Data Science and AI Quest

Monday, April 12, 2021

News Article - PM Narendra Modi's call to the nation : " Tika Utsav is another beginning of another war against Covid "

PM Narendra Modi : " Tika Utsav is another beginning of another war against Covid "

* 4 day Tika Utsav (Vaccine Festival) started off on Sunday by Indian PM Narendra Modi at New Delhi

* Aimed at optimum utilisation of country's vaccination capacity

* Success over Covid can be determined by the country's awareness of the micro containment zone rules which is the home itself with strict adherence to protocols like not to leave the house premises when there is no need , when the eligible members of the house are vaccinated , when everyone adheres to rules for wearing of masks and adherence to other rules

* PM attributed the accelerated vaccination drive as the second round of the fight against Covid-19

* PM has coined a four point action plan

01) Each One , Vaccinate One

02) Each One , treat One

03) Each One , Save One

04) Enforcement of Micro Containment Zones guidelines

* This four day Vaccination Festival will conclude on the birthday of Dalit Reformer and Father of the Constitution of India Mr BhimRao Ambedkar

Working with Data in Machine Learning - An overview of methodology for working over Data/Datasets in Machine Learning using R and Python

Working with Data in Machine Learning

* Machine Learning is one of the most appealing subjects because it allows machines to learn from real world examples such as sales records , signals from sensors and textual datastreaming from internet and then determine what such data would imply with the help of that subject

* The most common outputs that can commence from machine learning algorithms is prediction of the future , prescriptions and prescriptive knowledge for design and build up of applications etc

* Some of the common outputs that can come from machine learning algorithms is the following : prediction of the future , prescription to act on some given knowledge or information , creation of new knowledge in terms of examples categorised by groups

* Some of the applications which are already in place and have become a reality thanks to leveraging the use of such knowledge are the following things :

01) Diagnosing hard to find diseases

02) Discovering criminal behaviour and detecting criminals in action

03) Recommending the right product to the right person

04) Filtering and classifying data from internet at an big scale

05) Driving a car autonomously etc

* The mathematical and statistical basis of machine learning makes outputting such useful results possible

* One can use Math and Statistics over such accumulated data which could enable algorithms to understand anything with a numerical basis

* In order to begin the process of working with Data , one should represent the solution to the problem in the form of a number .

* For example , if one wants to diagnose a disease using a machine learning algorithm , one can make the response to a particular learning problem a 1 or 0 (binary response) which would inform about the illness of the person . A value of 1 would indicate that the person is ill , with a value of 1 stating that the person is ill or not .

Alternatively , one can use a number between the values 0 and 1 to convey an

MapReduce Jobs Execution

MapReduce Jobs Execution

* A MapReduce job is specified by a Map program and the Reduce program along with the data sets associated with a MapReduce Job

* There is another master program that resides and runs endlessly over a NameNode which is called as the "Job Tracker" which tracks the progress of MapReduce jobs from beginning to completion stage

* Hadoop moves the Map and Reduce computation logic to all the DataNodes which are hosting a fragment of data

* Communication between the nodes is accomplished using YARN , Hadoop's native resource manager

* The master machine (NameNode) is completely aware of the data stored over each of the worker machines (DataNodes)

* The Master Machine schedules the " Map / Reduce jobs " to Task Trackers with full awareness of the data location which means that Task Trackers residing within the hierarchial monitoring architecture being thoroughly aware of the residing data and their location .

By this the job tracker would be able to fully address the issue of mapping the requisite jobs to their job queues in the form of Job/Task tracker .

For example , if "node A" contains data (x,y,z) and "node B" contains data (a,b,c) , the job tracker schedules node B to perform map or Reduce Tasks on (a,b,c) and node A would be scheduled to perform Map or Reduce tasks on (x,y,z). This helps in reduction of the data traffic and subsequent choking of the network .

* Each DataNode within the MapReduce Jobs has a master program which is called the Task Tracker

Math behind Machine Learning - An introductory article on the usage of mathematics and statistics as the foundation of Machine Learning

Math behind Machine Learning

* If one wants to implement existing machine learning algorithms from scratch or if someone wants to devise newer machine learning algorithms , then one would require a profound knowledge of probability , linear algebra , linear programming and multivariable calculus

* Along with that one may also need to translate math into a form of working code which means that one needs to have a good deal of sophisticated computing skills

* This article is an introduction which would help someone in understanding of the mechanics of machine learning and thereafter describe how to translate math basics into usable code

* If one would like to apply the existing machine learning knowledge for implementation of practical purposes and practical projects , then one can leverage the best of possibilities of machine learning over datasets using R language and Python language's software libraries using some basic knowledge of math , statistics and programming as Machine learning's core foundation is built upon skills in all of these languages

* Some of the things that can be accomplished with a clearer understanding and grasp over these languages is the following :

1) Performance of Machine Learning experiments using R and Python language

2) Knowledge upon Vectors , Variables and Matrices

3) Usage of Descriptive Statistics techniques

4) Knowledge of statistical methods like Mean , Median , Mode , Standard Deviation and other important parameters for judging or evaluating a model

5) Understanding the capabilities and methods in which Machine Learning could be put to work which would help in making better predictions etc

Thursday, April 8, 2021

Writing MapReduce Programming - a descriptive guide with example on writing of a sample maprReduce programme with reference to its architecture

Writing MapReduce Programming

* As per standard books , one should start MapReduce program by writing pseudocode for Map and Reduce Functions

* A "pseudo-code" is not the entire / actual length of the code but it is a blueprint of the code that would be written in place of the actual code that is going to be used in case of a working standardised code .

* The program code for both the Map and Reduce functions can be written in Java or other programming languages

* In Java , a Map function is represented by generic Mapper Class (which acts over structured and unstructured data type objects ) .

* The Map Function has mainly four parameters (input key,input value, output key and output value)

* General handling of Map Function imported from Mapper Class in Java is out of context for this article , however I shall try to cover the usage of Map Function in a separate blog with appropriate example

* The Mapper Class uses an abstract Map() method which receives the Input Key and Input Values which would produce an Output key and Output value .

* For more complex problems involving Map() functions , it is advised to use a higher-level language than MapReduce such as Pig, Hive and Spark

( Detailed coverage on the above programming languages - Pig , Hive and Spark would be done in separate articles later )

* A Mapper function commonly performs input format parsing , projection (selection of relevant fields) and filtering related operations (selection of requisite records needed from context table)

* The Reducer function typically combines (adds/averages) the requisite values again after performance of the necessary operations after Mapping procedure which finally yields the Output .

* Below diagram is a breakdown of all operations that happen within the MapReduce Program Flow .

* Following is a step-by-step logic for performing a word count of all unique words in a text file .

1) A document taken into consideration is split into several different segments .The Map step is run on each segment of the data . The output is a set of key and value pairs . In the given case , the key is a word in the document .

2) The Big Data system gathers the (key,value) pair outputs from all the mappers and then it will sort the entire system with the help of a Key . The sorted list is then split into a few segments

3) The task of the Reducer in the entire system is to sort the entire list and produce a combined list of word counts from the entire list provided to the system for the purpose of Sorting and counting .

==================================

Java Code for WordCount

==================================

map(String key,String value):

for each word w in value:

EmitIntermediate(w,"I"):

reduce(String key,Iterator values):

int result = 0:

for each 'v' in values:

result == ParseInt(v):

Emit(AsString(result))

==================================

Testing MapReduce Programs - An introductory Article on Testing of MapReduce Programs for Load and Performance

Testing MapReduce Programs

* Mapper programs running on a cluster are usually complicated to debug

* The best way of debugging MapReduce programs is via usage of print statements over log setions in MapReduce programs

* But in a large application where thousands of programs may be running at any single point of time , running the execution jobs and programs over tens or thousands of nodes is preferred to be done in multiple stages

* Therefore , the most preferred mode of execution of any program is :

(a) To run the programs using small sample datasets ; this would ensure that what so ever program is running , the program is running in an efficient and robust manner . And for checking of the same , the tried and tested formula for applying the working proficiency of the program over a small dataset is done followed by applying the same over a bigger application / bigger dataset / more

number of testcases etc

(b) Expanding the unit tests to cover larger number of datasets and to run the programs over a bigger/larger cluster of network applications . As mentioned in the earlier point , the scope of execution of the testcases is enhanced by application of unit testcases over larger datasets in order to check the robustness

and performance of the system application software

(c) Ensuring that the Mapper and the Reducer functions can handle the inputs more efficiently . This means that the set of Mapper and Reducer functions created to work over the split input data would work in cohesion or in tandem with MapReduce programs desired working condition to produce serial output in desired format (text,key-value pair) etc

* Running the system application against a full dataset would likely expose more issues , which might lead to rise of undue errors , undiscovered issues , unpredictable results , undue fitting criterias and so on type of issues over the software because of which it might not be that conducive for the system analyst to put the entire full dataset to test over the software . But after all necessary

unit testcases have been checked and working criteria and pre-requisites have been fulfilled , one may put the program to be tested over bigger datasets ; by and by making the work of the MapReduce job easier to run , thereby also enhancing

speed and performance issues gradually

* it may be desirable to split the logic into many simpler Mapper and Reducer functions , chaining the programs into single Mapper functions using a facility like ChainMapper library class built within Hadoop (I am yet to explore all the scopes , ChainMapper library class built within Hadoop (I am yet to explore all the scopes ,

specifications and functionalities of the ChainMapper Library which I shall try to cover in a forthcoming session ) . This class can run a chain of Mappers followed by a Reducer function , followed again by a chain of Mapper functions within a single MapReduce Job

* More over testing of MapReduce Jobs / Execution of MapReduce Jobs / Analysis and Debugging of MapReduce Jobs would be done in later articles under appropriate headers and titles .

Map Reduce Data Types and Formats ( A short explanatory Article )

Map Reduce Data Types and Formats

* MapReduce 's model of data processing which includes the following components : inputs and outputs as the Map Reduce Functions consist of Key-Value pairs

* The Map and Reduce functions in Hadoop MapReduce have the following general form which is an accepted mode of representation

Map : (K1,V1) -> list(K2,V2)

Reduce : (K1,list(V2)) -> list(K2,V3)

* The Map input key and value types ( K1,V1 ) are different from the Map output types of (K2,V2)

* However , Reduce input takes in K1 and list values of V2 (which is different in format from that of the Map input over Key1 and associated value V1) . And yet again the output for the Reduce process is a list of the key-value pair (K2 and V3) which is again different from that of Reduce Operations .

* MapReduce can process many different types of data formats which may range from text file formats to databases .

* An "input split" is a chunk of the input that is processed by a single Map function .

* Each Map process processes a single split where each of the split is divided into records and the map function processes each record in the form of a key-value pair

* Splits and Records are a part of logical processing of the records and even maps to a full file / part of a file / collection of different files etc

* In a database context , a split corresponds to a range of rows from a table / record .