Data Science and AI Quest

Thursday, April 8, 2021

Writing MapReduce Programming - a descriptive guide with example on writing of a sample maprReduce programme with reference to its architecture

Writing MapReduce Programming

* As per standard books , one should start MapReduce program by writing pseudocode for Map and Reduce Functions

* A "pseudo-code" is not the entire / actual length of the code but it is a blueprint of the code that would be written in place of the actual code that is going to be used in case of a working standardised code .

* The program code for both the Map and Reduce functions can be written in Java or other programming languages

* In Java , a Map function is represented by generic Mapper Class (which acts over structured and unstructured data type objects ) .

* The Map Function has mainly four parameters (input key,input value, output key and output value)

* General handling of Map Function imported from Mapper Class in Java is out of context for this article , however I shall try to cover the usage of Map Function in a separate blog with appropriate example

* The Mapper Class uses an abstract Map() method which receives the Input Key and Input Values which would produce an Output key and Output value .

* For more complex problems involving Map() functions , it is advised to use a higher-level language than MapReduce such as Pig, Hive and Spark

( Detailed coverage on the above programming languages - Pig , Hive and Spark would be done in separate articles later )

* A Mapper function commonly performs input format parsing , projection (selection of relevant fields) and filtering related operations (selection of requisite records needed from context table)

* The Reducer function typically combines (adds/averages) the requisite values again after performance of the necessary operations after Mapping procedure which finally yields the Output .

* Below diagram is a breakdown of all operations that happen within the MapReduce Program Flow .

* Following is a step-by-step logic for performing a word count of all unique words in a text file .

1) A document taken into consideration is split into several different segments .The Map step is run on each segment of the data . The output is a set of key and value pairs . In the given case , the key is a word in the document .

2) The Big Data system gathers the (key,value) pair outputs from all the mappers and then it will sort the entire system with the help of a Key . The sorted list is then split into a few segments

3) The task of the Reducer in the entire system is to sort the entire list and produce a combined list of word counts from the entire list provided to the system for the purpose of Sorting and counting .

==================================

Java Code for WordCount

==================================

map(String key,String value):

for each word w in value:

EmitIntermediate(w,"I"):

reduce(String key,Iterator values):

int result = 0:

for each 'v' in values:

result == ParseInt(v):

Emit(AsString(result))

==================================

Testing MapReduce Programs - An introductory Article on Testing of MapReduce Programs for Load and Performance

Testing MapReduce Programs

* Mapper programs running on a cluster are usually complicated to debug

* The best way of debugging MapReduce programs is via usage of print statements over log setions in MapReduce programs

* But in a large application where thousands of programs may be running at any single point of time , running the execution jobs and programs over tens or thousands of nodes is preferred to be done in multiple stages

* Therefore , the most preferred mode of execution of any program is :

(a) To run the programs using small sample datasets ; this would ensure that what so ever program is running , the program is running in an efficient and robust manner . And for checking of the same , the tried and tested formula for applying the working proficiency of the program over a small dataset is done followed by applying the same over a bigger application / bigger dataset / more

number of testcases etc

(b) Expanding the unit tests to cover larger number of datasets and to run the programs over a bigger/larger cluster of network applications . As mentioned in the earlier point , the scope of execution of the testcases is enhanced by application of unit testcases over larger datasets in order to check the robustness

and performance of the system application software

(c) Ensuring that the Mapper and the Reducer functions can handle the inputs more efficiently . This means that the set of Mapper and Reducer functions created to work over the split input data would work in cohesion or in tandem with MapReduce programs desired working condition to produce serial output in desired format (text,key-value pair) etc

* Running the system application against a full dataset would likely expose more issues , which might lead to rise of undue errors , undiscovered issues , unpredictable results , undue fitting criterias and so on type of issues over the software because of which it might not be that conducive for the system analyst to put the entire full dataset to test over the software . But after all necessary

unit testcases have been checked and working criteria and pre-requisites have been fulfilled , one may put the program to be tested over bigger datasets ; by and by making the work of the MapReduce job easier to run , thereby also enhancing

speed and performance issues gradually

* it may be desirable to split the logic into many simpler Mapper and Reducer functions , chaining the programs into single Mapper functions using a facility like ChainMapper library class built within Hadoop (I am yet to explore all the scopes , ChainMapper library class built within Hadoop (I am yet to explore all the scopes ,

specifications and functionalities of the ChainMapper Library which I shall try to cover in a forthcoming session ) . This class can run a chain of Mappers followed by a Reducer function , followed again by a chain of Mapper functions within a single MapReduce Job

* More over testing of MapReduce Jobs / Execution of MapReduce Jobs / Analysis and Debugging of MapReduce Jobs would be done in later articles under appropriate headers and titles .

Map Reduce Data Types and Formats ( A short explanatory Article )

Map Reduce Data Types and Formats

* MapReduce 's model of data processing which includes the following components : inputs and outputs as the Map Reduce Functions consist of Key-Value pairs

* The Map and Reduce functions in Hadoop MapReduce have the following general form which is an accepted mode of representation

Map : (K1,V1) -> list(K2,V2)

Reduce : (K1,list(V2)) -> list(K2,V3)

* The Map input key and value types ( K1,V1 ) are different from the Map output types of (K2,V2)

* However , Reduce input takes in K1 and list values of V2 (which is different in format from that of the Map input over Key1 and associated value V1) . And yet again the output for the Reduce process is a list of the key-value pair (K2 and V3) which is again different from that of Reduce Operations .

* MapReduce can process many different types of data formats which may range from text file formats to databases .

* An "input split" is a chunk of the input that is processed by a single Map function .

* Each Map process processes a single split where each of the split is divided into records and the map function processes each record in the form of a key-value pair

* Splits and Records are a part of logical processing of the records and even maps to a full file / part of a file / collection of different files etc

* In a database context , a split corresponds to a range of rows from a table / record .

MapReduce Programming - An introductory article into the concept of MapReduce Programming

MapReduce Programming

* A Data Processing problem can be transformed into a MapReduce Model by the usage of MapReduce Programming

* The very first step within the process is to visualize the processing plan of a Map and Reduce Programming problem in a step by step process

* When a problem involving Map and Reduce Programming gets more complex , the underlying complexity within the Map and Reduce problem can be manifested and resolved in either of the two ways or a combination of two ways

1) Having more number of MapReduce Jobs -- which would eventually increase the load present over the processors and then mitigated by parallel distribution over the servers

2) Having more complex Map and Reduce Jobs -- under this scenario one may suppose that the number of sorting jobs and processes might get increased tremendously which might add to the complexity , otherwise complexity might also get enhanced under conditions when more and more key and values for same set of text/words are found out by the program and thus mapping their frequency to the matched key becomes more and more which would again add to the complexity of the Map-Reduce Program . Having more but simple MapReduce jobs leads to more easily maintainable Map and Reduce Programs .

Wednesday, April 7, 2021

Introduction to Parallel Processing with Map Reduce Algorithm in Big Data Technology

Introduction to Parallel Processing with Map Reduce

* Parallel Processing system is a clever and robust method to process huge amounts of data within a short period of time .

* This is ensured by enlisting the services of several computing devices to work on different parts of a job simultaneously

* An ideal parallel processing system can work across multiple computational problems using any number of computing devices across any size of data sets with ease and high programming productivity

* Parallel Programming can be achieved in such a way that it can be broken down into many parts such that each of the parts can be partially processed independently of the other parts and then processing the intermediate results from processing the parts which can be combined to produce a final solution .

* Infinite parallel processing is the most important essence of the laws of nature

Sample MapReduce Application – WordCount ( analysis and interpretation with an example )

Sample MapReduce Application – WordCount

* Suppose one wants to identify unique words in a piece of text with the frequency of the occurrence of each of the words in the text .

* Suppose the text within a datafile "file.txt" can be split into 4 segments in such a way that each of the segments are somewhat of the same length with a few changes between them and that too very minimally , then one can represent the same in the following manner :

Segment01 - "I stay at WonderVille in the city of Gods"

Segment 02 - "I am going to a picnic near our house "

Segment 03 - " Many of our friends are coming "

Segment 04 - " You are welcome to join us "

Segment 05 - " We will have fun "

* Each of the given segments of data can be processed in parallel where all the constituent data within the sample could be aggregated to provide results for the text as given in the above text segments "

* From this it can be ascertained that there are 4 map tasks one for each segment of data where each Map process takes in input in a <key,value> pair format .

* Each Map process takes in a <key,value> pair format where the first column is addressed as the Key which is the entire sentence in the case .

The second column holds the Value which in the application is the frequency of the words appearing within the counting process . Here , each Map Process within the application is executed by a different processor .

* There are four intermediate files in <keys2,value2> pair format which can be shown in the below manner

* The sort process inherent within "MapReduce" will "SORT" each of the

intermediate files and prodce a following sorted key-value pair in the following format .

* The "Reduce" function will read the sorted intermediate files and combine the results into one result

MapReduce Overview - with Example and Architecture ( an analysis and interpretation )

MapReduce Overview

* MapReduce is a parallel programming framework for speeding up large scale data processing for computation tasks

* MapReduce achieves its performance with minimal movement of data on distributed file systems on Hadoop Clusters to achieve real-time results

* There are two major pre-requisites for MapReduce Programming

(1) The first pre-requisite of MapReduce Programming is that tha application must lend itself to parallel programming

(2) The data for the applications can be expressed in terms of key-value pairs

* MapReduce Processing is similar to UNIX sequence which is can be expressed in the form of pipe separated values data structure eg UNIX command .

grep | sort | count textfile.txt

This upper command produces a "wordcount" within the output text document which is referred to as "textfile.txt"

* There are three commands in the sequence and they work as follows

(a) grep is a command which is used to find and read a text file and create an intermediate file with one word

(b) sort is a command that works upon an intermediate file and produces an alphabetically sorted list of words

(c) count is a command which works on a sorted list to produce the number of occurrences of each word and display the results to a user in a "word-frequency" pair format

For example : Suppose a file "file.txt" contains the following text :

" I stay at Wonderville in the city of Gods . I am going to a picnic near our house . Many of our friends are coming too . You are welcome to join us . We will have fun"

The outputs of grep , sort and wordcount command on this text are in the following manner

* For the sake of simplicity the case taken into account is of a relatively smaller text file . Had the text been very large , then it would have taken the computer a long amount of time to process the longer text document

* In order to process such a file one would take into account the service of Parallel Processing where MapReduce algorithm speeds up the computation process by reading and processing small chunks of file by different computers in parallel mode .

* Taking this into consideration , if the same logic could be applied to a file , then it could be said that the file object could be broken down into 100 smaller chunks where each of the chunks could be processed at a separate computer using parallel processing of the requests . The total time taken to process such a file is then minimised to 1/100 of the time that a single computer/ server/processor would have taken to accomplish the task of the file division .

* Now after the processing at separate nodes/processors is done separately/ parallely , the results of the computation on smaller chunks are done separately and later on aggregated together to produce a composite result . The results of the outputs from the various chunks are combined by another program called as "Reduce Program "

* The Map step distributes the full job into smaller tasks that can be done over separate computers using only a part of the data set . The result of the Map Step can be considered as intermediate results . The Reduce step reads the intermediate results and combines all of the results to produce the aggregated result .

(some results in the above diagram are not in proper order within the "SORT" column)

* As the concrete programmatic level breakdown of the logical handling of Mapping and Reducing the requisite steps are out of the context for this article , I am not able to show the background level code of the flow of each of the ensuing steps within the entire process .

* However one can only imagine the process through the given example where all the requisite data have been provided in the form of a text file which has all the required data and from the required data formation of the singular data chunks , sorting of the dataset (which is a standard procedure available in all database systems ) based on one or multiple fields can be performed . Therefore the intermediate results should have a key field upon which the sorting operation could be performed .

* In order to manage the variety of data structures stored over a file system , data is stored within it in the form of one - key and non-key attribute values . Therefore , data is represented in the form of a key-value pair . Along with that , the intermediate results would be also stored in the form of key-value pair format intermediate results would be also stored in the form of key-value pair format .

* Hence , one of the most significant things to remember about the manner of storage of all the input and output data over a "MapReduce Parallel Processing System is that the input data and the output data are all represented in the form of key-value pairs "

* A Map step reads all data in the form of key-value pair format . The programmer working upon the storage and managment of the stored data decides upon the characteristics and attributes of the key and value fields .

* The Map Step produces results in the form of key-value pair formats . However , the characteristics of the keys produced by the Map step need not be in the same format . Therefore , all this data is called as key2-value2 pairs

* The Reduce Step reads the key2-value2 pairs and produces an output using the same keys that were used for reading the data .

* The overall process of this entire MapReduce process can be seen in the

following manner :