MapReduce Programming
This technical blog is my own collection of notes , articles , implementations and interpretation of referred topics in coding, programming, data analytics , data science , data warehousing , Cloud Applications and Artificial Intelligence . Feel free to explore my blog and articles for reference and downloads . Do subscribe , like , share and comment ---- Vivek Dash
MapReduce Programming
Introduction to Parallel Processing with Map Reduce
* Parallel Processing system is a clever and robust method to process huge amounts of data within a short period of time .
* This is ensured by enlisting the services of several computing devices to work on different parts of a job simultaneously
* An ideal parallel processing system can work across multiple computational problems using any number of computing devices across any size of data sets with ease and high programming productivity
* Parallel Programming can be achieved in such a way that it can be broken down into many parts such that each of the parts can be partially processed independently of the other parts and then processing the intermediate results from processing the parts which can be combined to produce a final solution .
* Infinite parallel processing is the most important essence of the laws of nature
Sample MapReduce Application – WordCount
* Suppose one wants to identify unique words in a piece of text with the frequency of the occurrence of each of the words in the text .
* Suppose the text within a
datafile "file.txt" can be split into 4 segments in such a way that
each of the segments are somewhat of the same length with a few changes between
them and that too very minimally , then one can represent the same in the
following manner :
Segment01
- "I stay at WonderVille in the city of Gods"
Segment
02 - "I am going to a picnic near our house "
Segment
03 - " Many of our friends are coming "
Segment
04 - " You are welcome to join us "
Segment
05 - " We will have fun "
* Each of the given segments of
data can be processed in parallel where all the constituent data within the
sample could be aggregated to provide results for the text as given in the
above text segments "
* From this it can be ascertained that there are 4 map tasks one for each segment of data where each Map process takes in input in a <key,value> pair format .
* Each Map process takes in a <key,value> pair format where the
first column is addressed as the Key which is the entire sentence in the case .
The second column holds the Value which in the application is the frequency of the words appearing within the counting process . Here , each Map Process within the application is executed by a different processor .
* There are four intermediate
files in <keys2,value2> pair format which can be shown in the below
manner
* The sort process inherent
within "MapReduce" will
"SORT" each of the
intermediate files and prodce a
following sorted key-value pair in the following format .
* The "Reduce"
function will read the sorted intermediate files and combine the results into
one result
MapReduce
Overview
* MapReduce is a parallel
programming framework for speeding up large scale data processing for computation
tasks
* MapReduce achieves its
performance with minimal movement of data on distributed file systems on Hadoop
Clusters to achieve real-time results
* There are two major
pre-requisites for MapReduce Programming
(1) The first pre-requisite of
MapReduce Programming is that tha application must lend itself to parallel
programming
(2) The data for the
applications can be expressed in terms of key-value pairs
* MapReduce Processing is
similar to UNIX sequence which is can be expressed in the form of pipe
separated values data structure eg UNIX command .
grep
| sort | count textfile.txt
This upper command produces a
"wordcount" within the output text document which is referred to as
"textfile.txt"
* There are three commands in
the sequence and they work as follows
(a)
grep is
a command which is used to find and read a text file and create an intermediate
file with one word
(b)
sort is
a command that works upon an intermediate file and produces an alphabetically
sorted list of words
(c)
count is
a command which works on a sorted list to produce the number of occurrences of
each word and display the results to a user in a "word-frequency" pair
format
For example : Suppose a file
"file.txt" contains the following text :
" I stay at Wonderville in the city of Gods . I
am going to a picnic near our house . Many of our friends are coming too . You
are welcome to join us . We will have fun"
The outputs of grep , sort and wordcount command on this text are in the following manner
* For the sake of simplicity the case taken into account is of a
relatively smaller text file . Had the text been very large , then it would
have taken the computer a long amount of time to process the longer text
document
* In order to process such a
file one would take into account the service of Parallel Processing where MapReduce
algorithm speeds up the computation process by reading and processing small
chunks of file by different computers in parallel mode .
* Taking this into consideration
, if the same logic could be applied to a file , then it could be said that the
file object could be broken down into 100 smaller chunks where each of the
chunks could be processed at a separate computer using parallel processing of
the requests . The total time taken to process such a file is then minimised to
1/100 of the time that a single computer/ server/processor would have taken to
accomplish the task of the file division .
* Now after the processing at
separate nodes/processors is done separately/ parallely , the results of the
computation on smaller chunks are done separately and later on aggregated
together to produce a composite result . The results of the outputs from the
various chunks are combined by another program called as "Reduce Program
"
* The Map step distributes the
full job into smaller tasks that can be done over separate computers using only
a part of the data set . The result of the Map Step can be considered as
intermediate results . The Reduce step reads the intermediate results and
combines all of the results to produce the aggregated result .
* As the concrete programmatic
level breakdown of the logical handling of Mapping and Reducing the requisite
steps are out of the context for this article , I am not able to show the
background level code of the flow of each of the ensuing steps within the
entire process .
* However one can only imagine
the process through the given example where all the requisite data have been
provided in the form of a text file which has all the required data and from
the required data formation of the singular data chunks , sorting of the
dataset (which is a standard procedure available in all database systems )
based on one or multiple fields can be performed . Therefore the intermediate
results should have a key field upon which the sorting operation could be
performed .
* In order to manage the variety
of data structures stored over a file system , data is stored within it in the
form of one - key and non-key attribute values . Therefore , data is
represented in the form of a key-value pair . Along with that , the intermediate
results would be also stored in the form of key-value pair format intermediate
results would be also stored in the form of key-value pair format .
* Hence , one of the most
significant things to remember about the manner of storage of all the input and
output data over a "MapReduce Parallel Processing System is that the input
data and the output data are all represented in the form of key-value pairs
"
* A Map step reads all data in
the form of key-value pair format . The programmer working upon the storage and
managment of the stored data decides upon the characteristics and attributes of
the key and value fields .
* The Map Step produces results
in the form of key-value pair formats . However , the characteristics of the
keys produced by the Map step need not be in the same format . Therefore , all
this data is called as key2-value2 pairs
* The Reduce Step reads the
key2-value2 pairs and produces an output using the same keys that were used for
reading the data .
* The overall process of this
entire MapReduce process can be seen in the
following manner :
Generic Summary Command for Data Frames
* Below is a short guide to the results expected for the generic
software commands in R
* Descriptive Summary Commands that can be applied to Dataframes are :
00) mat01
[,1] [,2] [,3] [,4]
[1,] 1 2 3 4
[2,] 5 6 7 8
[3,] 9 10 11 12
[4,] 13 14 15 16
01) max(mat01)
[1] 16
- The largest value in the entire dataframe
02) min(mat01)
[1] 1
- The smallest value in the entire dataframe
03) sum(mat01)
[1] 136
- The sum of all the values in the entire dataframe
04) fivenum(mat01)
[1] 1.0 4.5 8.5 12.5 16.0
- The summary values for the entire dataframe can be found out
by using the "fivenum" command over a dataframe taken in as parameter
05) length(mat01)
[1] 16
- The length of all the columns within a dataframe can be found
by using the length command over a dataframe
06) summary(mat01)
V1 V2 V3 V4
Min. : 1 Min. : 2 Min. : 3 Min. : 4
1st Qu.: 4 1st Qu.: 5 1st Qu.: 6 1st Qu.: 7
Median : 7 Median : 8 Median : 9 Median :10
Mean : 7 Mean : 8 Mean : 9 Mean :10
3rd Qu.: 10 3rd Qu.:11 3rd Qu.:12 3rd Qu.:13
Max. : 13 Max. :14 Max. :15 Max. :16
- It provides the summary for each of the columns present within
a dataframe
* The list of all the summary / descriptive summary commands
that work on a dataframe are listed and short
* One can always extract a single vector from a dataframe and
perform a summary upon the data
* In general , it is better to use more specialised commands
when dealing with the rows and columns of a dataframe
Special Row and Column Summary Commands
* Two summary commands used for row data are - rowMeans() and rowSums()
> rowMeans(mat01)
[1] 2.5 6.5 10.5 14.5
> rowSums(mat01)
[1] 10 26 42 58
* In the given example , each row in the dataframe has a
specific row name
* If the names of the rows along with the values for the various
rows would appear as a simple vector of values
> rowSums(mat01)
[1] 10 26 42 58
* The corresponding "colSums()" and
"colMeans()" commands function in the same manner .
* In the following example ... one can see the
"mean()" and "colMeans()" command with their comparison in
the following manner :
> colMeans(df)
[1] 7 8 9 10
> mean(mf)
[1] 8.5
* One can see that one would essentially get the same display /
result using the above two commands
* The commands use "na.rm" instruction which is used
by default and is set to FALSE
* If one wants to ensure that the "NA" items are removed from the dataframe then one can add "na.rm = TRUE " as an instruction in the command
"Apply()" Command for finding Summaries
on Rows / Cols
=======================================================
* The "ColMeans()"
and "RowSums()" command
are designed as quick alternatives to a more generalised command
"Apply()"
* The "apply()"
command enables one to apply a function to rows or columns of a matrix or a
dataframe
* The general form of the "Apply()" command is given in the following manner :
apply(X,margin,FUN,....)
In this command , the applicable MARGIN within the parameter is
either 1 or 2 where 1 is for the rows and 2 is for the columns applicable for
the dataframe
* One can replace the "FUN" part within the parameter
of the apply() function and one can also add additional instructions which
might be appropriate to the command / function that one is applying
* Example :
One might add the parameter "na.rm = TRUE " as an
instruction to the apply function .
> mat01
[,1] [,2] [,3] [,4]
[1,] 1
2 3 4
[2,] 5
6 7 8
[3,] 9
10 11 12
[4,] 13
14 15 16
> apply (mat01 , 1 , mean , na.rm = TRUE )
[1] 2.5 6.5 10.5 14.5
* In such a case , one can see that the row names of the
original dataframe are displayed as output .
* If the dataframe has no set row names , then one will see the
result as a vector of values .
> apply (fw , 1 , median , na.rm = TRUE )
2.5 6.5 10.5 14.5