Writing MapReduce Programming
* As per standard books , one should start MapReduce program by writing pseudocode for Map and
Reduce Functions
* A "pseudo-code" is not the entire / actual length of
the code but it is a blueprint of the code that would be written in place of
the actual code that is going to be used in case of a working standardised code
.
* The program code for both the Map and Reduce functions can be
written in Java or other programming languages
* In Java , a Map function is represented by generic Mapper
Class (which acts over structured and unstructured data type objects ) .
* The Map Function has mainly four parameters (input key,input
value, output key and output value)
* General handling of Map Function imported from Mapper Class in
Java is out of context for this article , however I shall try to cover the
usage of Map Function in a separate blog with appropriate example
* The Mapper Class uses an abstract Map() method which receives
the Input Key and Input Values which would produce an Output key and Output
value .
* For more complex problems involving Map() functions , it is
advised to use a higher-level language than MapReduce such as Pig, Hive and
Spark
( Detailed coverage on the above programming languages - Pig ,
Hive and Spark would be done in separate articles later )
* A Mapper function commonly performs input format parsing ,
projection (selection of relevant fields) and filtering related operations
(selection of requisite records needed from context table)
* The Reducer function typically combines (adds/averages) the
requisite values again after performance of the necessary operations after
Mapping procedure which finally yields the Output .
* Below diagram is a breakdown of all operations that happen
within the MapReduce Program Flow .
* Following is a step-by-step logic for performing a word count
of all unique words in a text file .
1) A document taken into consideration is split into several
different segments .The Map step is run on each segment of the data . The
output is a set of key and value pairs . In the given case , the key is a word
in the document .
2) The Big Data system gathers the (key,value) pair outputs from
all the mappers and then it will sort the entire system with the help of a Key
. The sorted list is then split into a few segments
3) The task of the Reducer in the entire system is to sort the
entire list and produce a combined list of word counts from the entire list
provided to the system for the purpose of Sorting and counting .
==================================
Java Code for WordCount
==================================
map(String
key,String value):
for
each word w in value:
EmitIntermediate(w,"I"):
reduce(String key,Iterator values):
int result = 0:
for each 'v' in values:
result == ParseInt(v):
Emit(AsString(result))
==================================