Testing MapReduce Programs
* Mapper programs running on a cluster are usually complicated
to debug
* The best way of debugging MapReduce programs is via usage of
print statements over log setions in MapReduce programs
* But in a large application where thousands of programs may be
running at any single point of time , running the execution jobs and programs
over tens or thousands of nodes is preferred to be done in multiple stages
* Therefore , the most preferred mode of execution of any
program is :
(a) To run the programs using small sample datasets ; this would
ensure that what so ever program is running , the program is running in an
efficient and robust manner . And for checking of the same , the tried and
tested formula for applying the working proficiency of the program over a small
dataset is done followed by applying the same over a bigger application /
bigger dataset / more
number of testcases etc
(b) Expanding the unit tests to cover larger number of datasets
and to run the programs over a bigger/larger cluster of network applications .
As mentioned in the earlier point , the scope of execution of the testcases is
enhanced by application of unit testcases over larger datasets in order to
check the robustness
and performance of the system application software
(c) Ensuring that the Mapper and the Reducer functions can
handle the inputs more efficiently . This means that the set of Mapper and
Reducer functions created to work over the split input data would work in
cohesion or in tandem with MapReduce programs desired working condition to
produce serial output in desired format (text,key-value pair) etc
* Running the system application against a full dataset would
likely expose more issues , which might lead to rise of undue errors ,
undiscovered issues , unpredictable results , undue fitting criterias and so on
type of issues over the software because of which it might not be that
conducive for the system analyst to put the entire full dataset to test over
the software . But after all necessary
unit testcases have been checked and working criteria and
pre-requisites have been fulfilled , one may put the program to be tested over
bigger datasets ; by and by making the work of the MapReduce job easier to run
, thereby also enhancing
speed and performance issues gradually
* it may be desirable to split the logic into many simpler
Mapper and Reducer functions , chaining the programs into single Mapper
functions using a facility like ChainMapper library class built within Hadoop
(I am yet to explore all the scopes , ChainMapper library class built within
Hadoop (I am yet to explore all the scopes ,
specifications and functionalities of the ChainMapper Library
which I shall try to cover in a forthcoming session ) . This class can run a
chain of Mappers followed by a Reducer function , followed again by a chain of
Mapper functions within a single MapReduce Job
* More over testing of MapReduce Jobs / Execution of MapReduce
Jobs / Analysis and Debugging of MapReduce Jobs would be done in later articles
under appropriate headers and titles .
No comments:
Post a Comment