Data Science and AI Quest: Testing MapReduce Programs - An introductory Article on Testing of MapReduce Programs for Load and Performance

Thursday, April 8, 2021

Testing MapReduce Programs - An introductory Article on Testing of MapReduce Programs for Load and Performance

Testing MapReduce Programs

* Mapper programs running on a cluster are usually complicated to debug

* The best way of debugging MapReduce programs is via usage of print statements over log setions in MapReduce programs

* But in a large application where thousands of programs may be running at any single point of time , running the execution jobs and programs over tens or thousands of nodes is preferred to be done in multiple stages

* Therefore , the most preferred mode of execution of any program is :

(a) To run the programs using small sample datasets ; this would ensure that what so ever program is running , the program is running in an efficient and robust manner . And for checking of the same , the tried and tested formula for applying the working proficiency of the program over a small dataset is done followed by applying the same over a bigger application / bigger dataset / more

number of testcases etc

(b) Expanding the unit tests to cover larger number of datasets and to run the programs over a bigger/larger cluster of network applications . As mentioned in the earlier point , the scope of execution of the testcases is enhanced by application of unit testcases over larger datasets in order to check the robustness

and performance of the system application software

(c) Ensuring that the Mapper and the Reducer functions can handle the inputs more efficiently . This means that the set of Mapper and Reducer functions created to work over the split input data would work in cohesion or in tandem with MapReduce programs desired working condition to produce serial output in desired format (text,key-value pair) etc

* Running the system application against a full dataset would likely expose more issues , which might lead to rise of undue errors , undiscovered issues , unpredictable results , undue fitting criterias and so on type of issues over the software because of which it might not be that conducive for the system analyst to put the entire full dataset to test over the software . But after all necessary

unit testcases have been checked and working criteria and pre-requisites have been fulfilled , one may put the program to be tested over bigger datasets ; by and by making the work of the MapReduce job easier to run , thereby also enhancing

speed and performance issues gradually

* it may be desirable to split the logic into many simpler Mapper and Reducer functions , chaining the programs into single Mapper functions using a facility like ChainMapper library class built within Hadoop (I am yet to explore all the scopes , ChainMapper library class built within Hadoop (I am yet to explore all the scopes ,

specifications and functionalities of the ChainMapper Library which I shall try to cover in a forthcoming session ) . This class can run a chain of Mappers followed by a Reducer function , followed again by a chain of Mapper functions within a single MapReduce Job

* More over testing of MapReduce Jobs / Execution of MapReduce Jobs / Analysis and Debugging of MapReduce Jobs would be done in later articles under appropriate headers and titles .

Data Science and AI Quest

Thursday, April 8, 2021

Testing MapReduce Programs - An introductory Article on Testing of MapReduce Programs for Load and Performance

No comments:

Post a Comment

One Hot Encoding and Dummy Variables Generation upon a dataframe | Scenario - Perform One-Hot Encoding upon Un-Ordered Data in a sample dataframe and generate One-hot encoded feature variables | Conceptual Infographic Note