Data Science and AI Quest

Wednesday, April 7, 2021

MapReduce Overview - with Example and Architecture ( an analysis and interpretation )

MapReduce Overview

* MapReduce is a parallel programming framework for speeding up large scale data processing for computation tasks

* MapReduce achieves its performance with minimal movement of data on distributed file systems on Hadoop Clusters to achieve real-time results

* There are two major pre-requisites for MapReduce Programming

(1) The first pre-requisite of MapReduce Programming is that tha application must lend itself to parallel programming

(2) The data for the applications can be expressed in terms of key-value pairs

* MapReduce Processing is similar to UNIX sequence which is can be expressed in the form of pipe separated values data structure eg UNIX command .

grep | sort | count textfile.txt

This upper command produces a "wordcount" within the output text document which is referred to as "textfile.txt"

* There are three commands in the sequence and they work as follows

(a) grep is a command which is used to find and read a text file and create an intermediate file with one word

(b) sort is a command that works upon an intermediate file and produces an alphabetically sorted list of words

(c) count is a command which works on a sorted list to produce the number of occurrences of each word and display the results to a user in a "word-frequency" pair format

For example : Suppose a file "file.txt" contains the following text :

" I stay at Wonderville in the city of Gods . I am going to a picnic near our house . Many of our friends are coming too . You are welcome to join us . We will have fun"

The outputs of grep , sort and wordcount command on this text are in the following manner

* For the sake of simplicity the case taken into account is of a relatively smaller text file . Had the text been very large , then it would have taken the computer a long amount of time to process the longer text document

* In order to process such a file one would take into account the service of Parallel Processing where MapReduce algorithm speeds up the computation process by reading and processing small chunks of file by different computers in parallel mode .

* Taking this into consideration , if the same logic could be applied to a file , then it could be said that the file object could be broken down into 100 smaller chunks where each of the chunks could be processed at a separate computer using parallel processing of the requests . The total time taken to process such a file is then minimised to 1/100 of the time that a single computer/ server/processor would have taken to accomplish the task of the file division .

* Now after the processing at separate nodes/processors is done separately/ parallely , the results of the computation on smaller chunks are done separately and later on aggregated together to produce a composite result . The results of the outputs from the various chunks are combined by another program called as "Reduce Program "

* The Map step distributes the full job into smaller tasks that can be done over separate computers using only a part of the data set . The result of the Map Step can be considered as intermediate results . The Reduce step reads the intermediate results and combines all of the results to produce the aggregated result .

(some results in the above diagram are not in proper order within the "SORT" column)

* As the concrete programmatic level breakdown of the logical handling of Mapping and Reducing the requisite steps are out of the context for this article , I am not able to show the background level code of the flow of each of the ensuing steps within the entire process .

* However one can only imagine the process through the given example where all the requisite data have been provided in the form of a text file which has all the required data and from the required data formation of the singular data chunks , sorting of the dataset (which is a standard procedure available in all database systems ) based on one or multiple fields can be performed . Therefore the intermediate results should have a key field upon which the sorting operation could be performed .

* In order to manage the variety of data structures stored over a file system , data is stored within it in the form of one - key and non-key attribute values . Therefore , data is represented in the form of a key-value pair . Along with that , the intermediate results would be also stored in the form of key-value pair format intermediate results would be also stored in the form of key-value pair format .

* Hence , one of the most significant things to remember about the manner of storage of all the input and output data over a "MapReduce Parallel Processing System is that the input data and the output data are all represented in the form of key-value pairs "

* A Map step reads all data in the form of key-value pair format . The programmer working upon the storage and managment of the stored data decides upon the characteristics and attributes of the key and value fields .

* The Map Step produces results in the form of key-value pair formats . However , the characteristics of the keys produced by the Map step need not be in the same format . Therefore , all this data is called as key2-value2 pairs

* The Reduce Step reads the key2-value2 pairs and produces an output using the same keys that were used for reading the data .

* The overall process of this entire MapReduce process can be seen in the

following manner :

Thursday, April 1, 2021

Generic Summary Command for DataFrames / Matrices in R language

Generic Summary Command for Data Frames

* Below is a short guide to the results expected for the generic software commands in R

* Descriptive Summary Commands that can be applied to Dataframes are :

00) mat01

[,1] [,2] [,3] [,4]

[1,] 1 2 3 4

[2,] 5 6 7 8

[3,] 9 10 11 12

[4,] 13 14 15 16

01) max(mat01)

[1] 16

- The largest value in the entire dataframe

02) min(mat01)

[1] 1

- The smallest value in the entire dataframe

03) sum(mat01)

[1] 136

- The sum of all the values in the entire dataframe

04) fivenum(mat01)

[1] 1.0 4.5 8.5 12.5 16.0

- The summary values for the entire dataframe can be found out by using the "fivenum" command over a dataframe taken in as parameter

05) length(mat01)

[1] 16

- The length of all the columns within a dataframe can be found by using the length command over a dataframe

06) summary(mat01)

V1 V2 V3 V4

Min. : 1 Min. : 2 Min. : 3 Min. : 4

1st Qu.: 4 1st Qu.: 5 1st Qu.: 6 1st Qu.: 7

Median : 7 Median : 8 Median : 9 Median :10

Mean : 7 Mean : 8 Mean : 9 Mean :10

3rd Qu.: 10 3rd Qu.:11 3rd Qu.:12 3rd Qu.:13

Max. : 13 Max. :14 Max. :15 Max. :16

- It provides the summary for each of the columns present within a dataframe

* The list of all the summary / descriptive summary commands that work on a dataframe are listed and short

* One can always extract a single vector from a dataframe and perform a summary upon the data

* In general , it is better to use more specialised commands when dealing with the rows and columns of a dataframe

Row & Column Summary Commands "RowMeans()" & "ColMeans()" in R language over dataframe and Matrix objects

Special Row and Column Summary Commands

* Two summary commands used for row data are - rowMeans() and rowSums()

> rowMeans(mat01)

[1] 2.5 6.5 10.5 14.5

> rowSums(mat01)

[1] 10 26 42 58

* In the given example , each row in the dataframe has a specific row name

* If the names of the rows along with the values for the various rows would appear as a simple vector of values

> rowSums(mat01)

[1] 10 26 42 58

* The corresponding "colSums()" and "colMeans()" commands function in the same manner .

* In the following example ... one can see the "mean()" and "colMeans()" command with their comparison in the following manner :

> colMeans(df)

[1]  7  8  9 10

> mean(mf)

[1] 8.5

* One can see that one would essentially get the same display / result using the above two commands

* The commands use "na.rm" instruction which is used by default and is set to FALSE

* If one wants to ensure that the "NA" items are removed from the dataframe then one can add "na.rm = TRUE " as an instruction in the command

"Apply()" Command for finding Summaries on Rows / Columns of a Matrix or Dataframe Object

"Apply()" Command for finding Summaries on Rows / Cols

=======================================================

* The "ColMeans()" and "RowSums()" command are designed as quick alternatives to a more generalised command "Apply()"

* The "apply()" command enables one to apply a function to rows or columns of a matrix or a dataframe

* The general form of the "Apply()" command is given in the following manner :

apply(X,margin,FUN,....)

In this command , the applicable MARGIN within the parameter is either 1 or 2 where 1 is for the rows and 2 is for the columns applicable for the dataframe

* One can replace the "FUN" part within the parameter of the apply() function and one can also add additional instructions which might be appropriate to the command / function that one is applying

* Example :

One might add the parameter "na.rm = TRUE " as an instruction to the apply function .

> mat01

[,1] [,2] [,3] [,4]

[1,] 1 2 3 4

[2,] 5 6 7 8

[3,] 9 10 11 12

[4,] 13 14 15 16

> apply (mat01 , 1 , mean , na.rm = TRUE )

[1]  2.5  6.5 10.5 14.5

* In such a case , one can see that the row names of the original dataframe are displayed as output .

* If the dataframe has no set row names , then one will see the result as a vector of values .

> apply (fw , 1 , median , na.rm = TRUE )

2.5  6.5 10.5 14.5

Wednesday, March 31, 2021

Rotating Data Tables / Matrices in R

Rotating Data Tables / Matrices in R

* One can do a rotation operation over a T table / dataframe / matrix object so that the rows become the columns and the columns become the rows .

* For doing the rotation operation over any data table object / data frame object or a matrix object in R , one can use the t() command .

* One can think of such a program as short for transpose operation

* The given example begins with a dataframe that contains two columns of numeric data with its rows also named .

* The final object is transposed ( means - the rows are changed to columns and the columns are changed to rows / reversal of attributes )

* Also , the new object is in fact a matrix rather than a dataframe .

* One can see this much more clearly if one tries the same t() command over a simple vector .

* Vectors are treated like columns , but when they are displayed they look like rows

* In any event , when you apply a t() command one would get a matrix object as a result .

* One can easily convert the objects to a dataframe using the "as.dataframe()" command .

Selecting and Displaying Parts of a Vector in R Language

Selecting and Displaying Parts of a Vector

· * Being able to select and display the parts of a vector is one of the most important reasons of selection of a Vector .

· * If one has a large sample of data , then in case if one wants to obtain a large sample of data , then one may want to see which of the items are larger than which of the values which would require the user of the data to select those data that are larger ones among the dataset

· * In an alternative scenario , one may want to extract a series of values as a subsample from an analysis .

· * Being able to select / extract required parts of a vector is one of the most important aspects of performing many more complicated operations in R tool .

* The various examples or processes that one may come across while doing any type of selection of a chunk of data from a vector are in form of given scenarios :

·        extraction of the first item / single item from within a vector ;
·        selection of the third item ( nth item) from within a vector ;
·        selection of the first to the third items from a vector ;
·        selection and extraction of all items from a vector ;
·        selection of items from the combination vector ;
·        selection of all items which are greater than the value 3 (that means selection and extraction of given items with a value for the number greater than or lesser than some particular number) ;
·        Showing items which are either greater than or lesser than some set of numbers

The other useful commands over the objects which can be used to extract the various parts of data are : length() command which can be used to find the length of a given vector .

· The length command can be also put to use to obtain / extract segments of data from square brackets :

data[ length(data) - 5 : length(data)]

In the above given scenario example the last five elements of the vector are found out from the above used code :

· max() command can be used to get the largest value in the vector

==========================================================

> data1

[1] 4 6 8 6 4 3 7 9 6 7 10

> max( data1 )

[1] 10

> which( data1 == max(data1))

[1] 11

==========================================================

# The upper command -- "which" is showing the index number or position of the largest data from within the data vector . The maximum value of all the data elements present within the "data1" vector is 10 . The positional index value of the data from the vector is 11 .

· The first command .. max() provides the actual value which is the largest value within the vector and the second command asks which of the elements is the largest .

· Another useful command is one that generates sequences from a vector which can be expressed in the form .. seq()

· While using the "sequence" vector , one may need to pick out the beginning to ending of the interval vectors . In given words , one may select the first , third , fifth and so on vectors using the given in sequence parameters like the start , end and the interval values .

· Therefore putting the full scale general form of the "sequence" command can be writen in the given form :

seq(start ,end ,interval)

· The above command will work on character vectors as well as numeric vectors in the given manner :

==========================================================

> data5

"Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"

> data5[-1:-6]

"Jul" "Aug" "Sep" "Oct" "Nov" "Dec"

=================================================

* In the above code , the last 6 strings which are actually the three letter initials of each of the months of a calendar year are found out as result .