Showing posts with label methods. Show all posts
Showing posts with label methods. Show all posts

Friday, April 23, 2021

Describing the use of Statistics in Machine Learning - A full detailed article on some of the most important concepts in Statistics

 

Describing the use of Statistics


* Its important to skim through some of the basic statistical concepts related to probability and statistics . Along with that , we will also try to understand how these concepts can help someone to describe the information used by machine learning algorithms


 * Some of the main concepts that I shall also try to cover in my articles leading to a stronger foothold over the various sections of Statistics are the following topics -- sampling, statistical distributions, statistical descriptive measures.. etc which in one way or the other are based on the concepts of algebra and probability in some or the other ways as they are nothing but more elaborate manifestations of the concepts and theorems of mathematics .

 

* The zist of all the learning of these concepts is not only about how to describe an event by counting the number of occurrences , but its about describing an event without counting every time how many times a particular event occurs .

 

* If there are some imprecision in a recording instrument that one uses , or simply because some error in the the recording procedure of a machine occurs , rather an imprecision occurs in the instrument that one uses or simply because of any random nuisance which disturbs the process of recording a given measure during the process of recording the measure occurs ... then a simple measure such as weight , will differ every time one would get a scale which would be slightly oscillating around the true weights and minimal variation scale . Now , if someone wants to perform such a small incident in a city and want to measure the weight of all the people in the city , then it is probably an impossible experiment to be conducted on such a large scale as it would involve taking the weight-wise reading of all the people in the city which is something that is practically not possible , because first of all if someone wants to perform this experiment in one go , then one has to create a big big gigantic weighing scale to mount all the people of the city in its weighing pans , which is completely an impossible task , and probably the scale may break once all the people have been mounted to the pan or otherwise the worst thing that may happen is that once all the people's weights have been measured , the experiment could render itself insignificant as the experiment once conducted would make the use of the weighing machine useless and hence the cost associated with building of such a big machine for carrying out just one task would become meaningless .

* So , the purpose of experiment might get achieved , but the cost of the built-up of the instrument would run so high that a big dent in the overall GDP of the city would get created which might cripple the city's finance and budget . On the other hand , if we take the measurement of the entire city's weight recording each person's weight one by one , then the effort and time taken for the entire activity to be completed might take some weeks or months of time . Because of the high amount of time and effort that would get consumed while managing the entire ruckus won’t suit the idea for adaptability and taking up the idea. And even if all the weights of the people residing in the city is successfully measured, there are a lot of chances that anyhow some amount for error would definitely popup making the idea of the entire process not so fruitful and fault-proof

 

* Having partial information is a quite complex process which is not a completely negative condition because one can use such smaller matrices for the purpose of efficient and less cumbersome calculations. Also on top of that, it is said that one cannot get a sample of what one wants to describe and learn because the event's complexity might be quite high and may probably feature a great variety of features . Another example that some users could consider while taking a case of a sample or a large case of data , is a case of Twitter tweets . All the tweets may be considered as some sample of data over where the same data could be treated as some experimental potions and minerals which are processed using several word processors , sentiment analyzers , business enhancers , spam , abusive data and all depending upon the sample of data associated with each of the text within the short frame of data that one can provide within the text section .

 

* Therefore it is also a good practice in sampling to sample similar data which has associated characteristics and features which will present the sample data in the form of a grouped cohesive data which fit into a proper sampling criteria. And when sampling is done carefully, one can get an idea that one can obtain a better global view of the data from the constituent samples

 

* In Statistics , a population refers to all the events and objects that one wants to measure and is a part of the given criteria which gives in detail the account of metrices of the population . Using the concept of random sampling , which is picking the events or objects one needs to choose one's examples according to the criteria which would determine how the data is collected ,assembled and synthesised. This is then used for feeding into machine learning algorithms which apply their inherent functions for determination of patterns and behaviour.

 

* Along with such determination , a probabilistic model of input data is built which is used for prediction of similar patterns from any newly input data or datasets ,Application of this concept of data generation from population's subsamples and mapping the identified patterns to map new use cases is one of subsamples the chief objectives of machine learning on the back of supported algorithms

 

* "Random Sampling" -- It is not the only approach for any sort of sampling . One can also apply an approach of "stratified sampling" through which one can control some aspects of the random sample in order to avoid picking too many or too few events of a certain kind .After all , it is said that a random sample is a random sample , the manner it gets picked is irrespective of the manner in which all samples would criterion themselves for picking up a sample , and there is no absolute assurance of always replicating an exact distribution of a population .

 

* A distribution is a statistical formulation which describes how to observe any event or a measure by ideating the probability of witnessing a certain value . Distributions are described in Mathematical formula and can be graphically described using charts such as histograms or distribution plots . The  information that one wants to put over the matrix has a distribution , and one may find that the distributions of different features are related to each other . A normal distribution naturally implies variation and when dealing with numeric values , it is very important to figure out a center of variation which is essentially a value which corresponds to the statistical mean which can be calculated by summing all the values and then dividing the sum by the total number of values considered .

 

* Mean - This is specifically a descriptive measure which tells the users the values to expect the most from within dataset . as it is a general fact that most of the times , one can observe that the mean of a dataset is that data which generally hovers around a given data group or the entire dataset . The Mean of a dataset is the best suited data for any symmetrical and bell-shaped distribution . In cases , when the value is above the mean of the entire dataset , the distribution is similarly shaped for the values that lie below the mean . The normal distribution or the Gaussian distribution is shaped around the mean which one can find only when one is dealing with legible data which is not much skewed in any direction from the equally shaped domes of the normal distribution curve . In the real world , in most of the datasets one can find many skewed distributions that have extreme values on one side of the distribution , which influences the value of mean so much

 

* Median - The Median is a measure that takes the value in the middle after one orders all the observations from smallest to the largest values within the dataset . Based on the value order, the median is a less approximate measure of central approximation of data .

 

* Variance - The significance of mean and median data descriptors is that they describe a value within a data description around which there is some form of variation . In general, the significance of the mean and median descriptors is variation. In general , the significance of the mean and median descriptors is that they describe a value within the distribution around which there is a variation and machine learning algorithms generally do not care about such a form of variation . Most people generally refer to the term , variation as "variance" . And since , variance is a squared number there is also a root equivalent which is termed as "Standard Deviation" . Machine Learning takes into account the concept of variance in every single variable (univariate distributions) and in all the features together (multivariate distribution) to determine how such a variation impacts the response obtained .

 

* Statistics is an important matter in machine learning because it conveys the idea that features have a distribution pattern . Distribution of data implies variation and variation means quantification of information ... which means that more amount of variance is present in the features , then the more amount of Information can be matched to the response .

 

* One can use statistics to assess the quality of the feature matrix and then leverage statistical measures in order to draw a rule from the types of information to their purposes that they cater to .

 

Thursday, April 8, 2021

Writing MapReduce Programming - a descriptive guide with example on writing of a sample maprReduce programme with reference to its architecture

     


             

                   Writing MapReduce Programming

 

* As per standard books , one should start MapReduce  program by writing pseudocode for Map and Reduce Functions 

* A "pseudo-code" is not the entire / actual length of the code but it is a blueprint of the code that would be written in place of the actual code that is going to be used in case of a working standardised code . 

* The program code for both the Map and Reduce functions can be written in Java or other programming languages 

* In Java , a Map function is represented by generic Mapper Class (which acts over structured and unstructured data type objects ) . 

* The Map Function has mainly four parameters (input key,input value, output key and output value)

 * General handling of Map Function imported from Mapper Class in Java is out of context for this article , however I shall try to cover the usage of Map Function in a separate blog with appropriate example 

* The Mapper Class uses an abstract Map() method which receives the Input Key and Input Values which would produce an Output key and Output value .

 * For more complex problems involving Map() functions , it is advised to use a higher-level language than MapReduce such as Pig, Hive and Spark

 ( Detailed coverage on the above programming languages - Pig , Hive and Spark would be done in separate articles later )

 * A Mapper function commonly performs input format parsing , projection (selection of relevant fields) and filtering related operations (selection of requisite records needed from context table)

* The Reducer function typically combines (adds/averages) the requisite values again after performance of the necessary operations after Mapping procedure which finally yields the Output .

 * Below diagram is a breakdown of all operations that happen within the MapReduce Program Flow .

 


* Following is a step-by-step logic for performing a word count of all unique words in a text file .

1) A document taken into consideration is split into several different segments .The Map step is run on each segment of the data . The output is a set of key and value pairs . In the given case , the key is a word in the document .

2) The Big Data system gathers the (key,value) pair outputs from all the mappers and then it will sort the entire system with the help of a Key . The sorted list is then split into a few segments

3) The task of the Reducer in the entire system is to sort the entire list and produce a combined list of word counts from the entire list provided to the system for the purpose of Sorting and counting .

 

==================================

Java Code for WordCount

==================================

 map(String key,String value):

    for each word w in value:

    EmitIntermediate(w,"I"):

    reduce(String key,Iterator values):

int result = 0:

for each 'v' in values:

result == ParseInt(v):

Emit(AsString(result))

==================================

 

Thursday, April 1, 2021

Generic Summary Command for DataFrames / Matrices in R language

 


        Generic Summary Command for Data Frames

 

* Below is a short guide to the results expected for the generic software commands in R

 

* Descriptive Summary Commands that can be applied to Dataframes are :


00) mat01

     [,1] [,2] [,3] [,4]

[1,]    1    2    3    4

[2,]    5    6    7    8

[3,]    9   10   11   12

[4,]   13   14   15   16

 

01) max(mat01)

[1] 16

- The largest value in the entire dataframe

 

02) min(mat01)

[1] 1

- The smallest value in the entire dataframe

 

03) sum(mat01)

[1] 136

- The sum of all the values in the entire dataframe

 

04) fivenum(mat01)

[1]  1.0  4.5  8.5 12.5 16.0

- The summary values for the entire dataframe can be found out by using the "fivenum" command over a dataframe taken in as parameter

 

05) length(mat01)

[1] 16

- The length of all the columns within a dataframe can be found by using the length command over a dataframe

 

06) summary(mat01)

                   V1                   V2                       V3                    V4    

 Min.   :       1             Min.   : 2                Min.   : 3          Min.   : 4  

 1st Qu.:      4             1st Qu.: 5               1st Qu.: 6         1st Qu.: 7  

 Median :    7             Median : 8              Median : 9       Median :10  

 Mean   :     7             Mean   : 8               Mean   : 9        Mean   :10  

 3rd Qu.:    10            3rd Qu.:11              3rd Qu.:12       3rd Qu.:13  

 Max.   :     13            Max.   :14              Max.   :15        Max.   :16 

- It provides the summary for each of the columns present within a dataframe

 

* The list of all the summary / descriptive summary commands that work on a dataframe are listed and short

 

* One can always extract a single vector from a dataframe and perform a summary upon the data

 

* In general , it is better to use more specialised commands when dealing with the rows and columns of a dataframe

 

Row & Column Summary Commands "RowMeans()" & "ColMeans()" in R language over dataframe and Matrix objects


 Special Row and Column Summary Commands


* Two summary commands used for row data are - rowMeans() and rowSums()

 

> rowMeans(mat01)

[1]  2.5  6.5 10.5 14.5

 

> rowSums(mat01)

[1] 10 26 42 58

 

* In the given example , each row in the dataframe has a specific row name

 

* If the names of the rows along with the values for the various rows would appear as a simple vector of values

 

> rowSums(mat01)

[1] 10 26 42 58

 

* The corresponding "colSums()" and "colMeans()" commands function in the same manner .

 

* In the following example ... one can see the "mean()" and "colMeans()" command with their comparison in the following manner :

> colMeans(df)

[1]  7  8  9 10

 

> mean(mf)

[1] 8.5

 

 

* One can see that one would essentially get the same display / result using the above two commands

 

* The commands use "na.rm" instruction which is used by default and is set to FALSE

 

* If one wants to ensure that the "NA" items are removed from the dataframe then one can add "na.rm = TRUE " as an instruction in the command

"Apply()" Command for finding Summaries on Rows / Columns of a Matrix or Dataframe Object

 


"Apply()" Command for finding Summaries on Rows / Cols

=======================================================

* The "ColMeans()" and "RowSums()" command are designed as quick alternatives to a more generalised command "Apply()"

 

* The "apply()" command enables one to apply a function to rows or columns of a matrix or a dataframe

 

* The general form of the "Apply()" command is given in the following manner :

apply(X,margin,FUN,....)

 

In this command , the applicable MARGIN within the parameter is either 1 or 2 where 1 is for the rows and 2 is for the columns applicable for the dataframe

 

* One can replace the "FUN" part within the parameter of the apply() function and one can also add additional instructions which might be appropriate to the command / function that one is applying

 

* Example :

One might add the parameter "na.rm = TRUE " as an instruction to the apply function .

> mat01

     [,1] [,2] [,3] [,4]

[1,]    1    2    3    4

[2,]    5    6    7    8

[3,]    9   10   11   12

[4,]   13   14   15   16

 

> apply (mat01 , 1 , mean , na.rm = TRUE )

[1]  2.5  6.5 10.5 14.5

 

* In such a case , one can see that the row names of the original dataframe are displayed as output .

 

* If the dataframe has no set row names , then one will see the result as a vector of values .

 

> apply (fw , 1 , median , na.rm = TRUE )

2.5  6.5 10.5 14.5

 



Wednesday, March 31, 2021

Selecting and Displaying Parts of a Vector in R Language

 


   Selecting and Displaying Parts of a Vector

 

·     *   Being able to select and display the parts of a vector is one of the most important reasons of selection of a Vector .

 

·     * If one has a large sample of data , then in case if one wants to obtain a large sample of data , then one may want to see which of the items are larger than which of the values which would require the user of the data to select those data that are larger ones among the dataset

 

·    *   In an alternative scenario , one may want to extract a series of values as a subsample from an analysis .

 

·    *    Being able to select / extract required parts of a vector is one of the most important aspects of performing many more complicated operations in R tool .

 

* The various examples or processes that one may come across while doing any type of selection of a chunk of data from a vector are in form of given scenarios :

 

·        extraction of the first item / single item from within a vector ;
·        selection of the third item ( nth item) from within a vector ;
·        selection of the first to the third items from a vector ;
·        selection and extraction of all items from a vector ;
·        selection of items from the combination vector ;
·        selection of all items which are greater than the value 3 (that means selection and extraction of given items with a value for the number greater than or lesser than some particular number) ;
·        Showing items which are either greater than or lesser than some set of numbers








  The other useful commands over the objects which can be used to extract the various parts of data are : length() command which can be used to find the length of a given vector .

·        The length command can be also put to use to obtain / extract segments of data from square brackets :

 

data[ length(data) - 5 : length(data)]

 

In the above given scenario example the last five elements of the vector are found out from the above used code :

·        

·        max() command can be used to get the largest value in the vector

 

==========================================================

> data1

[1] 4 6 8 6 4 3 7 9 6 7 10

 

> max( data1 )

[1] 10

 

> which( data1 == max(data1))

[1] 11

==========================================================

 

# The upper command -- "which" is showing the index number or position of the largest data from within the data vector . The maximum value of all the data elements present within the "data1" vector is 10 . The positional index value of the data from the vector is 11 .

 

·        The first command .. max() provides the actual value which is the largest value within the vector and the second command asks which of the elements is the largest .

 

·        Another useful command is one that generates sequences from a vector which can be expressed in the form .. seq()

·        While using the "sequence" vector , one may need to pick out the beginning to ending of the interval vectors . In given words , one may select the first , third , fifth and so on vectors using the given in sequence parameters like the start , end and the interval values .

 

·        Therefore putting the full scale general form of the "sequence" command can be writen in the given form :

 

seq(start ,end ,interval)

 

 

·        The above command will work on character vectors as well as numeric vectors in the given manner :

 ==========================================================

> data5

   "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct"     "Nov" "Dec"

 

> data5[-1:-6]

   "Jul" "Aug" "Sep" "Oct" "Nov" "Dec" 

=================================================

* In the above code , the last 6 strings which are actually the three letter initials of each of the months of a calendar year are found out as result .