Showing posts with label probability. Show all posts
Showing posts with label probability. Show all posts

Sunday, July 18, 2021

An Introduction to Classification Algorithms with 10 fundamental questions | An Infographic Note with questions and answers

 



Questions Covered :
Q1) What are Naive bayes Classifiers ?
Q2) What do you mean by Probabilistic Classification ?
Q3) What do you mean by Statistical Classification ?
Q4) Give an example of Statistical Classification .
Q5) What is a Classifier Algorithm ?
Q6) Give some examples of various forms of classification .
Q7) Which is the most used Classification algorithm in Statistics ?
Q8) What are some of the characteristics of a Classification problem ?
Q9) What are the popular acronyms for samples , independent and dependent variables in Machine learning (ML)
Q10) What is the difference between Binary Classification and Multiclass Classification ?

Friday, April 30, 2021

Updating Machine Learning Algorithms by Mini-Batch and Batch Wise


*  Machine Learning boils down to an optimization problem in which one could look for a global minimum given a certain cost function

 

*  Working out an optimization algorithm using all the data available is an advantage , because it allows checking all the data which is clearly an advantage as it allows checking that too in the form of iteration by iteration in order to determine the amount of minimization with respect to all the data

 

*  It is the single most reason by which Machine Learning algorithms prefer to use all the data available at any instance , which they want to access inside the memory of the residing computer or the virtual memory of the GPU with tons of secondary memory available to it

 

 

*  Learning techniques based on statistical algorithms use calculus and matrix algebra , and they need all the data within the memory .

  

*     Simpler algorithms such as those based on step-by-step search of  the  next best solution by proceeding iteration by iteration through partial solution ( such as gradient descent ) can gain an advantage when developing a hypothesis which is based on all the data because the algorithms can catch some amount of weaker signals on the spot and avoid getting fooled by the noise in the data . This means that the machine learning algorithms can develop themselves for the purpose of learning either in supervised or un-supervised format which would help in the overall learning process subject to the conditions of either presence or absence of any noise .

  

*  While Operating within the data limits of the computer's memory , one can think that one is working upon a core memory .As straightforward as it is , one could imagine that all the operational computations do take place within the memory of the computer which either  could be a primary memory or the secondary memory . But as the precedence of the computation needs to be first and foremost , the primary memory is assigned for the task which jostles up when triggered with an incoming process and gets to action .

  

*     The afore-mentioned mechanism is quite well suited for the purpose of memory allocation to a process and an algorithm's execution which is called as "Batch Algorithm" because as in a factory where machines process batches of materials , such algorithms learn to handle and predict a single data or batch data at a given point of time . The incoming data is generally represented in the form   of a Data Matrix .

  

*  It is also believed that sometimes data cannot fit into core memory because the size of data is too big . Under such circumstances , data which is derived from the web is a typical example of information that cannot fit easily into the memory . Since most of the data might be homogenuous or heterogenuous in form and cannot be boiled down a particular format within the precincts of xml , json , sql , no-sql , big data etc the derived data is relatively hard to be deciphered and fitted .

 

 *  In addition to this , the data which is derived from sensors , tracking devices , satellites and video monitoring devices are often problematic because of their dimensions when compared to a computer RAM ; however they can be stored easily over a hard disk , given the availability of cheap and large storage devices which easily hold terabytes of data

  

*  A few strategies which can help in the determination of the amount of data whether it is too big or too low is to fit the data into standard memory of a single computer . A first solution that one can try is to subsample the data into smaller samples .

  

*  Here , the data is reshaped by a selection of cases and sometimes with features which is based on statistical sampling into a more manageable yet reduced form of data matrix .Reducing the data cannot always provide the same results  as during the time of analysis of the data . Also another problem that can come while working with less amount of data is that they can produce less powerful models . But in case , if the process of  subsampling is executed in proper   manner , then the approach can generate reliable and good results . Therefore , it is said that a successful subsampling must correctly use statistical sampling by employing random or stratified sample drawings

 

*  Now we will try to have a bird'e eye view on the various forms / methods of sampling which are used during the process of data reshaping and data reducing :

 

1)     Random Sampling

 

*  In random sampling , one can create a sample by randomly choosing the examples or sub-samples associated with any part of the sample . Here , the larger the size of the sample , the more likely the sample will resemble the original structure and the variety of the data .

 

2)     Stratified Sampling

 

*  In Stratified Sampling , one can control the final disribution of the target variable or of certain features within the data that one deems critical for successfully replicating the characteristics of the complete data .

 

*  One of a classic examples of stratified sampling is to draw a sample in a classroom which is made up of different proportions of males and females in order to guess the average height of the class .

 

*  If the females of the class are on average , shorter / smaller in height in proportion to the males of the class .. then one may like to draw a sample which would replicate the same amount of proportion from the considered sample in order to obtain a reliable estimate of the average height .

 

*  If one would only sample only the males by mistake , then one will overestimate the average height as in general the sub-sample which produced such a result is taken into consideration .. then the sub-sample would only fetch that data which is tilted in numbers towards the more contributing items from the picked out sub-sample. So as Boys or the males of the class as a sub-sample outweighs the average height factor inside the sample called class , the factor over which this attribute makes the sub-sample supercede the other sub-sample would lead to an over-estimation of trend due to negating out the lesser dominant attribute of average height of the sub-sample of girls / females of the class

 

================

Sampling Strategy

================

 

*  In order to avoid such problems that might come up during such problems during " Random Sampling " and " Stratified Sampling " , one has to draw a sub-sample of enough examples given a brief idea about what is the exact requirement that one is trying to fulfill which has been provided for in the sampling strategy used for defining the varieties of data .

 

*     Data with high dimensionality is larger characterised by many cases and many features , this is more difficult to sub-sample because this needs a much sample which may not even fit into the core memory  of the sampling strategy

 

*     After one has chosen a proper sampling strategy for creation and picking up a sampling strategy , given the existence of several memory limitations which would be used to represent the variety of data . It is a widespread assumption that Data with high dimensionality , characterised by many number of cases and features are more difficult to sub-sample as it would need a much larger sample , which may not even judiciously or marginally or completely fot over the core memory .

  

=========

Network Parallelism

=========

 

·        Beyond the process of sub-sampling, a second possible solution to fitting the  data in the memory is to leverage the problem of "network parallelism" which splits the data into multiple computers which are connected over a network .  Each of the computer handles part of the data for the process of optimization . After each of the computer has done its own computation and all of the parallel optimizations have been reduced to a form of single dimension and proportion , the core memory

 

*     In order to understand how the process of solution works , one can compare the process of building a car in a piece by piece manner starting from its framework  as per the blueprint to the core to the complete body which can be either done by a line of assembly workers and robotic hands of manufacturing . Apart from having a faster assmebly execution , one does not have to keep all the parts within the factory at the same time . In a very similar manner , one doesn't have to keep all the data parts within a single computer or computing device , but one can take advantage of the distributed architecture which helps in the distributed and parallel working mechanism over different computers , thereby overcoming some of the core memory limitations that can take place as a result of network parallelism .

 

 

*  This approach serves as the basis of map-reduce technology and cluster- computer frameworks, Apache Spark etc . A quick recap of the underlying technology can throw some light upon these in the following manner -- Map Reduce technology is a sorting and storing technique which takes numerous  data files , arranges them in the form of virtual data queues with sorted data  from top to bottom manner upon which indexing and count of the words is performed and the result is kept in the form of mapped key-value pairs in the storing architecture . Clustered computer and parallel servers also follow a similar structure of data representation in the form of master and slave nodes for data storage and data access . One needs to explore more on the exact manner of such storage systems which facilitate high-level data storage .

  

*     All these technologies are focused on mapping a problem over to multiple machines and then finally reducing their output into a desired solution . This means that all the machine learning computations for large scale data reading are not done by single data servers or computers but rather they are done in a parallel and distributed manner which touches many a nodes ( as in child nodes rest and then finally assimilating back upon a root node ) before the final result of the entire learning process along with the result is thrown as an output to the user in charge of reading the results at the root node .

  

*     But along with such sophisticated and complex system in place for reading of such data , one cannot split all the machine learning algorithms into separable processes and this problem limits the usability of such an approach . Also ,  more importantly one would encounter significant amount of cost and time overhead in the process of setup and maintenance when one keeps a network of computers ready for such kind of data processing . As such kind of massive level of computation and infrastructure is beyond the reach of individuals with less funding and lower level application setup , this is mainly hosted and distributed by a sleuth of large scale organisations having the ability to havesz big scale infrastructure for implementing , organising and running the chain .

  

*  The third solution is to rely on out-of-core algorithms which work by keeping the data on the storage device and feeding into the computer memory for processing . The feeding process is called as streaming because the data chunks are smaller than the core memory , the algorithm can handle the data properly   and use the data for updating the machine learning algorithm optimization . After the update , the system discards them in favour of new chunks which the algorithm uses for the purpose of learning . This process goes on repetitively until there are no more chunks left . Data Chunks can be small ( depending upon the Core Memory ) and the process is called as mini-batch learning or they have can be constituted by just a single example which is called as Online Learning.

 

*  The previously described gradient descent which can be used with other iterative algorithms can work fine with such an approach however; reaching an optimization takes longer because the gradient's path is more erratic and non- linear with respect to a batch approach. The algorithm can reach a solution using fewer numbers of computations with respect to its in-memory versions.

  

*  While working with any related updates of the parameters which are based on mini-batches and single-examples , the gradient descent algorithm takes the name stochastic gradient descent which will reach a proper optimization solution with the given pre-requisites

 

1)    The examples streamed are randomly extracted ( hence they are called as stochastic , recalling the idea of a random extraction )

 

2)    A proper learning rate is defined as some fixed or flexible value which according to the number of observations or other criteria

  

*  The learning parameter can make a great difference in the quality of the optimisation because a high learning rate even though is faster than the optimisation can constrain the parameters to the effects of noisy or erroneous examples seen at the beginning of the stream .

 

*  A high learning rate also renders the algorithm very insensible to the latter streamed observations which can prove to be a problem when the algorithm is learning from sources that are naturally evolving and mutable such as data from digital advertising sector where new advertising campaigns start mutating the level of attention and response of the targeted individuals

 

Last modified: 12:23


Wednesday, April 28, 2021

The Learning Process of M.L Algorithms

 *  During the process of optimization , the machine learning algorithm searches the possible variants of parameter combinations in order to find the best one which would allow the correct mapping between the features and the classes during the process of training

 *  This process evaluates many potential candidate target fuunctions from among those which a learning algorithm can guess

 *  The set of all the potential functions that the learning algorithm can figure out is called a Hypothesis Space

 *  One can call the resulting classifier with their set of parameters as a Hypothesis , which is a way in machine learning to say that the algorithm has set parameters to replicate the target function and is thus now ready to work out correct classifications

  *  The hypothesis space space must contain all the parameter variants of all the machine learning algorithms that one may want to try to map to an unknown function when solving a classification problem . This particular sentence suggests that the entire sample space takes into consideration , a hypothesis space which would contain all the possible variations in the form of scenarios over where the machine learning algorithm could manifest itself at each point of time under the conditions upto which a particular program has been evaluated till a particular point of time and from which the Machine Learning algorithm would do a self analysis on its own for finding the best possible approach for a given condition or problem . This is an instance example of a condition to showcase how a machine learning algorithm would be doing a self analysis for a possible  condition and   then take the best possible course of action basing upon its own understanding and derived results .So , elaborating more upon the aspect of hypothesis space .. one can deduce that a hypothesis space generally consists of a target function or  a similar approximation which is much different for a similar function .

 *  The equivalent of this could be thought of as the time when a child in an effort to figure out an image of a tree experiments with many different creative ideas by assembling one's own knowledge and experiences . Most certainly , parents play a major role in this learning phase and they provide all kinds of relevant environmental inputs for the faster and effective upbringing of the child . In Machine Learning , for say in supervised learning algorithms one has to provide the right learning algorithms and with that one has to provide some non-learnable parameters called as hyper-parameters , next one has to choose a set of  examples to learn and adapt from and then select the features that accompnay the examples . And just as a child cannot always learn to distinguish between  right and wrong if left alone in the world ( consider the example of the case depicted in the book - Lord of the Flies ; summary is available at may sites where one can have a quick synopsis of the story and save time from reading the entire book which in these days is a very tedious , demanding and unproductive task ). In such a similar scenario as well , a machine learning algorithm also needs multiple directions , multiple interjections in order to facilitate the smooth running and execution of a program .

 

*     So even after the completion of the learning process , a machine learning classifier often cannot unequivocally map the examples to the target  classification because many false and erroneous mappings are possible which could mar the generation of best possible results and then render the learning process ineffective as the learning algorithm in its path to effective learning picks up erroneous and wrong paths and lands up adding insufficient data points to discover the right function . In addition to this , conditions of noise ( this aspect is also a great factor in machine learning ) also affect the process of learning

 

*  In real world as well , Noise plays a same kind of impediment factor in the process of learning which derides the effective learning mechanism . Similarly , many such extraneous factors and errors also occur which during the process  of recording of the data which distort the values and features to be read and understood . In true sense , therefore it is considered that a good machine learning algorithm should distinguish the signals that can map back to a target function even though extraneus environmental noise is still in play .

 

Last modified: 27 Apr 2021

Monday, April 26, 2021

The Various Categories of Machine Learning Algorithms with their Interpretational learnings


Machine Learning has the three different flavours depending on the algorithm  and their objectives they serve . One can divide machine learning algorithms into three main groups based on the purpose :

01)      Supervised Learning

02)      Unsupervised Learning

03)      Re-inforcement Learning

Now in this article we will learn more on each of the learning techniques in greater detail .

==================================

01)      Supervised Learning

==================================

*  Supervised Learning occurs when an algorithm learns from a given form of example data and associated target responses that consist of numeric values or string labels such as classes or tags , which can help in later prediction of correct responses when one is encountered with newer examples

*  The supervised learning approach is similar to human learning under the guidance and mentorship of a teacher . This guided teaching and learning of a student under the aegis of a teacher is the basis for Supervised Learning

*  In this process , a teacher provides good examples for the student to memorize and understand and then the student derives general rules from the specific examples

*  One can distinguish between regression problems whose target is a numeric value and along with that one can make use of such regression problems whose target is a qualitative variable which is an indicator of a class or a tag as in the case of a selection criteria

*  More on Supervised Learning Algorithms with examples would be discussed in later articles .

==================================

02)      Unsupervisd Learning

==================================

*  Unsupervised Learning occurs when an algorithm learns from plain examples without any associated response in the target variable , leaving it to the algorithm to determine the data patterns on their own

 *  This type of algorithm tends to restructure the data into something else , such as new features that may represent a class or a new series of uncorrelated values

*  What is Unsupervised Learning ? It is a type of learning which tends to restructure the data into some new set of features which may represent a new class or a series of uncorrelated values within a data set

*  Unsupervised Learning algorithms are quite useful in providing humans with insights into the meaning of the data as there are patterns which need to be found out

*  Unsupervised Learning is quite useful in providing humans with insights into the meaning of the data and new useful inputs to supervised machine learning algorithms

*  As a new kind of learning , Unsupervised Learning resembles the methods that humans use to figure out that certain objects or events are of the same class or characteristic or not , by observing the degree of similarity of the given objects

*  Some of the recommendation systems that one may have come across over several retail websites or applications are in the form of marketing automation which are based on the type of learning

*  The marketing automation algorithm derives its suggestions from what one has done in the past

*  The recommendations are based on an estimation of what group of customers that one resembles the most and then inferring one's likely preferences based on that group

==================================

02) Reinforcement Learning

==================================

*  Reinforcement Learning occurs when one would present the algorithm with examples that lack any form of labels as in the case of unsupervised learning .

*  However , one can provide an example with some positive and negative feedback according to the solution of the algorithm proposed

*  Reinforcement Learning is connected to the applications for which the algorithm must make decisions ( so the product is mostly prescriptive and not just descriptive as in the case of unsupervised learning ) and on top of that the decisions bear some consequences .

*  In the human world , Reinforcement learning is mostly a process of learning by the application of trial and error method to the process of learning

*  In this type of learning , initial errors and aftermath errors help a reader to learn because this type of learning is associated with a penalty and reward system which gets added each time whenever the following factors like cost , loss of  time , regret , pain and so on get associated with the results that come in the  form of output for any particular model upon which the set of reinforcement learning algorithms are applied

*  One of the most interesting examples on reinforcement learning occurs when computers learn to play video games by themselves and then scaling up the ladders of various levels within the game on their own just by learning on their own the mechanism and the procedure to get through each of the level .

*  The application lets the algorithm know the outcome of what sort of action would result in what type of result .

*  One can come across a typical examplle  of  the  implementation of a Reinforcement Learning program developed by Google's Deep Mind porgram which plays old Atari's videogames in a solo mode at https://www.youtube.com/watch?v=VieYniJORnk

*  From the video , one can notice that the program is initially clumsy and unskilled but it steadily improves with better continuous training until the program  becomes a champion at performance of the task

 


 


 


 

Wednesday, April 21, 2021

An article on - Conditioning Chance and Probability by Bayes Theorem

Conditioning Chance & Probability by Bayes Theorem


* Probability is one of the most key important factors that takes into effect the condition of time and space but there are other measures which go hand in hand with the measures that go into calculation of probability values and that is Conditional Probability which takes into effect the chance of occurrence of one particular event with effect to occurrence of some other events that may also affect the possibility and probability of the other event .

 

* When one would like to estimate the probability of any given event , one may believe the probability of some value to be applicable to some values which one may calculate upon a set of possible events or situations . This term is used to express a belief of "apriori probability" which means general probability of any given event .

 

* For example , in the condition of a throw of a coin ... if the coin thrown is a fair coin , then it could be said that the apriori probability of occurrence of a head is around 50 percent . This means that when someone would go for tossing a coin , he already knows what is the probability of occurrence of a positive ( in other words .. desired outcome ) otherwise occurrence of a negative outcome ( in other words .. undesired outcome ) .

 

* Therefore , no matter how many times one would toss a coin .. whenever faced with a new toss the probability of occurrence of a heads is still 50 percent and the probability of occurrence of a tail is still 50 percent .

 

* But consider a situation where if someone wishes to change the context , then the subject of apriori probability is not valid anymore .. because something subtle has happened and changed the outcome as we all know there are some prerequisites and conditions that must satisfy so that the general experiment could be carried out and come to fruitition. In such a case , one can express the belief as a form of posteriori probability which is the priori probability after something has happened that would tend to modify the count or outcome of the event .

 

* For instance , gender estimation for a person being either a male or a female is the same which is about 50 percent in almost all of the cases . But this general assumption that any population taken into account would be having the same demography is wrong as I happened to come across my referenced article that what generally happens in a demographic population is that generally the women are the ones who tend to live longer and exceed their counterpart males in most of the cases in all of human existence .. as they are mostly the ones who tend to live longer and exceed their counterpart males in most of the factors that contribute to the general well being , and as a result of which the population demographic tilt is more towards the female gender .

 

 Hence , putting all these factors into account that contribute to the general estimate of any population , one should not ideally take gender as a main parameter for determination of population data because this factor is tilted in age-brackets and hence an overall idea for generalisation of this factor should not be considered .

 

* Again , taking this factor of gender into account , the posteriori probability is different from the expected apriori one which in this example can consider gender to be the parameter for estimation of population data and thus estimate somebody's probability of gender on the belief that there are 50 percent males and 50 percent females in a given population data .

 

* One can view cases of conditional probability in the given manner P(y(x)) which in mathematical sense can be read as probability of the event y given the probability of occurrence of event x takes place . For the great relevance Conditional Probability plays in the concepts and studies of machine learning , learning and understanding the syntax of representation , expression and comprehension of the given equation is of great paramount importance to any newbie or virtuoso in the field of maths , statistics and machine learning . Hence , again if someone comes across a notation for conditional probability in the form P(y(x)) which can be read as the probability of event Y happening given X has already happened .

 

* As mentioned earlier in the above paragraph , because of its dependence on possibility of occurrence on single or multiple prior conditions , the role of conditional probability is of paramount importance for machine learning which takes into effect statistical conditions of occurrence of any event . If the apriori probability can change because of circumstances, knowing the possible circumstances can give a big push in one's chances of correctly predicting any event by observing the underlying examples - exactly what machine learning generally intends to do .

 

* Generally , the possibility of finding a random person's gender as a male or female is around 50 percent . But , in case one would like to take into consideration the mortal aspects and age factor of any population , we have seen that the demographic tilt is more in favour of females . If under all such conditions , one would take into consideration the female population , and then dictate a machine learning algorithm to find out the gender of the considered person on the basis of their apriori conditions like length of hair , mortality rate etc , the ML algorithm would be able to very well determine the solicited answer