Friday, April 30, 2021

Updating Machine Learning Algorithms by Mini-Batch and Batch Wise


*  Machine Learning boils down to an optimization problem in which one could look for a global minimum given a certain cost function

 

*  Working out an optimization algorithm using all the data available is an advantage , because it allows checking all the data which is clearly an advantage as it allows checking that too in the form of iteration by iteration in order to determine the amount of minimization with respect to all the data

 

*  It is the single most reason by which Machine Learning algorithms prefer to use all the data available at any instance , which they want to access inside the memory of the residing computer or the virtual memory of the GPU with tons of secondary memory available to it

 

 

*  Learning techniques based on statistical algorithms use calculus and matrix algebra , and they need all the data within the memory .

  

*     Simpler algorithms such as those based on step-by-step search of  the  next best solution by proceeding iteration by iteration through partial solution ( such as gradient descent ) can gain an advantage when developing a hypothesis which is based on all the data because the algorithms can catch some amount of weaker signals on the spot and avoid getting fooled by the noise in the data . This means that the machine learning algorithms can develop themselves for the purpose of learning either in supervised or un-supervised format which would help in the overall learning process subject to the conditions of either presence or absence of any noise .

  

*  While Operating within the data limits of the computer's memory , one can think that one is working upon a core memory .As straightforward as it is , one could imagine that all the operational computations do take place within the memory of the computer which either  could be a primary memory or the secondary memory . But as the precedence of the computation needs to be first and foremost , the primary memory is assigned for the task which jostles up when triggered with an incoming process and gets to action .

  

*     The afore-mentioned mechanism is quite well suited for the purpose of memory allocation to a process and an algorithm's execution which is called as "Batch Algorithm" because as in a factory where machines process batches of materials , such algorithms learn to handle and predict a single data or batch data at a given point of time . The incoming data is generally represented in the form   of a Data Matrix .

  

*  It is also believed that sometimes data cannot fit into core memory because the size of data is too big . Under such circumstances , data which is derived from the web is a typical example of information that cannot fit easily into the memory . Since most of the data might be homogenuous or heterogenuous in form and cannot be boiled down a particular format within the precincts of xml , json , sql , no-sql , big data etc the derived data is relatively hard to be deciphered and fitted .

 

 *  In addition to this , the data which is derived from sensors , tracking devices , satellites and video monitoring devices are often problematic because of their dimensions when compared to a computer RAM ; however they can be stored easily over a hard disk , given the availability of cheap and large storage devices which easily hold terabytes of data

  

*  A few strategies which can help in the determination of the amount of data whether it is too big or too low is to fit the data into standard memory of a single computer . A first solution that one can try is to subsample the data into smaller samples .

  

*  Here , the data is reshaped by a selection of cases and sometimes with features which is based on statistical sampling into a more manageable yet reduced form of data matrix .Reducing the data cannot always provide the same results  as during the time of analysis of the data . Also another problem that can come while working with less amount of data is that they can produce less powerful models . But in case , if the process of  subsampling is executed in proper   manner , then the approach can generate reliable and good results . Therefore , it is said that a successful subsampling must correctly use statistical sampling by employing random or stratified sample drawings

 

*  Now we will try to have a bird'e eye view on the various forms / methods of sampling which are used during the process of data reshaping and data reducing :

 

1)     Random Sampling

 

*  In random sampling , one can create a sample by randomly choosing the examples or sub-samples associated with any part of the sample . Here , the larger the size of the sample , the more likely the sample will resemble the original structure and the variety of the data .

 

2)     Stratified Sampling

 

*  In Stratified Sampling , one can control the final disribution of the target variable or of certain features within the data that one deems critical for successfully replicating the characteristics of the complete data .

 

*  One of a classic examples of stratified sampling is to draw a sample in a classroom which is made up of different proportions of males and females in order to guess the average height of the class .

 

*  If the females of the class are on average , shorter / smaller in height in proportion to the males of the class .. then one may like to draw a sample which would replicate the same amount of proportion from the considered sample in order to obtain a reliable estimate of the average height .

 

*  If one would only sample only the males by mistake , then one will overestimate the average height as in general the sub-sample which produced such a result is taken into consideration .. then the sub-sample would only fetch that data which is tilted in numbers towards the more contributing items from the picked out sub-sample. So as Boys or the males of the class as a sub-sample outweighs the average height factor inside the sample called class , the factor over which this attribute makes the sub-sample supercede the other sub-sample would lead to an over-estimation of trend due to negating out the lesser dominant attribute of average height of the sub-sample of girls / females of the class

 

================

Sampling Strategy

================

 

*  In order to avoid such problems that might come up during such problems during " Random Sampling " and " Stratified Sampling " , one has to draw a sub-sample of enough examples given a brief idea about what is the exact requirement that one is trying to fulfill which has been provided for in the sampling strategy used for defining the varieties of data .

 

*     Data with high dimensionality is larger characterised by many cases and many features , this is more difficult to sub-sample because this needs a much sample which may not even fit into the core memory  of the sampling strategy

 

*     After one has chosen a proper sampling strategy for creation and picking up a sampling strategy , given the existence of several memory limitations which would be used to represent the variety of data . It is a widespread assumption that Data with high dimensionality , characterised by many number of cases and features are more difficult to sub-sample as it would need a much larger sample , which may not even judiciously or marginally or completely fot over the core memory .

  

=========

Network Parallelism

=========

 

·        Beyond the process of sub-sampling, a second possible solution to fitting the  data in the memory is to leverage the problem of "network parallelism" which splits the data into multiple computers which are connected over a network .  Each of the computer handles part of the data for the process of optimization . After each of the computer has done its own computation and all of the parallel optimizations have been reduced to a form of single dimension and proportion , the core memory

 

*     In order to understand how the process of solution works , one can compare the process of building a car in a piece by piece manner starting from its framework  as per the blueprint to the core to the complete body which can be either done by a line of assembly workers and robotic hands of manufacturing . Apart from having a faster assmebly execution , one does not have to keep all the parts within the factory at the same time . In a very similar manner , one doesn't have to keep all the data parts within a single computer or computing device , but one can take advantage of the distributed architecture which helps in the distributed and parallel working mechanism over different computers , thereby overcoming some of the core memory limitations that can take place as a result of network parallelism .

 

 

*  This approach serves as the basis of map-reduce technology and cluster- computer frameworks, Apache Spark etc . A quick recap of the underlying technology can throw some light upon these in the following manner -- Map Reduce technology is a sorting and storing technique which takes numerous  data files , arranges them in the form of virtual data queues with sorted data  from top to bottom manner upon which indexing and count of the words is performed and the result is kept in the form of mapped key-value pairs in the storing architecture . Clustered computer and parallel servers also follow a similar structure of data representation in the form of master and slave nodes for data storage and data access . One needs to explore more on the exact manner of such storage systems which facilitate high-level data storage .

  

*     All these technologies are focused on mapping a problem over to multiple machines and then finally reducing their output into a desired solution . This means that all the machine learning computations for large scale data reading are not done by single data servers or computers but rather they are done in a parallel and distributed manner which touches many a nodes ( as in child nodes rest and then finally assimilating back upon a root node ) before the final result of the entire learning process along with the result is thrown as an output to the user in charge of reading the results at the root node .

  

*     But along with such sophisticated and complex system in place for reading of such data , one cannot split all the machine learning algorithms into separable processes and this problem limits the usability of such an approach . Also ,  more importantly one would encounter significant amount of cost and time overhead in the process of setup and maintenance when one keeps a network of computers ready for such kind of data processing . As such kind of massive level of computation and infrastructure is beyond the reach of individuals with less funding and lower level application setup , this is mainly hosted and distributed by a sleuth of large scale organisations having the ability to havesz big scale infrastructure for implementing , organising and running the chain .

  

*  The third solution is to rely on out-of-core algorithms which work by keeping the data on the storage device and feeding into the computer memory for processing . The feeding process is called as streaming because the data chunks are smaller than the core memory , the algorithm can handle the data properly   and use the data for updating the machine learning algorithm optimization . After the update , the system discards them in favour of new chunks which the algorithm uses for the purpose of learning . This process goes on repetitively until there are no more chunks left . Data Chunks can be small ( depending upon the Core Memory ) and the process is called as mini-batch learning or they have can be constituted by just a single example which is called as Online Learning.

 

*  The previously described gradient descent which can be used with other iterative algorithms can work fine with such an approach however; reaching an optimization takes longer because the gradient's path is more erratic and non- linear with respect to a batch approach. The algorithm can reach a solution using fewer numbers of computations with respect to its in-memory versions.

  

*  While working with any related updates of the parameters which are based on mini-batches and single-examples , the gradient descent algorithm takes the name stochastic gradient descent which will reach a proper optimization solution with the given pre-requisites

 

1)    The examples streamed are randomly extracted ( hence they are called as stochastic , recalling the idea of a random extraction )

 

2)    A proper learning rate is defined as some fixed or flexible value which according to the number of observations or other criteria

  

*  The learning parameter can make a great difference in the quality of the optimisation because a high learning rate even though is faster than the optimisation can constrain the parameters to the effects of noisy or erroneous examples seen at the beginning of the stream .

 

*  A high learning rate also renders the algorithm very insensible to the latter streamed observations which can prove to be a problem when the algorithm is learning from sources that are naturally evolving and mutable such as data from digital advertising sector where new advertising campaigns start mutating the level of attention and response of the targeted individuals

 

Last modified: 12:23


Validation of Machine Learning Algorithms and Scenarios - A short article

 

                        Validation of Machine Learning Codes

 *  It is a widely accepted fact that just having some examples in the form of datasets and machine learning algorithm at hand does not  assure that solving  a machine learning problem is possible or the results would provide any desired solution

 

*  For example ... if one wants a computer to distinguish a photo of a dog from   a photo of a cat , one can do it with good examples of dogs and cats . One can then train a dog versus Cat classifier which is based on some machine learning algorithms that would output the probability that a given photo is that of a dog or a cat . All of the times for a set of photos resembling a given photo , the output would be in the form of a validation quantity which would be expressing some level of accuracy for a number which would reflect how well the classifier algorithm was able to perform those computations and with what level of  alacrity and accuracy . I am using the alacrity which should convey to the reader about the performnace and speed aspect of the identification process of the Machine Learning algorithm when computed upon a batch of photos for finding resemblance over a batch of photos of classes of photos by doing all forms of stucturisation like segmentation and clustering , KNN etc . And when it comes to the factor of accuracy one can think of the degree and magnitude in terms of percentage of resemblance of the referenced sample to the sample over which the matching is to be calculated .

 

*     Based on the probability which is exressed in percentage accuracy , one can  then decide whether the class ( that is if a dog or a cat) is based on the estimated probability as calculated by the algorithm .

 

*  Whenever the obtained probability or percentage would be higher for a dog , one can minimize the risk of making a supposed wrong assessment by choosing the higher chances which would be favouring the probability of finding a dog .

 

*  The greater the probability difference between the likelihood of a dog against that of a cat , the higher would be the confidence that one can have in their choices of finding any appropriate result

 

*  And in case , the probability difference between the likelihood of a dog against that of a cat , here it can be assumed that the picture of the subject is not clear   or probably the subjects in the picture bear much resemblance in features which would indirectly mean that some of the pictures of the cats are similar to that of the dogs and because of which a confusion may arise and lead to another supposition that whether the dogs are cattish in the concerned pictures .

 

*   On the point of training a classifier :

When you pose a problem and offer the examples , with each of the examples being carefully marked with the label or class that the algorithm should learn ; then the computer trains the algorithm for a while and then finally one would get a resulting model out of the training process of the model over the dataset .

 

*  Here , your computer trains the algorithm for a while and finally one would get a resulting model for the answer which provides one with an answer or probability .

 

*  Labellling is an another associated activity that can be carried out but in the end a probability is just an opportunity to propose a solution and get an answer

 

*  At such a point , one may have addressed all the issues and perhaps might guess that the work is finished , but still one may validate the results for ensuring that the results generated are first comprehensible to the human , make sure   that the user is able to have a clear understanding of the involved background processes and break-up analysis of the code and the result which can enable other readers to understand the code along with numbers

 

*  More over this would be elaborated in the forthcoming sessions / articles where we will look into the various modes in which the machine learning results could   be validated and made comprehensible to the users

 

Last modified: 16:39

Wednesday, April 28, 2021

Exploring Cost Functions in ML

 

*  The driving force behind the concept of optimization in machine learning is the response from a function which is internal to the algorithm which is called as a Cost Function

 *  One may see other terms used in some contexts , such as loss function , objective function , scoring function , or error function but the cost function is an evaluation function that measures how well the machine learning algorithm maps the target function that the function was striving to guess

 

*  In addition , a cost function determines how well a machine learning algorithm performs in a supervised prediction or an unsupervisd optimisation problem

  

*  The Evaluation function works by comparing the algorithm predictions against the actual outcome recorded from the real world .

 

*     Comparing a prediction against a real value using a cost function which determines the algorithm's error level

 

*     Since it is a mathematical formulation , a general cost function expresses the error level in a numerical form thereby keeping the errors low . This means that the cost function modulates according to the parameters of a function in order to tune the produced output to a more numeric form thereby keeping the errors of the overall output to a low .

  

*  The cost function transmits whatever is actually important and meaningful for the purposes of the learning algorithm

  

*  As a result , when considering a scenario like stock market forecasting , the cost function expresses the importance of avoiding incorrect predictions . In such a case , one may want to make some money by avoiding any sort of big losses . In forecasting sales , the concern is different because one needs to reduce the error in common and frequent situations and not in the rare and exceptional cases , as one uses a different cost function .


*     Example -- While considering stock market forecasting , the cost function expresses the importance of avoiding incorrect predictions . In such a case , one may want to make some money by avoiding big losses

  

*  When the problem is to predict who would likely become ill from a certain disease , then for this also algorithms are in place that can score a higher probability of singling out the people who have the same characteristics and actually did become ill at a later point of time . And based on the severity of the illness , one may also prefer that the algorithm wrongly chooses some people who do not get ill , rather misses out on the people who actually get ill

  

*  So after going through the given aspects on the usability of cost functions  and how they are coagulated with some ML algorithms in order to fine tune the result .we will get to see and check the method of Optimisation of a Cost Function and how and why they are done

 

*   Optimisation of Cost Functions :

It is widely accepted as a conjecture that the cost function associated with a Machine Learning generic function is what truly drives the success of a machine learning application . This is an important part of the process of representation of an associated cost function that is the capabilty of an algorithm to approximate certain mathematical functions and along with that do some necessary optimisation which means how the machine learning algorithm sets their internal parameters .

 

*  Most of the machine learning algorithms have their own optimisation which is associated with their own cost functions which means some of the better developed and advanced algorithms of the time are well capable enough to fine tune their algorithms on their own and can come at a best optimised result at each step of the formulation of machine learning algorithms . This leaves the   role of the user as futile some of the times as the role of the user to fine tune the learning process and preside over the aspects of learning are not so relevant .

 

*  Along with such , there are some algorithms that allow you to choose among a certain number of possible functions which provide more flexibility to choose their own course path of learning

 

*  When an algorithm  uses a cost  function directly in the  optimisation process , the cost function is used internally . As the algorithms are set to work with certain cost functions , the objective of the optimisation problem may differ from the desired objective .

 

*  And as the algorithms set to work with some of the cost functions , the optimisation objectives may also differ from the desired objective . In such a circumstance where the associated cost function could be used , one can call an error function or a loss function .. an error function is where the value needs to be minimised ; and the reverse of it is called a scoring function if the objective for   the function is to maximise the result .

  

*  With respect to one's target , a standard practice is to define the cost function that works best in solving the problem and then to figure out which algorithm would work best in the optimisation of the algorithm in order to define the hypothesis space that one would like to test . When someone works with algorithms that do not allow the cost function that one wants , one can still indirectly influence their optimisation process by fixing their hyper-parameters and selecting your input features with respect to the cost function . Finally , when someone has gathered all the algorithm results , then one may evaluate them by using the chosen cost function and then decide over the final hypothesis with the best result from the chosen error function .

 

*  Whenever an algorithm learns from a dataset ( combination of multiple data arranged in attribute order ) , the cost function associated with that particular algorithm guides the optimisation process by pointing out the changes in the internal parameters that are the most beneficial for making better predictions . This process of optimisation continues as the cost function response improves iteration by iteration with a process of improvised learning which of course is a result of iterative learning of the algorithm . When the response stalls or worsens , it is time to stop tweaking the algorithm's parameters because the algorithm is not likely to achieve better prediction results from there on . And when the algorithm works on new data and makes predictions , the cost function helps to evaluate whether the algorithm is working correctly

 

In Conclusion .. Even if the decision on undertaking any particular cost function is an underrated activity in machine learning , it is still considered as a fundamental task because this determines how the algorithm behaves after learning and how the algorithm would handle the problem that one would like to take up and solve . It is suggested that one should not go with the default options of cost functions but rather should ask oneself what should be the fundamental objective of using such a cost function which would yield their appropriate result

 

 

 

Last modified: 00:26