Data Science and AI Quest: statistics

Showing posts with label statistics. Show all posts

Thursday, August 5, 2021

One Hot Encoding and Dummy Variables Generation upon a dataframe | Scenario - Perform One-Hot Encoding upon Un-Ordered Data in a sample dataframe and generate One-hot encoded feature variables | Conceptual Infographic Note

Handling Categorical Data in Dataframe and Dataset | Types of Categorical Data and its examples | Conceptual Infographic Note

Label Encoding and Ordinal Encoding in Python | Scenario - Perform Ordinal Encoding upon Ordered Data in a sample dataframe and generate encoded features of the data | Conceptual Infographic Note

Sunday, July 18, 2021

An Introduction to Classification Algorithms with 10 fundamental questions | An Infographic Note with questions and answers

Questions Covered :

Q1) What are Naive bayes Classifiers ?

Q2) What do you mean by Probabilistic Classification ?

Q3) What do you mean by Statistical Classification ?

Q4) Give an example of Statistical Classification .

Q5) What is a Classifier Algorithm ?

Q6) Give some examples of various forms of classification .

Q7) Which is the most used Classification algorithm in Statistics ?

Q8) What are some of the characteristics of a Classification problem ?

Q9) What are the popular acronyms for samples , independent and dependent variables in Machine learning (ML)

Q10) What is the difference between Binary Classification and Multiclass Classification ?

Friday, May 28, 2021

Infographic Revision Short Note on Two - Way ANOVA's in Statistics , Analytics and Machine Learning

Infographic Revision Short Note on Data Collection using Interviews in Statistics , Analytics and Machine Learning

Infographic Revision Short Note on Statistical Data Collection using Interviews in Statistics , Analytics and Machine Learning - part 2

Thursday, May 6, 2021

Commutative property on Matrices is not applicable

Notes on Logistic Regression for Classification - with usage for Email Spam Recognition , Online transaction , detection of cancerous tumour , prediction of threshold values .. note part 2

Notes on Logistic Regression for Classification - with usage for Email Spam Recognition , Online transaction , detection of cancerous tumour , prediction of threshold values

Basic Codes on Octave for Statistical Calculations on Matrices , Vectors

Notes on Matrices , Properties of Matrices , Inverse Matrix and Transpose of a Matrix , Regression , Scenarios on Regression , Representation of Hypothesis Equation , Gradient Descent , Cost Function

Friday, April 30, 2021

Validation of Machine Learning Algorithms and Scenarios - A short article

Validation of Machine Learning Codes

* It is a widely accepted fact that just having some examples in the form of datasets and machine learning algorithm at hand does not assure that solving a machine learning problem is possible or the results would provide any desired solution

* For example ... if one wants a computer to distinguish a photo of a dog from a photo of a cat , one can do it with good examples of dogs and cats . One can then train a dog versus Cat classiﬁer which is based on some machine learning algorithms that would output the probability that a given photo is that of a dog or a cat . All of the times for a set of photos resembling a given photo , the output would be in the form of a validation quantity which would be expressing some level of accuracy for a number which would reﬂect how well the classiﬁer algorithm was able to perform those computations and with what level of alacrity and accuracy . I am using the alacrity which should convey to the reader about the performnace and speed aspect of the identiﬁcation process of the Machine Learning algorithm when computed upon a batch of photos for ﬁnding resemblance over a batch of photos of classes of photos by doing all forms of stucturisation like segmentation and clustering , KNN etc . And when it comes to the factor of accuracy one can think of the degree and magnitude in terms of percentage of resemblance of the referenced sample to the sample over which the matching is to be calculated .

* Based on the probability which is exressed in percentage accuracy , one can then decide whether the class ( that is if a dog or a cat) is based on the estimated probability as calculated by the algorithm .

* Whenever the obtained probability or percentage would be higher for a dog , one can minimize the risk of making a supposed wrong assessment by choosing the higher chances which would be favouring the probability of ﬁnding a dog .

* The greater the probability difference between the likelihood of a dog against that of a cat , the higher would be the conﬁdence that one can have in their choices of ﬁnding any appropriate result

* And in case , the probability difference between the likelihood of a dog against that of a cat , here it can be assumed that the picture of the subject is not clear or probably the subjects in the picture bear much resemblance in features which would indirectly mean that some of the pictures of the cats are similar to that of the dogs and because of which a confusion may arise and lead to another supposition that whether the dogs are cattish in the concerned pictures .

* On the point of training a classiﬁer :

When you pose a problem and offer the examples , with each of the examples being carefully marked with the label or class that the algorithm should learn ; then the computer trains the algorithm for a while and then ﬁnally one would get a resulting model out of the training process of the model over the dataset .

* Here , your computer trains the algorithm for a while and ﬁnally one would get a resulting model for the answer which provides one with an answer or probability .

* Labellling is an another associated activity that can be carried out but in the end a probability is just an opportunity to propose a solution and get an answer

* At such a point , one may have addressed all the issues and perhaps might guess that the work is ﬁnished , but still one may validate the results for ensuring that the results generated are ﬁrst comprehensible to the human , make sure that the user is able to have a clear understanding of the involved background processes and break-up analysis of the code and the result which can enable other readers to understand the code along with numbers

* More over this would be elaborated in the forthcoming sessions / articles where we will look into the various modes in which the machine learning results could be validated and made comprehensible to the users

Last modiﬁed: 16:39

Wednesday, April 28, 2021

Exploring Cost Functions in ML

* The driving force behind the concept of optimization in machine learning is the response from a function which is internal to the algorithm which is called as a Cost Function

* One may see other terms used in some contexts , such as loss function , objective function , scoring function , or error function but the cost function is an evaluation function that measures how well the machine learning algorithm maps the target function that the function was striving to guess

* In addition , a cost function determines how well a machine learning algorithm performs in a supervised prediction or an unsupervisd optimisation problem

* The Evaluation function works by comparing the algorithm predictions against the actual outcome recorded from the real world .

* Comparing a prediction against a real value using a cost function which determines the algorithm's error level

* Since it is a mathematical formulation , a general cost function expresses the error level in a numerical form thereby keeping the errors low . This means that the cost function modulates according to the parameters of a function in order to tune the produced output to a more numeric form thereby keeping the errors of the overall output to a low .

* The cost function transmits whatever is actually important and meaningful for the purposes of the learning algorithm

* As a result , when considering a scenario like stock market forecasting , the cost function expresses the importance of avoiding incorrect predictions . In such a case , one may want to make some money by avoiding any sort of big losses . In forecasting sales , the concern is different because one needs to reduce the error in common and frequent situations and not in the rare and exceptional cases , as one uses a different cost function .

* Example -- While considering stock market forecasting , the cost function expresses the importance of avoiding incorrect predictions . In such a case , one may want to make some money by avoiding big losses

* When the problem is to predict who would likely become ill from a certain disease , then for this also algorithms are in place that can score a higher probability of singling out the people who have the same characteristics and actually did become ill at a later point of time . And based on the severity of the illness , one may also prefer that the algorithm wrongly chooses some people who do not get ill , rather misses out on the people who actually get ill

* So after going through the given aspects on the usability of cost functions and how they are coagulated with some ML algorithms in order to ﬁne tune the result .we will get to see and check the method of Optimisation of a Cost Function and how and why they are done

* Optimisation of Cost Functions :

It is widely accepted as a conjecture that the cost function associated with a Machine Learning generic function is what truly drives the success of a machine learning application . This is an important part of the process of representation of an associated cost function that is the capabilty of an algorithm to approximate certain mathematical functions and along with that do some necessary optimisation which means how the machine learning algorithm sets their internal parameters .

* Most of the machine learning algorithms have their own optimisation which is associated with their own cost functions which means some of the better developed and advanced algorithms of the time are well capable enough to ﬁne tune their algorithms on their own and can come at a best optimised result at each step of the formulation of machine learning algorithms . This leaves the role of the user as futile some of the times as the role of the user to ﬁne tune the learning process and preside over the aspects of learning are not so relevant .

* Along with such , there are some algorithms that allow you to choose among a certain number of possible functions which provide more ﬂexibility to choose their own course path of learning

* When an algorithm uses a cost function directly in the optimisation process , the cost function is used internally . As the algorithms are set to work with certain cost functions , the objective of the optimisation problem may differ from the desired objective .

* And as the algorithms set to work with some of the cost functions , the optimisation objectives may also differ from the desired objective . In such a circumstance where the associated cost function could be used , one can call an error function or a loss function .. an error function is where the value needs to be minimised ; and the reverse of it is called a scoring function if the objective for the function is to maximise the result .

* With respect to one's target , a standard practice is to deﬁne the cost function that works best in solving the problem and then to ﬁgure out which algorithm would work best in the optimisation of the algorithm in order to deﬁne the hypothesis space that one would like to test . When someone works with algorithms that do not allow the cost function that one wants , one can still indirectly inﬂuence their optimisation process by ﬁxing their hyper-parameters and selecting your input features with respect to the cost function . Finally , when someone has gathered all the algorithm results , then one may evaluate them by using the chosen cost function and then decide over the ﬁnal hypothesis with the best result from the chosen error function .

* Whenever an algorithm learns from a dataset ( combination of multiple data arranged in attribute order ) , the cost function associated with that particular algorithm guides the optimisation process by pointing out the changes in the internal parameters that are the most beneﬁcial for making better predictions . This process of optimisation continues as the cost function response improves iteration by iteration with a process of improvised learning which of course is a result of iterative learning of the algorithm . When the response stalls or worsens , it is time to stop tweaking the algorithm's parameters because the algorithm is not likely to achieve better prediction results from there on . And when the algorithm works on new data and makes predictions , the cost function helps to evaluate whether the algorithm is working correctly

In Conclusion .. Even if the decision on undertaking any particular cost function is an underrated activity in machine learning , it is still considered as a fundamental task because this determines how the algorithm behaves after learning and how the algorithm would handle the problem that one would like to take up and solve . It is suggested that one should not go with the default options of cost functions but rather should ask oneself what should be the fundamental objective of using such a cost function which would yield their appropriate result

Last modiﬁed: 00:26

Data Science and AI Quest

Thursday, August 5, 2021

One Hot Encoding and Dummy Variables Generation upon a dataframe | Scenario - Perform One-Hot Encoding upon Un-Ordered Data in a sample dataframe and generate One-hot encoded feature variables | Conceptual Infographic Note

Handling Categorical Data in Dataframe and Dataset | Types of Categorical Data and its examples | Conceptual Infographic Note

Label Encoding and Ordinal Encoding in Python | Scenario - Perform Ordinal Encoding upon Ordered Data in a sample dataframe and generate encoded features of the data | Conceptual Infographic Note

Sunday, July 18, 2021

An Introduction to Classification Algorithms with 10 fundamental questions | An Infographic Note with questions and answers

Wednesday, July 14, 2021

Infographic Note on Z-Score in Statistics and Machine Learning with it's usage scenario , calculation , implication and applications

Sunday, June 27, 2021

Infographic Conceptual Note on Law of Large Numbers in Statistics and Machine Learning - Short Summary for Interview Questions

Sunday, June 6, 2021

Infographic Short Note on P-value in Statistics and its applications in Machine Learning

Saturday, June 5, 2021

Infographic Short Note on Mcnemar Test over Sample Data Space in Statistics

Friday, May 28, 2021

Infographic Revision Short Note on Two - Way ANOVA's in Statistics , Analytics and Machine Learning

Infographic Revision Short Note on Data Collection using Interviews in Statistics , Analytics and Machine Learning

Infographic Revision Short Note on Statistical Data Collection using Interviews in Statistics , Analytics and Machine Learning - part 2

Thursday, May 6, 2021

Commutative property on Matrices is not applicable

Notes on Logistic Regression for Classification - with usage for Email Spam Recognition , Online transaction , detection of cancerous tumour , prediction of threshold values .. note part 2

Notes on Logistic Regression for Classification - with usage for Email Spam Recognition , Online transaction , detection of cancerous tumour , prediction of threshold values

Basic Codes on Octave for Statistical Calculations on Matrices , Vectors

Notes on Matrices , Properties of Matrices , Inverse Matrix and Transpose of a Matrix , Regression , Scenarios on Regression , Representation of Hypothesis Equation , Gradient Descent , Cost Function

Tuesday, May 4, 2021

Notes on - Linear Regression with one Variable with Cost Function of sample data points with difference between a Hypothesis function and Cost Function

Monday, May 3, 2021

Definition of Machine Learning - pictorial handnotes

Friday, April 30, 2021

Validation of Machine Learning Algorithms and Scenarios - A short article

Wednesday, April 28, 2021

Exploring Cost Functions in ML

One Hot Encoding and Dummy Variables Generation upon a dataframe | Scenario - Perform One-Hot Encoding upon Un-Ordered Data in a sample dataframe and generate One-hot encoded feature variables | Conceptual Infographic Note