Data Science and AI Quest: R language

Showing posts with label R language. Show all posts

Tuesday, April 20, 2021

Exploring the World of Probability Theory in ML .. derived article with own interpretations

Exploring the World of Probability Theory in ML

* What is Probability and how can it be used? Probability is the likelihood of an event which means that Probability can help someone to determine the possibility of something to happen or not using the mathematical (Gannita Gyaana) where one can establish the possibility or likelihood of occurrence of an event in terms with the total number of possible events that could likely occur .

* The probability of an event is measured in the range from 0 (no probability that an event occurs) to the value of 1 ( a certainty that an event occurs ) which in relative terms says about the extent of any value towards the any of the extremes from the left most to the right most values .

* The probability of picking a certain suit from a deck of Cards (generally referred to as "Taash" in many Asian countries) is one of the most classic example on explanation of probabilities.

* The deck of cards contains 52 cards (joker cards excluded) which can be divided into four suits as clubs and spades which are black , and diamonds and hearts which are red in colour .

* Therefore , if one wants to determine whether the probability of picking the card is an ace , then one must consider that there are four aces of different suits .The probability of such an event can be calculated as p = 4/52 which is again evaluated to 0.077.

* Probabilities are between the values of 0 and 1 ; no probability can exceed such boundaries as everything's possibility of occurrence lies between nothing to everything and probability of not occurrence of something is always zero and the probability of occurrence of everything is always equal to 1 .

* If someone tries to do a Probability Possibility prediction for a given case of fraud detection in which one would like to see and find out the number of times a bank transaction related fraud has occurred over a given set of bank accounts or how many times fraud happens while conducting a banking transaction or how many times people get a certain disease in a particular country . So , after associating all the events , one can estimate the probability of occurrence of associating all the events , one can estimate the probability of occurrence of such forthcoming event with regards to the frequency of occurrence , mode of occurrence , time of occurrence , as well as the likely accounts which could be affected by the fraud and the conditions which are likely to affect the accounts .

The calculation for the estimation would take into consideration of counting the number of times a particular event occured and dividing the total number of events that could possibly occur for a set of operations and calculations.

* One can count the number of times the fraud happens using recorded data ( which are mostly taken from databases ) and then one would divide that figure by the total number of generic events or observations available

* Therefore , one should divide the total number of frauds by the number of transactions within a year or one can count the total number of people who fell ill during the year with respect to the population of a certain area . The result of this is a number ranging from 0 to 1 which one can use as baseline probability for a certain event under certain type of circumstances

* Counting all the occurrences of an event is not always possible for which one needs to know about the concept of sampling. Sampling is an act which is based on certain probability of expectations , which one can observe as a small part of a larger set of events or objects , yet one may not be able to infer correct probabilities for an event , as well as exact measures such as quantitative measurements or qualitative classes related to a set of objects

* Example - If one wants to track the sales of cars in a certain country , then one doesn't need to track all the sales that occur in that particular geography ... rather using a sample comprising of all the sales from new car sellers around the country , one can determine the quantitative measures such as average price of a car sold or qualitative measures such as the car model which were sold most often

Some Operating cases on Probabilities

Some Operating cases on Probabilities

* It is suggested that operations on probabilities are a bit different from numeric operations because the range of occurrence of such probability values generally lies between the range of 0 & 1

* One must rely on some set or rules in order for the operation to make sense to the user who is conducting the experiment on probabilities. For example , if someone is conducting an experiment of tossing a coin then he/she must strictly define the rules according to which the game of tossing a coin would be played out . The instructor would declare which outcomes should be taken as valid outcomes and which should not be taken in as valid outcomes , rather must be negated the moment the norms of the game are violated .

* For example , suppose say a case happens over where a coin does not fall over any of the sides rather falls over the floor standing erect , then the outcome is neither a heads and nor a tails , and neither a 50-50 heads-tails can be taken as consideration for the throw of the dice . Rather what would happen in such a circumstance is that the throw of the dice for this case would be nullified , the entire event of throw of such a dice would be struck off from the probable set of outputs that should happen as a result of the throw of the dice . Thats why one should also keep adhering to the rules of the experiment before conducting such an experiment which would require to know what set of events should be taken in as considerable outcomes and which should not be considered .

* Again another property of Probabilities that one needs to be aware is summations between probabilities which states that summations of probabilities is possible only when all the constituting events of the sample space are mutually exclusive to each other . For example lets consider an experiment of rolling a dice over a game of ludo , in this all the possible events that could turn up as a result of throw of the dice are 1 , 2 , 3 , 4 , 5 , 6 . The probability of occurrence of each of the events is 1/6 or 1 by 6 . And here , each of the events within the given sample space are disjoint and mutually exclusive to each other which makes the individual events probability of occurrence as equal to each of the given event divided by the total number of events over the entire sample space . And in case one would like to know the probability of occurrence of all the events together in unison , then one may have to add up the probabilities of each of the individual events as a summation of each of the individual events .. which would yield an output of 1 . So in retrospect, all individual elements of an experiment of probability are disjoint and mutually exclusive and in unison lead to a summed up value of 1 .

* We can take another simple example to demonstrate to demonstrate the case of understanding of probability calculation ; in this case one can consider for example the case of picking a spade or a diamond from a set of cards can be calculated in the following manner . Total number of cards in the entire deck = 52 . Number of cards in the house of clubs = 13 , number of cards in the house of clubs = 13 , number of cards in the house of hearts = 13 , number of cards in the house of diamonds = 13 . If a person takes out a card from the house of diamond then the probability if picking up one of the cards is 13/52 ; the same goes for the case of picking up a random card from a house of clubs is 13/52 . So , total probability of finding a card from both the houses is 26/52 which is equals to 0.5

* One can take the help of subtraction operation to determine the probability of some events where probability of an event is different from the probability of an event that one would want to compare . For instance , if someone wants to determine the probability of drawing a card that does not belong to some house of card for example , say I want to draw a card which is not a diamond from the overall deck of cards , then one will approach the problem in the given manner . He will first find out the overall probability of finding any card and then he will subtract the chance of occurrence of a particular card from the total , 1 - 0.25 which happens to be as 0.75. One could get a complement of the occurrence of the card in this manner , which could be used for finding the probability of not occurrence of a particular event .

* Multiplication of a set of events can be helpful for finding the intersection of a set of independent events . Independent Events are those which do not influence each other . For instance , if one is playing a game of dice and one would like to throw two dices together , then the probability of getting two sixes is 1/ 36 . This can be obtained by multiplication of dices over both the cards , where first the probability of obtaining a 6 is found out to be as 1/6 and then the subsequent independent event would also produce an probability of obtaining another 6 is found out to be as 1/6 , here both the values are multiplied with each other and found that product of both the probabilities of independent events would yield a value output as 1/36 or 0.28 .

* Using the concepts of summation , difference and multiplication , one can obtain the probability of most of the calculations which deal with events . For instance , if one would want to compare the probability of getting atleast a six from two throws of dice which is a summation of mutually exclusive events . Probability of obtaining two sixes of dice , p = 1/6* 1/6 = 1/36

* In a similar manner if one would like to calculate the probability of having a six on the first dice and then something other than a six on the second throw of the dice is p = (1/6)*(1- 1/6) = 5/36 ,

* Probability of getting a six from two thrown dice is p = 1/6* 1/6 +2*1/6*(1- 1/6) = 11/36

Monday, April 19, 2021

Advanced Matrix Operations – A theoretical view

Advanced Matrix Operations – A theoretical view ========================================

* One may encounter some important matrix operations using algorithmic formulations

* The advanced matrix operations are formulating the transpose and inverse of any given matrix form of dataset

* Transposition occurs when a matrix of shape n x m is transformed into a matrix in the form of m x n by exchanging the rows with the columns

* Most of the tests indicate the operation using the superscript T in the form of A( transpose )

* One can apply " matrix inversion " over matrices of shape m x m , which are square matrices that have the same number of rows and columns . In mathematical language , this form of square ordering of matrices is said that the matrix has m rows and m columns .

* The above operation is important for the sake of finding the immediate resolution of the various equations which involve matrix multiplication such as y = bX where one has to discover the values in the vector b . More on Matrix multiplications with more conceptual examples would be showcased in another article in which I shall try to cover how the Matrix Multiplication of different Matrices occur and how this Multiplication is used to solve more important / complex problems .

* Since most scalar numbers (exceptions including zero) have a number whose multiplication results in a value of 1 , the idea is to find a matrix inverse whose multiplication would result in a special matrix called the identity matrix whose elements are zero , except the diagonal elements

( the elements in positions where the index 1 is equal to the index j)

* Now , if one wants to find the inverse of a scalar quantity , then one can do so by finding the inverse of a scalar . (The scalar number n has an inverse value that is n to the power minus 1 which can be represented by 1/n that is 1 upon n )

* Sometimes, finding the inverse of a matrix is impossible and hence the inverse of a matrix A is indicated as A to the power minus 1

* When a matrix cannot be inverted, it is referred to "singular matrix" or a "degenerate matrix" . Singular matrices are usually not found in isolation, rather are quite rare to occur and generalise .

Monday, April 12, 2021

Working with Data in Machine Learning - An overview of methodology for working over Data/Datasets in Machine Learning using R and Python

Working with Data in Machine Learning

* Machine Learning is one of the most appealing subjects because it allows machines to learn from real world examples such as sales records , signals from sensors and textual datastreaming from internet and then determine what such data would imply with the help of that subject

* The most common outputs that can commence from machine learning algorithms is prediction of the future , prescriptions and prescriptive knowledge for design and build up of applications etc

* Some of the common outputs that can come from machine learning algorithms is the following : prediction of the future , prescription to act on some given knowledge or information , creation of new knowledge in terms of examples categorised by groups

* Some of the applications which are already in place and have become a reality thanks to leveraging the use of such knowledge are the following things :

01) Diagnosing hard to find diseases

02) Discovering criminal behaviour and detecting criminals in action

03) Recommending the right product to the right person

04) Filtering and classifying data from internet at an big scale

05) Driving a car autonomously etc

* The mathematical and statistical basis of machine learning makes outputting such useful results possible

* One can use Math and Statistics over such accumulated data which could enable algorithms to understand anything with a numerical basis

* In order to begin the process of working with Data , one should represent the solution to the problem in the form of a number .

* For example , if one wants to diagnose a disease using a machine learning algorithm , one can make the response to a particular learning problem a 1 or 0 (binary response) which would inform about the illness of the person . A value of 1 would indicate that the person is ill , with a value of 1 stating that the person is ill or not .

Alternatively , one can use a number between the values 0 and 1 to convey an

Thursday, April 1, 2021

Generic Summary Command for DataFrames / Matrices in R language

Generic Summary Command for Data Frames

* Below is a short guide to the results expected for the generic software commands in R

* Descriptive Summary Commands that can be applied to Dataframes are :

00) mat01

[,1] [,2] [,3] [,4]

[1,] 1 2 3 4

[2,] 5 6 7 8

[3,] 9 10 11 12

[4,] 13 14 15 16

01) max(mat01)

[1] 16

- The largest value in the entire dataframe

02) min(mat01)

[1] 1

- The smallest value in the entire dataframe

03) sum(mat01)

[1] 136

- The sum of all the values in the entire dataframe

04) fivenum(mat01)

[1] 1.0 4.5 8.5 12.5 16.0

- The summary values for the entire dataframe can be found out by using the "fivenum" command over a dataframe taken in as parameter

05) length(mat01)

[1] 16

- The length of all the columns within a dataframe can be found by using the length command over a dataframe

06) summary(mat01)

V1 V2 V3 V4

Min. : 1 Min. : 2 Min. : 3 Min. : 4

1st Qu.: 4 1st Qu.: 5 1st Qu.: 6 1st Qu.: 7

Median : 7 Median : 8 Median : 9 Median :10

Mean : 7 Mean : 8 Mean : 9 Mean :10

3rd Qu.: 10 3rd Qu.:11 3rd Qu.:12 3rd Qu.:13

Max. : 13 Max. :14 Max. :15 Max. :16

- It provides the summary for each of the columns present within a dataframe

* The list of all the summary / descriptive summary commands that work on a dataframe are listed and short

* One can always extract a single vector from a dataframe and perform a summary upon the data

* In general , it is better to use more specialised commands when dealing with the rows and columns of a dataframe

Row & Column Summary Commands "RowMeans()" & "ColMeans()" in R language over dataframe and Matrix objects

Special Row and Column Summary Commands

* Two summary commands used for row data are - rowMeans() and rowSums()

> rowMeans(mat01)

[1] 2.5 6.5 10.5 14.5

> rowSums(mat01)

[1] 10 26 42 58

* In the given example , each row in the dataframe has a specific row name

* If the names of the rows along with the values for the various rows would appear as a simple vector of values

> rowSums(mat01)

[1] 10 26 42 58

* The corresponding "colSums()" and "colMeans()" commands function in the same manner .

* In the following example ... one can see the "mean()" and "colMeans()" command with their comparison in the following manner :

> colMeans(df)

[1]  7  8  9 10

> mean(mf)

[1] 8.5

* One can see that one would essentially get the same display / result using the above two commands

* The commands use "na.rm" instruction which is used by default and is set to FALSE

* If one wants to ensure that the "NA" items are removed from the dataframe then one can add "na.rm = TRUE " as an instruction in the command

"Apply()" Command for finding Summaries on Rows / Columns of a Matrix or Dataframe Object

"Apply()" Command for finding Summaries on Rows / Cols

=======================================================

* The "ColMeans()" and "RowSums()" command are designed as quick alternatives to a more generalised command "Apply()"

* The "apply()" command enables one to apply a function to rows or columns of a matrix or a dataframe

* The general form of the "Apply()" command is given in the following manner :

apply(X,margin,FUN,....)

In this command , the applicable MARGIN within the parameter is either 1 or 2 where 1 is for the rows and 2 is for the columns applicable for the dataframe

* One can replace the "FUN" part within the parameter of the apply() function and one can also add additional instructions which might be appropriate to the command / function that one is applying

* Example :

One might add the parameter "na.rm = TRUE " as an instruction to the apply function .

> mat01

[,1] [,2] [,3] [,4]

[1,] 1 2 3 4

[2,] 5 6 7 8

[3,] 9 10 11 12

[4,] 13 14 15 16

> apply (mat01 , 1 , mean , na.rm = TRUE )

[1]  2.5  6.5 10.5 14.5

* In such a case , one can see that the row names of the original dataframe are displayed as output .

* If the dataframe has no set row names , then one will see the result as a vector of values .

> apply (fw , 1 , median , na.rm = TRUE )

2.5  6.5 10.5 14.5

Wednesday, March 31, 2021

Selecting and Displaying Parts of a Vector in R Language

Selecting and Displaying Parts of a Vector

· * Being able to select and display the parts of a vector is one of the most important reasons of selection of a Vector .

· * If one has a large sample of data , then in case if one wants to obtain a large sample of data , then one may want to see which of the items are larger than which of the values which would require the user of the data to select those data that are larger ones among the dataset

· * In an alternative scenario , one may want to extract a series of values as a subsample from an analysis .

· * Being able to select / extract required parts of a vector is one of the most important aspects of performing many more complicated operations in R tool .

* The various examples or processes that one may come across while doing any type of selection of a chunk of data from a vector are in form of given scenarios :

·        extraction of the first item / single item from within a vector ;
·        selection of the third item ( nth item) from within a vector ;
·        selection of the first to the third items from a vector ;
·        selection and extraction of all items from a vector ;
·        selection of items from the combination vector ;
·        selection of all items which are greater than the value 3 (that means selection and extraction of given items with a value for the number greater than or lesser than some particular number) ;
·        Showing items which are either greater than or lesser than some set of numbers

The other useful commands over the objects which can be used to extract the various parts of data are : length() command which can be used to find the length of a given vector .

· The length command can be also put to use to obtain / extract segments of data from square brackets :

data[ length(data) - 5 : length(data)]

In the above given scenario example the last five elements of the vector are found out from the above used code :

· max() command can be used to get the largest value in the vector

==========================================================

> data1

[1] 4 6 8 6 4 3 7 9 6 7 10

> max( data1 )

[1] 10

> which( data1 == max(data1))

[1] 11

==========================================================

# The upper command -- "which" is showing the index number or position of the largest data from within the data vector . The maximum value of all the data elements present within the "data1" vector is 10 . The positional index value of the data from the vector is 11 .

· The first command .. max() provides the actual value which is the largest value within the vector and the second command asks which of the elements is the largest .

· Another useful command is one that generates sequences from a vector which can be expressed in the form .. seq()

· While using the "sequence" vector , one may need to pick out the beginning to ending of the interval vectors . In given words , one may select the first , third , fifth and so on vectors using the given in sequence parameters like the start , end and the interval values .

· Therefore putting the full scale general form of the "sequence" command can be writen in the given form :

seq(start ,end ,interval)

· The above command will work on character vectors as well as numeric vectors in the given manner :

==========================================================

> data5

"Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"

> data5[-1:-6]

"Jul" "Aug" "Sep" "Oct" "Nov" "Dec"

=================================================

* In the above code , the last 6 strings which are actually the three letter initials of each of the months of a calendar year are found out as result .