Data Science and AI Quest

Tuesday, March 2, 2021

Article on - Data Definition Feature of SQL and its specifications

· The set of relations within a database must be specified to the system by means of a data definition language . Here , it means that all the forms of defining the structure and schema of a database and a relation are to be done with the help of a standard language called as a Structured Query Language .

· The SQL DDL allows the provision of specification of not only a set of relations but also information on each of the relations over a database which also includes the following types :

1) The schema for each of the relation within the database

This means that SQL language can be used for the creation of the schema definition of each of the relation within the database over which one is working upon . For example if there are a set of associated relational tables over a database management system in the given format :

branch (branch_name,branch_city,assets)

customer(customer_name,customer_street,customer_city)

Here , one can notice that the two existing tables within the given supposed database consists of primarily two tables that is branch and customer which has the following attributes as given in the parenthesis brackets . This representative definition of the two relational tables can be practically created and implemented using the SQL language using the create table command which takes the attribute names as parameters within the database schema definition function .

2) The domain of values associated with each of the attribute .

So from the above picture one can notice what are the essential domain types present in SQL that support the purpose of Data Definition within a database relation . So whatever data needs to be defined and stored within the columns/attributes of a relational table they should adhere to these domain standards and any exception in the datatype and format of the data is not facilitated within the SQL scripts .

3) The Integrity Constraints

The purpose of defining integrity constraints upon a database is to make sure that the database consists of attributes which adhere to the definition set upon the database as per the manner prescribed upon the database as in the case wherever not null is defined , the attribute cannot have a null value within it ;

wherever the constraint primary is defined , then it will mean that the values contained within the attribute for the relation cannot having non-unique values within it but columns / attributes where primary is not declared from the beginning , those attributes / columns can contain non--unique values .. etc etc

4) The set of indices to be maintained for each of the relation

Tables and Relations are needed to be maintained and described for the purpose of better searching , easier access and book keeping of relational tables within the database and as such indexing is one of the key parameters while considering the case of creation / definition of the data over databases

5) The security and authorization of information for each of the relation

This aspect of Data definition for a database and all the relations within it is a mandatory criteria for maintenance of security of the database ; otherwise unauthorised users would access the data and make unregulated transactions or tampering which might make the purpose of database and relational database invalidated

6) The physical storage structure of each relation on a disk

Sometimes whenever a relational table and other associated tables are created , the physical storage limit is specified so that stack overflow and buffer overflow type of memory issues could be handled well in time or in advance . Constraints related to data size limits are also set while defining the attributes during

the table creation commands so that any issues relating to memory would be handled while data defining and storage and exceptions would not arise afterwards .

Monday, March 1, 2021

Introduction to Hypothesis Testing In Statistics with sample problems and explanations

Introduction to Hypothesis Testing

In Statistics with problems and examples

* This article will try to breakdown the concept of Hypothesis testing into smaller chunks, and as someone goes through the concept of Hypothesis Testing as covered in the article, one would be able to have a very good idea and hold over the concept of Hypothesis testing and how and when these testing types could be put to use .

* The best part of learning this from this articulated version of Hypothesis Testing is that one can get to have a good understanding of the concept of Hypothesis Testing is by going through the article segment by segment as one may get to notice each of the discussed concepts .

* One of the main reasons why many of the students consider the topic of Hypothesis Testing to be difficult is that many of the students consider the case studies of Hypothesis Testing to be very much different from the various other concepts discussed in within the topic of Statistics as the testing methods of the various kinds of tests within Hypothesis Testing is much different from one another and many a students face a lot of hard time in understanding the concepts and also differentiating each of the concepts from one another on a wide scale where there would be cases where there are smaller sample sizes and there are also going to be samples of higher sample sizes and there would be cases discussed with usage of different to different sets of parameters for each of the underlying test types constituting the concept of Hypothesis Testing .

* And in each of the various distributions, one is going to use the case of different types of distributions and different techniques for arriving at a result which puts to use the various methods used for finding the results of each of the hypothesis tests .

* So we will begin to dig deeper into the concept of Hypothesis Testing by starting with the concept of Hypothesis Testing .

One can state that Hypothesis Testing is a premise or claim that we want to test so basically it is not a type of test where one would go to some laboratory to do some tests, rather one may need to formulate some statistical sample surveys from which the task of the statistician or the sample analyst would be to collect the associated results out of the data and then collect meaningful results from

the sample by testing the hypothesis.

* Next let us get to know what is Null Hypothesis. But first let’s understand what the Null value within the hypothesis means. Null stands for something which is zero or empty which is a widely used jargon in the world of computer science and programming language literature .

* So, when someone is talking about the Null Hypothesis then someone is talking about the Default hypothesis which talks about something which is already established . The Null Hypothesis is generally denoted by the letter capital H with a small zero as the sub-script . H(0) is the most widely accepted value for a parameter which is accepted by almost all the statisticians and analysts .

* And whenever someone wants to challenge the Null and wants to read and analyse the What Else portion of the Hypothesis then it could be said that one is interested in the Alternate or Alternative form of Hypothesis. This Alternative Hypothesis can be represented by the symbol (H1) or (Ha) where the sub-script for a means that it is an alternative hypothesis form. In some statistics books, the alternative hypothesis is also called as the Research Hypothesis and it involves the claim to be tested .

* Lets take the principle of Gravity into consideration, Many centuries back, Newton discovered the essence of Gravity when an apple suddenly comes and falls over his head and did some research and gave the postulates supporting his hypothesis upon Gravity But later Einstein comes into the picture and gives another set of hypothesis which is an alternative from the views that Einstein

Provided and thus they formed as the postulates for an Alternative View of Hypothesis formulated by Einstein. And his view of gravity was much different and complicated for generalisation as what had been provided by Newton and thus his view of alternative hypothesis was a lot more different than that of Newton and thus the view was eventually accepted and gained stature in the Science journals of that time .

========================================================

* Ok now lets take a question to get a better perspective into the idea of Hypothesis Testing with an example .

It is believed that a candy machine makes chocolate bars that are on average 5g in weight. A worker claims that the machine after maintenance no longer produces 5g bars. Formulate the expression for H0 that is Null Hypothesis and H1 which stands for alternative Hypothesis.

Ho = Null Hypothesis

H1 = Alternative Hypothesis

Ho : mean weight of each chocolate bar is 5 grams which is stated as the hypothesis question

H1 : mean weight of each of the chocolate bars is not equal to 5 grams as produced by the chocolate factory

So , one can sense from both the above given statements that the null hypothesis and the alternative hypothesis are mathematically opposite to each other .In all of the hypothesis tests that anyone can perform , the null hypothesis and the alternative hypothesis are always mathematically opposite to each other .

* So now we are interested in testing the outcomes of the test as have been provided in the null hypothesis and the alternative hypothesis when we need to consider that the null hypothesis must be true .

========================================================

* So what are the possible outcomes of the test for Null Hypothesis that one can do over the hypothesis :

1) Reject the Null Hypothesis

2) Accept the Null Hypothesis

* If we reject the null hypothesis then we mean that whatever has been provided against the data is held to be be false which means that mean of all the weights of all the chocolate bars in the sample is not equals to 5 grams and hence the hypothesis is rejected

* If we fail to reject the null hypothesis (the statement - fail to reject the null hypothesis doesn't absolutely mean that we are accepting the null hypothesis ) then one would mean that whatever statement has been provided in the Alternative Hypothesis is false and as such the weights of all the chocolate bars in the given sample is not equal to 5 gram

* The possible outcomes as pointed within the hypothesis statements is just like the opinion formulated in the court of law and then one can either reject the hypothesis ( H0 ) or else one can fail to reject the null hypothesis ( H1 ) .

* So now after conducting the test and then finding out the possible outcomes of the test , one would try to ascertain the Test Statistic in the following manner .

* Test Statistic :

The test statistic is calculated from sample data and can be used to decide whether to reject the null hypothesis or fail to reject the null hypothesis. As an example in the case of the Candy-Bar factory , may be one may start sampling 50 chocolate bars in the factory and from the factory we would be doing a statistical descriptive analysis of the data and get average value of the amount of chocolate bars present within the sample and get the value of the test statistic for the data .

Then one can determine Statistically, the significant value for the data which means how to arrive at a decision whether to reject the null hypothesis or fail to reject the null hypothesis.

Suppose say one guy draws a sample on Monday consisting of 50 bars and finds an average of 5.12 grams which is technically not equivalent to 5 grams , similarly another guy draws a sample of 50 bars which is again not equivalent to exactly 5 grams but is roughly around 5.72 grams and similarly on Friday the average count is noticed to be 6.53 grams then one can deduce from the above three calculations that the values are so much statistically different from each other and as such values are very much different from each other and also very much distant from the null hypothesis set at average price of 5 grams per candy bar .

Therefore, comparing the obtained prices as given for Monday is 5.12 which is pretty much closer to the value of the average weight of 5 gms , then the next average weight is that of 5.72 grams which is a bit distant from the accepted null hypothesis accepted value of 5 grams and then the third reading taken for the 50 bars is found out to be 6.53 grams which is very different and distant from the hypothetical mean value of 5 grams . Therefore , one can come to a general conclusion about the data that it is okay to reject the null hypothesis as the values are nowhere very near to the hypothesis accepted mean value of 5 grams per chocolate bar .

So, in statistical terms by looking at the hypothesis value and the actual values, an analyst should be readily able to make a concrete decision when to reject the null hypothesis and when not to reject the null hypothesis.

This is what in general the purpose of a hypothesis test is that is a hypothesis test needs to collect the data , generalise the data and obtain a test statistic which would enable to make a decision when to reject the null hypothesis and when not to reject the null hypothesis by having a proper look at the data obtained and then ascertain the case where the value obtained is too high and when the value obtained is too low and when to accept the condition for null hypothesis and when to reject the hypothesis which means that one needs to concretely decide the boundaries for the null hypothesis and conditions where to reject or not reject from the statistically significant data available .

========================================================

Level of Confidence

Level of confidence is also alternatively referred to as level of significance in many terms where it is graphically determined where to reject the null hypothesis and where to not reject the null hypothesis. Thus it enables one to know how confident someone is in their decision and what level of confidence expressed in percentage values.

For example if the level of confidence or value is 99% for rejection of a null hypothesis then, everyone would accept the decision for rejection of the hypothesis but in case if the level of confidence is around 50% then it means that it is not a right decision to reject the null hypothesis under the given circumstance as the statistically obtained value for level of confidence is very less .

========================================================

Level of Significance

Basically this is called as alpha which is numerically expressed as

( 1 - C) that is : level of significance is equals to 1 minus the level of confidence for some sample of data .

Mathematically , if the level of confidence is found out to be as 95% with the level of significance (alpha) is found out to be as alpha = 1 - 0.95 , then alpha = 0.05

========================================================

From both the terms "level of confidence" and "level of significance" one can make an appropriate decision whether to reject or not to reject the formulated hypothesis

The analogy that needs to be drawn from this test of hypothesis that could be also thought to be analogous to another is when someone is accused of a crime, the first assumption to be made in favour of the convict is that the convict is innocent and it is upto the lawyers and the evidence that the convict is guilty and if the lawyers and the evidences don’t prove the alternate assumption that the convict

is guilty then automatically the assumption that the convict is innocent would be held true .

Same is the case of our example. We assumed that the mean / average weight of one chocolate bar is around 5 grams but also formulated an alternative assumption that the average / mean weights of the chocolates are not the same but different from each other . To support / test our assumption we took the case of testing / experimentation and found out contradictory results for the data

values obtained and after the results were analysed we came to a conclusion whether to reject the null hypothesis or not to reject the null hypothesis .

Saturday, February 27, 2021

Chi-Square Tests of Independence testing using R ( Chi - Square Tests for two-way table )

Chi-Square Tests of Independence

( Chi - Square Tests for two-way table )

===========================

For effective execution and analysis of the Chi-square test in R language we are going to do the following steps :

1) Step no.1 - We will create a vector in the following format

data <- c(40,25,19,37,39)

This "data" vector holds the data for all the students who are registered for the different classes in a school . These are 40 students in class 1 , 25 students for class 2 , 19 students for class 3 , 37 students for class 4 and 39 students for the class 5 .

2) Step no. 2 - In R we are going to evaluate whether the values as have been specified in the data variable object "data" are same in value and this equality condition is going to be tested in the NULLHypothesis for the test for evaluation of the data object .

The Null Hypothesis and the Alternative Hypothesis for checking the validity of both the conditions can be stated and represented in the given manner with the null hypothesis being represented by the symbol (H0) and Null Hypothesis being represented by the symbol (H1) .

# H0 is the null hypothesis

H0 : p1 = p2 = p3 = p4 = p5

The alternative Hypothesis can be given in the alternate manner i.e,

# H1 is the alternative hypothesis

H1 : p1 != p2 != p3 != p4 !=p5

3) In the 3rd step , we will run the chi - square test over the function by calling the appropriate function for the evaluation of the same

· chisq.test(data)

On the console one can see that the entire data item which had been fed to the prompt has been registed as a memory object . And when one wants to run the Chi-squared test upon the data item object , the values that come as a result of execution of the test are in the given format :

X-squared = 11.125, df = 4 , p-value - 0.02519

# from the given evaluation one can analyse with the help of the Chi-square table that the critical value which is also acronymised as (T-crit) is found to be as 9.488 with a degree of freedom for the data as 4 and alpha ( degree of significance is found out to be as 0.05 )

which can be found out from the p-value obtained after running the test is that we are basically not wrong in our assumption while framing the null and alternative hypothesis which means that the p-value for the evaluation is less than 0.025 which is less than the alpha value set for the test which is 0.05 then we can reject the null hypothesis as again stating .. we found out the P-value to be lesser than that of the Alpha set for the test.

But had it been the case that the value of the alpha level had also been set at 0.25 then we would not have been at a position to reject the null hypothesis .

Conclusion and Verification

For the sake of analysing the obtained values if we are again running the chi-square test upon another pair of data in the given format .

data1 <- c(35,31,38,27,29)

And again , we are running the chi-square test on these numbers in the "data1" vector , then the following results would be obtained .

> chisq.test(data1)

X-squared statistic is 2.5 , degrees of freedom is 4 , p-value is obtained as 0.644 .This means that since one is looking at the alpha value of 0.05 and the p-value is lying towards the left of the normal distribution line then we can say that the Null Hypothesis for the given set of values within the vector "data1" cannot be rejected .. I repeat , for the data values presented in the second vector item , the null hypothesis cannot be rejected .

Again drawing a general conclusion from the experiment .. if we obtain the value of the p-value to be greater than that of the alpha value , then the null hypothesis cannot be rejected .

Friday, February 26, 2021

Keys of a Relational Database System ( detailed description with example on Primary Key Foreign Key , Candidate Key , Super Key )

· One must have a way to specify how the tuples within a given relational table are distinguished from one another which is usually done with the help of attributes of the relation which means that the attribute values of a tuple must be as such so that they can uniquely identify any given tuple from a given database

· In other words , no two tuples within a relation are allowed to have exactly the same value for all the attributes within a particular given relational table

· So for the easier and effective identification of any unique tuple or row from a database the concept of recognition using "Superkey" was coined .

· A Superkey is a set of one or more attributes that when taken collectively allows the identifier to uniquely identify a tuple within the relation .

· Example :

"customer_id" attribute of the relation Customer is sufficient to distinguish one customer tuple from another tuple . Therefore , "customer_id" is a superkey in the relation .Similarly , the combination of the following attributes "customer_name" and "customer_id" is a superkey for the relation "customer" . The "customer_name" attribute of "customer" is not a superkey because several people might have the same name .

· The concept of superkey is not sufficient since a superkey may also contain extraneous attributes within it .

· If one is often interested in superkeys over a tuple for which no proper subset is a superkey then such minimal superkeys are called as "candidate keys"

· In such a scenario , several distinct sets of attributes serve as a candidate key for a relation .Suppose a combination of "customer_name" and "customer_street" is sufficient to distinguish among members of the "customer" relation then both "customer_id" and {"customer_name" , " customer_street"} are called as "candidate keys"

· Although the attributes "customer_id" and "customer_name" are together used to distinguish a "customer" tuple .. the combination does not form a candidate key since the attribute "customer_id" alone is a candidate key .

· One can use the term "Primary Key" to denote a candidate key which is chosen by a database designer as the principal means of identifying the tuples within a Relation

· A key ( whether primary , candidate or super) is a property of the entire relation rather than the individual tuples of the relation . Any two individual tuples in the relation are prohibited from having the same value upon the KEY attribute at any point of time . The designation of a Key represents a constraint in the real world enterprise being modelled .

· Candidate Keys must be chosen with utmost care . As noted , the name of person for being selected as a form of candidate Key is not completely sufficient since a situation may arise from the given scenario where multiple people with the same initials might happen and in such a case , all the data might be fetched with the same initials at any given point of time within the database . And as such duplicacy of value for such a candidate key is not entertained and as such any two tuples within the relation are prohibited from having the same value on the key attribute at the same point of time . The designation of a Key represents a constraint in the real world enterprise being modelled .

· The Primary Key should be chosen in such a manner that its attribute values are never or rarely changed . For example , the address field of a person should not be part of the primary key , since the value is likely to change with the shifting of base / home from time to time . In the similar order , Social Security numbers of the dweller of any place can never change and remains the same from the time of birth to the time of death of the citizen . In the similar manner , Unique Identifiers generated by enterprises against any transaction are not likely to change and remain constant throughout and the given field could be considered as a primary key for a transaction relation .

· Therefore , formally reiterating once again , if R could be considered as a Relation Schema and one would say that a subset K of R is a superkey for the table then the framer of the relational table restricts consideration to relations r(R) in which no two distinct tuples have the same values on all the attributes in K . This means that if tuple t1 and tuple t2 are in relation r and t1 != t2 then t1[K] != t2[K] .

· A relation schema , say r1 may include among its attributes the primary key of another relation schema say r2 where this attribute is called as a foreign Key from relation r1 referencing the relation r2 . Here , the relation r2 is also called as the "referencing relation" of the the Foreign Key dependency and "r2" is called as the referenced relation of the foreign Key .

· For example , the attribute branch_name in Account schema or relation is a foreign key . For example , the attribute "branch_name" in Account schema is a foreign key from Account schema referencing Branch schema since branch_name is the primary key of Branch Schema . In any database instance , given any tuple say "tn" from the "Account" relation there must be some tuple say "tn" from the "Account" relation there must be some tuple say "t2" in the branch relation such that the value of the branch_name attribute of tn is the same as the value of the primary key that is "branch_name" of tb

· Therefore , it is customary to list the primary key attributes of a relation schema before other attributes for example , the attribute branch_name of Branch Schema is listed first since it is the primary key

· A database schema along with primary key and foreign key dependencies can be depicted pictorially as per the below given schema diagram . In the figure , a schema diagram for the banking enterprise has been depicted . Here , each relation appears as a Box with its attributes listed inside them and the name of the relation written above them . Hence , if there are primary key attributes , then a horizontal line crosses the box with the primary key attributes listed above the line in grey . Foreign key dependencies appear in the form of arrows from the foreign key attributes of the referencing relation to the primary key of the referenced relation .

Many database systems provide design tools with a graphical user interface for the

creation of schema diagrams .

Tuesday, March 2, 2021

Article on - Data Definition Feature of SQL and its specifications

Monday, March 1, 2021

Introduction to Hypothesis Testing In Statistics with sample problems and explanations

Saturday, February 27, 2021

Chi-Square Tests of Independence testing using R ( Chi - Square Tests for two-way table )

Friday, February 26, 2021

Keys of a Relational Database System ( detailed description with example on Primary Key Foreign Key , Candidate Key , Super Key )

One Hot Encoding and Dummy Variables Generation upon a dataframe | Scenario - Perform One-Hot Encoding upon Un-Ordered Data in a sample dataframe and generate One-hot encoded feature variables | Conceptual Infographic Note