Lean Six Sigma Green Belt – Six Sigma Analyze Phase Part 8
- Hypothesis Testing_Chi Square Test
Chi square test. What is chi square test? Chi square test is the test of homogeneity. And we have understood that earlier, right? It is used to compare the homogeneity of proportions of several groups. And here are a few of the underlying assumptions. The sample must be randomly drawn from the population. Data must be re reported in raw formats, not percentages. Measured variables must be independent of each other.
Values are categories on independent or independent variables must be mutually exclusive. And it has to be collectively exhaustive. What does that mean? Right? First of all, try to understand the homogeneity part. Homogeneity means in a survey. For example, in a survey of TV viewing preferences, we might ask respondents to identify their favorite program, right? We might ask the same question to two different populations, such as males and females.
And we could use a chi square test for homogeneity to determine whether male viewing preferences differ significantly from female viewing preferences or not. Right? Homogeneity. All right. What do you mean by mutually exclusive and collectively exhausted? Think about this example that you’re flipping a coin. If you get a head, you cannot get a tail, right? And vice versa. If I get a tail, I cannot get ahead. Observing one outcome excludes the possibility of observing the other.
That is called as mutually exclusive. If one event occurs, the other might not occur. Will not occur for that case. And they’re collectively exhaustive. What do you mean by collectively exhaustive? Think about this. You flip a coin, you can get a head or a tail. You do two flips of a coin. You can get head and a head. Or you can get tail and tail. You can get head once and teal the next time.
Or first time head and second time, first time tail and second time head. These are the only four possibilities if you’re flipping a coin twice, right? All these four is called as collectively exhaustive. You cannot get a result apart from this. Of course, you should be flipping a coin on an even ground, right? So that is mutually exclusive and collectively exhaustive. Observed frequencies cannot be too small. You need to have a bigger number there.
And when you write down your null and alternate hypothesis, this is how you write it down. Null hypothesis says all proportions are equal. Alternate hypothesis says not all proportions are equal. You once again look into your P value. If P is less than alpha, you reject the null hypothesis, right? Okay, let us look into a case study and try to solve this. Even before that, let us understand this. If my output is discrete, and if my input is discrete in more than two categories, I look into Kai Squire test. Here is the case study.
Let us quickly understand that. Bomb on Research bomb and Tech Research company uses four regional centers in South Asia, India, China, Sri Lanka and Bangladesh to input data of questionnaire responses. They audit a certain percentage of questionnaire responses versus data entry. Any error in data entry renders it defective. And chief data scientist wants you to check whether the defective percentage varies by country or not. Analyze the data at 5% significance level and help the manager draw appropriate inferences.
Now, one in the data set means it’s not defective. Zero means it is a defective. Okay, first of all, tell me what is Y and what is X? Here, y is percentage of defective, proportion of defective. And what is your input here? I have four inputs. The four regional centers india, China, Sri Lanka and Bangladesh. These are the four inputs. So both are discrete in nature and X is discreet in greater than two categories. Hence, I do a chi square test. I need to stack the data even before that. Once I stack the data, here is the minitab navigation. I simply do that and the Magic Box is going to give me with this P value here. The P value is greater than 0. 5, right?
It’s zero point 63 two, which is greater than 0. 5 p hi null fly. So I’m going to accept the null hypothesis. What does the My null hypothesis say? Here it is. Null hypothesis says all proportions are equal. So what do you conclude? You say that percentage defectives do not weigh by country. They are all the equal percentage defectives based on Indian operations, Chinese operations, Sri Lankan operations, and Bangladeshi operations, RDC. Okay, time to go back to the magic box and perform this exercise. Let me go to Bomb and Tech. Here is the data. I’ve already stacked that and I’ve put the details here. And if you say that no, you tell me how to stack the data, then let me remove that. There are a lot of entries, my friend. There are a lot of interest. Let me delete all that. Okay? You simply need to go to data, stack columns, select India to Bangladesh, put it here. I’m going to create the entries, stack data in the same worksheet. I’m going to select C seven and C eight. I’m simply going to click on okay, I get the data which is tagged. Now what do I do? Go to stat tables, click on Cross tabulation and chisquare values are already selected here, right? Defective is rows.
All these are all the rows, right? Zero ones and all these are rows. These rows are placed in this column called C seven, right? In the columns, which is India, China, Sri Lanka and Bangladesh. These columns are put here in the country. So zero India, zero India, zero India, zero India one, India in that way. And below this, you’ll also see zero channel, zero channel, zero China and all that. Below that you’ll see zero Sri Lankan and zero Bangladesh. Things like that. All right? Not just that. You need to click on Chi Square and select this option called Chi Square test. I’ve done this exercise earlier and hence it is showing us chi square test option checked, right? But it would not be checked. You will have to select that. Simply click on okay. And then magic box gives you the p value looking to PSN square. All right. We’ll not get into the details of what is PSN chi square. What is the likelihood ratio chi square. Because this is not the statistics class by the way. In master black belt we are going to cover each and everything in detail. But this is green belt, right? Okay. So let us go back to the worksheet or presentation and let us continue further. Here is a chisquare or we have done that. Here is another way of dealing, right? So I have chi square test for homogeneity. I have this case study which says sales of products in four different regions is tabulated for adults and children. So I have USA, UK, China and India.
Those are the four different regions. I have adults and children. Right? Find if male female bio ratios sorry, not regions. My bad ratios are similar across the different countries. For this we need to open the chi square table. Let me open that in minitab. Let me go there and open that. So here is a case study, right for chi-square which we have seen. Let us perform this. You need to go to stat tables, cross tabulation and chi square. And this is the box which opens up. Now you need to select the rows. Rows are nothing but your oh my God. Do you know what you’ll have to go to stat tables chi square test for association. This is what you do. You select all the four locations here starting from us to India. You click on select here are the rows or observed values of adults and children. You click on okay. And here you get the p value.
- Regression Analysis_Part 1
What do we do if your variables y and X are continuous? If both are continuous, then what is the test that you’re going to perform? If both y and X are continuous, we go with the regression. Let us move on and understand this with an example. As a first step, if both variables are continuous, you need to do scatter diagram. What are scatter diagrams? So these diagrams which are represented here are called as scatter diagrams. Scatter diagrams or plots provide a graphical representation of the relationship of two continuous variables. Be careful my dear friend. Correlation does not guarantee causation. Correlation by itself does not imply a cause and effect relationship. There was a good example around this to explain correlation and causation. Let me use that here. It seems that in the summer season the ice cream sales increased. And if you look into the ice cream sales versus the number of people or the number of accidents because of shark attack, it seems that in the summer season the number of shark attacks were high and also the sales of ice cream was increasing.
So when people looked into the relationship between the number of ice cream sales in summer season versus the number of shark attacks in the summer season, it seems there was a strong relationship between these two sales of ice cream and the number of shark attacks. Just because there is a strong correlation does not mean there is a causation. What do you mean by causation? Eating number of ice creams is resulting in shark attacks or the number of shark attacks is increasing? The ice cream seeds, right? There’s no causation and there is no relationship basically right between the number of ice creams which are consumed versus the number of shark attacks. So correlation does not mean there is causation. That’s the most important thing that you need to understand. And scatter diagram is always done based on continuous variables.
So if I have two variables, sorry, x and Y, and if I plot it on a chart and if I see this kind of representation, where in the data points are going to increase in this arrow like this, then you claim that it is a positive correlation. Look at that. Here also variables are in the positive direction, they are increasing, right? The relationship between y and X is increasing. If it’s increasing, you call it as positive correlation. If it is decreasing, then you call it as negative correlation. See here also it is decreasing. So it is called as negative correlation. And here the plot points are cluttered. There is no particular pattern to this. Hence you call this as no correlation. If the correlation is forming a straight line here also it’s following a straight line more or less. Here also it’s following a straight line though it’s decreasing. Here also it’s decreasing, right? So if it follows a straight line either in the decreasing or the increasing order, it’s called as linear relationship, but sometimes there might be a relationship which might be curvilinear. Look at this, it’s like a Ushap, right? It’s like a parabola basically.
So you have a curvilinear relationship here. And one more interesting thing here is the closer the points are to each other, stronger is the relationship far away the points are to each other, less is the correlation here. The points are not tightly coupled right here, in comparison to this diagram, the points are a little closer to each other. Hence you call this as a strong negative correlation. In this where the points are loosely coupled, you call it as moderate negative correlation in a negative direction, basically, right? So you can judge the strength of relationship by looking at the width or tightness of the scatter. Tighter the scatter, less the width, more is the strength of the correlation. Basically, Scatter diagram is used to determine the direction in the relationship. If x increases, y decreases, then it is a negative correlation.
Similarly, if x increases and the corresponding you also increases, then you call it as a positive correlation. There you go, positive one negative correlation and you might encounter a situation when there is no correlation. Will I be able to make out the strength of the relationship easily? Using a visual representation? Need not necessarily be. Sometimes you might be confused to make a statement on whether that is positive strong correlation or positive moderate correlation. You might be in a public situation wherein you might find it difficult to establish these statements on whether that is a strong correlation or a moderate correlation. Moderate or strong correlation. So what do I do then? This is what you do. You on top of Scatter diagram, perform correlation analysis. Correlation analysis measures the degree of linear relationship between two continuous variables and the range of these correlation coefficients is minus one to plus one.
If I have a value of plus one, that means I have a perfect positive relationship. So if you have negative value, that is, if you have the correlation analysis the value as minus one, then you would be having a perfect negative correlation. And if you have the value as zero, there would be no linear relationship, there might be a coalenial relationship. We don’t know that. We are using correlation analysis to measure the degree of linear relationship. If the absolute value of the correlation coefficient is greater than so much zero 85, then we say there is a good relationship. Here are the examples. If the r value is zero 87, or -0. 9 in this case you have a strong positive correlation. In this case you have a strong negative correlation. In this case you have a strong positive correlation because r is equal to 0. 9. Here you have the r value as minus zero point 87, which describes a good relationship basically. Here we have another example. If the r value is 0. 5 or r value is -0. 5 negative relationship or if the r value is very less, zero point 28, it describes a poor relationship. The closer you are to zero, less is the linear relationship, closer you are to plus or minus one more with the correlation. So the correlation values of minus one or one imply an exact linear relationship. However, the real value of correlation is in quantifying less than perfect relationships.
That is where you see real value in using Pearson correlation coefficient value which is r small r. We can then perform regression analysis. If you see that the scattered diagram is showing a good relationship between y and x, and if you see that Pearson correlation coefficient is good, then you can go ahead with regression analysis. Regression analysis attempts to further describe the type of relationship if the core relation is good between the two variables. We’ll understand this using an example, so don’t press the panic button yet. Hold on, my dear friend, what is regression analysis? So here I have an equation for you.
Regression analysis plus the correlation plus the scattered plots. All these three together will help you predict the future performance using the past results. While correlation explains the degree of linear relationship that exists between two variables, regression defines the relationship more precisely and we use it when there is existing data over a period of time. What does that mean? Let us look into the simple example here. Used car cost, right? Your scatter diagram or your correlation is going to say whether there is a relationship between used car cost and the miles driven yes or no, is there a relationship or not? And what is the relationship between this and this? Is a strong relationship, strong positive or strong negative? Or is it just a positive or a negative correlation? That information you can find out using correlation and scatter plots. However, your regulation analysis will exactly say what is the value? To what extent is miles driven defining your used car cost? Is it 78% or 80% of mild stream? Or how much amount of variation is explained? Right?
How much amount of used car cost variation is explained using miles driven? Regulation analysis is a tool that uses data on relevant variables to develop a prediction equation or a prediction model. It generates an equation to describe the statistical relationship between one or more predictors. What is a predictor? This is a predictor your x’s. Your x variables are called as your predictors and the response variable, which is your y, in this case the used car cost. And then you will be able to predict the new observations from then on. In a simple linear regression, you will have a single x variable, single predictor to define your y output. This is how your simple regression equation would look like y is equal to beta one plus beta two times of x. If I have only one input that is one, it will be called a simple linear regression plus there would be an element of error, right? You cannot 100% accurately predict anything for that case. And during our school days we might have learned about this equation. Y is equal to MX plus c plus epsilon error.
M stands for slope, y is a constant or an intercept. E is an error if I have a graph like this y axis and x axis. If I have a line which extends in that way, this is called as a y intercept and this is the slope basically of the line, right? That is how you can imagine this particular equation. I’m moving on to do regression analysis. R square value, which is also called as coefficient of determination. It’s not the correlation coefficient, it is coefficient of determination represents the percentage variation in the output variable. Let us relook into that. Correlation coefficient is r squared value raised to the power two, which represents a percentage variation in output which is explained by input variables or the percentage of response variable variation that is explained by its relationship with one or more predictor variables. Higher the r squared value, the better the model fit your data. And the r squared value lies between zero and 100% r zero to one. Basically, if you want to represent that in numbers r squared value, if it isn’t between zero 65 and zero eight, you call it as moderate correlation. This is the thumb rule, my dear friends. If r squared value is greater than 0. 8, it is a strong correlation that is about r squared or correlation, not correlation. My bad coefficient of determination. Let us understand the difference between prediction level interval and confidence interval. These are the types of confidence intervals used for predictions in regressions and other linear models.
What is a prediction Interwell? Prediction interval represents a range that a single new observation is likely to fall given specified settings of the predictors. So, what is the range of the cost of the used car given the miles driven that is prediction interval. What is confidence interval? In that case, it represents a range that the mean response is likely to fall given specified settings of the predictors confidence interval. Mean response, prediction interval, single new observation. The prediction interval is always wider than the corresponding confidence interval because of the added uncertainty if I have multiple observations and then if I take a mean or average of that, that would obviously have lesser width in comparison to a single variable. And that is what is mentioned here.