V34 Hypothesis Testing

Welcome to Part 10 of our video series in support of hypothesis testing. In this video, we are going to cover a contingency table which is used in a test of independence, and we're going to discuss the marginal frequencies or totals within a contingency table as well as the cell probabilities of the contingency table. I'm Renee Clark from the Swanson School of Engineering at the University of Pittsburgh.

Okay, so, what is a contingency table? An example of a contingency table is shown here on the right. Okay, in general, a contingency table is a matrix that cross classifies observations or data, and that's what's shown in the middle here, based on categorical variables. Okay, the category… categorical variables are indicated along the left hand side and across the top. Now, as you recall, categorical variables, which have categories, are qualitative variables, okay, as opposed to quantitative variables. Okay, in this particular contingency table, the two categorical variables are Hawaiian island and sweetness rating.

Okay, so, Hawaiian island has the three categories at least with this example, or with this data set of Maui, Kawai, and Oahu. Those are the three different categories. Sweetness rating has two categories in this example, and those two categories are high sweetness rating and super high sweetness rating. Okay, so, for example, based on this contingency table, there are 23 high sweetness pineapples from Oahu, okay, because 203… the count of 203 occurs at the intersection of high and Oahu categories. Okay, this particular sample of data has 1,000 total pineapples represented there. Okay, in general, a contingency table has R rows and C columns, and we… you'll often see it referred to as an R by C table. Okay, so, in this example, there are two rows and three columns. Okay, so, this is a 2x3 contingency table. Okay, now, in a contingency table, the row and the column totals, which I'm going to circle here, these are the column totals, these are the row totals. Okay, those totals are called marginal totals or frequencies (either one).

Okay, so, the marginal probability of a pineapple being from Maui with this data set is, we would take the marginal total associated with Maui of 336, divided by the total number of pineapples in the data set. Okay, so that would be the marginal probability associated with a pineapple being from Maui in this… in this example. So, likewise, the marginal probability of having a super high sweetness level would be 402 over 1,000. Okay, we could… we could label that as the probability of s for being associated… with being a super high pineapple is 402 out of 1000. Again, this is the marginal probability for the super high sweetness level. Okay, so, these concepts, and certainly contingency tables in general, are used in tests of independence, which are another type of hypothesis test that we will be learning about.

Okay, so, let's discuss cell probabilities associated with our contingency table. In order to do that, I'm going to bring to your attention, or have you recall, the following concept or theorem that you learned about in your probability studies. Okay, and that is the following. If you've got two indep…two independent events, okay, the… the probability that they both occur, which you could write such as… such the probability of a and b, or the probability of A intersect B, the probability on the left is actually equal to the product of the individual probabilities. So, the product on the left equal the probability of A times the probability of B, assuming that A and B are independent events. Now that's theorem 2.11 in your book. Okay, now, if we're running a test of independence then what we do is we null hypothesize that our categorical variables, so in this case, in our example, Hawaiian island or Hawaiian origin and sweetness level or sweetness rating, we null hypothesize that those two variables are independent variables. Okay, and the same sense that we're talking about right here. Okay. Okay, so… so, assuming, as we will, that these two variables in our contingency table are independent. Okay, then, if we want to find the probability that two events occur, okay, then we proceed as follows.

For example, let's take number one. If we want to determine the probability that a pineapple is from Maui, and it has a high sweetness level, okay, since we're assuming those two variables to be independent, that's equal to the product of the probability of being from Maui and the probability of having a high sweetness level. Okay, which is equal to… in order to calculate… culate this, we're going to use the marginal frequencies. Okay, so, the probability of being from Maui is equal to its marginal total or frequency of 336 over 1,000 total data points. Okay, the probability of being of a high sweetness rating. Okay, we're going to use its marginal total of 598 over 1,000 total observations. Okay, so, recapping then the probability that a pineapple is from Maui and has a high sweetness rating is equal to 336 over 1000 times that by 598 over 1000. Okay, so, we are actually multiplying here two marginal probabilities.

Okay, let's try another one. What's the probability? Let's try this one. What's the probability that a pineapple is from Kawai and it has a super high sweetness rating? Okay, so, we're going to use the marginal probabilities for that, assuming the events are independent. Probability of K and S is the probability of K times the probability of S. Okay, what's the probability of being from Kawai? Going to use its marginal total of 351 over 1,000 total observations. Okay, what's the probability of having a super high sweetness level? 402, or its marginal total, over 1,000 total observations. Okay, so, then this probability is the product of 351/ 1000 * 402 / 1000. Okay, so, what we have here are six different probabilities. These are called cell probabilities because they are associated with each of these six cells of the contingency table. Okay, so, a cell occurs at the intersection of a particular value of your column variable, in this case Hawaiian island and sweetness rating. Okay, high or super high- these cell probabilities, okay, which we obtain by taking products of the marginal probabilities of these independent variables. We will be using these in our test of independence.

We wish to thank the National Science Foundation under Grant 233582 for supporting our work. Thank you for watching.