V23 Estimation

Welcome to part six of our video series on estimation. In this video, we are going to discuss binary data and the mode as a descriptive statistic. We're going to talk about proportions. A proportion is a parameter, of course, which has a point estimate. We're going to talk about what is a Bernoulli trial, and finally we're going to talk about the binomial distribution (and a distribution that's often used to approximate it). I'm Renee Clark from the Swanson School of Engineering at the University of Pittsburgh.

Okay, so, let's first talk about binary data. Okay, now, binary data is qualitative data, or categorical data, okay? Remember from earlier in our videos, we talked about quantitative versus qualitative data. Okay, so, binary data is qualitative. Okay, so, it's… a binary variable can take on only one of two possible categories or values. So, for example, a part that's made in a factory is either defective or it's not defective, or a covid test is either positive or it's negative, or an offer may be extended or not extended to a particular… particular candidate. Okay, binary data is also known as dichotomous data in case you hear that term. Okay, so, let's look at the bar graph on the right, and let's say this is the result of offers that have been extended or not to graduate school applicants. Okay, and as you can see that more offers were not extended than extended, okay, because it looks like there were about 65 offers that were not extended versus, say, around 35 that were. Okay, so, in this case, and… and this was a binary variable, but no is what we call the mode. Okay, or, another way to say that is the modal category. Modal… modal just means the… the…the category having the greatest count. In this case, it's no. Okay, the mode is a type of descriptive statistic for categorical data, okay. Highly useful for categorical data, so you can keep that in mind. Okay, now, binary or dichotomous data is the opposite of polytomous data. Okay, polytomous data simply means that it is a qualitative variable that has more than two categories, so poly.

Okay, proportions and Bernoulli trials. Okay, the population proportion, which is a parameter, is labeled with the letter P. Okay, so, some examples of population proportions might be the proportion of all cars in Pittsburgh are blue. Okay, or the proportion of all items made at a factory that are defective. Okay, so, p as a proportion, or as a parameter, we… we have to estimate it, right? So, an estimate of P comes from what are known as Bernoulli trials, and, in particular, n of them, okay? And these series of trials form what's known as a Bernoulli process. Okay, so, a Bernoulli trial then is an experiment or, if you want to think of it as a variable that can be classified in one of two ways, okay. Sounds familiar. We just talked about this, so, for example, yes or no, success or failure, defective or not defective. Okay, so, in essence, it's a binary variable, right? One of two ways. Okay, so, an estimate, or a point estimator, for the population proportion is given by P hat. Okay, so, the little symbol here over top the P, that's hat. And when we use a hat, that just means estimate. Okay, so, P hat, or the estimate of p, is given by y Over N. So what's Y and what's n? Okay, that's on the… okay, so, in the numerator, we have y. Okay, Y is a count variable, and specifically it represents a count of your successes in those n Bernoulli trials. Okay, now, when we say success, what we mean is the…think of it as the characteristic that you're interested in counting. It doesn't necessarily have to be a desirable characteristic. So, for example, perhaps we are interested in counting the…the number of negative covid tests, right, which isn't necessarily a desirable or a successful outcome, but again the word success here simply means the characteristic you're interested in counting (whether that is desirable or not). Okay, in the denominator, we have n, which represents the number of Bernoulli Trials, okay, or, also known as, your sample size. Okay, so, in the numerator, y, okay, we said it's a count variable. It's distributed according to the binomial distribution, okay, and as we said it's a count variable. You can also think of it as the sum of your zero or one outcomes from your n Bernoulli trials. Okay, and remember each trial, each Bernoulli trial, is like a… a… a binary variable, okay, and so that value of one… it can only take one of two values… but that value of one corresponds to your success, right? Or the… the characteristic that you're interested in counting. Okay, now, there's actually a distribution that we've gotten very familiar in working with that is often used to approximate the binomial distribution. Okay, that distribution is the normal distribution.

Okay, and we can do this approximation if the following two conditions hold. The first is that your sample size, or number of trials, times the population proportion of success must be greater than or equal to five. Okay, the second condition is that number of trials times (1 minus P), or your population proportion of success, is also greater than or equal to five. So, this is actually your population proportion of failure, right? Now, when we are going to test, you know, when we want to test out this particular assumption, since we don't know P, we will substitute in P hat for that… that we are able to approximate using that equation. Okay, but, the normal distribution is often used as an approximator because it's easier to determine probability values using the tables in the back of the book.

We wish to thank the National Science Foundation under Grant 2335802 for supporting our work. Thank you for watching.