|
V23
Estimation Welcome to part six of our video
series on estimation. In this video, we are going to discuss binary data and
the mode as a descriptive statistic. We're going to
talk about proportions. A proportion is a parameter, of course, which has a
point estimate. We're going to talk about what is a Bernoulli trial, and
finally we're going to talk about the binomial distribution (and a
distribution that's often used to approximate it). I'm Renee Clark from the
Swanson School of Engineering at the University of Pittsburgh. Okay, so, let's first talk about
binary data. Okay, now, binary data is qualitative data, or categorical data,
okay? Remember from earlier in our videos, we talked about quantitative
versus qualitative data. Okay, so, binary data is qualitative. Okay, so, it's…
a binary variable can take on only one of two possible categories or values. So,
for example, a part that's made in a factory is either defective or it's not
defective, or a covid test is either positive or it's negative, or an offer
may be extended or not extended to a particular… particular
candidate. Okay, binary data is also known as dichotomous data in case
you hear that term. Okay, so, let's look at the bar graph on the right, and
let's say this is the result of offers that have been extended or not to
graduate school applicants. Okay, and as you can see that more offers were
not extended than extended, okay, because it looks like there were about 65
offers that were not extended versus, say, around 35 that were. Okay, so, in
this case, and… and this was a binary variable, but no is what we call the
mode. Okay, or, another way to say that is the modal
category. Modal… modal just means the… the…the category having the greatest
count. In this case, it's no. Okay, the mode is a type of descriptive statistic
for categorical data, okay. Highly useful for categorical data, so you can
keep that in mind. Okay, now, binary or dichotomous data is the opposite of
polytomous data. Okay, polytomous data simply means that it is a qualitative
variable that has more than two categories, so poly. Okay, proportions and Bernoulli
trials. Okay, the population proportion, which is a parameter, is labeled
with the letter P. Okay, so, some examples of population proportions might be
the proportion of all cars in Pittsburgh are blue. Okay, or the proportion of
all items made at a factory that are defective. Okay, so, p as a proportion,
or as a parameter, we… we have to estimate it, right?
So, an estimate of P comes from what are known as Bernoulli trials, and, in particular, n of them, okay? And these series of trials
form what's known as a Bernoulli process. Okay, so, a Bernoulli trial then is
an experiment or, if you want to think of it as a variable that can be
classified in one of two ways, okay. Sounds familiar.
We just talked about this, so, for example, yes or no, success or failure,
defective or not defective. Okay, so, in essence, it's a binary variable,
right? One of two ways. Okay, so, an estimate, or a point estimator, for the
population proportion is given by P hat. Okay, so, the little symbol here over
top the P, that's hat. And when we use a hat, that just means estimate. Okay,
so, P hat, or the estimate of p, is given by y Over N. So
what's Y and what's n? Okay, that's on the… okay, so, in the numerator, we
have y. Okay, Y is a count variable, and specifically it represents a count
of your successes in those n Bernoulli trials. Okay, now, when we say success,
what we mean is the…think of it as the characteristic that you're interested
in counting. It doesn't necessarily have to be a desirable characteristic. So,
for example, perhaps we are interested in counting the…the number of negative
covid tests, right, which isn't necessarily a desirable or a successful
outcome, but again the word success here simply means the characteristic
you're interested in counting (whether that is desirable or not). Okay, in
the denominator, we have n, which represents the number of Bernoulli Trials,
okay, or, also known as, your sample size. Okay, so, in the numerator, y,
okay, we said it's a count variable. It's distributed according to the
binomial distribution, okay, and as we said it's a count variable. You can
also think of it as the sum of your zero or one outcomes from your n Bernoulli
trials. Okay, and remember each trial, each Bernoulli trial, is like a… a… a
binary variable, okay, and so that value of one… it can only take one of two
values… but that value of one corresponds to your success, right? Or the… the
characteristic that you're interested in counting. Okay, now, there's actually a distribution that we've gotten very familiar in
working with that is often used to approximate the binomial distribution. Okay,
that distribution is the normal distribution. Okay, and we can do this
approximation if the following two conditions hold. The first is that your
sample size, or number of trials, times the population proportion of success
must be greater than or equal to five. Okay, the second condition is that
number of trials times (1 minus P), or your population proportion of success,
is also greater than or equal to five. So, this is actually
your population proportion of failure, right? Now, when we are going
to test, you know, when we want to test out this particular
assumption, since we don't know P, we will substitute in P hat for
that… that we are able to approximate using that equation. Okay, but, the normal distribution is often used as an
approximator because it's easier to determine probability values using the
tables in the back of the book. We wish to thank the National
Science Foundation under Grant 2335802 for supporting our work. Thank you for
watching. |