|
V2
Descriptive Statistics Hi everybody. Welcome to
descriptive statistics- part one. I'm Renee Clark from the Swanson School of
Engineering and this is a course in statistical testing and regression. So, in this part one video, our
agenda for the video is to discuss populations versus samples, and then
descriptive statistics that describe samples. And we'll be discussing
descriptive statistics in this video for both central tendency and
variability. Okay, so, let's define population
versus sample. A population consists of all the observations that we are
concerned with or interested in, whereas a sample is a subset of the
population. Okay, so, what do we mean by all the observations that we are
concerned with or interested in? Here are some examples. We could be
interested in all voters in Pennsylvania, all females who played golf last
year, all plastic parts that were made in a given year by a manufacturer, or
all students taking this course this semester. So, you form a population.
What is a descriptive statistic? It's a number that describes a sample of
data. Okay, also known simply as just a statistic. Whereas a parameter, that
we'll be learning much more about as the course go- goes on, characterizes
the population. So, descriptive statistics describe samples. Parameters
characterize the population. So, a descriptive statistic, as
we said is it's a... it provides a numerical summary for a sample of data.
And, in this part one video, we're going to discuss descriptive statistics
for both location slash central tendency as well as variability in your data.
Okay, so the most familiar descipt...descriptive statistic for location or
central tendency is that of the average of the mean. Of which I'm sure you
you've all are most familiar with. Okay, so if you want to take the average
of a set of numbers you add them up and you divide by the number of numbers
in that set. So, if I want to get the sample average for these five numbers
here, I add them up, divide by five. That comes out to 5.2. Now, a problem
with the average and the mean is that the average and the mean is affected by
outliers. What are outliers? They are simply unusual observations relative to
the rest of your data. So, in this original set of five numbers, 145, for
example, would be an outlier. It is a lot farther from the other numbers. Now,
if we added that to the mix and then got the average of these six numbers,
that would come out to 28.5. So, look how that average has jumped so much
just by the addition of that one outlying number. And you... what you notice
about this 28.5, it really isn't close to any of those six numbers here in
the numerator, so it doesn't characterize the, the data set well. Okay, in
jumps the median. Okay, al- also a measure of location or central tendency.
But, it's a lot less sensitive to outliers. Okay, it's defined as the middle
observation of a set of ordered data. So, let's say, for example, we've got
the three data items: Three, five, and nine. Okay? The median in this case
would be five. It's the middle observation in that ordered set of data. Okay,
what if I have an even number of items? How do you determine the median? The
median, in this case, is going to be the average of your two middle numbers.
So, in this case, the median for that, that set of four numbers is 3.5. Okay, I'd like to next discuss
the topic of a weighted average or a weighted mean. Okay, now averages are
weighted by nature. Okay, they are actually weighted based on the number of
times each value in your data set occurs. Okay, now let's look at the table
that is shown here. Okay, what that table is saying is that these are the
values in my data set. I have the values 4, 6, 9, and 11, and in the
right-hand column is the number of times each value occurs. Okay, so this
data set could also be written in the following way. The value four occurs twice. The
value six occurs six times. So, I would say 1 2 3 4 5 6. The value nine
occurs twice: 9 and 9. The value 11 occurs once. Okay, so that's actually
what my data set looks like. Now, I could get the average of that by simply
adding up those 11 numbers, which there are 11 of them, and dividing by 11.
However, another way to do that by using the table information directly is as
follows. Okay, you want to multiply each value by the number of times it
occurs. Okay, add these individual products together and then divide that sum
by the total number of occurrences. Okay, so, if we were going to do that for
this table, multiply each value by the number of times it occurs: four occurs
twice, six occurs six times, 9 occurs two times, the value 11 occurs once.
Okay, and if I were to add these four individual products together, I would
get 73. Okay, in the denominator, then, you simply have the total number of
occurrences. So, if I add 2 + 6, (so I'm adding the numbers in the
second column) 2 + 6 + 2 + 1, I get 11. Okay, and then, to get my weighted
average, I simply take the 73, which
comprises my numerator, divide by 11, which comprises my denominator, and I
come out with a weighted average of 6.636. However, I would get the same
result by calculating the average as usual. In other words, if I were to add
these 11 numbers up you will see that they do add to 73, and then divide by
the fact that there are 11 of them and I will get the same result: 6.636 is
the average. Okay, so, if you see, the point here being, if you see values
laid out in a table as such and you're asked to determine the weighted
average, this approach is the way that you would go about it. Or,
alternatively, you could list all the values out using this approach and divide,
divide by the number of values that there are. What is variability?
Variability is the extent to which data differs or is spread or stretched
out. Okay, so let's look at this set of data right here. There's 15 data
items, so n=15. Okay, and let's say that these pieces of data represent
chance of rain. Let's say at perhaps different locations in the United States
on a given day, okay, there… there's a lot of variability in this data.
You'll notice that it ranges from around 2% up to around 99.5%. So, it almost
spans the entire range of what it could span, right? And you'll also notice
that there are numbers in the 90s, the 80s, the 70s, 60s, 50s, etc.. 30s,
okay? So even within that data set, there's a lot of variability. So, a lot
of variability within this particular data set. So, there are descriptive
statistics that measure variability. The two most common (which you may have
heard of): the variance and the standard deviation. Okay, so, what the
variance and the standard deviation do is they capture the variability of
your data items around the mean of those items. Okay, now, because we're
capturing variability around the mean and the mean is sensitive to outliers,
unfortunately variance and standard deviation also sensitive to outliers.
Okay, but let's show a simple example of how this works. Let's say we have
three data items we'll call our X11, x22, and x33. So, this is just three
pieces of data- it could be any three pieces of data. To calculate the
variance, you must… you must know or calculate the sample average, okay? So,
for those three pieces of data, add them up, divide by three, you get two.
Sample size is three because there's three pieces of data. Okay, so, to
calculate the variance, this is what the formula looks like. It’s a… it's a
lot more complex looking than it actually is. But, all that the variance
involves is differences or deviations of each data item, 1, two, and three in
this case, around the mean. So, the difference between each data item, each x
i, and the mean. Okay, so, for example, to calculate the variance, we're
going to take the difference between one and two, square it, add to that the
difference between two and two, square that, add to that (because, you see,
there's a summation side), add to that the difference between three and two.
Okay, summed. Okay, sum those up, divide by n minus one. Okay, that's why the
variance is called the average squared deviation from the mean. Okay, there's
the deviation aspect, there's the squared aspect, and the average aspect
comes in too because you are summing, right? Summing, and then dividing by a
measure of sample size, or n minus one (and we'll talk in a minute why it's n
minus one versus n). But, this is a measure of sample size. So that we say
variance is the average squared deviation from the mean. Okay, standard
deviation, or S, is simply the square root of the variance directly derived
from the variance. Okay, so again over here on the
right is our formula for the variance, or the sample variance. Okay, take the
square root, you'll get the sample standard deviation. Quantity in the
denominator, n minus one, is known as the degrees of freedom. Okay, another
way to say that is that there are n minus one independent pieces of
information to go into calculating the sample variance. Okay, why n minus
one? It all comes down to this mathematical equality here that says that the
sum of all of the deviations, or the sum of all of the differences between
each data point in the mean, those deviations, when summed, must equal zero.
Okay, that's a mathematical equality that we won't prove but… but it is a
fact. And in class I will walk us through a numerical example that shows how
this equality is connected to the fact that there are only n minus one pieces
of independent information, and that leads to the N minus one in the
denominator here. |