V2 Descriptive Statistics

Hi everybody. Welcome to descriptive statistics- part one. I'm Renee Clark from the Swanson School of Engineering and this is a course in statistical testing and regression.

So, in this part one video, our agenda for the video is to discuss populations versus samples, and then descriptive statistics that describe samples. And we'll be discussing descriptive statistics in this video for both central tendency and variability.

Okay, so, let's define population versus sample. A population consists of all the observations that we are concerned with or interested in, whereas a sample is a subset of the population. Okay, so, what do we mean by all the observations that we are concerned with or interested in? Here are some examples. We could be interested in all voters in Pennsylvania, all females who played golf last year, all plastic parts that were made in a given year by a manufacturer, or all students taking this course this semester. So, you form a population. What is a descriptive statistic? It's a number that describes a sample of data. Okay, also known simply as just a statistic. Whereas a parameter, that we'll be learning much more about as the course go- goes on, characterizes the population. So, descriptive statistics describe samples. Parameters characterize the population.

So, a descriptive statistic, as we said is it's a... it provides a numerical summary for a sample of data. And, in this part one video, we're going to discuss descriptive statistics for both location slash central tendency as well as variability in your data. Okay, so the most familiar descipt...descriptive statistic for location or central tendency is that of the average of the mean. Of which I'm sure you you've all are most familiar with. Okay, so if you want to take the average of a set of numbers you add them up and you divide by the number of numbers in that set. So, if I want to get the sample average for these five numbers here, I add them up, divide by five. That comes out to 5.2. Now, a problem with the average and the mean is that the average and the mean is affected by outliers. What are outliers? They are simply unusual observations relative to the rest of your data. So, in this original set of five numbers, 145, for example, would be an outlier. It is a lot farther from the other numbers. Now, if we added that to the mix and then got the average of these six numbers, that would come out to 28.5. So, look how that average has jumped so much just by the addition of that one outlying number. And you... what you notice about this 28.5, it really isn't close to any of those six numbers here in the numerator, so it doesn't characterize the, the data set well. Okay, in jumps the median. Okay, al- also a measure of location or central tendency. But, it's a lot less sensitive to outliers. Okay, it's defined as the middle observation of a set of ordered data. So, let's say, for example, we've got the three data items: Three, five, and nine. Okay? The median in this case would be five. It's the middle observation in that ordered set of data. Okay, what if I have an even number of items? How do you determine the median? The median, in this case, is going to be the average of your two middle numbers. So, in this case, the median for that, that set of four numbers is 3.5.

Okay, I'd like to next discuss the topic of a weighted average or a weighted mean. Okay, now averages are weighted by nature. Okay, they are actually weighted based on the number of times each value in your data set occurs. Okay, now let's look at the table that is shown here. Okay, what that table is saying is that these are the values in my data set. I have the values 4, 6, 9, and 11, and in the right-hand column is the number of times each value occurs. Okay, so this data set could also be written in the following way.

The value four occurs twice. The value six occurs six times. So, I would say 1 2 3 4 5 6. The value nine occurs twice: 9 and 9. The value 11 occurs once. Okay, so that's actually what my data set looks like. Now, I could get the average of that by simply adding up those 11 numbers, which there are 11 of them, and dividing by 11. However, another way to do that by using the table information directly is as follows. Okay, you want to multiply each value by the number of times it occurs. Okay, add these individual products together and then divide that sum by the total number of occurrences. Okay, so, if we were going to do that for this table, multiply each value by the number of times it occurs: four occurs twice, six occurs six times, 9 occurs two times, the value 11 occurs once. Okay, and if I were to add these four individual products together, I would get 73. Okay, in the denominator, then, you simply have the total number of occurrences. So, if I add 2 + 6, (so I'm adding the numbers in the second column) 2 + 6 + 2 + 1, I get 11. Okay, and then, to get my weighted average,  I simply take the 73, which comprises my numerator, divide by 11, which comprises my denominator, and I come out with a weighted average of 6.636. However, I would get the same result by calculating the average as usual. In other words, if I were to add these 11 numbers up you will see that they do add to 73, and then divide by the fact that there are 11 of them and I will get the same result: 6.636 is the average. Okay, so, if you see, the point here being, if you see values laid out in a table as such and you're asked to determine the weighted average, this approach is the way that you would go about it. Or, alternatively, you could list all the values out using this approach and divide, divide by the number of values that there are. What is variability? Variability is the extent to which data differs or is spread or stretched out. Okay, so let's look at this set of data right here. There's 15 data items, so n=15. Okay, and let's say that these pieces of data represent chance of rain. Let's say at perhaps different locations in the United States on a given day, okay, there… there's a lot of variability in this data. You'll notice that it ranges from around 2% up to around 99.5%. So, it almost spans the entire range of what it could span, right? And you'll also notice that there are numbers in the 90s, the 80s, the 70s, 60s, 50s, etc.. 30s, okay? So even within that data set, there's a lot of variability. So, a lot of variability within this particular data set. So, there are descriptive statistics that measure variability. The two most common (which you may have heard of): the variance and the standard deviation. Okay, so, what the variance and the standard deviation do is they capture the variability of your data items around the mean of those items. Okay, now, because we're capturing variability around the mean and the mean is sensitive to outliers, unfortunately variance and standard deviation also sensitive to outliers. Okay, but let's show a simple example of how this works. Let's say we have three data items we'll call our X11, x22, and x33. So, this is just three pieces of data- it could be any three pieces of data. To calculate the variance, you must… you must know or calculate the sample average, okay? So, for those three pieces of data, add them up, divide by three, you get two. Sample size is three because there's three pieces of data. Okay, so, to calculate the variance, this is what the formula looks like. It’s a… it's a lot more complex looking than it actually is. But, all that the variance involves is differences or deviations of each data item, 1, two, and three in this case, around the mean. So, the difference between each data item, each x i, and the mean. Okay, so, for example, to calculate the variance, we're going to take the difference between one and two, square it, add to that the difference between two and two, square that, add to that (because, you see, there's a summation side), add to that the difference between three and two. Okay, summed. Okay, sum those up, divide by n minus one. Okay, that's why the variance is called the average squared deviation from the mean. Okay, there's the deviation aspect, there's the squared aspect, and the average aspect comes in too because you are summing, right? Summing, and then dividing by a measure of sample size, or n minus one (and we'll talk in a minute why it's n minus one versus n). But, this is a measure of sample size. So that we say variance is the average squared deviation from the mean. Okay, standard deviation, or S, is simply the square root of the variance directly derived from the variance.

Okay, so again over here on the right is our formula for the variance, or the sample variance. Okay, take the square root, you'll get the sample standard deviation. Quantity in the denominator, n minus one, is known as the degrees of freedom. Okay, another way to say that is that there are n minus one independent pieces of information to go into calculating the sample variance. Okay, why n minus one? It all comes down to this mathematical equality here that says that the sum of all of the deviations, or the sum of all of the differences between each data point in the mean, those deviations, when summed, must equal zero. Okay, that's a mathematical equality that we won't prove but… but it is a fact. And in class I will walk us through a numerical example that shows how this equality is connected to the fact that there are only n minus one pieces of independent information, and that leads to the N minus one in the denominator here.