V4 Descriptive Statistics

Welcome back to the descriptive statistics videos. This is part three of the videos on descriptive statistics. I'm Renee Clark from the Swanson School of Engineering. In this part three video, the… our agenda items are defining what is a data distribution. We then move on to the concept of symmetry, or a symmetric distribution, and then we talk about the concepts of skewness and kurtosis, both of which are descriptive statistics which describe or measure the shape of a distribution.

Okay, so first, what is a data distribution? Okay, down in the… at the bottom of the screen, these are some pictures of various data distributions. Okay, so, a data distribution is the shape of a graph when all the possible values of your variable, for example perhaps we're looking at the variable height, okay, when all those possible values of your variable are plotted on the x-axis. Okay, so, a variable such as height would be plotted on the x-axis. Okay, and maybe those values, for example, range from 6 to 7 foot, right? Maybe we are measuring basketball players, for example. Okay, now, how often each x value occurs, or each value of your variable occurs, is shown on the y-axis. Okay, so on the y-axis of the graph, that represents a frequency, okay, of how often each particular value of height occurs. So, 6ft… 6ft heights may occur with that frequency, whereas 7ft heights may occur with that frequency.

Okay, concept of symmetry: a distribution will be symmetric if a vertical line through its center divides it into two halves that are mirror images of one another. Okay, so, picture those two halves kind of folding nicely together if that vertical line serves as an axis, okay? So, the distribution that is pictured on the right is the normal distribution, right? It is indeed symmetric, okay, and that's because half of the data is to the left of the center line, right? Half the data, or half the area, is to the left and the other half is to the right of the the center line.

Okay, the opposite of symmetry or being symmetric is known as being skewed or skew. Okay, if a distribution is not symmetric, then it is skewed. Okay, we have two possibilities for skewed. A distribution can be skewed to the right, meaning it has a long right tail such as shown here. Or a distribution can be skewed to the left such that it has a long left tail such as shown here. Okay, what would be examples of data that might be skewed left or skewed right? Okay, in terms of skewed left, time to fail data is often skewed left. Okay, so, if your… our variable is time to fail, which is shown on the x- axis, okay, you hope that many fewer items, which is represented as your frequency on the y-axis, many fewer items will fail early. Okay, and as we go along in time then, as you know, we get to a larger time. We hope that then, or expect, that many more items will fail at a later time. Okay, what's an example of data that might be skewed right? Salary data. Okay, so for example, we expect many more people to have a lower salary. Okay, many fewer people to have a higher salary which would be represented off to the right along the x-axis.

Okay, there are descriptive statistics that measure shape. Okay, so, skewness is a number that measures the lack of symmetry in a distribution. Okay, such as the distribution that's shown right there. This is a skewed right distribution. It has a certain lack of symmetry. Now, as the symmetry of a distribution increases, that skewness number, or that skewness value, approaches zero. And if your skew value is exactly zero, then you have a perfectly symmetrical distribution.

Okay, now in terms of a skew value in… you… serving as a descriptive statistic for shape, we will allow software to calculate that for us. So, software such as mini tab or Excel, or other statistical pass package that you might use, will then simply interpret it using rules of thumb. Okay, so, here are some rules of thumb relative to skewness. We say that a distribution is highly skewed if its skew value is greater than one or less than negative one. Or another way to say that a distribution will be highly skewed is the skew value has an absolute value greater than one. Okay, a distribution is said to be moderately skewed if its skew value is somewhere between 0.5 and 1 or 0.5 and -1. Another way to say that’s moderately skewed if its absolute value is somewhere between 0.5 and 1.

In contrast, we say a distribution is fairly symmetric. Okay, if its skew is low, meaning somewhere between negative 1/2 and ½, another way to say that is that its absolute value is less than 0.5. Okay, there is a second descriptive statistic that measures the shape of a distribution and it's that of kurtosis. Okay, kurtosis measures either the pointiness or the flatness of a distribution. Okay, if you get a negative kurtosis value, or software calculates a negative kurtosis value for you, that means your distribution is relatively flat compared to the normal distribution. Okay, so, the normal distribution, in this case, is our comparison. If you get a positive kurtosis value, that says that your distribution is a little bit taller or has a higher peak than the normal distribution. Okay, you can associate positive with peak to keep those straight. Okay, so, over here on the right I'm going to overwrite in red the normal distribution. So, there's the normal distribution in red. Okay, in green is a distribution that has a positive kurtosis. Okay, because it's taller or has a higher peak than the normal. Okay, and then in orchid I am going to overlay a distribution that has a negative kurtosis because it's a lot flatter than the normal distribution in red. Okay, why is all of this important? Well, because the…the statistical methods that we're going to use in this course… course or, you know, in other courses you're going to encounter, often require approximately normal data or approximately normal distributions. Okay, now for perfectly normal data. Both the skew and the kur… kurtosis values will each be equal to zero. Okay, with real world data that is actually unlikely to occur. However, we can go with approximate normality, right? Okay, so, in the case of kurtosis, if your CTO kurtosis value is less than or equal to one in absolute value, that provides a very good approximation to the normal distribution.

Although within two in absolute value is also acceptable just as we, when we were talking about skew, we talked about fairly or fairly symmetrical, right? So, in the case of kurtosis, we can work within the bounds of approximate normality in order to use the statistical methods that we… we're going to be using. But, again, we're going to let mini tab, Excel, or other software calculate the kurtosis for us. We will interpret it.

Thank you to the National Science Foundation Grant number 233 5802 for supporting our work. Thank you for listening.