V37 Simple Linear Regression

Welcome to our part two of our video series in support of simple linear regression. In this video, we are going to discuss correlation and what it is, properties of a correlation coefficient, and the impact of outliers on the correlation coefficient. I'm Renee Clark from the Swanson School of Engineering at the University of Pittsburgh.

Okay, correlation is a common term, but… but what's often forgotten about correlation, or not known about correlation, is that it measures the strength of the linear relationship between two variables. Okay, so, it's not just any relationship that correlation detects, it's the strength of the linear relationship. Okay, so, having said this, one should always plot their data first to assess the reasonableness of a linear relationship between the two variables. Okay, so, taking a look at the scatter plot on the right, it's an XY scatter plot, okay, and as you can see, I would say that a linear relationship between X and Y in this case is definitely reasonable, right? It's not a perfect linear relationship- those points don't all lie on the… on that straight line- but it's close. So, it's certainly reasonable. So, after having plotted that XY data, I would proceed, if I wanted to, with calculating an actual correlation, or correlation coefficient, between X and Y.

Okay, so, the sample correlation coefficient that… that you would calculate would be denoted by R. Okay, R is also known as your Pearson correlation coefficient- you may see that term. Okay, important properties of the correlation coefficient, R: R lies somewhere between -1 and 1, inclusive. So, as you can see, a value of R of 0 is also possible. Okay, if the correlation coefficient is equal to exactly one, then that means all of your XY plotted points lie exactly on a straight line having a positive slope, okay, as shown in this graph. Okay, on the other hand, if your correlation coefficient is -1 exactly, that means that all of your XY points lie on a straight line but with a negative slope. They have a negative slope. Okay, so, that means that if the absolute value of your correlation coefficient is one, then there is a perfect linear relationship between your two variables X and Y. Okay, now, if your correlation coefficient is exactly zero, that means there is no correlation between X and Y, your two variables, and if R is, you know, approximately zero or around zero, that likewise mean there's… that there's little correlation between your variables.

Okay, so, here is… are some graphical depictions of correlation. If we look at the graphs on the… on the top of the screen that are… that show positive correlation, we can see here is our perfect correlation case of R equal 1 where all those points lie exactly on that straight line. To the right of that, these points, although not having a perfect correlation, they still have a high, positive correlation. Okay, and then, even looking to the right of that, the… the points here, again, have a lower correlation than the one… you know… than the points directly to the left, but still there is a… there is a linear trend going on there and I would classify that as having a lower pos… positive correlation than this graph.

Okay, looking at the graphs on the bottom, same but for the negative correlation case. Here is a case of per… perfect negative correlation, where the points all lay… lie exactly on that negatively sloped line. Here, these points don't have perfect negative correlation, but it's fairly high. They are… those points, you know, fairly closely hug that… that line, and then, here, over on the left, certainly a lower, negative correlation compared to this graph, okay  because the points are scattered a little bit further out from the line with a… with a negative slope. Okay, now these cases actually depict zero correlation.

Okay, now think about when we first introduced the term correlation. Correlation refers to the degree of the linear relationship, right, between two variables. Okay, so, let's look at the graph on the left. Okay, this XY scatter plot really shows random scatter of the points and, really, it shows no apparent association at all between X and Y because those points are just randomly scattered. Okay, in this case, we’re probably pretty close to zero. Now, contrast that with the case on the right. Okay, that correlation coefficient also pretty close to zero likely, but for a different reason. Okay, it's not due to the lack of any relationship between them, because there is a relationship between X and Y, some sort of a curve, a linear relationship, perhaps a quadratic relationship. So, there is a relationship. It's just that the relationship is not linear. There's a lack of a rel… of a linear relationship. That's why we’re approximately zero here. Okay, so, these slides show the importance of plotting your data first, right, to understand what type of a relationship might be reasonable. In the case on the right, this is a relationship that may be… may be able to be modeled, just not in a linear way. Perhaps with some other statistical techniques, other types of regression besides linear regression.

Okay, the next topic I want to talk about is the relationship, or better yet, the impact that outliers can have on a correlation coefficient. Okay, really briefly, you'll remember that an outlier is an unusual point. It's an unusual data point, okay, meaning a data point that's different from the others. Okay, so, let's see the impact that outliers can have on a correlation coefficient. Okay, so, let's look at our graph on the left here. We have some XY data that's fairly highly correlated. You'll see that correlation coefficient of about .987. That's a high correlation. As you can see, there's, you know, those points pretty nicely hug that straight line. Okay, now, let's look at the graph on the right. It's actually the same data, it's just rescaled. It's the same data, but it's rescaled because an outlier has been added to the data. Okay, so, remember that an outlier is something that's different, so it sits far out, as you can see. Now, the new correlation coefficient calculated is .996. Okay, so, with the addition of that outlier, the R value went from a .987 to a new correlation coefficient of that data of .996. Okay, so, really not a large change in that correlation coefficient. It's actually a small… a small change. So, that outlier had little effect on that fairly highly correlated data that was shown in the graph on the left. Okay, so, little influence there.

Okay, but, let's look at a different case where, looking here on the left, where we've got some XY data, but the correlation of this data is not as high. It's only, say, .617. Okay, when we then add an outlier to this data, so this data here on the right is the same data set rescaled because it has an outlier added to it, okay, look what happens to the correlation coefficient in this case. Okay, in this case, we started at a .617 and, with the addition of just that one outlier, our correlation coefficient changed to .870. Okay, so, .617 to .870. Okay, that's a larger change. That outlier was more influential in this case. Okay, so, as our data… when we started with less highly correlated data, that outlier had more of an impact on our correlation coefficient.

Okay, and then, let's finally look at a third case of this. Okay, so, this scatter plot here shows really rather random scatter matter of this XY data, and our correlation coefficient aligns with that, right, pretty small correlation coefficient. Essentially very, very close to zero. In fact, here it is. The correlation of that XY… involving that XY data is .001- very close to zero. So, little to no correlation there. Okay? Okay, so, shown here is our data with very small correlation. It's the same data as shown on the prior slide, it's just rescaled because we have added this outlier right here. Okay, so, when we actually add that outlier, the correlation changes from .001, you know, very close to zero, up to .04. Okay, now, that doesn't perhaps look like a large change, I think because those numbers are small to begin with. But, that R actually increased by a factor of 40, so this outlier was very influential on the correlation coefficient. Okay, so, as we've seen with these progression of graphs, that the lower the correlation between your XY data, the more influential an outlier will be on the value of your correlation coefficient, R.

We wish to thank the National Science Foundation under Grant 233582 for supporting our work. Thank you for watching.