|
V37
Simple Linear Regression Welcome to our
part two of our video series in support of simple linear regression. In
this video, we are going to discuss correlation and what it is, properties of
a correlation coefficient, and the impact of outliers on the correlation
coefficient. I'm Renee Clark from the Swanson School of Engineering at the
University of Pittsburgh. Okay, correlation is a common
term, but… but what's often forgotten about correlation, or not known about
correlation, is that it measures the strength of the linear relationship
between two variables. Okay, so, it's not just any relationship that
correlation detects, it's the strength of the linear relationship. Okay, so,
having said this, one should always plot their data
first to assess the reasonableness of a linear relationship between the two
variables. Okay, so, taking a look at the scatter
plot on the right, it's an XY scatter plot, okay, and as you can see, I would
say that a linear relationship between X and Y in this case is definitely
reasonable, right? It's not a perfect linear relationship- those points don't
all lie on the… on that straight line- but it's close. So, it's certainly
reasonable. So, after having plotted that XY data, I would proceed, if I
wanted to, with calculating an actual correlation, or correlation coefficient,
between X and Y. Okay, so, the sample correlation
coefficient that… that you would calculate would be denoted by R. Okay, R is
also known as your Pearson correlation coefficient- you may see that term. Okay,
important properties of the correlation coefficient, R: R lies somewhere
between -1 and 1, inclusive. So, as you can see, a value of R of 0 is also
possible. Okay, if the correlation coefficient is equal to exactly one, then
that means all of your XY plotted points lie exactly
on a straight line having a positive slope, okay, as shown in this graph. Okay,
on the other hand, if your correlation coefficient is -1 exactly, that means
that all of your XY points lie on a straight line
but with a negative slope. They have a negative slope. Okay, so, that means
that if the absolute value of your correlation coefficient is one, then there
is a perfect linear relationship between your two variables X and Y. Okay,
now, if your correlation coefficient is exactly zero, that means there is no
correlation between X and Y, your two variables, and if R is, you know,
approximately zero or around zero, that likewise mean there's… that there's
little correlation between your variables. Okay, so, here is… are some
graphical depictions of correlation. If we look at the graphs on the… on the
top of the screen that are… that show positive correlation, we can see here
is our perfect correlation case of R equal 1 where all those points lie
exactly on that straight line. To the right of that, these points, although
not having a perfect correlation, they still have a
high, positive correlation. Okay, and then, even looking to the right of that,
the… the points here, again, have a lower correlation than the one… you know…
than the points directly to the left, but still there is a… there is a linear
trend going on there and I would classify that as having a lower pos…
positive correlation than this graph. Okay, looking at the graphs on the bottom, same but for the negative correlation case.
Here is a case of per… perfect negative correlation, where the points all lay…
lie exactly on that negatively sloped line. Here, these points don't have
perfect negative correlation, but it's fairly high. They
are… those points, you know, fairly closely hug that… that line, and then,
here, over on the left, certainly a lower, negative correlation compared to
this graph, okay because the points are scattered a
little bit further out from the line with a… with a negative slope. Okay, now
these cases actually depict zero correlation. Okay, now think about when we
first introduced the term correlation. Correlation refers to the degree of
the linear relationship, right, between two variables. Okay, so, let's look
at the graph on the left. Okay, this XY scatter plot really shows random
scatter of the points and, really, it shows no apparent association at all
between X and Y because those points are just randomly scattered. Okay, in
this case, we’re probably pretty close to zero. Now,
contrast that with the case on the right. Okay, that correlation coefficient
also pretty close to zero likely, but for a
different reason. Okay, it's not due to the lack of any relationship between
them, because there is a relationship between X and Y, some
sort of a curve, a linear relationship, perhaps a quadratic
relationship. So, there is a relationship. It's just that the relationship is
not linear. There's a lack of a rel… of a linear
relationship. That's why we’re approximately zero here. Okay, so, these
slides show the importance of plotting your data first, right, to understand
what type of a relationship might be reasonable. In the case on the right,
this is a relationship that may be… may be able to be modeled, just not in a
linear way. Perhaps with some other statistical techniques, other types of
regression besides linear regression. Okay, the next topic I want to
talk about is the relationship, or better yet, the impact that outliers can
have on a correlation coefficient. Okay, really briefly,
you'll remember that an outlier is an unusual point. It's an unusual data
point, okay, meaning a data point that's different from the others. Okay, so,
let's see the impact that outliers can have on a
correlation coefficient. Okay, so, let's look at our graph on the left
here. We have some XY data that's fairly highly
correlated. You'll see that correlation coefficient of about .987. That's a
high correlation. As you can see, there's, you know, those points pretty nicely hug that straight line. Okay, now, let's
look at the graph on the right. It's actually the
same data, it's just rescaled. It's the same data, but it's rescaled because
an outlier has been added to the data. Okay, so, remember that an outlier is
something that's different, so it sits far out, as you can see. Now, the new
correlation coefficient calculated is .996. Okay, so, with the addition of
that outlier, the R value went from a .987 to a new correlation coefficient
of that data of .996. Okay, so, really not a large
change in that correlation coefficient. It's actually a
small… a small change. So, that outlier had little effect on that fairly highly correlated data that was shown in the graph
on the left. Okay, so, little influence there. Okay, but,
let's look at a different case where, looking here on the left,
where we've got some XY data, but the correlation of this data is not
as high. It's only, say, .617. Okay, when we then add an outlier to this data, so this data here on the right is the same data set
rescaled because it has an outlier added to it, okay, look what happens to
the correlation coefficient in this case. Okay, in this case, we started at a
.617 and, with the addition of just that one outlier, our correlation
coefficient changed to .870. Okay, so, .617 to .870. Okay, that's a larger
change. That outlier was more influential in this case. Okay, so, as our data…
when we started with less highly correlated data, that outlier had more of an
impact on our correlation coefficient. Okay, and then, let's finally
look at a third case of this. Okay, so, this scatter plot here shows really rather random scatter matter of this XY data, and
our correlation coefficient aligns with that, right, pretty small correlation
coefficient. Essentially very, very close to zero. In fact, here it is. The
correlation of that XY… involving that XY data is .001- very close to zero. So,
little to no correlation there. Okay? Okay, so,
shown here is our data with very small correlation. It's the same data as
shown on the prior slide, it's just rescaled because
we have added this outlier right here. Okay, so, when we actually
add that outlier, the correlation changes from .001, you know, very
close to zero, up to .04. Okay, now, that doesn't perhaps look like a large
change, I think because those numbers are small to begin with. But, that R actually increased by a factor of 40, so this
outlier was very influential on the correlation coefficient. Okay, so, as
we've seen with these progression of graphs, that
the lower the correlation between your XY data, the more influential an
outlier will be on the value of your correlation coefficient, R. We wish to thank the National
Science Foundation under Grant 233582 for supporting our work. Thank you for watching. |