|
V39
Multiple Linear Regression Welcome to part one of our video
series in support of multiple linear regression. In this video, we're going
to do a quick introduction to multiple linear regression and discuss the
condition known as multicollinearity. I'm Renee Clark from the Swanson School
of Engineering at the University of Pittsburgh. Okay, so, thus far, we have
learned about simple linear regression, okay, which involves just one
independent variable, okay, and we discuss simple linear regression because
we need a foundation, right, in order to study more
involved techniques. And, actually, simple linear
regression can be… or provides a reasonable model at times for… for some X
and Y data dep… depending on the variables. However, many scientific problems
or real world applications that we tend to encounter
are more complex, meaning that they require more than one independent
variable to explain the variation in the dependent variable and make good
predictions. Okay, in such a case, then, when you have more than one
independent variable, you have what's known as a multiple regression model,
and, if it's a linear model, it's then a multiple linear regression model. Okay,
so, simple regression, one independent variable, multiple regression, more
than one independent variable. Okay, so, this, of course, is our
simple linear regression model that we're used to working with, right here,
with our one independent variable and our dependent variable. A multiple
linear regression model, then, is simply an extension of the simple linear
regression model. Okay, so, you extend it by adding on
additional terms representing additional independent variables, each having
their own slope coefficient, okay, or slope parameter. Okay, so, in this case,
we have k independent variables. Okay? Okay, so, our independent variables
are X1, X2, X3, and all the way up to Xk, where…
where k is the number of independent variables in our model. Okay, we still
have, of course, the error term in a multiple linear regression model, just
as we did in the simple model, and under the same assumptions, with the error…
with the errors distributed normally with a mean of zero and constant
variance of sigma squared. Okay, and, in our multiple linear regression model,
each of our slope parameters is still estimated via the least squares method,
thereby obtaining your B's, right? And, of course we know that the B’s are used to come up with the fitted model, okay, depending on how many
k independent variables you have in your model. Okay, now, when you have more
than one independent variable, there is the possibility of the existence of a
condition known as multicollinearity, and what multicollinearity is… it refers
to the presence of linear relationships between your independent variables. Okay,
so, for example, what that means is for… for example, if X1 and X2 are
linearly related, okay, that would indicate multicollinearity. But, perhaps it's X2 and X3 that have a linear
relationship, again indicating multicollinearity. Okay, we investigate
multicollinearity through the correlation coefficients, okay, or the Pearson
correlation coefficients, which we call R. Okay, unfortunately,
multicollinearity is not a desirable condition when you are building a… a
multiple linear regression model. Okay, so, here it is explained a little bit
more. Let's say we have the following correlation matrix for four different
independent variables, call them X1, X2, X3, and X4. Okay, so, let's look at
the one I have boxed right here in the upper left. Okay, this is the cell for
the correlation between X1 and X2, and, as you can see, the correlation
coefficient itself is large at a .952, and then below that is the P value. Okay,
so, the P value is associated with the following hypothesis test, okay, and
this is a hypothesis test on the population correlation coefficient. Okay, so,
with that low P value, we're able to reject the null and accept the
alternative, which says that in the population, the correlation coefficient
between X1 and X2 is not zero. It's significantly different from zero. Okay,
so, this correlation coefficient of .952 between X1 and X2 is not only high,
but significantly different from zero in the population. And, here, as you
can see, when it says cell contents, R is listed at the top and then the P
value right below that, and then here is the correlation coefficient between
X3 and X4. Okay, again, it's a large R value of .784, and a P value of less
than .05, leading to rejection of the null and accepting of the alternative. Okay,
so, based on this correlation matrix, X1 and X2 are linearly related, X3 and
X4 are linearly related. Okay, so, this correlation matrix points to the
existence of multicollinearity in our… in our data. You'll notice, too, for
example, the R value for X1 and X3 of .534 is not a particularly small R. However,
its P value is greater than .05, so therefore we are not able to reject the
null, leading to the conclusion that the… the correlation between X1 and X3
in the population could be zero. So, it's… it…you may not be able to know for
sure whether there is multicollinearity between X1 and X3. We wish to thank the National
Science Foundation under Grant 2335802 for supporting our work. Thank you for
watching. |