V39 Multiple Linear Regression

Welcome to part one of our video series in support of multiple linear regression. In this video, we're going to do a quick introduction to multiple linear regression and discuss the condition known as multicollinearity. I'm Renee Clark from the Swanson School of Engineering at the University of Pittsburgh.

Okay, so, thus far, we have learned about simple linear regression, okay, which involves just one independent variable, okay, and we discuss simple linear regression because we need a foundation, right, in order to study more involved techniques. And, actually, simple linear regression can be… or provides a reasonable model at times for… for some X and Y data dep… depending on the variables. However, many scientific problems or real world applications that we tend to encounter are more complex, meaning that they require more than one independent variable to explain the variation in the dependent variable and make good predictions. Okay, in such a case, then, when you have more than one independent variable, you have what's known as a multiple regression model, and, if it's a linear model, it's then a multiple linear regression model. Okay, so, simple regression, one independent variable, multiple regression, more than one independent variable.

Okay, so, this, of course, is our simple linear regression model that we're used to working with, right here, with our one independent variable and our dependent variable. A multiple linear regression model, then, is simply an extension of the simple linear regression model. Okay, so, you extend it by adding on additional terms representing additional independent variables, each having their own slope coefficient, okay, or slope parameter. Okay, so, in this case, we have k independent variables. Okay? Okay, so, our independent variables are X1, X2, X3, and all the way up to Xk, where… where k is the number of independent variables in our model. Okay, we still have, of course, the error term in a multiple linear regression model, just as we did in the simple model, and under the same assumptions, with the error… with the errors distributed normally with a mean of zero and constant variance of sigma squared. Okay, and, in our multiple linear regression model, each of our slope parameters is still estimated via the least squares method, thereby obtaining your B's, right? And, of course  we know that the B’s are used to come up with the fitted model, okay, depending on how many k independent variables you have in your model.

Okay, now, when you have more than one independent variable, there is the possibility of the existence of a condition known as multicollinearity, and what multicollinearity is… it refers to the presence of linear relationships between your independent variables. Okay, so, for example, what that means is for… for example, if X1 and X2 are linearly related, okay, that would indicate multicollinearity. But, perhaps it's X2 and X3 that have a linear relationship, again indicating multicollinearity. Okay, we investigate multicollinearity through the correlation coefficients, okay, or the Pearson correlation coefficients, which we call R.

Okay, unfortunately, multicollinearity is not a desirable condition when you are building a… a multiple linear regression model. Okay, so, here it is explained a little bit more. Let's say we have the following correlation matrix for four different independent variables, call them X1, X2, X3, and X4. Okay, so, let's look at the one I have boxed right here in the upper left. Okay, this is the cell for the correlation between X1 and X2, and, as you can see, the correlation coefficient itself is large at a .952, and then below that is the P value. Okay, so, the P value is associated with the following hypothesis test, okay, and this is a hypothesis test on the population correlation coefficient. Okay, so, with that low P value, we're able to reject the null and accept the alternative, which says that in the population, the correlation coefficient between X1 and X2 is not zero. It's significantly different from zero. Okay, so, this correlation coefficient of .952 between X1 and X2 is not only high, but significantly different from zero in the population. And, here, as you can see, when it says cell contents, R is listed at the top and then the P value right below that, and then here is the correlation coefficient between X3 and X4. Okay, again, it's a large R value of .784, and a P value of less than .05, leading to rejection of the null and accepting of the alternative. Okay, so, based on this correlation matrix, X1 and X2 are linearly related, X3 and X4 are linearly related. Okay, so, this correlation matrix points to the existence of multicollinearity in our… in our data. You'll notice, too, for example, the R value for X1 and X3 of .534 is not a particularly small R. However, its P value is greater than .05, so therefore we are not able to reject the null, leading to the conclusion that the… the correlation between X1 and X3 in the population could be zero. So, it's… it…you may not be able to know for sure whether there is multicollinearity between X1 and X3.

We wish to thank the National Science Foundation under Grant 2335802 for supporting our work. Thank you for watching.