V36 Simple Linear Regression

Welcome to part one of our video series in support of simple linear regression. In this video, we are going to do an introduction to regression, including the definition of this statistical technique and uses for it. We'll review the equation of a straight line and also introduce the concept of a residual. I'm Renee Clark from the Swanson School of Engineering at the University of Pittsburgh.

Okay, so, what is regression? It's a statistical method or a technique that explores the relationship or association between two or more variables. Okay, so, what are some examples of relationships that you might be able to explore with regression. Okay, what is the relationship between the score that is received on a statistics exam and the hours spent studying or preparing for that exam. Okay, so, the two variables being score and hours spent, okay, or what's the relationship between one's happiness and one's wealth? Okay, now, regression has what we call a random, or a probabilistic, aspect to it, right, because you can envision how people with the same wealth may not be equally happy or have the same degree of happiness.

Okay, so, let's discuss various views on or uses of regression. Okay, regression works with what we're going to call XY data, sort of, as shown in this table or spreadsheet here. So, what are some of the views on or uses of regression? Okay, the first is that regression expresses a mathematical relationship between variables. So, in this case, our variables would be X and Y, okay, and it does this by determining an equation or a function between X and Y. Okay, regression produces a model, or is a type of modeling technique, where a model can be thought of and… as an abstraction of a real world process in an attempt to represent reality.

Okay, regression is used to predict or to estimate the value… values of one variable, say y, from the values of another, say x. Okay, so, with regression we can measure the degree or the strength of the relationship between the variables. Okay, and finally, regression is a type of data mining technique, okay, where we are able to extract patterns that exist in the data. Okay, so, the goal in linear regression is to fit a linear model to a set of XY data points, as shown in that table there. Okay, so, with linear regression the model in this case that is produced is actually a straight line. Okay, so, picture here in this scatter plot some X versus Y data. Okay, so, it's a scatter plot of X and Y data points. Okay, the model in this case, okay, that we would be attempting to produce with a linear regression takes the form of a straight line that summarizes those data points.

Okay, the equation of a straight line- do you recall what that is from math classes? It is y = mx + b, okay where m represents the slope of the line, b represents the y-intercept, or where that line crosses the y-axis. Okay, the result with a regression analysis is a summarization of your XY data points in the graph via an equation called the line of best fit, or another way to say this is it produces the fitted line. Okay, and, like I said above, the goal is a best fit line, okay, or a line that comes as close as possible straight line to all of your data points.

Okay, related to this is the concept of a residual. Okay, so, in order to discuss residuals, let's start with our plotted point which we call X sub i , Y sub i. So, this is our plotted point that may have come out of our XY table, similar to what I showed you on the previous screen. Okay, now, the residual is denoted by E sub I, and that's shown right here in the graph. This is an XY graph or scatter plot, so that residual is the vertical distance between the plotted point and the fitted line. Okay, I'm going to just retrace the fitted line here in green. This is the fitted line that we call… call y hat, I will label it here to fitted line.

Okay, so, again, that residual is the vertical distance between the plotted point, which is right there, and the fitted line. So, the residual is what I have shown there in yellow. Okay, so, mathematically, the residual, E sub I, is Yi minus Yi hat. Okay, so, again, y I, y sub I, is at this x i here, okay, is the vertical height of your plotted point. That's Yi. Yi hat is the vertical height of your fitted line at that Xi point. Okay, so, the residual is simply then the difference between the two. Okay, the residual represents the error in the fit of the line of Y hat, the fitted line, it's the error in the fit of the fitted line, or Y hat, to your plotted point, or x i, y i. Okay, you want your residuals to be small, right, because you want that fitted line to come as close as possible to all of your plotted points. So, the smaller the residual, the… or the smaller the residuals, the better the fit of that line to your data, and you want good fit of a line to the data.

We wish to thank the National Science Foundation under Grant 233582 for supporting our work. Thank you for watching.