Foundations of Quantitative Research in Political Science
So you have just learned how a regression equation is just like a simple linear function. And now we're going to learn where where these numbers come from. So we heavier in our regression equation, which is analogous to a simple linear function, we have, our regression calculation is telling us the percent Trump is 23 plus a third times percent white. And this is what this line right here is telling us. We can see percent white is 0%, Trump is going to be approximately 23 percent, y is 40 percent Trump percent, Trump will be 36. And we can, you can look at the line plugging in different numbers, 4%, why such as 80? It'll give you a 50 right here, and so on and so forth. But how did, how did we get to these values here, this 23, and this 1 third right here. So before we get there, let us just highlight some of the components of the regression equation. So we have that y is the dependent variable, in our case, percent Trump. Beta 0 is the constant or the intercept. In our case, 23. Beta 1 is the slope or the coefficient. In our case, this is a third. And the independent variable is x, in our case percent white. So this is how are, these, are our regression equation maps onto the, the standard regression equation. The thing as you will see, that in other regression equations, value that is, that in this specific equation is 23, which is whatever is whatever the dependent variable b. If the dependent variable is 0, this is always going to be called the intercept or the constant. And the number that is multiplying the independent variable is always going to be called the beta-1 or slope. And the way you interpret it is that for each increase of one independent variable, we should expect the dependent variable to vary by whatever is your slope. So in California counties, for each increase of one percentage point in the white population, we should expect the Trump vote to vary by 0.34 percentage points in the 2016 election. So I'm at 0.34 is the value of the regression. The slope for this equation which I, until, until now I was rounding is to a third, right? So, but I'll, I'll switch it between 0.34 and the third, because these are very close numbers. So we know that in our regression equation, the slope is 0.34. And the interpretation that we give to it is that for every increase of one in the independent variable, white population, you get an increase of point out before we get increase of slope rate in the dependent variable, which in our case is the transport through the intercept. To calculate the intercept or constant for 800, you simply get the mean of y and subtract by the mean of x times the slope. Okay, so that's, that's how you get beta 0 mean of y, mean of the dependent variable minus the slope times mean of independent variable. The way we interpreted is that if the independent variable is 0, we should expect the dependent variable to be Beta 0 to b. The intercept. In this is what we have in our equation, right? So we calculated beta 0 to be 23. So this means that in California counties, if the percent white population or the county is 0, we should expect the present vote for Trump in 2016 to be 23. So if no person in the county as white, then Trump, we should expect Trump to get 23% of the vote. Now, there is an important property of the way in which these values are calculated. So those formulas that we just saw to calculate the slope and the intercept, those formulas, they guarantee that the spread of the dots around the regression line is as small as possible. It's the line of best fit, the line that best fits all of these dots that you see on the screen. So to be more precise, notice that now I have just drawn the geese vertical yellow lines. These are the distances from each dot to the regression line. So these distances are all also what we call the error or the residual. So notice that the regression equation is making a prediction, right? It is predicting that a county that had, that were 40% of the population is white, accounting California, where 40% of population is white. In that county, you should get the Trump vote should be approximately 36 percent, right? So you can see it here, should be approximately 36 percent. However, there is this one county right here that the edit that it is about 40 percent white, but Trump got weigh less than 36 percent, right? Trump got about 10 percent of the vote. And so this difference that we have right here, which is going to be probably about 26. This is the error, this is the residual. This is how much, how wrong, how off the prediction from the regression is, from what we're actually observing. And you can see that the error is larger for some counties than it is for others. You can see that there are a few counties here that are exactly on the regression line. So you can see that in these counties, the regression does an excellent job at predicting what is the how, what percentage of the vote Trump gets. Only based on how wide that committee is. An East. Some, some, some are really close, some are really far off. And the job of the regression equation is to minimize the, all of these distances. And the calculation of the slope and intercept ensures that the residual sum of squares as small as possible to be so to be more precise, if we, if we were to measure each of these distances here, square them, and then sum them up. We are getting this value that is the residual sum of squares and the, the calculation of the slope and the intercept. They ensure that you are minimizing the residual sum of squares. And because we are minimizing the squares of the residuals. This is the reason why you might find regression estimation be called ordinary least squares or OLS. So whenever you hear someone say that they're using OLS to make a statistical test. You can know that OLS mean stands for ordinary least squares in ordinary or ordinary least squares speaks to this property of the calculation of the regression slope and intercept. That is, it minimizes the squares. Hence least squares.