Foundations of Quantitative Research in Political Science
The R-squared
The R-squared is a quantity that informs us how much variation in the dependent variable is explained by the independent variables.
The R-squared is always between 0 and 1, and it can be read as a percentage. In our example of percent white vs. the Trump vote, an R-squared of 0.23 would indicate that 23% of the Trump vote is explained by how white a county is. By contrast, an R-squared of 0.62 would indicate that 62% of the Trump vote is explained by how white a county is.
But how can we arrive at a number like this?
This is how we calculate it:
Let us look at what’s in the numerator and denominator:
The residual sum of squares is revealing about how well our model predicts the outcome. If the model makes predictions that miss the mark by a lot, we end up with a large residual sum of squares. In our Trump example, a large residual sum of squares would mean that for most counties, the regression line is “missing” the actual Trump vote by a lot. We measure the distance between each dot and line, square them, and sum them all up. |
|
We measure the distance between each dot and the mean value of all dots, square them, and sum them all up. In our Trump example, the Total Sum of Squares is the variance of the Trump vote. |
Visually, the R-squared looks like this:
This should help you understand the logic behind the R-squared. Notice that if the distances between the dots and the regression line are large, the residual sum of squares will be almost the same as the total sum of squares. Resultantly, the division will yield a number close to 1. When we subtract 1 by a number close to 1, we get an R-squared close to 0, meaning that a low percentage of the variation in the dependent variable is explained by the independent variable.
In the Trump example, if it were the case that how white a county is did not explain much of the variation in support for Trump, it would follow that our regression model would make predictions of the Trump vote that would miss the mark by a lot, to the point where the Residual Sum of Squares would be almost as large as the variance of the Trump vote. This would result in an R-squared very close to 0.
By contrast, if the distances between the dots and the regression line are small, the residual sum of squares will be much smaller than the total sum of squares. Resultantly, the division will yield a number close to 0. When we subtract 1 by a number close to 0, we get an R-squared close to 1, meaning that a high percentage of the variation in the dependent variable is explained by the independent variable.
In the Trump example, if it were the case that how white a county is explained a lot of the variation in support for Trump, it would follow that our regression model would make predictions of the Trump vote that would miss the mark by a little, to the point where the Residual Sum of Squares would be much smaller than the variance of the Trump vote. This would result in an R-squared closer to 1.