5. The R-squared

5. The R-squared

Foundations of Quantitative Research in Political Science

The R-squared

The R-squared is a quantity that informs us how much variation in the dependent variable is explained by the independent variables.

The R-squared is always between 0 and 1, and it can be read as a percentage. In our example of percent white vs. the Trump vote, an R-squared of 0.23 would indicate that 23% of the Trump vote is explained by how white a county is. By contrast, an R-squared of 0.62 would indicate that 62% of the Trump vote is explained by how white a county is.

But how can we arrive at a number like this?

This is how we calculate it:

r2equation.png

Let us look at what’s in the numerator and denominator:

rss.png  is the Residual Sum of Squares: The sum of the squares of the vertical distances between each dot and the regression line.

The residual sum of squares is revealing about how well our model predicts the outcome. If the model makes predictions that miss the mark by a lot, we end up with a large residual sum of squares.

In our Trump example, a large residual sum of squares would mean that for most counties, the regression line is “missing” the actual Trump vote by a lot.

We measure the distance between each dot and line, square them, and sum them all up.

tss.png   is the Total Sum of Squares: The variance of the dependent variable (y).

We measure the distance between each dot and the mean value of all dots, square them, and sum them all up.

In our Trump example, the Total Sum of Squares is the variance of the Trump vote.

 

Visually, the R-squared looks like this:

r2visualeq.png

 

This should help you understand the logic behind the R-squared. Notice that if the distances between the dots and the regression line are large, the residual sum of squares will be almost the same as the total sum of squares. Resultantly, the division will yield a number close to 1. When we subtract 1 by a number close to 1, we get an R-squared close to 0, meaning that a low percentage of the variation in the dependent variable is explained by the independent variable.

In the Trump example, if it were the case that how white a county is did not explain much of the variation in support for Trump, it would follow that our regression model would make predictions of the Trump vote that would miss the mark by a lot, to the point where the Residual Sum of Squares would be almost as large as the variance of the Trump vote. This would result in an R-squared very close to 0.

low r2.png

By contrast, if the distances between the dots and the regression line are small, the residual sum of squares will be much smaller than the total sum of squares. Resultantly, the division will yield a number close to 0. When we subtract 1 by a number close to 0, we get an R-squared close to 1, meaning that a high percentage of the variation in the dependent variable is explained by the independent variable.

In the Trump example, if it were the case that how white a county is explained a lot of the variation in support for Trump, it would follow that our regression model would make predictions of the Trump vote that would miss the mark by a little, to the point where the Residual Sum of Squares would be much smaller than the variance of the Trump vote. This would result in an R-squared closer to 1.

high r2.png

Scroll to Top