8.2 Adjusted \(R^2\)

Regardless of the number of independent variables, the variance of a regression model can be decomposed and a \(R^2\) can be calculated.

\[TSS = \sum^{N}_{i=1}(Y_i - \bar{Y})^2\]

\[ESS = \sum^{N}_{i=1}(\hat{Y}_i - \bar{Y})^2\]

\[RSS = \sum^{N}_{i=1}(Y_i - \hat{Y}_i)^2 = \sum^{N}_{i=1}e_i^2\]

\[R^2 = \frac{ESS}{TSS} = 1 - \frac{RSS}{TSS}\]

The \(R^2\) still delivers the proportion of the variation in the dependent variable explained by the model, only now the model is comprised of multiple independent variables.

An \(R^2\) is a very intuitive calculation, but it sometimes might be misleading.

8.2.1 Abusing an \(R^2\)

No matter how hard I try to downplay the importance of an \(R^2\), students always have the tendency to shoot for that measure to be as close to 1 as possible. The problem with this goal is that an \(R^2\) equal to 1 in not necessarily a good thing.

Consider a previous regression where we explained house prices with only the number of bedrooms.

REG1 <- lm(price ~ bdrms, data = hprice1)
summary(REG1)$r.squared
## [1] 0.2581489

The coefficient of determination states that the number of bedrooms explains slightly over 25 percent of the variation in house prices. If we include the size of the house in the regression,

REG2 <- lm(price ~ bdrms + sqrft, data = hprice1)
summary(REG2)$r.squared
## [1] 0.6319184

we see that the \(R^2\) increases to 0.63 as before. If we include yet another variable such as the size of the property,

REG3 <- lm(price ~ bdrms + sqrft + lotsize, data = hprice1)
summary(REG3)$r.squared
## [1] 0.6723622

we see that the regression now explains over 67 percent of the variation in house prices.

What we are seeing is that the more variables you add the higher the \(R^2\) is getting. While this might lead you to believe that we are adding important independent variables to the regression, the problem is that the \(R^2\) will go up no matter what variable you add. The increase might be slight, but the \(R^2\) will never go down.

Xcrap <- rnorm(88)
REG4 <- lm(price ~ bdrms + sqrft + lotsize + Xcrap, data = hprice1)
summary(REG3)$r.squared
## [1] 0.6723622
summary(REG4)$r.squared
## [1] 0.6800811

The exercise above adds a completely random variable as a fourth independent variable. It should have nothing to do with explaining house prices. However, if you generate the correct random variables, then you might get an increase in the \(R^2\) by as much as an entire percentage point. Does this say that the random variable actually helps explain variations in house prices? Of course not. What it does show is that sometimes we can abuse the \(R^2\), so we need an additional measure of goodness of fit.

8.2.2 An Adjusted \(R^2\)

The problem with an \(R^2\) is that it will increase no matter what independent variable you throw into the regression. If you think about it, if a regression with two independent variables explains 63 percent of the variation in the dependent variable, then adding a third variable (no matter how silly) will deliver a regression that will explain no less than 63 percent of the variation. We therefore cannot use the \(R^2\) as a measure for whether or not we should include an independent variable because we don’t know how big an increase in \(R^2\) needs to be. We therefore need a goodness of fit measure that not only has the potential to increase when the added variable is deemed important, but has the potential to decrease when the variable is unimportant. This is called an adjusted \(R^2\).

\[\bar{R}^2 = 1 - \frac{RSS/(N-k-1)}{TSS/(N-1)}\]

The main difference between the adjusted \(R^2\) and it’s unadjusted measure are the degrees of freedom in the numerator. When you add an additional independent variable, \(k\) goes up by one but \(N\) stays constant. Also, when adding an additional independent variable, the RSS goes down (which is what delivers an increase in the standard \(R^2\)). What you have in the numerator is a cost / benefit analysis. In other words, if the decrease in RSS is greater - then the \(\bar{R}^2\) increases and the independent variable of question might be somewhat important. However, if the decrease in \(N-k-1\) is greater, then the \(\bar{R}^2\) decreases and the independent variable of question is not important.

Conclusion: for informal use only!

While the \(R^2\) and adjusted \(R^2\) are two common measures of goodness of fit, they are informal at best. One can interpret them along the lines of how we did above, but there will more formal measures of whether or not an independent variable improves the forecasts of the regression model. Bottom line: these measures can give some insight to the results of a regression model, but they aren’t anything worth hanging your final conclusions on.