7.4 Decomposition of Variance

Using our regression estimates and sample information, we can construct one of the most popular (and most abused) measures of goodness of fit for a regression. We will construct this measure in pieces.

First, the total sum of squares (or TSS) can be calculated to measure the total variation in the dependent variable:

\[TSS = \sum^{N}_{i=1}(Y_i - \bar{Y})^2\]

This expression is similar to a variance equation (without averaging), and since the movements in the dependent variable are ultimately what we are after, this measure delivers the total variation in the dependent variable that we would like our model to explain.

Next, we can use our regression estimates to calculate an estimated sum of squares (or ESS) which measures the total variation in the dependent variable that our model actually explained:

\[ESS = \sum^{N}_{i=1}(\hat{Y}_i - \bar{Y})^2\]

Note that this measure uses our conditional forecasts from our regression model in place of the actual observations of the dependent variable.

\[\hat{Y}_i = \hat{\beta}_0 + \hat{\beta}_1 X_i\]

Finally, we can use our regression estimates to also calculate a residual sum of squares (or RSS) which measures the total variation in the dependent variable that our model cannot explain:

\[RSS = \sum^{N}_{i=1}(Y_i - \hat{Y}_i)^2 = \sum^{N}_{i=1}e_i^2\]

Note that this is a measure of the variation in the garbage can, and the garage can is where all of the variation in the dependent variable that your model cannot explain ends up.

7.4.1 The \(R^2\)

Our regression breaks the variation in \(Y_i\) (the TSS) into what can be explained (the ESS) and what cannot be explained (the RSS). This essentially means \(TSS=ESS+RSS\). Furthermore, our OLS estimates attempt to maximize the ESS and minimize the RSS. This delivers our first measure of how well our model explains the movements in the dependent variable or goodness of fit

\[R^2 = \frac{ESS}{TSS} = 1 - \frac{RSS}{TSS}\]

This coefficient of determination or \(R^2\) should be an intuitive measure. First, it is bounded between 0 and 1. If the measure is 0 then the model explains NOTHING and all variation is in the garbage can. If the measure is 1 then the model explains EVERYTHING and the garbage can is empty. Any number in between is simply the proportion of the variation in the dependent variable explained by the model.

REG3 <- lm(price ~ sqrft, data = hprice1)
summary(REG3)$r.squared
## [1] 0.6207967
pander(summary(REG3))
  Estimate Std. Error t value Pr(>|t|)
(Intercept) 11.2 24.74 0.45 0.65
sqrft 0.14 0.01 11.87 0
Fitting linear model: price ~ sqrft
Observations Residual Std. Error \(R^2\) Adjusted \(R^2\)
88 63.62 0.62 0.62

Returning to our house price application above, you can see that our coefficient of determination \((R^2)\) is 0.62.10 This states that approximately 62 percent of the variation in the prices of homes in our sample is explained by the size of the house (in square feet), while the remaining 38 percent is unexplained by our model and shoved into the garbage can. That is all it says… no more and no less.

7.4.2 What is a good \(R^2\)?

Is explaining 62 percent of the variation in house prices good? The answer depends on what you want the model to explain. We know that the house size explains a majority of the variation in house prices while all other potential independent variables will explain at most the remaining 38 percent. If you want to explain everything there is to know about house prices, then an \(R^2\) of 0.62 leaves something to be desired. If you only care to understand the impact of size, then the \(R^2\) tells you how much of the variation in house prices it explains. There really isn’t much more to it than that.

7.4.3 Standard Error of the Estimate

\[S_{YX} = \sqrt{\frac{RSS}{n-2}} = \sqrt{\frac{\sum^{N}_{i=1}e_i^2}{n-2}}\]

The standard error of the estimate is much like a standard deviation equation. However, while the standard deviation measures the variability around a mean, the standard error of the estimate measures the variability around the prediction line.

Note that the denominator of this measure is \(n-2\). The reason that we are averaging the sum of squared errors by \(n-2\) is because we lost two degrees of freedom. Recall that we lose a degree of freedom whenever we need to estimate something based on other estimates. When we consider how we calculated the residuals in the first place,

\[ e_i = Y_i - \hat{Y}_i = Y_i - \hat{\beta}_0 - \hat{\beta}_1 \; X_i\]

you will see that we had to estimate two line coefficients before we can determine what the prediction error is. That is why we deduct two degrees of freedom.11


  1. Note that this number is sometimes called the multiple \(R^2\)↩︎

  2. NOTE: this line of reasoning implies that we will lose more degrees of freedom when we estimate models with more independent variables… later.↩︎