8.3 Statistical Inference

With respect to statistical inference, confidence intervals and simple hypothesis tests are performed in multiple regression models exactly the same way as in simple regression models. The only difference is that a simple regression model calculated probabilities using \(n-2\) degrees of freedom, while a multiple regression calculates probabilities using \(n-k-1\) degrees of freedom, where \(k\) is the number of independent variables in the model. Note that the degrees of freedom are consistent across models - it’s just that a simple regression model has \(k=1\) by default.

8.3.1 Hypothesis Tests

Let us show this by continuing our analysis into explaining the price of a house using it’s house size (sqrft), number of bedrooms (bdrms), and lot size (lotsize).

\[Price_i=\beta_0+\beta_1\;sqrft_i+\beta_2\;bdrms_i+ \beta_3 \; lotsize_i + \varepsilon_i\] Our estimated results are as follows:

  Estimate Std. Error t value Pr(>|t|)
(Intercept) -21.77 29.48 -0.74 0.46
sqrft 0.12 0.01 9.28 0
bdrms 13.85 9.01 1.54 0.13
lotsize 0 0 3.22 0
Fitting linear model: price ~ sqrft + bdrms + lotsize
Observations Residual Std. Error \(R^2\) Adjusted \(R^2\)
88 59.83 0.67 0.66

The first thing we can do by way of statistical inference is test if each population coefficient is statistically different from zero. For example,

\[H_0: \beta_1 = 0 \quad versus \quad H_1:\beta_1 \neq 0\]

We can conduct this test the same way we did before (only minding the degrees of freedom).

B1 = 0
Bhat1 = summary(REG2)$coef[2,1]
SBhat1 = summary(REG2)$coef[2,2]
N = length(hprice1$price)

(tstat <- (Bhat1 - B1)/SBhat1)
## [1] 9.275093
(Pval <- pt(tstat,N-4,lower.tail=FALSE)*2)
## [1] 1.658015e-14
(1-Pval)
## [1] 1

You can see that the p-value of this test is essentially zero, meaning you can reject the null with almost 100 percent confidence. Note that the only real difference between this test and the one we did in the previous chapter is that we have \(N-4=84\) degrees of freedom in our t distribution. This is because we had to estimate an intercept and 3 slope coefficients.

Note that also like our simple linear regression case, this test is already done for you in the summary results. You can see that the test statistic under the null and the p-value for this test are already there.14

As in our simple linear regression case, all hypothesis tests where the null sets a PRF coefficient to some number other than 0 (and all two-sided hypothesis tests) need to be done by hand. It is again the same steps, but we need to mind the degrees of freedom.

Suppose we want to show that an additional square foot of house size (all else equal) will result in less than a $125 dollar increase in house price on average. \[H_0: \beta_1 \geq 0.125 \quad versus \quad H_1:\beta_1 < 0.125\]

This left-tail test is as follows:

B1 = 0.125

(tstat <- (Bhat1 - B1)/SBhat1)
## [1] -0.1678437
(Pval <- pt(tstat,N-4,lower.tail=TRUE))
## [1] 0.4335549
(1-Pval)
## [1] 0.5664451

Note that we DO NOT REJECT the null because we can only reject with about 56% confidence. This means that \(\hat{\beta}_1 < 0.125\) in the sample is not enough to show \(\beta_1 < 0.125\) in the population.

One final item to point out in the results above is that the p-value on the SRF coefficient for number of bedrooms is about 0.13. This implies that we can only reject the hypothesis that this coefficient is significantly different from zero at the 87 percent confidence level! You can confirm this by performing the test by hand, but it is essentially telling you that number of bedrooms is not important for determining house price once we are already accounting for house and lot size.

8.3.2 Confidence Intervals (around population parameters)

Fortunately for us, our confidence interval trick in R works for multiple regression models as well so we won’t spend any more time on it. Note that these results confirms our earlier result that the PRF coefficient with respect to number of bedrooms has a confidence interval that contains zero.

confint(REG2)
##                     2.5 %       97.5 %
## (Intercept) -80.384661400 36.844045104
## sqrft         0.096454149  0.149102222
## bdrms        -4.065140551 31.770184040
## lotsize       0.000790769  0.003344644

8.3.3 Confidence Intervals (around forecasts)

We can again easily extend the R commands learned for simple regression models to the case where we want to make forecasts of the dependent variable conditional on specific values of the independent variables. Only now there are more than one of them to build a conditional forecast on.

Let us remove the insignificant independent variable from our house price regression to make things cleaner:

REG3 <- lm(price ~ sqrft + lotsize, data = hprice1)
summary(REG3)
## 
## Call:
## lm(formula = price ~ sqrft + lotsize, data = hprice1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -109.995  -36.210   -5.553   27.848  207.081 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.932e+00  2.351e+01   0.252  0.80141    
## sqrft       1.334e-01  1.140e-02  11.702  < 2e-16 ***
## lotsize     2.113e-03  6.466e-04   3.269  0.00156 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 60.31 on 85 degrees of freedom
## Multiple R-squared:  0.6631, Adjusted R-squared:  0.6552 
## F-statistic: 83.67 on 2 and 85 DF,  p-value: < 2.2e-16

Recall that an expected value of Y conditional on values of X is simply using the SRF and plugging in the relevant numbers. Suppose for example that you want to know the expected price of a home with 2000 sqrft of house size and 5000 sqrft of lot size:

\[\hat{Y} = \hat{\beta}_0 + \hat{\beta}_1 \; 2000 + \hat{\beta}_2 \; 5000\]

In order to do this in R, we need to construct a data set with \(sqrft = 2000\) and \(lotsize = 5000\) and then use the predict command:

predict(REG3,
        data.frame(sqrft = 2000, lotsize = 5000))
##        1 
## 283.2239

This result tells us the average house price in the sample of a house with 2000 sqrft of house size and 5000 sqrft of lot size will be $283,224.

We can also construct mean and individual responses just like before. The formulas get a bit more complex, but the concept and the R code stays the same! You can refer to the discussion in the previous chapter on how to interpret these confidence intervals.

predict(REG3,
        data.frame(sqrft = 2000, lotsize = 5000),
        interval = "confidence",
        level = 0.95)
##        fit      lwr      upr
## 1 283.2239 269.4538 296.9941
predict(REG3,
        data.frame(sqrft = 2000, lotsize = 5000),
        interval = "prediction",
        level = 0.95)
##        fit      lwr      upr
## 1 283.2239 162.5204 403.9275

  1. If you are noticing that the p-values slightly differ at the 15th decimal place, it is due to rounding error and shouldn’t be taken too seriously.↩︎