7.6 Statistical Inference

Once the assumptions of the regression model have been verified, we are able to perform statistical inference. Since we are now dealing with a regression model, not only are we able to calculate confidence intervals and conduct hypothesis tests on the population coefficients, but we are able to perform statistical inference on the forecasts of the model as well.

7.6.1 Confidence Intervals (around population parameters)

Recall our earlier formula for calculating a confidence interval in a single-variable context:

\[Pr\left(\bar{X}-t_{(\frac{\alpha}{2},df=n-1)}\frac{S}{\sqrt{n}} \leq \mu \leq \bar{X}+t_{(\frac{\alpha}{2},df=n-1)}\frac{S}{\sqrt{n}}\right)=1-\alpha\]

We used the CLT to ultimately state that $\bar{X}$ was drawn from a normal distribution with a mean of $\mu$ and standard deviation $\sigma/\sqrt{n}$ (but we only have $S$ which makes this a t distribution). This line of reasoning is very similar to what we have with regression analyses.

First, $\hat{\beta}$ is an estimate of $\beta$ just like $\bar{X}$ is an estimate of $\mu$. However, the standard error of the sampling distribution of $\hat{\beta}$ is derived from the standard deviation of the residuals.

\[S_\beta=\frac{S_{YX}}{\sum{(X_i-\bar{X})^2}}\]

This means that we construct a standardized random variable from a t distribution with $n-2$ degrees of freedom.

\[t=\frac{\hat{\beta}-\beta}{S_\beta}\]

We have already derived a confidence interval before, so we can skip to the punchline.

\[Pr\left(\hat{\beta}-t_{(\frac{\alpha}{2},df=n-2)}S_\beta \leq \beta \leq \hat{\beta}+t_{(\frac{\alpha}{2},df=n-2)}S_\beta\right)=1-\alpha\]

This is the formula for a confidence interval around the population slope coefficient $\beta$ given the estimate $\hat{beta}$ and the regression characteristics. It can also be written compactly as before.

\[\hat{\beta} \pm t_{(\frac{\alpha}{2},df=n-2)} S_b\]

Recall our regression explaining differences in house prices given information on house sizes.

pander(summary(REG3))

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	11.2	24.74	0.45	0.65
sqrft	0.14	0.01	11.87	0

Fitting linear model: price ~ sqrft
Observations	Residual Std. Error	$R^2$	Adjusted $R^2$
88	63.62	0.62	0.62

The information included in the regression summary is all that is needed for us to construct a 95 percent $(\alpha=0.05)$ confidence interval around the population slope coefficient $\beta_1$.

# Back out all of the needed information:

Bhat1 <- summary(REG3)$coef[2,1]
SBhat1 <- summary(REG3)$coef[2,2]
N <- length(residuals(REG3))

# Find the critical t-distribution values... same as before

AL <- 0.05
df <- N-2
tcrit <- qt(AL/2,df,lower.tail = FALSE)

# Use the formula... same as before

(LEFT <- Bhat1 - tcrit * SBhat1)

## [1] 0.1167203

(RIGHT <- Bhat1 + tcrit * SBhat1)

## [1] 0.1637017

\[Pr(0.1167 \leq \beta_1 \leq 0.1637)=0.95\]

This states that while an increase in house size by one square foot will increase the house price by $140 $(\hat{\beta_1})$ on average in the sample, we can also state that an increase in house size by one square foot will increase the house price in the population somewhere between $116.70 and $163.70 with 95% confidence.

While the code above showed you how to calculate a confidence interval from scratch as we did before, there is an easier (one-line) way in R:

confint(REG3)

##                   2.5 %     97.5 %
## (Intercept) -37.9825309 60.3908210
## sqrft         0.1167203  0.1637017

7.6.2 Hypothesis Tests

We are able to conduct hypothesis tests regarding the values of the population regression coefficients. For example:

\[H_0:\beta_1 = 0 \quad vs. \quad H_1:\beta_1 \neq 0\]

In the context of our house price application, this null hypothesis states that the population slope between house price and size is zero… meaning that there is no relationship between the two variables.

Given the null hypothesis above, we follow the remaining steps laid out previously: we calculate a test statistic under the null, calculate a p-value, and conclude.

The test statistic under the null is given by

\[t=\frac{\hat{\beta}_1 - \beta_1}{S_{\beta_1}}\]

and this test statistic is drawn from a t distribution with $n-2$ degrees of freedom. Concluding this test is no more difficult that what we’ve done previously.

B1 = 0
(tstat <- (Bhat1 - B1)/SBhat1)

## [1] 11.86555

(Pval <- pt(tstat,N-2,lower.tail=FALSE)*2)

## [1] 8.423405e-20

(1-Pval)

## [1] 1

Our results state that we can reject this null hypothesis with approximately 100% confidence, meaning that there is a statistically significant relationship between house prices and house sizes.

As with the confidence interval exercise above, we actually do not need to conduct hypothesis tests where the null sets the population parameter to zero because R does this automatically. If you look again at columns to the right of the estimated coefficient $\hat{\beta}_1$, you will see a t value that is exactly what we calculated above and a p value that is essentially zero. This implies that a test with the null hypothesis set to zero is always done for you.

summary(REG3)

## 
## Call:
## lm(formula = price ~ sqrft, data = hprice1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -117.112  -36.348   -6.503   31.701  235.253 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.20415   24.74261   0.453    0.652    
## sqrft        0.14021    0.01182  11.866   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 63.62 on 86 degrees of freedom
## Multiple R-squared:  0.6208, Adjusted R-squared:  0.6164 
## F-statistic: 140.8 on 1 and 86 DF,  p-value: < 2.2e-16

This isn’t to say that all hypothesis tests are automatically done for you. Suppose a realtor believes that homes sell for $150 per square foot. This delivers the following hypotheses, followed by a test statistic, p-value, and conclusion.

\[H_0:\beta_1=0.150 \quad vs. \quad H_1:\beta_1\neq0.150\]

B1 = 0.150
(tstat <- (Bhat1 - B1)/SBhat1)

## [1] -0.8284098

(Pval <- pt(tstat,N-2)*2)

## [1] 0.4097316

(1-Pval)

## [1] 0.5902684

Our p-value of 0.41 implies that there is a 41% chance of being wrong if we reject the null hypothesis. We therefore do not have evidence that the population slope is different from 0.150… so we do not reject.

One-sided tests are also like before. Suppose a realtor believes that homes sell more than $160 per square foot. This delivers the following hypotheses, followed by a test statistic, p-value, and conclusion.

\[H_0:\beta_1\leq0.160 \quad vs. \quad H_1:\beta_1>0.160\]

B1 = 0.160
(tstat <- (Bhat1 - B1)/SBhat1)

## [1] -1.674674

(Pval <- pt(tstat,N-2))

## [1] 0.04881561

(1-Pval)

## [1] 0.9511844

Our test concludes that we can reject the null with at most 95.11% confidence.

7.6.3 Confidence Intervals (around forecasts)

A regression can also build confidence intervals around the conditional expectations (i.e., forecasts) of the dependent variable.

Suppose you want to use our model to predict the price of a 1000 square foot house. The conditional expectation is calculated by using our regression coefficients, a value of house size of 1000, and setting our forecast error to zero.

X = 1000
Bhat0 = summary(REG3)$coef[1,1]
Bhat1 = summary(REG3)$coef[2,1]

(Yhat = Bhat0 + Bhat1 * X)

## [1] 151.4151

Another way to calculate this forecast is using the predict command in R. This command creates a new data frame that includes only the value for the independent variable you want to make a prediction with. The rest is done for you.

predict(REG3,data.frame(sqrft = 1000))

##        1 
## 151.4151

Our model predicts that a 1,000 square foot house will sell for $151,415 on average. While this is an expected value based on the sample, what is the prediction in the population? We are able to build a confidence interval around this forecast in a number of ways.

A confidence interval for the mean response
A confidence interval for an individual response

The mean response: a confidence interval

Suppose you want to build a confidence interval around the mean price for a 1000 square foot house in the population. This is a conditional mean. In other words, we want the average house price but only for homes with a particular size. This conditional mean is generally given by $\mu_{Y|X=X_i}$ and in this case by $\mu_{Y|X=1000}$. Building a confidence interval for the mean response is given by

\[ \hat{Y}_{X=X_i} \pm t_{(\frac{\alpha}{2},df=n-2)}S_{YX} \sqrt{h_i}\] or

\[ \hat{Y}_{X=X_i} - t_{(\frac{\alpha}{2},df=n-2)}S_{YX} \sqrt{h_i} \leq \mu_{Y|X=X_i} \leq \hat{Y}_{X=X_i} + t_{(\frac{\alpha}{2},df=n-2)}S_{YX} \sqrt{h_i}\]

where

$\hat{Y}_{X=X_i}$ is the expectation of the dependent variable conditional on the desired value of $X_i$.
$S_{YX}$ is the standard error of the estimate (calculated previously)
$t_{(\frac{\alpha}{2},df=n-2)}$ is the critical t statistic (calculate previously)
$h_i = \frac{1}{n}+\frac{(X_i - \bar{X})^2}{\sum_{i=1}^n(X_i - \bar{X})^2}$

This last variable $h_i$ is what is new to us and increases the size of the confidence interval when the desired value of $X_i$ is farther away from the average value of the observations $\bar{X}$. This variable can sometimes be difficult to calculate, but R again does it for you. In R, a confidence interval around the population mean is simply called a confidence interval.

predict(REG3,
        data.frame(sqrft = 1000), 
        interval = "confidence",
        level = 0.95)

##        fit      lwr      upr
## 1 151.4151 124.0513 178.7789

\[Pr(124.05\leq\mu_{Y|X=1000}\leq178.78)=0.95\]

We can now state with 95% confidence that the population mean house price of all 1000 square-foot houses is somewhere between $124,050 and $178,780. Note that the confidence interval around the mean response is centered at our conditional expectation $(\hat{Y})$ just like all confidence intervals are centered around its estimate.

An individual response: a prediction interval

Suppose that instead of building a confidence interval around the conditional average in the population, we want to determine the range within which we are confident to draw a single home value. This calculation is almost identical to the mean response above, but with one slight difference.

\[ \hat{Y}_{X=X_i} \pm t_{(\frac{\alpha}{2},df=n-2)}S_{YX} \sqrt{1+h_i}\] or

\[ \hat{Y}_{X=X_i} - t_{(\frac{\alpha}{2},df=n-2)}S_{YX} \sqrt{1+h_i} \leq Y_{X=X_i} \leq \hat{Y}_{X=X_i} + t_{(\frac{\alpha}{2},df=n-2)}S_{YX} \sqrt{1+h_i}\]

where

$\hat{Y}_{X=X_i}$ is the expectation of the dependent variable conditional on the desired value of $X_i$.
$S_{YX}$ is the standard error of the estimate (calculated previously)
$t_{(\frac{\alpha}{2},df=n-2)}$ is the critical t statistic (calculate previously)
$h_i = \frac{1}{n}+\frac{(X_i - \bar{X})^2}{\sum_{i=1}^n(X_i - \bar{X})^2}$

The only difference is that we replace $\sqrt{h_i}$ with $\sqrt{1+h_i}$. Conceptually, we inserted the one in the formula because we are selecting a single home with a specified size out of the population. This is very different from building a confidence interval around a population mean, but in R it is simply the change of one word.

predict(REG3,
        data.frame(sqrft = 1000), 
        interval = "prediction",
        level = 0.95)

##        fit      lwr      upr
## 1 151.4151 22.02204 280.8082

\[Pr(22.02\leq Y_{X=1000} \leq 280.81)=0.95\]

We can now state with 95% confidence that a single draw of a house price from the population of all 1000 square-foot houses will be somewhere between $22,020 and $280,810. Note that the prediction interval is also centered at our conditional expectation $(\hat{Y})$, but now the interval is much wider than in the previous calculation. This should make sense, because when you are selecting a single home then you have a positive probability of selecting either very cheap homes or very expensive homes. A mean would wash these extreme values out.