8.3 Statistical Inference

The course officially discusses statistical inference using the multiple regression model as opposed to the simple regression model, so this section should contain everything that is needed. If any preliminary material is desired, there is an appendix to Chapter 7 that discusses statistical inference specifically with respect to the simple (one independent variable) regression model. One might also want to briefly review the chapters of statistical inference from MBA 8370 (i.e., Chapters 5 and 6).

8.3.1 Recalling the Concept of Statistical Inference

Back in MBA 8370, we wanted to get an idea about the parameters of a population (i.e., the population mean $\mu$ and the population standard deviation $\sigma$), but only had information on the statistics of a sample (i.e., the sample mean $\bar{X}$ and the sample standard deviation $S$). We were able to make probabilistic statements (i.e., educated guesses) concerning the population parameters given the sample statistics along the lines of confidence intervals and hypothesis tests.

Confidence Intervals allowed us to make general statements concerning the range of values in which the population mean $\mu$ will reside given the characteristics of the sample $(\bar{X},\;S)$ and a particular probability or level of confidence $\alpha$.
Hypothesis Tests allowed us to determine if nonarbitrary statements concerning the value of the population mean $\mu$ are consistent or inconsistent with the characteristics of the sample $(\bar{X},\;S)$.

The same concept of statistical inference can be applied to regression models. A population regression model contains parameters such as the intercept, slope coefficients, and residual standard error $(\sigma_{XY})$.

\[PRF:\;Y_i=\beta_0+\beta_1X_{1i}+\beta_2X_{2i}+...+\beta_kX_{ki}+\varepsilon_i\]

We would like to know these population parameters, but we cannot because we cannot analyze the population. Since we can only observe a sample, we can estimate a sample regression model containing statistics such as the intercept, slope coefficients, and residual standard error $(S_{XY})$.

\[SRF:\;Y_i=\hat{\beta}_0+\hat{\beta}_1\;X_{1i}+\hat{\beta}_2\;X_{2i}+...+\hat{\beta}_k\;X_{ki}+e_i\]

Our statistical inference will again amount to using our sample characteristics to make probabilistic statements about our population parameters. Statistical inference will take the form of our familiar confidence intervals and hypothesis tests, and a new statistical inference tool of forecasting.

8.3.2 Confidence Intervals (around population parameters)

Recall our earlier formula for calculating a confidence interval in a univariate context:

\[Pr\left(\bar{X}-t_{(\frac{\alpha}{2},df=n-1)}\frac{S}{\sqrt{n}} \leq \mu \leq \bar{X}+t_{(\frac{\alpha}{2},df=n-1)}\frac{S}{\sqrt{n}}\right)=1-\alpha\]

We used the Central Limit Theorem (CLT) to ultimately state that $\bar{X}$ was drawn from a normal distribution with a mean of $\mu$ and standard deviation $\sigma/\sqrt{n}$ (but we only have $S$ which makes this a t distribution). This line of reasoning is very similar to what we have with regression analyses.

First, $\hat{\beta}$ is an estimate of $\beta$ just like $\bar{X}$ is an estimate of $\mu$. However, the standard error of the distribution of $\hat{\beta}$ is derived from the standard deviation of the residuals.

\[S_{\hat{\beta}}=\frac{S_{YX}}{\sum{(X_i-\bar{X})^2}}\]

with

\[S_{YX} = \sqrt{ \frac{\sum e_i^2}{n-k-1} }\]

This means that we construct a standardized random variable from a t distribution with $n-k-1$ degrees of freedom, where $k$ is the number of independent variables (or slope coefficients) in the regression model.³⁴

\[t=\frac{\hat{\beta}-\beta}{S_{\hat{\beta}}}\]

We have already derived a confidence interval before, so we can skip to the punchline.

\[Pr\left(\hat{\beta}-t_{(\frac{\alpha}{2},df=n-k-1)}S_{\hat{\beta}} \leq \beta \leq \hat{\beta}+t_{(\frac{\alpha}{2},df=n-k-1)}S_{\hat{\beta}}\right)=1-\alpha\]

This is the formula for a confidence interval around the population slope coefficient $\beta$ given the estimate $\hat{\beta}$ and the regression characteristics. It can also be written compactly as before.

\[\hat{\beta} \pm t_{(\frac{\alpha}{2},df=n-k-1)} S_{\hat{\beta}}\]

Recall our regression explaining differences in house prices given information on house sizes and number of bedrooms.

REG <- lm(price ~ sqrft + bdrms, data = hprice1)
summary(REG)

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	-19.31	31.05	-0.6221	0.5355
sqrft	0.1284	0.01382	9.291	1.394e-14
bdrms	15.2	9.484	1.603	0.1127

Fitting linear model: price ~ sqrft + bdrms
Observations	Residual Std. Error	$R^2$	Adjusted $R^2$
88	63.04	0.6319	0.6233

The information included in the regression summary is all that is needed for us to construct a 95 percent $(\alpha=0.05)$ confidence interval around the population slope coefficient $\beta_1$. In other words, we can build a range where the population slope between house price and size will reside with 95 percent confidence.

# Back out all of the needed information:
Bhat1 <- summary(REG)$coef[2,1]
SBhat1 <- summary(REG)$coef[2,2]
n <- length(residuals(REG))
k = 2

# Find the critical t-distribution values... same as before
AL <- 0.05
df <- n-k-1
tcrit <- qt(AL/2,df,lower.tail = FALSE)

# Use the formula... same as before
(LEFT <- Bhat1 - tcrit * SBhat1)

## [1] 0.1009495

(RIGHT <- Bhat1 + tcrit * SBhat1)

## [1] 0.1559229

\[Pr(0.101 \leq \beta_1 \leq 0.156)=0.95\]

This states that while an increase in house size by one square foot (holding number of bedrooms constant) will increase the house price by $128 $(\hat{\beta_1})$ on average in the sample, we can also state that an increase in house size by one square foot (holding number of bedrooms constant) will increase the house price on average in the population somewhere between $101 and $156 with 95% confidence.

While the code above showed you how to calculate a confidence interval from scratch as we did before, there is an easier (one-line) way in R that gives you all confidence intervals for a desired level:

confint(REG, level = 0.95)

##                   2.5 %     97.5 %
## (Intercept) -81.0439924 42.4140009
## sqrft         0.1009495  0.1559229
## bdrms        -3.6575816 34.0539635

8.3.3 Hypothesis Tests

We are able to conduct hypothesis tests regarding the values of the population regression coefficients. For example:

\[H_0:\beta_1 = 0 \quad vs. \quad H_1:\beta_1 \neq 0\]

In the context of our house price application, this null hypothesis states that the population slope between house price and size is zero… meaning that there is no relationship between these two variables in the population.

Given the null hypothesis above, we follow the remaining steps laid out previously: we calculate a test statistic under the null, calculate a p-value, and conclude.

The test statistic under the null is given by

\[t=\frac{\hat{\beta}_1 - \beta_1}{S_{\hat{\beta}_1}}\]

and this test statistic is drawn from a t distribution with $n-k-1$ degrees of freedom. Concluding this test is no more difficult that what we’ve done previously.

B1 = 0
(tstat <- (Bhat1 - B1)/SBhat1)

## [1] 9.290506

(Pval <- pt(tstat,df,lower.tail=FALSE)*2)

## [1] 1.393748e-14

(1-Pval)

## [1] 1

Our results state that we can reject this null hypothesis with approximately 100% confidence, meaning that there is a statistically significant relationship between house prices and house sizes. By statistically significant, we are essentially saying that the population relationship (i.e., slope) is some number other than zero.

As with the confidence interval exercise above, we actually do not need to conduct hypothesis tests where the null sets the population parameter to zero because R does this automatically. If you look again at the columns to the right of the estimated coefficient $\hat{\beta}_1$ in the regression summary, you will see a t value that is exactly what we calculated above and a p value that is essentially zero. This implies that a test with the null hypothesis set to zero is always done for you.

summary(REG)

## 
## Call:
## lm(formula = price ~ sqrft + bdrms, data = hprice1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -127.627  -42.876   -7.051   32.589  229.003 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -19.31500   31.04662  -0.622    0.536    
## sqrft         0.12844    0.01382   9.291 1.39e-14 ***
## bdrms        15.19819    9.48352   1.603    0.113    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 63.04 on 85 degrees of freedom
## Multiple R-squared:  0.6319, Adjusted R-squared:  0.6233 
## F-statistic: 72.96 on 2 and 85 DF,  p-value: < 2.2e-16

This isn’t to say that all hypothesis tests are automatically done for you.

Suppose a realtor believes that homes sell for $150 per square foot. This is a non-arbitrary statement on a population parameter that delivers the following hypotheses, followed by a test statistic, p-value, and conclusion.

\[H_0:\beta_1=0.150 \quad vs. \quad H_1:\beta_1\neq0.150\]

B1 = 0.150
(tstat <- (Bhat1 - B1)/SBhat1)

## [1] -1.559829

(Pval <- pt(tstat,df,lower.tail = TRUE)*2)

## [1] 0.122516

(1-Pval)

## [1] 0.877484

Our p-value implies that there is a 12 percent chance of being wrong if we reject the null hypothesis. In other words, we can reject the null with at most 88 percent confidence. We therefore do not have evidence that the population slope is different from 0.150 with any traditional level of confidence (e.g., $\alpha \leq 0.10$).³⁵

One-sided tests are also like before. We can consider right-tailed tests where the rejection region (and p-value) are in the right tail, as well as left-tailed tests where the rejection region (and p-value) are in the left tail. Let us examine one of each.

Suppose a realtor believes that homes sell for more than $120 per square foot.³⁶ Since we can lend statistical support to this claim by rejecting everything else, This delivers the following hypotheses, which gives rise to a right-tailed test.

\[H_0:\beta_1\leq0.120 \quad vs. \quad H_1:\beta_1>0.120\]

We calculate a test statistic under the null as always. But since this is a right-tailed test, we calculate the p-value (and conclusion) by always calculating the area to the right of the test statistic.

B1 = 0.120
(tstat <- (Bhat1 - B1)/SBhat1)

## [1] 0.610238

(Pval <- pt(tstat,df,lower.tail = FALSE))

## [1] 0.2716661

(1-Pval)

## [1] 0.7283339

Our test concludes that we can reject the null with at most 73 percent confidence.

Suppose a different realtor believes that homes sell for less than $130 per square foot.³⁷ Since we can lend statistical support to this claim by rejecting everything else, This delivers the following hypotheses, which gives rise to a left-tailed test.

\[H_0:\beta_1\geq0.130 \quad vs. \quad H_1:\beta_1<0.130\]

We calculate a test statistic under the null as always. But since this is a left-tailed test, we calculate the p-value (and conclusion) by always calculating the area to the left of the test statistic.

B1 = 0.130
(tstat <- (Bhat1 - B1)/SBhat1)

## [1] -0.1131176

(Pval <- pt(tstat,df,lower.tail = TRUE))

## [1] 0.455102

(1-Pval)

## [1] 0.544898

Our test concludes that we can reject the null with at most 54 percent confidence.

8.3.4 Confidence Intervals (around forecasts)

A regression can also build confidence intervals around the conditional expectations (i.e., forecasts) of the dependent variable.

Suppose you want to use our model to predict the price of a 1000 square foot house with 3 bedrooms. The conditional expectation is calculated by using our regression coefficients, a value of house size of 1000, a value of bedrooms of 3, and setting our forecast error to zero.

\[\widehat{price}_i=\hat{\beta}_0+\hat{\beta}_1\;1000+\hat{\beta}_2\;3\]

Bhat0 = summary(REG)$coef[1,1]
Bhat1 = summary(REG)$coef[2,1]
Bhat2 = summary(REG)$coef[3,1]

(Yhat = Bhat0 + Bhat1 * 1000 + Bhat2 * 3)

## [1] 154.7158

This calculation suggests that we expect a house with 1000 square feet and 3 bedrooms to sell for approximately 154.716 (thousand) dollars. Another way to calculate this forecast is using the predict command in R. This command creates a new data frame that includes only the values for the independent variables you want to make a prediction with. The rest is done for you.

predict(REG,data.frame(sqrft = 1000, bdrms = 3))

##        1 
## 154.7158

While this is an expected value based on the sample, we need to appreciate that we want to see what the prediction is in the population. Just like we did with the population coefficients, we are able to build a confidence interval around this forecast. In this case, however, there are two different intervals based on what exactly we would like to discuss.

A confidence interval for the mean response
A prediction interval for an individual response

The mean response: a confidence interval

Suppose you want to build a confidence interval around the population mean price for a 1000 square foot house with 3 bedrooms. This is a conditional mean, because we want the average house price but only for homes with these characteristics. This conditional mean is generally given by $\mu_{Y|X}$ and in this case by $\mu_{Y|X_1=1000,\;X_2=3}$. Building a confidence interval for the mean response is given by

\[ \hat{Y}_{X} \pm t_{(\frac{\alpha}{2},df=n-k-1)}S_{YX} \sqrt{h_i}\] or

\[ \hat{Y}_{X} - t_{(\frac{\alpha}{2},df=n-k-1)}S_{YX} \sqrt{h_i} \leq \mu_{Y|X} \leq \hat{Y}_{X} + t_{(\frac{\alpha}{2},df=n-k-1)}S_{YX} \sqrt{h_i}\]

where

$\hat{Y}_{X}$ is the expectation of the dependent variable conditional on the desired value of $X$.
$S_{YX}$ is the standard error of the estimate (calculated previously)
$t_{(\frac{\alpha}{2},df=n-k-1)}$ is the critical t statistic for a given value of $\alpha$ (calculated previously)
$h_i = \frac{1}{n}+\frac{(X_i - \bar{X})^2}{\sum_{i=1}^n(X_i - \bar{X})^2}$

This last variable $h_i$ is what is new to us and increases the size of the confidence interval when the desired values of $X_i$ are farther away from the average values of the observations $\bar{X}$. This variable can sometimes be difficult to calculate when dealing with multiple independent variables, but R again does it for you.³⁸ In R, a confidence interval around the population mean is simply called a confidence interval.

predict(REG,
        data.frame(sqrft = 1000,bdrms = 3), 
        interval = "confidence",
        level = 0.95)

##        fit      lwr      upr
## 1 154.7158 127.2862 182.1454

\[Pr(127.29\leq\mu_{Y|X}\leq182.15)=0.95\]

We can now state with 95% confidence that the population mean house price of all 1000 square-foot houses with 3 bedrooms is somewhere between $127,290 and $182,150. Note that the confidence interval around the mean response is centered at our conditional expectation $(\hat{Y})$ just like all confidence intervals are centered around its estimate.

An individual response: a prediction interval

Suppose that instead of building a confidence interval around the conditional average in the population, we want to determine the range within which we are confident to draw a single home value. This calculation is almost identical to the mean response above, but with one slight difference.

\[ \hat{Y}_{X} \pm t_{(\frac{\alpha}{2},df=n-k-1)}S_{YX} \sqrt{1+h_i}\] or

\[ \hat{Y}_{X} - t_{(\frac{\alpha}{2},df=n-k-1)}S_{YX} \sqrt{1+h_i} \leq Y_{X} \leq \hat{Y}_{X} + t_{(\frac{\alpha}{2},df=n-k-1)}S_{YX} \sqrt{1+h_i}\]

where

$\hat{Y}_{X}$ is the expectation of the dependent variable conditional on the desired value of $X_i$.
$S_{YX}$ is the standard error of the estimate (calculated previously)
$t_{(\frac{\alpha}{2},df=n-k-1)}$ is the critical t statistic for a given value of $\alpha$ (calculate previously)
$h_i = \frac{1}{n}+\frac{(X_i - \bar{X})^2}{\sum_{i=1}^n(X_i - \bar{X})^2}$

The only difference is that we replace $\sqrt{h_i}$ with $\sqrt{1+h_i}$. Conceptually, we inserted the one in the formula because we are selecting a single home with a specified size and number of bedrooms out of the population. This is very different from building a confidence interval around a population mean, but in R it is simply the change of one word.

predict(REG,
        data.frame(sqrft = 1000, bdrms = 3), 
        interval = "prediction",
        level = 0.95)

##        fit      lwr      upr
## 1 154.7158 26.39973 283.0318

\[Pr(26.40\leq Y_{X} \leq 283.03)=0.95\]

We can now state with 95% confidence that a single draw of a house price from the population of all 1000 square-foot houses will be somewhere between $26,400 and $283,030. Note that the prediction interval is also centered at our conditional expectation $(\hat{Y})$, but now the interval is much wider than in the previous calculation. This should make sense, because when you are selecting a single home then you have a positive probability of selecting either very cheap homes or very expensive homes. A mean would wash these extreme values out.

Note that the appendix in Chapter 7 states that the t distribution in the simple regression case has $n-2$ degrees of freedom. This is because there is only one independent variable in a simple regression, so $k=1$ and $n-k-1 = n - 2$.↩︎
Note that since the test statistic under the null was a negative value, the p-value is technically twice the area to the right of that number. This is precisely why lower.tail = TRUE is the code.↩︎
Note that while the estimate from our sample is greater than 0.120, the statement we are testing is regarding what is going on in the population.↩︎
Note that while the estimate from our sample is less than 0.130, the statement we are testing is regarding what is going on in the population.↩︎
Note that this equation is provided for only one independent variable. It becomes even more messy in a multivariate setting. However, the important concept is that this value gets larger when we consider values of X that are farther away from the average values in the sample.↩︎