12.3 Applications

Lets consider two applications. The first application is not terribly interesting, but it will illustrate a joint hypothesis test that is always provided to you free of charge with any set of regression results. The second application is more involved and delivers the true importance of joint tests.

Application 1: A wage application

This is the same scenario we considered for the dummy variable section, only without gender as a variable.

Suppose you are a consultant hired by a firm to help determine the underlying features of the current wage structure for their employees. You want to understand why some wage rates are different from others. Let our dependent variable be wage (the hourly wage of an individual employee) and the independent variables be given by…

educ be the total years of education of an individual employee
exper be the total years of experience an individual employee had prior to starting with the company
tenure is the number of years an employee has been working with the firm.

The resulting PRF is given by…

\[wage_i=\beta_0+\beta_1educ_i+\beta_2exper_i+\beta_3tenure_i+\varepsilon_i\]

Suppose we wanted to test that none of these independent variables help explain movements in wages, so the resulting joint hypothesis would be

\[H_0: \beta_1 = \beta_2 = \beta_3 = 0 \quad \text{versus} \quad H_1: \beta_1 \neq 0, \; \beta_2 \neq 0, \; \text{or} \; \beta_3 \neq 0\]

The unrestricted model is one where each of the coefficients can be whatever number the data wants them to be.

data(wage1, package = "wooldridge")
UREG <- lm(wage~educ+exper+tenure,data=wage1)
(R2u <- summary(UREG)$r.squared)

## [1] 0.3064224

Our unrestricted model can explain roughly 30% of the variation in wages.

The next step is to estimate the restricted model - the model with the null hypothesis imposed. In this case you will notice that setting all slope coefficients to zero results in a rather strange looking model:

\[wage_i=\beta_0+\varepsilon_i\]

This model contains no independent variables. If you were to estimate this model, then the intercept term would return the average wage in the data and the error term will simply be every deviation from the individual wage observations with it’s average value. Since it is impossible for the deterministic component of this model to explain any of the variation in wages, then this implies that the restricted \(R^2\) is zero by definition. Note that this is only a special case because of what the restricted model looks like. There will be more interesting cases where the restricted \(R^2\) will need to be determined by estimating a restricted model.

R2r <- 0 # By definition

Now that we have the restricted and unrestricted \(R^2\), we need the degrees of freedom to calculate an F-statistic under the null. The numerator degrees of freedom \((m)\) denotes how many restrictions we placed on the restricted model. Since the null hypothesis sets all three slope coefficients to zero, we consider this to be 3 restrictions. The denominator degrees of freedom \((n-k-1)\) is taken directly from the unrestricted model. Since \(n=526\) and we originally had 3 independent variables (\(k=3\)), the denominator degrees of freedom is \(n-k-1=522\). We can now calculate our F statistic under the null as well as our p-value.

m = 3; n = 526; k = 3

(Fstat <- ((R2u - R2r)/m)/((1-R2u)/(n-k-1)))

## [1] 76.87317

(Pval <- pf(Fstat,m,n-k-1,lower.tail = FALSE))

## [1] 3.405862e-41

(1-Pval)

## [1] 1

Note that since our F-statistic is far from 0, we can reject the null with approximately 100% confidence (i.e. the p-value is essentially zero).

What can we conclude from this?

Since we rejected the null hypothesis, that means we have statistical evidence that the alternative hypothesis is true. However, take a look at what the alternative hypothesis actually says. It says that at least one of the population coefficients are statistically different from zero. It doesn’t say which ones. It doesn’t say how many. That’s it…

Is there a short cut?

Remember that all regression results provide the simple hypothesis that each slope coefficient is equal to zero.

\[H_0: \beta=0 \quad \text{versus} \quad H_1: \beta \neq 0\]

All regression results also provide the joint hypothesis that all slope coefficients are equal to zero. You can see the result at the bottom of the summary page. The last line delivers the same F-statistic we calculated above as well as a p-value that is essentially zero.

Note that while this uninteresting joint hypothesis test is done by default. Other joint tests require a bit more work.

summary(UREG)

## 
## Call:
## lm(formula = wage ~ educ + exper + tenure, data = wage1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.6068 -1.7747 -0.6279  1.1969 14.6536 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.87273    0.72896  -3.941 9.22e-05 ***
## educ         0.59897    0.05128  11.679  < 2e-16 ***
## exper        0.02234    0.01206   1.853   0.0645 .  
## tenure       0.16927    0.02164   7.820 2.93e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.084 on 522 degrees of freedom
## Multiple R-squared:  0.3064, Adjusted R-squared:  0.3024 
## F-statistic: 76.87 on 3 and 522 DF,  p-value: < 2.2e-16

Application 2: Constant Returns to Scale

Suppose you have data on the Gross Domestic Product (GDP) of a country as well as observations on two aggregate inputs of production: the nation’s capital stock (K) and aggregate labor supply (L). One popular regression to run in growth economics is to see if a nation’s aggregate production function possesses constant returns to scale. If it does, then if you scale up a nation’s inputs by a particular percentage, then you will get the exact same percentage increase in output (i.e., double the inputs results in double the outputs). This has implications for what the size an economy should be, but we won’t get into those details now.

The PRF is given by

\[lnGDP_i = \beta_0 + \beta_K \;lnK_i + \beta_L \;lnL_i + \varepsilon_i\]

where

\(lnGDP_i\) is an observation of total output
\(lnK_i\) is an observation of total capital stock
\(lnL_i\) is an observation of total labor stock.

These variables are actually in logs, but we will ignore that for now.

If we are testing for constant returns to scale, then we want to show that increasing all of the inputs by a certain amount will result in the same increase in output. Technical issues aside, this results in the following null hypothesis for a joint test:

\[H_0: \beta_K + \beta_L = 1 \quad \text{versus} \quad H_1: \beta_K + \beta_L \neq 1\]

We now have all we need to test for CRS:

# Load data...

library(readxl)
CDdata <- read_excel("data/CDdata.xlsx")

# Run unrestricted model, get R^2...

UREG <- lm(lnGDP ~ lnK + lnL, data = CDdata)
(R2u <- summary(UREG)$r.squared)

## [1] 0.9574247

The unrestricted model can explain around 96% of the variation in the dependent variable. For us to determine how much the restricted model can explain, we first need to see exactly what the restriction does to our model. Starting from the unrestricted model, imposing the restriction delivers the following:

\[lnGDP_i = \beta_0 + \beta_K \; lnK_i + \beta_L \; lnL_i + \varepsilon_i\] \[lnGDP_i = \beta_0 + (1 - \beta_L) \; lnK_i + \beta_L \; lnL_i + \varepsilon_i\]

\[(lnGDP_i - lnK_i) = \beta_0 + \beta_L \; (lnL_i - lnK_i) + \varepsilon_i\] \[\tilde{Y}_i = \beta_0 + \beta_L \; \tilde{X}_i + \varepsilon_i\] where \[\tilde{Y}_i=lnGDP_i - lnK_i \quad \text{and} \quad \tilde{X}_i=lnL_i - lnK_i\]

Notice how these derivations deliver exactly how the variables of the model need to be transformed and what the restricted model needs to be estimated.

Y = CDdata$lnGDP - CDdata$lnK
X = CDdata$lnL - CDdata$lnK

RREG <- lm(Y~X)
(R2r <- summary(RREG)$r.squared)

## [1] 0.9370283

The restricted model can explain roughly 94% of the variation in the dependent variable. To see if this reduction in \(R^2\) is enough to reject the null hypothesis, we need to calculate an F-statistic. The numerator degrees of freedom is \(m=1\) because there is technically only one restriction in the null. The denominator degrees of freedom uses \(n=24\) and \(k=2\).

m = 1; n = 24; k = 2

(Fstat <- ((R2u - R2r)/m)/((1-R2u)/(n-k-1)))

## [1] 10.0604

(Pval <- pf(Fstat,m,n-k-1,lower.tail = FALSE))

## [1] 0.004594084

(1-Pval)

## [1] 0.9954059

As in the previous application, we received a very high F-statistic and a very low p-value. This means we reject the hypothesis that this country has an aggregate production function that exhibits constant returns to scale with slightly over 99.5% confidence.