9.1 An Application
Consider an application using simulated data where two independent variables have different degrees of correlation. The simulated data was generated from the following model:
\[Y_i = 1 + 1 \;X_{1i} + 1\; X_{2i} + \varepsilon_i\]
In other words, the simulated data should return the same coefficients above if there are no problems with the estimation. The exercise will show you how collinearity can become a problem.
## [1] 0.3289358
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.0015735 0.0214445 46.705 < 2.2e-16 ***
## X31 1.0152385 0.0401048 25.315 < 2.2e-16 ***
## X32 0.9905016 0.0099267 99.781 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## [1] 0.9380521
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.001574 0.021445 46.705 < 2.2e-16 ***
## X21 1.100724 0.109304 10.070 < 2.2e-16 ***
## X22 0.905016 0.099267 9.117 1.082e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## [1] 0.9992777
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.001574 0.021445 46.7053 < 2e-16 ***
## X11 1.955579 0.996657 1.9621 0.05261 .
## X12 0.050161 0.992672 0.0505 0.95980
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## [1] 1
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.003483 0.021342 47.019 < 2.2e-16 ***
## X41 2.002616 0.037857 52.900 < 2.2e-16 ***
## X42 NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The above application considers four sets of data where the only difference is the degree of collinearity between the two independent variables. The first regression has a degree of correlation between \(X_{1i}\) and \(X_{2i}\) equal to 0.33, and you will see that the regression does a fairly good job at recovering the regression coefficients. The second regression has a degree of correlation between \(X_{1i}\) and \(X_{2i}\) equal to 0.94, and the regression is beginning to suffer a bit because both slope estimates are now off by about 10 percent. The Third regression has a degree of correlation between \(X_{1i}\) and \(X_{2i}\) equal to just shy of perfect (1), and the regression is now way off from the expected estimates. Finally, the fourth regression has perfect collinearity between \(X_{1i}\) and \(X_{2i}\), and the regression actually chokes by providing an NA (meaning, not a number) as an answer for the second coefficient. Mathematically, perfect collinearity asks for a computer to divide a number by zero (which computers don’t like to do).