9.5 A Concluding Application
Let us attempt to expand our analysis on the median starting salaries of graduating law classes by adding a second measure of student quality.
\[SALARY_i = \beta_0 + \beta_1 \; LSAT_i + \beta_2 \; RANK_i + \beta_3 \; GPA_i + \varepsilon_i\]
The regression above is the same regression considered in the previous chapter, only now we include the median GPA score of the graduating class. The first step is to load the data and run the regression.
library(readxl)
LAW <- read_excel("data/LAW.xlsx")
REG <- lm(SALARY~LSAT+RANK+GPA, data = LAW)
summary(REG)
##
## Call:
## lm(formula = SALARY ~ LSAT + RANK + GPA, data = LAW)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14528.8 -4108.2 -802.1 4016.8 20859.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -52187.41 25813.05 -2.022 0.0451 *
## LSAT -190.77 349.18 -0.546 0.5857
## RANK -144.89 15.49 -9.354 <2e-16 ***
## GPA 11893.07 4197.45 2.833 0.0053 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5980 on 138 degrees of freedom
## Multiple R-squared: 0.7667, Adjusted R-squared: 0.7616
## F-statistic: 151.1 on 3 and 138 DF, p-value: < 2.2e-16
The regression results suggest that the population coefficient in front of LSAT is not significantly different from zero - which is contrary to what we have already seen in previous chapters. This suggests a collinearity problem that is rather intuitive - one can suspect students with a high-median LSAT score also have high-median GPAs.
## LSAT RANK GPA
## 10.671894 2.399648 11.871970
After using R to quickly calculate all of our VIFs, we can see that there is indeed a collinearity problem in our data. The highest VIF is when GPA is one the left-hand side of the auxilliary regression, so that is the best candidate for removal.
## LSAT RANK
## 2.155515 2.155515
As the new regression and VIF measures indicate, dropping GPA from the regression did alleviate the collinearity issue. Note that the remaining VIF measures for LSAT and RANK suggest that there is a degree of correlation between these two independent variables, and that should make sense. After all, better ranked schools attract high quality students. The important thing to note here is that this correlation is not strong enough to significantly bias the estimates or their standard errors. Final conclusion: Adding an additional measure of student quality would have been nice in theory, but the additinal variable simply wasn’t bringing anything new to the analysis.