9.4 How do we remove Collinearity?

There are several ways to remove or reduce the degree of collinearity that vary in degrees of feasibility and effectiveness.

First, is the collinearity problem due to the inherent nature of the variables themselves or is it a coincidence with your current sample? If it is coincidence, then the problem might go away if you collected more observations. Note that this might not always work, and sometimes more data isn’t even available. However, it is an easy first pass if feasible.

Second, one could always ignore collinearity and proceed with the analysis. The reason for this is that while collinearity might bias the standard errors of the estimates, the bias might not be that bad. Think of increasing the value of zero by 100 times… it’s still zero.

For example, lets try the ignorance approach with the baseball application above:

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	69.28	13.64	5.077	4.37e-05
League	1.847	1.012	1.825	0.08168
ERA	-6.058	3.441	-1.76	0.09225
Runs	0.08855	0.007688	11.52	8.703e-11
Hits_Allowed	-0.02523	0.01411	-1.788	0.08761
Walks_Allowed	-0.02665	0.01178	-2.262	0.03393
Saves	0.5378	0.07606	7.071	4.297e-07
Errors	0.004109	0.04188	0.0981	0.9227

Fitting linear model: Wins ~ League + ERA + Runs + Hits_Allowed + Walks_Allowed + Saves + Errors
Observations	Residual Std. Error	\(R^2\)	Adjusted \(R^2\)
30	2.503	0.9611	0.9488

The results suggest that the population coefficients for the variables League, ERA, Hits Allowed, and Errors are all insignificantly different from zero with 95% confidence. Now if they were all significant, then we could possibly ignore any potential collinearity issues because the bias would not be enough for us to see if there was a problem. However, since two of these insignificant variables are ones we already identified as having a collinearity problem, then we are unable to go this route.

The third option for removing collinearity is to remove the correlated independent variables until the correlation structure is removed. The way to proceed down this route is to remove the variables (one-at-a-time) with the highest VIF values first until all remaining VIF values are below 5. The up side of this analysis is that you can now proceed with the main regression knowing that collinearity is no longer a problem. The down side is that you may have had to remove one or more variables that you really wanted to include in the regression.

The VIF values from the baseball analysis suggest that ERA and Hits Allowed are two variables that potentially need to be removed from the analysis due to collinearity. The way to proceed is that if we were to only remove one variable at a time, we will remove the variable with the highest VIF because it is the one that has the most redundant information.

REG <- lm(Wins ~ League + Runs + Hits_Allowed + Walks_Allowed + Saves + Errors, data = MULTI2)

vif(REG)

##        League          Runs  Hits_Allowed Walks_Allowed         Saves 
##      1.149383      1.279914      1.365583      1.235945      1.665172 
##        Errors 
##      1.546465

summary(REG)

## 
## Call:
## lm(formula = Wins ~ League + Runs + Hits_Allowed + Walks_Allowed + 
##     Saves + Errors, data = MULTI2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8127 -2.0776  0.0551  2.0168  4.9951 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   83.214595  11.607524   7.169 2.67e-07 ***
## League         2.278948   1.026010   2.221   0.0365 *  
## Runs           0.088445   0.008031  11.013 1.20e-10 ***
## Hits_Allowed  -0.047231   0.006840  -6.905 4.86e-07 ***
## Walks_Allowed -0.043122   0.007485  -5.761 7.22e-06 ***
## Saves          0.569301   0.077227   7.372 1.69e-07 ***
## Errors         0.001322   0.043722   0.030   0.9761    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.615 on 23 degrees of freedom
## Multiple R-squared:  0.9557, Adjusted R-squared:  0.9441 
## F-statistic: 82.65 on 6 and 23 DF,  p-value: 2.119e-14

The regression with ERA removed is now free of collinearity. We can confirm this by verifying that all VIF values of the remaining independent variables are well below 5. The regression results suggest that after removing ERA, Hits Allowed now has a population coefficient that is significantly different than zero with 95% confidence. Errors, however, is still an insignificant variable. This suggests that the insignificance in Hits was due to collinearity, while the insignificance in errors wasn’t. It’s simply the fact that Errors do not significantly help us explain why some teams win more games than others.

Sometimes removing collinearity might involve multiple rounds

You will note from the application above that we only needed to remove one independent variable, so only one round of VIF calculations displayed values above 5. It might sometimes be the case that even after you remove an independent variable, the next round of VIF values reports reports one or more with value of 5 or more. If this happens, you simply repeat the process by removing the variable with the highest VIF and check again. In general, a complete removal of multicollinearity involves the following:

calculate VIFs for each independent variable in your original model
drop the variable with the highest VIF (greater than 5)
calculate VIFs for each independent variable again (with the dropped variable no longer in the model)
drop the variable with the highest VIF (greater than 5)
this is repeated until all VIFs are less than 5