9.2 Collinearity

A significant problem in regression analysis arises when two more independent variables are significantly correlated with each other. This is known as collinearity or multicollinearity. A correlation means that two or more variables systematically move together. In regression analysis, movement is information that we use to explain differences or changes in the dependent variable. If independent variables have similar movements due to correlations, then they contain similar (i.e., redundant) information.

Another issue with collinearity is that when two or more variables systematically move together, then it goes against the very interpretation of our estimates: holding all else equal. If the variables aren’t held equal in the data due to collinearity, then our estimates will reflect that by being unable to differentiate the changes in these variables along separate dimensions. Since the information from these independent variables are shared and redundant, then the dimensions from these collinear variables becomes blurred.

9.2.1 An Application

Consider an application that compares simulated data where two independent variables have different degrees of correlation. The simulated data was generated from the following model:

\[Y_i = 1 + 1 X_{1i} + 1 X_{2i} + \varepsilon_i\]

In other words, the simulated data should return the same coefficients above if there are no problems with the estimation. The exercise will show you how collinearity can become a problem.

# 1) Regression: correlation = 0.3289

cor(MDAT$X31,MDAT$X32)

## [1] 0.3289358

CREG <- lm(Y3~X31+X32,data=MDAT)
coeftest(CREG)

## 
## t test of coefficients:
## 
##              Estimate Std. Error t value  Pr(>|t|)    
## (Intercept) 1.0015735  0.0214445  46.705 < 2.2e-16 ***
## X31         1.0152385  0.0401048  25.315 < 2.2e-16 ***
## X32         0.9905016  0.0099267  99.781 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# 2) Regression: correlation = 0.938

cor(MDAT$X21,MDAT$X22)

## [1] 0.9380521

CREG <- lm(Y2~X21+X22,data=MDAT)
coeftest(CREG)

## 
## t test of coefficients:
## 
##             Estimate Std. Error t value  Pr(>|t|)    
## (Intercept) 1.001574   0.021445  46.705 < 2.2e-16 ***
## X21         1.100724   0.109304  10.070 < 2.2e-16 ***
## X22         0.905016   0.099267   9.117 1.082e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# 3) Regression: correlation = 0.999

cor(MDAT$X11,MDAT$X12)

## [1] 0.9992777

CREG <- lm(Y1~X11+X12,data=MDAT)
coeftest(CREG)

## 
## t test of coefficients:
## 
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.001574   0.021445 46.7053  < 2e-16 ***
## X11         1.955579   0.996657  1.9621  0.05261 .  
## X12         0.050161   0.992672  0.0505  0.95980    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# 4) Regression: Highest correlation = 1

cor(MDAT$X41,MDAT$X42)

## [1] 1

CREG <- lm(Y4~X41+X42,data=MDAT)
coeftest(CREG)

## 
## t test of coefficients:
## 
##             Estimate Std. Error t value  Pr(>|t|)    
## (Intercept) 1.003483   0.021342  47.019 < 2.2e-16 ***
## X41         2.002616   0.037857  52.900 < 2.2e-16 ***
## X42               NA         NA      NA        NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The above application considers four sets of data where the only difference is the degree of collinearity between the two independent variables. The first regression has a degree of correlation between \(X_{1i}\) and \(X_{2i}\) equal to 0.33, and you will see that the regression does a fairly good job at recovering the regression coefficients. The second regression has a degree of correlation between \(X_{1i}\) and \(X_{2i}\) equal to 0.94, and you will see that the regression is beginning to suffer a bit where both slope estimates are now off by about 10 percent. The Third regression has a degree of correlation between \(X_{1i}\) and \(X_{2i}\) equal to just shy of perfect (1), and you will see that the regression is now way off from the expected estimates. Finally, the fourth regression has perfect collinearity between \(X_{1i}\) and \(X_{2i}\), and the regression actually chokes by providing an NA (meaning, not a number) as an answer for the second coefficient. Mathematically, perfect collinearity asks for a computer to divide a number by zero (which computers don’t like to do).

9.2.2 What does Collinearity do to our regression?

The takeaway from our application is that collinearity can become a significant problem if the degree of correlation among the independent variables is large enough. What the application does not show is that collinearity also results in excessively large standard errors of the coefficient estimates. Intuitively, if the regression doesn’t know which variable is providing the (redundant) information, then it shows this by placing little precision on the estimate - meaning a large standard deviation. This large standard deviation will impact the significance of estimates via confidence intervals and hypothesis tests.

9.2.3 How to test for Collinearity?

Note that some collinearity exists in every equation. All variables are correlated to some degree (even if completely at random). Therefore, the question is really how much multicollinearity exists in an equation? Is it enough to cause the types of problems we saw in the application above?

There are two characteristics that help detect the degree of collinearity in a regression:

High simple correlation coefficients
High Variance Inflation Factors (VIFs)

Correlation Coefficients

\[Cov(X_1,X_2)=\frac{1}{n-1} \sum_{i=1}^n (X_{1i}-\bar{X}_1)(X_{2i}-\bar{X}_2)\] \[S_{X_1} = \frac{1}{n-1} \sum_{i=1}^n (X_{1i}-\bar{X}_1)^2\] \[S_{X_2} = \frac{1}{n-1} \sum_{i=1}^n (X_{2i}-\bar{X}_2)^2\]

\[\rho(X_1,X_2) = \frac{Cov(X_1,X_2)}{S_{X_1}S_{X_2}}\]

If a simple correlation coefficient between any two explanatory variables, \(\rho(X_1,X_2)\), is high in absolute value, then multicollinearity is a potential problem. Like we saw in the application, high is rather arbitrary. Therefore, researchers settle on a threshold of 0.80. In other words, if you have a correlation of 0.80 or higher, then you are running the risk of having your estimates biased by the existence of collinearity.

The problem with looking at simple correlations is that they are pairwise calculations. In other words, you can only look at two variables at a time. What if a collinearity problem is bigger than just two variables?

Variance Inflation Factors (VIFs)

Suppose you want to estimate a regression with three independent variables, but you want to test for collinearity first.

\[Y_i = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + \beta_3 X_{3i} + \varepsilon_i\]

Correlation coefficients, being pairwise, will not be able to uncover a correlation structure that might exist across all three independent variables.

Take for example three independent variables: a pitcher’s ERA, the number of earned runs, and the number of innings pitched. For those of you (like me) who are unfamiliar with baseball, a pitcher’s ERA is essentially, their earned runs divided by the number of innings pitched. This means that ERA might be positively correlated with earned runs and negatively correlated with innings pitched, but you wouldn’t realize that the correlation is perfect (meaning, equal to 1) unless you consider both variables simultaneously. A Variance Inflation Factor (or VIF) is a method for examining a complete correlation structure on a list of three or more independent variables.

A Variance Inflation Factor (VIF) is calculated in two steps:

First, run an OLS regression where an independent variable (say, X1) takes a turn at being a dependent variable.

\[X_{1i} = a_0 + a_1 X_{2i} + a_2 X_{3i} + u_i\]

Note that the original dependent variable \((Y_i)\) is NOT in this equation!

The purpose of this auxiliary regression is to see if there is a sophisticated correlation structure between \(X_{1i}\) and the right-hand side variables. Conveniently, we already have an \(R^2\) which will indicate exactly how much the variation in the left-hand variable is explained by the right-hand variables. The second step takes the \(R^2\) from this regression and calculates the VIF for independent variable \(X_{1i}\). Since the VIF impacts the estimated coefficient of \(\beta_1\) in the original regression, it is sometimes referred to as \(VIF(\hat{\beta}_1)\):

\[VIF(\hat{\beta}_1) = \frac{1}{1-R^2}\]

If we did this for every independent variable in the original regression, we would arrive at three VIF values.

\[X_{1i} = a_0 + a_1 X_{2i} + a_2 X_{3i} + u_i \rightarrow VIF(\hat{\beta}_1) = \frac{1}{1-R^2}\]

\[X_{2i} = a_0 + a_1 X_{1i} + a_2 X_{3i} + u_i \rightarrow VIF(\hat{\beta}_2) = \frac{1}{1-R^2}\]

\[X_{3i} = a_0 + a_1 X_{1i} + a_2 X_{2i} + u_i \rightarrow VIF(\hat{\beta}_3) = \frac{1}{1-R^2}\]

These VIF values will deliver the amount of bias the standard errors each of the estimated coefficients will receive due to the presence of collinearity. In order to determine if there is a problem, we again resort to an arbitrary threshold of \(VIF \geq 5\). Note that since an \(R^2\) value is comparable to a correlation coefficient, this VIF measure corresponds to a correlation above 0.8.

9.2.4 An Application:

library(readxl)
MULTI2 <- read_excel("data/MULTI2.xlsx")
names(MULTI2)

## [1] "Team"          "League"        "Wins"          "ERA"           "Runs"          "Hits_Allowed" 
## [7] "Walks_Allowed" "Saves"         "Errors"

Suppose that you want to explain why some baseball teams recorded more wins than others by looking at the season statistics listed above. Before we run a full regression with Wins as the dependent variable and the other right variables as independent variables, we need to test for collinearity.

If we were to follow the steps above for each independent variable, we will need to calculate seven VIF values (Team isn’t a variable… it’s a name). This is a lot easier done than said in R:

# Estimate the 'intended' model:

REG <- lm(Wins ~ League + ERA + Runs + Hits_Allowed + Walks_Allowed + Saves + Errors, data = MULTI2)

# Use REG object to call vif command:
library(car)
vif(REG)

##        League           ERA          Runs  Hits_Allowed Walks_Allowed         Saves        Errors 
##      1.221101     11.026091      1.279997      6.342662      3.342659      1.762577      1.548678

The output above shows a VIF for each of the independent variables. The largest are for ERA and Hits Allowed, and these are problematic given that they are above our threshold of 5. So now that we detected collinearity… what do we do about it?

9.2.5 How do we remove Collinearity?

There are several ways to remove or reduce the degree of collinearity that vary in degrees of feasibility and effectiveness.

First, is the collinearity problem due to the inherent nature of the variables themselves or is it a coincidence with your current sample? If it is coincidence, then the problem might go away if you collected more observations. Note that this might not always work, and sometimes more data isn’t even available. However, it is a easy first pass if feasible.

Second, one could always ignore collinearity and proceed with the analysis. The reason for this is that while collinearity might bias the standard errors of the estimates, the bias might not be that bad. Think of increasing the value of zero by 100 times.

For example, lets try the ignorance approach with the baseball application above:

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	69.28	13.64	5.08	0
League	1.85	1.01	1.82	0.08
ERA	-6.06	3.44	-1.76	0.09
Runs	0.09	0.01	11.52	0
Hits_Allowed	-0.03	0.01	-1.79	0.09
Walks_Allowed	-0.03	0.01	-2.26	0.03
Saves	0.54	0.08	7.07	0
Errors	0	0.04	0.1	0.92

Fitting linear model: Wins ~ League + ERA + Runs + Hits_Allowed + Walks_Allowed + Saves + Errors
Observations	Residual Std. Error	\(R^2\)	Adjusted \(R^2\)
30	2.5	0.96	0.95

The results suggest that the population coefficients for the variables League, ERA, Hits Allowed, and Errors are all insignificantly different from zero with 95% confidence. Now if they were all significant, then we could possibly ignore any potential collinearity issues because the bias would not be enough for us to see if there was a problem. However, since two of these insignificant variables are ones we already identified as having a collinearity problem, then we are unable to go this route.

The third option to remove collinearity is to remove independent variables until the correlation structure is removed. The way to proceed down this route is to remove the variables (one-at-a-time) with the highest VIF values until all remaining values have VIF values below 5. The good side of this analysis is that you can now proceed with the main regression knowing that collinearity is not a problem. The bad side is that you might have had to remove variables that you really wanted to have in the regression.

The VIF values from the baseball analysis suggest that ERA and Hits Allowed are two variables that potentially need to be removed from the analysis due to collinearity. The way to proceed is that if we were to only remove one variable at a time, we will remove the variable with the highest VIF because it is the one that has the most redundant information.

REG <- lm(Wins ~ League + Runs + Hits_Allowed + Walks_Allowed + Saves + Errors, data = MULTI2)

vif(REG)

##        League          Runs  Hits_Allowed Walks_Allowed         Saves        Errors 
##      1.149383      1.279914      1.365583      1.235945      1.665172      1.546465

summary(REG)

## 
## Call:
## lm(formula = Wins ~ League + Runs + Hits_Allowed + Walks_Allowed + 
##     Saves + Errors, data = MULTI2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8127 -2.0776  0.0551  2.0168  4.9951 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   83.214595  11.607524   7.169 2.67e-07 ***
## League         2.278948   1.026010   2.221   0.0365 *  
## Runs           0.088445   0.008031  11.013 1.20e-10 ***
## Hits_Allowed  -0.047231   0.006840  -6.905 4.86e-07 ***
## Walks_Allowed -0.043122   0.007485  -5.761 7.22e-06 ***
## Saves          0.569301   0.077227   7.372 1.69e-07 ***
## Errors         0.001322   0.043722   0.030   0.9761    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.615 on 23 degrees of freedom
## Multiple R-squared:  0.9557, Adjusted R-squared:  0.9441 
## F-statistic: 82.65 on 6 and 23 DF,  p-value: 2.119e-14

The regression with ERA removed now is free of collinearity. We can confirm this by the fact that all VIF values of the remaining independent variables are well below 5. The regression results suggest that after removing ERA, ERA and Hits Allowed now have population coefficients that were significantly different than zero with 95% confidence. Errors is still an insignificant variable. This suggests that the insignificance wasn’t due to collinearity. It’s simply the fact that Errors do not significantly help us explain why some teams win more games than others.

Sometimes removing collinearity might involve multiple rounds

You will note from the application above that we only needed to remove one independent variable, so only one round of VIF calculations displayed values above 5. It might sometimes be the case that even after you remove an independent variable, the next round of VIF values reports a value of 5. If this happens, you simply repeat the process by removing the variable with the highest VIF and check again. In general, a complete removal of multicollinearity involves the following:

calculate VIFs for your data set
drop the variable with the highest VIF (greater than 5)
calculate VIFs on your data again (with the dropped variable no longer in the data set)
drop the variable with the highest VIF (greater than 5)
this is repeated until all VIFs are less than 5