9.3 How to test for Collinearity?

Note that most variables are correlated to some degree (even if completely at random). Therefore, the question is really how much collinearity exists in our data? Is it little so we can disregard (as in the first example in the previous section) or large enough to cause issues (as in the third or fourth example)?

There are two data characteristics that help detect the degree of collinearity in a regression:

High simple correlation coefficients
High Variance Inflation Factors (VIFs)

Correlation Coefficients

\[Cov(X_1,X_2)=\frac{1}{n-1} \sum_{i=1}^n (X_{1i}-\bar{X}_1)(X_{2i}-\bar{X}_2)\] \[S_{X_1} = \frac{1}{n-1} \sum_{i=1}^n (X_{1i}-\bar{X}_1)^2\] \[S_{X_2} = \frac{1}{n-1} \sum_{i=1}^n (X_{2i}-\bar{X}_2)^2\]

\[\rho(X_1,X_2) = \frac{Cov(X_1,X_2)}{S_{X_1}S_{X_2}}\]

If a simple correlation coefficient between any two explanatory variables, \(\rho(X_1,X_2)\), is high in absolute value, then collinearity is a potential problem. Like we saw in the application, high is rather arbitrary. Therefore, researchers settle on a threshold of 0.80. In other words, if you have a correlation of 0.80 or higher, then you are running the risk of having your estimates biased by the existence of collinearity.

The problem with looking at simple correlations is that they are only pairwise calculations. In other words, you can only look at two variables at a time. What if a collinearity problem is spread across more than just two variables?

Variance Inflation Factors (VIFs)

Suppose you want to estimate a regression with three independent variables, but you want to test for collinearity first.

\[Y_i = \beta_0 + \beta_1 \; X_{1i} + \beta_2 \; X_{2i} + \beta_3\; X_{3i} + \varepsilon_i\]

Correlation coefficients, being pairwise, will not be able to uncover a correlation structure that exists across all three independent variables.

Take for example three independent variables: a pitcher’s ERA, the number of earned runs, and the number of innings pitched. For those of you (like me) who are unfamiliar with baseball, a pitcher’s ERA is essentially their earned runs divided by the number of innings pitched. This means that ERA might be positively correlated with earned runs and negatively correlated with innings pitched, but you wouldn’t realize that the correlation is perfect (meaning, equal to 1) unless you consider both variables simultaneously - and pairwise correlation coefficients cannot uncover this. A Variance Inflation Factor (or VIF) is a method for examining a complete correlation structure on a list of three or more independent variables.

A Variance Inflation Factor (VIF) is calculated in two steps:

First, run an OLS regression where an independent variable (say, X1) takes a turn at being a dependent variable.

\[X_{1i} = a_0 + a_1\; X_{2i} + a_2 \; X_{3i} + u_i\]

Note that the original dependent variable \((Y_i)\) is NOT in this equation!

The purpose of this auxiliary regression is to see if there is a sophisticated correlation structure between \(X_{1i}\) and the current right-hand side variables. Conveniently, we already have an \(R^2\) which will indicate exactly how much the variation in the left-hand variable is explained by the right-hand variables.

The second step takes the \(R^2\) from this regression and calculates the VIF for independent variable \(X_{1i}\). Since the VIF impacts the estimated coefficient of \(\beta_1\) in the original regression, it is sometimes referred to as \(VIF(\hat{\beta}_1)\):

\[VIF(\hat{\beta}_1) = \frac{1}{1-R^2}\]

If we did this for every independent variable in the original regression, we would arrive at three VIF values.

\[X_{1i} = a_0 + a_1 \; X_{2i} + a_2\; X_{3i} + u_i \rightarrow VIF(\hat{\beta}_1) = \frac{1}{1-R^2}\]

\[X_{2i} = a_0 + a_1 \; X_{1i} + a_2 \; X_{3i} + u_i \rightarrow VIF(\hat{\beta}_2) = \frac{1}{1-R^2}\]

\[X_{3i} = a_0 + a_1 \; X_{1i} + a_2 \; X_{2i} + u_i \rightarrow VIF(\hat{\beta}_3) = \frac{1}{1-R^2}\]

These VIF values will deliver the amount of bias in each of the standard errors of the estimated coefficients due to the presence of collinearity. For example, if a VIF number is 2, then this means that the degree of collinearity will result in a standard error that is twice as large as it would have been without collinearity. In order to determine if there is a problem, we again resort to an arbitrary threshold of \(VIF \geq 5\). Note that since an \(R^2\) value is comparable to a correlation coefficient, this VIF measure corresponds to a correlation above 0.8.

9.3.1 An Application:

library(readxl)
MULTI2 <- read_excel("data/MULTI2.xlsx")
names(MULTI2)

## [1] "Team"          "League"        "Wins"          "ERA"          
## [5] "Runs"          "Hits_Allowed"  "Walks_Allowed" "Saves"        
## [9] "Errors"

Suppose that you want to explain why some baseball teams recorded more wins than others by looking at the season statistics listed above. Before we run a full regression with Wins as the dependent variable and the other right variables as independent variables, we need to test for collinearity.

If we were to follow the steps above for each independent variable, we will need to calculate seven VIF values (Team isn’t a variable… it’s a name). This is a lot easier done than said in R:

# Estimate the 'intended' model:
REG <- lm(Wins ~ League + ERA + Runs + Hits_Allowed + 
            Walks_Allowed + Saves + Errors, data = MULTI2)

# Use REG object to determine the VIFS:
library(car)
vif(REG)

##        League           ERA          Runs  Hits_Allowed Walks_Allowed 
##      1.221101     11.026091      1.279997      6.342662      3.342659 
##         Saves        Errors 
##      1.762577      1.548678

The output above shows a VIF for each of the independent variables. The largest are for ERA and Hits Allowed, and these are problematic given that they are above our threshold of 5.⁴² So now that we detected collinearity… what do we do about it?

Note that the handy command vif is located in the car package. That is why we needed to open the car package using the library command. See the chapter on R basics for more details.↩︎