10.3 What if there are more than two categories?
Since a dummy variable can take on either a zero or a one, it is perfectly designed to identify two categories. This might be fine for some variables like yes / no or win / lose, but what if a variable has more than two categories? Examples would be direct extensions of the above variables: yes / no / maybe or win / lose / draw.
The rule of thumb (to be explained in detail soon) is:
A variable containing \(N\) categories requires \(N-1\) dummy variables.
This rule actually applies to our standard case, because we can model \(N=2\) categories with \(N-1=1\) dummy variables. In our example above, we wanted to identify 2 categories of gender (male or female) so we needed 1 dummy variable. However, we need to take a little more care and follow additional steps when dealing with more than two categories. Suppose we extended our gender characteristics to identify a third gender category (other) in order to account for individuals who do not subscribe to one of the two traditional categories. We will use this scenario to illustrate how our model gets extended.
- Identify a benchmark category
A benchmark category is one of the characteristics that the researcher identifies as the category that all other categories get compared against. In our gender example, we choose male as our benchmark characteristic because we are comparing females to males. You will find that this choice is arbitrary, but it may have implications.
- Construct appropriate dummy variables
Once the benchmark category has been established as male, we need two dummy variables: one that identifies individuals as female and one that identifies individuals as other.
\[female_i = 1 \mbox{ if female; }\; 0 \mbox{ if male or other }\]
\[\mbox{other}_i = 1 \mbox{ if other; }\; 0 \mbox{ if male or female }\]
Note that each dummy variable is still a switch that signals the presence or absence of a characteristic. However, when BOTH dummy variables are zero at the same time… you have your benchmark category. That is how you can identify three categories with only two dummy variables.
To illustrate, consider the original model restricting attention to intercept dummies.
\[wage_i=\beta_0+\beta_1\;educ_i+\beta_2\;exper_i+\beta_3\;tenure_i+...\] \[\beta_4\;female_i +\beta_5\;other_i +\varepsilon_i\]
We can write down what the model looks like for each of our three categories:
\[Male: wage_i=\beta_0+\beta_1educ_i+\beta_2exper_i+\beta_3tenure_i+\varepsilon_i\]
\[Female: wage_i=(\beta_0+\beta_4)+\beta_1educ_i+\beta_2exper_i+\beta_3tenure_i+\varepsilon_i\]
\[\mbox{Other}: wage_i=(\beta_0+\beta_5)+\beta_1educ_i+\beta_2exper_i+\beta_3tenure_i+\varepsilon_i\]
When comparing these three equations, you can see how the benchmark category comes into play. The first equation is essentially the benchmark equation, indicating that \(\beta_0\) is the intercept term for males. The second equation is for females, and shows how the intercept for females differs from males (given by \(\beta_4\)). The third equation is for those identifying as other, and shows how the intercept for these individuals differs from males (given by \(\beta_5\)). Note that all of the other slopes are assumed to be identical here (but we could consider slope dummies like above).
One detail worth mentioning about the application above is that the coefficients \(\beta_4\) and \(\beta_5\) show how each category compares to the benchmark category. We can test if these coefficients are significantly different from zero with standard hypothesis tests. For example:
\[H_0: \; \beta_4 = 0 \quad H_1: \; \beta_4 \neq 0\]
However, if we show that \(\beta_4\) and \(\beta_5\) were significantly different than zero, we can only conclude that females and other individuals are treated differently than males (because it was the benchmark category). We cannot determine if female and other are significantly different from each other without a joint hypothesis test (examined below) or a choice of a new benchmark category. For example, you can easily change the benchmark category to be female and end up with a formal test of the difference between female and other.