Dummy variables in R

There are many ways to estimate with dummy independent variables in R

  • Direct estimation with character vector
    • Pro: easy to implement (if our data is a character.)
    • Con: difficult to change reference level
  • Create 0-1 variable for estimation
    • Pro: we can manually specify the reference level
    • Con: cumbersome if we have many categories
  • Coerce the variable to a factor
    • Pro: handle variables with many categories well (cf. fixed effects), can be used with numeric vectors
    • Con: can be tricky to handle, change reference level

For this topic, we are going to use the following packages.

library(tidyverse)
library(forcats)

Estimation with Charactor Vectors

Suppose we want to estimate \[n = \beta_0 + \beta_1 Admit + \epsilon\] from UCBAdmissions data.

UCBdata <- UCBAdmissions %>% as_tibble() #Little bit of cleaning
lm(data = UCBdata,n ~ 1 + Admit)
## 
## Call:
## lm(formula = n ~ 1 + Admit, data = UCBdata)
## 
## Coefficients:
##   (Intercept)  AdmitRejected  
##        146.25          84.67

The coefficient’s name displayed in the console follows the format ‘variable name’ + ‘level coded as 1’. Therefore, “AdmitRejected” implies that R has coded Rejected as 1 and Admitted as 0.

R decides which one is reference level by its alphabetical order.


Directly Creating 0-1 Variables

There are many ways to create 0-1 variables. I’m going to introduce two simple metods with ifelse() and I().

Creating 0-1 variables with ifelse()

UCBdata$admitDum <- ifelse(UCBdata$Admit == "Admitted", 1, 0)
lm(data = UCBdata,n~1+admitDum)
## 
## Call:
## lm(formula = n ~ 1 + admitDum, data = UCBdata)
## 
## Coefficients:
## (Intercept)     admitDum  
##      230.92       -84.67

Creating 0-1 variables with I()

Another method is by I(). This requires only a tiny bit less typing than the one with ifelse().

I(UCBdata$Admit=="Admitted")
##  [1]  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE
## [13]  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE
lm(data = UCBdata,n~1+I(Admit=="Admitted"))
## 
## Call:
## lm(formula = n ~ 1 + I(Admit == "Admitted"), data = UCBdata)
## 
## Coefficients:
##                (Intercept)  I(Admit == "Admitted")TRUE  
##                     230.92                      -84.67

Creating dummy variables by hand can be cumbersome if we have many categories (cf. fixed effects) or many categorical variables.


Estimation with Integer and Numerical Vectors

Let’s use mtcars data set. Suppose we want to estimate \[mpg = \alpha + \beta_1I(cyl=6) +\beta_2I(cyl=8) + \epsilon\]

head(mtcars, 3)
##                mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4     21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710    22.8   4  108  93 3.85 2.320 18.61  1  1    4    1

The following code does not work. Why?

lm(data = mtcars, mpg~1+cyl)
## 
## Call:
## lm(formula = mpg ~ 1 + cyl, data = mtcars)
## 
## Coefficients:
## (Intercept)          cyl  
##      37.885       -2.876

We may use I() function to do the job.

lm(data = mtcars,mpg~1+I(cyl==6)+I(cyl==8))
## 
## Call:
## lm(formula = mpg ~ 1 + I(cyl == 6) + I(cyl == 8), data = mtcars)
## 
## Coefficients:
##     (Intercept)  I(cyl == 6)TRUE  I(cyl == 8)TRUE  
##          26.664           -6.921          -11.564

However, if our variable contains many categories, this may be time-consuming. One easy fix is to coerce the variable into a factor.


Estimation with a Factor Variable

A factor looks like a vector, but it has special properties. R keeps track of the unique values in a vector, and each unique value is called a level of the associated factor.

If we have a vector of strings or integers, we can create a categorical variable by using the command factor().

strVec <- c("Win", "Win", "Lose", "Tie", "Win", "Lose")
f <- factor(strVec)
f
## [1] Win  Win  Lose Tie  Win  Lose
## Levels: Lose Tie Win

Notice that when we printed the factor, f, R did not put quotes around the values. They are levels, not strings. R also displays the distinct levels below the factor.

Going back to the previous question, we can coerce cyl into a factor.

lm(data = mtcars,mpg~1+factor(cyl))
## 
## Call:
## lm(formula = mpg ~ 1 + factor(cyl), data = mtcars)
## 
## Coefficients:
##  (Intercept)  factor(cyl)6  factor(cyl)8  
##       26.664        -6.921       -11.564

This will be useful when we estimate models with many fixed effects.

Dummy Interaction Variables

lm(data = mtcars, mpg~1+cyl*hp)
## 
## Call:
## lm(formula = mpg ~ 1 + cyl * hp, data = mtcars)
## 
## Coefficients:
## (Intercept)          cyl           hp       cyl:hp  
##    50.75121     -4.11914     -0.17068      0.01974
lm(data = mtcars, mpg~1+factor(cyl)*hp)
## 
## Call:
## lm(formula = mpg ~ 1 + factor(cyl) * hp, data = mtcars)
## 
## Coefficients:
##     (Intercept)     factor(cyl)6     factor(cyl)8               hp  
##        35.98303        -15.30917        -17.90295         -0.11278  
## factor(cyl)6:hp  factor(cyl)8:hp  
##         0.10516          0.09853

Reordering Factor

factor(mtcars$cyl)
##  [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
## Levels: 4 6 8

Originally, 4-cylinder is the base case. Suppose, we prefer 8-cylinder cars to be the base case. We can reorder the factor with fct_relevel from the forcats package.

cylf <- mtcars$cyl %>%
  factor() %>%
  fct_relevel(levels = "8") #specify the 1st level
print(cylf)
##  [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
## Levels: 8 4 6

Notice that the data order does not change. Therefore, we don’t have to worry about rearranging other variables.

Since the releveled factor was saved in new variable cylf, we need to estimate with the new variable instead.

lm(mtcars$mpg~1+cylf)
## 
## Call:
## lm(formula = mtcars$mpg ~ 1 + cylf)
## 
## Coefficients:
## (Intercept)        cylf4        cylf6  
##      15.100       11.564        4.643

Now 8-cylinder car has become the base case.