There are many ways to estimate with dummy independent variables in R
For this topic, we are going to use the following packages.
library(tidyverse)
library(forcats)
Suppose we want to estimate \[n = \beta_0 + \beta_1 Admit + \epsilon\] from UCBAdmissions
data.
UCBdata <- UCBAdmissions %>% as_tibble() #Little bit of cleaning
lm(data = UCBdata,n ~ 1 + Admit)
##
## Call:
## lm(formula = n ~ 1 + Admit, data = UCBdata)
##
## Coefficients:
## (Intercept) AdmitRejected
## 146.25 84.67
The coefficient’s name displayed in the console follows the format ‘variable name’ + ‘level coded as 1’. Therefore, “AdmitRejected” implies that R has coded Rejected as 1 and Admitted as 0.
R decides which one is reference level by its alphabetical order.
There are many ways to create 0-1 variables. I’m going to introduce two simple metods with ifelse()
and I()
.
ifelse()
UCBdata$admitDum <- ifelse(UCBdata$Admit == "Admitted", 1, 0)
lm(data = UCBdata,n~1+admitDum)
##
## Call:
## lm(formula = n ~ 1 + admitDum, data = UCBdata)
##
## Coefficients:
## (Intercept) admitDum
## 230.92 -84.67
I()
Another method is by I()
. This requires only a tiny bit less typing than the one with ifelse()
.
I(UCBdata$Admit=="Admitted")
## [1] TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE
## [13] TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE
lm(data = UCBdata,n~1+I(Admit=="Admitted"))
##
## Call:
## lm(formula = n ~ 1 + I(Admit == "Admitted"), data = UCBdata)
##
## Coefficients:
## (Intercept) I(Admit == "Admitted")TRUE
## 230.92 -84.67
Creating dummy variables by hand can be cumbersome if we have many categories (cf. fixed effects) or many categorical variables.
Let’s use mtcars data set. Suppose we want to estimate \[mpg = \alpha + \beta_1I(cyl=6) +\beta_2I(cyl=8) + \epsilon\]
head(mtcars, 3)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
The following code does not work. Why?
lm(data = mtcars, mpg~1+cyl)
##
## Call:
## lm(formula = mpg ~ 1 + cyl, data = mtcars)
##
## Coefficients:
## (Intercept) cyl
## 37.885 -2.876
We may use I()
function to do the job.
lm(data = mtcars,mpg~1+I(cyl==6)+I(cyl==8))
##
## Call:
## lm(formula = mpg ~ 1 + I(cyl == 6) + I(cyl == 8), data = mtcars)
##
## Coefficients:
## (Intercept) I(cyl == 6)TRUE I(cyl == 8)TRUE
## 26.664 -6.921 -11.564
However, if our variable contains many categories, this may be time-consuming. One easy fix is to coerce the variable into a factor.
A factor looks like a vector, but it has special properties. R keeps track of the unique values in a vector, and each unique value is called a level of the associated factor.
If we have a vector of strings or integers, we can create a categorical variable by using the command factor()
.
strVec <- c("Win", "Win", "Lose", "Tie", "Win", "Lose")
f <- factor(strVec)
f
## [1] Win Win Lose Tie Win Lose
## Levels: Lose Tie Win
Notice that when we printed the factor, f, R did not put quotes around the values. They are levels, not strings. R also displays the distinct levels below the factor.
Going back to the previous question, we can coerce cyl into a factor.
lm(data = mtcars,mpg~1+factor(cyl))
##
## Call:
## lm(formula = mpg ~ 1 + factor(cyl), data = mtcars)
##
## Coefficients:
## (Intercept) factor(cyl)6 factor(cyl)8
## 26.664 -6.921 -11.564
This will be useful when we estimate models with many fixed effects.
lm(data = mtcars, mpg~1+cyl*hp)
##
## Call:
## lm(formula = mpg ~ 1 + cyl * hp, data = mtcars)
##
## Coefficients:
## (Intercept) cyl hp cyl:hp
## 50.75121 -4.11914 -0.17068 0.01974
lm(data = mtcars, mpg~1+factor(cyl)*hp)
##
## Call:
## lm(formula = mpg ~ 1 + factor(cyl) * hp, data = mtcars)
##
## Coefficients:
## (Intercept) factor(cyl)6 factor(cyl)8 hp
## 35.98303 -15.30917 -17.90295 -0.11278
## factor(cyl)6:hp factor(cyl)8:hp
## 0.10516 0.09853
factor(mtcars$cyl)
## [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
## Levels: 4 6 8
Originally, 4-cylinder is the base case. Suppose, we prefer 8-cylinder cars to be the base case. We can reorder the factor with fct_relevel
from the forcats
package.
cylf <- mtcars$cyl %>%
factor() %>%
fct_relevel(levels = "8") #specify the 1st level
print(cylf)
## [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
## Levels: 8 4 6
Notice that the data order does not change. Therefore, we don’t have to worry about rearranging other variables.
Since the releveled factor was saved in new variable cylf, we need to estimate with the new variable instead.
lm(mtcars$mpg~1+cylf)
##
## Call:
## lm(formula = mtcars$mpg ~ 1 + cylf)
##
## Coefficients:
## (Intercept) cylf4 cylf6
## 15.100 11.564 4.643
Now 8-cylinder car has become the base case.