Consider \[ y = \beta_0 + \beta_1 x + u \]
When \(x\) and \(u\) are correlated, OLS estimator is biased and inconsistent. IV methods can be used to make the estimator consistent but its estimator is never unbiased.
This is sufficient. As a Nobel-prize winning econometrician, Clive W.J. Granger, once said “If you can’t get it right as \(n\) goes to infinity, you shouldn’t be in this business.”
There are two requirements for IV. Firstly, instrument \(z\) must satisfy Exogeneity, that is \(z\) should have no partial effect on \(y\) and \(z\) should be uncorrelated with omitted variables. Secondly, Relevance, that is \(z\) must be related to \(x\).
We can test the relevance assumption but we cannot test the exogeneity assumption. We must maintain exogeneity by appealing to economic behavior or introspection.
Besides wooldridge
and tidyverse
, another package we have to install for estimation with IV is AER
.
library(wooldridge)
library(tidyverse)
library(AER)
The main function we are going to use is ivreg
. After loading the packages to our R session, let’s see how estimation works.
For baseline, suppose we want to estimate the return to education in the simple regression model.
\[ \log(wage) = \beta_0 + \beta_1 educ + u \]
For comparison, let’s obtain the OLS estimates.
olsmod <- lm(data = mroz, lwage ~ 1 + educ)
tab_model(olsmod, show.ci = FALSE, show.stat = TRUE, show.se = TRUE,
show.p = FALSE, p.style = "stars")
lwage | |||
---|---|---|---|
Predictors | Estimates | std. Error | Statistic |
(Intercept) | -0.19 | 0.19 | -1.00 |
educ | 0.11 *** | 0.01 | 7.55 |
Observations | 428 | ||
R2 / R2 adjusted | 0.118 / 0.116 | ||
|
There may be some omitted variable in this estimating equation, e.g. ability.
Suppose we use father’s education as an instrument for \(educ\). We have to maintain that father’s education is uncorrelated with \(u\) (Exogeneity).
Next, we have to confirm that \(fatheduc\) is correlated with \(educ\). We can check this by running a simple regression model.
# make sure that the sample is consistent with the previous equation
data151 <- mroz %>%
select(lwage, educ, fatheduc) %>%
na.omit()
# run regression
stage1 <- lm(data = data151, educ ~ 1 + fatheduc)
tab_model(stage1, show.ci = FALSE, show.stat = TRUE, show.se = TRUE,
show.p = FALSE, p.style = "stars")
educ | |||
---|---|---|---|
Predictors | Estimates | std. Error | Statistic |
(Intercept) | 10.24 *** | 0.28 | 37.10 |
fatheduc | 0.27 *** | 0.03 | 9.43 |
Observations | 428 | ||
R2 / R2 adjusted | 0.173 / 0.171 | ||
|
t statistics on \(educ\) is 9.42, suggesting a statistically significant correlation between \(fatheduc\) and \(educ\).
Then we can estimate the equation using \(fatheduc\) as IV.
ivmod <- ivreg(data = data151,
lwage ~ 1 + educ | 1 + fatheduc)
tab_model(ivmod, show.ci = FALSE, show.stat = TRUE, show.se = TRUE,
show.p = FALSE, p.style = "stars")
lwage | |||
---|---|---|---|
Predictors | Estimates | std. Error | Statistic |
(Intercept) | 0.44 | 0.45 | 0.99 |
educ | 0.06 | 0.04 | 1.68 |
Observations | 428 | ||
R2 / R2 adjusted | 0.093 / 0.091 | ||
|
The IV estimate of the return to education is now 5.9%, which is only around half of the OLS estimate. This suggests that the OLS estimate is too high, consistent with the omitted ability bias.
Note that the standard error of the IV estimate is also twice that of OLS estimate. Recall that the variance of IV estimate depends on the correlation between \(x\) and \(z\) too. If \(z\) can explain little of the variation in \(x\), standard error of the IV estimate can be much larger.
How can we estimate with IV if we have other exogenous variables? Let’s look at the following example.
Use card
data from the Wooldridge
package to estimate the following equation \[
\log(wage) = \beta_0 + \beta_1 educ + X\gamma + u,
\] where \(X\) contains \(exper, exper^2, black, smsa, south,\) and a full set of regional dummy for variables and an SMSA dummy for where the man was living in 1966.
allreg <- paste0("reg", 662:669)
regvar <- paste(allreg, collapse = "+")
demovar <- paste("exper", "expersq", "black",
"smsa", "south", "smsa66", sep = "+")
controls <- paste(demovar, regvar, sep = "+")
fmla <- as.formula(paste("lwage ~ educ + ", controls))
olsmod <- lm(data = card,formula = fmla)
tab_model(olsmod, show.ci = FALSE, show.stat = TRUE, show.se = TRUE,
show.p = FALSE, p.style = "stars")
lwage | |||
---|---|---|---|
Predictors | Estimates | std. Error | Statistic |
(Intercept) | 4.62 *** | 0.07 | 62.25 |
educ | 0.07 *** | 0.00 | 21.35 |
exper | 0.08 *** | 0.01 | 12.81 |
expersq | -0.00 *** | 0.00 | -7.22 |
black | -0.20 *** | 0.02 | -10.91 |
smsa | 0.14 *** | 0.02 | 6.79 |
south | -0.15 *** | 0.03 | -5.69 |
smsa66 | 0.03 | 0.02 | 1.35 |
reg662 | 0.10 ** | 0.04 | 2.68 |
reg663 | 0.14 *** | 0.04 | 4.12 |
reg664 | 0.06 | 0.04 | 1.32 |
reg665 | 0.13 ** | 0.04 | 3.06 |
reg666 | 0.14 ** | 0.05 | 3.11 |
reg667 | 0.12 ** | 0.04 | 2.63 |
reg668 | -0.06 | 0.05 | -1.10 |
reg669 | 0.12 ** | 0.04 | 3.05 |
Observations | 3010 | ||
R2 / R2 adjusted | 0.300 / 0.296 | ||
|
If we have multiple dependent dummy variables with similar names, we may use paste0()
to construct variables efficiently.
paste0("reg", 662:669)
## [1] "reg662" "reg663" "reg664" "reg665" "reg666" "reg667" "reg668" "reg669"
We can pass it to the formula by using paste()
paste(allreg, collapse = "+")
## [1] "reg662+reg663+reg664+reg665+reg666+reg667+reg668+reg669"
We can transform string to formula by using as.formula()
Creating an object that summarizes a model is useful when we want to use the same set of variables multiple times.
fmla <- as.formula(paste("educ ~ 1 + nearc4 + ", controls))
stage1 <- lm(data = card, formula = fmla)
tab_model(stage1, show.ci = FALSE, show.stat = TRUE, show.se = TRUE,
show.p = FALSE, p.style = "stars")
educ | |||
---|---|---|---|
Predictors | Estimates | std. Error | Statistic |
(Intercept) | 16.64 *** | 0.24 | 69.14 |
nearc4 | 0.32 *** | 0.09 | 3.64 |
exper | -0.41 *** | 0.03 | -12.24 |
expersq | 0.00 | 0.00 | 0.53 |
black | -0.94 *** | 0.09 | -9.98 |
smsa | 0.40 *** | 0.10 | 3.84 |
south | -0.05 | 0.14 | -0.38 |
smsa66 | 0.03 | 0.11 | 0.24 |
reg662 | -0.08 | 0.19 | -0.42 |
reg663 | -0.03 | 0.18 | -0.15 |
reg664 | 0.12 | 0.22 | 0.54 |
reg665 | -0.27 | 0.22 | -1.25 |
reg666 | -0.30 | 0.24 | -1.28 |
reg667 | -0.22 | 0.23 | -0.93 |
reg668 | 0.52 | 0.27 | 1.96 |
reg669 | 0.21 | 0.20 | 1.04 |
Observations | 3010 | ||
R2 / R2 adjusted | 0.477 / 0.474 | ||
|
Considering other things being fixed, people who lived near a collge in 1966 had 0.32 year more education than those who didn’t.
The t statistic on \(nearc4\) is 3.64, implying that if \(nearc4\) is uncorrelated with the unobserved factors in the error term, we can use it as an IV for \(educ\).
fmla <- as.formula(paste("lwage ~ 1 + educ + ", controls, "|",
paste("1 + nearc4 + ", controls)))
ivmod <- ivreg(data = card, formula = fmla)
tab_model(ivmod, show.ci = FALSE, show.stat = TRUE, show.se = TRUE,
show.p = FALSE, p.style = "stars")
lwage | |||
---|---|---|---|
Predictors | Estimates | std. Error | Statistic |
(Intercept) | 3.67 *** | 0.92 | 3.96 |
educ | 0.13 * | 0.05 | 2.39 |
exper | 0.11 *** | 0.02 | 4.58 |
expersq | -0.00 *** | 0.00 | -7.00 |
black | -0.15 ** | 0.05 | -2.72 |
smsa | 0.11 *** | 0.03 | 3.53 |
south | -0.14 *** | 0.03 | -5.30 |
smsa66 | 0.02 | 0.02 | 0.86 |
reg662 | 0.10 ** | 0.04 | 2.67 |
reg663 | 0.15 *** | 0.04 | 4.03 |
reg664 | 0.05 | 0.04 | 1.14 |
reg665 | 0.15 ** | 0.05 | 3.11 |
reg666 | 0.16 ** | 0.05 | 3.14 |
reg667 | 0.13 ** | 0.05 | 2.72 |
reg668 | -0.08 | 0.06 | -1.40 |
reg669 | 0.11 ** | 0.04 | 2.58 |
Observations | 3010 | ||
R2 / R2 adjusted | 0.238 / 0.234 | ||
|
IV estimate is twice as large as OLS estimate but the standard error is over 18 times larger. Larger confidence interval is a price we must pay to get a consistent estimator.
summary(olsmod)$r.squared
## [1] 0.2998365
summary(ivmod)$r.squared
## [1] 0.2381655
Although \(R^2\) is larger for OLS, it does not imply OLS is a better model. \(R^2\) obtained from OLS estimation is always larger because OLS minimizes the sum of squared residuals.
Most econometrics packages have special commands for 2SLS, so there is no need to perform the two stages explicitly.
In most cases, we should avoid doing the second stage manually, as the standard errors obtained in this way is not valid. (See Chapter 15.3 for more details.)
Note that the estimated coefficients obtained manually are identical to those obtained from the usual 2SLS routine.
Verify these properties with the following computer exercise.
Use the wage2
data from the Wooldridge
package.
Use a 2SLS routine to estimate the equation \[ \log(wage) = \beta_0 + \beta_1 educ + \beta_2 exper + \beta_3 tenure + \beta_4 black + u, \] where \(sibs\) is the IV for \(educ\).
Now, manually carry out 2SLS. Regress \(educ\) on \(sibs, exper, tenure, black\) and obtain the fitted values \(\hat{educ}\). Run the second stage regression \(\log(wage)\) on \(\hat{educ}, exper, tenure, black\)
What happens if we neglect exogenous variables from the first stage? Regress \(educ\) on \(sibs\) only and obtain the fitted values, \(\tilde{educ}\). Run the regression of \(\log(wage)\) on \(\tilde{educ}, exper, tenure, black\).
q1591 <- ivreg(data = wage2,
lwage ~ 1 + educ + exper + tenure + black|
1 + sibs + exper + tenure + black)
tab_model(q1591, show.ci = FALSE, show.stat = TRUE, show.se = TRUE,
show.p = FALSE, p.style = "stars")
lwage | |||
---|---|---|---|
Predictors | Estimates | std. Error | Statistic |
(Intercept) | 5.22 *** | 0.54 | 9.60 |
educ | 0.09 ** | 0.03 | 2.78 |
exper | 0.02 * | 0.01 | 2.49 |
tenure | 0.01 *** | 0.00 | 4.22 |
black | -0.18 *** | 0.05 | -3.66 |
Observations | 935 | ||
R2 / R2 adjusted | 0.169 / 0.165 | ||
|
q1592_1 <- lm(data = wage2, educ ~ 1 + sibs + exper + tenure + black )
wage2$prededuc <- q1592_1$fitted.values
q1592_2 <- lm(data = wage2, lwage ~ 1 + prededuc + exper + tenure + black)
tab_model(q1592_2, show.ci = FALSE, show.stat = TRUE, show.se = TRUE,
show.p = FALSE, p.style = "stars")
lwage | |||
---|---|---|---|
Predictors | Estimates | std. Error | Statistic |
(Intercept) | 5.22 *** | 0.57 | 9.17 |
prededuc | 0.09 ** | 0.04 | 2.65 |
exper | 0.02 * | 0.01 | 2.38 |
tenure | 0.01 *** | 0.00 | 4.03 |
black | -0.18 *** | 0.05 | -3.49 |
Observations | 935 | ||
R2 / R2 adjusted | 0.089 / 0.085 | ||
|
The standard errors obtained from the manual method is too large.
q1593_1 <- lm(data = wage2, educ ~ sibs )
wage2$prededuc <- q1593_1$fitted.values
q1593_2 <- lm(data = wage2, lwage ~ 1 + prededuc + exper + tenure + black)
tab_model(q1593_2, show.ci = FALSE, show.stat = TRUE, show.se = TRUE,
show.p = FALSE, p.style = "stars")
lwage | |||
---|---|---|---|
Predictors | Estimates | std. Error | Statistic |
(Intercept) | 5.77 *** | 0.36 | 16.01 |
prededuc | 0.07 ** | 0.03 | 2.65 |
exper | -0.00 | 0.00 | -0.13 |
tenure | 0.01 *** | 0.00 | 5.19 |
black | -0.24 *** | 0.04 | -5.82 |
Observations | 935 | ||
R2 / R2 adjusted | 0.089 / 0.085 | ||
|
Now the estimated coefficients are also different.