Review of IV Estimation

Consider \[ y = \beta_0 + \beta_1 x + u \]

When \(x\) and \(u\) are correlated, OLS estimator is biased and inconsistent. IV methods can be used to make the estimator consistent but its estimator is never unbiased.

This is sufficient. As a Nobel-prize winning econometrician, Clive W.J. Granger, once said “If you can’t get it right as \(n\) goes to infinity, you shouldn’t be in this business.”

There are two requirements for IV. Firstly, instrument \(z\) must satisfy Exogeneity, that is \(z\) should have no partial effect on \(y\) and \(z\) should be uncorrelated with omitted variables. Secondly, Relevance, that is \(z\) must be related to \(x\).

We can test the relevance assumption but we cannot test the exogeneity assumption. We must maintain exogeneity by appealing to economic behavior or introspection.


IV in Simple Regression Model

Besides wooldridge and tidyverse, another package we have to install for estimation with IV is AER.

library(wooldridge)
library(tidyverse)
library(AER)

The main function we are going to use is ivreg. After loading the packages to our R session, let’s see how estimation works.

Example 15.1 Return to Education for Married Women.

For baseline, suppose we want to estimate the return to education in the simple regression model.

\[ \log(wage) = \beta_0 + \beta_1 educ + u \]

For comparison, let’s obtain the OLS estimates.

olsmod <- lm(data = mroz, lwage ~ 1 + educ)
tab_model(olsmod, show.ci = FALSE, show.stat = TRUE, show.se = TRUE,
          show.p = FALSE, p.style = "stars")
  lwage
Predictors Estimates std. Error Statistic
(Intercept) -0.19 0.19 -1.00
educ 0.11 *** 0.01 7.55
Observations 428
R2 / R2 adjusted 0.118 / 0.116
  • p<0.05   ** p<0.01   *** p<0.001

There may be some omitted variable in this estimating equation, e.g. ability.

Suppose we use father’s education as an instrument for \(educ\). We have to maintain that father’s education is uncorrelated with \(u\) (Exogeneity).

Next, we have to confirm that \(fatheduc\) is correlated with \(educ\). We can check this by running a simple regression model.

# make sure that the sample is consistent with the previous equation
data151 <- mroz %>%
  select(lwage, educ, fatheduc) %>%
  na.omit()
# run regression
stage1 <- lm(data = data151, educ ~ 1 + fatheduc)
tab_model(stage1, show.ci = FALSE, show.stat = TRUE, show.se = TRUE,
          show.p = FALSE, p.style = "stars")
  educ
Predictors Estimates std. Error Statistic
(Intercept) 10.24 *** 0.28 37.10
fatheduc 0.27 *** 0.03 9.43
Observations 428
R2 / R2 adjusted 0.173 / 0.171
  • p<0.05   ** p<0.01   *** p<0.001

t statistics on \(educ\) is 9.42, suggesting a statistically significant correlation between \(fatheduc\) and \(educ\).

Then we can estimate the equation using \(fatheduc\) as IV.

ivmod <- ivreg(data = data151,
               lwage ~ 1 + educ | 1 + fatheduc)
tab_model(ivmod, show.ci = FALSE, show.stat = TRUE, show.se = TRUE,
          show.p = FALSE, p.style = "stars")
  lwage
Predictors Estimates std. Error Statistic
(Intercept) 0.44 0.45 0.99
educ 0.06 0.04 1.68
Observations 428
R2 / R2 adjusted 0.093 / 0.091
  • p<0.05   ** p<0.01   *** p<0.001

The IV estimate of the return to education is now 5.9%, which is only around half of the OLS estimate. This suggests that the OLS estimate is too high, consistent with the omitted ability bias.

Note that the standard error of the IV estimate is also twice that of OLS estimate. Recall that the variance of IV estimate depends on the correlation between \(x\) and \(z\) too. If \(z\) can explain little of the variation in \(x\), standard error of the IV estimate can be much larger.


IV Estimation in Multiple Regression Model

How can we estimate with IV if we have other exogenous variables? Let’s look at the following example.

Example 15.4 College proximity as an IV for education.

Use card data from the Wooldridge package to estimate the following equation \[ \log(wage) = \beta_0 + \beta_1 educ + X\gamma + u, \] where \(X\) contains \(exper, exper^2, black, smsa, south,\) and a full set of regional dummy for variables and an SMSA dummy for where the man was living in 1966.

  1. Estimate the above equation with OLS.
  2. Check the releveance assumption of \(nearc4\), dummy variable of whether someone grew up near a four year college, a possible IV candidate.
  3. Now instrument \(educ\) with \(nearc4\). What is the estimated return to education?
  4. Compare the \(R^2\) of OLS and IV estimates. What does this \(R^2\) imply?

Question 1

allreg <- paste0("reg", 662:669)
regvar <- paste(allreg, collapse = "+") 
demovar <- paste("exper", "expersq", "black", 
                 "smsa", "south", "smsa66", sep = "+")
controls <- paste(demovar, regvar, sep = "+")

fmla <- as.formula(paste("lwage ~ educ + ", controls))

olsmod <- lm(data = card,formula = fmla) 
tab_model(olsmod, show.ci = FALSE, show.stat = TRUE, show.se = TRUE,
          show.p = FALSE, p.style = "stars")
  lwage
Predictors Estimates std. Error Statistic
(Intercept) 4.62 *** 0.07 62.25
educ 0.07 *** 0.00 21.35
exper 0.08 *** 0.01 12.81
expersq -0.00 *** 0.00 -7.22
black -0.20 *** 0.02 -10.91
smsa 0.14 *** 0.02 6.79
south -0.15 *** 0.03 -5.69
smsa66 0.03 0.02 1.35
reg662 0.10 ** 0.04 2.68
reg663 0.14 *** 0.04 4.12
reg664 0.06 0.04 1.32
reg665 0.13 ** 0.04 3.06
reg666 0.14 ** 0.05 3.11
reg667 0.12 ** 0.04 2.63
reg668 -0.06 0.05 -1.10
reg669 0.12 ** 0.04 3.05
Observations 3010
R2 / R2 adjusted 0.300 / 0.296
  • p<0.05   ** p<0.01   *** p<0.001
Some remarks on coding

If we have multiple dependent dummy variables with similar names, we may use paste0() to construct variables efficiently.

paste0("reg", 662:669)
## [1] "reg662" "reg663" "reg664" "reg665" "reg666" "reg667" "reg668" "reg669"

We can pass it to the formula by using paste()

paste(allreg, collapse = "+")
## [1] "reg662+reg663+reg664+reg665+reg666+reg667+reg668+reg669"

We can transform string to formula by using as.formula()

Creating an object that summarizes a model is useful when we want to use the same set of variables multiple times.

Question 2

fmla <- as.formula(paste("educ ~ 1 + nearc4 + ", controls))
stage1 <- lm(data = card, formula = fmla)
tab_model(stage1, show.ci = FALSE, show.stat = TRUE, show.se = TRUE,
          show.p = FALSE, p.style = "stars")
  educ
Predictors Estimates std. Error Statistic
(Intercept) 16.64 *** 0.24 69.14
nearc4 0.32 *** 0.09 3.64
exper -0.41 *** 0.03 -12.24
expersq 0.00 0.00 0.53
black -0.94 *** 0.09 -9.98
smsa 0.40 *** 0.10 3.84
south -0.05 0.14 -0.38
smsa66 0.03 0.11 0.24
reg662 -0.08 0.19 -0.42
reg663 -0.03 0.18 -0.15
reg664 0.12 0.22 0.54
reg665 -0.27 0.22 -1.25
reg666 -0.30 0.24 -1.28
reg667 -0.22 0.23 -0.93
reg668 0.52 0.27 1.96
reg669 0.21 0.20 1.04
Observations 3010
R2 / R2 adjusted 0.477 / 0.474
  • p<0.05   ** p<0.01   *** p<0.001

Considering other things being fixed, people who lived near a collge in 1966 had 0.32 year more education than those who didn’t.

The t statistic on \(nearc4\) is 3.64, implying that if \(nearc4\) is uncorrelated with the unobserved factors in the error term, we can use it as an IV for \(educ\).

Question 3

fmla <- as.formula(paste("lwage ~ 1 + educ + ", controls, "|",
                         paste("1 + nearc4 + ", controls)))
ivmod <- ivreg(data = card, formula = fmla)
tab_model(ivmod, show.ci = FALSE, show.stat = TRUE, show.se = TRUE,
          show.p = FALSE, p.style = "stars")
  lwage
Predictors Estimates std. Error Statistic
(Intercept) 3.67 *** 0.92 3.96
educ 0.13 * 0.05 2.39
exper 0.11 *** 0.02 4.58
expersq -0.00 *** 0.00 -7.00
black -0.15 ** 0.05 -2.72
smsa 0.11 *** 0.03 3.53
south -0.14 *** 0.03 -5.30
smsa66 0.02 0.02 0.86
reg662 0.10 ** 0.04 2.67
reg663 0.15 *** 0.04 4.03
reg664 0.05 0.04 1.14
reg665 0.15 ** 0.05 3.11
reg666 0.16 ** 0.05 3.14
reg667 0.13 ** 0.05 2.72
reg668 -0.08 0.06 -1.40
reg669 0.11 ** 0.04 2.58
Observations 3010
R2 / R2 adjusted 0.238 / 0.234
  • p<0.05   ** p<0.01   *** p<0.001

IV estimate is twice as large as OLS estimate but the standard error is over 18 times larger. Larger confidence interval is a price we must pay to get a consistent estimator.

Question 4

summary(olsmod)$r.squared
## [1] 0.2998365
summary(ivmod)$r.squared
## [1] 0.2381655

Although \(R^2\) is larger for OLS, it does not imply OLS is a better model. \(R^2\) obtained from OLS estimation is always larger because OLS minimizes the sum of squared residuals.


Issues with Doing 2SLS Manually

Most econometrics packages have special commands for 2SLS, so there is no need to perform the two stages explicitly.

In most cases, we should avoid doing the second stage manually, as the standard errors obtained in this way is not valid. (See Chapter 15.3 for more details.)

Note that the estimated coefficients obtained manually are identical to those obtained from the usual 2SLS routine.

Verify these properties with the following computer exercise.

Computer Exercise 15.9

Use the wage2 data from the Wooldridge package.

  1. Use a 2SLS routine to estimate the equation \[ \log(wage) = \beta_0 + \beta_1 educ + \beta_2 exper + \beta_3 tenure + \beta_4 black + u, \] where \(sibs\) is the IV for \(educ\).

  2. Now, manually carry out 2SLS. Regress \(educ\) on \(sibs, exper, tenure, black\) and obtain the fitted values \(\hat{educ}\). Run the second stage regression \(\log(wage)\) on \(\hat{educ}, exper, tenure, black\)

  3. What happens if we neglect exogenous variables from the first stage? Regress \(educ\) on \(sibs\) only and obtain the fitted values, \(\tilde{educ}\). Run the regression of \(\log(wage)\) on \(\tilde{educ}, exper, tenure, black\).

Question 1

q1591 <- ivreg(data = wage2,
               lwage ~ 1 + educ + exper + tenure + black|
                 1 + sibs + exper + tenure + black)
tab_model(q1591, show.ci = FALSE, show.stat = TRUE, show.se = TRUE,
          show.p = FALSE, p.style = "stars")
  lwage
Predictors Estimates std. Error Statistic
(Intercept) 5.22 *** 0.54 9.60
educ 0.09 ** 0.03 2.78
exper 0.02 * 0.01 2.49
tenure 0.01 *** 0.00 4.22
black -0.18 *** 0.05 -3.66
Observations 935
R2 / R2 adjusted 0.169 / 0.165
  • p<0.05   ** p<0.01   *** p<0.001

Question 2

q1592_1 <- lm(data = wage2, educ ~ 1 + sibs + exper + tenure + black )
wage2$prededuc <- q1592_1$fitted.values
q1592_2 <- lm(data = wage2, lwage ~ 1 + prededuc + exper + tenure + black)
tab_model(q1592_2, show.ci = FALSE, show.stat = TRUE, show.se = TRUE,
          show.p = FALSE, p.style = "stars")
  lwage
Predictors Estimates std. Error Statistic
(Intercept) 5.22 *** 0.57 9.17
prededuc 0.09 ** 0.04 2.65
exper 0.02 * 0.01 2.38
tenure 0.01 *** 0.00 4.03
black -0.18 *** 0.05 -3.49
Observations 935
R2 / R2 adjusted 0.089 / 0.085
  • p<0.05   ** p<0.01   *** p<0.001

The standard errors obtained from the manual method is too large.

Question 3

q1593_1 <- lm(data = wage2, educ ~ sibs  )
wage2$prededuc <- q1593_1$fitted.values
q1593_2 <- lm(data = wage2, lwage ~ 1 + prededuc + exper + tenure + black)
tab_model(q1593_2, show.ci = FALSE, show.stat = TRUE, show.se = TRUE,
          show.p = FALSE, p.style = "stars")
  lwage
Predictors Estimates std. Error Statistic
(Intercept) 5.77 *** 0.36 16.01
prededuc 0.07 ** 0.03 2.65
exper -0.00 0.00 -0.13
tenure 0.01 *** 0.00 5.19
black -0.24 *** 0.04 -5.82
Observations 935
R2 / R2 adjusted 0.089 / 0.085
  • p<0.05   ** p<0.01   *** p<0.001

Now the estimated coefficients are also different.