In R, missing values are represented by the symbol NA (not available). Impossible values (e.g., dividing by zero) are represented by the symbol NaN (not a number).
lm()
automatically omits missing values. This makes it tricky to know which observations are included in the estimation. To avoid this, we may remove rows that contain any NA values using na.omit()
.
However, na.omit()
may discard more rows than necessary. Suppose we have a data frame that contains \(x,y,\) and \(z\). We want to regress \(y\) on \(x\) but \(z\) has some missing values. If we use na.omit()
directly on that data frame, we will end up with fewer observations.
It is important to carefully examine data before dropping any observations. Make sure that omitting data makes sense in your context.
In this lesson, we need the following packages.
library(wooldridge)
library(tidyverse)
is.na()
returns TRUE if that element is missing.
x <- c(1,2,3,NA,5)
y <- is.na(x)
print(y)
## [1] FALSE FALSE FALSE TRUE FALSE
Logical values i.e. TRUE
or FALSE
can be treated as numerical values i.e. 1
or 0
. It also has properties of numbers. For example, we can sum over the vector y
.
sum(y)
## [1] 1
This means we can count the number of missing values in a dataframe using the combination of is.na()
and sum()
.
discrim$psoda %>%
is.na() %>%
sum()
## [1] 8
This means there are eight missing values in discrim$psoda
. Repeat this for all the necessary variables, or to save time, we may put all the columns in a list and loop over it.
var <- list(discrim$psoda, discrim$prpblck, discrim$income)
for (i in 1:length(var)) {
var[[i]] %>%
is.na() %>%
sum() %>% #Summing logical values
print()
}
## [1] 8
## [1] 1
## [1] 1
There are eight NAs in \(psoda\), one in \(prpblck\) and one in \(income\).
We want to extract the index of rows with missing value(s). We may combine is.na()
with which()
. Again, loop over the list var
we defined above.
for (i in 1:length(var)) {
which(is.na(var[[i]])==TRUE) %>% print()
}
## [1] 58 93 144 184 284 311 362 369
## [1] 385
## [1] 385
which()
returns the row number that satisfies the condition in its argument. If we use na.omit()
on a data frame with these three columns, we will discard 9 rows.
Before removing those rows, let’s check the number of rows in our data frame.
rawdata <- discrim %>%
select(psoda, prpblck, income)
nrow(rawdata)
## [1] 410
After removing the rows with missing values, confirm the result using nrow()
again.
cleandata <- na.omit(rawdata)
nrow(cleandata)
## [1] 401
We have successfully discarded 9 rows.