Missing Values

In R, missing values are represented by the symbol NA (not available). Impossible values (e.g., dividing by zero) are represented by the symbol NaN (not a number).

lm() automatically omits missing values. This makes it tricky to know which observations are included in the estimation. To avoid this, we may remove rows that contain any NA values using na.omit().

However, na.omit() may discard more rows than necessary. Suppose we have a data frame that contains $x,y,$ and $z$. We want to regress $y$ on $x$ but $z$ has some missing values. If we use na.omit() directly on that data frame, we will end up with fewer observations.

It is important to carefully examine data before dropping any observations. Make sure that omitting data makes sense in your context.

In this lesson, we need the following packages.

library(wooldridge)
library(tidyverse)

Counting NAs

is.na() returns TRUE if that element is missing.

x <- c(1,2,3,NA,5)
y <- is.na(x)
print(y)

## [1] FALSE FALSE FALSE  TRUE FALSE

Logical values i.e. TRUE or FALSE can be treated as numerical values i.e. 1 or 0. It also has properties of numbers. For example, we can sum over the vector y.

sum(y)

## [1] 1

This means we can count the number of missing values in a dataframe using the combination of is.na() and sum().

discrim$psoda %>%
  is.na() %>%
  sum()

## [1] 8

This means there are eight missing values in discrim$psoda. Repeat this for all the necessary variables, or to save time, we may put all the columns in a list and loop over it.

var <- list(discrim$psoda, discrim$prpblck, discrim$income)

for (i in 1:length(var)) {
  var[[i]] %>%
    is.na() %>% 
    sum() %>% #Summing logical values
    print()
}

## [1] 8
## [1] 1
## [1] 1

There are eight NAs in $psoda$, one in $prpblck$ and one in $income$.

Find rows with NAs

We want to extract the index of rows with missing value(s). We may combine is.na() with which(). Again, loop over the list var we defined above.

for (i in 1:length(var)) {
  which(is.na(var[[i]])==TRUE) %>% print()
}

## [1]  58  93 144 184 284 311 362 369
## [1] 385
## [1] 385

which() returns the row number that satisfies the condition in its argument. If we use na.omit() on a data frame with these three columns, we will discard 9 rows.

Removing NAs from a Data Frame

Before removing those rows, let’s check the number of rows in our data frame.

rawdata <- discrim %>%
  select(psoda, prpblck, income)

nrow(rawdata)

## [1] 410

After removing the rows with missing values, confirm the result using nrow() again.

cleandata <- na.omit(rawdata)
nrow(cleandata)

## [1] 401

We have successfully discarded 9 rows.