1. What is missing data?

When variables are missing some data values, we say that there is “missing data.” Depending on your software and the coding of the dataset, missing values may be coded as NA, ., an empty cell (""), or a common numeric code (often -99 or 99).

The consequences of missing data for estimation and interpretation depend on the type of variable missing the data. For our purposes, we will consider three types of variables: pretreatment covariates, treatment indicator(s), and outcome (or dependent) variables. Pretreatment covariates, often known simply as “covariates,” are variables that we observe and measure before treatment is assigned. Outcome (or dependent) variables refer to outcomes that are measured after the assignment of treatment.

Missing data emerges for different reasons. In survey data, a respondent could decline to answer a question or quit the survey without completing all questions. In a panel survey, some subjects may skip the second or later waves. With administrative data, records may be lost at some point in the process of collecting or recording data. To the extent that we can know the process by which data becomes missing, we can better understand the consequences of missing data for our analysis and inferences.

2. Missing treatment or outcome data can bias our ability to describe empirical patterns and estimate causal effects.

Missing data can induce bias in our estimates of descriptive patterns and causal effects. Consider a researcher trying to describe the income distribution in a country with survey data. Some individuals’ incomes are missing but the researcher describes the non-missing data at hand. Suppose low-income individuals are less likely to report their income than high-income individuals, thus missingness concentrates in the lower portion of the distribution. Then, the researcher’s characterization of the income distribution is apt to be biased. For example, the researcher’s estimate of median income is bound to be higher than the true (unknown) median income because more data is missing from the lower portion of the distribution. Since missingness is correlated with the variable that we are trying to describe, our characterization of the median of the distribution is biased.

We can illustrate two types of missingness by considering a variable \(x\) – for example, income, as above. The left panel shows the full distribution of the variable \(x\) without missing data. The middle panel depicts a scenario in which simulated missingness in \(x\) is not independent of \(x\) – low values of \(x\) are much more likely to be missing. The right panel depicts a scenario in which simulated missingness in \(x\) is independent of \(x\) (often called “missing at random”). The red line represents the median of the distribution without missingness. The blue lines represent the median of the observed data in each simulation. Where missingness is not independent of \(x\), we observe that the median of the nonmissing data is, in this instance, higher than the true median. This illustrates that missing data can bias our descriptions of single variables.

library(rmutil)
library(dplyr)
library(ggplot2)
data.frame(x = rep(rchisq(n = 1000, df = 5), 3),
           panel = rep(c("No missing data", "Missingness is not\nindependent of x", "Missingness is\nindependent of x"), each = 1000)) %>%
  mutate(panel = factor(panel, levels  = unique(panel)),
         missing_cor = rbinom(n = 3000, size = 1,  prob = (1 - pchisq(x, df = 5))),
         missing_ind = rbinom(n = 3000, size = 1, prob = .5),
         obs = ifelse(missing_cor == 1 & panel == "Missingness is not\nindependent of x", NA, 
                      ifelse(missing_ind == 1 & panel == "Missingness is\nindependent of x", NA, x))) %>%
  group_by(panel) %>%
  mutate(med_x = median(x),
         med_obs = median(obs, na.rm = T),
         med_obs = ifelse(panel == "No missing data", NA, med_obs)) %>%
  ggplot(aes(x = obs)) +
  geom_histogram(bins = 50) +
  facet_grid(~panel) +
  geom_vline(aes(xintercept = med_x), col = "red", lwd = 1) +
  geom_vline(aes(xintercept = med_obs), col = "blue", lty = 2, lwd = 1) +
  scale_x_continuous("x") +
  theme_minimal()

Similarly, when we seek to estimate causal effects, some patterns of missing data can lead to biased estimates of causal effects. In particular, missingness of the treatment indicator or the outcome variable of interest can induce bias in estimates of the ATE. First, consider missingness of an outcome variable \(Y_i(Z)\). Adopting some notation from Gerber and Green (2013), define “reporting” as a potential outcome of a treatment, \(Z\), as \(R_i(Z) \in \{0, 1\}\). In this notation, \(R_i(Z) = 0\) indicates that \(Y_i(Z)\) is missing and \(R_i(Z)=1\) indicates that the outcome is non-missing. Using this notation, we can express the ATE as:

\[\begin{align} \underbrace{E[Y_i(1)]-E[Y_i(0)]}_{ATE} =& \underbrace{E[R_i(1)]E[Y_i(1)|R_i(1) = 1]}_{Z = 1\text{ and }Y_i \text{ is not missing}} + \underbrace{(1-E[R_i(1)])(E[Y_i(1)|R_i(1) = 0])}_{Z = 1 \text{ and } Y_i \text{ is missing}} - \\ &\underbrace{E[R_i(0)]E[Y_i(0)|R_i(0) = 1]}_{Z = 0 \text { and } Y_i \text{ is not missing }} - \underbrace{(1-E[R_i(0)])E[Y_i(0)|R_i(0) = 0]}_{Z = 0 \text { and } Y_i \text{ is missing }} \end{align}\]

If we condition our analysis on on non-missing outcomes, our estimator of the ATE is:

\[E[Y_i(1)|R_i(1)=1] - E[Y_i(0) |R_i(0)=1]\]

Examining the preceding equations, the estimator using only only the non-missing outcomes is only (necessarily) equivalent to the ATE if:

  1. There is no missingness, \(E[R_i(1)] = 1\) and \(E[R_i(0)] = 1\).
  2. Missingness is independent of potential outcomes: \[\underbrace{E[Y_i(1)|R_i(1) = 1]}_{E[Y_i(1)] \text{ if } Y_i(1) \text{ is not missing}} = \underbrace{E[Y_i(1)|R_i(1) = 0]}_{E[Y_i(1)] \text{ if } Y_i(1) \text{ is missing}} \text{ and } \underbrace{E[Y_i(0)|R_i(0) = 1]}_{E[Y_i(0)] \text{ if } Y_i(0) \text{ is not missing}} = \underbrace{E[Y_i(0)|R_i(0) = 0]}_{E[Y_i(0)] \text{ if } Y_i(0) \text{ is missing}}\]

Otherwise, the analysis conditional upon observing \(Y_i(Z)\) can induce an unknowable (but boundable) amount of bias in our estimate of the ATE.

We often do not think of missingness of the treatment indicator in experiments. Indeed, competent administration of an experiment generally ensures against missing treatment values. Nevertheless, it is important to note that missingness of the treatment indicator can also produce bias if missingness is not independent of potential outcomes.

The following simulation shows the consequences of two types of missingness for estimation of the ATE. We set the true ATE to 0.5 (the red vertical lines) in all cases. We simulate missingness through two types of data generating processes. In both cases, all missingness occurs among subjects in treatment (\(Z = 1\)). In the top panel, missingness is most likely among subjects in treatment with higher values of the outcome \(Y_i(1)\). In the bottom pannel, missingness is independent of the value of \(Y_i(1)\). If missingness is correlated with potential outcomes, the estimator of the ATE is biased (top row). This occurs whether we are missing values of the outcome (left column) or the treatment indicator (right column). In contrast, when missing data is independent of potential outcomes, the estimator is unbiased (bottom row).

library(randomizr)
library(estimatr)

  
extract_ests <- function(estimatr_model){summary(estimatr_model)$coef[2,1]}

simulation <- function(){
  Y0 <- rnorm(2000)
  Z <- complete_ra(2000)
  Yobs <- Y0 + Z * .5 # .5 standard deviation treatment effect
  Z_miss <- rbinom(n = 2000, prob = Z * pnorm(Yobs), size = 1)
  Z_miss_ind <- rbinom(n = 2000, prob = sum(Z_miss)/2000, size = 1)
  Y_miss <- rbinom(n = 2000, prob = Z * pnorm(Yobs), size = 1)
  Y_miss_ind <- rbinom(n = 2000, prob = sum(Y_miss)/2000, size = 1)
  
  m1 <- lm_robust(Yobs ~ Z, subset = Z_miss == 0)
  m2 <- lm_robust(Yobs ~ Z, subset = Z_miss_ind == 0)
  m3 <- lm_robust(Yobs ~ Z, subset = Y_miss == 0)
  m4 <- lm_robust(Yobs ~ Z, subset = Y_miss_ind == 0)

  return(sapply(list(m1, m2, m3, m4), FUN = extract_ests))
}

reps <- replicate(n = 500, expr = simulation())

data.frame(ests = as.vector(t(reps)),
           missing = rep(c("Missing Treatment Indicator", "Missing Outcome"), each = 1000),
           pos = rep(c("Missing is\nNot Independent of POs", "Missingness is\nIndependent of POs"), each = 500)) %>%
  ggplot(aes(x = ests)) + 
  facet_grid(pos ~ missing) +
  geom_histogram(bins = 100) +
  geom_vline(xintercept = .5, col = "red") +
  theme_minimal() +
  xlab("ATE Estimates")

3. The potential for bias increases in the proportion of treatment or outcome data that is missing.

Returning to the equations in #2, it is useful to identify the terms that we can estimate (know) as analysts of a dataset. Rewriting the expression for the ATE in the presence of missing data, we can estimate:

  1. The proportion of missingness in treatment and control: \(\color{red}{E[R_i(1)]}\) and \(\color{red}{E[R_i(0)]}\).
  2. The expectation of the outcome variable among reporters (non-missing data) in treatment and control: \(\color{red}{E[Y_i(1)|R_i(1) = 1]}\) and \(\color{red}{E[Y_i(0)|R_i(0)=1]}\)

These expressions are colored in the following equation.

\[\begin{align} \underbrace{E[Y_i(1)]-E[Y_i(0)]}_{ATE} =& \underbrace{\color{red}{E[R_i(1)]E[Y_i(1)|R_i(1) = 1]}}_{Z = 1\text{ and }Y_i \text{ is not missing}} + \underbrace{(1-E[R_i(1)])(E[Y_i(1)|R_i(1) = 0])}_{Z = 1 \text{ and } Y_i \text{ is missing}} - \\ &\underbrace{\color{red}{E[R_i(0)]E[Y_i(0)|R_i(0) = 1]}}_{Z = 0 \text { and } Y_i \text{ is not missing }} - \underbrace{(1-E[R_i(0)])E[Y_i(0)|R_i(0) = 0]}_{Z = 0 \text { and } Y_i \text{ is missing }} \end{align}\]

We are ultimately interested in estimating the ATE, \(E[Y_i(1)]-E[Y_i(0)]\). One straightforward implication of this expression is that the magnitude of bias possible for an estimator of the ATE on non-missing observations increases in the amount of missingness. As \(E[R_i(1)] \rightarrow 1\) and \(E[R_i(0)]\rightarrow 1\), bias stemming from missing outcome data converges to 0. In contrast, when we are missing a large proportion of the data, the magnitude of possible bias increases. Thus, our concerns about missing data should increase as the amount (proportion) of missingness increases.

4. The consequences of missing data for bias in estimates of causal effects depend on the type of variable that is missing.

Missingness of pretreatment covariates need not induce bias in our estimates of the ATE. However, researchers can actually induce bias through improper treatment of missing pretreatment covariate data. If treatment is randomly assigned, treatment assignment should be orthogonal to pre-treatment missingness. In other words, pre-treatment missingness should be balanced across treatment assignment conditions.

However, we should avoid “dropping” (excluding) observations based on pretreatment missingness for two reasons. First, it is possible to induce bias in our estimate of an ATE by dropping observations with pre-treatment missingness. After dropping these observations, we can estimate an unbiased estimate the local average treatment effect (LATE) among observations with no missing pretreatment data. However, if treatment effects vary with missingness of pretreatment variables, this LATE may be quite different than the ATE. Second, as we drop observations the number of observations decreases, reducing our power to detect a given ATE. In sum, we should refrain from dropping observations based on pre-treatment covariates to avoid inducing bias or efficiency loss in our estimates of the ATE.

In contrast, missingness of the treatment indicator or outcome variable(s) can induce bias in our estimates of causal effects, as demonstrated in #2. This categorization informs the strategies that we adopt to address the consequences of missing data.

5. What assumptions do we invoke when we “ignore” treatment or outcome missingness in estimation?

Suppose that we analyze a dataset with missing treatment or outcome data while ignoring the missingness. If we “ignore” the missingness, we drop observations for which we lack a value for one of these variables. If we proceed to estimate the ATE on the subsample of data for which we have full data for both treatment and outcomes, we estimate:

\[E[Y_i(1)|R_i(1)]- E[Y_i(0)|R_i(1)]\]

This is necessarily equivalent to the ATE estimand \(E[Y_i(1)]-E[Y_i(0)]\) if missingness is independent of potential outcomes (MIPO), i.e. \(Y_i(Z) \perp R_i(Z)\) (Gerber and Green, 2013). Thus, to interpret \(E[Y_i(1)|R_i(1)]- E[Y_i(0)|R_i(1)]\) as an unbiased estimator of the ATE, we must impose an assumption that MIPO.

The assumption that MIPO is most plausible when missingness occurs by some random process. Under what conditions might this assumption be plausible? Perhaps we were only able to gather a random subset of administrative outcome data. In this case, missingness would be independent of both potential outcomes and pretreatment covariates. Alternatively, idiosyncratic behavior in data collection (enumerator absence or error) that is plausibly unrelated to potential outcomes may also be consistent with MIPO. On the other hand, survey non-response or other forms of outcome measurement dependent on subject reporting are often much harder to justify under MIPO. Because we cannot validate MIPO, we depend on researchers’ consideration of whether the assumption is plausible under a given data generating process. Where MIPO is not plausible, we should not simply “ignore” missing data in estimation. In these cases, researchers should consider the methods described in #6-#10.

7. What is imputation?

The most common approaches used to deal with missing data involve the imputation or “filling in” of missing values. In the following dataset with missingness, imputation procedures “fill in” the missing data – the NAs.

Xobs Z Yobs
-1.2070 0 3
NA 0 2
1.0840 1 3
-2.3460 1 4
0.4291 1 2
0.5061 0 NA
NA 1 NA
NA 1 3
-0.5645 0 2
NA 0 1
-0.4772 0 2
-0.9984 1 4

Just as the consequences of missingness vary by the type of variable that is missing, the imputation methods advocated to address missingness also vary. Importantly, these methods vary in the strength of the assumptions about missingness that are invoked to identify causal effects in the presence of missingness. We review three common approaches to imputation in #8-#10.

8. How do we address missingness of pre-treatment covariates and why does this matter?

As mentioned in #4, we should never “drop” observations on account of missing pre-treatment data. In order to estimate a model with covariate adjustment, thus, we need to “fill in” missing values to avoid dropping observations. We outline two forms of imputation advocated for missing pre-treatment covariates. The most common approach to address missingness of pre-treatment covariates is to create indicators for missingness and include these as covariates. To do this form of imputation:

  1. Substitute a numerical value for the NA (as necessary). In the dataset below, we impute a 0 for all values of Xobs that are NAs. The new variable is named Ximputed.
  2. Create an indicator for the missingness in each pretreatment variable. This indicator, Xmissing takes the value 1 whenever Xobs is NA, and a 1 otherwise.

Our imputed dataset is thus:

Xobs Ximputed Xmissing Z Yobs
-1.2070 -1.2070 0 0 3
NA 0.0000 1 0 2
1.0840 1.0840 0 1 3
-2.3460 -2.3460 0 1 4
0.4291 0.4291 0 1 2
0.5061 0.5061 0 0 NA
NA 0.0000 1 1 NA
NA 0.0000 1 1 3
-0.5645 -0.5645 0 0 2
NA 0.0000 1 0 1
-0.4772 -0.4772 0 0 2
-0.9984 -0.9984 0 1 4

Consider how this imputation changes the specification of a regression model. Instead of estimating:

model1 <- lm(Yobs ~ Z + Xobs)

we estimate:

model2 <- lm(Yobs ~ Z + Ximputed + Xmissing)

The estimator model2 does not drop observations on the basis of the missing pre-treatment variable Xobs like model1 does.

Alternatively, if pre-treatment missingness is minimal (less than 10% of observations), or when working with very small datasets with few observations relative to the number of covariates and missingness indicators, Lin, Green, and Coppock (2016) advocate imputing the (unconditional) mean of the observed pre-treatment covariate. Call this variable Ximputed_mean.

Xobs Ximputed_mean Z Yobs
-1.2070 -1.2070000 0 3
NA -0.4467375 0 2
1.0840 1.0840000 1 3
-2.3460 -2.3460000 1 4
0.4291 0.4291000 1 2
0.5061 0.5061000 0 NA
NA -0.4467375 1 NA
NA -0.4467375 1 3
-0.5645 -0.5645000 0 2
NA -0.4467375 0 1
-0.4772 -0.4772000 0 2
-0.9984 -0.9984000 1 4

If we are simply imputing the pretreatment mean, then instead of estimating model1 (as above) our regression model should be:

model3 <- lm(Yobs ~ Z + Ximputed_mean)

Since we have imputed missing values in Xobs when constructing Ximputed_mean, we will not lose observations on the basis of the missing pretreatment covariate when estimating model3.

9. We can bound ATEs to account for missing outcome data without making assumptions about the distribution of missing outcomes.

If we know the maximum and minimum values of the outcome variable, we can construct bounds on the ATE even in the presence of missing values of the dependent variable. For example, the range of the variable Yobs is 1 to 4. While we do not know what values missing values of Yobs would have been, we can take advantage of the fact that we know the maximum and minimum possible value to construct bounds on the ATE. This allows us to construct an interval estimate of the ATE instead of the point estimate we would construct if we had no missing data. These bounds are known as “extreme value bounds” or “Manski bounds” and do not invoke any additional assumptions about the value of Yobs.

Manski bounds consist of a maximum and minimum bound (estimate) of the possible ATE. To create these bounds, we impute the maximum or minimum value of the outcome variable. as a function of treatment assignment. Thus we construct:

  1. A maximum bound by imputing the maximum value of the outcome for missing values in treatment (\(Z=1\)) and imputing the minimum value of the outcome for missing values in control (\(Z=0\)).
  2. A minimum bound by imputing the minimum value of the outcome for missing values in treatment (\(Z=1\)) and imputing the maximum value of the outcome for missing values in control (\(Z=0\)).

Using the above dataset, we can construct two variables Y_maxbound and Y_minbound as follows:

Xobs Z Yobs Y_maxbound Y_minbound
-1.2070 0 3 3 3
NA 0 2 2 2
1.0840 1 3 3 3
-2.3460 1 4 4 4
0.4291 1 2 2 2
0.5061 0 NA 1 4
NA 1 NA 4 1
NA 1 3 3 3
-0.5645 0 2 2 2
NA 0 1 1 1
-0.4772 0 2 2 2
-0.9984 1 4 4 4

Without covariate adjustment, we can obtain our interval estimate of the ATE by estimating:

upper <- lm(Y_maxbound ~ Z)
lower <- lm(Y_minbound ~ Z)

coef(upper)[2]
##   Z 
## 1.5
coef(lower)[2]
##   Z 
## 0.5

Our interval estimate of the ATE using Manski bounds is thus [0.5, 1.5].

10. Multiple imputation for missing outcomes allows for point estimation of ATEs, but relies on stronger assumptions than bounding.

The methods in #8 and #9 describe methods of single imputation, where a single value is substituted for missing values. In multiple imputation, we impute missing values of the dataset multiple times according to an assumed stochastic data generating process. Different methods for multiple imputation impose different structures and assumptions about the probability distributions governing the data generating processes used to impute missing values. In general, multiple imputation proceeds via three stages:

  1. Imputation: Missing values are imputed via a random draw of plausible values under the specified data generating process. This creates a full dataset without missing values. Typically, researchers will impute at least five complete datasets. The only differences across these imputed datasets are the values that were missing in the original data.

  2. Estimation: Estimate the ATE (or other estimand of interest) using each imputed dataset. This generates as many estimates and standard errors as there are imputed datasets.

  3. Pooling estimates: Finally, researchers combine estimates from the different imputed datasets to generate a point estimate of the ATE and its stanard error. Typically, this point estimate can be calculated using Rubin’s rules (2004).

Multiple imputation is implemented in many software packages and relatively straightforward to implement. However, the specification of a data generating process from which datasets are imputed relies on additional assumptions about the correctness of the model of missingness (the data generating process). These assumptions are generally untestable and stronger than the standard experimental assumptions invoked to identify an interval estimate of the ATE using Manski bounds.

References:

Gerber, Alan S. and Donald P. Green. (2013). Field Experiments: Design, Analysis, and Interpretation. New York: W.W. Norton.

Lin, Winston, Donald P. Green, and Alexander Coppock. (2016). “Standard Operating Procedures for Don Green’s Lab at Columbia.” Available at https://alexandercoppock.com/Green-Lab-SOP/Green_Lab_SOP.html.

Rubin Donald B. (2004). Multiple Imputation for Nonresponse in Surveys. New York: John Wiley and Sons.