Abstract

This guide1 for more formal discussion of independence and the assumptions necessary to estimate causal effects. describes ten distinct types of causal effect researchers can be interested in estimating. As discussed in our guide to causal inference, simple randomization allows one to produce estimates of the average of the unit level causal effects in a sample. This average causal effect or average treatment effect (ATE) is a powerful concept because it is one solution to the problem of not observing all relevant counterfactuals. Yet, it is not the only productive engagement with this problem. In fact, there are many different types of quantities of causal interest. The goal of this guide is to help you choose estimands (a parameter of interest) and estimators (procedures for calculating estimates of those parameters) that are appropriate and meaningful for your data.

1 Average Treatment Effects

We begin by reviewing how, with randomization, a simple difference-of-means provides an unbiased estimate of the ATE. We take extra time to introduce some common statistical concepts and notation used throughout this guide.

First we define a treatment effect for an individual observation (a person, household, city, etc.) as the difference between that unit’s behavior under treatment \((Y_{i}(1))\) and control \((Y_{i}(0))\):

\[τ_{i}=Y_{i}(1)−Y_{i}(0)\]

Since we can only observe either \(Y_{i}(1)\) or \(Y_{i}(0)\) the individual treatment effect is unknowable. Now let \(D_{i}\) be an indicator for whether we observe an observation under treatment or control. If treatment is randomly assigned, \(D_{i}\) is independent, not only of potential outcomes but also of any covariates (observed and unobserved) that might predict also those outcomes \(((Y_{i}(1),Y_{i}(0),X_{i}⊥⊥D_{i}))\).2

Suppose our design involves \(m\) units under treatment and \(N−m\) under control. Suppose we were to repeatedly reassign treatment at random many times and each time calculate the difference of means between treated and control groups and then to record this value in a list. The average of the values in that list will be the same as the difference of the means of the true potential outcomes had we observed the full schedule of potential outcomes for all observations.3 Another way to say this characteristic of the average treatment effect and the estimator of it, is to say that the difference of observed means is an unbiased estimator of the average causal treatment effect.

\[ATE≡\frac{1}{N}∑^{N}_{i=1}τ_{i}=\frac{∑^{N}_{1}Y_{i}(1)}{N}−\frac{∑^{N}_{1}Y_{i}(0)}{N}\]

And we often estimate the ATE using the observed difference in means:4

\[\widehat{ATE} =\frac{∑^m_1Z_{i}Y_{i}}{m}−\frac{∑^{N}_{m+1}(1−Z_{i})Y_{i}}{N−m}\]

Statistical inference about the estimated ATE requires that we know how it will vary across randomizations. It turns out that we can write the variance of the ATE across randomizations as follows:

\[V(ATE) = \frac{N}{N−1} [\frac{V(Y_{i}(1))}{m}+\frac{V(Y_{i}(0))}{N−m}]−\frac{1}{N−1}[V(Y_{i}(1))+V(Y_{i}(0))−2∗Cov(Y_{i}(1),Y_{i}(0))]\]

and estimate this quantity from the sample estimates of the variance in each group.5

A linear model regressing the observed outcome \(Y_{i}\) on a treatment indicator \(D_{i}\) provides a convenient estimator of the ATE (and with some additional adjustments, the variance of the ATE):

\[Y_{i}=Y_{i}(0)∗(1−D_{i})+Y_{i}(1)∗D_{i}=β_{0}+β_{1}D_{i}+u\]

since we can rearrange terms so that \(β_{0}\) estimates the average among control observations \((Y_{i}(0)∣D_{i}=0)\) and \(β_{1}\) estimates the differences of means \((Y_{i}(1)∣D_{i}=1)–(Y_{i}(1)∣D_{i}=0)\). In the code below, we create a sample of 1,000 observations and randomly assign a treatment Di with a constant unit effect to half of the units. We estimate the ATE using ordinary least squares (OLS) regression to calculate the observed mean difference. Calculating the means in each group and taking their difference would also produce an unbiased estimate of the ATE. Note that the estimated ATE from OLS is unbiased, but the errors in this linear model are assumed to be independent and identically distributed. When our treatment effects both the average value of the outcome and the distribution of responses, this assumption no longer holds and we need to adjust the standard errors from OLS using a Huber-White sandwich estimator to obtain the correct estimates (based on the variance of the ATE) for statistical inference.6 Finally, we also demonstrate the unbiasedness of these estimators through simulation.

set.seed(1234) # For replication 
N = 1000 # Population size 
Y0 = runif(N) # Potential outcome under control condition 
Y1 = Y0 + 1 # Potential outcome under treatment condition 
D = sample((1:N)%%2) # Treatment: 1 if treated, 0 otherwise 
Y = D*Y1 + (1-D)*Y0 # Outcome in population 
samp = data.frame(D,Y) 

ATE = coef(lm(Y~D,data=samp))[2] #same as with(samp,mean(Y[Z==1])-mean(Y[Z==0])) 

# SATE with Neyman/Randomization Justified Standard Errors 
# which are the same as OLS standard errors when no covariates or blocking 
library(lmtest) 
library(sandwich) 
fit<-lm(Y~D,data=samp) 
coef(summary(fit))["D",1:2]
##   Estimate Std. Error 
## 0.99282939 0.01842545
ATE.se<-coeftest(fit,vcovHC(fit,type="HC2"))["D",2] 
# same as with(samp,sqrt(var(Y[D==1])/sum(D)+var(Y[D==0])/(n-sum(D))) 

# Assess unbiasedness and simulate standard errors 
getATE<-function() {
  D = sample((1:N)%%2) # Treatment: 1 if treated, 0 otherwise 
  Y = D*Y1 + (1-D)*Y0 
  coef(lm(Y~D))[["D"]] 
} 

manyATEs<-replicate(10000,getATE()) 

## Unbiasedness: 
c(ATE=mean(Y1)-mean(Y0), ExpEstATE=mean(manyATEs)) 
##       ATE ExpEstATE 
##  1.000000  1.000068
## Standard Error 
### True SE formula 
V<-var(cbind(Y0,Y1)) 
varc<-V[1,1] 
vart<-V[2,2] 
covtc<-V[1,2] 
n<-sum(D) 
m<-N-n 
varestATE<-((N-n)/(N-1))*(vart/n) + ((N-m)/(N-1))* (varc/m) + (2/(N-1)) * covtc 

### Compare SEs 
c(SimulatedSE= sd(manyATEs), TrueSE=sqrt(varestATE), ConservativeSE=ATE.se) 
##    SimulatedSE         TrueSE ConservativeSE 
##     0.01841534     0.01842684     0.01842545

2 Conditional Average Treatment Effects

The problem with looking at average treatment effects only is that it takes attention away from the fact that treatment effects might be very different for different sorts of people. While the “fundamental problem of causal inference” suggests that measuring causal effects for individual units is impossible, making inferences on groups of units is not.

Random assignment ensures that treatment is independent of potential outcomes and any (observed and unobserved) covariates. Sometimes, however, we have additional information about the experimental units as they existed before the experiment was fielded, say \(X_{i}\), and this information can can help us understand how treatment effects vary across subgroups. For example, we may suspect that men and women respond differently to treatment, and we can test for this hetorogeneity by estimating conditional ATE for each subgroup separately \((CATE=E(Y_{i}(1)−Y_{i}(0)∣D_{i},X_{i}))\). If our covariate is continous, we can test its moderating effects by interacting the continous variable with the treatment. Note, however, that the treatment effect is now conditional on both treatment status and the value of the conditioning variable at which the effect is evaluated and so we must adjust our interpretation and standard errors accordingly.7

A word of warning: looking at treatment effects across dimensions that are themselves affected by treatment is a dangerous business and can lead to incorrect inferences. For example if you wanted to see how administering a drug led to health improvements you could look separately for men and women, but you could not look separately for those that in fact took the drug and those that did not (this is an example of inference for compliers which requires separate techniques described in point 4 below).

3 Intent-to-Treat Effects

Outside of a controlled laboratory setting, the subjects we assign to treatment often are not the same as the subjects who actually receive the treatment. When some subjects assigned to treatment fail to receive it, we call this an experiment with one-sided non-compliance. When additionally, some subjects assigned to control also receive the treatment, we say there is two-sided non-compliance. For example, in a get-out-the-vote experiment, some people assigned to receive a mailer may not receive it. Perhaps they’ve changed addresses or never check their mail. Similarly, some observations assigned to control may receive the treatment. Perhaps they just moved in, and the previous tenant’s mail is still arriving.

When non-compliance occurs, the receipt of treatment is no longer independent of potential outcomes and confounders. The people who actually read their mail probably differ in a number of ways from the people who throw their mail away (or read their neighbors’ mail) and these differences likely also effect their probability of voting. The difference-of-means between subjects assigned to treatment and control no longer estimates the ATE, but instead estimates what is called an intent-to-treat effect (ITT). We often interpret the ITT as the effect of giving someone the opportunity to receive treatment. The ITT is particularly relevant then for assessing programs and interventions with voluntary participation.

In the code below, we create some simple data with one-sided non-compliance. Although the true treatment effect for people who actually received the treatment is 2, our estimated ITT is smaller (about 1) because only some of the people assigned to treatment actually receive it.

set.seed(1234) # For replication
n = 1000 # Population size 
Y0 = runif(n) # Potential outcome under control condition 
C = sample((1:n)%%2) # Whether someone is a complier or not 
Y1 = Y0 + 1 +C # Potential outcome under treatment 
Z = sample((1:n)%%2) # Treatment assignment 
D = Z*C # Treatment Uptake 
Y = D*Y1 + (1-D)*Y0 # Outcome in population 
samp = data.frame(Z,Y)
ITT<-coef(lm(Y~Z,data=samp))[2]

4 Complier Average Treatment Effects

What if you are interested in figuring out the effects of a treatment on those people who actually took up the treatment and not just those people that were administered the treatment? For example what is the effect of radio ads on voting behavior for those people that actually hear the ads?

This turns out to be a hard problem (for more on this see this guide). The reasons for non-compliance with treatment can be thought of as an omitted variable. While the receipt of treatment is no-longer independent of potential outcomes, the assignment of treatment status is. As long as random assignment had some positive effect on the probability of receiving treatment, we can use it as an instrument to identify the effects of treatment on the sub-population of subjects who comply with treatment assignment.

Following the notation of Angrist and Pischke,8 let \(Z\) be an indicator for whether an observation was assigned to treatment and \(D_{i}\) indicate whether that subject actually received the treatment. Experiments with non-compliance are composed of always-takers (\(D_{i}=1\), regardless of \(Z_{i}\)), never-takers (\(D_{i}=0\) regardless of \(Z_{i}\)), and compliers (\(D_{i}=1\) when \(Z_{i}=1\) and \(0\) when \(Z_{i}=0\)).9 We can estimate a complier average causal effect (CACE), sometimes also called a local average treatment effect (LATE), by weighting the ITT (the effect of \(Z\) on \(Y\)) by the effectiveness of random assignment on treatment uptake (the effect of \(Z\) on \(D\)).

\[CACE= \frac{Effect of Z on Y}{Effect of Z on D}=\frac{E(Y_i∣Z_i=1)-E(Y_i|Z_i=0)}{E(D_i|Z_i=1)-E(D_i|Z_i=0)}\]

The estimator above highlights the fact that the ITT and CACE converge as we approach full compliance. Constructing standard errors for ratios is somewhat cumbersome and so we usually estimate a CACE using two-stage-least-squares regression with random assignment, \(Z_i\), serving as instrument for treatment receipt \(D_i\) in the first stage of the model. This approach simplifies the estimation of standard errors and allows for the inclusion of covariates as additional instruments. We demonstrate both strategies in the code below for data with two-sided non-compliance. Note, however, that when instruments are weak (e.g. random assignment had only a small effect on the receipt of treatment), instrumental variable estimators and their standard errors can be biased and inconsistent.10

set.seed(1234) # For replication 
n = 1000 # Population size 
Y0 = runif(n) # Potential outcome under control condition 
Y1 = Y0 + 1 # Potential outcome under treatment 
Z = sample((1:n)%%2) # Treatment assignment 
pD<-pnorm(-1+rnorm(n,mean=2*Z)) # Non-compliance 
D<-rbinom(n,1,pD) # Treatment receipt with non-compliance 
Y = D*Y1 + (1-D)*Y0 # Outcome in population 
samp = data.frame(Z,D,Y) 

# IV estimate library(AER) CACE = coef(ivreg(Y ~ D | Z, data = samp))[2] 

# Wald Estimator ITT<-coef(lm(Y~Z,data=samp))[2] ITT.D<-coef(lm(D~Z,data=samp))[2] CACE.wald<-ITT/ITT.D

5 Population and Sample Average Treatment Effects

Often we want to generalize from our sample to make statements about some broader population of interest.11 Let \(S_i\) be an indicator for whether an subject is in our sample. The sample average treatment effect (SATE) is defined simply as \(E(Y_i(1)−Y_i(0)|S_i=1)\) and the population \(E(Y_i(1)−Y_i(0))\). With a large random sample from a well-defined population with full compliance with treatment, our SATE are PATE are equal in expectation and so a good estimate for one (like a difference of sample means) will be a good estimate for the other.12

In practice, the experimental pool may consist of a group of units selected in an unknown manner from a vaguely defined population of such units and compliance with treatment assignment may be less than complete. In such cases our SATE may diverge from the PATE and recovering estimates of each becomes more complicated. Imai, King, and Stuart (2008) decompose the divergence between these estimates into error that arises from sample selection and treatment imbalance. Error from sample selection arises from different distributions of (observed and unobserved) covariates in our sample and population. For example people in a medical trial often differ from the population for whom the drug would be available. Error from treatment imbalance reflects differences in covariates between treatment and control groups in our sample, perhaps because of non-random assignment and/or non-compliance.

While there are no simple solutions to the problems created by such error, there are steps you can take in both the design of your study and the analysis of your data to address these challenges to estimating the PATE or CACE/LATE. For example, including a placebo intervention provides additional information on the probability of receiving treatment, that can be used to re-weight the effect of actually receiving it (e.g Nickerson (2008)) in the presence of non-compliance. One could also use a model to re-weighting observations to adjust for covariate imbalance and the unequal probability of receiving the treatment, both within the sample and between a sample and the population of interest.13

In the code below, we demonstrate several approaches to estimating these effects implemented in the CausalGAM package for R.14 Specifically, the package produces regression, inverse-propensity weighting (IPW), and augmented inverse-propensity weighting estimates of the ATE. Combining regression adjustment with IPW, the AIPW has the feature of being “doubly robust” in that the estimate is still consistent even if we have incorrectly specified either the regression model or the propensity score for the probability weighting.

# Example adapted from ?estimate.ATE 
library(CausalGAM) 
## ##
## ## CausalGAM Package
## ## Copyright (C) 2009 Adam Glynn and Kevin Quinn
set.seed(1234) # For replication 
n = 1000 # Sample size 
X1 = rnorm(n) # Pre-treatment covariates 
X2 = rnorm(n) 
p = pnorm(-0.5 + 0.75*X2) # Unequal probabilty of Treatment 
D = rbinom(n, 1, p) # Treatment 
Y0 = rnorm(n) # Potential outcomes 
Y1 = Y0 + 1 + X1 + X2 
Y = D*Y1 + (1-D)*Y0 # Observed outcomes 
samp = data.frame(X1,X2,D,Y) 

# Estimate ATE with AIPW, IPW, Regression weights 
ATE.out <- estimate.ATE(pscore.formula = D ~ X1 +X2, 
                        pscore.family = binomial, 
                        outcome.formula.t = Y ~ X1 
                        +X2, 
                        outcome.formula.c = Y ~ X1 
                        +X2, 
                        outcome.family = gaussian, 
                        treatment.var = "D", 
                        data=samp, 
                        divby0.action="t", 
                        divby0.tol=0.001, 
                        var.gam.plot=FALSE, nboot=50)

6 Average Treatment Effects on the Treated and the Control

To evaluate the policy implications of a particular intervention, we often need to know the effects of the treatment not just on the whole population but specifically for those to whom the treatment is administered We define the average effects of treatment among the treated (ATT) and the control (ATC) as simple counter-factual comparisons:

\[ATT=E(Y_i(1)-Y_i(0)|D_i=1)=E(Y_i(1)|D_i=1)-E(Y_i(0)|D_i=1)\] \[ATC=E(Y_i(1)-Y_i(0)|D_i=0)=E(Y_i(1)|D_i=0)-E(Y_i(0)|D_i=0)\]

Informally, the ATT is the effect for those that we treated; ATC is what the effect would be for those we did not treat.

When treatment is randomly assigned and there is full compliance, the \(ATE=ATT=ATC\), since \(E(Y_i(0)∣D_i=1)=E(Y_i(0)∣D_i=0)\) and \(E(Y_i(1)∣D_i=0)=E(Y_i(1)∣D_i=1)\) Often either because of the nature of the intervention or specific concerns about cost and ethnics, treatment compliance is incomplete and the ATE will not in general equal the ATT or ATC. In such instances, we saw in the previous section that we could re-weight observations by their probability of receiving the treatment to recover estimates of the ATE. The same logic can be extended to produce estimates of the ATT and ATC in both our sample and the population.15

Below, we create an case where the probability of receiving treatment varies and but can be estimated using a propensity score model.16 The predicted probabilities from this model are then used as weights to recover the estimates of the ATE, ATT, and ATC. Inverse propensity score weighting attempts to balance the distribution of covariates between treatment and control groups when estimating the ATE. For the ATT, this weighting approach treats subjects in the the treated group as a sample from the target population (people who received the treatment) and weights subjects in the control by their odds of receiving the treatment. In a similar fashion, the estimate of the ATC weights treated observations to look like controls. The quality (unbiasedness) of these estimates is inherently linked to the quality of our models for predicting the receipt of treatment. Inverse propensity score weighting and other procedures produce balance between treatment and control groups on observed covariates, but unless we have the “true model” (and we almost never know the true model) the potential for bias from unobserved covariates remains and should lead us to interpret our the estimated ATT or ATC in light of the quality of the model that produced it.

set.seed(1234) # For replication
n = 1000 # Sample size 
X1 = rnorm(n) # Pre-treatment covariates 
X2 = rnorm(n) 
p = pnorm(-0.5 + 0.75*X2) # Unequal probabilty of Treatment 
D = rbinom(n, 1, p) # Treatment 
Y0 = rnorm(n) # Potential outcomes 
Y1 = Y0 +1 +X1 +X2 
Y = D*Y1 + (1-D)*Y0 # Observed outcomes 
samp = data.frame(X1,X2,D,Y) 
# Propensity score model 
samp$p.score<-
predict(glm(D~X1+X2,samp,family=binomial),type="response") 


# Inverse Propability Weights 
samp$W.ipw<-with(samp, ifelse(D==1,1/p.score,1/(1-p.score))) 
samp$W.att<-with(samp, ifelse(D==1,1,p.score/(1-p.score))) 
samp$W.atc<-with(samp, ifelse(D==1,(1-p.score)/p.score,1)) 

# IPW: ATE, ATT, ATC 
ATE.ipw<-coef(lm(Y~D,data=samp,weights=W.ipw))[2]
ATT.ipw<-coef(lm(Y~D,data=samp,weights=W.att))[2] 
ATC.ipw<-coef(lm(Y~D,data=samp,weights=W.atc))[2]

7 Quantile Average Treatment Effects

The ATE focuses on the middle, in a way on the effect for a typical person, but we often also care about the distributional consequences of our treatment. We want to know not just whether our treatment raised average income, but also whether it made the distribution of income in the study more or less equal.

Claims about distributions are difficult. Even though we can estimate the ATE from a difference of sample means, in general, we cannot make statements about the joint distribution of potential outcomes \((F(Yi(1),Yi(0)))\) without further assumptions. Typically, these assumptions either limit our analysis to a specific sub-population17 or require us to assume some form of rank invariance in the distribution of responses to treatment effects18 and Frölich and Melly (2010) for fairly concise discussions of these issues and Abbring and Heckman (Abbring, Jaap H, and James J Heckman. 2007. “Econometric Evaluation of Social Programs, Part III: Distributional Treatment Effects, Dynamic Treatment Effects, Dynamic Discrete Choice, and General Equilibrium Policy Evaluation.” Handbook of Econometrics 6. Elsevier: 5145–5303.) (2007) for a thorough overview.

If these assumptions are justified for our data, we can obtain consistent estimates of quantile treatment effects (QTE) using quantile regression.19 Just as linear regression estimates the ATE as a difference in means (or, when covariates are used in the model, from a conditional mean), quantile regression fits a linear model to a conditional quantile and this model can then be used to estimates the effects of treatment for that particular quantile of the outcome. The approach can be extended to include covariates and instruments for non-compliance. Note that the interpretation of the QTE is for a given quantile, not an individual at that quantile.

Below we show a case where the ATE is 0, but the treatment effect is negative for low quantiles of the response and positive for high quantiles. Estimating quantile treatment effects provide another tool for detecting heterogeneous effects and allow us to describe distributional consequences of our intervention. These added insights come at the cost of requiring more stringent statistical assumptions of our data and more nuanced interpretations of our results.

set.seed(1234) # For replication
n = 1000 # Population size
Y0 = runif(n) # Potential outcome under control condition
Y1= Y0 
Y1[Y0 <.5] = Y0[Y0 <.5]-rnorm(length(Y0[Y0 <.5])) 
Y1[Y0 >.5] = Y0[Y0 >.5]+rnorm(length(Y0[Y0 >.5])) 
D = sample((1:n)%%2) # Treatment: 1 if treated, 0 otherwise 
Y = D*Y1 + (1-D)*Y0 # Outcome in population 
samp = data.frame(D,Y) 
library(quantreg) 
ATE = coef(lm(Y~D,data=samp))[2] 
QTE = rq(Y~D,tau = 
seq(.05,.95,length.out=10),data=samp,method = "fn") 

plot(summary(QTE),parm=2,main="",ylab="QTE",xlab="Quantile",mar = c(5.1, 4.1, 2.1, 2.1)) 

8 Mediation Effects

Sometimes we want to describe not just the magnitude and significance of an observed causal effect, but also the mechanism (or mechanisms) that produced it. Did our intervention raise turnout in the treatment group, in part, by increasing these subjects’ sense of political efficacy? If so, how much of that total effect can be attributed to the mediated effects of our treatment on efficacy and efficacy on turnout?

Baron and Kenny (1986) offer a general framework for thinking about mediation by decomposing the total effect of treatment into its indirect effect on a mediator that then effects the outcome, called an average causal mediation effect (ACME), and the remaining average direct effect (ADE) of the treatment. Unbiased estimation of these effects, however, requires a set strong assumptions about the relationship between treatment, mediators, outcomes, and potential confounders, collectively called sequential ignorability (Imai, Keele, and Yamamoto (2010), Bullock, Green, and Ha (2010)).20

Most causal effects likely operate through multiple channels, and so an assumption of sequential ignorability for your experiment can be hard to justify. For example, the top row in the figure below illustrates situations in which sequential ignorability holds, while the bottom row depicts two (of many possible) cases in which sequential ignorability is violated, and mediation analysis is biased. In essence, specifying the effects of a particular mediator requires strong assumptions about the role of all the other mediators in the causal chain. While some experimental designs can, in theory, provide additional leverage (such as running a second, parallel experiment in which the mediator is also manipulated), in practice these designs are hard to implement and still sensitive to unobserved bias. In some cases, the insights we hope to gain from mediation analysis may be more easily acquired from subgroup analysis and experiments designed to tests for moderation.

Imai and colleagues propose an approach to mediation analysis that allows researchers test the sensitivity of their estimates to violations of sequential ignorability.21 In the code we demonstrate some of the features of their approach, implemented in the mediation package in R (Tingley et al. 2014). We model the relationships with OLS, but the package is capable of handling other outcome processes, such generalized linear models or general additive models, that may be more appropriate for your data. Most importantly, the package allows us to produce bounds that reflect the sensitivity of our point estimates to some violations of sequential ignorability. In our simulated data, just over 20 percent of the total effect is mediated by our proposed mediator, M and the bias from an unobserved pre-treatment confounder would have to be quite large (ρ=.7) before we would reject the finding of a positive ACME. These bounds are only valid, however, if we believe there are no unobserved post-treatment confounders (as in panel 4). Sensitivity analysis is still possible, but more complicated in such settings (Imai and Yamamoto 2013).