1. Randomization inference is a method for calculating p-values for hypothesis tests 1 2

One of the advantages of conducting a randomized trial is that the researcher knows the precise procedure by which the units were allocated to treatment and control. Randomization inference considers what would have happened under all possible random assignments, not just the one that happened to be selected for the experiment at hand. Against the backdrop of all possible random assignments, is the actual experimental result unusual? How unusual is it?

2. Randomization inference starts with a null hypothesis

After we have conducted an experiment, we observe outcomes for the control group in their untreated state and outcomes for the treatment group in their treated state.3 In order to simulate all possible random assignments, we need to stipulate the counterfactual outcomes – what we would have observed among control units had they been treated or among treated units had they not been treated. The sharp null hypothesis of no treatment effect for any unit is a skeptical worldview that allows us to stipulate all of the counterfactual outcomes. If there were no treatment effect for any unit, then all the control units’ outcomes would have been unchanged had they been placed in treatment. Similarly, the treatment units’ outcomes would have been unchanged had they been placed in the control group. Under the sharp null hypothesis, we therefore have a complete mapping from our data to the outcomes of all possible experiments. All we need to do is construct all possible random assignments and, for each one, calculate the test statistic (e.g., the difference in means between the assigned treatment group and the assigned control group). The collection of these test statistics over all possible random assignments creates a reference distribution under the null hypothesis. If we want to know how unusual our actual experimental test statistic is, we compare it to the reference distribution. For example, our experiment might obtain an estimate of 6.5, but 24% of all random assignments produce an estimate of 6.5 or more even in the absence of any treatment effect. In that case, our one-tailed p-value would be 0.24.4

# Worked example of randomization inference

rm(list=ls())       # clear objects in memory
library(ri)         # load the RI package
set.seed(1234567)   # random number seed, so that results are reproducible

# Data are from Table 2-1, Gerber and Green (2012)

Y0 <- c(10, 15, 20, 20, 10, 15, 15)
Y1 <- c(15, 15, 30, 15, 20, 15, 30)


Z <-  c(1,0,0,0,0,0,1)       # one possible treatment assignment
Y <-  Y1*Z + Y0*(1-Z)  # observed outcomes given assignment


probs <- genprobexact(Z,blockvar=NULL)   # no blocking is assumed when generating probability of treatment and probs are 2/7 for all units

ate <- estate(Y,Z,prob=probs)      # estimate the ATE

perms <- genperms(Z,maxiter=10000,blockvar=NULL)   # set the number of simulated random assignments

# show all 21 possible random assignments in which 2 units are treated
perms
##   [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
## 1    1    1    1    1    1    1    0    0    0     0     0     0     0
## 2    1    0    0    0    0    0    1    1    1     1     1     0     0
## 3    0    1    0    0    0    0    1    0    0     0     0     1     1
## 4    0    0    1    0    0    0    0    1    0     0     0     1     0
## 5    0    0    0    1    0    0    0    0    1     0     0     0     1
## 6    0    0    0    0    1    0    0    0    0     1     0     0     0
## 7    0    0    0    0    0    1    0    0    0     0     1     0     0
##   [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21]
## 1     0     0     0     0     0     0     0     0
## 2     0     0     0     0     0     0     0     0
## 3     1     1     0     0     0     0     0     0
## 4     0     0     1     1     1     0     0     0
## 5     0     0     1     0     0     1     1     0
## 6     1     0     0     1     0     1     0     1
## 7     0     1     0     0     1     0     1     1
# --------------------------------------------------------------------
# estimate sampling dist under the sharp null that tau=0 for all units
# --------------------------------------------------------------------

Ys <- genouts(Y,Z,ate=0)    # create potential outcomes under the sharp null of no effect for any unit

# show the apparent potential outcomes under the sharp null
Ys
## $Y0
## [1] 15 15 20 20 10 15 30
## 
## $Y1
## [1] 15 15 20 20 10 15 30
distout <- gendist(Ys,perms,prob=probs)  # generate the sampling distribution  based on the implied schedule of potential outcomes implied by the null hypothesis

ate                             # estimated ATE
## [1] 6.5
sort(distout)                   # list the distribution of possible estimates under the sharp null of no effect
##  [1] -7.5 -7.5 -7.5 -4.0 -4.0 -4.0 -4.0 -4.0 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5
## [15]  3.0  3.0  6.5  6.5  6.5 10.0 10.0
sum(    distout  >=     ate )/nrow(as.matrix(distout))   # one-tailed comparison used to calculate p-value
## [1] 0.2380952
sum(abs(distout) >= abs(ate))/nrow(as.matrix(distout))   # two-tailed comparison used to calculate p-value
## [1] 0.3809524
dispdist(distout,ate)        # display p-values, 95% confidence interval, standard error under the null, and graph the sampling distribution under the null

## $two.tailed.p.value
## [1] 0.4761905
## 
## $two.tailed.p.value.abs
## [1] 0.3809524
## 
## $greater.p.value
## [1] 0.2380952
## 
## $lesser.p.value
## [1] 0.9047619
## 
## $quantile
##  2.5% 97.5% 
##  -7.5  10.0 
## 
## $sd
## [1] 5.322906
## 
## $exp.val
## [1] 0
# Compare reuslts to traditional t-test with unequal variance
t.test(Y~Z,
       alternative = "less",
       mu = 0, paired = FALSE, var.equal = FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  Y by Z
## t = -0.8409, df = 1.1272, p-value = 0.2708
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
##      -Inf 33.94232
## sample estimates:
## mean in group 0 mean in group 1 
##            16.0            22.5
t.test(Y~Z,
       alternative = "two.sided",
       mu = 0, paired = FALSE, var.equal = FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  Y by Z
## t = -0.8409, df = 1.1272, p-value = 0.5416
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -82.03199  69.03199
## sample estimates:
## mean in group 0 mean in group 1 
##            16.0            22.5

3. Randomization inference gives exact p-values when all possible random assignments can be simulated

When the reference distribution is known based on a complete census of possible random assignments, p-value calculations are exact – there are no theoretical approximations based on assumptions about the shape of the sampling distribution. Sometimes the set of possible random assignments is so large that a full census is infeasible. In that case, the reference distribution can be approximated to an arbitrary level of precision by randomly sampling from the set of possible random assignments a large number of times. Thousands or tens of thousands of simulated random assignments are recommended.

4. Randomization inference requires the analyst to specify a test statistic and some are more informative than others

In principle, any test statistic can be used as input for randomization inference, which in turn outputs a p-value. Some test statistics provide more informative results than others, however. For example, although the simple difference-in-means often performs well, there are good arguments for other test statistics, such as the t-ratio using a robust standard error.5 In this case, the researcher would calculate the test statistic for the actual experiment and compare it to a reference distribution of robust t-statistics under the sharp null hypothesis of no effect.

5. Randomization inference may give different p-values from conventional tests when the number of observations is small and when the distribution of outcomes is non-normal

Conventional p-values typically rely on approximations that presuppose either that the outcomes are normally distributed or that the subject pool is large enough that the test statistics follow a posited sampling distribution. When outcomes are highly skewed, as in the case of donations (a few people donate large sums, but the overwhelming majority donate nothing), conventional methods may produce inaccurate p-values. Gerber and Green (2012, p.65) give the following example in which randomization inference and conventional test statistics produce different results: