# Abstract

The “Multiple Comparisons Problem” is the problem that standard statistical procedures can be misleading when researchers conduct a large group of hypothesis. When a research has multiple “bites at the apple,” the chances are that some finding will appear “significant” even when there’s nothing going on.

Classical hypothesis tests assess statistical significance by calculating the probability under a null hypothesis of obtaining estimates as large or larger as the observed estimate. When multiple tests are conducted, however, classical p-values are incorrect — they no longer reflect the true probability under the null.

This guide 1 will help you guard against drawing false conclusions from your experiments. We focus on the big ideas and provide examples and tools that you can use in R.

# 1 Almost every social science experiment faces a multiple comparisons problem

Typically, researchers are not interested in just one treatment versus control comparison per experiment. There are three main ways that comparisons proliferate:

1. Multiple treatment arms. When an experiment has n treatment arms, there are n(n-1)/2 possible comparisons.
2. Heterogeneous treatment effects. Often, we are interested in whether the treatment has different impacts on different subgroups. For example, a treatment might be more effective for women than for men.
3. Multiple estimators. Often, experimenters will apply multiple estimators to the same dataset: for example, difference-in-means and covariate adjustment. There is of course nothing wrong with employing multiple treatment arms, exploring treatment effect heterogeneity, or using multiple estimators of the treatment effect. However, these design and analysis choices sometimes require that researchers correct their statistical tests to account for multiple comparisons.
4. Multiple outcomes. Researchers often assess the effects of an intervention on multiple distinct outcomes or multiple operationalizations of the outcome variable.

These concerns are especially problematic when making a “family claim,” that is, when you are summarizing a series of results. For example, a family claim might be that treatments A, B, C, and D had no effect, but treatment E did. Or, similarly, the treatment had no effect among group 1, group 2, or group 3, but had a strong effect among group 4.

The multiple comparisons problem is related to, but different from, the problem of “fishing.” Fishing occurs when an unscrupulous analyst conducts many tests but only reports the “interesting” ones. In essence, fishing withholds the necessary information we would need in order to correct for multiple comparisons.

# 2 Why multiple comparisons are a problem

In classical hypothesis testing, the “alpha level” describes how willing the researcher is to make a certain kind of mistake — a so-called Type I error. A Type I error occurs when a researcher falsely concludes that an observed difference is “real,” when in fact there is no difference. In many social science applications, the alpha level, or Type I error rate, is set to 0.05. This means that the researcher is willing to commit a Type I error 5% of the time.

But when we move to the world of multiple comparisons, this simple testing framework is no longer sufficient. We are trying to avoid the scientific error of falsely concluding that an effect is practically or substantively meaningful — but if we apply the classical hypothesis testing framework without addressing multiple comparisons, we are much more likely to commit such an error.

In this guide, we will describe three main approaches for addressing the multiple comparisons problem:

1. p-value adjustments. Statisticians have derived a number of corrections that can guard against multiple comparisons mistakes. As described in the next section, these corrections control either the Family-Wise Error Rate (FWER) or the False Discovery Rate (FDR). Most of these adjustments apply a simple formula to a series of “raw” p-values; we will also describe a simulation method that can take account of features of a specific research setting.
2. Pre-analysis plans. These plans are a powerful design-based tool for “calling your shot” so to speak.
3. Replication. If we are concerned that a finding is simply an artifact of sampling variability that we happened to discover because of a blind application of classical hypothesis testing, then the best way to resolve the question is to conduct the experiment again.

# 3 Don’t mix up the FWER and FDR!

In the world of multiple testing, the Type I error rate might refer to one of two things: the Family-Wise Error Rate (FWER) or the False Discovery Rate (FDR).

The FWER is the probability of incorrectly rejecting even one null hypothesis. Suppose we have three null hypotheses, all of which are true. When the null hypothesis is true, but we nevertheless reject it in favor of some alternative, we commit a Type I error. If we set alpha (the Type I error rate) to be 0.05, we have a [$$1−(1−0.05)^3=14.2%$$] chance of rejecting at least one of them. In order to control the FWER (i.e., reduce it from 14.2% back down to 5%), we need to employ a correction. We’ll explore three ways to control the FWER (Bonferroni, Holm, and simulation) in the sections below.

The FDR is subtly different. It is the expected proportion of false discoveries among all discoveries. In the case where no discoveries are found then the “share” of false discoveries is taken to be zero. There are some connections to the FWER. For example, if in fact all null hypotheses are true then the FWER and FDR are the same. To see this note that in this case if no significant rejections of the null are found, then the share that are false discoveries is zero. But if some are found (no matter how many), then the share that are false is 100% (since all null hypotheses are true). So the false discovery rate in this case is just the probability that some true null hypothesis is falsely rejected—the FWER. Beyond this case however the FDR is less stringent than the FWER.

# 4 The Bonferroni correction is too extreme

The Bonferroni correction is a commonly-used approach for addressing the multiple comparisons problem, partly because of its simplicity. If you conduct k tests, the target significance level should be alpha/k, or, equivalently, you multiply your p-values by k, and apply the standard alpha level. (The trouble with multiplying the p-values is sometimes you end up with values over one, rendering the interpretation of the p-values incoherent.)

For example, suppose you conduct an experiment that has 3 dependent variables. You conduct three difference-in-means tests that yield the following classical p-values: 0.004, 0.020, and 0.122. If your alpha level is the standard 0.05 threshold, then you would usually declare the first two tests statistically significant and the last test insignificant. The Bonferroni correction, however, adjusts the target p-value to 0.05/3 = 0.016. We then declare only the first test to be statistically significant.

The Bonferroni correction works under the most extreme circumstances, that is, when all k tests are independent from one another. To see how this works, imagine we are testing three true null hypotheses using a classical alpha level of 0.05. Each test, therefore has a 5% chance of yielding the wrong answer that the null hypothesis is false.

But our chances of making at least one mistake are much greater than 5% because we have three chances to get it wrong. As above, this probability is in fact [$$1 - (1 - 0.05)^3 = 14.2%$$] . If we use the Bonferroni correction, however, our chances of getting it wrong fall back to our target alpha value: [$$1 - (1 - 0.05/3)^3 \approx 0.05$$] .

This correction works in the worst-case scenario that all tests are independent. But in most cases, tests are not independent. That is, if your treatment moves outcome A, it probably moves outcome B too, at least a little. So what tends to happen is, researchers report that their results “withstand” Bonferroni when they are extremely strong, but decry Bonferroni as too extreme when the results are weaker.

Instead of using the Bonferroni correction, you can use the Holm correction. It is strictly more powerful than Bonferroni, and is valid under the same assumptions. It also controls the FWER. Suppose you have $$m$$ p-values. Order them from smallest to largest. Find the smallest p-value that satisfies this condition: $$p_{k}>\frac{α}{m+1−k}$$, where $$k$$ is the p-value’s index. This and all larger p-values are insignificant; all smaller p-values are significant.

Taking our three p-values from above: 0.004, 0.020, and 0.122: $0.004<\frac{0.05}{3+1−1}=0.017$ $0.020<\frac{0.05}{3+1−2}=0.025$ $0.122>\frac{0.05}{3+1−3}=0.050$

Under the Holm correction, the first two tests are significant, but the last test is not.

# 5 Control the FDR with Benjamini-Hochberg

The Benjamini–Hochberg procedure controls the FDR. Like the Holm correction, you also begin by ordering $$m$$ p-values. Then you find the largest p-value that satisfies: $$p_{k}≤\frac{k}{m}α$$. This test, and all tests with smaller p-values are declared significant.

$0.004<\frac{1}{3}0.05=0.017$ $0.020<\frac{2}{3}0.05=0.033$ $0.122>\frac{3}{3}0.05=0.050$

Using the Benjamini–Hochberg procedure, the first two tests are significant, but the third is not.

# 6 It’s easy to implement these procedures in R

In R, the p.adjust() function contains many of the corrections devised by statisticians to address the multiple comparisons problem. The p.adjust() function is in base R, so no additional packages are required.

# Set seed for reproducibility
set.seed(343)

# Generate 50 test statistics
# Half are drawn from a normal with mean 0
# The other half are drawn from a normal with mean 3
x <- rnorm(50, mean = c(rep(0, 25), rep(3, 25)))

# Obtain 50 p-values
p <- round(2*pnorm(sort(-abs(x))), 3)

# Choose alpha level
alpha <- 0.05

# Without any corrections
sig <- p < alpha

# Conduct three corrections
# and compare to target alpha
bonferroni_sig <- p.adjust(p, "bonferroni") < alpha
holm_sig <- p.adjust(p, "holm") < alpha
BH_sig <- p.adjust(p, "BH") <alpha

The results of this simulation are presented in the table and figure below.

Correction Type No Correction Benjamini-Hochberg Holm Bonferroni
Statistically Significant 25 22 11 8
Not Statistically Significant 25 28 39 42

Of the 25 null hypotheses that would be rejected if no correction were made, the Bonferroni correction only rejects 8, the Holm procedure rejects 11, and the Benjamini–Hochberg procedure rejects 22. Of these three corrections, Bonferroni is the most stringent while Benjamini–Hochberg is the most lenient.

# 7 A better way to control the FWER is simulation

The trouble with the corrections above is that they struggle to address the extent to which the multiple comparisons are correlated with one another. A straightforward method of addressing this problem is simulation under the sharp null hypothesis of no effect for any unit on any dependent variable. Note that this is a family-wise sharp null.

If the treatment has no effect at all on anything, then we observe all potential outcomes for all subjects. We can re-randomize the experiment 1000 or more times and conduct all k hypothesis tests each time. We know for sure that all k null hypotheses are true, because the treatment has no effect by construction.

The next step is picking the right threshold value below which results are deemed statistically significant. If alpha is 0.05, we need to find the target p-value that, across all simulations under the sharp null, yields 5% significant hypothesis tests.

Once we have the right threshold value, it’s as easy as comparing the uncorrected p-values to the threshold value — those below the threshold are deemed significant.

# Control the FWER through simulation
rm(list=ls())
library(mvtnorm)
library(randomizr)

# Helper functions
do_t_test <- function(Y, Z){
t.test(Y[Z==1], Y[Z==0])\$p.value
}

permute_treat <- function(){
treatment_sim <- complete_ra(n, m=n/2)
ps_sim <- apply(outcomes, 2, do_t_test, Z = treatment_sim)
return(ps_sim)
}

threshold_finder<- function(threshold){
mean(apply(many_ps, 2, x <- function(x) sum(x <= threshold) > 0 ))
}

# Set a seed
set.seed(343)

# Generate correlated outcomes
# Outcomes are unrelated to treatment
# All null hypotheses are true

n <- 1000
k <- 100; r <- .7; s <- 1
sigma <- matrix(s*r, k,k)
diag(sigma) <- s
outcomes <- rmvnorm(n=n, mean=rep(0, k), sigma=sigma)

# Complete Random Assignment
treatment <- complete_ra(n, m=n/2)

# Conduct k hypothesis tests
p_obs <- apply(outcomes, 2, do_t_test, Z = treatment)

# Simulate under the sharp null
many_ps <- replicate(1000, permute_treat(), simplify = TRUE)

# Obtain the Type I error rate for a series of thresholds
thresholds <- seq(0, 0.05, length.out = 1000)
type_I_rate <- sapply(thresholds, threshold_finder)

# Find the largest threshold that yields an alpha type I error rate
target_p_value <- thresholds[max(which(type_I_rate <=0.05))]

# Apply target p_value to observed p_values
sig_simulated <- p_obs <= target_p_value

# Compare to raw p-values
sig <- p_obs <= 0.05

The target p-value obtained by the simulation is 0.002 — hypothesis tests with raw p-values below 0.002 are deemed significant. Compare this with the Bonferroni method, which would require a p-value below 0.05/100 = 0.0005, an order of magnitude smaller. The closer the correlation of the tests (the parameter “r” in the code above) is to zero, the closer the two methods will be.

The flexibility of the simulation method is both an advantage and a disadvantage. The advantage is that it can accommodate any set of estimation procedures, returning a study-specific correction that will generally be more powerful than other methods to control the FWER. The disadvantage is that it requires the researcher to code up a simulation — there are no prewritten functions that will apply across research contexts.

Here are some guidelines and tips for writing your own simulation.

1. Follow the original random assignment procedure as exactly as possible. For example, if you block-randomized your experiment, make sure your simulations permute the treatment assignment according to the same blocks.
2. Each simulation should return a set of p-values. (this was accomplished in the permute_treat() function above.)
3. Be sure to count up the number of simulations in which at least one test was deemed significant, not the average number of tests across all simulations deemed significant.

This calculator works best in Firefox. To use full-screen, go here.

# 9 Creating an index is a way to get a single comparison out of many

Suppose2 a researcher measures $$k>1$$ dependent variables. Indexing allows the researcher to reduce these $$k$$ outcomes into a single measure (or several thematically-grouped measures). These indexing methods effectively condense the number of dependent variables that investigators test in order to address the multiple comparisons problem. There are a number of ancillary benefits of these indexing methods:

• Unlike the other methods of addressing the multiple comparisons problem, the indexing approach may reward researchers for increasing the number of dependent variables. Imagine that a researcher collects $$k=50$$ outcome variables and that the treatment does not cause significant differences for any of them, but all the point estimates are in the same direction. If we were to apply a multiple comparisons correction to our 50 tests, our results would get even murkier. However, if we combine all 50 dependent variables into a single index, the resulting dependent variable may in fact exhibit significant differences.
• In the presence of limited amounts of attrition across outcomes, these methods may provide some leverage for dealing with missingness on some dependent variables (but not for units for which outcome variables are entirely unobserved).

There are two principal indexing methods in the literature:

### Mean Effects Index

Kling, Liebman, and Katz (2004) employ a mean effects index, constructed as follows:

1. If necessary, reorient some outcomes so that beneficial effects are consistently scored higher across all outcomes.
2. Calculate a $$z$$-score, $$\tilde{y}_{ik}$$ by subtracting off the control group mean and dividing by the control group standard deviation, as follows, where $$i$$ indexes individuals and $$k$$ indexes outcomes:

$\tilde{y}_{ik}= \frac{y_{ik}- \bar{y}_k^{Z=0}}{\sigma_{k}^{y,Z=0}}$

1. Sum the $$z$$-scores, $$\sum_{i=1}^K \tilde{y}_{ik}$$ (optionally) divide by $$K$$ to generate the index.

2. Optional: It may be desirable to normalize the final index by the control group mean and standard deviation.

In the presence of missing outcomes, one of two approaches could be employed:

1. Imputation: Kling, Liebman, and Katz advocate a imputation approach for missing values on individual outcomes. Specifically, prior to constructing the index, compute the mean of each outcome variable for each experimental group, $$\bar{y}_{ik}^{Z=1}$$ and $$\bar{y}_{ik}^{Z=0}$$ using the above notation. Then, impute the mean corresponding to a unit’s assignment status (treatment or control) prior to constructing the index.
2. “Greedy” Indexing: Instead of imputing values of missing outcome variables ex-ante as in method 1, calculate the $$z$$-scores as above. Where there are missing values for the “raw” outcome variables, there will be missing $$z$$-scores. For each unit, sum the non-missing $$z$$-scores and then divide by the number of non-missing outcomes. Hence, instead of dividing $$\sum_{i=1}^K \tilde{y}_{ik}$$ by $$K$$ as above, we calculate $$K_{i}$$, the number of non-missing outcomes, for each unit.

### Inverse Covariance Weighted Index

Anderson (2008) provides a similar approach that constructs an index that employs inverse covariance weighting. This weighting scheme improves efficiency relative to the mean effects index above by affording less weight to highly correlated outcomes. The Anderson index can be constructed through the following procedure:

1. If necessary, reorient some outcomes so that beneficial effects are consistently scored higher across all outcomes.
2. Calculate a $$z$$-score, $$\tilde{y}_{ik}$$ by subtracting off the control group mean and dividing by the control group standard deviation, as follows, where $$i$$ indexes individuals and $$k$$ indexes outcomes:

$\tilde{y}_{ik}= \frac{y_{ik}- \bar{y}_k^{Z=0}}{\sigma_{k}^{y,Z=0}}$

1. Construct and invert the (variance)-covariance matrix of the resultant matrix of $$z$$-scores calculated in step 2. Call this $$k \times k$$ inverted (variance)-covariance matrix $$\hat{\boldsymbol{\Sigma}}^{-1}$$.

2. The weighted indexed outcome, $$\bar{s}_i$$ can be estimated via the following procedure, where $$\textbf{1}$$ is a $$k \times 1$$ vector of ones and $$\textbf{y}_{ik}$$ is the $$n \times k$$ matrix of $$z$$-scores calculated in step 2.

$\bar{s}_i = (\textbf{1}^T \hat{\boldsymbol{\Sigma}}^{-1} \textbf{1})^{-1}(\textbf{1}^T \hat{\boldsymbol{\Sigma}}^{-1} \textbf{y}_{ik})$

1. Optional: As above, it may be desirable to normalize the final index by the control group mean and standard deviation.

As with the mean effects index, this varible $$\bar{s}_i$$ the serves as the dependent variable in your analysis. One potential drawback to the inverse covariance weighting index is that there is no guarantee that elements in the inverted covariance matrix ($$\boldsymbol{\Sigma}^{-1}$$) are positive. As such, it is possible to generate negative weights using this indexing method. Given that outcomes are oriented in the same direction, a negative weight effectively reverses the direction of the effect on negatively-weighted outcomes in the construction of the index.

The following functions implement both the mean effects and inverse covariance weighted index methods and evaluate both functions on a DGP with 50 outcome measures:

stopifnot(require(mvtnorm))
stopifnot(require(dplyr))
stopifnot(require(randomizr))
stopifnot(require(ggplot2))

set.seed(1234)

calculate_mean_effects_index <- function(Z, outcome_mat, to_reorient, reorient = FALSE, greedy = TRUE,
impute = FALSE){
if(length(Z) != nrow(outcome_mat)) stop("Error: Treatment assignment, outcome matrix require same n!")
if(impute == TRUE){
R <- 1 * is.na(outcome_mat)
means_for_imputation <- rbind(apply(outcome_mat[Z==0,], MAR = 2, FUN = mean, na.rm = T),
apply(outcome_mat[Z==1,], MAR = 2, FUN = mean, na.rm = T))
to_impute <- R * means_for_imputation[Z+1,]
outcome_mat[is.na(outcome_mat)] <- 0
outcome_mat <- outcome_mat + to_impute
}
c_mean <- apply(X = outcome_mat[Z==0,], MARGIN = 2, FUN = mean, na.rm = T)
c_sd <- apply(X = outcome_mat[Z==0,], MARGIN = 2, FUN = sd, na.rm = T)
z_score <- t(t(sweep(outcome_mat, 2, c_mean))/ c_sd)
index_numerator <- rowSums(z_score)
if(greedy == TRUE){
n_outcomes <- rowSums(!is.na(z_score))
}
else if(greedy == FALSE){
n_outcomes <- ncol(outcome_mat)
}
index <- index_numerator/n_outcomes
index <-  (index - mean(index[Z==0], na.rm =T))/sd(index[Z==0], na.rm =T)
return(index)
}

calculate_inverse_covariance_weighted_index <- function(Z, outcome_mat, to_reorient, reorient = FALSE){
if(length(Z) != nrow(outcome_mat)) stop("Error: Treatment assignment, outcome matrix require same n!")
if(reorient == TRUE){
outcome_mat[, c(to_reorient)] <- -outcome_mat[, c(to_reorient)]
}
c_mean <- apply(X = outcome_mat[Z==0,], MARGIN = 2, FUN = mean, na.rm = T)
c_sd <- apply(X = outcome_mat[Z==0,], MARGIN = 2, FUN = sd, na.rm = T)
z_score <- t(t(sweep(outcome_mat, 2, c_mean))/ c_sd)
Sigma_hat <- solve(cov(z_score, y = z_score, use = "complete.obs"))
one_vec <- as.vector(rep(1, ncol(outcome_mat)))
if(sum(is.na(outcome_mat))>0){
z_score[is.na(z_score)] <- 0
}
w_ij <- t(solve(t(one_vec) %*% Sigma_hat %*% one_vec) %*% (t(one_vec) %*% Sigma_hat))
if(sum(w_ij < 0) > 0){warning('Warning, at least one weight is negative!')}
s_ij <- t(solve(t(one_vec) %*% Sigma_hat %*% one_vec) %*% (t(one_vec) %*% Sigma_hat %*% t(z_score)))
index <- (s_ij - mean(s_ij[Z==0], na.rm = T))/sd(s_ij[Z==0], na.rm = T)
return(s_ij)
}

We can see how these indices perform in a setting with $$k = 5$$ outcome variables.

[Click to show code]
# A DGP with K outcome variables
# Untreated potential outcomes drawn from multivariate normal distribution
K <- 5
r <- runif(n = K, min = -.9, max = .9)
sigma <- outer(r, r, FUN = "*")
diag(sigma) <- 1
mat <- rmvnorm(n = 200, mean = rep(0, K), sigma = sigma)

# Treatment assignment
Z <- complete_ra(200)

# Created observed potential outcomes
# Assume that ATEs are all oriented in the same direction for the time being
ATEs <- rnorm(K, mean = .25, sd = 1)

for(i in 1:K){
mat[,i] <- mat[,i] + rnorm(n = 200, mean = Z * ATEs[i], sd = 1)
}

mean_effects_index <- calculate_mean_effects_index(Z = Z, outcome_mat = mat, reorient = F)
inv_cov_weighted_index <- calculate_inverse_covariance_weighted_index(Z = Z, outcome_mat = mat,reorient = F)

First, we can examine the properties of the indices alongside our five outcome variables by looking at the covariance matrix.

[Click to show code]
knitr::kable(cov(data.frame(mat, mean_effects_index, inv_cov_weighted_index)), digits = 3)
X1 X2 X3 X4 X5 mean_effects_index inv_cov_weighted_index
X1 1.480 0.147 -0.080 -0.285 -0.244 0.410 0.276
X2 0.147 3.047 0.446 -0.090 0.117 1.295 0.292
X3 -0.080 0.446 2.123 -0.260 0.211 0.806 0.317
X4 -0.285 -0.090 -0.260 2.275 0.360 0.656 0.309
X5 -0.244 0.117 0.211 0.360 1.995 0.781 0.324
mean_effects_index 0.410 1.295 0.806 0.656 0.781 1.352 0.520
inv_cov_weighted_index 0.276 0.292 0.317 0.309 0.324 0.520 0.230

We can also plot the two indices to show their similarities (or differences). Note that with the final normalization included in the functions above, both indices are on the same scale.

[Click to show code]
data.frame(Z, mean_effects_index, inv_cov_weighted_index) %>%
mutate(Group = ifelse(Z==1, "Treatment", "Control")) %>%
ggplot(aes(x = mean_effects_index, y = inv_cov_weighted_index, colour = Group)) +
geom_point() + theme_bw() + xlab("Mean Effects Index") + ylab("Inverse Covariance Weighted Index")