1 What clustering is

Cluster randomized1 experiments allocate treatments to groups, but measure outcomes at the level of the individuals that compose the groups. It is this divergence between the level at which the intervention is assigned and the level at which outcomes are defined that classifies an experiment as cluster randomized.

Consider a study that randomly assigns villages to receive different development programs, where the well-being of households in the village is the outcome of interest. This is a clustered design because, while the treatment is assigned to the village as a whole, we are interested in outcomes defined at the household level. Or consider a study that randomly assigns certain households to receive different get-out-the-vote messages, where we are interested in the voting behavior of individuals. Because the unit of assignment for the message is the household, but the outcome is defined as individual behavior, this study is cluster randomized.

Now consider a study in which villages are selected at random, and 10 people from each village are assigned to treatment or control, and we measure the well-being of those individuals. In this case, the study is not cluster randomized, because the level at which treatment is assigned and the level at which outcomes are defined is the same. Suppose that a study randomly assigned villages to different development programs and then measured social cohesion in the village. Though it contains the same randomization procedure as our first example, it is not clustered because we are interested here in village-level outcomes: the level of treatment assignment and of outcome measurement is the same.

Clustering matters for two main reasons. On the one hand, clustering reduces the amount of information in an experiment because it restricts the number of ways that the treatment and control groups can be composed, relative to randomization at the individual level. If this fact is not taken into account, we often underestimate the variance in our estimator, leading to over-confidence in our the results of the study. On the other hand, clustering raises the question of how to combine information from different parts of the same experiment into one quantity. Especially when clusters are not of equal sizes and the potential outcomes of units within them are very different, conventional estimators will systematically produce the wrong answer due to bias. However, several approaches at the design and analysis phases can overcome the challenges posed by cluster randomized designs.

2 Why clustering can matter I: information reduction

We commonly think of the information contained in studies in terms of the sample size and the characteristics of the units within the sample. Yet two studies with exactly the same sample size and the same participants could in theory contain very different amounts of information depending on whether units are clustered. This will greatly affect the precision of the inferences we make based on the studies.

Imagine an impact evaluation with 10 people, where 5 are assigned to the treatment group and 5 to control. In one version of the experiment, the treatment is assigned to individuals - it is not cluster randomized. In another version of the experiment, the 5 individuals with black hair and the 5 individuals with some other color of hair are assigned to treatment as a group. This design has two clusters: ‘black hair’ and ‘other color’.

A simple application of the n choose k rule shows why this matters. In the first version, the randomization procedure allows for
252 different combinations of people as the treatment and control groups. However, in the second version, the randomization procedure assigns all the black-haired subjects either to treatment or to control, and thus only ever produces two ways of combining people to estimate an effect.

Throughout this guide, we will illustrate points using examples written in R code. You can copy and paste this into your own R program to see how it works.

[Click to show code]
# Define the sample average treatment effect (SATE)
treatment_effect     <- 1
# Define the individual ids (i)
person               <- 1:10
# Define the cluster indicator (j)
hair_color           <- c(rep("black",5),rep("brown",5))
# Define the control outcome (Y0)
outcome_if_untreated <- rnorm(n = 10)
# Define the treatment outcome (Y1)
outcome_if_treated   <- outcome_if_untreated + treatment_effect

# Version 1 - Not cluster randomized
# Generate all possible non-clustered assignments of treatment (Z)
non_clustered_assignments <- combn(x = unique(person),m = 5)
# Estimate the treatment effect
treatment_effects_V1 <-
          X = non_clustered_assignments,
          MARGIN = 2,
          FUN = function(assignment) {
               treated_outcomes   <- outcome_if_treated[person %in% assignment]
               untreated_outcomes <- outcome_if_untreated[!person %in% assignment]
               mean(treated_outcomes) - mean(untreated_outcomes)
# Estimate the true standard error
standard_error_V1 <- sd(treatment_effects_V1)
# Plot the histogram of all possible estimates of the treatment effect
hist(treatment_effects_V1,xlim = c(-1,2.5),breaks = 20)

[Click to show code]
# Version 2 - Cluster randomized
# Generate all possible assignments of treatment when clustering by hair color (Z)
clustered_assignments     <- combn(x = unique(hair_color),m = 1)
# Estimate the treatment effect
treatment_effects_V2 <-
          X = clustered_assignments,
          FUN = function(assignment) {
               treated_outcomes   <- outcome_if_treated[person %in% person[hair_color==assignment]]
               untreated_outcomes <- outcome_if_untreated[person %in% person[!hair_color==assignment]]
               mean(treated_outcomes) - mean(untreated_outcomes)
# Estimate the true standard error
standard_error_V2 <- sd(treatment_effects_V2)
# Plot the histogram of all possible estimates of the treatment effect
hist(treatment_effects_V2,xlim = c(-1,2.5),breaks = 20)

As the histograms show, clustering provides a much ‘coarser’ view of what the treatment effect might be. Irrespective of the number of times one randomizes the treatment and of the number of subjects one has, in a clustered randomization procedure the number of possible estimates of the treatment effect will be strictly determined by the number of clusters assigned to the different treatment conditions. This has important implications for the standard error of the estimator.

[Click to show code]
# Compare the standard errors
standard_error_V1 standard_error_V2
0.52 1.13

While the sampling distribution for the non-clustered estimate of the treatment effect has a standard error of about .52, that of the clustered estimate is more than twice as large, at 1.13. Recall that the data going into both studies is identical, the only difference between the studies resides in the way that the treatment assignment mechanism reveals the information.

Related to this issue of information is the question of how much units in our study vary within and between clusters. Two cluster randomized studies with \(J=10\) villages and \(n_j=100\) people per village may have different information about the treatment effect on individuals if, in one study, differences between villages are much greater than the differences in outcomes within them. If, say, all of the individuals in any village acted exactly the same but different villages showed different outcomes, then we would have on the order of 10 pieces of information: all of the information about causal effects in that study would be at the village level. Alternatively, if the individuals within a village acted more or less independently of each other, then we would have on the order of 10 \(\times\) 100=1000 pieces of information.

We can formalize the idea that the highly dependent clusters provide less information than the highly independent clusters with the intracluster correlation coefficient. For a given variable, \(y\), units \(i\) and clusters \(j\), we can write the intracluster correlation coefficient as follows:

\[ \text{ICC} \equiv \frac{\text{variance between clusters in } y}{\text{total variance in } y} \equiv \frac{\sigma_j^2}{\sigma_j^2 + \sigma_i^2} \]

where \(\sigma_i^2\) is the variance between units in the population and \(\sigma_j^2\) is the variance between outcomes defined at the cluster level. Kish (1965) uses this description of dependence to define his idea of the “effective N” of a study (in the sample survey context, where samples may be clustered):

\[\text{effective N}=\frac{N}{1+(n_j -1)\text{ICC}}=\frac{Jn}{1+(n-1)\text{ICC}},\]

where the second term follows if all of the clusters are the same size (\(n_1 \ldots n_J \equiv n\)).

If 200 observations arose from 10 clusters with 20 individuals within each cluster, where ICC = .5, such that 50% of the variation could be attributed to cluster-to-cluster differences (and not to differences within a cluster), Kish’s formula would suggest that we have an effective sample size of about 19 observations, instead of 200.

As the foregoing discussion suggests, clustering reduces information most when it a) greatly restricts the number of ways an effect can be estimated, and b) produces units whose outcomes are strongly related to the cluster they are a member of (i.e. when it increases the ICC).

3 What to do about information reduction

In order to characterize our uncertainty about treatment effects, we typically want to calculate a standard error: an estimate of how much the treatment effect would have varied, were we able to repeat the experiment a very large number of times and observe units alternatively in their treated and untreated states.

However, we are never able to observe the true standard error of an estimator, and therefore must use statistical procedures to infer this unknown quantity. Conventional methods for calculating standard errors do not take into account clustering, which, as we noted above, can strongly increase the amount estimates vary from one repetition of the experiment to another. Thus, in order to avoid over-confidence in experimental findings, it is important to take clustering into account.

In this section we limit our attention to so-called ‘design-based’ approaches to calculating the standard. In the design-based approach we simulate repetitions of the experiment to derive and check ways of characterizing the variance of the estimate of the treatment effect, accounting for clustered randomization. We contrast these with ‘model-based’ approaches further on in the guide. In the model-based approach we state that the outcomes were generated according to a probability model and that the cluster-level relationships also follow a probability model.

To begin, we will create a function which simulates a cluster randomized experiment with fixed intracluster correlation, and use it to simulate some data from a simple cluster-randomized design.

[Click to show code]
make_clustered_data <- function(J = 10, n = 100, treatment_effect = .25, ICC = .1){
     ## Inspired by Mathieu et al, 2012, Journal of Applied Psychology
     if (J %% 2 != 0 | n %% 2 !=0) {
          stop(paste("Number of clusters (J) and size of clusters (n) must be even."))
     Y0_j         <- rnorm(J,0,sd = (1 + treatment_effect) ^ 2 * sqrt(ICC))
     fake_data    <- expand.grid(i = 1:n,j = 1:J)
     fake_data$Y0 <- rnorm(n * J,0,sd = (1 + treatment_effect) ^ 2 * sqrt(1 - ICC)) + Y0_j[fake_data$j]
     fake_data$Y1 <- with(fake_data,mean(Y0) + treatment_effect + (Y0 - mean(Y0)) * (2 / 3))
     fake_data$Z  <- ifelse(fake_data$j %in% sample(1:J,J / 2) == TRUE, 1, 0)
     fake_data$Y  <- with(fake_data, Z * Y1 + (1 - Z) * Y0)

pretend_data <- make_clustered_data(J = 10,n = 100,treatment_effect = .25,ICC = .1)

Because we have created the data ourselves, we can calculate the true standard error of our estimator. We firstly generate the true sampling distribution by simulating every possible permutation of the treatment and calculating the estimate each time. The standard deviation of this distribution is the standard error of the estimator.

[Click to show code]
# Define the number of clusters
J <- length(unique(pretend_data$j))
# Generate all possible ways of combining clusters into a treatment group
all_treatment_groups <- with(pretend_data,combn(x = 1:J,m = J/2))
# Create a function for estimating the effect
clustered_ATE <- function(j,Y1,Y0,treated_clusters) {
     Z_sim    <- (j %in% treated_clusters)*1
     Y        <- Z_sim * Y1 + (1 - Z_sim) * Y0
     estimate <- mean(Y[Z_sim == 1]) - mean(Y[Z_sim == 0])

# Apply the function through all possible treatment assignments
cluster_results <- apply(X = all_treatment_groups,MARGIN = 2,
                         FUN = clustered_ATE,
                         j  = pretend_data$j,Y1 = pretend_data$Y1,
                         Y0 = pretend_data$Y0)

true_SE <- sd(cluster_results)

## [1] 0.2567029

This gives a standard error of 0.26. We can compare the true standard error to two other kinds of standard error commonly employed. The first ignores clustering and assumes that the sampling distribution is identically and independently distributed according to a normal distribution. We will refer to this as the I.I.D. standard error. To take clustering into account, we can use the following formula for the standard error:

\[\text{Var}_\text{clustered}(\hat{\tau})=\frac{\sigma^2}{\sum_{j=1}^J \sum_{i=1}^n_j (Z_{ij}-\bar{Z})^2} (1-(n-1)\rho) \]

where \(\sigma^2=\sum_{j=1}^J \sum_{i=1}^n_j (Y_{ij} - \bar{Y}_{ij})^2\) (following Arceneaux and Nickerson (2009) ). This adjustment to the IID standard error is commonly known as the “Robust Clustered Standard Error” or RCSE.

[Click to show code]
ATE_estimate <- lm(Y ~ Z,data = pretend_data)

IID_SE <- function(model) {

RCSE <- function(model, cluster,return_cov = FALSE){
  M <- length(unique(cluster))
  N <- length(cluster)
  K <- model$rank
  dfc <- (M/(M - 1)) * ((N - 1)/(N - K))
  uj <- apply(estfun(model), 2, function(x) tapply(x, cluster, sum));
  rcse.cov <- dfc * sandwich(model, meat = crossprod(uj)/N)
  rcse.se <- as.matrix(coeftest(model, rcse.cov))

IID_SE_estimate <- IID_SE(model = ATE_estimate)

RCSE_estimate   <- RCSE(model = ATE_estimate,cluster = pretend_data$j)

     true_SE         = true_SE,
     IID_SE_estimate = IID_SE_estimate,
     RCSE_estimate   = RCSE_estimate["Z", "Std. Error"]),
true_SE IID_SE_estimate RCSE_estimate
0.26 0.08 0.27

When we ignore the clustered-assignment, the standard error is too small: we are over-confident about the amount of information provided to us by the experiment. The RCSE is slightly more conservative than the true standard error in this instance, but is very close. The discrepancy is likely because the RCSE is not a good approximation of the true standard error when the number of clusters is as small as it is here. To illustrate the point further, we can compare a simulation of the true standard error generated through random permutations of the treatment to the IID and RC standard errors.

[Click to show code]
compare_SEs <- function(data) {
     simulated_SE <- sd(replicate(
               j = data$j,
               Y1 = data$Y1,
               Y0 = data$Y0,
               treated_clusters = sample(unique(data$j),length(unique(data$j))/2)
     ATE_estimate <- lm(Y ~ Z,data)
     IID_SE_estimate <- IID_SE(model = ATE_estimate)
     RCSE_estimate <- RCSE(model = ATE_estimate,cluster = data$j)["Z", "Std. Error"]
          simulated_SE = simulated_SE,
          IID_SE = IID_SE_estimate,
          RCSE = RCSE_estimate

J_4_clusters    <- make_clustered_data(J = 4)
J_10_clusters   <- make_clustered_data(J = 10)
J_30_clusters   <- make_clustered_data(J = 30)
J_100_clusters  <- make_clustered_data(J = 100)
J_1000_clusters <- make_clustered_data(J = 1000)


  c(J = 4,compare_SEs(J_4_clusters)),
  c(J = 30,compare_SEs(J_30_clusters)),
  c(J = 100,compare_SEs(J_100_clusters)),
  c(J = 1000,compare_SEs(J_1000_clusters))
J simulated_SE IID_SE RCSE
4 0.267 0.124 0.084
30 0.168 0.049 0.177
100 0.095 0.027 0.098
1000 0.027 0.008 0.028

As these simple examples illustrate, the clustered estimate of the standard error (RCSE) gets closer to the truth (the simulated standard error) as the number of clusters increases. Meanwhile, the standard error ignoring clustering (assuming IID) tends to be smaller than either of the other standard errors. The smaller the estimate of the standard error is, the more precise the estimates seem to us, and the more likely we will find results that appear ‘statistically significant’. This is problematic: in this case, the IID standard error is leads us to be over confident in our results because it ignores intra-cluster correlation, the extent to which differences between units can be attributed to the cluster they are a member of. If we estimate standard errors using techniques that understate our uncertainty, we are more likely to falsely reject null hypotheses when we should not.

Another way to approach the problems that clustering introduces into the calculation of standard errors is to analyze the data at the level of the cluster. In this approach, we take averages or sums of the outcomes within the clusters, and then treat the study as though it only took place at the level of the cluster. Hansen and Bowers (2008) show that we can characterize the distribution of the difference of means using what we know about the distribution of the sum of the outcome in the treatment group, which varies from one assignment of treatment to another.

[Click to show code]
# Aggregate the unit-level data to the cluster level
# Sum outcome to cluster level
Yj <- tapply(pretend_data$Y,pretend_data$j,sum)
# Aggregate assignment indicator to cluster level 
Zj <- tapply(pretend_data$Z,pretend_data$j,unique)
# Calculate unique cluster size 
n_j <- unique(as.vector(table(pretend_data$j)))
# Calculate total sample size (our units are now clusters)
N <- length(Zj)
# Generate cluster id
j <- 1:N
# Calculate number of clusters treated
J_treated <- sum(Zj) 

# Make a function for the cluster-level difference in means estimator (See Hansen & Bowers 2008)
cluster_difference <- function(Yj,Zj,n_j,J_treated,N){
     ones <- rep(1, length(Zj))
     ATE_estimate <- crossprod(Zj,Yj)*(N/(n_j*J_treated*(N-J_treated))) - 

# Given equal sized clusters and no blocking, this is identical to the
# unit-level difference in means

ATEs <- colMeans(data.frame(cluster_level_ATE = 
          unit_level_ATE = 

knitr::kable(data.frame(ATEs),align = "c")
cluster_level_ATE 0.5624785
unit_level_ATE 0.5624785

In order to characterize uncertainty about the cluster-level ATE, we can exploit the fact that the only random element of the estimator is now the cross-product between the cluster-level assignment vector and the cluster-level outcome, \(\mathbf{Z}^\top\mathbf{Y}\), scaled by some constant. We can estimate the variance of this random component through permutation of the assignment vector or through an approximation of the variance, assuming that the sampling distribution follows a normal distribution.

[Click to show code]
# Approximating variance using normality assumptions
normal_sampling_variance <-
# Approximating variance using permutations
sampling_distribution <- replicate(10000,cluster_difference(Yj,sample(Zj),n_j,J_treated,N))

ses <- data.frame(sampling_variance = c(sqrt(normal_sampling_variance),sd(sampling_distribution)),
                  p_values = c(2*(1-pnorm(abs(ATEs[1])/sqrt(normal_sampling_variance),mean=0)),

rownames(ses) <- c("Assuming Normality","Permutations")

sampling_variance p_values
Assuming Normality 0.3303386 0.088618
Permutations 0.3338726 0.137000

This cluster-level approach has the advantage of correctly characterizing uncertainty about effects when randomization is clustered, without having to use the RCSE standard errors for the unit-level estimates, which are overly permissive for small N. Indeed, the false positive rate of tests based on RCSE standard errors tend to be incorrect when the number of clusters is small, leading to over-confidence. As we shall see below, however, when the number of clusters is very small (\(J=4\)) the cluster-level approach is overly conservative, rejecting the null with a probability of 1. Another drawback of the cluster-level approach is that it does not allow for the estimation of unit-level quantities of interest, such as heterogeneous treatment effects.

4 Why clustering can matter II: different cluster sizes

When clusters are of different sizes, this can pose a unique class of problems related to the estimation of the treatment effect. Especially when the size of the cluster is in some way related to the potential outcomes of the units within it, many conventional estimators of the sample average treatment effect (SATE) can be biased.

To fix ideas, imagine an intervention targeted at firms of different sizes, which seeks to increase worker productivity. Due to economies of scale, the productivity of employees in big firms is increased much more proportional to that of employees in smaller firms. Imagine that the experiment includes 20 firms ranging in size from one-person entrepreneurs to large outfits with over 500 employees. Half of the firms are assigned to the treatment, and the other half are assigned to control. Outcomes are defined at the employee level.

[Click to show code]
# Number of firms
J <- 20

# Employees per firm
n_j <- rep(2^(0:(J/2-1)),rep(2,J/2))

# Total number of employees
N <- sum(n_j)
# 2046 

# Unique employee (unit) ID
i <- 1:N

# Unique firm (cluster) ID
j <- rep(1:length(n_j),n_j)

# Firm specific treatment effects
cluster_ATE <- n_j^2/10000

# Untreated productivity
Y0 <- rnorm(N)

# Treated productivity
Y1 <- Y0 + cluster_ATE[j]

# True sample average treatment effect 
(true_SATE <- mean(Y1-Y0))
## [1] 14.9943
[Click to show code]
# Correlation between firm size and effect size 
## [1] 0.961843

As we see, there is high correlation in the treatment effect and cluster size. Now let us simulate 1000 analyses of this experiment, permuting the treatment assignment vector each time, and taking the unweighted difference in means as an estimate of the sample average treatment effect.

[Click to show code]
# Unweighted SATE
SATE_estimate_no_weights <- NA
for(i in 1:1000){
     # Clustered random assignment of half of the firms
     Z <- (j %in% sample(1:J,J/2))*1
     # Reveal outcomes
     Y <- Z*Y1 + (1-Z)*Y0
     # Estimate SATE
     SATE_estimate_no_weights[i] <- mean(Y[Z==1])-mean(Y[Z==0])

# Generate histogram of estimated effects
hist(SATE_estimate_no_weights,xlim = c(true_SATE-2,true_SATE+2),breaks = 100)
# Add the expected estimate of the SATE using this estimator
# And add the true SATE