- Abstract
- 1 What Power Is
- 2 Why You Need It
- 3 The Three Ingredients of Statistical Power
- 4 Key Formulas for Calculating Power
- 5 When to Believe Your Power Analysis
- 6 How to Use Simulation to Estimate Power
- 7 How to Change your Design to Improve Your Power
- 8 Power Analysis for Multiple Treatments
- 9 How to Think About Power for Clustered Designs
- 10 Good Power Analysis Makes Preregistration Easy

This guide^{1} will help you assess and improve the power of your experiments. We focus on the big ideas and provide examples and tools that you can use in R and Google Spreadsheets.

Power is the ability to distinguish signal from noise.

The signal that we are interested in is the impact of a treatment on some outcome. Does education increase incomes? Do public health campaigns decrease the incidence of disease? Can international monitoring decrease government corruption?

The noise that we are concerned about comes from the complexity of the world. Outcomes vary across people and places for myriad reasons. In statistical terms, you can think of this variation as the standard deviation of the outcome variable. For example, suppose an experiment uses rates of a rare disease as an outcome. The total number of affected people isnâ€™t likely to fluctuate wildly day to day, meaning that the background noise in this environment will be low. When noise is low, experiments can detect even small changes in average outcomes. A treatment that decreased the incidence of the disease by 1% percentage points would be easily detected, because the baseline rates are so constant.

Now suppose an experiment instead used subjectsâ€™ income as an outcome variable. Incomes can vary pretty widely â€“ in some places, it is not uncommon for people to have neighbors that earn two, ten, or one hundred times their daily wages. When noise is high, experiments have more trouble. A treatment that increased workersâ€™ incomes by 1% would be difficult to detect, because incomes differ by so much in the first place.

A major concern before embarking on an experiment is the danger of a **false negative**. Suppose the treatment really does have a causal impact on outcomes. It would be a shame to go to all the trouble and expense of randomizing the treatment, collecting data on both treatment and control groups, and analyzing the results, just to have the effect be overwhelmed by background noise.

If our experiments are highly-powered, we can be confident that if there truly is a treatment effect, weâ€™ll be able to see it.

Experimenters often guard against false positives with statistical significance tests. After an experiment has been run, we are concerned about falsely concluding that there is an effect when there really isnâ€™t.

Power analysis asks the opposite question: supposing there truly is a treatment effect and you were to run your experiment a huge number of times, how often will you get a statistically significant result?

Answering this question requires informed guesswork. Youâ€™ll have to supply guesses as to how big your treatment effect can reasonably be, how many subjects will answer your survey, how many subjects your organization can realistically afford to treat.

Where do these guesses come from? Before an experiment is run, there is often a wealth of baseline data that are available. How old/rich/educated are subjects like yours going to be? How big was the biggest treatment effect ever established for your dependent variable? With power analysis, you can see how sensitive the probability of getting significant results is to changes in your assumptions.

Many disciplines have settled on a target power value of 0.80. Researchers will tweak their designs and assumptions until they can be confident that their experiments will return statistically significant results 80% of the time. While this convention is a useful benchmark, be sure that you are comfortable with the risks associated with an 80% expected success rate.

A note of caution: power matters a lot. Negative results from underpowered studies can be hard to interpret: Is there really no effect? Or is the study just not able to figure it out? Positive results from an underpowered study can also be misleading: conditional upon being statistically significant, an estimate from an underpowered study probably overestimates treatment effects. Under powered studies are sometimes based on overly optimistic assumptions; a convincing power analysis makes these assumptions explicit and should protect you from implementing designs that realistically have no chance of answering the questions you want to answer.

There are three big categories of things that determine how highly powered your experiment will be. The first two (the strength of the treatment and background noise) are things that you canâ€™t really control â€“ these are the realities of your experimental environment. The last, the experimental design, is the only thing that you have power over â€“ use it!

- Strength of the treatment. As the strength of your treatment increases, the power of your experiment increases. This makes sense: if your treatment were giving every subject $1,000,000, there is little doubt that we could discern differences in behavior between the treatment and control groups. Many times, however, we are not in control of the strength of our treatments. For example, researchers involved in program evaluation donâ€™t get to decide what the treatment should be, they are supposed to evaluate the program as it is.
- Background noise. As the background noise of your outcome variables increases, the power of your experiment decreases. To the extent that it is possible, try to select outcome variables that have low variability. In practical terms, this means comparing the standard deviation of the outcome variable to the expected treatment effect size â€” there is no magic ratio that you should be shooting for, but the closer the two are, the better off your experiment will be. By and large, researchers are not in control of background noise, and picking lower-noise outcome variables is easier said than done. Furthermore, many outcomes we would like to study are inherently quite variable. From this perspective, background noise is something you just have to deal with as best you can.
- Experimental Design. Traditional power analysis focuses on one (albeit very important) element of experimental design: the number of subjects in each experimental group. Put simply, a larger number of subjects increases power. However, there are other elements of the experimental design that can increase power: how is the randomization conducted? Will other factors be statistically controlled for? How many treatment groups will there be, and can they be combined in some analyses?

Statisticians have derived formulas for calculating the power of many experimental designs. They can be useful as a back of the envelope calculation of how large a sample youâ€™ll need. Be careful, though, because the assumptions behind the formulas can sometimes be obscure, and worse, they can be wrong.

Here is a common formula used to calculate power^{2}

\[\beta = \Phi \left(\frac{|\mu_t-\mu_c|\sqrt{N}}{2\sigma}-\Phi^{-1} \left(1-\frac{\alpha}{2}\right) \right)\]

- \(\beta\) is our measure of power. Because itâ€™s the probability of getting a statistically significant result, Î² will be a number between 0 and 1.
- \(\Phi\) is the CDF of the normal distribution, and \(\Phi^{-1}\) is its inverse. Everything else in this formula, we have to plug in:
- \(\mu_t\) is the average outcome in the treatment group. Suppose itâ€™s 65.
- \(\mu_c\) is the average outcome in the control group. Suppose itâ€™s 60.
- Together, assumptions about Î¼t and Î¼c define our assumption about the size of the treatment effect: 65-60= 5.
- \(\sigma\) is the standard deviation of outcomes. This is how we make assumptions about how noisy our experiment will be â€” one of the assumptions weâ€™re making is that sigma is the same for both the treatment and control groups. Suppose \(\sigma=20\)
- \(\alpha\) is our significance level â€“ the convention in many disciplines is that Î± should be equal to 0.05. \(N\) is the total number of subjects. This is the only variable that is under the direct control of the researcher. This formula assumes that every subject had a 50/50 chance of being in control. Suppose that \(N=500\).

Working through the formula, we find that under this set of assumptions, \(Î² = 0.80\), meaning that we have an 80% chance of recovering a statistically significant result with this design. Click here for a google spreadsheet that includes this formula. You can copy these formulas directly into Excel. If youâ€™re comfortable in R, here is code that will accomplish the same calculation.

```
power_calculator <- function(mu_t, mu_c, sigma, alpha=0.05, N){
lowertail <- (abs(mu_t - mu_c)*sqrt(N))/(2*sigma)
uppertail <- -1*lowertail
beta <- pnorm(lowertail- qnorm(1-alpha/2), lower.tail=TRUE) + 1- pnorm(uppertail- qnorm(1-alpha/2), lower.tail=FALSE)
return(beta)
}
```

From some perspectives the whole idea of power analysis makes no sense. You want to figure out the size of some treatment effect but first you need to do a power analysis which requires that you already know your treatment effect and a lot more besides.

So in most power analyses you are in fact seeing what happens with numbers that are to some extent made up. The good news is that it is easy to find out how much your conclusions depend on your assumptions: simply vary your assumptions and see how the conclusions on power vary.

This is most easily seen by thinking about how power varies with the number of subjects. A power analysis that looks at power for different study sizes simply plugs in a range of values in for N and seeing how Î² changes.

Using the formula in section 4, you can see how sensitive power is to all of the assumptions: Power will be higher if you assume the treatment effect will be larger, or if youâ€™re willing to accept a higher alpha level, or if you have more or less confidence in the noisiness of your measures.^{3}

Power is a measure of how often, given assumptions, we would obtain statistically significant results, if we were to conduct our experiment thousands of times. The power calculation formula takes assumptions and return an analytic solution. However, due to advances in modern computing, we donâ€™t have to rely on analytic solutions for power analysis. We can tell our computers to literally run the experiment thousands of times and simply count how frequently our experiment comes up significant.

The code block below shows how to conduct this simulation in R.

```
possible.ns <- seq(from=100, to=2000, by=50) # The sample sizes we'll be considering
powers <- rep(NA, length(possible.ns)) # Empty object to collect simulation estimates
alpha <- 0.05 # Standard significance level
sims <- 500 # Number of simulations to conduct for each N
#### Outer loop to vary the number of subjects ####
for (j in 1:length(possible.ns)){ N <- possible.ns[j] # Pick the jth value for N
Y0 <- rnorm(n=N, mean=60, sd=20) # control potential outcome
tau <- 5 # Hypothesize treatment effect
Y1 <- Y0 + tau # treatment potential outcome
significant.experiments <- rep(NA, sims) # Empty object to count significant experiments
#### Inner loop to conduct experiments "sims" times over for each N ####
for (i in 1:sims){
Z.sim <- rbinom(n=N, size=1, prob=.5) # Do a random assignment
Y.sim <- Y1*Z.sim + Y0*(1-Z.sim) # Reveal outcomes according to assignment
fit.sim <- lm(Y.sim ~ Z.sim) # Do analysis (Simple regression)
p.value <- summary(fit.sim)$coefficients[2,4] # Extract p-values
significant.experiments[i] <- (p.value <= alpha) # Determine significance according to p <= 0.05
}
powers[j] <- mean(significant.experiments) # store average success rate (power) for each N
}
powers
```

```
## [1] 0.234 0.272 0.362 0.468 0.648 0.700 0.700 0.808 0.738 0.840 0.786
## [12] 0.878 0.924 0.906 0.972 0.946 0.952 0.970 0.992 0.976 0.994 0.994
## [23] 0.994 0.990 0.990 0.992 0.998 0.994 1.000 0.998 0.998 1.000 0.998
## [34] 1.000 1.000 1.000 1.000 1.000 1.000
```

The code for this simulation and others is available here. Simulation is a far more flexible, and far more intuitive way to think about power analysis. Even the smallest tweaks to an experimental design are difficult to capture in a formula (adding a second treatment group, for example), but are relatively straightforward to include in a simulation.

In addition to counting up how often your experiments come up statistically significant, you can directly observe the distribution of p-values youâ€™re likely to get. The graph below shows that under these assumptions, you can get expect to get quite a few p-values in the 0.01 range, but that 80% will be below 0.05.

When it comes to statistical power, the only thing that thatâ€™s under your control is the design of the experiment. As weâ€™ve seen above, an obvious design choice is the number of subjects to include in the experiment. The more subjects, the higher the power.

However, the number of subjects is not the only design choice that has consequences for power. There are two broad classes of design choices that are especially important in this regard.

- Choice of estimator. Are you using difference-in-means? Will you be doing some transformation, such as a logit or a probit? Will you be controlling for covariates? Will you be using some kind of robust standard error estimator? All of these choices will make a difference for the statistical significance of your results, and therefore for the power of your experiment. One easy way to think about this is to imagine what command youâ€™ll be running in R or Stata after the experiment has come back; thatâ€™s your estimator!
- Randomization Protocol. What kind of randomization will you be employing? Simple randomization gives all subjects an equal probability of being in the treatment group, and then performs a (possibly weighted) coin flip for each. Complete randomization is similar, but it ensures that exactly a certain number will be assigned to treatment. Block randomization is even more powerful â€” it ensures that a certain number within a subgroup will be assigned to treatment. A restricted random assignment rejects some random assignments based on some set of criteria â€” lack of balance perhaps. These various types of random assignment can dramatically increase the power of an experiment at no extra cost. Read up on randomization protocols here.

There are too many choices to cover in this short article, but check out the Simulation for Power Analysis code page for some ways to get started. But to give a flavor of the simulation approach, consider how you would conduct a power analysis if you wanted to include covariates in your analysis.

If the covariates you include as control variables are strongly related to the outcome, then youâ€™ve dramatically increased the power of your experiment.Unfortunately, the extra power that comes with including control variables is very hard to capture in a compact formula. Almost none of the power formulas found in textbooks or floating around on the internet can provide guidance on what the inclusion of covariates will do for your power.

The answer is simulation.

- Suppose weâ€™re studying the effect of an educational intervention on income
- Suppose we have good data on the relationship between two covariates and income: age and gender. In this economy, men earn more than women, and older people earn more than younger people.
- Run a regression of income on age and gender and record the coefficients, using pre-existing survey data (better yet: use baseline data from future participants in your experiment!) *Generate fake covariate data â€” N total subjects, but broken up by age and gender in a way that reflects your experimental subject pool.
- Generate fake control data â€” where the outcome is a function of age and gender according to your regression estimates
- Hypothesize a treatment effect to generate fake treatment data
- Run the experiment 10,000 times, and record how often, using a regression with controls, your experiment turns up significant.

Hereâ€™s a graph that compares the power of an experiment that does control for background attributes to one that does. The R-square of the regression relating income to age and gender is pretty high â€” around .66 â€” meaning that the covariates that we have gathered (generated) are highly predictive. For a rough comparison, sigma, the level of background noise that the unadjusted model is dealing with, is around 33. This graph shows that at any N, the covariate-adjusted model has more power â€” so much so that the unadjusted model would need 1500 subjects to achieve what the covariate-adjusted model can do with 500.