- 1. What is meta-analysis?
- 2. How does a meta-analysis differ from a literature review?
- 3. What types of data are used as input for a meta-analysis?
- 4. What is the estimand in meta-analysis?
- 5. Meta-analysis and Bayes’ Rule
- 6. Fixed effects versus random effects estimation
- 7. Is it okay to summarize both experimental and observational research findings?
- 8. Publication bias as a threat to meta-analysis
- 9. Modeling inter-study heterogeneity using meta-regression
- 10. Methods for assessing the accuracy of meta-analytic results

Meta-analysis is a method for summarizing the statistical findings from a research literature. For example, if five experiments have been conducted using the same intervention and outcome measure on the same population of people with five separate estimates of an average treatment effect, one might imagine pooling these five studies together into a single dataset and analyzing them jointly. In broad strokes, in such a case, we could act as though the studies came from five blocks within a single experiment rather than five separate experiments. The benefit of such an approach would be more statistical power in the estimation of one overall average treatment effect. In essence, a meta-analysis produces a weighted average of the five studies’ results. As explained below, this method is also used to summarize research literatures that comprise a diverse array of interventions and outcomes measured in diverse settings, under the assumption that the interventions are theoretically similar and the outcome measures tap into a shared underlying trait.

Meta-analysis is often characterized as a form of systematic review insofar as it involves a specific set of data collection and analysis procedures. These procedures are spelled out in great detail by the Campbell Collaboration and by the James Lind Library. Particular attention is paid to gathering both published and unpublished studies (see below). By comparison, most conventional literature reviews cite the most noteworthy theoretical or empirical contributions but rarely attempt to be comprehensive or to summarize the findings quantitatively. Critics of conventional reviews point out the possibility that the most noteworthy or memorable studies present findings that are unrepresentative of the broader research literature and therefore may be a poor guide for policy. On the other hand, critics of meta-analysis point out that the flaws of individual studies are often lost sight of when their estimates are blended together to generate an overarching conclusion. As Uri Simonsohn once quipped, “Meta-analysis is a sausage factory that uses sausages by other factories as inputs.”^{1}

In principle, meta-analysis could be applied to the original data from each study, but in practice such data are seldom available for all relevant studies. Instead, researchers typically cull estimated treatment effects from research papers or other reports. This process presents scholars conducting a meta-analysis with an array of decisions when the research papers present results involving multiple outcomes, treatments, and estimation approaches. Often meta-analysis focuses on the “main” results, but identifying the main or primary results can be a judgment call. The reproducibility of meta-analysis hinges on careful documentation of such decisions. In addition to locating the key estimates, the meta-analyst must also track down measures of statistical uncertainty (e.g., standard errors, confidence intervals), as these statistics will be used to assign weights to each study in the averaging process. As noted below, meta-analysis tends to assign more weight to studies with less sampling variability such as studies with larger sample sizes.

Some meta-analyses are narrowly tailored to specific treatments and outcomes. For example, a vast literature dating back to the 1920s focuses on the extent to which mailings that encourage voting in fact cause people to vote. In this case, one could imagine a population parameter that represents the average causal effect of mailed voting encouragements on a population of people in a specified region over some specified time period.

Other meta-analyses are more abstract, focusing on a broad class of treatments and outcomes. For example, the literature on prejudice reduction comprises hundreds of studies on the effects of interpersonal contact between people with different racial, ethnic, religious, age, or gender backgrounds (Pettigrew and Tropp 2006).^{2} Contact ranges from a brief conversation to a year of co-habitation in a college dormitory. Outcomes also range widely from overt behaviors to self-reported feelings about in-groups and out-groups. Since these interventions and outcomes are on different scales and may refer to different concepts, the underlying population parameter is ambiguous. Researchers try to sidestep this issue by standardizing the outcomes (e.g., by dividing the outcome by the standard deviation in the control group), but there remains the problem of what to make of treatments that vary in intensity and duration. In effect, the population parameter in such cases becomes the average extent to which an ad hoc collection of interventions changes putative measures of prejudice within some location and time period.

Meta-analyses can also struggle when the underlying population represented by the component studies is vague or abstract. For example, in a literature dominated by laboratory experiments conducted in the United States, the meta-analysis will implicitly assess an underlying average treatment effect in which the “average” gives disproportionate weight to American undergraduates.

Consider the simple case in which two experiments are conducted using the same treatments and outcomes. Imagine that both studies draw their subject pool from the same population. In this case, if we had the individual data for the experiments, we could pool them together into a single dataset and analyze them as though they were part of the same block-randomized experiment. But suppose that we did not have the individual data; instead, we only have the estimated ATE and estimated standard error from each study. How might we combine the two studies to form our best guess of the average treatment effect in the population from which the subjects were drawn? With some simplifying assumptions, we could apply Bayes’ Rule. Let’s assume that the sampling distribution of each experimental estimate is normal. (This is a reasonable assumption under the Central Limit Theorem, since we are using an average to estimate the average treatment effect and we assume that each experiment has at least a few dozen subjects and that the outcome distribution is not too skewed.) Since these experiments are independent of one another, Bayes’ Rule takes a simple form: take a weighted average of the two estimates, where the weights are the inverse of each study’s squared standard error (\(\hat{\sigma}_j^2\) is the squared estimated standard error for study \(j\)).

\[ \hat{ATE_{pooled}} = \frac{\frac{1}{\hat{\sigma}_1^2}}{\frac{1}{\hat{\sigma}_1^2} + \frac{1}{\hat{\sigma}_2^2}}\hat{ATE_1} + \frac{\frac{1}{\hat{\sigma}_1^2}}{\frac{1}{\hat{\sigma}_1^2} + \frac{1}{\hat{\sigma}_2^2}}\hat{ATE_2} \]

This formula turns out to be the same as a so-called “fixed effects” meta-analysis. This formula is sometimes called a “precision-weighted average,” where the term “precision” refers to the inverse of the squared standard error. In a simple two-arm, completely randomized study, the standard error of the simple estimator of the average treatment effect is a function of sample size and variation in the outcome, and ratio of treated to control units. So, notice that in this case the study with the smaller standard error (i.e. larger sample, less variable outcome, more equal ratio of treated to control units) received more weight in the pooled meta-analytic result.

Most meta-analysis software^{3} presents users with a choice between fixed effects estimation and random effects estimation. Fixed effects estimation is simply a precision-weighted average.^{4} And random effects estimation is a special case of more general Bayesian meta-analysis. In either case, the studies with the smallest standard errors are accorded the most weight. Random effects estimation applies a different set of weights depending on the extent to which the estimates vary more than would be expected by chance under a fixed effects model. Therefore, the weights for the random effects estimation not only consider the variance within each study but also an estimate of the between-study variance (\(\tau^2\)).

\[ \hat{ATE_{pooled}^*} = \frac{\frac{1}{\hat{\sigma}_1^2+\tau^2}}{\frac{1}{\hat{\sigma}_1^2+\tau^2} + \frac{1}{\hat{\sigma}_2^2+\tau^2}}\hat{ATE_1} + \frac{\frac{1}{\hat{\sigma}_1^2+\tau^2}}{\frac{1}{\hat{\sigma}_1^2+\tau^2} + \frac{1}{\hat{\sigma}_2^2+\tau^2}}\hat{ATE_2} \]

The more heterogeneous the estimated effects–perhaps due to variations in experimental techniques, outcome measurement, or context–the more the resulting weighted average represents a simple average rather than a precision-weighted average. Typically, researchers start with a fixed effects meta-analysis, test whether the estimates are significantly overdispersed given the fixed effects model, and, if so, estimate and report a random effects meta-analysis.

Beware of meta-analyses that combine experimental and observational estimates. When properly executed, experiments provide unbiased estimates of the average treatment effect. An observational study, on the other hand, is prone to bias insofar as the treatments are not randomly assigned. The nominal standard errors associated with observational studies ignore the potential for bias; the standard errors are biased downward because they assume the best-case scenario, namely, that nature assigned treatments in a manner that was as good as random. Gerber, Green, and Kaplan (2004)^{5} show that merely being uncertain about the bias of an observational study is equivalent to according it a larger standard error. Although many prominent meta-analyses include both experiments and observational studies (e.g., Lau and Sigelman^{6}; Pettigrew and Tropp 2006), this practice is frowned upon by leading scholars conducting biomedical meta-analyses.

Because meta-analyses draw their data from reported results, publication bias presents a serious threat to the interpretability of meta-analytic results. If the only results that see the light of day are splashy or statistically significant, meta-analysis may simply amplify publication bias. Methodological guidance to meta-analytic researchers therefore places special emphasis on conducting and carefully documenting a broad-ranging search for relevant studies, whether published or not, including languages other than English. This task is, in principle, aided by pre-registration of studies in public archives; unfortunately, pre-registration in the social sciences is not sufficiently comprehensive to make this a dependable approach on its own.

When assembling a meta-analysis, it is often impossible to know whether one has missed relevant studies. Some statistical methods have been developed in order to detect publication bias, but these tests tend to have low power and therefore may give more reassurance than is warranted. For example, one common approach is to construct a scatterplot to assess the relationship between study size (whether measured by the N of subjects or the standard error of the estimated treatment effect) and effect size. A telltale symptom of publication bias is a tendency for smaller studies to produce larger effects (as would be the case if studies were published only if they showed statistically significant results; to reach the significance bar, small studies (with large standard errors) would need to generate larger effect estimates. Unfortunately, this test often produces ambiguous results (CITE), and methods to correct publication bias in the wake of such diagnostic tests (e.g., the trim-and-fill method) may do little to reduce bias. Given growing criticism of statistical tests for publication bias and accompanying statistical correctives, there is an increasing sense that the quality of a meta-analysis hinges on whether research reports in a given domain can be assembled in a comprehensive manner.

Researchers often seek to investigate systematic sources of treatment effect heterogeneity. These systematic sources may reflect differences among subjects (Do certain drugs work especially well for men or women?), contexts (Do lab studies of exposure to mass media produce stronger effects than field studies?), outcomes (Are treatment effects especially large when outcomes are measured via opinion surveys as opposed to direct observation of behavior?), or treatments (Are partisan messages more effective at mobilizing voters than nonpartisan messages?). Quite often, these investigations are best studied directly, via an experimental design. For example, variation in treatment may be studied by randomly assigning different treatment arms. Variation in effects associated with different outcome measures may also be studied in the context of a given experiment by gathering data on more than one outcome or by randomly assigning how outcomes are measured.

A second-best approach is to compare studies that differ on one or more of these dimensions (subjects, treatments, context, or outcomes). The drawback of this approach is that it is essentially descriptive rather than causal – the researcher is basically characterizing the features of studies that contribute to especially large or small effect sizes. That said, this exercise can be conducted via meta-regression: the estimated effect size is the dependent variable, while study attributes (e.g., whether outcomes were measured through direct observation or via survey self-reports) constitute the independent variables. Note that meta-regression is a generalization of random effects meta-analysis, with measured predictors of effect sizes as well as unmeasured sources of heterogeneity.

Since meta-analysis is a technique for combining information across different studies, we do not here discuss the detection or modeling of heterogeneous treatment effects within any single study.

A skeptic might ask whether meta-analysis improves our understanding of cause-and-effect in any practical way. Do we learn anything from pooling existing studies via a weighted average versus presenting the studies one at a time and leaving the synthesis to the reader? To address this question EGAP conducted an experiment among the academics and policy experts attending a conference to reveal the results of the first round of EGAP’s Metaketa Initiative, which focused on conducting a coordinated meta-analysis on the impact of information and accountability programs on electoral outcomes. The round consisted of six studies measuring the impact of the same causal mechanism.

To test the idea that accumulated knowledge (in the form of meta-analysis) allows for better inferences about the effect of a given program, the Metaketa committee randomized the audience to hear a presentation of the meta-analysis, each component study, a placebo, and an external study of a similar intervention that was not part of the Metaketa round or the subsequent meta-analysis. Each group of participants was not exposed to one of the above group of studies. And the participants were asked to predict the results of the left out study. This allowed the committee to measure the effect of each study type on attendees’ predictive abilities. The event attendees were then asked to predict the findings of the one study they had not yet seen. The resulting analysis found that exposure to the meta-analysis led to greater accuracy in predicting the effect in the left-out study in comparison to the external study (which, as a reminder, was not part of the meta-analysis in any way). For more on this Metaketa round, along with a more substantial discussion of this “evidence summit” look for the upcoming book Information, Accountability, and Cumulative Learning: Lessons from Metaketa I.

Pettigrew, T.F. & Tropp, L.R. (2006). A Meta-Analytic Test of Intergroup Contact Theory.

*Journal of Personality and Social Psychology, 90(5)*, 751–783.↩for a list of R packages useful in conducting meta-analysis, see here: https://cran.r-project.org/web/views/MetaAnalysis.html↩

See for example https://www.stata.com/support/faqs/statistics/meta-analysis/ and https://cran.r-project.org/web/views/MetaAnalysis.html↩

Gerber, A.S., Green, D.P., & Kaplan, E.H. (2004). The illusion of learning from observational research. In I. Shapiro, R.M. Smith, & T.E. Masoud (Eds.),

*Problems and Methods in the Study of Politics*(251-273). Cambridge, England: Cambridge University Press.↩Lau, R.R., Sigelman, L., & Rovner, I.B. (2007). The Effects of Negative Political Campaigns: A Meta‐Analytic Reassessment.

*The Journal of Politics, 69(4)*, 1176-1209.↩