To draw inferences from null hypothesis testing, a range of assumptions must be made. Here we look at our database to examine the extent to which these assumptions are valid within the language acquisition literature. We find broadly that these assumptions are merited.

An important component of reproducbility is complete description of data in published report. This is critical both for evaluating an individual study, but also for the purposes of a cumulative science (e.g. meta-analysis). Here we explore the extent to which papers report desired statistics such as test-statistics mean and standard deviation, effect sizes and test statistics.

```
counts = all_data %>%
summarise(test_statistic = sum(!is.na(t) | !is.na(F) | !is.na(r)),
means = sum(!is.na(x_1)),
SD = sum(!is.na(SD_1)),
d = sum(!is.na(d_calc)),
g = sum(!is.na(g_calc)),
r = sum(!is.na(r_calc)),
age_range = sum(!is.na(age_range_1)),
gender = sum(!is.na(gender_1))) %>%
gather("coded_variable", "n") %>%
mutate(coded = "coded") %>%
mutate(total = nrow(all_data)) %>%
mutate(coded_variable = factor(coded_variable, levels = c("d", "g", "r", "means",
"SD", "test_statistic",
"age_range", "gender")))
counts = counts %>%
mutate(n = total - n,
coded = "uncoded") %>%
bind_rows(counts) %>%
mutate(n_lab = ifelse(coded == "coded", n, "")) %>%
arrange(coded)
ggplot(counts) +
geom_bar(aes(x = coded_variable,
y = n/total,
fill = coded,
order = coded),
stat = "identity") +
ylim(0,1) +
ylab("Proportion coded") +
xlab("Coded variable") +
ggtitle("All data") +
#annotate("text", x = 1, y = .9,
# label = paste("N =", counts$total[1]), size = 6) +
scale_fill_manual(values=c( "lightgreen", "grey")) +
geom_text(aes(label = n_lab, x = coded_variable, y = n/total -.06) )+
theme_bw() +
theme(panel.border = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.line = element_line(colour = "black"),
text = element_text(size=20),
axis.text.x = element_text(angle = 30, hjust = 1))
```

This analysis is in practice difficult because our many of our source MA’s include effect sizes that were included by the coders. Similarly, the proportion of coded gender in many cases reflects only that the coder chose not to code that, not that this information was not present. Nevertheless, this analysis gives us a good sense of the the proportion of papers that report means and standard deviations (about two-thirds).

By MA

```
counts = all_data %>%
group_by(dataset) %>%
summarise(test_statistic = sum(!is.na(t) | !is.na(F)),
means = sum(!is.na(x_1)),
SD = sum(!is.na(SD_1)),
d = sum(!is.na(d_calc)),
g = sum(!is.na(g_calc)),
r = sum(!is.na(r_calc)),
age_range = sum(!is.na(age_range_1)),
gender = sum(!is.na(gender_1)),
total = n()) %>%
gather(coded_variable, n, -dataset, -total) %>%
mutate(coded = "coded") %>%
mutate(coded_variable = factor(coded_variable, levels = c("d", "g", "r", "means",
"SD", "test_statistic",
"age_range", "gender")))
counts = counts %>%
mutate(n = total - n,
coded = "uncoded") %>%
bind_rows(counts) %>%
mutate(n_lab = ifelse(coded == "coded", n, "")) %>%
arrange(coded)
ggplot(counts, aes(fill = coded)) +
geom_bar(aes(x = factor(coded_variable),
y = n/total), ## FIX THIS
stat = "identity",
position = "fill") +
facet_wrap(~dataset, ncol=2) +
ylim(0,1) +
ylab("Proportion coded") +
xlab("Coded variable") +
scale_fill_manual(values = c("lightgreen", "grey")) +
geom_text(aes(label = n_lab,
x = coded_variable,
y = n/total -.06)) +
theme_bw() +
theme(panel.border = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.line = element_line(colour = "black"),
text = element_text(size=20),
strip.background = element_blank(),
axis.text.x = element_text(angle = 30, hjust = 1))
```

Here we examine how sample sizes and exclusion rates are related to other variables of inteerest.

Given that effect size broadly increases with age, we would expect sample sizes to decrease. Broadly, the data are consistent with this. Phonemic discriminatiion (native) is an exception, but this may be due to differences in method (?).

```
all = all_data %>%
mutate(n_total = ifelse(!is.na(n_2), n_1 + n_2, n_1)) %>%
select(dataset, d_calc,
mean_age_1, n_total,
n_excluded_1, num_trials, method)
ggplot(all, aes(x = mean_age_1, y = n_total, color = dataset)) +
geom_point() +
geom_smooth(method = "lm") +
facet_wrap(~dataset, scales = "free") +
theme_bw() +
theme(legend.position="none")
```

Here we explore a more direct test that sample sizes are coupled with effect sizes. We test this by looking at the relationship between sample sizes and effect size, after residualizing out the effect of (1) phenomenon, (2) age, and (3) method.

Compute residuals.

```
full.model = rma(d_calc ~ method + dataset + mean_age_1,
vi = d_var_calc, data = all_data, method = "REML")
residuals = rstandard(full.model)
all_data = all_data %>%
bind_cols(as.data.frame(residuals$resid),
as.data.frame(residuals$z)) %>%
rename(residual.d = `residuals$resid`,
residual.d.s = `residuals$z`) # standardized
```

```
ns_ds = all_data %>%
mutate(n_total = ifelse(!is.na(n_2), n_1 + n_2, n_1)) %>%
select(dataset, d_calc, residual.d.s,
mean_age_1, n_total)
ggplot(ns_ds , aes(y = n_total, x = residual.d.s, color = dataset)) +
geom_point() +
geom_smooth(method = "lm") +
theme_bw() +
xlab("Residual effect size") +
facet_wrap(~dataset, scales = "free") +
theme(legend.position="none")
```

If sample size and effect sizes were appropriately coupled, we should expect sample to decrease as effect size increases. Across 3 meta-analyses, we find the opposite pattern: sample size increases with effect size. This suggest that researchers either are not prospectively planning their sample sizes based on effect size estimates.

Here we look at whether the number of excluded infants varies as a function of effect size. If infants are excluded based on an indepenent criteria, and before looking at the data, there should be no relationship. Again, we use residualized effect sizes.

However, there are some techincal issues we need to address before properly doing this analysis:

- not here that different data sets use differently names columsn to indicate exclusions (“n_excluded”, “n_excluded_1”, “n_excluded_fussers”)
- an additional complicaiton: are the exlcusions part of the n?
- we don’t have num_trials for any, but this would be great to look at
- what’s up with the n_excluded > n?

Compute residuals.

```
age.method = rma(mean_age_1 ~ method + dataset,
vi = d_var_calc, data = all_data, method = "REML")
residuals.age.method = rstandard(age.method)
all_data = all_data %>%
bind_cols(as.data.frame(residuals.age.method$resid),
as.data.frame(residuals.age.method$z)) %>%
rename(residual.d.age = `residuals.age.method$resid`,
residual.d.s.age = `residuals.age.method$z`) # standardized
```

```
exclusions = all_data %>%
filter(!is.na(n_excluded_1) || !is.na(n_excluded_2)) %>%
mutate(n_excluded = ifelse(!is.na(n_excluded_2), n_excluded_1 + n_excluded_2, n_excluded_1),
n_total = ifelse(!is.na(n_2), n_1 + n_2, n_1)) %>%
select(dataset, d_calc,
mean_age_1, n_total,
n_excluded, num_trials, method, residual.d.s, residual.d.s.age) %>%
mutate(prop_excluded = n_excluded/n_total) %>%
filter(prop_excluded < 1)
ggplot(exclusions, aes(x = prop_excluded, y =residual.d.s ,
color = dataset)) +
geom_point() +
geom_smooth(method = "lm") +
ggtitle("prop. excluded as a function of d") +
xlab("proportion excluded") +
ylab("d") +
theme_bw() +
facet_wrap(~dataset, scales = "free") +
theme(legend.position="none")
```

Again, if there is no bias we should see no relationship. In contrast, we see some evidence that proportion excluded increases with age. This is difficult to interpret though, for the reasons described above.

```
ggplot(exclusions, aes(x = residual.d.s.age, y = prop_excluded,
color = dataset)) +
geom_point() +
geom_smooth(method = "lm") +
ggtitle("prop. excluded as a function of age") +
xlab("mean age (months)") +
ylab("proportion excluded") +
theme_bw() +
facet_wrap(~dataset, scales = "free") +
theme(legend.position="none")
```

If studies are randomly sampled from the population of possible studies, effect sizes should not be bias. Here we examine this assumption, using residualized effect sizes.

```
years.df = all_data %>%
mutate(year = as.numeric(unlist(lapply(strsplit(unlist(all_data$study_ID),
"[^0-9]+"), function(x) unlist(x)[2]))))
ggplot(years.df , aes(x = year, y = residual.d.s, color = dataset)) +
geom_point() +
geom_smooth(method = "lm") +
facet_wrap(~ dataset, scales = "free_y") +
theme_bw() +
xlab("published year") +
theme(legend.position="none")
```

We see a bias of year here: Studies conducted early on in a literature tend to have larger effect sizes than later studies.

```
ggplot(years.df , aes(x = year, y = residual.d.s.age, color = dataset)) +
geom_point() +
geom_smooth(method = "lm") +
facet_wrap(~ dataset, scales = "free_y") +
theme_bw() +
xlab("published year") +
theme(legend.position="none")
```

In some literatures, we see a trend toward looking for the phenomenon in younger children over time.

Here I calculated the power to detect an effect using a random effect model (Hedges & Pigott, 2001; Valentine, Pigott, & Rothstein, 2010). Note that this information is redundant with confidence intervals on the estimated effect size, and the power in all of these MAs is essentially 1. Below, I ran a simulation sampling n papers from each meta-analysis, and calculating power. We need surprisingly few studies to get relatively high power.

```
calculate_ma_power <- function(sample_data) {
ALPHA <- .05
NTAILS <- 2
c_crit <- qnorm(ALPHA/NTAILS, lower.tail = F)
model <- rma(d_calc, vi = d_var_calc, data = sample_data,
method = "REML", control = list(maxiter = 1000, stepadj = 0.5))
lambda <- model$b[1] / model$se[1] # Valentine, Eq. [6]
power <- 1 - pnorm(c_crit - lambda)
power
}
one_sample <- function(ma_data, summary_function, n) {
function(k) {
do.call(summary_function, list(sample_n(ma_data, size = n, replace = F)))
}
}
all_samples <- function(ma_data, summary_function, n, nboot) {
sample_values <- 1:nboot %>%
map(one_sample(ma_data, summary_function, n)) %>%
unlist()
data.frame(n = n,
mean = mean(sample_values),
ci_lower = ci_lower(sample_values),
ci_upper = ci_upper(sample_values),
row.names = NULL)
}
dataset_data <- function(ma_data, summary_function, nboot) {
overall <- do.call(summary_function, list(ma_data))
seq(1, nrow(ma_data), by = 1) %>%
map(function(n) all_samples(ma_data,
summary_function, n, nboot)) %>%
bind_rows() %>%
mutate(dataset = unique(ma_data$dataset),
overall_est = overall)
}
power_data <- all_data %>%
split(.$dataset) %>%
map(function(ma_data) dataset_data(ma_data,
calculate_ma_power, 50)) %>%
bind_rows()
ggplot(power_data, aes(x = n, y = mean, fill = dataset)) +
geom_ribbon(aes(ymin = ci_lower, ymax = ci_upper), alpha = 0.5) +
geom_hline(aes(yintercept = overall_est), linetype = "dashed") +
geom_line( aes(x = n, y = mean)) +
facet_wrap(~dataset) +
xlab("Number of effect sizes") +
ylab("Meta-analytic power") +
xlim(0,150) +
scale_fill_solarized(guide = FALSE) +
theme_bw() +
theme(text = element_text(family = "Open Sans"),
panel.grid.minor = element_blank())
```

Not totally sure how to do this. Valentine et al. (2001) say it’s difficult without the raw data?

Corrects for selective reporting on the basis of effect size (fewer small effects; Duval & Tweedie, 2000a, 2000b). The routines and plotting tools stem from the R library “metafor” (Viechtbauer).

[TODO: Adapt to Metalab style]

```
### fit random-effects model and plot
datasets = unique(all_data$short_name)
for (i in 1:length(datasets)) {
my_data = all_data[all_data$short_name == datasets[i],]
res <- rma(d_calc, d_var_calc, data = my_data)
taf <- trimfill(res)
funnel(taf, main = datasets[i])
}
```