Let’s start by looking at the height of adults in NHANES, to explore the use of sampling. We will assume that our population of interest is all adults in NHANES, and we will examine how well different estimators work in terms of our two criteria: bias and variance.

NHANES$isChild <- NHANES$Age<18
NHANES_adult=subset(NHANES,subset=!isChild & Height!='NA')
NHANES_clean=subset(NHANES,subset=Height!='NA')
print(paste('Population height: mean = ',mean(NHANES_adult$Height)))
## [1] "Population height: mean =  168.862957974138"
print(paste('Population height: std deviation = ',sd(NHANES_adult$Height)))
## [1] "Population height: std deviation =  10.10583462654"

We can take a single sample using the sample_n function from dplyr, which will sample a specified number of individuals from the population. Here we sample 10 of them

exampleSample = NHANES_adult %>% sample_n(10)
dim(NHANES_adult)
## [1] 7424   77
dim(exampleSample)
## [1] 10 77
exampleSample
print(paste('Sample height: mean = ',mean(exampleSample$Height)))
## [1] "Sample height: mean =  168.45"
print(paste('Sample height: std deviation = ',sd(exampleSample$Height)))
## [1] "Sample height: std deviation =  8.33803200868033"

By default, sample_n() will not replace the sampled individuals; that is, once a person is selected, they can’t be selected again. Next session we will see an example of cases when we want to sample with replacement - that is, when someone is selected, they get put back into the pile so that they can possibly be selected again. Here is an example where we are selecting amongst restaurants, with replacement; note that if we were sampling without replacement, we could only sample up to 4 observations, because at that point we would have exhausted our population of restaurants.

restaurantDf = data.frame(restaurant=c('Chipotle','KFC','Tacolicious','Coupa'))
restaurantDf %>% sample_n(20,replace = TRUE)

Now let’s take a bunch of samples of 100 individuals, compute the mean, and look at the distribution of those values (which we call the “sampling distribtion”).

We have to decide how many samples to take in order to do a good job of estimating the sampling distribution. Let’s take 5000 samples so that we are really confident in the answer. This might take a moment….

sampSize=100
nsamps=5000
sampMeans=array(NA,nsamps)
sampMeansTrimmed=array(NA,nsamps)
sampMedians=array(NA,nsamps)
sampMidpoints=array(NA,nsamps)
sampMeanSE=array(NA,nsamps)

for (i in 1:nsamps){
  NHANES_sample=sample_n(NHANES_adult,sampSize)
  sampMeans[i]=mean(NHANES_sample$Height)
  sampMeanSE[i]=sd(NHANES_sample$Height)/sqrt(sampSize)
  sampMeansTrimmed[i]=mean(NHANES_sample$Height,trim=0.1)
  sampMedians[i]=median(NHANES_sample$Height)
  sampMidpoints[i]=min(NHANES_sample$Height) + (max(NHANES_sample$Height) -min(NHANES_sample$Height))/2

}

sampdataDf=data.frame(mean=sampMeans,meanTrimmed=sampMeansTrimmed,
                      median=sampMedians,midpoint=sampMidpoints,
                      se_mean=sampMeanSE)

print(paste('Average sample mean =',mean(sampMeans)))
## [1] "Average sample mean = 168.8699838"
print(paste('Standard deviation of sample means =',sd(sampMeans)))
## [1] "Standard deviation of sample means = 1.0039740838935"
print(paste('Estimated standard error based on population SD:',sd(NHANES_adult$Height)/sqrt(sampSize)))
## [1] "Estimated standard error based on population SD: 1.010583462654"
sampMeans_df=data.frame(sampMeans=sampMeans)

ggplot(sampMeans_df,aes(sampMeans)) +
  geom_histogram(data=NHANES_adult,aes(Height,..density..),bins=500,col='gray',fill='gray') +
  geom_histogram(aes(y=..density..*0.2),bins=500,col='blue',fill='blue') +
  xlab('Height (inches)') + geom_vline(xintercept = mean(NHANES_adult$Height))

Now let’s see if the sample mean is really the “best” estimator of the population mean.

First, let’s see if it really is less biased than the other estimators (median and trimmed mean).

sampdataLongDf=gather(sampdataDf)
sampdataGrouped <- group_by(sampdataLongDf, key) %>% filter(key != 'se_mean') %>% summarise(mean=mean(value),sd=sd(value))

ggplot(sampdataGrouped, aes(x=key,y=mean)) + 
  geom_point(stat="identity") +
  geom_hline(yintercept=mean(NHANES_adult$Height)) +
  xlab('summary measure') + ylab('estimate')

Now let’s see whether the mean also has the lowest variance compared to the other estimators.

ggplot(sampdataGrouped, aes(x=key,y=sd)) + 
  geom_point(stat="identity") +
  geom_hline(yintercept = sd(NHANES_adult$Height)/sqrt(sampSize)) +
  xlab('summary measure') + ylab('standard deviation of estimates')

Central limit theorem

The CLT tells us that parameter estimates from any distribution will be normally distributed, assuming that the sample size is large enough.

For a simple example, we can look at data sampled from a very no-normal discrete distribution. Let’s say that we have a die that is highly unfair, with a probability distribution for the 6 sides as follows:

unfairDieProbDist = c(6,5,4,3,2,1) #c(3,0,1,2,0,3)
# normalize by the sum
unfairDieProbDist = unfairDieProbDist/sum(unfairDieProbDist)

ggplot(data.frame(x=seq(1,6),pdist=unfairDieProbDist),aes(x=x,y=pdist)) + 
  geom_bar(stat='identity') + ylab('probability') + xlab('outcome')

Now let’s sample repeatedly from this distribution and take the mean of each sample, and then look at the resulting distribution of sample means. Let’s start with a large sample size of 100 rolls per sample.

sampSize=10
nsamps=5000
sampMeans=array(NA,nsamps)

for (i in 1:nsamps){
  samp=sample(6,sampSize,replace=TRUE,prob=unfairDieProbDist)
  sampMeans[i]=mean(samp)
}

sampMeansDf=data.frame(sampMeans=sampMeans)

ggplot(sampMeansDf,aes(sampMeans)) +
  geom_histogram(aes(y=..density..),bins=100,col='blue',fill='blue') +
  xlab('Mean roll') 

We can also see this in real data. Let’s work with the Height data from all individuals in the NHANES dataset, which is highly skewed. Let’s take a bunch of samples from this distribution, and then look at the shape of the sampling distribution of the mean.

sampSize=250
nsamps=2500
sampMeansFull=array(NA,nsamps)

for (i in 1:nsamps){
  NHANES_sample=sample_n(NHANES_clean,sampSize)
  sampMeansFull[i]=mean(NHANES_sample$Height)

}
sampMeansFullDf=data.frame(sampMeans=sampMeansFull)

ggplot(sampMeansFullDf,aes(sampMeans)) +
  geom_histogram(data=NHANES_clean,aes(Height,..density..),bins=500,col='gray',fill='gray') +
  geom_histogram(aes(y=..density..*0.18),bins=500,col='blue',fill='blue') +
  xlab('Height (inches)')  + ylim(0,0.065) +
  annotate('text',x=100,y=0.055,label=sprintf('sample size = %d',sampSize),size=5,hjust=0) +
  annotate('text',x=100,y=0.05,label=sprintf('std error (observed) = %.03f',sd(sampMeansFull)),size=5,hjust=0) +
  annotate('text',x=100,y=0.045,label=sprintf('std error (computed) = %.03f',sd(NHANES_clean$Height)/sqrt(sampSize)),size=5,hjust=0)

# add quantiles of standard normal
normalCDF=data.frame(x=seq(min(sampMeansFull),
                           max(sampMeansFull),.05))
normalCDF = normalCDF %>% mutate(cdf= pnorm(x,mean=mean(NHANES_clean$Height),
                                     sd=sd(NHANES_clean$Height)/sqrt(sampSize)))

ggplot(sampMeansFullDf,aes(sampMeans)) + 
  stat_ecdf(color='blue',linetype='dotted',size=1) + 
  geom_line(data=normalCDF,aes(x=x,y=cdf),size=1) + xlim(min(sampMeansFull), max(sampMeansFull)) +
  annotate('text',x=158,y=0.8,label=sprintf('sample size = %d',sampSize),size=5,hjust=0) 

# from https://mgimond.github.io/ES218/Week06a.html#drawing-a-normal-q-q-plot-from-scratch
y     <- quantile(sampMeansFull, c(0.25, 0.75)) # Find the 1st and 3rd quartiles
x     <- qnorm( c(0.25, 0.75))         # Find the matching normal values on the x-axis
slope <- diff(y) / diff(x)             # Compute the line slope
int   <- y[1] - slope * x[1]           # Compute the line intercept

ggplot(sampMeansFullDf,aes(sample=sampMeans)) + geom_qq() + 
  geom_abline(intercept=int, slope=slope)

How big is large enough? Try out the demo using these commands:

Confidence intervals

The confidence interval around an estimate provides us with an interval that should capture the true population value 95% of the time.

We can use the data from our earlier simulations to examine this.

meandataDf=subset(sampdataDf,select=c('mean','se_mean'))
meandataDf$upperci=meandataDf$mean+1.96*meandataDf$se_mean
meandataDf$lowerci=meandataDf$mean-1.96*meandataDf$se_mean

meandataDf$capture=(mean(NHANES_adult$Height)>meandataDf$lowerci)&(mean(NHANES_adult$Height)<meandataDf$upperci)

print(paste('Proportion of confidence intervals that capture the population mean:',mean(meandataDf$capture)))
## [1] "Proportion of confidence intervals that capture the population mean: 0.9478"

Inferential statistics

Our ultimate interest is often to determine whether two groups differ from one another on some feature. We can use inferential statistics to do this.

We will use NHANES for this as well, but now we are going to treat the NHANES data as a sample from the larger US population, rather than treating it as a population in its own right.

Let’s ask a question where we are pretty sure that we already know the answer: Are adult men taller than adult women on average?

ggplot(NHANES_adult,aes(Gender, Height)) +
  geom_violin()

meandiff=mean(NHANES_adult$Height[NHANES_adult$Gender=='male']) - mean(NHANES_adult$Height[NHANES_adult$Gender=='female'])
print(paste('mean difference in height between men and women:',round(meandiff,digits=1),'inches'))
## [1] "mean difference in height between men and women: 13.8 inches"
shortestMan=min(NHANES_adult$Height[NHANES_adult$Gender=='male'])
tallestWoman=max(NHANES_adult$Height[NHANES_adult$Gender=='female'])

print(paste('Percentage of women taller than the shortest man:',mean(NHANES_adult$Height[NHANES_adult$Gender=='female']>shortestMan)))
## [1] "Percentage of women taller than the shortest man: 0.95432819968136"
print(paste('Percentage of men shorter than the tallest woman:',mean(NHANES_adult$Height[NHANES_adult$Gender=='male']<tallestWoman)))
## [1] "Percentage of men shorter than the tallest woman: 0.87452159650082"

The plot seems to confirm our prior hypothesis: Men are on average taller than women, by a little over one foot. But there is also a lot of overlap: Most women are taller than the shortest man, and most men and shorter than the tallest woman. So how can we decide whether there is really a difference?

One way we can try to test this is to compare the confidence intervals for the two groups.

heightDf=group_by(NHANES_adult, Gender) %>% summarise(mean=mean(Height),sd=sd(Height),n=length(Height))
heightDf$se=heightDf$sd/sqrt(heightDf$n)
heightDf$upperCI=heightDf$mean + heightDf$se*1.96
heightDf$lowerCI=heightDf$mean - heightDf$se*1.96

ggplot(heightDf,aes(Gender, mean)) +
  geom_violin(data=NHANES_adult,aes(Gender,Height)) + 
  geom_errorbar(aes(ymin=lowerCI,ymax=upperCI))

print(heightDf)
## # A tibble: 2 x 7
##   Gender  mean    sd     n    se upperCI lowerCI
##   <fct>  <dbl> <dbl> <int> <dbl>   <dbl>   <dbl>
## 1 female   162  7.27  3766 0.119     162     162
## 2 male     176  7.48  3658 0.124     176     176

The mean for the women is outside of the confidence interval for men; we would only expect this to happen 1 in 20 times if that confidence interval captured the true population for women, which gives us fairly good evidence that the means for men and women are different. We will return to this when we come to hypothesis testing later in the course.