logo

Table of Contents Section 3.1: Introduction. Section 3.2: Generating random data. Section 3.3: Sampling data and sampling variability. Section 3.4: Sampling distributions, sampling experiments, and the central limit theorem. Section 3.5: The normal distribution and confidence intervals. Section 3.6: Computing confidence intervals for means and proportions with R. Section 3.7: A brief intro to resampling and bootstraping Section 3.8: What about comparisons? Sampling distribution for the difference of two means. Section 3.9: Comparing means visually by using error bars representing confidence intervals: inference by eye

3.1 Introduction

Up to know we have introduced a series of concepts and tools that are helpful to describe sample data. But in data analysis we often do not observe full populations. We often only have sample data.

Think of the following two problems:

You want to know the extent of intimate partner violence in a given country. You could look at police data. But not every instance of intimate partner violence gets reported to, or recorded by, the police. We know there is a large proportion of those instances that are not reflected in police statistics. You could do a survey. But it would not be practical to ask everybody in the country about this. So you select a sample and try to develop an estimate of what the extent of this problem is in the population based on what you observe in the sample. But, how can you be sure your sample guess, your estimate, is any good? Would you get a different estimate if you select a different sample?

You conduct a study to evaluate the impact of a particular crime prevention program. You observe a number of areas where you have implemented this program and a number of areas where you have not. As before, you still are working with sample data. How can you reach conclusions about the effectiveness of the intervetion based on observations of differences on crime in these areas?

For this and similar problems we need to apply statistical inference: a set of tools that allows us to draw inferences from sample data. In this session we will cover a set of important concepts that constitute the basis for statistical inference. In particular, we will approach this topic from the frequentist tradition.

It is important you understand this is not the only way of doing data analysis. There is an alternative approach, bayesian statistics, which is very important and increasingly popular. Unfortunately, we do not have the time this semester to also cover Bayesian statistics. Typically, you would learn about this approach in more advanced courses.

Unlike in previous and future sessions, the focus today will be less applied and a bit more theoretical. However, it is important you pay attention since understanding the foundations of statistical inference is essential for a proper understanding of everything else we will discuss in this course.

3.2 Generating random data

For the purpose of today’s session we are going to generate some fictitious data. We use real data in all other sessions but it is convenient for this session to have some randomly generated fake data (actually technically speaking pseudo-random data)¹.

So that all of us gets the same results (otherwise there would be random differences!), we need to use the set.seed(). Basically your numbers are pseudorandom because they’re calculated by a number generating algorithm, and setting the seed gives it a number to “grow”" these pseudorandom numbers out of. If you start with the same seed, you get the same set of random numbers.

So to guarantee that all of us get the same randomly generated numbers, set your seed to 100:

set.seed(100)

We are going to generate an object with skewed data. We often work with severely skewed data in criminology. For example, we saw with crimes in neighbourhoods in last week and the week before that number of crimes is not evenly distributed, instead the distribution is skewed. For generating this type of data I am going to use the rnbinom() for something called negative binomial distributions.

skewed <- rnbinom(100000, mu = 1, size = 0.3) #Creates a negative binomial distribution, don't worry too much about the other parameters at this stage, but if curious look at ?rnbinom

We can also get the mean and standard deviation for this object:

mean(skewed)

## [1] 1.00143

sd(skewed)

## [1] 2.083404

And we can also see what it looks like:

library(ggplot2)
qplot(skewed)

We are going to pretend this variable measures numbers of crime perpetrated by an individual in the previous year. Let’s see how many offenders we have in this fake population.

sum(skewed > 0)

## [1] 35623

We are now going to put this variable in a dataframe and we are also going to create a new categorical variable identifying whether someone offended over the past year (e.g., anybody with a count of crime higher than 0):

#Let's start by creating a new dataframe ("fakepopulation") with the skewed variable rebaptised as crimecount. 
fake_population <- data.frame(crime = skewed)
#Then let's define all values above 0 as "Yes" in a variable identifying offenders and everybody else as "No". We use the ifelse() funciton for this.
fake_population$offender <- ifelse(fake_population$crime > 0, c("Yes"), c("No"))
#Let's check the results
table(fake_population$offender)

## 
##    No   Yes 
## 64377 35623

We are now going to generate a normally distributed variable. We are going to pretend that this variable measures IQ. We are going to assume that this variable has a mean of 100 in the non-criminal population (pretending there is such a thing) with a standard deviation of 15 and a mean of 92 with a standard deviation of 20 in the criminal population. I am pretty much making up these figures.

#The first expression is asking R to generate random values from a normal distribution with mean 100 and standard deviation for every of the 64394 "non-offenders" in our fake population data frame
fake_population$IQ[fake_population$offender == "No"] <- rnorm(64377, mean = 100, sd = 15)
#And now we are going to artificially create somehow dumber offenders.
fake_population$IQ[fake_population$offender == "Yes"] <- rnorm(35623, mean = 92, sd = 20)

We can now have a look at the data. Let’s plot the density of IQ for each of the two groups and have a look at the summary statistics.

##This will give us the mean IQ for the whole population
mean(fake_population$IQ)

## [1] 97.19921

#We will load the plyr package to get the means for IQ for each of the two offending groups
library(plyr)
#We will store this mean in a data frame (IQ_means) after getting them with the ddply function
IQ_means <- ddply(fake_population, "offender", summarise, IQ = mean(IQ))
#You can see the mean value of IQ for each of the two groups, unsurprisingly they are as we defined them
IQ_means

##   offender       IQ
## 1       No 99.96347
## 2      Yes 92.20370

#We are going to create a plot with the density estimation for each of the plots (first two lines of code) and then I will add a vertical line at the point of the means (that we saved) for each of the groups
ggplot(fake_population, aes(x = IQ, colour = offender)) + 
  geom_density() +
  geom_vline(aes(xintercept = IQ, colour = offender), data = IQ_means,
             linetype = "dashed", size = 1)

So, now we have our fake population data. In this case, because we generated the data ourselves, we know what the population data looks like and we know what the summary statistics for the various attributes (IQ, crime) of the population are. But in real life we don’t normally have access to full population data. It is not practical or economic. It is for this reason we rely on samples.

3.3 Sampling data and sampling variability

It is fairly straightforward to sample data with R. The following code shows you how to obtain a random sample of size 10 from our population data above:

#We will use the sample function within the mosaic package. 
library(mosaic) 
sample(fake_population$IQ, 10)

##  [1] 105.33498  89.91923 110.34449  74.27133  78.27559  88.10284  51.85960
##  [8]  73.79595  97.45203  58.10007

You may be getting results that are different from mine, depending on the seed you used and how many times before you tried to obtain a random sample. You can compute the mean for a sample generated this way:

mean(sample(fake_population$IQ, 10))

## [1] 98.66043

And every time you do this, you will be getting a slightly different mean. This is one of the problems with sample data. Not two samples are going to be exactly the same and as a result, every time you compute the mean you will be getting a slightly different value. Run the function three or four times and notice the different means you get.

We can also use code to automitise the process. The following code will ask R to draw 15 samples of size 10 and obtain the means for each of them.

do(15) * with(sample(fake_population, 10), mean(IQ))

##         with
## 1   99.96232
## 2  106.22017
## 3  105.18270
## 4  100.87614
## 5  100.62368
## 6   93.09230
## 7  102.47644
## 8  102.10881
## 9   91.47627
## 10  86.91715
## 11 100.52563
## 12  96.65420
## 13  94.76000
## 14  95.61116
## 15 100.35543

We can store the results from an exercise such as this as a variable and plot it:

#The following code will create a dataframe with the results
samp_IQ <- do(15) * with(sample(fake_population, 10), mean(IQ))
#You can see the name of the variable designating the means in the create data frame
names(samp_IQ)

## [1] "with"

#We are going to create a data frame with the population mean
IQ_mean <- data.frame(mean(fake_population$IQ))
#Have a look inside the object to see what the variable containing the mean is called (and we can then use this name in our plot function)
names(IQ_mean)

## [1] "mean.fake_population.IQ."

#And we can plot it then
ggplot(samp_IQ, aes(x = with)) + 
  geom_histogram() +
  geom_vline(aes(xintercept = mean.fake_population.IQ.), data = IQ_mean,
             linetype = "dashed", size = 1, colour = "red") + #This code will add a red line with the overall mean
  labs(x = "Value of the sample means", y = "Number of samples") +
  annotate("text", x = 99.8, y = 3, label = "Population mean", colour = "red") + #Annotate is used to insert elements in the graphic, in this case text indicating what the red line means, the x and y indicate the position where the annotation will appear in regards to the x and the y axis (this position may not be optimal depending on the means you get when you run this code)
  theme_bw()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Your exact results may differ from those shown here, but you can surely see the point. We have a problem with using sample means as a guess for population means. Your guesses will vary. How much of a problem is this? This excellent piece and demonstration by New York Times reporters illustrate the problem well. We are going to learn that something called the central limit theorem is of great assistance here.

3.4 Sampling distributions and sampling experiments

We are going to introduce an important new concept here: sampling distribution. A sampling distribution is the probability distribution of a given statistic based on a random sample.It may be considered as the distribution of the statistic for all possible samples from the same population of a given size.

We can have a sense for what the sampling distribution of the means of IQ for samples of size 10 in our example by taking a large number of samples from the population. This is called a sampling experiment. Let’s do that:

#We will take 50000 samples and compute the means, it may take a bit
sampd_IQ_10 <- do(50000) * with(sample(fake_population, 10), mean(IQ))
#And compute the mean of this sampling distribution of the means and compare it to the population mean
mean(sampd_IQ_10$with) #mean of the sample means

## [1] 97.18898

mean(fake_population$IQ) #mean of the population

## [1] 97.19921

What we have observed is part of something called the central limit theorem, a concept from probability theory. One of the first things that the central limit theorem tells us is that the mean of the sampling distribution of the means (also called the expected value) should equal the mean of the population. It won’t be quite the same in this case (to all the decimals) because we only took 50000 samples, but in the very long run (if you take many more samples) they would be the same.

Let’s now visually explore the distribution of the sample means.

#Now we plot the means
qplot(sampd_IQ_10$with, xlab = "Distribution of means from samples of size 10")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Amazing, isn’t it? When you take many random samples from a normally distributed variables; compute the means for each of these samples; and plot the means of each of these samples, you end up with something that is also normally distributed. The sampling distribution of the means of normally distributed variables in the population is normally distributed. I want you to think for a few seconds as to what this means and then keep reading.

Did you think about it? What this type of distribution for the sample means is telling us is that most of the samples will give us guesses that are clustered around their own mean, as long as the variable is normally distributed in the population (which is something, however, that we may not know). Most of the sample means will cluster around the value of 97.11 (in the long run). There will be some samples that will give us much larger and much smaller means (look at the right and left tail of the distribution), but most of the samples won’t gives us such extreme values.

Another way of saying this is that the means obtained from random samples behave in a predictable way. When we take just one sample and compute the mean we won’t we able to tell whether the mean for that sample is close to the centre of its sampling distribution. But we will know there probability of getting an extreme value for the mean is lower than the probability of getting a value closer to the mean. That is, if we can assume that the variable in question is normally distributed in the population.

But it get’s better. Let’s repeat the exercise with a sample size of 30, 100 and 1000.

sampd_IQ_30 <- do(50000) * with(sample(fake_population, 30), mean(IQ))
sampd_IQ_100 <- do(50000) * with(sample(fake_population, 100), mean(IQ))
sampd_IQ_1000 <- do(50000) * with(sample(fake_population, 1000), mean(IQ))
#Plot the results, notice how we have changed the aesthetics. We are definig them within each geom because we are using different data stored in different dataframes.
ggplot() + 
    geom_density(data = sampd_IQ_1000, aes(x = with, fill = "1000"), position = "identity", alpha = 0.4) +
   geom_density(data = sampd_IQ_100, aes(x = with, fill = "100"), position = "identity", alpha = 0.4) +
   geom_density(data = sampd_IQ_30, aes(x = with, fill = "30"), position = "identity", alpha = 0.4) +
  labs(fill = "Sample size") + #This will change the title of the legend
  scale_fill_discrete(limits = c("1000", "100", "30")) + #This will ensure the order of the items in the legend is correct
  xlab("Distribution of mean value of IQ")

Notice the differences between these sampling distributions? As the sample size increases, more and more of the samples tend to cluster closely around the mean of the sampling distribution. In other words with larger samples the means you get will tend to differ less from the population mean than with smaller samples. You will be more unlikely to get means that are dramatically different from the population mean.

Let’s look closer to the summary statistics using favstats() from the loaded mosaic package:

favstats(~with, data = sampd_IQ_30)

##       min       Q1  median       Q3      max     mean       sd     n
##  83.05997 95.07221 97.2165 99.35673 109.5119 97.20948 3.170027 50000
##  missing
##        0

favstats(~with, data = sampd_IQ_100)

##       min       Q1 median       Q3      max     mean      sd     n missing
##  89.83259 96.02572 97.199 98.36734 104.1844 97.19008 1.73628 50000       0

favstats(~with, data = sampd_IQ_1000)

##       min       Q1   median       Q3      max     mean        sd     n
##  95.01812 96.83179 97.19858 97.56263 99.37166 97.19779 0.5456979 50000
##  missing
##        0

As you can see the mean of the sampling distributions is pretty much the same regardless of sample size, though since we only did 50000 samples there’s still some variability. But notice how the range (the difference between the smaller and larger value) is much larger when we use smaller samples. When I run this code I get one sample of size 30 with a sample mean as low as 83 and another as high as 110. But when I use a sample size of a 1000 the smallest sample mean I get is 95 and the largest sample size I get is 99. When the sample size is smaller the range of possible means is wider and you are more likely to get sample means that are wide off from the expected value.

This variability is also captured by the standard deviation of the sampling distributions, which is smaller the larger the sample size is. The standard deviation of a sampling distribution receives a special name you need to remember: the standard error. In our example, with samples of size 30 the standard error is 3.17, whereas with samples of size 1000 the standard error is 0.55.

We can see that the precision of our guess or estimate improves as we increase the sample size. Using the sample mean as a estimate (we call it a point estimate because it is a single value, a single guess) of the population mean is not a bad thing to do if your sample size is large enough and the variable can be assumed to be normally distributed in the population. More often than not this guess won’t be too far off in those circumstances.

But what about variables that are not normally distributed. What about “crime”? We saw this variable was quite skewed. Let’s take numerous samples, compute the mean of “crime”, and plot its distribution.

sampd_CR_30 <- do(50000) * with(sample(fake_population, 30), mean(crime))
sampd_CR_100 <- do(50000) * with(sample(fake_population, 100), mean(crime))
sampd_CR_1000 <- do(50000) * with(sample(fake_population, 1000), mean(crime))
#Plot the results, notice how we have changed the aesthetics. We are definig them within each geom because we are using different data stored in different dataframes

ggplot() + 
    geom_density(data=sampd_CR_1000, aes(x = result, fill = "1000"), position = "identity", alpha = 0.4) +
   geom_density(data=sampd_CR_100, aes(x = result, fill = "100"), position = "identity", alpha = 0.4) +
   geom_density(data=sampd_CR_30, aes(x = result, fill = "30"), position = "identity", alpha = 0.4) +
  labs(fill = "Sample size") + #This will change the title of the legend
  scale_fill_discrete(limits = c("1000", "100", "30")) #This will ensure the order of the items in the legend is correct

favstats(~result, data = sampd_CR_30)

##         min        Q1    median       Q3      max     mean        sd     n
##  0.03333333 0.7333333 0.9666667 1.233333 3.333333 1.003608 0.3812273 50000
##  missing
##        0

favstats(~result, data = sampd_CR_100)

##   min   Q1 median   Q3  max     mean        sd     n missing
##  0.35 0.86   0.99 1.13 2.17 1.001387 0.2083859 50000       0

favstats(~result, data = sampd_CR_1000)

##    min    Q1 median    Q3   max     mean        sd     n missing
##  0.741 0.956      1 1.045 1.277 1.001351 0.0656619 50000       0

mean(fake_population$crime)

## [1] 1.00143

You can see something similar happens. Even though “crime” itself is not normally distributed. The sampling distribution of the means of “crime” becomes more normally distributed the larger the sample size gets. Although we are not going to repeat the exercise again, the same would happen even for the variable “offender”. With a binary categorical variable such as offender (remember it could take two values: yes or no) the “mean” represents the proportion with one of the outcomes. But essentially the same process applies.

What we have seen in this section is an illustration of various amazing facts associated with the central limit theorem. Most sample means are close to the population mean, very few are far away from the population mean, and on average, we get the right answer (i.e., the mean of the sample means is equal to the population mean). This is why statisticians say that the sample mean is an unbiased estimate of the population mean.

How is this helpful? Well, it tells us we need large samples if we want to use samples to guess population parameters without being too far off. It also shows that although sampling introduces error (sampling error: the difference between the sample mean and the population mean), this error behaves in predictable ways (in most of the samples the error will be small, but it will be larger in some: following a normal distribution). In the next section, we will see how we can use this to produce confidence intervals.

If you want to further consolidate some of these concepts you may find these videos on sampling distributions from Khan Academy useful.

In terms of optimal sample sizes, in the first lecture we discussed a possibility of diminishing returns on increasing sample sizes. Basically, the greater the sample size, the more likely you are to find a smaller effect statistically significant. We’ll talk about this later, when we discuss effect sizes and power calculations, but if you’re interested to read ahead, the main reference paper here (which you can tell by it being cited over 21 THOUSAND times) is Cohen’s Power Primer

3.5 The normal distribution and confidence intervals with known standard errors

While the sample mean may be the best single number to use as an estimate of the population mean, particularly with large samples, each sample mean will continue to come with some sample error (with some distance from the population mean). How can we take into account the uncertainty in estimating the population mean that derives from this fact?

The key to solving the problem relies in the fact that the sampling distribution of the means will approach normality with large samples. If we can assume that the sampling distribution of the means is normally distributed then we can take advantage of the properties of the standard normal distribution.

One of the peculiarities of the standard normal distribution is that we know the proportion of cases that fall within standard deviation units away from the mean. In the graphic below you can see the percentage of cases that fall within one and two standard deviations from the mean in the standard normal distribution:

Normal Distribution

We know that the sampling distribution of the means can be assumed with large samples to have a shape like this. We saw that running our sampling experiment. You can think of the sampling distribution of the means as the distribution of sampling error. Most sample means fall fairly close to the expected value (i.e., the population mean) and so have small sampling error; many fall a moderate distance away; and just a few fall in the tails of the sampling distribution, which signals large estimation errors. So although working with sample data we don’t have a precise distance, we have a model that tells us how that distance behaves (i.e., it follows a normal distribution). Let this sink in for a few seconds.

This is very useful because then we can use this knowledge to generate the margin of error.

The margin of error is simply the largest likely sampling error. In social science we typically choose likely to imply 95%, so that there is a 95% chance that the sampling error is less than the margin of error. By extension this means that there is only a 5% chance that the sampling error will be bigger: that we have been very unlucky and our sample mean falls in one of the tails of the sampling distribution.

Looking at the standard normal distribution we know that about 95% of the cases fall within 2 standard deviations on either side of the mean. We know then that 95% of the sample means (95.46% to be more precise) will fall within two standard errors of the expected value (e.g., the mean of the sampling distribution). So we can say that the margin of error, the largest likely estimation error, equals 2 standard errors. More accurately, the margin of error equals 1.96 standard errors (1.96 corresponds to 95% whereas the value 2 corresponds to 95.46%).

This may be clearer with an example. Let’s focus in our variable “IQ”. Look at the standard error (the standard deviation of the collection of sample means) for the sampling distribution of “IQ” when we took samples of 100 cases. We produced this earlier on.

se_sampd_IQ_100 <- favstats(~with, data=sampd_IQ_100)

The standard error was 1.7362799. If we multiply 1.7362799 times 1.96 we obtain 3.4031086. This means that 95% of the cases in this sampling distribution will have an error that won’t be bigger than that. They will only at most differ from the mean of the sampling distribution by (plus and minus) 3.4031086. However, 5% of the samples will have a margin of error bigger than that (in absolute terms).

The wonderful thing is that we can use the margin of error to provide information about the degree to which our sample estimate may vary from the population mean. We can use it to give a measure of the uncertainty in our estimation. How?

“We rely on this obvious relation: If M” (our sample mean) “is likely to be close to μ” (the population mean) “-as the last page or two has illustrated- then μ is likely to be close to M. As simple as that. The simulation shows us that, for most samples, M” (the sample mean) “falls pretty close to μ” (the population mean) “, in fact within margin of error of μ. Now, we have only a single M and don’t know μ. But, unless we’ve been unlucky, our M has fallen within the margin of error of μ, and so, if we mark out an interval extending the margin of error on either side of our, most likely we’ve included μ. Indeed, and that interval is the confidence interval (CI)” (Cumming, 2012: 69).

If we have a large random sample, the 95% confidence interval will then be:
Upper limit= sample mean + 1.96 standard error
Lower limit= sample mean - 1.96 standard error

This will be clearer with an example. Let’s extract a sample of size 100 from the “fakepopulation” and look at the distribution of IQ:

sample_1 <- sample(fake_population, 100)
mean(sample_1$IQ)

## [1] 99.15223

When I then take the mean of “IQ” in this sample I get the value of 99.1522309. It does not matter if you get a different value. Remember the standard error for the sampling distribution of “IQ” when we took samples of a 100 cases. It was 1.7362799. If we multiply 1.7362799 times 1.96 we obtain 3.4031086. The upper limit for the confidence interval then will be 99.1522309 (my sample mean) plus 3.4031086 (the margin of error) and the lower limit for the confidence interval will be 99.1522309 minus 3.4031086. This yields a confidence interval ranging from 95.7491223 to 102.5553394.

Now, if your sample mean would have been different, your confidence interval would have also been different. If you take 10,000 sample means and compute 10,000 confidence intervals they will be different among themselves. In the long run, that is, if you take a large enough numbers of samples and then compute the confidence interval for each of the samples, 95% of those confidence intervals will capture the population mean and 5% will miss it. Let’s explore this.

#We are first going to select 100 means (from the samples of size 100) out of the 50,000 samples that we created (if you don't understand this code, have a second look at the notes from week 1)
samp_IQ_100 <- sampd_IQ_100[1:100, ]
ci_IQ <- data.frame(meanofIQ = samp_IQ_100)
#We are now going to create the lower and the upper limit for the confidence interval
#First we obtain the margin of error. To do this I will compute the summary statistics for the sampling distribution and then multiply the standard error (e.g., the standard deviation of the sampling distribution) times 1.96.
se_sampd_IQ_100 <- favstats(~with, data=sampd_IQ_100)
me_IQ_100 <- se_sampd_IQ_100$sd * 1.96 #sd is the name of the variable returned from favstats() that includes the information about the standard deviation of the distribution we were exploring. Notice how one of the beauties of R is that it allows you to extract content from the objects it creates so that then you use them for whatever purpose you want. Here we compute the margin of error by multiplying this times 1.96.
#Then I will create two numerical vectors with the upper and the lower limit and then add them to the data frame we created
ci_IQ$LowerLimit <- samp_IQ_100 - me_IQ_100
ci_IQ$UpperLimit <- samp_IQ_100 + me_IQ_100

You may want to use the View() function to see inside the ci_IQ data frame that we have created. Every row represents a sample and for each sample we have the mean and the limits of the confidence interval. We can now query how many of these confidence intervals include the mean of the variable crime in our fake population data.

ci_IQ$indx <- (ci_IQ$LowerLimit <= mean(fake_population$IQ)) & 
  (ci_IQ$UpperLimit >= mean(fake_population$IQ))
sum(ci_IQ$indx)

## [1] 97

Thus 97 intervals contain the true population mean. If you feel playful (and curious) you may want to modify the code we have use above to check how many of a 100 or of 200 samples for example would contain the true population mean. It should be roughly around 95% of them. Pretty cool, isn’t it?

We can also plot these confidence intervals:

#First I am going to create an ID variable to identify each sample (I will need this as an input in the plot I will create). I will use the row name (that list the samples from 1 to 1 100) as my ID variable.
ci_IQ$id <- rownames(ci_IQ)
#We are going to use a new geom we have not covered so far that allows you to create lines with a point in the middle. You need to tell R where the line begins and ends, as well as where to locate the point in the middle. The point in the middle is our mean and the lines range from the lower limit and upper limit. In this sense each line represents the confidence interval. We will use the ID variable so that R plots one line for each of the samples. Finally I am asking R to use the "indx" variable we create so that we can distinguish clearly the confidence intervals that cover the population mean.
ggplot(ci_IQ, aes(x = id, y = meanofIQ, ymin = LowerLimit, ymax = UpperLimit, group= indx, color = indx)) +
  geom_hline(yintercept  = mean(fake_population$IQ), colour = "red", size = 1) + #This will create a horizontal line representing the population mean, so that we can clearly see it
  geom_pointrange(aes(ymin = LowerLimit, ymax = UpperLimit)) +
  theme_bw() +
  ggtitle("Confidence intervals for the mean (100 samples of size 100)") +
  labs(colour="Covers μ?", x = "", y ="Mean of IQ") +
  scale_x_discrete(breaks = "") + #this ensures that no tick marks and labels are used in the axis defining each sample (try to run the code witout this line to see what happens if you don't include it)
  coord_flip()

As you can see most of the confidence intervals cover the population mean, but some times this does not happen. If you know the population mean, then you can see whether that happens or not. But in real life we run samples because we don’t know the population parameters. So, unfortunately, when you do a sample you can never be sure whether your estimated confidence interval is one of the red or the green ones. The truth is we will never know whether our confidence interval captures the population parameter or not, we can only say that under certain assumptions if we had repeated the procedure many times it will include it 95% of the time. This is one of the reasons why in statistics when making inferences we cannot provide definitive answers².

It is, however, better to report your confidence intervals than your point estimates. Why?

First, because you are being explicit about the fact you are just guessing. Point estimates such as the mean create a false impression of precision.
But beware the CI can also be misleading! 95% of the times you may get the true parameter, but that’s not the same than to say that 95% of the time your mean will lie between your upper and lower boundaries for your confidence interval. This is a common interpretative mistake made by researchers and, even, teachers. A confidence interval can only be used to evaluate the procedure not a specific estimated interval.

So to reiterate:

FALSE INTERPRETATION: “There is a 95% chance that the mean IQ is between 89.7 and 104.7 minutes”. This is a very common misconception! It seems very close to true, but it isn’t because the population mean value is fixed. So, it is either in the interval or not. This is subtle but important.
What is correct? 95% of the time, when we calculate a confidence interval in this way, the true mean will be between the two values. 5% of the time, it will not. Because the true mean (population mean) is an unknown value, we don’t know if we are in the 5% or the 95%. BUT 95% is pretty good. So we say something like “We are 95% confident that the mean IQ for all people in our fake population is between 89.7 and 104.7.” This is a common shorthand for the idea that the calculations “work” 95% of the time.
Remember that we can’t have a 100% confidence interval. By definition, the population mean is not known . If we could calculate it exactly we would! But that would mean that we need a census of our population with is often not possible or feasible.
Finally, because if the range of values that you give me for your CI is smaller or bigger I will know that your estimate is more or less precise respectively. That is, with the CI you are giving me a measure of your uncertainty. The bigger the CI the more uncertain we are about the true population parameter.

3.6 Asymptotic confidence intervals for means and proportions using R

You may have spotted a big problem in what came before. How did we compute the confidence interval? We multiplied 1.96 times the standard error. Remember: the standard error is the standard deviation of the sampling distribution of the mean. And… well, at least you are willing to repeat a survey thousands of times with the same population you won’t know what the standard error is! The population mean is unknown and we want to estimate it. But the standard error that we need for constructing our confidence interval is also unknown!.

If you are working with proportions there is an obvious way to estimate the standard error only with sample data (for details see the required reading). But with means this is not possible. There is, however, a solution. You can use the standard deviation of your sample as an estimate for the standard error . You would also need to make some adjustments to the formula for the confidence interval (you divide the sample standard deviation by the square root of the sample mean). You don’t need to worry to much about the mathematics of it. In practice we will rely on R to apply these formulas and compute the confidence intervals.

It is important, though, that you know that this approach works reasonably well when applying the Normal model to large samples. But with small samples using the sample standard deviation as an estimate of the standard error (so that we can compute the confidence interval) is problematic. The sample standard deviation also varies from sample to sample and this extra variation in smaller samples will mess up the computation of the margin of errors. William Gosset’s suggested we needed to use a different probability distribution for this cases, the t Student distribution. You can learn more about this distribution and the work of Gosset in the suggested reading. The t Student distribution and the normal distribution are almost indistinguishable for large samples. In essence that means you will still multiply by 1.96. But with smaller sample sizes that value will be different if you use a normal distribution or a t student distribution. Refer to the recommended textbooks for further clarification.

It is fairly straightforward to get the confidence intervals using R. In order to use the t Student distribution we need to assume the data were randomly sampled and that the population distribution is unimodal and symmetric. We know this is the case in the population and next week we will discuss how you can check this assumption when you don’t have the population data.

Earlier we created a sample of 100 cases from our fake population. Let’s build the confidence intervals using the sample standard deviation as an estimate for the standard error and assuming we can use the t Student distribution:

t.test(sample_1$IQ)

## 
##  One Sample t-test
## 
## data:  sample_1$IQ
## t = 53.606, df = 99, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##   95.48211 102.82235
## sample estimates:
## mean of x 
##  99.15223

Ignore for now the few first lines. Just focus on the 95% interval. You will see it is not wildly different from the one we derived using the actual standard error.

If you want a different confidence interval, say 99%, you can pass an additional argument to change the default in the t.test() function:

t.test(sample_1$IQ, conf.level = .99)

## 
##  One Sample t-test
## 
## data:  sample_1$IQ
## t = 53.606, df = 99, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 99 percent confidence interval:
##   94.29429 104.01017
## sample estimates:
## mean of x 
##  99.15223

What if you have a factor and want to estimate a confidence interval for a proportion. In our data we have a dichotomous variable that identifies cases as offenders.

table(sample_1$offender)

## 
##  No Yes 
##  64  36

We can use the prop.test() function in these cases:

prop.test(sample_1$offender=="Yes") #We want to estimate the proportion of respondents who are offenders, which is why we specifically ask for those classified as "Yes"

## 
##  1-sample proportions test with continuity correction
## 
## data:  sample_1$offender == "Yes"  [with success = TRUE]
## X-squared = 7.29, df = 1, p-value = 0.006934
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
##  0.2681721 0.4627255
## sample estimates:
##    p 
## 0.36

You can also specify a different confidence level:

prop.test(sample_1$offender=="Yes", conf.level = .99)

## 
##  1-sample proportions test with continuity correction
## 
## data:  sample_1$offender == "Yes"  [with success = TRUE]
## X-squared = 7.29, df = 1, p-value = 0.006934
## alternative hypothesis: true p is not equal to 0.5
## 99 percent confidence interval:
##  0.2443208 0.4937815
## sample estimates:
##    p 
## 0.36

The prop.test() function uses a Normal approximation to compute the confidence interval. This approximation may not work well when the outcome of interest is rare or uncommon or with small smaples. A number of alternative formulas have been proposed for these cases. Check wikepedia for “binomial proportion confidence interval”. To get R to compute these alternative ways you need to install and load th binom package.

library(binom)
binom.confint(34, 100) #This function takes as the first argument the count for the outcome of interest and the sample size as the second argument

##           method  x   n      mean     lower     upper
## 1  agresti-coull 34 100 0.3400000 0.2544306 0.4374073
## 2     asymptotic 34 100 0.3400000 0.2471548 0.4328452
## 3          bayes 34 100 0.3415842 0.2507476 0.4341676
## 4        cloglog 34 100 0.3400000 0.2491861 0.4327669
## 5          exact 34 100 0.3400000 0.2482235 0.4415333
## 6          logit 34 100 0.3400000 0.2540660 0.4379354
## 7         probit 34 100 0.3400000 0.2527521 0.4368062
## 8        profile 34 100 0.3400000 0.2520471 0.4360562
## 9            lrt 34 100 0.3400000 0.2520248 0.4360417
## 10     prop.test 34 100 0.3400000 0.2501177 0.4423445
## 11        wilson 34 100 0.3400000 0.2546152 0.4372227

Here you can see 11 different confidence intervals that are computed using different formulas and approaches. You will see that in this case the differences between the Normal approximation and these methods are minimal, but there may be scenarios where this is not the case.

Remember that confidence intervals may be easy to construct (just one line of code!) but they are easy to misinterpret. The word confidence in everyday meaning is subjective. Perhaps it would be better to have terms such as “sampling precision interval” (Kaplan, 2012), but we don’t. Another common mistake when reading them is to think that you can say that “if you repeat the study and take the mean, the new result will be within the margin of error of the original study”, but this is not correct mathematically (remember the plot of the confidence intervals).

3.7 A brief intro to resampling and bootstraping

We have seen how theoretically the sampling distribution reflects variation from random samples. We also discussed how in practice we only have one sample. In the previous sections we also saw how we can some times use theoretical probability distributions (such as the normal or the t distribution) provided we are willing to make some assumptions. And we could then use these distributions to build our confidence intervals.

Another way of building confidence intervals is called bootstrapping. This comes from the phrase: “He pulled himself up by his own trousers”, which is said of someone who improves himself without help. Essentially what we are doing is estimating the properties of the sampling distribution from our single sample. How? By means of taking repeated samples (with replacement) from our own sample. These samples are called resamples. Essentially the sample is “used as a stand-in for the population and new samples are drawn from the sample” (Kaplan, 2012). Notice I said we use sampling with replacement. Every time we select a case from the sample, the case is put back so that we can select it again.

#First, let's extract a sample of size 10
sample_2 <- sample(fake_population$IQ, 10)
sample_2

##  [1] 107.22465  99.56061  79.97599  75.83347 109.99573  72.00389 106.05083
##  [8]  95.62090 103.55263 109.66310

#We can then resample (drawing samples from the set of cases in our sample) with replacement
resample(sample_2)

##  [1]  99.56061  99.56061  79.97599  99.56061 103.55263 103.55263  72.00389
##  [8] 109.66310  75.83347  72.00389

You will notice that every element in our resample was present in the sample, but it may now appears several times (because of the replacement). Bootstrapping involve repeating this process many times and examining the variation among the resamples to construct a confidence interval based on the resampling distribution. Bootstraping won’t work well with very small samples. The sample size should be one or two dozen or larger (Kaplan, 2012). So let’s move to a slightly larger sample and then we will create the resampling distribution.

sample_3 <- sample(fake_population$IQ, 30)
resampling_IQ_30_1 <- do(1000) * mean(resample(sample_3))

Then we can use the Mosaic qdata() function to extract the 95% coverage intervals:

qdata(~mean, p = c(.025, .975), resampling_IQ_30_1)

##        quantile     p
## 2.5%   89.10074 0.025
## 97.5% 105.15578 0.975

How does this compare to the confidence interval using the t Student distribution?

t.test(sample_3)

## 
##  One Sample t-test
## 
## data:  c(118.477230891732, 107.822027213531, 100.50838979841, 113.987810234884,  ...
## t = 24.619, df = 29, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##   89.81315 106.08755
## sample estimates:
## mean of x 
##  97.95035

What if the sample size was larger?

sample_4 <- sample(fake_population$IQ, 100)
resampling_IQ_100_1 <- do(1000) * mean(resample(sample_4))
qdata(~mean, p = c(.025, .975), resampling_IQ_100_1)

##        quantile     p
## 2.5%   93.91896 0.025
## 97.5% 100.38443 0.975

t.test(sample_4)

## 
##  One Sample t-test
## 
## data:  c(106.916641877059, 85.3349606841163, 83.1061988425574, 107.683105946275,  ...
## t = 57.252, df = 99, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##   93.85751 100.59677
## sample estimates:
## mean of x 
##  97.22714

As you can see as the sample size grows the differences between the bootstrap confidence interval and the one that relies on the t Student model become more similar.

Before we conclude this section, it is important to remember that the resampling distributions won’t construct the sampling distribution. What they do is to show you how the sampling distribution may look like if the population looked like your sample. The center of the resampling distribution is generally not aligned with the center of the sampling distribution, although in practice the width of the re-sampling distribution in practice tends to match the width of the sampling distribution. The following figure shows you how three resampling distributions compare to the sampling distribution of IQ for samples of size 30.

resampling_IQ_30_2 <- do(1000) * mean(resample(sample_3))

resampling_IQ_30_3 <- do(1000) * mean(resample(sample_3))

3.8 What about comparisons? Sampling distribution for the difference of two means

So far we have seen how we can use confidence intervals to quantify the unavoidable uncertainty that exists when you use sample data to make inferences about population parameters. In doing this, the focus of our discussion has been univariate estimation; that is, we were focusing on the logic involved in estimating single quantities (descriptive values for single variables) such as the mean or the proportion for a given variable (i.e., the proportion of households that suffer a burglary victimisation). But statistics is all about comparisons. And making comparisons involves working with several variables or groups at the same time. So most often we need to estimate information about various variables or groups at the same time. When making comparisons we also have to take into account sampling variability.

Imagine that we want to know whether there is a difference in the level of fear experienced by males and females. Suppose we want to compare the average level of fear of violent crime across the two genders. We could use the data from the British Crime Survey we have been using so far (please load the data if you don’t have it in your global environment from previous weeks). But we also need to take into account sampling variability. The estimated mean for fear of crime for males will be subject to some sampling error. The same for females. And the same for the difference between the two means.

##R in Windows have some problems with https addresses, that's why we need to do this first:
urlfile<-'https://raw.githubusercontent.com/jjmedinaariza/LAWS70821/master/BCS0708.csv'
#We create a data frame object reading the data from the remote .csv file
BCS0708<-read.csv(url(urlfile))

library(psych)
with(BCS0708, describeBy(tcviolent, sex))

## $female
##    vars    n mean   sd median trimmed  mad   min  max range skew kurtosis
## X1    1 4475 0.33 1.04   0.23    0.25 0.96 -2.35 3.56  5.91 0.61     0.02
##      se
## X1 0.02
## 
## $male
##    vars    n  mean   sd median trimmed  mad   min  max range skew kurtosis
## X1    1 3959 -0.27 0.86  -0.44   -0.36 0.69 -2.35 3.81  6.16  1.1     1.91
##      se
## X1 0.01
## 
## attr(,"call")
## by.default(data = x, INDICES = group, FUN = describe, type = type)

ggplot(BCS0708, aes(x = sex, y = tcviolent)) +
  geom_boxplot()

The mean value of fear of violent crime is -0.27 for the males and 0.33 for the females. There is a difference in the mean value of fear of crime of -0.6. Comparing the distributions in the boxplot seems to suggest that the distribution of scores on fear of violent crime tend to be higher than for the males. The question is: would we observe similar differences if we were looking at population data (rather than sample data)? The answer to the previous questions, as you can imagine by now, is that here we also have the problem of sampling variability. If we were to take a different sample from the same population (1) we would obtain a slightly different mean score of fear for the men; (2) a slightly different mean score of fear for the women; and (3) correspondingly a slightly different difference between those two quantities. So rather than a difference of -0.6 points we may find a slightly different one.

How do we deal with this? Again, we can make assumptions about the sampling distributions of the estimated parameters and try to quantify our uncertainty around the observed difference. Earlier we only had to worry about one parameter (the mean or the proportion in the population). We said that thanks to the central limit theorem we can assume that with large samples the sampling distribution of this single parameter follows a normal distribution. Now instead we have two different population parameters (i.e., the mean score of fear for men, µ1, and the mean age of fear for women, µ2) and their difference (µ1 – µ2) . Look at the Figure below:

Now we are trying to work out two different population parameters (the population meam of fear for males and females), which are unknown, based on our observed sample estimates. If we were to take repeated samples from the same populations these sample-based estimates of the means would vary and so would the difference between them. But we know that if we were to take large repeated samples of men the sampling distribution of the means for fear of crime in the long run would follow a normal distribution. Equally, if we were to take repeated samples of women the same would happen. However, now we are not only interested in these two quantities (the mean for each group) we are also interested in the difference between them, the difference in the mean score of fear for females and males.

In essence, however, the same principles apply. If we were to plot the distribution of the difference between these two quantities, the means for the men and the women for every single sample of repeated samples, we would also have a sampling distribution: the sampling distribution of the differences of the means as shown in the figure above. Here, as earlier, we could invoke the central limit theorem that, in this case, states that with large random samples the sampling distribution of the difference of the means will be normal. We can also apply a variation of this (Gosset’s theorem) that states that with small samples of normally distributed variables we can use the t-student distribution. We could then construct a 95% confidence interval for the difference of the means that would give us the range of plausible values from the same population.

Remember how we construct confidence intervals for the mean by adding the mean to the standard error of the sampling distribution of the mean. Also remember how we used the standard deviation of the sample mean as a proxy for the standard error. The computation of the confidence interval for the difference of two means works in a similar way but we need to account for the fact that now we have two different variances (i.e., in our example the variance for men and for women). Otherwise, the logic for constructing the confidence interval remains the same. For details on computation and formula see appropriate textbooks.

If you want to compare two samples means you would use the following code to obtain the 95% confidence interval for the difference of the two means:

t.test(tcviolent ~ sex, data = BCS0708)

## 
##  Welch Two Sample t-test
## 
## data:  tcviolent by sex
## t = 29.114, df = 8398.3, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.5614656 0.6425300
## sample estimates:
## mean in group female   mean in group male 
##            0.3281656           -0.2738322

For now, I want you to ignore the first few lines of output and just focus in the bottom part. You see how the mean value of “tcviolent” (fear of violent crime) is printed for each of the two groups. If you substract this two values you obtain around -0.6. Right above you see the confidence interval for the difference between these two means. This confidence ranges from -.64 to -.56. We are simply stating that the values in the interval are plausible as true values for the population parameter, and that values outside the interval are relatively implausible (although not impossible). Although our point estimate for the difference was -0.6, the confidence interval gives us more information as to what the true difference may be in the population. In the long run, that is, if you take a large enough numbers of samples and then compute the confidence interval for each of the samples, 95% of those confidence intervals will capture the difference in the population and 5% will miss it. As before, always remember, we may have been unlucky and got one of those 5% confidence intervals.

Notice that the confidence interval does not include the value zero. The observed difference that we have between females and males is not consistent with a difference of zero in the population. The fact that our estimated CI for the difference of the means does not include zero invites the suggestion that the difference between males and females is different from zero in the population. If, on the other hand, you were to encounter a confidence interval for the difference of the means including the value of zero then you would be less confident that the difference in the population would not be zero, particularly in those cases when the value of zero is near the middle of your confidence interval. This makes sense. If zero is a very plausible value for the difference of the means in the population then we cannot on the basis of our sample data pretend otherwise.

3.9 Comparing means visually by using error bars representing confidence intervals: inference by eye

In the previous section we have discussed how you can construct the confidence interval for the difference between two means. Another way of looking at whether the means of two groups are different in a population is by visually comparing the confidence interval for the means of each of the two groups. Think for a second about the semantic difference. If you don’t get it, look back at the figure we represented above.

We can visualise the confidence interval for the sample mean score of fear of crime for the men and the women using ggplot():

#As usual we define the aesthetics first
ggplot(BCS0708, aes(x = sex, y = tcviolent)) +
        stat_summary(fun.data = "mean_cl_normal", geom = "pointrange") #this function ask to display summary statistics as pointrange (the point is the mean and the lines end at the upper and lower CI limits). The "mean_cl_normal" uses the CI assuming normality.

## Warning: Removed 3242 rows containing non-finite values (stat_summary).

#So if you prefer the bootstrapped confidence interval rather than assuming normality, you could use:
ggplot(BCS0708, aes(x = sex, y = tcviolent)) +
       stat_summary(fun.data = "mean_cl_boot", geom = "crossbar") #Here we are using a different geom just to show you the range of options, but you could also have used "pointrange". Or finally, you could also use "errorbars"

## Warning: Removed 3242 rows containing non-finite values (stat_summary).

ggplot(BCS0708, aes(x = sex, y = tcviolent)) +
       stat_summary(fun.data = "mean_cl_boot", geom = "errorbar") +
        stat_summary(fun.y = mean, geom = "point")

## Warning: Removed 3242 rows containing non-finite values (stat_summary).

## Warning: Removed 3242 rows containing non-finite values (stat_summary).

The point in the error bar represents the mean value for fear of crime for each of the groups and the error bars represent the upper and lower bound for the confidence interval for the mean fear of crime score for each of those two groups. Notice how the confidence intervals do not overlap. These confidence intervals provide a range of plausible values for the population parameters, the mean score of fear for males and females in the population. The fact that the CI do not overlap is another way of showing that there may be a difference in the population parameters for these two groups. Lack of any overlap is strong suggestion of a significant difference in the population.

If they were overlapping this would be indicating that some of the plausible values for the mean fear of crime score for males in the population would also be plausible values for the mean fear of crime for females in the population. In this case, when there is some overlap, it is less intuitive to interpret the confidence intervals. You can have some overlap even if there is a real difference across the population means. However, the greater the overlap the smaller the chance that there is a difference between the two means in the population. In particular, if the overlap is greater than about 50% for the length of the bar either side of the mean then you will be, roughly speaking, “very confident” that there is no real difference in the population. This is a good guide about how to interpret error bars in this type of scenarios.

si1 <- sessionInfo()
print(si1, locale = FALSE)

## R version 3.3.1 (2016-06-21)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.10.5 (Yosemite)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] psych_1.6.9       binom_1.1-1       mosaic_0.14.4     Matrix_1.2-6     
## [5] mosaicData_0.14.0 lattice_0.20-33   dplyr_0.5.0       plyr_1.8.4       
## [9] ggplot2_2.1.0    
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.7         RColorBrewer_1.1-2  formatR_1.4        
##  [4] tools_3.3.1         rpart_4.1-10        digest_0.6.10      
##  [7] evaluate_0.10       tibble_1.2          gtable_0.2.0       
## [10] DBI_0.5-1           yaml_2.1.13         parallel_3.3.1     
## [13] ggdendro_0.1-20     gridExtra_2.2.1     cluster_2.0.4      
## [16] stringr_1.1.0       knitr_1.14          nnet_7.3-12        
## [19] grid_3.3.1          data.table_1.9.6    R6_2.2.0           
## [22] survival_2.39-4     foreign_0.8-66      rmarkdown_1.1      
## [25] latticeExtra_0.6-28 Formula_1.2-1       tidyr_0.6.0        
## [28] magrittr_1.5        Hmisc_3.17-4        scales_0.4.0       
## [31] htmltools_0.3.5     splines_3.3.1       MASS_7.3-45        
## [34] assertthat_0.1      mnormt_1.5-5        colorspace_1.2-7   
## [37] labeling_0.3        stringi_1.1.2       acepack_1.3-3.3    
## [40] lazyeval_0.2.0      munsell_0.4.3       chron_2.3-47

Although we would like to think of our samples as random, it is in fact very difficult to generate random numbers in a computer. Most of the time someone is telling you they are using random numbers they are most likely using pseudo-random numbers. If this is the kind of thing that get’s you excited you may want to read the wiki entry. If you want to know how R generates these numbers you should ask for the help pages for the Random.Seed function.↩
As an aside, you can use this Java applet to see what happens when one uses different parameters with confidence intervals. In the right hand side you will see a button that says “Sample”. Press there. This will produce a horizontal line representing the confidence interval. The left hand side end of the line represents the lower limit of the confidence interval and the right hand side end of the line represents the upper limit of the confidence interval. The red point in the middle is your sample mean, your point estimate. If you are lucky the line will be black and it will cross the green vertical line. This means that your CI covers the population mean. There will be a difference with your point estimate (i.e., your red point is unlikely to be just in top of the green vertical line). But, at least, the population parameter will be included within the range of plausible values for it that our confidence interval is estimating. If you keep pressing the “Sample” button (please do 30 or 50 times), you will see that most confidence intervals include the population parameter: most will cross the green line and will be represented by a black line. Sometimes your point estimate (the red point at the centre of the horizontal lines) will be to the right of the population mean (will be higher) or to the left (will be lower), but the confidence interval will include the population mean (and cross the green vertical line).↩

WEEK 3: FOUNDATIONS OF STATISTICAL INFERENCE: CONFIDENCE INTERVALS

Juanjo Medina, updates by Reka Solymosi