Table of Contents Section 3.1: Introduction. Section 3.2: Generating random data. Section 3.3: Sampling data and sampling variability. Section 3.4: Sampling distributions, sampling experiments, and the central limit theorem. Section 3.5: The normal distribution and confidence intervals. Section 3.6: Computing confidence intervals for means and proportions with R. Section 3.7: A brief intro to resampling and bootstraping Section 3.8: What about comparisons? Sampling distribution for the difference of two means. Section 3.9: Comparing means visually by using error bars representing confidence intervals: inference by eye
Up to know we have introduced a series of concepts and tools that are helpful to describe sample data. But in data analysis we often do not observe full populations. We often only have sample data.
Think of the following two problems:
You want to know the extent of intimate partner violence in a given country. You could look at police data. But not every instance of intimate partner violence gets reported to, or recorded by, the police. We know there is a large proportion of those instances that are not reflected in police statistics. You could do a survey. But it would not be practical to ask everybody in the country about this. So you select a sample and try to develop an estimate of what the extent of this problem is in the population based on what you observe in the sample. But, how can you be sure your sample guess, your estimate, is any good? Would you get a different estimate if you select a different sample?
You conduct a study to evaluate the impact of a particular crime prevention program. You observe a number of areas where you have implemented this program and a number of areas where you have not. As before, you still are working with sample data. How can you reach conclusions about the effectiveness of the intervetion based on observations of differences on crime in these areas?
For this and similar problems we need to apply statistical inference: a set of tools that allows us to draw inferences from sample data. In this session we will cover a set of important concepts that constitute the basis for statistical inference. In particular, we will approach this topic from the frequentist tradition.
It is important you understand this is not the only way of doing data analysis. There is an alternative approach, bayesian statistics, which is very important and increasingly popular. Unfortunately, we do not have the time this semester to also cover Bayesian statistics. Typically, you would learn about this approach in more advanced courses.
Unlike in previous and future sessions, the focus today will be less applied and a bit more theoretical. However, it is important you pay attention since understanding the foundations of statistical inference is essential for a proper understanding of everything else we will discuss in this course.
For the purpose of today’s session we are going to generate some fictitious data. We use real data in all other sessions but it is convenient for this session to have some randomly generated fake data (actually technically speaking pseudo-random data)1.
So that all of us gets the same results (otherwise there would be random differences!), we need to use the set.seed()
. Basically your numbers are pseudorandom because they’re calculated by a number generating algorithm, and setting the seed gives it a number to “grow”" these pseudorandom numbers out of. If you start with the same seed, you get the same set of random numbers.
So to guarantee that all of us get the same randomly generated numbers, set your seed to 100:
set.seed(100)
We are going to generate an object with skewed data. We often work with severely skewed data in criminology. For example, we saw with crimes in neighbourhoods in last week and the week before that number of crimes is not evenly distributed, instead the distribution is skewed. For generating this type of data I am going to use the rnbinom()
for something called negative binomial distributions.
skewed <- rnbinom(100000, mu = 1, size = 0.3) #Creates a negative binomial distribution, don't worry too much about the other parameters at this stage, but if curious look at ?rnbinom
We can also get the mean and standard deviation for this object:
mean(skewed)
## [1] 1.00143
sd(skewed)
## [1] 2.083404
And we can also see what it looks like:
library(ggplot2)
qplot(skewed)
We are going to pretend this variable measures numbers of crime perpetrated by an individual in the previous year. Let’s see how many offenders we have in this fake population.
sum(skewed > 0)
## [1] 35623
We are now going to put this variable in a dataframe and we are also going to create a new categorical variable identifying whether someone offended over the past year (e.g., anybody with a count of crime higher than 0):
#Let's start by creating a new dataframe ("fakepopulation") with the skewed variable rebaptised as crimecount.
fake_population <- data.frame(crime = skewed)
#Then let's define all values above 0 as "Yes" in a variable identifying offenders and everybody else as "No". We use the ifelse() funciton for this.
fake_population$offender <- ifelse(fake_population$crime > 0, c("Yes"), c("No"))
#Let's check the results
table(fake_population$offender)
##
## No Yes
## 64377 35623
We are now going to generate a normally distributed variable. We are going to pretend that this variable measures IQ. We are going to assume that this variable has a mean of 100 in the non-criminal population (pretending there is such a thing) with a standard deviation of 15 and a mean of 92 with a standard deviation of 20 in the criminal population. I am pretty much making up these figures.
#The first expression is asking R to generate random values from a normal distribution with mean 100 and standard deviation for every of the 64394 "non-offenders" in our fake population data frame
fake_population$IQ[fake_population$offender == "No"] <- rnorm(64377, mean = 100, sd = 15)
#And now we are going to artificially create somehow dumber offenders.
fake_population$IQ[fake_population$offender == "Yes"] <- rnorm(35623, mean = 92, sd = 20)
We can now have a look at the data. Let’s plot the density of IQ for each of the two groups and have a look at the summary statistics.
##This will give us the mean IQ for the whole population
mean(fake_population$IQ)
## [1] 97.19921
#We will load the plyr package to get the means for IQ for each of the two offending groups
library(plyr)
#We will store this mean in a data frame (IQ_means) after getting them with the ddply function
IQ_means <- ddply(fake_population, "offender", summarise, IQ = mean(IQ))
#You can see the mean value of IQ for each of the two groups, unsurprisingly they are as we defined them
IQ_means
## offender IQ
## 1 No 99.96347
## 2 Yes 92.20370
#We are going to create a plot with the density estimation for each of the plots (first two lines of code) and then I will add a vertical line at the point of the means (that we saved) for each of the groups
ggplot(fake_population, aes(x = IQ, colour = offender)) +
geom_density() +
geom_vline(aes(xintercept = IQ, colour = offender), data = IQ_means,
linetype = "dashed", size = 1)