WEEK 6: AN INTRODUCTION TO REGRESSION

Table of contents

Section 6.1: Introduction.
Section 6.2: Motivating regression
Section 6.3: Fitting a simple regression model
Section 6.4: Residuals revisited: R square
Section 6.5: Inference with regression
Section 6.6: Fitting regression with categorical predictors
Section 6.7: Motivating multiple regression
Section 6.8: Fitting and interpreting a multiple regression model
Section 6.9: Presenting your regression results
Section 6.10: Rescaling input variables to assist interpretation
Section 6.11: Testing conditional hypothesis: interactions
Section 6.12: Model building and variable selection

6.1 Introduction

In this session we are going to cover regression analysis or, rather, we are beginning to talk about regression modelling. This form of analysis has been one the main technique of data analysis in the social sciences for many years and it belongs to a family of techniques called generalised linear models. Regression is a flexible model that allows you to “explain” or “predict” a given outcome (Y), variously called your outcome, response or dependent variable, as a function of a number of what is variously called inputs, features or independent, explanatory, or predictive variables (X1, X2, X3, etc.). Following Gelman and Hill (2007), I will try to stick for the most part to the terms outputs and inputs.

Today we will cover something that is called linear regression or ordinary least squares regression (OLS), which is a technique that you use when you are interested in explaining variation in an interval level variable. First we will see how you can use regression analysis when you only have one input and then we will move to situations when we have several explanatory variables or inputs.

We will return to the BCS0708 data we used in previous sessions.

##R in Windows have some problems with https addresses, that's why we need to do this first:
urlfile<-'https://raw.githubusercontent.com/jjmedinaariza/LAWS70821/master/BCS0708.csv'
#We create a data frame object reading the data from the remote .csv file
BCS0708<-read.csv(url(urlfile))

We are going to look at the relationship between fear of violent crime (tcviolent), high scores represent high fear, with a variable measuring perceptions on antisocial behaviour in the local area (tcarea), high scores represent the respondents perceive a good deal of antisocial behaviour¹:

library(ggplot2)
qplot(x = tcviolent, data = BCS0708)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 3242 rows containing non-finite values (stat_bin).

qplot(x = tcarea, data = BCS0708)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 677 rows containing non-finite values (stat_bin).

We can see both are skewed. Let’s look at the scatterplot:

ggplot(BCS0708, aes(x = tcarea, y = tcviolent)) +
  geom_point(alpha=.2, position="jitter")

## Warning: Removed 3664 rows containing missing values (geom_point).

What do you think when looking at this scatterplot? Is there a relationship between perceptions of disorder and fear of violent crime? Does it look as if individuals that have a high score on the X axis (perceptions of disorder) also have a high score on the Y axis (fear of violent crime)? It may be a bit hard to see but I would think there is certainly a trend.

6.2 Motivating regression

Now, imagine that we play a game. Imagine I have all the respondents waiting in a room, and I randomly call one of them to the stage. You’re sitting in the audience, and you have to guess the level of fear (‘tcviolent’) for that respondent. Imagine that I pay £150 to the student that gets the closest to the right value. What would you guess if you only have one guess and you knew (as we do) how fear of violent crime is distributed?

ggplot(BCS0708, aes(x = tcviolent)) + 
  geom_density() +
  geom_vline(xintercept = 0.046, linetype = "dashed", size = 1, color="red") +
  geom_vline(xintercept = -0.117, linetype = "dashed", size = 1, color="blue") +
  ggtitle("Density estimate and mean of fear of violent crime")

## Warning: Removed 3242 rows containing non-finite values (stat_density).

summary(BCS0708$tcviolent)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  -2.350  -0.672  -0.117   0.046   0.540   3.805    3242

If I only had one shot, I would go for the median (given the skew) but the mean would be your second best. Most of the areas here have values clustered around those values, which is another way of saying they are bound to be not too far from them.

Imagine, however, that now when someone is called to the stage, you are told their perception of antisocial behaviour in their neighbourhood - so the value of the tcarea variable for the individual that has been selected (for example 3). Imagine as well that you have the scatterplot that we produced earlier in front of you. Would you still go for the value of “zero” as your best guess for the value of the selected ball (area)?

I certainly would not go with the overall mean or median as my prediction anymore. If somebody said to me, the value ‘tcarea’ for the selected respondent is 3, I would be more inclined to guess the mean value for the individuals with that level of perceptions of antisocial behaviour (the conditional mean), rather than the overall mean across all the individuals. Wouldn’t you?

If we plot the conditional means we can see that the mean of fear of violent crime for individuals that report a value of 3 in tcarea is around 1.4. So you may be better off guessing that.

library(grid)
ggplot() +
  geom_point(data=BCS0708, aes(x = tcarea, y = tcviolent), alpha=.2) +
geom_line(data=BCS0708, aes(x = round(tcarea/0.12)*0.12, y = tcviolent), 
          stat='summary',
          fun.y=mean,
          color="red",
          size=1) +
  annotate("segment", x=2, xend = 3, y = 2, yend= 1.4, color = "blue", size = 2, arrow = arrow()) +
  annotate("text", x = 2, y = 2.2, label = "Pick this one!", size =7, colour = "blue")

## Warning: Removed 3664 rows containing non-finite values (stat_summary).

## Warning: Removed 3664 rows containing missing values (geom_point).

Linear regression tackles this problem using a slightly different approach. Rather than focusing on the conditional mean (smoothed or not), it draws a straight line that tries to capture the trend in the data. If we focus in the region of the scatterplot that are less sparse we see that this is an upward trend, suggesting that as the perceptions of antisocial behaviour increase so do the expressed fear of violent crime.

Simple linear regression draws a single straight line of predicted values as the model for the data. This line would be a model, a simplification of the real world like any other model (e.g., a toy pistol, an architectural drawing, a subway map), that assumes that there is approximately a linear relationship between X and Y. Let’s draw the regression line:

ggplot(data = BCS0708, aes(x = tcarea, y = tcviolent)) +
  geom_point(alpha = .2, position = "jitter") +
  geom_smooth(method = "lm", se = FALSE, color = "red", size = 1) #This ask for a geom with the regression line, method=lm asks for the linear regression line, se=FALSE ask for just the line to be printed, the other arguments specify the color and thickness of the line

## Warning: Removed 3664 rows containing non-finite values (stat_smooth).

## Warning: Removed 3664 rows containing missing values (geom_point).

What that line is doing is giving you guesses (predictions) for the values of fear of violent crime based in the information that we have about the perceptions of antisocial behaviour. It gives you one possible guess for the value of fear for every possible value of perceptions of antisocial behaviour and links them all together in a straight line.

Another way of thinking about this line is as the best possible summary of the cloud of points that are represented in the scatterplot (if we can assume that a straight line would do a good job doing this). If I were to tell you to draw a straight line that best represents this pattern of points the regression line would be the one that best does it (if certain assumptions are met).

The linear model then is a model that takes the form of the equation of a straight line through the data. The line does not go through all the points. In fact, you can see is a slightly less accurate representation than the (smoothed) conditional means:

## Warning: Removed 3664 rows containing non-finite values (stat_smooth).

## Warning: Removed 3664 rows containing non-finite values (stat_summary).

## Warning: Removed 3664 rows containing missing values (geom_point).

As De Veaux et al (2012: 179) highlight: “like all models of the real world, the line will be wrong, wrong in the sense that it can’t match reality exactly. But it can help us understand how the variables are associated”. A map is never a perfect representation of the world, the same happens with statistical models. Yet, as with maps, models can be helpful.

6.3 Fitting a simple regression model

In order to draw a regression line we need to know two things: (1) We need to know where the line begins, what is the value of Y (our dependent variable) when X (our independent variable) is 0, so that we have a point from which to start drawing the line. The technical name for this point is the intercept. (2) And we need to know what is the slope of that line, that is, how inclined the line is, the angle of the line.

If you recall from elementary algebra (and you may not), the equation for any straight line is: y = mx + b In statistics we use a slightly different notation, although the equation remains the same: y = b0 + b1x

We need the origin of the line (b0) and the slope of the line (b1). How does R get the intercept and the slope for the green line? How does R know where to draw this line? We need to estimate these parameters (or coefficients) from the data. How? We don’t have the time to get into these more mathematical details now. You should study the required reading to understand this (required means it is required, it is not optional)². For now, suffice to say that for linear regression modes like the one we cover here, when drawing the line, R tries to minimise the distance from every point in the scatterplot to the regression line using a method called least squares estimation.

In order to fit the model we use the lm() function using the formula specification (Y ~ X). Typically you want to store your regression model in a “variable”, let’s call it fit_1:

fit_1 <- lm(tcviolent ~ tcarea, data = BCS0708)

You will see in your R Studio global environment space that there is a new object called fit_1 with 13 elements on it and about 2.2 Mb of data. We can get a sense for what this object is and includes using the functions we introduced in Week 1:

class(fit_1)

## [1] "lm"

attributes(fit_1)

## $names
##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "na.action"     "xlevels"       "call"          "terms"        
## [13] "model"        
## 
## $class
## [1] "lm"

R is telling us that this is an object of class lm and that it includes a number of attributes. One of the beauties of R is that you are producing all the results from running the model, putting them in an object, and then giving you the opportunity for using them later on. A lot of the diagnostics that we will cover in Week 8 work with various elements that we stored in this object. If you want to simply see the basic results from running the model you can use the summary() function.

summary(fit_1)

## 
## Call:
## lm(formula = tcviolent ~ tcarea, data = BCS0708)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9336 -0.6354 -0.1317  0.4731  3.9777 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.03159    0.01065   2.965  0.00303 ** 
## tcarea       0.30921    0.01081  28.595  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9536 on 8010 degrees of freedom
##   (3664 observations deleted due to missingness)
## Multiple R-squared:  0.09263,    Adjusted R-squared:  0.09251 
## F-statistic: 817.7 on 1 and 8010 DF,  p-value: < 2.2e-16

Or if you prefer more parsimonious presentation you could use the display() function of the arm package:

library(arm)

## Loading required package: MASS

## Loading required package: Matrix

## Loading required package: lme4

## 
## arm (Version 1.9-1, built: 2016-8-21)

## Working directory is /Users/reka/Desktop/R-for-Criminologists

display(fit_1)

## lm(formula = tcviolent ~ tcarea, data = BCS0708)
##             coef.est coef.se
## (Intercept) 0.03     0.01   
## tcarea      0.31     0.01   
## ---
## n = 8012, k = 2
## residual sd = 0.95, R-Squared = 0.09

detach(package:arm) #This will unload the package

For now I just want you to focus on the numbers in the “Estimate” column. The value of 0.03159 estimated for the intercept is the “predicted” value for Y when X equals zero. This is the predicted value of the fear of crime score when perceptions of antisocial behaviour are zero.

We then need the b1 regression coefficient for for our independent variable, the value that will shape the slope in this scenario. This value is 0.30921. This estimated regression coefficient for our independent variable has a convenient interpretation. When the value is positive, it tells us that for every one unit increase in X there is a b1 increase on Y. If the coefficient is negative then it represents a decrease on Y. Here, we can read it as “for every one unit increase in the perceptions of antisocial behaviour score, there is a 0.31 unit increase in the fear of violent crime score.”

Knowing these two parameters not only allows us to draw the line, we can also solve for any given value of X. Let’s go back to our guess-the-fear-of-crime game. Imagine I tell you the perceptions of antisocial behaviour score is .23. What would be your best bet now? We can simply go back to our regression line equation and insert the estimated parameters:

y = b0 + b1x
y = 0.03 + 0.31 (.23)
y = .1031

Or if you don’t want to do the calculation yourself, you can use the predict function (differences are due to rounding error):

predict(fit_1, data.frame(tcarea = c(.23))) #First you name your stored model and then you identify the new data (which has to be in a data frame format and with a variable name matching the one in the original data set)

##         1 
## 0.1027126

This is the expected value of Y, fear of violent, when X, perceptions of antisocial behaviour is .23 according to our model (according to our simplification of the real world, our simplification of the whole cloud of points into just one straight line). Look back at the scatterplot we produced earlier with the green line. Does it look as if the green line when X is .23 corresponds to a value of Y of .1027?

6.4 Residuals revisited: R squared

In the output above we saw there was something called the residuals. The residuals are the differences between the observed values of Y for each case minus the predicted or expected value of Y, in other words the distances between each point in the dataset and the regression line (see the visual example below).

drawing

You see that we have our line, which is our predicted values, and then we have the black dots which are our actually observed values. The distance between them is essentially the amount by which we were wrong, and all these distances between observed and predicted values are our residuals. Least square estimation essentially aims to reduce the squared average of all these distances: that’s how it draws the line.

Why do we have residuals? Well, think about it. The fact that the line is not a perfect representation of the cloud of points makes sense, doesn’t it? You cannot predict perfectly what the value of Y is for every individual just by looking ONLY at their perceptions of antisocial behaviour! This line only uses information regarding perceptions of antisocial behaviour. This means that there’s bound to be some difference between our predicted level of fear given our knowledge of these perceptions (the regression line) and the actual level of fear (the actual location of the points in the scatterplot). There are other things that matter not being taken into account by our model to predict the values of Y. There are other things that surely matter in terms of understanding fear of crime. And then, of course, we have measurement error and other forms of noise.

We can re-write our equation like this if we want to represent each value of Y (rather than the predicted value of Y) then: y = b0 + b1x + residuals

The residuals capture how much variation is unexplained, how much we still have to learn if we want to understand variation in Y. A good model tries to maximise explained variation and reduce the magnitude of the residuals.

We can use information from the residuals to produce a measure of effect size, of how good our model is in predicting variation in our dependent variables. Remember our game where we try to guess fear of crime (Y)? If we did not have any information about X our best bet for Y would be the mean of Y. The regression line aims to improve that prediction. By knowing the values of X we can build a regression line that aims to get us closer to the actual values of Y (look at the Figure below).

r squared

The distance between the mean (our best guess without any other piece of information) and the observed value of Y is what we call the total variation. The residual is the difference between our predicted value of Y and the observed value of Y. This is what we cannot explain (i.e, variation in Y that is unexplained). The difference between the mean value of Y and the expected value of Y (the value given by our regression line) is how much better we are doing with our prediction by using information about X (i.e., in our previous example it would be variation in Y that can be explained by knowing about perceptions of antisocial behaviour). How much closer the regression line gets us to the observed values. We can then contrast these two different sources of variation (explained and unexplained) to produce a single measure of how good our model is. The formula is as follows:

formula

All this formula is doing is taking a ratio of the explained variation (the squared differences between the regression line and the mean of Y for each observation) by the total variation (the squared differences of the observed values of Y for each observation from the mean of Y). This gives us a measure of the percentage of variation in Y that is “explained” by X. If this sounds familiar is because it is a measure similar to eta squared in ANOVA that we cover in an earlier session.

As then we can take this value as a measure of the strength of our model. If you look at the R output you will see that the R2 for our model was .09 (look at the multiple R square value in the output) . We can say that our model explains 9% of the variance in the fear of violent crime measure.

#As an aside, and to continue emphasising your appreciation of the object oriented nature of R, when we run the summary() function we are simply generating a list object of the class summary.lm.
attributes(summary(fit_1))

## $names
##  [1] "call"          "terms"         "residuals"     "coefficients" 
##  [5] "aliased"       "sigma"         "df"            "r.squared"    
##  [9] "adj.r.squared" "fstatistic"    "cov.unscaled"  "na.action"    
## 
## $class
## [1] "summary.lm"

#This means that we can access its elements if so we wish. So, for example, to obtain just the R Squared, we could ask for:
summary(fit_1)$r.squared

## [1] 0.09262767

Knowing how to interpret this is important. R^2 ranges from 0 to 1. The greater it is the more powerful our model is, the more explaining we are doing, the better we are able to account for variation in our outcome Y with our input. In other words, the stronger the relationship is between Y and X. As with all the other measures of effect size interpretation is a matter of judgement. You are advised to see what other researchers report in relation to the particular outcome that you may be exploring.

Weisburd and Britt (2009: 437) suggest that in criminal justice you rarely see values for R^2 greater than .40. Thus, if your R^2 is larger than .40, you can assume you have a powerful model. When, on the other hand, R^2 is lower than .15 or .2 the model is likely to be viewed as relatively weak. Our observed r squared here is rather poor. There is considerably room for improvement if we want to develop a better model to explain fear of violent crime³. In any case, many people would argue that R^2 is a bit overrated. You need to be aware of what it measures and the context in which you are using it. Read here for some additional detail.

6.5 Inference with regression

In real applications, we have access to a set of observations from which we can compute the least squares line, but the population regression line is unobserved. So our regression line is one of many that could be estimated. A different sample would produce a different regression line. The same sort of ideas that we introduced when discussing the estimation of sample means or proportions also apply here. if we estimate b0 and b1 from a particular sample, then our estimates won’t be exactly equal to b0 and b1 in the population. But if we could average the estimates obtained over a very large number of data sets, the average of these estimates would equal the coefficients of the regression line in the population.

In the same way that we can compute the standard error when estimating the mean and explained in Week 3, we can compute standard errors for the regression coefficients to quantify our uncertainty about these estimates. These standard errors can in turn be used to produce confidence intervals. This would require us to assume that the residuals are normally distributed. As seen in the image, and for a simple regression model, you are assuming that the values of Y are approximately normally distributed for each level of X:

normalityresiduals

In those circumstances we can trust the confidence intervals that we can draw around the regression line as in the image below:

estimated

The dark-blue line marks the best fit. The two dark-pink lines mark the limits of the confidence interval. The light-pink lines show the sampling distributions around each of the confidence-interval limits (the many regression lines that would result from repeated sampling); notice that the best-fit line falls at the extreme of each sampling distribution.

You can also then perform standard hypothesis test on the coefficients. As we saw before when summarising the model, R will compute the standard errors and a t test for each of the coefficients.

summary(fit_1)$coefficients

##               Estimate Std. Error   t value      Pr(>|t|)
## (Intercept) 0.03159349 0.01065460  2.965243  3.033356e-03
## tcarea      0.30921336 0.01081345 28.595247 2.496128e-171

In our example, we can see that the coefficient for our predictor here is statistically significant⁴.

We can also obtain confidence intervals for the estimated coefficients using the confint() function:

confint(fit_1)

##                 2.5 %     97.5 %
## (Intercept) 0.0107077 0.05247929
## tcarea      0.2880162 0.33041054

6.6 Fitting regression with categorical predictors

So far we have explained regression using a numeric input. It turns out we can also use regression with categorical explanatory variables. It is quite straightforward to run it.

Last week we saw that men and women seem to express different levels of fear of violent crime. We can also explore this relationship using regression and a regression line. This is how you would express the model:

fit_2 <- lm(tcviolent ~ sex, data=BCS0708)

Notice that there is nothing different in how we ask for the model. And see below the regression line:

ggplot(data=BCS0708, aes(x=sex, y=tcviolent)) +
  geom_point(alpha=.2, position="jitter", color="orange") +
  geom_abline(intercept = fit_2$coefficients[1],
    slope = fit_2$coefficients[2], color="red", size=1) +
  theme_bw()

## Warning: Removed 3242 rows containing missing values (geom_point).

Let’s have a look at the results:

summary(fit_2)

## 
## Call:
## lm(formula = tcviolent ~ sex, data = BCS0708)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.6785 -0.6266 -0.1288  0.5346  4.0793 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.32817    0.01433   22.91   <2e-16 ***
## sexmale     -0.60200    0.02091  -28.79   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9584 on 8432 degrees of freedom
##   (3242 observations deleted due to missingness)
## Multiple R-squared:  0.08949,    Adjusted R-squared:  0.08938 
## F-statistic: 828.7 on 1 and 8432 DF,  p-value: < 2.2e-16

As you will see the output does not look too different. But notice that in the print out you see how the row with the coefficient and other values for our input variable sex we see that R is printing [sexmale]. What does this mean?

Remember week 5 and t tests? It turns out that a linear regression model with just one dichotomous categorical predictor is just the equivalent of a t test. When you only have one predictor the value of the intercept is the mean value of the reference category and the coefficient for the slope tells you how much higher (if it is positive) or how much lower (if it is negative) is the mean value for the other category in your factor.

The reference category is the one for which R does not print the level next to the name of the variable for which it gives you the regression coefficient. Here we see that the named level is “male”. That’s telling you that the reference category here is “female”. Therefore the Y intercept in this case is the mean value of fear of violent crime for the females, whereas the coefficient for the slope is telling you how much lower the mean value is for the males. Don’t believe me?

mean(BCS0708$tcviolent[BCS0708$sex == "female"], na.rm=TRUE)

## [1] 0.3281656

mean(BCS0708$tcviolent[BCS0708$sex == "male"], na.rm=TRUE) - mean(BCS0708$tcviolent[BCS0708$sex == "female"], na.rm=TRUE)

## [1] -0.6019978

So, to reiterate, for a binary predictor, the coefficient is nothing else than the difference between the mean of the two levels in your factor variable, between the averages in your two groups.

With categorical variables encoded as factors you always have a situation like this: a reference category and then as many additional coefficients as there are additional levels in your categorical variable. Each of these additional categories is input into the model as “dummy” variables. Here our categorical variable has two levels, thus we have only one dummy variable. There will always be one fewer dummy variable than the number of levels. The level with no dummy variable, females in this example, is known as the reference category or the baseline.

It turns out then that the regression table is printing out for us a t test of statistical significance for every input in the model. If we look at the table above this t value is -28.79 and the p value associated with it is near 0. This is indeed considerably lower than the conventional significance level of 0.05. So we could conclude that the probability of obtaining this value if the null hypothesis is true is very low. However, the observed r squared is also kind of poor. Read this to understand a bit more this phenomenon of low p, but also low r-squared.

Imagine, however, that you are looking at the relationship between fear and race (ethgrp2). In our data set this variable has 5 levels that are alphabetically ordered.

levels(BCS0708$ethgrp2)

## [1] "asian or asian british" "black or black british"
## [3] "chinese or other"       "mixed"                 
## [5] "white"

If we were to run a regression model this would result in a model with 4 dummy variables. The coefficient of each of these dummies would be telling you how much higher or lower (if the sign were negative) was the level of fear of crime for each of the racial groups for which you have a dummy compared to your reference category or baseline. One thing that is important to keep in mind is that R by default will use as the baseline category the first level in your factor. Since ethgrp2 has these levels alphabetically ordered that means that your baseline category would be “Asian or Asian British”. If you want to change this you may need to reorder the levels. There are many different ways of doing this but one possibility would be to use the relevel function within your formula call in lm. In a case in which you are exploring racial differences it can make sense to use the racial majority as your baseline. We could then run the following model.

fit_race <- lm(tcviolent ~ relevel(ethgrp2, "white"), data=BCS0708)

6.7 Motivating multiple regression

So we have seen that our models with just one predictor are not terribly powerful. Partly that’s due to the fact that they are not properly specified, they do not include a solid set of good predictor variables that can help us to explain variation in our response variable. We can build better models by expanding the number of predictors (although keep in mind you should also aim to build models as parsimonious as possible).

Another reason why it is important to think about additional variables in your model is to control for spurious correlations (although here you may also want to use your common sense when selecting your variables!). You must have heard before that correlation does not equal causation. Just because two things are associated we cannot assume that one is the cause for the other. Typically we see how the pilots switch the secure the belt button when there is turbulence. These two things are associated, they tend to come together. But the pilots are not causing the turbulences by pressing a switch! The world is full of spurious correlations, associations between two variables that should not be taking too seriously. You can explore a few here. It’s funny.

Looking only at covariation between pair of variables can be misleading. It may lead you to conclude that a relationship is more important than it really is. This is no trivial matter, but one of the most important ones we confront in research and policy⁵.

It’s not an exaggeration to say that most quantitative explanatory research is about trying to control for the presence of confounders, variables that may explain explain away observed associations. Think about any criminology question: Does marriage reduces crime? Or is it that people that get married are different from those that don’t (and are those pre-existing differences that are associated with less crime)? Do gangs lead to more crime? Or is it that young people that join gangs are more likely to be offenders to start with? Are the police being racist when they stop and search more members of ethnic minorities? Or is it that there are other factors (i.e., offending, area of residence, time spent in the street) that, once controlled, would mean there is no ethnic dis-proportionality in stop and searches? Does a particular program reduces crime? Or is the observed change due to something else?

These things also matter for policy. Wilson and Kelling, for example, argued that signs of incivility (or antisocial behaviour) in a community lead to more serious forms of crime later on as people withdraw to the safety of their homes when they see those signs of incivilities and this leads to a reduction in informal mechanisms of social control. All the policies to tackle antisocial behaviour in this country are very much informed by this model and were heavily influenced by broken windows theory.

But is the model right? Sampson and Raudenbush argue it is not entirely correct. They argue, and tried to show, that there are other confounding (poverty, collective efficacy) factors that explain the association of signs of incivility and more serious crime. In other words, the reason why you see antisocial behaviour in the same communities that you see crime is because other structural factors explain both of those outcomes. They also argue that perceptions of antisocial behaviour are not just produced by observed antisocial behaviour but also by stereotypes about social class and race. If you believe them, then the policy implications are that only tackling antisocial behaviour won’t help you to reduce crime (as Wilson and Kelling have argued) . So as you can see this stuff matters for policy not just for theory.

Multiple regression is one way of checking the relevance of competing explanations. You could set up a model where you try to predict crime levels with an indicator of broken windows and an indicator of structural disadvantage. If after controlling for structural disadvantage you see that the regression coefficient for broken windows is still significant you may be into something, particularly if the estimated effect is still large. If, on the other hand, the t test for the regression coefficient of your broken windows variable is no longer significant, then you may be tempted to think that perhaps Sampson and Raudenbush were into something.

6.8 Fitting and interpreting a multiple regression model

It could not be any easier to fit a multiple regression model. You simply modify the formula in the lm() function by adding terms for the additional inputs.

fit_3 <- lm(tcviolent ~ tcarea + sex, data=BCS0708)
summary(fit_3)

## 
## Call:
## lm(formula = tcviolent ~ tcarea + sex, data = BCS0708)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.1310 -0.5870 -0.1281  0.4631  4.2935 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.31337    0.01393   22.50   <2e-16 ***
## tcarea       0.31038    0.01027   30.21   <2e-16 ***
## sexmale     -0.59680    0.02027  -29.44   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9059 on 8009 degrees of freedom
##   (3664 observations deleted due to missingness)
## Multiple R-squared:  0.1812, Adjusted R-squared:  0.181 
## F-statistic: 886.3 on 2 and 8009 DF,  p-value: < 2.2e-16

With more than one input, you need to ask yourself whether all of the regression coefficients are zero. This hypothesis is tested with a F test, similar to the one we described when describing ANOVA (in fact ANOVA is just a special case of regression). Again we are assuming the residuals are normally distributed, though with large samples the F statistics approximates the F distribution.

You see the F test printed at the bottom of the summary output and the associated p value, which in this case is way below the conventional .05 that we use to declare statistical significance and reject the null hypothesis. At least one of our inputs must be related to our response variable.

Notice that the table printed also reports a t test for each of the predictors. These are testing whether each of these predictors is associated with the response variable when adjusting for the other variables in the model. They report the “partial effect of adding that variable to the model” (James et al. 2014: 77). In this case we can see that both variables seem to be significantly associated with the response variable.

If we look at the r squared we can now see that it is higher than before. It actually doubles in size. r squared will always increase as a consequence of adding new variables, even if the new variables added are weakly related to the response variable. But the increase we are observing suggests that adding these two variables results in a substantially improvement to our previous model.

We see that the coefficients for the predictors change very little, it goes up a bit for tcarea and it goes down a bit for sex. But their interpretation now changes. A common interpretation is that now the regression for each variable tells you about changes in Y related to that variable when the other variables in the model are held constant. So, for example, you could say the coefficient for tcarea represents the increase in fear of crime for every one-unit increase in the measure of perceptions of antisocial behaviour when holding all other variables in the model constant (in this case that refers to holding constant sex). But this terminology can be a bit misleading.

Other interpretations are also possible and are more generalizable. Gelman and Hill (2007: p. 34) emphasise what they call the predictive interpretation that considers how “the outcome variable differs, on average, when comparing two groups of units that differ by 1 in the relevant predictor while being identical in all the other predictors”. So if you’re regressing y on u and v, the coefficient of u is the average difference in y per difference in u, comparing pairs of items that differ in u but are identical in v.

So, for example, in this case we could say that comparing respondents who have the same level of perceptions of antisocial behaviour but who differed in whether they are female or male, the model predicts a expected difference of -.59 in their fear of violent crime scores. And that respondents with the same gender, but that differ by one point in the perception of antisocial behaviour, we would expect to see a difference of 0.31 in their fear of violent crime score. So we are interpreting the regression slopes as comparisons of individuals that differ in one predictor while being at the same levels of the other predictors.

As you can see, interpreting regression coefficients can be kind of tricky⁶. The relationship between the response y and any one explanatory variable can change greatly depending on what other explanatory variables are present in the model.

For example, if you contrast this model with the one we run with only sex as a predictor you will notice the intercept has changed. You cannot longer read the intercept as the mean value of fear of violent crime for females. Adding predictors to the model changes their meaning. Now the intercept index the value of fear of violent crime for females that score 0 in tcarea. In this case you have cases that meet this condition (equal zero in all your predictors), but often you may not have any case that does meet the definition of the intercept. More often than not, then, there is not much value in bothering to interpret the intercept.

Something you need to be particularly careful about is to interpret the coefficients in a causal manner. At least your data come from an experiment this is unlikely to be helpful. With observational data regression coefficients should not be read as indexing causal relations. This sort of textbook warning is, however, often neglectfully ignored by professional researchers. Often authors carefully draw sharp distinctions between causal and correlational claims when discussing their data analysis, but then interpret the correlational patterns in a totally causal way in their conclusion section. This is what is called the causation or causal creep. Beware. Don’t do this tempting as it may be.

Comparing the simple models with this more complex model we could say that adjusting for sex does not change much the impact of tcarea in fear and the same can be said about the effect of sex when holding tcarea fixed.

6.9 Presenting your regression results.

Communicating your results in a clear manner is incredibly important. We have seen the tabular results produced by R. If you want to use them in a paper you may need to do some tidying up of those results. There are a number of packages (textreg, stargazer) that automatise that process. They take your lm objects and produce tables that you can put straight away in your reports or papers. One popular trend in presenting results is the coefficient plot as an alternative to the table of regression coefficients. There are various ways of producing coefficient plots with R for a variety of models. See here and here, for example.

We are going to use instead the sjp.lm() function of the sjPlot package, that makes it easier to produce this sort of plots. You can find a more detailed tutorial about this function here. See below for an example:

library(sjPlot)

Let’s try with a more complex example:

fit_4 <- lm(tcviolent ~ tcarea + sex + age + bcsvictim + rural2 + work +educat3, data=BCS0708)
sjp.lm(fit_4, breakLabelsAt = 30)

Visual display of the effects of the variables in the model are particularly helpful. The effects package allows us to produce plots to visualise these relationships (when adjusting for the other variables in the model). Here’s an example going back to our model fit_3 which contained sex and tcarea predictor variables:

library(effects)
plot(allEffects(fit_3), ask=FALSE)

Notice that the line has a confidence interval drawn around it (to reflect the likely impact of sampling variation) and that the predicted means for males and females (when controlling for tcarea) also have a confidence interval.

6.10 Rescaling input variables to assist interpretation

The interpretation or regression coefficients is sensitive to the scale of measurement of the predictors. This means one cannot compare the magnitude of the coefficients to compare the relevance of variables to predict the response variable. Let’s look at the more recent model, how can we tell what predictors have a stronger effect?

summary(fit_4)

## 
## Call:
## lm(formula = tcviolent ~ tcarea + sex + age + bcsvictim + rural2 + 
##     work + educat3, data = BCS0708)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.0551 -0.5840 -0.1114  0.4683  4.1141 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               0.4867262  0.0538716   9.035  < 2e-16 ***
## tcarea                    0.2945885  0.0108428  27.169  < 2e-16 ***
## sexmale                  -0.5772651  0.0205819 -28.047  < 2e-16 ***
## age                      -0.0045506  0.0007504  -6.064 1.39e-09 ***
## bcsvictimvictim of crime -0.1008647  0.0256423  -3.934 8.44e-05 ***
## rural2urban               0.1483361  0.0228612   6.489 9.18e-11 ***
## workyes                  -0.0929336  0.0244166  -3.806 0.000142 ***
## educat3degree or diploma -0.0849667  0.0289187  -2.938 0.003312 ** 
## educat3none               0.1390824  0.0333705   4.168 3.11e-05 ***
## educat3o level/gcse       0.0333842  0.0322756   1.034 0.301005    
## educat3other              0.1652819  0.0568549   2.907 0.003658 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8961 on 7965 degrees of freedom
##   (3700 observations deleted due to missingness)
## Multiple R-squared:  0.1978, Adjusted R-squared:  0.1968 
## F-statistic: 196.4 on 10 and 7965 DF,  p-value: < 2.2e-16

We just cannot. One way of dealing with this is by rescaling the input variables. A common method involves subtracting the mean and dividing by the standard deviation of each numerical input. The coefficients in these models is the expected difference in the response variable, comparing units that differ by one standard deviation in the predictor while adjusting for other predictors in the model.

Instead, Gelman (2008) has proposed dividing each numeric variables by two times its standard deviation, so that the generic comparison is with inputs equal to plus/minus one standard deviation. As Gelman explains the resulting coefficients are then comparable to untransformed binary predictors. The implementation of this approach in the arm package subtract the mean of each binary input while it subtract the mean and divide by two standard deviations for every numeric input.

The way we would obtain these rescaled inputs uses the standardize() function of the arm package, that takes as an argument the name of the stored fit model.

library(arm)

## 
## arm (Version 1.9-1, built: 2016-8-21)

## Working directory is /Users/reka/Desktop/R-for-Criminologists

standardize(fit_4)

## 
## Call:
## lm(formula = tcviolent ~ z.tcarea + c.sex + z.age + c.bcsvictim + 
##     c.rural2 + c.work + educat3, data = BCS0708)
## 
## Coefficients:
##              (Intercept)                  z.tcarea  
##                  0.04333                   0.59511  
##                    c.sex                     z.age  
##                 -0.57727                  -0.16873  
##              c.bcsvictim                  c.rural2  
##                 -0.10086                   0.14834  
##                   c.work  educat3degree or diploma  
##                 -0.09293                  -0.08497  
##              educat3none       educat3o level/gcse  
##                  0.13908                   0.03338  
##             educat3other  
##                  0.16528

Notice the main change affects the numerical predictors. The unstandardised coefficients are influenced by the degree of variability in your predictors, which means that typically they will be larger for your binary inputs. With unstandardised coefficients you are comparing complete change in one variable (whether one is a female or not) with one-unit changes in your numerical variable, which may not amount to much change. So, by putting in a comparable scale, you avoid this problem.

Standardising in the way described here will help you to make fairer comparisons. This standardised coefficients are comparable in a way that the unstandardised coefficients are not. We can now see what inputs have a comparatively stronger effect. It is very important to realise, though, that one should not compare standardised coefficients across different models.

6.11 Testing conditional hypothesis: interactions

In the social sciences there is a great interest in what are called conditional hypothesis or interactions. Many of our theories do not assume simply additive effects but multiplicative effects.For example, Wikstrom and his colleagues (2011) suggest that the threat of punishment only affects the probability of involvement on crime for those that have a propensity to offend but are largely irrelevant for people who do not have this propensity. Or you may think that a particular crime prevention programme may work in some environments but not in others. The interest in this kind of conditional hypothesis is growing.

One of the assumptions of the regression model is that the relationship between the response variable and your predictors is additive. That is, if you have two predictors x1 and x2. Regression assumes that the effect of x1 on y is the same at all levels of x2. If that is not the case, you are then violating one of the assumptions of regression. This is in fact one of the most important assumptions of regression, even if researchers often overlook it.

One way of extending our model to accommodate for interaction effects is to add additional terms to our model, a third predictor x3, where x3 is simply the product of multiplying x1 by x2. Notice we keep a term for each of the main effects (the original predictors) as well as a new term for the interaction effect. “Analysts should include all constitutive terms when specifying multiplicative interaction models except in very rare circumstances” (Brambor et al., 2006: 66).

How do we do this in R? One way is to use the following notation in the formula argument. Notice how we have added a third term tcarea:sex, which is asking R to test the conditional hypothesis that the perceptions of antisocial behaviour may have a different impact on fear of violent crime for men and women.

fit_5 <- lm(tcviolent ~ tcarea + sex + tcarea:sex , data=BCS0708)
# which is equivalent to: 
# fit_5 <- lm(tcviolent ~ tcarea * sex , data=BCS0708)
summary(fit_5)

## 
## Call:
## lm(formula = tcviolent ~ tcarea + sex + tcarea:sex, data = BCS0708)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.1800 -0.5890 -0.1279  0.4637  4.2827 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     0.31322    0.01393  22.485   <2e-16 ***
## tcarea          0.32339    0.01384  23.370   <2e-16 ***
## sexmale        -0.59635    0.02027 -29.414   <2e-16 ***
## tcarea:sexmale -0.02896    0.02065  -1.402    0.161    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9058 on 8008 degrees of freedom
##   (3664 observations deleted due to missingness)
## Multiple R-squared:  0.1814, Adjusted R-squared:  0.1811 
## F-statistic: 591.6 on 3 and 8008 DF,  p-value: < 2.2e-16

You see here that essentially you have only two inputs (perceptions of the area and sex) but several regression coefficients. Gelman and Hill (2007) suggest reserving the term input for the variables encoding the information and to use the term predictor to refer to each of the terms in the model. So here we have two inputs and four predictors (one for the constant term, one for sex, another for perceptions, and a final one for the interaction effect).

In this case the test for the interaction effect is non-significant, which suggests there isn’t such an interaction. The R squared barely changes. Let’s visualise the results with the effects package:

plot(allEffects(fit_5), ask=FALSE)

Notice that essentially what we are doing is running two regression lines and testing whether the slope is different for the two groups. The intercept is different, we know that females are more fearful, but what we are testing here is whether the level of fear goes up in a steeper fashion (and in the same direction) for one or the other group as the perceptions of antisocial behaviour go up. We see that’s not the case here. The estimated lines are almost parallel.

A word of warning, the moment you introduce an interaction effect the meaning of the coefficients for the other predictors changes (what it is often referred as to the “main effects” as opposed to the interaction effect). You cannot retain the interpretation we introduced earlier. Now, for example, the coefficient for the sex variable relates the marginal effect of this variable when tcarea equals zero. The typical table of results helps you to understand whether the effects are significant but offers little of interest that will help you to meaningfully interpret what the effects are. For this is better you use some of the graphical displays we have covered.

Essentially what happens is that the regression coefficients that get printed are interpretable only for certain groups. So now:

The intercept still represents the predicted score of fear of violent crime for female respondents and have a score of 0 in the perception of antisocial behaviour (as before).
The coefficient of sex: male now can be thought of as the difference between the predicted score of fear of violent crime for females that have a score of 0 in perceptions of antisocial behaviour and males that have a score of 0 in perceptions of antisocial behaviour.
The coefficient of tcarea now becomes the comparison of mean fear of violent crime scores for females who differ by one point in perceptions of antisocial behaviour.
The coefficient for the interaction term represents the difference in the slope for tcarea comparing female and male respondents, the difference in the slope of the two lines that we visualised above.

Models with interaction terms are too often misinterpreted. I strongly recommend you read this piece by Brambor et al (2005) to understand some of the issues involved. When discussing logistic regression we will return to this and will consider tricks to ease the interpretation.

Equally, John Fox (2003) piece on the effects package goes to much more detail that we can here to explain the logic and some of the options that are available when producing plots to show interactions with this package.

6.12 Model building and variable selection

How do you construct a good model? This partly depends on your goal, although there are commonalities. You do want to start with theory as a way to select your predictors and when specifying the nature of the relationship to your response variable (e.g., additive, multiplicative). Gelman and Hill (2007) provide a series of general principles⁷. I would like to emphasise at this stage two of them:

Include all input variables that, for substantive reasons, might be expected to be important in predicting the outcome.
For inputs with large effects, consider including their interactions as well.

It is often the case that for any model, the response variable is only related to a subset of the predictors. There are some scenarios where you may be interested in understanding what is the best subset of predictors. Imagine that you want to develop a risk assessment tool to be used by police officers that respond to a domestic violence incident, so that you could use this tool for forecasting the future risk of violence. There is a cost to adding too many predictors. A police officer time should not be wasted gathering information on predictors that are not associated with future risk. So you may want to identify the predictors that will help in this process.

Ideally, we would like to perform variable selection by trying out a lot of different models, each containing a different subset of the predictors. There are various statistics that help in making comparisons across models. Unfortunately, as the number of potentially relevant predictors increases the number of potential models to compare increases exponentially. So you need methods that help you in this process. There are a number of tools that you can use for variable selection but this goes beyond the aims of this introduction. If you are interested you may want to read this.

CAUTION! These variables were created by ESDS to aid in teaching. They are the result of running factor analysis in a number of categorical measures to create scalar variables. For any other purposes you should aim to use the original BCS data.↩
This is a fine chapter too if you struggle with the explanations in the required reading. Many universities, like the University of Manchester, have full access to Springer ebooks. You can also have a look at these notes.↩
This is a reasonable explanation of how to interpret R-Squared.↩
This blog post provides a nice animation of the confidence interval and hypothesis testing.↩
This is a nice illustration of the Simpon’s Paradox, a well known example of ommitted variable bias.↩
I recommend reading chapter 13 “Woes of regression coefficients” of an old book Mostseller and Tukey (1977) Data Analysis and Regression. Reading: Addison-Wesley Publishing.↩
Look at this too.↩