7.1 Introduction

Although so far we have discussed the practicalities of fitting and interpreting regression models, in practical applications you want to first check your model and proceed from there. Not much point spending time interpreting your model until you know that the model reasonably fits your data.

In previous sessions we talked about the assumptions made by various statistical tests. The regression model also makes assumptions of its own. In fact, there are so many that we will spend an entire session discussing them. We follow for the most part Gelman and Hill (2007) discussion of these assumptions. These authors point out that the most important assumptions by decreasing order of importance are:

“Optimally, this means that the outcome measure should accurately reflect the phenomenon of interes, the model should include all relevant predictors, and the model should generalize to all cases to which it will be applied… Data used in empirical research rarely meet all (if any) of these criteria precisely. However, keeping these goals in mind can help you be precise about the types of questions you can and cannot answer reliably”

Apart from this, it is convenient to diagnose multicollinearity (this affects interpretation) and influential observations.

So these are the assumptions of linear regression, and today we will go through how to test for them, and also what are some options that you can consider if you find that your model violates them. While finding that some of the assumptions are violated do not necessarily mean that you have to scrap your model, it is important to use these diagnostics to illustrate that you have considered what the possible issues with your model is, and if you find any serious issues that you address them.

We’ll be using the British Crime Survey daya from 2007/8 again, so read it in (you might already have it saved from earlier sessions):

##R in Windows have some problems with https addresses, that's why we need to do this first:
#We create a data frame object reading the data from the remote .csv file

7.2 Plotting residuals

Many of the assumptions can be tested first by having a look at your residuals. Remember, the residuals are the ‘error’ in your model. In previous weeks we defined the ordinary residuals as the difference between the observed and the predicted values, the distance between the points in your scatterplot and the regression line. Apart from the ordinary residuals, most software computes other forms of closely related ones: the standardised, the studentised, and the Pearson residuals.

The raw residuals are just the difference between the observed and the predicted, the other three are ways of normalising this measure, so you can compare what is large, what is small, etc. For example, with the standardized residuals, you essentailly calculate z scores, and given a normal distribution of the standardized residuals, the mean is 0, and the standard deviations is 1. Pearson residuas are raw residuals divided by the standard error of the observed value. Studentized resiruals (also called standardized pearson residuals) are raw residuals divided by their standard error. You can read more about these here.

Plotting these residuals versus the fitted values and versus each of the predictors is the most basic way of diagnosing problems with your regression model. However, as Fox and Weisberg (2011) emphasise > this “is useful for revealing problems but less useful for determining the exact nature of the problem” and consequently one needs “other diagnostic graphs to suggest improvements to the model”.

In the previous session we fitted the model tcviolent ~ tcarea + sex. This was our fit_3 model during that session. You may have to run the model again if you do not have it in your global environment.

To obtain the basic residual plots for this model we use the residualPlots() function of the car package.

fit_3 <- lm(tcviolent ~ tcarea + sex, data=BCS0708)

##            Test stat Pr(>|t|)
## tcarea         6.257        0
## sex               NA       NA
## Tukey test     4.607        0

This function will produce plots of the Pearson residuals versus each of the predictors in the model and versus the fitted values.

Residuals vs predicted values

The most important of this is the last one, the scatterplot of the Pearson residuals versus the fitted values. In these plots one has to pay particular attention to nonlinear trends, trends in variations across the graph, but also isolated points. Ideally a plot of the residuals should show that:

    1. they’re pretty symmetrically distributed, tending to cluster towards the middle of the plot
    1. they’re clustered around the lower single digits of the y-axis (e.g., 0.5 or 1.5, not 30 or 150)
    1. in general there aren’t clear patterns

For example this is a good looking scatterplot of residuals vs fitted values: