#Intro day R data tasks
To get comfortable with using R for playing around with data, let’s get some actual crime data, and have a look at it.
Create a folder to save your data and our outputs in. In R, this is known as a working directory. You can name this anything. In this case, I created a folder called ‘demo’ inside the folder that has the course notes. I now have to tell R about this folder. You can do this a few ways.
Click on Session > Set working directory > Choose directory…
Then navigate to the folder you want to use, open it, and click on ‘Open’.
The command to set working directory is setwd()
. Inside the brackets you need to write the path to your folder, in quotation marks. So for me this is:
setwd("~/Desktop/R-for-Criminologists/demo")
Whichever way you choose, once you have done this you can save all data in this folder, and read them in from here. Also any outputs like plots and code get saved here as well.
We can play around with police recorded crime data, which can be downloaded from the police.uk website.
Let’s stick local and download some data for crime in Manchester.
To do this, open the data.police.uk/data website.
Date range
just select June 2016 - June 2016Force
find Greater Manchester Police
, and tick the box next to it.Data sets
tick Include crime data
.Generate File
button.This will take you to a download page, where you have to click the Download now
button. This will open a dialogue to save a .zip file. Navigate to the working directory folder you’ve readet and save it there. Unzip the file.
R reads in data in the form of dataframes. This is essentially the same as a spreadsheet. You have columnds that represent your variables. And rows that reporesent your observations
So for example, in this data set each row is one crime record.
And this column for example is the month variable. Each observation (crime) will have a value for this variable (the month that it was recorded).
So that is essentially a dataframe. So let’s read in the downloaded and unzipped crime data into a dataframe.
To do so, you need to create a dataframe object, that you map the data onto. So you need to give it a name, and then rrad the file into it. This name is what you will use to call the dataframe. Here I simply call it crime
.
To read in the .csv file, which is the format we just downloaded, the command is read.csv
.
Again there are two ways to read in the data, if you want to open a window where you can manually navigate and open the file, you can pass file.choose()
argument to the read.csv() function.
crimes <- read.csv(file.choose())
Or, if you know the path to your file, you can hardcode it in there, within quotation marks:
crimes <- read.csv("2016-06/2016-06-greater-manchester-street.csv")
You might notice that crimes
has appeared in your work environment window. It will tell you how many observations (rows - and incidentally the number of recorded crimes in June 2016 within the GMP jurisdiction) and how many variables (columns) your data has.
As seen earlier, we can just have a look at our data using the View()
function.
View(crimes)
What are the variables? What are some questions that you might ask that you could answer with this data?
A1: You can check this a few ways (eg: look in your environment, or the bottom of the data view) but I like explicitly asking R to tell me things, so I will ask it to count the Number of ROWs with the nrow()
function.
nrow(crimes)
## [1] 30671
There were a total of 30,671 crimes recorded in Greater Manchester in June 2016.
A2: You can use the function unique()
to get R to list all the unique character values any of your variables can take.
unique(crimes$Crime.type)
## [1] Anti-social behaviour Violence and sexual offences
## [3] Other crime Criminal damage and arson
## [5] Other theft Vehicle crime
## [7] Burglary Bicycle theft
## [9] Shoplifting Drugs
## [11] Public order Robbery
## [13] Theft from the person Possession of weapons
## 14 Levels: Anti-social behaviour Bicycle theft ... Violence and sexual offences
The types of crime recorded in this data set are: Anti-social behaviour, Violence and sexual offences, Other crime, Criminal damage and arson, Other theft, Vehicle crime, Burglary, Bicycle theft, Shoplifting, Drugs, Public order, Robbery, Theft from the person, Possession of weapons
#####Q3: Which crime type had the most crimes recorded?
A3: To answer this question, you could make a frequency table using the table()
command. A frequency table just counts the number of times that each crime type appears in the dataframe. Since each row is one crime, each time the crime type appears, one more instance of it can be counted. So a frequency table returns the frequency with which we see each crime type in our data.
Step 1: create table
frequencyTableOfCrimeTypes <- table(crimes$Crime.type)
frequencyTableOfCrimeTypes
##
## Anti-social behaviour Bicycle theft
## 11360 424
## Burglary Criminal damage and arson
## 2184 2790
## Drugs Other crime
## 392 340
## Other theft Possession of weapons
## 2271 117
## Public order Robbery
## 1450 276
## Shoplifting Theft from the person
## 1426 458
## Vehicle crime Violence and sexual offences
## 1910 5273
Since there are not too many crime types we can probably eyeball which crime type has the highest number of appearances in the data (frequency). However, I am lazy, so let’s get R to tell me the answer instead, by sorting the table.
You can use the sort()
command to then arrange this table. You can set the parameter of decreasing
to True or to False, depending on how you want to sort the data.
Step 2: sort table
sort(frequencyTableOfCrimeTypes, decreasing=T)
##
## Anti-social behaviour Violence and sexual offences
## 11360 5273
## Criminal damage and arson Other theft
## 2790 2271
## Burglary Vehicle crime
## 2184 1910
## Public order Shoplifting
## 1450 1426
## Theft from the person Bicycle theft
## 458 424
## Drugs Other crime
## 392 340
## Robbery Possession of weapons
## 276 117
The most common crime recorded is anti-social behaviour.
A4: One of the variables in the dataset is labelled LSOA.name
. LSOA stands for Lower Level Super Output Area, and it is a geographical region that the Greater Manchester region (and all of the UK actually) is broken into. These areas are often used as proxys for identifying neighbourhoods. More on LSOAs can be found here.
One interesting question would be to compare crimes between these neighbourhoods. Let’s just look at total number of crimes in each neighbourhood.
Above with Q3 we talked about how we can make a quick frequency table. This table tells us how many times each LSOA name appears in the data. As every row in the crimes
dataframe is a crime incident, every time the LSOA name appears, it is a crime incident that happened in that neighbourhood. Therefore if we count the number of times each LSOA name appears in the list of crimes, we then find out how many crimes occurred there in June 2016.
But of course we want R to count for us, so as before, to create a table, we use the table()
funciton. Simples.
crimesPerLsoa <- table(crimes$LSOA.name)
At this moment, this is a table. There is less scope for manipulation with tables than with dataframes in R, so let’s turn it into a dataframe. You might be noticing that R commands are pretty much like telling the computer what you want it to do in English. So the command to turn something into a dataframe is to.data.frame
.
crimesPerLsoaDf <- as.data.frame(crimesPerLsoa)
Let’s have a look at this new dataframe.
View(crimesPerLsoaDf)
With viewing the dataframe, we also see that the column titles are Var1
and Freq
. This is because we created a FREQuency table of the neighbourhood VARiable. Basically variable here is neighbourhod, and frequency is the number of crimes that were recorded in each. If we want to, for clarity we can rename these. Don’t worry too much about the syntax at this point.
colnames(crimesPerLsoaDf)[colnames(crimesPerLsoaDf)=="Var1"] <- "neighbourhood"
colnames(crimesPerLsoaDf)[colnames(crimesPerLsoaDf)=="Freq"] <- "numCrimes"
What are the variables in the new dataframe?
What are the observations (ie what does each row represent?)
So one of the variables is the number of crimes, and each row is a neighbourhood. In order to get the average number of crimes in each neighbourhood we calculate the mean. Mean means average in statistics language. Same way you calculate the average, you can calculate the mean by adding up all the values, and then dividing by the total number of observations.
sum(crimesPerLsoaDf$numCrimes)/nrow(crimesPerLsoaDf)
## [1] 18.35488
But of course, we are lazy, and want to make the most of R’s capabilities, so we can just use the … mean()
function!
mean(crimesPerLsoaDf$numCrimes)
## [1] 18.35488
There are an average of 18 crimes per neighbourhood across Greater Manchester.
In your groups come up with three more questions you could ask this data set and do your best to answer them. Use the provided R markdown template to create a html document for submission.