Table of Contents

1.1 Introduction

This course provides an introduction to data analysis using R.

These handouts contains various hyper-links to videos that explain or expand some of the ideas and concepts introduced. It will be handy if you have speakers or headphones to listen to the videos.

1.2 Interacting with R and creating objects

Unlike other programs for data analysis you may have used in the past (Excel, SPSS), you need to interact with R by means of writing down instructions and asking R to evaluate those instructions. R is an interpreted programming language.

In these handouts you will see bits of these instructions (I will often refer to this as “code”) inside greyed boxes and the result of executing those instructions in white boxes. If you are looking at this document and have an open session of R, you should be able to reproduce the results by cutting it and pasting it into your R console (or script).

We’ll be using R Studio in these sessions. As we practiced in the intro session, you type your code in the R Script window, which allows for you to save an re-run your code later. This way you can build up a “cookbook” of various bits of code that you have written.

In this session we will interpret the code a bit more, in order to get you moving beyond copy and pasting code, and actually writing some of your own.

So let’s have a look at some basic concepts.


R will always treat numbers as numbers. This sounds straighforward, but actually it is important to note, because, as discussed last week, we can name our variables anything. EXCEPT they cannot be numbers. Numbers are protcted by R. 1 will always mean 1.

If you want, give it a try. Try to create a variable called 12 and assign it the value “twelve”. As we did last week, we can assign something a meaning by using the “<-” characters.

12 <- "twelve"
## Error in 12 <- "twelve": invalid (do_set) left-hand side to assignment

You get an error!

12 remains 12. This means that with numbers, you can use R as a calculator. Give it a go.

3 + 5
## [1] 8


7 * 4
## [1] 28


Most of the time you will be using objects, not the numbers though (R is more than just a fancy calculator). Last week we created many objects. We created a data frame object, that was a table of crimes recorded by GMP in June. We can also create an object that holds only one bit of text, or a list of numbers. Objects have a name, and a value. Name is how you summon them, and the value is what it represents.

Technically R is an Object Oriented language. Object-oriented programming (OOP) is a programming language model organized around objects rather than “actions” and data rather than logic.

To create an object, you have to give it a name, and then use the assignemtn operator (the <- symbol) to assign it some value.

For example, for the if we want to create an object that we name “x”, and we want it to represent the value of 5, we write:

x <- 5

We are simply telling R to create a numeric object, called x, with one element (5) or of length 1. It is numeric because we are putting a number inside this object. It may help you at this stage to think of objects as boxes, things where you store stuff and the assignment operator as the tool you use to tell R what goes inside.

You can see the content of the object x either by auto-printing by typing the following:

## [1] 5

When writing expressions in R is very important you understand that R is case sensitive. This could drive you nuts if you are not careful. More often than not if you write an expression asking R to do something and R returns an error message, chances are that you have used lower case when upper case was needed (or vice-versa). So always check for the right spelling. For example, see what happens if I use a capital ‘X’:

## Error in eval(expr, envir, enclos): object 'X' not found

You will get the following message: "Error in print(X) : object 'X' not found". R is telling us that X does not exist. There isn’t an object " X (upper case), but there is an object x (lower case). When you get an error message or implausible results, you want to look back at your code to figure out what is the problem. This process is called debugging. There are some proper systematic ways to write code that facilitate debugging, but we won’t get into that here. R is very good with automatic error handling at the levels we’ll be using it at. Very often the solution will simply involve correcting the spelling.


Alternatively to just typing the name of the object to get its value in the console, you can use a function. R uses functions to perform operations. Everything you do in R is the result of running a function. You can think of functions as preprogrammed routines that ask R to do a particular something. Here we can use the print() function to see what it is inside this object.

## [1] 5

A function is a bit of code that does something with the objects you pass it. These are called arguments. Basically the thing that you pass into the function is it’s argument. You pass arguments into functions by placing them in the brackets(). We won’t be writing functions in this course, but just to illustrate, I’ll put a basic one here. I call it doubleThisNumber(), and I say that it will receive one argument in a bracket (call it x), and when it does, it will take x and times it by two. That is what it will return.

doubleThisNumber <- function(x){

Now I can pass any number (or object that is numeric) to this function and it will take that, double it, and return it.

## [1] 8

Generally functions are more useful than this. Another function we saw last week was View() which let us view a dataframe, and plot() which created a basic x,y plot for us. We’ll be using loads of functions. Thing of them as a mechanism for taking your object, and doing something with them. To do so, you need the name of the function, and then put your object in the brackets which follow that function.

You can often pass parameters into functions, as well as the argument (the object you want it to run the function on). For example, last week we saw sort(). You can create a list in R, and use the sort() function to put it in order.

To demonstrate:

Step 1 create list:

listOfSomeNumbers <- c(2, 5, 23, 1, 7, 56, 109, 33, 21)

Step 2 sort list:

## [1]   1   2   5   7  21  23  33  56 109

But what if I want to sort it from largest to smallest (in decreasing order)?

Well, the sort function allows you to specify this, by passing it the parameter decreasing = TRUE.

sort(listOfSomeNumbers, decreasing = TRUE)
## [1] 109  56  33  23  21   7   5   2   1

To find out what parameters you can pass a function besides the object, you can use the help function. As discussed last week, to call the help on a function, you just put a question mark in front of it:


And the details will appear in your help/ plot window of R Studio.


We also discussed last week the use of packages. Packages are bundles of code that someone else has written, and uploaded to a central repository (called CRAN) so that anyone can download and use them. These packages have lots and lots of functions in them, which we can use, and also sometimes some data sets.

Different packages are used for different things we want to achieve. For example, last week we used the rmarkdown package, in order to create a markdown document.

Packages need to be downloaded only once, but they need to be called every time you use them.

To download the package you use the install.packages() function. So to download the rmarkdown package for example we used


You only ever have to do this once on your computer. If you quit R and then start it up again a week later, the package should still be there.

Loading it on the other hand, you have to do every time you start a new R session. That just means if you close R Studio today, and open it back up in a week, and you want to run a function that comes from a package, you need to load that package into your current session first.

You do this by ysing the library() function. So to load rmarkdown into your session, you need to run:


To see what packages you currently have loaded in your session, you use the search() function (you do not need to pass it any objects in this case)

## [1] ".GlobalEnv"        "package:stats"     "package:graphics" 
## [4] "package:grDevices" "package:utils"     "package:datasets" 
## [7] "package:methods"   "Autoloads"         "package:base"

To find out mode about packages see here.


We’ve also spoken a bit about comments in your code. As discussed, you should save bits of code that you write, and compile them into your own personal R cookbook. However once you have lots of bits of code, and some time has elapsed, you might forget what some bits of codes do.

Similarly if you want to share your code with someone, then it makes it easier for them to have comments to see what you are doing with that code, to better understand.

To create a comment you use the hashtag/ number sign # followed by some text. Whenever the R engine sees the number sign it knows that what follows is not code to be executed. You can use this sign to include annotations when you are coding. These annotations are a helpful reminder to yourself (and others reading your code) of what the code is doing and (even more important) why you are doing it.

It is good practice to often use annotations. You can use these annotations in your code to explain your reasoning and to create “scannable” headings in your code. That way after you save your script you will be able to share it with others or return to it at a later point and understand what you were doing when you first created it -see here for further details on annotations and in how to save a script when working with the basic R interface.

So for example, if I wanted someone to be able to use my function I wrote earlier, I could write:

#this function will return the double of any number passed to it
#to call the function type doubleThisNumber(), and inside the brackets enter a number
doubleThisNumber <- function(x){

You need one # per line, and anything after that is a comment that is not executed by R. You can use spaces after (its not like a hashtag on twitter). You do need a # for every line you have comments on.

In sum, everything that exist in R is an object and everything you do in R is the result of running a function. In fact, the numbers we used in the sum operation earlier are numeric objects (R does not uses scalars as other programming languages) and the sum operator is in fact a function for summation. If you don’t believe try executing the following:


1.3 Vectors, factors and data frames


Most commonly, when you use variables in R, you create vectors. What is a vector? An atomic vector is simply a set of elements of the same class (typically: character, numeric, integer, or logical -as in True/False). It is the basic data structure in R. Typically you will use the c() function (c stands for concatenate) to create vectors.

The code below exemplifies how to create vectors of different classes (numeric, logical, etc.). Notice how the listed elements (to simplify there are two elements in each vector below) are separated by commas:

my_1st_vector <- c(0.5, 0.6) #creates a numeric vector with two elements
my_2nd_vector <- c(1L, 2L) #creates an integer vector
my_3rd_vector <- c(TRUE, FALSE) #creates a logical vector
my_4th_vector <- c(T, F) #creates a logical vector using abbreviations of True and False, but you should avoid this formulation an instead use the full word.
my_5th_vector <- c("a", "b", "c") #creates a character vector
my_6th_vector <- c(1+0i, 2+4i) #creates a complex vector (we won't really use this class)

The beauty of an object oriented statistical language is that one you have these objects you can use them as inputs in functions, use them in operations, or to create other objects. This makes R very flexible:

class(my_1st_vector) #to figure out the class of the vector
## [1] "numeric"
length(my_1st_vector) #to figure out the lenght of the vector
## [1] 2
my_1st_vector + 2 #Add a constant to each element of the vector
## [1] 2.5 2.6
my_7th_vector <- my_1st_vector + 1 #Create a new vector that contains the elements of my1stvector plus a constant of 1
my_1st_vector + my_7th_vector #Adds the two vectors and auto-print the results (note how the sum was done)
## [1] 2.0 2.2

When you create objects you will place them in your working memory or workspace. Each R session will be associated to a workspace (curiously called “global environment”). Think of it as a warehouse where you are placing a bunch of boxes (your objects). R works using your RAM. That’s where all these objects get stored, which means you need good RAM for very large datasets. In R Studio you can visualise the objects you have created during a session in the Global Environment screen. But if you want to produce a list of what’s there you can use the ls() function (the results you get my differ from the ones below depending on what you actually have in your global environment).

ls() #list all objects in your global environment
##  [1] "doubleThisNumber"  "listOfSomeNumbers" "my_1st_vector"    
##  [4] "my_2nd_vector"     "my_3rd_vector"     "my_4th_vector"    
##  [7] "my_5th_vector"     "my_6th_vector"     "my_7th_vector"    
## [10] "x"                 "y"

If you want to delete a particular object you can do so using the rm() function.

rm(x) #remove x from your global environment

It is also possibly to remove all objects at once:

rm(list = ls()) #remove all objects from your global environment

If you mix in a vector elements that are of a different class (for example numerical and logical), R will coerce to the minimum common denominator, so that every element in the vector is of the same class. So, for example, if you input a number and a character, it will coerce the vector to be a character vector -see the example below and notice the use of the class() function to identify the class of an object.

my_8th_vector <- c(0.5, "a")
class(my_8th_vector) #The class() function will tell us the class of the vector
## [1] "character"


An important thing to understand in R is that categorical (ordered, also called ordinal, or unordered, also called nominal)1 data are typically encoded as factors . A factor is simply an integer vector that can contain only predefined values, and is used to store categorical data. Factors are treated specially by many data analytic and visualisation functions. This makes sense because they are essentially different from quantitative variables.

Although you can use numbers to represent categories, using factors with labels is better than using integers to represent categories because factors are self-describing (having a variable that has values “Male” and “Female” is better than a variable that has values “1” and “2”). When R reads data in other formats (e.g., comma separated), by default it will automatically convert all character variables into factors. If you rather keep these variables as simple character vectors you need to explicitly ask R to do so.

Factors can also be created with the factor() function concatenating a series of character elements. You will notice that is printed differently from a simply character vector and that it tells us the levels of the factor (look at the second printed line).

the_smiths <- factor(c("Morrisey", "Marr", "Rourke", "Joyce")) #create a new factor
the_smiths #auto-print the factor
## [1] Morrisey Marr     Rourke   Joyce   
## Levels: Joyce Marr Morrisey Rourke
#Alternatively for similar result using the as.factor() function
the_smiths_bis <- c("Morrisey", "Marr", "Rourke", "Joyce") #create a character vector
the_smiths_f <- as.factor(the_smiths_bis) #create a factor
the_smiths_f #auto-print factor
## [1] Morrisey Marr     Rourke   Joyce   
## Levels: Joyce Marr Morrisey Rourke

Factors in R can be seen as vectors with a bit more information added. This extra information consists of a record of the distinct values in that vector, called levels. If you want to know the levels in a given factor you can use the levels() function:

## [1] "Joyce"    "Marr"     "Morrisey" "Rourke"

Notice that the levels appear printed by alphabetical order. There will be situations when this is not the most convenient order. We will discuss in these tutorials how to reorder your factor levels when you need to.

You may have noticed the various names I have used to designate objects (my_1st_vector, the_smiths, etc.). You can use almost any names you want for your objects. Objects in R can have names of any length consisting of letters, numbers, underscores (“_“) or the period (”.“) and should begin with a letter. In addition, when naming objects:

  • Some names are forbidden. These include words such as FALSE and TRUE, logical operators, and programming words like Inf, for, else, break, function, and words for special entities like NA and NaN.

  • You want to use names that do not correspond to a specific function. We have seen, for example, that there is a function called print(), you don’t want to call an object “print” to avoid conflicts. To avoid this use nouns instead of verbs for naming your variables and data.

  • You don’t want them to be too long (or you will regret it every time you need to use that object in your analysis: your fingers will bleed from typing).

  • You want to make them as intuitive to interpret as possible.

  • You want to follow consistent naming conventions. R users are terrible about this. But we could make it better if we all aim to follow similar conventions. In these handouts you will see I follow the underscore_separated convention -see here for details.

Data frames

One of the most common objects you will work with in this course are data frames. Data frames can be created with the data.frame() function. Data frames are multiple vectors of possibly different classes (e.g., numeric, factors), but of the same length (e.g., all vectors, or variables, have the same number of rows). This is what in other programmes for data analysis are represented as data sets, the tabular spreadsheets I was referring to earlier.

#We create a dataframe called mydata.1 with two variables, an integer vector called foo and a logical vector called bar
mydata_1 <- data.frame(foo = 1:4, bar = c(T,T,F,F))
##   foo   bar
## 1   1  TRUE
## 2   2  TRUE
## 3   3 FALSE
## 4   4 FALSE

Or alternatively for the same result:

x <- 1:4
y <- c(T, T, F, F)
mydata_2 <- data.frame (foo = x, bar = y)
##   foo   bar
## 1   1  TRUE
## 2   2  TRUE
## 3   3 FALSE
## 4   4 FALSE

As you can see in R, as in any other language, there are multiple ways of saying the same thing. Programmers aim to produce code that has been optimised: it is short and quick. It is likely that as you develop your R skills you find increasingly more efficient ways of asking R how to do things.

Every object in R can have attributes. These are: names; dimensions (for matrices and arrays: number of rows and columns) and dimensions names; class of object (numeric, character, etc.); length (for a vector this will be the number of elements in the vector); and other user-defined. You can access the attributes of an object using the attributes() function.

## $names
## [1] "foo" "bar"
## $row.names
## [1] 1 2 3 4
## $class
## [1] "data.frame"

By now you must have also noticed the common structure of functions in R. You can think of functions as executable commands that R will evaluate. You will have noticed functions have a name followed by a bracket and that you can pass arguments to the function by including them within the brackets. In the previous example we were using a function called attributes and we passed the argument mydata_1.

A function in R can take any number of arguments. You can obtain help about functions in R (and the specific arguments they can take) by using the ? as mentioned earlier A final word in code presentation and coding conventions before we carry on. Code is a form of communication and it is important that you write it in a way that others will read it clearly. As Hadley Wickham2 has noted: “Good coding style is like using correct punctuation when writing: you can manage without it, but it sure makes things easier to read.” Apart from using the “#” sign to make annotations, there are other basic conventions you should also follow:

  • Every comma should be followed by a space

  • Every mathematical operator (+, -, =, *, /, etc.) should be surrounded by spaces

  • Parentheses do not need spaces

  • Lines should be at most 80 characters. If you have to break up a line, indent the following piece.

You may want to look at the style guide used by Google programmers using R for further details, but the basic conventions listed above will possibly suffice for now.

1.4 Data filtering

R capabilities for data manipulation are ridiculously rich. We could devote entire sessions just to discuss those. Here I am just going to introduce some very basic ones around filtering. For more detail (or if you want to do something I have not explained), look at: the R Cookbook; Quick R; this twotorial; or this guide to the fast and excellent dplyr package; and for even more detail (than you possibly want at this stage) you can consult a preprint copy of Rob Muenchen’s book.

Filtering allows you to extract a vector’s element that satisfy certain condition. You will be commonly doing filtering, as statistical analysis often focuses on data that satisfies certain conditions. Going through some of these data manipulation commands now will help you to understand a bit better some of the code that we will be using during the course.

As discussed R has many data sets built in for teaching with, one of which is the Titanic data about passengers and their fate. If you want to have a look at it, we can do so using the View() function.


You can see that it has a few variables. What are these?

This is actually a table, so before we are able to work with this data, we have to turn it from a table into a dataframe. This is achieved with the function.

titanicDf <-

We now have a dataframe called titanicDf to play with.

To filter, or subset a dataframe, you can use a few approaches. Here I will introduce how you do this in base R, and then I will recommend some packages that are very good for data manipulation, which you can move on to if you are feeling more confident.

Filtering in base R

You normally want to filter based on some criteria. The syntax for filtering is to use the name of the dataframe (or vector, if you’re filtering from that) followed by the square brackets. Inside the square brackets, you enter the criteria you want to filter for rows and columns, separated by a comma. Row criteria are always on the left of the comma, and column criteria are always on the right. If you leave either side of the comma blank, that indicates that you want all of those.

So to select all rows of the first two columns of the Titanic dataframe, you would type “titanicDf”+“[“+” “+”, “+”1:2“+”]”. If you want to use this subset dataframe later, dont forget to assign it to an object! So to create a new dataframe that is the first two columns of the Titanic dataframe you type:

firstTwoColsOfTitanicDf <- titanicDf[ ,1:2]

You can have a look at the new dataframe you have created using View(). If all went well, it should be the class which each passenger travelled in, and their sex.

So similarly, if you wanted the first two rows of the data, then you would put your criteria on the left (row) side of the comma within the square brackets:

firstTwoRowsOfTitanicDf <- titanicDf[1:2, ]

Again you can View() if you’d like, and hopefully you have all the columns, but only two rows of the dataframe.

Most of the time your crietria will be a little bit more complex however. You don’t always know the number of the columns you want, and you usually want to select rows where a column meets a certain criteria.

For example, let’s say we only want to see the outcomes for people who travelled in 1st class. So we want a dataframe where we have all the columns, but only the rows (observations) where Class value is 1st.

If we were to say this in an equation, then we would say, we want all rowns, where Class equals 1st. It is not so different in R, as we want to select all rows, where the column that descibes class equals “1st”.

If you remember from last week, you summon a column (or variable) from a dataframe by using the $ operator. So to get the Class column from the titanicDf dataframe, we need to type:

“titanicDf” + “$” + “Class”

The equals operator in R is a double equals “==”. This is because a single equals can be used to assign something a value (sort of like the “<-” operator). == instead asks a question, “does one value equal the other?”.

So for example 1 == 2 returns FALSE while 2 == 2 returns true. Try it!

## [1] FALSE
## [1] TRUE

So == is a way to test if some criteria is met. This is why we use that to select rows where a criteria is met.

So to test whether an observation (row) meets the criteria, we can test for each row whether the == returns true or false. In the subsetting, we get back all the rows where == returned TRUE.

So back to the exampe of selecting all rows where Class equals 1st (so all observations about 1st class passengers) the criteria can be written as

titanicDf$Class == "1st"

As you recall, we are filtering rows, and row criteria goes in the left half of the square brackets. We want all of the columns, so we leave the right hand side (column criteria) blank:

firstClassOnly <- titanicDf[titanicDf$Class == "1st", ]
##    Class    Sex   Age Survived Freq
## 1    1st   Male Child       No    0
## 5    1st Female Child       No    0
## 9    1st   Male Adult       No  118
## 13   1st Female Adult       No    4
## 17   1st   Male Child      Yes    5
## 21   1st Female Child      Yes    1
## 25   1st   Male Adult      Yes   57
## 29   1st Female Adult      Yes  140

Now you have a subset dataframe which has only the information about 1st class passengers on the Titanic. Feel free to view it.

You can add multiple criteria to your selection. To do this you use the and "&" and or "|" operator.

So if you want to know only about 1st class children, your criteria would look like this:

titanicDf$Class == "1st" & titanicDf$Age=="Child"

That is what you would place in the left side (row criteria) of the square bracket:

firstClassChildOnly <- titanicDf[titanicDf$Class == "1st" & titanicDf$Age=="Child", ]
##    Class    Sex   Age Survived Freq
## 1    1st   Male Child       No    0
## 5    1st Female Child       No    0
## 17   1st   Male Child      Yes    5
## 21   1st Female Child      Yes    1

On the other hand if you wanted to know about both 1st and 2nd classes, then you need the “or” operator. That is because it is true either if the class is “1st” or if its “2nd”.

So your criteria will look like:

titanicDf\(Class == "1st" | titanicDf\)Class == “2nd”

That is what you would place in the left side (row criteria) of the square bracket:

firstOrSecondClass <- titanicDf[titanicDf$Class == "1st" | titanicDf$Class == "2nd", ]
##    Class    Sex   Age Survived Freq
## 1    1st   Male Child       No    0
## 2    2nd   Male Child       No    0
## 5    1st Female Child       No    0
## 6    2nd Female Child       No    0
## 9    1st   Male Adult       No  118
## 10   2nd   Male Adult       No  154
## 13   1st Female Adult       No    4
## 14   2nd Female Adult       No   13
## 17   1st   Male Child      Yes    5
## 18   2nd   Male Child      Yes   11
## 21   1st Female Child      Yes    1
## 22   2nd Female Child      Yes   13
## 25   1st   Male Adult      Yes   57
## 26   2nd   Male Adult      Yes   14
## 29   1st Female Adult      Yes  140
## 30   2nd Female Adult      Yes   80

You can also use things like greater than, lesser than, etc. A list of operators can be found here.

You can also filter with the subset() function. As arguments you need to specify the thing you are subsetting (eg: dataframe or vector) and then the filtering condition/ criteria. So something like this:

subsetDf <- subset(nameOfOriginalDf, nameOfDf$columnName=="myCriteria")

So to return to the Titanic examples, to subset only the rows where class is 1st class, we would type:

firstClassOnly <- subset(titanicDf, titanicDf$Class=="1st") 
##    Class    Sex   Age Survived Freq
## 1    1st   Male Child       No    0
## 5    1st Female Child       No    0
## 9    1st   Male Adult       No  118
## 13   1st Female Adult       No    4
## 17   1st   Male Child      Yes    5
## 21   1st Female Child      Yes    1
## 25   1st   Male Adult      Yes   57
## 29   1st Female Adult      Yes  140

Or with the multiple criteria (both 1st and 2nd Class passengers)

firstOrSecondClass <- subset(titanicDf, titanicDf$Class=="1st" | titanicDf$Class=="2nd") 
##    Class    Sex   Age Survived Freq
## 1    1st   Male Child       No    0
## 2    2nd   Male Child       No    0
## 5    1st Female Child       No    0
## 6    2nd Female Child       No    0
## 9    1st   Male Adult       No  118
## 10   2nd   Male Adult       No  154
## 13   1st Female Adult       No    4
## 14   2nd Female Adult       No   13
## 17   1st   Male Child      Yes    5
## 18   2nd   Male Child      Yes   11
## 21   1st Female Child      Yes    1
## 22   2nd Female Child      Yes   13
## 25   1st   Male Adult      Yes   57
## 26   2nd   Male Adult      Yes   14
## 29   1st Female Adult      Yes  140
## 30   2nd Female Adult      Yes   80

When applied to vectors, the difference between this function and the ordinary filtering we have covered so far lies in the manner in which missing data is handled.

Let’s see an example:

z <- (c(6, 1:3, NA,12)) #This creates a vector with 6 values, one of which is NA. NA is the label R uses for unknown values.
## [1]  6  1  2  3 NA 12

If we use ordinary filtering NA values (missing data) will be included in our selection:

z[z > 5]
## [1]  6 NA 12

But if we use the subset() function the NA values are excluded from the selection.

subset(z, z > 5) 
## [1]  6 12

It is important to remember this difference for you may have particular reasons to exclude or include NA values in what you are planning to do.

For other logical and arithmetic operators that you can use in R for filtering (and how to write them) in R please look here. For further details you may also want to see Roger Peng’s video on subsetting. The dplyr package is truly quite something if you are interested in filtering data frames. It uses more intuitive language and adds some very helpful functions, but in this section I thought it was important to cover the more traditional approach of filtering with R. In later sessions we will introduce some of the features of dplyr.

1.5: Loading data

We did this a bit last week, but here’s a more comprehensive guide for reading in data.

You can then, as we have seen, create data into R in a variety of ways. In data analysis, however, it is very common that you will work with data already formatted (and collected by somebody else: government, a professional survey organisation you may have contracted for fieldwork, other researchers, etc.). We live in a time where data is everywhere; some people call it the “data revolution that will change the way we live”. And there are, indeed, a variety of places where you can obtain datasets for secondary analysis such as the UK Data Archive or the ICPSR website (feel free to Google these, they may give you ideas for your MA dissertation or future PhD research).

For most analysis then, the first step will involve importing an already existing data frame into R. In this course every time we introduce one of these datasets you will be provided with some background information about the dataset in question: a short codebook telling you how the data was collated and some information about the variables included. This information is often called metadata, data about data. The first thing you always do in data analysis is to have a look at this codebook to understand what is in it and how the data was generated. With long dry codebooks this may be as inviting as reading a telephone guide, but I can assure you it will save you time and prevent you from making mistakes later on (as I learnt the hard way!).

For the most part the data we use in the course have already being formatted and cleaned to make your life easier. In fact, so have the codebooks. But in real life data, even pre-processed data, tends to be messy. The day to day of data analysis implies spending the bulk of the time (up to 90%) in what some informally call data munging before you can start exploring your data for interesting patterns. Pre-processing data may involve: fixing variable names; creating new variables; merging data sets; reshaping data sets; dealing with missing data; transforming variables; checking on and removing inconsistent values; etc. You can find here a short discussion about data carpentry of this type. Using R means that you will be induced to have a much better (and essential) record of all these operations (by saving the scripts performing them).

The sanitised datasets we use in this course will be mostly loaded from the internet. One of the cool things about R is that it can easily read data directly from the internet if you provide a URL for it -click here for a video “twotorial” demonstration. You will be able to download this data from my public repositories in GitHub, a cloud storage facility used by programmers and data analysts 3.

Data may come in a variety of formats (Excel files, comma separated, tab separated, SPSS, STATA, JSON, etc.) 4. There are functions that help importing into R files in these various formats. We are going to start with a simple case. We are going to read a textual comma separated file into R.

In particular, we are going to read a dataset which is a Teaching Unrestricted version of the British Crime Survey for 2007-2008. The codebook with information about this dataset is here. This file is in comma separated format and therefore we will be using the read.csv() function.

##R in Windows have some problems with https addresses, that's why we need to do this first:
#We create a data frame object reading the data from the remote .csv file

Note that R requires forward slashes (/) not back slashes () when specifying a file location even if the file is on your hard drive. Another way of getting this sort of file would be using the following approach:

#Download the data into your working directory
download.file('', "BCS0708.csv", method = "internal")
#Read the data in the csv file into an object called BCS0708
BCS0708 <- read.csv("BCS0708.csv")

Or for ultimate simplicity, you can do what we did in the half day session last week, and download the .csv file into your working directory, and then read it in with either specifying the file path, or with the simple file.choose() function.

BCS0708 <- read.csv(file.choose())

For some resources on how to read other file types into R see:

1.6: A first look at the data

We are going to work with the British Crime Survey data for the remainder of this session. And we want to start by having a sense for what the data look like.

Data are often too big to look at the whole thing. It is almost always impossible to eyeball the entire dataset and see what you have in terms of interesting patterns or potential problems. It is often a case of information overload and we want to be able to extract what is relevant and important about it. Summarising the data is the first step in any analysis and it is also used for finding out potential problems with the data. Regarding the latter you want to look out for: missing values; values outside the expected range (e.g., someone aged 200 years); values that seem to be in the wrong units; mislabelled variables; or variables that seem to be the wrong class (e.g., a quantitative variable encoded as a factor).

If you simply type the name of the new dataset BCS0708 and press return for auto-printing, you will see why we need data summaries. It is hard to eyeball large datasets like this trying to see the whole thing. So lets start by the basic things you always look first in a datasets.

You can see in the Environment window that BCS0708 has 11676 observations (rows) of 35 variables (columns). You can also obtain this information using code. Here you want the DIMensions of the dataframe (the number of rows and columns) so you use the dim() function:

## [1] 11676    35

We can see that the dataset has 11676 observations or rows and 35 columns or variables. Looking at this information will help you to diagnose whether there was any trouble getting your data into R (e.g., imagine you know there should be more cases or more variable). You may also want to have a look at the names of the columns using the names() function. We will see the names of the variables.

##  [1] "rowlabel"  "sex"       "age"       "livharm1"  "ethgrp2"  
##  [6] "educat3"   "work"      "yrsarea"   "resyrago"  "tenure1"  
## [11] "rural2"    "rubbcomm"  "vandcomm"  "poorhou"   "tcemdiqu2"
## [16] "tcwmdiqu2" "causem"    "walkdark"  "walkday"   "homealon" 
## [21] "tcviolent" "tcsteal"   "wburgl"    "wmugged"   "wcarstol" 
## [26] "wfromcar"  "wraped"    "wattack"   "winsult"   "wraceatt" 
## [31] "crimerat"  "tcarea"    "tcneigh"   "bcsvictim" "tcindwt"

As you may notice, these names are hard to interpret. You need to look at the codebook to figure out what each of those variables is actually measuring. The bad news is that real life datasets in social science are much larger than this and have many more variables (hundreds or more). The good news is that typically you will only need a small handful of them and will only require to deeply familiarise yourself with that smaller subset.

Then you may want to look at the class of each individual column. As discussed, class of the variable lets us know if its an integer (number) or factor.

To get the class of one variable, you pass it to the class() function. For example

## [1] "integer"
## [1] "factor"

There is a way that you can apply a function to all elements of a vector (list or dataframe). You can use the functions sapply(), lapply(), and mapply() . To find out more about when to use each one see here.

For example, we can use the lapply() function to look at each column and get its class. To do so, we have to pass two arguments to the lapply() function, the first is the name of the dataframe, to tell it what to look through, and the second is the function we want it to apply to every column of that function.

So we want to type “lapply(” + “name of dataframe” + “,” + “name of function” + “)”

Which is:

lapply(BCS0708, class)
## $rowlabel
## [1] "integer"
## $sex
## [1] "factor"
## $age
## [1] "integer"
## $livharm1
## [1] "factor"
## $ethgrp2
## [1] "factor"
## $educat3
## [1] "factor"
## $work
## [1] "factor"
## $yrsarea
## [1] "factor"
## $resyrago
## [1] "factor"
## $tenure1
## [1] "factor"
## $rural2
## [1] "factor"
## $rubbcomm
## [1] "factor"
## $vandcomm
## [1] "factor"
## $poorhou
## [1] "factor"
## $tcemdiqu2
## [1] "integer"
## $tcwmdiqu2
## [1] "integer"
## $causem
## [1] "factor"
## $walkdark
## [1] "factor"
## $walkday
## [1] "factor"
## $homealon
## [1] "factor"
## $tcviolent
## [1] "numeric"
## $tcsteal
## [1] "numeric"
## $wburgl
## [1] "factor"
## $wmugged
## [1] "factor"
## $wcarstol
## [1] "factor"
## $wfromcar
## [1] "factor"
## $wraped
## [1] "factor"
## $wattack
## [1] "factor"
## $winsult
## [1] "factor"
## $wraceatt
## [1] "factor"
## $crimerat
## [1] "factor"
## $tcarea
## [1] "numeric"
## $tcneigh
## [1] "numeric"
## $bcsvictim
## [1] "factor"
## $tcindwt
## [1] "numeric"

As you can see many variables are classed as factors. This is common with survey data. Many of the questions in social surveys measure the answers as categorical variables (e.g., these are nominal or ordinal level measures) which are then encoded as factors in R. You should know that the default options for read.csv() will encode any character attribute in your file as a factor.

Another useful function is str(), which will return: the name of the variable; the class of each column; the number of levels or categories (if it is a factor); and the values for the first few cases in the dataset.

##  Factor w/ 4 levels "fairly worried",..: 1 4 1 4 2 4 1 2 3 2 ...

You can also use the head() function if you just want to visualise the values for the first few cases in your dataset. The next code for example ask for the values for the first two cases.

head(BCS0708, 2)
##   rowlabel    sex age  livharm1 ethgrp2                      educat3 work
## 1 61302140 female  36   married   white                         none  yes
## 2 61384060   male  44 separated   white apprenticeship or a/as level  yes
##                           yrsarea resyrago
## 1 10 years but less than 20 years     <NA>
## 2             less than 12 months       no
##                                         tenure1 rural2        rubbcomm
## 1 buying it with the help of a mortgage or loan  urban            <NA>
## 2                                       rent it  urban not very common
##            vandcomm           poorhou tcemdiqu2 tcwmdiqu2     causem
## 1              <NA>              <NA>         1        NA   e. drugs
## 2 not at all common not at all common         3        NA f. alcohol
##       walkdark        walkday     homealon  tcviolent  tcsteal
## 1  very unsafe or very unsafe a bit unsafe         NA       NA
## 2 a bit unsafe    fairly safe  fairly safe -0.3892744 2.139811
##           wburgl          wmugged       wcarstol     wfromcar
## 1 fairly worried not very worried           <NA>         <NA>
## 2   very worried     very worried fairly worried very worried
##               wraped          wattack          winsult         wraceatt
## 1   not very worried not very worried   fairly worried not very worried
## 2 not at all worried not very worried not very worried not very worried
##              crimerat   tcarea   tcneigh             bcsvictim  tcindwt
## 1 a little more crime 1.117700  2.212788 not a victim of crime 1.763460
## 2                <NA> 1.791787 -1.024336 not a victim of crime 3.844527

In the same way you could look at the last two cases in your dataset using tail():

tail(BCS0708, 2)
##       rowlabel    sex age livharm1 ethgrp2 educat3 work
## 11675 86048300 female  73  widowed   white    none   no
## 11676 86052180 female  71  widowed   white    none   no
##                              yrsarea resyrago         tenure1 rural2
## 11675 5 years but less than 10 years     <NA>         rent it  urban
## 11676             20 years or longer     <NA> own it outright  urban
##                rubbcomm          vandcomm           poorhou tcemdiqu2
## 11675 not at all common not at all common not at all common        NA
## 11676 not at all common not at all common not at all common        NA
##       tcwmdiqu2                             causem    walkdark     walkday
## 11675         2                           e. drugs very unsafe   very safe
## 11676         1 d. lack of discipline from parents very unsafe fairly safe
##          homealon  tcviolent  tcsteal         wburgl        wmugged
## 11675   very safe         NA       NA fairly worried   very worried
## 11676 very unsafe -0.1945015 1.837662   very worried fairly worried
##           wcarstol       wfromcar             wraped          wattack
## 11675         <NA>           <NA> not at all worried   fairly worried
## 11676 very worried fairly worried   not very worried not very worried
##                winsult           wraceatt         crimerat   tcarea
## 11675     very worried not at all worried a lot more crime       NA
## 11676 not very worried   not very worried a lot more crime 0.651897
##          tcneigh             bcsvictim   tcindwt
## 11675         NA not a victim of crime 0.6227991
## 11676 -0.9313367       victim of crime 0.6208484

It is good practice to do this to ensure R has read the data correctly and there’s nothing terribly wrong with your dataset. It can also give you a first impression for what the data looks like. If you are used to spreadsheet-like views of data, you can use the View() function, which should open this view in R Studio.


One thing you may also want to do is to see if there are any missing values. For that we can use the function. Missing values in R are coded as NA. The code below, for example, asks for NA values for the variable “educat3”" in the BCS0708 object for cases 1 to 10:$educat3[1:10])

R is telling me that none of those elements are missing. More typically, you can ask the count of NA values for a particular variable:

## [1] 58

This is asking R to sum how many cases are TRUE NA in this variable. When reading a logical vector as the one we are creating, R will treat the FALSE elements as 0s and the TRUE elements as 1s. So basically the sum() function will count the number of TRUE cases returned by the function.

You can use a bit of a hack to get the proportion of missing cases instead of the count:

## [1] 0.004967455

This code is exploiting the mathematical fact that the mean of binary outcomes (0 or 1) gives you the proportion of 1s in your data.

If you see more than 5% of the cases declared as NA, you need to start thinking about the implications of this. Beware of formulaic application of rules of thumb such as this though!

There is a whole field of statistics devoted to doing analysis when missing data is a problem. R has extensive capabilities for dealing with missing data -see for example here. For the purpose of this introductory course, we only explain how to do analysis that ignore missing data. You would cover techniques for dealing with this sort of issues in more advanced courses -such as this offered by my lovely colleagues at CCSR.

The any() function is a useful one if you want to check that a particular value exist for a given variable. So, say we want to know whether we have anybody younger than 17 in the dataset, we could type the following:

any(BCS0708$age < 17)
## [1] TRUE

1.7 Numerical summaries of central tendency and variability

In this section we start exploring some statistical summaries for our data. If you have not watched the required video for this week explaining these measures, it is convenient you watch it now -just click here. You may also find these other videos on measures of central tendency and the standard deviation useful.

Let’s start with the mean. If you want to obtain the mean of a quantitative variable you can use the following expression:

## [1] NA

NA? What’s going on? Often this will happen to you. You get an unexpected result. Why may this be happening? You could use the View() function to visualise the data. If you view the data you will see that nothing seems odd with the variable. If you use the str() function you will see “age”" is a numeric vector, so it’s not as if you are asking something unreasonable (e.g., computing the mean for a categorical variable). The writing of the code also seems ok (we are using lower and upper case correctly).

When these things happen, you will need to go through the mental process of eliminating likely explanations such as the ones we have gone through here. Typically, next step is to look at the help files for the function you are using.

If you look at the help files you will notice that there is a default argument for the mean() function. The default specifies the following na.rm=FALSE. This means that the NA values are not removed before computation proceeds. Of course, you get NA because it is mathematically impossible to perform an operation with a NA value. What is 2+NA? Nothing. I can only imagine this is the default to alert you to the fact you should not ignore missing data problems. Let’s try again modifying the default. As you will see now it will work:

mean(BCS0708$age, na.rm = TRUE) #Specifying this argument will ensure that only cases with valid values are used in the mathematical operation
## [1] 50.42278

Another function you may want to use with numeric variables is summary():


This gives you the five number summary (minimum, first quartile, median, third quartile, and maximum, plus the mean and the count of NA values). Notice how when using this function you did not need to change any default settings to remove NA values.

There are multiple ways of getting results in R. Particularly for basic and intermediate-level statistical analysis many core functions and packages can give you the answer that you are looking for. For example, there are a variety of packages that allow you to look at summary statistics using functions defined within those packages. You will need to install these packages before you can use them.

Once installed you can activate packages with either the library() or the require() functions. It is a matter of some debate which to use. Some people think that there are good reasons to prefer library(), even if require() uses a more explicit language as to what you are doing. In these handouts we will use library().

You could use favstats() function from the mosaic package:

To do this you have to first install and then load up the mosaic package. (Hint: if you’ve not used it before, you’ll need to run install.packages(), and after that, library() )

## Loading required package: dplyr
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##     filter, lag
## The following objects are masked from 'package:base':
##     intersect, setdiff, setequal, union
## Loading required package: lattice
## Loading required package: ggplot2
## Loading required package: mosaicData
## Loading required package: Matrix
## The 'mosaic' package masks several functions from core packages in order to add additional features.  
## The original behavior of these functions should not be affected by this.
## Attaching package: 'mosaic'
## The following object is masked from 'package:Matrix':
##     mean
## The following objects are masked from 'package:dplyr':
##     count, do, tally
## The following objects are masked from 'package:stats':
##     binom.test, cor, cov, D, fivenum, IQR, median, prop.test,
##     quantile, sd, t.test, var
## The following objects are masked from 'package:base':
##     max, mean, min, prod, range, sample, sum
favstats(~age, data = BCS0708) #five number summary, mean, standard deviation, and number of cases
##  min Q1 median Q3 max     mean      sd     n missing
##   16 36     49 65 101 50.42278 18.5389 11661      15

The describe() function from the Hmisc package:

## Loading required package: survival
## Loading required package: Formula
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
##     combine, src, summarize
## The following objects are masked from 'package:base':
##     format.pval, round.POSIXt, trunc.POSIXt, units
describe(BCS0708$age) #n,missing, mean, various percentiles, the 5 lowest and highest values
## BCS0708$age 
##       n missing  unique    Info    Mean     .05     .10     .25     .50 
##   11661      15      84       1   50.42      21      26      36      49 
##     .75     .90     .95 
##      65      76      81 
## lowest :  16  17  18  19  20, highest:  95  97  98  99 101

The stat.desc() function from the pastecs package:

## Loading required package: boot
## Attaching package: 'boot'
## The following object is masked from 'package:survival':
##     aml
## The following object is masked from 'package:mosaic':
##     logit
## The following object is masked from 'package:lattice':
##     melanoma
## Attaching package: 'pastecs'
## The following objects are masked from 'package:dplyr':
##     first, last
stat.desc(BCS0708$age) #Which gives you among other things the standard error, standard deviation, and 95% confidence interval (if you specify norm=TRUE as an argument you will get test for normality of the distribution, but it will only work in data sets size 3 to 5000 -so we cannot use it here). Specifying basic=FALSE as an argument will also exluce some of the descriptives.
##      nbr.val     nbr.null          min          max 
## 1.166100e+04 0.000000e+00 1.500000e+01 1.600000e+01 1.010000e+02 
##        range          sum       median         mean      SE.mean 
## 8.500000e+01 5.879800e+05 4.900000e+01 5.042278e+01 1.716785e-01 
## CI.mean.0.95          var     coef.var 
## 3.365187e-01 3.436907e+02 1.853890e+01 3.676691e-01

Or the describe() function from the psych package:

## Attaching package: 'psych'
## The following object is masked from 'package:boot':
##     logit
## The following object is masked from 'package:Hmisc':
##     describe
## The following objects are masked from 'package:mosaic':
##     logit, read.file, rescale
## The following objects are masked from 'package:ggplot2':
##     %+%, alpha
describe(BCS0708$age) #Which gives you a trimmed mean, as well as measures of skew and kurtosis
##    vars     n  mean    sd median trimmed   mad min max range skew kurtosis
## X1    1 11661 50.42 18.54     49   50.17 20.76  16 101    85 0.11     -0.9
##      se
## X1 0.17

What package to use? Well, that’s very much a matter of personal preference and what your specific needs are. As you have seen these packages give you slightly different sets of summaries. You may also inform your choice considering what other functions those packages include. It could be that one of those packages use functions for types of analysis you will use often. In that case it may be sensitive to rely on that particular package.

I want to alert you to something, however. You may have noticed that both the psych and the Hmisc package both include a describe() function. Packages are user-produced and there’s no policing of the names people use for their functions. With 5000+ packages is unavoidable that some people will use the same names for some of their functions. When this happens R “masks” the first of these functions in use.

So in this session, if we use describe() again, we will use the function that was created as part of the psych package (since that was the last loaded package and by loading it we masked the describe function of the previous package). If we want to revert to the describe() function as designed in the Hmisc package you may need to load that package again so that R masks the describe() function from the psych package. It is important you keep an eye on this “masked” messages when you load packages!!!

Sometimes you want to produce summary statistics by group. The psych package is good for this:

#Since we just loaded this package we can continue using its functions
describeBy(BCS0708$age, BCS0708$livharm1) #age descriptives for various categories of marital status
## $cohabiting
##    vars    n  mean    sd median trimmed   mad min max range skew kurtosis
## X1    1 1032 38.27 12.86     36   36.93 11.86  18  95    77 0.95     0.72
##     se
## X1 0.4
## $divorced
##    vars    n  mean    sd median trimmed   mad min max range skew kurtosis
## X1    1 1055 54.28 12.61     55   54.14 13.34  24  95    71 0.09    -0.49
##      se
## X1 0.39
## $married
##    vars    n  mean    sd median trimmed   mad min max range skew kurtosis
## X1    1 5551 52.97 14.79     53   52.74 16.31  18  94    76 0.11    -0.84
##     se
## X1 0.2
## $separated
##    vars   n  mean    sd median trimmed   mad min max range skew kurtosis
## X1    1 323 47.66 13.64     45   46.64 13.34  24  86    62 0.63    -0.17
##      se
## X1 0.76
## $single
##    vars    n  mean    sd median trimmed   mad min max range skew kurtosis
## X1    1 2372 34.65 16.63     30   32.31 14.83  16  92    76 1.09     0.48
##      se
## X1 0.34
## $widowed
##    vars    n  mean    sd median trimmed   mad min max range  skew kurtosis
## X1    1 1320 75.09 10.24     76   75.77 10.38  37 101    64 -0.68     0.54
##      se
## X1 0.28
## attr(,"call")
## by.default(data = x, INDICES = group, FUN = describe, type = type)

If you look at the mean for each of the marital status groups, you can see that there is a relationship in this sample. Single people, for example, tend to be younger. Earth shattering discovery! Don’t worry, we will look at more interesting examples in subsequent weeks. As I mentioned earlier, we will cover dplyr in later sections, but it is important at least to mention here that this package is also very good for breaking down your data frame in such a way that would allow you to produce summary statistics by group.

Now that you have a bit of more familiarity with R you can possibly understand why I earlier suggested to use short names for your data frames. Imagine that your data frame was called british_crime_survey_2007_2008. Then you would need to type the following:

#Don't try to execute this, it will tell you there is no object called British_Crime_Survey_2007_2008
describeBy(british_crime_survey_2007_2008$age, british_crime_survey_2007_2008$livharm1)

Yet, even if you have use a short name to designate your objects, you may still have to do more typing that your, let’s face it, lazy nature will want. If you have to use as inputs various variables and the function require that you use the name_of_dataset$name_of_variable formulation as a way to identify your variables, it can be a bit tedious to do so. An incredibly helpful way of getting around that is using the with() function.

#Ok, we don't save so much typing with this example, but imagine you have 10 variables!
with(BCS0708, describeBy(age, livharm1))

Next week we will devote the entire session to cover graphics as a way of visualising data. But it is important you understand that although we separate the treatment of this topic for practical reasons, in practice visualising the data is one of the very first things you do (even before you produce numerical summaries). So, if you are working with a numerical variable, you may want to produce a histogram to quickly look at the full distribution:


1.8 Summarising categorical variables

How about characterising qualitative variables? You can use the table() function to produce a frequency distribution. Let’s look at how safe people feel walking along in the area where they live when is dark:

## a bit unsafe  fairly safe    very safe  very unsafe 
##         2604         4718         3002         1301

You can see the modal category is fairly safe. Notice that R by default does not print the missing values. If you want them printed when using this function then you have to specifically ask for them with the useNA argument:

table(BCS0708$walkdark, useNA = "ifany")
## a bit unsafe  fairly safe    very safe  very unsafe         <NA> 
##         2604         4718         3002         1301           51

We may also be interested in expressing these quantities in percentage values. In order to do so, we need to engage in a programming trick. First we create a new object and we use the results from this object to compute those quantities. We do this using the following instructions:

.Table <- table(BCS0708$walkdark) ##creates a new object called .Table containing the results we saw before
round(100*.Table/sum(.Table), 2)  ##percentages for ethnic4 created using the object called .Table we just created (because we did not use useNA, this will compute the valid percent)
## a bit unsafe  fairly safe    very safe  very unsafe 
##        22.40        40.58        25.82        11.19
remove(.Table) ##removes this object from working memory

When producing frequency distribution tables, it is particularly useful (if you want to save yourself the work of formatting the tables in a nice way) to use the sjPlot package. You can use the sjt.frq() function for this. This online tutorial explains some of the features of this function in greater detail.

sjt.frq(BCS0708$walkdark, var.labels = c("How safe you feel walking alone at dark?"))

1.9 Saving your work and further resources

After you spend some time working with R you may want to save your progress. Equally if you create, alter or obtain data (e.g., from an external repository) you may want to store the data in a file saved in either your P: drive, a USB memory stick, dropbox, etc (whichever you prefer to work from).

The first thing you want to do is to create a folder in either your P: drive, the hard drive of your home computer, or a USB drive for these datasets and the exercises that you will be performing for this course. You may call this folder “R Exercises” or something like that. In future you may want to have a folder for every research project you are working on.

Whatever folder or location you use for the materials of this course, it is convenient you set this folder as your working directory whenever you start your session, otherwise every time you need to use a data frame you will have to explain R where to find it. When you start your R session, R will be working from a pre-specified working directory. Which working directory this is will depend on the machine you are using, and whether you are working from a previously saved project in R Studio.

To find out what working directory you are currently working from, you can use the following code:

## [1] "/Users/reka/Desktop/R-for-Criminologists"

In order to use a more convenient directory (like the one you may want to create for this course), you can use the setwd() function. This is how I set the working directory for my R related teaching materials for this course:

setwd("~/Desktop/R-for-Criminologists/demo") #Path to your working directory goes here

Or as described in the intro session you can navigate there manually by clicking on Session > Set working directory > choose directory (see here)

The argument, what you find between parentheses, is simply the particular location that in my case I want to use as such. You will have to adapt this to whatever location you want to work from. In order for R to find and save in the right place your data and any programs (“scripts”), or graphics that you produce it is important that you start any R session setting up this working directory.

Once you have created a folder and have set it up as the working directory you can save your progress. You can do this using the following code:


This will create an image with all the various objects that are in your workspace in the named file. You can use other name rather than “FirstWeek”, but it is important that you make sure you specify that this is a .RData type of file. This will tell R this file is a workspace file. In any case, if you close your R Studio session you will be automatically asked whether you want to save the current workspace image. Watch this video if any of this is not clear.

You can also save particular objects in your working space, say one of your data frames, as separate file. The following code will save the BCS0708 file as a .RData file. Note, of course, that you can also specify a suitable physical location.

save("BCS0708", file = "BCS0708.RData")

Once the file is saved as a .RData file you can then load it simply with the load() function.


However, some people would advise to save data files as such. So rather than creating a .RData file a preferable option may be to use the write.csv() option, which will create a comma separated file with your data frame. See this for details.

write.csv(BCS0708, file="BCS0708.csv")

If you want to write your data frame into a different format, there are a number of ways for doing so. I won’t be expecting you to do so, but you can learn how to do it: ; for dbf, SPSS, SAS, STATA; and for Excel.

The many expressions that you may have been using are better stored within a script. This video, from the excellent series produced by MarinStats (Mike Marin at University of British Columbia), explains how to work with scripts within R Studio. I highly recommend that you watch it.

Also, in order to consolidate what we have covered today, you may also want to watch his other short videos (typically 6 minutes) that demonstrate the sort of stuff we have covered today. Specifically his videos on:

Alternatively you may find as helpful reference the online handouts produced by Andy Teucher. They are more parsimonious than this handout and can be used as a cheat-sheet for the type of functions we introduced today. If you want further practice you may enjoy the interactive modules in Try R. And for further documentation, the Beginner’s Guide to R guide of Computerworld magazine, is pretty handy and you will find there as well a very comprehensive annotated list of additional learning resources for R.

Important terms introduced today (if you are not clear about them, please read the document again or ask me about their meaning:

sessionInfo() #This function provides information about the version of R I used to produce the html file you are reading. All these tutorials have been created directly with RStudio using the rmarkdown and knitr packages. A wonderful thing about R is that is makes largely redundant the use of other tools for publishing your work or for producing presentations. Including this session information at the end of the html file will allow others to reproduce this work if their version of R encounter problems with the code provided in this document. 
## R version 3.3.1 (2016-06-21)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.10.5 (Yosemite)
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## other attached packages:
##  [1] psych_1.6.9       pastecs_1.3-18    boot_1.3-18      
##  [4] Hmisc_3.17-4      Formula_1.2-1     survival_2.39-4  
##  [7] mosaic_0.14.4     Matrix_1.2-6      mosaicData_0.14.0
## [10] ggplot2_2.2.0     lattice_0.20-33   dplyr_0.5.0      
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.7         formatR_1.4         RColorBrewer_1.1-2 
##  [4] plyr_1.8.4          tools_3.3.1         rpart_4.1-10       
##  [7] digest_0.6.10       evaluate_0.10       tibble_1.2         
## [10] gtable_0.2.0        DBI_0.5-1           parallel_3.3.1     
## [13] yaml_2.1.13         ggdendro_0.1-20     gridExtra_2.2.1    
## [16] stringr_1.1.0       knitr_1.14          cluster_2.0.4      
## [19] nnet_7.3-12         grid_3.3.1          data.table_1.9.6   
## [22] R6_2.2.0            foreign_0.8-66      rmarkdown_1.1      
## [25] latticeExtra_0.6-28 tidyr_0.6.0         magrittr_1.5       
## [28] scales_0.4.1        htmltools_0.3.5     splines_3.3.1      
## [31] MASS_7.3-45         mnormt_1.5-5        assertthat_0.1     
## [34] colorspace_1.2-7    stringi_1.1.2       acepack_1.3-3.3    
## [37] lazyeval_0.2.0      munsell_0.4.3       chron_2.3-47

  1. If you are confused by this terminology watch this video

  2. A “semi-legendary” R programmer and part of the R Studio team. We all want to know as much R as he does.

  3. I use GiT and GitHub as my systems of version control. We don’t have the time to cover version control systems in this course. But for the purpose of this course you simply need to know that all the data will be stored in GitHub and that you will be able to load the data directly from it. Version control and GiT in particular is a helpful tool for a data scientist. RStudio allows you to use GiT as your system of version control. If you are interested you can find a great tutorial for GiT and GitHub here.

  4. The following video “twotorials” (in two minutes or less) provide information in how to read into R in a variety of formats: csv files; SPSS, STATA, and SAS; and Excel files - here for a different way to read Excel.