Introduction to Solving Biological Problems Using R - Week 1
Last modified: 15 Jul 2018
PeterMac Data Science’s modified version of material by the University of Cambridge (Mark Dunning, Suraj Menon and Aiora Zabala, Robert Stojnić, Laurent Gatto, Rob Foy, John Davey, Dávid Molnár and Ian Roberts, original material: http://cambiotraining.github.io/r-intro) and Software Carpentry.
Course Aims
To introduce you to the basics of R
Reading data
Perform simple analyses
Producing graphs
How to get help!
Give you all the background you need to practice by yourselves
Introduce tools that will help you to work in a reproducible manner
Executed using command line, or a graphical user interface (GUI)
On this course, we use the RStudio GUI (www.rstudio.com)
rstudio
Introduction to RStudio
(from Software Carpentry)
Throughout this lesson, we’re going to teach you some of the fundamentals of the R language. We’ll be using RStudio: a free, open source R integrated development environment. It provides a built in editor, works on all platforms (including on servers) and provides many advantages such as integration with version control and project management.
To launch RStudio, find the RStudio icon and click it.
Basic layout
When you first open RStudio, you will be greeted by three panels:
The interactive R console (entire left)
Environment/History (tabbed in upper right)
Files/Plots/Packages/Help/Viewer (tabbed in lower right)
RStudio layout
Once you open files, such as R scripts, an editor panel will also open in the top left.
RStudio layout with .R file open
Workflow within RStudio
There are two main ways one can work within RStudio.
Test and play within the interactive R console then copy code into a .R file to run later.
This works well when doing small tests and initially starting off.
It quickly becomes laborious
Start writing in an .R file and use RStudio’s short cut keys for the Run command to push the current line, selected lines or modified lines to the interactive R console.
This is a great way to start; all your code is saved for later
You will be able to run the file you create from within RStudio or using R’s source() function.
Tip: Running segments of your code
RStudio offers you great flexibility in running code from within the editor window. There are buttons, menu choices, and keyboard shortcuts. To run the current line, you can 1. click on the Run button above the editor panel, or 2. select “Run Lines” from the “Code” menu, or 3. hit Ctrl+Return in Windows or Linux or ⌘+Return on OS X. (This shortcut can also be seen by hovering the mouse over the button). To run a block of code, select it and then Run. If you have modified a line of code within a block of code you have just run, there is no need to reselct the section and Run, you can use the next button along, Re-run the previous region. This will run the previous code block including the modifications you have made.
Introduction to R
Much of your time in R will be spent in the R interactive console. This is where you will run all of your code, and can be a useful environment to try out ideas before adding them to an R script file. This console in RStudio is the same as the one you would get if you typed in R in your command-line environment.
The first thing you will see in the R interactive session is a bunch of information, followed by a “>” and a blinking cursor. In many ways this is similar to the shell environment you learned about during the shell lessons: it operates on the same idea of a “Read, evaluate, print loop”: you type in commands, R tries to execute them, and then returns a result.
The traditional way to enter R commands is via the Terminal, or using the console in RStudio (bottom-left)
However, another way (not used in this course) is to use a relatively new feature called R-notebooks.
An R-notebook mixes plain text with R code
The R code can be run from inside the document and the results are displayed directly underneath
Each chunk of R code looks something like this.
Each line of R can be executed by clicking on the line and pressing CTRL and ENTER
Or you can press the green triangle on the right-hand side to run everything in the chunk
print("Hello World")
The R notebook can be rendered into a format such as PDF or HTML so they can be shared with your collaborators
On the course website you will see compiled versions of each session
Basic concepts in R - simple arithmetic
The command line can be used as a calculator and understands the usual arithmetic operators +, -, *, /
Try adding a few more calculations here
2 + 22 - 24 * 310 / 2
Note: The number in the square brackets is an indicator of the position in the output. In this case the output is a ‘vector’ of length 1 (i.e. a single number). More on vectors coming up…
In the case of expressions involving multiple operations, R respects the BODMAS system to decide the order in which operations should be performed.
2 + 2 *32 + (2 * 3)
(2 + 2) * 3
R is capable of more complicated arithmetic such as trigonometry and logarithms; like you would find on a fancy scientific calculator. Of course, R also has a plethora of statistical operations as we will see.
pi
sin (pi/2)
cos(pi)
tan(2)
log(1)
We can only go so far with performing simple calculations like this. Eventually we will need to store our results for later use. For this, we need to make use of variables.
Basic concepts in R - variables
A variable is a letter or word which takes (or contains) a value. We use the assignment operator: <-
x <- 10
x
myNumber <- 25
myNumber
We can perform arithmetic on variables:
sqrt(myNumber)
We can add variables together:
x + myNumber
We can change the value of an existing variable:
x <- 21
x
We can set one variable to equal the value of another variable:
x <- myNumber
x
We can modify the contents of a variable:
myNumber <- myNumber + sqrt(16)
myNumber
When we are feeling lazy we might give our variables short names (x, y, i…etc), but a better practice would be to give them meaningful names. There are some restrictions on creating variable names. They cannot start with a number or contain characters such as ., _, ‘-’. Naming variables the same as in-built functions in R, such as c, T, mean should also be avoided.
Naming variables is a matter of taste. Some conventions exist such as a separating words with - or using CamelCaps. Whatever convention you decided, stick with it!
Basic concepts in R - functions
Functions in R perform operations on arguments (the inputs(s) to the function). We have already used:
sin(x)
This returns the sine of x
In this case the function has one argument: x.
Arguments are always contained in parentheses – curved brackets, () – separated by commas.
Arguments can be named or unnamed, but if they are unnamed they must be ordered (we will see later how to find the right order). The names of the arguments are determined by the author of the function and can be found in the help page for the function. When testing code, it is easier and safer to name the arguments.
seq is a function for generating a numeric sequence from and to particular numbers.
Type ?seq to get the help page for this function.
When testing code, it is easier and safer to name the arguments
seq(from = 2, to = 20, by = 4)
seq(2, 20, 4)
Arguments can have default values, meaning we do not need to specify values for these in order to run the function.
rnorm is a function that will generate a series of values from a normal distribution. In order to use the function, we need to tell R how many values we want
rnorm(n=10)
The normal distribution is defined by a mean (average) and standard deviation (spread). However, in the above example we didn’t tell R what mean and standard deviation we wanted. So how does R know what to do? All arguments to a function and their default values are listed in the help page
(N.B sometimes help pages can describe more than one function)
?rnorm
In this case, we see that the defaults for mean and standard deviation are 0 and 1. We can change the function to generate values from a distribution with a different mean and standard deviation using the mean and sdarguments. It is important that we get the spelling of these arguments exactly right, otherwise R will an error message, or (worse?) do something unexpected.
rnorm(n=10, mean=2,sd=3)
rnorm(10, 2, 3)
In the examples above, seq and rnorm were both outputting a series of numbers, which is called a vector in R and is the most-fundamental data-type.
Basic concepts in R - vectors
The basic data structure in R is a vector – an ordered collection of values.
R treats even single values as 1-element vectors.
The function ccombines its arguments into a vector:
x <- c(3,4,5,6)
x
The square brackets [] indicate the position within the vector (the index).
We can extract individual elements by using the [] notation:
x[1]
x[4]
We can even put a vector inside the square brackets (vector indexing):
Before executing this line of code, what do you think it will produce?
y <- c(2,3)
x[y]
There are a number of shortcuts to create a vector.
Instead of:
x <- c(3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
x
we can write:
x <- 3:12
x
or we can use the seq() function, which returns a vector:
x <- seq(2, 20, 4)
x
x <- seq(2, 20, length.out=5)
x
or we can use the rep() function:
y <- rep(3, 5)
y
y <- rep(1:3, 5)
y
We have seen some ways of extracting elements of a vector. We can use these shortcuts to make things easier (or more complex!)
x <- 3:12# Extract elements from x:
x[3:7]
x[seq(2, 6, 2)]
x[rep(3, 2)]
We can add an element to a vector:
y <- c(x, 1)
y
We can glue vectors together:
z <- c(x, y)
z
We can “remove” element(s) from a vector:
NOTE: the vector x doesn’t get modified
we’re just displaying what the vector looks like without particular elements
x <- 3:12
x[-3]
x[-(5:7)]
x[-seq(2, 6, 2)]
x
Finally, we can modify the contents of a vector:
x[6] <- 4
x
x[3:5] <- 1
x
Remember!
Square brackets [ ] for indexing
Parentheses () for function arguments
Basic concepts in R - vector arithmetic
When applying all standard arithmetic operations to vectors, application is element-wise
x <- 1:10
y <- x*2
y
z <- x^2
z
Adding two vectors:
y + z
If vectors are not the same length, the shorter one will be recycled:
x + 1:2
But be careful if the vector lengths aren’t factors of each other:
x + 1:3
Sometimes R will give a warning message. It has performed the calculation you asked it to, but the results may be unexpected. You need to check the output carefully to make sure it is what you really wanted.
Basic concepts in R - Character vectors and naming
All the vectors we have seen so far have contained numbers, but we can also store text (/“strings”) in vector
We can also use the names() function to get a vector of the names of an object:
names(gene.expression)
Exercise: Body-Mass Index
Let’s try some vector arithmetic. Here are the weights and heights of five individuals
Person
Weight (kg)
Height (cm)
Jo
65.8
192
Sam
67.9
179
Charlie
75.3
169
Frankie
61.9
175
Alex
92.4
171
Create weight and height vectors to hold the data in each column using the c function. Create a person vector and use this vector to name the values in the other two vectors.
The body-mass index is given by the formula:- BMI=(Weight)/(Height2); where Height is given in metres
Create a new vector to record this, called bmi.
Create a new vector bmi.sorted where the bmi values are put in increasing numeric order (HINT: look up the help on the sort function)
The interquartile range (IQR) of a vector is defined as the 75% percentile of the data minus the 25% percentile. Calculate the IQR for our bmi values
check your answer using the IQR function
### YOUR ANSWER HERE (please) ###
Getting help
This is possibly the most important slide in the whole course!?!
To get help on any R function, type ? followed by the function name. For example:
?seq
This retrieves the syntax and arguments for the function. The help page shows the default order of arguments. It also tells you which package it belongs to.
There is typically a usage example, which you can test using the example function:
example(seq)
If you can’t remember the exact name, type ?? followed by your guess. R will return a list of possibilities:
??mean
The Packages tab in the lower-right panel of RStudio will help you locate the help pages for a particular package and its functions
Often there will be a user-guide or ‘vignette’ too
R packages
R comes ready loaded with various libraries of functions called packages. For example: the function sum() is in the base package and sd(), which calculates the standard deviation of a vector, is in the stats package
There are 1000s of additional packages provided by third parties, and the packages can be found in numerous server locations on the web called repositories
The two repositories you will come across the most are: