Hugo Tavares (hugo.tavares@slcu.cam.ac.uk)
7 Dec 2016
Cancer Research Uk Bioinformatics Winter School: Cambridge, 7th-9th December 2016
Session 1 - Data Organisation (in spreadsheets)
Session 2 - Introduction to R and Rstudio
Session 3 - Data manipulation in R
Session 4 - Data visualisation in R (short intro - more on Day 2)
We will use material from the lessons developed by the Data Carpentry initiative.
This is not a full Data Carpentry course, just a shortened version of it (but we do run these at the University, if you're interested).
The material for this session is available from the Data Carpentry website ( here).
Let's say I recorded some collection dates in the form “Month-Day”.
I didn't bother with the year, because I know what year I'm in right now!
Input these data in a worksheet:
Feb-12
Mar-14
Apr-21
What Day, Month, Year is assumed by your spreadsheet program?
Be careful, different spreadsheet programs assume different things!
Even different versions of the same software might assume different things!
More about dates here.
If possible, store/distribute data in CSV
(comma-separated-values) format:
species,year,month,day,weight_kg,height_cm
mouse,2014,3,21,2,10
dog,2013,7,2,20,60
cat,2016,12,7,4.2,25
More about exporting data here.
The materials for this session are available from the Data Carpentry website (Lesson 1 here).
data.frame
objectThe materials for this session are available from the Data Carpentry website (Lessons 2-4 here).
We will no longer use the Data Carpentry lessons, because we will use
base
R graphics (the default in R).
However, if you want to learn about an alternative, please check the Data Carpentry
lesson
on using the ggplot2
package.
Make sure you have the surveys
data in the right format (as detailed in this lesson):
library(dplyr)
surveys_complete <- surveys %>%
filter(species_id != "", # remove missing species_id
!is.na(weight), # remove missing weight
!is.na(hindfoot_length), # remove missing hindfoot_length
sex != "") # remove missing sex
## Extract the most common species_id
species_counts <- surveys_complete %>%
group_by(species_id) %>%
tally %>%
filter(n >= 50) %>%
select(species_id)
## Only keep the most common species
surveys_complete <- surveys_complete %>%
filter(species_id %in% species_counts$species_id)
Let's say we want to see if there is a correlation between the
animal's weight and its hindfoot length.
We can use the plot()
function for this, indicating what values we want
to be assigned to the x
and y
axis.
plot(x = surveys_complete$weight, y = surveys_complete$hindfoot_length)
Challenge: how can we change the axis labels? Check the help for the function with ?plot
plot(surveys_complete$weight, surveys_complete$hindfoot_length,
xlab = "Weight (g)", ylab = "Hindfoot length (mm)",
pch = 16, col = "seagreen")
What do the pch
and col
arguments do?
Many of these graphical parameters can be found in the par
function (see its help: ?par
).
We would like to know what kind of distribution our data follow.
A typical way to display this information is using a histogram:
hist(surveys_complete$hindfoot_length, breaks = 100, xlab = "Hindfoot length (mm)")
What does this distribution suggest?
The previous histogram suggests that we have several distributions, which is likely because we have weights from different species. We can visualise the distributions of the weight within each species using a boxplot.
The syntax for this is special:
boxplot(surveys_complete$hindfoot_length ~ surveys_complete$species_id)
We will learn more about the special '~' tomorrow, but for now it means that we want to plot the hindfoot length split by species id.
boxplot(surveys_complete$hindfoot_length ~ surveys_complete$species_id, pch = 16, las = 2)
Why are there so many species with no data?
(also, notice the new las
argument. Check ?par
to know more about it.)
Even though we filtered our data earlier, the levels of our factor variables remained unchanged.
levels(surveys_complete$species_id)
[1] "AB" "AH" "AS" "BA" "CB" "CM" "CQ" "CS" "CT" "CU" "CV" "DM" "DO" "DS"
[15] "DX" "NL" "OL" "OT" "OX" "PB" "PC" "PE" "PF" "PG" "PH" "PI" "PL" "PM"
[29] "PP" "PU" "PX" "RF" "RM" "RO" "RX" "SA" "SC" "SF" "SH" "SO" "SS" "ST"
[43] "SU" "UL" "UP" "UR" "US" "ZL"
levels(surveys$species_id)
[1] "AB" "AH" "AS" "BA" "CB" "CM" "CQ" "CS" "CT" "CU" "CV" "DM" "DO" "DS"
[15] "DX" "NL" "OL" "OT" "OX" "PB" "PC" "PE" "PF" "PG" "PH" "PI" "PL" "PM"
[29] "PP" "PU" "PX" "RF" "RM" "RO" "RX" "SA" "SC" "SF" "SH" "SO" "SS" "ST"
[43] "SU" "UL" "UP" "UR" "US" "ZL"
If we want to remove these levels, we can use the function droplevels()
:
surveys_complete <- droplevels(surveys_complete)
levels(surveys_complete$species_id)
[1] "DM" "DO" "DS" "NL" "OL" "OT" "PB" "PE" "PF" "PM" "PP" "RF" "RM" "SH"
boxplot(surveys_complete$hindfoot_length ~ surveys_complete$species_id, pch = 16, las = 1)