Data Organisation and Introduction to R

Hugo Tavares (hugo.tavares@slcu.cam.ac.uk)
7 Dec 2016

Cancer Research Uk Bioinformatics Winter School: Cambridge, 7th-9th December 2016

Outline of Day 1

  • Session 1 - Data Organisation (in spreadsheets)

  • Session 2 - Introduction to R and Rstudio

  • Session 3 - Data manipulation in R

  • Session 4 - Data visualisation in R (short intro - more on Day 2)

We will use material from the lessons developed by the Data Carpentry initiative.

This is not a full Data Carpentry course, just a shortened version of it (but we do run these at the University, if you're interested).

Data Organisation (in spreadsheets)

  • Recognise and apply principles and good practices for organising your data
  • Recognise common pitfalls when recording data (and learn to avoid them!)
  • Understanding file formats and their suitability for other programs (e.g. R)

The material for this session is available from the Data Carpentry website ( here).

Data Organisation - be cautious with dates

Let's say I recorded some collection dates in the form “Month-Day”.

I didn't bother with the year, because I know what year I'm in right now!

Input these data in a worksheet:

Feb-12
Mar-14
Apr-21

What Day, Month, Year is assumed by your spreadsheet program?

Data visualisation in R - boxplot

Even though we filtered our data earlier, the levels of our factor variables remained unchanged.

levels(surveys_complete$species_id)
 [1] "AB" "AH" "AS" "BA" "CB" "CM" "CQ" "CS" "CT" "CU" "CV" "DM" "DO" "DS"
[15] "DX" "NL" "OL" "OT" "OX" "PB" "PC" "PE" "PF" "PG" "PH" "PI" "PL" "PM"
[29] "PP" "PU" "PX" "RF" "RM" "RO" "RX" "SA" "SC" "SF" "SH" "SO" "SS" "ST"
[43] "SU" "UL" "UP" "UR" "US" "ZL"
levels(surveys$species_id)
 [1] "AB" "AH" "AS" "BA" "CB" "CM" "CQ" "CS" "CT" "CU" "CV" "DM" "DO" "DS"
[15] "DX" "NL" "OL" "OT" "OX" "PB" "PC" "PE" "PF" "PG" "PH" "PI" "PL" "PM"
[29] "PP" "PU" "PX" "RF" "RM" "RO" "RX" "SA" "SC" "SF" "SH" "SO" "SS" "ST"
[43] "SU" "UL" "UP" "UR" "US" "ZL"

Data visualisation in R - boxplot

If we want to remove these levels, we can use the function droplevels():

surveys_complete <- droplevels(surveys_complete)
levels(surveys_complete$species_id)
 [1] "DM" "DO" "DS" "NL" "OL" "OT" "PB" "PE" "PF" "PM" "PP" "RF" "RM" "SH"

Data visualisation in R - boxplot

boxplot(surveys_complete$hindfoot_length ~ surveys_complete$species_id, pch = 16, las = 1)

plot of chunk unnamed-chunk-11