Why use R?

Advantages of R

The R programming language is now recognised beyond the academic community as an effect solution for data analysis and visualisation. Notable users of R include Facebook, google, Microsoft (who recently invested in a commerical provider of R), and the New York Times.

Key features

  • Open-source
  • Cross-platform
  • Access to existing visualisation / statistical tools
  • Flexibility
  • Visualisation and interactivity
  • Add-ons for many fields of research
  • Facilitating Reproducible Research

Support for R

  • Packages analyse all kinds of Genomic data (>800)
  • Compulsory documentation (vignettes) for each package
  • 6-month release cycle
  • Course Materials
  • Example data and workflows
  • Common, re-usable framework and functionality
  • Available Support
    • Often you will be able to interact with the package maintainers / developers and other power-users of the project software
  • Annual conferences in U.S and Europe
    • The last European conference was in Cambridge

The Bioconductor project

Many of the packages are by well-respected authors and get lots of citations.

Downloading a package

Each package has its own landing page. e.g. http://bioconductor.org/packages/release/bioc/html/beadarray.html. Here you’ll find;

  • Installation script (will install all dependancies)
  • Vignettes and manuals
  • Details of package maintainer
  • After downloading, you can load using the library function. e.g. library(beadarray)
  • Only need to download once for each version of R
  • CRAN packages installed by install.packages
  • What packages to install?

RStudio

  • Rstudio is a free environment for R
  • Convenient menus to access scripts, display plots
  • Still need to use command-line to get things done
  • Developed by some of the leading R programmers

The Packages tab in the bottom-right panel of RStudio lists all packages that you currently have installed. Clicking on a package name will show a list of functions that available once that package has been loaded. The library function is used to load a package and make it’s functions / data available in your current R session. You need to do this every time you load a new RStudio session.

library(beadarray)

There are functions for installing packages within R. If your package is part of the main CRAN repository, you can use install.packages

We will be using the wakefield R package in this practical. To install it, we would do.

install.packages("wakefield")

Bioconductor packages have their own install script, which you can download from the Bioconductor website

source("http://www.bioconductor.org/biocLite.R")
biocLite("affy")

A package may have several dependancies; other R packages from which it uses functions or data types (re-using code from other packages is strongly-encouraged). If this is the case, the other R packages will be located and installed too.

So long as you stick with the same version of R, you won’t need to repeat this install process.

Bioconductor packages also come with a vignette

  • Often describing the workflow of using the package, and particular use-cases

R has an in-built help system. At the console, you can type ? followed by the name of a function. This will bring-up the documentation for the function; which includes the expected inputs (arguments), the output you should expect from the function and some use-cases.

?mean

About the R markdown format

Aside from teaching you about NGS analysis, we also hope to teach you how to work in a reproducible manner. The first step in this process is to master the R markdown format.

Go File -> New File -> R Markdown in Rstudio now…..

Press Ok.

The “markdown” file is a template used to generate a report (in pdf, html or doc format). The report is a mix of R code and plain text. All R code gets run and the results appear in the final report.

  1. Header information
  2. Section heading
  3. Plain text
  4. R code to be run
  5. Plain text
  6. R code to be run
  • Each line of R code can be executed in the R console by placing the cursor on the line and pressing CTRL + ENTER.
  • You can also highlight multiple lines of code. NB. You do not need to highlight to the backtick (```) symbols.
  • Hitting the knit HTML button will run all R code in order and (providing there are no errors!) you will get a PDF or HTML document.
    • the first time you try and knit you’ll need to specify a file name
  • The resultant document will contain all the plain text you wrote, the R code, and any outputs (including graphs, tables etc) that R produced. You can then distribute this document to have a reproducible account of your analysis.

How to use the template

  • Most R sessions will have a markdown template that you can modify to complete the exercises
  • We will also tell you what “working directory” you need to be in
    • the working directory is where RStudio will look to read data from, and save data to
  • Change your name, add a title and date in the header section
  • Add notes, explanations of code etc in the white space between code chunks. You can add new lines with ENTER. Clicking the ? next to the Knit HTML button will give more information about how to format this text. You can introduce bold and italics for example.
  • Some code chunks are left blank. These are for you to write the R code required to answer the questions
  • You can try to knit the document at any point to see how it looks

The practical

Working Directory; Session -> Set Working Directory -> Choose Directory

/home/participant/Course_Materials/Day1/

Template file

/home/participant/Course_Materials/Day1/Session1-template.Rmd

We are going to explore some of the basic features of R using some patient data; the kind of data that we might encounter in the wild. However, rather than using real-life data we are going to make some up. There is a package called wakefield that is particularly convenient for this task.

library(wakefield)

Various patient characteristics can be generated using the package. The following is a function that uses the package to create a data frame with various clinical characteristics. The number of patients we want to simulate is an argument.

Don’t worry about what the function does, you can just paste the following into the R console, or highlight it in the the markdown template and press CTRL + ENTER to run.

random_patients <- function(n) {
  as.data.frame(r_data_frame(
    n,
    id,
    name,
    race,
    sex,
    smokes,
    height,
    birth(random = TRUE, x = NULL, start = Sys.Date() - 365 * 45, k = 365*2,by = "1 days"),
    state,
    pet,
    grade_level(x=1:3),
    died,    
    normal(name="Count"),
    date_stamp)
  )
}

We can now use the random_patients function to generate a data frame of fictitious patients

patients <- random_patients(100)

In Rstudio , you can view the contents of this data frame in a tab.

View(patients)



Exercise

  • What are the dimensions of the data frame?

  • What columns are available?
  • HINT: see the dim, ncol, nrow and colnames functions




## [1] 10 13
##  [1] "ID"          "Name"        "Race"        "Sex"         "Smokes"     
##  [6] "Height"      "Birth"       "State"       "Pet"         "Grade_Level"
## [11] "Died"        "Count"       "Date"



Exercise

  • Can you think of two ways to access the Names of the patients?
  • What type of object is returned?



##  [1] "Britt"  "Martin" "Young"  "Deon"   "Juan"   "Devon"  "Adam"  
##  [8] "Cary"   "Yong"   "Clyde"
##  [1] "Britt"  "Martin" "Young"  "Deon"   "Juan"   "Devon"  "Adam"  
##  [8] "Cary"   "Yong"   "Clyde"

We can access the columns of a data frame by either

  • knowing the column index
  • knowing the column name

By column name is recommended, unless you can guarentee the columns will also be in the same order

TIP Use auto-complete with the TAB key to get the name of the column correct

A vector (1-dimensional) is returned, the length of which is the same as the number of rows in the data frame. The vector could be stored as a variable and itself be subset or used in further calculations

peeps <- patients$Name
peeps
##  [1] "Britt"  "Martin" "Young"  "Deon"   "Juan"   "Devon"  "Adam"  
##  [8] "Cary"   "Yong"   "Clyde"
length(peeps)
## [1] 10
nchar(peeps)
##  [1] 5 6 5 4 4 5 4 4 4 5
substr(peeps,1,3)
##  [1] "Bri" "Mar" "You" "Deo" "Jua" "Dev" "Ada" "Car" "Yon" "Cly"

The summary function is a useful way of summarising the data containing in each column. It will give information about the type of data (remember, data frames can have a mixture of numeric and character columns) and also an appropriate summary. For numeric columns, it will report some stats about the distribution of the data. For categorical data, it will report the different levels.

summary(patients)
##       ID                Name                  Race       Sex   
##  Length:10          Length:10          White    :7   Male  :6  
##  Class :character   Class :character   Hispanic :2   Female:4  
##  Mode  :character   Mode  :character   Black    :1             
##                                        Asian    :0             
##                                        Bi-Racial:0             
##                                        Native   :0             
##                                        (Other)  :0             
##    Smokes            Height         Birth                     State  
##  Mode :logical   Min.   :65.0   Min.   :1972-01-08   New York    :2  
##  FALSE:7         1st Qu.:67.0   1st Qu.:1972-02-18   Pennsylvania:2  
##  TRUE :3         Median :68.0   Median :1972-10-02   Colorado    :1  
##  NA's :0         Mean   :69.4   Mean   :1972-08-14   Georgia     :1  
##                  3rd Qu.:71.0   3rd Qu.:1972-11-10   Indiana     :1  
##                  Max.   :77.0   Max.   :1973-06-11   Missouri    :1  
##                                                      (Other)     :2  
##     Pet    Grade_Level    Died             Count        
##  Dog  :5   1:2         Mode :logical   Min.   :-1.0275  
##  Cat  :3   2:2         FALSE:4         1st Qu.:-0.3792  
##  None :2   3:6         TRUE :6         Median : 0.2978  
##  Bird :0               NA's :0         Mean   : 0.4072  
##  Horse:0                               3rd Qu.: 0.9062  
##                                        Max.   : 2.4957  
##                                                         
##       Date           
##  Min.   :2015-09-23  
##  1st Qu.:2015-12-30  
##  Median :2016-02-23  
##  Mean   :2016-02-28  
##  3rd Qu.:2016-05-15  
##  Max.   :2016-06-23  
## 

Subsetting

A data frame can be subset using square brackes[] placed after the name of the data frame. As a data frame is a two-dimensional object, you need a row and column index, or vector indices.

Disclaimer 1:- later in the course, we will see a slightly-nicer way of subsetting and filtering which will be useful in some circumstances




Exercise

  • Make sure you can understand the behaviour of the following commands



patients[1,2]
patients[2,1]
patients[c(1,2,3),1]
patients[c(1,2,3),c(1,2,3)]

Note that the data frame is not altered we are just seeing what a subset of the data looks like and not changing the underlying data. If we wanted to do this, we would need to create a new variale.

patients
##    ID   Name     Race    Sex Smokes Height      Birth        State  Pet
## 1  01  Britt    White   Male   TRUE     71 1972-10-28    Wisconsin  Cat
## 2  02 Martin    White   Male  FALSE     68 1973-06-11     Colorado None
## 3  03  Young    White Female  FALSE     67 1972-02-10 Pennsylvania  Dog
## 4  04   Deon Hispanic   Male  FALSE     77 1972-11-12      Georgia  Cat
## 5  05   Juan Hispanic   Male   TRUE     65 1972-09-06     New York  Dog
## 6  06  Devon    White   Male   TRUE     71 1972-11-05     Missouri  Dog
## 7  07   Adam    Black Female  FALSE     68 1973-03-01     New York  Dog
## 8  08   Cary    White   Male  FALSE     66 1972-01-08      Indiana None
## 9  09   Yong    White Female  FALSE     74 1972-01-19         Ohio  Cat
## 10 10  Clyde    White Female  FALSE     67 1972-03-16 Pennsylvania  Dog
##    Grade_Level  Died      Count       Date
## 1            3  TRUE  2.0429659 2015-09-23
## 2            2 FALSE -0.3968310 2015-11-23
## 3            3  TRUE  0.9995400 2015-12-23
## 4            1  TRUE -0.3263563 2016-01-23
## 5            3 FALSE  0.1008862 2016-02-23
## 6            3 FALSE  2.4956672 2016-02-23
## 7            3  TRUE -0.9369304 2016-04-23
## 8            3  TRUE -1.0275402 2016-05-23
## 9            1 FALSE  0.6261153 2016-06-23
## 10           2  TRUE  0.4947157 2016-06-23

Should we wish to see all rows, or all columns, we can neglect either the row or column index




Exercise

  • Make sure you can understand the behaviour of the following commands
patients[1,]
patients[,1]
patients[,c(1,2)]






Exercise

  • How can we view all information about the first six patients?
  • HINT head is commonly-used to give a snapshot of a data frame. Otherwise, you can use the [row,column] notation.



##   ID   Name     Race    Sex Smokes Height      Birth        State  Pet
## 1 01  Britt    White   Male   TRUE     71 1972-10-28    Wisconsin  Cat
## 2 02 Martin    White   Male  FALSE     68 1973-06-11     Colorado None
## 3 03  Young    White Female  FALSE     67 1972-02-10 Pennsylvania  Dog
## 4 04   Deon Hispanic   Male  FALSE     77 1972-11-12      Georgia  Cat
## 5 05   Juan Hispanic   Male   TRUE     65 1972-09-06     New York  Dog
## 6 06  Devon    White   Male   TRUE     71 1972-11-05     Missouri  Dog
##   Grade_Level  Died      Count       Date
## 1           3  TRUE  2.0429659 2015-09-23
## 2           2 FALSE -0.3968310 2015-11-23
## 3           3  TRUE  0.9995400 2015-12-23
## 4           1  TRUE -0.3263563 2016-01-23
## 5           3 FALSE  0.1008862 2016-02-23
## 6           3 FALSE  2.4956672 2016-02-23

Rather than selecting rows based on their numeric index (as in the previous example) we can use what we call a logical test. This is a test that gives either a TRUE or FALSE result. When applied to subsetting, only rows with a TRUE result get returned.

For example we could compare the Count variable to zero. The result is a vector of TRUE or FALSE; one for each row in the data frame

patients$Count < 0
##  [1] FALSE  TRUE FALSE  TRUE FALSE FALSE  TRUE  TRUE FALSE FALSE

This R code can be put inside the square brackets.

patients[patients$Count<0, ]
##   ID   Name     Race    Sex Smokes Height      Birth    State  Pet
## 2 02 Martin    White   Male  FALSE     68 1973-06-11 Colorado None
## 4 04   Deon Hispanic   Male  FALSE     77 1972-11-12  Georgia  Cat
## 7 07   Adam    Black Female  FALSE     68 1973-03-01 New York  Dog
## 8 08   Cary    White   Male  FALSE     66 1972-01-08  Indiana None
##   Grade_Level  Died      Count       Date
## 2           2 FALSE -0.3968310 2015-11-23
## 4           1  TRUE -0.3263563 2016-01-23
## 7           3  TRUE -0.9369304 2016-04-23
## 8           3  TRUE -1.0275402 2016-05-23

If we wanted to know about the patients that had died, we could do;

deceased <- patients[patients$Died == TRUE,]
deceased
##    ID  Name     Race    Sex Smokes Height      Birth        State  Pet
## 1  01 Britt    White   Male   TRUE     71 1972-10-28    Wisconsin  Cat
## 3  03 Young    White Female  FALSE     67 1972-02-10 Pennsylvania  Dog
## 4  04  Deon Hispanic   Male  FALSE     77 1972-11-12      Georgia  Cat
## 7  07  Adam    Black Female  FALSE     68 1973-03-01     New York  Dog
## 8  08  Cary    White   Male  FALSE     66 1972-01-08      Indiana None
## 10 10 Clyde    White Female  FALSE     67 1972-03-16 Pennsylvania  Dog
##    Grade_Level Died      Count       Date
## 1            3 TRUE  2.0429659 2015-09-23
## 3            3 TRUE  0.9995400 2015-12-23
## 4            1 TRUE -0.3263563 2016-01-23
## 7            3 TRUE -0.9369304 2016-04-23
## 8            3 TRUE -1.0275402 2016-05-23
## 10           2 TRUE  0.4947157 2016-06-23

In fact, this is equivalent

deceased <- patients[patients$Died,]

The test of equality == also works for text

patients[patients$Race == "White",]
##    ID   Name  Race    Sex Smokes Height      Birth        State  Pet
## 1  01  Britt White   Male   TRUE     71 1972-10-28    Wisconsin  Cat
## 2  02 Martin White   Male  FALSE     68 1973-06-11     Colorado None
## 3  03  Young White Female  FALSE     67 1972-02-10 Pennsylvania  Dog
## 6  06  Devon White   Male   TRUE     71 1972-11-05     Missouri  Dog
## 8  08   Cary White   Male  FALSE     66 1972-01-08      Indiana None
## 9  09   Yong White Female  FALSE     74 1972-01-19         Ohio  Cat
## 10 10  Clyde White Female  FALSE     67 1972-03-16 Pennsylvania  Dog
##    Grade_Level  Died      Count       Date
## 1            3  TRUE  2.0429659 2015-09-23
## 2            2 FALSE -0.3968310 2015-11-23
## 3            3  TRUE  0.9995400 2015-12-23
## 6            3 FALSE  2.4956672 2016-02-23
## 8            3  TRUE -1.0275402 2016-05-23
## 9            1 FALSE  0.6261153 2016-06-23
## 10           2  TRUE  0.4947157 2016-06-23



Exercise

  • Can you create a data frame of dog owners?
##    ID  Name     Race    Sex Smokes Height      Birth        State Pet
## 3  03 Young    White Female  FALSE     67 1972-02-10 Pennsylvania Dog
## 5  05  Juan Hispanic   Male   TRUE     65 1972-09-06     New York Dog
## 6  06 Devon    White   Male   TRUE     71 1972-11-05     Missouri Dog
## 7  07  Adam    Black Female  FALSE     68 1973-03-01     New York Dog
## 10 10 Clyde    White Female  FALSE     67 1972-03-16 Pennsylvania Dog
##    Grade_Level  Died      Count       Date
## 3            3  TRUE  0.9995400 2015-12-23
## 5            3 FALSE  0.1008862 2016-02-23
## 6            3 FALSE  2.4956672 2016-02-23
## 7            3  TRUE -0.9369304 2016-04-23
## 10           2  TRUE  0.4947157 2016-06-23



There are a couple of ways of testing for more than one text value. The first uses an or | statement. i.e. testing if the value of Pet is Dog or the value is Cat.

The %in% function is a convenient function for testing which items in a vector correspond to a defined set of values.

patients[patients$Pet == "Dog" | patients$Pet == "Cat",]
##    ID  Name     Race    Sex Smokes Height      Birth        State Pet
## 1  01 Britt    White   Male   TRUE     71 1972-10-28    Wisconsin Cat
## 3  03 Young    White Female  FALSE     67 1972-02-10 Pennsylvania Dog
## 4  04  Deon Hispanic   Male  FALSE     77 1972-11-12      Georgia Cat
## 5  05  Juan Hispanic   Male   TRUE     65 1972-09-06     New York Dog
## 6  06 Devon    White   Male   TRUE     71 1972-11-05     Missouri Dog
## 7  07  Adam    Black Female  FALSE     68 1973-03-01     New York Dog
## 9  09  Yong    White Female  FALSE     74 1972-01-19         Ohio Cat
## 10 10 Clyde    White Female  FALSE     67 1972-03-16 Pennsylvania Dog
##    Grade_Level  Died      Count       Date
## 1            3  TRUE  2.0429659 2015-09-23
## 3            3  TRUE  0.9995400 2015-12-23
## 4            1  TRUE -0.3263563 2016-01-23
## 5            3 FALSE  0.1008862 2016-02-23
## 6            3 FALSE  2.4956672 2016-02-23
## 7            3  TRUE -0.9369304 2016-04-23
## 9            1 FALSE  0.6261153 2016-06-23
## 10           2  TRUE  0.4947157 2016-06-23
patients[patients$Pet %in% c("Dog","Cat"),]
##    ID  Name     Race    Sex Smokes Height      Birth        State Pet
## 1  01 Britt    White   Male   TRUE     71 1972-10-28    Wisconsin Cat
## 3  03 Young    White Female  FALSE     67 1972-02-10 Pennsylvania Dog
## 4  04  Deon Hispanic   Male  FALSE     77 1972-11-12      Georgia Cat
## 5  05  Juan Hispanic   Male   TRUE     65 1972-09-06     New York Dog
## 6  06 Devon    White   Male   TRUE     71 1972-11-05     Missouri Dog
## 7  07  Adam    Black Female  FALSE     68 1973-03-01     New York Dog
## 9  09  Yong    White Female  FALSE     74 1972-01-19         Ohio Cat
## 10 10 Clyde    White Female  FALSE     67 1972-03-16 Pennsylvania Dog
##    Grade_Level  Died      Count       Date
## 1            3  TRUE  2.0429659 2015-09-23
## 3            3  TRUE  0.9995400 2015-12-23
## 4            1  TRUE -0.3263563 2016-01-23
## 5            3 FALSE  0.1008862 2016-02-23
## 6            3 FALSE  2.4956672 2016-02-23
## 7            3  TRUE -0.9369304 2016-04-23
## 9            1 FALSE  0.6261153 2016-06-23
## 10           2  TRUE  0.4947157 2016-06-23

Similar to or, we can require that both tests are TRUE by using an and & operation. e.g. to look for white males.

patients[patients$Race == "White" & patients$Sex =="Male",]
##   ID   Name  Race  Sex Smokes Height      Birth     State  Pet Grade_Level
## 1 01  Britt White Male   TRUE     71 1972-10-28 Wisconsin  Cat           3
## 2 02 Martin White Male  FALSE     68 1973-06-11  Colorado None           2
## 6 06  Devon White Male   TRUE     71 1972-11-05  Missouri  Dog           3
## 8 08   Cary White Male  FALSE     66 1972-01-08   Indiana None           3
##    Died     Count       Date
## 1  TRUE  2.042966 2015-09-23
## 2 FALSE -0.396831 2015-11-23
## 6 FALSE  2.495667 2016-02-23
## 8  TRUE -1.027540 2016-05-23



Exercise

  • Can you create a data frame of deceased patients with a ‘count’ < 0
##   ID Name     Race    Sex Smokes Height      Birth    State  Pet
## 4 04 Deon Hispanic   Male  FALSE     77 1972-11-12  Georgia  Cat
## 7 07 Adam    Black Female  FALSE     68 1973-03-01 New York  Dog
## 8 08 Cary    White   Male  FALSE     66 1972-01-08  Indiana None
##   Grade_Level Died      Count       Date
## 4           1 TRUE -0.3263563 2016-01-23
## 7           3 TRUE -0.9369304 2016-04-23
## 8           3 TRUE -1.0275402 2016-05-23



Ordering and sorting

A vector can be returned in sorted form using the sort function.

sort(peeps)
##  [1] "Adam"   "Britt"  "Cary"   "Clyde"  "Deon"   "Devon"  "Juan"  
##  [8] "Martin" "Yong"   "Young"
sort(patients$Count,decreasing = TRUE)
##  [1]  2.4956672  2.0429659  0.9995400  0.6261153  0.4947157  0.1008862
##  [7] -0.3263563 -0.3968310 -0.9369304 -1.0275402

However, if we want to sort an entire data frame a different approach is needed. The trick is to use order. Rather than giving a sorted set of values, it will give sorted indices.

patients[order(patients$Count),]
##    ID   Name     Race    Sex Smokes Height      Birth        State  Pet
## 8  08   Cary    White   Male  FALSE     66 1972-01-08      Indiana None
## 7  07   Adam    Black Female  FALSE     68 1973-03-01     New York  Dog
## 2  02 Martin    White   Male  FALSE     68 1973-06-11     Colorado None
## 4  04   Deon Hispanic   Male  FALSE     77 1972-11-12      Georgia  Cat
## 5  05   Juan Hispanic   Male   TRUE     65 1972-09-06     New York  Dog
## 10 10  Clyde    White Female  FALSE     67 1972-03-16 Pennsylvania  Dog
## 9  09   Yong    White Female  FALSE     74 1972-01-19         Ohio  Cat
## 3  03  Young    White Female  FALSE     67 1972-02-10 Pennsylvania  Dog
## 1  01  Britt    White   Male   TRUE     71 1972-10-28    Wisconsin  Cat
## 6  06  Devon    White   Male   TRUE     71 1972-11-05     Missouri  Dog
##    Grade_Level  Died      Count       Date
## 8            3  TRUE -1.0275402 2016-05-23
## 7            3  TRUE -0.9369304 2016-04-23
## 2            2 FALSE -0.3968310 2015-11-23
## 4            1  TRUE -0.3263563 2016-01-23
## 5            3 FALSE  0.1008862 2016-02-23
## 10           2  TRUE  0.4947157 2016-06-23
## 9            1 FALSE  0.6261153 2016-06-23
## 3            3  TRUE  0.9995400 2015-12-23
## 1            3  TRUE  2.0429659 2015-09-23
## 6            3 FALSE  2.4956672 2016-02-23
patients[order(patients$Sex),]
##    ID   Name     Race    Sex Smokes Height      Birth        State  Pet
## 1  01  Britt    White   Male   TRUE     71 1972-10-28    Wisconsin  Cat
## 2  02 Martin    White   Male  FALSE     68 1973-06-11     Colorado None
## 4  04   Deon Hispanic   Male  FALSE     77 1972-11-12      Georgia  Cat
## 5  05   Juan Hispanic   Male   TRUE     65 1972-09-06     New York  Dog
## 6  06  Devon    White   Male   TRUE     71 1972-11-05     Missouri  Dog
## 8  08   Cary    White   Male  FALSE     66 1972-01-08      Indiana None
## 3  03  Young    White Female  FALSE     67 1972-02-10 Pennsylvania  Dog
## 7  07   Adam    Black Female  FALSE     68 1973-03-01     New York  Dog
## 9  09   Yong    White Female  FALSE     74 1972-01-19         Ohio  Cat
## 10 10  Clyde    White Female  FALSE     67 1972-03-16 Pennsylvania  Dog
##    Grade_Level  Died      Count       Date
## 1            3  TRUE  2.0429659 2015-09-23
## 2            2 FALSE -0.3968310 2015-11-23
## 4            1  TRUE -0.3263563 2016-01-23
## 5            3 FALSE  0.1008862 2016-02-23
## 6            3 FALSE  2.4956672 2016-02-23
## 8            3  TRUE -1.0275402 2016-05-23
## 3            3  TRUE  0.9995400 2015-12-23
## 7            3  TRUE -0.9369304 2016-04-23
## 9            1 FALSE  0.6261153 2016-06-23
## 10           2  TRUE  0.4947157 2016-06-23

A final point on data frames is that we can export them out of R once we have done our data processing.

countOrder <- patients[order(patients$Count),]
write.csv(countOrder, file="patientsOrderedByCount.csv")

Simple plotting

All your favourite types of plot can be created in R

Plotting

  • Simple plots are supported in the base distribution of R (what you get automatically when you download R).
    • boxplot, hist, barplot,… all of which are extensions of the basic plot function
  • Many different customisations are possible
    • colour, overlay points / text, legends, multi-panel figures
  • We will show how some of these plots can be used to inform us about the quality of NGS data, and to visualise our results.
  • References..

Disclaimer 2:- later in the course, we will see a slightly-nicer way of plotting

hist(patients$Height)

plot(patients$Height,patients$Count)

barplot(table(patients$Race))

barplot(table(patients$Pet))

boxplot(patients$Count ~ patients$Died)

Lots of customisations are possible to enhance the appaerance of our plots; colour, labels, axes, legends

plot(patients$Height,patients$Count,pch=16,
     col="red",xlab="Height",
     ylab="Count")

boxplot(patients$Count ~ patients$Died,col=c("red","yellow"))

Plots can be exported by the Plots tab in RStudio, or by calling the pdf or png functions which will write the plot to a file

png("myLittlePlot.png")
barplot(table(patients$Pet))
dev.off()
## png 
##   2