The R programming language is now recognised beyond the academic community as an effect solution for data analysis and visualisation. Notable users of R include Facebook, google, Microsoft (who recently invested in a commerical provider of R), and the New York Times.
?
or help.start()
Many of the packages are by well-respected authors and get lots of citations.
Each package has its own landing page. e.g. http://bioconductor.org/packages/release/bioc/html/beadarray.html. Here you’ll find;
library
function. e.g. library(beadarray)
install.packages
The Packages tab in the bottom-right panel of RStudio lists all packages that you currently have installed. Clicking on a package name will show a list of functions that available once that package has been loaded. The library
function is used to load a package and make it’s functions / data available in your current R session. You need to do this every time you load a new RStudio session.
library(beadarray)
There are functions for installing packages within R. If your package is part of the main CRAN repository, you can use install.packages
We will be using the wakefield
R package in this practical. To install it, we would do.
install.packages("wakefield")
Bioconductor packages have their own install script, which you can download from the Bioconductor website
source("http://www.bioconductor.org/biocLite.R")
biocLite("affy")
A package may have several dependancies; other R packages from which it uses functions or data types (re-using code from other packages is strongly-encouraged). If this is the case, the other R packages will be located and installed too.
So long as you stick with the same version of R, you won’t need to repeat this install process.
Bioconductor packages also come with a vignette
R has an in-built help system. At the console, you can type ?
followed by the name of a function. This will bring-up the documentation for the function; which includes the expected inputs (arguments), the output you should expect from the function and some use-cases.
?mean
Aside from teaching you about NGS analysis, we also hope to teach you how to work in a reproducible manner. The first step in this process is to master the R markdown format.
Go File -> New File -> R Markdown in Rstudio now…..
Press Ok.
The “markdown” file is a template used to generate a report (in pdf, html or doc format). The report is a mix of R code and plain text. All R code gets run and the results appear in the final report.
CTRL + ENTER
.ENTER
. Clicking the ?
next to the Knit HTML
button will give more information about how to format this text. You can introduce bold and italics for example.Working Directory; Session -> Set Working Directory -> Choose Directory
/home/participant/Course_Materials/Day1/
Template file
/home/participant/Course_Materials/Day1/Session1-template.Rmd
We are going to explore some of the basic features of R using some patient data; the kind of data that we might encounter in the wild. However, rather than using real-life data we are going to make some up. There is a package called wakefield
that is particularly convenient for this task.
library(wakefield)
Various patient characteristics can be generated using the package. The following is a function that uses the package to create a data frame with various clinical characteristics. The number of patients we want to simulate is an argument.
Don’t worry about what the function does, you can just paste the following into the R console, or highlight it in the the markdown template and press CTRL + ENTER
to run.
random_patients <- function(n) {
as.data.frame(r_data_frame(
n,
id,
name,
race,
sex,
smokes,
height,
birth(random = TRUE, x = NULL, start = Sys.Date() - 365 * 45, k = 365*2,by = "1 days"),
state,
pet,
grade_level(x=1:3),
died,
normal(name="Count"),
date_stamp)
)
}
We can now use the random_patients
function to generate a data frame of fictitious patients
patients <- random_patients(100)
In Rstudio , you can view the contents of this data frame in a tab.
View(patients)
What are the dimensions of the data frame?
HINT: see the dim
, ncol
, nrow
and colnames
functions
## [1] 10 13
## [1] "ID" "Name" "Race" "Sex" "Smokes"
## [6] "Height" "Birth" "State" "Pet" "Grade_Level"
## [11] "Died" "Count" "Date"
## [1] "Britt" "Martin" "Young" "Deon" "Juan" "Devon" "Adam"
## [8] "Cary" "Yong" "Clyde"
## [1] "Britt" "Martin" "Young" "Deon" "Juan" "Devon" "Adam"
## [8] "Cary" "Yong" "Clyde"
We can access the columns of a data frame by either
By column name is recommended, unless you can guarentee the columns will also be in the same order
TIP Use auto-complete with the TAB key to get the name of the column correct
A vector (1-dimensional) is returned, the length of which is the same as the number of rows in the data frame. The vector could be stored as a variable and itself be subset or used in further calculations
peeps <- patients$Name
peeps
## [1] "Britt" "Martin" "Young" "Deon" "Juan" "Devon" "Adam"
## [8] "Cary" "Yong" "Clyde"
length(peeps)
## [1] 10
nchar(peeps)
## [1] 5 6 5 4 4 5 4 4 4 5
substr(peeps,1,3)
## [1] "Bri" "Mar" "You" "Deo" "Jua" "Dev" "Ada" "Car" "Yon" "Cly"
The summary
function is a useful way of summarising the data containing in each column. It will give information about the type of data (remember, data frames can have a mixture of numeric and character columns) and also an appropriate summary. For numeric columns, it will report some stats about the distribution of the data. For categorical data, it will report the different levels.
summary(patients)
## ID Name Race Sex
## Length:10 Length:10 White :7 Male :6
## Class :character Class :character Hispanic :2 Female:4
## Mode :character Mode :character Black :1
## Asian :0
## Bi-Racial:0
## Native :0
## (Other) :0
## Smokes Height Birth State
## Mode :logical Min. :65.0 Min. :1972-01-08 New York :2
## FALSE:7 1st Qu.:67.0 1st Qu.:1972-02-18 Pennsylvania:2
## TRUE :3 Median :68.0 Median :1972-10-02 Colorado :1
## NA's :0 Mean :69.4 Mean :1972-08-14 Georgia :1
## 3rd Qu.:71.0 3rd Qu.:1972-11-10 Indiana :1
## Max. :77.0 Max. :1973-06-11 Missouri :1
## (Other) :2
## Pet Grade_Level Died Count
## Dog :5 1:2 Mode :logical Min. :-1.0275
## Cat :3 2:2 FALSE:4 1st Qu.:-0.3792
## None :2 3:6 TRUE :6 Median : 0.2978
## Bird :0 NA's :0 Mean : 0.4072
## Horse:0 3rd Qu.: 0.9062
## Max. : 2.4957
##
## Date
## Min. :2015-09-23
## 1st Qu.:2015-12-30
## Median :2016-02-23
## Mean :2016-02-28
## 3rd Qu.:2016-05-15
## Max. :2016-06-23
##
A data frame can be subset using square brackes[]
placed after the name of the data frame. As a data frame is a two-dimensional object, you need a row and column index, or vector indices.
Disclaimer 1:- later in the course, we will see a slightly-nicer way of subsetting and filtering which will be useful in some circumstances
patients[1,2]
patients[2,1]
patients[c(1,2,3),1]
patients[c(1,2,3),c(1,2,3)]
Note that the data frame is not altered we are just seeing what a subset of the data looks like and not changing the underlying data. If we wanted to do this, we would need to create a new variale.
patients
## ID Name Race Sex Smokes Height Birth State Pet
## 1 01 Britt White Male TRUE 71 1972-10-28 Wisconsin Cat
## 2 02 Martin White Male FALSE 68 1973-06-11 Colorado None
## 3 03 Young White Female FALSE 67 1972-02-10 Pennsylvania Dog
## 4 04 Deon Hispanic Male FALSE 77 1972-11-12 Georgia Cat
## 5 05 Juan Hispanic Male TRUE 65 1972-09-06 New York Dog
## 6 06 Devon White Male TRUE 71 1972-11-05 Missouri Dog
## 7 07 Adam Black Female FALSE 68 1973-03-01 New York Dog
## 8 08 Cary White Male FALSE 66 1972-01-08 Indiana None
## 9 09 Yong White Female FALSE 74 1972-01-19 Ohio Cat
## 10 10 Clyde White Female FALSE 67 1972-03-16 Pennsylvania Dog
## Grade_Level Died Count Date
## 1 3 TRUE 2.0429659 2015-09-23
## 2 2 FALSE -0.3968310 2015-11-23
## 3 3 TRUE 0.9995400 2015-12-23
## 4 1 TRUE -0.3263563 2016-01-23
## 5 3 FALSE 0.1008862 2016-02-23
## 6 3 FALSE 2.4956672 2016-02-23
## 7 3 TRUE -0.9369304 2016-04-23
## 8 3 TRUE -1.0275402 2016-05-23
## 9 1 FALSE 0.6261153 2016-06-23
## 10 2 TRUE 0.4947157 2016-06-23
Should we wish to see all rows, or all columns, we can neglect either the row or column index
patients[1,]
patients[,1]
patients[,c(1,2)]
head
is commonly-used to give a snapshot of a data frame. Otherwise, you can use the [row,column]
notation.## ID Name Race Sex Smokes Height Birth State Pet
## 1 01 Britt White Male TRUE 71 1972-10-28 Wisconsin Cat
## 2 02 Martin White Male FALSE 68 1973-06-11 Colorado None
## 3 03 Young White Female FALSE 67 1972-02-10 Pennsylvania Dog
## 4 04 Deon Hispanic Male FALSE 77 1972-11-12 Georgia Cat
## 5 05 Juan Hispanic Male TRUE 65 1972-09-06 New York Dog
## 6 06 Devon White Male TRUE 71 1972-11-05 Missouri Dog
## Grade_Level Died Count Date
## 1 3 TRUE 2.0429659 2015-09-23
## 2 2 FALSE -0.3968310 2015-11-23
## 3 3 TRUE 0.9995400 2015-12-23
## 4 1 TRUE -0.3263563 2016-01-23
## 5 3 FALSE 0.1008862 2016-02-23
## 6 3 FALSE 2.4956672 2016-02-23
Rather than selecting rows based on their numeric index (as in the previous example) we can use what we call a logical test. This is a test that gives either a TRUE
or FALSE
result. When applied to subsetting, only rows with a TRUE
result get returned.
For example we could compare the Count
variable to zero. The result is a vector of TRUE
or FALSE
; one for each row in the data frame
patients$Count < 0
## [1] FALSE TRUE FALSE TRUE FALSE FALSE TRUE TRUE FALSE FALSE
This R code can be put inside the square brackets.
patients[patients$Count<0, ]
## ID Name Race Sex Smokes Height Birth State Pet
## 2 02 Martin White Male FALSE 68 1973-06-11 Colorado None
## 4 04 Deon Hispanic Male FALSE 77 1972-11-12 Georgia Cat
## 7 07 Adam Black Female FALSE 68 1973-03-01 New York Dog
## 8 08 Cary White Male FALSE 66 1972-01-08 Indiana None
## Grade_Level Died Count Date
## 2 2 FALSE -0.3968310 2015-11-23
## 4 1 TRUE -0.3263563 2016-01-23
## 7 3 TRUE -0.9369304 2016-04-23
## 8 3 TRUE -1.0275402 2016-05-23
If we wanted to know about the patients that had died, we could do;
deceased <- patients[patients$Died == TRUE,]
deceased
## ID Name Race Sex Smokes Height Birth State Pet
## 1 01 Britt White Male TRUE 71 1972-10-28 Wisconsin Cat
## 3 03 Young White Female FALSE 67 1972-02-10 Pennsylvania Dog
## 4 04 Deon Hispanic Male FALSE 77 1972-11-12 Georgia Cat
## 7 07 Adam Black Female FALSE 68 1973-03-01 New York Dog
## 8 08 Cary White Male FALSE 66 1972-01-08 Indiana None
## 10 10 Clyde White Female FALSE 67 1972-03-16 Pennsylvania Dog
## Grade_Level Died Count Date
## 1 3 TRUE 2.0429659 2015-09-23
## 3 3 TRUE 0.9995400 2015-12-23
## 4 1 TRUE -0.3263563 2016-01-23
## 7 3 TRUE -0.9369304 2016-04-23
## 8 3 TRUE -1.0275402 2016-05-23
## 10 2 TRUE 0.4947157 2016-06-23
In fact, this is equivalent
deceased <- patients[patients$Died,]
The test of equality ==
also works for text
patients[patients$Race == "White",]
## ID Name Race Sex Smokes Height Birth State Pet
## 1 01 Britt White Male TRUE 71 1972-10-28 Wisconsin Cat
## 2 02 Martin White Male FALSE 68 1973-06-11 Colorado None
## 3 03 Young White Female FALSE 67 1972-02-10 Pennsylvania Dog
## 6 06 Devon White Male TRUE 71 1972-11-05 Missouri Dog
## 8 08 Cary White Male FALSE 66 1972-01-08 Indiana None
## 9 09 Yong White Female FALSE 74 1972-01-19 Ohio Cat
## 10 10 Clyde White Female FALSE 67 1972-03-16 Pennsylvania Dog
## Grade_Level Died Count Date
## 1 3 TRUE 2.0429659 2015-09-23
## 2 2 FALSE -0.3968310 2015-11-23
## 3 3 TRUE 0.9995400 2015-12-23
## 6 3 FALSE 2.4956672 2016-02-23
## 8 3 TRUE -1.0275402 2016-05-23
## 9 1 FALSE 0.6261153 2016-06-23
## 10 2 TRUE 0.4947157 2016-06-23
## ID Name Race Sex Smokes Height Birth State Pet
## 3 03 Young White Female FALSE 67 1972-02-10 Pennsylvania Dog
## 5 05 Juan Hispanic Male TRUE 65 1972-09-06 New York Dog
## 6 06 Devon White Male TRUE 71 1972-11-05 Missouri Dog
## 7 07 Adam Black Female FALSE 68 1973-03-01 New York Dog
## 10 10 Clyde White Female FALSE 67 1972-03-16 Pennsylvania Dog
## Grade_Level Died Count Date
## 3 3 TRUE 0.9995400 2015-12-23
## 5 3 FALSE 0.1008862 2016-02-23
## 6 3 FALSE 2.4956672 2016-02-23
## 7 3 TRUE -0.9369304 2016-04-23
## 10 2 TRUE 0.4947157 2016-06-23
There are a couple of ways of testing for more than one text value. The first uses an or |
statement. i.e. testing if the value of Pet
is Dog
or the value is Cat
.
The %in%
function is a convenient function for testing which items in a vector correspond to a defined set of values.
patients[patients$Pet == "Dog" | patients$Pet == "Cat",]
## ID Name Race Sex Smokes Height Birth State Pet
## 1 01 Britt White Male TRUE 71 1972-10-28 Wisconsin Cat
## 3 03 Young White Female FALSE 67 1972-02-10 Pennsylvania Dog
## 4 04 Deon Hispanic Male FALSE 77 1972-11-12 Georgia Cat
## 5 05 Juan Hispanic Male TRUE 65 1972-09-06 New York Dog
## 6 06 Devon White Male TRUE 71 1972-11-05 Missouri Dog
## 7 07 Adam Black Female FALSE 68 1973-03-01 New York Dog
## 9 09 Yong White Female FALSE 74 1972-01-19 Ohio Cat
## 10 10 Clyde White Female FALSE 67 1972-03-16 Pennsylvania Dog
## Grade_Level Died Count Date
## 1 3 TRUE 2.0429659 2015-09-23
## 3 3 TRUE 0.9995400 2015-12-23
## 4 1 TRUE -0.3263563 2016-01-23
## 5 3 FALSE 0.1008862 2016-02-23
## 6 3 FALSE 2.4956672 2016-02-23
## 7 3 TRUE -0.9369304 2016-04-23
## 9 1 FALSE 0.6261153 2016-06-23
## 10 2 TRUE 0.4947157 2016-06-23
patients[patients$Pet %in% c("Dog","Cat"),]
## ID Name Race Sex Smokes Height Birth State Pet
## 1 01 Britt White Male TRUE 71 1972-10-28 Wisconsin Cat
## 3 03 Young White Female FALSE 67 1972-02-10 Pennsylvania Dog
## 4 04 Deon Hispanic Male FALSE 77 1972-11-12 Georgia Cat
## 5 05 Juan Hispanic Male TRUE 65 1972-09-06 New York Dog
## 6 06 Devon White Male TRUE 71 1972-11-05 Missouri Dog
## 7 07 Adam Black Female FALSE 68 1973-03-01 New York Dog
## 9 09 Yong White Female FALSE 74 1972-01-19 Ohio Cat
## 10 10 Clyde White Female FALSE 67 1972-03-16 Pennsylvania Dog
## Grade_Level Died Count Date
## 1 3 TRUE 2.0429659 2015-09-23
## 3 3 TRUE 0.9995400 2015-12-23
## 4 1 TRUE -0.3263563 2016-01-23
## 5 3 FALSE 0.1008862 2016-02-23
## 6 3 FALSE 2.4956672 2016-02-23
## 7 3 TRUE -0.9369304 2016-04-23
## 9 1 FALSE 0.6261153 2016-06-23
## 10 2 TRUE 0.4947157 2016-06-23
Similar to or, we can require that both tests are TRUE
by using an and &
operation. e.g. to look for white males.
patients[patients$Race == "White" & patients$Sex =="Male",]
## ID Name Race Sex Smokes Height Birth State Pet Grade_Level
## 1 01 Britt White Male TRUE 71 1972-10-28 Wisconsin Cat 3
## 2 02 Martin White Male FALSE 68 1973-06-11 Colorado None 2
## 6 06 Devon White Male TRUE 71 1972-11-05 Missouri Dog 3
## 8 08 Cary White Male FALSE 66 1972-01-08 Indiana None 3
## Died Count Date
## 1 TRUE 2.042966 2015-09-23
## 2 FALSE -0.396831 2015-11-23
## 6 FALSE 2.495667 2016-02-23
## 8 TRUE -1.027540 2016-05-23
## ID Name Race Sex Smokes Height Birth State Pet
## 4 04 Deon Hispanic Male FALSE 77 1972-11-12 Georgia Cat
## 7 07 Adam Black Female FALSE 68 1973-03-01 New York Dog
## 8 08 Cary White Male FALSE 66 1972-01-08 Indiana None
## Grade_Level Died Count Date
## 4 1 TRUE -0.3263563 2016-01-23
## 7 3 TRUE -0.9369304 2016-04-23
## 8 3 TRUE -1.0275402 2016-05-23
A vector can be returned in sorted form using the sort
function.
sort(peeps)
## [1] "Adam" "Britt" "Cary" "Clyde" "Deon" "Devon" "Juan"
## [8] "Martin" "Yong" "Young"
sort(patients$Count,decreasing = TRUE)
## [1] 2.4956672 2.0429659 0.9995400 0.6261153 0.4947157 0.1008862
## [7] -0.3263563 -0.3968310 -0.9369304 -1.0275402
However, if we want to sort an entire data frame a different approach is needed. The trick is to use order
. Rather than giving a sorted set of values, it will give sorted indices.
patients[order(patients$Count),]
## ID Name Race Sex Smokes Height Birth State Pet
## 8 08 Cary White Male FALSE 66 1972-01-08 Indiana None
## 7 07 Adam Black Female FALSE 68 1973-03-01 New York Dog
## 2 02 Martin White Male FALSE 68 1973-06-11 Colorado None
## 4 04 Deon Hispanic Male FALSE 77 1972-11-12 Georgia Cat
## 5 05 Juan Hispanic Male TRUE 65 1972-09-06 New York Dog
## 10 10 Clyde White Female FALSE 67 1972-03-16 Pennsylvania Dog
## 9 09 Yong White Female FALSE 74 1972-01-19 Ohio Cat
## 3 03 Young White Female FALSE 67 1972-02-10 Pennsylvania Dog
## 1 01 Britt White Male TRUE 71 1972-10-28 Wisconsin Cat
## 6 06 Devon White Male TRUE 71 1972-11-05 Missouri Dog
## Grade_Level Died Count Date
## 8 3 TRUE -1.0275402 2016-05-23
## 7 3 TRUE -0.9369304 2016-04-23
## 2 2 FALSE -0.3968310 2015-11-23
## 4 1 TRUE -0.3263563 2016-01-23
## 5 3 FALSE 0.1008862 2016-02-23
## 10 2 TRUE 0.4947157 2016-06-23
## 9 1 FALSE 0.6261153 2016-06-23
## 3 3 TRUE 0.9995400 2015-12-23
## 1 3 TRUE 2.0429659 2015-09-23
## 6 3 FALSE 2.4956672 2016-02-23
patients[order(patients$Sex),]
## ID Name Race Sex Smokes Height Birth State Pet
## 1 01 Britt White Male TRUE 71 1972-10-28 Wisconsin Cat
## 2 02 Martin White Male FALSE 68 1973-06-11 Colorado None
## 4 04 Deon Hispanic Male FALSE 77 1972-11-12 Georgia Cat
## 5 05 Juan Hispanic Male TRUE 65 1972-09-06 New York Dog
## 6 06 Devon White Male TRUE 71 1972-11-05 Missouri Dog
## 8 08 Cary White Male FALSE 66 1972-01-08 Indiana None
## 3 03 Young White Female FALSE 67 1972-02-10 Pennsylvania Dog
## 7 07 Adam Black Female FALSE 68 1973-03-01 New York Dog
## 9 09 Yong White Female FALSE 74 1972-01-19 Ohio Cat
## 10 10 Clyde White Female FALSE 67 1972-03-16 Pennsylvania Dog
## Grade_Level Died Count Date
## 1 3 TRUE 2.0429659 2015-09-23
## 2 2 FALSE -0.3968310 2015-11-23
## 4 1 TRUE -0.3263563 2016-01-23
## 5 3 FALSE 0.1008862 2016-02-23
## 6 3 FALSE 2.4956672 2016-02-23
## 8 3 TRUE -1.0275402 2016-05-23
## 3 3 TRUE 0.9995400 2015-12-23
## 7 3 TRUE -0.9369304 2016-04-23
## 9 1 FALSE 0.6261153 2016-06-23
## 10 2 TRUE 0.4947157 2016-06-23
A final point on data frames is that we can export them out of R once we have done our data processing.
countOrder <- patients[order(patients$Count),]
write.csv(countOrder, file="patientsOrderedByCount.csv")
All your favourite types of plot can be created in R
boxplot
, hist
, barplot
,… all of which are extensions of the basic plot
functionDisclaimer 2:- later in the course, we will see a slightly-nicer way of plotting
hist(patients$Height)
plot(patients$Height,patients$Count)
barplot(table(patients$Race))
barplot(table(patients$Pet))
boxplot(patients$Count ~ patients$Died)
Lots of customisations are possible to enhance the appaerance of our plots; colour, labels, axes, legends
plot(patients$Height,patients$Count,pch=16,
col="red",xlab="Height",
ylab="Count")
boxplot(patients$Count ~ patients$Died,col=c("red","yellow"))
Plots can be exported by the Plots tab in RStudio, or by calling the pdf
or png
functions which will write the plot to a file
png("myLittlePlot.png")
barplot(table(patients$Pet))
dev.off()
## png
## 2