--- title: "Introduction to Solving Biological Problems Using R - Day 1" author: Mark Dunning, Suraj Menon and Aiora Zabala. Original material by Robert Stojnić, Laurent Gatto, Rob Foy, John Davey, Dávid Molnár and Ian Roberts date: '`r format(Sys.time(), "Last modified: %d %b %Y")`' output: html_notebook: toc: yes toc_float: yes --- # 3. R for data analysis ##3 steps to Basic Data Analysis - In this short section, we show how the data manipulation steps we have just seen can be used as part of an analysis pipeline: 1. Reading in data + `read.table()` + `read.csv(), read.delim()` 2. Analysis + Manipulating & reshaping the data + perhaps dealing with "missing data" + Any maths you like + Diagnostic Plots 3. Writing out results + `write.table()` + `write.csv()` ## A simple walkthrough - We have data from 100 patients that given consent for their data to use in future studies - A researcher wants to undertake a study involving people that are overweight - We will walkthrough how to filter the data and write a new file with the candidates for the study ##The Working Directory (wd) - Like many programs R has a concept of a working directory - It is the place where R will look for files to execute and where it will save files, by default - For this course we need to set the working directory to the location of the course scripts - In RStudio use the mouse and browse to the directory where you saved the Course Materials - ***Session → Set Working Directory → Choose Directory...*** ## 0. Locate the data Before we even start the analysis, we need to be sure of where the data are located on our hard drive - Functions that import data need a file location as a character vector - The default location is the ***working directory*** ```{r} getwd() ``` - If the file you want to read is in your working directory, you can just use the file name ```{r eval=FALSE} list.files() ``` - The `file.exists` function does exactly what it says on the tin! + a good sanity check for your code ```{r} file.exists("patient-info.txt") ``` - Otherwise you need the *path* to the file + you can get this using **`file.choose()`** - If you unsure about specifying a file path at the command line, this [online tutorial](http://rik.smith-unna.com/command_line_bootcamp/?id=vczhybjhtyt) will give you hands-on practice ##1. Read in the data - The data are a tab-delimited file. Each row is a record, each column is a field. Columns are separated by tabs in the text - We need to read in the results and assign it to an object (`patients`) ```{r} patients <- read.delim("patient-info.txt") ``` In the latest RStudio, there is the option to import data directly from the File menu. ***File*** -> ***Import Dataset*** -> ***From Csv*** - If the data are comma-separated, then use either the argument `sep=","` or the function `read.csv()`: - You need to make sure you use the correct function + can you explain the output of the following lines of code? ```{r } tmp <- read.csv("patient-info.txt") head(tmp) ``` - For full list of arguments: ```{r} ?read.table ``` ##1b. Check the data - *Always* check the object to make sure the contents and dimensions are as you expect - R will sometimes create the object without error, but the contents may be un-usable for analysis + If you specify an incorrect separator, R will not be able to locate the columns in your data, and you may end up with an object with just one column ```{r} # View the first 10 rows to ensure import is OK patients[1:10,] ``` - or use the `View()` function to get a display of the data in RStudio: ```{r} View(patients) ``` ##1c. Understanding the object - Once we have read the data successfully, we can start to interact with it - The object we have created is a *data frame*: ```{r} class(patients) ``` - We can query the dimensions: ```{r} ncol(patients) nrow(patients) dim(patients) ``` - The names of the columns are automatically assigned: ```{r} colnames(patients) ``` - We can use any of these names to access a particular column: + and create a vector + TOP TIP: type the name of the object and hit TAB: you can select the column from the drop-down list! ```{r} patients$ID ``` ## Word of warning ![](images/tolstoy.jpg) ![](images/hadley.jpg) > Like families, tidy datasets are all alike but every messy dataset is messy in its own way - (Hadley Wickham - RStudio chief scientist and author of dplyr, ggplot2 and others) You will make your life a lot easier if you keep your data **tidy** and ***organised***. Before blaming R, consider if your data are in a suitable form for analysis. The more manual manipulation you have done on the data (highlighting, formulas, copy-and-pasting), the less happy R is going to be to read it. Here are some useful links on some common pitfalls and how to avoid them - http://www.datacarpentry.org/spreadsheet-ecology-lesson/ - http://kbroman.org/dataorg/ ##Handling missing values - The data frame contains some **`NA`** values, which means the values are missing – a common occurrence in real data collection - `NA` is a special value that can be present in objects of any type (logical, character, numeric etc) - `NA` is not the same as `NULL`: - `NULL` is an empty R object. - `NA` is one missing value within an R object (like a data frame or a vector) - Often R functions will handle `NA`s gracefully: ```{r} length(patients$Height) mean(patients$Height) ``` - However, sometimes we have to tell the functions what to do with them. - R has some built-in functions for dealing with `NA`s, and functions often have their own arguments (like `na.rm`) for handling them: + annoyingly, different functions have different argument names to change their behaviour with regards to `NA` values. *Always check the documentation* ```{r} mean(patients$Height, na.rm = TRUE) mean(na.omit(patients$Height)) ``` ##2. Analysis (reshaping data and maths) - Our analysis involves identifying patients with extreme BMI + we will define this as being two standard deviations from the mean ```{r} # Create an index of results: BMI <- (patients$Weight)/((patients$Height/100)^2) upper.limit <- mean(BMI,na.rm = TRUE) + 2*sd(BMI,na.rm = TRUE) upper.limit ``` - We can plot a simple chart of the BMI values + add a vertical line to indicate the cut-off + plotting will be covered in detail shortly.. ```{r} plot(BMI) # Add a horizonal line: abline(h=upper.limit) ``` - It is also useful to save the variable we have computed as a new column in the data frame ```{r} round(BMI,1) patients$BMI <- round(BMI,1) head(patients) ``` - To actually select the candidates we can use a logical expression to test the values of the BMI vector being greater than the upper limit + if the second line looks a bit weird, remember that `<-` is doing an assignment. Thevalue we are assigning to our new variable is the logical (`TRUE` or `FALSE`) vector given by testing each item in `BMI` against the `upper.limit` ```{r} BMI > upper.limit candidates <- BMI > upper.limit ``` We have seen that a logical vector can be used to subset a data frame - However, in our case the result looks a bit funny - Can you think why this might be? ```{r} patients[candidates,] ``` The `which` function will take a logical vector and return the indices of the `TRUE` values - This can then be used to subset the data frame ```{r} which(BMI > upper.limit) candidates <- which(BMI > upper.limit) ``` ## 3. Outputting the results - We write out a data frame of candidates (patients with BMI more than standard deviations from the mean) as a 'comma separated values' text file (CSV): ```{r} write.csv(patients[candidates,], file="selectedSamples.csv") ``` - The output file is directly-readable by Excel - It's often helpful to double check where the data has been saved. Use the *get working directory* function: ```{r eval=FALSE} getwd() # print working directory list.files() # list files in working directory ``` To recap, the set of R commands we have used is:- ```{r} patients <- read.delim("patient-info.txt") BMI <- (patients$Weight)/((patients$Height/100)^2) upper.limit <- mean(BMI,na.rm = TRUE) + 2*sd(BMI,na.rm = TRUE) plot(BMI) # Add a horizonal line: abline(h=upper.limit) patients$BMI <- round(BMI,1) candidates <- which(BMI > upper.limit) write.csv(patients[candidates,], file="selectedSamples.csv") ``` ##Exercise: Exercise 3 - A separate study is looking for patients that are underweight and also smoke; + Modify the condition in our previous code to find these patients + e.g. having BMI that is 2 standard deviations *less* than the mean BMI + Write out a results file of the samples that match these criteria, and open it in a spreadsheet program ```{r} ### Your Answer Here ### ```