Programming with data using R
Overview
- R for the MATLAB user
Basics
Special types
- Categorical variables (factor class)
- Data tables (the R “data frame”)
Key-value pairs
Interacting with files on your hard drive (I/O) and objects in working memory
- Shell operations
- Working memory
Namespaces
Graphics
A taste of higher-order programming (function mapping example)
Troubleshooting and debugging
Seeking help

Programming with data using R

In order to perform the data analysis, we need a tool to help us

clean and transform the data,
visualize our data to generate hypotheses and confirm that assumptions required for statistical tests are met (e.g., normality), and
apply the tests and harvest results.

R is such a tool that was created for “Programming with Data”, and will be covered in this lesson.

We will cover some basic concepts of R programming in this course, with an emphasis on demonstrating what you can (or should) do with a programming language. Once you know this, you can look up how to perform a specific task.
Many demonstrations and example code will be provided. If you find a code block that accomplishes what you need, then you can use it directly or read up on the function or functions to adapt the block to your needs.
Everyone has a different level of background in programming, but it is possible to find appropriate resources at every level.
Find the resource that speaks to your background and way of thinking (e.g., some tutorials are aimed at engineers, while others are aimed at computer scientists, statisticians, economitricians, psychologists, and so on).

We will require each student to learn a lot on his or her own as it is not possible to exhaustively cover every function or operation used in our exmaples. As you will see, the actual number of resources to learn R is endless (just type “introduction to R” in Google), which then leads to a tyranny of choice and decision fatigue. We might recommend a very few resources:

Introduction to language basics:
- (very) short introduction to R
- An Introduction to R
- R Inferno
Working with data
- Modern Applied Statistics with S
- Data Manipulation with R

Rintro Rmanip

One strategy (for learning a new topic in general) is to skim the contents of many resources and identify:

common topics that come up in every tutorial
topics that are of interest for solving your problem
authors that explain concepts using analogies that are relevant to you or your prior training

We will present a few examples to initiate a user generally familiar with a limited amount of programming. Basic classifications to keep in mind.

Important “data types” (objects):

data frame
lists/vectors (ordered/labeled collection of objects)

Everything in R is technically a “vector” with different or additional attributes such as storage mode, dimension, etc.

Important classes of operations:

data selection
data merging
arithmetic operations
statistical analysis
graphics
text processing
shell scripting
flow control

Important features when working with real data:

handling of missing values

Overview

Rlogo

R is a programming language
R is a statistical package
R is object-oriented
R is derived from S, a predecessor language used by the statistics community (overall has 30+ year history)
R is available through GNU Public License
R is used professionally in academia and industry; particularly in machine learning communities

One of its core-strengths is the collection of user-contributed libraries. You will find that R permits many ways to do the same thing. Find the way that works best for you (and your colleagues), and stick with it.

R for the MATLAB user

Superficial differences:

Comment character is # rather than %.
Arrows <- are assignment operators. (Can also use equal sign in most places.)
*, /, ^ operate element-wise (like .*, ./, .^ in MATLAB).
Functions do not have to be defined in separate files.
Most operations are performed by calling a function on arguments. Functional calls are very forgiving, and arguments can be specified by 1) order provided or 2) partial matching of argument names.
```
> divide <- function(numerator,denominator) numerator / denominator
> divide(1,2)
[1] 0.5
> divide(denominator=1,numerator=2)
[1] 2
> divide(d=1,2)
[1] 2
```
Same control structures exist—if, for, while, etc.—but braces {} denote extent of expressions, rather than end statements.

Check out:

Basics

For the rest of the examples, we will assume the following libraries have been imported and options set:

library(tidyverse)
library(chron)

source("functions_extra.R")

Sys.setlocale("LC_TIME","C")
options(stringsAsFactors=FALSE)
options(chron.year.abb=FALSE)
theme_set(theme_bw()) # just my preference for plots

These libraries should be present on the machines in room GRB001. On your own machine, you can install necessary packages from the web:

install.packages("packagename",repos="http://stat.ethz.ch/CRAN/")

or, for multiple packages at once,

install.packages(c("packagename1","packagename2"),repos="http://stat.ethz.ch/CRAN/")

and so on.

In the assignment of values to symbols/variables, <- or = can be used. The former method is recommended.

```r
x <- 1   # scalar (a vector of length 1)
x = 1
x <- 1:5 # vector
x = 1:5
```

Functions accept a set of arguments (inputs) and return a value (output). Even binary operators can be written as a function in prefix notation – e.g., x+1is equivalent to ```+``(x,1).

Variables (data types, objects)

Variables are symbols (e.g., x, y) that represent a set of values. These values can take on one of several types.

simple (“atomic”) data types: logical (Boolean), numeric (integer, real/double), complex, character (string), and raw
complex structures can be composed of atomic data types: list, data.frame, factors

You can convert among data types with as: as.data.frame, as.list, as.numeric, as.character, and so on.

Lists are like cell arrays or structures in MATLAB.

As with MATLAB, R has vectorized operations. Vectorized operations are applicable for atomic data types, but not for “list”. Example:

x <- 1:5
y <- x+1
print(y)

## [1] 2 3 4 5 6

mode(y)

## [1] "numeric"

typeof(y)

## [1] "double"

You can check the mode of your data type with mode() or typeof().

Apart from the mode, objects can have dimensionality: 1-D (vector), 2-D (matrix), N-D (array).

Special data types:

factor (categorical variable) (1-D)
data frame (relational table) (2-D)

Creating instances of vectors, matrices, lists, etc.

c concatenates elements and is often used to construct a vector.

(char <- c("ab","d"))      # concatenate elements (character)

## [1] "ab" "d"

(v <- c(1,3,5))           # concatenate elements (numeric)

## [1] 1 3 5

Other objects are created by functions that are named after the object.

(m <- matrix(1:6,ncol=2)) # create a matrix

##      [,1] [,2]
## [1,]    1    4
## [2,]    2    5
## [3,]    3    6

(l <- list(1,v,m))        # define a list

## [[1]]
## [1] 1
## 
## [[2]]
## [1] 1 3 5
## 
## [[3]]
##      [,1] [,2]
## [1,]    1    4
## [2,]    2    5
## [3,]    3    6

In contrast with MATLAB, Note that a single element in a character vector in R can contain multiple letters.

Objects

An object defines both the 1) data type and 2) operations that are allowed on them, and is labeled by a “class”.

Check the class of an R object by class(), and the functions allowed with methods():

x <- 1
print(class(x))

## [1] "numeric"

methods(class=class(x))

##  [1] all.equal     as_factor     as_mapper     as.data.frame as.Date      
##  [6] as.POSIXct    as.POSIXlt    as.raster     coerce        Compare      
## [11] full_seq      months        Ops           recode        scale_type   
## see '?methods' for accessing help and source code

Alternatively, you can query what type of objects a function can operate on. For instance, we have a function called mean():

methods(mean)

## [1] mean.Date        mean.default     mean.difftime    mean.POSIXct    
## [5] mean.POSIXlt     mean.quosure*    mean.times*      mean.vctrs_vctr*
## see '?methods' for accessing help and source code

In R, all variables are objects and all operations are functions. All functions are also objects.

Labeling elements

Elements can have optional labels.

Vectors, lists can have names.

(v <- c(a=1, b=3, c=5))

## a b c 
## 1 3 5

v["a"]

## a 
## 1

names(l) <- names(v)
print(l)

## $a
## [1] 1
## 
## $b
## [1] 1 3 5
## 
## $c
##      [,1] [,2]
## [1,]    1    4
## [2,]    2    5
## [3,]    3    6

Matrices can have column and row names.

colnames(m) <- letters[1:ncol(m)]
rownames(m) <- letters[1:nrow(m)]
print(m)

##   a b
## a 1 4
## b 2 5
## c 3 6

Text processing

Processing strings of characters is a relevant part of data analysis.

Manipulate strings:

date <- "2012.03.01"
strsplit(date, ".", fixed=TRUE)

## [[1]]
## [1] "2012" "03"   "01"

paste("2012", "03", "01", sep=".")

## [1] "2012.03.01"

sprintf("%d.%02d.%02d",2012, 3, 1)

## [1] "2012.03.01"

Search for strings:

dates <- c("2012.03.01","2013.03.01")
grep("2012", date, fixed=TRUE)

## [1] 1

grepl("2012", date, fixed=TRUE)

## [1] TRUE

grep("2012", date, value=TRUE, fixed=TRUE)

## [1] "2012.03.01"

To delve deeper, you will eventually want to look into regular expressions.

pattern <- "([0-9]{4})\\.([0-9]{2})\\.([0-9]{2})"
sub(pattern, "\\1", dates)

## [1] "2012" "2013"

sub(pattern, "\\2", dates)

## [1] "03" "03"

sub(pattern, "\\3", dates)

## [1] "01" "01"

Useful example:

data <- read.table("data/2013/LAU.csv", skip=5, sep=";", header=TRUE, check.names=FALSE)
names(data)

## [1] "Date/time"          "O3 [\xb5g/m\xb3]"   "NO2 [\xb5g/m\xb3]" 
## [4] "CO [mg/m\xb3]"      "PM10 [\xb5g/m\xb3]" "TEMP [\xb0C]"      
## [7] "PREC [mm]"          "RAD [W/m\xb2]"

Note non-ASCII encoding.

We can delete everything after whitespace:

sub("[ ].*$","",names(data))

## [1] "Date/time" "O3"        "NO2"       "CO"        "PM10"      "TEMP"     
## [7] "PREC"      "RAD"

With such functions, you can relable your data table columns without assigning them manually (note fixed=TRUE indicates that you are using fixed string patterns and not regular expressions).

names(data) <- sub("[ ].*$","",names(data))
names(data) <- sub("Date/time", "datetime", names(data), fixed=TRUE)

Some text processing functions:

paste()
substr(); substring(); nchar()
strsplit()
sub(); gsub()
grep()
regexpr(); gregexpr() match(); pmatch(); %in% `==`

Set operations

newdate1 <- "2013.03.01"
newdate2 <- "2013.03.02"
union(newdate2, dates)

## [1] "2013.03.02" "2012.03.01" "2013.03.01"

intersect(newdate2, dates)

## character(0)

intersect(newdate1, dates)

## [1] "2013.03.01"

setdiff(dates, newdate1)

## [1] "2012.03.01"

Extract/replace

Can refer to elements by sequence number, or by label.

Vectors:

v[2]

## b 
## 3

v["b"]

## b 
## 3

v["b"] <- 10
print(v)

##  a  b  c 
##  1 10  5

Matrices:

m[1,2]

## [1] 4

m[,"b"]               # select column

## a b c 
## 4 5 6

m[,"b"] <- c(0,10,20) # replace column
print(m)

##   a  b
## a 1  0
## b 2 10
## c 3 20

Lists:

l[2]

## $b
## [1] 1 3 5

l[[2]]

## [1] 1 3 5

l[["b"]]

## [1] 1 3 5

l[["b"]] <- 10
print(l)

## $a
## [1] 1
## 
## $b
## [1] 10
## 
## $c
##      [,1] [,2]
## [1,]    1    4
## [2,]    2    5
## [3,]    3    6

You may also see the $ operator used, but this is used to extract or assign to a single element of a list or data frame.

l$b

## [1] 10

l$b <- "a"
l

## $a
## [1] 1
## 
## $b
## [1] "a"
## 
## $c
##      [,1] [,2]
## [1,]    1    4
## [2,]    2    5
## [3,]    3    6

Add to the collection:

c(v,d=3)

##  a  b  c  d 
##  1 10  5  3

Remove from the collection:

v[-2]

## a c 
## 1 5

v[!names(v) %in% "b"]

## a c 
## 1 5

Modify the values

v[2] <- "2"
print(v) # now converted to character string

##   a   b   c 
## "1" "2" "5"

“The most important distinction between [, [[ and $ is that the [ can select more than one element, whereas the other two select a single element.”
2-D, 3-D, N-D: See x[i,j,...,drop=FALSE] to retain other dimensions.
Also, see x[i,j,...,exact=FALSE] to enable partial matching (partial matching enabled for $ by default).
Each extraction operator has a corresponding replacement method.

Arithmetic operations

The usual arithmetic operators/functions:

+, -, *, /,                      # binary operators
sum(), prod(), cumsum(), diff(), # apply to vector

Other functions:

exp(), log(), log10(), ^
%%  #(modulo)
%/% #(integer division),
floor(), ceiling(), round(), signif()

Control structures

For loop (note that for this example, you could also use vectorized addition):

x <- 1:5
y <- numeric(length(x))
for(i in 1:length(x)) {
  y[i] <- x[i] + 1
}
print(y)

## [1] 2 3 4 5 6

If-else:

i <- 3
for(i in 1:3) {
  if(i==1) {
    x <- 1
  } else if(i==2) {
    x <- 2
  } else {
    x <- 3
  }
}
print(x)

## [1] 3

Within the use, you can use a few additional structures:

  break # break out of loop
  next  # skip rest of loop

Logical values

Logical values: TRUE or FALSE.

operator	description
`\|`, `\|\|`	or
`&`, `&&`	and
`<`, `>`	less/greater than
`<=`, `>=`	less/greater than or equal tl
`==`, !=	equal/not equal to
`%in%`	is in {collection}
`!`	not
`any()`	is any `TRUE`
`all()`	are all `FALSE`

||, && return single value (evaluates only first statement if first statement is TRUE) *???? |, & return vectored value

Negate any expression by prefixing with !.

Missing values

Missing values are encoded as NA.

R has very sophisticated facilities for handling missing values. Test for missing values in a vector with the following functions:

is.na()
is.nan()
is.finite()

You can test for non-missing values with is.na(x).

Many common functions provide a rm.na=TRUE argument. E.g., mean(x,na.rm=TRUE).

You can remove missing values with na.omit(x) but be careful as this can change the length of the vector x.

Anatomy of a function

Example:

Foo <- function(x) {
  y <- 1
  x + y + 2*z
}

x is a bound variable, and its value is determined by the argument passed upon function invocation.
y is a local variable, which is defined only within the context of the function.
z is a free variable, and its value is found in the environment in which the function was defined.

The value of the last expression (x + y + 2*z in this case) is returned from the function.

z <- 2
y <- 2
m <- 1
n <- Foo(m)
print(m)    # remains unchanged

## [1] 1

print(y)    # remains unchanged

## [1] 2

print(n)    # value that is returned

## [1] 6

Flexibility of R function invocation

Example function:

Bar <- function(first, second = 3) {
  first + 2*second
}

Note that default values for arguments can be provided in the function definition, in which chase they become optional arguments for the user to specify.

Explore the possibilities:

Bar(1)

## [1] 7

Bar(1, 3)

## [1] 7

Bar(3, 1)

## [1] 5

Bar(second=3, first=1)

## [1] 7

Bar(s=3, 1)

## [1] 7

Special types

Categorical variables (factor class)

The factor class in R is useful for representing categorical variables, which are discrete variables with a defined set of possibilities.

sites <- factor(c("Lausanne","Zurich"),                # values
                 levels=c("Bern","Lausanne","Zurich")) # set from which values are drawn
sites

## [1] Lausanne Zurich  
## Levels: Bern Lausanne Zurich

unclass(sites)

## [1] 2 3
## attr(,"levels")
## [1] "Bern"     "Lausanne" "Zurich"

sites == "Lausanne"

## [1]  TRUE FALSE

sites[1] <- "Fribourg"                                 # not in defined set of possibilities

## Warning in `[<-.factor`(`*tmp*`, 1, value = "Fribourg"): invalid factor level,
## NA generated

methods(class="factor")

##  [1] [             [[            [[<-          [<-           all.equal    
##  [6] as_factor     as.character  as.data.frame as.Date       as.list      
## [11] as.logical    as.POSIXlt    as.vector     coerce        Compare      
## [16] droplevels    format        initialize    is.na<-       length<-     
## [21] levels<-      Math          Ops           plot          print        
## [26] recode        relevel       relist        rep           scale_type   
## [31] show          slotsFromS3   summary       Summary       type_sum     
## [36] xtfrm        
## see '?methods' for accessing help and source code

Caution: factors are integer at heart.

(vec <- c(four=4, five=5))

## four five 
##    4    5

(fac <- factor(c("four","five")))

## [1] four five
## Levels: five four

vec["four"]

## four 
##    4

fac[2]

## [1] five
## Levels: five four

vec[fac[2]]

## four 
##    4

This behavior occurs because:

unclass(fac)

## [1] 2 1
## attr(,"levels")
## [1] "five" "four"

so fac[2] is equivalent to 1, and vec[fac[2]] is vec[1]

Data tables (the R “data frame”)

Note that each column can have a different data type.

(dtable <- data.frame(label=c("a","b"),value=c(1,2)))

##   label value
## 1     a     1
## 2     b     2

ColClasses(dtable)

##       label   value
## 1 character numeric

Common data frame operations defined in base R, with improvements in speeed or usability provided by the reshape2 and dplyr packages. Many of these will be further demonstrated in context throughout the rest of the course.

Operation	base R	`dplyr`/`tidyr`/`reshape2`
subset rows	`[`, `subset()`	`filter()`
select columns	`[`, `subset()`	`select()`
modify column values	`[<-`, `transform()`	`mutate()`
rename coluimns		`rename()`
join tables	`merge()`, `rbind()`, `cbind()`	`{full/inner/left/right}_join()`
pivot frame	`stack()`, `unstack()`	`gather()`, `spread()`
		`melt()`, `dcast()`

Chaining operations

Apply a series of functions in sequence. Subset rows where label is equal to “b”, and then change the value to 3.

Note the following sequence of operations:

mutate(filter(dtable,label=="b"),value=3)

##   label value
## 1     b     3

We can accomplish the same operation with pipes:

dtable %>% filter(label=="b") %>% mutate(value=3)

##   label value
## 1     b     3

The %>% is a “postfix operator” and assigns the preceding object (data frame) to the first argument of the proceeding function.

Modfying tables

Return new data frame (original is unmodified):

(newtable <- dtable %>% mutate(value2=NA))

##   label value value2
## 1     a     1     NA
## 2     b     2     NA

print(dtable)

##   label value
## 1     a     1
## 2     b     2

``In place’’ (modify original data frame):

dtable[,"value2"] <- NA
print(dtable)

##   label value value2
## 1     a     1     NA
## 2     b     2     NA

Merging example

Merging tables is a powerful feature.

dtable[,"value2"] <- NULL # delete column

print(dtable)

##   label value
## 1     a     1
## 2     b     2

(dtable2 <- data.frame(label="b", value2=1))

##   label value2
## 1     b      1

inner_join(dtable, dtable2)

## Joining, by = "label"

##   label value value2
## 1     b     2      1

full_join(dtable, dtable2)

## Joining, by = "label"

##   label value value2
## 1     a     1     NA
## 2     b     2      1

(dtable3 <- data.frame(label="c",value=1))

##   label value
## 1     c     1

inner_join(dtable, dtable3)

## Joining, by = c("label", "value")

## [1] label value
## <0 rows> (or 0-length row.names)

full_join(dtable, dtable3)

## Joining, by = c("label", "value")

##   label value
## 1     a     1
## 2     b     2
## 3     c     1

Reshaping/pivoting

Let us revisit an example from Lesson 1:

data <- read.table("data/2013/LAU.csv", sep=";", skip=6,
  col.names=c("datetime","O3","NO2","CO","PM10","TEMP","PREC","RAD"))

View first two rows.

data[1:2,]

##           datetime   O3  NO2  CO PM10 TEMP PREC  RAD
## 1 31.12.2012 01:00  7.8 56.3 0.5 16.1  3.8    0 -2.4
## 2 31.12.2012 02:00 22.4 38.0 0.4 11.6  4.1    0 -2.3

Convert to “long” format.

lf <- gather(data[1:2,],     # data table, first two rows
             key = variable, # name of new variable column
             value = value,  # name of new value column           
             -datetime)      # columns to keep fixed (collapse/stack all other variables)

head(lf)

##           datetime variable value
## 1 31.12.2012 01:00       O3   7.8
## 2 31.12.2012 02:00       O3  22.4
## 3 31.12.2012 01:00      NO2  56.3
## 4 31.12.2012 02:00      NO2  38.0
## 5 31.12.2012 01:00       CO   0.5
## 6 31.12.2012 02:00       CO   0.4

Convert back to “wide” format.

wf <- spread(lf,             # data table
             key = variable, # column from which new column names should be taken
             value = value)  # column from which values should be taken to fill new wide-format table

head(wf)

##           datetime  CO  NO2   O3 PM10 PREC  RAD TEMP
## 1 31.12.2012 01:00 0.5 56.3  7.8 16.1    0 -2.4  3.8
## 2 31.12.2012 02:00 0.4 38.0 22.4 11.6    0 -2.3  4.1

Split, apply, combine

Based on what we’ve learned so far, we can define a function for reading a csv file from the NABEL network:

ReadTSeries <- function(filename, timecolumn="datetime", timeformat="%d.%m.%Y %H:%M") {
  data <- read.table(filename, skip=5, header=TRUE, sep=";", check.names=FALSE)
  names(data) <- sub("[ ].*$","",names(data))
  names(data) <- sub("Date/time", timecolumn, names(data), fixed=TRUE)
  data[,timecolumn] <- as.chron(data[,timecolumn], timeformat)
  data
}
data <- ReadTSeries("data/2013/LAU.csv")

Add month column:

data[,"month"] <- months(data[,"datetime"])
head(data)

##                datetime   O3  NO2  CO PM10 TEMP PREC  RAD month
## 1 (12/31/2012 01:00:00)  7.8 56.3 0.5 16.1  3.8    0 -2.4   Dec
## 2 (12/31/2012 02:00:00) 22.4 38.0 0.4 11.6  4.1    0 -2.3   Dec
## 3 (12/31/2012 03:00:00) 14.5 37.2 0.3 10.3  3.1    0 -2.1   Dec
## 4 (12/31/2012 04:00:00) 28.7 25.4 0.3 10.5  3.5    0 -2.2   Dec
## 5 (12/31/2012 05:00:00) 19.6 33.7 0.3  9.0  2.9    0 -2.2   Dec
## 6 (12/31/2012 06:00:00) 30.8 51.2 0.3  8.7  3.2    0 -2.3   Dec

Return single value (or series of individually computed values): use summarize().

data %>%
  group_by(month) %>%
    summarize(mean=mean(O3,na.rm=TRUE),
              sd=sd(O3,na.rm=TRUE))

## # A tibble: 12 x 3
##    month  mean    sd
##  * <ord> <dbl> <dbl>
##  1 Jan    22.6  15.5
##  2 Feb    40.6  17.1
##  3 Mar    35.0  21.8
##  4 Apr    50.8  23.4
##  5 May    51.9  18.5
##  6 Jun    59.6  24.3
##  7 Jul    76.2  31.9
##  8 Aug    66.5  23.9
##  9 Sep    43.2  21.8
## 10 Oct    24.7  17.0
## 11 Nov    25.9  17.7
## 12 Dec    18.7  18.2

Return table: use do().

Statsfn <- function(subtable) {
  O3 <- subtable[["O3"]] # to select a single column in this case, use [[]] rather than [,]
  data.frame(mean=mean(O3,na.rm=TRUE), sd=sd(O3,na.rm=TRUE))
}

data %>%
  group_by(month) %>%
    do(Statsfn(.))

## # A tibble: 12 x 3
## # Groups:   month [12]
##    month  mean    sd
##    <ord> <dbl> <dbl>
##  1 Jan    22.6  15.5
##  2 Feb    40.6  17.1
##  3 Mar    35.0  21.8
##  4 Apr    50.8  23.4
##  5 May    51.9  18.5
##  6 Jun    59.6  24.3
##  7 Jul    76.2  31.9
##  8 Aug    66.5  23.9
##  9 Sep    43.2  21.8
## 10 Oct    24.7  17.0
## 11 Nov    25.9  17.7
## 12 Dec    18.7  18.2

Suggestions on how to use do():

Write a function that accepts an argument: a subtable (a subset of a data table split according to variables passed to group_by()) or its variables, and returns a different table. E.g., let us call this function Foo().
Pass the table to do() using this syntax: do(Foo(.)) or do(Foo(.[["column"]])) or do(Foo(.[,c("column1", "column2")])).
- . represents the table (data frame) itself
- Foo(.) or its variants above should return a data frame to be merged back with the other results

Key-value pairs

Use labeling feature of R to implement a key-value pair structure (lookup table'',associative list’‘, ``hash table’’).

seasons <- c(
  Dec="DJF",
  Jan="DJF",
  Feb="DJF",
  Mar="MAM",
  Apr="MAM",
  May="MAM",
  Jun="JJA",
  Jul="JJA",
  Aug="JJA",
  Sep="SON",
  Oct="SON",
  Nov="SON"
  )

dframe <- data.frame(month=c("Jan","Feb","Nov"))
dframe[,"season"] <- seasons[dframe[,"month"]]
print(dframe)

##   month season
## 1   Jan    DJF
## 2   Feb    DJF
## 3   Nov    SON

The same thing can be accomplished by merging data frames.

dframe[,"season"] <- NULL
seasons.df <- data.frame(month=names(seasons), season=seasons)
print(seasons.df)

##     month season
## Dec   Dec    DJF
## Jan   Jan    DJF
## Feb   Feb    DJF
## Mar   Mar    MAM
## Apr   Apr    MAM
## May   May    MAM
## Jun   Jun    JJA
## Jul   Jul    JJA
## Aug   Aug    JJA
## Sep   Sep    SON
## Oct   Oct    SON
## Nov   Nov    SON

dframe <- inner_join(dframe, seasons.df)

## Joining, by = "month"

print(dframe)

##   month season
## 1   Jan    DJF
## 2   Feb    DJF
## 3   Nov    SON

Interacting with files on your hard drive (I/O) and objects in working memory

Shell operations

Determine and set working directory:

  getwd()
  setwd("path/to/directory")

The working directory determines where you read from and write files to, and all path names are written relative to this location.

Shell functions:

list.files()
file.copy()
file.rename()
file.remove()
dir.create()
file.info()
basename()
dirname()
...
system()

Reading/writing text files:

scan()
read.table()
write.table()
...

To save R objects:

save(); load()
saveRDS(); readRDS()

Working memory

Check order of packages loaded into memory with search(). This list determines the order in which function definitions will be searched, much like the search path in MATLAB and your operating system. See Namespaces topic below.

To save objects in your memory (“workspace”), use save.image(). This command will create a file called “.RData” on your hard drive, which you can load with load(".RData"). Note that you will have to reload libraries (preferably in the order in which they were loaded previously) to continue seamlessly.

Namespaces

Namespaces can be thought of as containers which resolve conflicts among variables and functions with identical names. For instance, it is common to name your data tables as “data” or “df” (for data frame). These two names are actually functions predefined in R packages loaded at startup (which you can see with search()).

First, let us look at data:

exists("data")

## [1] TRUE

class(data)        # this is our data set we loaded previously

## [1] "data.frame"

rm(data)           # we can delete our object
exists("data")     # an object called 'data' still exists

## [1] TRUE

class(data)        # function to load avaiable data sets

## [1] "function"

environment(data)  # this is loaded in the 'utils' package

## <environment: namespace:utils>

Next, we continue our example with df:

exists("df")       # F distribution distribution

## [1] TRUE

environment(df)

## <environment: namespace:stats>

df

## function (x, df1, df2, ncp, log = FALSE) 
## {
##     if (missing(ncp)) 
##         .Call(C_df, x, df1, df2, log)
##     else .Call(C_dnf, x, df1, df2, ncp, log)
## }
## <bytecode: 0x7f8ffc914e10>
## <environment: namespace:stats>

However, you can still assign values to these symbols:

df <- data.frame(x=1:5)

and these new objects will co-exist with the original objects in separate namespaces.

In this case, when df is used in your code, it will first look in the namespace that you are working in, and then move on down the search() list for other definitions.

df

##   x
## 1 1
## 2 2
## 3 3
## 4 4
## 5 5

If you want to use the original function, you can prepend the namespace with two colons:

stats::df

## function (x, df1, df2, ncp, log = FALSE) 
## {
##     if (missing(ncp)) 
##         .Call(C_df, x, df1, df2, log)
##     else .Call(C_dnf, x, df1, df2, ncp, log)
## }
## <bytecode: 0x7f8ffc914e10>
## <environment: namespace:stats>

Or, more generally, use get:

get("df", "package:stats")

## function (x, df1, df2, ncp, log = FALSE) 
## {
##     if (missing(ncp)) 
##         .Call(C_df, x, df1, df2, log)
##     else .Call(C_dnf, x, df1, df2, ncp, log)
## }
## <bytecode: 0x7f8ffc914e10>
## <environment: namespace:stats>

To access functions that are private to a particular package, use :::.

stats:::df

## function (x, df1, df2, ncp, log = FALSE) 
## {
##     if (missing(ncp)) 
##         .Call(C_df, x, df1, df2, log)
##     else .Call(C_dnf, x, df1, df2, ncp, log)
## }
## <bytecode: 0x7f8ffc914e10>
## <environment: namespace:stats>

(In this case the triple colon is not necessary as df is not a private function but is shown for illustrative purposes.)

For more details, see here.

Graphics

There are several graphics “paradigms”.

base graphics (built-in)
grid graphics (built-in)
lattice graphics (built-in)
ggplot2 (external)

In base graphics, a graphic is specified by its primitive elements. Grid graphics provide low-level functions; lattice and ggplot2 are built on top of grid graphics. Lattice graphics is an implementation of Trellis graphics in R. In ggplot(2), as illustrated earlier, is an implementation of the “grammar of graphics.”

Anatomy of a graphic

Symbols, lines, and characters:

x <- 1:10
y <- 1:10

Build up with low-level elements:

  plot.new()
  plot.window(range(x),range(y))
  axis(1)
  axis(2)
  box()
  points(x, y, col="blue")
  lines(x, y, col="red")
  title(xlab="x",ylab="y")

Use high level function plot():

  plot(x, y, col="blue")
  lines(x, y, col="red")

However, in many cases we will use the ggplot approach illustrated earlier.

Grammar of graphics

Consider the traditional anatomy of a graphic composed of symbols, lines, and characters. The description of a graphic is built up procedurally; element by element.

Instead of framing the task of plotting as drawing axes, lines, and points on a page, we can describe a statistical graphic in terms of the following elements:

a coordinate system (Cartesian, polar, spherical)
relative position and aesthetic properties (shape, color) of symbols which are associated with data values and categories
facets (``faces’’ along which the data should be grouped)
optional: any (statistical) transformations to the original data
optional: annotations

This is a high-level, declarative approach to creating a plot, and can facilitate exploratory visual analysis.

Proposed by L. Wilkinson in 2005.
Implemented by H. Wickam for R, 2008/2009.

Exploratory graphics

Univariate

jitter plot
box-and-whisker
histogram
density plot
bar chart
dot chart

Multivariate (2-D)

any of the univariate graphics applied across multiple categories
(X-Y) scatter plot
conditioning plots
parallel coordinates plot

For classic textbooks on visualizing data:

Tufte, The Visualization of Quantitative Information
Cleveland, The Elements of Graphing Data

lattice

Murrell, R Graphics, 2003

Saving figures

Example data table:

dtable <- data.frame(x=1:10,y=1:10)

With ggplot, save the specifications of a plot to an object:

ggp <- ggplot(dtable)+
  geom_point(aes(x,y))

We can display this graphic on screen:

print(ggp)

To print to a file, the preferred method is to open a graphics device (pdf in this instance), write to it, and then close it with dev.off().

pdf("filename.pdf",width=20,height=20)
print(ggp)
dev.off()

With base graphics, we create the plot when the graphics device is open.

pdf("filename.pdf",width=20,height=20)
plot(dtable[,"x"], dtable[,"y"])
dev.off()

A more MATLAB-like approach is to plot the figure to screen, and then save to a device.

print(ggp)
dev.print(device=pdf,
          file="filename.pdf",
          width=20,height=20)

The latter method is less robust, so if something does not look right try the former method.

Note that you can also use png() instead of pdf() to create a raster graphics (image). (Use raster graphics if you have images or many data points.) For instance,

print(ggp)
dev.print(device=png,
          file="filename.png",
          width=20,height=20,
          units="cm",res=72)

For ggplot specifically, it is possible to use ggsave():

ggplot(dtable)+
  geom_point(aes(x,y))
ggsave("filename.pdf", width=7, height=7)

Graphics devices:

pdf()
bitmap(,type="pdfwrite")
postscript() ## PS
postscript(onefile=FALSE) ## EPS
bitmap()
png()
jpeg()
tiff()
svg() ## see package...
...

Functions to access devices:

dev.list()
dev.new()
dev.copy()
dev.off()

A taste of higher-order programming (function mapping example)

In many programming languages, it is customary to use a for-loop to apply a function f over each element of a sequence. MATLAB example:

input = {u, v, w}   % cell array
output = {}         % empty cell array
for i=1:length(input),
  output{i} = f(input{i})
end

The same task can be accomplished in MATLAB using cellfun.

output = cellfun(@f,input)

R also as for-loops, but it is also common to use such higher-order functions which accept other functions as its arguments. Map is one such R function, and its analogy in MATLAB is cellfun. R example:

input <- list(u, v, w)
output <- Map(f,input)

Given a seqeuence a, b, c, Map and cellfun take function f and apply it to each element of the sequence such that the returned value is f(a), f(b), f(c).

If structuring your programs in this way appeals to you, read up on “functional programming”. We will use this convention as necessary during this course.

Troubleshooting and debugging

Common errors:

You are not in the correct working directory (affects input/output, often resulting in error "No such file or directory").
- Always check your working directory with getwd().
- Set your working directory with setwd("pathname") (or through the menu with RStudio).
- Check the files (residing on your hard disk) in your working directory with list.files() (or through the directory viewer in RStudio).
Your packages are not loaded in the intended order (affects function calls).
- Check order using search(). Functions with identical names are taken from packages in the order listed.
- You can detach packages you are no longer using. For instance, to detach a package called plyr, use the syntax: detach("package:plyr", unload=TRUE).
You have unintentionally deleted or overwritten objects (data object is not found, or is not the correct type).
- Check the objects (residing in memory) in your workspace with ls() (or workspace viewer in RStudio).

Troubleshooting graphics:

If your plotting command does not work, try dev.list() to see if multiple graphics devices are open. This can happen if an error was encountered before the device could be closed in a previous call, and further plotting may be write to an unintended graphics device. Use dev.off() to close devices that are open (might have to do this several times).
Try print(graphicsobject); in these exercises the graphicsobject is often saved as ggp.

Aborting a computation:

Hit Control-c at the interpreter.

Clearing the console:

Hit Control-l or type cat("\014").

Clearning the workspace (warning):

rm(list=ls()) will technically delete objects in your global workspace, but be forewarned that it is safer to re-start a new instance of R as this removal function does not reset the order in which packages are loaded. Therefore, it is possible that your code will not run in the same way as if you re-started in new instance of R.

Other errors:

Rather than running the entire script, run blocks of R code interactively to isolate the location of the error.
You can follow the example below to precisely identify the location of the offending call.

Simple example

As an example, let us consider the problem of multiplying two numbers. Their data types must be numeric, and an error will be returned if this is not the case.

> 2 * "1"
Error in 2 * "1" : non-numeric argument to binary operator

Let us define 1) a function to obtain the product of two numbers, and 2) a function which increments a number by one:

product <- function(x, y) {
    x * y
}
adder1 <- function(x) {
    x + 1
}

An error will be returned when applying the two functions:

> adder1(product(x, y))
Error in x * y : non-numeric argument to binary operator

or, using the pipe notation:

> library(magrittr) # for the %>%
> product(x, y) %>% adder1()
Error in x * y : non-numeric argument to binary operator

First, determine which function call is causing the problem. Is it product(x, y) or adder called on its result?

In an interactive language like R or MATLAB, it is easy to evaluate the function calls in sequence to find where the error occurs.

> z <- product(x, y)
> adder(z)

and see where the error happens. In this case, it is product(x, y).

The next step is to determine whether the arguments passed to product are sensible. Therefore, you would evaluate:

> x
[1] 2
> y
[1] "1"

and see that the arguments to product are not numeric, and therefore suggests the path to correction.

If the error is not obvious, you will have to insert print statements or use language-specific debugging tools. Error handling functions can be used when certain types of errors are expected.

Debugging functions

debug(): call on function
debugonce(): call on function
browser(): place in function
trace(): call on function
recover(): set in options() or trace()

Error handling functions

try()
tryCatch()

Some references: 1, 2. Revisiting the simple example, try can be used in the following way:

z <- try(x * y)

if(is(z, "try-error")) 
  z <- x * as.numeric(y)

Seeking help

Within R:

> options(htmlhelp=TRUE)
> ?plot #or help("plot")
> ??keyword #or help.search("keyword")
> apropos("keyword")
> RSiteSearch("keyword")

List of vinettes, books, and other resources:

Documentation for plotting with the ggplot2 package:

http://docs.ggplot2.org

Documentation for dplyr:

https://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html

Style guides:

Submit specific questions to online forums:

R basics