In order to perform the data analysis, we need a tool to help us
R is such a tool that was created for “Programming with Data”, and will be covered in this lesson.
We will cover some basic concepts of R programming in this course, with an emphasis on demonstrating what you can (or should) do with a programming language. Once you know this, you can look up how to perform a specific task.
Many demonstrations and example code will be provided. If you find a code block that accomplishes what you need, then you can use it directly or read up on the function or functions to adapt the block to your needs.
Everyone has a different level of background in programming, but it is possible to find appropriate resources at every level.
Find the resource that speaks to your background and way of thinking (e.g., some tutorials are aimed at engineers, while others are aimed at computer scientists, statisticians, economitricians, psychologists, and so on).
We will require each student to learn a lot on his or her own as it is not possible to exhaustively cover every function or operation used in our exmaples. As you will see, the actual number of resources to learn R is endless (just type “introduction to R” in Google), which then leads to a tyranny of choice and decision fatigue. We might recommend a very few resources:
One strategy (for learning a new topic in general) is to skim the contents of many resources and identify:
We will present a few examples to initiate a user generally familiar with a limited amount of programming. Basic classifications to keep in mind.
Important “data types” (objects):
Everything in R is technically a “vector” with different or additional attributes such as storage mode, dimension, etc.
Important classes of operations:
Important features when working with real data:
One of its core-strengths is the collection of user-contributed libraries. You will find that R permits many ways to do the same thing. Find the way that works best for you (and your colleagues), and stick with it.
Superficial differences:
Comment character is #
rather than %
.
Arrows <-
are assignment operators. (Can also use equal sign in most places.)
*
, /
, ^
operate element-wise (like .*
, ./
, .^
in MATLAB).
Functions do not have to be defined in separate files.
Most operations are performed by calling a function on arguments. Functional calls are very forgiving, and arguments can be specified by 1) order provided or 2) partial matching of argument names.
> divide <- function(numerator,denominator) numerator / denominator
> divide(1,2)
[1] 0.5
> divide(denominator=1,numerator=2)
[1] 2
> divide(d=1,2)
[1] 2
Same control structures exist—if
, for
, while
, etc.—but braces {}
denote extent of expressions, rather than end
statements.
Check out:
For the rest of the examples, we will assume the following libraries have been imported and options set:
library(tidyverse)
library(chron)
source("functions_extra.R")
Sys.setlocale("LC_TIME","C")
options(stringsAsFactors=FALSE)
options(chron.year.abb=FALSE)
theme_set(theme_bw()) # just my preference for plots
These libraries should be present on the machines in room GRB001. On your own machine, you can install necessary packages from the web:
install.packages("packagename",repos="http://stat.ethz.ch/CRAN/")
or, for multiple packages at once,
install.packages(c("packagename1","packagename2"),repos="http://stat.ethz.ch/CRAN/")
and so on.
In the assignment of values to symbols/variables, <-
or =
can be used. The former method is recommended.
```r
x <- 1 # scalar (a vector of length 1)
x = 1
x <- 1:5 # vector
x = 1:5
```
Functions accept a set of arguments (inputs) and return a value (output). Even binary operators can be written as a function in prefix notation – e.g., x+1is equivalent to ```+``(x,1)
.
Variables are symbols (e.g., x
, y
) that represent a set of values. These values can take on one of several types.
You can convert among data types with as
: as.data.frame
, as.list
, as.numeric
, as.character
, and so on.
Lists are like cell arrays or structures in MATLAB.
As with MATLAB, R has vectorized operations. Vectorized operations are applicable for atomic data types, but not for “list”. Example:
x <- 1:5
y <- x+1
print(y)
## [1] 2 3 4 5 6
mode(y)
## [1] "numeric"
typeof(y)
## [1] "double"
You can check the mode of your data type with mode()
or typeof()
.
Apart from the mode, objects can have dimensionality: 1-D (vector), 2-D (matrix), N-D (array).
Special data types:
c
concatenates elements and is often used to construct a vector.
(char <- c("ab","d")) # concatenate elements (character)
## [1] "ab" "d"
(v <- c(1,3,5)) # concatenate elements (numeric)
## [1] 1 3 5
Other objects are created by functions that are named after the object.
(m <- matrix(1:6,ncol=2)) # create a matrix
## [,1] [,2]
## [1,] 1 4
## [2,] 2 5
## [3,] 3 6
(l <- list(1,v,m)) # define a list
## [[1]]
## [1] 1
##
## [[2]]
## [1] 1 3 5
##
## [[3]]
## [,1] [,2]
## [1,] 1 4
## [2,] 2 5
## [3,] 3 6
In contrast with MATLAB, Note that a single element in a character vector in R can contain multiple letters.
An object defines both the 1) data type and 2) operations that are allowed on them, and is labeled by a “class”.
Check the class of an R object by class()
, and the functions allowed with methods()
:
x <- 1
print(class(x))
## [1] "numeric"
methods(class=class(x))
## [1] all.equal as_factor as_mapper as.data.frame as.Date
## [6] as.POSIXct as.POSIXlt as.raster coerce Compare
## [11] full_seq months Ops recode scale_type
## see '?methods' for accessing help and source code
Alternatively, you can query what type of objects a function can operate on. For instance, we have a function called mean()
:
methods(mean)
## [1] mean.Date mean.default mean.difftime mean.POSIXct
## [5] mean.POSIXlt mean.quosure* mean.times* mean.vctrs_vctr*
## see '?methods' for accessing help and source code
In R, all variables are objects and all operations are functions. All functions are also objects.
Elements can have optional labels.
Vectors, lists can have names.
(v <- c(a=1, b=3, c=5))
## a b c
## 1 3 5
v["a"]
## a
## 1
names(l) <- names(v)
print(l)
## $a
## [1] 1
##
## $b
## [1] 1 3 5
##
## $c
## [,1] [,2]
## [1,] 1 4
## [2,] 2 5
## [3,] 3 6
Matrices can have column and row names.
colnames(m) <- letters[1:ncol(m)]
rownames(m) <- letters[1:nrow(m)]
print(m)
## a b
## a 1 4
## b 2 5
## c 3 6
Processing strings of characters is a relevant part of data analysis.
Manipulate strings:
date <- "2012.03.01"
strsplit(date, ".", fixed=TRUE)
## [[1]]
## [1] "2012" "03" "01"
paste("2012", "03", "01", sep=".")
## [1] "2012.03.01"
sprintf("%d.%02d.%02d",2012, 3, 1)
## [1] "2012.03.01"
Search for strings:
dates <- c("2012.03.01","2013.03.01")
grep("2012", date, fixed=TRUE)
## [1] 1
grepl("2012", date, fixed=TRUE)
## [1] TRUE
grep("2012", date, value=TRUE, fixed=TRUE)
## [1] "2012.03.01"
To delve deeper, you will eventually want to look into regular expressions.
pattern <- "([0-9]{4})\\.([0-9]{2})\\.([0-9]{2})"
sub(pattern, "\\1", dates)
## [1] "2012" "2013"
sub(pattern, "\\2", dates)
## [1] "03" "03"
sub(pattern, "\\3", dates)
## [1] "01" "01"
Useful example:
data <- read.table("data/2013/LAU.csv", skip=5, sep=";", header=TRUE, check.names=FALSE)
names(data)
## [1] "Date/time" "O3 [\xb5g/m\xb3]" "NO2 [\xb5g/m\xb3]"
## [4] "CO [mg/m\xb3]" "PM10 [\xb5g/m\xb3]" "TEMP [\xb0C]"
## [7] "PREC [mm]" "RAD [W/m\xb2]"
Note non-ASCII encoding.
We can delete everything after whitespace:
sub("[ ].*$","",names(data))
## [1] "Date/time" "O3" "NO2" "CO" "PM10" "TEMP"
## [7] "PREC" "RAD"
With such functions, you can relable your data table columns without assigning them manually (note fixed=TRUE
indicates that you are using fixed string patterns and not regular expressions).
names(data) <- sub("[ ].*$","",names(data))
names(data) <- sub("Date/time", "datetime", names(data), fixed=TRUE)
Some text processing functions:
paste()
substr(); substring(); nchar()
strsplit()
sub(); gsub()
grep()
regexpr(); gregexpr() match(); pmatch(); %in% `==`
newdate1 <- "2013.03.01"
newdate2 <- "2013.03.02"
union(newdate2, dates)
## [1] "2013.03.02" "2012.03.01" "2013.03.01"
intersect(newdate2, dates)
## character(0)
intersect(newdate1, dates)
## [1] "2013.03.01"
setdiff(dates, newdate1)
## [1] "2012.03.01"
Can refer to elements by sequence number, or by label.
Vectors:
v[2]
## b
## 3
v["b"]
## b
## 3
v["b"] <- 10
print(v)
## a b c
## 1 10 5
Matrices:
m[1,2]
## [1] 4
m[,"b"] # select column
## a b c
## 4 5 6
m[,"b"] <- c(0,10,20) # replace column
print(m)
## a b
## a 1 0
## b 2 10
## c 3 20
Lists:
l[2]
## $b
## [1] 1 3 5
l[[2]]
## [1] 1 3 5
l[["b"]]
## [1] 1 3 5
l[["b"]] <- 10
print(l)
## $a
## [1] 1
##
## $b
## [1] 10
##
## $c
## [,1] [,2]
## [1,] 1 4
## [2,] 2 5
## [3,] 3 6
You may also see the $
operator used, but this is used to extract or assign to a single element of a list or data frame.
l$b
## [1] 10
l$b <- "a"
l
## $a
## [1] 1
##
## $b
## [1] "a"
##
## $c
## [,1] [,2]
## [1,] 1 4
## [2,] 2 5
## [3,] 3 6
Add to the collection:
c(v,d=3)
## a b c d
## 1 10 5 3
Remove from the collection:
v[-2]
## a c
## 1 5
v[!names(v) %in% "b"]
## a c
## 1 5
Modify the values
v[2] <- "2"
print(v) # now converted to character string
## a b c
## "1" "2" "5"
[
, [[
and $
is that the [
can select more than one element, whereas the other two select a single element.”x[i,j,...,drop=FALSE]
to retain other dimensions.x[i,j,...,exact=FALSE]
to enable partial matching (partial matching enabled for $
by default).The usual arithmetic operators/functions:
+, -, *, /, # binary operators
sum(), prod(), cumsum(), diff(), # apply to vector
Other functions:
exp(), log(), log10(), ^
%% #(modulo)
%/% #(integer division),
floor(), ceiling(), round(), signif()
For loop (note that for this example, you could also use vectorized addition):
x <- 1:5
y <- numeric(length(x))
for(i in 1:length(x)) {
y[i] <- x[i] + 1
}
print(y)
## [1] 2 3 4 5 6
If-else:
i <- 3
for(i in 1:3) {
if(i==1) {
x <- 1
} else if(i==2) {
x <- 2
} else {
x <- 3
}
}
print(x)
## [1] 3
Within the use, you can use a few additional structures:
break # break out of loop
next # skip rest of loop
Logical values: TRUE
or FALSE
.
operator | description |
---|---|
| , || |
or |
& , && |
and |
< , > |
less/greater than |
<= , >= |
less/greater than or equal tl |
== , != |
equal/not equal to |
%in% |
is in {collection} |
! |
not |
any() |
is any TRUE |
all() |
are all FALSE |
||
, &&
return single value (evaluates only first statement if first statement is TRUE) *???? |
, &
return vectored valueNegate any expression by prefixing with !
.
Missing values are encoded as NA
.
R has very sophisticated facilities for handling missing values. Test for missing values in a vector with the following functions:
is.na()
is.nan()
is.finite()
You can test for non-missing values with is.na(x)
.
Many common functions provide a rm.na=TRUE
argument. E.g., mean(x,na.rm=TRUE)
.
You can remove missing values with na.omit(x)
but be careful as this can change the length of the vector x
.
Example:
Foo <- function(x) {
y <- 1
x + y + 2*z
}
x
is a bound variable, and its value is determined by the argument passed upon function invocation.y
is a local variable, which is defined only within the context of the function.z
is a free variable, and its value is found in the environment in which the function was defined.The value of the last expression (x + y + 2*z
in this case) is returned from the function.
z <- 2
y <- 2
m <- 1
n <- Foo(m)
print(m) # remains unchanged
## [1] 1
print(y) # remains unchanged
## [1] 2
print(n) # value that is returned
## [1] 6
Example function:
Bar <- function(first, second = 3) {
first + 2*second
}
Note that default values for arguments can be provided in the function definition, in which chase they become optional arguments for the user to specify.
Explore the possibilities:
Bar(1)
## [1] 7
Bar(1, 3)
## [1] 7
Bar(3, 1)
## [1] 5
Bar(second=3, first=1)
## [1] 7
Bar(s=3, 1)
## [1] 7
The factor
class in R is useful for representing categorical variables, which are discrete variables with a defined set of possibilities.
sites <- factor(c("Lausanne","Zurich"), # values
levels=c("Bern","Lausanne","Zurich")) # set from which values are drawn
sites
## [1] Lausanne Zurich
## Levels: Bern Lausanne Zurich
unclass(sites)
## [1] 2 3
## attr(,"levels")
## [1] "Bern" "Lausanne" "Zurich"
sites == "Lausanne"
## [1] TRUE FALSE
sites[1] <- "Fribourg" # not in defined set of possibilities
## Warning in `[<-.factor`(`*tmp*`, 1, value = "Fribourg"): invalid factor level,
## NA generated
methods(class="factor")
## [1] [ [[ [[<- [<- all.equal
## [6] as_factor as.character as.data.frame as.Date as.list
## [11] as.logical as.POSIXlt as.vector coerce Compare
## [16] droplevels format initialize is.na<- length<-
## [21] levels<- Math Ops plot print
## [26] recode relevel relist rep scale_type
## [31] show slotsFromS3 summary Summary type_sum
## [36] xtfrm
## see '?methods' for accessing help and source code
Caution: factors are integer at heart.
(vec <- c(four=4, five=5))
## four five
## 4 5
(fac <- factor(c("four","five")))
## [1] four five
## Levels: five four
vec["four"]
## four
## 4
fac[2]
## [1] five
## Levels: five four
vec[fac[2]]
## four
## 4
This behavior occurs because:
unclass(fac)
## [1] 2 1
## attr(,"levels")
## [1] "five" "four"
so fac[2]
is equivalent to 1
, and vec[fac[2]]
is vec[1]
Note that each column can have a different data type.
(dtable <- data.frame(label=c("a","b"),value=c(1,2)))
## label value
## 1 a 1
## 2 b 2
ColClasses(dtable)
## label value
## 1 character numeric
Common data frame operations defined in base R, with improvements in speeed or usability provided by the reshape2
and dplyr
packages. Many of these will be further demonstrated in context throughout the rest of the course.
Operation | base R | dplyr /tidyr /reshape2 |
---|---|---|
subset rows | [ , subset() |
filter() |
select columns | [ , subset() |
select() |
modify column values | [<- , transform() |
mutate() |
rename coluimns | rename() |
|
join tables | merge() , rbind() , cbind() |
{full/inner/left/right}_join() |
pivot frame | stack() , unstack() |
gather() , spread() |
melt() , dcast() |
Apply a series of functions in sequence. Subset rows where label is equal to “b”, and then change the value to 3.
Note the following sequence of operations:
mutate(filter(dtable,label=="b"),value=3)
## label value
## 1 b 3
We can accomplish the same operation with pipes:
dtable %>% filter(label=="b") %>% mutate(value=3)
## label value
## 1 b 3
The %>%
is a “postfix operator” and assigns the preceding object (data frame) to the first argument of the proceeding function.
Return new data frame (original is unmodified):
(newtable <- dtable %>% mutate(value2=NA))
## label value value2
## 1 a 1 NA
## 2 b 2 NA
print(dtable)
## label value
## 1 a 1
## 2 b 2
``In place’’ (modify original data frame):
dtable[,"value2"] <- NA
print(dtable)
## label value value2
## 1 a 1 NA
## 2 b 2 NA
Merging tables is a powerful feature.
dtable[,"value2"] <- NULL # delete column
print(dtable)
## label value
## 1 a 1
## 2 b 2
(dtable2 <- data.frame(label="b", value2=1))
## label value2
## 1 b 1
inner_join(dtable, dtable2)
## Joining, by = "label"
## label value value2
## 1 b 2 1
full_join(dtable, dtable2)
## Joining, by = "label"
## label value value2
## 1 a 1 NA
## 2 b 2 1
(dtable3 <- data.frame(label="c",value=1))
## label value
## 1 c 1
inner_join(dtable, dtable3)
## Joining, by = c("label", "value")
## [1] label value
## <0 rows> (or 0-length row.names)
full_join(dtable, dtable3)
## Joining, by = c("label", "value")
## label value
## 1 a 1
## 2 b 2
## 3 c 1
Let us revisit an example from Lesson 1:
data <- read.table("data/2013/LAU.csv", sep=";", skip=6,
col.names=c("datetime","O3","NO2","CO","PM10","TEMP","PREC","RAD"))
View first two rows.
data[1:2,]
## datetime O3 NO2 CO PM10 TEMP PREC RAD
## 1 31.12.2012 01:00 7.8 56.3 0.5 16.1 3.8 0 -2.4
## 2 31.12.2012 02:00 22.4 38.0 0.4 11.6 4.1 0 -2.3
Convert to “long” format.
lf <- gather(data[1:2,], # data table, first two rows
key = variable, # name of new variable column
value = value, # name of new value column
-datetime) # columns to keep fixed (collapse/stack all other variables)
head(lf)
## datetime variable value
## 1 31.12.2012 01:00 O3 7.8
## 2 31.12.2012 02:00 O3 22.4
## 3 31.12.2012 01:00 NO2 56.3
## 4 31.12.2012 02:00 NO2 38.0
## 5 31.12.2012 01:00 CO 0.5
## 6 31.12.2012 02:00 CO 0.4
Convert back to “wide” format.
wf <- spread(lf, # data table
key = variable, # column from which new column names should be taken
value = value) # column from which values should be taken to fill new wide-format table
head(wf)
## datetime CO NO2 O3 PM10 PREC RAD TEMP
## 1 31.12.2012 01:00 0.5 56.3 7.8 16.1 0 -2.4 3.8
## 2 31.12.2012 02:00 0.4 38.0 22.4 11.6 0 -2.3 4.1
Based on what we’ve learned so far, we can define a function for reading a csv file from the NABEL network:
ReadTSeries <- function(filename, timecolumn="datetime", timeformat="%d.%m.%Y %H:%M") {
data <- read.table(filename, skip=5, header=TRUE, sep=";", check.names=FALSE)
names(data) <- sub("[ ].*$","",names(data))
names(data) <- sub("Date/time", timecolumn, names(data), fixed=TRUE)
data[,timecolumn] <- as.chron(data[,timecolumn], timeformat)
data
}
data <- ReadTSeries("data/2013/LAU.csv")
Add month column:
data[,"month"] <- months(data[,"datetime"])
head(data)
## datetime O3 NO2 CO PM10 TEMP PREC RAD month
## 1 (12/31/2012 01:00:00) 7.8 56.3 0.5 16.1 3.8 0 -2.4 Dec
## 2 (12/31/2012 02:00:00) 22.4 38.0 0.4 11.6 4.1 0 -2.3 Dec
## 3 (12/31/2012 03:00:00) 14.5 37.2 0.3 10.3 3.1 0 -2.1 Dec
## 4 (12/31/2012 04:00:00) 28.7 25.4 0.3 10.5 3.5 0 -2.2 Dec
## 5 (12/31/2012 05:00:00) 19.6 33.7 0.3 9.0 2.9 0 -2.2 Dec
## 6 (12/31/2012 06:00:00) 30.8 51.2 0.3 8.7 3.2 0 -2.3 Dec
Return single value (or series of individually computed values): use summarize()
.
data %>%
group_by(month) %>%
summarize(mean=mean(O3,na.rm=TRUE),
sd=sd(O3,na.rm=TRUE))
## # A tibble: 12 x 3
## month mean sd
## * <ord> <dbl> <dbl>
## 1 Jan 22.6 15.5
## 2 Feb 40.6 17.1
## 3 Mar 35.0 21.8
## 4 Apr 50.8 23.4
## 5 May 51.9 18.5
## 6 Jun 59.6 24.3
## 7 Jul 76.2 31.9
## 8 Aug 66.5 23.9
## 9 Sep 43.2 21.8
## 10 Oct 24.7 17.0
## 11 Nov 25.9 17.7
## 12 Dec 18.7 18.2
Return table: use do()
.
Statsfn <- function(subtable) {
O3 <- subtable[["O3"]] # to select a single column in this case, use [[]] rather than [,]
data.frame(mean=mean(O3,na.rm=TRUE), sd=sd(O3,na.rm=TRUE))
}
data %>%
group_by(month) %>%
do(Statsfn(.))
## # A tibble: 12 x 3
## # Groups: month [12]
## month mean sd
## <ord> <dbl> <dbl>
## 1 Jan 22.6 15.5
## 2 Feb 40.6 17.1
## 3 Mar 35.0 21.8
## 4 Apr 50.8 23.4
## 5 May 51.9 18.5
## 6 Jun 59.6 24.3
## 7 Jul 76.2 31.9
## 8 Aug 66.5 23.9
## 9 Sep 43.2 21.8
## 10 Oct 24.7 17.0
## 11 Nov 25.9 17.7
## 12 Dec 18.7 18.2
Suggestions on how to use do()
:
group_by()
) or its variables, and returns a different table. E.g., let us call this function Foo()
.do()
using this syntax: do(Foo(.))
or do(Foo(.[["column"]]))
or do(Foo(.[,c("column1", "column2")]))
.
.
represents the table (data frame) itselfFoo(.)
or its variants above should return a data frame to be merged back with the other resultsUse labeling feature of R to implement a key-value pair structure (lookup table'',
associative list’‘, ``hash table’’).
seasons <- c(
Dec="DJF",
Jan="DJF",
Feb="DJF",
Mar="MAM",
Apr="MAM",
May="MAM",
Jun="JJA",
Jul="JJA",
Aug="JJA",
Sep="SON",
Oct="SON",
Nov="SON"
)
dframe <- data.frame(month=c("Jan","Feb","Nov"))
dframe[,"season"] <- seasons[dframe[,"month"]]
print(dframe)
## month season
## 1 Jan DJF
## 2 Feb DJF
## 3 Nov SON
The same thing can be accomplished by merging data frames.
dframe[,"season"] <- NULL
seasons.df <- data.frame(month=names(seasons), season=seasons)
print(seasons.df)
## month season
## Dec Dec DJF
## Jan Jan DJF
## Feb Feb DJF
## Mar Mar MAM
## Apr Apr MAM
## May May MAM
## Jun Jun JJA
## Jul Jul JJA
## Aug Aug JJA
## Sep Sep SON
## Oct Oct SON
## Nov Nov SON
dframe <- inner_join(dframe, seasons.df)
## Joining, by = "month"
print(dframe)
## month season
## 1 Jan DJF
## 2 Feb DJF
## 3 Nov SON
Determine and set working directory:
getwd()
setwd("path/to/directory")
The working directory determines where you read from and write files to, and all path names are written relative to this location.
Shell functions:
list.files()
file.copy()
file.rename()
file.remove()
dir.create()
file.info()
basename()
dirname()
...
system()
Reading/writing text files:
scan()
read.table()
write.table()
...
To save R objects:
save(); load()
saveRDS(); readRDS()
Check order of packages loaded into memory with search()
. This list determines the order in which function definitions will be searched, much like the search path in MATLAB and your operating system. See Namespaces topic below.
To save objects in your memory (“workspace”), use save.image()
. This command will create a file called “.RData” on your hard drive, which you can load with load(".RData")
. Note that you will have to reload libraries (preferably in the order in which they were loaded previously) to continue seamlessly.
Namespaces can be thought of as containers which resolve conflicts among variables and functions with identical names. For instance, it is common to name your data tables as “data” or “df” (for data frame). These two names are actually functions predefined in R packages loaded at startup (which you can see with search()
).
First, let us look at data
:
exists("data")
## [1] TRUE
class(data) # this is our data set we loaded previously
## [1] "data.frame"
rm(data) # we can delete our object
exists("data") # an object called 'data' still exists
## [1] TRUE
class(data) # function to load avaiable data sets
## [1] "function"
environment(data) # this is loaded in the 'utils' package
## <environment: namespace:utils>
Next, we continue our example with df
:
exists("df") # F distribution distribution
## [1] TRUE
environment(df)
## <environment: namespace:stats>
df
## function (x, df1, df2, ncp, log = FALSE)
## {
## if (missing(ncp))
## .Call(C_df, x, df1, df2, log)
## else .Call(C_dnf, x, df1, df2, ncp, log)
## }
## <bytecode: 0x7f8ffc914e10>
## <environment: namespace:stats>
However, you can still assign values to these symbols:
df <- data.frame(x=1:5)
and these new objects will co-exist with the original objects in separate namespaces.
In this case, when df
is used in your code, it will first look in the namespace that you are working in, and then move on down the search()
list for other definitions.
df
## x
## 1 1
## 2 2
## 3 3
## 4 4
## 5 5
If you want to use the original function, you can prepend the namespace with two colons:
stats::df
## function (x, df1, df2, ncp, log = FALSE)
## {
## if (missing(ncp))
## .Call(C_df, x, df1, df2, log)
## else .Call(C_dnf, x, df1, df2, ncp, log)
## }
## <bytecode: 0x7f8ffc914e10>
## <environment: namespace:stats>
Or, more generally, use get
:
get("df", "package:stats")
## function (x, df1, df2, ncp, log = FALSE)
## {
## if (missing(ncp))
## .Call(C_df, x, df1, df2, log)
## else .Call(C_dnf, x, df1, df2, ncp, log)
## }
## <bytecode: 0x7f8ffc914e10>
## <environment: namespace:stats>
To access functions that are private to a particular package, use :::
.
stats:::df
## function (x, df1, df2, ncp, log = FALSE)
## {
## if (missing(ncp))
## .Call(C_df, x, df1, df2, log)
## else .Call(C_dnf, x, df1, df2, ncp, log)
## }
## <bytecode: 0x7f8ffc914e10>
## <environment: namespace:stats>
(In this case the triple colon is not necessary as df
is not a private function but is shown for illustrative purposes.)
For more details, see here.
There are several graphics “paradigms”.
In base graphics, a graphic is specified by its primitive elements. Grid graphics provide low-level functions; lattice and ggplot2 are built on top of grid graphics. Lattice graphics is an implementation of Trellis graphics in R. In ggplot(2), as illustrated earlier, is an implementation of the “grammar of graphics.”
Symbols, lines, and characters:
x <- 1:10
y <- 1:10
Build up with low-level elements:
plot.new()
plot.window(range(x),range(y))
axis(1)
axis(2)
box()
points(x, y, col="blue")
lines(x, y, col="red")
title(xlab="x",ylab="y")
Use high level function plot()
:
plot(x, y, col="blue")
lines(x, y, col="red")
However, in many cases we will use the ggplot approach illustrated earlier.
Consider the traditional anatomy of a graphic composed of symbols, lines, and characters. The description of a graphic is built up procedurally; element by element.
Instead of framing the task of plotting as drawing axes, lines, and points on a page, we can describe a statistical graphic in terms of the following elements:
This is a high-level, declarative approach to creating a plot, and can facilitate exploratory visual analysis.
Univariate
Multivariate (2-D)
For classic textbooks on visualizing data:
|
Murrell, R Graphics, 2003 |
Example data table:
dtable <- data.frame(x=1:10,y=1:10)
With ggplot, save the specifications of a plot to an object:
ggp <- ggplot(dtable)+
geom_point(aes(x,y))
We can display this graphic on screen:
print(ggp)
To print to a file, the preferred method is to open a graphics device (pdf
in this instance), write to it, and then close it with dev.off()
.
pdf("filename.pdf",width=20,height=20)
print(ggp)
dev.off()
With base graphics, we create the plot when the graphics device is open.
pdf("filename.pdf",width=20,height=20)
plot(dtable[,"x"], dtable[,"y"])
dev.off()
A more MATLAB-like approach is to plot the figure to screen, and then save to a device.
print(ggp)
dev.print(device=pdf,
file="filename.pdf",
width=20,height=20)
The latter method is less robust, so if something does not look right try the former method.
Note that you can also use png()
instead of pdf()
to create a raster graphics (image). (Use raster graphics if you have images or many data points.) For instance,
print(ggp)
dev.print(device=png,
file="filename.png",
width=20,height=20,
units="cm",res=72)
For ggplot specifically, it is possible to use ggsave()
:
ggplot(dtable)+
geom_point(aes(x,y))
ggsave("filename.pdf", width=7, height=7)
Graphics devices:
pdf()
bitmap(,type="pdfwrite")
postscript() ## PS
postscript(onefile=FALSE) ## EPS
bitmap()
png()
jpeg()
tiff()
svg() ## see package...
...
Functions to access devices:
dev.list()
dev.new()
dev.copy()
dev.off()
In many programming languages, it is customary to use a for
-loop to apply a function f
over each element of a sequence. MATLAB example:
input = {u, v, w} % cell array
output = {} % empty cell array
for i=1:length(input),
output{i} = f(input{i})
end
The same task can be accomplished in MATLAB using cellfun
.
output = cellfun(@f,input)
R also as for
-loops, but it is also common to use such higher-order functions which accept other functions as its arguments. Map
is one such R function, and its analogy in MATLAB is cellfun
. R example:
input <- list(u, v, w)
output <- Map(f,input)
Given a seqeuence a, b, c
, Map
and cellfun
take function f
and apply it to each element of the sequence such that the returned value is f(a), f(b), f(c)
.
If structuring your programs in this way appeals to you, read up on “functional programming”. We will use this convention as necessary during this course.
Common errors:
"No such file or directory"
).
getwd()
.setwd("pathname")
(or through the menu with RStudio).list.files()
(or through the directory viewer in RStudio).search()
. Functions with identical names are taken from packages in the order listed.plyr
, use the syntax: detach("package:plyr", unload=TRUE)
.ls()
(or workspace viewer in RStudio).Troubleshooting graphics:
dev.list()
to see if multiple graphics devices are open. This can happen if an error was encountered before the device could be closed in a previous call, and further plotting may be write to an unintended graphics device. Use dev.off()
to close devices that are open (might have to do this several times).print(graphicsobject)
; in these exercises the graphicsobject
is often saved as ggp
.Aborting a computation:
Control-c
at the interpreter.Clearing the console:
Control-l
or type cat("\014")
.Clearning the workspace (warning):
rm(list=ls())
will technically delete objects in your global workspace, but be forewarned that it is safer to re-start a new instance of R as this removal function does not reset the order in which packages are loaded. Therefore, it is possible that your code will not run in the same way as if you re-started in new instance of R.Other errors:
As an example, let us consider the problem of multiplying two numbers. Their data types must be numeric, and an error will be returned if this is not the case.
> 2 * "1"
Error in 2 * "1" : non-numeric argument to binary operator
Let us define 1) a function to obtain the product of two numbers, and 2) a function which increments a number by one:
product <- function(x, y) {
x * y
}
adder1 <- function(x) {
x + 1
}
An error will be returned when applying the two functions:
> adder1(product(x, y))
Error in x * y : non-numeric argument to binary operator
or, using the pipe notation:
> library(magrittr) # for the %>%
> product(x, y) %>% adder1()
Error in x * y : non-numeric argument to binary operator
First, determine which function call is causing the problem. Is it product(x, y)
or adder
called on its result?
In an interactive language like R or MATLAB, it is easy to evaluate the function calls in sequence to find where the error occurs.
> z <- product(x, y)
> adder(z)
and see where the error happens. In this case, it is product(x, y)
.
The next step is to determine whether the arguments passed to product
are sensible. Therefore, you would evaluate:
> x
[1] 2
> y
[1] "1"
and see that the arguments to product
are not numeric, and therefore suggests the path to correction.
If the error is not obvious, you will have to insert print statements or use language-specific debugging tools. Error handling functions can be used when certain types of errors are expected.
debug()
: call on functiondebugonce()
: call on functionbrowser()
: place in functiontrace()
: call on functionrecover()
: set in options()
or trace()
Within R:
> options(htmlhelp=TRUE)
> ?plot #or help("plot")
> ??keyword #or help.search("keyword")
> apropos("keyword")
> RSiteSearch("keyword")
List of vinettes, books, and other resources:
Documentation for plotting with the ggplot2 package:
Documentation for dplyr:
Style guides:
Submit specific questions to online forums: