• 2. Data structures
    • R is designed to handle experimental data
    • Character, numeric and logical data types
    • Factors
    • Creating a data frame (first attempt)
    • Naming data frame variables
    • Factors in data frames
    • Removing variables
    • Adding additional columns
    • Indexing data frames and matrices
    • Advanced indexing
    • Logical Operators
    • Exercise: Exercise 2
    • (Supplementary) Matrices

2. Data structures

R is designed to handle experimental data

  • Although the basic unit of R is a vector, we usually handle data in data frames.
  • A data frame is a set of observations of a set of variables – in other words, the outcome of an experiment.
  • For example, we might want to analyse information about a set of patients.
  • To start with, let’s say we have ten patients and for each one we know their name, sex, age, weight and whether they give consent for their data to be made public.
  • We are going to create a data frame called ‘patients’, which will have ten rows (observations) and seven columns (variables). The columns must all be equal lengths.
  • We will explore how to construct these data from scratch.
    • (in practice, we would usually import such data from a file)
First_Name Second_Name Full_Name Sex Age Weight Consent
1 Adam Jones Adam Jones Male 50 70.8 TRUE
2 Eve Parker Eve Parker Female 21 67.9 TRUE
3 John Evans John Evans Male 35 75.3 FALSE
4 Mary Davis Mary Davis Female 45 61.9 TRUE
5 Peter Baker Peter Baker Male 28 72.4 FALSE
6 Paul Daniels Paul Daniels Male 31 69.9 FALSE
7 Joanna Edwards Joanna Edwards Female 42 63.5 FALSE
8 Matthew Smith Matthew Smith Male 33 71.5 TRUE
9 David Roberts David Roberts Male 57 73.2 FALSE
10 Sally Wilson Sally Wilson Female 62 64.8 TRUE

Character, numeric and logical data types

  • Each column is a vector, like previous vectors we have seen, for example:
age    <- c(50, 21, 35, 45, 28, 31, 42, 33, 57, 62)
weight <- c(70.8, 67.9, 75.3, 61.9, 72.4, 69.9, 
            63.5, 71.5, 73.2, 64.8)
  • We can define the names using character vectors:
firstName  <- c("Adam", "Eve", "John", "Mary",
                "Peter", "Paul", "Joanna", "Matthew",
                "David", "Sally")
secondName <- c("Jones", "Parker", "Evans", "Davis",
                "Baker","Daniels", "Edwards", "Smith", 
                "Roberts", "Wilson")

Notice how a particular line of R code can be typed over multiple lines. R won’t execute the code until it sees the closing bracket ) that matches the initial bracket () - We often use this trick to make our code easier to read

  • We also have a new type of vector, the logical vector, which only contains the values TRUE and FALSE:
consent <- c(TRUE, TRUE, FALSE, TRUE, FALSE, 
             FALSE, FALSE, TRUE, FALSE, TRUE)
  • Vectors can only contain one type of data; we cannot mix numbers, characters and logical values in the same vector.
    • If we try this, R will convert everything to characters:
c(20, "a string", TRUE)
[1] "20"       "a string" "TRUE"    
  • We can see the type of a particular vector using the class() function:
 class(firstName)
[1] "character"
 class(age)
[1] "numeric"
 class(weight)
[1] "numeric"
 class(consent)
[1] "logical"

Factors

  • Character vectors are fine for some variables, like names. But sometimes we have categorical data and we want R to recognize this
  • A factor is R’s data structure for categorical data:
sex <- c("Male", "Female", "Male", "Female", "Male",
         "Male", "Female", "Male", "Male", "Female")
sex
 [1] "Male"   "Female" "Male"   "Female" "Male"   "Male"   "Female" "Male"   "Male"  
[10] "Female"
factor(sex)
 [1] Male   Female Male   Female Male   Male   Female Male   Male   Female
Levels: Female Male
  • R has converted the strings of the sex character vector into two levels, which are the categories in the data
  • Note the values of this factor are not character strings, but levels
  • We can use this factor later-on to compare data for males and females

Creating a data frame (first attempt)

  • We can construct a data frame from other objects (N.B. The paste() function joins character vectors together)
patients <- data.frame(firstName, secondName, 
                       paste(firstName, secondName),  
                       sex, age, weight, consent)
patients
ABCDEFGHIJ0123456789
firstName
<fctr>
secondName
<fctr>
paste.firstName..secondName.
<fctr>
sex
<fctr>
age
<dbl>
weight
<dbl>
consent
<lgl>
AdamJonesAdam JonesMale5070.8TRUE
EveParkerEve ParkerFemale2167.9TRUE
JohnEvansJohn EvansMale3575.3FALSE
MaryDavisMary DavisFemale4561.9TRUE
PeterBakerPeter BakerMale2872.4FALSE
PaulDanielsPaul DanielsMale3169.9FALSE
JoannaEdwardsJoanna EdwardsFemale4263.5FALSE
MatthewSmithMatthew SmithMale3371.5TRUE
DavidRobertsDavid RobertsMale5773.2FALSE
SallyWilsonSally WilsonFemale6264.8TRUE

Naming data frame variables

  • We can access particular variables using the $ operator:
  • TIP: you can use TAB-complete to select the variable you want
patients$age
 [1] 50 21 35 45 28 31 42 33 57 62
  • R has inferred the names of our data frame variables from the names of the vectors or the commands (e.g. the paste() command)
  • We can name the variables after we have created a data frame using the names() function, and we can use the same function to see the names:
names(patients) <- c("First_Name", "Second_Name",
                     "Full_Name", "Sex", "Age", 
                     "Weight", "Consent")
names(patients)
[1] "First_Name"  "Second_Name" "Full_Name"   "Sex"         "Age"        
[6] "Weight"      "Consent"    
  • Or we can name the variables when we define the data frame
patients <- data.frame(First_Name = firstName, 
                       Second_Name = secondName, 
                       Full_Name = paste(firstName,
                                         secondName), 
                       Sex = sex,
                       Age = age,
                       Weight = weight, 
                       Consent = consent)
names(patients)
[1] "First_Name"  "Second_Name" "Full_Name"   "Sex"         "Age"        
[6] "Weight"      "Consent"    

Factors in data frames

  • When creating a data frame, R assumes all character vectors should be categorical variables and converts them to factors. This is not always what we want:
    • e.g. we are unlikely to be interested in the hypothesis that people called Adam are taller, so it seems a bit silly to represent this as a factor
patients$First_Name
 [1] Adam    Eve     John    Mary    Peter   Paul    Joanna  Matthew David   Sally  
Levels: Adam David Eve Joanna John Mary Matthew Paul Peter Sally
  • We can avoid this by asking R not to treat strings as factors, and then explicitly stating when we want a factor by using factor():
patients <- data.frame(First_Name = firstName, 
                       Second_Name = secondName, 
                       Full_Name = paste(firstName,
                                         secondName), 
                       Sex = factor(sex),
                       Age = age,
                       Weight = weight,
                       Consent = consent,
                       stringsAsFactors = FALSE)
patients
ABCDEFGHIJ0123456789
First_Name
<chr>
Second_Name
<chr>
Full_Name
<chr>
Sex
<fctr>
Age
<dbl>
Weight
<dbl>
Consent
<lgl>
AdamJonesAdam JonesMale5070.8TRUE
EveParkerEve ParkerFemale2167.9TRUE
JohnEvansJohn EvansMale3575.3FALSE
MaryDavisMary DavisFemale4561.9TRUE
PeterBakerPeter BakerMale2872.4FALSE
PaulDanielsPaul DanielsMale3169.9FALSE
JoannaEdwardsJoanna EdwardsFemale4263.5FALSE
MatthewSmithMatthew SmithMale3371.5TRUE
DavidRobertsDavid RobertsMale5773.2FALSE
SallyWilsonSally WilsonFemale6264.8TRUE
patients$Sex
 [1] Male   Female Male   Female Male   Male   Female Male   Male   Female
Levels: Female Male
patients$First_Name
 [1] "Adam"    "Eve"     "John"    "Mary"    "Peter"   "Paul"    "Joanna"  "Matthew"
 [9] "David"   "Sally"  

Removing variables

Now that we are happy with our data frame, we no longer have any use for the vectors that were used to create it

  • R has a function called rm that will allow us to remove variables
rm(age)

Once something has been removed, we can no longer use it

age

Multiple objects can be removed at the same time

rm(list = c("age","firstName","secondName","sex","weight","consent"))

Adding additional columns

Recall that we can create a new variable using an assignment operator and specifying a name that R isn’t currently using as a variable name

myNewVariable <- 42
myNewVariable
[1] 42

We use a similar trick to define new columns in the data frame - The value you assign must be the same length as the number of rows in the data frame.

patients$ID
NULL
patients$ID <- paste("Patient", 1:10)
patients
ABCDEFGHIJ0123456789
First_Name
<chr>
Second_Name
<chr>
Full_Name
<chr>
Sex
<fctr>
Age
<dbl>
Weight
<dbl>
Consent
<lgl>
ID
<chr>
AdamJonesAdam JonesMale5070.8TRUEPatient 1
EveParkerEve ParkerFemale2167.9TRUEPatient 2
JohnEvansJohn EvansMale3575.3FALSEPatient 3
MaryDavisMary DavisFemale4561.9TRUEPatient 4
PeterBakerPeter BakerMale2872.4FALSEPatient 5
PaulDanielsPaul DanielsMale3169.9FALSEPatient 6
JoannaEdwardsJoanna EdwardsFemale4263.5FALSEPatient 7
MatthewSmithMatthew SmithMale3371.5TRUEPatient 8
DavidRobertsDavid RobertsMale5773.2FALSEPatient 9
SallyWilsonSally WilsonFemale6264.8TRUEPatient 10

Indexing data frames and matrices

  • You can index multidimensional data structures like matrices and data frames using commas:
  • object[rows, colums]
  • Try and predict what each of the following commands will do:-
patients[2,1]
[1] "Eve"
patients[1,2]
[1] "Jones"
patients[1,1:3]
ABCDEFGHIJ0123456789
 
 
First_Name
<chr>
Second_Name
<chr>
Full_Name
<chr>
1AdamJonesAdam Jones
  • If you don’t provide an index for either rows or columns, all of the rows or columns will be returned.
patients[1,]
ABCDEFGHIJ0123456789
 
 
First_Name
<chr>
Second_Name
<chr>
Full_Name
<chr>
Sex
<fctr>
Age
<dbl>
Weight
<dbl>
Consent
<lgl>
ID
<chr>
1AdamJonesAdam JonesMale5070.8TRUEPatient 1
  • Rows or columns can be omitted by putting a - in front of the index
patients[,-1]
ABCDEFGHIJ0123456789
Second_Name
<chr>
Full_Name
<chr>
Sex
<fctr>
Age
<dbl>
Weight
<dbl>
Consent
<lgl>
ID
<chr>
JonesAdam JonesMale5070.8TRUEPatient 1
ParkerEve ParkerFemale2167.9TRUEPatient 2
EvansJohn EvansMale3575.3FALSEPatient 3
DavisMary DavisFemale4561.9TRUEPatient 4
BakerPeter BakerMale2872.4FALSEPatient 5
DanielsPaul DanielsMale3169.9FALSEPatient 6
EdwardsJoanna EdwardsFemale4263.5FALSEPatient 7
SmithMatthew SmithMale3371.5TRUEPatient 8
RobertsDavid RobertsMale5773.2FALSEPatient 9
WilsonSally WilsonFemale6264.8TRUEPatient 10
patients[-c(5,7),]
ABCDEFGHIJ0123456789
 
 
First_Name
<chr>
Second_Name
<chr>
Full_Name
<chr>
Sex
<fctr>
Age
<dbl>
Weight
<dbl>
Consent
<lgl>
ID
<chr>
1AdamJonesAdam JonesMale5070.8TRUEPatient 1
2EveParkerEve ParkerFemale2167.9TRUEPatient 2
3JohnEvansJohn EvansMale3575.3FALSEPatient 3
4MaryDavisMary DavisFemale4561.9TRUEPatient 4
6PaulDanielsPaul DanielsMale3169.9FALSEPatient 6
8MatthewSmithMatthew SmithMale3371.5TRUEPatient 8
9DavidRobertsDavid RobertsMale5773.2FALSEPatient 9
10SallyWilsonSally WilsonFemale6264.8TRUEPatient 10

Advanced indexing

  • Indices are actually vectors, and can be numeric or logical:
  • We won’t always know in advance which indices we want to return
    • we might want all values that exceed a particular value or satisfy some other criteria
  • In this example, letters is a vector containing all letters in the English alphabet
letters
 [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t"
[21] "u" "v" "w" "x" "y" "z"
s <- letters[1:5]
s
[1] "a" "b" "c" "d" "e"

So far we have seen how to extract the first and third values in the vector

s[c(1,3)]
[1] "a" "c"

R can perform the same operation using a vector of logical values. Only indices with a TRUE value will get returned

s[c(TRUE, FALSE, TRUE, FALSE, FALSE)]
[1] "a" "c"
  • We can do the logical test and indexing in the same line of R code
    • R will do the test first, and then use the vector of TRUE and FALSE values to subset the vector
a <- 1:5
a < 3
[1]  TRUE  TRUE FALSE FALSE FALSE
s[a < 3]
[1] "a" "b"

Logical Operators

  • Operators allow us to combine multiple logical tests
  • comparison operators <, >, <=, >=, ==, !=
  • logical operators !, &, |, xor
    • The operators for ‘comparison’ and ‘logical’ always return logical values! i.e. (TRUE, FALSE)
s[a > 1 & a <3]
[1] "b"
s[a == 2]
[1] "b"

The vector that you use to perform the logical test could be extracted from a data frame

  • which could then be used to subset the data frame
patients$First_Name == "Peter"
 [1] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
patients[patients$First_Name == "Peter",]
ABCDEFGHIJ0123456789
 
 
First_Name
<chr>
Second_Name
<chr>
Full_Name
<chr>
Sex
<fctr>
Age
<dbl>
Weight
<dbl>
Consent
<lgl>
ID
<chr>
5PeterBakerPeter BakerMale2872.4FALSEPatient 5

Exercise: Exercise 2

  • Write R code to print the following subsets of the patients data frame
  • The first and second rows, and the first and second colums
First_Name Second_Name
1 Adam Jones
2 Eve Parker
  • Only even-numbered rows

HINT: you can use the seq function that we saw earlier to define a vector of even numbers

First_Name Second_Name Full_Name Sex Age Weight Consent
2 Eve Parker Eve Parker Female 21 67.9 TRUE
4 Mary Davis Mary Davis Female 45 61.9 TRUE
6 Paul Daniels Paul Daniels Male 31 69.9 FALSE
8 Matthew Smith Matthew Smith Male 33 71.5 TRUE
10 Sally Wilson Sally Wilson Female 62 64.8 TRUE
  • All rows except the last one, all columns

HINT: the nrow function will give the number of rows in the data frame

First_Name Second_Name Full_Name Sex Age Weight Consent
1 Adam Jones Adam Jones Male 50 70.8 TRUE
2 Eve Parker Eve Parker Female 21 67.9 TRUE
3 John Evans John Evans Male 35 75.3 FALSE
4 Mary Davis Mary Davis Female 45 61.9 TRUE
5 Peter Baker Peter Baker Male 28 72.4 FALSE
6 Paul Daniels Paul Daniels Male 31 69.9 FALSE
7 Joanna Edwards Joanna Edwards Female 42 63.5 FALSE
8 Matthew Smith Matthew Smith Male 33 71.5 TRUE
9 David Roberts David Roberts Male 57 73.2 FALSE
  • Use logical indexing to select the following patients from the data frame:
    1. Patients under 40
    2. Patients who give consent to share their data
    3. Men who weigh as much or more than the average European male (70.8 kg)
age    <- c(50, 21, 35, 45, 28, 31, 42, 33, 57, 62)
weight <- c(70.8, 67.9, 75.3, 61.9, 72.4, 69.9, 
            63.5, 71.5, 73.2, 64.8)
firstName  <- c("Adam", "Eve", "John", "Mary",
                "Peter", "Paul", "Joanna", "Matthew",
                "David", "Sally")
secondName <- c("Jones", "Parker", "Evans", "Davis",
                "Baker","Daniels", "Edwards", "Smith", 
                "Roberts", "Wilson")
consent <- c(TRUE, TRUE, FALSE, TRUE, FALSE, 
             FALSE, FALSE, TRUE, FALSE, TRUE)
sex <- c("Male", "Female", "Male", "Female", "Male",
         "Male", "Female", "Male", "Male", "Female")
patients <- data.frame(First_Name = firstName, 
                       Second_Name = secondName, 
                       Full_Name = paste(firstName,
                                         secondName), 
                       Sex = factor(sex),
                       Age = age,
                       Weight = weight,
                       Consent = consent,
                       stringsAsFactors = FALSE)
rm(list = c("firstName","secondName","sex","weight","consent"))
patients
ABCDEFGHIJ0123456789
First_Name
<chr>
Second_Name
<chr>
Full_Name
<chr>
Sex
<fctr>
Age
<dbl>
Weight
<dbl>
Consent
<lgl>
AdamJonesAdam JonesMale5070.8TRUE
EveParkerEve ParkerFemale2167.9TRUE
JohnEvansJohn EvansMale3575.3FALSE
MaryDavisMary DavisFemale4561.9TRUE
PeterBakerPeter BakerMale2872.4FALSE
PaulDanielsPaul DanielsMale3169.9FALSE
JoannaEdwardsJoanna EdwardsFemale4263.5FALSE
MatthewSmithMatthew SmithMale3371.5TRUE
DavidRobertsDavid RobertsMale5773.2FALSE
SallyWilsonSally WilsonFemale6264.8TRUE
### Your Answer ###

(Supplementary) Matrices

  • Data frames are R’s speciality, but R also handles matrices:
    • All columns are assumed to contain the same data type, e.g. numerical
    • Matrices can be manipulated in the same fashion as data frame
      • We can easily convert between the two object types
e <- matrix(1:10, nrow=5, ncol=2)
e
     [,1] [,2]
[1,]    1    6
[2,]    2    7
[3,]    3    8
[4,]    4    9
[5,]    5   10
  • Some calculations are more efficient to do on matrices, e.g.:
rowMeans(e)
[1] 3.5 4.5 5.5 6.5 7.5

Matrices (and indeed data frames) can be joined together using the functions cbind and rbind

Let’s first create some example data

mat1 <- matrix(11:20, nrow=5,ncol=2)
mat1
     [,1] [,2]
[1,]   11   16
[2,]   12   17
[3,]   13   18
[4,]   14   19
[5,]   15   20
mat2 <- matrix(21:30, nrow=5, ncol=2)
mat2
     [,1] [,2]
[1,]   21   26
[2,]   22   27
[3,]   23   28
[4,]   24   29
[5,]   25   30
mat3 <- matrix(31:40,nrow=5,ncol=2)
mat3
     [,1] [,2]
[1,]   31   36
[2,]   32   37
[3,]   33   38
[4,]   34   39
[5,]   35   40

and now try out these functions:-

cbind(mat1,mat2)
     [,1] [,2] [,3] [,4]
[1,]   11   16   21   26
[2,]   12   17   22   27
[3,]   13   18   23   28
[4,]   14   19   24   29
[5,]   15   20   25   30
rbind(mat1,mat3)
      [,1] [,2]
 [1,]   11   16
 [2,]   12   17
 [3,]   13   18
 [4,]   14   19
 [5,]   15   20
 [6,]   31   36
 [7,]   32   37
 [8,]   33   38
 [9,]   34   39
[10,]   35   40
