Lab2

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

Review

This week’s lab is a combination of Chapter 5.4 from R in Action by Robert Kabacoff and last year’s lab material.

Control flow refers to the order of statements or functions calls are executed. We can use different control-flow structures, including for loop, while, and if-else statements.

For loops

The for loop executes a statement repetitively until we reach the last value contained in the sequence in seq. A basic example includes the following,

# loop over a general vector
a_vec = c(0.1, 0.3, 0.5) ;
for (a in a_vec){
  print(a);
}

## [1] 0.1
## [1] 0.3
## [1] 0.5

Remember that the index vector can be more general than 1:n. Furthermore, you could use the double loop as well,

# double for loop
for (i in 1:3){
  for (j in 1:3){
    print(c(i,j))
  }
}

## [1] 1 1
## [1] 1 2
## [1] 1 3
## [1] 2 1
## [1] 2 2
## [1] 2 3
## [1] 3 1
## [1] 3 2
## [1] 3 3

While loops

Like in any language, be careful with terminating condition(s). Avoid infinite loops. A while loop executes a statement repetitively until the condition is no longer true. Look at the following example,

# loop over until our varialbe i reaches 0
i=5
while (i>0){
  print(i)
  i = i-1
}

## [1] 5
## [1] 4
## [1] 3
## [1] 2
## [1] 1

if-else

The if-else control structure executes a statement if a given condition is true.

# Checking even number
a = 2
if (a %% 2 == 0){
  print('a is an even number')
} else{
  print('a is not an even number')
}

## [1] "a is an even number"

switch

switch selectively executes an argument based on the expression.

# switch example
feelings = c('sad', 'afraid')
for(i in feelings){
  print(
    switch(i,
        happy = 'I am glad you are happy',
        afraid = 'There is nothing to fear',
        sad = 'Cheer up',
        angry = 'Calm down now')
  )
}

## [1] "Cheer up"
## [1] "There is nothing to fear"

Not relevant at all

# Need to install the package called 'praise'
# require(praise)
# praise()

Apply’s

Probably new to people coming from other programming languages.

apply, lapply, sapply can come in handy when you try to avoid writing loops.

Basic form:

a_matrix = matrix(rnorm(12),3,4)
print(a_matrix)

##            [,1]       [,2]       [,3]       [,4]
## [1,]  0.8757685  0.7984639 -0.3073529  0.7288364
## [2,]  0.6744366 -1.2704720  0.5509735 -0.4348839
## [3,] -1.4905839 -1.0741841 -0.3247654 -0.8777252

apply(a_matrix, MARGIN = 2, mean)

## [1]  0.01987374 -0.51539741 -0.02704827 -0.19459090

colMeans(a_matrix)

## [1]  0.01987374 -0.51539741 -0.02704827 -0.19459090

apply(a_matrix, MARGIN = 1, mean)

## [1]  0.5239290 -0.1199864 -0.9418146

R provides a function apply() which apply() an arbitrary function to a matrix, array, or data frame. Margin is the dimension index, which you can understand through the example above. MARGIN=1 performs a row wise operation, while 2 for column wise operation.

lapply() and sapply() both operate on list input. The only difference is that lapply() outputs a list, and sapply() outputs a vector or matrix instead of list object.

temp_list = list(a=1:10, b=11:15)
lapply(temp_list, mean)

## $a
## [1] 5.5
## 
## $b
## [1] 13

sapply(temp_list, mean)

##    a    b 
##  5.5 13.0

Writing functions

Like other languages, we can write functions for our own, but be careful! If there is a global variable referred within the function, R doesn’t output an error message. The basic form looks like following

myfunction = function(x, y){
  return(x+y)
}

# function from this week's homework
record_high = function(data, station){
    high_index = which.max(data$precip[[station]])
    return(data$time[[station]][high_index])
}

Break and next

Be careful with those two.

Density function of normal distributions

In R, functions are a symbolic representation of an operation to be carried out on variables known as inputs. Functions are useful when you plan to apply a certain operation to a range of inputs which are cumbersome to be declared beforehand. The syntax for declaring a function can be best understood with an example:

The probability density of a normal distribution \(N(\mu,\sigma^{2})\) is defined as

\[ f(x) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp\{-\frac{(x - \mu)^2}{2 \sigma^2}\}\,. \]

There is a function dnorm() in R that can produce values of this density. Now without using that function, write your own function to calculate \(f(x)\). Then plot the densities of \(N(0,1)\) from your own function on a grid of 30 points from \(-3\) to \(3\), and compare with the result using the internal function dnorm().

normal_density_function = function(x, mean, sd) {
  d = 1 / sqrt(2 * pi * sd ^ 2) * exp(-(x - mean) ^ 2 / (2 * sd ^ 2))
  return(d)
}
# Generate a grid with 30 points between -3 and 3
x = seq(-3, 3, length = 30)
# Compute the corresponding densities
y1 = normal_density_function(x = x, mean = 0, sd = 1)
# Compute the densities from R internal function dnorm()
y2 = dnorm(x = x, mean = 0, sd = 1)
# Plot the result of function normal_density_function() with points
plot(
x, y1, main = 'Density of normal distribution',
xlab = 'x', ylab = 'Density', type = 'p'
)
# Plot the result of function normal_density_function() with curve
lines(x, y2, lty = 1, col = "red")

The return statement within a function is used when you want some value returned by the function.

Remember the variables defined inside a function are local and not accessible to the outside. More on this in the Environment section of Advanced R.

Coin flipping game

In a fair coin flipping game, let \(B_{1},B_{2},\ldots\) denote the result where \(B_{i}=1\) if it’s heads and \(0\) if it’s tails. Let \(N\) be the number of experiments when the first time you get heads, i.e. \[ N = \min \{n \geq 1: B_{n}=1\}\,. \] Suppose you are given a number \(m\), write a piece of R code that runs the game for \(m\) times and prints \(N_{1},\ldots,N_{m}\).

Hint: you can use the command rbinom(1,1,0.5) to simulate the result of flipping a coin. The function rbinom(n, size, prob) generates binomial random variable.

m = 10
for (k in 1:m) {
  # initialize count n and result B
  n = 0
  B = 0
  # stop at the first time that 1 occurs
  while (B != 1) {
    n = n + 1
    # generate a fair coin flip
    B = rbinom(1,1,0.5)
  }
  print(n)
}

## [1] 2
## [1] 1
## [1] 2
## [1] 3
## [1] 1
## [1] 6
## [1] 1
## [1] 1
## [1] 4
## [1] 1

Another coin flipping game

In a fair coin flipping game, two players A and B bet on the result of the first two consecutives that has the same face. If the first two consecutive are heads, then A wins. If the first two consecutive are tails, then B wins.

Suppose you are given a number \(m\), write a piece of R code that simulates the game for \(m\) times and prints the winner at each time. And calculate the proportion of times that A wins.

m = 100
# save the times that A wins
count = 0
for (k in 1:m) {
  # generates the first flip
  coin_old = rbinom(1,1,0.5)
  # generates the second flip
  coin = rbinom(1,1,0.5)
  # stops when two consecutive results appear
  while (coin_old != coin) {
    # save the last flip
    coin_old = coin
    # generate a new flip
    coin = rbinom(1,1,0.5)
  }
  # check the results
  if (coin == 1) {
    print("A wins")
    count = count + 1
  }else{
    print("B wins")
  }
}

# print the proportion of times A wins
print(count/m)

## [1] 0.52

Student performance data set (data I/O)

Download the student-mat.csv file from Canvas. This is a data set about students’ math grades and other information. Make sure that the data is located in your current working directory in R. You can use the command getwd() and setwd() to check and change the directory.

Read the data into R by using command read.table("student-mat.csv",header=T,sep=";"). Check the help document in R to see what the parameters mean.
Look at the variables G1, G2, G3 which denote the first period, second period, and final math grades of the students. Create a new data frame consisting of only two columns: the average of the three grades; the weighted average of the three grades with weights 0.25, 0.25, 0.5. You can use either for loops or apply to solve this.
Save the new data frame into a csv file student\_new.csv.

# read the data
student = read.table("student-mat.csv",header = T,sep = ";")
n = nrow(student)
# create a new data frame and specify the column names
student_new = data.frame(matrix(0,nrow = n,ncol = 2))
names(student_new) = c("avg","weightedAvg")

# use for loop
for (i in 1:n) {
  tmp = as.numeric(student[i,c("G1","G2","G3")])
  student_new$avg[i] = mean(tmp)
  student_new$weightedAvg[i] = sum(tmp * c(0.25,0.25,0.5))
}

# use apply
student_new$avg = apply(student[,c("G1","G2","G3")],1,mean)
# define the weighted sum function
student_new$weightedAvg = apply(student[,c("G1","G2","G3")],1,
                                 function(x) {
                                   sum(x * c(0.25,0.25,0.5))
                                 })

# write the new data frame into csv file
write.table(
  student_new,file = "student_new.csv",sep = ",",
  col.names = T,row.names = F
  )

Extras

More dice throwing by [Tadashi Tokieda].

Simulate the dice comparisons he described.

Of course, this problem can be solved exactly without using simulations, that’s how he obtained the ranking.

Sometimes simulation can be used to solve problem that cannot be solved exactly.

Q&A

How to convert factors to strings?

Some face this problem after reading in the datasets.

senators=read.table("senators.txt",header=F,sep=",")
senators[1,1]

## [1] Akaka  Daniel K.
## 102 Levels: Akaka  Daniel K. Alexander  Lamar ... Wyden  Ron

What you are getting with senators[i,1] is a Factor, not a string (aka character). Verify that with class(senators[i,1]).

class(senators[1,1])

## [1] "factor"

What’s happening is that when you read the dataset into R, all strings are converted to factors by default.

To disable this conversion, set as.is = TRUE, or stringsAsFactors = FALSE.

senators=read.table("senators.txt",header=F,sep=",",stringsAsFactors = FALSE)
senators[1,1]

## [1] "Akaka  Daniel K."

class(senators[1,1])

## [1] "character"

Alternatively (or more generally), to extract the face-values of a factor, and not its numerical representation, you can do this:

senators=read.table("senators.txt",header=F,sep=",",stringsAsFactors = TRUE)
senatorNames = senators[,1]
head(levels(senatorNames)[senatorNames])

## [1] "Akaka  Daniel K." "Alexander  Lamar" "Allard  Wayne"   
## [4] "Barrasso  John"   "Baucus  Max"      "Bayh  Evan"

class(levels(senatorNames)[senatorNames])

## [1] "character"