This tutorial introduces how to do regression in R and then uses probability to introduce other important aspects of R. Finally, the third part shows you how to import stock return data in R and introduces some related concepts and tools. Rather than appearing at the end, six exercises are scattered throughout the parts of this tutorial.

Regression in R

lm

This command stands for “linear model” and is R’s general purpose regression command. We’re used to specifying a linear model as “y = a + bx” and lm uses a similar syntax. Basically, you’ll specify lm(y ~ x). Let’s do the regression that we did in class of MPG and horsepower. Notice, in the lm command, you can specify the “data”. If you didn’t, you’d have to do “mtcars$mpg” and “mtcars$hp”.

library(data.table)
lm(hp ~ mpg, data = mtcars)
## 
## Call:
## lm(formula = hp ~ mpg, data = mtcars)
## 
## Coefficients:
## (Intercept)          mpg  
##      324.08        -8.83

Remember: unlike correlation and covariance, regression is not symmetric

lm(mpg ~ hp, data = mtcars)
## 
## Call:
## lm(formula = mpg ~ hp, data = mtcars)
## 
## Coefficients:
## (Intercept)           hp  
##    30.09886     -0.06823

Let’s plot this relationship.

library(plotly)
## Loading required package: ggplot2
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:graphics':
## 
##     layout
plot_ly(data = mtcars, x = mpg, y = hp, type = "scatter", mode = "markers")

To add the regression line, we follow the plot command with the function add_trace, using the function coefficients with the lm command to extract the regression slope and intercept. The “%>%” lets you string together multiple commands and continue to add onto the same chart instead of making a new one.

regLine <- lm(hp ~ mpg, data = mtcars)
carRegression <- plot_ly(data = mtcars, x = mpg, y = hp, type = "scatter", mode = "markers", name = "data") %>% 
  add_trace(x = mpg, y = fitted(regLine), type = "lines", name = "regression")
carRegression
plotly_POST(carRegression, filename = "carRegression", sharing = "public")
## No encoding supplied: defaulting to UTF-8.
## Success! Modified your plotly here -> https://plot.ly/~emallickhossain/14

Note that add_trace can only be used after you’ve already made a plot. It adds a line to the existing plot rather than making a plot of its own.

You can also use add_trace to plot different kinds of lines. For example, we can show that the regression line goes through the means of the data. I added some style options that made the “average” lines dashed instead of solid. The main idea of this example is to get comfortable with the idea of plotting your data and then adding lines on top of the original plot. Unfortunately, Plotly does not have something like an hline or vline command so you acutally have to specify the points that you are plotting. Slightly annoying, but managable.

carRegMeans <- plot_ly(data = mtcars, x = mpg, y = hp, type = "scatter" , mode = "markers", name = "Data") %>%
  add_trace(x = mpg, y = fitted(regLine), type = "lines", name = "Regression") %>%
  add_trace(x = c(min(mpg), max(mpg)), y = c(mean(hp), mean(hp)), mode = "lines", 
            line = list(dash = "5px"), name = "Average HP") %>%
  add_trace(x = c(mean(mpg), mean(mpg)), y = c(min(hp), max(hp)), mode = "lines", 
            line = list(dash = "5px"), name = "Average MPG")
carRegMeans
plotly_POST(carRegMeans, filename = "carRegMeans", sharing = "public")
## No encoding supplied: defaulting to UTF-8.
## Success! Modified your plotly here -> https://plot.ly/~emallickhossain/16

Exercise #1 - Regression

Carry out a linear regression in which qsec is the y-variable (this is the time to go 1/4 mile) and wt is the x-variable (the weight of the car). Plot the data and add the regression line. Add dashed lines to show that the regression line goes through the means of the data. Interpret your results. Is there a relationship between quarter-mile time and car weight? Bonus: Can you make your dotted lines red? (Hint: you’ll want to add color = "red" somewhere)

Creating Your Own Functions

Functions take one or more inputs, called “arguments”, and combine them in some way to produce an output. You’ve already met a great many of the built-in R functions. Now it’s time to make some of our own. Making functions is a great way to save time by automating repetitive tasks and extend the functionality of R to make it even more useful.

Writing a function would be the equivalent of writing an equation on your exam and then referencing it each time you compute something. If the function is specified properly, then you’ll get full credit. If your function is not correct, then you’ll probably get wrong answers.

Simple Example

R doesn’t have a built-in function for computing z-scores, so we’ll make one of our own. Among other things, this example will illustrate a key point about functions: you can build them out of other functions this means you don’t have to re-invent the wheel each time you want to do something. To make our z-score function we’ll use the built-in R functions mean and sd rather than writing our own mean and standard deviation functions from scratch:

z.score <- function(x){
  z <- (x - mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE)
  return(z)
}

Let’s walk through the above code one step at a time:

  1. In R <-, pronounced “gets” is the assignment operator, so z.score <- means that we’re assigning something to z.score
  2. The thing we are assigning is indicated next: function. So far we have z.score <- function which means “R, I’m going to create a function and call it z.score
  3. There are some parentheses next to function. These are used to enclose the function’s arguments. Argument is just another word for input. Functions can take one argument, more than one argument or no arguments depending on the situation, but you always need the parentheses. Here, z.score <- function(x) means “R, I’m going to create a function called z.score that takes a single argument whose name is x.”
  4. The next thing we see is a curly bracket: {. Essentially, a function is a set of instructions that tells R how to turn some arguments (inputs) into a desired output. These instructions are called the function body and they are always enclosed in curly brackets. If you look at the bottom of the function, you’ll see the second curly bracket. A common mistake that people make is to forget the second bracket. Always make sure to close any bracket that you open.
  5. Now we’ll look at the function body, the list of commands contained inside of the curly brackets. The first part should be familiar: it’s made up entirely of R commands that you already know! It does the following
  • Calculate the mean of the argument (input) x. Ignore missing observations.
  • Calculate the standard deviation the argument (input) x. Ignore missing observations.
  • Subtract the mean of x from the vector x and then divide by the standard deviation
  • Store the answer in a vector called z
  1. The last part of the function is the command return. This indicates that our function should output or “return” the vector z. In other words, z is the “answer”.

It was easy to create the z.score function because mathematical operations in R are vectorized. This means, for example, that if we want to add 3 to each element of a vector x, we can just write x + 3. In other words we don’t have to separately add three to each element of x. Similarly, if we want to add corresponding elements of x and y, two vectors of the same length, we use x + y. Try some simple examples of your own to make sure you understand how this works. See what happens if you try to add two vectors of different lengths.

Scope

Now, if you run your newly written function z.score and then run z to see the answer again, you will get an error (“Error: object ‘z’ not found”“). We assigned the answer to z, so why doesn’t R know what z is?

There is something slightly subtle here: z is a variable “inside a function”, namely a variable inside the function z.score. This makes it different from the other variables that we’ve seen so far. To explain how, we’ll need to introduce a programming concept called “scope.”

Scope refers to where the variable “lives,” meaning how it can be accessed. We will focus on two types of scope, global and local.

If you type the command ls() into the R console (or look in the top right of RStudio), you will see a bunch of variables. These variables have “global scope”. This means that they can be accessed from anywhere by any function. If you’ve run the code for the function z.score above, you’ll see z.score when you type ls() into the console. This means that z.score has global scope. Similarly if you type foo <- 3 into the R console, foo will have global scope. You can see this by re-running ls() after you create foo.

Notice, however, that you did not see z when you typed ls(). This is because z is a “variable inside a function.” In other words, it has local scope. This means that z is only accessible inside the function z.score.

Another example of a variable with local scope is the column of a dataframe, for example mpg in the dataframe mtcars. We can only access mpg by extracting it from mtcars, which has global scope, using $.

Don’t worry if you don’t immediately understand this. It’s a weird concept to grasp. You can think of variables with global scope like books on a shelf. You can see each one and pick it out. In contrast, variables with local scope are like chapters in a book: you don’t know what the chapter is called until you open a specific book and read it.

Missing Observations

Before looking at more examples of functions, we need to learn a little more about missing values. As we’ve seen, many of the built-in R functions have optional arguments that you can use to direct R to ignore missing observations, for example na.rm = TRUE. When you make your own functions from scratch, however, things are slightly trickier.

To handle this situation, we’ll use a function called is.na that tests for missing observations:

foo <- c(1, 3, NA, 3, 0, NA)
is.na(foo)
## [1] FALSE FALSE  TRUE FALSE FALSE  TRUE

As you can see from the output, is.na returns a vector of logical (aka Boolean) values. A value of TRUE indicates that the corresponding element of foo is missing. In this case, the 3rd and 6th elements are NAs. On it’s own, this isn’t very useful, but it turns out we can use vectors of logical values to select desired elements of a vector or dataframe. One way of selecting the first and second element of foo is

foo[1:2]
## [1] 1 3

but here’s another:

foo[c(TRUE, TRUE, FALSE, FALSE, FALSE, FALSE)]
## [1] 1 3

When you select using a logical vector, it must have the same length as the vector you’re selecting. Using this idea, we could select the missing observations of foo as follows:

foo[is.na(foo)]
## [1] NA NA

Now, what we’d really like to do is select all the observations that are not missing. It turns out that this is easy: the R symbol for NOT is the exclamation point (!) which is more commonly called “bang” by programmers. If you put this symbol in front of a vector of logical values, it turns TRUE into FALSE and FALSE into TRUE:

!is.na(foo)
## [1]  TRUE  TRUE FALSE  TRUE  TRUE FALSE

Hence: we can select all non-missing values of foo as follows:

foo[!is.na(foo)]
## [1] 1 3 3 0

and we can remove all missing values from foo by overwriting the original vector with the values selected in the previous command:

foo <- foo[!is.na(foo)]

Creating Your Own Mean Function

Let’s look at another example: making a function to compute the sample mean. R has a perfectly good function for calculating means, as we’ve seen, so this is purely for illustrative purposes. I’ll call this function mymean to distinguish it from the built-in R function:

mymean <- function(x){
  x <- x[!is.na(x)]
  x.bar <- sum(x) / length(x)
  return(x.bar)
}

Again, I’ll walk you through this one step-by-step:

  1. mymean <- function(x) tells R to create a function that takes a single argument x and store it under the name mymean
  2. The curly braces { and } enclose the function body.
  3. x <- x[!is.na(x)] overwrites x with a new vector containing all the elements of x that are not NAs. In other words, this removes missing observations.
  4. sum(x) / length(x) calculates the sample mean: summing up the observations and dividing by the number of observations. Notice that I did not use na.rm = TRUE in the sum command. If I had used it, everything still would have worked but since I already removed all the missing observations in the line above, this wasn’t necessary.

Let’s test out my.mean on some data and make sure it works. First we’ll use R’s built-in mean function:

testData <- c(1, 3, 6, NA, 7, NA, NA, 8)
mean(testData, na.rm = TRUE)
## [1] 5

and then we’ll try mymean

mymean(testData)
## [1] 5

It works! Now I’m going to show you a slightly different version of mymean which I’ll call mymean2. This one will not give the correct answer:

mymean2 <- function(x){
  x.bar <- sum(x, na.rm = TRUE) / length(x)
  return(x.bar)
}
mymean2(testData)
## [1] 3.125

The answer is too small! What’s going on here? The problem has to do with how I handled missing observations. Rather than throwing them away, I told sum to ignore them. The problem is that length does not ignore them: it simply counts up the number of elements in a vector, regardless of whether they’re NAs. In effect we have divided by the wrong value of \(n\): the sum is being taken over all non-missing observations while the length is being computed over all observations.

It’s worth comparing mymean2 to z.score, the first function I introduced above. In that example it was ok to use na.rm = TRUE to handle the missing observations since this argument was available for both the mean and sd functions, so we did not have to worry about any mismatches between the data used in the numerator and the denominator.

Creating Your Own Variance Function

One more example before I set you loose to make your own function. Here’s R code to create a function called myvar that calculates the sample variance. This function piggybacks on mymean from above. This is a good thing. We’ve already tested mymean and know that it works, so incorporating it into myvar is much better than starting from scratch. Again, note how we handle missing observations.

myvar <- function(x){
  x <- x[!is.na(x)]
  s.squared <- sum((x - mymean(x))^2) / (length(x) - 1)
  return(s.squared)
}

We can test this against var to make sure it works:

var(testData, na.rm = TRUE)
## [1] 8.5
myvar(testData)
## [1] 8.5

Looks good!

Exercise #2

It turns out that R doesn’t have a built-in function to calculate skewness. Your job is to write one. Remember the definition of skewness from lecture: \[\mbox{Skewness} = \frac{\frac{1}{n} \sum_{i = 1}^n(x_i - \bar{x})^3}{s^3}\] where \(s\) is the sample standard deviation. Call your function skew. It should take one argument, a vector x, and return the sample skewness of x. Make sure to handle missing values appropriately. Use the built-in R functions mean, sum, length and sd to construct your skewness function. Test your function out on testData.

It turns out that the “e1071” package you downloaded when setting up R does have a function for skewness (appropriately called skewness). Check that it lines up with what you got.

Returning a Dataframe

The examples we’ve seen so far have been very simple: functions that take one argument and return a single value. In fact, functions can take multiple values and return pretty much anything. First we’ll look at an example of a function that returns more than a dataframe. This is a very flexible way of returning multiple values in a way that makes them easy to access

summary.stats <- function(x){
  
  x <- x[!is.na(x)]
  sample.mean <- mean(x)
  std.dev  <- sd(x)
  
  out <- data.frame(sample.mean, std.dev)
  return(out)
}
results <- summary.stats(testData)
results
##   sample.mean  std.dev
## 1           5 2.915476
results$sample.mean
## [1] 5
results$std.dev
## [1] 2.915476

Multiple Arguments

Now we’ll look at a function that takes multiple arguments: mycov. This is essentially a stripped-down version of cov, the built-in R equivalent. Unlike cov which can accept two vectors as arguments or a whole dataframe, our version will only accept two vectors as arguments: x and y. Handling the missing observations is a bit more complicated in this situation: we need to drop any x for which there isn’t a corresponding y and any y for which there isn’t a corresponding x. This is because covariance is a measure of the relationship between x and y. Accordingly, it is calculated using pairs of observations rather than individual observations. First I’ll give you the code, then I’ll explain the part that’s different from what we did above:

mycov <- function(x, y){
  
  keep <- !is.na(x) & !is.na(y)
  x <- x[keep]
  y <- y[keep]
  
  n <- length(x)
  
  s.xy <- sum( (x - mean(x)) * (y - mean(y)) ) / (n - 1)
  return(s.xy)
  
}

Testing this,

cov(mtcars$mpg, mtcars$hp, use = "complete.obs")
## [1] -320.7321
mycov(mtcars$mpg, mtcars$hp)
## [1] -320.7321

we see that it works.

The tricky part of the above code was the definition of the logical vector keep which has TRUE for every element of x and y that we’ll use to calculate the covariance. The symbol & means AND, so the line !is.na(x) & !is.na(y) tells R to find all the indices (which correspond to pairs) for which neither x nor y is missing. This may be a little confusing so here’s a toy example:

x <- c(1, 2, NA, 3)
y <- c(5, NA, 6, 0)
!is.na(x)
## [1]  TRUE  TRUE FALSE  TRUE
!is.na(y)
## [1]  TRUE FALSE  TRUE  TRUE
keep <- !is.na(x) & !is.na(y)
keep
## [1]  TRUE FALSE FALSE  TRUE
x[keep]
## [1] 1 3
y[keep]
## [1] 5 0

Once we’ve defined keep, everything is similar to what we’ve already seen above. We simply select those elements of x and y that we want to keep and overwrite the original vectors x and y. Then we implement the formula for covariance.

Exercise #3

Write a linear regression function called myreg. It should take two arguments: vectors x and y which you may assume have the same length. It should return a dataframe with columns a for the intercept and b for the slope. Make sure to handle missing values appropriately. Feel free to use any built-in R functions you like except lm. Check your answer against the regression results for MPG and horsepower that we calculated above. When creating this function, remember that regression is not symmetric.

Stock Market Data

Now that we know how to write our own functions, let’s have a little fun playing the stock market. It’s easy to download and play with stock price data using the R packages Quandl and zoo. An R package is like an “expansion pack” that gives R new features. One of the great things about R is that there literally thousands of packages that expand the usefulness of R.

Installing Quandl

First you’ll need to install Quandl. You only need to do this once: all subsequent times you can skip this part. In RStudio go to “Tools > Install Packages.” The first time you do this, you may be asked to choose a “Repository.” Don’t worry about what this is, you can choose any of them. The package installation will just be slightly quicker if you choose a repository that’s geographically close to Penn. (It’s really just a question of which server you instruct R to download the packages from.) Make sure that the checkbox “install dependencies” is checked and then enter Quandl in the text box labeled “Packages”. Lots of text will whiz past in the R console and once it finishes you’ll have access to the desired packages.

If you don’t like pointing and clicking, then just type the following into the console: install.packages("Quandl")

Again, you don’t need to repeat the proceeding steps each time you want to use Quandl. You do, however, need to load the packages every time you want to use them. (In other words, if you close RStudio you’ll need to reload Quandl to use them.) Packages are loaded using the library command.

library(Quandl)
## Loading required package: xts
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## Attaching package: 'xts'
## The following object is masked from 'package:data.table':
## 
##     last

Now we’re ready to get started.

Downloading Stock Prices

To start off I’ll download prices for Apple stock:

apple.prices <- Quandl('GOOG/NASDAQ_AAPL', start_date = '2012-01-01', 
                       end_date = '2012-12-31', type = 'raw')

Now I’ll walk you through the arguments of the function Quandl - The first argument is the Quandl Code for the data you want. In this case, it happens that the code for Apple is GOOG/NASDAQ_APPL. The GOOG indicates that the data source is Google Finance and NASDAQ indicates that we’re looking for a company that’s traded on the NASDAQ stock exchange. Finally, APPL is the ticker symbol for Apple Computer. - start_date and end_date mean exactly what you think they do. - type = 'raw' tells Quandl to set up our data without any special formatting. Quandl has functionality to output data in various time series forms that can then be used with other R packages.

You can look up the codes for various datasets on the Quandl website. Here are a few others for you to play with: - Stock Prices for Google: GOOG/NASDAQ_GOOG - Stock Prices for Amazon: GOOG/NASDAQ_AMZN - Stock Prices for Netflix: GOOG/NASDAQ_NFLX - S&P 500 Index Prices: YAHOO/INDEX_GSPC

All of these series are at the daily frequency.

Plotting Daily Stock Prices for Apple

Now let’s take a look at what’s contained in apple.prices

head(apple.prices)
##         Date  Open  High   Low Close    Volume
## 1 2012-12-31 72.93 76.49 72.71 76.02 164872785
## 2 2012-12-28 72.90 73.50 72.59 72.80  88569243
## 3 2012-12-27 73.36 73.75 72.09 73.58 113779680
## 4 2012-12-26 74.14 74.21 73.02 73.29  75609030
## 5 2012-12-24 74.34 74.89 74.10 74.31  43936977
## 6 2012-12-21 73.21 74.24 72.89 74.19 149129554

As you can see, there’s quite a lot of information here: for each day we have the opening price, the high price, the low price and the closing price. Volume records how many shares were traded on that particular day.

We’re mainly going to be interested in the closing prices, which we can plot by themselves as follows:

applePrice <- plot_ly(data = apple.prices, x = Date, y = Close, type = "lines")
applePrice
plotly_POST(applePrice, filename = "applePrice", sharing = "public")
## No encoding supplied: defaulting to UTF-8.
## Success! Modified your plotly here -> https://plot.ly/~emallickhossain/18

From Prices to Returns

In financial economics, we’re usually more interested in stock returns than stock prices. Returns measure the percentage change in a stock’s price, so they tell us how much the value of our investment has grown (or declined). If \(p_t\) is today’s stock price and \(p_{t-1}\) is yesterday’s stock price, then \[r_t = \frac{p_t - p_{t-1}}{p_{t-1}}\] is today’s stock return. This is just the percentage change in price from yesterday to today expressed as a decimal. For example a return of \(0.5\) corresponds to a 50% increase in the stock price and a return of \(-0.05\) corresponds to a 5% decrease in the stock price.

Log Returns

As it happens, researchers in financial economics tend not to use the formula for geometric returns \(r_t\) given in the previous section. Instead they work with something called log returns which is more convenient mathematically and very similar to geometric returns provided that \(r_t\) isn’t too large. The precise approximation is: \(r_t \approx log(p_t) - log(p_{t-1})\) when \(r_t\) is close to zero. In other words, we can approximate geometric returns by taking the difference of the natural log of successive prices rather than computing the ratio that defines \(r_t\). The interpretation remains the same: a percentage change expressed as a decimal.

If you want to know more about log returns, there’s an optional section at the end of this document explaining them in more detail.

Log Returns for Apple

Now we’ll construct daily log returns for Apple based on the closing prices. Remember: we need successive differences of the natural log of prices. The comand for successive differences in R is diff, for example:

foo <- c(1, 2, 4, 7, 11)
diff(foo)
## [1] 1 2 3 4

since \(2 - 1 = 1\), \(4 - 2 = 2\), \(7 - 4 = 3\) and \(11 - 7 = 4\). We can combine this with the function log to compute log returns for Apple as follows:

apple.returns <- diff(log(apple.prices$Close))

Plotting these, we have

appleReturns <- plot_ly(y = apple.returns)
appleReturns
plotly_POST(appleReturns, filename = "appleReturns", sharing = "public")
## No encoding supplied: defaulting to UTF-8.
## Success! Modified your plotly here -> https://plot.ly/~emallickhossain/20

Notice how different everything looks now! While the prices show a clear trend, the returns do not share this trend. They’re much “spikier.”" For those of you who know something about finance, this is often described by saying that prices follow a “random walk” while returns are “stationary.” If you want to learn more about this, you should take Frank Diebold’s undergrad course on forecasting.

Summary Statistics for Apple Returns

So what are the Apple’s returns like? Could I make a fortune by owning this stock? The sample mean is a measure of average returns, which you can think of as the average “reward” for holding the stock

mean(apple.returns)
## [1] -0.001034961

While the sample standard deviation is a measure of the variability or volatility in returns

sd(apple.returns)
## [1] 0.01851288

which you can think of as the “risk” of holding the stock. All things equal, you’d prefer a stock with high average returns and a low standard deviation.

Note that the preceding values are percentages expressed as decimals. To convert them into ordinary percents, we multiply by 100

mean(apple.returns) * 100
## [1] -0.1034961
sd(apple.returns) * 100
## [1] 1.851288

So, the average daily (log) return for Apple is -0.1% with a standard deviation of 1.8%.

The nice thing about log returns is that to calculate cumulative returns over a period of time, you simply add up the daily log returns. For example, to get the cumulative log return for 2012 do the following:

sum(apple.returns)
## [1] -0.2577053

This means that if you’d started out holding $1 of Apple stock in January 2012, the value of your investment would have decreased to about $0.84 by the end of the year.

Are Apple Returns Skewed?

A classic question in financial economics is whether stock returns are skewed. Let’s have a look at Apple:

appleHist <- plot_ly(x = apple.returns, type = "histogram") 
appleHist
plotly_POST(appleHist, filename = "appleHist", sharing = "public")
## No encoding supplied: defaulting to UTF-8.
## Success! Modified your plotly here -> https://plot.ly/~emallickhossain/22
mean(apple.returns)
## [1] -0.001034961
median(apple.returns)
## [1] -0.0003673994

Exercise #4

Are Apple Returns Skewed? Test out your function for calculating skewness on apple.returns and interpret your results.

Exercise #5

Download data for IBM (Quandl Code: GOOG/NYSE_IBM) and carry out the same analysis that we did for Apple Computer above. Use the same time period. Use your results to answer the following questions: 1. Which has higher average returns over the period studied: Apple or IBM? 2. Which are more “volatile” or “risky”: IBM’s returns or Apple’s? 3. If you had invested a dollar in IBM at the start of 2012, how much would your investment have been worth at the end of the year? 4. Are IBM returns skewed? How do they compare to Apple’s returns in this sense?

Exercise #6

In this exercise you’ll calculate the correlations between different sets of returns. 1. Calculate the sample correlation between Apple returns and IBM returns. Intepret your result. 2. Download price data for Bank of America stock (Quandl Code: GOOG/NYSE_BAC) over the same period used for Apple and IBM and construct daily log returns using closing prices. 3. Calculate the correlation between Apple and Bank of American returns, and Bank of America and IBM returns. 4. Based on your answers to 1 and 3, which assets are most strongly correlated? Which are least strongly correlated? If wanted to “hedge your bets”, which of these stocks would you choose to form a portfolio?

An Explanation of Log Returns (Optional)

You don’t have to read this if you don’t want to, and I’m not going to test you on it, but I suspect some of you may be curious about why log returns provide a good approximation of geometric returns and why summing log returns gives us cumulative returns. This section explains.

Recall the first-order Taylor Approximation from Calculus:

\[f(x) \approx f(x_0) + f'(x_0) (x - x_0)\]

In essence, this formula approximates an arbitrary differentiable function \(f\) at a point \(x\) by a linear function centered around a point of our own choosing, namely \(x_0\). This is a good approximation as long as \(x\) is close to \(x_0\). To understand why, recall the definition of a derivative:

\[f'(x_0) = \lim_{x \rightarrow x_0}\frac{f(x) - f(x_0)}{x- x_0}\]

So if \(x\) is “close enough” to \(x_0\),

\[f'(x_0) \approx \frac{f(x) - f(x_0)}{x- x_0}\]

Rearranging this expression gives the Taylor Approximation. Now, consider the function \(f(x) = log(1+x)\) where \(log\) is the natural logarithm. Taylor expanding this around \(x_0\) gives

\[log(1+x) \approx log(1+x_0) + \frac{1}{1+x_0} (x - x_0)\]

and the approximation is good when \(x\) is close to \(x_0\). We’re free to choose any \(x_0\) we want, so let’s try \(x_0 = 0\). Substituting,

\[log(1+x) \approx x\]

since \(log(1)=0\). In other words, provided that \(x\) is close to zero, \(log(1+x)\approx x\). This is the approximation we’ll use to motivate log returns. Now recall our definition of geometric returns from above:

\[r_t = \frac{p_{t}-p_{t-1}}{p_{t-1}}\]

This is equivalent to

\[r_t = \frac{p_t}{p_{t-1}}-1\]

Applying the approximation \(log(1+x) \approx x\) to \(r_t\) and using the properties of logarithms

\[r_t \approx log(1 + r_t) = log(1 + p_t/p_{t-1} - 1) = log(p_t/p_{t-1}) = log(p_t) - log(p_{t-1})\]

In other words, the difference of log prices is approximately equal to \(r_t\), provided that \(r_t\) is close to zero. This is true of daily stock returns.

Now, suppose we wanted to calculate cumulative returns over some period of time \(t =1, 2,..., T\). For example, suppose that we have daily stock returns and want to find out what the overall return was over a year. I asserted above that when we’re working with log returns, all we have to do is sum them. Here’s an explanation. Consider the sum of log returns:

\[S = \sum_{t=1}^T log(1 + r_t) = log(1+r_1) + log(1+r_2) +\dots + log(1+R_{T-1})+ log(1+ r_T)\]

Since \(log(1+r_t) = log(p_t) - log(p_{t-1})\), we have

\[S = log(p_1) - log(p_0) + log(p_2) - log(p_1) + \dots + log(p_{T-1}) - log(p_{T-2}) + log(p_T) - log(p_{T-1})\]

But notice how almost all of these terms cancel out! All that remains is:

\[S = log(p_T) - log(p_0)\]

which is just the difference of the log price at the end of the year and the log price at the beginning of the year! Now, using the properties of logarithms as above:

\[S = log(p_T/p_0) = log\left(1 + \frac{p_T - p_0}{p_0} \right)\]

Now we immediately recognize \((p_T - p_0)/p_0\): this is the cumulative geometric return over the whole year: the percentage change in price between date 0 and date \(T\). Call this quantity \(R_T\). We see that the sum of log returns, \(S\) is related to \(R_T\) as follows:

\[S = log(1 + R_T)\approx R_T\]

again using our approximation from above. Note that this approximation only works when \(R_T\) is relatively close to zero. Since it is a percentage expressed as a decimal, as long as the horizon over which we calculate cumulative returns isn’t too long this is a reasonable approximation. For example, a cumulative return of 20% over one year gives \(R_T = 0.2\) which is reasonably close to zero. In contrast a 400% return over 15 years corresponds to \(R_T = 4\) which is not close to zero. Hence, the trick of summing log returns to approximate cumulative geometric returns works well as long as we don’t look at time periods that are too long or stocks that grew (or declined) too fast.