## (C) (cc by-sa) Wouter van Atteveldt & Jan Kleinnijenhuis, file generated juni 24 2015

Note on the data used in this howto: This data can be downloaded from http://piketty.pse.ens.fr/files/capital21c/en/xls/, but the excel format is a bit difficult to parse at it is meant to be human readable, with multiple header rows etc. For that reason, I’ve extracted csv files for some interesting tables that I’ve uploaded to http://vanatteveldt.com/uploads/rcourse/data

Time Series Analysis with R

*Caveat*: I am not an expert on time series modelling. Please take this document as a source of inspiration rather than as a definitive set of answers.

Time series analysis consists of statistical procedures that take into account that observations are sequential and often serially autocorrelated. Typical questions for time series analysis are whether change in one variable affects another variable at a later date; and in reciprocal series which effect is stronger (granger causality).

In R, there is an object type ts that can be used to represent time series.

income = read.csv("data/income_toppercentile.csv")
income = ts(income[-1], start=income[1,1], frequency=1)
class(income)
## [1] "mts"    "ts"     "matrix"

As you can see from the class output, a time series is really a specialised type of matrix. One of the nice things is that the default plot for ts objects gives an overview of the different time series over time:

plot(income, main="Income Inequality (percentage of income earned by top 1%)")

Univariate time series analysis

Let’s first consider the univariate case. We focus on the US series since it is a long series without missing values.

us = na.omit(income[, "US"])
plot(us)

We could easily do a “naive” time series by just running a linear model with a lagged variable, e.g. to model the autocorrelation:

summary(lm(us ~ lag(us)))
## Warning in summary.lm(lm(us ~ lag(us))): essentially perfect fit: summary
## may be unreliable
## 
## Call:
## lm(formula = us ~ lag(us))
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -1.191e-16 -7.700e-20  9.190e-19  2.640e-18  1.097e-17 
## 
## Coefficients:
##              Estimate Std. Error   t value Pr(>|t|)    
## (Intercept) 6.628e-17  4.644e-18 1.427e+01   <2e-16 ***
## lag(us)     1.000e+00  3.069e-17 3.259e+16   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.231e-17 on 99 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 1.062e+33 on 1 and 99 DF,  p-value: < 2.2e-16

However, we generally want to use the more specialized methods such as VAR and ARIMA.

Transforming time series

Before going on with the analysis, we need to know whether the time series is /stationary/, i.e. whether its properties such as mean and variance are constant over time. Nonstationarity means that a time series shows an upward or downward trend for many time points; the series does not revert to it mean (almost) immediately. Nonstationary time series easily show either a positive or a negative spurious correlation, since it’s unlikely that two series that show an upward or downward trend for a long sequence of time points will not show such a correlation.
Since a non-stationary time series gives estimation problems it should be transformed first, generally with a combination of differencing, log transformation or Box-Cox transformation.

The first test for stationarity is just looking at the graph. It looks like the series is much more variable initially and at the end, which could be a problem. Moreover, there is an obvious U-shape which means that the mean is not constant.

Formal tests for stationarity include the KPSS tests and the Augmented Dickey-Fuller (ADF) test.

library(tseries)
kpss.test(us)
## Warning in kpss.test(us): p-value smaller than printed p-value
## 
##  KPSS Test for Level Stationarity
## 
## data:  us
## KPSS Level = 0.75464, Truncation lag parameter = 2, p-value = 0.01
adf.test(us)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  us
## Dickey-Fuller = -0.92247, Lag order = 4, p-value = 0.9465
## alternative hypothesis: stationary

Take care with interpreting these: the null hypothesis for KPSS is stationarity (rejected in this case), while the null hypothesis for ADF is a unit root, i.e. non-stationarity (not rejected).

A final diagnostic tools is looking at the auto-correlation grapf and partial auto-correlation graph. For a stationary series, both ACF and PACF should drop to zero relatively quickly.

par(mfrow=c(3,1))
plot(us)
acf(us)
pacf(us)

Thus, the ACF plot also points to a non-stationary series. As a first start, let’s difference the series, i.e. instead of analysing income inequality we will analyse change in income inequality.

The reasoning for differencing is as follows: Since it would be highly coincidental that two series that show a spurious correlations if we compare their levels also show a correlation if we compare their changes, it’s often a good idea to difference the time series. A famous example is the spurious correlation that indeed gradually less babies were born in Western countries in the twentieth century as the number of storks gradually decreased. If this is not a spurious correlation, then one would expect also a correlation of the differences. Thus, one would expect that the birth cohort immediately after the second world war was preceded by an increase of storks a year before. One would also expect that the decrease of birth immediately after the introduction of the contraception pill was preceded by a mass extinction of storks.

We can difference is series using the diff command:

par(mfrow=c(3,1))
plot(diff(us))
acf(diff(us))
pacf(diff(us))

This looks a lot more stationary, but variance still seems bigger at the extremes of the series.

kpss.test(diff(us))
## Warning in kpss.test(diff(us)): p-value greater than printed p-value
## 
##  KPSS Test for Level Stationarity
## 
## data:  diff(us)
## KPSS Level = 0.18628, Truncation lag parameter = 2, p-value = 0.1
adf.test(diff(us))
## Warning in adf.test(diff(us)): p-value smaller than printed p-value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  diff(us)
## Dickey-Fuller = -6.1894, Lag order = 4, p-value = 0.01
## alternative hypothesis: stationary

So that is also looking better. However, we should probably still do a log transformation to normalize the variance:

par(mfrow=c(3,1))
plot(diff(log(us)))
acf(diff(log(us)))
pacf(diff(log(us)))

Now, variance looks about right for the whole series, and there are not ACF’s or PACF’s.

One should realize that these transformations may either facilitate or complicate the interpretation of research results. A linear relationship between log transformed variables, for example, can be understood often in economics and in finance as a relationship between percentual changes, thus in terms of elasticities. Theory comes first, which means that without a proper interpretation transformations a regression equation based on non-transformed variables may be preferable, although his increases the risk of spurious correlations (stork-baby-correlation). We will come back to the “error correction model” as another way to get rid of trends in the data at the end of this document.

ARIMA modeling

ARIMA (AutoRegressive Integrated Moving Average) models model a time series with a mixture of autoregressive, integrative, and moving average components. One can base the components needed on the (P)ACF plots as presented above. Very briefly put: decaying ACF’s such as in the plot of the untransformed series point to an integrative component. After eliminating integrative components, we check the remaining (p)acf plots, and ACF spikes suggest an AR component, while if there are more PACF spikes then ACF spikes we should use MA components.

In our case, since the transformed plots had no significant spikes in either the ACF or the PACF plots, we don’t need an AR or MA component. Thus, we can fit the ARIMA model, and inspect the residuals with the tsdiag function:

library(forecast)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## Loading required package: timeDate
## This is forecast 6.1
m = arima(diff(log(us)), c(0,0,0))
summary(m)
## 
## Call:
## arima(x = diff(log(us)), order = c(0, 0, 0))
## 
## Coefficients:
##       intercept
##          0.0011
## s.e.     0.0086
## 
## sigma^2 estimated as 0.007349:  log likelihood = 103.77,  aic = -203.53
## 
## Training set error measures:
##                       ME       RMSE        MAE  MPE MAPE      MASE
## Training set 7.48235e-19 0.08572538 0.06778016 -Inf  Inf 0.7454546
##                     ACF1
## Training set -0.02815696
tsdiag(m)