## (C) (cc by-sa) Wouter van Atteveldt & Jan Kleinnijenhuis, file generated juni 24 2015

Note on the data used in this howto: This data can be downloaded from http://piketty.pse.ens.fr/files/capital21c/en/xls/, but the excel format is a bit difficult to parse at it is meant to be human readable, with multiple header rows etc. For that reason, I’ve extracted csv files for some interesting tables that I’ve uploaded to http://vanatteveldt.com/uploads/rcourse/data

Time Series Analysis with R

*Caveat*: I am not an expert on time series modelling. Please take this document as a source of inspiration rather than as a definitive set of answers.

Time series analysis consists of statistical procedures that take into account that observations are sequential and often serially autocorrelated. Typical questions for time series analysis are whether change in one variable affects another variable at a later date; and in reciprocal series which effect is stronger (granger causality).

In R, there is an object type ts that can be used to represent time series.

income = read.csv("data/income_toppercentile.csv")
income = ts(income[-1], start=income[1,1], frequency=1)
class(income)
## [1] "mts"    "ts"     "matrix"

As you can see from the class output, a time series is really a specialised type of matrix. One of the nice things is that the default plot for ts objects gives an overview of the different time series over time:

plot(income, main="Income Inequality (percentage of income earned by top 1%)")

Univariate time series analysis

Let’s first consider the univariate case. We focus on the US series since it is a long series without missing values.

us = na.omit(income[, "US"])
plot(us)

We could easily do a “naive” time series by just running a linear model with a lagged variable, e.g. to model the autocorrelation:

summary(lm(us ~ lag(us)))
## Warning in summary.lm(lm(us ~ lag(us))): essentially perfect fit: summary
## may be unreliable
## 
## Call:
## lm(formula = us ~ lag(us))
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -1.191e-16 -7.700e-20  9.190e-19  2.640e-18  1.097e-17 
## 
## Coefficients:
##              Estimate Std. Error   t value Pr(>|t|)    
## (Intercept) 6.628e-17  4.644e-18 1.427e+01   <2e-16 ***
## lag(us)     1.000e+00  3.069e-17 3.259e+16   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.231e-17 on 99 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 1.062e+33 on 1 and 99 DF,  p-value: < 2.2e-16

However, we generally want to use the more specialized methods such as VAR and ARIMA.

Transforming time series

Before going on with the analysis, we need to know whether the time series is /stationary/, i.e. whether its properties such as mean and variance are constant over time. Nonstationarity means that a time series shows an upward or downward trend for many time points; the series does not revert to it mean (almost) immediately. Nonstationary time series easily show either a positive or a negative spurious correlation, since it’s unlikely that two series that show an upward or downward trend for a long sequence of time points will not show such a correlation.
Since a non-stationary time series gives estimation problems it should be transformed first, generally with a combination of differencing, log transformation or Box-Cox transformation.

The first test for stationarity is just looking at the graph. It looks like the series is much more variable initially and at the end, which could be a problem. Moreover, there is an obvious U-shape which means that the mean is not constant.

Formal tests for stationarity include the KPSS tests and the Augmented Dickey-Fuller (ADF) test.

library(tseries)
kpss.test(us)
## Warning in kpss.test(us): p-value smaller than printed p-value
## 
##  KPSS Test for Level Stationarity
## 
## data:  us
## KPSS Level = 0.75464, Truncation lag parameter = 2, p-value = 0.01
adf.test(us)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  us
## Dickey-Fuller = -0.92247, Lag order = 4, p-value = 0.9465
## alternative hypothesis: stationary

Take care with interpreting these: the null hypothesis for KPSS is stationarity (rejected in this case), while the null hypothesis for ADF is a unit root, i.e. non-stationarity (not rejected).

A final diagnostic tools is looking at the auto-correlation grapf and partial auto-correlation graph. For a stationary series, both ACF and PACF should drop to zero relatively quickly.

par(mfrow=c(3,1))
plot(us)
acf(us)
pacf(us)

Thus, the ACF plot also points to a non-stationary series. As a first start, let’s difference the series, i.e. instead of analysing income inequality we will analyse change in income inequality.

The reasoning for differencing is as follows: Since it would be highly coincidental that two series that show a spurious correlations if we compare their levels also show a correlation if we compare their changes, it’s often a good idea to difference the time series. A famous example is the spurious correlation that indeed gradually less babies were born in Western countries in the twentieth century as the number of storks gradually decreased. If this is not a spurious correlation, then one would expect also a correlation of the differences. Thus, one would expect that the birth cohort immediately after the second world war was preceded by an increase of storks a year before. One would also expect that the decrease of birth immediately after the introduction of the contraception pill was preceded by a mass extinction of storks.

We can difference is series using the diff command:

par(mfrow=c(3,1))
plot(diff(us))
acf(diff(us))
pacf(diff(us))

This looks a lot more stationary, but variance still seems bigger at the extremes of the series.

kpss.test(diff(us))
## Warning in kpss.test(diff(us)): p-value greater than printed p-value
## 
##  KPSS Test for Level Stationarity
## 
## data:  diff(us)
## KPSS Level = 0.18628, Truncation lag parameter = 2, p-value = 0.1
adf.test(diff(us))
## Warning in adf.test(diff(us)): p-value smaller than printed p-value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  diff(us)
## Dickey-Fuller = -6.1894, Lag order = 4, p-value = 0.01
## alternative hypothesis: stationary

So that is also looking better. However, we should probably still do a log transformation to normalize the variance:

par(mfrow=c(3,1))
plot(diff(log(us)))
acf(diff(log(us)))
pacf(diff(log(us)))

Now, variance looks about right for the whole series, and there are not ACF’s or PACF’s.

One should realize that these transformations may either facilitate or complicate the interpretation of research results. A linear relationship between log transformed variables, for example, can be understood often in economics and in finance as a relationship between percentual changes, thus in terms of elasticities. Theory comes first, which means that without a proper interpretation transformations a regression equation based on non-transformed variables may be preferable, although his increases the risk of spurious correlations (stork-baby-correlation). We will come back to the “error correction model” as another way to get rid of trends in the data at the end of this document.

ARIMA modeling

ARIMA (AutoRegressive Integrated Moving Average) models model a time series with a mixture of autoregressive, integrative, and moving average components. One can base the components needed on the (P)ACF plots as presented above. Very briefly put: decaying ACF’s such as in the plot of the untransformed series point to an integrative component. After eliminating integrative components, we check the remaining (p)acf plots, and ACF spikes suggest an AR component, while if there are more PACF spikes then ACF spikes we should use MA components.

In our case, since the transformed plots had no significant spikes in either the ACF or the PACF plots, we don’t need an AR or MA component. Thus, we can fit the ARIMA model, and inspect the residuals with the tsdiag function:

library(forecast)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## Loading required package: timeDate
## This is forecast 6.1
m = arima(diff(log(us)), c(0,0,0))
summary(m)
## 
## Call:
## arima(x = diff(log(us)), order = c(0, 0, 0))
## 
## Coefficients:
##       intercept
##          0.0011
## s.e.     0.0086
## 
## sigma^2 estimated as 0.007349:  log likelihood = 103.77,  aic = -203.53
## 
## Training set error measures:
##                       ME       RMSE        MAE  MPE MAPE      MASE
## Training set 7.48235e-19 0.08572538 0.06778016 -Inf  Inf 0.7454546
##                     ACF1
## Training set -0.02815696
tsdiag(m)

What this in fact means is that the best estimate of next years inequality is slightly higher than this years, but other than the very slight trend there are no regularities in the time series.

We can also use the auto.arima and ets functions that automatically fit a time series, and we can use forecast to project the regularities into the future. In this case, of course, that is not very interesting as there were no regularities in the time series,

m = auto.arima(us)
summary(m)
## Series: us 
## ARIMA(0,1,0) with drift         
## 
## Coefficients:
##        drift
##       0.0002
## s.e.  0.0014
## 
## sigma^2 estimated as 0.0001954:  log likelihood=282.78
## AIC=-561.55   AICc=-561.43   BIC=-556.34
## 
## Training set error measures:
##                        ME       RMSE        MAE        MPE     MAPE
## Training set 1.760395e-06 0.01383926 0.01027107 -0.4076896 6.765002
##                   MASE       ACF1
## Training set 0.9895055 0.03575944
forecast(m, 4)
##      Point Forecast     Lo 80     Hi 80     Lo 95     Hi 95
## 2011         0.1982 0.1802860 0.2161140 0.1708030 0.2255970
## 2012         0.1984 0.1730658 0.2237342 0.1596547 0.2371453
## 2013         0.1986 0.1675721 0.2296279 0.1511469 0.2460531
## 2014         0.1988 0.1629721 0.2346279 0.1440059 0.2535941

Note that the auto.arima even ignores the slight trend, presumably because the gain in fit is too small.

VAR models for multivariate time series

VAR (Vector Auto Regression) models can be used to model effects between different time series. Let’s take the time series of France, US, and Canada and see whether change in income inequality in one country predicts change in the other countries.

income = na.omit(income[, c("US", "France", "Canada")])
plot(income, main="Income Inequality")

library(vars)
## Loading required package: MASS
## Loading required package: strucchange
## Loading required package: sandwich
## Loading required package: urca
## Loading required package: lmtest
m = VAR(income)
summary(m)
## 
## VAR Estimation Results:
## ========================= 
## Endogenous variables: US, France, Canada 
## Deterministic variables: const 
## Sample size: 90 
## Log Likelihood: 933.934 
## Roots of the characteristic polynomial:
## 0.9687 0.9338 0.7919
## Call:
## VAR(y = income)
## 
## 
## Estimation results for equation US: 
## =================================== 
## US = US.l1 + France.l1 + Canada.l1 + const 
## 
##           Estimate Std. Error t value Pr(>|t|)    
## US.l1     0.936915   0.054215  17.282   <2e-16 ***
## France.l1 0.007480   0.077948   0.096    0.924    
## Canada.l1 0.017084   0.110300   0.155    0.877    
## const     0.006785   0.006291   1.079    0.284    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## Residual standard error: 0.01437 on 86 degrees of freedom
## Multiple R-Squared: 0.8813,  Adjusted R-squared: 0.8772 
## F-statistic: 212.9 on 3 and 86 DF,  p-value: < 2.2e-16 
## 
## 
## Estimation results for equation France: 
## ======================================= 
## France = US.l1 + France.l1 + Canada.l1 + const 
## 
##            Estimate Std. Error t value Pr(>|t|)    
## US.l1     -0.007523   0.017793  -0.423    0.673    
## France.l1  0.932756   0.025582  36.461   <2e-16 ***
## Canada.l1  0.046958   0.036200   1.297    0.198    
## const      0.001659   0.002065   0.803    0.424    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## Residual standard error: 0.004716 on 86 degrees of freedom
## Multiple R-Squared: 0.9801,  Adjusted R-squared: 0.9794 
## F-statistic:  1414 on 3 and 86 DF,  p-value: < 2.2e-16 
## 
## 
## Estimation results for equation Canada: 
## ======================================= 
## Canada = US.l1 + France.l1 + Canada.l1 + const 
## 
##           Estimate Std. Error t value Pr(>|t|)    
## US.l1     0.071613   0.027766   2.579   0.0116 *  
## France.l1 0.075741   0.039921   1.897   0.0611 .  
## Canada.l1 0.824773   0.056490  14.600   <2e-16 ***
## const     0.001747   0.003222   0.542   0.5890    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## Residual standard error: 0.007359 on 86 degrees of freedom
## Multiple R-Squared: 0.9405,  Adjusted R-squared: 0.9384 
## F-statistic: 453.1 on 3 and 86 DF,  p-value: < 2.2e-16 
## 
## 
## 
## Covariance matrix of residuals:
##               US    France    Canada
## US     2.065e-04 1.694e-05 2.129e-05
## France 1.694e-05 2.224e-05 3.439e-06
## Canada 2.129e-05 3.439e-06 5.415e-05
## 
## Correlation matrix of residuals:
##            US  France  Canada
## US     1.0000 0.25006 0.20136
## France 0.2501 1.00000 0.09909
## Canada 0.2014 0.09909 1.00000
plot(m)

All countries have a strong autocorrelation, and only for Canada an second effect is found: a high level of income inequality in the US predicts a high level of inequality in Canada the next year, controlled for the autocorrelation. We can do a VAR analysis on the differenced time series to get rid of the strong autocorrelations (and essentially predict how changes in one country respond to changes in another country)

library(vars)
m = VAR(diff(income))
summary(m)
## 
## VAR Estimation Results:
## ========================= 
## Endogenous variables: US, France, Canada 
## Deterministic variables: const 
## Sample size: 89 
## Log Likelihood: 931.976 
## Roots of the characteristic polynomial:
## 0.373 0.1518 0.1518
## Call:
## VAR(y = diff(income))
## 
## 
## Estimation results for equation US: 
## =================================== 
## US = US.l1 + France.l1 + Canada.l1 + const 
## 
##            Estimate Std. Error t value Pr(>|t|)  
## US.l1     0.0076587  0.1111608   0.069   0.9452  
## France.l1 0.2457523  0.3219072   0.763   0.4473  
## Canada.l1 0.3400251  0.2007215   1.694   0.0939 .
## const     0.0008144  0.0015564   0.523   0.6022  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## Residual standard error: 0.01428 on 85 degrees of freedom
## Multiple R-Squared: 0.04298, Adjusted R-squared: 0.009202 
## F-statistic: 1.272 on 3 and 85 DF,  p-value: 0.2891 
## 
## 
## Estimation results for equation France: 
## ======================================= 
## France = US.l1 + France.l1 + Canada.l1 + const 
## 
##             Estimate Std. Error t value Pr(>|t|)  
## US.l1      0.0738163  0.0355239   2.078   0.0407 *
## France.l1  0.2034187  0.1028726   1.977   0.0512 .
## Canada.l1  0.0897387  0.0641450   1.399   0.1655  
## const     -0.0007418  0.0004974  -1.491   0.1396  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## Residual standard error: 0.004564 on 85 degrees of freedom
## Multiple R-Squared: 0.1437,  Adjusted R-squared: 0.1135 
## F-statistic: 4.756 on 3 and 85 DF,  p-value: 0.004097 
## 
## 
## Estimation results for equation Canada: 
## ======================================= 
## Canada = US.l1 + France.l1 + Canada.l1 + const 
## 
##             Estimate Std. Error t value Pr(>|t|)  
## US.l1     -0.0352316  0.0532511  -0.662   0.5100  
## France.l1  0.3061549  0.1542083   1.985   0.0503 .
## Canada.l1  0.0326685  0.0961548   0.340   0.7349  
## const     -0.0002445  0.0007456  -0.328   0.7438  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## Residual standard error: 0.006842 on 85 degrees of freedom
## Multiple R-Squared: 0.0462,  Adjusted R-squared: 0.01254 
## F-statistic: 1.372 on 3 and 85 DF,  p-value: 0.2567 
## 
## 
## 
## Covariance matrix of residuals:
##               US    France    Canada
## US     2.040e-04 1.367e-05 1.448e-05
## France 1.367e-05 2.083e-05 3.246e-06
## Canada 1.448e-05 3.246e-06 4.681e-05
## 
## Correlation matrix of residuals:
##            US France Canada
## US     1.0000 0.2098 0.1482
## France 0.2098 1.0000 0.1039
## Canada 0.1482 0.1039 1.0000

In the case of reciprocal relations it can be useful to use a Granger causality test. Let’s use this test to see whether ht

m = VAR(income)
causality(m, cause="US")
## $Granger
## 
##  Granger causality H0: US do not Granger-cause France Canada
## 
## data:  VAR object m
## F-Test = 3.5584, df1 = 2, df2 = 258, p-value = 0.02989
## 
## 
## $Instant
## 
##  H0: No instantaneous causality between: US and France Canada
## 
## data:  VAR object m
## Chi-squared = 7.7348, df = 2, p-value = 0.02091

So, the null hypothesis that there is no granger or instantaneous causality between the US is (presumably erronously) rejected. Add a small piece on Cointegration and vector error correction models?

Error correction models

Another way to look at the data fro the US, France and Canada is that they show in spite of their differences a common trend called “cointegration”: up until somewhere in the roaring twenties, downward until the eighties, upwards thereafter, with also some common ripples, e.eg. the short recovery in the late thirties, and the common fall roughly in 2007. The idea of “vector error correction models” (VECM) is that the nonstationarity in each country is caused by the long-term cointegration of economic trends in these countries - due to a common world market. The procedure to test a VECM model is to (1) test whether covariation is present indeed (e.g. with the Johannsen test) (2) test a model in which current changes do not simply depend on previous changes, but also on previous levels. A nickname of the VECM model is the drunk lady and her dog model. They may run apart unexpectedly long, but after a while they will walk together once more.

Non-constant variance

In statistical tests for small n one often has to assume a constant variance. Both in economics and in the social sciences changes in volatility (variance) are often more important. GARCH models deal not only with changes in means, but also with changes in volatility (AutoRegressive Conditional Heteroscedasticity). Let’s give an example in which changes in volatility are extremely important. Understanding the gambler’s paradox essentially comes down to understanding that precisely with a fair toss without memory (equal chances on +1p or -1p at each successive toss, and a starting capital C, the chances that one will be bankrupt if the number of tosses is extended forever, however large C, and however small p, since the variance (of a random walk) increases. Whether the risk of going bankrupt is still acceptable depends critically on the volatility, thus (simplified) on the question whether p increases or decreases in a subsequence of tosses.