In this lesson, we will explore how to examine relationships among observations that are made in a sequence (time series). We will focus on characterizing additional patterns within a single time series.
Autocovariance for positive values of lag \(k\) is defined in a form similar to the cross-covariance: \[ c_{xx}(k) = \frac{1}{N} \sum_{t=1}^{N-k} (x_t - \overline{x})(x_{t+k} - \overline{x}) \] Similarly, autocorrelation is: \[ r_{xx}(k) = \frac{c_{xx}(k)}{c_{xx}(0)} \]
Autocorrelation coefficients and the correlogram (plot of \(r_{xx}(k)\) as a function of lag \(k\)) can provide insight into the underlying processes (Chatfield 2003).
Example correlelograms (notated as “\(r_k\)”) are shown below. Corresponding time series signals (“\(x_r\)”) are shown directly above for the first three figures. In the last the figure (lower right quadrant), correlelograms are shown for monthly observations in air temperature (top), and short-term correlations remaining after removal of the seasonal contribution to the signal is removed (bottom).
short-term corrlelation | alternating time series |
non-stationary time series | short-term correlation after removal of seasonal variation |
Such analyses can be used as a basis for constructing a purely statistical models of time series (ARIMA models) primarily used for forecasting, but we can also isolate and describe relationships among sequential observations using autocorrelations.
For instance, these representations can be included in discussions of:
We can project a times series (or arbitrary function) onto a basis set consisting of harmonic functions of various frequencies, and determine which coefficients contribute most to the reproduction of the original signal.
Next, we will describe the Fourier series transformation of a sequential series of observations (time series) using notation adapted from Marr and Harley (2002).
The original time series \(x_t\) can be represented by a Fourier series: \[ x_t = \sum_{k=0}^{N-1} X(k) \exp\left(2\pi i \nu_k t\right) \] where \(N\) is the number of observations, and \(\nu_k = k/N\). The new series \(X(k)\) defined by the discrete Fourier transform are given by: \[ X(k) = \frac{1}{\sqrt{N}} \sum_{t=0}^{N-1} x_t \exp\left(-2\pi i \nu_k t\right) \] for \(k=0,1,\ldots,N-1\). The periodogram or power spectrum is defined as: \[ P(\nu_k) = |X(k)|^2 \] and indicates the strength of frequencies in a time series.
The periodogram is a finite Fourier transform of the autocovariance \(\{c_{xx}(k)\}\) (Chatfield 2003).
library(tidyverse)
library(chron)
source("functions_extra.R")
Sys.setlocale("LC_TIME","C")
options(stringsAsFactors=FALSE)
options(chron.year.abb=FALSE)
theme_set(theme_bw()) # just my preference for plots
df <- readRDS("data/2013/lau-zue.rds")
Select ZUE site and convert to long format. When providing a character vector of column names, wrap in all_of
.
id.vars <- c("site", "datetime", "year", "month", "day", "hour", "season", "dayofwk", "daytype")
lf <- df %>%
filter(site=="ZUE") %>%
gather(variable, value, -all_of(id.vars))
We will first visualize the effect of imposing a lag on a variable. We can apply lags to each variable (each time series is segmented by season) with the following code:
SelfLag <- function(x, k) {
data.frame(lag=k, x0=head(x,-k), xk=tail(x,-k))
}
lagged <- lf %>%
group_by(site, season, variable) %>%
do(rbind(SelfLag(.[["value"]], 1),
SelfLag(.[["value"]], 3),
SelfLag(.[["value"]], 5),
SelfLag(.[["value"]], 7)))
We will plot the effect of the lag (shown across rows) on two meteorological variables: radiation and temperature.
ggplot(filter(lagged, variable=="RAD"))+
geom_abline(intercept=0, slope=1)+
geom_point(aes(x0, xk), shape=3, color="gray")+
facet_grid(lag~season)+
labs(x=expression(x(t[0])), y=expression(x(t[k])))
filter(lagged, variable=="TEMP") %>%
ggplot+
geom_abline(intercept=0, slope=1)+
geom_point(aes(x0, xk), shape=3, color="gray")+
facet_grid(lag~season)+
labs(x=expression(x(t[0])), y=expression(x(t[k])))
We can see that the autocorrelation in temperature persists for longer period of time (the correlation between lagged values for TEMP
deteriorates much more slowly with increasing amount of lag than for RAD
).
We will calculate the autocorrelation coefficient for up to a lag of 6 intervals (hours in our case).
Autocorrelation <- function(x, ...) {
with(acf(x, ..., na.action=na.pass, plot=FALSE),
data.frame(lag, acf))
}
autocorr <- lf %>% group_by(site, season, variable) %>%
do(Autocorrelation(.[["value"]], lag.max=6))
We can view the correlelograms for all variables:
ggplot(autocorr)+
geom_hline(yintercept=0)+
geom_point(aes(lag, acf))+
geom_segment(aes(lag, 0, xend=lag, yend=acf))+
facet_grid(variable~season)+
labs(x="lag", y="acf")
R has a built-in function, spectrum
, to calculate the power spectrum. For illustration, we will apply it to hourly time series from July. For further details, see ?spectrum
.
ix <- "ZUE"==df[["site"]] & "JJA"==df[["season"]]
spec <- spectrum(df[ix,"CO"])
## Error in na.fail.default(as.ts(x)): missing values in object
We see that there is an error on account of missing values. Oftentimes it is common to assess the extent of missing values by a quantity known as “percent (or fraction) data recovery.” If the recovery is less than 75% (as a rule of thumb), you may want to flag any aggregated statistic calculated over this period (averages, periodicity, etc.) as unrepresentative/unreliable.
PercentRecovery <- function(x) {
length(na.omit(x))/length(x)*1e2
}
PercentRecovery(df[ix,"CO"])
## [1] 99.77355
In this case, our data recovery is acceptable and we assume that interpolating over a few points will not affect the analysis too much. There are additional R packages (e.g., VIM) for visualizing patterns in missing data that may affect the analysis.
Furthermore, there is a class of statistical techniques called “imputation” for replacing missing values. For our lesson, we will simply use linear interpolation. This is accomplished using the approx
function built into R. For further details, see ?approx
.
out <- approx(df[ix,"datetime"], df[ix,"CO"], df[ix,"datetime"])
str(out)
## List of 2
## $ x: 'chron' num [1:2208] (06/01/2013 00:00:00) (06/01/2013 01:00:00) (06/01/2013 02:00:00) (06/01/2013 03:00:00) (06/01/2013 04:00:00) ...
## ..- attr(*, "format")= Named chr [1:2] "m/d/y" "h:m:s"
## .. ..- attr(*, "names")= chr [1:2] "dates" "times"
## ..- attr(*, "origin")= Named num [1:3] 1 1 1970
## .. ..- attr(*, "names")= chr [1:3] "month" "day" "year"
## $ y: num [1:2208] 0.3 0.3 0.3 0.2 0.2 0.2 0.3 0.3 0.3 0.3 ...
The list element y
contains the interpolated values, and now we can calculate the spectrum, which also provides us with the plot.
spec <- spectrum(out[["y"]])
We can add frequencies which correspond to periods of 4, 6, 8, … hours. This is using the base graphics plotting functionality (in contrast to ggplot, which we have been using for most other applications) in R:
hrs <- c("4-hr"=4, "6-hr"=6, "8-hr"=8, "12-hr"=12, "daily"=24, "weekly"=24*7, "monthly"=24*30)
abline(v=1/hrs, col=seq(hrs)+1, lty=2)
legend("topright", names(hrs), col=seq(hrs)+1, lty=2, bg="white")
Next, we want to generalize over all seasons and variables. First, let us examine our data recovery:
data.frame(lf %>% group_by(site, season, variable) %>%
summarize(recovery=sprintf("%.0f%%",PercentRecovery(value))))
## site season variable recovery
## 1 ZUE DJF CO 100%
## 2 ZUE DJF EC 100%
## 3 ZUE DJF NMVOC 100%
## 4 ZUE DJF NO2 100%
## 5 ZUE DJF O3 100%
## 6 ZUE DJF PM10 100%
## 7 ZUE DJF PREC 98%
## 8 ZUE DJF RAD 100%
## 9 ZUE DJF SO2 100%
## 10 ZUE DJF TEMP 100%
## 11 ZUE MAM CO 100%
## 12 ZUE MAM EC 100%
## 13 ZUE MAM NMVOC 100%
## 14 ZUE MAM NO2 100%
## 15 ZUE MAM O3 100%
## 16 ZUE MAM PM10 99%
## 17 ZUE MAM PREC 100%
## 18 ZUE MAM RAD 100%
## 19 ZUE MAM SO2 100%
## 20 ZUE MAM TEMP 100%
## 21 ZUE JJA CO 100%
## 22 ZUE JJA EC 100%
## 23 ZUE JJA NMVOC 100%
## 24 ZUE JJA NO2 100%
## 25 ZUE JJA O3 100%
## 26 ZUE JJA PM10 95%
## 27 ZUE JJA PREC 100%
## 28 ZUE JJA RAD 100%
## 29 ZUE JJA SO2 100%
## 30 ZUE JJA TEMP 100%
## 31 ZUE SON CO 100%
## 32 ZUE SON EC 95%
## 33 ZUE SON NMVOC 100%
## 34 ZUE SON NO2 100%
## 35 ZUE SON O3 99%
## 36 ZUE SON PM10 99%
## 37 ZUE SON PREC 100%
## 38 ZUE SON RAD 100%
## 39 ZUE SON SO2 100%
## 40 ZUE SON TEMP 100%
"%.0f"
is a formatting specification for zero decimal places (%.0
) applicable to floating point numbers (f
), which are of mode/class "numeric"
in R.
We find again that simple interpolation will likely not affect our results on account of the small number of missing values.
Let us apply the spectrum
function more generally:
Spectr <- function(x, y) {
out <- approx(x, y, xout=x)
spec <- spectrum(out[["y"]], plot=FALSE)
data.frame(spec[c("freq", "spec")])
}
spec <- lf %>% group_by(site, season, variable) %>%
do(Spectr(.[["datetime"]],.[["value"]]))
We can similarly produce a power spectrum:
period <- data.frame(label=factor(names(hrs), names(hrs)), freq=1/hrs)
ggplot(spec)+
geom_vline(aes(xintercept=freq, color=label), data=period, linetype=2)+
geom_line(aes(freq, spec))+
facet_grid(variable~season, scale="free_y")+
scale_y_log10()
We can also calculate daily averages and apply the spectra decomposition:
daily <- lf %>%
mutate(date=dates(datetime)) %>%
group_by(site, variable, season, date) %>%
summarize(value=mean(value, na.rm=TRUE))
spec.daily <- daily %>% group_by(site, season, variable) %>%
do(Spectr(.[["date"]],.[["value"]]))
period.daily <- data.frame(label=c("weekly", "monthly"),
freq =1/c(7, 30))
Plot results for daily power spectra:
ggplot(spec.daily)+
geom_vline(aes(xintercept=freq, color=label), data=period.daily, linetype=2)+
geom_line(aes(freq, spec))+
facet_grid(variable~season, scale="free_y")
Chatfield, C. 2003. The Analysis of Time Series: An Introduction, Sixth Edition. Chapman & Hall/Crc Texts in Statistical Science. CRC Press.
Marr, Linsey C., and Robert A. Harley. 2002. “Spectral Analysis of Weekday–Weekend Differences in Ambient Ozone, Nitrogen Oxide, and Non-Methane Hydrocarbon Time Series in California.” Atmospheric Environment 36 (14): 2327–35. https://doi.org/10.1016/S1352-2310(02)00188-7.