Sample covariance between variables \(x=x_t\) and \(y=y_t\):
\[ c_{xy} = \frac{1}{N} \sum_i^N (x_i-\overline{x}) (y_i-\overline{y}) \]
Sample cross-covariance function for positive values of lag between variables \(x_t\) and \(y_{t+k}\) (Chatfield, The Analysis of Time Series, 2004):
\[ c_{xy}(k) = \frac{1}{N} \sum_{t=1}^{N-k} (x_t-\overline{x})(y_{t+k}-\overline{y}) \]
Pearson’s correlation coefficient (sample correlation) is defined as the covariance of two variables divided by the product of their standard deviations (which are the square roots of their respective variances):
\[ r_{xy} = \frac{c_{xy}}{\sqrt{c_{xx}c_{yy}}} %= \frac{\sum (x_i-\overline{x})(y_i-\overline{y})}{\sqrt{ \sum (x_i-\overline{x})^2 \sum (y_i-\overline{y})^2 }} \]
The sample cross-correlation function: \[ r_{xy}(k) = \frac{c_{xy}(k)}{\sqrt{c_{xx}(0)c_{yy}(0)}} \]
\(c_{xx}\) and \(c_{yy}\) are the sample variances of \(x\) and \(y\), respectively.
\(c_{xx}(0)\) and \(c_{yy}(0)\) that are the sample variances of \(x_t\) and \(y_t\) respectively.
“For descriptive purposes, the relationship will be described as strong if \(|r| \geq .8\), moderate if \(.5 < |r| <.8\), and weak if \(|r| \leq .5\).” – Devore and Berk, Modern Mathematical Statistics with Applications, 2012
Anscombe’s quartet classically illustrates the pitfalls on relying on a single coefficient – always visualize your data. Consider the following four datasets:
All have similar statistical properties.
set | mean of x | mean of y | variance of x | variance of y | correlation | intercept | slope |
---|---|---|---|---|---|---|---|
1 | 9 | 7.5 | 11 | 4.127 | 0.82 | 3 | 0.5 |
2 | 9 | 7.5 | 11 | 4.128 | 0.82 | 3 | 0.5 |
3 | 9 | 7.5 | 11 | 4.123 | 0.82 | 3 | 0.5 |
4 | 9 | 7.5 | 11 | 4.123 | 0.82 | 3 | 0.5 |
Lag for 20.07.2013 in Lausanne (example for illustrative purposes):
A relationship between radiation intensity and ozone concentration may be better described by examining correlations between the daily maximum values of the two variables.
A correlation or lagged correlation in x and y may also be observed. For examples, a correlation between \(x\) and \(y\) may not be due to the causal relationship between \(x\) and \(y\), but dependent on a third variable, \(z\). This is written:
\[ \begin{aligned} y_t &= f_y(z_t)\\ x_t &= f_x(z_t) \end{aligned} \] |
\[ \begin{aligned} y_{t+k} &= f_y(z_t)\\ x_{t} &= f_x(z_t) \end{aligned} \] |
“Correlation does not imply causation”:
For entertainment purposes, you may wish to visit the website of Spurious correlations.
library(tidyverse)
library(chron)
source("functions_extra.R")
Sys.setlocale("LC_TIME","C")
options(stringsAsFactors=FALSE)
options(chron.year.abb=FALSE)
theme_set(theme_bw()) # just my preference for plots
Let us load data saved from Lesson 4.
df <- readRDS("data/2013/lau-zue.rds")
variables <- c("O3", "NO2", "CO", "PM10", "SO2", "NMVOC", "EC", "TEMP", "PREC", "RAD")
Before computing correlations, we will first visualize the relationships using scatter plots (or x-y plots).
Let us calculate the daily maximum values for each variable:
lf <- gather(df, variable, value, all_of(variables))
daily.max <- lf %>%
group_by(site, year, month, day, season, variable) %>%
summarize(value=max(value, na.rm=TRUE)) %>%
spread(variable, value)
Recall the relationship between temperature and O3 shown in a previous lecture. Note the seasonal dependence of this relationship.
ggplot(daily.max)+
facet_grid(site~season)+
geom_point(aes(TEMP, O3))
The corresponding correlation values can be obtained with the built-in function, cor
(see ?cor). Note use of
use=“pairwise.complete.obs”` for calculation of correlation with missing values.
(cor.values <- daily.max %>% group_by(site, season) %>%
summarize(correlation=cor(TEMP, O3, use="pairwise.complete.obs")))
## # A tibble: 8 x 3
## # Groups: site [2]
## site season correlation
## <chr> <fct> <dbl>
## 1 LAU DJF 0.0760
## 2 LAU MAM 0.534
## 3 LAU JJA 0.802
## 4 LAU SON 0.527
## 5 ZUE DJF 0.180
## 6 ZUE MAM 0.630
## 7 ZUE JJA 0.843
## 8 ZUE SON 0.648
We can format this table and save to file:
(cor.values.wf <- spread(cor.values, site, correlation))
## # A tibble: 4 x 3
## season LAU ZUE
## <fct> <dbl> <dbl>
## 1 DJF 0.0760 0.180
## 2 MAM 0.534 0.630
## 3 JJA 0.802 0.843
## 4 SON 0.527 0.648
write.csv2(cor.values.wf, "correlations.csv", row.names=FALSE)
write.csv2
writes to a CSV (comma separated value) file using the European convention of defining delimiters as semicolons (;
) rather than commas (,
).
Or, we can visualize the values:
ggplot(cor.values)+
geom_col(aes(season, correlation))+
scale_y_continuous(limits=c(0,1))+
facet_grid(.~site)+
scale_y_continuous(expand=expansion(mult=c(0, 0.1)))
We can examine the correlation of other pollutants or meterological variables with O3.
gather(daily.max, variable, value, all_of(setdiff(variables, "O3"))) %>% # everything but ozone
ggplot+
facet_grid(site~variable, scale="free_x")+
geom_point(aes(value, O3, group=season, color=season), shape=4)
We will next look at a scatter plot between hourly measurements of CO and NO2. Why does this relationship exist?
ggplot(df)+
facet_grid(site~season)+
geom_point(aes(CO, NO2))
For the following scatterplot matrix, we will use a built-in library called lattice
which is a plotting system that exists in parallel to R’s base graphics and ggplot2
package (you can check out ggpairs
from the GGally
package for a ggplot2
-based solution). We additionally define a function to include correlation coefficients in our panels.
library(lattice)
CorrelationValue <- function(x, y, ...) {
correlation <- cor(x, y, use="pairwise.complete.obs")
if(is.finite(correlation)) {
cpl <- current.panel.limits()
panel.text(mean(cpl$xlim), mean(cpl$ylim),
bquote(italic(r)==.(sprintf("%.2f", correlation))),
adj=c(0.5,0.5), col="blue")
}
}
Plot daily maximum values as pairwise points (only Lausanne) using this syntax:
daily.max <- as.data.frame(daily.max)
ix <- "LAU" == daily.max[["site"]]
splom(~daily.max[ix,c("O3","NO2","CO","PM10","TEMP","PREC","RAD")] | daily.max[ix,"season"],
upper.panel = CorrelationValue,
pch=4)
Plot hourly data as pairwise points. Given the large number of points, we can “smooth” the visual representation.
ix <- grepl("LAU", df[["site"]], fixed=TRUE)
splom(~df[ix,c("O3","NO2","CO","PM10","TEMP","PREC","RAD")] | df[ix,"season"],
upper.panel = CorrelationValue,
panel = panel.smoothScatter,
pch=4, cex=.2, col="gray")
Notice some values of correlation went up.
We can define a function to generate lagged pairs for \(k\) intervals and apply it using the do()
function. If we provide hourly measurements, \(k=1\) corresponds to a lag of 1 hour, and so on.
Lag <- function(pair, k) {
out <- data.frame(lag=k, head(pair[,1],-k), tail(pair[,2],-k))
names(out)[2:3] <- colnames(pair)
out
}
lagged <- df %>%
group_by(site, season) %>%
do(rbind(Lag(.[,c("RAD","O3")], 1),
Lag(.[,c("RAD","O3")], 2),
Lag(.[,c("RAD","O3")], 3),
Lag(.[,c("RAD","O3")], 4),
Lag(.[,c("RAD","O3")], 5),
Lag(.[,c("RAD","O3")], 6)))
ggplot(lagged) +
geom_point(aes(RAD, O3, group=site, color=site), shape=4)+
facet_grid(lag~season)
R includes a function, ccf
, for calculating the cross correlation. The value passed to the na.action=na.pass
says to compute covariances from the complete cases when there are missing values. Calling this function will generate a corresponding plot by default (which can be turned off). See ?ccf
for further details.
ix <- "LAU"==df[["site"]] & "JJA"==df[["season"]]
ccf(df[ix,"RAD"], df[ix,"O3"], lag.max=6, na.action=na.pass)
This tells us that the correlation is higest at a lag of -4 hours.
We can wrap this in a function that returns information that we want in a data frame (note plot=FALSE
argument):
LaggedCorrelation <- function(pair, ...) {
out <- ccf(pair[,1], pair[,2], ..., na.action=na.pass, plot=FALSE)
data.frame(lag=out[["lag"]], value=out[["acf"]])
}
lagged.values <- df %>% group_by(site, season) %>%
do(LaggedCorrelation(.[,c("RAD","O3")], lag.max=6))
Note the syntax, ...
in the function definition This passes on any additional arguments (lag.max
in this case) from the outer function (LaggedCorrelation
) to the inner-most function (ccf
). ?...
will refer you to the Introduction to R manual for further details on this syntax.
Plot:
ggplot(lagged.values)+
geom_segment(aes(x=lag,xend=lag,y=0,yend=value))+
facet_grid(site~season)+
xlab("lag (hours)")+
ylab("Cross correlation coefficient")