# Background

- If \(y_t = f(x_t)\), a relationship between in x and y may be observed. This strength of the linear relationship is often characterized by what is known as a correlation coefficient.
- However, when the dependency is time-dependent, \(y_{t+k} = f(x_t)\), we may wish to examine the correlation lagged by \(k\) intervals (\(k\)=hours in our case).
- Note that people commonly refer to any relationship as correlation, but commonly-used correlation coefficients in statistics measure the degree of a
*linear* relationship between two variables. It is important to visualize the data when possible, rather than relying solely on the correlation coefficient, so that nonlinear relationships are not also detected.

## Sample covariance

Sample covariance between variables \(x=x_t\) and \(y=y_t\):

\[
c_{xy} = \frac{1}{N} \sum_i^N (x_i-\overline{x}) (y_i-\overline{y})
\]

Sample cross-covariance function for positive values of lag between variables \(x_t\) and \(y_{t+k}\) (Chatfield, *The Analysis of Time Series*, 2004):

\[
c_{xy}(k) = \frac{1}{N} \sum_{t=1}^{N-k} (x_t-\overline{x})(y_{t+k}-\overline{y})
\]

## Sample correlation

Pearson’s correlation coefficient (sample correlation) is defined as the covariance of two variables divided by the product of their standard deviations (which are the square roots of their respective variances):

\[
r_{xy} = \frac{c_{xy}}{\sqrt{c_{xx}c_{yy}}} %= \frac{\sum (x_i-\overline{x})(y_i-\overline{y})}{\sqrt{ \sum (x_i-\overline{x})^2 \sum (y_i-\overline{y})^2 }}
\]

The sample cross-correlation function: \[
r_{xy}(k) = \frac{c_{xy}(k)}{\sqrt{c_{xx}(0)c_{yy}(0)}}
\]

\(c_{xx}\) and \(c_{yy}\) are the sample variances of \(x\) and \(y\), respectively.

\(c_{xx}(0)\) and \(c_{yy}(0)\) that are the sample variances of \(x_t\) and \(y_t\) respectively.

## Correlation coefficients illustrated

“For descriptive purposes, the relationship will be described as strong if \(|r| \geq .8\), moderate if \(.5 < |r| <.8\), and weak if \(|r| \leq .5\).” – Devore and Berk, *Modern Mathematical Statistics with Applications*, 2012

Anscombe’s quartet classically illustrates the pitfalls on relying on a single coefficient – always visualize your data. Consider the following four datasets: