Air pollution data (observations or model predictions) consist of realizations of multiple variables in time and space. How can we structure this data so that we may efficiently apply simple statistical analyses on them? In general, the goal of statistical analysis can be thought of as having two products:
In this project, we will learn methods of exploratory data analysis using relational models of data. We will cover the software package R.
The assignment is to analyze measurements from the Swiss Federal Office of the Environment (BAFU/OFFEV/FOEN) network. The objective is to observe seasonal and diurnal patterns in concentrations of each pollutant, and understand the interplay of emissions, photochemistry, and meteorology that gives rise to these patterns to the extent that they can be inferred. You can find many details on the BAFU website, and in the lectures of the course.
See course handout for exact assignment instructions.
Data analysis describes the process of reducing a set of data to a few relevant conclusions.
Code describes data (e.g., measurements) and its transformations (e.g., averaging).
Structuring your data appropriately can significantly reduce the number of steps required for many types of transformations.
Data models describe entities and their relationships to other entities.
These models can be implemented by databases in data warehouses, or data structures residing in local memory.
Array representation in R:
dimnames(arr)
## $year
## [1] "2012" "2013" "2014"
##
## $month
## [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"
##
## $day
## [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" "14" "15"
## [16] "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26" "27" "28" "29" "30"
## [31] "31"
##
## $hour
## [1] "0" "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" "14"
## [16] "15" "16" "17" "18" "19" "20" "21" "22" "23"
##
## $variable
## [1] "O3" "NO2" "CO" "PM10" "TEMP" "PREC" "RAD"
arr["2013","Jul","20",,"O3"]
## 0 1 2 3 4 5 6 7 8 9 10 11 12
## 78.4 71.8 69.5 76.8 63.3 50.3 45.2 54.9 65.2 56.6 61.5 74.1 82.4
## 13 14 15 16 17 18 19 20 21 22 23
## 91.9 98.8 110.4 118.3 122.0 122.9 113.2 99.9 85.8 54.9 48.9
Some operations can be efficiently performed with arrays:
apply(arr["2013",,,,],MARGIN=c("month","variable"), FUN=mean, na.rm=TRUE)
## variable
## month O3 NO2 CO PM10 TEMP PREC RAD
## Jan 22.64865 47.52375 0.4440108 22.54845 2.2271505 0.08721400 46.31774
## Feb 40.58125 45.05580 0.4380952 28.74501 0.6016369 0.07113095 85.00015
## Mar 35.02078 50.33841 0.4849328 35.95249 4.7498656 0.10713324 98.91492
## Apr 50.78187 40.48799 0.3930556 25.55132 10.6093056 0.11930556 170.54847
## May 51.85189 38.63450 0.3339166 12.29838 11.9751344 0.18561828 174.21317
## Jun 59.61657 38.92350 0.3386111 17.55225 17.6750000 0.12517385 248.54292
## Jul 76.19515 38.44542 0.3680108 19.92781 22.3884409 0.16424731 272.65753
## Aug 66.49771 34.13576 0.3336022 17.26532 20.9435484 0.05397039 241.67258
## Sep 43.20181 42.02855 0.3945833 17.52681 17.0044444 0.12472222 157.18375
## Oct 24.74307 40.52557 0.4131720 17.46508 13.7291667 0.28721400 82.58642
## Nov 25.85577 36.38623 0.3695833 13.65309 6.0034722 0.15724234 53.88542
## Dec 18.55666 57.09166 0.5268817 26.03401 3.8581989 0.13575269 46.13616
Hierarchical representation:
tree[["2013"]][["Jul"]][["20"]][c("11","12")]
## $`11`
## $`11`$O3
## [1] 74.1
##
## $`11`$NO2
## [1] 31.6
##
## $`11`$CO
## [1] 0.4
##
## $`11`$PM10
## [1] 15
##
## $`11`$TEMP
## [1] 20.8
##
## $`11`$PREC
## [1] 0
##
## $`11`$RAD
## [1] 799.3
##
##
## $`12`
## $`12`$O3
## [1] 82.4
##
## $`12`$NO2
## [1] 35.1
##
## $`12`$CO
## [1] 0.4
##
## $`12`$PM10
## [1] 17.4
##
## $`12`$TEMP
## [1] 21.7
##
## $`12`$PREC
## [1] 0
##
## $`12`$RAD
## [1] 880.3
Relations with 2-D structure (“a multiset of tuples”).
Like matrices, but different:
Import example data:
View in “wide” format:
kable(data[1:2,])
datetime | O3 | NO2 | CO | PM10 | TEMP | PREC | RAD |
---|---|---|---|---|---|---|---|
31.12.2012 01:00 | 7.8 | 56.3 | 0.5 | 16.1 | 3.8 | 0 | -2.4 |
31.12.2012 02:00 | 22.4 | 38.0 | 0.4 | 11.6 | 4.1 | 0 | -2.3 |
Convert to “long” format:
library(reshape2) # makes melt() library available
kable(melt(data[1:2,], id.vars="datetime"))
datetime | variable | value |
---|---|---|
31.12.2012 01:00 | O3 | 7.8 |
31.12.2012 02:00 | O3 | 22.4 |
31.12.2012 01:00 | NO2 | 56.3 |
31.12.2012 02:00 | NO2 | 38.0 |
31.12.2012 01:00 | CO | 0.5 |
31.12.2012 02:00 | CO | 0.4 |
31.12.2012 01:00 | PM10 | 16.1 |
31.12.2012 02:00 | PM10 | 11.6 |
31.12.2012 01:00 | TEMP | 3.8 |
31.12.2012 02:00 | TEMP | 4.1 |
31.12.2012 01:00 | PREC | 0.0 |
31.12.2012 02:00 | PREC | 0.0 |
31.12.2012 01:00 | RAD | -2.4 |
31.12.2012 02:00 | RAD | -2.3 |
The transformation from the wide format to long format as shown above (and vice versa) is known as “pivoting”. Other common operations with data tables include:
Data tables permit a set of operations common to many types of data analyses.
General strategy:
Example (as shown in Lesson 1):
The basic syntax for SAC and describing a graphic in R can be similar in structure.
To return a data table after grouping rows of another data table by variable1
and variable2
; applying cor
to each piece, and recombining:
library(dplyr)
data_table %>% # data table
group_by(variable1, variable2) %>% # grouping variables
summarize(correlation=cor(x, y)) # (aggregating) function
There are many variations on this specification, but this is the general idea for creating a graphic:
library(ggplot2)
ggplot(data=data_table) + # data table
facet_grid(variable1~variable2) + # grouping variables
geom_line(mapping=aes(x, y, color=z)) # aesthetics
The last example shows the application of a popular programming concept for graph specification. We will show more such examples in the following lessons.
Note that dplyr
and ggplot2
packages can now be loaded together through the tidyverse
package. For data table pivoting, there are many options, including melt
/dcast
from the reshape2
package shown above. We will use gather
/spread
from the tidyr
package (which can is also loaded through tidyverse
.