How to use archivist

to boost reproducibility

Przemyslaw Biecek

useR 2016

What and why?

Everything in R is an object


What if we store all R objects that have been ever created?


Goals for the archivist package:

  • keep trace of (every) created object,
  • keep trace of relations between objects,
  • keep meta-information/properties of objects,
  • provide a way to seek objects that match some conditions,
  • provide a way to restore specific objects.

What the archivist is doing?


The archivist is an R package for management of R objects outside R session. It stores binary copies of R objects in rda files and provides easy access for seeking and restoring these objects based on timestamps, classes or other properties


But, why anybody would like to store copies of all R objects?

Example 1: archivist + knitr

Enrich knitr reports with hooks for selected objects. Example.

>archivist::aread(“pbiecek/Eseje/arepo/e10f9d223df408fca73ea548456493f4”)

Example 2: archivist + shiny

Enrich shiny applications with hooks to displayed plots. Example.

More about this example.

How does it work?

Basics

  • Artifacts are stored in repositories.
  • Use createLocalRepo() to initialize a new repo.
  • Add new artifacts with saveToLocalRepo().
  • Artifacts are identified by their md5hash.
library(archivist)

createLocalRepo("arepo", default = TRUE)

model <- lm(Sepal.Length ~ Petal.Length + Species, data=iris)
saveToLocalRepo(model)
## [1] "e700c6704890022a4e0277e11cd46488"
## attr(,"data")
## [1] "58c95897ce23bea6f52de936aa711c2b"
library(ggplot2)

plot <- ggplot(iris, aes(Sepal.Length, Petal.Length)) + geom_point()
saveToLocalRepo(plot)
## [1] "fb558ab613b62cb762078d7bdc9faa9c"
## attr(,"data")
## [1] "ff575c261c949d073b2895b05d1097c3"

  • Use aread() to read an artifact from a repository to an R session.
  • You can read artifacts either from local folders or remote repositories.
# from the default local repository
aread("e700c6704890022a4e0277e11cd46488")
## 
## Call:
## lm(formula = Sepal.Length ~ Petal.Length + Species, data = iris)
## 
## Coefficients:
##       (Intercept)       Petal.Length  Speciesversicolor  
##            3.6835             0.9046            -1.6010  
##  Speciesvirginica  
##           -2.1177
# from the github pbiecek/graphGallery repository
aread("pbiecek/graphGallery/18a98048f0584469483afb65294ce3ed")
## 
## Call:
## lm(formula = Petal.Length ~ Sepal.Length, data = iris)
## 
## Coefficients:
##  (Intercept)  Sepal.Length  
##       -7.101         1.858

  • Use asearch() to seek for all artifacts with properties that fit some patterns.
  • As a result you will get a list of R objects.
  • You may use time intervals as patters as well.
  • Below we ask for all lm models that are in a repository, apply BIC criteria and sort models.
models <- asearch("pbiecek/graphGallery", patterns = "class:lm")
modelsBIC <- sapply(models, BIC)
sort(modelsBIC)
## 990861c7c27812ee959f10e5f76fe2c3 2a6e492cb6982f230e48cf46023e2e4f 
##                         39.05577                         67.52735 
## 0a82efeb8250a47718cea9d7f64e5ae7 378237103bb60c58600fe69bed6c7f11 
##                        189.73593                        189.73593 
## 7f11e03539d48d35f7e7fe7780527ba7 c1b1ef7bcddefb181f79176015bc3931 
##                        189.73593                        189.73593 
## 0e213ac68a45b6cd454d06b91f991bc7 e58d2f9d50b67ce4d397bf015ec1259c 
##                        243.49450                        243.49450 
## 18a98048f0584469483afb65294ce3ed 
##                        396.16690

The structure

  • Repository is a folder with:
  • a subfolder gallery with rda dumps of recorded objects,
  • an SQLite database (file backpack.db).

  • The database contains two tables, with artifacts and their properties and relations.

  • Properties and relations are stored as Tags.
  • Tags are being extracted from an artifact after each call to saveToLocalRepo().
  • Use showLocalRepo() to examine tags for selected artifacts.
  • Popular and useful tags are: format:..., name:..., class:..., date:....

  • Different tags are extracted from different objects.
  • The list is flexible, one may easily add new tags if needed.

What else you can do with the archivist?

Pedigree of an object

  • Use %a% operator to record function calls.
  • Use ahistory() to present the pedigree of an object.
  • All partial intermediate results are also stored in a repo
library("dplyr")
iris %a%
 dplyr::filter(Sepal.Length < 6) %a%
 lm(Petal.Length~Species, data=.) %a%
 summary() -> tmp
ahistory(md5hash = "050e41ec3bc40b3004bc6bdd356acae7", format = "kable", alink = TRUE)
call md5hash
4 env[[nm]] ff575c261c949d073b2895b05d1097c3
3 dplyr::filter(Sepal.Length < 6) d3696e13d15223c7d0bbccb33cc20a11
2 lm(Petal.Length ~ Species, data = .) 990861c7c27812ee959f10e5f76fe2c3
1 summary() 050e41ec3bc40b3004bc6bdd356acae7
ahistory(tmp)
##    env[[nm]]                             [ff575c261c949d073b2895b05d1097c3]
## -> dplyr::filter(Sepal.Length < 6)       [d3696e13d15223c7d0bbccb33cc20a11]
## -> lm(Petal.Length ~ Species, data = .)  [990861c7c27812ee959f10e5f76fe2c3]
## -> summary()                             [050e41ec3bc40b3004bc6bdd356acae7]

You need a village to raise an R object

  • For some artifacts you need specific versions of packages (see ggplot2).
  • For every object also the sessionInfo is stored.
  • Use asession to check versions of packages loaded when artifact was added to the repo.
  • Use restoreLibs to install packages in versions installed when artifact was added to the repo.
asession("050e41ec3bc40b3004bc6bdd356acae7")
##  setting  value                       
##  version  R version 3.3.0 (2016-05-03)
##  system   x86_64, darwin13.4.0        
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  tz       Europe/Warsaw               
##  date     2016-06-30                  
## 
##  package    * version  date       source        
##  archivist  * 2.0.4    2016-05-13 CRAN (R 3.3.0)
##  assertthat   0.1      2013-12-06 CRAN (R 3.3.0)
##  bitops       1.0-6    2013-08-17 CRAN (R 3.3.0)
##  colorspace   1.2-6    2015-03-11 CRAN (R 3.1.3)
##  DBI          0.4-1    2016-05-08 CRAN (R 3.3.0)
##  devtools     1.12.0   2016-06-24 CRAN (R 3.3.0)
##  digest       0.6.9    2016-01-08 CRAN (R 3.3.0)
##  dplyr      * 0.5.0    2016-06-24 CRAN (R 3.3.0)
##  evaluate     0.9      2016-04-29 CRAN (R 3.3.0)
##  formatR      1.4      2016-05-09 CRAN (R 3.3.0)
##  ggplot2    * 2.1.0    2016-03-01 CRAN (R 3.3.0)
##  gtable       0.2.0    2016-02-26 CRAN (R 3.3.0)
##  htmltools    0.3.5    2016-03-21 CRAN (R 3.3.0)
##  httr         1.2.0    2016-06-15 CRAN (R 3.3.0)
##  knitr        1.13     2016-05-09 CRAN (R 3.3.0)
##  labeling     0.3      2014-08-23 CRAN (R 3.3.0)
##  lazyeval     0.2.0    2016-06-12 CRAN (R 3.3.0)
##  lubridate    1.5.6    2016-04-06 CRAN (R 3.3.0)
##  magrittr     1.5      2014-11-22 CRAN (R 3.3.0)
##  memoise      1.0.0    2016-01-29 CRAN (R 3.3.0)
##  munsell      0.4.3    2016-02-13 CRAN (R 3.3.0)
##  plyr         1.8.4    2016-06-08 CRAN (R 3.3.0)
##  R6           2.1.2    2016-01-26 CRAN (R 3.3.0)
##  Rcpp         0.12.5   2016-05-14 CRAN (R 3.3.0)
##  RCurl        1.95-4.8 2016-03-01 CRAN (R 3.3.0)
##  revealjs     0.6      2016-03-09 CRAN (R 3.3.0)
##  rmarkdown    0.9.6    2016-05-01 CRAN (R 3.3.0)
##  RSQLite      1.0.0    2014-10-25 CRAN (R 3.1.2)
##  rstudioapi   0.5      2016-01-24 CRAN (R 3.3.0)
##  scales       0.4.0    2016-02-26 CRAN (R 3.3.0)
##  stringi      1.1.1    2016-05-27 CRAN (R 3.3.0)
##  stringr      1.0.0    2015-04-30 CRAN (R 3.3.0)
##  tibble       1.0      2016-03-23 CRAN (R 3.3.0)
##  withr        1.0.1    2016-02-04 CRAN (R 3.3.0)
##  yaml         2.1.13   2014-06-12 CRAN (R 3.3.0)
# restoreLibs("pbiecek/graphGallery/arepo/600bda83cb840947976bd1ce3a11879d")

Other features

  • Deeper integration with GitHub is available in archivist.github thanks to @MarcinKosinski (coauthor of the archivist).
  • Use trace() to force adding an artifact to repo after each call of a specific function.
  • The code below will adds all results of lm() to the repository allModels.
createLocalRepo("allModels", default = TRUE)
trace(lm, exit = quote(saveToLocalRepo(z, repoDir="allModels")))
  • Use addHooksToPrint() to add archivist hooks to all objects of selected classes in a knitr report.
addHooksToPrint(class=c("ggplot", "data.frame"),
  repoDir = "arepo",
  repo = "Eseje", user = "pbiecek", subdir = "arepo")

structure

Thank you!

Find more at pbiecek.github.io/archivist