The Repo R Data Manager

Francesco Napolitano

Introduction

This is a getting-started guide for the Repo R package, which implements an R objects repository manager. It is a data-centered data flow manager.

The Repo package builds one (or more) centralized local repository where R objects are stored together with corresponding annotations, tags, dependency notes, provenance traces, source code. Once a repository has been populated, stored objects can be easily searched, navigated, edited, imported/exported. Annotations can be exploited to reconstruct data flows and perform typical pipeline management operations.

Additional information can be found in the paper: Napolitano, F. repo: an R package for data-centered management of bioinformatic pipelines. BMC Bioinformatics 18, 112 (2017).

Repo latest version can be found at: https://github.com/franapoli/repo

Repo is also on CRAN at: https://cran.r-project.org/package=repo

Preparation

The following command creates a new repository in a temporary path (the default would be “~/.R_repo”). The same function opens existing repositories. The variable rp will be used as the main interface to the repository throughout this guide.

library(repo)
rp <- repo_open(tempdir(), force=T)
Repo created.

This document is produced by a script named index.Rmd. The script itself can be added to the repository and newly created resources annotated as being produced by it. The annotation is made automatic using the options command.

rp$attach("index.Rmd", "Source code for Repo vignette", replace=T)
rp$options(src="index.Rmd")

Populating the repository

Here is a normalized version of the Iris dataset to be stored in the repository:

myiris <- scale(as.matrix(iris[,1:4]))

The shortest way to permanently store the myiris object in the repository is simply:

rp$put(myiris)

However, richer annotation is possible, for example:

## chunk "myiris" {
rp$put(
    obj = myiris,
    name = "myiris", 
    description = paste(
        "A normalized version of the iris dataset coming with R.",
        "Normalization is made with the scale function",
        "with default parameters."
    ),
    tags = c("dataset", "iris", "repodemo"),
    replace=T
)
## }

The call provides the data to be stored (obj), an identifier (name), a longer description, a list of tags. The parameter replace set to T allows to overwrite the existing item created before.

The comment lines (## chunk "myiris" { and ## }) have a special meaning: they associate the corresponding code to the resource. The code can be showed as follows:

rp$chunk("myiris")
rp$put(
    obj = myiris,
    name = "myiris", 
    description = paste(
        "A normalized version of the iris dataset coming with R.",
        "Normalization is made with the scale function",
        "with default parameters."
    ),
    tags = c("dataset", "iris", "repodemo"),
    replace=T
)

The code associated with an item should take care of building and storing it. The build command executes the code in the current environment. It can automatically build dependencies, too.

rp$rm("myiris")
rp$build("myiris", "index.Rmd")

In this example, the Iris class annotation will be stored separately:

rp$put(iris$Species, "irisLabels",
             tags = c("labels", "iris", "repodemo"))

Attaching figures

The following code produces a 2D visualization of the Iris data and shows it:

irispca <- princomp(myiris)
iris2d <- irispca$scores[,c(1,2)]
plot(iris2d, main="2D visualization of the Iris dataset",
     col=rp$get("irisLabels"))

Note that irisLabels is loaded on the fly from the repository.

It would be nice to store the figure itself in the repo together with the Iris data. This is done using the attach method, which stores any file in the repo as is (as opposed to R objects), plus annotations. Two parameters differ from put:

fpath <- file.path(rp$root(), "iris2D.pdf")
pdf(fpath)
plot(iris2d, main="2D visualization of the Iris dataset",
     col=rp$get("irisLabels"))
invisible(dev.off())
rp$attach(fpath, "Iris 2D visualization obtained with PCA.",
            c("visualization", "iris", "repodemo"),
              to="myiris")

The attached PDF can be accessed using an external PDF viewer directly from within Repo through the sys command. On a Linux system, this command runs the Evince document viewer and shows iris2D.pdf:

rp$sys("iris2D.pdf", "evince")

The following code makes a clustering of the Iris data and stores it in the repository. There is one parameter to note:

kiris <- kmeans(myiris, 5)$cluster
rp$put(kiris, "iris_5clu", "Kmeans clustering of the Iris data, k=5.",
         c("metadata", "iris", "kmeans", "clustering", "repodemo"),
           depends="myiris")

The following shows what the clustering looks like. The figure will be attached to the repository as well.

plot(iris2d, main="Iris dataset kmeans clustering", col=kiris)

fpath <- file.path(rp$root(), "iris2Dclu.pdf")
pdf(fpath)
plot(iris2d, main="Iris dataset kmeans clustering", col=kiris)
invisible(dev.off())
rp$attach(fpath, "Iris K-means clustering.",
    c("visualization", "iris", "clustering", "kmeans", "repodemo"),
               to="iris_5clu")

Finally, a contingency table of the Iris classes versus clusters is computed below. The special tag hide prevents an item from being shown unless explicitly requested.

res <- table(rp$get("irisLabels"), kiris)
rp$put(res, "iris_cluVsSpecies",
         paste("Contingency table of the kmeans clustering versus the",
               "original labels of the Iris dataset."),
         c("result", "iris","validation", "clustering", "repodemo", "hide"),
         src="index.Rmd", depends=c("myiris", "irisLabels", "iris_5clu"), replace=T)

Looking at the repository

The info command summarizes some information about a repository:

rp$info()
Root:            /tmp/RtmpaYy7by 
Number of items: 7 
Total size:      26.41 kB 

The Repo library supports an S3 print method that shows the contents of the repository. All non-hidden items will be shown, together with some details, which by defaults are: name, dimensions, size.

rp ## resolves to print(rp)
         ID  Dims    Size
     myiris 150x4 1.82 kB
 irisLabels   150   123 B
  iris_5clu   150   114 B

Hidden items are… hidden. The following will show them too:

print(rp, all=T)
                ID  Dims     Size
        @index.Rmd     - 12.32 kB
            myiris 150x4  1.82 kB
        irisLabels   150    123 B
       @iris2D.pdf     -  5.84 kB
         iris_5clu   150    114 B
    @iris2Dclu.pdf     -  6.03 kB
 iris_cluVsSpecies   3x5    179 B

Items can also be filtered. With the following call, only items tagged with “clustering” will be shown:

print(rp, tags="clustering", all=T)
                ID Dims    Size
         iris_5clu  150   114 B
    @iris2Dclu.pdf    - 6.03 kB
 iris_cluVsSpecies  3x5   179 B

print can show information selectively. This command shows tags and size on disk:

rp$print(show="st")
         ID                                         Tags    Size
     myiris                      dataset, iris, repodemo 1.82 kB
 irisLabels                       labels, iris, repodemo   123 B
  iris_5clu metadata, iris, kmeans, clustering, repodemo   114 B

The find command will match a search string against all item fields in the repository:

rp$find("clu", all=T)
                ID Dims    Size
         iris_5clu  150   114 B
    @iris2Dclu.pdf    - 6.03 kB
 iris_cluVsSpecies  3x5   179 B

It is also possible to obtain a visual synthetic summary of the repository by using the pies command:

rp$pies()

Finally, the check command runs an integrity check verifying that the stored data has not been modified/corrupted. The command will also check the presence of extraneous (not indexed) files. Since the rp repository was created in a temporary directory, a few extraneous files will pop up.

rp$check()
Checking index.Rmd... ok.
Checking myiris... ok.
Checking irisLabels... ok.
Checking iris2D.pdf... ok.
Checking iris_5clu... ok.
Checking iris2Dclu.pdf... ok.
Checking iris_cluVsSpecies... ok.

Checking for extraneous files in repo root... found some:
/tmp/RtmpaYy7by/file32b114c83843
/tmp/RtmpaYy7by/file32b142982555
/tmp/RtmpaYy7by/file32b1748cc03c
/tmp/RtmpaYy7by/iris2Dclu.pdf
/tmp/RtmpaYy7by/iris2D.pdf

Showing dependencies

In Repo, the relations “generated by”, “attached to” and “dependent on” are summarized in a dependency graph. The formal representation of the graph is a matrix, in which the entry (i,j) represent a relation from i to j of type 1, 2 or 3 (dependency, attachment or generation). Here’s how it looks like:

depgraph <- rp$dependencies(plot=F)
library(knitr)
kable(depgraph)
index.Rmd myiris irisLabels iris2D.pdf iris_5clu iris2Dclu.pdf iris_cluVsSpecies
index.Rmd 0 0 0 0 0 0 0
myiris 3 0 0 0 0 0 0
irisLabels 3 0 0 0 0 0 0
iris2D.pdf 3 2 0 0 0 0 0
iris_5clu 3 1 0 0 0 0 0
iris2Dclu.pdf 3 0 0 0 2 0 0
iris_cluVsSpecies 3 1 1 0 1 0 0

Omitting the plot=F parameter, the dependencies method will plot the dependency graph. This plot requires the igraph library.

if(require("igraph", NULL, T, F))
     rp$dependencies()

The three types of edges can be shown selectively, so here’s how the graph looks like without the “generated” edges:

if(require("igraph"))
    rp$dependencies(generated=F)

Accessing items in the repo

The get command is used to retrieve items from a repository. In the following the variable myiris is loaded into the variable x in the current environment.

x <- rp$get("myiris")

An even simpler command is load, which uses the item name also as variable name:

rm("myiris")
rp$load("myiris")
"myiris" %in% ls()
[1] TRUE

The info command can provide addtional information about an entry:

rp$info("myiris")
ID:           myiris
Description:  A normalized version of the iris dataset coming with R. Normalization is made with the scale function with default parameters.
Tags:         dataset, iris, repodemo
Dimensions:   150x4
Timestamp:    2017-08-04 15:44:57
Size on disk: 1.82 kB
Provenance:   index.Rmd
Attached to:  -
Stored in:    yf/fk/dr/yffkdr30htve9p00o9zg5ahwf8ivk0yj
MD5 checksum: d23a5831dfd459be089e51f5bdda8799
URL:          -

Item versions, temporary items, remote contents

There are actually 3 different ways of adding an object to a repository:

Plus, item contents for an existing entry can be dowloaded if an URL is provided with it (rp$pull).

Versioning

The K-means algorithm will likely provide different solutions over multiple runs. Alternative solutions can be stored as new versions of the iris_5clu item as fllows:

kiris2 <- kmeans(myiris, 5)$cluster
rp$put(kiris2, "iris_5clu",
         "Kmeans clustering of the Iris data, k=5. Today's version!",
           depends="myiris", replace="addversion")

The new repository looks like the old one:

rp
         ID  Dims    Size
     myiris 150x4 1.82 kB
 irisLabels   150   123 B
  iris_5clu   150   112 B

Except that iris_5clu is actually the one just put (look at the description):

rp$info("iris_5clu")
ID:           iris_5clu
Description:  Kmeans clustering of the Iris data, k=5. Today's version!
Tags:         
Dimensions:   150
Timestamp:    2017-08-04 15:44:58
Size on disk: 112 B
Provenance:   index.Rmd
Attached to:  -
Stored in:    l6/9f/e9/l69fe9tqitr7vx9sewbtq8h2lmyh6tku
MD5 checksum: 3805ef9f3bdbac437442897ff29a54f1
URL:          -

The old one has been renamed and hidden:

rp$info("iris_5clu#1")
ID:           iris_5clu#1
Description:  Kmeans clustering of the Iris data, k=5.
Tags:         metadata, iris, kmeans, clustering, repodemo, hide
Dimensions:   150
Timestamp:    2017-08-04 15:44:58
Size on disk: 114 B
Provenance:   index.Rmd
Attached to:  -
Stored in:    vu/bw/rc/vubwrckkcc2culwppsos2owhi5kr95da
MD5 checksum: a0dff9bd7d32e30844b0be010cab41e7
URL:          -

Caching

It is also possible to use the repository for caching purposes by using the lazydo command. It will run an expression and store the results. When the same expression is run again, the results will be loaded from the repository instead of being built again.

expr <- expression({
    Sys.sleep(3)
    result <- "This took 3 seconds to compute"
})
    
system.time(rp$lazydo(expr)) # first run
lazydo is building resource from code.
Cached item name is: ed37e506c7ed1b11a4d81c5d9aebb599.
   user  system elapsed 
  0.008   0.000   3.013 
system.time(rp$lazydo(expr)) # second run
lazydo found precomputed resource.
   user  system elapsed 
  0.004   0.000   0.003 

Pulling

Existing items can feature an URL property. The pull function is meant to update item contents by downloading them from the Internet. This allows for the distribution of “stub” repositories containing all items information without the actual data. The following code creates an item provided with a remote URL. A call to pull overwrites the stub local content with the remote content.

rp$put("Local content", "item1",
    "This points to big data you may want to download",
    "tag", URL="http://exampleURL/repo")
print(rp$get("item1"))
[1] "Local content"
rp$pull("item1", replace=T)
print(rp$get("item1"))
[1] "Remote content"

Handlers

The handlers method returns a list of functions by the same names of the items in the repo. Each of these functions can call Repo methods (get by default) on the corresponding items. In this way all item names are loaded, which may be usefule for example to exploit auto-completion features of the editor.

h <- rp$handlers()
names(h)
 [1] "index.Rmd"                        "myiris"                          
 [3] "irisLabels"                       "iris2D.pdf"                      
 [5] "iris_5clu#1"                      "iris2Dclu.pdf"                   
 [7] "iris_cluVsSpecies"                "iris_5clu"                       
 [9] "ed37e506c7ed1b11a4d81c5d9aebb599" "item1"                           
[11] "repo"                            

Handlers call get by default:

print(h$iris_cluVsSpecies())
            kiris
              1  2  3  4  5
  setosa      0 25  0  0 25
  versicolor 27  0  2 21  0
  virginica  21  0 27  2  0

The tag command (not yet described) adds a tag to an item:

h$iris_cluVsSpecies("tag", "onenewtag")
h$iris_cluVsSpecies("info")
ID:           iris_cluVsSpecies
Description:  Contingency table of the kmeans clustering versus the original labels of the Iris dataset.
Tags:         result, iris, validation, clustering, repodemo, hide, onenewtag
Dimensions:   3x5
Timestamp:    2017-08-04 15:45:02
Size on disk: 179 B
Provenance:   index.Rmd
Attached to:  -
Stored in:    um/b5/6g/umb56gc0ocqo1hrwscbcr07sw972jt5l
MD5 checksum: c6dc52b657e1d37325472b0e7810e036
URL:          -

One may want to open a repo directly with:

h <- repo_open(rp$root())$handlers()
Found repo index in "/tmp/RtmpaYy7by/R_repo.RDS".

In that case, the handler to the repo itself will come handy:

h$repo
                               ID  Dims    Size
                           myiris 150x4 1.82 kB
                       irisLabels   150   123 B
                        iris_5clu   150   112 B
 ed37e506c7ed1b11a4d81c5d9aebb599     1    72 B
                            item1     1    58 B

If items are removed or added, handlers may need a refresh:

h <- h$repo$handlers()

Further documentation

The repo manual starts at:

help(repo)

In order to get help on the function “func”, try the following:

help(repo_func)

Based on Repo build 2.1.1