DIPPS

Lyron Winderbaum

2016-12-21

This vignette demonstrates the use of the functions included in the dipps package for reading in peaklists as generated by Bruker software for mass spectrometry imaging (MSI) data, and performing simple analyses on peakpicked MSI data such as peak-grouping, investigating the quality of peak-grouping using heuristics with the help of the plyr package, calculations of difference in proportions statistics (DIPPS), and making relevant plots using the ggplot2 package.

library(dipps)
library(plyr)
library(ggplot2)

Unlike the vignette included in the dipps package itself, which uses a small subset of this dataset as an example (mainly for compile-time reasons), this seperate vignette includes the full example dataset, named "A1" – corresponding to the dataset A1 in “Feature extraction for proteomics imaging mass spectrometry data.” The Annals of Applied Statistics. 2015;9(4):1973-1996. doi: 10.1214/15-AOAS870. In this vignette, the full example dataset will be used to illustrate the use of the dipps package.

The current development version of the dipps package is available from github. Just follow the instructions in the README.md

This vignette is seperated into four main sections:

Reading Peaklists (from the output of Bruker software)

combine_peaklists is the relevent function here, reading in all the peaklists in <folname>/peaklists/ and writing two csv files in the current working directory, a Peak-list and a Spectra-list. Note that I include a tarball of the data in the git repo, not the raw data and so an additional line of code is needed to untar the data, this would not normally be needed but it is preferable to uploading the raw data to github. This untar step is slow, and so if modifying this and testing/ iterating ideas you should either untar once and use the untarred data, or cache the chunk that does the untar step as I have done here. Note that the cache files are not included in the repo, so you will have to run the untar step at least once locally to generate them.

untar("A1/A1.tar.gz", exdir = "A1")

## Read untarred data
i.path = file.path(".", "A1")
n.empt = combine_peaklists(i.path)
print(paste("Number of empty spectra:", n.empt))
## [1] "Number of empty spectra: 4"

i.path is specified here using file.path to acquire a path to these data because this is good practice, although in this case because the data folder is in the current working directory, it could have simply been specified as i.path = "A1". Additional arguments can be passed to combine_peaklists to modify locations of inputs and outputs, see the documentation, ?combine_peaklists. The outputs can be read in using the helper functions load_*:

o.name = basename(i.path)
pl = load_peaklist(o.name)
sl = load_speclist(o.name)
print(paste("Number of peaks:", nrow(pl)))
## [1] "Number of peaks: 1721862"
print(head(pl), digits = 6)
##       m.z     SN QualityFactor Resolution intensity   area   Acq
## 1 1011.60 13.450        9171.3     7929.5    1112.4  278.9 12859
## 2 1045.52 81.455       17456.1     7560.9    6669.3 1825.3 12859
## 3 1067.51  7.744        3988.5     6801.5     628.7  198.2 12859
## 4 1078.46  2.100          89.6     2374.6     170.3  154.6 12859
## 5 1084.01  2.447          98.6     5858.8     199.1   74.1 12859
## 6 1095.99  2.051          53.5     4160.4     165.2   87.7 12859

o.name can be supplied to combine_peaklists in which case it would be used to name the output files thereof, and should be supplied to the load_* functions here. When o.name is not supplied to combine_peaklists, it defaults to basename(i.path), which is why that is how I specify o.name here even though in this case this simply results in it being "A1".

In terms of computational expense, pl is the large object here, and if speed is of importance it is possible to remove columns unneccessary for whatever analyses you are performing and saving that – which will be quicker to load due to the size reduction.

Peakgrouping

Usually one of the first things you will want to do with such data is to produce peak-groups of some kind – grouping peaks with similar mass to identify them with each other. This package supplies two functions for this purpose:

After introducing these two functions and their uses, I’ll introduce a some heuristics I often use for judging the quality of a peak-group clustering, particularly applicable to peak-grouping by dbscan, but that can also be applied peak-grouping done by extract_masses.

Extract Masses

Most simply, extract_masses can be used to simply extract peaks near known masses of interest, in this dataset for example, m/z = 1570.677 is a mass of interest, and so we could extract it as:

df.cal = data.frame(mass = c(1296.685,
                             1570.677,
                             2147.199,
                             2932.588),
                    name = c("Angiotensin_I",
                             "Glu_Fibrinopeptide_B",
                             "Dynorphin_A",
                             "ACTH_fragment1_24"),
                    short = c("AngI",
                              "GluFib",
                              "DynA",
                              "ACTH"))
ppm_err = 20
pl.cal = extract_masses(pl, df.cal$mass, margin = ppm_err, use_ppm = TRUE)

This returns pl.cal as a simple subset of pl of peaks within 20 ppm of the m/z values specified in df.cal$mass, with an additional column called group whose values are the values of df.cal$mass to which each peak is matched.

DBSCAN

As mentioned in the documentation, dbscan provides a univariate optimisation of the DBSCAN* algorithm of section 3 of Campello et al. (2013). This is a density based clustering algorithm that, breifly, can be thought of as two steps performed in sequence, first points with insufficient density are removed, and second the remaining points are grouped as per the equivelance classes induced by an equivalence relation. In this context density is measured as the number of points within a certain eps, and the equivalence relation used is “are within eps of each other”. This second step, without the first, is often called tolerance clustering and is often sufficient – this can be acheived by specifing a minimum density of 1:

pl$tol_clus_1 = dbscan(pl$m.z, eps = 0.1, mnpts = 1)
pl$tol_clus_2 = dbscan(pl$m.z, eps = 0.05, mnpts = 1)

But more generally the minimum density, mnpts, will need to be adjusted based on the dataset, usually proportionally to the size of the dataset but also depending on other factors. Similarly eps will depend on the dataset, and is directly related to mnpts – the same value of mnpts is a more stringent density cutoff for a smaller value of eps. This test dataset is quite small, so a small value of mnpts is appropriate, such as for example:

pl$dbs_clus_1 = dbscan(pl$m.z, eps = 0.1, mnpts = 100)
pl$dbs_clus_2 = dbscan(pl$m.z, eps = 0.05, mnpts = 100)
pl$dbs_clus_3 = dbscan(pl$m.z, eps = 0.1, mnpts = 250)
pl$dbs_clus_4 = dbscan(pl$m.z, eps = 0.05, mnpts = 250)

Note I have generated several different clusterings by varying the parameters, and will compare these in the following section.

Heuristics for diagnosis of peak-grouping.

Here I’ll show how to construct a plot I often use for heuristically diagnosing the quality of the performed peakpicking, and to adjust the tuning parameter selection to get a peak-grouping i am happy with. First, a couple of helper functions for calculating peak-group summaries, and generating the plot to which I alluded respectively:

summarise_peakgroups = function(pl, var) {
  return(ddply(pl, var, summarise,
               AWM = weighted.mean(m.z, log1p(intensity)),
               Centre = (max(m.z) + min(m.z)) / 2,
               Range = max(m.z) - min(m.z),
               nPeaks = length(m.z),
               nDupPeaks = (length(m.z) - length(unique(Acq)))))
}
diag_plot = function(pgs, t) {
  return(ggplot(pgs, aes(x=Range, y=nDupPeaks))
         + ggtitle(t)
         + geom_jitter(width = 0, height = 0.4) 
         + ylab("# Duplicate Peaks"))
}

Given these functions, peak-group summaries can be calculates as:

pgs_tol_1 <- summarise_peakgroups(pl, "tol_clus_1")
pgs_tol_2 <- summarise_peakgroups(pl, "tol_clus_2")
pgs_dbs_1 <- summarise_peakgroups(pl, "dbs_clus_1")
pgs_dbs_2 <- summarise_peakgroups(pl, "dbs_clus_2")
pgs_dbs_3 <- summarise_peakgroups(pl, "dbs_clus_3")
pgs_dbs_4 <- summarise_peakgroups(pl, "dbs_clus_4")

and heuristic diagnostic plots can then be generated like so:

print(diag_plot(subset(pgs_tol_1, tol_clus_1 != 0), 
                "eps: 0.1, mnpts: 1\n0% of peaks unassigned"))
print(diag_plot(subset(pgs_tol_1, tol_clus_1 != 0), 
                "eps: 0.05, mnpts: 1\n0% of peaks unassigned"))
per.unassigned_1 = round(100 * subset(pgs_dbs_1, dbs_clus_1 == 0)$nPeaks / nrow(pl), 1)
print(diag_plot(subset(pgs_dbs_1, dbs_clus_1 != 0), 
                paste0("eps: 0.1,  mnpts: 100\n", per.unassigned_1,
                      "% of peaks unassigned")))
per.unassigned_2 = round(100 * subset(pgs_dbs_2, dbs_clus_2 == 0)$nPeaks / nrow(pl), 1)
print(diag_plot(subset(pgs_dbs_2, dbs_clus_2 != 0), 
                paste0("eps: 0.05,  mnpts: 100\n", per.unassigned_2,
                      "% of peaks unassigned")))
per.unassigned_3 = round(100 * subset(pgs_dbs_3, dbs_clus_3 == 0)$nPeaks / nrow(pl), 1)
print(diag_plot(subset(pgs_dbs_3, dbs_clus_3 != 0), 
                paste0("eps: 0.1,  mnpts: 250\n", per.unassigned_3,
                      "% of peaks unassigned")))
per.unassigned_4 = round(100 * subset(pgs_dbs_4, dbs_clus_4 == 0)$nPeaks / nrow(pl), 1)
print(diag_plot(subset(pgs_dbs_4, dbs_clus_4 != 0), 
                paste0("eps: 0.05,  mnpts: 250\n", per.unassigned_4,
                      "% of peaks unassigned")))

Duplicate peaks here refers to multiple peaks from the same spectrum occuring in the same peak-group. Usually, we would not expect this to happen if the peak-picking algorithm is working well (although occasional artefacts can happen). Either way, number of duplicate peaks should be zero or if that is not possible, then it should be as small as possible (and noted, as these duplicates will likely have to be dealt with somehow later in the processing pipeline, either by averaging their intensities or by some other method, depending on the outcome being pursued).

I jitter the points in these plots vertically by +/- 0.2 so that in the case that there are no duplicate peaks, or close to, there will be some visual que for the density of these otherwise overlapping points, giving an idea what kind of range is typical.

Peaks left as unassigned due to insufficient density in the first step of the DBSCAN* algorithm would usually be discarded from this point downstream, so to avoid throwing out important information the number of peaks left unassigned during the peak-grouping step should generally be minimised if possible. On the other hand, if a peak-grouping that acheives zero duplicate peaks in all clusters can be acheived this is preferable as otherwise duplicate peaks would generally have to be dealt with by some exclusion or averaging method in order to avoid ambiguity. On the other hand if there are only a handful of clusters with a number of duplicate peaks greater than zero, these could be manually inspected and a decision be made as to what to do about them. For example, lets consider an example – the clustering performed above with eps: 0.1 and mnpts: 250.

print(subset(pgs_dbs_3, nDupPeaks > 0 & dbs_clus_3 != 0), digits = 6)
##     dbs_clus_3     AWM  Centre Range nPeaks nDupPeaks
## 18          17 1081.52 1081.44 0.481   4421       445
## 25          24 1101.63 1101.72 0.743   4893       481
## 245        244 1592.72 1592.73 0.319   1185         1
## 280        279 1665.96 1665.91 0.336   2233         4
## 372        371 1850.00 1850.00 0.425   1287        10

This allows us to identify the mass range in which these peaks occur, and other information about these peakgroups. To go even deeper, lets take a closer look at the m/z distribution of a couple of these peakgroups, specifically the two with over 400 duplicate peaks which seem to be particularly problematic:

pgs = subset(pgs_dbs_3, nDupPeaks > 400 & dbs_clus_3 != 0)$dbs_clus_3
centres = pgs_dbs_3[pgs_dbs_3$dbs_clus_3 %in% pgs, "Centre"]
margins = pgs_dbs_3[pgs_dbs_3$dbs_clus_3 %in% pgs, "Range"] / 2
print(qplot(subset(pl, dbs_clus_3 == pgs[1])$m.z, binwidth = 0.01)
      + ggtitle(paste(round(centres[1], 2), "+/-", round(margins[1], 2)))
      + xlab("m/z"))
print(qplot(subset(pl, dbs_clus_3 == pgs[2])$m.z, binwidth = 0.01)
      + ggtitle(paste(round(centres[2], 2), "+/-", round(margins[2], 2)))
      + xlab("m/z"))

Now lets take a look at how these peaks are clustering by the more stringent clustering with eps: 0.05 and mnpts: 250:

pl.sub = subset(pl, abs(m.z - centres[1]) <= margins[1])
print(qplot(pl.sub$m.z, fill = factor(pl.sub$dbs_clus_4), binwidth = 0.01)
      + ggtitle(paste(round(centres[1], 2), "+/-", round(margins[1], 2)))
      + guides(fill = FALSE) + xlab("m/z"))
pl.sub = subset(pl, abs(m.z - centres[2]) <= margins[2])
print(qplot(pl.sub$m.z, fill = factor(pl.sub$dbs_clus_4), binwidth = 0.01)
      + ggtitle(paste(round(centres[2], 2), "+/-", round(margins[2], 2)))
      + guides(fill = FALSE) + xlab("m/z"))

This looks much better – note that in the figures above colours denote different clusters, with the salmon colour representing peakgroup zero, which is the unassigned peaks due to insufficient density. Of interest is the two seemingly seperate green groups of peaks that are actually clustered together, but as seen previously, do not contain any duplicate peaks. Meaning the two consist of non-overlapping sets of spectra. This is interesting as it could potentially represent a systematic mass error is a portion of the spectra in this dataset, and could be further investigated as such. Particularly as this is peptide data, different resolvable masses are expected to differ by roughly 1Da, and the distance between these groups is much less than that.

DIPPS

DIPPS can be calculated for a subset of spectra using the dipps function. In this case we have annotatations made by an expert based on a hematoxylin and eosin (H&E) stain image of the same tissue section. These annotations where in the form of a region of interest (ROI) in the Bruker software, which has the functionality to output this as a “ClinProTools” extensible markup language (XML) file. In order to help match these to the spectra_list we made above, I’ll make a helper function to extract out the regular expression matching the Bruker naming convention for spectra:

library(stringr)
Peaklist_ID <- function(x) str_extract(x,"R\\d{2,3}X\\d{3,4}Y\\d{3,4}")

which I’ll use to demonstrate how to read and match these annotations to our MSI data:

library(XML)

fname <- file.path('A1','annotations.xml')
doc = xmlInternalTreeParse(fname)
rois  = xpathSApply(doc, "/ClinProtSpectraImport/Class", xmlGetAttr, "Name")
nSpec = xpathSApply(doc, "/ClinProtSpectraImport/Class", xmlSize)
spec  = xpathSApply(doc, "/ClinProtSpectraImport/Class/Element", 
                    xmlGetAttr, "Path")

sl$ID = Peaklist_ID(sl$fname)
sl = merge(sl, data.frame(ID = Peaklist_ID(spec),
                          ROI = rep(rois, nSpec),
                          stringsAsFactors = FALSE),
           all.x = TRUE)
sl[is.na(sl$ROI), 'ROI'] = "Not Annotated"

sl$cancer = (sl$ROI != "Not Annotated")

We can plot these annotations spatially:

p = (ggplot(sl, aes(x=X, y=Y, fill = ROI, alpha = (sl$ROI != "Not Annotated")), 
            colour=NA)
  + geom_tile() + coord_fixed() + guides(alpha = FALSE) + xlab("X") + ylab("Y"))
print(p)

and calculate DIPPS for the four annotated cancer regions combined, using the eps: 0.05, mnpts: 250 peakgrouping produced above to identify the variables:

pl = merge(pl, sl[, c("Acq", "cancer")])

pl.sub = subset(pl, dbs_clus_4 != 0)
dipps.df = dipps(pl.sub$Acq, pl.sub$dbs_clus_4, pl.sub$cancer)
print(head(dipps.df), digits = 2, row.names = FALSE)
##  var   p.d  p.u    d  c.u  cos t
##  589 0.061 0.94 0.88 0.14 0.86 1
##  153 0.076 0.96 0.88 0.14 0.81 1
##  535 0.019 0.88 0.86 0.12 0.77 1
##  147 0.136 0.99 0.85 0.14 0.73 1
##  144 0.047 0.88 0.83 0.12 0.71 1
##   92 0.109 0.91 0.80 0.13 0.68 1

From the output of dipps we can see that cluster number 589 occurs in 0.94 of cancer-annotated spectra, and in 0.06 of remaining spectra, giving it the highest DIPPS of 0.88. The peak-group summary we made in the previous section is a conviniant way to look at the details of this peakgroup:

subset(pgs_dbs_4, dbs_clus_4 == dipps.df[1, "var"])
##     dbs_clus_4      AWM   Centre Range nPeaks nDupPeaks
## 590        589 2854.411 2854.401 0.092   1294         0

The main observation being that the m/z window for this cluster is 2854.401 +/- 0.046 with most of the intense peaks grouping nearer to the AWM of 2854.411. If we wanted a better idea of the m/z distribution within this cluster, we could make a histogram, as we did above:

centres = round(subset(pgs_dbs_4, dbs_clus_4 == dipps.df[1, "var"])$Centre, 3)
margins = round(subset(pgs_dbs_4, dbs_clus_4 == dipps.df[1, "var"])$Range / 2, 3)
qplot(subset(pl, dbs_clus_4 == dipps.df[1, "var"])$m.z, xlab = "m/z", 
     main = paste(centres, "+/-", margins), binwidth = 0.01)

Spatial Plotting

We could plot the spatial distribution of these peaks, or their intensities, using the spatial_plot function:

pl.sub = subset(pl, dbs_clus_4 == dipps.df[1, "var"])
pl.sub$occurrence = 1
pl.sub$log.intensity = log1p(pl.sub$intensity)
print(spatial_plot(pl.sub, sl, plot.var = "occurrence",
                   return.plot = TRUE, print.plot = FALSE))
print(spatial_plot(pl.sub, sl, plot.var = "log.intensity",
                   return.plot = TRUE, print.plot = FALSE))

The t column of the output from dipps identifies the variables above the heuristically selected DIPPS cutoff based on cosine distance. From dipps.df we can observe there are 68 such variables above the cutoff. We can count how many of these variables exhibit a peak in each spectrum, and plot this count:

pl.sub = subset(pl, dbs_clus_4 %in% subset(dipps.df, t == 1)$var)
pl.sum = ddply(pl.sub, "Acq", summarise, count = length(Acq))
print(spatial_plot(pl.sum, sl, plot.var = "count",
                   return.plot = TRUE, print.plot = FALSE))

Colophon

This vignette was written in Rmarkdown inside RStudio. knitr and pandoc converted the raw Rmarkdown to html. The complete source is available from github. The wording for this colophon was inspired by hadley’s book which also helped me learn Rmarkdown and write packages (writing this vignette was something of a learning exercise thereof).

This version of this vignette was built with:

devtools::session_info()
## Session info --------------------------------------------------------------
##  setting  value                       
##  version  R version 3.3.1 (2016-06-21)
##  system   x86_64, mingw32             
##  ui       RTerm                       
##  language (EN)                        
##  collate  English_United States.1252  
##  tz       Australia/Adelaide          
##  date     2016-12-21
## Packages ------------------------------------------------------------------
##  package    * version    date       source        
##  assertthat   0.1        2013-12-06 CRAN (R 3.3.1)
##  colorspace   1.2-6      2015-03-11 CRAN (R 3.3.1)
##  devtools     1.12.0     2016-06-24 CRAN (R 3.3.1)
##  digest       0.6.10     2016-08-02 CRAN (R 3.3.1)
##  dipps      * 0.1.0.9000 2016-12-19 local         
##  evaluate     0.9        2016-04-29 CRAN (R 3.3.1)
##  formatR      1.4        2016-05-09 CRAN (R 3.3.1)
##  ggplot2    * 2.1.0      2016-03-01 CRAN (R 3.3.1)
##  gtable       0.2.0      2016-02-26 CRAN (R 3.3.1)
##  htmltools    0.3.5      2016-03-21 CRAN (R 3.3.1)
##  knitr        1.14       2016-08-13 CRAN (R 3.3.1)
##  magrittr     1.5        2014-11-22 CRAN (R 3.3.1)
##  memoise      1.0.0      2016-01-29 CRAN (R 3.3.1)
##  munsell      0.4.3      2016-02-13 CRAN (R 3.3.1)
##  plyr       * 1.8.4      2016-06-08 CRAN (R 3.3.1)
##  Rcpp         0.12.7     2016-09-05 CRAN (R 3.3.1)
##  rmarkdown    1.1        2016-10-16 CRAN (R 3.3.1)
##  scales       0.4.0      2016-02-26 CRAN (R 3.3.1)
##  stringi      1.1.1      2016-05-27 CRAN (R 3.3.1)
##  stringr      1.1.0      2016-08-19 CRAN (R 3.3.1)
##  tibble       1.2        2016-08-26 CRAN (R 3.3.1)
##  withr        1.0.2      2016-06-20 CRAN (R 3.3.1)
##  yaml         2.1.13     2014-06-12 CRAN (R 3.3.1)