Comparative non-targeted metabolomics comes of age through commercial vendors offering metabolomics for translational researchers outside the mass spectrometry field. The MetaboDiff package aims to provide a low-level entry to differential metabolomic analysis by starting off with the table of relative metabolite measurements. In addition, it introduces the usage of weighted correlation network analysis for metabolomics.
MetaboDiff 0.9.5
The MetaboDiff
R package requires R version 4.0.2 or higher.
CRAN occasionally fails to compile the WGCNA
package for Mac OS X. Hence, it is recommended to install the package before installing MetaboDiff
.
install.packages("WGCNA")
If asked, install the package from source. Alternatively you might need to download the package from the developer and install the package locally:
install.packages(path_to_file, repos = NULL, type="source")
If you encounter problems installing WGCNA
, please refer to the developer page. Please note that MetaboDiff
can only be installed if WGCNA
was successfully installed.
MetaboDiff
is available for all operating systems and can be installed via Github
library("devtools")
install_github("andreasmock/MetaboDiff")
and once installed loaded by
library(MetaboDiff)
MetaboDiff requires three objects as input:
assay
- a matrix containing the relative metabolic measurementsrowData
- a dataframe containing the available metabolite annotationcolData
- a dataframe containing sample metadataExample data for the tutorial is derived from a study by Priolo and colleagues in which the authors used the service of Metabolon® to compare the tissue metabolome of 61 prostate cancers with 25 normal prostate specimens1 Priolo, C., Pyne, S., Rose, J., Regan, E. R., Zadra, G., Photopoulos, C., et al. (2014). AKT1 and MYC Induce Distinctive Metabolic Fingerprints in Human Prostate Cancer. Cancer Research, 74(24), 7198–7204. http://doi.org/10.1158/0008-5472.CAN-14-1490.
Please make sure that the colnames (samples) of assay
correspond to the rownames (patients) of colData
and the rownames of assay
to the rownames of rowData
.
assay[1:5,1:5]
## pat1 pat2 pat3 pat4 pat5
## met1 33964.73 117318.43 118856.90 78670.7 102565.94
## met2 18505.56 167585.32 59621.97 66220.4 74892.27
## met3 NA 42373.93 27141.21 NA 38390.78
## met4 61638.77 74595.78 NA NA NA
## met5 NA 148363.61 43861.79 105835.2 25589.08
head(colData)
## id tumor_normal random_gender group
## pat1 cp2 N female Control
## pat2 cp7 N female Control
## pat3 cp19 N male Control
## pat4 cp26 N male Control
## pat5 cp29 N female Control
## pat6 cp32 N male Control
head(rowData)
## BIOCHEMICAL SUPER_PATHWAY SUB_PATHWAY METABOLON_ID PLATFORM KEGG_ID HMDB_ID
## met1 1-arachidonoylglycerophosphoethanolamine* Lipid Lysolipid 35186 LC/MS neg <NA> HMDB11517
## met2 1-arachidonoylglycerophosphoinositol* Lipid Lysolipid 34214 LC/MS neg <NA> <NA>
## met3 1-arachidonylglycerol Lipid Monoacylglycerol 34397 LC/MS neg C13857 HMDB11572
## met4 1-eicosadienoylglycerophosphocholine* Lipid Lysolipid 33871 LC/MS pos <NA> <NA>
## met5 1-heptadecanoylglycerophosphoethanolamine* No Super Pathway No Pathway 37419 LC/MS neg <NA> <NA>
## met6 1-linoleoylglycerol (1-monolinolein) Lipid Monoacylglycerol 27447 LC/MS neg <NA> <NA>
The function create_mae
merges all objects into a so called MultiAssayExperiment
object2 Sig M (2017). MultiAssayExperiment: Software for the integration of multi-omics experiments in Bioconductor. R package version 1.2.1 to simplify all downstream analysis.
(met <- create_mae(assay,rowData,colData))
## A MultiAssayExperiment object of 1 listed
## experiment with a user-defined name and respective class.
## Containing an ExperimentList class object of length 1:
## [1] raw: SummarizedExperiment with 307 rows and 86 columns
## Features:
## experiments() - obtain the ExperimentList instance
## colData() - the primary/phenotype DFrame
## sampleMap() - the sample availability DFrame
## `$`, `[`, `[[` - extract colData columns, subset, or experiment
## *Format() - convert into a long or wide DFrame
## assays() - convert ExperimentList to a SimpleList of matrices
Metabolite annotation can be retrieved from the Small Molecular Pathway Database (SMPDB) if HMDB, KEGG or ChEBI ids are part of the rowData object 3 Jewison T, Su Y, Disfany FM, et al. SMPDB 2.0: Big Improvements to the Small Molecule Pathway Database. Nucleic Acids Res. 2014 Jan;42(Database issue):D478-84..
# Please specify the accoding column in rowData. In the example data set, the KEGG id was in column 6 and the HMBD id in column 7. A ChEBI id was not available.
met <- get_SMPDBanno(met,
column_kegg_id=6,
column_hmdb_id=7,
column_chebi_id=NA)
In contrast to other high-throughput technologies, missing values are common in quantitative metabolomic datasets.
The function na_heatmap
visualizes the missing metabolite measurements across the samples. The name of the column in colData for grouping and the label colors for the two groups need to be specified.
na_heatmap(met,
group_factor="tumor_normal",
label_colors=c("darkseagreen","dodgerblue"))
The example data supports the need for data imputation (figure 1). Imputation is performed by k-nearest neighbor imputation, which could be shown to minimize the effects on the normality and variance of the data as long as the number of missing data does not exceed 40%4 Armitage, E. G., Godzien, J., Alonso-Herranz, V., L pez-Gonz lvez, N., & Barbas, C. (2015). Missing value imputation strategies for metabolomics data. Electrophoresis, 36(24), 3050–3060. http://doi.org/10.1002/elps.201500352.
The function knn_impute
adds the slot “impute” to the MultiAssayExperiment object that contains the imputed relative metabolite measurements for all metabolites with raw measurements in more than 60% of cases. We recommend a cutoff of 40% (i.e. 0.4). However the cutoff might be changed according to the discretion of the user.
(met = knn_impute(met,cutoff=0.4))
## A MultiAssayExperiment object of 2 listed
## experiments with user-defined names and respective classes.
## Containing an ExperimentList class object of length 2:
## [1] raw: SummarizedExperiment with 307 rows and 86 columns
## [2] imputed: SummarizedExperiment with 238 rows and 86 columns
## Features:
## experiments() - obtain the ExperimentList instance
## colData() - the primary/phenotype DFrame
## sampleMap() - the sample availability DFrame
## `$`, `[`, `[[` - extract colData columns, subset, or experiment
## *Format() - convert into a long or wide DFrame
## assays() - convert ExperimentList to a SimpleList of matrices
As apparent form the summary description of the object met
69 metabolites are excluded in the slot imputed
due to missing measurements in more than 40% of samples.
Before we normalize the data, we want to exclude putative outliers in the study set. To this end, the function outlier_heatmap
is provided. The sample annotation shows the number of missing metabolites per sample as a proxy of the impact of imputation on clustering. To identify outliers, the dendrogram also displays the results of a k-means clustering. In the examplary data we set 2 clusters (k=2; figure 2).
outlier_heatmap(met,
group_factor="tumor_normal",
label_colors=c("darkseagreen","dodgerblue"),
k=2)
The imputed data of the example study set displays a cluster of 5 samples (cluster 1) with in average lower relative metabolite measurements. Due to the lack of batch information, this cannot be investigated further at this time. To demonstrate how a cluster can be removed, we apply the function remove_cluster
to remove cluster 1:
(met <- remove_cluster(met,cluster=1))
## A MultiAssayExperiment object of 2 listed
## experiments with user-defined names and respective classes.
## Containing an ExperimentList class object of length 2:
## [1] raw: SummarizedExperiment with 307 rows and 81 columns
## [2] imputed: SummarizedExperiment with 238 rows and 81 columns
## Features:
## experiments() - obtain the ExperimentList instance
## colData() - the primary/phenotype DFrame
## sampleMap() - the sample availability DFrame
## `$`, `[`, `[[` - extract colData columns, subset, or experiment
## *Format() - convert into a long or wide DFrame
## assays() - convert ExperimentList to a SimpleList of matrices
As displayed in the summary of the met
object, the 5 samples of cluster 1 were successfully removed from the slots “raw” and “imputed”.
Variance stabilizing normalization (vsn) is used to ensure that the variance remains nearly constant over the measured spectrum5 Huber, W., Heydebreck, von, A., Sültmann, H., Poustka, A., & Vingron, M. (2002). Variance stabilization applied to microarray data calibration and to the quantification of differential expression. Bioinformatics, 18 Suppl 1, S96–104..
(met <- normalize_met(met))
## A MultiAssayExperiment object of 4 listed
## experiments with user-defined names and respective classes.
## Containing an ExperimentList class object of length 4:
## [1] raw: SummarizedExperiment with 307 rows and 81 columns
## [2] imputed: SummarizedExperiment with 238 rows and 81 columns
## [3] norm: SummarizedExperiment with 307 rows and 81 columns
## [4] norm_imputed: SummarizedExperiment with 238 rows and 81 columns
## Features:
## experiments() - obtain the ExperimentList instance
## colData() - the primary/phenotype DFrame
## sampleMap() - the sample availability DFrame
## `$`, `[`, `[[` - extract colData columns, subset, or experiment
## *Format() - convert into a long or wide DFrame
## assays() - convert ExperimentList to a SimpleList of matrices
At this point the data processing is completed with the MultiAssayExperiment
object containing 4 slots:
quality_plot(met,
group_factor="tumor_normal",
label_colors=c("darkseagreen","dodgerblue"))
The quality control plots shows the distribution of raw and normalized metabolic measurements for every sample in the study set. As aimed, the distribution of measurements is comparable across the study set for the normalized and imputed data (figure 3D).
Part II of MetaboDiff
comprises a set of functions for comparative data analysis that are purpose-built for the preprocessed MultiAssayExperiment
object. All statistics obtained in this part will be saved in the metadata slot of the object. Certain plotting functions will only be available once the corresponding statistics has been run, e.g. you will only be able to plot the volcano plot after executing the function for differential analysis (diff_test
). Please note that the data analysis has been developed for metabolomic data sets of more than 100 metabolites. Hence, some functionality might not be available with smaller data sets < 50 metabolites.
The following table summarizes the biological questions that can be answered by MetaboDiff functions.
Question | Statistical Methodology |
---|---|
Are there outliers in my data set? | outlier heatmap including k-means clustering |
Are there metabolome-wide changes between samples? | unsupervised analyses (PCA, tSNE) |
Which individual metabolites show differential abundance between groups? | hypothesis testing with correction for multiple testing |
Which metabolic subpathways (i.e. modules) differ between groups? | metabolic correlation network analysis - module significance (MS) plot |
Which metabolite is most closely related to the individual subpathway (i.e. modules)? | metabolic correlation network analysis - module of interest (MOI) plot |
To explore the metabolomic profiles between samples in an unsupervised fashion, MetaboDiff offers PCA as a linear and tSNE as a non-linear dimension reduction technique.
source("http://peterhaschke.com/Code/multiplot.R")
multiplot(
pca_plot(met,
group_factor="tumor_normal",
label_colors=c("darkseagreen","dodgerblue")),
tsne_plot(met,
group_factor="tumor_normal",
label_colors=c("darkseagreen","dodgerblue")),
cols=2)
## Warning in if (class(X) == "dist") {: Bedingung hat Länge > 1 und nur das erste Element wird benutzt
A plot of the first two principal components shows no metabolome-wide difference between the grouping of interest. The first two dimensions in the tSNE plot do not reveal a distinct difference in the metabolomic profiles between the normal and tumor samples.
Differential analysis for individual metabolites is performed with the function diff_test
. If two groups are compared the function applies the Student`s T-Test. If there are more than two groups an analysis of variance (ANOVA) is applied. The p-values are corrected for multiple testing by the Benjamini-Hochberg procedure. Hypothesis testing can be simultaneously performed for multiple groups. In our example data, we perform the hypothesis testing for the sample grouping “tumor_normal”, as well as for the randomly generated grouping “random_gender”.
met = diff_test(met,
group_factors = c("tumor_normal","random_gender"))
The results of the hypothesis testing is saved in the metadata slot of the MultiAssayExperiment
object.
str(metadata(met), max.level=2)
## List of 2
## $ ttest_tumor_normal_T_vs_N :'data.frame': 238 obs. of 4 variables:
## ..$ metabolite: chr [1:238] "met1" "met2" "met5" "met6" ...
## ..$ pval : num [1:238] 0.0206 0.7808 0.0832 0.0432 0.5859 ...
## ..$ adj_pval : num [1:238] 0.102 0.904 0.221 0.158 0.758 ...
## ..$ dm : num [1:238] 0.2872 0.0366 -0.3936 -0.5391 -0.1646 ...
## $ ttest_random_gender_male_vs_female:'data.frame': 238 obs. of 4 variables:
## ..$ metabolite: chr [1:238] "met1" "met2" "met5" "met6" ...
## ..$ pval : num [1:238] 0.2318 0.8626 0.4048 0.0121 0.2111 ...
## ..$ adj_pval : num [1:238] 0.83 0.959 0.862 0.386 0.83 ...
## ..$ dm : num [1:238] -0.1372 -0.0208 0.1742 0.607 0.3438 ...
The name of the corresponding slot comprises the name of the group_factor (e.g. tumor_normal), as well as the factor levels of the grouping (N_vs_T). Here, the first factor is always the reference for the difference in means (dm) calculation. A positive difference in means in the grouping T_vs_N is hence a higher abundance in the normal (N) group.
The columns of the result dataframe are the unadjusted p-value (pval), the adjusted p-value by Benjaminni-Hochberg procedure (adj_pval) and the difference in means between groups.
The comparative analysis can be visualized by means of a volcano plot (figure 5). The option p_adjust
enables to choose if the p-value cutoff counts for unadjusted or adjusted p-values. The argument dm_cutoff
sets the cutoff for the absolute difference in means.
par(mfrow=c(1,2))
volcano_plot(met,
group_factor="tumor_normal",
label_colors=c("darkseagreen","dodgerblue"),
dm_cutoff=0.5,
p_adjust = FALSE)
volcano_plot(met,
group_factor="tumor_normal",
label_colors=c("darkseagreen","dodgerblue"),
dm_cutoff=0.5,
p_adjust = TRUE)
As a sanity check, we also display the volcano plot for the random grouping “random_gender” for which we do not expect the same number of significant metabolites.
par(mfrow=c(1,2))
volcano_plot(met,
group_factor="random_gender",
label_colors=c("brown","orange"),
p_adjust = FALSE)
volcano_plot(met,
group_factor="random_gender",
label_colors=c("brown","orange"),
p_adjust = TRUE)
As expected, no metabolite was significantly different between the randomly assigned grouping male vs. female after multiple testing with a cutoff of p-value<0.05 (figure 6).
To derive meaningful subpathways that are enriched between groups, MetaboDiff generates a metabolic correlation network. Within MetaboDiff, metabolic correlation network analysis is performed by a set of functions:
diss_matrix
- construction of dissimilarity matrixidentify_modules
- identification of metabolic correlation modulesname_modules
- name metabolic correlation modulescalculate_MS
- calculation of module signficance to relate sample traits to modulesFor more details about the implementation and interpretation, please refer to the package vignette “Metabolic correlation network analysis”.
met <- met %>%
diss_matrix %>%
identify_modules(min_module_size=5) %>%
name_modules(pathway_annotation="SUB_PATHWAY") %>%
calculate_MS(group_factors=c("tumor_normal","random_gender"))
##
## ..cutHeight not given, setting it to 0.991 ===> 99% of the (truncated) height range in dendro.
## ..done.
The following plots illustrate the association of the sample trait of interest (tumor vs. normal) with the metabolic correlation modules derived form the data. The option p_adjust
enables to choose if the p-value cutoff counts for unadjusted or adjusted p-values.
MS_plot(met,
group_factor="tumor_normal",
p_value_cutoff=0.05,
p_adjust=FALSE)
MS_plot(met,
group_factor="random_gender",
p_value_cutoff=0.05,
p_adjust = FALSE)
In line with the volcano plots, the random grouping did not result in a significant association (figure 8).
The relationship of individual metabolites within a correlation module can be explored by means of a module of interest (MOI) plot, where the so called module membership is plotted over the p-value. In short, the module membership tells us, how closely related a metabolite is to the corresponding module and to other metabolites.
MOI_plot(met,
group_factor="tumor_normal",
MOI = 2,
label_colors=c("darkseagreen","dodgerblue"),
p_adjust = FALSE) + xlim(c(-1,8))
For more information regarding the correlation network methodology, see the vignette Metabolic_correlation_netork_analysis
. A full case study applying metabolic correlation network analysis is described in the vignette Case_study
.
sessionInfo()
## R version 4.0.2 (2020-06-22)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 17763)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252 LC_MONETARY=German_Germany.1252 LC_NUMERIC=C LC_TIME=German_Germany.1252
##
## attached base packages:
## [1] grid parallel stats4 stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] MetaboDiff_0.9.5 forcats_0.5.0 stringr_1.4.0 dplyr_1.0.0 purrr_0.3.4 readr_1.3.1
## [7] tidyr_1.1.0 tibble_3.0.3 ggplot2_3.3.2 tidyverse_1.3.0 ComplexHeatmap_2.4.3 MultiAssayExperiment_1.14.0
## [13] SummarizedExperiment_1.18.2 DelayedArray_0.14.1 matrixStats_0.56.0 Biobase_2.48.0 GenomicRanges_1.40.0 GenomeInfoDb_1.24.2
## [19] IRanges_2.22.2 S4Vectors_0.26.1 BiocGenerics_0.34.0 BiocStyle_2.16.1
##
## loaded via a namespace (and not attached):
## [1] readxl_1.3.1 backports_1.1.7 circlize_0.4.10 fastmatch_1.1-0 Hmisc_4.4-0 igraph_1.2.5 plyr_1.8.6
## [8] splines_4.0.2 digest_0.6.25 foreach_1.5.0 htmltools_0.5.0 magick_2.5.0 GO.db_3.11.4 fansi_0.4.1
## [15] phytools_0.7-70 magrittr_1.5 checkmate_2.0.0 memoise_1.1.0 cluster_2.1.0 doParallel_1.0.15 limma_3.44.3
## [22] fastcluster_1.1.25 annotate_1.66.0 modelr_0.1.8 jpeg_0.1-8.1 colorspace_1.4-1 blob_1.2.1 rvest_0.3.6
## [29] haven_2.3.1 xfun_0.15 crayon_1.3.4 RCurl_1.98-1.2 jsonlite_1.7.0 genefilter_1.70.0 impute_1.62.0
## [36] phangorn_2.5.5 ape_5.4-1 survival_3.1-12 iterators_1.0.12 glue_1.4.1 gtable_0.3.0 zlibbioc_1.34.0
## [43] XVector_0.28.0 GetoptLong_1.0.2 shape_1.4.4 maps_3.3.0 scales_1.1.1 vsn_3.56.0 DBI_1.1.0
## [50] Rcpp_1.0.5 plotrix_3.7-8 xtable_1.8-4 htmlTable_2.0.1 tmvnsim_1.0-2 clue_0.3-57 foreign_0.8-80
## [57] bit_1.1-15.2 preprocessCore_1.50.0 Formula_1.2-3 tsne_0.1-3 htmlwidgets_1.5.1 httr_1.4.2 RColorBrewer_1.1-2
## [64] acepack_1.4.1 ellipsis_0.3.1 pkgconfig_2.0.3 XML_3.99-0.5 farver_2.0.3 nnet_7.3-14 dbplyr_1.4.4
## [71] dynamicTreeCut_1.63-1 tidyselect_1.1.0 labeling_0.3 rlang_0.4.7 AnnotationDbi_1.50.3 munsell_0.5.0 cellranger_1.1.0
## [78] tools_4.0.2 cli_2.0.2 generics_0.0.2 RSQLite_2.2.0 broom_0.7.0 evaluate_0.14 yaml_2.2.1
## [85] knitr_1.30 bit64_0.9-7 fs_1.4.2 nlme_3.1-148 xml2_1.3.2 compiler_4.0.2 rstudioapi_0.11
## [92] png_0.1-7 affyio_1.58.0 clusterGeneration_1.3.5 reprex_0.3.0 stringi_1.4.6 highr_0.8 lattice_0.20-41
## [99] Matrix_1.2-18 vctrs_0.3.2 pillar_1.4.6 lifecycle_0.2.0 BiocManager_1.30.10 combinat_0.0-8 GlobalOptions_0.1.2
## [106] data.table_1.13.0 cowplot_1.0.0 bitops_1.0-6 R6_2.4.1 latticeExtra_0.6-29 affy_1.66.0 bookdown_0.21
## [113] gridExtra_2.3 codetools_0.2-16 gtools_3.8.2 MASS_7.3-51.6 assertthat_0.2.1 rjson_0.2.20 withr_2.2.0
## [120] mnormt_2.0.2 GenomeInfoDbData_1.2.3 expm_0.999-5 hms_0.5.3 quadprog_1.5-8 rpart_4.1-15 coda_0.19-3
## [127] rmarkdown_2.4 scatterplot3d_0.3-41 numDeriv_2016.8-1.1 WGCNA_1.69 lubridate_1.7.9 base64enc_0.1-3