In the following tutorial we are going to create BGC class-specific models to predict the abundance of BGC classes in metagenomic samples. We used a machine-learning approach to create the models, which are class-specific, and were created using the abundance of each class and its corresponding domains as the response and predictor variables, respectively. Each model consists of a two-step zero-inflated process: First, the presence/absence of the BGC class is predicted using a random forest (RF) classifier; Second, the abundance is predicted with a multiple linear (ML) regression only if the class was previously predicted as present. The models are trained using a simulated metagenomic dataset of 150 samples (i.e. OMs dataset, see below).
OMs: 150 simulated metagenomes. It is based on the complete and nearly complete genome sequences of the Ocean Microbial Reference Gene Catalogue (OM-RGC) (Sunagawa et al., 2015). Composed of 376 different species.
TGs: 150 simulated metagenomes. It is based on the complete and nearly complete genome sequences corresponding to the genera found in the TARA Oceans taxonomic annotation of the Operational Taxonomic Units (OTUs) (Sunagawa et al., 2015). Composed of 331 different species.
First we need to load the necessary libraries and data. Bgcpred installation instructions can be found here
library(tidyverse)
library(bgcpred)
URL <- "https://github.com/pereiramemo/BiG-MEx/wiki/files/"
URL_train_data_domains <- paste(URL,"Marine-RM_dom_abund.tsv", sep = "")
train_data_domains <- read_tsv(file = URL_train_data_domains, col_names = TRUE)
URL_test_data_domains <- paste(URL,"Marine-TM_dom_abund.tsv", sep = "")
test_data_domains <- read_tsv(file = URL_test_data_domains, col_names = TRUE)
URL_train_data_classes <- paste(URL,"Marine-RM_class_abund.tsv", sep = "")
train_data_classes <- read_tsv(file = URL_train_data_classes, col_names = TRUE)
URL_test_data_classes <- paste(URL,"Marine-TM_class_abund.tsv", sep = "")
test_data_classes <- read_tsv(file = URL_test_data_classes, col_names = TRUE)
And massage a little bit the data. First we will add a few domains not present in the test_data_domains table but present in the train_data_domains table, this will make our life easier later in the prediction and training
i <- !colnames(train_data_domains) %in% colnames(test_data_domains)
for (n in colnames(train_data_domains)[i]){
test_data_domains <- test_data_domains %>%
mutate_(.dots = setNames(0, n))
}
For the model creation we will use the get_domains() and class_model_train() functions from bgcpred package (although the package already includes the models: see wrap_up_predict()).
We will create the models for the lantipeptide, nrps, t1pks, t2pks, t3pks and terpene BGC classes. First, we use the get_domains() function to extract the corresponding domains of each BGC class. Then, we use these domains and classes RCs to train the models.
bgc_classes <- c("nrps","t1pks","t2pks","terpene")
bgc_model <- list()
for (b in bgc_classes) {
y <- train_data_classes %>% select(b) %>% unlist
domains_x <- get_domains(b)
domains_x <- domains_x[ domains_x %in% colnames(train_data_domains) ]
x <- train_data_domains %>% select(domains_x)
bgc_model[[b]] <- class_model_train(y = y,
x = x,
binary_method = "rf",
regression_method = "lm",
seed = 111)
}
Now that we have created the models, we will use them to predict the BGC class abundance in the test data set. For this task, we are going to use the class_model_predict() function from bgcpred.
bgc_pred <- list()
for (b in bgc_classes) {
domains_x <- get_domains(b)
domains_x <- domains_x[ domains_x %in% colnames(test_data_domains) ]
x <- test_data_domains %>% select(domains_x)
bgc_pred[[b]] <- class_model_predict(x = x,
model_c = bgc_model[[b]]$binary_model,
model_r = bgc_model[[b]]$regression_model)
}
Given that TGs is a simulated data set, we actually know the BGC class abundances, and we can compare those with the model predictions.
# select some nice colors
class2color <- cbind(class = bgc_classes,
color = c("#ECD078","#D95B43","#C02942","#000000"))
# convert predictions list to a long data frame
pred_df <- bgc_pred %>%
bind_cols() %>%
gather(key = bgc, value = pred )
# extract and convert real abundances to a long data frame
ref_df <- test_data_classes %>%
select(bgc_classes) %>%
gather(key = bgc, value = ref)
# join data frames
X <- bind_cols(pred_df, ref = ref_df$ref )
# make titles including correlation values
COR <- X %>%
group_by(bgc) %>%
summarise(cor = cor(ref, pred) %>% round(., digits = 2))
titles <- paste(COR$bgc, "cor:", COR$cor)
names(titles) <- COR$bgc
# plot
ggplot(X, aes(x = ref, y=pred)) +
geom_abline(intercept = 0, slope = 1, color = "gray60") +
geom_point(aes(color = bgc ), alpha = 0.5) +
facet_wrap( ~ bgc, ncol = 2, nrow = 2, scales = "free", labeller = as_labeller(titles)) +
scale_color_manual(values = class2color[,"color"], name = "BGC class") +
xlab("Reference abundance") +
ylab("Predicted abundance") +
expand_limits(y=0) +
theme_light() +
theme(strip.background = element_blank(),
strip.text = element_text(color = "black", face = "bold"))
Here we simply plot the distribution of the residuals.
res_df <- lapply(bgc_model,"[[", "regression_model") %>%
lapply(., residuals) %>%
plyr::ldply(., cbind)
colnames(res_df) <- c("bgc","res")
# plot
ggplot(res_df, aes(res)) +
geom_histogram(aes(fill = bgc), alpha = 0.8, bins = 30) +
facet_wrap( ~ bgc, ncol = 2, nrow = 2, scales = "free" ) +
scale_fill_manual(values = class2color[,"color"], name = "BGC class") +
xlab("MLR residuals") +
ylab("Counts") +
expand_limits(y=0) +
theme_light() +
theme(strip.background = element_blank(),
strip.text = element_text(color = "black", face = "bold"))
Now, we plot the residuals vs. the fitted values. The idea is to check if the variability of the residuals stays constant as the fitted values increase.
fit_df <- lapply(bgc_model,"[[", "regression_model") %>%
lapply(., fitted) %>%
plyr::ldply(., cbind)
colnames(fit_df) <- c("bgc","fit")
fit2res_df <- data.frame(bgc = fit_df$bgc, fit = fit_df$fit, res = res_df$res)
# plot
ggplot(fit2res_df, aes(x = fit, y = res)) +
geom_abline(intercept = 0, slope = 0, color = "gray60") +
geom_point(aes(color = bgc ), alpha = 0.5) +
facet_wrap( ~ bgc, ncol = 2, nrow = 2, scales = "free" ) +
scale_color_manual(values = class2color[,"color"], name = "BGC class") +
xlab("Residuals") +
ylab("Fitted") +
expand_limits(y=0) +
theme_light() +
theme(strip.background = element_blank(),
strip.text = element_text(color = "black", face = "bold"))
Lastly, we check if there is a linear relationship between the predictor and response variables. To do this, we will plot the predictor variables vs. the residuals and check if the residuals “bounce randomly” around the 0 line. Given that we would have to create a plot for each domain, here we only show the plots for the nrps domains.
res_redu_df <- res_df %>% filter(bgc == "nrps")
domains_x <- get_domains("nrps")
domains_x <- domains_x[ domains_x %in% colnames(train_data_domains) ]
dom_abund2res_df <- data.frame(train_data_domains[ ,domains_x], res = res_redu_df$res) %>%
gather(key = "domain", value = "abund", domains_x )
# select domain colors
domain2colors <- c("#C02842","#B75131","#AE8539","#A0A540","#759D46")
ggplot(dom_abund2res_df, aes(x = abund, y = res)) +
geom_abline(intercept = 0, slope = 0, color = "gray60") +
geom_point(aes(color = domain ), alpha = 0.5) +
facet_wrap( ~ domain, ncol = 2, nrow = 2, scales = "free" ) +
scale_color_manual(values = domain2colors, name = "BGC domain") +
xlab("Abundance") +
ylab("Residuals") +
expand_limits(y=0) +
theme_light() +
theme(strip.background = element_blank(),
strip.text = element_text(color = "black", face = "bold"))
Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5???32. http://doi.org/10.1023/A:1010933404324
Delmont, T. O., Quince, C., Shaiber, A., Esen, O. C., Lee, S. T. M., Lucker, S., & Eren, A. M. (2017). Nitrogen-Fixing Populations Of Planctomycetes And Proteobacteria Are Abundant In The Surface Ocean. bioRxiv, 129791. http://doi.org/10.1101/129791
Sunagawa, S., Coelho, L. P., Chaffron, S., Kultima, J. R., Labadie, K., Salazar, G., ??? Velayoudon, D. (2015). Structure and function of the global ocean microbiome. Science, 348(6237), 1261359???1261359. http://doi.org/10.1126/science.1261359
sessionInfo()
## R version 3.3.1 (2016-06-21)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.10.5 (Yosemite)
##
## locale:
## [1] C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] bindrcpp_0.2 bgcpred_0.0.0.9000 e1071_1.6-8
## [4] xgboost_0.6-4 vegan_2.4-5 lattice_0.20-35
## [7] permute_0.9-4 randomForest_4.6-12 dplyr_0.7.4
## [10] purrr_0.2.3 readr_1.1.1 tidyr_0.7.2
## [13] tibble_1.3.4 ggplot2_2.2.1 tidyverse_1.1.1
##
## loaded via a namespace (and not attached):
## [1] tidyselect_0.2.2 reshape2_1.4.2 haven_1.1.0
## [4] colorspace_1.3-2 htmltools_0.3.6 yaml_2.1.14
## [7] mgcv_1.8-20 rlang_0.1.2 foreign_0.8-69
## [10] glue_1.1.1 modelr_0.1.1 readxl_1.0.0
## [13] bindr_0.1 plyr_1.8.4 stringr_1.2.0
## [16] munsell_0.4.3 gtable_0.2.0 cellranger_1.1.0
## [19] rvest_0.3.2 psych_1.7.8 evaluate_0.10.1
## [22] labeling_0.3 knitr_1.17 forcats_0.2.0
## [25] curl_3.1 class_7.3-14 parallel_3.3.1
## [28] broom_0.4.2 Rcpp_0.12.13 scales_0.5.0
## [31] backports_1.1.0 jsonlite_1.5 mnormt_1.5-5
## [34] hms_0.3 digest_0.6.12 stringi_1.1.5
## [37] grid_3.3.1 rprojroot_1.2 tools_3.3.1
## [40] magrittr_1.5 lazyeval_0.2.0 cluster_2.0.6
## [43] pkgconfig_2.0.1 MASS_7.3-47 Matrix_1.2-11
## [46] data.table_1.11.4 xml2_1.1.1 lubridate_1.6.0
## [49] assertthat_0.2.0 rmarkdown_1.6 httr_1.3.1
## [52] R6_2.2.2 nlme_3.1-131