1. Introduction

In the following tutorial we are going to create BGC class-specific models to predict the abundance of BGC classes in metagenomic samples. We used a machine-learning approach to create the models, which are class-specific, and were created using the abundance of each class and its corresponding domains as the response and predictor variables, respectively. Each model consists of a two-step zero-inflated process: First, the presence/absence of the BGC class is predicted using a random forest (RF) classifier; Second, the abundance is predicted with a multiple linear (ML) regression only if the class was previously predicted as present. The models are trained using a simulated metagenomic dataset of 150 samples (i.e. OMs dataset, see below).

2. Datasets

  1. OMs: 150 simulated metagenomes. It is based on the complete and nearly complete genome sequences of the Ocean Microbial Reference Gene Catalogue (OM-RGC) (Sunagawa et al., 2015). Composed of 376 different species.

  2. TGs: 150 simulated metagenomes. It is based on the complete and nearly complete genome sequences corresponding to the genera found in the TARA Oceans taxonomic annotation of the Operational Taxonomic Units (OTUs) (Sunagawa et al., 2015). Composed of 331 different species.

3. Step-by-step tutorial

3.1 Load libraries and tables

First we need to load the necessary libraries and data. Bgcpred installation instructions can be found here

library(tidyverse)
library(bgcpred)

URL <- "https://github.com/pereiramemo/BiG-MEx/wiki/files/"

URL_train_data_domains <- paste(URL,"Marine-RM_dom_abund.tsv", sep = "")
train_data_domains <- read_tsv(file = URL_train_data_domains, col_names = TRUE)

URL_test_data_domains <- paste(URL,"Marine-TM_dom_abund.tsv", sep = "")
test_data_domains <- read_tsv(file = URL_test_data_domains, col_names = TRUE)

URL_train_data_classes <- paste(URL,"Marine-RM_class_abund.tsv", sep = "")
train_data_classes <- read_tsv(file = URL_train_data_classes, col_names = TRUE) 

URL_test_data_classes <- paste(URL,"Marine-TM_class_abund.tsv", sep = "")
test_data_classes <- read_tsv(file = URL_test_data_classes, col_names = TRUE) 

3.2 Reformat test domain table

And massage a little bit the data. First we will add a few domains not present in the test_data_domains table but present in the train_data_domains table, this will make our life easier later in the prediction and training

i <- !colnames(train_data_domains) %in% colnames(test_data_domains)
for (n in colnames(train_data_domains)[i]){
  test_data_domains <- test_data_domains %>%
    mutate_(.dots = setNames(0, n))
}

3.3 Training the models

For the model creation we will use the get_domains() and class_model_train() functions from bgcpred package (although the package already includes the models: see wrap_up_predict()).

We will create the models for the lantipeptide, nrps, t1pks, t2pks, t3pks and terpene BGC classes. First, we use the get_domains() function to extract the corresponding domains of each BGC class. Then, we use these domains and classes RCs to train the models.

bgc_classes <- c("nrps","t1pks","t2pks","terpene")

bgc_model <- list()

for (b in bgc_classes) {
  
  y <- train_data_classes %>% select(b) %>% unlist
  domains_x <- get_domains(b)
  domains_x <- domains_x[ domains_x %in% colnames(train_data_domains) ]
  x <- train_data_domains %>% select(domains_x)
  bgc_model[[b]] <- class_model_train(y = y, 
                                      x = x, 
                                      binary_method = "rf", 
                                      regression_method = "lm", 
                                      seed = 111)
}

3.4 BGC class abundance predictions

Now that we have created the models, we will use them to predict the BGC class abundance in the test data set. For this task, we are going to use the class_model_predict() function from bgcpred.

bgc_pred <- list()

for (b in bgc_classes) {
  
  domains_x <- get_domains(b)
  domains_x <- domains_x[ domains_x %in% colnames(test_data_domains) ]
  x <- test_data_domains %>% select(domains_x)
  
  bgc_pred[[b]] <- class_model_predict(x = x, 
                                       model_c = bgc_model[[b]]$binary_model, 
                                       model_r = bgc_model[[b]]$regression_model)
}  

3.5 Results evaluation

Given that TGs is a simulated data set, we actually know the BGC class abundances, and we can compare those with the model predictions.

# select some nice colors
class2color <- cbind(class = bgc_classes, 
                     color = c("#ECD078","#D95B43","#C02942","#000000"))


# convert predictions list to a long data frame
pred_df <- bgc_pred %>%
  bind_cols() %>%
  gather(key = bgc, value = pred )


# extract and convert real abundances to a long data frame
ref_df <- test_data_classes %>% 
  select(bgc_classes) %>%
  gather(key = bgc, value = ref)


# join data frames
X <- bind_cols(pred_df, ref = ref_df$ref )


# make titles including correlation values
COR <- X %>% 
  group_by(bgc) %>%
  summarise(cor = cor(ref, pred) %>% round(., digits = 2))

titles <- paste(COR$bgc, "cor:", COR$cor)
names(titles) <- COR$bgc

# plot
ggplot(X, aes(x = ref, y=pred)) +
  geom_abline(intercept = 0, slope = 1, color = "gray60") +  
  geom_point(aes(color = bgc ), alpha = 0.5) +
  facet_wrap( ~ bgc, ncol = 2, nrow = 2, scales = "free", labeller = as_labeller(titles)) +
  scale_color_manual(values = class2color[,"color"], name = "BGC class") + 
  xlab("Reference abundance") +
  ylab("Predicted abundance") +
  expand_limits(y=0) +
  theme_light() +
  theme(strip.background = element_blank(),
        strip.text = element_text(color = "black", face = "bold"))

4. Checking MLR assumptions

4.1 Residuals normality

Here we simply plot the distribution of the residuals.

res_df <- lapply(bgc_model,"[[", "regression_model") %>%
          lapply(., residuals) %>%
          plyr::ldply(., cbind)

colnames(res_df) <- c("bgc","res")


# plot
ggplot(res_df, aes(res)) +
  geom_histogram(aes(fill = bgc), alpha = 0.8, bins = 30) +  
  facet_wrap( ~ bgc, ncol = 2, nrow = 2, scales = "free" ) +
  scale_fill_manual(values = class2color[,"color"], name = "BGC class") + 
  xlab("MLR residuals") +
  ylab("Counts") +
  expand_limits(y=0) +
  theme_light() +
  theme(strip.background = element_blank(),
        strip.text = element_text(color = "black", face = "bold"))

4.2 Homoscedasticity

Now, we plot the residuals vs. the fitted values. The idea is to check if the variability of the residuals stays constant as the fitted values increase.

fit_df <- lapply(bgc_model,"[[", "regression_model") %>%
                  lapply(., fitted) %>%
                  plyr::ldply(., cbind)

colnames(fit_df) <- c("bgc","fit")
fit2res_df <- data.frame(bgc = fit_df$bgc, fit = fit_df$fit, res = res_df$res)

# plot
ggplot(fit2res_df, aes(x = fit, y = res)) +
  geom_abline(intercept = 0, slope = 0, color = "gray60") +  
  geom_point(aes(color = bgc ), alpha = 0.5) + 
  facet_wrap( ~ bgc, ncol = 2, nrow = 2, scales = "free" ) +
  scale_color_manual(values = class2color[,"color"], name = "BGC class") + 
  xlab("Residuals") +
  ylab("Fitted") +
  expand_limits(y=0) +
  theme_light() +
  theme(strip.background = element_blank(),
        strip.text = element_text(color = "black", face = "bold"))

4.3 Linearity

Lastly, we check if there is a linear relationship between the predictor and response variables. To do this, we will plot the predictor variables vs. the residuals and check if the residuals “bounce randomly” around the 0 line. Given that we would have to create a plot for each domain, here we only show the plots for the nrps domains.

res_redu_df <- res_df %>% filter(bgc == "nrps")
domains_x <- get_domains("nrps")
domains_x <- domains_x[ domains_x %in% colnames(train_data_domains) ]

dom_abund2res_df <- data.frame(train_data_domains[ ,domains_x], res = res_redu_df$res) %>%
                    gather(key = "domain", value = "abund", domains_x )
  

# select domain colors
domain2colors <- c("#C02842","#B75131","#AE8539","#A0A540","#759D46")
        

ggplot(dom_abund2res_df, aes(x = abund, y = res)) +
  geom_abline(intercept = 0, slope = 0, color = "gray60") +  
  geom_point(aes(color = domain ), alpha = 0.5) + 
  facet_wrap( ~ domain, ncol = 2, nrow = 2, scales = "free" ) +
  scale_color_manual(values = domain2colors, name = "BGC domain") + 
  xlab("Abundance") +
  ylab("Residuals") +
  expand_limits(y=0) +
  theme_light() +
  theme(strip.background = element_blank(),
        strip.text = element_text(color = "black", face = "bold"))

5. Bibliography

  • Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5???32. http://doi.org/10.1023/A:1010933404324

  • Delmont, T. O., Quince, C., Shaiber, A., Esen, O. C., Lee, S. T. M., Lucker, S., & Eren, A. M. (2017). Nitrogen-Fixing Populations Of Planctomycetes And Proteobacteria Are Abundant In The Surface Ocean. bioRxiv, 129791. http://doi.org/10.1101/129791

  • Sunagawa, S., Coelho, L. P., Chaffron, S., Kultima, J. R., Labadie, K., Salazar, G., ??? Velayoudon, D. (2015). Structure and function of the global ocean microbiome. Science, 348(6237), 1261359???1261359. http://doi.org/10.1126/science.1261359

6. Session Info

sessionInfo()
## R version 3.3.1 (2016-06-21)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.10.5 (Yosemite)
## 
## locale:
## [1] C
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] bindrcpp_0.2        bgcpred_0.0.0.9000  e1071_1.6-8        
##  [4] xgboost_0.6-4       vegan_2.4-5         lattice_0.20-35    
##  [7] permute_0.9-4       randomForest_4.6-12 dplyr_0.7.4        
## [10] purrr_0.2.3         readr_1.1.1         tidyr_0.7.2        
## [13] tibble_1.3.4        ggplot2_2.2.1       tidyverse_1.1.1    
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_0.2.2  reshape2_1.4.2    haven_1.1.0      
##  [4] colorspace_1.3-2  htmltools_0.3.6   yaml_2.1.14      
##  [7] mgcv_1.8-20       rlang_0.1.2       foreign_0.8-69   
## [10] glue_1.1.1        modelr_0.1.1      readxl_1.0.0     
## [13] bindr_0.1         plyr_1.8.4        stringr_1.2.0    
## [16] munsell_0.4.3     gtable_0.2.0      cellranger_1.1.0 
## [19] rvest_0.3.2       psych_1.7.8       evaluate_0.10.1  
## [22] labeling_0.3      knitr_1.17        forcats_0.2.0    
## [25] curl_3.1          class_7.3-14      parallel_3.3.1   
## [28] broom_0.4.2       Rcpp_0.12.13      scales_0.5.0     
## [31] backports_1.1.0   jsonlite_1.5      mnormt_1.5-5     
## [34] hms_0.3           digest_0.6.12     stringi_1.1.5    
## [37] grid_3.3.1        rprojroot_1.2     tools_3.3.1      
## [40] magrittr_1.5      lazyeval_0.2.0    cluster_2.0.6    
## [43] pkgconfig_2.0.1   MASS_7.3-47       Matrix_1.2-11    
## [46] data.table_1.11.4 xml2_1.1.1        lubridate_1.6.0  
## [49] assertthat_0.2.0  rmarkdown_1.6     httr_1.3.1       
## [52] R6_2.2.2          nlme_3.1-131