1. Introduction

In the following tutorial we are going to create BGC class-specific models to predict the abundance of BGC classes in metagenomic samples. We used a machine-learning approach to create the models, which are class-specific, and were created using the abundance of each class and its corresponding domains as the response and predictor variables, respectively. Each model consists of a two-step zero-inflated process: First, the presence/absence of the BGC class is predicted using a random forest (RF) classifier; Second, the abundance is predicted with a multiple linear (ML) regression only if the class was previously predicted as present. The models are trained using a simulated metagenomic dataset of 150 samples (i.e. OMs dataset, see below).

2. Datasets

  1. OMs: 150 simulated metagenomes. It is based on the complete and nearly complete genome sequences of the Ocean Microbial Reference Gene Catalogue (OM-RGC) (Sunagawa et al., 2015). Composed of 376 different species.

  2. TGs: 150 simulated metagenomes. It is based on the complete and nearly complete genome sequences corresponding to the genera found in the TARA Oceans taxonomic annotation of the Operational Taxonomic Units (OTUs) (Sunagawa et al., 2015). Composed of 331 different species.

3. Step-by-step tutorial

3.1 Load libraries and tables

First we need to load the necessary libraries and data. Bgcpred installation instructions can be found here


URL <- "https://github.com/pereiramemo/BiG-MEx/wiki/files/"

URL_train_data_domains <- paste(URL,"Marine-RM_dom_abund.tsv", sep = "")
train_data_domains <- read_tsv(file = URL_train_data_domains, col_names = TRUE)

URL_test_data_domains <- paste(URL,"Marine-TM_dom_abund.tsv", sep = "")
test_data_domains <- read_tsv(file = URL_test_data_domains, col_names = TRUE)

URL_train_data_classes <- paste(URL,"Marine-RM_class_abund.tsv", sep = "")
train_data_classes <- read_tsv(file = URL_train_data_classes, col_names = TRUE) 

URL_test_data_classes <- paste(URL,"Marine-TM_class_abund.tsv", sep = "")
test_data_classes <- read_tsv(file = URL_test_data_classes, col_names = TRUE) 

3.2 Reformat test domain table

And massage a little bit the data. First we will add a few domains not present in the test_data_domains table but present in the train_data_domains table, this will make our life easier later in the prediction and training

i <- !colnames(train_data_domains) %in% colnames(test_data_domains)
for (n in colnames(train_data_domains)[i]){
  test_data_domains <- test_data_domains %>%
    mutate_(.dots = setNames(0, n))

3.3 Training the models

For the model creation we will use the get_domains() and class_model_train() functions from bgcpred package (although the package already includes the models: see wrap_up_predict()).

We will create the models for the lantipeptide, nrps, t1pks, t2pks, t3pks and terpene BGC classes. First, we use the get_domains() function to extract the corresponding domains of each BGC class. Then, we use these domains and classes RCs to train the models.

bgc_classes <- c("nrps","t1pks","t2pks","terpene")

bgc_model <- list()

for (b in bgc_classes) {
  y <- train_data_classes %>% select(b) %>% unlist
  domains_x <- get_domains(b)
  domains_x <- domains_x[ domains_x %in% colnames(train_data_domains) ]
  x <- train_data_domains %>% select(domains_x)
  bgc_model[[b]] <- class_model_train(y = y, 
                                      x = x, 
                                      binary_method = "rf", 
                                      regression_method = "lm", 
                                      seed = 111)

3.4 BGC class abundance predictions

Now that we have created the models, we will use them to predict the BGC class abundance in the test data set. For this task, we are going to use the class_model_predict() function from bgcpred.

bgc_pred <- list()

for (b in bgc_classes) {
  domains_x <- get_domains(b)
  domains_x <- domains_x[ domains_x %in% colnames(test_data_domains) ]
  x <- test_data_domains %>% select(domains_x)
  bgc_pred[[b]] <- class_model_predict(x = x, 
                                       model_c = bgc_model[[b]]$binary_model, 
                                       model_r = bgc_model[[b]]$regression_model)

3.5 Results evaluation

Given that TGs is a simulated data set, we actually know the BGC class abundances, and we can compare those with the model predictions.

# select some nice colors
class2color <- cbind(class = bgc_classes, 
                     color = c("#ECD078","#D95B43","#C02942","#000000"))

# convert predictions list to a long data frame
pred_df <- bgc_pred %>%
  bind_cols() %>%
  gather(key = bgc, value = pred )

# extract and convert real abundances to a long data frame
ref_df <- test_data_classes %>% 
  select(bgc_classes) %>%
  gather(key = bgc, value = ref)

# join data frames
X <- bind_cols(pred_df, ref = ref_df$ref )

# make titles including correlation values
COR <- X %>% 
  group_by(bgc) %>%
  summarise(cor = cor(ref, pred) %>% round(., digits = 2))

titles <- paste(COR$bgc, "cor:", COR$cor)
names(titles) <- COR$bgc

# plot
ggplot(X, aes(x = ref, y=pred)) +
  geom_abline(intercept = 0, slope = 1, color = "gray60") +  
  geom_point(aes(color = bgc ), alpha = 0.5) +
  facet_wrap( ~ bgc, ncol = 2, nrow = 2, scales = "free", labeller = as_labeller(titles)) +
  scale_color_manual(values = class2color[,"color"], name = "BGC class") + 
  xlab("Reference abundance") +
  ylab("Predicted abundance") +
  expand_limits(y=0) +
  theme_light() +
  theme(strip.background = element_blank(),
        strip.text = element_text(color = "black", face = "bold"))