Contents

1 Introduction

This document gives an introduction to and overview of inferring the clone identity of cells using the cardelino package using a given clonal structure.

cardelino contains general tools for inferring the clonal or donor identity of cells from single-cell transcriptomic data, focusing on RNA-seq data. Specifically, the package features:

Here, we focus on demonstrating the use of cardelino to probabilistically assign cells to clones when a clonal tree is provided.

2 About the model

cardelino can use variant information extracted from single-cell RNA-seq reads to probabilistically assign single-cell transcriptomes to individual clones.

Briefly, cardelino is based on a Bayesian mixture model with a beta-binomial error model to account for sequencing errors as well as a gene-specific model for allelic imbalance between haplotypes and associated bias in variant detection. Bayesian inference allows the model to account for uncertainty in model parameters and cell assignments.

We assume that clones are tagged by somatic mutations, and that these mutations are known (e.g. from exome sequencing or equivalent). Given a set of known mutations, these sites can be interrogated in scRNA-seq reads to obtain evidence for the presence or absence of each mutation in each cell. As input, the model requires the count of reads supporting the alternative (mutant) allele at each mutation site, the total number of reads overlapping the mutation site (“coverage”).

Typically, coverage of somatic mutations in scRNA-seq data is very sparse (most mutation sites in a given cell have no read coverage), but the cardelino model accounts for this sparsity and aggregates information across all available mutation sites to infer clonal identity.

3 Clone ID with a clonal tree provided

In many clone ID scenarios, a clonal tree is known. That is, we have been able to infer the clones present in the sampled cell population, for example using bulk or single-cell DNA-seq data, and we know which mutations are expected to be present in which clones.

To infer the clonal identity of cells when a clonal tree is provided, cardelino requires the following input data:

The configuration matrix, Config, can be provided by other tools used to infer the clonal structure of the cell population. For example, the package Canopy can be used to infer a clonal tree from DNA-seq data and the “Z” element of its output is the configuration matrix.

Here, we demonstrate the use of cardelino to assign 428 cells to clones identified with Canopy using 34 somatic mutations.

We load the package and the example clone ID dataset distributed with the package.

library(cardelino)
data(example_donor)

The clonal tree inferred by Canopy for this donor consists of four clones, including a “base” clone (“clone1”) that has no sublconal somatic mutations present.

plot_tree(tree, orient = "v")

The included dataset contains the A and D matrices, so combined with the Canopy tree object provided, we have the necessary input to probabilistically assign cells to clones.

assignments <- clone_id(A_clone, D_clone, Config = tree$Z)
## 100% converged.[1] "Converged in 3000 iterations."
names(assignments)
##  [1] "theta0"         "theta1"         "theta0_all"     "theta1_all"    
##  [5] "element"        "logLik"         "prob_all"       "prob"          
##  [9] "prob_variant"   "relax_rate"     "Config_prob"    "Config_all"    
## [13] "relax_rate_all" "n_chain"

We can visualise the cell-clone assignment probabilities as a heatmap.

prob_heatmap(assignments$prob)

We recommend assigning a cell to the highest-probability clone if the highest posterior probability is greater than 0.5 and leaving cells “unassigned” if they do not reach this threshold. The assign_cells_to_clones function conveniently assigns cells to clones based on a threshold and returns a data.frame with the cell-clone assignments.

df <- assign_cells_to_clones(assignments$prob)
head(df)
##    cell      clone  prob_max
## 1 cell1     clone3 0.7226487
## 2 cell2     clone2 0.9994734
## 3 cell3     clone1 0.5049057
## 4 cell4 unassigned 0.2503391
## 5 cell5     clone2 0.9985593
## 6 cell6     clone3 0.8481360
table(df$clone)
## 
##     clone1     clone2     clone3     clone4 unassigned 
##         48        190         25          7        158

4 Reading in data from VCF files

In the genomics field, genotype data is most commonly stored in VCF (variant call format) files. There are many possible ways to extract the data required by cardelino from a VCF file, here we show just one approach using convenience functions in cardelino:

Read in data from VCF included in the package.

vcf <- read_vcf(system.file("extdata", "cell_example.mpileup.vcf.gz", 
                   package = "cardelino"))
input_data <- get_snp_matrices(vcf)

Read in Canopy tree results for the same individual.

canopy <- readRDS(system.file("extdata", "canopy_results.example.rds", 
                   package = "cardelino"))
Config <- canopy$tree$Z

Be careful to ensure that the same variant IDs are used in both data sources.

rownames(Config) <- gsub("chr", "", gsub(":", "_", gsub("_.*", "", 
                                                        rownames(Config))))

Now we can run the clone ID function to assign cells to clones.

assignments <- clone_id(input_data$A, input_data$D, Config)
## 100% converged.[1] "Converged in 3000 iterations."

As above, we can easily assign cells to clones using a threshold on the posterior probabilities output by the clone_id function.

df <- assign_cells_to_clones(assignments$prob)
table(df$clone)
## 
##     clone1     clone2     clone3     clone4 unassigned 
##        114         56         10          5        243

Session information

sessionInfo()
## R version 3.6.1 (2019-07-05)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Mojave 10.14.6
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] C/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] foreach_1.4.4    cardelino_0.6.0  ggplot2_3.1.1    knitr_1.22      
## [5] BiocStyle_2.12.0
## 
## loaded via a namespace (and not attached):
##  [1] Biobase_2.42.0              httr_1.4.0                 
##  [3] tidyr_0.8.3                 jsonlite_1.6               
##  [5] bit64_0.9-7                 splines_3.6.1              
##  [7] assertthat_0.2.1            rvcheck_0.1.3              
##  [9] BiocManager_1.30.4          stats4_3.6.1               
## [11] blob_1.1.1                  BSgenome_1.50.0            
## [13] GenomeInfoDbData_1.2.0      Rsamtools_1.34.1           
## [15] yaml_2.2.0                  progress_1.2.0             
## [17] pillar_1.3.1                RSQLite_2.1.1              
## [19] lattice_0.20-38             glue_1.3.1                 
## [21] digest_0.6.18               GenomicRanges_1.34.0       
## [23] RColorBrewer_1.1-2          XVector_0.22.0             
## [25] colorspace_1.4-1            htmltools_0.3.6            
## [27] Matrix_1.2-17               plyr_1.8.4                 
## [29] XML_3.98-1.19               pkgconfig_2.0.2            
## [31] pheatmap_1.0.12             biomaRt_2.38.0             
## [33] bookdown_0.9                zlibbioc_1.28.0            
## [35] purrr_0.3.2                 tidytree_0.2.4             
## [37] scales_1.0.0                BiocParallel_1.16.6        
## [39] tibble_2.1.1                IRanges_2.16.0             
## [41] withr_2.1.2                 SummarizedExperiment_1.12.0
## [43] GenomicFeatures_1.34.8      BiocGenerics_0.28.0        
## [45] lazyeval_0.2.2              survival_2.44-1.1          
## [47] magrittr_1.5                crayon_1.3.4               
## [49] memoise_1.1.0               evaluate_0.13              
## [51] nlme_3.1-140                tools_3.6.1                
## [53] prettyunits_1.0.2           doMC_1.3.5                 
## [55] hms_0.4.2                   matrixStats_0.54.0         
## [57] stringr_1.4.0               S4Vectors_0.20.1           
## [59] ggtree_1.14.6               munsell_0.5.0              
## [61] DelayedArray_0.8.0          AnnotationDbi_1.44.0       
## [63] snpStats_1.32.0             Biostrings_2.50.2          
## [65] compiler_3.6.1              GenomeInfoDb_1.18.2        
## [67] rlang_0.3.4                 grid_3.6.1                 
## [69] RCurl_1.95-4.12             iterators_1.0.10           
## [71] VariantAnnotation_1.28.13   labeling_0.3               
## [73] bitops_1.0-6                rmarkdown_1.12             
## [75] codetools_0.2-16            gtable_0.3.0               
## [77] DBI_1.0.0                   R6_2.4.0                   
## [79] GenomicAlignments_1.18.1    dplyr_0.8.0.1              
## [81] rtracklayer_1.42.2          bit_1.1-14                 
## [83] treeio_1.6.2                ape_5.3                    
## [85] stringi_1.4.3               parallel_3.6.1             
## [87] Rcpp_1.0.1                  tidyselect_0.2.5           
## [89] xfun_0.6