cardelino 0.6.0
This document gives an introduction to and overview of inferring the clone identity of cells using the cardelino package using a given clonal structure.
cardelino contains general tools for inferring the clonal or donor identity of cells from single-cell transcriptomic data, focusing on RNA-seq data. Specifically, the package features:
Here, we focus on demonstrating the use of cardelino
to probabilistically
assign cells to clones when a clonal tree is provided.
cardelino can use variant information extracted from single-cell RNA-seq reads to probabilistically assign single-cell transcriptomes to individual clones.
Briefly, cardelino is based on a Bayesian mixture model with a beta-binomial error model to account for sequencing errors as well as a gene-specific model for allelic imbalance between haplotypes and associated bias in variant detection. Bayesian inference allows the model to account for uncertainty in model parameters and cell assignments.
We assume that clones are tagged by somatic mutations, and that these mutations are known (e.g. from exome sequencing or equivalent). Given a set of known mutations, these sites can be interrogated in scRNA-seq reads to obtain evidence for the presence or absence of each mutation in each cell. As input, the model requires the count of reads supporting the alternative (mutant) allele at each mutation site, the total number of reads overlapping the mutation site (“coverage”).
Typically, coverage of somatic mutations in scRNA-seq data is very sparse (most mutation sites in a given cell have no read coverage), but the cardelino model accounts for this sparsity and aggregates information across all available mutation sites to infer clonal identity.
In many clone ID scenarios, a clonal tree is known. That is, we have been able to infer the clones present in the sampled cell population, for example using bulk or single-cell DNA-seq data, and we know which mutations are expected to be present in which clones.
To infer the clonal identity of cells when a clonal tree is provided, cardelino requires the following input data:
The configuration matrix, Config, can be provided by other tools used to infer the clonal structure of the cell population. For example, the package Canopy can be used to infer a clonal tree from DNA-seq data and the “Z” element of its output is the configuration matrix.
Here, we demonstrate the use of cardelino to assign 428 cells to clones identified with Canopy using 34 somatic mutations.
We load the package and the example clone ID dataset distributed with the package.
library(cardelino)
data(example_donor)
The clonal tree inferred by Canopy for this donor consists of four clones, including a “base” clone (“clone1”) that has no sublconal somatic mutations present.
plot_tree(tree, orient = "v")
The included dataset contains the A and D matrices, so combined with the Canopy tree object provided, we have the necessary input to probabilistically assign cells to clones.
assignments <- clone_id(A_clone, D_clone, Config = tree$Z)
## 100% converged.[1] "Converged in 3000 iterations."
names(assignments)
## [1] "theta0" "theta1" "theta0_all" "theta1_all"
## [5] "element" "logLik" "prob_all" "prob"
## [9] "prob_variant" "relax_rate" "Config_prob" "Config_all"
## [13] "relax_rate_all" "n_chain"
We can visualise the cell-clone assignment probabilities as a heatmap.
prob_heatmap(assignments$prob)
We recommend assigning a cell to the highest-probability clone if the highest
posterior probability is greater than 0.5 and leaving cells “unassigned” if they
do not reach this threshold. The assign_cells_to_clones
function conveniently
assigns cells to clones based on a threshold and returns a data.frame with the
cell-clone assignments.
df <- assign_cells_to_clones(assignments$prob)
head(df)
## cell clone prob_max
## 1 cell1 clone3 0.7226487
## 2 cell2 clone2 0.9994734
## 3 cell3 clone1 0.5049057
## 4 cell4 unassigned 0.2503391
## 5 cell5 clone2 0.9985593
## 6 cell6 clone3 0.8481360
table(df$clone)
##
## clone1 clone2 clone3 clone4 unassigned
## 48 190 25 7 158
In the genomics field, genotype data is most commonly stored in VCF
(variant call format) files. There are
many possible ways to extract the data required by cardelino
from a VCF file,
here we show just one approach using convenience functions in cardelino
:
Read in data from VCF included in the package.
vcf <- read_vcf(system.file("extdata", "cell_example.mpileup.vcf.gz",
package = "cardelino"))
input_data <- get_snp_matrices(vcf)
Read in Canopy tree results for the same individual.
canopy <- readRDS(system.file("extdata", "canopy_results.example.rds",
package = "cardelino"))
Config <- canopy$tree$Z
Be careful to ensure that the same variant IDs are used in both data sources.
rownames(Config) <- gsub("chr", "", gsub(":", "_", gsub("_.*", "",
rownames(Config))))
Now we can run the clone ID function to assign cells to clones.
assignments <- clone_id(input_data$A, input_data$D, Config)
## 100% converged.[1] "Converged in 3000 iterations."
As above, we can easily assign cells to clones using a threshold on the
posterior probabilities output by the clone_id
function.
df <- assign_cells_to_clones(assignments$prob)
table(df$clone)
##
## clone1 clone2 clone3 clone4 unassigned
## 114 56 10 5 243
sessionInfo()
## R version 3.6.1 (2019-07-05)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Mojave 10.14.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
##
## locale:
## [1] C/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] foreach_1.4.4 cardelino_0.6.0 ggplot2_3.1.1 knitr_1.22
## [5] BiocStyle_2.12.0
##
## loaded via a namespace (and not attached):
## [1] Biobase_2.42.0 httr_1.4.0
## [3] tidyr_0.8.3 jsonlite_1.6
## [5] bit64_0.9-7 splines_3.6.1
## [7] assertthat_0.2.1 rvcheck_0.1.3
## [9] BiocManager_1.30.4 stats4_3.6.1
## [11] blob_1.1.1 BSgenome_1.50.0
## [13] GenomeInfoDbData_1.2.0 Rsamtools_1.34.1
## [15] yaml_2.2.0 progress_1.2.0
## [17] pillar_1.3.1 RSQLite_2.1.1
## [19] lattice_0.20-38 glue_1.3.1
## [21] digest_0.6.18 GenomicRanges_1.34.0
## [23] RColorBrewer_1.1-2 XVector_0.22.0
## [25] colorspace_1.4-1 htmltools_0.3.6
## [27] Matrix_1.2-17 plyr_1.8.4
## [29] XML_3.98-1.19 pkgconfig_2.0.2
## [31] pheatmap_1.0.12 biomaRt_2.38.0
## [33] bookdown_0.9 zlibbioc_1.28.0
## [35] purrr_0.3.2 tidytree_0.2.4
## [37] scales_1.0.0 BiocParallel_1.16.6
## [39] tibble_2.1.1 IRanges_2.16.0
## [41] withr_2.1.2 SummarizedExperiment_1.12.0
## [43] GenomicFeatures_1.34.8 BiocGenerics_0.28.0
## [45] lazyeval_0.2.2 survival_2.44-1.1
## [47] magrittr_1.5 crayon_1.3.4
## [49] memoise_1.1.0 evaluate_0.13
## [51] nlme_3.1-140 tools_3.6.1
## [53] prettyunits_1.0.2 doMC_1.3.5
## [55] hms_0.4.2 matrixStats_0.54.0
## [57] stringr_1.4.0 S4Vectors_0.20.1
## [59] ggtree_1.14.6 munsell_0.5.0
## [61] DelayedArray_0.8.0 AnnotationDbi_1.44.0
## [63] snpStats_1.32.0 Biostrings_2.50.2
## [65] compiler_3.6.1 GenomeInfoDb_1.18.2
## [67] rlang_0.3.4 grid_3.6.1
## [69] RCurl_1.95-4.12 iterators_1.0.10
## [71] VariantAnnotation_1.28.13 labeling_0.3
## [73] bitops_1.0-6 rmarkdown_1.12
## [75] codetools_0.2-16 gtable_0.3.0
## [77] DBI_1.0.0 R6_2.4.0
## [79] GenomicAlignments_1.18.1 dplyr_0.8.0.1
## [81] rtracklayer_1.42.2 bit_1.1-14
## [83] treeio_1.6.2 ape_5.3
## [85] stringi_1.4.3 parallel_3.6.1
## [87] Rcpp_1.0.1 tidyselect_0.2.5
## [89] xfun_0.6