Loading [MathJax]/jax/output/HTML-CSS/jax.js

1. Overview of analysis pipeline

The possible analysis scenarios are listed in Table 1. Figure 1 gives an outline for the relationship between the software: CODEX and CODEX2 perform read depth normalization for total copy number profiling; read depth normalized by CODEX/CODEX2 is received by iCNV, which combines it with allele-specific read counts and microarray data to detect CNVs; FALCON and FALCON-X perform ASCN analysis; and Canopy receives input from FALCON/FALCON-X to perform tumor phylogeny reconstruction.

Figure 1. A flowchart outlining the procedures for profiling CNV, ASCN, and reconstructing tumor phylogeny. CNVs with common and rare population frequencies can be profiled by CODEX and CODEX2, with and without negative control samples. iCNV integrates sequencing and microarray data for CNV detection. ASCNs can be profiled by FALCON and FALCON-X using allelic read counts at germline heterozygous loci. Canopy infers tumor phylogeny using somatic SNVs and ASCNs.

Table 1. Analysis scenarios and pipeline design. The last column shows the sequence of software that should be used for each analysis scenario. * By normal we mean samples that are not derived from tumor tissue, which are not expected to carry chromosome-level copy number changes.

2. Questions & issues

If you have any questions or problems when using MARATHON, please feel free to open a new issue here. You can also email the maintainers of the corresponding packages – the contact information is below.

3. Installation

3.1 Installation Option 1: Docker image - good for ease of installation

A docker image is available here. This image is an Rstudio GUI built on rocker/tidyverse with MARATHON as well as all of its dependent packages and datasets pre-installed. Note that this can take a while to download the human reference genome as well as the toy sequencing dataset. Instructions for using Docker can be found here.

docker pull lzeppelini/marathon

3.2 Installation Option 2: Install to R/RStudio - good for performance

Install all packages in the latest version of R.

install.packages(c("falcon", "falconx", "devtools"))
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install(c("WES.1KG.WUGSC", "GenomeInfoDbData", "GenomeInfoDb", "VariantAnnotation"))
devtools::install_github(c("yuchaojiang/CODEX/package", "yuchaojiang/CODEX2/package", "yuchaojiang/Canopy/package", "zhouzilu/iCNV", "yuchaojiang/MARATHON/package"))

4. Running MARATHON

4.1. Total copy number analysis of normal

4.1.1. CODEX for data normalization and total copy number analysis in normal

CODEX (Jiang et al. 2015) adopts a Poisson latent factor model for normalization to remove biases due to GC content, exon capture and amplification efficiency, and latent systemic artifacts.

Get bam file directories, sample names from .txt file, and exonic positions from .bed file.

library(MARATHON, quietly = TRUE)
library(WES.1KG.WUGSC) # Load Toy data from the 1000 Genomes Project.
dirPath <- system.file("extdata", package = "WES.1KG.WUGSC")
bamFile <- list.files(dirPath, pattern = '*.bam$')
bamdir <- file.path(dirPath, bamFile)
sampname <- as.matrix(read.table(file.path(dirPath, "sampname")))
bedFile <- file.path(dirPath, "chr22_400_to_500.bed")
bambedObj <- getbambed(bamdir = bamdir, bedFile = bedFile, 
                       sampname = sampname, projectname = "CODEX_demo")
bamdir <- bambedObj$bamdir; sampname <- bambedObj$sampname
ref <- bambedObj$ref; projectname <- bambedObj$projectname; chr <- bambedObj$chr

Get read depth coverage

coverageObj <- getcoverage(bambedObj, mapqthres = 20)
Y <- coverageObj$Y; readlength <- coverageObj$readlength

Get GC content and mappability

genome = BSgenome.Hsapiens.UCSC.hg19 # hg19
# library(BSgenome.Hsapiens.UCSC.hg38); genome = BSgenome.Hsapiens.UCSC.hg38 # hg38
gc <- getgc(ref, genome = genome)
mapp <- getmapp(ref, genome = genome)
values(ref) <- cbind(values(ref), DataFrame(gc, mapp))  

Quality control

qcObj <- qc(Y, sampname, ref, cov_thresh = c(20, 4000),
            length_thresh = c(20, 2000), mapp_thresh = 0.9,
            gc_thresh = c(20, 80))
Y_qc <- qcObj$Y_qc; sampname_qc <- qcObj$sampname_qc
ref_qc <- qcObj$ref_qc; qcmat <- qcObj$qcmat; gc_qc <- ref_qc$gc
write.table(qcmat, file = paste(projectname, '_qcmat', '.txt', sep=''),
            sep = '\t', quote = FALSE, row.names = FALSE)

Normalization with normal controls

Y.nonzero <- Y_qc[apply(Y_qc, 1, function(x){!any(x==0)}),]
pseudo.sample <- apply(Y.nonzero,1,function(x){exp(1/length(x)*sum(log(x)))})
N <- apply(apply(Y.nonzero, 2, function(x){x/pseudo.sample}), 2, median)

normObj <- normalize(Y_qc, gc_qc, K = 1:6, N=N)
Yhat <- normObj$Yhat; AIC <- normObj$AIC; BIC <- normObj$BIC
RSS <- normObj$RSS; K <- normObj$K

Choose the number of latent factors

choiceofK(AIC, BIC, RSS, K, filename = paste(projectname, "_", chr, 
                                             "_choiceofK", ".pdf", sep = ""))

Poisson-likelihood recursive segmentation.

optK = K[which.max(BIC)]
chr=22
finalcall <- segmentCBS(Y_qc, Yhat, optK = optK, K = K, sampname_qc,
                     ref_qc, chr, lmax = 200, mode = "integer")
head(finalcall)
##      sample_name chr cnv    st_bp    ed_bp length_kb st_exon ed_exon
## cnv1     NA18990  22 dup 22312814 22326373    13.560      62      74
## cnv2     NA19210  22 del 22329961 22599725   269.765      77      78
##      raw_cov norm_cov copy_no lratio   mBIC
## cnv1    1382     1007       3 56.830 46.173
## cnv2     140      217       1 11.466  2.746

4.1.2. CODEX2 for calling improvement

CODEX2 (Jiang et al. 2018) builds on CODEX with a significant improvement of sensitivity for both rare and common variants.

4.1.2.1. Analysis overview

The figure below illustrates the two experimental designs for which CODEX2 can be applied: (A) case-control design with a group of negative control samples, where the goal is to detect CNVs disproportionately present in the cases versus the controls; and (B) detection of all CNVs present in all samples design, such as in the Exome Aggregation Consortium. The key innovation in CODEX2 is the usage of negative control genome regions in a genome-wide latent factor model for sample- and position-specific background correction, and the utilization of negative control samples, under a case-control design, to further improve background bias estimation under this model. The negative control genome regions defined by CODEX2 are regions that do not harbor common CNVs, but that are still allowed to harbor rare CNVs, and can be constructed from existing studies or learned from data.

4.1.2.2. Pre-processing

This step is to get directories of .bam files, read in exon target positions from .bed files, and get sample names. The direct input of CODEX2 include: bamdir, which is a vector indicating the directories of all .bam files; sampname, which is a column vector with row entries of sample names; bedFile, which indicates the directory of the .bed file (WES bait file, no header, sorted by start and end positions); and chr, which specifies the chromosome. CODEX2 processes the entire genome chromosome by chromosome; make sure the chromosome formats are consistent between the .bed and the .bam files.

library(MARATHON)
library(WES.1KG.WUGSC) # Load Toy data from the 1000 Genomes Project.
dirPath <- system.file("extdata", package = "WES.1KG.WUGSC")
bamFile <- list.files(dirPath, pattern = '*.bam$')
bamdir <- file.path(dirPath, bamFile)
sampname <- as.matrix(read.table(file.path(dirPath, "sampname")))
bedFile <- file.path(dirPath, "chr22_400_to_500.bed")
chr <- 22
bambedObj <- getbambed(bamdir = bamdir, bedFile = bedFile, 
                       sampname = sampname, projectname = "CODEX_demo")
bamdir <- bambedObj$bamdir; sampname <- bambedObj$sampname
ref <- bambedObj$ref; projectname <- bambedObj$projectname; chr <- bambedObj$chr

Getting raw read depth, GC content and mappability: read depth matrix, as well as read lengths across all samples, will be returned.

coverageObj <- getcoverage(bambedObj, mapqthres = 20)
Y <- coverageObj$Y; readlength <- coverageObj$readlength

genome = BSgenome.Hsapiens.UCSC.hg19 # hg19
# library(BSgenome.Hsapiens.UCSC.hg38); genome = BSgenome.Hsapiens.UCSC.hg38 # hg38
gc <- getgc(ref, genome = genome)
mapp <- getmapp(ref, genome = genome)
values(ref) <- cbind(values(ref), DataFrame(gc, mapp))

Quality control: take a sample-wise and exon-wise quality control procedure on the depth of coverage matrix.

qcObj <- qc(Y, sampname, ref, cov_thresh = c(20, 4000),
            length_thresh = c(20, 2000), mapp_thresh = 0.9,
            gc_thresh = c(20, 80))
Y_qc <- qcObj$Y_qc; sampname_qc <- qcObj$sampname_qc
ref_qc <- qcObj$ref_qc; qcmat <- qcObj$qcmat; gc_qc <- ref_qc$gc
write.table(qcmat, file = paste(projectname, '_qcmat', '.txt', sep=''),
            sep = '\t', quote = FALSE, row.names = FALSE)

Running CODEX2: For demonstration purpose, we in silico spiked in CNVs spanning exon 1580 - 1620 with a population frequency 40%. There are altogether 90 samples, 36 of which have the heterozygous deletion. The toy dataset is stored as part of the CODEX2 R-package.

4.1.2.3. Running CODEX2 with negative control regions

We can empirically identify common CNV regions by a first-pass CODEX run: For exons residing in common CNV regions, the s.d. of normalized z-scores across all samples will be large.

# Below are pre-computed demo dataset, stored as part of the CODEX2 R-package.
Y_qc <- Y_qc_demo
ref_qc <- ref_qc_demo
gc_qc <- ref_qc$gc

# Estimate library size factor based on genome-wide read depth after QC
Y.nonzero <- Y_qc[apply(Y_qc, 1, function(x){!any(x==0)}),]
pseudo.sample <- apply(Y.nonzero,1,function(x){exp(1/length(x)*sum(log(x)))})
N <- apply(apply(Y.nonzero, 2, function(x){x/pseudo.sample}), 2, median)

# Empirically identify common CNV regions from the data.
normObj=normalize_null(Y_qc = Y_qc, gc_qc = gc_qc, K = 1:5, N = N)
z.codex = log(Y_qc/normObj$Yhat[[which.max(normObj$BIC)]])
cnv_index1 = which(apply(z.codex,1,sd)>=0.25) 

# This can also be provided by the user as known,
# e.g., from existing database (DGV or dbVar) or knowledge (tumor supressors or oncogenes).
cnv_index2 <- 1580:1620
normObj <- normalize_codex2_nr(Y_qc = Y_qc, gc_qc = gc_qc, 
                            K = 1:6, cnv_index = cnv_index2,
                            N = N)
Yhat.nr <- normObj$Yhat; fGC.hat.nr <- normObj$fGC.hat;
beta.hat.nr <- normObj$beta.hat; g.hat.nr <- normObj$g.hat; h.hat.nr <- normObj$h.hat
AIC.nr <- normObj$AIC; BIC.nr <- normObj$BIC; RSS.nr <- normObj$RSS

Running Poisson-likelihood recursive segmentation by CODEX2

finalcall.codex2.nr <- segmentCBS(Y_qc, Yhat.nr, optK = which.max(BIC.nr),
                                  K = 1:6, sampname_qc = paste('sample',1:ncol(Y_qc),sep=''),
                                  ref_qc = IRanges(start=1:nrow(Y_qc)*100, end=1:nrow( Y_qc )*100+50),
                                  chr = 18, lmax = 200, mode = "integer")

4.1.3. iCNV for calling improvement

iCNV (Zhou et al. 2018) uses the normalized coverage from CODEX/CODEX2, and makes use of sequenced reads at inherited single nucleotide polymorphism (SNP) positions for CNV detection. These heterozygous loci are shown to be valuable in improving detection and genotyping accuracy. iCNV takes as input normalized coverage by CODEX/CODEX2, allelic frequency at inherited SNP positions from sequencing, and log-ratio and B-allele frequency from SNP array. Output is CNV calls with quality scores.

4.1.3.1. Bam file normalization using CODEX

Researchers working on WGS need to generate your own BED file for CODEX normalization. We made a function bed_generator to help with this. For WES and targeted sequencing, the BED file is from pull-down design. Follow code below to get normalized NGS PLR. For demonstration, we utilize toy data from the 1000 Genomes Project.

library(WES.1KG.WUGSC) # Load Toy data from the 1000 Genomes Project.
library(iCNV)
# Pre-process
dir <- system.file("extdata", package = "WES.1KG.WUGSC") # 'PATH/TO/BAM'
bamFile <- list.files(dir, pattern = '*.bam$')
bamdir <- file.path(dir, bamFile)
sampname <- as.matrix(read.table(file.path(dir, "sampname")))
bedFile <- file.path(dir, "chr22_400_to_500.bed")
chr <- 22
bambedObj <- CODEX::getbambed(bamdir = bamdir, bedFile = bedFile, 
                   sampname = sampname, projectname = "icnv.demo.", chr)
bamdir <- bambedObj$bamdir; sampname <- bambedObj$sampname
ref <- bambedObj$ref; projectname <- bambedObj$projectname; chr <- bambedObj$chr
# Get coverage
coverageObj <- CODEX::getcoverage(bambedObj, mapqthres = 20)
Y <- coverageObj$Y; readlength <- coverageObj$readlength
# Get GC content and mappability
gc <- CODEX::getgc(chr, ref)
mapp <- CODEX::getmapp(chr, ref)
# Quality control
qcObj <- CODEX::qc(Y, sampname, chr, ref, mapp, gc, cov_thresh = c(20, 4000), 
            length_thresh = c(20, 2000), mapp_thresh = 0.9, gc_thresh = c(20, 80))
Y_qc <- qcObj$Y_qc; sampname_qc <- qcObj$sampname_qc; gc_qc <- qcObj$gc_qc
mapp_qc <- qcObj$mapp_qc; ref_qc <- qcObj$ref_qc; qcmat <- qcObj$qcmat
# CODEX normalization
normObj <- CODEX::normalize(Y_qc, gc_qc, K = 1:9)
Yhat <- normObj$Yhat; AIC <- normObj$AIC; BIC <- normObj$BIC
RSS <- normObj$RSS; K <- normObj$K
choiceofK(AIC, BIC, RSS, K, filename = paste0(projectname, "_", chr, "_choiceofK.pdf"))
save(qcObj,normObj,sampname,file=paste0(projectname,"_",chr,".rda") )

CODEX reports all three statistical metrics (AIC, BIC, variance reduction) and uses BIC as the default method to determine the number of Poisson factors. Since false positives can be screened out through a closer examination of the post-segmentation data, whereas CNV signals removed in the normalization step cannot be recovered, CODEX opts for a more conservative normalization that, when in doubt, uses a smaller value of K.

load(paste0(projectname,"_",chr,".rda"))
optK = K[which.max(BIC)] # by default or customize given curves
# Log-transformed z-score based on CODEX's normalization results
Y_qc <- qcObj$Y_qc; sampname_qc <- qcObj$sampname_qc; gc_qc <- qcObj$gc_qc
mapp_qc <- qcObj$mapp_qc; ref_qc <- qcObj$ref_qc; qcmat <- qcObj$qcmat
Yhat <- normObj$Yhat; AIC <- normObj$AIC; BIC <- normObj$BIC
RSS <- normObj$RSS; K <- normObj$K
ref_qc=qcObj$ref_qc # IRanges object for exon target
sampname_qc=qcObj$sampname_qc # sample names
Y_norm=normObj$Yhat[[optK]] # normalized read count under null (no CNV)
plr=log(pmax(Y_qc,0.0001)/pmax(Y_norm,0.0001)) # log transformed z-scores
ngs_plr=lapply(seq_len(ncol(plr)), function(i) plr[,i])
ngs_plr.pos=lapply(seq_len(ncol(plr)),function(x){return(cbind(start(ref_qc),end(ref_qc)))})
save(sampname_qc,ngs_plr,ngs_plr.pos,file=paste0(projectname,'plrObj_',chr,'_',optK,'.rda'))

4.1.3.2. B-allele frequency calling of germline variant

For sequencing data without SNV callsets in VCF format, we manually call SNVs from quality controlled BAM files by the mpileup module in samtools, and calculate B allele frequency(BAF) on heterozygous loci by dividing DV (Number of high-quality non-reference bases, FORMAT) from DP (Number of high-quality bases, FORMAT). Example code are:

# Prerequest: samtools, bcftools and reference fasta file
cd PATH/TO/BAM
for i in *bam; do PATH/TO/SAMTOOLS/samtools mpileup -ugI -t DP -t DV -f PATH/TO/REF/human_hg37.fasta $i | ~/bcftools-1.3.1/bcftools call -cv -O z -o PATH/TO/OUTPUT/$i.vcf.gz; done

We could further extract variants BAF info from vcf file by function bambaf_from_vcf.

library(MARATHON)
projectname <- "icnv.demo."
dir <- system.file("extdata", package="iCNV")
bambaf_from_vcf(dir,'bam_vcf.list',chr=22,projectname) 
# ignore chr argement if you want to convert all chromosome 1-22
load(paste0(projectname,'bambaf_22.rda'))
str(ngs_baf)
str(ngs_baf.pos)

4.1.3.3. SNP array LRR normalization and BAF

Due to the fact that signal intensity files vary from platform to platform, we set a standard signal intensity file format for each individual:

Name,Chr,POS,sample_1.Log R Ratio,sample_1.B Allele Freq
rs1,22,15462739,-0.096390619874,0.0443964861333
rs2,22,15520991,-0.154103130102,0.963218688965
rs3,22,15780940,-0.110297381878,0.0457459762692
rs4,22,15863717,-0.21270519495,0.957377910614
rs5,22,16532045,-0.0330782271922,0.0300635993481

The first row is rsid (Name), followed by chromosome (Chr), position (POS), log R ratio (sample1.Log R Ratio), and BAF (sample1.B Allele Freq). Function get_array_input can be used to convert the data input to iCNV’s format. For example, in the demo dataset:

# iCNV_array_input
dir <- system.file("extdata", package="iCNV")#PATH/TO/FOLDER where you stored the array data
chr <- 22
projectname <- "icnv.demo."
pattern <- paste0('*.csv.arrayicnv$')
get_array_input(dir,pattern,chr,projectname)
load(paste0(projectname,'array_lrrbaf_',chr,'.rda'))
str(snp_lrr)
str(snp_lrr.pos)
str(snp_baf)
str(snp_baf.pos)

For some of the SNP array LRR data, we need to apply singular value decomposition (SVD) to remove high-dimensional noise and preserve low-dimension signal. Noisy data has the feature of local strip across samples (use the plot_intensity function). Conventional way for identifying the elbow point to determine the number of PCs can be applied here. More details can be found here.

4.1.3.4. Mutiple platform CNV detection using iCNV

At this step, we should alreday have PLR and variants BAF from sequencing, normalized LRR and BAF from SNP array. Please make sure all the input are of type list. This is mainly to accomadate the fact that the lengths for ngs_baf and ngs_baf.pos are sample specific. In addition, CNV detection should perform per chromosome basis.

str(ngs_plr) # List of n vector, each one is the PLR for an exon
str(ngs_plr.pos) # List of n matrix (p x 2), each one is the start and end location for an exon
str(ngs_baf) # List of n vector, each one is the variants BAF from .bam
str(ngs_baf.pos) # List of n vector, each one is the variants BAF position
str(snp_lrr) # List of n vector, each one is the normalized LRR for a SNP
str(snp_lrr.pos) # List of n vector, each one is a SNP position
str(snp_baf) # List of n vector, each one is the BAF for a SNP
str(snp_baf.pos) # List of n vector, each one is the SNP BAF position
projectname <- 'icnv.demo.'
icnv_res0 <- iCNV_detection(ngs_plr,snp_lrr,
                         ngs_baf,snp_baf,
                         ngs_plr.pos,snp_lrr.pos,
                         ngs_baf.pos,snp_baf.pos,
                         projname=projectname,CN=1,mu=c(-3,0,2),cap=TRUE,visual = 1)
icnv.output <- output_list(icnv_res0,sampname_qc,CN=1,min_size=0,min_gap=0) 
# You can alter min size of the CNV you want to keep and merge CNV with gap 
# smaller than min gap.
head(icnv.output)

The results plot looks are follow. Each row is a sample and each column is a hidden state. The color indicates hidden state Z-score (large positive number prefers amplification, low negative number prefers deletion). Black dots represent amplification detected, while white dots show deletion detected.

Additional functionality of iCNV see iCNV notebook.

4.1.3.5. Single platform CNV detection

At this step, we should already have PLR and variants BAF from sequencing OR normalized LRR and BAF from SNP array. Please make sure all the input are in list form.

NGS only CNV detection using iCNV. As we showed in the paper, we highly recommend using iCNV for WGS CNV detection.

str(ngs_plr) # List of n vector, each one is the PLR for an exon
str(ngs_plr.pos) # List of n matrix (p x 2), each one is the start and end location for an exon
str(ngs_baf) # List of n vector, each one is the variants BAF from .bam
str(ngs_baf.pos) # List of n vector, each one is the variants BAF position
projectname <- 'icnv.demo.ngs.'
icnv_res0_ngs <- iCNV_detection(ngs_plr=ngs_plr, ngs_baf = ngs_baf, 
                             ngs_plr.pos = ngs_plr.pos,ngs_baf.pos = ngs_baf.pos,
                             projname=projectname,CN=0,mu=c(-3,0,2),cap=TRUE,visual = 2)
icnv.output <- output_list(icnv_res0_ngs,sampname_qc,CN=0)
head(icnv.output)

SNP array only CNV detection using iCNV

str(snp_lrr) # List of n vector, each one is the normalized LRR for a SNP
str(snp_lrr.pos) # List of n vector, each one is a SNP position
str(snp_baf) # List of n vector, each one is the BAF for a SNP
str(snp_baf.pos) # List of n vector, each one is the SNP BAF position
projectname <- 'icnv.demo.snp'
icnv_res0_snp <- iCNV_detection(snp_lrr=snp_lrr, snp_baf = snp_baf, 
                             snp_lrr.pos = snp_lrr.pos,snp_baf.pos = snp_baf.pos,
                             projname=projectname,CN=1,mu=c(-3,0,2),cap=TRUE,visual = 2)
icnv.output <- output_list(icnv_res0_snp,sampname_qc,CN=1)
head(icnv.output)

4.2. Total copy number analysis of tumor

CODEX2 (Jiang et al. 2018) can be applied to two scenarios: the case control scenario where the goal is to detect CNVs that are enriched in the case samples; and the scenario where control samples are not available and the goal is simply to profile all CNVs. CODEX and CODEX2 take as input assembled BAM files as well as bed files specifying targets for WES and targeted sequencing and output normalized read counts and tab-delimited text files with copy number calls.

4.2.1. Analysis overview

The figure below illustrates the two experimental designs for which CODEX2 can be applied: (i) case-control design with a group of negative control samples, where the goal is to detect CNVs disproportionately present in the cases versus the controls; and (ii) detection of all CNVs present in all samples design, such as in the Exome Aggregation Consortium. The key innovation in CODEX2 is the usage of negative control genome regions in a genome-wide latent factor model for sample- and position-specific background correction, and the utilization of negative control samples, under a case-control design, to further improve background bias estimation under this model. The negative control genome regions defined by CODEX2 are regions that do not harbor common CNVs, but that are still allowed to harbor rare CNVs, and can be constructed from existing studies or learned from data.

4.2.2. Pre-processing

This step is to get directories of .bam files, read in exon target positions from .bed files, and get sample names. The direct input of CODEX2 include: bamdir, which is a vector indicating the directories of all .bam files; sampname, which is a column vector with row entries of sample names; bedFile, which indicates the directory of the .bed file (WES bait file, no header, sorted by start and end positions); and chr, which specifies the chromosome. CODEX2 processes the entire genome chromosome by chromosome; make sure the chromosome formats are consistent between the .bed and the .bam files.

library(MARATHON)
library(WES.1KG.WUGSC) # Load Toy data from the 1000 Genomes Project.
dirPath <- system.file("extdata", package = "WES.1KG.WUGSC")
bamFile <- list.files(dirPath, pattern = '*.bam$')
bamdir <- file.path(dirPath, bamFile)
sampname <- as.matrix(read.table(file.path(dirPath, "sampname")))
bedFile <- file.path(dirPath, "chr22_400_to_500.bed")
chr <- 22
bambedObj <- getbambed(bamdir = bamdir, bedFile = bedFile, 
                       sampname = sampname, projectname = "CODEX_demo")
bamdir <- bambedObj$bamdir; sampname <- bambedObj$sampname
ref <- bambedObj$ref; projectname <- bambedObj$projectname; chr <- bambedObj$chr

Getting raw read depth, GC content and mappability: read depth matrix, as well as read lengths across all samples, will be returned.

coverageObj <- getcoverage(bambedObj, mapqthres = 20)
Y <- coverageObj$Y; readlength <- coverageObj$readlength

genome = BSgenome.Hsapiens.UCSC.hg19 # hg19
# library(BSgenome.Hsapiens.UCSC.hg38); genome = BSgenome.Hsapiens.UCSC.hg38 # hg38
gc <- getgc(ref, genome = genome)
mapp <- getmapp(ref, genome = genome)
values(ref) <- cbind(values(ref), DataFrame(gc, mapp))

Quality control: take a sample-wise and exon-wise quality control procedure on the depth of coverage matrix.

qcObj <- qc(Y, sampname, ref, cov_thresh = c(20, 4000),
            length_thresh = c(20, 2000), mapp_thresh = 0.9,
            gc_thresh = c(20, 80))
Y_qc <- qcObj$Y_qc; sampname_qc <- qcObj$sampname_qc
ref_qc <- qcObj$ref_qc; qcmat <- qcObj$qcmat; gc_qc <- ref_qc$gc
write.table(qcmat, file = paste(projectname, '_qcmat', '.txt', sep=''),
            sep = '\t', quote = FALSE, row.names = FALSE)

Running CODEX2: For demonstration purpose, we in silico spiked in CNVs spanning exon 1580 - 1620 with a population frequency 40%. There are altogether 90 samples, 36 of which have the heterozygous deletion. The toy dataset is stored as part of the CODEX2 R-package.

4.2.3. Running CODEX2 with negative control samples

Y_qc and gc_qc can be obtained from the sequencing bam files using the code in the previous section. For the case-control scenario, the normal sample index is known (samples without spike-in signals).

# Below are pre-computed demo dataset, stored as part of the CODEX2 R-package.
Y_qc <- Y_qc_demo
ref_qc <- ref_qc_demo
norm_index <- norm_index_demo # indices for normal samples
gc_qc <- ref_qc$gc

# Estimate library size factor based on genome-wide read depth after QC
Y.nonzero <- Y_qc[apply(Y_qc, 1, function(x){!any(x==0)}),]
pseudo.sample <- apply(Y.nonzero,1,function(x){exp(1/length(x)*sum(log(x)))})
N <- apply(apply(Y.nonzero, 2, function(x){x/pseudo.sample}), 2, median)

normObj=normalize_codex2_ns(Y_qc = Y_qc, gc_qc = gc_qc, 
                            K = 1:6, norm_index = norm_index, N = N)
Yhat.ns=normObj$Yhat; fGC.hat.ns=normObj$fGC.hat;
beta.hat.ns=normObj$beta.hat; g.hat.ns=normObj$g.hat; h.hat.ns=normObj$h.hat
AIC.ns=normObj$AIC; BIC.ns=normObj$BIC; RSS.ns=normObj$RSS

Choose the number of latent Poisson factors. Use BIC as the model selection metric by default.

choiceofK(AIC.ns, BIC.ns, RSS.ns, K = 1:6 , filename = "codex2_ns_choiceofK.pdf")

4.2.4. Running CODEX2 with negative control regions

Alternatively, we can empirically identify common CNV regions by a first-pass CODEX run: For exons residing in common CNV regions, the s.d. of normalized z-scores across all samples will be large.

# Below are pre-computed demo dataset, stored as part of the CODEX2 R-package.
Y_qc <- Y_qc_demo
ref_qc <- ref_qc_demo
gc_qc <- ref_qc$gc

# Estimate library size factor based on genome-wide read depth after QC
Y.nonzero <- Y_qc[apply(Y_qc, 1, function(x){!any(x==0)}),]
pseudo.sample <- apply(Y.nonzero,1,function(x){exp(1/length(x)*sum(log(x)))})
N <- apply(apply(Y.nonzero, 2, function(x){x/pseudo.sample}), 2, median)

# Empirically identify common CNV regions from the data.
normObj=normalize_null(Y_qc = Y_qc, gc_qc = gc_qc, K = 1:5, N = N)
z.codex = log(Y_qc/normObj$Yhat[[which.max(normObj$BIC)]])
cnv_index1 = which(apply(z.codex,1,sd)>=0.25) 

# This can also be provided by the user as known,
# e.g., from existing database (DGV or dbVar) or knowledge (tumor supressors or oncogenes).
cnv_index2 <- 1580:1620
normObj <- normalize_codex2_nr(Y_qc = Y_qc, gc_qc = gc_qc, 
                               K = 1:6, cnv_index = cnv_index2,
                               N = N)
Yhat.nr <- normObj$Yhat; fGC.hat.nr <- normObj$fGC.hat;
beta.hat.nr <- normObj$beta.hat; g.hat.nr <- normObj$g.hat; h.hat.nr <- normObj$h.hat
AIC.nr <- normObj$AIC; BIC.nr <- normObj$BIC; RSS.nr <- normObj$RSS

4.2.5. Running Poisson-likelihood recursive segmentation by CODEX2

finalcall.codex2.ns <- segmentCBS(Y_qc, Yhat.ns, optK = which.max(BIC.ns),
                               K = 1:6, sampname_qc = paste('sample',1:ncol(Y_qc),sep=''),
                               ref_qc = IRanges(start=1:nrow(Y_qc)*100, end=1:nrow(Y_qc)*100+50),
                               chr = 18, lmax = 200, mode = "integer")
# For CODEX2 with negative control regions, simply change 'ns' to 'nr'.
nrow(finalcall.codex2.ns[finalcall.codex2.ns[,'st_exon']=='1580',])
## [1] 36
nrow(finalcall.codex2.ns) # identified 4 additional CNVs (false positives)
## [1] 40
# this can be easily filtered out by a QC on the CNV lengths in kb or number of exons
head(finalcall.codex2.ns[finalcall.codex2.ns[,'st_exon']=='1580',])
##      sample_name chr cnv  st_bp  ed_bp length_kb st_exon ed_exon raw_cov
## cnv1     sample3  18 del 158000 162050     4.051    1580    1620    2063
## cnv2     sample4  18 del 158000 162050     4.051    1580    1620    1583
## cnv3     sample7  18 del 158000 162050     4.051    1580    1620    2310
## cnv4     sample8  18 del 158000 162050     4.051    1580    1620    2086
## cnv5    sample10  18 del 158000 162050     4.051    1580    1620    2271
## cnv6    sample15  18 del 158000 162050     4.051    1580    1620    1264
##      norm_cov copy_no  lratio    mBIC
## cnv1     4055       1 597.641 575.765
## cnv2     3106       1 455.827 433.951
## cnv3     4476       1 636.946 615.070
## cnv4     4072       1 590.199 568.323
## cnv5     4511       1 681.476 659.600
## cnv6     2502       1 374.925 353.049

4.2.6. Running segmentation by Hidden Markov Model

Also available is a Hidden Markov model for segmentation of CNV events, using normalized read depth from whole exome sequencing. This algorithm incorporates distance between exons as information.

Y_qc <- qcObjDemo$Y_qc
Yhat <- normObjDemo$Yhat
BIC <- normObjDemo$BIC
K <- normObjDemo$K
sampname_qc <- qcObjDemo$sampname_qc
ref_qc <- qcObjDemo$ref_qc
chr <- bambedObjDemo$chr

finalcall <- segmentHMM(Y_qc, Yhat, optK = K[which.max(BIC)], 
  K = K, sampname_qc, ref_qc, chr, mode = "integer")

finalcall
##      sample_name chr cnv    st_bp    ed_bp length_kb st_exon ed_exon
## cnv1     NA18990  22 dup 22312814 22326373     13.56      60      72
##      raw_cov norm_cov copy_no
## cnv1    1382 1005.096       3

4.2.7. Visualization of copy number by IGV

Finally, it is very helpful to visualize the results in the Integrative Genomics Viewer (IGV). Takes as input the output of segment (or segmentHMM) from CODEX/2 and converts to .seg format. This is a preferred format for Integrative Genomics Viewer (IGV). This function takes a single data frame as input, so combine CNV calls from all chromosomes beforehand if necessary. This function writes to file, so specify the outPath is desired.

Y_qc <- qcObjDemo$Y_qc
Yhat <- normObjDemo$Yhat
BIC <- normObjDemo$BIC
K <- normObjDemo$K
sampname_qc <- qcObjDemo$sampname_qc
ref_qc <- qcObjDemo$ref_qc
chr <- bambedObjDemo$chr

#Add simulated CNV
Y_qc[1:10,10] = 1.5*Y_qc[1:10,10]   # add a dup
Y_qc[50:60,10] = .5*Y_qc[50:60,10]  # add a del
Y_qc[1:10,15] = 1.5*Y_qc[1:10,15]   # add a dup
Y_qc[50:60,15] = .5*Y_qc[50:60,15]  # add a del
Y_qc[25:35,15] = .5*Y_qc[25:35,15]  # add a del

finalcall <- segment(Y_qc, Yhat, optK = K[which.max(BIC)], 
  K = K, sampname_qc, ref_qc, chr, lmax = 200, mode = "fraction")

#If necessary, combine (rbind) individual chromosome data frames 
#  to single data frame

convertForIGV(finalcall, outPath = NULL, filterFraction = TRUE )

4.3. Tumor allele-specific copy number by WGS

For allele-specific copy number estimation in a matched tumor-normal setting, FALCON (Chen et al. 2014) is based on a change-point model on a process of a mixture of two bivariate Binomial distributions. FALCON takes as input allelic read counts at germline heterozygous loci and outputs ASCN estimates with genome segmentations.

4.3.1. Calling heterozygous loci

# 1. Index the genome template
bwa index ~/structure/hg19/ucsc.hg19.fasta

# 2. Align reads in .fastq file to the template (sample sequenced at two different lanes)
bwa mem -M -t 16 ucsc.hg19.fasta _EGAR00001248897_UCF_1014_NoIndex_L008_R1_001.fastq _EGAR00001248897_UCF_1014_NoIndex_L008_R2_001.fastq > _EGAR00001248897_UCF_1014_NoIndex_L008.sam

bwa mem -M -t 16 ucsc.hg19.fasta _EGAR00001248899_UCF1014_NoIndex_L002_R1_001.fastq _EGAR00001248899_UCF1014_NoIndex_L002_R2_001.fastq > _EGAR00001248899_UCF1014_NoIndex_L002.sam

# 3. Convert .sam to .bam and sort
samtools view -bS _EGAR00001248897_UCF_1014_NoIndex_L008.sam > _EGAR00001248897_UCF_1014_NoIndex_L008.bam
line=_EGAR00001248897_UCF_1014_NoIndex_L008.bam
java -jar SortSam.jar INPUT=$line OUTPUT=$line.sorted.bam SORT_ORDER=coordinate 

samtools view -bS _EGAR00001248899_UCF1014_NoIndex_L002.sam > _EGAR00001248899_UCF1014_NoIndex_L002.bam
line=_EGAR00001248899_UCF1014_NoIndex_L002.bam
java -jar SortSam.jar INPUT=$line OUTPUT=$line.sorted.bam SORT_ORDER=coordinate 

# 4. Add read group
line=_EGAR00001248897_UCF_1014_NoIndex_L008.bam
#line=_EGAR00001248899_UCF1014_NoIndex_L002.bam

java -jar AddOrReplaceReadGroups.jar INPUT=$line.sorted.bam OUTPUT=$line.sorted.rg.bam RGID=$line RGLB=WGS_UCF1014 RGPL=ILLUMINA RGPU=machine RGSM=UCF1014
samtools index $line.sorted.rg.bam

# 5. Merge two different lanes (sample sequenced at two different lanes)
samtools merge UCF1014.merge.bam _EGAR00001248899_UCF1014_NoIndex_L002.bam.sorted.rg.bam _EGAR00001248897_UCF_1014_NoIndex_L008.bam.sorted.rg.bam
java -jar SortSam.jar INPUT=UCF1014.merge.bam OUTPUT=UCF1014.merge.sorted.bam SORT_ORDER=coordinate 
samtools index UCF1014.merge.sorted.bam

# 6. Dedup
java -jar MarkDuplicates.jar INPUT=UCF1014.merge.sorted.bam OUTPUT=UCF1014.merge.sorted.dedup.bam  METRICS_FILE=UCF1014.merge.sorted.dedup.metrics.txt PROGRAM_RECORD_ID= MarkDuplicates PROGRAM_GROUP_VERSION=null PROGRAM_GROUP_NAME=MarkDuplicates
java -jar BuildBamIndex.jar INPUT=UCF1014.merge.sorted.dedup.bam

# 7. Realign
java -jar GenomeAnalysisTK.jar -T RealignerTargetCreator -R ucsc.hg19.fasta -I UCF1014.merge.sorted.dedup.bam -known Mills_and_1000G_gold_standard.indels.hg19.sites.vcf -known 1000G_phase1.indels.hg19.sites.vcf -o UCF1014.merge.sorted.dedup.target_intervals.list 

java -jar ~/bin/GenomeAnalysisTK.jar -T IndelRealigner -R ucsc.hg19.fasta -I UCF1014.merge.sorted.dedup.bam -targetIntervals UCF1014.merge.sorted.dedup.target_intervals.list -known Mills_and_1000G_gold_standard.indels.hg19.sites.vcf -known 1000G_phase1.indels.hg19.sites.vcf -o UCF1014.merge.sorted.dedup.realigned.bam

# 8. Recalibrate
java -jar GenomeAnalysisTK.jar -T BaseRecalibrator -R ucsc.hg19.fasta -I UCF1014.merge.sorted.dedup.realigned.bam -knownSites dbsnp_138.hg19.vcf -knownSites Mills_and_1000G_gold_standard.indels.hg19.sites.vcf -knownSites 1000G_phase1.indels.hg19.sites.vcf -o UCF1014.merge.sorted.dedup.realigned.recal_data.table

java -jar GenomeAnalysisTK.jar -T PrintReads -R ucsc.hg19.fasta -I UCF1014.merge.sorted.dedup.realigned.bam -BQSR UCF1014.merge.sorted.dedup.realigned.recal_data.table -o UCF1014.merge.sorted.dedup.realigned.recal.bam
samtools index UCF1014.merge.sorted.dedup.realigned.recal.bam

# 9. GATK HaplotypeCaller
java -jar GenomeAnalysisTK.jar -R ucsc.hg19.fasta -T HaplotypeCaller -I UCF1014.merge.sorted.dedup.realigned.recal.bam -o UCF1014.merge.sorted.dedup.realigned.recal.raw.snps.indels.g.vcf

readVCFforFalcon is a function which generates falcon input from a VCF file. This function takes as input a the path to the VCF file. First 9 column names after the header should be c(“#CHROM”, “POS”, “ID”, “REF”, “ALT”, “QUAL”, “FILTER”, “INFO”, “FORMAT”). Following column names should be sample names. see .

The function outputs a data frame where each row contains location for each variant with read counts and allele info. It is comparable to relapse.demo from FALCON for ease of use with the QC procedures below.

vcfFile = system.file("extdata", "sample_w_header.vcf", package="MARATHON")
coverageDataDemo = readVCFforFalcon(vcfFile)
head(coverageDataDemo)
##               SRR5906250_ReadCount_Alt SRR5906250_ReadCount_Ref
## chrM:152_T/C                         9                       54
## chrM:195_C/T                         0                       59
## chrM:410_A/T                         0                      210
## chrM:2354_C/T                        0                      108
## chrM:2485_C/T                        0                      170
## chrM:2759_A/G                        5                       85
##               SRR5906250_ReadCount_Total SRR5906251_ReadCount_Alt
## chrM:152_T/C                          63                        1
## chrM:195_C/T                          59                        0
## chrM:410_A/T                         210                        0
## chrM:2354_C/T                        108                        0
## chrM:2485_C/T                        170                        0
## chrM:2759_A/G                         90                        4
##               SRR5906251_ReadCount_Ref SRR5906251_ReadCount_Total
## chrM:152_T/C                       192                        194
## chrM:195_C/T                       241                        241
## chrM:410_A/T                       250                        250
## chrM:2354_C/T                      250                        250
## chrM:2485_C/T                      250                        250
## chrM:2759_A/G                      244                        248
##               SRR5906252_ReadCount_Alt SRR5906252_ReadCount_Ref
## chrM:152_T/C                         0                      249
## chrM:195_C/T                         0                      250
## chrM:410_A/T                         0                      249
## chrM:2354_C/T                        2                      248
## chrM:2485_C/T                        0                      250
## chrM:2759_A/G                        1                      249
##               SRR5906252_ReadCount_Total SRR5906253_ReadCount_Alt
## chrM:152_T/C                         249                        1
## chrM:195_C/T                         250                        0
## chrM:410_A/T                         249                        0
## chrM:2354_C/T                        250                        0
## chrM:2485_C/T                        250                        0
## chrM:2759_A/G                        250                        0
##               SRR5906253_ReadCount_Ref SRR5906253_ReadCount_Total
## chrM:152_T/C                       249                        250
## chrM:195_C/T                       250                        250
## chrM:410_A/T                       249                        249
## chrM:2354_C/T                      248                        248
## chrM:2485_C/T                      249                        249
## chrM:2759_A/G                      250                        250
##               Reference_Allele SRR5906250_Allele1 SRR5906250_Allele2
## chrM:152_T/C                 T                  T                  C
## chrM:195_C/T                 C                  T                  T
## chrM:410_A/T                 A                  T                  T
## chrM:2354_C/T                C                  T                  T
## chrM:2485_C/T                C                  T                  T
## chrM:2759_A/G                A                  G                  G
##               SRR5906251_Allele1 SRR5906251_Allele2 SRR5906252_Allele1
## chrM:152_T/C                   C                  C                  C
## chrM:195_C/T                   T                  T                  T
## chrM:410_A/T                   T                  T                  T
## chrM:2354_C/T                  T                  T                  T
## chrM:2485_C/T                  T                  T                  T
## chrM:2759_A/G                  G                  G                  G
##               SRR5906252_Allele2 SRR5906253_Allele1 SRR5906253_Allele2
## chrM:152_T/C                   C                  C                  C
## chrM:195_C/T                   T                  T                  T
## chrM:410_A/T                   T                  T                  T
## chrM:2354_C/T                  T                  T                  T
## chrM:2485_C/T                  T                  T                  T
## chrM:2759_A/G                  G                  G                  G
##               Chromosome Start_position End_position
## chrM:152_T/C        chrM            152          152
## chrM:195_C/T        chrM            195          195
## chrM:410_A/T        chrM            410          410
## chrM:2354_C/T       chrM           2354         2354
## chrM:2485_C/T       chrM           2485         2485
## chrM:2759_A/G       chrM           2759         2759

4.3.2. Running FALCON for allele-specific copy number profiling

Calculate depth ratio (total read counts of tumor versus normal). Restrict our analysis to chromosome 14, where a copy-neutral loss of heterozygosity has been previously reported.

coverageData = relapse.demo
rdep = sum(coverageData$Tumor_ReadCount_Total)/sum(coverageData$Normal_ReadCount_Total)
chr = 14
coverageChr=coverageData[which(coverageData[,'Chromosome']==chr),]

Remove variants with missing genotype.

nonMissing1 = coverageChr[,'Match_Norm_Seq_Allele1']!=' '
nonMissing2 = coverageChr[,'Match_Norm_Seq_Allele2']!=' '
nonMissing3 = coverageChr[,'Reference_Allele']!=' '
nonMissing4 = coverageChr[,'TumorSeq_Allele1']!=' '
nonMissing5 = coverageChr[,'TumorSeq_Allele2']!=' '
coverageChr=coverageChr[nonMissing1 & nonMissing2 & nonMissing3 & nonMissing4 & nonMissing5,]

Get germline heterozygous loci where normal allele1 differs from normal allele2.

coverageChrHet=coverageChr[(as.matrix(coverageChr[,'Match_Norm_Seq_Allele1'])!=as.matrix(coverageChr[,'Match_Norm_Seq_Allele2'])),]

QC procedures to remove false neg and false pos variants. The thresholds can be adjusted.

# Remove indels (this can be relaxed but we think indels are harder to call than SNPs).
indelThresh = 1
indel.filter1=nchar(as.matrix(coverageChrHet[,'Reference_Allele']))<=indelThresh
indel.filter2=nchar(as.matrix(coverageChrHet[,'Match_Norm_Seq_Allele1']))<=indelThresh
indel.filter3=nchar(as.matrix(coverageChrHet[,'Match_Norm_Seq_Allele2']))<=indelThresh
indel.filter4=nchar(as.matrix(coverageChrHet[,'TumorSeq_Allele1']))<=indelThresh
indel.filter5=nchar(as.matrix(coverageChrHet[,'TumorSeq_Allele2']))<=indelThresh
coverageChrHet=coverageChrHet[indel.filter1 & indel.filter2 & indel.filter3 & indel.filter4 & indel.filter5,]

# Filter on coverage: total number of reads greater than 30 in both tumor and normal.
coverageThresh = 30
depth.filter1=(coverageChrHet[,"Normal_ReadCount_Ref"]+coverageChrHet[,"Normal_ReadCount_Alt"])>=coverageThresh
depth.filter2=(coverageChrHet[,"Tumor_ReadCount_Ref"]+coverageChrHet[,"Tumor_ReadCount_Alt"])>=coverageThresh
coverageChrHet=coverageChrHet[depth.filter1 & depth.filter2,]

Generat input for FALCON (data frame with four columns), run FALCON, and view FALCON’s segmentation.

readMatrix=as.data.frame(coverageChrHet[,c('Tumor_ReadCount_Ref',
                                                   'Tumor_ReadCount_Alt',
                                                   'Normal_ReadCount_Ref',
                                                   'Normal_ReadCount_Alt')])
colnames(readMatrix)=c('AT','BT','AN','BN')
tauhat = getChangepoints(readMatrix)
cn = getASCN(readMatrix, tauhat=tauhat, rdep = rdep, threshold = 0.3)
view(cn, pos=coverageChrHet[,'Start_position'], rdep = rdep)

From the figure above, we see that: (1) There are small segments that need to be removed; (2) Consecutive segments with similar allelic copy number states need to be combined. Therefore, we need to further curate FALCON’s segmentation results.

if(length(tauhat)>0){
  length.thres=10^6  # Threshold for length of segments, in base pair.
  delta.cn.thres=0.3  # Threshold of absolute copy number difference between consecutive segments.
  falcon.qc.list = falcon.qc(readMatrix = readMatrix,
                             tauhat = tauhat,
                             cn = cn,
                             st_bp = coverageChrHet[,"Start_position"],
                             end_bp = coverageChrHet[,"End_position"],
                             rdep = rdep,
                             length.thres = length.thres,
                             delta.cn.thres = delta.cn.thres)
  tauhat=falcon.qc.list$tauhat
  cn=falcon.qc.list$cn
}
# Chromosomal view of QC'ed segmentation results.
view(cn,pos=coverageChrHet[,'Start_position'], rdep = rdep)

The code below is to generate table output including genomic locations for segment boudaries, as well as mean and standard deviation for each segment. For Canopy’s input, we use Bootstrap-based method to estimate the standard deviations for the allele-specific copy numbers.

  falconOutput=falcon.output(readMatrix = readMatrix,
                              tauhat = tauhat,
                              cn = cn,
                              st_bp = coverageChrHet[,"Start_position"],
                              end_bp = coverageChrHet[,"End_position"],
                              nboot = 5000)
## Running bootstrap for segment 2 ...
  falconOutput = cbind(chr=rep(chr,nrow(falconOutput)), falconOutput)
  falconOutput
##   chr st_snp end_snp    st_bp    end_bp Minor_copy Major_copy Minor.sd
## 1  14      1    3451 19066021  26963559      1.000      1.000       NA
## 2  14   3451   24617 26963559 107287663      0.277      1.719    8e-04
##   Major.sd
## 1       NA
## 2   0.0032

4.3.3. Visualization of allele-specific copy number by IGV

Finally, it is very helpful to visualize the results in the Integrative Genomics Viewer (IGV). Takes as input the output of falcon.output and converts to .seg format. This is a preferred format for Integrative Genomics Viewer (IGV). This function takes a single data frame as input, so combine CNV calls from all chromosomes beforehand if necessary. This function writes to file, so specify the outPath if desired.

falconOutput  #from the MARTHON notebook section 4.3.2. 
##   chr st_snp end_snp    st_bp    end_bp Minor_copy Major_copy Minor.sd
## 1  14      1    3451 19066021  26963559      1.000      1.000       NA
## 2  14   3451   24617 26963559 107287663      0.277      1.719    8e-04
##   Major.sd
## 1       NA
## 2   0.0032
#Running FALCON for allele-specific copy number profiling

convertFalconForIGV( 
  falconOutput = falconOutput, sampleName = "sample1", outPath = getwd(), log2 = FALSE)

4.4. Tumor allele-specific copy number by WES

For WES data, biases and artifacts cannot be fully captured by comparing the tumor sample to the matched normal sample. FALCON-X (Chen et al. 2017) extends upon FALCON, where it takes as inputs allelic read counts at germline heterozygous loci and total coverage biases for each of these loci estimated by CODEX2 and outputs ASCN estimates.

4.4.1. Calling heterozygous loci

# 1. Index the genome template
bwa index ~/structure/hg19/ucsc.hg19.fasta

# 2. Align reads in .fastq file to the template (sample sequenced at two different lanes)
bwa mem -M -t 16 ucsc.hg19.fasta _EGAR00001248897_UCF_1014_NoIndex_L008_R1_001.fastq _EGAR00001248897_UCF_1014_NoIndex_L008_R2_001.fastq > _EGAR00001248897_UCF_1014_NoIndex_L008.sam

bwa mem -M -t 16 ucsc.hg19.fasta _EGAR00001248899_UCF1014_NoIndex_L002_R1_001.fastq _EGAR00001248899_UCF1014_NoIndex_L002_R2_001.fastq > _EGAR00001248899_UCF1014_NoIndex_L002.sam

# 3. Convert .sam to .bam and sort
samtools view -bS _EGAR00001248897_UCF_1014_NoIndex_L008.sam > _EGAR00001248897_UCF_1014_NoIndex_L008.bam
line=_EGAR00001248897_UCF_1014_NoIndex_L008.bam
java -jar SortSam.jar INPUT=$line OUTPUT=$line.sorted.bam SORT_ORDER=coordinate 

samtools view -bS _EGAR00001248899_UCF1014_NoIndex_L002.sam > _EGAR00001248899_UCF1014_NoIndex_L002.bam
line=_EGAR00001248899_UCF1014_NoIndex_L002.bam
java -jar SortSam.jar INPUT=$line OUTPUT=$line.sorted.bam SORT_ORDER=coordinate 

# 4. Add read group
line=_EGAR00001248897_UCF_1014_NoIndex_L008.bam
#line=_EGAR00001248899_UCF1014_NoIndex_L002.bam

java -jar AddOrReplaceReadGroups.jar INPUT=$line.sorted.bam OUTPUT=$line.sorted.rg.bam RGID=$line RGLB=WGS_UCF1014 RGPL=ILLUMINA RGPU=machine RGSM=UCF1014
samtools index $line.sorted.rg.bam

# 5. Merge two different lanes (sample sequenced at two different lanes)
samtools merge UCF1014.merge.bam _EGAR00001248899_UCF1014_NoIndex_L002.bam.sorted.rg.bam _EGAR00001248897_UCF_1014_NoIndex_L008.bam.sorted.rg.bam
java -jar SortSam.jar INPUT=UCF1014.merge.bam OUTPUT=UCF1014.merge.sorted.bam SORT_ORDER=coordinate 
samtools index UCF1014.merge.sorted.bam

# 6. Dedup
java -jar MarkDuplicates.jar INPUT=UCF1014.merge.sorted.bam OUTPUT=UCF1014.merge.sorted.dedup.bam  METRICS_FILE=UCF1014.merge.sorted.dedup.metrics.txt PROGRAM_RECORD_ID= MarkDuplicates PROGRAM_GROUP_VERSION=null PROGRAM_GROUP_NAME=MarkDuplicates
java -jar BuildBamIndex.jar INPUT=UCF1014.merge.sorted.dedup.bam

# 7. Realign
java -jar GenomeAnalysisTK.jar -T RealignerTargetCreator -R ucsc.hg19.fasta -I UCF1014.merge.sorted.dedup.bam -known Mills_and_1000G_gold_standard.indels.hg19.sites.vcf -known 1000G_phase1.indels.hg19.sites.vcf -o UCF1014.merge.sorted.dedup.target_intervals.list 

java -jar ~/bin/GenomeAnalysisTK.jar -T IndelRealigner -R ucsc.hg19.fasta -I UCF1014.merge.sorted.dedup.bam -targetIntervals UCF1014.merge.sorted.dedup.target_intervals.list -known Mills_and_1000G_gold_standard.indels.hg19.sites.vcf -known 1000G_phase1.indels.hg19.sites.vcf -o UCF1014.merge.sorted.dedup.realigned.bam

# 8. Recalibrate
java -jar GenomeAnalysisTK.jar -T BaseRecalibrator -R ucsc.hg19.fasta -I UCF1014.merge.sorted.dedup.realigned.bam -knownSites dbsnp_138.hg19.vcf -knownSites Mills_and_1000G_gold_standard.indels.hg19.sites.vcf -knownSites 1000G_phase1.indels.hg19.sites.vcf -o UCF1014.merge.sorted.dedup.realigned.recal_data.table

java -jar GenomeAnalysisTK.jar -T PrintReads -R ucsc.hg19.fasta -I UCF1014.merge.sorted.dedup.realigned.bam -BQSR UCF1014.merge.sorted.dedup.realigned.recal_data.table -o UCF1014.merge.sorted.dedup.realigned.recal.bam
samtools index UCF1014.merge.sorted.dedup.realigned.recal.bam

# 9. GATK HaplotypeCaller
java -jar GenomeAnalysisTK.jar -R ucsc.hg19.fasta -T HaplotypeCaller -I UCF1014.merge.sorted.dedup.realigned.recal.bam -o UCF1014.merge.sorted.dedup.realigned.recal.raw.snps.indels.g.vcf

4.4.2. Running CODEX2 to get bias matrix

Below is demo dataset consisting of 39 tumor-normal paired whole-exome sequencing, published in Maxwell et al. (Nature Communications, 2017). We focus on chr17, where copy-neutral loss-of-heterozygosity has been reported. Below are allelic reads, genotype, and genomic locations, which can be extracted from vcf files. rda files available for download [here] (https://github.com/yuchaojiang/Canopy/tree/master/instruction)

library(MARATHON)
chr=17
head(mymatrix.demo)# genomic locations and SNP info across all loci from chr17
##   CHROM   POS REF ALT
## 1    17  6115   G   C
## 2    17  6157   A   G
## 3    17  6188   C   T
## 4    17 11830   A   G
## 5    17 11837   G   A
## 6    17 11843   G   A
# genotype in blood (bloodGT1 and blood GT2) and tumor (tumorGT1 and tumor GT2) across all samples
head(genotype.demo[,1:8]) 
##      samp1_bloodGT1 samp1_bloodGT2 samp1_tumorGT1 samp1_tumorGT2
## [1,]              0              1              0              1
## [2,]              0              1              0              1
## [3,]              0              1              0              1
## [4,]              1              1              1              1
## [5,]              0              0              0              0
## [6,]              0              0              0              0
##      samp2_bloodGT1 samp2_bloodGT2 samp2_tumorGT1 samp2_tumorGT2
## [1,]              0              0              0              1
## [2,]              0              0              0              1
## [3,]              0              1              0              1
## [4,]              0              1              0              1
## [5,]              0              0              0              0
## [6,]              0              0              0              0
#allelic reads in blood (AN and BN) and tumor (AT and BT) across all samples
head(reads.demo[,1:8]) 
##      samp1_AN samp1_BN samp1_AT samp1_BT samp2_AN samp2_BN samp2_AT
## [1,]      123       78      122      128      240        8      191
## [2,]       80       54       83       85      194       11      170
## [3,]       35       49       43       49       76       50       80
## [4,]        0       33        1       37       20       26       24
## [5,]       37        0       50        0       49        0       70
## [6,]       39        0       55        0       55        0       79
##      samp2_BT
## [1,]       58
## [2,]       59
## [3,]       58
## [4,]       37
## [5,]        0
## [6,]        0

Apply CODEX2 to get total coverage bias. First we get GC content from a 50bp window centered at the SNP, as well as total read depth at each loci.

pos=as.numeric(paste(mymatrix.demo[,'POS']))
ref=GRanges(seqnames=chr, ranges=IRanges(start=pos-25,end=pos+25))
gc <- getgc(ref, genome = genome) # GC content for each loci
values(ref) <- cbind(values(ref), DataFrame(gc))

Y = reads.demo[,seq(1,ncol(reads.demo),2)] +
    reads.demo[,seq(2,ncol(reads.demo),2)] # total read depth for each loci
colnames(Y)=paste('samp',1:(ncol(Y)/2),'_',rep(c('N','T'),ncol(Y)/2),sep='')
head(Y[,1:8])
##      samp1_N samp2_T samp3_N samp4_T samp5_N samp6_T samp7_N samp8_T
## [1,]     201     250     248     249     158      72     168     247
## [2,]     134     168     205     229     105      41     113     143
## [3,]      84      92     126     138      82      25      69     100
## [4,]      33      38      46      61      57      16      42      50
## [5,]      37      50      49      70      70      16      41      52
## [6,]      39      55      55      79      71      22      40      59

QC procedure – remove loci with low median coverage across samples. Calculate library size factor and normalize by CODEX2.

pos.filter=(apply(Y,1,median)>=20)
pos=pos[pos.filter]; ref=ref[pos.filter]; gc=gc[pos.filter]
genotype.demo=genotype.demo[pos.filter,]; reads.demo=reads.demo[pos.filter,]
mymatrix.demo=mymatrix.demo[pos.filter,]; Y=Y[pos.filter,]

# Estimate library size factor based on genome-wide read depth after QC
Y.nonzero <- Y[apply(Y, 1, function(x){!any(x==0)}),]
pseudo.sample <- apply(Y.nonzero,1,function(x){exp(1/length(x)*sum(log(x)))})
N <- apply(apply(Y.nonzero, 2, function(x){x/pseudo.sample}), 2, median)

normObj=normalize_codex2_ns(Y, gc, K=1:3, norm_index=seq(1,77,2), N = N)
# choiceofK(normObj$AIC,normObj$BIC,normObj$RSS,K=1:3,filename='CODEX2_falconX_demo.pdf')
Yhat=round(normObj$Yhat[[which.max(normObj$BIC)]],0)

4.4.3. Running FALCON-X for allele-specific copy number profiling

Generate input for FALCON-X: allelic read depth and genotype across all loci.

n=39 # total number of samples
for (i in 1:n){
  message('Generating input for sample',i,'...\n')
  ids = (4*i-3):(4*i)
  ids2 = (2*i-1):(2*i)
  mydata = as.data.frame(cbind(mymatrix.demo[,1:2], 
                               genotype.demo[,ids],
                               reads.demo[,ids],
                               Yhat[,ids2]))
  colnames(mydata) = c("chr", "pos", "bloodGT1", "bloodGT2", "tumorGT1",
                       "tumorGT2", "AN", "BN", "AT", "BT", 'sN','sT')
  ids=which(as.numeric(mydata[,3])!=as.numeric(mydata[,4])) # select heterozygou loci
  newdata0 = mydata[ids,]
  index.na=apply(is.na(newdata0), 1, any)
  newdata=newdata0[index.na==FALSE,]
  
  # Remove loci with multiple alternative alleles
  mut.alt.filter=(newdata$tumorGT1==newdata$bloodGT1 | newdata$tumorGT1==newdata$bloodGT2) &
  (newdata$tumorGT2==newdata$bloodGT1 | newdata$tumorGT2==newdata$bloodGT2)
  newdata=newdata[mut.alt.filter,]
  # write text at germline heterozygous loci, which is used as input for Falcon-X
  write.table(newdata, file=paste("sample",i,"_het.txt",sep=""), quote=F, row.names=F)
}

Apply FALCON-X to generate allele-specific copy number profiles. Note that CODEX normalize total read depth across samples, while falcon-x profiles ASCN in each sample separately.

k=10 # calling ASCN for the 10th sample
ascn.input=read.table(paste("sample",k,"_het.txt",sep=""),head=T)
readMatrix=ascn.input[,c('AN','BN','AT','BT')]
biasMatrix=ascn.input[,c('sN','sT')]

tauhat = getChangepoints.x(readMatrix, biasMatrix, pos=ascn.input$pos)
cn = getASCN.x(readMatrix, biasMatrix, tauhat=tauhat, pos=ascn.input$pos, threshold = 0.3)
# cn$tauhat would give the indices of change-points.
# cn$ascn would give the estimated allele-specific copy numbers for each segment.
# cn$Haplotype[[i]] would give the estimated haplotype for the major chromosome in segment i
# if this segment has different copy numbers on the two homologous chromosomes.
view(cn, pos=ascn.input$pos)

Further curate FALCON-X’s segmentation, i.e., remove small regions and combine consecutive regions with similar ASCN profiles.

if(length(tauhat)>0){
  length.thres=10^6  # Threshold for length of segments, in base pair.
  delta.cn.thres=0.3  # Threshold of absolute copy number difference between consecutive segments.
  falcon.qc.list = falconx.qc(readMatrix = readMatrix,
                              biasMatrix = biasMatrix,
                              tauhat = tauhat,
                              cn = cn,
                              st_bp = ascn.input$pos,
                              end_bp = ascn.input$pos,
                              length.thres = length.thres,
                              delta.cn.thres = delta.cn.thres)
  
  tauhat=falcon.qc.list$tauhat
  cn=falcon.qc.list$cn
}

view(cn,pos=ascn.input$pos)

4.5. Tumor phylogeny analysis by WGS/WES

Canopy (Jiang et al. 2016) identifies subclones within a tumor, determines the mutational profiles of these subclones, and infers the tumor’s phylogenetic history by NGS data from temporally and/or spatially separated tumor resections from the same patient. Canopy jointly models somatic copy number changes and SNVs in a similar fashion to non-negative matrix factorization and adopts a Bayesian framework to reconstruct phylogeny with posterior confidence assessment. Canopy takes as input both somatic ASCN changes returned by FALCON/FALCON-X as well as somatic SNVs and outputs tumor phylogenetic trees with somatic mutations placed along tree branches and subclones placed at the leaves.

This is a demo for using the Canopy package in R. Canopy is a statistical framework and computational procedure for identifying subpopulations within a tumor, determining the mutation profiles of each subpopulation, and inferring the tumor’s phylogenetic history. The input to Canopy are variant allele frequencies of somatic single nucleotide alterations (SNAs) along with allele-specific coverage ratios between the tumor and matched normal sample for somatic copy number alterations (CNAs). These quantities can be directly taken from the output of existing software. Canopy provides a general mathematical framework for pooling data across samples and sites to infer the underlying parameters. For SNAs that fall within CNA regions, Canopy infers their temporal ordering and resolves their phase. When there are multiple evolutionary configurations consistent with the data, Canopy outputs all configurations along with their confidence.

Below is an example on reconstructing tumor phylogeny of a transplantable metastasis model system derived from a heterogeneous human breast cancer cell line MDA-MB-231. Cancer cells from the parental line MDA-MB-231 were engrafted into mouse hosts leading to organ-specific metastasis. Mixed cell populations (MCPs) were in vivo selected from either bone or lung metastasis and grew into phenotypically stable and metastatically competent cancer cell lines. The parental line as well as the MCP sublines were whole-exome sequencedwith somatic SNAs and CNAs profiled. Canopy is used to infer metastatic phylogeny. Code for analysis of this dataset is broken down below with explanations and is further available here.

4.5.1. CNA and SNA input

The input to Canopy are variant allele frequencies of somatic SNAs along with allele-specific coverage ratios between the tumor and matched normal sample for somatic CNAs. For SNAs, let the matrices R and X be, respectively, the number of reads containing the mutant allele and the total number of reads for each locus across all samples. The ratio R/X is the proportion of reads supporting the mutant allele, known as the variant allele frequency. For CNAs, Canopy directly takes output from allele-specific copy number estimation softwares, such as FALCON-X or Sequenza. These outputs are in the form of estimated major and minor copy number ratios, respectively denoted by WM and Wm, with their corresponding standard errors ϵM and ϵm. Matrix Y specifies whether SNAs are affected by CNAs; matrix C specifies whether CNA regions harbor specific CNAs (this input is only needed if overlapping CNA events are observed).

How to generate SNA input for Canopy

  • We use UnifiedGenotyper by GATK to call somatic SNAs and follow its Best Practices. MuTect and VarScan2 can also be used when paired normal samples are available.

# 1. Index the genome template
bwa index ~ $GENOME_DIR/hg19/ucsc.hg19.fasta

# 2. Align reads in .fastq file to the template
bwa mem -M -t 16 $GENOME_DIR/ucsc.hg19.fasta sampleRead1.fastq \
  sampleRead2.fastq > sample.sam

# 3. Convert .sam to .bam and sort
samtools view -bS sample.sam > sample.bam
java -jar $PATH_PICARD/picard.jar SortSam \
  INPUT=sample.bam \
  OUTPUT=sample.sorted.bam SORT_ORDER=coordinate

# 4. Add read group and index
java -jar $PATH_PICARD/picard.jar AddOrReplaceReadGroups \
  INPUT=sample.sorted.bam OUTPUT=sample.sorted.rg.bam \
  RGID=sample RGLB=myRGLB RGPL=myRGPL RGPU=myRGPU RGSM=myRGSM
samtools index sample.sorted.rg.bam

# 5. *Not Run (Merge two different lanes if sample sequenced at two different lanes),
# sort, index
#samtools merge sampleMerge.bam \
#  sampleLane1.sorted.rg.bam sampleLane2.sorted.rg.bam
#java -jar SortSam.jar INPUT=sampleMerge.bam \
#  OUTPUT=sampleMerge.sorted.bam SORT_ORDER=coordinate


# 6. Dedup, index
java -jar $PATH_PICARD/picard.jar MarkDuplicates \
  INPUT=sample.sorted.rg.bam OUTPUT=sample.sorted.dedup.bam  \
  METRICS_FILE=sample.sorted.dedup.metrics.txt \
  PROGRAM_RECORD_ID= MarkDuplicates \
  PROGRAM_GROUP_VERSION=null \
  PROGRAM_GROUP_NAME=MarkDuplicates
java -jar $PATH_PICARD/picard.jar BuildBamIndex \
  INPUT=sample.sorted.dedup.bam

# 7. Realign
java -jar $PATH_GATK/GenomeAnalysisTK.jar \
  -T RealignerTargetCreator \
  -R $GENOME_DIR/ucsc.hg19.fasta \
  -I sample.sorted.dedup.bam \
  -known $GENOME_DIR/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf \
  -known $GENOME_DIR/1000G_phase1.indels.hg19.sites.vcf \
  -o sample.sorted.dedup.target_intervals.list

java -jar $PATH_GATK/GenomeAnalysisTK.jar \
  -T IndelRealigner \
  -R $GENOME_DIR/ucsc.hg19.fasta \
  -I sample.sorted.dedup.bam \
  -targetIntervals sampleMerge.sorted.dedup.target_intervals.list \
  -known $GENOME_DIR/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf \
  -known $GENOME_DIR/1000G_phase1.indels.hg19.sites.vcf \
  -o sample.sorted.dedup.realigned.bam

# 8. Recalibrate Base Quality Scores
java -jar $PATH_GATK/GenomeAnalysisTK.jar \
  -T BaseRecalibrator \
  -R $GENOME_DIR/ucsc.hg19.fasta \
  -I sample.sorted.dedup.realigned.bam \
  -knownSites $GENOME_DIR/dbsnp_138.hg19.vcf \
  -knownSites $VCF_DIR/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf \
  -knownSites $VCF_DIR/1000G_phase1.indels.hg19.sites.vcf \
  -o sample.sorted.dedup.realigned.recal_data.table

java -jar $PATH_GATK/GenomeAnalysisTK.jar 
  -T PrintReads \
  -R $GENOME_DIR/ucsc.hg19.fasta \
  -I sample.sorted.dedup.realigned.bam \
  -BQSR sample.sorted.dedup.realigned.recal_data.table \
  -o sample.sorted.dedup.realigned.recal.bam
samtools index sample.sorted.dedup.realigned.recal.bam

# 9. GATK UnifiedGenotyper
java -jar $PATH_GATK/GenomeAnalysisTK.jar 
  -T UnifiedGenotyper\
  -R $GENOME_DIR/ucsc.hg19.fasta \
  -I sample.sorted.dedup.realigned.recal.bam \
  -o sample.sorted.dedup.realigned.recal.raw.snps.indels.g.vcf
  
  • Stringent QC procedures are strongly recommended. Just to list a few QCs that we have adopted:
    • Pass variant recalibration (VQSR) from GATK;
    • Have only one alternative allele (one locus being double hit by two different SNAs in one patient is very unlikely);
    • Are highly deleterious (i.e., focuse on driver mutations) from functional annotations (ANNOVAR);
    • Have low population variant frequency from the 1000 Genomes Project (if no normal samples are available);
    • Don’t reside in segmental duplication regions;
    • Have high depth of coverage (total as well as mutated read depth);
    • Reside in target baits (e.g., exonic regions for exome sequencing);
  • After mutation calling and QC procedures, the number of mutated reads will be stored in matrix R while the total number of reads will be stored in matrix X across all mutational loci in all samples.

  • A good way for sanity check is to plot the variant allele frequencies (VAFs) across samples.
  • If there are only two samples, a 2-D scatterplot will suffice. In the figure below, the left panel contains the VAFs at two timepoints for the leukemia patient from Ding et al. (Nature 2012). We see that there are clear clusters of mutations indicating they fall on the same branch of the tree, which serves as a nice sanity check of our mutation calling and filtering.
  • If there are more than two samples, heatmap can be used for visualization. The right panel in the figure below is the heatmap of VAFs across 63 somatic mutations in 10 spatially separated tumor slices of a ovarian cancer patient from Bashashati et al. (The Journal of pathology 2013). The heatmap below is plotted using the pheatmap R package.

Convert VCF file to input for Canopy

readVCFforCanopy is a function which generates canopy input from a VCF file. This function takes as input a the path to the VCF file. First 9 column names after the header should be c(“#CHROM”, “POS”, “ID”, “REF”, “ALT”, “QUAL”, “FILTER”, “INFO”, “FORMAT”). Following column names should be sample names. see . The function outputs a list of R(alternate reads), X(total reads), and vcfTargets which contains gene target info from the VCF.
GenotypeMatrix is a snp by sample matrix representing alleles at that position (i.e. 0: homozygous ref, 1: heterozygous alt, 2: homozygous alt).
ReadsMatrix is a matrix where rows are positions and each sample has two columns, first the reference allele count, and second the alternate allele count.

vcfFile = system.file("extdata", "sample_w_header.vcf", package="MARATHON")
canopyInput = readVCFforCanopy(vcfFile)
lapply(canopyInput, head)
## $R
##               SRR5906250 SRR5906251 SRR5906252 SRR5906253
## chrM:152_T/C          54        192        249        249
## chrM:195_C/T          59        241        250        250
## chrM:410_A/T         210        250        249        249
## chrM:2354_C/T        108        250        248        248
## chrM:2485_C/T        170        250        250        249
## chrM:2759_A/G         85        244        249        250
## 
## $X
##               SRR5906250 SRR5906251 SRR5906252 SRR5906253
## chrM:152_T/C          63        194        249        250
## chrM:195_C/T          59        241        250        250
## chrM:410_A/T         210        250        249        249
## chrM:2354_C/T        108        250        250        248
## chrM:2485_C/T        170        250        250        249
## chrM:2759_A/G         90        248        250        250
## 
## $vcfTargets
## GRanges object with 6 ranges and 5 metadata columns:
##                 seqnames    ranges strand | paramRangeID            REF
##                    <Rle> <IRanges>  <Rle> |     <factor> <DNAStringSet>
##    chrM:152_T/C     chrM       152      * |         <NA>              T
##    chrM:195_C/T     chrM       195      * |         <NA>              C
##    chrM:410_A/T     chrM       410      * |         <NA>              A
##   chrM:2354_C/T     chrM      2354      * |         <NA>              C
##   chrM:2485_C/T     chrM      2485      * |         <NA>              C
##   chrM:2759_A/G     chrM      2759      * |         <NA>              A
##                                ALT      QUAL      FILTER
##                 <DNAStringSetList> <numeric> <character>
##    chrM:152_T/C                  C  28617.33           .
##    chrM:195_C/T                  T  31116.35           .
##    chrM:410_A/T                  T  37789.35           .
##   chrM:2354_C/T                  T  32179.35           .
##   chrM:2485_C/T                  T  34291.35           .
##   chrM:2759_A/G                  G  31965.36           .
##   -------
##   seqinfo: 93 sequences from hg19 genome
## 
## $GenotypeMatrix
##               SRR5906250 SRR5906251 SRR5906252 SRR5906253
## chrM:152_T/C           1          2          2          2
## chrM:195_C/T           2          2          2          2
## chrM:410_A/T           2          2          2          2
## chrM:2354_C/T          2          2          2          2
## chrM:2485_C/T          2          2          2          2
## chrM:2759_A/G          2          2          2          2
## 
## $ReadsMatrix
##               SRR5906250_Ref SRR5906250_Alt SRR5906251_Ref SRR5906251_Alt
## chrM:152_T/C               9             54              1            192
## chrM:195_C/T               0             59              0            241
## chrM:410_A/T               0            210              0            250
## chrM:2354_C/T              0            108              0            250
## chrM:2485_C/T              0            170              0            250
## chrM:2759_A/G              5             85              4            244
##               SRR5906252_Ref SRR5906252_Alt SRR5906253_Ref SRR5906253_Alt
## chrM:152_T/C               0            249              1            249
## chrM:195_C/T               0            250              0            250
## chrM:410_A/T               0            249              0            249
## chrM:2354_C/T              2            248              0            248
## chrM:2485_C/T              0            250              0            249
## chrM:2759_A/G              1            249              0            250

How to generate CNA input for Canopy

To generate allele-specific copy number calls, Sequenza or FALCON can be used. Both methods require paired tumor normal samples as input.

Sequenza

Sequenza estimates allele-specific copy numbers as well as tumor purity and ploidy using B-allele frequencies and depth ratios from paired tumor-normal sequencing data. Here are Sequenza outputs from WES of normal, primary tumor and relapse genome of a neuroblastoma patient from Eleveld et al. (Nature Genetics 2015).

Data visualization and QC procedures on Sequenza’s segmentation output are strongly recommended. Below is genome-wide view from Sequenza as a sanity check. The top panel is for the primary tumor and the bottom panel is for the relapse. The red and blue lines are for the major and minor copy numbers, respectively. We see that large chromosome-arm level deletions and duplications are fairly concordant between the primary tumor and the relapse genome (check!), while the relapse tumor gained additional CNAs at the second timepoint. The small CNA events, however, are most likely false positives and should be filtered out through QC procedures, which may include: * Have at least 100 heterozygous loci within each segment; * Are at least 5Mb long; * Don’t have extreme estimated copy numbers (e.g., total copy number >= 6); * … * …

It is also worth noting that additional steps are further needed to generate curated CNA segments. For example, chr17q is split into three segments in the primary tumor, which should be just one large duplication instead (via comparison against the relapse). Furthermore, the breakpoints of the same CNA events between the primrary tumor and the relapse genome are sometimes different and they need to be merged. They should NOT be treated as separate events.

Primary tumor Sequenza segmentation genome-wide view Relapse genome Sequenza segementation genome-wide view

Canopy does not require interger allele-specific copy number as input. While Sequenza infers tumor purity and ploidy and outputs interger-valued allle-specifc copy numbers, Canopy directly takes as input the fractional allele-specific copy numbers, which can be obtained from the segmentation output (see above). The B-allele frequency is Bf = Wm / (WM + Wm) and the depth ratio is depth.ratio = (WM + Wm)/2. From here the input matrix WM and Wm can be calculated. The standard errors of the estimated copy numbers, epsilonM and epsilonm, can be set as the default value by Canopy or be taken as the sd.ratio from Sequenza’s output, assuming they are the same. NOTE: epsilonM and epsilonm should NOT be sd(WM) and sd(Wm). The former are both matrix of dimension number of CNA events x number of samples, unless if they are set all the same by default as a scalar; the latter are the standard deviations of the copy numbers across all events in all samples.

Which CNAs and SNAs to use?

How to generate a clean set of input to Canopy is important and non-trivial. While in our phylogeney reconstruction paper we were not trying to solve this tricky but separate problem of quality control of point mutation profiling and copy number estimation, an input with too many false positives will only lead to “garbage in garbage out” by Canopy. We are currently working on automating the pipeline for generating CNA and SNA input as well as offering guidance to select the informative SNAs and CNAs. By saying informative, we mean that the SNAs or CNAs show distinct patterns between different samples (from the same patient since we are looking at intratumor heterogeneity). For SNAs, this means that the observed VAFs are different (see Figure 4B in our paper) and in this case a heatmap is a good way for visualization. For CNAs, this means that the WM and Wm are different (see Supplementary Figure S13 in our paper) and we find IGV a good tool for visualization and recommend focusing on large CNA regions, which helps remove false calls and speed up computation.

Just like SNAs, there will likely to be CNAs carried by small fractions of the cells, or that reside in hard-to-call regions of the genome, which are not detected. The former scenario, in particular, includes CNAs which may be informative about rare subclones. We do not assume that the CNAs (and SNAs) given to Canopy comprise all mutations that are carried by the sample, and similarly, do not attempt to claim that Canopy detects all subclones or resolves all branches of the evolutionary tree. Our goal is only to estimate all of the subclones that have representation among the set of input CNAs and SNAs, which are, inherently, limited in resolution by the experimental protocol (sequencing platform and coverage, number of tumor slices, etc.) We believe this is the best that can be achieved under current settings.

Canopy Input

Below is demo data input from project MDA231 (first case study in our paper).

library(MARATHON)
data("MDA231")
projectname = MDA231$projectname ## name of project
R = MDA231$R; R ## mutant allele read depth (for SNAs)
##       MCP1833_bone MCP1834_lung MCP2287_bone MDA-MB-231_parental
## BRAF           155           59          136                  77
## KRAS            44           21           54                  19
## ALPK2           37           17           28                  10
## RYR1            44            0           26                   0
##       MCP3481_lung
## BRAF            49
## KRAS            17
## ALPK2            7
## RYR1             0
X = MDA231$X; X ## total depth (for SNAs)
##       MCP1833_bone MCP1834_lung MCP2287_bone MDA-MB-231_parental
## BRAF           157          111          177                 146
## KRAS            44           30           64                  42
## ALPK2           63           17           65                  24
## RYR1           107           56          165                  55
##       MCP3481_lung
## BRAF            71
## KRAS            27
## ALPK2            7
## RYR1            43
WM = MDA231$WM; WM ## observed major copy number (for CNA regions)
##       MCP1833_bone MCP1834_lung MCP2287_bone MDA-MB-231_parental
## chr7         2.998        2.002        2.603               2.000
## chr12        1.998        1.998        1.603               1.001
## chr18        1.000        2.992        1.000               1.002
## chr19        2.000        2.000        2.000               2.000
##       MCP3481_lung
## chr7         2.001
## chr12        1.999
## chr18        2.996
## chr19        2.000
Wm = MDA231$Wm; Wm ## observed minor copy number (for CNA regions)
##       MCP1833_bone MCP1834_lung MCP2287_bone MDA-MB-231_parental
## chr7         0.002        0.998        0.397               1.000
## chr12        0.002        0.998        0.397               1.000
## chr18        1.000        0.004        1.000               0.999
## chr19        1.000        1.000        1.000               1.000
##       MCP3481_lung
## chr7         0.999
## chr12        0.999
## chr18        0.002
## chr19        1.000
epsilonM = MDA231$epsilonM ## standard deviation of WM, pre-fixed here
epsilonm = MDA231$epsilonm ## standard deviation of Wm, pre-fixed here
## Matrix C specifices whether CNA regions harbor specific CNAs 
## only needed if overlapping CNAs are observed, specifying which CNAs overlap
C = MDA231$C; C
##       chr7_1 chr7_2 chr12_1 chr12_2 chr18 chr19
## chr7       1      1       0       0     0     0
## chr12      0      0       1       1     0     0
## chr18      0      0       0       0     1     0
## chr19      0      0       0       0     0     1
Y = MDA231$Y; Y ## whether SNAs are affected by CNAs
##       non-cna_region chr7 chr12 chr18 chr19
## BRAF               0    1     0     0     0
## KRAS               0    0     1     0     0
## ALPK2              0    0     0     1     0
## RYR1               0    0     0     0     1

4.5.2. Binomial clustering of SNAs

A multivariate binomial mixture clustering step can be applied to the SNAs before MCMC sampling. We show in our paper via simulations that this pre-clustering method helps the Markov chain converge faster with smaller estimation error (especially when mutations show clear cluster patterns by visualization). This clustering step can also remove likely false positives before feeding the mutations to the MCMC algorithm.

Below is a toy example, where three bulk tumor samples were in silico simulated from a tree of 4 clones/leaves. The 5 tree segments (excluding the leftmost branch, which corresponds to the normal clone) separate 200 mutations into 5 mutation clusters. More detailed demo codes for clustering can be found here. Detailed methods can be found in the supplements of our paper under section Binomial mixture clustering. BIC is used for model selection. 2D (two longitudinal/spatial samples) or 3D (three samples) plots are generated for visualization.

data(toy3)
R=toy3$R; X=toy3$X # 200 mutations across 3 samples
num_cluster=2:9 # Range of number of clusters to run
num_run=10 # How many EM runs per clustering step for each mutation cluster wave
canopy.cluster=canopy.cluster(R = R,
                              X = X,
                              num_cluster = num_cluster,
                              num_run = num_run)
## Running EM with 2 clusters...    1   2   3   4   5   6   7   8   9   10   
## EM converged in 10 out of 10 runs with 2 clusters 
## Running EM with 3 clusters...    1   2   3   4   5   6   7   8   9   10   
## EM converged in 10 out of 10 runs with 3 clusters 
## Running EM with 4 clusters...    1   2   3   4   5   6   7   8   9   10   
## EM converged in 10 out of 10 runs with 4 clusters 
## Running EM with 5 clusters...    1   2   3   4   5   6   7   8   9   10   
## EM converged in 10 out of 10 runs with 5 clusters 
## Running EM with 6 clusters...    1   2   3   4   5   6   7   8   9   10   
## EM converged in 10 out of 10 runs with 6 clusters 
## Running EM with 7 clusters...    1   2   3   4   5   6   7   8   9   10   
## EM converged in 10 out of 10 runs with 7 clusters 
## Running EM with 8 clusters...    1   2   3   4   5   6   7   8   9   10   
## EM converged in 10 out of 10 runs with 8 clusters 
## Running EM with 9 clusters...    1   2   3   4   5   6   7   8   9   10   
## EM converged in 10 out of 10 runs with 9 clusters
bic_output=canopy.cluster$bic_output # BIC for model selection (# of clusters)
Mu=canopy.cluster$Mu # VAF centroid for each cluster
Tau=canopy.cluster$Tau  # Prior for mutation cluster, with a K+1 component
sna_cluster=canopy.cluster$sna_cluster # cluster identity for each mutation

Below is real dataset from Ding et al. (Nature 2012), where a leukemia patient was sequenced at two timepoints – primary tumor (sample 1) and relapse genome (sample 2). The real dataset is noisier and can potentially contain false positives for somatic mutations. We thus include in the mixture a multivariate uniform component on the unit interval, which corresponds to mutations that have high standard errors during sequencing or that are likely to be false positives. The code and SNA input for this dataset can be found here.

data(AML43)
R=AML43$R; X=AML43$X
num_cluster=4 # Range of number of clusters to run
num_run=6 # How many EM runs per clustering step for each mutation cluster wave
Tau_Kplus1=0.05 # Pre-specified proportion of noise component
Mu.init=cbind(c(0.01,0.15,0.25,0.45),
              c(0.2,0.2,0.01,0.2)) # Initial value for centroid
canopy.cluster=canopy.cluster(R = R,
                              X = X,
                              num_cluster = num_cluster,
                              num_run = num_run,
                              Mu.init = Mu.init,
                              Tau_Kplus1=Tau_Kplus1)
Mu=canopy.cluster$Mu # VAF centroid for each cluster
Tau=canopy.cluster$Tau  # Prior for mutation cluster, with a K+1 component
sna_cluster=canopy.cluster$sna_cluster # cluster identity for each mutation

R.qc=R[sna_cluster<=4,] # exclude mutations in the noise cluster
X.qc=X[sna_cluster<=4,]
sna_cluster.qc=sna_cluster[sna_cluster<=4]

R.cluster=round(Mu*100)  # Generate pseudo-SNAs correponding to each cluster. 
X.cluster=pmax(R.cluster,100)   # Total depth is set at 100 but can be obtained as median instead 
rownames(R.cluster)=rownames(X.cluster)=paste('SNA.cluster',1:4,sep='')

4.5.3. Phylogenetic tree (unknown)

Each sampled tree is modeled as a list by Canopy. Below are the treeelements of the most likely tree from the project MDA231 (first case study in the paper). The most likely tree is obtained from the posterior distributionin the tree space from the MCMC sampling. How to visualize/plot the sampled trees is discussed later.

data('MDA231_tree')
MDA231_tree$Z # Z matrix specifies the position of the SNAs along the tree branch
##       clone1 clone2 clone3 clone4
## BRAF       0      1      1      1
## KRAS       0      1      1      1
## ALPK2      0      1      1      1
## RYR1       0      0      1      0
MDA231_tree$cna.copy # major and minor copy number (interger values) for each CNA
##            chr7_LOH chr7_dup chr12_dup chr12_LOH chr18_LOH chr19_dup
## major_copy        3        2         2         2         3         2
## minor_copy        0        1         1         0         0         1
MDA231_tree$CM # Major copy per clone for each CNA
##       clone1 clone2 clone3 clone4
## chr7       1      2      3      2
## chr12      1      1      2      2
## chr18      1      1      1      3
## chr19      1      2      2      2
MDA231_tree$Cm # Minor copy per clone for each CNA
##       clone1 clone2 clone3 clone4
## chr7       1      1      0      1
## chr12      1      1      0      1
## chr18      1      1      1      0
## chr19      1      1      1      1
MDA231_tree$Q # whether an SNA precedes a CNA
##       chr7_LOH chr7_dup chr12_dup chr12_LOH chr18_LOH chr19_dup
## BRAF         1        1         0         0         0         0
## KRAS         0        0         1         1         0         0
## ALPK2        0        0         0         0         1         0
## RYR1         0        0         0         0         0         0
MDA231_tree$H # if an SNA precedes a CNA, whether it resides in the major copy
##       chr7_LOH chr7_dup chr12_dup chr12_LOH chr18_LOH chr19_dup
## BRAF         1        1         0         0         0         0
## KRAS         0        0         1         1         0         0
## ALPK2        0        0         0         0         1         0
## RYR1         0        0         0         0         0         0
MDA231_tree$P # clonal compostion for each sample
##        MCP1833_bone MCP1834_lung MCP2287_bone MDA-MB-231_parental
## clone1        0.000        0.000        0.000               0.000
## clone2        0.006        0.000        0.399               0.992
## clone3        0.991        0.001        0.600               0.001
## clone4        0.003        0.999        0.001               0.007
##        MCP3481_lung
## clone1        0.000
## clone2        0.003
## clone3        0.001
## clone4        0.996
MDA231_tree$VAF # VAF based on current tree structure
##       MCP1833_bone MCP1834_lung MCP2287_bone MDA-MB-231_parental
## BRAF         0.997        0.667        0.867               0.667
## KRAS         0.996        0.667        0.800               0.502
## ALPK2        0.502        1.000        0.501               0.505
## RYR1         0.330        0.000        0.200               0.000
##       MCP3481_lung
## BRAF         0.667
## KRAS         0.667
## ALPK2        0.999
## RYR1         0.000

4.5.4. MCMC sampling

Canopy samples in subtree space with varying number of subclones(denoted as K) by a Markov chain Monte Carlo (MCMC) method. A plot of posterior likelihood (pdf format) will be generated for each subtree space and we recommend users to refer to the plot as a sanity check for sampling convergence and to choose the number of burn-ins and thinning accordingly. Note that this step can be time-consuming, especially with larger number of chains numchain specifies the number of chains with random initiations, a larger value of which is in favor of not getting stuck in local optima) and longer chains (simrun specifies number of iterations per chain). MCMC sampling is the most computationally heavy step in Canopy. It is recommended that jobs are run in parallel on high-performance cluster.

There are four modes of MCMC sampling embedded in Canopy: (1) canopy.sample which takes both SNA and CNA as input by default; (2) canopy.sample.nocna for cases where there is no CNA input; (3) canopy.sample.cluster for cases where SNAs are pre-clustered by the Binomial mixture EM algorithm; (4) canopy.sample.cluster.nocna for cases where there is no CNA input and SNAs are pre-clustered by the Binomial mixture EM algorithm. More details can be found here.

Below is sampling code for the MDA231 dataset (see section 4.5.1) where both SNA and CNA are used as input.

K = 3:6 # number of subclones
numchain = 20 # number of chains with random initiations
sampchain = canopy.sample(R = R, X = X, WM = WM, Wm = Wm, epsilonM = epsilonM, 
                          epsilonm = epsilonm, C = C, Y = Y, K = K, numchain = numchain,
                          max.simrun = 50000, min.simrun = 10000, 
                          writeskip = 200, projectname = projectname, cell.line = TRUE, 
                          plot.likelihood = TRUE)
#save.image(file = paste(projectname, '_postmcmc_image.rda',sep=''),
#           compress = 'xz')
length(sampchain) ## number of subtree spaces (K=3:6)
## [1] 4
length(sampchain[[which(K==4)]]) ## number of chains for subtree space with 4 subclones
## [1] 20
length(sampchain[[which(K==4)]][[1]]) ## number of posterior trees in each chain
## [1] 250

4.5.5 BIC for model selection

Canopy uses BIC as a model selection criterion to determine to optimal number of subclones. The first [burnin] iterations are removed and of the remaining, every [thin]th iteration is considered.

burnin = 100
thin = 5 # If there is error in the bic and canopy.post step below, make sure
# burnin and thinning parameters are wisely selected so that there are
# posterior trees left.
bic = canopy.BIC(sampchain = sampchain, projectname = projectname, K = K,
                 numchain = numchain, burnin = burnin, thin = thin, pdf = TRUE)
## k = 3 : mean likelihood -17243.57 ; BIC -34650.33 .
## k = 4 : mean likelihood -548.2876 ; BIC -1305.67 .
## k = 5 : mean likelihood -611.4774 ; BIC -1477.948 .
## k = 6 : mean likelihood -586.8295 ; BIC -1474.551 .
optK = K[which.max(bic)]

4.5.5.1 Diagnostic for length of Markov chains

To determine if the markov chain has run long enough, we examine the stability. The following function provides insight. The right is a zoomed plot of the left. All chains appear to be stable on the left, because the scale is so large. However, the zoomed image on the right gives greater deatil and ability to determine stability. Make sure writeskip here matches your setting in canopy.sample(). You can change yRange setting to expand the view so that more chains are visible.

On the right, we see stability for the highest likelihood chains over a period of 50,000 trees. That, is the most recent value is not greater than the mean of the past 50,000 trees. If values are still climbing, then the chain should be run longer, using min.simrun / max.simrum.

canopy.simrun.diagnostic(sampchain, optK, K, writeskip = 1000, yRange = 100 )

4.5.5.2 Parallel Computing

Parallel computing is easy and each chain can run simultneously given suffifient number of availiable cpu cores. Simply use canopy.sample.parallel() instead of canopy.sample(). All parameters are identical.

If using high performance clusters, it may me necessary to request additional memory, as sampchain files are large. Alternatively, one can raise the value of writeskip. This is a way of pre-thinning the written record of trees. If you raise writeskip, you may need to lower the thin parameter for canopy.BIC() and canopy.post() to ensure that enough iterations are kept.

sampchain = canopy.sample.parallel(R = R, X = X, WM = WM, Wm = Wm, epsilonM = epsilonM, 
                          epsilonm = epsilonm, C = C, Y = Y, K = K, numchain = numchain,
                          max.simrun = 50000, min.simrun = 10000, 
                          writeskip = 1000, projectname = projectname, cell.line = TRUE,
                          plot.likelihood = TRUE)
#save.image(file = paste(projectname, '_postmcmc_image.rda',sep=''),
#           compress = 'xz')

4.5.6. Posterior evaluation of sampled trees, output, and plotting

Canopy then runs a posterior evaluation of all sampled trees by MCMC. If modes of posterior probabilities (second column of config.summary) aren’t obvious, check if the algorithm has converged (and run sampling longer if not). The first [burnin] iterations are removed and of the remaining, every [thin]th iteration is considered.

post = canopy.post(sampchain = sampchain, projectname = projectname, K = K,
numchain = numchain, burnin = burnin, thin = thin, optK = optK,
C = C, post.config.cutoff = 0.05)
samptreethin = post[[1]]   # list of all post-burnin and thinning trees
samptreethin.lik = post[[2]]   # likelihoods of trees in samptree
config = post[[3]] # configuration for each posterior tree
config.summary = post[[4]] # configuration summary
print(config.summary)
##      Configuration Post_prob Mean_post_lik
## [1,]             1     0.562       -547.55
## [2,]             2     0.066       -549.68
## [3,]             3     0.073       -548.62
## [4,]             4     0.146       -547.68
## [5,]             5     0.153       -547.80
# first column: tree configuration
# second column: posterior configuration probability in the entire tree space
# third column: posterior configuration likelihood in the subtree space

One can then use Canopy to output and plot the most likely tree (i.e.,tree with the highest posterior likelihood). Mutations, clonal frequencies, and tree topology, etc., of the tree are obtained from the posterior distributions of subtree space with trees having the same configuration. In our MDA231 example, the most likely tree is the tree with configuration 3. Note: A separate txt file can be generated (with txt=TRUE and txt.name=’*.txt’) if the figure legend of mutational profiles (texts below the phylogenetic tree) in the plot is too long to be fitted entirely.

config.i = config.summary[which.max(config.summary[,3]),1]
cat('Configuration', config.i, 'has the highest posterior likelihood!\n')
# plot the most likely tree in the posterior tree space
output.tree = canopy.output(post, config.i, C)
canopy.plottree(output.tree)

# plot the tree with configuration 1 in the posterior tree space
output.tree = canopy.output(post, 1, C)
canopy.plottree(output.tree, pdf=TRUE, pdf.name = 
paste(projectname,'_first_config.pdf',sep=''))
## Configuration 1 has the highest posterior likelihood!

4.5.7. Try it yourself

Now try Canopy yourself using the simulated toy dataset below! Note that no overlapping CNAs are used as input and thus matrix C doesn’t need to be specified.

library(MARATHON)
data(toy)
projectname = 'toy'
R = toy$R; X = toy$X; WM = toy$WM; Wm = toy$Wm
epsilonM = toy$epsilonM; epsilonm = toy$epsilonm; Y = toy$Y

K = 3:6; numchain = 10
sampchain = canopy.sample(R = R, X = X, WM = WM, Wm = Wm, epsilonM = epsilonM, 
                          epsilonm = epsilonm, C = NULL, Y = Y, K = K, 
                          numchain = numchain, max.simrun = 100000,
                          min.simrun = 10000, writeskip = 200,
                          projectname = projectname, cell.line = FALSE,
                          plot.likelihood = TRUE)

The most likely tree is shown below. There should be only one tree configurationfrom the posterior tree space. The code for this toy dataset analysis can be found here.

The second toy example has a different tree topology. Feel free to try Canopy on this dataset too! There should be also just one tree configuration as is shown below from the posterior tree space. The code for this toy dataset analysis can be found here.

library(MARATHON)
data(toy2)
projectname = 'toy2'
R = toy2$R; X = toy2$X; WM = toy2$WM; Wm = toy2$Wm
epsilonM = toy2$epsilonM; epsilonm = toy2$epsilonm; Y = toy2$Y
true.tree = toy2$true.tree  # true underlying tree
K = 3:6; numchain = 15
sampchain = canopy.sample(R = R, X = X, WM = WM, Wm = Wm, epsilonM = epsilonM, 
                          epsilonm = epsilonm, C = NULL, Y = Y, K = K, 
                          numchain = numchain, max.simrun = 100000,
                          min.simrun = 10000, writeskip = 200,
                          projectname = projectname, cell.line = FALSE,
                          plot.likelihood = TRUE)

The third toy example consists of three bulk tumor samples, in silico simulated from a tree of 4 clones/leaves. The 5 tree segments (excluding the leftmost branch, which corresponds to the normal clone) separate 200 mutations into 5 mutation clusters. The SNA clustering details are outlined in section ``Binomial clustering of SNAs“. The code for this toy dataset analysis is briefly attached below. Code in more details with visualization and posterior analysis can be found here.

library(MARATHON)
data(toy3)
R=toy3$R; X=toy3$X
num_cluster=2:9 # Range of number of clusters to run
num_run=10 # How many EM runs per clustering step for each mutation cluster wave
canopy.cluster=canopy.cluster(R = R,
X = X,
num_cluster = num_cluster,
num_run = num_run)

bic_output=canopy.cluster$bic_output # BIC for model selection (# of clusters)
Mu=canopy.cluster$Mu # VAF centroid for each cluster
Tau=canopy.cluster$Tau  # Prior for mutation cluster, with a K+1 component
sna_cluster=canopy.cluster$sna_cluster # cluster identity for each mutation

projectname='toy3'
K = 3:5 # number of subclones
numchain = 15 # number of chains with random initiations
sampchain = canopy.sample.cluster.nocna(R = R, X = X, sna_cluster = sna_cluster,
K = K, numchain = numchain, 
max.simrun = 100000, min.simrun = 20000,
writeskip = 200, projectname = projectname,
cell.line = FALSE, plot.likelihood = TRUE)
save.image(file = paste(projectname, '_postmcmc_image.rda',sep=''),
compress = 'xz')

5. Session info

sessionInfo()
## R version 3.4.3 (2017-11-30)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS High Sierra 10.13.1
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] WES.1KG.WUGSC_1.8.0               CODEX_1.8.0                      
##  [3] BSgenome.Hsapiens.UCSC.hg19_1.4.0 BSgenome_1.44.0                  
##  [5] rtracklayer_1.36.4                Rsamtools_1.28.0                 
##  [7] Biostrings_2.44.1                 XVector_0.16.0                   
##  [9] GenomicRanges_1.28.4              GenomeInfoDb_1.12.2              
## [11] IRanges_2.10.2                    S4Vectors_0.14.3                 
## [13] BiocGenerics_0.22.0              
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.14               knitr_1.17                
##  [3] magrittr_1.5               GenomicAlignments_1.12.1  
##  [5] zlibbioc_1.22.0            BiocParallel_1.10.1       
##  [7] lattice_0.20-35            stringr_1.2.0             
##  [9] tools_3.4.3                grid_3.4.3                
## [11] SummarizedExperiment_1.6.3 Biobase_2.36.2            
## [13] matrixStats_0.52.2         htmltools_0.3.6           
## [15] yaml_2.1.15                rprojroot_1.2             
## [17] digest_0.6.12              Matrix_1.2-12             
## [19] GenomeInfoDbData_0.99.0    bitops_1.0-6              
## [21] codetools_0.2-15           RCurl_1.95-4.8            
## [23] evaluate_0.10.1            rmarkdown_1.8             
## [25] DelayedArray_0.2.7         stringi_1.1.6             
## [27] compiler_3.4.3             backports_1.1.1           
## [29] XML_3.98-1.9

6. References

Chen, Hao, John M Bell, Nicolas A Zavala, Hanlee P Ji, and Nancy R Zhang. 2014. “Allele-Specific Copy Number Profiling by Next-Generation Dna Sequencing.” Nucleic Acids Research 43 (4). Oxford University Press: e23–e23.

Chen, Hao, Yuchao Jiang, Kara N Maxwell, Katherine L Nathanson, and Nancy Zhang. 2017. “Allele-Specific Copy Number Estimation by Whole Exome Sequencing.” The Annals of Applied Statistics 11 (2). NIH Public Access: 1169.

Jiang, Yuchao, Derek A Oldridge, Sharon J Diskin, and Nancy R Zhang. 2015. “CODEX: A Normalization and Copy Number Variation Detection Method for Whole Exome Sequencing.” Nucleic Acids Research 43 (6). Oxford University Press: e39–e39.

Jiang, Yuchao, Yu Qiu, Andy J Minn, and Nancy R Zhang. 2016. “Assessing Intratumor Heterogeneity and Tracking Longitudinal and Spatial Clonal Evolutionary History by Next-Generation Sequencing.” Proceedings of the National Academy of Sciences 113 (37). National Acad Sciences: E5528–E5537.

Jiang, Yuchao, Rujin Wang, Eugene Urrutia, Ioannis N Anastopoulos, Katherine L Nathanson, and Nancy R Zhang. 2018. “CODEX2: Full-Spectrum Copy Number Variation Detection by High-Throughput Dna Sequencing.” Genome Biology 19 (1). BioMed Central: 202.

Zhou, Zilu, Weixin Wang, Li-San Wang, and Nancy Ruonan Zhang. 2018. “Integrative Dna Copy Number Detection and Genotyping from Sequencing and Array-Based Platforms.” Bioinformatics 34 (14). Oxford University Press: 2349–55.