Abstract
Copy number variation is an important and abundant source of variation in the human genome, which has been associated with a number of diseases, especially cancer. Massively parallel next-generation sequencing allows copy number profiling with fine resolution. Such efforts, however, have met with mixed successes, with setbacks arising partly from the lack of reliable analytical methods to meet the diverse and unique challenges arising from the myriad experimental designs and study goals in genetic studies. In cancer genomics, detection of somatic copy number changes and profiling of allele-specific copy number (ASCN) are complicated by experimental biases and artifacts as well as normal cell contamination and cancer subclone admixture. Furthermore, careful statistical modeling is warranted to reconstruct tumor phylogeny by both somatic ASCN changes and single nucleotide variants. Here we describe a flexible computational pipeline, MARATHON (copy nuMber vARiAtion and Tumor pHylOgeNy), which integrates multiple related statistical software for copy number profiling and downstream analyses in disease genetic studies.
The possible analysis scenarios are listed in Table 1. Figure 1 gives an outline for the relationship between the software: CODEX and CODEX2 perform read depth normalization for total copy number profiling; read depth normalized by CODEX/CODEX2 is received by iCNV, which combines it with allele-specific read counts and microarray data to detect CNVs; FALCON and FALCON-X perform ASCN analysis; and Canopy receives input from FALCON/FALCON-X to perform tumor phylogeny reconstruction.
Figure 1. A flowchart outlining the procedures for profiling CNV, ASCN, and reconstructing tumor phylogeny. CNVs with common and rare population frequencies can be profiled by CODEX and CODEX2, with and without negative control samples. iCNV integrates sequencing and microarray data for CNV detection. ASCNs can be profiled by FALCON and FALCON-X using allelic read counts at germline heterozygous loci. Canopy infers tumor phylogeny using somatic SNVs and ASCNs.
Table 1. Analysis scenarios and pipeline design. The last column shows the sequence of software that should be used for each analysis scenario. * By normal we mean samples that are not derived from tumor tissue, which are not expected to carry chromosome-level copy number changes.
If you have any questions or problems when using MARATHON, please feel free to open a new issue here. You can also email the maintainers of the corresponding packages – the contact information is below.
Yuchao Jiang (yuchaoj at email dot unc dot edu)
Department of Biostatistics & Department of Genetics, UNC-Chapel Hill
Hao Chen (hxchen at ucdavis dot edu)
Department of Statistics, UC Davis
Gene Urrutia (gene dot urrutia at gmail dot com)
Hill-Rom Innovation, Cary, NC
Zilu Zhou (zhouzilu at pennmedicine dot upenn dot edu)
Genomics and Computational Biology Graduate Group, UPenn
Nancy R. Zhang (nzh at wharton dot upenn dot edu)
Department of Statistics, UPenn
A docker image is available here. This image is an Rstudio GUI built on rocker/tidyverse with MARATHON as well as all of its dependent packages and datasets pre-installed. Note that this can take a while to download the human reference genome as well as the toy sequencing dataset. Instructions for using Docker can be found here.
docker pull lzeppelini/marathon
Install all packages in the latest version of R.
install.packages(c("falcon", "falconx", "devtools"))
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(c("WES.1KG.WUGSC", "GenomeInfoDbData", "GenomeInfoDb", "VariantAnnotation"))
devtools::install_github(c("yuchaojiang/CODEX/package", "yuchaojiang/CODEX2/package", "yuchaojiang/Canopy/package", "zhouzilu/iCNV", "yuchaojiang/MARATHON/package"))
CODEX (Jiang et al. 2015) adopts a Poisson latent factor model for normalization to remove biases due to GC content, exon capture and amplification efficiency, and latent systemic artifacts.
Get bam file directories, sample names from .txt file, and exonic positions from .bed file.
library(MARATHON, quietly = TRUE)
library(WES.1KG.WUGSC) # Load Toy data from the 1000 Genomes Project.
dirPath <- system.file("extdata", package = "WES.1KG.WUGSC")
bamFile <- list.files(dirPath, pattern = '*.bam$')
bamdir <- file.path(dirPath, bamFile)
sampname <- as.matrix(read.table(file.path(dirPath, "sampname")))
bedFile <- file.path(dirPath, "chr22_400_to_500.bed")
bambedObj <- getbambed(bamdir = bamdir, bedFile = bedFile,
sampname = sampname, projectname = "CODEX_demo")
bamdir <- bambedObj$bamdir; sampname <- bambedObj$sampname
ref <- bambedObj$ref; projectname <- bambedObj$projectname; chr <- bambedObj$chr
Get read depth coverage
coverageObj <- getcoverage(bambedObj, mapqthres = 20)
Y <- coverageObj$Y; readlength <- coverageObj$readlength
Get GC content and mappability
genome = BSgenome.Hsapiens.UCSC.hg19 # hg19
# library(BSgenome.Hsapiens.UCSC.hg38); genome = BSgenome.Hsapiens.UCSC.hg38 # hg38
gc <- getgc(ref, genome = genome)
mapp <- getmapp(ref, genome = genome)
values(ref) <- cbind(values(ref), DataFrame(gc, mapp))
Quality control
qcObj <- qc(Y, sampname, ref, cov_thresh = c(20, 4000),
length_thresh = c(20, 2000), mapp_thresh = 0.9,
gc_thresh = c(20, 80))
Y_qc <- qcObj$Y_qc; sampname_qc <- qcObj$sampname_qc
ref_qc <- qcObj$ref_qc; qcmat <- qcObj$qcmat; gc_qc <- ref_qc$gc
write.table(qcmat, file = paste(projectname, '_qcmat', '.txt', sep=''),
sep = '\t', quote = FALSE, row.names = FALSE)
Normalization with normal controls
Y.nonzero <- Y_qc[apply(Y_qc, 1, function(x){!any(x==0)}),]
pseudo.sample <- apply(Y.nonzero,1,function(x){exp(1/length(x)*sum(log(x)))})
N <- apply(apply(Y.nonzero, 2, function(x){x/pseudo.sample}), 2, median)
normObj <- normalize(Y_qc, gc_qc, K = 1:6, N=N)
Yhat <- normObj$Yhat; AIC <- normObj$AIC; BIC <- normObj$BIC
RSS <- normObj$RSS; K <- normObj$K
Choose the number of latent factors
choiceofK(AIC, BIC, RSS, K, filename = paste(projectname, "_", chr,
"_choiceofK", ".pdf", sep = ""))
Poisson-likelihood recursive segmentation.
optK = K[which.max(BIC)]
chr=22
finalcall <- segmentCBS(Y_qc, Yhat, optK = optK, K = K, sampname_qc,
ref_qc, chr, lmax = 200, mode = "integer")
head(finalcall)
## sample_name chr cnv st_bp ed_bp length_kb st_exon ed_exon
## cnv1 NA18990 22 dup 22312814 22326373 13.560 62 74
## cnv2 NA19210 22 del 22329961 22599725 269.765 77 78
## raw_cov norm_cov copy_no lratio mBIC
## cnv1 1382 1007 3 56.830 46.173
## cnv2 140 217 1 11.466 2.746
CODEX2 (Jiang et al. 2018) builds on CODEX with a significant improvement of sensitivity for both rare and common variants.
The figure below illustrates the two experimental designs for which CODEX2 can be applied: (A) case-control design with a group of negative control samples, where the goal is to detect CNVs disproportionately present in the cases versus the controls; and (B) detection of all CNVs present in all samples design, such as in the Exome Aggregation Consortium. The key innovation in CODEX2 is the usage of negative control genome regions in a genome-wide latent factor model for sample- and position-specific background correction, and the utilization of negative control samples, under a case-control design, to further improve background bias estimation under this model. The negative control genome regions defined by CODEX2 are regions that do not harbor common CNVs, but that are still allowed to harbor rare CNVs, and can be constructed from existing studies or learned from data.
This step is to get directories of .bam files, read in exon target positions from .bed files, and get sample names. The direct input of CODEX2 include: bamdir, which is a vector indicating the directories of all .bam files; sampname, which is a column vector with row entries of sample names; bedFile, which indicates the directory of the .bed file (WES bait file, no header, sorted by start and end positions); and chr, which specifies the chromosome. CODEX2 processes the entire genome chromosome by chromosome; make sure the chromosome formats are consistent between the .bed and the .bam files.
library(MARATHON)
library(WES.1KG.WUGSC) # Load Toy data from the 1000 Genomes Project.
dirPath <- system.file("extdata", package = "WES.1KG.WUGSC")
bamFile <- list.files(dirPath, pattern = '*.bam$')
bamdir <- file.path(dirPath, bamFile)
sampname <- as.matrix(read.table(file.path(dirPath, "sampname")))
bedFile <- file.path(dirPath, "chr22_400_to_500.bed")
chr <- 22
bambedObj <- getbambed(bamdir = bamdir, bedFile = bedFile,
sampname = sampname, projectname = "CODEX_demo")
bamdir <- bambedObj$bamdir; sampname <- bambedObj$sampname
ref <- bambedObj$ref; projectname <- bambedObj$projectname; chr <- bambedObj$chr
Getting raw read depth, GC content and mappability: read depth matrix, as well as read lengths across all samples, will be returned.
coverageObj <- getcoverage(bambedObj, mapqthres = 20)
Y <- coverageObj$Y; readlength <- coverageObj$readlength
genome = BSgenome.Hsapiens.UCSC.hg19 # hg19
# library(BSgenome.Hsapiens.UCSC.hg38); genome = BSgenome.Hsapiens.UCSC.hg38 # hg38
gc <- getgc(ref, genome = genome)
mapp <- getmapp(ref, genome = genome)
values(ref) <- cbind(values(ref), DataFrame(gc, mapp))
Quality control: take a sample-wise and exon-wise quality control procedure on the depth of coverage matrix.
qcObj <- qc(Y, sampname, ref, cov_thresh = c(20, 4000),
length_thresh = c(20, 2000), mapp_thresh = 0.9,
gc_thresh = c(20, 80))
Y_qc <- qcObj$Y_qc; sampname_qc <- qcObj$sampname_qc
ref_qc <- qcObj$ref_qc; qcmat <- qcObj$qcmat; gc_qc <- ref_qc$gc
write.table(qcmat, file = paste(projectname, '_qcmat', '.txt', sep=''),
sep = '\t', quote = FALSE, row.names = FALSE)
Running CODEX2: For demonstration purpose, we in silico spiked in CNVs spanning exon 1580 - 1620 with a population frequency 40%. There are altogether 90 samples, 36 of which have the heterozygous deletion. The toy dataset is stored as part of the CODEX2 R-package.
We can empirically identify common CNV regions by a first-pass CODEX run: For exons residing in common CNV regions, the s.d. of normalized z-scores across all samples will be large.
# Below are pre-computed demo dataset, stored as part of the CODEX2 R-package.
Y_qc <- Y_qc_demo
ref_qc <- ref_qc_demo
gc_qc <- ref_qc$gc
# Estimate library size factor based on genome-wide read depth after QC
Y.nonzero <- Y_qc[apply(Y_qc, 1, function(x){!any(x==0)}),]
pseudo.sample <- apply(Y.nonzero,1,function(x){exp(1/length(x)*sum(log(x)))})
N <- apply(apply(Y.nonzero, 2, function(x){x/pseudo.sample}), 2, median)
# Empirically identify common CNV regions from the data.
normObj=normalize_null(Y_qc = Y_qc, gc_qc = gc_qc, K = 1:5, N = N)
z.codex = log(Y_qc/normObj$Yhat[[which.max(normObj$BIC)]])
cnv_index1 = which(apply(z.codex,1,sd)>=0.25)
# This can also be provided by the user as known,
# e.g., from existing database (DGV or dbVar) or knowledge (tumor supressors or oncogenes).
cnv_index2 <- 1580:1620
normObj <- normalize_codex2_nr(Y_qc = Y_qc, gc_qc = gc_qc,
K = 1:6, cnv_index = cnv_index2,
N = N)
Yhat.nr <- normObj$Yhat; fGC.hat.nr <- normObj$fGC.hat;
beta.hat.nr <- normObj$beta.hat; g.hat.nr <- normObj$g.hat; h.hat.nr <- normObj$h.hat
AIC.nr <- normObj$AIC; BIC.nr <- normObj$BIC; RSS.nr <- normObj$RSS
Running Poisson-likelihood recursive segmentation by CODEX2
finalcall.codex2.nr <- segmentCBS(Y_qc, Yhat.nr, optK = which.max(BIC.nr),
K = 1:6, sampname_qc = paste('sample',1:ncol(Y_qc),sep=''),
ref_qc = IRanges(start=1:nrow(Y_qc)*100, end=1:nrow( Y_qc )*100+50),
chr = 18, lmax = 200, mode = "integer")
iCNV (Zhou et al. 2018) uses the normalized coverage from CODEX/CODEX2, and makes use of sequenced reads at inherited single nucleotide polymorphism (SNP) positions for CNV detection. These heterozygous loci are shown to be valuable in improving detection and genotyping accuracy. iCNV takes as input normalized coverage by CODEX/CODEX2, allelic frequency at inherited SNP positions from sequencing, and log-ratio and B-allele frequency from SNP array. Output is CNV calls with quality scores.
Researchers working on WGS need to generate your own BED file for CODEX normalization. We made a function bed_generator to help with this. For WES and targeted sequencing, the BED file is from pull-down design. Follow code below to get normalized NGS PLR. For demonstration, we utilize toy data from the 1000 Genomes Project.
library(WES.1KG.WUGSC) # Load Toy data from the 1000 Genomes Project.
library(iCNV)
# Pre-process
dir <- system.file("extdata", package = "WES.1KG.WUGSC") # 'PATH/TO/BAM'
bamFile <- list.files(dir, pattern = '*.bam$')
bamdir <- file.path(dir, bamFile)
sampname <- as.matrix(read.table(file.path(dir, "sampname")))
bedFile <- file.path(dir, "chr22_400_to_500.bed")
chr <- 22
bambedObj <- CODEX::getbambed(bamdir = bamdir, bedFile = bedFile,
sampname = sampname, projectname = "icnv.demo.", chr)
bamdir <- bambedObj$bamdir; sampname <- bambedObj$sampname
ref <- bambedObj$ref; projectname <- bambedObj$projectname; chr <- bambedObj$chr
# Get coverage
coverageObj <- CODEX::getcoverage(bambedObj, mapqthres = 20)
Y <- coverageObj$Y; readlength <- coverageObj$readlength
# Get GC content and mappability
gc <- CODEX::getgc(chr, ref)
mapp <- CODEX::getmapp(chr, ref)
# Quality control
qcObj <- CODEX::qc(Y, sampname, chr, ref, mapp, gc, cov_thresh = c(20, 4000),
length_thresh = c(20, 2000), mapp_thresh = 0.9, gc_thresh = c(20, 80))
Y_qc <- qcObj$Y_qc; sampname_qc <- qcObj$sampname_qc; gc_qc <- qcObj$gc_qc
mapp_qc <- qcObj$mapp_qc; ref_qc <- qcObj$ref_qc; qcmat <- qcObj$qcmat
# CODEX normalization
normObj <- CODEX::normalize(Y_qc, gc_qc, K = 1:9)
Yhat <- normObj$Yhat; AIC <- normObj$AIC; BIC <- normObj$BIC
RSS <- normObj$RSS; K <- normObj$K
choiceofK(AIC, BIC, RSS, K, filename = paste0(projectname, "_", chr, "_choiceofK.pdf"))
save(qcObj,normObj,sampname,file=paste0(projectname,"_",chr,".rda") )
CODEX reports all three statistical metrics (AIC, BIC, variance reduction) and uses BIC as the default method to determine the number of Poisson factors. Since false positives can be screened out through a closer examination of the post-segmentation data, whereas CNV signals removed in the normalization step cannot be recovered, CODEX opts for a more conservative normalization that, when in doubt, uses a smaller value of K.
load(paste0(projectname,"_",chr,".rda"))
optK = K[which.max(BIC)] # by default or customize given curves
# Log-transformed z-score based on CODEX's normalization results
Y_qc <- qcObj$Y_qc; sampname_qc <- qcObj$sampname_qc; gc_qc <- qcObj$gc_qc
mapp_qc <- qcObj$mapp_qc; ref_qc <- qcObj$ref_qc; qcmat <- qcObj$qcmat
Yhat <- normObj$Yhat; AIC <- normObj$AIC; BIC <- normObj$BIC
RSS <- normObj$RSS; K <- normObj$K
ref_qc=qcObj$ref_qc # IRanges object for exon target
sampname_qc=qcObj$sampname_qc # sample names
Y_norm=normObj$Yhat[[optK]] # normalized read count under null (no CNV)
plr=log(pmax(Y_qc,0.0001)/pmax(Y_norm,0.0001)) # log transformed z-scores
ngs_plr=lapply(seq_len(ncol(plr)), function(i) plr[,i])
ngs_plr.pos=lapply(seq_len(ncol(plr)),function(x){return(cbind(start(ref_qc),end(ref_qc)))})
save(sampname_qc,ngs_plr,ngs_plr.pos,file=paste0(projectname,'plrObj_',chr,'_',optK,'.rda'))
For sequencing data without SNV callsets in VCF format, we manually call SNVs from quality controlled BAM files by the mpileup module in samtools, and calculate B allele frequency(BAF) on heterozygous loci by dividing DV (Number of high-quality non-reference bases, FORMAT) from DP (Number of high-quality bases, FORMAT). Example code are:
# Prerequest: samtools, bcftools and reference fasta file
cd PATH/TO/BAM
for i in *bam; do PATH/TO/SAMTOOLS/samtools mpileup -ugI -t DP -t DV -f PATH/TO/REF/human_hg37.fasta $i | ~/bcftools-1.3.1/bcftools call -cv -O z -o PATH/TO/OUTPUT/$i.vcf.gz; done
We could further extract variants BAF info from vcf file by function bambaf_from_vcf.
library(MARATHON)
projectname <- "icnv.demo."
dir <- system.file("extdata", package="iCNV")
bambaf_from_vcf(dir,'bam_vcf.list',chr=22,projectname)
# ignore chr argement if you want to convert all chromosome 1-22
load(paste0(projectname,'bambaf_22.rda'))
str(ngs_baf)
str(ngs_baf.pos)
Due to the fact that signal intensity files vary from platform to platform, we set a standard signal intensity file format for each individual:
Name,Chr,POS,sample_1.Log R Ratio,sample_1.B Allele Freq
rs1,22,15462739,-0.096390619874,0.0443964861333
rs2,22,15520991,-0.154103130102,0.963218688965
rs3,22,15780940,-0.110297381878,0.0457459762692
rs4,22,15863717,-0.21270519495,0.957377910614
rs5,22,16532045,-0.0330782271922,0.0300635993481
The first row is rsid (Name), followed by chromosome (Chr), position (POS), log R ratio (sample1.Log R Ratio), and BAF (sample1.B Allele Freq). Function get_array_input can be used to convert the data input to iCNV’s format. For example, in the demo dataset:
# iCNV_array_input
dir <- system.file("extdata", package="iCNV")#PATH/TO/FOLDER where you stored the array data
chr <- 22
projectname <- "icnv.demo."
pattern <- paste0('*.csv.arrayicnv$')
get_array_input(dir,pattern,chr,projectname)
load(paste0(projectname,'array_lrrbaf_',chr,'.rda'))
str(snp_lrr)
str(snp_lrr.pos)
str(snp_baf)
str(snp_baf.pos)
For some of the SNP array LRR data, we need to apply singular value decomposition (SVD) to remove high-dimensional noise and preserve low-dimension signal. Noisy data has the feature of local strip across samples (use the plot_intensity
function). Conventional way for identifying the elbow point to determine the number of PCs can be applied here. More details can be found here.
At this step, we should alreday have PLR and variants BAF from sequencing, normalized LRR and BAF from SNP array. Please make sure all the input are of type list. This is mainly to accomadate the fact that the lengths for ngs_baf
and ngs_baf.pos
are sample specific. In addition, CNV detection should perform per chromosome basis.
str(ngs_plr) # List of n vector, each one is the PLR for an exon
str(ngs_plr.pos) # List of n matrix (p x 2), each one is the start and end location for an exon
str(ngs_baf) # List of n vector, each one is the variants BAF from .bam
str(ngs_baf.pos) # List of n vector, each one is the variants BAF position
str(snp_lrr) # List of n vector, each one is the normalized LRR for a SNP
str(snp_lrr.pos) # List of n vector, each one is a SNP position
str(snp_baf) # List of n vector, each one is the BAF for a SNP
str(snp_baf.pos) # List of n vector, each one is the SNP BAF position
projectname <- 'icnv.demo.'
icnv_res0 <- iCNV_detection(ngs_plr,snp_lrr,
ngs_baf,snp_baf,
ngs_plr.pos,snp_lrr.pos,
ngs_baf.pos,snp_baf.pos,
projname=projectname,CN=1,mu=c(-3,0,2),cap=TRUE,visual = 1)
icnv.output <- output_list(icnv_res0,sampname_qc,CN=1,min_size=0,min_gap=0)
# You can alter min size of the CNV you want to keep and merge CNV with gap
# smaller than min gap.
head(icnv.output)
The results plot looks are follow. Each row is a sample and each column is a hidden state. The color indicates hidden state Z-score (large positive number prefers amplification, low negative number prefers deletion). Black dots represent amplification detected, while white dots show deletion detected.
Additional functionality of iCNV see iCNV notebook.
At this step, we should already have PLR and variants BAF from sequencing OR normalized LRR and BAF from SNP array. Please make sure all the input are in list form.
NGS only CNV detection using iCNV. As we showed in the paper, we highly recommend using iCNV for WGS CNV detection.
str(ngs_plr) # List of n vector, each one is the PLR for an exon
str(ngs_plr.pos) # List of n matrix (p x 2), each one is the start and end location for an exon
str(ngs_baf) # List of n vector, each one is the variants BAF from .bam
str(ngs_baf.pos) # List of n vector, each one is the variants BAF position
projectname <- 'icnv.demo.ngs.'
icnv_res0_ngs <- iCNV_detection(ngs_plr=ngs_plr, ngs_baf = ngs_baf,
ngs_plr.pos = ngs_plr.pos,ngs_baf.pos = ngs_baf.pos,
projname=projectname,CN=0,mu=c(-3,0,2),cap=TRUE,visual = 2)
icnv.output <- output_list(icnv_res0_ngs,sampname_qc,CN=0)
head(icnv.output)
SNP array only CNV detection using iCNV
str(snp_lrr) # List of n vector, each one is the normalized LRR for a SNP
str(snp_lrr.pos) # List of n vector, each one is a SNP position
str(snp_baf) # List of n vector, each one is the BAF for a SNP
str(snp_baf.pos) # List of n vector, each one is the SNP BAF position
projectname <- 'icnv.demo.snp'
icnv_res0_snp <- iCNV_detection(snp_lrr=snp_lrr, snp_baf = snp_baf,
snp_lrr.pos = snp_lrr.pos,snp_baf.pos = snp_baf.pos,
projname=projectname,CN=1,mu=c(-3,0,2),cap=TRUE,visual = 2)
icnv.output <- output_list(icnv_res0_snp,sampname_qc,CN=1)
head(icnv.output)
CODEX2 (Jiang et al. 2018) can be applied to two scenarios: the case control scenario where the goal is to detect CNVs that are enriched in the case samples; and the scenario where control samples are not available and the goal is simply to profile all CNVs. CODEX and CODEX2 take as input assembled BAM files as well as bed files specifying targets for WES and targeted sequencing and output normalized read counts and tab-delimited text files with copy number calls.
The figure below illustrates the two experimental designs for which CODEX2 can be applied: (i) case-control design with a group of negative control samples, where the goal is to detect CNVs disproportionately present in the cases versus the controls; and (ii) detection of all CNVs present in all samples design, such as in the Exome Aggregation Consortium. The key innovation in CODEX2 is the usage of negative control genome regions in a genome-wide latent factor model for sample- and position-specific background correction, and the utilization of negative control samples, under a case-control design, to further improve background bias estimation under this model. The negative control genome regions defined by CODEX2 are regions that do not harbor common CNVs, but that are still allowed to harbor rare CNVs, and can be constructed from existing studies or learned from data.
This step is to get directories of .bam files, read in exon target positions from .bed files, and get sample names. The direct input of CODEX2 include: bamdir, which is a vector indicating the directories of all .bam files; sampname, which is a column vector with row entries of sample names; bedFile, which indicates the directory of the .bed file (WES bait file, no header, sorted by start and end positions); and chr, which specifies the chromosome. CODEX2 processes the entire genome chromosome by chromosome; make sure the chromosome formats are consistent between the .bed and the .bam files.
library(MARATHON)
library(WES.1KG.WUGSC) # Load Toy data from the 1000 Genomes Project.
dirPath <- system.file("extdata", package = "WES.1KG.WUGSC")
bamFile <- list.files(dirPath, pattern = '*.bam$')
bamdir <- file.path(dirPath, bamFile)
sampname <- as.matrix(read.table(file.path(dirPath, "sampname")))
bedFile <- file.path(dirPath, "chr22_400_to_500.bed")
chr <- 22
bambedObj <- getbambed(bamdir = bamdir, bedFile = bedFile,
sampname = sampname, projectname = "CODEX_demo")
bamdir <- bambedObj$bamdir; sampname <- bambedObj$sampname
ref <- bambedObj$ref; projectname <- bambedObj$projectname; chr <- bambedObj$chr
Getting raw read depth, GC content and mappability: read depth matrix, as well as read lengths across all samples, will be returned.
coverageObj <- getcoverage(bambedObj, mapqthres = 20)
Y <- coverageObj$Y; readlength <- coverageObj$readlength
genome = BSgenome.Hsapiens.UCSC.hg19 # hg19
# library(BSgenome.Hsapiens.UCSC.hg38); genome = BSgenome.Hsapiens.UCSC.hg38 # hg38
gc <- getgc(ref, genome = genome)
mapp <- getmapp(ref, genome = genome)
values(ref) <- cbind(values(ref), DataFrame(gc, mapp))
Quality control: take a sample-wise and exon-wise quality control procedure on the depth of coverage matrix.
qcObj <- qc(Y, sampname, ref, cov_thresh = c(20, 4000),
length_thresh = c(20, 2000), mapp_thresh = 0.9,
gc_thresh = c(20, 80))
Y_qc <- qcObj$Y_qc; sampname_qc <- qcObj$sampname_qc
ref_qc <- qcObj$ref_qc; qcmat <- qcObj$qcmat; gc_qc <- ref_qc$gc
write.table(qcmat, file = paste(projectname, '_qcmat', '.txt', sep=''),
sep = '\t', quote = FALSE, row.names = FALSE)
Running CODEX2: For demonstration purpose, we in silico spiked in CNVs spanning exon 1580 - 1620 with a population frequency 40%. There are altogether 90 samples, 36 of which have the heterozygous deletion. The toy dataset is stored as part of the CODEX2 R-package.
Y_qc and gc_qc can be obtained from the sequencing bam files using the code in the previous section. For the case-control scenario, the normal sample index is known (samples without spike-in signals).
# Below are pre-computed demo dataset, stored as part of the CODEX2 R-package.
Y_qc <- Y_qc_demo
ref_qc <- ref_qc_demo
norm_index <- norm_index_demo # indices for normal samples
gc_qc <- ref_qc$gc
# Estimate library size factor based on genome-wide read depth after QC
Y.nonzero <- Y_qc[apply(Y_qc, 1, function(x){!any(x==0)}),]
pseudo.sample <- apply(Y.nonzero,1,function(x){exp(1/length(x)*sum(log(x)))})
N <- apply(apply(Y.nonzero, 2, function(x){x/pseudo.sample}), 2, median)
normObj=normalize_codex2_ns(Y_qc = Y_qc, gc_qc = gc_qc,
K = 1:6, norm_index = norm_index, N = N)
Yhat.ns=normObj$Yhat; fGC.hat.ns=normObj$fGC.hat;
beta.hat.ns=normObj$beta.hat; g.hat.ns=normObj$g.hat; h.hat.ns=normObj$h.hat
AIC.ns=normObj$AIC; BIC.ns=normObj$BIC; RSS.ns=normObj$RSS
Choose the number of latent Poisson factors. Use BIC as the model selection metric by default.
choiceofK(AIC.ns, BIC.ns, RSS.ns, K = 1:6 , filename = "codex2_ns_choiceofK.pdf")
Alternatively, we can empirically identify common CNV regions by a first-pass CODEX run: For exons residing in common CNV regions, the s.d. of normalized z-scores across all samples will be large.
# Below are pre-computed demo dataset, stored as part of the CODEX2 R-package.
Y_qc <- Y_qc_demo
ref_qc <- ref_qc_demo
gc_qc <- ref_qc$gc
# Estimate library size factor based on genome-wide read depth after QC
Y.nonzero <- Y_qc[apply(Y_qc, 1, function(x){!any(x==0)}),]
pseudo.sample <- apply(Y.nonzero,1,function(x){exp(1/length(x)*sum(log(x)))})
N <- apply(apply(Y.nonzero, 2, function(x){x/pseudo.sample}), 2, median)
# Empirically identify common CNV regions from the data.
normObj=normalize_null(Y_qc = Y_qc, gc_qc = gc_qc, K = 1:5, N = N)
z.codex = log(Y_qc/normObj$Yhat[[which.max(normObj$BIC)]])
cnv_index1 = which(apply(z.codex,1,sd)>=0.25)
# This can also be provided by the user as known,
# e.g., from existing database (DGV or dbVar) or knowledge (tumor supressors or oncogenes).
cnv_index2 <- 1580:1620
normObj <- normalize_codex2_nr(Y_qc = Y_qc, gc_qc = gc_qc,
K = 1:6, cnv_index = cnv_index2,
N = N)
Yhat.nr <- normObj$Yhat; fGC.hat.nr <- normObj$fGC.hat;
beta.hat.nr <- normObj$beta.hat; g.hat.nr <- normObj$g.hat; h.hat.nr <- normObj$h.hat
AIC.nr <- normObj$AIC; BIC.nr <- normObj$BIC; RSS.nr <- normObj$RSS
finalcall.codex2.ns <- segmentCBS(Y_qc, Yhat.ns, optK = which.max(BIC.ns),
K = 1:6, sampname_qc = paste('sample',1:ncol(Y_qc),sep=''),
ref_qc = IRanges(start=1:nrow(Y_qc)*100, end=1:nrow(Y_qc)*100+50),
chr = 18, lmax = 200, mode = "integer")
# For CODEX2 with negative control regions, simply change 'ns' to 'nr'.
nrow(finalcall.codex2.ns[finalcall.codex2.ns[,'st_exon']=='1580',])
## [1] 36
nrow(finalcall.codex2.ns) # identified 4 additional CNVs (false positives)
## [1] 40
# this can be easily filtered out by a QC on the CNV lengths in kb or number of exons
head(finalcall.codex2.ns[finalcall.codex2.ns[,'st_exon']=='1580',])
## sample_name chr cnv st_bp ed_bp length_kb st_exon ed_exon raw_cov
## cnv1 sample3 18 del 158000 162050 4.051 1580 1620 2063
## cnv2 sample4 18 del 158000 162050 4.051 1580 1620 1583
## cnv3 sample7 18 del 158000 162050 4.051 1580 1620 2310
## cnv4 sample8 18 del 158000 162050 4.051 1580 1620 2086
## cnv5 sample10 18 del 158000 162050 4.051 1580 1620 2271
## cnv6 sample15 18 del 158000 162050 4.051 1580 1620 1264
## norm_cov copy_no lratio mBIC
## cnv1 4055 1 597.641 575.765
## cnv2 3106 1 455.827 433.951
## cnv3 4476 1 636.946 615.070
## cnv4 4072 1 590.199 568.323
## cnv5 4511 1 681.476 659.600
## cnv6 2502 1 374.925 353.049
Finally, it is very helpful to visualize the results in the Integrative Genomics Viewer (IGV). Takes as input the output of segment (or segmentHMM) from CODEX/2 and converts to .seg format. This is a preferred format for Integrative Genomics Viewer (IGV). This function takes a single data frame as input, so combine CNV calls from all chromosomes beforehand if necessary. This function writes to file, so specify the outPath is desired.
Y_qc <- qcObjDemo$Y_qc
Yhat <- normObjDemo$Yhat
BIC <- normObjDemo$BIC
K <- normObjDemo$K
sampname_qc <- qcObjDemo$sampname_qc
ref_qc <- qcObjDemo$ref_qc
chr <- bambedObjDemo$chr
#Add simulated CNV
Y_qc[1:10,10] = 1.5*Y_qc[1:10,10] # add a dup
Y_qc[50:60,10] = .5*Y_qc[50:60,10] # add a del
Y_qc[1:10,15] = 1.5*Y_qc[1:10,15] # add a dup
Y_qc[50:60,15] = .5*Y_qc[50:60,15] # add a del
Y_qc[25:35,15] = .5*Y_qc[25:35,15] # add a del
finalcall <- segment(Y_qc, Yhat, optK = K[which.max(BIC)],
K = K, sampname_qc, ref_qc, chr, lmax = 200, mode = "fraction")
#If necessary, combine (rbind) individual chromosome data frames
# to single data frame
convertForIGV(finalcall, outPath = NULL, filterFraction = TRUE )
For allele-specific copy number estimation in a matched tumor-normal setting, FALCON (Chen et al. 2014) is based on a change-point model on a process of a mixture of two bivariate Binomial distributions. FALCON takes as input allelic read counts at germline heterozygous loci and outputs ASCN estimates with genome segmentations.
# 1. Index the genome template
bwa index ~/structure/hg19/ucsc.hg19.fasta
# 2. Align reads in .fastq file to the template (sample sequenced at two different lanes)
bwa mem -M -t 16 ucsc.hg19.fasta _EGAR00001248897_UCF_1014_NoIndex_L008_R1_001.fastq _EGAR00001248897_UCF_1014_NoIndex_L008_R2_001.fastq > _EGAR00001248897_UCF_1014_NoIndex_L008.sam
bwa mem -M -t 16 ucsc.hg19.fasta _EGAR00001248899_UCF1014_NoIndex_L002_R1_001.fastq _EGAR00001248899_UCF1014_NoIndex_L002_R2_001.fastq > _EGAR00001248899_UCF1014_NoIndex_L002.sam
# 3. Convert .sam to .bam and sort
samtools view -bS _EGAR00001248897_UCF_1014_NoIndex_L008.sam > _EGAR00001248897_UCF_1014_NoIndex_L008.bam
line=_EGAR00001248897_UCF_1014_NoIndex_L008.bam
java -jar SortSam.jar INPUT=$line OUTPUT=$line.sorted.bam SORT_ORDER=coordinate
samtools view -bS _EGAR00001248899_UCF1014_NoIndex_L002.sam > _EGAR00001248899_UCF1014_NoIndex_L002.bam
line=_EGAR00001248899_UCF1014_NoIndex_L002.bam
java -jar SortSam.jar INPUT=$line OUTPUT=$line.sorted.bam SORT_ORDER=coordinate
# 4. Add read group
line=_EGAR00001248897_UCF_1014_NoIndex_L008.bam
#line=_EGAR00001248899_UCF1014_NoIndex_L002.bam
java -jar AddOrReplaceReadGroups.jar INPUT=$line.sorted.bam OUTPUT=$line.sorted.rg.bam RGID=$line RGLB=WGS_UCF1014 RGPL=ILLUMINA RGPU=machine RGSM=UCF1014
samtools index $line.sorted.rg.bam
# 5. Merge two different lanes (sample sequenced at two different lanes)
samtools merge UCF1014.merge.bam _EGAR00001248899_UCF1014_NoIndex_L002.bam.sorted.rg.bam _EGAR00001248897_UCF_1014_NoIndex_L008.bam.sorted.rg.bam
java -jar SortSam.jar INPUT=UCF1014.merge.bam OUTPUT=UCF1014.merge.sorted.bam SORT_ORDER=coordinate
samtools index UCF1014.merge.sorted.bam
# 6. Dedup
java -jar MarkDuplicates.jar INPUT=UCF1014.merge.sorted.bam OUTPUT=UCF1014.merge.sorted.dedup.bam METRICS_FILE=UCF1014.merge.sorted.dedup.metrics.txt PROGRAM_RECORD_ID= MarkDuplicates PROGRAM_GROUP_VERSION=null PROGRAM_GROUP_NAME=MarkDuplicates
java -jar BuildBamIndex.jar INPUT=UCF1014.merge.sorted.dedup.bam
# 7. Realign
java -jar GenomeAnalysisTK.jar -T RealignerTargetCreator -R ucsc.hg19.fasta -I UCF1014.merge.sorted.dedup.bam -known Mills_and_1000G_gold_standard.indels.hg19.sites.vcf -known 1000G_phase1.indels.hg19.sites.vcf -o UCF1014.merge.sorted.dedup.target_intervals.list
java -jar ~/bin/GenomeAnalysisTK.jar -T IndelRealigner -R ucsc.hg19.fasta -I UCF1014.merge.sorted.dedup.bam -targetIntervals UCF1014.merge.sorted.dedup.target_intervals.list -known Mills_and_1000G_gold_standard.indels.hg19.sites.vcf -known 1000G_phase1.indels.hg19.sites.vcf -o UCF1014.merge.sorted.dedup.realigned.bam
# 8. Recalibrate
java -jar GenomeAnalysisTK.jar -T BaseRecalibrator -R ucsc.hg19.fasta -I UCF1014.merge.sorted.dedup.realigned.bam -knownSites dbsnp_138.hg19.vcf -knownSites Mills_and_1000G_gold_standard.indels.hg19.sites.vcf -knownSites 1000G_phase1.indels.hg19.sites.vcf -o UCF1014.merge.sorted.dedup.realigned.recal_data.table
java -jar GenomeAnalysisTK.jar -T PrintReads -R ucsc.hg19.fasta -I UCF1014.merge.sorted.dedup.realigned.bam -BQSR UCF1014.merge.sorted.dedup.realigned.recal_data.table -o UCF1014.merge.sorted.dedup.realigned.recal.bam
samtools index UCF1014.merge.sorted.dedup.realigned.recal.bam
# 9. GATK HaplotypeCaller
java -jar GenomeAnalysisTK.jar -R ucsc.hg19.fasta -T HaplotypeCaller -I UCF1014.merge.sorted.dedup.realigned.recal.bam -o UCF1014.merge.sorted.dedup.realigned.recal.raw.snps.indels.g.vcf
readVCFforFalcon is a function which generates falcon input from a VCF file. This function takes as input a the path to the VCF file. First 9 column names after the header should be c(“#CHROM”, “POS”, “ID”, “REF”, “ALT”, “QUAL”, “FILTER”, “INFO”, “FORMAT”). Following column names should be sample names. see .
The function outputs a data frame where each row contains location for each variant with read counts and allele info. It is comparable to relapse.demo from FALCON for ease of use with the QC procedures below.
vcfFile = system.file("extdata", "sample_w_header.vcf", package="MARATHON")
coverageDataDemo = readVCFforFalcon(vcfFile)
head(coverageDataDemo)
## SRR5906250_ReadCount_Alt SRR5906250_ReadCount_Ref
## chrM:152_T/C 9 54
## chrM:195_C/T 0 59
## chrM:410_A/T 0 210
## chrM:2354_C/T 0 108
## chrM:2485_C/T 0 170
## chrM:2759_A/G 5 85
## SRR5906250_ReadCount_Total SRR5906251_ReadCount_Alt
## chrM:152_T/C 63 1
## chrM:195_C/T 59 0
## chrM:410_A/T 210 0
## chrM:2354_C/T 108 0
## chrM:2485_C/T 170 0
## chrM:2759_A/G 90 4
## SRR5906251_ReadCount_Ref SRR5906251_ReadCount_Total
## chrM:152_T/C 192 194
## chrM:195_C/T 241 241
## chrM:410_A/T 250 250
## chrM:2354_C/T 250 250
## chrM:2485_C/T 250 250
## chrM:2759_A/G 244 248
## SRR5906252_ReadCount_Alt SRR5906252_ReadCount_Ref
## chrM:152_T/C 0 249
## chrM:195_C/T 0 250
## chrM:410_A/T 0 249
## chrM:2354_C/T 2 248
## chrM:2485_C/T 0 250
## chrM:2759_A/G 1 249
## SRR5906252_ReadCount_Total SRR5906253_ReadCount_Alt
## chrM:152_T/C 249 1
## chrM:195_C/T 250 0
## chrM:410_A/T 249 0
## chrM:2354_C/T 250 0
## chrM:2485_C/T 250 0
## chrM:2759_A/G 250 0
## SRR5906253_ReadCount_Ref SRR5906253_ReadCount_Total
## chrM:152_T/C 249 250
## chrM:195_C/T 250 250
## chrM:410_A/T 249 249
## chrM:2354_C/T 248 248
## chrM:2485_C/T 249 249
## chrM:2759_A/G 250 250
## Reference_Allele SRR5906250_Allele1 SRR5906250_Allele2
## chrM:152_T/C T T C
## chrM:195_C/T C T T
## chrM:410_A/T A T T
## chrM:2354_C/T C T T
## chrM:2485_C/T C T T
## chrM:2759_A/G A G G
## SRR5906251_Allele1 SRR5906251_Allele2 SRR5906252_Allele1
## chrM:152_T/C C C C
## chrM:195_C/T T T T
## chrM:410_A/T T T T
## chrM:2354_C/T T T T
## chrM:2485_C/T T T T
## chrM:2759_A/G G G G
## SRR5906252_Allele2 SRR5906253_Allele1 SRR5906253_Allele2
## chrM:152_T/C C C C
## chrM:195_C/T T T T
## chrM:410_A/T T T T
## chrM:2354_C/T T T T
## chrM:2485_C/T T T T
## chrM:2759_A/G G G G
## Chromosome Start_position End_position
## chrM:152_T/C chrM 152 152
## chrM:195_C/T chrM 195 195
## chrM:410_A/T chrM 410 410
## chrM:2354_C/T chrM 2354 2354
## chrM:2485_C/T chrM 2485 2485
## chrM:2759_A/G chrM 2759 2759
Calculate depth ratio (total read counts of tumor versus normal). Restrict our analysis to chromosome 14, where a copy-neutral loss of heterozygosity has been previously reported.
coverageData = relapse.demo
rdep = sum(coverageData$Tumor_ReadCount_Total)/sum(coverageData$Normal_ReadCount_Total)
chr = 14
coverageChr=coverageData[which(coverageData[,'Chromosome']==chr),]
Remove variants with missing genotype.
nonMissing1 = coverageChr[,'Match_Norm_Seq_Allele1']!=' '
nonMissing2 = coverageChr[,'Match_Norm_Seq_Allele2']!=' '
nonMissing3 = coverageChr[,'Reference_Allele']!=' '
nonMissing4 = coverageChr[,'TumorSeq_Allele1']!=' '
nonMissing5 = coverageChr[,'TumorSeq_Allele2']!=' '
coverageChr=coverageChr[nonMissing1 & nonMissing2 & nonMissing3 & nonMissing4 & nonMissing5,]
Get germline heterozygous loci where normal allele1 differs from normal allele2.
coverageChrHet=coverageChr[(as.matrix(coverageChr[,'Match_Norm_Seq_Allele1'])!=as.matrix(coverageChr[,'Match_Norm_Seq_Allele2'])),]
QC procedures to remove false neg and false pos variants. The thresholds can be adjusted.
# Remove indels (this can be relaxed but we think indels are harder to call than SNPs).
indelThresh = 1
indel.filter1=nchar(as.matrix(coverageChrHet[,'Reference_Allele']))<=indelThresh
indel.filter2=nchar(as.matrix(coverageChrHet[,'Match_Norm_Seq_Allele1']))<=indelThresh
indel.filter3=nchar(as.matrix(coverageChrHet[,'Match_Norm_Seq_Allele2']))<=indelThresh
indel.filter4=nchar(as.matrix(coverageChrHet[,'TumorSeq_Allele1']))<=indelThresh
indel.filter5=nchar(as.matrix(coverageChrHet[,'TumorSeq_Allele2']))<=indelThresh
coverageChrHet=coverageChrHet[indel.filter1 & indel.filter2 & indel.filter3 & indel.filter4 & indel.filter5,]
# Filter on coverage: total number of reads greater than 30 in both tumor and normal.
coverageThresh = 30
depth.filter1=(coverageChrHet[,"Normal_ReadCount_Ref"]+coverageChrHet[,"Normal_ReadCount_Alt"])>=coverageThresh
depth.filter2=(coverageChrHet[,"Tumor_ReadCount_Ref"]+coverageChrHet[,"Tumor_ReadCount_Alt"])>=coverageThresh
coverageChrHet=coverageChrHet[depth.filter1 & depth.filter2,]
Generat input for FALCON (data frame with four columns), run FALCON, and view FALCON’s segmentation.
readMatrix=as.data.frame(coverageChrHet[,c('Tumor_ReadCount_Ref',
'Tumor_ReadCount_Alt',
'Normal_ReadCount_Ref',
'Normal_ReadCount_Alt')])
colnames(readMatrix)=c('AT','BT','AN','BN')
tauhat = getChangepoints(readMatrix)
cn = getASCN(readMatrix, tauhat=tauhat, rdep = rdep, threshold = 0.3)
view(cn, pos=coverageChrHet[,'Start_position'], rdep = rdep)
From the figure above, we see that: (1) There are small segments that need to be removed; (2) Consecutive segments with similar allelic copy number states need to be combined. Therefore, we need to further curate FALCON’s segmentation results.
if(length(tauhat)>0){
length.thres=10^6 # Threshold for length of segments, in base pair.
delta.cn.thres=0.3 # Threshold of absolute copy number difference between consecutive segments.
falcon.qc.list = falcon.qc(readMatrix = readMatrix,
tauhat = tauhat,
cn = cn,
st_bp = coverageChrHet[,"Start_position"],
end_bp = coverageChrHet[,"End_position"],
rdep = rdep,
length.thres = length.thres,
delta.cn.thres = delta.cn.thres)
tauhat=falcon.qc.list$tauhat
cn=falcon.qc.list$cn
}
# Chromosomal view of QC'ed segmentation results.
view(cn,pos=coverageChrHet[,'Start_position'], rdep = rdep)
The code below is to generate table output including genomic locations for segment boudaries, as well as mean and standard deviation for each segment. For Canopy’s input, we use Bootstrap-based method to estimate the standard deviations for the allele-specific copy numbers.
falconOutput=falcon.output(readMatrix = readMatrix,
tauhat = tauhat,
cn = cn,
st_bp = coverageChrHet[,"Start_position"],
end_bp = coverageChrHet[,"End_position"],
nboot = 5000)
## Running bootstrap for segment 2 ...
falconOutput = cbind(chr=rep(chr,nrow(falconOutput)), falconOutput)
falconOutput
## chr st_snp end_snp st_bp end_bp Minor_copy Major_copy Minor.sd
## 1 14 1 3451 19066021 26963559 1.000 1.000 NA
## 2 14 3451 24617 26963559 107287663 0.277 1.719 8e-04
## Major.sd
## 1 NA
## 2 0.0032
Finally, it is very helpful to visualize the results in the Integrative Genomics Viewer (IGV). Takes as input the output of falcon.output and converts to .seg format. This is a preferred format for Integrative Genomics Viewer (IGV). This function takes a single data frame as input, so combine CNV calls from all chromosomes beforehand if necessary. This function writes to file, so specify the outPath if desired.
falconOutput #from the MARTHON notebook section 4.3.2.
## chr st_snp end_snp st_bp end_bp Minor_copy Major_copy Minor.sd
## 1 14 1 3451 19066021 26963559 1.000 1.000 NA
## 2 14 3451 24617 26963559 107287663 0.277 1.719 8e-04
## Major.sd
## 1 NA
## 2 0.0032
#Running FALCON for allele-specific copy number profiling
convertFalconForIGV(
falconOutput = falconOutput, sampleName = "sample1", outPath = getwd(), log2 = FALSE)
For WES data, biases and artifacts cannot be fully captured by comparing the tumor sample to the matched normal sample. FALCON-X (Chen et al. 2017) extends upon FALCON, where it takes as inputs allelic read counts at germline heterozygous loci and total coverage biases for each of these loci estimated by CODEX2 and outputs ASCN estimates.
# 1. Index the genome template
bwa index ~/structure/hg19/ucsc.hg19.fasta
# 2. Align reads in .fastq file to the template (sample sequenced at two different lanes)
bwa mem -M -t 16 ucsc.hg19.fasta _EGAR00001248897_UCF_1014_NoIndex_L008_R1_001.fastq _EGAR00001248897_UCF_1014_NoIndex_L008_R2_001.fastq > _EGAR00001248897_UCF_1014_NoIndex_L008.sam
bwa mem -M -t 16 ucsc.hg19.fasta _EGAR00001248899_UCF1014_NoIndex_L002_R1_001.fastq _EGAR00001248899_UCF1014_NoIndex_L002_R2_001.fastq > _EGAR00001248899_UCF1014_NoIndex_L002.sam
# 3. Convert .sam to .bam and sort
samtools view -bS _EGAR00001248897_UCF_1014_NoIndex_L008.sam > _EGAR00001248897_UCF_1014_NoIndex_L008.bam
line=_EGAR00001248897_UCF_1014_NoIndex_L008.bam
java -jar SortSam.jar INPUT=$line OUTPUT=$line.sorted.bam SORT_ORDER=coordinate
samtools view -bS _EGAR00001248899_UCF1014_NoIndex_L002.sam > _EGAR00001248899_UCF1014_NoIndex_L002.bam
line=_EGAR00001248899_UCF1014_NoIndex_L002.bam
java -jar SortSam.jar INPUT=$line OUTPUT=$line.sorted.bam SORT_ORDER=coordinate
# 4. Add read group
line=_EGAR00001248897_UCF_1014_NoIndex_L008.bam
#line=_EGAR00001248899_UCF1014_NoIndex_L002.bam
java -jar AddOrReplaceReadGroups.jar INPUT=$line.sorted.bam OUTPUT=$line.sorted.rg.bam RGID=$line RGLB=WGS_UCF1014 RGPL=ILLUMINA RGPU=machine RGSM=UCF1014
samtools index $line.sorted.rg.bam
# 5. Merge two different lanes (sample sequenced at two different lanes)
samtools merge UCF1014.merge.bam _EGAR00001248899_UCF1014_NoIndex_L002.bam.sorted.rg.bam _EGAR00001248897_UCF_1014_NoIndex_L008.bam.sorted.rg.bam
java -jar SortSam.jar INPUT=UCF1014.merge.bam OUTPUT=UCF1014.merge.sorted.bam SORT_ORDER=coordinate
samtools index UCF1014.merge.sorted.bam
# 6. Dedup
java -jar MarkDuplicates.jar INPUT=UCF1014.merge.sorted.bam OUTPUT=UCF1014.merge.sorted.dedup.bam METRICS_FILE=UCF1014.merge.sorted.dedup.metrics.txt PROGRAM_RECORD_ID= MarkDuplicates PROGRAM_GROUP_VERSION=null PROGRAM_GROUP_NAME=MarkDuplicates
java -jar BuildBamIndex.jar INPUT=UCF1014.merge.sorted.dedup.bam
# 7. Realign
java -jar GenomeAnalysisTK.jar -T RealignerTargetCreator -R ucsc.hg19.fasta -I UCF1014.merge.sorted.dedup.bam -known Mills_and_1000G_gold_standard.indels.hg19.sites.vcf -known 1000G_phase1.indels.hg19.sites.vcf -o UCF1014.merge.sorted.dedup.target_intervals.list
java -jar ~/bin/GenomeAnalysisTK.jar -T IndelRealigner -R ucsc.hg19.fasta -I UCF1014.merge.sorted.dedup.bam -targetIntervals UCF1014.merge.sorted.dedup.target_intervals.list -known Mills_and_1000G_gold_standard.indels.hg19.sites.vcf -known 1000G_phase1.indels.hg19.sites.vcf -o UCF1014.merge.sorted.dedup.realigned.bam
# 8. Recalibrate
java -jar GenomeAnalysisTK.jar -T BaseRecalibrator -R ucsc.hg19.fasta -I UCF1014.merge.sorted.dedup.realigned.bam -knownSites dbsnp_138.hg19.vcf -knownSites Mills_and_1000G_gold_standard.indels.hg19.sites.vcf -knownSites 1000G_phase1.indels.hg19.sites.vcf -o UCF1014.merge.sorted.dedup.realigned.recal_data.table
java -jar GenomeAnalysisTK.jar -T PrintReads -R ucsc.hg19.fasta -I UCF1014.merge.sorted.dedup.realigned.bam -BQSR UCF1014.merge.sorted.dedup.realigned.recal_data.table -o UCF1014.merge.sorted.dedup.realigned.recal.bam
samtools index UCF1014.merge.sorted.dedup.realigned.recal.bam
# 9. GATK HaplotypeCaller
java -jar GenomeAnalysisTK.jar -R ucsc.hg19.fasta -T HaplotypeCaller -I UCF1014.merge.sorted.dedup.realigned.recal.bam -o UCF1014.merge.sorted.dedup.realigned.recal.raw.snps.indels.g.vcf
Below is demo dataset consisting of 39 tumor-normal paired whole-exome sequencing, published in Maxwell et al. (Nature Communications, 2017). We focus on chr17, where copy-neutral loss-of-heterozygosity has been reported. Below are allelic reads, genotype, and genomic locations, which can be extracted from vcf files. rda files available for download [here] (https://github.com/yuchaojiang/Canopy/tree/master/instruction)
library(MARATHON)
chr=17
head(mymatrix.demo)# genomic locations and SNP info across all loci from chr17
## CHROM POS REF ALT
## 1 17 6115 G C
## 2 17 6157 A G
## 3 17 6188 C T
## 4 17 11830 A G
## 5 17 11837 G A
## 6 17 11843 G A
# genotype in blood (bloodGT1 and blood GT2) and tumor (tumorGT1 and tumor GT2) across all samples
head(genotype.demo[,1:8])
## samp1_bloodGT1 samp1_bloodGT2 samp1_tumorGT1 samp1_tumorGT2
## [1,] 0 1 0 1
## [2,] 0 1 0 1
## [3,] 0 1 0 1
## [4,] 1 1 1 1
## [5,] 0 0 0 0
## [6,] 0 0 0 0
## samp2_bloodGT1 samp2_bloodGT2 samp2_tumorGT1 samp2_tumorGT2
## [1,] 0 0 0 1
## [2,] 0 0 0 1
## [3,] 0 1 0 1
## [4,] 0 1 0 1
## [5,] 0 0 0 0
## [6,] 0 0 0 0
#allelic reads in blood (AN and BN) and tumor (AT and BT) across all samples
head(reads.demo[,1:8])
## samp1_AN samp1_BN samp1_AT samp1_BT samp2_AN samp2_BN samp2_AT
## [1,] 123 78 122 128 240 8 191
## [2,] 80 54 83 85 194 11 170
## [3,] 35 49 43 49 76 50 80
## [4,] 0 33 1 37 20 26 24
## [5,] 37 0 50 0 49 0 70
## [6,] 39 0 55 0 55 0 79
## samp2_BT
## [1,] 58
## [2,] 59
## [3,] 58
## [4,] 37
## [5,] 0
## [6,] 0
Apply CODEX2 to get total coverage bias. First we get GC content from a 50bp window centered at the SNP, as well as total read depth at each loci.
pos=as.numeric(paste(mymatrix.demo[,'POS']))
ref=GRanges(seqnames=chr, ranges=IRanges(start=pos-25,end=pos+25))
gc <- getgc(ref, genome = genome) # GC content for each loci
values(ref) <- cbind(values(ref), DataFrame(gc))
Y = reads.demo[,seq(1,ncol(reads.demo),2)] +
reads.demo[,seq(2,ncol(reads.demo),2)] # total read depth for each loci
colnames(Y)=paste('samp',1:(ncol(Y)/2),'_',rep(c('N','T'),ncol(Y)/2),sep='')
head(Y[,1:8])
## samp1_N samp2_T samp3_N samp4_T samp5_N samp6_T samp7_N samp8_T
## [1,] 201 250 248 249 158 72 168 247
## [2,] 134 168 205 229 105 41 113 143
## [3,] 84 92 126 138 82 25 69 100
## [4,] 33 38 46 61 57 16 42 50
## [5,] 37 50 49 70 70 16 41 52
## [6,] 39 55 55 79 71 22 40 59
QC procedure – remove loci with low median coverage across samples. Calculate library size factor and normalize by CODEX2.
pos.filter=(apply(Y,1,median)>=20)
pos=pos[pos.filter]; ref=ref[pos.filter]; gc=gc[pos.filter]
genotype.demo=genotype.demo[pos.filter,]; reads.demo=reads.demo[pos.filter,]
mymatrix.demo=mymatrix.demo[pos.filter,]; Y=Y[pos.filter,]
# Estimate library size factor based on genome-wide read depth after QC
Y.nonzero <- Y[apply(Y, 1, function(x){!any(x==0)}),]
pseudo.sample <- apply(Y.nonzero,1,function(x){exp(1/length(x)*sum(log(x)))})
N <- apply(apply(Y.nonzero, 2, function(x){x/pseudo.sample}), 2, median)
normObj=normalize_codex2_ns(Y, gc, K=1:3, norm_index=seq(1,77,2), N = N)
# choiceofK(normObj$AIC,normObj$BIC,normObj$RSS,K=1:3,filename='CODEX2_falconX_demo.pdf')
Yhat=round(normObj$Yhat[[which.max(normObj$BIC)]],0)
Generate input for FALCON-X: allelic read depth and genotype across all loci.
n=39 # total number of samples
for (i in 1:n){
message('Generating input for sample',i,'...\n')
ids = (4*i-3):(4*i)
ids2 = (2*i-1):(2*i)
mydata = as.data.frame(cbind(mymatrix.demo[,1:2],
genotype.demo[,ids],
reads.demo[,ids],
Yhat[,ids2]))
colnames(mydata) = c("chr", "pos", "bloodGT1", "bloodGT2", "tumorGT1",
"tumorGT2", "AN", "BN", "AT", "BT", 'sN','sT')
ids=which(as.numeric(mydata[,3])!=as.numeric(mydata[,4])) # select heterozygou loci
newdata0 = mydata[ids,]
index.na=apply(is.na(newdata0), 1, any)
newdata=newdata0[index.na==FALSE,]
# Remove loci with multiple alternative alleles
mut.alt.filter=(newdata$tumorGT1==newdata$bloodGT1 | newdata$tumorGT1==newdata$bloodGT2) &
(newdata$tumorGT2==newdata$bloodGT1 | newdata$tumorGT2==newdata$bloodGT2)
newdata=newdata[mut.alt.filter,]
# write text at germline heterozygous loci, which is used as input for Falcon-X
write.table(newdata, file=paste("sample",i,"_het.txt",sep=""), quote=F, row.names=F)
}
Apply FALCON-X to generate allele-specific copy number profiles. Note that CODEX normalize total read depth across samples, while falcon-x profiles ASCN in each sample separately.
k=10 # calling ASCN for the 10th sample
ascn.input=read.table(paste("sample",k,"_het.txt",sep=""),head=T)
readMatrix=ascn.input[,c('AN','BN','AT','BT')]
biasMatrix=ascn.input[,c('sN','sT')]
tauhat = getChangepoints.x(readMatrix, biasMatrix, pos=ascn.input$pos)
cn = getASCN.x(readMatrix, biasMatrix, tauhat=tauhat, pos=ascn.input$pos, threshold = 0.3)
# cn$tauhat would give the indices of change-points.
# cn$ascn would give the estimated allele-specific copy numbers for each segment.
# cn$Haplotype[[i]] would give the estimated haplotype for the major chromosome in segment i
# if this segment has different copy numbers on the two homologous chromosomes.
view(cn, pos=ascn.input$pos)
Further curate FALCON-X’s segmentation, i.e., remove small regions and combine consecutive regions with similar ASCN profiles.
if(length(tauhat)>0){
length.thres=10^6 # Threshold for length of segments, in base pair.
delta.cn.thres=0.3 # Threshold of absolute copy number difference between consecutive segments.
falcon.qc.list = falconx.qc(readMatrix = readMatrix,
biasMatrix = biasMatrix,
tauhat = tauhat,
cn = cn,
st_bp = ascn.input$pos,
end_bp = ascn.input$pos,
length.thres = length.thres,
delta.cn.thres = delta.cn.thres)
tauhat=falcon.qc.list$tauhat
cn=falcon.qc.list$cn
}
view(cn,pos=ascn.input$pos)
Canopy (Jiang et al. 2016) identifies subclones within a tumor, determines the mutational profiles of these subclones, and infers the tumor’s phylogenetic history by NGS data from temporally and/or spatially separated tumor resections from the same patient. Canopy jointly models somatic copy number changes and SNVs in a similar fashion to non-negative matrix factorization and adopts a Bayesian framework to reconstruct phylogeny with posterior confidence assessment. Canopy takes as input both somatic ASCN changes returned by FALCON/FALCON-X as well as somatic SNVs and outputs tumor phylogenetic trees with somatic mutations placed along tree branches and subclones placed at the leaves.
This is a demo for using the Canopy package in R. Canopy is a statistical framework and computational procedure for identifying subpopulations within a tumor, determining the mutation profiles of each subpopulation, and inferring the tumor’s phylogenetic history. The input to Canopy are variant allele frequencies of somatic single nucleotide alterations (SNAs) along with allele-specific coverage ratios between the tumor and matched normal sample for somatic copy number alterations (CNAs). These quantities can be directly taken from the output of existing software. Canopy provides a general mathematical framework for pooling data across samples and sites to infer the underlying parameters. For SNAs that fall within CNA regions, Canopy infers their temporal ordering and resolves their phase. When there are multiple evolutionary configurations consistent with the data, Canopy outputs all configurations along with their confidence.
Below is an example on reconstructing tumor phylogeny of a transplantable metastasis model system derived from a heterogeneous human breast cancer cell line MDA-MB-231. Cancer cells from the parental line MDA-MB-231 were engrafted into mouse hosts leading to organ-specific metastasis. Mixed cell populations (MCPs) were in vivo selected from either bone or lung metastasis and grew into phenotypically stable and metastatically competent cancer cell lines. The parental line as well as the MCP sublines were whole-exome sequencedwith somatic SNAs and CNAs profiled. Canopy is used to infer metastatic phylogeny. Code for analysis of this dataset is broken down below with explanations and is further available here.
The input to Canopy are variant allele frequencies of somatic SNAs along with allele-specific coverage ratios between the tumor and matched normal sample for somatic CNAs. For SNAs, let the matrices R and X be, respectively, the number of reads containing the mutant allele and the total number of reads for each locus across all samples. The ratio R/X is the proportion of reads supporting the mutant allele, known as the variant allele frequency. For CNAs, Canopy directly takes output from allele-specific copy number estimation softwares, such as FALCON-X or Sequenza. These outputs are in the form of estimated major and minor copy number ratios, respectively denoted by WM and Wm, with their corresponding standard errors ϵM and ϵm. Matrix Y specifies whether SNAs are affected by CNAs; matrix C specifies whether CNA regions harbor specific CNAs (this input is only needed if overlapping CNA events are observed).
# 1. Index the genome template
bwa index ~ $GENOME_DIR/hg19/ucsc.hg19.fasta
# 2. Align reads in .fastq file to the template
bwa mem -M -t 16 $GENOME_DIR/ucsc.hg19.fasta sampleRead1.fastq \
sampleRead2.fastq > sample.sam
# 3. Convert .sam to .bam and sort
samtools view -bS sample.sam > sample.bam
java -jar $PATH_PICARD/picard.jar SortSam \
INPUT=sample.bam \
OUTPUT=sample.sorted.bam SORT_ORDER=coordinate
# 4. Add read group and index
java -jar $PATH_PICARD/picard.jar AddOrReplaceReadGroups \
INPUT=sample.sorted.bam OUTPUT=sample.sorted.rg.bam \
RGID=sample RGLB=myRGLB RGPL=myRGPL RGPU=myRGPU RGSM=myRGSM
samtools index sample.sorted.rg.bam
# 5. *Not Run (Merge two different lanes if sample sequenced at two different lanes),
# sort, index
#samtools merge sampleMerge.bam \
# sampleLane1.sorted.rg.bam sampleLane2.sorted.rg.bam
#java -jar SortSam.jar INPUT=sampleMerge.bam \
# OUTPUT=sampleMerge.sorted.bam SORT_ORDER=coordinate
# 6. Dedup, index
java -jar $PATH_PICARD/picard.jar MarkDuplicates \
INPUT=sample.sorted.rg.bam OUTPUT=sample.sorted.dedup.bam \
METRICS_FILE=sample.sorted.dedup.metrics.txt \
PROGRAM_RECORD_ID= MarkDuplicates \
PROGRAM_GROUP_VERSION=null \
PROGRAM_GROUP_NAME=MarkDuplicates
java -jar $PATH_PICARD/picard.jar BuildBamIndex \
INPUT=sample.sorted.dedup.bam
# 7. Realign
java -jar $PATH_GATK/GenomeAnalysisTK.jar \
-T RealignerTargetCreator \
-R $GENOME_DIR/ucsc.hg19.fasta \
-I sample.sorted.dedup.bam \
-known $GENOME_DIR/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf \
-known $GENOME_DIR/1000G_phase1.indels.hg19.sites.vcf \
-o sample.sorted.dedup.target_intervals.list
java -jar $PATH_GATK/GenomeAnalysisTK.jar \
-T IndelRealigner \
-R $GENOME_DIR/ucsc.hg19.fasta \
-I sample.sorted.dedup.bam \
-targetIntervals sampleMerge.sorted.dedup.target_intervals.list \
-known $GENOME_DIR/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf \
-known $GENOME_DIR/1000G_phase1.indels.hg19.sites.vcf \
-o sample.sorted.dedup.realigned.bam
# 8. Recalibrate Base Quality Scores
java -jar $PATH_GATK/GenomeAnalysisTK.jar \
-T BaseRecalibrator \
-R $GENOME_DIR/ucsc.hg19.fasta \
-I sample.sorted.dedup.realigned.bam \
-knownSites $GENOME_DIR/dbsnp_138.hg19.vcf \
-knownSites $VCF_DIR/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf \
-knownSites $VCF_DIR/1000G_phase1.indels.hg19.sites.vcf \
-o sample.sorted.dedup.realigned.recal_data.table
java -jar $PATH_GATK/GenomeAnalysisTK.jar
-T PrintReads \
-R $GENOME_DIR/ucsc.hg19.fasta \
-I sample.sorted.dedup.realigned.bam \
-BQSR sample.sorted.dedup.realigned.recal_data.table \
-o sample.sorted.dedup.realigned.recal.bam
samtools index sample.sorted.dedup.realigned.recal.bam
# 9. GATK UnifiedGenotyper
java -jar $PATH_GATK/GenomeAnalysisTK.jar
-T UnifiedGenotyper\
-R $GENOME_DIR/ucsc.hg19.fasta \
-I sample.sorted.dedup.realigned.recal.bam \
-o sample.sorted.dedup.realigned.recal.raw.snps.indels.g.vcf
After mutation calling and QC procedures, the number of mutated reads will be stored in matrix R while the total number of reads will be stored in matrix X across all mutational loci in all samples.
If there are more than two samples, heatmap can be used for visualization. The right panel in the figure below is the heatmap of VAFs across 63 somatic mutations in 10 spatially separated tumor slices of a ovarian cancer patient from Bashashati et al. (The Journal of pathology 2013). The heatmap below is plotted using the pheatmap R package.
readVCFforCanopy is a function which generates canopy input from a VCF file. This function takes as input a the path to the VCF file. First 9 column names after the header should be c(“#CHROM”, “POS”, “ID”, “REF”, “ALT”, “QUAL”, “FILTER”, “INFO”, “FORMAT”). Following column names should be sample names. see . The function outputs a list of R(alternate reads), X(total reads), and vcfTargets which contains gene target info from the VCF.
GenotypeMatrix is a snp by sample matrix representing alleles at that position (i.e. 0: homozygous ref, 1: heterozygous alt, 2: homozygous alt).
ReadsMatrix is a matrix where rows are positions and each sample has two columns, first the reference allele count, and second the alternate allele count.
vcfFile = system.file("extdata", "sample_w_header.vcf", package="MARATHON")
canopyInput = readVCFforCanopy(vcfFile)
lapply(canopyInput, head)
## $R
## SRR5906250 SRR5906251 SRR5906252 SRR5906253
## chrM:152_T/C 54 192 249 249
## chrM:195_C/T 59 241 250 250
## chrM:410_A/T 210 250 249 249
## chrM:2354_C/T 108 250 248 248
## chrM:2485_C/T 170 250 250 249
## chrM:2759_A/G 85 244 249 250
##
## $X
## SRR5906250 SRR5906251 SRR5906252 SRR5906253
## chrM:152_T/C 63 194 249 250
## chrM:195_C/T 59 241 250 250
## chrM:410_A/T 210 250 249 249
## chrM:2354_C/T 108 250 250 248
## chrM:2485_C/T 170 250 250 249
## chrM:2759_A/G 90 248 250 250
##
## $vcfTargets
## GRanges object with 6 ranges and 5 metadata columns:
## seqnames ranges strand | paramRangeID REF
## <Rle> <IRanges> <Rle> | <factor> <DNAStringSet>
## chrM:152_T/C chrM 152 * | <NA> T
## chrM:195_C/T chrM 195 * | <NA> C
## chrM:410_A/T chrM 410 * | <NA> A
## chrM:2354_C/T chrM 2354 * | <NA> C
## chrM:2485_C/T chrM 2485 * | <NA> C
## chrM:2759_A/G chrM 2759 * | <NA> A
## ALT QUAL FILTER
## <DNAStringSetList> <numeric> <character>
## chrM:152_T/C C 28617.33 .
## chrM:195_C/T T 31116.35 .
## chrM:410_A/T T 37789.35 .
## chrM:2354_C/T T 32179.35 .
## chrM:2485_C/T T 34291.35 .
## chrM:2759_A/G G 31965.36 .
## -------
## seqinfo: 93 sequences from hg19 genome
##
## $GenotypeMatrix
## SRR5906250 SRR5906251 SRR5906252 SRR5906253
## chrM:152_T/C 1 2 2 2
## chrM:195_C/T 2 2 2 2
## chrM:410_A/T 2 2 2 2
## chrM:2354_C/T 2 2 2 2
## chrM:2485_C/T 2 2 2 2
## chrM:2759_A/G 2 2 2 2
##
## $ReadsMatrix
## SRR5906250_Ref SRR5906250_Alt SRR5906251_Ref SRR5906251_Alt
## chrM:152_T/C 9 54 1 192
## chrM:195_C/T 0 59 0 241
## chrM:410_A/T 0 210 0 250
## chrM:2354_C/T 0 108 0 250
## chrM:2485_C/T 0 170 0 250
## chrM:2759_A/G 5 85 4 244
## SRR5906252_Ref SRR5906252_Alt SRR5906253_Ref SRR5906253_Alt
## chrM:152_T/C 0 249 1 249
## chrM:195_C/T 0 250 0 250
## chrM:410_A/T 0 249 0 249
## chrM:2354_C/T 2 248 0 248
## chrM:2485_C/T 0 250 0 249
## chrM:2759_A/G 1 249 0 250
To generate allele-specific copy number calls, Sequenza or FALCON can be used. Both methods require paired tumor normal samples as input.
FALCON-X is for ASCN profiling with WES of one or more tumor samples, multiple (>=20) normal controls, which don’t have to be matched. See section above in MARATHON pipeline.
FALCON is for ASCN profiling with WGS of paired tumor normal samples. See section above in MARATHON pipeline.
Sequenza estimates allele-specific copy numbers as well as tumor purity and ploidy using B-allele frequencies and depth ratios from paired tumor-normal sequencing data. Here are Sequenza outputs from WES of normal, primary tumor and relapse genome of a neuroblastoma patient from Eleveld et al. (Nature Genetics 2015).
Data visualization and QC procedures on Sequenza’s segmentation output are strongly recommended. Below is genome-wide view from Sequenza as a sanity check. The top panel is for the primary tumor and the bottom panel is for the relapse. The red and blue lines are for the major and minor copy numbers, respectively. We see that large chromosome-arm level deletions and duplications are fairly concordant between the primary tumor and the relapse genome (check!), while the relapse tumor gained additional CNAs at the second timepoint. The small CNA events, however, are most likely false positives and should be filtered out through QC procedures, which may include: * Have at least 100 heterozygous loci within each segment; * Are at least 5Mb long; * Don’t have extreme estimated copy numbers (e.g., total copy number >= 6); * … * …
It is also worth noting that additional steps are further needed to generate curated CNA segments. For example, chr17q is split into three segments in the primary tumor, which should be just one large duplication instead (via comparison against the relapse). Furthermore, the breakpoints of the same CNA events between the primrary tumor and the relapse genome are sometimes different and they need to be merged. They should NOT be treated as separate events.
Primary tumor Sequenza segmentation genome-wide view Relapse genome Sequenza segementation genome-wide view
Canopy does not require interger allele-specific copy number as input. While Sequenza infers tumor purity and ploidy and outputs interger-valued allle-specifc copy numbers, Canopy directly takes as input the fractional allele-specific copy numbers, which can be obtained from the segmentation output (see above). The B-allele frequency is Bf = Wm / (WM + Wm) and the depth ratio is depth.ratio = (WM + Wm)/2. From here the input matrix WM and Wm can be calculated. The standard errors of the estimated copy numbers, epsilonM and epsilonm, can be set as the default value by Canopy or be taken as the sd.ratio from Sequenza’s output, assuming they are the same. NOTE: epsilonM and epsilonm should NOT be sd(WM) and sd(Wm). The former are both matrix of dimension number of CNA events x number of samples, unless if they are set all the same by default as a scalar; the latter are the standard deviations of the copy numbers across all events in all samples.
How to generate a clean set of input to Canopy is important and non-trivial. While in our phylogeney reconstruction paper we were not trying to solve this tricky but separate problem of quality control of point mutation profiling and copy number estimation, an input with too many false positives will only lead to “garbage in garbage out” by Canopy. We are currently working on automating the pipeline for generating CNA and SNA input as well as offering guidance to select the informative SNAs and CNAs. By saying informative, we mean that the SNAs or CNAs show distinct patterns between different samples (from the same patient since we are looking at intratumor heterogeneity). For SNAs, this means that the observed VAFs are different (see Figure 4B in our paper) and in this case a heatmap is a good way for visualization. For CNAs, this means that the WM and Wm are different (see Supplementary Figure S13 in our paper) and we find IGV a good tool for visualization and recommend focusing on large CNA regions, which helps remove false calls and speed up computation.
Just like SNAs, there will likely to be CNAs carried by small fractions of the cells, or that reside in hard-to-call regions of the genome, which are not detected. The former scenario, in particular, includes CNAs which may be informative about rare subclones. We do not assume that the CNAs (and SNAs) given to Canopy comprise all mutations that are carried by the sample, and similarly, do not attempt to claim that Canopy detects all subclones or resolves all branches of the evolutionary tree. Our goal is only to estimate all of the subclones that have representation among the set of input CNAs and SNAs, which are, inherently, limited in resolution by the experimental protocol (sequencing platform and coverage, number of tumor slices, etc.) We believe this is the best that can be achieved under current settings.
Below is demo data input from project MDA231 (first case study in our paper).
library(MARATHON)
data("MDA231")
projectname = MDA231$projectname ## name of project
R = MDA231$R; R ## mutant allele read depth (for SNAs)
## MCP1833_bone MCP1834_lung MCP2287_bone MDA-MB-231_parental
## BRAF 155 59 136 77
## KRAS 44 21 54 19
## ALPK2 37 17 28 10
## RYR1 44 0 26 0
## MCP3481_lung
## BRAF 49
## KRAS 17
## ALPK2 7
## RYR1 0
X = MDA231$X; X ## total depth (for SNAs)
## MCP1833_bone MCP1834_lung MCP2287_bone MDA-MB-231_parental
## BRAF 157 111 177 146
## KRAS 44 30 64 42
## ALPK2 63 17 65 24
## RYR1 107 56 165 55
## MCP3481_lung
## BRAF 71
## KRAS 27
## ALPK2 7
## RYR1 43
WM = MDA231$WM; WM ## observed major copy number (for CNA regions)
## MCP1833_bone MCP1834_lung MCP2287_bone MDA-MB-231_parental
## chr7 2.998 2.002 2.603 2.000
## chr12 1.998 1.998 1.603 1.001
## chr18 1.000 2.992 1.000 1.002
## chr19 2.000 2.000 2.000 2.000
## MCP3481_lung
## chr7 2.001
## chr12 1.999
## chr18 2.996
## chr19 2.000
Wm = MDA231$Wm; Wm ## observed minor copy number (for CNA regions)
## MCP1833_bone MCP1834_lung MCP2287_bone MDA-MB-231_parental
## chr7 0.002 0.998 0.397 1.000
## chr12 0.002 0.998 0.397 1.000
## chr18 1.000 0.004 1.000 0.999
## chr19 1.000 1.000 1.000 1.000
## MCP3481_lung
## chr7 0.999
## chr12 0.999
## chr18 0.002
## chr19 1.000
epsilonM = MDA231$epsilonM ## standard deviation of WM, pre-fixed here
epsilonm = MDA231$epsilonm ## standard deviation of Wm, pre-fixed here
## Matrix C specifices whether CNA regions harbor specific CNAs
## only needed if overlapping CNAs are observed, specifying which CNAs overlap
C = MDA231$C; C
## chr7_1 chr7_2 chr12_1 chr12_2 chr18 chr19
## chr7 1 1 0 0 0 0
## chr12 0 0 1 1 0 0
## chr18 0 0 0 0 1 0
## chr19 0 0 0 0 0 1
Y = MDA231$Y; Y ## whether SNAs are affected by CNAs
## non-cna_region chr7 chr12 chr18 chr19
## BRAF 0 1 0 0 0
## KRAS 0 0 1 0 0
## ALPK2 0 0 0 1 0
## RYR1 0 0 0 0 1
A multivariate binomial mixture clustering step can be applied to the SNAs before MCMC sampling. We show in our paper via simulations that this pre-clustering method helps the Markov chain converge faster with smaller estimation error (especially when mutations show clear cluster patterns by visualization). This clustering step can also remove likely false positives before feeding the mutations to the MCMC algorithm.
Below is a toy example, where three bulk tumor samples were in silico simulated from a tree of 4 clones/leaves. The 5 tree segments (excluding the leftmost branch, which corresponds to the normal clone) separate 200 mutations into 5 mutation clusters. More detailed demo codes for clustering can be found here. Detailed methods can be found in the supplements of our paper under section Binomial mixture clustering. BIC is used for model selection. 2D (two longitudinal/spatial samples) or 3D (three samples) plots are generated for visualization.
data(toy3)
R=toy3$R; X=toy3$X # 200 mutations across 3 samples
num_cluster=2:9 # Range of number of clusters to run
num_run=10 # How many EM runs per clustering step for each mutation cluster wave
canopy.cluster=canopy.cluster(R = R,
X = X,
num_cluster = num_cluster,
num_run = num_run)
## Running EM with 2 clusters... 1 2 3 4 5 6 7 8 9 10
## EM converged in 10 out of 10 runs with 2 clusters
## Running EM with 3 clusters... 1 2 3 4 5 6 7 8 9 10
## EM converged in 10 out of 10 runs with 3 clusters
## Running EM with 4 clusters... 1 2 3 4 5 6 7 8 9 10
## EM converged in 10 out of 10 runs with 4 clusters
## Running EM with 5 clusters... 1 2 3 4 5 6 7 8 9 10
## EM converged in 10 out of 10 runs with 5 clusters
## Running EM with 6 clusters... 1 2 3 4 5 6 7 8 9 10
## EM converged in 10 out of 10 runs with 6 clusters
## Running EM with 7 clusters... 1 2 3 4 5 6 7 8 9 10
## EM converged in 10 out of 10 runs with 7 clusters
## Running EM with 8 clusters... 1 2 3 4 5 6 7 8 9 10
## EM converged in 10 out of 10 runs with 8 clusters
## Running EM with 9 clusters... 1 2 3 4 5 6 7 8 9 10
## EM converged in 10 out of 10 runs with 9 clusters
bic_output=canopy.cluster$bic_output # BIC for model selection (# of clusters)
Mu=canopy.cluster$Mu # VAF centroid for each cluster
Tau=canopy.cluster$Tau # Prior for mutation cluster, with a K+1 component
sna_cluster=canopy.cluster$sna_cluster # cluster identity for each mutation
Below is real dataset from Ding et al. (Nature 2012), where a leukemia patient was sequenced at two timepoints – primary tumor (sample 1) and relapse genome (sample 2). The real dataset is noisier and can potentially contain false positives for somatic mutations. We thus include in the mixture a multivariate uniform component on the unit interval, which corresponds to mutations that have high standard errors during sequencing or that are likely to be false positives. The code and SNA input for this dataset can be found here.
data(AML43)
R=AML43$R; X=AML43$X
num_cluster=4 # Range of number of clusters to run
num_run=6 # How many EM runs per clustering step for each mutation cluster wave
Tau_Kplus1=0.05 # Pre-specified proportion of noise component
Mu.init=cbind(c(0.01,0.15,0.25,0.45),
c(0.2,0.2,0.01,0.2)) # Initial value for centroid
canopy.cluster=canopy.cluster(R = R,
X = X,
num_cluster = num_cluster,
num_run = num_run,
Mu.init = Mu.init,
Tau_Kplus1=Tau_Kplus1)
Mu=canopy.cluster$Mu # VAF centroid for each cluster
Tau=canopy.cluster$Tau # Prior for mutation cluster, with a K+1 component
sna_cluster=canopy.cluster$sna_cluster # cluster identity for each mutation
R.qc=R[sna_cluster<=4,] # exclude mutations in the noise cluster
X.qc=X[sna_cluster<=4,]
sna_cluster.qc=sna_cluster[sna_cluster<=4]
R.cluster=round(Mu*100) # Generate pseudo-SNAs correponding to each cluster.
X.cluster=pmax(R.cluster,100) # Total depth is set at 100 but can be obtained as median instead
rownames(R.cluster)=rownames(X.cluster)=paste('SNA.cluster',1:4,sep='')
Each sampled tree is modeled as a list by Canopy. Below are the treeelements of the most likely tree from the project MDA231 (first case study in the paper). The most likely tree is obtained from the posterior distributionin the tree space from the MCMC sampling. How to visualize/plot the sampled trees is discussed later.
data('MDA231_tree')
MDA231_tree$Z # Z matrix specifies the position of the SNAs along the tree branch
## clone1 clone2 clone3 clone4
## BRAF 0 1 1 1
## KRAS 0 1 1 1
## ALPK2 0 1 1 1
## RYR1 0 0 1 0
MDA231_tree$cna.copy # major and minor copy number (interger values) for each CNA
## chr7_LOH chr7_dup chr12_dup chr12_LOH chr18_LOH chr19_dup
## major_copy 3 2 2 2 3 2
## minor_copy 0 1 1 0 0 1
MDA231_tree$CM # Major copy per clone for each CNA
## clone1 clone2 clone3 clone4
## chr7 1 2 3 2
## chr12 1 1 2 2
## chr18 1 1 1 3
## chr19 1 2 2 2
MDA231_tree$Cm # Minor copy per clone for each CNA
## clone1 clone2 clone3 clone4
## chr7 1 1 0 1
## chr12 1 1 0 1
## chr18 1 1 1 0
## chr19 1 1 1 1
MDA231_tree$Q # whether an SNA precedes a CNA
## chr7_LOH chr7_dup chr12_dup chr12_LOH chr18_LOH chr19_dup
## BRAF 1 1 0 0 0 0
## KRAS 0 0 1 1 0 0
## ALPK2 0 0 0 0 1 0
## RYR1 0 0 0 0 0 0
MDA231_tree$H # if an SNA precedes a CNA, whether it resides in the major copy
## chr7_LOH chr7_dup chr12_dup chr12_LOH chr18_LOH chr19_dup
## BRAF 1 1 0 0 0 0
## KRAS 0 0 1 1 0 0
## ALPK2 0 0 0 0 1 0
## RYR1 0 0 0 0 0 0
MDA231_tree$P # clonal compostion for each sample
## MCP1833_bone MCP1834_lung MCP2287_bone MDA-MB-231_parental
## clone1 0.000 0.000 0.000 0.000
## clone2 0.006 0.000 0.399 0.992
## clone3 0.991 0.001 0.600 0.001
## clone4 0.003 0.999 0.001 0.007
## MCP3481_lung
## clone1 0.000
## clone2 0.003
## clone3 0.001
## clone4 0.996
MDA231_tree$VAF # VAF based on current tree structure
## MCP1833_bone MCP1834_lung MCP2287_bone MDA-MB-231_parental
## BRAF 0.997 0.667 0.867 0.667
## KRAS 0.996 0.667 0.800 0.502
## ALPK2 0.502 1.000 0.501 0.505
## RYR1 0.330 0.000 0.200 0.000
## MCP3481_lung
## BRAF 0.667
## KRAS 0.667
## ALPK2 0.999
## RYR1 0.000
Canopy samples in subtree space with varying number of subclones(denoted as K) by a Markov chain Monte Carlo (MCMC) method. A plot of posterior likelihood (pdf format) will be generated for each subtree space and we recommend users to refer to the plot as a sanity check for sampling convergence and to choose the number of burn-ins and thinning accordingly. Note that this step can be time-consuming, especially with larger number of chains numchain
specifies the number of chains with random initiations, a larger value of which is in favor of not getting stuck in local optima) and longer chains (simrun
specifies number of iterations per chain). MCMC sampling is the most computationally heavy step in Canopy. It is recommended that jobs are run in parallel on high-performance cluster.
There are four modes of MCMC sampling embedded in Canopy: (1) canopy.sample
which takes both SNA and CNA as input by default; (2) canopy.sample.nocna
for cases where there is no CNA input; (3) canopy.sample.cluster
for cases where SNAs are pre-clustered by the Binomial mixture EM algorithm; (4) canopy.sample.cluster.nocna
for cases where there is no CNA input and SNAs are pre-clustered by the Binomial mixture EM algorithm. More details can be found here.
Below is sampling code for the MDA231 dataset (see section 4.5.1) where both SNA and CNA are used as input.
K = 3:6 # number of subclones
numchain = 20 # number of chains with random initiations
sampchain = canopy.sample(R = R, X = X, WM = WM, Wm = Wm, epsilonM = epsilonM,
epsilonm = epsilonm, C = C, Y = Y, K = K, numchain = numchain,
max.simrun = 50000, min.simrun = 10000,
writeskip = 200, projectname = projectname, cell.line = TRUE,
plot.likelihood = TRUE)
#save.image(file = paste(projectname, '_postmcmc_image.rda',sep=''),
# compress = 'xz')
length(sampchain) ## number of subtree spaces (K=3:6)
## [1] 4
length(sampchain[[which(K==4)]]) ## number of chains for subtree space with 4 subclones
## [1] 20
length(sampchain[[which(K==4)]][[1]]) ## number of posterior trees in each chain
## [1] 250
Canopy uses BIC as a model selection criterion to determine to optimal number of subclones. The first [burnin] iterations are removed and of the remaining, every [thin]th iteration is considered.
burnin = 100
thin = 5 # If there is error in the bic and canopy.post step below, make sure
# burnin and thinning parameters are wisely selected so that there are
# posterior trees left.
bic = canopy.BIC(sampchain = sampchain, projectname = projectname, K = K,
numchain = numchain, burnin = burnin, thin = thin, pdf = TRUE)
## k = 3 : mean likelihood -17243.57 ; BIC -34650.33 .
## k = 4 : mean likelihood -548.2876 ; BIC -1305.67 .
## k = 5 : mean likelihood -611.4774 ; BIC -1477.948 .
## k = 6 : mean likelihood -586.8295 ; BIC -1474.551 .
optK = K[which.max(bic)]
To determine if the markov chain has run long enough, we examine the stability. The following function provides insight. The right is a zoomed plot of the left. All chains appear to be stable on the left, because the scale is so large. However, the zoomed image on the right gives greater deatil and ability to determine stability. Make sure writeskip here matches your setting in canopy.sample(). You can change yRange setting to expand the view so that more chains are visible.
On the right, we see stability for the highest likelihood chains over a period of 50,000 trees. That, is the most recent value is not greater than the mean of the past 50,000 trees. If values are still climbing, then the chain should be run longer, using min.simrun / max.simrum.
canopy.simrun.diagnostic(sampchain, optK, K, writeskip = 1000, yRange = 100 )
Parallel computing is easy and each chain can run simultneously given suffifient number of availiable cpu cores. Simply use canopy.sample.parallel() instead of canopy.sample(). All parameters are identical.
If using high performance clusters, it may me necessary to request additional memory, as sampchain files are large. Alternatively, one can raise the value of writeskip. This is a way of pre-thinning the written record of trees. If you raise writeskip, you may need to lower the thin parameter for canopy.BIC() and canopy.post() to ensure that enough iterations are kept.
sampchain = canopy.sample.parallel(R = R, X = X, WM = WM, Wm = Wm, epsilonM = epsilonM,
epsilonm = epsilonm, C = C, Y = Y, K = K, numchain = numchain,
max.simrun = 50000, min.simrun = 10000,
writeskip = 1000, projectname = projectname, cell.line = TRUE,
plot.likelihood = TRUE)
#save.image(file = paste(projectname, '_postmcmc_image.rda',sep=''),
# compress = 'xz')
Canopy then runs a posterior evaluation of all sampled trees by MCMC. If modes of posterior probabilities (second column of config.summary
) aren’t obvious, check if the algorithm has converged (and run sampling longer if not). The first [burnin] iterations are removed and of the remaining, every [thin]th iteration is considered.
post = canopy.post(sampchain = sampchain, projectname = projectname, K = K,
numchain = numchain, burnin = burnin, thin = thin, optK = optK,
C = C, post.config.cutoff = 0.05)
samptreethin = post[[1]] # list of all post-burnin and thinning trees
samptreethin.lik = post[[2]] # likelihoods of trees in samptree
config = post[[3]] # configuration for each posterior tree
config.summary = post[[4]] # configuration summary
print(config.summary)
## Configuration Post_prob Mean_post_lik
## [1,] 1 0.562 -547.55
## [2,] 2 0.066 -549.68
## [3,] 3 0.073 -548.62
## [4,] 4 0.146 -547.68
## [5,] 5 0.153 -547.80
# first column: tree configuration
# second column: posterior configuration probability in the entire tree space
# third column: posterior configuration likelihood in the subtree space
One can then use Canopy to output and plot the most likely tree (i.e.,tree with the highest posterior likelihood). Mutations, clonal frequencies, and tree topology, etc., of the tree are obtained from the posterior distributions of subtree space with trees having the same configuration. In our MDA231 example, the most likely tree is the tree with configuration 3. Note: A separate txt file can be generated (with txt=TRUE and txt.name=’*.txt’) if the figure legend of mutational profiles (texts below the phylogenetic tree) in the plot is too long to be fitted entirely.
config.i = config.summary[which.max(config.summary[,3]),1]
cat('Configuration', config.i, 'has the highest posterior likelihood!\n')
# plot the most likely tree in the posterior tree space
output.tree = canopy.output(post, config.i, C)
canopy.plottree(output.tree)
# plot the tree with configuration 1 in the posterior tree space
output.tree = canopy.output(post, 1, C)
canopy.plottree(output.tree, pdf=TRUE, pdf.name =
paste(projectname,'_first_config.pdf',sep=''))
## Configuration 1 has the highest posterior likelihood!
Now try Canopy yourself using the simulated toy dataset below! Note that no overlapping CNAs are used as input and thus matrix C doesn’t need to be specified.
library(MARATHON)
data(toy)
projectname = 'toy'
R = toy$R; X = toy$X; WM = toy$WM; Wm = toy$Wm
epsilonM = toy$epsilonM; epsilonm = toy$epsilonm; Y = toy$Y
K = 3:6; numchain = 10
sampchain = canopy.sample(R = R, X = X, WM = WM, Wm = Wm, epsilonM = epsilonM,
epsilonm = epsilonm, C = NULL, Y = Y, K = K,
numchain = numchain, max.simrun = 100000,
min.simrun = 10000, writeskip = 200,
projectname = projectname, cell.line = FALSE,
plot.likelihood = TRUE)
The most likely tree is shown below. There should be only one tree configurationfrom the posterior tree space. The code for this toy dataset analysis can be found here.
The second toy example has a different tree topology. Feel free to try Canopy on this dataset too! There should be also just one tree configuration as is shown below from the posterior tree space. The code for this toy dataset analysis can be found here.
library(MARATHON)
data(toy2)
projectname = 'toy2'
R = toy2$R; X = toy2$X; WM = toy2$WM; Wm = toy2$Wm
epsilonM = toy2$epsilonM; epsilonm = toy2$epsilonm; Y = toy2$Y
true.tree = toy2$true.tree # true underlying tree
K = 3:6; numchain = 15
sampchain = canopy.sample(R = R, X = X, WM = WM, Wm = Wm, epsilonM = epsilonM,
epsilonm = epsilonm, C = NULL, Y = Y, K = K,
numchain = numchain, max.simrun = 100000,
min.simrun = 10000, writeskip = 200,
projectname = projectname, cell.line = FALSE,
plot.likelihood = TRUE)
The third toy example consists of three bulk tumor samples, in silico simulated from a tree of 4 clones/leaves. The 5 tree segments (excluding the leftmost branch, which corresponds to the normal clone) separate 200 mutations into 5 mutation clusters. The SNA clustering details are outlined in section ``Binomial clustering of SNAs“. The code for this toy dataset analysis is briefly attached below. Code in more details with visualization and posterior analysis can be found here.
library(MARATHON)
data(toy3)
R=toy3$R; X=toy3$X
num_cluster=2:9 # Range of number of clusters to run
num_run=10 # How many EM runs per clustering step for each mutation cluster wave
canopy.cluster=canopy.cluster(R = R,
X = X,
num_cluster = num_cluster,
num_run = num_run)
bic_output=canopy.cluster$bic_output # BIC for model selection (# of clusters)
Mu=canopy.cluster$Mu # VAF centroid for each cluster
Tau=canopy.cluster$Tau # Prior for mutation cluster, with a K+1 component
sna_cluster=canopy.cluster$sna_cluster # cluster identity for each mutation
projectname='toy3'
K = 3:5 # number of subclones
numchain = 15 # number of chains with random initiations
sampchain = canopy.sample.cluster.nocna(R = R, X = X, sna_cluster = sna_cluster,
K = K, numchain = numchain,
max.simrun = 100000, min.simrun = 20000,
writeskip = 200, projectname = projectname,
cell.line = FALSE, plot.likelihood = TRUE)
save.image(file = paste(projectname, '_postmcmc_image.rda',sep=''),
compress = 'xz')
sessionInfo()
## R version 3.4.3 (2017-11-30)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS High Sierra 10.13.1
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats4 parallel stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] WES.1KG.WUGSC_1.8.0 CODEX_1.8.0
## [3] BSgenome.Hsapiens.UCSC.hg19_1.4.0 BSgenome_1.44.0
## [5] rtracklayer_1.36.4 Rsamtools_1.28.0
## [7] Biostrings_2.44.1 XVector_0.16.0
## [9] GenomicRanges_1.28.4 GenomeInfoDb_1.12.2
## [11] IRanges_2.10.2 S4Vectors_0.14.3
## [13] BiocGenerics_0.22.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.14 knitr_1.17
## [3] magrittr_1.5 GenomicAlignments_1.12.1
## [5] zlibbioc_1.22.0 BiocParallel_1.10.1
## [7] lattice_0.20-35 stringr_1.2.0
## [9] tools_3.4.3 grid_3.4.3
## [11] SummarizedExperiment_1.6.3 Biobase_2.36.2
## [13] matrixStats_0.52.2 htmltools_0.3.6
## [15] yaml_2.1.15 rprojroot_1.2
## [17] digest_0.6.12 Matrix_1.2-12
## [19] GenomeInfoDbData_0.99.0 bitops_1.0-6
## [21] codetools_0.2-15 RCurl_1.95-4.8
## [23] evaluate_0.10.1 rmarkdown_1.8
## [25] DelayedArray_0.2.7 stringi_1.1.6
## [27] compiler_3.4.3 backports_1.1.1
## [29] XML_3.98-1.9
Chen, Hao, John M Bell, Nicolas A Zavala, Hanlee P Ji, and Nancy R Zhang. 2014. “Allele-Specific Copy Number Profiling by Next-Generation Dna Sequencing.” Nucleic Acids Research 43 (4). Oxford University Press: e23–e23.
Chen, Hao, Yuchao Jiang, Kara N Maxwell, Katherine L Nathanson, and Nancy Zhang. 2017. “Allele-Specific Copy Number Estimation by Whole Exome Sequencing.” The Annals of Applied Statistics 11 (2). NIH Public Access: 1169.
Jiang, Yuchao, Derek A Oldridge, Sharon J Diskin, and Nancy R Zhang. 2015. “CODEX: A Normalization and Copy Number Variation Detection Method for Whole Exome Sequencing.” Nucleic Acids Research 43 (6). Oxford University Press: e39–e39.
Jiang, Yuchao, Yu Qiu, Andy J Minn, and Nancy R Zhang. 2016. “Assessing Intratumor Heterogeneity and Tracking Longitudinal and Spatial Clonal Evolutionary History by Next-Generation Sequencing.” Proceedings of the National Academy of Sciences 113 (37). National Acad Sciences: E5528–E5537.
Jiang, Yuchao, Rujin Wang, Eugene Urrutia, Ioannis N Anastopoulos, Katherine L Nathanson, and Nancy R Zhang. 2018. “CODEX2: Full-Spectrum Copy Number Variation Detection by High-Throughput Dna Sequencing.” Genome Biology 19 (1). BioMed Central: 202.
Zhou, Zilu, Weixin Wang, Li-San Wang, and Nancy Ruonan Zhang. 2018. “Integrative Dna Copy Number Detection and Genotyping from Sequencing and Array-Based Platforms.” Bioinformatics 34 (14). Oxford University Press: 2349–55.