For this example we begin at this step 3.1. Load Destin R package. Previous steps are described for reference.
Approximate run time (8 core macbook pro): ~ 1 minute.
startTime = Sys.time()
input: fastq files
output: bam files and peaks bed file
install the bioinformatics pipeline from github, where yourPathToDestinRepo is your path to the local cloned repository
cd yourPathToDestinRepo
git clone https://github.com/urrutiag/destin.git
The bioinformatics pipeline is displayed as an example, and not meant to be run as is. Running it requires that the user:
outDir=yourPathToDestinRepo/practice/BuenrostroMouse
SRRFile=$outDir/SRR_Acc_ListBuenrostroMouse.txt
sampleName=BuenrostroMouse
srcDir=yourPathToDestinRepo/src
model=mm10
cellData=$outDir/SraRunTableBuenrostroMouse.txt
This is the output tree for the bioinformatics processing pipeline. This should not be customized. The only selectable parameter is the outDir, defined in the last step
mkdir -p $outDir
mkdir -p $outDir/temp
mkdir -p $outDir/bam
mkdir -p $outDir/peaks
mkdir -p $outDir/fastq
Here we select the path to model-specific files, these must be set specific to your system
export genomeBowtie=/proj/seq/data/MM10_UCSC/Sequence/Bowtie2Index/genome
export genomeMacs2=mm;
export blacklistFile=yourPathToDestinRepo/data/blacklist/ENCFF547MET.bed.gz
In order for the bioninformatics pipeline to run, several dependencies must be included in the system search PATH:
additionally, we set “picardPath”, the system specific path to picard, manually
below is an example when using a high performance cluster, but this will need to be customized to your system.
module add sratoolkit
module add cutadapt
module add bowtie2
module add samtools
export picardPath=~/applications/picard/build/libs
module add macs/2016-02-15
module add bedtools
Alternatively, manually add your system specific path to the executable to the system search PATH, as below
export PATH=/nas/longleaf/apps/bowtie2/2.3.4.1/bin:$PATH
To ensure you have set up the path to bowtie2 correctly, type “which bowtie2” and the result should be your system specific path to the executable
which bowtie2
/nas/longleaf/apps/bowtie2/2.3.4.1/bin/bowtie2
This is an optional step, if you wish to download fastq files from the SRA repository. The first line reads the SRRFile contents into an array. The next line downloads those files. –split-files tells sratoolkit that the read are paired and we desire 2 files per cell/sample. This pipeline assumes paired reads.
SRRNames=( $( cat $SRRFile ) )
fastq-dump "${SRRNames[@]}" --split-files -O $outDir/fastq --gzip
input: paired-read fastq files, one pair of fastq files per cell
output: indexed bam files, one bam file per cell
Overview
We first cut Illumina adaptors using cutadapt with –minimum length set to 20. parameters w, x, y, z refer to cutadapt inputs. In this experiment Illumina adaptors were used, but the sequences can be adapted to a different experiment. See https://cutadapt.readthedocs.io/en/stable/guide.html for further info.
-w refers to cutadapt input ADAPTER_FWD “-A” -x refers to cutadapt input ADAPTER_REV “-a” -y refers to cutadapt input second ADAPTER_FWD “-G” -z refers to cutadapt input second ADAPTER_REV “-g”
Reads are aligned to respective genome using bowtie2 with setting X2000 to ensure paired reads are within 2kb of one another. Samtools is used to covert to bam format. Picard tools is then used to perform a series of tasks: SortSam to sort; AddOrReplaceReadGroups to add read groups and index; MarkDuplicates to mark and remove duplicates; and BuildBamIndex to index reads. Samtools is used to remove mitochondrial, unmapped, and chromosome Y reads.
Next, due to Tn5 insertion, the starting position of forward reads is adjusted +4, and the end position of reverse reads is adjusted -5. This is performed using a custom awk command. Only aligned reads with map quality over 30 are retained. Finally, aligned reads are indexed by picard SortSam and indexed by samtools.
SRRNames=( $( cat $SRRFile ) )
for cellID in "${SRRNames[@]}"
do
$srcDir/fastqToBam.sh \
-c $cellID \
-s $sampleName \
-o $outDir \
-g $genomeBowtie \
-w CTGTCTCTTATACACA \
-x CTGTCTCTTATACACA \
-y GATGTGTATAAGAGACAG \
-z GATGTGTATAAGAGACAG
done
input: indexed bam file, one file per cell
output: filtered narrow_peaks.bed
Overview
Peaks are called a single time, inlcuding all cell bam files as input. This is essentially calling peaks on a psuedobulk tissue incorporating all cell types.
Peaks are called by MACS2 using parameters: –nomodel -p 0.01. Thus, only peaks with p-value below 0.01 are retained. Peaks are filtered using an ENCODE annotated blacklist file mainly consisting of low-mappability regions and repeat regions: “ENCFF547MET.bed” for mm10 and “wgEncodeDacMapabilityConsensusExcludable.bed” for hg19.
$srcDir/callPeaks.sh \
-s $sampleName \
-o $outDir \
-l $blacklistFile \
-q $genomeMacs2
Destin incorporates a set of existing and novel practices regarding normalization, dimension reduction, and clustering. Several existing and novel techniques are cast as tuning parameters which are optimized by an unsupervised model-based likelihood.
Specifically, Destin incorporates biological weights for chromatin accessible regions based on reference annotations and bulk regulomic profiles. These weights, the number of clusters, and the number of principle components are cast as tuning parameters. Destin fits the full binary chromatin accessibility data using the multinomial distribution, providing an objective function in an unsupervised clustering context.
if you have already installed the github repository, you can install the R package from the local copy. yourPathToDestinRepo is your system specific path to the local cloned repository
yourPathToDestinRepo = "~/Documents/gitRepos/destin"
install.packages(file.path(yourPathToDestinRepo,"package"), repos = NULL, type = "source")
if you have not already installed the github repository, install the R package directly from github.
install.packages("devtools")
devtools::install_github("urrutiag/destin/package")
load the package
library(destin, quietly = T)
sampleName = "BuenrostroMouse"
outDir = file.path(yourPathToDestinRepo, "practice/BuenrostroMouse")
cellDataFile = file.path(outDir, "SraRunTableBuenrostroMouse.txt")
model = "mm10"
inputs: indexed bam files (one bam per cell), peaks file
output: ranged summarized experiment containing:
bamDir = file.path(outDir, "bam")
peaksDir = file.path(outDir, "peaks")
bedFile = file.path(peaksDir,
paste0(sampleName, "_peaks.blacklist_removed.narrowPeak"))
extraCols_narrowPeak <- c(signalValue = "numeric", pValue = "numeric",
qValue = "numeric", peak = "integer")
bedData = rtracklayer::import(bedFile, format = "BED",
extraCols = extraCols_narrowPeak)
bamFiles = dir( bamDir, pattern = "final.bam" )
bamFiles = bamFiles[!grepl("bai", bamFiles)]
rse = createRSE(bamDir, bamFiles, bedData)
here is a description of the ranged summarized experiment
column data is a description of the cells
colData(rse)
## DataFrame with 20 rows and 1 column
## fileName
## <factor>
## 1 SRR1781315.BuenrostroMouse.final.bam
## 2 SRR1781316.BuenrostroMouse.final.bam
## 3 SRR1781317.BuenrostroMouse.final.bam
## 4 SRR1781318.BuenrostroMouse.final.bam
## 5 SRR1781319.BuenrostroMouse.final.bam
## ... ...
## 16 SRR1781416.BuenrostroMouse.final.bam
## 17 SRR1781417.BuenrostroMouse.final.bam
## 18 SRR1781418.BuenrostroMouse.final.bam
## 19 SRR1781419.BuenrostroMouse.final.bam
## 20 SRR1781420.BuenrostroMouse.final.bam
rowRanges is a description of the peaks
rowRanges(rse)
## GRanges object with 146111 ranges and 6 metadata columns:
## seqnames ranges strand |
## <Rle> <IRanges> <Rle> |
## [1] chr1 3006578-3006839 * |
## [2] chr1 3035755-3035954 * |
## [3] chr1 3062441-3062746 * |
## [4] chr1 3062859-3063222 * |
## [5] chr1 3205834-3206033 * |
## ... ... ... ... .
## [146107] chrX 169980749-169981120 * |
## [146108] chrX 169984843-169985042 * |
## [146109] chrX 169986765-169987645 * |
## [146110] chrX 169993635-169994358 * |
## [146111] chrX 170005001-170005479 * |
## name
## <character>
## [1] /proj/yuchaojlab/gene/scATACseq/data/BuenrostroMouse/peaks/BuenrostroMouse_peak_1
## [2] /proj/yuchaojlab/gene/scATACseq/data/BuenrostroMouse/peaks/BuenrostroMouse_peak_2
## [3] /proj/yuchaojlab/gene/scATACseq/data/BuenrostroMouse/peaks/BuenrostroMouse_peak_3
## [4] /proj/yuchaojlab/gene/scATACseq/data/BuenrostroMouse/peaks/BuenrostroMouse_peak_4
## [5] /proj/yuchaojlab/gene/scATACseq/data/BuenrostroMouse/peaks/BuenrostroMouse_peak_5
## ... ...
## [146107] /proj/yuchaojlab/gene/scATACseq/data/BuenrostroMouse/peaks/BuenrostroMouse_peak_146102
## [146108] /proj/yuchaojlab/gene/scATACseq/data/BuenrostroMouse/peaks/BuenrostroMouse_peak_146103
## [146109] /proj/yuchaojlab/gene/scATACseq/data/BuenrostroMouse/peaks/BuenrostroMouse_peak_146104
## [146110] /proj/yuchaojlab/gene/scATACseq/data/BuenrostroMouse/peaks/BuenrostroMouse_peak_146105
## [146111] /proj/yuchaojlab/gene/scATACseq/data/BuenrostroMouse/peaks/BuenrostroMouse_peak_146106
## score signalValue pValue qValue peak
## <numeric> <numeric> <numeric> <numeric> <integer>
## [1] 41 3.55717 4.18469 1.83307 104
## [2] 30 2.84573 3.08771 0.94733 42
## [3] 20 2.1343 2.08456 0.22292 153
## [4] 65 4.98003 6.59934 3.91686 185
## [5] 30 2.84573 3.08771 0.94733 88
## ... ... ... ... ... ...
## [146107] 49 3.59477 4.96282 2.51408 327
## [146108] 33 2.74725 3.37433 1.19415 60
## [146109] 128 6.49351 12.88313 9.61703 191
## [146110] 65 4.98003 6.59934 3.91686 260
## [146111] 267 13.1579 26.71017 22.5086 273
## -------
## seqinfo: 20 sequences from an unspecified genome; no seqlengths
assay is the binary chromatin accessibility matrix in sparse matrix format
assay(rse)[1:20,1:20]
## 20 x 20 sparse Matrix of class "dgCMatrix"
##
## [1,] . . . . . . . . . . . . . . . . . . . .
## [2,] . . . . . . . . . . . . . . . . . . . .
## [3,] . . . . . . . . . . . . . . . . . . . .
## [4,] . . . . . . . . . . . . . 1 . . . . . .
## [5,] . . . . . . . . . . . . . . . 1 . . . .
## [6,] . . . . . . . . . . . . . . . . . . . .
## [7,] . . . . . . . . . . . . . . . . . 1 . .
## [8,] . . . . . . . . . . . . . . . . . . . .
## [9,] . . . . . . . . . . . . . . . 1 . . . .
## [10,] . . . . . . . . . . . . . . . 1 . . . .
## [11,] . . . . . . . . . . . . . . . . . . . .
## [12,] . . . . . . . . . . . . . . . . . . . .
## [13,] . . . . . . . . . . . . . . . . . . . .
## [14,] . . . . . . . . . . . . . . . . . . . .
## [15,] . 1 . 1 1 . . . . . . . . . . . . . . .
## [16,] . . . . . . 1 . . . . . . . . 1 . . . .
## [17,] . . . . . . . . . . . . . . . . . . . .
## [18,] . . . . . . . . . . . . . . . . . . . .
## [19,] . . . . . . . . . . . . . . . . 1 1 1 .
## [20,] . . . . . . . . . . . . . . . . . . . .
Annotate the ranges with TSS distance and DHS frequency
TSS distance:
Distance to transcription start site is annotated using R function annotatePeakInBatch from the ChIPpeakAnno package. Data set ’TSS.human.GRCh37’ is used for model hg19 and ’TSS.mouse.GRCm38’ is used for model mm10. A promoter is defined as distance >=2kb and <= 1kb. Otherwise the chromatin accessible region is defined as distal regulatory element.
DHS frequency:
To generate the reference DNase hypersensitivity frequency map, we used the ENCODE database (Consortium et al., 2012).
For human reference, we used search terms: “bed broadpeaks”, “Homo sapiens”, “DNase-seq”, and “cell line”. This resulted in 138 experiments representing 99 cell lines, all from either the John Stamatoyannopoulos lab or Gregory Crawford lab.
For mouse reference, we used search terms: “bed broadpeaks”, “Mus musculus”, “DNase-seq”, and “tissue”. This resulted in 61 experiments representing 27 tissue types. All experiments were from the John Stamatoyannopoulos lab.
For processing we created a template representing the entire genome in 500bp bins using R functions tile and GRanges from package GenomicRanges. This was followed by GenomicRanges function overlapsAny to assign the files to the template. Overlaps were averaged across all files of same type (cell line or tissue), so that no type would be overrepresented. Finally, overlaps were averaged across all types.
rse = annotateRSE(rse, model)
in this example we append the SRA Run table from SRA to the colData
cellData = fread(cellDataFile)
colData(rse)$Run = tstrsplit(colData(rse)$fileName, split = '.', fixed= T)[[1]]
colMerged = merge(rse@colData, cellData, by = "Run")
colData(rse) = colMerged
Destin begins with quality control, retaining chromatin regions accessible in at least 5 cells, and retaining cells with total number of chromatin accessible regions within 3 standard deviations from median (robustly calculated using median absolute deviations). These default parameters can be modified if desired.
rse = doQC(rse, regionSumCutoff = 5, cellSumCutoffSDs = 3)
Destin uses an unsupervised model-based likelihood to estimate number of clusters. Like WCSSE, the likelihood is non-decreasing with increasing number of clusters. We fit the model log-likelihood as a linear spline function of number of clusters with a single knot. The knot placement which results in the best fit (lowest sum of squared errors) is selected as number of clusters (n_k).
Also shown are results using 4 other methods to estimate number of clusters:
wcsseElbow: They fit WCSSE as a linear spline function of k with a single knot. The knot placement which results in the best fit (lowest sum of squared errors) is selected as n_k
silhouette: n_k is selected as the number of clusters that maximizes average silhouette width across all cells
distortion: accounts for the decrease in WCSSE by comparing to a baseline model incorporating a weight factor based on number of dimensions
GapStat: The gap (Tibshirani et al., 2001) statistic accounts for the tendency to overestimate n_k by using a permutation approach and fitting a null distribution to the WCSSE which naturally decreases as n_k increases. Then the empirical WCSSE is compared to the null distribution with the gap statistic as output. To select the number of clusters the “first SE” method is used which determines the first nk where the gap statistic is within 1 standard deviation of the following gap statistic.
We search a space of 1 to 5 number of clusters (user-modifiable). Here we find 2 clusters.
set.seed(10)
clusterEst = estimateNClusters(rse, nClustersRange = 1:5, allMethods = T)
## Warning in irlba(t(X), nv = max(PCrange)): You're computing too large a
## percentage of total singular values, use a standard svd instead.
## [1] "computing elbow methods"
## [1] "computing silhouette"
## [1] "computing distortion"
## [1] "computing gap"
Estimated number of clusters shown in blue by method.
plotNClusters(clusterEst)
## Warning: Removed 1 rows containing missing values (geom_point).
For each combination of TSS distance weights, DHS frequency weights, and number of PCs, cell type clustering is performed via k-means. The likelihood is calculated according to the multinomial model, and the clustering result producing the highest multinomial likelihood is selected as the final clustering result.
In this example, we run Destin with default settings aside from lowering the number of principle components (since we only have 17 cells post-QC). Optimal tuning parameters for the toy example are displayed in the summary: TSS weights (1, 2), higher weight to distal regulatory elements; DHS weights (1, 1), unweighted DHS frequency; and 3 principle components.
nClusters = clusterEst$nClustersList$logLikeElbow
PCrange = 3:10
results = destinGrid (rse, nClusters = nClusters,
PCrange = PCrange)
results$summary
## nPCs TSSWeight1 TSSWeight2 DHSWeight1 DHSWeight2 nPeaksActual logLike
## 1 3 1 2 1 1 7670 -85079.35
Next we reassign cells by maximizing each cell’s post-classification multinomial likelihood. The input to the multinomial model is the chromatin accessibility count matrix and the cell cluster assignments. The output is a cell by cluster post-classification likelihood.
In this example there is no change in clustering assignment
countMat = assay(rse)
clusterResult = results$cluster$cluster
logLikeMatrix = getLogLike(countMat, clusterResult, reassign = T)
clusterReassigned = apply(logLikeMatrix, 1, which.max)
clusterReassigned
## [1] 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1
Since Buenrostro et al. provided cell labels, we can compute cluster purity
clusterTable = table(results$cluster$cluster, rse$cell_type)
purity = sum( apply(clusterTable, 1, max) ) / sum(clusterTable)
clusterTable
##
## hematopoietic cells mouse embryonic stem cells
## 1 0 8
## 2 9 0
paste0("purity = ", purity)
## [1] "purity = 1"
Plot results PCA
plotCluster(results, clusterLabels = c("embryonic", "hematopoietic"))
Destin performs differential chromatin accessibility testing via the Fisher test, generating a chromatin accessibility profile for each cluster. We compare each cell type to each other in a 1 vs. all others comparison, treating all chromatin accessible regions independently.
The output includes a genetic feature, Bonferroni corrected p-value and log2 fold change for each chromatin accessible region. We show the two clusters for the first 1K peaks here.
alternateCellLabel = results$cluster$cluster
rse = getDiffAccess(rse, clusterMap = results$cluster)
Results of differential accessibility (top 10 p-values per cluster)
rowRanges(rse)[order(rowRanges(rse)$pValsCor_1, decreasing = F)][1:10, c("feature", "pValsCor_1", "log2FC_1")]
## GRanges object with 10 ranges and 3 metadata columns:
## seqnames ranges strand | feature
## <Rle> <IRanges> <Rle> | <character>
## [1] chr6 134364557-134364988 * | ENSMUSG00000030200
## [2] chr1 4496261-4497024 * | ENSMUSG00000025902
## [3] chr1 4774796-4775928 * | ENSMUSG00000033845
## [4] chr1 4785453-4786298 * | ENSMUSG00000033845
## [5] chr1 4807522-4808279 * | ENSMUSG00000025903
## [6] chr1 4857413-4857944 * | ENSMUSG00000033813
## [7] chr1 4858100-4858459 * | ENSMUSG00000033813
## [8] chr1 5082726-5083581 * | ENSMUSG00000033793
## [9] chr1 6214107-6215250 * | ENSMUSG00000090031
## [10] chr1 7088410-7089368 * | ENSMUSG00000051285
## pValsCor_1 log2FC_1
## <numeric> <numeric>
## [1] 0.315508021390374 -3.15200309344505
## [2] 1 0.906890595608518
## [3] 1 -2.83007499855769
## [4] 1 -0.830074998557688
## [5] 1 -0.245112497836531
## [6] 1 -0.415037499278844
## [7] 1 0.491853096329675
## [8] 1 -0.245112497836531
## [9] 1 -0.637429920615292
## [10] 1 -0.637429920615292
## -------
## seqinfo: 20 sequences from an unspecified genome; no seqlengths
rowRanges(rse)[order(rowRanges(rse)$pValsCor_2, decreasing = F)][1:10, c("feature", "pValsCor_2", "log2FC_2")]
## GRanges object with 10 ranges and 3 metadata columns:
## seqnames ranges strand | feature
## <Rle> <IRanges> <Rle> | <character>
## [1] chr6 134364557-134364988 * | ENSMUSG00000030200
## [2] chr1 4496261-4497024 * | ENSMUSG00000025902
## [3] chr1 4774796-4775928 * | ENSMUSG00000033845
## [4] chr1 4785453-4786298 * | ENSMUSG00000033845
## [5] chr1 4807522-4808279 * | ENSMUSG00000025903
## [6] chr1 4857413-4857944 * | ENSMUSG00000033813
## [7] chr1 4858100-4858459 * | ENSMUSG00000033813
## [8] chr1 5082726-5083581 * | ENSMUSG00000033793
## [9] chr1 6214107-6215250 * | ENSMUSG00000090031
## [10] chr1 7088410-7089368 * | ENSMUSG00000051285
## pValsCor_2 log2FC_2
## <numeric> <numeric>
## [1] 0.315508021390374 3.15200309344505
## [2] 1 -0.906890595608519
## [3] 1 2.83007499855769
## [4] 1 0.830074998557688
## [5] 1 0.245112497836531
## [6] 1 0.415037499278844
## [7] 1 -0.491853096329675
## [8] 1 0.245112497836531
## [9] 1 0.637429920615292
## [10] 1 0.637429920615292
## -------
## seqinfo: 20 sequences from an unspecified genome; no seqlengths
Alternatively, we can examine differential access on a label such as disease vs. healthy, treated vs untreated etc.
alternateCellLabel = c(rep('Disease', 5), rep('Health', 12))
rse = getDiffAccess(rse, alternateCellLabel = alternateCellLabel)
Results of differential accessibility (top 10 p-values per cluster)
rowRanges(rse)[order(rowRanges(rse)$pValsCor_Disease, decreasing = F)][1:10, c("feature", "pValsCor_Disease", "log2FC_Disease")]
## GRanges object with 10 ranges and 3 metadata columns:
## seqnames ranges strand | feature
## <Rle> <IRanges> <Rle> | <character>
## [1] chr1 4496261-4497024 * | ENSMUSG00000025902
## [2] chr1 4774796-4775928 * | ENSMUSG00000033845
## [3] chr1 4785453-4786298 * | ENSMUSG00000033845
## [4] chr1 4807522-4808279 * | ENSMUSG00000025903
## [5] chr1 4857413-4857944 * | ENSMUSG00000033813
## [6] chr1 4858100-4858459 * | ENSMUSG00000033813
## [7] chr1 5082726-5083581 * | ENSMUSG00000033793
## [8] chr1 6214107-6215250 * | ENSMUSG00000090031
## [9] chr1 7088410-7089368 * | ENSMUSG00000051285
## [10] chr1 7397431-7398561 * | ENSMUSG00000097797
## pValsCor_Disease log2FC_Disease
## <numeric> <numeric>
## [1] 1 -0.321928094887362
## [2] 1 1.58496250072116
## [3] 1 0.941106310946431
## [4] 1 0.84799690655495
## [5] 1 0.0406419844973458
## [6] 1 0.263034405833794
## [7] 1 0.84799690655495
## [8] 1 1
## [9] 1 1
## [10] 1 0.941106310946431
## -------
## seqinfo: 20 sequences from an unspecified genome; no seqlengths
rowRanges(rse)[order(rowRanges(rse)$pValsCor_Health, decreasing = F)][1:10, c("feature", "pValsCor_Health", "log2FC_Health")]
## GRanges object with 10 ranges and 3 metadata columns:
## seqnames ranges strand | feature
## <Rle> <IRanges> <Rle> | <character>
## [1] chr1 4496261-4497024 * | ENSMUSG00000025902
## [2] chr1 4774796-4775928 * | ENSMUSG00000033845
## [3] chr1 4785453-4786298 * | ENSMUSG00000033845
## [4] chr1 4807522-4808279 * | ENSMUSG00000025903
## [5] chr1 4857413-4857944 * | ENSMUSG00000033813
## [6] chr1 4858100-4858459 * | ENSMUSG00000033813
## [7] chr1 5082726-5083581 * | ENSMUSG00000033793
## [8] chr1 6214107-6215250 * | ENSMUSG00000090031
## [9] chr1 7088410-7089368 * | ENSMUSG00000051285
## [10] chr1 7397431-7398561 * | ENSMUSG00000097797
## pValsCor_Health log2FC_Health
## <numeric> <numeric>
## [1] 1 0.321928094887362
## [2] 1 -1.58496250072116
## [3] 1 -0.941106310946431
## [4] 1 -0.84799690655495
## [5] 1 -0.0406419844973458
## [6] 1 -0.263034405833794
## [7] 1 -0.84799690655495
## [8] 1 -1
## [9] 1 -1
## [10] 1 -0.941106310946431
## -------
## seqinfo: 20 sequences from an unspecified genome; no seqlengths
Time to run
TotalTime = Sys.time() - startTime
TotalTime
## Time difference of 57.86891 secs