For this example we begin at this step 3.1. Load Destin R package. Previous steps are described for reference.

Approximate run time (8 core macbook pro): ~ 1 minute.

startTime = Sys.time()

1. Data

Mouse scATAC-seq from Buenrostro et al.
GEO accession: GSE65360
2 cell types (hematopoietic cells, mouse embryonic stem cells)
Fluidigm microfluidics platform
192 cells (20 in the Destin toy example)
original fastq files correspond to individual cells

2. Bioinformatics pipeline

input: fastq files
output: bam files and peaks bed file

install the bioinformatics pipeline from github, where yourPathToDestinRepo is your path to the local cloned repository

cd yourPathToDestinRepo
git clone https://github.com/urrutiag/destin.git

2.1. Set up parameters, paths, and directories

The bioinformatics pipeline is displayed as an example, and not meant to be run as is. Running it requires that the user:

sets all parameters appropriately
ensures that all bioinformatics dependencies are included in the system search PATH
ensures directories are selected appropriately

2.1.1. Designate inputs and parameters

select outDir as the directory to read input and print all output. In this example we use “yourPathToDestinRepo/practice/BuenrostroMouse”, where yourPathToDestinRepo is the path to your local cloned repository
(optionally) if you want to download files using sratoolkit, include SRRFile path
sampleName is used to name files and metadata
srcDir is the path to our bioinformatic processing scripts (located locally at “yourPathToDestinRepo/src”)
model is selected as hg19 or mm10
cellData is the path to a sample description table with each row corresponding to a cell

outDir=yourPathToDestinRepo/practice/BuenrostroMouse
SRRFile=$outDir/SRR_Acc_ListBuenrostroMouse.txt
sampleName=BuenrostroMouse
srcDir=yourPathToDestinRepo/src
model=mm10
cellData=$outDir/SraRunTableBuenrostroMouse.txt

2.1.2. Create directories

This is the output tree for the bioinformatics processing pipeline. This should not be customized. The only selectable parameter is the outDir, defined in the last step

mkdir -p $outDir
mkdir -p $outDir/temp
mkdir -p $outDir/bam
mkdir -p $outDir/peaks
mkdir -p $outDir/fastq

2.1.3. Set model specific parameters for mm10

Here we select the path to model-specific files, these must be set specific to your system

genomeBowtie is the path to the indexed genome for bowtie2
genomeMacs2 is the model for MACS2 peak caller: either mm (mouse) or hs (human)
blacklistFile is the path to file containing peaks to exclude: either ENCFF547MET.bed.gz (mouse) or ENCFF001TDO.bed.gz (human). We have included these files in our github repository (located locally at pseudoname “yourPathToDestinRepo/data/blacklist”).

export genomeBowtie=/proj/seq/data/MM10_UCSC/Sequence/Bowtie2Index/genome
export genomeMacs2=mm;
export blacklistFile=yourPathToDestinRepo/data/blacklist/ENCFF547MET.bed.gz

2.1.4. Ensure that dependencies are in system search PATH

In order for the bioninformatics pipeline to run, several dependencies must be included in the system search PATH:

sratoolkit (optional if you wish to dowload fastq files from SRA)
cutadapt
bowtie2
samtools
macs/2016-02-15 (latest version)
bedtools

additionally, we set “picardPath”, the system specific path to picard, manually

below is an example when using a high performance cluster, but this will need to be customized to your system.

module add sratoolkit
module add cutadapt
module add bowtie2
module add samtools
export picardPath=~/applications/picard/build/libs
module add macs/2016-02-15
module add bedtools

Alternatively, manually add your system specific path to the executable to the system search PATH, as below

export PATH=/nas/longleaf/apps/bowtie2/2.3.4.1/bin:$PATH

To ensure you have set up the path to bowtie2 correctly, type “which bowtie2” and the result should be your system specific path to the executable

which bowtie2
/nas/longleaf/apps/bowtie2/2.3.4.1/bin/bowtie2

2.2. (Optional) Download fastq files for toy example

This is an optional step, if you wish to download fastq files from the SRA repository. The first line reads the SRRFile contents into an array. The next line downloads those files. –split-files tells sratoolkit that the read are paired and we desire 2 files per cell/sample. This pipeline assumes paired reads.

SRRNames=( $( cat $SRRFile ) )
fastq-dump "${SRRNames[@]}" --split-files -O $outDir/fastq --gzip

2.3. Align reads

input: paired-read fastq files, one pair of fastq files per cell
output: indexed bam files, one bam file per cell

Overview

cut adapters
align
sam to bam
sort
Add read group and index
mark duplicates
remove mitochondrial, unmapped and chr Y
adjust for Tn5 insertion
alignment quality >= 30
index

We first cut Illumina adaptors using cutadapt with –minimum length set to 20. parameters w, x, y, z refer to cutadapt inputs. In this experiment Illumina adaptors were used, but the sequences can be adapted to a different experiment. See https://cutadapt.readthedocs.io/en/stable/guide.html for further info.

-w refers to cutadapt input ADAPTER_FWD “-A” -x refers to cutadapt input ADAPTER_REV “-a” -y refers to cutadapt input second ADAPTER_FWD “-G” -z refers to cutadapt input second ADAPTER_REV “-g”

Reads are aligned to respective genome using bowtie2 with setting X2000 to ensure paired reads are within 2kb of one another. Samtools is used to covert to bam format. Picard tools is then used to perform a series of tasks: SortSam to sort; AddOrReplaceReadGroups to add read groups and index; MarkDuplicates to mark and remove duplicates; and BuildBamIndex to index reads. Samtools is used to remove mitochondrial, unmapped, and chromosome Y reads.

Next, due to Tn5 insertion, the starting position of forward reads is adjusted +4, and the end position of reverse reads is adjusted -5. This is performed using a custom awk command. Only aligned reads with map quality over 30 are retained. Finally, aligned reads are indexed by picard SortSam and indexed by samtools.

SRRNames=( $( cat $SRRFile ) )
for cellID in "${SRRNames[@]}"
do
  $srcDir/fastqToBam.sh \
    -c $cellID  \
    -s $sampleName \
    -o $outDir \
    -g $genomeBowtie \
    -w CTGTCTCTTATACACA \
    -x CTGTCTCTTATACACA \
    -y GATGTGTATAAGAGACAG \
    -z GATGTGTATAAGAGACAG
done

2.4. Call Peaks

input: indexed bam file, one file per cell
output: filtered narrow_peaks.bed

Overview

call peaks
filter blacklist

Peaks are called a single time, inlcuding all cell bam files as input. This is essentially calling peaks on a psuedobulk tissue incorporating all cell types.

Peaks are called by MACS2 using parameters: –nomodel -p 0.01. Thus, only peaks with p-value below 0.01 are retained. Peaks are filtered using an ENCODE annotated blacklist file mainly consisting of low-mappability regions and repeat regions: “ENCFF547MET.bed” for mm10 and “wgEncodeDacMapabilityConsensusExcludable.bed” for hg19.

$srcDir/callPeaks.sh \
  -s $sampleName \
  -o $outDir \
  -l $blacklistFile \
  -q $genomeMacs2

3. Destin

Destin incorporates a set of existing and novel practices regarding normalization, dimension reduction, and clustering. Several existing and novel techniques are cast as tuning parameters which are optimized by an unsupervised model-based likelihood.

Specifically, Destin incorporates biological weights for chromatin accessible regions based on reference annotations and bulk regulomic profiles. These weights, the number of clusters, and the number of principle components are cast as tuning parameters. Destin fits the full binary chromatin accessibility data using the multinomial distribution, providing an objective function in an unsupervised clustering context.

20 cells (out of 192) in this Destin toy example

3.1. Load Destin R package

if you have already installed the github repository, you can install the R package from the local copy. yourPathToDestinRepo is your system specific path to the local cloned repository

yourPathToDestinRepo = "~/Documents/gitRepos/destin"

install.packages(file.path(yourPathToDestinRepo,"package"), repos = NULL, type = "source")

if you have not already installed the github repository, install the R package directly from github.

install.packages("devtools")
devtools::install_github("urrutiag/destin/package")

load the package

library(destin, quietly = T)

3.2. Set parameters

sampleName should be the same as used during the pipeline, used to identify input files
outDir is the home directory for input files, yourPathToDestinRepo is your path to the local cloned repository
cellDataFile is the path to a sample description table with each row corresponding to a cell
model is selected as hg19 or mm10

sampleName = "BuenrostroMouse"
outDir = file.path(yourPathToDestinRepo, "practice/BuenrostroMouse")
cellDataFile = file.path(outDir, "SraRunTableBuenrostroMouse.txt")
model = "mm10"

3.3. Create RSE from bam and peaks

inputs: indexed bam files (one bam per cell), peaks file
output: ranged summarized experiment containing:

colData which describes the cells
rowRanges which describes the peaks
assay which is the binary chromatin accessibility matrix

bamDir = file.path(outDir, "bam")
peaksDir = file.path(outDir, "peaks")
bedFile = file.path(peaksDir, 
    paste0(sampleName, "_peaks.blacklist_removed.narrowPeak"))

extraCols_narrowPeak <- c(signalValue = "numeric", pValue = "numeric",
                          qValue = "numeric", peak = "integer")
bedData = rtracklayer::import(bedFile, format = "BED",
               extraCols = extraCols_narrowPeak)

bamFiles = dir( bamDir, pattern = "final.bam" )
bamFiles = bamFiles[!grepl("bai", bamFiles)]

rse = createRSE(bamDir, bamFiles, bedData)

here is a description of the ranged summarized experiment

column data is a description of the cells

colData(rse)

## DataFrame with 20 rows and 1 column
##                                 fileName
##                                 <factor>
## 1   SRR1781315.BuenrostroMouse.final.bam
## 2   SRR1781316.BuenrostroMouse.final.bam
## 3   SRR1781317.BuenrostroMouse.final.bam
## 4   SRR1781318.BuenrostroMouse.final.bam
## 5   SRR1781319.BuenrostroMouse.final.bam
## ...                                  ...
## 16  SRR1781416.BuenrostroMouse.final.bam
## 17  SRR1781417.BuenrostroMouse.final.bam
## 18  SRR1781418.BuenrostroMouse.final.bam
## 19  SRR1781419.BuenrostroMouse.final.bam
## 20  SRR1781420.BuenrostroMouse.final.bam

rowRanges is a description of the peaks

rowRanges(rse)

## GRanges object with 146111 ranges and 6 metadata columns:
##            seqnames              ranges strand |
##               <Rle>           <IRanges>  <Rle> |
##        [1]     chr1     3006578-3006839      * |
##        [2]     chr1     3035755-3035954      * |
##        [3]     chr1     3062441-3062746      * |
##        [4]     chr1     3062859-3063222      * |
##        [5]     chr1     3205834-3206033      * |
##        ...      ...                 ...    ... .
##   [146107]     chrX 169980749-169981120      * |
##   [146108]     chrX 169984843-169985042      * |
##   [146109]     chrX 169986765-169987645      * |
##   [146110]     chrX 169993635-169994358      * |
##   [146111]     chrX 170005001-170005479      * |
##                                                                                              name
##                                                                                       <character>
##        [1]      /proj/yuchaojlab/gene/scATACseq/data/BuenrostroMouse/peaks/BuenrostroMouse_peak_1
##        [2]      /proj/yuchaojlab/gene/scATACseq/data/BuenrostroMouse/peaks/BuenrostroMouse_peak_2
##        [3]      /proj/yuchaojlab/gene/scATACseq/data/BuenrostroMouse/peaks/BuenrostroMouse_peak_3
##        [4]      /proj/yuchaojlab/gene/scATACseq/data/BuenrostroMouse/peaks/BuenrostroMouse_peak_4
##        [5]      /proj/yuchaojlab/gene/scATACseq/data/BuenrostroMouse/peaks/BuenrostroMouse_peak_5
##        ...                                                                                    ...
##   [146107] /proj/yuchaojlab/gene/scATACseq/data/BuenrostroMouse/peaks/BuenrostroMouse_peak_146102
##   [146108] /proj/yuchaojlab/gene/scATACseq/data/BuenrostroMouse/peaks/BuenrostroMouse_peak_146103
##   [146109] /proj/yuchaojlab/gene/scATACseq/data/BuenrostroMouse/peaks/BuenrostroMouse_peak_146104
##   [146110] /proj/yuchaojlab/gene/scATACseq/data/BuenrostroMouse/peaks/BuenrostroMouse_peak_146105
##   [146111] /proj/yuchaojlab/gene/scATACseq/data/BuenrostroMouse/peaks/BuenrostroMouse_peak_146106
##                score signalValue    pValue    qValue      peak
##            <numeric>   <numeric> <numeric> <numeric> <integer>
##        [1]        41     3.55717   4.18469   1.83307       104
##        [2]        30     2.84573   3.08771   0.94733        42
##        [3]        20      2.1343   2.08456   0.22292       153
##        [4]        65     4.98003   6.59934   3.91686       185
##        [5]        30     2.84573   3.08771   0.94733        88
##        ...       ...         ...       ...       ...       ...
##   [146107]        49     3.59477   4.96282   2.51408       327
##   [146108]        33     2.74725   3.37433   1.19415        60
##   [146109]       128     6.49351  12.88313   9.61703       191
##   [146110]        65     4.98003   6.59934   3.91686       260
##   [146111]       267     13.1579  26.71017   22.5086       273
##   -------
##   seqinfo: 20 sequences from an unspecified genome; no seqlengths

assay is the binary chromatin accessibility matrix in sparse matrix format

rows are peaks
columns are cells

assay(rse)[1:20,1:20]

## 20 x 20 sparse Matrix of class "dgCMatrix"
##                                              
##  [1,] . . . . . . . . . . . . . . . . . . . .
##  [2,] . . . . . . . . . . . . . . . . . . . .
##  [3,] . . . . . . . . . . . . . . . . . . . .
##  [4,] . . . . . . . . . . . . . 1 . . . . . .
##  [5,] . . . . . . . . . . . . . . . 1 . . . .
##  [6,] . . . . . . . . . . . . . . . . . . . .
##  [7,] . . . . . . . . . . . . . . . . . 1 . .
##  [8,] . . . . . . . . . . . . . . . . . . . .
##  [9,] . . . . . . . . . . . . . . . 1 . . . .
## [10,] . . . . . . . . . . . . . . . 1 . . . .
## [11,] . . . . . . . . . . . . . . . . . . . .
## [12,] . . . . . . . . . . . . . . . . . . . .
## [13,] . . . . . . . . . . . . . . . . . . . .
## [14,] . . . . . . . . . . . . . . . . . . . .
## [15,] . 1 . 1 1 . . . . . . . . . . . . . . .
## [16,] . . . . . . 1 . . . . . . . . 1 . . . .
## [17,] . . . . . . . . . . . . . . . . . . . .
## [18,] . . . . . . . . . . . . . . . . . . . .
## [19,] . . . . . . . . . . . . . . . . 1 1 1 .
## [20,] . . . . . . . . . . . . . . . . . . . .

3.4. Annotate regions

Annotate the ranges with TSS distance and DHS frequency

TSS distance:

Distance to transcription start site is annotated using R function annotatePeakInBatch from the ChIPpeakAnno package. Data set ’TSS.human.GRCh37’ is used for model hg19 and ’TSS.mouse.GRCm38’ is used for model mm10. A promoter is defined as distance >=2kb and <= 1kb. Otherwise the chromatin accessible region is defined as distal regulatory element.

DHS frequency:

To generate the reference DNase hypersensitivity frequency map, we used the ENCODE database (Consortium et al., 2012).

For human reference, we used search terms: “bed broadpeaks”, “Homo sapiens”, “DNase-seq”, and “cell line”. This resulted in 138 experiments representing 99 cell lines, all from either the John Stamatoyannopoulos lab or Gregory Crawford lab.

For mouse reference, we used search terms: “bed broadpeaks”, “Mus musculus”, “DNase-seq”, and “tissue”. This resulted in 61 experiments representing 27 tissue types. All experiments were from the John Stamatoyannopoulos lab.

For processing we created a template representing the entire genome in 500bp bins using R functions tile and GRanges from package GenomicRanges. This was followed by GenomicRanges function overlapsAny to assign the files to the template. Overlaps were averaged across all files of same type (cell line or tissue), so that no type would be overrepresented. Finally, overlaps were averaged across all types.

rse = annotateRSE(rse, model)

3.5. (optional) Append experimental information

in this example we append the SRA Run table from SRA to the colData

cellData = fread(cellDataFile) 
colData(rse)$Run = tstrsplit(colData(rse)$fileName, split = '.', fixed= T)[[1]]
colMerged = merge(rse@colData, cellData, by = "Run")
colData(rse) = colMerged

3.6. QC

Destin begins with quality control, retaining chromatin regions accessible in at least 5 cells, and retaining cells with total number of chromatin accessible regions within 3 standard deviations from median (robustly calculated using median absolute deviations). These default parameters can be modified if desired.

rse = doQC(rse, regionSumCutoff = 5, cellSumCutoffSDs = 3)

3.7. Determine number of clusters

Destin uses an unsupervised model-based likelihood to estimate number of clusters. Like WCSSE, the likelihood is non-decreasing with increasing number of clusters. We fit the model log-likelihood as a linear spline function of number of clusters with a single knot. The knot placement which results in the best fit (lowest sum of squared errors) is selected as number of clusters (n_k).

Also shown are results using 4 other methods to estimate number of clusters:

wcsseElbow: They fit WCSSE as a linear spline function of k with a single knot. The knot placement which results in the best fit (lowest sum of squared errors) is selected as n_k

silhouette: n_k is selected as the number of clusters that maximizes average silhouette width across all cells

distortion: accounts for the decrease in WCSSE by comparing to a baseline model incorporating a weight factor based on number of dimensions

GapStat: The gap (Tibshirani et al., 2001) statistic accounts for the tendency to overestimate n_k by using a permutation approach and fitting a null distribution to the WCSSE which naturally decreases as n_k increases. Then the empirical WCSSE is compared to the null distribution with the gap statistic as output. To select the number of clusters the “first SE” method is used which determines the first nk where the gap statistic is within 1 standard deviation of the following gap statistic.

We search a space of 1 to 5 number of clusters (user-modifiable). Here we find 2 clusters.

set.seed(10)
clusterEst = estimateNClusters(rse,   nClustersRange = 1:5, allMethods = T)

## Warning in irlba(t(X), nv = max(PCrange)): You're computing too large a
## percentage of total singular values, use a standard svd instead.

## [1] "computing elbow methods"
## [1] "computing silhouette"
## [1] "computing distortion"
## [1] "computing gap"

Estimated number of clusters shown in blue by method.

plotNClusters(clusterEst)

## Warning: Removed 1 rows containing missing values (geom_point).

3.8. Clustering cells

For each combination of TSS distance weights, DHS frequency weights, and number of PCs, cell type clustering is performed via k-means. The likelihood is calculated according to the multinomial model, and the clustering result producing the highest multinomial likelihood is selected as the final clustering result.

In this example, we run Destin with default settings aside from lowering the number of principle components (since we only have 17 cells post-QC). Optimal tuning parameters for the toy example are displayed in the summary: TSS weights (1, 2), higher weight to distal regulatory elements; DHS weights (1, 1), unweighted DHS frequency; and 3 principle components.

nClusters = clusterEst$nClustersList$logLikeElbow
PCrange = 3:10
results = destinGrid (rse, nClusters = nClusters,
            PCrange = PCrange)
results$summary

##   nPCs TSSWeight1 TSSWeight2 DHSWeight1 DHSWeight2 nPeaksActual   logLike
## 1    3          1          2          1          1         7670 -85079.35

Next we reassign cells by maximizing each cell’s post-classification multinomial likelihood. The input to the multinomial model is the chromatin accessibility count matrix and the cell cluster assignments. The output is a cell by cluster post-classification likelihood.

In this example there is no change in clustering assignment

countMat = assay(rse)
clusterResult = results$cluster$cluster
logLikeMatrix = getLogLike(countMat, clusterResult, reassign = T)
clusterReassigned = apply(logLikeMatrix, 1, which.max)
clusterReassigned

##  [1] 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1

Since Buenrostro et al. provided cell labels, we can compute cluster purity

clusterTable = table(results$cluster$cluster, rse$cell_type)
purity = sum( apply(clusterTable, 1, max) ) / sum(clusterTable)
clusterTable

##    
##     hematopoietic cells mouse embryonic stem cells
##   1                   0                          8
##   2                   9                          0

paste0("purity = ", purity)

## [1] "purity = 1"

Plot results PCA

plotCluster(results, clusterLabels = c("embryonic", "hematopoietic"))

3.9. Differential Accessibility

Destin performs differential chromatin accessibility testing via the Fisher test, generating a chromatin accessibility profile for each cluster. We compare each cell type to each other in a 1 vs. all others comparison, treating all chromatin accessible regions independently.

The output includes a genetic feature, Bonferroni corrected p-value and log2 fold change for each chromatin accessible region. We show the two clusters for the first 1K peaks here.

alternateCellLabel = results$cluster$cluster
rse = getDiffAccess(rse, clusterMap = results$cluster)

Results of differential accessibility (top 10 p-values per cluster)

rowRanges(rse)[order(rowRanges(rse)$pValsCor_1, decreasing = F)][1:10, c("feature", "pValsCor_1", "log2FC_1")]

## GRanges object with 10 ranges and 3 metadata columns:
##        seqnames              ranges strand |            feature
##           <Rle>           <IRanges>  <Rle> |        <character>
##    [1]     chr6 134364557-134364988      * | ENSMUSG00000030200
##    [2]     chr1     4496261-4497024      * | ENSMUSG00000025902
##    [3]     chr1     4774796-4775928      * | ENSMUSG00000033845
##    [4]     chr1     4785453-4786298      * | ENSMUSG00000033845
##    [5]     chr1     4807522-4808279      * | ENSMUSG00000025903
##    [6]     chr1     4857413-4857944      * | ENSMUSG00000033813
##    [7]     chr1     4858100-4858459      * | ENSMUSG00000033813
##    [8]     chr1     5082726-5083581      * | ENSMUSG00000033793
##    [9]     chr1     6214107-6215250      * | ENSMUSG00000090031
##   [10]     chr1     7088410-7089368      * | ENSMUSG00000051285
##               pValsCor_1           log2FC_1
##                <numeric>          <numeric>
##    [1] 0.315508021390374  -3.15200309344505
##    [2]                 1  0.906890595608518
##    [3]                 1  -2.83007499855769
##    [4]                 1 -0.830074998557688
##    [5]                 1 -0.245112497836531
##    [6]                 1 -0.415037499278844
##    [7]                 1  0.491853096329675
##    [8]                 1 -0.245112497836531
##    [9]                 1 -0.637429920615292
##   [10]                 1 -0.637429920615292
##   -------
##   seqinfo: 20 sequences from an unspecified genome; no seqlengths

rowRanges(rse)[order(rowRanges(rse)$pValsCor_2, decreasing = F)][1:10, c("feature", "pValsCor_2", "log2FC_2")]

## GRanges object with 10 ranges and 3 metadata columns:
##        seqnames              ranges strand |            feature
##           <Rle>           <IRanges>  <Rle> |        <character>
##    [1]     chr6 134364557-134364988      * | ENSMUSG00000030200
##    [2]     chr1     4496261-4497024      * | ENSMUSG00000025902
##    [3]     chr1     4774796-4775928      * | ENSMUSG00000033845
##    [4]     chr1     4785453-4786298      * | ENSMUSG00000033845
##    [5]     chr1     4807522-4808279      * | ENSMUSG00000025903
##    [6]     chr1     4857413-4857944      * | ENSMUSG00000033813
##    [7]     chr1     4858100-4858459      * | ENSMUSG00000033813
##    [8]     chr1     5082726-5083581      * | ENSMUSG00000033793
##    [9]     chr1     6214107-6215250      * | ENSMUSG00000090031
##   [10]     chr1     7088410-7089368      * | ENSMUSG00000051285
##               pValsCor_2           log2FC_2
##                <numeric>          <numeric>
##    [1] 0.315508021390374   3.15200309344505
##    [2]                 1 -0.906890595608519
##    [3]                 1   2.83007499855769
##    [4]                 1  0.830074998557688
##    [5]                 1  0.245112497836531
##    [6]                 1  0.415037499278844
##    [7]                 1 -0.491853096329675
##    [8]                 1  0.245112497836531
##    [9]                 1  0.637429920615292
##   [10]                 1  0.637429920615292
##   -------
##   seqinfo: 20 sequences from an unspecified genome; no seqlengths

Alternatively, we can examine differential access on a label such as disease vs. healthy, treated vs untreated etc.

alternateCellLabel = c(rep('Disease', 5), rep('Health', 12)) 
rse = getDiffAccess(rse, alternateCellLabel = alternateCellLabel)