Introduction

In this example we use the following data set:

5k Peripheral blood mononuclear cells (PBMCs) from a healthy donor

Single Cell ATAC Dataset by Cell Ranger ATAC 1.0.1

Peripheral blood mononuclear cells (PBMCs) from a healthy donor.

  • ~6700 transposed nuclei were loaded.
  • 5335 nuclei were recovered.
  • Sequenced on Illumina NovaSeq with approximately 42k read pairs per cell.
  • 50bp read1, 8bp i7 (sample index), 16bp i5 (10x Barcode), 49bp read2.
  • Published on December 17, 2018

This dataset is licensed under the Creative Commons Attribution license.

https://support.10xgenomics.com/single-cell-atac/datasets/1.0.1/atac_v1_pbmc_5k

10x uses Cell Ranger ATAC 1.0.1 for bioinformatic processing and produces a peak by cell matrix.

Approximate time to run Destin on this dataset (8 core macbook pro): ~ 15 min

startTime = Sys.time()

Download Data

Download the file corresponding to Peak by cell matrix (filtered) 97.86 MB 03ea4420a7bdaadd8c1cab8d4a0cf2e9

We download the peak by cell matrix via the terminal using curl, although other options are available on thier webpage.

cd localDir10xData
curl -O http://cf.10xgenomics.com/samples/cell-atac/1.0.1/atac_v1_pbmc_5k/atac_v1_pbmc_5k_filtered_peak_bc_matrix.tar.gz

Unpack the tar.gz download via the terminal (method may depend on operating system)

tar -xzf atac_v1_pbmc_5k_filtered_peak_bc_matrix.tar.gz

The resulting folder contains 3 files:

  • matrix.mtx: peak by cell chromatin accessiblilty matrix in MatrixMarket format. We can use R package “Matrix” to read straight to a sparse matrix
  • peaks.bed: bed file corresponding to rows of matrix.mtx
  • barcodes.tsv: barcodes corresponding to columns of matrix.mtx

set data10xDir to appropriate local folder

data10xDir = "~/Dropbox/GitHub/PBMC_10X/filtered_peak_bc_matrix"

install destin R package

yourPathToDestinRepo = "~/Documents/gitRepos/destin"
install.packages(file.path(yourPathToDestinRepo,"package"), repos = NULL, type = "source")

load Destin

library(destin, quietly = T) 

Create ranged summarized experiment from 10x peak by cell matrix

rse = createRSEfrom10xMatrix(data10xDir)

test memory limits

countMat = assay(rse)
countMatBig = cbind(countMat, countMat, countMat, countMat, countMat) 
countMatBigger = cbind(countMatBig, countMatBig, countMatBig)
pca = irlba(t(countMatBigger), nv = 2)

here is a description of the ranged summarized experiment

column data is a description of the cells

colData(rse)
## DataFrame with 5335 rows and 1 column
##                      V1
##                <factor>
## 1    AAACGAAAGCGCAATG-1
## 2    AAACGAAAGGGTATCG-1
## 3    AAACGAAAGTAACATG-1
## 4    AAACGAAAGTTACACC-1
## 5    AAACGAACAGAGATGC-1
## ...                 ...
## 5331 TTTGTGTCACTCAAGT-1
## 5332 TTTGTGTCACTGGGCT-1
## 5333 TTTGTGTGTACGCAAG-1
## 5334 TTTGTGTGTCTGCGCA-1
## 5335 TTTGTGTTCAACTTGG-1

rowRanges is a description of the peaks

rowRanges(rse)
## GRanges object with 97824 ranges and 0 metadata columns:
##           seqnames              ranges strand
##              <Rle>           <IRanges>  <Rle>
##       [1]     chr1         10245-10510      *
##       [2]     chr1       237576-237942      *
##       [3]     chr1       565099-565554      *
##       [4]     chr1       569173-569645      *
##       [5]     chr1       713422-715095      *
##       ...      ...                 ...    ...
##   [97820]     chrX 155049753-155050199      *
##   [97821]     chrX 155070988-155071433      *
##   [97822]     chrX 155110236-155111744      *
##   [97823]     chrX 155196592-155196884      *
##   [97824]     chrX 155227026-155227540      *
##   -------
##   seqinfo: 24 sequences from an unspecified genome; no seqlengths

assay is the binary chromatin accessibility matrix in sparse matrix format

  • rows are peaks
  • columns are cells
assay(rse)[1:10,1:10]
## 10 x 10 sparse Matrix of class "dgCMatrix"
##                          
##  [1,] . . . . . . . . . .
##  [2,] . . . . . . . 1 . .
##  [3,] . . 1 . . . . . . .
##  [4,] . . . . . 1 . . . .
##  [5,] 1 . 1 1 . . . 1 . 1
##  [6,] 1 . . . . . . . . .
##  [7,] 1 . 1 . . 1 . . . .
##  [8,] . . . . . . . . . .
##  [9,] . . 1 . . . . . . .
## [10,] . . . . . . . . . .

Peak annotation

Destin calculates the distance of each peak to the transcript start site (TSS) and correspondingly annotates the peak to be distal or proximal regulatory element based on a 3kb window centered at the TSS. Destin also calculates for each peak its openness frequency (i.e., the frequency of this peak being open among existing cell lines and tissue types) based on the reference frequency map, pre-constructed from the ENCODE Project.

  model = "hg19" # choose from hg19, hg38, mm10
  rse = annotateRSE(rse, model)

Quality control

Destin begins with quality control, retaining chromatin regions accessible in at least 5 cells, and retaining cells with log total number of chromatin accessible regions within 3 standard deviations from median (robustly calculated using median absolute deviations). These default parameters can be modified if desired.

  rse = doQC(rse, regionSumCutoff = 5, cellSumCutoffSDs = 3)

Estimate number of clusters

Destin uses an unsupervised model-based likelihood (logLikeElbow) to estimate number of clusters. Like WCSSE, the likelihood is non-decreasing with increasing number of clusters. We fit the model log-likelihood as a linear spline function of number of clusters with a single knot. The knot placement which results in the best fit (lowest sum of squared errors) is selected as number of clusters (n_k).

clusterEst = estimateNClusters(rse, nClustersRange = 2:20)
## [1] "computing elbow methods"
## Warning: did not converge in 10 iterations
nClusters = clusterEst$nClustersList$logLikeElbow
plotNClusters(clusterEst)
## Warning: Removed 1 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing missing values (geom_point).

Perform Destin clustering

#set nCores to an appropriate value based on your system
nCores = 8
clusterResults = destinGrid (rse, nClusters = nClusters, nCores = nCores)
clusterResults$cluster
## DataFrame with 4741 rows and 3 columns
##                      V1 cellSumPostQC   cluster
##                <factor>     <numeric> <integer>
## 1    AAACGAAAGCGCAATG-1          8744         8
## 2    AAACGAAAGGGTATCG-1         10150         8
## 3    AAACGAAAGTAACATG-1         11320         7
## 4    AAACGAAAGTTACACC-1         10951         7
## 5    AAACGAACAGAGATGC-1          5436         8
## ...                 ...           ...       ...
## 4737 TTTGTGTCACTCAAGT-1          5283         8
## 4738 TTTGTGTCACTGGGCT-1         17604         1
## 4739 TTTGTGTGTACGCAAG-1         10961         7
## 4740 TTTGTGTGTCTGCGCA-1          6111         8
## 4741 TTTGTGTTCAACTTGG-1          1876         5

Plot results

plotCluster(clusterResults, type = "t-SNE")

Time to run

TotalTime = Sys.time() - startTime
TotalTime
## Time difference of 20.72764 mins