In this example we use the following data set:
5k Peripheral blood mononuclear cells (PBMCs) from a healthy donor
Single Cell ATAC Dataset by Cell Ranger ATAC 1.0.1
Peripheral blood mononuclear cells (PBMCs) from a healthy donor.
This dataset is licensed under the Creative Commons Attribution license.
https://support.10xgenomics.com/single-cell-atac/datasets/1.0.1/atac_v1_pbmc_5k
10x uses Cell Ranger ATAC 1.0.1 for bioinformatic processing and produces a peak by cell matrix.
Approximate time to run Destin on this dataset (8 core macbook pro): ~ 15 min
startTime = Sys.time()
Download the file corresponding to Peak by cell matrix (filtered) 97.86 MB 03ea4420a7bdaadd8c1cab8d4a0cf2e9
We download the peak by cell matrix via the terminal using curl, although other options are available on thier webpage.
cd localDir10xData
curl -O http://cf.10xgenomics.com/samples/cell-atac/1.0.1/atac_v1_pbmc_5k/atac_v1_pbmc_5k_filtered_peak_bc_matrix.tar.gz
Unpack the tar.gz download via the terminal (method may depend on operating system)
tar -xzf atac_v1_pbmc_5k_filtered_peak_bc_matrix.tar.gz
The resulting folder contains 3 files:
matrix.mtx
: peak by cell chromatin accessiblilty matrix in MatrixMarket format. We can use R package “Matrix” to read straight to a sparse matrixpeaks.bed
: bed file corresponding to rows of matrix.mtxbarcodes.tsv
: barcodes corresponding to columns of matrix.mtxset data10xDir to appropriate local folder
data10xDir = "~/Dropbox/GitHub/PBMC_10X/filtered_peak_bc_matrix"
install destin R package
yourPathToDestinRepo = "~/Documents/gitRepos/destin"
install.packages(file.path(yourPathToDestinRepo,"package"), repos = NULL, type = "source")
load Destin
library(destin, quietly = T)
rse = createRSEfrom10xMatrix(data10xDir)
countMat = assay(rse)
countMatBig = cbind(countMat, countMat, countMat, countMat, countMat)
countMatBigger = cbind(countMatBig, countMatBig, countMatBig)
pca = irlba(t(countMatBigger), nv = 2)
here is a description of the ranged summarized experiment
column data is a description of the cells
colData(rse)
## DataFrame with 5335 rows and 1 column
## V1
## <factor>
## 1 AAACGAAAGCGCAATG-1
## 2 AAACGAAAGGGTATCG-1
## 3 AAACGAAAGTAACATG-1
## 4 AAACGAAAGTTACACC-1
## 5 AAACGAACAGAGATGC-1
## ... ...
## 5331 TTTGTGTCACTCAAGT-1
## 5332 TTTGTGTCACTGGGCT-1
## 5333 TTTGTGTGTACGCAAG-1
## 5334 TTTGTGTGTCTGCGCA-1
## 5335 TTTGTGTTCAACTTGG-1
rowRanges is a description of the peaks
rowRanges(rse)
## GRanges object with 97824 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr1 10245-10510 *
## [2] chr1 237576-237942 *
## [3] chr1 565099-565554 *
## [4] chr1 569173-569645 *
## [5] chr1 713422-715095 *
## ... ... ... ...
## [97820] chrX 155049753-155050199 *
## [97821] chrX 155070988-155071433 *
## [97822] chrX 155110236-155111744 *
## [97823] chrX 155196592-155196884 *
## [97824] chrX 155227026-155227540 *
## -------
## seqinfo: 24 sequences from an unspecified genome; no seqlengths
assay is the binary chromatin accessibility matrix in sparse matrix format
assay(rse)[1:10,1:10]
## 10 x 10 sparse Matrix of class "dgCMatrix"
##
## [1,] . . . . . . . . . .
## [2,] . . . . . . . 1 . .
## [3,] . . 1 . . . . . . .
## [4,] . . . . . 1 . . . .
## [5,] 1 . 1 1 . . . 1 . 1
## [6,] 1 . . . . . . . . .
## [7,] 1 . 1 . . 1 . . . .
## [8,] . . . . . . . . . .
## [9,] . . 1 . . . . . . .
## [10,] . . . . . . . . . .
Destin calculates the distance of each peak to the transcript start site (TSS) and correspondingly annotates the peak to be distal or proximal regulatory element based on a 3kb window centered at the TSS. Destin also calculates for each peak its openness frequency (i.e., the frequency of this peak being open among existing cell lines and tissue types) based on the reference frequency map, pre-constructed from the ENCODE Project.
model = "hg19" # choose from hg19, hg38, mm10
rse = annotateRSE(rse, model)
Destin begins with quality control, retaining chromatin regions accessible in at least 5 cells, and retaining cells with log total number of chromatin accessible regions within 3 standard deviations from median (robustly calculated using median absolute deviations). These default parameters can be modified if desired.
rse = doQC(rse, regionSumCutoff = 5, cellSumCutoffSDs = 3)
Destin uses an unsupervised model-based likelihood (logLikeElbow) to estimate number of clusters. Like WCSSE, the likelihood is non-decreasing with increasing number of clusters. We fit the model log-likelihood as a linear spline function of number of clusters with a single knot. The knot placement which results in the best fit (lowest sum of squared errors) is selected as number of clusters (n_k).
clusterEst = estimateNClusters(rse, nClustersRange = 2:20)
## [1] "computing elbow methods"
## Warning: did not converge in 10 iterations
nClusters = clusterEst$nClustersList$logLikeElbow
plotNClusters(clusterEst)
## Warning: Removed 1 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing missing values (geom_point).
#set nCores to an appropriate value based on your system
nCores = 8
clusterResults = destinGrid (rse, nClusters = nClusters, nCores = nCores)
clusterResults$cluster
## DataFrame with 4741 rows and 3 columns
## V1 cellSumPostQC cluster
## <factor> <numeric> <integer>
## 1 AAACGAAAGCGCAATG-1 8744 8
## 2 AAACGAAAGGGTATCG-1 10150 8
## 3 AAACGAAAGTAACATG-1 11320 7
## 4 AAACGAAAGTTACACC-1 10951 7
## 5 AAACGAACAGAGATGC-1 5436 8
## ... ... ... ...
## 4737 TTTGTGTCACTCAAGT-1 5283 8
## 4738 TTTGTGTCACTGGGCT-1 17604 1
## 4739 TTTGTGTGTACGCAAG-1 10961 7
## 4740 TTTGTGTGTCTGCGCA-1 6111 8
## 4741 TTTGTGTTCAACTTGG-1 1876 5
plotCluster(clusterResults, type = "t-SNE")
Time to run
TotalTime = Sys.time() - startTime
TotalTime
## Time difference of 20.72764 mins