cell-type specific GWAS association

Approximate run time (8 core macbook pro): ~ 7 min

1. Data

startTime = Sys.time()

set packageDir to the path where you have cloned the git repository and install

packageDir = "~/Documents/gitRepos/destin"
install.packages(file.path(packageDir,"package"), repos = NULL, type = "source")

load libraries and set paths according to your system

library(destin, quietly = T)
## 
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, basename, cbind,
##     colMeans, colnames, colSums, dirname, do.call, duplicated,
##     eval, evalq, Filter, Find, get, grep, grepl, intersect,
##     is.unsorted, lapply, lengths, Map, mapply, match, mget, order,
##     paste, pmax, pmax.int, pmin, pmin.int, Position, rank, rbind,
##     Reduce, rowMeans, rownames, rowSums, sapply, setdiff, sort,
##     table, tapply, union, unique, unsplit, which, which.max,
##     which.min
## 
## Attaching package: 'S4Vectors'
## The following object is masked from 'package:base':
## 
##     expand.grid
## 
## Attaching package: 'Biostrings'
## The following object is masked from 'package:base':
## 
##     strsplit
## 
## 
## Attaching package: 'data.table'
## The following object is masked from 'package:GenomicRanges':
## 
##     shift
## The following object is masked from 'package:IRanges':
## 
##     shift
## The following objects are masked from 'package:S4Vectors':
## 
##     first, second
## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.
## 
## Attaching package: 'matrixStats'
## The following objects are masked from 'package:Biobase':
## 
##     anyMissing, rowMedians
## 
## Attaching package: 'DelayedArray'
## The following objects are masked from 'package:matrixStats':
## 
##     colMaxs, colMins, colRanges, rowMaxs, rowMins, rowRanges
## The following object is masked from 'package:Biostrings':
## 
##     type
## The following objects are masked from 'package:base':
## 
##     aperm, apply
## 
## Attaching package: 'GenomicAlignments'
## The following object is masked from 'package:data.table':
## 
##     last
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:Biobase':
## 
##     combine
## The following object is masked from 'package:BiocGenerics':
## 
##     combine
## 
## Attaching package: 'Matrix'
## The following object is masked from 'package:S4Vectors':
## 
##     expand
## Warning: replacing previous import 'data.table::first' by
## 'GenomicAlignments::first' when loading 'destin'
## Warning: replacing previous import 'data.table::last' by
## 'GenomicAlignments::last' when loading 'destin'
## Warning: replacing previous import 'data.table::second' by
## 'GenomicAlignments::second' when loading 'destin'
gwasDir = "~/Dropbox/Gene_shared/scATACseq/articles/GWAS"
outDir = "~/Documents/statGenGit/GWASout" 

Load clustered chromatin accessibility summarized experiment. QC and annotate clusters.

load(file.path(packageDir, "practice/PreisslP56/peaks/PreisslP56Annotated.Rdata"))
rse = doQC(rse)
clusterData = fread(file.path(packageDir, "practice/gwas/Preisslp56Cluster.csv"))
colData(rse)$cluster = clusterData$cluster

2. Process chromatin accessiblility data for cell-type specific GWAS

To prepare for cell-type specific GWAS association, we must: - provide human annotation for the chromatin accessible regions. - aggregate the binary signal across peaks within each gene - calculate cell type specificity for chromatin accessibility by gene

2.1. Provide human symbol (HGNC) annotation for the chromatin accessible regions.

2.1.1 Conversion from mouse annotation to human annnotation

GWAS predominantly use human subjects, while our data set is mouse forebrain. Thus we need to utilize a series of 3 conversions to map the accessible regions to HGNC and Entrez ID.

annotated peak feature (mouse) -> MGI symbol (mouse) -> HGNC symbol (human) -> Entrez ID (human)

2.1.1.1 MGI symbol (Mouse)

MGI symbol is mapped according to ensembl_gene_id, which was annotated during the Destin clustering (“TSS.mouse.GRCm38” from the R ChIPpeakAnnopackage). We use the BioMart dataset “mmusculus_gene_ensembl” to map to MGI symbol.

The mmusculus conversion data frame “mmusculusConversion.Rdata” is preinstalled with Destin

2.1.1.2. HGNC (Human) via Mouse-Human homolog

Next, we map mouse to human gene using homolog list “HOM_MouseHumanSequence.rpt” from MGI. Only genes with 1 to 1 homolog mapping were retained.

“HOM_MouseHumanSequence.rpt” is preinstalled with Destin

2.1.1.3. Entrez ID (Human)

Finally, we use bioMart again to map to Entrez ID, after which accessible regions were aggregated by gene.

The hsapiens conversion data frame “hsapiensConversion.Rdata” is preinstalled with Destin.

2.1.1.4. Annotation results

rse = annotateMouseToHuman(rse)
rowRanges(rse)
## GRanges object with 70658 ranges and 13 metadata columns:
##           seqnames              ranges strand |     score signalValue
##              <Rle>           <IRanges>  <Rle> | <numeric>   <numeric>
##       [1]     chr1     3597507-3597732      * |        33      2.8353
##       [2]     chr1     3649211-3649594      * |        40     3.15033
##       [3]     chr1     3654966-3655506      * |        68     4.29448
##       [4]     chr1     3670496-3672507      * |       564     7.45739
##       [5]     chr1     4089665-4090022      * |       155     7.24576
##       ...      ...                 ...    ... .       ...         ...
##   [70654]     chrX 168698186-168698436      * |        40     3.15033
##   [70655]     chrX 168795436-168795702      * |        47     3.46536
##   [70656]     chrX 169270711-169270911      * |        54      3.7804
##   [70657]     chrX 169299659-169300030      * |       106      5.6706
##   [70658]     chrX 169319901-169320794      * |       790    12.62255
##              pValue    qValue      peak             DHSsum distanceToTSS
##           <numeric> <numeric> <integer>          <numeric>     <numeric>
##       [1]    3.3645   1.52226        66 0.0952380952380952         73991
##       [2]   4.03729   2.11312       101    5.4166511048864         22287
##       [3]   6.89429   4.72941       295  0.178571428571429         16532
##       [4]  56.41362  52.68634      1248   11.1109307359307          1002
##       [5]  15.55749  12.93645       166   1.28968253968254        270649
##       ...       ...       ...       ...                ...           ...
##   [70654]   4.03729   2.11312       119   1.59446000622471        -24288
##   [70655]    4.7497   2.75158       182   15.7707070707071           336
##   [70656]   5.49855   3.43169        96  0.648148148148148         49661
##   [70657]  10.62655   8.22295       223   14.4206427015251         20713
##   [70658]  79.01832   74.6328       612   21.8929172679173           471
##                      feature         region  mgi_symbol HomoloGene.ID
##                  <character>    <character> <character>     <integer>
##       [1] ENSMUSG00000051951 distal element        Xkr4         45848
##       [2] ENSMUSG00000051951 distal element        Xkr4         45848
##       [3] ENSMUSG00000051951 distal element        Xkr4         45848
##       [4] ENSMUSG00000051951 distal element        Xkr4         45848
##       [5] ENSMUSG00000025900 distal element         Rp1          4564
##       ...                ...            ...         ...           ...
##   [70654] ENSMUSG00000031358 distal element        Msl3         40749
##   [70655] ENSMUSG00000031355       promoter     Arhgap6          7630
##   [70656] ENSMUSG00000031352 distal element        Hccs          3897
##   [70657] ENSMUSG00000031352 distal element        Hccs          3897
##   [70658] ENSMUSG00000031352       promoter        Hccs          3897
##           humanSymbol  entrezID
##           <character> <integer>
##       [1]        XKR4    114786
##       [2]        XKR4    114786
##       [3]        XKR4    114786
##       [4]        XKR4    114786
##       [5]         RP1 107984125
##       ...         ...       ...
##   [70654]        MSL3     10943
##   [70655]     ARHGAP6       395
##   [70656]        HCCS      3052
##   [70657]        HCCS      3052
##   [70658]        HCCS      3052
##   -------
##   seqinfo: 20 sequences from an unspecified genome; no seqlengths

2.1.2 Directly annotate human features

Alternatively, if conducting a human scATAC-seq experiment, user can simplify the steps above, and the conversion becomes:

annotated Esememble peak feature (human) -> HGNC symbol (human) + Entrez ID (human)

annotateHuman() utilizes addGeneIDs function from ChIPpeakAnno package to annotate human Entrez ID and HGNC symbol from Ensemble feature. org.Hs.eg.db data set is used for annotation.

# rse = annotateHuman(rse)
# rowRanges(rse)

2. Annotate from from mouse annotation to human annnotation

GWAS predominantly use human subjects, while our data set is mouse forebrain. Thus we need to utilize a series of 3 conversions to map the accessible regions to HGNC and Entrez ID.

annotated peak feature (mouse) -> MGI symbol (mouse) -> HGNC symbol (human) -> Entrez ID (human)

2.2. Aggregate chromatin acessibility by gene

The summarized experiment now contains multiple accessible peaks per entrez ID. Thus we aggregate across peaks within gene to create a chromatin accessiblity matrix that is no longer binary.

This step is high in computational processing, thus it is recommended to run in parallel, but that is optional. It is also high in memory, thus we use sparseMatrix format.

The output is an aggregated summarized experiment where rows are genes by entrez ID

Set nCores to an appropriate value based on your system

# Set nCores to an appropriate value based on your system 
nCores = 7
se = aggregateRSEByGene(rse, nCores = nCores)

2.3. Calculate Cell type specificity for chromatin accessilbilty by gene

This step calculates the proportion of total accessibility by cell type. This is required input for magma and EWCE.

The result is a stacked data frame containing the cell type specificity for each gene - cell type combination. cell type specificity is calculated as the ratio between the cell type accessibility and the total accessibility across all cell types.

geneAccessibility = getGeneAccessibility(se)
geneAccessibility[hgncSymbol == "XKR4"]
##    hgncSymbol entrezID cluster accessibility totalAcrossTypes
## 1:       XKR4   114786       4     0.3007812          1.04871
## 2:       XKR4   114786       2     0.1172414          1.04871
## 3:       XKR4   114786       3     0.2292683          1.04871
## 4:       XKR4   114786       5     0.2491349          1.04871
## 5:       XKR4   114786       1     0.1522843          1.04871
##    cellTypeSpecificity
## 1:           0.2868107
## 2:           0.1117958
## 3:           0.2186193
## 4:           0.2375632
## 5:           0.1452110

3. EWCE for GWAS accessbility association

Now we determine whether GWAS results are associated with increased accessibility in a particular cell type. We utilize 2 methods originally developed for scRNA-seq expression: ECWE and MAGMA.

Expression Weighted Celltype Enrichment (ECWC) (Skene and Grant, 2016). While magma tests for trend in SNP p-value over increasing cell type specificty, ECWC tests whether the set of significant SNPs are more specific to cell type than are a random set of SNPs.

The empirical score = sum of cellTypeSpecificity from the geneAccessibility data frame only within the snpGenes, specific by cell type.

The p-value is calculated via permutation across all genes of the above score.

getEWCE is a function which takes as input: - geneAccessibility: the cell type specific accessibility data frame created is the previous step - geneList: list of significant snps for a single GWAS

Here we perform EWCE for 3 GWAS studies: “alzLambert.csv”: Alzheimers “parkNalls.csv”: Parkinsons “SCZ_PGC.tsv”: Schizophrenia

3.1. Prepare SNP lists

we begin by processing the significant snp files

EWCEDir = file.path(packageDir, "practice/gwas/EWCE")
fileNames = c("alzLambert.csv", "parkNalls.csv", "SCZ_PGC.tsv")

geneLists = lapply(fileNames, function(fileName) {
  snps = fread(file.path(EWCEDir,  fileName), check.names = T)
  geneList = snps$MAPPED_GENE[!grepl(" ", snps$MAPPED_GENE) & !grepl("LOC", snps$MAPPED_GENE)]
  return( geneList )
})
names(geneLists) = fileNames
str(geneLists)
## List of 3
##  $ alzLambert.csv: chr [1:22] "SQSTM1" "NDUFAF6" "AP2A2" "IGH" ...
##  $ parkNalls.csv : chr [1:12] "SIPA1L2" "MCCC1" "TMEM175" "BST1" ...
##  $ SCZ_PGC.tsv   : chr [1:52] "PLCH2" "MIR137HG" "C1orf132" "SDCCAG8" ...

3.2. Perform GWAS association via EWCE

then we call getEWCE individually for each significant snp list, and annotate the result and combine

set.seed(10)
resultsEWCEList = lapply(seq_along(geneLists), function(geneListIndex){
  getEWCE(geneAccessibility, geneLists[[geneListIndex]])
})
names(resultsEWCEList) = names(geneLists)

3.3. Results

The result is a data frame with p-value, foldChange, sd from mean for each cluster study.

Cluster 1 associated with schizophrenia SNPs (3rd study): p = 0.004.

resultsEWCEList
## $alzLambert.csv
##    cluster pValue foldChange sd_from_mean
## 1:       4  0.976  0.6842756  -1.81053179
## 2:       2  0.098  1.1892859   1.29983011
## 3:       3  0.367  1.0427030   0.30318715
## 4:       5  0.317  1.0538552   0.42263228
## 5:       1  0.449  1.0103079   0.06661755
## 
## $parkNalls.csv
##    cluster pValue foldChange sd_from_mean
## 1:       4  0.345  1.0569098    0.3085322
## 2:       2  0.232  1.1081952    0.6592406
## 3:       3  0.721  0.9021513   -0.6564246
## 4:       5  0.528  0.9827863   -0.1291906
## 5:       1  0.570  0.9590534   -0.2458018
## 
## $SCZ_PGC.tsv
##    cluster pValue foldChange sd_from_mean
## 1:       4  0.889  0.8876822   -1.2136889
## 2:       2  1.000  0.7433052   -3.0042313
## 3:       3  0.052  1.1347055    1.7971638
## 4:       5  0.410  1.0146945    0.2129084
## 5:       1  0.004  1.2201005    2.6003749

4. MAGMA for GWAS accessibility association

We used the GWAS association tool magma to test for associate with GWAS SNPs. We first called annotate with window size 10kb upstream and 1.5kb downstream of transcribed region. All 3 GWAS sets used hg19. Annotated SNPs were aggregated by gene to a single p-value. The magma model adjusts for covariance in SNP p-values due to linkage disequilibrium.

As done by Skeene et al., we binned the cell type specificity into 40 quantiles, with an additional bin for regions with no accessibility. The cell type specific bin was used as the feature vector for association testing. Finally test for trend between cell type specificity was performed.

Magma accounts for gene size, log gene size, gene density, and log gene density. Gene density accounts for the linkage disequilibrium between SNPs in the gene. The model also incorporates correlations between genes.

MAGMA input - raw GWAS data or SNP p values - reference data set (eg. 1,000 Genomes European panel) - specificity quantiles

4.1. Create specificity quantile matrix

We convert the cell type specific accessibility from the geneAccessibility data frame to 40 binned quantiles.

The specificity quantile matrix must be printed to file, as it is required by magma using the terminal (shell) command line.

magmaDir = file.path(packageDir, "practice/gwas/magma")
magmaOutDir = file.path(magmaDir, "magmaOut")
if (!dir.exists(magmaOutDir) )  dir.create(magmaOutDir)

quantsForMagma = getQuantsForMagma(geneAccessibility)
write.table(quantsForMagma, 
            file = file.path(magmaOutDir, "quantsForMagma.tsv"),
            row.names = F, quote = F)

Comparing the quantsForMagma to geneAccessibility, we see that a gene (A1BG) in which a cluster has high cellTypeSpecificity receives a high quantile (cluster 1), and vice versa (cluster 2). The highest quantile is 40 and 0 indicates that there was no accessibility for any cell in the cluster.

quantsForMagma[entrezID == 1]
##    entrezID  1 2  3  4 5
## 1:        1 38 2 29 28 3
geneAccessibility[entrezID == 1]
##    hgncSymbol entrezID cluster accessibility totalAcrossTypes
## 1:       A1BG        1       4    0.04296875        0.1799735
## 2:       A1BG        1       2    0.01034483        0.1799735
## 3:       A1BG        1       3    0.04682927        0.1799735
## 4:       A1BG        1       5    0.01384083        0.1799735
## 5:       A1BG        1       1    0.06598985        0.1799735
##    cellTypeSpecificity
## 1:          0.23875040
## 2:          0.05747972
## 3:          0.26020087
## 4:          0.07690481
## 5:          0.36666420

4.2. Get GWAS full data

Please download “ckqny.scz2snpres.gz”" from http://www.med.unc.edu/pgc/files/resultfiles/scz2.snp.results.txt.gz/view

Load the GWAS data set and prepare to be read by magma. Columns and names must be ordered appropriately: “SNP”, “CHR”, “BP”, “P”; corresponding to SNP ID, chromosome, base pair position, and p-value.

The prepared GWAS data must be printed to file, as it is required by magma using the terminal (shell) command line.

schizophreniaSNPsFile = file.path(magmaDir, "scz/ckqny.scz2snpres.gz")
data = fread(schizophreniaSNPsFile, fill=TRUE)

#format SNP data for magma
data[, hg19chrc := gsub("chr", "", hg19chrc)]
dataReorder = data[,c("snpid", "hg19chrc", "bp", "p")]
setnames(dataReorder, c("SNP", "CHR", "BP", "P"))
write.table(dataReorder, file = file.path(magmaOutDir, "ckqny.scz2snpres.reorderd.txt"),
            quote = F, row.names = F)
head(dataReorder)
##            SNP CHR     BP      P
## 1:   rs4951859   1 729679 0.2083
## 2: rs142557973   1 731718 0.3298
## 3: rs141242758   1 734349 0.3055
## 4:  rs79010578   1 736289 0.5132
## 5: rs143225517   1 751756 0.8431
## 6:   rs3094315   1 752566 0.7870

4.3. run magma via terminal (shell) command line

the following code is in bash script, and is not run in this notebook

4.3.1. set paramaters

for installation, documentation, and troubleshooting, see https://ctg.cncr.nl/software/magma calls below require the path to magma be included in the system path, here we assign by: PATH=$PATH:/Users/urrutia/Applications/magma_v1.06c_mac

packageDir=~/Documents/gitRepos/destin
PATH=$PATH:~/Applications/magma_v1.07b_mac
magmaDir=$packageDir/practice/gwas/magma
magmaOutDir=$magmaDir/magmaOut

4.3.2. annotation

The SNP location file should contain three columns: SNP ID, chromosome, and basepair position. These should be the first three columns in that file (additional columns are ignored).

This step requires a file of gene locations, using build 37. We include this file as part of destin. Please see https://ctg.cncr.nl/software/magma if another build is required.

magma --annotate window=10,1.5 \
--snp-loc $magmaOutDir/ckqny.scz2snpres.reorderd.txt \
--gene-loc $packageDir/practice/gwas/magma/NCBI37.3.gene.loc  \
--out $magmaOutDir/SCZ2014

4.3.3. Gene wise p-values and covariance

Magma adjusts for covariates in order to produce a biologically appropriate p-value to each gene.

Magma accounts for gene size, log gene size, gene density, and log gene density. Gene density accounts for the linkage disequilibrium between SNPs in the gene. The model also incorporates correlations between genes.

A reference file from Phase 3 of 1000 genomes project is required for this step, and here we use the European ancestry file. Please download from https://ctg.cncr.nl/software/magma. The location is assigned by the –bfile command.

magma --bfile  $magmaDir/g1000_eur/g1000_eur \
--gene-model snp-wise=mean \
--pval $magmaOutDir/ckqny.scz2snpres.reorderd.txt use=SNP,P N=10000 \
--gene-annot $magmaOutDir/SCZ2014.genes.annot \
--out $magmaOutDir/SCZ2014

4.3.4. Gene property analysis

Now mamga tests for association between cluster specific accessibility quantiles and significant genes (aggregated SNPs) for each cell cluster.

magma --gene-results $magmaOutDir/SCZ2014.genes.raw \
--gene-covar $magmaOutDir/quantsForMagma.tsv  \
--out $magmaOutDir/SCZ2014

4.3.5. Magma results

The results are below, precomputed and saved in upstream folder. Again we see strong association between cluster specific accessibility quantiles and significant genes (aggregated SNPs) for cluster number 1 corresponding to interneuron 1.

This indicates that genes that are associated with schizophrenia are specifically accessible in cluster 1 compared to other clusters. Beta indicates the trend between p-values enrichment by gene and cluster specific accessibility quantiles. Positive beta indicates that decreasing p-value trends positively with accessibility quantile.

#pMagmaData = readLines(file.path(magmaOutDir, "SCZ2014.gsa.out"))
# we read the precomputed results
pMagmaData = readLines(file.path(magmaDir, "SCZ2014.gsa.out"))
pMagmaData
##  [1] "# MEAN_SAMPLE_SIZE = 10000"                                                                                        
##  [2] "# TOTAL_GENES = 14153"                                                                                             
##  [3] "# TEST_DIRECTION = one-sided, positive (set), two-sided (covar)"                                                   
##  [4] "# CONDITIONED_INTERNAL = gene size, gene density, inverse mac, log(gene size), log(gene density), log(inverse mac)"
##  [5] "VARIABLE      TYPE  NGENES         BETA     BETA_STD           SE            P"                                    
##  [6] "1            COVAR   14153    0.0025526     0.030821   0.00076402   0.00083717"                                    
##  [7] "2            COVAR   14153   -0.0012294    -0.014635   0.00078074      0.11535"                                    
##  [8] "3            COVAR   14153    0.0013858     0.016008   0.00081748     0.090058"                                    
##  [9] "4            COVAR   14153   0.00026362    0.0031811   0.00077018      0.73214"                                    
## [10] "5            COVAR   14153    0.0010363     0.012351   0.00076929      0.17796"

Time to run

TotalTime = Sys.time() - startTime
TotalTime
## Time difference of 8.939172 mins