GUESS is software for Bayesian variable selection and model averaging using a stochastic search algorithm to visit the most likely models via MCMC. There also exists an R package, R2GUESS, which is a wrapper for running GUESS, and provides useful diagnostic and summary functions.
This package, GUESSFM, aims to extend GUESS for fine mapping, defined here as the task of identifying the most likely set of causal variants. The are particular challenges for fine mapping, mostly related to the density of SNP markers. Nonetheless, this vignette will at first focus on an example dataset that comes with R2GUESS.
This vignettes introduces GUESSFM, and tries to do a sample run with simulated data. The other vignettes visit some topics in more depth:
plotting and SNP groups.
Although GUESS uses g-priors, which inhibits visiting models with SNPs in high LD, we have found that running GUESS with SNPs in complete LD can lead to instability. R2GUESS was created purely to allow running GUESS on a tagged set of SNPs to approximate the posterior model space, and then expand the most interesting models to include their tags. It is perhaps useful to give an example of expanding a model.
Suppose we begin tagging, and determine a couple of tag groups as follows:
Tag | SNPs in tag group |
\(A\) | \(A_1\), \(A_2\) |
\(B\) | \(B_1\) |
R2GUESS will remove SNPs A1, A2 and B1, and will save a file
containing an object of class tags
containting the information in
the above table. Then, suppose after GUESS has run on the set of tag
SNPs, model \((A,B)\) is one of the most interesting. GUESS FM will
expand' this model to the set of models \({(A,B), (A_1,B), (A_2,B),
(A,B_1), (A_1,B_1), (A_2,B_1)}\).
We assess each model using Bayes Factors. The Bayes Factor for model \(M_i\) is
\[BF_{i0} = \frac{P(M_i | \text{data} )}{P(M_0 | \text{data})}.\]
In a fully Bayesian analysis, the individual model probabilities are calculated by integrating over prior distributions for the model parameters (regression coefficients, for example). This is what GUESS does. However, it assumes a linear model when often we will want to fine map a disease trait and therefore use a logistic model. Further, we cannot go back and run the expanded models through GUESS. Instead, we use Approximate Bayes Factors, based on the BIC:
\[-2 \log{ABF_{i0}} = BIC(M_i) - BIC(M_0).\]
These are calculated within R via the glm
and BIC
functions.
At what threshold should you tag? Probably, as high as you can go without inducing instability in GUESS. We have found \(r^2=0.99\) to work in some large datasets, but your mileage may vary. In the examples below we use a lower threshold purely for demonstration.
We start with using some sample data from the snpStats package including 20 SNPs, and simulating a quantitative trait that depends on 3 causal SNPs.
library(snpStats)
data(for.exercise, package="snpStats")
set.seed(12346)
X <- snps.10[,101:120]
n <- nrow(X)
causal <- c("rs1555897","rs7069505")
Y <- rnorm(n,mean=as.numeric(X[,causal[1]]))*sqrt(0.2) +
rnorm(n,mean=as.numeric(X[,causal[2]]))*sqrt(0.2) +
rnorm(n)*sqrt(0.6)
X
contains some missing genotypes, but no SNPs with such a low call
rate we would worry in a large study. Still, the rest of the analysis
is easier to interpret for the purposes of a vignette if we fill in
the missing values.
library(GUESSFM)
summary(col.summary(X))
## Calls Call.rate Certain.calls RAF
## Min. :984.0 Min. :0.9840 Min. :1 Min. :0.04651
## 1st Qu.:988.5 1st Qu.:0.9885 1st Qu.:1 1st Qu.:0.16658
## Median :989.0 Median :0.9890 Median :1 Median :0.41617
## Mean :989.7 Mean :0.9897 Mean :1 Mean :0.41482
## 3rd Qu.:990.0 3rd Qu.:0.9900 3rd Qu.:1 3rd Qu.:0.62680
## Max. :998.0 Max. :0.9980 Max. :1 Max. :0.84747
## MAF P.AA P.AB P.BB
## Min. :0.04651 Min. :0.01921 Min. :0.08898 Min. :0.002022
## 1st Qu.:0.14523 1st Qu.:0.14170 1st Qu.:0.22626 1st Qu.:0.029575
## Median :0.31430 Median :0.33984 Median :0.42095 Median :0.172658
## Mean :0.27944 Mean :0.40774 Mean :0.35487 Mean :0.237383
## 3rd Qu.:0.37406 3rd Qu.:0.69520 3rd Qu.:0.47258 3rd Qu.:0.392982
## Max. :0.49298 Max. :0.90900 Max. :0.50101 Max. :0.732323
## z.HWE
## Min. :-3.5140
## 1st Qu.:-1.1135
## Median : 0.1907
## Mean :-0.4489
## 3rd Qu.: 0.4224
## Max. : 1.1354
X <- impute.missing(X)
## 20 to impute
## 1 .SNPs tagged by multiple tag haplotypes (saturated model): 1
## 2 .SNPs tagged by multiple tag haplotypes (saturated model): 1
## 3 .SNPs tagged by multiple tag haplotypes (saturated model): 1
## 4 .SNPs tagged by a single SNP: 1
## 5 .SNPs tagged by a single SNP: 1
## 6 .SNPs tagged by a single SNP: 1
## 7 .SNPs tagged by a single SNP: 1
## 8 .SNPs tagged by multiple tag haplotypes (saturated model): 1
## 9 .SNPs tagged by multiple tag haplotypes (saturated model): 1
## 10 .SNPs tagged by multiple tag haplotypes (saturated model): 1
## 11 .SNPs tagged by multiple tag haplotypes (saturated model): 1
## 12 .SNPs tagged by multiple tag haplotypes (saturated model): 1
## 13 .SNPs tagged by multiple tag haplotypes (saturated model): 1
## 14 .SNPs tagged by multiple tag haplotypes (saturated model): 1
## 15 .SNPs tagged by multiple tag haplotypes (saturated model): 1
## 16 .SNPs tagged by multiple tag haplotypes (saturated model): 1
## 17 .SNPs tagged by multiple tag haplotypes (saturated model): 1
## 18 .SNPs tagged by multiple tag haplotypes (saturated model): 1
## 19 .SNPs tagged by multiple tag haplotypes (saturated model): 1
## 20 .SNPs tagged by multiple tag haplotypes (saturated model): 1
##
## coercing object of mode numeric to SnpMatrix
summary(col.summary(X))
## Calls Call.rate Certain.calls RAF
## Min. :1000 Min. :1 Min. :1 Min. :0.0460
## 1st Qu.:1000 1st Qu.:1 1st Qu.:1 1st Qu.:0.1660
## Median :1000 Median :1 Median :1 Median :0.4158
## Mean :1000 Mean :1 Mean :1 Mean :0.4145
## 3rd Qu.:1000 3rd Qu.:1 3rd Qu.:1 3rd Qu.:0.6269
## Max. :1000 Max. :1 Max. :1 Max. :0.8465
## MAF P.AA P.AB P.BB
## Min. :0.0460 Min. :0.0200 Min. :0.0880 Min. :0.00200
## 1st Qu.:0.1457 1st Qu.:0.1410 1st Qu.:0.2268 1st Qu.:0.02925
## Median :0.3145 Median :0.3400 Median :0.4220 Median :0.17150
## Mean :0.2794 Mean :0.4079 Mean :0.3551 Mean :0.23700
## 3rd Qu.:0.3739 3rd Qu.:0.6957 3rd Qu.:0.4733 3rd Qu.:0.39275
## Max. :0.4940 Max. :0.9100 Max. :0.5020 Max. :0.73100
## z.HWE
## Min. :-3.5284
## 1st Qu.:-1.0185
## Median : 0.2031
## Mean :-0.4348
## 3rd Qu.: 0.3971
## Max. : 1.1292
Looking at the LD, we see this is a region in which D' (above the diagonal) is typically high, whilst \(r^2\) can be high between some SNPs, and with moderately strong \(r^2 \simeq 0.7\) between two of our causal SNPs:
ld <- show.ld(X=X)
First, via GUESSFM's wrapper, tagging at \(r^2=0.8\). This is considerably lower than suggested above, and is used here purely for demonstration as there is limited strong LD in this dataset.
## THIS DIRECTORY WILL HOLD ALL THE GUESS INPUT AND OUTPUT
## AND WILL BE CREATED IF IT DOESN'T ALREADY EXIST
mydir <- tempfile()
## NB, IF GUESS IS NOT ON YOUR PATH, SUPPLY THE FULL PATH IN AN ARGUMENT
## guess.command="/path/to/GUESS"
run.bvs(X,Y,gdir=mydir,
tag.r2=0.95, # maximum r2 between SNPs to be modelled
guess.command="GUESS", # /path/to/GUESS
nexp=3, # expected number of causal variants, an overestimate
nsave=1000) # number of best models to save
This can take a (long) while. For the purposes of this vignette, we will load the results from an existing run.
mydir <- system.file("extdata",package="GUESSFM")
## what files were created?
list.files(mydir)
## [1] "Par_file_example.xml"
## [2] "X_55000"
## [3] "Y_55000"
## [4] "decode_55000"
## [5] "decode_samples_55000"
## [6] "init_55000"
## [7] "log"
## [8] "out_55000_sweeps_features.txt"
## [9] "out_55000_sweeps_output_all_exchange_history.txt"
## [10] "out_55000_sweeps_output_best_visited_models.txt"
## [11] "out_55000_sweeps_output_cross_over_history.txt"
## [12] "out_55000_sweeps_output_delayed_rejection_history.txt"
## [13] "out_55000_sweeps_output_fast_scan_history.txt"
## [14] "out_55000_sweeps_output_g_adaptation_history.txt"
## [15] "out_55000_sweeps_output_g_history.txt"
## [16] "out_55000_sweeps_output_gibbs_history.txt"
## [17] "out_55000_sweeps_output_log_cond_post_prob_history.txt"
## [18] "out_55000_sweeps_output_marg_prob_incl.txt"
## [19] "out_55000_sweeps_output_model_size_history.txt"
## [20] "out_55000_sweeps_output_models_history.txt"
## [21] "out_55000_sweeps_output_n_models_visited_history.txt"
## [22] "out_55000_sweeps_output_temperature_history.txt"
## [23] "par.xml"
## [24] "simdata.RData"
## [25] "tags.RData"
## read output with GUESSFM
d <- read.snpmod(mydir)
## /ipswich/data/chrisw/R/packages/GUESSFM/inst/extdata/out_55000 is a directory - reading most recent filestub: /ipswich/data/chrisw/R/packages/GUESSFM/inst/extdata/out_55000
## 396 models, with pp > 1.18e-47, and, cumulatively, to 1
## examine the best models and SNPs with greatest marginal support within the tagged data.
best.models(d)
## rank visits size logPP PP jeffreys
## 1 1 31725 2 -3520.197 0.65605419 46.75253
## 2 2 2557 3 -3522.787 0.04923028 46.36002
## 3 3 2422 3 -3522.829 0.04720462 46.34177
## 4 4 1901 3 -3523.065 0.03729365 46.23942
## 5 5 1512 3 -3523.295 0.02962316 46.13942
## 6 6 1312 3 -3523.454 0.02527297 46.07044
## 7 7 759 3 -3524.022 0.01432030 45.82374
## 8 8 653 3 -3524.215 0.01180112 45.73971
## 9 9 536 3 -3524.287 0.01098127 45.70844
## str snps
## 1 rs11253451%rs7069505 rs11253451%rs7069505
## 2 rs11253451%rs17221435%rs7069505 rs11253451%rs17221435%rs7069505
## 3 rs11253446%rs11253451%rs7069505 rs11253446%rs11253451%rs7069505
## 4 rs11253451%rs2296625%rs7069505 rs11253451%rs2296625%rs7069505
## 5 rs10752021%rs11253451%rs7069505 rs10752021%rs11253451%rs7069505
## 6 rs10904578%rs11253451%rs7069505 rs10904578%rs11253451%rs7069505
## 7 rs11253451%rs12268008%rs7069505 rs11253451%rs12268008%rs7069505
## 8 rs11253451%rs11253516%rs7069505 rs11253451%rs11253516%rs7069505
## 9 rs11253451%rs7069505%rs9124 rs11253451%rs7069505%rs9124
best.snps(d)
## Predictor Marg_Prob_Incl var
## rs7069505 15 0.9999798 rs7069505
## rs11253451 4 0.9999653 rs11253451
Huh. GUESS has selected one of our causal SNPs, but not both. Why? We have a clue from the ld matrix:
sel=rownames(best.snps(d))
ld[c(sel,causal),c(sel,causal)]
## rs7069505 rs11253451 rs1555897 rs7069505
## rs7069505 1.0000000 0.2797505 0.2763362 1.0000000
## rs11253451 0.6310478 1.0000000 1.0000000 0.6310478
## rs1555897 0.6258456 0.9957333 1.0000000 0.6258456
## rs7069505 1.0000000 0.2797505 0.2763362 1.0000000
So the selected SNP rs11253451 has \(r^2=0.996\) and \(D'=1\) with the causal but unselected SNP rs1555897. That would explain it.
Note that both best.models
and best.snps
allow you to specify thresholds for
how to determine “best”. See their help pages for details.
The tags created within the run.bvs
function are saved to a
tags.RData
file under mydir
and can be examined.
(load(file.path(mydir,"tags.RData")))
## [1] "tags"
tags
## tags object, containing 20 SNPs in 17 groups.
tagsof(tags,causal)
## tag snps
## 1 rs11253451 rs1555897
## 2 rs7069505 rs7069505
taggedby(tags,sel)
## tag snps
## 1 rs11253451 rs12413228
## 2 rs11253451 rs10904571
## 3 rs11253451 rs11253451
## 4 rs11253451 rs1555897
## 5 rs7069505 rs7069505
Indeed, rs11253451 is a tag for rs1555897.
NB: to see more about how to manipulate tags and groups objects, see the vignette
vignette("groups",package="GUESSFM")
Tagging has allowed us to shrink the model space, by assuming that models with SNPs in very high LD will have very similar likelihoods, but for fine mapping we really do want to evaluate each and every model. So, having chosen our best set of models within the shrunken space, we need to expand each of them to all the possible models they tag:
dx<-expand.tags(d,tags)
## expanding tags for 396 models over 17 tag SNPs, tagging a total of 20 SNPs.
The expanded models above have all been assigned the log Bayes Factor for their nearest tag model. This isn't terrible, in practice, but if you care about fine mapping then you should get a more precise answer by refitting the most likely models individually. Note this is especially important if you have a binomial outcome, as GUESS has been run using a linear model.
Here, we take the set of most likely models which collectively capture 50% of the posterior support from GUESS, after expansion. Loading the speedglm
library makes the fitting faster, and we calculate the approximate Bayes Factors using the BIC approximation. To do this, we also need to supply some information about our prior for the number of causal variants in the model.
best <- best.models(dx,cpp.thr=0.9)
library(speedglm)
abf <- abf.calc(y=Y,x=X,models=best$str,family="gaussian")
## Calculating BICs using speedglm
## 43 of 43 models can be evaluated directly
## calculating ABF
## evaluating direct models, n=43
sm <- abf2snpmod(abf,expected=3)
## creating snpmod data.frame
## calculating marginal SNP inclusion probabilities
Now we can explore the best SNPs and models in the tagged data, the expanded data, and the refitted data
best.snps(d)
## Predictor Marg_Prob_Incl var
## rs7069505 15 0.9999798 rs7069505
## rs11253451 4 0.9999653 rs11253451
best.snps(dx)
## Marg_Prob_Incl var rownames
## rs7069505 0.9959669 rs7069505 rs7069505
## rs12413228 0.2496380 rs12413228 rs12413228
## rs10904571 0.2496380 rs10904571 rs10904571
## rs11253451 0.2496380 rs11253451 rs11253451
## rs1555897 0.2496380 rs1555897 rs1555897
best.snps(sm)
## Marg_Prob_Incl var rownames
## rs7069505 1.000000 rs7069505 rs7069505
## rs1555897 0.451403 rs1555897 rs1555897
## rs11253451 0.449641 rs11253451 rs11253451
We see that on expanding the tags we pick up the true causal variant, together with two more extraneous SNPs, but on refitting the model we see only the two causal SNPs and rs11253451, which is in very high LD with rs1555897.
A formal way to group SNPs in LD with posterior support is to use the snp.picker
function, which can also produce a plot to show how it's working.
We do not expect that we will be able to discriminate, statistically,
between highly correlated variants. Instead, the posterior support is
likely to be diluted across such sets of variants. To group such
SNPs, we used the marginal posterior probabilities of inclusion (MPPI)
for each SNP, and applied the following algorithm:
We summarize the support for any group of SNPs by the grouped marginal posterior probability of inclusion, or gMPPI.
sp <- snp.picker(sm,X)
## r2 mpi model.count
## rs7069505 -4.440892e-16 1 0
## 1 rs7069505 1 1
## r2 mpi model.count
## rs1555897 -2.220446e-16 0.4514030 0
## rs11253451 4.266735e-03 0.4496410 0
## rs10904571 4.039755e-02 0.0827517 0
## rs12413228 6.558729e-02 0.0162043 0
## 2 rs1555897 4 1
summary(sp)
## SNP.index SNP.count min.R2 gMMPI
## 1 rs7069505 1 1.0000000 1
## 2 rs1555897 4 0.9344127 1
plot(sp)
It is useful to assess the steps of any analysis by looking at the data. With such a large number of models, the best way is to plot aspects of the data. For that reason, GUESSFM contains lots of plotting functions, described in a separate vignette. To see it, do:
vignette("plotting",package="GUESSFM")
Looping over many, many models can be made quicker by parallel
processing. GUESSFM does this by means of calls to the mclapply
function in the parallel
package. By default,
the parallel
package sets itself up to use two cores. You can
change this by setting the option mc.cores
. Eg, if you have 20
cores on your machine, you might set
options(mc.cores=16)
to use 16 of this for R, and leave the remainder free to run other processes.
Functions which make use of this (and over which you might then not to
use mclapply
are:
expand.tags
You can convert a run from R2GUESS into a snpmod object with:
## read output using a convenience wrapper for as.ESS.object()
## this returns an object of class ESS, used by R2GUESS
ess <- read.ess(mydir)
str(ess)
## GUESSFM maps snp numbers to names via a decode vector
decode <- structure(colnames(X),names=as.numeric(1:ncol(X)))
## create a snpmod
gfm2 <- ess2snpmod(ess)
best.models(gfm2)
Now you can apply all the plotting functions etc in GUESSFM, but without the tagging strategy, you won't be able to do the expansion.