Processing math: 100%

GUESSFM Introduction

Chris Wallace // web // email

1. Introduction
2. R2GUESS Strategy
3. Simulate some data
4. Running GUESS and reading output
5. Expanding
6. Refitting
7. SNP groups
8. Plotting
9. Parallelism
10. Using an existing R2GUESS run

Introduction

GUESS is software for Bayesian variable selection and model averaging using a stochastic search algorithm to visit the most likely models via MCMC. There also exists an R package, R2GUESS, which is a wrapper for running GUESS, and provides useful diagnostic and summary functions.

This package, GUESSFM, aims to extend GUESS for fine mapping, defined here as the task of identifying the most likely set of causal variants. The are particular challenges for fine mapping, mostly related to the density of SNP markers. Nonetheless, this vignette will at first focus on an example dataset that comes with R2GUESS.

This vignettes introduces GUESSFM, and tries to do a sample run with simulated data. The other vignettes visit some topics in more depth:

plotting and SNP groups.

R2GUESS Strategy

Although GUESS uses g-priors, which inhibits visiting models with SNPs in high LD, we have found that running GUESS with SNPs in complete LD can lead to instability. R2GUESS was created purely to allow running GUESS on a tagged set of SNPs to approximate the posterior model space, and then expand the most interesting models to include their tags. It is perhaps useful to give an example of expanding a model.

Suppose we begin tagging, and determine a couple of tag groups as follows:

Tag	SNPs in tag group
$A$	$A_1$ , $A_2$
$B$	$B_1$

R2GUESS will remove SNPs A1, A2 and B1, and will save a file containing an object of class tags containting the information in the above table. Then, suppose after GUESS has run on the set of tag SNPs, model $(A,B)$ is one of the most interesting. GUESS FM will expand' this model to the set of models ${(A,B), (A_1,B), (A_2,B), (A,B_1), (A_1,B_1), (A_2,B_1)}$ .

We assess each model using Bayes Factors. The Bayes Factor for model $M_i$ is

$BF_{i0} = \frac{P(M_i | \text{data} )}{P(M_0 | \text{data})}.$

In a fully Bayesian analysis, the individual model probabilities are calculated by integrating over prior distributions for the model parameters (regression coefficients, for example). This is what GUESS does. However, it assumes a linear model when often we will want to fine map a disease trait and therefore use a logistic model. Further, we cannot go back and run the expanded models through GUESS. Instead, we use Approximate Bayes Factors, based on the BIC:

$-2 \log{ABF_{i0}} = BIC(M_i) - BIC(M_0).$

These are calculated within R via the glm and BIC functions.

At what threshold should you tag? Probably, as high as you can go without inducing instability in GUESS. We have found $r^2=0.99$ to work in some large datasets, but your mileage may vary. In the examples below we use a lower threshold purely for demonstration.

Simulate some data

We start with using some sample data from the snpStats package including 20 SNPs, and simulating a quantitative trait that depends on 3 causal SNPs.

library(snpStats)
data(for.exercise, package="snpStats")
set.seed(12346)
X <- snps.10[,101:120]
n <- nrow(X)
causal <- c("rs1555897","rs7069505")
Y <- rnorm(n,mean=as.numeric(X[,causal[1]]))*sqrt(0.2) +
  rnorm(n,mean=as.numeric(X[,causal[2]]))*sqrt(0.2) +
  rnorm(n)*sqrt(0.6)

X contains some missing genotypes, but no SNPs with such a low call rate we would worry in a large study. Still, the rest of the analysis is easier to interpret for the purposes of a vignette if we fill in the missing values.

library(GUESSFM)
summary(col.summary(X))

##      Calls         Call.rate      Certain.calls      RAF         
##  Min.   :984.0   Min.   :0.9840   Min.   :1     Min.   :0.04651  
##  1st Qu.:988.5   1st Qu.:0.9885   1st Qu.:1     1st Qu.:0.16658  
##  Median :989.0   Median :0.9890   Median :1     Median :0.41617  
##  Mean   :989.7   Mean   :0.9897   Mean   :1     Mean   :0.41482  
##  3rd Qu.:990.0   3rd Qu.:0.9900   3rd Qu.:1     3rd Qu.:0.62680  
##  Max.   :998.0   Max.   :0.9980   Max.   :1     Max.   :0.84747  
##       MAF               P.AA              P.AB              P.BB         
##  Min.   :0.04651   Min.   :0.01921   Min.   :0.08898   Min.   :0.002022  
##  1st Qu.:0.14523   1st Qu.:0.14170   1st Qu.:0.22626   1st Qu.:0.029575  
##  Median :0.31430   Median :0.33984   Median :0.42095   Median :0.172658  
##  Mean   :0.27944   Mean   :0.40774   Mean   :0.35487   Mean   :0.237383  
##  3rd Qu.:0.37406   3rd Qu.:0.69520   3rd Qu.:0.47258   3rd Qu.:0.392982  
##  Max.   :0.49298   Max.   :0.90900   Max.   :0.50101   Max.   :0.732323  
##      z.HWE        
##  Min.   :-3.5140  
##  1st Qu.:-1.1135  
##  Median : 0.1907  
##  Mean   :-0.4489  
##  3rd Qu.: 0.4224  
##  Max.   : 1.1354

X <- impute.missing(X)

## 20 to impute
## 1 .SNPs tagged by multiple tag haplotypes (saturated model): 1
## 2 .SNPs tagged by multiple tag haplotypes (saturated model): 1
## 3 .SNPs tagged by multiple tag haplotypes (saturated model): 1
## 4 .SNPs tagged by a single SNP: 1
## 5 .SNPs tagged by a single SNP: 1
## 6 .SNPs tagged by a single SNP: 1
## 7 .SNPs tagged by a single SNP: 1
## 8 .SNPs tagged by multiple tag haplotypes (saturated model): 1
## 9 .SNPs tagged by multiple tag haplotypes (saturated model): 1
## 10 .SNPs tagged by multiple tag haplotypes (saturated model): 1
## 11 .SNPs tagged by multiple tag haplotypes (saturated model): 1
## 12 .SNPs tagged by multiple tag haplotypes (saturated model): 1
## 13 .SNPs tagged by multiple tag haplotypes (saturated model): 1
## 14 .SNPs tagged by multiple tag haplotypes (saturated model): 1
## 15 .SNPs tagged by multiple tag haplotypes (saturated model): 1
## 16 .SNPs tagged by multiple tag haplotypes (saturated model): 1
## 17 .SNPs tagged by multiple tag haplotypes (saturated model): 1
## 18 .SNPs tagged by multiple tag haplotypes (saturated model): 1
## 19 .SNPs tagged by multiple tag haplotypes (saturated model): 1
## 20 .SNPs tagged by multiple tag haplotypes (saturated model): 1
## 
## coercing object of mode  numeric  to SnpMatrix

summary(col.summary(X))

##      Calls        Call.rate Certain.calls      RAF        
##  Min.   :1000   Min.   :1   Min.   :1     Min.   :0.0460  
##  1st Qu.:1000   1st Qu.:1   1st Qu.:1     1st Qu.:0.1660  
##  Median :1000   Median :1   Median :1     Median :0.4158  
##  Mean   :1000   Mean   :1   Mean   :1     Mean   :0.4145  
##  3rd Qu.:1000   3rd Qu.:1   3rd Qu.:1     3rd Qu.:0.6269  
##  Max.   :1000   Max.   :1   Max.   :1     Max.   :0.8465  
##       MAF              P.AA             P.AB             P.BB        
##  Min.   :0.0460   Min.   :0.0200   Min.   :0.0880   Min.   :0.00200  
##  1st Qu.:0.1457   1st Qu.:0.1410   1st Qu.:0.2268   1st Qu.:0.02925  
##  Median :0.3145   Median :0.3400   Median :0.4220   Median :0.17150  
##  Mean   :0.2794   Mean   :0.4079   Mean   :0.3551   Mean   :0.23700  
##  3rd Qu.:0.3739   3rd Qu.:0.6957   3rd Qu.:0.4733   3rd Qu.:0.39275  
##  Max.   :0.4940   Max.   :0.9100   Max.   :0.5020   Max.   :0.73100  
##      z.HWE        
##  Min.   :-3.5284  
##  1st Qu.:-1.0185  
##  Median : 0.2031  
##  Mean   :-0.4348  
##  3rd Qu.: 0.3971  
##  Max.   : 1.1292

Looking at the LD, we see this is a region in which D' (above the diagonal) is typically high, whilst $r^2$ can be high between some SNPs, and with moderately strong $r^2 \simeq 0.7$ between two of our causal SNPs:

ld <- show.ld(X=X)

plot of chunk unnamed-chunk-3

Running GUESS and reading output

First, via GUESSFM's wrapper, tagging at $r^2=0.8$ . This is considerably lower than suggested above, and is used here purely for demonstration as there is limited strong LD in this dataset.

## THIS DIRECTORY WILL HOLD ALL THE GUESS INPUT AND OUTPUT 
## AND WILL BE CREATED IF IT DOESN'T ALREADY EXIST
mydir <- tempfile() 
## NB, IF GUESS IS NOT ON YOUR PATH, SUPPLY THE FULL PATH IN AN ARGUMENT
## guess.command="/path/to/GUESS"
run.bvs(X,Y,gdir=mydir,
        tag.r2=0.95,           # maximum r2 between SNPs to be modelled
        guess.command="GUESS", # /path/to/GUESS
        nexp=3,                # expected number of causal variants, an overestimate
        nsave=1000)            # number of best models to save

This can take a (long) while. For the purposes of this vignette, we will load the results from an existing run.

mydir <-  system.file("extdata",package="GUESSFM")

## what files were created?
list.files(mydir)

##  [1] "Par_file_example.xml"                                  
##  [2] "X_55000"                                               
##  [3] "Y_55000"                                               
##  [4] "decode_55000"                                          
##  [5] "decode_samples_55000"                                  
##  [6] "init_55000"                                            
##  [7] "log"                                                   
##  [8] "out_55000_sweeps_features.txt"                         
##  [9] "out_55000_sweeps_output_all_exchange_history.txt"      
## [10] "out_55000_sweeps_output_best_visited_models.txt"       
## [11] "out_55000_sweeps_output_cross_over_history.txt"        
## [12] "out_55000_sweeps_output_delayed_rejection_history.txt" 
## [13] "out_55000_sweeps_output_fast_scan_history.txt"         
## [14] "out_55000_sweeps_output_g_adaptation_history.txt"      
## [15] "out_55000_sweeps_output_g_history.txt"                 
## [16] "out_55000_sweeps_output_gibbs_history.txt"             
## [17] "out_55000_sweeps_output_log_cond_post_prob_history.txt"
## [18] "out_55000_sweeps_output_marg_prob_incl.txt"            
## [19] "out_55000_sweeps_output_model_size_history.txt"        
## [20] "out_55000_sweeps_output_models_history.txt"            
## [21] "out_55000_sweeps_output_n_models_visited_history.txt"  
## [22] "out_55000_sweeps_output_temperature_history.txt"       
## [23] "par.xml"                                               
## [24] "simdata.RData"                                         
## [25] "tags.RData"

## read output with GUESSFM
d <- read.snpmod(mydir)

## /ipswich/data/chrisw/R/packages/GUESSFM/inst/extdata/out_55000 is a directory - reading most recent filestub: /ipswich/data/chrisw/R/packages/GUESSFM/inst/extdata/out_55000
## 396 models, with pp > 1.18e-47, and, cumulatively, to 1

## examine the best models and SNPs with greatest marginal support within the tagged data.
best.models(d)

##   rank visits size     logPP         PP jeffreys
## 1    1  31725    2 -3520.197 0.65605419 46.75253
## 2    2   2557    3 -3522.787 0.04923028 46.36002
## 3    3   2422    3 -3522.829 0.04720462 46.34177
## 4    4   1901    3 -3523.065 0.03729365 46.23942
## 5    5   1512    3 -3523.295 0.02962316 46.13942
## 6    6   1312    3 -3523.454 0.02527297 46.07044
## 7    7    759    3 -3524.022 0.01432030 45.82374
## 8    8    653    3 -3524.215 0.01180112 45.73971
## 9    9    536    3 -3524.287 0.01098127 45.70844
##                               str                            snps
## 1            rs11253451%rs7069505            rs11253451%rs7069505
## 2 rs11253451%rs17221435%rs7069505 rs11253451%rs17221435%rs7069505
## 3 rs11253446%rs11253451%rs7069505 rs11253446%rs11253451%rs7069505
## 4  rs11253451%rs2296625%rs7069505  rs11253451%rs2296625%rs7069505
## 5 rs10752021%rs11253451%rs7069505 rs10752021%rs11253451%rs7069505
## 6 rs10904578%rs11253451%rs7069505 rs10904578%rs11253451%rs7069505
## 7 rs11253451%rs12268008%rs7069505 rs11253451%rs12268008%rs7069505
## 8 rs11253451%rs11253516%rs7069505 rs11253451%rs11253516%rs7069505
## 9     rs11253451%rs7069505%rs9124     rs11253451%rs7069505%rs9124

best.snps(d)

##            Predictor Marg_Prob_Incl        var
## rs7069505         15      0.9999798  rs7069505
## rs11253451         4      0.9999653 rs11253451

Huh. GUESS has selected one of our causal SNPs, but not both. Why? We have a clue from the ld matrix:

sel=rownames(best.snps(d))
ld[c(sel,causal),c(sel,causal)]

##            rs7069505 rs11253451 rs1555897 rs7069505
## rs7069505  1.0000000  0.2797505 0.2763362 1.0000000
## rs11253451 0.6310478  1.0000000 1.0000000 0.6310478
## rs1555897  0.6258456  0.9957333 1.0000000 0.6258456
## rs7069505  1.0000000  0.2797505 0.2763362 1.0000000

So the selected SNP rs11253451 has $r^2=0.996$ and $D'=1$ with the causal but unselected SNP rs1555897. That would explain it.

Note that both best.models and best.snps allow you to specify thresholds for how to determine “best”. See their help pages for details.

The tags created within the run.bvs function are saved to a tags.RData file under mydir and can be examined.

(load(file.path(mydir,"tags.RData")))

## [1] "tags"

tags

## tags object, containing 20 SNPs in 17 groups.

tagsof(tags,causal)

##          tag      snps
## 1 rs11253451 rs1555897
## 2  rs7069505 rs7069505

taggedby(tags,sel)

##          tag       snps
## 1 rs11253451 rs12413228
## 2 rs11253451 rs10904571
## 3 rs11253451 rs11253451
## 4 rs11253451  rs1555897
## 5  rs7069505  rs7069505

Indeed, rs11253451 is a tag for rs1555897.

NB: to see more about how to manipulate tags and groups objects, see the vignette

vignette("groups",package="GUESSFM")

Expanding

Tagging has allowed us to shrink the model space, by assuming that models with SNPs in very high LD will have very similar likelihoods, but for fine mapping we really do want to evaluate each and every model. So, having chosen our best set of models within the shrunken space, we need to expand each of them to all the possible models they tag:

dx<-expand.tags(d,tags)

## expanding tags for 396 models over 17 tag SNPs, tagging a total of 20 SNPs.

Refitting

The expanded models above have all been assigned the log Bayes Factor for their nearest tag model. This isn't terrible, in practice, but if you care about fine mapping then you should get a more precise answer by refitting the most likely models individually. Note this is especially important if you have a binomial outcome, as GUESS has been run using a linear model.

Here, we take the set of most likely models which collectively capture 50% of the posterior support from GUESS, after expansion. Loading the speedglm library makes the fitting faster, and we calculate the approximate Bayes Factors using the BIC approximation. To do this, we also need to supply some information about our prior for the number of causal variants in the model.

best <- best.models(dx,cpp.thr=0.9)
library(speedglm)
abf <- abf.calc(y=Y,x=X,models=best$str,family="gaussian")

## Calculating BICs using speedglm
## 43 of 43 models can be evaluated directly
## calculating ABF
## evaluating direct models, n=43

sm <- abf2snpmod(abf,expected=3)

## creating snpmod data.frame
## calculating marginal SNP inclusion probabilities

Now we can explore the best SNPs and models in the tagged data, the expanded data, and the refitted data

best.snps(d)

##            Predictor Marg_Prob_Incl        var
## rs7069505         15      0.9999798  rs7069505
## rs11253451         4      0.9999653 rs11253451

best.snps(dx)

##            Marg_Prob_Incl        var   rownames
## rs7069505       0.9959669  rs7069505  rs7069505
## rs12413228      0.2496380 rs12413228 rs12413228
## rs10904571      0.2496380 rs10904571 rs10904571
## rs11253451      0.2496380 rs11253451 rs11253451
## rs1555897       0.2496380  rs1555897  rs1555897

best.snps(sm)

##            Marg_Prob_Incl        var   rownames
## rs7069505        1.000000  rs7069505  rs7069505
## rs1555897        0.451403  rs1555897  rs1555897
## rs11253451       0.449641 rs11253451 rs11253451

We see that on expanding the tags we pick up the true causal variant, together with two more extraneous SNPs, but on refitting the model we see only the two causal SNPs and rs11253451, which is in very high LD with rs1555897.

SNP groups

A formal way to group SNPs in LD with posterior support is to use the snp.picker function, which can also produce a plot to show how it's working.
We do not expect that we will be able to discriminate, statistically, between highly correlated variants. Instead, the posterior support is likely to be diluted across such sets of variants. To group such SNPs, we used the marginal posterior probabilities of inclusion (MPPI) for each SNP, and applied the following algorithm:

Pick the index SNP with maximum MPPI
Order remaining SNPs by $r^2$ with index SNP
Exclude SNPs which co-occur in models with the index SNP (joint MPPI $>0.02$ )
Step away from the index SNP in order of decreasing $r^2$ , adding SNPs to its group until $\text{MPPI}<0.001$ for two SNPs in a row (NB, these SNPs will not be added to the SNP group), or until $r^2<0.5$
Remove this set of SNPs and return to step 1 until no SNP remains with $\operatorname{MPPI}>0.01$ .

We summarize the support for any group of SNPs by the grouped marginal posterior probability of inclusion, or gMPPI.

sp <- snp.picker(sm,X)

##                      r2 mpi model.count
## rs7069505 -4.440892e-16   1           0
## 1 rs7069505 1 1 
##                       r2       mpi model.count
## rs1555897  -2.220446e-16 0.4514030           0
## rs11253451  4.266735e-03 0.4496410           0
## rs10904571  4.039755e-02 0.0827517           0
## rs12413228  6.558729e-02 0.0162043           0
## 2 rs1555897 4 1

summary(sp)

##   SNP.index SNP.count    min.R2 gMMPI
## 1 rs7069505         1 1.0000000     1
## 2 rs1555897         4 0.9344127     1

plot(sp)

plot of chunk unnamed-chunk-10

Plotting

It is useful to assess the steps of any analysis by looking at the data. With such a large number of models, the best way is to plot aspects of the data. For that reason, GUESSFM contains lots of plotting functions, described in a separate vignette. To see it, do:

vignette("plotting",package="GUESSFM")

Parallelism

Looping over many, many models can be made quicker by parallel processing. GUESSFM does this by means of calls to the mclapply function in the parallel package. By default, the parallel package sets itself up to use two cores. You can change this by setting the option mc.cores. Eg, if you have 20 cores on your machine, you might set

options(mc.cores=16)

to use 16 of this for R, and leave the remainder free to run other processes.

Functions which make use of this (and over which you might then not to use mclapply are:

expand.tags

Using an existing R2GUESS run

You can convert a run from R2GUESS into a snpmod object with:

## read output using a convenience wrapper for as.ESS.object()
## this returns an object of class ESS, used by R2GUESS
ess <- read.ess(mydir)
str(ess)

## GUESSFM maps snp numbers to names via a decode vector
decode <- structure(colnames(X),names=as.numeric(1:ncol(X)))

## create a snpmod
gfm2 <- ess2snpmod(ess)
best.models(gfm2)

Now you can apply all the plotting functions etc in GUESSFM, but without the tagging strategy, you won't be able to do the expansion.