Introduction

The purpose of this document is to give an overview over frequently occurring quantities in quantitative genetics and to demonstrate how to they can be validated in R examples using simulated genotype data.

Since the focus here is on how different concepts relate to one another, many topics are presented in a simplified way that ignores complicating factors that sometimes arise in the application to real data.

For questions, and any kind of feedback (especially mistakes), please contact Robert Maier at r.maier@uq.edu.au.


 

Notation

Symbol Meaning
\(\textbf{X}_{012}\) Original genotype matrix
\(\textbf{X}^*\) Mean-centered genotypes
\(\textbf{X}\) Scaled genotypes (each SNP has mean 0, variance 1)
\(\textbf{y}\) Phenotype
\(p\) Minor Allele Frequency (MAF)
\(M\) Number of SNPs
\(N\) Number of individuals
\(M_e\) Effective number of SNPs
\(N_e\) Effective number of individuals
\(\textbf{A}\) Genetic relatedness matrix (GRM)
\(\textbf{V}\) Variance-covariance matrix of the phenotype
\(\textbf{I}_M\) Identity matrix of dimension \(M \times M\)
\(\beta^*\) Effect of a SNP (assumes unscaled genotypes)
\(\beta\) Effect of a SNP (assumes scaled genotypes)
\(g\) Genetic effect of an individual
\(i\) Index for individuals
\(j\) Index for SNPs
\(\sigma^2_y\) Phenotypic variance. Often assumed to be 1. \(\sigma^2_g + \sigma^2_e\)
\(\sigma^2_g\) Genetic variance
\(\sigma^2_e\) Error variance
\(h^2\) SNP-heritability. \(\frac{\sigma^2_g}{\sigma^2_y}\)
\(\hat{p}\) Estimate of \(p\)
\(\bar{p}\) Mean of \(p\)
\(var(\textbf{y})\) (Scalar) variance of \(y\) (usually 1 here)
\(Var[\textbf{y}]\) Variance-covariance matrix of the random variable \(\textbf{y}\)

 


Summary of Equations

Quantity Definition
Genotype properties
MAF estimate for a SNP \(j\) \(\hat{p}_j = \frac{\bar{\textbf{X}}_{012,j}}{2}\)
Expected MAF sampling variance \(SE^2_{p_j} = var(\hat{p_j} \mid p_j) = \frac{\hat{p_j}(1-\hat{p_j})}{2N}\)
Expected variance of a SNP \(var(\textbf{X}_{012,j}) = 2\hat{p_j}(1-\hat{p_j})\)
LD matrix \(\textbf{L} = \frac{\textbf{X}^\top\textbf{X}}{N}\)
GRM \(\textbf{A} = \frac{\textbf{X}\textbf{X}^\top}{M}\)
LD score of a SNP \(l_j = \frac{1}{N^2}\textbf{X}^\top_j\textbf{X}\textbf{X}^\top\textbf{X}_j\)
SNP effect estimates
OLS effect estimate for model with one SNP (\(\hat{\beta}_{GWAS}\)) \(\hat{\beta}_{j,GWAS} = \frac{\textbf{X}_j^\top \textbf{y}}{\textbf{X}_j^\top\textbf{X}_j} = \frac{cov(\textbf{X}_j, \textbf{y})}{var(\textbf{X}_j)}\)
Mixed linear mode association (MLMA) estimate for one SNP \(\hat{\beta}_{j,MLMA} = \frac{\textbf{X}_j^\top \textbf{V}^{-1}\textbf{y}}{\textbf{X}_j^\top \textbf{V}^{-1}\textbf{X}_j}\)
OLS effect estimate for model with all SNPs (\(\hat{\boldsymbol{\beta}}_{OLS}\)) \(\hat{\boldsymbol{\beta}}_{OLS} = (\textbf{X}^\top\textbf{X})^{-1}\textbf{X}^\top \textbf{y}\)
BLUP effect estimate \(\hat{\boldsymbol{\beta}}_{BLUP} = (\textbf{X}^\top\textbf{X} + \lambda \textbf{I})^{-1}\textbf{X}^\top \textbf{y}\)
Precision of SNP effect estimates
Expected sampling variance of \(\hat{\beta^*}_{j,GWAS}\) \(SE^2_{\hat{\beta^*}_j} = var(\hat{\beta^*}_j \mid \beta^*_j) \approx \frac{1}{N \times var(\textbf{X}_j)}\)
Expected sampling variance of \(\hat{\beta}_{j,GWAS}\) \(SE^2_{\hat{\beta}_j} = var(\hat{\beta}_j \mid \beta_j) \approx \frac{1}{N}\)
Expected variance of \(\hat{\beta}_{j,GWAS}\) assuming independent markers \(var(\hat{\beta}_j) = var(\beta_j) + var(\hat{\beta_j} \mid \beta_j) = \frac{h^2}{M} + \frac{1}{N}\)
Expected variance of \(\hat{\beta}_{j,GWAS}\) \(var(\hat{\beta}_j) = \frac{h^2}{M_e} + \frac{1}{N}\)
Expected accuracy of GWAS predictor \(cor^2(\textbf{y}, \hat{\textbf{g}}_{GWAS}) = \frac{h^2}{1+\frac{M_e}{N h^2}}\)
Expected accuracy of BLUP predictor \(cor^2(\textbf{y}, \hat{\textbf{g}}_{BLUP}) = R^2 = \frac{h^2}{1+\frac{M_e (1-R^2)}{N h^2}}\)

 

 


The structure of genotype data

 

Simulating genotypes

Humans have diploid genomes, so at each biallelic SNP, there are \(2 \times 2\) possible combinations of alleles at each locus. Since we don’t usually distinguish between, say \(AG\) and \(GA\), we are left with 3 distinct genotypes, which means that we can code genotypes for each SNP and individual as 0, 1 or 2, which can be interpreted as 0, 1 or 2 alternative alleles.

Under a random mating assumption, the number of alternative alleles for a SNP and individual follows a binomial distribution with 2 draws (one from mum and one from dad) and probability equal to the minor allele frequency of that SNP.

Let’s assume that the minor allele frequency of our \(M\) SNPs come from a uniform distribution between \(0\) and \(0.5\).

#set.seed(6147)
#set.seed(6154)
set.seed(6155)
m = 500                                  # number of SNPs
maf = runif(m, 0, .5)                    # random MAF for each SNP
# apologies if using "=" as the assignment operator in R makes your eyes hurt

The set.seed command here ensures that the random draws from a distribution will be the same each time this code is run. We can then draw genotypes for one person for each SNP.

x012 = rbinom(m, 2, maf)

We want genotypes for \(N\) individuals, so let’s replicate this \(N\) times.

n = 400                                    # number of individuals
x012 = t(replicate(n, rbinom(m, 2, maf)))  # n x m genotype matrix

If we are unlucky, some SNPs will be monomorphic, so not actually vary between people. Since this can cause problems, let’s simulate more SNPs and only keep \(M\) polymorphic ones.

x012 = t(replicate(n, rbinom(2*m, 2, c(maf, maf))))
polymorphic = apply(x012, 2, var) > 0
x012 = x012[,polymorphic][,1:m]
maf = c(maf, maf)[polymorphic][1:m]
round(maf[1:10], 2)
##  [1] 0.02 0.04 0.28 0.18 0.08 0.10 0.45 0.03 0.22 0.26
x012[1:5, 1:10]
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,]    0    0    0    0    0    0    1    0    0     0
## [2,]    0    0    1    0    0    0    2    0    0     0
## [3,]    0    0    0    1    0    0    0    0    1     1
## [4,]    0    1    1    1    0    0    1    0    1     1
## [5,]    0    0    1    0    0    0    1    0    1     0

Later on, we will not only need the original genotype matrix \(\textbf{X}_{012}\), but also a version in which each SNP is mean-centered, \(\textbf{X}^*\), and a version in which each SNP has mean 0 and variance 1, \(\textbf{X}\).

x_o = scale(x012, scale=FALSE)               # mean 0
x = scale(x012, scale=TRUE)                  # mean 0, variance 1

Working with mean-centered genotypes and phenotypes makes life a lot easier, but it means we don’t have to estimate intercept terms. So a model like this:

\[\textbf{y} \sim \beta_0 + \beta_1 \textbf{x} + \textbf{e}\]

just becomes

\[\textbf{y} \sim \beta \textbf{x} + \textbf{e}\] Similarly, assuming that there are no covariates, or that \(y\) has already been corrected for any covariates, makes things a lot simpler as well.

 

 

MAF estimate

Here, we know what the true MAF is for each SNP (\(p\)), because we have simulated it, but in real data we will have to get an estimate based on a finite sample. For a diploid genome coded in 0 and 1, this estimate would just be the mean across individuals, but since we have a diploid genome, it is the mean divided by two.

\[\hat{p}_j = \frac{\bar{\textbf{X}}_{012,j}}{2}\]

maf_est = colMeans(x012)/2
qplot(maf, maf_est, col=maf) +
  geom_abline() + xlab('p') + ylab(expression(hat(p)))