Â
The purpose of this document is to give an overview over frequently occurring quantities in quantitative genetics and to demonstrate how to they can be validated in R examples using simulated genotype data.
Since the focus here is on how different concepts relate to one another, many topics are presented in a simplified way that ignores complicating factors that sometimes arise in the application to real data.
For questions, and any kind of feedback (especially mistakes), please contact Robert Maier at r.maier@uq.edu.au.
Â
Symbol | Meaning |
---|---|
\(\textbf{X}_{012}\) | Original genotype matrix |
\(\textbf{X}^*\) | Mean-centered genotypes |
\(\textbf{X}\) | Scaled genotypes (each SNP has mean 0, variance 1) |
\(\textbf{y}\) | Phenotype |
\(p\) | Minor Allele Frequency (MAF) |
\(M\) | Number of SNPs |
\(N\) | Number of individuals |
\(M_e\) | Effective number of SNPs |
\(N_e\) | Effective number of individuals |
\(\textbf{A}\) | Genetic relatedness matrix (GRM) |
\(\textbf{V}\) | Variance-covariance matrix of the phenotype |
\(\textbf{I}_M\) | Identity matrix of dimension \(M \times M\) |
\(\beta^*\) | Effect of a SNP (assumes unscaled genotypes) |
\(\beta\) | Effect of a SNP (assumes scaled genotypes) |
\(g\) | Genetic effect of an individual |
\(i\) | Index for individuals |
\(j\) | Index for SNPs |
\(\sigma^2_y\) | Phenotypic variance. Often assumed to be 1. \(\sigma^2_g + \sigma^2_e\) |
\(\sigma^2_g\) | Genetic variance |
\(\sigma^2_e\) | Error variance |
\(h^2\) | SNP-heritability. \(\frac{\sigma^2_g}{\sigma^2_y}\) |
\(\hat{p}\) | Estimate of \(p\) |
\(\bar{p}\) | Mean of \(p\) |
\(var(\textbf{y})\) | (Scalar) variance of \(y\) (usually 1 here) |
\(Var[\textbf{y}]\) | Variance-covariance matrix of the random variable \(\textbf{y}\) |
Â
Quantity | Definition |
---|---|
Genotype properties | |
MAF estimate for a SNP \(j\) | \(\hat{p}_j = \frac{\bar{\textbf{X}}_{012,j}}{2}\) |
Expected MAF sampling variance | \(SE^2_{p_j} = var(\hat{p_j} \mid p_j) = \frac{\hat{p_j}(1-\hat{p_j})}{2N}\) |
Expected variance of a SNP | \(var(\textbf{X}_{012,j}) = 2\hat{p_j}(1-\hat{p_j})\) |
LD matrix | \(\textbf{L} = \frac{\textbf{X}^\top\textbf{X}}{N}\) |
GRM | \(\textbf{A} = \frac{\textbf{X}\textbf{X}^\top}{M}\) |
LD score of a SNP | \(l_j = \frac{1}{N^2}\textbf{X}^\top_j\textbf{X}\textbf{X}^\top\textbf{X}_j\) |
SNP effect estimates | |
OLS effect estimate for model with one SNP (\(\hat{\beta}_{GWAS}\)) | \(\hat{\beta}_{j,GWAS} = \frac{\textbf{X}_j^\top \textbf{y}}{\textbf{X}_j^\top\textbf{X}_j} = \frac{cov(\textbf{X}_j, \textbf{y})}{var(\textbf{X}_j)}\) |
Mixed linear mode association (MLMA) estimate for one SNP | \(\hat{\beta}_{j,MLMA} = \frac{\textbf{X}_j^\top \textbf{V}^{-1}\textbf{y}}{\textbf{X}_j^\top \textbf{V}^{-1}\textbf{X}_j}\) |
OLS effect estimate for model with all SNPs (\(\hat{\boldsymbol{\beta}}_{OLS}\)) | \(\hat{\boldsymbol{\beta}}_{OLS} = (\textbf{X}^\top\textbf{X})^{-1}\textbf{X}^\top \textbf{y}\) |
BLUP effect estimate | \(\hat{\boldsymbol{\beta}}_{BLUP} = (\textbf{X}^\top\textbf{X} + \lambda \textbf{I})^{-1}\textbf{X}^\top \textbf{y}\) |
Precision of SNP effect estimates | |
Expected sampling variance of \(\hat{\beta^*}_{j,GWAS}\) | \(SE^2_{\hat{\beta^*}_j} = var(\hat{\beta^*}_j \mid \beta^*_j) \approx \frac{1}{N \times var(\textbf{X}_j)}\) |
Expected sampling variance of \(\hat{\beta}_{j,GWAS}\) | \(SE^2_{\hat{\beta}_j} = var(\hat{\beta}_j \mid \beta_j) \approx \frac{1}{N}\) |
Expected variance of \(\hat{\beta}_{j,GWAS}\) assuming independent markers | \(var(\hat{\beta}_j) = var(\beta_j) + var(\hat{\beta_j} \mid \beta_j) = \frac{h^2}{M} + \frac{1}{N}\) |
Expected variance of \(\hat{\beta}_{j,GWAS}\) | \(var(\hat{\beta}_j) = \frac{h^2}{M_e} + \frac{1}{N}\) |
Expected accuracy of GWAS predictor | \(cor^2(\textbf{y}, \hat{\textbf{g}}_{GWAS}) = \frac{h^2}{1+\frac{M_e}{N h^2}}\) |
Expected accuracy of BLUP predictor | \(cor^2(\textbf{y}, \hat{\textbf{g}}_{BLUP}) = R^2 = \frac{h^2}{1+\frac{M_e (1-R^2)}{N h^2}}\) |
Â
Â
Â
Humans have diploid genomes, so at each biallelic SNP, there are \(2 \times 2\) possible combinations of alleles at each locus. Since we donâ€™t usually distinguish between, say \(AG\) and \(GA\), we are left with 3 distinct genotypes, which means that we can code genotypes for each SNP and individual as 0, 1 or 2, which can be interpreted as 0, 1 or 2 alternative alleles.
Under a random mating assumption, the number of alternative alleles for a SNP and individual follows a binomial distribution with 2 draws (one from mum and one from dad) and probability equal to the minor allele frequency of that SNP.
Letâ€™s assume that the minor allele frequency of our \(M\) SNPs come from a uniform distribution between \(0\) and \(0.5\).
#set.seed(6147)
#set.seed(6154)
set.seed(6155)
m = 500 # number of SNPs
maf = runif(m, 0, .5) # random MAF for each SNP
# apologies if using "=" as the assignment operator in R makes your eyes hurt
The set.seed
command here ensures that the random draws from a distribution will be the same each time this code is run. We can then draw genotypes for one person for each SNP.
x012 = rbinom(m, 2, maf)
We want genotypes for \(N\) individuals, so letâ€™s replicate this \(N\) times.
n = 400 # number of individuals
x012 = t(replicate(n, rbinom(m, 2, maf))) # n x m genotype matrix
If we are unlucky, some SNPs will be monomorphic, so not actually vary between people. Since this can cause problems, letâ€™s simulate more SNPs and only keep \(M\) polymorphic ones.
x012 = t(replicate(n, rbinom(2*m, 2, c(maf, maf))))
polymorphic = apply(x012, 2, var) > 0
x012 = x012[,polymorphic][,1:m]
maf = c(maf, maf)[polymorphic][1:m]
round(maf[1:10], 2)
## [1] 0.02 0.04 0.28 0.18 0.08 0.10 0.45 0.03 0.22 0.26
x012[1:5, 1:10]
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 0 0 0 0 0 0 1 0 0 0
## [2,] 0 0 1 0 0 0 2 0 0 0
## [3,] 0 0 0 1 0 0 0 0 1 1
## [4,] 0 1 1 1 0 0 1 0 1 1
## [5,] 0 0 1 0 0 0 1 0 1 0
Later on, we will not only need the original genotype matrix \(\textbf{X}_{012}\), but also a version in which each SNP is mean-centered, \(\textbf{X}^*\), and a version in which each SNP has mean 0 and variance 1, \(\textbf{X}\).
x_o = scale(x012, scale=FALSE) # mean 0
x = scale(x012, scale=TRUE) # mean 0, variance 1
Working with mean-centered genotypes and phenotypes makes life a lot easier, but it means we donâ€™t have to estimate intercept terms. So a model like this:
\[\textbf{y} \sim \beta_0 + \beta_1 \textbf{x} + \textbf{e}\]
just becomes
\[\textbf{y} \sim \beta \textbf{x} + \textbf{e}\] Similarly, assuming that there are no covariates, or that \(y\) has already been corrected for any covariates, makes things a lot simpler as well.
Â
Â
Here, we know what the true MAF is for each SNP (\(p\)), because we have simulated it, but in real data we will have to get an estimate based on a finite sample. For a diploid genome coded in 0 and 1, this estimate would just be the mean across individuals, but since we have a diploid genome, it is the mean divided by two.
\[\hat{p}_j = \frac{\bar{\textbf{X}}_{012,j}}{2}\]
maf_est = colMeans(x012)/2
qplot(maf, maf_est, col=maf) +
geom_abline() + xlab('p') + ylab(expression(hat(p)))