## Introduction

The purpose of this document is to give an overview over frequently occurring quantities in quantitative genetics and to demonstrate how to they can be validated in R examples using simulated genotype data.

Since the focus here is on how different concepts relate to one another, many topics are presented in a simplified way that ignores complicating factors that sometimes arise in the application to real data.

## Notation

Symbol Meaning
$$\textbf{X}_{012}$$ Original genotype matrix
$$\textbf{X}^*$$ Mean-centered genotypes
$$\textbf{X}$$ Scaled genotypes (each SNP has mean 0, variance 1)
$$\textbf{y}$$ Phenotype
$$p$$ Minor Allele Frequency (MAF)
$$M$$ Number of SNPs
$$N$$ Number of individuals
$$M_e$$ Effective number of SNPs
$$N_e$$ Effective number of individuals
$$\textbf{A}$$ Genetic relatedness matrix (GRM)
$$\textbf{V}$$ Variance-covariance matrix of the phenotype
$$\textbf{I}_M$$ Identity matrix of dimension $$M \times M$$
$$\beta^*$$ Effect of a SNP (assumes unscaled genotypes)
$$\beta$$ Effect of a SNP (assumes scaled genotypes)
$$g$$ Genetic effect of an individual
$$i$$ Index for individuals
$$j$$ Index for SNPs
$$\sigma^2_y$$ Phenotypic variance. Often assumed to be 1. $$\sigma^2_g + \sigma^2_e$$
$$\sigma^2_g$$ Genetic variance
$$\sigma^2_e$$ Error variance
$$h^2$$ SNP-heritability. $$\frac{\sigma^2_g}{\sigma^2_y}$$
$$\hat{p}$$ Estimate of $$p$$
$$\bar{p}$$ Mean of $$p$$
$$var(\textbf{y})$$ (Scalar) variance of $$y$$ (usually 1 here)
$$Var[\textbf{y}]$$ Variance-covariance matrix of the random variable $$\textbf{y}$$

## Summary of Equations

Quantity Definition
Genotype properties
MAF estimate for a SNP $$j$$ $$\hat{p}_j = \frac{\bar{\textbf{X}}_{012,j}}{2}$$
Expected MAF sampling variance $$SE^2_{p_j} = var(\hat{p_j} \mid p_j) = \frac{\hat{p_j}(1-\hat{p_j})}{2N}$$
Expected variance of a SNP $$var(\textbf{X}_{012,j}) = 2\hat{p_j}(1-\hat{p_j})$$
LD matrix $$\textbf{L} = \frac{\textbf{X}^\top\textbf{X}}{N}$$
GRM $$\textbf{A} = \frac{\textbf{X}\textbf{X}^\top}{M}$$
LD score of a SNP $$l_j = \frac{1}{N^2}\textbf{X}^\top_j\textbf{X}\textbf{X}^\top\textbf{X}_j$$
SNP effect estimates
OLS effect estimate for model with one SNP ($$\hat{\beta}_{GWAS}$$) $$\hat{\beta}_{j,GWAS} = \frac{\textbf{X}_j^\top \textbf{y}}{\textbf{X}_j^\top\textbf{X}_j} = \frac{cov(\textbf{X}_j, \textbf{y})}{var(\textbf{X}_j)}$$
Mixed linear mode association (MLMA) estimate for one SNP $$\hat{\beta}_{j,MLMA} = \frac{\textbf{X}_j^\top \textbf{V}^{-1}\textbf{y}}{\textbf{X}_j^\top \textbf{V}^{-1}\textbf{X}_j}$$
OLS effect estimate for model with all SNPs ($$\hat{\boldsymbol{\beta}}_{OLS}$$) $$\hat{\boldsymbol{\beta}}_{OLS} = (\textbf{X}^\top\textbf{X})^{-1}\textbf{X}^\top \textbf{y}$$
BLUP effect estimate $$\hat{\boldsymbol{\beta}}_{BLUP} = (\textbf{X}^\top\textbf{X} + \lambda \textbf{I})^{-1}\textbf{X}^\top \textbf{y}$$
Precision of SNP effect estimates
Expected sampling variance of $$\hat{\beta^*}_{j,GWAS}$$ $$SE^2_{\hat{\beta^*}_j} = var(\hat{\beta^*}_j \mid \beta^*_j) \approx \frac{1}{N \times var(\textbf{X}_j)}$$
Expected sampling variance of $$\hat{\beta}_{j,GWAS}$$ $$SE^2_{\hat{\beta}_j} = var(\hat{\beta}_j \mid \beta_j) \approx \frac{1}{N}$$
Expected variance of $$\hat{\beta}_{j,GWAS}$$ assuming independent markers $$var(\hat{\beta}_j) = var(\beta_j) + var(\hat{\beta_j} \mid \beta_j) = \frac{h^2}{M} + \frac{1}{N}$$
Expected variance of $$\hat{\beta}_{j,GWAS}$$ $$var(\hat{\beta}_j) = \frac{h^2}{M_e} + \frac{1}{N}$$
Expected accuracy of GWAS predictor $$cor^2(\textbf{y}, \hat{\textbf{g}}_{GWAS}) = \frac{h^2}{1+\frac{M_e}{N h^2}}$$
Expected accuracy of BLUP predictor $$cor^2(\textbf{y}, \hat{\textbf{g}}_{BLUP}) = R^2 = \frac{h^2}{1+\frac{M_e (1-R^2)}{N h^2}}$$

## The structure of genotype data

### Simulating genotypes

Humans have diploid genomes, so at each biallelic SNP, there are $$2 \times 2$$ possible combinations of alleles at each locus. Since we don’t usually distinguish between, say $$AG$$ and $$GA$$, we are left with 3 distinct genotypes, which means that we can code genotypes for each SNP and individual as 0, 1 or 2, which can be interpreted as 0, 1 or 2 alternative alleles.

Under a random mating assumption, the number of alternative alleles for a SNP and individual follows a binomial distribution with 2 draws (one from mum and one from dad) and probability equal to the minor allele frequency of that SNP.

Let’s assume that the minor allele frequency of our $$M$$ SNPs come from a uniform distribution between $$0$$ and $$0.5$$.

#set.seed(6147)
#set.seed(6154)
set.seed(6155)
m = 500                                  # number of SNPs
maf = runif(m, 0, .5)                    # random MAF for each SNP
# apologies if using "=" as the assignment operator in R makes your eyes hurt

The set.seed command here ensures that the random draws from a distribution will be the same each time this code is run. We can then draw genotypes for one person for each SNP.

x012 = rbinom(m, 2, maf)

We want genotypes for $$N$$ individuals, so let’s replicate this $$N$$ times.

n = 400                                    # number of individuals
x012 = t(replicate(n, rbinom(m, 2, maf)))  # n x m genotype matrix

If we are unlucky, some SNPs will be monomorphic, so not actually vary between people. Since this can cause problems, let’s simulate more SNPs and only keep $$M$$ polymorphic ones.

x012 = t(replicate(n, rbinom(2*m, 2, c(maf, maf))))
polymorphic = apply(x012, 2, var) > 0
x012 = x012[,polymorphic][,1:m]
maf = c(maf, maf)[polymorphic][1:m]
round(maf[1:10], 2)
##  [1] 0.02 0.04 0.28 0.18 0.08 0.10 0.45 0.03 0.22 0.26
x012[1:5, 1:10]
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,]    0    0    0    0    0    0    1    0    0     0
## [2,]    0    0    1    0    0    0    2    0    0     0
## [3,]    0    0    0    1    0    0    0    0    1     1
## [4,]    0    1    1    1    0    0    1    0    1     1
## [5,]    0    0    1    0    0    0    1    0    1     0

Later on, we will not only need the original genotype matrix $$\textbf{X}_{012}$$, but also a version in which each SNP is mean-centered, $$\textbf{X}^*$$, and a version in which each SNP has mean 0 and variance 1, $$\textbf{X}$$.

x_o = scale(x012, scale=FALSE)               # mean 0
x = scale(x012, scale=TRUE)                  # mean 0, variance 1

Working with mean-centered genotypes and phenotypes makes life a lot easier, but it means we don’t have to estimate intercept terms. So a model like this:

$\textbf{y} \sim \beta_0 + \beta_1 \textbf{x} + \textbf{e}$

just becomes

$\textbf{y} \sim \beta \textbf{x} + \textbf{e}$ Similarly, assuming that there are no covariates, or that $$y$$ has already been corrected for any covariates, makes things a lot simpler as well.

### MAF estimate

Here, we know what the true MAF is for each SNP ($$p$$), because we have simulated it, but in real data we will have to get an estimate based on a finite sample. For a diploid genome coded in 0 and 1, this estimate would just be the mean across individuals, but since we have a diploid genome, it is the mean divided by two.

$\hat{p}_j = \frac{\bar{\textbf{X}}_{012,j}}{2}$

maf_est = colMeans(x012)/2
qplot(maf, maf_est, col=maf) +
geom_abline() + xlab('p') + ylab(expression(hat(p)))