Random Regression
Defining the mixed model equation
The following random regression model was used to model trajectories for PSA across the 20 time points and obtain estimates for \(\mathbf{\Omega}\) and \(\mathbf{P}\)
\[\begin{align} \label{RR1} \text{PSA}_{tijk} =\mu + \sum_{k=0}^{2}\phi_{jtk}\beta_k + \sum_{k=0}^{2}\phi_{jtk} u_{jk} + \sum_{k=0}^{1}\phi_{itk} s_{ik} + e_{tijk} \end{align}\]
In matrix notation, the model can be written as \[\begin{align} \label{RR3} \mathbf{y} &= \mathbf{Xb} + \mathbf{Zu} + \mathbf{Qs} + \mathbf{e} \end{align}\]
\(\mathbf{y}\) is a vector with an order equal to the number of observations and contains the \(PSA\) over the 20 days. \(\mathbf{X}\) is a covariable matrix for the fixed effects where the number of rows is equal to the number of observations (\(n\)) and the number of columns is equal to the order of Legendre polynomial used to model fixed effects (\(k_f\)). The matrices \(\mathbf{Z}\) and \(\mathbf{Q}\) are covariable matrices for the random additive genetic and random experimental effects, respectively. The number of rows for \(\mathbf{Z}\) is equal to the number of observations and the number of columns corresponds to the order of Legendre polynomial times the number of lines used to fit the additive genetic effect (\(q * k_g = 357 * 3 = 1,071\)). For \(\mathbf{Q}\) the number of columns would be 6 (\(e * k_s = 3 * 2\)) and the number of rows would be equal to the number of observations. We assume \(\mathbf{u} \sim N(0, \mathbf{G} \otimes \mathbf{\Omega})\), \(\mathbf{s} \sim N(0, \mathbf{I} \otimes \mathbf{P})\), and \(\mathbf{e} \sim N(0, \mathbf{I} \otimes \mathbf{D})\). Here, \(\mathbf{\Omega}\) and \(\mathbf{P}\) are the covariance matrices for the RR coefficients for the additive genetic and permanent environmental effects. \(\mathbf{D}\) is a diagonal matrix that allows for heterogeneous variances over the 20 time points.
The mixed model equation (MME) is
\[\begin{align} \begin{bmatrix} \mathbf{X' R^{-1} X} & \mathbf{X' R^{-1} Z} & \mathbf{X' R^{-1} Q}\\ \mathbf{Z' R^{-1} X} & \mathbf{Z' R^{-1} Z} + \mathbf{G^{-1}} \otimes \boldsymbol{\Omega} & \mathbf{Z' R^{-1} Q}\\ \mathbf{Q' R^{-1} X} & \mathbf{Q' R^{-1} Z} & \mathbf{Q' R^{-1} Q} + \mathbf{I} \otimes \mathbf{P}\\ \end{bmatrix} \begin{bmatrix} \mathbf{\hat{b}} \\ \mathbf{\hat{u}} \\ \mathbf{\hat{s}} \\ \end{bmatrix} &= \begin{bmatrix} \mathbf{X' R^{-1} y} \\ \mathbf{Z' R^{-1} y} \\ \mathbf{Q' R^{-1} y} \\ \end{bmatrix} \end{align}\]
Solving the above MME will give three RR coefficients for each line for the random genetic effects. Using these RR coefficients, the genetic values at each time point can be obtained. For line \(j\) the predicted genetic values (gBLUP) at each time point is given by \(gBLUP_j = \boldsymbol{\Phi} \hat{u}_j\).
Setting up covariable matrices for the RR approach
For each term, we define a matrix of Legendre polynomials evaluated at each time point. Recall that both the fixed and random additive genetic effect are modeled using a second-order Legendre polynomial. The covariable matrix \(\mathbf{X}\) is defined as \(\mathbf{X = X^{o} \Phi_f}\) where \(\mathbf{X^{o}}\) is a vector of 1 with length \(q*e\). Similarly, we define matrices \(\mathbf{Z}\) and \(\mathbf{Q}\) as \[\begin{align} \mathbf{Z} &= \mathbf{Z^o} \otimes \boldsymbol{\Phi_g} \\ \mathbf{Q} &= \mathbf{Q^o} \otimes \boldsymbol{\Phi_s} \end{align}\] \(\mathbf{Z^o}\) and \(\mathbf{Q^o}\) are incidence matrices that allocate temporal records to individuals and experiments respectively. The order of \(\mathbf{Z^o}\) is \(q*e \times q\) and \(\mathbf{Q^o}\) is \(q*e \times e\) (\(q\) is the number of individuals, \(e\) is the number of experiments).
The functions below were adapted from Mrode (2005) by Gota Morota
##Return coefficient matrix (lambda) of n-th order Legendre polynomials. Scaling method implemented by Gengler et. al. (1999) converts constant Legendre polynomial coefficients into 1
`legendre` <-
function(n, gengler){
if (nargs()==1){
gengler <- TRUE
}
if (gengler != TRUE & gengler != FALSE){
gengler=TRUE
}
N <- n+1
L <- matrix(0,nrow=N, ncol=N)
for(i in (1:N)){
if(i==1){
L[i,i] <- 1
}
else if(i==2){
L[i,i] <- 1
}
else {
tmp <- L[i-1,]
tmp2 <- as.numeric()
tmp2 <- c(0,tmp[1:(N-1)])
L[i,] <- (1/(i-2+1))*( (2*(i-2) + 1)*tmp2 -(i-2)*L[i-2,] )
}
}
# Normalize
for (j in (1:N)){
L[j,] <- (sqrt( (2*(j-1)+1)/2) )*L[j,]
}
# Gengler (1999)
if (gengler==TRUE){
L <- sqrt(2)*L
}
return(L)
}
##Given time points covariate and order of fit for Legendre polynomials, return matrix 'M' containing the polynomials of standardized time. 'M' is order t (number of time points) by k (order of Legendre polynomials)
stdtime <- function(t, n, tmax, tmin){
if(missing(tmax)) {
tmax <- t[which.max(t)]
}
if(missing(tmin)) {
tmin <- t[which.min(t)]
}
N <- n+1
M <- matrix(0, nrow=length(t), ncol=N)
a <- -1 + 2*(t-tmin)/(tmax - tmin)
M[,1] <- 1
for (i in 2:N){
M[,i] <- a^(i-1)
}
return(M)
}
TPs <- 20
Inds <- 357
Ex <- 3
PhiF <- stdtime(1:20, 2) %*% t(legendre(2, gengler = F)) #Fixed effects
PhiG <- stdtime(1:20, 2) %*% t(legendre(2, gengler = F)) #Additive genetic effects
PhiE <- stdtime(1:20, 1) %*% t(legendre(1, gengler = F)) #Experimental effect
X <- do.call(rbind, replicate(Inds*Ex, PhiF, simplify=FALSE))
Idnt <- matrix(0, ncol = Inds, nrow = Inds)
diag(Idnt) <- 1
Z <- kronecker(Idnt, matrix(1, nrow = Ex, ncol = 1))
Z <- kronecker(Z, PhiG) #n x kg*q; 21420 x 357*3
Idnt <- matrix(0, ncol = Ex, nrow = Ex)
diag(Idnt) <- 1
Q <- kronecker(Idnt, matrix(1, nrow = Inds, ncol = 1)) #n x e; 1071 x 3
Q <- kronecker(Q, PhiE) #n x ke*e; 21420 x 2*3
Calculating \(\textrm{Var(}\hat{\beta} \textrm{)}\) at each time point
The objective is to calculate SNP effects at each time point. For the single time point approach \(\textrm{BLUP(}\hat{\beta} \textrm{)} = \mathbf{W_{sc}' G^{-1} \hat{g} }\). Thus, \[\begin{align} \label{eq:9} \textrm{Var(}\hat{\beta} \textrm{)} &= \textrm{Var(} \mathbf{W_{sc}' G^{-1} \hat{u} } \textrm{)} \\ &= \mathbf{W_{sc}' G^{-1}} \textrm{Var(} \mathbf{\hat{u}} \textrm{)} \mathbf{G^{-1} W_{sc}} \end{align}\]
The prediction error variance (PEV) of \(\boldsymbol{\hat{u}}\) is \[\begin{align} \textrm{PEV(}\hat{\mathbf{u}} \textrm{)} = \mathbf{C^{22}} \sigma^2_e &= \textrm{Var(} \mathbf{u - \hat{u} }\textrm{)} \\ &= \textrm{Var(} \mathbf{u} \textrm{)} - \textrm{Var(} \hat{\mathbf{u}} \textrm{)} \\ &= \mathbf{G} \sigma^2_g - \textrm{Var(} \hat{\mathbf{u}} \textrm{)} \end{align}\]
By rearranging the equation above we obtain
\[\begin{align} \label{eq:13} \textrm{Var(} \hat{\mathbf{u}} \textrm{)} &= \mathbf{G} \sigma^2_g - \mathbf{C^{22}} \sigma^2_e \end{align}\]
To calculate the variance of the SNP effects, we can introduce equation (13) into equation (9) giving
\[\begin{align} \textrm{Var(} \hat{\boldsymbol{\beta}} \textrm{)} &= \mathbf{W'_{sc} G^{-1}} (\mathbf{G} \sigma_g^2 - \mathbf{C^{22}} \sigma^2_e) \mathbf{G^{-1} W_{sc} } \\ &= \mathbf{W'_{sc} G^{-1} W_{sc}} \sigma_g^2 - \mathbf{W'_{sc} G^{-1} C^{22}} \mathbf{G^{-1} W_{sc} \sigma^2_e } \end{align}\]
At each time point, \(\sigma_g^2\) is extracted from the corresponding the diagonal element of \(\boldsymbol{\Phi_g \Omega \Phi'_g}\).
\(\mathbf{C^{22}}\) is obtained by inverting the coefficient matrix of the MME (equation 3), and is of order \(q*k_g \times q*k_g\). The diagonal elements of \(\mathbf{C^{22}}\) contain the PEV for the RR coefficients for the additive genetic effect. To obtain the variance of SNP effects at each time point \(\mathbf{C^{22}}\) must be transformed so that the diagonal elements correspond to the PEV for GEBVs at each time point. We will refer to this \(q*t \times q*t\) matrix as \(\mathbf{C^{22}}^*\). Following , PEV for individual \(i\) at each time point can be obtained by taking the diagonal elements of \(\textrm{PEV}_i = \boldsymbol{\Phi_g} \mathbf{C_{ii}} \boldsymbol{\Phi_g'}\). \(\mathbf{C_{ii}}\) is a \(3 \times 3\) submatrix of RR coefficients from \(\mathbf{C_{22}}\) for individual \(i\). To extend this approach to the full \(\mathbf{C_{22}}\) matrix, we construct a block matrix of \(\boldsymbol{\Phi_{g}}\) (\(\boldsymbol{\Phi^*_{g}}\)) via \(\boldsymbol{\Phi^*_{g}} = \mathbf{I} \otimes \boldsymbol{\Phi_{g}}\) and obtain \(\boldsymbol{C^{22*}}\) by
\[\begin{align} \boldsymbol{C^{22*}} &= \boldsymbol{\Phi_{g}^*} \boldsymbol{C^{22}} \boldsymbol{\Phi_{g}^*}' \end{align}\]
Thus, \(\mathbf{C_{22}^*}\) is \(q*t \times q*t\) and the diagonal elements are the PEV for GEBVs at each time point. Finally, to calculate the variance of SNP effects at each time point at each, we extract the corresponding elements of \(\boldsymbol{C^{22*}}\) and introduce them into .
Setting up MME and obtaining C22*
Resvar <- diag(20)
diag(Resvar) <- c(1.37E-03, 1.11E-03, 1.87E-03, 3.00E-03, 4.81E-03,
7.08E-03, 1.04E-02, 1.33E-02, 1.78E-02, 2.40E-02,
3.52E-02, 5.10E-02, 7.23E-02, 0.101765, 0.1486,
0.203647, 0.308825, 0.484353, 0.743772, 1.11587)
#Genetic variance. Here a second order Legendre polynomial was used. So G will be a 3 x 3 matrix of
Omega <- matrix(c(0.583091, 0.456320, 0.117960,
0.456320, 0.360961, 0.945203E-01,
0.117960, 0.945203E-01, 0.251287E-01), 3, 3)
#Experimental variance
P <- matrix(c(0.603125E-01, 0.400317E-01,
0.400317E-01, 0.267477E-01), 2, 2)
library(MASS)
Idnt <- matrix(0, ncol = Inds, nrow = Inds)
diag(Idnt) <- 1
PhiG.star <- kronecker(Idnt, PhiG) #q*t x q*k; 7140 1071
R.ident <- matrix(0, ncol = Ex * Inds, nrow = Ex * Inds)
diag(R.ident) <- 1
Rinv <- kronecker(R.ident, ginv(Resvar)) #21420 x 21420
Omega.i <- ginv(Omega)
P.i <- ginv(P)
Set up the MME.
G <- as.matrix(read.table("DataAndCode/CompCluster/G.txt", sep = "\t", header = T))
Ginv <- ginv(G)
C11 <- t(X) %*% Rinv %*% X; dim(C11) #3 x 3
C12 <- t(X) %*% Rinv %*% Z; dim(C12) #3 x 1071
C13 <- t(X) %*% Rinv %*% Q; dim(C13) #3 x 6
C21 <- t(Z) %*% Rinv %*% X; dim(C21) #1071 x 3
C22 <- (t(Z) %*% Rinv %*% Z) + kronecker(Ginv, Omega.i); dim(C22) #1071 x 1071
C23 <- t(Z) %*% Rinv %*% Q; dim(C23) #1071 x 6
C31 <- t(Q) %*% Rinv %*% X; dim(C31) #6 x 3
C32 <- t(Q) %*% Rinv %*% Z; dim(C32) #6 x 1071
Idnt <- matrix(0, nrow = 3, ncol = 3)
diag(Idnt) <- 1
C33 <- (t(Q) %*% Rinv %*% Q) + kronecker(Idnt, P.i); dim(C33) #6 x 6
TopRow <- cbind(C11, C12, C13) #3 x 1080
MidRow <- cbind(C21, C22, C23) #1071 x 1080
BotRow <- cbind(C31, C32, C33) #6 x 1080
LHS <- rbind(TopRow, MidRow, BotRow)
InvLHS <- ginv(LHS)
C22 <- InvLHS[4:1074, 4:1074]
C22.star <- PhiG.star %*% C22 %*% t(PhiG.star)
rm(LHS)
rm(InvLHS)
rm(C22)
rm(TopRow)
rm(MidRow)
rm(BotRow)
Getting p-values at each time point from C22*
Eff.mat <- as.matrix(read.csv("DataAndCode/CompCluster/RR/Eff.mat.csv", header = T) )
Zsc <- t( as.matrix (read.table("Zsc.txt", header = T, sep = "\t") ) )
#Define all matrices to store the results
pvalue.mat <- matrix(0, ncol = 20, nrow = ncol(Zsc))
SD.mat <- matrix(0, ncol = 20, nrow = ncol(Zsc))
#Split into time specific blocks and calculate the SNP effects
#calculate index for each line
indx <- 1
for (i in 1:356){
indx <- c(indx, indx[i] + 20)
}
for ( day in 1:20){
C22.tmp <- C22.star[indx, indx]
VarU.hat <- (G * Gvar[day]) - C22.tmp
#Varg.hat <- (G * Gvar[day]) + (C22.tmp * diag(Resvar)[day])
VarBeta.hat <- diag(t(Zsc) %*% ginv(G) %*% VarU.hat %*% ginv(G) %*% Zsc)
#summary(diag(VarBeta.hat))
sd.SNPs <- sqrt(VarBeta.hat)
SNPe_ad <- Eff.mat[,day] / sd.SNPs
pvalue.mat[,day] <- 2*(1-pnorm(abs(SNPe_ad)))
SD.mat[,day] <- sd.SNPs
indx <- indx + 1
}
write.table(pvalue.mat, "pval.mat.txt", sep = "\t", row.names = T, col.names = T)
write.table(SD.mat, "SDmat.mat.txt", sep = "\t", row.names = T, col.names = T)
Single time point approach
Defining the mixed model equation
The single time point approach is essentially a univariate gBLUP approach. We fit the following model at each time point \[\begin{align} \label{eq:VarSNP_matrix} \mathbf{y} &= \mathbf{Xb} + \mathbf{Zu} + \mathbf{Qs} + \mathbf{e} , \end{align}\] The matrices \(\mathbf{X, Z}\) and \(\mathbf{Q}\) correspond to incidence matrices for the fixed, random additive genetic and random experimental effect, respectively. Moreover, the dimensions for \(\mathbf{X, Z}\) and \(\mathbf{Q}\) are \(n \times 1\), \(n \times q\) and \(n \times e\), where \(n\) is the number of observations at each time point. We assume the random terms are distributed as follows \(\mathbf{u} \sim N(0, \mathbf{G}\sigma_g^2)\), \(\mathbf{s} \sim N(0, \mathbf{I}\sigma_s^2)\), and \(\mathbf{e} \sim N(0, \mathbf{I}\sigma_e^2)\).
The MME for the TP approach is \[\begin{align} \label{eq:TP.MME} \begin{bmatrix} \mathbf{X' R^{-1} X} & \mathbf{X' R^{-1} Z} & \mathbf{X' R^{-1} Q}\\ \mathbf{Z' R^{-1} X} & \mathbf{Z' R^{-1} Z} + \mathbf{G^{-1}} \lambda_g & \mathbf{Z' R^{-1} Q}\\ \mathbf{Q' R^{-1} X} & \mathbf{Q' R^{-1} Z} & \mathbf{Q' R^{-1} Q} + \mathbf{I} \lambda_s\\ \end{bmatrix} \begin{bmatrix} \mathbf{\hat{b}} \\ \mathbf{\hat{u}} \\ \mathbf{\hat{s}} \\ \end{bmatrix} &= \begin{bmatrix} \mathbf{X' R^{-1} y} \\ \mathbf{Z' R^{-1} y} \\ \mathbf{Q' R^{-1} y} \\ \end{bmatrix}, \\ \lambda_g = \frac{\sigma_e^2}{\sigma_g^2}; \lambda_s = \frac{\sigma_e^2}{\sigma_s^2} \end{align}\]
Calculating \(\textrm{Var(}\hat{\beta} \textrm{)}\) at each time point
Since the TP approach is a univariate gBLUP model calculating the variance of SNP effects is much simpler than the RR approach, and are given by
\[\begin{align} \textrm{Var(} \hat{\boldsymbol{\beta}} \textrm{)} &= \mathbf{W'_{sc} G^{-1} W_{sc}} \sigma_g^2 - \mathbf{W'_{sc} G^{-1} C^{22}} \mathbf{G^{-1} W_{sc} \sigma^2_e) } \label{eq:14} \end{align}\]
\(\mathbf{C^{22}}\) is obtained by inverting the MME .
Extracting variance components from asreml output
Recall, that the TP approach produces 20 individual files. The code chunks below parses those files, extracts the variance components, and stores them for later use.
cd TP_GP/
for file in *.asr; do less $file | grep -A6 "Model_Term" | sed -n -e 1,3p -e 5p | tr -s " " > $file.varcomp.txt; done
Read in the parsed asr file and extract the variance components.
rm(list = ls())
library(reshape2)
FILES <- paste0("DataAndCode/TP_GP/TPY", 1:20, ".asr.varcomp.txt")
for(i in 1:20){
tmp <- read.table(FILES[i], skip = 1, sep="")
tmp$DOI <- i
if(i == 1){
final <- tmp
}else{
final <- rbind(final, tmp)
}
}
final <- final[c("V1", "V5", "DOI")]
colnames(final)[1:2] <- c("Source", "Sigma")
Var.TP <- dcast(final, Source ~ DOI, value.var = "Sigma")
Var.TP <- Var.TP[2:21]
row.names(Var.TP) <- c("Exp", "G", "E")
print(Var.TP)
write.csv(Var.TP, "DataandCode/CompCluster/TP/VarComps.csv", row.names=T)
Setting up MME for TP approach and running GWAS from GEBVs
library(MASS)
G <- as.matrix(read.table("DataAndCode/CompCluster/G.txt", sep = "\t", header = T))
PSA.df <- read.csv("DataandCode/CompCluster/TP/PSA.cleaned.mn.csv")
VarComps <- read.csv("DataandCode/CompCluster/TP/VarComps.csv", row.names = 1)
Eff.mat <- as.matrix(read.csv("DataandCode/CompCluster/TP/Eff.mat.csv", header = T) )
Zsc <- t( as.matrix (read.table("DataandCode/CompCluster/Zsc.txt", header = T, sep = "\t") ) )
#Define all matrices to store the results
pvalue.mat <- matrix(0, ncol = 20, nrow = ncol(Zsc))
SD.mat <- matrix(0, ncol = 20, nrow = ncol(Zsc))
for(day in 1:20){
tmp.PSA.df <- PSA.df[c("NSFTV.ID", "Exp", paste0("X", day))]
colnames(tmp.PSA.df)[3] <- "Y"
###############################################
#Set up incidence matrices for each time point#
###############################################
#Fixed effects X; n x 1
X <- model.matrix(Y ~ 1, data = tmp.PSA.df); dim(X)
#Random additive genetic Z; n x q
Z <- model.matrix(Y ~ -1 + NSFTV.ID, data = tmp.PSA.df); dim(Z)
#Random experiment Q; n x e
Q <- model.matrix(Y ~ -1 + Exp, data = tmp.PSA.df); dim(Q)
#########
###R ####
#########
#R = I * \sigma^2_e
Rinv <- matrix(0, ncol = nrow(X), nrow = nrow(X))
diag(Rinv) <- VarComps[3, day]
Rinv <- ginv(Rinv)
############
#Set up MME#
############
lambda.g <- 1 / VarComps[2, day]
lambda.q <- 1 / VarComps[1, day]
C11 <- t(X) %*% Rinv %*% X; dim(C11) #1 x 1
C12 <- t(X) %*% Rinv %*% Z; dim(C12) #1 x 357
C13 <- t(X) %*% Rinv %*% Q; dim(C13) #1 x 3
C21 <- t(Z) %*% Rinv %*% X; dim(C21) #357 x 1
C22 <- (t(Z) %*% Rinv %*% Z) + (ginv(G) * lambda.g); dim(C22) #357 x 357
C23 <- t(Z) %*% Rinv %*% Q; dim(C23) #357 x 3
C31 <- t(Q) %*% Rinv %*% X; dim(C31) #3 x 1
C32 <- t(Q) %*% Rinv %*% Z; dim(C32) #3 x 357
Idnt <- matrix(0, nrow = 3, ncol = 3)
diag(Idnt) <- 1
C33 <- (t(Q) %*% Rinv %*% Q) + (Idnt * lambda.q); dim(C33) #3 x 3
TopRow <- cbind(C11, C12, C13)
MidRow <- cbind(C21, C22, C23)
BotRow <- cbind(C31, C32, C33)
#Left hand side of MME
LHS <- rbind(TopRow, MidRow, BotRow) #361 x 361
InvLHS <- ginv(LHS)
#Extract only the additive genetic parts
C22 <- InvLHS[2:358, 2:358]
#######################################
#Variance of predicted breeding values#
#######################################
VarU.hat <- (G * VarComps[2, day]) - C22
#########################
#Variance of SNP effects#
#########################
VarBeta.hat <- diag(t(Zsc) %*% ginv(G) %*% VarU.hat %*% ginv(G) %*% Zsc)
#summary(diag(VarBeta.hat))
sd.SNPs <- sqrt(VarBeta.hat)
SNPe_ad <- Eff.mat[,day] / sd.SNPs
pvalue.mat[,day] <- 2*(1-pnorm(abs(SNPe_ad)))
SD.mat[,day] <- sd.SNPs
print(day)
}
write.table(pvalue.mat, "pval.mat.txt", sep = "\t", row.names = T, col.names = T)
write.table(SD.mat, "SDmat.mat.txt", sep = "\t", row.names = T, col.names = T)