--- title: "Calculating $p$-values for RR and TP GWAS" author: "Malachy Campbell" date: "8/7/2018" output: rmdformats::html_clean: fig_width: 6 fig_height: 6 highlight: kate thumbnails: true lightbox: true gallery: true --- ```{r setup, include=FALSE, echo = F, eval = T} knitr::opts_knit$set(root.dir = '/Users/malachycampbell/Documents/Dropbox/Work/Manuscripts/2018_RandomRegressionGWAS/ThePlantGenome/Revision/New Analysis/') ``` # Random Regression ## Defining the mixed model equation The following random regression model was used to model trajectories for PSA across the 20 time points and obtain estimates for $\mathbf{\Omega}$ and $\mathbf{P}$ \begin{align} \label{RR1} \text{PSA}_{tijk} =\mu + \sum_{k=0}^{2}\phi_{jtk}\beta_k + \sum_{k=0}^{2}\phi_{jtk} u_{jk} + \sum_{k=0}^{1}\phi_{itk} s_{ik} + e_{tijk} \end{align} In matrix notation, the model can be written as \begin{align} \label{RR3} \mathbf{y} &= \mathbf{Xb} + \mathbf{Zu} + \mathbf{Qs} + \mathbf{e} \end{align} $\mathbf{y}$ is a vector with an order equal to the number of observations and contains the $PSA$ over the 20 days. $\mathbf{X}$ is a covariable matrix for the fixed effects where the number of rows is equal to the number of observations ($n$) and the number of columns is equal to the order of Legendre polynomial used to model fixed effects ($k_f$). The matrices $\mathbf{Z}$ and $\mathbf{Q}$ are covariable matrices for the random additive genetic and random experimental effects, respectively. The number of rows for $\mathbf{Z}$ is equal to the number of observations and the number of columns corresponds to the order of Legendre polynomial times the number of lines used to fit the additive genetic effect ($q * k_g = 357 * 3 = 1,071$). For $\mathbf{Q}$ the number of columns would be 6 ($e * k_s = 3 * 2$) and the number of rows would be equal to the number of observations. We assume $\mathbf{u} \sim N(0, \mathbf{G} \otimes \mathbf{\Omega})$, $\mathbf{s} \sim N(0, \mathbf{I} \otimes \mathbf{P})$, and $\mathbf{e} \sim N(0, \mathbf{I} \otimes \mathbf{D})$. Here, $\mathbf{\Omega}$ and $\mathbf{P}$ are the covariance matrices for the RR coefficients for the additive genetic and permanent environmental effects. $\mathbf{D}$ is a diagonal matrix that allows for heterogeneous variances over the 20 time points. The mixed model equation (MME) is \begin{align} \begin{bmatrix} \mathbf{X' R^{-1} X} & \mathbf{X' R^{-1} Z} & \mathbf{X' R^{-1} Q}\\ \mathbf{Z' R^{-1} X} & \mathbf{Z' R^{-1} Z} + \mathbf{G^{-1}} \otimes \boldsymbol{\Omega} & \mathbf{Z' R^{-1} Q}\\ \mathbf{Q' R^{-1} X} & \mathbf{Q' R^{-1} Z} & \mathbf{Q' R^{-1} Q} + \mathbf{I} \otimes \mathbf{P}\\ \end{bmatrix} \begin{bmatrix} \mathbf{\hat{b}} \\ \mathbf{\hat{u}} \\ \mathbf{\hat{s}} \\ \end{bmatrix} &= \begin{bmatrix} \mathbf{X' R^{-1} y} \\ \mathbf{Z' R^{-1} y} \\ \mathbf{Q' R^{-1} y} \\ \end{bmatrix} \end{align} Solving the above MME will give three RR coefficients for each line for the random genetic effects. Using these RR coefficients, the genetic values at each time point can be obtained. For line $j$ the predicted genetic values (gBLUP) at each time point is given by $gBLUP_j = \boldsymbol{\Phi} \hat{u}_j$. ## Setting up covariable matrices for the RR approach For each term, we define a matrix of Legendre polynomials evaluated at each time point. Recall that both the fixed and random additive genetic effect are modeled using a second-order Legendre polynomial. The covariable matrix $\mathbf{X}$ is defined as $\mathbf{X = X^{o} \Phi_f}$ where $\mathbf{X^{o}}$ is a vector of 1 with length $q*e$. Similarly, we define matrices $\mathbf{Z}$ and $\mathbf{Q}$ as \begin{align} \mathbf{Z} &= \mathbf{Z^o} \otimes \boldsymbol{\Phi_g} \\ \mathbf{Q} &= \mathbf{Q^o} \otimes \boldsymbol{\Phi_s} \end{align} $\mathbf{Z^o}$ and $\mathbf{Q^o}$ are incidence matrices that allocate temporal records to individuals and experiments respectively. The order of $\mathbf{Z^o}$ is $q*e \times q$ and $\mathbf{Q^o}$ is $q*e \times e$ ($q$ is the number of individuals, $e$ is the number of experiments). The functions below were adapted from Mrode (2005) by [Gota Morota](http://morotalab.org/Mrode2005/rr/rr.html#section00020000000000000000) ```{r load functions for RR with leg polynomials, echo = T, eval = F} ##Return coefficient matrix (lambda) of n-th order Legendre polynomials. Scaling method implemented by Gengler et. al. (1999) converts constant Legendre polynomial coefficients into 1 `legendre` <- function(n, gengler){ if (nargs()==1){ gengler <- TRUE } if (gengler != TRUE & gengler != FALSE){ gengler=TRUE } N <- n+1 L <- matrix(0,nrow=N, ncol=N) for(i in (1:N)){ if(i==1){ L[i,i] <- 1 } else if(i==2){ L[i,i] <- 1 } else { tmp <- L[i-1,] tmp2 <- as.numeric() tmp2 <- c(0,tmp[1:(N-1)]) L[i,] <- (1/(i-2+1))*( (2*(i-2) + 1)*tmp2 -(i-2)*L[i-2,] ) } } # Normalize for (j in (1:N)){ L[j,] <- (sqrt( (2*(j-1)+1)/2) )*L[j,] } # Gengler (1999) if (gengler==TRUE){ L <- sqrt(2)*L } return(L) } ##Given time points covariate and order of fit for Legendre polynomials, return matrix 'M' containing the polynomials of standardized time. 'M' is order t (number of time points) by k (order of Legendre polynomials) stdtime <- function(t, n, tmax, tmin){ if(missing(tmax)) { tmax <- t[which.max(t)] } if(missing(tmin)) { tmin <- t[which.min(t)] } N <- n+1 M <- matrix(0, nrow=length(t), ncol=N) a <- -1 + 2*(t-tmin)/(tmax - tmin) M[,1] <- 1 for (i in 2:N){ M[,i] <- a^(i-1) } return(M) } ``` ```{r, echo = T, eval = F} TPs <- 20 Inds <- 357 Ex <- 3 PhiF <- stdtime(1:20, 2) %*% t(legendre(2, gengler = F)) #Fixed effects PhiG <- stdtime(1:20, 2) %*% t(legendre(2, gengler = F)) #Additive genetic effects PhiE <- stdtime(1:20, 1) %*% t(legendre(1, gengler = F)) #Experimental effect X <- do.call(rbind, replicate(Inds*Ex, PhiF, simplify=FALSE)) Idnt <- matrix(0, ncol = Inds, nrow = Inds) diag(Idnt) <- 1 Z <- kronecker(Idnt, matrix(1, nrow = Ex, ncol = 1)) Z <- kronecker(Z, PhiG) #n x kg*q; 21420 x 357*3 Idnt <- matrix(0, ncol = Ex, nrow = Ex) diag(Idnt) <- 1 Q <- kronecker(Idnt, matrix(1, nrow = Inds, ncol = 1)) #n x e; 1071 x 3 Q <- kronecker(Q, PhiE) #n x ke*e; 21420 x 2*3 ``` ## Calculating $\textrm{Var(}\hat{\beta} \textrm{)}$ at each time point The objective is to calculate SNP effects at each time point. For the single time point approach $\textrm{BLUP(}\hat{\beta} \textrm{)} = \mathbf{W_{sc}' G^{-1} \hat{g} }$. Thus, \begin{align} \label{eq:9} \textrm{Var(}\hat{\beta} \textrm{)} &= \textrm{Var(} \mathbf{W_{sc}' G^{-1} \hat{u} } \textrm{)} \\ &= \mathbf{W_{sc}' G^{-1}} \textrm{Var(} \mathbf{\hat{u}} \textrm{)} \mathbf{G^{-1} W_{sc}} \end{align} The prediction error variance (PEV) of $\boldsymbol{\hat{u}}$ is \begin{align} \textrm{PEV(}\hat{\mathbf{u}} \textrm{)} = \mathbf{C^{22}} \sigma^2_e &= \textrm{Var(} \mathbf{u - \hat{u} }\textrm{)} \\ &= \textrm{Var(} \mathbf{u} \textrm{)} - \textrm{Var(} \hat{\mathbf{u}} \textrm{)} \\ &= \mathbf{G} \sigma^2_g - \textrm{Var(} \hat{\mathbf{u}} \textrm{)} \end{align} By rearranging the equation above we obtain \begin{align} \label{eq:13} \textrm{Var(} \hat{\mathbf{u}} \textrm{)} &= \mathbf{G} \sigma^2_g - \mathbf{C^{22}} \sigma^2_e \end{align} To calculate the variance of the SNP effects, we can introduce equation (13) into equation (9) giving \begin{align} \textrm{Var(} \hat{\boldsymbol{\beta}} \textrm{)} &= \mathbf{W'_{sc} G^{-1}} (\mathbf{G} \sigma_g^2 - \mathbf{C^{22}} \sigma^2_e) \mathbf{G^{-1} W_{sc} } \\ &= \mathbf{W'_{sc} G^{-1} W_{sc}} \sigma_g^2 - \mathbf{W'_{sc} G^{-1} C^{22}} \mathbf{G^{-1} W_{sc} \sigma^2_e } \end{align} At each time point, $\sigma_g^2$ is extracted from the corresponding the diagonal element of $\boldsymbol{\Phi_g \Omega \Phi'_g}$. $\mathbf{C^{22}}$ is obtained by inverting the coefficient matrix of the MME (equation 3), and is of order $q*k_g \times q*k_g$. The diagonal elements of $\mathbf{C^{22}}$ contain the PEV for the RR coefficients for the additive genetic effect. To obtain the variance of SNP effects at each time point $\mathbf{C^{22}}$ must be transformed so that the diagonal elements correspond to the PEV for GEBVs at each time point. We will refer to this $q*t \times q*t$ matrix as $\mathbf{C^{22}}^*$. Following \citep{mrode2014linear}, PEV for individual $i$ at each time point can be obtained by taking the diagonal elements of $\textrm{PEV}_i = \boldsymbol{\Phi_g} \mathbf{C_{ii}} \boldsymbol{\Phi_g'}$. $\mathbf{C_{ii}}$ is a $3 \times 3$ submatrix of RR coefficients from $\mathbf{C_{22}}$ for individual $i$. To extend this approach to the full $\mathbf{C_{22}}$ matrix, we construct a block matrix of $\boldsymbol{\Phi_{g}}$ ($\boldsymbol{\Phi^*_{g}}$) via $\boldsymbol{\Phi^*_{g}} = \mathbf{I} \otimes \boldsymbol{\Phi_{g}}$ and obtain $\boldsymbol{C^{22*}}$ by \begin{align} \boldsymbol{C^{22*}} &= \boldsymbol{\Phi_{g}^*} \boldsymbol{C^{22}} \boldsymbol{\Phi_{g}^*}' \end{align} Thus, $\mathbf{C_{22}^*}$ is $q*t \times q*t$ and the diagonal elements are the PEV for GEBVs at each time point. Finally, to calculate the variance of SNP effects at each time point at each, we extract the corresponding elements of $\boldsymbol{C^{22*}}$ and introduce them into \ref{eq:15}. ### Setting up MME and obtaining C22* ```{r, echo = T, eval = F} Resvar <- diag(20) diag(Resvar) <- c(1.37E-03, 1.11E-03, 1.87E-03, 3.00E-03, 4.81E-03, 7.08E-03, 1.04E-02, 1.33E-02, 1.78E-02, 2.40E-02, 3.52E-02, 5.10E-02, 7.23E-02, 0.101765, 0.1486, 0.203647, 0.308825, 0.484353, 0.743772, 1.11587) #Genetic variance. Here a second order Legendre polynomial was used. So G will be a 3 x 3 matrix of Omega <- matrix(c(0.583091, 0.456320, 0.117960, 0.456320, 0.360961, 0.945203E-01, 0.117960, 0.945203E-01, 0.251287E-01), 3, 3) #Experimental variance P <- matrix(c(0.603125E-01, 0.400317E-01, 0.400317E-01, 0.267477E-01), 2, 2) ``` ```{r, echo = T, eval = F} library(MASS) Idnt <- matrix(0, ncol = Inds, nrow = Inds) diag(Idnt) <- 1 PhiG.star <- kronecker(Idnt, PhiG) #q*t x q*k; 7140 1071 R.ident <- matrix(0, ncol = Ex * Inds, nrow = Ex * Inds) diag(R.ident) <- 1 Rinv <- kronecker(R.ident, ginv(Resvar)) #21420 x 21420 Omega.i <- ginv(Omega) P.i <- ginv(P) ``` Set up the MME. ```{r, echo = T, eval = F} G <- as.matrix(read.table("DataAndCode/CompCluster/G.txt", sep = "\t", header = T)) Ginv <- ginv(G) C11 <- t(X) %*% Rinv %*% X; dim(C11) #3 x 3 C12 <- t(X) %*% Rinv %*% Z; dim(C12) #3 x 1071 C13 <- t(X) %*% Rinv %*% Q; dim(C13) #3 x 6 C21 <- t(Z) %*% Rinv %*% X; dim(C21) #1071 x 3 C22 <- (t(Z) %*% Rinv %*% Z) + kronecker(Ginv, Omega.i); dim(C22) #1071 x 1071 C23 <- t(Z) %*% Rinv %*% Q; dim(C23) #1071 x 6 C31 <- t(Q) %*% Rinv %*% X; dim(C31) #6 x 3 C32 <- t(Q) %*% Rinv %*% Z; dim(C32) #6 x 1071 Idnt <- matrix(0, nrow = 3, ncol = 3) diag(Idnt) <- 1 C33 <- (t(Q) %*% Rinv %*% Q) + kronecker(Idnt, P.i); dim(C33) #6 x 6 TopRow <- cbind(C11, C12, C13) #3 x 1080 MidRow <- cbind(C21, C22, C23) #1071 x 1080 BotRow <- cbind(C31, C32, C33) #6 x 1080 LHS <- rbind(TopRow, MidRow, BotRow) InvLHS <- ginv(LHS) C22 <- InvLHS[4:1074, 4:1074] C22.star <- PhiG.star %*% C22 %*% t(PhiG.star) rm(LHS) rm(InvLHS) rm(C22) rm(TopRow) rm(MidRow) rm(BotRow) ``` ### Getting p-values at each time point from C22* ```{r, echo = T, eval = F} Eff.mat <- as.matrix(read.csv("DataAndCode/CompCluster/RR/Eff.mat.csv", header = T) ) Zsc <- t( as.matrix (read.table("Zsc.txt", header = T, sep = "\t") ) ) ``` ```{r, echo = T, eval = F} Gvar <- diag(PhiG %*% Omega %*% t(PhiG)) ``` ```{r, echo = T, eval = F} #Define all matrices to store the results pvalue.mat <- matrix(0, ncol = 20, nrow = ncol(Zsc)) SD.mat <- matrix(0, ncol = 20, nrow = ncol(Zsc)) #Split into time specific blocks and calculate the SNP effects #calculate index for each line indx <- 1 for (i in 1:356){ indx <- c(indx, indx[i] + 20) } for ( day in 1:20){ C22.tmp <- C22.star[indx, indx] VarU.hat <- (G * Gvar[day]) - C22.tmp #Varg.hat <- (G * Gvar[day]) + (C22.tmp * diag(Resvar)[day]) VarBeta.hat <- diag(t(Zsc) %*% ginv(G) %*% VarU.hat %*% ginv(G) %*% Zsc) #summary(diag(VarBeta.hat)) sd.SNPs <- sqrt(VarBeta.hat) SNPe_ad <- Eff.mat[,day] / sd.SNPs pvalue.mat[,day] <- 2*(1-pnorm(abs(SNPe_ad))) SD.mat[,day] <- sd.SNPs indx <- indx + 1 } write.table(pvalue.mat, "pval.mat.txt", sep = "\t", row.names = T, col.names = T) write.table(SD.mat, "SDmat.mat.txt", sep = "\t", row.names = T, col.names = T) ``` # Single time point approach ## Defining the mixed model equation The single time point approach is essentially a univariate gBLUP approach. We fit the following model at each time point \begin{align} \label{eq:VarSNP_matrix} \mathbf{y} &= \mathbf{Xb} + \mathbf{Zu} + \mathbf{Qs} + \mathbf{e} , \end{align} The matrices $\mathbf{X, Z}$ and $\mathbf{Q}$ correspond to incidence matrices for the fixed, random additive genetic and random experimental effect, respectively. Moreover, the dimensions for $\mathbf{X, Z}$ and $\mathbf{Q}$ are $n \times 1$, $n \times q$ and $n \times e$, where $n$ is the number of observations at each time point. We assume the random terms are distributed as follows $\mathbf{u} \sim N(0, \mathbf{G}\sigma_g^2)$, $\mathbf{s} \sim N(0, \mathbf{I}\sigma_s^2)$, and $\mathbf{e} \sim N(0, \mathbf{I}\sigma_e^2)$. The MME for the TP approach is \begin{align} \label{eq:TP.MME} \begin{bmatrix} \mathbf{X' R^{-1} X} & \mathbf{X' R^{-1} Z} & \mathbf{X' R^{-1} Q}\\ \mathbf{Z' R^{-1} X} & \mathbf{Z' R^{-1} Z} + \mathbf{G^{-1}} \lambda_g & \mathbf{Z' R^{-1} Q}\\ \mathbf{Q' R^{-1} X} & \mathbf{Q' R^{-1} Z} & \mathbf{Q' R^{-1} Q} + \mathbf{I} \lambda_s\\ \end{bmatrix} \begin{bmatrix} \mathbf{\hat{b}} \\ \mathbf{\hat{u}} \\ \mathbf{\hat{s}} \\ \end{bmatrix} &= \begin{bmatrix} \mathbf{X' R^{-1} y} \\ \mathbf{Z' R^{-1} y} \\ \mathbf{Q' R^{-1} y} \\ \end{bmatrix}, \\ \lambda_g = \frac{\sigma_e^2}{\sigma_g^2}; \lambda_s = \frac{\sigma_e^2}{\sigma_s^2} \end{align} ## Calculating $\textrm{Var(}\hat{\beta} \textrm{)}$ at each time point Since the TP approach is a univariate gBLUP model calculating the variance of SNP effects is much simpler than the RR approach, and are given by \begin{align} \textrm{Var(} \hat{\boldsymbol{\beta}} \textrm{)} &= \mathbf{W'_{sc} G^{-1} W_{sc}} \sigma_g^2 - \mathbf{W'_{sc} G^{-1} C^{22}} \mathbf{G^{-1} W_{sc} \sigma^2_e) } \label{eq:14} \end{align} $\mathbf{C^{22}}$ is obtained by inverting the MME \ref{eq:TP.MME}. ### Extracting variance components from asreml output Recall, that the TP approach produces 20 individual files. The code chunks below parses those files, extracts the variance components, and stores them for later use. ```{bash varcomp, echo = T, eval = F} cd TP_GP/ for file in *.asr; do less $file | grep -A6 "Model_Term" | sed -n -e 1,3p -e 5p | tr -s " " > $file.varcomp.txt; done ``` Read in the parsed asr file and extract the variance components. ```{r TP var comp, echo = T, eval = F} rm(list = ls()) library(reshape2) FILES <- paste0("DataAndCode/TP_GP/TPY", 1:20, ".asr.varcomp.txt") for(i in 1:20){ tmp <- read.table(FILES[i], skip = 1, sep="") tmp$DOI <- i if(i == 1){ final <- tmp }else{ final <- rbind(final, tmp) } } final <- final[c("V1", "V5", "DOI")] colnames(final)[1:2] <- c("Source", "Sigma") Var.TP <- dcast(final, Source ~ DOI, value.var = "Sigma") Var.TP <- Var.TP[2:21] row.names(Var.TP) <- c("Exp", "G", "E") print(Var.TP) write.csv(Var.TP, "DataandCode/CompCluster/TP/VarComps.csv", row.names=T) ``` ### Setting up MME for TP approach and running GWAS from GEBVs ```{r, echo = T, eval = F} library(MASS) G <- as.matrix(read.table("DataAndCode/CompCluster/G.txt", sep = "\t", header = T)) PSA.df <- read.csv("DataandCode/CompCluster/TP/PSA.cleaned.mn.csv") VarComps <- read.csv("DataandCode/CompCluster/TP/VarComps.csv", row.names = 1) Eff.mat <- as.matrix(read.csv("DataandCode/CompCluster/TP/Eff.mat.csv", header = T) ) Zsc <- t( as.matrix (read.table("DataandCode/CompCluster/Zsc.txt", header = T, sep = "\t") ) ) ``` ```{r, echo = T, eval = F} #Define all matrices to store the results pvalue.mat <- matrix(0, ncol = 20, nrow = ncol(Zsc)) SD.mat <- matrix(0, ncol = 20, nrow = ncol(Zsc)) for(day in 1:20){ tmp.PSA.df <- PSA.df[c("NSFTV.ID", "Exp", paste0("X", day))] colnames(tmp.PSA.df)[3] <- "Y" ############################################### #Set up incidence matrices for each time point# ############################################### #Fixed effects X; n x 1 X <- model.matrix(Y ~ 1, data = tmp.PSA.df); dim(X) #Random additive genetic Z; n x q Z <- model.matrix(Y ~ -1 + NSFTV.ID, data = tmp.PSA.df); dim(Z) #Random experiment Q; n x e Q <- model.matrix(Y ~ -1 + Exp, data = tmp.PSA.df); dim(Q) ######### ###R #### ######### #R = I * \sigma^2_e Rinv <- matrix(0, ncol = nrow(X), nrow = nrow(X)) diag(Rinv) <- VarComps[3, day] Rinv <- ginv(Rinv) ############ #Set up MME# ############ lambda.g <- 1 / VarComps[2, day] lambda.q <- 1 / VarComps[1, day] C11 <- t(X) %*% Rinv %*% X; dim(C11) #1 x 1 C12 <- t(X) %*% Rinv %*% Z; dim(C12) #1 x 357 C13 <- t(X) %*% Rinv %*% Q; dim(C13) #1 x 3 C21 <- t(Z) %*% Rinv %*% X; dim(C21) #357 x 1 C22 <- (t(Z) %*% Rinv %*% Z) + (ginv(G) * lambda.g); dim(C22) #357 x 357 C23 <- t(Z) %*% Rinv %*% Q; dim(C23) #357 x 3 C31 <- t(Q) %*% Rinv %*% X; dim(C31) #3 x 1 C32 <- t(Q) %*% Rinv %*% Z; dim(C32) #3 x 357 Idnt <- matrix(0, nrow = 3, ncol = 3) diag(Idnt) <- 1 C33 <- (t(Q) %*% Rinv %*% Q) + (Idnt * lambda.q); dim(C33) #3 x 3 TopRow <- cbind(C11, C12, C13) MidRow <- cbind(C21, C22, C23) BotRow <- cbind(C31, C32, C33) #Left hand side of MME LHS <- rbind(TopRow, MidRow, BotRow) #361 x 361 InvLHS <- ginv(LHS) #Extract only the additive genetic parts C22 <- InvLHS[2:358, 2:358] ####################################### #Variance of predicted breeding values# ####################################### VarU.hat <- (G * VarComps[2, day]) - C22 ######################### #Variance of SNP effects# ######################### VarBeta.hat <- diag(t(Zsc) %*% ginv(G) %*% VarU.hat %*% ginv(G) %*% Zsc) #summary(diag(VarBeta.hat)) sd.SNPs <- sqrt(VarBeta.hat) SNPe_ad <- Eff.mat[,day] / sd.SNPs pvalue.mat[,day] <- 2*(1-pnorm(abs(SNPe_ad))) SD.mat[,day] <- sd.SNPs print(day) } write.table(pvalue.mat, "pval.mat.txt", sep = "\t", row.names = T, col.names = T) write.table(SD.mat, "SDmat.mat.txt", sep = "\t", row.names = T, col.names = T) ```