Background
The aim of scenario B was to determine if traits at later time points can be predicted for known lines using information at earlier time points. Thus, it can be considered as a forecasting approach. Here, a RR model was fitted using the first ten time points and was used to predict the phenotypes for the same set of lines in the last ten time points. The accuracy of prediction was assessed by comparing predicted gBLUPs with observed PSA at each of the three experiments using Pearson’s correlation method. The lines were randomly assigned to each fold, and the process was repeated 20 times. For each fold, the average correlation over the three experiments was used.
For scenario C, the aim is similar to B, however here we are interested in predicting future phenotypes for unknown lines. This can be thought of as a forecasting approach for unobserved, or undeveloped lines. Here, a RR model was fitted using PSA at the first ten time points for the training set, and was used to predict the phenotypes for the testing set of lines in the last ten time points. Again, accuracy was determined using Pearson’s correlation.
Scenario B
Generate the dataset for cross validation
Here, we will reuse the CV file created in Scenario A, so I won’t bother to add the code to generate that here. We’ll load that file and only select the first 10 time points.
final <- read.csv("~/Desktop/RR/ScenarioA/RR/RR_CV.csv")
final <- final[final$DayOfImaging < 11 ,]
write.csv(final, "~/Desktop/RR/ScenarioB/RR_CV_ScenB", row.names=F)
Running asreml for CV
Again, as for Scenario A, CV was done on our computing cluster at UNL. Four jobs were created, where the RR model was fit for five resampling runs. Below is an example .as file, as well as the slurm script used to submit the job. Note there are 10 starting values for the error term, thus we’re using 10 time points.
!RENAME !ARG Y1 Y2 Y3 Y4 Y5
RR scenario B
NID !A
PSA !/100000
DayOfImaging 10 !I
Rep 2
Exp !A
Y1 !/100000
Y2 !/100000
Y3 !/100000
Y4 !/100000
Y5 !/100000
Y6 !/100000
Y7 !/100000
Y8 !/100000
Y9 !/100000
Y10 !/100000
Y11 !/100000
Y12 !/100000
Y13 !/100000
Y14 !/100000
Y15 !/100000
Y16 !/100000
Y17 !/100000
Y18 !/100000
Y19 !/100000
Y20 !/100000
G2.grm
RR_CV_ScenB.csv !SKIP 1 !MAXITER 1000 !EXTRA 100 !WORKSPACE 6144 !ASUV
!ASSIGN USe !< !INIT
8.02E-04 1.18E-03 1.85E-03 2.84E-03 4.44E-03 6.64E-03 9.97E-03 1.46E-02 2.09E-02 3.19E-02
!>
!ASSIGN USg !< !INIT
0.538879E-01
0.280743E-01 0.147578E-01
0.526344E-02 0.280003E-02 0.540146E-03
!>
!ASSIGN USp !< !INIT
0.554204E-02
0.359945E-02 0.240679E-02
!>
$1 ~ mu leg(DayOfImaging,2) !r us(leg(DayOfImaging,1) $USp).Exp us(leg(DayOfImaging,2) $USg).grm(NID) !f mv
residual id(2142).idh(DayOfImaging $USe)
#!/bin/bash
#SBATCH --time=24:00:00
#SBATCH --mem=4gb
#SBATCH --ntasks-per-node=1
#SBATCH --job-name=ASREML_Y1
errlog="logs/log_${SLURM_JOB_ID}.err"
outlog="logs/log_${SLURM_JOB_ID}.out"
module load asreml
asreml CV1.as
Cross validation results
Again, this is very similar to Scenario A. The difference is here, we are interested in predicting the phenotypes at later time points for the same lines. Therefore, we are only interested in doing the correlation between predicted gBLUPs and PSA for lines in the training set.
The accuracy of prediction was assessed by comparing predicted gBLUPs with observed PSA at each of the three experiments using Pearson’s correlation method. For each fold, the average correlation over the three experiments was used.
The functions below were adapted from Mrode (2005) by Gota Morota.
##Return coefficient matrix (lambda) of n-th order Legendre polynomials. Scaling method implemented by Gengler et. al. (1999) converts constant Legendre polynomial coefficients into 1
`legendre` <-
function(n, gengler){
if (nargs()==1){
gengler <- TRUE
}
if (gengler != TRUE & gengler != FALSE){
gengler=TRUE
}
N <- n+1
L <- matrix(0,nrow=N, ncol=N)
for(i in (1:N)){
if(i==1){
L[i,i] <- 1
}
else if(i==2){
L[i,i] <- 1
}
else {
tmp <- L[i-1,]
tmp2 <- as.numeric()
tmp2 <- c(0,tmp[1:(N-1)])
L[i,] <- (1/(i-2+1))*( (2*(i-2) + 1)*tmp2 -(i-2)*L[i-2,] )
}
}
# Normalize
for (j in (1:N)){
L[j,] <- (sqrt( (2*(j-1)+1)/2) )*L[j,]
}
# Gengler (1999)
if (gengler==TRUE){
L <- sqrt(2)*L
}
return(L)
}
##Given time points covariate and order of fit for Legendre polynomials, return matrix 'M' containing the polynomials of standardized time. 'M' is order t (number of time points) by k (order of Legendre polynomials)
stdtime <- function(t, n, tmax, tmin){
if(missing(tmax)) {
tmax <- t[which.max(t)]
}
if(missing(tmin)) {
tmin <- t[which.min(t)]
}
N <- n+1
M <- matrix(0, nrow=length(t), ncol=N)
a <- -1 + 2*(t-tmin)/(tmax - tmin)
M[,1] <- 1
for (i in 2:N){
M[,i] <- a^(i-1)
}
return(M)
}
#This code will calculate the gBLUPs at each time point from the asreml .sln files.
library(plyr)
library(reshape2)
setwd("~/Desktop/RR/ScenarioB/")
##CV dataset for scenario B
final <- read.csv("~/Desktop/RR/ScenarioB/RR_CV_ScenB.csv")
#This is the CV dataset from Scenario A. Here we'll use that since it has PSA for all 20 time points.
PSA <- read.csv("~/Desktop/RR/ScenarioB/RR_CV_ScenA.csv")
PSA <- PSA[1:5]
PSA <- ddply(PSA, .(NSFTV.ID, Exp, DayOfImaging), summarise, PSA=mean(PSA, na.rm = T))
PSA <- dcast(PSA, NSFTV.ID + DayOfImaging ~ Exp)
## Using PSA as value column: use value.var to override.
PSA <- PSA[order(PSA$DayOfImaging, PSA$NSFTV.ID) ,]
#Create an object that lists the file names for all CV runs. Recall that the CV was split up into four batches, each with 5 jobs.
Files <- c(paste0("CV1Y", 1:5, ".sln"),
paste0("CV2Y", 6:10, ".sln"),
paste0("CV3Y", 11:15, ".sln"),
paste0("CV4Y", 16:20, ".sln"))
for (i in 1:length(Files)){
sln <- read.delim(Files[i], sep = "", header = T)
g.hat.y <- t(cbind(sln[grep("1.NSFTV", sln$Level) ,][,3],
sln[grep("2.NSFTV", sln$Level) ,][,3],
sln[grep("3.NSFTV", sln$Level) ,][,3]))
#gBLUPs for legendre polynomials
colnames(g.hat.y) <- sub("1.", "", sln[grep("1.NSFTV", sln$Level) ,][,2])
#Calculate gBLUPs at each time point
Phi <- stdtime(1:20, 2) %*% t(legendre(2, gengler = F))
ghat.t.y <- t(apply(g.hat.y, 2, function (x) Phi %*% x))
colnames(ghat.t.y) <- 1:20
gBLUP <- melt(ghat.t.y)
colnames(gBLUP) <- c("NSFTV.ID", "DayOfImaging", "gBLUP")
if(i == 1 ){
final.blups=gBLUP
}else{
final.blups=cbind(final.blups, gBLUP[,3])
}
}
#This file contains all the gBLUP values for ach accession, run and time point
colnames(final.blups)[3:ncol(final.blups)] <- sub(".sln", "", sub("CV[:1-4:]", "", Files) )
library(plyr)
#Now we do the correlation and get the actual CV results
Cor.res <- matrix(0, ncol=20, nrow=20)
for(j in 1:length(Files)){
tmp <- final[c("NSFTV.ID", colnames(final.blups)[j+2])]
colnames(tmp)[2] <- "Y"
tmp <- ddply(tmp, .(NSFTV.ID), summarise, Cnt = sum(is.na(Y)))
#Find which accessions are in the test set
test.acc <- tmp[ tmp$Cnt < 60 ,]$NSFTV.ID
#Merge the gBLUPs and observed PSA datasets
tmp.df <- merge(PSA, final.blups, by = c("NSFTV.ID", "DayOfImaging"), all=T)
#Only keep the data for individuals in the testing set
tmp.df <- cbind(tmp.df[1:5], tmp.df[, (5 + j)])
colnames(tmp.df)[6] <- "gBLUP"
tmp.df <- tmp.df[order(tmp.df$DayOfImaging, tmp.df$NSFTV.ID) ,]
tmp.df <- tmp.df[tmp.df$NSFTV.ID %in% test.acc ,]
#Do the correlations
res <- rbind(ldply(dlply(tmp.df, .(DayOfImaging), function(x) cor(x$E1, x$gBLUP, use="complete.obs") ) )[,2],
ldply(dlply(tmp.df, .(DayOfImaging), function(x) cor(x$E2, x$gBLUP, use="complete.obs") ) )[,2],
ldply(dlply(tmp.df, .(DayOfImaging), function(x) cor(x$E3, x$gBLUP, use="complete.obs") ) )[,2] )
Cor.res[,j] <- colMeans(res)
}
Plot the results
This is the code that was used to generate Figure 5B.
Cor.res.mean <- apply(Cor.res, 1, mean)
Cor.res.sd <- apply(Cor.res, 1, sd)
#pdf("~/Desktop/RR/RR/Figures/ScenarioB.pdf", h=2.1, w=3.5, useDingbats = F, pointsize = 10)
#par(mar=c(3,3,1,.2), mgp=c(1.8,0.5,0))
plot(1:20, Cor.res.mean, ylim=c(0,1), cex=0.3, pch=19, xlab="Day of Imaging", ylab=expression(italic(r)))
lines(1:20, Cor.res.mean)
segments(1:20, Cor.res.mean - Cor.res.sd, 1:20, Cor.res.mean + Cor.res.sd, lwd=1)
segments(1:20 - 0.1, Cor.res.mean - Cor.res.sd, 1:20 + 0.1, Cor.res.mean - Cor.res.sd, lwd=1)
segments(1:20 - 0.1, Cor.res.mean + Cor.res.sd, 1:20 + 0.1, Cor.res.mean + Cor.res.sd, lwd=1)
#dev.off()
Scenario C
Generate the dataset for cross validation
The CV dataset and script is identical to that used for Scenario B. Refer to the processes above for how to generate the CV dataset and run the asreml code.
Cross validation results
The code is nearly identical to that used for scenario B, however here we’ll subset the lines that are in the testing set.
#This code will calucalte the gBLUPs at each time point from the asreml .sln files.
library(plyr)
library(reshape2)
setwd("~/Desktop/RR/ScenarioC/")
##CV dataset for scenario C
final <- read.csv("~/Desktop/RR/ScenarioC/RR_CV_ScenB.csv")
#This is the CV dataset from Scenario A. Here we'll use that since it has PSA for all 20 time points.
PSA <- read.csv("~/Desktop/RR/ScenarioC/RR_CV_ScenA.csv")
PSA <- PSA[1:5]
PSA <- ddply(PSA, .(NSFTV.ID, Exp, DayOfImaging), summarise, PSA=mean(PSA, na.rm = T))
PSA <- dcast(PSA, NSFTV.ID + DayOfImaging ~ Exp)
## Using PSA as value column: use value.var to override.
PSA <- PSA[order(PSA$DayOfImaging, PSA$NSFTV.ID) ,]
#Create an object that lists the file names for all CV runs. Recall that the CV was split up into four batches, each with 5 jobs.
Files <- c(paste0("CV1Y", 1:5, ".sln"),
paste0("CV2Y", 6:10, ".sln"),
paste0("CV3Y", 11:15, ".sln"),
paste0("CV4Y", 16:20, ".sln"))
for (i in 1:length(Files)){
sln <- read.delim(Files[i], sep = "", header = T)
g.hat.y <- t(cbind(sln[grep("1.NSFTV", sln$Level) ,][,3],
sln[grep("2.NSFTV", sln$Level) ,][,3],
sln[grep("3.NSFTV", sln$Level) ,][,3]))
#gBLUPs for legendre polynomials
colnames(g.hat.y) <- sub("1.", "", sln[grep("1.NSFTV", sln$Level) ,][,2])
#Calculate gBLUPs at each time point
Phi <- stdtime(1:20, 2) %*% t(legendre(2, gengler = F))
ghat.t.y <- t(apply(g.hat.y, 2, function (x) Phi %*% x))
colnames(ghat.t.y) <- 1:20
gBLUP <- melt(ghat.t.y)
colnames(gBLUP) <- c("NSFTV.ID", "DayOfImaging", "gBLUP")
if(i == 1 ){
final.blups=gBLUP
}else{
final.blups=cbind(final.blups, gBLUP[,3])
}
}
#This file contains all the gBLUP values for ach accession, run and time point
colnames(final.blups)[3:ncol(final.blups)] <- sub(".sln", "", sub("CV[:1-4:]", "", Files) )
library(plyr)
#Now we do the correlation and get the actual CV results
Cor.res <- matrix(0, ncol=20, nrow=20)
for(j in 1:length(Files)){
tmp <- final[c("NSFTV.ID", colnames(final.blups)[j+2])]
colnames(tmp)[2] <- "Y"
tmp <- ddply(tmp, .(NSFTV.ID), summarise, Cnt = sum(is.na(Y)))
#Find which accessions are in the test set
test.acc <- tmp[ tmp$Cnt == 60 ,]$NSFTV.ID
#Merge the gBLUPs and observed PSA datasets
tmp.df <- merge(PSA, final.blups, by = c("NSFTV.ID", "DayOfImaging"), all=T)
#Only keep the data for individuals in the testing set
tmp.df <- cbind(tmp.df[1:5], tmp.df[, (5 + j)])
colnames(tmp.df)[6] <- "gBLUP"
tmp.df <- tmp.df[order(tmp.df$DayOfImaging, tmp.df$NSFTV.ID) ,]
tmp.df <- tmp.df[tmp.df$NSFTV.ID %in% test.acc ,]
#Do the correlations
res <- rbind(ldply(dlply(tmp.df, .(DayOfImaging), function(x) cor(x$E1, x$gBLUP, use="complete.obs") ) )[,2],
ldply(dlply(tmp.df, .(DayOfImaging), function(x) cor(x$E2, x$gBLUP, use="complete.obs") ) )[,2],
ldply(dlply(tmp.df, .(DayOfImaging), function(x) cor(x$E3, x$gBLUP, use="complete.obs") ) )[,2] )
Cor.res[,j] <- colMeans(res)
}
Plot the results
This is the code that was used to generate Figure 5C.
Cor.res.mean <- apply(Cor.res, 1, mean)
Cor.res.sd <- apply(Cor.res, 1, sd)
#pdf("~/Desktop/RR/RR/Figures/ScenarioC.pdf", h=2.1, w=3.5, useDingbats = F, pointsize = 10)
#par(mar=c(3,3,1,.2), mgp=c(1.8,0.5,0))
plot(1:20, Cor.res.mean, ylim=c(0,1), cex=0.3, pch=19, xlab="Day of Imaging", ylab=expression(italic(r)))
lines(1:20, Cor.res.mean)
segments(1:20, Cor.res.mean - Cor.res.sd, 1:20, Cor.res.mean + Cor.res.sd, lwd=1)
segments(1:20 - 0.1, Cor.res.mean - Cor.res.sd, 1:20 + 0.1, Cor.res.mean - Cor.res.sd, lwd=1)
segments(1:20 - 0.1, Cor.res.mean + Cor.res.sd, 1:20 + 0.1, Cor.res.mean + Cor.res.sd, lwd=1)
#dev.off()