--- title: "Scenarios B and C" author: "Malachy Campbell" date: "8/7/2017" output: rmdformats::html_clean: fig_width: 6 fig_height: 6 highlight: kate thumbnails: true lightbox: true gallery: true --- ```{r setup, include=FALSE} knitr::opts_knit$set(root.dir = '~/Desktop/RR/') ``` #Background The aim of scenario B was to determine if traits at later time points can be predicted for known lines using information at earlier time points. Thus, it can be considered as a forecasting approach. Here, a RR model was fitted using the first ten time points and was used to predict the phenotypes for the same set of lines in the last ten time points. The accuracy of prediction was assessed by comparing predicted gBLUPs with observed PSA at each of the three experiments using Pearson's correlation method. The lines were randomly assigned to each fold, and the process was repeated 20 times. For each fold, the average correlation over the three experiments was used. For scenario C, the aim is similar to B, however here we are interested in predicting future phenotypes for unknown lines. This can be thought of as a forecasting approach for unobserved, or undeveloped lines. Here, a RR model was fitted using PSA at the first ten time points for the training set, and was used to predict the phenotypes for the testing set of lines in the last ten time points. Again, accuracy was determined using Pearson's correlation. #Scenario B ##Generate the dataset for cross validation Here, we will reuse the CV file created in Scenario A, so I won't bother to add the code to generate that here. We'll load that file and only select the first 10 time points. ```{r scenario B CV file, eval = F} final <- read.csv("~/Desktop/RR/ScenarioA/RR/RR_CV.csv") final <- final[final$DayOfImaging < 11 ,] write.csv(final, "~/Desktop/RR/ScenarioB/RR_CV_ScenB", row.names=F) ``` ##Running asreml for CV Again, as for Scenario A, CV was done on our computing cluster at UNL. Four jobs were created, where the RR model was fit for five resampling runs. Below is an example .as file, as well as the slurm script used to submit the job. Note there are 10 starting values for the error term, thus we're using 10 time points. ```{asreml RR CV, eval = F} !RENAME !ARG Y1 Y2 Y3 Y4 Y5 RR scenario B NID !A PSA !/100000 DayOfImaging 10 !I Rep 2 Exp !A Y1 !/100000 Y2 !/100000 Y3 !/100000 Y4 !/100000 Y5 !/100000 Y6 !/100000 Y7 !/100000 Y8 !/100000 Y9 !/100000 Y10 !/100000 Y11 !/100000 Y12 !/100000 Y13 !/100000 Y14 !/100000 Y15 !/100000 Y16 !/100000 Y17 !/100000 Y18 !/100000 Y19 !/100000 Y20 !/100000 G2.grm RR_CV_ScenB.csv !SKIP 1 !MAXITER 1000 !EXTRA 100 !WORKSPACE 6144 !ASUV !ASSIGN USe !< !INIT 8.02E-04 1.18E-03 1.85E-03 2.84E-03 4.44E-03 6.64E-03 9.97E-03 1.46E-02 2.09E-02 3.19E-02 !> !ASSIGN USg !< !INIT 0.538879E-01 0.280743E-01 0.147578E-01 0.526344E-02 0.280003E-02 0.540146E-03 !> !ASSIGN USp !< !INIT 0.554204E-02 0.359945E-02 0.240679E-02 !> $1 ~ mu leg(DayOfImaging,2) !r us(leg(DayOfImaging,1) $USp).Exp us(leg(DayOfImaging,2) $USg).grm(NID) !f mv residual id(2142).idh(DayOfImaging $USe) ``` ```{slurm script for RR CV, eval = F} #!/bin/bash #SBATCH --time=24:00:00 #SBATCH --mem=4gb #SBATCH --ntasks-per-node=1 #SBATCH --job-name=ASREML_Y1 errlog="logs/log_${SLURM_JOB_ID}.err" outlog="logs/log_${SLURM_JOB_ID}.out" module load asreml asreml CV1.as ``` ##Cross validation results Again, this is very similar to Scenario A. The difference is here, we are interested in predicting the phenotypes at later time points for the *same lines*. Therefore, we are only interested in doing the correlation between predicted gBLUPs and PSA for lines in the training set. The accuracy of prediction was assessed by comparing predicted gBLUPs with observed PSA at each of the three experiments using Pearson's correlation method. For each fold, the average correlation over the three experiments was used. The functions below were adapted from Mrode (2005) by [Gota Morota](http://morotalab.org/Mrode2005/rr/rr.html#section00020000000000000000). ```{r define functions for RR} ##Return coefficient matrix (lambda) of n-th order Legendre polynomials. Scaling method implemented by Gengler et. al. (1999) converts constant Legendre polynomial coefficients into 1 `legendre` <- function(n, gengler){ if (nargs()==1){ gengler <- TRUE } if (gengler != TRUE & gengler != FALSE){ gengler=TRUE } N <- n+1 L <- matrix(0,nrow=N, ncol=N) for(i in (1:N)){ if(i==1){ L[i,i] <- 1 } else if(i==2){ L[i,i] <- 1 } else { tmp <- L[i-1,] tmp2 <- as.numeric() tmp2 <- c(0,tmp[1:(N-1)]) L[i,] <- (1/(i-2+1))*( (2*(i-2) + 1)*tmp2 -(i-2)*L[i-2,] ) } } # Normalize for (j in (1:N)){ L[j,] <- (sqrt( (2*(j-1)+1)/2) )*L[j,] } # Gengler (1999) if (gengler==TRUE){ L <- sqrt(2)*L } return(L) } ##Given time points covariate and order of fit for Legendre polynomials, return matrix 'M' containing the polynomials of standardized time. 'M' is order t (number of time points) by k (order of Legendre polynomials) stdtime <- function(t, n, tmax, tmin){ if(missing(tmax)) { tmax <- t[which.max(t)] } if(missing(tmin)) { tmin <- t[which.min(t)] } N <- n+1 M <- matrix(0, nrow=length(t), ncol=N) a <- -1 + 2*(t-tmin)/(tmax - tmin) M[,1] <- 1 for (i in 2:N){ M[,i] <- a^(i-1) } return(M) } ``` ```{r RR Scen B CV get gBLUPs for each run, eval = T} #This code will calculate the gBLUPs at each time point from the asreml .sln files. library(plyr) library(reshape2) setwd("~/Desktop/RR/ScenarioB/") ##CV dataset for scenario B final <- read.csv("~/Desktop/RR/ScenarioB/RR_CV_ScenB.csv") #This is the CV dataset from Scenario A. Here we'll use that since it has PSA for all 20 time points. PSA <- read.csv("~/Desktop/RR/ScenarioB/RR_CV_ScenA.csv") PSA <- PSA[1:5] PSA <- ddply(PSA, .(NSFTV.ID, Exp, DayOfImaging), summarise, PSA=mean(PSA, na.rm = T)) PSA <- dcast(PSA, NSFTV.ID + DayOfImaging ~ Exp) PSA <- PSA[order(PSA$DayOfImaging, PSA$NSFTV.ID) ,] #Create an object that lists the file names for all CV runs. Recall that the CV was split up into four batches, each with 5 jobs. Files <- c(paste0("CV1Y", 1:5, ".sln"), paste0("CV2Y", 6:10, ".sln"), paste0("CV3Y", 11:15, ".sln"), paste0("CV4Y", 16:20, ".sln")) for (i in 1:length(Files)){ sln <- read.delim(Files[i], sep = "", header = T) g.hat.y <- t(cbind(sln[grep("1.NSFTV", sln$Level) ,][,3], sln[grep("2.NSFTV", sln$Level) ,][,3], sln[grep("3.NSFTV", sln$Level) ,][,3])) #gBLUPs for legendre polynomials colnames(g.hat.y) <- sub("1.", "", sln[grep("1.NSFTV", sln$Level) ,][,2]) #Calculate gBLUPs at each time point Phi <- stdtime(1:20, 2) %*% t(legendre(2, gengler = F)) ghat.t.y <- t(apply(g.hat.y, 2, function (x) Phi %*% x)) colnames(ghat.t.y) <- 1:20 gBLUP <- melt(ghat.t.y) colnames(gBLUP) <- c("NSFTV.ID", "DayOfImaging", "gBLUP") if(i == 1 ){ final.blups=gBLUP }else{ final.blups=cbind(final.blups, gBLUP[,3]) } } #This file contains all the gBLUP values for ach accession, run and time point colnames(final.blups)[3:ncol(final.blups)] <- sub(".sln", "", sub("CV[:1-4:]", "", Files) ) ``` ```{r RR CV results scenario B, eval = T} library(plyr) #Now we do the correlation and get the actual CV results Cor.res <- matrix(0, ncol=20, nrow=20) for(j in 1:length(Files)){ tmp <- final[c("NSFTV.ID", colnames(final.blups)[j+2])] colnames(tmp)[2] <- "Y" tmp <- ddply(tmp, .(NSFTV.ID), summarise, Cnt = sum(is.na(Y))) #Find which accessions are in the test set test.acc <- tmp[ tmp$Cnt < 60 ,]$NSFTV.ID #Merge the gBLUPs and observed PSA datasets tmp.df <- merge(PSA, final.blups, by = c("NSFTV.ID", "DayOfImaging"), all=T) #Only keep the data for individuals in the testing set tmp.df <- cbind(tmp.df[1:5], tmp.df[, (5 + j)]) colnames(tmp.df)[6] <- "gBLUP" tmp.df <- tmp.df[order(tmp.df$DayOfImaging, tmp.df$NSFTV.ID) ,] tmp.df <- tmp.df[tmp.df$NSFTV.ID %in% test.acc ,] #Do the correlations res <- rbind(ldply(dlply(tmp.df, .(DayOfImaging), function(x) cor(x$E1, x$gBLUP, use="complete.obs") ) )[,2], ldply(dlply(tmp.df, .(DayOfImaging), function(x) cor(x$E2, x$gBLUP, use="complete.obs") ) )[,2], ldply(dlply(tmp.df, .(DayOfImaging), function(x) cor(x$E3, x$gBLUP, use="complete.obs") ) )[,2] ) Cor.res[,j] <- colMeans(res) } ``` ##Plot the results This is the code that was used to generate Figure 5B. ```{r plot results for scenario B, eval = T} Cor.res.mean <- apply(Cor.res, 1, mean) Cor.res.sd <- apply(Cor.res, 1, sd) #pdf("~/Desktop/RR/RR/Figures/ScenarioB.pdf", h=2.1, w=3.5, useDingbats = F, pointsize = 10) #par(mar=c(3,3,1,.2), mgp=c(1.8,0.5,0)) plot(1:20, Cor.res.mean, ylim=c(0,1), cex=0.3, pch=19, xlab="Day of Imaging", ylab=expression(italic(r))) lines(1:20, Cor.res.mean) segments(1:20, Cor.res.mean - Cor.res.sd, 1:20, Cor.res.mean + Cor.res.sd, lwd=1) segments(1:20 - 0.1, Cor.res.mean - Cor.res.sd, 1:20 + 0.1, Cor.res.mean - Cor.res.sd, lwd=1) segments(1:20 - 0.1, Cor.res.mean + Cor.res.sd, 1:20 + 0.1, Cor.res.mean + Cor.res.sd, lwd=1) #dev.off() ``` #Scenario C ##Generate the dataset for cross validation The CV dataset and script is identical to that used for Scenario B. Refer to the processes above for how to generate the CV dataset and run the asreml code. ##Cross validation results The code is nearly identical to that used for scenario B, however here we'll subset the lines that are in the testing set. ```{r RR CV Scenario C get gBLUPs for each run, eval = T} #This code will calucalte the gBLUPs at each time point from the asreml .sln files. library(plyr) library(reshape2) setwd("~/Desktop/RR/ScenarioC/") ##CV dataset for scenario C final <- read.csv("~/Desktop/RR/ScenarioC/RR_CV_ScenB.csv") #This is the CV dataset from Scenario A. Here we'll use that since it has PSA for all 20 time points. PSA <- read.csv("~/Desktop/RR/ScenarioC/RR_CV_ScenA.csv") PSA <- PSA[1:5] PSA <- ddply(PSA, .(NSFTV.ID, Exp, DayOfImaging), summarise, PSA=mean(PSA, na.rm = T)) PSA <- dcast(PSA, NSFTV.ID + DayOfImaging ~ Exp) PSA <- PSA[order(PSA$DayOfImaging, PSA$NSFTV.ID) ,] #Create an object that lists the file names for all CV runs. Recall that the CV was split up into four batches, each with 5 jobs. Files <- c(paste0("CV1Y", 1:5, ".sln"), paste0("CV2Y", 6:10, ".sln"), paste0("CV3Y", 11:15, ".sln"), paste0("CV4Y", 16:20, ".sln")) for (i in 1:length(Files)){ sln <- read.delim(Files[i], sep = "", header = T) g.hat.y <- t(cbind(sln[grep("1.NSFTV", sln$Level) ,][,3], sln[grep("2.NSFTV", sln$Level) ,][,3], sln[grep("3.NSFTV", sln$Level) ,][,3])) #gBLUPs for legendre polynomials colnames(g.hat.y) <- sub("1.", "", sln[grep("1.NSFTV", sln$Level) ,][,2]) #Calculate gBLUPs at each time point Phi <- stdtime(1:20, 2) %*% t(legendre(2, gengler = F)) ghat.t.y <- t(apply(g.hat.y, 2, function (x) Phi %*% x)) colnames(ghat.t.y) <- 1:20 gBLUP <- melt(ghat.t.y) colnames(gBLUP) <- c("NSFTV.ID", "DayOfImaging", "gBLUP") if(i == 1 ){ final.blups=gBLUP }else{ final.blups=cbind(final.blups, gBLUP[,3]) } } #This file contains all the gBLUP values for ach accession, run and time point colnames(final.blups)[3:ncol(final.blups)] <- sub(".sln", "", sub("CV[:1-4:]", "", Files) ) ``` ```{r RR CV results scenario C, eval = T} library(plyr) #Now we do the correlation and get the actual CV results Cor.res <- matrix(0, ncol=20, nrow=20) for(j in 1:length(Files)){ tmp <- final[c("NSFTV.ID", colnames(final.blups)[j+2])] colnames(tmp)[2] <- "Y" tmp <- ddply(tmp, .(NSFTV.ID), summarise, Cnt = sum(is.na(Y))) #Find which accessions are in the test set test.acc <- tmp[ tmp$Cnt == 60 ,]$NSFTV.ID #Merge the gBLUPs and observed PSA datasets tmp.df <- merge(PSA, final.blups, by = c("NSFTV.ID", "DayOfImaging"), all=T) #Only keep the data for individuals in the testing set tmp.df <- cbind(tmp.df[1:5], tmp.df[, (5 + j)]) colnames(tmp.df)[6] <- "gBLUP" tmp.df <- tmp.df[order(tmp.df$DayOfImaging, tmp.df$NSFTV.ID) ,] tmp.df <- tmp.df[tmp.df$NSFTV.ID %in% test.acc ,] #Do the correlations res <- rbind(ldply(dlply(tmp.df, .(DayOfImaging), function(x) cor(x$E1, x$gBLUP, use="complete.obs") ) )[,2], ldply(dlply(tmp.df, .(DayOfImaging), function(x) cor(x$E2, x$gBLUP, use="complete.obs") ) )[,2], ldply(dlply(tmp.df, .(DayOfImaging), function(x) cor(x$E3, x$gBLUP, use="complete.obs") ) )[,2] ) Cor.res[,j] <- colMeans(res) } ``` ##Plot the results This is the code that was used to generate Figure 5C. ```{r plot results for scenario C, eval = T} Cor.res.mean <- apply(Cor.res, 1, mean) Cor.res.sd <- apply(Cor.res, 1, sd) #pdf("~/Desktop/RR/RR/Figures/ScenarioC.pdf", h=2.1, w=3.5, useDingbats = F, pointsize = 10) #par(mar=c(3,3,1,.2), mgp=c(1.8,0.5,0)) plot(1:20, Cor.res.mean, ylim=c(0,1), cex=0.3, pch=19, xlab="Day of Imaging", ylab=expression(italic(r))) lines(1:20, Cor.res.mean) segments(1:20, Cor.res.mean - Cor.res.sd, 1:20, Cor.res.mean + Cor.res.sd, lwd=1) segments(1:20 - 0.1, Cor.res.mean - Cor.res.sd, 1:20 + 0.1, Cor.res.mean - Cor.res.sd, lwd=1) segments(1:20 - 0.1, Cor.res.mean + Cor.res.sd, 1:20 + 0.1, Cor.res.mean + Cor.res.sd, lwd=1) #dev.off() ```