---
title: "Scenarios B and C"
author: "Malachy Campbell"
date: "8/7/2017"
output:
  rmdformats::html_clean:
    fig_width: 6
    fig_height: 6
    highlight: kate
    thumbnails: true
    lightbox: true
    gallery: true
---

```{r setup, include=FALSE}
knitr::opts_knit$set(root.dir = '~/Desktop/RR/')
```

#Background
The aim of scenario B was to determine if traits at later time points can be predicted for known lines using information at earlier time points. Thus, it can be considered as a forecasting approach. Here, a RR model was fitted using the first ten time points and was used to predict the phenotypes for the same set of lines in the last ten time points. The accuracy of prediction was assessed by comparing predicted gBLUPs with observed PSA at each of the three experiments using Pearson's correlation method. The lines were randomly assigned to each fold, and the process was repeated 20 times. For each fold, the average correlation over the three experiments was used.

For scenario C, the aim is similar to B, however here we are interested in predicting future phenotypes for unknown lines. This can be thought of as a forecasting approach for unobserved, or undeveloped lines. Here, a RR model was fitted using PSA at the first ten time points for the training set, and was used to predict the phenotypes for the testing set of lines in the last ten time points. Again, accuracy was determined using Pearson's correlation.

#Scenario B
##Generate the dataset for cross validation
Here, we will reuse the CV file created in Scenario A, so I won't bother to add the code to generate that here. We'll load that file and only select the first 10 time points.

```{r scenario B CV file, eval = F}
final <- read.csv("~/Desktop/RR/ScenarioA/RR/RR_CV.csv")

final <- final[final$DayOfImaging < 11 ,]
write.csv(final, "~/Desktop/RR/ScenarioB/RR_CV_ScenB", row.names=F)
```

##Running asreml for CV
Again, as for Scenario A, CV was done on our computing cluster at UNL. Four jobs were created, where the RR model was fit for five resampling runs. Below is an example .as file, as well as the slurm script used to submit the job. Note there are 10 starting values for the error term, thus we're using 10 time points.

```{asreml RR CV, eval = F}
!RENAME !ARG Y1 Y2 Y3 Y4 Y5
RR scenario B
 NID !A
 PSA !/100000
 DayOfImaging 10 !I
 Rep 2
 Exp !A
 Y1 !/100000
 Y2 !/100000
 Y3 !/100000
 Y4 !/100000
 Y5 !/100000
 Y6 !/100000
 Y7 !/100000
 Y8 !/100000
 Y9 !/100000
 Y10 !/100000
 Y11 !/100000
 Y12 !/100000
 Y13 !/100000
 Y14 !/100000
 Y15 !/100000
 Y16 !/100000
 Y17 !/100000
 Y18 !/100000
 Y19 !/100000
 Y20 !/100000
G2.grm
RR_CV_ScenB.csv !SKIP 1 !MAXITER 1000 !EXTRA 100 !WORKSPACE 6144 !ASUV

!ASSIGN USe !< !INIT
8.02E-04 1.18E-03 1.85E-03 2.84E-03 4.44E-03 6.64E-03 9.97E-03 1.46E-02 2.09E-02 3.19E-02
!>
!ASSIGN USg !< !INIT
0.538879E-01
0.280743E-01 0.147578E-01
0.526344E-02 0.280003E-02 0.540146E-03
!>
!ASSIGN USp !< !INIT
0.554204E-02
0.359945E-02 0.240679E-02
!>
$1 ~ mu leg(DayOfImaging,2) !r us(leg(DayOfImaging,1) $USp).Exp us(leg(DayOfImaging,2) $USg).grm(NID) !f mv
residual id(2142).idh(DayOfImaging $USe)
```

```{slurm script for RR CV, eval = F}
#!/bin/bash
#SBATCH --time=24:00:00
#SBATCH --mem=4gb
#SBATCH --ntasks-per-node=1
#SBATCH --job-name=ASREML_Y1

errlog="logs/log_${SLURM_JOB_ID}.err"
outlog="logs/log_${SLURM_JOB_ID}.out"

module load asreml

asreml CV1.as
```

##Cross validation results
Again, this is very similar to Scenario A. The difference is here, we are interested in predicting the phenotypes at later time points for the *same lines*. Therefore, we are only interested in doing the correlation between predicted gBLUPs and PSA for lines in the training set.

The accuracy of prediction was assessed by comparing predicted gBLUPs with observed PSA at each of the three experiments using Pearson's correlation method. For each fold, the average correlation over the three experiments was used.

The functions below were adapted from Mrode (2005) by [Gota Morota](http://morotalab.org/Mrode2005/rr/rr.html#section00020000000000000000).
```{r define functions for RR}
##Return coefficient matrix (lambda) of n-th order Legendre polynomials. Scaling method implemented by Gengler et. al. (1999) converts constant Legendre polynomial coefficients into 1
`legendre` <-
  function(n, gengler){
    
    if (nargs()==1){
      gengler <- TRUE	
    }
    
    if (gengler != TRUE & gengler != FALSE){
      gengler=TRUE	
    }
    
    N <- n+1
    L <- matrix(0,nrow=N, ncol=N)
    
    for(i in (1:N)){
      if(i==1){
        L[i,i] <- 1
      }
      else if(i==2){
        L[i,i] <- 1
      }
      else  {
        tmp <- L[i-1,]
        tmp2 <- as.numeric()
        tmp2 <- c(0,tmp[1:(N-1)])
        L[i,] <- (1/(i-2+1))*( (2*(i-2) + 1)*tmp2 -(i-2)*L[i-2,] )
      }
    }
    
    # Normalize
    for (j in (1:N)){	
      L[j,] <- (sqrt( (2*(j-1)+1)/2)  )*L[j,]
    }
    
    
    # Gengler (1999)
    if (gengler==TRUE){
      L <- sqrt(2)*L
    }
    
    return(L)
    
  }

##Given time points covariate and order of fit for Legendre polynomials, return matrix 'M' containing the polynomials of standardized time. 'M' is order t (number of time points) by k (order of Legendre polynomials)
stdtime <- function(t, n, tmax, tmin){
  if(missing(tmax)) {
    tmax <- t[which.max(t)]
  }
  if(missing(tmin)) {
    tmin <- t[which.min(t)]
  }
  
  N <- n+1
  M <- matrix(0, nrow=length(t), ncol=N)
  a <- -1 + 2*(t-tmin)/(tmax - tmin)
  M[,1] <- 1
  
  for (i in 2:N){
    M[,i] <- a^(i-1)
  }
  
  return(M)
}
```


```{r RR Scen B CV get gBLUPs for each run, eval = T}
#This code will calculate the gBLUPs at each time point from the asreml .sln files.
library(plyr)
library(reshape2)

setwd("~/Desktop/RR/ScenarioB/")

##CV dataset for scenario B
final <- read.csv("~/Desktop/RR/ScenarioB/RR_CV_ScenB.csv")

#This is the CV dataset from Scenario A. Here we'll use that since it has PSA for all 20 time points.
PSA <- read.csv("~/Desktop/RR/ScenarioB/RR_CV_ScenA.csv")

PSA <- PSA[1:5]
PSA <- ddply(PSA, .(NSFTV.ID, Exp, DayOfImaging), summarise, PSA=mean(PSA, na.rm = T))
PSA <- dcast(PSA, NSFTV.ID + DayOfImaging ~ Exp)
PSA <- PSA[order(PSA$DayOfImaging, PSA$NSFTV.ID) ,]

#Create an object that lists the file names for all CV runs. Recall that the CV was split up into four batches, each with 5 jobs.
Files <- c(paste0("CV1Y", 1:5, ".sln"),
           paste0("CV2Y", 6:10, ".sln"),
           paste0("CV3Y", 11:15, ".sln"),
           paste0("CV4Y", 16:20, ".sln"))

for (i in 1:length(Files)){
  sln <- read.delim(Files[i], sep = "", header = T)
 
  g.hat.y <- t(cbind(sln[grep("1.NSFTV", sln$Level) ,][,3],
                     sln[grep("2.NSFTV", sln$Level) ,][,3],
                     sln[grep("3.NSFTV", sln$Level) ,][,3]))
  
  #gBLUPs for legendre polynomials
  colnames(g.hat.y) <- sub("1.", "", sln[grep("1.NSFTV", sln$Level) ,][,2])
  
  #Calculate gBLUPs at each time point
  Phi <- stdtime(1:20, 2) %*% t(legendre(2, gengler = F))
  ghat.t.y <- t(apply(g.hat.y, 2, function (x) Phi %*% x))
  colnames(ghat.t.y) <- 1:20
  gBLUP <- melt(ghat.t.y)
  colnames(gBLUP) <- c("NSFTV.ID", "DayOfImaging", "gBLUP")
  
  if(i == 1 ){
    final.blups=gBLUP
  }else{
    final.blups=cbind(final.blups, gBLUP[,3])
  }
}

#This file contains all the gBLUP values for ach accession, run and time point
colnames(final.blups)[3:ncol(final.blups)] <- sub(".sln", "", sub("CV[:1-4:]", "", Files) )
```

```{r RR CV results scenario B, eval = T}
library(plyr)

#Now we do the correlation and get the actual CV results
Cor.res <- matrix(0, ncol=20, nrow=20)

for(j in 1:length(Files)){
  tmp <- final[c("NSFTV.ID", colnames(final.blups)[j+2])]
  colnames(tmp)[2] <- "Y"
  tmp <- ddply(tmp, .(NSFTV.ID), summarise, Cnt = sum(is.na(Y)))
  
  #Find which accessions are in the test set
  test.acc <- tmp[ tmp$Cnt < 60 ,]$NSFTV.ID
  
  #Merge the gBLUPs and observed PSA datasets
  tmp.df <- merge(PSA, final.blups, by = c("NSFTV.ID", "DayOfImaging"), all=T)
  
  #Only keep the data for individuals in the testing set
  tmp.df <- cbind(tmp.df[1:5], tmp.df[, (5 + j)])
  colnames(tmp.df)[6] <- "gBLUP"
  tmp.df <- tmp.df[order(tmp.df$DayOfImaging, tmp.df$NSFTV.ID) ,]
  tmp.df <- tmp.df[tmp.df$NSFTV.ID %in% test.acc ,]
  
  #Do the correlations
  res <- rbind(ldply(dlply(tmp.df, .(DayOfImaging), function(x) cor(x$E1, x$gBLUP, use="complete.obs") ) )[,2], 
               ldply(dlply(tmp.df, .(DayOfImaging), function(x) cor(x$E2, x$gBLUP, use="complete.obs") ) )[,2],
               ldply(dlply(tmp.df, .(DayOfImaging), function(x) cor(x$E3, x$gBLUP, use="complete.obs") ) )[,2] )
  
  Cor.res[,j] <- colMeans(res)
}
```

##Plot the results
This is the code that was used to generate Figure 5B.
```{r plot results for scenario B, eval = T}

Cor.res.mean <- apply(Cor.res, 1, mean)
Cor.res.sd <- apply(Cor.res, 1, sd)

#pdf("~/Desktop/RR/RR/Figures/ScenarioB.pdf", h=2.1, w=3.5, useDingbats = F, pointsize = 10)
#par(mar=c(3,3,1,.2), mgp=c(1.8,0.5,0))

plot(1:20, Cor.res.mean, ylim=c(0,1), cex=0.3, pch=19, xlab="Day of Imaging", ylab=expression(italic(r)))
lines(1:20, Cor.res.mean)

segments(1:20, Cor.res.mean - Cor.res.sd, 1:20, Cor.res.mean + Cor.res.sd, lwd=1)
segments(1:20 - 0.1, Cor.res.mean - Cor.res.sd, 1:20 + 0.1, Cor.res.mean - Cor.res.sd, lwd=1)
segments(1:20 - 0.1, Cor.res.mean + Cor.res.sd, 1:20 + 0.1, Cor.res.mean + Cor.res.sd, lwd=1)
#dev.off()
```


#Scenario C
##Generate the dataset for cross validation
The CV dataset and script is identical to that used for Scenario B. Refer to the processes above for how to generate the CV dataset and run the asreml code. 

##Cross validation results
The code is nearly identical to that used for scenario B, however here we'll subset the lines that are in the testing set.


```{r RR CV Scenario C get gBLUPs for each run, eval = T}
#This code will calucalte the gBLUPs at each time point from the asreml .sln files.
library(plyr)
library(reshape2)

setwd("~/Desktop/RR/ScenarioC/")

##CV dataset for scenario C
final <- read.csv("~/Desktop/RR/ScenarioC/RR_CV_ScenB.csv")

#This is the CV dataset from Scenario A. Here we'll use that since it has PSA for all 20 time points.
PSA <- read.csv("~/Desktop/RR/ScenarioC/RR_CV_ScenA.csv")

PSA <- PSA[1:5]
PSA <- ddply(PSA, .(NSFTV.ID, Exp, DayOfImaging), summarise, PSA=mean(PSA, na.rm = T))
PSA <- dcast(PSA, NSFTV.ID + DayOfImaging ~ Exp)
PSA <- PSA[order(PSA$DayOfImaging, PSA$NSFTV.ID) ,]

#Create an object that lists the file names for all CV runs. Recall that the CV was split up into four batches, each with 5 jobs.
Files <- c(paste0("CV1Y", 1:5, ".sln"),
           paste0("CV2Y", 6:10, ".sln"),
           paste0("CV3Y", 11:15, ".sln"),
           paste0("CV4Y", 16:20, ".sln"))

for (i in 1:length(Files)){
  sln <- read.delim(Files[i], sep = "", header = T)
 
  g.hat.y <- t(cbind(sln[grep("1.NSFTV", sln$Level) ,][,3],
                     sln[grep("2.NSFTV", sln$Level) ,][,3],
                     sln[grep("3.NSFTV", sln$Level) ,][,3]))
  
  #gBLUPs for legendre polynomials
  colnames(g.hat.y) <- sub("1.", "", sln[grep("1.NSFTV", sln$Level) ,][,2])
  
  #Calculate gBLUPs at each time point
  Phi <- stdtime(1:20, 2) %*% t(legendre(2, gengler = F))
  ghat.t.y <- t(apply(g.hat.y, 2, function (x) Phi %*% x))
  colnames(ghat.t.y) <- 1:20
  gBLUP <- melt(ghat.t.y)
  colnames(gBLUP) <- c("NSFTV.ID", "DayOfImaging", "gBLUP")
  
  if(i == 1 ){
    final.blups=gBLUP
  }else{
    final.blups=cbind(final.blups, gBLUP[,3])
  }
}

#This file contains all the gBLUP values for ach accession, run and time point
colnames(final.blups)[3:ncol(final.blups)] <- sub(".sln", "", sub("CV[:1-4:]", "", Files) )
```

```{r RR CV results scenario C, eval = T}
library(plyr)

#Now we do the correlation and get the actual CV results
Cor.res <- matrix(0, ncol=20, nrow=20)

for(j in 1:length(Files)){
  tmp <- final[c("NSFTV.ID", colnames(final.blups)[j+2])]
  colnames(tmp)[2] <- "Y"
  tmp <- ddply(tmp, .(NSFTV.ID), summarise, Cnt = sum(is.na(Y)))
  
  #Find which accessions are in the test set
  test.acc <- tmp[ tmp$Cnt == 60 ,]$NSFTV.ID
  
  #Merge the gBLUPs and observed PSA datasets
  tmp.df <- merge(PSA, final.blups, by = c("NSFTV.ID", "DayOfImaging"), all=T)
  
  #Only keep the data for individuals in the testing set
  tmp.df <- cbind(tmp.df[1:5], tmp.df[, (5 + j)])
  colnames(tmp.df)[6] <- "gBLUP"
  tmp.df <- tmp.df[order(tmp.df$DayOfImaging, tmp.df$NSFTV.ID) ,]
  tmp.df <- tmp.df[tmp.df$NSFTV.ID %in% test.acc ,]
  
  #Do the correlations
  res <- rbind(ldply(dlply(tmp.df, .(DayOfImaging), function(x) cor(x$E1, x$gBLUP, use="complete.obs") ) )[,2], 
               ldply(dlply(tmp.df, .(DayOfImaging), function(x) cor(x$E2, x$gBLUP, use="complete.obs") ) )[,2],
               ldply(dlply(tmp.df, .(DayOfImaging), function(x) cor(x$E3, x$gBLUP, use="complete.obs") ) )[,2] )
  
  Cor.res[,j] <- colMeans(res)
}
```

##Plot the results
This is the code that was used to generate Figure 5C.
```{r plot results for scenario C, eval = T}

Cor.res.mean <- apply(Cor.res, 1, mean)
Cor.res.sd <- apply(Cor.res, 1, sd)

#pdf("~/Desktop/RR/RR/Figures/ScenarioC.pdf", h=2.1, w=3.5, useDingbats = F, pointsize = 10)
#par(mar=c(3,3,1,.2), mgp=c(1.8,0.5,0))

plot(1:20, Cor.res.mean, ylim=c(0,1), cex=0.3, pch=19, xlab="Day of Imaging", ylab=expression(italic(r)))
lines(1:20, Cor.res.mean)

segments(1:20, Cor.res.mean - Cor.res.sd, 1:20, Cor.res.mean + Cor.res.sd, lwd=1)
segments(1:20 - 0.1, Cor.res.mean - Cor.res.sd, 1:20 + 0.1, Cor.res.mean - Cor.res.sd, lwd=1)
segments(1:20 - 0.1, Cor.res.mean + Cor.res.sd, 1:20 + 0.1, Cor.res.mean + Cor.res.sd, lwd=1)
#dev.off()
```