Nick Kennedy
Clinical Research Fellow
GI Unit, IGMM
University of Edinburgh
16/09/2015
forlapplyplyr packagelapplyset.seed(123)
my_list <- list(a = rnorm(100), b = rnorm(50), c = runif(20))
lapply(my_list, mean)
## $a
## [1] 0.09040591
##
## $b
## [1] -0.2539004
##
## $c
## [1] 0.4793934
Image copyright Hadley Wickham
| l | a | d | _ | |
|---|---|---|---|---|
| l | llply |
laply |
ldply |
l_ply |
| a | alply |
aaply |
adply |
a_ply |
| d | dlply |
daply |
ddply |
d_ply |
| r | rlply |
raply |
rdply |
r_ply |
| m | mlply |
maply |
mdply |
m_ply |
llplylapplyllply(my_list, mean)
## $a
## [1] 0.09040591
##
## $b
## [1] -0.2539004
##
## $c
## [1] 0.4793934
laplysapply, but will always return an array (whereas sapply will return a list if the output is ragged)llply(my_list, mean)
## $a
## [1] 0.09040591
##
## $b
## [1] -0.2539004
##
## $c
## [1] 0.4793934
#laply(my_list, I)
#Will return an error
ldplyldply(my_list, mean)
## .id V1
## 1 a 0.09040591
## 2 b -0.25390043
## 3 c 0.47939338
ldply(my_list, function(x) data.frame(mean = mean(x), sd = sd(x)))
## .id mean sd
## 1 a 0.09040591 0.9128159
## 2 b -0.25390043 0.9893339
## 3 c 0.47939338 0.2830888
l_plylayout(1:3)
l_ply(my_list, hist)
layout(1)
aaplyapply in base Rbase::apply may return a list if the result cannot be simplified)apply in base R
aaply(.data, .margins, .fun = identity) will return the same as aperm(.data, c(.margins, (1:length(dim(.data)))[-.margins]))x <- array(1:24, c(2, 3, 4))
all(aaply(x, 2, .fun = identity) == aperm(x, c(2, 1, 3)))
## [1] TRUE
aaplymatrixStats package should be preferred)set.seed(913)
x <- matrix(rnorm(100), 10, 10)
aaply(x, 1, sd)
## 1 2 3 4 5 6 7
## 0.9194631 0.8210030 0.5969373 0.8796405 0.9995688 1.2863265 0.8085736
## 8 9 10
## 0.8429865 0.9946271 1.1209652
aaply on a data.frameaaply can be a data.frame instead of an array or matrix.aaply on margin 2 will apply the function to each column and return a vector or array as one might expect..expand=TRUE. If the data.frame has 4 columns and the function result is a scalar, the final result will have 4 dimensions (one for each column) and each will have every possible value of that variable.my_fun <- function(r) r$Sepal.Length + r$Petal.Length
iris_data <- iris[, c("Sepal.Length", "Petal.Length")]
aaply(iris_data, 1, my_fun)[1:10, 1:10]
## Petal.Length
## Sepal.Length 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.9 3
## 4.3 NA 5.4 NA NA NA NA NA NA NA NA
## 4.4 NA NA NA 5.7 5.8 NA NA NA NA NA
## 4.5 NA NA NA 5.8 NA NA NA NA NA NA
## 4.6 5.6 NA NA NA 6.0 6.1 NA NA NA NA
## 4.7 NA NA NA 6.0 NA NA 6.3 NA NA NA
## 4.8 NA NA NA NA 6.2 NA 6.4 NA 6.7 NA
## 4.9 NA NA NA NA 6.3 6.4 NA NA NA NA
## 5 NA NA 6.2 6.3 6.4 6.5 6.6 NA NA NA
## 5.1 NA NA NA NA 6.5 6.6 6.7 6.8 7.0 8.1
## 5.2 NA NA NA NA 6.6 6.7 NA NA NA NA
aaply on a data.frame.expand = FALSE:aaply(iris_data, 1, my_fun, .expand = FALSE)[1:20]
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
## 6.5 6.3 6.0 6.1 6.4 7.1 6.0 6.5 5.8 6.4 6.9 6.4 6.2 5.4 7.0 7.2 6.7 6.5
## 19 20
## 7.4 6.6
adplyadply works similarly to aaply but as expected returns a data.frame.x <- array(1:24, c(2, 3, 4))
sum(x[1, , ])
## [1] 144
adply(x, 1, sum)
## X1 V1
## 1 1 144
## 2 2 156
adplyadd_one <- function(a) a + 1
add_one(x[1, 1, ])
## [1] 2 8 14 20
adply(x, 1:2, add_one)
## X1 X2 V1 V2 V3 V4
## 1 1 1 2 8 14 20
## 2 2 1 3 9 15 21
## 3 1 2 4 10 16 22
## 4 2 2 5 11 17 23
## 5 1 3 6 12 18 24
## 6 2 3 7 13 19 25
adplydata.frame and the results rbindedadd_one(x[1, , ])
## [,1] [,2] [,3] [,4]
## [1,] 2 8 14 20
## [2,] 4 10 16 22
## [3,] 6 12 18 24
adply(x, 1, add_one)
## X1 1 2 3 4
## 1 1 2 8 14 20
## 2 1 4 10 16 22
## 3 1 6 12 18 24
## 4 2 3 9 15 21
## 5 2 5 11 17 23
## 6 2 7 13 19 25
data.frameadply on a data.frame with .expand.expand indicates whether to add a new column to the existing data.frameadply(iris_data, 1, my_fun, .expand = TRUE)[1:10, ]
## Sepal.Length Petal.Length V1
## 1 5.1 1.4 6.5
## 2 4.9 1.4 6.3
## 3 4.7 1.3 6.0
## 4 4.6 1.5 6.1
## 5 5.0 1.4 6.4
## 6 5.4 1.7 7.1
## 7 4.6 1.4 6.0
## 8 5.0 1.5 6.5
## 9 4.4 1.4 5.8
## 10 4.9 1.5 6.4
adply(iris_data, 1, my_fun, .expand = FALSE)[1:10, ]
## X1 V1
## 1 1 6.5
## 2 2 6.3
## 3 3 6.0
## 4 4 6.1
## 5 5 6.4
## 6 6 7.1
## 7 7 6.0
## 8 8 6.5
## 9 9 5.8
## 10 10 6.4
a*ply functionsalply and a_ply work analogously to the l*ply functions described previously.data.frames is to use the a*ply functions in a row-wise manner.plyr offers the d*ply functions as an alternative where the intention is to split by every unique combination of values of the desired columns.quoted names. .() is useful shorthard for the latter.daply(iris, .(Species), function(r) mean(r$Petal.Length))
## setosa versicolor virginica
## 1.462 4.260 5.552
ddply(iris, .(Species), function(r) mean(r$Petal.Length))
## Species V1
## 1 setosa 1.462
## 2 versicolor 4.260
## 3 virginica 5.552
sumarise with ddply functionsplyr provides a summarise function to make this easier.ddply(iris, .(Species, round(Sepal.Length, 0)), summarise,
Mean.Petal.Length = mean(Petal.Length),
SD.Petal.Length = sd(Petal.Length))
## Species round(Sepal.Length, 0) Mean.Petal.Length SD.Petal.Length
## 1 setosa 4 1.280000 0.1095445
## 2 setosa 5 1.490000 0.1661016
## 3 setosa 6 1.420000 0.1923538
## 4 versicolor 5 3.583333 0.5382069
## 5 versicolor 6 4.277778 0.3711843
## 6 versicolor 7 4.687500 0.2167124
## 7 virginica 5 4.500000 NA
## 8 virginica 6 5.255556 0.3238391
## 9 virginica 7 5.737500 0.3263434
## 10 virginica 8 6.566667 0.2804758
r*ply functionsplyr this can easily be done using the r*ply functions.replicate, and unlike the other ply functions, this takes an expression not a functionr*ply functions exampleresult <- raply(100, mean(runif(1000)))
sum(result)
## [1] 49.96337
hist(result)
plyrplyr is progres bars.**ply functions.file_list <- list.files("data", "\\.csv$")
process_file <- function(file_name) {
# Do something rather slow on a file and return a one row data.frame
}
processed_data <- ldply(file_list, process_file, .progress = "text")
|================ | 21%
plyr are .progress = "tk" and .progress = "win".plyrplyr offers an easy route into parallelisation when used in conjunction with one of a few backends:
SNOW/parallel (doSNOW or doParallel)multicore (doMC)MPI (doMPI)plyr (doMC)multicore works on UNIX-like OSes by forking the main process.library("doMC")
registerDoMC(4)
system.time(llply(1:4, sleepy_time, .parallel = TRUE))
plyr (doParallel)library("doParallel")
cl <- makeCluster(4)
registerDoParallel(cl)
sleepy_time <- function(x) Sys.sleep(2)
system.time(llply(1:4, sleepy_time, .parallel = FALSE))
## user system elapsed
## 0.00 0.00 8.02
system.time(llply(1:4, sleepy_time, .parallel = TRUE))
## user system elapsed
## 0.03 0.00 3.00
stopCluster(cl)
plyr (doParallel 2)doParallel, packages and the required variables need to be explicitly exported using the .paropts parameter.library("pROC")
my_data <- data.frame(resp = sample(1:2, 1000, TRUE), V1 = rnorm(1000), V2 = rnorm(1000))
library("doParallel")
cl <- makeCluster(4)
registerDoParallel(cl)
llply(c("V1", "V2"), function(var) auc(my_data$resp, my_data[, var]),
.parallel = TRUE, .paropts = list(.packages = "pROC", .export = "my_data"))
## [[1]]
## Area under the curve: 0.4919
##
## [[2]]
## Area under the curve: 0.5129
stopCluster(cl)
llply(c("V1", "V2"), function(var) auc(my_data$resp, my_data[, var]), .parallel = TRUE)
## Error in do.ply(i) : task 1 failed - "could not find function "auc""
plyr (doParallel 3)clusterEvalQ and data exported using clusterExport.library("pROC")
my_data <- data.frame(resp = sample(1:2, 1000, TRUE), V1 = rnorm(1000), V2 = rnorm(1000))
library("doParallel")
cl <- makeCluster(4)
registerDoParallel(cl)
clusterExport(cl, "my_data")
invisible(clusterEvalQ(cl, library("pROC")))
llply(c("V1", "V2"), function(var) auc(my_data$resp, my_data[, var]), .parallel = TRUE)
## [[1]]
## Area under the curve: 0.538
##
## [[2]]
## Area under the curve: 0.5007
stopCluster(cl)
plyrlapply, each iteration cannot access data from a previous iteration.
<<- operator is possible, but not recommended.for loop (ideally with pre-assignment of the output variable)d*ply functions are quite a bit slower than the equivalents in dplyr and data.table.dplyr and plyr in the same session, they have multiple functions with the same name. It is recommended to load the packages in the order:library("plyr")
library("dplyr")
:: operator.plyr::summarise(x, mean = mean(y))
plyr is an excellent way of taking some data, splitting it up, doing something to each bit and joining it all together.