Data Manipulation in R & Python: the dplyr, data.table, and dplython packages

Som B. Bohora

Department of Pediatrics

University of Oklahoma Health Sciences Center

October 04, 2016

Outline of the presentation

  1. dplyr in R
  2. dplython in python
  3. data.table in R

why `dplyr`?

  1. Speed and performance
  2. Direct connection to and analysis within external databases
  3. Function chaining
  4. Syntax simplicity and ease of use (SQL flavor)

`dplyr` verbs

6 key operations

  • filter: filter the rows of a data frame
  • mutate: modify or create new columns
  • group by: set grouping variables
  • summarise: aggregate a data frame
  • arrange: sort columns of a data frame
  • select: select a set of columns

`data.table` (advanced operations)

DT[.N-2] # second row
DT[, .N]  # no. of rows
DT[,.(V2,V3)]  # or
DT[,list(V2,V3)] 
DT[,mean(V3), by= .(V1,V2)] 

`data.table` (advanced operations cont.)

#.SD
DT[,.SD[c(1,.N)], by=V2]
DT[, lapply(.SD, sum), by=V2] 
DT[, lapply(.SD,sum), by=V2,.SDcols = c("V3","V4")] #same, but only for V3,V4

DT[, lapply(.SD, function(x) sum(x, na.rm = TRUE)), by = V2]

DT[, .(V4.Sum = sum(V4)), by=V1][V4.Sum > 25 ]  # SQL like having

Questions?