Database Best Practices

DBI, odbc and pool

Jim Hester

Brief History of Jim

  • CCF - Program Analyst - Genomic Medicine Institute (2008-2013) R 2.8.0
  • Explorys - Data Scientist - (2013-2015)
  • Bioconductor - Core Member - (2015)
  • RStudio - Software Engineer - (2016)
    • Hadley Wickham's Team
      • Open source R packages (devtools, xml2, readr, ...)

Make R The Best Environment for Research and Data Science.

Databases

  • RDBMS - Relational Database Management System
  • SQL - Strucured Query Language
  • Lots of data
  • Many types exist
  • Each has proprietary API / SQL Dialect

Databases

ODBC - Open Database Connectivity

  • Standard Interface (API)
  • Connection Configuration / Strings
  • Widely adopted (Drivers)
  • Cross Platform (Linux, MacOS, Windows)

ODBC - Open Database Connectivity

RODBC

  • Original work by Michael Lapsley (2000)
  • Maintained by Brian Ripley (2002-Present)
  • ODBC support for R
  • R C API
  • Battle Tested
  • Not DBI Compatible

RODBC

DBI - Database Interface

  • First CRAN release (2001-12-05), also S-PLUS
  • David A. James - Bell Labs
  • DBs <-> R/S
  • Hadley Wickham (2014), Kirill Müller (2016)
  • DBItest - comprehensive compatability testing
  • Next dplyr release, generic DBI backend

DBI - Database Interface

odbc

  • devtools::install_github("rstats-db/odbc")
  • DBI compatible
  • Complete Rewrite of RODBC
  • C++ / Rcpp
  • Well tested
  • Support for dates / timestamps / raw binary
  • Batch parameterized queries
  • Faster (3-4X Reading, 2x Writing than RODBC)

Reading

# First using RODBC / RODBCDBI
library(DBI)
library(RODBCDBI)
rodbc <- dbConnect(RODBCDBI::ODBC(), dsn = "PostgreSQL")
system.time(rodbc_result <- dbReadTable(rodbc, "flights"))
#> Warning: closing unused RODBC handle 2
#>    user  system elapsed 
#>  19.203   1.356  21.724

# Now using odbc
odbc <- dbConnect(odbc::odbc(), dsn = "PostgreSQL")
system.time(odbc_result <- dbReadTable(odbc, "flights"))
#>    user  system elapsed 
#>   5.119   0.290   6.771

Writing

library(nycflights13)
# rodbc does not support writing timestamps natively.
rodbc <- dbConnect(RODBCDBI::ODBC(), dsn = "PostgreSQL")
system.time(dbWriteTable(rodbc, "flights2", as.data.frame(flights[, names(flights) != "time_hour"])))
#>    user  system elapsed 
#>   6.693   3.786  48.423

# Now using odbc
odbc <- dbConnect(odbc::odbc(), dsn = "PostgreSQL")
system.time(dbWriteTable(odbc, "flights3", as.data.frame(flights)))
#>    user  system elapsed 
#>   7.802   3.703  26.016

Avoiding SQL Injection (Stopping little bobby tables)

  • Very bad!!! - Users could drop all data!
  • DBI
    • sqlInterpolate(), dbQuoteString(), dbQuoteIdentifier()
    • dbBind() - Parameterized Queries

Parameterized Queries

  • INSERT INTO XYZ (?, ?), dbBind()
  • dbWriteTable() (automatic)

Parameterized Queries

library(DBI);con <- dbConnect(RSQLite::SQLite(), ":memory:");dbWriteTable(con, "iris", iris)
iris_result <- dbSendQuery(con, "SELECT * FROM iris WHERE [Petal.Width] > ?")
dbBind(iris_result, list(2.3))
dbFetch(iris_result)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
## 1          6.3         3.3          6.0         2.5 virginica
## 2          7.2         3.6          6.1         2.5 virginica
## 3          5.8         2.8          5.1         2.4 virginica
## 4          6.3         3.4          5.6         2.4 virginica
## 5          6.7         3.1          5.6         2.4 virginica
## 6          6.7         3.3          5.7         2.5 virginica
dbBind(iris_result, list(2.4))
dbFetch(iris_result)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
## 1          6.3         3.3          6.0         2.5 virginica
## 2          7.2         3.6          6.1         2.5 virginica
## 3          6.7         3.3          5.7         2.5 virginica
dbClearResult(iris_result); dbDisconnect(con)

Pool

  • General Object Pooling, reuse objects from pool
  • DBI Connections - Use Case
  • Problem - Shiny Applications
    • Single R process
    • Multiple Users
    • Establishing Connections
      • Takes time
      • Limited #
      • Failure?

Pool

Pool

library(shiny)
library(DBI)
library(pool)

pool <- dbPool(
  drv = RMySQL::MySQL(),
  dbname = "shinydemo",
  host = "shiny-demo.csa7qlmguqrf.us-east-1.rds.amazonaws.com",
  username = "guest",
  password = "guest"
)

ui <- fluidPage(
  textInput("ID", "Enter your ID:", "5"),
  tableOutput("tbl"),
  numericInput("nrows", "How many cities to show?", 10),
  plotOutput("popPlot")
)

server <- function(input, output, session) {
  output$tbl <- renderTable({
    sql <- "SELECT * FROM City WHERE ID = ?id;"
    query <- sqlInterpolate(pool, sql, id = input$ID)
    dbGetQuery(pool, query)
  })
  output$popPlot <- renderPlot({
    query <- paste0("SELECT * FROM City LIMIT ",
                    as.integer(input$nrows)[1], ";")
    df <- dbGetQuery(pool, query)
    pop <- df$Population
    names(pop) <- df$Name
    barplot(pop)
  })
}

shinyApp(ui, server)

DBI

install.packages("DBI")

sqlIterpolate(), dbQuoteString(), dbQuoteIdentifier()

odbc

devtools::install_github("rstats-db/odbc")

dbConnect(odbc::odbc(), ...), dbGetQuery(), dbReadTable(), dbWriteTable()

pool

devtools::install_github("rstudio/pool")

dbPool(), sqlInterpolate()