@ London Business Analytics Group

Objective

Learn about R and do useful day-job stuff with it

Sub-objectives

  • Install R
  • Connect to database using RODBC
  • Manipulate and summarise data using data.table
  • Visualise your data using ggplot2
  • Present your data using rmarkdown and knitr
  • Produce a report using shiny

Steph Locke

Scenario

I've heard about this new thing called R. Apparently everyone's using it so we need to. Make me something using it, by tomorrow!

Step 1: Google "What is R?"

I Google frantically and find out…

Eep, but I'm not a stats bod or a programmer, what will I do? Do my usual stuff, but swisher!

Step 2: Get R installed

OK, I need to get this thing called R installed on my laptop!

  • Install R from r-project.org myself
  • Jump on IT's desk until they install Rstudio via Rstudio.com for me
  • Use R to nab some useful packages

install.packages(c(
  "RODBC",      # Database access
  "data.table", # Table manipulation
  "ggplot2",    # Charting
  "scales",     # Scale manipulations
  "rmarkdown",  # Presentation writing
  "knitr",      # Presentation production
  "shiny"       # Reporting
  ))

Step 3: Connect to the database

I'll report on global bike sales from our company's database

Establishing a connection

# Getting data
library("RODBC")

# Manipulating data
library("data.table")

# Build a connection to the DB for reuse
# Driver names can vary eg 
# {ODBC Driver 11 for SQL Server}
azure <- odbcDriverConnect(
          "Driver={SQL Server Native Client 11.0};
          Server=mhknbn2kdz.database.windows.net;
          Database=AdventureWorks2012;
          Uid=sqlfamily;
          Pwd=sqlf@m1ly;")

Getting our starting data

Order    <- data.table( sqlQuery( azure, 
         "SELECT * FROM [Sales].[SalesOrderHeader]"))

Territory<- data.table( sqlQuery( azure, 
         "SELECT * FROM [Sales].[SalesTerritory]"))

Region   <- data.table( sqlQuery( azure, 
         "SELECT * FROM [Person].[CountryRegion]"))

Step 4: Create big dataset

Hmm, I need to put all the data into a "flat-file" view

  • Set "primary keys" with setkey
  • Join in format Y[X] to join Y on to X
  • Equivalent of LEFT JOINs in SQL
  • Equivalent of extensive VLOOKUPs in Excel

Join Territory to the Order

# Specify the columns to join by
setkey(Order     , TerritoryID)
setkey(Territory , TerritoryID)

# Reads as join Territory to the Order table
# Seperate table so we don't have to start again
OrderTerritory <- Territory[Order]

Join Region to the OrderTerritory

# Specify the columns to join by
setkey(OrderTerritory , CountryRegionCode)
setkey(Region         , CountryRegionCode)

# Reads as join Region to the OrderTerritory table
# Will overwrite as it's our desired state
OrderTerritory <- Region[OrderTerritory]

Step 5: Summarise data

OK, let's produce some tables!

  • Lots of functionality, but kept quite simple here
  • Read data.table intro for a great in-depth introduction

Basic summary

OrderTerritory[  , #Nothing in 1st position
               .( Value  = sum(TotalDue) ,
                  Volume = .N ), # Aggregations
               by = Name # Group by, no alias
               ]

Basic summary

Name Value Volume
Australia 11814376 6843
Canada 18398929 4067
Germany 5479820 2623
France 8119749 2672
United Kingdom 8574049 3219
United States 70829863 12041

Top 5 sales ever

OrderTerritory[order(-TotalDue)[1:5],
               # Picks first 5 rows of TotalDue desc
               .( SalesOrderNumber, # multi-select
                  AccountNumber,
                  Region=Name, # Aliasing colum
                  TotalDue,
                  OrderDate
                  )
               ]

Top 5 sales ever

SalesOrderNumber AccountNumber Region TotalDue OrderDate
SO7473 10-4020-000599 United States 187487.8 2007-07-01
SO11624 10-4020-000599 United States 182018.6 2007-10-01
SO2958 10-4020-000514 Canada 170512.7 2006-07-01
SO3323 10-4020-000638 France 166537.1 2006-08-01
SO3737 10-4020-000072 United States 165028.7 2006-09-01

Top sale ever for each region

OrderTerritory[  ,
               .SD[which.max(TotalDue),
               # .SD is subset of rows within group
               
               # which.max() returns row 
               # no of largest val
                   .( SalesOrderNumber,
                      AccountNumber,
                      TotalDue,
                      OrderDate
                    )
                   ],
               by=.(Region=Name)
               #Aliasing in groupung
               ]

Top sale ever for each region

Region SalesOrderNumber AccountNumber TotalDue OrderDate
Australia SO9908 10-4020-000015 71729.86 2007-09-01
Canada SO2958 10-4020-000514 170512.67 2006-07-01
Germany SO11596 10-4020-000302 117506.12 2007-10-01
France SO3323 10-4020-000638 166537.08 2006-08-01
United Kingdom SO9915 10-4020-000502 130249.26 2007-09-01
United States SO7473 10-4020-000599 187487.83 2007-07-01

Step 6: Make some charts

I need to make some pretties!

  • Using ggplot2 as much easier than base graphics
  • Simple structure = chart(data, axis & series) + chart types + formatting
  • No secondary axes as considered bad visualisation, use optiRum::multiplot()
  • Easy trellis / small multiple charts

Basic chart – volumes

library(ggplot2)

ggplot(OrderTerritory                 # dataset
       ,aes(x=Name, y=..count..))+    # values
  geom_bar()+                         # chart type
  theme_minimal()+                    # themeing
  labs(x="Region",title="All time sales volumes")

Basic chart – volumes

Basic chart – value

library(scales)

ggplot(OrderTerritory               
       ,aes(x=Name, y=TotalDue))+    
  geom_bar(stat="identity")+         # use value
  theme_minimal()+                   
  scale_y_continuous(label=dollar)+  # customisation
  labs(x="Region",title="All time sales value")

Basic chart – value

Trellis chart – value

ggplot(OrderTerritory
       ,aes(x=year(OrderDate), y=TotalDue, 
            colour=Name))+ 
  geom_bar(stat="identity")+
  theme_minimal()+                  
  scale_y_continuous(label=dollar)+ 
  facet_wrap(~Name,ncol = 2) +   # trellis
  xlab("Order Year")

Trellis chart – value

Step 7: Produce presentation

I need a slide deck for the boss to take use!

  • Use rmarkdown to produce SamplePresentation.Rmd
  • Really light syntax, same as Stack Overflow
  • Customisable via CSS
  • Create slides (html or PDF) or docs (html, PDF, Word)
  • Use a modular design
  • Re-knit to get latest data

Step 8: Produce report

I need a report for people to play with

  • Use shiny to produce interactive online reports
  • Can host locally (for other R users)
  • Can use a server (for internal users, uses Linux)
  • Can use a hosting platform like shinyapps.io
library("shiny")

#Run the shiny app (app.R) in the current folder
runApp(".")

Conclusion

Phew, he's got something to show to the Board & a web page he can play with all day long.

We learnt along the way:

  • connect to our data using RODBC
  • manipulate it with ggplot2
  • make documents with rmarkdown
  • produce interactive reports using shiny

Not bad for a day's work!

Where next?

Get this presentation