How to use DALEX with the xgboost package

Przemyslaw Biecek

2018-04-29

This vigniette demonstrates how to use the DALEX package with models created with the xgboost package.

Regression

In this example we are going to use the wine dataset from the breakDown package. The wine quality will be predicted based on other features.

library("breakDown")
head(wine)
#>   fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
#> 1           7.0             0.27        0.36           20.7     0.045
#> 2           6.3             0.30        0.34            1.6     0.049
#> 3           8.1             0.28        0.40            6.9     0.050
#> 4           7.2             0.23        0.32            8.5     0.058
#> 5           7.2             0.23        0.32            8.5     0.058
#> 6           8.1             0.28        0.40            6.9     0.050
#>   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
#> 1                  45                  170  1.0010 3.00      0.45     8.8
#> 2                  14                  132  0.9940 3.30      0.49     9.5
#> 3                  30                   97  0.9951 3.26      0.44    10.1
#> 4                  47                  186  0.9956 3.19      0.40     9.9
#> 5                  47                  186  0.9956 3.19      0.40     9.9
#> 6                  30                   97  0.9951 3.26      0.44    10.1
#>   quality
#> 1       6
#> 2       6
#> 3       6
#> 4       6
#> 5       6
#> 6       6

Model building

Let’s build a model. We need to prepare xgb.DMatrix first.

Explainer

Now we can create an explainer.

Single variable

For continouse variable

Single prediction

Variable importance

Classification

In this example we are going to use the HR_data dataset from the breakDown package. The model will predict odds that someone will leave the company.

library("breakDown")
head(HR_data)
#>   satisfaction_level last_evaluation number_project average_montly_hours
#> 1               0.38            0.53              2                  157
#> 2               0.80            0.86              5                  262
#> 3               0.11            0.88              7                  272
#> 4               0.72            0.87              5                  223
#> 5               0.37            0.52              2                  159
#> 6               0.41            0.50              2                  153
#>   time_spend_company Work_accident left promotion_last_5years sales salary
#> 1                  3             0    1                     0 sales    low
#> 2                  6             0    1                     0 sales medium
#> 3                  4             0    1                     0 sales medium
#> 4                  5             0    1                     0 sales    low
#> 5                  3             0    1                     0 sales    low
#> 6                  3             0    1                     0 sales    low

Model building

Let’s build a model. We need to prepare xgb.DMatrix first.

Explainer

Now we can create an explainer.

Single variable

For continouse variable

Single prediction

Variable importance

Session info

devtools::session_info()
#>  setting  value                       
#>  version  R version 3.4.4 (2018-03-15)
#>  system   x86_64, darwin15.6.0        
#>  ui       X11                         
#>  language (EN)                        
#>  collate  en_US.UTF-8                 
#>  tz       Europe/Warsaw               
#>  date     2018-04-29                  
#> 
#>  package      * version    date       source                            
#>  agricolae      1.2-8      2017-09-12 cran (@1.2-8)                     
#>  ALEPlot        1.0        2017-11-13 CRAN (R 3.4.2)                    
#>  AlgDesign      1.1-7.3    2014-10-15 CRAN (R 3.2.0)                    
#>  assertthat     0.2.0      2017-04-11 CRAN (R 3.4.0)                    
#>  backports      1.1.2      2017-12-13 cran (@1.1.2)                     
#>  base         * 3.4.4      2018-03-15 local                             
#>  bindr          0.1.1      2018-03-13 CRAN (R 3.4.4)                    
#>  bindrcpp       0.2.2      2018-03-29 CRAN (R 3.4.4)                    
#>  boot           1.3-20     2017-08-06 CRAN (R 3.4.4)                    
#>  breakDown    * 0.1.5      2018-04-18 local (pbiecek/breakDown@5081b24) 
#>  cluster        2.0.7-1    2018-04-09 CRAN (R 3.4.4)                    
#>  coda           0.19-1     2016-12-08 cran (@0.19-1)                    
#>  colorspace     1.3-2      2016-12-14 CRAN (R 3.4.0)                    
#>  combinat       0.0-8      2012-10-29 CRAN (R 3.1.0)                    
#>  compiler       3.4.4      2018-03-15 local                             
#>  DALEX        * 0.2.2      2018-04-29 local (pbiecek/DALEX@NA)          
#>  data.table     1.10.4-3   2017-10-27 CRAN (R 3.4.2)                    
#>  datasets     * 3.4.4      2018-03-15 local                             
#>  deldir         0.1-15     2018-04-01 CRAN (R 3.4.4)                    
#>  devtools       1.13.5     2018-02-18 CRAN (R 3.4.3)                    
#>  digest         0.6.15     2018-01-28 cran (@0.6.15)                    
#>  dplyr          0.7.4      2017-09-28 CRAN (R 3.4.2)                    
#>  evaluate       0.10.1     2017-06-24 CRAN (R 3.4.1)                    
#>  expm           0.999-2    2017-03-29 cran (@0.999-2)                   
#>  factorMerger   0.3.6      2018-04-04 CRAN (R 3.4.4)                    
#>  gdata          2.18.0     2017-06-06 CRAN (R 3.4.0)                    
#>  ggplot2        2.2.1.9000 2018-04-29 Github (thomasp85/ggplot2@f1ba983)
#>  ggpubr         0.1.6      2017-11-14 cran (@0.1.6)                     
#>  glue           1.2.0      2017-10-29 cran (@1.2.0)                     
#>  gmodels        2.16.2     2015-07-22 CRAN (R 3.4.0)                    
#>  graphics     * 3.4.4      2018-03-15 local                             
#>  grDevices    * 3.4.4      2018-03-15 local                             
#>  grid           3.4.4      2018-03-15 local                             
#>  gridExtra      2.3        2017-09-09 CRAN (R 3.4.1)                    
#>  gtable         0.2.0      2016-02-26 CRAN (R 3.2.3)                    
#>  gtools         3.5.0      2015-05-29 CRAN (R 3.2.0)                    
#>  highr          0.6        2016-05-09 CRAN (R 3.4.0)                    
#>  htmltools      0.3.6      2017-04-28 CRAN (R 3.4.0)                    
#>  httpuv         1.3.6.2    2018-03-02 CRAN (R 3.4.3)                    
#>  klaR           0.6-14     2018-03-19 CRAN (R 3.4.4)                    
#>  knitr          1.20       2018-02-20 cran (@1.20)                      
#>  labeling       0.3        2014-08-23 CRAN (R 3.2.2)                    
#>  lattice        0.20-35    2017-03-25 CRAN (R 3.4.4)                    
#>  lazyeval       0.2.1      2017-10-29 CRAN (R 3.4.2)                    
#>  LearnBayes     2.15.1     2018-03-18 CRAN (R 3.4.4)                    
#>  magrittr       1.5        2014-11-22 CRAN (R 3.2.2)                    
#>  MASS           7.3-49     2018-02-23 CRAN (R 3.4.4)                    
#>  Matrix         1.2-14     2018-04-09 CRAN (R 3.4.4)                    
#>  memoise        1.1.0      2017-04-21 CRAN (R 3.4.0)                    
#>  methods      * 3.4.4      2018-03-15 local                             
#>  mime           0.5        2016-07-07 CRAN (R 3.4.0)                    
#>  miniUI         0.1.1      2016-01-15 CRAN (R 3.2.3)                    
#>  munsell        0.4.3      2016-02-13 CRAN (R 3.2.3)                    
#>  mvtnorm        1.0-7      2018-01-25 cran (@1.0-7)                     
#>  nlme           3.1-137    2018-04-07 CRAN (R 3.4.4)                    
#>  pdp            0.6.0      2017-07-20 CRAN (R 3.4.1)                    
#>  pillar         1.2.1      2018-02-27 CRAN (R 3.4.3)                    
#>  pkgconfig      2.0.1      2017-03-21 CRAN (R 3.4.0)                    
#>  plyr           1.8.4      2016-06-08 CRAN (R 3.4.0)                    
#>  proxy          0.4-22     2018-04-08 CRAN (R 3.4.4)                    
#>  questionr      0.6.2      2017-11-01 CRAN (R 3.4.2)                    
#>  R6             2.2.2      2017-06-17 CRAN (R 3.4.0)                    
#>  RColorBrewer   1.1-2      2014-12-07 CRAN (R 3.2.2)                    
#>  Rcpp           0.12.16    2018-03-13 cran (@0.12.16)                   
#>  reshape2       1.4.3      2017-12-11 cran (@1.4.3)                     
#>  rlang          0.2.0.9001 2018-04-29 Github (tidyverse/rlang@82b2727)  
#>  rmarkdown      1.9        2018-03-01 CRAN (R 3.4.3)                    
#>  rprojroot      1.3-2      2018-01-03 CRAN (R 3.4.3)                    
#>  rstudioapi     0.7        2017-09-07 CRAN (R 3.4.1)                    
#>  scales         0.5.0.9000 2018-04-29 Github (hadley/scales@d767915)    
#>  shiny          1.0.5      2017-08-23 CRAN (R 3.4.1)                    
#>  sp             1.2-7      2018-01-19 cran (@1.2-7)                     
#>  spData         0.2.8.3    2018-03-25 CRAN (R 3.4.4)                    
#>  spdep          0.7-7      2018-04-03 CRAN (R 3.4.4)                    
#>  splines        3.4.4      2018-03-15 local                             
#>  stats        * 3.4.4      2018-03-15 local                             
#>  stringi        1.1.7      2018-03-12 cran (@1.1.7)                     
#>  stringr        1.3.0      2018-02-19 cran (@1.3.0)                     
#>  survival       2.41-3     2017-04-04 CRAN (R 3.4.4)                    
#>  tibble         1.4.2      2018-01-22 cran (@1.4.2)                     
#>  tools          3.4.4      2018-03-15 local                             
#>  utils        * 3.4.4      2018-03-15 local                             
#>  withr          2.1.2      2018-04-29 Github (jimhester/withr@79d7b0d)  
#>  xgboost      * 0.6.4.1    2018-01-23 CRAN (R 3.4.3)                    
#>  xtable         1.8-2      2016-02-05 CRAN (R 3.2.3)                    
#>  yaImpute       1.0-29     2017-12-10 CRAN (R 3.4.3)                    
#>  yaml           2.1.18     2018-03-08 cran (@2.1.18)