1 Introduction

DALEX is designed to work with various black-box models like tree ensembles, linear models, neural networks etc. Unfortunately R packages that create such models are very inconsistent. Different tools use different interfaces to train, validate and use models.

In this vignette we will show explanations for models created with h2o package.

2 Regression use case - apartments data

library("DALEX")
library("h2o")
set.seed(123)

To illustrate applications of DALEX to regression problems we will use an artificial dataset apartments available in the DALEX package. Our goal is to predict the price per square meter of an apartment based on selected features such as construction year, surface, floor, number of rooms, district. It should be noted that four of these variables are continuous while the fifth one is a categorical one. Prices are given in Euro.

data(apartments)
head(apartments)
##   m2.price construction.year surface floor no.rooms    district
## 1     5897              1953      25     3        1 Srodmiescie
## 2     1818              1992     143     9        5     Bielany
## 3     3643              1937      56     1        2       Praga
## 4     3517              1995      93     7        3      Ochota
## 5     3013              1992     144     6        5     Mokotow
## 6     5795              1926      61     6        2 Srodmiescie

3 Models

We create two regular H2O models: glm and gbm. To do this w need to first initialize h2o and we need to convert apartments to H2OFrame.

h2o.init()
## 
## H2O is not running yet, starting it now...
## 
## Note:  In case of errors look at the following log files:
##     C:\Users\AnnA\AppData\Local\Temp\RtmpAFFy1j\file11c8b9b1e4/h2o_AnnA_started_from_r.out
##     C:\Users\AnnA\AppData\Local\Temp\RtmpAFFy1j\file11c8576b2e7e/h2o_AnnA_started_from_r.err
## 
## 
## Starting H2O JVM and connecting: . Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         6 seconds 350 milliseconds 
##     H2O cluster timezone:       Europe/Belgrade 
##     H2O data parsing timezone:  UTC 
##     H2O cluster version:        3.32.0.1 
##     H2O cluster version age:    1 month and 21 days  
##     H2O cluster name:           H2O_started_from_R_AnnA_osn291 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   3.48 GB 
##     H2O cluster total cores:    4 
##     H2O cluster allowed cores:  4 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     H2O API Extensions:         Amazon S3, Algos, AutoML, Core V3, TargetEncoder, Core V4 
##     R Version:                  R version 3.6.3 (2020-02-29)
h2o.no_progress()

apartments_hf <- as.h2o(apartments)

model_h2o_glm <- h2o.glm(y = "m2.price", training_frame = apartments_hf)
model_h2o_gbm <- h2o.gbm(y = "m2.price", training_frame = apartments_hf)

We also create third model by an H2O AutoML.

model_h2o_automl <- h2o.automl(y = "m2.price", training_frame = apartments_hf, max_models = 10)

4 The explain() function

The first step of using the DALEX package is to wrap-up the black-box model with meta-data that unifies model interfacing. For model built in h2o we can use DALEXtra, the extension of DALEX package.

To create an explainer we use explain_h2o() function. Validation dataset for the models is apartmentsTest data from the DALEX package. For the models created by h2o package we have to provide custom predict function which takes two arguments: model and newdata and returns a numeric vector with predictions.

explainer_h2o_glm <- DALEXtra::explain_h2o(model = model_h2o_glm, 
                             data = apartmentsTest[,2:6],  
                             y = apartmentsTest$m2.price,
                             label = "h2o glm",
                             colorize = FALSE)
## Preparation of a new explainer is initiated
##   -> model label       :  h2o glm 
##   -> data              :  9000  rows  5  cols 
##   -> target variable   :  9000  values 
##   -> predict function  :  yhat.H2ORegressionModel  will be used (  default  )
##   -> predicted values  :  numerical, min =  2437.868 , mean =  3500.502 , max =  4694.49  
##   -> model_info        :  package h2o , ver. 3.32.0.1 , task regression (  default  ) 
##   -> residual function :  difference between y and yhat (  default  )
##   -> residuals         :  numerical, min =  -923.4434 , mean =  11.02155 , max =  2119.145  
##   A new explainer has been created!
explainer_h2o_gbm <- DALEXtra::explain_h2o(model = model_h2o_gbm, 
                             data = apartmentsTest[,2:6],  
                             y = apartmentsTest$m2.price,
                             label = "h2o gbm",
                             colorize = FALSE)
## Preparation of a new explainer is initiated
##   -> model label       :  h2o gbm 
##   -> data              :  9000  rows  5  cols 
##   -> target variable   :  9000  values 
##   -> predict function  :  yhat.H2ORegressionModel  will be used (  default  )
##   -> predicted values  :  numerical, min =  1702.853 , mean =  3505.4 , max =  6446.298  
##   -> model_info        :  package h2o , ver. 3.32.0.1 , task regression (  default  ) 
##   -> residual function :  difference between y and yhat (  default  )
##   -> residuals         :  numerical, min =  -587.5726 , mean =  6.123194 , max =  645.5172  
##   A new explainer has been created!
explainer_h2o_automl <- DALEXtra::explain_h2o(model = model_h2o_automl, 
                             data = apartmentsTest[,2:6],  
                             y = apartmentsTest$m2.price,
                             label = "h2o automl",
                             colorize = FALSE)
## Preparation of a new explainer is initiated
##   -> model label       :  h2o automl 
##   -> data              :  9000  rows  5  cols 
##   -> target variable   :  9000  values 
##   -> predict function  :  yhat.H2ORegressionModel  will be used (  default  )
##   -> predicted values  :  numerical, min =  1583.434 , mean =  3515.396 , max =  6570.365  
##   -> model_info        :  package h2o , ver. 3.32.0.1 , task regression (  default  ) 
##   -> residual function :  difference between y and yhat (  default  )
##   -> residuals         :  numerical, min =  -517.7739 , mean =  -3.872876 , max =  571.5371  
##   A new explainer has been created!

5 Model performance

Function model_performance() calculates predictions and residuals for validation dataset.

mp_h2o_glm <- model_performance(explainer_h2o_glm)
mp_h2o_gbm <- model_performance(explainer_h2o_gbm)
mp_h2o_automl <- model_performance(explainer_h2o_automl)

Generic function print() returns quantiles for residuals.

mp_h2o_glm
## Measures for:  regression
## mse        : 407698.6 
## rmse       : 638.5128 
## r2         : 0.4971679 
## mad        : 481.9948
## 
## Residuals:
##         0%        10%        20%        30%        40%        50%        60% 
## -923.44339 -677.18027 -578.66955 -486.98678 -292.75927  -23.36066   85.25062 
##        70%        80%        90%       100% 
##  194.81788  426.24329  977.34524 2119.14517

Generic function plot() shows reversed empirical cumulative distribution function for absolute values from residuals. Plots can be generated for one or more models.

plot(mp_h2o_glm, mp_h2o_gbm, mp_h2o_automl)

We are also able to use the plot() function to get an alternative comparison of residuals. Setting the geom = "boxplot" parameter we can compare the distribution of residuals for selected models.

plot(mp_h2o_glm, mp_h2o_gbm, mp_h2o_automl, geom = "boxplot")

6 Variable importance

Using he DALEX package we are able to better understand which variables are important.

Model agnostic variable importance is calculated by means of permutations. We simply substract the loss function calculated for validation dataset with permuted values for a single variable from the loss function calculated for validation dataset.

This method is implemented in the model_parts() function.

vi_h2o_glm <- model_parts(explainer_h2o_glm)
vi_h2o_gbm <- model_parts(explainer_h2o_gbm)
vi_h2o_automl <- model_parts(explainer_h2o_automl)

We can compare all models using the generic plot() function.

plot(vi_h2o_automl, vi_h2o_gbm, vi_h2o_glm)

Length of the interval coresponds to a variable importance. Longer interval means larger loss, so the variable is more important.

For better comparison of the models we can hook the variabe importance at 0 using the type="difference".

vi_h2o_glm <- model_parts(explainer_h2o_glm, type="difference")
vi_h2o_gbm <- model_parts(explainer_h2o_gbm, type="difference")
vi_h2o_automl <- model_parts(explainer_h2o_automl, type="difference")
plot(vi_h2o_automl, vi_h2o_gbm, vi_h2o_glm)

7 Variable response

Explainers presented in this section are designed to better understand the relation between a variable and model output.

For more details of methods described in this section see Partial-dependence Profiles and Local-dependence and Accumulated-local Profiles

7.1 Partial Dependence Plot

Partial Dependence Plots (PDP) are one of the most popular methods for exploration of the relation between a continuous variable and the model outcome.

Function model_profile() with the parameter type = "partial" calculates PDP response.

pdp_h2o_glm <- model_profile(explainer_h2o_glm, variable = "construction.year", type = "partial")
pdp_h2o_gbm <- model_profile(explainer_h2o_gbm, variable = "construction.year", type = "partial")
pdp_h2o_automl <- model_profile(explainer_h2o_automl, variable = "construction.year", type = "partial")

plot(pdp_h2o_glm, pdp_h2o_gbm, pdp_h2o_automl)

7.2 Acumulated Local Effects plot

Acumulated Local Effects (ALE) plot is the extension of PDP, that is more suited for highly correlated variables.

Function model_profile() with the parameter type = "accumulated" calculate the ALE curve for the variable construction.year.

ale_h2o_glm <- model_profile(explainer_h2o_glm, variable = "construction.year", type = "accumulated")
ale_h2o_gbm <- model_profile(explainer_h2o_gbm, variable = "construction.year", type = "accumulated")
ale_h2o_automl <- model_profile(explainer_h2o_automl, variable = "construction.year", type = "accumulated")

plot(ale_h2o_glm, ale_h2o_gbm, ale_h2o_automl)

8 Prediction understanding

Model prediction is visualized with Break Down Plots, which show the contribution of every variable present in the model. Function predict_parts() with type = "break_down" generates variable attributions for selected prediction. The generic plot() function shows these attributions.

new_apartment <- apartmentsTest[1,]
pb_h2o_glm <- predict_parts(explainer_h2o_glm, new_observation = new_apartment, type = "break_down")
pb_h2o_gbm <- predict_parts(explainer_h2o_gbm, new_observation = new_apartment, type = "break_down")
pb_h2o_automl <- predict_parts(explainer_h2o_automl, new_observation = new_apartment, type = "break_down")

plot(pb_h2o_automl)

plot(pb_h2o_gbm)

plot(pb_h2o_glm)

9 Session info

sessionInfo()
## R version 3.6.3 (2020-02-29)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18363)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=Polish_Poland.1250  LC_CTYPE=Polish_Poland.1250   
## [3] LC_MONETARY=Polish_Poland.1250 LC_NUMERIC=C                  
## [5] LC_TIME=Polish_Poland.1250    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] h2o_3.32.0.1 DALEX_2.0.1 
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.4        pillar_1.4.3      compiler_3.6.3    ingredients_2.0  
##  [5] bitops_1.0-6      tools_3.6.3       bit_4.0.4         digest_0.6.25    
##  [9] jsonlite_1.6.1    evaluate_0.14     lifecycle_0.2.0   tibble_2.1.3     
## [13] gtable_0.3.0      pkgconfig_2.0.3   rlang_0.4.6       yaml_2.2.1       
## [17] xfun_0.12         DALEXtra_2.0      stringr_1.4.0     dplyr_1.0.0      
## [21] knitr_1.28        generics_0.0.2    vctrs_0.3.1       bit64_4.0.5      
## [25] grid_3.6.3        tidyselect_1.1.0  reticulate_1.14   glue_1.3.2       
## [29] data.table_1.12.8 R6_2.4.1          iBreakDown_1.3.1  rmarkdown_2.1    
## [33] farver_2.0.3      ggplot2_3.3.0     purrr_0.3.3       magrittr_1.5     
## [37] scales_1.1.0      htmltools_0.4.0   colorspace_1.4-1  labeling_0.3     
## [41] stringi_1.4.6     RCurl_1.98-1.2    munsell_0.5.0     crayon_1.3.4