To illustrate applications of explailers from the DrWhy.AI universe we will use fraud data available under the following address: https://www.openml.org/d/1597 Our data consists of a 28 (V1-V28) anonymized variables. They are transformed by PCA from the raw dataset, but due to the sensitivity of the raw data, we have access only to its anonymized version. Moreover, there are 3 additional variables: Time, Amount and Class. Time describes time, when a transaction was taken and Amount describes amount of used money. Class is a target variable where 1 means that a transaction was fraudulent.
Let’s download our data and display a few observations to get some insight.
set.seed(333)
OpenML <- getOMLDataSet(data.id = 1597)
data <- OpenML$data
# there are a few duplicates which I remove at the beginning
data <- data[!duplicated(data), ]
head(data)
## Time V1 V2 V3 V4 V5 V6
## 0 0 -1.3598071 -0.07278117 2.5363467 1.3781552 -0.33832077 0.46238778
## 1 0 1.1918571 0.26615071 0.1664801 0.4481541 0.06001765 -0.08236081
## 2 1 -1.3583541 -1.34016307 1.7732093 0.3797796 -0.50319813 1.80049938
## 3 1 -0.9662717 -0.18522601 1.7929933 -0.8632913 -0.01030888 1.24720317
## 4 2 -1.1582331 0.87773675 1.5487178 0.4030339 -0.40719338 0.09592146
## 5 2 -0.4259659 0.96052304 1.1411093 -0.1682521 0.42098688 -0.02972755
## V7 V8 V9 V10 V11 V12
## 0 0.23959855 0.09869790 0.3637870 0.09079417 -0.5515995 -0.61780086
## 1 -0.07880298 0.08510165 -0.2554251 -0.16697441 1.6127267 1.06523531
## 2 0.79146096 0.24767579 -1.5146543 0.20764287 0.6245015 0.06608369
## 3 0.23760894 0.37743587 -1.3870241 -0.05495192 -0.2264873 0.17822823
## 4 0.59294075 -0.27053268 0.8177393 0.75307443 -0.8228429 0.53819555
## 5 0.47620095 0.26031433 -0.5686714 -0.37140720 1.3412620 0.35989384
## V13 V14 V15 V16 V17 V18
## 0 -0.9913898 -0.3111694 1.4681770 -0.4704005 0.20797124 0.02579058
## 1 0.4890950 -0.1437723 0.6355581 0.4639170 -0.11480466 -0.18336127
## 2 0.7172927 -0.1659459 2.3458649 -2.8900832 1.10996938 -0.12135931
## 3 0.5077569 -0.2879237 -0.6314181 -1.0596472 -0.68409279 1.96577500
## 4 1.3458516 -1.1196698 0.1751211 -0.4514492 -0.23703324 -0.03819479
## 5 -0.3580907 -0.1371337 0.5176168 0.4017259 -0.05813282 0.06865315
## V19 V20 V21 V22 V23
## 0 0.40399296 0.25141210 -0.018306778 0.277837576 -0.11047391
## 1 -0.14578304 -0.06908314 -0.225775248 -0.638671953 0.10128802
## 2 -2.26185710 0.52497973 0.247998153 0.771679402 0.90941226
## 3 -1.23262197 -0.20803778 -0.108300452 0.005273597 -0.19032052
## 4 0.80348692 0.40854236 -0.009430697 0.798278495 -0.13745808
## 5 -0.03319379 0.08496767 -0.208253515 -0.559824796 -0.02639767
## V24 V25 V26 V27 V28 Amount Class
## 0 0.06692807 0.1285394 -0.1891148 0.133558377 -0.02105305 149.62 0
## 1 -0.33984648 0.1671704 0.1258945 -0.008983099 0.01472417 2.69 0
## 2 -0.68928096 -0.3276418 -0.1390966 -0.055352794 -0.05975184 378.66 0
## 3 -1.17557533 0.6473760 -0.2219288 0.062722849 0.06145763 123.50 0
## 4 0.14126698 -0.2060096 0.5022922 0.219422230 0.21515315 69.99 0
## 5 -0.37142658 -0.2327938 0.1059148 0.253844225 0.08108026 3.67 0
We see that all variables are numeric. We will take a closer look at Time and Amount variables. The rest is anonymized, so we won’t be able to do anything more advanced than just plot simple summary statistics. All anonymized variables look pretty similar, so I decided to show only first 4 of them.
theme_update(plot.title = element_text(hjust = 0.5))
# summary statistics only for variables with letter V (i.e. V1-V28)
anonymized_variables <- colnames(data[, grep('V', colnames(data))])
for (v in anonymized_variables[1:4]){
D2 <- round(var(data[, v]), 2)
outliers <- compute_outliers(data[, v])
kurt <- round(kurtosis(data[, v]), 2)
pl <- ggplot(data, aes(x = '', y = data[, v])) +
geom_boxplot(outlier.size = 0.01) +
ggtitle(paste('Variable:', v,
'\n Variance of a variable:', D2,
'\n Percentage of outliers:', outliers, '%',
'\n Kurtosis of a variable: ', kurt)) +
ylab('') +
xlab('') +
coord_flip()
print(pl)
}
We can see that variables have many outliers (in this case defined as points which lie further than 1.5 interquartile range from the median). Additionaly, all kurtoses are positive which means that our variables are light-tailed. Two things can be done with them. Firstly, we can standarise variances so they all will be equal to 1. It’s very common procedure during data preparation which usually makes a training algorithm perform better. We also should note that all variables have mean around zero, hence there is no need to center them.
theme_update(plot.title = element_text(hjust = 0.5))
ggplot(data, aes(x = Time)) +
geom_histogram(color = 'black', fill = 'grey') +
ylab('Frequency') +
ggtitle('Histogram of Time variable')
ggplot(data, aes(x = Amount)) +
geom_histogram(color = 'black', fill = 'grey') +
ylab('Frequency') +
ggtitle('Histogram of Amount variable')
ggplot(data, aes(x = log(Amount + 0.1))) +
geom_histogram(color = 'black', fill = 'grey') +
ylab('Frequency') +
ggtitle('Amount variable after log transformation')
temp <- data.frame(table(data$Class))
ggplot(temp, aes(x = as.factor(Var1), y = Freq)) +
geom_bar(stat = 'identity') +
geom_text(aes(label = Freq), vjust = -0.2) +
ggtitle('Histogram of classes') +
xlab('') +
ylab('Frequency')
It should be reasonable to apply log transformation to the Amount variable as it is really skewed. We will also categorize Time variable. Although it is not written, it is sensible to assume that high values in the histogram are times of transactions which were done during a day whereas low values stand for nights. The last thing to note is huge disproportion between positive (473) and negative (283 253) cases. We have to have in mind that we work with extremly unbalanced data. With all those corrections let’s build our train and test datasets.
# changing numeric time into categorical variable
data[, 1] <- sapply(data[, 1], function(x){
if (x < 30000)
{return (0)}
if ((x > 85000) & (x < 115000))
{return (0)}
return (1)})
# turning characters into numbers from {0, 1}
data$Class <- as.numeric(data$Class) - 1
# scaling data
data$Amount <-(data$Amount - mean(data$Amount))/sd(data$Amount)
for (v in anonymized_variables){
data[, v] = data[, v]/sd(data[, v])
}
# preparing subtable with only fraud observations
data_fraud <- data[data$Class == 1, ]
data_fraud <- data_fraud[sample(nrow(data_fraud)), ]
data_nonfraud <- data[data$Class == 0, ]
# creating train and test datasets which have similiar percentage of fraudulent and non fraudulent transactions
# train data has 80% of the whole dataset included in the analysis
data_train <- rbind(data_fraud[1 : floor(0.8*nrow(data_fraud)), ],
data_nonfraud[1 : floor(0.8*nrow(data_nonfraud)), ])
data_test <- rbind(data_fraud[(floor(0.8*nrow(data_fraud))+1) : nrow(data_fraud), ],
data_nonfraud[(floor(0.8*nrow(data_nonfraud))+1) : nrow(data_nonfraud), ])
col = ncol(data)
Finally, we are able to build models. Hyperparameters were tuned using bayesian optimization algorithms. Goal during optimization was to maximize area under precision-recall curve which should be a good alternative to standard AUC score in case of a such unbalanced dataset. Optimization scripts are attached at the bottom. Additionally, weighting observations made minority class more important during training.
frauds = sum(data_train$Class)/nrow(data_train)
weights <- ifelse(data_train$Class == 1, 1-frauds, frauds)
weights_for_rf <- c(1-frauds, frauds)
names(weights_for_rf) <- c('1', '0')
model_glm <- glmnet(x = as.matrix(data_train[, -col]),
y = data_train$Class,
family = 'binomial',
weights = weights,
lambda = 1,
alpha = 0.063)
model_rf <- randomForest(as.factor(Class) ~ .,
data = data_train,
classwt = weights_for_rf,
ntree = 1000,
mtry = 4,
replace = TRUE,
samplesize = 0.5,
nodesize = 1)
model_gbm <- gbm(Class == 1 ~ .,
data = data_train,
weights = weights,
distribution = 'adaboost',
n.trees = 1172,
interaction.depth = 3,
shrinkage = 0.1614,
bag.fraction = 0.7865)
plot_curves <- function(predictions){
plot(roc.curve(scores.class0 = predictions,
weights.class0 = data_test$Class,
curve = TRUE))
plot(pr.curve(scores.class0 = predictions,
weights.class0 = data_test$Class,
curve = TRUE))
}
par(mfrow = c(1, 2))
pred_glm <- predict(model_glm, as.matrix(data_test[, -col]), type = 'response')
plot_curves(pred_glm)
par(mfrow = c(1, 2))
pred_rf <- predict(model_rf, data_test[, -col], type = 'prob')[, 2]
plot_curves(pred_rf)
par(mfrow = c(1, 2))
pred_gbm <- predict(model_gbm, data_test[, -col], n.trees = 1172, type = 'response')
plot_curves(pred_gbm)
Data is unbalanced, hence it is better to look at precision-recall curves instead of standard ROC. Let’s print all 3 curves in one plot in order to make comparision easier.
temp <- mmdata(cbind(pred_gbm, pred_glm, pred_rf), data_test$Class, modnames=c('gbm', 'glm', 'rf'))
ms <- evalmod(temp)
autoplot(ms, 'PRC')
As we see, all models work really well. From the last plot we can deduce GLM is a model which performance drops first. GBM and RF have very similar precision-recall curves.
Now we can explain our models. It’s not necessary to use explain function from the DALEX package, but it makes a whole procedure easier.
explainer_glm <- explain(model_glm,
data = data_train[, -col],
y = data_train$Class,
predict_function = function(m, x) as.vector(predict(m, as.matrix(x), type = 'response')),
label = 'glm')
explainer_rf <- explain(model_rf,
data = data_train[, -col],
y = (data_train$Class),
label = 'rf')
explainer_gbm <- explain(model_gbm,
data = data_train[, -col],
y = (data_train$Class == 1),
predict_function = function(m, x) predict(m, x, n.trees = 1172, type = "response"),
label = 'gbm')
Using the DrWhy.AI universe we are able to better understand which variables are important.
Model agnostic variable importance is calculated by means of permutations. We simply substract the loss function calculated for validation dataset with permuted values for a single variable from the loss function calculated for validation dataset.
This method is implemented in the feature_importance() function from the ingredients package. Because there are many variables, we can display only these with the biggest importance.
custom_loss_function <- function(y, yhat) {
1 - mltools::auc_roc(yhat, y)
}
vars = c('V11', 'V16', 'V4', 'V10', 'V12', 'V14')
fi_glm <- ingredients::feature_importance(x = explainer_glm,
type = "difference",
loss_function = custom_loss_function,
variables = vars)
fi_rf <- ingredients::feature_importance(x = explainer_rf,
type = "difference",
loss_function = custom_loss_function,
variables = vars)
fi_gbm <- ingredients::feature_importance(x = explainer_gbm,
type = "difference",
loss_function = custom_loss_function,
variables = vars)
We can visualize those values in a single plot using generic plot() or plotD3() functions
plot(fi_glm, fi_rf, fi_gbm)
As we see, tree-based models have very little drops after perturbations.
The break_down function finds Variable Attributions via Sequential Variable Conditioning and them compute contributions of particular variables in a whole prediction. In case of non-additive models the function can compute interactions between predictors. Let’s check which variables had majority impact on prediction at 12th observation from the training dataset.
plotD3(break_down(explainer_glm, data_train[12, ]))
plotD3(break_down(explainer_rf, data_train[12, ]))
plotD3(break_down(explainer_gbm, data_train[12, ]))
This particular observation was fraudulent, so we see that all models correctly classified its label.
Additionally, if we are not confident with just one specific ordering of variables, we can estimate an impact of each variable by sampling a couple of different orderings and then compute averages. We can even obtain interquartile ranges for estimators and all these functionalities are implemented in a function from the iBreakDown package called break_down_uncertainty.
plot(break_down_uncertainty(explainer_glm, data_train[12, ], B = 25))
plot(break_down_uncertainty(explainer_rf, data_train[12, ], B = 25))
plot(break_down_uncertainty(explainer_gbm, data_train[12, ], B = 25))
We can deduce that GLM is the most consistent with variables’ contributions through different orderings whereas RF can be unstable. One more thing to note is that variables in RF model can have only positive contribution (i.e. variables’ values can’t decrease prediction)
Partial Dependency Profiles are averages from Ceteris Paribus Profiles over a whole dataset. The latter can be used to hypothesize about model results if selected variable is changed. For this reason it is also called ‘What-If Profiles’. Averaging ensures more stability in explainations. Let’s print Partial Dependency Profiles for variables V4, V12 and V14.
pdp_glm_V4 <- ingredients::partial_dependency(explainer_glm , variables = "V4")
pdp_rf_V4 <- ingredients::partial_dependency(explainer_rf, variables = "V4")
pdp_gbm_V4 <- ingredients::partial_dependency(explainer_gbm , variables = "V4")
plot(pdp_glm_V4, pdp_rf_V4, pdp_gbm_V4)
pdp_glm_V12 <- ingredients::partial_dependency(explainer_glm , variables = "V12")
pdp_rf_V12 <- ingredients::partial_dependency(explainer_rf, variables = "V12")
pdp_gbm_V12 <- ingredients::partial_dependency(explainer_gbm , variables = "V12")
plot(pdp_glm_V12, pdp_rf_V12, pdp_gbm_V12)
pdp_glm_V14 <- ingredients::partial_dependency(explainer_glm , variables = "V14")
pdp_rf_V14 <- ingredients::partial_dependency(explainer_rf, variables = "V14")
pdp_gbm_V14 <- ingredients::partial_dependency(explainer_gbm , variables = "V14")
plot(pdp_glm_V14, pdp_rf_V14, pdp_gbm_V14)
Let’s take a look at the last variable. The figure shows that decreasing value of variable increases probability for transaction to be fraudulent both for logistic regression and gradient boosting machine whereas random forest treats this variable as insignificant in this particular observation. Note that changes for tree-based models are rather small which is consistent with the fact that individual variables don’t cause dramatical changes in model’s performance (what we have seen in feature importance plots)
if (!require('rBayesianOptimization')){
install.packages('rBayesianOptimization')
library('rBayesianOptimization')
}
if (!require('PRROC')){
install.packages('PRROC')
library('PRROC')
}
if (!require('glmnet')){
install.packages('glmnet')
library('glmnet')
}
# load('../data/data_train_small')
# load('../data/data_test_small')
load('../data/data_train_final')
load('../data/data_test_final')
col = ncol(data_train)
# optimizations
cv_folds = KFold(target = data_train$Class,
nfolds = 5,
stratified = TRUE)
# LR optimization
glm_optim <- function(alpha, lambda){
scores <- numeric(5)
for (i in 1:5){
train <- data_train[cv_folds[[i]], ]
test <- data_test[-cv_folds[[i]], ]
frauds = sum(train$Class)/nrow(train)
weights <- ifelse(train$Class == 1, 1-frauds, frauds)
model_glm <- glmnet(as.matrix(train[, -col]),
train$Class,
family = 'binomial',
lambda = lambda,
alpha = alpha,
weights = weights)
pred = predict(model_glm, as.matrix(test[, -col]))
pred <- sapply(pred, function(x) {1/(1 + exp(-x))})
roc <- pr.curve(scores.class0 = pred,
weights.class0 = test$Class)
scores[i] = roc$auc.integral
}
return (list(Score = mean(scores), Pred = 0))
}
start <- Sys.time()
print('Training has begun!')
glm_bayes <- BayesianOptimization(FUN = glm_optim,
bounds = list(alpha = c(0, 1),
lambda = c(0, 1)),
n_iter = 50,
init_points = 10,
verbose = TRUE)
stop <- Sys.time()
print(stop-start)
save(glm_bayes, file = 'glmnet_bayes')
}
if (!require('rBayesianOptimization')){
install.packages('rBayesianOptimization')
library('rBayesianOptimization')
}
if (!require('PRROC')){
install.packages('PRROC')
library('PRROC')
}
if (!require('randomForest')){
install.packages('randomForest')
library('randomForest')
}
# load('../data/data_train_small')
# load('../data/data_test_small')
load('../data/data_train_final')
load('../data/data_test_final')
col = ncol(data_train)
# optimizations
cv_folds = KFold(target = data_train$Class,
nfolds = 5,
stratified = TRUE)
# RF optimization
rf_optim <- function(ntree, mtry, replace, samplesize, nodesize){
scores <- numeric(5)
for (i in 1:5){
train <- data_train[cv_folds[[i]], ]
test <- data_test[-cv_folds[[i]], ]
frauds = sum(train$Class)/nrow(train)
weights <- c(1-frauds, frauds)
names(weights) <- c('1', '0')
model_rf <- randomForest(as.factor(Class) ~ .,
data = train,
classwt = weights,
ntree = ntree,
mtry = mtry,
replace = replace,
samplesize = samplesize,
nodesize = nodesize)
roc <- pr.curve(scores.class0 = predict(model_rf, test[, -col], type = 'prob')[, 2],
weights.class0 = test$Class)
scores[i] = roc$auc.integral
}
return (list(Score = mean(scores), Pred = 0))
}
start <- Sys.time()
print('Training has begun!')
rf_bayes <- BayesianOptimization(FUN = rf_optim,
bounds = list(ntree = c(300L, 1000L),
mtry = c(4L, 8L),
# 0 and 1 stand for FALSE and TRUE values
replace = c(0L, 1L),
samplesize = c(0.5, 0.8),
nodesize = c(1L, 4L)),
n_iter = 30,
init_points = 10,
verbose = TRUE)
stop <- Sys.time()
print(stop-start)
save(rf_bayes, file = 'rf_bayes')
if (!require('rBayesianOptimization')){
install.packages('rBayesianOptimization')
library('rBayesianOptimization')
}
if (!require('PRROC')){
install.packages('PRROC')
library('PRROC')
}
if (!require('gbm')){
install.packages('gbm')
library('gbm')
}
# load('../data/data_train_small')
# load('../data/data_test_small')
load('../data/data_train_final')
load('../data/data_test_final')
col = ncol(data_train)
# optimizations
cv_folds = KFold(target = data_train$Class,
nfolds = 5,
stratified = TRUE)
opts.distribution = c('bernoulli', 'huberized', 'adaboost')
# GBM optimization
gbm_optim <- function(distribution, n.trees, interaction.depth, shrinkage, bag.fraction){
distribution = opts.distribution[distribution]
scores <- numeric(5)
for (i in 1:5){
train <- data_train[cv_folds[[i]], ]
test <- data_test[-cv_folds[[i]], ]
frauds = sum(train$Class)/nrow(train)
weights <- ifelse(train$Class == 1, 1-frauds, frauds)
model_gbm <- gbm(Class == 1 ~ .,
data = train,
distribution = distribution,
n.trees = n.trees,
weights = weights,
interaction.depth = interaction.depth,
shrinkage = shrinkage,
bag.fraction = bag.fraction)
roc <- pr.curve(scores.class0 = predict(model_gbm,
test[, -col],
n.trees = n.trees,
type = 'response'),
weights.class0 = test$Class)
scores[i] = roc$auc.integral
}
return (list(Score = mean(scores), Pred = 0))
}
start <- Sys.time()
print('Training has begun!')
gbm_bayes <- BayesianOptimization(FUN = gbm_optim,
bounds = list(distribution = c(1L, 3L),
n.trees = c(50L, 2000L),
interaction.depth = c(1L, 3L),
shrinkage = c(0.001, 0.3),
bag.fraction = c(0.2, 0.8)),
n_iter = 30,
init_points = 10,
verbose = TRUE)
stop <- Sys.time()
print(stop-start)
save(gbm_bayes, file = 'gbm_bayes')
sessionInfo()
## R version 3.4.4 (2018-03-15)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=pl_PL.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=pl_PL.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=pl_PL.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=pl_PL.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] precrec_0.10.1 glmnet_2.0-16 foreach_1.4.4
## [4] Matrix_1.2-17 e1071_1.7-1 PRROC_1.3.1
## [7] ggplot2_3.1.1 gbm_2.1.5 randomForest_4.6-14
## [10] iBreakDown_0.9.6 ingredients_0.3.5 DALEX_0.4.1
##
## loaded via a namespace (and not attached):
## [1] tidyselect_0.2.5 xfun_0.4 purrr_0.3.0
## [4] splines_3.4.4 lattice_0.20-35 colorspace_1.4-1
## [7] htmltools_0.3.6 yaml_2.2.0 survival_2.41-3
## [10] rlang_0.3.4 pillar_1.4.1 glue_1.3.1
## [13] withr_2.1.2 bindrcpp_0.2.2 bindr_0.1.1
## [16] plyr_1.8.4 stringr_1.4.0 munsell_0.5.0
## [19] gtable_0.3.0 codetools_0.2-15 evaluate_0.12
## [22] labeling_0.3 knitr_1.21 class_7.3-14
## [25] Rcpp_1.0.1 scales_1.0.0 gridExtra_2.3
## [28] digest_0.6.19 stringi_1.4.3 dplyr_0.7.8
## [31] grid_3.4.4 tools_3.4.4 magrittr_1.5
## [34] lazyeval_0.2.2 tibble_2.1.2 crayon_1.3.4
## [37] pkgconfig_2.0.2 data.table_1.12.2 assertthat_0.2.1
## [40] rmarkdown_1.11 rstudioapi_0.10 iterators_1.0.10
## [43] R6_2.4.0 compiler_3.4.4
```