This code covers chapter 5 of “Introduction to Data Mining” by Pang-Ning Tan, Michael Steinbach and Vipin Kumar. See table of contents for code examples for other chapters.

CC This work is licensed under the Creative Commons Attribution 4.0 International License. For questions please contact Michael Hahsler.

Show fewer digits

options(digits=3)

Load the data set

data(Zoo, package="mlbench")
head(Zoo)
##           hair feathers  eggs  milk airborne aquatic predator toothed
## aardvark  TRUE    FALSE FALSE  TRUE    FALSE   FALSE     TRUE    TRUE
## antelope  TRUE    FALSE FALSE  TRUE    FALSE   FALSE    FALSE    TRUE
## bass     FALSE    FALSE  TRUE FALSE    FALSE    TRUE     TRUE    TRUE
## bear      TRUE    FALSE FALSE  TRUE    FALSE   FALSE     TRUE    TRUE
## boar      TRUE    FALSE FALSE  TRUE    FALSE   FALSE     TRUE    TRUE
## buffalo   TRUE    FALSE FALSE  TRUE    FALSE   FALSE    FALSE    TRUE
##          backbone breathes venomous  fins legs  tail domestic catsize
## aardvark     TRUE     TRUE    FALSE FALSE    4 FALSE    FALSE    TRUE
## antelope     TRUE     TRUE    FALSE FALSE    4  TRUE    FALSE    TRUE
## bass         TRUE    FALSE    FALSE  TRUE    0  TRUE    FALSE   FALSE
## bear         TRUE     TRUE    FALSE FALSE    4 FALSE    FALSE    TRUE
## boar         TRUE     TRUE    FALSE FALSE    4  TRUE    FALSE    TRUE
## buffalo      TRUE     TRUE    FALSE FALSE    4  TRUE    FALSE    TRUE
##            type
## aardvark mammal
## antelope mammal
## bass       fish
## bear     mammal
## boar     mammal
## buffalo  mammal

Use multi-core support for cross-validation. Note: Does not work with rJava used in RWeka below.

#library(doParallel)
#registerDoParallel()
#getDoParWorkers()

Fitting Different Classification Models

Load the caret data mining package

library(caret)
## Loading required package: lattice
## Loading required package: ggplot2

Create fixed sampling scheme (10-folds) so we can compare the models later on.

train <- createFolds(Zoo$type, k=10)

For help with building models in caret see: ? train

Note: Be careful if you have many NA values in your data. train() and cross-validation many fail in some cases. If that is the case then you can remove features (columns) which have many NAs, omit NAs using na.omit() or use imputation to replace them with reasonable values (e.g., by the feature mean or via kNN).

Conditional Inference Tree (Decision Tree)

ctreeFit <- train(type ~ ., method = "ctree", data = Zoo,
    tuneLength = 5,
    trControl = trainControl(
        method = "cv", indexOut = train))
## Loading required package: party
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: sandwich
ctreeFit
## Conditional Inference Tree 
## 
## 101 samples
##  16 predictor
##   7 classes: 'mammal', 'bird', 'reptile', 'fish', 'amphibian', 'insect', 'mollusc.et.al' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 90, 90, 90, 89, 92, 92, ... 
## Resampling results across tuning parameters:
## 
##   mincriterion  Accuracy  Kappa
##   0.010         0.873     0.832
##   0.255         0.873     0.832
##   0.500         0.873     0.832
##   0.745         0.873     0.832
##   0.990         0.873     0.832
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mincriterion = 0.99.
plot(ctreeFit$finalModel)

The final model can be directly used for predict()

predict(ctreeFit, Zoo[1:2,])
## [1] mammal mammal
## Levels: mammal bird reptile fish amphibian insect mollusc.et.al

C 4.5 Decision Tree

library(RWeka)
C45Fit <- train(type ~ ., method = "J48", data = Zoo,
    tuneLength = 5,
    trControl = trainControl(
        method = "cv", indexOut = train))
C45Fit
## C4.5-like Trees 
## 
## 101 samples
##  16 predictor
##   7 classes: 'mammal', 'bird', 'reptile', 'fish', 'amphibian', 'insect', 'mollusc.et.al' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 91, 90, 91, 92, 91, 91, ... 
## Resampling results across tuning parameters:
## 
##   C      M  Accuracy  Kappa
##   0.010  1  0.979     0.971
##   0.010  2  0.979     0.971
##   0.010  3  0.979     0.971
##   0.010  4  0.951     0.935
##   0.010  5  0.941     0.922
##   0.133  1  0.979     0.971
##   0.133  2  0.979     0.971
##   0.133  3  0.979     0.971
##   0.133  4  0.951     0.935
##   0.133  5  0.941     0.922
##   0.255  1  0.979     0.971
##   0.255  2  0.979     0.971
##   0.255  3  0.979     0.971
##   0.255  4  0.951     0.935
##   0.255  5  0.941     0.922
##   0.378  1  0.979     0.971
##   0.378  2  0.979     0.971
##   0.378  3  0.979     0.971
##   0.378  4  0.951     0.935
##   0.378  5  0.941     0.922
##   0.500  1  0.979     0.971
##   0.500  2  0.979     0.971
##   0.500  3  0.979     0.971
##   0.500  4  0.951     0.935
##   0.500  5  0.941     0.922
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final values used for the model were C = 0.01 and M = 1.
C45Fit$finalModel
## J48 pruned tree
## ------------------
## 
## feathersTRUE <= 0
## |   milkTRUE <= 0
## |   |   backboneTRUE <= 0
## |   |   |   airborneTRUE <= 0: mollusc.et.al (12.0/2.0)
## |   |   |   airborneTRUE > 0: insect (6.0)
## |   |   backboneTRUE > 0
## |   |   |   finsTRUE <= 0
## |   |   |   |   tailTRUE <= 0: amphibian (3.0)
## |   |   |   |   tailTRUE > 0: reptile (6.0/1.0)
## |   |   |   finsTRUE > 0: fish (13.0)
## |   milkTRUE > 0: mammal (41.0)
## feathersTRUE > 0: bird (20.0)
## 
## Number of Leaves  :  7
## 
## Size of the tree :   13

K-Nearest Neighbors

Note: kNN uses Euclidean distance, so data should be standardized (scaled) first. Here legs are measured between 0 and 6 while all other variables are between 0 and 1.

Zoo_scaled <- cbind(as.data.frame(scale(Zoo[,-17])), type = Zoo[,17])
knnFit <- train(type ~ ., method = "knn", data = Zoo_scaled,
    tuneLength = 5,  tuneGrid=data.frame(k=1:10),
    trControl = trainControl(
        method = "cv", indexOut = train))
knnFit
## k-Nearest Neighbors 
## 
## 101 samples
##  16 predictor
##   7 classes: 'mammal', 'bird', 'reptile', 'fish', 'amphibian', 'insect', 'mollusc.et.al' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 92, 91, 90, 90, 92, 90, ... 
## Resampling results across tuning parameters:
## 
##   k   Accuracy  Kappa
##    1  1.000     1.000
##    2  0.980     0.974
##    3  0.979     0.970
##    4  0.959     0.943
##    5  0.979     0.970
##    6  0.959     0.943
##    7  0.958     0.941
##    8  0.917     0.889
##    9  0.917     0.888
##   10  0.888     0.849
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was k = 1.
knnFit$finalModel
## 1-nearest neighbor classification model
## Training set class distribution:
## 
##        mammal          bird       reptile          fish     amphibian 
##            41            20             5            13             4 
##        insect mollusc.et.al 
##             8            10

PART (Rule-based classifier)

rulesFit <- train(type ~ ., method = "PART", data = Zoo,
  tuneLength = 5,
  trControl = trainControl(
    method = "cv", indexOut = train))
rulesFit
## Rule-Based Classifier 
## 
## 101 samples
##  16 predictor
##   7 classes: 'mammal', 'bird', 'reptile', 'fish', 'amphibian', 'insect', 'mollusc.et.al' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 93, 90, 92, 91, 90, 91, ... 
## Resampling results across tuning parameters:
## 
##   threshold  pruned  Accuracy  Kappa
##   0.010      yes     0.969     0.958
##   0.010      no      0.989     0.985
##   0.133      yes     0.969     0.958
##   0.133      no      0.989     0.985
##   0.255      yes     0.969     0.958
##   0.255      no      0.989     0.985
##   0.378      yes     0.969     0.958
##   0.378      no      0.989     0.985
##   0.500      yes     0.969     0.958
##   0.500      no      0.989     0.985
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final values used for the model were threshold = 0.5 and pruned = no.
rulesFit$finalModel
## PART decision list
## ------------------
## 
## feathersTRUE <= 0 AND
## milkTRUE > 0: mammal (41.0)
## 
## feathersTRUE > 0: bird (20.0)
## 
## backboneTRUE <= 0 AND
## airborneTRUE <= 0 AND
## predatorTRUE > 0: mollusc.et.al (8.0)
## 
## backboneTRUE <= 0 AND
## legs > 2: insect (8.0)
## 
## finsTRUE > 0: fish (13.0)
## 
## backboneTRUE > 0 AND
## tailTRUE > 0 AND
## aquaticTRUE <= 0: reptile (4.0)
## 
## aquaticTRUE > 0 AND
## venomousTRUE <= 0: amphibian (3.0)
## 
## aquaticTRUE <= 0: mollusc.et.al (2.0)
## 
## : reptile (2.0/1.0)
## 
## Number of Rules  :   9

Linear Support Vector Machines

svmFit <- train(type ~., method = "svmLinear", data = Zoo,
    tuneLength = 5,
    trControl = trainControl(
        method = "cv", indexOut = train))
## Loading required package: kernlab
## 
## Attaching package: 'kernlab'
## The following object is masked from 'package:modeltools':
## 
##     prior
## The following object is masked from 'package:ggplot2':
## 
##     alpha
svmFit
## Support Vector Machines with Linear Kernel 
## 
## 101 samples
##  16 predictor
##   7 classes: 'mammal', 'bird', 'reptile', 'fish', 'amphibian', 'insect', 'mollusc.et.al' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 92, 92, 91, 90, 90, 90, ... 
## Resampling results:
## 
##   Accuracy  Kappa
##   0.99      0.987
## 
## Tuning parameter 'C' was held constant at a value of 1
## 
svmFit$finalModel
## Support Vector Machine object of class "ksvm" 
## 
## SV type: C-svc  (classification) 
##  parameter : cost C = 1 
## 
## Linear (vanilla) kernel function. 
## 
## Number of Support Vectors : 47 
## 
## Objective Function Value : -0.145 -0.218 -0.148 -0.175 -0.0936 -0.103 -0.297 -0.0819 -0.156 -0.0907 -0.114 -0.182 -0.576 -0.13 -0.183 -0.118 -0.0474 -0.0823 -0.124 -0.148 -0.567 
## Training error : 0

Artificial Neural Network

nnetFit <- train(type ~ ., method = "nnet", data = Zoo,
    tuneLength = 5,
    trControl = trainControl(
        method = "cv", indexOut = train),
  trace = FALSE)
## Loading required package: nnet
nnetFit
## Neural Network 
## 
## 101 samples
##  16 predictor
##   7 classes: 'mammal', 'bird', 'reptile', 'fish', 'amphibian', 'insect', 'mollusc.et.al' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 90, 91, 93, 91, 91, 90, ... 
## Resampling results across tuning parameters:
## 
##   size  decay  Accuracy  Kappa
##   1     0e+00  0.777     0.690
##   1     1e-04  0.774     0.676
##   1     1e-03  0.876     0.834
##   1     1e-02  0.835     0.780
##   1     1e-01  0.724     0.623
##   3     0e+00  0.961     0.949
##   3     1e-04  0.960     0.944
##   3     1e-03  1.000     1.000
##   3     1e-02  1.000     1.000
##   3     1e-01  0.990     0.985
##   5     0e+00  1.000     1.000
##   5     1e-04  1.000     1.000
##   5     1e-03  1.000     1.000
##   5     1e-02  1.000     1.000
##   5     1e-01  1.000     1.000
##   7     0e+00  0.990     0.986
##   7     1e-04  1.000     1.000
##   7     1e-03  1.000     1.000
##   7     1e-02  1.000     1.000
##   7     1e-01  1.000     1.000
##   9     0e+00  0.990     0.985
##   9     1e-04  1.000     1.000
##   9     1e-03  1.000     1.000
##   9     1e-02  1.000     1.000
##   9     1e-01  1.000     1.000
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final values used for the model were size = 3 and decay = 0.01.
nnetFit$finalModel
## a 16-3-7 network with 79 weights
## inputs: hairTRUE feathersTRUE eggsTRUE milkTRUE airborneTRUE aquaticTRUE predatorTRUE toothedTRUE backboneTRUE breathesTRUE venomousTRUE finsTRUE legs tailTRUE domesticTRUE catsizeTRUE 
## output(s): .outcome 
## options were - softmax modelling  decay=0.01

Random Forest

randomForestFit <- train(type ~ ., method = "rf", data = Zoo,
    tuneLength = 5,
    trControl = trainControl(
        method = "cv", indexOut = train))
## Loading required package: randomForest
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
randomForestFit
## Random Forest 
## 
## 101 samples
##  16 predictor
##   7 classes: 'mammal', 'bird', 'reptile', 'fish', 'amphibian', 'insect', 'mollusc.et.al' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 91, 91, 91, 90, 91, 90, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy  Kappa
##    2    0.99      0.987
##    5    0.99      0.987
##    9    0.99      0.987
##   12    0.99      0.987
##   16    0.99      0.987
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 2.
randomForestFit$finalModel
## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 3.96%
## Confusion matrix:
##               mammal bird reptile fish amphibian insect mollusc.et.al
## mammal            41    0       0    0         0      0             0
## bird               0   20       0    0         0      0             0
## reptile            0    1       2    1         1      0             0
## fish               0    0       0   13         0      0             0
## amphibian          0    0       0    0         4      0             0
## insect             0    0       0    0         0      8             0
## mollusc.et.al      0    0       0    0         0      1             9
##               class.error
## mammal                0.0
## bird                  0.0
## reptile               0.6
## fish                  0.0
## amphibian             0.0
## insect                0.0
## mollusc.et.al         0.1

Compare Models

resamps <- resamples(list(
  ctree=ctreeFit,
  C45=C45Fit,
  SVM=svmFit,
  KNN=knnFit,
  rules=rulesFit,
  NeuralNet=nnetFit,
  randomForest=randomForestFit))
resamps
## 
## Call:
## resamples.default(x = list(ctree = ctreeFit, C45 = C45Fit, SVM =
##  svmFit, KNN = knnFit, rules = rulesFit, NeuralNet = nnetFit,
##  randomForest = randomForestFit))
## 
## Models: ctree, C45, SVM, KNN, rules, NeuralNet, randomForest 
## Number of resamples: 10 
## Performance metrics: Accuracy, Kappa 
## Time estimates for: everything, final model fit
summary(resamps)
## 
## Call:
## summary.resamples(object = resamps)
## 
## Models: ctree, C45, SVM, KNN, rules, NeuralNet, randomForest 
## Number of resamples: 10 
## 
## Accuracy 
##               Min. 1st Qu. Median  Mean 3rd Qu. Max. NA's
## ctree        0.800   0.808  0.894 0.873     0.9    1    0
## C45          0.889   1.000  1.000 0.979     1.0    1    0
## SVM          0.900   1.000  1.000 0.990     1.0    1    0
## KNN          1.000   1.000  1.000 1.000     1.0    1    0
## rules        0.889   1.000  1.000 0.989     1.0    1    0
## NeuralNet    1.000   1.000  1.000 1.000     1.0    1    0
## randomForest 0.900   1.000  1.000 0.990     1.0    1    0
## 
## Kappa 
##               Min. 1st Qu. Median  Mean 3rd Qu. Max. NA's
## ctree        0.737   0.753  0.849 0.832   0.868    1    0
## C45          0.845   1.000  1.000 0.971   1.000    1    0
## SVM          0.868   1.000  1.000 0.987   1.000    1    0
## KNN          1.000   1.000  1.000 1.000   1.000    1    0
## rules        0.847   1.000  1.000 0.985   1.000    1    0
## NeuralNet    1.000   1.000  1.000 1.000   1.000    1    0
## randomForest 0.868   1.000  1.000 0.987   1.000    1    0
difs <- diff(resamps)
difs
## 
## Call:
## diff.resamples(x = resamps)
## 
## Models: ctree, C45, SVM, KNN, rules, NeuralNet, randomForest 
## Metrics: Accuracy, Kappa 
## Number of differences: 21 
## p-value adjustment: bonferroni
summary(difs)
## 
## Call:
## summary.diff.resamples(object = difs)
## 
## p-value adjustment: bonferroni 
## Upper diagonal: estimates of the difference
## Lower diagonal: p-value for H0: difference = 0
## 
## Accuracy 
##              ctree   C45      SVM      KNN      rules    NeuralNet
## ctree                -0.10576 -0.11687 -0.12687 -0.11576 -0.12687 
## C45          0.01923          -0.01111 -0.02111 -0.01000 -0.02111 
## SVM          0.00296 1.00000           -0.01000  0.00111 -0.01000 
## KNN          0.00294 1.00000  1.00000            0.01111  0.00000 
## rules        0.06237 1.00000  1.00000  1.00000           -0.01111 
## NeuralNet    0.00294 1.00000  1.00000  NA       1.00000           
## randomForest 0.00296 1.00000  NA       1.00000  1.00000  1.00000  
##              randomForest
## ctree        -0.11687    
## C45          -0.01111    
## SVM           0.00000    
## KNN           0.01000    
## rules        -0.00111    
## NeuralNet     0.01000    
## randomForest             
## 
## Kappa 
##              ctree   C45     SVM     KNN     rules   NeuralNet
## ctree                -0.1389 -0.1544 -0.1676 -0.1523 -0.1676  
## C45          0.01671         -0.0155 -0.0287 -0.0134 -0.0287  
## SVM          0.00226 1.00000         -0.0132  0.0021 -0.0132  
## KNN          0.00232 1.00000 1.00000          0.0153  0.0000  
## rules        0.06303 1.00000 1.00000 1.00000         -0.0153  
## NeuralNet    0.00232 1.00000 1.00000 NA      1.00000          
## randomForest 0.00226 1.00000 NA      1.00000 1.00000 1.00000  
##              randomForest
## ctree        -0.1544     
## C45          -0.0155     
## SVM           0.0000     
## KNN           0.0132     
## rules        -0.0021     
## NeuralNet     0.0132     
## randomForest

All perform similarly well except ctree

More Information