21  Smile classification models reference - DRAFT 🛠

(ns noj-book.smile-classification
  (:require
   [noj-book.utils.example-code :refer [iris-std]]
   [noj-book.utils.render-tools :refer [render-key-info]]
   [noj-book.utils.surface-plot :refer [surface-plot]]
   [scicloj.kindly.v4.kind :as kind]
   [scicloj.metamorph.core :as mm]
   [scicloj.metamorph.ml :as ml]
   [scicloj.ml.smile.classification]
   [scicloj.ml.xgboost]
   [tech.v3.dataset.metamorph :as ds-mm]))
(render-key-info :smile.classification)

21.1 :smile.classification/ada-boost

javadoc
user guide
name type default description
trees int32 500 Number of trees
max-depth int32 200 Maximum depth of the tree
max-nodes int32 6 Maximum number of leaf nodes in the tree
node-size int32 1 Number of instances in a node below which the tree will not split, setting nodeSize = 5 generally gives good results


In this example we will use the capability of the Ada boost classifier to give us the importance of variables.

As data we take here the Wiscon Breast Cancer dataset, which has 30 variables.

(def df (-> (datasets/breast-cancer-ds)))
(tc/column-names df)
(:mean-radius
 :mean-texture
 :mean-perimeter
 :mean-area
 :mean-smoothness
 :mean-compactness
 :mean-concavity
 :mean-concave-points
 :mean-symmetry
 :mean-fractal-dimension
 :radius-error
 :texture-error
 :perimeter-error
 :area-error
 :smoothness-error
 :compactness-error
 :concavity-error
 :concave-points-error
 :symmetry-error
 :fractal-dimension-error
 :worst-radius
 :worst-texture
 :worst-perimeter
 :worst-area
 :worst-smoothness
 :worst-compactness
 :worst-concavity
 :worst-concave-points
 :worst-symmetry
 :worst-fractal-dimension
 :class)

To get an overview of the dataset, we print its summary:

(-> df tc/info)

_unnamed: descriptive-stats [31 12]:

:col-name :datatype :n-valid :n-missing :min :mean :mode :max :standard-deviation :skew :first :last
:mean-radius :float64 569 0 6.9810000 14.12729174 28.11000 3.52404883 0.94237957 17.990000 7.760000
:mean-texture :float64 569 0 9.7100000 19.28964851 39.28000 4.30103577 0.65044954 10.380000 24.540000
:mean-perimeter :float64 569 0 43.7900000 91.96903339 188.50000 24.29898104 0.99065043 122.800000 47.920000
:mean-area :float64 569 0 143.5000000 654.88910369 2501.00000 351.91412918 1.64573218 1001.000000 181.000000
:mean-smoothness :float64 569 0 0.0526300 0.09636028 0.16340 0.01406413 0.45632376 0.118400 0.052630
:mean-compactness :float64 569 0 0.0193800 0.10434098 0.34540 0.05281276 1.19012303 0.277600 0.043620
:mean-concavity :float64 569 0 0.0000000 0.08879932 0.42680 0.07971981 1.40117974 0.300100 0.000000
:mean-concave-points :float64 569 0 0.0000000 0.04891915 0.20120 0.03880284 1.17118008 0.147100 0.000000
:mean-symmetry :float64 569 0 0.1060000 0.18116186 0.30400 0.02741428 0.72560897 0.241900 0.158700
:mean-fractal-dimension :float64 569 0 0.0499600 0.06279761 0.09744 0.00706036 1.30448881 0.078710 0.058840
:radius-error :float64 569 0 0.1115000 0.40517206 2.87300 0.27731273 3.08861217 1.095000 0.385700
:texture-error :float64 569 0 0.3602000 1.21685343 4.88500 0.55164839 1.64644381 0.905300 1.428000
:perimeter-error :float64 569 0 0.7570000 2.86605923 21.98000 2.02185455 3.44361520 8.589000 2.548000
:area-error :float64 569 0 6.8020000 40.33707909 542.20000 45.49100552 5.44718628 153.400000 19.150000
:smoothness-error :float64 569 0 0.0017130 0.00704098 0.03113 0.00300252 2.31445006 0.006399 0.007189
:compactness-error :float64 569 0 0.0022520 0.02547814 0.13540 0.01790818 1.90222071 0.049040 0.004660
:concavity-error :float64 569 0 0.0000000 0.03189372 0.39600 0.03018606 5.11046305 0.053730 0.000000
:concave-points-error :float64 569 0 0.0000000 0.01179614 0.05279 0.00617029 1.44467814 0.015870 0.000000
:symmetry-error :float64 569 0 0.0078820 0.02054230 0.07895 0.00826637 2.19513290 0.030030 0.026760
:fractal-dimension-error :float64 569 0 0.0008948 0.00379490 0.02984 0.00264607 3.92396862 0.006193 0.002783
:worst-radius :float64 569 0 7.9300000 16.26918981 36.04000 4.83324158 1.10311521 25.380000 9.456000
:worst-texture :float64 569 0 12.0200000 25.67722320 49.54000 6.14625762 0.49832131 17.330000 30.370000
:worst-perimeter :float64 569 0 50.4100000 107.26121265 251.20000 33.60254227 1.12816387 184.600000 59.160000
:worst-area :float64 569 0 185.2000000 880.58312830 4254.00000 569.35699267 1.85937327 2019.000000 268.600000
:worst-smoothness :float64 569 0 0.0711700 0.13236859 0.22260 0.02283243 0.41542600 0.162200 0.089960
:worst-compactness :float64 569 0 0.0272900 0.25426504 1.05800 0.15733649 1.47355490 0.665600 0.064440
:worst-concavity :float64 569 0 0.0000000 0.27218848 1.25200 0.20862428 1.15023682 0.711900 0.000000
:worst-concave-points :float64 569 0 0.0000000 0.11460622 0.29100 0.06573234 0.49261553 0.265400 0.000000
:worst-symmetry :float64 569 0 0.1565000 0.29007557 0.66380 0.06186747 1.43392777 0.460100 0.287100
:worst-fractal-dimension :float64 569 0 0.0550400 0.08394582 0.20750 0.01806127 1.66257927 0.118900 0.070390
:class :int16 569 0 0 1.000000 0.000000

Then we create a metamorph pipeline with the ada boost model:

(def ada-pipe-fn
 (mm/pipeline
   (ds-mm/set-inference-target :class)
   (ds-mm/categorical->number [:class])
   (ml/model {:model-type :smile.classification/ada-boost})))

We run the pipeline in :fit. As we just explore the data,not train.test split is needed.

(def trained-ctx (mm/fit-pipe df ada-pipe-fn))

Next we take the model out of the pipeline:

(def model (-> trained-ctx vals (nth 2) ml/thaw-model))

The variable importance can be obtained from the trained model,

(def var-importances
 (mapv
   #(hash-map :variable %1 :importance %2)
   (map #(first (.variables %)) (.. model formula predictors))
   (.importance model)))
var-importances
[{:variable "mean-radius", :importance 27.21071195125037}
 {:variable "mean-texture", :importance 36.85190362720516}
 {:variable "mean-perimeter", :importance 4.0493008550371705}
 {:variable "mean-area", :importance 3.4765857390070463}
 {:variable "mean-smoothness", :importance 21.66390589589813}
 {:variable "mean-compactness", :importance 15.912486432250832}
 {:variable "mean-concavity", :importance 12.34363977341074}
 {:variable "mean-concave-points", :importance 22.70359436651821}
 {:variable "mean-symmetry", :importance 10.048959432953504}
 {:variable "mean-fractal-dimension", :importance 6.924343262361257}
 {:variable "radius-error", :importance 8.971221228662214}
 {:variable "texture-error", :importance 7.9896123740813945}
 {:variable "perimeter-error", :importance 10.790149398506824}
 {:variable "area-error", :importance 15.09584591367921}
 {:variable "smoothness-error", :importance 13.99070642969226}
 {:variable "compactness-error", :importance 12.399212661680444}
 {:variable "concavity-error", :importance 3.270004790745155}
 {:variable "concave-points-error", :importance 11.344151515529605}
 {:variable "symmetry-error", :importance 9.820853228342536}
 {:variable "fractal-dimension-error", :importance 14.749874549557097}
 {:variable "worst-radius", :importance 8.139150986088634}
 {:variable "worst-texture", :importance 26.12818019288407}
 {:variable "worst-perimeter", :importance 9.897002892242261}
 {:variable "worst-area", :importance 15.010564335320119}
 {:variable "worst-smoothness", :importance 18.550024822772443}
 {:variable "worst-compactness", :importance 8.87201713129558}
 {:variable "worst-concavity", :importance 13.521732540972554}
 {:variable "worst-concave-points", :importance 19.603206499908776}
 {:variable "worst-symmetry", :importance 9.504501280412123}
 {:variable "worst-fractal-dimension", :importance 9.406581636351218}]

and we plot the variables:

(kind/vega-lite
  {:data {:values var-importances},
   :width 800,
   :height 500,
   :mark {:type "bar"},
   :encoding
   {:x {:field :variable, :type "nominal", :sort "-y"},
    :y {:field :importance, :type "quantitative"}}})

21.2 :smile.classification/decision-tree

javadoc
user guide
name type default description lookup-table
max-nodes int32 100 maximum number of leaf nodes in the tree
node-size int32 1 minimum size of leaf nodes
max-depth int32 20 maximum depth of the tree
split-rule keyword gini the splitting rule
{:gini "GINI",
 :entropy "ENTROPY",
 :classification-error "CLASSIFICATION_ERROR"}


A decision tree learns a set of rules from the data in the form of a tree, which we will plot in this example. We use the iris dataset:

(def iris (datasets/iris-ds))
iris

_unnamed [150 5]:

:sepal_length :sepal_width :petal_length :petal_width :species
5.1 3.5 1.4 0.2 0
4.9 3.0 1.4 0.2 0
4.7 3.2 1.3 0.2 0
4.6 3.1 1.5 0.2 0
5.0 3.6 1.4 0.2 0
5.4 3.9 1.7 0.4 0
4.6 3.4 1.4 0.3 0
5.0 3.4 1.5 0.2 0
4.4 2.9 1.4 0.2 0
4.9 3.1 1.5 0.1 0
6.9 3.1 5.4 2.1 1
6.7 3.1 5.6 2.4 1
6.9 3.1 5.1 2.3 1
5.8 2.7 5.1 1.9 1
6.8 3.2 5.9 2.3 1
6.7 3.3 5.7 2.5 1
6.7 3.0 5.2 2.3 1
6.3 2.5 5.0 1.9 1
6.5 3.0 5.2 2.0 1
6.2 3.4 5.4 2.3 1
5.9 3.0 5.1 1.8 1

We make a pipe only containing the model, as the dataset is ready to be used by scicloj.ml

(def trained-pipe-tree
 (mm/fit-pipe
   iris
   (mm/pipeline
     #:metamorph{:id :model}
     (ml/model {:model-type :smile.classification/decision-tree}))))

We extract the Java object of the trained model.

(def tree-model (-> trained-pipe-tree :model ml/thaw-model))
tree-model
#object[smile.classification.DecisionTree 0xaf48683 "n=150\nnode), split, n, loss, yval, (yprob)\n* denotes terminal node\n1) root 150 329.58 0 (0.33333 0.33333 0.33333)\n 2) petal_length<=2.45000 50 3.8466 0 (0.96226 0.018868 0.018868) *\n 3) petal_length>2.45000 100 140.58 1 (0.0097087 0.49515 0.49515)\n  6) petal_width<=1.75000 54 35.354 2 (0.017544 0.10526 0.87719)\n   12) sepal_length<=7.10000 53 30.434 2 (0.017857 0.089286 0.89286)\n    24) petal_width<=1.65000 51 24.944 2 (0.018519 0.074074 0.90741) *\n    25) petal_width>1.65000 2 3.6652 1 (0.20000 0.40000 0.40000)\n     50) sepal_width<=2.75000 1 1.3863 1 (0.25000 0.50000 0.25000) *\n     51) sepal_width>2.75000 1 1.3863 2 (0.25000 0.25000 0.50000) *\n   13) sepal_length>7.10000 1 1.3863 1 (0.25000 0.50000 0.25000) *\n  7) petal_width>1.75000 46 12.083 1 (0.020408 0.93878 0.040816) *"]

The model has a .dot function, which returns a GraphViz textual representation of the decision tree, which we render to svg using the kroki service.

(kind/html
  (String. (:body (kroki (.dot tree-model) :graphviz :svg)) "UTF-8"))
CART 1 petal_length ≤ 2.45 size = 150 impurity reduction = 0.3333 2 species = 0 size = 50 deviance = 3.8466 1->2 True 3 petal_width ≤ 1.75 size = 100 impurity reduction = 0.3897 1->3 False 6 sepal_length ≤ 7.1 size = 54 impurity reduction = 0.0311 3->6 7 species = 1 size = 46 deviance = 12.0834 3->7 12 petal_width ≤ 1.65 size = 53 impurity reduction = 0.0141 6->12 13 species = 1 size = 1 deviance = 1.3863 6->13 24 species = 2 size = 51 deviance = 24.9439 12->24 25 sepal_width ≤ 2.75 size = 2 impurity reduction = 0.5000 12->25 50 species = 1 size = 1 deviance = 1.3863 25->50 51 species = 2 size = 1 deviance = 1.3863 25->51

21.3 :smile.classification/discrete-naive-bayes

javadoc
user guide
name type default lookup-table
p int32
k int32
discrete-naive-bayes-model keyword
{:polyaurn "POLYAURN",
 :wcnb "WCNB",
 :cnb "CNB",
 :twcnb "TWCNB",
 :bernoulli "BERNOULLI",
 :multinomial "MULTINOMIAL"}


21.4 :smile.classification/fld

javadoc
user guide
name type default description
dimension int32 -1.0 The dimensionality of mapped space.
tolerance float64 1.0E-4 A tolerance to decide if a covariance matrix is singular; it will reject variables whose variance is less than tol


21.5 :smile.classification/gradient-tree-boost

javadoc
user guide
name type default description
ntrees int32 500.0 number of iterations (trees)
max-depth int32 20.0 maximum depth of the tree
max-nodes int32 6.0 maximum number of leaf nodes in the tree
node-size int32 5.0 number of instances in a node below which the tree will not split, setting nodeSize = 5 generally gives good results
shrinkage float64 0.05 the shrinkage parameter in (0, 1] controls the learning rate of procedure
sampling-rate float64 0.7 the sampling fraction for stochastic tree boosting


21.6 :smile.classification/knn

javadoc
user guide
name type default description
k int32 3 number of neighbors for decision


In this example we use a knn model to classify some dummy data. The training data is this:

(def df-knn
 (tc/dataset
   {:x1 [7 7 3 1], :x2 [7 4 4 4], :y [:bad :bad :good :good]}))
df-knn

_unnamed [4 3]:

:x1 :x2 :y
7 7 :bad
7 4 :bad
3 4 :good
1 4 :good

Then we construct a pipeline with the knn model, using 3 neighbors for decision.

(def knn-pipe-fn
 (mm/pipeline
   (ds-mm/set-inference-target :y)
   (ds-mm/categorical->number [:y])
   (ml/model {:model-type :smile.classification/knn, :k 3})))

We run the pipeline in mode fit:

(def trained-ctx-knn
 (knn-pipe-fn #:metamorph{:data df-knn, :mode :fit}))

Then we run the pipeline in mode :transform with some test data and take the prediction and convert it from numeric into categorical:

(-> trained-ctx-knn
 (merge
   #:metamorph{:data (tc/dataset {:x1 [3 5], :x2 [7 5], :y [nil nil]}),
               :mode :transform})
 knn-pipe-fn
 :metamorph/data
 (ds-mod/column-values->categorical :y)
 seq)
(:good :bad)

21.7 :smile.classification/linear-discriminant-analysis

javadoc
user guide
name type default description
prioiri float64-array The priori probability of each class. If null, it will be estimated from the training data.
tolerance float64 1.0E-4 A tolerance to decide if a covariance matrix is singular; it will reject variables whose variance is less than tol


21.8 :smile.classification/logistic-regression

javadoc
user guide
name type default description
lambda float64 0.1 lambda > 0 gives a regularized estimate of linear weights which often has superior generalization performance, especially when the dimensionality is high
tolerance float64 1.0E-5 tolerance for stopping iterations
max-iterations int32 500.0 maximum number of iterations


21.9 :smile.classification/maxent-binomial



21.10 :smile.classification/maxent-multinomial



21.11 :smile.classification/mlp

javadoc
user guide
name type default description
layer-builders seq
[]
Sequence of type smile.base.mlp.LayerBuilder describing the layers of the neural network


21.12 :smile.classification/quadratic-discriminant-analysis

javadoc
user guide
name type default description
prioiri float64-array The priori probability of each class. If null, it will be estimated from the training data.
tolerance float64 1.0E-4 A tolerance to decide if a covariance matrix is singular; it will reject variables whose variance is less than tol


21.13 :smile.classification/random-forest

javadoc
user guide
name type default description lookup-table
trees int32 500 Number of trees
mtry int32 0 number of input variables to be used to determine the decision at a node of the tree. floor(sqrt(p)) generally gives good performance, where p is the number of variables
split-rule keyword gini Decision tree split rule
{:gini "GINI",
 :entropy "ENTROPY",
 :classification-error "CLASSIFICATION_ERROR"}
max-depth int32 20 Maximum depth of tree
max-nodes int32 scicloj.ml.smile.classification$fn__86850@59ee0238 Maximum number of leaf nodes in the tree
node-size int32 5 number of instances in a node below which the tree will not split, nodeSize = 5 generally gives good results
sample-rate float32 1.0 the sampling rate for training tree. 1.0 means sampling with replacement. < 1.0 means sampling without replacement.
class-weight string Priors of the classes. The weight of each class is roughly the ratio of samples in each class. For example, if there are 400 positive samples and 100 negative samples, the classWeight should be [1, 4] (assuming label 0 is of negative, label 1 is of positive)


The following code plots the decision surfaces of the random forest model on pairs of features.

We use the Iris dataset for this.

iris-std

https://raw.githubusercontent.com/scicloj/metamorph.ml/main/test/data/iris.csv [150 5]:

:sepal_length :sepal_width :petal_length :petal_width :species
-0.89767388 1.02861128 -1.33679402 -1.30859282 setosa
-1.13920048 -0.12454038 -1.33679402 -1.30859282 setosa
-1.38072709 0.33672028 -1.39346985 -1.30859282 setosa
-1.50149039 0.10608995 -1.28011819 -1.30859282 setosa
-1.01843718 1.25924161 -1.33679402 -1.30859282 setosa
-0.53538397 1.95113261 -1.16676652 -1.04652483 setosa
-1.50149039 0.79798095 -1.33679402 -1.17755883 setosa
-1.01843718 0.79798095 -1.28011819 -1.30859282 setosa
-1.74301699 -0.35517071 -1.33679402 -1.30859282 setosa
-1.13920048 0.10608995 -1.28011819 -1.43962681 setosa
1.27606556 0.10608995 0.93023937 1.18105307 virginica
1.03453895 0.10608995 1.04359104 1.57415505 virginica
1.27606556 0.10608995 0.76021186 1.44312105 virginica
-0.05233076 -0.81643138 0.76021186 0.91898508 virginica
1.15530226 0.33672028 1.21361854 1.44312105 virginica
1.03453895 0.56735062 1.10026687 1.70518904 virginica
1.03453895 -0.12454038 0.81688770 1.44312105 virginica
0.55148575 -1.27769204 0.70353603 0.91898508 virginica
0.79301235 -0.12454038 0.81688770 1.05001907 virginica
0.43072244 0.79798095 0.93023937 1.44312105 virginica
0.06843254 -0.12454038 0.76021186 0.78795108 virginica

The next function creates a vega specification for the random forest decision surface for a given pair of column names.

#'noj-book.utils.example-code/make-iris-pipeline
(def rf-pipe
 (make-iris-pipeline {:model-type :smile.classification/random-forest}))
#'noj-book.utils.example-code/iris
(kind/vega-lite
  (surface-plot
    iris
    [:sepal_length :sepal_width]
    rf-pipe
    :smile.classification/random-forest))
(kind/vega-lite
  (surface-plot
    iris-std
    [:sepal_length :petal_length]
    rf-pipe
    :smile.classification/random-forest))
(kind/vega-lite
  (surface-plot
    iris-std
    [:sepal_length :petal_width]
    rf-pipe
    :smile.classification/random-forest))
(kind/vega-lite
  (surface-plot
    iris-std
    [:sepal_width :petal_length]
    rf-pipe
    :smile.classification/random-forest))
(kind/vega-lite
  (surface-plot
    iris-std
    [:sepal_width :petal_width]
    rf-pipe
    :smile.classification/random-forest))
(kind/vega-lite
  (surface-plot
    iris-std
    [:petal_length :petal_width]
    rf-pipe
    :smile.classification/random-forest))

21.14 :smile.classification/regularized-discriminant-analysis

javadoc
user guide
name type default description
prioiri float64-array The priori probability of each class. If null, it will be estimated from the training data.
alpha float64 0.9 Regularization factor in [0, 1] allows a continuum of models between LDA and QDA.
tolerance float64 1.0E-4 A tolerance to decide if a covariance matrix is singular; it will reject variables whose variance is less than tol


21.15 :smile.classification/sparse-logistic-regression

name type default
lambda float32 0.1
tolerance float32 1.0E-5
max-iterations int32 500.0


21.16 :smile.classification/sparse-svm

javadoc
user guide
name type default description
C float32 1.0 soft margin penalty parameter
tol float32 1.0E-4 tolerance of convergence test


21.17 :smile.classification/svm

javadoc
user guide
name type default description
C float32 1.0 soft margin penalty parameter
tol float32 1.0E-4 tolerance of convergence test


22 Compare decision surfaces of different classification models

In the following we see the decision surfaces of some models on the same data from the Iris dataset using 2 columns :sepal_width and sepal_length:

[

]

This shows nicely that different model types have different capabilities to seperate and therefore classify data.

source: notebooks/noj_book/smile_classification.clj