21 Smile classification models reference - DRAFT 🛠
ns noj-book.smile-classification
(:require
(:refer [iris-std]]
[noj-book.utils.example-code :refer [render-key-info]]
[noj-book.utils.render-tools :refer [surface-plot]]
[noj-book.utils.surface-plot :as kind]
[scicloj.kindly.v4.kind :as mm]
[scicloj.metamorph.core :as ml]
[scicloj.metamorph.ml
[scicloj.ml.smile.classification]
[scicloj.ml.xgboost]:as ds-mm])) [tech.v3.dataset.metamorph
:smile.classification) (render-key-info
21.1 :smile.classification/ada-boost
name | type | default | description |
---|---|---|---|
trees | int32 | 500 | Number of trees |
max-depth | int32 | 200 | Maximum depth of the tree |
max-nodes | int32 | 6 | Maximum number of leaf nodes in the tree |
node-size | int32 | 1 | Number of instances in a node below which the tree will not split, setting nodeSize = 5 generally gives good results |
In this example we will use the capability of the Ada boost classifier to give us the importance of variables.
As data we take here the Wiscon Breast Cancer dataset, which has 30 variables.
def df (-> (datasets/breast-cancer-ds))) (
(tc/column-names df)
:mean-radius
(:mean-texture
:mean-perimeter
:mean-area
:mean-smoothness
:mean-compactness
:mean-concavity
:mean-concave-points
:mean-symmetry
:mean-fractal-dimension
:radius-error
:texture-error
:perimeter-error
:area-error
:smoothness-error
:compactness-error
:concavity-error
:concave-points-error
:symmetry-error
:fractal-dimension-error
:worst-radius
:worst-texture
:worst-perimeter
:worst-area
:worst-smoothness
:worst-compactness
:worst-concavity
:worst-concave-points
:worst-symmetry
:worst-fractal-dimension
:class)
To get an overview of the dataset, we print its summary:
-> df tc/info) (
_unnamed: descriptive-stats [31 12]:
:col-name | :datatype | :n-valid | :n-missing | :min | :mean | :mode | :max | :standard-deviation | :skew | :first | :last |
---|---|---|---|---|---|---|---|---|---|---|---|
:mean-radius | :float64 | 569 | 0 | 6.9810000 | 14.12729174 | 28.11000 | 3.52404883 | 0.94237957 | 17.990000 | 7.760000 | |
:mean-texture | :float64 | 569 | 0 | 9.7100000 | 19.28964851 | 39.28000 | 4.30103577 | 0.65044954 | 10.380000 | 24.540000 | |
:mean-perimeter | :float64 | 569 | 0 | 43.7900000 | 91.96903339 | 188.50000 | 24.29898104 | 0.99065043 | 122.800000 | 47.920000 | |
:mean-area | :float64 | 569 | 0 | 143.5000000 | 654.88910369 | 2501.00000 | 351.91412918 | 1.64573218 | 1001.000000 | 181.000000 | |
:mean-smoothness | :float64 | 569 | 0 | 0.0526300 | 0.09636028 | 0.16340 | 0.01406413 | 0.45632376 | 0.118400 | 0.052630 | |
:mean-compactness | :float64 | 569 | 0 | 0.0193800 | 0.10434098 | 0.34540 | 0.05281276 | 1.19012303 | 0.277600 | 0.043620 | |
:mean-concavity | :float64 | 569 | 0 | 0.0000000 | 0.08879932 | 0.42680 | 0.07971981 | 1.40117974 | 0.300100 | 0.000000 | |
:mean-concave-points | :float64 | 569 | 0 | 0.0000000 | 0.04891915 | 0.20120 | 0.03880284 | 1.17118008 | 0.147100 | 0.000000 | |
:mean-symmetry | :float64 | 569 | 0 | 0.1060000 | 0.18116186 | 0.30400 | 0.02741428 | 0.72560897 | 0.241900 | 0.158700 | |
:mean-fractal-dimension | :float64 | 569 | 0 | 0.0499600 | 0.06279761 | 0.09744 | 0.00706036 | 1.30448881 | 0.078710 | 0.058840 | |
:radius-error | :float64 | 569 | 0 | 0.1115000 | 0.40517206 | 2.87300 | 0.27731273 | 3.08861217 | 1.095000 | 0.385700 | |
:texture-error | :float64 | 569 | 0 | 0.3602000 | 1.21685343 | 4.88500 | 0.55164839 | 1.64644381 | 0.905300 | 1.428000 | |
:perimeter-error | :float64 | 569 | 0 | 0.7570000 | 2.86605923 | 21.98000 | 2.02185455 | 3.44361520 | 8.589000 | 2.548000 | |
:area-error | :float64 | 569 | 0 | 6.8020000 | 40.33707909 | 542.20000 | 45.49100552 | 5.44718628 | 153.400000 | 19.150000 | |
:smoothness-error | :float64 | 569 | 0 | 0.0017130 | 0.00704098 | 0.03113 | 0.00300252 | 2.31445006 | 0.006399 | 0.007189 | |
:compactness-error | :float64 | 569 | 0 | 0.0022520 | 0.02547814 | 0.13540 | 0.01790818 | 1.90222071 | 0.049040 | 0.004660 | |
:concavity-error | :float64 | 569 | 0 | 0.0000000 | 0.03189372 | 0.39600 | 0.03018606 | 5.11046305 | 0.053730 | 0.000000 | |
:concave-points-error | :float64 | 569 | 0 | 0.0000000 | 0.01179614 | 0.05279 | 0.00617029 | 1.44467814 | 0.015870 | 0.000000 | |
:symmetry-error | :float64 | 569 | 0 | 0.0078820 | 0.02054230 | 0.07895 | 0.00826637 | 2.19513290 | 0.030030 | 0.026760 | |
:fractal-dimension-error | :float64 | 569 | 0 | 0.0008948 | 0.00379490 | 0.02984 | 0.00264607 | 3.92396862 | 0.006193 | 0.002783 | |
:worst-radius | :float64 | 569 | 0 | 7.9300000 | 16.26918981 | 36.04000 | 4.83324158 | 1.10311521 | 25.380000 | 9.456000 | |
:worst-texture | :float64 | 569 | 0 | 12.0200000 | 25.67722320 | 49.54000 | 6.14625762 | 0.49832131 | 17.330000 | 30.370000 | |
:worst-perimeter | :float64 | 569 | 0 | 50.4100000 | 107.26121265 | 251.20000 | 33.60254227 | 1.12816387 | 184.600000 | 59.160000 | |
:worst-area | :float64 | 569 | 0 | 185.2000000 | 880.58312830 | 4254.00000 | 569.35699267 | 1.85937327 | 2019.000000 | 268.600000 | |
:worst-smoothness | :float64 | 569 | 0 | 0.0711700 | 0.13236859 | 0.22260 | 0.02283243 | 0.41542600 | 0.162200 | 0.089960 | |
:worst-compactness | :float64 | 569 | 0 | 0.0272900 | 0.25426504 | 1.05800 | 0.15733649 | 1.47355490 | 0.665600 | 0.064440 | |
:worst-concavity | :float64 | 569 | 0 | 0.0000000 | 0.27218848 | 1.25200 | 0.20862428 | 1.15023682 | 0.711900 | 0.000000 | |
:worst-concave-points | :float64 | 569 | 0 | 0.0000000 | 0.11460622 | 0.29100 | 0.06573234 | 0.49261553 | 0.265400 | 0.000000 | |
:worst-symmetry | :float64 | 569 | 0 | 0.1565000 | 0.29007557 | 0.66380 | 0.06186747 | 1.43392777 | 0.460100 | 0.287100 | |
:worst-fractal-dimension | :float64 | 569 | 0 | 0.0550400 | 0.08394582 | 0.20750 | 0.01806127 | 1.66257927 | 0.118900 | 0.070390 | |
:class | :int16 | 569 | 0 | 0 | 1.000000 | 0.000000 |
Then we create a metamorph pipeline with the ada boost model:
def ada-pipe-fn
(
(mm/pipeline:class)
(ds-mm/set-inference-target :class])
(ds-mm/categorical->number [:model-type :smile.classification/ada-boost}))) (ml/model {
We run the pipeline in :fit. As we just explore the data,not train.test split is needed.
def trained-ctx (mm/fit-pipe df ada-pipe-fn)) (
Next we take the model out of the pipeline:
def model (-> trained-ctx vals (nth 2) ml/thaw-model)) (
The variable importance can be obtained from the trained model,
def var-importances
(mapv
(hash-map :variable %1 :importance %2)
#(map #(first (.variables %)) (.. model formula predictors))
( (.importance model)))
var-importances
:variable "mean-radius", :importance 27.21071195125037}
[{:variable "mean-texture", :importance 36.85190362720516}
{:variable "mean-perimeter", :importance 4.0493008550371705}
{:variable "mean-area", :importance 3.4765857390070463}
{:variable "mean-smoothness", :importance 21.66390589589813}
{:variable "mean-compactness", :importance 15.912486432250832}
{:variable "mean-concavity", :importance 12.34363977341074}
{:variable "mean-concave-points", :importance 22.70359436651821}
{:variable "mean-symmetry", :importance 10.048959432953504}
{:variable "mean-fractal-dimension", :importance 6.924343262361257}
{:variable "radius-error", :importance 8.971221228662214}
{:variable "texture-error", :importance 7.9896123740813945}
{:variable "perimeter-error", :importance 10.790149398506824}
{:variable "area-error", :importance 15.09584591367921}
{:variable "smoothness-error", :importance 13.99070642969226}
{:variable "compactness-error", :importance 12.399212661680444}
{:variable "concavity-error", :importance 3.270004790745155}
{:variable "concave-points-error", :importance 11.344151515529605}
{:variable "symmetry-error", :importance 9.820853228342536}
{:variable "fractal-dimension-error", :importance 14.749874549557097}
{:variable "worst-radius", :importance 8.139150986088634}
{:variable "worst-texture", :importance 26.12818019288407}
{:variable "worst-perimeter", :importance 9.897002892242261}
{:variable "worst-area", :importance 15.010564335320119}
{:variable "worst-smoothness", :importance 18.550024822772443}
{:variable "worst-compactness", :importance 8.87201713129558}
{:variable "worst-concavity", :importance 13.521732540972554}
{:variable "worst-concave-points", :importance 19.603206499908776}
{:variable "worst-symmetry", :importance 9.504501280412123}
{:variable "worst-fractal-dimension", :importance 9.406581636351218}] {
and we plot the variables:
(kind/vega-lite:data {:values var-importances},
{:width 800,
:height 500,
:mark {:type "bar"},
:encoding
:x {:field :variable, :type "nominal", :sort "-y"},
{:y {:field :importance, :type "quantitative"}}})
21.2 :smile.classification/decision-tree
name | type | default | description | lookup-table |
---|---|---|---|---|
max-nodes | int32 | 100 | maximum number of leaf nodes in the tree | |
node-size | int32 | 1 | minimum size of leaf nodes | |
max-depth | int32 | 20 | maximum depth of the tree | |
split-rule | keyword | gini | the splitting rule |
|
A decision tree learns a set of rules from the data in the form of a tree, which we will plot in this example. We use the iris dataset:
def iris (datasets/iris-ds)) (
iris
_unnamed [150 5]:
:sepal_length | :sepal_width | :petal_length | :petal_width | :species |
---|---|---|---|---|
5.1 | 3.5 | 1.4 | 0.2 | 0 |
4.9 | 3.0 | 1.4 | 0.2 | 0 |
4.7 | 3.2 | 1.3 | 0.2 | 0 |
4.6 | 3.1 | 1.5 | 0.2 | 0 |
5.0 | 3.6 | 1.4 | 0.2 | 0 |
5.4 | 3.9 | 1.7 | 0.4 | 0 |
4.6 | 3.4 | 1.4 | 0.3 | 0 |
5.0 | 3.4 | 1.5 | 0.2 | 0 |
4.4 | 2.9 | 1.4 | 0.2 | 0 |
4.9 | 3.1 | 1.5 | 0.1 | 0 |
… | … | … | … | … |
6.9 | 3.1 | 5.4 | 2.1 | 1 |
6.7 | 3.1 | 5.6 | 2.4 | 1 |
6.9 | 3.1 | 5.1 | 2.3 | 1 |
5.8 | 2.7 | 5.1 | 1.9 | 1 |
6.8 | 3.2 | 5.9 | 2.3 | 1 |
6.7 | 3.3 | 5.7 | 2.5 | 1 |
6.7 | 3.0 | 5.2 | 2.3 | 1 |
6.3 | 2.5 | 5.0 | 1.9 | 1 |
6.5 | 3.0 | 5.2 | 2.0 | 1 |
6.2 | 3.4 | 5.4 | 2.3 | 1 |
5.9 | 3.0 | 5.1 | 1.8 | 1 |
We make a pipe only containing the model, as the dataset is ready to be used by scicloj.ml
def trained-pipe-tree
(
(mm/fit-pipe
iris
(mm/pipeline:metamorph{:id :model}
#:model-type :smile.classification/decision-tree})))) (ml/model {
We extract the Java object of the trained model.
def tree-model (-> trained-pipe-tree :model ml/thaw-model)) (
tree-model
0xaf48683 "n=150\nnode), split, n, loss, yval, (yprob)\n* denotes terminal node\n1) root 150 329.58 0 (0.33333 0.33333 0.33333)\n 2) petal_length<=2.45000 50 3.8466 0 (0.96226 0.018868 0.018868) *\n 3) petal_length>2.45000 100 140.58 1 (0.0097087 0.49515 0.49515)\n 6) petal_width<=1.75000 54 35.354 2 (0.017544 0.10526 0.87719)\n 12) sepal_length<=7.10000 53 30.434 2 (0.017857 0.089286 0.89286)\n 24) petal_width<=1.65000 51 24.944 2 (0.018519 0.074074 0.90741) *\n 25) petal_width>1.65000 2 3.6652 1 (0.20000 0.40000 0.40000)\n 50) sepal_width<=2.75000 1 1.3863 1 (0.25000 0.50000 0.25000) *\n 51) sepal_width>2.75000 1 1.3863 2 (0.25000 0.25000 0.50000) *\n 13) sepal_length>7.10000 1 1.3863 1 (0.25000 0.50000 0.25000) *\n 7) petal_width>1.75000 46 12.083 1 (0.020408 0.93878 0.040816) *"] #object[smile.classification.DecisionTree
The model has a .dot function, which returns a GraphViz textual representation of the decision tree, which we render to svg using the kroki service.
(kind/html:body (kroki (.dot tree-model) :graphviz :svg)) "UTF-8")) (String. (
21.3 :smile.classification/discrete-naive-bayes
name | type | default | lookup-table |
---|---|---|---|
p | int32 | ||
k | int32 | ||
discrete-naive-bayes-model | keyword |
|
21.4 :smile.classification/fld
name | type | default | description |
---|---|---|---|
dimension | int32 | -1.0 | The dimensionality of mapped space. |
tolerance | float64 | 1.0E-4 | A tolerance to decide if a covariance matrix is singular; it will reject variables whose variance is less than tol |
21.5 :smile.classification/gradient-tree-boost
name | type | default | description |
---|---|---|---|
ntrees | int32 | 500.0 | number of iterations (trees) |
max-depth | int32 | 20.0 | maximum depth of the tree |
max-nodes | int32 | 6.0 | maximum number of leaf nodes in the tree |
node-size | int32 | 5.0 | number of instances in a node below which the tree will not split, setting nodeSize = 5 generally gives good results |
shrinkage | float64 | 0.05 | the shrinkage parameter in (0, 1] controls the learning rate of procedure |
sampling-rate | float64 | 0.7 | the sampling fraction for stochastic tree boosting |
21.6 :smile.classification/knn
name | type | default | description |
---|---|---|---|
k | int32 | 3 | number of neighbors for decision |
In this example we use a knn model to classify some dummy data. The training data is this:
def df-knn
(
(tc/dataset:x1 [7 7 3 1], :x2 [7 4 4 4], :y [:bad :bad :good :good]})) {
df-knn
_unnamed [4 3]:
:x1 | :x2 | :y |
---|---|---|
7 | 7 | :bad |
7 | 4 | :bad |
3 | 4 | :good |
1 | 4 | :good |
Then we construct a pipeline with the knn model, using 3 neighbors for decision.
def knn-pipe-fn
(
(mm/pipeline:y)
(ds-mm/set-inference-target :y])
(ds-mm/categorical->number [:model-type :smile.classification/knn, :k 3}))) (ml/model {
We run the pipeline in mode fit:
def trained-ctx-knn
(:metamorph{:data df-knn, :mode :fit})) (knn-pipe-fn #
Then we run the pipeline in mode :transform with some test data and take the prediction and convert it from numeric into categorical:
-> trained-ctx-knn
(merge
(:metamorph{:data (tc/dataset {:x1 [3 5], :x2 [7 5], :y [nil nil]}),
#:mode :transform})
knn-pipe-fn:metamorph/data
:y)
(ds-mod/column-values->categorical seq)
:good :bad) (
21.7 :smile.classification/linear-discriminant-analysis
name | type | default | description |
---|---|---|---|
prioiri | float64-array | The priori probability of each class. If null, it will be estimated from the training data. | |
tolerance | float64 | 1.0E-4 | A tolerance to decide if a covariance matrix is singular; it will reject variables whose variance is less than tol |
21.8 :smile.classification/logistic-regression
name | type | default | description |
---|---|---|---|
lambda | float64 | 0.1 | lambda > 0 gives a regularized estimate of linear weights which often has superior generalization performance, especially when the dimensionality is high |
tolerance | float64 | 1.0E-5 | tolerance for stopping iterations |
max-iterations | int32 | 500.0 | maximum number of iterations |
21.9 :smile.classification/maxent-binomial
21.10 :smile.classification/maxent-multinomial
21.11 :smile.classification/mlp
name | type | default | description |
---|---|---|---|
layer-builders | seq |
|
Sequence of type smile.base.mlp.LayerBuilder describing the layers of the neural network |
21.12 :smile.classification/quadratic-discriminant-analysis
name | type | default | description |
---|---|---|---|
prioiri | float64-array | The priori probability of each class. If null, it will be estimated from the training data. | |
tolerance | float64 | 1.0E-4 | A tolerance to decide if a covariance matrix is singular; it will reject variables whose variance is less than tol |
21.13 :smile.classification/random-forest
name | type | default | description | lookup-table |
---|---|---|---|---|
trees | int32 | 500 | Number of trees | |
mtry | int32 | 0 | number of input variables to be used to determine the decision at a node of the tree. floor(sqrt(p)) generally gives good performance, where p is the number of variables | |
split-rule | keyword | gini | Decision tree split rule |
|
max-depth | int32 | 20 | Maximum depth of tree | |
max-nodes | int32 | scicloj.ml.smile.classification$fn__86850@59ee0238 | Maximum number of leaf nodes in the tree | |
node-size | int32 | 5 | number of instances in a node below which the tree will not split, nodeSize = 5 generally gives good results | |
sample-rate | float32 | 1.0 | the sampling rate for training tree. 1.0 means sampling with replacement. < 1.0 means sampling without replacement. | |
class-weight | string | Priors of the classes. The weight of each class is roughly the ratio of samples in each class. For example, if there are 400 positive samples and 100 negative samples, the classWeight should be [1, 4] (assuming label 0 is of negative, label 1 is of positive) |
The following code plots the decision surfaces of the random forest model on pairs of features.
We use the Iris dataset for this.
iris-std
https://raw.githubusercontent.com/scicloj/metamorph.ml/main/test/data/iris.csv [150 5]:
:sepal_length | :sepal_width | :petal_length | :petal_width | :species |
---|---|---|---|---|
-0.89767388 | 1.02861128 | -1.33679402 | -1.30859282 | setosa |
-1.13920048 | -0.12454038 | -1.33679402 | -1.30859282 | setosa |
-1.38072709 | 0.33672028 | -1.39346985 | -1.30859282 | setosa |
-1.50149039 | 0.10608995 | -1.28011819 | -1.30859282 | setosa |
-1.01843718 | 1.25924161 | -1.33679402 | -1.30859282 | setosa |
-0.53538397 | 1.95113261 | -1.16676652 | -1.04652483 | setosa |
-1.50149039 | 0.79798095 | -1.33679402 | -1.17755883 | setosa |
-1.01843718 | 0.79798095 | -1.28011819 | -1.30859282 | setosa |
-1.74301699 | -0.35517071 | -1.33679402 | -1.30859282 | setosa |
-1.13920048 | 0.10608995 | -1.28011819 | -1.43962681 | setosa |
… | … | … | … | … |
1.27606556 | 0.10608995 | 0.93023937 | 1.18105307 | virginica |
1.03453895 | 0.10608995 | 1.04359104 | 1.57415505 | virginica |
1.27606556 | 0.10608995 | 0.76021186 | 1.44312105 | virginica |
-0.05233076 | -0.81643138 | 0.76021186 | 0.91898508 | virginica |
1.15530226 | 0.33672028 | 1.21361854 | 1.44312105 | virginica |
1.03453895 | 0.56735062 | 1.10026687 | 1.70518904 | virginica |
1.03453895 | -0.12454038 | 0.81688770 | 1.44312105 | virginica |
0.55148575 | -1.27769204 | 0.70353603 | 0.91898508 | virginica |
0.79301235 | -0.12454038 | 0.81688770 | 1.05001907 | virginica |
0.43072244 | 0.79798095 | 0.93023937 | 1.44312105 | virginica |
0.06843254 | -0.12454038 | 0.76021186 | 0.78795108 | virginica |
The next function creates a vega specification for the random forest decision surface for a given pair of column names.
#'noj-book.utils.example-code/make-iris-pipeline
def rf-pipe
(:model-type :smile.classification/random-forest})) (make-iris-pipeline {
#'noj-book.utils.example-code/iris
(kind/vega-lite
(surface-plot
iris:sepal_length :sepal_width]
[
rf-pipe:smile.classification/random-forest))
(kind/vega-lite
(surface-plot
iris-std:sepal_length :petal_length]
[
rf-pipe:smile.classification/random-forest))
(kind/vega-lite
(surface-plot
iris-std:sepal_length :petal_width]
[
rf-pipe:smile.classification/random-forest))
(kind/vega-lite
(surface-plot
iris-std:sepal_width :petal_length]
[
rf-pipe:smile.classification/random-forest))
(kind/vega-lite
(surface-plot
iris-std:sepal_width :petal_width]
[
rf-pipe:smile.classification/random-forest))
(kind/vega-lite
(surface-plot
iris-std:petal_length :petal_width]
[
rf-pipe:smile.classification/random-forest))
21.14 :smile.classification/regularized-discriminant-analysis
name | type | default | description |
---|---|---|---|
prioiri | float64-array | The priori probability of each class. If null, it will be estimated from the training data. | |
alpha | float64 | 0.9 | Regularization factor in [0, 1] allows a continuum of models between LDA and QDA. |
tolerance | float64 | 1.0E-4 | A tolerance to decide if a covariance matrix is singular; it will reject variables whose variance is less than tol |
21.15 :smile.classification/sparse-logistic-regression
name | type | default |
---|---|---|
lambda | float32 | 0.1 |
tolerance | float32 | 1.0E-5 |
max-iterations | int32 | 500.0 |
21.16 :smile.classification/sparse-svm
name | type | default | description |
---|---|---|---|
C | float32 | 1.0 | soft margin penalty parameter |
tol | float32 | 1.0E-4 | tolerance of convergence test |
21.17 :smile.classification/svm
name | type | default | description |
---|---|---|---|
C | float32 | 1.0 | soft margin penalty parameter |
tol | float32 | 1.0E-4 | tolerance of convergence test |
22 Compare decision surfaces of different classification models
In the following we see the decision surfaces of some models on the same data from the Iris dataset using 2 columns :sepal_width and sepal_length:
[
]
This shows nicely that different model types have different capabilities to seperate and therefore classify data.