10  Machine learning

In this tutorial we will train a simple machine learning model in order to predict the survival of titanic passengers given their data.

10.1 Preface: machine learning models in Noj

ML models in Noj are available as different plugins to the metamorph.ml library.

The metamorph.ml library itself has no models (except for a linear regression model), but it contains the various functions to “train” and “predict” based on data.

Models are available via Clojure wrappers of existing ML libraries. These are currently part of Noj:

Library Clojure Wrapper
Tribuo scicloj.ml.tribuo
Xgboost4J scicloj.ml.xgboost
scikit-learn sklearn-clj

These libraries do not have any functions for the models they contain. Instead of funtcions per model, metamorph.ml has the concept of each model having a unique key, the :model-type , which needs to be given when calling metamorph.ml/train.

The model libraries register their models under these keys, when their main ns is required (and the model keys get printed on screen when getting registered). So we cannot provide cljdoc for the models, as they do no have corresponding functions.

Instead, we provide in the the last chapters of the Noj book a complete list of all models (and their keys) incl. the parameters they take with a description. For some models this reference documentation contains as well code examples. This can be used to browse or search for models and their parameters.

The Tribuo plugins and their models are special in this aspect. The scicloj.ml.tribuo library only contains 2 model types as keys, namely :scicloj.ml.tribuo/classification and :scicloj.ml.tribuo/regression. The model as such is encoded in the same way as the Triuo Java libraries does this, namely as a map of all Tribuo components in place, of which one is the model, the so called “Trainer”, is always needed and has a certin :type, the model class.

The reference documentation therefore lists all “Trainer”s and their name incl. parameters. It lists as well all other “Configurable”s which could be refered to in a component map.

10.2 Setup

(ns noj-book.ml-basic
  (:require [tablecloth.api :as tc]
            [scicloj.metamorph.ml.toydata :as data]
            [tech.v3.dataset :as ds]
            [scicloj.kindly.v4.kind :as kind]
            [scicloj.kindly.v4.api :as kindly]
            [tech.v3.dataset.categorical :as ds-cat]))

10.3 Inspect data

The titanic data is part of metamorph.ml and in the form of a train, test split

We use the :train part only for this tutorial.

(->
 (data/titanic-ds-split)
 :train)

_unnamed [891 12]:

:passenger-id :survived :pclass :name :sex :age :sib-sp :parch :ticket :fare :cabin :embarked
1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 S
2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38.0 1 0 PC 17599 71.2833 C85 C
3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 S
4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 S
6 0 3 Moran, Mr. James male 0 0 330877 8.4583 Q
7 0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S
8 0 3 Palsson, Master. Gosta Leonard male 2.0 3 1 349909 21.0750 S
9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27.0 0 2 347742 11.1333 S
10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14.0 1 0 237736 30.0708 C
881 1 2 Shelley, Mrs. William (Imanita Parrish Hall) female 25.0 0 1 230433 26.0000 S
882 0 3 Markun, Mr. Johann male 33.0 0 0 349257 7.8958 S
883 0 3 Dahlberg, Miss. Gerda Ulrika female 22.0 0 0 7552 10.5167 S
884 0 2 Banfield, Mr. Frederick James male 28.0 0 0 C.A./SOTON 34068 10.5000 S
885 0 3 Sutehall, Mr. Henry Jr male 25.0 0 0 SOTON/OQ 392076 7.0500 S
886 0 3 Rice, Mrs. William (Margaret Norton) female 39.0 0 5 382652 29.1250 Q
887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 S
888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S
889 0 3 Johnston, Miss. Catherine Helen “Carrie” female 1 2 W./C. 6607 23.4500 S
890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C
891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 Q

We use defonce to avoid reading the files every time we evaluate the namespace.

(defonce titanic-split
  (data/titanic-ds-split))
(def titanic
  (-> titanic-split
      :train
      (tc/map-columns :survived
                      [:survived]
                      (fn [el] (case el
                                 0 "no"
                                 1 "yes")))))

It has various columns

(tc/column-names titanic)
(:passenger-id
 :survived
 :pclass
 :name
 :sex
 :age
 :sib-sp
 :parch
 :ticket
 :fare
 :cabin
 :embarked)

of which we can get some statistics

(ds/descriptive-stats titanic)

_unnamed: descriptive-stats [12 12]:

:col-name :datatype :n-valid :n-missing :min :mean :mode :max :standard-deviation :skew :first :last
:passenger-id :int16 891 0 1.00 446.00000000 891.0000 257.35384202 0.00000000 1 891
:survived :string 891 0 no no no
:pclass :int16 891 0 1.00 2.30864198 3.0000 0.83607124 -0.63054791 3 3
:name :string 891 0 Mallet, Mr. Albert Braund, Mr. Owen Harris Dooley, Mr. Patrick
:sex :string 891 0 male male male
:age :float64 714 177 0.42 29.69911765 80.0000 14.52649733 0.38910778 22.00 32.00
:sib-sp :int16 891 0 0.00 0.52300786 8.0000 1.10274343 3.69535173 1 0
:parch :int16 891 0 0.00 0.38159371 6.0000 0.80605722 2.74911705 0 0
:ticket :string 891 0 CA. 2343 A/5 21171 370376
:fare :float64 891 0 0.00 32.20420797 512.3292 49.69342860 4.78731652 7.250 7.750
:cabin :string 204 687
:embarked :string 889 2 S S Q

The data is more or less balanced across the 2 classes:

(-> titanic :survived frequencies)
{"no" 549, "yes" 342}

We will make a very simple model, which will predict the column :survived from columns :sex , :pclass and :embarked. These represent the “gender”, “passenger class” and “port of embarkment”.

(def categorical-feature-columns [:sex :pclass :embarked])
(def target-column :survived)

10.4 Convert categorical features to numeric

As we need to convert the non numerical feature columns to categorical, we will first look at their unique values:

(map
 #(hash-map
   :col-name %
   :values  (distinct (get titanic %)))
 categorical-feature-columns)
({:col-name :sex, :values ("male" "female")}
 {:col-name :pclass, :values (3 1 2)}
 {:col-name :embarked, :values ("S" "C" "Q" nil)})

This allows us now to set specifically the values in the conversion to numbers. This is a good practice, instead of the relying on the automatic selection of the categorical mapping:

(We discuss more about categorical mappings in another chapter.)

(require '[tech.v3.dataset.categorical :as ds-cat]
         '[tech.v3.dataset.modelling :as ds-mod]
         '[tech.v3.dataset.column-filters :as cf])

This gives then the selected and numeric columns like this:

(def relevant-titanic-data
  (-> titanic
      (tc/select-columns (conj categorical-feature-columns target-column))
      (tc/drop-missing)
      (ds/categorical->number [:survived] ["no" "yes"] :float64)
      (ds-mod/set-inference-target target-column)))

of which we can inspect the lookup-tables

(def cat-maps
  [(ds-cat/fit-categorical-map relevant-titanic-data :sex ["male" "female"] :float64)
   (ds-cat/fit-categorical-map relevant-titanic-data :pclass [0 1 2] :float64)
   (ds-cat/fit-categorical-map relevant-titanic-data :embarked ["S" "Q" "C"] :float64)])
cat-maps
[{:lookup-table {"male" 0, "female" 1},
  :src-column :sex,
  :result-datatype :float64}
 {:lookup-table {0 0, 1 1, 2 2, 3 3},
  :src-column :pclass,
  :result-datatype :float64}
 {:lookup-table {"S" 0, "Q" 1, "C" 2},
  :src-column :embarked,
  :result-datatype :float64}]

After the mappings are applied, we have a numeric dataset, as expected by most models.

(def numeric-titanic-data
  (reduce (fn [ds cat-map]
            (ds-cat/transform-categorical-map ds cat-map))
          relevant-titanic-data
          cat-maps))
(tc/head
 numeric-titanic-data)

_unnamed [5 4]:

:sex :pclass :embarked :survived
0.0 3.0 0.0 0.0
1.0 1.0 2.0 1.0
1.0 3.0 0.0 1.0
1.0 1.0 0.0 1.0
0.0 3.0 0.0 0.0
(ds/rowvecs
 (tc/head
  numeric-titanic-data))
[[0.0 3.0 0.0 0.0] [1.0 1.0 2.0 1.0] [1.0 3.0 0.0 1.0] [1.0 1.0 0.0 1.0] [0.0 3.0 0.0 0.0]]

Split data into train and test set

Now we split the data into train and test. We use a :holdout strategy, so will get a single split in training and test data.

(def split
  (first
   (tc/split->seq numeric-titanic-data :holdout {:seed 112723})))
split

{

:train

Group: 0 [592 4]:

:sex :pclass :embarked :survived
0.0 3.0 2.0 0.0
0.0 3.0 0.0 0.0
0.0 3.0 0.0 0.0
1.0 3.0 2.0 1.0
0.0 1.0 0.0 0.0
1.0 3.0 0.0 0.0
1.0 2.0 0.0 1.0
1.0 3.0 0.0 0.0
0.0 3.0 0.0 0.0
1.0 1.0 2.0 1.0
... ... ... ...
0.0 3.0 0.0 0.0
1.0 2.0 0.0 1.0
0.0 3.0 2.0 1.0
1.0 2.0 0.0 0.0
0.0 2.0 0.0 0.0
1.0 3.0 0.0 1.0
0.0 2.0 0.0 0.0
1.0 2.0 0.0 1.0
0.0 3.0 0.0 1.0
0.0 3.0 0.0 0.0
0.0 3.0 1.0 0.0
:test

Group: 0 [297 4]:

:sex :pclass :embarked :survived
0.0 1.0 0.0 0.0
1.0 3.0 0.0 0.0
0.0 2.0 0.0 0.0
0.0 3.0 0.0 0.0
0.0 3.0 0.0 1.0
0.0 3.0 0.0 0.0
0.0 3.0 0.0 0.0
0.0 3.0 0.0 0.0
1.0 3.0 2.0 1.0
0.0 1.0 2.0 0.0
... ... ... ...
0.0 3.0 0.0 0.0
1.0 3.0 0.0 0.0
0.0 3.0 0.0 0.0
1.0 3.0 1.0 1.0
0.0 3.0 0.0 0.0
1.0 2.0 2.0 1.0
0.0 3.0 0.0 0.0
0.0 3.0 0.0 0.0
0.0 3.0 0.0 1.0
0.0 1.0 0.0 0.0
1.0 1.0 2.0 1.0

}

10.5 Train a model

Now it’s time to train a model:

(require '[scicloj.metamorph.ml :as ml]
         '[scicloj.metamorph.ml.classification]
         '[scicloj.metamorph.ml.loss :as loss])

10.5.1 Dummy model

We start with a dummy model, which simply predicts the majority class.

(def dummy-model (ml/train (:train split)
                           {:model-type :metamorph.ml/dummy-classifier}))
(def dummy-prediction
  (ml/predict (:test split) dummy-model))

It always predicts a single class, as expected:

(-> dummy-prediction :survived frequencies)
{0.0 297}

We can calculate accuracy by using a metric after having converted the numerical data back to original (important!). We should never compare mapped columns directly.

(loss/classification-accuracy
 (:survived (ds-cat/reverse-map-categorical-xforms (:test split)))
 (:survived (ds-cat/reverse-map-categorical-xforms dummy-prediction)))
0.6026936026936027

10.6 Logistic regression

Next model to use is Logistic Regression:

(require '[scicloj.ml.tribuo])
(def lreg-model (ml/train (:train split)
                          {:model-type :scicloj.ml.tribuo/classification
                           :tribuo-components [{:name "logistic"
                                                :type "org.tribuo.classification.sgd.linear.LinearSGDTrainer"}]
                           :tribuo-trainer-name "logistic"}))
(def lreg-prediction
  (ml/predict (:test split) lreg-model))
(loss/classification-accuracy
 (:survived (ds-cat/reverse-map-categorical-xforms (:test split)))
 (:survived (ds-cat/reverse-map-categorical-xforms lreg-prediction)))
0.7373737373737373

Its performance is better, 73 %.

10.7 Random forest

Next is random forest:

(def rf-model (ml/train (:train split) {:model-type :scicloj.ml.tribuo/classification
                                        :tribuo-components [{:name "random-forest"
                                                             :type "org.tribuo.classification.dtree.CARTClassificationTrainer"
                                                             :properties {:maxDepth "8"
                                                                          :useRandomSplitPoints "false"
                                                                          :fractionFeaturesInSplit "0.5"}}]
                                        :tribuo-trainer-name "random-forest"}))
(def rf-prediction
  (ml/predict (:test split) rf-model))

Let us extract the first five prediction and the probabilities provided by the mode.

(-> rf-prediction
    (tc/head)
    (tc/rows))
[[0.0 0.6470588235294118 0.35294117647058826] [0.0 0.5714285714285714 0.42857142857142855] [0.0 0.8529411764705882 0.14705882352941177] [0.0 0.8879310344827587 0.11206896551724138] [0.0 0.8879310344827587 0.11206896551724138]]
(loss/classification-accuracy
 (:survived (ds-cat/reverse-map-categorical-xforms (:test split)))
 (:survived (ds-cat/reverse-map-categorical-xforms rf-prediction)))
0.7878787878787878

best so far, 78 %.

10.8 Next steps

We could now go further and trying to improve the features / the model type in order to find the best performing model for the data we have. All models types have a range of configurations, so-called hyper-parameters. They can have as well influence on the model accuracy.

So far we used a single split into ‘train’ and ‘test’ data, so we only get a point estimate of the accuracy. This should be made more robust via cross-validation and using different splits of the data.

source: notebooks/noj_book/ml_basic.clj