10  Machine learning

In this tutorial we will train a simple machine learning model in order to predict the survival of titanic passengers given their data.

(ns noj-book.ml-basic
  (:require [tablecloth.api :as tc]
            [scicloj.metamorph.ml.toydata :as data]
            [tech.v3.dataset :as ds]
            [scicloj.kindly.v4.kind :as kind]
            [scicloj.kindly.v4.api :as kindly]
            [tech.v3.dataset.categorical :as ds-cat]))

10.1 Inspect data

The titanic data is part of metamorph.ml and in the form of a train, test split

We use the :train part only for this tutorial.

(->
 (data/titanic-ds-split)
 :train)

_unnamed [891 12]:

:passenger-id :survived :pclass :name :sex :age :sib-sp :parch :ticket :fare :cabin :embarked
1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 S
2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38.0 1 0 PC 17599 71.2833 C85 C
3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 S
4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 S
6 0 3 Moran, Mr. James male 0 0 330877 8.4583 Q
7 0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S
8 0 3 Palsson, Master. Gosta Leonard male 2.0 3 1 349909 21.0750 S
9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27.0 0 2 347742 11.1333 S
10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14.0 1 0 237736 30.0708 C
881 1 2 Shelley, Mrs. William (Imanita Parrish Hall) female 25.0 0 1 230433 26.0000 S
882 0 3 Markun, Mr. Johann male 33.0 0 0 349257 7.8958 S
883 0 3 Dahlberg, Miss. Gerda Ulrika female 22.0 0 0 7552 10.5167 S
884 0 2 Banfield, Mr. Frederick James male 28.0 0 0 C.A./SOTON 34068 10.5000 S
885 0 3 Sutehall, Mr. Henry Jr male 25.0 0 0 SOTON/OQ 392076 7.0500 S
886 0 3 Rice, Mrs. William (Margaret Norton) female 39.0 0 5 382652 29.1250 Q
887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 S
888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S
889 0 3 Johnston, Miss. Catherine Helen “Carrie” female 1 2 W./C. 6607 23.4500 S
890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C
891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 Q

We use defonce to avoid reading the files every time we evaluate the namespace.

(defonce titanic-split
  (data/titanic-ds-split))
(def titanic
  (-> titanic-split
      :train
      (tc/map-columns :survived
                      [:survived]
                      (fn [el] (case el
                                 0 "no"
                                 1 "yes")))))

It has various columns

(tc/column-names titanic)
(:passenger-id
 :survived
 :pclass
 :name
 :sex
 :age
 :sib-sp
 :parch
 :ticket
 :fare
 :cabin
 :embarked)

of which we can get some statistics

(ds/descriptive-stats titanic)

_unnamed: descriptive-stats [12 12]:

:col-name :datatype :n-valid :n-missing :min :mean :mode :max :standard-deviation :skew :first :last
:passenger-id :int16 891 0 1.00 446.00000000 891.0000 257.35384202 0.00000000 1 891
:survived :string 891 0 no no no
:pclass :int16 891 0 1.00 2.30864198 3.0000 0.83607124 -0.63054791 3 3
:name :string 891 0 Mallet, Mr. Albert Braund, Mr. Owen Harris Dooley, Mr. Patrick
:sex :string 891 0 male male male
:age :float64 714 177 0.42 29.69911765 80.0000 14.52649733 0.38910778 22.00 32.00
:sib-sp :int16 891 0 0.00 0.52300786 8.0000 1.10274343 3.69535173 1 0
:parch :int16 891 0 0.00 0.38159371 6.0000 0.80605722 2.74911705 0 0
:ticket :string 891 0 CA. 2343 A/5 21171 370376
:fare :float64 891 0 0.00 32.20420797 512.3292 49.69342860 4.78731652 7.250 7.750
:cabin :string 204 687
:embarked :string 889 2 S S Q

The data is more or less balanced across the 2 classes:

(-> titanic :survived frequencies)
{"no" 549, "yes" 342}

We will make a very simple model, which will predict the column :survived from columns :sex , :pclass and :embarked. These represent the “gender”, “passenger class” and “port of embarkment”.

(def categorical-feature-columns [:sex :pclass :embarked])
(def target-column :survived)

10.2 Convert categorical features to numeric

As we need to convert the non numerical feature columns to categorical, we will first look at their unique values:

(map
 #(hash-map
   :col-name %
   :values  (distinct (get titanic %)))
 categorical-feature-columns)
({:col-name :sex, :values ("male" "female")}
 {:col-name :pclass, :values (3 1 2)}
 {:col-name :embarked, :values ("S" "C" "Q" nil)})

This allows us now to set specifically the values in the conversion to numbers. This is a good practice, instead of the relying on the automatic selection of the categorical mapping:

(We discuss more about categorical mappings in another chapter.)

(require '[tech.v3.dataset.categorical :as ds-cat]
         '[tech.v3.dataset.modelling :as ds-mod]
         '[tech.v3.dataset.column-filters :as cf])

This gives then the selected and numeric columns like this:

(def relevant-titanic-data
  (-> titanic
      (tc/select-columns (conj categorical-feature-columns target-column))
      (tc/drop-missing)
      (ds/categorical->number [:survived] ["no" "yes"] :float64)
      (ds-mod/set-inference-target target-column)))

of which we can inspect the lookup-tables

(def cat-maps
  [(ds-cat/fit-categorical-map relevant-titanic-data :sex ["male" "female"] :float64)
   (ds-cat/fit-categorical-map relevant-titanic-data :pclass [0 1 2] :float64)
   (ds-cat/fit-categorical-map relevant-titanic-data :embarked ["S" "Q" "C"] :float64)])
cat-maps
[{:lookup-table {"male" 0, "female" 1},
  :src-column :sex,
  :result-datatype :float64}
 {:lookup-table {0 0, 1 1, 2 2, 3 3},
  :src-column :pclass,
  :result-datatype :float64}
 {:lookup-table {"S" 0, "Q" 1, "C" 2},
  :src-column :embarked,
  :result-datatype :float64}]

After the mappings are applied, we have a numeric dataset, as expected by most models.

(def numeric-titanic-data
  (reduce (fn [ds cat-map]
            (ds-cat/transform-categorical-map ds cat-map))
          relevant-titanic-data
          cat-maps))
(tc/head
 numeric-titanic-data)

_unnamed [5 4]:

:sex :pclass :embarked :survived
0.0 3.0 0.0 0.0
1.0 1.0 2.0 1.0
1.0 3.0 0.0 1.0
1.0 1.0 0.0 1.0
0.0 3.0 0.0 0.0
(ds/rowvecs
 (tc/head
  numeric-titanic-data))
[[0.0 3.0 0.0 0.0] [1.0 1.0 2.0 1.0] [1.0 3.0 0.0 1.0] [1.0 1.0 0.0 1.0] [0.0 3.0 0.0 0.0]]

Split data into train and test set

Now we split the data into train and test. By we use a :holdout strategy, so will get a single split in training an test data.

(def split
  (first
   (tc/split->seq numeric-titanic-data :holdout {:seed 112723})))
split

{

:train

Group: 0 [592 4]:

:sex :pclass :embarked :survived
0.0 3.0 2.0 0.0
0.0 3.0 0.0 0.0
0.0 3.0 0.0 0.0
1.0 3.0 2.0 1.0
0.0 1.0 0.0 0.0
1.0 3.0 0.0 0.0
1.0 2.0 0.0 1.0
1.0 3.0 0.0 0.0
0.0 3.0 0.0 0.0
1.0 1.0 2.0 1.0
... ... ... ...
0.0 3.0 0.0 0.0
1.0 2.0 0.0 1.0
0.0 3.0 2.0 1.0
1.0 2.0 0.0 0.0
0.0 2.0 0.0 0.0
1.0 3.0 0.0 1.0
0.0 2.0 0.0 0.0
1.0 2.0 0.0 1.0
0.0 3.0 0.0 1.0
0.0 3.0 0.0 0.0
0.0 3.0 1.0 0.0
:test

Group: 0 [297 4]:

:sex :pclass :embarked :survived
0.0 1.0 0.0 0.0
1.0 3.0 0.0 0.0
0.0 2.0 0.0 0.0
0.0 3.0 0.0 0.0
0.0 3.0 0.0 1.0
0.0 3.0 0.0 0.0
0.0 3.0 0.0 0.0
0.0 3.0 0.0 0.0
1.0 3.0 2.0 1.0
0.0 1.0 2.0 0.0
... ... ... ...
0.0 3.0 0.0 0.0
1.0 3.0 0.0 0.0
0.0 3.0 0.0 0.0
1.0 3.0 1.0 1.0
0.0 3.0 0.0 0.0
1.0 2.0 2.0 1.0
0.0 3.0 0.0 0.0
0.0 3.0 0.0 0.0
0.0 3.0 0.0 1.0
0.0 1.0 0.0 0.0
1.0 1.0 2.0 1.0

}

10.3 Train a model

Now its time to train a model:

(require '[scicloj.metamorph.ml :as ml]
         '[scicloj.metamorph.ml.classification]
         '[scicloj.metamorph.ml.loss :as loss])

10.3.1 Dummy model

We start with a dummy model, which simply predicts the majority class

(def dummy-model (ml/train (:train split)
                           {:model-type :metamorph.ml/dummy-classifier}))

TODO: Is the dummy model wrong about the majority?

(def dummy-prediction
  (ml/predict (:test split) dummy-model))

It always predicts a single class, as expected:

(-> dummy-prediction :survived frequencies)
{1.0 297}

we can calculate accuracy by using a metric after having converted the numerical data back to original (important !) We should never compare mapped columns directly.

(loss/classification-accuracy
 (:survived (ds-cat/reverse-map-categorical-xforms (:test split)))
 (:survived (ds-cat/reverse-map-categorical-xforms dummy-prediction)))
0.3973063973063973

It’s performance is poor, even worse than coin flip.

10.4 Logistic regression

Next model to use is Logistic Regression

(require '[scicloj.ml.tribuo])
(def lreg-model (ml/train (:train split)
                          {:model-type :scicloj.ml.tribuo/classification
                           :tribuo-components [{:name "logistic"
                                                :type "org.tribuo.classification.sgd.linear.LinearSGDTrainer"}]
                           :tribuo-trainer-name "logistic"}))
(def lreg-prediction
  (ml/predict (:test split) lreg-model))
(loss/classification-accuracy
 (:survived (ds-cat/reverse-map-categorical-xforms (:test split)))
 (:survived (ds-cat/reverse-map-categorical-xforms lreg-prediction)))
0.7373737373737373

Its performance is better, 73 %

10.5 Random forest

Next is random forest

(def rf-model (ml/train (:train split) {:model-type :scicloj.ml.tribuo/classification
                                        :tribuo-components [{:name "random-forest"
                                                             :type "org.tribuo.classification.dtree.CARTClassificationTrainer"
                                                             :properties {:maxDepth "8"
                                                                          :useRandomSplitPoints "false"
                                                                          :fractionFeaturesInSplit "0.5"}}]
                                        :tribuo-trainer-name "random-forest"}))
(def rf-prediction
  (ml/predict (:test split) rf-model))

First five prediction including the probability distributions are

(-> rf-prediction
    (tc/head)
    (tc/rows))
[[0.0 0.6470588235294118 0.35294117647058826] [0.0 0.5714285714285714 0.42857142857142855] [0.0 0.8529411764705882 0.14705882352941177] [0.0 0.8879310344827587 0.11206896551724138] [0.0 0.8879310344827587 0.11206896551724138]]
(loss/classification-accuracy
 (:survived (ds-cat/reverse-map-categorical-xforms (:test split)))
 (:survived (ds-cat/reverse-map-categorical-xforms rf-prediction)))
0.7878787878787878

best so far, 78 %

TODO: Extract feature importance.

11 Next steps

We could now go further and trying to improve the features / the model type in order to find the best performing model for the data we have. All models types have a range of configurations, so called hyper-parameters. They can have as well influence on the model accuracy.

So far we used a single split into ‘train’ and ‘test’ data, so we only get a point estimate of the accuracy. This should be made more robust via cross-validations and using different splits of the data.

source: notebooks/noj_book/ml_basic.clj