11 Machine learning

In this tutorial we will train a simple machine learning model in order to predict the survival of titanic passengers given their data.

11.1 Preface: machine learning models in Noj

ML models in Noj are available as different plugins to the metamorph.ml library.

The metamorph.ml library itself has no models (except for a linear regression model), but it contains the various functions to “train” and “predict” based on data.

Models are available via Clojure wrappers of existing ML libraries. These are currently part of Noj:

Library	Clojure Wrapper
Tribuo	scicloj.ml.tribuo
Xgboost4J	scicloj.ml.xgboost
scikit-learn	sklearn-clj

These libraries do not have specific Clojure functions for the models they contain. Instead of functions per model, metamorph.ml has the concept of each model having a unique key, the :model-type , which needs to be given when calling metamorph.ml/train.

The model libraries register their models under these keys, when their main ns is required (and the model keys get printed on screen when getting registered). So we cannot provide cljdoc for the models, as they do no have corresponding functions.

Instead, we provide in the the last chapters of the Noj book a complete list of all models (and their keys) incl. the parameters they take with a description. For some models this reference documentation contains as well code examples. This can be used to browse or search for models and their parameters.

The Tribuo plugins and their models are special in this aspect. The scicloj.ml.tribuo library only contains 2 model types as keys, namely :scicloj.ml.tribuo/classification and :scicloj.ml.tribuo/regression. The model as such is encoded in the same way as the Triuo Java libraries does this, namely as a map of all Tribuo components in place, of which one is the model, the so called “Trainer”, is always needed and has a certain :type, the model class.

The reference documentation therefore lists all “Trainer”s and their name incl. parameters. It lists as well all other “Configurable”s which could be referred to in a component map.

11.2 Setup

(ns noj-book.ml-basic
  (:require [tablecloth.api :as tc]
            [scicloj.metamorph.ml.toydata :as data]
            [tech.v3.dataset :as ds]
            [scicloj.kindly.v4.kind :as kind]
            [scicloj.kindly.v4.api :as kindly]
            [tech.v3.dataset.categorical :as ds-cat]))

11.3 Inspect data

The titanic data is part of metamorph.ml and in the form of a train, test split

We use the :train part only for this tutorial.

(->
 (data/titanic-ds-split)
 :train)

_unnamed [891 12]:

:passenger-id	:survived	:pclass	:name	:sex	:age	:sib-sp	:parch	:ticket	:fare	:cabin	:embarked
1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500		S
2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Thayer)	female	38.0	1	0	PC 17599	71.2833	C85	C
3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250		S
4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500		S
6	0	3	Moran, Mr. James	male		0	0	330877	8.4583		Q
7	0	1	McCarthy, Mr. Timothy J	male	54.0	0	0	17463	51.8625	E46	S
8	0	3	Palsson, Master. Gosta Leonard	male	2.0	3	1	349909	21.0750		S
9	1	3	Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)	female	27.0	0	2	347742	11.1333		S
10	1	2	Nasser, Mrs. Nicholas (Adele Achem)	female	14.0	1	0	237736	30.0708		C
…	…	…	…	…	…	…	…	…	…	…	…
881	1	2	Shelley, Mrs. William (Imanita Parrish Hall)	female	25.0	0	1	230433	26.0000		S
882	0	3	Markun, Mr. Johann	male	33.0	0	0	349257	7.8958		S
883	0	3	Dahlberg, Miss. Gerda Ulrika	female	22.0	0	0	7552	10.5167		S
884	0	2	Banfield, Mr. Frederick James	male	28.0	0	0	C.A./SOTON 34068	10.5000		S
885	0	3	Sutehall, Mr. Henry Jr	male	25.0	0	0	SOTON/OQ 392076	7.0500		S
886	0	3	Rice, Mrs. William (Margaret Norton)	female	39.0	0	5	382652	29.1250		Q
887	0	2	Montvila, Rev. Juozas	male	27.0	0	0	211536	13.0000		S
888	1	1	Graham, Miss. Margaret Edith	female	19.0	0	0	112053	30.0000	B42	S
889	0	3	Johnston, Miss. Catherine Helen “Carrie”	female		1	2	W./C. 6607	23.4500		S
890	1	1	Behr, Mr. Karl Howell	male	26.0	0	0	111369	30.0000	C148	C
891	0	3	Dooley, Mr. Patrick	male	32.0	0	0	370376	7.7500		Q

We use defonce to avoid reading the files every time we evaluate the namespace.

(defonce titanic-split
  (data/titanic-ds-split))

(def titanic
  (-> titanic-split
      :train
      (tc/map-columns :survived
                      [:survived]
                      (fn [el] (case el
                                 0 "no"
                                 1 "yes")))))

It has various columns

(tc/column-names titanic)

(:passenger-id
 :survived
 :pclass
 :name
 :sex
 :age
 :sib-sp
 :parch
 :ticket
 :fare
 :cabin
 :embarked)

of which we can get some statistics

(ds/descriptive-stats titanic)

_unnamed: descriptive-stats [12 12]:

:col-name	:datatype	:n-valid	:n-missing	:min	:mean	:mode	:max	:standard-deviation	:skew	:first	:last
:passenger-id	:int16	891	0	1.00	446.00000000		891.0000	257.35384202	0.00000000	1	891
:survived	:string	891	0			no				no	no
:pclass	:int16	891	0	1.00	2.30864198		3.0000	0.83607124	-0.63054791	3	3
:name	:string	891	0			Mallet, Mr. Albert				Braund, Mr. Owen Harris	Dooley, Mr. Patrick
:sex	:string	891	0			male				male	male
:age	:float64	714	177	0.42	29.69911765		80.0000	14.52649733	0.38910778	22.00	32.00
:sib-sp	:int16	891	0	0.00	0.52300786		8.0000	1.10274343	3.69535173	1	0
:parch	:int16	891	0	0.00	0.38159371		6.0000	0.80605722	2.74911705	0	0
:ticket	:string	891	0			CA. 2343				A/5 21171	370376
:fare	:float64	891	0	0.00	32.20420797		512.3292	49.69342860	4.78731652	7.250	7.750
:cabin	:string	204	687
:embarked	:string	889	2			S				S	Q

The data is more or less balanced across the 2 classes:

(-> titanic :survived frequencies)

{"no" 549, "yes" 342}

We will make a very simple model, which will predict the column :survived from columns :sex , :pclass and :embarked. These represent the “gender”, “passenger class” and “port of embarkment”.

(def categorical-feature-columns [:sex :pclass :embarked])

(def target-column :survived)

11.4 Convert categorical features to numeric

As we need to convert the non numerical feature columns to categorical, we will first look at their unique values:

(map
 #(hash-map
   :col-name %
   :values  (distinct (get titanic %)))
 categorical-feature-columns)

({:col-name :sex, :values ("male" "female")}
 {:col-name :pclass, :values (3 1 2)}
 {:col-name :embarked, :values ("S" "C" "Q" nil)})

This allows us now to set specifically the values in the conversion to numbers. This is a good practice, instead of the relying on the automatic selection of the categorical mapping:

(We discuss more about categorical mappings in another chapter.)

(require '[tech.v3.dataset.categorical :as ds-cat]
         '[tech.v3.dataset.modelling :as ds-mod]
         '[tech.v3.dataset.column-filters :as cf])

This gives then the selected and numeric columns like this:

(def relevant-titanic-data
  (-> titanic
      (tc/select-columns (conj categorical-feature-columns target-column))
      (tc/drop-missing)
      (ds/categorical->number [:survived] ["no" "yes"] :float64)
      (ds-mod/set-inference-target target-column)))

of which we can inspect the lookup-tables

(def cat-maps
  [(ds-cat/fit-categorical-map relevant-titanic-data :sex ["male" "female"] :float64)
   (ds-cat/fit-categorical-map relevant-titanic-data :pclass [0 1 2] :float64)
   (ds-cat/fit-categorical-map relevant-titanic-data :embarked ["S" "Q" "C"] :float64)])

cat-maps

[{:lookup-table {"male" 0, "female" 1},
  :src-column :sex,
  :result-datatype :float64}
 {:lookup-table {0 0, 1 1, 2 2, 3 3},
  :src-column :pclass,
  :result-datatype :float64}
 {:lookup-table {"S" 0, "Q" 1, "C" 2},
  :src-column :embarked,
  :result-datatype :float64}]

After the mappings are applied, we have a numeric dataset, as expected by most models.

(def numeric-titanic-data
  (reduce (fn [ds cat-map]
            (ds-cat/transform-categorical-map ds cat-map))
          relevant-titanic-data
          cat-maps))

(tc/head
 numeric-titanic-data)

_unnamed [5 4]:

:sex	:pclass	:embarked	:survived
0.0	3.0	0.0	0.0
1.0	1.0	2.0	1.0
1.0	3.0	0.0	1.0
1.0	1.0	0.0	1.0
0.0	3.0	0.0	0.0

(ds/rowvecs
 (tc/head
  numeric-titanic-data))

[[0.0 3.0 0.0 0.0] [1.0 1.0 2.0 1.0] [1.0 3.0 0.0 1.0] [1.0 1.0 0.0 1.0] [0.0 3.0 0.0 0.0]]

11.5 Split data into train and test set

Now we split the data into train and test. We use a :holdout strategy, so will get a single split in training and test data.

(def split
  (first
   (tc/split->seq numeric-titanic-data :holdout {:seed 112723})))

split

{

:train

Group: 0 [592 4]:

:sex	:pclass	:embarked	:survived
0.0	3.0	2.0	0.0
0.0	3.0	0.0	0.0
0.0	3.0	0.0	0.0
1.0	3.0	2.0	1.0
0.0	1.0	0.0	0.0
1.0	3.0	0.0	0.0
1.0	2.0	0.0	1.0
1.0	3.0	0.0	0.0
0.0	3.0	0.0	0.0
1.0	1.0	2.0	1.0
...	...	...	...
0.0	3.0	0.0	0.0
1.0	2.0	0.0	1.0
0.0	3.0	2.0	1.0
1.0	2.0	0.0	0.0
0.0	2.0	0.0	0.0
1.0	3.0	0.0	1.0
0.0	2.0	0.0	0.0
1.0	2.0	0.0	1.0
0.0	3.0	0.0	1.0
0.0	3.0	0.0	0.0
0.0	3.0	1.0	0.0

:test

Group: 0 [297 4]:

:sex	:pclass	:embarked	:survived
0.0	1.0	0.0	0.0
1.0	3.0	0.0	0.0
0.0	2.0	0.0	0.0
0.0	3.0	0.0	0.0
0.0	3.0	0.0	1.0
0.0	3.0	0.0	0.0
0.0	3.0	0.0	0.0
0.0	3.0	0.0	0.0
1.0	3.0	2.0	1.0
0.0	1.0	2.0	0.0
...	...	...	...
0.0	3.0	0.0	0.0
1.0	3.0	0.0	0.0
0.0	3.0	0.0	0.0
1.0	3.0	1.0	1.0
0.0	3.0	0.0	0.0
1.0	2.0	2.0	1.0
0.0	3.0	0.0	0.0
0.0	3.0	0.0	0.0
0.0	3.0	0.0	1.0
0.0	1.0	0.0	0.0
1.0	1.0	2.0	1.0

}

11.6 Train a model

Now it’s time to train a model:

(require '[scicloj.metamorph.ml :as ml]
         '[scicloj.metamorph.ml.classification]
         '[scicloj.metamorph.ml.loss :as loss])

11.6.1 Dummy model

We start with a dummy model, which simply predicts the majority class.

(def dummy-model (ml/train (:train split)
                           {:model-type :metamorph.ml/dummy-classifier}))

(def dummy-prediction
  (ml/predict (:test split) dummy-model))

It always predicts a single class, as expected:

(-> dummy-prediction :survived frequencies)

{0.0 297}

We can calculate accuracy by using a metric after having converted the numerical data back to original (important!). We should never compare mapped columns directly.

We get an accuray of:

(loss/classification-accuracy
 (:survived (ds-cat/reverse-map-categorical-xforms (:test split)))
 (:survived (ds-cat/reverse-map-categorical-xforms dummy-prediction)))

0.6026936026936027

11.7 Logistic regression

Next model to use is Logistic Regression from Tribuo

(require '[scicloj.ml.tribuo])

(def lreg-model (ml/train (:train split)
                          {:model-type :scicloj.ml.tribuo/classification
                           :tribuo-components [{:name "logistic"
                                                :type "org.tribuo.classification.sgd.linear.LinearSGDTrainer"}]
                           :tribuo-trainer-name "logistic"}))

(def lreg-prediction
  (ml/predict (:test split) lreg-model))

with accuracy of:

(loss/classification-accuracy
 (:survived (ds-cat/reverse-map-categorical-xforms (:test split)))
 (:survived (ds-cat/reverse-map-categorical-xforms lreg-prediction)))

0.7373737373737373

Its performance is better, 73 %.

11.8 Random forest

Next is random forest:

(def rf-model (ml/train (:train split) {:model-type :scicloj.ml.tribuo/classification
                                        :tribuo-components [{:name "random-forest"
                                                             :type "org.tribuo.classification.dtree.CARTClassificationTrainer"
                                                             :properties {:maxDepth "8"
                                                                          :useRandomSplitPoints "false"
                                                                          :fractionFeaturesInSplit "0.5"}}]
                                        :tribuo-trainer-name "random-forest"}))

(def rf-prediction
  (ml/predict (:test split) rf-model))

Let us extract the first five prediction and the probabilities provided by the mode.

(-> rf-prediction
    (tc/head)
    (tc/rows))

[[0.0 0.6470588235294118 0.35294117647058826] [0.0 0.5714285714285714 0.42857142857142855] [0.0 0.8529411764705882 0.14705882352941177] [0.0 0.8879310344827587 0.11206896551724138] [0.0 0.8879310344827587 0.11206896551724138]]

(loss/classification-accuracy
 (:survived (ds-cat/reverse-map-categorical-xforms (:test split)))
 (:survived (ds-cat/reverse-map-categorical-xforms rf-prediction)))

0.7878787878787878

best accuracy so far, 78 %.

11.9 Next steps

We could now go further and try to improve the features / the model type in order to find the best performing model for the data we have. All models types have a range of configurations, so-called hyper-parameters. They can have as well influence on the model accuracy.

So far we used a single split into ‘train’ and ‘test’ data, so we only get a point estimate of the accuracy. This should be made more robust via cross-validation and using different splits of the data.

source: notebooks/noj_book/ml_basic.clj