10 Machine learning
In this tutorial we will train a simple machine learning model in order to predict the survival of titanic passengers given their data.
10.1 Preface: machine learning models in Noj
ML models in Noj are available as different plugins to the metamorph.ml
library.
The metamorph.ml
library itself has no models (except for a linear regression model), but it contains the various functions to “train” and “predict” based on data.
Models are available via Clojure wrappers of existing ML libraries. These are currently part of Noj:
Library | Clojure Wrapper |
Tribuo | scicloj.ml.tribuo |
Xgboost4J | scicloj.ml.xgboost |
scikit-learn | sklearn-clj |
These libraries do not have any functions for the models they contain. Instead of funtcions per model, metamorph.ml
has the concept of each model having a unique key
, the :model-type
, which needs to be given when calling metamorph.ml/train
.
The model libraries register their models under these keys, when their main ns
is require
d (and the model keys get printed on screen when getting registered). So we cannot provide cljdoc for the models, as they do no have corresponding functions.
Instead, we provide in the the last chapters of the Noj book a complete list of all models (and their keys) incl. the parameters they take with a description. For some models this reference documentation contains as well code examples. This can be used to browse or search for models and their parameters.
The Tribuo plugins and their models are special in this aspect. The scicloj.ml.tribuo
library only contains 2 model types as keys, namely :scicloj.ml.tribuo/classification
and :scicloj.ml.tribuo/regression
. The model as such is encoded in the same way as the Triuo Java libraries does this, namely as a map of all Tribuo components in place, of which one is the model, the so called “Trainer”, is always needed and has a certin :type
, the model class.
The reference documentation therefore lists all “Trainer”s and their name incl. parameters. It lists as well all other “Configurable”s which could be refered to in a component map.
10.2 Setup
ns noj-book.ml-basic
(:require [tablecloth.api :as tc]
(:as data]
[scicloj.metamorph.ml.toydata :as ds]
[tech.v3.dataset :as kind]
[scicloj.kindly.v4.kind :as kindly]
[scicloj.kindly.v4.api :as ds-cat])) [tech.v3.dataset.categorical
10.3 Inspect data
The titanic data is part of metamorph.ml
and in the form of a train, test split
We use the :train
part only for this tutorial.
->
(
(data/titanic-ds-split):train)
_unnamed [891 12]:
:passenger-id | :survived | :pclass | :name | :sex | :age | :sib-sp | :parch | :ticket | :fare | :cabin | :embarked |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | S | |
2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Thayer) | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | S | |
4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | S | |
6 | 0 | 3 | Moran, Mr. James | male | 0 | 0 | 330877 | 8.4583 | Q | ||
7 | 0 | 1 | McCarthy, Mr. Timothy J | male | 54.0 | 0 | 0 | 17463 | 51.8625 | E46 | S |
8 | 0 | 3 | Palsson, Master. Gosta Leonard | male | 2.0 | 3 | 1 | 349909 | 21.0750 | S | |
9 | 1 | 3 | Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) | female | 27.0 | 0 | 2 | 347742 | 11.1333 | S | |
10 | 1 | 2 | Nasser, Mrs. Nicholas (Adele Achem) | female | 14.0 | 1 | 0 | 237736 | 30.0708 | C | |
… | … | … | … | … | … | … | … | … | … | … | … |
881 | 1 | 2 | Shelley, Mrs. William (Imanita Parrish Hall) | female | 25.0 | 0 | 1 | 230433 | 26.0000 | S | |
882 | 0 | 3 | Markun, Mr. Johann | male | 33.0 | 0 | 0 | 349257 | 7.8958 | S | |
883 | 0 | 3 | Dahlberg, Miss. Gerda Ulrika | female | 22.0 | 0 | 0 | 7552 | 10.5167 | S | |
884 | 0 | 2 | Banfield, Mr. Frederick James | male | 28.0 | 0 | 0 | C.A./SOTON 34068 | 10.5000 | S | |
885 | 0 | 3 | Sutehall, Mr. Henry Jr | male | 25.0 | 0 | 0 | SOTON/OQ 392076 | 7.0500 | S | |
886 | 0 | 3 | Rice, Mrs. William (Margaret Norton) | female | 39.0 | 0 | 5 | 382652 | 29.1250 | Q | |
887 | 0 | 2 | Montvila, Rev. Juozas | male | 27.0 | 0 | 0 | 211536 | 13.0000 | S | |
888 | 1 | 1 | Graham, Miss. Margaret Edith | female | 19.0 | 0 | 0 | 112053 | 30.0000 | B42 | S |
889 | 0 | 3 | Johnston, Miss. Catherine Helen “Carrie” | female | 1 | 2 | W./C. 6607 | 23.4500 | S | ||
890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | 111369 | 30.0000 | C148 | C |
891 | 0 | 3 | Dooley, Mr. Patrick | male | 32.0 | 0 | 0 | 370376 | 7.7500 | Q |
We use defonce
to avoid reading the files every time we evaluate the namespace.
defonce titanic-split
( (data/titanic-ds-split))
def titanic
(-> titanic-split
(:train
:survived
(tc/map-columns :survived]
[fn [el] (case el
(0 "no"
1 "yes")))))
It has various columns
(tc/column-names titanic)
:passenger-id
(:survived
:pclass
:name
:sex
:age
:sib-sp
:parch
:ticket
:fare
:cabin
:embarked)
of which we can get some statistics
(ds/descriptive-stats titanic)
_unnamed: descriptive-stats [12 12]:
:col-name | :datatype | :n-valid | :n-missing | :min | :mean | :mode | :max | :standard-deviation | :skew | :first | :last |
---|---|---|---|---|---|---|---|---|---|---|---|
:passenger-id | :int16 | 891 | 0 | 1.00 | 446.00000000 | 891.0000 | 257.35384202 | 0.00000000 | 1 | 891 | |
:survived | :string | 891 | 0 | no | no | no | |||||
:pclass | :int16 | 891 | 0 | 1.00 | 2.30864198 | 3.0000 | 0.83607124 | -0.63054791 | 3 | 3 | |
:name | :string | 891 | 0 | Mallet, Mr. Albert | Braund, Mr. Owen Harris | Dooley, Mr. Patrick | |||||
:sex | :string | 891 | 0 | male | male | male | |||||
:age | :float64 | 714 | 177 | 0.42 | 29.69911765 | 80.0000 | 14.52649733 | 0.38910778 | 22.00 | 32.00 | |
:sib-sp | :int16 | 891 | 0 | 0.00 | 0.52300786 | 8.0000 | 1.10274343 | 3.69535173 | 1 | 0 | |
:parch | :int16 | 891 | 0 | 0.00 | 0.38159371 | 6.0000 | 0.80605722 | 2.74911705 | 0 | 0 | |
:ticket | :string | 891 | 0 | CA. 2343 | A/5 21171 | 370376 | |||||
:fare | :float64 | 891 | 0 | 0.00 | 32.20420797 | 512.3292 | 49.69342860 | 4.78731652 | 7.250 | 7.750 | |
:cabin | :string | 204 | 687 | ||||||||
:embarked | :string | 889 | 2 | S | S | Q |
The data is more or less balanced across the 2 classes:
-> titanic :survived frequencies) (
"no" 549, "yes" 342} {
We will make a very simple model, which will predict the column :survived
from columns :sex
, :pclass
and :embarked
. These represent the “gender”, “passenger class” and “port of embarkment”.
def categorical-feature-columns [:sex :pclass :embarked]) (
def target-column :survived) (
10.4 Convert categorical features to numeric
As we need to convert the non numerical feature columns to categorical, we will first look at their unique values:
map
(hash-map
#(:col-name %
:values (distinct (get titanic %)))
categorical-feature-columns)
:col-name :sex, :values ("male" "female")}
({:col-name :pclass, :values (3 1 2)}
{:col-name :embarked, :values ("S" "C" "Q" nil)}) {
This allows us now to set specifically the values in the conversion to numbers. This is a good practice, instead of the relying on the automatic selection of the categorical mapping:
(We discuss more about categorical mappings in another chapter.)
require '[tech.v3.dataset.categorical :as ds-cat]
(:as ds-mod]
'[tech.v3.dataset.modelling :as cf]) '[tech.v3.dataset.column-filters
This gives then the selected and numeric columns like this:
def relevant-titanic-data
(-> titanic
(conj categorical-feature-columns target-column))
(tc/select-columns (
(tc/drop-missing):survived] ["no" "yes"] :float64)
(ds/categorical->number [ (ds-mod/set-inference-target target-column)))
of which we can inspect the lookup-tables
def cat-maps
(:sex ["male" "female"] :float64)
[(ds-cat/fit-categorical-map relevant-titanic-data :pclass [0 1 2] :float64)
(ds-cat/fit-categorical-map relevant-titanic-data :embarked ["S" "Q" "C"] :float64)]) (ds-cat/fit-categorical-map relevant-titanic-data
cat-maps
:lookup-table {"male" 0, "female" 1},
[{:src-column :sex,
:result-datatype :float64}
:lookup-table {0 0, 1 1, 2 2, 3 3},
{:src-column :pclass,
:result-datatype :float64}
:lookup-table {"S" 0, "Q" 1, "C" 2},
{:src-column :embarked,
:result-datatype :float64}]
After the mappings are applied, we have a numeric dataset, as expected by most models.
def numeric-titanic-data
(reduce (fn [ds cat-map]
(
(ds-cat/transform-categorical-map ds cat-map))
relevant-titanic-data cat-maps))
(tc/head numeric-titanic-data)
_unnamed [5 4]:
:sex | :pclass | :embarked | :survived |
---|---|---|---|
0.0 | 3.0 | 0.0 | 0.0 |
1.0 | 1.0 | 2.0 | 1.0 |
1.0 | 3.0 | 0.0 | 1.0 |
1.0 | 1.0 | 0.0 | 1.0 |
0.0 | 3.0 | 0.0 | 0.0 |
(ds/rowvecs
(tc/head numeric-titanic-data))
0.0 3.0 0.0 0.0] [1.0 1.0 2.0 1.0] [1.0 3.0 0.0 1.0] [1.0 1.0 0.0 1.0] [0.0 3.0 0.0 0.0]] [[
Split data into train and test set
Now we split the data into train and test. We use a :holdout
strategy, so will get a single split in training and test data.
def split
(first
(:holdout {:seed 112723}))) (tc/split->seq numeric-titanic-data
split
{
|
Group: 0 [592 4]:
|
|
Group: 0 [297 4]:
|
}
10.5 Train a model
Now it’s time to train a model:
require '[scicloj.metamorph.ml :as ml]
(
'[scicloj.metamorph.ml.classification]:as loss]) '[scicloj.metamorph.ml.loss
10.5.1 Dummy model
We start with a dummy model, which simply predicts the majority class.
def dummy-model (ml/train (:train split)
(:model-type :metamorph.ml/dummy-classifier})) {
def dummy-prediction
(:test split) dummy-model)) (ml/predict (
It always predicts a single class, as expected:
-> dummy-prediction :survived frequencies) (
0.0 297} {
We can calculate accuracy by using a metric after having converted the numerical data back to original (important!). We should never compare mapped columns directly.
(loss/classification-accuracy:survived (ds-cat/reverse-map-categorical-xforms (:test split)))
(:survived (ds-cat/reverse-map-categorical-xforms dummy-prediction))) (
0.6026936026936027
10.6 Logistic regression
Next model to use is Logistic Regression:
require '[scicloj.ml.tribuo]) (
def lreg-model (ml/train (:train split)
(:model-type :scicloj.ml.tribuo/classification
{:tribuo-components [{:name "logistic"
:type "org.tribuo.classification.sgd.linear.LinearSGDTrainer"}]
:tribuo-trainer-name "logistic"}))
def lreg-prediction
(:test split) lreg-model)) (ml/predict (
(loss/classification-accuracy:survived (ds-cat/reverse-map-categorical-xforms (:test split)))
(:survived (ds-cat/reverse-map-categorical-xforms lreg-prediction))) (
0.7373737373737373
Its performance is better, 73 %.
10.7 Random forest
Next is random forest:
def rf-model (ml/train (:train split) {:model-type :scicloj.ml.tribuo/classification
(:tribuo-components [{:name "random-forest"
:type "org.tribuo.classification.dtree.CARTClassificationTrainer"
:properties {:maxDepth "8"
:useRandomSplitPoints "false"
:fractionFeaturesInSplit "0.5"}}]
:tribuo-trainer-name "random-forest"}))
def rf-prediction
(:test split) rf-model)) (ml/predict (
Let us extract the first five prediction and the probabilities provided by the mode.
-> rf-prediction
(
(tc/head) (tc/rows))
0.0 0.6470588235294118 0.35294117647058826] [0.0 0.5714285714285714 0.42857142857142855] [0.0 0.8529411764705882 0.14705882352941177] [0.0 0.8879310344827587 0.11206896551724138] [0.0 0.8879310344827587 0.11206896551724138]] [[
(loss/classification-accuracy:survived (ds-cat/reverse-map-categorical-xforms (:test split)))
(:survived (ds-cat/reverse-map-categorical-xforms rf-prediction))) (
0.7878787878787878
best so far, 78 %.
10.8 Next steps
We could now go further and trying to improve the features / the model type in order to find the best performing model for the data we have. All models types have a range of configurations, so-called hyper-parameters. They can have as well influence on the model accuracy.
So far we used a single split into ‘train’ and ‘test’ data, so we only get a point estimate of the accuracy. This should be made more robust via cross-validation and using different splits of the data.
source: notebooks/noj_book/ml_basic.clj