10 Machine learning
In this tutorial we will train a simple machine learning model in order to predict the survival of titanic passengers given their data.
ns noj-book.ml-basic
(:require [tablecloth.api :as tc]
(:as data]
[scicloj.metamorph.ml.toydata :as ds]
[tech.v3.dataset :as kind]
[scicloj.kindly.v4.kind :as kindly]
[scicloj.kindly.v4.api :as ds-cat])) [tech.v3.dataset.categorical
10.1 Inspect data
The titanic data is part of metamorph.ml
and in the form of a train, test split
We use the :train part only for this tutorial.
->
(
(data/titanic-ds-split):train)
_unnamed [891 12]:
:passenger-id | :survived | :pclass | :name | :sex | :age | :sib-sp | :parch | :ticket | :fare | :cabin | :embarked |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | S | |
2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Thayer) | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | S | |
4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | S | |
6 | 0 | 3 | Moran, Mr. James | male | 0 | 0 | 330877 | 8.4583 | Q | ||
7 | 0 | 1 | McCarthy, Mr. Timothy J | male | 54.0 | 0 | 0 | 17463 | 51.8625 | E46 | S |
8 | 0 | 3 | Palsson, Master. Gosta Leonard | male | 2.0 | 3 | 1 | 349909 | 21.0750 | S | |
9 | 1 | 3 | Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) | female | 27.0 | 0 | 2 | 347742 | 11.1333 | S | |
10 | 1 | 2 | Nasser, Mrs. Nicholas (Adele Achem) | female | 14.0 | 1 | 0 | 237736 | 30.0708 | C | |
… | … | … | … | … | … | … | … | … | … | … | … |
881 | 1 | 2 | Shelley, Mrs. William (Imanita Parrish Hall) | female | 25.0 | 0 | 1 | 230433 | 26.0000 | S | |
882 | 0 | 3 | Markun, Mr. Johann | male | 33.0 | 0 | 0 | 349257 | 7.8958 | S | |
883 | 0 | 3 | Dahlberg, Miss. Gerda Ulrika | female | 22.0 | 0 | 0 | 7552 | 10.5167 | S | |
884 | 0 | 2 | Banfield, Mr. Frederick James | male | 28.0 | 0 | 0 | C.A./SOTON 34068 | 10.5000 | S | |
885 | 0 | 3 | Sutehall, Mr. Henry Jr | male | 25.0 | 0 | 0 | SOTON/OQ 392076 | 7.0500 | S | |
886 | 0 | 3 | Rice, Mrs. William (Margaret Norton) | female | 39.0 | 0 | 5 | 382652 | 29.1250 | Q | |
887 | 0 | 2 | Montvila, Rev. Juozas | male | 27.0 | 0 | 0 | 211536 | 13.0000 | S | |
888 | 1 | 1 | Graham, Miss. Margaret Edith | female | 19.0 | 0 | 0 | 112053 | 30.0000 | B42 | S |
889 | 0 | 3 | Johnston, Miss. Catherine Helen “Carrie” | female | 1 | 2 | W./C. 6607 | 23.4500 | S | ||
890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | 111369 | 30.0000 | C148 | C |
891 | 0 | 3 | Dooley, Mr. Patrick | male | 32.0 | 0 | 0 | 370376 | 7.7500 | Q |
We use defonce
to avoid reading the files every time we evaluate the namespace.
defonce titanic-split
( (data/titanic-ds-split))
def titanic
(-> titanic-split
(:train
:survived
(tc/map-columns :survived]
[fn [el] (case el
(0 "no"
1 "yes")))))
It has various columns
(tc/column-names titanic)
:passenger-id
(:survived
:pclass
:name
:sex
:age
:sib-sp
:parch
:ticket
:fare
:cabin
:embarked)
of which we can get some statistics
(ds/descriptive-stats titanic)
_unnamed: descriptive-stats [12 12]:
:col-name | :datatype | :n-valid | :n-missing | :min | :mean | :mode | :max | :standard-deviation | :skew | :first | :last |
---|---|---|---|---|---|---|---|---|---|---|---|
:passenger-id | :int16 | 891 | 0 | 1.00 | 446.00000000 | 891.0000 | 257.35384202 | 0.00000000 | 1 | 891 | |
:survived | :string | 891 | 0 | no | no | no | |||||
:pclass | :int16 | 891 | 0 | 1.00 | 2.30864198 | 3.0000 | 0.83607124 | -0.63054791 | 3 | 3 | |
:name | :string | 891 | 0 | Mallet, Mr. Albert | Braund, Mr. Owen Harris | Dooley, Mr. Patrick | |||||
:sex | :string | 891 | 0 | male | male | male | |||||
:age | :float64 | 714 | 177 | 0.42 | 29.69911765 | 80.0000 | 14.52649733 | 0.38910778 | 22.00 | 32.00 | |
:sib-sp | :int16 | 891 | 0 | 0.00 | 0.52300786 | 8.0000 | 1.10274343 | 3.69535173 | 1 | 0 | |
:parch | :int16 | 891 | 0 | 0.00 | 0.38159371 | 6.0000 | 0.80605722 | 2.74911705 | 0 | 0 | |
:ticket | :string | 891 | 0 | CA. 2343 | A/5 21171 | 370376 | |||||
:fare | :float64 | 891 | 0 | 0.00 | 32.20420797 | 512.3292 | 49.69342860 | 4.78731652 | 7.250 | 7.750 | |
:cabin | :string | 204 | 687 | ||||||||
:embarked | :string | 889 | 2 | S | S | Q |
The data is more or less balanced across the 2 classes:
-> titanic :survived frequencies) (
"no" 549, "yes" 342} {
We will make a very simple model, which will predict the column :survived
from columns :sex
, :pclass
and :embarked
. These represent the “gender”, “passenger class” and “port of embarkment”.
def categorical-feature-columns [:sex :pclass :embarked]) (
def target-column :survived) (
10.2 Convert categorical features to numeric
As we need to convert the non numerical feature columns to categorical, we will first look at their unique values:
map
(hash-map
#(:col-name %
:values (distinct (get titanic %)))
categorical-feature-columns)
:col-name :sex, :values ("male" "female")}
({:col-name :pclass, :values (3 1 2)}
{:col-name :embarked, :values ("S" "C" "Q" nil)}) {
This allows us now to set specifically the values in the conversion to numbers. This is a good practice, instead of the relying on the automatic selection of the categorical mapping:
(We discuss more about categorical mappings in another chapter.)
require '[tech.v3.dataset.categorical :as ds-cat]
(:as ds-mod]
'[tech.v3.dataset.modelling :as cf]) '[tech.v3.dataset.column-filters
This gives then the selected and numeric columns like this:
def relevant-titanic-data
(-> titanic
(conj categorical-feature-columns target-column))
(tc/select-columns (
(tc/drop-missing):survived] ["no" "yes"] :float64)
(ds/categorical->number [ (ds-mod/set-inference-target target-column)))
of which we can inspect the lookup-tables
def cat-maps
(:sex ["male" "female"] :float64)
[(ds-cat/fit-categorical-map relevant-titanic-data :pclass [0 1 2] :float64)
(ds-cat/fit-categorical-map relevant-titanic-data :embarked ["S" "Q" "C"] :float64)]) (ds-cat/fit-categorical-map relevant-titanic-data
cat-maps
:lookup-table {"male" 0, "female" 1},
[{:src-column :sex,
:result-datatype :float64}
:lookup-table {0 0, 1 1, 2 2, 3 3},
{:src-column :pclass,
:result-datatype :float64}
:lookup-table {"S" 0, "Q" 1, "C" 2},
{:src-column :embarked,
:result-datatype :float64}]
After the mappings are applied, we have a numeric dataset, as expected by most models.
def numeric-titanic-data
(reduce (fn [ds cat-map]
(
(ds-cat/transform-categorical-map ds cat-map))
relevant-titanic-data cat-maps))
(tc/head numeric-titanic-data)
_unnamed [5 4]:
:sex | :pclass | :embarked | :survived |
---|---|---|---|
0.0 | 3.0 | 0.0 | 0.0 |
1.0 | 1.0 | 2.0 | 1.0 |
1.0 | 3.0 | 0.0 | 1.0 |
1.0 | 1.0 | 0.0 | 1.0 |
0.0 | 3.0 | 0.0 | 0.0 |
(ds/rowvecs
(tc/head numeric-titanic-data))
0.0 3.0 0.0 0.0] [1.0 1.0 2.0 1.0] [1.0 3.0 0.0 1.0] [1.0 1.0 0.0 1.0] [0.0 3.0 0.0 0.0]] [[
Split data into train and test set
Now we split the data into train and test. By we use a :holdout strategy, so will get a single split in training an test data.
def split
(first
(:holdout {:seed 112723}))) (tc/split->seq numeric-titanic-data
split
{
|
Group: 0 [592 4]:
|
|
Group: 0 [297 4]:
|
}
10.3 Train a model
Now its time to train a model:
require '[scicloj.metamorph.ml :as ml]
(
'[scicloj.metamorph.ml.classification]:as loss]) '[scicloj.metamorph.ml.loss
10.3.1 Dummy model
We start with a dummy model, which simply predicts the majority class
def dummy-model (ml/train (:train split)
(:model-type :metamorph.ml/dummy-classifier})) {
TODO: Is the dummy model wrong about the majority?
def dummy-prediction
(:test split) dummy-model)) (ml/predict (
It always predicts a single class, as expected:
-> dummy-prediction :survived frequencies) (
1.0 297} {
we can calculate accuracy by using a metric after having converted the numerical data back to original (important !) We should never compare mapped columns directly.
(loss/classification-accuracy:survived (ds-cat/reverse-map-categorical-xforms (:test split)))
(:survived (ds-cat/reverse-map-categorical-xforms dummy-prediction))) (
0.3973063973063973
It’s performance is poor, even worse than coin flip.
10.4 Logistic regression
Next model to use is Logistic Regression
require '[scicloj.ml.tribuo]) (
def lreg-model (ml/train (:train split)
(:model-type :scicloj.ml.tribuo/classification
{:tribuo-components [{:name "logistic"
:type "org.tribuo.classification.sgd.linear.LinearSGDTrainer"}]
:tribuo-trainer-name "logistic"}))
def lreg-prediction
(:test split) lreg-model)) (ml/predict (
(loss/classification-accuracy:survived (ds-cat/reverse-map-categorical-xforms (:test split)))
(:survived (ds-cat/reverse-map-categorical-xforms lreg-prediction))) (
0.7373737373737373
Its performance is better, 73 %
10.5 Random forest
Next is random forest
def rf-model (ml/train (:train split) {:model-type :scicloj.ml.tribuo/classification
(:tribuo-components [{:name "random-forest"
:type "org.tribuo.classification.dtree.CARTClassificationTrainer"
:properties {:maxDepth "8"
:useRandomSplitPoints "false"
:fractionFeaturesInSplit "0.5"}}]
:tribuo-trainer-name "random-forest"}))
def rf-prediction
(:test split) rf-model)) (ml/predict (
First five prediction including the probability distributions are
-> rf-prediction
(
(tc/head) (tc/rows))
0.0 0.6470588235294118 0.35294117647058826] [0.0 0.5714285714285714 0.42857142857142855] [0.0 0.8529411764705882 0.14705882352941177] [0.0 0.8879310344827587 0.11206896551724138] [0.0 0.8879310344827587 0.11206896551724138]] [[
(loss/classification-accuracy:survived (ds-cat/reverse-map-categorical-xforms (:test split)))
(:survived (ds-cat/reverse-map-categorical-xforms rf-prediction))) (
0.7878787878787878
best so far, 78 %
TODO: Extract feature importance.
11 Next steps
We could now go further and trying to improve the features / the model type in order to find the best performing model for the data we have. All models types have a range of configurations, so called hyper-parameters. They can have as well influence on the model accuracy.
So far we used a single split into ‘train’ and ‘test’ data, so we only get a point estimate of the accuracy. This should be made more robust via cross-validations and using different splits of the data.
source: notebooks/noj_book/ml_basic.clj