2 Machine learning - DRAFT

This is part of the Scicloj Clojure Data Tutorials.

(ns ml
  (:require [scicloj.ml.core :as ml]
            [scicloj.ml.metamorph :as mm]
            [scicloj.ml.dataset :refer [dataset add-column]]
            [scicloj.ml.dataset :as ds]
            [fastmath.stats]
            [tablecloth.api :as tc]
            [scicloj.noj.v1.datasets :as datasets]
            [scicloj.kindly.v4.kind :as kind]))

2.1 Linear regression

We will explore the Iris dataset:

(tc/head datasets/iris)

_unnamed [5 5]:

:sepal-length	:sepal-width	:petal-length	:petal-width	:species
5.1	3.5	1.4	0.2	setosa
4.9	3.0	1.4	0.2	setosa
4.7	3.2	1.3	0.2	setosa
4.6	3.1	1.5	0.2	setosa
5.0	3.6	1.4	0.2	setosa

A Metamorph pipeline for linear regression:

(def additive-pipeline
  (ml/pipeline
   (mm/set-inference-target :sepal-length)
   (mm/drop-columns [:species])
   {:metamorph/id :model}
   (mm/model {:model-type :smile.regression/ordinary-least-square})))

Training and evaluating the pipeline on various subsets:

(def evaluations
  (ml/evaluate-pipelines
   [additive-pipeline]
   (ds/split->seq datasets/iris :holdout)
   ml/rmse
   :loss
   {:other-metrices [{:name :r2
                      :metric-fn fastmath.stats/r2-determination}]}))

Printing one of the trained models (note that the Smile regression model is recognized by Kindly and printed correctly):

(-> evaluations
    flatten
    first
    :fit-ctx
    :model
    ml/thaw-model)

Linear Model:

Residuals:
       Min          1Q      Median          3Q         Max
   -0.8517     -0.2316      0.0315      0.2308      0.6501

Coefficients:
                  Estimate Std. Error    t value   Pr(>|t|)
Intercept           1.8837     0.2870     6.5634     0.0000 ***
sepal-width         0.6507     0.0753     8.6382     0.0000 ***
petal-length        0.6913     0.0673    10.2732     0.0000 ***
petal-width        -0.5117     0.1490    -3.4344     0.0009 ***
---------------------------------------------------------------------
Significance codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3172 on 96 degrees of freedom
Multiple R-squared: 0.8534,    Adjusted R-squared: 0.8488
F-statistic: 186.2321 on 4 and 96 DF,  p-value: 6.948e-40

source: projects/noj/notebooks/ml.clj