2  Machine learning - DRAFT

SciCloj logo
This is part of the Scicloj Clojure Data Tutorials.
(ns ml
  (:require [scicloj.ml.core :as ml]
            [scicloj.ml.metamorph :as mm]
            [scicloj.ml.dataset :refer [dataset add-column]]
            [scicloj.ml.dataset :as ds]
            [fastmath.stats]
            [tablecloth.api :as tc]
            [scicloj.noj.v1.datasets :as datasets]
            [scicloj.kindly.v4.kind :as kind]))

2.1 Linear regression

We will explore the Iris dataset:

(tc/head datasets/iris)

_unnamed [5 5]:

:sepal-length :sepal-width :petal-length :petal-width :species
5.1 3.5 1.4 0.2 setosa
4.9 3.0 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5.0 3.6 1.4 0.2 setosa

A Metamorph pipeline for linear regression:

(def additive-pipeline
  (ml/pipeline
   (mm/set-inference-target :sepal-length)
   (mm/drop-columns [:species])
   {:metamorph/id :model}
   (mm/model {:model-type :smile.regression/ordinary-least-square})))

Training and evaluating the pipeline on various subsets:

(def evaluations
  (ml/evaluate-pipelines
   [additive-pipeline]
   (ds/split->seq datasets/iris :holdout)
   ml/rmse
   :loss
   {:other-metrices [{:name :r2
                      :metric-fn fastmath.stats/r2-determination}]}))

Printing one of the trained models (note that the Smile regression model is recognized by Kindly and printed correctly):

(-> evaluations
    flatten
    first
    :fit-ctx
    :model
    ml/thaw-model)
Linear Model:

Residuals:
       Min          1Q      Median          3Q         Max
   -0.8590     -0.2245      0.0465      0.2136      0.8509

Coefficients:
                  Estimate Std. Error    t value   Pr(>|t|)
Intercept           2.0365     0.3037     6.7065     0.0000 ***
sepal-width         0.6058     0.0810     7.4785     0.0000 ***
petal-length        0.6784     0.0725     9.3580     0.0000 ***
petal-width        -0.4970     0.1676    -2.9659     0.0038 **
---------------------------------------------------------------------
Significance codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3247 on 96 degrees of freedom
Multiple R-squared: 0.8626,    Adjusted R-squared: 0.8583
F-statistic: 200.8142 on 4 and 96 DF,  p-value: 3.132e-41
source: projects/noj/notebooks/ml.clj