2  Machine learning - DRAFT

SciCloj logo
This is part of the Scicloj Clojure Data Scrapbook.
(ns ml
  (:require [scicloj.ml.core :as ml]
            [scicloj.ml.metamorph :as mm]
            [scicloj.ml.dataset :refer [dataset add-column]]
            [scicloj.ml.dataset :as ds]
            [fastmath.stats]
            [tablecloth.api :as tc]
            [scicloj.noj.v1.datasets :as datasets]
            [scicloj.kindly.v4.kind :as kind]))

2.1 Linear regression

We will explore the Iris dataset:

(tc/head datasets/iris)

_unnamed [5 5]:

:sepal-length :sepal-width :petal-length :petal-width :species
5.1 3.5 1.4 0.2 setosa
4.9 3.0 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5.0 3.6 1.4 0.2 setosa

A Metamorph pipeline for linear regression:

(def additive-pipeline
  (ml/pipeline
   (mm/set-inference-target :sepal-length)
   (mm/drop-columns [:species])
   {:metamorph/id :model}
   (mm/model {:model-type :smile.regression/ordinary-least-square})))

Training and evaluating the pipeline on various subsets:

(def evaluations
  (ml/evaluate-pipelines
   [additive-pipeline]
   (ds/split->seq datasets/iris :holdout)
   ml/rmse
   :loss
   {:other-metrices [{:name :r2
                      :metric-fn fastmath.stats/r2-determination}]}))

Printing one of the trained models (note that the Smile regression model is recognized by Kindly and printed correctly):

(-> evaluations
    flatten
    first
    :fit-ctx
    :model
    ml/thaw-model)
Linear Model:

Residuals:
       Min          1Q      Median          3Q         Max
   -0.7326     -0.2096     -0.0182      0.1866      0.8517

Coefficients:
                  Estimate Std. Error    t value   Pr(>|t|)
Intercept           1.7373     0.2932     5.9256     0.0000 ***
sepal-width         0.6949     0.0767     9.0631     0.0000 ***
petal-length        0.6458     0.0660     9.7786     0.0000 ***
petal-width        -0.3970     0.1481    -2.6810     0.0086 **
---------------------------------------------------------------------
Significance codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3194 on 96 degrees of freedom
Multiple R-squared: 0.8511,    Adjusted R-squared: 0.8465
F-statistic: 182.9573 on 4 and 96 DF,  p-value: 1.434e-39
source: projects/noj/notebooks/ml.clj