22  Smile regression models reference - DRAFT 🛠

(ns noj-book.smile-regression 
  (:require
   [noj-book.utils.render-tools :refer [render-key-info]]
   [scicloj.kindly.v4.kind :as kind]
   [scicloj.metamorph.core :as mm]
   [scicloj.metamorph.ml :as ml]
   [scicloj.metamorph.ml.toydata :as datasets]
   [tablecloth.api :as tc]
   [tech.v3.dataset :as ds]
   [tech.v3.dataset.metamorph :as ds-mm]
   [tech.v3.datatype.functional :as dtf]))

22.1 :smile.regression/elastic-net

javadoc
user guide
name type default range
lambda1 float64 0.1 >0
lambda2 float64 0.1 >0
tolerance float64 1.0E-4 >0
max-iterations int32 1000.0 >0


22.2 :smile.regression/gradient-tree-boost

javadoc
user guide
name type default range lookup-table
trees int32 500 >0
loss enumeration least-absolute-deviation
{:least-squares "LeastSquares",
 :quantile "Quantile",
 :least-absolute-deviation "LeastAbsoluteDeviation",
 :huber "Huber"}
max-depth int32 20 >0
max-nodes int32 6 >0
node-size int32 5 >0
shrinkage float64 0.05 >0
sample-rate float64 0.7
[0.0 1.0]


22.3 :smile.regression/lasso

javadoc
user guide
name type default description range
lambda float64 1.0 The shrinkage/regularization parameter. Large lambda means more shrinkage. Choosing an appropriate value of lambda is important, and also difficult >0
tolerance float64 1.0E-4 Tolerance for stopping iterations (relative target duality gap) >0
max-iterations int32 1000.0 Maximum number of IPM (Newton) iterations >0


We use the diabetes dataset and will show how Lasso regression regulates the different variables dependent of lambda.

First we make a function to create pipelines with different lambdas

(defn make-pipe-fn [lambda]
  (mm/pipeline
    (ds-mm/update-column
      :disease-progression
      (fn [col] (map #(double %) col)))
    (mm/lift tc/convert-types :disease-progression :float32)
    (ds-mm/set-inference-target :disease-progression)
    #:metamorph{:id :model}
    (ml/model
      {:model-type :smile.regression/lasso, :lambda (double lambda)})))
:kindly/hide-code
:kindly/hide-code
(kind/md
  "Now we go over a sequence of lambdas and fit a pipeline for all off them\n          and store the coefficients for each predictor variable:")

Now we go over a sequence of lambdas and fit a pipeline for all off them and store the coefficients for each predictor variable:

(def diabetes (datasets/diabetes-ds))
(ds/column-names diabetes)
(:age :sex :bmi :bp :s1 :s2 :s3 :s4 :s5 :s6 :disease-progression)
(ds/shape diabetes)
[11 442]
(def coefs-vs-lambda
 (flatten
   (map
     (fn [lambda]
       (let [fitted (mm/fit-pipe diabetes (make-pipe-fn lambda))
             model-instance (-> fitted :model (ml/thaw-model))
             predictors (map
                          #(first (.variables %))
                          (seq
                            (.. model-instance formula predictors)))]
         (map
           #(hash-map
             :log-lambda
             (dtf/log10 lambda)
             :coefficient
             %1
             :predictor
             %2)
           (-> model-instance .coefficients seq)
           predictors)))
     (range 1 100000 100))))

Then we plot the coefficients over the log of lambda.

(kind/vega-lite
  {:data {:values coefs-vs-lambda},
   :width 500,
   :height 500,
   :mark {:type "line"},
   :encoding
   {:x {:field :log-lambda, :type "quantitative"},
    :y {:field :coefficient, :type "quantitative"},
    :color {:field :predictor}}})

This shows that an increasing lambda regulates more and more variables to zero. This plot can be used as well to find important variables, namely the ones which stay > 0 even with large lambda

22.4 :smile.regression/ordinary-least-square

javadoc
user guide
name type default lookup-table
method enumeration qr
{:qr "qr", :svd "svd"}
standard-error boolean true
recursive boolean true


In this example we will explore the relationship between the body mass index (bmi) and a diabetes indicator.

First we load the data and split into train and test sets.

(def diabetes (datasets/diabetes-ds))
(def diabetes-train (ds/head diabetes 422))
(def diabetes-test (ds/tail diabetes 20))

Next we create the pipeline, converting the target variable to a float value, as needed by the model.

(def ols-pipe-fn
 (mm/pipeline
   (ds-mm/select-columns [:bmi :disease-progression])
   (mm/lift tc/convert-types :disease-progression :float32)
   (ds-mm/set-inference-target :disease-progression)
   #:metamorph{:id :model}
   (ml/model {:model-type :smile.regression/ordinary-least-square})))

We can then fit the model, by running the pipeline in mode :fit

(def fitted (mm/fit diabetes-train ols-pipe-fn))

Next we run the pipe-fn in :transform and extract the prediction for the disease progression:

(def diabetes-test-prediction
 (-> diabetes-test
  (mm/transform-pipe ols-pipe-fn fitted)
  :metamorph/data
  :disease-progression))
diabetes-test-prediction
#tech.v3.dataset.column<float64>[20]
:disease-progression
[226.0, 115.7, 163.3, 114.7, 120.8, 158.2, 236.1, 121.8, 99.57, 123.8, 204.7, 96.53, 154.2, 130.9, 83.39, 171.4, 138.0, 138.0, 189.6, 84.40]

The truth is available in the test dataset.

(def diabetes-test-trueth (-> diabetes-test :disease-progression))
diabetes-test-trueth
#tech.v3.dataset.column<int32>[20]
:disease-progression
[233, 91, 111, 152, 120, 67, 310, 94, 183, 66, 173, 72, 49, 64, 48, 178, 104, 132, 220, 57]

The smile Java object of the LinearModel is in the pipeline as well:

(def model-instance (-> fitted :model (ml/thaw-model)))

This object contains all information regarding the model fit such as coefficients and formul

(-> model-instance .coefficients seq)
(152.9188618261617 938.2378612512631)
(-> model-instance .formula str)
"disease-progression ~ bmi"

Smile generates as well a String with the result of the linear regression as part of the toString() method of class LinearModel:

(kind/code (str model-instance))
Linear Model:

Residuals:
       Min          1Q      Median          3Q         Max
 -164.9058    -44.8263     -8.6914     47.9600    153.4435

Coefficients:
                  Estimate Std. Error    t value   Pr(>|t|)
Intercept         152.9189     3.0688    49.8299     0.0000 ***
bmi               938.2379    64.4835    14.5500     0.0000 ***
---------------------------------------------------------------------
Significance codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 63.0385 on 420 degrees of freedom
Multiple R-squared: 0.3351,    Adjusted R-squared: 0.3335
F-statistic: 211.7036 on 2 and 420 DF,  p-value: 3.981e-39

This tells us that there is a statistically significant (positive) correlation between the bmi and the diabetes disease progression in this data

At the end we can plot the truth and the prediction on the test data, and observe the linear nature of the model.

(kind/vega-lite
  {:layer
   [{:data
     {:values
      (map
        #(hash-map :disease-progression %1 :bmi %2 :type :truth)
        diabetes-test-trueth
        (:bmi diabetes-test))},
     :width 500,
     :height 500,
     :mark {:type "circle"},
     :encoding
     {:x {:field :bmi, :type "quantitative"},
      :y {:field :disease-progression, :type "quantitative"},
      :color {:field :type}}}
    {:data
     {:values
      (map
        #(hash-map :disease-progression %1 :bmi %2 :type :prediction)
        diabetes-test-prediction
        (:bmi diabetes-test))},
     :width 500,
     :height 500,
     :mark {:type "line"},
     :encoding
     {:x {:field :bmi, :type "quantitative"},
      :y {:field :disease-progression, :type "quantitative"},
      :color {:field :type}}}]})

22.5 :smile.regression/random-forest

javadoc
user guide
name type default range
trees int32 500 >0
max-depth int32 20 >0
max-nodes int32 scicloj.ml.smile.regression$fn__87539@1e903b88 >0
node-size int32 5 >0
sample-rate float64 1.0
[0.0 1.0]


22.6 :smile.regression/ridge

javadoc
user guide
name type default range
lambda float64 1.0 >0


source: notebooks/noj_book/smile_regression.clj