21 Smile regression models reference

As discussed in the Machine Learning chapter, this book contains reference chapters for machine learning models that can be registered in metamorph.ml.

This specific chapter focuses on regression models of Smile version 2.6, which are wrapped by scicloj.ml.smile.

Note that this chapter reqiures scicloj.ml.smile as an additional dependency to Noj.

In the following we have a list of all model keys of Smile regression models, including parameters. They can be used like this:

(comment
  (ml/train df
            {:model-type <model-key>
             :param-1 0
             :param-2 1}))

(require '[scicloj.ml.smile.regression]
         '[scicloj.ml.tribuo])

21.1 :smile.regression/elastic-net

javadoc

user guide

name	type	default
lambda1	float64	0.1
lambda2	float64	0.1
tolerance	float64	1.0E-4
max-iterations	int32	1000.0

21.2 :smile.regression/gradient-tree-boost

javadoc

user guide

name	type	default
trees	int32	500
loss	enumeration	least-absolute-deviation
max-depth	int32	20
max-nodes	int32	6
node-size	int32	5
shrinkage	float64	0.05
sample-rate	float64	0.7

21.3 :smile.regression/lasso

javadoc

user guide

name	type	default	description
lambda	float64	1.0	The shrinkage/regularization parameter. Large lambda means more shrinkage. Choosing an appropriate value of lambda is important, and also difficult
tolerance	float64	1.0E-4	Tolerance for stopping iterations (relative target duality gap)
max-iterations	int32	1000.0	Maximum number of IPM (Newton) iterations

We use the diabetes dataset and will show how Lasso regression regulates the different variables, and the regulation depends on the lambda parameter.

First we make a function to create pipelines with different lambdas.

(defn make-pipe-fn [lambda]
  (mm/pipeline
    (ds-mm/update-column
      :disease-progression
      (fn [col] (map #(double %) col)))
    (mm/lift tc/convert-types :disease-progression :float32)
    (ds-mm/set-inference-target :disease-progression)
    #:metamorph{:id :model}
    (ml/model
      {:model-type :smile.regression/lasso, :lambda (double lambda)})))

:kindly/hide-code

:kindly/hide-code

(kind/md
  "Now we go over a sequence of `lambda`s, fit a pipeline for all of them,\n          and store the coefficients for each predictor variable:")

Now we go over a sequence of lambdas, fit a pipeline for all of them, and store the coefficients for each predictor variable:

(def diabetes (datasets/diabetes-ds))

(ds/column-names diabetes)

(:age :sex :bmi :bp :s1 :s2 :s3 :s4 :s5 :s6 :disease-progression)

(ds/shape diabetes)

[11 442]

(def coefs-vs-lambda
 (flatten
   (map
     (fn [lambda]
       (let [fitted (mm/fit-pipe diabetes (make-pipe-fn lambda))
             model-instance (-> fitted :model (ml/thaw-model))
             predictors (map
                          #(first (.variables %))
                          (seq
                            (.. model-instance formula predictors)))]
         (map
           #(hash-map
             :log-lambda
             (dtf/log10 lambda)
             :coefficient
             %1
             :predictor
             %2)
           (-> model-instance .coefficients seq)
           predictors)))
     (range 1 100000 100))))

Then we plot the coefficients over the log of lambda.

(kind/vega-lite
  {:data {:values coefs-vs-lambda},
   :width 500,
   :height 500,
   :mark {:type "line"},
   :encoding
   {:x {:field :log-lambda, :type "quantitative"},
    :y {:field :coefficient, :type "quantitative"},
    :color {:field :predictor}}})

This shows that an increasing lambda regulates more and more variables to zero. This plot can be used as well to find important variables, namely the ones which stay > 0 even with large lambda.

21.4 :smile.regression/ordinary-least-square

javadoc

user guide

name	type	default
method	enumeration	qr
standard-error	boolean	true
recursive	boolean	true

In this example we will explore the relationship between the body mass index (bmi) and a diabetes indicator.

First we load the data and split into train and test sets.

(def diabetes (datasets/diabetes-ds))

(def diabetes-train (ds/head diabetes 422))

(def diabetes-test (ds/tail diabetes 20))

Next we create the pipeline, converting the target variable to a float value, as needed by the model.

(def ols-pipe-fn
 (mm/pipeline
   (ds-mm/select-columns [:bmi :disease-progression])
   (mm/lift tc/convert-types :disease-progression :float32)
   (ds-mm/set-inference-target :disease-progression)
   #:metamorph{:id :model}
   (ml/model {:model-type :smile.regression/ordinary-least-square})))

We can then fit the model, by running the pipeline in mode :fit.

(def fitted (mm/fit diabetes-train ols-pipe-fn))

Next we run the pipe-fn in :transform and extract the prediction for the disease progression:

(def diabetes-test-prediction
 (-> diabetes-test
  (mm/transform-pipe ols-pipe-fn fitted)
  :metamorph/data
  :disease-progression))

diabetes-test-prediction

#tech.v3.dataset.column<float64>[20]
:disease-progression
[226.0, 115.7, 163.3, 114.7, 120.8, 158.2, 236.1, 121.8, 99.57, 123.8, 204.7, 96.53, 154.2, 130.9, 83.39, 171.4, 138.0, 138.0, 189.6, 84.40]

The truth is available in the test dataset.

(def diabetes-test-trueth (-> diabetes-test :disease-progression))

diabetes-test-trueth

#tech.v3.dataset.column<int32>[20]
:disease-progression
[233, 91, 111, 152, 120, 67, 310, 94, 183, 66, 173, 72, 49, 64, 48, 178, 104, 132, 220, 57]

The smile Java object of the LinearModel is in the pipeline as well:

(def model-instance (-> fitted :model (ml/thaw-model)))

This object contains all information regarding the model fit such as coefficients and formula.

(-> model-instance .coefficients seq)

(152.9188618261617 938.2378612512631)

(-> model-instance .formula str)

"disease-progression ~ bmi"

Smile also generates a String with the result of the linear regression as part of the toString() method of class LinearModel:

(kind/code (str model-instance))

Linear Model:

Residuals:
       Min          1Q      Median          3Q         Max
 -164.9058    -44.8263     -8.6914     47.9600    153.4435

Coefficients:
                  Estimate Std. Error    t value   Pr(>|t|)
Intercept         152.9189     3.0688    49.8299     0.0000 ***
bmi               938.2379    64.4835    14.5500     0.0000 ***
---------------------------------------------------------------------
Significance codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 63.0385 on 420 degrees of freedom
Multiple R-squared: 0.3351,    Adjusted R-squared: 0.3335
F-statistic: 211.7036 on 2 and 420 DF,  p-value: 3.981e-39

This tells us that there is a statistically significant (positive) correlation between the bmi and the diabetes disease progression in this data.

At the end we can plot the truth and the prediction on the test data, and observe the linear nature of the model.

(kind/vega-lite
  {:layer
   [{:data
     {:values
      (map
        #(hash-map :disease-progression %1 :bmi %2 :type :truth)
        diabetes-test-trueth
        (:bmi diabetes-test))},
     :width 500,
     :height 500,
     :mark {:type "circle"},
     :encoding
     {:x {:field :bmi, :type "quantitative"},
      :y {:field :disease-progression, :type "quantitative"},
      :color {:field :type}}}
    {:data
     {:values
      (map
        #(hash-map :disease-progression %1 :bmi %2 :type :prediction)
        diabetes-test-prediction
        (:bmi diabetes-test))},
     :width 500,
     :height 500,
     :mark {:type "line"},
     :encoding
     {:x {:field :bmi, :type "quantitative"},
      :y {:field :disease-progression, :type "quantitative"},
      :color {:field :type}}}]})

21.5 :smile.regression/random-forest

javadoc

user guide

name	type	default
trees	int32	500
max-depth	int32	20
max-nodes	int32	scicloj.ml.smile.regression$fn__79376@5856f990
node-size	int32	5
sample-rate	float64	1.0

21.6 :smile.regression/ridge

javadoc

user guide

name	type	default	description
lambda	float64	1.0

source: notebooks/noj_book/smile_regression.clj