22 Smile regression models reference - DRAFT 🛠
ns noj-book.smile-regression
(:require
(:refer [render-key-info]]
[noj-book.utils.render-tools :as kind]
[scicloj.kindly.v4.kind :as mm]
[scicloj.metamorph.core :as ml]
[scicloj.metamorph.ml :as datasets]
[scicloj.metamorph.ml.toydata :as tc]
[tablecloth.api :as ds]
[tech.v3.dataset :as ds-mm]
[tech.v3.dataset.metamorph :as dtf])) [tech.v3.datatype.functional
22.1 :smile.regression/elastic-net
name | type | default | range |
---|---|---|---|
lambda1 | float64 | 0.1 | >0 |
lambda2 | float64 | 0.1 | >0 |
tolerance | float64 | 1.0E-4 | >0 |
max-iterations | int32 | 1000.0 | >0 |
22.2 :smile.regression/gradient-tree-boost
name | type | default | range | lookup-table |
---|---|---|---|---|
trees | int32 | 500 | >0 | |
loss | enumeration | least-absolute-deviation |
|
|
max-depth | int32 | 20 | >0 | |
max-nodes | int32 | 6 | >0 | |
node-size | int32 | 5 | >0 | |
shrinkage | float64 | 0.05 | >0 | |
sample-rate | float64 | 0.7 |
|
22.3 :smile.regression/lasso
name | type | default | description | range |
---|---|---|---|---|
lambda | float64 | 1.0 | The shrinkage/regularization parameter. Large lambda means more shrinkage. Choosing an appropriate value of lambda is important, and also difficult | >0 |
tolerance | float64 | 1.0E-4 | Tolerance for stopping iterations (relative target duality gap) | >0 |
max-iterations | int32 | 1000.0 | Maximum number of IPM (Newton) iterations | >0 |
We use the diabetes dataset and will show how Lasso regression regulates the different variables dependent of lambda.
First we make a function to create pipelines with different lambdas
defn make-pipe-fn [lambda]
(
(mm/pipeline
(ds-mm/update-column:disease-progression
fn [col] (map #(double %) col)))
(:disease-progression :float32)
(mm/lift tc/convert-types :disease-progression)
(ds-mm/set-inference-target :metamorph{:id :model}
#
(ml/model:model-type :smile.regression/lasso, :lambda (double lambda)}))) {
:kindly/hide-code
:kindly/hide-code
(kind/md"Now we go over a sequence of lambdas and fit a pipeline for all off them\n and store the coefficients for each predictor variable:")
Now we go over a sequence of lambdas and fit a pipeline for all off them and store the coefficients for each predictor variable:
def diabetes (datasets/diabetes-ds)) (
(ds/column-names diabetes)
:age :sex :bmi :bp :s1 :s2 :s3 :s4 :s5 :s6 :disease-progression) (
(ds/shape diabetes)
11 442] [
def coefs-vs-lambda
(
(flattenmap
(fn [lambda]
(let [fitted (mm/fit-pipe diabetes (make-pipe-fn lambda))
(-> fitted :model (ml/thaw-model))
model-instance (map
predictors (first (.variables %))
#(seq
(
(.. model-instance formula predictors)))]map
(hash-map
#(:log-lambda
(dtf/log10 lambda):coefficient
%1
:predictor
%2)
-> model-instance .coefficients seq)
(
predictors)))range 1 100000 100)))) (
Then we plot the coefficients over the log of lambda.
(kind/vega-lite:data {:values coefs-vs-lambda},
{:width 500,
:height 500,
:mark {:type "line"},
:encoding
:x {:field :log-lambda, :type "quantitative"},
{:y {:field :coefficient, :type "quantitative"},
:color {:field :predictor}}})
This shows that an increasing lambda regulates more and more variables to zero. This plot can be used as well to find important variables, namely the ones which stay > 0 even with large lambda
22.4 :smile.regression/ordinary-least-square
name | type | default | lookup-table |
---|---|---|---|
method | enumeration | qr |
|
standard-error | boolean | true | |
recursive | boolean | true |
In this example we will explore the relationship between the body mass index (bmi) and a diabetes indicator.
First we load the data and split into train and test sets.
def diabetes (datasets/diabetes-ds)) (
def diabetes-train (ds/head diabetes 422)) (
def diabetes-test (ds/tail diabetes 20)) (
Next we create the pipeline, converting the target variable to a float value, as needed by the model.
def ols-pipe-fn
(
(mm/pipeline:bmi :disease-progression])
(ds-mm/select-columns [:disease-progression :float32)
(mm/lift tc/convert-types :disease-progression)
(ds-mm/set-inference-target :metamorph{:id :model}
#:model-type :smile.regression/ordinary-least-square}))) (ml/model {
We can then fit the model, by running the pipeline in mode :fit
def fitted (mm/fit diabetes-train ols-pipe-fn)) (
Next we run the pipe-fn in :transform and extract the prediction for the disease progression:
def diabetes-test-prediction
(-> diabetes-test
(
(mm/transform-pipe ols-pipe-fn fitted):metamorph/data
:disease-progression))
diabetes-test-prediction
20]
#tech.v3.dataset.column<float64>[:disease-progression
226.0, 115.7, 163.3, 114.7, 120.8, 158.2, 236.1, 121.8, 99.57, 123.8, 204.7, 96.53, 154.2, 130.9, 83.39, 171.4, 138.0, 138.0, 189.6, 84.40] [
The truth is available in the test dataset.
def diabetes-test-trueth (-> diabetes-test :disease-progression)) (
diabetes-test-trueth
20]
#tech.v3.dataset.column<int32>[:disease-progression
233, 91, 111, 152, 120, 67, 310, 94, 183, 66, 173, 72, 49, 64, 48, 178, 104, 132, 220, 57] [
The smile Java object of the LinearModel is in the pipeline as well:
def model-instance (-> fitted :model (ml/thaw-model))) (
This object contains all information regarding the model fit such as coefficients and formul
-> model-instance .coefficients seq) (
152.9188618261617 938.2378612512631) (
-> model-instance .formula str) (
"disease-progression ~ bmi"
Smile generates as well a String with the result of the linear regression as part of the toString() method of class LinearModel:
str model-instance)) (kind/code (
Linear Model:
Residuals:1Q Median 3Q Max
Min 164.9058 -44.8263 -8.6914 47.9600 153.4435
-
Coefficients:>|t|)
Estimate Std. Error t value Pr(152.9189 3.0688 49.8299 0.0000 ***
Intercept 938.2379 64.4835 14.5500 0.0000 ***
bmi
---------------------------------------------------------------------0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Significance codes:
63.0385 on 420 degrees of freedom
Residual standard error: 0.3351, Adjusted R-squared: 0.3335
Multiple R-squared: 211.7036 on 2 and 420 DF, p-value: 3.981e-39 F-statistic:
This tells us that there is a statistically significant (positive) correlation between the bmi and the diabetes disease progression in this data
At the end we can plot the truth and the prediction on the test data, and observe the linear nature of the model.
(kind/vega-lite:layer
{:data
[{:values
{map
(hash-map :disease-progression %1 :bmi %2 :type :truth)
#(
diabetes-test-trueth:bmi diabetes-test))},
(:width 500,
:height 500,
:mark {:type "circle"},
:encoding
:x {:field :bmi, :type "quantitative"},
{:y {:field :disease-progression, :type "quantitative"},
:color {:field :type}}}
:data
{:values
{map
(hash-map :disease-progression %1 :bmi %2 :type :prediction)
#(
diabetes-test-prediction:bmi diabetes-test))},
(:width 500,
:height 500,
:mark {:type "line"},
:encoding
:x {:field :bmi, :type "quantitative"},
{:y {:field :disease-progression, :type "quantitative"},
:color {:field :type}}}]})
22.5 :smile.regression/random-forest
name | type | default | range |
---|---|---|---|
trees | int32 | 500 | >0 |
max-depth | int32 | 20 | >0 |
max-nodes | int32 | scicloj.ml.smile.regression$fn__87539@1e903b88 | >0 |
node-size | int32 | 5 | >0 |
sample-rate | float64 | 1.0 |
|
22.6 :smile.regression/ridge
name | type | default | range |
---|---|---|---|
lambda | float64 | 1.0 | >0 |