22 Smile regression models reference - DRAFT 🛠
Note that this chapter reqiures scicloj.ml.smile
as an additional dependency to Noj.
In the following we have a list of all model keys of Smile regression models, including parameters. They can be used like this:
comment
(
(ml/train df:model-type <model-key>
{:param-1 0
:param-2 1}))
require '[scicloj.ml.smile.regression]
( '[scicloj.ml.tribuo])
22.1 :smile.regression/elastic-net
name | type | default | range |
---|---|---|---|
lambda1 | float64 | 0.1 | >0 |
lambda2 | float64 | 0.1 | >0 |
tolerance | float64 | 1.0E-4 | >0 |
max-iterations | int32 | 1000.0 | >0 |
22.2 :smile.regression/gradient-tree-boost
name | type | default | range | lookup-table |
---|---|---|---|---|
trees | int32 | 500 | >0 | |
loss | enumeration | least-absolute-deviation |
|
|
max-depth | int32 | 20 | >0 | |
max-nodes | int32 | 6 | >0 | |
node-size | int32 | 5 | >0 | |
shrinkage | float64 | 0.05 | >0 | |
sample-rate | float64 | 0.7 |
|
22.3 :smile.regression/lasso
name | type | default | description | range |
---|---|---|---|---|
lambda | float64 | 1.0 | The shrinkage/regularization parameter. Large lambda means more shrinkage. Choosing an appropriate value of lambda is important, and also difficult | >0 |
tolerance | float64 | 1.0E-4 | Tolerance for stopping iterations (relative target duality gap) | >0 |
max-iterations | int32 | 1000.0 | Maximum number of IPM (Newton) iterations | >0 |
We use the diabetes dataset and will show how Lasso regression regulates the different variables, and the regulation depends on the lambda
parameter.
First we make a function to create pipelines with different lambda
s.
defn make-pipe-fn [lambda]
(
(mm/pipeline
(ds-mm/update-column:disease-progression
fn [col] (map #(double %) col)))
(:disease-progression :float32)
(mm/lift tc/convert-types :disease-progression)
(ds-mm/set-inference-target :metamorph{:id :model}
#
(ml/model:model-type :smile.regression/lasso, :lambda (double lambda)}))) {
:kindly/hide-code
:kindly/hide-code
(kind/md"Now we go over a sequence of `lambda`s, fit a pipeline for all of them,\n and store the coefficients for each predictor variable:")
Now we go over a sequence of lambda
s, fit a pipeline for all of them, and store the coefficients for each predictor variable:
def diabetes (datasets/diabetes-ds)) (
(ds/column-names diabetes)
:age :sex :bmi :bp :s1 :s2 :s3 :s4 :s5 :s6 :disease-progression) (
(ds/shape diabetes)
11 442] [
def coefs-vs-lambda
(
(flattenmap
(fn [lambda]
(let [fitted (mm/fit-pipe diabetes (make-pipe-fn lambda))
(-> fitted :model (ml/thaw-model))
model-instance (map
predictors (first (.variables %))
#(seq
(
(.. model-instance formula predictors)))]map
(hash-map
#(:log-lambda
(dtf/log10 lambda):coefficient
%1
:predictor
%2)
-> model-instance .coefficients seq)
(
predictors)))range 1 100000 100)))) (
Then we plot the coefficients over the log of lambda
.
(kind/vega-lite:data {:values coefs-vs-lambda},
{:width 500,
:height 500,
:mark {:type "line"},
:encoding
:x {:field :log-lambda, :type "quantitative"},
{:y {:field :coefficient, :type "quantitative"},
:color {:field :predictor}}})
This shows that an increasing lambda
regulates more and more variables to zero. This plot can be used as well to find important variables, namely the ones which stay > 0 even with large lambda
.
22.4 :smile.regression/ordinary-least-square
name | type | default | lookup-table |
---|---|---|---|
method | enumeration | qr |
|
standard-error | boolean | true | |
recursive | boolean | true |
In this example we will explore the relationship between the body mass index (bmi) and a diabetes indicator.
First we load the data and split into train and test sets.
def diabetes (datasets/diabetes-ds)) (
def diabetes-train (ds/head diabetes 422)) (
def diabetes-test (ds/tail diabetes 20)) (
Next we create the pipeline, converting the target variable to a float value, as needed by the model.
def ols-pipe-fn
(
(mm/pipeline:bmi :disease-progression])
(ds-mm/select-columns [:disease-progression :float32)
(mm/lift tc/convert-types :disease-progression)
(ds-mm/set-inference-target :metamorph{:id :model}
#:model-type :smile.regression/ordinary-least-square}))) (ml/model {
We can then fit the model, by running the pipeline in mode :fit
.
def fitted (mm/fit diabetes-train ols-pipe-fn)) (
Next we run the pipe-fn in :transform
and extract the prediction for the disease progression:
def diabetes-test-prediction
(-> diabetes-test
(
(mm/transform-pipe ols-pipe-fn fitted):metamorph/data
:disease-progression))
diabetes-test-prediction
20]
#tech.v3.dataset.column<float64>[:disease-progression
226.0, 115.7, 163.3, 114.7, 120.8, 158.2, 236.1, 121.8, 99.57, 123.8, 204.7, 96.53, 154.2, 130.9, 83.39, 171.4, 138.0, 138.0, 189.6, 84.40] [
The truth is available in the test dataset.
def diabetes-test-trueth (-> diabetes-test :disease-progression)) (
diabetes-test-trueth
20]
#tech.v3.dataset.column<int32>[:disease-progression
233, 91, 111, 152, 120, 67, 310, 94, 183, 66, 173, 72, 49, 64, 48, 178, 104, 132, 220, 57] [
The smile Java object of the LinearModel
is in the pipeline as well:
def model-instance (-> fitted :model (ml/thaw-model))) (
This object contains all information regarding the model fit such as coefficients and formula.
-> model-instance .coefficients seq) (
152.9188618261617 938.2378612512631) (
-> model-instance .formula str) (
"disease-progression ~ bmi"
Smile also generates a String with the result of the linear regression as part of the toString()
method of class LinearModel
:
str model-instance)) (kind/code (
Linear Model:
Residuals:1Q Median 3Q Max
Min 164.9058 -44.8263 -8.6914 47.9600 153.4435
-
Coefficients:>|t|)
Estimate Std. Error t value Pr(152.9189 3.0688 49.8299 0.0000 ***
Intercept 938.2379 64.4835 14.5500 0.0000 ***
bmi
---------------------------------------------------------------------0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Significance codes:
63.0385 on 420 degrees of freedom
Residual standard error: 0.3351, Adjusted R-squared: 0.3335
Multiple R-squared: 211.7036 on 2 and 420 DF, p-value: 3.981e-39 F-statistic:
This tells us that there is a statistically significant (positive) correlation between the bmi and the diabetes disease progression in this data.
At the end we can plot the truth and the prediction on the test data, and observe the linear nature of the model.
(kind/vega-lite:layer
{:data
[{:values
{map
(hash-map :disease-progression %1 :bmi %2 :type :truth)
#(
diabetes-test-trueth:bmi diabetes-test))},
(:width 500,
:height 500,
:mark {:type "circle"},
:encoding
:x {:field :bmi, :type "quantitative"},
{:y {:field :disease-progression, :type "quantitative"},
:color {:field :type}}}
:data
{:values
{map
(hash-map :disease-progression %1 :bmi %2 :type :prediction)
#(
diabetes-test-prediction:bmi diabetes-test))},
(:width 500,
:height 500,
:mark {:type "line"},
:encoding
:x {:field :bmi, :type "quantitative"},
{:y {:field :disease-progression, :type "quantitative"},
:color {:field :type}}}]})
22.5 :smile.regression/random-forest
name | type | default | range |
---|---|---|---|
trees | int32 | 500 | >0 |
max-depth | int32 | 20 | >0 |
max-nodes | int32 | scicloj.ml.smile.regression$fn__90511@2c27828a | >0 |
node-size | int32 | 5 | >0 |
sample-rate | float64 | 1.0 |
|
22.6 :smile.regression/ridge
name | type | default | range |
---|---|---|---|
lambda | float64 | 1.0 | >0 |