ns index) (
require '[scicloj.metamorph.ml :as ml]
(:as mm]
'[scicloj.metamorph.core :as toydata]
'[scicloj.metamorph.ml.toydata :as tc]
'[tablecloth.api :as ds]
'[tech.v3.dataset :as ds-cf]
'[tech.v3.dataset.column-filters :as ds-mod]
'[tech.v3.dataset.modelling :refer [py. py.-] :as py]) '[libpython-clj2.python
sklearn-clj
sklearn-clj is a Clojure library which allows to use all sklearn estimators (models and others) from Clojure. It uses libpython-clj behind the scenes but we do not need to use the libpython-clj
API. All models are available via the standard Clojure functions in metamorph.ml
.
Train sklearn model with sklearn-clj
In this scenario, we will not use any sklearn or libpython-clj API, only metamorph.ml
functions
Use iris data
Lets first get our data, the well known iris dataset:
def iris
(-> (toydata/iris-ds)
(:species)
(ds-mod/set-inference-target :species]))) (ds/categorical->number [
Register models
This require
will register all sklearn models and make them available to metamorph.ml
require '[scicloj.sklearn-clj.ml]) (
Define metamorph pipeline
All models are available by specifying keys in form of :sklearn.xxx.yyy for the model type. The available models are listed in the annex. They take the same parameters as in sklearn, just in kebap case.
We define a normal metamorph.ml
pipeline, as we would do with Clojure models.
def pipe-fn
(
(mm/pipeline:metamorph/id :model}
{:model-type :sklearn.classification/logistic-regression
(ml/model {:max-iter 1000
:verbose true})))
It will use sklearn model “sklearn.linear_model.LogisticRegression”
Use tech.dataset as training data
We need to train the model using a tech.ml.dataset as training data. sklearn-clj
will transform the data behind the scenes to a tech.v3.tensor, which libpython-clj auto-transforms to a numpy array , which the model can work with.
def trained-ctx (mm/fit-pipe iris pipe-fn)) (
trained-ctx
{
|
_unnamed [150 5]:
|
:metamorph/mode :fit
:model {:model-data {:model LogisticRegression(max_iter=1000, verbose=True), :predict-proba? true, :pickled-model #object["[S" 0x4100ae4d "[S@4100ae4d"], :attributes {:n_features_in_ 4, :coef_ [[-0.42456599 0.96664261 -2.51554625 -1.08216927]
[ 0.53541119 -0.32073935 -0.20740629 -0.94263206]
[-0.1108452 -0.64590325 2.72295254 2.02480133]], :intercept_ [ 9.85494228 2.23117432 -12.0861166 ], :n_iter_ [109], :classes_ [0. 1. 2.]}}, :options {:model-type :sklearn.classification/logistic-regression, :max-iter 1000, :verbose true}, :id #uuid "f10d9cf9-fe6a-4cf2-8e99-37f74071c06a", :feature-columns [:sepal_length :sepal_width :petal_length :petal_width], :target-columns [:species], :target-categorical-maps {:species #tech.v3.dataset.categorical.CategoricalMap{:lookup-table {0 0, 1 1, 2 2}, :src-column :species, :result-datatype :float64}}, :scicloj.metamorph.ml/unsupervised? nil}
}
Inspect trained model
We can inspect the model object:
def model-object
(-> trained-ctx :model :model-data :model)) (
It’s a libpython-clj reference to a python object
model-object
1000, verbose=True) LogisticRegression(max_iter=
and use libpython-clj
functions to get information out of it. We can get the models coefficients, for example:
(py/->jvm- model-object coef_)) (py.
3 4]
#tech.v3.tensor<float64>[0.4246 0.9666 -2.516 -1.082]
[[-0.5354 -0.3207 -0.2074 -0.9426]
[ 0.1108 -0.6459 2.723 2.025]] [-
we can as well ask for predict on new data
def simulated-new-data
(10) ) (tc/head (tc/shuffle iris)
def prediction
(:metamorph/data
(
(mm/transform-pipe
simulated-new-data
pipe-fn trained-ctx)))
We get a tech.ml.dataset with the prediction result back. sklearn-clj
auto-transform the prediction result back to a tech.ml.dataset
prediction
:_unnamed [10 4]:
0 | 1 | 2 | :species |
---|---|---|---|
0.00045021 | 0.34963497 | 0.64991483 | 2.0 |
0.96393942 | 0.03606051 | 0.00000007 | 0.0 |
0.00000008 | 0.00356624 | 0.99643367 | 2.0 |
0.00009945 | 0.12072613 | 0.87917442 | 2.0 |
0.14799720 | 0.84894630 | 0.00305651 | 1.0 |
0.00000003 | 0.00465871 | 0.99534126 | 2.0 |
0.00226244 | 0.80529023 | 0.19244733 | 1.0 |
0.00024314 | 0.16250271 | 0.83725414 | 2.0 |
0.98521363 | 0.01478635 | 0.00000001 | 0.0 |
0.00105987 | 0.72525793 | 0.27368221 | 1.0 |
Train model with sklearn using libpython-clj directly
As alternative approach we can use libpython-clj
as well directly.
I take the following example an translate 1:1 into Clojure using libpython-clj https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html
Import python modules as Clojure vars
(py/from-import sklearn.svm SVC)
#'index/SVC
(py/from-import sklearn.preprocessing StandardScaler)
#'index/StandardScaler
(py/from-import sklearn.datasets make_classification)
#'index/make_classification
(py/from-import sklearn.model_selection train_test_split)
#'index/train_test_split
(py/from-import sklearn.pipeline Pipeline)
#'index/Pipeline
(py/import-as numpy np)
#'index/np
Define X and y from artifical data
:random_state 0)) (py/def-unpack [X y] ( make_classification
#'index/y
Split data in test and train
(py/def-unpack [X_train X_test y_train y_test] :random_state 0)) ( train_test_split X y
#'index/y_test
we get 4 vars , example:
X_train
0.65240858 0.49374178 1.30184623 ... -1.30819171 -1.04525337
[[-0.11054066]
-0.35178011 -0.47003288 -0.37914756 ... -2.38076394 -0.11048941
[ 1.55042935]
-1.58249448 -1.42279491 -0.56430103 ... 1.26661394 -1.31771734
[-1.61805427]
...0.96050438 -2.28862004 1.02943883 ... -0.79347019 1.12859406
[-0.27567053]
-0.91017891 0.78632796 0.06326199 ... 0.42234144 -0.46359597
[ 0.01702041]
-0.87916063 -1.63880731 -0.30769128 ... -0.6054158 1.57886519
[-0.73165893]]
We define a pipeline with a standard scaler and a support vector machine model:
def pipe ( Pipeline [[ "scaler" ( StandardScaler)]
("svc" ( SVC)] ])) [
pipe
'scaler', StandardScaler()), ('svc', SVC()))) Pipeline(steps=((
Train and score the pipeline:
(py/py.. pipe
(fit X_train y_train) (score X_test, y_test))
0.88
Train and score the pipeline and set parameter:
(py/py.. pipe:svc__C 10)
(set_params
(fit X_train y_train) (score X_test, y_test))
0.76
Train model with sklearn using libpython-clj from tech dataset
When we start with a tech.dataset, like
-> iris tc/shuffle tc/head) (
_unnamed [5 5]:
:sepal_length | :sepal_width | :petal_length | :petal_width | :species |
---|---|---|---|---|
6.3 | 3.3 | 4.7 | 1.6 | 1.0 |
4.8 | 3.4 | 1.6 | 0.2 | 0.0 |
6.3 | 2.3 | 4.4 | 1.3 | 1.0 |
5.7 | 3.8 | 1.7 | 0.3 | 0.0 |
6.5 | 3.0 | 5.5 | 1.8 | 2.0 |
we need to first split it in :train and :test and convert it to row vectors in (java) array format. Then libpython-clj knows how to convert these into python (numpy)
Split in test and train:
def train-test-split (tc/split->seq iris)) (
where data looks like this:
-> train-test-split first :train) (
Group: 0 [120 5]:
:sepal_length | :sepal_width | :petal_length | :petal_width | :species |
---|---|---|---|---|
5.2 | 3.4 | 1.4 | 0.2 | 0.0 |
5.5 | 2.4 | 3.7 | 1.0 | 1.0 |
6.0 | 2.2 | 4.0 | 1.0 | 1.0 |
5.1 | 3.5 | 1.4 | 0.3 | 0.0 |
5.0 | 3.5 | 1.6 | 0.6 | 0.0 |
6.2 | 2.8 | 4.8 | 1.8 | 2.0 |
6.4 | 3.2 | 5.3 | 2.3 | 2.0 |
5.9 | 3.0 | 4.2 | 1.5 | 1.0 |
6.5 | 3.0 | 5.2 | 2.0 | 2.0 |
6.5 | 3.2 | 5.1 | 2.0 | 2.0 |
… | … | … | … | … |
6.7 | 3.3 | 5.7 | 2.5 | 2.0 |
5.8 | 4.0 | 1.2 | 0.2 | 0.0 |
5.5 | 3.5 | 1.3 | 0.2 | 0.0 |
5.5 | 2.3 | 4.0 | 1.3 | 1.0 |
4.5 | 2.3 | 1.3 | 0.3 | 0.0 |
4.8 | 3.4 | 1.6 | 0.2 | 0.0 |
5.8 | 2.7 | 3.9 | 1.2 | 1.0 |
7.7 | 2.8 | 6.7 | 2.0 | 2.0 |
5.8 | 2.7 | 5.1 | 1.9 | 2.0 |
5.6 | 2.9 | 3.6 | 1.3 | 1.0 |
4.8 | 3.1 | 1.6 | 0.2 | 0.0 |
a helper to call numpy.ravel() easier
defn ravel [x]
( (py. np ravel x))
fit the pipeline to tech dataset, :train subset
def fitted-pipe
(
(py/py. pipe
fit-> train-test-split first :train ds-cf/feature tc/rows)
(-> train-test-split first :train ds-cf/target tc/rows ravel))) (
predict the pipeline to tech dataset :test subset
(py/py. fitted-pipe predict-> train-test-split first :test ds-cf/feature tc/rows)) (
0. 1. 0. 0. 2. 2. 0. 2. 2. 1. 2. 2. 2. 2. 2. 1. 2. 1. 0. 2. 1. 2. 1. 2.
[0. 2. 0. 1. 1. 1.]
score the model on :test data
(py/py. fitted-pipe score-> train-test-split first :test ds-cf/feature tc/rows)
(-> train-test-split first :test ds-cf/target tc/rows)) (
0.9666666666666667
Annex
List of model types of all sklearn models supported by sklearn-clj (when using sklearn 1.5.1)
->> @ml/model-definitions* keys sort) (
:sklearn.classification/ada-boost-classifier
(:sklearn.classification/bagging-classifier
:sklearn.classification/bernoulli-nb
:sklearn.classification/calibrated-classifier-cv
:sklearn.classification/categorical-nb
:sklearn.classification/complement-nb
:sklearn.classification/decision-tree-classifier
:sklearn.classification/dummy-classifier
:sklearn.classification/extra-tree-classifier
:sklearn.classification/extra-trees-classifier
:sklearn.classification/gaussian-nb
:sklearn.classification/gaussian-process-classifier
:sklearn.classification/gradient-boosting-classifier
:sklearn.classification/hist-gradient-boosting-classifier
:sklearn.classification/k-neighbors-classifier
:sklearn.classification/label-propagation
:sklearn.classification/label-spreading
:sklearn.classification/linear-discriminant-analysis
:sklearn.classification/linear-svc
:sklearn.classification/logistic-regression
:sklearn.classification/logistic-regression-cv
:sklearn.classification/mlp-classifier
:sklearn.classification/multinomial-nb
:sklearn.classification/nearest-centroid
:sklearn.classification/nu-svc
:sklearn.classification/passive-aggressive-classifier
:sklearn.classification/perceptron
:sklearn.classification/quadratic-discriminant-analysis
:sklearn.classification/radius-neighbors-classifier
:sklearn.classification/random-forest-classifier
:sklearn.classification/ridge-classifier
:sklearn.classification/ridge-classifier-cv
:sklearn.classification/sgd-classifier
:sklearn.classification/svc
:sklearn.regression/ada-boost-regressor
:sklearn.regression/ard-regression
:sklearn.regression/bagging-regressor
:sklearn.regression/bayesian-ridge
:sklearn.regression/cca
:sklearn.regression/decision-tree-regressor
:sklearn.regression/dummy-regressor
:sklearn.regression/elastic-net
:sklearn.regression/elastic-net-cv
:sklearn.regression/extra-tree-regressor
:sklearn.regression/extra-trees-regressor
:sklearn.regression/gamma-regressor
:sklearn.regression/gaussian-process-regressor
:sklearn.regression/gradient-boosting-regressor
:sklearn.regression/hist-gradient-boosting-regressor
:sklearn.regression/huber-regressor
:sklearn.regression/isotonic-regression
:sklearn.regression/k-neighbors-regressor
:sklearn.regression/kernel-ridge
:sklearn.regression/lars
:sklearn.regression/lars-cv
:sklearn.regression/lasso
:sklearn.regression/lasso-cv
:sklearn.regression/lasso-lars
:sklearn.regression/lasso-lars-cv
:sklearn.regression/lasso-lars-ic
:sklearn.regression/linear-regression
:sklearn.regression/linear-svr
:sklearn.regression/mlp-regressor
:sklearn.regression/multi-task-elastic-net
:sklearn.regression/multi-task-elastic-net-cv
:sklearn.regression/multi-task-lasso
:sklearn.regression/multi-task-lasso-cv
:sklearn.regression/nu-svr
:sklearn.regression/orthogonal-matching-pursuit
:sklearn.regression/orthogonal-matching-pursuit-cv
:sklearn.regression/passive-aggressive-regressor
:sklearn.regression/pls-canonical
:sklearn.regression/pls-regression
:sklearn.regression/poisson-regressor
:sklearn.regression/quantile-regressor
:sklearn.regression/radius-neighbors-regressor
:sklearn.regression/random-forest-regressor
:sklearn.regression/ransac-regressor
:sklearn.regression/ridge
:sklearn.regression/ridge-cv
:sklearn.regression/sgd-regressor
:sklearn.regression/svr
:sklearn.regression/theil-sen-regressor
:sklearn.regression/transformed-target-regressor
:sklearn.regression/tweedie-regressor)
source: projects/ml/sklearn-clj/notebooks/index.clj