(ns index)
(require '[scicloj.metamorph.ml :as ml]
         '[scicloj.metamorph.core :as mm]
         '[scicloj.metamorph.ml.toydata :as toydata]
         '[tablecloth.api :as tc]
         '[tech.v3.dataset :as ds]
         '[tech.v3.dataset.column-filters :as ds-cf]
         '[tech.v3.dataset.modelling :as ds-mod]
         '[libpython-clj2.python :refer [py. py.-] :as py])

sklearn-clj

sklearn-clj is a Clojure library which allows to use all sklearn estimators (models and others) from Clojure. It uses libpython-clj behind the scenes but we do not need to use the libpython-clj API. All models are available via the standard Clojure functions in metamorph.ml.

Train sklearn model with sklearn-clj

In this scenario, we will not use any sklearn or libpython-clj API, only metamorph.ml functions

Use iris data

Lets first get our data, the well known iris dataset:

(def iris
  (-> (toydata/iris-ds)
      (ds-mod/set-inference-target :species)
      (ds/categorical->number [:species])))

Register models

This require will register all sklearn models and make them available to metamorph.ml

(require '[scicloj.sklearn-clj.ml])

Define metamorph pipeline

All models are available by specifying keys in form of :sklearn.xxx.yyy for the model type. The available models are listed in the annex. They take the same parameters as in sklearn, just in kebap case.

We define a normal metamorph.ml pipeline, as we would do with Clojure models.

(def pipe-fn
  (mm/pipeline
   {:metamorph/id :model}
   (ml/model {:model-type :sklearn.classification/logistic-regression
              :max-iter 1000
              :verbose true})))

It will use sklearn model “sklearn.linear_model.LogisticRegression”

Use tech.dataset as training data

We need to train the model using a tech.ml.dataset as training data. sklearn-clj will transform the data behind the scenes to a tech.v3.tensor, which libpython-clj auto-transforms to a numpy array , which the model can work with.

(def trained-ctx (mm/fit-pipe iris pipe-fn))
trained-ctx

{

:metamorph/data

_unnamed [150 5]:

:sepal_length :sepal_width :petal_length :petal_width :species
5.1 3.5 1.4 0.2 0.0
4.9 3.0 1.4 0.2 0.0
4.7 3.2 1.3 0.2 0.0
4.6 3.1 1.5 0.2 0.0
5.0 3.6 1.4 0.2 0.0
5.4 3.9 1.7 0.4 0.0
4.6 3.4 1.4 0.3 0.0
5.0 3.4 1.5 0.2 0.0
4.4 2.9 1.4 0.2 0.0
4.9 3.1 1.5 0.1 0.0
... ... ... ... ...
6.9 3.1 5.4 2.1 2.0
6.7 3.1 5.6 2.4 2.0
6.9 3.1 5.1 2.3 2.0
5.8 2.7 5.1 1.9 2.0
6.8 3.2 5.9 2.3 2.0
6.7 3.3 5.7 2.5 2.0
6.7 3.0 5.2 2.3 2.0
6.3 2.5 5.0 1.9 2.0
6.5 3.0 5.2 2.0 2.0
6.2 3.4 5.4 2.3 2.0
5.9 3.0 5.1 1.8 2.0
:metamorph/mode :fit
:model {:model-data {:model LogisticRegression(max_iter=1000, verbose=True), :predict-proba? true, :pickled-model #object["[S" 0x4100ae4d "[S@4100ae4d"], :attributes {:n_features_in_ 4, :coef_ [[-0.42456599  0.96664261 -2.51554625 -1.08216927]
 [ 0.53541119 -0.32073935 -0.20740629 -0.94263206]
 [-0.1108452  -0.64590325  2.72295254  2.02480133]], :intercept_ [  9.85494228   2.23117432 -12.0861166 ], :n_iter_ [109], :classes_ [0. 1. 2.]}}, :options {:model-type :sklearn.classification/logistic-regression, :max-iter 1000, :verbose true}, :id #uuid "f10d9cf9-fe6a-4cf2-8e99-37f74071c06a", :feature-columns [:sepal_length :sepal_width :petal_length :petal_width], :target-columns [:species], :target-categorical-maps {:species #tech.v3.dataset.categorical.CategoricalMap{:lookup-table {0 0, 1 1, 2 2}, :src-column :species, :result-datatype :float64}}, :scicloj.metamorph.ml/unsupervised? nil}

}

Inspect trained model

We can inspect the model object:

(def model-object
  (-> trained-ctx :model :model-data :model))

It’s a libpython-clj reference to a python object

model-object
LogisticRegression(max_iter=1000, verbose=True)

and use libpython-clj functions to get information out of it. We can get the models coefficients, for example:

(py/->jvm
  (py.- model-object coef_))
#tech.v3.tensor<float64>[3 4]
[[-0.4246  0.9666  -2.516  -1.082]
 [ 0.5354 -0.3207 -0.2074 -0.9426]
 [-0.1108 -0.6459   2.723   2.025]]

we can as well ask for predict on new data

(def simulated-new-data 
  (tc/head (tc/shuffle iris) 10) )
(def prediction
  (:metamorph/data
    (mm/transform-pipe 
     simulated-new-data  
     pipe-fn 
     trained-ctx)))

We get a tech.ml.dataset with the prediction result back. sklearn-clj auto-transform the prediction result back to a tech.ml.dataset

prediction

:_unnamed [10 4]:

0 1 2 :species
0.00045021 0.34963497 0.64991483 2.0
0.96393942 0.03606051 0.00000007 0.0
0.00000008 0.00356624 0.99643367 2.0
0.00009945 0.12072613 0.87917442 2.0
0.14799720 0.84894630 0.00305651 1.0
0.00000003 0.00465871 0.99534126 2.0
0.00226244 0.80529023 0.19244733 1.0
0.00024314 0.16250271 0.83725414 2.0
0.98521363 0.01478635 0.00000001 0.0
0.00105987 0.72525793 0.27368221 1.0

Train model with sklearn using libpython-clj directly

As alternative approach we can use libpython-clj as well directly.

I take the following example an translate 1:1 into Clojure using libpython-clj https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html

Import python modules as Clojure vars

(py/from-import sklearn.svm  SVC)
#'index/SVC
(py/from-import sklearn.preprocessing StandardScaler)
#'index/StandardScaler
(py/from-import sklearn.datasets make_classification)
#'index/make_classification
(py/from-import sklearn.model_selection train_test_split)
#'index/train_test_split
(py/from-import sklearn.pipeline Pipeline)
#'index/Pipeline
(py/import-as numpy np)
#'index/np

Define X and y from artifical data

(py/def-unpack [X y] ( make_classification  :random_state 0))
#'index/y

Split data in test and train

(py/def-unpack [X_train X_test y_train y_test] 
               ( train_test_split  X y :random_state 0))
#'index/y_test

we get 4 vars , example:

X_train
[[-0.65240858  0.49374178  1.30184623 ... -1.30819171 -1.04525337
  -0.11054066]
 [ 0.35178011 -0.47003288 -0.37914756 ... -2.38076394 -0.11048941
  -1.55042935]
 [-1.58249448 -1.42279491 -0.56430103 ...  1.26661394 -1.31771734
   1.61805427]
 ...
 [-0.96050438 -2.28862004  1.02943883 ... -0.79347019  1.12859406
  -0.27567053]
 [ 0.91017891  0.78632796  0.06326199 ...  0.42234144 -0.46359597
  -0.01702041]
 [-0.87916063 -1.63880731 -0.30769128 ... -0.6054158   1.57886519
   0.73165893]]

We define a pipeline with a standard scaler and a support vector machine model:

(def pipe ( Pipeline  [[ "scaler" ( StandardScaler)]  
                      [ "svc" ( SVC)] ]))
pipe
Pipeline(steps=(('scaler', StandardScaler()), ('svc', SVC())))

Train and score the pipeline:

(py/py.. pipe
         (fit  X_train y_train)
         (score  X_test, y_test))
0.88

Train and score the pipeline and set parameter:

(py/py.. pipe
         (set_params :svc__C 10)
         (fit  X_train y_train)
         (score  X_test, y_test))
0.76

Train model with sklearn using libpython-clj from tech dataset

When we start with a tech.dataset, like

(-> iris tc/shuffle tc/head)

_unnamed [5 5]:

:sepal_length :sepal_width :petal_length :petal_width :species
6.3 3.3 4.7 1.6 1.0
4.8 3.4 1.6 0.2 0.0
6.3 2.3 4.4 1.3 1.0
5.7 3.8 1.7 0.3 0.0
6.5 3.0 5.5 1.8 2.0

we need to first split it in :train and :test and convert it to row vectors in (java) array format. Then libpython-clj knows how to convert these into python (numpy)

Split in test and train:

(def train-test-split (tc/split->seq iris))

where data looks like this:

(-> train-test-split first :train)

Group: 0 [120 5]:

:sepal_length :sepal_width :petal_length :petal_width :species
5.2 3.4 1.4 0.2 0.0
5.5 2.4 3.7 1.0 1.0
6.0 2.2 4.0 1.0 1.0
5.1 3.5 1.4 0.3 0.0
5.0 3.5 1.6 0.6 0.0
6.2 2.8 4.8 1.8 2.0
6.4 3.2 5.3 2.3 2.0
5.9 3.0 4.2 1.5 1.0
6.5 3.0 5.2 2.0 2.0
6.5 3.2 5.1 2.0 2.0
6.7 3.3 5.7 2.5 2.0
5.8 4.0 1.2 0.2 0.0
5.5 3.5 1.3 0.2 0.0
5.5 2.3 4.0 1.3 1.0
4.5 2.3 1.3 0.3 0.0
4.8 3.4 1.6 0.2 0.0
5.8 2.7 3.9 1.2 1.0
7.7 2.8 6.7 2.0 2.0
5.8 2.7 5.1 1.9 2.0
5.6 2.9 3.6 1.3 1.0
4.8 3.1 1.6 0.2 0.0

a helper to call numpy.ravel() easier

(defn ravel [x]
  (py. np ravel x))

fit the pipeline to tech dataset, :train subset

(def fitted-pipe
  (py/py. pipe
          fit
          (-> train-test-split first :train ds-cf/feature tc/rows)
          (-> train-test-split first :train ds-cf/target tc/rows ravel)))

predict the pipeline to tech dataset :test subset

(py/py. fitted-pipe predict
         (-> train-test-split first :test ds-cf/feature tc/rows))
[0. 1. 0. 0. 2. 2. 0. 2. 2. 1. 2. 2. 2. 2. 2. 1. 2. 1. 0. 2. 1. 2. 1. 2.
 0. 2. 0. 1. 1. 1.]

score the model on :test data

(py/py. fitted-pipe score
        (-> train-test-split first :test ds-cf/feature tc/rows)
        (-> train-test-split first :test ds-cf/target tc/rows))
0.9666666666666667

Annex

List of model types of all sklearn models supported by sklearn-clj (when using sklearn 1.5.1)

(->> @ml/model-definitions* keys sort)
(:sklearn.classification/ada-boost-classifier
 :sklearn.classification/bagging-classifier
 :sklearn.classification/bernoulli-nb
 :sklearn.classification/calibrated-classifier-cv
 :sklearn.classification/categorical-nb
 :sklearn.classification/complement-nb
 :sklearn.classification/decision-tree-classifier
 :sklearn.classification/dummy-classifier
 :sklearn.classification/extra-tree-classifier
 :sklearn.classification/extra-trees-classifier
 :sklearn.classification/gaussian-nb
 :sklearn.classification/gaussian-process-classifier
 :sklearn.classification/gradient-boosting-classifier
 :sklearn.classification/hist-gradient-boosting-classifier
 :sklearn.classification/k-neighbors-classifier
 :sklearn.classification/label-propagation
 :sklearn.classification/label-spreading
 :sklearn.classification/linear-discriminant-analysis
 :sklearn.classification/linear-svc
 :sklearn.classification/logistic-regression
 :sklearn.classification/logistic-regression-cv
 :sklearn.classification/mlp-classifier
 :sklearn.classification/multinomial-nb
 :sklearn.classification/nearest-centroid
 :sklearn.classification/nu-svc
 :sklearn.classification/passive-aggressive-classifier
 :sklearn.classification/perceptron
 :sklearn.classification/quadratic-discriminant-analysis
 :sklearn.classification/radius-neighbors-classifier
 :sklearn.classification/random-forest-classifier
 :sklearn.classification/ridge-classifier
 :sklearn.classification/ridge-classifier-cv
 :sklearn.classification/sgd-classifier
 :sklearn.classification/svc
 :sklearn.regression/ada-boost-regressor
 :sklearn.regression/ard-regression
 :sklearn.regression/bagging-regressor
 :sklearn.regression/bayesian-ridge
 :sklearn.regression/cca
 :sklearn.regression/decision-tree-regressor
 :sklearn.regression/dummy-regressor
 :sklearn.regression/elastic-net
 :sklearn.regression/elastic-net-cv
 :sklearn.regression/extra-tree-regressor
 :sklearn.regression/extra-trees-regressor
 :sklearn.regression/gamma-regressor
 :sklearn.regression/gaussian-process-regressor
 :sklearn.regression/gradient-boosting-regressor
 :sklearn.regression/hist-gradient-boosting-regressor
 :sklearn.regression/huber-regressor
 :sklearn.regression/isotonic-regression
 :sklearn.regression/k-neighbors-regressor
 :sklearn.regression/kernel-ridge
 :sklearn.regression/lars
 :sklearn.regression/lars-cv
 :sklearn.regression/lasso
 :sklearn.regression/lasso-cv
 :sklearn.regression/lasso-lars
 :sklearn.regression/lasso-lars-cv
 :sklearn.regression/lasso-lars-ic
 :sklearn.regression/linear-regression
 :sklearn.regression/linear-svr
 :sklearn.regression/mlp-regressor
 :sklearn.regression/multi-task-elastic-net
 :sklearn.regression/multi-task-elastic-net-cv
 :sklearn.regression/multi-task-lasso
 :sklearn.regression/multi-task-lasso-cv
 :sklearn.regression/nu-svr
 :sklearn.regression/orthogonal-matching-pursuit
 :sklearn.regression/orthogonal-matching-pursuit-cv
 :sklearn.regression/passive-aggressive-regressor
 :sklearn.regression/pls-canonical
 :sklearn.regression/pls-regression
 :sklearn.regression/poisson-regressor
 :sklearn.regression/quantile-regressor
 :sklearn.regression/radius-neighbors-regressor
 :sklearn.regression/random-forest-regressor
 :sklearn.regression/ransac-regressor
 :sklearn.regression/ridge
 :sklearn.regression/ridge-cv
 :sklearn.regression/sgd-regressor
 :sklearn.regression/svr
 :sklearn.regression/theil-sen-regressor
 :sklearn.regression/transformed-target-regressor
 :sklearn.regression/tweedie-regressor)
source: projects/ml/sklearn-clj/notebooks/index.clj