26 Transformer reference

As discussed in the AutoML chapter, metamorph.ml can have certain pipeline operations whose behavior changes between the :fit and :transform modes. These operations are called transformers.

This chapter provides a reference for the currently existing transformers.

Note that this chapter reqiures scicloj.ml.smile as an additional dependency to Noj.

(ns noj-book.transformer-references
  (:require
   [scicloj.kindly.v4.api :as kindly]
   [scicloj.kindly.v4.kind :as kind]
   [scicloj.metamorph.core :as mm]
   [scicloj.metamorph.ml :as ml]
   [scicloj.metamorph.ml.preprocessing :as preprocessing]
   [scicloj.ml.smile.classification]
   [scicloj.ml.smile.metamorph :as smile-mm]
   [scicloj.ml.smile.nlp :as nlp]
   [scicloj.ml.smile.projections :as projections]
   [scicloj.ml.smile.clustering :as clustering]
   [tablecloth.api :as tc]
   [tech.v3.dataset :as ds]
   [tech.v3.dataset.categorical :as ds-cat]
   [tech.v3.dataset.metamorph :as ds-mm]
   [tech.v3.dataset.modelling :as ds-mod]
   [tech.v3.dataset.print]))

26.1 Transformer count-vectorize

Clojure doc:

Converts text column text-col to bag-of-words representation in the form of a frequency-count map. The default text->bow function is default-text-bow. All options are passed to it.

In the following we transform the text given in a dataset into a map of token counts applying some default text normalization.

(def data (ds/->dataset {:text ["Hello Clojure world, hello ML word !"
                              "ML with Clojure is fun"]}))

^kind/dataset
data

_unnamed [2 1]:

:text
Hello Clojure world, hello ML word !
ML with Clojure is fun

(def fitted-ctx
  (mm/fit data
          (scicloj.ml.smile.metamorph/count-vectorize :text :bow)))

(:metamorph/data fitted-ctx)

_unnamed [2 2]:

:text	:bow
Hello Clojure world, hello ML word !	{hello 2, clojur 1, world 1, , 1, ml 1, word 1, ! 1}
ML with Clojure is fun	{ml 1, with 1, clojur 1, is 1, fun 1}

(def bow-ds
  (:metamorph/data fitted-ctx))

^kind/dataset
bow-ds

_unnamed [2 2]:

:text	:bow
Hello Clojure world, hello ML word !	{hello 2, clojur 1, world 1, , 1, ml 1, word 1, ! 1}
ML with Clojure is fun	{ml 1, with 1, clojur 1, is 1, fun 1}

A custom tokenizer can be specified by either passing options to scicloj.ml.smile.nlp/default-tokenize:

(def fitted-ctx
  (mm/fit
   data
   (scicloj.ml.smile.metamorph/count-vectorize 
    :text :bow {:stopwords ["clojure"]
                :stemmer :none})))

(:metamorph/data fitted-ctx)

_unnamed [2 2]:

:text	:bow
Hello Clojure world, hello ML word !	{hello 2, world 1, , 1, ml 1, word 1, ! 1}
ML with Clojure is fun	{ml 1, with 1, is 1, fun 1}

or passing in an implementation of a tokenizer function:

(def fitted-ctx
  (mm/fit
   data
   (scicloj.ml.smile.metamorph/count-vectorize
    :text :bow
    {:text->bow-fn (fn [text options]
                     {:a 1 :b 2})})))

(:metamorph/data fitted-ctx)

_unnamed [2 2]:

:text	:bow
Hello Clojure world, hello ML word !	{:a 1, :b 2}
ML with Clojure is fun	{:a 1, :b 2}

26.2 Transformer bow->SparseArray

Clojure doc:

Converts a bag-of-word column bow-col to sparse indices column indices-col, as needed by the discrete naive bayes model.

Options can be of:

create-vocab-fn A function which converts the bow map to a list of tokens. Defaults to scicloj.ml.smile.nlp/create-vocab-all

The sparse data is represented as smile.util.SparseArray.

metamorph	.
Behaviour in mode :fit	normal
Behaviour in mode :transform	normal
Reads keys from ctx	none
Writes keys to ctx	:scicloj.ml.smile.metamorph/bow->sparse-vocabulary

Now we convert the bag-of-words map to a sparse array of class smile.util.SparseArray:

(def ctx-sparse
  (mm/fit
   bow-ds
   (smile-mm/bow->SparseArray :bow :sparse)))

ctx-sparse

{

:metamorph/data

_unnamed [2 3]:

:text	:bow	:sparse
Hello Clojure world, hello ML word !	{hello 2, clojur 1, world 1, , 1, ml 1, word 1, ! 1}	[3:2, 0:1, 9:1, 7:1, 6:1, 2:1, 1:1]
ML with Clojure is fun	{ml 1, with 1, clojur 1, is 1, fun 1}	[6:1, 8:1, 0:1, 4:1, 5:1]

:metamorph/mode :fit

:scicloj.ml.smile.metamorph/bow->sparse-vocabulary {:vocab ("clojur" "!" "word" "hello" "is" "fun" "ml" "," "with" "world"), :vocab->index-map {"clojur" 0, "!" 1, "word" 2, "hello" 3, "is" 4, "fun" 5, "ml" 6, "," 7, "with" 8, "world" 9}, :index->vocab-map {0 "clojur", 7 ",", 1 "!", 4 "is", 6 "ml", 3 "hello", 2 "word", 9 "world", 5 "fun", 8 "with"}}

}

^kind/dataset
(:metamorph/data ctx-sparse)

_unnamed [2 3]:

:text	:bow	:sparse
Hello Clojure world, hello ML word !	{hello 2, clojur 1, world 1, , 1, ml 1, word 1, ! 1}	[3:2, 0:1, 9:1, 7:1, 6:1, 2:1, 1:1]
ML with Clojure is fun	{ml 1, with 1, clojur 1, is 1, fun 1}	[6:1, 8:1, 0:1, 4:1, 5:1]

The SparseArray instances look like this:

(zipmap
 (:text bow-ds)
 (map seq
      (-> ctx-sparse :metamorph/data :sparse)))

{"Hello Clojure world, hello ML word !"
 (#object[smile.util.SparseArray$Entry 0x5bb089eb "3:2"]
  #object[smile.util.SparseArray$Entry 0x1df9380a "0:1"]
  #object[smile.util.SparseArray$Entry 0x7f5a05ad "9:1"]
  #object[smile.util.SparseArray$Entry 0x3c49d41 "7:1"]
  #object[smile.util.SparseArray$Entry 0x21427849 "6:1"]
  #object[smile.util.SparseArray$Entry 0x75af3d03 "2:1"]
  #object[smile.util.SparseArray$Entry 0x5a772976 "1:1"]),
 "ML with Clojure is fun"
 (#object[smile.util.SparseArray$Entry 0x7e22bff2 "6:1"]
  #object[smile.util.SparseArray$Entry 0x2f14e56b "8:1"]
  #object[smile.util.SparseArray$Entry 0x102793d1 "0:1"]
  #object[smile.util.SparseArray$Entry 0x78c3550a "4:1"]
  #object[smile.util.SparseArray$Entry 0x4cf486e6 "5:1"])}

26.3 Transformer bow->sparse-array

Clojure doc:

Converts a bag-of-word column bow-col to sparse indices column indices-col, as needed by the Maxent model. Options can be of:

create-vocab-fn A function which converts the bow map to a list of tokens. Defaults to scicloj.ml.smile.nlp/create-vocab-all

The sparse data is represented as primitive int arrays, of which entries are the indices against the vocabulary of the present tokens.

metamorph	.
Behaviour in mode :fit	normal
Behaviour in mode :transform	normal
Reads keys from ctx	none
Writes keys to ctx	:scicloj.ml.smile.metamorph/bow->sparse-vocabulary

Now we convert the bag-of-words map to a sparse array of class java primitive int array:

(def ctx-sparse
  (mm/fit
   bow-ds
   (smile-mm/bow->sparse-array :bow :sparse)))

ctx-sparse

{

:metamorph/data

_unnamed [2 3]:

:text	:bow	:sparse
Hello Clojure world, hello ML word !	{hello 2, clojur 1, world 1, , 1, ml 1, word 1, ! 1}	[I@ae7cd01
ML with Clojure is fun	{ml 1, with 1, clojur 1, is 1, fun 1}	[I@23acef6d

:metamorph/mode :fit

:scicloj.ml.smile.metamorph/bow->sparse-vocabulary {:vocab ("clojur" "!" "word" "hello" "is" "fun" "ml" "," "with" "world"), :vocab->index-map {"clojur" 0, "!" 1, "word" 2, "hello" 3, "is" 4, "fun" 5, "ml" 6, "," 7, "with" 8, "world" 9}, :index->vocab-map {0 "clojur", 7 ",", 1 "!", 4 "is", 6 "ml", 3 "hello", 2 "word", 9 "world", 5 "fun", 8 "with"}}

}

We also see the sparse representation as indices against the vocabulary of the non-zero counts.

(zipmap
 (:text bow-ds)
 (map seq
      (-> ctx-sparse :metamorph/data :sparse)))

{"Hello Clojure world, hello ML word !" (0 1 2 3 6 7 9),
 "ML with Clojure is fun" (0 4 5 6 8)}

In both ->sparse functions we can control the vocabulary via the option to pass in a different / custom function which creates the vocabulary from the bow (bag-of-words) maps.

(def ctx-sparse
  (mm/fit
   bow-ds
   (smile-mm/bow->SparseArray
    :bow :sparse
    {:create-vocab-fn
     (fn [bow] (nlp/->vocabulary-top-n bow 1))})))

ctx-sparse

{

:metamorph/data

_unnamed [2 3]:

:text	:bow	:sparse
Hello Clojure world, hello ML word !	{hello 2, clojur 1, world 1, , 1, ml 1, word 1, ! 1}	[0:1]
ML with Clojure is fun	{ml 1, with 1, clojur 1, is 1, fun 1}	[0:1]

:metamorph/mode :fit

:scicloj.ml.smile.metamorph/bow->sparse-vocabulary {:vocab ("ml"), :vocab->index-map {"ml" 0}, :index->vocab-map {0 "ml"}}

}

(def ctx-sparse
  (mm/fit
   bow-ds
   (smile-mm/bow->SparseArray
    :bow :sparse
    {:create-vocab-fn
     (fn [_]
       ["hello" "fun"])})))

ctx-sparse

{

:metamorph/data

_unnamed [2 3]:

:text	:bow	:sparse
Hello Clojure world, hello ML word !	{hello 2, clojur 1, world 1, , 1, ml 1, word 1, ! 1}	[0:2]
ML with Clojure is fun	{ml 1, with 1, clojur 1, is 1, fun 1}	[1:1]

:metamorph/mode :fit

:scicloj.ml.smile.metamorph/bow->sparse-vocabulary {:vocab ["hello" "fun"], :vocab->index-map {"hello" 0, "fun" 1}, :index->vocab-map {0 "hello", 1 "fun"}}

}

26.4 Transformer bow->tfidf

Clojure doc:

Calculates the tfidf score from bag-of-words (as token frequency maps) in column bow-column and stores them in a new column tfid-column as maps of token->tfidf-score.

It calculates a global term-frequency map in :fit and reuses it in :transform

metamorph	.
Behaviour in mode :fit	normal
Behaviour in mode :transform	normal
Reads keys from ctx	none
Writes keys to ctx	none

Here we calculate the tf-idf score from the bag-of-words:

^kind/dataset
(mm/pipe-it
 bow-ds
 (smile-mm/bow->tfidf :bow :tfidf {}))

_unnamed [2 3]:

:text	:bow	:tfidf
Hello Clojure world, hello ML word !	{hello 2, clojur 1, world 1, , 1, ml 1, word 1, ! 1}	{clojur 1.0,
		! 1.4054651081081644,
		word 1.4054651081081644,
		hello 2.8109302162163288,
		ml 1.0,
		, 1.4054651081081644,
		world 1.4054651081081644}
ML with Clojure is fun	{ml 1, with 1, clojur 1, is 1, fun 1}	{clojur 1.0,
		is 1.4054651081081644,
		fun 1.4054651081081644,
		ml 1.0,
		with 1.4054651081081644}

26.5 Transformer model

Clojure doc:

Executes a machine learning model in train/predict (depending on :mode) from the metamorph.ml model registry.

The model is passed between both invocation via the shared context ctx in a key (a step indentifier) which is passed in key :metamorph/id and guarantied to be unique for each pipeline step.

The function writes and reads into this common context key.

Options: - :model-type - Keyword for the model to use

Further options get passed to train functions and are model specific.

See here for an overview for the models build into scicloj.ml:

https://scicloj.github.io/scicloj.ml-tutorials/userguide-models.html

Other libraries might contribute other models, which are documented as part of the library.

metamorph	.
Behaviour in mode :fit	Calls `scicloj.metamorph.ml/train` using data in `:metamorph/data` and `options`and stores trained model in ctx under key in `:metamorph/id`
Behaviour in mode :transform	Reads trained model from ctx and calls `scicloj.metamorph.ml/predict` with the model in $id and data in `:metamorph/data`
Reads keys from ctx	In mode `:transform` : Reads trained model to use for prediction from key in `:metamorph/id`.
Writes keys to ctx	In mode `:fit` : Stores trained model in key $id and writes feature-ds and target-ds before prediction into ctx at `:scicloj.metamorph.ml/feature-ds` /`:scicloj.metamorph.ml/target-ds`

See as well:

scicloj.metamorph.ml/train
scicloj.metamorph.ml/predict

The model transformer allows to execute all machine learning models which register themself inside the metamorph.ml system via the function scicloj.metamorph.ml/define-model!. Models can be added at runing by require relevant namespaces as documented in the various “model reference” chapters of the Noj book. The currently defined models can be looked up via (ml/model-definition-names).

We use the Iris data for this example:

(def iris
  (->
   (ds/->dataset
    "https://raw.githubusercontent.com/scicloj/metamorph.ml/main/test/data/iris.csv" {:key-fn keyword})
   (tech.v3.dataset.print/print-range 5)))

^kind/dataset
iris

https://raw.githubusercontent.com/scicloj/metamorph.ml/main/test/data/iris.csv [150 5]:

:sepal_length	:sepal_width	:petal_length	:petal_width	:species
5.1	3.5	1.4	0.2	setosa
4.9	3.0	1.4	0.2	setosa
…	…	…	…	…
6.5	3.0	5.2	2.0	virginica
6.2	3.4	5.4	2.3	virginica
5.9	3.0	5.1	1.8	virginica

(def train-test
  (ds-mod/train-test-split iris))

The pipeline specifies the inference target, transforms the target to categorical, and applies the model function.

(def pipe-fn
  (mm/pipeline
   (mm/lift ds-mod/set-inference-target :species)
   (mm/lift ds/categorical->number [:species])
   {:metamorph/id :model}
   (ml/model {:model-type :smile.classification/logistic-regression})))

First we run the training:

(def fitted-ctx
  (mm/fit
   (:train-ds train-test)
   pipe-fn))

(dissoc-in  fitted-ctx [:model :model-data])

{

:metamorph/data

https://raw.githubusercontent.com/scicloj/metamorph.ml/main/test/data/iris.csv [105 5]:

:sepal_length	:sepal_width	:petal_length	:petal_width	:species
6.4	3.2	4.5	1.5	2
6.1	2.6	5.6	1.4	0
5.1	3.8	1.5	0.3	1
5.1	2.5	3.0	1.1	2
4.8	3.4	1.9	0.2	1
4.7	3.2	1.3	0.2	1
6.1	2.8	4.0	1.3	2
4.9	2.5	4.5	1.7	0
7.6	3.0	6.6	2.1	0
5.6	3.0	4.1	1.3	2
...	...	...	...	...
6.8	3.2	5.9	2.3	0
6.4	2.8	5.6	2.1	0
5.8	2.7	5.1	1.9	0
5.4	3.9	1.7	0.4	1
6.9	3.1	5.4	2.1	0
6.7	3.3	5.7	2.1	0
6.4	3.2	5.3	2.3	0
5.4	3.0	4.5	1.5	2
6.7	3.1	4.7	1.5	2
5.1	3.3	1.7	0.5	1
7.2	3.6	6.1	2.5	0

:metamorph/mode :fit

:model {:feature-columns [:sepal_length :sepal_width :petal_length :petal_width], :target-categorical-maps {:species #tech.v3.dataset.categorical.CategoricalMap{:lookup-table {"virginica" 0, "setosa" 1, "versicolor" 2}, :src-column :species, :result-datatype :int64}}, :target-columns [:species], :train-input-hash nil, :target-datatypes {:species :int64}, :scicloj.metamorph.ml/unsupervised? nil, :id #uuid "b462fd08-552b-4bdc-b7ab-496c795b59c9", :options {:model-type :smile.classification/logistic-regression}}

}

and then prediction on the test set:

(def transformed-ctx
  (mm/transform-pipe (:test-ds train-test) pipe-fn fitted-ctx))

(-> transformed-ctx
    (dissoc-in [:model :model-data])
    (update-in [:metamorph/data] #(tech.v3.dataset.print/print-range % 5)))

{

:metamorph/data

:_unnamed [45 4]:

virginica	setosa	versicolor	:species
9.57306622E-01	0.00000243	0.04269094	0
5.36682319E-14	0.99497305	0.00502695	1
...	...	...	...
8.72844114E-01	0.00000337	0.12715252	0
8.67741314E-01	0.00000476	0.13225392	0
9.56654434E-01	0.00000013	0.04334544	0

:metamorph/mode :transform

:model

{

:feature-columns [:sepal_length :sepal_width :petal_length :petal_width]

:target-categorical-maps {:species #tech.v3.dataset.categorical.CategoricalMap{:lookup-table {"virginica" 0, "setosa" 1, "versicolor" 2}, :src-column :species, :result-datatype :int64}}

:target-columns [:species]

:train-input-hash nil

:target-datatypes {:species :int64}

:scicloj.metamorph.ml/unsupervised? nil

:scicloj.metamorph.ml/feature-ds

https://raw.githubusercontent.com/scicloj/metamorph.ml/main/test/data/iris.csv [45 4]:

:sepal_length	:sepal_width	:petal_length	:petal_width
5.6	2.8	4.9	2.0
5.1	3.4	1.5	0.2
5.0	3.4	1.5	0.2
5.0	3.5	1.3	0.3
4.9	3.1	1.5	0.1
6.9	3.1	5.1	2.3
4.8	3.0	1.4	0.1
5.2	4.1	1.5	0.1
5.8	2.7	4.1	1.0
5.5	2.5	4.0	1.3
...	...	...	...
6.3	2.5	4.9	1.5
6.7	3.0	5.2	2.3
6.8	3.0	5.5	2.1
5.2	3.5	1.5	0.2
6.7	2.5	5.8	1.8
6.9	3.2	5.7	2.3
5.0	3.3	1.4	0.2
4.6	3.6	1.0	0.2
6.0	2.2	5.0	1.5
6.0	2.7	5.1	1.6
7.2	3.0	5.8	1.6

:id #uuid "b462fd08-552b-4bdc-b7ab-496c795b59c9"

:scicloj.metamorph.ml/target-ds

https://raw.githubusercontent.com/scicloj/metamorph.ml/main/test/data/iris.csv [45 1]:

:species
0
1
1
1
1
0
1
1
2
2
...
2
0
0
1
0
0
1
1
0
2
0

:options {:model-type :smile.classification/logistic-regression}

}

and we get the predictions:

^kind/dataset
(-> transformed-ctx
    :metamorph/data
    (ds-cat/reverse-map-categorical-xforms)
    (ds/select-columns [:species])
    (ds/head))

:_unnamed [5 1]:

:species
virginica
setosa
setosa
setosa
setosa

26.6 Transformer std-scale

Clojure doc:

Metamorph transfomer, which centers and scales the dataset per column.

columns-selector tablecloth columns-selector to choose columns to work on meta-field tablecloth meta-field working with columns-selector

options are the options for the scaler and can take: mean? If true (default), the data gets shifted by the column means, so 0 centered stddev? If true (default), the data gets scaled by the standard deviation of the column

metamorph	.
Behaviour in mode :fit	Centers and scales the dataset at key `:metamorph/data` and stores the trained model in ctx under key at `:metamorph/id`
Behaviour in mode :transform	Reads trained std-scale model from ctx and applies it to data in `:metamorph/data`
Reads keys from ctx	In mode `:transform` : Reads trained model to use for from key in `:metamorph/id`.
Writes keys to ctx	In mode `:fit` : Stores trained model in key $id

We can use the std-scale transformer to center and scale data. Lets take some example data:

(def data
  (tc/dataset
   [[100 0.001]
    [8   0.05]
    [50  0.005]
    [88  0.07]
    [4   0.1]]
   {:layout :as-row}))

^kind/dataset
data

:_unnamed [5 2]:

0	1
100	0.001
8	0.050
50	0.005
88	0.070
4	0.100

Now we can center each column arround 0 and scale it by the standard deviation of the column

^kind/dataset
(mm/pipe-it
 data
 (preprocessing/std-scale [0 1] {}))

:_unnamed [5 2]:

0	1
1.13053908	-1.04102352
-0.94965283	0.11305233
0.00000000	-0.94681324
0.85920970	0.58410369
-1.04009595	1.29068074

26.7 Transformer min-max-scale

Clojure doc:

Metamorph transfomer, which scales the column data into a given range.

columns-selector tablecloth columns-selector to choose columns to work on meta-field tablecloth meta-field working with columns-selector

options Options for scaler, can take: min Minimal value to scale to (default -0.5) max Maximum value to scale to (default 0.5)

metamorph	.
Behaviour in mode :fit	Scales the dataset at key `:metamorph/data` and stores the trained model in ctx under key at `:metamorph/id`
Behaviour in mode :transform	Reads trained min-max-scale model from ctx and applies it to data in `:metamorph/data`
Reads keys from ctx	In mode `:transform` : Reads trained model to use for from key in `:metamorph/id`.
Writes keys to ctx	In mode `:fit` : Stores trained model in key $id

The min-max scaler scales columns in a specified interval, by default from -0.5 to 0.5

^kind/dataset
(mm/pipe-it
 data
 (preprocessing/min-max-scale [0 1] {}))

:_unnamed [5 2]:

0	1
0.50000000	-0.50000000
-0.45833333	-0.00505051
-0.02083333	-0.45959596
0.37500000	0.19696970
-0.50000000	0.50000000

26.8 Transformer reduce-dimensions

Clojure doc:

Metamorph transformer, which reduces the dimensions of a given dataset.

algorithm can be any of: * :pca-cov * :pca-cor * :pca-prob * :kpca * :gha * :random

target-dims is number of dimensions to reduce to.

cnames is a sequence of column names on which the reduction get performed

opts are the options of the algorithm

metamorph	.
Behaviour in mode :fit	Reduces dimensions of the dataset at key `:metamorph/data` and stores the trained model in ctx under key at `:metamorph/id`
Behaviour in mode :transform	Reads trained reduction model from ctx and applies it to data in `:metamorph/data`
Reads keys from ctx	In mode `:transform` : Reads trained model to use from ctx at key in `:metamorph/id`.
Writes keys to ctx	In mode `:fit` : Stores trained model in ctx under key in `:metamorph/id`.

26.8.1 PCA example

In this example we run PCA on some data.

(require '[scicloj.metamorph.ml.toydata :as toydata])

We use the Sonar dataset. It has 60 columns of quantitative data, which are certain measurements from a sonar device. The original purpose of the dataset is to learn to detect rock vs metal from the measurements.

(def sonar
  (toydata/sonar-ds))

sample 10x10:

^kind/dataset
(ds/select-by-index sonar (range 10) (range 10))

_unnamed [10 10]:

:x0	:x1	:x2	:x3	:x4	:x5	:x6	:x7	:x8	:x9
0.0200	0.0371	0.0428	0.0207	0.0954	0.0986	0.1539	0.1601	0.3109	0.2111
0.0453	0.0523	0.0843	0.0689	0.1183	0.2583	0.2156	0.3481	0.3337	0.2872
0.0262	0.0582	0.1099	0.1083	0.0974	0.2280	0.2431	0.3771	0.5598	0.6194
0.0100	0.0171	0.0623	0.0205	0.0205	0.0368	0.1098	0.1276	0.0598	0.1264
0.0762	0.0666	0.0481	0.0394	0.0590	0.0649	0.1209	0.2467	0.3564	0.4459
0.0286	0.0453	0.0277	0.0174	0.0384	0.0990	0.1201	0.1833	0.2105	0.3039
0.0317	0.0956	0.1321	0.1408	0.1674	0.1710	0.0731	0.1401	0.2083	0.3513
0.0519	0.0548	0.0842	0.0319	0.1158	0.0922	0.1027	0.0613	0.1465	0.2838
0.0223	0.0375	0.0484	0.0475	0.0647	0.0591	0.0753	0.0098	0.0684	0.1487
0.0164	0.0173	0.0347	0.0070	0.0187	0.0671	0.1056	0.0697	0.0962	0.0251

(def col-names (map #(keyword (str "x" %))
                    (range 60)))

First we create and run a pipeline that computes the PCA. In this pipeline we do not fix the number of columns, as we want to plot the result for all numbers of components (up to 60).

(def fitted-ctx
  (mm/fit
   sonar
   (projections/reduce-dimensions :pca-cov 60
                         col-names
                         {})))

The next function transforms the result from the fitted pipeline into vega-lite-compatible format for plotting. It accesses the underlying Smile Java object to get the data on the cumulative variance for each PCA component.

(defn create-plot-data [ctx]
  (map
   #(hash-map :principal-component %1
              :cumulative-variance %2)
   (range)
   (-> ctx vals (nth 2) :fit-result :model bean :cumulativeVarianceProportion)))

Next we plot the cumulative variance over the component index:

^kind/vega-lite
{:$schema "https://vega.github.io/schema/vega-lite/v5.json"
 :width 850
 :data {:values
        (create-plot-data fitted-ctx)}
 :mark "line" ,
 :encoding
 {:x {:field :principal-component, :type "nominal"},
  :y {:field :cumulative-variance, :type "quantitative"}}}

From the plot we see that transforming the data via PCA and reducing it from 60 dimensions to about 25 would still preserve the full variance. Looking at this plot, we could now make a decision, how many dimensions to keep. We could, for example, decide that keeping 60 % of the variance is enough, which would result in keeping the first 2 dimensions.

So our pipeline becomes:

(def fitted-ctx
  (mm/fit
   sonar
   (projections/reduce-dimensions :pca-cov 2
                                  col-names
                                  {})

   (ds-mm/select-columns  [:material "pca-cov-0" "pca-cov-1"])
   (ds-mm/shuffle)))

^kind/dataset
(:metamorph/data fitted-ctx)

_unnamed [208 3]:

:material	pca-cov-0	pca-cov-1
R	0.12658623	0.49437376
M	-0.29179131	-0.33600889
M	-1.52828844	-0.02195851
R	-0.61161077	0.17626804
M	0.84579825	0.60715181
R	0.97635445	0.24800552
M	1.08997617	0.54775654
M	-0.13058542	-0.89968899
M	0.53972586	-0.85503686
R	-1.66270705	0.86106694
…	…	…
M	0.40429732	-0.26845707
R	-1.27249809	0.62683866
R	1.27712586	-0.24039510
M	-0.27559465	-0.17427651
M	1.16616864	-0.10277945
R	0.67295834	0.56274030
R	0.10891600	-0.26262607
R	-1.62745983	0.83522225
M	-0.04178149	-1.03164391
M	0.81660148	0.96570961
M	-1.30378149	-0.27461418

As the data is now 2-dimensional, it is easy to plot:

(def scatter-plot-data
  (-> fitted-ctx
      :metamorph/data
      (ds/select-columns [:material "pca-cov-0" "pca-cov-1"])
      (ds/rows :as-maps)))

^kind/vega
{:$schema "https://vega.github.io/schema/vega-lite/v5.json"
 :data {:values scatter-plot-data}
 :width 500
 :height 500

 :mark :circle
 :encoding
 {:x {:field "pca-cov-0"  :type "quantitative"}
  :y {:field "pca-cov-1"  :type "quantitative"}
  :color {:field :material}}}

The plot shows that the reduction to 2 dimensions does not create linear separable areas of M and R. So a linear model will not be able to predict well the material from the 2 PCA components.

It even seems that the reduction to 2 dimensions removes too much information for predicting of the material for any type of model.

26.9 Transformer cluster

Clojure doc:

Metamorph transformer, which clusters the data and creates a new column with the cluster id.

clustering-method can be any of:

:spectral
:dbscan
:k-means
:mec
:clarans
:g-means
:lloyd
:x-means
:deterministic-annealing
:denclue

The clustering-args is a vector with the positional arguments for each cluster function, as documented here: https://cljdoc.org/d/generateme/fastmath/2.1.5/api/fastmath.clustering (but minus the data argument, which will be passed in automatically)

The cluster id of each row gets written to the column in target-column

metamorph	.
Behaviour in mode :fit	Calculates cluster centers of the rows dataset at key `:metamorph/data` and stores them in ctx under key at `:metamorph/id`. Adds as wll column in `target-column` with cluster centers into the dataset.
Behaviour in mode :transform	Reads cluster centers from ctx and applies it to data in `:metamorph/data`
Reads keys from ctx	In mode `:transform` : Reads cluster centers to use from ctx at key in `:metamorph/id`.
Writes keys to ctx	In mode `:fit` : Stores cluster centers in ctx under key in `:metamorph/id`.

source: notebooks/noj_book/transformer_references.clj