27 Transformer reference - DRAFT π
ns noj-book.transformer-references
(:require
(:as kindly]
[scicloj.kindly.v4.api :as kind]
[scicloj.kindly.v4.kind :as mm]
[scicloj.metamorph.core :as ml]
[scicloj.metamorph.ml :as preprocessing]
[scicloj.metamorph.ml.preprocessing
[scicloj.ml.smile.classification]:as smile-mm]
[scicloj.ml.smile.metamorph :as nlp]
[scicloj.ml.smile.nlp :as projections]
[scicloj.ml.smile.projections :as tc]
[tablecloth.api :as ds]
[tech.v3.dataset :as ds-cat]
[tech.v3.dataset.categorical :as ds-mm]
[tech.v3.dataset.metamorph :as ds-mod]
[tech.v3.dataset.modelling print])) [tech.v3.dataset.
27.1 Transformer count-vectorize
Clojure doc:
Converts text column text-col
to bag-of-words representation in the form of a frequency-count map. The default text->bow function is default-text-bow
. All options
are passed to it.
In the following we transform the text given in a dataset into a map of token counts applying some default text normalization.
def data (ds/->dataset {:text ["Hello Clojure world, hello ML word !"
("ML with Clojure is fun"]}))
^kind/dataset data
_unnamed [2 1]:
:text |
---|
Hello Clojure world, hello ML word ! |
ML with Clojure is fun |
_
def fitted-ctx
(
(mm/fit data:text :bow))) (scicloj.ml.smile.metamorph/count-vectorize
:metamorph/data fitted-ctx) (
_unnamed [2 2]:
:text | :bow |
---|---|
Hello Clojure world, hello ML word ! | {hello 2, clojur 1, world 1, , 1, ml 1, word 1, ! 1} |
ML with Clojure is fun | {ml 1, with 1, clojur 1, is 1, fun 1} |
def bow-ds
(:metamorph/data fitted-ctx)) (
^kind/dataset bow-ds
_unnamed [2 2]:
:text | :bow |
---|---|
Hello Clojure world, hello ML word ! | {hello 2, clojur 1, world 1, , 1, ml 1, word 1, ! 1} |
ML with Clojure is fun | {ml 1, with 1, clojur 1, is 1, fun 1} |
A custom tokenizer can be specified by either passing options to scicloj.ml.smile.nlp/default-tokenize
def fitted-ctx
(
(mm/fit
data
(scicloj.ml.smile.metamorph/count-vectorize :text :bow {:stopwords ["clojure"]
:stemmer :none})))
:metamorph/data fitted-ctx) (
_unnamed [2 2]:
:text | :bow |
---|---|
Hello Clojure world, hello ML word ! | {hello 2, world 1, , 1, ml 1, word 1, ! 1} |
ML with Clojure is fun | {ml 1, with 1, is 1, fun 1} |
or passing in a implementation of a tokenizer function
def fitted-ctx
(
(mm/fit
data
(scicloj.ml.smile.metamorph/count-vectorize:text :bow
:text->bow-fn (fn [text options]
{:a 1 :b 2})}))) {
:metamorph/data fitted-ctx) (
_unnamed [2 2]:
:text | :bow |
---|---|
Hello Clojure world, hello ML word ! | {:a 1, :b 2} |
ML with Clojure is fun | {:a 1, :b 2} |
27.2 Transformer bow->SparseArray
Clojure doc:
Converts a bag-of-word column bow-col
to sparse indices column indices-col
, as needed by the discrete naive bayes model.
Options
can be of:
create-vocab-fn
A function which converts the bow map to a list of tokens. Defaults to scicloj.ml.smile.nlp/create-vocab-all
The sparse data is represented as smile.util.SparseArray
.
metamorph | . |
---|---|
Behaviour in mode :fit | normal |
Behaviour in mode :transform | normal |
Reads keys from ctx | none |
Writes keys to ctx | :scicloj.ml.smile.metamorph/bow->sparse-vocabulary |
Now we convert the bag-of-words map to a sparse array of class smile.util.SparseArray
def ctx-sparse
(
(mm/fit
bow-ds:bow :sparse))) (smile-mm/bow->SparseArray
ctx-sparse
{
|
_unnamed [2 3]:
|
:metamorph/mode :fit
:scicloj.ml.smile.metamorph/bow->sparse-vocabulary {:vocab ("clojur" "!" "word" "hello" "is" "fun" "ml" "," "with" "world"), :vocab->index-map {"clojur" 0, "!" 1, "word" 2, "hello" 3, "is" 4, "fun" 5, "ml" 6, "," 7, "with" 8, "world" 9}, :index->vocab-map {0 "clojur", 7 ",", 1 "!", 4 "is", 6 "ml", 3 "hello", 2 "word", 9 "world", 5 "fun", 8 "with"}}
}
^kind/dataset:metamorph/data ctx-sparse) (
_unnamed [2 3]:
:text | :bow | :sparse |
---|---|---|
Hello Clojure world, hello ML word ! | {hello 2, clojur 1, world 1, , 1, ml 1, word 1, ! 1} | [3:2, 0:1, 9:1, 7:1, 6:1, 2:1, 1:1] |
ML with Clojure is fun | {ml 1, with 1, clojur 1, is 1, fun 1} | [6:1, 8:1, 0:1, 4:1, 5:1] |
The SparseArray instances look like this:
zipmap
(:text bow-ds)
(map seq
(-> ctx-sparse :metamorph/data :sparse))) (
"Hello Clojure world, hello ML word !"
{0x42c18b70 "3:2"]
(#object[smile.util.SparseArray$Entry 0x366265df "0:1"]
#object[smile.util.SparseArray$Entry 0x1ae976be "9:1"]
#object[smile.util.SparseArray$Entry 0x204eed6b "7:1"]
#object[smile.util.SparseArray$Entry 0x704d9414 "6:1"]
#object[smile.util.SparseArray$Entry 0x55a44b5f "2:1"]
#object[smile.util.SparseArray$Entry 0x4298bd3 "1:1"]),
#object[smile.util.SparseArray$Entry "ML with Clojure is fun"
0x46049b65 "6:1"]
(#object[smile.util.SparseArray$Entry 0x69c82cd3 "8:1"]
#object[smile.util.SparseArray$Entry 0x5886c76a "0:1"]
#object[smile.util.SparseArray$Entry 0x70162606 "4:1"]
#object[smile.util.SparseArray$Entry 0x51b580e3 "5:1"])} #object[smile.util.SparseArray$Entry
27.3 Transformer bow->sparse-array
Clojure doc:
Converts a bag-of-word column bow-col
to sparse indices column indices-col
, as needed by the Maxent model. Options
can be of:
create-vocab-fn
A function which converts the bow map to a list of tokens. Defaults to scicloj.ml.smile.nlp/create-vocab-all
The sparse data is represented as primitive int arrays
, of which entries are the indices against the vocabulary of the present tokens.
metamorph | . |
---|---|
Behaviour in mode :fit | normal |
Behaviour in mode :transform | normal |
Reads keys from ctx | none |
Writes keys to ctx | :scicloj.ml.smile.metamorph/bow->sparse-vocabulary |
Now we convert the bag-of-words map to a sparse array of class java primitive int array
def ctx-sparse
(
(mm/fit
bow-ds:bow :sparse))) (smile-mm/bow->sparse-array
ctx-sparse
{
|
_unnamed [2 3]:
|
:metamorph/mode :fit
:scicloj.ml.smile.metamorph/bow->sparse-vocabulary {:vocab ("clojur" "!" "word" "hello" "is" "fun" "ml" "," "with" "world"), :vocab->index-map {"clojur" 0, "!" 1, "word" 2, "hello" 3, "is" 4, "fun" 5, "ml" 6, "," 7, "with" 8, "world" 9}, :index->vocab-map {0 "clojur", 7 ",", 1 "!", 4 "is", 6 "ml", 3 "hello", 2 "word", 9 "world", 5 "fun", 8 "with"}}
}
We see as well the sparse representation as indices against the vocabulary of the non-zero counts.
zipmap
(:text bow-ds)
(map seq
(-> ctx-sparse :metamorph/data :sparse))) (
"Hello Clojure world, hello ML word !" (0 1 2 3 6 7 9),
{"ML with Clojure is fun" (0 4 5 6 8)}
In both ->sparse function we can control the vocabulary via the option to pass in a different / custom functions which creates the vocabulary from the bow maps.
def ctx-sparse
(
(mm/fit
bow-ds
(smile-mm/bow->SparseArray:bow :sparse
:create-vocab-fn
{fn [bow] (nlp/->vocabulary-top-n bow 1))}))) (
ctx-sparse
{
|
_unnamed [2 3]:
|
:metamorph/mode :fit
:scicloj.ml.smile.metamorph/bow->sparse-vocabulary {:vocab ("ml"), :vocab->index-map {"ml" 0}, :index->vocab-map {0 "ml"}}
}
def ctx-sparse
(
(mm/fit
bow-ds
(smile-mm/bow->SparseArray:bow :sparse
:create-vocab-fn
{fn [_]
("hello" "fun"])}))) [
ctx-sparse
{
|
_unnamed [2 3]:
|
:metamorph/mode :fit
:scicloj.ml.smile.metamorph/bow->sparse-vocabulary {:vocab ["hello" "fun"], :vocab->index-map {"hello" 0, "fun" 1}, :index->vocab-map {0 "hello", 1 "fun"}}
}
27.4 Transformer bow->tfidf
Clojure doc:
Calculates the tfidf score from bag-of-words (as token frequency maps) in column bow-column
and stores them in a new column tfid-column
as maps of token->tfidf-score.
It calculates a global term-frequency map in :fit and reuses it in :transform
metamorph | . |
---|---|
Behaviour in mode :fit | normal |
Behaviour in mode :transform | normal |
Reads keys from ctx | none |
Writes keys to ctx | none |
βHere we calculate the tf-idf score from the bag of words:β
^kind/dataset
(mm/pipe-it
bow-ds:bow :tfidf {})) (smile-mm/bow->tfidf
_unnamed [2 3]:
:text | :bow | :tfidf |
---|---|---|
Hello Clojure world, hello ML word ! | {hello 2, clojur 1, world 1, , 1, ml 1, word 1, ! 1} | {clojur 1.0, |
! 1.4054651081081644, | ||
word 1.4054651081081644, | ||
hello 2.8109302162163288, | ||
ml 1.0, | ||
, 1.4054651081081644, | ||
world 1.4054651081081644} | ||
ML with Clojure is fun | {ml 1, with 1, clojur 1, is 1, fun 1} | {clojur 1.0, |
is 1.4054651081081644, | ||
fun 1.4054651081081644, | ||
ml 1.0, | ||
with 1.4054651081081644} |
27.5 Transformer model
Clojure doc:
Executes a machine learning model in train/predict (depending on :mode) from the metamorph.ml
model registry.
The model is passed between both invocation via the shared context ctx in a key (a step indentifier) which is passed in key :metamorph/id
and guarantied to be unique for each pipeline step.
The function writes and reads into this common context key.
Options: - :model-type
- Keyword for the model to use
Further options get passed to train
functions and are model specific.
See here for an overview for the models build into scicloj.ml:
https://scicloj.github.io/scicloj.ml-tutorials/userguide-models.html
Other libraries might contribute other models, which are documented as part of the library.
metamorph | . |
---|---|
Behaviour in mode :fit | Calls scicloj.metamorph.ml/train using data in :metamorph/data and options and stores trained model in ctx under key in :metamorph/id |
Behaviour in mode :transform | Reads trained model from ctx and calls scicloj.metamorph.ml/predict with the model in $id and data in :metamorph/data |
Reads keys from ctx | In mode :transform : Reads trained model to use for prediction from key in :metamorph/id . |
Writes keys to ctx | In mode :fit : Stores trained model in key $id and writes feature-ds and target-ds before prediction into ctx at :scicloj.metamorph.ml/feature-ds /:scicloj.metamorph.ml/target-ds |
See as well:
scicloj.metamorph.ml/train
scicloj.metamorph.ml/predict
The model
transformer allows to execute all machine learning models.clj which register themself inside the metamorph.ml
system via the function scicloj.metamorph.ml/define-model!
. The build in models are listed here: https://scicloj.github.io/scicloj.ml/userguide-models.html
We use the Iris data for this example:
def iris
(->
(
(ds/->dataset"https://raw.githubusercontent.com/scicloj/metamorph.ml/main/test/data/iris.csv" {:key-fn keyword})
5))) (tech.v3.dataset.print/print-range
^kind/dataset iris
https://raw.githubusercontent.com/scicloj/metamorph.ml/main/test/data/iris.csv [150 5]:
:sepal_length | :sepal_width | :petal_length | :petal_width | :species |
---|---|---|---|---|
5.1 | 3.5 | 1.4 | 0.2 | setosa |
4.9 | 3.0 | 1.4 | 0.2 | setosa |
β¦ | β¦ | β¦ | β¦ | β¦ |
6.5 | 3.0 | 5.2 | 2.0 | virginica |
6.2 | 3.4 | 5.4 | 2.3 | virginica |
5.9 | 3.0 | 5.1 | 1.8 | virginica |
def train-test
( (ds-mod/train-test-split iris))
The pipeline consists in specifying the inference target, transform target to categorical and the model function
def pipe-fn
(
(mm/pipeline:species)
(mm/lift ds-mod/set-inference-target :species])
(mm/lift ds/categorical->number [:metamorph/id :model}
{:model-type :smile.classification/logistic-regression}))) (ml/model {
First we run the training
def fitted-ctx
(
(mm/fit:train-ds train-test)
( pipe-fn))
:model :model-data]) (dissoc-in fitted-ctx [
{
|
https://raw.githubusercontent.com/scicloj/metamorph.ml/main/test/data/iris.csv [105 5]:
|
:metamorph/mode :fit
:model {:options {:model-type :smile.classification/logistic-regression}, :id #uuid "22ec5e61-a9a5-4d9b-aa25-8243dce49129", :feature-columns [:sepal_length :sepal_width :petal_length :petal_width], :target-columns [:species], :target-datatypes {:species :int64}, :target-categorical-maps {:species #tech.v3.dataset.categorical.CategoricalMap{:lookup-table {"virginica" 0, "setosa" 1, "versicolor" 2}, :src-column :species, :result-datatype :int64}}, :scicloj.metamorph.ml/unsupervised? nil}
}
and then prediction on test
def transformed-ctx
(:test-ds train-test) pipe-fn fitted-ctx)) (mm/transform-pipe (
-> transformed-ctx
(:model :model-data])
(dissoc-in [update-in [:metamorph/data] #(tech.v3.dataset.print/print-range % 5))) (
{
|
:_unnamed [45 4]:
|
:metamorph/mode :transform
|
{
} |
}
and we get the predictions:
^kind/dataset-> transformed-ctx
(:metamorph/data
(ds-cat/reverse-map-categorical-xforms):species])
(ds/select-columns [ (ds/head))
:_unnamed [5 1]:
:species |
---|
setosa |
virginica |
virginica |
versicolor |
setosa |
27.6 Transformer std-scale
Clojure doc:
Metamorph transfomer, which centers and scales the dataset per column.
columns-selector
tablecloth columns-selector to choose columns to work on meta-field
tablecloth meta-field working with columns-selector
options
are the options for the scaler and can take: mean?
If true (default), the data gets shifted by the column means, so 0 centered stddev?
If true (default), the data gets scaled by the standard deviation of the column
metamorph | . |
---|---|
Behaviour in mode :fit | Centers and scales the dataset at key :metamorph/data and stores the trained model in ctx under key at :metamorph/id |
Behaviour in mode :transform | Reads trained std-scale model from ctx and applies it to data in :metamorph/data |
Reads keys from ctx | In mode :transform : Reads trained model to use for from key in :metamorph/id . |
Writes keys to ctx | In mode :fit : Stores trained model in key $id |
We can use the std-scale transformer to center and scale data. Lets take some example data:
def data
(
(tc/dataset100 0.001]
[[8 0.05]
[50 0.005]
[88 0.07]
[4 0.1]]
[:layout :as-row})) {
^kind/dataset data
:_unnamed [5 2]:
0 | 1 |
---|---|
100 | 0.001 |
8 | 0.050 |
50 | 0.005 |
88 | 0.070 |
4 | 0.100 |
Now we can center each column arround 0 and scale it by the standard deviation of the column
^kind/dataset
(mm/pipe-it
data0 1] {})) (preprocessing/std-scale [
:_unnamed [5 2]:
0 | 1 |
---|---|
1.13053908 | -1.04102352 |
-0.94965283 | 0.11305233 |
0.00000000 | -0.94681324 |
0.85920970 | 0.58410369 |
-1.04009595 | 1.29068074 |
27.7 Transformer min-max-scale
Clojure doc:
Metamorph transfomer, which scales the column data into a given range.
columns-selector
tablecloth columns-selector to choose columns to work on meta-field
tablecloth meta-field working with columns-selector
options
Options for scaler, can take: min
Minimal value to scale to (default -0.5) max
Maximum value to scale to (default 0.5)
metamorph | . |
---|---|
Behaviour in mode :fit | Scales the dataset at key :metamorph/data and stores the trained model in ctx under key at :metamorph/id |
Behaviour in mode :transform | Reads trained min-max-scale model from ctx and applies it to data in :metamorph/data |
Reads keys from ctx | In mode :transform : Reads trained model to use for from key in :metamorph/id . |
Writes keys to ctx | In mode :fit : Stores trained model in key $id |
The min-max scaler scales columns in a specified interval, by default from -0.5 to 0.5
^kind/dataset
(mm/pipe-it
data0 1] {})) (preprocessing/min-max-scale [
:_unnamed [5 2]:
0 | 1 |
---|---|
0.50000000 | -0.50000000 |
-0.45833333 | -0.00505051 |
-0.02083333 | -0.45959596 |
0.37500000 | 0.19696970 |
-0.50000000 | 0.50000000 |
27.8 Transformer reduce-dimensions
Clojure doc:
Metamorph transformer, which reduces the dimensions of a given dataset.
algorithm
can be any of: * :pca-cov * :pca-cor * :pca-prob * :kpca * :gha * :random
target-dims
is number of dimensions to reduce to.
cnames
is a sequence of column names on which the reduction get performed
opts
are the options of the algorithm
metamorph | . |
---|---|
Behaviour in mode :fit | Reduces dimensions of the dataset at key :metamorph/data and stores the trained model in ctx under key at :metamorph/id |
Behaviour in mode :transform | Reads trained reduction model from ctx and applies it to data in :metamorph/data |
Reads keys from ctx | In mode :transform : Reads trained model to use from ctx at key in :metamorph/id . |
Writes keys to ctx | In mode :fit : Stores trained model in ctx under key in :metamorph/id . |
27.8.1 PCA example
In this example we run PCA on some data.
require '[scicloj.metamorph.ml.toydata :as toydata]) (
βWe use the sonar dataset which has 60 columns of quantitative data, which are certain measurements from a sonar device. The original purpose of the dataset is to learn to detect rock vs metal from the measurements
def sonar
( (toydata/sonar-ds))
sample 10x10:
^kind/datasetrange 10) (range 10)) (ds/select-by-index sonar (
_unnamed [10 10]:
:x0 | :x1 | :x2 | :x3 | :x4 | :x5 | :x6 | :x7 | :x8 | :x9 |
---|---|---|---|---|---|---|---|---|---|
0.0200 | 0.0371 | 0.0428 | 0.0207 | 0.0954 | 0.0986 | 0.1539 | 0.1601 | 0.3109 | 0.2111 |
0.0453 | 0.0523 | 0.0843 | 0.0689 | 0.1183 | 0.2583 | 0.2156 | 0.3481 | 0.3337 | 0.2872 |
0.0262 | 0.0582 | 0.1099 | 0.1083 | 0.0974 | 0.2280 | 0.2431 | 0.3771 | 0.5598 | 0.6194 |
0.0100 | 0.0171 | 0.0623 | 0.0205 | 0.0205 | 0.0368 | 0.1098 | 0.1276 | 0.0598 | 0.1264 |
0.0762 | 0.0666 | 0.0481 | 0.0394 | 0.0590 | 0.0649 | 0.1209 | 0.2467 | 0.3564 | 0.4459 |
0.0286 | 0.0453 | 0.0277 | 0.0174 | 0.0384 | 0.0990 | 0.1201 | 0.1833 | 0.2105 | 0.3039 |
0.0317 | 0.0956 | 0.1321 | 0.1408 | 0.1674 | 0.1710 | 0.0731 | 0.1401 | 0.2083 | 0.3513 |
0.0519 | 0.0548 | 0.0842 | 0.0319 | 0.1158 | 0.0922 | 0.1027 | 0.0613 | 0.1465 | 0.2838 |
0.0223 | 0.0375 | 0.0484 | 0.0475 | 0.0647 | 0.0591 | 0.0753 | 0.0098 | 0.0684 | 0.1487 |
0.0164 | 0.0173 | 0.0347 | 0.0070 | 0.0187 | 0.0671 | 0.1056 | 0.0697 | 0.0962 | 0.0251 |
def col-names (map #(keyword (str "x" %))
(range 60))) (
First we create and run a pipeline which does the PCA. In this pipeline we do not fix the number of columns, as we want to plot the result for all numbers of components (up to 60)
def fitted-ctx
(
(mm/fit
sonar:pca-cov 60
(projections/reduce-dimensions
col-names {})))
The next function transforms the result from the fitted pipeline into vega lite compatible format for plotting It accesses the underlying Smile Java object to get the data on the cumulative variance for each PCA component.
defn create-plot-data [ctx]
(map
(hash-map :principal-component %1
#(:cumulative-variance %2)
range)
(-> ctx vals (nth 2) :fit-result :model bean :cumulativeVarianceProportion))) (
Next we plot the cumulative variance over the component index:
^kind/vega-lite"https://vega.github.io/schema/vega-lite/v5.json"
{:$schema :width 850
:data {:values
(create-plot-data fitted-ctx)}:mark "line" ,
:encoding
:x {:field :principal-component, :type "nominal"},
{:y {:field :cumulative-variance, :type "quantitative"}}}
From the plot we see, that transforming the data via PCA and reducing it from 60 dimensions to about 25 would still preserve the full variance. Looking at this plot, we could now make a decision, how many dimensions to keep. We could for example decide, that keeping 60 % of the variance is enough, which would result in keeping the first 2 dimensions.
So our pipeline becomes:
def fitted-ctx
(
(mm/fit
sonar:pca-cov 2
(projections/reduce-dimensions
col-names
{})
:material "pca-cov-0" "pca-cov-1"])
(ds-mm/select-columns [ (ds-mm/shuffle)))
^kind/dataset:metamorph/data fitted-ctx) (
_unnamed [208 3]:
:material | pca-cov-0 | pca-cov-1 |
---|---|---|
M | 0.89443090 | 0.44016812 |
M | 0.33202860 | -0.40242658 |
M | -0.42172459 | 0.00461004 |
M | 0.72660907 | 0.41808875 |
R | -0.70654319 | -0.68603026 |
M | -1.52470970 | 0.07391137 |
R | 0.41905230 | 0.17147989 |
M | 0.28088718 | 0.28213827 |
M | -0.29773311 | -1.15185988 |
R | 0.23467347 | 0.36520863 |
β¦ | β¦ | β¦ |
R | 1.34875184 | 0.64443495 |
R | -0.11410315 | -0.74143479 |
R | 1.01226225 | 0.43795441 |
M | -0.04178149 | -1.03164391 |
R | -0.16196134 | -0.66724957 |
R | -0.22626634 | -0.68497148 |
M | 0.78631803 | -0.37264745 |
R | 1.04259531 | -0.47114466 |
R | -0.30512516 | -0.68425259 |
R | 1.22936765 | -0.27666872 |
R | 0.12658623 | 0.49437376 |
As the data is now 2-dimensional, it is easy to plot:
def scatter-plot-data
(-> fitted-ctx
(:metamorph/data
:material "pca-cov-0" "pca-cov-1"])
(ds/select-columns [:as-maps))) (ds/rows
^kind/vega"https://vega.github.io/schema/vega-lite/v5.json"
{:$schema :data {:values scatter-plot-data}
:width 500
:height 500
:mark :circle
:encoding
:x {:field "pca-cov-0" :type "quantitative"}
{:y {:field "pca-cov-1" :type "quantitative"}
:color {:field :material}}}
The plot shows that the reduction to 2 dimensions does not create linear separable areas of M
and R
. So a linear model will not be able to predict well the material from the 2 PCA components.
It even seems, that the reduction to 2 dimensions removes too much information for predicting of the material for any type of model.