scicloj.metamorph.ml.categorical
Categorical feature encoding for machine learning pipelines.
This namespace provides metamorph transformers for handling categorical variables commonly used in supervised learning. Currently focuses on one-hot encoding, which converts categorical values into binary indicator columns.
One-hot encoding is essential for:
- Preparing categorical features for algorithms that expect numeric inputs
- Preventing ordinal assumptions on nominal categories
- Creating interpretable model features
Main API:
transform-one-hot: The primary metamorph transformer for one-hot encoding
Encoding strategies:
:fullUses a predefined level set from full dataset context:fitLevels discovered during :fit used in :transform:independentEach mode independently determines and encodes levels
transform-one-hot
(transform-one-hot column-selector strategy)(transform-one-hot column-selector strategy options)Metamorph transformer that maps categorical variables to one-hot encoded columns.
Each unique value of the categorical column becomes its own binary column in the one-hot encoding.
column-selector - Tablecloth column selector (keyword, fn, or selector spec)
strategy - Strategy for handling train/test level differences:
:full- Levels retrieved from dataset at:metamorph.ml/full-dsin context:independent- One-hot columns fitted and transformed independently:fit- Mapping from :fit mode used in :transform (assumes all levels present in fit)
options - Optional map with:
:table-args- Precise mapping as sequence of val idx pairs or sorted values:result-datatype- Datatype of the one-hot-mapping columns
Returns a metamorph step function that transforms the data in both :fit and :transform modes.
| metamorph | . |
|---|---|
| Behaviour in mode :fit | Fits one-hot encoding and applies it to :metamorph/data |
| Behaviour in mode :transform | Applies fitted encoding to :metamorph/data |
| Reads keys from ctx | In :transform: reads fitted encoding from :metamorph/id |
| Writes keys to ctx | In :fit: stores fitted encoding in :metamorph/id |
See also: tech.v3.dataset.categorical/fit-one-hot, tech.v3.dataset/categorical->one-hot
Examples
One hot encode :cyl column in pipeline
(let [ds (-> (rdatasets/datasets-mtcars)
(ds/select-columns [:mpg :cyl]))]
(-> (mm/fit ds (cat/transform-one-hot :cyl :independent))
:metamorph/data
str))
;;=> https://vincentarelbundock.github.io/Rdatasets/doc/datasets/mtcars.html [32 4]:
;;=>
;;=> | :mpg | :cyl-8 | :cyl-4 | :cyl-6 |
;;=> |-----:|-------:|-------:|-------:|
;;=> | 21.0 | 0 | 0 | 1 |
;;=> | 21.0 | 0 | 0 | 1 |
;;=> | 22.8 | 0 | 1 | 0 |
;;=> | 21.4 | 0 | 0 | 1 |
;;=> | 18.7 | 1 | 0 | 0 |
;;=> | 18.1 | 0 | 0 | 1 |
;;=> | 14.3 | 1 | 0 | 0 |
;;=> | 24.4 | 0 | 1 | 0 |
;;=> | 22.8 | 0 | 1 | 0 |
;;=> | 19.2 | 0 | 0 | 1 |
;;=> | ... | ... | ... | ... |
;;=> | 15.5 | 1 | 0 | 0 |
;;=> | 15.2 | 1 | 0 | 0 |
;;=> | 13.3 | 1 | 0 | 0 |
;;=> | 19.2 | 1 | 0 | 0 |
;;=> | 27.3 | 0 | 1 | 0 |
;;=> | 26.0 | 0 | 1 | 0 |
;;=> | 30.4 | 0 | 1 | 0 |
;;=> | 15.8 | 1 | 0 | 0 |
;;=> | 19.7 | 0 | 0 | 1 |
;;=> | 15.0 | 1 | 0 | 0 |
;;=> | 21.4 | 0 | 1 | 0 |