scicloj.metamorph.ml.categorical

Categorical feature encoding for machine learning pipelines.

This namespace provides metamorph transformers for handling categorical variables commonly used in supervised learning. Currently focuses on one-hot encoding, which converts categorical values into binary indicator columns.

One-hot encoding is essential for:

  • Preparing categorical features for algorithms that expect numeric inputs
  • Preventing ordinal assumptions on nominal categories
  • Creating interpretable model features

Main API:

  • transform-one-hot: The primary metamorph transformer for one-hot encoding

Encoding strategies:

  • :full Uses a predefined level set from full dataset context
  • :fit Levels discovered during :fit used in :transform
  • :independent Each mode independently determines and encodes levels

Categories

    Other vars: transform-one-hot

    transform-one-hot

    (transform-one-hot column-selector strategy)(transform-one-hot column-selector strategy options)

    Metamorph transformer that maps categorical variables to one-hot encoded columns.

    Each unique value of the categorical column becomes its own binary column in the one-hot encoding.

    column-selector - Tablecloth column selector (keyword, fn, or selector spec)

    strategy - Strategy for handling train/test level differences:

    • :full - Levels retrieved from dataset at :metamorph.ml/full-ds in context
    • :independent - One-hot columns fitted and transformed independently
    • :fit - Mapping from :fit mode used in :transform (assumes all levels present in fit)

    options - Optional map with:

    • :table-args - Precise mapping as sequence of val idx pairs or sorted values
    • :result-datatype - Datatype of the one-hot-mapping columns

    Returns a metamorph step function that transforms the data in both :fit and :transform modes.

    metamorph .
    Behaviour in mode :fit Fits one-hot encoding and applies it to :metamorph/data
    Behaviour in mode :transform Applies fitted encoding to :metamorph/data
    Reads keys from ctx In :transform: reads fitted encoding from :metamorph/id
    Writes keys to ctx In :fit: stores fitted encoding in :metamorph/id

    See also: tech.v3.dataset.categorical/fit-one-hot, tech.v3.dataset/categorical->one-hot

    Examples

    One hot encode :cyl column in pipeline

    (let [ds (-> (rdatasets/datasets-mtcars)
                 (ds/select-columns [:mpg :cyl]))]
      (-> (mm/fit ds (cat/transform-one-hot :cyl :independent))
          :metamorph/data
          str))
    ;;=> https://vincentarelbundock.github.io/Rdatasets/doc/datasets/mtcars.html [32 4]:
    ;;=> 
    ;;=> | :mpg | :cyl-8 | :cyl-4 | :cyl-6 |
    ;;=> |-----:|-------:|-------:|-------:|
    ;;=> | 21.0 |      0 |      0 |      1 |
    ;;=> | 21.0 |      0 |      0 |      1 |
    ;;=> | 22.8 |      0 |      1 |      0 |
    ;;=> | 21.4 |      0 |      0 |      1 |
    ;;=> | 18.7 |      1 |      0 |      0 |
    ;;=> | 18.1 |      0 |      0 |      1 |
    ;;=> | 14.3 |      1 |      0 |      0 |
    ;;=> | 24.4 |      0 |      1 |      0 |
    ;;=> | 22.8 |      0 |      1 |      0 |
    ;;=> | 19.2 |      0 |      0 |      1 |
    ;;=> |  ... |    ... |    ... |    ... |
    ;;=> | 15.5 |      1 |      0 |      0 |
    ;;=> | 15.2 |      1 |      0 |      0 |
    ;;=> | 13.3 |      1 |      0 |      0 |
    ;;=> | 19.2 |      1 |      0 |      0 |
    ;;=> | 27.3 |      0 |      1 |      0 |
    ;;=> | 26.0 |      0 |      1 |      0 |
    ;;=> | 30.4 |      0 |      1 |      0 |
    ;;=> | 15.8 |      1 |      0 |      0 |
    ;;=> | 19.7 |      0 |      0 |      1 |
    ;;=> | 15.0 |      1 |      0 |      0 |
    ;;=> | 21.4 |      0 |      1 |      0 |