scicloj.metamorph.ml.design-matrix

Design matrix construction for machine learning pipelines.

This namespace provides utilities to transform datasets into numeric design matrices suitable for machine learning models. It supports deriving new features, transforming existing columns, managing target variables, and expanding complex column types (arrays, maps).

Main Entry Point:

  • create-design-matrix: Transform a dataset into a design matrix with custom specs

Design Matrix Specification Syntax:

Column specifications use column-name transformation pairs where:

  • Transformations are Clojure expressions (quoted with ’)
  • Expressions can reference column names directly as symbols
  • Expressions are evaluated in order and can chain
  • Non-listed columns are removed from the output

Shorthand Syntax:

  • :column-name Keeps column unchanged (identity function)
  • nil ’(+ a b) Auto-generates column name for derived column
  • ’(+ a b) Same as above

Available Aliases (no qualification needed):

  • ds - tech.v3.dataset
  • tc - tablecloth.api
  • tcc - tablecloth.column.api
  • All of clojure.core

Features:

  • Derives new columns from existing data
  • Expands array and map columns into separate columns
  • Automatically converts categorical columns to numbers
  • Sets inference target(s) for supervised learning
  • Chains transformations in dependency order

Limitations:

  • Does not automatically expand categorical variables (specify manually)
  • Design matrix approach is more flexible but less compact than R formula syntax

See also: fastmath.ml/lm for linear regression with formula-based transformations

Categories

    Other vars: create-design-matrix

    create-design-matrix

    (create-design-matrix ds targets-specs features-specs)

    Converts the given dataset into a full numeric dataset.

    • ds is the tech.v3.dataset to transform
    • target-specs are the specifications how to transform the target variables
    • features-specs are the specifications how to transform the features

    The ‘spec’ can express several types of dataset transformations in a compact way:

    • add new derived columns
    • remove columns
    • rename columns
    • convert columns to categorical
    • set inference target

    Columns specs are in general given as pairs of colname function

    function need to be given as list (quoted by ’), and can refer to column names.

    They get evaluated from top->bottom, and can refer to each other.

    Not listed columns get removed.

    Special syntax:

    The following aliases can be used as part of the spec. (Other functions need to be full qualified).

    clojure.core can be used without full qualifying the symbols

    • ds (tech.v3.dataset)
    • tc (tablecloth.api)
    • tcc (tablecloth.column.api)

    Example:

    (dm/create-design-matrix
          ds
          [:y] 
          [         
           [:sum '(+ :a :b :c)]
          ])
    

    This will:

    • set inference target to y:
    • create a new derived variables :sum, being the sum of a,b,c
    • remove all columns except :y and :sum

    This covers a range of cases, but is not as complete as R formulae. Specialy it does not handle automatic expansion of categorical variables, but these can be manually specified.

    See design_matrix_test.clj for more examples.

    (for model type :fastmath/ols , linear regression, we support a different way of expressing arbitrary ‘row transformations’ using :transformer option see fastmath.ml/lm documentation)

    Examples

    Usage

    (-> (rdatasets/datasets-iris)
        (ds/drop-columns [:rownames])
        (create-design-matrix [:species]
                              [:petal-length
                               [:sepal-ratio
                                (quote (/ :sepal-length :sepal-width))]])
        str)
    ;;=> https://vincentarelbundock.github.io/Rdatasets/doc/datasets/iris.html [150 3]:
    ;;=> 
    ;;=> | :petal-length | :species | :sepal-ratio |
    ;;=> |--------------:|---------:|-------------:|
    ;;=> |           1.4 |        0 |   1.45714286 |
    ;;=> |           1.4 |        0 |   1.63333333 |
    ;;=> |           1.3 |        0 |   1.46875000 |
    ;;=> |           1.5 |        0 |   1.48387097 |
    ;;=> |           1.4 |        0 |   1.38888889 |
    ;;=> |           1.7 |        0 |   1.38461538 |
    ;;=> |           1.4 |        0 |   1.35294118 |
    ;;=> |           1.5 |        0 |   1.47058824 |
    ;;=> |           1.4 |        0 |   1.51724138 |
    ;;=> |           1.5 |        0 |   1.58064516 |
    ;;=> |           ... |      ... |          ... |
    ;;=> |           5.4 |        1 |   2.22580645 |
    ;;=> |           5.6 |        1 |   2.16129032 |
    ;;=> |           5.1 |        1 |   2.22580645 |
    ;;=> |           5.1 |        1 |   2.14814815 |
    ;;=> |           5.9 |        1 |   2.12500000 |
    ;;=> |           5.7 |        1 |   2.03030303 |
    ;;=> |           5.2 |        1 |   2.23333333 |
    ;;=> |           5.0 |        1 |   2.52000000 |
    ;;=> |           5.2 |        1 |   2.16666667 |
    ;;=> |           5.4 |        1 |   1.82352941 |
    ;;=> |           5.1 |        1 |   1.96666667 |