scicloj.metamorph.ml
Core machine learning framework integrating metamorph pipelines with standardized model APIs.
This is the central namespace of metamorph.ml, providing infrastructure for:
- Registering and using machine learning models
- Training models and making predictions
- Evaluating pipelines via cross-validation
- Standardized model diagnostics (glance, tidy, augment)
- Optional caching of computationally expensive operations
Key Concepts:
Model Registration: Models are registered using define-model! and can be referenced by keyword (e.g., :fastmath/ols, :metamorph.ml/dummy-classifier). Models define a train-fn, predict-fn, and optional diagnostic functions.
Training and Prediction:
train: Train a model on a dataset given options including :model-typepredict: Make predictions using a trained modeltrain-predict-cache: Optional cache to avoid redundant computations
Pipeline Evaluation:
evaluate-pipelines: Evaluate multiple pipelines across train/test splitsevaluate-one-pipeline: Evaluate a single pipeline with cross-validation- Returns results sorted by metric performance with optional filtering
- Supports parallel evaluation (:map/:pmap/:ppmap)
Model Diagnostics (following tidymodels conventions):
glance: One-row model summary (goodness-of-fit)tidy: One-row-per-component output (coefficients with statistics)augment: One-row-per-observation output (predictions, residuals)
Main API Functions:
define-model!: Register a new model type with train/predict/diagnostic functionstrain: Train a model with a specified model-typepredict: Generate predictions from a trained modelevaluate-pipelines: Evaluate pipelines with cross-validationglance: Get model summary statisticstidy: Extract coefficient-level resultsaugment: Add predictions and residuals to data
Pipeline Integration:
Models integrate with metamorph pipelines via the model step, which:
- Trains in :fit mode using training data
- Predicts in :transform mode on new data
- Stores model output column metadata for later evaluation
Built-in Models:
Regression:
:metamorph.ml/ols: Apache Commons Math OLS:fastmath/ols: FastMath OLS:fastmath/glm: FastMath GLM:metamorph.ml/dummy-regressor: Mean baseline
Classification:
:metamorph.ml/dummy-classifier: Majority class or random baseline:metamorph.ml/random-forest: Random forest classifier
Preprocessing:
See specific namespaces for transformers:
scicloj.metamorph.ml.preprocessing: Scaling and normalizationscicloj.metamorph.ml.categorical: One-hot encodingscicloj.metamorph.ml.r-model-matrix: R formula features
See also: scicloj.metamorph.core for metamorph pipeline mechanics, scicloj.metamorph.ml.tidy-models for diagnostic validation
Categories
- Model definition: define-model! hyperparameters model-definition-names model-definitions* options->model-def
- Model diagnsotics: augment explain glance loglik tidy
- Model lifecycle: evaluate-pipelines model optimize-hyperparameter predict train
Other vars: default-loss-fn enable-strict-prediction-validations ensemble-pipe plot prediction-column-meta-schema probability-column-meta-schema thaw-model train-predict-cache
augment
(augment model data)Adds informations about observations to a dataset
Potential row names are these: https://raw.githubusercontent.com/scicloj/metamorph.ml/main/resources/columms-augment.edn
No other row names should be used.
Each model will only return a small subset of possible rows.
A model might not implement this function, and then the dataset is returned unchanged.
Examples
Use
augmentafter regression, which adds ‘residual’ and ‘fitted’ to data
(require (quote scicloj.metamorph.ml.classification))
;;=> nil
(let [ds (-> (scicloj.metamorph.ml.rdatasets/datasets-mtcars)
(ds/drop-columns [:rownames])
(ds/categorical->number cf/categorical)
(ds-mod/set-inference-target :mpg))
model (train ds {:model-type :fastmath/ols})]
(str (augment model ds)))
;;=> https://vincentarelbundock.github.io/Rdatasets/doc/datasets/mtcars.html [32 16]:
;;=>
;;=> | :mpg | :cyl | :disp | :hp | :drat | :wt | :qsec | :vs | :am | :gear | :carb | :.resid | :.std.resid | :.fitted | :.cooksd | :.hat |
;;=> |-----:|-----:|------:|----:|------:|------:|------:|----:|----:|------:|------:|------------:|------------:|------------:|-----------:|-----------:|
;;=> | 21.0 | 6 | 160.0 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 | -1.59950576 | -0.72266589 | 22.59950576 | 0.02059098 | 0.30250648 |
;;=> | 21.0 | 6 | 160.0 | 110 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 | -1.11188608 | -0.49798984 | 22.11188608 | 0.00921835 | 0.29022074 |
;;=> | 22.8 | 4 | 108.0 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 | -3.45064408 | -1.49237335 | 26.25064408 | 0.06352411 | 0.23881707 |
;;=> | 21.4 | 6 | 258.0 | 110 | 3.08 | 3.215 | 19.44 | 1 | 0 | 3 | 1 | 0.16259545 | 0.06981493 | 21.23740455 | 0.00013067 | 0.22773935 |
;;=> | 18.7 | 8 | 360.0 | 175 | 3.15 | 3.440 | 17.02 | 0 | 0 | 3 | 2 | 1.00656597 | 0.42450871 | 17.69343403 | 0.00408314 | 0.19951177 |
;;=> | 18.1 | 6 | 225.0 | 105 | 2.76 | 3.460 | 20.22 | 1 | 0 | 3 | 1 | -2.28303904 | -1.01685467 | 20.38303904 | 0.03697080 | 0.28228409 |
;;=> | 14.3 | 8 | 360.0 | 245 | 3.21 | 3.570 | 15.84 | 0 | 0 | 3 | 4 | -0.08625625 | -0.03964205 | 14.38625625 | 0.00006907 | 0.32591814 |
;;=> | 24.4 | 4 | 146.7 | 62 | 3.69 | 3.190 | 20.00 | 1 | 0 | 4 | 2 | 1.90398812 | 0.87785664 | 22.49601188 | 0.03454201 | 0.33023115 |
;;=> | 22.8 | 4 | 140.8 | 95 | 3.92 | 3.150 | 22.90 | 1 | 0 | 4 | 2 | -1.61908990 | -1.20344045 | 24.41908990 | 0.37922064 | 0.74228696 |
;;=> | 19.2 | 6 | 167.6 | 123 | 3.92 | 3.440 | 18.30 | 1 | 0 | 4 | 4 | 0.50097006 | 0.25023012 | 18.69902994 | 0.00428239 | 0.42932607 |
;;=> | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
;;=> | 15.5 | 8 | 318.0 | 150 | 2.76 | 3.520 | 16.87 | 0 | 0 | 3 | 2 | -1.44305322 | -0.61574883 | 16.94305322 | 0.00960926 | 0.21801004 |
;;=> | 15.2 | 8 | 304.0 | 150 | 3.15 | 3.435 | 17.30 | 0 | 0 | 3 | 2 | -2.53218150 | -1.05158327 | 17.73218150 | 0.02124258 | 0.17444502 |
;;=> | 13.3 | 8 | 350.0 | 245 | 3.73 | 3.840 | 15.41 | 0 | 0 | 3 | 4 | -0.00602198 | -0.00295343 | 13.30602198 | 0.00000055 | 0.40807317 |
;;=> | 19.2 | 8 | 400.0 | 175 | 3.08 | 3.845 | 17.05 | 0 | 0 | 3 | 2 | 2.50832101 | 1.06170734 | 16.69167899 | 0.02647385 | 0.20530539 |
;;=> | 27.3 | 4 | 79.0 | 66 | 4.08 | 1.935 | 18.90 | 1 | 1 | 4 | 1 | -0.99346869 | -0.40473802 | 28.29346869 | 0.00246798 | 0.14216447 |
;;=> | 26.0 | 4 | 120.3 | 91 | 4.43 | 2.140 | 16.70 | 0 | 1 | 5 | 2 | -0.15295396 | -0.09402469 | 26.15295396 | 0.00132940 | 0.62322569 |
;;=> | 30.4 | 4 | 95.1 | 113 | 3.77 | 1.513 | 16.90 | 1 | 1 | 5 | 2 | 2.76372742 | 1.38260585 | 27.63627258 | 0.13168703 | 0.43109820 |
;;=> | 15.8 | 8 | 351.0 | 264 | 4.22 | 3.170 | 14.50 | 0 | 1 | 5 | 4 | -3.07004080 | -1.99624210 | 18.87004080 | 0.71352055 | 0.66325159 |
;;=> | 19.7 | 6 | 145.0 | 175 | 3.62 | 2.770 | 15.50 | 0 | 1 | 5 | 6 | 0.00617185 | 0.00298425 | 19.69382815 | 0.00000052 | 0.39101910 |
;;=> | 15.0 | 8 | 301.0 | 335 | 3.54 | 3.570 | 14.60 | 0 | 1 | 5 | 8 | 1.05888162 | 0.66847870 | 13.94111838 | 0.07309138 | 0.64275732 |
;;=> | 21.4 | 4 | 121.0 | 109 | 4.11 | 2.780 | 18.60 | 1 | 1 | 4 | 2 | -2.96826768 | -1.32969201 | 24.36826768 | 0.06581417 | 0.29050770 |default-loss-fn
deprecated in 1.4.0
(default-loss-fn dataset)Given a datset which must have exactly 1 inference target column return a default loss fn. If column is categorical, loss is tech.v3.ml.loss/classification-loss, else the loss is tech.v3.ml.loss/mae (mean average error).
define-model!
(define-model! model-kwd train-fn predict-fn {:keys [hyperparameters thaw-fn explain-fn loglik-fn tidy-fn glance-fn augment-fn plot-fn options documentation unsupervised?], :as opts})Create a model definition. An ml model is a function that takes a dataset and an options map and returns a model. A model is something that, combined with a dataset, produces a inferred dataset.
Examples
Define simple (noop) model
(define-model! :myns/model1
(fn train [feature-ds label-ds opts] "my model")
(fn predict [feature-ds thawed-model opts])
{})
;;=> :ok
(train (-> (ds/->dataset {:a [0], :b [1]})
(ds-mod/set-inference-target [:a]))
{:model-type :myns/model1})
;;=> {:feature-columns [:b],
;;=> :id #uuid "41d4757c-d27e-46d4-86fa-055a19a9800c",
;;=> :model-data "my model",
;;=> :options {:model-type :myns/model1},
;;=> :target-columns [:a],
;;=> :target-datatypes {:a :int64},
;;=> :train-input-hash nil}enable-strict-prediction-validations
Atom controlling strict prediction validation behavior.
When set to true (via reset! or swap!), enables validation that throws exceptions during prediction if:
- Target categorical maps don’t match between training and prediction datasets
- Predicted values are not present in the prediction categorical map
Defaults to false for backward compatibility. Set to true to catch potential prediction inconsistencies early.
Example: (reset! enable-strict-prediction-validations true)
ensemble-pipe
(ensemble-pipe pipes)Creates an ensemble pipeline from multiple pipelines using majority voting.
pipes - Sequence of metamorph pipeline functions
Returns a single metamorph pipeline function that trains all sub-pipelines in :fit mode and combines their predictions via majority voting in :transform mode. Each pipeline is trained independently on the same data.
In :fit mode, stores all fitted pipeline contexts. In :transform mode, runs predictions from all pipelines and selects the most common prediction for each observation.
The ensemble pipeline can be used anywhere a regular pipeline is accepted (e.g., in evaluate-pipelines).
| metamorph | . |
|---|---|
| Behaviour in mode :fit | Fits all sub-pipelines and stores their contexts |
| Behaviour in mode :transform | Runs all sub-pipelines and combines predictions by majority vote |
| Reads keys from ctx | In :transform: reads fitted sub-pipeline contexts |
| Writes keys to ctx | In :fit: stores all fitted contexts; In :transform: writes final prediction |
See also: scicloj.metamorph.ml/evaluate-pipelines
evaluate-pipelines
deprecated in 1.4.0
(evaluate-pipelines pipeline-fn-or-decl-seq train-test-split-seq metric-fn loss-or-accuracy options)(evaluate-pipelines pipeline-fn-seq train-test-split-seq metric-fn loss-or-accuracy)Evaluates the performance of a seq of metamorph pipelines, which are suposed to have a model as last step under key :model, which behaves correctly in mode :fit and :transform. The function scicloj.metamorph.ml/model is such function behaving correctly.
This function calculates the accuracy or loss, given as metric-fn of each pipeline in pipeline-fn-seq using all the train-test splits given in train-test-split-seq.
It runs the pipelines in mode :fit and in mode :transform for each pipeline-fn in pipe-fn-seq for each split in train-test-split-seq.
The function returns a seq of seqs of evaluation results per pipe-fn per train-test split. Each of the evaluation results is a context map, which is specified in the malli schema attached to this function.
-
pipeline-fn-or-decl-seqneed to be sequence of pipeline functions or pipline declarations which follow the metamorph approach. These type of functions get produced typically by callingscicloj.metamorph/pipeline. Documentation is here: -
train-test-split-seqneed to be a sequence of maps containing the train and test dataset (being tech.ml.dataset) at keys :train and :test.tablecloth.api/split->seqproduces such splits. Supervised models require both keys (:train and :test), while unsupervised models only use :train -
metric-fnMetric function to use. Typically comming fromtech.v3.ml.loss. For supervised models the metric-fn receives the trueth and predicted values and should return a single double number. The metric fns receives a a seq without categorical maps. These get reverse-applied to the prediction , if present, before passing the values to the metriic fn. For unsupervised models the function receives the fitted ctx and should return a singel double number as well. This metric will be used to sort and eventualy filter the result, depending on the options (:return-best-pipeline-only and :return-best-crossvalidation-only). The notion ofbestcomes from metric-fn combined with loss-and-accuracy -
loss-or-accuracyIf the metric-fn is a loss or accuracy calculation. Can be :loss or :accuracy. Decided the notion ofbestmodel. In case of :loss pipelines with lower metric are better, in case of :accuracy pipelines with higher value are better. -
optionsmap controls some mainly performance related parameters. This function can potentialy result in a large ammount of data, able to bring the JVM into out-of-memory. We can control memory consumption / paralellism by the below options. The defaults are quite aggresive in removing details, and this can be tweaked further into more or less details via::return-best-pipeline-only- Only return information of the best performing pipeline. Default is true.:return-best-crossvalidation-only- Only return information of the best crossvalidation (per pipeline returned). Default istrue.:map-fn- Controls parallelism, so decides if we usemap,pmap,ppmapormapvto map over different pipelines. Default is:mapGiving :run!, :prun!, pprun! executes the pipelines based onrun!. In this case this function returns nothing, and it assumes that a site-effect making:evaluation-handler-fnis given. (which can then return nil):evaluation-handler-fn- Gets called once with the complete result of an individual pipeline evaluation. It can be used to adapt the data returned for each evaluation and / or to make side effects using the evaluation data. The result of this function is taken as evaluation result. It need to contain as a minumum this 2 key paths:[:train-transform :metric][:test-transform :metric]
All other evalution data can be removed, if desired. The
evaluation-handler-fnfn can as well return nil, and then it should be used together withmap-fn:run!, :prun! or pprun!, so we execute it for side-efects only, which means as well that memory consumption is minimalIt can be used for side effects, like experiment tracking on disk or in a database. The passed in evaluation result is a map with all information on the current evaluation, including the datasets used.
The default handler function is:
scicloj.metamorph.ml/default-result-dissoc--in-fnwhich removes the often large model object and the training data.identitycan be use to retain all evaluation data incl. datascicloj.metamorph.ml/result-dissoc-in-seq--allis availble and reduces even more agressively, just keeping the metrices.-
:other-metricsSpecifies other metrices to be calculated during evaluation
This function expects as well the ground truth of the target variable into a specific key in the context at key :model :scicloj.metamorph.ml/target-ds See here for the simplest way to set this up: https://github.com/behrica/metamorph.ml/blob/main/README.md The function scicloj.ml.metamorph/model does this correctly.
explain
(explain model & [options])Explain (if possible) an ml model. A model explanation is a model-specific map of data that usually indicates some level of mapping between features and importance
Examples
explain (= show feature importance) of :random-forest model
(require (quote scicloj.metamorph.ml.classification))
;;=> nil
(let [training-data (-> (scicloj.metamorph.ml.rdatasets/datasets-iris)
(ds/drop-columns [:rownames])
(ds-mod/set-inference-target [:species]))
model (train training-data
{:model-type :metamorph.ml/random-forest})]
(explain model))
;;=> {:feature-importance {:petal-length 0.38887991321808574,
;;=> :petal-width 0.3883721466983636,
;;=> :sepal-length 0.16562420661481295,
;;=> :sepal-width 0.05712373346873774}}glance
(glance model)Gives a glance on the model, returning a dataset with model information about the entire model.
Potential row names are these: https://raw.githubusercontent.com/scicloj/metamorph.ml/main/resources/columms-glance.edn
No other row names should be used.
Each model will only return a small subset of possible rows.
The list of allowed row names might change over time.
A model might not implement this function, and then an empty dataset will be returned.
Examples
Use
glanceafter regression, which gives basic regression information
(require (quote scicloj.metamorph.ml.classification))
;;=> nil
(let [ds (-> (scicloj.metamorph.ml.rdatasets/datasets-mtcars)
(ds/drop-columns [:rownames])
(ds/categorical->number cf/categorical)
(ds-mod/set-inference-target :mpg))
model (train ds {:model-type :fastmath/ols})]
(str (glance model)))
;;=> _unnamed [1 13]:
;;=>
;;=> | :p.value | :statistic | :adj.r.squared | :n | :mse | :rss | :df | :df.residual | :aic | :bic | :totss | :r.squared | :log-lik |
;;=> |-----------:|------------:|---------------:|---:|------------:|-------------:|----:|-------------:|-------------:|-------------:|-------------:|-----------:|-------------:|
;;=> | 0.00000038 | 13.93246369 | 0.80664232 | 32 | 97.85527575 | 147.49443002 | 10 | 21 | 163.70981043 | 181.29864127 | 1126.0471875 | 0.86901576 | -69.85490522 |hyperparameters
(hyperparameters model-kwd)Retrieves the hyperparameters definition for a model type.
model-kwd - Keyword identifying the model type (e.g., :smile.classification/random-forest)
Returns the hyperparameters map specified during model registration, or nil if no hyperparameters were defined. Hyperparameters describe tunable options for the model.
Used for introspection and hyperparameter tuning/grid search.
See also: define-model!, options->model-def
loglik
(loglik model y yhat)Calculates the log-likelihood for the given model and predictions.
model- Trained model map containing:optionswith model definitiony- Actual target values (ground truth)yhat- Predicted values from the model
Returns the log-likelihood value by calling the model’s :loglik-fn function. The specific log-likelihood function used depends on the model type.
See also: scicloj.metamorph.ml/tidy, scicloj.metamorph.ml/glance
model
(model options)Executes a machine learning model in train/predict (depending on :mode) from the metamorph.ml model registry.
The model is passed between both invocation via the shared context ctx in a key (a step indentifier) which is passed in key :metamorph/id and guarantied to be unique for each pipeline step.
The function writes and reads into this common context key.
Options: - :model-type - Keyword for the model to use
Further options get passed to train functions and are model specific.
See here for an overview for the models build into scicloj.ml:
https://scicloj.github.io/scicloj.ml-tutorials/userguide-models.html
Other libraries might contribute other models, which are documented as part of the library.
| metamorph | . |
|---|---|
| Behaviour in mode :fit | Calls scicloj.metamorph.ml/train using data in :metamorph/data and optionsand stores trained model in ctx under key in :metamorph/id |
| Behaviour in mode :transform | Reads trained model from ctx and calls scicloj.metamorph.ml/predict with the model in $id and data in :metamorph/data |
| Reads keys from ctx | In mode :transform : Reads trained model to use for prediction from key in :metamorph/id. |
| Writes keys to ctx | In mode :fit : Stores trained model in key $id and writes feature-ds and target-ds before prediction into ctx at :scicloj.metamorph.ml/feature-ds /:scicloj.metamorph.ml/target-ds |
See as well:
scicloj.metamorph.ml/trainscicloj.metamorph.ml/predict
Examples
Pipeline incl. the model stop
(mm/pipeline (tc-mm/drop-columns [:a])
{:metamorph/id :model}
(model {:model-type :metamorph.ml/dummy-classifier}))
;;=> clojure.core$partial$fn__5929@10e92c50model-definition-names
(model-definition-names)Returns a list of all registered model definition names.
Returns a sequence of keywords representing all model types that have been registered via define-model!. These can be used as the :model-type value when training models.
Example: [:metamorph.ml/dummy-classifier ...]
See also: define-model!, options->model-def
optimize-hyperparameter
(optimize-hyperparameter pipeline-fn-or-decl-seq train-test-split-seq metric-def options)(optimize-hyperparameter pipeline-fn-or-decl-seq train-test-split-seq metric-def)Finds optimal hyperparameters in a machine learning pipeline using cross validation.
It evaluates the performance of a seq of metamorph pipelines, which are suposed to have a model as last step under key :model, which behaves correctly in mode :fit and :transform. The function scicloj.metamorph.ml/model is such function behaving correctly.
This function calculates the accuracy or loss, given as metric-def of each pipeline in pipeline-fn-seq using all the train-test splits given in train-test-split-seq.
It runs the pipelines in mode :fit and in mode :transform for each pipeline-fn in pipe-fn-seq for each split in train-test-split-seq.
The function returns a seq of seqs of evaluation results per pipe-fn per train-test split. Each of the evaluation results is a context map, which is specified in the malli schema attached to this function.
-
pipeline-fn-or-decl-seqneed to be sequence of pipeline functions or pipline declarations which follow the metamorph approach. These type of functions get produced typically by callingscicloj.metamorph/pipeline. Each item in this list represents one hyper-pararmeter setting.scicloj.metamorph.ml.gridsearch/sobol-gridsearchcan be used to generate model options vrainats from a search grid definition. The different pipelines can contain one model with several options, or multiple models with differentg options or even different preprocessing instructions per pipeline. -
train-test-split-seqneed to be a sequence of maps containing the train and test dataset (being tech.ml.dataset) at keys :train and :test.tablecloth.api/split->seqproduces such splits. Supervised models require both keys (:train and :test), while unsupervised models only use :train The nested loop overpipeline-fn-or-decl-seqandtrain-test-split-seqdecides how many evaluations are run, and restults are returned in detailed and aggregated form. -
metric-defMetric definition to use. It’s a map with keys * :metric - a keyword specifying the metric fn to use, seescicloj.metamorph.ml.column-metric/classification-metricandscicloj.metamorph.ml.column-metric/regression-metricfor available metric functions *:averaging- :macro or :micro , decides how to average over the results of the usualy binary classification functions *:loss-or-accuracy:accuracy or :loss , if teh fn is caluclating accuracy (higer values are better), or loss (lower values are better) *:optionsoptional options for the metric fnThis metric will be used to sort and eventualy filter the result, depending on the options (:return-best-pipeline-only and :return-best-crossvalidation-only). The notion of
bestcomes from metric-fn combined with loss-and-accuracyoptionsmap controls some mainly performance related parameters. This function can potentialy result in a large ammount of data, able to bring the JVM into out-of-memory. We can control memory consumption / paralellism by the below options.
The defaults are quite aggresive in removing details, and this can be tweaked further into more or less details via:
:return-best-pipeline-only- Only return information of the best performing pipeline. Default is true.:return-best-crossvalidation-only- Only return information of the best crossvalidation (per pipeline returned). Default istrue.:map-fn- Controls parallelism, so decides if we usemap,pmap,ppmapormapvto map over different pipelines. Default is:mapGiving :run!, :prun!, pprun! executes the pipelines based onrun!. In this case this function returns nothing, and it assumes that a site-effect making:evaluation-handler-fnis given. (which can then return nil):evaluation-handler-fn- Gets called once with the complete result of an individual pipeline evaluation. It can be used to adapt the data returned for each evaluation and / or to make side effects using the evaluatio data. The result of this function is taken as evaluation result. It need to contain as a minumum this 2 key paths: :train-transform :metric :test-transform :metric All other evalution data can be removed, if desired. Theevaluation-handler-fnfn can as well return nil, and then it should be used together withmap-fn:run!, :prun! or pprun!, so we execute it for side-efects only, which means as well that memory consumption is minimalIt can be used for side effects, like experiment tracking on disk or in a database. The passed in evaluation result is a map with all information on the current evaluation, including the datasets used.
The default handler function is:
scicloj.metamorph.ml/default-result-dissoc--in-fnwhich removes the often large model object and the training data.identitycan be use to retain all evaluation data incl. datascicloj.metamorph.ml/result-dissoc-in-seq--allis availble and reduces even more agressively, just keeping the metrices.-
:other-metricsSpecifies other metrices to be calculated during evaluation
This function expects as well the ground truth of the target variable into a specific key in the context at key
:model :scicloj.metamorph.ml/target-dsSee here for the simplest way to set this up: https://github.com/behrica/metamorph.ml/blob/main/README.md The function scicloj.ml.metamorph/model does this correctly.(This function replaces the
deprecatedfunctionevaluate-pipelines. It’s main difference is the changed handling of metric functions)
Examples
Simple call to optimize-hyperparameter using single :holdout split and single pipeline. (so we try only one configuration)
(require (quote scicloj.metamorph.ml))
;;=> nil
(def result
(let [iris (-> (rdatasets/datasets-iris)
(ds/remove-columns [:rownames])
(ds-mod/set-inference-target [:species])
(ds/categorical->number cf/categorical))
split (tc/split->seq iris :holdout {:ratio [0.1 0.9]})
pipe (mm/pipeline {:metamorph/id :model}
(ml/model {:model-type
:metamorph.ml/random-forest}))
result (ml/optimize-hyperparameter
[pipe]
split
{:metric :accuracy,
:metric-type :classification,
:loss-or-accuracy :accuracy})]
result))
;;=> #'examples/result
(def train-accuracy
(-> result
first
first
:train-transform
:metric))
;;=> #'examples/train-accuracy
train-accuracy
;;=> 1.0
(def test-accuracy
(-> result
first
first
:test-transform
:metric))
;;=> #'examples/test-accuracy
test-accuracy
;;=> 0.9753086419753086Grid-search over hyperparameter space of :random-forrest model
(def iris
(-> (rdatasets/datasets-iris)
(tc/drop-columns [:rownames])
(ds/categorical->number [:species])
(ds-mod/set-inference-target :species)))
;;=> #'examples/iris
(def iris-split (tc/split->seq iris))
;;=> #'examples/iris-split
(def hyperparms-space
(ml/hyperparameters :smile.classification/random-forest))
;;=> #'examples/hyperparms-space
(def search-space (take 10 (gs/sobol-gridsearch hyperparms-space)))
;;=> #'examples/search-space
(def model-options
(map (fn [m]
(assoc m :model-type :smile.classification/random-forest))
search-space))
;;=> #'examples/model-options
(def pipeline-fns
(map (fn [opts] (mm/pipeline {:metamorph/id :model} (ml/model opts)))
model-options))
;;=> #'examples/pipeline-fns
(def results
(ml/optimize-hyperparameter pipeline-fns
iris-split
{:metric :accuracy,
:loss-or-accuracy :accuracy,
:metric-type :classification}
{:return-best-pipeline-only false}))
;;=> #'examples/results
(-> (map (fn [res]
(hash-map :options (-> res
first
:fit-ctx
:model
:options)
:test-accuracy (-> res
first
:test-transform
:metric)))
results)
(tc/dataset)
(tc/map-column->columns :options)
str)
;;=> _unnamed [10 8]:
;;=>
;;=> | :test-accuracy | :options-trees | :options-max-depth | :options-max-nodes | :options-node-size | :options-sample-rate | :options-split-rule | :options-model-type |
;;=> |---------------:|---------------:|-------------------:|-------------------:|-------------------:|---------------------:|-----------------------|-------------------------------------|
;;=> | 1.00000000 | 750 | 32 | 32 | 26 | 0.77272727 | :classification-error | :smile.classification/random-forest |
;;=> | 0.91111111 | 880 | 89 | 20 | 38 | 0.89090909 | :entropy | :smile.classification/random-forest |
;;=> | 0.84444444 | 940 | 15 | 71 | 20 | 0.38181818 | :entropy | :smile.classification/random-forest |
;;=> | 0.53333333 | 200 | 38 | 94 | 44 | 0.60909091 | :entropy | :smile.classification/random-forest |
;;=> | 0.53333333 | 130 | 66 | 43 | 13 | 0.20909091 | :entropy | :smile.classification/random-forest |
;;=> | 0.53333333 | 380 | 43 | 66 | 88 | 0.43636364 | :gini | :smile.classification/random-forest |
;;=> | 0.53333333 | 510 | 55 | 55 | 51 | 0.55454545 | :entropy | :smile.classification/random-forest |
;;=> | 0.55555556 | 630 | 20 | 89 | 63 | 0.66363636 | :classification-error | :smile.classification/random-forest |
;;=> | 0.57777778 | 260 | 77 | 77 | 75 | 0.32727273 | :entropy | :smile.classification/random-forest |
;;=> | 0.60000000 | 690 | 82 | 49 | 94 | 0.15454545 | :classification-error | :smile.classification/random-forest |options->model-def
(options->model-def options)Retrieves the model definition corresponding to the :model-type option.
options - Map containing at minimum a :model-type keyword
Returns the model definition map registered for the given :model-type. Throws an exception if the model type is not found, suggesting a missing namespace require.
Used internally to look up train/predict functions and model metadata.
See also: define-model!, model-definition-names, hyperparameters
predict
(predict dataset-or-dmatrix {:keys [feature-columns options train-input-hash], :as model})Predict returns a dataset with only the predictions in it.
- For regression, a single column dataset is returned with the column named after the target
- For classification, a dataset is returned with a float64 column for each target value and values that describe the probability distribution.
Each implementing model should construct its prediction in a shape expressed by
:target-column:target-datatypes:target-categorical-maps
it is receiving.
Any implementing model need to behave symetric between the ‘datatype in the target columns of training data’ and the ’datatype of the prediction columns` A model can decide to not accept certain dataypes in the target columns of training data. (and fail with exception). But any model should try to minimize this and accept for categorical data:
- all numeric types ( :int32, :int64, :float32, :float64)
- string
- categorical maps
It NEED to be symetric, and return the same datatype in prediction as it receives in training:
- numeric in train -> same numeric in predict
- string in train -> string in predict
- categorical map in train -> equivalent categorical map in predict
ml/train passes the needed information of the train target column to the model implementaion to do this.
Examples
train/predict on (splitted) iris data
(let [iris (-> (scicloj.metamorph.ml.rdatasets/datasets-iris)
(ds/drop-columns [:rownames])
(ds-mod/set-inference-target [:species]))
split (ds-mod/train-test-split iris)
model (scicloj.metamorph.ml/train (:train-ds split)
{:model-type
:metamorph.ml/random-forest})]
(predict (:test-ds split) model))
;;=> {:species
;;=> #tech.v3.dataset.column[45]
;;=> :species
;;=> [virginica, setosa, versicolor, virginica, virginica, virginica, versicolor, versicolor, setosa, versicolor, setosa, versicolor, setosa, setosa, versicolor, setosa, setosa, setosa, versicolor, versicolor...]} thaw-model
(thaw-model model {:keys [thaw-fn], :as opts})(thaw-model model)Thaws a frozen model for use in predictions.
Models returned from train may be ‘frozen’ (serialized) for storage efficiency. A ‘thaw’ operation deserializes the model for use. This happens automatically during predict, but you can manually thaw and cache the model under :thawed-model for faster repeated predictions on small datasets.
model- Model map fromtraincontaining:model-dataopts- Optional map with:thaw-fnto override the model’s thaw function
Returns the thawed model data ready for prediction. If already thawed and cached, returns the cached version.
tidy
(tidy model)summarizes information about model components. Returns a dataset with rows from this list: https://raw.githubusercontent.com/scicloj/metamorph.ml/main/resources/columms-tidy.edn
No other row names should be used.
Each model will only return a small subset of possible rows.
The list of allowed row names might change over time.
A model might not implement this function, and then an empty dataset will be returned.
Examples
Use
tidyafter regression, which gives basic information per term
(require (quote scicloj.metamorph.ml.classification))
;;=> nil
(let [ds (-> (scicloj.metamorph.ml.rdatasets/datasets-mtcars)
(ds/drop-columns [:rownames])
(ds/categorical->number cf/categorical)
(ds-mod/set-inference-target :mpg))
model (train ds {:model-type :fastmath/ols})]
(str (tidy model)))
;;=> _unnamed [11 5]:
;;=>
;;=> | :term | :statistic | :estimate | :p.value | :std.error |
;;=> |-------|------------:|------------:|-----------:|------------:|
;;=> | :mpg | 0.65730581 | 12.30337416 | 0.51812440 | 18.71788443 |
;;=> | :cyl | -0.10663922 | -0.11144048 | 0.91608738 | 1.04502336 |
;;=> | :disp | 0.74675849 | 0.01333524 | 0.46348865 | 0.01785750 |
;;=> | :hp | -0.98684065 | -0.02148212 | 0.33495531 | 0.02176858 |
;;=> | :drat | 0.48130362 | 0.78711097 | 0.63527790 | 1.63537307 |
;;=> | :wt | -1.96118871 | -3.71530393 | 0.06325215 | 1.89441430 |
;;=> | :qsec | 1.12341328 | 0.82104075 | 0.27394127 | 0.73084480 |
;;=> | :vs | 0.15099145 | 0.31776281 | 0.88142347 | 2.10450861 |
;;=> | :am | 1.22540355 | 2.52022689 | 0.23398971 | 2.05665055 |
;;=> | :gear | 0.43891421 | 0.65541302 | 0.66520643 | 1.49325996 |
;;=> | :carb | -0.24062583 | -0.19941925 | 0.81217871 | 0.82875250 |train
(train data options)Given a dataset and an options map produce a model. The model-type keyword in the options map selects which model definition to use to train the model. Returns a map containing at least:
:model-data- the result of that definitions’s train-fn.:options- the options passed in.:id- new randomly generated UUID.:feature-columns- vector of column names.:target-columns- vector of column names.:target-datatypes- map of target columns names -> target columns type:target-categorical-maps- the categorical maps of the target columns, if present
A well behaving model implementaion should use
:target-column- ’:target-datatypes`
:target-categorical-maps
to construct its prediction dataset so that its matches with the train data target column.
Examples
Train random-forest on iris data
(let [training-data (-> (scicloj.metamorph.ml.rdatasets/datasets-iris)
(ds/drop-columns [:rownames])
(ds-mod/set-inference-target [:species]))
model (train training-data
{:model-type :metamorph.ml/random-forest})]
(-> model
:model-data
:forest
:trees
count))
;;=> 100
(comment
"forest with hundred trees created")
;;=> nildo regression and calculate RMSE
(let [split (-> (rdatasets/datasets-iris)
(ds/remove-columns [:rownames :species])
(ds-mod/set-inference-target [:petal-width])
(ds-mod/train-test-split))
model (ml/train (:train-ds split) {:model-type :fastmath/ols})
prediction (ml/predict (:test-ds split) model)]
(col-metric/regression-metric (cf/target (:test-ds split))
prediction
:rmse))
;;=> 0.18028218869854934train-predict-cache
Controls , if train/predict invocations are cached. if ‘use-cache’ is true, the get-fn and set-fn functions ar called accorddngly