11  🚧 Draft: pocket-model β€” drop-in caching for metamorph.ml

Last modified: 2026-02-08

This chapter shows how to cache model training in a metamorph.ml pipeline using Pocket. We define a small pocket-model function β€” a drop-in replacement for ml/model β€” and use it with cross-validation, grid search, and multiple model types.

Background

metamorph.ml is the Scicloj library for machine learning pipelines. It builds on metamorph, a data-transformation framework where each step is a function that takes a context map and returns an updated one. metamorph.ml distinguishes two modes β€” :fit (learn from training data) and :transform (apply to new data) β€” so a pipeline can be trained once and reused for prediction.

On top of this, metamorph.ml adds model training/prediction, cross-validation (evaluate-pipelines), loss functions, and hyperparameter search. A typical workflow looks like:

  1. Define a pipeline of preprocessing + model steps
  2. Split data into folds
  3. Call evaluate-pipelines to train and score across folds
  4. Compare results, pick the best model

Why cache with Pocket?

metamorph.ml includes a built-in caching mechanism. This notebook explores what happens when we use Pocket’s caching instead, bringing a few things that are natural to Pocket’s design:

  • Disk persistence β€” cached models survive JVM restarts, so we can pick up where we left off across sessions

  • Content-based keys β€” cache keys derived from function identity and full argument values via SHA-1

  • Concurrent dedup β€” when multiple threads request the same computation, only one trains and the rest wait for the result

The integration is lightweight: a pocket-model function that is a drop-in replacement for ml/model. We swap one pipeline step and everything else β€” evaluate-pipelines, preprocessing, grid search β€” stays the same.

What this gives us:

  • Same pipeline code, same evaluate-pipelines
  • Model training cached to disk (survives JVM restarts)
  • Graceful fallback for non-serializable models

What this notebook does not cover: because pocket-model plugs into metamorph.ml’s existing pipeline machinery, only the model-training step is cached through Pocket. Preprocessing, splitting, and evaluation happen outside Pocket’s awareness β€” there is no computational DAG tracking the full pipeline, no per-step storage control (choosing whether each step caches to disk, memory, or not at all), and no provenance trail that connects a final metric back to the data and parameters that produced it. A companion notebook is in the works, exploring a deeper integration where every pipeline step is a Pocket caching-fn, giving us all of those things.

Setup

(ns pocket-book.pocket-model
  (:require
   ;; Logging setup for this chapter (see Logging chapter):
   [pocket-book.logging]
   ;; Pocket API:
   [scicloj.pocket :as pocket]
   ;; Annotating kinds of visualizations:
   [scicloj.kindly.v4.kind :as kind]
   ;; Data processing:
   [tablecloth.api :as tc]
   [tablecloth.column.api :as tcc]
   [tech.v3.dataset.modelling :as ds-mod]
   [tech.v3.dataset.column-filters :as cf]
   ;; Machine learning:
   [scicloj.metamorph.ml :as ml]
   [scicloj.metamorph.ml.loss :as loss]
   [scicloj.metamorph.ml.regression]
   [scicloj.metamorph.core :as mm]
   [scicloj.ml.tribuo]))
(def cache-dir "/tmp/pocket-model")
(pocket/set-base-cache-dir! cache-dir)
10:06:45.235 INFO scicloj.pocket - Cache dir set to: /tmp/pocket-model
"/tmp/pocket-model"
(pocket/cleanup!)
10:06:45.236 INFO scicloj.pocket - Cache cleanup: /tmp/pocket-model
{:dir "/tmp/pocket-model", :existed false}

The pocket-model function

This is the core of the integration. It follows the same contract as ml/model β€” a metamorph step that trains in :fit mode and predicts in :transform mode. The only difference: ml/train is wrapped with pocket/cached.

If Nippy can’t serialize a model (e.g., Apache Commons Math OLS), it falls back to uncached training automatically.

(defn pocket-model
  "Drop-in replacement for ml/model that caches training via Pocket.
  Falls back to uncached training if serialization fails."
  [options]
  (fn [{:metamorph/keys [id data mode] :as ctx}]
    (case mode
      :fit
      (let [model (try
                    (deref (pocket/cached #'ml/train data options))
                    (catch Exception _e
                      (ml/train data options)))]
        (assoc ctx id (assoc model :scicloj.metamorph.ml/unsupervised?
                             (get (ml/options->model-def options)
                                  :unsupervised? false))))
      :transform
      (let [model (get ctx id)]
        (if (get model :scicloj.metamorph.ml/unsupervised?)
          ctx
          (-> ctx
              (update id assoc
                      :scicloj.metamorph.ml/feature-ds (cf/feature data)
                      :scicloj.metamorph.ml/target-ds (cf/target data))
              (assoc :metamorph/data (ml/predict data model))))))))

Test data

Simple synthetic regression: y = 3x + noise. 200 rows, enough for quick feedback.

(def ds (-> (let [rng (java.util.Random. 42)]
              (tc/dataset
               {:x (vec (repeatedly 200 #(* 10.0 (.nextDouble rng))))
                :y (vec (repeatedly 200 #(+ (* 3.0 (* 10.0 (.nextDouble rng)))
                                            (* 2.0 (.nextGaussian rng)))))}))
            (ds-mod/set-inference-target :y)))
(def splits (tc/split->seq ds :kfold {:k 3 :seed 42}))
(count splits)
3

Basic usage

Use pocket-model in place of ml/model. The {:metamorph/id :model} map step sets the step ID that evaluate-pipelines expects.

(def cart-spec
  {:model-type :scicloj.ml.tribuo/regression
   :tribuo-components [{:name "cart"
                        :type "org.tribuo.regression.rtree.CARTRegressionTrainer"
                        :properties {:maxDepth "8"}}]
   :tribuo-trainer-name "cart"})
(def pipe-cart
  (mm/pipeline
   {:metamorph/id :model}
   (pocket-model cart-spec)))

First run β€” trains 3 models (one per fold):

(def results-1
  (ml/evaluate-pipelines
   [pipe-cart]
   splits
   loss/rmse
   :loss
   {:return-best-crossvalidation-only false
    :return-best-pipeline-only false}))
10:06:45.249 INFO scicloj.pocket.impl.cache - Cache miss, computing: scicloj.metamorph.ml/train
10:06:45.255 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-model/7a/7a2371066976291d06fe1aad1b48bbeba167ff70
10:06:45.258 INFO scicloj.pocket.impl.cache - Cache miss, computing: scicloj.metamorph.ml/train
10:06:45.264 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-model/9d/9d2799f31ec89ab47c28abaedf1a94632d6e4912
10:06:45.268 INFO scicloj.pocket.impl.cache - Cache miss, computing: scicloj.metamorph.ml/train
10:06:45.274 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-model/75/752a5761fad71dd397dad959c21a078b67503a46
(mapv #(-> % :test-transform :metric) (flatten results-1))
[10.938693265902357 11.23113067170221 12.12978921023711]

Cache now has 3 entries (one per fold):

(pocket/cache-stats)
{:total-entries 3,
 :total-size-bytes 53368,
 :entries-per-fn {"scicloj.metamorph.ml/train" 3}}

Second run β€” all cache hits, same metrics:

(def results-2
  (ml/evaluate-pipelines
   [pipe-cart]
   splits
   loss/rmse
   :loss
   {:return-best-crossvalidation-only false
    :return-best-pipeline-only false}))
(= (mapv #(-> % :test-transform :metric) (flatten results-1))
   (mapv #(-> % :test-transform :metric) (flatten results-2)))
true

Multiple model types

Compare CART, linear SGD, and fastmath OLS in the same evaluation. Each model type is cached independently.

(pocket/cleanup!)
10:06:45.588 INFO scicloj.pocket - Cache cleanup: /tmp/pocket-model
{:dir "/tmp/pocket-model", :existed true}
(def sgd-spec
  {:model-type :scicloj.ml.tribuo/regression
   :tribuo-components [{:name "squared"
                        :type "org.tribuo.regression.sgd.objectives.SquaredLoss"}
                       {:name "linear-sgd"
                        :type "org.tribuo.regression.sgd.linear.LinearSGDTrainer"
                        :properties {:objective "squared"
                                     :epochs "50"
                                     :loggingInterval "10000"}}]
   :tribuo-trainer-name "linear-sgd"})
(def multi-results
  (ml/evaluate-pipelines
   [(mm/pipeline {:metamorph/id :model} (pocket-model cart-spec))
    (mm/pipeline {:metamorph/id :model} (pocket-model sgd-spec))
    (mm/pipeline {:metamorph/id :model} (pocket-model {:model-type :fastmath/ols}))]
   splits
   loss/rmse
   :loss
   {:return-best-crossvalidation-only false
    :return-best-pipeline-only false}))
10:06:45.589 INFO scicloj.pocket.impl.cache - Cache miss, computing: scicloj.metamorph.ml/train
10:06:45.596 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-model/7a/7a2371066976291d06fe1aad1b48bbeba167ff70
10:06:45.600 INFO scicloj.pocket.impl.cache - Cache miss, computing: scicloj.metamorph.ml/train
10:06:45.606 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-model/9d/9d2799f31ec89ab47c28abaedf1a94632d6e4912
10:06:45.610 INFO scicloj.pocket.impl.cache - Cache miss, computing: scicloj.metamorph.ml/train
10:06:45.618 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-model/75/752a5761fad71dd397dad959c21a078b67503a46
10:06:45.623 INFO scicloj.pocket.impl.cache - Cache miss, computing: scicloj.metamorph.ml/train
Feb 09, 2026 10:06:45 AM org.tribuo.common.sgd.AbstractSGDTrainer train
INFO: Training SGD model with 133 examples
Feb 09, 2026 10:06:45 AM org.tribuo.common.sgd.AbstractSGDTrainer train
INFO: Outputs - RegressionInfo({name=y,id=0,count=133,max=32.285163,min=-3.003255,mean=15.591786,variance=84.043799})
10:06:45.632 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-model/6a/6ac08d75a9c1dfba5441528a6c2cb027b0986f6f
10:06:45.635 INFO scicloj.pocket.impl.cache - Cache miss, computing: scicloj.metamorph.ml/train
Feb 09, 2026 10:06:45 AM org.tribuo.common.sgd.AbstractSGDTrainer train
INFO: Training SGD model with 133 examples
Feb 09, 2026 10:06:45 AM org.tribuo.common.sgd.AbstractSGDTrainer train
INFO: Outputs - RegressionInfo({name=y,id=0,count=133,max=31.652557,min=-1.736155,mean=15.631001,variance=80.557863})
10:06:45.643 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-model/cf/cff2b9e4351565863c5cf69ac6a1aa7a626936af
10:06:45.647 INFO scicloj.pocket.impl.cache - Cache miss, computing: scicloj.metamorph.ml/train
Feb 09, 2026 10:06:45 AM org.tribuo.common.sgd.AbstractSGDTrainer train
INFO: Training SGD model with 134 examples
Feb 09, 2026 10:06:45 AM org.tribuo.common.sgd.AbstractSGDTrainer train
INFO: Outputs - RegressionInfo({name=y,id=0,count=134,max=32.285163,min=-3.003255,mean=16.262557,variance=77.697467})
10:06:45.658 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-model/21/2174cb3cdbabf34a7fd782c8efb1ab0084db8081
10:06:45.663 INFO scicloj.pocket.impl.cache - Cache miss, computing: scicloj.metamorph.ml/train
10:06:45.673 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-model/c2/c2ccc1e0dcf1c2c00d9c621178aafc97ec23e85e
10:06:45.676 INFO scicloj.pocket.impl.cache - Cache miss, computing: scicloj.metamorph.ml/train
10:06:45.681 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-model/42/42cb21d2dea2f78ba1450f2f2eb4c3683652e07f
10:06:45.683 INFO scicloj.pocket.impl.cache - Cache miss, computing: scicloj.metamorph.ml/train
10:06:45.689 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-model/3c/3cb60cf5b7576d22ded8c276361bcc0eac5d3c40

3 model types Γ— 3 folds = 9 entries:

(pocket/cache-stats)
{:total-entries 9,
 :total-size-bytes 188853,
 :entries-per-fn {"scicloj.metamorph.ml/train" 9}}

Mean RMSE per model type:

(let [model-names ["CART" "SGD" "fastmath-OLS"]
      means (mapv (fn [pipeline-results]
                    (tcc/mean (map #(-> % :test-transform :metric) pipeline-results)))
                  multi-results)]
  (tc/dataset {:model model-names :mean-rmse means}))

_unnamed [3 2]:

:model :mean-rmse
CART 11.43320438
SGD 9.00886158
fastmath-OLS 9.01791979

Graceful fallback

The built-in metamorph.ml/ols uses Apache Commons Math which Nippy can’t serialize. pocket-model catches the error and falls back to uncached training β€” the pipeline still works, just without disk caching for that model.

(pocket/cleanup!)
10:06:45.705 INFO scicloj.pocket - Cache cleanup: /tmp/pocket-model
{:dir "/tmp/pocket-model", :existed true}
(def fallback-results
  (ml/evaluate-pipelines
   [(mm/pipeline {:metamorph/id :model} (pocket-model cart-spec))
    (mm/pipeline {:metamorph/id :model} (pocket-model {:model-type :metamorph.ml/ols}))]
   splits
   loss/rmse
   :loss
   {:return-best-crossvalidation-only false
    :return-best-pipeline-only false}))
10:06:45.706 INFO scicloj.pocket.impl.cache - Cache miss, computing: scicloj.metamorph.ml/train
10:06:45.713 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-model/7a/7a2371066976291d06fe1aad1b48bbeba167ff70
10:06:45.718 INFO scicloj.pocket.impl.cache - Cache miss, computing: scicloj.metamorph.ml/train
10:06:45.725 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-model/9d/9d2799f31ec89ab47c28abaedf1a94632d6e4912
10:06:45.730 INFO scicloj.pocket.impl.cache - Cache miss, computing: scicloj.metamorph.ml/train
10:06:45.738 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-model/75/752a5761fad71dd397dad959c21a078b67503a46
10:06:45.744 INFO scicloj.pocket.impl.cache - Cache miss, computing: scicloj.metamorph.ml/train
10:06:45.759 INFO scicloj.pocket.impl.cache - Cache miss, computing: scicloj.metamorph.ml/train
10:06:45.771 INFO scicloj.pocket.impl.cache - Cache miss, computing: scicloj.metamorph.ml/train

CART models are cached β€” 3 entries, one per fold. OLS falls back to uncached training silently. The failed serialization attempts leave empty cache directories, which show up as entries with a nil function name:

(pocket/cache-stats)
{:total-entries 6,
 :total-size-bytes 53369,
 :entries-per-fn {"scicloj.metamorph.ml/train" 3, nil 3}}

Both model types produce valid metrics:

(let [model-names ["CART" "OLS-fallback"]
      means (mapv (fn [pipeline-results]
                    (tcc/mean (map #(-> % :test-transform :metric) pipeline-results)))
                  fallback-results)]
  (tc/dataset {:model model-names :mean-rmse means}))

_unnamed [2 2]:

:model :mean-rmse
CART 11.43320438
OLS-fallback 9.00886158

Disk persistence

Models survive JVM restarts. After clearing the in-memory cache, models are loaded from disk on next access.

(pocket/cleanup!)
10:06:45.791 INFO scicloj.pocket - Cache cleanup: /tmp/pocket-model
{:dir "/tmp/pocket-model", :existed true}

Train fresh:

(def persist-results-1
  (ml/evaluate-pipelines
   [(mm/pipeline {:metamorph/id :model} (pocket-model cart-spec))]
   splits
   loss/rmse
   :loss
   {:return-best-crossvalidation-only false
    :return-best-pipeline-only false}))
10:06:45.792 INFO scicloj.pocket.impl.cache - Cache miss, computing: scicloj.metamorph.ml/train
10:06:45.800 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-model/7a/7a2371066976291d06fe1aad1b48bbeba167ff70
10:06:45.805 INFO scicloj.pocket.impl.cache - Cache miss, computing: scicloj.metamorph.ml/train
10:06:45.813 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-model/9d/9d2799f31ec89ab47c28abaedf1a94632d6e4912
10:06:45.817 INFO scicloj.pocket.impl.cache - Cache miss, computing: scicloj.metamorph.ml/train
10:06:45.824 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-model/75/752a5761fad71dd397dad959c21a078b67503a46

Clear in-memory cache (simulates JVM restart):

(pocket/clear-mem-cache!)
nil

Re-evaluate β€” loads from disk:

(def persist-results-2
  (ml/evaluate-pipelines
   [(mm/pipeline {:metamorph/id :model} (pocket-model cart-spec))]
   splits
   loss/rmse
   :loss
   {:return-best-crossvalidation-only false
    :return-best-pipeline-only false}))
10:06:45.831 DEBUG scicloj.pocket.impl.cache - Cache hit (disk): scicloj.metamorph.ml/train /tmp/pocket-model/7a/7a2371066976291d06fe1aad1b48bbeba167ff70
10:06:45.837 DEBUG scicloj.pocket.impl.cache - Cache hit (disk): scicloj.metamorph.ml/train /tmp/pocket-model/9d/9d2799f31ec89ab47c28abaedf1a94632d6e4912
10:06:45.841 DEBUG scicloj.pocket.impl.cache - Cache hit (disk): scicloj.metamorph.ml/train /tmp/pocket-model/75/752a5761fad71dd397dad959c21a078b67503a46

Same metrics:

(= (mapv #(-> % :test-transform :metric) (flatten persist-results-1))
   (mapv #(-> % :test-transform :metric) (flatten persist-results-2)))
true

Discussion

pocket-model is a thin wrapper β€” about 20 lines of code β€” that gives us disk-persistent model caching with zero changes to our pipeline structure. It works with evaluate-pipelines, preprocessing steps, learning curves, and grid search.

Serialization compatibility (tested):

Backend Cacheable?
Tribuo regression (CART, SGD) Yes
Tribuo classification Yes
fastmath/ols Yes
metamorph.ml/ols (Commons Math) No (falls back)
metamorph.ml/dummy-regressor Yes

When to use pocket-model:

  • Grid search / hyperparameter tuning (train once, reuse)
  • Iterative notebook development (change downstream code, keep models)
  • Learning curves (add new sizes, only new ones train)
  • Any workflow where we re-evaluate with the same data + options

Cache key efficiency: When pocket-model receives a derefed dataset (e.g., from ml/evaluate-pipelines, which passes real datasets through :metamorph/data), Pocket’s origin registry recognizes it and uses the lightweight identity from the original Cached reference. This avoids hashing the full dataset content for the cache key β€” the same efficiency as passing a Cached reference directly.

Cleanup

(pocket/cleanup!)
10:06:45.848 INFO scicloj.pocket - Cache cleanup: /tmp/pocket-model
{:dir "/tmp/pocket-model", :existed true}
source: notebooks/pocket_book/pocket_model.clj