12 π§ Draft: pocket-pipeline β cached ML pipelines with evaluate-pipelines
The previous chapter (pocket-model) showed how to cache model training in a metamorph.ml pipeline by swapping one step. That approach is simple β just replace ml/model with pocket-model β but only the training step is cached through Pocket.
This chapter explores a deeper integration: building the entire pipeline as a chain of pocket/caching-fn calls, where every step β data splitting, feature engineering, outlier clipping, training β becomes a cached node. This gives us:
Per-step storage control β choose
:mem,:mem+disk, or:nonefor each step independentlyFull provenance β
origin-storytraces any result back to the scalar parameters that produced itDisk persistence β cached models and intermediate results survive JVM restarts
Concurrent dedup β same computation runs once across threads
The key ingredient is Pocketβs origin registry: when a Cached value is derefed, the real result keeps its lightweight identity. This lets us deref at each pipeline step β so real datasets flow through metamorphβs context β while cache keys stay efficient. Because the data is always a real dataset, we can use metamorph.mlβs evaluate-pipelines directly for cross-validation and model comparison.
Background
metamorph.ml is the Scicloj library for machine learning pipelines. It builds on metamorph, a data-transformation framework where each step is a function that takes a context map and returns an updated one. Metamorph distinguishes two modes β :fit (learn from training data) and :transform (apply to new data) β so a pipeline can be trained once and reused for prediction.
On top of this, metamorph.ml adds model training/prediction, cross-validation (evaluate-pipelines), loss functions, and hyperparameter search.
How this chapter relates to others
The ML Workflows chapter demonstrates Pocket caching with plain functions β pocket/cached calls wired into a DAG. This chapter uses the same pipeline functions and the same DAG approach, but adds cross-validation and model comparison on top, reusing metamorph.mlβs evaluate-pipelines.
The pocket-model chapter takes the opposite approach: it plugs into metamorph.mlβs existing pipeline machinery with a single drop-in replacement. Simpler to adopt, but only the training step is cached through Pocket.
| pocket-model | This chapter | |
|---|---|---|
| Integration effort | One-line change | Build pipeline with caching-fn wrappers |
| Whatβs cached | Training only | Every step |
| Provenance | Training step only | Full DAG β any result to its parameters |
| Storage control | Global | Per-step |
| Evaluation | ml/evaluate-pipelines |
ml/evaluate-pipelines (same) |
Setup
(ns pocket-book.pocket-pipeline
(:require
;; Logging setup for this chapter (see Logging chapter):
[pocket-book.logging]
;; Pocket API:
[scicloj.pocket :as pocket]
;; Annotating kinds of visualizations:
[scicloj.kindly.v4.kind :as kind]
;; Metamorph pipeline tools:
[scicloj.metamorph.core :as mm]
;; Data processing:
[tablecloth.api :as tc]
[tablecloth.column.api :as tcc]
[tech.v3.dataset.modelling :as ds-mod]
;; Column role filters (feature, target, prediction):
[tech.v3.dataset.column-filters :as cf]
;; Machine learning:
[scicloj.metamorph.ml :as ml]
[scicloj.metamorph.ml.loss :as loss]
[scicloj.ml.tribuo]))Override the default log level from debug to info. This notebook has many cached steps and debug output (cache hits, writes) would overwhelm the rendered output. Info level shows cache misses, invalidation, and cleanup β enough to see when computation happens.
(pocket-book.logging/set-slf4j-level! :info)nil(def cache-dir "/tmp/pocket-metamorph")(pocket/set-base-cache-dir! cache-dir)10:06:45.929 INFO scicloj.pocket - Cache dir set to: /tmp/pocket-metamorph
"/tmp/pocket-metamorph"(pocket/cleanup!)10:06:45.929 INFO scicloj.pocket - Cache cleanup: /tmp/pocket-metamorph
{:dir "/tmp/pocket-metamorph", :existed false}Pipeline functions
These are plain Clojure functions β each takes data in and returns data out. They know nothing about caching or about metamorphβs context maps and fit/transform modes. Pocket will wrap them with caching-fn later to add caching; the evaluation loop will call them across folds to add cross-validation.
(defn make-regression-data
"Generate synthetic regression data: y = f(x) + noise.
Optional outlier injection on x values."
[{:keys [f n noise-sd seed outlier-fraction outlier-scale]
:or {outlier-fraction 0 outlier-scale 10}}]
(let [rng (java.util.Random. (long seed))
xs (repeatedly n #(* 10.0 (.nextDouble rng)))
xs-final (if (pos? outlier-fraction)
(let [out-rng (java.util.Random. (+ (long seed) 7919))]
(map (fn [x]
(if (< (.nextDouble out-rng) outlier-fraction)
(+ x (* (double outlier-scale) (.nextGaussian out-rng)))
x))
xs))
xs)
ys (map (fn [x] (+ (double (f x))
(* (double noise-sd) (.nextGaussian rng))))
xs)]
(-> (tc/dataset {:x xs-final :y ys})
(ds-mod/set-inference-target :y))))(defn nonlinear-fn [x] (* (Math/sin x) x))(defn split-dataset
"Split into train/test using holdout."
[ds {:keys [seed]}]
(first (tc/split->seq ds :holdout {:seed seed})))(defn prepare-features
"Add derived columns: :raw (none), :poly+trig (xΒ², sin, cos)."
[ds feature-set]
(let [x (:x ds)]
(-> (case feature-set
:raw ds
:poly+trig (tc/add-columns ds {:x2 (tcc/sq x)
:sin-x (tcc/sin x)
:cos-x (tcc/cos x)}))
(ds-mod/set-inference-target :y))))(defn fit-outlier-threshold
"Compute IQR-based clipping bounds for :x from training data."
[train-ds]
(let [xs (sort (:x train-ds))
n (count xs)
q1 (nth xs (int (* 0.25 n)))
q3 (nth xs (int (* 0.75 n)))
iqr (- q3 q1)]
{:lower (- q1 (* 1.5 iqr))
:upper (+ q3 (* 1.5 iqr))}))(defn clip-outliers
"Clip :x values using pre-computed threshold bounds."
[ds threshold]
(let [{:keys [lower upper]} threshold]
(tc/add-column ds :x (-> (:x ds) (tcc/max lower) (tcc/min upper)))))(defn predict-model
"Predict on test data using a trained model."
[test-ds model]
(ml/predict test-ds model))Model specifications
(def cart-spec
{:model-type :scicloj.ml.tribuo/regression
:tribuo-components [{:name "cart"
:type "org.tribuo.regression.rtree.CARTRegressionTrainer"
:properties {:maxDepth "8"}}]
:tribuo-trainer-name "cart"})(def linear-sgd-spec
{:model-type :scicloj.ml.tribuo/regression
:tribuo-components [{:name "squared"
:type "org.tribuo.regression.sgd.objectives.SquaredLoss"}
{:name "linear-sgd"
:type "org.tribuo.regression.sgd.linear.LinearSGDTrainer"
:properties {:objective "squared"
:epochs "50"
:loggingInterval "10000"}}]
:tribuo-trainer-name "linear-sgd"})Caching-fn wrappers
Each pipeline function gets a caching-fn wrapper with an appropriate storage policy:
:memfor cheap shared steps (threshold, clipping, features, prediction) β no disk I/O, but in-memory dedup ensures each runs once:mem+disk(default) for expensive steps (model training) β persists to disk, survives JVM restarts
(def c-fit-outlier-threshold
(pocket/caching-fn #'fit-outlier-threshold {:storage :mem}))(def c-clip-outliers
(pocket/caching-fn #'clip-outliers {:storage :mem}))(def c-prepare-features
(pocket/caching-fn #'prepare-features {:storage :mem}))(def c-predict-model
(pocket/caching-fn #'predict-model {:storage :mem}))Custom pipeline steps
metamorph provides the pipeline machinery we need: mm/pipeline composes steps, mm/lift wraps stateless functions, mm/fit-pipe and mm/transform-pipe run pipelines in each mode.
We build on this with two custom step types. Each step uses caching-fn wrappers internally and derefs the Cached result, so :metamorph/data always holds a real dataset. The origin registry ensures these derefed datasets carry their lightweight identity, so the next stepβs cache key stays efficient.
pocket-fitted
Creates a stateful pipeline step from two functions: one that fits parameters on training data, and one that applies those parameters to any dataset. In :fit mode, both are called. In :transform mode, only the apply function runs, using the parameters saved during :fit.
Both functions should be caching-fn wrappers. Their Cached results are derefed before being stored in the context or in :metamorph/data.
(defn pocket-fitted
"Create a stateful pipeline step from fit and apply caching-fns.
In :fit mode, fits parameters from data and applies them.
In :transform mode, applies previously fitted parameters.
Results are derefed so real datasets flow through the pipeline."
[fit-caching-fn apply-caching-fn]
(fn [{:metamorph/keys [data mode id] :as ctx}]
(case mode
:fit (let [fitted (deref (fit-caching-fn data))]
(-> ctx
(assoc id fitted)
(assoc :metamorph/data (deref (apply-caching-fn data fitted)))))
:transform (assoc ctx :metamorph/data
(deref (apply-caching-fn data (get ctx id)))))))pocket-model
The model step β compatible with ml/evaluate-pipelines. Trains in :fit mode (cached via Pocket) and stores the model map under its step ID. In :transform mode, predicts using the stored model (also cached) and saves the preprocessed test data as :target (for our loss computation) and as :scicloj.metamorph.ml/target-ds (for evaluate-pipelines).
The Cached reference from training is also stored (under :pocket/model-cached) so we can trace provenance later.
(defn pocket-model
"Cached model step compatible with ml/evaluate-pipelines.
Caches training and prediction via pocket/cached. Stores the
training Cached reference at :pocket/model-cached for provenance."
[model-spec]
(fn [{:metamorph/keys [data mode id] :as ctx}]
(case mode
:fit
(let [model-c (pocket/cached #'ml/train data model-spec)
model (deref model-c)]
(assoc ctx id model
:pocket/model-cached model-c))
:transform
(let [model (get ctx id)]
(-> ctx
(update id assoc
:scicloj.metamorph.ml/feature-ds (cf/feature data)
:scicloj.metamorph.ml/target-ds (cf/target data))
(assoc :metamorph/data (deref (c-predict-model data model))
:target data))))))Composing a pipeline
With these tools, we can build a pipeline by composing steps. The pocket-fitted step handles stateful outlier clipping, mm/lift with (comp deref c-fn) handles stateless cached feature preparation, and pocket-model handles training.
(def data-c
(pocket/cached #'make-regression-data
{:f #'nonlinear-fn :n 500 :noise-sd 0.5 :seed 42
:outlier-fraction 0.1 :outlier-scale 15}))(def split-c (pocket/cached #'split-dataset data-c {:seed 42}))(def train-c (pocket/cached :train split-c))(def test-c (pocket/cached :test split-c))(def pipe-cart
(mm/pipeline
{:metamorph/id :clip} (pocket-fitted c-fit-outlier-threshold c-clip-outliers)
{:metamorph/id :prep} (mm/lift (comp deref c-prepare-features) :poly+trig)
{:metamorph/id :model} (pocket-model cart-spec)))Fit on training data. We deref the Cached split to get a real dataset β the origin registry ensures our caching-fn wrappers still see a lightweight cache key:
(def fit-ctx (mm/fit-pipe (deref train-c) pipe-cart))10:06:45.943 INFO scicloj.pocket.impl.cache - Cache miss, computing: :train
10:06:45.943 INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.pocket-pipeline/split-dataset
10:06:45.943 INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.pocket-pipeline/make-regression-data
10:06:45.953 INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/fit-outlier-threshold
10:06:45.953 INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/clip-outliers
10:06:45.956 INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/prepare-features
10:06:45.957 INFO scicloj.pocket.impl.cache - Cache miss, computing: scicloj.metamorph.ml/train
Transform on test data (using fitted params from training):
(def transform-ctx (mm/transform-pipe (deref test-c) pipe-cart fit-ctx))10:06:45.972 INFO scicloj.pocket.impl.cache - Cache miss, computing: :test
10:06:45.974 INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/clip-outliers
10:06:45.974 INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/prepare-features
10:06:45.976 INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
The fitted context carries the model:
(-> fit-ctx
(get :model)
(update :model-data dissoc :model-as-bytes)
kind/pprint){:model-data {:target-ds Group: 0 [333 1]:
| :y |
|------------:|
| -2.03959829 |
| 1.97631359 |
| -1.13244613 |
| -0.28466023 |
| -3.51203007 |
| 1.56378543 |
| -0.34817318 |
| 3.95058332 |
| 1.75833936 |
| 6.90883052 |
| ... |
| 5.94841143 |
| 0.76281763 |
| 6.72108056 |
| -2.61412072 |
| -0.47687264 |
| 0.90381607 |
| 6.37372487 |
| -4.57928612 |
| 0.72187253 |
| -1.30560162 |
| 1.86963826 |
, :feature-ds Group: 0 [333 4]:
| :x | :x2 | :sin-x | :cos-x |
|------------:|------------:|------------:|------------:|
| 3.56214010 | 12.68884212 | -0.40826026 | -0.91286557 |
| 2.64156573 | 6.97786948 | 0.47944917 | -0.87756965 |
| 9.51947133 | 90.62033446 | -0.09455192 | -0.99551993 |
| 6.27079557 | 39.32287712 | -0.01238942 | 0.99992325 |
| 5.62533957 | 31.64444532 | -0.61141358 | 0.79131121 |
| -4.85326548 | 23.55418581 | 0.99009331 | 0.14041098 |
| 0.28305731 | 0.08012144 | 0.27929260 | 0.96020604 |
| 6.94559974 | 48.24135581 | 0.61502245 | 0.78850960 |
| 6.46688990 | 41.82066494 | 0.18267307 | 0.98317371 |
| 7.51301426 | 56.44538329 | 0.94243162 | 0.33439893 |
| ... | ... | ... | ... |
| -4.85326548 | 23.55418581 | 0.99009331 | 0.14041098 |
| 1.03856897 | 1.07862551 | 0.86167893 | 0.50745386 |
| 7.39090954 | 54.62554378 | 0.89468442 | 0.44669877 |
| 3.90381073 | 15.23973824 | -0.69052749 | -0.72330615 |
| 0.22759616 | 0.05180001 | 0.22563633 | 0.97421160 |
| -4.85326548 | 23.55418581 | 0.99009331 | 0.14041098 |
| 7.38225379 | 54.49767101 | 0.89078444 | 0.45442610 |
| 4.66158402 | 21.73036562 | -0.99870971 | -0.05078310 |
| 2.67983090 | 7.18149367 | 0.44552604 | -0.89526898 |
| 3.47865983 | 12.10107420 | -0.33072073 | -0.94372867 |
| 2.13681411 | 4.56597455 | 0.84404323 | -0.53627515 |
},
:options
{:model-type :scicloj.ml.tribuo/regression,
:tribuo-components
[{:name "cart",
:type "org.tribuo.regression.rtree.CARTRegressionTrainer",
:properties {:maxDepth "8"}}],
:tribuo-trainer-name "cart"},
:train-input-hash nil,
:id #uuid "7a635726-06ac-4cab-b9f7-0e65d609d6ba",
:feature-columns [:x :x2 :sin-x :cos-x],
:target-columns [:y],
:target-datatypes {:y :float64}}Predictions:
(tc/head (:metamorph/data transform-ctx))_unnamed [5 1]:
| :y |
|---|
| 7.49226834 |
| -1.63699192 |
| -4.08775473 |
| -3.85731359 |
| 2.39153284 |
Compute RMSE from the target and predictions:
(loss/rmse (:y (get-in transform-ctx [:model :scicloj.metamorph.ml/target-ds]))
(:y (:metamorph/data transform-ctx)))1.6810603338440813Train and test loss
A single metric on test data tells us how well the model generalizes, but comparing train and test loss reveals whether the model is overfitting. We compute the loss separately for each, then gather both into a summary.
(defn compute-loss
"Compute RMSE between actual and predicted :y columns."
[actual-ds predicted-ds]
(loss/rmse (:y actual-ds) (:y predicted-ds)))(def c-compute-loss (pocket/caching-fn #'compute-loss {:storage :mem}))We already have the test predictions from transform-ctx. For training loss, we also transform the training data through the fitted pipeline:
(def train-transform-ctx (mm/transform-pipe (deref train-c) pipe-cart fit-ctx))10:06:45.987 INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
Now we compute loss on each split independently:
(def train-loss-c
(c-compute-loss (:target train-transform-ctx)
(:metamorph/data train-transform-ctx)))(def test-loss-c
(c-compute-loss (:target transform-ctx)
(:metamorph/data transform-ctx)))A report function gathers both into one summary:
(defn report
"Gather train and test loss into a summary map."
[train-loss test-loss]
{:train-rmse train-loss
:test-rmse test-loss})(def c-report (pocket/caching-fn #'report {:storage :mem}))(def summary-c (c-report train-loss-c test-loss-c))(deref summary-c)10:06:45.991 INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/report
10:06:45.991 INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/compute-loss
10:06:45.992 INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/compute-loss
{:train-rmse 0.9236365142505957, :test-rmse 1.6810603338440813}Provenance
The summary reference carries full provenance. The origin registry lets origin-story follow derefed values back through the caching chain β so the DAG branches into train and test paths that share the same model node. This diamond dependency is traced naturally:
(pocket/origin-story-mermaid summary-c):n 500,
:noise-sd 0.5,
:seed 42,
:outlier-fraction 0.1,
:outlier-scale 15}"/] n7 --> n6 n6 --> n5 n8[/"{:seed 42}"/] n8 --> n5 n5 --> n4 n4 --> n3 n9["fit-outlier-threshold"] n4 --> n9 n9 --> n3 n3 --> n2 n10[/":poly+trig"/] n10 --> n2 n2 --> n1 n11["predict-model"] n2 --> n11 n12["train"] n2 --> n12 n13[/"{:model-type :scicloj.ml.tribuo/regression,
:tribuo-components [{:name 'cart',
:type 'org.tribuo.regression.rtree.CARTRegressionTrainer',
:properties {:maxDepth '8'}}],
:tribuo-trainer-name 'cart'}"/] n13 --> n12 n12 --> n11 n11 --> n1 n1 --> n0 n14["compute-loss"] n15["prepare-features"] n16["clip-outliers"] n17[":test"] n5 --> n17 n17 --> n16 n9 --> n16 n16 --> n15 n18[/":poly+trig"/] n18 --> n15 n15 --> n14 n19["predict-model"] n15 --> n19 n12 --> n19 n19 --> n14 n14 --> n0
(pocket/cleanup!)10:06:45.997 INFO scicloj.pocket - Cache cleanup: /tmp/pocket-metamorph
{:dir "/tmp/pocket-metamorph", :existed true}Splits as Cached references
For cross-validation, we need k train/test splits. We create them as Cached references β preserving provenance β and then deref them to get real datasets for ml/evaluate-pipelines. The origin registry ensures the derefed datasets carry their lightweight identity.
(defn nth-split-train
"Extract the train set of the nth split."
[ds split-method split-params idx]
(:train (nth (tc/split->seq ds split-method split-params) idx)))(defn nth-split-test
"Extract the test set of the nth split."
[ds split-method split-params idx]
(:test (nth (tc/split->seq ds split-method split-params) idx)))(defn- n-splits
"Derive the number of splits from the method and params.
For :loo, derefs the dataset to get its row count."
[data-c split-method split-params]
(case split-method
:kfold (:k split-params 5)
:holdout 1
:bootstrap (:repeats split-params 1)
:loo (tc/row-count (deref data-c))))#'pocket-book.pocket-pipeline/n-splits(defn pocket-splits
"Create k-fold splits as Cached references.
Returns [{:train Cached, :test Cached, :idx int} ...]."
[data-c split-method split-params]
(for [idx (range (n-splits data-c split-method split-params))]
{:train (pocket/cached #'nth-split-train
data-c split-method split-params idx)
:test (pocket/cached #'nth-split-test
data-c split-method split-params idx)
:idx idx}))Create Cached splits and deref them for ml/evaluate-pipelines. The derefed datasets are real (passing malli validation) while carrying their origin identity (for efficient cache keys):
(def cached-splits (pocket-splits data-c :kfold {:k 3 :seed 42}))(def splits
(map (fn [{:keys [train test]}]
{:train (deref train) :test (deref test)})
cached-splits))Cross-validation with ml/evaluate-pipelines
Because our pipeline steps deref their outputs, real datasets flow through :metamorph/data at every point. This makes our pipeline fully compatible with evaluate-pipelines, which needs real datasets for metric computation.
(defn make-pipe [{:keys [feature-set model-spec]}]
(mm/pipeline
{:metamorph/id :clip} (pocket-fitted c-fit-outlier-threshold c-clip-outliers)
{:metamorph/id :prep} (mm/lift (comp deref c-prepare-features) feature-set)
{:metamorph/id :model} (pocket-model model-spec)))(def configs
[{:feature-set :poly+trig :model-spec cart-spec}
{:feature-set :raw :model-spec cart-spec}
{:feature-set :poly+trig :model-spec linear-sgd-spec}])(def results
(ml/evaluate-pipelines
(map make-pipe configs)
splits
loss/rmse
:loss
{:return-best-crossvalidation-only false
:return-best-pipeline-only false}))10:06:46.008 INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.pocket-pipeline/nth-split-train
10:06:46.008 INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.pocket-pipeline/make-regression-data
10:06:46.021 INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.pocket-pipeline/nth-split-test
10:06:46.030 INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.pocket-pipeline/nth-split-train
10:06:46.035 INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.pocket-pipeline/nth-split-test
10:06:46.040 INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.pocket-pipeline/nth-split-train
10:06:46.044 INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.pocket-pipeline/nth-split-test
10:06:46.049 INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/fit-outlier-threshold
10:06:46.049 INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/clip-outliers
10:06:46.049 INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/prepare-features
10:06:46.050 INFO scicloj.pocket.impl.cache - Cache miss, computing: scicloj.metamorph.ml/train
10:06:46.066 INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
10:06:46.070 INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/clip-outliers
10:06:46.070 INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/prepare-features
10:06:46.071 INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
10:06:46.073 INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/fit-outlier-threshold
10:06:46.074 INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/clip-outliers
10:06:46.074 INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/prepare-features
10:06:46.075 INFO scicloj.pocket.impl.cache - Cache miss, computing: scicloj.metamorph.ml/train
10:06:46.089 INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
10:06:46.091 INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/clip-outliers
10:06:46.092 INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/prepare-features
10:06:46.092 INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
10:06:46.094 INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/fit-outlier-threshold
10:06:46.094 INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/clip-outliers
10:06:46.094 INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/prepare-features
10:06:46.095 INFO scicloj.pocket.impl.cache - Cache miss, computing: scicloj.metamorph.ml/train
10:06:46.108 INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
10:06:46.110 INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/clip-outliers
10:06:46.111 INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/prepare-features
10:06:46.112 INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
10:06:46.115 INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/prepare-features
10:06:46.115 INFO scicloj.pocket.impl.cache - Cache miss, computing: scicloj.metamorph.ml/train
10:06:46.123 INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
10:06:46.126 INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/prepare-features
10:06:46.126 INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
10:06:46.128 INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/prepare-features
10:06:46.128 INFO scicloj.pocket.impl.cache - Cache miss, computing: scicloj.metamorph.ml/train
10:06:46.135 INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
10:06:46.137 INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/prepare-features
10:06:46.138 INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
10:06:46.140 INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/prepare-features
10:06:46.140 INFO scicloj.pocket.impl.cache - Cache miss, computing: scicloj.metamorph.ml/train
10:06:46.149 INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
10:06:46.152 INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/prepare-features
10:06:46.152 INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
10:06:46.154 INFO scicloj.pocket.impl.cache - Cache miss, computing: scicloj.metamorph.ml/train
Feb 09, 2026 10:06:46 AM org.tribuo.common.sgd.AbstractSGDTrainer train
INFO: Training SGD model with 333 examples
Feb 09, 2026 10:06:46 AM org.tribuo.common.sgd.AbstractSGDTrainer train
INFO: Outputs - RegressionInfo({name=y,id=0,count=333,max=8.463754,min=-5.615639,mean=1.066586,variance=14.157836})
Feb 09, 2026 10:06:46 AM org.tribuo.common.sgd.AbstractSGDTrainer train
INFO: At iteration 10000, average loss = 4.050872565748835
10:06:46.163 INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
10:06:46.165 INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
10:06:46.166 INFO scicloj.pocket.impl.cache - Cache miss, computing: scicloj.metamorph.ml/train
Feb 09, 2026 10:06:46 AM org.tribuo.common.sgd.AbstractSGDTrainer train
INFO: Training SGD model with 333 examples
Feb 09, 2026 10:06:46 AM org.tribuo.common.sgd.AbstractSGDTrainer train
INFO: Outputs - RegressionInfo({name=y,id=0,count=333,max=8.739185,min=-5.756192,mean=0.898097,variance=14.475543})
Feb 09, 2026 10:06:46 AM org.tribuo.common.sgd.AbstractSGDTrainer train
INFO: At iteration 10000, average loss = 3.2396946375069637
10:06:46.175 INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
10:06:46.177 INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
10:06:46.179 INFO scicloj.pocket.impl.cache - Cache miss, computing: scicloj.metamorph.ml/train
Feb 09, 2026 10:06:46 AM org.tribuo.common.sgd.AbstractSGDTrainer train
INFO: Training SGD model with 334 examples
Feb 09, 2026 10:06:46 AM org.tribuo.common.sgd.AbstractSGDTrainer train
INFO: Outputs - RegressionInfo({name=y,id=0,count=334,max=8.739185,min=-5.756192,mean=0.997828,variance=14.612461})
Feb 09, 2026 10:06:46 AM org.tribuo.common.sgd.AbstractSGDTrainer train
INFO: At iteration 10000, average loss = 3.473370043059974
10:06:46.193 INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
10:06:46.196 INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
3 configs Γ 3 folds β aggregate mean RMSE per config:
(def summary
(map (fn [config pipeline-results]
{:feature-set (:feature-set config)
:model-type (-> config :model-spec :tribuo-trainer-name)
:mean-rmse (tcc/mean (map #(-> % :test-transform :metric)
pipeline-results))})
configs results))(tc/dataset summary)_unnamed [3 3]:
| :feature-set | :model-type | :mean-rmse |
|---|---|---|
| :poly+trig | cart | 1.65064791 |
| :raw | cart | 1.63217429 |
| :poly+trig | linear-sgd | 2.13019778 |
Second run β all training hits cache, same metrics:
(def results-2
(ml/evaluate-pipelines
(map make-pipe configs)
splits
loss/rmse
:loss
{:return-best-crossvalidation-only false
:return-best-pipeline-only false}))(= (map #(-> % first :test-transform :metric) results)
(map #(-> % first :test-transform :metric) results-2))trueHyperparameter sweep
Vary tree depth Γ feature set. Each unique combination trains once and is cached. Re-running adds only new combinations.
(def sweep-configs
(for [depth [4 6 8 12]
fs [:raw :poly+trig]]
{:feature-set fs
:model-spec {:model-type :scicloj.ml.tribuo/regression
:tribuo-components [{:name "cart"
:type "org.tribuo.regression.rtree.CARTRegressionTrainer"
:properties {:maxDepth (str depth)}}]
:tribuo-trainer-name "cart"}}))(def sweep-results
(ml/evaluate-pipelines
(map make-pipe sweep-configs)
splits
loss/rmse
:loss
{:return-best-crossvalidation-only false
:return-best-pipeline-only false}))10:06:46.222 INFO scicloj.pocket.impl.cache - Cache miss, computing: scicloj.metamorph.ml/train
10:06:46.229 INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
10:06:46.231 INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
10:06:46.233 INFO scicloj.pocket.impl.cache - Cache miss, computing: scicloj.metamorph.ml/train
10:06:46.240 INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
10:06:46.243 INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
10:06:46.244 INFO scicloj.pocket.impl.cache - Cache miss, computing: scicloj.metamorph.ml/train
10:06:46.251 INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
10:06:46.253 INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
10:06:46.255 INFO scicloj.pocket.impl.cache - Cache miss, computing: scicloj.metamorph.ml/train
10:06:46.266 INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
10:06:46.268 INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
10:06:46.270 INFO scicloj.pocket.impl.cache - Cache miss, computing: scicloj.metamorph.ml/train
10:06:46.281 INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
10:06:46.283 INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
10:06:46.286 INFO scicloj.pocket.impl.cache - Cache miss, computing: scicloj.metamorph.ml/train
10:06:46.296 INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
10:06:46.299 INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
10:06:46.302 INFO scicloj.pocket.impl.cache - Cache miss, computing: scicloj.metamorph.ml/train
10:06:46.311 INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
10:06:46.313 INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
10:06:46.316 INFO scicloj.pocket.impl.cache - Cache miss, computing: scicloj.metamorph.ml/train
10:06:46.346 INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
10:06:46.349 INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
10:06:46.351 INFO scicloj.pocket.impl.cache - Cache miss, computing: scicloj.metamorph.ml/train
10:06:46.362 INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
10:06:46.366 INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
10:06:46.371 INFO scicloj.pocket.impl.cache - Cache miss, computing: scicloj.metamorph.ml/train
10:06:46.394 INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
10:06:46.399 INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
10:06:46.403 INFO scicloj.pocket.impl.cache - Cache miss, computing: scicloj.metamorph.ml/train
10:06:46.477 INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
10:06:46.484 INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
10:06:46.488 INFO scicloj.pocket.impl.cache - Cache miss, computing: scicloj.metamorph.ml/train
10:06:46.507 INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
10:06:46.509 INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
10:06:46.518 INFO scicloj.pocket.impl.cache - Cache miss, computing: scicloj.metamorph.ml/train
10:06:46.527 INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
10:06:46.530 INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
10:06:46.533 INFO scicloj.pocket.impl.cache - Cache miss, computing: scicloj.metamorph.ml/train
10:06:46.543 INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
10:06:46.547 INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
10:06:46.550 INFO scicloj.pocket.impl.cache - Cache miss, computing: scicloj.metamorph.ml/train
10:06:46.561 INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
10:06:46.565 INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
10:06:46.569 INFO scicloj.pocket.impl.cache - Cache miss, computing: scicloj.metamorph.ml/train
10:06:46.585 INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
10:06:46.588 INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
10:06:46.590 INFO scicloj.pocket.impl.cache - Cache miss, computing: scicloj.metamorph.ml/train
10:06:46.604 INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
10:06:46.607 INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
10:06:46.611 INFO scicloj.pocket.impl.cache - Cache miss, computing: scicloj.metamorph.ml/train
10:06:46.629 INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
10:06:46.632 INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
Results by depth and feature set:
(def sweep-summary
(->> (map (fn [config pipeline-results]
{:depth (-> config :model-spec :tribuo-components
first :properties :maxDepth)
:feature-set (:feature-set config)
:mean-rmse (tcc/mean (map #(-> % :test-transform :metric)
pipeline-results))})
sweep-configs sweep-results)
(sort-by :mean-rmse)))(tc/dataset sweep-summary)_unnamed [8 3]:
| :depth | :feature-set | :mean-rmse |
|---|---|---|
| 12 | :poly+trig | 1.52866093 |
| 6 | :poly+trig | 1.63217429 |
| 8 | :poly+trig | 1.64435595 |
| 6 | :raw | 1.65064791 |
| 4 | :poly+trig | 1.65350222 |
| 4 | :raw | 1.66164890 |
| 8 | :raw | 1.66262333 |
| 12 | :raw | 1.71514296 |
On this synthetic data, deeper trees with engineered features (poly+trig) perform best, while shallower trees show similar results regardless of feature set.
Sweep provenance
Pick the best result and trace its full provenance. The DAG goes from the trained model back to the original scalar parameters (seed, noise-sd, outlier-fraction, etc.):
(pocket/origin-story-mermaid
(:pocket/model-cached
(-> sweep-results first first :fit-ctx))):n 500,
:noise-sd 0.5,
:seed 42,
:outlier-fraction 0.1,
:outlier-scale 15}"/] n5 --> n4 n4 --> n3 n6[/":kfold"/] n6 --> n3 n7[/"{:k 3,
:seed 42}"/] n7 --> n3 n8[/"0"/] n8 --> n3 n3 --> n2 n9["fit-outlier-threshold"] n3 --> n9 n9 --> n2 n2 --> n1 n10[/":poly+trig"/] n10 --> n1 n1 --> n0 n11[/"{:model-type :scicloj.ml.tribuo/regression,
:tribuo-components [{:name 'cart',
:type 'org.tribuo.regression.rtree.CARTRegressionTrainer',
:properties {:maxDepth '12'}}],
:tribuo-trainer-name 'cart'}"/] n11 --> n0
Discussion
What the Pocket DAG approach brings to an ML workflow:
| Aspect | What Pocket adds |
|---|---|
| Caching | Per-step, configurable β each step chooses :mem, :mem+disk, or :none |
| Provenance | Full DAG via origin-story β trace any result to its parameters |
| Disk persistence | Cached models and intermediates survive JVM restarts |
| Concurrent dedup | ConcurrentHashMap ensures each computation runs once across threads |
Reusing metamorph:
We use mm/pipeline, mm/lift, mm/fit-pipe, and mm/transform-pipe directly β and now also ml/evaluate-pipelines for cross-validation and model comparison. Pocket only adds two custom step types:
pocket-modelβ likeml/model, but caches training viapocket/cachedso models persist to diskpocket-fittedβ a general pattern for stateful steps
The deref-through pattern:
Each pipeline step wraps a caching-fn and immediately derefs the Cached result. This means real datasets (not Cached references) flow through :metamorph/data at every point. The origin registry provides two benefits:
- Efficient cache keys β each derefed dataset carries its lightweight identity, so the next stepβs
caching-fnavoids hashing full dataset content - Full provenance β
origin-storyfollows derefed values back through the registry to theirCachedorigin, preserving the complete DAG (as seen in the diamond dependency above)
Because :metamorph/data is always a real dataset, ml/evaluate-pipelines can call cf/target, cf/prediction, and malli validation β things that require concrete dataset types.
What we write:
- Plain pipeline functions (data in, data out)
caching-fnwrappers with storage policies (one line each)- Pipeline composition via
mm/pipelinewith our custom steps ml/evaluate-pipelinesfor cross-validation
Open question: where should the custom steps live? pocket-fitted and pocket-model are currently defined in this notebook. A future scicloj.pocket.ml namespace could provide them β but only if the pattern proves stable across different use cases.
Cleanup
(pocket/cleanup!)10:06:46.646 INFO scicloj.pocket - Cache cleanup: /tmp/pocket-metamorph
{:dir "/tmp/pocket-metamorph", :existed true}