12 π§ Draft: pocket-pipeline β cached ML pipelines with evaluate-pipelines
The previous chapter (pocket-model) showed how to cache model training in a metamorph.ml pipeline by swapping one step. That approach is simple β just replace ml/model with pocket-model β but only the training step is cached through Pocket.
This chapter explores a deeper integration: building the entire pipeline as a chain of pocket/caching-fn calls, where every step β data splitting, feature engineering, outlier clipping, training β becomes a cached node. This gives us:
Per-step storage control β choose
:mem,:mem+disk, or:nonefor each step independentlyFull provenance β
origin-storytraces any result back to the scalar parameters that produced itDisk persistence β cached models and intermediate results survive JVM restarts
Concurrent dedup β same computation runs once across threads
The key ingredient is Pocketβs origin registry: when a Cached value is derefed, the real result keeps its lightweight identity. This lets us deref at each pipeline step β so real datasets flow through metamorphβs context β while cache keys stay efficient. Because the data is always a real dataset, we can use metamorph.mlβs evaluate-pipelines directly for cross-validation and model comparison.
Background
metamorph.ml is the Scicloj library for machine learning pipelines. It builds on metamorph, a data-transformation framework where each step is a function that takes a context map and returns an updated one. Metamorph distinguishes two modes β :fit (learn from training data) and :transform (apply to new data) β so a pipeline can be trained once and reused for prediction.
On top of this, metamorph.ml adds model training/prediction, cross-validation (evaluate-pipelines), loss functions, and hyperparameter search.
How this chapter relates to others
The ML Workflows chapter demonstrates Pocket caching with plain functions β pocket/cached calls wired into a DAG. This chapter uses the same pipeline functions and the same DAG approach, but adds cross-validation and model comparison on top, reusing metamorph.mlβs evaluate-pipelines.
The pocket-model chapter takes the opposite approach: it plugs into metamorph.mlβs existing pipeline machinery with a single drop-in replacement. Simpler to adopt, but only the training step is cached through Pocket.
| pocket-model | This chapter | |
|---|---|---|
| Integration effort | One-line change | Build pipeline with caching-fn wrappers |
| Whatβs cached | Training only | Every step |
| Provenance | Training step only | Full DAG β any result to its parameters |
| Storage control | Global | Per-step |
| Evaluation | ml/evaluate-pipelines |
ml/evaluate-pipelines (same) |
Setup
(ns pocket-book.pocket-pipeline
(:require
;; Logging setup for this chapter (see Logging chapter):
[pocket-book.logging]
;; Pocket API:
[scicloj.pocket :as pocket]
;; Annotating kinds of visualizations:
[scicloj.kindly.v4.kind :as kind]
;; Metamorph pipeline tools:
[scicloj.metamorph.core :as mm]
;; Data processing:
[tablecloth.api :as tc]
[tablecloth.column.api :as tcc]
[tech.v3.dataset.modelling :as ds-mod]
;; Column role filters (feature, target, prediction):
[tech.v3.dataset.column-filters :as cf]
;; Machine learning:
[scicloj.metamorph.ml :as ml]
[scicloj.metamorph.ml.loss :as loss]
[scicloj.ml.tribuo]))Override the default log level from debug to info. This notebook has many cached steps and debug output (cache hits, writes) would overwhelm the rendered output. Info level shows cache misses, invalidation, and cleanup β enough to see when computation happens.
(pocket-book.logging/set-slf4j-level! :info)nil(def cache-dir "/tmp/pocket-metamorph")(pocket/set-base-cache-dir! cache-dir)[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket - Cache dir set to: /tmp/pocket-metamorph
"/tmp/pocket-metamorph"(pocket/cleanup!)[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket - Cache cleanup: /tmp/pocket-metamorph
{:dir "/tmp/pocket-metamorph", :existed false}Pipeline functions
These are plain Clojure functions β each takes data in and returns data out. They know nothing about caching or about metamorphβs context maps and fit/transform modes. Pocket will wrap them with caching-fn later to add caching; the evaluation loop will call them across folds to add cross-validation.
(defn make-regression-data
"Generate synthetic regression data: y = f(x) + noise.
Optional outlier injection on x values."
[{:keys [f n noise-sd seed outlier-fraction outlier-scale]
:or {outlier-fraction 0 outlier-scale 10}}]
(let [rng (java.util.Random. (long seed))
xs (repeatedly n #(* 10.0 (.nextDouble rng)))
xs-final (if (pos? outlier-fraction)
(let [out-rng (java.util.Random. (+ (long seed) 7919))]
(map (fn [x]
(if (< (.nextDouble out-rng) outlier-fraction)
(+ x (* (double outlier-scale) (.nextGaussian out-rng)))
x))
xs))
xs)
ys (map (fn [x] (+ (double (f x))
(* (double noise-sd) (.nextGaussian rng))))
xs)]
(-> (tc/dataset {:x xs-final :y ys})
(ds-mod/set-inference-target :y))))(defn nonlinear-fn [x] (* (Math/sin x) x))(defn split-dataset
"Split into train/test using holdout."
[ds {:keys [seed]}]
(first (tc/split->seq ds :holdout {:seed seed})))(defn prepare-features
"Add derived columns: :raw (none), :poly+trig (xΒ², sin, cos)."
[ds feature-set]
(let [x (:x ds)]
(-> (case feature-set
:raw ds
:poly+trig (tc/add-columns ds {:x2 (tcc/sq x)
:sin-x (tcc/sin x)
:cos-x (tcc/cos x)}))
(ds-mod/set-inference-target :y))))(defn fit-outlier-threshold
"Compute IQR-based clipping bounds for :x from training data."
[train-ds]
(let [xs (sort (:x train-ds))
n (count xs)
q1 (nth xs (int (* 0.25 n)))
q3 (nth xs (int (* 0.75 n)))
iqr (- q3 q1)]
{:lower (- q1 (* 1.5 iqr))
:upper (+ q3 (* 1.5 iqr))}))(defn clip-outliers
"Clip :x values using pre-computed threshold bounds."
[ds threshold]
(let [{:keys [lower upper]} threshold]
(tc/add-column ds :x (-> (:x ds) (tcc/max lower) (tcc/min upper)))))(defn predict-model
"Predict on test data using a trained model."
[test-ds model]
(ml/predict test-ds model))Model specifications
(def cart-spec
{:model-type :scicloj.ml.tribuo/regression
:tribuo-components [{:name "cart"
:type "org.tribuo.regression.rtree.CARTRegressionTrainer"
:properties {:maxDepth "8"}}]
:tribuo-trainer-name "cart"})(def linear-sgd-spec
{:model-type :scicloj.ml.tribuo/regression
:tribuo-components [{:name "squared"
:type "org.tribuo.regression.sgd.objectives.SquaredLoss"}
{:name "linear-sgd"
:type "org.tribuo.regression.sgd.linear.LinearSGDTrainer"
:properties {:objective "squared"
:epochs "50"
:loggingInterval "10000"}}]
:tribuo-trainer-name "linear-sgd"})Caching-fn wrappers
Each pipeline function gets a caching-fn wrapper with an appropriate storage policy:
:memfor cheap shared steps (threshold, clipping, features, prediction) β no disk I/O, but in-memory dedup ensures each runs once:mem+disk(default) for expensive steps (model training) β persists to disk, survives JVM restarts
(def c-fit-outlier-threshold
(pocket/caching-fn #'fit-outlier-threshold {:storage :mem}))(def c-clip-outliers
(pocket/caching-fn #'clip-outliers {:storage :mem}))(def c-prepare-features
(pocket/caching-fn #'prepare-features {:storage :mem}))(def c-predict-model
(pocket/caching-fn #'predict-model {:storage :mem}))Custom pipeline steps
metamorph provides the pipeline machinery we need: mm/pipeline composes steps, mm/lift wraps stateless functions, mm/fit-pipe and mm/transform-pipe run pipelines in each mode.
We build on this with two custom step types. Each step uses caching-fn wrappers internally and derefs the Cached result, so :metamorph/data always holds a real dataset. The origin registry ensures these derefed datasets carry their lightweight identity, so the next stepβs cache key stays efficient.
pocket-fitted
Creates a stateful pipeline step from two functions: one that fits parameters on training data, and one that applies those parameters to any dataset. In :fit mode, both are called. In :transform mode, only the apply function runs, using the parameters saved during :fit.
Both functions should be caching-fn wrappers. Their Cached results are derefed before being stored in the context or in :metamorph/data.
(defn pocket-fitted
"Create a stateful pipeline step from fit and apply caching-fns.
In :fit mode, fits parameters from data and applies them.
In :transform mode, applies previously fitted parameters.
Results are derefed so real datasets flow through the pipeline."
[fit-caching-fn apply-caching-fn]
(fn [{:metamorph/keys [data mode id] :as ctx}]
(case mode
:fit (let [fitted (deref (fit-caching-fn data))]
(-> ctx
(assoc id fitted)
(assoc :metamorph/data (deref (apply-caching-fn data fitted)))))
:transform (assoc ctx :metamorph/data
(deref (apply-caching-fn data (get ctx id)))))))pocket-model
The model step β compatible with ml/evaluate-pipelines. Trains in :fit mode (cached via Pocket) and stores the model map under its step ID. In :transform mode, predicts using the stored model (also cached) and saves the preprocessed test data as :target (for our loss computation) and as :scicloj.metamorph.ml/target-ds (for evaluate-pipelines).
The Cached reference from training is also stored (under :pocket/model-cached) so we can trace provenance later.
(defn pocket-model
"Cached model step compatible with ml/evaluate-pipelines.
Caches training and prediction via pocket/cached. Stores the
training Cached reference at :pocket/model-cached for provenance."
[model-spec]
(fn [{:metamorph/keys [data mode id] :as ctx}]
(case mode
:fit
(let [model-c (pocket/cached #'ml/train data model-spec)
model (deref model-c)]
(assoc ctx id model
:pocket/model-cached model-c))
:transform
(let [model (get ctx id)]
(-> ctx
(update id assoc
:scicloj.metamorph.ml/feature-ds (cf/feature data)
:scicloj.metamorph.ml/target-ds (cf/target data))
(assoc :metamorph/data (deref (c-predict-model data model))
:target data))))))Composing a pipeline
With these tools, we can build a pipeline by composing steps. The pocket-fitted step handles stateful outlier clipping, mm/lift with (comp deref c-fn) handles stateless cached feature preparation, and pocket-model handles training.
(def data-c
(pocket/cached #'make-regression-data
{:f #'nonlinear-fn :n 500 :noise-sd 0.5 :seed 42
:outlier-fraction 0.1 :outlier-scale 15}))(def split-c (pocket/cached #'split-dataset data-c {:seed 42}))(def train-c (pocket/cached :train split-c))(def test-c (pocket/cached :test split-c))(def pipe-cart
(mm/pipeline
{:metamorph/id :clip} (pocket-fitted c-fit-outlier-threshold c-clip-outliers)
{:metamorph/id :prep} (mm/lift (comp deref c-prepare-features) :poly+trig)
{:metamorph/id :model} (pocket-model cart-spec)))Fit on training data. We deref the Cached split to get a real dataset β the origin registry ensures our caching-fn wrappers still see a lightweight cache key:
(def fit-ctx (mm/fit-pipe (deref train-c) pipe-cart))[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: :train
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.pocket-pipeline/split-dataset
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.pocket-pipeline/make-regression-data
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/fit-outlier-threshold
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/clip-outliers
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/prepare-features
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: scicloj.metamorph.ml/train
Transform on test data (using fitted params from training):
(def transform-ctx (mm/transform-pipe (deref test-c) pipe-cart fit-ctx))[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: :test
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/clip-outliers
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/prepare-features
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
The fitted context carries the model:
(-> fit-ctx
(get :model)
(update :model-data dissoc :model-as-bytes)
kind/pprint){:model-data {:target-ds Group: 0 [333 1]:
| :y |
|------------:|
| -2.03959829 |
| 1.97631359 |
| -1.13244613 |
| -0.28466023 |
| -3.51203007 |
| 1.56378543 |
| -0.34817318 |
| 3.95058332 |
| 1.75833936 |
| 6.90883052 |
| ... |
| 5.94841143 |
| 0.76281763 |
| 6.72108056 |
| -2.61412072 |
| -0.47687264 |
| 0.90381607 |
| 6.37372487 |
| -4.57928612 |
| 0.72187253 |
| -1.30560162 |
| 1.86963826 |
, :feature-ds Group: 0 [333 4]:
| :x | :x2 | :sin-x | :cos-x |
|------------:|------------:|------------:|------------:|
| 3.56214010 | 12.68884212 | -0.40826026 | -0.91286557 |
| 2.64156573 | 6.97786948 | 0.47944917 | -0.87756965 |
| 9.51947133 | 90.62033446 | -0.09455192 | -0.99551993 |
| 6.27079557 | 39.32287712 | -0.01238942 | 0.99992325 |
| 5.62533957 | 31.64444532 | -0.61141358 | 0.79131121 |
| -4.85326548 | 23.55418581 | 0.99009331 | 0.14041098 |
| 0.28305731 | 0.08012144 | 0.27929260 | 0.96020604 |
| 6.94559974 | 48.24135581 | 0.61502245 | 0.78850960 |
| 6.46688990 | 41.82066494 | 0.18267307 | 0.98317371 |
| 7.51301426 | 56.44538329 | 0.94243162 | 0.33439893 |
| ... | ... | ... | ... |
| -4.85326548 | 23.55418581 | 0.99009331 | 0.14041098 |
| 1.03856897 | 1.07862551 | 0.86167893 | 0.50745386 |
| 7.39090954 | 54.62554378 | 0.89468442 | 0.44669877 |
| 3.90381073 | 15.23973824 | -0.69052749 | -0.72330615 |
| 0.22759616 | 0.05180001 | 0.22563633 | 0.97421160 |
| -4.85326548 | 23.55418581 | 0.99009331 | 0.14041098 |
| 7.38225379 | 54.49767101 | 0.89078444 | 0.45442610 |
| 4.66158402 | 21.73036562 | -0.99870971 | -0.05078310 |
| 2.67983090 | 7.18149367 | 0.44552604 | -0.89526898 |
| 3.47865983 | 12.10107420 | -0.33072073 | -0.94372867 |
| 2.13681411 | 4.56597455 | 0.84404323 | -0.53627515 |
},
:options
{:model-type :scicloj.ml.tribuo/regression,
:tribuo-components
[{:name "cart",
:type "org.tribuo.regression.rtree.CARTRegressionTrainer",
:properties {:maxDepth "8"}}],
:tribuo-trainer-name "cart"},
:train-input-hash nil,
:id #uuid "7f016414-2811-4faa-b3f6-0bc75ea7e911",
:feature-columns [:x :x2 :sin-x :cos-x],
:target-columns [:y],
:target-datatypes {:y :float64}}Predictions:
(tc/head (:metamorph/data transform-ctx))_unnamed [5 1]:
| :y |
|---|
| 7.49226834 |
| -1.63699192 |
| -4.08775473 |
| -3.85731359 |
| 2.39153284 |
Compute RMSE from the target and predictions:
(loss/rmse (:y (get-in transform-ctx [:model :scicloj.metamorph.ml/target-ds]))
(:y (:metamorph/data transform-ctx)))1.6810603338440813Train and test loss
A single metric on test data tells us how well the model generalizes, but comparing train and test loss reveals whether the model is overfitting. We compute the loss separately for each, then gather both into a summary.
(defn compute-loss
"Compute RMSE between actual and predicted :y columns."
[actual-ds predicted-ds]
(loss/rmse (:y actual-ds) (:y predicted-ds)))(def c-compute-loss (pocket/caching-fn #'compute-loss {:storage :mem}))We already have the test predictions from transform-ctx. For training loss, we also transform the training data through the fitted pipeline:
(def train-transform-ctx (mm/transform-pipe (deref train-c) pipe-cart fit-ctx))[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
Now we compute loss on each split independently:
(def train-loss-c
(c-compute-loss (:target train-transform-ctx)
(:metamorph/data train-transform-ctx)))(def test-loss-c
(c-compute-loss (:target transform-ctx)
(:metamorph/data transform-ctx)))A report function gathers both into one summary:
(defn report
"Gather train and test loss into a summary map."
[train-loss test-loss]
{:train-rmse train-loss
:test-rmse test-loss})(def c-report (pocket/caching-fn #'report {:storage :mem}))(def summary-c (c-report train-loss-c test-loss-c))(deref summary-c)[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/report
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/compute-loss
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/compute-loss
{:train-rmse 0.9236365142505957, :test-rmse 1.6810603338440813}Provenance
The summary reference carries full provenance. The origin registry lets origin-story follow derefed values back through the caching chain β so the DAG branches into train and test paths that share the same model node. This diamond dependency is traced naturally:
(pocket/origin-story-mermaid summary-c):n 500,
:noise-sd 0.5,
:seed 42,
:outlier-fraction 0.1,
:outlier-scale 15}"/] n7 --> n6 n6 --> n5 n8[/"{:seed 42}"/] n8 --> n5 n5 --> n4 n4 --> n3 n9["fit-outlier-threshold"] n4 --> n9 n9 --> n3 n3 --> n2 n10[/":poly+trig"/] n10 --> n2 n2 --> n1 n11["predict-model"] n2 --> n11 n12["train"] n2 --> n12 n13[/"{:model-type :scicloj.ml.tribuo/regression,
:tribuo-components [{:name 'cart',
:type 'org.tribuo.regression.rtree.CARTRegressionTrainer',
:properties {:maxDepth '8'}}],
:tribuo-trainer-name 'cart'}"/] n13 --> n12 n12 --> n11 n11 --> n1 n1 --> n0 n14["compute-loss"] n15["prepare-features"] n16["clip-outliers"] n17[":test"] n5 --> n17 n17 --> n16 n9 --> n16 n16 --> n15 n18[/":poly+trig"/] n18 --> n15 n15 --> n14 n19["predict-model"] n15 --> n19 n12 --> n19 n19 --> n14 n14 --> n0
(pocket/cleanup!)[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket - Cache cleanup: /tmp/pocket-metamorph
{:dir "/tmp/pocket-metamorph", :existed true}Splits as Cached references
For cross-validation, we need k train/test splits. We create them as Cached references β preserving provenance β and then deref them to get real datasets for ml/evaluate-pipelines. The origin registry ensures the derefed datasets carry their lightweight identity.
(defn nth-split-train
"Extract the train set of the nth split."
[ds split-method split-params idx]
(:train (nth (tc/split->seq ds split-method split-params) idx)))(defn nth-split-test
"Extract the test set of the nth split."
[ds split-method split-params idx]
(:test (nth (tc/split->seq ds split-method split-params) idx)))(defn- n-splits
"Derive the number of splits from the method and params.
For :loo, derefs the dataset to get its row count."
[data-c split-method split-params]
(case split-method
:kfold (:k split-params 5)
:holdout 1
:bootstrap (:repeats split-params 1)
:loo (tc/row-count (deref data-c))))#'pocket-book.pocket-pipeline/n-splits(defn pocket-splits
"Create k-fold splits as Cached references.
Returns [{:train Cached, :test Cached, :idx int} ...]."
[data-c split-method split-params]
(for [idx (range (n-splits data-c split-method split-params))]
{:train (pocket/cached #'nth-split-train
data-c split-method split-params idx)
:test (pocket/cached #'nth-split-test
data-c split-method split-params idx)
:idx idx}))Create Cached splits and deref them for ml/evaluate-pipelines. The derefed datasets are real (passing malli validation) while carrying their origin identity (for efficient cache keys):
(def cached-splits (pocket-splits data-c :kfold {:k 3 :seed 42}))(def splits
(map (fn [{:keys [train test]}]
{:train (deref train) :test (deref test)})
cached-splits))Cross-validation with ml/evaluate-pipelines
Because our pipeline steps deref their outputs, real datasets flow through :metamorph/data at every point. This makes our pipeline fully compatible with evaluate-pipelines, which needs real datasets for metric computation.
(defn make-pipe [{:keys [feature-set model-spec]}]
(mm/pipeline
{:metamorph/id :clip} (pocket-fitted c-fit-outlier-threshold c-clip-outliers)
{:metamorph/id :prep} (mm/lift (comp deref c-prepare-features) feature-set)
{:metamorph/id :model} (pocket-model model-spec)))(def configs
[{:feature-set :poly+trig :model-spec cart-spec}
{:feature-set :raw :model-spec cart-spec}
{:feature-set :poly+trig :model-spec linear-sgd-spec}])(def results
(ml/evaluate-pipelines
(map make-pipe configs)
splits
loss/rmse
:loss
{:return-best-crossvalidation-only false
:return-best-pipeline-only false}))[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.pocket-pipeline/nth-split-train
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.pocket-pipeline/make-regression-data
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.pocket-pipeline/nth-split-test
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.pocket-pipeline/nth-split-train
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.pocket-pipeline/nth-split-test
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.pocket-pipeline/nth-split-train
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.pocket-pipeline/nth-split-test
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/fit-outlier-threshold
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/clip-outliers
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/prepare-features
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: scicloj.metamorph.ml/train
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/clip-outliers
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/prepare-features
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/fit-outlier-threshold
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/clip-outliers
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/prepare-features
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: scicloj.metamorph.ml/train
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/clip-outliers
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/prepare-features
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/fit-outlier-threshold
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/clip-outliers
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/prepare-features
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: scicloj.metamorph.ml/train
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/clip-outliers
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/prepare-features
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/prepare-features
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: scicloj.metamorph.ml/train
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/prepare-features
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/prepare-features
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: scicloj.metamorph.ml/train
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/prepare-features
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/prepare-features
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: scicloj.metamorph.ml/train
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/prepare-features
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: scicloj.metamorph.ml/train
Mar 01, 2026 4:32:00 PM org.tribuo.common.sgd.AbstractSGDTrainer train
INFO: Training SGD model with 333 examples
Mar 01, 2026 4:32:00 PM org.tribuo.common.sgd.AbstractSGDTrainer train
INFO: Outputs - RegressionInfo({name=y,id=0,count=333,max=8.463754,min=-5.615639,mean=1.066586,variance=14.157836})
Mar 01, 2026 4:32:00 PM org.tribuo.common.sgd.AbstractSGDTrainer train
INFO: At iteration 10000, average loss = 4.050872565748835
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: scicloj.metamorph.ml/train
Mar 01, 2026 4:32:00 PM org.tribuo.common.sgd.AbstractSGDTrainer train
INFO: Training SGD model with 333 examples
Mar 01, 2026 4:32:00 PM org.tribuo.common.sgd.AbstractSGDTrainer train
INFO: Outputs - RegressionInfo({name=y,id=0,count=333,max=8.739185,min=-5.756192,mean=0.898097,variance=14.475543})
Mar 01, 2026 4:32:00 PM org.tribuo.common.sgd.AbstractSGDTrainer train
INFO: At iteration 10000, average loss = 3.2396946375069637
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: scicloj.metamorph.ml/train
Mar 01, 2026 4:32:00 PM org.tribuo.common.sgd.AbstractSGDTrainer train
INFO: Training SGD model with 334 examples
Mar 01, 2026 4:32:00 PM org.tribuo.common.sgd.AbstractSGDTrainer train
INFO: Outputs - RegressionInfo({name=y,id=0,count=334,max=8.739185,min=-5.756192,mean=0.997828,variance=14.612461})
Mar 01, 2026 4:32:00 PM org.tribuo.common.sgd.AbstractSGDTrainer train
INFO: At iteration 10000, average loss = 3.473370043059974
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
3 configs Γ 3 folds β aggregate mean RMSE per config:
(def summary
(map (fn [config pipeline-results]
{:feature-set (:feature-set config)
:model-type (-> config :model-spec :tribuo-trainer-name)
:mean-rmse (tcc/mean (map #(-> % :test-transform :metric)
pipeline-results))})
configs results))(tc/dataset summary)_unnamed [3 3]:
| :feature-set | :model-type | :mean-rmse |
|---|---|---|
| :poly+trig | cart | 1.65064791 |
| :raw | cart | 1.63217429 |
| :poly+trig | linear-sgd | 2.13019778 |
Second run β all training hits cache, same metrics:
(def results-2
(ml/evaluate-pipelines
(map make-pipe configs)
splits
loss/rmse
:loss
{:return-best-crossvalidation-only false
:return-best-pipeline-only false}))[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: scicloj.metamorph.ml/train
Mar 01, 2026 4:32:00 PM org.tribuo.common.sgd.AbstractSGDTrainer train
INFO: Training SGD model with 333 examples
Mar 01, 2026 4:32:00 PM org.tribuo.common.sgd.AbstractSGDTrainer train
INFO: Outputs - RegressionInfo({name=y,id=0,count=333,max=8.463754,min=-5.615639,mean=1.066586,variance=14.157836})
Mar 01, 2026 4:32:00 PM org.tribuo.common.sgd.AbstractSGDTrainer train
INFO: At iteration 10000, average loss = 4.050872565748835
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/clip-outliers
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/prepare-features
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/clip-outliers
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/prepare-features
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
(= (map #(-> % first :test-transform :metric) results)
(map #(-> % first :test-transform :metric) results-2))trueHyperparameter sweep
Vary tree depth Γ feature set. Each unique combination trains once and is cached. Re-running adds only new combinations.
(def sweep-configs
(for [depth [4 6 8 12]
fs [:raw :poly+trig]]
{:feature-set fs
:model-spec {:model-type :scicloj.ml.tribuo/regression
:tribuo-components [{:name "cart"
:type "org.tribuo.regression.rtree.CARTRegressionTrainer"
:properties {:maxDepth (str depth)}}]
:tribuo-trainer-name "cart"}}))(def sweep-results
(ml/evaluate-pipelines
(map make-pipe sweep-configs)
splits
loss/rmse
:loss
{:return-best-crossvalidation-only false
:return-best-pipeline-only false}))[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: scicloj.metamorph.ml/train
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: scicloj.metamorph.ml/train
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: scicloj.metamorph.ml/train
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: scicloj.metamorph.ml/train
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: scicloj.metamorph.ml/train
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: scicloj.metamorph.ml/train
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: scicloj.metamorph.ml/train
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: scicloj.metamorph.ml/train
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: scicloj.metamorph.ml/train
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: scicloj.metamorph.ml/train
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: scicloj.metamorph.ml/train
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: scicloj.metamorph.ml/train
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: scicloj.metamorph.ml/train
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: scicloj.metamorph.ml/train
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: scicloj.metamorph.ml/train
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: scicloj.metamorph.ml/train
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: scicloj.metamorph.ml/train
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/clip-outliers
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/prepare-features
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/clip-outliers
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/prepare-features
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: scicloj.metamorph.ml/train
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.pocket-pipeline/predict-model
Results by depth and feature set:
(def sweep-summary
(->> (map (fn [config pipeline-results]
{:depth (-> config :model-spec :tribuo-components
first :properties :maxDepth)
:feature-set (:feature-set config)
:mean-rmse (tcc/mean (map #(-> % :test-transform :metric)
pipeline-results))})
sweep-configs sweep-results)
(sort-by :mean-rmse)))(tc/dataset sweep-summary)_unnamed [8 3]:
| :depth | :feature-set | :mean-rmse |
|---|---|---|
| 12 | :poly+trig | 1.52866093 |
| 6 | :poly+trig | 1.63217429 |
| 8 | :poly+trig | 1.64435595 |
| 6 | :raw | 1.65064791 |
| 4 | :poly+trig | 1.65350222 |
| 4 | :raw | 1.66164890 |
| 8 | :raw | 1.66262333 |
| 12 | :raw | 1.71514296 |
On this synthetic data, deeper trees with engineered features (poly+trig) perform best, while shallower trees show similar results regardless of feature set.
Sweep provenance
Pick the best result and trace its full provenance. The DAG goes from the trained model back to the original scalar parameters (seed, noise-sd, outlier-fraction, etc.):
(pocket/origin-story-mermaid
(:pocket/model-cached
(-> sweep-results first first :fit-ctx))):tribuo-components [{:name 'cart',
:type 'org.tribuo.regression.rtree.CARTRegressionTrainer',
:properties {:maxDepth '12'}}],
:tribuo-trainer-name 'cart'}"/] n2 --> n0
Discussion
What the Pocket DAG approach brings to an ML workflow:
| Aspect | What Pocket adds |
|---|---|
| Caching | Per-step, configurable β each step chooses :mem, :mem+disk, or :none |
| Provenance | Full DAG via origin-story β trace any result to its parameters |
| Disk persistence | Cached models and intermediates survive JVM restarts |
| Concurrent dedup | ConcurrentHashMap ensures each computation runs once across threads |
Reusing metamorph:
We use mm/pipeline, mm/lift, mm/fit-pipe, and mm/transform-pipe directly β and now also ml/evaluate-pipelines for cross-validation and model comparison. Pocket only adds two custom step types:
pocket-modelβ likeml/model, but caches training viapocket/cachedso models persist to diskpocket-fittedβ a general pattern for stateful steps
The deref-through pattern:
Each pipeline step wraps a caching-fn and immediately derefs the Cached result. This means real datasets (not Cached references) flow through :metamorph/data at every point. The origin registry provides two benefits:
- Efficient cache keys β each derefed dataset carries its lightweight identity, so the next stepβs
caching-fnavoids hashing full dataset content - Full provenance β
origin-storyfollows derefed values back through the registry to theirCachedorigin, preserving the complete DAG (as seen in the diamond dependency above)
Because :metamorph/data is always a real dataset, ml/evaluate-pipelines can call cf/target, cf/prediction, and malli validation β things that require concrete dataset types.
What we write:
- Plain pipeline functions (data in, data out)
caching-fnwrappers with storage policies (one line each)- Pipeline composition via
mm/pipelinewith our custom steps ml/evaluate-pipelinesfor cross-validation
Open question: where should the custom steps live? pocket-fitted and pocket-model are currently defined in this notebook. A future scicloj.pocket.ml namespace could provide them β but only if the pattern proves stable across different use cases.
Cleanup
(pocket/cleanup!)[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket - Cache cleanup: /tmp/pocket-metamorph
{:dir "/tmp/pocket-metamorph", :existed true}