10  Example: Machine Learning Workflows

Last modified: 2026-02-08

This chapter demonstrates Pocket in a realistic machine learning scenario. If you’re new to ML, don’t worry — we’ll explain the concepts as we go. The focus is on how caching helps when you’re exploring many combinations of data, features, and models.

The problem: We want to predict a numeric value (like house prices or temperature) from input data. This is called regression. We’ll generate synthetic data, try different ways of preparing it, and compare two learning algorithms.

Why caching matters: Training models can be slow. When you’re experimenting — tweaking parameters, trying new features — you don’t want to recompute everything each time. Pocket caches each step independently, so only the parts you changed get recomputed.

What we’ll cover:

Note: This notebook uses tablecloth for data manipulation, metamorph.ml and tribuo for ML, and Plotly.js for visualization. These are not Pocket dependencies — they illustrate a realistic ML workflow. All output is shown inline; to reproduce it, add noj to your project dependencies.

Why synthetic data? Working with synthetic data is a standard practice in machine learning. Because we define the true relationship (\(y = \sin(x) \cdot x\)), we can measure exactly how well each model recovers it — something impossible with real-world data where the ground truth is unknown. Synthetic experiments let us isolate one variable at a time: does feature engineering help? How does noise affect each algorithm? These controlled comparisons build intuition that transfers to real problems. In our case, we’ll see that a linear model is helpless against a nonlinear target unless we give it the right features, while a decision tree handles the shape on its own but pays a different price when noise increases.

Setup

(ns pocket-book.ml-workflows
  (:require
   ;; Logging setup for this chapter (see Logging chapter):
   [pocket-book.logging]
   ;; Pocket API:
   [scicloj.pocket :as pocket]
   ;; Annotating kinds of visualizations:
   [scicloj.kindly.v4.kind :as kind]
   ;; Data processing:
   [tablecloth.api :as tc]
   [tablecloth.column.api :as tcc]
   [tech.v3.dataset :as ds]
   [tech.v3.dataset.modelling :as ds-mod]
   ;; Machine learning:
   [scicloj.metamorph.ml :as ml]
   [scicloj.metamorph.ml.loss :as loss]
   [scicloj.ml.tribuo]))
(def cache-dir "/tmp/pocket-regression")
(pocket/set-base-cache-dir! cache-dir)
10:06:44.475 INFO scicloj.pocket - Cache dir set to: /tmp/pocket-regression
"/tmp/pocket-regression"
(pocket/cleanup!)
10:06:44.477 INFO scicloj.pocket - Cache cleanup: /tmp/pocket-regression
{:dir "/tmp/pocket-regression", :existed false}

Pipeline functions

These are the steps of our ML pipeline — plain Clojure functions that know nothing about caching. Pocket will wrap them later.

Data generation: make-regression-data creates a synthetic dataset from a ground-truth function. We control the sample size, noise level, and random seed — all of which become part of the cache key, so changing any parameter triggers recomputation.

(defn make-regression-data
  "Generate a synthetic regression dataset.
  `f` is a function from x to y (the ground truth).
  Optional `outlier-fraction` (0–1) and `outlier-scale` inject
  corrupted x values to simulate sensor glitches."
  [{:keys [f n noise-sd seed outlier-fraction outlier-scale]
    :or {outlier-fraction 0 outlier-scale 10}}]
  (let [rng (java.util.Random. (long seed))
        xs (vec (repeatedly n #(* 10.0 (.nextDouble rng))))
        xs-final (if (pos? outlier-fraction)
                   (let [out-rng (java.util.Random. (+ (long seed) 7919))]
                     (mapv (fn [x]
                             (if (< (.nextDouble out-rng) outlier-fraction)
                               (+ x (* (double outlier-scale) (.nextGaussian out-rng)))
                               x))
                           xs))
                   xs)
        ys (mapv (fn [x] (+ (double (f x))
                            (* (double noise-sd) (.nextGaussian rng))))
                 xs)]
    (-> (tc/dataset {:x xs-final :y ys})
        (ds-mod/set-inference-target :y))))

Splitting: split-dataset divides data into training and test sets. This is a cached step so the full provenance chain — from parameters through data generation to the split — is captured in the DAG.

(defn split-dataset
  "Split a dataset into train/test using holdout."
  [ds {:keys [seed]}]
  (first (tc/split->seq ds :holdout {:seed seed})))

Feature engineering: prepare-features transforms raw data by adding derived columns. The choice of feature set is a key hyperparameter — a linear model with only :raw features can’t learn nonlinear patterns, but :trig or :poly+trig features give it the building blocks it needs.

(defn prepare-features
  "Add derived columns to a dataset according to `feature-set`.
  Supported feature sets:

  - `:raw`       — no extra columns
  - `:quadratic` — add x²
  - `:trig`      — add sin(x) and cos(x)
  - `:poly+trig` — add x², sin(x), and cos(x)"
  [ds feature-set]
  (let [x (:x ds)]
    (-> (case feature-set
          :raw ds
          :quadratic (tc/add-columns ds {:x2 (tcc/sq x)})
          :trig (tc/add-columns ds {:sin-x (tcc/sin x)
                                    :cos-x (tcc/cos x)})
          :poly+trig (tc/add-columns ds {:x2 (tcc/sq x)
                                         :sin-x (tcc/sin x)
                                         :cos-x (tcc/cos x)}))
        (ds-mod/set-inference-target :y))))

Training and evaluation: train-model fits a model to prepared data, and predict-and-rmse measures how well it generalizes to unseen test data. These are thin wrappers around metamorph.ml — the caching value comes from avoiding redundant retraining when only downstream parameters change.

(defn train-model
  "Train a model on a dataset."
  [train-ds model-spec]
  (ml/train train-ds model-spec))
(defn predict-and-rmse
  "Predict on test data and return RMSE."
  [test-ds model]
  (let [pred (ml/predict test-ds model)]
    (loss/rmse (:y test-ds) (:y pred))))

Ground truth

We need a function to predict. In real problems you don’t know the true relationship — that’s what you’re trying to learn. Here we define it explicitly so we can measure how well our models do.

Our target is \(y = \sin(x) \cdot x\) — a wavy curve that grows with \(x\). A straight line can’t fit this shape, so a simple linear model will struggle unless we help it with better features.

(defn nonlinear-fn
  "y = sin(x) · x"
  [x]
  (* (Math/sin x) x))

Model specifications

We’ll compare two fundamentally different algorithms:

Linear model (gradient descent): Finds the best straight-line (or hyperplane) relationship between inputs and output. Simple and fast, but can only learn linear patterns. Needs good features.

Decision tree (CART): Learns by splitting data into regions based on thresholds (“if x > 5, go left”). Can capture complex patterns automatically, but may overfit noisy data.

These algorithms respond differently to feature engineering — that contrast is the heart of Part 1.

(def linear-sgd-spec
  {:model-type :scicloj.ml.tribuo/regression
   :tribuo-components [{:name "squared"
                        :type "org.tribuo.regression.sgd.objectives.SquaredLoss"}
                       {:name "linear-sgd"
                        :type "org.tribuo.regression.sgd.linear.LinearSGDTrainer"
                        :properties {:objective "squared"
                                     :epochs "50"
                                     :loggingInterval "10000"}}]
   :tribuo-trainer-name "linear-sgd"})
(def cart-spec
  {:model-type :scicloj.ml.tribuo/regression
   :tribuo-components [{:name "cart"
                        :type "org.tribuo.regression.rtree.CARTRegressionTrainer"
                        :properties {:maxDepth "8"}}]
   :tribuo-trainer-name "cart"})

Part 1 — Feature engineering matters (for some models)

Feature engineering means transforming raw inputs into forms that help models learn. For example, if the true relationship involves \(x^2\), adding a squared column gives the model that pattern directly instead of forcing it to discover it.

We’ll test four feature sets:

  • :raw — just the original \(x\) value
  • :quadratic — add \(x^2\)
  • :trig — add \(\sin(x)\) and \(\cos(x)\)
  • :poly+trig — add all three

Crossed with two model types, that’s eight combinations. Every step is cached, so re-running is instant.

Generate data

(def data-c
  (pocket/cached #'make-regression-data
                 {:f #'nonlinear-fn :n 500 :noise-sd 0.5 :seed 42}))
(tc/head (deref data-c))
10:06:44.487 INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/make-regression-data
10:06:44.494 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/b4/(pocket-book.ml-workflows_make-regression-data {:f #'pocket-book.ml-workflows_nonlinear-fn, :n 500, :noise-sd 0.5, :seed 42})

_unnamed [5 2]:

:x :y
7.27563680 6.74555252
6.83223472 4.07224915
3.08719455 0.22904859
2.77078490 0.47163659
6.65548952 2.81816258

Split into train and test

(def split-c
  (pocket/cached #'split-dataset data-c {:seed 42}))

Extract train and test sets — using keywords as cached functions. The DAG now traces from numerical parameters through data generation to the split to each subset.

(def train-c (pocket/cached :train split-c))
(def test-c (pocket/cached :test split-c))

Feature sets

(def feature-sets [:raw :quadratic :trig :poly+trig])

Prepare features (cached)

Each feature set applied to each split half is a separate cached computation — eight in total.

(def prepared
  (into {}
        (for [fs feature-sets
              [role ds-c] [[:train train-c] [:test test-c]]]
          [[fs role]
           (pocket/cached #'prepare-features ds-c fs)])))

Train models (cached)

Two models per feature set — eight cached training runs.

(def models
  (into {}
        (for [fs feature-sets
              [model-name spec] [[:sgd linear-sgd-spec]
                                 [:cart cart-spec]]]
          [[fs model-name]
           (pocket/cached #'train-model
                          (prepared [fs :train])
                          spec)])))

Results

(def feature-results
  (vec (for [fs feature-sets
             [model-name _] [[:sgd linear-sgd-spec]
                             [:cart cart-spec]]]
         {:feature-set fs
          :model (name model-name)
          :rmse (predict-and-rmse @(prepared [fs :test])
                                  @(models [fs model-name]))})))
10:06:44.516 INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/prepare-features
10:06:44.516 INFO scicloj.pocket.impl.cache - Cache miss, computing: :test
10:06:44.516 INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/split-dataset
10:06:44.521 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/e3/(pocket-book.ml-workflows_split-dataset (pocket-book.ml-workflows_make-regression-data {:f #'pocket-book.ml-workflows_nonlinear-fn, :n 500, :noise-sd 0.5, :seed 42}) {:seed 42})
10:06:44.522 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/23/(:test (pocket-book.ml-workflows_split-dataset (pocket-book.ml-workflows_make-regression-data {:f #'pocket-book.ml-workflows_nonlinear-fn, :n 500, :noise-sd 0.5, :seed 42}) {:seed 42}))
10:06:44.523 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/05/(pocket-book.ml-workflows_prepare-features (:test (pocket-book.ml-workflows_split-dataset (pocket-book.ml-workflows_make-regression-data {:f #'pocket-book.ml-workflows_nonlinear-fn, :n 500, :noise-sd 0.5, :seed 42}) {:seed 42})) :raw)
10:06:44.523 INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/train-model
10:06:44.523 INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/prepare-features
10:06:44.524 INFO scicloj.pocket.impl.cache - Cache miss, computing: :train
10:06:44.524 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/04/(:train (pocket-book.ml-workflows_split-dataset (pocket-book.ml-workflows_make-regression-data {:f #'pocket-book.ml-workflows_nonlinear-fn, :n 500, :noise-sd 0.5, :seed 42}) {:seed 42}))
10:06:44.525 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/79/(pocket-book.ml-workflows_prepare-features (:train (pocket-book.ml-workflows_split-dataset (pocket-book.ml-workflows_make-regression-data {:f #'pocket-book.ml-workflows_nonlinear-fn, :n 500, :noise-sd 0.5, :seed 42}) {:seed 42})) :raw)
Feb 09, 2026 10:06:44 AM org.tribuo.common.sgd.AbstractSGDTrainer train
INFO: Training SGD model with 333 examples
Feb 09, 2026 10:06:44 AM org.tribuo.common.sgd.AbstractSGDTrainer train
INFO: Outputs - RegressionInfo({name=y,id=0,count=333,max=8.866764,min=-5.501972,mean=0.840206,variance=15.440717})
Feb 09, 2026 10:06:44 AM org.tribuo.common.sgd.AbstractSGDTrainer train
INFO: At iteration 10000, average loss = 7.061050130878382
10:06:44.536 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/c6/c60324b3a114e8d5646efe7ec8bc1d78e743001b
10:06:44.541 INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/train-model
10:06:44.549 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/1e/1e01826f37666f143cccc6e1883455eb2562ed2e
10:06:44.607 INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/prepare-features
10:06:44.610 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/6d/(pocket-book.ml-workflows_prepare-features (:test (pocket-book.ml-workflows_split-dataset (pocket-book.ml-workflows_make-regression-data {:f #'pocket-book.ml-workflows_nonlinear-fn, :n 500, :noise-sd 0.5, :seed 42}) {:seed 42})) :quadratic)
10:06:44.610 INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/train-model
10:06:44.610 INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/prepare-features
10:06:44.612 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/06/0644c627bd9ef15c830deb29e333d06403c26a4f
Feb 09, 2026 10:06:44 AM org.tribuo.common.sgd.AbstractSGDTrainer train
INFO: Training SGD model with 333 examples
Feb 09, 2026 10:06:44 AM org.tribuo.common.sgd.AbstractSGDTrainer train
INFO: Outputs - RegressionInfo({name=y,id=0,count=333,max=8.866764,min=-5.501972,mean=0.840206,variance=15.440717})
Feb 09, 2026 10:06:44 AM org.tribuo.common.sgd.AbstractSGDTrainer train
INFO: At iteration 10000, average loss = 7.746488252108407
10:06:44.625 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/2f/2fdc6bcd8e2009e923d805ad1f2fdc52fc57948e
10:06:44.628 INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/train-model
10:06:44.636 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/cc/cc330e34b3221a68d7bb7649e629ad6b645e4f47
10:06:44.638 INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/prepare-features
10:06:44.640 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/98/(pocket-book.ml-workflows_prepare-features (:test (pocket-book.ml-workflows_split-dataset (pocket-book.ml-workflows_make-regression-data {:f #'pocket-book.ml-workflows_nonlinear-fn, :n 500, :noise-sd 0.5, :seed 42}) {:seed 42})) :trig)
10:06:44.640 INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/train-model
10:06:44.640 INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/prepare-features
10:06:44.642 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/9b/(pocket-book.ml-workflows_prepare-features (:train (pocket-book.ml-workflows_split-dataset (pocket-book.ml-workflows_make-regression-data {:f #'pocket-book.ml-workflows_nonlinear-fn, :n 500, :noise-sd 0.5, :seed 42}) {:seed 42})) :trig)
Feb 09, 2026 10:06:44 AM org.tribuo.common.sgd.AbstractSGDTrainer train
INFO: Training SGD model with 333 examples
Feb 09, 2026 10:06:44 AM org.tribuo.common.sgd.AbstractSGDTrainer train
INFO: Outputs - RegressionInfo({name=y,id=0,count=333,max=8.866764,min=-5.501972,mean=0.840206,variance=15.440717})
Feb 09, 2026 10:06:44 AM org.tribuo.common.sgd.AbstractSGDTrainer train
INFO: At iteration 10000, average loss = 0.9717267353378612
10:06:44.649 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/31/31897de14a3f45f4aaf19d0981053c5fb21403cb
10:06:44.651 INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/train-model
10:06:44.661 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/04/04527115901fc961e204e50922811c525652d96d
10:06:44.664 INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/prepare-features
10:06:44.666 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/99/(pocket-book.ml-workflows_prepare-features (:test (pocket-book.ml-workflows_split-dataset (pocket-book.ml-workflows_make-regression-data {:f #'pocket-book.ml-workflows_nonlinear-fn, :n 500, :noise-sd 0.5, :seed 42}) {:seed 42})) :poly+trig)
10:06:44.666 INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/train-model
10:06:44.666 INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/prepare-features
10:06:44.669 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/d7/d737f05eb6b2fdcaf7508a2f74277ea684a788ac
Feb 09, 2026 10:06:44 AM org.tribuo.common.sgd.AbstractSGDTrainer train
INFO: Training SGD model with 333 examples
Feb 09, 2026 10:06:44 AM org.tribuo.common.sgd.AbstractSGDTrainer train
INFO: Outputs - RegressionInfo({name=y,id=0,count=333,max=8.866764,min=-5.501972,mean=0.840206,variance=15.440717})
Feb 09, 2026 10:06:44 AM org.tribuo.common.sgd.AbstractSGDTrainer train
INFO: At iteration 10000, average loss = 1.5031022528715852
10:06:44.679 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/9a/9af00491ae43878968b94538ab45619c1c58f0d9
10:06:44.681 INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/train-model
10:06:44.695 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/d1/d168df4bc60ad63e27c35709a7e8e2c6b8036407
feature-results
[{:feature-set :raw, :model "sgd", :rmse 3.6917759873191685}
 {:feature-set :raw, :model "cart", :rmse 0.6334615055076024}
 {:feature-set :quadratic, :model "sgd", :rmse 3.576999082640806}
 {:feature-set :quadratic, :model "cart", :rmse 0.6334615055076024}
 {:feature-set :trig, :model "sgd", :rmse 1.4577701415666355}
 {:feature-set :trig, :model "cart", :rmse 0.6805569759436894}
 {:feature-set :poly+trig, :model "sgd", :rmse 1.3410184469297421}
 {:feature-set :poly+trig, :model "cart", :rmse 0.6805569759436894}]

What the results show:

The linear model (SGD) has high error with raw features — it’s trying to draw a straight line through a wavy curve. But give it \(\sin(x)\) and \(\cos(x)\) as features, and it can combine them to approximate the true shape. Feature engineering saved the day.

The decision tree (CART) doesn’t care. It discovers the wavy pattern by splitting the data into regions. Extra features don’t help because the tree already found the structure.

Takeaway: Some models need feature engineering; others don’t. Caching lets you explore both without waiting.

Predictions plot

Best linear model (poly+trig) vs best tree (raw) vs actual values.

(let [test-ds @(prepared [:raw :test])
      sgd-pred (:y (ml/predict @(prepared [:poly+trig :test])
                               @(models [:poly+trig :sgd])))
      cart-pred (:y (ml/predict test-ds
                                @(models [:raw :cart])))
      xs (vec (:x test-ds))
      actuals (vec (:y test-ds))
      sgd-vals (vec sgd-pred)
      cart-vals (vec cart-pred)]
  (kind/plotly
   {:data [{:x xs :y actuals :mode "markers" :name "actual"
            :marker {:opacity 0.3 :color "gray"}}
           {:x xs :y sgd-vals :mode "markers" :name "Linear SGD (poly+trig)"
            :marker {:opacity 0.5 :color "steelblue"}}
           {:x xs :y cart-vals :mode "markers" :name "CART (raw)"
            :marker {:opacity 0.5 :color "tomato"}}]
    :layout {:xaxis {:title "x"} :yaxis {:title "y"}}}))

Part 2 — How models handle noisy data

Real data is messy. Measurements have errors, inputs are approximate. Noise is the random variation that obscures the true pattern.

How do our models behave as noise increases? We’ll test five levels, from nearly clean (0.1) to very noisy (5.0).

Notice: the noise=0.5 dataset reuses the cache from Part 1 — Pocket recognizes the same function and arguments.

(def noise-levels [0.1 0.5 1.0 2.0 5.0])
(def noise-results
  (vec
   (for [noise-sd noise-levels]
     (let [data-c (pocket/cached #'make-regression-data
                                 {:f #'nonlinear-fn :n 500 :noise-sd noise-sd :seed 42})
           split-c (pocket/cached #'split-dataset data-c {:seed 42})
           train-c (pocket/cached :train split-c)
           test-c (pocket/cached :test split-c)
           cart-train (pocket/cached #'prepare-features train-c :raw)
           cart-test (pocket/cached #'prepare-features test-c :raw)
           sgd-train (pocket/cached #'prepare-features train-c :poly+trig)
           sgd-test (pocket/cached #'prepare-features test-c :poly+trig)
           cart-model (pocket/cached #'train-model cart-train cart-spec)
           sgd-model (pocket/cached #'train-model sgd-train linear-sgd-spec)]
       {:noise-sd noise-sd
        :cart-rmse (predict-and-rmse @cart-test @cart-model)
        :sgd-rmse (predict-and-rmse @sgd-test @sgd-model)}))))
10:06:44.714 INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/prepare-features
10:06:44.714 INFO scicloj.pocket.impl.cache - Cache miss, computing: :test
10:06:44.714 INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/split-dataset
10:06:44.714 INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/make-regression-data
10:06:44.716 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/6a/(pocket-book.ml-workflows_make-regression-data {:f #'pocket-book.ml-workflows_nonlinear-fn, :n 500, :noise-sd 0.1, :seed 42})
10:06:44.721 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/3a/(pocket-book.ml-workflows_split-dataset (pocket-book.ml-workflows_make-regression-data {:f #'pocket-book.ml-workflows_nonlinear-fn, :n 500, :noise-sd 0.1, :seed 42}) {:seed 42})
10:06:44.722 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/94/(:test (pocket-book.ml-workflows_split-dataset (pocket-book.ml-workflows_make-regression-data {:f #'pocket-book.ml-workflows_nonlinear-fn, :n 500, :noise-sd 0.1, :seed 42}) {:seed 42}))
10:06:44.723 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/96/(pocket-book.ml-workflows_prepare-features (:test (pocket-book.ml-workflows_split-dataset (pocket-book.ml-workflows_make-regression-data {:f #'pocket-book.ml-workflows_nonlinear-fn, :n 500, :noise-sd 0.1, :seed 42}) {:seed 42})) :raw)
10:06:44.723 INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/train-model
10:06:44.723 INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/prepare-features
10:06:44.723 INFO scicloj.pocket.impl.cache - Cache miss, computing: :train
10:06:44.724 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/55/(:train (pocket-book.ml-workflows_split-dataset (pocket-book.ml-workflows_make-regression-data {:f #'pocket-book.ml-workflows_nonlinear-fn, :n 500, :noise-sd 0.1, :seed 42}) {:seed 42}))
10:06:44.725 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/08/(pocket-book.ml-workflows_prepare-features (:train (pocket-book.ml-workflows_split-dataset (pocket-book.ml-workflows_make-regression-data {:f #'pocket-book.ml-workflows_nonlinear-fn, :n 500, :noise-sd 0.1, :seed 42}) {:seed 42})) :raw)
10:06:44.735 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/b8/b852ace3759232f1dec48d3e01572a918e9e31e6
10:06:44.739 INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/prepare-features
10:06:44.741 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/dd/(pocket-book.ml-workflows_prepare-features (:test (pocket-book.ml-workflows_split-dataset (pocket-book.ml-workflows_make-regression-data {:f #'pocket-book.ml-workflows_nonlinear-fn, :n 500, :noise-sd 0.1, :seed 42}) {:seed 42})) :poly+trig)
10:06:44.741 INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/train-model
10:06:44.741 INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/prepare-features
10:06:44.744 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/bc/bcc1185d8c044fe468f72f058191a99f73c4ea91
Feb 09, 2026 10:06:44 AM org.tribuo.common.sgd.AbstractSGDTrainer train
INFO: Training SGD model with 333 examples
Feb 09, 2026 10:06:44 AM org.tribuo.common.sgd.AbstractSGDTrainer train
INFO: Outputs - RegressionInfo({name=y,id=0,count=333,max=8.106624,min=-5.332393,mean=0.805881,variance=15.304966})
Feb 09, 2026 10:06:44 AM org.tribuo.common.sgd.AbstractSGDTrainer train
INFO: At iteration 10000, average loss = 1.4235088958624642
10:06:44.756 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/82/82fcdba4291961e398a2dabf02049538d45de7ff
10:06:44.765 INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/prepare-features
10:06:44.765 INFO scicloj.pocket.impl.cache - Cache miss, computing: :test
10:06:44.765 INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/split-dataset
10:06:44.765 INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/make-regression-data
10:06:44.767 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/53/(pocket-book.ml-workflows_make-regression-data {:f #'pocket-book.ml-workflows_nonlinear-fn, :n 500, :noise-sd 1.0, :seed 42})
10:06:44.772 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/68/(pocket-book.ml-workflows_split-dataset (pocket-book.ml-workflows_make-regression-data {:f #'pocket-book.ml-workflows_nonlinear-fn, :n 500, :noise-sd 1.0, :seed 42}) {:seed 42})
10:06:44.773 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/85/(:test (pocket-book.ml-workflows_split-dataset (pocket-book.ml-workflows_make-regression-data {:f #'pocket-book.ml-workflows_nonlinear-fn, :n 500, :noise-sd 1.0, :seed 42}) {:seed 42}))
10:06:44.774 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/d1/(pocket-book.ml-workflows_prepare-features (:test (pocket-book.ml-workflows_split-dataset (pocket-book.ml-workflows_make-regression-data {:f #'pocket-book.ml-workflows_nonlinear-fn, :n 500, :noise-sd 1.0, :seed 42}) {:seed 42})) :raw)
10:06:44.774 INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/train-model
10:06:44.774 INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/prepare-features
10:06:44.774 INFO scicloj.pocket.impl.cache - Cache miss, computing: :train
10:06:44.775 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/62/(:train (pocket-book.ml-workflows_split-dataset (pocket-book.ml-workflows_make-regression-data {:f #'pocket-book.ml-workflows_nonlinear-fn, :n 500, :noise-sd 1.0, :seed 42}) {:seed 42}))
10:06:44.776 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/ac/(pocket-book.ml-workflows_prepare-features (:train (pocket-book.ml-workflows_split-dataset (pocket-book.ml-workflows_make-regression-data {:f #'pocket-book.ml-workflows_nonlinear-fn, :n 500, :noise-sd 1.0, :seed 42}) {:seed 42})) :raw)
10:06:44.785 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/56/56b12a0b688bc09e6b02b1079e0b3db9e27780bb
10:06:44.788 INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/prepare-features
10:06:44.789 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/3d/(pocket-book.ml-workflows_prepare-features (:test (pocket-book.ml-workflows_split-dataset (pocket-book.ml-workflows_make-regression-data {:f #'pocket-book.ml-workflows_nonlinear-fn, :n 500, :noise-sd 1.0, :seed 42}) {:seed 42})) :poly+trig)
10:06:44.789 INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/train-model
10:06:44.789 INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/prepare-features
10:06:44.791 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/21/2138858cf3d145d3b1dda31f1fbff57c42021903
Feb 09, 2026 10:06:44 AM org.tribuo.common.sgd.AbstractSGDTrainer train
INFO: Training SGD model with 333 examples
Feb 09, 2026 10:06:44 AM org.tribuo.common.sgd.AbstractSGDTrainer train
INFO: Outputs - RegressionInfo({name=y,id=0,count=333,max=9.816940,min=-6.625413,mean=0.883112,variance=15.979208})
Feb 09, 2026 10:06:44 AM org.tribuo.common.sgd.AbstractSGDTrainer train
INFO: At iteration 10000, average loss = 1.8153008504655874
10:06:44.799 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/af/afcc6439d95dc6b19a47d7bb13b405a6e3f7bf75
10:06:44.802 INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/prepare-features
10:06:44.802 INFO scicloj.pocket.impl.cache - Cache miss, computing: :test
10:06:44.802 INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/split-dataset
10:06:44.802 INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/make-regression-data
10:06:44.803 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/1f/(pocket-book.ml-workflows_make-regression-data {:f #'pocket-book.ml-workflows_nonlinear-fn, :n 500, :noise-sd 2.0, :seed 42})
10:06:44.807 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/ec/(pocket-book.ml-workflows_split-dataset (pocket-book.ml-workflows_make-regression-data {:f #'pocket-book.ml-workflows_nonlinear-fn, :n 500, :noise-sd 2.0, :seed 42}) {:seed 42})
10:06:44.808 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/a2/(:test (pocket-book.ml-workflows_split-dataset (pocket-book.ml-workflows_make-regression-data {:f #'pocket-book.ml-workflows_nonlinear-fn, :n 500, :noise-sd 2.0, :seed 42}) {:seed 42}))
10:06:44.809 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/b0/(pocket-book.ml-workflows_prepare-features (:test (pocket-book.ml-workflows_split-dataset (pocket-book.ml-workflows_make-regression-data {:f #'pocket-book.ml-workflows_nonlinear-fn, :n 500, :noise-sd 2.0, :seed 42}) {:seed 42})) :raw)
10:06:44.809 INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/train-model
10:06:44.809 INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/prepare-features
10:06:44.809 INFO scicloj.pocket.impl.cache - Cache miss, computing: :train
10:06:44.810 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/be/(:train (pocket-book.ml-workflows_split-dataset (pocket-book.ml-workflows_make-regression-data {:f #'pocket-book.ml-workflows_nonlinear-fn, :n 500, :noise-sd 2.0, :seed 42}) {:seed 42}))
10:06:44.811 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/d8/(pocket-book.ml-workflows_prepare-features (:train (pocket-book.ml-workflows_split-dataset (pocket-book.ml-workflows_make-regression-data {:f #'pocket-book.ml-workflows_nonlinear-fn, :n 500, :noise-sd 2.0, :seed 42}) {:seed 42})) :raw)
10:06:44.818 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/42/4283017d384a0edb1360ed439ce2022cded0120a
10:06:44.821 INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/prepare-features
10:06:44.822 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/9a/(pocket-book.ml-workflows_prepare-features (:test (pocket-book.ml-workflows_split-dataset (pocket-book.ml-workflows_make-regression-data {:f #'pocket-book.ml-workflows_nonlinear-fn, :n 500, :noise-sd 2.0, :seed 42}) {:seed 42})) :poly+trig)
10:06:44.823 INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/train-model
10:06:44.823 INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/prepare-features
10:06:44.824 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/c5/c596cf0eff32ed864e9956ede9b9f1a8ba604d8a
Feb 09, 2026 10:06:44 AM org.tribuo.common.sgd.AbstractSGDTrainer train
INFO: Training SGD model with 333 examples
Feb 09, 2026 10:06:44 AM org.tribuo.common.sgd.AbstractSGDTrainer train
INFO: Outputs - RegressionInfo({name=y,id=0,count=333,max=11.717291,min=-8.958664,mean=0.968924,variance=18.285525})
Feb 09, 2026 10:06:44 AM org.tribuo.common.sgd.AbstractSGDTrainer train
INFO: At iteration 10000, average loss = 3.123027977933463
10:06:44.835 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/e7/e73082ba807aa7585ee9bfe055500e1478428bfb
10:06:44.838 INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/prepare-features
10:06:44.838 INFO scicloj.pocket.impl.cache - Cache miss, computing: :test
10:06:44.838 INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/split-dataset
10:06:44.838 INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/make-regression-data
10:06:44.840 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/16/(pocket-book.ml-workflows_make-regression-data {:f #'pocket-book.ml-workflows_nonlinear-fn, :n 500, :noise-sd 5.0, :seed 42})
10:06:44.843 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/33/(pocket-book.ml-workflows_split-dataset (pocket-book.ml-workflows_make-regression-data {:f #'pocket-book.ml-workflows_nonlinear-fn, :n 500, :noise-sd 5.0, :seed 42}) {:seed 42})
10:06:44.844 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/f6/(:test (pocket-book.ml-workflows_split-dataset (pocket-book.ml-workflows_make-regression-data {:f #'pocket-book.ml-workflows_nonlinear-fn, :n 500, :noise-sd 5.0, :seed 42}) {:seed 42}))
10:06:44.845 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/1d/(pocket-book.ml-workflows_prepare-features (:test (pocket-book.ml-workflows_split-dataset (pocket-book.ml-workflows_make-regression-data {:f #'pocket-book.ml-workflows_nonlinear-fn, :n 500, :noise-sd 5.0, :seed 42}) {:seed 42})) :raw)
10:06:44.845 INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/train-model
10:06:44.845 INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/prepare-features
10:06:44.845 INFO scicloj.pocket.impl.cache - Cache miss, computing: :train
10:06:44.846 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/b2/(:train (pocket-book.ml-workflows_split-dataset (pocket-book.ml-workflows_make-regression-data {:f #'pocket-book.ml-workflows_nonlinear-fn, :n 500, :noise-sd 5.0, :seed 42}) {:seed 42}))
10:06:44.847 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/c0/(pocket-book.ml-workflows_prepare-features (:train (pocket-book.ml-workflows_split-dataset (pocket-book.ml-workflows_make-regression-data {:f #'pocket-book.ml-workflows_nonlinear-fn, :n 500, :noise-sd 5.0, :seed 42}) {:seed 42})) :raw)
10:06:44.856 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/41/4125ad81cf3bdb18d7b227f84c899e30d5692f27
10:06:44.860 INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/prepare-features
10:06:44.862 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/2a/(pocket-book.ml-workflows_prepare-features (:test (pocket-book.ml-workflows_split-dataset (pocket-book.ml-workflows_make-regression-data {:f #'pocket-book.ml-workflows_nonlinear-fn, :n 500, :noise-sd 5.0, :seed 42}) {:seed 42})) :poly+trig)
10:06:44.862 INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/train-model
10:06:44.862 INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/prepare-features
10:06:44.864 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/2a/2ab54b2e292bd3cd5f65f72be9ddd1b0a17c3de8
Feb 09, 2026 10:06:44 AM org.tribuo.common.sgd.AbstractSGDTrainer train
INFO: Training SGD model with 333 examples
Feb 09, 2026 10:06:44 AM org.tribuo.common.sgd.AbstractSGDTrainer train
INFO: Outputs - RegressionInfo({name=y,id=0,count=333,max=17.418345,min=-15.958416,mean=1.226360,variance=35.039175})
Feb 09, 2026 10:06:44 AM org.tribuo.common.sgd.AbstractSGDTrainer train
INFO: At iteration 10000, average loss = 12.1030268449752
10:06:44.875 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/53/533425c770f1d0b6600c387d76730e7b41036713
noise-results
[{:noise-sd 0.1,
  :cart-rmse 0.18813079748027944,
  :sgd-rmse 1.2744431229599325}
 {:noise-sd 0.5,
  :cart-rmse 0.6334615055076024,
  :sgd-rmse 1.3410184469297421}
 {:noise-sd 1.0,
  :cart-rmse 1.2499664669902657,
  :sgd-rmse 1.582298583473656}
 {:noise-sd 2.0,
  :cart-rmse 2.453719422103725,
  :sgd-rmse 2.352662287937308}
 {:noise-sd 5.0,
  :cart-rmse 5.960858808406107,
  :sgd-rmse 5.262029923696976}]

What the results show:

At low noise, the tree wins — it captures fine details the linear model smooths over. But as noise increases, the tree starts memorizing random wiggles (overfitting), and its error explodes.

The linear model degrades more gracefully. Its rigid structure (a weighted sum of features) acts as a built-in regularizer — it can’t chase noise even if it wanted to.

Takeaway: Flexible models (trees) excel with clean data but suffer with noise. Simple models (linear) are more robust.

RMSE vs. noise

(let [noise-sds (vec (map :noise-sd noise-results))
      cart-rmses (vec (map :cart-rmse noise-results))
      sgd-rmses (vec (map :sgd-rmse noise-results))]
  (kind/plotly
   {:data [{:x noise-sds :y cart-rmses :mode "lines+markers" :name "CART"}
           {:x noise-sds :y sgd-rmses :mode "lines+markers" :name "Linear SGD"}]
    :layout {:xaxis {:title "noise-sd"} :yaxis {:title "rmse"}}}))

Part 3 — What got cached?

We’ve run many combinations of data, features, and models. Each pocket/cached call created an independent cache entry. Let’s see what we accumulated:

(:total-entries (pocket/cache-stats))
60
(:entries-per-fn (pocket/cache-stats))
{"pocket-book.ml-workflows/train-model" 16,
 "pocket-book.ml-workflows/prepare-features" 24,
 "pocket-book.ml-workflows/make-regression-data" 5,
 ":test" 5,
 ":train" 5,
 "pocket-book.ml-workflows/split-dataset" 5}

With this small synthetic data, each step runs in milliseconds. But the structure is what matters. In real workflows — large datasets, deep neural networks, hyperparameter searches — the same cache graph saves hours or days.

Here’s what happens when you change something:

Change What recomputes
Edit a feature set That feature prep + its models
Change a model hyperparameter Only that model
Change the noise level That data + its features + its models
Re-run the whole notebook Nothing — all cached

Cleanup

(pocket/cleanup!)
10:06:44.907 INFO scicloj.pocket - Cache cleanup: /tmp/pocket-regression
{:dir "/tmp/pocket-regression", :existed true}

Part 4 — Sharing computations across branches

Real sensors glitch. A positioning system occasionally records a wildly wrong x value — the physics (y) is unaffected, but the recorded input is corrupted. When we build polynomial features like x², these outlier x values get amplified: an errant x=50 gives x²=2500 instead of the expected ~25 from a normal x≈5.

The fix is feature outlier clipping: compute what range of x is “normal” from training data, then clip both train and test inputs to those bounds — before feature engineering.

The clipping threshold must come from training data alone. Using test data would leak future information.

This creates a diamond dependency — one computation (the threshold) feeds into multiple downstream steps:

 make-regression-data (with x outliers)
         |
    split-dataset
         |
    +----+----+
    v         v
 (:train)  (:test)
    |         |
    v         |
fit-threshold |
    |         |
    +----+----+
    v         v
clip(train) clip(test)
    |         |
    v         v
features   features
    |         |
    v         |
train-model   |
    |         |
    +----+----+
    v
  evaluate

Pocket handles this naturally. The threshold node is computed once and feeds both clipping steps. When you change the training data, the threshold recomputes, and both branches update.

Pipeline functions

These are plain functions. Each does one thing: fit a threshold, clip outliers, or evaluate. Pocket will wire them together.

(defn fit-outlier-threshold
  "Compute IQR-based clipping bounds for :x from training data.
  Returns {:lower <bound> :upper <bound>}."
  [train-ds]
  (println "  Fitting outlier threshold from training data...")
  (let [xs (sort (vec (:x train-ds)))
        n (count xs)
        q1 (nth xs (int (* 0.25 n)))
        q3 (nth xs (int (* 0.75 n)))
        iqr (- q3 q1)]
    {:lower (- q1 (* 1.5 iqr))
     :upper (+ q3 (* 1.5 iqr))}))
(defn clip-outliers
  "Clip :x values using pre-computed threshold bounds."
  [ds threshold]
  (println "  Clipping outliers with bounds:" (select-keys threshold [:lower :upper]))
  (let [{:keys [lower upper]} threshold]
    (tc/add-column ds :x (-> (:x ds) (tcc/max lower) (tcc/min upper)))))
(defn evaluate-model
  "Evaluate a model on test data."
  [test-ds model]
  (println "  Evaluating model...")
  (let [pred (ml/predict test-ds model)]
    {:rmse (loss/rmse (:y test-ds) (:y pred))}))

Build the DAG with mixed storage policies

Not every step needs disk persistence. We use caching-fn with per-function storage policies:

  • :mem for cheap shared computations (threshold, clipping, feature engineering) — no disk I/O, but in-memory dedup ensures each runs only once persists across JVM sessions

  • :none for trivial steps (evaluation) — just tracks identity in the DAG without any shared caching

(def c-fit-threshold
  (pocket/caching-fn #'fit-outlier-threshold {:storage :mem}))
(def c-clip
  (pocket/caching-fn #'clip-outliers {:storage :mem}))
(def c-prepare
  (pocket/caching-fn #'prepare-features {:storage :mem}))
(def c-train
  (pocket/caching-fn #'train-model))
(def c-evaluate
  (pocket/caching-fn #'evaluate-model {:storage :none}))

Generate data with outliers for this demo — 10% of the x values are corrupted by large random spikes, simulating sensor glitches. The y values (physics) are computed from the clean x, then noise is added normally — so only the input is corrupted.

(def dag-data-c
  (pocket/cached #'make-regression-data
                 {:f #'nonlinear-fn :n 200 :noise-sd 0.3 :seed 99
                  :outlier-fraction 0.1 :outlier-scale 15}))
(def dag-split-c
  (pocket/cached #'split-dataset dag-data-c {:seed 99}))
(def dag-train-c (pocket/cached :train dag-split-c))
(def dag-test-c (pocket/cached :test dag-split-c))

Now wire the pipeline. The threshold is fitted once from training data (in memory) and feeds both clipping steps — a diamond dependency handled naturally.

(def threshold-c
  (c-fit-threshold dag-train-c))
(def train-clipped-c
  (c-clip dag-train-c threshold-c))
(def test-clipped-c
  (c-clip dag-test-c threshold-c))
(def train-prepped-c
  (c-prepare train-clipped-c :poly+trig))
(def test-prepped-c
  (c-prepare test-clipped-c :poly+trig))
(def model-c
  (c-train train-prepped-c cart-spec))
(def metrics-c
  (c-evaluate test-prepped-c model-c))

Visualize the DAG

Pocket provides three functions for DAG introspection, each suited to different use cases.

origin-story returns a nested tree structure. Each cached node has :fn, :args, and :id. The :id is unique; when the same Cached instance appears multiple times (diamond pattern), subsequent occurrences become {:ref <id>} pointers. This avoids infinite recursion and makes the diamond explicit:

(pocket/origin-story metrics-c)
{:fn #'pocket-book.ml-workflows/evaluate-model,
 :args
 [{:fn #'pocket-book.ml-workflows/prepare-features,
   :args
   [{:fn #'pocket-book.ml-workflows/clip-outliers,
     :args
     [{:fn :test,
       :args
       [{:fn #'pocket-book.ml-workflows/split-dataset,
         :args
         [{:fn #'pocket-book.ml-workflows/make-regression-data,
           :args
           [{:value
             {:f #'pocket-book.ml-workflows/nonlinear-fn,
              :n 200,
              :noise-sd 0.3,
              :seed 99,
              :outlier-fraction 0.1,
              :outlier-scale 15}}],
           :id "c6"}
          {:value {:seed 99}}],
         :id "c5"}],
       :id "c4"}
      {:fn #'pocket-book.ml-workflows/fit-outlier-threshold,
       :args [{:fn :train, :args [{:ref "c5"}], :id "c8"}],
       :id "c7"}],
     :id "c3"}
    {:value :poly+trig}],
   :id "c2"}
  {:fn #'pocket-book.ml-workflows/train-model,
   :args
   [{:fn #'pocket-book.ml-workflows/prepare-features,
     :args
     [{:fn #'pocket-book.ml-workflows/clip-outliers,
       :args [{:ref "c8"} {:ref "c7"}],
       :id "c11"}
      {:value :poly+trig}],
     :id "c10"}
    {:value
     {:model-type :scicloj.ml.tribuo/regression,
      :tribuo-components
      [{:name "cart",
        :type "org.tribuo.regression.rtree.CARTRegressionTrainer",
        :properties {:maxDepth "8"}}],
      :tribuo-trainer-name "cart"}}],
   :id "c9"}],
 :id "c1"}

Notice how the threshold node appears as a :ref in one branch — it’s the same computation feeding both train and test clipping.

origin-story-graph normalizes the tree into a flat {:nodes ... :edges ...} structure suitable for graph algorithms:

(pocket/origin-story-graph metrics-c)
{:nodes
 {"c9" {:fn #'pocket-book.ml-workflows/fit-outlier-threshold},
  "c10" {:fn :train},
  "c13" {:fn #'pocket-book.ml-workflows/prepare-features},
  "c14" {:fn #'pocket-book.ml-workflows/clip-outliers},
  "v15" {:value :poly+trig},
  "v7"
  {:value
   {:f #'pocket-book.ml-workflows/nonlinear-fn,
    :n 200,
    :noise-sd 0.3,
    :seed 99,
    :outlier-fraction 0.1,
    :outlier-scale 15}},
  "v8" {:value {:seed 99}},
  "c2" {:fn #'pocket-book.ml-workflows/prepare-features},
  "v11" {:value :poly+trig},
  "c12" {:fn #'pocket-book.ml-workflows/train-model},
  "v16"
  {:value
   {:model-type :scicloj.ml.tribuo/regression,
    :tribuo-components
    [{:name "cart",
      :type "org.tribuo.regression.rtree.CARTRegressionTrainer",
      :properties {:maxDepth "8"}}],
    :tribuo-trainer-name "cart"}},
  "c3" {:fn #'pocket-book.ml-workflows/clip-outliers},
  "c4" {:fn :test},
  "c5" {:fn #'pocket-book.ml-workflows/split-dataset},
  "c6" {:fn #'pocket-book.ml-workflows/make-regression-data},
  "c1" {:fn #'pocket-book.ml-workflows/evaluate-model}},
 :edges
 [["c1" "c2"]
  ["c2" "c3"]
  ["c3" "c4"]
  ["c4" "c5"]
  ["c5" "c6"]
  ["c6" "v7"]
  ["c5" "v8"]
  ["c3" "c9"]
  ["c9" "c10"]
  ["c10" "c5"]
  ["c2" "v11"]
  ["c1" "c12"]
  ["c12" "c13"]
  ["c13" "c14"]
  ["c14" "c10"]
  ["c14" "c9"]
  ["c13" "v15"]
  ["c12" "v16"]]}

origin-story-mermaid renders the DAG as a Mermaid flowchart, with arrows showing data flow direction (from inputs toward the final result). The diamond dependency is clearly visible — the threshold feeds both clipping steps:

(pocket/origin-story-mermaid metrics-c)
flowchart TD n0["evaluate-model"] n1["prepare-features"] n2["clip-outliers"] n3[":test"] n4["split-dataset"] n5["make-regression-data"] n6[/"{:f #'pocket-book.ml-workflows/nonlinear-fn,
:n 200,
:noise-sd 0.3,
:seed 99,
:outlier-fraction 0.1,
:outlier-scale 15}"/] n6 --> n5 n5 --> n4 n7[/"{:seed 99}"/] n7 --> n4 n4 --> n3 n3 --> n2 n8["fit-outlier-threshold"] n9[":train"] n4 --> n9 n9 --> n8 n8 --> n2 n2 --> n1 n10[/":poly+trig"/] n10 --> n1 n1 --> n0 n11["train-model"] n12["prepare-features"] n13["clip-outliers"] n9 --> n13 n8 --> n13 n13 --> n12 n14[/":poly+trig"/] n14 --> n12 n12 --> n11 n15[/"{:model-type :scicloj.ml.tribuo/regression,
:tribuo-components [{:name 'cart',
:type 'org.tribuo.regression.rtree.CARTRegressionTrainer',
:properties {:maxDepth '8'}}],
:tribuo-trainer-name 'cart'}"/] n15 --> n11 n11 --> n0

Execute the pipeline

(deref metrics-c)
10:06:44.922 INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.ml-workflows/prepare-features
10:06:44.922 INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.ml-workflows/clip-outliers
10:06:44.923 INFO scicloj.pocket.impl.cache - Cache miss, computing: :test
10:06:44.923 INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/split-dataset
10:06:44.923 INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/make-regression-data
10:06:44.925 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/19/(pocket-book.ml-workflows_make-regression-data {:f #'pocket-book.ml-workflows_nonlinear-fn, :n 200, :noise-sd 0.3, :outlier-fraction 0.1, :outlier-scale 15, :seed 99})
10:06:44.928 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/53/(pocket-book.ml-workflows_split-dataset (pocket-book.ml-workflows_make-regression-data {:f #'pocket-book.ml-workflows_nonlinear-fn, :n 200, :noise-sd 0.3, :outlier-fraction 0.1, :outlier-scale 15, :seed 99}) {:seed 99})
10:06:44.929 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/fb/(:test (pocket-book.ml-workflows_split-dataset (pocket-book.ml-workflows_make-regression-data {:f #'pocket-book.ml-workflows_nonlinear-fn, :n 200, :noise-sd 0.3, :outlier-fraction 0.1, :outlier-scale 15, :seed 99}) {:seed 99}))
10:06:44.929 INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.ml-workflows/fit-outlier-threshold
10:06:44.929 INFO scicloj.pocket.impl.cache - Cache miss, computing: :train
10:06:44.930 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/09/(:train (pocket-book.ml-workflows_split-dataset (pocket-book.ml-workflows_make-regression-data {:f #'pocket-book.ml-workflows_nonlinear-fn, :n 200, :noise-sd 0.3, :outlier-fraction 0.1, :outlier-scale 15, :seed 99}) {:seed 99}))
  Fitting outlier threshold from training data...
  Clipping outliers with bounds: {:lower -5.499694170624462, :upper 15.994051959902624}
10:06:44.931 INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/train-model
10:06:44.931 INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.ml-workflows/prepare-features
10:06:44.932 INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.ml-workflows/clip-outliers
  Clipping outliers with bounds: {:lower -5.499694170624462, :upper 15.994051959902624}
10:06:44.940 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/29/298735d4cc3dbde1964a5f86130dba60f3a9db43
  Evaluating model...
{:rmse 1.601302555211606}

How much did clipping help? Let’s compare three scenarios using the same cached building blocks.

The no-clip and clean-baseline pipelines are local — they exist only for this comparison. Each still builds a cached DAG that shares steps with the clipped pipeline above.

(let [;; No-clip: skip clipping, go straight from raw splits to features
      noclip-train-c  (c-prepare dag-train-c :poly+trig)
      noclip-test-c   (c-prepare dag-test-c :poly+trig)
      noclip-model-c  (c-train noclip-train-c cart-spec)
      noclip-metrics  @(c-evaluate noclip-test-c noclip-model-c)
      ;; Clean baseline: same structure, data without outliers
      clean-data-c    (pocket/cached #'make-regression-data
                                     {:f #'nonlinear-fn :n 200 :noise-sd 0.3 :seed 99})
      clean-split-c   (pocket/cached #'split-dataset clean-data-c {:seed 99})
      clean-train-c   (c-prepare (pocket/cached :train clean-split-c) :poly+trig)
      clean-test-c    (c-prepare (pocket/cached :test clean-split-c) :poly+trig)
      clean-model-c   (c-train clean-train-c cart-spec)
      clean-metrics   @(c-evaluate clean-test-c clean-model-c)]
  {:clean            clean-metrics
   :outliers-no-clip noclip-metrics
   :outliers-clipped @metrics-c})
10:06:44.946 INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.ml-workflows/prepare-features
10:06:44.947 INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/train-model
10:06:44.947 INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.ml-workflows/prepare-features
10:06:44.955 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/81/81bd64358cce9b4ef112b6e4b16b04cb1cdfb14e
  Evaluating model...
10:06:44.959 INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.ml-workflows/prepare-features
10:06:44.959 INFO scicloj.pocket.impl.cache - Cache miss, computing: :test
10:06:44.959 INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/split-dataset
10:06:44.959 INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/make-regression-data
10:06:44.961 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/55/(pocket-book.ml-workflows_make-regression-data {:f #'pocket-book.ml-workflows_nonlinear-fn, :n 200, :noise-sd 0.3, :seed 99})
10:06:44.963 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/87/(pocket-book.ml-workflows_split-dataset (pocket-book.ml-workflows_make-regression-data {:f #'pocket-book.ml-workflows_nonlinear-fn, :n 200, :noise-sd 0.3, :seed 99}) {:seed 99})
10:06:44.964 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/80/(:test (pocket-book.ml-workflows_split-dataset (pocket-book.ml-workflows_make-regression-data {:f #'pocket-book.ml-workflows_nonlinear-fn, :n 200, :noise-sd 0.3, :seed 99}) {:seed 99}))
10:06:44.965 INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/train-model
10:06:44.965 INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.ml-workflows/prepare-features
10:06:44.965 INFO scicloj.pocket.impl.cache - Cache miss, computing: :train
10:06:44.966 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/54/(:train (pocket-book.ml-workflows_split-dataset (pocket-book.ml-workflows_make-regression-data {:f #'pocket-book.ml-workflows_nonlinear-fn, :n 200, :noise-sd 0.3, :seed 99}) {:seed 99}))
10:06:44.973 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/a0/a076e329e7ab037718d99d2a664dfc9114879d46
  Evaluating model...
{:clean {:rmse 0.4263098047865239},
 :outliers-no-clip {:rmse 2.552495499444297},
 :outliers-clipped {:rmse 1.601302555211606}}

Clipping x before building polynomial features makes a visible difference — the amplification through x² is tamed.


Part 5 — Comparing many experiments at once

Hyperparameters are settings you choose before training: tree depth, learning rate, which features to use. Finding good values usually means trying many combinations — a hyperparameter sweep.

Pocket’s compare-experiments helps here. You pass a collection of cached experiments, and it extracts the parameters that vary across them (ignoring ones that are constant).

(defn run-pipeline
  "Run a complete pipeline with given hyperparameters."
  [{:keys [noise-sd feature-set max-depth]}]
  (let [ds (make-regression-data {:f nonlinear-fn :n 200 :noise-sd noise-sd :seed 42})
        sp (split-dataset ds {:seed 42})
        train-prep (prepare-features (:train sp) feature-set)
        test-prep (prepare-features (:test sp) feature-set)
        spec {:model-type :scicloj.ml.tribuo/regression
              :tribuo-components [{:name "cart"
                                   :type "org.tribuo.regression.rtree.CARTRegressionTrainer"
                                   :properties {:maxDepth (str max-depth)}}]
              :tribuo-trainer-name "cart"}
        model (ml/train train-prep spec)
        pred (ml/predict test-prep model)]
    {:rmse (loss/rmse (:y test-prep) (:y pred))}))

Run experiments across a grid of hyperparameters:

(def experiments
  (for [noise-sd [0.3 0.5]
        feature-set [:raw :poly+trig]
        max-depth [4 8]]
    (pocket/cached #'run-pipeline
                   {:noise-sd noise-sd
                    :feature-set feature-set
                    :max-depth max-depth})))

Compare all experiments — only varying parameters are shown:

(def comparison
  (pocket/compare-experiments experiments))
10:06:44.985 INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/run-pipeline
10:06:44.993 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/0f/(pocket-book.ml-workflows_run-pipeline {:feature-set :raw, :max-depth 4, :noise-sd 0.3})
10:06:44.993 INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/run-pipeline
10:06:45.001 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/23/(pocket-book.ml-workflows_run-pipeline {:feature-set :raw, :max-depth 8, :noise-sd 0.3})
10:06:45.001 INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/run-pipeline
10:06:45.011 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/3f/(pocket-book.ml-workflows_run-pipeline {:feature-set :poly+trig, :max-depth 4, :noise-sd 0.3})
10:06:45.011 INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/run-pipeline
10:06:45.024 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/40/(pocket-book.ml-workflows_run-pipeline {:feature-set :poly+trig, :max-depth 8, :noise-sd 0.3})
10:06:45.024 INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/run-pipeline
10:06:45.034 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/c8/(pocket-book.ml-workflows_run-pipeline {:feature-set :raw, :max-depth 4, :noise-sd 0.5})
10:06:45.035 INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/run-pipeline
10:06:45.045 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/01/(pocket-book.ml-workflows_run-pipeline {:feature-set :raw, :max-depth 8, :noise-sd 0.5})
10:06:45.046 INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/run-pipeline
10:06:45.058 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/b3/(pocket-book.ml-workflows_run-pipeline {:feature-set :poly+trig, :max-depth 4, :noise-sd 0.5})
10:06:45.059 INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/run-pipeline
10:06:45.072 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/00/(pocket-book.ml-workflows_run-pipeline {:feature-set :poly+trig, :max-depth 8, :noise-sd 0.5})
(tc/dataset comparison)

_unnamed [8 4]:

:noise-sd :feature-set :max-depth :result
0.3 :raw 4 {:rmse 0.7189521159338053}
0.3 :raw 8 {:rmse 0.41024324994778005}
0.3 :poly+trig 4 {:rmse 0.5297031491020386}
0.3 :poly+trig 8 {:rmse 0.4530388384300822}
0.5 :raw 4 {:rmse 0.8815492449083825}
0.5 :raw 8 {:rmse 0.6467985374993637}
0.5 :poly+trig 4 {:rmse 0.7728233864192875}
0.5 :poly+trig 8 {:rmse 0.6785270538736407}

Each row shows the varying parameters plus the result. Parameters that were constant (like seed=42) are excluded automatically — you see only what differs.

Results visualization

(let [rows (map (fn [exp]
                  (merge (select-keys exp [:noise-sd :feature-set :max-depth])
                         (:result exp)))
                comparison)
      ;; Group by both feature-set and noise-sd for legend entries
      grouped (group-by (juxt :feature-set :noise-sd) rows)
      feature-colors {:raw "steelblue" :poly "tomato" :poly+trig "green"}]
  (kind/plotly
   {:data (vec (for [[[feature-set noise-sd] pts] (sort-by first grouped)
                     :let [max-depths (mapv :max-depth pts)
                           rmses (mapv :rmse pts)]]
                 {:x max-depths
                  :y rmses
                  :mode "markers"
                  :name (str (name feature-set) " (noise=" noise-sd ")")
                  :legendgroup (name feature-set)
                  :marker {:size (+ 8 (* 15 noise-sd))
                           :color (feature-colors feature-set)}}))

    :layout {:xaxis {:title "max-depth"} :yaxis {:title "rmse"}}}))

What we learned

This experiment revealed a clear story about the interplay between models, features, and noise:

  • Feature engineering is decisive for linear models. With raw features, the linear model couldn’t capture the nonlinear target at all. Adding trigonometric features (sin, cos) — which match the structure of the true function — dramatically improved it. The model didn’t get smarter; we gave it the right vocabulary.

  • Decision trees are self-sufficient but fragile. The CART model achieved low error regardless of feature set, because it can learn nonlinear splits on its own. But as noise increased, it began fitting the noise rather than the signal — a classic overfitting pattern.

  • The crossover point matters. At low noise, the tree wins. At high noise, the well-featured linear model degrades more gracefully. Knowing where this crossover happens is exactly the kind of insight you get from systematic experimentation.

  • Caching structures the workflow. In this small example, each step runs in milliseconds — caching isn’t needed for speed. But the pattern scales: with real datasets and expensive training, the same pipeline structure ensures that only changed steps recompute. Meanwhile, compare-experiments extracted the varying parameters automatically, turning cached results into a comparison table — useful at any scale.

  • Preprocessing order matters. Outlier x values get amplified by polynomial features (x²), so clipping must come before feature engineering. The diamond dependency — one threshold feeding both train and test clipping — is handled naturally by Pocket’s DAG.

Cleanup

(pocket/cleanup!)
10:06:45.084 INFO scicloj.pocket - Cache cleanup: /tmp/pocket-regression
{:dir "/tmp/pocket-regression", :existed true}
source: notebooks/pocket_book/ml_workflows.clj