10 Example: Machine Learning Workflows

Last modified: 2026-02-08

This chapter demonstrates Pocket in a realistic machine learning scenario. If you’re new to ML, don’t worry — we’ll explain the concepts as we go. The focus is on how caching helps when you’re exploring many combinations of data, features, and models.

The problem: We want to predict a numeric value (like house prices or temperature) from input data. This is called regression. We’ll generate synthetic data, try different ways of preparing it, and compare two learning algorithms.

Why caching matters: Training models can be slow. When you’re experimenting — tweaking parameters, trying new features — you don’t want to recompute everything each time. Pocket caches each step independently, so only the parts you changed get recomputed.

What we’ll cover:

Part 1: Feature engineering — transforming inputs to help models learn
Part 2: Noise sensitivity — how models behave with messy data
Part 3: The caching payoff — what got cached and why it matters
Part 4: DAG workflows — when preprocessing steps share dependencies
Part 5: Hyperparameter sweeps — comparing many experiments at once

Note: This notebook uses tablecloth for data manipulation, metamorph.ml and tribuo for ML, and Plotly.js for visualization. These are not Pocket dependencies — they illustrate a realistic ML workflow. All output is shown inline; to reproduce it, add noj to your project dependencies.

Why synthetic data? Working with synthetic data is a standard practice in machine learning. Because we define the true relationship (\(y = \sin(x) \cdot x\)), we can measure exactly how well each model recovers it — something impossible with real-world data where the ground truth is unknown. Synthetic experiments let us isolate one variable at a time: does feature engineering help? How does noise affect each algorithm? These controlled comparisons build intuition that transfers to real problems. In our case, we’ll see that a linear model is helpless against a nonlinear target unless we give it the right features, while a decision tree handles the shape on its own but pays a different price when noise increases.

Setup

(ns pocket-book.ml-workflows
  (:require
   ;; Logging setup for this chapter (see Logging chapter):
   [pocket-book.logging]
   ;; Pocket API:
   [scicloj.pocket :as pocket]
   ;; Annotating kinds of visualizations:
   [scicloj.kindly.v4.kind :as kind]
   ;; Data processing:
   [tablecloth.api :as tc]
   [tablecloth.column.api :as tcc]
   [tech.v3.dataset :as ds]
   [tech.v3.dataset.modelling :as ds-mod]
   ;; Machine learning:
   [scicloj.metamorph.ml :as ml]
   [scicloj.metamorph.ml.loss :as loss]
   [scicloj.ml.tribuo]))

(def cache-dir "/tmp/pocket-regression")

(pocket/set-base-cache-dir! cache-dir)

OUT

[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket - Cache dir set to: /tmp/pocket-regression

"/tmp/pocket-regression"

(pocket/cleanup!)

OUT

[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket - Cache cleanup: /tmp/pocket-regression

{:dir "/tmp/pocket-regression", :existed false}

Pipeline functions

These are the steps of our ML pipeline — plain Clojure functions that know nothing about caching. Pocket will wrap them later.

Data generation: make-regression-data creates a synthetic dataset from a ground-truth function. We control the sample size, noise level, and random seed — all of which become part of the cache key, so changing any parameter triggers recomputation.

(defn make-regression-data
  "Generate a synthetic regression dataset.
  `f` is a function from x to y (the ground truth).
  Optional `outlier-fraction` (0–1) and `outlier-scale` inject
  corrupted x values to simulate sensor glitches."
  [{:keys [f n noise-sd seed outlier-fraction outlier-scale]
    :or {outlier-fraction 0 outlier-scale 10}}]
  (let [rng (java.util.Random. (long seed))
        xs (vec (repeatedly n #(* 10.0 (.nextDouble rng))))
        xs-final (if (pos? outlier-fraction)
                   (let [out-rng (java.util.Random. (+ (long seed) 7919))]
                     (mapv (fn [x]
                             (if (< (.nextDouble out-rng) outlier-fraction)
                               (+ x (* (double outlier-scale) (.nextGaussian out-rng)))
                               x))
                           xs))
                   xs)
        ys (mapv (fn [x] (+ (double (f x))
                            (* (double noise-sd) (.nextGaussian rng))))
                 xs)]
    (-> (tc/dataset {:x xs-final :y ys})
        (ds-mod/set-inference-target :y))))

Splitting: split-dataset divides data into training and test sets. This is a cached step so the full provenance chain — from parameters through data generation to the split — is captured in the DAG.

(defn split-dataset
  "Split a dataset into train/test using holdout."
  [ds {:keys [seed]}]
  (first (tc/split->seq ds :holdout {:seed seed})))

Feature engineering: prepare-features transforms raw data by adding derived columns. The choice of feature set is a key hyperparameter — a linear model with only :raw features can’t learn nonlinear patterns, but :trig or :poly+trig features give it the building blocks it needs.

(defn prepare-features
  "Add derived columns to a dataset according to `feature-set`.
  Supported feature sets:

  - `:raw`       — no extra columns
  - `:quadratic` — add x²
  - `:trig`      — add sin(x) and cos(x)
  - `:poly+trig` — add x², sin(x), and cos(x)"
  [ds feature-set]
  (let [x (:x ds)]
    (-> (case feature-set
          :raw ds
          :quadratic (tc/add-columns ds {:x2 (tcc/sq x)})
          :trig (tc/add-columns ds {:sin-x (tcc/sin x)
                                    :cos-x (tcc/cos x)})
          :poly+trig (tc/add-columns ds {:x2 (tcc/sq x)
                                         :sin-x (tcc/sin x)
                                         :cos-x (tcc/cos x)}))
        (ds-mod/set-inference-target :y))))

Training and evaluation: train-model fits a model to prepared data, and predict-and-rmse measures how well it generalizes to unseen test data. These are thin wrappers around metamorph.ml — the caching value comes from avoiding redundant retraining when only downstream parameters change.

(defn train-model
  "Train a model on a dataset."
  [train-ds model-spec]
  (ml/train train-ds model-spec))

(defn predict-and-rmse
  "Predict on test data and return RMSE."
  [test-ds model]
  (let [pred (ml/predict test-ds model)]
    (loss/rmse (:y test-ds) (:y pred))))

Ground truth

We need a function to predict. In real problems you don’t know the true relationship — that’s what you’re trying to learn. Here we define it explicitly so we can measure how well our models do.

Our target is \(y = \sin(x) \cdot x\) — a wavy curve that grows with \(x\). A straight line can’t fit this shape, so a simple linear model will struggle unless we help it with better features.

(defn nonlinear-fn
  "y = sin(x) · x"
  [x]
  (* (Math/sin x) x))

Model specifications

We’ll compare two fundamentally different algorithms:

Linear model (gradient descent): Finds the best straight-line (or hyperplane) relationship between inputs and output. Simple and fast, but can only learn linear patterns. Needs good features.

Decision tree (CART): Learns by splitting data into regions based on thresholds (“if x > 5, go left”). Can capture complex patterns automatically, but may overfit noisy data.

These algorithms respond differently to feature engineering — that contrast is the heart of Part 1.

(def linear-sgd-spec
  {:model-type :scicloj.ml.tribuo/regression
   :tribuo-components [{:name "squared"
                        :type "org.tribuo.regression.sgd.objectives.SquaredLoss"}
                       {:name "linear-sgd"
                        :type "org.tribuo.regression.sgd.linear.LinearSGDTrainer"
                        :properties {:objective "squared"
                                     :epochs "50"
                                     :loggingInterval "10000"}}]
   :tribuo-trainer-name "linear-sgd"})

(def cart-spec
  {:model-type :scicloj.ml.tribuo/regression
   :tribuo-components [{:name "cart"
                        :type "org.tribuo.regression.rtree.CARTRegressionTrainer"
                        :properties {:maxDepth "8"}}]
   :tribuo-trainer-name "cart"})

Part 1 — Feature engineering matters (for some models)

Feature engineering means transforming raw inputs into forms that help models learn. For example, if the true relationship involves \(x^2\), adding a squared column gives the model that pattern directly instead of forcing it to discover it.

We’ll test four feature sets:

:raw — just the original \(x\) value
:quadratic — add \(x^2\)
:trig — add \(\sin(x)\) and \(\cos(x)\)
:poly+trig — add all three

Crossed with two model types, that’s eight combinations. Every step is cached, so re-running is instant.

Generate data

(def data-c
  (pocket/cached #'make-regression-data
                 {:f #'nonlinear-fn :n 500 :noise-sd 0.5 :seed 42}))

(tc/head (deref data-c))

OUT

[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/make-regression-data
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/b4/(pocket-book.ml-workflows_make-regression-data {:f #'pocket-book.ml-workflows_nonlinear-fn, :n 500, :noise-sd 0.5, :seed 42})

_unnamed [5 2]:

:x	:y
7.27563680	6.74555252
6.83223472	4.07224915
3.08719455	0.22904859
2.77078490	0.47163659
6.65548952	2.81816258

Split into train and test

(def split-c
  (pocket/cached #'split-dataset data-c {:seed 42}))

Extract train and test sets — using keywords as cached functions. The DAG now traces from numerical parameters through data generation to the split to each subset.

(def train-c (pocket/cached :train split-c))

(def test-c (pocket/cached :test split-c))

Feature sets

(def feature-sets [:raw :quadratic :trig :poly+trig])

Prepare features (cached)

Each feature set applied to each split half is a separate cached computation — eight in total.

(def prepared
  (into {}
        (for [fs feature-sets
              [role ds-c] [[:train train-c] [:test test-c]]]
          [[fs role]
           (pocket/cached #'prepare-features ds-c fs)])))

Train models (cached)

Two models per feature set — eight cached training runs.

(def models
  (into {}
        (for [fs feature-sets
              [model-name spec] [[:sgd linear-sgd-spec]
                                 [:cart cart-spec]]]
          [[fs model-name]
           (pocket/cached #'train-model
                          (prepared [fs :train])
                          spec)])))

Results

(def feature-results
  (vec (for [fs feature-sets
             [model-name _] [[:sgd linear-sgd-spec]
                             [:cart cart-spec]]]
         {:feature-set fs
          :model (name model-name)
          :rmse (predict-and-rmse @(prepared [fs :test])
                                  @(models [fs model-name]))})))

OUT

[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/prepare-features
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: :test
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/split-dataset
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/e3/(pocket-book.ml-workflows_split-dataset (pocket-book.ml-workflows_make-regression-data {:f #'pocket-book.ml-workflows_nonlinear-fn, :n 500, :noise-sd 0.5, :seed 42}) {:seed 42})
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/23/(:test (pocket-book.ml-workflows_split-dataset (pocket-book.ml-workflows_make-regression-data {:f #'pocket-book.ml-workflows_nonlinear-fn, :n 500, :noise-sd 0.5, :seed 42}) {:seed 42}))
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/05/(pocket-book.ml-workflows_prepare-features (:test (pocket-book.ml-workflows_split-dataset (pocket-book.ml-workflows_make-regression-data {:f #'pocket-book.ml-workflows_nonlinear-fn, :n 500, :noise-sd 0.5, :seed 42}) {:seed 42})) :raw)
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/train-model
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/prepare-features
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: :train
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/04/(:train (pocket-book.ml-workflows_split-dataset (pocket-book.ml-workflows_make-regression-data {:f #'pocket-book.ml-workflows_nonlinear-fn, :n 500, :noise-sd 0.5, :seed 42}) {:seed 42}))
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/79/(pocket-book.ml-workflows_prepare-features (:train (pocket-book.ml-workflows_split-dataset (pocket-book.ml-workflows_make-regression-data {:f #'pocket-book.ml-workflows_nonlinear-fn, :n 500, :noise-sd 0.5, :seed 42}) {:seed 42})) :raw)
Mar 01, 2026 4:31:58 PM org.tribuo.common.sgd.AbstractSGDTrainer train
INFO: Training SGD model with 333 examples
Mar 01, 2026 4:31:58 PM org.tribuo.common.sgd.AbstractSGDTrainer train
INFO: Outputs - RegressionInfo({name=y,id=0,count=333,max=8.866764,min=-5.501972,mean=0.840206,variance=15.440717})
Mar 01, 2026 4:31:58 PM org.tribuo.common.sgd.AbstractSGDTrainer train
INFO: At iteration 10000, average loss = 7.061050130878382
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/c6/c60324b3a114e8d5646efe7ec8bc1d78e743001b
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/train-model
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/1e/1e01826f37666f143cccc6e1883455eb2562ed2e
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/prepare-features
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/6d/(pocket-book.ml-workflows_prepare-features (:test (pocket-book.ml-workflows_split-dataset (pocket-book.ml-workflows_make-regression-data {:f #'pocket-book.ml-workflows_nonlinear-fn, :n 500, :noise-sd 0.5, :seed 42}) {:seed 42})) :quadratic)
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/train-model
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/prepare-features
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/06/0644c627bd9ef15c830deb29e333d06403c26a4f
Mar 01, 2026 4:31:58 PM org.tribuo.common.sgd.AbstractSGDTrainer train
INFO: Training SGD model with 333 examples
Mar 01, 2026 4:31:58 PM org.tribuo.common.sgd.AbstractSGDTrainer train
INFO: Outputs - RegressionInfo({name=y,id=0,count=333,max=8.866764,min=-5.501972,mean=0.840206,variance=15.440717})
Mar 01, 2026 4:31:58 PM org.tribuo.common.sgd.AbstractSGDTrainer train
INFO: At iteration 10000, average loss = 7.746488252108407
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/2f/2fdc6bcd8e2009e923d805ad1f2fdc52fc57948e
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/train-model
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/cc/cc330e34b3221a68d7bb7649e629ad6b645e4f47
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/prepare-features
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/98/(pocket-book.ml-workflows_prepare-features (:test (pocket-book.ml-workflows_split-dataset (pocket-book.ml-workflows_make-regression-data {:f #'pocket-book.ml-workflows_nonlinear-fn, :n 500, :noise-sd 0.5, :seed 42}) {:seed 42})) :trig)
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/train-model
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/prepare-features
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/9b/(pocket-book.ml-workflows_prepare-features (:train (pocket-book.ml-workflows_split-dataset (pocket-book.ml-workflows_make-regression-data {:f #'pocket-book.ml-workflows_nonlinear-fn, :n 500, :noise-sd 0.5, :seed 42}) {:seed 42})) :trig)
Mar 01, 2026 4:31:58 PM org.tribuo.common.sgd.AbstractSGDTrainer train
INFO: Training SGD model with 333 examples
Mar 01, 2026 4:31:58 PM org.tribuo.common.sgd.AbstractSGDTrainer train
INFO: Outputs - RegressionInfo({name=y,id=0,count=333,max=8.866764,min=-5.501972,mean=0.840206,variance=15.440717})
Mar 01, 2026 4:31:58 PM org.tribuo.common.sgd.AbstractSGDTrainer train
INFO: At iteration 10000, average loss = 0.9717267353378612
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/31/31897de14a3f45f4aaf19d0981053c5fb21403cb
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/train-model
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/04/04527115901fc961e204e50922811c525652d96d
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/prepare-features
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/99/(pocket-book.ml-workflows_prepare-features (:test (pocket-book.ml-workflows_split-dataset (pocket-book.ml-workflows_make-regression-data {:f #'pocket-book.ml-workflows_nonlinear-fn, :n 500, :noise-sd 0.5, :seed 42}) {:seed 42})) :poly+trig)
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/train-model
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/prepare-features
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/d7/d737f05eb6b2fdcaf7508a2f74277ea684a788ac
Mar 01, 2026 4:31:58 PM org.tribuo.common.sgd.AbstractSGDTrainer train
INFO: Training SGD model with 333 examples
Mar 01, 2026 4:31:58 PM org.tribuo.common.sgd.AbstractSGDTrainer train
INFO: Outputs - RegressionInfo({name=y,id=0,count=333,max=8.866764,min=-5.501972,mean=0.840206,variance=15.440717})
Mar 01, 2026 4:31:58 PM org.tribuo.common.sgd.AbstractSGDTrainer train
INFO: At iteration 10000, average loss = 1.5031022528715852
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/9a/9af00491ae43878968b94538ab45619c1c58f0d9
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/train-model
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/d1/d168df4bc60ad63e27c35709a7e8e2c6b8036407

feature-results

[{:feature-set :raw, :model "sgd", :rmse 3.6917759873191685}
 {:feature-set :raw, :model "cart", :rmse 0.6334615055076024}
 {:feature-set :quadratic, :model "sgd", :rmse 3.576999082640806}
 {:feature-set :quadratic, :model "cart", :rmse 0.6334615055076024}
 {:feature-set :trig, :model "sgd", :rmse 1.4577701415666355}
 {:feature-set :trig, :model "cart", :rmse 0.6805569759436894}
 {:feature-set :poly+trig, :model "sgd", :rmse 1.3410184469297421}
 {:feature-set :poly+trig, :model "cart", :rmse 0.6805569759436894}]

What the results show:

The linear model (SGD) has high error with raw features — it’s trying to draw a straight line through a wavy curve. But give it \(\sin(x)\) and \(\cos(x)\) as features, and it can combine them to approximate the true shape. Feature engineering saved the day.

The decision tree (CART) doesn’t care. It discovers the wavy pattern by splitting the data into regions. Extra features don’t help because the tree already found the structure.

Takeaway: Some models need feature engineering; others don’t. Caching lets you explore both without waiting.

Predictions plot

Best linear model (poly+trig) vs best tree (raw) vs actual values.

(let [test-ds @(prepared [:raw :test])
      sgd-pred (:y (ml/predict @(prepared [:poly+trig :test])
                               @(models [:poly+trig :sgd])))
      cart-pred (:y (ml/predict test-ds
                                @(models [:raw :cart])))
      xs (vec (:x test-ds))
      actuals (vec (:y test-ds))
      sgd-vals (vec sgd-pred)
      cart-vals (vec cart-pred)]
  (kind/plotly
   {:data [{:x xs :y actuals :mode "markers" :name "actual"
            :marker {:opacity 0.3 :color "gray"}}
           {:x xs :y sgd-vals :mode "markers" :name "Linear SGD (poly+trig)"
            :marker {:opacity 0.5 :color "steelblue"}}
           {:x xs :y cart-vals :mode "markers" :name "CART (raw)"
            :marker {:opacity 0.5 :color "tomato"}}]
    :layout {:xaxis {:title "x"} :yaxis {:title "y"}}}))

Part 2 — How models handle noisy data

Real data is messy. Measurements have errors, inputs are approximate. Noise is the random variation that obscures the true pattern.

How do our models behave as noise increases? We’ll test five levels, from nearly clean (0.1) to very noisy (5.0).

Notice: the noise=0.5 dataset reuses the cache from Part 1 — Pocket recognizes the same function and arguments.

(def noise-levels [0.1 0.5 1.0 2.0 5.0])

(def noise-results
  (vec
   (for [noise-sd noise-levels]
     (let [data-c (pocket/cached #'make-regression-data
                                 {:f #'nonlinear-fn :n 500 :noise-sd noise-sd :seed 42})
           split-c (pocket/cached #'split-dataset data-c {:seed 42})
           train-c (pocket/cached :train split-c)
           test-c (pocket/cached :test split-c)
           cart-train (pocket/cached #'prepare-features train-c :raw)
           cart-test (pocket/cached #'prepare-features test-c :raw)
           sgd-train (pocket/cached #'prepare-features train-c :poly+trig)
           sgd-test (pocket/cached #'prepare-features test-c :poly+trig)
           cart-model (pocket/cached #'train-model cart-train cart-spec)
           sgd-model (pocket/cached #'train-model sgd-train linear-sgd-spec)]
       {:noise-sd noise-sd
        :cart-rmse (predict-and-rmse @cart-test @cart-model)
        :sgd-rmse (predict-and-rmse @sgd-test @sgd-model)}))))

OUT

[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/prepare-features
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: :test
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/split-dataset
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/make-regression-data
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/6a/(pocket-book.ml-workflows_make-regression-data {:f #'pocket-book.ml-workflows_nonlinear-fn, :n 500, :noise-sd 0.1, :seed 42})
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/3a/(pocket-book.ml-workflows_split-dataset (pocket-book.ml-workflows_make-regression-data {:f #'pocket-book.ml-workflows_nonlinear-fn, :n 500, :noise-sd 0.1, :seed 42}) {:seed 42})
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/94/(:test (pocket-book.ml-workflows_split-dataset (pocket-book.ml-workflows_make-regression-data {:f #'pocket-book.ml-workflows_nonlinear-fn, :n 500, :noise-sd 0.1, :seed 42}) {:seed 42}))
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/96/(pocket-book.ml-workflows_prepare-features (:test (pocket-book.ml-workflows_split-dataset (pocket-book.ml-workflows_make-regression-data {:f #'pocket-book.ml-workflows_nonlinear-fn, :n 500, :noise-sd 0.1, :seed 42}) {:seed 42})) :raw)
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/train-model
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/prepare-features
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: :train
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/55/(:train (pocket-book.ml-workflows_split-dataset (pocket-book.ml-workflows_make-regression-data {:f #'pocket-book.ml-workflows_nonlinear-fn, :n 500, :noise-sd 0.1, :seed 42}) {:seed 42}))
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/08/(pocket-book.ml-workflows_prepare-features (:train (pocket-book.ml-workflows_split-dataset (pocket-book.ml-workflows_make-regression-data {:f #'pocket-book.ml-workflows_nonlinear-fn, :n 500, :noise-sd 0.1, :seed 42}) {:seed 42})) :raw)
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/b8/b852ace3759232f1dec48d3e01572a918e9e31e6
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/prepare-features
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/dd/(pocket-book.ml-workflows_prepare-features (:test (pocket-book.ml-workflows_split-dataset (pocket-book.ml-workflows_make-regression-data {:f #'pocket-book.ml-workflows_nonlinear-fn, :n 500, :noise-sd 0.1, :seed 42}) {:seed 42})) :poly+trig)
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/train-model
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/prepare-features
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/bc/bcc1185d8c044fe468f72f058191a99f73c4ea91
Mar 01, 2026 4:31:58 PM org.tribuo.common.sgd.AbstractSGDTrainer train
INFO: Training SGD model with 333 examples
Mar 01, 2026 4:31:58 PM org.tribuo.common.sgd.AbstractSGDTrainer train
INFO: Outputs - RegressionInfo({name=y,id=0,count=333,max=8.106624,min=-5.332393,mean=0.805881,variance=15.304966})
Mar 01, 2026 4:31:58 PM org.tribuo.common.sgd.AbstractSGDTrainer train
INFO: At iteration 10000, average loss = 1.4235088958624642
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/82/82fcdba4291961e398a2dabf02049538d45de7ff
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/prepare-features
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: :test
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/split-dataset
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/make-regression-data
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/53/(pocket-book.ml-workflows_make-regression-data {:f #'pocket-book.ml-workflows_nonlinear-fn, :n 500, :noise-sd 1.0, :seed 42})
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/68/(pocket-book.ml-workflows_split-dataset (pocket-book.ml-workflows_make-regression-data {:f #'pocket-book.ml-workflows_nonlinear-fn, :n 500, :noise-sd 1.0, :seed 42}) {:seed 42})
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/85/(:test (pocket-book.ml-workflows_split-dataset (pocket-book.ml-workflows_make-regression-data {:f #'pocket-book.ml-workflows_nonlinear-fn, :n 500, :noise-sd 1.0, :seed 42}) {:seed 42}))
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/d1/(pocket-book.ml-workflows_prepare-features (:test (pocket-book.ml-workflows_split-dataset (pocket-book.ml-workflows_make-regression-data {:f #'pocket-book.ml-workflows_nonlinear-fn, :n 500, :noise-sd 1.0, :seed 42}) {:seed 42})) :raw)
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/train-model
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/prepare-features
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: :train
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/62/(:train (pocket-book.ml-workflows_split-dataset (pocket-book.ml-workflows_make-regression-data {:f #'pocket-book.ml-workflows_nonlinear-fn, :n 500, :noise-sd 1.0, :seed 42}) {:seed 42}))
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/ac/(pocket-book.ml-workflows_prepare-features (:train (pocket-book.ml-workflows_split-dataset (pocket-book.ml-workflows_make-regression-data {:f #'pocket-book.ml-workflows_nonlinear-fn, :n 500, :noise-sd 1.0, :seed 42}) {:seed 42})) :raw)
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/56/56b12a0b688bc09e6b02b1079e0b3db9e27780bb
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/prepare-features
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/3d/(pocket-book.ml-workflows_prepare-features (:test (pocket-book.ml-workflows_split-dataset (pocket-book.ml-workflows_make-regression-data {:f #'pocket-book.ml-workflows_nonlinear-fn, :n 500, :noise-sd 1.0, :seed 42}) {:seed 42})) :poly+trig)
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/train-model
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/prepare-features
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/21/2138858cf3d145d3b1dda31f1fbff57c42021903
Mar 01, 2026 4:31:58 PM org.tribuo.common.sgd.AbstractSGDTrainer train
INFO: Training SGD model with 333 examples
Mar 01, 2026 4:31:58 PM org.tribuo.common.sgd.AbstractSGDTrainer train
INFO: Outputs - RegressionInfo({name=y,id=0,count=333,max=9.816940,min=-6.625413,mean=0.883112,variance=15.979208})
Mar 01, 2026 4:31:58 PM org.tribuo.common.sgd.AbstractSGDTrainer train
INFO: At iteration 10000, average loss = 1.8153008504655874
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/af/afcc6439d95dc6b19a47d7bb13b405a6e3f7bf75
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/prepare-features
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: :test
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/split-dataset
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/make-regression-data
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/1f/(pocket-book.ml-workflows_make-regression-data {:f #'pocket-book.ml-workflows_nonlinear-fn, :n 500, :noise-sd 2.0, :seed 42})
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/ec/(pocket-book.ml-workflows_split-dataset (pocket-book.ml-workflows_make-regression-data {:f #'pocket-book.ml-workflows_nonlinear-fn, :n 500, :noise-sd 2.0, :seed 42}) {:seed 42})
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/a2/(:test (pocket-book.ml-workflows_split-dataset (pocket-book.ml-workflows_make-regression-data {:f #'pocket-book.ml-workflows_nonlinear-fn, :n 500, :noise-sd 2.0, :seed 42}) {:seed 42}))
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/b0/(pocket-book.ml-workflows_prepare-features (:test (pocket-book.ml-workflows_split-dataset (pocket-book.ml-workflows_make-regression-data {:f #'pocket-book.ml-workflows_nonlinear-fn, :n 500, :noise-sd 2.0, :seed 42}) {:seed 42})) :raw)
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/train-model
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/prepare-features
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: :train
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/be/(:train (pocket-book.ml-workflows_split-dataset (pocket-book.ml-workflows_make-regression-data {:f #'pocket-book.ml-workflows_nonlinear-fn, :n 500, :noise-sd 2.0, :seed 42}) {:seed 42}))
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/d8/(pocket-book.ml-workflows_prepare-features (:train (pocket-book.ml-workflows_split-dataset (pocket-book.ml-workflows_make-regression-data {:f #'pocket-book.ml-workflows_nonlinear-fn, :n 500, :noise-sd 2.0, :seed 42}) {:seed 42})) :raw)
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/42/4283017d384a0edb1360ed439ce2022cded0120a
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/prepare-features
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/9a/(pocket-book.ml-workflows_prepare-features (:test (pocket-book.ml-workflows_split-dataset (pocket-book.ml-workflows_make-regression-data {:f #'pocket-book.ml-workflows_nonlinear-fn, :n 500, :noise-sd 2.0, :seed 42}) {:seed 42})) :poly+trig)
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/train-model
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/prepare-features
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/c5/c596cf0eff32ed864e9956ede9b9f1a8ba604d8a
Mar 01, 2026 4:31:59 PM org.tribuo.common.sgd.AbstractSGDTrainer train
INFO: Training SGD model with 333 examples
Mar 01, 2026 4:31:59 PM org.tribuo.common.sgd.AbstractSGDTrainer train
INFO: Outputs - RegressionInfo({name=y,id=0,count=333,max=11.717291,min=-8.958664,mean=0.968924,variance=18.285525})
Mar 01, 2026 4:31:59 PM org.tribuo.common.sgd.AbstractSGDTrainer train
INFO: At iteration 10000, average loss = 3.123027977933463
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/e7/e73082ba807aa7585ee9bfe055500e1478428bfb
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/prepare-features
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: :test
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/split-dataset
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/make-regression-data
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/16/(pocket-book.ml-workflows_make-regression-data {:f #'pocket-book.ml-workflows_nonlinear-fn, :n 500, :noise-sd 5.0, :seed 42})
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/33/(pocket-book.ml-workflows_split-dataset (pocket-book.ml-workflows_make-regression-data {:f #'pocket-book.ml-workflows_nonlinear-fn, :n 500, :noise-sd 5.0, :seed 42}) {:seed 42})
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/f6/(:test (pocket-book.ml-workflows_split-dataset (pocket-book.ml-workflows_make-regression-data {:f #'pocket-book.ml-workflows_nonlinear-fn, :n 500, :noise-sd 5.0, :seed 42}) {:seed 42}))
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/1d/(pocket-book.ml-workflows_prepare-features (:test (pocket-book.ml-workflows_split-dataset (pocket-book.ml-workflows_make-regression-data {:f #'pocket-book.ml-workflows_nonlinear-fn, :n 500, :noise-sd 5.0, :seed 42}) {:seed 42})) :raw)
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/train-model
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/prepare-features
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: :train
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/b2/(:train (pocket-book.ml-workflows_split-dataset (pocket-book.ml-workflows_make-regression-data {:f #'pocket-book.ml-workflows_nonlinear-fn, :n 500, :noise-sd 5.0, :seed 42}) {:seed 42}))
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/c0/(pocket-book.ml-workflows_prepare-features (:train (pocket-book.ml-workflows_split-dataset (pocket-book.ml-workflows_make-regression-data {:f #'pocket-book.ml-workflows_nonlinear-fn, :n 500, :noise-sd 5.0, :seed 42}) {:seed 42})) :raw)
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/41/4125ad81cf3bdb18d7b227f84c899e30d5692f27
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/prepare-features
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/2a/(pocket-book.ml-workflows_prepare-features (:test (pocket-book.ml-workflows_split-dataset (pocket-book.ml-workflows_make-regression-data {:f #'pocket-book.ml-workflows_nonlinear-fn, :n 500, :noise-sd 5.0, :seed 42}) {:seed 42})) :poly+trig)
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/train-model
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/prepare-features
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/2a/2ab54b2e292bd3cd5f65f72be9ddd1b0a17c3de8
Mar 01, 2026 4:31:59 PM org.tribuo.common.sgd.AbstractSGDTrainer train
INFO: Training SGD model with 333 examples
Mar 01, 2026 4:31:59 PM org.tribuo.common.sgd.AbstractSGDTrainer train
INFO: Outputs - RegressionInfo({name=y,id=0,count=333,max=17.418345,min=-15.958416,mean=1.226360,variance=35.039175})
Mar 01, 2026 4:31:59 PM org.tribuo.common.sgd.AbstractSGDTrainer train
INFO: At iteration 10000, average loss = 12.1030268449752
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/53/533425c770f1d0b6600c387d76730e7b41036713

noise-results

[{:noise-sd 0.1,
  :cart-rmse 0.18813079748027944,
  :sgd-rmse 1.2744431229599325}
 {:noise-sd 0.5,
  :cart-rmse 0.6334615055076024,
  :sgd-rmse 1.3410184469297421}
 {:noise-sd 1.0,
  :cart-rmse 1.2499664669902657,
  :sgd-rmse 1.582298583473656}
 {:noise-sd 2.0,
  :cart-rmse 2.453719422103725,
  :sgd-rmse 2.352662287937308}
 {:noise-sd 5.0,
  :cart-rmse 5.960858808406107,
  :sgd-rmse 5.262029923696976}]

What the results show:

At low noise, the tree wins — it captures fine details the linear model smooths over. But as noise increases, the tree starts memorizing random wiggles (overfitting), and its error explodes.

The linear model degrades more gracefully. Its rigid structure (a weighted sum of features) acts as a built-in regularizer — it can’t chase noise even if it wanted to.

Takeaway: Flexible models (trees) excel with clean data but suffer with noise. Simple models (linear) are more robust.

RMSE vs. noise

(let [noise-sds (vec (map :noise-sd noise-results))
      cart-rmses (vec (map :cart-rmse noise-results))
      sgd-rmses (vec (map :sgd-rmse noise-results))]
  (kind/plotly
   {:data [{:x noise-sds :y cart-rmses :mode "lines+markers" :name "CART"}
           {:x noise-sds :y sgd-rmses :mode "lines+markers" :name "Linear SGD"}]
    :layout {:xaxis {:title "noise-sd"} :yaxis {:title "rmse"}}}))

Part 3 — What got cached?

We’ve run many combinations of data, features, and models. Each pocket/cached call created an independent cache entry. Let’s see what we accumulated:

(:total-entries (pocket/cache-stats))

(:entries-per-fn (pocket/cache-stats))

{"pocket-book.ml-workflows/train-model" 16,
 "pocket-book.ml-workflows/prepare-features" 24,
 "pocket-book.ml-workflows/make-regression-data" 5,
 ":test" 5,
 ":train" 5,
 "pocket-book.ml-workflows/split-dataset" 5}

With this small synthetic data, each step runs in milliseconds. But the structure is what matters. In real workflows — large datasets, deep neural networks, hyperparameter searches — the same cache graph saves hours or days.

Here’s what happens when you change something:

Change	What recomputes
Edit a feature set	That feature prep + its models
Change a model hyperparameter	Only that model
Change the noise level	That data + its features + its models
Re-run the whole notebook	Nothing — all cached

Cleanup

(pocket/cleanup!)

OUT

[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket - Cache cleanup: /tmp/pocket-regression

{:dir "/tmp/pocket-regression", :existed true}

Part 4 — Sharing computations across branches

Real sensors glitch. A positioning system occasionally records a wildly wrong x value — the physics (y) is unaffected, but the recorded input is corrupted. When we build polynomial features like x², these outlier x values get amplified: an errant x=50 gives x²=2500 instead of the expected ~25 from a normal x≈5.

The fix is feature outlier clipping: compute what range of x is “normal” from training data, then clip both train and test inputs to those bounds — before feature engineering.

The clipping threshold must come from training data alone. Using test data would leak future information.

This creates a diamond dependency — one computation (the threshold) feeds into multiple downstream steps:

 make-regression-data (with x outliers)
         |
    split-dataset
         |
    +----+----+
    v         v
 (:train)  (:test)
    |         |
    v         |
fit-threshold |
    |         |
    +----+----+
    v         v
clip(train) clip(test)
    |         |
    v         v
features   features
    |         |
    v         |
train-model   |
    |         |
    +----+----+
    v
  evaluate

Pocket handles this naturally. The threshold node is computed once and feeds both clipping steps. When you change the training data, the threshold recomputes, and both branches update.

Pipeline functions

These are plain functions. Each does one thing: fit a threshold, clip outliers, or evaluate. Pocket will wire them together.

(defn fit-outlier-threshold
  "Compute IQR-based clipping bounds for :x from training data.
  Returns {:lower <bound> :upper <bound>}."
  [train-ds]
  (println "  Fitting outlier threshold from training data...")
  (let [xs (sort (vec (:x train-ds)))
        n (count xs)
        q1 (nth xs (int (* 0.25 n)))
        q3 (nth xs (int (* 0.75 n)))
        iqr (- q3 q1)]
    {:lower (- q1 (* 1.5 iqr))
     :upper (+ q3 (* 1.5 iqr))}))

(defn clip-outliers
  "Clip :x values using pre-computed threshold bounds."
  [ds threshold]
  (println "  Clipping outliers with bounds:" (select-keys threshold [:lower :upper]))
  (let [{:keys [lower upper]} threshold]
    (tc/add-column ds :x (-> (:x ds) (tcc/max lower) (tcc/min upper)))))

(defn evaluate-model
  "Evaluate a model on test data."
  [test-ds model]
  (println "  Evaluating model...")
  (let [pred (ml/predict test-ds model)]
    {:rmse (loss/rmse (:y test-ds) (:y pred))}))

Build the DAG with mixed storage policies

Not every step needs disk persistence. We use caching-fn with per-function storage policies:

:mem for cheap shared computations (threshold, clipping, feature engineering) — no disk I/O, but in-memory dedup ensures each runs only once persists across JVM sessions
:none for trivial steps (evaluation) — just tracks identity in the DAG without any shared caching

(def c-fit-threshold
  (pocket/caching-fn #'fit-outlier-threshold {:storage :mem}))

(def c-clip
  (pocket/caching-fn #'clip-outliers {:storage :mem}))

(def c-prepare
  (pocket/caching-fn #'prepare-features {:storage :mem}))

(def c-train
  (pocket/caching-fn #'train-model))

(def c-evaluate
  (pocket/caching-fn #'evaluate-model {:storage :none}))

Generate data with outliers for this demo — 10% of the x values are corrupted by large random spikes, simulating sensor glitches. The y values (physics) are computed from the clean x, then noise is added normally — so only the input is corrupted.

(def dag-data-c
  (pocket/cached #'make-regression-data
                 {:f #'nonlinear-fn :n 200 :noise-sd 0.3 :seed 99
                  :outlier-fraction 0.1 :outlier-scale 15}))

(def dag-split-c
  (pocket/cached #'split-dataset dag-data-c {:seed 99}))

(def dag-train-c (pocket/cached :train dag-split-c))

(def dag-test-c (pocket/cached :test dag-split-c))

Now wire the pipeline. The threshold is fitted once from training data (in memory) and feeds both clipping steps — a diamond dependency handled naturally.

(def threshold-c
  (c-fit-threshold dag-train-c))

(def train-clipped-c
  (c-clip dag-train-c threshold-c))

(def test-clipped-c
  (c-clip dag-test-c threshold-c))

(def train-prepped-c
  (c-prepare train-clipped-c :poly+trig))

(def test-prepped-c
  (c-prepare test-clipped-c :poly+trig))

(def model-c
  (c-train train-prepped-c cart-spec))

(def metrics-c
  (c-evaluate test-prepped-c model-c))

Visualize the DAG

Pocket provides three functions for DAG introspection, each suited to different use cases.

origin-story returns a nested tree structure. Each cached node has :fn, :args, and :id. The :id is unique; when the same Cached instance appears multiple times (diamond pattern), subsequent occurrences become {:ref <id>} pointers. This avoids infinite recursion and makes the diamond explicit:

(pocket/origin-story metrics-c)

{:fn #'pocket-book.ml-workflows/evaluate-model,
 :args
 [{:fn #'pocket-book.ml-workflows/prepare-features,
   :args
   [{:fn #'pocket-book.ml-workflows/clip-outliers,
     :args
     [{:fn :test,
       :args
       [{:fn #'pocket-book.ml-workflows/split-dataset,
         :args
         [{:fn #'pocket-book.ml-workflows/make-regression-data,
           :args
           [{:value
             {:f #'pocket-book.ml-workflows/nonlinear-fn,
              :n 200,
              :noise-sd 0.3,
              :seed 99,
              :outlier-fraction 0.1,
              :outlier-scale 15}}],
           :id "c6"}
          {:value {:seed 99}}],
         :id "c5"}],
       :id "c4"}
      {:fn #'pocket-book.ml-workflows/fit-outlier-threshold,
       :args [{:fn :train, :args [{:ref "c5"}], :id "c8"}],
       :id "c7"}],
     :id "c3"}
    {:value :poly+trig}],
   :id "c2"}
  {:fn #'pocket-book.ml-workflows/train-model,
   :args
   [{:fn #'pocket-book.ml-workflows/prepare-features,
     :args
     [{:fn #'pocket-book.ml-workflows/clip-outliers,
       :args [{:ref "c8"} {:ref "c7"}],
       :id "c11"}
      {:value :poly+trig}],
     :id "c10"}
    {:value
     {:model-type :scicloj.ml.tribuo/regression,
      :tribuo-components
      [{:name "cart",
        :type "org.tribuo.regression.rtree.CARTRegressionTrainer",
        :properties {:maxDepth "8"}}],
      :tribuo-trainer-name "cart"}}],
   :id "c9"}],
 :id "c1"}

Notice how the threshold node appears as a :ref in one branch — it’s the same computation feeding both train and test clipping.

origin-story-graph normalizes the tree into a flat {:nodes ... :edges ...} structure suitable for graph algorithms:

(pocket/origin-story-graph metrics-c)

{:nodes
 {"c9" {:fn #'pocket-book.ml-workflows/fit-outlier-threshold},
  "c10" {:fn :train},
  "c13" {:fn #'pocket-book.ml-workflows/prepare-features},
  "c14" {:fn #'pocket-book.ml-workflows/clip-outliers},
  "v15" {:value :poly+trig},
  "v7"
  {:value
   {:f #'pocket-book.ml-workflows/nonlinear-fn,
    :n 200,
    :noise-sd 0.3,
    :seed 99,
    :outlier-fraction 0.1,
    :outlier-scale 15}},
  "v8" {:value {:seed 99}},
  "c2" {:fn #'pocket-book.ml-workflows/prepare-features},
  "v11" {:value :poly+trig},
  "c12" {:fn #'pocket-book.ml-workflows/train-model},
  "v16"
  {:value
   {:model-type :scicloj.ml.tribuo/regression,
    :tribuo-components
    [{:name "cart",
      :type "org.tribuo.regression.rtree.CARTRegressionTrainer",
      :properties {:maxDepth "8"}}],
    :tribuo-trainer-name "cart"}},
  "c3" {:fn #'pocket-book.ml-workflows/clip-outliers},
  "c4" {:fn :test},
  "c5" {:fn #'pocket-book.ml-workflows/split-dataset},
  "c6" {:fn #'pocket-book.ml-workflows/make-regression-data},
  "c1" {:fn #'pocket-book.ml-workflows/evaluate-model}},
 :edges
 [["c1" "c2"]
  ["c2" "c3"]
  ["c3" "c4"]
  ["c4" "c5"]
  ["c5" "c6"]
  ["c6" "v7"]
  ["c5" "v8"]
  ["c3" "c9"]
  ["c9" "c10"]
  ["c10" "c5"]
  ["c2" "v11"]
  ["c1" "c12"]
  ["c12" "c13"]
  ["c13" "c14"]
  ["c14" "c10"]
  ["c14" "c9"]
  ["c13" "v15"]
  ["c12" "v16"]]}

origin-story-mermaid renders the DAG as a Mermaid flowchart, with arrows showing data flow direction (from inputs toward the final result). The diamond dependency is clearly visible — the threshold feeds both clipping steps:

(pocket/origin-story-mermaid metrics-c)

flowchart TD n0["evaluate-model"] n1["prepare-features"] n2["clip-outliers"] n3[":test"] n4["split-dataset"] n5["make-regression-data"] n6[/"{:f #'pocket-book.ml-workflows/nonlinear-fn,
:n 200,
:noise-sd 0.3,
:seed 99,
:outlier-fraction 0.1,
:outlier-scale 15}"/] n6 --> n5 n5 --> n4 n7[/"{:seed 99}"/] n7 --> n4 n4 --> n3 n3 --> n2 n8["fit-outlier-threshold"] n9[":train"] n4 --> n9 n9 --> n8 n8 --> n2 n2 --> n1 n10[/":poly+trig"/] n10 --> n1 n1 --> n0 n11["train-model"] n12["prepare-features"] n13["clip-outliers"] n9 --> n13 n8 --> n13 n13 --> n12 n14[/":poly+trig"/] n14 --> n12 n12 --> n11 n15[/"{:model-type :scicloj.ml.tribuo/regression,
:tribuo-components [{:name 'cart',
:type 'org.tribuo.regression.rtree.CARTRegressionTrainer',
:properties {:maxDepth '8'}}],
:tribuo-trainer-name 'cart'}"/] n15 --> n11 n11 --> n0

Execute the pipeline

(deref metrics-c)

OUT

[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.ml-workflows/prepare-features
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.ml-workflows/clip-outliers
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: :test
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/split-dataset
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/make-regression-data
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/19/(pocket-book.ml-workflows_make-regression-data {:f #'pocket-book.ml-workflows_nonlinear-fn, :n 200, :noise-sd 0.3, :outlier-fraction 0.1, :outlier-scale 15, :seed 99})
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/53/(pocket-book.ml-workflows_split-dataset (pocket-book.ml-workflows_make-regression-data {:f #'pocket-book.ml-workflows_nonlinear-fn, :n 200, :noise-sd 0.3, :outlier-fraction 0.1, :outlier-scale 15, :seed 99}) {:seed 99})
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/fb/(:test (pocket-book.ml-workflows_split-dataset (pocket-book.ml-workflows_make-regression-data {:f #'pocket-book.ml-workflows_nonlinear-fn, :n 200, :noise-sd 0.3, :outlier-fraction 0.1, :outlier-scale 15, :seed 99}) {:seed 99}))
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.ml-workflows/fit-outlier-threshold
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: :train
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/09/(:train (pocket-book.ml-workflows_split-dataset (pocket-book.ml-workflows_make-regression-data {:f #'pocket-book.ml-workflows_nonlinear-fn, :n 200, :noise-sd 0.3, :outlier-fraction 0.1, :outlier-scale 15, :seed 99}) {:seed 99}))
  Fitting outlier threshold from training data...
  Clipping outliers with bounds: {:lower -5.499694170624462, :upper 15.994051959902624}
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/train-model
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.ml-workflows/prepare-features
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.ml-workflows/clip-outliers
  Clipping outliers with bounds: {:lower -5.499694170624462, :upper 15.994051959902624}
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/29/298735d4cc3dbde1964a5f86130dba60f3a9db43
  Evaluating model...

{:rmse 1.601302555211606}

How much did clipping help? Let’s compare three scenarios using the same cached building blocks.

The no-clip and clean-baseline pipelines are local — they exist only for this comparison. Each still builds a cached DAG that shares steps with the clipped pipeline above.

(let [;; No-clip: skip clipping, go straight from raw splits to features
      noclip-train-c  (c-prepare dag-train-c :poly+trig)
      noclip-test-c   (c-prepare dag-test-c :poly+trig)
      noclip-model-c  (c-train noclip-train-c cart-spec)
      noclip-metrics  @(c-evaluate noclip-test-c noclip-model-c)
      ;; Clean baseline: same structure, data without outliers
      clean-data-c    (pocket/cached #'make-regression-data
                                     {:f #'nonlinear-fn :n 200 :noise-sd 0.3 :seed 99})
      clean-split-c   (pocket/cached #'split-dataset clean-data-c {:seed 99})
      clean-train-c   (c-prepare (pocket/cached :train clean-split-c) :poly+trig)
      clean-test-c    (c-prepare (pocket/cached :test clean-split-c) :poly+trig)
      clean-model-c   (c-train clean-train-c cart-spec)
      clean-metrics   @(c-evaluate clean-test-c clean-model-c)]
  {:clean            clean-metrics
   :outliers-no-clip noclip-metrics
   :outliers-clipped @metrics-c})

OUT

[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.ml-workflows/prepare-features
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/train-model
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.ml-workflows/prepare-features
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/81/81bd64358cce9b4ef112b6e4b16b04cb1cdfb14e
  Evaluating model...
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.ml-workflows/prepare-features
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: :test
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/split-dataset
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/make-regression-data
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/55/(pocket-book.ml-workflows_make-regression-data {:f #'pocket-book.ml-workflows_nonlinear-fn, :n 200, :noise-sd 0.3, :seed 99})
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/87/(pocket-book.ml-workflows_split-dataset (pocket-book.ml-workflows_make-regression-data {:f #'pocket-book.ml-workflows_nonlinear-fn, :n 200, :noise-sd 0.3, :seed 99}) {:seed 99})
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/80/(:test (pocket-book.ml-workflows_split-dataset (pocket-book.ml-workflows_make-regression-data {:f #'pocket-book.ml-workflows_nonlinear-fn, :n 200, :noise-sd 0.3, :seed 99}) {:seed 99}))
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/train-model
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss (mem), computing: pocket-book.ml-workflows/prepare-features
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: :train
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/54/(:train (pocket-book.ml-workflows_split-dataset (pocket-book.ml-workflows_make-regression-data {:f #'pocket-book.ml-workflows_nonlinear-fn, :n 200, :noise-sd 0.3, :seed 99}) {:seed 99}))
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/a0/a076e329e7ab037718d99d2a664dfc9114879d46
  Evaluating model...

{:clean {:rmse 0.4263098047865239},
 :outliers-no-clip {:rmse 2.552495499444297},
 :outliers-clipped {:rmse 1.601302555211606}}

Clipping x before building polynomial features makes a visible difference — the amplification through x² is tamed.

Part 5 — Comparing many experiments at once

Hyperparameters are settings you choose before training: tree depth, learning rate, which features to use. Finding good values usually means trying many combinations — a hyperparameter sweep.

Pocket’s compare-experiments helps here. You pass a collection of cached experiments, and it extracts the parameters that vary across them (ignoring ones that are constant).

(defn run-pipeline
  "Run a complete pipeline with given hyperparameters."
  [{:keys [noise-sd feature-set max-depth]}]
  (let [ds (make-regression-data {:f nonlinear-fn :n 200 :noise-sd noise-sd :seed 42})
        sp (split-dataset ds {:seed 42})
        train-prep (prepare-features (:train sp) feature-set)
        test-prep (prepare-features (:test sp) feature-set)
        spec {:model-type :scicloj.ml.tribuo/regression
              :tribuo-components [{:name "cart"
                                   :type "org.tribuo.regression.rtree.CARTRegressionTrainer"
                                   :properties {:maxDepth (str max-depth)}}]
              :tribuo-trainer-name "cart"}
        model (ml/train train-prep spec)
        pred (ml/predict test-prep model)]
    {:rmse (loss/rmse (:y test-prep) (:y pred))}))

Run experiments across a grid of hyperparameters:

(def experiments
  (for [noise-sd [0.3 0.5]
        feature-set [:raw :poly+trig]
        max-depth [4 8]]
    (pocket/cached #'run-pipeline
                   {:noise-sd noise-sd
                    :feature-set feature-set
                    :max-depth max-depth})))

Compare all experiments — only varying parameters are shown:

(def comparison
  (pocket/compare-experiments experiments))

OUT

[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/run-pipeline
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/0f/(pocket-book.ml-workflows_run-pipeline {:feature-set :raw, :max-depth 4, :noise-sd 0.3})
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/run-pipeline
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/23/(pocket-book.ml-workflows_run-pipeline {:feature-set :raw, :max-depth 8, :noise-sd 0.3})
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/run-pipeline
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/3f/(pocket-book.ml-workflows_run-pipeline {:feature-set :poly+trig, :max-depth 4, :noise-sd 0.3})
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/run-pipeline
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/40/(pocket-book.ml-workflows_run-pipeline {:feature-set :poly+trig, :max-depth 8, :noise-sd 0.3})
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/run-pipeline
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/c8/(pocket-book.ml-workflows_run-pipeline {:feature-set :raw, :max-depth 4, :noise-sd 0.5})
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/run-pipeline
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/01/(pocket-book.ml-workflows_run-pipeline {:feature-set :raw, :max-depth 8, :noise-sd 0.5})
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/run-pipeline
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/b3/(pocket-book.ml-workflows_run-pipeline {:feature-set :poly+trig, :max-depth 4, :noise-sd 0.5})
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.ml-workflows/run-pipeline
[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-regression/00/(pocket-book.ml-workflows_run-pipeline {:feature-set :poly+trig, :max-depth 8, :noise-sd 0.5})

(tc/dataset comparison)

_unnamed [8 4]:

:noise-sd	:feature-set	:max-depth	:result
0.3	:raw	4	{:rmse 0.7189521159338053}
0.3	:raw	8	{:rmse 0.41024324994778005}
0.3	:poly+trig	4	{:rmse 0.5297031491020386}
0.3	:poly+trig	8	{:rmse 0.4530388384300822}
0.5	:raw	4	{:rmse 0.8815492449083825}
0.5	:raw	8	{:rmse 0.6467985374993637}
0.5	:poly+trig	4	{:rmse 0.7728233864192875}
0.5	:poly+trig	8	{:rmse 0.6785270538736407}

Each row shows the varying parameters plus the result. Parameters that were constant (like seed=42) are excluded automatically — you see only what differs.

Results visualization

(let [rows (map (fn [exp]
                  (merge (select-keys exp [:noise-sd :feature-set :max-depth])
                         (:result exp)))
                comparison)
      ;; Group by both feature-set and noise-sd for legend entries
      grouped (group-by (juxt :feature-set :noise-sd) rows)
      feature-colors {:raw "steelblue" :poly "tomato" :poly+trig "green"}]
  (kind/plotly
   {:data (vec (for [[[feature-set noise-sd] pts] (sort-by first grouped)
                     :let [max-depths (mapv :max-depth pts)
                           rmses (mapv :rmse pts)]]
                 {:x max-depths
                  :y rmses
                  :mode "markers"
                  :name (str (name feature-set) " (noise=" noise-sd ")")
                  :legendgroup (name feature-set)
                  :marker {:size (+ 8 (* 15 noise-sd))
                           :color (feature-colors feature-set)}}))

    :layout {:xaxis {:title "max-depth"} :yaxis {:title "rmse"}}}))

What we learned

This experiment revealed a clear story about the interplay between models, features, and noise:

Feature engineering is decisive for linear models. With raw features, the linear model couldn’t capture the nonlinear target at all. Adding trigonometric features (sin, cos) — which match the structure of the true function — dramatically improved it. The model didn’t get smarter; we gave it the right vocabulary.
Decision trees are self-sufficient but fragile. The CART model achieved low error regardless of feature set, because it can learn nonlinear splits on its own. But as noise increased, it began fitting the noise rather than the signal — a classic overfitting pattern.
The crossover point matters. At low noise, the tree wins. At high noise, the well-featured linear model degrades more gracefully. Knowing where this crossover happens is exactly the kind of insight you get from systematic experimentation.
Caching structures the workflow. In this small example, each step runs in milliseconds — caching isn’t needed for speed. But the pattern scales: with real datasets and expensive training, the same pipeline structure ensures that only changed steps recompute. Meanwhile, compare-experiments extracted the varying parameters automatically, turning cached results into a comparison table — useful at any scale.
Preprocessing order matters. Outlier x values get amplified by polynomial features (x²), so clipping must come before feature engineering. The diamond dependency — one threshold feeding both train and test clipping — is handled naturally by Pocket’s DAG.

Cleanup

(pocket/cleanup!)

OUT

[nREPL-session-120ee500-d4ba-4b41-bcdc-e26822f35e2b] INFO scicloj.pocket - Cache cleanup: /tmp/pocket-regression

{:dir "/tmp/pocket-regression", :existed true}

source: notebooks/pocket_book/ml_workflows.clj