5  Recursive Caching in Pipelines

Last modified: 2026-02-09

When you pass a Cached value as an argument to another cached function, Pocket handles this recursively. The cache key for the outer computation is derived from the identity of the inner computation (its function name and arguments), not from its result. This means the entire pipeline’s cache key captures the full computation graph.

Pocket automatically derefs any Cached arguments before calling the function, so pipeline functions receive plain values and don’t need any special handling.

Setup

(ns pocket-book.recursive-caching-in-pipelines
  (:require
   ;; Logging setup for this chapter (see Logging chapter):
   [pocket-book.logging]
   ;; Pocket API:
   [scicloj.pocket :as pocket]
   ;; Annotating kinds of visualizations:
   [scicloj.kindly.v4.kind :as kind]
   ;; String utilities:
   [clojure.string :as str]))
(def cache-dir "/tmp/pocket-demo-pipelines")
(pocket/set-base-cache-dir! cache-dir)
10:06:31.991 INFO scicloj.pocket - Cache dir set to: /tmp/pocket-demo-pipelines
"/tmp/pocket-demo-pipelines"
(pocket/cleanup!)
10:06:31.992 INFO scicloj.pocket - Cache cleanup: /tmp/pocket-demo-pipelines
{:dir "/tmp/pocket-demo-pipelines", :existed false}

A three-step pipeline

We’ll build a simple data science pipeline with three stages: load data, preprocess it, and train a model. Each stage is wrapped with caching-fn so every call returns a Cached object. Passing one Cached result into the next stage is what makes the caching recursive.

flowchart LR LD[load-dataset] --> PP[preprocess] PP --> TM[train-model]

Pipeline functions

(defn load-dataset [path]
  (println "Loading dataset from" path "...")
  (Thread/sleep 300)
  {:data [1 2 3 4 5] :source path})
(defn preprocess [data opts]
  (println "Preprocessing with options:" opts)
  (Thread/sleep 300)
  (update data :data #(map (fn [x] (* x (:scale opts))) %)))
(defn train-model [data params]
  (println "Training model with params:" params)
  (Thread/sleep 300)
  {:model :trained :accuracy 0.95 :data data})

Wrap each function with caching-fn so every call returns a Cached object:

(def load-dataset* (pocket/caching-fn #'load-dataset))
(def preprocess* (pocket/caching-fn #'preprocess))
(def train-model* (pocket/caching-fn #'train-model))

Running the pipeline

Chain cached computations in a pipeline:

First pipeline run:

(time
 (-> "data/raw.csv"
     (load-dataset*)
     (preprocess* {:scale 2})
     (train-model* {:epochs 100})
     deref
     (select-keys [:model :accuracy])))
10:06:31.998 INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.recursive-caching-in-pipelines/train-model
10:06:31.998 INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.recursive-caching-in-pipelines/preprocess
10:06:31.998 INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.recursive-caching-in-pipelines/load-dataset
Loading dataset from data/raw.csv ...
10:06:32.311 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-demo-pipelines/b8/(pocket-book.recursive-caching-in-pipelines_load-dataset "data_raw.csv")
Preprocessing with options: {:scale 2}
10:06:32.614 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-demo-pipelines/85/(pocket-book.recursive-caching-in-pipelines_preprocess (pocket-book.recursive-caching-in-pipelines_load-dataset "data_raw.csv") {:scale 2})
Training model with params: {:epochs 100}
10:06:32.915 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-demo-pipelines/c9/(pocket-book.recursive-caching-in-pipelines_train-model (pocket-book.recursive-caching-in-pipelines_preprocess (pocket-book.recursive-caching-in-pipelines_load-dataset "data_raw.csv") {:scale 2}) {:epochs 100})
"Elapsed time: 918.094275 msecs"
{:model :trained, :accuracy 0.95}

Run the same pipeline again — everything loads from cache:

Second pipeline run (all cached):

(time
 (-> "data/raw.csv"
     (load-dataset*)
     (preprocess* {:scale 2})
     (train-model* {:epochs 100})
     deref
     (select-keys [:model :accuracy])))
"Elapsed time: 0.525695 msecs"
{:model :trained, :accuracy 0.95}

No log output above — the result was served entirely from the in-memory cache, so no disk I/O or computation occurred. Each step caches independently. If you change only the last step (e.g., different training params), the upstream steps load from cache while only the final step recomputes.

Provenance in cache entries

The cache entries reveal the pipeline structure. Each entry’s identity encodes its full computation history — not just the function name, but the nested identities of all its cached inputs.

(->> (pocket/cache-entries)
     (mapv :id))
["(pocket-book.recursive-caching-in-pipelines/load-dataset \"data/raw.csv\")"
 "(pocket-book.recursive-caching-in-pipelines/preprocess (pocket-book.recursive-caching-in-pipelines/load-dataset \"data/raw.csv\") {:scale 2})"
 "(pocket-book.recursive-caching-in-pipelines/train-model (pocket-book.recursive-caching-in-pipelines/preprocess (pocket-book.recursive-caching-in-pipelines/load-dataset \"data/raw.csv\") {:scale 2}) {:epochs 100})"]
(->> (pocket/cache-entries)
     (mapv :id)
     (str/join "\n")
     kind/code)
(pocket-book.recursive-caching-in-pipelines/load-dataset "data/raw.csv")
(pocket-book.recursive-caching-in-pipelines/preprocess (pocket-book.recursive-caching-in-pipelines/load-dataset "data/raw.csv") {:scale 2})
(pocket-book.recursive-caching-in-pipelines/train-model (pocket-book.recursive-caching-in-pipelines/preprocess (pocket-book.recursive-caching-in-pipelines/load-dataset "data/raw.csv") {:scale 2}) {:epochs 100})

The inner step appears as a literal sub-expression in the outer step’s identity. This is how Pocket tracks provenance: the cache key for train-model records that its input came from preprocess, which in turn came from load-dataset.

This happens automatically when you pass Cached objects (without derefing) from one cached step to the next.

If you deref early with @ (or deref), the derefed value still carries its origin identity — (pocket/->id @cached-ref) returns the same lightweight identity as the Cached reference itself. This means downstream cached steps get efficient cache keys even when working with real (derefed) values. The link breaks only when the value is transformed (e.g., adding a column), creating a new object that falls back to content-based identity. See Under the hood: cache keys for details on the origin registry.

Similarly, origin-story follows derefed values back through the registry to their Cached origin, producing the full provenance DAG. We demonstrate this below.

For a fuller example with branching dependencies, see the Real-World Walkthrough.

Inspecting the DAG

Pocket provides three functions for DAG introspection:

  • origin-story — nested tree with :ref pointers for shared nodes
  • origin-story-graph — flat {:nodes ... :edges ...} for graph algorithms
  • origin-story-mermaid — Mermaid flowchart string for visualization

Build the pipeline keeping the intermediate Cached objects:

(def data-c (load-dataset* "data/experiment.csv"))
(def preprocessed-c (preprocess* data-c {:scale 2}))
(def model-c (train-model* preprocessed-c {:epochs 100}))

origin-story — tree structure

Returns a nested map where each cached step is {:fn <var> :args [...] :id <string>}. Plain arguments become {:value ...} leaves. If a step has been computed, :value is included.

Before any computation:

(pocket/origin-story model-c)
{:fn #'pocket-book.recursive-caching-in-pipelines/train-model,
 :args
 [{:fn #'pocket-book.recursive-caching-in-pipelines/preprocess,
   :args
   [{:fn #'pocket-book.recursive-caching-in-pipelines/load-dataset,
     :args [{:value "data/experiment.csv"}],
     :id "c3"}
    {:value {:scale 2}}],
   :id "c2"}
  {:value {:epochs 100}}],
 :id "c1"}

No :value keys yet. Now trigger computation:

(deref model-c)
10:06:32.923 INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.recursive-caching-in-pipelines/train-model
10:06:32.923 INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.recursive-caching-in-pipelines/preprocess
10:06:32.923 INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.recursive-caching-in-pipelines/load-dataset
Loading dataset from data/experiment.csv ...
10:06:33.224 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-demo-pipelines/cd/(pocket-book.recursive-caching-in-pipelines_load-dataset "data_experiment.csv")
Preprocessing with options: {:scale 2}
10:06:33.526 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-demo-pipelines/f8/(pocket-book.recursive-caching-in-pipelines_preprocess (pocket-book.recursive-caching-in-pipelines_load-dataset "data_experiment.csv") {:scale 2})
Training model with params: {:epochs 100}
10:06:33.828 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-demo-pipelines/9e/(pocket-book.recursive-caching-in-pipelines_train-model (pocket-book.recursive-caching-in-pipelines_preprocess (pocket-book.recursive-caching-in-pipelines_load-dataset "data_experiment.csv") {:scale 2}) {:epochs 100})
{:model :trained,
 :accuracy 0.95,
 :data {:data (2 4 6 8 10), :source "data/experiment.csv"}}

After deref, every node includes its :value:

(pocket/origin-story model-c)
{:fn #'pocket-book.recursive-caching-in-pipelines/train-model,
 :args
 [{:fn #'pocket-book.recursive-caching-in-pipelines/preprocess,
   :args
   [{:fn #'pocket-book.recursive-caching-in-pipelines/load-dataset,
     :args [{:value "data/experiment.csv"}],
     :id "c3",
     :value {:data [1 2 3 4 5], :source "data/experiment.csv"}}
    {:value {:scale 2}}],
   :id "c2",
   :value {:data (2 4 6 8 10), :source "data/experiment.csv"}}
  {:value {:epochs 100}}],
 :id "c1",
 :value
 {:model :trained,
  :accuracy 0.95,
  :data {:data (2 4 6 8 10), :source "data/experiment.csv"}}}

When the same Cached instance appears multiple times (diamond pattern), subsequent occurrences are {:ref <id>} pointing to the first.

origin-story-graph — flat graph

Returns {:nodes {<id> <node-map>} :edges [[<from> <to>] ...]}. Useful for graph algorithms or custom rendering.

(pocket/origin-story-graph model-c)
{:nodes
 {"c1"
  {:fn #'pocket-book.recursive-caching-in-pipelines/train-model,
   :value
   {:model :trained,
    :accuracy 0.95,
    :data {:data (2 4 6 8 10), :source "data/experiment.csv"}}},
  "c2"
  {:fn #'pocket-book.recursive-caching-in-pipelines/preprocess,
   :value {:data (2 4 6 8 10), :source "data/experiment.csv"}},
  "c3"
  {:fn #'pocket-book.recursive-caching-in-pipelines/load-dataset,
   :value {:data [1 2 3 4 5], :source "data/experiment.csv"}},
  "v4" {:value "data/experiment.csv"},
  "v5" {:value {:scale 2}},
  "v6" {:value {:epochs 100}}},
 :edges [["c1" "c2"] ["c2" "c3"] ["c3" "v4"] ["c2" "v5"] ["c1" "v6"]]}

origin-story-mermaid — visualization

Returns a Mermaid flowchart string. Arrows show data flow direction (from inputs toward the final result). It returns a kindly value that renders directly.

(pocket/origin-story-mermaid model-c)
flowchart TD n0["train-model"] n1["preprocess"] n2["load-dataset"] n3[/"'data/experiment.csv'"/] n3 --> n2 n2 --> n1 n4[/"{:scale 2}"/] n4 --> n1 n1 --> n0 n5[/"{:epochs 100}"/] n5 --> n0

Provenance through derefed values

In the pipeline above, we passed Cached references directly from one step to the next. But sometimes we need to deref between steps — for example, when passing data to a library that expects concrete values. The origin registry ensures provenance still works in this case.

Here we build the same three-step pipeline, but deref between each step:

(def data-c2 (load-dataset* "data/deref-demo.csv"))
(def data-val (deref data-c2))
10:06:33.833 INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.recursive-caching-in-pipelines/load-dataset
Loading dataset from data/deref-demo.csv ...
10:06:34.134 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-demo-pipelines/94/(pocket-book.recursive-caching-in-pipelines_load-dataset "data_deref-demo.csv")
(def processed-c2 (preprocess* data-val {:scale 3}))
(def processed-val (deref processed-c2))
10:06:34.136 INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.recursive-caching-in-pipelines/preprocess
Preprocessing with options: {:scale 3}
10:06:34.437 DEBUG scicloj.pocket.impl.cache - Cache write: /tmp/pocket-demo-pipelines/0c/(pocket-book.recursive-caching-in-pipelines_preprocess (pocket-book.recursive-caching-in-pipelines_load-dataset "data_deref-demo.csv") {:scale 3})
(def model-c2 (train-model* processed-val {:epochs 50}))

Even though each cached function received a derefed (real) value, origin-story traces the full chain — from train-model back through preprocess to load-dataset:

(pocket/origin-story model-c2)
{:fn #'pocket-book.recursive-caching-in-pipelines/train-model,
 :args
 [{:fn #'pocket-book.recursive-caching-in-pipelines/preprocess,
   :args
   [{:fn #'pocket-book.recursive-caching-in-pipelines/load-dataset,
     :args [{:value "data/deref-demo.csv"}],
     :id "c3",
     :value {:data [1 2 3 4 5], :source "data/deref-demo.csv"}}
    {:value {:scale 3}}],
   :id "c2",
   :value {:data (3 6 9 12 15), :source "data/deref-demo.csv"}}
  {:value {:epochs 50}}],
 :id "c1"}

origin-story-graph shows three cached nodes (one per pipeline step). Without the origin registry, the derefed values would appear as opaque leaves and we would see only one cached node:

(let [g (pocket/origin-story-graph model-c2)]
  (count (filter (fn [[_ v]] (:fn v)) (:nodes g))))
3

The full chain as a Mermaid flowchart:

(pocket/origin-story-mermaid model-c2)
flowchart TD n0["train-model"] n1["preprocess"] n2["load-dataset"] n3[/"'data/deref-demo.csv'"/] n3 --> n2 n2 --> n1 n4[/"{:scale 3}"/] n4 --> n1 n1 --> n0 n5[/"{:epochs 50}"/] n5 --> n0

Cleanup

(pocket/cleanup!)
10:06:34.448 INFO scicloj.pocket - Cache cleanup: /tmp/pocket-demo-pipelines
{:dir "/tmp/pocket-demo-pipelines", :existed true}
source: notebooks/pocket_book/recursive_caching_in_pipelines.clj