13 Under the hood: cache keys
When Pocket caches a function call, it builds a cache key from the function identity and all arguments. This chapter looks at how that works internally and what it costs β especially for large arguments like datasets.
Setup
(ns pocket-book.cache-keys
(:require
;; Pocket API and internals:
[scicloj.pocket :as pocket]
[scicloj.pocket.impl.cache :as cache]
[scicloj.pocket.protocols :as proto]
;; Annotating kinds of visualizations:
[scicloj.kindly.v4.kind :as kind]
;; Data processing:
[tablecloth.api :as tc]
[tech.v3.dataset.modelling :as ds-mod]))The four steps
Every call to pocket/cached goes through four steps to produce a filesystem path for the cache entry:
->idβ convert each argument to its identity representation via thePIdentifiableprotocol. Vars become fully-qualified symbols;Cachedreferences become lightweight references; datasets become their full column data + metadata.canonical-idβ deep-sort maps and normalize the structure so that{:a 1 :b 2}and{:b 2 :a 1}produce the same key.strβ serialize the canonical form to a string.shaβ SHA-1 hash for a fixed-length, filesystem-safe path.
When a functionβs arguments are small (scalars, keywords, or Cached references), all four steps are sub-millisecond. But when a raw dataset is passed directly, the full dataset content becomes part of the cache key.
Measuring the cost
Letβs pass a 50,000-row dataset as a direct argument and time each step:
(let [ds (-> (tc/dataset {:x (vec (range 50000))
:y (vec (range 50000))
:z (repeatedly 50000 rand)})
(ds-mod/set-inference-target :y))
;; Step 1: ->id
t0 (System/nanoTime)
id (proto/->id ds)
t1 (System/nanoTime)
;; Step 2: canonical-id
cid (cache/canonical-id id)
t2 (System/nanoTime)
;; Step 3: str
s (str cid)
t3 (System/nanoTime)
;; Step 4: sha
_ (cache/sha s)
t4 (System/nanoTime)]
{:rows 50000
:string-length (count s)
:->id-ms (/ (- t1 t0) 1e6)
:canonical-id-ms (/ (- t2 t1) 1e6)
:str-ms (/ (- t3 t2) 1e6)
:sha-ms (/ (- t4 t3) 1e6)}){:rows 50000,
:string-length 1541262,
:->id-ms 3.355993,
:canonical-id-ms 5.730036,
:str-ms 25.115645,
:sha-ms 1.579439}The str serialization step dominates β it must walk the entire nested structure and produce a ~1.5 MB string. SHA-1 hashing that string is fast by comparison. Switching to a faster hash algorithm would not meaningfully help.
Why Cached references matter
When an argument is a Cached reference rather than a raw dataset, its identity is a lightweight reference to the computation that produced it β not the data itself. Compare:
(pocket/set-base-cache-dir! "/tmp/pocket-cache-keys")10:06:46.813 INFO scicloj.pocket - Cache dir set to: /tmp/pocket-cache-keys
"/tmp/pocket-cache-keys"(pocket/cleanup!)10:06:46.814 INFO scicloj.pocket - Cache cleanup: /tmp/pocket-cache-keys
{:dir "/tmp/pocket-cache-keys", :existed false}(defn make-data [n]
(tc/dataset {:x (vec (range n))
:y (vec (range n))}))Direct dataset β identity includes all 50,000 rows:
(let [ds (make-data 50000)
t0 (System/nanoTime)
_ (str (cache/canonical-id (proto/->id ds)))
t1 (System/nanoTime)]
{:direct-ms (/ (- t1 t0) 1e6)}){:direct-ms 20.211526}Cached reference β identity is just (make-data 50000):
(let [data-c (pocket/cached #'make-data 50000)
t0 (System/nanoTime)
_ (str (cache/canonical-id (proto/->id data-c)))
t1 (System/nanoTime)]
{:cached-reference-ms (/ (- t1 t0) 1e6)}){:cached-reference-ms 0.074606}The Cached reference is orders of magnitude faster because its identity is a small form like (pocket-book.cache-keys/make-data 50000), regardless of how large the output dataset is.
This is one of the key reasons to chain pocket/cached calls in a pipeline: each stepβs cache key references its inputs by identity rather than by content, keeping key generation fast and enabling full provenance through the DAG.
Origin registry: derefed values keep their identity
Sometimes we need to pass real values β not Cached references β to code that requires concrete types. For example, metamorph.mlβs evaluate-pipelines checks (instance? Dataset ds), which fails for Cached references. The natural solution is to deref the reference first, but that would lose the lightweight identity: the derefed dataset would need full content hashing for its cache key.
Pocket solves this with an origin registry. When a Cached value is derefed, the result is registered in a side channel that maps it back to the Cached identity. Later, when ->id is called on that derefed value, the registry provides the lightweight identity instead of hashing the content.
A derefed value has the same identity as its Cached reference:
(let [data-c (pocket/cached #'make-data 50000)
data (deref data-c)]
(= (proto/->id data) (proto/->id data-c)))10:06:46.848 INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.cache-keys/make-data
trueAnd the performance is the same β sub-millisecond, like a Cached reference, rather than the tens-of-milliseconds cost of hashing 50,000 rows:
(let [data-c (pocket/cached #'make-data 50000)
data (deref data-c)
t0 (System/nanoTime)
_ (str (cache/canonical-id (proto/->id data)))
t1 (System/nanoTime)]
{:derefed-with-origin-ms (/ (- t1 t0) 1e6)}){:derefed-with-origin-ms 0.036806}What breaks the link
Transforming a derefed value creates a new object. The new object is not in the registry, so ->id falls back to content-based identity. This is intentional β a transformed dataset is semantically different from its source, and its cache key should reflect its actual content.
(let [data-c (pocket/cached #'make-data 100)
data (deref data-c)
transformed (tc/add-column data :z (repeat 100 0))]
{:original-has-origin (= (proto/->id data) (proto/->id data-c))
:transformed-has-origin (= (proto/->id transformed) (proto/->id data-c))})10:06:46.867 INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.cache-keys/make-data
{:original-has-origin true, :transformed-has-origin false}Which values are registered
Only values implementing clojure.lang.IObj β maps, vectors, sets, and datasets β are registered. The JVM interns small integers and other primitives, meaning (Long/valueOf 1) always returns the same object. Registering such values would cause false origin matches across unrelated computations. Excluding them avoids this problem entirely.
Cleanup
(pocket/cleanup!)10:06:46.875 INFO scicloj.pocket - Cache cleanup: /tmp/pocket-cache-keys
{:dir "/tmp/pocket-cache-keys", :existed true}