13  Under the hood: cache keys

When Pocket caches a function call, it builds a cache key from the function identity and all arguments. This chapter looks at how that works internally and what it costs β€” especially for large arguments like datasets.

Setup

(ns pocket-book.cache-keys
  (:require
   ;; Pocket API and internals:
   [scicloj.pocket :as pocket]
   [scicloj.pocket.impl.cache :as cache]
   [scicloj.pocket.protocols :as proto]
   ;; Annotating kinds of visualizations:
   [scicloj.kindly.v4.kind :as kind]
   ;; Data processing:
   [tablecloth.api :as tc]
   [tech.v3.dataset.modelling :as ds-mod]))

The four steps

Every call to pocket/cached goes through four steps to produce a filesystem path for the cache entry:

  1. ->id β€” convert each argument to its identity representation via the PIdentifiable protocol. Vars become fully-qualified symbols; Cached references become lightweight references; datasets become their full column data + metadata.

  2. canonical-id β€” deep-sort maps and normalize the structure so that {:a 1 :b 2} and {:b 2 :a 1} produce the same key.

  3. str β€” serialize the canonical form to a string.

  4. sha β€” SHA-1 hash for a fixed-length, filesystem-safe path.

When a function’s arguments are small (scalars, keywords, or Cached references), all four steps are sub-millisecond. But when a raw dataset is passed directly, the full dataset content becomes part of the cache key.

Measuring the cost

Let’s pass a 50,000-row dataset as a direct argument and time each step:

(let [ds (-> (tc/dataset {:x (vec (range 50000))
                          :y (vec (range 50000))
                          :z (repeatedly 50000 rand)})
             (ds-mod/set-inference-target :y))
      ;; Step 1: ->id
      t0 (System/nanoTime)
      id (proto/->id ds)
      t1 (System/nanoTime)
      ;; Step 2: canonical-id
      cid (cache/canonical-id id)
      t2 (System/nanoTime)
      ;; Step 3: str
      s (str cid)
      t3 (System/nanoTime)
      ;; Step 4: sha
      _ (cache/sha s)
      t4 (System/nanoTime)]
  {:rows 50000
   :string-length (count s)
   :->id-ms (/ (- t1 t0) 1e6)
   :canonical-id-ms (/ (- t2 t1) 1e6)
   :str-ms (/ (- t3 t2) 1e6)
   :sha-ms (/ (- t4 t3) 1e6)})
{:rows 50000,
 :string-length 1541262,
 :->id-ms 3.355993,
 :canonical-id-ms 5.730036,
 :str-ms 25.115645,
 :sha-ms 1.579439}

The str serialization step dominates β€” it must walk the entire nested structure and produce a ~1.5 MB string. SHA-1 hashing that string is fast by comparison. Switching to a faster hash algorithm would not meaningfully help.

Why Cached references matter

When an argument is a Cached reference rather than a raw dataset, its identity is a lightweight reference to the computation that produced it β€” not the data itself. Compare:

(pocket/set-base-cache-dir! "/tmp/pocket-cache-keys")
10:06:46.813 INFO scicloj.pocket - Cache dir set to: /tmp/pocket-cache-keys
"/tmp/pocket-cache-keys"
(pocket/cleanup!)
10:06:46.814 INFO scicloj.pocket - Cache cleanup: /tmp/pocket-cache-keys
{:dir "/tmp/pocket-cache-keys", :existed false}
(defn make-data [n]
  (tc/dataset {:x (vec (range n))
               :y (vec (range n))}))

Direct dataset β€” identity includes all 50,000 rows:

(let [ds (make-data 50000)
      t0 (System/nanoTime)
      _ (str (cache/canonical-id (proto/->id ds)))
      t1 (System/nanoTime)]
  {:direct-ms (/ (- t1 t0) 1e6)})
{:direct-ms 20.211526}

Cached reference β€” identity is just (make-data 50000):

(let [data-c (pocket/cached #'make-data 50000)
      t0 (System/nanoTime)
      _ (str (cache/canonical-id (proto/->id data-c)))
      t1 (System/nanoTime)]
  {:cached-reference-ms (/ (- t1 t0) 1e6)})
{:cached-reference-ms 0.074606}

The Cached reference is orders of magnitude faster because its identity is a small form like (pocket-book.cache-keys/make-data 50000), regardless of how large the output dataset is.

This is one of the key reasons to chain pocket/cached calls in a pipeline: each step’s cache key references its inputs by identity rather than by content, keeping key generation fast and enabling full provenance through the DAG.

Origin registry: derefed values keep their identity

Sometimes we need to pass real values β€” not Cached references β€” to code that requires concrete types. For example, metamorph.ml’s evaluate-pipelines checks (instance? Dataset ds), which fails for Cached references. The natural solution is to deref the reference first, but that would lose the lightweight identity: the derefed dataset would need full content hashing for its cache key.

Pocket solves this with an origin registry. When a Cached value is derefed, the result is registered in a side channel that maps it back to the Cached identity. Later, when ->id is called on that derefed value, the registry provides the lightweight identity instead of hashing the content.

A derefed value has the same identity as its Cached reference:

(let [data-c (pocket/cached #'make-data 50000)
      data (deref data-c)]
  (= (proto/->id data) (proto/->id data-c)))
10:06:46.848 INFO scicloj.pocket.impl.cache - Cache miss, computing: pocket-book.cache-keys/make-data
true

And the performance is the same β€” sub-millisecond, like a Cached reference, rather than the tens-of-milliseconds cost of hashing 50,000 rows:

(let [data-c (pocket/cached #'make-data 50000)
      data (deref data-c)
      t0 (System/nanoTime)
      _ (str (cache/canonical-id (proto/->id data)))
      t1 (System/nanoTime)]
  {:derefed-with-origin-ms (/ (- t1 t0) 1e6)})
{:derefed-with-origin-ms 0.036806}

Which values are registered

Only values implementing clojure.lang.IObj β€” maps, vectors, sets, and datasets β€” are registered. The JVM interns small integers and other primitives, meaning (Long/valueOf 1) always returns the same object. Registering such values would cause false origin matches across unrelated computations. Excluding them avoids this problem entirely.

Cleanup

(pocket/cleanup!)
10:06:46.875 INFO scicloj.pocket - Cache cleanup: /tmp/pocket-cache-keys
{:dir "/tmp/pocket-cache-keys", :existed true}
source: notebooks/pocket_book/cache_keys.clj