12  MALDI Validation & Edge Cases

This notebook tests error paths, boundary conditions, and utility functions in Ripple’s MALDI preprocessing pipeline. Robust signal processing code must handle degenerate inputs gracefully — even window sizes, too-few data points, zero-noise spectra, and memory-safety limits on bin counts.

For the algorithms themselves, see MALDI Algorithms Explained. For validation against the R reference implementation, see MALDIquant Comparison and MALDIquant Test Vectors.

Note: This notebook requires impl namespaces to test individual internal functions (median filter, noise estimation, peak filtering, alignment) at a finer granularity than the public API exposes.

(ns ripple-book.maldi-validation
  (:require
   ;; Ripple MALDI signal processing internals:
   [scicloj.ripple.maldi.impl.signal :as signal]
   ;; Spectral binning internals:
   [scicloj.ripple.maldi.impl.binning :as binning]
   ;; Multi-sample alignment (https://github.com/scicloj/ripple):
   [scicloj.ripple.maldi.impl.alignment :as alignment]
   ;; Table processing (https://scicloj.github.io/tablecloth/):
   [tablecloth.api :as tc]
   ;; Annotating kinds of visualizations (https://scicloj.github.io/kindly-noted/):
   [scicloj.kindly.v4.kind :as kind]))

Median Filter

The median filter replaces each point with the median of its local window. Our implementation uses the R-compatible median definition (average of two middle values for even-length windows), matching MALDIquant behavior.

(def median-filtered
  (vec (signal/median-filter [1.0 10.0 2.0 9.0 3.0 8.0 4.0] {:window-size 3})))

Interior values should be the median of each 3-element window: [1,10,2]→2.0 [10,2,9]→9.0 [2,9,3]→3.0 [9,3,8]→8.0 [3,8,4]→4.0 Edge values use smaller windows: [1,10]→5.5 and [8,4]→6.0

median-filtered
[5.5 2.0 9.0 3.0 8.0 4.0 6.0]

Even-length window uses proper averaging (not “high median”):

(def median-even-window
  (vec (signal/median-filter [1.0 2.0 3.0 4.0 5.0] {:window-size 3})))

Windows: [1,2]→1.5 [1,2,3]→2.0 [2,3,4]→3.0 [3,4,5]→4.0 [4,5]→4.5

median-even-window
[1.5 2.0 3.0 4.0 4.5]

Local Noise Estimation (MAD local)

The Median Absolute Deviation can be computed globally (one noise estimate for the entire spectrum) or locally (a sliding window giving position-dependent noise). Local estimation is useful when noise varies across the m/z range.

(def local-noise-data
  ;; Quiet region followed by noisy region
  [1.0 1.0 1.0 1.0 1.0 10.0 20.0 5.0 15.0 25.0])
(def global-noise (signal/estimate-noise-mad local-noise-data {}))
(def local-noise (signal/estimate-noise-mad local-noise-data {:half-window-size 2}))

Local noise should vary across the spectrum: quiet region should have lower noise than noisy region

(let [quiet-noise (local-noise 2)
      noisy-noise (local-noise 7)]
  {:quiet quiet-noise :noisy noisy-noise :quiet-lower? (< quiet-noise noisy-noise)})
{:quiet 0.0, :noisy 7.412999999999999, :quiet-lower? true}

detect-peaks with :mad-local should work without error

(def peaks-local
  (signal/detect-peaks [1.0 5.0 1.0 1.0 1.0 1.0 10.0 1.0 1.0 1.0]
                       {:half-window-size 2 :snr 2 :noise-method :mad-local}))
peaks-local
[1 6]

Signal Processing Error Paths

Savitzky-Golay: even window size

(def sg-even-error
  (try
    (signal/savitzky-golay-smooth [1.0 2.0 3.0] {:window-size 4})
    (catch Exception e (ex-data e))))
sg-even-error
{:window-size 4}

Savitzky-Golay: too few data points

(def sg-short-error
  (try
    (signal/savitzky-golay-smooth [1.0 2.0] {:window-size 5})
    (catch Exception e (ex-data e))))
sg-short-error
{:data-points 2, :window-size 5}

SNIP: non-positive iterations

(def snip-zero-error
  (try
    (signal/snip-baseline-removal [1.0 2.0 3.0] {:iterations 0})
    (catch Exception e (ex-data e))))
snip-zero-error
{:iterations 0}

SNIP: too few data points

(def snip-short-error
  (try
    (signal/snip-baseline-removal [1.0 2.0] {:iterations 5})
    (catch Exception e (ex-data e))))
snip-short-error
{:data-points 2}

Median filter: even window size

(def mf-even-error
  (try
    (signal/median-filter [1.0 2.0 3.0] {:window-size 4})
    (catch Exception e (ex-data e))))
mf-even-error
{:window-size 4}

Binning: Single Point Per Bin

(def test-spectrum-1
  (tc/dataset {:mass [2001.5 2004.5 2007.5]
               :intensity [100.0 200.0 300.0]}))
(def binning-params-1
  {:range [2000 2010]
   :step 3})

Expected bins:

  • bin 0: [2000, 2003) contains mass 2001.5, intensity 100.0
  • bin 1: [2003, 2006) contains mass 2004.5, intensity 200.0
  • bin 2: [2006, 2009) contains mass 2007.5, intensity 300.0
(def binned-1 (binning/bin-spectrum test-spectrum-1 binning-params-1))
(vec binned-1)
[100.0 200.0 300.0]

Binning: Multiple Points in Same Bin

(def test-spectrum-2
  (tc/dataset {:mass [2001.0 2002.0 2004.0 2005.0]
               :intensity [100.0 50.0 200.0 75.0]}))

bin 0: [2000, 2003) contains masses 2001.0, 2002.0 — sum = 150.0 bin 1: [2003, 2006) contains masses 2004.0, 2005.0 — sum = 275.0 bin 2: [2006, 2009) — empty — 0.0

(def binned-2 (binning/bin-spectrum test-spectrum-2 binning-params-1))
(vec binned-2)
[150.0 275.0 0.0]

Binning: Out-of-Range Values

(def test-spectrum-3
  (tc/dataset {:mass [1999.0 2001.0 2005.0 2011.0]
               :intensity [999.0 100.0 200.0 888.0]}))

1999.0 and 2011.0 are outside [2000, 2010) — filtered out.

(def binned-3 (binning/bin-spectrum test-spectrum-3 binning-params-1))
(vec binned-3)
[100.0 200.0 0.0]

Binning: DRIAMS Parameters

The Weis et al. DRIAMS paper specifies: range [2000, 20000] Da, step 3 Da → 6000 bins.

(def driams-params
  {:range [2000 20000]
   :step 3})
(def n-bins-driams (binning/calculate-n-bins driams-params))
n-bins-driams
6000
(def test-spectrum-driams
  (tc/dataset {:mass [2000.0 2003.0 2006.0 10000.0 19999.0]
               :intensity [10.0 20.0 30.0 500.0 999.0]}))
(def binned-driams (binning/bin-spectrum test-spectrum-driams driams-params))
(count binned-driams)
6000

Verify specific bin assignments: 2000.0 → bin 0, 2003.0 → bin 1, 2006.0 → bin 2, 10000.0 → bin 2666, 19999.0 → bin 5999

{:bin-0 (aget binned-driams 0)
 :bin-1 (aget binned-driams 1)
 :bin-2 (aget binned-driams 2)
 :bin-2666 (aget binned-driams 2666)
 :bin-5999 (aget binned-driams 5999)}
{:bin-0 10.0,
 :bin-1 20.0,
 :bin-2 30.0,
 :bin-2666 500.0,
 :bin-5999 999.0}

Binning: Empty Spectrum

(def test-spectrum-empty
  (tc/dataset {:mass []
               :intensity []}))
(def binned-empty (binning/bin-spectrum test-spectrum-empty binning-params-1))
(vec binned-empty)
[0.0 0.0 0.0]

Binning: Boundary Values

A mass exactly on a bin boundary falls into the bin that starts there. With bins [2000, 2003), [2003, 2006), [2006, 2009): 2003.0 lands in bin 1, not bin 0.

(def test-spectrum-boundary
  (tc/dataset {:mass [2000.0 2003.0 2006.0]
               :intensity [10.0 20.0 30.0]}))
(def binned-boundary (binning/bin-spectrum test-spectrum-boundary binning-params-1))
(vec binned-boundary)
[10.0 20.0 30.0]

Binning: Too Many Bins (Memory Safety)

(def bins-too-many-error
  (try
    (binning/calculate-n-bins {:range [0 100000000] :step 1})
    (catch Exception e (:calculated-bins (ex-data e)))))
bins-too-many-error
100000000

Moving Average: Known Values

The moving average replaces each point with the mean of its window. Edge values use the mean of the first/last full window (MALDIquant convention).

(def ma-known-input [1.0 2.0 3.0 4.0 5.0 6.0 7.0])
(def ma-known-result
  (vec (signal/moving-average-smooth ma-known-input {:half-window-size 1})))

With half-window-size=1, window size is 3.

  • edge[0] = mean(1,2,3) = 2.0
  • interior[1..5] = 3-point running mean
  • edge[6] = mean(5,6,7) = 6.0
ma-known-result
[2.0 2.0 3.0 4.0 5.0 6.0 6.0]

Moving Average: Negative Clamping

Smoothed values that go negative are clamped to 0.0, matching MALDIquant’s behavior for intensity data.

(def ma-neg-result
  (vec (signal/moving-average-smooth [1.0 -5.0 1.0] {:half-window-size 1})))

mean(1,-5,1) = -1.0 → clamped to 0.0 for all positions

ma-neg-result
[0.0 0.0 0.0]

TopHat Baseline Removal: Known Values

The TopHat transform removes broad baseline features while preserving narrow peaks. It computes: signal − opening(signal), where opening = dilation(erosion(signal)).

(def tophat-input [1.0 2.0 3.0 4.0 25.0 4.0 3.0 2.0 1.0])
(def tophat-result
  (vec (signal/tophat-baseline tophat-input {:half-window-size 2})))

The triangular baseline [1..5..1] should be removed. The spike at index 4 (value 25) should be preserved.

tophat-result
[0.0 0.0 0.0 1.0 22.0 1.0 0.0 0.0 0.0]

TopHat: Flat and Zero Spectra

A constant signal has no baseline to remove — TopHat returns all zeros.

(def tophat-flat-result
  (vec (signal/tophat-baseline (double-array (repeat 10 5.0)) {:half-window-size 2})))
tophat-flat-result
[0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0]

All-zero input also returns zeros:

(def tophat-zero-result
  (vec (signal/tophat-baseline (double-array (repeat 10 0.0)) {:half-window-size 2})))
tophat-zero-result
[0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0]

Median Calibration: Properties

Median calibration divides each intensity by the spectrum’s median. This normalizes spectra so that the “typical” intensity is 1.0.

(def medcal-input [2.0 4.0 6.0 8.0 10.0])
(def medcal-result (vec (signal/median-calibrate medcal-input {})))

Median of [2,4,6,8,10] = 6.0, so result = each / 6.0

medcal-result
[0.3333333333333333
 0.6666666666666666
 1.0
 1.3333333333333333
 1.6666666666666667]

Zero-median input (all zeros) returns zeros with a warning:

(def medcal-zero-result (vec (signal/median-calibrate [0.0 0.0 0.0] {})))
medcal-zero-result
[0.0 0.0 0.0]

Trim: Edge Cases

trim-spectrum restricts a spectrum to a mass range using inclusive boundaries (≥ min and ≤ max).

(def trim-test-ds
  (tc/dataset {:mass [100.0 200.0 300.0] :intensity [1.0 2.0 3.0]}))

Range outside data → empty dataset

(def trim-empty (signal/trim-spectrum trim-test-ds {:range [500.0 600.0]}))
(tc/row-count trim-empty)
0

Exact data bounds → full dataset

(def trim-full (signal/trim-spectrum trim-test-ds {:range [100.0 300.0]}))
(tc/row-count trim-full)
3

Single-point range → 1 row

(def trim-single (signal/trim-spectrum trim-test-ds {:range [200.0 200.0]}))
(tc/row-count trim-single)
1

Boundary inclusivity — mass on boundary is included

(vec (:mass trim-single))
[200.0]

Local Maxima Detection: Properties

find-local-maxima-logical returns a boolean array marking positions that are the maximum in their window. Useful properties to verify:

Monotonically increasing → only the last element is a local max

(vec (signal/find-local-maxima-logical [1.0 2.0 3.0 4.0 5.0] {:half-window-size 2}))
[false false false false true]

Monotonically decreasing → only the first element

(vec (signal/find-local-maxima-logical [5.0 4.0 3.0 2.0 1.0] {:half-window-size 2}))
[true false false false false]

All equal values → first element (leftmost wins ties)

(vec (signal/find-local-maxima-logical [3.0 3.0 3.0 3.0 3.0] {:half-window-size 2}))
[true false false false false]

Single peak in center

(vec (signal/find-local-maxima-logical [1.0 2.0 5.0 2.0 1.0] {:half-window-size 2}))
[false false true false false]

Filter Peaks by SNR: Direct Tests

filter-peaks-by-snr selects peaks where intensity > (SNR × noise). Testing the function directly (not through detect-peaks) exercises the zero-noise and mixed-noise paths.

Zero noise → threshold is 0, all positive local maxima pass:

(def snr-zero-noise
  (vec (signal/filter-peaks-by-snr
        [0.0 5.0 0.0 3.0 0.0]
        (boolean-array [false true false true false])
        (double-array [0.0 0.0 0.0 0.0 0.0])
        {:snr-threshold 2})))
snr-zero-noise
[1 3]

Very high noise → no peaks pass:

(def snr-high-noise
  (vec (signal/filter-peaks-by-snr
        [0.0 5.0 0.0 3.0 0.0]
        (boolean-array [false true false true false])
        (double-array [100.0 100.0 100.0 100.0 100.0])
        {:snr-threshold 2})))
snr-high-noise
[]

Mixed: intensity[1]=5 > 2×1=2 (pass), intensity[3]=3 ≤ 2×2=4 (fail):

(def snr-mixed
  (vec (signal/filter-peaks-by-snr
        [0.0 5.0 0.0 3.0 0.0]
        (boolean-array [false true false true false])
        (double-array [1.0 1.0 1.0 2.0 1.0])
        {:snr-threshold 2})))
snr-mixed
[1]

Pipeline: Skip-Steps Verification

preprocess-spectrum-data has boolean flags to skip the square root transform and TIC normalization. Verify each flag works.

(def pipeline-test-masses (vec (range 2000.0 2100.0 1.0)))
(def pipeline-test-intensities
  (mapv (fn [m] (+ 100.0 (* 50.0 (Math/sin (* m 0.1))))) pipeline-test-masses))
(def pipeline-test-spec
  {:mass pipeline-test-masses :intensity pipeline-test-intensities})

With sqrt=false, intensities remain larger (no compression):

(def pipeline-with-sqrt
  (signal/preprocess-spectrum-data pipeline-test-spec
                                   {:should-sqrt-transform true :smooth-window 5 :smooth-polynomial 2
                                    :baseline-iterations 5 :should-tic-normalize false :tic-target 1.0}))
(def pipeline-no-sqrt
  (signal/preprocess-spectrum-data pipeline-test-spec
                                   {:should-sqrt-transform false :smooth-window 5 :smooth-polynomial 2
                                    :baseline-iterations 5 :should-tic-normalize false :tic-target 1.0}))
(let [max-with (double (apply max (seq (:intensity pipeline-with-sqrt))))
      max-no (double (apply max (seq (:intensity pipeline-no-sqrt))))]
  {:max-with-sqrt max-with :max-no-sqrt max-no
   :sqrt-reduces-range? (< max-with max-no)})
{:max-with-sqrt 0.16218210486563872,
 :max-no-sqrt 4.314465590062217,
 :sqrt-reduces-range? true}

With tic-normalize=true, trapezoid area ≈ target (1.0). With tic-normalize=false, area is much larger.

(def ^:private trapezoid-area
  (fn [ds]
    (let [m (double-array (:mass ds))
          inten (double-array (:intensity ds))
          n (count m)]
      (loop [i 1 acc 0.0]
        (if (>= i n) acc
            (recur (inc i)
                   (+ acc (* 0.5 (+ (aget inten i) (aget inten (dec i)))
                             (- (aget m i) (aget m (dec i)))))))))))
(def pipeline-with-tic
  (signal/preprocess-spectrum-data pipeline-test-spec
                                   {:should-sqrt-transform true :smooth-window 5 :smooth-polynomial 2
                                    :baseline-iterations 5 :should-tic-normalize true :tic-target 1.0}))
(def pipeline-no-tic
  (signal/preprocess-spectrum-data pipeline-test-spec
                                   {:should-sqrt-transform true :smooth-window 5 :smooth-polynomial 2
                                    :baseline-iterations 5 :should-tic-normalize false :tic-target 1.0}))
(let [area-with (trapezoid-area pipeline-with-tic)
      area-no (trapezoid-area pipeline-no-tic)]
  {:area-with-tic area-with :area-no-tic area-no
   :tic-normalizes? (< (Math/abs (- area-with 1.0)) 1e-10)
   :no-tic-different? (> (Math/abs (- area-no 1.0)) 0.1)})
{:area-with-tic 1.0000000000000002,
 :area-no-tic 4.450518596479303,
 :tic-normalizes? true,
 :no-tic-different? true}

Alignment: Small Example

Multi-sample peak alignment groups nearby m/z values across spectra and builds a rectangular feature matrix for machine learning.

Three spectra with slightly different peak positions:

(def align-s1 {:masses (double-array [1000.0 2000.0 3000.0])
               :intensities (double-array [10.0 20.0 30.0])})
(def align-s2 {:masses (double-array [1000.5 2000.3 3000.1])
               :intensities (double-array [11.0 21.0 31.0])})
(def align-s3 {:masses (double-array [1000.2 3000.4])
               :intensities (double-array [12.0 32.0])})
(def aligned (alignment/bin-peaks [align-s1 align-s2 align-s3] {:tolerance 0.002}))

After binning, corresponding peaks share the same mass (the group mean):

(let [b1 (vec (:masses (nth aligned 0)))
      b2 (vec (:masses (nth aligned 1)))
      b3 (vec (:masses (nth aligned 2)))]
  {:s1-masses b1 :s2-masses b2 :s3-masses b3
   :mass0-aligned? (= (first b1) (first b2))
   :s3-has-2-peaks? (= 2 (count b3))})
{:s1-masses [1000.2333333333333 2000.15 3000.1666666666665],
 :s2-masses [1000.2333333333333 2000.15 3000.1666666666665],
 :s3-masses [1000.2333333333333 3000.1666666666665],
 :mass0-aligned? true,
 :s3-has-2-peaks? true}

Filter at min-frequency=0.5: mass ~2000 appears in 2/3 spectra (keep). Filter at min-frequency=0.8: mass ~2000 appears in only 2/3 (reject).

(def filtered-50
  (alignment/filter-peaks aligned {:min-frequency 0.5}))
(def filtered-80
  (alignment/filter-peaks aligned {:min-frequency 0.8}))
{:at-50-counts (mapv #(count (:masses %)) filtered-50)
 :at-80-counts (mapv #(count (:masses %)) filtered-80)}
{:at-50-counts [3 3 2], :at-80-counts [2 2 2]}

Intensity matrix — s3 is missing mass ~2000, so its column is 0.0:

(def align-im (alignment/intensity-matrix aligned))
{:shape [(:n-spectra align-im) (:n-features align-im)]
 :s3-row (vec (nth (:matrix align-im) 2))
 :s3-zero-fill? (zero? (aget (nth (:matrix align-im) 2) 1))}
{:shape [3 3], :s3-row [12.0 0.0 32.0], :s3-zero-fill? true}

Alignment: Tolerance Edge

Peaks that are just within relative tolerance get grouped; peaks just outside remain separate.

Two spectra with 2 peaks each. Mass pair 1000.0/1000.5: relative diff to mean (1000.25) = 0.00025 < tolerance 0.002 → grouped

(def tol-wide (alignment/bin-peaks
               [{:masses (double-array [1000.0 2000.0])
                 :intensities (double-array [10.0 20.0])}
                {:masses (double-array [1000.5 2000.3])
                 :intensities (double-array [11.0 21.0])}]
               {:tolerance 0.002}))

Same data, but tolerance=0.0001 → mass 1000/1000.5 should NOT group (relative diff 0.00025 > 0.0001)

(def tol-narrow (alignment/bin-peaks
                 [{:masses (double-array [1000.0 2000.0])
                   :intensities (double-array [10.0 20.0])}
                  {:masses (double-array [1000.5 2000.3])
                   :intensities (double-array [11.0 21.0])}]
                 {:tolerance 0.0001}))
{:wide-grouped? (= (aget (:masses (first tol-wide)) 0)
                   (aget (:masses (second tol-wide)) 0))
 :narrow-separate? (not= (aget (:masses (first tol-narrow)) 0)
                         (aget (:masses (second tol-narrow)) 0))}
{:wide-grouped? true, :narrow-separate? true}

Alignment: Single Spectrum

With a single spectrum, binning and filtering are identity operations.

(def single-spec {:masses (double-array [1000.0 2000.0 3000.0])
                  :intensities (double-array [10.0 20.0 30.0])})
(def single-binned (alignment/bin-peaks [single-spec] {:tolerance 0.002}))
(def single-filtered (alignment/filter-peaks single-binned {:min-frequency 1.0}))
(def single-im (alignment/intensity-matrix single-binned))
{:masses-unchanged? (= (vec (:masses (first single-binned))) [1000.0 2000.0 3000.0])
 :filter-keeps-all? (= 3 (count (:masses (first single-filtered))))
 :matrix-shape [(:n-spectra single-im) (:n-features single-im)]
 :matrix-row (vec (first (:matrix single-im)))}
{:masses-unchanged? true,
 :filter-keeps-all? true,
 :matrix-shape [1 3],
 :matrix-row [10.0 20.0 30.0]}
source: notebooks/ripple_book/maldi_validation.clj