12 MALDI Validation & Edge Cases
This notebook tests error paths, boundary conditions, and utility functions in Ripple’s MALDI preprocessing pipeline. Robust signal processing code must handle degenerate inputs gracefully — even window sizes, too-few data points, zero-noise spectra, and memory-safety limits on bin counts.
For the algorithms themselves, see MALDI Algorithms Explained. For validation against the R reference implementation, see MALDIquant Comparison and MALDIquant Test Vectors.
Note: This notebook requires impl namespaces to test individual internal functions (median filter, noise estimation, peak filtering, alignment) at a finer granularity than the public API exposes.
(ns ripple-book.maldi-validation
(:require
;; Ripple MALDI signal processing internals:
[scicloj.ripple.maldi.impl.signal :as signal]
;; Spectral binning internals:
[scicloj.ripple.maldi.impl.binning :as binning]
;; Multi-sample alignment (https://github.com/scicloj/ripple):
[scicloj.ripple.maldi.impl.alignment :as alignment]
;; Table processing (https://scicloj.github.io/tablecloth/):
[tablecloth.api :as tc]
;; Annotating kinds of visualizations (https://scicloj.github.io/kindly-noted/):
[scicloj.kindly.v4.kind :as kind]))Median Filter
The median filter replaces each point with the median of its local window. Our implementation uses the R-compatible median definition (average of two middle values for even-length windows), matching MALDIquant behavior.
(def median-filtered
(vec (signal/median-filter [1.0 10.0 2.0 9.0 3.0 8.0 4.0] {:window-size 3})))Interior values should be the median of each 3-element window: [1,10,2]→2.0 [10,2,9]→9.0 [2,9,3]→3.0 [9,3,8]→8.0 [3,8,4]→4.0 Edge values use smaller windows: [1,10]→5.5 and [8,4]→6.0
median-filtered[5.5 2.0 9.0 3.0 8.0 4.0 6.0]Even-length window uses proper averaging (not “high median”):
(def median-even-window
(vec (signal/median-filter [1.0 2.0 3.0 4.0 5.0] {:window-size 3})))Windows: [1,2]→1.5 [1,2,3]→2.0 [2,3,4]→3.0 [3,4,5]→4.0 [4,5]→4.5
median-even-window[1.5 2.0 3.0 4.0 4.5]Local Noise Estimation (MAD local)
The Median Absolute Deviation can be computed globally (one noise estimate for the entire spectrum) or locally (a sliding window giving position-dependent noise). Local estimation is useful when noise varies across the m/z range.
(def local-noise-data
;; Quiet region followed by noisy region
[1.0 1.0 1.0 1.0 1.0 10.0 20.0 5.0 15.0 25.0])(def global-noise (signal/estimate-noise-mad local-noise-data {}))(def local-noise (signal/estimate-noise-mad local-noise-data {:half-window-size 2}))Local noise should vary across the spectrum: quiet region should have lower noise than noisy region
(let [quiet-noise (local-noise 2)
noisy-noise (local-noise 7)]
{:quiet quiet-noise :noisy noisy-noise :quiet-lower? (< quiet-noise noisy-noise)}){:quiet 0.0, :noisy 7.412999999999999, :quiet-lower? true}detect-peaks with :mad-local should work without error
(def peaks-local
(signal/detect-peaks [1.0 5.0 1.0 1.0 1.0 1.0 10.0 1.0 1.0 1.0]
{:half-window-size 2 :snr 2 :noise-method :mad-local}))peaks-local[1 6]Signal Processing Error Paths
Savitzky-Golay: even window size
(def sg-even-error
(try
(signal/savitzky-golay-smooth [1.0 2.0 3.0] {:window-size 4})
(catch Exception e (ex-data e))))sg-even-error{:window-size 4}Savitzky-Golay: too few data points
(def sg-short-error
(try
(signal/savitzky-golay-smooth [1.0 2.0] {:window-size 5})
(catch Exception e (ex-data e))))sg-short-error{:data-points 2, :window-size 5}SNIP: non-positive iterations
(def snip-zero-error
(try
(signal/snip-baseline-removal [1.0 2.0 3.0] {:iterations 0})
(catch Exception e (ex-data e))))snip-zero-error{:iterations 0}SNIP: too few data points
(def snip-short-error
(try
(signal/snip-baseline-removal [1.0 2.0] {:iterations 5})
(catch Exception e (ex-data e))))snip-short-error{:data-points 2}Median filter: even window size
(def mf-even-error
(try
(signal/median-filter [1.0 2.0 3.0] {:window-size 4})
(catch Exception e (ex-data e))))mf-even-error{:window-size 4}Binning: Single Point Per Bin
(def test-spectrum-1
(tc/dataset {:mass [2001.5 2004.5 2007.5]
:intensity [100.0 200.0 300.0]}))(def binning-params-1
{:range [2000 2010]
:step 3})Expected bins:
- bin 0: [2000, 2003) contains mass 2001.5, intensity 100.0
- bin 1: [2003, 2006) contains mass 2004.5, intensity 200.0
- bin 2: [2006, 2009) contains mass 2007.5, intensity 300.0
(def binned-1 (binning/bin-spectrum test-spectrum-1 binning-params-1))(vec binned-1)[100.0 200.0 300.0]Binning: Multiple Points in Same Bin
(def test-spectrum-2
(tc/dataset {:mass [2001.0 2002.0 2004.0 2005.0]
:intensity [100.0 50.0 200.0 75.0]}))bin 0: [2000, 2003) contains masses 2001.0, 2002.0 — sum = 150.0 bin 1: [2003, 2006) contains masses 2004.0, 2005.0 — sum = 275.0 bin 2: [2006, 2009) — empty — 0.0
(def binned-2 (binning/bin-spectrum test-spectrum-2 binning-params-1))(vec binned-2)[150.0 275.0 0.0]Binning: Out-of-Range Values
(def test-spectrum-3
(tc/dataset {:mass [1999.0 2001.0 2005.0 2011.0]
:intensity [999.0 100.0 200.0 888.0]}))1999.0 and 2011.0 are outside [2000, 2010) — filtered out.
(def binned-3 (binning/bin-spectrum test-spectrum-3 binning-params-1))(vec binned-3)[100.0 200.0 0.0]Binning: DRIAMS Parameters
The Weis et al. DRIAMS paper specifies: range [2000, 20000] Da, step 3 Da → 6000 bins.
(def driams-params
{:range [2000 20000]
:step 3})(def n-bins-driams (binning/calculate-n-bins driams-params))n-bins-driams6000(def test-spectrum-driams
(tc/dataset {:mass [2000.0 2003.0 2006.0 10000.0 19999.0]
:intensity [10.0 20.0 30.0 500.0 999.0]}))(def binned-driams (binning/bin-spectrum test-spectrum-driams driams-params))(count binned-driams)6000Verify specific bin assignments: 2000.0 → bin 0, 2003.0 → bin 1, 2006.0 → bin 2, 10000.0 → bin 2666, 19999.0 → bin 5999
{:bin-0 (aget binned-driams 0)
:bin-1 (aget binned-driams 1)
:bin-2 (aget binned-driams 2)
:bin-2666 (aget binned-driams 2666)
:bin-5999 (aget binned-driams 5999)}{:bin-0 10.0,
:bin-1 20.0,
:bin-2 30.0,
:bin-2666 500.0,
:bin-5999 999.0}Binning: Empty Spectrum
(def test-spectrum-empty
(tc/dataset {:mass []
:intensity []}))(def binned-empty (binning/bin-spectrum test-spectrum-empty binning-params-1))(vec binned-empty)[0.0 0.0 0.0]Binning: Boundary Values
A mass exactly on a bin boundary falls into the bin that starts there. With bins [2000, 2003), [2003, 2006), [2006, 2009): 2003.0 lands in bin 1, not bin 0.
(def test-spectrum-boundary
(tc/dataset {:mass [2000.0 2003.0 2006.0]
:intensity [10.0 20.0 30.0]}))(def binned-boundary (binning/bin-spectrum test-spectrum-boundary binning-params-1))(vec binned-boundary)[10.0 20.0 30.0]Binning: Too Many Bins (Memory Safety)
(def bins-too-many-error
(try
(binning/calculate-n-bins {:range [0 100000000] :step 1})
(catch Exception e (:calculated-bins (ex-data e)))))bins-too-many-error100000000Moving Average: Known Values
The moving average replaces each point with the mean of its window. Edge values use the mean of the first/last full window (MALDIquant convention).
(def ma-known-input [1.0 2.0 3.0 4.0 5.0 6.0 7.0])(def ma-known-result
(vec (signal/moving-average-smooth ma-known-input {:half-window-size 1})))With half-window-size=1, window size is 3.
- edge[0] = mean(1,2,3) = 2.0
- interior[1..5] = 3-point running mean
- edge[6] = mean(5,6,7) = 6.0
ma-known-result[2.0 2.0 3.0 4.0 5.0 6.0 6.0]Moving Average: Negative Clamping
Smoothed values that go negative are clamped to 0.0, matching MALDIquant’s behavior for intensity data.
(def ma-neg-result
(vec (signal/moving-average-smooth [1.0 -5.0 1.0] {:half-window-size 1})))mean(1,-5,1) = -1.0 → clamped to 0.0 for all positions
ma-neg-result[0.0 0.0 0.0]TopHat Baseline Removal: Known Values
The TopHat transform removes broad baseline features while preserving narrow peaks. It computes: signal − opening(signal), where opening = dilation(erosion(signal)).
(def tophat-input [1.0 2.0 3.0 4.0 25.0 4.0 3.0 2.0 1.0])(def tophat-result
(vec (signal/tophat-baseline tophat-input {:half-window-size 2})))The triangular baseline [1..5..1] should be removed. The spike at index 4 (value 25) should be preserved.
tophat-result[0.0 0.0 0.0 1.0 22.0 1.0 0.0 0.0 0.0]TopHat: Flat and Zero Spectra
A constant signal has no baseline to remove — TopHat returns all zeros.
(def tophat-flat-result
(vec (signal/tophat-baseline (double-array (repeat 10 5.0)) {:half-window-size 2})))tophat-flat-result[0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0]All-zero input also returns zeros:
(def tophat-zero-result
(vec (signal/tophat-baseline (double-array (repeat 10 0.0)) {:half-window-size 2})))tophat-zero-result[0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0]Median Calibration: Properties
Median calibration divides each intensity by the spectrum’s median. This normalizes spectra so that the “typical” intensity is 1.0.
(def medcal-input [2.0 4.0 6.0 8.0 10.0])(def medcal-result (vec (signal/median-calibrate medcal-input {})))Median of [2,4,6,8,10] = 6.0, so result = each / 6.0
medcal-result[0.3333333333333333
0.6666666666666666
1.0
1.3333333333333333
1.6666666666666667]Zero-median input (all zeros) returns zeros with a warning:
(def medcal-zero-result (vec (signal/median-calibrate [0.0 0.0 0.0] {})))medcal-zero-result[0.0 0.0 0.0]Trim: Edge Cases
trim-spectrum restricts a spectrum to a mass range using inclusive boundaries (≥ min and ≤ max).
(def trim-test-ds
(tc/dataset {:mass [100.0 200.0 300.0] :intensity [1.0 2.0 3.0]}))Range outside data → empty dataset
(def trim-empty (signal/trim-spectrum trim-test-ds {:range [500.0 600.0]}))(tc/row-count trim-empty)0Exact data bounds → full dataset
(def trim-full (signal/trim-spectrum trim-test-ds {:range [100.0 300.0]}))(tc/row-count trim-full)3Single-point range → 1 row
(def trim-single (signal/trim-spectrum trim-test-ds {:range [200.0 200.0]}))(tc/row-count trim-single)1Boundary inclusivity — mass on boundary is included
(vec (:mass trim-single))[200.0]Local Maxima Detection: Properties
find-local-maxima-logical returns a boolean array marking positions that are the maximum in their window. Useful properties to verify:
Monotonically increasing → only the last element is a local max
(vec (signal/find-local-maxima-logical [1.0 2.0 3.0 4.0 5.0] {:half-window-size 2}))[false false false false true]Monotonically decreasing → only the first element
(vec (signal/find-local-maxima-logical [5.0 4.0 3.0 2.0 1.0] {:half-window-size 2}))[true false false false false]All equal values → first element (leftmost wins ties)
(vec (signal/find-local-maxima-logical [3.0 3.0 3.0 3.0 3.0] {:half-window-size 2}))[true false false false false]Single peak in center
(vec (signal/find-local-maxima-logical [1.0 2.0 5.0 2.0 1.0] {:half-window-size 2}))[false false true false false]Filter Peaks by SNR: Direct Tests
filter-peaks-by-snr selects peaks where intensity > (SNR × noise). Testing the function directly (not through detect-peaks) exercises the zero-noise and mixed-noise paths.
Zero noise → threshold is 0, all positive local maxima pass:
(def snr-zero-noise
(vec (signal/filter-peaks-by-snr
[0.0 5.0 0.0 3.0 0.0]
(boolean-array [false true false true false])
(double-array [0.0 0.0 0.0 0.0 0.0])
{:snr-threshold 2})))snr-zero-noise[1 3]Very high noise → no peaks pass:
(def snr-high-noise
(vec (signal/filter-peaks-by-snr
[0.0 5.0 0.0 3.0 0.0]
(boolean-array [false true false true false])
(double-array [100.0 100.0 100.0 100.0 100.0])
{:snr-threshold 2})))snr-high-noise[]Mixed: intensity[1]=5 > 2×1=2 (pass), intensity[3]=3 ≤ 2×2=4 (fail):
(def snr-mixed
(vec (signal/filter-peaks-by-snr
[0.0 5.0 0.0 3.0 0.0]
(boolean-array [false true false true false])
(double-array [1.0 1.0 1.0 2.0 1.0])
{:snr-threshold 2})))snr-mixed[1]Pipeline: Skip-Steps Verification
preprocess-spectrum-data has boolean flags to skip the square root transform and TIC normalization. Verify each flag works.
(def pipeline-test-masses (vec (range 2000.0 2100.0 1.0)))(def pipeline-test-intensities
(mapv (fn [m] (+ 100.0 (* 50.0 (Math/sin (* m 0.1))))) pipeline-test-masses))(def pipeline-test-spec
{:mass pipeline-test-masses :intensity pipeline-test-intensities})With sqrt=false, intensities remain larger (no compression):
(def pipeline-with-sqrt
(signal/preprocess-spectrum-data pipeline-test-spec
{:should-sqrt-transform true :smooth-window 5 :smooth-polynomial 2
:baseline-iterations 5 :should-tic-normalize false :tic-target 1.0}))(def pipeline-no-sqrt
(signal/preprocess-spectrum-data pipeline-test-spec
{:should-sqrt-transform false :smooth-window 5 :smooth-polynomial 2
:baseline-iterations 5 :should-tic-normalize false :tic-target 1.0}))(let [max-with (double (apply max (seq (:intensity pipeline-with-sqrt))))
max-no (double (apply max (seq (:intensity pipeline-no-sqrt))))]
{:max-with-sqrt max-with :max-no-sqrt max-no
:sqrt-reduces-range? (< max-with max-no)}){:max-with-sqrt 0.16218210486563872,
:max-no-sqrt 4.314465590062217,
:sqrt-reduces-range? true}With tic-normalize=true, trapezoid area ≈ target (1.0). With tic-normalize=false, area is much larger.
(def ^:private trapezoid-area
(fn [ds]
(let [m (double-array (:mass ds))
inten (double-array (:intensity ds))
n (count m)]
(loop [i 1 acc 0.0]
(if (>= i n) acc
(recur (inc i)
(+ acc (* 0.5 (+ (aget inten i) (aget inten (dec i)))
(- (aget m i) (aget m (dec i)))))))))))(def pipeline-with-tic
(signal/preprocess-spectrum-data pipeline-test-spec
{:should-sqrt-transform true :smooth-window 5 :smooth-polynomial 2
:baseline-iterations 5 :should-tic-normalize true :tic-target 1.0}))(def pipeline-no-tic
(signal/preprocess-spectrum-data pipeline-test-spec
{:should-sqrt-transform true :smooth-window 5 :smooth-polynomial 2
:baseline-iterations 5 :should-tic-normalize false :tic-target 1.0}))(let [area-with (trapezoid-area pipeline-with-tic)
area-no (trapezoid-area pipeline-no-tic)]
{:area-with-tic area-with :area-no-tic area-no
:tic-normalizes? (< (Math/abs (- area-with 1.0)) 1e-10)
:no-tic-different? (> (Math/abs (- area-no 1.0)) 0.1)}){:area-with-tic 1.0000000000000002,
:area-no-tic 4.450518596479303,
:tic-normalizes? true,
:no-tic-different? true}Alignment: Small Example
Multi-sample peak alignment groups nearby m/z values across spectra and builds a rectangular feature matrix for machine learning.
Three spectra with slightly different peak positions:
(def align-s1 {:masses (double-array [1000.0 2000.0 3000.0])
:intensities (double-array [10.0 20.0 30.0])})(def align-s2 {:masses (double-array [1000.5 2000.3 3000.1])
:intensities (double-array [11.0 21.0 31.0])})(def align-s3 {:masses (double-array [1000.2 3000.4])
:intensities (double-array [12.0 32.0])})(def aligned (alignment/bin-peaks [align-s1 align-s2 align-s3] {:tolerance 0.002}))After binning, corresponding peaks share the same mass (the group mean):
(let [b1 (vec (:masses (nth aligned 0)))
b2 (vec (:masses (nth aligned 1)))
b3 (vec (:masses (nth aligned 2)))]
{:s1-masses b1 :s2-masses b2 :s3-masses b3
:mass0-aligned? (= (first b1) (first b2))
:s3-has-2-peaks? (= 2 (count b3))}){:s1-masses [1000.2333333333333 2000.15 3000.1666666666665],
:s2-masses [1000.2333333333333 2000.15 3000.1666666666665],
:s3-masses [1000.2333333333333 3000.1666666666665],
:mass0-aligned? true,
:s3-has-2-peaks? true}Filter at min-frequency=0.5: mass ~2000 appears in 2/3 spectra (keep). Filter at min-frequency=0.8: mass ~2000 appears in only 2/3 (reject).
(def filtered-50
(alignment/filter-peaks aligned {:min-frequency 0.5}))(def filtered-80
(alignment/filter-peaks aligned {:min-frequency 0.8})){:at-50-counts (mapv #(count (:masses %)) filtered-50)
:at-80-counts (mapv #(count (:masses %)) filtered-80)}{:at-50-counts [3 3 2], :at-80-counts [2 2 2]}Intensity matrix — s3 is missing mass ~2000, so its column is 0.0:
(def align-im (alignment/intensity-matrix aligned)){:shape [(:n-spectra align-im) (:n-features align-im)]
:s3-row (vec (nth (:matrix align-im) 2))
:s3-zero-fill? (zero? (aget (nth (:matrix align-im) 2) 1))}{:shape [3 3], :s3-row [12.0 0.0 32.0], :s3-zero-fill? true}Alignment: Tolerance Edge
Peaks that are just within relative tolerance get grouped; peaks just outside remain separate.
Two spectra with 2 peaks each. Mass pair 1000.0/1000.5: relative diff to mean (1000.25) = 0.00025 < tolerance 0.002 → grouped
(def tol-wide (alignment/bin-peaks
[{:masses (double-array [1000.0 2000.0])
:intensities (double-array [10.0 20.0])}
{:masses (double-array [1000.5 2000.3])
:intensities (double-array [11.0 21.0])}]
{:tolerance 0.002}))Same data, but tolerance=0.0001 → mass 1000/1000.5 should NOT group (relative diff 0.00025 > 0.0001)
(def tol-narrow (alignment/bin-peaks
[{:masses (double-array [1000.0 2000.0])
:intensities (double-array [10.0 20.0])}
{:masses (double-array [1000.5 2000.3])
:intensities (double-array [11.0 21.0])}]
{:tolerance 0.0001})){:wide-grouped? (= (aget (:masses (first tol-wide)) 0)
(aget (:masses (second tol-wide)) 0))
:narrow-separate? (not= (aget (:masses (first tol-narrow)) 0)
(aget (:masses (second tol-narrow)) 0))}{:wide-grouped? true, :narrow-separate? true}Alignment: Single Spectrum
With a single spectrum, binning and filtering are identity operations.
(def single-spec {:masses (double-array [1000.0 2000.0 3000.0])
:intensities (double-array [10.0 20.0 30.0])})(def single-binned (alignment/bin-peaks [single-spec] {:tolerance 0.002}))(def single-filtered (alignment/filter-peaks single-binned {:min-frequency 1.0}))(def single-im (alignment/intensity-matrix single-binned)){:masses-unchanged? (= (vec (:masses (first single-binned))) [1000.0 2000.0 3000.0])
:filter-keeps-all? (= 3 (count (:masses (first single-filtered))))
:matrix-shape [(:n-spectra single-im) (:n-features single-im)]
:matrix-row (vec (first (:matrix single-im)))}{:masses-unchanged? true,
:filter-keeps-all? true,
:matrix-shape [1 3],
:matrix-row [10.0 20.0 30.0]}