3 MALDI Algorithms Explained

This notebook walks through the preprocessing pipeline used in the DRIAMS study (Weis et al. 2022) for MALDI-TOF mass spectrometry data, explaining why each step is needed and the mathematical intuition behind it.

The pipeline follows the MALDIquant reference implementation (Gibb & Strimmer 2016). Ripple also provides additional single-spectrum functions (moving average, TopHat baseline, median calibration) and multi-sample alignment — these are not covered here but are documented in the validation notebooks.

References:

Gibb, S. and Strimmer, K. (2016). “Mass Spectrometry Analysis Using MALDIquant.” Chapter 11, Statistical Analysis of Proteomics, Metabolomics, and Lipidomics Data Using Mass Spectrometry. Springer. arXiv:1607.03180
Weis, C. et al. (2022). “Direct antimicrobial resistance prediction from clinical MALDI-TOF mass spectra using machine learning.” Nature Medicine 28, 164–174. doi:10.1038/s41591-021-01619-9

(ns ripple-book.maldi-algorithms-explained
  (:require
   ;; Ripple MALDI public API:
   [scicloj.ripple.maldi :as maldi]
   ;; Table processing (https://scicloj.github.io/tablecloth/):
   [tablecloth.api :as tc]
   ;; Interactive plotting via Plotly (https://scicloj.github.io/tableplot/):
   [scicloj.tableplot.v1.plotly :as plotly]
   ;; Annotating kinds of visualizations (https://scicloj.github.io/kindly-noted/):
   [scicloj.kindly.v4.kind :as kind]
   ;; High-performance array math (https://github.com/cnuernber/dtype-next):
   [tech.v3.datatype :as dtype]
   [tech.v3.datatype.functional :as dfn]))

The MALDI-TOF Preprocessing Pipeline

MALDI-TOF (Matrix-Assisted Laser Desorption/Ionization — Time of Flight) mass spectrometry measures the mass-to-charge ratio (m/z) of molecules in a biological sample. A raw spectrum is a sequence of (mass, intensity) pairs where intensity reflects how many ions of that mass were detected.

Raw spectra are noisy and contain artifacts. Before analysis, each spectrum passes through a preprocessing pipeline:

Step	Algorithm	Purpose
1	Trim	Restrict to analysis mass range
2	Square root transform	Stabilize variance
3	Savitzky-Golay smoothing	Reduce noise
4	SNIP baseline removal	Remove background signal
5	TIC normalization	Make spectra comparable
6	Peak detection	Find biologically relevant peaks
7	Spectral binning	Create fixed-length feature vectors for ML

Synthetic Test Spectrum

To clearly see the effect of each algorithm, we construct a synthetic spectrum with known components: a curved baseline, several peaks of varying shape, and realistic noise.

The raw spectrum spans a wider range than the analysis window. We will trim it to [2000, 4000] Da before processing — the same pattern used in DRIAMS where raw spectra span ~1960–20133 Da but are trimmed to [2000, 20000] Da.

(defn gaussian
  "A Gaussian peak centered at `center` with width `sigma` and `height`."
  [x center sigma height]
  (* height (Math/exp (- (/ (Math/pow (- x center) 2)
                            (* 2 sigma sigma))))))

(def raw-masses
  (vec (range 1500.0 4500.0 1.0)))

The spectrum has:

An exponentially decaying baseline (simulating MALDI matrix effects)
Five peaks of different widths and heights
Deterministic sinusoidal “noise” (reproducible without a random seed)

(def raw-intensities
  (mapv (fn [m]
          (let [baseline (+ (* 800 (Math/exp (- (/ (- m 2000) 500.0))))
                            (* 0.02 (- m 2000)))
                peak1 (gaussian m 2200 8.0 500.0)
                peak2 (gaussian m 2500 15.0 300.0)
                peak3 (gaussian m 2800 5.0 700.0)
                peak4 (gaussian m 3200 20.0 200.0)
                peak5 (gaussian m 3600 10.0 400.0)
                noise (* 15.0 (+ (Math/sin (* m 0.3))
                                 (* 0.7 (Math/sin (* m 1.1)))
                                 (* 0.3 (Math/sin (* m 2.7)))))]
            (max 0.0 (+ baseline peak1 peak2 peak3 peak4 peak5 noise))))
        raw-masses))

(count raw-masses)

The raw spectrum

(-> (tc/dataset {:mass raw-masses :intensity raw-intensities})
    (plotly/base {:=width 800 :=height 300 :=x :mass :=y :intensity})
    (plotly/layer-line {:=mark-opacity 0.8})
    plotly/plot)

The spectrum shows all the real-world challenges: a decaying baseline that obscures the true peak heights, noise that could create false peaks, and peaks of varying width that reflect different molecular species. The region below m/z 2000 is dominated by the matrix baseline — we will trim it away in the first preprocessing step.

Trim to Analysis Range

Raw MALDI-TOF spectra typically span a wider m/z range than is analytically useful. The low-mass region is dominated by matrix ions, and the high-mass tail carries mostly noise.

Trimming restricts the spectrum to the region of interest before any signal processing. In the DRIAMS study, spectra are trimmed to [2000, 20000] Da.

(Weis et al. 2022)

(def raw-spectrum
  (tc/dataset {:mass raw-masses :intensity raw-intensities}))

(def step0-trimmed
  (maldi/trim-spectrum raw-spectrum {:range [2000 4000]}))

(tc/row-count step0-trimmed)

From here on, all processing operates on the trimmed spectrum:

(def masses (:mass step0-trimmed))

(def intensities (:intensity step0-trimmed))

Square Root Transformation

The problem

MALDI-TOF measures ion counts. Like all counting processes, the data follows Poisson-like statistics where the variance scales with the mean. This means high-intensity peaks carry disproportionately more noise than low-intensity regions, and downstream algorithms (smoothing, peak detection) will be biased toward the tallest peaks.

The solution

The square root is a variance-stabilizing transformation for Poisson-distributed data. If Y ~ Poisson(λ), then Var(√Y) ≈ 1/4, regardless of λ. After transformation, all regions of the spectrum have roughly equal noise variance, giving downstream algorithms a fair chance at detecting both large and small peaks.

The transformation is simply: y’ = √y

(Gibb & Strimmer 2016, §11.2.3)

(def step1-sqrt (maldi/sqrt-transform intensities))

The dynamic range is compressed:

(let [raw-range [(dfn/reduce-min intensities) (dfn/reduce-max intensities)]
      sqrt-range [(dfn/reduce-min step1-sqrt) (dfn/reduce-max step1-sqrt)]]
  {:raw-range raw-range :sqrt-range sqrt-range})

{:raw-range [25.22105216303161 1055.9484728959605],
 :sqrt-range [5.022056567087991 32.49536079036453]}

Before and after

(-> step0-trimmed
    (plotly/base {:=width 800 :=height 250 :=x :mass :=y :intensity
                  :=title "Before: raw intensities"})
    (plotly/layer-line {:=mark-opacity 0.7})
    plotly/plot)

(-> (tc/dataset {:mass masses :intensity step1-sqrt})
    (plotly/base {:=width 800 :=height 250 :=x :mass :=y :intensity
                  :=title "After: square root transform"})
    (plotly/layer-line {:=mark-opacity 0.7})
    plotly/plot)

The raw spectrum has a peak at m/z 2800 reaching ~1050, dominating the plot. After the square root transform, the same peak is at ~32 while smaller peaks are relatively more visible.

Savitzky-Golay Smoothing

The problem

Even after variance stabilization, point-to-point fluctuations from instrument noise can create spurious local maxima that confuse peak detection.

The solution

The Savitzky-Golay filter fits a local polynomial of degree p to each window of 2m+1 consecutive points using least squares, then replaces the center point with the fitted value.

The key insight is that for each window size and polynomial degree, the coefficients can be precomputed — so the filter reduces to a simple convolution (weighted average). This makes it very efficient.

Parameters:

window-size (must be odd): larger windows smooth more aggressively but risk broadening peaks
polynomial-order: higher orders preserve peak shapes better but smooth less. Order 2 is a good default for mass spectrometry.

After smoothing, any negative values are clamped to 0 (intensities cannot be physically negative).

(Savitzky & Golay, 1964. Analytical Chemistry 36(8), 1627–1639; Gibb & Strimmer 2016, §11.2.3)

Effect of window size

Compare three window sizes on the sqrt-transformed data:

(def sg-narrow
  (maldi/savitzky-golay-smooth step1-sqrt {:window-size 5 :polynomial-order 2}))

(def sg-default
  (maldi/savitzky-golay-smooth step1-sqrt {:window-size 11 :polynomial-order 2}))

(def sg-wide
  (maldi/savitzky-golay-smooth step1-sqrt {:window-size 31 :polynomial-order 2}))

Zoomed view around the sharpest peak (m/z 2750–2850):

(let [idx-start 750
      idx-end 850
      sub-masses (dtype/sub-buffer masses idx-start (- idx-end idx-start))
      sub-input (dtype/sub-buffer step1-sqrt idx-start (- idx-end idx-start))
      sub-narrow (dtype/sub-buffer sg-narrow idx-start (- idx-end idx-start))
      sub-default (dtype/sub-buffer sg-default idx-start (- idx-end idx-start))
      sub-wide (dtype/sub-buffer sg-wide idx-start (- idx-end idx-start))
      nm (count sub-masses)]
  (-> (tc/dataset {:mass (vec (concat sub-masses sub-masses sub-masses sub-masses))
                   :intensity (vec (concat sub-input sub-narrow sub-default sub-wide))
                   :series (vec (concat (repeat nm "input (sqrt)")
                                        (repeat nm "SG window=5")
                                        (repeat nm "SG window=11")
                                        (repeat nm "SG window=31")))})
      (plotly/base {:=width 800 :=height 350 :=x :mass :=y :intensity :=color :series})
      (plotly/layer-line {:=mark-opacity 0.8})
      plotly/plot))

Window=31 broadens and lowers the peak compared to window=5:

(let [peak-narrow (dfn/reduce-max (dtype/sub-buffer sg-narrow 780 40))
      peak-wide (dfn/reduce-max (dtype/sub-buffer sg-wide 780 40))]
  {:peak-narrow peak-narrow :peak-wide peak-wide :broadened? (> peak-narrow peak-wide)})

{:peak-narrow 29.517019349442418,
 :peak-wide 26.595201945025458,
 :broadened? true}

With window=5, the noise is barely reduced. With window=31, the peak is noticeably broadened and shortened. Window=11 is a reasonable compromise.

We proceed with window-size=11:

(def step2-smooth sg-default)

(count step2-smooth)

All values remain non-negative (MALDIquant clamps negatives to 0.0):

(every? #(>= % 0.0) step2-smooth)

true

SNIP Baseline Removal

The problem

In MALDI-TOF, the chemical matrix (the substance that helps ionize the sample) produces a broad background signal called the baseline. Peaks sit on top of this baseline, so the measured intensity at a peak includes both the analyte signal and the baseline contribution.

Without baseline correction, peak intensities are unreliable — a small peak on a high baseline could appear stronger than a large peak on a low baseline.

The SNIP algorithm

SNIP (Statistics-sensitive Non-linear Iterative Peak-clipping) estimates the baseline by iteratively “eroding” peaks:

For each iteration with window half-width w, at each point i:

baseline[i] = min(baseline[i], (baseline[i-w] + baseline[i+w]) / 2)

This compares each point to the linear interpolation of its neighbors w steps away. If the point is higher (i.e., it’s part of a peak), it gets clipped down to the interpolated value.

After iterating from large w down to 1 (MALDIquant’s default decreasing order), broad features are removed first, then progressively finer ones. What remains is the slow-varying baseline estimate.

The corrected spectrum is: corrected = original - baseline

Parameters:

iterations: controls the maximum window width (and thus the broadest feature that can be removed). Default: 25.
decreasing: if true (default), iterate from large to small windows.

(Ryan, C.G. et al., 1988. Nuclear Instruments and Methods in Physics Research B34, 396–402; Gibb & Strimmer 2016, §11.2.4)

Visualizing the baseline estimate

To show the estimated baseline separately, we compute it by subtracting the corrected spectrum from the input:

(def step3-corrected
  (maldi/snip-baseline-removal step2-smooth {:iterations 25}))

(def estimated-baseline
  (dfn/- step2-smooth step3-corrected))

(let [n (count masses)]
  (-> (tc/dataset {:mass (vec (concat masses masses masses))
                   :intensity (vec (concat step2-smooth estimated-baseline step3-corrected))
                   :series (vec (concat (repeat n "input (smoothed)")
                                        (repeat n "estimated baseline")
                                        (repeat n "corrected")))})
      (plotly/base {:=width 800 :=height 350 :=x :mass :=y :intensity :=color :series})
      (plotly/layer-line {:=mark-opacity 0.7})
      plotly/plot))

The estimated baseline follows the broad decay. The corrected spectrum has the baseline removed, leaving only peaks and residual noise near zero.

Effect of iterations parameter

Fewer iterations remove only the broadest features; more iterations remove progressively finer background variations:

(def baseline-10
  (maldi/snip-baseline-removal step2-smooth {:iterations 10}))

(def baseline-50
  (maldi/snip-baseline-removal step2-smooth {:iterations 50}))

(let [n (count masses)]
  (-> (tc/dataset {:mass (vec (concat masses masses masses))
                   :intensity (vec (concat baseline-10 step3-corrected baseline-50))
                   :series (vec (concat (repeat n "iterations=10")
                                        (repeat n "iterations=25 (default)")
                                        (repeat n "iterations=50")))})
      (plotly/base {:=width 800 :=height 350 :=x :mass :=y :intensity :=color :series})
      (plotly/layer-line {:=mark-opacity 0.7})
      plotly/plot))

iterations=10 leaves more residual baseline than 25; iterations=50 is close to 25:

(let [max-10 (dfn/reduce-max baseline-10)
      max-25 (dfn/reduce-max step3-corrected)
      max-50 (dfn/reduce-max baseline-50)
      diff-10-25 (Math/abs (- max-25 max-10))
      diff-50-25 (Math/abs (- max-50 max-25))]
  {:converged? (< diff-50-25 diff-10-25)})

{:converged? true}

With 10 iterations, some residual baseline remains. With 50, the result is similar to 25 because the baseline was already well-estimated.

Verify the corrected spectrum has near-zero background:

(let [region-vals (dtype/sub-buffer step3-corrected 900 50)
      mean-val (/ (dfn/sum region-vals) (count region-vals))]
  {:background-region-mean mean-val
   :near-zero? (< (Math/abs mean-val) 2.0)})

{:background-region-mean 0.5987598002048962, :near-zero? true}

TIC Normalization

The problem

When comparing spectra across different samples, measurements, or instruments, the total signal varies due to:

Different amounts of sample deposited
Instrument sensitivity drift over time
Variations in laser energy

Without normalization, two identical samples could produce spectra with very different absolute intensities.

The solution: Total Ion Current (TIC)

TIC normalization scales each spectrum so that its total area under the curve equals a target value (typically 1.0).

The area is computed using the trapezoid rule (not a simple sum), because mass points may not be uniformly spaced:

A = Σ 0.5 × (m[i+1] - m[i]) × (I[i] + I[i+1])

Then all intensities are multiplied by (target / A).

Note: Unlike the other functions which take intensities only, tic-normalize also requires the mass array — it needs the m/z spacings to compute the trapezoid area correctly.

(Gibb & Strimmer 2016, §11.2.5)

(def step4-normalized
  (maldi/tic-normalize masses step3-corrected {:target-area 1.0}))

Verify the area is now 1.0:

(defn trapezoid-area
  "Compute area under curve using trapezoid rule."
  [masses intensities]
  (dfn/sum (dfn/* (dfn/- (dfn/shift masses -1) masses)
                  (dfn/+ intensities (dfn/shift intensities -1))
                  0.5)))

(def area-after-tic (trapezoid-area masses step4-normalized))

area-after-tic

1.0

Why trapezoid, not sum?

A simple sum of intensities ignores the spacing between mass points. If some regions have denser measurements, they would be over-weighted. The trapezoid rule weights each interval by its width, correctly computing the integral regardless of spacing.

Visualizing normalization

We simulate two “samples” — same peaks but different overall intensities:

(let [scale-factor 3.5
      spectrum-a step3-corrected
      spectrum-b (dfn/* step3-corrected scale-factor)
      norm-a (maldi/tic-normalize masses spectrum-a {:target-area 1.0})
      norm-b (maldi/tic-normalize masses spectrum-b {:target-area 1.0})
      s 150 e 300
      sub-masses (dtype/sub-buffer masses s (- e s))
      nm (count sub-masses)]
  (-> (tc/dataset {:mass (vec (concat sub-masses sub-masses))
                   :intensity (vec (concat (dtype/sub-buffer spectrum-a s (- e s))
                                           (dtype/sub-buffer spectrum-b s (- e s))))
                   :series (vec (concat (repeat nm "sample A")
                                        (repeat nm "sample B (3.5x)")))})
      (plotly/base {:=width 800 :=height 250 :=x :mass :=y :intensity :=color :series
                    :=title "Before normalization: different total intensities"})
      (plotly/layer-line {:=mark-opacity 0.7})
      plotly/plot))

(let [scale-factor 3.5
      spectrum-a step3-corrected
      spectrum-b (dfn/* step3-corrected scale-factor)
      norm-a (maldi/tic-normalize masses spectrum-a {:target-area 1.0})
      norm-b (maldi/tic-normalize masses spectrum-b {:target-area 1.0})
      s 150 e 300
      sub-masses (dtype/sub-buffer masses s (- e s))
      nm (count sub-masses)]
  (-> (tc/dataset {:mass (vec (concat sub-masses sub-masses))
                   :intensity (vec (concat (dtype/sub-buffer norm-a s (- e s))
                                           (dtype/sub-buffer norm-b s (- e s))))
                   :series (vec (concat (repeat nm "sample A (normalized)")
                                        (repeat nm "sample B (normalized)")))})
      (plotly/base {:=width 800 :=height 250 :=x :mass :=y :intensity :=color :series
                    :=title "After normalization: identical shapes, same scale"})
      (plotly/layer-line {:=mark-opacity 0.7})
      plotly/plot))

Before normalization, sample B dominates at 3.5x the intensity. After normalization, the two samples overlap perfectly.

(let [norm-a (maldi/tic-normalize masses step3-corrected {:target-area 1.0})
      norm-b (maldi/tic-normalize masses (dfn/* step3-corrected 3.5) {:target-area 1.0})]
  (< (dfn/reduce-max (dfn/abs (dfn/- norm-a norm-b))) 1e-10))

true

Peak Detection

Peak detection identifies the biologically relevant signals in a preprocessed spectrum. Ripple implements MALDIquant’s three-step approach.

Step 1: Local maxima detection

A point is a local maximum if it’s the highest value in a sliding window of size 2w+1 centered on it:

is_peak[i] = (intensity[i] = max(intensity[i-w : i+w]))

The parameter half-window-size (w) controls sensitivity: a larger window requires peaks to dominate a wider region, which filters out small noise fluctuations but may miss closely-spaced peaks.

(def local-max-result
  (maldi/find-local-maxima-logical step4-normalized {:half-window-size 20}))

(def local-max-indices
  (vec (for [i (range (count step4-normalized))
             :when (aget local-max-result i)]
         i)))

(count local-max-indices)

Step 2: MAD noise estimation

The Median Absolute Deviation (MAD) provides a robust noise estimate:

noise = median(|x - median(x)|) × 1.4826

The constant 1.4826 makes MAD equivalent to the standard deviation for normally distributed data: 1/Φ⁻¹(3/4) ≈ 1.4826, where Φ⁻¹ is the inverse normal CDF.

MAD can be computed globally (one noise estimate for the entire spectrum) or locally (a sliding window, giving a noise estimate at each point). Local estimation adapts to varying noise levels across the spectrum.

(Gibb & Strimmer 2016, §11.2.6)

(def global-noise (maldi/estimate-noise-mad step4-normalized {}))

global-noise

2.8385269208683957E-4

Step 3: SNR filtering

A local maximum is accepted as a peak only if its intensity exceeds the noise level by a factor of at least SNR (Signal-to-Noise Ratio):

keep peak if: intensity > SNR × noise

Note: when noise = 0 (perfectly flat signal), the threshold is 0, so all positive local maxima pass. This matches MALDIquant’s behavior.

The complete peak detection pipeline

(def peaks
  (maldi/detect-peaks step4-normalized {:half-window-size 20 :snr 2}))

(count peaks)

All five Gaussian peaks should be among the detected peaks:

(let [peak-masses (mapv #(masses %) peaks)
      expected [2200 2500 2800 3200 3600]]
  (every? (fn [exp] (some #(< (Math/abs (- % exp)) 30) peak-masses))
          expected))

true

Visualizing detected peaks

(let [peak-masses (mapv #(masses %) peaks)
      peak-intensities (mapv #(step4-normalized %) peaks)
      n (count masses)
      np (count peaks)]
  (-> (tc/dataset {:mass (vec (concat masses peak-masses))
                   :intensity (vec (concat step4-normalized peak-intensities))
                   :series (vec (concat (repeat n "spectrum")
                                        (repeat np "detected peaks")))})
      (plotly/base {:=width 800 :=height 350 :=x :mass :=y :intensity :=color :series})
      (plotly/layer-line {:=mark-opacity 0.5})
      (plotly/layer-point {:=mark-size 8 :=mark-opacity 0.9})
      plotly/plot))

The detected peaks correspond to the five Gaussian peaks we placed in the synthetic spectrum. Noise fluctuations are correctly rejected.

Effect of SNR threshold

(def peaks-snr1 (maldi/detect-peaks step4-normalized {:half-window-size 20 :snr 1}))

(def peaks-snr3 (maldi/detect-peaks step4-normalized {:half-window-size 20 :snr 3}))

(def peaks-snr5 (maldi/detect-peaks step4-normalized {:half-window-size 20 :snr 5}))

(kind/table
 {:column-names ["SNR Threshold" "Peaks Found"]
  :row-vectors [["1 (permissive)" (count peaks-snr1)]
                ["2 (default)" (count peaks)]
                ["3 (moderate)" (count peaks-snr3)]
                ["5 (strict)" (count peaks-snr5)]]})

SNR Threshold	Peaks Found
1 (permissive)	47
2 (default)	27
3 (moderate)	5
5 (strict)	5

Lower SNR thresholds accept more peaks (including potential noise), while higher thresholds are more selective.

(let [counts [(count peaks-snr1) (count peaks) (count peaks-snr3) (count peaks-snr5)]]
  (apply >= counts))

true

Spectral Binning

The problem

Different spectra may have different numbers of mass points and different m/z positions. Machine learning algorithms require fixed-length feature vectors — every sample must be represented by the same number of features.

The solution

Partition the m/z axis into equal-width bins and sum the intensity of all measurements falling into each bin:

bin_index = floor((mass - min) / step)

N_bins = (max - min) / step

The DRIAMS paper (Weis et al. 2022) uses:

Range: [2000, 20000] Da
Step: 3 Da
Result: 6000 bins

(Weis et al., 2022)

Example: binning a preprocessed spectrum

(def preprocessed-ds
  (tc/dataset {:mass masses :intensity step4-normalized}))

(def binning-params {:range [2000 4000] :step 3})

(def n-bins (maldi/calculate-n-bins binning-params))

n-bins

(def binned (maldi/bin-spectrum preprocessed-ds binning-params))

(count binned)

Visualizing the binned spectrum

Compare the continuous spectrum (2001 points) with the binned version (666 bins), zoomed to a peak region:

(let [[bin-min _] (:range binning-params)
      bin-step (:step binning-params)
      bin-centers (mapv #(+ bin-min (* (+ % 0.5) bin-step)) (range n-bins))
      ;; Zoom to the first peak region
      zoom-mass-min 2150
      zoom-mass-max 2280
      cont-idx (filterv #(and (>= (masses %) zoom-mass-min)
                              (<= (masses %) zoom-mass-max))
                        (range (count masses)))
      cont-masses (mapv #(masses %) cont-idx)
      cont-intens (mapv #(step4-normalized %) cont-idx)
      bin-idx (filterv #(and (>= (nth bin-centers %) zoom-mass-min)
                             (<= (nth bin-centers %) zoom-mass-max))
                       (range n-bins))
      bin-masses (mapv #(nth bin-centers %) bin-idx)
      bin-intens (mapv #(aget binned %) bin-idx)
      nc (count cont-masses)
      nb (count bin-masses)]
  (-> (tc/dataset {:mass (vec (concat cont-masses bin-masses))
                   :intensity (vec (concat cont-intens bin-intens))
                   :series (vec (concat (repeat nc "continuous (1 Da spacing)")
                                        (repeat nb "binned (3 Da bins)")))})
      (plotly/base {:=width 800 :=height 350 :=x :mass :=y :intensity :=color :series})
      (plotly/layer-line {:=mark-opacity 0.7})
      (plotly/layer-point {:=mark-size 4 :=mark-opacity 0.5})
      plotly/plot))

The binned version preserves the overall peak shape while reducing the data from 2001 points to 666 fixed-width bins.

The Complete Pipeline

preprocess-spectrum-data combines the square root transform, smoothing, baseline removal, and TIC normalization into a single function call. Trim, peak detection, and binning are separate steps because they serve different upstream/downstream purposes.

(def fully-preprocessed
  (maldi/preprocess-spectrum-data
   step0-trimmed
   {:should-sqrt-transform true
    :smooth-window 11
    :smooth-polynomial 2
    :baseline-iterations 25
    :should-tic-normalize true
    :tic-target 1.0}))

(tc/row-count fully-preprocessed)

Before and after: full pipeline

(-> step0-trimmed
    (plotly/base {:=width 800 :=height 250 :=x :mass :=y :intensity
                  :=title "Before: trimmed raw spectrum"})
    (plotly/layer-line {:=mark-opacity 0.6})
    plotly/plot)

(-> fully-preprocessed
    (plotly/base {:=width 800 :=height 250 :=x :mass :=y :intensity
                  :=title "After: fully preprocessed"})
    (plotly/layer-line {:=mark-opacity 0.6})
    plotly/plot)

The raw spectrum has a decaying baseline, noise, and intensities up to ~1050. After preprocessing, the background is near zero, noise is smoothed away, and the normalized peaks are clearly defined.

source: notebooks/ripple_book/maldi_algorithms_explained.clj