6  Exploring the DRIAMS Dataset

The DRIAMS dataset (Database of Resistance against Antimicrobials with MALDI-TOF Mass Spectrometry) is a large-scale collection of MALDI-TOF mass spectra linked to antimicrobial resistance profiles, published by Weis et al. (2022).

The dataset contains spectra from four Swiss hospital sites:

Each spectrum comes with species identification and, where available, antimicrobial susceptibility results (R/S/I) for dozens of antibiotics — making this an ideal dataset for applying Ripple’s MALDI preprocessing pipeline to real clinical data.

Download

  1. Go to the Dryad repository
  2. Download the four .tar.gz archives (DRIAMS-A through D)
  3. Extract so that site folders are under a DRIAMS/ directory

This notebook assumes the dataset is available at DRIAMS/ relative to the project root (e.g., via a symlink). We use only the raw/ and id/ subdirectories — Ripple handles preprocessing.

Note: The raw spectra in DRIAMS are individual .txt files (~20K lines each). We recommend gzipping them after extraction (gzip DRIAMS-*/raw/*/*.txt) to save disk space — the dataset shrinks from ~60 GB to ~15 GB. This notebook assumes gzipped files (.txt.gz); tablecloth reads them transparently.

(ns ripple-book.driams-exploration
  (:require
   ;; Ripple MALDI public API:
   [scicloj.ripple.maldi :as maldi]
   ;; Table processing (https://scicloj.github.io/tablecloth/):
   [tablecloth.api :as tc]
   ;; Interactive plotting via Plotly (https://scicloj.github.io/tableplot/):
   [scicloj.tableplot.v1.plotly :as plotly]
   ;; Annotating kinds of visualizations (https://scicloj.github.io/kindly-noted/):
   [scicloj.kindly.v4.kind :as kind]))

Data Format

Raw spectra

Each spectrum is a gzipped text file with a 2-line header (original file path and UUID) followed by space-separated mass/intensity pairs — typically ~20,000 points spanning roughly 2,000–20,000 Da.

Metadata

The id/ directory contains gzipped CSVs mapping each spectrum code to its species and antimicrobial resistance profile. Resistance is encoded as R (resistant), S (susceptible), or I (intermediate) for each tested antibiotic.

Reading a Raw Spectrum

tablecloth can read gzipped text files directly. The two header lines are skipped automatically (they start with # and "). We just need to specify space as the separator and rename the columns.

(defn load-raw-spectrum
  "Load a raw DRIAMS spectrum from a gzipped text file.
  Returns a tablecloth dataset with :mass and :intensity columns."
  [path]
  (-> path
      (tc/dataset {:separator " "})
      (tc/rename-columns [:mass :intensity])))

Let’s load one spectrum from DRIAMS-A (2018):

(def test-spectrum-path
  "DRIAMS/DRIAMS-A/raw/2018/00006690-1411-4a89-87cc-ab84678cc9fb_MALDI1.txt.gz")
(def raw-spectrum (load-raw-spectrum test-spectrum-path))
raw-spectrum

DRIAMS/DRIAMS-A/raw/2018/00006690-1411-4a89-87cc-ab84678cc9fb_MALDI1.txt.gz [20813 2]:

:mass :intensity
1959.84464940 565
1960.25874185 579
1960.67287815 621
1961.08705830 652
1961.50128232 697
1961.91555020 642
1962.32986194 714
1962.74421754 688
1963.15861699 657
1963.57306031 619
20119.23572141 0
20120.57033330 5
20121.90498983 4
20123.23969102 0
20124.57443685 0
20125.90922734 1
20127.24406246 5
20128.57894224 0
20129.91386667 3
20131.24883574 3
20132.58384946 1

The spectrum has ~20,000 data points covering the mass range from about 1,960 to 20,133 Da:

{:rows (tc/row-count raw-spectrum)
 :mass-min (-> raw-spectrum :mass first)
 :mass-max (-> raw-spectrum :mass last)}
{:rows 20813, :mass-min 1959.84464940456, :mass-max 20132.5838494582}

Visualizing a raw spectrum

(-> raw-spectrum
    (plotly/base {:=x :mass
                  :=y :intensity
                  :=title "Raw MALDI-TOF spectrum — DRIAMS-A"
                  :=x-title "m/z (Da)"
                  :=y-title "Intensity (a.u.)"})
    (plotly/layer-line)
    plotly/plot)

Preprocessing with Ripple

We apply the same DRIAMS preprocessing pipeline described in Weis et al. (2022): square root transformation, Savitzky-Golay smoothing, SNIP baseline removal, and TIC normalization.

(def preprocessed
  (maldi/preprocess-spectrum-data
   raw-spectrum
   {:should-sqrt-transform true
    :smooth-window 11
    :baseline-iterations 20
    :should-tic-normalize true}))
(-> preprocessed
    (plotly/base {:=x :mass
                  :=y :intensity
                  :=title "Preprocessed spectrum"
                  :=x-title "m/z (Da)"
                  :=y-title "Intensity (normalized)"})
    (plotly/layer-line)
    plotly/plot)

Binning to 6,000 features

The DRIAMS paper uses fixed-width bins of 3 Da over the range [2000, 20000] Da, producing 6,000 features per spectrum — a fixed-length vector suitable for machine learning.

(def binned (maldi/bin-spectrum preprocessed {:range [2000 20000] :step 3}))
(count binned)
6000

Reading Metadata

The id/ directory contains species and resistance labels. Each year has a *_clean.csv.gz file mapping spectrum codes to species identification and R/S/I results.

(def metadata
  (-> (tc/dataset "DRIAMS/DRIAMS-A/id/2018/2018_clean.csv.gz"
                  {:key-fn keyword})
      (tc/drop-columns [:column-0 (keyword "Unnamed: 0")])))

How many spectra and species are in DRIAMS-A 2018?

{:spectra (tc/row-count metadata)
 :species (-> metadata :species distinct count)}
{:spectra 30069, :species 794}

Most common species

(-> metadata
    (tc/group-by [:species])
    (tc/aggregate {:count tc/row-count})
    (tc/order-by [:count] :desc)
    (tc/head 15)
    (plotly/base {:=x :count
                  :=y :species
                  :=title "Top 15 species — DRIAMS-A 2018"
                  :=x-title "Count"
                  :=y-title ""})
    (plotly/layer-bar)
    plotly/plot
    (assoc-in [:data 0 :orientation] :h)
    (assoc-in [:layout :margin :l] 200))

Linking spectra to metadata

The :code column in the metadata matches the spectrum filename (without the .txt.gz extension). This lets us look up the species and resistance profile for any spectrum.

(def test-code "00006690-1411-4a89-87cc-ab84678cc9fb_MALDI1")
(-> metadata
    (tc/select-rows #(= test-code (:code %)))
    (tc/select-columns [:code :species]))

DRIAMS/DRIAMS-A/id/2018/2018_clean.csv.gz [1 2]:

:code :species
00006690-1411-4a89-87cc-ab84678cc9fb_MALDI1 Staphylococcus schleiferi

References

source: notebooks/ripple_book/driams_exploration.clj