6 Exploring the DRIAMS Dataset
The DRIAMS dataset (Database of Resistance against Antimicrobials with MALDI-TOF Mass Spectrometry) is a large-scale collection of MALDI-TOF mass spectra linked to antimicrobial resistance profiles, published by Weis et al. (2022).
The dataset contains spectra from four Swiss hospital sites:
- DRIAMS-A — University Hospital of Basel (2015–2018, ~80K spectra)
- DRIAMS-B — Canton Hospital Basel-Land (2018, ~6K spectra)
- DRIAMS-C — Canton Hospital Aarau (2018, ~22K spectra)
- DRIAMS-D — Viollier AG laboratory (2018, ~76K spectra)
Each spectrum comes with species identification and, where available, antimicrobial susceptibility results (R/S/I) for dozens of antibiotics — making this an ideal dataset for applying Ripple’s MALDI preprocessing pipeline to real clinical data.
Download
- Go to the Dryad repository
- Download the four
.tar.gzarchives (DRIAMS-A through D) - Extract so that site folders are under a
DRIAMS/directory
This notebook assumes the dataset is available at DRIAMS/ relative to the project root (e.g., via a symlink). We use only the raw/ and id/ subdirectories — Ripple handles preprocessing.
Note: The raw spectra in DRIAMS are individual .txt files (~20K lines each). We recommend gzipping them after extraction (gzip DRIAMS-*/raw/*/*.txt) to save disk space — the dataset shrinks from ~60 GB to ~15 GB. This notebook assumes gzipped files (.txt.gz); tablecloth reads them transparently.
(ns ripple-book.driams-exploration
(:require
;; Ripple MALDI public API:
[scicloj.ripple.maldi :as maldi]
;; Table processing (https://scicloj.github.io/tablecloth/):
[tablecloth.api :as tc]
;; Interactive plotting via Plotly (https://scicloj.github.io/tableplot/):
[scicloj.tableplot.v1.plotly :as plotly]
;; Annotating kinds of visualizations (https://scicloj.github.io/kindly-noted/):
[scicloj.kindly.v4.kind :as kind]))Data Format
Raw spectra
Each spectrum is a gzipped text file with a 2-line header (original file path and UUID) followed by space-separated mass/intensity pairs — typically ~20,000 points spanning roughly 2,000–20,000 Da.
Metadata
The id/ directory contains gzipped CSVs mapping each spectrum code to its species and antimicrobial resistance profile. Resistance is encoded as R (resistant), S (susceptible), or I (intermediate) for each tested antibiotic.
Reading a Raw Spectrum
tablecloth can read gzipped text files directly. The two header lines are skipped automatically (they start with # and "). We just need to specify space as the separator and rename the columns.
(defn load-raw-spectrum
"Load a raw DRIAMS spectrum from a gzipped text file.
Returns a tablecloth dataset with :mass and :intensity columns."
[path]
(-> path
(tc/dataset {:separator " "})
(tc/rename-columns [:mass :intensity])))Let’s load one spectrum from DRIAMS-A (2018):
(def test-spectrum-path
"DRIAMS/DRIAMS-A/raw/2018/00006690-1411-4a89-87cc-ab84678cc9fb_MALDI1.txt.gz")(def raw-spectrum (load-raw-spectrum test-spectrum-path))raw-spectrumDRIAMS/DRIAMS-A/raw/2018/00006690-1411-4a89-87cc-ab84678cc9fb_MALDI1.txt.gz [20813 2]:
| :mass | :intensity |
|---|---|
| 1959.84464940 | 565 |
| 1960.25874185 | 579 |
| 1960.67287815 | 621 |
| 1961.08705830 | 652 |
| 1961.50128232 | 697 |
| 1961.91555020 | 642 |
| 1962.32986194 | 714 |
| 1962.74421754 | 688 |
| 1963.15861699 | 657 |
| 1963.57306031 | 619 |
| … | … |
| 20119.23572141 | 0 |
| 20120.57033330 | 5 |
| 20121.90498983 | 4 |
| 20123.23969102 | 0 |
| 20124.57443685 | 0 |
| 20125.90922734 | 1 |
| 20127.24406246 | 5 |
| 20128.57894224 | 0 |
| 20129.91386667 | 3 |
| 20131.24883574 | 3 |
| 20132.58384946 | 1 |
The spectrum has ~20,000 data points covering the mass range from about 1,960 to 20,133 Da:
{:rows (tc/row-count raw-spectrum)
:mass-min (-> raw-spectrum :mass first)
:mass-max (-> raw-spectrum :mass last)}{:rows 20813, :mass-min 1959.84464940456, :mass-max 20132.5838494582}Visualizing a raw spectrum
(-> raw-spectrum
(plotly/base {:=x :mass
:=y :intensity
:=title "Raw MALDI-TOF spectrum — DRIAMS-A"
:=x-title "m/z (Da)"
:=y-title "Intensity (a.u.)"})
(plotly/layer-line)
plotly/plot)Preprocessing with Ripple
We apply the same DRIAMS preprocessing pipeline described in Weis et al. (2022): square root transformation, Savitzky-Golay smoothing, SNIP baseline removal, and TIC normalization.
(def preprocessed
(maldi/preprocess-spectrum-data
raw-spectrum
{:should-sqrt-transform true
:smooth-window 11
:baseline-iterations 20
:should-tic-normalize true}))(-> preprocessed
(plotly/base {:=x :mass
:=y :intensity
:=title "Preprocessed spectrum"
:=x-title "m/z (Da)"
:=y-title "Intensity (normalized)"})
(plotly/layer-line)
plotly/plot)Binning to 6,000 features
The DRIAMS paper uses fixed-width bins of 3 Da over the range [2000, 20000] Da, producing 6,000 features per spectrum — a fixed-length vector suitable for machine learning.
(def binned (maldi/bin-spectrum preprocessed {:range [2000 20000] :step 3}))(count binned)6000Reading Metadata
The id/ directory contains species and resistance labels. Each year has a *_clean.csv.gz file mapping spectrum codes to species identification and R/S/I results.
(def metadata
(-> (tc/dataset "DRIAMS/DRIAMS-A/id/2018/2018_clean.csv.gz"
{:key-fn keyword})
(tc/drop-columns [:column-0 (keyword "Unnamed: 0")])))How many spectra and species are in DRIAMS-A 2018?
{:spectra (tc/row-count metadata)
:species (-> metadata :species distinct count)}{:spectra 30069, :species 794}Most common species
(-> metadata
(tc/group-by [:species])
(tc/aggregate {:count tc/row-count})
(tc/order-by [:count] :desc)
(tc/head 15)
(plotly/base {:=x :count
:=y :species
:=title "Top 15 species — DRIAMS-A 2018"
:=x-title "Count"
:=y-title ""})
(plotly/layer-bar)
plotly/plot
(assoc-in [:data 0 :orientation] :h)
(assoc-in [:layout :margin :l] 200))Linking spectra to metadata
The :code column in the metadata matches the spectrum filename (without the .txt.gz extension). This lets us look up the species and resistance profile for any spectrum.
(def test-code "00006690-1411-4a89-87cc-ab84678cc9fb_MALDI1")(-> metadata
(tc/select-rows #(= test-code (:code %)))
(tc/select-columns [:code :species]))DRIAMS/DRIAMS-A/id/2018/2018_clean.csv.gz [1 2]:
| :code | :species |
|---|---|
| 00006690-1411-4a89-87cc-ab84678cc9fb_MALDI1 | Staphylococcus schleiferi |
References
- Weis, C., et al. (2022). Direct antimicrobial resistance prediction from clinical MALDI-TOF mass spectra using machine learning. Nature Medicine, 28, 164–174.