7 Exploring the WESAD Dataset

The WESAD dataset (Wearable Stress and Affect Detection) is a multimodal dataset for wearable stress and affect detection, published by Schmidt et al. (2018).

Fifteen subjects wore two devices simultaneously — a chest-worn RespiBAN and a wrist-worn Empatica E4 — while undergoing a protocol with three affective conditions:

Baseline — 20 minutes of neutral reading
Stress — Trier Social Stress Test (public speaking + mental arithmetic)
Amusement — watching funny video clips

Two additional conditions (meditation 1 and 2) were also recorded. The dataset contains ECG, EDA, EMG, respiration, temperature, accelerometry, and blood volume pulse (PPG) — making it an ideal testbed for Ripple’s cardio and (future) respiratory signal processing.

Download

Go to the UCI ML Repository page
Download the ZIP archive (~3.5 GB)
Extract so that subject folders (S2, S3, … S17) are under a WESAD/ directory

This notebook assumes the dataset is available at WESAD/ relative to the project root (e.g., via a symlink).

17 subjects participated, but S1 and S12 were discarded due to sensor malfunction, leaving 15 usable subjects (S2–S17, excluding S12).

(ns ripple-book.wesad-exploration
  (:require
   ;; Table processing (https://scicloj.github.io/tablecloth/):
   [tablecloth.api :as tc]
   ;; Column-level operations:
   [tablecloth.column.api :as tcc]
   ;; Interactive plotting via Plotly (https://scicloj.github.io/tableplot/):
   [scicloj.tableplot.v1.plotly :as plotly]
   ;; Annotating kinds of visualizations (https://scicloj.github.io/kindly-noted/):
   [scicloj.kindly.v4.kind :as kind]
   ;; Python interop (https://github.com/clj-python/libpython-clj):
   [libpython-clj2.python :as py :refer [py.]]
   [libpython-clj2.require :refer [require-python]]
   ;; Zero-copy numpy array support:
   [libpython-clj2.python.np-array]))

Devices and Signals

RespiBAN (chest, 700 Hz)

All RespiBAN channels are sampled at 700 Hz. The raw sensor values need conversion to SI units (vcc=3, chan_bit=2^16):

Channel	Signal	Unit	Conversion
CH1	ECG	mV	(signal/chan_bit − 0.5) × vcc
CH2	EDA	μS	(signal/chan_bit) × vcc / 0.12
CH3	EMG	mV	(signal/chan_bit − 0.5) × vcc
CH4	TEMP	°C	NTC thermistor formula (see readme)
CH5–CH7	ACC x/y/z	g	(signal − 28000) / (38000 − 28000) × 2 − 1
CH8	RESP	%	(signal/chan_bit − 0.5) × 100

Empatica E4 (wrist, variable rates)

Each signal is a separate CSV. The first row is a Unix timestamp (session start); the second row is the sampling rate in Hz. HR.csv, IBI.csv, and tags.csv are derived and should be ignored in favor of the raw signals.

File	Signal	Rate	Unit
BVP.csv	Blood volume pulse (PPG)	64 Hz	—
EDA.csv	Electrodermal activity	4 Hz	μS
ACC.csv	3-axis accelerometer	32 Hz	1/64 g
TEMP.csv	Skin temperature	4 Hz	°C

Synchronised pickle files

The RespiBAN and E4 record independently with different clocks. The authors synchronised them using a double-tap gesture visible in both accelerometers. The result is stored in SX.pkl — a Python dict containing all signals (already aligned) plus per-sample labels.

Label Encoding

Per-sample labels at 700 Hz (aligned with RespiBAN signals):

Label	Condition	Use
0	Not defined / transient	Ignore
1	Baseline	Main task
2	Stress	Main task
3	Amusement	Main task
4	Meditation	Optional
5, 6, 7	Study protocol phases	Ignore

Reading the Data

Each subject’s .pkl file is a Python pickle containing a nested dict with NumPy arrays for every signal channel plus per-sample labels. We use libpython-clj to load these directly. The key advantage: NumPy arrays become dtype-next buffers via zero-copy interop, so tablecloth can wrap them as dataset columns without any data duplication.

(require-python '[pickle :as pkl]
                '[builtins])

:ok

All RespiBAN signals are sampled at 700 Hz.

(def WESAD-sampling-rate 700)

Python’s pickle.load needs a binary file handle. We use builtins/open to get one, wrapped in py/with (the Clojure equivalent of Python’s with statement) to ensure the file is closed after reading.

(defn load-pickle [filename]
  (py/with [f (builtins/open filename "rb")]
           (pkl/load f :encoding "latin")))

The pickle’s structure is a nested dict:

{"signal" {"chest" {"ECG" <numpy (N,1)>, "EDA" <numpy (N,1)>, ...}
           "wrist" {"BVP" <numpy (M,1)>, ...}}
 "label"  <numpy (N,)>
 "subject" "SX"}

libpython-clj makes Python dicts work like Clojure maps, so we can navigate with get-in. Keyword access works too (:ECG matches the string key "ECG").

Both functions below are memoized so repeated calls for the same subject don’t re-read the large pickle files from disk.

(def labelled-data
  (memoize
   (fn [subject]
     (load-pickle (format "WESAD/S%d/S%d.pkl"
                          subject subject)))))

For this exploration we extract just the ECG signal and labels into a tablecloth dataset. ECG is one of many available signals (see the full list below) — the same approach works for any of them. We call .flatten() on the numpy array because its shape is (N,1) and we need (N,) for a flat column.

(def labelled-dataset
  (memoize
   (fn [subject]
     (let [ld (labelled-data subject)]
       (tc/dataset {:t (tcc/* (range)
                              (/ 1.0 WESAD-sampling-rate))
                    :ECG (-> ld
                             (get-in [:signal :chest :ECG])
                             (py. flatten))
                    :label (-> ld
                               (get :label))})))))

Exploring Subject S2

Let’s load subject S2 and see what we get.

(def ds (labelled-dataset 2))

ERR

sys:1: VisibleDeprecationWarning: dtype(): align should be passed as Python or NumPy boolean but got `align=0`. Did you mean to pass a tuple to create a subarray type? (Deprecated NumPy 2.4)

The dataset has over 4 million rows — about 100 minutes of continuous recording at 700 Hz.

(tc/info ds :basic)

_unnamed :basic info [1 4]:

:name	:grouped?	:rows	:columns
_unnamed	false	4255300	3

Available signals

The pickle contains many more signals than just ECG. Here are all the channels available from each device:

(let [ld (labelled-data 2)]
  (kind/table
   {:column-names ["Device" "Signal" "Shape"]
    :row-vectors
    (concat
     (for [[k v] (get-in ld [:signal :chest])]
       ["RespiBAN (chest)" k (str (py/get-attr v "shape"))])
     (for [[k v] (get-in ld [:signal :wrist])]
       ["Empatica E4 (wrist)" k (str (py/get-attr v "shape"))]))}))

Device	Signal	Shape
RespiBAN (chest)	ACC	(4255300, 3)
RespiBAN (chest)	ECG	(4255300, 1)
RespiBAN (chest)	EMG	(4255300, 1)
RespiBAN (chest)	EDA	(4255300, 1)
RespiBAN (chest)	Temp	(4255300, 1)
RespiBAN (chest)	Resp	(4255300, 1)
Empatica E4 (wrist)	ACC	(194528, 3)
Empatica E4 (wrist)	BVP	(389056, 1)
Empatica E4 (wrist)	EDA	(24316, 1)
Empatica E4 (wrist)	TEMP	(24316, 1)

The chest signals all have the same number of rows (4.2M samples at 700 Hz ≈ 101 minutes). The wrist signals have fewer rows because the E4 samples at lower rates (BVP at 64 Hz, EDA and TEMP at 4 Hz, ACC at 32 Hz).

Label distribution

How much time does this subject spend in each condition?

(def label-names
  {0 "transient"
   1 "baseline"
   2 "stress"
   3 "amusement"
   4 "meditation"
   5 "other (5)"
   6 "other (6)"
   7 "other (7)"})

(let [freqs (frequencies (seq (:label ds)))]
  (kind/table
   {:column-names ["Label" "Condition" "Samples" "Duration"]
    :row-vectors
    (->> freqs
         (sort-by key)
         (mapv (fn [[label-code n]]
                 [(long label-code)
                  (get label-names (long label-code) "?")
                  n
                  (format "%.1f min" (/ n 700.0 60.0))])))}))

Label	Condition	Samples	Duration
0	transient	2142701	51.0 min
1	baseline	800800	19.1 min
2	stress	430500	10.3 min
3	amusement	253400	6.0 min
4	meditation	537599	12.8 min
6	other (6)	45500	1.1 min
7	other (7)	44800	1.1 min

Over half the recording is transient (label 0) — transitions between conditions, questionnaire filling, etc. The three main conditions (baseline, stress, amusement) together account for about 35 minutes.

ECG during different conditions

Let’s look at 5-second ECG excerpts from each of the three main conditions. We take the first 5 seconds of each condition block.

(defn ecg-snippet
  "Extract n seconds of ECG from the first occurrence of the given label."
  [ds label-code n-seconds]
  (let [rows (tc/select-rows ds #(= label-code (long (% :label))))
        n (* n-seconds WESAD-sampling-rate)]
    (-> rows
        (tc/select-rows (range n))
        (tc/add-column :seconds (tcc/- (tcc/* (range) (/ 1.0 WESAD-sampling-rate))
                                       0.0)))))

(-> (tc/concat
     (tc/add-column (ecg-snippet ds 1 5) :condition "baseline")
     (tc/add-column (ecg-snippet ds 2 5) :condition "stress")
     (tc/add-column (ecg-snippet ds 3 5) :condition "amusement"))
    (plotly/base {:=x :seconds
                  :=y :ECG
                  :=color :condition
                  :=title "5-second ECG excerpts — Subject S2"
                  :=x-title "Time (s)"
                  :=y-title "ECG (raw)"})
    (plotly/layer-line)
    plotly/plot)

The three conditions show visibly different ECG morphology: the stress segment typically has a higher heart rate and different amplitude compared to baseline and amusement.

References

Schmidt, P., Reiss, A., Duerichen, R., Marberger, C., & Van Laerhoven, K. (2018). Introducing WESAD, a Multimodal Dataset for Wearable Stress and Affect Detection. Proceedings of the 20th ACM International Conference on Multimodal Interaction (ICMI), 400–408.

source: notebooks/ripple_book/wesad_exploration.clj