7 Exploring the WESAD Dataset
The WESAD dataset (Wearable Stress and Affect Detection) is a multimodal dataset for wearable stress and affect detection, published by Schmidt et al. (2018).
Fifteen subjects wore two devices simultaneously — a chest-worn RespiBAN and a wrist-worn Empatica E4 — while undergoing a protocol with three affective conditions:
- Baseline — 20 minutes of neutral reading
- Stress — Trier Social Stress Test (public speaking + mental arithmetic)
- Amusement — watching funny video clips
Two additional conditions (meditation 1 and 2) were also recorded. The dataset contains ECG, EDA, EMG, respiration, temperature, accelerometry, and blood volume pulse (PPG) — making it an ideal testbed for Ripple’s cardio and (future) respiratory signal processing.
Download
- Go to the UCI ML Repository page
- Download the ZIP archive (~3.5 GB)
- Extract so that subject folders (S2, S3, … S17) are under a
WESAD/directory
This notebook assumes the dataset is available at WESAD/ relative to the project root (e.g., via a symlink).
17 subjects participated, but S1 and S12 were discarded due to sensor malfunction, leaving 15 usable subjects (S2–S17, excluding S12).
(ns ripple-book.wesad-exploration
(:require
;; Table processing (https://scicloj.github.io/tablecloth/):
[tablecloth.api :as tc]
;; Column-level operations:
[tablecloth.column.api :as tcc]
;; Interactive plotting via Plotly (https://scicloj.github.io/tableplot/):
[scicloj.tableplot.v1.plotly :as plotly]
;; Annotating kinds of visualizations (https://scicloj.github.io/kindly-noted/):
[scicloj.kindly.v4.kind :as kind]
;; Python interop (https://github.com/clj-python/libpython-clj):
[libpython-clj2.python :as py :refer [py.]]
[libpython-clj2.require :refer [require-python]]
;; Zero-copy numpy array support:
[libpython-clj2.python.np-array]))Devices and Signals
RespiBAN (chest, 700 Hz)
All RespiBAN channels are sampled at 700 Hz. The raw sensor values need conversion to SI units (vcc=3, chan_bit=2^16):
| Channel | Signal | Unit | Conversion |
|---|---|---|---|
| CH1 | ECG | mV | (signal/chan_bit − 0.5) × vcc |
| CH2 | EDA | μS | (signal/chan_bit) × vcc / 0.12 |
| CH3 | EMG | mV | (signal/chan_bit − 0.5) × vcc |
| CH4 | TEMP | °C | NTC thermistor formula (see readme) |
| CH5–CH7 | ACC x/y/z | g | (signal − 28000) / (38000 − 28000) × 2 − 1 |
| CH8 | RESP | % | (signal/chan_bit − 0.5) × 100 |
Empatica E4 (wrist, variable rates)
Each signal is a separate CSV. The first row is a Unix timestamp (session start); the second row is the sampling rate in Hz. HR.csv, IBI.csv, and tags.csv are derived and should be ignored in favor of the raw signals.
| File | Signal | Rate | Unit |
|---|---|---|---|
| BVP.csv | Blood volume pulse (PPG) | 64 Hz | — |
| EDA.csv | Electrodermal activity | 4 Hz | μS |
| ACC.csv | 3-axis accelerometer | 32 Hz | 1/64 g |
| TEMP.csv | Skin temperature | 4 Hz | °C |
Synchronised pickle files
The RespiBAN and E4 record independently with different clocks. The authors synchronised them using a double-tap gesture visible in both accelerometers. The result is stored in SX.pkl — a Python dict containing all signals (already aligned) plus per-sample labels.
Label Encoding
Per-sample labels at 700 Hz (aligned with RespiBAN signals):
| Label | Condition | Use |
|---|---|---|
| 0 | Not defined / transient | Ignore |
| 1 | Baseline | Main task |
| 2 | Stress | Main task |
| 3 | Amusement | Main task |
| 4 | Meditation | Optional |
| 5, 6, 7 | Study protocol phases | Ignore |
Reading the Data
Each subject’s .pkl file is a Python pickle containing a nested dict with NumPy arrays for every signal channel plus per-sample labels. We use libpython-clj to load these directly. The key advantage: NumPy arrays become dtype-next buffers via zero-copy interop, so tablecloth can wrap them as dataset columns without any data duplication.
(require-python '[pickle :as pkl]
'[builtins]):okAll RespiBAN signals are sampled at 700 Hz.
(def WESAD-sampling-rate 700)Python’s pickle.load needs a binary file handle. We use builtins/open to get one, wrapped in py/with (the Clojure equivalent of Python’s with statement) to ensure the file is closed after reading.
(defn load-pickle [filename]
(py/with [f (builtins/open filename "rb")]
(pkl/load f :encoding "latin")))The pickle’s structure is a nested dict:
{"signal" {"chest" {"ECG" <numpy (N,1)>, "EDA" <numpy (N,1)>, ...}
"wrist" {"BVP" <numpy (M,1)>, ...}}
"label" <numpy (N,)>
"subject" "SX"}
libpython-clj makes Python dicts work like Clojure maps, so we can navigate with get-in. Keyword access works too (:ECG matches the string key "ECG").
Both functions below are memoized so repeated calls for the same subject don’t re-read the large pickle files from disk.
(def labelled-data
(memoize
(fn [subject]
(load-pickle (format "WESAD/S%d/S%d.pkl"
subject subject)))))For this exploration we extract just the ECG signal and labels into a tablecloth dataset. ECG is one of many available signals (see the full list below) — the same approach works for any of them. We call .flatten() on the numpy array because its shape is (N,1) and we need (N,) for a flat column.
(def labelled-dataset
(memoize
(fn [subject]
(let [ld (labelled-data subject)]
(tc/dataset {:t (tcc/* (range)
(/ 1.0 WESAD-sampling-rate))
:ECG (-> ld
(get-in [:signal :chest :ECG])
(py. flatten))
:label (-> ld
(get :label))})))))Exploring Subject S2
Let’s load subject S2 and see what we get.
(def ds (labelled-dataset 2))sys:1: VisibleDeprecationWarning: dtype(): align should be passed as Python or NumPy boolean but got `align=0`. Did you mean to pass a tuple to create a subarray type? (Deprecated NumPy 2.4)
The dataset has over 4 million rows — about 100 minutes of continuous recording at 700 Hz.
(tc/info ds :basic)_unnamed :basic info [1 4]:
| :name | :grouped? | :rows | :columns |
|---|---|---|---|
| _unnamed | false | 4255300 | 3 |
Available signals
The pickle contains many more signals than just ECG. Here are all the channels available from each device:
(let [ld (labelled-data 2)]
(kind/table
{:column-names ["Device" "Signal" "Shape"]
:row-vectors
(concat
(for [[k v] (get-in ld [:signal :chest])]
["RespiBAN (chest)" k (str (py/get-attr v "shape"))])
(for [[k v] (get-in ld [:signal :wrist])]
["Empatica E4 (wrist)" k (str (py/get-attr v "shape"))]))}))| Device | Signal | Shape |
|---|---|---|
| RespiBAN (chest) | ACC | (4255300, 3) |
| RespiBAN (chest) | ECG | (4255300, 1) |
| RespiBAN (chest) | EMG | (4255300, 1) |
| RespiBAN (chest) | EDA | (4255300, 1) |
| RespiBAN (chest) | Temp | (4255300, 1) |
| RespiBAN (chest) | Resp | (4255300, 1) |
| Empatica E4 (wrist) | ACC | (194528, 3) |
| Empatica E4 (wrist) | BVP | (389056, 1) |
| Empatica E4 (wrist) | EDA | (24316, 1) |
| Empatica E4 (wrist) | TEMP | (24316, 1) |
The chest signals all have the same number of rows (4.2M samples at 700 Hz ≈ 101 minutes). The wrist signals have fewer rows because the E4 samples at lower rates (BVP at 64 Hz, EDA and TEMP at 4 Hz, ACC at 32 Hz).
Label distribution
How much time does this subject spend in each condition?
(def label-names
{0 "transient"
1 "baseline"
2 "stress"
3 "amusement"
4 "meditation"
5 "other (5)"
6 "other (6)"
7 "other (7)"})(let [freqs (frequencies (seq (:label ds)))]
(kind/table
{:column-names ["Label" "Condition" "Samples" "Duration"]
:row-vectors
(->> freqs
(sort-by key)
(mapv (fn [[label-code n]]
[(long label-code)
(get label-names (long label-code) "?")
n
(format "%.1f min" (/ n 700.0 60.0))])))}))| Label | Condition | Samples | Duration |
|---|---|---|---|
| 0 | transient | 2142701 | 51.0 min |
| 1 | baseline | 800800 | 19.1 min |
| 2 | stress | 430500 | 10.3 min |
| 3 | amusement | 253400 | 6.0 min |
| 4 | meditation | 537599 | 12.8 min |
| 6 | other (6) | 45500 | 1.1 min |
| 7 | other (7) | 44800 | 1.1 min |
Over half the recording is transient (label 0) — transitions between conditions, questionnaire filling, etc. The three main conditions (baseline, stress, amusement) together account for about 35 minutes.
ECG during different conditions
Let’s look at 5-second ECG excerpts from each of the three main conditions. We take the first 5 seconds of each condition block.
(defn ecg-snippet
"Extract n seconds of ECG from the first occurrence of the given label."
[ds label-code n-seconds]
(let [rows (tc/select-rows ds #(= label-code (long (% :label))))
n (* n-seconds WESAD-sampling-rate)]
(-> rows
(tc/select-rows (range n))
(tc/add-column :seconds (tcc/- (tcc/* (range) (/ 1.0 WESAD-sampling-rate))
0.0)))))(-> (tc/concat
(tc/add-column (ecg-snippet ds 1 5) :condition "baseline")
(tc/add-column (ecg-snippet ds 2 5) :condition "stress")
(tc/add-column (ecg-snippet ds 3 5) :condition "amusement"))
(plotly/base {:=x :seconds
:=y :ECG
:=color :condition
:=title "5-second ECG excerpts — Subject S2"
:=x-title "Time (s)"
:=y-title "ECG (raw)"})
(plotly/layer-line)
plotly/plot)The three conditions show visibly different ECG morphology: the stress segment typically has a higher heart rate and different amplitude compared to baseline and amusement.
References
- Schmidt, P., Reiss, A., Duerichen, R., Marberger, C., & Van Laerhoven, K. (2018). Introducing WESAD, a Multimodal Dataset for Wearable Stress and Affect Detection. Proceedings of the 20th ACM International Conference on Multimodal Interaction (ICMI), 400–408.