7  Exploring the WESAD Dataset

The WESAD dataset (Wearable Stress and Affect Detection) is a multimodal dataset for wearable stress and affect detection, published by Schmidt et al. (2018).

Fifteen subjects wore two devices simultaneously — a chest-worn RespiBAN and a wrist-worn Empatica E4 — while undergoing a protocol with three affective conditions:

Two additional conditions (meditation 1 and 2) were also recorded. The dataset contains ECG, EDA, EMG, respiration, temperature, accelerometry, and blood volume pulse (PPG) — making it an ideal testbed for Ripple’s cardio and (future) respiratory signal processing.

Download

  1. Go to the UCI ML Repository page
  2. Download the ZIP archive (~3.5 GB)
  3. Extract so that subject folders (S2, S3, … S17) are under a WESAD/ directory

This notebook assumes the dataset is available at WESAD/ relative to the project root (e.g., via a symlink).

17 subjects participated, but S1 and S12 were discarded due to sensor malfunction, leaving 15 usable subjects (S2–S17, excluding S12).

(ns ripple-book.wesad-exploration
  (:require
   ;; Table processing (https://scicloj.github.io/tablecloth/):
   [tablecloth.api :as tc]
   ;; Column-level operations:
   [tablecloth.column.api :as tcc]
   ;; Interactive plotting via Plotly (https://scicloj.github.io/tableplot/):
   [scicloj.tableplot.v1.plotly :as plotly]
   ;; Annotating kinds of visualizations (https://scicloj.github.io/kindly-noted/):
   [scicloj.kindly.v4.kind :as kind]
   ;; Python interop (https://github.com/clj-python/libpython-clj):
   [libpython-clj2.python :as py :refer [py.]]
   [libpython-clj2.require :refer [require-python]]
   ;; Zero-copy numpy array support:
   [libpython-clj2.python.np-array]))

Devices and Signals

RespiBAN (chest, 700 Hz)

All RespiBAN channels are sampled at 700 Hz. The raw sensor values need conversion to SI units (vcc=3, chan_bit=2^16):

Channel Signal Unit Conversion
CH1 ECG mV (signal/chan_bit − 0.5) × vcc
CH2 EDA μS (signal/chan_bit) × vcc / 0.12
CH3 EMG mV (signal/chan_bit − 0.5) × vcc
CH4 TEMP °C NTC thermistor formula (see readme)
CH5–CH7 ACC x/y/z g (signal − 28000) / (38000 − 28000) × 2 − 1
CH8 RESP % (signal/chan_bit − 0.5) × 100

Empatica E4 (wrist, variable rates)

Each signal is a separate CSV. The first row is a Unix timestamp (session start); the second row is the sampling rate in Hz. HR.csv, IBI.csv, and tags.csv are derived and should be ignored in favor of the raw signals.

File Signal Rate Unit
BVP.csv Blood volume pulse (PPG) 64 Hz
EDA.csv Electrodermal activity 4 Hz μS
ACC.csv 3-axis accelerometer 32 Hz 1/64 g
TEMP.csv Skin temperature 4 Hz °C

Synchronised pickle files

The RespiBAN and E4 record independently with different clocks. The authors synchronised them using a double-tap gesture visible in both accelerometers. The result is stored in SX.pkl — a Python dict containing all signals (already aligned) plus per-sample labels.

Label Encoding

Per-sample labels at 700 Hz (aligned with RespiBAN signals):

Label Condition Use
0 Not defined / transient Ignore
1 Baseline Main task
2 Stress Main task
3 Amusement Main task
4 Meditation Optional
5, 6, 7 Study protocol phases Ignore

Reading the Data

Each subject’s .pkl file is a Python pickle containing a nested dict with NumPy arrays for every signal channel plus per-sample labels. We use libpython-clj to load these directly. The key advantage: NumPy arrays become dtype-next buffers via zero-copy interop, so tablecloth can wrap them as dataset columns without any data duplication.

(require-python '[pickle :as pkl]
                '[builtins])
:ok

All RespiBAN signals are sampled at 700 Hz.

(def WESAD-sampling-rate 700)

Python’s pickle.load needs a binary file handle. We use builtins/open to get one, wrapped in py/with (the Clojure equivalent of Python’s with statement) to ensure the file is closed after reading.

(defn load-pickle [filename]
  (py/with [f (builtins/open filename "rb")]
           (pkl/load f :encoding "latin")))

The pickle’s structure is a nested dict:

{"signal" {"chest" {"ECG" <numpy (N,1)>, "EDA" <numpy (N,1)>, ...}
           "wrist" {"BVP" <numpy (M,1)>, ...}}
 "label"  <numpy (N,)>
 "subject" "SX"}

libpython-clj makes Python dicts work like Clojure maps, so we can navigate with get-in. Keyword access works too (:ECG matches the string key "ECG").

Both functions below are memoized so repeated calls for the same subject don’t re-read the large pickle files from disk.

(def labelled-data
  (memoize
   (fn [subject]
     (load-pickle (format "WESAD/S%d/S%d.pkl"
                          subject subject)))))

For this exploration we extract just the ECG signal and labels into a tablecloth dataset. ECG is one of many available signals (see the full list below) — the same approach works for any of them. We call .flatten() on the numpy array because its shape is (N,1) and we need (N,) for a flat column.

(def labelled-dataset
  (memoize
   (fn [subject]
     (let [ld (labelled-data subject)]
       (tc/dataset {:t (tcc/* (range)
                              (/ 1.0 WESAD-sampling-rate))
                    :ECG (-> ld
                             (get-in [:signal :chest :ECG])
                             (py. flatten))
                    :label (-> ld
                               (get :label))})))))

Exploring Subject S2

Let’s load subject S2 and see what we get.

(def ds (labelled-dataset 2))
NoteERR
sys:1: VisibleDeprecationWarning: dtype(): align should be passed as Python or NumPy boolean but got `align=0`. Did you mean to pass a tuple to create a subarray type? (Deprecated NumPy 2.4)

The dataset has over 4 million rows — about 100 minutes of continuous recording at 700 Hz.

(tc/info ds :basic)

_unnamed :basic info [1 4]:

:name :grouped? :rows :columns
_unnamed false 4255300 3

Available signals

The pickle contains many more signals than just ECG. Here are all the channels available from each device:

(let [ld (labelled-data 2)]
  (kind/table
   {:column-names ["Device" "Signal" "Shape"]
    :row-vectors
    (concat
     (for [[k v] (get-in ld [:signal :chest])]
       ["RespiBAN (chest)" k (str (py/get-attr v "shape"))])
     (for [[k v] (get-in ld [:signal :wrist])]
       ["Empatica E4 (wrist)" k (str (py/get-attr v "shape"))]))}))
Device Signal Shape
RespiBAN (chest) ACC (4255300, 3)
RespiBAN (chest) ECG (4255300, 1)
RespiBAN (chest) EMG (4255300, 1)
RespiBAN (chest) EDA (4255300, 1)
RespiBAN (chest) Temp (4255300, 1)
RespiBAN (chest) Resp (4255300, 1)
Empatica E4 (wrist) ACC (194528, 3)
Empatica E4 (wrist) BVP (389056, 1)
Empatica E4 (wrist) EDA (24316, 1)
Empatica E4 (wrist) TEMP (24316, 1)

The chest signals all have the same number of rows (4.2M samples at 700 Hz ≈ 101 minutes). The wrist signals have fewer rows because the E4 samples at lower rates (BVP at 64 Hz, EDA and TEMP at 4 Hz, ACC at 32 Hz).

Label distribution

How much time does this subject spend in each condition?

(def label-names
  {0 "transient"
   1 "baseline"
   2 "stress"
   3 "amusement"
   4 "meditation"
   5 "other (5)"
   6 "other (6)"
   7 "other (7)"})
(let [freqs (frequencies (seq (:label ds)))]
  (kind/table
   {:column-names ["Label" "Condition" "Samples" "Duration"]
    :row-vectors
    (->> freqs
         (sort-by key)
         (mapv (fn [[label-code n]]
                 [(long label-code)
                  (get label-names (long label-code) "?")
                  n
                  (format "%.1f min" (/ n 700.0 60.0))])))}))
Label Condition Samples Duration
0 transient 2142701 51.0 min
1 baseline 800800 19.1 min
2 stress 430500 10.3 min
3 amusement 253400 6.0 min
4 meditation 537599 12.8 min
6 other (6) 45500 1.1 min
7 other (7) 44800 1.1 min

Over half the recording is transient (label 0) — transitions between conditions, questionnaire filling, etc. The three main conditions (baseline, stress, amusement) together account for about 35 minutes.

ECG during different conditions

Let’s look at 5-second ECG excerpts from each of the three main conditions. We take the first 5 seconds of each condition block.

(defn ecg-snippet
  "Extract n seconds of ECG from the first occurrence of the given label."
  [ds label-code n-seconds]
  (let [rows (tc/select-rows ds #(= label-code (long (% :label))))
        n (* n-seconds WESAD-sampling-rate)]
    (-> rows
        (tc/select-rows (range n))
        (tc/add-column :seconds (tcc/- (tcc/* (range) (/ 1.0 WESAD-sampling-rate))
                                       0.0)))))
(-> (tc/concat
     (tc/add-column (ecg-snippet ds 1 5) :condition "baseline")
     (tc/add-column (ecg-snippet ds 2 5) :condition "stress")
     (tc/add-column (ecg-snippet ds 3 5) :condition "amusement"))
    (plotly/base {:=x :seconds
                  :=y :ECG
                  :=color :condition
                  :=title "5-second ECG excerpts — Subject S2"
                  :=x-title "Time (s)"
                  :=y-title "ECG (raw)"})
    (plotly/layer-line)
    plotly/plot)

The three conditions show visibly different ECG morphology: the stress segment typically has a higher heart rate and different amplitude compared to baseline and amusement.

References

source: notebooks/ripple_book/wesad_exploration.clj