Skip to content

Results

Headline empirical findings from the WESAD validation cohort (Schmidt et al. 2018), n = 15 subjects, 6,585 5-second windows of synchronised wrist-worn PPG and chest-worn ECG. All figures and tables are reproducible from the analysis pipeline in scripts/run_deep_real_analysis.py; the values shown here are written to results/wesad_deep_analysis.json at run time.

At a glance

Metric Value 95% CI or note
Subjects 15 WESAD release
5-second windows 6,585 post artefact rejection
Three-baseline consensus rejection rate 44.6% 2,936 / 6,585
In-house pipeline pass rate 1.0000 6,585 / 6,585
Bland-Altman bias (PPG minus ECG) +3.57 bpm LoA [-23.14, +30.28]
Mean absolute error 9.66 bpm across all windows
Pearson r (PPG vs ECG) +0.70 across all windows
Median pairwise SQI Cohen's kappa -0.20 three published baselines
Recalibration train / holdout 3,292 / 3,293 random per-subject split
Delta kappa after recalibration 0.000 at n = 15
Spearman rho, paired effect +0.10 Wilcoxon p = 1.5e-4
LOSO AUROC (baseline) 0.804 stress classifier
LOSO AUROC (after audit) 0.823 delta = +0.019
Test suite 235 passing Python 3.10 / 3.11 / 3.12

Figure 1. Verdict gap between in-house thresholds and three published SQI baselines

Verdict gap

The in-house threshold-based pipeline passes every one of the 6,585 windows. Applying three independent published baselines to the same windows produces a 44.6% consensus rejection rate (2,936 windows rejected by Orphanidou, Sukor, and Elgendi simultaneously). The verdict gap motivates the four-way SQI audit.

Figure 2. Per-baseline pass rates

Per-baseline pass rates

Each published baseline applied independently passes between 21.5% and 25.7% of windows. The in-house pipeline passes 100%. The four baselines disagree both with each other and with the in-house pipeline on which windows are analysable.

Figure 3. Pairwise Cohen's kappa across published SQI baselines

Pairwise kappa heatmap

Pairwise Cohen's kappa across the three published baselines. Two of the three pairs disagree (negative kappa); the third agreement (Orphanidou vs Sukor, kappa = +0.41) is modest. Median pairwise kappa is -0.1978. The in-house pipeline is omitted because its pass rate of 1.00 leaves Cohen's kappa undefined (zero marginal variance) against any other baseline.

Figure 4. Downstream stress-detection outcomes

Downstream outcomes

Left panel: LOSO AUROC of a stress classifier rises from 0.804 to 0.823 with the full audit pipeline applied (delta = +0.019). Right panel: post- recalibration kappa is unchanged (delta = 0.000) while the paired effect across windows is small and statistically significant (Spearman rho = +0.10, Wilcoxon p = 1.5e-4). Quality filtering does not improve recalibrated agreement but produces a small detectable downstream effect at n = 15.

Figure 5. Bland-Altman of wrist PPG vs chest ECG heart rate

Bland-Altman

Generated directly from the WESAD window table (results/real_data/wesad_deep/window_table.csv) rather than from cited summary statistics. To regenerate locally:

python scripts/figures/plot_bland_altman.py \
  --input results/real_data/wesad_deep/window_table.csv \
  --output paper/figures/fig1_bland_altman.png

After dropping 16 of 6,585 windows with NaN in either HR estimate, 6,569 paired observations are retained. Bias = +3.5728 bpm, 95% LoA = [-23.1347, +30.2802], MAE = 9.6583 bpm, Pearson r = +0.6975. The bias and the width of the limits of agreement together indicate that wrist PPG systematically overestimates HR by approximately 3.6 bpm with a large window-to-window spread.

Table 1. Per-baseline window pass rates

Method Year Pass rate n passed Reference
In-house (threshold-based) this work 1.0000 6,585 signals.signal_quality
Orphanidou et al. 2015 0.2565 1,689 signals.orphanidou_sqi
Sukor et al. 2011 0.2545 1,676 signals.sukor_sqi
Elgendi 2016 0.2150 1,416 signals.elgendi_sqi

Table 2. Pairwise SQI agreement (Cohen's kappa)

Pair Cohen's kappa Interpretation
Orphanidou vs Sukor +0.4096 Modest positive agreement
Orphanidou vs Elgendi -0.1978 Slight negative agreement
Sukor vs Elgendi -0.2247 Slight negative agreement
Median across pairs -0.1978 Three baselines do not converge

Table 3. Downstream-task summary

Metric Baseline After audit Delta Note
LOSO AUROC 0.804 0.823 +0.019 Stress classification
Post-recalibration kappa 0.5XX 0.5XX 0.000 At n = 15
Paired effect (Spearman rho) - +0.10 - Wilcoxon p = 1.5e-4

Note: post-recalibration kappa numerical anchors are reported in the manuscript; the delta of 0.000 is the headline finding.

Table 4. Reporting-standards compliance

Guideline Year Applies to this work Compliance file
TRIPOD+AI 2024 Yes paper/checklists/tripod_ai_checklist.md
STARD 2015 Yes paper/checklists/stard_2015_checklist.md
CONSORT-AI 2020 No (non-interventional) paper/checklists/consort_ai_applicability.md
DECIDE-AI 2022 No (no clinician-AI interaction) paper/checklists/decide_ai_applicability.md

Reproducing the analyses

python scripts/run_deep_real_analysis.py --path /path/to/WESAD
python scripts/figures/plot_bland_altman.py \
  --input results/wesad_deep_analysis.json \
  --output paper/figures/fig1_bland_altman.png
python generate_results_figures.py

Each script writes a JSON summary to results/ that the manuscript references directly. The figures above are regenerated end-to-end and stored in paper/figures/.