Methods¶

A narrative walk-through of the four audit components. For the manuscript-grade methods text see paper/paper.md and the reporting-standards checklists in paper/checklists/; for inline API reference see the API reference.

The audit is organised around four cooperating components, each producing quantitative evidence that feeds a single set of methodology recommendations.

1. Signal-quality audit¶

The signal-quality component evaluates each 5-second window of wrist PPG against four independent indicators:

Indicator	Reference	Threshold logic
In-house	this work	Amplitude, baseline drift, and beat-detection consistency thresholds tuned on synthetic data
Orphanidou et al.	2015	Template-matching against an averaged beat shape; correlation threshold 0.66
Sukor et al.	2011	Skewness and pulse-amplitude variability of detected beats
Elgendi	2016	Signal-to-noise ratio in the cardiac frequency band

For each window the toolkit records a pass/fail decision from each indicator and computes pairwise Cohen's kappa across indicators. A window is considered consensus-rejected when all three published baselines reject it simultaneously, and consensus-passed when at least one published baseline passes it. The in-house indicator is reported separately because its 100% pass rate makes Cohen's kappa undefined (zero marginal variance) against the others.

The empirical finding on WESAD (44.6% consensus rejection, median pairwise kappa = -0.20) motivates the rest of the audit: if three independent published indicators disagree this strongly on the same six-thousand-window cohort, no single threshold can be assumed to be correct.

2. Algorithmic-fairness audit¶

The fairness component stratifies signal-quality and downstream outcomes by:

device family (chest-strap, wrist-worn, finger-clip)
skin tone (Fitzpatrick I to VI when annotated)
per-subject drift (sliding-window kappa within subject)
motion intensity (accelerometer RMS quartile)

For each stratum the toolkit reports the difference in pass rate, the difference in downstream AUROC, and a permutation-test p-value against the null of stratum independence. Disparities are reported as effect sizes with bootstrap confidence intervals; no claim of "fair" or "unfair" is made categorically.

The audit does not claim to enumerate all sites of inequity. Inclusion of additional strata (socioeconomic, geographic, temporal) is supported through the reliability.fairness_audit.add_stratum API.

3. Causal-sensitivity audit¶

The causal-sensitivity component evaluates whether an observed exposure-outcome association survives back-door adjustment for measured confounders, and how robust the adjusted estimate is to unmeasured confounding.

Three methods are applied to each candidate confounder:

Method	Reference	What it tests
AIPW doubly-robust estimation	Bang and Robins 2005	Whether the exposure-outcome estimate is stable to misspecification of either the exposure model or the outcome model
E-value	VanderWeele and Ding 2017	The minimum strength of association an unmeasured confounder would need with both the exposure and the outcome to fully explain the observed effect
Negative-control exposures	Lipsitch et al. 2010	Whether a known-null exposure shows a spurious effect in the same direction (a positive negative control indicates residual confounding)

The audit reports the back-door-adjusted point estimate, the E-value, and the result of each negative control. An estimate that flips sign under back-door adjustment, or that has an E-value below 1.5, is flagged as causally fragile in the methodology-recommendations layer.

4. Downstream-impact audit¶

The downstream-impact component measures whether quality filtering and other pre-processing decisions detectably move the metric the field cares about. For stress detection on WESAD the toolkit reports:

LOSO AUROC of a logistic-regression classifier on heart-rate variability features, baseline (no audit) versus after the full audit pipeline
Recalibration kappa from a 50-50 random per-subject train/holdout split
Spearman rho of paired window-level scores and the corresponding Wilcoxon signed-rank p-value

The headline downstream finding on WESAD is that quality filtering does not improve recalibrated agreement (delta kappa = 0.000 at n = 15) but produces a small, statistically significant paired effect (rho = +0.10, p = 1.5e-4) and a delta-AUROC of +0.019. The toolkit emphasises that none of these effects are clinically meaningful at this sample size, and reports them honestly rather than rounding up.

Methodology-recommendations layer¶

The four audits feed a single recommendations file (results/methodology_recommendations.md after pipeline run) that summarises:

Which signal-quality indicator was used and why
Which fairness strata showed material disparities and the recommended reporting practice
Whether the causal interpretation is robust under back-door adjustment and the E-value for the headline effect
Whether the downstream effect of quality filtering is large enough to matter clinically

The recommendations are intentionally cautious. None of the four audits is a substitute for prospective validation, external replication, or clinical-trial evidence; the toolkit produces evidence about analytic choices, not about clinical truth.

Reporting standards¶

For the formal reporting-standards mapping see the Reporting standards page and the EQUATOR-Network checklists in paper/checklists/. In short:

TRIPOD+AI (Collins et al. 2024) applies and is mapped item by item
STARD 2015 (Bossuyt et al. 2015) applies and is mapped item by item
CONSORT-AI (Liu et al. 2020) does not apply (non-interventional)
DECIDE-AI (Vasey et al. 2022) does not apply (no clinician-AI interaction)

A STARD-style flow diagram is provided as Figure 1 (paper/figures/fig_flow_diagram.png).

What this toolkit does not do¶

To be honest about scope:

It does not generate new signal-quality indicators; it audits the ones already in the literature.
It does not produce clinical predictions; it audits the analytic decisions that feed downstream models.
It does not perform external validation; that requires a second cohort and is in the project roadmap.
It does not claim that any single signal-quality threshold is correct; the whole point of the four-way audit is to surface disagreement.
It does not certify a pipeline as fair, causal, or downstream-safe; it provides evidence that the practitioner uses to make those judgements.