Reporting standards compliance¶

This page summarises compliance with the EQUATOR-Network reporting standards relevant to digital-health AI research. The full per-item checklists live in paper/checklists/ in the repository and are pulled in below.

Summary¶

Standard	Applies?	Coverage
TRIPOD+AI (Collins et al. 2024)	Partial (Section 4.6 classifier)	Per-item compliance below
STARD 2015 (Bossuyt et al. 2015)	Yes (Sections 4.1-4.3 SQI-as-test)	Per-item compliance below
CONSORT-AI (Liu et al. 2020)	No (not a randomised trial)	Applicability assessment below
DECIDE-AI (Vasey et al. 2022)	No (not a clinical deployment)	Applicability assessment below

The flow diagram in paper/figures/fig_flow_diagram.png follows STARD 2015 conventions adapted to the multi-arm structure of this audit.

TRIPOD+AI¶

TRIPOD+AI compliance checklist¶

Reference: Collins GS, Moons KGM, Dhiman P, et al. TRIPOD+AI statement: updated guidance for reporting clinical prediction models that use regression or machine learning methods. BMJ 2024;385:e078378. doi:10.1136/bmj-2023-078378

Applicability to this paper: TRIPOD+AI applies to the Section 4.6 downstream-audit classifier (LF/HF biomarker -> baseline/stress label, LOSO cross-validation). It does not apply to the SQI-agreement analysis in Sections 4.1-4.3, which is framed as a diagnostic-accuracy comparison (see STARD 2015 checklist) rather than a prediction-model deployment. The downstream classifier in Section 4.6 is a research-stage demonstration, not a clinical-deployment candidate; the items below reflect that scope honestly.

Item numbering follows the published TRIPOD+AI checklist. Item statements are paraphrased; the full text of each item is in the original publication.

Title and abstract¶

Item 1 (Title). Identifies the study as developing or validating a prediction model, mentions the target population, the outcome, and the AI methods used.

This paper. The main title positions the work as a "methodology audit of wearable physiological signals," not as a prediction-model paper. Section 4.6 (the downstream audit) is a sub-study; its sub-heading identifies it as a classifier evaluation. The full title is appropriate for the methodology audit as a whole.

Item 2 (Abstract). Structured abstract reporting objectives, methods, results, conclusions.

This paper. Manuscript abstract is structured. Section 4.6 results (AUROC 0.804 -> 0.823, Wilcoxon p = 1.5e-4) are reported in the abstract.

Introduction¶

Item 3a (Background and rationale). Explains the medical context and the need for a prediction model.

This paper. Section 1 motivates the audit; Section 4.6 motivates the downstream classifier as a check on whether the SQI-binarisation choice has measurable consequences for a downstream prediction task.

Item 3b (Objectives). States objectives, including whether the study develops, validates, or updates a model.

This paper. Section 4.6 develops a per-subject LF/HF classifier and evaluates it under two SQI preprocessing regimes (raw vs in-house cleaned). Objective is to test whether SQI choice changes downstream performance.

Methods¶

Item 4a (Source of data). Describes the data source, including dates, geographic location, setting.

This paper. WESAD (Schmidt et al. 2018), released 2018, lab-based acquisition at the University of Siegen. Section 3.1 (Data).

Item 4b (Eligibility criteria). Describes participant eligibility.

This paper. All 15 WESAD subjects with successful chest-ECG and wrist-PPG synchronisation included. No additional eligibility filter applied. Stated in Section 3.1.

Item 5a (Setting). Geographic and temporal setting.

This paper. Lab-based, single-site, 2018. Stated in Section 3.1.

Item 5b (Outcome). Definition of the outcome being predicted, including how and when it was measured.

This paper. Outcome is the binary label baseline vs stress, defined by the WESAD protocol annotations. Section 4.6.

Item 6 (Predictors). Predictors used, including how and when measured.

This paper. Single predictor: per-subject LF/HF ratio computed on RR intervals derived from chest ECG. Section 4.6.

Item 7 (Sample size). Sample size justification.

This paper. n=15 subjects, dictated by WESAD. Section 4.6 explicitly states this is not powered as a confirmatory study; it is a research-stage demonstration. Section 6 (Limitations) acknowledges the small n.

Item 8 (Missing data). Handling of missing data.

This paper. No subject-level missingness; all 15 subjects contributed one baseline and one stress LF/HF measurement. Stated in Section 4.6.

Item 9 (Statistical analysis). Statistical methods used.

This paper. LOSO cross-validation (15 folds, each with 14 train + 1 test); per-fold AUROC; paired Wilcoxon signed-rank test comparing per-fold AUROC under the two preprocessing regimes; bootstrap confidence intervals. Section 4.6.

Item 10a (Model development). Predictors selected, methods for model specification.

This paper. Single-predictor logistic regression. No predictor selection required. Section 4.6.

Item 10b (Model specification). Final model specification.

This paper. Logistic regression with one continuous predictor (LF/HF ratio). Coefficients reported in supplement.

Item 10c (Model performance). Discrimination, calibration, etc.

This paper. AUROC reported (mean and per-fold). Calibration not reported because the analysis focus is on relative performance between two preprocessing regimes, not on calibration of the classifier itself. This is stated as a limitation.

Item 10d (Internal validation). Description of internal validation.

This paper. LOSO is the internal validation strategy. Section 4.6.

AI-specific items¶

Item AI-1 (Software/version). Software and version used.

This paper. Python 3.10/3.11/3.12, scikit-learn pinned in pyproject.toml. CI matrix tests all three Python versions.

Item AI-2 (Reproducibility). Code and data availability.

This paper. All code public at https://github.com/ceyhunolcan/biomedical-signal-forensics-lab under MIT. WESAD is public via the original release. Section 7 (Reproducibility).

Item AI-3 (Computational resources). Computational requirements.

This paper. The full pipeline runs on a laptop in under 10 minutes; no GPU required. Stated in repository README.

Item AI-4 (Fairness and bias). Considerations of fairness and bias.

This paper. The synthetic cohort (Sections 3-4.5) was constructed specifically to inject and recover fairness disparities including device-family and skin-tone. WESAD itself is not demographically diverse (15 subjects, demographics not stratified); this limitation is stated in Section 6.

Results¶

Item 11 (Risk groups). If applicable, definition of risk groups.

This paper. Not applicable (binary classification, no stratified risk groups).

Item 12 (Development vs validation). Numbers in each set.

This paper. LOSO: every subject contributes both as training (in 14 folds) and as test (in 1 fold).

Item 13a (Participants). Flow of participants through the study.

This paper. See Figure 1 flow diagram.

Item 13b (Performance). Model performance with confidence intervals.

This paper. AUROC 0.804 (95% CI from bootstrap) under raw preprocessing; 0.823 under cleaned preprocessing; paired Wilcoxon p = 1.5e-4 (n=15). Section 4.6.

Item 14a (Final model). Specification of the final model.

This paper. Final model is the logistic regression with LF/HF as predictor; coefficients in supplement.

Item 14b (Performance interpretation).

This paper. Section 4.6 explicitly notes the AUROC difference is small (0.019) but consistently in the same direction across all 15 subjects, hence the strong Wilcoxon p-value despite small absolute effect.

Discussion¶

Item 15 (Interpretation). Overall interpretation including comparison with other models.

This paper. Section 5 interprets the downstream result within the broader audit framing. No comparison with other published classifiers is attempted because the contribution is the audit, not the classifier.

Item 16 (Limitations). Study limitations.

This paper. Section 6 enumerates: small n; single-site dataset; single predictor; absence of external validation; no calibration analysis.

Item 17 (Implications). Implications for clinical practice and future research.

This paper. Section 5 states the downstream result is a research-stage signal that the SQI-binarisation choice has measurable downstream effect, not a deployment claim.

Other information¶

Item 18 (Data sharing). Statement on data sharing.

This paper. WESAD is public. All derived artefacts (window tables, summary JSONs) are in the repository under results/real_data/wesad_deep/.

Item 19 (Funding). Funding statement.

This paper. No external funding; self-supported research. To be stated in the author affiliation block once finalised.

Summary of compliance¶

Category	Items	Compliance
Title and abstract	2	Compliant
Introduction	2	Compliant
Methods	11	Compliant (calibration analysis is missing, acknowledged in Section 6)
Results	5	Compliant
Discussion	3	Compliant
Other	2	Compliant once author block is finalised

Items not addressed: TRIPOD+AI item 10c calibration analysis. The analysis focus is on relative performance between preprocessing regimes rather than on absolute calibration of the downstream classifier. This is an acknowledged limitation rather than a TRIPOD+AI violation: calibration is out of scope for the audit framing of Section 4.6. A future deployment-style study should add reliability diagrams and a Hosmer-Lemeshow or Spiegelhalter test before clinical use is considered.

Items to address before submission: item 19 (author block / funding statement). Tracked in the npj DM polish queue.

If the include above renders as raw text in your viewer, the full checklist is at paper/checklists/tripod_ai_checklist.md.

STARD 2015¶

STARD 2015 compliance checklist¶

Reference: Bossuyt PM, Reitsma JB, Bruns DE, et al. STARD 2015: an updated list of essential items for reporting diagnostic accuracy studies. BMJ 2015;351:h5527. doi:10.1136/bmj.h5527

Applicability to this paper: STARD 2015 applies most directly to Arms 1-3 of Figure 1: HR agreement (Section 4.1), SQI agreement (Section 4.2), and threshold recalibration (Section 4.3). In all three, chest ECG serves as the reference standard against which wrist PPG-derived measurements serve as the index modality. The 30 STARD 2015 items are listed below with a per-item compliance statement.

The wording of each item is paraphrased from the published checklist; the original BMJ paper contains the full text.

Title and abstract¶

Item 1. Identification as a study of diagnostic accuracy using at least one measure of accuracy (sensitivity, specificity, predictive values, likelihood ratios, area under ROC curve, etc.).

This paper. Section 4.2 reports Cohen's kappa as the primary agreement metric between four SQI binarisations on the same 6,585 windows; Section 4.1 reports Bland-Altman bias and limits of agreement on continuous HR. Both are explicit accuracy / agreement measures.

Item 2. Structured summary of study design, methods, results, and conclusions.

This paper. Abstract is structured.

Introduction¶

Item 3. Scientific and clinical background, including the intended use and clinical role of the index test.

This paper. Sections 1-2 motivate the audit. The "index modality" (wrist PPG) is positioned as a candidate component of consumer-wearable physiological monitoring; the comparator (chest ECG) is positioned as the established reference. The intended use of the SQI is to flag low-quality windows for exclusion before downstream analysis.

Item 4. Study objectives and hypotheses.

This paper. Section 1 states three objectives: (a) audit synthetic-tuned SQI thresholds on real data, (b) compare multiple published SQI baselines, (c) test whether SQI choice has measurable downstream effect.

Methods¶

Study design¶

Item 5. Whether data collection was planned before the index test and reference standard were performed (prospective) or after (retrospective).

This paper. Retrospective use of the public WESAD dataset. Stated in Section 3.1.

Participants¶

Item 6. Eligibility criteria.

This paper. All 15 subjects in the public WESAD release. Stated in Section 3.1. The two subjects (S1, S12) excluded by WESAD authors before release are noted in Figure 1.

Item 7. On what basis potentially eligible participants were identified.

This paper. WESAD public release; no additional selection step in this study.

Item 8. Where and when potentially eligible participants were identified (setting, location and dates).

This paper. WESAD acquisition: lab-based, University of Siegen, 2018. This study performed in 2025-2026.

Item 9. Whether participants formed a consecutive, random, or convenience series.

This paper. Convenience series (all available WESAD subjects). Stated in Section 3.1.

Test methods¶

Item 10a (Index test). How and by whom the index test was performed.

This paper. Index modality: wrist Empatica E4 PPG, 64 Hz, automated pulse-peak detection and HR computation. Algorithms in src/signals/ppg_processing.py. Algorithm version pinned by repository tag.

Item 10b (Reference standard). How and by whom the reference standard was performed.

This paper. Reference modality: chest RespiBAN ECG, 700 Hz, Pan-Tompkins R-peak detection followed by RR interval computation. Algorithms in src/signals/ecg_processing.py.

Item 11. Rationale for choosing the reference standard.

This paper. Chest ECG is the established gold standard for instantaneous heart rate and HRV in physiological research (HRV Task Force standards; Schaefer and Vagedes 2013). Stated in Section 3.

Item 12a. Definition of and rationale for test positivity cutoffs.

This paper. Four SQI binarisations are compared. The in-house cutoff (0.70) was inherited from prior synthetic-data tuning. The three published cutoffs (Orphanidou, Sukor, Elgendi) follow the published recommendations. The lack of agreement among these four constitutes the central audit finding (Section 4.2).

Item 12b. Whether clinical information and reference standard results were available to performers of the index test.

This paper. Both modalities are automated; no human reader involved at test execution. The SQI computation is blind to the ECG-derived HR by construction.

Item 13a. Whether clinical information and index test results were available to assessors of the reference standard.

This paper. Both modalities are automated; not applicable.

Item 13b. Definition of test positivity (cutoff) for the reference standard.

This paper. The "reference HR" is the continuous output of the ECG pipeline; there is no positivity cutoff. For Section 4.2, the reference comparator is each published SQI baseline's own pass/fail decision.

Analysis¶

Item 14. Methods for estimating or comparing measures of diagnostic accuracy.

This paper. Bland-Altman bias and 95% limits of agreement (Section 4.1); Cohen's kappa pairwise between four SQI binarisations (Section 4.2); fraction of windows on which all three published methods reject (Section 4.2); paired Wilcoxon signed-rank for downstream classifier comparison (Section 4.6). All methods implemented in src/evaluation/deep_real_analysis.py.

Item 15. How indeterminate index test or reference standard results were handled.

This paper. Zero indeterminate results in either modality. Reported as the "Exclusions during per-window processing: 0" box in Figure 1.

Item 16. How missing data on the index test and reference standard were handled.

This paper. No subject-level missingness; one subject (S12) is missing from WESAD itself, not from this study. Reported in Figure 1.

Item 17. Any analyses of variability in diagnostic accuracy, distinguishing pre-specified from exploratory.

This paper. Pre-specified: per-state stratification (baseline vs stress vs amusement). Exploratory: per-subject heterogeneity (range 6.75-26.02 bpm). Stated in Section 4.1 and explicitly flagged as exploratory.

Item 18. Intended sample size and how it was determined.

This paper. n=15 subjects (all available WESAD subjects); 6,585 windows followed mechanically from 5-second non-overlapping segmentation. No formal power calculation; this is an exploratory audit. Acknowledged in Section 6.

Results¶

Participants¶

Item 19. Flow of participants, using a diagram.

This paper. Figure 1 (this directory's flow diagram).

Item 20. Baseline demographic and clinical characteristics.

This paper. WESAD demographics are stated in Schmidt et al. 2018; not re-tabulated here. This is a limitation acknowledged in Section 6.

Item 21a. Distribution of severity of disease in those with the target condition.

This paper. Not applicable (no disease target).

Item 21b. Distribution of alternative diagnoses in those without the target condition.

This paper. Not applicable.

Item 22. Time interval and any clinical interventions between index test and reference standard.

This paper. Synchronous acquisition; both modalities recorded simultaneously. Stated in Section 3.1.

Test results¶

Item 23. Cross tabulation of the index test results by the results of the reference standard.

This paper. Section 4.2 reports the four-way SQI kappa matrix; this is the analog of a cross-tabulation for the SQI-as-test framing. The HR agreement (Section 4.1) is continuous, so Bland-Altman replaces the cross-tabulation.

Item 24. Estimates of diagnostic accuracy and their precision (e.g. 95% confidence intervals).

This paper. Bland-Altman LoA reported with 95% bounds; bootstrap CIs for Pearson r. Kappa estimates reported as point estimates; bootstrap CIs not yet computed for kappa (acknowledged as a planned addition for next revision).

Item 25. Any adverse events from performing the index test or the reference standard.

This paper. Not applicable (non-invasive sensors, retrospective data analysis only).

Discussion¶

Item 26. Study limitations, including sources of potential bias, statistical uncertainty, and generalisability.

This paper. Section 6 enumerates: single-site dataset, small n, lab-controlled (not free-living), absence of demographic stratification in WESAD itself, single-vendor wearable (E4), and the in-house SQI was tuned on synthetic data rather than held out from real data.

Item 27. Implications for practice, including the intended use and clinical role of the index test.

This paper. Section 5 states the central practical implication: a signal-quality threshold tuned on synthetic data should be re-validated on real data before deployment; failure to do so risks accepting windows that all available published baselines reject.

Other information¶

Item 28. Registration number and name of registry.

This paper. This is a methodology audit, not a clinical trial; trial registration does not apply.

Item 29. Where the full study protocol can be accessed.

This paper. Protocol is the repository itself; pinned by release tag. https://github.com/ceyhunolcan/biomedical-signal-forensics-lab/releases

Item 30. Sources of funding and other support; role of funders.

This paper. No external funding; self-supported research. To be stated in the author affiliation block once finalised.

Summary of compliance¶

Category	Items	Compliance
Title and abstract	2	Compliant
Introduction	2	Compliant
Methods	14	Compliant; item 18 (formal power calculation) intentionally absent for exploratory audit
Results	7	Compliant; item 24 kappa CIs planned for next revision
Discussion	2	Compliant
Other	3	Compliant once author block is finalised

Items partially addressed: - Item 20 (baseline demographics): the paper refers reviewers to WESAD's own demographics rather than re-tabulating; acknowledged in Section 6. - Item 24 (precision estimates): Bland-Altman LoA and Pearson r have 95% bootstrap CIs reported; kappa point estimates do not yet have CIs. Bootstrap kappa CIs are a planned addition.

Items not applicable: 21a, 21b, 25, 28 (no disease target, no clinical-trial framing).

If the include above renders as raw text, the full checklist is at paper/checklists/stard_2015_checklist.md.

CONSORT-AI: applicability assessment¶

CONSORT-AI applicability assessment¶

Reference: Liu X, Cruz Rivera S, Moher D, Calvert MJ, Denniston AK (on behalf of the SPIRIT-AI and CONSORT-AI Working Group). Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: the CONSORT-AI extension. Nature Medicine 2020;26(9):1364-1374. doi:10.1038/s41591-020-1034-x

Does CONSORT-AI apply to this paper? No.¶

CONSORT-AI extends the CONSORT 2010 reporting standard for randomised controlled trials, adding 14 AI-specific items on top of the base CONSORT checklist. It is intended for clinical trial reports in which an AI-based intervention is evaluated against a comparator (sham, standard care, or another AI system).

This paper is a retrospective methodology audit of physiological signal processing. It contains:

no participant randomisation,
no clinical intervention,
no comparator arm in the trial sense,
no clinical outcome,
no enrolment of human subjects beyond use of an already-collected and already-released public dataset (WESAD).

CONSORT-AI is therefore not a relevant reporting standard for the work presented. The paper references CONSORT-AI for completeness and to signal awareness of the broader EQUATOR network landscape for digital-health AI.

What would CONSORT-AI require if this toolkit were deployed in a trial?¶

A future clinical trial evaluating a signal-quality auditing pipeline derived from this toolkit (for example, a randomised trial comparing clinical decisions made with vs without the toolkit's quality flagging) would need to report:

AI-1: instructions on integrating the AI intervention into the trial setting, including the version of the toolkit deployed, the dependency pin, and the operating environment.
AI-2: the input data handling pipeline (signal acquisition, windowing, pre-processing), including any data exclusion logic.
AI-3: the output (signal-quality decision) and how it was acted on by clinicians or downstream systems.
AI-4: human-AI interaction during the trial (whether clinicians could override the toolkit's quality flag).
AI-5: error analysis, including any human-detected errors in the AI output.

The repository would supply most of the AI-1, AI-2, AI-3 content as provenance metadata; AI-4 and AI-5 would be trial-specific.

This document exists to make the inapplicability explicit and to provide a roadmap for any group that does choose to use this toolkit in a future prospective clinical evaluation.

DECIDE-AI: applicability assessment¶

DECIDE-AI applicability assessment¶

Reference: Vasey B, Nagendran M, Campbell B, et al. Reporting guideline for the early-stage clinical evaluation of decision support systems driven by artificial intelligence: DECIDE-AI. Nature Medicine 2022;28(5):924-933. doi:10.1038/s41591-022-01772-9

Does DECIDE-AI apply to this paper? No.¶

DECIDE-AI is a 27-item reporting guideline specifically for the early-stage clinical evaluation of AI-based decision support systems being used by clinicians on real patients in real clinical settings. It sits in the EQUATOR landscape between DEVELOPMENT-stage standards (TRIPOD+AI) and TRIAL-stage standards (CONSORT-AI), filling the gap of "first clinical use under monitoring."

This paper presents a methodology audit using retrospective public data (WESAD). The pipeline described has not been used by any clinician, on any patient, in any clinical setting. DECIDE-AI is therefore not applicable.

The paper references DECIDE-AI for completeness because future work using this toolkit might enter that phase, and an honest reader of the paper deserves a pointer to the relevant standard for that next step.

What would DECIDE-AI require if this toolkit entered early clinical use?¶

If a clinical team deployed the SQI auditing pipeline in a real care setting under monitoring, DECIDE-AI would require reporting on:

the clinical setting and stage of deployment, including how patients were selected and what care decisions the AI output informed,
the version of the AI system deployed (the repository release tag would provide this directly),
the human-AI interaction model, including whether clinicians override the AI output and how often,
patient-relevant outcomes and a safety profile,
learning curves and performance changes during the early-use period,
ethical, regulatory, and consent considerations specific to the deployment site.

None of these items can be addressed in the current paper because none of those activities have occurred. This document exists to flag DECIDE-AI as the right standard for the next step, not as one this paper claims to satisfy.

Why include this document at all if the standard does not apply?¶

Reviewers familiar with the EQUATOR landscape may reasonably ask whether the authors considered DECIDE-AI. Including an explicit applicability assessment makes the answer transparent: the standard was considered, the work does not fall within its scope, and a future deployment study should return to it.