API reference¶

The toolkit is organised into the following modules. Click any module to see its public functions, classes, and constants, auto-generated from docstrings via mkdocstrings.

If you are looking for high-level guidance instead of API specifics, see Getting started or Reproducing the paper.

Signals¶

PPG and ECG preprocessing, along with four signal-quality baselines.

`biomedical_signal_forensics_lab.signals` ¶

Signal processing: ECG, PPG, HRV, sleep, activity, quality indices.

`activity_processing` ¶

Activity / step-count reliability.

`ecg_processing` ¶

ECG processing: bandpass, R-peak detection, RR intervals, SQI.

Deliberately classical methods. No deep learning here: the point is that the downstream forensics layer should work even when the signal processing is plain-vanilla.

`detect_r_peaks(signal, fs, refractory_ms=250)` ¶

Very simple amplitude-threshold detector after bandpass + squaring.

`signal_quality_index(signal, fs)` ¶

A bounded [0, 1] SQI: penalize flatness, saturation, and very high noise.

`elgendi_sqi` ¶

Elgendi 2016 PPG signal-quality indices.

Implements the skewness-based SQI (SSQI) from:

Elgendi M. Optimal Signal Quality Index for Photoplethysmogram Signals. Bioengineering, 3(4):21, 2016. https://doi.org/10.3390/bioengineering3040021

Elgendi (2016) compared eight candidate per-window PPG SQI metrics on 106 annotated 60-second recordings labeled as 'excellent', 'acceptable', or 'unfit for diagnosis' by two independent reviewers with adjudication by a third expert. Of the eight metrics, the skewness-based index (SSQI) had the highest F1 score for the binary acceptable-versus-unfit decision and is recommended as the headline metric. We expose SSQI as the primary baseline and provide kurtosis (KSQI) and entropy (ESQI) as auxiliary metrics for robustness checks.

The three statistics together cover a complementary set of failure modes

SSQI: distribution asymmetry. A clean PPG has positive skew (long upper tail from systolic peaks). Saturated or flat signals lose this asymmetry.
KSQI: distribution tailedness. Heavy-tailed distributions indicate spike artifacts.
ESQI: amplitude entropy in moving sub-windows. Highly random amplitude distributions indicate broadband noise.

Elgendi's recommended thresholds were tuned on fingertip PPG; we report the continuous statistics and let the recalibration machinery (Section 4.3) re-tune binarization for wrist PPG. Default acceptance thresholds: SSQI > 0 (any positive skew) KSQI > 3 (super-Gaussian: heavier tail than normal) ESQI < 0.9 (not pure noise)

These defaults are intentionally lenient; users should recalibrate against either expert labels or a reference SQI on their own data.

Each function operates on a single window (typically 5-30 seconds at 64 Hz for wrist PPG or 125 Hz for fingertip PPG) and returns an ElgendiSQI dataclass with the continuous statistics, the per-statistic acceptance flags, and the overall binary acceptance verdict.

`ElgendiSQI` `dataclass` ¶

Per-window Elgendi 2016 SQI report.

Attributes:

Name	Type	Description
`acceptable`	`bool`	overall pass (SSQI > SSQI_MIN AND KSQI > KSQI_MIN AND ESQI < ESQI_MAX). All three must pass for the window to be accepted under the default policy. Single-statistic users can inspect ssqi_ok, ksqi_ok, esqi_ok individually.
`skewness`	`float`	third standardized moment of the window after bandpass. Clean PPG has positive skew; SSQI is its raw value.
`kurtosis`	`float`	fourth standardized moment (Fisher convention, so normal-distribution kurtosis = 0). KSQI is `kurtosis + 3` in the Pearson convention used in Elgendi 2016.
`entropy`	`float`	normalized Shannon entropy of the amplitude distribution computed over a 16-bin histogram, scaled to [0, 1].
`ssqi_ok,`	`(ksqi_ok, esqi_ok)`	per-statistic acceptance flags.
`n_samples`	`int`	number of samples in the window after bandpass.

`elgendi_window(window, fs, ssqi_min=SSQI_MIN, ksqi_min=KSQI_MIN, esqi_max=ESQI_MAX, apply_bandpass=True, invert=False)` ¶

Compute Elgendi 2016 SSQI / KSQI / ESQI on a single PPG window.

Parameters:

Name	Type	Description	Default
`window`	`ndarray`	1D PPG signal samples.	required
`fs`	`int`	sampling rate in Hz.	required
`ssqi_min`	`float`	SSQI acceptance threshold. See module constants.	`SSQI_MIN`
`ksqi_min`	`float`	KSQI acceptance threshold. See module constants.	`KSQI_MIN`
`esqi_max`	`float`	ESQI acceptance threshold. See module constants. docstring for defaults and recommended recalibration policy.	`ESQI_MAX`
`apply_bandpass`	`bool`	if True, run the standard PPG bandpass filter (0.5-8 Hz, 4th-order Butterworth) before computing statistics.	`True`
`invert`	`bool`	if True, multiply the signal by -1 before computing statistics. Some devices (Empatica E4 wrist PPG in particular) report PPG inverted relative to fingertip conventions, which flips the sign of skewness. Use `batch_elgendi(..., auto_invert=True)` to auto-detect from the cohort.	`False`

Returns:

Type	Description
`ElgendiSQI`	ElgendiSQI dataclass with the continuous statistics and acceptance
`ElgendiSQI`	flags. All NaN/empty edge cases return acceptable=False with
`ElgendiSQI`	statistics set to NaN.

`batch_elgendi(windows, fs, ssqi_min=SSQI_MIN, ksqi_min=KSQI_MIN, esqi_max=ESQI_MAX, auto_invert=True, pilot_size=100)` ¶

Compute Elgendi SQI on a batch of windows.

Parameters:

Name	Type	Description	Default
`windows`	`ndarray`	2D array of shape (n_windows, n_samples) or a list of 1D arrays. Variable-length windows are accepted by passing a list.	required
`fs`	`int`	sampling rate in Hz.	required
`auto_invert`	`bool`	if True (default), compute Elgendi statistics on the first `pilot_size` windows, and if the median skewness across that pilot is negative, invert the entire batch before computing the full result. This handles devices that report PPG with reversed polarity (e.g., Empatica E4 wrist PPG). Set to False to disable polarity detection.	`True`
`pilot_size`	`int`	number of windows used for polarity detection.	`100`

Returns:

Type	Description
`DataFrame`	DataFrame with one row per window containing all ElgendiSQI fields
`DataFrame`	plus a `polarity_inverted` boolean column (True if auto_invert
`DataFrame`	decided to invert).

`four_way_sqi_agreement(in_house_pass, orphanidou_pass, sukor_pass, elgendi_pass)` ¶

Compute pairwise Cohen's kappa and Spearman rho for all four SQIs.

The 'all three published baselines agree against in-house' argument becomes stronger when there are three baselines instead of two; the headline statistic is the median pairwise kappa between Orphanidou, Sukor, and Elgendi.

Returns:

Type	Description
`dict`	Dict with pairwise kappas, fraction of windows accepted by all
`dict`	published baselines, fraction accepted by in-house but not by any
`dict`	published baseline (the most damning category for the in-house SQI),
`dict`	and the median pairwise published-vs-published kappa.

`hrv_processing` ¶

HRV metrics: RMSSD, SDNN, mean RR, artifact-corrected variants.

Operates on RR interval arrays (in milliseconds).

`artifact_corrected_rmssd(rr_ms, lo=300, hi=2000)` ¶

Drop impossible RRs, linearly interpolate, then compute RMSSD.

`orphanidou_sqi` ¶

Template-matching signal quality, following Orphanidou et al. 2015.

Reference

Orphanidou, C., Bonnici, T., Charlton, P., Clifton, D., Vallance, D., & Tarassenko, L. (2015). Signal-quality indices for the electrocardiogram and photoplethysmogram derived from optical pulse-rate measurements. IEEE J. Biomed. Health Inform., 19(3), 832-838.

The method has four rules that a window must satisfy to be 'acceptable': 1. Heart-rate estimated from peaks is in [40, 180] bpm 2. The maximum RR-interval gap is below a fixed threshold (3 s here) 3. The maximum RR/min RR ratio is below 2.2 4. The average correlation between each beat and the median beat template is above 0.66 for ECG (0.86 for PPG)

We expose

orphanidou_ecg_sqi(window, fs) -> dict with continuous metrics and bool
orphanidou_ppg_sqi(window, fs) -> same shape, different thresholds
head_to_head(windows, ground_truth, fs, modality) -> agreement summary

The point of including this is not to compete with the in-house SQI but to demonstrate that the in-house SQI agrees with an established published method on the same windows. Agreement on synthetic data is necessary but not sufficient. Anyone who runs the framework on real data should run this side-by-side and report both.

`orphanidou_ecg_sqi(window, fs)` ¶

Apply Orphanidou's four ECG rules to a single window.

`orphanidou_ppg_sqi(window, fs)` ¶

Apply Orphanidou's four PPG rules to a single window.

Note: the PPG variant uses a stricter template correlation threshold because clean PPG pulses are more morphologically consistent than QRS complexes.

`head_to_head(windows, in_house_sqi, fs, modality='ppg', in_house_threshold=0.7)` ¶

Compare the in-house SQI estimates against Orphanidou's rules.

Parameters¶

windows (n_windows, window_len) array of signal windows. in_house_sqi (n_windows,) array of the in-house continuous SQI in [0, 1]. If shorter than windows, both arrays are truncated to the common length and a warning is logged. fs Sampling rate of the windows. modality 'ecg' or 'ppg'. Switches the Orphanidou template threshold. in_house_threshold Threshold for binarizing the in-house SQI when computing the crosstab. Default 0.7.

`ppg_processing` ¶

PPG processing: smoothing, pulse-peak detection, motion estimation, SQI.

`motion_artifact_score(signal, fs)` ¶

High-frequency / low-frequency power ratio. Higher → more motion.

`signal_quality` ¶

Aggregate signal-quality scoring across ECG, PPG, and daily summaries.

`daily_signal_quality(df)` ¶

A simple per-day quality estimate from wear-time and motion proxy.

Missing source columns are treated as 'no information'. Returns a Series of NaN of the same length as df when both source columns are absent.

`sleep_processing` ¶

Sleep-summary reliability metrics.

`regularity_index(durations)` ¶

1 - normalized std of sleep duration. Higher = more regular.

`sukor_sqi` ¶

Sukor 2011 PPG signal-quality decision rules.

Implements the four-feature decision-tree style SQI from:

Sukor JA, Redmond SJ, Lovell NH. Signal quality measures for pulse oximetry through waveform morphology analysis. Physiological Measurement, 2011.

Sukor's method is a published PPG-specific SQI baseline. We use it alongside Orphanidou 2015 to give the in-house SQI a two-witness comparison rather than relying on a single published baseline.

Sukor's four features per detected pulse: 1. Pulse waveform shape: ratio of systolic peak amplitude to diastolic peak 2. Pulse-to-pulse interval consistency (similar to Orphanidou's RR check) 3. Pulse amplitude variability 4. Trough-to-trough interval consistency

A pulse is acceptable if all four features fall within their published ranges. A window is acceptable if at least 80% of its detected pulses are acceptable.

This is a faithful re-implementation of the published thresholds, adapted to window-level scoring for compatibility with our pipeline.

`sukor_ppg_sqi(window, fs)` ¶

Apply Sukor 2011 four-feature rules to a single PPG window.

`batch_sukor(windows, fs)` ¶

Run Sukor SQI on a batch of windows.

Evaluation¶

WESAD real-data validation, four-way SQI agreement, Bland-Altman, bootstrap CIs.

`biomedical_signal_forensics_lab.evaluation` ¶

Evaluation: metrics, robustness, calibration, sim-to-real comparison.

`calibration` ¶

Reliability-diagram-style calibration data.

`cross_cohort_check` ¶

Cross-cohort generalization check.

A partial defense against the 'circular evaluation' critique. We build a second synthetic cohort with parameters intentionally different from the defaults (different baselines, different effect signs, different within-participant noise), then run the same audit framework on it and report:

The trust-score distribution shifts in the expected direction.
The fairness audit recovers the new injected effects.
The Orphanidou agreement holds (it should: the SQI computation doesn't depend on the generative parameters).
The bootstrap test-retest reliability tracks the new within-participant noise scale (more noise → lower r).

This is not real-data validation. It's a sanity check that the framework's outputs are sensitive to the cohort's properties rather than to the particular default parameters. Run from scripts/run_cross_cohort_check.py.

`CohortRegime` `dataclass` ¶

A named set of generator overrides for one regime.

`evaluate_regime(regime, n_participants=80, n_days=30, seed=0)` ¶

Generate a cohort under the regime, run the relevant audit pieces, return a row.

`predicted_vs_observed(table)` ¶

Compare observed audit outputs against the qualitative predictions per regime.

Each row is a predicted relationship that should hold across regimes.

Returns an empty result-typed DataFrame if table is empty or has no 'regime' column. Predictions involving regimes that aren't present return NaN observations with pass=None (i.e., 'skipped' rather than 'failed').

`deep_real_analysis` ¶

Deep cross-modality analysis of real wearable data.

Computes for every extracted window

HR from chest ECG (gold standard)
HR from wrist PPG (the realistic device)
In-house PPG SQI
In-house ECG SQI
Orphanidou 2015 PPG SQI (published baseline 1)
Sukor 2011 PPG SQI (published baseline 2)
PPG motion artifact score

Then runs a series of validation analyses:

Cross-modality HR agreement (HR_PPG vs HR_ECG): mean absolute error, bias, 95% limits of agreement (Bland-Altman 1986), Pearson correlation.
Per-state effects: for each subject, paired Wilcoxon signed-rank test on baseline vs stress windows for SQI, motion, and HR-PPG-vs-ECG error.
Three-way SQI agreement (in-house vs Orphanidou vs Sukor): pairwise Spearman, Cohen's kappa for binary acceptance, three-way confusion table.
Motion-vs-SQI: correlation between PPG motion score and the absolute HR-PPG-vs-ECG disagreement. If motion predicts disagreement, that's the confounding-detection logic working on real signals.
Recalibration experiment: split windows 50/50 by subject, fit new in-house SQI thresholds on the calibration half to match Orphanidou's pass/fail pattern, evaluate the recalibrated SQI on the held-out half.

`FourWaySQIAgreement` `dataclass` ¶

Pairwise agreement among four PPG SQI methods, including Elgendi 2016.

Reported alongside the three-way result to give the manuscript a more bulletproof 'in-house is the outlier' argument: when all three published baselines reject a window and the in-house method accepts it, the in-house method is unambiguously the outlier rather than the baselines being weak. The headline statistic for that argument is inhouse_passes_when_all_published_fail.

`hr_from_ecg_window(ecg_window, fs=700)` ¶

Estimate HR from a single ECG window in bpm. Returns NaN if undetermined.

`hr_from_ppg_window(ppg_window, fs=64)` ¶

Estimate HR from a single PPG window in bpm. Returns NaN if undetermined.

`compute_window_table(ecg_arr, ppg_arr, meta, ecg_fs=700, ppg_fs=64)` ¶

For every window, compute HR from both modalities plus all three SQIs.

Returns¶

A DataFrame with one row per window. Columns include the meta fields (subject_id, label_name, etc.) plus: - hr_ecg, hr_ppg, hr_abs_diff - inhouse_ecg_sqi, inhouse_ppg_sqi, inhouse_ppg_motion - orphanidou_acceptable, orphanidou_template_corr - sukor_acceptable, sukor_pp_cv, sukor_amp_cv

`cross_modality_hr_agreement(df)` ¶

Bland-Altman style agreement statistics for HR_PPG vs HR_ECG.

`per_state_comparison(df, state_a='baseline', state_b='stress')` ¶

For each metric, compare state_a vs state_b within each subject.

Returns one row per subject with mean values in each state and an unpaired Wilcoxon rank-sum p-value.

`cohens_kappa(a, b)` ¶

Cohen's kappa for two binary rater outputs.

`three_way_sqi_agreement(df, inhouse_threshold=0.7)` ¶

Pairwise agreement among three PPG SQI methods.

`four_way_sqi_agreement(df, inhouse_threshold=0.7)` ¶

Compute pairwise kappas across four PPG SQIs and the headline 'in-house outlier' statistic.

If the input dataframe does not contain Elgendi columns (because it was generated by an older version of the pipeline), returns a result with NaN fields rather than raising. This preserves backwards compatibility with old window_table.csv files.

`motion_effect_analysis(df)` ¶

Does the PPG motion score predict SQI failures and HR-modality disagreement?

`recalibrate_inhouse_sqi(df, original_threshold=0.7, seed=0)` ¶

Refit the in-house PPG SQI threshold to maximize Orphanidou agreement.

Splits windows 50/50, fits a new threshold on the calibration half to maximize Cohen's kappa against Orphanidou, then evaluates that threshold on the held-out half.

`metrics` ¶

Common metrics used by the leaderboard and the audit report.

`robustness` ¶

Robustness: how stable is a metric under input perturbations?

`perturbation_stability(scorer, df, columns, noise_std=0.05, n_trials=10, seed=0)` ¶

Add Gaussian noise to selected columns, see how much the score moves.

A seeded local RNG is used so successive calls with the same seed produce identical results. Returns NaN if the scorer fails on every perturbation.

`synthetic_to_real` ¶

Comparison harness for synthetic-to-real generalization.

This module is a stub: the framework is designed so that a real dataset can replace the synthetic one through the same column schema. The function below documents the contract.

Confounding¶

Back-door adjustment, AIPW doubly-robust estimation, E-values.

`biomedical_signal_forensics_lab.confounding` ¶

Confounding detection: environmental, behavioral, missingness.

`behavioral_confounding` ¶

Behavioral confounders: activity → PPG, low wear → reliability.

`causal_inference` ¶

Causal confounding analysis: DAG, g-computation, doubly-robust estimation.

The default environmental_confounding.py reports Pearson correlations. That's a screening signal: useful, but a reviewer is right to want more before any adjusted-effect claim. This module provides three things:

A minimal DAG representation (no networkx dependency) with enough structure to identify a valid adjustment set under the back-door criterion.
G-computation (standardization): fit an outcome model conditional on the treatment and adjustment set, then average over the empirical covariate distribution to get an adjusted marginal effect. Includes bootstrap CIs.
Doubly-robust AIPW (augmented inverse-probability weighting): combines a propensity model and an outcome model so that the estimate is consistent if either model is correct. Includes bootstrap CIs.

All of this is "first-order" causal inference. We do not handle time-varying confounding or instrumental variables. The point is to give a defensible adjusted effect alongside the screening correlation.

`DAG` `dataclass` ¶

A directed acyclic graph encoded as a dict of node → list of parents.

Parents-of-X means "variables that directly cause X". This convention makes the back-door identification logic cleaner. Use add_edge(parent, child) if you prefer to think in arrows.

`has_cycle()` ¶

True if the graph has a cycle (i.e., is not actually a DAG).

`back_door_adjustment_set(treatment, outcome)` ¶

Return a valid adjustment set under a sufficient (but not always minimal) back-door criterion: take the ancestors of treatment that are also ancestors of outcome, excluding descendants of treatment.

This is the 'all-common-causes' rule. It is conservative but always valid in a DAG with no unmeasured confounders among the listed nodes.

Raises ValueError if the graph contains a cycle or if treatment == outcome.

`g_computation(df, treatment, outcome, adjustment_set, n_bootstrap=200, seed=0, learner='linear')` ¶

Adjusted marginal effect E[Y | do(T=1)] − E[Y | do(T=0)] by standardization.

`aipw(df, treatment, outcome, adjustment_set, n_bootstrap=200, seed=0, learner='linear', clip=(0.02, 0.98))` ¶

Augmented inverse-propensity weighted estimator.

Consistent if either the outcome model or the propensity model is correct. Continuous treatments are median-binarized first; results are interpreted as the effect of being above vs below the cohort median.

`default_wearable_dag()` ¶

Encodes the assumed causal structure of the synthetic cohort:

age, sex → baseline_hr, baseline_hrv heat_index → sleep_efficiency, hrv_rmssd stress_proxy → sleep_efficiency, hrv_rmssd, missing activity → resting_hr, hrv_rmssd, ppg_motion device_type → resting_hr, sqi skin_tone → ppg_sqi sleep_efficiency → hrv_rmssd (mediator)

The graph is small on purpose. Anyone applying this to a real cohort should write down their own DAG; this one is a starting point and a worked example.

`screening_vs_adjusted_table(df, pairs, dag=None, learner='linear', default_covariates=None)` ¶

For each (confounder, outcome) pair, report screening r AND AIPW adjusted effect.

If the DAG yields an empty back-door set (e.g., the treatment is a root node), we fall back to default_covariates to give a useful adjusted estimate. This is a deliberate convenience: anyone with a real DAG should pass it in.

`environmental_confounding` ¶

Environmental confounders: heat, AQI, humidity.

`heat_sleep_confounding` ¶

A focused module on heat × sleep confounding.

`missingness_dynamics` ¶

Missingness as signal: clustering, consecutive dropout, state dependence.

`consecutive_run_lengths(flags)` ¶

Return lengths of consecutive 1-runs in a binary array.

`state_dependent_missingness(df)` ¶

Does missingness correlate with stress / poor sleep / heat?

Reliability¶

ICC, test-retest stability, bootstrap reliability primitives.

`biomedical_signal_forensics_lab.reliability` ¶

Reliability engine: ICC, test-retest, drift, device bias, trust score.

`biomarker_trust_score` ¶

Digital Biomarker Trust Score (DBTS).

Six sub-scores → weighted composite in [0, 100] with a categorical label. All weights and thresholds are loaded from configs/reliability.yaml so the score is auditable.

`DigitalBiomarkerTrustScore` ¶

Compute the trust score and its sub-components from a participant frame.

Inputs are kept simple on purpose: a per-participant dataframe with the daily columns plus any computed artifact / signal quality columns. We derive each sub-score with a transparent rule and combine via the configured weights.

`change_point` ¶

Change-point detection on participant time-series.

Firmware updates, manufacturer pipeline changes, and silent algorithm swaps introduce step changes in wearable-derived signals. The reliability stack already catches these as "low temporal stability," but never attributes them to a discrete event. This module produces explicit change-point timestamps and a before-vs-after stratification.

Two methods:

bocpd: a Bayesian online change-point detector with a constant hazard prior and a Normal-Inverse-Gamma model of run-length means. The full message-passing recursion is implemented in numpy with light bookkeeping.
binary_segmentation: a fast offline alternative that recursively splits on the most likely change-point under a sum-of-squared-errors criterion, stopping when the proposed segment fails an F-test.

The two methods agree on textbook step changes (see tests). They disagree on ambiguous drifts, where bocpd is more conservative.

`bocpd(values, hazard=1.0 / 50.0, mu0=None, kappa0=1.0, alpha0=1.0, beta0=1.0, min_collapse=5)` ¶

Bayesian online change-point detection.

NaN samples are removed before processing. If you need to preserve absolute indices, mask externally before calling.

Parameters¶

values 1D array of observations. hazard Constant prior probability of a change at each step. 1/50 ≈ "one expected change per 50 samples". Lower = more conservative. mu0, kappa0, alpha0, beta0 Normal-Inverse-Gamma prior hyperparameters. If mu0 is None, it is initialized to the mean of the first 5 observations. Defaults elsewhere are weakly informative. min_collapse Required drop in the MAP run length to register a change-point. A change at time t shows up as argmax(R_t) collapsing from a large value (the length of the current stationary regime) back to 0 or 1.

`binary_segmentation(values, min_segment=7, f_stat_threshold=8.0, detrend=False)` ¶

Recursive binary segmentation with an F-statistic stopping rule.

f_stat_threshold corresponds to a fairly aggressive p ~ 0.005 floor for typical sample sizes; raise for more conservative detection.

NaN samples are stripped before processing. If detrend=True, a linear trend is subtracted from the series first; this helps avoid spurious change-points on slow continuous drifts (a known weakness of step-change detectors).

`scan_cohort(df, metric, method='binary_segmentation', **kwargs)` ¶

Run change-point detection for every participant's series of metric.

`stratify_before_after(values, cp_index)` ¶

Return summary statistics for the before and after segments.

`device_bias` ¶

Estimate device-class bias on key biomarkers.

`per_column_bias(df, columns, device_col='device_type', reference=None)` ¶

For each device class, report mean - reference_mean for each column.

`bias_severity(bias_df)` ¶

Aggregate device-bias severity in [0, 1].

Defined as the maximum absolute mean_diff across rows, divided by a fixed scale (10 bpm / 10 ms) and clipped to [0, 1]. This is a simple heuristic, not a standardized effect size, so it depends on the unit of mean_diff. For metrics whose natural scale differs from 10, pre-scale or use a different aggregator.

Returns 0.0 on an empty frame.

`fairness_audit` ¶

Stratified fairness audit.

For any categorical column (device_type, sex, age band, a skin-tone proxy quartile), compute the full DBTS panel within each level and emit:

a tidy per-stratum table with all six components and the overall score;
a per-component disparity table (max minus min across strata);
a 'forest-plot-ready' frame with bootstrapped confidence intervals.

The plot itself is in src/reports/figure_builder.py::fairness_forest_plot.

This module deliberately stays domain-agnostic. It does not know what "fair" means; it surfaces the disparities so a human can decide what to do about them.

`fairness_audit(df, stratify_by, bin_strategy='auto', n_bootstrap=100, scorer=None)` ¶

Run DBTS for every level of stratify_by.

Parameters¶

df Daily-summary frame, schema-compliant. stratify_by Column to stratify on. If numeric, will be quartile-binned (or median-split when there aren't enough unique values). bin_strategy 'auto' (default), 'quartile', or 'none'. 'none' requires the column to be categorical or low-cardinality already. n_bootstrap Bootstrap resamples for the per-stratum CI. Set to 0 to skip. scorer Optional pre-built scorer (uses default config if None).

`disparity_summary(stratum_table)` ¶

Max minus min across strata, per component. Quick read of who's worst off.

`multi_stratify(df, stratify_columns, **kwargs)` ¶

Run the audit across multiple stratifiers; return a dict keyed by column.

`intraclass_correlation` ¶

ICC(2,1): two-way random, single rater, absolute agreement.

Treats each participant as a 'target' and each day as a 'rater'. This is an unconventional use of the formula. The classical Shrout & Fleiss (1979) ICC assumes a small fixed set of human raters scoring each subject. Here we substitute calendar days within a participant as the 'raters'. The substitution is empirically defensible (see compare_to_test_retest below): on the bundled synthetic cohort, the ICC ranking matches the proper week-pair test-retest ranking, with the ICC being uniformly more conservative because it counts day-to-day fluctuation as rater disagreement.

The classical numerical implementation here was sanity-checked against the textbook example from Shrout & Fleiss (1979) Table 1 and recovers the published ICC(2,1) value of 0.290.

For a clinical paper the better metric is probably the bootstrap week-pair correlation in reliability/test_retest.py. The ICC value here is included for backward compatibility and as a within-participant noise-floor signal.

`MethodComparison` `dataclass` ¶

Empirical comparison of ICC(2,1) days-as-raters against bootstrap week-pair r.

`icc_2_1(matrix, nan_policy='drop')` ¶

Intraclass correlation, ICC(2,1) form: two-way random, single rater, absolute agreement.

Parameters¶

matrix : np.ndarray Shape (n_subjects, n_raters). nan_policy : {'drop', 'raise'} - 'drop' (default): drop any subject row that contains a NaN. - 'raise': raise ValueError if any NaN is present.

Returns¶

float ICC value, or NaN if the cleaned input has fewer than 2 subjects or 2 raters.

`compute_from_dataframe(df, column, window_days=7)` ¶

Take first window_days per participant, drop incomplete, compute ICC.

`compare_to_test_retest(df, columns, icc_window_days=7, n_bootstrap=200)` ¶

Empirical comparison: ICC(2,1) days-as-raters against bootstrap week-pair r.

This is the formal defense of the unconventional ICC use. If the two methods rank metrics consistently and the ICC is monotonically related to the proper test-retest correlation, the ICC is doing useful work even though the rater interpretation is unusual.

Returns one row per metric with both estimates and the week-pair CI.

`temporal_stability` ¶

Day-to-day stability, rolling CoV, drift.

`drift_slope(values)` ¶

Linear-regression slope and p-value of values vs sample index.

NaN samples are dropped before the regression. Returns NaN slope and p-value when fewer than 5 finite samples remain or when the finite portion has effectively zero variance.

`test_retest` ¶

Test-retest reliability for daily-summary time series.

Three functions, in increasing order of methodological seriousness:

split_half_correlation: Pearson between the first half and the second half of a single participant's series. Quick, simple, suitable for ranking participants. Single point estimate; no CI.
week_pair_correlation: For each pair of adjacent non-overlapping weeks within a participant, correlate the within-week means. Standard test-retest formulation when daily measurements are pooled by week.
cohort_test_retest: Pool participants and report between-participant test-retest reliability of any weekly aggregate of choice. Returns a point estimate plus a participant-level bootstrap 95% CI. This is the one to put in a paper.

The 'classical' (Shrout & Fleiss 1979) ICC is over in intraclass_correlation.py. We deliberately keep the two separate because they answer different questions: this module asks 'is the same participant consistent week to week?', ICC asks 'how much of the total variance is between-participant?'.

`split_half_correlation(values)` ¶

Pearson correlation between the first and second halves of a series.

`per_participant(df, column)` ¶

Per-participant split-half correlation (no CI).

`week_pair_correlation(df, column, min_weeks=4)` ¶

Pool all (week_t, week_{t+1}) pairs across participants, correlate.

A participant with at least min_weeks weeks of data contributes n_weeks - 1 pairs. Returns the Pearson correlation across all pooled pairs.

`cohort_test_retest(df, column, method='week_pair', n_bootstrap=500, min_weeks=4, seed=0)` ¶

Pool participants, compute a single test-retest estimate, bootstrap the CI.

Bootstrap resamples participants (cluster bootstrap), which is the appropriate unit for clustered repeated-measures data.

`cohort_test_retest_table(df, metrics, **kwargs)` ¶

Run cohort_test_retest for several metrics and return a tidy table.

`weight_optimization` ¶

Learn trust-score weights from data, tied to a downstream reproducibility task.

The default DBTS weights are a defensible prior. They are not learned. Anyone who actually cares about a specific downstream task (week-over-week stability of an HRV-based stress estimate, say) can do better by searching the weight simplex for the weighting that maximizes correlation between DBTS and that task's stability.

We provide:

weekly_reproducibility_target: a function that produces a per-participant reproducibility score for a chosen metric. Default metric: HRV RMSSD, default measure: 1 − relative absolute difference between adjacent weeks.
learn_weights: grid + random search over the 6-component simplex, maximizing Spearman ρ between DBTS overall and the per-participant target.
sensitivity_table: how much do conclusions move when the weights change? Reports overall-score correlation between top-k weight settings and the rank stability of the participant ordering they induce.

Methodology: - Search is on the simplex with a Dirichlet-uniform sample plus a refining local grid around the best point. Both phases are exposed so the search budget is auditable. - The target is computed on a held-out half of the participants; weight fitting and validation never see the same participants. - We never claim the learned weights generalize to a different task. The point is to report the dependence and let the user decide.

`weekly_reproducibility_target(df, metric='hrv_rmssd')` ¶

Per-participant week-over-week reproducibility of metric.

Score is 1 minus the median relative absolute difference between adjacent 7-day means. Higher = more reproducible. Range roughly [0, 1].

`participant_components(df, scorer=None)` ¶

Run the DBTS once per participant and return the six component scores.

`overall_from_components(components_df, weights)` ¶

Compute the overall score under a given weighting.

`learn_weights(df, target_metric='hrv_rmssd', n_random=400, n_refine=6, holdout_fraction=0.3, seed=0, min_train_participants=5)` ¶

Search the 6-simplex for weights that maximize Spearman ρ vs reproducibility.

Returns LearnedWeights with NaN ρ values (and default weights) if the cohort has fewer than min_train_participants participants with a computable reproducibility target, or if the bootstrap never produces a finite Spearman.

`sensitivity_table(df, learned, perturbations=None)` ¶

How sensitive is the participant ranking to small weight changes?

For each perturbation magnitude p, draw Dirichlet noise of scale p around the learned weights, recompute the overall score, and report the Spearman rank correlation with the original ordering.

Models¶

Downstream LF/HF biomarker classifier with LOSO cross-validation.

`biomedical_signal_forensics_lab.models` ¶

Quality / anomaly models: autoencoder, isolation forest, baselines.

`anomaly_detector` ¶

Isolation Forest anomaly detector + logistic / random forest baselines.

`quality_autoencoder` ¶

A small PyTorch autoencoder used as an unsupervised quality probe.

We train it on what looks like 'good' data (high signal quality ground truth) and use reconstruction error as a quality score for new samples.

`sequence_model` ¶

Tiny GRU baseline for sequence-level quality (per-participant time series).

Trained against per-participant mean reliability_ground_truth as a regression sanity check. Not the main contribution: just here so reviewers can see we considered the temporal dimension.

Reports¶

Publication-grade figure generation and summary JSON writers.

`biomedical_signal_forensics_lab.reports` ¶

Report generation, figures, trust-report markdown.

`figure_builder` ¶

Publication-style figures. Matplotlib only.

Each function takes an already-loaded dataframe and writes a PNG to results/figures/. Numbers are deliberately kept legible at small sizes so they reproduce well in a PDF.

`fairness_forest_plot(stratum_table, out='results/figures/fairness_forest.png')` ¶

Forest plot of per-stratum overall trust score with bootstrap CIs.

Expects the output of fairness_audit: columns stratum_value, overall_trust_score, ci_low, ci_high.

`real_data_figures` ¶

Publication-quality figures for the deep WESAD analysis.

Six figures

Bland-Altman: HR-from-PPG vs HR-from-ECG, color-coded by state
Scatter HR_PPG vs HR_ECG with identity line and per-subject color
SQI pass rate by method (in-house vs Orphanidou vs Sukor)
Per-state distribution panels (4-subplot: SQI, motion, HR error, template corr)
Motion vs HR disagreement scatter
Recalibration: original vs recalibrated kappa across thresholds

`fig_bland_altman_hr(window_df, out)` ¶

Bland-Altman: mean of HR_PPG and HR_ECG vs their difference.

`fig_hr_scatter(window_df, out)` ¶

HR_PPG vs HR_ECG scatter with identity line.

`fig_sqi_pass_rates(three_way_result, out)` ¶

Bar chart of pass rates across three SQI methods.

`fig_per_state_panels(window_df, out)` ¶

4-panel boxplot: SQI, motion, HR error, template corr by state.

`fig_motion_vs_hr_error(window_df, out)` ¶

Scatter: PPG motion score vs |HR_PPG - HR_ECG|.

`fig_recalibration_curve(window_df, out)` ¶

Cohen's kappa vs in-house SQI threshold, with original and recalibrated.

`make_all_figures(window_df, three_way_result, out_dir)` ¶

Generate all six figures and return their paths.

`report_generator` ¶

Top-level audit report generator.

Produces results/reports/signal_forensics_report.md plus a set of figures. Designed to be readable as a standalone document.

`trust_report` ¶

Markdown formatting for a single trust report.

If a module above does not render, it may be because mkdocstrings could not import it during the docs build. Common causes:

The package is not installed in the docs environment. Run pip install -e . before mkdocs serve.
A module imports a heavy optional dependency that is not in the docs environment. Lazy imports inside functions are recommended over module-level imports of optional heavy dependencies.

API reference¶

Signals¶

biomedical_signal_forensics_lab.signals ¶

activity_processing ¶

ecg_processing ¶

detect_r_peaks(signal, fs, refractory_ms=250) ¶

signal_quality_index(signal, fs) ¶

elgendi_sqi ¶

ElgendiSQI dataclass ¶

elgendi_window(window, fs, ssqi_min=SSQI_MIN, ksqi_min=KSQI_MIN, esqi_max=ESQI_MAX, apply_bandpass=True, invert=False) ¶

batch_elgendi(windows, fs, ssqi_min=SSQI_MIN, ksqi_min=KSQI_MIN, esqi_max=ESQI_MAX, auto_invert=True, pilot_size=100) ¶

four_way_sqi_agreement(in_house_pass, orphanidou_pass, sukor_pass, elgendi_pass) ¶

hrv_processing ¶

artifact_corrected_rmssd(rr_ms, lo=300, hi=2000) ¶

orphanidou_sqi ¶

orphanidou_ecg_sqi(window, fs) ¶

orphanidou_ppg_sqi(window, fs) ¶

head_to_head(windows, in_house_sqi, fs, modality='ppg', in_house_threshold=0.7) ¶

Parameters¶

ppg_processing ¶

motion_artifact_score(signal, fs) ¶

signal_quality ¶

daily_signal_quality(df) ¶

sleep_processing ¶

regularity_index(durations) ¶

sukor_sqi ¶

sukor_ppg_sqi(window, fs) ¶

batch_sukor(windows, fs) ¶

Evaluation¶

biomedical_signal_forensics_lab.evaluation ¶

calibration ¶

cross_cohort_check ¶

CohortRegime dataclass ¶

evaluate_regime(regime, n_participants=80, n_days=30, seed=0) ¶

predicted_vs_observed(table) ¶

deep_real_analysis ¶

FourWaySQIAgreement dataclass ¶

hr_from_ecg_window(ecg_window, fs=700) ¶

hr_from_ppg_window(ppg_window, fs=64) ¶

compute_window_table(ecg_arr, ppg_arr, meta, ecg_fs=700, ppg_fs=64) ¶

Returns¶

cross_modality_hr_agreement(df) ¶

per_state_comparison(df, state_a='baseline', state_b='stress') ¶

cohens_kappa(a, b) ¶

three_way_sqi_agreement(df, inhouse_threshold=0.7) ¶

four_way_sqi_agreement(df, inhouse_threshold=0.7) ¶

motion_effect_analysis(df) ¶

recalibrate_inhouse_sqi(df, original_threshold=0.7, seed=0) ¶

metrics ¶

robustness ¶

perturbation_stability(scorer, df, columns, noise_std=0.05, n_trials=10, seed=0) ¶

synthetic_to_real ¶

Confounding¶

biomedical_signal_forensics_lab.confounding ¶

behavioral_confounding ¶

causal_inference ¶

DAG dataclass ¶

has_cycle() ¶

back_door_adjustment_set(treatment, outcome) ¶

g_computation(df, treatment, outcome, adjustment_set, n_bootstrap=200, seed=0, learner='linear') ¶

aipw(df, treatment, outcome, adjustment_set, n_bootstrap=200, seed=0, learner='linear', clip=(0.02, 0.98)) ¶

default_wearable_dag() ¶

screening_vs_adjusted_table(df, pairs, dag=None, learner='linear', default_covariates=None) ¶

environmental_confounding ¶

heat_sleep_confounding ¶

missingness_dynamics ¶

consecutive_run_lengths(flags) ¶

state_dependent_missingness(df) ¶

Reliability¶

biomedical_signal_forensics_lab.reliability ¶

biomarker_trust_score ¶

DigitalBiomarkerTrustScore ¶

change_point ¶

bocpd(values, hazard=1.0 / 50.0, mu0=None, kappa0=1.0, alpha0=1.0, beta0=1.0, min_collapse=5) ¶

Parameters¶

binary_segmentation(values, min_segment=7, f_stat_threshold=8.0, detrend=False) ¶

scan_cohort(df, metric, method='binary_segmentation', **kwargs) ¶

stratify_before_after(values, cp_index) ¶

device_bias ¶

per_column_bias(df, columns, device_col='device_type', reference=None) ¶

`biomedical_signal_forensics_lab.signals` ¶

`activity_processing` ¶

`ecg_processing` ¶

`detect_r_peaks(signal, fs, refractory_ms=250)` ¶

`signal_quality_index(signal, fs)` ¶

`elgendi_sqi` ¶

`ElgendiSQI` `dataclass` ¶

`elgendi_window(window, fs, ssqi_min=SSQI_MIN, ksqi_min=KSQI_MIN, esqi_max=ESQI_MAX, apply_bandpass=True, invert=False)` ¶

`batch_elgendi(windows, fs, ssqi_min=SSQI_MIN, ksqi_min=KSQI_MIN, esqi_max=ESQI_MAX, auto_invert=True, pilot_size=100)` ¶

`four_way_sqi_agreement(in_house_pass, orphanidou_pass, sukor_pass, elgendi_pass)` ¶

`hrv_processing` ¶

`artifact_corrected_rmssd(rr_ms, lo=300, hi=2000)` ¶

`orphanidou_sqi` ¶

`orphanidou_ecg_sqi(window, fs)` ¶

`orphanidou_ppg_sqi(window, fs)` ¶

`head_to_head(windows, in_house_sqi, fs, modality='ppg', in_house_threshold=0.7)` ¶

`ppg_processing` ¶

`motion_artifact_score(signal, fs)` ¶

`signal_quality` ¶

`daily_signal_quality(df)` ¶

`sleep_processing` ¶

`regularity_index(durations)` ¶

`sukor_sqi` ¶

`sukor_ppg_sqi(window, fs)` ¶

`batch_sukor(windows, fs)` ¶

`biomedical_signal_forensics_lab.evaluation` ¶

`calibration` ¶

`cross_cohort_check` ¶

`CohortRegime` `dataclass` ¶

`evaluate_regime(regime, n_participants=80, n_days=30, seed=0)` ¶

`predicted_vs_observed(table)` ¶

`deep_real_analysis` ¶

`FourWaySQIAgreement` `dataclass` ¶

`hr_from_ecg_window(ecg_window, fs=700)` ¶

`hr_from_ppg_window(ppg_window, fs=64)` ¶

`compute_window_table(ecg_arr, ppg_arr, meta, ecg_fs=700, ppg_fs=64)` ¶

`cross_modality_hr_agreement(df)` ¶

`per_state_comparison(df, state_a='baseline', state_b='stress')` ¶

`cohens_kappa(a, b)` ¶

`three_way_sqi_agreement(df, inhouse_threshold=0.7)` ¶

`four_way_sqi_agreement(df, inhouse_threshold=0.7)` ¶

`motion_effect_analysis(df)` ¶

`recalibrate_inhouse_sqi(df, original_threshold=0.7, seed=0)` ¶

`metrics` ¶

`robustness` ¶

`perturbation_stability(scorer, df, columns, noise_std=0.05, n_trials=10, seed=0)` ¶

`synthetic_to_real` ¶

`biomedical_signal_forensics_lab.confounding` ¶

`behavioral_confounding` ¶

`causal_inference` ¶

`DAG` `dataclass` ¶

`has_cycle()` ¶

`back_door_adjustment_set(treatment, outcome)` ¶

`g_computation(df, treatment, outcome, adjustment_set, n_bootstrap=200, seed=0, learner='linear')` ¶

`aipw(df, treatment, outcome, adjustment_set, n_bootstrap=200, seed=0, learner='linear', clip=(0.02, 0.98))` ¶

`default_wearable_dag()` ¶

`screening_vs_adjusted_table(df, pairs, dag=None, learner='linear', default_covariates=None)` ¶

`environmental_confounding` ¶

`heat_sleep_confounding` ¶

`missingness_dynamics` ¶

`consecutive_run_lengths(flags)` ¶

`state_dependent_missingness(df)` ¶

`biomedical_signal_forensics_lab.reliability` ¶

`biomarker_trust_score` ¶

`DigitalBiomarkerTrustScore` ¶

`change_point` ¶

`bocpd(values, hazard=1.0 / 50.0, mu0=None, kappa0=1.0, alpha0=1.0, beta0=1.0, min_collapse=5)` ¶

`binary_segmentation(values, min_segment=7, f_stat_threshold=8.0, detrend=False)` ¶

`scan_cohort(df, metric, method='binary_segmentation', **kwargs)` ¶

`stratify_before_after(values, cp_index)` ¶

`device_bias` ¶

`per_column_bias(df, columns, device_col='device_type', reference=None)` ¶

`bias_severity(bias_df)` ¶

`fairness_audit` ¶

`fairness_audit(df, stratify_by, bin_strategy='auto', n_bootstrap=100, scorer=None)` ¶

`disparity_summary(stratum_table)` ¶

`multi_stratify(df, stratify_columns, **kwargs)` ¶

`intraclass_correlation` ¶

`MethodComparison` `dataclass` ¶

`icc_2_1(matrix, nan_policy='drop')` ¶