Skip to content

API reference

The toolkit is organised into the following modules. Click any module to see its public functions, classes, and constants, auto-generated from docstrings via mkdocstrings.

If you are looking for high-level guidance instead of API specifics, see Getting started or Reproducing the paper.

Signals

PPG and ECG preprocessing, along with four signal-quality baselines.

biomedical_signal_forensics_lab.signals

Signal processing: ECG, PPG, HRV, sleep, activity, quality indices.

activity_processing

Activity / step-count reliability.

ecg_processing

ECG processing: bandpass, R-peak detection, RR intervals, SQI.

Deliberately classical methods. No deep learning here: the point is that the downstream forensics layer should work even when the signal processing is plain-vanilla.

detect_r_peaks(signal, fs, refractory_ms=250)

Very simple amplitude-threshold detector after bandpass + squaring.

signal_quality_index(signal, fs)

A bounded [0, 1] SQI: penalize flatness, saturation, and very high noise.

elgendi_sqi

Elgendi 2016 PPG signal-quality indices.

Implements the skewness-based SQI (SSQI) from:

Elgendi M. Optimal Signal Quality Index for Photoplethysmogram Signals. Bioengineering, 3(4):21, 2016. https://doi.org/10.3390/bioengineering3040021

Elgendi (2016) compared eight candidate per-window PPG SQI metrics on 106 annotated 60-second recordings labeled as 'excellent', 'acceptable', or 'unfit for diagnosis' by two independent reviewers with adjudication by a third expert. Of the eight metrics, the skewness-based index (SSQI) had the highest F1 score for the binary acceptable-versus-unfit decision and is recommended as the headline metric. We expose SSQI as the primary baseline and provide kurtosis (KSQI) and entropy (ESQI) as auxiliary metrics for robustness checks.

The three statistics together cover a complementary set of failure modes
  • SSQI: distribution asymmetry. A clean PPG has positive skew (long upper tail from systolic peaks). Saturated or flat signals lose this asymmetry.
  • KSQI: distribution tailedness. Heavy-tailed distributions indicate spike artifacts.
  • ESQI: amplitude entropy in moving sub-windows. Highly random amplitude distributions indicate broadband noise.

Elgendi's recommended thresholds were tuned on fingertip PPG; we report the continuous statistics and let the recalibration machinery (Section 4.3) re-tune binarization for wrist PPG. Default acceptance thresholds: SSQI > 0 (any positive skew) KSQI > 3 (super-Gaussian: heavier tail than normal) ESQI < 0.9 (not pure noise)

These defaults are intentionally lenient; users should recalibrate against either expert labels or a reference SQI on their own data.

Each function operates on a single window (typically 5-30 seconds at 64 Hz for wrist PPG or 125 Hz for fingertip PPG) and returns an ElgendiSQI dataclass with the continuous statistics, the per-statistic acceptance flags, and the overall binary acceptance verdict.

ElgendiSQI dataclass

Per-window Elgendi 2016 SQI report.

Attributes:

Name Type Description
acceptable bool

overall pass (SSQI > SSQI_MIN AND KSQI > KSQI_MIN AND ESQI < ESQI_MAX). All three must pass for the window to be accepted under the default policy. Single-statistic users can inspect ssqi_ok, ksqi_ok, esqi_ok individually.

skewness float

third standardized moment of the window after bandpass. Clean PPG has positive skew; SSQI is its raw value.

kurtosis float

fourth standardized moment (Fisher convention, so normal-distribution kurtosis = 0). KSQI is kurtosis + 3 in the Pearson convention used in Elgendi 2016.

entropy float

normalized Shannon entropy of the amplitude distribution computed over a 16-bin histogram, scaled to [0, 1].

ssqi_ok, (ksqi_ok, esqi_ok)

per-statistic acceptance flags.

n_samples int

number of samples in the window after bandpass.

elgendi_window(window, fs, ssqi_min=SSQI_MIN, ksqi_min=KSQI_MIN, esqi_max=ESQI_MAX, apply_bandpass=True, invert=False)

Compute Elgendi 2016 SSQI / KSQI / ESQI on a single PPG window.

Parameters:

Name Type Description Default
window ndarray

1D PPG signal samples.

required
fs int

sampling rate in Hz.

required
ssqi_min float

SSQI acceptance threshold. See module constants.

SSQI_MIN
ksqi_min float

KSQI acceptance threshold. See module constants.

KSQI_MIN
esqi_max float

ESQI acceptance threshold. See module constants. docstring for defaults and recommended recalibration policy.

ESQI_MAX
apply_bandpass bool

if True, run the standard PPG bandpass filter (0.5-8 Hz, 4th-order Butterworth) before computing statistics.

True
invert bool

if True, multiply the signal by -1 before computing statistics. Some devices (Empatica E4 wrist PPG in particular) report PPG inverted relative to fingertip conventions, which flips the sign of skewness. Use batch_elgendi(..., auto_invert=True) to auto-detect from the cohort.

False

Returns:

Type Description
ElgendiSQI

ElgendiSQI dataclass with the continuous statistics and acceptance

ElgendiSQI

flags. All NaN/empty edge cases return acceptable=False with

ElgendiSQI

statistics set to NaN.

batch_elgendi(windows, fs, ssqi_min=SSQI_MIN, ksqi_min=KSQI_MIN, esqi_max=ESQI_MAX, auto_invert=True, pilot_size=100)

Compute Elgendi SQI on a batch of windows.

Parameters:

Name Type Description Default
windows ndarray

2D array of shape (n_windows, n_samples) or a list of 1D arrays. Variable-length windows are accepted by passing a list.

required
fs int

sampling rate in Hz.

required
auto_invert bool

if True (default), compute Elgendi statistics on the first pilot_size windows, and if the median skewness across that pilot is negative, invert the entire batch before computing the full result. This handles devices that report PPG with reversed polarity (e.g., Empatica E4 wrist PPG). Set to False to disable polarity detection.

True
pilot_size int

number of windows used for polarity detection.

100

Returns:

Type Description
DataFrame

DataFrame with one row per window containing all ElgendiSQI fields

DataFrame

plus a polarity_inverted boolean column (True if auto_invert

DataFrame

decided to invert).

four_way_sqi_agreement(in_house_pass, orphanidou_pass, sukor_pass, elgendi_pass)

Compute pairwise Cohen's kappa and Spearman rho for all four SQIs.

The 'all three published baselines agree against in-house' argument becomes stronger when there are three baselines instead of two; the headline statistic is the median pairwise kappa between Orphanidou, Sukor, and Elgendi.

Returns:

Type Description
dict

Dict with pairwise kappas, fraction of windows accepted by all

dict

published baselines, fraction accepted by in-house but not by any

dict

published baseline (the most damning category for the in-house SQI),

dict

and the median pairwise published-vs-published kappa.

hrv_processing

HRV metrics: RMSSD, SDNN, mean RR, artifact-corrected variants.

Operates on RR interval arrays (in milliseconds).

artifact_corrected_rmssd(rr_ms, lo=300, hi=2000)

Drop impossible RRs, linearly interpolate, then compute RMSSD.

orphanidou_sqi

Template-matching signal quality, following Orphanidou et al. 2015.

Reference

Orphanidou, C., Bonnici, T., Charlton, P., Clifton, D., Vallance, D., & Tarassenko, L. (2015). Signal-quality indices for the electrocardiogram and photoplethysmogram derived from optical pulse-rate measurements. IEEE J. Biomed. Health Inform., 19(3), 832-838.

The method has four rules that a window must satisfy to be 'acceptable': 1. Heart-rate estimated from peaks is in [40, 180] bpm 2. The maximum RR-interval gap is below a fixed threshold (3 s here) 3. The maximum RR/min RR ratio is below 2.2 4. The average correlation between each beat and the median beat template is above 0.66 for ECG (0.86 for PPG)

We expose
  • orphanidou_ecg_sqi(window, fs) -> dict with continuous metrics and bool
  • orphanidou_ppg_sqi(window, fs) -> same shape, different thresholds
  • head_to_head(windows, ground_truth, fs, modality) -> agreement summary

The point of including this is not to compete with the in-house SQI but to demonstrate that the in-house SQI agrees with an established published method on the same windows. Agreement on synthetic data is necessary but not sufficient. Anyone who runs the framework on real data should run this side-by-side and report both.

orphanidou_ecg_sqi(window, fs)

Apply Orphanidou's four ECG rules to a single window.

orphanidou_ppg_sqi(window, fs)

Apply Orphanidou's four PPG rules to a single window.

Note: the PPG variant uses a stricter template correlation threshold because clean PPG pulses are more morphologically consistent than QRS complexes.

head_to_head(windows, in_house_sqi, fs, modality='ppg', in_house_threshold=0.7)

Compare the in-house SQI estimates against Orphanidou's rules.

Parameters

windows (n_windows, window_len) array of signal windows. in_house_sqi (n_windows,) array of the in-house continuous SQI in [0, 1]. If shorter than windows, both arrays are truncated to the common length and a warning is logged. fs Sampling rate of the windows. modality 'ecg' or 'ppg'. Switches the Orphanidou template threshold. in_house_threshold Threshold for binarizing the in-house SQI when computing the crosstab. Default 0.7.

ppg_processing

PPG processing: smoothing, pulse-peak detection, motion estimation, SQI.

motion_artifact_score(signal, fs)

High-frequency / low-frequency power ratio. Higher → more motion.

signal_quality

Aggregate signal-quality scoring across ECG, PPG, and daily summaries.

daily_signal_quality(df)

A simple per-day quality estimate from wear-time and motion proxy.

Missing source columns are treated as 'no information'. Returns a Series of NaN of the same length as df when both source columns are absent.

sleep_processing

Sleep-summary reliability metrics.

regularity_index(durations)

1 - normalized std of sleep duration. Higher = more regular.

sukor_sqi

Sukor 2011 PPG signal-quality decision rules.

Implements the four-feature decision-tree style SQI from:

Sukor JA, Redmond SJ, Lovell NH. Signal quality measures for pulse oximetry through waveform morphology analysis. Physiological Measurement, 2011.

Sukor's method is a published PPG-specific SQI baseline. We use it alongside Orphanidou 2015 to give the in-house SQI a two-witness comparison rather than relying on a single published baseline.

Sukor's four features per detected pulse: 1. Pulse waveform shape: ratio of systolic peak amplitude to diastolic peak 2. Pulse-to-pulse interval consistency (similar to Orphanidou's RR check) 3. Pulse amplitude variability 4. Trough-to-trough interval consistency

A pulse is acceptable if all four features fall within their published ranges. A window is acceptable if at least 80% of its detected pulses are acceptable.

This is a faithful re-implementation of the published thresholds, adapted to window-level scoring for compatibility with our pipeline.

sukor_ppg_sqi(window, fs)

Apply Sukor 2011 four-feature rules to a single PPG window.

batch_sukor(windows, fs)

Run Sukor SQI on a batch of windows.

Evaluation

WESAD real-data validation, four-way SQI agreement, Bland-Altman, bootstrap CIs.

biomedical_signal_forensics_lab.evaluation

Evaluation: metrics, robustness, calibration, sim-to-real comparison.

calibration

Reliability-diagram-style calibration data.

cross_cohort_check

Cross-cohort generalization check.

A partial defense against the 'circular evaluation' critique. We build a second synthetic cohort with parameters intentionally different from the defaults (different baselines, different effect signs, different within-participant noise), then run the same audit framework on it and report:

  1. The trust-score distribution shifts in the expected direction.
  2. The fairness audit recovers the new injected effects.
  3. The Orphanidou agreement holds (it should: the SQI computation doesn't depend on the generative parameters).
  4. The bootstrap test-retest reliability tracks the new within-participant noise scale (more noise → lower r).

This is not real-data validation. It's a sanity check that the framework's outputs are sensitive to the cohort's properties rather than to the particular default parameters. Run from scripts/run_cross_cohort_check.py.

CohortRegime dataclass

A named set of generator overrides for one regime.

evaluate_regime(regime, n_participants=80, n_days=30, seed=0)

Generate a cohort under the regime, run the relevant audit pieces, return a row.

predicted_vs_observed(table)

Compare observed audit outputs against the qualitative predictions per regime.

Each row is a predicted relationship that should hold across regimes.

Returns an empty result-typed DataFrame if table is empty or has no 'regime' column. Predictions involving regimes that aren't present return NaN observations with pass=None (i.e., 'skipped' rather than 'failed').

deep_real_analysis

Deep cross-modality analysis of real wearable data.

Computes for every extracted window
  • HR from chest ECG (gold standard)
  • HR from wrist PPG (the realistic device)
  • In-house PPG SQI
  • In-house ECG SQI
  • Orphanidou 2015 PPG SQI (published baseline 1)
  • Sukor 2011 PPG SQI (published baseline 2)
  • PPG motion artifact score

Then runs a series of validation analyses:

  1. Cross-modality HR agreement (HR_PPG vs HR_ECG): mean absolute error, bias, 95% limits of agreement (Bland-Altman 1986), Pearson correlation.

  2. Per-state effects: for each subject, paired Wilcoxon signed-rank test on baseline vs stress windows for SQI, motion, and HR-PPG-vs-ECG error.

  3. Three-way SQI agreement (in-house vs Orphanidou vs Sukor): pairwise Spearman, Cohen's kappa for binary acceptance, three-way confusion table.

  4. Motion-vs-SQI: correlation between PPG motion score and the absolute HR-PPG-vs-ECG disagreement. If motion predicts disagreement, that's the confounding-detection logic working on real signals.

  5. Recalibration experiment: split windows 50/50 by subject, fit new in-house SQI thresholds on the calibration half to match Orphanidou's pass/fail pattern, evaluate the recalibrated SQI on the held-out half.

FourWaySQIAgreement dataclass

Pairwise agreement among four PPG SQI methods, including Elgendi 2016.

Reported alongside the three-way result to give the manuscript a more bulletproof 'in-house is the outlier' argument: when all three published baselines reject a window and the in-house method accepts it, the in-house method is unambiguously the outlier rather than the baselines being weak. The headline statistic for that argument is inhouse_passes_when_all_published_fail.

hr_from_ecg_window(ecg_window, fs=700)

Estimate HR from a single ECG window in bpm. Returns NaN if undetermined.

hr_from_ppg_window(ppg_window, fs=64)

Estimate HR from a single PPG window in bpm. Returns NaN if undetermined.

compute_window_table(ecg_arr, ppg_arr, meta, ecg_fs=700, ppg_fs=64)

For every window, compute HR from both modalities plus all three SQIs.

Returns

A DataFrame with one row per window. Columns include the meta fields (subject_id, label_name, etc.) plus: - hr_ecg, hr_ppg, hr_abs_diff - inhouse_ecg_sqi, inhouse_ppg_sqi, inhouse_ppg_motion - orphanidou_acceptable, orphanidou_template_corr - sukor_acceptable, sukor_pp_cv, sukor_amp_cv

cross_modality_hr_agreement(df)

Bland-Altman style agreement statistics for HR_PPG vs HR_ECG.

per_state_comparison(df, state_a='baseline', state_b='stress')

For each metric, compare state_a vs state_b within each subject.

Returns one row per subject with mean values in each state and an unpaired Wilcoxon rank-sum p-value.

cohens_kappa(a, b)

Cohen's kappa for two binary rater outputs.

three_way_sqi_agreement(df, inhouse_threshold=0.7)

Pairwise agreement among three PPG SQI methods.

four_way_sqi_agreement(df, inhouse_threshold=0.7)

Compute pairwise kappas across four PPG SQIs and the headline 'in-house outlier' statistic.

If the input dataframe does not contain Elgendi columns (because it was generated by an older version of the pipeline), returns a result with NaN fields rather than raising. This preserves backwards compatibility with old window_table.csv files.

motion_effect_analysis(df)

Does the PPG motion score predict SQI failures and HR-modality disagreement?

recalibrate_inhouse_sqi(df, original_threshold=0.7, seed=0)

Refit the in-house PPG SQI threshold to maximize Orphanidou agreement.

Splits windows 50/50, fits a new threshold on the calibration half to maximize Cohen's kappa against Orphanidou, then evaluates that threshold on the held-out half.

metrics

Common metrics used by the leaderboard and the audit report.

robustness

Robustness: how stable is a metric under input perturbations?

perturbation_stability(scorer, df, columns, noise_std=0.05, n_trials=10, seed=0)

Add Gaussian noise to selected columns, see how much the score moves.

A seeded local RNG is used so successive calls with the same seed produce identical results. Returns NaN if the scorer fails on every perturbation.

synthetic_to_real

Comparison harness for synthetic-to-real generalization.

This module is a stub: the framework is designed so that a real dataset can replace the synthetic one through the same column schema. The function below documents the contract.

Confounding

Back-door adjustment, AIPW doubly-robust estimation, E-values.

biomedical_signal_forensics_lab.confounding

Confounding detection: environmental, behavioral, missingness.

behavioral_confounding

Behavioral confounders: activity → PPG, low wear → reliability.

causal_inference

Causal confounding analysis: DAG, g-computation, doubly-robust estimation.

The default environmental_confounding.py reports Pearson correlations. That's a screening signal: useful, but a reviewer is right to want more before any adjusted-effect claim. This module provides three things:

  1. A minimal DAG representation (no networkx dependency) with enough structure to identify a valid adjustment set under the back-door criterion.
  2. G-computation (standardization): fit an outcome model conditional on the treatment and adjustment set, then average over the empirical covariate distribution to get an adjusted marginal effect. Includes bootstrap CIs.
  3. Doubly-robust AIPW (augmented inverse-probability weighting): combines a propensity model and an outcome model so that the estimate is consistent if either model is correct. Includes bootstrap CIs.

All of this is "first-order" causal inference. We do not handle time-varying confounding or instrumental variables. The point is to give a defensible adjusted effect alongside the screening correlation.

DAG dataclass

A directed acyclic graph encoded as a dict of node → list of parents.

Parents-of-X means "variables that directly cause X". This convention makes the back-door identification logic cleaner. Use add_edge(parent, child) if you prefer to think in arrows.

has_cycle()

True if the graph has a cycle (i.e., is not actually a DAG).

back_door_adjustment_set(treatment, outcome)

Return a valid adjustment set under a sufficient (but not always minimal) back-door criterion: take the ancestors of treatment that are also ancestors of outcome, excluding descendants of treatment.

This is the 'all-common-causes' rule. It is conservative but always valid in a DAG with no unmeasured confounders among the listed nodes.

Raises ValueError if the graph contains a cycle or if treatment == outcome.

g_computation(df, treatment, outcome, adjustment_set, n_bootstrap=200, seed=0, learner='linear')

Adjusted marginal effect E[Y | do(T=1)] − E[Y | do(T=0)] by standardization.

aipw(df, treatment, outcome, adjustment_set, n_bootstrap=200, seed=0, learner='linear', clip=(0.02, 0.98))

Augmented inverse-propensity weighted estimator.

Consistent if either the outcome model or the propensity model is correct. Continuous treatments are median-binarized first; results are interpreted as the effect of being above vs below the cohort median.

default_wearable_dag()

Encodes the assumed causal structure of the synthetic cohort:

age, sex → baseline_hr, baseline_hrv heat_index → sleep_efficiency, hrv_rmssd stress_proxy → sleep_efficiency, hrv_rmssd, missing activity → resting_hr, hrv_rmssd, ppg_motion device_type → resting_hr, sqi skin_tone → ppg_sqi sleep_efficiency → hrv_rmssd (mediator)

The graph is small on purpose. Anyone applying this to a real cohort should write down their own DAG; this one is a starting point and a worked example.

screening_vs_adjusted_table(df, pairs, dag=None, learner='linear', default_covariates=None)

For each (confounder, outcome) pair, report screening r AND AIPW adjusted effect.

If the DAG yields an empty back-door set (e.g., the treatment is a root node), we fall back to default_covariates to give a useful adjusted estimate. This is a deliberate convenience: anyone with a real DAG should pass it in.

environmental_confounding

Environmental confounders: heat, AQI, humidity.

heat_sleep_confounding

A focused module on heat × sleep confounding.

missingness_dynamics

Missingness as signal: clustering, consecutive dropout, state dependence.

consecutive_run_lengths(flags)

Return lengths of consecutive 1-runs in a binary array.

state_dependent_missingness(df)

Does missingness correlate with stress / poor sleep / heat?

Reliability

ICC, test-retest stability, bootstrap reliability primitives.

biomedical_signal_forensics_lab.reliability

Reliability engine: ICC, test-retest, drift, device bias, trust score.

biomarker_trust_score

Digital Biomarker Trust Score (DBTS).

Six sub-scores → weighted composite in [0, 100] with a categorical label. All weights and thresholds are loaded from configs/reliability.yaml so the score is auditable.

DigitalBiomarkerTrustScore

Compute the trust score and its sub-components from a participant frame.

Inputs are kept simple on purpose: a per-participant dataframe with the daily columns plus any computed artifact / signal quality columns. We derive each sub-score with a transparent rule and combine via the configured weights.

change_point

Change-point detection on participant time-series.

Firmware updates, manufacturer pipeline changes, and silent algorithm swaps introduce step changes in wearable-derived signals. The reliability stack already catches these as "low temporal stability," but never attributes them to a discrete event. This module produces explicit change-point timestamps and a before-vs-after stratification.

Two methods:

  1. bocpd: a Bayesian online change-point detector with a constant hazard prior and a Normal-Inverse-Gamma model of run-length means. The full message-passing recursion is implemented in numpy with light bookkeeping.
  2. binary_segmentation: a fast offline alternative that recursively splits on the most likely change-point under a sum-of-squared-errors criterion, stopping when the proposed segment fails an F-test.

The two methods agree on textbook step changes (see tests). They disagree on ambiguous drifts, where bocpd is more conservative.

bocpd(values, hazard=1.0 / 50.0, mu0=None, kappa0=1.0, alpha0=1.0, beta0=1.0, min_collapse=5)

Bayesian online change-point detection.

NaN samples are removed before processing. If you need to preserve absolute indices, mask externally before calling.

Parameters

values 1D array of observations. hazard Constant prior probability of a change at each step. 1/50 ≈ "one expected change per 50 samples". Lower = more conservative. mu0, kappa0, alpha0, beta0 Normal-Inverse-Gamma prior hyperparameters. If mu0 is None, it is initialized to the mean of the first 5 observations. Defaults elsewhere are weakly informative. min_collapse Required drop in the MAP run length to register a change-point. A change at time t shows up as argmax(R_t) collapsing from a large value (the length of the current stationary regime) back to 0 or 1.

binary_segmentation(values, min_segment=7, f_stat_threshold=8.0, detrend=False)

Recursive binary segmentation with an F-statistic stopping rule.

f_stat_threshold corresponds to a fairly aggressive p ~ 0.005 floor for typical sample sizes; raise for more conservative detection.

NaN samples are stripped before processing. If detrend=True, a linear trend is subtracted from the series first; this helps avoid spurious change-points on slow continuous drifts (a known weakness of step-change detectors).

scan_cohort(df, metric, method='binary_segmentation', **kwargs)

Run change-point detection for every participant's series of metric.

stratify_before_after(values, cp_index)

Return summary statistics for the before and after segments.

device_bias

Estimate device-class bias on key biomarkers.

per_column_bias(df, columns, device_col='device_type', reference=None)

For each device class, report mean - reference_mean for each column.

bias_severity(bias_df)

Aggregate device-bias severity in [0, 1].

Defined as the maximum absolute mean_diff across rows, divided by a fixed scale (10 bpm / 10 ms) and clipped to [0, 1]. This is a simple heuristic, not a standardized effect size, so it depends on the unit of mean_diff. For metrics whose natural scale differs from 10, pre-scale or use a different aggregator.

Returns 0.0 on an empty frame.

fairness_audit

Stratified fairness audit.

For any categorical column (device_type, sex, age band, a skin-tone proxy quartile), compute the full DBTS panel within each level and emit:

  • a tidy per-stratum table with all six components and the overall score;
  • a per-component disparity table (max minus min across strata);
  • a 'forest-plot-ready' frame with bootstrapped confidence intervals.

The plot itself is in src/reports/figure_builder.py::fairness_forest_plot.

This module deliberately stays domain-agnostic. It does not know what "fair" means; it surfaces the disparities so a human can decide what to do about them.

fairness_audit(df, stratify_by, bin_strategy='auto', n_bootstrap=100, scorer=None)

Run DBTS for every level of stratify_by.

Parameters

df Daily-summary frame, schema-compliant. stratify_by Column to stratify on. If numeric, will be quartile-binned (or median-split when there aren't enough unique values). bin_strategy 'auto' (default), 'quartile', or 'none'. 'none' requires the column to be categorical or low-cardinality already. n_bootstrap Bootstrap resamples for the per-stratum CI. Set to 0 to skip. scorer Optional pre-built scorer (uses default config if None).

disparity_summary(stratum_table)

Max minus min across strata, per component. Quick read of who's worst off.

multi_stratify(df, stratify_columns, **kwargs)

Run the audit across multiple stratifiers; return a dict keyed by column.

intraclass_correlation

ICC(2,1): two-way random, single rater, absolute agreement.

Treats each participant as a 'target' and each day as a 'rater'. This is an unconventional use of the formula. The classical Shrout & Fleiss (1979) ICC assumes a small fixed set of human raters scoring each subject. Here we substitute calendar days within a participant as the 'raters'. The substitution is empirically defensible (see compare_to_test_retest below): on the bundled synthetic cohort, the ICC ranking matches the proper week-pair test-retest ranking, with the ICC being uniformly more conservative because it counts day-to-day fluctuation as rater disagreement.

The classical numerical implementation here was sanity-checked against the textbook example from Shrout & Fleiss (1979) Table 1 and recovers the published ICC(2,1) value of 0.290.

For a clinical paper the better metric is probably the bootstrap week-pair correlation in reliability/test_retest.py. The ICC value here is included for backward compatibility and as a within-participant noise-floor signal.

MethodComparison dataclass

Empirical comparison of ICC(2,1) days-as-raters against bootstrap week-pair r.

icc_2_1(matrix, nan_policy='drop')

Intraclass correlation, ICC(2,1) form: two-way random, single rater, absolute agreement.

Parameters

matrix : np.ndarray Shape (n_subjects, n_raters). nan_policy : {'drop', 'raise'} - 'drop' (default): drop any subject row that contains a NaN. - 'raise': raise ValueError if any NaN is present.

Returns

float ICC value, or NaN if the cleaned input has fewer than 2 subjects or 2 raters.

compute_from_dataframe(df, column, window_days=7)

Take first window_days per participant, drop incomplete, compute ICC.

compare_to_test_retest(df, columns, icc_window_days=7, n_bootstrap=200)

Empirical comparison: ICC(2,1) days-as-raters against bootstrap week-pair r.

This is the formal defense of the unconventional ICC use. If the two methods rank metrics consistently and the ICC is monotonically related to the proper test-retest correlation, the ICC is doing useful work even though the rater interpretation is unusual.

Returns one row per metric with both estimates and the week-pair CI.

temporal_stability

Day-to-day stability, rolling CoV, drift.

drift_slope(values)

Linear-regression slope and p-value of values vs sample index.

NaN samples are dropped before the regression. Returns NaN slope and p-value when fewer than 5 finite samples remain or when the finite portion has effectively zero variance.

test_retest

Test-retest reliability for daily-summary time series.

Three functions, in increasing order of methodological seriousness:

  • split_half_correlation: Pearson between the first half and the second half of a single participant's series. Quick, simple, suitable for ranking participants. Single point estimate; no CI.

  • week_pair_correlation: For each pair of adjacent non-overlapping weeks within a participant, correlate the within-week means. Standard test-retest formulation when daily measurements are pooled by week.

  • cohort_test_retest: Pool participants and report between-participant test-retest reliability of any weekly aggregate of choice. Returns a point estimate plus a participant-level bootstrap 95% CI. This is the one to put in a paper.

The 'classical' (Shrout & Fleiss 1979) ICC is over in intraclass_correlation.py. We deliberately keep the two separate because they answer different questions: this module asks 'is the same participant consistent week to week?', ICC asks 'how much of the total variance is between-participant?'.

split_half_correlation(values)

Pearson correlation between the first and second halves of a series.

per_participant(df, column)

Per-participant split-half correlation (no CI).

week_pair_correlation(df, column, min_weeks=4)

Pool all (week_t, week_{t+1}) pairs across participants, correlate.

A participant with at least min_weeks weeks of data contributes n_weeks - 1 pairs. Returns the Pearson correlation across all pooled pairs.

cohort_test_retest(df, column, method='week_pair', n_bootstrap=500, min_weeks=4, seed=0)

Pool participants, compute a single test-retest estimate, bootstrap the CI.

Bootstrap resamples participants (cluster bootstrap), which is the appropriate unit for clustered repeated-measures data.

cohort_test_retest_table(df, metrics, **kwargs)

Run cohort_test_retest for several metrics and return a tidy table.

weight_optimization

Learn trust-score weights from data, tied to a downstream reproducibility task.

The default DBTS weights are a defensible prior. They are not learned. Anyone who actually cares about a specific downstream task (week-over-week stability of an HRV-based stress estimate, say) can do better by searching the weight simplex for the weighting that maximizes correlation between DBTS and that task's stability.

We provide:

  • weekly_reproducibility_target: a function that produces a per-participant reproducibility score for a chosen metric. Default metric: HRV RMSSD, default measure: 1 − relative absolute difference between adjacent weeks.
  • learn_weights: grid + random search over the 6-component simplex, maximizing Spearman ρ between DBTS overall and the per-participant target.
  • sensitivity_table: how much do conclusions move when the weights change? Reports overall-score correlation between top-k weight settings and the rank stability of the participant ordering they induce.

Methodology: - Search is on the simplex with a Dirichlet-uniform sample plus a refining local grid around the best point. Both phases are exposed so the search budget is auditable. - The target is computed on a held-out half of the participants; weight fitting and validation never see the same participants. - We never claim the learned weights generalize to a different task. The point is to report the dependence and let the user decide.

weekly_reproducibility_target(df, metric='hrv_rmssd')

Per-participant week-over-week reproducibility of metric.

Score is 1 minus the median relative absolute difference between adjacent 7-day means. Higher = more reproducible. Range roughly [0, 1].

participant_components(df, scorer=None)

Run the DBTS once per participant and return the six component scores.

overall_from_components(components_df, weights)

Compute the overall score under a given weighting.

learn_weights(df, target_metric='hrv_rmssd', n_random=400, n_refine=6, holdout_fraction=0.3, seed=0, min_train_participants=5)

Search the 6-simplex for weights that maximize Spearman ρ vs reproducibility.

Returns LearnedWeights with NaN ρ values (and default weights) if the cohort has fewer than min_train_participants participants with a computable reproducibility target, or if the bootstrap never produces a finite Spearman.

sensitivity_table(df, learned, perturbations=None)

How sensitive is the participant ranking to small weight changes?

For each perturbation magnitude p, draw Dirichlet noise of scale p around the learned weights, recompute the overall score, and report the Spearman rank correlation with the original ordering.

Models

Downstream LF/HF biomarker classifier with LOSO cross-validation.

biomedical_signal_forensics_lab.models

Quality / anomaly models: autoencoder, isolation forest, baselines.

anomaly_detector

Isolation Forest anomaly detector + logistic / random forest baselines.

quality_autoencoder

A small PyTorch autoencoder used as an unsupervised quality probe.

We train it on what looks like 'good' data (high signal quality ground truth) and use reconstruction error as a quality score for new samples.

sequence_model

Tiny GRU baseline for sequence-level quality (per-participant time series).

Trained against per-participant mean reliability_ground_truth as a regression sanity check. Not the main contribution: just here so reviewers can see we considered the temporal dimension.

Reports

Publication-grade figure generation and summary JSON writers.

biomedical_signal_forensics_lab.reports

Report generation, figures, trust-report markdown.

figure_builder

Publication-style figures. Matplotlib only.

Each function takes an already-loaded dataframe and writes a PNG to results/figures/. Numbers are deliberately kept legible at small sizes so they reproduce well in a PDF.

fairness_forest_plot(stratum_table, out='results/figures/fairness_forest.png')

Forest plot of per-stratum overall trust score with bootstrap CIs.

Expects the output of fairness_audit: columns stratum_value, overall_trust_score, ci_low, ci_high.

real_data_figures

Publication-quality figures for the deep WESAD analysis.

Six figures
  1. Bland-Altman: HR-from-PPG vs HR-from-ECG, color-coded by state
  2. Scatter HR_PPG vs HR_ECG with identity line and per-subject color
  3. SQI pass rate by method (in-house vs Orphanidou vs Sukor)
  4. Per-state distribution panels (4-subplot: SQI, motion, HR error, template corr)
  5. Motion vs HR disagreement scatter
  6. Recalibration: original vs recalibrated kappa across thresholds

fig_bland_altman_hr(window_df, out)

Bland-Altman: mean of HR_PPG and HR_ECG vs their difference.

fig_hr_scatter(window_df, out)

HR_PPG vs HR_ECG scatter with identity line.

fig_sqi_pass_rates(three_way_result, out)

Bar chart of pass rates across three SQI methods.

fig_per_state_panels(window_df, out)

4-panel boxplot: SQI, motion, HR error, template corr by state.

fig_motion_vs_hr_error(window_df, out)

Scatter: PPG motion score vs |HR_PPG - HR_ECG|.

fig_recalibration_curve(window_df, out)

Cohen's kappa vs in-house SQI threshold, with original and recalibrated.

make_all_figures(window_df, three_way_result, out_dir)

Generate all six figures and return their paths.

report_generator

Top-level audit report generator.

Produces results/reports/signal_forensics_report.md plus a set of figures. Designed to be readable as a standalone document.

trust_report

Markdown formatting for a single trust report.


If a module above does not render, it may be because mkdocstrings could not import it during the docs build. Common causes:

  • The package is not installed in the docs environment. Run pip install -e . before mkdocs serve.
  • A module imports a heavy optional dependency that is not in the docs environment. Lazy imports inside functions are recommended over module-level imports of optional heavy dependencies.