Reviewed against: sleep-trackers-performance-master/AnalyticalPipeline_v1.0.0.Rmd and its
companion R functions (ebe2sleep.R, errorMatrix.R, indEBE.R, groupEBE.R, indDiscr.R,
groupDiscr.R, BAplot.R).
- ✅ Implemented / Fixed
⚠️ Partial / In progress- ❌ Missing / Not implemented
- 🐛 Bug
See #228
| # | Issue | Status | Notes |
|---|---|---|---|
| 1 | Specificity (TNR) missing from get_agreement_bystage() |
✅ Implemented | Per-stage one-vs-rest confusion matrix in scorer() |
| 2 | PABAK not computed | ❌ | R: epi.kappa()$pabak. Not trivially added via sklearn |
| 3 | NPV missing from get_agreement_bystage() |
✅ Implemented | Per-stage one-vs-rest confusion matrix in scorer() |
| 4 | Prevalence index / bias index not computed | ❌ | R: epi.kappa()$pindex, $bindex |
| 5 | ROC curves missing | ❌ | R: groupEBE(doROC=TRUE) via ROCR. Could add with sklearn |
| 10 | pooled mode for get_agreement() not surfaced |
✅ Implemented | Added pooled=False parameter; True = R's metricsType="sum" |
- Accuracy is reported 0–1 in YASA vs 0–100% in R (convention only, not a bug).
- R's
metricsType="avg"(per-subject average) = YASA's defaultget_agreement()behavior. - R's
metricsType="sum"(pooled epochs) = YASA's newget_agreement(pooled=True). get_agreement_bystage()now returns 6 metrics:fbeta, npv, precision, recall, specificity, support.
| # | Issue | Status | Notes |
|---|---|---|---|
| 6 | Log transformation missing | ✅ Implemented | log_transform=True param + Euser et al. (2008) back-transform; "log" loa_method. |
| 7 | Individual discrepancy heatmap (indDiscr.R) |
❌ | No equivalent; data available via get_sleep_stats() |
| 8 | Calibration direction bug | ✅ Fixed | See detailed investigation below (Bugs A, B, C) |
| Feature | YASA | R pipeline |
|---|---|---|
| Calibration of new data | ✅ calibrate(), get_calibration_func() |
❌ Not present |
| MCC (Matthews Correlation Coefficient) | ✅ | ❌ |
| Balanced accuracy | ✅ | ❌ |
| BCa bootstrap (more robust) | ✅ | ❌ (R uses "basic" default) |
| Overlaid hypnogram plots | ✅ plot_hypnograms() |
❌ |
| MAD / median in group summary | ✅ summary() |
❌ |
| Human-readable report table | ✅ report() |
❌ |
| Bland-Altman plot | ✅ plot_blandaltman() |
✅ BAplot.R |
✅ plot_blandaltman() implemented in SleepStatsAgreement (evaluation.py:1604).
| Element | R (BAplot.R) |
YASA (plot_blandaltman) |
|---|---|---|
| Scatter points | One dot per session | ✅ One dot per session |
| Bias line | Horizontal mean(diff) or regression b0 + b1*ref |
✅ Same; auto-selected from assumptions |
| Bias CI | t-CI or bootstrap band | ✅ Shaded band; method via ci_method |
| LoA lines | Constant bias ± 1.96 SD or bias ± 2.46*(c0 + c1*ref) |
✅ Same; auto-selected from assumptions |
| LoA CI | Dashed CI bands | ✅ Shaded bands |
| Flag biased | Red bias line if significant | ✅ flag_biased=True |
| Euser LoA | bias ± ref × euser_slope |
✅ log_transform=True |
| Marginal density | ggExtra::ggMarginal |
❌ Not implemented |
xaxis="mean" |
(obs+ref)/2 on x-axis | ❌ Only reference on x-axis |
| Parameter | Purpose | Status |
|---|---|---|
logTransf = TRUE/FALSE |
Use Euser back-transform for LoA | ✅ log_transform=True |
xaxis = "mean" |
X-axis: (obs+ref)/2 instead of ref | ❌ Not planned |
xlim, ylim |
Axis limits | ❌ Can be set via matplotlib post-call |
Reference: groupDiscr.R lines 208–264, BAplot.R log section, Euser et al. (2008) and Bland & Altman (1999).
log_transformis bool only — no per-stat list. Mixed cases use twoSleepStatsAgreementobjects.- Euser slope is internal — not exposed in
summary(), only rendered inplot_blandaltmanandreport. "log"is a first-classloa_methodvalue, alongside"param"and"regr".auto_methods["loa"]returns"log"for all stats whenlog_transform=True.- No
log_normalinassumptions— normality of original diffs drives CI selection as before. - Bootstrap CI for Euser slope handled inside existing
_generate_bootstrap_ci, no new methods.
When differences between devices and reference scale proportionally with the measurement size (heteroscedasticity), a log transformation stabilises the variance. The Euser et al. (2008) method computes limits of agreement in log space and back-transforms them to a slope that multiplies the reference value:
LoA_upper = bias + slope × ref
LoA_lower = bias − slope × ref
slope = 2 * (exp(agreement × SD(log_diffs)) − 1) / (exp(agreement × SD(log_diffs)) + 1)
where log_diffs = log(obs + ε) − log(ref + ε) and ε = 1e-4 to avoid log(0).
This is the only LoA representation that applies for log-transformed stats. The standard loa_lower/loa_upper (constant) and loa_intercept/loa_slope (regression) representations do not apply to these stats.
tests/test_evaluation_sri.py loads all 14 subjects (10 766 epochs, 30-s epochs) from
tests/data/sample_data_sri.csv and compares YASA's output to reference values extracted
from the published HTML report (AnalyticalPipeline_v1.0.0.html), stored in
tests/data/evaluation_sri_full.json. 32 test methods are run via unittest.
| Test class | What is checked | Tolerance |
|---|---|---|
TestSRIPerSubject |
Per-subject recall and specificity — 14 subjects × 4 stages (112 checks) | 0.1 pp |
TestSRIPerSubjectSleepWake |
Per-subject binary SLEEP/WAKE accuracy, sensitivity, specificity — 14 subjects (42 checks) | 0.1 pp |
TestSRIGroupMeans |
Group mean recall, specificity, PPV, NPV per stage — 4 metrics × 4 stages | 0.5 pp |
TestSRIGroupMeansSleepWake |
Group mean binary sensitivity and specificity | 0.5 pp |
TestSRISleepStats |
Per-subject TIB, TST, SE, SOL, stage durations, stage % for both scorers — 14 × 10 × 2 (280 checks) | 0.1 |
TestSRISleepStats |
Per-subject WASO as TIB − SOL − TST (R pipeline definition, including post-sleep wake) — 14 × 2 | 0.1 |
TestSRIDiscrepancies |
Per-subject Device − Reference differences for TST, SE, SOL, WASO, stage durations, stage % — 14 × 10 (140 checks) | 0.1 |
TestSRIPooledMetrics |
Pooled (all-epoch) recall and specificity per stage matching R's metricsType="sum" |
0.1 pp |
TestSRIConfusionMatrixValues |
All 16 cells of the pooled absolute confusion matrix, accessed by label | exact |
TestSRISanity |
Dataset size, output shapes, index names, stage labels, metrics in [0, 1] | — |
Tolerance note: 0.1 pp covers single rounding in the HTML source (±0.005 pp). 0.5 pp at group level covers accumulated rounding across 14 subjects.
WASO note: The R pipeline (ebe2sleep.R lines 47–50) counts wake epochs from the first sleep epoch to the end of the recording, including any post-sleep wake after the final sleep epoch. This is algebraically equivalent to TIB − SOL − TST. YASA's built-in WASO counts only wake within the Sleep Period Time (first to last sleep epoch), equivalent to SPT − TST. For subjects with post-sleep wake (sbj09, sbj11) the two definitions disagree. The tests use TIB − SOL − TST, which matches the R pipeline exactly for all 14 subjects.
| Item | Reason |
|---|---|
| Per-subject accuracy (4-stage) | YASA accuracy = fraction correct across all stages; R pipeline reports binary one-vs-rest accuracy per stage. Numerically different. (Binary accuracy is tested.) |
| Per-subject PPV / NPV | R pipeline reports these at group level only; no per-subject reference. |
| Per-stage Cohen's κ, PABAK, prevalence index | R pipeline computes one-vs-rest; YASA kappa is multiclass. PABAK is not yet in YASA (item 2 above). |
| Bland-Altman bias, LoA, and CIs | R pipeline uses conditional regression depending on assumption tests, making expected outputs data-dependent and impractical to pin as fixed reference values. |