Skip to content

Commit 79a47b3

Browse files
committed
format
1 parent c2a845a commit 79a47b3

1 file changed

Lines changed: 35 additions & 30 deletions

File tree

research/modules/5-billion-scale-polypharmacy/manuscripts/manuscript_v2.0.md

Lines changed: 35 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -286,63 +286,63 @@ We demonstrate that billion-scale federated causal inference is computationally
286286

287287
## References
288288

289-
1. Gelman A, Carlin J. Beyond Power Calculations: Assessing Type S (Sign) and Type M (Magnitude) Errors. *Perspectives on Psychological Science* 2014;9(6):641-651.
289+
1. Gelman A, Carlin J. Beyond Power Calculations: Assessing Type S (Sign) and Type M (Magnitude) Errors. _Perspectives on Psychological Science_ 2014;9(6):641-651.
290290

291-
2. Lash TL, Fox MP, MacLehose RF, et al. Good practices for quantitative bias analysis. *International Journal of Epidemiology* 2014;43(6):1969-1985.
291+
2. Lash TL, Fox MP, MacLehose RF, et al. Good practices for quantitative bias analysis. _International Journal of Epidemiology_ 2014;43(6):1969-1985.
292292

293293
3. FDA Sentinel Initiative. https://www.sentinelinitiative.org
294294

295-
4. Coloma PM, Schuemie MJ, Trifirò G, et al. Combining electronic healthcare databases in Europe to allow for large-scale drug safety monitoring: the EU-ADR Project. *Pharmacoepidemiology and Drug Safety* 2011;20(1):1-11.
295+
4. Coloma PM, Schuemie MJ, Trifirò G, et al. Combining electronic healthcare databases in Europe to allow for large-scale drug safety monitoring: the EU-ADR Project. _Pharmacoepidemiology and Drug Safety_ 2011;20(1):1-11.
296296

297-
5. McMahan HB, Moore E, Ramage D, et al. Communication-Efficient Learning of Deep Networks from Decentralized Data. *AISTATS* 2017.
297+
5. McMahan HB, Moore E, Ramage D, et al. Communication-Efficient Learning of Deep Networks from Decentralized Data. _AISTATS_ 2017.
298298

299-
6. Kairouz P, McMahan HB, Avent B, et al. Advances and Open Problems in Federated Learning. *Foundations and Trends in Machine Learning* 2021;14(1-2):1-210.
299+
6. Kairouz P, McMahan HB, Avent B, et al. Advances and Open Problems in Federated Learning. _Foundations and Trends in Machine Learning_ 2021;14(1-2):1-210.
300300

301-
7. Rosenbaum PR, Rubin DB. The central role of the propensity score in observational studies for causal effects. *Biometrika* 1983;70(1):41-55.
301+
7. Rosenbaum PR, Rubin DB. The central role of the propensity score in observational studies for causal effects. _Biometrika_ 1983;70(1):41-55.
302302

303303
8. Hernán MA, Robins JM. Causal Inference: What If. Chapman & Hall/CRC, 2020.
304304

305305
9. Pearl J. Causality: Models, Reasoning, and Inference. 2nd ed. Cambridge University Press, 2009.
306306

307-
10. Petersen ML, van der Laan MJ. Causal models and learning from data: integrating causal modeling and statistical estimation. *Epidemiology* 2014;25(3):418-426.
307+
10. Petersen ML, van der Laan MJ. Causal models and learning from data: integrating causal modeling and statistical estimation. _Epidemiology_ 2014;25(3):418-426.
308308

309-
11. D'Agostino RB. Propensity score methods for bias reduction in the comparison of a treatment to a non-randomized control group. *Statistics in Medicine* 1998;17(19):2265-2281.
309+
11. D'Agostino RB. Propensity score methods for bias reduction in the comparison of a treatment to a non-randomized control group. _Statistics in Medicine_ 1998;17(19):2265-2281.
310310

311-
12. Austin PC. An Introduction to Propensity Score Methods for Reducing the Effects of Confounding in Observational Studies. *Multivariate Behavioral Research* 2011;46(3):399-424.
311+
12. Austin PC. An Introduction to Propensity Score Methods for Reducing the Effects of Confounding in Observational Studies. _Multivariate Behavioral Research_ 2011;46(3):399-424.
312312

313-
13. Rubin DB. Estimating causal effects of treatments in randomized and nonrandomized studies. *Journal of Educational Psychology* 1974;66(5):688-701.
313+
13. Rubin DB. Estimating causal effects of treatments in randomized and nonrandomized studies. _Journal of Educational Psychology_ 1974;66(5):688-701.
314314

315315
14. Imbens GW, Rubin DB. Causal Inference for Statistics, Social, and Biomedical Sciences. Cambridge University Press, 2015.
316316

317-
15. VanderWeele TJ, Ding P. Sensitivity Analysis in Observational Research: Introducing the E-Value. *Annals of Internal Medicine* 2017;167(4):268-274.
317+
15. VanderWeele TJ, Ding P. Sensitivity Analysis in Observational Research: Introducing the E-Value. _Annals of Internal Medicine_ 2017;167(4):268-274.
318318

319-
16. Meng XL. Statistical paradises and paradoxes in big data (I): Law of large populations, big data paradox, and the 2016 US presidential election. *Annals of Applied Statistics* 2018;12(2):685-726.
319+
16. Meng XL. Statistical paradises and paradoxes in big data (I): Law of large populations, big data paradox, and the 2016 US presidential election. _Annals of Applied Statistics_ 2018;12(2):685-726.
320320

321-
17. Cole SR, Hernán MA. Constructing inverse probability weights for marginal structural models. *American Journal of Epidemiology* 2008;168(6):656-664.
321+
17. Cole SR, Hernán MA. Constructing inverse probability weights for marginal structural models. _American Journal of Epidemiology_ 2008;168(6):656-664.
322322

323-
18. Lunceford JK, Davidian M. Stratification and weighting via the propensity score in estimation of causal treatment effects: a comparative study. *Statistics in Medicine* 2004;23(19):2937-2960.
323+
18. Lunceford JK, Davidian M. Stratification and weighting via the propensity score in estimation of causal treatment effects: a comparative study. _Statistics in Medicine_ 2004;23(19):2937-2960.
324324

325-
19. Stürmer T, Rothman KJ, Avorn J, Glynn RJ. Treatment effects in the presence of unmeasured confounding: dealing with observations in the tails of the propensity score distribution. *American Journal of Epidemiology* 2010;172(7):843-854.
325+
19. Stürmer T, Rothman KJ, Avorn J, Glynn RJ. Treatment effects in the presence of unmeasured confounding: dealing with observations in the tails of the propensity score distribution. _American Journal of Epidemiology_ 2010;172(7):843-854.
326326

327-
20. Petersen ML, Porter KE, Gruber S, et al. Diagnosing and responding to violations in the positivity assumption. *Statistical Methods in Medical Research* 2012;21(1):31-54.
327+
20. Petersen ML, Porter KE, Gruber S, et al. Diagnosing and responding to violations in the positivity assumption. _Statistical Methods in Medical Research_ 2012;21(1):31-54.
328328

329-
21. Li F, Morgan KL, Zaslavsky AM. Balancing covariates via propensity score weighting. *Journal of the American Statistical Association* 2018;113(521):390-400.
329+
21. Li F, Morgan KL, Zaslavsky AM. Balancing covariates via propensity score weighting. _Journal of the American Statistical Association_ 2018;113(521):390-400.
330330

331-
22. Zhao Q, Percival D. Entropy balancing is doubly robust. *Journal of Causal Inference* 2017;5(1):20160010.
331+
22. Zhao Q, Percival D. Entropy balancing is doubly robust. _Journal of Causal Inference_ 2017;5(1):20160010.
332332

333-
23. Hainmueller J. Entropy Balancing for Causal Effects: A Multivariate Reweighting Method to Produce Balanced Samples in Observational Studies. *Political Analysis* 2012;20(1):25-46.
333+
23. Hainmueller J. Entropy Balancing for Causal Effects: A Multivariate Reweighting Method to Produce Balanced Samples in Observational Studies. _Political Analysis_ 2012;20(1):25-46.
334334

335-
24. Athey S, Imbens GW. Machine Learning Methods for Estimating Heterogeneous Causal Effects. *Statistical Science* 2019;34(2):197-209.
335+
24. Athey S, Imbens GW. Machine Learning Methods for Estimating Heterogeneous Causal Effects. _Statistical Science_ 2019;34(2):197-209.
336336

337-
25. Chernozhukov V, Chetverikov D, Demirer M, et al. Double/debiased machine learning for treatment and structural parameters. *Econometrics Journal* 2018;21(1):C1-C68.
337+
25. Chernozhukov V, Chetverikov D, Demirer M, et al. Double/debiased machine learning for treatment and structural parameters. _Econometrics Journal_ 2018;21(1):C1-C68.
338338

339-
26. Dwork C, Roth A. The Algorithmic Foundations of Differential Privacy. *Foundations and Trends in Theoretical Computer Science* 2014;9(3-4):211-407.
339+
26. Dwork C, Roth A. The Algorithmic Foundations of Differential Privacy. _Foundations and Trends in Theoretical Computer Science_ 2014;9(3-4):211-407.
340340

341-
27. Abadi M, Chu A, Goodfellow I, et al. Deep Learning with Differential Privacy. *ACM CCS* 2016:308-318.
341+
27. Abadi M, Chu A, Goodfellow I, et al. Deep Learning with Differential Privacy. _ACM CCS_ 2016:308-318.
342342

343-
28. Li W, Milletarì F, Xu D, et al. Privacy-Preserving Federated Brain Tumour Segmentation. *MICCAI Workshop* 2019:133-141.
343+
28. Li W, Milletarì F, Xu D, et al. Privacy-Preserving Federated Brain Tumour Segmentation. _MICCAI Workshop_ 2019:133-141.
344344

345-
29. Rieke N, Hancox J, Li W, et al. The future of digital health with federated learning. *NPJ Digital Medicine* 2020;3:119.
345+
29. Rieke N, Hancox J, Li W, et al. The future of digital health with federated learning. _NPJ Digital Medicine_ 2020;3:119.
346346

347347
---
348348

@@ -366,7 +366,7 @@ We demonstrate that billion-scale federated causal inference is computationally
366366

367367
### Figure 3: Sign Flip Phenomenon
368368

369-
Treatment effect convergence across sample sizes for rare polypharmacy subgroup (CKD Stage 3b + Loop Diuretic + Age>80, prevalence 0.064%).
369+
Treatment effect convergence across sample sizes for rare polypharmacy subgroup (CKD Stage 3b + Loop Diuretic + Age>80, prevalence 0.064%).
370370

371371
**Key Finding**: At 1M patients (n=645), estimated ATE = -2.11 ml/min/year (95% CI: -3.14 to -1.07, p=0.003), suggesting harm. At 1B patients (n=632,776), estimated ATE = +1.46 ml/min/year (95% CI: +1.41 to +1.52, p<0.0001), indicating benefit—a complete sign reversal with high statistical confidence at both scales.
372372

@@ -383,21 +383,25 @@ Treatment effect convergence across sample sizes for rare polypharmacy subgroup
383383
**Synthetic Data Protocol**: Extended Synthea framework with embedded ground truth for validation.
384384

385385
**Polypharmacy Modeling**:
386+
386387
- Base rate: 35% (Age>65), 60% (CKD Stage 3+)
387388
- Three interaction tiers: Interaction 1 (16% prevalence), Interaction 2 (0.4%), Interaction 3 (0.064%)
388389

389390
**Ground Truth Effects**:
391+
390392
- SGLT2i baseline: +1.0 ml/min/year
391393
- Interaction 1: +2.0 ml/min/year additional
392-
- Interaction 2: +0.5 ml/min/year additional
394+
- Interaction 2: +0.5 ml/min/year additional
393395
- Interaction 3: +0.5 ml/min/year additional
394396

395397
**Confounding Structure**:
398+
396399
- Logistic propensity model: logit(P(T=1)) = 0.5×(HbA1c-7) - 0.3×(eGFR-60)/10 + 0.2×Age/10
397400
- Confounding by indication: Sicker patients preferentially receive treatment
398401
- Missing data: 5% missing-at-random (MAR)
399402

400403
**Data Generation at Scale**:
404+
401405
- 1000 sites × 1M patients per site = 1B total
402406
- Streaming generation (no disk I/O)
403407
- Worker threads parallelization for site-level computation
@@ -408,17 +412,17 @@ Treatment effect convergence across sample sizes for rare polypharmacy subgroup
408412
**Theorem 1 (Federated-Centralized Equivalence)**:
409413
Federated Newton-Raphson propensity score estimation produces identical estimates to centralized analysis.
410414

411-
*Proof*: By associativity of sums, ∑_{k=1}^K g_k = ∑_{i=1}^N x_i(T_i - p_i) and ∑_{k=1}^K H_k = ∑_{i=1}^N x_ix_i^T p_i(1-p_i), where k indexes sites and i indexes patients. Therefore, β^{(t+1)} = β^{(t)} + (∑_k H_k)^{-1}(∑_k g_k) is mathematically equivalent to centralized Newton-Raphson. □
415+
_Proof_: By associativity of sums, ∑*{k=1}^K g_k = ∑*{i=1}^N x*i(T_i - p_i) and ∑*{k=1}^K H*k = ∑*{i=1}^N x_ix_i^T p_i(1-p_i), where k indexes sites and i indexes patients. Therefore, β^{(t+1)} = β^{(t)} + (∑_k H_k)^{-1}(∑_k g_k) is mathematically equivalent to centralized Newton-Raphson. □
412416

413417
**Theorem 2 (Communication Complexity)**:
414418
Federated algorithm achieves O(1) communication per site independent of sample size.
415419

416-
*Proof*: Each site transmits fixed-dimension statistics: gradient g_k ∈ ℝ^p, Hessian H_k ∈ ℝ^{p×p}, weighted matrices XWX_k ∈ ℝ^{p×p}, XWY_k ∈ ℝ^p. For p=5 covariates, communication = 5 (gradient) + 15 (Hessian upper triangle) + 15 (XWX) + 5 (XWY) = 40 floating-point numbers × 8 bytes = 320 bytes per site. Observed: 264 bytes (compression/encoding). □
420+
_Proof_: Each site transmits fixed-dimension statistics: gradient g_k ∈ ℝ^p, Hessian H_k ∈ ℝ^{p×p}, weighted matrices XWX_k ∈ ℝ^{p×p}, XWY_k ∈ ℝ^p. For p=5 covariates, communication = 5 (gradient) + 15 (Hessian upper triangle) + 15 (XWX) + 5 (XWY) = 40 floating-point numbers × 8 bytes = 320 bytes per site. Observed: 264 bytes (compression/encoding). □
417421

418422
**Theorem 3 (Privacy Preservation)**:
419423
Aggregated statistics (gradient, Hessian, XWX, XWY) satisfy HIPAA Safe Harbor de-identification standard (§164.514(b)(2)).
420424

421-
*Proof*: Transmitted statistics are aggregates over ≥1M patients per site, containing no individual identifiers, no cell counts <10, and no patient-level data. Satisfies statistical de-identification requirements. □
425+
_Proof_: Transmitted statistics are aggregates over ≥1M patients per site, containing no individual identifiers, no cell counts <10, and no patient-level data. Satisfies statistical de-identification requirements. □
422426

423427
### Supplement C: Sensitivity Analyses
424428

@@ -480,6 +484,7 @@ ATE = (Σ XWY_treated) / (Σ w_treated) - (Σ XWY_control) / (Σ w_control)
480484
**Full results tables** for all sample sizes (100K, 1M, 10M, 100M, 1B) and all subgroups (Overall, Interaction 1, Interaction 2, Interaction 3) are available in the online repository: https://github.com/watilde/Harmonia
481485

482486
**Key findings across all subgroups**:
487+
483488
- Overall subgroup (84% prevalence, n=841M at 1B scale): Monotonic convergence to ATE=+1.28, no sign flip
484489
- Interaction 1 (17% prevalence, n=169M): Monotonic convergence to ATE=+2.86, no sign flip
485490
- Interaction 2 (0.4% prevalence, n=4.2M): Monotonic convergence to ATE=+1.50, no sign flip

0 commit comments

Comments
 (0)