Abstract
Federated learning has gained great popularities in the last decade for its capability of collaboratively building models on data from multiple datasets. However, in real-world biomedical settings, practical challenges remain, including the needs to protect privacy of the patients, the capability of accounting for between-site heterogeneity in patient characteristics, and, from operational point of view, the number of needed communications across data partners. In this chapter, we describe and provide examples of multi-database data-sharing mechanisms in the healthcare data context and highlight the primary methods available for performing statistical regression analysis in each setting. For each method, we discuss the advantages and disadvantages in terms of data privacy, data communication efficiency, heterogeneity awareness, and statistical accuracy. Our goal is to provide researchers with the insight necessary to choose among the available algorithms for a given setting of conducting regression analysis using multi-site data.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Sherman RE, Anderson SA, Dal Pan GJ, Gray GW, Gross T, Hunter NL, LaVange L, Marinac-Dabic D, Marks PW, Robb MA, Shuren J. Real-world evidence—what is it and what can it tell us. N Engl J Med. 2016;375(23):2293–7.
Jarow JP, LaVange L, Woodcock J. Multidimensional evidence generation and FDA regulatory decision making: defining and using “real-world” data. JAMA. 2017;318(8):703–4.
NIH. Announcement: Access to the COVID-19 Data Analytics Platform is Open. 2021. https://ncats.nih.gov/news/releases/2020/access-to-N3C-COVID-19-data-analytics-platform-now-open (visited on 05/06/2021).
4CE. Consortium for Clinical Characterization of COVID-19 by EHR: Members. 2021. https://covidclinical.net/members.index.html (visited on 05/06/2021).
Weeks J, Pardee R. Learning to share health care data: a brief timeline of influential common data models and distributed health data networks in U.S. health care research. eGEMs (Generating Evidence & Methods to improve patient outcomes). 2019;7(1): 4, p. 1–7. https://doi.org/10.5334/egems.279.
Haendel MA, Chute CG, Bennett TD, Eichmann DA, Guinney J, Kibbe WA, Payne PR, Pfaff ER, Robinson PN, Saltz JH, Spratt H. The National COVID Cohort Collaborative (N3C): rationale, design, infrastructure, and deployment. J Am Med Inform Assoc. 2021;28(3):427–43.
Love D, Custer W. Miller P, 2010. All-payer claims databases: state initiatives to improve health care transparency. New York (NY): Commonwealth Fund.
Centers for Disease Control and Prevention. HIPAA privacy rule and public health. Guidance from CDC and the US Department of Health and Human Services. MMWR: Morbidity and Mortality Weekly Report, 2003;52(Suppl 1):1–17.
Voigt P, Von dem Bussche A. The EU general data protection regulation (GDPR). A Practical Guide, vol. 10. no. 3152676, 1st ed. Cham: Springer International Publishing; 2017. p. 10–5555.
D. McGraw, Building public trust in uses of Health Insurance. Portability and Accountability Act de-identified data. J Am Med Inform Assoc. 2012; https://doi.org/10.1136/amiajnl-2012-000936
Benitez K, Malin B. Evaluating re-identification risks with respect to the HIPAA privacy rule. J Am Med Inform Assoc. 2010;17(2):169–77. https://doi.org/10.1136/jamia.2009.000026.
Mazor KM, Richards A, Gallagher M, Arterburn DE, Raebel MA, Nowell WB, Curtis JR, Paolino AR, Toh S. Stakeholders’ views on data sharing in multicenter studies. J Comparat Effectiveness Res. 2017;6(6):537–47.
Hripcsak G, Duke JD, Shah NH, Reich CG, Huser V, Schuemie MJ, Suchard MA, Park RW, Wong ICK, Rijnbeek PR, Van Der Lei J. Observational health data sciences and informatics (OHDSI): opportunities for observational researchers. Stud Health Technol Inf. 2015;216:574.
Suchard MA, Schuemie MJ, Krumholz HM, You SC, Chen R, Pratt N, Reich CG, Duke J, Madigan D, Hripcsak G, Ryan PB. Comprehensive comparative effectiveness and safety of first-line antihypertensive drug classes: a systematic, multinational, large-scale analysis. The Lancet. 2019;394(10211):1816–26.
Ball R, Robb M, Anderson SA, Dal Pan G. The FDA’s sentinel initiative—a comprehensive approach to medical product surveillance. Clin Pharmacol Ther. 2016;99(3):265–8.
Fleurence RL, Curtis LH, Califf RM, Platt R, Selby JV, Brown JS. Launching PCORnet, a national patient-centered clinical research network. J Am Med Inform Assoc. 2014;21(4):578–82.
Chen RT, Glasser JW, Rhodes PH, Davis RL, Barlow WE, Thompson RS, Mullooly JP, Black SB, Shinefield HR, Vadheim CM, Marcy SM. Vaccine safety datalink project: a new tool for improving vaccine safety monitoring in the United States. Pediatrics. 1997;99(6):765–73.
Vogt TM, Lafata JE, Tolsma DD, Greene SM. The role of research in integrated health care systems: the HMO Research Network. Permanente J. 2004;8(4):10.
Nelder JA, Wedderburn RW. Generalized linear models. J Royal Stat Soc: Series A (General). 1972;135(3):370–84.
Cox DR. Regression models and life-tables. J Roy Stat Soc: Ser B (Methodol). 1972;34(2):187–202.
Oxman AD, Clarke MJ, Stewart LA. From science to practice: meta-analyses using individual patient data are needed. JAMA. 1995;274(10):845–6. https://doi.org/10.1001/jama.1995.03530100085040.
Riley RD, Higgins JP. Deeks JJ. 2011. Interpretation of random effects meta-analyses. BMJ, 342.
You SC, Rho Y, Bikdeli B, Kim J, Siapos A, Weaver J, Londhe A, Cho J, Park J, Schuemie M, Suchard MA. Association of ticagrelor vs clopidogrel with net adverse clinical events in patients with acute coronary syndrome undergoing percutaneous coronary intervention. JAMA. 2020;324(16):1640–50.
Vashisht R, Jung K, Schuler A, Banda JM, Park RW, Jin S, Li L, Dudley JT, Johnson KW, Shervey MM, Xu H. Association of hemoglobin A1c levels with use of sulfonylureas, dipeptidyl peptidase 4 inhibitors, and thiazolidinediones in patients with type 2 diabetes treated with metformin: analysis from the observational health data sciences and informatics initiative. JAMA Netw Open. 2018;1(4):e181755–e181755.
Zeng D, Lin DY. On random-effects meta-analysis. Biometrika. 2015;102(2):281–94.
Rassen JA, Avorn J, Schneeweiss S. Multivariate-adjusted pharmacoepidemiologic analyses of confidential information pooled from multiple health care utilization databases. Pharmacoepidemiol Drug Saf. 2010;19(8):848–57.
Toh S, Reichman ME, Houstoun M, Ding X, Fireman BH, Gravel E, Levenson M, Li L, Moyneur E, Shoaibi A, Zornberg G, Hennessy S. Multivariable confounding adjustment in distributed data networks without sharing of patient-level data. Pharmacoepidemiol Drug Saf. 2013;22(11):1171–7. https://doi.org/10.1002/pds.3483. Epub 2013 Jul 23 PMID: 23878013.
Duan R, Luo C, Schuemie MJ, Tong J, Liang CJ, Chang HH, Boland MR, Bian J, Xu H, Holmes JH, Forrest CB. Learning from local to global: an efficient distributed algorithm for modeling time-to-event data. J Am Med Inform Assoc. 2020;27(7):1028–36.
Firth D. Bias reduction of maximum likelihood estimates. Biometrika. 1993;80(1):27–38.
Berlin JA, Santanna J, Schmid CH, Szczech LA, Feldman HI. Individual patient-versus group-level data meta-regressions for the investigation of treatment effect modifiers: ecological bias rears its ugly head. Stat Med. 2002;21(3):371–87.
Riley RD, Debray TP, Fisher D, Hattle M, Marlin N, Hoogland J, Gueyffier F, Staessen JA, Wang J, Moons KG, Reitsma JB. Individual participant data meta-analysis to examine interactions between treatment effect and participant-level covariates: statistical recommendations for conduct and planning. Stat Med. 2020;39(15):2115–37.
Fisher DJ, Carpenter JR, Morris TP, Freeman SC, Tierney JF. Meta-analytical methods to identify who benefits most from treatments: daft, deluded, or deft approach? BMJ. 2017;356: j573. https://doi.org/10.1136/bmj.j573.
Chen Y, Dong G, Han J, Pei J, Wah BW, Wang J. Regression cubes with lossless compression and aggregation. IEEE Trans Knowl Data Eng. 2006;18(12):1585–99.
Ben-Israel A. A Newton-Raphson method for the solution of systems of equations. J Math Anal Appl. 1966;15(2):243–52.
Wu Y, Jiang X, Kim J, Ohno-Machado L. G rid Binary LO gistic RE gression (GLORE): building shared models without sharing data. J Am Med Inform Assoc. 2012;19(5):758–64.
Lu CL, Wang S, Ji Z, Wu Y, Xiong L, Jiang X, Ohno-Machado L. WebDISCO: a web service for distributed cox model learning without patient-level data sharing. J Am Med Inform Assoc. 2015;22(6):1212–9.
Huang C, Huo X. A distributed one-step estimator. Math Program. 2019;174:41–76. https://doi.org/10.1007/s10107-019-01369-0.
Shu D, Yoshida K, Fireman BH, Toh S. Inverse probability weighted Cox model in multi-site studies without sharing individual-level data. Stat Methods Med Res. 2020;29(6):1668–81.
Riley RD, Simmonds MC, Look MP. Evidence synthesis combining individual patient data and aggregate data: a systematic review identified current practice and possible methods. J Clin Epidemiol. 2007;60(5):431–9. https://doi.org/10.1016/j.jclinepi.2006.09.009. Epub 2007 Feb 5 PMID: 17419953.
Duan R, Boland MR, Liu Z, Liu Y, Chang HH, Xu H, Chu H, Schmid CH, Forrest CB, Holmes JH, Schuemie MJ. Learning from electronic health records across multiple sites: a communication-efficient and privacy-preserving distributed algorithm. J Am Med Inform Assoc. 2020;27(3):376–85.
Jordan MI, Lee JD, Yang Y. Communication-efficient distributed statistical inference. J Am Stat Assoc. 2019;114(526):668–81. https://doi.org/10.1080/01621459.2018.1429274.
Edmondson MJ, Luo C, Islam MN, Sheils NE, Buresh J, Chen Z, Bian J, Chen Y. Distributed quasi-Poisson regression algorithm for modeling multi-site count outcomes in distributed data networks. J Biomed Inf. 2022;104097.
Edmondson MJ, Luo C, Duan R, Maltenfort M, Chen Z, Locke K, Shults J, Bian J, Ryan PB, Forrest CB, Chen Y. An efficient and accurate distributed learning algorithm for modeling multi-site zero-inflated count outcomes. Sci Rep. 2021;11(1):1–17.
Sutton AJ, Kendrick D, Coupland CA. Meta-analysis of individual-and aggregate-level data. Stat Med. 2008;27(5):651–69.
Luo C, Islam M, Sheils NE, Buresh J, Reps J, Schuemie MJ, Ryan PB, Edmondson M, Duan R, Tong J, Marks-Anglin A. DLMM as a lossless one-shot algorithm for collaborative multi-site distributed linear mixed models. Nat Commun. 2022;13(1):1–10.
Zhu R, Jiang C, Wang X, Wang S, Zheng H, Tang H. Privacy-preserving construction of generalized linear mixed model for biomedical computation. Bioinformatics, 2020:36(Supplement_1);i128–35.
Luo C, Islam MN, Sheils NE, Buresh J, Schuemie MJ, Doshi JA, Werner RM, Asch DA, Chen Y. dPQL: a lossless distributed algorithm for generalized linear mixed model with application to privacy-preserving hospital profiling. J Am Med Inf Assoc. 2022; ocac067. https://doi.org/10.1093/jamia/ocac067.
Tong J, Duan R, Li R, Scheuemie MJ, Moore JH, Chen Y. Robust-ODAL: learning from heterogeneous health systems without sharing patient-level data. In: Pacific symposium on biocomputing 2020, 2019; 695–706.
Luo C, Duan R, Naj AC, et al. ODACH: a one-shot distributed algorithm for Cox model with heterogeneous multi-center data. Sci Rep. 2022;12:6627. https://doi.org/10.1038/s41598-022-09069-0.
Luo X, Tsai WY. A proportional likelihood ratio model. Biometrika. 2012;99(1):211–22.
Tong J, Luo C, Islam MN, Sheils NE, Buresh J, Edmondson M, Merkel PA, Lautenbach E, Duan R, Chen Y. Distributed learning for heterogeneous clinical data with application to integrating COVID-19 data across 230 sites. NPJ Dig Med. 2022;5(1):1–8.
Duan R, Ning Y, Chen Y. Heterogeneity-aware and communication-efficient distributed statistical inference. Biometrika. 2022;109(1):67–83.
Luo C, Duan R, Edmondson M, Shi J, Maltenfort M, Morris J, Forrest C, Hubbard R, Chen Y. Distributed proportional likelihood ratio model with application to data integration across clinical sites 2020.
Shokri R, Stronati M, Song C, Shmatikov V. Membership inference attacks against machine learning models. In 2017 IEEE symposium on security and privacy (SP). IEEE; 2017. p. 3–18.
Pyrgelis A, Troncoso C, De Cristofaro E. Knock knock, who’s there? Membership inference on aggregate location data. 2017. ArXiv Prepr. https://arxiv.org/abs/1708.06145.
Dwork C, McSherry F, Nissim K, Smith A. Calibrating noise to sensitivity in private data analysis. J Priv Confidentiality. 2017;7:17–51.
Wasserman L, Zhou S. A statistical framework for differential privacy. J Am Stat Assoc. 2010;105:375–89.
Sweeney L. k-anonymity: a model for protecting privacy. Int J Uncertainty, Fuzziness Knowledge-Based Syst. 10, 557–570 (2002).
CMS Cell Suppression Policy, accessed April 15th, 2022. https://www.hhs.gov/guidance/document/cms-cell-suppression-policy.
Froelicher D, et al. Truly privacy-preserving federated analytics for precision medicine with multiparty homomorphic encryption. bioRxiv 2021.
Ohno-Machado L, et al. pSCANNER: patient-centered scalable national network for effectiveness research. J Am Med Inform Assoc. 2014;21:621–6.
Luo C, Duan R, Edmondson M, Tong J, Chen Y. pda: privacy-preserving distributed algorithms. R package version 1.0–2 2020. https://CRAN.R-project.org/package=pda.
Luo C, et al. pda: Privacy-Preserving Distributed Algorithms (v 1.2–4). Github. https://github.com/Penncil/pda. (Accessed on 20 Mar 2021).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this chapter
Cite this chapter
Edmondson, M.J., Luo, C., Chen, Y. (2023). Statistical Analysis—Meta-Analysis/Reproducibility. In: Asselbergs, F.W., Denaxas, S., Oberski, D.L., Moore, J.H. (eds) Clinical Applications of Artificial Intelligence in Real-World Data. Springer, Cham. https://doi.org/10.1007/978-3-031-36678-9_8
Download citation
DOI: https://doi.org/10.1007/978-3-031-36678-9_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-36677-2
Online ISBN: 978-3-031-36678-9
eBook Packages: MedicineMedicine (R0)