Abstract
Electronic Health Records (EHRs) have become a popular data source for conducting observational studies of health outcomes. One advantage of using EHR-derived data for biomedical and epidemiologic research is the ability to efficiently construct large cohorts, providing access to “big data” in healthcare. For example, the U.S. Food and Drug Administration’s Sentinel System, which is composed of EHR and administrative claims data, includes over 100 million people, constituting approximately one-third of the U.S. population. Although the sample size of EHR-derived cohorts can be very large, EHR data arise through a complex, non-random sampling process that can induce bias when using such data to obtain parameter estimates that are meant to be representative of an underlying population. In the U.S.A., where most health insurance is employment-based, insured populations are often non-representative of uninsured populations, and thus, insurance status, as well as health literacy and healthcare-seeking behavior, is associated with representation in EHRs. As a result, the non-random sampling mechanism that gives rise to EHR data can induce significant bias in parameter estimates derived from EHR-based studies relative to the underlying population parameters. Here, we derive formulas for the mean-squared error of an EHR-derived sample as a function of the strength of association between a health outcome of interest, the sampling process, and an underlying unobserved covariate. We also provide a formula for the effective sample size of an EHR-derived cohort defined as the sample size of a simple random sample with equivalent mean-squared error to an EHR-derived sample arising from a biased sampling mechanism. The effective sample size allows for assessment of the advantage of using an EHR-derived sample as opposed to conducting a more traditional, designed observational study, taking into account both the number of patients and the biased sampling mechanism. Through simulation studies, we demonstrate the magnitude of bias induced in EHR-based parameter estimates under varying sample selection mechanisms, and we demonstrate how the effective sample size can be used to compute confidence intervals that account for the biased sampling scheme. We conclude that attention to biased sampling is necessary to avoid erroneous inference due to the large sample size and complex, non-random provenance of EHR-derived data, when the goal of a study is to use EHR-derived data to capture parameter estimates that are representative of an underlying population.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Adler-Milstein, J., DesRoches, C.M., Furukawa, M.F., Worzala, C., Charles, D., Kralovec, P., Stalley, S., Jha, A.K.: More than half of US hospitals have at least a basic EHR, but stage 2 criteria remain challenging for most. Health Aff. (Millwood) 33(9), 1664–1671 (2014). https://doi.org/10.1377/hlthaff.2014.0453
Agniel, D., Kohane, I.S., Weber, G.M.: Biases in electronic health record data due to processes within the healthcare system: retrospective observational study. BMJ 361, k1479 (2018). https://doi.org/10.1136/bmj.k1479
Ball, R., Robb, M., Anderson, S.A., Pan, G.D.: The FDA’s sentinel initiative—a comprehensive approach to medical product surveillance. Clin. Pharmacol. Ther. 99(3), 265–268 (2016). https://doi.org/10.1002/cpt.320
Canela-Xandri, O., Rawlik, K., Tenesa, A.: An atlas of genetic associations in UK Biobank. Nat. Genet. 50(11), 1593–1599 (2018). https://doi.org/10.1038/s41588-018-0248-z
Chen, Y., Wang, J., Chubak, J., Hubbard, R.A.: Inflation of type I error rates due to differential misclassification in EHR-derived outcomes: empirical illustration using breast cancer recurrence. Pharmacoepidemiol. Drug Saf. 28(2), 264–268 (2019)
Fleurence, R.L., Curtis, L.H., Califf, R.M., Platt, R., Selby, J.V., Brown, J.S.: Launching PCORnet, a national patient-centered clinical research network. J. Am. Med. Inform. Assoc. 21(4), 578–582 (2014)
Friedman, C.P., Wong, A.K., Blumenthal, D.: Achieving a nationwide learning health system. Sci. Transl. Med. 2(57), 5729 (2010). https://doi.org/10.1126/scitranslmed.3001456
Goldstein, B.A., Bhavsar, N.A., Phelan, M., Pencina, M.J.: Controlling for informed presence bias due to the number of health encounters in an electronic health record. Am. J. Epidemiol. 184(11), 847–855 (2016)
Greenblatt, R.E., Zhao, E.J., Henrickson, S.E., Apter, A.J., Hubbard, R.A., Himes, B.E.: Factors associated with exacerbations among adults with asthma according to electronic health record data. Asthma Res. Pract. 5, 1 (2019). https://doi.org/10.1186/s40733-019-0048-y
Haneuse, S., Daniels, M.: A General framework for considering selection bias in EHR-based studies: what data are observed and why? EGEMS (Wash DC) 4(1), 1203 (2016). https://doi.org/10.13063/2327-9214.1203
Hripcsak, G., Albers, D.J.: Next-generation phenotyping of electronic health records. J. Am. Med. Inf. Assoc. 20(1), 117–121 (2013). https://doi.org/10.1136/amiajnl-2012-001145
Jensen, P.B., Jensen, L.J., Brunak, S.: Mining electronic health records: towards better research applications and clinical care. Nat. Rev. Genet. 13(6), 395–405 (2012). https://doi.org/10.1038/nrg3208
McCulloch, C.E., Neuhaus, J.M., Olin, R.L.: Biased and unbiased estimation in longitudinal studies with informative visit processes. Biometrics 72(4), 1315–1324 (2016)
McGregor, T.L., Van Driest, S.L., Brothers, K.B., Bowton, E.A., Muglia, L.J., Roden, D.M.: Inclusion of pediatric samples in an opt-out biorepository linking DNA to de-identified medical records: pediatric BioVU. Clin. Pharmacol. Ther. 93(2), 204–11 (2013). https://doi.org/10.1038/clpt.2012.230
Meng, X.L.: A trio of inference problems that could win you a Nobel Prize in statistics (if you help fund it). In: Lin, X., Genest, C., Banks, D., Molenberghs, G., Scott, D., Wang. J.L. (eds.), Past, Present, and Future of Statistical Science, pp. 537–562. Chapman and Hall/CRC (2014). https://doi.org/10.1201/b16720-50
Pike, M.M., Decker, P.A., Larson, N.B., St Sauver, J.L., Takahashi, P.Y., Roger, V.L., Rocca, W.A., Miller, V.M., Olson, J.E., Pathak, J., Bielinski, S.J.: Improvement in cardiovascular risk prediction with electronic health records. J. Cardiovasc. Transl. Res. 9(3), 214–22 (2016). https://doi.org/10.1007/s12265-016-9687-z
Richesson, R.L., Green, B.B., Laws, R., Puro, J., Kahn, M.G., Bauck, A., Smerek, M., Van Eaton, E.G., Zozus, M., Ed Hammond, W., et al.: Pragmatic (trial) informatics: a perspective from the NIH Health Care Systems Research Collaboratory. J. Am. Med. Inform. Assoc. 24(5), 996–1001 (2017)
Rusanov, A., Weiskopf, N.G., Wang, S., Weng, C.H.: Hidden in plain sight: bias towards sick patients when sampling patients with sufficient electronic health record data for research. BMC Med. Inform. Decis. Mak. 14, 51 (2014). https://doi.org/10.1186/1472-6947-14-51
Scott, S.A., Owusu Obeng, A., Botton, M.R., Yang, Y., Scott, E.R., Ellis, S.B., Wallsten, R., Kaszemacher, T., Zhou, X., Chen, R., Nicoletti, P., Naik, H., Kenny, E.E., Vega, A., Waite, E., Diaz, G.A., Dudley, J., Halperin, J.L., Edelmann, L., Kasarskis, A., Hulot, J.S., Peter, I., Bottinger, E.P., Hirschhorn, K., Sklar, P., Cho, J.H., Desnick, R.J., Schadt, E.E.: Institutional profile: translational pharmacogenomics at the Icahn School of Medicine at Mount Sinai. Pharmacogenomics 18(15), 1381–1386 (2017). https://doi.org/10.2217/pgs-2017-0137
Siebert, S., Lyall, D.M., Mackay, D.F., Porter, D., McInnes, I.B., Sattar, N., Pell, J.P.: Characteristics of rheumatoid arthritis and its association with major comorbid conditions: cross-sectional study of 502 649 UK Biobank participants. RMD Open 2(1), e000,267 (2016). https://doi.org/10.1136/rmdopen-2016-000267
Stang, P.E., Ryan, P.B., Racoosin, J.A., Overhage, J.M., Hartzema, A.G., Reich, C., Welebob, E., Scarnecchia, T., Woodcock, J.: Advancing the science for active surveillance: rationale and design for the observational medical outcomes partnership. Ann. Intern. Med. 153(9), 600–606 (2010)
Veronesi, G., Grassi, G., Savelli, G., Quatto, P., Zambon, A.: Big data, observational research and P-value: a recipe for false-positive findings? A study of simulated and real prospective cohorts. Int. J. Epidemiol. 49(3), 876–884 (2019). https://doi.org/10.1093/ije/dyz206
Xie, S., Greenblatt, R., Levy, M.Z., Himes, B.E.: Enhancing electronic health record data with geospatial information. AMIA Jt. Summits Transl. Sci. Proc. 2017, 123–132 (2017). https://www.ncbi.nlm.nih.gov/pubmed/28815121
Xie, S., Himes, B.E.: Approaches to link geospatially varying social, economic, and environmental factors with electronic health record data to better understand asthma exacerbations. AMIA Annu. Symp. Proc. 2018, 1561–1570 (2018). https://www.ncbi.nlm.nih.gov/pubmed/30815202
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this chapter
Cite this chapter
Hubbard, R.A., Lou, C., Himes, B.E. (2021). The Effective Sample Size of EHR-Derived Cohorts Under Biased Sampling. In: Zhao, Y., Chen, (.DG. (eds) Modern Statistical Methods for Health Research. Emerging Topics in Statistics and Biostatistics . Springer, Cham. https://doi.org/10.1007/978-3-030-72437-5_1
Download citation
DOI: https://doi.org/10.1007/978-3-030-72437-5_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-72436-8
Online ISBN: 978-3-030-72437-5
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)