Skip to main content

The Effective Sample Size of EHR-Derived Cohorts Under Biased Sampling

  • Chapter
  • First Online:
Modern Statistical Methods for Health Research

Abstract

Electronic Health Records (EHRs) have become a popular data source for conducting observational studies of health outcomes. One advantage of using EHR-derived data for biomedical and epidemiologic research is the ability to efficiently construct large cohorts, providing access to “big data” in healthcare. For example, the U.S. Food and Drug Administration’s Sentinel System, which is composed of EHR and administrative claims data, includes over 100 million people, constituting approximately one-third of the U.S. population. Although the sample size of EHR-derived cohorts can be very large, EHR data arise through a complex, non-random sampling process that can induce bias when using such data to obtain parameter estimates that are meant to be representative of an underlying population. In the U.S.A., where most health insurance is employment-based, insured populations are often non-representative of uninsured populations, and thus, insurance status, as well as health literacy and healthcare-seeking behavior, is associated with representation in EHRs. As a result, the non-random sampling mechanism that gives rise to EHR data can induce significant bias in parameter estimates derived from EHR-based studies relative to the underlying population parameters. Here, we derive formulas for the mean-squared error of an EHR-derived sample as a function of the strength of association between a health outcome of interest, the sampling process, and an underlying unobserved covariate. We also provide a formula for the effective sample size of an EHR-derived cohort defined as the sample size of a simple random sample with equivalent mean-squared error to an EHR-derived sample arising from a biased sampling mechanism. The effective sample size allows for assessment of the advantage of using an EHR-derived sample as opposed to conducting a more traditional, designed observational study, taking into account both the number of patients and the biased sampling mechanism. Through simulation studies, we demonstrate the magnitude of bias induced in EHR-based parameter estimates under varying sample selection mechanisms, and we demonstrate how the effective sample size can be used to compute confidence intervals that account for the biased sampling scheme. We conclude that attention to biased sampling is necessary to avoid erroneous inference due to the large sample size and complex, non-random provenance of EHR-derived data, when the goal of a study is to use EHR-derived data to capture parameter estimates that are representative of an underlying population.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

eBook
USD 16.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 159.00
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 159.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Adler-Milstein, J., DesRoches, C.M., Furukawa, M.F., Worzala, C., Charles, D., Kralovec, P., Stalley, S., Jha, A.K.: More than half of US hospitals have at least a basic EHR, but stage 2 criteria remain challenging for most. Health Aff. (Millwood) 33(9), 1664–1671 (2014). https://doi.org/10.1377/hlthaff.2014.0453

  2. Agniel, D., Kohane, I.S., Weber, G.M.: Biases in electronic health record data due to processes within the healthcare system: retrospective observational study. BMJ 361, k1479 (2018). https://doi.org/10.1136/bmj.k1479

    Article  Google Scholar 

  3. Ball, R., Robb, M., Anderson, S.A., Pan, G.D.: The FDA’s sentinel initiative—a comprehensive approach to medical product surveillance. Clin. Pharmacol. Ther. 99(3), 265–268 (2016). https://doi.org/10.1002/cpt.320

    Article  Google Scholar 

  4. Canela-Xandri, O., Rawlik, K., Tenesa, A.: An atlas of genetic associations in UK Biobank. Nat. Genet. 50(11), 1593–1599 (2018). https://doi.org/10.1038/s41588-018-0248-z

    Article  Google Scholar 

  5. Chen, Y., Wang, J., Chubak, J., Hubbard, R.A.: Inflation of type I error rates due to differential misclassification in EHR-derived outcomes: empirical illustration using breast cancer recurrence. Pharmacoepidemiol. Drug Saf. 28(2), 264–268 (2019)

    Article  Google Scholar 

  6. Fleurence, R.L., Curtis, L.H., Califf, R.M., Platt, R., Selby, J.V., Brown, J.S.: Launching PCORnet, a national patient-centered clinical research network. J. Am. Med. Inform. Assoc. 21(4), 578–582 (2014)

    Article  Google Scholar 

  7. Friedman, C.P., Wong, A.K., Blumenthal, D.: Achieving a nationwide learning health system. Sci. Transl. Med. 2(57), 5729 (2010). https://doi.org/10.1126/scitranslmed.3001456

    Article  Google Scholar 

  8. Goldstein, B.A., Bhavsar, N.A., Phelan, M., Pencina, M.J.: Controlling for informed presence bias due to the number of health encounters in an electronic health record. Am. J. Epidemiol. 184(11), 847–855 (2016)

    Article  Google Scholar 

  9. Greenblatt, R.E., Zhao, E.J., Henrickson, S.E., Apter, A.J., Hubbard, R.A., Himes, B.E.: Factors associated with exacerbations among adults with asthma according to electronic health record data. Asthma Res. Pract. 5, 1 (2019). https://doi.org/10.1186/s40733-019-0048-y

    Article  Google Scholar 

  10. Haneuse, S., Daniels, M.: A General framework for considering selection bias in EHR-based studies: what data are observed and why? EGEMS (Wash DC) 4(1), 1203 (2016). https://doi.org/10.13063/2327-9214.1203

  11. Hripcsak, G., Albers, D.J.: Next-generation phenotyping of electronic health records. J. Am. Med. Inf. Assoc. 20(1), 117–121 (2013). https://doi.org/10.1136/amiajnl-2012-001145

    Article  Google Scholar 

  12. Jensen, P.B., Jensen, L.J., Brunak, S.: Mining electronic health records: towards better research applications and clinical care. Nat. Rev. Genet. 13(6), 395–405 (2012). https://doi.org/10.1038/nrg3208

    Article  Google Scholar 

  13. McCulloch, C.E., Neuhaus, J.M., Olin, R.L.: Biased and unbiased estimation in longitudinal studies with informative visit processes. Biometrics 72(4), 1315–1324 (2016)

    Article  MathSciNet  Google Scholar 

  14. McGregor, T.L., Van Driest, S.L., Brothers, K.B., Bowton, E.A., Muglia, L.J., Roden, D.M.: Inclusion of pediatric samples in an opt-out biorepository linking DNA to de-identified medical records: pediatric BioVU. Clin. Pharmacol. Ther. 93(2), 204–11 (2013). https://doi.org/10.1038/clpt.2012.230

    Article  Google Scholar 

  15. Meng, X.L.: A trio of inference problems that could win you a Nobel Prize in statistics (if you help fund it). In: Lin, X., Genest, C., Banks, D., Molenberghs, G., Scott, D., Wang. J.L. (eds.), Past, Present, and Future of Statistical Science, pp. 537–562. Chapman and Hall/CRC (2014). https://doi.org/10.1201/b16720-50

  16. Pike, M.M., Decker, P.A., Larson, N.B., St Sauver, J.L., Takahashi, P.Y., Roger, V.L., Rocca, W.A., Miller, V.M., Olson, J.E., Pathak, J., Bielinski, S.J.: Improvement in cardiovascular risk prediction with electronic health records. J. Cardiovasc. Transl. Res. 9(3), 214–22 (2016). https://doi.org/10.1007/s12265-016-9687-z

    Article  Google Scholar 

  17. Richesson, R.L., Green, B.B., Laws, R., Puro, J., Kahn, M.G., Bauck, A., Smerek, M., Van Eaton, E.G., Zozus, M., Ed Hammond, W., et al.: Pragmatic (trial) informatics: a perspective from the NIH Health Care Systems Research Collaboratory. J. Am. Med. Inform. Assoc. 24(5), 996–1001 (2017)

    Article  Google Scholar 

  18. Rusanov, A., Weiskopf, N.G., Wang, S., Weng, C.H.: Hidden in plain sight: bias towards sick patients when sampling patients with sufficient electronic health record data for research. BMC Med. Inform. Decis. Mak. 14, 51 (2014). https://doi.org/10.1186/1472-6947-14-51

    Article  Google Scholar 

  19. Scott, S.A., Owusu Obeng, A., Botton, M.R., Yang, Y., Scott, E.R., Ellis, S.B., Wallsten, R., Kaszemacher, T., Zhou, X., Chen, R., Nicoletti, P., Naik, H., Kenny, E.E., Vega, A., Waite, E., Diaz, G.A., Dudley, J., Halperin, J.L., Edelmann, L., Kasarskis, A., Hulot, J.S., Peter, I., Bottinger, E.P., Hirschhorn, K., Sklar, P., Cho, J.H., Desnick, R.J., Schadt, E.E.: Institutional profile: translational pharmacogenomics at the Icahn School of Medicine at Mount Sinai. Pharmacogenomics 18(15), 1381–1386 (2017). https://doi.org/10.2217/pgs-2017-0137

  20. Siebert, S., Lyall, D.M., Mackay, D.F., Porter, D., McInnes, I.B., Sattar, N., Pell, J.P.: Characteristics of rheumatoid arthritis and its association with major comorbid conditions: cross-sectional study of 502 649 UK Biobank participants. RMD Open 2(1), e000,267 (2016). https://doi.org/10.1136/rmdopen-2016-000267

    Article  Google Scholar 

  21. Stang, P.E., Ryan, P.B., Racoosin, J.A., Overhage, J.M., Hartzema, A.G., Reich, C., Welebob, E., Scarnecchia, T., Woodcock, J.: Advancing the science for active surveillance: rationale and design for the observational medical outcomes partnership. Ann. Intern. Med. 153(9), 600–606 (2010)

    Article  Google Scholar 

  22. Veronesi, G., Grassi, G., Savelli, G., Quatto, P., Zambon, A.: Big data, observational research and P-value: a recipe for false-positive findings? A study of simulated and real prospective cohorts. Int. J. Epidemiol. 49(3), 876–884 (2019). https://doi.org/10.1093/ije/dyz206

    Google Scholar 

  23. Xie, S., Greenblatt, R., Levy, M.Z., Himes, B.E.: Enhancing electronic health record data with geospatial information. AMIA Jt. Summits Transl. Sci. Proc. 2017, 123–132 (2017). https://www.ncbi.nlm.nih.gov/pubmed/28815121

    Google Scholar 

  24. Xie, S., Himes, B.E.: Approaches to link geospatially varying social, economic, and environmental factors with electronic health record data to better understand asthma exacerbations. AMIA Annu. Symp. Proc. 2018, 1561–1570 (2018). https://www.ncbi.nlm.nih.gov/pubmed/30815202

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rebecca A. Hubbard .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Hubbard, R.A., Lou, C., Himes, B.E. (2021). The Effective Sample Size of EHR-Derived Cohorts Under Biased Sampling. In: Zhao, Y., Chen, (.DG. (eds) Modern Statistical Methods for Health Research. Emerging Topics in Statistics and Biostatistics . Springer, Cham. https://doi.org/10.1007/978-3-030-72437-5_1

Download citation

Publish with us

Policies and ethics