The Effective Sample Size of EHR-Derived Cohorts Under Biased Sampling

Hubbard, Rebecca A.; Lou, Carolyn; Himes, Blanca E.

doi:10.1007/978-3-030-72437-5_1

Rebecca A. Hubbard⁸,
Carolyn Lou⁸ &
Blanca E. Himes⁸

Part of the book series: Emerging Topics in Statistics and Biostatistics ((ETSB))

1659 Accesses

Abstract

Electronic Health Records (EHRs) have become a popular data source for conducting observational studies of health outcomes. One advantage of using EHR-derived data for biomedical and epidemiologic research is the ability to efficiently construct large cohorts, providing access to “big data” in healthcare. For example, the U.S. Food and Drug Administration’s Sentinel System, which is composed of EHR and administrative claims data, includes over 100 million people, constituting approximately one-third of the U.S. population. Although the sample size of EHR-derived cohorts can be very large, EHR data arise through a complex, non-random sampling process that can induce bias when using such data to obtain parameter estimates that are meant to be representative of an underlying population. In the U.S.A., where most health insurance is employment-based, insured populations are often non-representative of uninsured populations, and thus, insurance status, as well as health literacy and healthcare-seeking behavior, is associated with representation in EHRs. As a result, the non-random sampling mechanism that gives rise to EHR data can induce significant bias in parameter estimates derived from EHR-based studies relative to the underlying population parameters. Here, we derive formulas for the mean-squared error of an EHR-derived sample as a function of the strength of association between a health outcome of interest, the sampling process, and an underlying unobserved covariate. We also provide a formula for the effective sample size of an EHR-derived cohort defined as the sample size of a simple random sample with equivalent mean-squared error to an EHR-derived sample arising from a biased sampling mechanism. The effective sample size allows for assessment of the advantage of using an EHR-derived sample as opposed to conducting a more traditional, designed observational study, taking into account both the number of patients and the biased sampling mechanism. Through simulation studies, we demonstrate the magnitude of bias induced in EHR-based parameter estimates under varying sample selection mechanisms, and we demonstrate how the effective sample size can be used to compute confidence intervals that account for the biased sampling scheme. We conclude that attention to biased sampling is necessary to avoid erroneous inference due to the large sample size and complex, non-random provenance of EHR-derived data, when the goal of a study is to use EHR-derived data to capture parameter estimates that are representative of an underlying population.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

eBook: USD 16.99; Price excludes VAT (USA)

Softcover Book: USD 159.00; Price excludes VAT (USA)

Hardcover Book: USD 159.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Adler-Milstein, J., DesRoches, C.M., Furukawa, M.F., Worzala, C., Charles, D., Kralovec, P., Stalley, S., Jha, A.K.: More than half of US hospitals have at least a basic EHR, but stage 2 criteria remain challenging for most. Health Aff. (Millwood) 33(9), 1664–1671 (2014). https://doi.org/10.1377/hlthaff.2014.0453
Agniel, D., Kohane, I.S., Weber, G.M.: Biases in electronic health record data due to processes within the healthcare system: retrospective observational study. BMJ 361, k1479 (2018). https://doi.org/10.1136/bmj.k1479
Article Google Scholar
Ball, R., Robb, M., Anderson, S.A., Pan, G.D.: The FDA’s sentinel initiative—a comprehensive approach to medical product surveillance. Clin. Pharmacol. Ther. 99(3), 265–268 (2016). https://doi.org/10.1002/cpt.320
Article Google Scholar
Canela-Xandri, O., Rawlik, K., Tenesa, A.: An atlas of genetic associations in UK Biobank. Nat. Genet. 50(11), 1593–1599 (2018). https://doi.org/10.1038/s41588-018-0248-z
Article Google Scholar
Chen, Y., Wang, J., Chubak, J., Hubbard, R.A.: Inflation of type I error rates due to differential misclassification in EHR-derived outcomes: empirical illustration using breast cancer recurrence. Pharmacoepidemiol. Drug Saf. 28(2), 264–268 (2019)
Article Google Scholar
Fleurence, R.L., Curtis, L.H., Califf, R.M., Platt, R., Selby, J.V., Brown, J.S.: Launching PCORnet, a national patient-centered clinical research network. J. Am. Med. Inform. Assoc. 21(4), 578–582 (2014)
Article Google Scholar
Friedman, C.P., Wong, A.K., Blumenthal, D.: Achieving a nationwide learning health system. Sci. Transl. Med. 2(57), 5729 (2010). https://doi.org/10.1126/scitranslmed.3001456
Article Google Scholar
Goldstein, B.A., Bhavsar, N.A., Phelan, M., Pencina, M.J.: Controlling for informed presence bias due to the number of health encounters in an electronic health record. Am. J. Epidemiol. 184(11), 847–855 (2016)
Article Google Scholar
Greenblatt, R.E., Zhao, E.J., Henrickson, S.E., Apter, A.J., Hubbard, R.A., Himes, B.E.: Factors associated with exacerbations among adults with asthma according to electronic health record data. Asthma Res. Pract. 5, 1 (2019). https://doi.org/10.1186/s40733-019-0048-y
Article Google Scholar
Haneuse, S., Daniels, M.: A General framework for considering selection bias in EHR-based studies: what data are observed and why? EGEMS (Wash DC) 4(1), 1203 (2016). https://doi.org/10.13063/2327-9214.1203
Hripcsak, G., Albers, D.J.: Next-generation phenotyping of electronic health records. J. Am. Med. Inf. Assoc. 20(1), 117–121 (2013). https://doi.org/10.1136/amiajnl-2012-001145
Article Google Scholar
Jensen, P.B., Jensen, L.J., Brunak, S.: Mining electronic health records: towards better research applications and clinical care. Nat. Rev. Genet. 13(6), 395–405 (2012). https://doi.org/10.1038/nrg3208
Article Google Scholar
McCulloch, C.E., Neuhaus, J.M., Olin, R.L.: Biased and unbiased estimation in longitudinal studies with informative visit processes. Biometrics 72(4), 1315–1324 (2016)
Article MathSciNet Google Scholar
McGregor, T.L., Van Driest, S.L., Brothers, K.B., Bowton, E.A., Muglia, L.J., Roden, D.M.: Inclusion of pediatric samples in an opt-out biorepository linking DNA to de-identified medical records: pediatric BioVU. Clin. Pharmacol. Ther. 93(2), 204–11 (2013). https://doi.org/10.1038/clpt.2012.230
Article Google Scholar
Meng, X.L.: A trio of inference problems that could win you a Nobel Prize in statistics (if you help fund it). In: Lin, X., Genest, C., Banks, D., Molenberghs, G., Scott, D., Wang. J.L. (eds.), Past, Present, and Future of Statistical Science, pp. 537–562. Chapman and Hall/CRC (2014). https://doi.org/10.1201/b16720-50
Pike, M.M., Decker, P.A., Larson, N.B., St Sauver, J.L., Takahashi, P.Y., Roger, V.L., Rocca, W.A., Miller, V.M., Olson, J.E., Pathak, J., Bielinski, S.J.: Improvement in cardiovascular risk prediction with electronic health records. J. Cardiovasc. Transl. Res. 9(3), 214–22 (2016). https://doi.org/10.1007/s12265-016-9687-z
Article Google Scholar
Richesson, R.L., Green, B.B., Laws, R., Puro, J., Kahn, M.G., Bauck, A., Smerek, M., Van Eaton, E.G., Zozus, M., Ed Hammond, W., et al.: Pragmatic (trial) informatics: a perspective from the NIH Health Care Systems Research Collaboratory. J. Am. Med. Inform. Assoc. 24(5), 996–1001 (2017)
Article Google Scholar
Rusanov, A., Weiskopf, N.G., Wang, S., Weng, C.H.: Hidden in plain sight: bias towards sick patients when sampling patients with sufficient electronic health record data for research. BMC Med. Inform. Decis. Mak. 14, 51 (2014). https://doi.org/10.1186/1472-6947-14-51
Article Google Scholar
Scott, S.A., Owusu Obeng, A., Botton, M.R., Yang, Y., Scott, E.R., Ellis, S.B., Wallsten, R., Kaszemacher, T., Zhou, X., Chen, R., Nicoletti, P., Naik, H., Kenny, E.E., Vega, A., Waite, E., Diaz, G.A., Dudley, J., Halperin, J.L., Edelmann, L., Kasarskis, A., Hulot, J.S., Peter, I., Bottinger, E.P., Hirschhorn, K., Sklar, P., Cho, J.H., Desnick, R.J., Schadt, E.E.: Institutional profile: translational pharmacogenomics at the Icahn School of Medicine at Mount Sinai. Pharmacogenomics 18(15), 1381–1386 (2017). https://doi.org/10.2217/pgs-2017-0137
Siebert, S., Lyall, D.M., Mackay, D.F., Porter, D., McInnes, I.B., Sattar, N., Pell, J.P.: Characteristics of rheumatoid arthritis and its association with major comorbid conditions: cross-sectional study of 502 649 UK Biobank participants. RMD Open 2(1), e000,267 (2016). https://doi.org/10.1136/rmdopen-2016-000267
Article Google Scholar
Stang, P.E., Ryan, P.B., Racoosin, J.A., Overhage, J.M., Hartzema, A.G., Reich, C., Welebob, E., Scarnecchia, T., Woodcock, J.: Advancing the science for active surveillance: rationale and design for the observational medical outcomes partnership. Ann. Intern. Med. 153(9), 600–606 (2010)
Article Google Scholar
Veronesi, G., Grassi, G., Savelli, G., Quatto, P., Zambon, A.: Big data, observational research and P-value: a recipe for false-positive findings? A study of simulated and real prospective cohorts. Int. J. Epidemiol. 49(3), 876–884 (2019). https://doi.org/10.1093/ije/dyz206
Google Scholar
Xie, S., Greenblatt, R., Levy, M.Z., Himes, B.E.: Enhancing electronic health record data with geospatial information. AMIA Jt. Summits Transl. Sci. Proc. 2017, 123–132 (2017). https://www.ncbi.nlm.nih.gov/pubmed/28815121
Google Scholar
Xie, S., Himes, B.E.: Approaches to link geospatially varying social, economic, and environmental factors with electronic health record data to better understand asthma exacerbations. AMIA Annu. Symp. Proc. 2018, 1561–1570 (2018). https://www.ncbi.nlm.nih.gov/pubmed/30815202
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, Philadelphia, PA, USA
Rebecca A. Hubbard, Carolyn Lou & Blanca E. Himes

Authors

Rebecca A. Hubbard
View author publications
You can also search for this author in PubMed Google Scholar
Carolyn Lou
View author publications
You can also search for this author in PubMed Google Scholar
Blanca E. Himes
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rebecca A. Hubbard .

Editor information

Editors and Affiliations

Department of Mathematics & Statistics, Georgia State University, Atlanta, GA, USA
Yichuan Zhao
School of Mathematics, Statistics and Computer Science, University of KwaZulu-Natal, KwaZulu-Natal, South Africa
(Din) Ding-Geng Chen

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Hubbard, R.A., Lou, C., Himes, B.E. (2021). The Effective Sample Size of EHR-Derived Cohorts Under Biased Sampling. In: Zhao, Y., Chen, (.DG. (eds) Modern Statistical Methods for Health Research. Emerging Topics in Statistics and Biostatistics . Springer, Cham. https://doi.org/10.1007/978-3-030-72437-5_1

Download citation

DOI: https://doi.org/10.1007/978-3-030-72437-5_1
Published: 15 October 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-72436-8
Online ISBN: 978-3-030-72437-5
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics