Skip to main content
Log in

Accounting for misclassification in electronic health records-derived exposures using generalized linear finite mixture models

  • Published:
Health Services and Outcomes Research Methodology Aims and scope Submit manuscript

Abstract

Exposures derived from electronic health records (EHR) may be misclassified, leading to biased estimates of their association with outcomes of interest. An example of this problem arises in the context of cancer screening where test indication, the purpose for which a test was performed, is often unavailable. This poses a challenge to understanding the effectiveness of screening tests because estimates of screening test effectiveness are biased if some diagnostic tests are misclassified as screening. Prediction models have been developed for a variety of exposure variables that can be derived from EHR, but no previous research has investigated appropriate methods for obtaining unbiased association estimates using these predicted probabilities. The full likelihood incorporating information on both the predicted probability of exposure-class membership and the association between the exposure and outcome of interest can be expressed using a finite mixture model. When the regression model of interest is a generalized linear model (GLM), the expectation–maximization algorithm can be used to estimate the parameters using standard software for GLMs. Using simulation studies, we compared the bias and efficiency of this mixture model approach to alternative approaches including multiple imputation and dichotomization of the predicted probabilities to create a proxy for the missing predictor. The mixture model was the only approach that was unbiased across all scenarios investigated. Finally, we explored the performance of these alternatives in a study of colorectal cancer screening with colonoscopy. These findings have broad applicability in studies using EHR data where gold-standard exposures are unavailable and prediction models have been developed for estimating proxies.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

References

  • Adams, K.F., Johnson, E.A., Chubak, J., Kamineni, A., Doubeni, C.A., Buist, D.S.M., Weinmann, S., Doria-Rose, V.P., Rutter, C.M.: Ascertainment of colonoscopy indication using administrative data. Egems 3(1), 11 (2015)

    Google Scholar 

  • Ananthakrishnan, A.N., Cai, T., Savova, G., Cheng, S.C., Chen, P., Perez, R.G., Gainer, V.S., Murphy, S.N., Szolovits, P., Xia, Z., Shaw, S., Churchill, S., Karlson, E.W., Kohane, I., Plenge, R.M., Liao, K.P.: Improving case definition of Crohn’s disease and ulcerative colitis in electronic medical records using natural language processing: a novel informatics approach. Inflamm. Bowel Dis. 19(7), 1411–1420 (2013). doi:10.1097/MIB.0b013e31828133fd

    Article  PubMed  PubMed Central  Google Scholar 

  • Boehmer, U., Kressin, N.R., Berlowitz, D.R., Christiansen, C.L., Kazis, L.E., Jones, J.A.: Self-reported vs administrative race/ethnicity data and study results. Am. J. Public Health 92(9), 1471–1473 (2002). doi:10.2105/Ajph.92.9.1471

    Article  PubMed  PubMed Central  Google Scholar 

  • Brenner, H., Stock, C., Hoffmeister, M.: Effect of screening sigmoidoscopy and screening colonoscopy on colorectal cancer incidence and mortality: systematic review and meta-analysis of randomised controlled trials and observational studies. BMJ 348, g2467 (2014). doi:10.1136/bmj.g2467

    Article  PubMed  PubMed Central  Google Scholar 

  • El-Serag, H.B., Petersen, L., Hampel, H., Richardson, P., Cooper, G.: The use of screening colonoscopy for patients cared for by the Department of Veterans Affairs. Arch. Intern. Med. 166(20), 2202–2208 (2006). doi:10.1001/archinte.166.20.2202

    Article  PubMed  Google Scholar 

  • Elliott, M.N., Morrison, P.A., Fremont, A., McCaffrey, D.F., Pantoja, P., Lurie, N.: Using the census bureau’s surname list to improve estimates of race/ethnicity and associated disparities. Health Serv. Outcomes Res. Method. 9(2), 69–83 (2009)

    Article  Google Scholar 

  • Fisher, D.A., Grubber, J.M., Castor, J.M., Coffman, C.J.: Ascertainment of colonoscopy indication using administrative data. Dig. Dis. Sci. 55(6), 1721–1725 (2011). doi:10.1007/s10620-010-1200-y

    Article  Google Scholar 

  • Gomez, S.L., Kelsey, J.L., Glaser, S.L., Lee, M.M., Sidney, S.: Inconsistencies between self-reported ethnicity and ethnicity recorded in a health maintenance organization. Ann. Epidemiol. 15(1), 71–79 (2005). doi:10.1016/j.annepidem.2004.03.002

    Article  PubMed  Google Scholar 

  • Hunt, L., Jorgensen, M.: Mixture model clustering for mixed data with missing information. Comput Stat Data An 41(3–4), 429–440 (2003). doi:10.1016/S0167-9473(02)00190-1

    Article  Google Scholar 

  • Jansen, R.: Maximum likelihood in a generalized linear finite mixture model by using the EM algorithm. Biometrics 49, 227–231 (1993)

    Article  Google Scholar 

  • Levin, T.R., Zhao, W., Conell, C., Seeff, L.C., Manninen, D.L., Shapiro, J.A., Schulman, J.: Complications of colonoscopy in an integrated health care delivery system. Ann. Intern. Med. 145(12), 880–886 (2006)

    Article  PubMed  Google Scholar 

  • Liao, K.P., Cai, T., Gainer, V., Goryachev, S., Zeng-treitler, Q., Raychaudhuri, S., Szolovits, P., Churchill, S., Murphy, S., Kohane, I., Karlson, E.W., Plenge, R.M.: Electronic medical records for discovery research in rheumatoid arthritis. Arthritis Care Res. 62(8), 1120–1127 (2010). doi:10.1002/acr.20184

    Article  Google Scholar 

  • Lin, O.S., Kozarek, R.A., Cha, J.M.: Impact of sigmoidoscopy and colonoscopy on colorectal cancer incidence and mortality: an evidence-based review of published prospective and retrospective studies. Intest. Res. 12(4), 268–274 (2014). doi:10.5217/ir.2014.12.4.268

    Article  PubMed  PubMed Central  Google Scholar 

  • Little, R.J., Rubin, D.B.: Statistical analysis with missing data, 2nd edn. Wiley, Hoboken (2002)

    Google Scholar 

  • McCaffrey, D.F., Elliott, M.N.: Power of tests for a dichotomous independent variable measured with error. Health Serv. Res. 43(3), 1085–1101 (2008). doi:10.1111/j.1475-6773.2007.00810.x

    Article  PubMed  PubMed Central  Google Scholar 

  • Redner, R.A., Walker, H.F.: Mixture densities, maximum-likelihood and the Em algorithm. Siam Rev 26(2), 195–237 (1984). doi:10.1137/1026034

    Article  Google Scholar 

  • Richesson, R., Smerek, M.: Electronic health records-based phenotyping. http://sites.duke.edu/rethinkingclinicaltrials/informed-consent-in-pragmatic-clinical-trials/ (2015). Accessed 30 Nov 2015

  • Siegel, R., Desantis, C., Jemal, A.: Colorectal cancer statistics, 2014. CA Cancer J. Clin. 64(2), 104–117 (2014). doi:10.3322/caac.21220

    Article  PubMed  Google Scholar 

  • Sun, J.M., McNaughton, C.D., Zhang, P., Perer, A., Gkoulalas-Divanis, A., Denny, J.C., Kirby, J., Lasko, T., Saip, A., Malin, B.A.: Predicting changes in hypertension control using electronic health records from a chronic disease management program. J. Am. Med. Inform. Assoc. 21(2), 337–344 (2014). doi:10.1136/Amiajnl-2013-002033

    Article  PubMed  Google Scholar 

  • Tamblyn, R., Eguale, T., Huang, A., Winslade, N., Doran, P.: The incidence and determinants of primary nonadherence with prescribed medication in primary care: a cohort study. Ann. Intern. Med. 160(7), 441–450 (2014). doi:10.7326/M13-1705

    Article  PubMed  Google Scholar 

  • Thompson, T.J., Smith, P.J., Boyle, J.P.: Finite mixture models with concomitant information: assessing diagnostic criteria for diabetes. J. R. Stat. Soc. Ser. C Appl. Stat. 47, 393–404 (1998)

    Article  Google Scholar 

  • US Preventive Services Task Force: Screening for colorectal cancer: U.S. Preventive Services Task Force recommendation statement. Ann. Intern. Med. 149(9), 627–637 (2008)

    Article  Google Scholar 

  • Vermunt, J.K.: Latent class modeling with covariates: two improved three-step approaches. Polit. Anal. 18(4), 450–469 (2010). doi:10.1093/pan/mpq025

    Article  Google Scholar 

  • Weiss, N.S.: Analysis of case-control studies of the efficacy of screening for cancer: how should we deal with tests done in persons with symptoms? Am. J. Epidemiol. 147(12), 1099–1102 (1998)

    Article  CAS  PubMed  Google Scholar 

  • Weiss, N.S., McKnight, B., Stevens, N.G.: Approaches to the analysis of case–control studies of the efficacy of screening for cancer. Am. J. Epidemiol. 135(7), 817–823 (1992)

    Article  CAS  PubMed  Google Scholar 

  • West, C.N., Geiger, A.M., Greene, S.M., Harris, E.L., Liu, I.L., Barton, M.B., Elmore, J.G., Rolnick, S., Nekhlyudov, L., Altschuler, A., Herrinton, L.J., Fletcher, S.W., Emmons, K.M.: Race and ethnicity: comparing medical records to self-reports. J. Natl. Cancer Inst. Monogr. 35, 72–74 (2005). doi:10.1093/jncimonographs/lgi041

    Article  Google Scholar 

  • Winawer, S.J., Fletcher, R.H., Miller, L., Godlee, F., Stolar, M.H., Mulrow, C.D., Woolf, S.H., Glick, S.N., Ganiats, T.G., Bond, J.H., Rosen, L., Zapka, J.G., Olsen, S.J., Giardiello, F.M., Sisk, J.E., vanAntwerp, R., BrownDavis, C., Marciniak, D.A., Mayer, R.J.: Colorectal cancer screening: clinical guidelines and rationale. Gastroenterology 112(2), 594–642 (1997). doi:10.1053/Gast.1997.V112.Agast970594

    Article  CAS  PubMed  Google Scholar 

  • Wu, L.T., Gersing, K., Burchett, B., Woody, G.E., Blazer, D.G.: Substance use disorders and comorbid Axis I and II psychiatric disorders among young psychiatric patients: findings from a large electronic health records database. J. Psychiatr. Res. 45(11), 1453–1462 (2011). doi:10.1016/j.jpsychires.2011.06.012

    Article  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgments

This work was supported by the National Cancer Institute of the National Institutes of Health (Grant Number U01CA152959). The collection of cancer incidence data used in this study was supported by the Cancer Surveillance System of the Fred Hutchinson Cancer Research Center, which is funded by Contract Nos. N01-CN-67009 and N01-PC-35142 from the Surveillance, Epidemiology and End Results (SEER) Program of the National Cancer Institute with additional support from the Fred Hutchinson Cancer Research Center and the State of Washington.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rebecca A. Hubbard.

Ethics declarations

Conflict of interest

All authors declare that they have no conflicts of interest.

Ethical approval

All procedures performed in studies involving human participants were in accordance with the ethical standards of the Group Health Institutional Review Board and with the 1964 Helsinki declaration and its later amendments or comparable ethical standards. This article does not contain any studies with animals performed by any of the authors.

Informed consent

The Group Health Institutional Review Board approved a waiver of consent for this study.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hubbard, R.A., Johnson, E., Chubak, J. et al. Accounting for misclassification in electronic health records-derived exposures using generalized linear finite mixture models. Health Serv Outcomes Res Method 17, 101–112 (2017). https://doi.org/10.1007/s10742-016-0149-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10742-016-0149-5

Keywords

Navigation