Abstract
Exposures derived from electronic health records (EHR) may be misclassified, leading to biased estimates of their association with outcomes of interest. An example of this problem arises in the context of cancer screening where test indication, the purpose for which a test was performed, is often unavailable. This poses a challenge to understanding the effectiveness of screening tests because estimates of screening test effectiveness are biased if some diagnostic tests are misclassified as screening. Prediction models have been developed for a variety of exposure variables that can be derived from EHR, but no previous research has investigated appropriate methods for obtaining unbiased association estimates using these predicted probabilities. The full likelihood incorporating information on both the predicted probability of exposure-class membership and the association between the exposure and outcome of interest can be expressed using a finite mixture model. When the regression model of interest is a generalized linear model (GLM), the expectation–maximization algorithm can be used to estimate the parameters using standard software for GLMs. Using simulation studies, we compared the bias and efficiency of this mixture model approach to alternative approaches including multiple imputation and dichotomization of the predicted probabilities to create a proxy for the missing predictor. The mixture model was the only approach that was unbiased across all scenarios investigated. Finally, we explored the performance of these alternatives in a study of colorectal cancer screening with colonoscopy. These findings have broad applicability in studies using EHR data where gold-standard exposures are unavailable and prediction models have been developed for estimating proxies.
Similar content being viewed by others
References
Adams, K.F., Johnson, E.A., Chubak, J., Kamineni, A., Doubeni, C.A., Buist, D.S.M., Weinmann, S., Doria-Rose, V.P., Rutter, C.M.: Ascertainment of colonoscopy indication using administrative data. Egems 3(1), 11 (2015)
Ananthakrishnan, A.N., Cai, T., Savova, G., Cheng, S.C., Chen, P., Perez, R.G., Gainer, V.S., Murphy, S.N., Szolovits, P., Xia, Z., Shaw, S., Churchill, S., Karlson, E.W., Kohane, I., Plenge, R.M., Liao, K.P.: Improving case definition of Crohn’s disease and ulcerative colitis in electronic medical records using natural language processing: a novel informatics approach. Inflamm. Bowel Dis. 19(7), 1411–1420 (2013). doi:10.1097/MIB.0b013e31828133fd
Boehmer, U., Kressin, N.R., Berlowitz, D.R., Christiansen, C.L., Kazis, L.E., Jones, J.A.: Self-reported vs administrative race/ethnicity data and study results. Am. J. Public Health 92(9), 1471–1473 (2002). doi:10.2105/Ajph.92.9.1471
Brenner, H., Stock, C., Hoffmeister, M.: Effect of screening sigmoidoscopy and screening colonoscopy on colorectal cancer incidence and mortality: systematic review and meta-analysis of randomised controlled trials and observational studies. BMJ 348, g2467 (2014). doi:10.1136/bmj.g2467
El-Serag, H.B., Petersen, L., Hampel, H., Richardson, P., Cooper, G.: The use of screening colonoscopy for patients cared for by the Department of Veterans Affairs. Arch. Intern. Med. 166(20), 2202–2208 (2006). doi:10.1001/archinte.166.20.2202
Elliott, M.N., Morrison, P.A., Fremont, A., McCaffrey, D.F., Pantoja, P., Lurie, N.: Using the census bureau’s surname list to improve estimates of race/ethnicity and associated disparities. Health Serv. Outcomes Res. Method. 9(2), 69–83 (2009)
Fisher, D.A., Grubber, J.M., Castor, J.M., Coffman, C.J.: Ascertainment of colonoscopy indication using administrative data. Dig. Dis. Sci. 55(6), 1721–1725 (2011). doi:10.1007/s10620-010-1200-y
Gomez, S.L., Kelsey, J.L., Glaser, S.L., Lee, M.M., Sidney, S.: Inconsistencies between self-reported ethnicity and ethnicity recorded in a health maintenance organization. Ann. Epidemiol. 15(1), 71–79 (2005). doi:10.1016/j.annepidem.2004.03.002
Hunt, L., Jorgensen, M.: Mixture model clustering for mixed data with missing information. Comput Stat Data An 41(3–4), 429–440 (2003). doi:10.1016/S0167-9473(02)00190-1
Jansen, R.: Maximum likelihood in a generalized linear finite mixture model by using the EM algorithm. Biometrics 49, 227–231 (1993)
Levin, T.R., Zhao, W., Conell, C., Seeff, L.C., Manninen, D.L., Shapiro, J.A., Schulman, J.: Complications of colonoscopy in an integrated health care delivery system. Ann. Intern. Med. 145(12), 880–886 (2006)
Liao, K.P., Cai, T., Gainer, V., Goryachev, S., Zeng-treitler, Q., Raychaudhuri, S., Szolovits, P., Churchill, S., Murphy, S., Kohane, I., Karlson, E.W., Plenge, R.M.: Electronic medical records for discovery research in rheumatoid arthritis. Arthritis Care Res. 62(8), 1120–1127 (2010). doi:10.1002/acr.20184
Lin, O.S., Kozarek, R.A., Cha, J.M.: Impact of sigmoidoscopy and colonoscopy on colorectal cancer incidence and mortality: an evidence-based review of published prospective and retrospective studies. Intest. Res. 12(4), 268–274 (2014). doi:10.5217/ir.2014.12.4.268
Little, R.J., Rubin, D.B.: Statistical analysis with missing data, 2nd edn. Wiley, Hoboken (2002)
McCaffrey, D.F., Elliott, M.N.: Power of tests for a dichotomous independent variable measured with error. Health Serv. Res. 43(3), 1085–1101 (2008). doi:10.1111/j.1475-6773.2007.00810.x
Redner, R.A., Walker, H.F.: Mixture densities, maximum-likelihood and the Em algorithm. Siam Rev 26(2), 195–237 (1984). doi:10.1137/1026034
Richesson, R., Smerek, M.: Electronic health records-based phenotyping. http://sites.duke.edu/rethinkingclinicaltrials/informed-consent-in-pragmatic-clinical-trials/ (2015). Accessed 30 Nov 2015
Siegel, R., Desantis, C., Jemal, A.: Colorectal cancer statistics, 2014. CA Cancer J. Clin. 64(2), 104–117 (2014). doi:10.3322/caac.21220
Sun, J.M., McNaughton, C.D., Zhang, P., Perer, A., Gkoulalas-Divanis, A., Denny, J.C., Kirby, J., Lasko, T., Saip, A., Malin, B.A.: Predicting changes in hypertension control using electronic health records from a chronic disease management program. J. Am. Med. Inform. Assoc. 21(2), 337–344 (2014). doi:10.1136/Amiajnl-2013-002033
Tamblyn, R., Eguale, T., Huang, A., Winslade, N., Doran, P.: The incidence and determinants of primary nonadherence with prescribed medication in primary care: a cohort study. Ann. Intern. Med. 160(7), 441–450 (2014). doi:10.7326/M13-1705
Thompson, T.J., Smith, P.J., Boyle, J.P.: Finite mixture models with concomitant information: assessing diagnostic criteria for diabetes. J. R. Stat. Soc. Ser. C Appl. Stat. 47, 393–404 (1998)
US Preventive Services Task Force: Screening for colorectal cancer: U.S. Preventive Services Task Force recommendation statement. Ann. Intern. Med. 149(9), 627–637 (2008)
Vermunt, J.K.: Latent class modeling with covariates: two improved three-step approaches. Polit. Anal. 18(4), 450–469 (2010). doi:10.1093/pan/mpq025
Weiss, N.S.: Analysis of case-control studies of the efficacy of screening for cancer: how should we deal with tests done in persons with symptoms? Am. J. Epidemiol. 147(12), 1099–1102 (1998)
Weiss, N.S., McKnight, B., Stevens, N.G.: Approaches to the analysis of case–control studies of the efficacy of screening for cancer. Am. J. Epidemiol. 135(7), 817–823 (1992)
West, C.N., Geiger, A.M., Greene, S.M., Harris, E.L., Liu, I.L., Barton, M.B., Elmore, J.G., Rolnick, S., Nekhlyudov, L., Altschuler, A., Herrinton, L.J., Fletcher, S.W., Emmons, K.M.: Race and ethnicity: comparing medical records to self-reports. J. Natl. Cancer Inst. Monogr. 35, 72–74 (2005). doi:10.1093/jncimonographs/lgi041
Winawer, S.J., Fletcher, R.H., Miller, L., Godlee, F., Stolar, M.H., Mulrow, C.D., Woolf, S.H., Glick, S.N., Ganiats, T.G., Bond, J.H., Rosen, L., Zapka, J.G., Olsen, S.J., Giardiello, F.M., Sisk, J.E., vanAntwerp, R., BrownDavis, C., Marciniak, D.A., Mayer, R.J.: Colorectal cancer screening: clinical guidelines and rationale. Gastroenterology 112(2), 594–642 (1997). doi:10.1053/Gast.1997.V112.Agast970594
Wu, L.T., Gersing, K., Burchett, B., Woody, G.E., Blazer, D.G.: Substance use disorders and comorbid Axis I and II psychiatric disorders among young psychiatric patients: findings from a large electronic health records database. J. Psychiatr. Res. 45(11), 1453–1462 (2011). doi:10.1016/j.jpsychires.2011.06.012
Acknowledgments
This work was supported by the National Cancer Institute of the National Institutes of Health (Grant Number U01CA152959). The collection of cancer incidence data used in this study was supported by the Cancer Surveillance System of the Fred Hutchinson Cancer Research Center, which is funded by Contract Nos. N01-CN-67009 and N01-PC-35142 from the Surveillance, Epidemiology and End Results (SEER) Program of the National Cancer Institute with additional support from the Fred Hutchinson Cancer Research Center and the State of Washington.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
All authors declare that they have no conflicts of interest.
Ethical approval
All procedures performed in studies involving human participants were in accordance with the ethical standards of the Group Health Institutional Review Board and with the 1964 Helsinki declaration and its later amendments or comparable ethical standards. This article does not contain any studies with animals performed by any of the authors.
Informed consent
The Group Health Institutional Review Board approved a waiver of consent for this study.
Rights and permissions
About this article
Cite this article
Hubbard, R.A., Johnson, E., Chubak, J. et al. Accounting for misclassification in electronic health records-derived exposures using generalized linear finite mixture models. Health Serv Outcomes Res Method 17, 101–112 (2017). https://doi.org/10.1007/s10742-016-0149-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10742-016-0149-5