Text Mining of the Electronic Health Record: An Information Extraction Approach for Automated Identification and Subphenotyping of HFpEF Patients for Clinical Trials

Jonnalagadda, Siddhartha R.; Adupa, Abhishek K.; Garg, Ravi P.; Corona-Cox, Jessica; Shah, Sanjiv J.

doi:10.1007/s12265-017-9752-2

Text Mining of the Electronic Health Record: An Information Extraction Approach for Automated Identification and Subphenotyping of HFpEF Patients for Clinical Trials

Original Article
Published: 05 June 2017

Volume 10, pages 313–321, (2017)
Cite this article

Journal of Cardiovascular Translational Research Aims and scope Submit manuscript

Siddhartha R. Jonnalagadda¹^nAff2,
Abhishek K. Adupa¹,
Ravi P. Garg¹,
Jessica Corona-Cox³ &
…
Sanjiv J. Shah³

1719 Accesses
41 Citations
6 Altmetric
Explore all metrics

Abstract

Precision medicine requires clinical trials that are able to efficiently enroll subtypes of patients in whom targeted therapies can be tested. To reduce the large amount of time spent screening, identifying, and recruiting patients with specific subtypes of heterogeneous clinical syndromes (such as heart failure with preserved ejection fraction [HFpEF]), we need prescreening systems that are able to automate data extraction and decision-making tasks. However, a major obstacle is the vast amount of unstructured free-form text in medical records. Here we describe an information extraction-based approach that automatically converts unstructured text into structured data, which is cross-referenced against eligibility criteria using a rule-based system to determine which patients qualify for a major HFpEF clinical trial (PARAGON). We show that we can achieve a sensitivity and positive predictive value of 0.95 and 0.86, respectively. Our open-source algorithm could be used to efficiently identify and subphenotype patients with HFpEF and other disorders.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Text Mining for Precision Medicine: Bringing Structure to EHRs and Biomedical Literature to Understand Genes and Health

Use of electronic health record data mining for heart failure subtyping

Article Open access 11 September 2023

Mining the Electronic Health Record for Disease Knowledge

References

Jensen, P. B., Jensen, L. J., & Brunak, S. (2012). Mining electronic health records: towards better research applications and clinical care. Nature Reviews. Genetics, 13(6), 395–405.
Article CAS PubMed Google Scholar
Sullivan, J.. (2004). Subject Recruitment and Retention: Barrier to Success. http://www.appliedclinicaltrialsonline.com/subject-recruitment-and-retention-barriers-success. Accessed 27 July 2015.
PARAGON Inclusion/Exclusion Criteria (2015). https://sjonnalagadda.files.wordpress.com/2015/08/paragon_ie-criteria_10-01-2014.pdf. Accessed 10th August 2015.
Bodenreider, O. (2004). The unified medical language system (UMLS): integrating biomedical terminology. Nucleic Acids Research, 32(Database issue), D267–D270.
Article CAS PubMed PubMed Central Google Scholar
Harkema, H., Dowling, J. N., Thornblade, T., & Chapman, W. W. (2009). ConText: an algorithm for determining negation, experiencer, and temporal status from clinical reports. Journal of Biomedical Informatics, 42(5), 839–851.
Article PubMed PubMed Central Google Scholar
Mitchell, K. J., Becich, M. J., Berman, J. J., Chapman, W. W., Gilbertson, J., Gupta, D., et al. (2004). Implementation and evaluation of a negation tagger in a pipeline-based system for information extract from pathology reports. Studies in Health Technology and Informatics, 107(Pt 1), 663–667.
PubMed Google Scholar
Shah, S. J., Heitner, J. F., Sweitzer, N. K., Anand, I. S., Kim, H. Y., Harty, B., et al. (2013). Baseline characteristics of patients in the treatment of preserved cardiac function heart failure with an aldosterone antagonist trial. Circulation. Heart Failure, 6(2), 184–192.
Article CAS PubMed Google Scholar
Shah, S. J., Cogswell, R., Ryan, J. J., & Sharma, K. (2016). How to develop and implement a specialized heart failure with preserved ejection fraction clinical program. Current Cardiology Reports, 18(12), 122.
Article PubMed Google Scholar
Friedman, C. P., Wong, A. K., & Blumenthal, D. (2010). Achieving a nationwide learning health system. Science Translational Medicine, 2(57), 57cm29–57cm29.
Article PubMed Google Scholar
Friedman, C., & Rigby, M. (2013). Conceptualising and creating a global learning health system. International Journal of Medical Informatics, 82(4), e63–e71.
Article PubMed Google Scholar
Ma, X.-J., Wang, Z., Ryan, P. D., Isakoff, S. J., Barmettler, A., Fuller, A., et al. (2004). A two-gene expression ratio predicts clinical outcome in breast cancer patients treated with tamoxifen. Cancer Cell, 5(6), 607–616.
Article CAS PubMed Google Scholar
Strom, B. L., Schinnar, R., Jones, J., Bilker, W. B., Weiner, M. G., Hennessy, S., et al. (2011). Detecting pregnancy use of non-hormonal category X medications in electronic medical records. Journal of the American Medical Informatics Association, 18(Suppl 1), i81–i86.
Mathias, J. S., Gossett, D., & Baker, D. W. (2012). Use of electronic health record data to evaluate overuse of cervical cancer screening. Journal of the American Medical Informatics Association, 19(e1), e96–e101.
De Pauw, R., Kregel, J., De Blaiser, C., Van Akeleyen, J., Logghe, T., Danneels, L., et al. (2015). Identifying prognostic factors predicting outcome in patients with chronic neck pain after multimodal treatment: a retrospective study. Manual Therapy, 20(4), 592–597.
Article CAS PubMed Google Scholar
Onofrei, M., Hunt, J., Siemienczuk, J., Touchette, D. R., & Middleton, B. (2004). A first step towards translating evidence into practice: heart failure in a community practice-based research network. Informatics in Primary Care, 12(3), 139–145.
PubMed Google Scholar
Johnson, S. B., Bakken, S., Dine, D., Hyun, S., Mendonça, E., Morrison, F., et al. (2008). An electronic health record based on structured narrative. Journal of the American Medical Informatics Association, 15(1), 54–64.
Article PubMed PubMed Central Google Scholar
Zhou, L., Mahoney, L. M., Shakurova, A., Goss, F., Chang, F. Y., Bates, D. W., et al. (2012). How many medication orders are entered through free-text in EHRs?—a study on hypoglycemic agents. American Medical Informatics Association Annual Symposium Proceedings, 2012, 1079–1088.
Google Scholar
Zheng, K., Hanauer, D. A., Padman, R., Johnson, M. P., Hussain, A. A., Ye, W., et al. (2011). Handling anticipated exceptions in clinical care: investigating clinician use of ‘exit strategies’ in an electronic health records system. Journal of the American Medical Informatics Association, 18(6), 883–889.
Raghavan, P., Chen, J. L., Fosler-Lussier, E., & Lai, A. M. (2014). How essential are unstructured clinical narratives and information fusion to clinical trial recruitment? AMIA Jt Summits Transl Sci Proc, 2014, 218–223.
PubMed PubMed Central Google Scholar
Stanfill, M. H., Williams, M., Fenton, S. H., Jenders, R. A., & Hersh, W. R. (2010). A systematic literature review of automated clinical coding and classification systems. Journal of the American Medical Informatics Association, 17(6), 646–651.
Jha, A. K. (2011). The promise of electronic records: around the corner or down the road? JAMA, 306(8), 880–881.
Article CAS PubMed Google Scholar
Friedman, C., Rindflesch, T. C., & Corn, M. (2013). Natural language processing: State of the art and prospects for significant progress, a workshop sponsored by the National Library of Medicine. Journal of Biomedical Informatics, 46(5), 765–773.
Article PubMed Google Scholar
Shivade, C., Raghavan, P., Fosler-Lussier, E., Embi, P. J., Elhadad, N., Johnson, S. B., et al. (2014). A review of approaches to identifying patient phenotype cohorts using electronic health records. Journal of the American Medical Informatics Association, 21(2), 221–230.
Nguyen, A. N., Lawley, M. J., Hansen, D. P., Bowman, R. V., Clarke, B. E., Duhig, E. E., et al. (2010). Symbolic rule-based classification of lung cancer stages from free-text pathology reports. 17(4), 440–445.
Mia Schmiedeskamp, P. P., Spencer Harpe, P. P. M. P. H., Ronald Polk, P., Michael Oinonen, P. M. P. H., & Amy Pakyz, P. M. S. (2009). Use of international classification of diseases, ninth revision, clinical modification codes and medication use data to identify nosocomial Clostridium difficile infection. Infection Control and Hospital Epidemiology, 30(11), 1070–1076.
Article PubMed Google Scholar
Penberthy, L., Brown, R., Puma, F., & Dahman, B. (2010). Automated matching software for clinical trials eligibility: measuring efficiency and flexibility. Contemporary Clinical Trials, 31(3), 207–217.
Article PubMed PubMed Central Google Scholar
Kho, A. N., Hayes, M. G., Rasmussen-Torvik, L., Pacheco, J. A., Thompson, W. K., Armstrong, L. L., et al. (2012). Use of diverse electronic medical record systems to identify genetic risk for type 2 diabetes within a genome-wide association study. Journal of the American Medical Informatics Association, 19(2), 212–218.
Article PubMed Google Scholar
Klompas, M., Haney, G., Church, D., Lazarus, R., Hou, X., & Platt, R. (2008). Automated identification of acute hepatitis B using electronic medical record data to facilitate public health surveillance. PloS One, 3(7), e2626.
Article PubMed PubMed Central Google Scholar
Mani, S., Chen, Y., Arlinghaus, L. R., Li, X., Chakravarthy, A. B., Bhave, S. R., et al. (2011). Early prediction of the response of breast tumors to neoadjuvant chemotherapy using quantitative MRI and machine learning. American Medical Informatics Association Annual Symposium Proceedings, 2011, 868–877.
Google Scholar
Van den Bulcke, T., Vanden Broucke, P., Van Hoof, V., Wouters, K., Vanden Broucke, S., Smits, G., et al. (2011). Data mining methods for classification of Medium-Chain Acyl-CoA dehydrogenase deficiency (MCADD) using non-derivatized tandem MS neonatal screening data. Journal of Biomedical Informatics, 44(2), 319–325.
Article PubMed Google Scholar
Zhao, D., & Weng, C. (2011). Combining PubMed knowledge and EHR data to develop a weighted bayesian network for pancreatic cancer prediction. Journal of Biomedical Informatics, 44(5), 859–868.
Article PubMed PubMed Central Google Scholar
Kawaler, E., Cobian, A., Peissig, P., Cross, D., Yale, S., & Craven, M. (2012). Learning to predict post-hospitalization VTE risk from EHR data. American Medical Informatics Association Annual Symposium Proceedings, 2012, 436–445.
Google Scholar
Lowe, H. J., Ferris, T. A., Hernandez, P. M., & Weber, S. C. (2009). STRIDE—an integrated standards-based translational research informatics platform. American Medical Informatics Association Annual Symposium Proceedings, 2009, 391–395.
Google Scholar
Gregg, W., Jirjis, J., Lorenzi, N. M., & Giuse, D. (2003). StarTracker: an integrated, web-based clinical search engine. AMIA Annual Symposium Proceedings, 855.
Hanauer, D. A., Mei, Q., Law, J., Khanna, R., & Zheng, K. (2015). Supporting information retrieval from electronic health records: a report of University of Michigan’s nine-year experience in developing and using the Electronic Medical Record Search Engine (EMERSE). Journal of Biomedical Informatics, 55, 290–300.
Article PubMed PubMed Central Google Scholar
Zalis, M., & Harris, M. (2010). Advanced search of the electronic medical record: augmenting safety and efficiency in radiology. Journal of the American College of Radiology, 7(8), 625–633.
Article PubMed Google Scholar
Lehman, L. W., Saeed, M., Long, W., Lee, J., & Mark, R. (2012). Risk stratification of ICU patients using topic models inferred from unstructured progress notes. American Medical Informatics Association Annual Symposium Proceedings, 2012, 505–511.
Google Scholar
Carroll, R. J., Eyler, A. E., & Denny, J. C. (2011). Naive electronic health record phenotype identification for rheumatoid arthritis. American Medical Informatics Association Annual Symposium Proceedings, 2011, 189–196.
Google Scholar
Liao, K. P., Cai, T., Gainer, V., Goryachev, S., Zeng-treitler, Q., Raychaudhuri, S., et al. (2010). Electronic medical records for discovery research in rheumatoid arthritis. Arthritis Care and Research, 62(8), 1120–1127.
Article PubMed PubMed Central Google Scholar
Bejan, C. A., Xia, F., Vanderwende, L., Wurfel, M. M., & Yetisgen-Yildiz, M. (2012). Pneumonia identification using statistical feature selection. Journal of the American Medical Informatics Association, 19(5), 817–823.
Kopcke, F., & Prokosch, H. U. (2014). Employing computers for the recruitment into clinical trials: a comprehensive systematic review. Journal of Medical Internet Research, 16(7), e161.
Article PubMed PubMed Central Google Scholar
Ni, Y., Kennebeck, S., Dexheimer, J. W., McAneney, C. M., Tang, H., Lingren, T., et al. (2015). Automated clinical trial eligibility prescreening: increasing the efficiency of patient identification for clinical trials in the emergency department. Journal of the American Medical Informatics Association, 22(1), 166–178.
Article PubMed Google Scholar

Download references

Author information

Siddhartha R. Jonnalagadda
Present address: Microsoft Corporation, 555 110th Ave NE, Bellevue, WA, 98004, USA

Authors and Affiliations

Division of Health and Biomedical Informatics, Department of Preventive Medicine, Northwestern University Feinberg School of Medicine, Chicago, IL, 60611, USA
Siddhartha R. Jonnalagadda, Abhishek K. Adupa & Ravi P. Garg
Division of Cardiology, Department of Medicine, Northwestern University Feinberg School of Medicine, Chicago, IL, 60611, USA
Jessica Corona-Cox & Sanjiv J. Shah

Authors

Siddhartha R. Jonnalagadda
View author publications
You can also search for this author in PubMed Google Scholar
Abhishek K. Adupa
View author publications
You can also search for this author in PubMed Google Scholar
Ravi P. Garg
View author publications
You can also search for this author in PubMed Google Scholar
Jessica Corona-Cox
View author publications
You can also search for this author in PubMed Google Scholar
Sanjiv J. Shah
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Siddhartha R. Jonnalagadda.

Ethics declarations

Funding Sources

This work was funded by the National Library of Medicine: R00LM011389 and R01LM011416 (to S.R.J.), and an investigator-initiated study grant from Novartis. S.J.S. is also supported by grants from the National Institutes of Health (R01 HL107577 and R01 HL127028). The authors acknowledge Prasanth Nannapaneni for his valuable ideas on extracting information from the electronic health record.

Conflicts of Interest

Siddhartha R. Jonnalagadda is currently an employee of Microsoft Corporation.

Abhishek K. Adupa declares that he has no conflict of interest.

Ravi P. Garg declares that he has no conflict of interest.

Jessica Corona-Cox declares that she has no conflict of interest.

Sanjiv J. Shah reports receiving consulting fees from Novartis.

Ethical Approval

All procedures performed in studies involving human participants were in accordance with the ethical standards of the institutional and/or national research committee and with the 1964 Helsinki Declaration and its later amendments or comparable ethical standards.

Informed Consent

Informed consent was waived for this study by the Northwestern University Institutional Review Board because the study only involved retrospective chart review.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jonnalagadda, S.R., Adupa, A.K., Garg, R.P. et al. Text Mining of the Electronic Health Record: An Information Extraction Approach for Automated Identification and Subphenotyping of HFpEF Patients for Clinical Trials. J. of Cardiovasc. Trans. Res. 10, 313–321 (2017). https://doi.org/10.1007/s12265-017-9752-2

Download citation

Received: 18 January 2017
Accepted: 16 May 2017
Published: 05 June 2017
Issue Date: June 2017
DOI: https://doi.org/10.1007/s12265-017-9752-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Text Mining of the Electronic Health Record: An Information Extraction Approach for Automated Identification and Subphenotyping of HFpEF Patients for Clinical Trials

Abstract

Access this article

Similar content being viewed by others

Text Mining for Precision Medicine: Bringing Structure to EHRs and Biomedical Literature to Understand Genes and Health

Use of electronic health record data mining for heart failure subtyping

Mining the Electronic Health Record for Disease Knowledge

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Funding Sources

Conflicts of Interest

Ethical Approval

Informed Consent

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Text Mining of the Electronic Health Record: An Information Extraction Approach for Automated Identification and Subphenotyping of HFpEF Patients for Clinical Trials

Abstract

Access this article

Similar content being viewed by others

Text Mining for Precision Medicine: Bringing Structure to EHRs and Biomedical Literature to Understand Genes and Health

Use of electronic health record data mining for heart failure subtyping

Mining the Electronic Health Record for Disease Knowledge

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Funding Sources

Conflicts of Interest

Ethical Approval

Informed Consent

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation