Identifying Breast Cancer Distant Recurrences from Electronic Health Records Using Machine Learning

Zeng, Zexian; Yao, Liang; Roy, Ankita; Li, Xiaoyu; Espino, Sasa; Clare, Susan E; Khan, Seema A; Luo, Yuan

doi:10.1007/s41666-019-00046-3

Identifying Breast Cancer Distant Recurrences from Electronic Health Records Using Machine Learning

Research Article
Published: 08 April 2019

Volume 3, pages 283–299, (2019)
Cite this article

Journal of Healthcare Informatics Research Aims and scope Submit manuscript

Zexian Zeng¹,
Liang Yao¹,
Ankita Roy²,
Xiaoyu Li³,
Sasa Espino²,
Susan E Clare²,
Seema A Khan² &
…
Yuan Luo ORCID: orcid.org/0000-0003-0195-7456¹

739 Accesses
15 Citations
Explore all metrics

Abstract

Accurately identifying distant recurrences in breast cancer from the electronic health records (EHR) is important for both clinical care and secondary analysis. Although multiple applications have been developed for computational phenotyping in breast cancer, distant recurrence identification still relies heavily on manual chart review. In this study, we aim to develop a model that identifies distant recurrences in breast cancer using clinical narratives and structured data from EHR. We applied MetaMap to extract features from clinical narratives and also retrieved structured clinical data from EHR. Using these features, we trained a support vector machine model to identify distant recurrences in breast cancer patients. We trained the model using 1396 double-annotated subjects and validated the model using 599 double-annotated subjects. In addition, we validated the model on a set of 4904 single-annotated subjects as a generalization test. In the held-out test and generalization test, we obtained F-measure scores of 0.78 and 0.74, area under curve (AUC) scores of 0.95 and 0.93, respectively. To explore the representation learning utility of deep neural networks, we designed multiple convolutional neural networks and multilayer neural networks to identify distant recurrences. Using the same test set and generalizability test set, we obtained F-measure scores of 0.79 ± 0.02 and 0.74 ± 0.004, AUC scores of 0.95 ± 0.002 and 0.95 ± 0.01, respectively. Our model can accurately and efficiently identify distant recurrences in breast cancer by combining features extracted from unstructured clinical narratives and structured clinical data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Weakly supervised temporal model for prediction of breast cancer distant recurrence

Article Open access 04 May 2021

Multi-task Deep Neural Networks for Automated Extraction of Primary Site and Laterality Information from Cancer Pathology Reports

Development and validation of case-finding algorithms for recurrence of breast cancer using routinely collected administrative data

Article Open access 08 March 2019

References

Egner JR (2010) AJCC cancer staging manual. JAMA 304(15):1726–1727
Article Google Scholar
Lê MG, Arriagada R, Spielmann M, Guinebretière JM, Rochard F (2002) Prognostic factors for death after an isolated local recurrence in patients with early-stage breast carcinoma. Cancer 94(11):2813–2820
Article Google Scholar
Geiger AM, Thwin SS, Lash TL, Buist DSM, Prout MN, Wei F, Field TS, Ulcickas Yood M, Frost FJ, Enger SM, Silliman RA (2007) Recurrences and second primary breast cancers in older women with initial early-stage disease. Cancer 109(5):966–974
Article Google Scholar
Habel LA, Achacoso NS, Haque R, Nekhlyudov L, Fletcher SW, Schnitt SJ, Collins LC, Geiger AM, Puligandla B, Acton L, Quesenberry CP (2009) Declining recurrence among ductal carcinoma in situ patients treated with breast-conserving surgery in the community setting. Breast Cancer Res 11(6):R85
Article Google Scholar
Starren JB, Winter AQ, Lloyd-Jones DM (2015) Enabling a learning health system through a unified enterprise data warehouse: the experience of the Northwestern University Clinical and Translational Sciences (NUCATS) Institute. Clin Transl Sci 8(4):269–271
Article Google Scholar
Birman-Deych E, Waterman AD, Yan Y, Nilasena DS, Radford MJ, Gage BF (2005) Accuracy of ICD-9-CM codes for identifying cardiovascular and stroke risk factors. Med Care 43(5):480–485
Article Google Scholar
Singh JA, Holmgren AR, Noorbaloochi S (2004) Accuracy of Veterans Administration databases for a diagnosis of rheumatoid arthritis. Arthritis Care Res 51(6):952–957
Article Google Scholar
O'malley KJ, Cook KF, Price MD, Wildes KR, Hurdle JF, Ashton CM (2005) Measuring diagnoses: ICD code accuracy. Health Serv Res 40(5p2):1620–1639
Article Google Scholar
Hripcsak G, Albers DJ (2012) Next-generation phenotyping of electronic health records. J Am Med Inform Assoc 20(1):117–121
Article Google Scholar
Greenhalgh T (1999) Narrative based medicine: narrative based medicine in an evidence based world. BMJ Br Med J 318(7179):323–325
Article Google Scholar
Liao KP, Cai T, Gainer V, Goryachev S, Zeng-treitler Q, Raychaudhuri S, Szolovits P, Churchill S, Murphy S, Kohane I, Karlson EW, Plenge RM (2010) Electronic medical records for discovery research in rheumatoid arthritis. Arthritis Care Res 62(8):1120–1127
Article Google Scholar
G. Chao and S. Sun, "Applying a multitask feature sparsity method for the classification of semantic relations between nominals," in Machine Learning and Cybernetics (ICMLC), 2012 International Conference on, 2012, vol. 1, pp. 72–76: IEEE
Luo Y et al (2017) Natural language processing for EHR-based pharmacovigilance: a structured review. Drug Saf:1–15
Zeng Z, Deng Y, Li X, Naumann T, Luo Y (2018) Natural language processing for EHR-based computational phenotyping. IEEE/ACM Transactions on Computational Biology and Bioinformatics:1–1
D. S. Carrell, S. Halgrim, D.T. Tran, D. S. M. Buist, J. Chubak, W. W. Chapman, G. Savova, "Using natural language processing to improve efficiency of manual chart abstraction in research: the case of breast cancer recurrence," American journal of epidemiology, p. kwt441, 2014, 179, 749, 758
Strauss JA, Chao CR, Kwan ML, Ahmed SA, Schottinger JE, Quinn VP (2013) Identifying primary and recurrent cancers using a SAS-based natural language processing algorithm. J Am Med Inform Assoc 20(2):349–355
Article Google Scholar
Bosco JL et al (2009) Breast cancer recurrence in older women five to ten years after diagnosis. Cancer Epidemiology and Prevention Biomarkers 18(11):2979–2983
Article Google Scholar
Haque R, Shi J, Schottinger JE, Ahmed SA, Chung J, Avila C, Lee VS, Cheetham TC, Habel LA, Fletcher SW, Kwan ML (2015) A hybrid approach to identify subsequent breast cancer using pathology and automated health information data. Med Care 53(4):380–385
Article Google Scholar
Wallner LP, Dibello JR, Li BH, Zheng C, Yu W, Weinmann S, Richert-Boe KE, Ritzwoller DP, VanDenEeden SK, Jacobsen SJ (2014) Development of an algorithm to identify metastatic prostate cancer in electronic medical records using natural language processing. Proc Am Soc Clin Oncol 32:164
Article Google Scholar
Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. Mach Learn ECML-98:137–142
Google Scholar
Garla V, Taylor C, Brandt C (2013) Semi-supervised clinical text classification with Laplacian SVMs: an application to cancer case management. J Biomed Inform 46(5):869–875
Article Google Scholar
Bejan CA, Xia F, Vanderwende L, Wurfel MM, Yetisgen-Yildiz M (2012) Pneumonia identification using statistical feature selection. J Am Med Inform Assoc 19(5):817–823
Article Google Scholar
McCowan IA, Moore DC, Nguyen AN, Bowman RV, Clarke BE, Duhig EE, Fry MJ (2007) Collection of cancer stage data by classifying free-text medical reports. J Am Med Inform Assoc 14(6):736–745
Article Google Scholar
Z. Zeng et al., "Contralateral breast cancer event detection using Nature Language Processing," in AMIA Annual Symposium Proceedings, 2017, vol. 2017, pp. 1885–1892: American Medical Informatics Association
R. J. Carroll, A. E. Eyler, and J. C. Denny, "Naïve electronic health record phenotype identification for rheumatoid arthritis," in AMIA annual symposium proceedings, 2011, vol. 2011, p. 189: American Medical Informatics Association
Denny JC, Smithers JD, Miller RA, Spickard A III (2003) “Understanding” medical school curriculum content using KnowledgeMap. J Am Med Inform Assoc 10(4):351–362
Article Google Scholar
Y. Kim, "Convolutional neural networks for sentence classification," arXiv preprint arXiv:1408.5882, 2014
N. Kalchbrenner, E. Grefenstette, and P. Blunsom, "A convolutional neural network for modelling sentences," arXiv preprint arXiv:1404.2188, 2014
K. S. Tai, R. Socher, and C. D. Manning, "Improved semantic representations from tree-structured long short-term memory networks," arXiv preprint arXiv:1503.00075, 2015
Z. Yang, D. Yang, C. Dyer, X. He, A. Smola, and E. Hovy, "Hierarchical attention networks for document classification," in Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2016, pp. 1480–1489
S. Gehrmann et al., "Comparing Rule-Based and Deep Learning Models for Patient Phenotyping," arXiv preprint arXiv:1703.08705, 2017
Luo Y (2017) Recurrent neural networks for classifying relations in clinical notes. J Biomed Inform 72:85–95
Article Google Scholar
Luo Y, Cheng Y, Uzuner Ö, Szolovits P, Starren J (2017) Segment convolutional neural networks (Seg-CNNs) for classifying relations in clinical notes. J Am Med Inform Assoc 25(1):93–98
Article Google Scholar
Wu Y, Jiang M, Lei J, Xu H (2015) Named entity recognition in Chinese clinical text using deep neural network. Studies in health technology and informatics 216:624
Google Scholar
A. N. Jagannatha and H. Yu, "Structured prediction models for RNN based sequence labeling in clinical text," in Proceedings of the Conference on Empirical Methods in Natural Language Processing Conference on Empirical Methods in Natural Language Processing, 2016, vol. 2016, p. 856: NIH Public Access
A. N. Jagannatha and H. Yu, "Bidirectional rnn for medical event detection in electronic health records," in Proceedings of the conference Association for Computational Linguistics North American Chapter Meeting, 2016, vol. 2016, p. 473: NIH Public Access
DeLisle S, Kim B, Deepak J, Siddiqui T, Gundlapalli A, Samore M, D'Avolio L (2013) Using the electronic medical record to identify community-acquired pneumonia: toward a replicable automated strategy. PLoS One 8(8):e70944
Article Google Scholar
Lin C, Karlson EW, Dligach D, Ramirez MP, Miller TA, Mo H, Braggs NS, Cagan A, Gainer V, Denny JC, Savova GK (2014) Automatic identification of methotrexate-induced liver toxicity in patients with rheumatoid arthritis from the electronic medical record. J Am Med Inform Assoc 22(e1):e151–e161
Article Google Scholar
Liao KP, Cai T, Savova GK, Murphy SN, Karlson EW, Ananthakrishnan AN, Gainer VS, Shaw SY, Xia Z, Szolovits P, Churchill S, Kohane I (2015) Development of phenotype algorithms using electronic medical records and incorporating natural language processing. bmj 350:h1885
Article Google Scholar
F. Galton, Finger prints. Macmillan and Company, 1892
Leemans CR, Tiwari R, Nauta J, Van der Waal I, Snow GB (1993) Regional lymph node involvement and its significance in the development of distant metastases in head and neck carcinoma. Cancer 71(2):452–456
Article Google Scholar
A. R. Aronson, "Metamap: mapping text to the umls metathesaurus," Bethesda, MD: NLM, NIH, DHHS, pp. 1–26, 2006
Chapman WW et al (2013) Extending the NegEx lexicon for multiple languages. Stud Health Technol Inform 192:677
Google Scholar
Pedregosa F et al (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830
MathSciNet MATH Google Scholar
L. De Vine, G. Zuccon, B. Koopman, L. Sitbon, and P. Bruza, "Medical semantic similarity with a neural language model," in Proceedings of the 23rd ACM international conference on conference on information and knowledge management, 2014, pp. 1819–1822: ACM
M. Abadi et al, "Tensorflow: a system for large-scale machine learning," in OSDI, 2016, vol. 16, pp. 265–283
D. Kinga and J. B. Adam, "A method for stochastic optimization," in International Conference on Learning Representations (ICLR), 2015, vol. 5
Luo Y, Xin Y, Hochberg E, Joshi R, Uzuner O, Szolovits P (2015) Subgraph augmented non-negative tensor factorization (SANTF) for modeling clinical narrative text. J Am Med Inform Assoc:ocv016
Luo Y, Sohani AR, Hochberg EP, Szolovits P (2014) Automatic lymphoma classification with sentence subgraph mining from pathology reports. J Am Med Inform Assoc 21(5):824–832
Article Google Scholar
Boland MR, Hripcsak G, Shen Y, Chung WK, Weng C (2013) Defining a comprehensive verotype using electronic health records for personalized medicine. J Am Med Inform Assoc 20:e232–e238
Article Google Scholar

Download references

Funding

This project is supported in part by NIH grant R21LM012618-01.

Author information

Authors and Affiliations

Department of Preventive Medicine, Northwestern University Feinberg School of Medicine, Chicago, IL, USA
Zexian Zeng, Liang Yao & Yuan Luo
Department of Surgery, Northwestern University Feinberg School of Medicine, Chicago, IL, USA
Ankita Roy, Sasa Espino, Susan E Clare & Seema A Khan
Department of Medicine, Brigham and Women’s Hospital, Boston, MA, USA
Xiaoyu Li

Authors

Zexian Zeng
View author publications
You can also search for this author in PubMed Google Scholar
Liang Yao
View author publications
You can also search for this author in PubMed Google Scholar
Ankita Roy
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoyu Li
View author publications
You can also search for this author in PubMed Google Scholar
Sasa Espino
View author publications
You can also search for this author in PubMed Google Scholar
Susan E Clare
View author publications
You can also search for this author in PubMed Google Scholar
Seema A Khan
View author publications
You can also search for this author in PubMed Google Scholar
Yuan Luo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yuan Luo.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

ESM 1

(XLSX 50 kb)

ESM 2

(XLSX 13 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zeng, Z., Yao, L., Roy, A. et al. Identifying Breast Cancer Distant Recurrences from Electronic Health Records Using Machine Learning. J Healthc Inform Res 3, 283–299 (2019). https://doi.org/10.1007/s41666-019-00046-3

Download citation

Received: 12 July 2018
Revised: 30 November 2018
Accepted: 07 January 2019
Published: 08 April 2019
Issue Date: 15 September 2019
DOI: https://doi.org/10.1007/s41666-019-00046-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Identifying Breast Cancer Distant Recurrences from Electronic Health Records Using Machine Learning

Abstract

Access this article

Similar content being viewed by others

Weakly supervised temporal model for prediction of breast cancer distant recurrence

Multi-task Deep Neural Networks for Automated Extraction of Primary Site and Laterality Information from Cancer Pathology Reports

Development and validation of case-finding algorithms for recurrence of breast cancer using routinely collected administrative data

References

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Electronic supplementary material

ESM 1

ESM 2

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Identifying Breast Cancer Distant Recurrences from Electronic Health Records Using Machine Learning

Abstract

Access this article

Similar content being viewed by others

Weakly supervised temporal model for prediction of breast cancer distant recurrence

Multi-task Deep Neural Networks for Automated Extraction of Primary Site and Laterality Information from Cancer Pathology Reports

Development and validation of case-finding algorithms for recurrence of breast cancer using routinely collected administrative data

References

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Electronic supplementary material

ESM 1

ESM 2

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation