Using machine learning to parse breast pathology reports

Yala, Adam; Barzilay, Regina; Salama, Laura; Griffin, Molly; Sollender, Grace; Bardia, Aditya; Lehman, Constance; Buckley, Julliette M.; Coopey, Suzanne B.; Polubriaginof, Fernanda; Garber, Judy E.; Smith, Barbara L.; Gadd, Michele A.; Specht, Michelle C.; Gudewicz, Thomas M.; Guidi, Anthony J.; Taghian, Alphonse; Hughes, Kevin S.

doi:10.1007/s10549-016-4035-1

Using machine learning to parse breast pathology reports

Preclinical Study
Published: 08 November 2016

Volume 161, pages 203–211, (2017)
Cite this article

Breast Cancer Research and Treatment Aims and scope Submit manuscript

Adam Yala¹,
Regina Barzilay¹,
Laura Salama³,
Molly Griffin ORCID: orcid.org/0000-0002-1615-2645²,
Grace Sollender⁸,
Aditya Bardia¹⁰,
Constance Lehman⁵,
Julliette M. Buckley²,
Suzanne B. Coopey²,
Fernanda Polubriaginof⁹,
Judy E. Garber⁶,
Barbara L. Smith²,
Michele A. Gadd²,
Michelle C. Specht²,
Thomas M. Gudewicz⁴,
Anthony J. Guidi⁷,
Alphonse Taghian³ &
…
Kevin S. Hughes²

3180 Accesses
73 Citations
12 Altmetric
1 Mention
Explore all metrics

Abstract

Purpose

Extracting information from electronic medical record is a time-consuming and expensive process when done manually. Rule-based and machine learning techniques are two approaches to solving this problem. In this study, we trained a machine learning model on pathology reports to extract pertinent tumor characteristics, which enabled us to create a large database of attribute searchable pathology reports. This database can be used to identify cohorts of patients with characteristics of interest.

Methods

We collected a total of 91,505 breast pathology reports from three Partners hospitals: Massachusetts General Hospital, Brigham and Women’s Hospital, and Newton-Wellesley Hospital, covering the period from 1978 to 2016. We trained our system with annotations from two datasets, consisting of 6295 and 10,841 manually annotated reports. The system extracts 20 separate categories of information, including atypia types and various tumor characteristics such as receptors. We also report a learning curve analysis to show how much annotation our model needs to perform reasonably.

Results

The model accuracy was tested on 500 reports that did not overlap with the training set. The model achieved accuracy of 90% for correctly parsing all carcinoma and atypia categories for a given patient. The average accuracy for individual categories was 97%. Using this classifier, we created a database of 91,505 parsed pathology reports.

Conclusions

Our learning curve analysis shows that the model can achieve reasonable results even when trained on a few annotations. We developed a user-friendly interface to the database that allows physicians to easily identify patients with target characteristics and export the matching cohort. This model has the potential to reduce the effort required for analyzing large amounts of data from medical records, and to minimize the cost and time required to glean scientific insight from these data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Buckley JM, Coopey SB, Sharko J, Polubriaginof F, Drohan B, Belli AK, Kim EM, Garber JE, Smith BL, Gadd MA et al (2012) The feasibility of using natural language processing to extract clinical information from breast pathology reports. J Pathol Inform 3:23
Article PubMed PubMed Central Google Scholar
Edwards GA (2008) Expert systems for clinical pathology reporting. Clin Biochem Rev 29:S105–S109
PubMed PubMed Central Google Scholar
Napolitano G, Fox C, Middleton R, Connolly D (2010) Pattern-based information extraction from pathology reports for cancer registration. Cancer Causes Control 21:1887–1894
Article PubMed Google Scholar
Nguyen A, Lawley M, Hansen D, Colquist S (2011) Structured pathology reporting for cancer from free text: lung cancer case study. Electron J Health Inform 7:8
Google Scholar
Nguyen AN, Lawley MJ, Hansen DP, Bowman RV, Clarke BE, Duhig EE, Colquist S (2010) Symbolic rule-based classification of lung cancer stages from free-text pathology reports. J Am Med Inform Assoc 17:440–445
Article PubMed PubMed Central Google Scholar
Weegar R, Dalianis H (2015) Creating a rule based system for text mining of Norwegian breast cancer pathology reports. In: Sixth international workshop on health text mining and information analysis (Louhi). p 73
Li Y, Martinez D (2010) Information extraction of multiple categories from pathology reports. In: Australasian Language Technology Association Workshop. p 41
Martinez D, Li Y (2011) Information extraction from pathology reports in a hospital setting. In: Proceedings of the 20th ACM international conference on information and knowledge management, ACM. pp 1877–1882
Nguyen A, Moore D, McCowan I, Courage M-J (2007) Multi-class classification of cancer stages from free-text histology reports using support vector machines. In: 29th annual international conference of the IEEE engineering in medicine and biology society, IEEE. pp 5140–5143
Wieneke AE, Bowles EJ, Cronkite D, Wernli KJ, Gao H, Carrell D, Buist DS (2015) Validation of natural language processing to extract breast cancer pathology procedures and results. J Pathol Inform 6:38
Article PubMed PubMed Central Google Scholar
Schapire RE, Singer Y (2000) Boostexter: a boosting-based system for text categorization. Mach Learn 39:135–168
Article Google Scholar
Ou Y, Patrick J (2014) Automatic population of structured reports from narrative pathology reports. In: Proceedings of the seventh Australasian workshop on health informatics and knowledge management, vol 153, Australian Computer Society, Inc. pp 41–50

Download references

Author information

Authors and Affiliations

Department of Electrical Engineering and Computer Science, CSAIL, MIT, Cambridge, USA
Adam Yala & Regina Barzilay
Division of Surgical Oncology, MGH, Boston, USA
Molly Griffin, Julliette M. Buckley, Suzanne B. Coopey, Barbara L. Smith, Michele A. Gadd, Michelle C. Specht & Kevin S. Hughes
Department of Radiation Oncology, MGH, Boston, USA
Laura Salama & Alphonse Taghian
Department of Pathology, MGH, Boston, USA
Thomas M. Gudewicz
Department of Radiology, MGH, Boston, USA
Constance Lehman
Department of Medical Oncology, DFCI, Boston, USA
Judy E. Garber
Department of Pathology, NWH, Newton, USA
Anthony J. Guidi
Geisel School of Medicine at Dartmouth, Hanover, USA
Grace Sollender
Department of Biomedical Informatics, Columbia University, New York, USA
Fernanda Polubriaginof
Department of Medical Oncology, MGH, Boston, USA
Aditya Bardia

Authors

Adam Yala
View author publications
You can also search for this author in PubMed Google Scholar
Regina Barzilay
View author publications
You can also search for this author in PubMed Google Scholar
Laura Salama
View author publications
You can also search for this author in PubMed Google Scholar
Molly Griffin
View author publications
You can also search for this author in PubMed Google Scholar
Grace Sollender
View author publications
You can also search for this author in PubMed Google Scholar
Aditya Bardia
View author publications
You can also search for this author in PubMed Google Scholar
Constance Lehman
View author publications
You can also search for this author in PubMed Google Scholar
Julliette M. Buckley
View author publications
You can also search for this author in PubMed Google Scholar
Suzanne B. Coopey
View author publications
You can also search for this author in PubMed Google Scholar
Fernanda Polubriaginof
View author publications
You can also search for this author in PubMed Google Scholar
Judy E. Garber
View author publications
You can also search for this author in PubMed Google Scholar
Barbara L. Smith
View author publications
You can also search for this author in PubMed Google Scholar
Michele A. Gadd
View author publications
You can also search for this author in PubMed Google Scholar
Michelle C. Specht
View author publications
You can also search for this author in PubMed Google Scholar
Thomas M. Gudewicz
View author publications
You can also search for this author in PubMed Google Scholar
Anthony J. Guidi
View author publications
You can also search for this author in PubMed Google Scholar
Alphonse Taghian
View author publications
You can also search for this author in PubMed Google Scholar
Kevin S. Hughes
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Molly Griffin.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yala, A., Barzilay, R., Salama, L. et al. Using machine learning to parse breast pathology reports. Breast Cancer Res Treat 161, 203–211 (2017). https://doi.org/10.1007/s10549-016-4035-1

Download citation

Received: 13 October 2016
Accepted: 21 October 2016
Published: 08 November 2016
Issue Date: January 2017
DOI: https://doi.org/10.1007/s10549-016-4035-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Using machine learning to parse breast pathology reports