Machine learning to parse breast pathology reports in Chinese

Tang, Rong; Ouyang, Lizhi; Li, Clara; He, Yue; Griffin, Molly; Taghian, Alphonse; Smith, Barbara; Yala, Adam; Barzilay, Regina; Hughes, Kevin

doi:10.1007/s10549-018-4668-3

Machine learning to parse breast pathology reports in Chinese

Preclinical study
Published: 29 January 2018

Volume 169, pages 243–250, (2018)
Cite this article

Breast Cancer Research and Treatment Aims and scope Submit manuscript

Rong Tang ORCID: orcid.org/0000-0001-5727-1010¹,
Lizhi Ouyang²,
Clara Li³,
Yue He²,
Molly Griffin¹,
Alphonse Taghian⁴,
Barbara Smith¹,
Adam Yala³,
Regina Barzilay³ &
…
Kevin Hughes¹

1007 Accesses
20 Citations
7 Altmetric
1 Mention
Explore all metrics

Abstract

Introduction

Large structured databases of pathology findings are valuable in deriving new clinical insights. However, they are labor intensive to create and generally require manual annotation. There has been some work in the bioinformatics community to support automating this work via machine learning in English. Our contribution is to provide an automated approach to construct such structured databases in Chinese, and to set the stage for extraction from other languages.

Methods

We collected 2104 de-identified Chinese benign and malignant breast pathology reports from Hunan Cancer Hospital. Physicians with native Chinese proficiency reviewed the reports and annotated a variety of binary and numerical pathologic entities. After excluding 78 cases with a bilateral lesion in the same report, 1216 cases were used as a training set for the algorithm, which was then refined by 405 development cases. The Natural language processing algorithm was tested by using the remaining 405 cases to evaluate the machine learning outcome. The model was used to extract 13 binary entities and 8 numerical entities.

Results

When compared to physicians with native Chinese proficiency, the model showed a per-entity accuracy from 91 to 100% for all common diagnoses on the test set. The overall accuracy of binary entities was 98% and of numerical entities was 95%. In a per-report evaluation for binary entities with more than 100 training cases, 85% of all the testing reports were completely correct and 11% had an error in 1 out of 22 entities.

Conclusion

We have demonstrated that Chinese breast pathology reports can be automatically parsed into structured data using standard machine learning approaches. The results of our study demonstrate that techniques effective in parsing English reports can be scaled to other languages.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Using machine learning to parse breast pathology reports

Article 08 November 2016

Automated medical chart review for breast cancer outcomes research: a novel natural language processing extraction system

Article Open access 12 May 2022

Electronic case report forms generation from pathology reports by ARGO, automatic record generator for onco-hematology

Article Open access 10 December 2021

References

Huang CR, Chen KJ, Chang LL (1996) Segmentation standard for Chinese natural language processing. In: Proceedings of the 16th conference on Computational linguistics, vol. 2 (pp. 1045–1048). Association for Computational Linguistics
Wong KF, Li W, Xu R, Zhang ZS (2009) Introduction to Chinese natural language processing. Synth Lect Hum Lang Technol 2(1):1–148
Article Google Scholar
Qiu X, Qi Z, Huang X (2013) Fudan NLP: a toolkit for Chinese natural language processing. In: ACL (conference system demonstrations), pp. 49–54
Liang YF, Chu PY, Chang CS, Wang CH, Chang P (2006) Developing and evaluating a simple, spreadsheet-based pathology report extraction system for cancer registrars. AMIA Ann Sym Proc 2006:1008
Google Scholar
Buckley JM, Coopey SB, Sharko J, Polubriaginof F, Drohan B, Belli AK, Kim EM, Garber JE, Smith BL, Gadd MA et al (2012) The feasibility of using natural language processing to extract clinical information from breast pathology reports. J Pathol Inform 3:23
Article PubMed PubMed Central Google Scholar
Yala Adam, Barzilay Regina, Salama Laura, Griffin Molly, Sollender Grace, Bardia Aditya, Lehman Constance et al (2017) Using machine learning to parse breast pathology reports. Breast Cancer Res Treat 161(2):203–211
Article PubMed Google Scholar
Sun J (2013) Jieba (version 0.39) [source code]. https://github.com/fxsjy/jieba
Korobov M (2015) Sklearn-crfsuite (Version 0.3.6) [source code] https://github.com/TeamHG-Memex/sklearn-crfsuite
Burger G, Abu-Hanna A, de Keizer N, Cornet R (2016) Natural language processing in pathology: a scoping review. J Clin Pathol 69(11):949–955
Article Google Scholar
Edwards GA (2008) Expert systems for clinical pathology reporting. Clin Biochem Rev 29:S105–S109
PubMed PubMed Central Google Scholar
Napolitano G, Fox C, Middleton R, Connolly D (2010) Pattern based information extraction from pathology reports for cancer registration. Cancer Causes Control 21:1887–1894
Article PubMed Google Scholar
Nguyen A, Lawley M, Hansen D, Colquist S (2011) Structured pathology reporting for cancer from free text: lung cancer case study. Electron J Health Inform 7:8
Google Scholar
Nguyen AN, Lawley MJ, Hansen DP, Bowman RV, Clarke BE, Duhig EE, Colquist S (2010) Symbolic rule-based classification of lung cancer stages from free-text pathology reports. J Am Med Inform Assoc 17:440–445
Article PubMed PubMed Central Google Scholar
Weegar R, Dalianis H (2015) Creating a rule based system for text mining of Norwegian breast cancer pathology reports. In: Sixth international workshop on health text mining and information analysis (Louhi), p 73
Li Y, Martinez D (2010) Information extraction of multiple entities from pathology reports. In: Australasian Language Technology Association Workshop, p 41
Martinez D, Li Y (2011) Information extraction from pathology reports in a hospital setting. In: Proceedings of the 20th ACM international conference on information and knowledge management, ACM, pp 1877–1882
Nguyen A, Moore D, McCowan I, Courage M-J (2007) Multiclass classification of cancer stages from free-text histology reports using support vector machines. In: 29th annual international conference of the IEEE engineering in medicine and biology society, IEEE, pp 5140–5143
Wieneke AE, Bowles EJ, Cronkite D, Wernli KJ, Gao H, Carrell D, Buist DS (2015) Validation of natural language processing to extract breast cancer pathology procedures and results. J Pathol Inform 6:38
Article PubMed PubMed Central Google Scholar

Download references

Author information

Authors and Affiliations

Division of Surgical Oncology, MGH, Boston, USA
Rong Tang, Molly Griffin, Barbara Smith & Kevin Hughes
Department of Breast Surgery, Hunan Cancer Hospital, Changsha, Hunan, China
Lizhi Ouyang & Yue He
Department of Electrical Engineering and Computer Science, CSAIL, MIT, Cambridge, USA
Clara Li, Adam Yala & Regina Barzilay
Department of Radiation Oncology, MGH, Boston, USA
Alphonse Taghian

Authors

Rong Tang
View author publications
You can also search for this author in PubMed Google Scholar
Lizhi Ouyang
View author publications
You can also search for this author in PubMed Google Scholar
Clara Li
View author publications
You can also search for this author in PubMed Google Scholar
Yue He
View author publications
You can also search for this author in PubMed Google Scholar
Molly Griffin
View author publications
You can also search for this author in PubMed Google Scholar
Alphonse Taghian
View author publications
You can also search for this author in PubMed Google Scholar
Barbara Smith
View author publications
You can also search for this author in PubMed Google Scholar
Adam Yala
View author publications
You can also search for this author in PubMed Google Scholar
Regina Barzilay
View author publications
You can also search for this author in PubMed Google Scholar
Kevin Hughes
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tang, R., Ouyang, L., Li, C. et al. Machine learning to parse breast pathology reports in Chinese. Breast Cancer Res Treat 169, 243–250 (2018). https://doi.org/10.1007/s10549-018-4668-3

Download citation

Received: 08 January 2018
Accepted: 11 January 2018
Published: 29 January 2018
Issue Date: June 2018
DOI: https://doi.org/10.1007/s10549-018-4668-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Machine learning to parse breast pathology reports in Chinese