Pattern-based information extraction from pathology reports for cancer registration
To evaluate precision and recall rates for the automatic extraction of information from free-text pathology reports. To assess the impact that implementation of pattern-based methods would have on cancer registration completeness.
Over 300,000 electronic pathology reports were scanned for the extraction of Gleason score, Clark level and Breslow depth, by a number of Perl routines progressively enhanced by a trial-and-error method. An additional test set of 915 reports potentially containing Gleason score was used for evaluation.
Values for recall and precision of over 98 and 99%, respectively, were easily reached. Potential increase in cancer staging completeness of up to 32% was proved.
In cancer registration, simple pattern matching applied to free-text documents can be effectively used to improve completeness and accuracy of pathology information.
KeywordsSurgical pathology Automatic data processing Cancer registries Pattern matching Information extraction Text mining Unstructured data management Pathology report Cancer registration Regular expression
The Northern Ireland Cancer Registry was funded by the Department of Health, Social Services and Public Safety Northern Ireland (DHSSPSNI), at the time this study was completed. It is now funded by the Public Health Agency. We also wish to thank Alejandra González Beltrán for her stimulating comments on this paper.
The Northern Ireland Cancer Registry was funded by the Department of Health, Social Services and Public Safety Northern Ireland (DHSSPSNI), at the time this study was completed. It is now funded by the Public Health Agency.
- 1.Stevens R, Wroe C, Lord P, Goble C (2004) Ontologies in bioinformatics. In: Staab S, Studer R (eds) Handbook on ontologies. Springer, Berlin, pp 635–657Google Scholar
- 2.Health level 7. http://www.hl7.org/. Accessed Jan 2010
- 3.Systematized nomenclature of medicine. http://www.snomed.org/. Accessed Jan 2010
- 4.International classification of disease. ver. 10. http://www.who.int/classifications/icd/en/. Accessed Jan 2010
- 7.Hotho A, Nürnberger A, Paaß G (2005) A brief survey of text mining. LDV Forum 20:19–62Google Scholar
- 9.Gleason DF (1977) The veteran’s administration cooperative urologic research group: histologic grading and clinical staging of prostatic carcinoma. In: Tannenbaum M (ed) Urologic pathology: the prostate. Lea and Febiger, Philadelphia, pp 171–198Google Scholar
- 10.Clark WHJ, From L, Bernardino EA, Mihm MC (1969) The histogenesis and biological behavior of primary human malignant melanoma of the skin. Cancer Res 14:705–726Google Scholar
- 12.NHS Information standards board, data standards: cancer registration data set, data set change notice (2005). http://www.connectingforhealth.nhs.uk/ dscn/dscn2005/092005.pdf
- 13.NHS connecting for health. http://www.connectingforhealth.nhs.uk/. Accessed Jan 2010
- 14.Friedl JEF (1997) Mastering regular expressions. O’Reilly & Associates, Cambridge (MA)Google Scholar
- 15.Sobin LH, Wittekind C (2002) UICC TNM classification of malignant tumours. Wiley-Liss, New YorkGoogle Scholar
- 16.SEER training modules, skin cancer: melanoma. U. S. National Institutes of Health, National Cancer Institute. http://training.seer.cancer.gov/melanoma/abstract-code-stage/staging.html. Accessed 19 July 2010
- 17.Meystre SM, Savova GK, Kipper-Schuler KC, Hurdle JF (2008) Extracting information from textual documents in the electronic health record: a review of recent research. Yearb Med Inform 2008:128–144Google Scholar
- 19.van Leeuwen PJ, Connolly D, Napolitano G, Gavin A, Schröder FH, Roobol MJ (2009) Metastasis-free survival in screen and clinical detected prostate cancer: a comparison between the European randomized study of screening for prostate cancer and Northern Ireland. J Urol 181(4)Suppl 1: 798Google Scholar