Cancer Causes & Control

, Volume 21, Issue 11, pp 1887–1894 | Cite as

Pattern-based information extraction from pathology reports for cancer registration

  • Giulio NapolitanoEmail author
  • Colin Fox
  • Richard Middleton
  • David Connolly
Original paper



To evaluate precision and recall rates for the automatic extraction of information from free-text pathology reports. To assess the impact that implementation of pattern-based methods would have on cancer registration completeness.


Over 300,000 electronic pathology reports were scanned for the extraction of Gleason score, Clark level and Breslow depth, by a number of Perl routines progressively enhanced by a trial-and-error method. An additional test set of 915 reports potentially containing Gleason score was used for evaluation.


Values for recall and precision of over 98 and 99%, respectively, were easily reached. Potential increase in cancer staging completeness of up to 32% was proved.


In cancer registration, simple pattern matching applied to free-text documents can be effectively used to improve completeness and accuracy of pathology information.


Surgical pathology Automatic data processing Cancer registries Pattern matching Information extraction Text mining Unstructured data management Pathology report Cancer registration Regular expression 



The Northern Ireland Cancer Registry was funded by the Department of Health, Social Services and Public Safety Northern Ireland (DHSSPSNI), at the time this study was completed. It is now funded by the Public Health Agency. We also wish to thank Alejandra González Beltrán for her stimulating comments on this paper.

Financial support

The Northern Ireland Cancer Registry was funded by the Department of Health, Social Services and Public Safety Northern Ireland (DHSSPSNI), at the time this study was completed. It is now funded by the Public Health Agency.


  1. 1.
    Stevens R, Wroe C, Lord P, Goble C (2004) Ontologies in bioinformatics. In: Staab S, Studer R (eds) Handbook on ontologies. Springer, Berlin, pp 635–657Google Scholar
  2. 2.
    Health level 7. Accessed Jan 2010
  3. 3.
    Systematized nomenclature of medicine. Accessed Jan 2010
  4. 4.
    International classification of disease. ver. 10. Accessed Jan 2010
  5. 5.
    Collier N, Nazarenko A, Baud R, Ruch P (2006) Recent advances in natural language processing for biomedical applications. Int J Med Inform 75:413–417CrossRefPubMedGoogle Scholar
  6. 6.
    Taira RK, Soderland SG, Jakobovits RM (2001) Automatic structuring of radiology free-text reports. Radiographics 21:237–245PubMedGoogle Scholar
  7. 7.
    Hotho A, Nürnberger A, Paaß G (2005) A brief survey of text mining. LDV Forum 20:19–62Google Scholar
  8. 8.
    Turchin A, Kolatkar NS, Grant RW, Makhni EC, Pendergrass ML, Einbinder JS (2006) Using regular expressions to abstract blood pressure and treatment intensification information from the text of physician notes. J Am Med Inform Assoc 13:691–695CrossRefPubMedGoogle Scholar
  9. 9.
    Gleason DF (1977) The veteran’s administration cooperative urologic research group: histologic grading and clinical staging of prostatic carcinoma. In: Tannenbaum M (ed) Urologic pathology: the prostate. Lea and Febiger, Philadelphia, pp 171–198Google Scholar
  10. 10.
    Clark WHJ, From L, Bernardino EA, Mihm MC (1969) The histogenesis and biological behavior of primary human malignant melanoma of the skin. Cancer Res 14:705–726Google Scholar
  11. 11.
    Breslow A (1970) Thickness, cross-sectional areas and depth of invasion in the prognosis of cutaneous melanoma. Ann Surg 172:902–908CrossRefPubMedGoogle Scholar
  12. 12.
    NHS Information standards board, data standards: cancer registration data set, data set change notice (2005). dscn/dscn2005/092005.pdf
  13. 13.
    NHS connecting for health. Accessed Jan 2010
  14. 14.
    Friedl JEF (1997) Mastering regular expressions. O’Reilly & Associates, Cambridge (MA)Google Scholar
  15. 15.
    Sobin LH, Wittekind C (2002) UICC TNM classification of malignant tumours. Wiley-Liss, New YorkGoogle Scholar
  16. 16.
    SEER training modules, skin cancer: melanoma. U. S. National Institutes of Health, National Cancer Institute. Accessed 19 July 2010
  17. 17.
    Meystre SM, Savova GK, Kipper-Schuler KC, Hurdle JF (2008) Extracting information from textual documents in the electronic health record: a review of recent research. Yearb Med Inform 2008:128–144Google Scholar
  18. 18.
    Coden A, Savova G, Sominsky I, Tanenblatt M, Masanz J, Schuler K, Cooper J, Guan W, de Groen PC (2009) Automatically extracting cancer disease characteristics from pathology reports into a disease knowledge representation model. J Biomed Inform 42:937–949CrossRefPubMedGoogle Scholar
  19. 19.
    van Leeuwen PJ, Connolly D, Napolitano G, Gavin A, Schröder FH, Roobol MJ (2009) Metastasis-free survival in screen and clinical detected prostate cancer: a comparison between the European randomized study of screening for prostate cancer and Northern Ireland. J Urol 181(4)Suppl 1: 798Google Scholar

Copyright information

© Springer Science+Business Media B.V. 2010

Authors and Affiliations

  • Giulio Napolitano
    • 1
    • 2
    Email author
  • Colin Fox
    • 1
  • Richard Middleton
    • 1
  • David Connolly
    • 3
  1. 1.Northern Ireland Cancer Registry, Centre for Public HealthQueen’s University of BelfastBelfastNorthern Ireland (UK)
  2. 2.Centre for Statistical Science and Operational ResearchQueen’s University BelfastBelfastNorthern Ireland (UK)
  3. 3.Department of UrologyBelfast City HospitalBelfastNorthern Ireland (UK)

Personalised recommendations