Hybrid Text Segmentation for Hungarian Clinical Records

  • György Orosz
  • Attila Novák
  • Gábor Prószéky
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8265)

Abstract

Nowadays clinical documents are getting widely available to researchers who are aiming to develop resources and tools that may help clinicians in their work. While several attempts exist for English medical text processing, there are only few for other languages. Moreover, word and sentence segmentation tasks are commonly treated as simple engineering issues. In this study, we introduce the difficulties that arise during the segmentation of Hungarian clinical records, and describe a complex method that results in a normalized and segmented text. Our approach is a hybrid combination of a rule-based and an unsupervised statistical solution. The presented system is compared with other algorithms that are available and commonly used. These fail to segment clinical text (all of them reach F-scores below 75%), while our method scores above 90%. This means that only the hybrid tool described in this study can be used for the segmentation of Hungarian clinical texts in practical applications.

Keywords

text segmentation clinical records sentence boundary detection log-likelihood ratios 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Apostolova, E., Channin, D.S., Demner-Fushman, D., Furst, J., Lytinen, S., Raicu, D.: Automatic segmentation of clinical texts. In: Annual International Conference of the IEEE Engineering in Medicine and Biology Society, EMBC 2009, pp. 5905–5908. IEEE (2009)Google Scholar
  2. 2.
    Baldridge, J., Morton, T., Bierner, G.: The OpenNLP maximum entropy package (2002)Google Scholar
  3. 3.
    Paul, S., Cho, R.K., Taira, Kangarloo, H.: Text boundary detection of medical reports. In: Proceedings of the AMIA Symposium, p. 998. American Medical Informatics Association (2002)Google Scholar
  4. 4.
    Csendes, D., Csirik, J., Gyimóthy, T.: The Szeged Corpus: A POS tagged and syntactically annotated Hungarian natural language corpus. In: Proceedings of the 5th International Workshop on Linguistically Interpreted Corpora, pp. 19–23 (2004)Google Scholar
  5. 5.
    Dridan, R., Oepen, S.: Tokenization: returning to a long solved problem a survey, contrastive experiment, recommendations, and toolkit. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, pp. 378–382. Association for Computational Linguistics (2012)Google Scholar
  6. 6.
    Dunning, T.: Accurate methods for the statistics of surprise and coincidence. Computational linguistics 19(1), 61–74 (1993)Google Scholar
  7. 7.
    Gillick, D.: Sentence boundary detection and the problem with the US. In: Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers, pp. 241–244. Association for Computational Linguistics (2009)Google Scholar
  8. 8.
    Halácsy, P., Kornai, A., Németh, L., Rung, A., Szakadát, I., Trón, V.: Creating open language resources for Hungarian. In: Proceedings of Language Resources and Evaluation Conference (2004)Google Scholar
  9. 9.
    Kiss, T., Strunk, J.: Unsupervised multilingual sentence boundary detection. Computational Linguistics 32(4), 485–525 (2006)CrossRefGoogle Scholar
  10. 10.
    Kumar, A.: Monk project: Architecture overview. In: Proceedings of JCDL 2009 Workshop: Integrating Digital Library Content with Computational Tools and Services (2009)Google Scholar
  11. 11.
    Meystre, S.M., Savova, G.K., Kipper-Schuler, K.C., Hurdle, J.F.: Extracting information from textual documents in the electronic health record: a review of recent research. In: Yearbook of Medical Informatics, pp. 128–144 (2008)Google Scholar
  12. 12.
    Mikheev, A.: Tagging sentence boundaries. In: Proceedings of the 1st North American chapter of the Association for Computational Linguistics Conference, pp. 264–271. Association for Computational Linguistics (2000)Google Scholar
  13. 13.
    Mikheev, A.: Periods, capitalized words, etc. Computational Linguistics 28(3), 289–318 (2002)CrossRefGoogle Scholar
  14. 14.
    Orosz, G., Novák, A., Prószéky, G.: Magyar nyelvű klinikai rekordok morfológiai egyértelműsítése. In: IX. Magyar Számítógépes Nyelvészeti Konferencia, Szeged, pp. 159–169. Szegedi Tudományegyetem (2013)Google Scholar
  15. 15.
    Palmer, D.D., Hearst, M.A.: Adaptive sentence boundary disambiguation. In: Proceedings of the fourth conference on Applied natural language processing, pp. 78–83. Association for Computational Linguistics (1994)Google Scholar
  16. 16.
    Palmer, D.D., Hearst, M.A.: Adaptive multilingual sentence boundary disambiguation. Computational Linguistics 23(2), 241–267 (1997)Google Scholar
  17. 17.
    Prószéky, G.: Industrial applications of unification morphology. In: Proceedings of the Fourth Conference on Applied Natural Language Processing, Morristown, NJ, USA, p. 213 (1994)Google Scholar
  18. 18.
    Prószéky, G., Novák, A.: Computational Morphologies for Small Uralic Languages. In: Inquiries into Words, Constraints and Contexts, Stanford, California, pp. 150–157 (2005)Google Scholar
  19. 19.
    Read, J., Dridan, R., Oepen, S., Solberg, L.J.: Sentence Boundary Detection: A Long Solved Problem? In: 24th International Conference on Computational Linguistics (Coling 2012), India (2012)Google Scholar
  20. 20.
    Reynar, J.C., Ratnaparkhi, A.: A maximum entropy approach to identifying sentence boundaries. In: Proceedings of the Fifth Conference on Applied Natural Language Processing, pp. 16–19. Association for Computational Linguistics (1997)Google Scholar
  21. 21.
    Riley, M.D.: Some applications of tree-based modelling to speech and language. In: Proceedings of the Workshop on Speech and Natural Language, pp. 339–352. Association for Computational Linguistics (1989)Google Scholar
  22. 22.
    Savova, G.K., Masanz, J.J., Ogren, P.V., Zheng, J., Sohn, S., Schuler, K.K., Chute, C.G.: Mayo clinical text analysis and knowledge extraction system (ctakes): architecture, component evaluation and applications. Journal of the American Medical Informatics Association 17(5), 507–513 (2010)CrossRefGoogle Scholar
  23. 23.
    Schmid, H.: Unsupervised learning of period disambiguation for tokenisation. Technical report (2000)Google Scholar
  24. 24.
    Siklósi, B., Orosz, G., Novák, A., Prószéky, G.: Automatic structuring and correction suggestion system for hungarian clinical records. In: De Pauw, G., De Schryver, G.M., Forcada, M.L., Tyers, F.M. (eds.) 8th SaLTMiL Workshop on Creation and Use of Basic Lexical Resources for Lessresourced Languages, pp. 29–34 (2012)Google Scholar
  25. 25.
    Siklósi, B., Novák, A., Prószéky, G.: Context-aware correction of spelling errors in hungarian medical documents. In: Dediu, A.-H., Martín-Vide, C., Mitkov, R., Truthe, B. (eds.) SLSP 2013. LNCS, vol. 7978, pp. 248–259. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  26. 26.
    Stevenson, M., Gaizauskas, R.: Experiments on sentence boundary detection. In: Proceedings of the Sixth Conference on Applied Natural Language Processing, pp. 84–89. Association for Computational Linguistics (2000)Google Scholar
  27. 27.
    Taira, R.K., Soderland, S.G., Jakobovits, R.M.: Automatic structuring of radiology free-text reports. Radiographics 21(1), 237–245 (2001)CrossRefGoogle Scholar
  28. 28.
    Tomanek, K., Wermter, J., Hahn, U.: A reappraisal of sentence and token splitting for life sciences documents. Studies in Health Technology and Informatics 129(pt. 1), 524–528 (2006)Google Scholar
  29. 29.
    Tomanek, K., Wermter, J., Hahn, U.: Sentence and token splitting based on conditional random fields. In: Proceedings of the 10th Conference of the Pacific Association for Computational Linguistics, pp. 49–57 (2007)Google Scholar
  30. 30.
    Wrenn, J.O., Stetson, P.D., Johnson, S.B.: An unsupervised machine learning approach to segmentation of clinician-entered free text. In: AMIA Annu. Symp. Proc., pp. 811–815 (2007)Google Scholar
  31. 31.
    Xu, H., Stenner, S.P., Doan, S., Johnson, K.B., Waitman, L.R., Denny, J.C.: Medex: a medication information extraction system for clinical narratives. Journal of the American Medical Informatics Association 17(1), 19–24 (2010)CrossRefGoogle Scholar
  32. 32.
    Zhu, C., Tang, J., Li, H., Ng, H.T., Zhao, T.: A unified tagging approach to text normalization. In: The 45th Annual Meeting of the Association for Computational Linguistics, pp. 688–695 (2007)Google Scholar
  33. 33.
    Zsibrita, J., Vincze, V., Farkas, R.: magyarlanc: A Toolkit for Morphological and Dependency Parsing of Hungarian. In: Proceedings of Recent Advances in Natural Language Provessing 2013, Hissar, Bulgaria, pp. 763–771. Association for Computational Linguistics (2013)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • György Orosz
    • 1
    • 2
  • Attila Novák
    • 1
    • 2
  • Gábor Prószéky
    • 1
    • 2
  1. 1.Faculty of Information Technology and BionicsPázmány Péter Catholic UniversityBudapestHungary
  2. 2.MTA-PPKE Hungarian Language Technology Research GroupPázmány Péter Catholic UniversityBudapestHungary

Personalised recommendations