Skip to main content

Identifying and Clustering Relevant Terms in Clinical Records Using Unsupervised Methods

  • Conference paper
  • First Online:
Statistical Language and Speech Processing (SLSP 2014)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8791))

Included in the following conference series:

Abstract

The automatic processing of clinical documents created at clinical settings has become a focus of research in natural language processing. However, standard tools developed for general texts are not applicable or perform poorly on this type of documents. Moreover, several crucial tasks require lexical resources and relational thesauri or ontologies to identify relevant concepts and their connections. In the case of less-resourced languages, such as Hungarian, there are no such lexicons available. The construction of annotated data and their organization requires human expert work. In this paper we show how applying statistical methods can result in a preprocessed, semi-structured transformation of the raw documents that can be used to aid human work. The modules detect and resolve abbreviations, identify multiword terms and derive their similarity, all based on the corpus itself.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Barrows, J.R., Busuioc, M., Friedman, C.: Limited parsing of notational text visit notes: ad-hoc vs. NLP approaches. In: Proceedings of the AMIA Annual Symposium, pp. 51–55 (2000)

    Google Scholar 

  2. Siklósi, B., Novák, A., Prószéky, G.: Resolving abbreviations in clinical texts without pre-existing structured resources. In: Fourth Workshop on Building and Evaluating Resources for Health and Biomedical Text Processing, LREC 2014 (2014)

    Google Scholar 

  3. Carroll, J., Koeling, R., Puri, S.: Lexical acquisition for clinical text mining using distributional similarity. In: Gelbukh, A. (ed.) CICLing 2012, Part II. LNCS, vol. 7182, pp. 232–246. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  4. Csendes, D., Csirik, J., Gyimóthy, T.: The Szeged Corpus: a POS tagged and syntactically annotated Hungarian natural language corpus. In: Sojka, P., Kopeček, I., Pala, K. (eds.) TSD 2004. LNCS (LNAI), vol. 3206, pp. 41–47. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  5. Firth, J.R.: A synopsis of linguistic theory 1930–55, 1952–59, pp. 1–32 (1957)

    Google Scholar 

  6. Frantzi, K., Ananiadou, S., Mima, H.: Automatic recognition of multi-word terms: the c-value/nc-value method. Int. J. Digit. Libr. 3(2), 115–130 (2000)

    Article  Google Scholar 

  7. Friedman, C., Kra, P., Rzhetsky, A.: Two biomedical sublanguages: a description based on the theories of Zellig Harris. J. Biomed. Inform. 35(4), 222–235 (2002)

    Article  Google Scholar 

  8. Harris, Z.S.: The structure of science information. J. Biomed. Inform. 35(4), 215–221 (2002)

    Article  Google Scholar 

  9. Kate, R.J.: Unsupervised grammar induction of clinical report sublanguage. J. Biomed. Semant. 3(S-3), S4 (2012)

    Google Scholar 

  10. Lin, D.: Automatic retrieval and clustering of similar words. In: Proceedings of the 17th International Conference on Computational Linguistics, COLING ’98, vol. 2, pp. 768–774. Association for Computational Linguistics, Stroudsburg, PA, USA (1998)

    Google Scholar 

  11. Meystre, S., Savova, G., Kipper-Schuler, K., Hurdle, J.: Extracting information from textual documents in the electronic health record: a review of recent research. Yearb. Med. Inform. 35, 128–144 (2008)

    Google Scholar 

  12. Nasiruddin, M.: A state of the art of word sense induction: a way towards word sense disambiguation for under-resourced languages. In: CoRR abs/1310.1425 (2013)

    Google Scholar 

  13. Navigli, R.: A quick tour of word sense disambiguation, induction and related approaches. In: Bieliková, M., Friedrich, G., Gottlob, G., Katzenbeisser, S., Turán, G. (eds.) SOFSEM 2012. LNCS, vol. 7147, pp. 115–129. Springer, Heidelberg (2012)

    Google Scholar 

  14. Orosz, Gy., Novák, A., Prószéky, G.: Hybrid text segmentation for Hungarian clinical records. In: Castro, F., Gelbukh, A., González, M. (eds.) MICAI 2013, Part I. LNCS (LNAI), vol. 8265, pp. 306–317. Springer, Heidelberg (2013)

    Chapter  Google Scholar 

  15. Orosz, Gy., Novák, A., Prószéky, G.: Lessons learned from tagging clinical Hungarian. Int. J. Comput. Linguist. Appl. 5(1), 159–176 (2014)

    Google Scholar 

  16. Sager, N., Lyman, M., Bucknall, C., Nhan, N., Tick, L.J.: Natural language processing and the representation of clinical data. J. Am. Med. Inform. Assoc. 1(2), 142–160 (1994)

    Article  Google Scholar 

  17. Siklósi, B., Novák, A.: Detection and expansion of abbreviations in Hungarian clinical notes. In: Castro, F., Gelbukh, A., González, M. (eds.) MICAI 2013, Part I. LNCS (LNAI), vol. 8265, pp. 318–328. Springer, Heidelberg (2013)

    Chapter  Google Scholar 

  18. Siklósi, B., Novák, A., Prószéky, G.: Context-aware correction of spelling errors in Hungarian medical documents. In: Dediu, A.-H., Martín-Vide, C., Mitkov, R., Truthe, B. (eds.) SLSP 2013. LNCS (LNAI), vol. 7978, pp. 248–259. Springer, Heidelberg (2013)

    Chapter  Google Scholar 

  19. Siklósi, B., Orosz, Gy., Novák, A., Prószéky, G.: Automatic structuring and correction suggestion system for Hungarian clinical records. In: De Pauw, G., De Schryver, G.-M., Forcada, M.L., Sarasola, K., Tyers, F.M., Wagacha, P.W. (eds.) 8th SaLTMiL Workshop on Creation and use of Basic Lexical Resources for Less-Resourced Languages, pp. 29–34 (2012)

    Google Scholar 

  20. Vincze, V.: Domének közti hasonlóságok és különbségek a szófajok és szintaktikai viszonyok eloszlásában. In: IX. Magyar Számítógépes Nyelvészeti Konferencia, pp. 182–192 (2013)

    Google Scholar 

Download references

Acknowledgement

This work was partially supported by TÁMOP – 4.2.1.B – 11/2/KMR-2011-0002 and TÁMOP-4.2.2./B-10/1-2010-0014.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Borbála Siklósi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Siklósi, B., Novák, A. (2014). Identifying and Clustering Relevant Terms in Clinical Records Using Unsupervised Methods. In: Besacier, L., Dediu, AH., Martín-Vide, C. (eds) Statistical Language and Speech Processing. SLSP 2014. Lecture Notes in Computer Science(), vol 8791. Springer, Cham. https://doi.org/10.1007/978-3-319-11397-5_18

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-11397-5_18

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-11396-8

  • Online ISBN: 978-3-319-11397-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics