Identifying and Clustering Relevant Terms in Clinical Records Using Unsupervised Methods

Siklósi, Borbála; Novák, Attila

doi:10.1007/978-3-319-11397-5_18

Borbála Siklósi⁸ &
Attila Novák^7,8

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8791))

Included in the following conference series:

International Conference on Statistical Language and Speech Processing

991 Accesses
1 Citations

Abstract

The automatic processing of clinical documents created at clinical settings has become a focus of research in natural language processing. However, standard tools developed for general texts are not applicable or perform poorly on this type of documents. Moreover, several crucial tasks require lexical resources and relational thesauri or ontologies to identify relevant concepts and their connections. In the case of less-resourced languages, such as Hungarian, there are no such lexicons available. The construction of annotated data and their organization requires human expert work. In this paper we show how applying statistical methods can result in a preprocessed, semi-structured transformation of the raw documents that can be used to aid human work. The modules detect and resolve abbreviations, identify multiword terms and derive their similarity, all based on the corpus itself.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Barrows, J.R., Busuioc, M., Friedman, C.: Limited parsing of notational text visit notes: ad-hoc vs. NLP approaches. In: Proceedings of the AMIA Annual Symposium, pp. 51–55 (2000)
Google Scholar
Siklósi, B., Novák, A., Prószéky, G.: Resolving abbreviations in clinical texts without pre-existing structured resources. In: Fourth Workshop on Building and Evaluating Resources for Health and Biomedical Text Processing, LREC 2014 (2014)
Google Scholar
Carroll, J., Koeling, R., Puri, S.: Lexical acquisition for clinical text mining using distributional similarity. In: Gelbukh, A. (ed.) CICLing 2012, Part II. LNCS, vol. 7182, pp. 232–246. Springer, Heidelberg (2012)
Chapter Google Scholar
Csendes, D., Csirik, J., Gyimóthy, T.: The Szeged Corpus: a POS tagged and syntactically annotated Hungarian natural language corpus. In: Sojka, P., Kopeček, I., Pala, K. (eds.) TSD 2004. LNCS (LNAI), vol. 3206, pp. 41–47. Springer, Heidelberg (2004)
Chapter Google Scholar
Firth, J.R.: A synopsis of linguistic theory 1930–55, 1952–59, pp. 1–32 (1957)
Google Scholar
Frantzi, K., Ananiadou, S., Mima, H.: Automatic recognition of multi-word terms: the c-value/nc-value method. Int. J. Digit. Libr. 3(2), 115–130 (2000)
Article Google Scholar
Friedman, C., Kra, P., Rzhetsky, A.: Two biomedical sublanguages: a description based on the theories of Zellig Harris. J. Biomed. Inform. 35(4), 222–235 (2002)
Article Google Scholar
Harris, Z.S.: The structure of science information. J. Biomed. Inform. 35(4), 215–221 (2002)
Article Google Scholar
Kate, R.J.: Unsupervised grammar induction of clinical report sublanguage. J. Biomed. Semant. 3(S-3), S4 (2012)
Google Scholar
Lin, D.: Automatic retrieval and clustering of similar words. In: Proceedings of the 17th International Conference on Computational Linguistics, COLING ’98, vol. 2, pp. 768–774. Association for Computational Linguistics, Stroudsburg, PA, USA (1998)
Google Scholar
Meystre, S., Savova, G., Kipper-Schuler, K., Hurdle, J.: Extracting information from textual documents in the electronic health record: a review of recent research. Yearb. Med. Inform. 35, 128–144 (2008)
Google Scholar
Nasiruddin, M.: A state of the art of word sense induction: a way towards word sense disambiguation for under-resourced languages. In: CoRR abs/1310.1425 (2013)
Google Scholar
Navigli, R.: A quick tour of word sense disambiguation, induction and related approaches. In: Bieliková, M., Friedrich, G., Gottlob, G., Katzenbeisser, S., Turán, G. (eds.) SOFSEM 2012. LNCS, vol. 7147, pp. 115–129. Springer, Heidelberg (2012)
Google Scholar
Orosz, Gy., Novák, A., Prószéky, G.: Hybrid text segmentation for Hungarian clinical records. In: Castro, F., Gelbukh, A., González, M. (eds.) MICAI 2013, Part I. LNCS (LNAI), vol. 8265, pp. 306–317. Springer, Heidelberg (2013)
Chapter Google Scholar
Orosz, Gy., Novák, A., Prószéky, G.: Lessons learned from tagging clinical Hungarian. Int. J. Comput. Linguist. Appl. 5(1), 159–176 (2014)
Google Scholar
Sager, N., Lyman, M., Bucknall, C., Nhan, N., Tick, L.J.: Natural language processing and the representation of clinical data. J. Am. Med. Inform. Assoc. 1(2), 142–160 (1994)
Article Google Scholar
Siklósi, B., Novák, A.: Detection and expansion of abbreviations in Hungarian clinical notes. In: Castro, F., Gelbukh, A., González, M. (eds.) MICAI 2013, Part I. LNCS (LNAI), vol. 8265, pp. 318–328. Springer, Heidelberg (2013)
Chapter Google Scholar
Siklósi, B., Novák, A., Prószéky, G.: Context-aware correction of spelling errors in Hungarian medical documents. In: Dediu, A.-H., Martín-Vide, C., Mitkov, R., Truthe, B. (eds.) SLSP 2013. LNCS (LNAI), vol. 7978, pp. 248–259. Springer, Heidelberg (2013)
Chapter Google Scholar
Siklósi, B., Orosz, Gy., Novák, A., Prószéky, G.: Automatic structuring and correction suggestion system for Hungarian clinical records. In: De Pauw, G., De Schryver, G.-M., Forcada, M.L., Sarasola, K., Tyers, F.M., Wagacha, P.W. (eds.) 8th SaLTMiL Workshop on Creation and use of Basic Lexical Resources for Less-Resourced Languages, pp. 29–34 (2012)
Google Scholar
Vincze, V.: Domének közti hasonlóságok és különbségek a szófajok és szintaktikai viszonyok eloszlásában. In: IX. Magyar Számítógépes Nyelvészeti Konferencia, pp. 182–192 (2013)
Google Scholar

Download references

Acknowledgement

This work was partially supported by TÁMOP – 4.2.1.B – 11/2/KMR-2011-0002 and TÁMOP-4.2.2./B-10/1-2010-0014.

Author information

Authors and Affiliations

MTA-PPKE Hungarian Language Technology Research Group, Pázmány Péter Catholic University, 50/a Práter Street, Budapest, 1083, Hungary
Attila Novák
Faculty of Information Technology and Bionics, Pázmány Péter Catholic University, 50/a Práter Street, Budapest, 1083, Hungary
Borbála Siklósi & Attila Novák

Authors

Borbála Siklósi
View author publications
You can also search for this author in PubMed Google Scholar
Attila Novák
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Borbála Siklósi .

Editor information

Editors and Affiliations

University Joseph Fourier, Grenoble, France
Laurent Besacier
Rovira i Virgili University, Tarragona, Spain
Adrian-Horia Dediu
Rovira i Virgili University, Tarragona, Spain
Carlos Martín-Vide

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Siklósi, B., Novák, A. (2014). Identifying and Clustering Relevant Terms in Clinical Records Using Unsupervised Methods. In: Besacier, L., Dediu, AH., Martín-Vide, C. (eds) Statistical Language and Speech Processing. SLSP 2014. Lecture Notes in Computer Science(), vol 8791. Springer, Cham. https://doi.org/10.1007/978-3-319-11397-5_18

Download citation

DOI: https://doi.org/10.1007/978-3-319-11397-5_18
Published: 03 September 2014
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-11396-8
Online ISBN: 978-3-319-11397-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics