Clustering Relevant Terms and Identifying Types of Statements in Clinical Records
The automatic processing of clinical documents created at clinical settings has become a focus of research in natural language processing. However, standard tools developed for general texts are not applicable or perform poorly on this type of documents, especially in the case of less-resourced languages. In order to be able to create a formal representation of knowledge in the clinical records, a normalized representation of concepts needs to be defined. This can be done by mapping each record to an external ontology or other semantic resources. In the case of languages, where no such resources exist, it is reasonable to create a representational schema from the texts themselves. In this paper, we show that, based on the pairwise distributional similarities of words and multiword terms, a conceptual hierarchy can be built from the raw documents. In order to create the hierarchy, we applied an agglomerative clustering algorithm on the most frequent terms. Having such an initial system of knowledge extracted from the documents, a domain expert can then check the results and build a system of concepts that is in accordance with the documents the system is applied to. Moreover, we propose a method for classifying various types of statements and parts of clinical documents by annotating the texts with cluster identifiers and extracting relevant patterns.
Keywordsclinical documents clustering ontology construction less-resourced languages
Unable to display preview. Download preview PDF.
- 4.Firth, J.R.: A synopsis of linguistic theory 1930-55. 1952-59, 1–32 (1957)Google Scholar
- 5.Henriksson, A.: Semantic spaces of clinical text: Leveraging distributional semantics for natural language processing of electronic health records (2013)Google Scholar
- 6.Hindle, D.: Noun classification from predicate-argument structures. In: Proceedings of the 28th Annual Meeting on Association for Computational Linguistics, ACL 1990, pp. 268–275, Stroudsburg, PA, USA (1990)Google Scholar
- 7.Jurafsky, D., Martin, J.H.: Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, 1st edn. Prentice Hall PTR, Upper Saddle River (2000)Google Scholar
- 8.Lin, D.: Automatic retrieval and clustering of similar words. In: Proceedings of the 17th International Conference on Computational Linguistics, COLING 1998, vol. 2, pp. 768–774, Stroudsburg, PA, USA (1998)Google Scholar
- 11.Orosz, G., Novák, A., Prószéky, G.: Lessons learned from tagging clinical Hungarian. International Journal of Computational Linguistics and Applications 5(1), 159–176 (2014)Google Scholar
- 14.Schütze, H.: Word space. In: Giles, L.C., Hanson, S.J., Cowan, J.D. (eds.) Advances in Neural Information Processing Systems 5, pp. 895–902. Morgan Kaufmann, San Francisco (1993)Google Scholar
- 18.Siklósi, B., Orosz, G., Novák, A., Prószéky, G.: Automatic structuring and correction suggestion system for hungarian clinical records. In: 8th SaLTMiL Workshop on Creation and Use of Basic Lexical Resources for Lessresourced Languages, pp. 29–34 (2012)Google Scholar
- 19.Siklósi, B., Novák, A., Prószéky, G.: Context-aware correction of spelling errors in hungarian medical documents. Computer Speech and Language (2014)Google Scholar
- 21.Sridharan, S., Murphy, B.: Modeling Word Meaning: Distributional Semantics and the Corpus Quality-Quantity Trade-Off. In: Proceedings of the 3rd Workshop on Cognitive Aspects of the Lexicon, pp. 53–68. The COLING 2012 Organizing Committee (2012)Google Scholar