Clustering Relevant Terms and Identifying Types of Statements in Clinical Records

  • Borbála SiklósiEmail author
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9042)


The automatic processing of clinical documents created at clinical settings has become a focus of research in natural language processing. However, standard tools developed for general texts are not applicable or perform poorly on this type of documents, especially in the case of less-resourced languages. In order to be able to create a formal representation of knowledge in the clinical records, a normalized representation of concepts needs to be defined. This can be done by mapping each record to an external ontology or other semantic resources. In the case of languages, where no such resources exist, it is reasonable to create a representational schema from the texts themselves. In this paper, we show that, based on the pairwise distributional similarities of words and multiword terms, a conceptual hierarchy can be built from the raw documents. In order to create the hierarchy, we applied an agglomerative clustering algorithm on the most frequent terms. Having such an initial system of knowledge extracted from the documents, a domain expert can then check the results and build a system of concepts that is in accordance with the documents the system is applied to. Moreover, we propose a method for classifying various types of statements and parts of clinical documents by annotating the texts with cluster identifiers and extracting relevant patterns.


clinical documents clustering ontology construction less-resourced languages 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Carroll, J., Koeling, R., Puri, S.: Lexical acquisition for clinical text mining using distributional similarity. In: Gelbukh, A. (ed.) CICLing 2012, Part II. LNCS, vol. 7182, pp. 232–246. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  2. 2.
    Cohen, T., Widdows, D.: Empirical distributional semantics: Methods and biomedical applications. Journal of Biomedical Informatics 42(2), 390–405 (2009)CrossRefGoogle Scholar
  3. 3.
    Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. Journal of the American Society for Information Science 41(6), 391–407 (1990)CrossRefGoogle Scholar
  4. 4.
    Firth, J.R.: A synopsis of linguistic theory 1930-55. 1952-59, 1–32 (1957)Google Scholar
  5. 5.
    Henriksson, A.: Semantic spaces of clinical text: Leveraging distributional semantics for natural language processing of electronic health records (2013)Google Scholar
  6. 6.
    Hindle, D.: Noun classification from predicate-argument structures. In: Proceedings of the 28th Annual Meeting on Association for Computational Linguistics, ACL 1990, pp. 268–275, Stroudsburg, PA, USA (1990)Google Scholar
  7. 7.
    Jurafsky, D., Martin, J.H.: Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, 1st edn. Prentice Hall PTR, Upper Saddle River (2000)Google Scholar
  8. 8.
    Lin, D.: Automatic retrieval and clustering of similar words. In: Proceedings of the 17th International Conference on Computational Linguistics, COLING 1998, vol. 2, pp. 768–774, Stroudsburg, PA, USA (1998)Google Scholar
  9. 9.
    Lund, K., Burgess, C.: Producing high-dimensional semantic spaces from lexical co-occurrence. Behavior Research Methods, Instruments, and Computers 28(2), 203–208 (1996)CrossRefGoogle Scholar
  10. 10.
    Orosz, G., Novák, A., Prószéky, G.: Hybrid text segmentation for Hungarian clinical records. In: Castro, F., Gelbukh, A., González, M. (eds.) MICAI 2013, Part I. LNCS (LNAI), vol. 8265, pp. 306–317. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  11. 11.
    Orosz, G., Novák, A., Prószéky, G.: Lessons learned from tagging clinical Hungarian. International Journal of Computational Linguistics and Applications 5(1), 159–176 (2014)Google Scholar
  12. 12.
    Patel, V.L., Arocha, J.F., Kushniruk, A.W.: Patients’ and physicians’ understanding of health and biomedical concepts: Relationship to the design of emr systems. J. of Biomedical Informatics 35(1), 8–16 (2002)CrossRefGoogle Scholar
  13. 13.
    Pedersen, T., Pakhomov, S.V., Patwardhan, S., Chute, C.G.: Measures of semantic similarity and relatedness in the biomedical domain. Journal of Biomedical Informatics 40(3), 288–299 (2007)CrossRefGoogle Scholar
  14. 14.
    Schütze, H.: Word space. In: Giles, L.C., Hanson, S.J., Cowan, J.D. (eds.) Advances in Neural Information Processing Systems 5, pp. 895–902. Morgan Kaufmann, San Francisco (1993)Google Scholar
  15. 15.
    Siklósi, B., Novák, A.: Detection and Expansion of Abbreviations in Hungarian Clinical Notes. In: Castro, F., Gelbukh, A., González, M. (eds.) MICAI 2013, Part I. LNCS (LNAI), vol. 8265, pp. 318–328. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  16. 16.
    Siklósi, B., Novák, A.: Identifying and Clustering Relevant Terms in Clinical Records Using Unsupervised Methods. In: Besacier, L., Dediu, A.-H., Martín-Vide, C. (eds.) SLSP 2014. LNCS (LNAI), vol. 8791, pp. 233–243. Springer, Heidelberg (2014)CrossRefGoogle Scholar
  17. 17.
    Siklósi, B., Novák, A., Prószéky, G.: Context-aware correction of spelling errors in Hungarian medical documents. In: Dediu, A.-H., Martín-Vide, C., Mitkov, R., Truthe, B. (eds.) SLSP 2013. LNCS (LNAI), vol. 7978, pp. 248–259. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  18. 18.
    Siklósi, B., Orosz, G., Novák, A., Prószéky, G.: Automatic structuring and correction suggestion system for hungarian clinical records. In: 8th SaLTMiL Workshop on Creation and Use of Basic Lexical Resources for Lessresourced Languages, pp. 29–34 (2012)Google Scholar
  19. 19.
    Siklósi, B., Novák, A., Prószéky, G.: Context-aware correction of spelling errors in hungarian medical documents. Computer Speech and Language (2014)Google Scholar
  20. 20.
    Sokal, R.R., Rohlf, F.J.: The comparison of dendrograms by objective methods. Taxon 11(2), 33–40 (1962)CrossRefGoogle Scholar
  21. 21.
    Sridharan, S., Murphy, B.: Modeling Word Meaning: Distributional Semantics and the Corpus Quality-Quantity Trade-Off. In: Proceedings of the 3rd Workshop on Cognitive Aspects of the Lexicon, pp. 53–68. The COLING 2012 Organizing Committee (2012)Google Scholar
  22. 22.
    Turney, P.D., Pantel, P.: From frequency to meaning: Vector space models of semantics. J. Artif. Int. Res. 37(1), 141–188 (2010)zbMATHMathSciNetGoogle Scholar
  23. 23.
    Ward, J.H.: Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association 58(301), 236–244 (1963)CrossRefMathSciNetGoogle Scholar
  24. 24.
    Zhang, J.: Representations of health concepts: a cognitive perspective. Journal of Biomedical Informatics 35(1), 17–24 (2002)CrossRefzbMATHGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  1. 1.Faculty of Information Technology and BionicsPázmány Péter Catholic UniversityBudapestHungary

Personalised recommendations