Advanced Clustering Technique for Medical Data Using Semantic Information

  • Kwangcheol Shin
  • Sang-Yong Han
  • Alexander Gelbukh
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2972)


MEDLINE is a representative collection of medical documents supplied with original full-text natural-language abstracts as well as with representative keywords (called MeSH-terms) manually selected by the expert annotators from a pre-defined ontology and structured according to their relation to the document. We show how the structured manually assigned semantic descriptions can be combined with the original full-text abstracts to improve quality of clustering the documents into a small number of clusters. As a baseline, we compare our results with clustering using only abstracts or only MeSH-terms. Our experiments show 36% to 47% higher cluster coherence, as well as more refined keywords for the produced clusters.


Semantic Information MeSH Term Vector Space Model Term Weight Document Cluster 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Iliopoulos, I., Enright, A., Ouzounis, C.: Textquest: document clustering of medline abstracts for concept discovery in molecular biology. In: Pac. Symp. on Biocomput. pp. 384–395 (2001)Google Scholar
  2. 2.
    Kubat, M., Bratko, I., Michalski, R.S.: In: Michalski, R.S., Bratko, I., Kubat, M. (eds.) Machine Learning and Data Mining: methods and applications: A review of machine learning methods, John Wiley & Sons, New York (1997)Google Scholar
  3. 3.
    Sekimizu, T., Park, H.S., Tsujii, J.: Identifying the interaction between genes and gene products based on frequently seen verbs in Medline abstracts. In: Genome Informatics Workshop, Tokyo, p. 62 (1998)Google Scholar
  4. 4.
    Thomas, J., Milward, D., Ouzounis, C., Pulman, S., Carroll, M.: Automatic extraction of protein interactions from scientific abstracts. In: Pac. Symp. Biocomput, pp. 538–549 (2000)Google Scholar
  5. 5.
    Andrade, M.A., Valencia, A.: Automatic extraction of keywords from scientific text: application to the knowledge domain of protein families. Bioinformatics 14, 600 (1998)CrossRefGoogle Scholar
  6. 6.
    Proux, D., Rechenmann, F., Julliard, L., Pillet, V., Jacq, B.: Detecting gene symbols and names in biological texts: a first step toward pertinent information extraction. In: Genome Informatics Workshop, Tokyo, pp. 72–80 (1998)Google Scholar
  7. 7.
    Salton, G., McGill, M.J.: Introduction to Modern Retrieval. McGraw-Hill Book Company, New York (1983)zbMATHGoogle Scholar
  8. 8.
    Dhillon, I.S., Modha, D.S.: Concept Decomposition for Large Sparse Text Data using Clustering, Technical Report RJ 10147(9502), IBM Almaden Research Center (1999)Google Scholar
  9. 9.
    Frakes, W.B., Baeza-Yates, R.: Information Retrieval: Data Structures and Algorithms. Prentince Hall, Englewood Cliffs (1992)Google Scholar
  10. 10.
    Dhillon, I.S., Fan, J., Guan, Y.: Efficient Clustering of Very Large Document Collections. In: Data Mining for Scientific and Engineering Applications, Kluwer Academic Publishers, Dordrecht (2001)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2004

Authors and Affiliations

  • Kwangcheol Shin
    • 1
  • Sang-Yong Han
    • 1
  • Alexander Gelbukh
    • 1
    • 2
  1. 1.School of Computer Science and EngineeringChung-Ang UniversitySeoulKorea
  2. 2.Center for Computing ResearchNational Polytechnic InstituteMexico CityMexico

Personalised recommendations