Clustering Large Collection of Biomedical Literature Based on Ontology-Enriched Bipartite Graph Representation and Mutual Refinement Strategy

  • Illhoi Yoo
  • Xiaohua Hu
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3918)


In this paper we introduce a novel document clustering approach that solves some major problems of traditional document clustering approaches. Instead of depending on traditional vector space model, this approach represents a set of documents as bipartite graphs using domain knowledge in ontology. In this representation, the concepts of the documents are classified according to their relationships with documents that are reflected on the bipartite graph. Using the concept groups, documents are clustered based on the concepts’ contribution to each document. Through the mutual-refinement relationship with concept groups and document groups, the two groups are recursively refined. Our experimental results on MEDLINE articles show that our approach outperforms two leading document clustering algorithms: BiSecting K-means and CLUTO. In addition to its decent performance, our approach provides a meaningful explanation for each document cluster by identifying its most contributing concepts, thus helps users to understand and interpret documents and clustering results.


Bipartite Graph Vector Space Model Biomedical Literature Document Cluster Concept Hierarchy 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    van Rijsbergen, C.J.: Information Retrieval, 2nd edn. Buttersworth, London (1979), MATHGoogle Scholar
  2. 2.
    Willett, P.: Recent trends in hierarchical document clustering: A critical review. Information Processing & Management 24(5), 577–597 (1988)CrossRefGoogle Scholar
  3. 3.
    Cutting, D., Karger, D., Pedersen, J., Tukey, J.: Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections. In: SIGIR 1992, pp. 318–329 (1992)Google Scholar
  4. 4.
    Buckley, C., Lewit, A.F.: Optimization of inverted vector searches. In: Proceedings of SIGIR 1985, pp. 97–110 (1985)Google Scholar
  5. 5.
    Hearst, M.A., Pedersen, J.O.: Reexamining the cluster hypothesis: Scatter/Gather on retrieval results. In: Proceedings of SIGIR 1996, Zurich, Switzerland, pp. 76–84 (1996)Google Scholar
  6. 6.
    Zamir, O., Etzioni, O.: Web Document Clustering: A Feasibility Demonstration. In: Proc. ACM SIGIR 1998, pp. 46–54 (1998)Google Scholar
  7. 7.
    Koller, D., Sahami, M.: Hierarchically classifying documents using very few words. In: Proceedings of ICML 1997, Nashville, TN, pp. 170–176 (1997)Google Scholar
  8. 8.
    Wang, B.B., (Bob) McKay, R I., Abbass, H.A. Barlow, M.: Learning Text Classifier using the Domain Concept Hierarchy. In: Proceedings of International Conference on Communications, Circuits and Systems 2002, China (2002)Google Scholar
  9. 9.
    Hotho, A., Maedche, A., Staab, S.: Text Clustering Based on Good Aggregations. Künstliche Intelligenz (KI) 16(4), 48–54 (2002)Google Scholar
  10. 10.
    Beyer, K., Goldstein, J., Ramakrishnan, R., Shaft, U.: When is nearest neighbor meaningful? In: Proceedings of 7th International Conference on Database Theory, pp. 217–235 (1999)Google Scholar
  11. 11.
    Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data: an Introduction to Cluster Analysis. John Wiley & Sons, Chichester (1990)CrossRefMATHGoogle Scholar
  12. 12.
    Steinbach, M., Karypis, G., Kumar, V.: A Comparison of Document Clustering Techniques. Technical Report #00-034. Department of Computer Science and Engineering, University of Minnesota (2000)Google Scholar
  13. 13.
    Larsen, B., Aone, C.: Fast and Effective Text Mining Using Linear-time Document Clustering. In: KDD 1999, San Diego, California (1999) Google Scholar
  14. 14.
    Hu, X.: Mining Novel Connections from Large Online Digital Library Using Biomedical Ontologies. Library Management Journal 26(4/5), 261–270 (2005)CrossRefGoogle Scholar
  15. 15.
    Harper, D.J., van Rijsbergen, C.J.: Evaluation of feedback in document retrieval using co-occurrence data. Journal of Documentation 34, 189–216 (1978)CrossRefGoogle Scholar
  16. 16.
    Van Rijsbergen, C.J., Harper, D.J., Porter, M.F.: The selection of good search terms. Information Processing and Management 17, 77–91 (1981)CrossRefGoogle Scholar
  17. 17.
    Lin, D.: An information-theoretic definition of similarity. In: Proceedings of the Fifteenth International Conference on Machine Learning, pp. 296–304 (1998)Google Scholar
  18. 18.
    Zeng, Y., Tang, J., Garcia-Frias, J., Gao, G.R.: An Adaptive Meta-Clustering Approach: Combining The Information From Different Clustering Results. In: CSB 2002 IEEE Computer Society Bioinformatics Conference Proceedings, pp. 276–287 (2002)Google Scholar
  19. 19.
    Beil, F., Ester, M., Xu, X.: Frequent Term-Based Text Clustering. In: 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, Alberta, Canada, July 23-26 (2002)Google Scholar
  20. 20.
    Zhong, S., Ghosh, J.: A comparative study of generative models for document clustering. In: Proceedings of the workshop on Clustering High Dimensional Data and Its Applications in SIAM Data Mining Conference (2003)Google Scholar
  21. 21.
    Pantel, P., Lin, D.: Document clustering with committees. In: SIGIR 2002, pp. 199–206 (2002)Google Scholar
  22. 22.
    Liu, J., Wang, W., Yang, J.: A framework for ontology-driven subspace clustering. In: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 623–628 (2004)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Illhoi Yoo
    • 1
  • Xiaohua Hu
    • 1
  1. 1.College of Information Science and TechnologyDrexel UniversityPhiladelphiaUSA

Personalised recommendations