PAKDD 2006: Advances in Knowledge Discovery and Data Mining pp 303-312 | Cite as
Clustering Large Collection of Biomedical Literature Based on Ontology-Enriched Bipartite Graph Representation and Mutual Refinement Strategy
Abstract
In this paper we introduce a novel document clustering approach that solves some major problems of traditional document clustering approaches. Instead of depending on traditional vector space model, this approach represents a set of documents as bipartite graphs using domain knowledge in ontology. In this representation, the concepts of the documents are classified according to their relationships with documents that are reflected on the bipartite graph. Using the concept groups, documents are clustered based on the concepts’ contribution to each document. Through the mutual-refinement relationship with concept groups and document groups, the two groups are recursively refined. Our experimental results on MEDLINE articles show that our approach outperforms two leading document clustering algorithms: BiSecting K-means and CLUTO. In addition to its decent performance, our approach provides a meaningful explanation for each document cluster by identifying its most contributing concepts, thus helps users to understand and interpret documents and clustering results.
Keywords
Bipartite Graph Vector Space Model Biomedical Literature Document Cluster Concept HierarchyPreview
Unable to display preview. Download preview PDF.
References
- 1.van Rijsbergen, C.J.: Information Retrieval, 2nd edn. Buttersworth, London (1979), http://www.dcs.gla.ac.uk/Keith/Preface.html MATHGoogle Scholar
- 2.Willett, P.: Recent trends in hierarchical document clustering: A critical review. Information Processing & Management 24(5), 577–597 (1988)CrossRefGoogle Scholar
- 3.Cutting, D., Karger, D., Pedersen, J., Tukey, J.: Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections. In: SIGIR 1992, pp. 318–329 (1992)Google Scholar
- 4.Buckley, C., Lewit, A.F.: Optimization of inverted vector searches. In: Proceedings of SIGIR 1985, pp. 97–110 (1985)Google Scholar
- 5.Hearst, M.A., Pedersen, J.O.: Reexamining the cluster hypothesis: Scatter/Gather on retrieval results. In: Proceedings of SIGIR 1996, Zurich, Switzerland, pp. 76–84 (1996)Google Scholar
- 6.Zamir, O., Etzioni, O.: Web Document Clustering: A Feasibility Demonstration. In: Proc. ACM SIGIR 1998, pp. 46–54 (1998)Google Scholar
- 7.Koller, D., Sahami, M.: Hierarchically classifying documents using very few words. In: Proceedings of ICML 1997, Nashville, TN, pp. 170–176 (1997)Google Scholar
- 8.Wang, B.B., (Bob) McKay, R I., Abbass, H.A. Barlow, M.: Learning Text Classifier using the Domain Concept Hierarchy. In: Proceedings of International Conference on Communications, Circuits and Systems 2002, China (2002)Google Scholar
- 9.Hotho, A., Maedche, A., Staab, S.: Text Clustering Based on Good Aggregations. Künstliche Intelligenz (KI) 16(4), 48–54 (2002)Google Scholar
- 10.Beyer, K., Goldstein, J., Ramakrishnan, R., Shaft, U.: When is nearest neighbor meaningful? In: Proceedings of 7th International Conference on Database Theory, pp. 217–235 (1999)Google Scholar
- 11.Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data: an Introduction to Cluster Analysis. John Wiley & Sons, Chichester (1990)CrossRefMATHGoogle Scholar
- 12.Steinbach, M., Karypis, G., Kumar, V.: A Comparison of Document Clustering Techniques. Technical Report #00-034. Department of Computer Science and Engineering, University of Minnesota (2000)Google Scholar
- 13.Larsen, B., Aone, C.: Fast and Effective Text Mining Using Linear-time Document Clustering. In: KDD 1999, San Diego, California (1999) Google Scholar
- 14.Hu, X.: Mining Novel Connections from Large Online Digital Library Using Biomedical Ontologies. Library Management Journal 26(4/5), 261–270 (2005)CrossRefGoogle Scholar
- 15.Harper, D.J., van Rijsbergen, C.J.: Evaluation of feedback in document retrieval using co-occurrence data. Journal of Documentation 34, 189–216 (1978)CrossRefGoogle Scholar
- 16.Van Rijsbergen, C.J., Harper, D.J., Porter, M.F.: The selection of good search terms. Information Processing and Management 17, 77–91 (1981)CrossRefGoogle Scholar
- 17.Lin, D.: An information-theoretic definition of similarity. In: Proceedings of the Fifteenth International Conference on Machine Learning, pp. 296–304 (1998)Google Scholar
- 18.Zeng, Y., Tang, J., Garcia-Frias, J., Gao, G.R.: An Adaptive Meta-Clustering Approach: Combining The Information From Different Clustering Results. In: CSB 2002 IEEE Computer Society Bioinformatics Conference Proceedings, pp. 276–287 (2002)Google Scholar
- 19.Beil, F., Ester, M., Xu, X.: Frequent Term-Based Text Clustering. In: 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, Alberta, Canada, July 23-26 (2002)Google Scholar
- 20.Zhong, S., Ghosh, J.: A comparative study of generative models for document clustering. In: Proceedings of the workshop on Clustering High Dimensional Data and Its Applications in SIAM Data Mining Conference (2003)Google Scholar
- 21.Pantel, P., Lin, D.: Document clustering with committees. In: SIGIR 2002, pp. 199–206 (2002)Google Scholar
- 22.Liu, J., Wang, W., Yang, J.: A framework for ontology-driven subspace clustering. In: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 623–628 (2004)Google Scholar