Advertisement

A Comparative Study of Ontology Based Term Similarity Measures on PubMed Document Clustering

  • Xiaodan Zhang
  • Liping Jing
  • Xiaohua Hu
  • Michael Ng
  • Xiaohua Zhou
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4443)

Abstract

Recent research shows that ontology as background knowledge can improve document clustering quality with its concept hierarchy knowledge. Previous studies take term semantic similarity as an important measure to incorporate domain knowledge into clustering process such as clustering initialization and term re-weighting. However, not many studies have been focused on how different types of term similarity measures affect the clustering performance for a certain domain. In this paper, we conduct a comparative study on how different semantic similarity measures of term including path based similarity measure, information content based similarity measure and feature based similarity measure affect document clustering. We evaluate term re-weighting as an important method to integrate domain ontology to clustering process. Meanwhile, we apply k-means clustering on one real-world text dataset, our own corpus generated from PubMed. Experiment results on 8 different semantic measures have shown that: (1) there is no a certain type of similarity measures that significantly outperforms the others; (2) Several similarity measures have rather more stable performance than the others; (3) term re-weighting has positive effects on medical document clustering, but might not be significant when documents are short of terms.

Keywords

Semantic Similarity Measure Document Clustering Domain Ontology 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Banerjee, A., Ghosh, J.: Frequency sensitive competitive learning for clustering on high-dimensional hyperspheres. In: Proc. IEEE Int. Joint Conference on Neural Networks, pp. 1590–1595 (2002)Google Scholar
  2. 2.
    Hotho, A., Staab, S., Stumme, G.: Wordnet improves text document clustering. In: Proc. of the Semantic Web Workshop at 26th Annual International ACM SIGIR Conference, Toronto, Canada (2003)Google Scholar
  3. 3.
    Jing, J., Zhou, L., Ng, M.K., Huang, Z.: Ontology-based distance measure for text clustering. In: Proc. of SIAM SDM workshop on text mining, Bethesda, Maryland, USA (2006)Google Scholar
  4. 4.
    Jiang, J.J., Conrath, D.W.: Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy. In: Proceedings of the International Conference on Research in Computational Linguistic, Taiwan (1998)Google Scholar
  5. 5.
    Knappe, R., Bulskov, H., Andreasen, T.: Perspectives on Ontology-based Querying. International Journal of Intelligent Systems (2004)Google Scholar
  6. 6.
    Larsen, B., Aone, C.: Fast and effective text mining using linear-time document clustering. In: KDD-99, San Diego, California, pp. 16–22 (1999)Google Scholar
  7. 7.
    Leacock, C., Chodorow, M.: Filling in a sparse training space for word sense identification. ms. (March 1994)Google Scholar
  8. 8.
    Li, Y., Zuhair, A.B., McLean, D.: An Approach for Measuring Semantic Similarity between Words Using Multiple Information Sources. IEEE Transactions on Knowledge and Data Engineering 15(4), 871–882 (2003)CrossRefGoogle Scholar
  9. 9.
    Lin, D.: Principle-Based Parsing Without Overgeneration. In: Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics (ACL’93), Columbus, Ohio, pp. 112–120 (1993)Google Scholar
  10. 10.
    Mao, W., Chu, W.W.: Free text medical document retrieval via phrased-based vector space model. In: Proc. of AMIA’02, San Antonio,TX (2002)Google Scholar
  11. 11.
    Pedersen, T., Pakhomov, S., Patwardhan, S., Chute, C.: Measures of semantic similarity and relatedness in the biomedical domain. Journal of Biomedical Informatics, In Press, Corrected Proof (June 2006)Google Scholar
  12. 12.
    Resnik, O.: Semantic Similarity in a Taxonomy: An Information-Based Measure and its Application to Problems of Ambiguity and Natural Language. Journal of Artificial Intelligence Research 11, 95–130 (1999)zbMATHGoogle Scholar
  13. 13.
    Steinbach, M., Karypis, G., Kumar, V.: A Comparison of document clustering techniques. Technical Report #00-034, Department of Computer Science and Engineering, University of Minnesota (2000)Google Scholar
  14. 14.
    Varelas, G., Voutsakis, E., Raftopoulou, P., Petrakis, E.G., Milios, E.E.: Semantic similarity methods in wordNet and their application to information retrieval on the web. In: WIDM ’05, pp. 10–16. ACM Press, New York (2005)CrossRefGoogle Scholar
  15. 15.
    Wu, Z., Palmer, M.: Verb Semantics and Lexical Selection. In: Proceedings of the 32nd Annual Meeting of the Associations for Computational Linguistics (ACL’94), Las Cruces, New Mexico, pp. 133–138 (1994)Google Scholar
  16. 16.
    Yoo, I., Hu, X., Song, I.-Y.: Integration of Semantic-based Bipartite Graph Representation and Mutual Refinement Strategy for Biomedical Literature Clustering. In: The Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD 2006), pp. 791–796 (2006)Google Scholar
  17. 17.
    Zhang, X., Zhou, X., Hu, X.: Semantic Smoothing for Model-based Document Clustering. Accepted in the, IEEE International Conference on Data Mining, ICDM’06 (2006)Google Scholar
  18. 18.
    Zhou, X., Zhang, X., Hu, X.: The Dragon Toolkit. Data Mining & Bioinformatics Lab, iSchool at Drexel University, http://www.ischool.drexel.edu/dmbio/dragontool
  19. 19.
    Zhao, Y., Karypis, G.: Criterion functions for document clustering: experiments and analysis. Technical Report, Department of Computer Science, University of Minnesota (2001)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2007

Authors and Affiliations

  • Xiaodan Zhang
    • 1
  • Liping Jing
    • 2
  • Xiaohua Hu
    • 1
  • Michael Ng
    • 3
  • Xiaohua Zhou
    • 1
  1. 1.College of Information Science & Technology, Drexel University, 3141 Chestnut, Philadelphia, PA 19104USA
  2. 2.ETI & Department of Math, The University of Hong Kong, Pokfulam RoadHong Kong
  3. 3.Department of Mathematics, Hong Kong Baptist University, Kowloon TongHong Kong

Personalised recommendations