A New Cluster Merging Algorithm of Suffix tree Clustering

  • Jianhua Wang
  • Ruixu Li
Part of the IFIP International Federation for Information Processing book series (IFIPAICT, volume 228)


Document clustering methods can be used to structure large sets of text or hypertext documents. Suffix Tree Clustering has been proved to be a good approach for documents clustering. However, the cluster merging algorithm of Suffix Tree Clustering is based on the overlap of their document sets, which totally ignore the similarity between the non-overlap parts of different clusters. In this paper, we introduce a novel cluster merging approach which will combines the cosine similarity and overlap percentage. Using this method, we can get a better clustering result and a comparative small number of clusters.

Key words

suffix tree clustering cluster merging algorithm 


  1. 1.
    Liu B., Chin C. W., and Ng, H. T. Mining Topic-Specific Concepts and Definitions on the Web. In Proceedings of the Twelfth International World Wide Web Conference (WWW’03), Budapest, Hungary, 2003.Google Scholar
  2. 2.
    Cutting D.R., Karger D.R., Pedersen J.O., Tukey J.W. Scatter / Gather: A Cluster-based Approach to Browsing Large Document Collection, Proc. ACM SIGIR 92, 1992Google Scholar
  3. 3.
    Zamir O., Etzioni O. Web Document Clustering: A Feasibility Demonstration, In Proceedings of the 19th International ACM SIG1R Conference on Research and Development of Information Retrieval (SIGIR’98), 1998.Google Scholar
  4. 4.
    J.J. Rocchio, Document retrieval systems — optimization and evaluation, Ph.D. Thesis, Harvard University, 1966.Google Scholar
  5. 5.
    P. Willet. Recent trends in hierarchical document clustering: a critical review. Information Processing and Management, 24:577–97, 1988.CrossRefGoogle Scholar
  6. 6.
    Leuski A. and Allan J. Improving Interactive Retrieval by Combining Ranked List and Clustering. Proceedings of RIAO, College de France, pp. 665–681, 2000.Google Scholar
  7. 7.
    Smith, D.A. Detecting and Browsing Events in Unstructured Text. In Proceedings of ACM/SIGIR’ 2002.Google Scholar
  8. 8.
    Sergey Brin, and Larry Page. The anatomy of a large scale hypertextual web search engine. In Proceedings of WWW7, Brisbane, Australia, April 1998.Google Scholar
  9. 9.
    Hua-Jun Zeng Qi-Cai He Zheng Chen Wei-Ying Ma Jinwen Ma Learning to cluster web search results SIGIR’04, July 25 29, Sheffield, South Yorkshire, UK, 2004Google Scholar
  10. 10.
    X. Shen, B. Tan, and C. Zhai. Intelligent search using implicit user model. Technical report, Department of Computer Science, University of Illinois at Urbana-Champaign, 2005.Google Scholar
  11. 11.
    Google search engine, Google Scholar
  12. 12.
    Yahoo search engine, Google Scholar
  13. 13.
    Ricardo Baeza-Yates. Berthier Ribeiro-Neto, Modern Information Retrieval, Addison Wesley Press, 1999Google Scholar
  14. 14.
    Ian.H. Written, Alistair Moffat, Timothy.C. Bell. Managing Gigabyte, Morgan Kaufmann publishing, 1999Google Scholar
  15. 15.
    P. Weiner. Linear pattern matching algorithms. In Proceedings of the 14th Annual Symposium on Foundations of Computer Science (FOCS), pages 1–11, 1973.Google Scholar

Copyright information

© International Federation for Information Processing 2006

Authors and Affiliations

  • Jianhua Wang
    • 1
  • Ruixu Li
    • 1
  1. 1.Computer Science DepartmentYantai UniversityYantai, ShandongChina

Personalised recommendations