Comparison of Algorithms for Web Document Clustering Using Graph Representations of Data

  • Adam Schenker
  • Mark Last
  • Horst Bunke
  • Abraham Kandel
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3138)


In this paper we compare the performance of several popular clustering algorithms, including k-means, fuzzy c-means, hierarchical agglomerative, and graph partitioning. The novelty of this work is that the objects to be clustered are represented by graphs rather than the usual case of numeric feature vectors. We apply these techniques to web documents, which are represented by graphs instead of vectors, in order to perform web document clustering. Web documents are structured information sources and thus appropriate for modeling by graphs. We will examine the performance of each clustering algorithm when the web documents are represented as both graphs and vectors. This will allow us to investigate the applicability of each algorithm to the problem of web document clustering.


  1. 1.
    Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: a review. ACM Computing Surveys 31, 264–323 (1999)CrossRefGoogle Scholar
  2. 2.
    Mitchell, T.M.: Machine Learning. McGraw-Hill, Boston (1997)zbMATHGoogle Scholar
  3. 3.
    Salton, G.: Automatic Text Processing: the Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, Reading (1989)Google Scholar
  4. 4.
    Klir, G.J., Yuan, B.: Fuzzy Sets and Fuzzy Logic: Theory and Applications. Prentice Hall, Upper Saddle River (1995)zbMATHGoogle Scholar
  5. 5.
    Zahn, C.T.: Graph-theoretical methods for detecting and describing gestalt structures. IEEE Transactions on Computers C-20, 68–86 (1971)zbMATHCrossRefGoogle Scholar
  6. 6.
    Likas, A., Vlassis, N., Verbeek, J.J.: The global k-means algorithm. Pattern Recognition 36, 451–461 (2003)CrossRefGoogle Scholar
  7. 7.
    Strehl, A., Ghosh, J., Mooney, R.: Impact of similarity measures on web-page clustering. In: AAAI 2000: Workshop of Artificial Intelligence for Web Search, pp. 58–64 (2000)Google Scholar
  8. 8.
    Zamir, O., Etzioni, O.: Web document clustering: A feasibility demonstration. In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 46–54 (1998)Google Scholar
  9. 9.
    Bunke, H., Shearer, K.: A graph distance metric based on the maximal common subgraph. Pattern Recognition Letters 19, 225–259 (1998)CrossRefGoogle Scholar
  10. 10.
    Fernández, M.L., Valiente, G.: A graph distance metric combining maximum common subgraph and minimum common supergraph. Pattern Recognition Letters 22, 753–758 (2001)zbMATHCrossRefGoogle Scholar
  11. 11.
    Wallis, W.D., Shoubridge, P., Kraetz, M., Ray, D.: Graph distances using graph union. Pattern Recognition Letters 22, 701–704 (2001)zbMATHCrossRefGoogle Scholar
  12. 12.
    Jiang, X., Muenger, A., Bunke, H.: On median graphs: properties, algorithms, and applications. IEEE Transactions on Pattern Analysis and Machine Intelligence 23, 1144–1151 (2001)CrossRefGoogle Scholar
  13. 13.
    Schenker, A., Last, M., Bunke, H., Kandel, A.: Clustering of web documents using a graph model. In: Antonacopoulos, A., Hu, J. (eds.) Web Document Analysis: Challenges and Opportunities. Machine Perception and Artificial Intelligence, vol. 55, pp. 3–18. World Scientific Publishing Company, Singapore (2003)CrossRefGoogle Scholar
  14. 14.
    Schenker, A., Last, M., Bunke, H., Kandel, A.: Comparison of distance measures for graph-based clustering of documents. In: Hancock, E.R., Vento, M. (eds.) GbRPR 2003. LNCS, vol. 2726, pp. 202–213. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  15. 15.
    Schenker, A., Last, M., Bunke, H., Kandel, A.: Graph representations for web document clustering. In: Perales, F.J., Campilho, A.C., Pérez, N., Sanfeliu, A. (eds.) IbPRIA 2003. LNCS, vol. 2652, pp. 935–942. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  16. 16.
    Perner, P.: Data Mining on Multimedia Data. In: Perner, P. (ed.) Data Mining on Multimedia Data. LNCS, vol. 2558, pp. 13–22. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  17. 17.
    Messmer, B.T., Bunke, H.: A new algorithm for error-tolerant subgraph isomorphism detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 20, 493–504 (1998)CrossRefGoogle Scholar
  18. 18.
    Dickinson, P., Bunke, H., Dadej, A., Kretzl, M.: On graphs with unique node labels. In: Hancock, E.R., Vento, M. (eds.) GbRPR 2003. LNCS, vol. 2726, pp. 13–23. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  19. 19.
    Tan, C.M., Wang, Y.F., Lee, C.D.: The use of bigrams to enhance text categorization. Information Processing and Management 38, 529–546 (2002)zbMATHCrossRefGoogle Scholar
  20. 20.
    Rand, W.M.: Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association 66, 846–850 (1971)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2004

Authors and Affiliations

  • Adam Schenker
    • 1
  • Mark Last
    • 2
  • Horst Bunke
    • 3
  • Abraham Kandel
    • 1
    • 4
  1. 1.University of South FloridaTampaUSA
  2. 2.Ben-Gurion University of the NegevBeer-ShevaIsrael
  3. 3.University of BernBernSwitzerland
  4. 4.Tel-Aviv UniversityTel-AvivIsrael

Personalised recommendations