Clustering Large Scale of XML Documents

  • Tong Wang
  • Da-Xin Liu
  • Xuan-Zuo Lin
  • Wei Sun
  • Gufran Ahmad
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3947)


Clustering is able to facilitate Information Retrieval. This paper addresses the issue of clustering a large number of XML documents. We propose ICX algorithm with a novel similarity metric based on quantitative path. In our approach, each document is firstly represented by path sequences extracted from XML trees. Then these sequences are mapped into quantitative path, by which the distance between documents can be computed with low complexity. Finally, the desired clusters are constructed by utilizing ICX method with literal local search. Experimental results, based on XML documents obtained from DBLP, show the effectiveness and good performance of the proposed techniques.


Local Search Tree Edit Distance Path Sequence Centroid Vector Book Author 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Faloutsos, C., Oard, D.: A survey of information retrieval and filtering methods, Department of Computer Science, University of Maryland, Technical Report, CS-TR- 35l4 (1995)Google Scholar
  2. 2.
    Zamir, O., Etzioni, O., Madani, O., Karp, R.M.: Fast and Intuitive Clustering of Web Documents. In: Proc. Second Int’l. Conf. Knowledge Discovery and Data Mining, pp. 287–290 (1997)Google Scholar
  3. 3.
    Nierman, A., Jagadish, H.V.: Evaluating Structural Similarity in XML Documents. In: Proc. Fifth Int’l. Workshop Web and Databases (2002)Google Scholar
  4. 4.
    Costa, G., Manco, G., Ortale, R., Tagarelli, A.: A Tree-Based Approach to Clustering XML Documents by Structure. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) PKDD 2004. LNCS (LNAI), vol. 3202, pp. 137–148. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  5. 5.
    Dalamagas, T., Cheng, T., Winkel, K.-J., Sellis, T.K.: A Methodology for Clustering XML documents using Tree Summaries and Structural Distance Metrics. In: HDMS (2004)Google Scholar
  6. 6.
    Al-Sultan, K.S., Khan, M.M.: Computational experience on four algorithms for the hard clustering problem. Pattern Recogn. 17(3), 295–308 (1996)CrossRefGoogle Scholar
  7. 7.
    Miller, G.A., Beckwith, R.: Introduction to WordNet. An On-line Lexical Database International journal of Lexicography 3(4), 235–312 (1990)Google Scholar
  8. 8.
    Lee, M.-L., Yang, L.H., Hsu, W., Yang, X.: XClust: Clustering XML schemas for effective integration. In: CIKM 2002, pp. 292–299 (2002)Google Scholar
  9. 9.
    Zhou, A., Qian, W., Qian, H.: Clustering DTDs: An Interactive Two-Level Ap-proach. J. Comput. Sci. Technol. 17(6), 807–819 (2002)CrossRefMATHGoogle Scholar
  10. 10.
    Jagadish, H.V., Koudas, N., Srivastava, D.: On Effective Multi-Dimensional Indexing for Strings. In: Proceedings of the ACM SIGMOD Conference on Management of Data, pp. 403–414 (2000)Google Scholar
  11. 11.
    Doucet, A., Ahonen-Myka, H.: Naive Clustering of a large XML Document Collection. In: INEX Workshop 2002, pp. 81–87 (2002)Google Scholar
  12. 12.
    Cui, X., Potok, T.E., Palathingal, P.: Document Clustering using Particle Swarm Optimization. In: Proceedings of the 2005 IEEE Swarm Intelligence Symposium, June 2005, Pasadena, California, USA (2005)Google Scholar
  13. 13.
    Abiteboul, S., Buneman, P., Suciu, D.: Data On The Web: From relations to Semistructured Data and XML. Morgan Kaufmann Publishers, San Francisco (2000)Google Scholar
  14. 14.
    Selim, S.Z., Ismail, M.A.: K-means type algorithms: A generalized convergence theorem and characterization of local optimality. IEEE Trans. Pattern Anal. Mach. Intell. 6, 81–87 (1984)CrossRefMATHGoogle Scholar
  15. 15.
    Zhang, S., Wang, J.T.L., Herbert, K.G.: Xml query by example. International Journal of Computational Intelligence and Applications 2(3), 329–337 (2002)CrossRefGoogle Scholar
  16. 16.
    DBLP Computer Science Bibliography (2004),

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Tong Wang
    • 1
  • Da-Xin Liu
    • 1
  • Xuan-Zuo Lin
    • 2
  • Wei Sun
    • 1
  • Gufran Ahmad
    • 1
  1. 1.Department of Computer Science and TechnologyHarbin Engineering UniversityChina
  2. 2.Northeast Agriculture UniversityHarbinChina

Personalised recommendations