Clustering XML Documents by Structure

  • Theodore Dalamagas
  • Tao Cheng
  • Klaas-Jan Winkel
  • Timos Sellis
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3025)


This work explores the application of clustering methods for grouping structurally similar XML documents. Modeling the XML documents as rooted ordered labeled trees, we apply clustering algorithms using distances that estimate the similarity between those trees in terms of the hierarchical relationships of their nodes. We suggest the usage of tree structural summaries to improve the performance of the distance calculation and at the same time to maintain or even improve its quality. Experimental results are provided using a prototype testbed.


XML structural similarity tree distance structural summary clustering 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Abiteboul, S., Buneman, P., Suciu, D.: Data on the Web. Morgan Kaufmann, San Francisco (2000)Google Scholar
  2. 2.
    Garofalakis, M., Gionis, A., Rastogi, R., Seshadri, S., Shim, K.: XTRACT: A system for extracting document type descriptors from XML documents. In: Proceedings of the ACM SIGMOD Conference,Texas, USA (2000)Google Scholar
  3. 3.
    Sankoff, D., Kruskal, J.: Time Warps, String Edits and Macromolecules, The Theory and Practice of Sequence Comparison. CSLI Publications, Standford (1999)Google Scholar
  4. 4.
    Direen, H.G., Jones, M.S.: Knowledge management in bioinformatics. In: Chaudhri, A.B., Rashid, A., Zicari, R. (eds.) XML Data Management, Addison Wesley, Reading (2003)Google Scholar
  5. 5.
    Wagner, R., Fisher, M.: The string-to-string correction problem. Journal of ACM 21(1), 168–173 (1974)zbMATHCrossRefGoogle Scholar
  6. 6.
    Tai, K.C.: The tree-to-tree correction problem. Journal of ACM 26, 422–433 (1979)zbMATHCrossRefMathSciNetGoogle Scholar
  7. 7.
    Selkow, S.M.: The tree-to-tree editing problem. Information Processing Letters 6, 184–186 (1977)zbMATHCrossRefMathSciNetGoogle Scholar
  8. 8.
    Chawathe, S.S.: Comparing hierarchical data in external memory. In: Proceedings of the VLDB Conference, Edinburgh, Scotland, UK, pp. 90–101 (1999)Google Scholar
  9. 9.
    Chawathe, S.S., Rajaraman, A., Garcia-Molina, H., Widom, J.: Change Detection in Hierarchically Structured Information. In: Proceedings of the ACM SIGMOD Conference, USA (1996)Google Scholar
  10. 10.
    Zhang, K., Shasha, D.: Simple fast algorithms for the editing distance between trees and related problems. SIAM Journal of Computing 18, 1245–1262 (1989)zbMATHCrossRefMathSciNetGoogle Scholar
  11. 11.
    Rasmussen, E.: Clustering algorithms. In: Frakes, W., Baeza-Yates, R. (eds.) Information Retrieval: Data Structures and Algorithms, Prentice Hall, Englewood Cliffs (1992)Google Scholar
  12. 12.
    Halkidi, M., Batistakis, Y., Vazirgiannis, M.: Clustering algorithms and validity measures, in: SSDBM Conference, Virginia, USA (2001)Google Scholar
  13. 13.
    van Rijsbergen, C.J.: Information Retrieval. Butterworths, London (1979)Google Scholar
  14. 14.
    Gower, J.C., Ross, G.J.S.: Minimum spanning trees and single linkage cluster analysis. Applied Statistics 18, 54–64 (1969)CrossRefMathSciNetGoogle Scholar
  15. 15.
    Hubert, L.J., Levin, J.R.: A general statistical framework for accessing categorical clustering in free recall. Psychological Bulletin 83, 1072–1082 (1976)CrossRefGoogle Scholar
  16. 16.
    Milligan, G.W., Cooper, M.C.: An examination of procedures for determining the number of clusters in a data set. Psychometrika 50, 159–179 (1985)CrossRefGoogle Scholar
  17. 17.
    Nierman, A., Jagadish, H.V.: Evaluating structural similarity in xml documents. In: Proceedings of the WebDB Workshop, Madison, Wisconsin, USA (2002)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2004

Authors and Affiliations

  • Theodore Dalamagas
    • 1
  • Tao Cheng
    • 2
  • Klaas-Jan Winkel
    • 3
  • Timos Sellis
    • 1
  1. 1.School of Electr. and Comp. EngineeringNational Technical University of AthensGreece
  2. 2.Dept. of Computer ScienceUniversity of California, Santa BarbaraUSA
  3. 3.Faculty of Computer ScienceUniversity of Twentethe Netherlands

Personalised recommendations