Clustering XML Documents Using Structural Summaries

  • Theodore Dalamagas
  • Tao Cheng
  • Klaas-Jan Winkel
  • Timos Sellis
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3268)


This work presents a methodology for grouping structurally similar XML documents using clustering algorithms. Modeling XML documents with tree-like structures, we face the ‘clustering XML documents by structure’ problem as a ‘tree clustering’ problem, exploiting distances that estimate the similarity between those trees in terms of the hierarchical relationships of their nodes. We suggest the usage of tree structural summaries to improve the performance of the distance calculation and at the same time to maintain or even improve its quality. Experimental results are provided using a prototype testbed.


Original Tree Cluster Quality Label Tree Edit Graph Structural Distance 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Abiteboul, S., Buneman, P., Suciu, D.: Data on the Web. Morgan Kaufmann, San Francisco (2000)Google Scholar
  2. 2.
    Chawathe, S.S.: Comparing hierarchical data in external memory. In: Proc. of the VLDB Conference, Edinburgh, Scotland, UK (1999)Google Scholar
  3. 3.
    Chawathe, S.S., Rajaraman, A., Garcia-Molina, H., Widom, J.: Change Detection in Hierarchically Structured Information. In: Proc. of the ACM SIGMOD Conference, USA (1996)Google Scholar
  4. 4.
    Cobena, G., Abiteboul, S., Marian, A.: Detecting changes in XML documents. In: Proc. of the ICDE Conference, San Jose, USA (2002)Google Scholar
  5. 5.
    Direen, H.G., Jones, M.S.: Knowledge management in bioinformatics. In: Chaudhri, A.B., Rashid, A., Zicari, R. (eds.) XML Data Management. Addison Wesley, Reading (2003)Google Scholar
  6. 6.
    Flesca, S., Manco, G., Masciari, E., Pontieri, L., Pugliese, A.: Detecting similarities between XML documents. In: Proc. of WebDB 2002 (2002)Google Scholar
  7. 7.
    Garofalakis, M., Gionis, A., Rastogi, R., Seshadri, S., Shim, K.: XTRACT: A system for extracting document type descriptors from XML documents. In: Proc. of the ACM SIGMOD Conference, Texas, USA (2000)Google Scholar
  8. 8.
    Milligan, G.W., Cooper, M.C.: An examination of procedures for determining the number of clusters in a data set. Psychometrika 50 (1985)Google Scholar
  9. 9.
    Nierman, A., Jagadish, H.V.: Evaluating structural similarity in XML documents. In: Proc. of the WebDB Workshop, Madison, Wisconsin, USA (June 2002)Google Scholar
  10. 10.
    Sankoff, D., Kruskal, J.: Time Warps, String Edits and Macromolecules, The Theory and Practice of Sequence Comparison. CSLI Publications, Stanford (1999)Google Scholar
  11. 11.
    Selkow, S.M.: The tree-to-tree editing problem. Information Processing Letters 6, 184–186 (1977)zbMATHCrossRefMathSciNetGoogle Scholar
  12. 12.
    Tai, K.C.: The tree-to-tree correction problem. Journal of ACM 26 (1979)Google Scholar
  13. 13.
    van Rijsbergen, C.J.: Information Retrieval, Butterworths, London (1979)Google Scholar
  14. 14.
    Wagner, R., Fisher, M.: The string-to-string correction problem. Journal of ACM 21(1), 168–173 (1974)zbMATHCrossRefGoogle Scholar
  15. 15.
    Wang, Y., DeWitt, D., Cai, J.-Y.: X-Diff: An effective change detection algorithm for XML documents. In: Proc. of the ICDE Conference, Bangalore, India (2003)Google Scholar
  16. 16.
    Zhang, K., Shasha, D.: Simple fast algorithms for the editing distance between trees and related problems. SIAM Journal of Computing 18, 1245–1262 (1989)zbMATHCrossRefMathSciNetGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2004

Authors and Affiliations

  • Theodore Dalamagas
    • 1
  • Tao Cheng
    • 2
  • Klaas-Jan Winkel
    • 3
  • Timos Sellis
    • 1
  1. 1.School of Electr. and Comp. EngineeringNational Technical University of Athens, ZographouAthensGreece
  2. 2.Department of Computer ScienceUniversity of CaliforniaSanta BarbaraUSA
  3. 3.Faculty of Computer ScienceUniversity of TwenteEnschedeThe Netherlands

Personalised recommendations