Skip to main content

Structural Similarity Evaluation Between XML Documents and DTDs

  • Conference paper
Web Information Systems Engineering – WISE 2007 (WISE 2007)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4831))

Included in the following conference series:

Abstract

The automatic processing and management of XML-based data are ever more popular research issues due to the increasing abundant use of XML, especially on the Web. Nonetheless, several operations based on the structure of XML data have not yet received strong attention. Among these is the process of matching XML documents with XML grammars, useful in various applications such as documents classification, retrieval and selective dissemination of information. In this paper, we propose an algorithm for measuring the structural similarity between an XML document and a Document Type Definition (DTD) considered as the simplest way for specifying structural constraints on XML documents. We consider the various DTD operators that designate constraints on the existence, repeatability and alternativeness of XML elements/attributes. Our approach is based on the concept of tree edit distance, as an effective and efficient means for comparing tree structures, XML documents and DTDs being modeled as ordered labeled trees. It is of polynomial complexity, in comparison with existing exponential algorithms. Classification experiments, conducted on large sets of real and synthetic XML documents, underline our approach effectiveness, as well as its applicability to large XML repositories and databases.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Aho, A., Hirschberg, D., Ullman, J.: Bounds on the Complexity of the Longest Common Subsequence Problem. ACM J. 23(1), 1–12 (1976)

    Article  MATH  MathSciNet  Google Scholar 

  2. Akatsu, T.: Approximate String Matching with Don’t Care Characters. Information Processing Letters 55, 235–239 (1995)

    Article  MathSciNet  Google Scholar 

  3. Bertino E., et al.: Measuring the Structural Similarity among XML Documents and DTDs, Technical Report, University of Genova (2002), http://www.disi.unige.it/person/MesitiM

  4. Bertino, E., Guerrini, G., Mesiti, M.: A Matching Algorithm for Measuring the Structural Similarity between an XML Documents and a DTD and its Applications. Elsevier Computer Science, Amsterdam (2004)

    Google Scholar 

  5. Buttler, D.: A Short Survey of Document Structure Similarity Algorithms. In: Proc. of the 5th International Conference on Internet Computing, pp. 3–9 (2004)

    Google Scholar 

  6. Chamberlin, D., et al.: XQuery: A Query Language for XML (2001), http://www.w3.org/TR/2001/WD-xquery-20010215

  7. Chawathe, S., et al.: Change Detection in Hierarchically Structured Information. In: SIGMOD, pp. 493–504 (1996)

    Google Scholar 

  8. Chawathe, S.: Comparing Hierarchical Data in External Memory. In: Proceedings of the 25th Int. Conf. on Very Large Data Bases, pp. 90–101 (1999)

    Google Scholar 

  9. Cobéna, G., et al.: Detecting Changes in XML Documents. In: Proc. of the 18th ICDE Conference, pp. 41–52 (2002)

    Google Scholar 

  10. Dalamagas, T., Cheng, T., Winkel, K., Sellis, T.: A methodology for clustering XML documents by structure. Inf. Syst. 31(3), 187–228 (2006)

    Article  Google Scholar 

  11. Deutsch, A., et al.: XML-QL: A Query Language for XML. Computer Networks 31(11-16), 1155–1169 (1999)

    Article  Google Scholar 

  12. Do, H.H., Rahm, E.: COMA: A System for Flexible Combination of Schema Matching Approaches. In: The 28th VLDB Conf., Honk Kong, pp. 610–621 (2002)

    Google Scholar 

  13. Doan, A., Domingos, P., Halevy, A.Y.: Reconciling Schemas of Disparate Data Sources: A Machine Learning Approach. In: ACM SIGMOD, pp. 509–520 (2001)

    Google Scholar 

  14. Flesca, S., et al.: Detecting Structural Similarities between XML Documents. In: WebDB, pp. 55–60 (2002)

    Google Scholar 

  15. Goldman, R., Windom, J.: Dataguides: Enabling Query Formulation and Optimization in Semi-structured Databases. In: Proc. of the 23rd VLDB Conference, pp. 436–445 (1997)

    Google Scholar 

  16. Grahne, G., Thomo, A.: Approximate Reasoning in Semi-structured Databases. In: Proc. of the 8th International Workshop on Knowledge Representation meets Databases, vol. 45 (2001), http://CEUR-WS.org/Vol-45/03-thomo.ps

  17. Landau, G.M., Vishkin, U.: Fast Parallel and Serial Approximate String Matching. Journal of Algorithms 10, 157–169 (1989)

    Article  MATH  MathSciNet  Google Scholar 

  18. Lee, J.H., Kim, M.H., Lee, Y.J.: Information Retrieval Based on Conceptual Distance in IS-A Hierarchies. Journal of Documentation 49(2), 188–207 (1993)

    Article  Google Scholar 

  19. Lee, M., et al.: XClust: Clustering XML Schemas for Effective Integration. In: CIKM, pp. 292–299 (2002)

    Google Scholar 

  20. Levenshtein, V.: Binary Codes Capable of Correcting Deletions, Insertions and Reversals. Sov. Phys. Dokl. 6, 707–710 (1966)

    MathSciNet  Google Scholar 

  21. Madhavan, J., et al.: Generic Schema Matching With Cupid. In: 27th VLDB Conference, pp. 49–58 (2001)

    Google Scholar 

  22. Melnik, S., Garcia-Molina, H., Rahm, E.: Similarity Flooding: A Versatile Graph Matching Algorithm and its Application to Schema Matching. In: Proc. of the 18th ICDE Conference, pp. 117–128 (2002)

    Google Scholar 

  23. Myers, E.: An O(N D) Difference Algorithm and Its Variations. Algorithmica 1(2), 251–266 (1986)

    Article  MATH  MathSciNet  Google Scholar 

  24. Nierman, A., Jagadish, H.V.: Evaluating structural similarity in XML documents. In: Proc. of the 5th Int. Workshop on the Web and Databases, pp. 61–66 (2002)

    Google Scholar 

  25. Pereira, F.: Technologies for Digital Multimedia Communications: An Evolution Analysis of MPEG Standards. China Communications Journal, 8–19 (2006)

    Google Scholar 

  26. van Rijsbergen, C.J.: Information Retrieval. Butterworths, London (1979)

    Google Scholar 

  27. Sanz, I., Mesiti, M., Guerrini, G., Berlanga Lavori, R.: Approximate Subtree Identification in Heterogeneous XML Documents Collections. In: Bressan, S., Ceri, S., Hunt, E., Ives, Z.G., Bellahsène, Z., Rys, M., Unland, R. (eds.) XSym 2005. LNCS, vol. 3671, pp. 192–206. Springer, Heidelberg (2005)

    Google Scholar 

  28. Schlieder, T.: Similarity Search in XML Data Using Cost-based Query Transformations. In: WebDB, pp. 19–24 (2001)

    Google Scholar 

  29. Shasha, D., Zhang, K.: Approximate Tree Pattern Matching. In: Pattern Matching in Strings, Trees and Arrays, Oxford University Press, Oxford (1995)

    Google Scholar 

  30. Tai, K.C.: The Tree-to-Tree correction problem. ACM J. 26, 422–433 (1979)

    Article  MATH  MathSciNet  Google Scholar 

  31. Wagner, J., Fisher, M.: The String-to-String correction problem. J. of the ACM (21), 168–173 (1974)

    Google Scholar 

  32. Wong, C., Chandra, A.: Bounds for the String Editing Problem. Journal of the Association for Computing Machinery 23(1), 13–16 (1976)

    MATH  MathSciNet  Google Scholar 

  33. WWW Consortium, The Document Object Model (August 10, 2006), http://www.w3.org/DOM

  34. Zhang, K., Shasha, D.: Simple Fast Algorithms for the Editing Distance between Trees and Related Problems. SIAM J. 18(6), 1245–1262 (1989)

    Article  MATH  MathSciNet  Google Scholar 

  35. Zhang, K., Shasha, D., Wang, J.: Approximate Tree Matching in the Presence of Variable Length Don’t Cares. J. of Algorithms 16(1), 33–66 (1994)

    Article  MATH  MathSciNet  Google Scholar 

  36. Zhang, Z., Li, R., Cao, S., Zhu, Y.: Similarity Metric for XML Documents. In: Knowledge Management and Experience Management Workshop, Karlsruhe, Germany, pp. 255–261 (2003)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Boualem Benatallah Fabio Casati Dimitrios Georgakopoulos Claudio Bartolini Wasim Sadiq Claude Godart

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Tekli, J., Chbeir, R., Yetongnon, K. (2007). Structural Similarity Evaluation Between XML Documents and DTDs. In: Benatallah, B., Casati, F., Georgakopoulos, D., Bartolini, C., Sadiq, W., Godart, C. (eds) Web Information Systems Engineering – WISE 2007. WISE 2007. Lecture Notes in Computer Science, vol 4831. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-76993-4_17

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-76993-4_17

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-76992-7

  • Online ISBN: 978-3-540-76993-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics