Structural Similarity Evaluation Between XML Documents and DTDs

Tekli, Joe; Chbeir, Richard; Yetongnon, Kokou

doi:10.1007/978-3-540-76993-4_17

Joe Tekli¹,
Richard Chbeir¹ &
Kokou Yetongnon¹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4831))

Included in the following conference series:

International Conference on Web Information Systems Engineering

1167 Accesses
8 Citations

Abstract

The automatic processing and management of XML-based data are ever more popular research issues due to the increasing abundant use of XML, especially on the Web. Nonetheless, several operations based on the structure of XML data have not yet received strong attention. Among these is the process of matching XML documents with XML grammars, useful in various applications such as documents classification, retrieval and selective dissemination of information. In this paper, we propose an algorithm for measuring the structural similarity between an XML document and a Document Type Definition (DTD) considered as the simplest way for specifying structural constraints on XML documents. We consider the various DTD operators that designate constraints on the existence, repeatability and alternativeness of XML elements/attributes. Our approach is based on the concept of tree edit distance, as an effective and efficient means for comparing tree structures, XML documents and DTDs being modeled as ordered labeled trees. It is of polynomial complexity, in comparison with existing exponential algorithms. Classification experiments, conducted on large sets of real and synthetic XML documents, underline our approach effectiveness, as well as its applicability to large XML repositories and databases.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

SemSynX: Flexible Similarity Analysis of XML Data via Semantic and Syntactic Heterogeneity/Homogeneity Detection

Towards Flexible Similarity Analysis of XML Data

A methodology for measuring structure similarity of fuzzy XML documents

Article 10 April 2017

References

Aho, A., Hirschberg, D., Ullman, J.: Bounds on the Complexity of the Longest Common Subsequence Problem. ACM J. 23(1), 1–12 (1976)
Article MATH MathSciNet Google Scholar
Akatsu, T.: Approximate String Matching with Don’t Care Characters. Information Processing Letters 55, 235–239 (1995)
Article MathSciNet Google Scholar
Bertino E., et al.: Measuring the Structural Similarity among XML Documents and DTDs, Technical Report, University of Genova (2002), http://www.disi.unige.it/person/MesitiM
Bertino, E., Guerrini, G., Mesiti, M.: A Matching Algorithm for Measuring the Structural Similarity between an XML Documents and a DTD and its Applications. Elsevier Computer Science, Amsterdam (2004)
Google Scholar
Buttler, D.: A Short Survey of Document Structure Similarity Algorithms. In: Proc. of the 5^th International Conference on Internet Computing, pp. 3–9 (2004)
Google Scholar
Chamberlin, D., et al.: XQuery: A Query Language for XML (2001), http://www.w3.org/TR/2001/WD-xquery-20010215
Chawathe, S., et al.: Change Detection in Hierarchically Structured Information. In: SIGMOD, pp. 493–504 (1996)
Google Scholar
Chawathe, S.: Comparing Hierarchical Data in External Memory. In: Proceedings of the 25th Int. Conf. on Very Large Data Bases, pp. 90–101 (1999)
Google Scholar
Cobéna, G., et al.: Detecting Changes in XML Documents. In: Proc. of the 18^th ICDE Conference, pp. 41–52 (2002)
Google Scholar
Dalamagas, T., Cheng, T., Winkel, K., Sellis, T.: A methodology for clustering XML documents by structure. Inf. Syst. 31(3), 187–228 (2006)
Article Google Scholar
Deutsch, A., et al.: XML-QL: A Query Language for XML. Computer Networks 31(11-16), 1155–1169 (1999)
Article Google Scholar
Do, H.H., Rahm, E.: COMA: A System for Flexible Combination of Schema Matching Approaches. In: The 28^th VLDB Conf., Honk Kong, pp. 610–621 (2002)
Google Scholar
Doan, A., Domingos, P., Halevy, A.Y.: Reconciling Schemas of Disparate Data Sources: A Machine Learning Approach. In: ACM SIGMOD, pp. 509–520 (2001)
Google Scholar
Flesca, S., et al.: Detecting Structural Similarities between XML Documents. In: WebDB, pp. 55–60 (2002)
Google Scholar
Goldman, R., Windom, J.: Dataguides: Enabling Query Formulation and Optimization in Semi-structured Databases. In: Proc. of the 23^rd VLDB Conference, pp. 436–445 (1997)
Google Scholar
Grahne, G., Thomo, A.: Approximate Reasoning in Semi-structured Databases. In: Proc. of the 8^th International Workshop on Knowledge Representation meets Databases, vol. 45 (2001), http://CEUR-WS.org/Vol-45/03-thomo.ps
Landau, G.M., Vishkin, U.: Fast Parallel and Serial Approximate String Matching. Journal of Algorithms 10, 157–169 (1989)
Article MATH MathSciNet Google Scholar
Lee, J.H., Kim, M.H., Lee, Y.J.: Information Retrieval Based on Conceptual Distance in IS-A Hierarchies. Journal of Documentation 49(2), 188–207 (1993)
Article Google Scholar
Lee, M., et al.: XClust: Clustering XML Schemas for Effective Integration. In: CIKM, pp. 292–299 (2002)
Google Scholar
Levenshtein, V.: Binary Codes Capable of Correcting Deletions, Insertions and Reversals. Sov. Phys. Dokl. 6, 707–710 (1966)
MathSciNet Google Scholar
Madhavan, J., et al.: Generic Schema Matching With Cupid. In: 27^th VLDB Conference, pp. 49–58 (2001)
Google Scholar
Melnik, S., Garcia-Molina, H., Rahm, E.: Similarity Flooding: A Versatile Graph Matching Algorithm and its Application to Schema Matching. In: Proc. of the 18th ICDE Conference, pp. 117–128 (2002)
Google Scholar
Myers, E.: An O(N D) Difference Algorithm and Its Variations. Algorithmica 1(2), 251–266 (1986)
Article MATH MathSciNet Google Scholar
Nierman, A., Jagadish, H.V.: Evaluating structural similarity in XML documents. In: Proc. of the 5th Int. Workshop on the Web and Databases, pp. 61–66 (2002)
Google Scholar
Pereira, F.: Technologies for Digital Multimedia Communications: An Evolution Analysis of MPEG Standards. China Communications Journal, 8–19 (2006)
Google Scholar
van Rijsbergen, C.J.: Information Retrieval. Butterworths, London (1979)
Google Scholar
Sanz, I., Mesiti, M., Guerrini, G., Berlanga Lavori, R.: Approximate Subtree Identification in Heterogeneous XML Documents Collections. In: Bressan, S., Ceri, S., Hunt, E., Ives, Z.G., Bellahsène, Z., Rys, M., Unland, R. (eds.) XSym 2005. LNCS, vol. 3671, pp. 192–206. Springer, Heidelberg (2005)
Google Scholar
Schlieder, T.: Similarity Search in XML Data Using Cost-based Query Transformations. In: WebDB, pp. 19–24 (2001)
Google Scholar
Shasha, D., Zhang, K.: Approximate Tree Pattern Matching. In: Pattern Matching in Strings, Trees and Arrays, Oxford University Press, Oxford (1995)
Google Scholar
Tai, K.C.: The Tree-to-Tree correction problem. ACM J. 26, 422–433 (1979)
Article MATH MathSciNet Google Scholar
Wagner, J., Fisher, M.: The String-to-String correction problem. J. of the ACM (21), 168–173 (1974)
Google Scholar
Wong, C., Chandra, A.: Bounds for the String Editing Problem. Journal of the Association for Computing Machinery 23(1), 13–16 (1976)
MATH MathSciNet Google Scholar
WWW Consortium, The Document Object Model (August 10, 2006), http://www.w3.org/DOM
Zhang, K., Shasha, D.: Simple Fast Algorithms for the Editing Distance between Trees and Related Problems. SIAM J. 18(6), 1245–1262 (1989)
Article MATH MathSciNet Google Scholar
Zhang, K., Shasha, D., Wang, J.: Approximate Tree Matching in the Presence of Variable Length Don’t Cares. J. of Algorithms 16(1), 33–66 (1994)
Article MATH MathSciNet Google Scholar
Zhang, Z., Li, R., Cao, S., Zhu, Y.: Similarity Metric for XML Documents. In: Knowledge Management and Experience Management Workshop, Karlsruhe, Germany, pp. 255–261 (2003)
Google Scholar

Download references

Author information

Authors and Affiliations

LE2I Laboratory UMR-CNRS, University of Bourgogne, 21078 Dijon Cedex, France
Joe Tekli, Richard Chbeir & Kokou Yetongnon

Authors

Joe Tekli
View author publications
You can also search for this author in PubMed Google Scholar
Richard Chbeir
View author publications
You can also search for this author in PubMed Google Scholar
Kokou Yetongnon
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Boualem Benatallah Fabio Casati Dimitrios Georgakopoulos Claudio Bartolini Wasim Sadiq Claude Godart

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tekli, J., Chbeir, R., Yetongnon, K. (2007). Structural Similarity Evaluation Between XML Documents and DTDs. In: Benatallah, B., Casati, F., Georgakopoulos, D., Bartolini, C., Sadiq, W., Godart, C. (eds) Web Information Systems Engineering – WISE 2007. WISE 2007. Lecture Notes in Computer Science, vol 4831. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-76993-4_17

Download citation

DOI: https://doi.org/10.1007/978-3-540-76993-4_17
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-76992-7
Online ISBN: 978-3-540-76993-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Structural Similarity Evaluation Between XML Documents and DTDs

Abstract

Access this chapter

Preview

Similar content being viewed by others

SemSynX: Flexible Similarity Analysis of XML Data via Semantic and Syntactic Heterogeneity/Homogeneity Detection

Towards Flexible Similarity Analysis of XML Data

A methodology for measuring structure similarity of fuzzy XML documents

References

Author information

Authors and Affiliations

Editor information

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Structural Similarity Evaluation Between XML Documents and DTDs

Abstract

Access this chapter

Preview

Similar content being viewed by others

SemSynX: Flexible Similarity Analysis of XML Data via Semantic and Syntactic Heterogeneity/Homogeneity Detection

Towards Flexible Similarity Analysis of XML Data

A methodology for measuring structure similarity of fuzzy XML documents

References

Author information

Authors and Affiliations

Editor information

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation