Measuring the structural similarity among XML documents and DTDs

Bertino, Elisa; Guerrini, Giovanna; Mesiti, Marco

doi:10.1007/s10844-006-0023-y

Measuring the structural similarity among XML documents and DTDs

Published: 27 January 2007

Volume 30, pages 55–92, (2008)
Cite this article

Journal of Intelligent Information Systems Aims and scope Submit manuscript

Elisa Bertino¹,
Giovanna Guerrini² &
Marco Mesiti³

114 Accesses
16 Citations
Explore all metrics

Abstract

Measuring the structural similarity between an XML document and a DTD has many relevant applications that range from document classification and approximate structural queries on XML documents to selective dissemination of XML documents and document protection. The problem is harder than measuring structural similarity among documents, because a DTD can be considered as a generator of documents. Thus, the problem is to evaluate the similarity between a document and a set of documents. An effective structural similarity measure should face different requirements that range from considering the presence and absence of required elements, as well as the structure and level of the missing and extra elements to vocabulary discrepancies due to the use of synonymous or syntactically similar tags. In the paper, starting from these requirements, we provide a definition of the measure and present an algorithm for matching a document against a DTD to obtain their structural similarity. Finally, experimental results to assess the effectiveness of the approach are presented.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Amer-Yahia, S., Koudas, N., & Srivastava, D. (2003). Approximate matching in XML. In Proceedings of the International Conference on Data Engineering (p. 803). Los Alamitos, California: IEEE Computer Society.
Google Scholar
Batini, C., Lenzerini, M., & Navathe, S. (1986). A comparative analysis of methodologies for database schema integration. ACM Computing Surveys, 18(4), 323–364.
Article Google Scholar
Bertino, E., Castano, S., Ferrari, E., & Mesiti, M. (2002). Protection and administration of XML data sources. Data and Knowledge Engineering, 43(3), 237–260.
Article MATH Google Scholar
Bertino, E., Guerrini, G., Merlo, I., & Mesiti, M. (1999). An approach to classify semi-structured objects. In Proceedings of European Conference on Object-Oriented Programming, LNCS (1628) (pp. 416–440). Berlin Heidelberg New York: Springer.
Google Scholar
Bertino, E., Guerrini, G., & Mesiti, M. (2004a). A matching algorithm for measuring the structural similarity between an XML document and a DTD and its applications. Information Systems, 29(1), 23–46.
Article MathSciNet Google Scholar
Bertino, E., Guerrini, G., & Mesiti, M. (2004b). Di\({\cal X}\)eminator: a profile-based selective dissemination system for XML documents. In Proceedings of the International EDBT Workshop on Clustering Information on the Web, (pp. 47–54). Berlin Heidelberg New York: Springer.
Google Scholar
Bertino, E., Guerrini, G., & Mesiti, M. (2004c). Measuring the structural similarity among XML documents and DTDs. Technical Report, University of Genova. http://www.disi.unige.it/person/MesitiM/publications.html.
Bourret, R. (1999). XML and databases. http://www.rpbourret.com/xml/.
Buitelaar, P., Cimiano, P., & Magnini, B. (2005). Ontology learning from text: methods, evaluation and applications. Frontiers in Artificial Intelligence and Applications Series, (pp. 3–12). Amsterdam, The Netherlands: IOS Press
Google Scholar
Chawathe, S. S., & Garcia-Molina, H. (1997). Meaningful change detection in structured data. In Proceedings of the International Conference on Management of Data (pp. 26–37). New York: ACM.
Google Scholar
Chawathe, S. S., Rajaraman, A., Garcia-Molina, H., & Widom, J. (1996). Change detection in hierarchically structured information. In Proceedings of the International Conference on Management of Data (pp. 493–504). New York: ACM.
Google Scholar
Chinenyanga, T., & Kushmerick, N. (2002). An expressive and efficient language for XML information retrieval. JASIST, 53(6), 438–453.
Article Google Scholar
Cobena, G., Abiteboul, S., & Marian, A. (2002). Detecting changes in XML documents. In Proceedings of the International Conference on Data Engeneering (pp. 41–52). Los Alamitos, California: IEEE Computer Society.
Google Scholar
DCI. Dublin Core, http://dublincore.org/.
Deutsch, A., Fernandez, M., & Suciu, D. (1999). Storing semistructured data with STORED. In Proceedings of the International Conference on Management of Data (pp. 431–442). New York: ACM.
Google Scholar
Do, H.-H., Melnik, S., & Rahm, E. (2003). Comparison of schema matching evaluations. In Web, Web-Services, and Database Systems, vol 2593 of LNCS (pp. 221–237). Berlin Heidelberg New York: Springer.
Chapter Google Scholar
Do, H.-H., & Rahm, E. (2002). COMA - a system for flexible combination of schema matching approaches. In Proceedings of the International Conference on Very Large Databases (pp. 610–621). San Mateo, California: Morgan Kaufmann.
Google Scholar
Doan, A., Domingos, P., & Halevy, A. Y. (2001). Reconciling schemas of disparate data sources: A machine-learning approach. SIGMOD Record, 30(2), 509–520.
Article Google Scholar
Flesca, S., Manco, G., Masciari, E., Pontieri, L., & Pugliese, A. (2002). Detecting structural similarities between XML documents. In Proceedings of the International Workshop on Web and Databases (pp. 55–60). Madison, Wisconsin.
Fuhr, N., & Grossjohann, K. (2001). XIRQL: a query language for information retrieval in XML documents. In Proceedings of the International Conference on Research and Development in Information Retrieval (pp. 172–180). New York: ACM.
Google Scholar
Fuhr, N., & Lalmas, M. (2004). Initiative for the evaluation of XML retrieval. http://inex.is.informatik.uni-duisburg.de:2004/.
Garofalakis, M. N., Gionis, A., Rastogi, R., Seshadri, S., & Shim, K. (2000). XTRACT: A system for extracting document type descriptors from XML documents. In Proceedings of the International Conference on Management of Data (pp. 165–176). New York: ACM.
Google Scholar
Grahne, G., & Thomo, A. (2001). Approximate reasoning in semi-structured databases. In Proceedings of the International Workshop on Knowledge Representation Meets Databases, vol. 45 of CEUR Workshop Proceedings. Rome, Italy: CEUR-WS.org.
Google Scholar
Guerrini, G., Mesiti, M., & Bertino, E. (2006). Structural similarity measures in sources of XML documents. In J. Darmont & O. Boussaid (Eds.), Processing and Managing Complex Data for Decision Support (pp. 247–279). Hershey, Pennsylvania: Idea Group.
Google Scholar
Guerrini, G., Mesiti, M., & Sanz, I. (2006). An overview of similarity measures for clustering XML documents. In A. Vakali & G. Palis (Eds.), Web Data Management Practices: Emerging Techniques and Technologies (pp. 56–78). Hershey, Pennsylvania: Idea Group.
Google Scholar
Lee, M., Yang, L., Hsu, W., & Yang, X. (2002). XClust: Clustering XML schemas for effective integration. In Proceedings of the International Conference on Information and Knowledge Management (pp. 292–299). New York: ACM.
Google Scholar
Lu, S. Y. (1979). A tree to tree distance and its applications to cluster analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1, 219–224.
MATH Google Scholar
Madhavan, J., Bernstein, P., & Rahm, E. (2001). Generic schema matching with Cupid. In Proceedings of the International Conference on Very Large Databases (pp. 49–58). San Francisco, California: Morgan Kaufmann.
Google Scholar
Mesiti, M. (2002). A structural similarity measure for XML documents: theory and applications. Ph.D. dissertation, University of Genova, Italy. http://www.disi.unige.it.
Mignet, L., Barbosa, D., Veltri, P. (2003). The XML web: a first study. In Proceedings of the International Conference on WWW (pp. 500–510). New York: ACM.
Chapter Google Scholar
Miller, A. (1995). WordNet: a lexical database for english. Communications of the ACM, 38(11), 39–41.
Article Google Scholar
Moh, C., Lim, E., & Ng, W. (2000). Re-engineering structures from web documents. In Proceedings of ACM DL (pp. 67–76). New York: ACM.
Google Scholar
Nestorov, S., Abiteboul, S., & Motwani, R. (1998). Extracting schema from semistructured data. In Proceedings of the International Conference on Management of Data (pp. 295–306). New York: ACM.
Google Scholar
Nierman, A. & Jagadish, H. (2002). Evaluating structural similarity in XML documents. In Proceedings of the International Workshop on Web and Databases (pp. 61–66). Madison, Wisconsin.
Parent, C., & Spaccapietra, S. (1998). Issues and approaches of database integration. Communications of the ACM, 41(5), 166–178.
Article Google Scholar
Rahm, E., & Bernstein, P. A. (2001). A survey of approaches to automatic schema matching. VLDB J., 10(4), 334–350.
Article MATH Google Scholar
Rice, S. V., Bunke, H., & Nartker, T. A. (1997). Classes of cost functions for string edit distance. Algorithmica, 18(2), 271–280.
Article MATH MathSciNet Google Scholar
Schlieder, T. (2001). Similarity search in XML data using cost-based query transformations. In Proceedings of the International Workshop on Web and Databases (pp. 19–24). Santa Barbara, California.
Selkow, S. M. (1977). The tree-to-tree editing problem. Information Processing Letters, 6(6), 184–186.
Article MATH MathSciNet Google Scholar
Stanoi, I., Mihaila, G., & Padmanabhan, S. (2003). A framework for the selective dissemination of XML documents based on inferred user profiles. In Proceedings of the International Conference on Data Engineering (pp. 531–542). Los Alamitos, California: IEEE Computer Society.
Google Scholar
Tai, K.-C. (1979). The tree-to-tree correction problem. Journal of the ACM, 26(3), 422–433.
Article MATH MathSciNet Google Scholar
Tanaka, E., & Tanaka, K. (1988). The tree-to-tree editing problem. Journal of Pattern Recognition and Artificial Intelligence, 2(2).
Theobald, A. & Weikum, G. (2000). Adding relevance to XML. In Proceedings of the International Workshop on the Web and Databases, LNCS(1997) (pp. 105–124). Berlin Heidelberg New York: Springer.
Google Scholar
Tversky, A. (1977). Features of similarity. Journal of Psychological Review, 84(4), 327–352.
Article Google Scholar
W3C. (1998). Extensible Markup Language (XML).
XML.org. (2003). XML.org focus areas. http://www.xml.org/xml/focus_areas.shtml.
Wang, K., & Liu, H. (1998). Discovering typical structures of documents: A road map approach. In Proceedings of the ACM SIGIR (pp. 146–154). New York: ACM.
Google Scholar
Wang, Y., DeWitt, D. J., & Cai, J.-Y. (2003). X-Diff: an effective change detection algorithm for XML documents. In Proceedings of the International Conference on Data Engineering (pp. 574–580). Los Alamitos, California: IEEE Computer Society.
Google Scholar
Yao, B., Ozsu, M., & Keenleyside, J. (2002). XBench–a family of benchmarks for XML DBMSs. In Proceedings of EEXTT and DiWeb 2002, vol. 2590 of LNCS (pp. 162–164). Berlin Heidelberg New York: Springer.
Google Scholar
Zhang, K. (1993). A new editing based distance between unordered labeled trees. In Proceedings of the Symposium on Combinatorial Pattern Matching, vol. 684 of LNCS (pp. 110–121). Berlin Heidelberg New York: Springer.
Google Scholar
Zhang, K. & Shasha, D. (1977). Tree pattern matching. Pattern Matching Algorithms. London, UK: Oxford University Press.
Google Scholar
Zhang, K., & Shasha, D. (1989). Simple fast algorithms for the editing distance between trees and related problems. SIAM Journal of Computing, 18(6), 1245–1262.
Article MATH MathSciNet Google Scholar
Zhang, K., Shasha, D., & Wang, J.-L. (1994). Approximate tree matching in the presence of variable length don’t cares. Journal on Algorithms, 16(1), 33–66.
Article MATH MathSciNet Google Scholar
Zhang, K., Statman, R., & Shasha, D. (1992). On the editing distance between unordered labeled trees. Information Processing Letters, 42(3), 133–139.
Article MATH MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Purdue University, West Lafayette, IN, USA
Elisa Bertino
University of Genova, Genova, Italy
Giovanna Guerrini
University of Milano, Milano, Italy
Marco Mesiti

Authors

Elisa Bertino
View author publications
You can also search for this author in PubMed Google Scholar
Giovanna Guerrini
View author publications
You can also search for this author in PubMed Google Scholar
Marco Mesiti
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marco Mesiti.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bertino, E., Guerrini, G. & Mesiti, M. Measuring the structural similarity among XML documents and DTDs. J Intell Inf Syst 30, 55–92 (2008). https://doi.org/10.1007/s10844-006-0023-y

Download citation

Received: 24 September 2004
Revised: 16 February 2006
Accepted: 22 February 2006
Published: 27 January 2007
Issue Date: February 2008
DOI: https://doi.org/10.1007/s10844-006-0023-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Measuring the structural similarity among XML documents and DTDs

Abstract

Access this article

Similar content being viewed by others

SemSynX: Flexible Similarity Analysis of XML Data via Semantic and Syntactic Heterogeneity/Homogeneity Detection

A Prufer Sequence Based Approach to Measure Structural Similarity of XML Documents

The Consistency and Absolute Consistency Problems of XML Schema Mappings between Restricted DTDs

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Measuring the structural similarity among XML documents and DTDs

Abstract

Access this article

Similar content being viewed by others

SemSynX: Flexible Similarity Analysis of XML Data via Semantic and Syntactic Heterogeneity/Homogeneity Detection

A Prufer Sequence Based Approach to Measure Structural Similarity of XML Documents

The Consistency and Absolute Consistency Problems of XML Schema Mappings between Restricted DTDs

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation