Skip to main content
Log in

Measuring the structural similarity among XML documents and DTDs

  • Published:
Journal of Intelligent Information Systems Aims and scope Submit manuscript

Abstract

Measuring the structural similarity between an XML document and a DTD has many relevant applications that range from document classification and approximate structural queries on XML documents to selective dissemination of XML documents and document protection. The problem is harder than measuring structural similarity among documents, because a DTD can be considered as a generator of documents. Thus, the problem is to evaluate the similarity between a document and a set of documents. An effective structural similarity measure should face different requirements that range from considering the presence and absence of required elements, as well as the structure and level of the missing and extra elements to vocabulary discrepancies due to the use of synonymous or syntactically similar tags. In the paper, starting from these requirements, we provide a definition of the measure and present an algorithm for matching a document against a DTD to obtain their structural similarity. Finally, experimental results to assess the effectiveness of the approach are presented.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Amer-Yahia, S., Koudas, N., & Srivastava, D. (2003). Approximate matching in XML. In Proceedings of the International Conference on Data Engineering (p. 803). Los Alamitos, California: IEEE Computer Society.

    Google Scholar 

  • Batini, C., Lenzerini, M., & Navathe, S. (1986). A comparative analysis of methodologies for database schema integration. ACM Computing Surveys, 18(4), 323–364.

    Article  Google Scholar 

  • Bertino, E., Castano, S., Ferrari, E., & Mesiti, M. (2002). Protection and administration of XML data sources. Data and Knowledge Engineering, 43(3), 237–260.

    Article  MATH  Google Scholar 

  • Bertino, E., Guerrini, G., Merlo, I., & Mesiti, M. (1999). An approach to classify semi-structured objects. In Proceedings of European Conference on Object-Oriented Programming, LNCS (1628) (pp. 416–440). Berlin Heidelberg New York: Springer.

    Google Scholar 

  • Bertino, E., Guerrini, G., & Mesiti, M. (2004a). A matching algorithm for measuring the structural similarity between an XML document and a DTD and its applications. Information Systems, 29(1), 23–46.

    Article  MathSciNet  Google Scholar 

  • Bertino, E., Guerrini, G., & Mesiti, M. (2004b). Di\({\cal X}\)eminator: a profile-based selective dissemination system for XML documents. In Proceedings of the International EDBT Workshop on Clustering Information on the Web, (pp. 47–54). Berlin Heidelberg New York: Springer.

    Google Scholar 

  • Bertino, E., Guerrini, G., & Mesiti, M. (2004c). Measuring the structural similarity among XML documents and DTDs. Technical Report, University of Genova. http://www.disi.unige.it/person/MesitiM/publications.html.

  • Bourret, R. (1999). XML and databases. http://www.rpbourret.com/xml/.

  • Buitelaar, P., Cimiano, P., & Magnini, B. (2005). Ontology learning from text: methods, evaluation and applications. Frontiers in Artificial Intelligence and Applications Series, (pp. 3–12). Amsterdam, The Netherlands: IOS Press

    Google Scholar 

  • Chawathe, S. S., & Garcia-Molina, H. (1997). Meaningful change detection in structured data. In Proceedings of the International Conference on Management of Data (pp. 26–37). New York: ACM.

    Google Scholar 

  • Chawathe, S. S., Rajaraman, A., Garcia-Molina, H., & Widom, J. (1996). Change detection in hierarchically structured information. In Proceedings of the International Conference on Management of Data (pp. 493–504). New York: ACM.

    Google Scholar 

  • Chinenyanga, T., & Kushmerick, N. (2002). An expressive and efficient language for XML information retrieval. JASIST, 53(6), 438–453.

    Article  Google Scholar 

  • Cobena, G., Abiteboul, S., & Marian, A. (2002). Detecting changes in XML documents. In Proceedings of the International Conference on Data Engeneering (pp. 41–52). Los Alamitos, California: IEEE Computer Society.

    Google Scholar 

  • DCI. Dublin Core, http://dublincore.org/.

  • Deutsch, A., Fernandez, M., & Suciu, D. (1999). Storing semistructured data with STORED. In Proceedings of the International Conference on Management of Data (pp. 431–442). New York: ACM.

    Google Scholar 

  • Do, H.-H., Melnik, S., & Rahm, E. (2003). Comparison of schema matching evaluations. In Web, Web-Services, and Database Systems, vol 2593 of LNCS (pp. 221–237). Berlin Heidelberg New York: Springer.

    Chapter  Google Scholar 

  • Do, H.-H., & Rahm, E. (2002). COMA - a system for flexible combination of schema matching approaches. In Proceedings of the International Conference on Very Large Databases (pp. 610–621). San Mateo, California: Morgan Kaufmann.

    Google Scholar 

  • Doan, A., Domingos, P., & Halevy, A. Y. (2001). Reconciling schemas of disparate data sources: A machine-learning approach. SIGMOD Record, 30(2), 509–520.

    Article  Google Scholar 

  • Flesca, S., Manco, G., Masciari, E., Pontieri, L., & Pugliese, A. (2002). Detecting structural similarities between XML documents. In Proceedings of the International Workshop on Web and Databases (pp. 55–60). Madison, Wisconsin.

  • Fuhr, N., & Grossjohann, K. (2001). XIRQL: a query language for information retrieval in XML documents. In Proceedings of the International Conference on Research and Development in Information Retrieval (pp. 172–180). New York: ACM.

    Google Scholar 

  • Fuhr, N., & Lalmas, M. (2004). Initiative for the evaluation of XML retrieval. http://inex.is.informatik.uni-duisburg.de:2004/.

  • Garofalakis, M. N., Gionis, A., Rastogi, R., Seshadri, S., & Shim, K. (2000). XTRACT: A system for extracting document type descriptors from XML documents. In Proceedings of the International Conference on Management of Data (pp. 165–176). New York: ACM.

    Google Scholar 

  • Grahne, G., & Thomo, A. (2001). Approximate reasoning in semi-structured databases. In Proceedings of the International Workshop on Knowledge Representation Meets Databases, vol. 45 of CEUR Workshop Proceedings. Rome, Italy: CEUR-WS.org.

    Google Scholar 

  • Guerrini, G., Mesiti, M., & Bertino, E. (2006). Structural similarity measures in sources of XML documents. In J. Darmont & O. Boussaid (Eds.), Processing and Managing Complex Data for Decision Support (pp. 247–279). Hershey, Pennsylvania: Idea Group.

    Google Scholar 

  • Guerrini, G., Mesiti, M., & Sanz, I. (2006). An overview of similarity measures for clustering XML documents. In A. Vakali & G. Palis (Eds.), Web Data Management Practices: Emerging Techniques and Technologies (pp. 56–78). Hershey, Pennsylvania: Idea Group.

    Google Scholar 

  • Lee, M., Yang, L., Hsu, W., & Yang, X. (2002). XClust: Clustering XML schemas for effective integration. In Proceedings of the International Conference on Information and Knowledge Management (pp. 292–299). New York: ACM.

    Google Scholar 

  • Lu, S. Y. (1979). A tree to tree distance and its applications to cluster analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1, 219–224.

    MATH  Google Scholar 

  • Madhavan, J., Bernstein, P., & Rahm, E. (2001). Generic schema matching with Cupid. In Proceedings of the International Conference on Very Large Databases (pp. 49–58). San Francisco, California: Morgan Kaufmann.

    Google Scholar 

  • Mesiti, M. (2002). A structural similarity measure for XML documents: theory and applications. Ph.D. dissertation, University of Genova, Italy. http://www.disi.unige.it.

  • Mignet, L., Barbosa, D., Veltri, P. (2003). The XML web: a first study. In Proceedings of the International Conference on WWW (pp. 500–510). New York: ACM.

    Chapter  Google Scholar 

  • Miller, A. (1995). WordNet: a lexical database for english. Communications of the ACM, 38(11), 39–41.

    Article  Google Scholar 

  • Moh, C., Lim, E., & Ng, W. (2000). Re-engineering structures from web documents. In Proceedings of ACM DL (pp. 67–76). New York: ACM.

    Google Scholar 

  • Nestorov, S., Abiteboul, S., & Motwani, R. (1998). Extracting schema from semistructured data. In Proceedings of the International Conference on Management of Data (pp. 295–306). New York: ACM.

    Google Scholar 

  • Nierman, A. & Jagadish, H. (2002). Evaluating structural similarity in XML documents. In Proceedings of the International Workshop on Web and Databases (pp. 61–66). Madison, Wisconsin.

  • Parent, C., & Spaccapietra, S. (1998). Issues and approaches of database integration. Communications of the ACM, 41(5), 166–178.

    Article  Google Scholar 

  • Rahm, E., & Bernstein, P. A. (2001). A survey of approaches to automatic schema matching. VLDB J., 10(4), 334–350.

    Article  MATH  Google Scholar 

  • Rice, S. V., Bunke, H., & Nartker, T. A. (1997). Classes of cost functions for string edit distance. Algorithmica, 18(2), 271–280.

    Article  MATH  MathSciNet  Google Scholar 

  • Schlieder, T. (2001). Similarity search in XML data using cost-based query transformations. In Proceedings of the International Workshop on Web and Databases (pp. 19–24). Santa Barbara, California.

  • Selkow, S. M. (1977). The tree-to-tree editing problem. Information Processing Letters, 6(6), 184–186.

    Article  MATH  MathSciNet  Google Scholar 

  • Stanoi, I., Mihaila, G., & Padmanabhan, S. (2003). A framework for the selective dissemination of XML documents based on inferred user profiles. In Proceedings of the International Conference on Data Engineering (pp. 531–542). Los Alamitos, California: IEEE Computer Society.

    Google Scholar 

  • Tai, K.-C. (1979). The tree-to-tree correction problem. Journal of the ACM, 26(3), 422–433.

    Article  MATH  MathSciNet  Google Scholar 

  • Tanaka, E., & Tanaka, K. (1988). The tree-to-tree editing problem. Journal of Pattern Recognition and Artificial Intelligence, 2(2).

  • Theobald, A. & Weikum, G. (2000). Adding relevance to XML. In Proceedings of the International Workshop on the Web and Databases, LNCS(1997) (pp. 105–124). Berlin Heidelberg New York: Springer.

    Google Scholar 

  • Tversky, A. (1977). Features of similarity. Journal of Psychological Review, 84(4), 327–352.

    Article  Google Scholar 

  • W3C. (1998). Extensible Markup Language (XML).

  • XML.org. (2003). XML.org focus areas. http://www.xml.org/xml/focus_areas.shtml.

  • Wang, K., & Liu, H. (1998). Discovering typical structures of documents: A road map approach. In Proceedings of the ACM SIGIR (pp. 146–154). New York: ACM.

    Google Scholar 

  • Wang, Y., DeWitt, D. J., & Cai, J.-Y. (2003). X-Diff: an effective change detection algorithm for XML documents. In Proceedings of the International Conference on Data Engineering (pp. 574–580). Los Alamitos, California: IEEE Computer Society.

    Google Scholar 

  • Yao, B., Ozsu, M., & Keenleyside, J. (2002). XBench–a family of benchmarks for XML DBMSs. In Proceedings of EEXTT and DiWeb 2002, vol. 2590 of LNCS (pp. 162–164). Berlin Heidelberg New York: Springer.

    Google Scholar 

  • Zhang, K. (1993). A new editing based distance between unordered labeled trees. In Proceedings of the Symposium on Combinatorial Pattern Matching, vol. 684 of LNCS (pp. 110–121). Berlin Heidelberg New York: Springer.

    Google Scholar 

  • Zhang, K. & Shasha, D. (1977). Tree pattern matching. Pattern Matching Algorithms. London, UK: Oxford University Press.

    Google Scholar 

  • Zhang, K., & Shasha, D. (1989). Simple fast algorithms for the editing distance between trees and related problems. SIAM Journal of Computing, 18(6), 1245–1262.

    Article  MATH  MathSciNet  Google Scholar 

  • Zhang, K., Shasha, D., & Wang, J.-L. (1994). Approximate tree matching in the presence of variable length don’t cares. Journal on Algorithms, 16(1), 33–66.

    Article  MATH  MathSciNet  Google Scholar 

  • Zhang, K., Statman, R., & Shasha, D. (1992). On the editing distance between unordered labeled trees. Information Processing Letters, 42(3), 133–139.

    Article  MATH  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marco Mesiti.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bertino, E., Guerrini, G. & Mesiti, M. Measuring the structural similarity among XML documents and DTDs. J Intell Inf Syst 30, 55–92 (2008). https://doi.org/10.1007/s10844-006-0023-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10844-006-0023-y

Keywords

Navigation