Abstract
Measuring the structural similarity between an XML document and a DTD has many relevant applications that range from document classification and approximate structural queries on XML documents to selective dissemination of XML documents and document protection. The problem is harder than measuring structural similarity among documents, because a DTD can be considered as a generator of documents. Thus, the problem is to evaluate the similarity between a document and a set of documents. An effective structural similarity measure should face different requirements that range from considering the presence and absence of required elements, as well as the structure and level of the missing and extra elements to vocabulary discrepancies due to the use of synonymous or syntactically similar tags. In the paper, starting from these requirements, we provide a definition of the measure and present an algorithm for matching a document against a DTD to obtain their structural similarity. Finally, experimental results to assess the effectiveness of the approach are presented.
Similar content being viewed by others
References
Amer-Yahia, S., Koudas, N., & Srivastava, D. (2003). Approximate matching in XML. In Proceedings of the International Conference on Data Engineering (p. 803). Los Alamitos, California: IEEE Computer Society.
Batini, C., Lenzerini, M., & Navathe, S. (1986). A comparative analysis of methodologies for database schema integration. ACM Computing Surveys, 18(4), 323–364.
Bertino, E., Castano, S., Ferrari, E., & Mesiti, M. (2002). Protection and administration of XML data sources. Data and Knowledge Engineering, 43(3), 237–260.
Bertino, E., Guerrini, G., Merlo, I., & Mesiti, M. (1999). An approach to classify semi-structured objects. In Proceedings of European Conference on Object-Oriented Programming, LNCS (1628) (pp. 416–440). Berlin Heidelberg New York: Springer.
Bertino, E., Guerrini, G., & Mesiti, M. (2004a). A matching algorithm for measuring the structural similarity between an XML document and a DTD and its applications. Information Systems, 29(1), 23–46.
Bertino, E., Guerrini, G., & Mesiti, M. (2004b). Di\({\cal X}\)eminator: a profile-based selective dissemination system for XML documents. In Proceedings of the International EDBT Workshop on Clustering Information on the Web, (pp. 47–54). Berlin Heidelberg New York: Springer.
Bertino, E., Guerrini, G., & Mesiti, M. (2004c). Measuring the structural similarity among XML documents and DTDs. Technical Report, University of Genova. http://www.disi.unige.it/person/MesitiM/publications.html.
Bourret, R. (1999). XML and databases. http://www.rpbourret.com/xml/.
Buitelaar, P., Cimiano, P., & Magnini, B. (2005). Ontology learning from text: methods, evaluation and applications. Frontiers in Artificial Intelligence and Applications Series, (pp. 3–12). Amsterdam, The Netherlands: IOS Press
Chawathe, S. S., & Garcia-Molina, H. (1997). Meaningful change detection in structured data. In Proceedings of the International Conference on Management of Data (pp. 26–37). New York: ACM.
Chawathe, S. S., Rajaraman, A., Garcia-Molina, H., & Widom, J. (1996). Change detection in hierarchically structured information. In Proceedings of the International Conference on Management of Data (pp. 493–504). New York: ACM.
Chinenyanga, T., & Kushmerick, N. (2002). An expressive and efficient language for XML information retrieval. JASIST, 53(6), 438–453.
Cobena, G., Abiteboul, S., & Marian, A. (2002). Detecting changes in XML documents. In Proceedings of the International Conference on Data Engeneering (pp. 41–52). Los Alamitos, California: IEEE Computer Society.
DCI. Dublin Core, http://dublincore.org/.
Deutsch, A., Fernandez, M., & Suciu, D. (1999). Storing semistructured data with STORED. In Proceedings of the International Conference on Management of Data (pp. 431–442). New York: ACM.
Do, H.-H., Melnik, S., & Rahm, E. (2003). Comparison of schema matching evaluations. In Web, Web-Services, and Database Systems, vol 2593 of LNCS (pp. 221–237). Berlin Heidelberg New York: Springer.
Do, H.-H., & Rahm, E. (2002). COMA - a system for flexible combination of schema matching approaches. In Proceedings of the International Conference on Very Large Databases (pp. 610–621). San Mateo, California: Morgan Kaufmann.
Doan, A., Domingos, P., & Halevy, A. Y. (2001). Reconciling schemas of disparate data sources: A machine-learning approach. SIGMOD Record, 30(2), 509–520.
Flesca, S., Manco, G., Masciari, E., Pontieri, L., & Pugliese, A. (2002). Detecting structural similarities between XML documents. In Proceedings of the International Workshop on Web and Databases (pp. 55–60). Madison, Wisconsin.
Fuhr, N., & Grossjohann, K. (2001). XIRQL: a query language for information retrieval in XML documents. In Proceedings of the International Conference on Research and Development in Information Retrieval (pp. 172–180). New York: ACM.
Fuhr, N., & Lalmas, M. (2004). Initiative for the evaluation of XML retrieval. http://inex.is.informatik.uni-duisburg.de:2004/.
Garofalakis, M. N., Gionis, A., Rastogi, R., Seshadri, S., & Shim, K. (2000). XTRACT: A system for extracting document type descriptors from XML documents. In Proceedings of the International Conference on Management of Data (pp. 165–176). New York: ACM.
Grahne, G., & Thomo, A. (2001). Approximate reasoning in semi-structured databases. In Proceedings of the International Workshop on Knowledge Representation Meets Databases, vol. 45 of CEUR Workshop Proceedings. Rome, Italy: CEUR-WS.org.
Guerrini, G., Mesiti, M., & Bertino, E. (2006). Structural similarity measures in sources of XML documents. In J. Darmont & O. Boussaid (Eds.), Processing and Managing Complex Data for Decision Support (pp. 247–279). Hershey, Pennsylvania: Idea Group.
Guerrini, G., Mesiti, M., & Sanz, I. (2006). An overview of similarity measures for clustering XML documents. In A. Vakali & G. Palis (Eds.), Web Data Management Practices: Emerging Techniques and Technologies (pp. 56–78). Hershey, Pennsylvania: Idea Group.
Lee, M., Yang, L., Hsu, W., & Yang, X. (2002). XClust: Clustering XML schemas for effective integration. In Proceedings of the International Conference on Information and Knowledge Management (pp. 292–299). New York: ACM.
Lu, S. Y. (1979). A tree to tree distance and its applications to cluster analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1, 219–224.
Madhavan, J., Bernstein, P., & Rahm, E. (2001). Generic schema matching with Cupid. In Proceedings of the International Conference on Very Large Databases (pp. 49–58). San Francisco, California: Morgan Kaufmann.
Mesiti, M. (2002). A structural similarity measure for XML documents: theory and applications. Ph.D. dissertation, University of Genova, Italy. http://www.disi.unige.it.
Mignet, L., Barbosa, D., Veltri, P. (2003). The XML web: a first study. In Proceedings of the International Conference on WWW (pp. 500–510). New York: ACM.
Miller, A. (1995). WordNet: a lexical database for english. Communications of the ACM, 38(11), 39–41.
Moh, C., Lim, E., & Ng, W. (2000). Re-engineering structures from web documents. In Proceedings of ACM DL (pp. 67–76). New York: ACM.
Nestorov, S., Abiteboul, S., & Motwani, R. (1998). Extracting schema from semistructured data. In Proceedings of the International Conference on Management of Data (pp. 295–306). New York: ACM.
Nierman, A. & Jagadish, H. (2002). Evaluating structural similarity in XML documents. In Proceedings of the International Workshop on Web and Databases (pp. 61–66). Madison, Wisconsin.
Parent, C., & Spaccapietra, S. (1998). Issues and approaches of database integration. Communications of the ACM, 41(5), 166–178.
Rahm, E., & Bernstein, P. A. (2001). A survey of approaches to automatic schema matching. VLDB J., 10(4), 334–350.
Rice, S. V., Bunke, H., & Nartker, T. A. (1997). Classes of cost functions for string edit distance. Algorithmica, 18(2), 271–280.
Schlieder, T. (2001). Similarity search in XML data using cost-based query transformations. In Proceedings of the International Workshop on Web and Databases (pp. 19–24). Santa Barbara, California.
Selkow, S. M. (1977). The tree-to-tree editing problem. Information Processing Letters, 6(6), 184–186.
Stanoi, I., Mihaila, G., & Padmanabhan, S. (2003). A framework for the selective dissemination of XML documents based on inferred user profiles. In Proceedings of the International Conference on Data Engineering (pp. 531–542). Los Alamitos, California: IEEE Computer Society.
Tai, K.-C. (1979). The tree-to-tree correction problem. Journal of the ACM, 26(3), 422–433.
Tanaka, E., & Tanaka, K. (1988). The tree-to-tree editing problem. Journal of Pattern Recognition and Artificial Intelligence, 2(2).
Theobald, A. & Weikum, G. (2000). Adding relevance to XML. In Proceedings of the International Workshop on the Web and Databases, LNCS(1997) (pp. 105–124). Berlin Heidelberg New York: Springer.
Tversky, A. (1977). Features of similarity. Journal of Psychological Review, 84(4), 327–352.
W3C. (1998). Extensible Markup Language (XML).
XML.org. (2003). XML.org focus areas. http://www.xml.org/xml/focus_areas.shtml.
Wang, K., & Liu, H. (1998). Discovering typical structures of documents: A road map approach. In Proceedings of the ACM SIGIR (pp. 146–154). New York: ACM.
Wang, Y., DeWitt, D. J., & Cai, J.-Y. (2003). X-Diff: an effective change detection algorithm for XML documents. In Proceedings of the International Conference on Data Engineering (pp. 574–580). Los Alamitos, California: IEEE Computer Society.
Yao, B., Ozsu, M., & Keenleyside, J. (2002). XBench–a family of benchmarks for XML DBMSs. In Proceedings of EEXTT and DiWeb 2002, vol. 2590 of LNCS (pp. 162–164). Berlin Heidelberg New York: Springer.
Zhang, K. (1993). A new editing based distance between unordered labeled trees. In Proceedings of the Symposium on Combinatorial Pattern Matching, vol. 684 of LNCS (pp. 110–121). Berlin Heidelberg New York: Springer.
Zhang, K. & Shasha, D. (1977). Tree pattern matching. Pattern Matching Algorithms. London, UK: Oxford University Press.
Zhang, K., & Shasha, D. (1989). Simple fast algorithms for the editing distance between trees and related problems. SIAM Journal of Computing, 18(6), 1245–1262.
Zhang, K., Shasha, D., & Wang, J.-L. (1994). Approximate tree matching in the presence of variable length don’t cares. Journal on Algorithms, 16(1), 33–66.
Zhang, K., Statman, R., & Shasha, D. (1992). On the editing distance between unordered labeled trees. Information Processing Letters, 42(3), 133–139.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Bertino, E., Guerrini, G. & Mesiti, M. Measuring the structural similarity among XML documents and DTDs. J Intell Inf Syst 30, 55–92 (2008). https://doi.org/10.1007/s10844-006-0023-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10844-006-0023-y