Journal of Intelligent Information Systems

, Volume 30, Issue 1, pp 55–92 | Cite as

Measuring the structural similarity among XML documents and DTDs

Article

Abstract

Measuring the structural similarity between an XML document and a DTD has many relevant applications that range from document classification and approximate structural queries on XML documents to selective dissemination of XML documents and document protection. The problem is harder than measuring structural similarity among documents, because a DTD can be considered as a generator of documents. Thus, the problem is to evaluate the similarity between a document and a set of documents. An effective structural similarity measure should face different requirements that range from considering the presence and absence of required elements, as well as the structure and level of the missing and extra elements to vocabulary discrepancies due to the use of synonymous or syntactically similar tags. In the paper, starting from these requirements, we provide a definition of the measure and present an algorithm for matching a document against a DTD to obtain their structural similarity. Finally, experimental results to assess the effectiveness of the approach are presented.

Keywords

H.3.2 Information storage H.3.3 Information search and retrieval H.3.5.f XML/XSL/RDF I.5.3.b Similarity measure 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Amer-Yahia, S., Koudas, N., & Srivastava, D. (2003). Approximate matching in XML. In Proceedings of the International Conference on Data Engineering (p. 803). Los Alamitos, California: IEEE Computer Society.Google Scholar
  2. Batini, C., Lenzerini, M., & Navathe, S. (1986). A comparative analysis of methodologies for database schema integration. ACM Computing Surveys, 18(4), 323–364.CrossRefGoogle Scholar
  3. Bertino, E., Castano, S., Ferrari, E., & Mesiti, M. (2002). Protection and administration of XML data sources. Data and Knowledge Engineering, 43(3), 237–260.MATHCrossRefGoogle Scholar
  4. Bertino, E., Guerrini, G., Merlo, I., & Mesiti, M. (1999). An approach to classify semi-structured objects. In Proceedings of European Conference on Object-Oriented Programming, LNCS (1628) (pp. 416–440). Berlin Heidelberg New York: Springer.Google Scholar
  5. Bertino, E., Guerrini, G., & Mesiti, M. (2004a). A matching algorithm for measuring the structural similarity between an XML document and a DTD and its applications. Information Systems, 29(1), 23–46.CrossRefMathSciNetGoogle Scholar
  6. Bertino, E., Guerrini, G., & Mesiti, M. (2004b). Di\({\cal X}\)eminator: a profile-based selective dissemination system for XML documents. In Proceedings of the International EDBT Workshop on Clustering Information on the Web, (pp. 47–54). Berlin Heidelberg New York: Springer.Google Scholar
  7. Bertino, E., Guerrini, G., & Mesiti, M. (2004c). Measuring the structural similarity among XML documents and DTDs. Technical Report, University of Genova. http://www.disi.unige.it/person/MesitiM/publications.html.
  8. Bourret, R. (1999). XML and databases. http://www.rpbourret.com/xml/.
  9. Buitelaar, P., Cimiano, P., & Magnini, B. (2005). Ontology learning from text: methods, evaluation and applications. Frontiers in Artificial Intelligence and Applications Series, (pp. 3–12). Amsterdam, The Netherlands: IOS PressGoogle Scholar
  10. Chawathe, S. S., & Garcia-Molina, H. (1997). Meaningful change detection in structured data. In Proceedings of the International Conference on Management of Data (pp. 26–37). New York: ACM.Google Scholar
  11. Chawathe, S. S., Rajaraman, A., Garcia-Molina, H., & Widom, J. (1996). Change detection in hierarchically structured information. In Proceedings of the International Conference on Management of Data (pp. 493–504). New York: ACM.Google Scholar
  12. Chinenyanga, T., & Kushmerick, N. (2002). An expressive and efficient language for XML information retrieval. JASIST, 53(6), 438–453.CrossRefGoogle Scholar
  13. Cobena, G., Abiteboul, S., & Marian, A. (2002). Detecting changes in XML documents. In Proceedings of the International Conference on Data Engeneering (pp. 41–52). Los Alamitos, California: IEEE Computer Society.Google Scholar
  14. DCI. Dublin Core, http://dublincore.org/.
  15. Deutsch, A., Fernandez, M., & Suciu, D. (1999). Storing semistructured data with STORED. In Proceedings of the International Conference on Management of Data (pp. 431–442). New York: ACM.Google Scholar
  16. Do, H.-H., Melnik, S., & Rahm, E. (2003). Comparison of schema matching evaluations. In Web, Web-Services, and Database Systems, vol 2593 of LNCS (pp. 221–237). Berlin Heidelberg New York: Springer.CrossRefGoogle Scholar
  17. Do, H.-H., & Rahm, E. (2002). COMA - a system for flexible combination of schema matching approaches. In Proceedings of the International Conference on Very Large Databases (pp. 610–621). San Mateo, California: Morgan Kaufmann.Google Scholar
  18. Doan, A., Domingos, P., & Halevy, A. Y. (2001). Reconciling schemas of disparate data sources: A machine-learning approach. SIGMOD Record, 30(2), 509–520.CrossRefGoogle Scholar
  19. Flesca, S., Manco, G., Masciari, E., Pontieri, L., & Pugliese, A. (2002). Detecting structural similarities between XML documents. In Proceedings of the International Workshop on Web and Databases (pp. 55–60). Madison, Wisconsin.Google Scholar
  20. Fuhr, N., & Grossjohann, K. (2001). XIRQL: a query language for information retrieval in XML documents. In Proceedings of the International Conference on Research and Development in Information Retrieval (pp. 172–180). New York: ACM.Google Scholar
  21. Fuhr, N., & Lalmas, M. (2004). Initiative for the evaluation of XML retrieval. http://inex.is.informatik.uni-duisburg.de:2004/.
  22. Garofalakis, M. N., Gionis, A., Rastogi, R., Seshadri, S., & Shim, K. (2000). XTRACT: A system for extracting document type descriptors from XML documents. In Proceedings of the International Conference on Management of Data (pp. 165–176). New York: ACM.Google Scholar
  23. Grahne, G., & Thomo, A. (2001). Approximate reasoning in semi-structured databases. In Proceedings of the International Workshop on Knowledge Representation Meets Databases, vol. 45 of CEUR Workshop Proceedings. Rome, Italy: CEUR-WS.org.Google Scholar
  24. Guerrini, G., Mesiti, M., & Bertino, E. (2006). Structural similarity measures in sources of XML documents. In J. Darmont & O. Boussaid (Eds.), Processing and Managing Complex Data for Decision Support (pp. 247–279). Hershey, Pennsylvania: Idea Group.Google Scholar
  25. Guerrini, G., Mesiti, M., & Sanz, I. (2006). An overview of similarity measures for clustering XML documents. In A. Vakali & G. Palis (Eds.), Web Data Management Practices: Emerging Techniques and Technologies (pp. 56–78). Hershey, Pennsylvania: Idea Group.Google Scholar
  26. Lee, M., Yang, L., Hsu, W., & Yang, X. (2002). XClust: Clustering XML schemas for effective integration. In Proceedings of the International Conference on Information and Knowledge Management (pp. 292–299). New York: ACM.Google Scholar
  27. Lu, S. Y. (1979). A tree to tree distance and its applications to cluster analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1, 219–224.MATHGoogle Scholar
  28. Madhavan, J., Bernstein, P., & Rahm, E. (2001). Generic schema matching with Cupid. In Proceedings of the International Conference on Very Large Databases (pp. 49–58). San Francisco, California: Morgan Kaufmann.Google Scholar
  29. Mesiti, M. (2002). A structural similarity measure for XML documents: theory and applications. Ph.D. dissertation, University of Genova, Italy. http://www.disi.unige.it.
  30. Mignet, L., Barbosa, D., Veltri, P. (2003). The XML web: a first study. In Proceedings of the International Conference on WWW (pp. 500–510). New York: ACM.CrossRefGoogle Scholar
  31. Miller, A. (1995). WordNet: a lexical database for english. Communications of the ACM, 38(11), 39–41.CrossRefGoogle Scholar
  32. Moh, C., Lim, E., & Ng, W. (2000). Re-engineering structures from web documents. In Proceedings of ACM DL (pp. 67–76). New York: ACM.Google Scholar
  33. Nestorov, S., Abiteboul, S., & Motwani, R. (1998). Extracting schema from semistructured data. In Proceedings of the International Conference on Management of Data (pp. 295–306). New York: ACM.Google Scholar
  34. Nierman, A. & Jagadish, H. (2002). Evaluating structural similarity in XML documents. In Proceedings of the International Workshop on Web and Databases (pp. 61–66). Madison, Wisconsin.Google Scholar
  35. Parent, C., & Spaccapietra, S. (1998). Issues and approaches of database integration. Communications of the ACM, 41(5), 166–178.CrossRefGoogle Scholar
  36. Rahm, E., & Bernstein, P. A. (2001). A survey of approaches to automatic schema matching. VLDB J., 10(4), 334–350.MATHCrossRefGoogle Scholar
  37. Rice, S. V., Bunke, H., & Nartker, T. A. (1997). Classes of cost functions for string edit distance. Algorithmica, 18(2), 271–280.MATHCrossRefMathSciNetGoogle Scholar
  38. Schlieder, T. (2001). Similarity search in XML data using cost-based query transformations. In Proceedings of the International Workshop on Web and Databases (pp. 19–24). Santa Barbara, California.Google Scholar
  39. Selkow, S. M. (1977). The tree-to-tree editing problem. Information Processing Letters, 6(6), 184–186.MATHCrossRefMathSciNetGoogle Scholar
  40. Stanoi, I., Mihaila, G., & Padmanabhan, S. (2003). A framework for the selective dissemination of XML documents based on inferred user profiles. In Proceedings of the International Conference on Data Engineering (pp. 531–542). Los Alamitos, California: IEEE Computer Society.Google Scholar
  41. Tai, K.-C. (1979). The tree-to-tree correction problem. Journal of the ACM, 26(3), 422–433.MATHCrossRefMathSciNetGoogle Scholar
  42. Tanaka, E., & Tanaka, K. (1988). The tree-to-tree editing problem. Journal of Pattern Recognition and Artificial Intelligence, 2(2).Google Scholar
  43. Theobald, A. & Weikum, G. (2000). Adding relevance to XML. In Proceedings of the International Workshop on the Web and Databases, LNCS(1997) (pp. 105–124). Berlin Heidelberg New York: Springer.Google Scholar
  44. Tversky, A. (1977). Features of similarity. Journal of Psychological Review, 84(4), 327–352.CrossRefGoogle Scholar
  45. W3C. (1998). Extensible Markup Language (XML).Google Scholar
  46. XML.org. (2003). XML.org focus areas. http://www.xml.org/xml/focus_areas.shtml.
  47. Wang, K., & Liu, H. (1998). Discovering typical structures of documents: A road map approach. In Proceedings of the ACM SIGIR (pp. 146–154). New York: ACM.Google Scholar
  48. Wang, Y., DeWitt, D. J., & Cai, J.-Y. (2003). X-Diff: an effective change detection algorithm for XML documents. In Proceedings of the International Conference on Data Engineering (pp. 574–580). Los Alamitos, California: IEEE Computer Society.Google Scholar
  49. Yao, B., Ozsu, M., & Keenleyside, J. (2002). XBench–a family of benchmarks for XML DBMSs. In Proceedings of EEXTT and DiWeb 2002, vol. 2590 of LNCS (pp. 162–164). Berlin Heidelberg New York: Springer.Google Scholar
  50. Zhang, K. (1993). A new editing based distance between unordered labeled trees. In Proceedings of the Symposium on Combinatorial Pattern Matching, vol. 684 of LNCS (pp. 110–121). Berlin Heidelberg New York: Springer.Google Scholar
  51. Zhang, K. & Shasha, D. (1977). Tree pattern matching. Pattern Matching Algorithms. London, UK: Oxford University Press.Google Scholar
  52. Zhang, K., & Shasha, D. (1989). Simple fast algorithms for the editing distance between trees and related problems. SIAM Journal of Computing, 18(6), 1245–1262.MATHCrossRefMathSciNetGoogle Scholar
  53. Zhang, K., Shasha, D., & Wang, J.-L. (1994). Approximate tree matching in the presence of variable length don’t cares. Journal on Algorithms, 16(1), 33–66.MATHCrossRefMathSciNetGoogle Scholar
  54. Zhang, K., Statman, R., & Shasha, D. (1992). On the editing distance between unordered labeled trees. Information Processing Letters, 42(3), 133–139.MATHCrossRefMathSciNetGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2006

Authors and Affiliations

  • Elisa Bertino
    • 1
  • Giovanna Guerrini
    • 2
  • Marco Mesiti
    • 3
  1. 1.Purdue UniversityWest LafayetteUSA
  2. 2.University of GenovaGenovaItaly
  3. 3.University of MilanoMilanoItaly

Personalised recommendations