XML document-grammar comparison: related problems and applications

  • Joe Tekli
  • Richard Chbeir
  • Agma J. M. Traina
  • Caetano Traina
Review Article
  • 68 Downloads

Abstract

XML document comparison is becoming an ever more popular research issue due to the increasingly abundant use of XML. Likewise, a growing interest fosters the development of XML grammar matching and comparison, due to the proliferation of heterogeneous XML data sources, particularly on the Web. Nonetheless, the process of comparing XML documents with XML grammars, i.e., XML document and grammar similarity evaluation, has not yet received the attention it deserves. In this paper, we provide an overview on existing research related to XML document/grammar comparison, presenting the background and discussing the various techniques related to the problem. We also discuss some prominent application domains, ranging over document classification and clustering, document transformation, grammar evolution, selective dissemination of XML information, XML querying, as well as alert filtering in intrusion detection systems and Web Services matching and communications.

Keywords

XML semi-structured data XML grammar DTD XSD structural similarity classification clustering structure transformation selective dissemination grammar evolution 

References

  1. [1]
    Abu-Ghazaleh N., Lewis M.J., Differential Deserialization for Optimized SOAP Performance, Proceedings of the ACM/IEEE Conference on Supercomputing (Seattle), 2005, 21–31Google Scholar
  2. [2]
    Abu-Ghazaleh N., Lewis M.J., Govindaraju M., Differential Serialization for Optimized SOAP Performance, Proceedings of the 13th International Symposium on High Performance Distributed Computing (HPDC’04), 2004, 55–64Google Scholar
  3. [3]
    Akatsu T., Approximate String Matching with Don’t Care Characters, INFORM PROCESS LETT, 1995, 55, 235–239CrossRefMathSciNetGoogle Scholar
  4. [4]
    Algergawy A., Schallehn E., Saake G., Improving XML schema matching using Prufer sequences, DATA KNOWL ENG, 2009, 68, 724–747CrossRefGoogle Scholar
  5. [5]
    Altinel M., Franklin M.J., Efficient Filtering of XML Documents for Selective Dissemination of Information, Procedings of the 28th International Conference on Very Large Data Bases (VLDB’00), 2000, 53–64Google Scholar
  6. [6]
    Amer-Yahia S., Shanmugasundaram J., XML Full-Text Search: Challenges and Opportunities, Proceedings of the International Conference on Very Large Data Bases, 2005. Tutorial Slides: http://www.vldb2005.org/program/slides/fri/s1368-amer-yahia.ppt
  7. [7]
    Amer-Yahia S., Case P., Rolleke T., Shanmugasundaram J., Weikum G., Report on the DB/IR Panel at SIGMOD 2005, SIGMOD RECORD, 2005, 34, 71–74CrossRefGoogle Scholar
  8. [8]
    Axelsson S., The Base-Rate Fallacy and the Difficulty of Intrusion Detection, ACM T DATABASE SYST, 2000, 3, 186–205CrossRefGoogle Scholar
  9. [9]
    Balmin A., Papakonstantinou Y., Vianu V., Incremental validation of XML documents, ACM T DATABASE SYST, 2004, 29, 710–751CrossRefGoogle Scholar
  10. [10]
    Barbosa D., Mendelzon A.O., Libkin L., Mignet L., Arenas M., Efficient Incremental Validation of XML Documents, Proceedings of the international Conference on Data Engineering (ICDE), IEEE Computer Society, 2004, 671–682Google Scholar
  11. [11]
    Berglund et al., XML Path Language (XPath) 2.0, W3C Recommendation, January 2007, http://www.w3.org/TR/xpath20/
  12. [12]
    Bertino E., Guerrini G., Mesiti, M., A Matching Algorithm for Measuring the Structural Similarity between an XML Documents and a DTD and its Applications, ELSEVIER INFORMATION SYSTEMS, 2004, 29, 23–46MathSciNetGoogle Scholar
  13. [13]
    Bouchou B., Cheriat A., Halfeld Ferrari M., Savary A., XML Document Correction: Incremental Approach Activated by Schema Validation, Proceedings of the International Database Engineering and Applications Symposium (IDEAS), 2006, 228–238Google Scholar
  14. [14]
    Bouchou B., Cheriat A., Halfeld Ferrari M., Laurent D., Lima M.A., Musicante M., Efficient Constraint Validation for XML Database, INFORMATICA, 2007, 31, 285–309MATHGoogle Scholar
  15. [15]
    Bray T., Paoli J., Sperberg-McQueen C., Mailer Y., Yergeau F., Extensible Markup Language (XML) 1.0 — 5th Edition, W3C Recommendation, 2008, http://www.w3.org/TR/REC-xml/
  16. [16]
    Budanitsky A., Hirst G., Evaluating WordNet-based Measures of Lexical Semantic Relatedness, COMPUT LINGUIST, 2006, 32, 13–47CrossRefGoogle Scholar
  17. [17]
    Buttler D., A Short Survey of Document Structure Similarity Algorithms, Proceedings of the International Conference on Internet Computing (ICOMP), 2004, 3–9Google Scholar
  18. [18]
    Chamberlin D., Florescu D., Robie J., Simeon J., Stefanescu M., XQuery: A Query Language for XML, 2001, http://www.w3.org/TR/2001/WD-xquery-20010215
  19. [19]
    Chan C.Y., Felber P., Garofalakis M., Rastogi R., Efficient Filtering of XML Documents with XPath Expressions, VLDB J, 2002, 11, 354–379CrossRefMATHGoogle Scholar
  20. [20]
    Chawathe S., Rajaraman A., Garcia-Molina H., Widom J., Change Detection in Hierarchically Structured Information, Proceedings of the ACM International Conference on Management of Data (SIGMOD), Montreal, 1996, 26–37Google Scholar
  21. [21]
    Cheriat A., Savary A., Bouchou B., Halfeld Ferrari M., Incremental String Correction: Towards Correction of XML Documents, Proceedings of the Prague Stringology Conference(PSC), 2005, 201–215Google Scholar
  22. [22]
    Chidlovskii B., Using Regular Tree Automata as XML Schemas, Proceedings of the IEEE Advances in Digital Libraries (ADL’00), 2000, 89–98Google Scholar
  23. [23]
    Chinnici R., Moreau J.J., Ryman A., Weerawarana S., Web Services Description Language (WSDL) Version 2.0 Part 1: Core Language, W3C Recommendation 26 June 2007, http://www.w3.org/TR/wsdl20/
  24. [24]
    Chitic C., Rosu D., On Validation of XML Streams using Finite State Machines, Proceedings of the 7th International Workshop on the Web and Databases (WebDB’ 04), ACM Press, New York, NY, USA, 2004, 85–90Google Scholar
  25. [25]
    Cobéna G., Abiteboul S., Marian A., Detecting Changes in XML Documents, Proceedings of the IEEE International Conference on Data Engineering (ICDE), 2002, 41–52Google Scholar
  26. [26]
    Da Luz R., Halfeld Ferrari Alves M., Musicante M. A., Regular expression transformations to extend regular languages (with application to a Datalog XML schema validator), J ALGORITHM, 2007, 62, 148–167CrossRefMATHGoogle Scholar
  27. [27]
    Dalamagas T., Cheng T., Winkel K., Sellis T., A Methodology for Clustering XML Documents by Structure, INFORM SYST, 2006, 31, 187–228CrossRefGoogle Scholar
  28. [28]
    Debar H., Curry D., Feinstein B., The Intrusion Detection Message Exchange Format (IDMEF), 2005, http://www.ietf.org/rfc/rfc4765.txt
  29. [29]
    Diao Y., Fischer P., Franklin M.J., To R., YFilter: Efficient and Scalable Filtering of XML Documents, Proceedings of the International Conference on Data Engineering (ICDE’02), 2002Google Scholar
  30. [30]
    Do H., Rahm E., Matching Large Schemas: Approaches and Evaluation, INFORM SYST, 2007, 32, 857–885CrossRefGoogle Scholar
  31. [31]
    Doan A., Domingos P., Halevy A., Learning to Match the Schemas of Data Sources: A Multistrategy Approach, MACH LEARN, 2003, 50, 279–301CrossRefMATHGoogle Scholar
  32. [32]
    DuChateau F., Bellahsene Z., Hunt E., Roantree M., An Indexing Structure for Automatic Schema Matching, The 23rd International Conference on Data Engineering (ICDE) — Workshops, 2007, 485–491Google Scholar
  33. [33]
    Fernau H., Extracting Minimum Length Document Type Definitions Is NP-Hard, Grammatical Inference: Algorithms and Applications (ICGI’04) 2004, 277–278Google Scholar
  34. [34]
    Formica A., Similarity of XML-Schema Elements: A Structural and Information content Approach, COMPUT J, 2008, 51, 240–254CrossRefGoogle Scholar
  35. [35]
    Garofalakis M., Gionis A., Rastogi R., Seshadri S., Shim K., Xtract: A system for extracting document type descriptors from XML documents, Proceedings of the ACM International Conference on Management of Data (SIGMOD), Dallas, Texas, USA, 2000, 165–176Google Scholar
  36. [36]
    Giunchiglia F., Yatskevich M., Shvaiko P., Semantic matching: Algorithms and implementation, JODS, 2007, 9, 1–38Google Scholar
  37. [37]
    Grahne G., Thomo A., Approximate Reasoning in Semi-structured Databases, Proceedings of the International Workshop on Knowledge Representation meets Databases (KRDB), Rome, 2001. Vol. 45Google Scholar
  38. [38]
    Guerrini G., Mesiti M., Sanz I., An overview of similarity measures for clustering XML documents, In: Vakali A., Pallis G. (Eds.), Web Data Management Practices: Emerging Techniques and Technologies, IDEA Group, 2006Google Scholar
  39. [39]
    Guha S., Jagadish H.V., Koudas N., Srivastava D., Yu T., Approximate XML Joins, Proceedings of ACM International Conference on Managemenet of Data (SIGMOD), 2002, 287–298Google Scholar
  40. [40]
    Murata M., Hosoya H., Validation Algorithm for Attribute-Element Constraints of RELAX NG, Extreme Markup Languages, Montreal, Canada, 2003Google Scholar
  41. [41]
    Halfeld Ferrari Alves M., Aspects Dynamiques de XML et Spécification des Interfaces de Services Web avec PEWS, Rapport de HDR, Université François Rabelais de Tours, 2007Google Scholar
  42. [42]
    Helmer S., Measuring the Structural Similarity of Semistructured Documents Using Entropy, Proceedings of the International Conference on Very Large Databases (VLDB), 2007, 1022–1032Google Scholar
  43. [43]
    Hopcroft J.E., Motwani R., Ullman J.D., Introduction to Automata Theory, Languages and Computation, Addison Wesley, 2nd edition, 2001Google Scholar
  44. [44]
    Kayacik H.G., Zincir-Heywood A.N., A Case Study of Three Open Source Security Management Tools, Proceedings of 8th IFIP/IEEE International Symposium on Integrated Network Management, 2003, 101–104Google Scholar
  45. [45]
    Kim S.K., Lee M., Lee K.C., Validation of XML Document Updates Based on XML Schema in XML Databases, International Conference on Database and Expert Systems Applications (DEXA’03), 2003, LNCS 2736, 98–108Google Scholar
  46. [46]
    Landau G.M., Vishkin U., Fast Parallel and Serial Approximate String Matching, J ALGORITHM, 1989, 10, 157–169CrossRefMATHMathSciNetGoogle Scholar
  47. [47]
    Lee G., Lee K., Chen A., Efficient Graph-based Algorithms for Discovering and Maintaining Association Rules in Large Databases, KNOWL INF SYST, 2001, 3, 338–355CrossRefMATHGoogle Scholar
  48. [48]
    Lee J., Kim M., Lee Y., Information Retrieval Based on Conceptual Distance in IS-A Hierarchies, J DOC, 1993, 49, 188–207CrossRefGoogle Scholar
  49. [49]
    Lee M., Yang L., Hsu W., Yang X., XClust: Clustering XML Schemas for Effective Integration, Proceedings of the International Conference on Information and Knowledge Management (CIKM), 2002, 292–299Google Scholar
  50. [50]
    Leonardi E., Hoai T.T., Bhowmick S.S., Madria S., DTD-Diff: A Change Detection Algorithm for DTDs, Proceedings of the Database Systems for Advanced Applications conference (DASFAA), 2006, 384–402Google Scholar
  51. [51]
    Lian W., Cheung D., Mamoulis N., Yiu S., An Efficient and Scalable Algorithm for Clustering XML Documents by Structure, IEEE T KNOWL DATA EN, 2004, 16, 82–96CrossRefGoogle Scholar
  52. [52]
    Liang W., Yokota H., LAX: An Efficient Approximate XML Join Based on Clustered Leaf Nodes for XML Data Integration, Proceedings of the British National Conference on Databases (BNCOD), 2005, 82–97Google Scholar
  53. [53]
    Lin D., An Information-Theoretic Definition of Similarity, Proceedings of the International Conference on Machine Learning (ICML), 1998, 296–304Google Scholar
  54. [54]
    Long J., Shwartz D., Stoecklin S., Distinguishing False from True Alerts in Snort by Data Mining Patterns of Alerts, Proceedings of SPIE’06, the International Society for Optical Engineering, 2006Google Scholar
  55. [55]
    Maguitman A., Menczer F., Roinestad H., Vespignani A., Algorithmic Detection of Semantic Similarity, Proceedings of the International Conference on the World Wide Web (www), 2005, 107–116Google Scholar
  56. [56]
    Marian A., Abiteboul S., Mignet L., Change-Centric Management of Versions in an XML Warehouse, Proceedings of the International Conference on Very Large Data Bases (VLDB), 2001, 581–590Google Scholar
  57. [57]
    Megginson D. et al., The Simple API for XML, http://www.megginson.com/SAX/
  58. [58]
    Miller G., WordNet: An On-Line Lexical Database, INT J LEXICOGR, 1990, 3Google Scholar
  59. [59]
    Murata M., Lee D., Mani M., Kawaguchi K., Taxonomy of XML Schema Languages Using Formal Language Theory, ACM TOIT, 2005, 5, 660–704CrossRefGoogle Scholar
  60. [60]
    Nakano K., Nishimura S., Deriving Event-Based Document Transformers from Tree-Based Specifications, In: van den Brand M., Parigot D. (Eds.), Electronic Notes in Theoretical Computer Science, Elsevier Science Publishers, 2001, Volume 44Google Scholar
  61. [61]
    Neumann A., Parsing and Querying XML Documents in SML, PhD thesis, University of Trier, Trier, Germany, 2000Google Scholar
  62. [62]
    Neumann A., Seidl H., Locating Matches of Tree Patterns in Forests, In: Arvind V., Ramamujan R. (Eds.), Foundations of Software Technology and Theoretical Computer Science (18th FST&TCS), Volume 1530 of Lecture Notes in Computer Science, Heidelberg, 1998, 134–145Google Scholar
  63. [63]
    Nierman A., Jagadish H.V., Evaluating structural similarity in XML documents, Proceedings of the ACM SIGMOD International Workshop on the Web and Databases (WebDB), 2002, 61–66Google Scholar
  64. [64]
    Nishimura S., Nakano K., XML Stream Transformer Generation through Program Composition and Dependency Analysis, SCI COMPUT PROGRAM, 2005, 54, 257–290CrossRefMATHMathSciNetGoogle Scholar
  65. [65]
    Peterson D., Gao S., Malhotra A., Sperberg-McQueen C., Thompson H., W3C XML Schema Definition Language (XSD) 1.1 Part 2: Datatypes, http://www.w3.org/TR/xmlschema11-2/
  66. [66]
    Rada R., Mili H., Bicknell E., Blettner M., Development and Application of a Metric on Semantic Nets, IEEE T SYST MAN CYB, 1989, 19, 17–30CrossRefGoogle Scholar
  67. [67]
    Ray E.T., Introduction to XML, O’Reilly, Paris, 2001Google Scholar
  68. [68]
    Reis D.C., Golgher P.B., Silva A.S., Laender A.F., Automatic Web News Extraction using Tree Edit Distance, Proceedings of the 13th International Conference on the World Wide Web (www’ 04), ACM, New York, NY, USA, 2004, 502–511CrossRefGoogle Scholar
  69. [69]
    Resnik P., Using Information Content to Evaluate Semantic Similarity in a Taxonomy, Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), 1995, 1, 448–453Google Scholar
  70. [70]
    Rijsbergen van C.J., Information Retrieval, Butterworths, London, 1979Google Scholar
  71. [71]
    Sahai A., Machiraju V., Enabling fo the Ubiquitous e-services Vision on the Internet, Hewlett-Packard Laboratories, HPL-2001-5, 2001Google Scholar
  72. [72]
    Salton G., Mcgill M.J., Introduction to Modern Information Retrieval, McGraw-Hill, Tokio, 1983MATHGoogle Scholar
  73. [73]
    Sanz I., Mesiti M., Guerrini G., Berlanga La R., Berlanga Lavori R., Approximate Subtree Identification in Heterogeneous XML Documents Collections, XML SYMPOSIUM, 2005, 192–206Google Scholar
  74. [74]
    Saruladha K., Aghila G., Raj S., A Survey of Semantic Similarity Methods for Ontology Based Information Retrieval, Proceedings of the International Conference on Machine Learning and Computing (ICMLC’10), 2010, 297–301Google Scholar
  75. [75]
    Schenkel R., Theobald A., Weikum G., Semantic Similarity Search on Semistructured Data with the XXL Search Engine, INFORM RETRIEVAL, 2005, 8, 521–545CrossRefGoogle Scholar
  76. [76]
    Schlieder T., Similarity Search in XML Data Using Cost-based Query Transformations, Proceedings of the ACM SIGMOD International Workshop on the Web and Databases (WebDB), 2001, 19–24Google Scholar
  77. [77]
    Schlieder T., Meuss H., Querying and Ranking XML Documents, J AM SOC INFORM SCI, 2002, 53, 489–503CrossRefGoogle Scholar
  78. [78]
    Schöning H., Tamoni — A DBMS Designed for XML, Proceedings of the IEEE International Conference on Data Engineering (ICDE), 2001, 149–154Google Scholar
  79. [79]
    Segoufin L., Vianu V., Validating Streaming XML Documents, Proceedings of the ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS), 2002, 53–64Google Scholar
  80. [80]
    Selkow S.M., The Tree-to-Tree Editing Problem, INFORM PROCESS LETT, 1977, 6, 184–186CrossRefMATHMathSciNetGoogle Scholar
  81. [81]
    Shvaiko P., Euzenat J., A Survey of Schema-Based Matching Approaches, JOURNAL OF DATA SEMANTICS, 2005, IV, 146–171Google Scholar
  82. [82]
    Stanoi I., Mihaila G., Padmanabhan S., A framework for the selective dissemination of XML documents based on inferred user profiles, Proceedings of the International Conference on Data Engineering, 2003, 531–542Google Scholar
  83. [83]
    Su H., Padmanabhan S., Lo M.L., Identification of Syntactically Similar DTD Elements for Schema Matching, Proceedings of the International Conference on Advances in Web-Age Information Management (WAIM), 2001, 145–159Google Scholar
  84. [84]
    Suzuki N., Finding an Optimum Edit Script between an XML Document and a DTD, Proceedings of the ACM Symposium on Applied Computing (ACM SAC), 2005, 647–653Google Scholar
  85. [85]
    Tekli J., Chbeir R., Yétongnon K., Structural Similarity Evaluation between XML Documents and DTDs, Proceedings of the International Conference on Web Information Systems Engineering (WISE), 2007, 196–211Google Scholar
  86. [86]
    Tekli J., Chbeir R., Yétongnon K., An Overview of XML Similarity: Background, Current Trends and Future Directions, ELSEVIER COMPUTER SCIENCE REVIEW, 2009, 3, 151–173Google Scholar
  87. [87]
    Tekli J., Chbeir R., Yétongnon K., XML Grammar Similarity: Breakthroughs and Limitations, Second Edition of the Encyclopedia of Multimedia Technology and Networking, Information Science Reference, Hershey — New York, 2009, 2, 140–148Google Scholar
  88. [88]
    Tekli J., Damiani E., Chbeir R., Gianini G., SOAP Processing Performance and Enhacement, IEEE TSC, (in press)Google Scholar
  89. [89]
    Teraguchi M., Makino S., Ueno K., Chung H.V., Optimized Web Services Security Performance with Differential Parsing, Proceedings of the 4th International Conference on Service-Oriented Computing (ICSOC’06), 2006, 277–288Google Scholar
  90. [90]
    Theobald A., Weikum G., Adding Relevance to XML, Proceedings of the 3rd International Workshop on the Web Databases (WebDB), Dallas, USA, 2000, 105–124Google Scholar
  91. [91]
    Werner C., Buschmann C., Fischer S., WSDL-Driven SOAP Compression, INT J WEB SERV RES, 2005, 2, 18–35CrossRefGoogle Scholar
  92. [92]
    Word Wide Web Consortium. SOAP Version 1.2. W3C Recommendation (Second Edition) 2007, http://www.w3.org/TR/soap/ [cited February 2010]
  93. [93]
    World Wide Web Consortium, The Document Object Model, http://www.w3.org/DOM [cited 28 May 2009]
  94. [94]
    Xing G., Fast Approximate Matching Between XML Documents and Schemata, The Asia Pacific Web Conference (APWeb’06), 2006, 425–436Google Scholar
  95. [95]
    Xing G., Xia X., Guo J., Clustering XML Documents Based on Structural Similarity, International Conference of Database Systems for Advanced Applications (DASFAA’07), 2007, 905–911Google Scholar
  96. [96]
    Zhang K. Shasha D., Simple Fast Algorithms for the Editing Distance between Trees and Related Problems, SIAM J COMPUT, 1989, 18, 1245–1262CrossRefMATHMathSciNetGoogle Scholar
  97. [97]
    Zhang K., Shasha D., Wang J., Approximate Tree Matching in the Presence of Variable Length Don’t Cares, J ALGORITHM, 1994, 16, 33–66CrossRefMATHMathSciNetGoogle Scholar
  98. [98]
    Zhang X., Jing L., Hu X., Ng M., Zhou X., A Comparative Study of Ontology Based Term Similarity Measures on PubMed Document Clustering, Proceedings of the International Conference on Database Systems for Advanced Applications (DASFAA’ 07), 2007, 115–126Google Scholar
  99. [99]
    Zhang Z., Li R., Cao S., Zhu Y., Similarity Metric in XML Documents, Knowledge Management and Experience Management Workshop, 2003Google Scholar

Copyright information

© © Versita Warsaw and Springer-Verlag Wien 2011

Authors and Affiliations

  • Joe Tekli
    • 1
  • Richard Chbeir
    • 2
  • Agma J. M. Traina
    • 1
  • Caetano Traina
    • 1
  1. 1.ICMC — Computer Science and Statistics DepartmentUniversity of Sao Paulo (USP)Sao CarlosBrazil
  2. 2.LE2I Laboratory UMR-CNRS, Department of Computer ScienceUniversity of BourgogneDijon, CedexFrance

Personalised recommendations