A Hybrid Approach for XML Similarity

  • Joe Tekli
  • Richard Chbeir
  • Kokou Yetongnon
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4362)


In the past few years, XML has been established as an effective means for information management, and has been widely exploited for complex data representation. Owing to an unparalleled increasing use of the XML standard, developing efficient techniques for comparing XML-based documents becomes essential in information retrieval (IR) research. Various algorithms for comparing hierarchically structured data, e.g. XML documents, have been proposed in the literature. However, to our knowledge, most of them focus exclusively on comparing documents based on structural features, overlooking the semantics involved. In this paper, we integrate IR semantic similarity assessment in an edit distance algorithm, seeking to amend similarity judgments when comparing XML-based documents. Our approach comprises of an original edit distance operation cost model, introducing semantic relatedness of XML element/attribute labels, in traditional edit distance computations. A prototype has been developed to evaluate our model’s performance. Experiments yielded notable results.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Aho, A., Hirschberg, D., Ullman, J.: Bounds on the Complexity of the Longest Common Subsequence Problem. Journal of the Association for Computing Machinery 23(1), 1–12 (1976)MATHMathSciNetGoogle Scholar
  2. 2.
    Bertino, E., Guerrini, G., Mesiti, M., Rivara, I., Tavella, C.: Measuring the Structural Similarity among XML Documents and DTDs. Technical Report, University of Genova (2002), http://www.disi.unige.it/person/MesitiM
  3. 3.
    Bertino, E., Guerrini, G., Mesiti, M.: A Matching Algorithm for Measuring the Structural Similarity between an XML Documents and a DTD and Its Applications. Computer Science 29, 23–46 (2004)MathSciNetGoogle Scholar
  4. 4.
    Chawathe, S., Rajaraman, A., Garcia-Molina, H., Widom, J.: Change Detection in Hierarchically Structured Information. In: Proc. of the ACM Int. Conf. on Management of Data (SIGMOD), Montreal, Quebec, Canada (1996)Google Scholar
  5. 5.
    Chawathe, S.: Comparing Hierarchical Data in External Memory. In: Proceedings of the Twenty-fth Int. Conf. on Very Large Data Bases, pp. 90–101(1999)Google Scholar
  6. 6.
    Cobéna, G., Abiteboul, S., Marian, A.: Detecting Changes in XML Documents. In: Proc. of the IEEE Int. Conf. on Data Engineering, pp. 41–52 (2002)Google Scholar
  7. 7.
    Flesca, S., Manco, G., Masciari, E., Pontieri, L., Pugliese, A.: Detecting Structural Similarities between XML Documents. In: Proc. of WebDB 2002 (2002)Google Scholar
  8. 8.
    Jiang, J., Conrath, D.: Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy. In: Proc. of the Int. Conf. on Research in Computational Linguistics (1997)Google Scholar
  9. 9.
    Lee, J.H., Kim, M.H., Lee, Y.J.: Information Retrieval Based on Conceptual Distance in IS-A Hierarchies. Journal of Documentation 49(2), 188–207 (1993)CrossRefGoogle Scholar
  10. 10.
    Levenshtein, V.: Binary Codes Capable of Correcting Deletions, Insertions and Reversals. Sov. Phys. Dokl. 6, 707–710 (1966)MathSciNetGoogle Scholar
  11. 11.
    Lin, D.: An Information-Theoretic Definition of Similarity. In: Proceedings of the 15th Int. Conf. on Machine Learning, pp. 296–304. Morgan Kaufmann, San Francisco (1998)Google Scholar
  12. 12.
    Maguitman, A.G., Menczer, F., Roinestad, H., Vespignani, A.: Algorithmic Detection of Semantic Similarity. In: Proc. of the 14th Int. WWW Conference, Japan pp. 107–116 (2005)Google Scholar
  13. 13.
    Miller, G.: WordNet: An On-Line Lexical Database. Int. Journal of Lexicography (1990)Google Scholar
  14. 14.
    Nierman, A., Jagadish, H.V.: Evaluating Structural Similarity in XML Documents. In: Proc. of the 5th Int. Workshop on the Web and Databases (2002)Google Scholar
  15. 15.
    Rada, R., Mili, H., Bicknell, E., Blettner, M.: Development and Application of a Metric on Semantic Nets. IEEE Transactions on Systems, Man, and Cybernetics 19, 17–30 (1989)CrossRefGoogle Scholar
  16. 16.
    Resnik, P.: Using Information Content to Evaluate Semantic Similarity in a Taxonomy. In: Proc. of the 14th IJCA-95, vol. 1, pp. 448–453 (1995)Google Scholar
  17. 17.
    Richardson, R., Smeaton, A.F.: Using WordNet in a Knowledge-Based Approach to Information Retrieval. In: Proc. of the 17th Colloquium on Information Retrieval, (1995)Google Scholar
  18. 18.
    Sanz, I., Mesiti, M., Guerrini, G., Llavori, R.B.: Approximate subtree identification in heterogeneous XML documents collections. In: Bressan, S., Ceri, S., Hunt, E., Ives, Z.G., Bellahsène, Z., Rys, M., Unland, R. (eds.) XSym 2005. LNCS, vol. 3671, pp. 192–206. Springer, Heidelberg (2005)Google Scholar
  19. 19.
    Shasha, D., Zhang, K.: Approximate Tree Pattern Matching. In: Shasha, D., Zhang, K. (eds.) Pattern Matching in Strings, Trees and Arrays, vol. 14, Oxford University Press, Oxford (1995)Google Scholar
  20. 20.
    Wagner, J., Fisher, M.: The String-to-String Correction Problem. Journal of the Association of Computing Machinery 21(1), 168–173 (1974)MATHGoogle Scholar
  21. 21.
    Wong, C., Chandra, A.: Bounds for the String Editing Problem. Journal of the Association for Computing Machinery 23(1), 13–16 (1976)MATHMathSciNetGoogle Scholar
  22. 22.
    WWW Consortium, The Document Object Model, http://www.w3.org/DOM
  23. 23.
    Zhang, K., Shasha, D.: Simple Fast Algorithms for the Editing Distance Between Trees and Related Problems. SIAM Journal of Computing 18(6), 1245–1262 (1989)MATHCrossRefMathSciNetGoogle Scholar
  24. 24.
    Zhang, Z., Li, R., Cao, S., Zhu, Y.: Similarity Metric in XML Documents. In: Knowledge Management and Experience Management Workshop (2003)Google Scholar

Copyright information

© Springer Berlin Heidelberg 2007

Authors and Affiliations

  • Joe Tekli
    • 1
  • Richard Chbeir
    • 1
  • Kokou Yetongnon
    • 1
  1. 1.LE2I Laboratory UMR-CNRS, University of Bourgogne, 21078 Dijon CedexFrance

Personalised recommendations