A Hybrid Approach for XML Similarity

Tekli, Joe; Chbeir, Richard; Yetongnon, Kokou

doi:10.1007/978-3-540-69507-3_68

Joe Tekli¹,
Richard Chbeir¹ &
Kokou Yetongnon¹

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 4362))

Included in the following conference series:

International Conference on Current Trends in Theory and Practice of Computer Science

1698 Accesses
8 Citations

Abstract

In the past few years, XML has been established as an effective means for information management, and has been widely exploited for complex data representation. Owing to an unparalleled increasing use of the XML standard, developing efficient techniques for comparing XML-based documents becomes essential in information retrieval (IR) research. Various algorithms for comparing hierarchically structured data, e.g. XML documents, have been proposed in the literature. However, to our knowledge, most of them focus exclusively on comparing documents based on structural features, overlooking the semantics involved. In this paper, we integrate IR semantic similarity assessment in an edit distance algorithm, seeking to amend similarity judgments when comparing XML-based documents. Our approach comprises of an original edit distance operation cost model, introducing semantic relatedness of XML element/attribute labels, in traditional edit distance computations. A prototype has been developed to evaluate our model’s performance. Experiments yielded notable results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Aho, A., Hirschberg, D., Ullman, J.: Bounds on the Complexity of the Longest Common Subsequence Problem. Journal of the Association for Computing Machinery 23(1), 1–12 (1976)
MATH MathSciNet Google Scholar
Bertino, E., Guerrini, G., Mesiti, M., Rivara, I., Tavella, C.: Measuring the Structural Similarity among XML Documents and DTDs. Technical Report, University of Genova (2002), http://www.disi.unige.it/person/MesitiM
Bertino, E., Guerrini, G., Mesiti, M.: A Matching Algorithm for Measuring the Structural Similarity between an XML Documents and a DTD and Its Applications. Computer Science 29, 23–46 (2004)
MathSciNet Google Scholar
Chawathe, S., Rajaraman, A., Garcia-Molina, H., Widom, J.: Change Detection in Hierarchically Structured Information. In: Proc. of the ACM Int. Conf. on Management of Data (SIGMOD), Montreal, Quebec, Canada (1996)
Google Scholar
Chawathe, S.: Comparing Hierarchical Data in External Memory. In: Proceedings of the Twenty-fth Int. Conf. on Very Large Data Bases, pp. 90–101(1999)
Google Scholar
Cobéna, G., Abiteboul, S., Marian, A.: Detecting Changes in XML Documents. In: Proc. of the IEEE Int. Conf. on Data Engineering, pp. 41–52 (2002)
Google Scholar
Flesca, S., Manco, G., Masciari, E., Pontieri, L., Pugliese, A.: Detecting Structural Similarities between XML Documents. In: Proc. of WebDB 2002 (2002)
Google Scholar
Jiang, J., Conrath, D.: Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy. In: Proc. of the Int. Conf. on Research in Computational Linguistics (1997)
Google Scholar
Lee, J.H., Kim, M.H., Lee, Y.J.: Information Retrieval Based on Conceptual Distance in IS-A Hierarchies. Journal of Documentation 49(2), 188–207 (1993)
Article Google Scholar
Levenshtein, V.: Binary Codes Capable of Correcting Deletions, Insertions and Reversals. Sov. Phys. Dokl. 6, 707–710 (1966)
MathSciNet Google Scholar
Lin, D.: An Information-Theoretic Definition of Similarity. In: Proceedings of the 15th Int. Conf. on Machine Learning, pp. 296–304. Morgan Kaufmann, San Francisco (1998)
Google Scholar
Maguitman, A.G., Menczer, F., Roinestad, H., Vespignani, A.: Algorithmic Detection of Semantic Similarity. In: Proc. of the 14th Int. WWW Conference, Japan pp. 107–116 (2005)
Google Scholar
Miller, G.: WordNet: An On-Line Lexical Database. Int. Journal of Lexicography (1990)
Google Scholar
Nierman, A., Jagadish, H.V.: Evaluating Structural Similarity in XML Documents. In: Proc. of the 5th Int. Workshop on the Web and Databases (2002)
Google Scholar
Rada, R., Mili, H., Bicknell, E., Blettner, M.: Development and Application of a Metric on Semantic Nets. IEEE Transactions on Systems, Man, and Cybernetics 19, 17–30 (1989)
Article Google Scholar
Resnik, P.: Using Information Content to Evaluate Semantic Similarity in a Taxonomy. In: Proc. of the 14th IJCA-95, vol. 1, pp. 448–453 (1995)
Google Scholar
Richardson, R., Smeaton, A.F.: Using WordNet in a Knowledge-Based Approach to Information Retrieval. In: Proc. of the 17th Colloquium on Information Retrieval, (1995)
Google Scholar
Sanz, I., Mesiti, M., Guerrini, G., Llavori, R.B.: Approximate subtree identification in heterogeneous XML documents collections. In: Bressan, S., Ceri, S., Hunt, E., Ives, Z.G., Bellahsène, Z., Rys, M., Unland, R. (eds.) XSym 2005. LNCS, vol. 3671, pp. 192–206. Springer, Heidelberg (2005)
Google Scholar
Shasha, D., Zhang, K.: Approximate Tree Pattern Matching. In: Shasha, D., Zhang, K. (eds.) Pattern Matching in Strings, Trees and Arrays, vol. 14, Oxford University Press, Oxford (1995)
Google Scholar
Wagner, J., Fisher, M.: The String-to-String Correction Problem. Journal of the Association of Computing Machinery 21(1), 168–173 (1974)
MATH Google Scholar
Wong, C., Chandra, A.: Bounds for the String Editing Problem. Journal of the Association for Computing Machinery 23(1), 13–16 (1976)
MATH MathSciNet Google Scholar
WWW Consortium, The Document Object Model, http://www.w3.org/DOM
Zhang, K., Shasha, D.: Simple Fast Algorithms for the Editing Distance Between Trees and Related Problems. SIAM Journal of Computing 18(6), 1245–1262 (1989)
Article MATH MathSciNet Google Scholar
Zhang, Z., Li, R., Cao, S., Zhu, Y.: Similarity Metric in XML Documents. In: Knowledge Management and Experience Management Workshop (2003)
Google Scholar

Download references

Author information

Authors and Affiliations

LE2I Laboratory UMR-CNRS, University of Bourgogne, 21078 Dijon Cedex, France
Joe Tekli, Richard Chbeir & Kokou Yetongnon

Authors

Joe Tekli
View author publications
You can also search for this author in PubMed Google Scholar
Richard Chbeir
View author publications
You can also search for this author in PubMed Google Scholar
Kokou Yetongnon
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Jan van Leeuwen Giuseppe F. Italiano Wiebe van der Hoek Christoph Meinel Harald Sack František Plášil

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tekli, J., Chbeir, R., Yetongnon, K. (2007). A Hybrid Approach for XML Similarity. In: van Leeuwen, J., Italiano, G.F., van der Hoek, W., Meinel, C., Sack, H., Plášil, F. (eds) SOFSEM 2007: Theory and Practice of Computer Science. SOFSEM 2007. Lecture Notes in Computer Science, vol 4362. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-69507-3_68

Download citation

DOI: https://doi.org/10.1007/978-3-540-69507-3_68
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-69506-6
Online ISBN: 978-3-540-69507-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics