Abstract
In the past few years, XML has been established as an effective means for information management, and has been widely exploited for complex data representation. Owing to an unparalleled increasing use of the XML standard, developing efficient techniques for comparing XML-based documents becomes essential in information retrieval (IR) research. Various algorithms for comparing hierarchically structured data, e.g. XML documents, have been proposed in the literature. However, to our knowledge, most of them focus exclusively on comparing documents based on structural features, overlooking the semantics involved. In this paper, we integrate IR semantic similarity assessment in an edit distance algorithm, seeking to amend similarity judgments when comparing XML-based documents. Our approach comprises of an original edit distance operation cost model, introducing semantic relatedness of XML element/attribute labels, in traditional edit distance computations. A prototype has been developed to evaluate our model’s performance. Experiments yielded notable results.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Aho, A., Hirschberg, D., Ullman, J.: Bounds on the Complexity of the Longest Common Subsequence Problem. Journal of the Association for Computing Machinery 23(1), 1–12 (1976)
Bertino, E., Guerrini, G., Mesiti, M., Rivara, I., Tavella, C.: Measuring the Structural Similarity among XML Documents and DTDs. Technical Report, University of Genova (2002), http://www.disi.unige.it/person/MesitiM
Bertino, E., Guerrini, G., Mesiti, M.: A Matching Algorithm for Measuring the Structural Similarity between an XML Documents and a DTD and Its Applications. Computer Science 29, 23–46 (2004)
Chawathe, S., Rajaraman, A., Garcia-Molina, H., Widom, J.: Change Detection in Hierarchically Structured Information. In: Proc. of the ACM Int. Conf. on Management of Data (SIGMOD), Montreal, Quebec, Canada (1996)
Chawathe, S.: Comparing Hierarchical Data in External Memory. In: Proceedings of the Twenty-fth Int. Conf. on Very Large Data Bases, pp. 90–101(1999)
Cobéna, G., Abiteboul, S., Marian, A.: Detecting Changes in XML Documents. In: Proc. of the IEEE Int. Conf. on Data Engineering, pp. 41–52 (2002)
Flesca, S., Manco, G., Masciari, E., Pontieri, L., Pugliese, A.: Detecting Structural Similarities between XML Documents. In: Proc. of WebDB 2002 (2002)
Jiang, J., Conrath, D.: Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy. In: Proc. of the Int. Conf. on Research in Computational Linguistics (1997)
Lee, J.H., Kim, M.H., Lee, Y.J.: Information Retrieval Based on Conceptual Distance in IS-A Hierarchies. Journal of Documentation 49(2), 188–207 (1993)
Levenshtein, V.: Binary Codes Capable of Correcting Deletions, Insertions and Reversals. Sov. Phys. Dokl. 6, 707–710 (1966)
Lin, D.: An Information-Theoretic Definition of Similarity. In: Proceedings of the 15th Int. Conf. on Machine Learning, pp. 296–304. Morgan Kaufmann, San Francisco (1998)
Maguitman, A.G., Menczer, F., Roinestad, H., Vespignani, A.: Algorithmic Detection of Semantic Similarity. In: Proc. of the 14th Int. WWW Conference, Japan pp. 107–116 (2005)
Miller, G.: WordNet: An On-Line Lexical Database. Int. Journal of Lexicography (1990)
Nierman, A., Jagadish, H.V.: Evaluating Structural Similarity in XML Documents. In: Proc. of the 5th Int. Workshop on the Web and Databases (2002)
Rada, R., Mili, H., Bicknell, E., Blettner, M.: Development and Application of a Metric on Semantic Nets. IEEE Transactions on Systems, Man, and Cybernetics 19, 17–30 (1989)
Resnik, P.: Using Information Content to Evaluate Semantic Similarity in a Taxonomy. In: Proc. of the 14th IJCA-95, vol. 1, pp. 448–453 (1995)
Richardson, R., Smeaton, A.F.: Using WordNet in a Knowledge-Based Approach to Information Retrieval. In: Proc. of the 17th Colloquium on Information Retrieval, (1995)
Sanz, I., Mesiti, M., Guerrini, G., Llavori, R.B.: Approximate subtree identification in heterogeneous XML documents collections. In: Bressan, S., Ceri, S., Hunt, E., Ives, Z.G., Bellahsène, Z., Rys, M., Unland, R. (eds.) XSym 2005. LNCS, vol. 3671, pp. 192–206. Springer, Heidelberg (2005)
Shasha, D., Zhang, K.: Approximate Tree Pattern Matching. In: Shasha, D., Zhang, K. (eds.) Pattern Matching in Strings, Trees and Arrays, vol. 14, Oxford University Press, Oxford (1995)
Wagner, J., Fisher, M.: The String-to-String Correction Problem. Journal of the Association of Computing Machinery 21(1), 168–173 (1974)
Wong, C., Chandra, A.: Bounds for the String Editing Problem. Journal of the Association for Computing Machinery 23(1), 13–16 (1976)
WWW Consortium, The Document Object Model, http://www.w3.org/DOM
Zhang, K., Shasha, D.: Simple Fast Algorithms for the Editing Distance Between Trees and Related Problems. SIAM Journal of Computing 18(6), 1245–1262 (1989)
Zhang, Z., Li, R., Cao, S., Zhu, Y.: Similarity Metric in XML Documents. In: Knowledge Management and Experience Management Workshop (2003)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer Berlin Heidelberg
About this paper
Cite this paper
Tekli, J., Chbeir, R., Yetongnon, K. (2007). A Hybrid Approach for XML Similarity. In: van Leeuwen, J., Italiano, G.F., van der Hoek, W., Meinel, C., Sack, H., Plášil, F. (eds) SOFSEM 2007: Theory and Practice of Computer Science. SOFSEM 2007. Lecture Notes in Computer Science, vol 4362. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-69507-3_68
Download citation
DOI: https://doi.org/10.1007/978-3-540-69507-3_68
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-69506-6
Online ISBN: 978-3-540-69507-3
eBook Packages: Computer ScienceComputer Science (R0)