Abstract
XML language is widely used as a standard for data representation and exchange among Web applications. In recent years, many efforts have been spent in querying, integrating and clustering XML documents. Measuring the similarity among XML documents is the foundation of such applications. In this paper, we propose a new similarity measure method among the XML documents, which is based on Merge-Edit-Distance (MED). MED upholds the distribution information of the common tree in XML document trees. We urge the distribution information is useful for determining the similarity of XML documents. A novel algorithm is also proposed to calculate MED as follows. Given two XML document trees A and B, it compresses the two trees into one merge tree C and then transforms the tree C to the common tree of A and B with the defined operations such as “Delete”, “Reduce”, “Combine”. The cost of the operation sequence is defined as MED. The experiments on real datasets give the evidence that the proposed similarity measure is effective.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Rege, M., Caraconcea, I., Lu, S., Fotouhi, F.: Querying XML Documents from a Relational Database in the Presence of DTDs. In: Ghosh, R.K., Mohanty, H. (eds.) ICDCIT 2004. LNCS, vol. 3347, pp. 168–177. Springer, Heidelberg (2004)
Galhardas, E.H., Florescu, D., Shasha, D., Simon, E., Saita, C.-A.: Declarative Data Cleaning: Language, Model, and Algorithms. In: Apers, P.M.G., Atzeni, P., Ceri, S., Paraboschi, S., Ramamohanarao, K., Snodgrass, R.T. (eds.) Proceedings of 27th International Conference on Very Large Data Bases, pp. 371–380. Morgan Kaufmann, Roma, Italy (2001)
Doucet, A., Ahonen-Myka, H.: Naive clustering of alarge XML document collection. In: Norbert Fuhr, N.G., Kazai, G., Lalmas, M. (eds.) First Workshop of the INitiative for the Evaluation of XML Retrieval (INEX), Schloss Dagstuhl, Germany, pp. 81–87 (2002)
Nierman, A., Jagadish, H.V.: Evaluating Structural Similarity in XML Documents. In: Mary, F., Fernandez, Y.P. (eds.) WebDB 2002, Madison, Wisconsin, USA, pp. 61–66 (2002)
Zhang, K., Dennis, S.: Simple fast algorithms for the editing distance between trees and related problems. SIAM Journal on Computing 18, 1245–1262 (1989)
Lee, J.-W., Lee, K., Kim, W.: Preparations for Semantics-Based XML Mining. In: Cercone, N., Lin, T.Y., Wu, X. (eds.) International Conference on Data Mining, pp. 345–352. IEEE Computer Society, California (2001)
Leung, H.-p., Chung, F.-l., Chan, S.C.-f.: A New Sequential Mining Approach to XML Document Similarity Computation. In: Whang, K.-Y., Jeon, J., Shim, K., Srivastava, J. (eds.) PAKDD 2003. LNCS (LNAI), vol. 2637, pp. 356–362. Springer, Heidelberg (2003)
Dalamagas, T., Cheng, T., Winkel, K.-J., Sellis, T.: A methodology for clustering XML documents by structure. Information Systems 31, 187–228 (2006)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Zhou, C., Lu, Y., Zou, L., Hu, R. (2007). Evaluate Structure Similarity in XML Documents with Merge-Edit-Distance. In: Washio, T., et al. Emerging Technologies in Knowledge Discovery and Data Mining. PAKDD 2007. Lecture Notes in Computer Science(), vol 4819. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-77018-3_31
Download citation
DOI: https://doi.org/10.1007/978-3-540-77018-3_31
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-77016-9
Online ISBN: 978-3-540-77018-3
eBook Packages: Computer ScienceComputer Science (R0)