Evaluate Structure Similarity in XML Documents with Merge-Edit-Distance

Zhou, Chong; Lu, Yansheng; Zou, Lei; Hu, Rong

doi:10.1007/978-3-540-77018-3_31

Chong Zhou¹,
Yansheng Lu¹,
Lei Zou¹ &
…
Rong Hu¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4819))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

1491 Accesses
1 Citations

Abstract

XML language is widely used as a standard for data representation and exchange among Web applications. In recent years, many efforts have been spent in querying, integrating and clustering XML documents. Measuring the similarity among XML documents is the foundation of such applications. In this paper, we propose a new similarity measure method among the XML documents, which is based on Merge-Edit-Distance (MED). MED upholds the distribution information of the common tree in XML document trees. We urge the distribution information is useful for determining the similarity of XML documents. A novel algorithm is also proposed to calculate MED as follows. Given two XML document trees A and B, it compresses the two trees into one merge tree C and then transforms the tree C to the common tree of A and B with the defined operations such as “Delete”, “Reduce”, “Combine”. The cost of the operation sequence is defined as MED. The experiments on real datasets give the evidence that the proposed similarity measure is effective.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Rege, M., Caraconcea, I., Lu, S., Fotouhi, F.: Querying XML Documents from a Relational Database in the Presence of DTDs. In: Ghosh, R.K., Mohanty, H. (eds.) ICDCIT 2004. LNCS, vol. 3347, pp. 168–177. Springer, Heidelberg (2004)
Google Scholar
Galhardas, E.H., Florescu, D., Shasha, D., Simon, E., Saita, C.-A.: Declarative Data Cleaning: Language, Model, and Algorithms. In: Apers, P.M.G., Atzeni, P., Ceri, S., Paraboschi, S., Ramamohanarao, K., Snodgrass, R.T. (eds.) Proceedings of 27th International Conference on Very Large Data Bases, pp. 371–380. Morgan Kaufmann, Roma, Italy (2001)
Google Scholar
Doucet, A., Ahonen-Myka, H.: Naive clustering of alarge XML document collection. In: Norbert Fuhr, N.G., Kazai, G., Lalmas, M. (eds.) First Workshop of the INitiative for the Evaluation of XML Retrieval (INEX), Schloss Dagstuhl, Germany, pp. 81–87 (2002)
Google Scholar
Nierman, A., Jagadish, H.V.: Evaluating Structural Similarity in XML Documents. In: Mary, F., Fernandez, Y.P. (eds.) WebDB 2002, Madison, Wisconsin, USA, pp. 61–66 (2002)
Google Scholar
Zhang, K., Dennis, S.: Simple fast algorithms for the editing distance between trees and related problems. SIAM Journal on Computing 18, 1245–1262 (1989)
Article MATH MathSciNet Google Scholar
Lee, J.-W., Lee, K., Kim, W.: Preparations for Semantics-Based XML Mining. In: Cercone, N., Lin, T.Y., Wu, X. (eds.) International Conference on Data Mining, pp. 345–352. IEEE Computer Society, California (2001)
Google Scholar
Leung, H.-p., Chung, F.-l., Chan, S.C.-f.: A New Sequential Mining Approach to XML Document Similarity Computation. In: Whang, K.-Y., Jeon, J., Shim, K., Srivastava, J. (eds.) PAKDD 2003. LNCS (LNAI), vol. 2637, pp. 356–362. Springer, Heidelberg (2003)
Chapter Google Scholar
Dalamagas, T., Cheng, T., Winkel, K.-J., Sellis, T.: A methodology for clustering XML documents by structure. Information Systems 31, 187–228 (2006)
Article Google Scholar

Download references

Author information

Authors and Affiliations

College of Computer Science and Technology, HuaZhong University of Science and Technology, 1037 Luoyu Road, Wuhan, P.R. China
Chong Zhou, Yansheng Lu, Lei Zou & Rong Hu

Authors

Chong Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Yansheng Lu
View author publications
You can also search for this author in PubMed Google Scholar
Lei Zou
View author publications
You can also search for this author in PubMed Google Scholar
Rong Hu
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Takashi Washio Zhi-Hua Zhou Joshua Zhexue Huang Xiaohua Hu Jinyan Li Chao Xie Jieyue He Deqing Zou Kuan-Ching Li Mário M. Freire

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhou, C., Lu, Y., Zou, L., Hu, R. (2007). Evaluate Structure Similarity in XML Documents with Merge-Edit-Distance. In: Washio, T., et al. Emerging Technologies in Knowledge Discovery and Data Mining. PAKDD 2007. Lecture Notes in Computer Science(), vol 4819. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-77018-3_31

Download citation

DOI: https://doi.org/10.1007/978-3-540-77018-3_31
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-77016-9
Online ISBN: 978-3-540-77018-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics