A New Sequential Mining Approach to XML Document Similarity Computation

Leung, Ho-pong; Chung, Fu-lai; Chan, Stephen Chi-fai

doi:10.1007/3-540-36175-8_35

Ho-pong Leung⁵,
Fu-lai Chung⁵ &
Stephen Chi-fai Chan⁵

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2637))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

1158 Accesses
6 Citations

Abstract

There exist several methods to measuring the structural similarity among XML documents. The data mining approach seems to be a novel, interesting and promising one. In view of the deficiencies encountered by ignoring the hierarchical information in encoding the paths for mining, we propose a new sequential pattern mining scheme for XML document similarity computation. It makes use of the hierarchical information to computing the document structural similarity. In addition, it includes a post-processing step to reuse the mined patterns to estimate the similarity of unmatched elements so that another metric to qualify the similarity between XML documents can be introduced. Encouraging experimental results were obtained and reported.

This work was supported by the research project H-ZJ84 and the full version of this paper can be found in http://www.comp.polyu.edu.hk/~cskchung/Paper/PAKDD03.pdf

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Galhardas H., Florescu D., Shasha D., Simon E., Saita C.A.: Declarative data cleaning Language, model, and algorithms. In Proc. of 28th Int. Conf. on Very Large Data Bases (VLDB), Hong Kong, China, pp. 371–380, August, 2002.
Google Scholar
Pereira J., Fabret F., Jacobsen H.A., Llirbat F., Shasha D.: WebFilter A High-throughput XML-based publish and subscribe system. In Proc. of 27th Int. Conf. on Very Large Data Bases, Roma, Italy, pp. 723–724, September, 2001.
Google Scholar
Hartmut Liefke, Dan Suciu: XMill An efficient compressor for XML data. In Proc. of the ACM SIGMOD Int. Conf. on Management of Data, Dallas, Texas, pp. 153–164, May, 2000.
Google Scholar
Lee J.W., Lee K. and Kim W.: Preparations for semantics-based XML mining. In Proc. of the 2001 IEEE Int. Conf. on Data Mining, San Jose, California, pp. 345–352, Dec., 2001.
Google Scholar
Chang C.H., Lui S.C. and Wu Y.C.: Applying pattern mining to Web information extraction. In Proc. of the Fifth Pacific-Asia Conf. on Knowledge Discovery and Data Mining, Hong Kong, China, pp. 4–16, April, 2001.
Google Scholar
Sedgewick R.: An introduction to the analysis of algorithms. Addison-Wesley Press, 1996.
Google Scholar
W3C’s Document Object Model (DOM) home page [http://www.w3.org/DOM/].
Agrawal R. and Srikant R.: Mining sequential patterns. In Proc. of the Eleventh Int. Conf. on Data Engineering, Taipei, Taiwan, pp. 3–14, March, 1995.
Google Scholar
Meng W., Wang W., Sun H. and Yu C.: Concept Hierarchy Based Text Database Categorization. In J. of Knowledge and Information Systems 4(2): 132–150, 2002.
Article Google Scholar
ACM SIGMOD Record home page [http://www.acm.org/sigmod/record/xml].
IBM’s XML Generator homepage [http://www.alphaworks.ibm.com].

Download references

Author information

Authors and Affiliations

Department of Computing, Hong Kong Polytechnic University, Hunghom, Kowloon, Hong Kong
Ho-pong Leung, Fu-lai Chung & Stephen Chi-fai Chan

Authors

Ho-pong Leung
View author publications
You can also search for this author in PubMed Google Scholar
Fu-lai Chung
View author publications
You can also search for this author in PubMed Google Scholar
Stephen Chi-fai Chan
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Computer Science Department, Korea Advanced Institute of Science and Technology, 373-1 Koo-Sung Dong, Yoo-Sung Ku, Daejeon, 305-701, Korea
Kyu-Young Whang
Department of Statistics, Seoul National University, Sillimdong Kwanakgu, Seoul, 151-742, Korea
Jongwoo Jeon
School of Electrical Engineering and Computer Science, Seoul National University, Kwanak P.O. Box 34, Seoul, 151-742, Korea
Kyuseok Shim
Department of Computer Science and Engineering, University of Minnesota, 200 Union St SE, Minneapolis, MN, 55455, USA
Jaideep Srivastava

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Leung, Hp., Chung, Fl., Chan, S.Cf. (2003). A New Sequential Mining Approach to XML Document Similarity Computation. In: Whang, KY., Jeon, J., Shim, K., Srivastava, J. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2003. Lecture Notes in Computer Science(), vol 2637. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-36175-8_35

Download citation

DOI: https://doi.org/10.1007/3-540-36175-8_35
Published: 30 April 2003
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-04760-5
Online ISBN: 978-3-540-36175-6
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics