Abstract
There exist several methods to measuring the structural similarity among XML documents. The data mining approach seems to be a novel, interesting and promising one. In view of the deficiencies encountered by ignoring the hierarchical information in encoding the paths for mining, we propose a new sequential pattern mining scheme for XML document similarity computation. It makes use of the hierarchical information to computing the document structural similarity. In addition, it includes a post-processing step to reuse the mined patterns to estimate the similarity of unmatched elements so that another metric to qualify the similarity between XML documents can be introduced. Encouraging experimental results were obtained and reported.
This work was supported by the research project H-ZJ84 and the full version of this paper can be found in http://www.comp.polyu.edu.hk/~cskchung/Paper/PAKDD03.pdf
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Galhardas H., Florescu D., Shasha D., Simon E., Saita C.A.: Declarative data cleaning Language, model, and algorithms. In Proc. of 28th Int. Conf. on Very Large Data Bases (VLDB), Hong Kong, China, pp. 371–380, August, 2002.
Pereira J., Fabret F., Jacobsen H.A., Llirbat F., Shasha D.: WebFilter A High-throughput XML-based publish and subscribe system. In Proc. of 27th Int. Conf. on Very Large Data Bases, Roma, Italy, pp. 723–724, September, 2001.
Hartmut Liefke, Dan Suciu: XMill An efficient compressor for XML data. In Proc. of the ACM SIGMOD Int. Conf. on Management of Data, Dallas, Texas, pp. 153–164, May, 2000.
Lee J.W., Lee K. and Kim W.: Preparations for semantics-based XML mining. In Proc. of the 2001 IEEE Int. Conf. on Data Mining, San Jose, California, pp. 345–352, Dec., 2001.
Chang C.H., Lui S.C. and Wu Y.C.: Applying pattern mining to Web information extraction. In Proc. of the Fifth Pacific-Asia Conf. on Knowledge Discovery and Data Mining, Hong Kong, China, pp. 4–16, April, 2001.
Sedgewick R.: An introduction to the analysis of algorithms. Addison-Wesley Press, 1996.
W3C’s Document Object Model (DOM) home page [http://www.w3.org/DOM/].
Agrawal R. and Srikant R.: Mining sequential patterns. In Proc. of the Eleventh Int. Conf. on Data Engineering, Taipei, Taiwan, pp. 3–14, March, 1995.
Meng W., Wang W., Sun H. and Yu C.: Concept Hierarchy Based Text Database Categorization. In J. of Knowledge and Information Systems 4(2): 132–150, 2002.
ACM SIGMOD Record home page [http://www.acm.org/sigmod/record/xml].
IBM’s XML Generator homepage [http://www.alphaworks.ibm.com].
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Leung, Hp., Chung, Fl., Chan, S.Cf. (2003). A New Sequential Mining Approach to XML Document Similarity Computation. In: Whang, KY., Jeon, J., Shim, K., Srivastava, J. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2003. Lecture Notes in Computer Science(), vol 2637. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-36175-8_35
Download citation
DOI: https://doi.org/10.1007/3-540-36175-8_35
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-04760-5
Online ISBN: 978-3-540-36175-6
eBook Packages: Springer Book Archive