Skip to main content

A New Sequential Mining Approach to XML Document Similarity Computation

  • Conference paper
  • First Online:
Advances in Knowledge Discovery and Data Mining (PAKDD 2003)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2637))

Included in the following conference series:

Abstract

There exist several methods to measuring the structural similarity among XML documents. The data mining approach seems to be a novel, interesting and promising one. In view of the deficiencies encountered by ignoring the hierarchical information in encoding the paths for mining, we propose a new sequential pattern mining scheme for XML document similarity computation. It makes use of the hierarchical information to computing the document structural similarity. In addition, it includes a post-processing step to reuse the mined patterns to estimate the similarity of unmatched elements so that another metric to qualify the similarity between XML documents can be introduced. Encouraging experimental results were obtained and reported.

This work was supported by the research project H-ZJ84 and the full version of this paper can be found in http://www.comp.polyu.edu.hk/~cskchung/Paper/PAKDD03.pdf

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Galhardas H., Florescu D., Shasha D., Simon E., Saita C.A.: Declarative data cleaning Language, model, and algorithms. In Proc. of 28th Int. Conf. on Very Large Data Bases (VLDB), Hong Kong, China, pp. 371–380, August, 2002.

    Google Scholar 

  2. Pereira J., Fabret F., Jacobsen H.A., Llirbat F., Shasha D.: WebFilter A High-throughput XML-based publish and subscribe system. In Proc. of 27th Int. Conf. on Very Large Data Bases, Roma, Italy, pp. 723–724, September, 2001.

    Google Scholar 

  3. Hartmut Liefke, Dan Suciu: XMill An efficient compressor for XML data. In Proc. of the ACM SIGMOD Int. Conf. on Management of Data, Dallas, Texas, pp. 153–164, May, 2000.

    Google Scholar 

  4. Lee J.W., Lee K. and Kim W.: Preparations for semantics-based XML mining. In Proc. of the 2001 IEEE Int. Conf. on Data Mining, San Jose, California, pp. 345–352, Dec., 2001.

    Google Scholar 

  5. Chang C.H., Lui S.C. and Wu Y.C.: Applying pattern mining to Web information extraction. In Proc. of the Fifth Pacific-Asia Conf. on Knowledge Discovery and Data Mining, Hong Kong, China, pp. 4–16, April, 2001.

    Google Scholar 

  6. Sedgewick R.: An introduction to the analysis of algorithms. Addison-Wesley Press, 1996.

    Google Scholar 

  7. W3C’s Document Object Model (DOM) home page [http://www.w3.org/DOM/].

  8. Agrawal R. and Srikant R.: Mining sequential patterns. In Proc. of the Eleventh Int. Conf. on Data Engineering, Taipei, Taiwan, pp. 3–14, March, 1995.

    Google Scholar 

  9. Meng W., Wang W., Sun H. and Yu C.: Concept Hierarchy Based Text Database Categorization. In J. of Knowledge and Information Systems 4(2): 132–150, 2002.

    Article  Google Scholar 

  10. ACM SIGMOD Record home page [http://www.acm.org/sigmod/record/xml].

  11. IBM’s XML Generator homepage [http://www.alphaworks.ibm.com].

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2003 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Leung, Hp., Chung, Fl., Chan, S.Cf. (2003). A New Sequential Mining Approach to XML Document Similarity Computation. In: Whang, KY., Jeon, J., Shim, K., Srivastava, J. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2003. Lecture Notes in Computer Science(), vol 2637. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-36175-8_35

Download citation

  • DOI: https://doi.org/10.1007/3-540-36175-8_35

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-04760-5

  • Online ISBN: 978-3-540-36175-6

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics