Skip to main content
Log in

Extend tree edit distance for effective object identification

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Similarity join on XML documents which are usually modeled as rooted ordered labeled trees is widely applied, due to the ambiguity of references to the real-world objects. The conventional method dealing with this issue is based on tree edit distance, which is shortage of flexibility and efficiency. In this paper, we propose two novel edit operations together with extended tree edit distance, which can achieve good performance in similarity matching with hierarchical data structures [the run-time is \(O(n^{3})\) in the worst case]. And then, we propose \(k\)-generation set distance as a good approximation of the tree edit distance to further improve the join efficiency with quadric time complexity. Experiments on real and synthetic databases demonstrate the benefit of our method in efficiency and scalability.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

References

  1. Algergawy A, Nayak R, Saake R (2010) Element similarity measures in XML schema matching. J Inf Sci 180(24):4975–4998

    Article  Google Scholar 

  2. Augsten N, Bohlen M, Gamper J (2005) Approximate matching of hierarchical data using pq-grams. In: Proceeding of the 31st VLDB conferences, Trondheim, pp 301–312

  3. Augsten N, Bohlen M, Gamper J (2010) The pq-gram distance between ordered labeled trees. ACM Trans Database Syst 35(1):4(1)–4(36)

  4. Bille P (2005) A survey on tree edit distance and related problems. Theor Comput Sci 337(1–3):217–239

    Article  MathSciNet  MATH  Google Scholar 

  5. Demaine ED, Mozes S, Rossman B, Weimann O (2007) An optimal decomposition algorithm for tree edit distance. In: Arge L, Cachin C, Jurdzinski T, Tarlecki A (eds) ICALP, LNCS, vol 4596. Springer, Heidelberg, pp 146–157

    Google Scholar 

  6. Dulucq S, Touzet H (2003) Analysis of tree edit distance algorithms. In: Proceeding of the 14th annual symposium on Combinatorial Pattern Matching (CPM), pp 83–95

  7. Garofalakis M, Kumar A (2005) XML stream processing using tree-edit distance embeddings. ACM Trans Database Syst 30(1):279–332

    Article  Google Scholar 

  8. Guha S, Jagadish HV, Koudas N et al (2002) Approximate XML joins. ACM SIGMOD, Madison, Wisconsin

  9. Guha S, Jagadish HV, Koudas N et al (2006) Integrating XML data sources using approximate joins. ACM Trans Database Syst 31(1):161–207

    Article  Google Scholar 

  10. Han Z, Wang H, Gao H et al (2009) Clustering-based approximate join method on XML documents. J Comput Res Dev. ISSN1000-1239/CN 11–1177/TP46(Suppl.): 81–86

  11. Kailing K, Kriegel H, Schonauer S et al (2004) Efficient similarity search for hierarchical data in large databases. In: Bertino E, Christodoulakis S, Plexousakis D, Vassilis C, Koubarakis M, Böhm K, Ferrari E (eds) Advances in database technology - EDBT 2004. Lecture notes in computer science, vol 2992. Springer, Heidelberg, pp 676–693

  12. Klein PH (1998) Computing the edit-distance between unrooted ordered trees. ESA’98, LNCS 1461:91–102

  13. Li F, Wang H, Hao L et al (2010) pq-hash: an efficient method for approximate XML joins. WAIM 2010 Workshops LNCS 6185:125–134

    Google Scholar 

  14. Li F, Wang H, Zhang C et al (2010) Approximate joins for XML using \(g\)-string. XSym 2010, LNCS 6309, pp. 3–17

  15. Mozes S (2008) Some lower and upper bounds for tree edit distance. Department of Computer Science, Brown University, Providence, RI

    Google Scholar 

  16. Tai KC (1979) The tree-to-tree correction problem. J ACM 26(3):422–433

    Article  MathSciNet  MATH  Google Scholar 

  17. Tatikonda S, Parthasarathy S (2010) Hashing tree-structured data: methods and applications. In: IEEE 26th international conference on data engineering (ICDE), pp 429–440

  18. Wang Y, Wang H, Wang Y et al (2012) Similarity join on XML based on \(k\)-generation set distance. WAIM 2011 Workshops, LNCS 7142, Springer, Heidelberg, pp 124–135

  19. Yang Z, Yang G (2004) A near-optimal similarity join algorithm and performance evaluation. J Inf Sci 167(1–4):87–108

    Article  MATH  Google Scholar 

  20. Yi S, Huang B, Chan WT (2005) XML application schema matching using similarity measure and relaxation labeling. J Inf Sci 169(1–2):27–46

    Article  MATH  Google Scholar 

  21. Zhang K, Shasha D (1989) Simple fast algorithms for the editing distance between trees and related problems. SIAM J Comput 18(6):1245–1262

    Article  MathSciNet  MATH  Google Scholar 

  22. Shasha D, Zhang K (1995) Approximate tree pattern matching. Pattern matching in strings, trees and arrays. chapter 14, Oxford University Press

  23. Chawathe SS (1999) Comparing hierarchical data in external memory. In: Proceedings of the twenty-fifth International conference on very large data bases (VLDB), pp 90–101

  24. Nierman A, Jagadish HV (2002) Evaluating structural similarity in XML documents. In Proceedings of the 5th international workshop on the Web and Databases

  25. Flesca S, Manco G, Masciari G et al (2002) Detecting structural similarities between XML documents. In: Proceedings of WebDB, pp 55–60

Download references

Acknowledgments

This paper was partially supported by NGFR 973 Grant 2012CB316200, NSFC Grant 61472099 and National Sci-Tech Support Plan 2015BAH10F00.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hongzhi Wang.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, Y., Wang, H., Zhang, L. et al. Extend tree edit distance for effective object identification. Knowl Inf Syst 46, 629–656 (2016). https://doi.org/10.1007/s10115-014-0816-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-014-0816-1

Keywords

Navigation