Advertisement

Knowledge and Information Systems

, Volume 46, Issue 3, pp 629–656 | Cite as

Extend tree edit distance for effective object identification

  • Yue Wang
  • Hongzhi WangEmail author
  • Liyan Zhang
  • Yang Wang
  • Jianzhong Li
  • Hong Gao
Regular Paper

Abstract

Similarity join on XML documents which are usually modeled as rooted ordered labeled trees is widely applied, due to the ambiguity of references to the real-world objects. The conventional method dealing with this issue is based on tree edit distance, which is shortage of flexibility and efficiency. In this paper, we propose two novel edit operations together with extended tree edit distance, which can achieve good performance in similarity matching with hierarchical data structures [the run-time is \(O(n^{3})\) in the worst case]. And then, we propose \(k\)-generation set distance as a good approximation of the tree edit distance to further improve the join efficiency with quadric time complexity. Experiments on real and synthetic databases demonstrate the benefit of our method in efficiency and scalability.

Keywords

XML data matching New edit operations Extended tree edit distance \(k\)-Generation set distance Cluster 

Notes

Acknowledgments

This paper was partially supported by NGFR 973 Grant 2012CB316200, NSFC Grant 61472099 and National Sci-Tech Support Plan 2015BAH10F00.

References

  1. 1.
    Algergawy A, Nayak R, Saake R (2010) Element similarity measures in XML schema matching. J Inf Sci 180(24):4975–4998CrossRefGoogle Scholar
  2. 2.
    Augsten N, Bohlen M, Gamper J (2005) Approximate matching of hierarchical data using pq-grams. In: Proceeding of the 31st VLDB conferences, Trondheim, pp 301–312Google Scholar
  3. 3.
    Augsten N, Bohlen M, Gamper J (2010) The pq-gram distance between ordered labeled trees. ACM Trans Database Syst 35(1):4(1)–4(36)Google Scholar
  4. 4.
    Bille P (2005) A survey on tree edit distance and related problems. Theor Comput Sci 337(1–3):217–239CrossRefMathSciNetzbMATHGoogle Scholar
  5. 5.
    Demaine ED, Mozes S, Rossman B, Weimann O (2007) An optimal decomposition algorithm for tree edit distance. In: Arge L, Cachin C, Jurdzinski T, Tarlecki A (eds) ICALP, LNCS, vol 4596. Springer, Heidelberg, pp 146–157Google Scholar
  6. 6.
    Dulucq S, Touzet H (2003) Analysis of tree edit distance algorithms. In: Proceeding of the 14th annual symposium on Combinatorial Pattern Matching (CPM), pp 83–95Google Scholar
  7. 7.
    Garofalakis M, Kumar A (2005) XML stream processing using tree-edit distance embeddings. ACM Trans Database Syst 30(1):279–332CrossRefGoogle Scholar
  8. 8.
    Guha S, Jagadish HV, Koudas N et al (2002) Approximate XML joins. ACM SIGMOD, Madison, WisconsinGoogle Scholar
  9. 9.
    Guha S, Jagadish HV, Koudas N et al (2006) Integrating XML data sources using approximate joins. ACM Trans Database Syst 31(1):161–207CrossRefGoogle Scholar
  10. 10.
    Han Z, Wang H, Gao H et al (2009) Clustering-based approximate join method on XML documents. J Comput Res Dev. ISSN1000-1239/CN 11–1177/TP46(Suppl.): 81–86Google Scholar
  11. 11.
    Kailing K, Kriegel H, Schonauer S et al (2004) Efficient similarity search for hierarchical data in large databases. In: Bertino E, Christodoulakis S, Plexousakis D, Vassilis C, Koubarakis M, Böhm K, Ferrari E (eds) Advances in database technology - EDBT 2004. Lecture notes in computer science, vol 2992. Springer, Heidelberg, pp 676–693Google Scholar
  12. 12.
    Klein PH (1998) Computing the edit-distance between unrooted ordered trees. ESA’98, LNCS 1461:91–102Google Scholar
  13. 13.
    Li F, Wang H, Hao L et al (2010) pq-hash: an efficient method for approximate XML joins. WAIM 2010 Workshops LNCS 6185:125–134Google Scholar
  14. 14.
    Li F, Wang H, Zhang C et al (2010) Approximate joins for XML using \(g\)-string. XSym 2010, LNCS 6309, pp. 3–17Google Scholar
  15. 15.
    Mozes S (2008) Some lower and upper bounds for tree edit distance. Department of Computer Science, Brown University, Providence, RIGoogle Scholar
  16. 16.
    Tai KC (1979) The tree-to-tree correction problem. J ACM 26(3):422–433CrossRefMathSciNetzbMATHGoogle Scholar
  17. 17.
    Tatikonda S, Parthasarathy S (2010) Hashing tree-structured data: methods and applications. In: IEEE 26th international conference on data engineering (ICDE), pp 429–440Google Scholar
  18. 18.
    Wang Y, Wang H, Wang Y et al (2012) Similarity join on XML based on \(k\)-generation set distance. WAIM 2011 Workshops, LNCS 7142, Springer, Heidelberg, pp 124–135Google Scholar
  19. 19.
    Yang Z, Yang G (2004) A near-optimal similarity join algorithm and performance evaluation. J Inf Sci 167(1–4):87–108CrossRefzbMATHGoogle Scholar
  20. 20.
    Yi S, Huang B, Chan WT (2005) XML application schema matching using similarity measure and relaxation labeling. J Inf Sci 169(1–2):27–46CrossRefzbMATHGoogle Scholar
  21. 21.
    Zhang K, Shasha D (1989) Simple fast algorithms for the editing distance between trees and related problems. SIAM J Comput 18(6):1245–1262CrossRefMathSciNetzbMATHGoogle Scholar
  22. 22.
    Shasha D, Zhang K (1995) Approximate tree pattern matching. Pattern matching in strings, trees and arrays. chapter 14, Oxford University PressGoogle Scholar
  23. 23.
    Chawathe SS (1999) Comparing hierarchical data in external memory. In: Proceedings of the twenty-fifth International conference on very large data bases (VLDB), pp 90–101Google Scholar
  24. 24.
    Nierman A, Jagadish HV (2002) Evaluating structural similarity in XML documents. In Proceedings of the 5th international workshop on the Web and DatabasesGoogle Scholar
  25. 25.
    Flesca S, Manco G, Masciari G et al (2002) Detecting structural similarities between XML documents. In: Proceedings of WebDB, pp 55–60Google Scholar

Copyright information

© Springer-Verlag London 2015

Authors and Affiliations

  • Yue Wang
    • 1
  • Hongzhi Wang
    • 1
    Email author
  • Liyan Zhang
    • 2
  • Yang Wang
    • 1
  • Jianzhong Li
    • 1
  • Hong Gao
    • 1
  1. 1.The School of Computer Science and TechnologyHarbin Institute of TechnologyHarbinChina
  2. 2.Donald Bren School of Information and Computer SciencesUniversity of CaliforniaIrvineUSA

Personalised recommendations