Skip to main content

Efficient Similarity Search for Tree-Structured Data

  • Conference paper
Scientific and Statistical Database Management (SSDBM 2008)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 5069))

Abstract

Tree-structured data are becoming ubiquitous nowadays and manipulating them based on similarity is essential for many applications. Although similarity search on textual data has been extensively studied, searching for similar trees is still an open problem due to the high complexity of computing the similarity between trees, especially for large numbers of tress. In this paper, we propose to transform tree-structured data into strings with a one-to-one mapping. We prove that the edit distance of the corresponding strings forms a bound for the similarity measures between trees, including tree edit distance, largest common subtrees and smallest common super-trees. Based on the theoretical analysis, we can employ any existing algorithm of approximate string search for effective similarity search on trees. Moreover, we embed the bound into a filter-and-refine framework for facilitating similarity search on tree-structured data. The experimental results show that our algorithm achieves high performance and outperforms state-of-the-art methods significantly. Our method is especially suitable for accelerating similarity query processing on large numbers of trees in massive datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Arasu, A., Ganti, V., Kaushik, R.: Efficient exact set-similarity joins. In: VLDB (2006)

    Google Scholar 

  2. Augsten, N., Bohlen, M., Gamper, J.: Approximate matching of hierarchical data using pq-grams. In: VLDB (2005)

    Google Scholar 

  3. Bayardo, R.J., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. In: WWW (2007)

    Google Scholar 

  4. Bille, P.: A survey on tree edit distance and related problems. Theoretical Computer Science 337(1-3), 217–239 (2005)

    Article  MATH  MathSciNet  Google Scholar 

  5. Chaudhuri, S., Ganjam, K., Ganti, V., Motwani, R.: Robust and efficient fuzzy match for online data cleaning. In: SIGMOD (2003)

    Google Scholar 

  6. Chaudhuri, S., Ganti, V., Kaushik, R.: A primitive operator for similarity joins in data cleaning. In: ICDE (2006)

    Google Scholar 

  7. Gionis, A., Gunopulos, D., Koudas, N.: Efficient and tunable similar set retrieval. In: SIGMOD (2001)

    Google Scholar 

  8. Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S., Srivastava, D.: Approximate string joins in a database (almost) for free. In: VLDB, pp. 491–500 (2001)

    Google Scholar 

  9. Guha, S., Jagadish, H.V., Koudas, N., Srivastava, D., Yu, T.: Approximate xml joins. In: SIGMOD (2002)

    Google Scholar 

  10. Hadjieleftheriou, M., Chandel, A., Koudas, N., Srivastava, D.: Fast indexes and algorithms for set similarity selection queries. In: ICDE (2008)

    Google Scholar 

  11. Kahveci, T., Singh, A.K.: Efficient index structures for string databases. In: VLDB (2001)

    Google Scholar 

  12. Kailing, K., Kriegel, H.-P., Schonauer, S., Seidl, T.: Efficient similarity search for hierarchical data in large databases. In: Bertino, E., Christodoulakis, S., Plexousakis, D., Christophides, V., Koubarakis, M., Böhm, K., Ferrari, E. (eds.) EDBT 2004. LNCS, vol. 2992, pp. 676–693. Springer, Heidelberg (2004)

    Google Scholar 

  13. Kim, M.-S., Whang, K.-Y., Lee, J.-G., Lee, M.-J.: n-gram/2l: A space and time efficient two-level n-gram inverted index structure. In: VLDB (2005)

    Google Scholar 

  14. Klein, P.: Computing the edit-distance between unrooted ordered trees. In: Bilardi, G., Pietracaprina, A., Italiano, G.F., Pucci, G. (eds.) ESA 1998. LNCS, vol. 1461. Springer, Heidelberg (1998)

    Chapter  Google Scholar 

  15. Li, C., Lu, J., Lu, Y.: Efficient merging and filtering algorithms for approximate string searches. In: ICDE (2008)

    Google Scholar 

  16. Li, C., Wang, B., Yang, X.: Vgram: Improving performance of approximate queries on string collections using variable-length grams. In: VLDB (2007)

    Google Scholar 

  17. Navarro, G.: A guided tour to approximate string matching. ACM Computing Surveys, 31–88 (2001)

    Google Scholar 

  18. Prufer, H.: Neuer beweis eines satzes uber permutationen. Archiv fur Mathematik und Physik 27, 142–144 (1918)

    Google Scholar 

  19. Sahinalp, S.C., Tasan, M., Macker, J., Ozsoyoglu, Z.M.: Distance based indexing for string proximity search. In: ICDE (2003)

    Google Scholar 

  20. Sarawagi, S., Kirpal, A.: Efficient set joins on similarity predicates. In: SIGMOD (2004)

    Google Scholar 

  21. Seidl, T., Kriegel, H.-P.: Optimal multi-step k-nearest neighbor search. In: SIGMOD (1998)

    Google Scholar 

  22. Tai, K.-C.: The tree-to-tree correction problem. Journal of the Association for Computing Machinery (JACM) 26, 422–433 (1979)

    MATH  MathSciNet  Google Scholar 

  23. Ukkonen, E.: Approximate string matching with q-grams and maximal matches. Theor. Comput. Sci. 92(1), 191–211 (1992)

    Article  MATH  MathSciNet  Google Scholar 

  24. Yang, R., Kalnis, P., Tung, A.K.H.: Similarity evaluation on tree-structured data. In: SIGMOD (2005)

    Google Scholar 

  25. Zhang, K., Shasha, D.: Simple fast algorithms for the editing distance between trees and related problems. SIAM Journal of Computing 18, 1245–1262 (1989)

    Article  MATH  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Bertram Ludäscher Nikos Mamoulis

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Li, G., Liu, X., Feng, J., Zhou, L. (2008). Efficient Similarity Search for Tree-Structured Data. In: Ludäscher, B., Mamoulis, N. (eds) Scientific and Statistical Database Management. SSDBM 2008. Lecture Notes in Computer Science, vol 5069. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-69497-7_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-69497-7_11

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-69476-2

  • Online ISBN: 978-3-540-69497-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics