Abstract
Tree-structured data are becoming ubiquitous nowadays and manipulating them based on similarity is essential for many applications. Although similarity search on textual data has been extensively studied, searching for similar trees is still an open problem due to the high complexity of computing the similarity between trees, especially for large numbers of tress. In this paper, we propose to transform tree-structured data into strings with a one-to-one mapping. We prove that the edit distance of the corresponding strings forms a bound for the similarity measures between trees, including tree edit distance, largest common subtrees and smallest common super-trees. Based on the theoretical analysis, we can employ any existing algorithm of approximate string search for effective similarity search on trees. Moreover, we embed the bound into a filter-and-refine framework for facilitating similarity search on tree-structured data. The experimental results show that our algorithm achieves high performance and outperforms state-of-the-art methods significantly. Our method is especially suitable for accelerating similarity query processing on large numbers of trees in massive datasets.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Arasu, A., Ganti, V., Kaushik, R.: Efficient exact set-similarity joins. In: VLDB (2006)
Augsten, N., Bohlen, M., Gamper, J.: Approximate matching of hierarchical data using pq-grams. In: VLDB (2005)
Bayardo, R.J., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. In: WWW (2007)
Bille, P.: A survey on tree edit distance and related problems. Theoretical Computer Science 337(1-3), 217–239 (2005)
Chaudhuri, S., Ganjam, K., Ganti, V., Motwani, R.: Robust and efficient fuzzy match for online data cleaning. In: SIGMOD (2003)
Chaudhuri, S., Ganti, V., Kaushik, R.: A primitive operator for similarity joins in data cleaning. In: ICDE (2006)
Gionis, A., Gunopulos, D., Koudas, N.: Efficient and tunable similar set retrieval. In: SIGMOD (2001)
Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S., Srivastava, D.: Approximate string joins in a database (almost) for free. In: VLDB, pp. 491–500 (2001)
Guha, S., Jagadish, H.V., Koudas, N., Srivastava, D., Yu, T.: Approximate xml joins. In: SIGMOD (2002)
Hadjieleftheriou, M., Chandel, A., Koudas, N., Srivastava, D.: Fast indexes and algorithms for set similarity selection queries. In: ICDE (2008)
Kahveci, T., Singh, A.K.: Efficient index structures for string databases. In: VLDB (2001)
Kailing, K., Kriegel, H.-P., Schonauer, S., Seidl, T.: Efficient similarity search for hierarchical data in large databases. In: Bertino, E., Christodoulakis, S., Plexousakis, D., Christophides, V., Koubarakis, M., Böhm, K., Ferrari, E. (eds.) EDBT 2004. LNCS, vol. 2992, pp. 676–693. Springer, Heidelberg (2004)
Kim, M.-S., Whang, K.-Y., Lee, J.-G., Lee, M.-J.: n-gram/2l: A space and time efficient two-level n-gram inverted index structure. In: VLDB (2005)
Klein, P.: Computing the edit-distance between unrooted ordered trees. In: Bilardi, G., Pietracaprina, A., Italiano, G.F., Pucci, G. (eds.) ESA 1998. LNCS, vol. 1461. Springer, Heidelberg (1998)
Li, C., Lu, J., Lu, Y.: Efficient merging and filtering algorithms for approximate string searches. In: ICDE (2008)
Li, C., Wang, B., Yang, X.: Vgram: Improving performance of approximate queries on string collections using variable-length grams. In: VLDB (2007)
Navarro, G.: A guided tour to approximate string matching. ACM Computing Surveys, 31–88 (2001)
Prufer, H.: Neuer beweis eines satzes uber permutationen. Archiv fur Mathematik und Physik 27, 142–144 (1918)
Sahinalp, S.C., Tasan, M., Macker, J., Ozsoyoglu, Z.M.: Distance based indexing for string proximity search. In: ICDE (2003)
Sarawagi, S., Kirpal, A.: Efficient set joins on similarity predicates. In: SIGMOD (2004)
Seidl, T., Kriegel, H.-P.: Optimal multi-step k-nearest neighbor search. In: SIGMOD (1998)
Tai, K.-C.: The tree-to-tree correction problem. Journal of the Association for Computing Machinery (JACM) 26, 422–433 (1979)
Ukkonen, E.: Approximate string matching with q-grams and maximal matches. Theor. Comput. Sci. 92(1), 191–211 (1992)
Yang, R., Kalnis, P., Tung, A.K.H.: Similarity evaluation on tree-structured data. In: SIGMOD (2005)
Zhang, K., Shasha, D.: Simple fast algorithms for the editing distance between trees and related problems. SIAM Journal of Computing 18, 1245–1262 (1989)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Li, G., Liu, X., Feng, J., Zhou, L. (2008). Efficient Similarity Search for Tree-Structured Data. In: Ludäscher, B., Mamoulis, N. (eds) Scientific and Statistical Database Management. SSDBM 2008. Lecture Notes in Computer Science, vol 5069. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-69497-7_11
Download citation
DOI: https://doi.org/10.1007/978-3-540-69497-7_11
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-69476-2
Online ISBN: 978-3-540-69497-7
eBook Packages: Computer ScienceComputer Science (R0)