The VLDB Journal

, Volume 26, Issue 2, pp 249–274 | Cite as

A unified framework for string similarity search with edit-distance constraint

  • Minghe Yu
  • Jin Wang
  • Guoliang Li
  • Yong Zhang
  • Dong Deng
  • Jianhua Feng
Regular Paper
  • 320 Downloads

Abstract

String similarity search is a fundamental operation in data cleaning and integration. It has two variants: threshold-based string similarity search and top-\(k\) string similarity search. Existing algorithms are efficient for either the former or the latter; most of them cannot support both two variants. To address this limitation, we propose a unified framework. We first recursively partition strings into disjoint segments and build a hierarchical segment tree index (\({\textsf {HS}}{\text {-}}{\textsf {Tree}}\)) on top of the segments. Then, we utilize the \({\textsf {HS}}{\text {-}}{\textsf {Tree}}\) to support similarity search. For threshold-based search, we identify appropriate tree nodes based on the threshold to answer the query and devise an efficient algorithm (HS-Search). For top-\(k\) search, we identify promising strings with large possibility to be similar to the query, utilize these strings to estimate an upper bound which is used to prune dissimilar strings and propose an algorithm (HS-Topk). We develop effective pruning techniques to further improve the performance. To support large data sets, we extend our techniques to support the disk-based setting. Experimental results on real-world data sets show that our method achieves high performance on the two problems and outperforms state-of-the-art algorithms by 5–10 times.

Keywords

Similarity search Edit distance Top-k Disk-based method Partition 

References

  1. 1.
    Ahmadi, A., Behm, A., Honnalli, N., Li, C., Weng, L., Xie, X.: Hobbes: optimized gram-based methods for efficient read alignment. Nucleic Acids Res. 40, e41 (2012)CrossRefGoogle Scholar
  2. 2.
    Bayardo, R.J., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. In: WWW, pp. 131–140 (2007)Google Scholar
  3. 3.
    Behm, A., Li, C., Carey, M.J.: Answering approximate string queries on large data sets using external memory. In: ICDE, pp. 888–899 (2011)Google Scholar
  4. 4.
    Chaudhuri, S., Ganti, V., Kaushik, R.: A primitive operator for similarity joins in data cleaning. In: ICDE (2006)Google Scholar
  5. 5.
    Chaudhuri, S., Kaushik, R.: Extending autocompletion to tolerate errors. In: SIGMOD Conference, pp. 707–718 (2009)Google Scholar
  6. 6.
    Deng, D., Li, G., Feng, J.: A pivotal prefix based filtering algorithm for string similarity search. In: SIGMOD Conference, pp. 673–684 (2014)Google Scholar
  7. 7.
    Deng, D., Li, G., Feng, J., Duan, Y., Gong, Z.: A unified framework for approximate dictionary-based entity extraction. VLDB J. 24(1), 143–167 (2015)CrossRefGoogle Scholar
  8. 8.
    Deng, D., Li, G., Feng, J., Li, W.-S.: Top-\(k\) string similarity search with edit-distance constraints. In: ICDE, pp. 925–936 (2013)Google Scholar
  9. 9.
    Deng, D., Li, G., Hao, S., Wang, J., Feng, J.: Massjoin: a mapreduce-based method for scalable string similarity joins. In: ICDE, pp. 340–351 (2014)Google Scholar
  10. 10.
    Deng, D., Li, G., Wen, H., Feng, J.: An efficient partition based method for exact set similarity joins. PVLDB 9(4), 360–371 (2015)Google Scholar
  11. 11.
    Deng, D., Li, G., Wen, H., Jagadish, H.V., Feng, J.: META: an efficient matching-based method for error-tolerant autocompletion. PVLDB 9(10), 828–839 (2016)Google Scholar
  12. 12.
    Feng, J., Wang, J., Li, G.: Trie-join: a trie-based method for efficient string similarity joins. VLDB J. 21(4), 437–461 (2012)CrossRefGoogle Scholar
  13. 13.
    Gerdjikov, S., Mihov, S., Mitankin, P., Schulz, K.U.: Wallbreaker: overcoming the wall effect in similarity search. In:EDBT/ICDT, pp. 366–369 (2013)Google Scholar
  14. 14.
    Guo, L., Shanmugasundaram, J., Beyer, K.S., Shekita, E.J.: Efficient inverted lists and query algorithms for structured value ranking in update-intensive relational databases. In: ICDE, pp. 298–309 (2005)Google Scholar
  15. 15.
    Gusfield, D.: Algorithms on Strings, Trees, and Sequences—Computer Science and Computational Biology. Cambridge University Press, Cambridge (1997)CrossRefMATHGoogle Scholar
  16. 16.
    Hadjieleftheriou, M., Yu, X., Koudas, N., Srivastava, D.: Hashed samples: selectivity estimators for set similarity selection queries. PVLDB 1(1), 201–212 (2008)Google Scholar
  17. 17.
    Ji, S., Li, G., Li, C., Feng, J.: Efficient interactive fuzzy keyword search. In: WWW (2009)Google Scholar
  18. 18.
    Jiang, Y., Li, G., Feng, J.: String similarity joins: an experimental evaluation. PVLDB 7(8), 625–636 (2014)Google Scholar
  19. 19.
    Kim, Y., Shim, K.: Efficient top-k algorithms for approximate substring matching. In: SIGMOD Conference, pp. 385–396 (2013)Google Scholar
  20. 20.
    Li, C., Lu, J., Lu, Y.: Efficient merging and filtering algorithms for approximate string searches. In: ICDE, pp. 257–266 (2008)Google Scholar
  21. 21.
    Li, C., Wang, B., Yang, X.: Vgram: improving performance of approximate queries on string collections using variable-length grams. In: VLDB, pp. 303–314 (2007)Google Scholar
  22. 22.
    Li, G., Deng, D., Feng, J.: Faerie: efficient filtering algorithms for approximate dictionary-based entity extraction. In: SIGMOD Conference, pp. 529–540 (2011)Google Scholar
  23. 23.
    Li, G., Deng, D., Wang, J., Feng, J.: Pass-join: a partition-based method for similarity joins. PVLDB 5(3), 253–264 (2011)Google Scholar
  24. 24.
    Li, G., Feng, J., Li, C.: Supporting search-as-you-type using SQL in databases. IEEE Trans. Knowl. Data Eng. 25(2), 461–475 (2013)CrossRefGoogle Scholar
  25. 25.
    Li, G., He, J., Deng, D., Li, J.: Efficient similarity join and search on multi-attribute data. In: SIGMOD, pp. 1137–1151 (2015)Google Scholar
  26. 26.
    Li, G., Ji, S., Li, C., Feng, J.: Efficient fuzzy full-text type-ahead search. VLDB J. 20(4), 617–640 (2011)CrossRefGoogle Scholar
  27. 27.
    Mansour, E., Allam, A., Skiadopoulos, S., Kalnis, P.: Era: Efficient serial and parallel suffix tree construction for very long strings. Proc. VLDB Endow. 5(1), 49–60 (2011)CrossRefGoogle Scholar
  28. 28.
    Qin, J., Wang, W., Lu, Y., Xiao, C., Lin, X.: Efficient exact edit similarity query processing with the asymmetric signature scheme. In: SIGMOD Conference, pp. 1033–1044 (2011)Google Scholar
  29. 29.
    Sarawagi, S., Kirpal, A.: Efficient set joins on similarity predicates. In: SIGMOD Conference, pp. 743–754 (2004)Google Scholar
  30. 30.
    Sellers, P.H.: The theory and computation of evolutionary distances: pattern recognition. J. Algorithms 1(4), 359–373 (1980)MathSciNetCrossRefMATHGoogle Scholar
  31. 31.
    Siragusai, E., Weese, D., Reinert, K.: Fast and accurate read mapping with approximate seeds and multiple backtracking. Nucleic Acids Res. 41(7), e78 (2013)Google Scholar
  32. 32.
    Tomasic, A., Garcia-Molina, H., Shoens, K.A.: Incremental updates of inverted lists for text document retrieval. In: SIGMOD, pp. 289–300 (1994)Google Scholar
  33. 33.
    Wandelt, S., Deng, D., Gerdjikov, S., Mishra, S., Mitankin, P., Patil, M., Siragusa, E., Tiskin, A., Wang, W., Wang, J., Leser, U.: State-of-the-art in string similarity search and join. SIGMOD Rec. 43(1), 64–76 (2014)CrossRefGoogle Scholar
  34. 34.
    Wang, J., Li, G., Deng, D., Zhang, Y., Feng, J.: Two birds with one stone: an efficient hierarchical framework for top-\(k\) and threshold-based string similarity search. In: ICDE (2015)Google Scholar
  35. 35.
    Wang, J., Li, G., Feng, J.: Trie-join: efficient trie-based string similarity joins with edit-distance constraints. PVLDB 3(1), 1219–1230 (2010)Google Scholar
  36. 36.
    Wang, J., Li, G., Feng, J.: Fast-join: An efficient method for fuzzy token matching based string similarity join. In: ICDE, pp. 458–469 (2011)Google Scholar
  37. 37.
    Wang, J., Li, G., Feng, J.: Can we beat the prefix filtering? An adaptive framework for similarity join and search. In: SIGMOD Conference, pp. 85–96 (2012)Google Scholar
  38. 38.
    Wang, W., Xiao, C., Lin, X., Zhang, C.: Efficient approximate entity extraction with edit distance constraints. In: SIGMOD Conference, (2009)Google Scholar
  39. 39.
    Wang, X., Ding, X., Tung, A.K.H., Zhang, Z.: Efficient and effective knn sequence search with approximate n-grams. PVLDB 7, 1–12 (2014)Google Scholar
  40. 40.
    Xiao, C., Qin, J., Wang, W., Ishikawa, Y., Tsuda, K., Sadakane, K.: Efficient error-tolerant query autocompletion. PVLDB 6(6), 373–384 (2013)Google Scholar
  41. 41.
    Xiao, C., Wang, W., Lin, X.: Ed-join: an efficient algorithm for similarity joins with edit distance constraints. PVLDB 1(1), 933–944 (2008)MathSciNetGoogle Scholar
  42. 42.
    Xiao, C., Wang, W., Lin, X., Yu, J.X.: Efficient similarity joins for near duplicate detection. In: WWW, pp. 131–140 (2008)Google Scholar
  43. 43.
    Yang, Z., Yu, J., Kitsuregawa, M.: Fast algorithms for top-k approximate string matching. In: AAAI (2010)Google Scholar
  44. 44.
    Yu, M., Li, G., Deng, D., Feng, J.: String similarity search and join: a survey. Front. Comput. Sci. 10(3), 399–417 (2016)CrossRefGoogle Scholar
  45. 45.
    Zhang, Z., Hadjieleftheriou, M., Ooi, B.C., Srivastava, D.: Bed-tree: an all-purpose index structure for string similarity search based on edit distance. In: SIGMOD Conference, pp. 915–926 (2010)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2016

Authors and Affiliations

  • Minghe Yu
    • 1
  • Jin Wang
    • 1
  • Guoliang Li
    • 1
  • Yong Zhang
    • 1
  • Dong Deng
    • 1
  • Jianhua Feng
    • 1
  1. 1.Department of Computer Science and TechnologyTsinghua UniversityBeijingChina

Personalised recommendations