Advertisement

Hash\(^{ed}\)-Join: Approximate String Similarity Join with Hashing

  • Peisen YuanEmail author
  • Chaofeng Sha
  • Yi Sun
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8505)

Abstract

The string similarity join, which finds similar string pairs from string sets, has received extensive attention in database and information retrieval fields. To this problem, the filter-and-refine framework is usually adopted by the existing research work, and various filtering methods have been proposed. Recently, tree based index techniques with the edit distance constraint are effectively employed for evaluating the string similarity join. However, they do not scale well with large distance threshold. In this paper, we propose an approach for approximate string similarity join based on Min-Hashing locality sensitive hashing and trie-based index techniques. Our approach is flexible between trading the efficiency and performance. Empirical study using the real datasets demonstrates that our framework is more efficient and scales better.

Keywords

Binary Vector Active Node Edit Distance Jaccard Similarity Locality Sensitive Hash 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Notes

Acknowledgments

This work was supported by the 973 project(No. 2010CB328106), NSFC grant (No. 61033007 and 61170085).

References

  1. 1.
    Wang, W., Xiao, C., Lin, X., Zhang, C.: Efficient approximate entity extraction with edit distance constraints. In: SIGMOD, pp. 759–770 (2009)Google Scholar
  2. 2.
    Chaudhuri, S., Ganti, V., Kaushik, R.: A primitive operator for similarity joins in data cleaning. In: ICDE, p. 5 (2006)Google Scholar
  3. 3.
    Xiao, C., Wang, W., Lin, X., Yu, J.X.: Efficient similarity joins for near duplicate detection. In: WWW, pp. 131–140 (2008)Google Scholar
  4. 4.
    Wang, J., Feng, J., Li, G.: Trie-join: efficient trie-based string similarity joins with edit distance constraints. VLDB 1(1), 933–944 (2010)Google Scholar
  5. 5.
    Siragusa, E., Weese, D., Knut R.: Scalable string similarity search/join with approximate seeds and multiple backtracking. In: EDBT/ICDT, pp. 370–374. ACM (2013)Google Scholar
  6. 6.
    Wagner, R.A., Fischer, M.J.: The string-to-string correction problem. JACM 21(1), 168–173 (1974)CrossRefzbMATHMathSciNetGoogle Scholar
  7. 7.
    Gouda, K., Rashad, M.: Prejoin: an efficient trie-based string similarity join algorithm. In: INFOS, pp. DE–37. IEEE (2012)Google Scholar
  8. 8.
    Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: STOC, pp. 604–613 (1998)Google Scholar
  9. 9.
    Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Sov. Phys. Dokl. 10, 707–710 (1966)MathSciNetGoogle Scholar
  10. 10.
    Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S., Srivastava, D.: Approximate string joins in a database (almost) for free. In: VLDB, pp. 491–500 (2001)Google Scholar
  11. 11.
    Broder, A.Z.: On the resemblance and containment of documents. In: Proceedings of Compression and Complexity of Sequences, pp. 21–29 (1997)Google Scholar
  12. 12.
    Lv, Q., Josephson, W., Wang, Z., Charikar, M., Li, K.: Multi-probe LSH: efficient indexing for high-dimensional similarity search. In: VLDB, pp. 950–961 (2007)Google Scholar
  13. 13.
    Rajaraman, A., Ullman, J.D.: Mining of Massive Datasets. Cambridge University Press, New York (2013)Google Scholar
  14. 14.
    Bayardo, R.J., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. In: WWW, pp. 131–140 (2007)Google Scholar
  15. 15.
    Xiao, C., Wang, W., Lin, X.: Ed-Join: an efficient algorithm for similarity joins with edit distance constraints. VLDB 1(1), 933–944 (2008)Google Scholar
  16. 16.
    Lu, H., Yang, B., Jensen, C.S.: Spatio-temporal joins on symbolic indoor tracking data. In: ICDE, pp. 816–827 (2011)Google Scholar
  17. 17.
    Lu, J., Lin, C., Wang, W., Li, C., Wang, H.: String similarity measures and joins with synonyms. In: SIGMOD (2013)Google Scholar
  18. 18.
    Wang, J., Li, G., Feng, J.: Can we beat the prefix filtering? an adaptive framework for similarity join and search. In: SIGMOD, pp. 85–96 (2012)Google Scholar
  19. 19.
    Deng, D., Li, G., Feng, J., Li, W.-S.: Top-k string similarity search with edit-distance constraints. In: ICDE, pp. 925–936. ICDE (2013)Google Scholar
  20. 20.
    Zhang, Z., Hadjieleftheriou, M., Ooi, B.C., Srivastava, D.: B\(^{ed}\)-tree: an all-purpose index structure for string similarity search based on edit distance. In: SIGMOD, pp. 915–926 (2010)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2014

Authors and Affiliations

  1. 1.College of Information Science and TechnologyNanjing Agricultural UniversityNanjingChina
  2. 2.School of Computer ScienceFudan UniversityShanghaiChina

Personalised recommendations