Advertisement

Landmark-Join: Hash-Join Based String Similarity Joins with Edit Distance Constraints

  • Kazuyo Narita
  • Shinji Nakadai
  • Takuya Araki
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7448)

Abstract

Parallel data processing complicates the completion of string similarity joins because parallel data processing requires the use of a well designed data partitioning scheme. Moreover, efficient verification of string pairs is needed to speed up the entire string similarity join process. We propose a novel framework that addresses these requirements through the use of edit distance constraints. The Landmark-Join framework has two functions that reduce two kinds of search spaces. The first, q-bucket partitioning, reduces the number of verifications of dissimilar string pairs and lowers skewness among buckets. The second, local upper bound calculation, prunes the search space of edit distance to speed up each verification. Experimental results show that Landmark-Join has good parallel scalability and that the two proposed functions speed up the entire string similarity join process.

Keywords

Edit Distance Input String Edit Graph Position List Zipf Distribution 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Arasu, A., Ganti, V., Kaushik, R.: Efficient exact set-similarity joins. In: VLDB, pp. 918–929 (2006)Google Scholar
  2. 2.
    Bayardo, R.J., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. In: WWW, pp. 131–140 (2007)Google Scholar
  3. 3.
    Bocek, T., Hunt, E., Stiller, B.: Fast similarity search in large dictionaries. Technical Report ifi-2007.02, Department of Informatics, University of Zurich (April 2007)Google Scholar
  4. 4.
    Chaudhuri, S., Ganti, V., Kaushik, R.: A primitive operator for similarity joins in data cleaning. In: ICDE, p. 5 (2006)Google Scholar
  5. 5.
    DeWitt, D.J., Naughton, J.F., Schneider, D.A.: An evaluation of non-equijoin algorithms. In: VLDB, pp. 443–452 (1991)Google Scholar
  6. 6.
    Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S., Srivastava, D.: Approximate string joins in a database (almost) for free. In: VLDB, pp. 491–500 (2001)Google Scholar
  7. 7.
    Kim, S.-R., Park, K.: A Dynamic Edit Distance Table. In: Giancarlo, R., Sankoff, D. (eds.) CPM 2000. LNCS, vol. 1848, pp. 60–68. Springer, Heidelberg (2000)CrossRefGoogle Scholar
  8. 8.
    Kitsuregawa, M., Ichiro Tsudaka, S., Nakano, M.: Parallel grace hash join on shared-everything multiprocessor: Implementation and performance evaluation on symmetry s81. In: ICDE, pp. 256–264 (1992)Google Scholar
  9. 9.
    Li, C., Lu, J., Lu, Y.: Efficient merging and filtering algorithms for approximate string searches. In: ICDE, pp. 257–266 (2008)Google Scholar
  10. 10.
    Li, C., Wang, B., Yang, X.: Vgram: Improving performance of approximate queries on string collections using variable-length grams. In: VLDB, pp. 303–314 (2007)Google Scholar
  11. 11.
    Li, G., Deng, D., Wang, J., Feng, J.: Pass-join: A partition-based method for similarity joins. In: PVLDB, pp. 253–264 (2011)Google Scholar
  12. 12.
    Vernica, R., Carey, M.J., Li, C.: Efficient parallel set-similarity joins using mapreduce. In: SIGMOD Conference, pp. 495–506 (2010)Google Scholar
  13. 13.
    Wang, J., Li, G., Feng, J.: Trie-join: Efficient trie-based string similarity joins with edit-distance constraints. In: PVLDB, vol. 3(1), pp. 1219–1230 (2010)Google Scholar
  14. 14.
    Wang, W., Xiao, C., Lin, X., Zhang, C.: Efficient approximate entity extraction with edit distance constraints. In: SIGMOD Conference, pp. 759–770 (2009)Google Scholar
  15. 15.
    Xiao, C., Wang, W., Lin, X.: Ed-join: an efficient algorithm for similarity joins with edit distance constraints. In: PVLDB, vol. 1(1), pp. 933–944 (2008)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Kazuyo Narita
    • 1
  • Shinji Nakadai
    • 1
  • Takuya Araki
    • 1
  1. 1.Cloud System Research LaboratoriesNEC CorporationKawasakiJapan

Personalised recommendations