Abstract
Parallel data processing complicates the completion of string similarity joins because parallel data processing requires the use of a well designed data partitioning scheme. Moreover, efficient verification of string pairs is needed to speed up the entire string similarity join process. We propose a novel framework that addresses these requirements through the use of edit distance constraints. The Landmark-Join framework has two functions that reduce two kinds of search spaces. The first, q-bucket partitioning, reduces the number of verifications of dissimilar string pairs and lowers skewness among buckets. The second, local upper bound calculation, prunes the search space of edit distance to speed up each verification. Experimental results show that Landmark-Join has good parallel scalability and that the two proposed functions speed up the entire string similarity join process.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Arasu, A., Ganti, V., Kaushik, R.: Efficient exact set-similarity joins. In: VLDB, pp. 918–929 (2006)
Bayardo, R.J., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. In: WWW, pp. 131–140 (2007)
Bocek, T., Hunt, E., Stiller, B.: Fast similarity search in large dictionaries. Technical Report ifi-2007.02, Department of Informatics, University of Zurich (April 2007)
Chaudhuri, S., Ganti, V., Kaushik, R.: A primitive operator for similarity joins in data cleaning. In: ICDE, p. 5 (2006)
DeWitt, D.J., Naughton, J.F., Schneider, D.A.: An evaluation of non-equijoin algorithms. In: VLDB, pp. 443–452 (1991)
Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S., Srivastava, D.: Approximate string joins in a database (almost) for free. In: VLDB, pp. 491–500 (2001)
Kim, S.-R., Park, K.: A Dynamic Edit Distance Table. In: Giancarlo, R., Sankoff, D. (eds.) CPM 2000. LNCS, vol. 1848, pp. 60–68. Springer, Heidelberg (2000)
Kitsuregawa, M., Ichiro Tsudaka, S., Nakano, M.: Parallel grace hash join on shared-everything multiprocessor: Implementation and performance evaluation on symmetry s81. In: ICDE, pp. 256–264 (1992)
Li, C., Lu, J., Lu, Y.: Efficient merging and filtering algorithms for approximate string searches. In: ICDE, pp. 257–266 (2008)
Li, C., Wang, B., Yang, X.: Vgram: Improving performance of approximate queries on string collections using variable-length grams. In: VLDB, pp. 303–314 (2007)
Li, G., Deng, D., Wang, J., Feng, J.: Pass-join: A partition-based method for similarity joins. In: PVLDB, pp. 253–264 (2011)
Vernica, R., Carey, M.J., Li, C.: Efficient parallel set-similarity joins using mapreduce. In: SIGMOD Conference, pp. 495–506 (2010)
Wang, J., Li, G., Feng, J.: Trie-join: Efficient trie-based string similarity joins with edit-distance constraints. In: PVLDB, vol. 3(1), pp. 1219–1230 (2010)
Wang, W., Xiao, C., Lin, X., Zhang, C.: Efficient approximate entity extraction with edit distance constraints. In: SIGMOD Conference, pp. 759–770 (2009)
Xiao, C., Wang, W., Lin, X.: Ed-join: an efficient algorithm for similarity joins with edit distance constraints. In: PVLDB, vol. 1(1), pp. 933–944 (2008)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Narita, K., Nakadai, S., Araki, T. (2012). Landmark-Join: Hash-Join Based String Similarity Joins with Edit Distance Constraints. In: Cuzzocrea, A., Dayal, U. (eds) Data Warehousing and Knowledge Discovery. DaWaK 2012. Lecture Notes in Computer Science, vol 7448. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-32584-7_15
Download citation
DOI: https://doi.org/10.1007/978-3-642-32584-7_15
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-32583-0
Online ISBN: 978-3-642-32584-7
eBook Packages: Computer ScienceComputer Science (R0)