Abstract
A string similarity join finds similar string pairs from two sets of strings, which is frequently found in many applications, such as duplicate detection, data integration and cleaning. Various algorithms have been proposed to address its efficiency issues. Partition-based filtering methods, such as Pass-JOIN, are promising, which quickly screens out possible similar string pairs by searching partitioned parts of a string in another string, in order of increasing length, and then performs similarity verification base on edit-distance. We notice that, filtering with different direction produces different candidate sets, which motivate us using a bi-directional filtering mechanism. This paper proposes a novel bi-directional filtering mechanism to enhance the filtering capability, which pipelines filtered results in forward direction to the process of backward filtering. The substring selection method of Pass-JOIN is adapted for the backward filtering. Experimental results show that the proposed bi-directional filtering algorithm outperforms the origin algorithm on real-world datasets.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Jiang, Y., Li, G., Feng, J.: String similarity joins: an experimental evaluation. In: Proceedings of the 40th International Conference on VLDB (2014)
Wang, J., Feng, J., Li, G.: Trie-Join: efficient trie-based string similarity joins with edit-distance constraints. In: Proceedings of the 36th International Conference on VLDB (2010)
Ji, S., Li, G., Li, C., Feng, J.: Efficient interactive fuzzy keyword search. In: Proceedings of the 18th International Conference on WWW, pp. 433–439 (2009)
Li, G., Ji, S., Li, C., Feng, J.: Efficient fuzzy full-text type-ahead search. In: Proceedings of the 37th International Conference on VLDB Journal, pp. 617–640 (2011)
Li, G., Deng, D., Wang, J., Feng, J.: Pass-Join: a partition-based method for similarity joins. In: Proceedings of the 38th International Conference on VLDB (2012)
Xiao, C., Wang, W., Lin., X.: Ed-Join: an efficient algorithm for similarity joins with edit distance constraints. In: Proceedings of the 34th International Conference on VLDB Endowment (2008)
Xiao, C., Wang, W., Lin., X.: Efficient similarity joins for near duplicate detection. In: Proceedings of the 17th International Conference on WWW (2008)
Wang, J., Li, G., Feng, J.: Can we beat the prefix filtering? An adaptive framework for similarity join and search. In: SIGMOD Conference, pp. 85–96 (2012)
Li, C., Lu, J., Lu, Y.: Efficient merging and filtering algorithms for approximate string searches. In: ICDE Conference, pp. 257–266 (2008)
Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S., Srivastava, D.: Approximate string joins in a database (Almost) for free. In: Proceedings of the 28th International Conference on VLDB, pp. 491–500 (2001)
Sarawagi, S., Kirpal, A.: Efficient set joins on similarity predicates. In: SIGMOD Conference, pp. 743–754 (2004)
Wang, W., Xiao, C., Lin, X., Zhang, C.: Efficient approximate entity extraction with edit distance constraints. In: SIGMOD Conference, pp. 759–770 (2009)
Wang, J., Li, G., Feng, J.: Fast-Join: an efficient method for fuzzy token matching based string similarity join. In: ICDE Conference, pp. 458–469 (2011)
Bayardo, R.J., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. In: Proceedings of the 16th International Conference on WWW (2007)
Deng, D., Li, G., Hao, S., Wang, J., Feng, J.: Massjoin: a mapreduce-based method for scalable string similarity joins. In: IEEE 30th International Conference, pp. 340–351 (2014)
Jiang, Y., Deng, D., Wang, J., Li, G., Feng, J.: Efficient parallel partition-based algorithms for similarity search and join with edit distance constraints. In: EDBT Workshop (2013)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Huang, Y., Niu, B., Song, C. (2015). A Partition-Based Bi-directional Filtering Method for String Similarity JOINs. In: Dong, X., Yu, X., Li, J., Sun, Y. (eds) Web-Age Information Management. WAIM 2015. Lecture Notes in Computer Science(), vol 9098. Springer, Cham. https://doi.org/10.1007/978-3-319-21042-1_32
Download citation
DOI: https://doi.org/10.1007/978-3-319-21042-1_32
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-21041-4
Online ISBN: 978-3-319-21042-1
eBook Packages: Computer ScienceComputer Science (R0)