Abstract
We study the problem of computing similarity joins under edit distance on a set of strings. Edit similarity joins is a fundamental problem in databases, data mining and bioinformatics. It finds many applications in data cleaning and integration, collaborative filtering, genome sequence assembly, etc. This problem has attracted a lot of attention in the past two decades. However, all previous algorithms either cannot scale to long strings and large similarity thresholds, or suffer from imperfect accuracy. In this paper, we propose a new algorithm for edit similarity joins using a novel string partition-based approach. We show that, theoretically, our algorithm finds all similar pairs with high probability and runs in linear time (plus a data-dependent verification step). The algorithm can also be easily parallelized. Experiments on real-world datasets show that our algorithm outperforms the state-of-the-art algorithms for edit similarity joins by orders of magnitudes in running time and achieves perfect accuracy on most datasets that we have tested.
Similar content being viewed by others
Notes
Alternatively, for each string we can store the first \(N - (\lceil ( N - K )/q \rceil - K) + 1\) q-grams in the hash table, and make queries with the first \(K+1\) chunks.
If we run the algorithm in [5] in our CPU computational environment, then its embedding step (only) is already 10–100\(\times \) slower than the entire running time of MinJoin++ on the datasets that we use in this paper.
See https://en.wikipedia.org/wiki/MurmurHash for MurmurHash3, and https://docs.rs/seahash/latest/seahash/ for SeaHash.
See the documentation from the project website of [21]: https://github.com/kedayuge/Embedjoin.
References
Arasu, A., Ganti, V., Kaushik, R.: Efficient exact set-similarity joins. In: VLDB, pp. 918–929 (2006)
Bayardo, R.J., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. In: WWW, pp. 131–140 (2007)
Bocek, T., Hunt, E., Stiller, B., Hecht, F.: Fast similarity search in large dictionaries. University (2007)
Ciaccia, P., Patella, M., Zezula, P.: M-tree: an efficient access method for similarity search in metric spaces. In: VLDB, pp. 426–435 (1997)
Dai, X., Yan, X., Zhou, K., Wang, Y., Yang, H., Cheng, J., Sigir. J., Huang, X., Chang, Y., Cheng, X., Kamps, J., Murdock, V., Wen, J., Liu, Y. (eds.) ACM, pp. 599–608
Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S., Srivastava, D.: Approximate string joins in a database (almost) for free. In: VLDB, pp. 491–500 (2001)
Jiang, Y., Li, G., Feng, J., Li, W.: String similarity joins: an experimental evaluation. PVLDB 7(8), 625–636 (2014)
Li, C., Lu, J., Lu, Y.: Efficient merging and filtering algorithms for approximate string searches. In: ICDE, pp. 257–266 (2008)
Li, G., Deng, D., Wang, J., Feng, J.: PASS-JOIN: a partition-based method for similarity joins. PVLDB 5(3), 253–264 (2011)
Myers, G.: Efficient local alignment discovery amongst noisy long reads. In: Brown, D.G., Morgenstern, B. (eds.), Algorithms in Bioinformatics—14th International Workshop, WABI 2014, Wroclaw, Poland, September 8–10, 2014. Proceedings, vol. 8701 of Lecture Notes in Computer Science, pp. 52–67. Springer (2014)
Qin, J., Wang, W., Lu, Y., Xiao, C., Lin, X.: Efficient exact edit similarity query processing with the asymmetric signature scheme. In: SIGMOD, pp. 1033–1044 (2011)
Roberts, R.J., Carneiro, M.O., Schatz, M.C.: The advantages of SMRT sequencing. Genome Biol. 14(6), 405 (2013)
Song, Y., Tang, H., Zhang, H., Zhang, Q.: Overlap detection on long, error-prone sequencing reads via smooth q-gram. Bioinformatics 36(19), 4838–4845 (2020)
Su, Z., Ahn, B.-R., Eom, K.-Y., Kang, M.-K., Kim, J.-P., Kim, M.-K.: Plagiarism detection using the levenshtein distance and smith-waterman algorithm. In: 2008 3rd International Conference on Innovative Computing Information and Control, pp. 569–569 (2008)
Ukkonen, E.: Algorithms for approximate string matching. Inf. Control 64(1–3), 100–118 (1985)
Wandelt, S., Deng, D., Gerdjikov, S., Mishra, S., Mitankin, P., Patil, M., Siragusa, E., Tiskin, A., Wang, W., Wang, J., Leser, U.: State-of-the-art in string similarity search and join. SIGMOD Record 43(1), 64–76 (2014)
Wang, J., Li, G., Feng, J.: Trie-join: efficient trie-based string similarity joins with edit-distance constraints. PVLDB 3(1), 1219–1230 (2010)
Wang, J., Li, G., Feng, J.: Can we beat the prefix filtering?: an adaptive framework for similarity join and search. In: SIGMOD, pp. 85–96 (2012)
Wang, W., Qin, J., Xiao, C., Lin, X., Shen, H.T.: Vchunkjoin: an efficient algorithm for edit similarity joins. IEEE Trans. Knowl. Data Eng. 25(8), 1916–1929 (2013)
Xiao, C., Wang, W., Lin, X.: Ed-join: an efficient algorithm for similarity joins with edit distance constraints. PVLDB 1(1), 933–944 (2008)
Zhang, H., Zhang, Q.: Embedjoin: efficient edit similarity joins via embeddings. In: KDD, pp. 585–594 (2017)
Zhang, H., Zhang, Q.: Minjoin: efficient edit similarity joins via local hash minima. In: KDD, pp. 1093–1103. ACM (2019)
Zini, M., Fabbri, M., Moneglia, M., Panunzi, A.: Plagiarism detection through multilevel text comparison. In: Proceedings of the Second International Conference on Automated Production of Cross Media Content for Multi-Channel Distribution, AXMEDIS 2006, Leeds, UK, December 13–15, 2006, pp. 181–185. IEEE Computer Society (2006)
Acknowledgements
N. Karpov, H. Zhang, and Q. Zhang are supported in part by NSF CCF-1844234.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Authors are ordered alphabetically.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Karpov, N., Zhang, H. & Zhang, Q. MinJoin++: a fast algorithm for string similarity joins under edit distance. The VLDB Journal 33, 281–299 (2024). https://doi.org/10.1007/s00778-023-00806-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-023-00806-z