MinJoin++: a fast algorithm for string similarity joins under edit distance

Karpov, Nikolai; Zhang, Haoyu; Zhang, Qin

doi:10.1007/s00778-023-00806-z

MinJoin++: a fast algorithm for string similarity joins under edit distance

Regular Paper
Published: 21 August 2023

Volume 33, pages 281–299, (2024)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

236 Accesses
Explore all metrics

Abstract

We study the problem of computing similarity joins under edit distance on a set of strings. Edit similarity joins is a fundamental problem in databases, data mining and bioinformatics. It finds many applications in data cleaning and integration, collaborative filtering, genome sequence assembly, etc. This problem has attracted a lot of attention in the past two decades. However, all previous algorithms either cannot scale to long strings and large similarity thresholds, or suffer from imperfect accuracy. In this paper, we propose a new algorithm for edit similarity joins using a novel string partition-based approach. We show that, theoretically, our algorithm finds all similar pairs with high probability and runs in linear time (plus a data-dependent verification step). The algorithm can also be easily parallelized. Experiments on real-world datasets show that our algorithm outperforms the state-of-the-art algorithms for edit similarity joins by orders of magnitudes in running time and achieves perfect accuracy on most datasets that we have tested.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Algorithm 1

Longest Common Substring with Approximately k Mismatches

Article Open access 16 February 2019

A fast and efficient algorithm for DNA sequence similarity identification

Article Open access 23 August 2022

A survey of density based clustering algorithms

Article 29 September 2020

Notes

Alternatively, for each string we can store the first \(N - (\lceil ( N - K )/q \rceil - K) + 1\) q-grams in the hash table, and make queries with the first \(K+1\) chunks.
If we run the algorithm in [5] in our CPU computational environment, then its embedding step (only) is already 10–100\(\times \) slower than the entire running time of MinJoin++ on the datasets that we use in this paper.
https://en.wikipedia.org/wiki/Rolling_hash.
See https://en.wikipedia.org/wiki/MurmurHash for MurmurHash3, and https://docs.rs/seahash/latest/seahash/ for SeaHash.
See the documentation from the project website of [21]: https://github.com/kedayuge/Embedjoin.
http://www.uniprot.org/.
https://www.personalgenomes.org/us.

References

Arasu, A., Ganti, V., Kaushik, R.: Efficient exact set-similarity joins. In: VLDB, pp. 918–929 (2006)
Bayardo, R.J., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. In: WWW, pp. 131–140 (2007)
Bocek, T., Hunt, E., Stiller, B., Hecht, F.: Fast similarity search in large dictionaries. University (2007)
Ciaccia, P., Patella, M., Zezula, P.: M-tree: an efficient access method for similarity search in metric spaces. In: VLDB, pp. 426–435 (1997)
Dai, X., Yan, X., Zhou, K., Wang, Y., Yang, H., Cheng, J., Sigir. J., Huang, X., Chang, Y., Cheng, X., Kamps, J., Murdock, V., Wen, J., Liu, Y. (eds.) ACM, pp. 599–608
Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S., Srivastava, D.: Approximate string joins in a database (almost) for free. In: VLDB, pp. 491–500 (2001)
Jiang, Y., Li, G., Feng, J., Li, W.: String similarity joins: an experimental evaluation. PVLDB 7(8), 625–636 (2014)
Google Scholar
Li, C., Lu, J., Lu, Y.: Efficient merging and filtering algorithms for approximate string searches. In: ICDE, pp. 257–266 (2008)
Li, G., Deng, D., Wang, J., Feng, J.: PASS-JOIN: a partition-based method for similarity joins. PVLDB 5(3), 253–264 (2011)
Google Scholar
Myers, G.: Efficient local alignment discovery amongst noisy long reads. In: Brown, D.G., Morgenstern, B. (eds.), Algorithms in Bioinformatics—14th International Workshop, WABI 2014, Wroclaw, Poland, September 8–10, 2014. Proceedings, vol. 8701 of Lecture Notes in Computer Science, pp. 52–67. Springer (2014)
Qin, J., Wang, W., Lu, Y., Xiao, C., Lin, X.: Efficient exact edit similarity query processing with the asymmetric signature scheme. In: SIGMOD, pp. 1033–1044 (2011)
Roberts, R.J., Carneiro, M.O., Schatz, M.C.: The advantages of SMRT sequencing. Genome Biol. 14(6), 405 (2013)
Article PubMed PubMed Central Google Scholar
Song, Y., Tang, H., Zhang, H., Zhang, Q.: Overlap detection on long, error-prone sequencing reads via smooth q-gram. Bioinformatics 36(19), 4838–4845 (2020)
Article CAS PubMed Google Scholar
Su, Z., Ahn, B.-R., Eom, K.-Y., Kang, M.-K., Kim, J.-P., Kim, M.-K.: Plagiarism detection using the levenshtein distance and smith-waterman algorithm. In: 2008 3rd International Conference on Innovative Computing Information and Control, pp. 569–569 (2008)
Ukkonen, E.: Algorithms for approximate string matching. Inf. Control 64(1–3), 100–118 (1985)
Article MathSciNet Google Scholar
Wandelt, S., Deng, D., Gerdjikov, S., Mishra, S., Mitankin, P., Patil, M., Siragusa, E., Tiskin, A., Wang, W., Wang, J., Leser, U.: State-of-the-art in string similarity search and join. SIGMOD Record 43(1), 64–76 (2014)
Article Google Scholar
Wang, J., Li, G., Feng, J.: Trie-join: efficient trie-based string similarity joins with edit-distance constraints. PVLDB 3(1), 1219–1230 (2010)
Google Scholar
Wang, J., Li, G., Feng, J.: Can we beat the prefix filtering?: an adaptive framework for similarity join and search. In: SIGMOD, pp. 85–96 (2012)
Wang, W., Qin, J., Xiao, C., Lin, X., Shen, H.T.: Vchunkjoin: an efficient algorithm for edit similarity joins. IEEE Trans. Knowl. Data Eng. 25(8), 1916–1929 (2013)
Article CAS Google Scholar
Xiao, C., Wang, W., Lin, X.: Ed-join: an efficient algorithm for similarity joins with edit distance constraints. PVLDB 1(1), 933–944 (2008)
MathSciNet Google Scholar
Zhang, H., Zhang, Q.: Embedjoin: efficient edit similarity joins via embeddings. In: KDD, pp. 585–594 (2017)
Zhang, H., Zhang, Q.: Minjoin: efficient edit similarity joins via local hash minima. In: KDD, pp. 1093–1103. ACM (2019)
Zini, M., Fabbri, M., Moneglia, M., Panunzi, A.: Plagiarism detection through multilevel text comparison. In: Proceedings of the Second International Conference on Automated Production of Cross Media Content for Multi-Channel Distribution, AXMEDIS 2006, Leeds, UK, December 13–15, 2006, pp. 181–185. IEEE Computer Society (2006)

Download references

Acknowledgements

N. Karpov, H. Zhang, and Q. Zhang are supported in part by NSF CCF-1844234.

Author information

Authors and Affiliations

Indiana University, Bloomington, USA
Nikolai Karpov & Qin Zhang
Meta Inc., Menlo Park, CA, USA
Haoyu Zhang

Authors

Nikolai Karpov
View author publications
You can also search for this author in PubMed Google Scholar
Haoyu Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Qin Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Qin Zhang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Authors are ordered alphabetically.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Karpov, N., Zhang, H. & Zhang, Q. MinJoin++: a fast algorithm for string similarity joins under edit distance. The VLDB Journal 33, 281–299 (2024). https://doi.org/10.1007/s00778-023-00806-z

Download citation

Received: 20 July 2022
Revised: 23 January 2023
Accepted: 13 July 2023
Published: 21 August 2023
Issue Date: March 2024
DOI: https://doi.org/10.1007/s00778-023-00806-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

MinJoin++: a fast algorithm for string similarity joins under edit distance

Abstract

Access this article

Similar content being viewed by others

Longest Common Substring with Approximately k Mismatches

A fast and efficient algorithm for DNA sequence similarity identification

A survey of density based clustering algorithms

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

MinJoin++: a fast algorithm for string similarity joins under edit distance

Abstract

Access this article

Similar content being viewed by others

Longest Common Substring with Approximately k Mismatches

A fast and efficient algorithm for DNA sequence similarity identification

A survey of density based clustering algorithms

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation