Abstract
In the k-mismatch problem we are given a pattern of length n and a text and must find all locations where the Hamming distance between the pattern and the text is at most k. A series of recent breakthroughs have resulted in an ultra-efficient streaming algorithm for this problem that requires only \(\mathcal {O}(k \log \frac{n}{k})\) space and \(\mathcal {O}(\log \frac{n}{k} (\sqrt{k \log k} + \log ^3 n))\) time per letter (Clifford, Kociumaka, Porat, SODA 2019). In this work, we consider a strictly harder problem called dictionary matching with k mismatches. In this problem, we are given a dictionary of d patterns, where the length of each pattern is at most n, and must find all substrings of the text that are within Hamming distance k from one of the patterns. We develop a streaming algorithm for this problem with \(\mathcal {O}(k d \log ^k d \mathop {\mathrm {polylog} {\,n}})\) space and \(\mathcal {O}(k \log ^{k} d \mathop {\mathrm {polylog} {\,n}} + |\mathrm {output}|)\) time per position of the text. The algorithm is randomised and outputs correct answers with high probability. On the lower bound side, we show that any streaming algorithm for dictionary matching with k mismatches requires \(\varOmega (k d)\) bits of space.
Similar content being viewed by others
Notes
With high probability means with probability at least \(1-1/n^c\) for any predefined constant \(c>1\).
Hereafter, \(\tilde{\mathcal {O}}\) hides a multiplicative factor polynomial in \(\log n\).
References
Aho, A.V., Corasick, M.J.: Efficient string matching: an aid to bibliographic search. Commun. ACM 18(6), 333–340 (1975). https://doi.org/10.1145/360825.360855
Amir, A., Lewenstein, M., Porat, E.: Faster algorithms for string matching with \(k\) mismatches. J. Algorithms 50(2), 257–275 (2004). https://doi.org/10.1016/S0196-6774(03)00097-X
Belazzougui, D.: Succinct dictionary matching with no slowdown. In: Proceedings of the 21st Annual Symposium on Combinatorial Pattern Matching, pp. 88–100 (2010). https://doi.org/10.1007/978-3-642-13509-5_9
Belazzougui, D.: Worst-case efficient single and multiple string matching on packed texts in the word-RAM model. J. Discrete Algorithms 14, 91–106 (2012). https://doi.org/10.1007/978-3-642-19222-7_10
Belazzougui, D., Boldi, P., Pagh, R., Vigna, S.: Monotone minimal perfect hashing: searching a sorted table with O(1) accesses. In: Proceedings of the 20th Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 785–794 (2009). https://doi.org/10.1137/1.9781611973068.86
Belazzougui, D., Boldi, P., Vigna, S.: Dynamic \(z\)-fast tries. In: Proceedings of the 17th International Symposium on String Processing and Information Retrieval, pp. 159–172 (2010). https://doi.org/10.1007/978-3-642-16321-0_15
Belazzougui, D., Raffinot, M.: Average optimal string matching in packed strings. In: Proceedings of the 8th International Conference on Algorithms and Complexity, pp. 37–48 (2013). https://doi.org/10.1007/978-3-642-38233-8_4
Breslauer, D., Galil, Z.: Real-time streaming string-matching. ACM Trans. Algorithms 10(4), 221–2212 (2014). https://doi.org/10.1145/2635814
Clifford, R., Fontaine, A., Porat, E., Sach, B., Starikovskaya, T.: Dictionary matching in a stream. In: Proceedings of the 23rd Annual European Symposium on Algorithms, pp. 361–372 (2015). https://doi.org/10.1007/978-3-662-48350-3_31
Clifford, R., Fontaine, A., Porat, E., Sach, B., Starikovskaya, T.: The k-mismatch problem revisited. In: Proceedings of the 27th Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 2039–2052 (2016). https://doi.org/10.1137/1.9781611974331.ch142
Clifford, R., Kociumaka, T., Porat, E.: The streaming k-mismatch problem. In: Proceedings of the 30th Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1106–1125 (2019). https://doi.org/10.1137/1.9781611975482.68
Cole, R., Gottlieb, L.A., Lewenstein, M.: Dictionary matching and indexing with errors and don’t cares. In: Proceedings of the 36th Annual ACM Symposium on Theory of Computing, pp. 91–100 (2004). https://doi.org/10.1145/1007352.1007374
Commentz-Walter, B.: A string matching algorithm fast on the average. In: Proceedings of the 6th International Colloquium on Automata, Languages and Programming, pp. 118–132 (1979). https://doi.org/10.1007/3-540-09510-1_10
Crochemore, M., Czumaj, A., Gasieniec, L., Lecroq, T., Plandowski, W., Rytter, W.: Fast practical multi-pattern matching. Inf. Process. Lett. 71(3), 107–113 (1999)
Dietzfelbinger, M., Meyer auf der Heide, F.: Dynamic hashing in real time. In: Informatik: Festschrift zum 60. Geburtstag von Günter Hotz, pp. 95–119 (1992). https://doi.org/10.1007/978-3-322-95233-2_7
Epifanio, C., Gabriele, A., Mignosi, F., Restivo, A., Sciortino, M.: Languages with mismatches. Theor. Comput. Sci. 385(1), 152–166 (2007). https://doi.org/10.1016/j.tcs.2007.06.006
Fischer, J., Gagie, T., Gawrychowski, P., Kociumaka, T.: Approximating LZ77 via small-space multiple-pattern matching. In: Proceedings of the 23rd European Symposium on Algorithms, pp. 533–544 (2015). https://doi.org/10.1007/978-3-662-48350-3_45
Gawrychowski, P., Landau, G.M., Starikovskaya, T.: Fast entropy-bounded string dictionary look-up with mismatches. In: Proceedings of the 43rd International Symposium on Mathematical Foundations of Computer Science, vol. 117, pp. 66:1–66:15 (2018). https://doi.org/10.4230/LIPIcs.MFCS.2018.66
Gawrychowski, P., Starikovskaya, T.: Streaming dictionary matching with mismatches. In: Proceedings of the 30th Annual Symposium on Combinatorial Pattern Matching, pp. 21:1–21:15 (2019). https://doi.org/10.4230/LIPIcs.CPM.2019.21
Gawrychowski, P., Uznański, P.: Towards unified approximate pattern matching for Hamming and \(L_1\) distance. In: Proceedings of the 45th International Colloquium on Automata, Languages, and Programming, vol. 107, pp. 62:1–62:13 (2018). https://doi.org/10.4230/LIPIcs.ICALP.2018.62
Golan, S., Kociumaka, T., Kopelowitz, T., Porat, E.: Dynamic dictionary matching in the online model. In: Proceedings of the 16th International Symposium on Algorithms and Data Structures, Lecture Notes in Computer Science, vol. 11646, pp. 409–422 (2019). https://doi.org/10.1007/978-3-030-24766-9_30
Golan, S., Kociumaka, T., Kopelowitz, T., Porat, E.: The streaming k-mismatch problem: tradeoffs between space and total time. In: Proceedings of the 31st Annual Symposium on Combinatorial Pattern Matching, vol. 161, pp. 15:1–15:15 (2020). https://doi.org/10.4230/LIPIcs.CPM.2020.15
Golan, S., Kopelowitz, T., Porat, E.: Towards optimal approximate streaming pattern matching by matching multiple patterns in multiple streams. In: Proceedings of the 45th International Colloquium on Automata, Languages, and Programming, pp. 65:1–65:16 (2018). https://doi.org/10.4230/LIPIcs.ICALP.2018.65
Golan, S., Porat, E.: Real-time streaming multi-pattern search for constant alphabet. In: Proceedings of the 25th Annual European Symposium on Algorithms, vol. 87, pp. 41:1–41:15 (2017). https://doi.org/10.4230/LIPIcs.ESA.2017.41
Hon, W.K., Ku, T.H., Shah, R., Thankachan, S.V., Vitter, J.S.: Faster compressed dictionary matching. In: Proceedings of the 17th International Symposium on String Processing and Information Retrieval, pp. 191–200 (2010). https://doi.org/10.1007/978-3-642-16321-0_19
Huynh, T.N.D., Hon, W.K., Lam, T.W., Sung, W.K.: Approximate string matching using compressed suffix arrays. J. Theor. Comput. Sci. 352(1), 240–249 (2006). https://doi.org/10.1016/j.tcs.2005.11.022
Karp, R.M., Rabin, M.O.: Efficient randomized pattern-matching algorithms. IBM J. Res. Dev. 31(2), 249–260 (1987). https://doi.org/10.1147/rd.312.0249
Kopelowitz, T., Porat, E., Rozen, Y.: Succinct online dictionary matching with improved worst-case guarantees. In: Proceedings of the 27th Annual Symposium on Combinatorial Pattern Matching, vol. 54, pp. 6:1–6:13 (2016). https://doi.org/10.4230/LIPIcs.CPM.2016.6
Kosolobov, D., Sivukhin, N.: Compressed multiple pattern matching. In: Proceedings of the 30th Annual Symposium on Combinatorial Pattern Matching, pp. 13:1–13:14 (2019). https://doi.org/10.4230/LIPIcs.CPM.2019.13
Kremer, I., Nisan, N., Ron, D.: On randomized one-round communication complexity. In: Proceedings of the 27th Annual ACM Symposium on Theory of Computing, pp. 596–605 (1995). https://doi.org/10.1007/s000370050018
Lam, T.W., Sung, W.K., Wong, S.S.: Improved approximate string matching using compressed suffix data structures. J. Algorithmica 51(3), 298–314 (2008). https://doi.org/10.1007/s00453-007-9104-8
Landau, G.M., Vishkin, U.: Efficient string matching with \(k\) mismatches. Theor. Comput. Sci. 43, 239–249 (1986). https://doi.org/10.1016/0304-3975(86)90178-7
Porat, B., Porat, E.: Exact and approximate pattern matching in the streaming model. In: Proceedings of the 50th Annual Symposium on Foundations of Computer Science, pp. 315–323 (2009). https://doi.org/10.1109/FOCS.2009.11
Tsur, D.: Fast index for approximate string matching. J. Discrete Algorithms 8(4), 339–345 (2010). https://doi.org/10.1016/j.jda.2010.08.002
Wu, S., Manber, U.: Agrep—a fast approximate pattern-matching tool. In: Proceedings of the USENIX Technical Conference, pp. 153–162 (1992)
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This is a full and extended version of the conference paper [19].
P. Gawrychowski was partially supported by the Bekker programme of the Polish National Agency for Academic Exchange (PPN/BEK/2020/1/00444) and the grant ANR-20-CE48-0001 from the French National Research Agency (ANR). T. Starikovskaya was partially supported by the grant ANR-20-CE48-0001 from the French National Research Agency (ANR)
Rights and permissions
About this article
Cite this article
Gawrychowski, P., Starikovskaya, T. Streaming Dictionary Matching with Mismatches. Algorithmica 84, 896–916 (2022). https://doi.org/10.1007/s00453-021-00876-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00453-021-00876-x