Skip to main content
Log in

Streaming Dictionary Matching with Mismatches

  • Published:
Algorithmica Aims and scope Submit manuscript

Abstract

In the k-mismatch problem we are given a pattern of length n and a text and must find all locations where the Hamming distance between the pattern and the text is at most k. A series of recent breakthroughs have resulted in an ultra-efficient streaming algorithm for this problem that requires only \(\mathcal {O}(k \log \frac{n}{k})\) space and \(\mathcal {O}(\log \frac{n}{k} (\sqrt{k \log k} + \log ^3 n))\) time per letter (Clifford, Kociumaka, Porat, SODA 2019). In this work, we consider a strictly harder problem called dictionary matching with k mismatches. In this problem, we are given a dictionary of d patterns, where the length of each pattern is at most n, and must find all substrings of the text that are within Hamming distance k from one of the patterns. We develop a streaming algorithm for this problem with \(\mathcal {O}(k d \log ^k d \mathop {\mathrm {polylog} {\,n}})\) space and \(\mathcal {O}(k \log ^{k} d \mathop {\mathrm {polylog} {\,n}} + |\mathrm {output}|)\) time per position of the text. The algorithm is randomised and outputs correct answers with high probability. On the lower bound side, we show that any streaming algorithm for dictionary matching with k mismatches requires \(\varOmega (k d)\) bits of space.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. With high probability means with probability at least \(1-1/n^c\) for any predefined constant \(c>1\).

  2. Hereafter, \(\tilde{\mathcal {O}}\) hides a multiplicative factor polynomial in \(\log n\).

  3. The query algorithm of [12] returns only those patterns that are within Hamming distance k from Q itself, but considering prefixes as well does not change the query time and is more suitable for our purposes. We explain the necessary modifications in Sect. 3.1.

References

  1. Aho, A.V., Corasick, M.J.: Efficient string matching: an aid to bibliographic search. Commun. ACM 18(6), 333–340 (1975). https://doi.org/10.1145/360825.360855

    Article  MathSciNet  MATH  Google Scholar 

  2. Amir, A., Lewenstein, M., Porat, E.: Faster algorithms for string matching with \(k\) mismatches. J. Algorithms 50(2), 257–275 (2004). https://doi.org/10.1016/S0196-6774(03)00097-X

    Article  MathSciNet  MATH  Google Scholar 

  3. Belazzougui, D.: Succinct dictionary matching with no slowdown. In: Proceedings of the 21st Annual Symposium on Combinatorial Pattern Matching, pp. 88–100 (2010). https://doi.org/10.1007/978-3-642-13509-5_9

  4. Belazzougui, D.: Worst-case efficient single and multiple string matching on packed texts in the word-RAM model. J. Discrete Algorithms 14, 91–106 (2012). https://doi.org/10.1007/978-3-642-19222-7_10

    Article  MathSciNet  MATH  Google Scholar 

  5. Belazzougui, D., Boldi, P., Pagh, R., Vigna, S.: Monotone minimal perfect hashing: searching a sorted table with O(1) accesses. In: Proceedings of the 20th Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 785–794 (2009). https://doi.org/10.1137/1.9781611973068.86

  6. Belazzougui, D., Boldi, P., Vigna, S.: Dynamic \(z\)-fast tries. In: Proceedings of the 17th International Symposium on String Processing and Information Retrieval, pp. 159–172 (2010). https://doi.org/10.1007/978-3-642-16321-0_15

  7. Belazzougui, D., Raffinot, M.: Average optimal string matching in packed strings. In: Proceedings of the 8th International Conference on Algorithms and Complexity, pp. 37–48 (2013). https://doi.org/10.1007/978-3-642-38233-8_4

  8. Breslauer, D., Galil, Z.: Real-time streaming string-matching. ACM Trans. Algorithms 10(4), 221–2212 (2014). https://doi.org/10.1145/2635814

    Article  MathSciNet  MATH  Google Scholar 

  9. Clifford, R., Fontaine, A., Porat, E., Sach, B., Starikovskaya, T.: Dictionary matching in a stream. In: Proceedings of the 23rd Annual European Symposium on Algorithms, pp. 361–372 (2015). https://doi.org/10.1007/978-3-662-48350-3_31

  10. Clifford, R., Fontaine, A., Porat, E., Sach, B., Starikovskaya, T.: The k-mismatch problem revisited. In: Proceedings of the 27th Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 2039–2052 (2016). https://doi.org/10.1137/1.9781611974331.ch142

  11. Clifford, R., Kociumaka, T., Porat, E.: The streaming k-mismatch problem. In: Proceedings of the 30th Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1106–1125 (2019). https://doi.org/10.1137/1.9781611975482.68

  12. Cole, R., Gottlieb, L.A., Lewenstein, M.: Dictionary matching and indexing with errors and don’t cares. In: Proceedings of the 36th Annual ACM Symposium on Theory of Computing, pp. 91–100 (2004). https://doi.org/10.1145/1007352.1007374

  13. Commentz-Walter, B.: A string matching algorithm fast on the average. In: Proceedings of the 6th International Colloquium on Automata, Languages and Programming, pp. 118–132 (1979). https://doi.org/10.1007/3-540-09510-1_10

  14. Crochemore, M., Czumaj, A., Gasieniec, L., Lecroq, T., Plandowski, W., Rytter, W.: Fast practical multi-pattern matching. Inf. Process. Lett. 71(3), 107–113 (1999)

    Article  MathSciNet  Google Scholar 

  15. Dietzfelbinger, M., Meyer auf der Heide, F.: Dynamic hashing in real time. In: Informatik: Festschrift zum 60. Geburtstag von Günter Hotz, pp. 95–119 (1992). https://doi.org/10.1007/978-3-322-95233-2_7

  16. Epifanio, C., Gabriele, A., Mignosi, F., Restivo, A., Sciortino, M.: Languages with mismatches. Theor. Comput. Sci. 385(1), 152–166 (2007). https://doi.org/10.1016/j.tcs.2007.06.006

    Article  MathSciNet  MATH  Google Scholar 

  17. Fischer, J., Gagie, T., Gawrychowski, P., Kociumaka, T.: Approximating LZ77 via small-space multiple-pattern matching. In: Proceedings of the 23rd European Symposium on Algorithms, pp. 533–544 (2015). https://doi.org/10.1007/978-3-662-48350-3_45

  18. Gawrychowski, P., Landau, G.M., Starikovskaya, T.: Fast entropy-bounded string dictionary look-up with mismatches. In: Proceedings of the 43rd International Symposium on Mathematical Foundations of Computer Science, vol. 117, pp. 66:1–66:15 (2018). https://doi.org/10.4230/LIPIcs.MFCS.2018.66

  19. Gawrychowski, P., Starikovskaya, T.: Streaming dictionary matching with mismatches. In: Proceedings of the 30th Annual Symposium on Combinatorial Pattern Matching, pp. 21:1–21:15 (2019). https://doi.org/10.4230/LIPIcs.CPM.2019.21

  20. Gawrychowski, P., Uznański, P.: Towards unified approximate pattern matching for Hamming and \(L_1\) distance. In: Proceedings of the 45th International Colloquium on Automata, Languages, and Programming, vol. 107, pp. 62:1–62:13 (2018). https://doi.org/10.4230/LIPIcs.ICALP.2018.62

  21. Golan, S., Kociumaka, T., Kopelowitz, T., Porat, E.: Dynamic dictionary matching in the online model. In: Proceedings of the 16th International Symposium on Algorithms and Data Structures, Lecture Notes in Computer Science, vol. 11646, pp. 409–422 (2019). https://doi.org/10.1007/978-3-030-24766-9_30

  22. Golan, S., Kociumaka, T., Kopelowitz, T., Porat, E.: The streaming k-mismatch problem: tradeoffs between space and total time. In: Proceedings of the 31st Annual Symposium on Combinatorial Pattern Matching, vol. 161, pp. 15:1–15:15 (2020). https://doi.org/10.4230/LIPIcs.CPM.2020.15

  23. Golan, S., Kopelowitz, T., Porat, E.: Towards optimal approximate streaming pattern matching by matching multiple patterns in multiple streams. In: Proceedings of the 45th International Colloquium on Automata, Languages, and Programming, pp. 65:1–65:16 (2018). https://doi.org/10.4230/LIPIcs.ICALP.2018.65

  24. Golan, S., Porat, E.: Real-time streaming multi-pattern search for constant alphabet. In: Proceedings of the 25th Annual European Symposium on Algorithms, vol. 87, pp. 41:1–41:15 (2017). https://doi.org/10.4230/LIPIcs.ESA.2017.41

  25. Hon, W.K., Ku, T.H., Shah, R., Thankachan, S.V., Vitter, J.S.: Faster compressed dictionary matching. In: Proceedings of the 17th International Symposium on String Processing and Information Retrieval, pp. 191–200 (2010). https://doi.org/10.1007/978-3-642-16321-0_19

  26. Huynh, T.N.D., Hon, W.K., Lam, T.W., Sung, W.K.: Approximate string matching using compressed suffix arrays. J. Theor. Comput. Sci. 352(1), 240–249 (2006). https://doi.org/10.1016/j.tcs.2005.11.022

    Article  MathSciNet  MATH  Google Scholar 

  27. Karp, R.M., Rabin, M.O.: Efficient randomized pattern-matching algorithms. IBM J. Res. Dev. 31(2), 249–260 (1987). https://doi.org/10.1147/rd.312.0249

    Article  MathSciNet  MATH  Google Scholar 

  28. Kopelowitz, T., Porat, E., Rozen, Y.: Succinct online dictionary matching with improved worst-case guarantees. In: Proceedings of the 27th Annual Symposium on Combinatorial Pattern Matching, vol. 54, pp. 6:1–6:13 (2016). https://doi.org/10.4230/LIPIcs.CPM.2016.6

  29. Kosolobov, D., Sivukhin, N.: Compressed multiple pattern matching. In: Proceedings of the 30th Annual Symposium on Combinatorial Pattern Matching, pp. 13:1–13:14 (2019). https://doi.org/10.4230/LIPIcs.CPM.2019.13

  30. Kremer, I., Nisan, N., Ron, D.: On randomized one-round communication complexity. In: Proceedings of the 27th Annual ACM Symposium on Theory of Computing, pp. 596–605 (1995). https://doi.org/10.1007/s000370050018

  31. Lam, T.W., Sung, W.K., Wong, S.S.: Improved approximate string matching using compressed suffix data structures. J. Algorithmica 51(3), 298–314 (2008). https://doi.org/10.1007/s00453-007-9104-8

    Article  MathSciNet  MATH  Google Scholar 

  32. Landau, G.M., Vishkin, U.: Efficient string matching with \(k\) mismatches. Theor. Comput. Sci. 43, 239–249 (1986). https://doi.org/10.1016/0304-3975(86)90178-7

    Article  MathSciNet  MATH  Google Scholar 

  33. Porat, B., Porat, E.: Exact and approximate pattern matching in the streaming model. In: Proceedings of the 50th Annual Symposium on Foundations of Computer Science, pp. 315–323 (2009). https://doi.org/10.1109/FOCS.2009.11

  34. Tsur, D.: Fast index for approximate string matching. J. Discrete Algorithms 8(4), 339–345 (2010). https://doi.org/10.1016/j.jda.2010.08.002

    Article  MathSciNet  MATH  Google Scholar 

  35. Wu, S., Manber, U.: Agrep—a fast approximate pattern-matching tool. In: Proceedings of the USENIX Technical Conference, pp. 153–162 (1992)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tatiana Starikovskaya.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This is a full and extended version of the conference paper [19].

P. Gawrychowski was partially supported by the Bekker programme of the Polish National Agency for Academic Exchange (PPN/BEK/2020/1/00444) and the grant ANR-20-CE48-0001 from the French National Research Agency (ANR). T. Starikovskaya was partially supported by the grant ANR-20-CE48-0001 from the French National Research Agency (ANR)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gawrychowski, P., Starikovskaya, T. Streaming Dictionary Matching with Mismatches. Algorithmica 84, 896–916 (2022). https://doi.org/10.1007/s00453-021-00876-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00453-021-00876-x

Keywords

Navigation