Abstract
The classic string indexing problem is to preprocess a string S into a compact data structure that supports efficient pattern matching queries. Typical queries include existential queries (decide if the pattern occurs in S), reporting queries (return all positions where the pattern occurs), and counting queries (return the number of occurrences of the pattern). In this paper we consider a variant of string indexing, where the goal is to compactly represent the string such that given two patterns \(P_1\) and \(P_2\) and a gap range \({[}\alpha , \beta ]\) we can quickly find the consecutive occurrences of \(P_1\) and \(P_2\) with distance in \({[}\alpha , \beta ]\), i.e., pairs of subsequent occurrences with distance within the range. We present data structures that use linear space and query time \({\widetilde{O}}(|P_1|+|P_2|+n^{2/3})\) for existence and counting and \({\widetilde{O}}(|P_1|+|P_2|+n^{2/3}\hbox {occ}^{1/3})\) for reporting. We complement this with a conditional lower bound based on the set intersection problem showing that any solution using \({\widetilde{O}}(n)\) space must use \({\widetilde{\Omega }}(|P_1| + |P_2| + \sqrt{n})\) query time. To obtain our results we develop new techniques and ideas of independent interest including a new suffix tree decomposition and hardness of a variant of the set intersection problem.
Similar content being viewed by others
Notes
\({\widetilde{O}}\) and \(\widetilde{\Omega }\) ignore polylogarithmic factors.
References
Alstrup, S., Holm, J., de Lichtenberg, K., Thorup, M.: Minimizing diameters of dynamic trees. In: Proceedings of the 24th ICALP, pp. 270–280 (1997)
Alstrup, S., Holm, J., Thorup, M.: Maintaining center and median in dynamic trees. In: Proceedings of the 7th SWAT, pp. 46–56 (2000)
Alstrup, S., Rauhe, T.: Improved labeling scheme for ancestor queries. In: Proceedings of the 13th SODA, pp. 947–953 (2002)
Amir, A., Chan, T.M., Lewenstein, M., Lewenstein, N.: On hardness of jumbled indexing. In: Proceedings of the 41st ICALP, pp. 114–125 (2014)
Amir, A., Kopelowitz, T., Levy, A., Pettie, S., Porat, E., Shalom, B.R.: Mind the gap: essentially optimal algorithms for online dictionary matching with one gap. In: Proceedings of the 27th ISAAC, pp. 12:1–12:12 (2016)
Apostolico, A., Pizzi, C., Satta, G.: Optimal discovery of subword associations in strings. In: Proceedings of the 7th DS, pp. 270–277 (2004)
Apostolico, A., Pizzi, C., Ukkonen, E.: Efficient algorithms for the discovery of gapped factors. Algorithms Mol. Biol. 6, 5 (2011)
Apostolico, A., Satta, G.: Discovering subword associations in strings in time linear in the output size. J. Discrete Algorithms 7(2), 227–238 (2009)
Bader, J., Gog, S., Petri, M.: Practical variable length gap pattern matching. In: Proceedings of the 15th SEA, pp. 1–16 (2016)
Bille, P., Gørtz, I.L.: The tree inclusion problem: in linear space and faster. ACM Trans. Algorithms 7(3), 1–47 (2011)
Bille, P., Gørtz, I.L.: Substring range reporting. Algorithmica 69(2), 384–396 (2014)
Bille, P., Gørtz, I.L., Pedersen, M.R., Rotenberg, E., Steiner, T.A.: String indexing for top-\(k\) close consecutive occurrences. In: Proceedings of the 40th FSTTCS, pp. 14:1–14:17 (2020)
Bille, P., Gørtz, I.L., Pedersen, M.R., Steiner, T.A.: Gapped indexing for consecutive occurrences. In: Proceedings of the 32nd CPM, pp. 10:1–10:19 (2021)
Bille, P., Gørtz, I.L., Vildhøj, H.W., Vind, S.: String indexing for patterns with wildcards. Theory Comput. Syst. 55(1), 41–60 (2014)
Bille, P., Gørtz, I.L., Vildhøj, H.W., Wind, D.K.: String matching with variable length gaps. Theor. Comput. Sci. 443 (2012). Announced at SPIRE (2010)
Biswas, S., Ganguly, A., Shah, R., Thankachan, S.V.: Ranked document retrieval for multiple patterns. Theor. Comput. Sci. 746, 98–111 (2018)
Bucher, P., Bairoch, A.: A generalized profile syntax for biomolecular sequence motifs and its function in automatic sequence interpretation. In: Proceedings of the 2nd ISMB, pp. 53–61 (1994)
Cáceres, M., Puglisi, S.J., Zhukova, B.: Fast indexes for gapped pattern matching. In: Proceedings of the 46th SOFSEM, pp. 493–504 (2020)
Cohen, H., Porat, E.: Fast set intersection and two-patterns matching. Theor. Comput. Sci. 411(40–42), 3795–3800 (2010)
Ferragina, P., Koudas, N., Muthukrishnan, S., Srivastava, D.: Two-dimensional substring indexing. J. Comput. Syst. Sci. 66(4), 763–774 (2003)
Frederickson, G.N.: Ambivalent data structures for dynamic 2-edge-connectivity and \(k\) smallest spanning trees. SIAM J. Comput. 26(2), 484–538 (1997)
Fredman, M.L., Komlós, J., Szemerédi, E.: Storing a sparse table with \(o(1)\) worst case access time. J. ACM 31(3), 538–544 (1984)
Fredriksson, K., Grabowski, S.: Efficient algorithms for pattern matching with general gaps, character classes, and transposition invariance. Inf. Retr. 11(4), 335–357 (2008)
Goldstein, I., Kopelowitz, T., Lewenstein, M., Porat, E.: Conditional lower bounds for space/time tradeoffs. In: Proceedings of the 15th WADS, pp. 421–436. Springer (2017)
Haapasalo, T., Silvasti, P., Sippu, S., Soisalon-Soininen, E.: Online dictionary matching with variable-length gaps. In: Proceedings of the 10th SEA, pp. 76–87 (2011)
Hofmann, K., Bucher, P., Falquet, L., Bairoch, A.: The PROSITE database, its status in 1999. Nucleic Acids Res. 27(1), 215–219 (1999)
Hon, W., Patil, M., Shah, R., Thankachan, S.V., Vitter, J.S.: Indexes for document retrieval with relevance. In: Space-Efficient Data Structures, Streams, and Algorithms—Papers in Honor of J. Ian Munro on the Occasion of His 66th Birthday, pp. 351–362 (2013)
Hon, W., Thankachan, S.V., Shah, R., Vitter, J.S.: Faster compressed top-k document retrieval. In: Proceedings of the 23rd DCC, pp. 341–350 (2013)
Hon, W.K., Patil, M., Shah, R., Wu, S.B.: Efficient index for retrieving top-k most frequent documents. J. Discrete Algorithms 8(4), 402–417 (2010)
Hon, W.K., Shah, R., Thankachan, S.V., Vitter, J.S.: Space-efficient frameworks for top-k string retrieval. J. ACM 61(2), 1–36 (2014). Announced at 50th FOCS
Iliopoulos, C.S., Rahman, M.S.: Indexing factors with gaps. Algorithmica 55(1), 60–70 (2009)
Keller, O., Kopelowitz, T., Lewenstein, M.: Range non-overlapping indexing and successive list indexing. In: Proceedings of the 11th WADS, pp. 625–636 (2007)
Kopelowitz, T., Krauthgamer, R.: Color-distance oracles and snippets. In: Grossi, R., Lewenstein, M. (Eds.) Proceedings of the 27th CPM, pp. 24:1–24:10 (2016)
Kopelowitz, T., Pettie, S., Porat, E.: Higher lower bounds from the 3sum conjecture. In: Proceedings of the 27th SODA, pp. 1272–1287 (2016)
Larsen, K.G., Munro, J.I., Nielsen, J.S., Thankachan, S.V.: On hardness of several string indexing problems. Theor. Comput. Sci. 582, 74–82 (2015)
Lewenstein, M.: Indexing with gaps. In: Proceedings of the 18th SPIRE, pp. 135–143 (2011)
Mehldau, G., Myers, G.: A system for pattern matching applications on biosequences. Bioinformatics 9(3), 299–314 (1993)
Munro, J.I., Navarro, G., Nielsen, J.S., Shah, R., Thankachan, S.V.: Top-k term-proximity in succinct space. Algorithmica 78(2), 379–393 (2017). Announced at 25th ISAAC
Munro, J.I., Navarro, G., Shah, R., Thankachan, S.V.: Ranked document selection. Theor. Comput. Sci. 812, 149–159 (2020)
Myers, E.W.: Approximate matching of network expressions with spacers. J. Comput. Biol. 3(1), 33–51 (1992)
Navarro, G.: Spaces, trees, and colors: the algorithmic landscape of document retrieval on sequences. ACM Comput. Surv. 46(4), 1–47 (2014)
Navarro, G., Nekrich, Y.: Time-optimal top-k document retrieval. SIAM J. Comput. 46(1), 80–113 (2017). Announced at 23rd SODA
Navarro, G., Raffinot, M.: Fast and simple character classes and bounded gaps pattern matching, with applications to protein searching. J. Comput. Biol. 10(6), 903–923 (2003)
Navarro, G., Thankachan, S.V.: New space/time tradeoffs for top-k document retrieval on sequences. Theor. Comput. Sci. 542, 83–97 (2014). Announced at 20th SPIRE
Navarro, G., Thankachan, S.V.: Reporting consecutive substring occurrences under bounded gap constraints. Theor. Comput. Sci. 638, 108–111 (2016). Announced at 26th CPM
Nekrich, Y., Navarro, G.: Sorted range reporting. In: Proceedings of the 13th SWAT, pp. 271–282 (2012)
Shah, R., Sheng, C., Thankachan, S.V., Vitter, J.S.: Top-k document retrieval in external memory. In: Proceedings of the 21st ESA, pp. 803–814 (2013)
Tsur, D.: Top-k document retrieval in optimal space. Inf. Process. Lett. 113(12), 440–443 (2013)
Weiner, P.: Linear pattern matching algorithms. In: Proceedings of the 14th FOCS, pp. 1–11 (1973)
Willard, D.E.: Log-logarithmic worst-case range queries are possible in space theta(n). Inf. Process. Lett. 17(2), 81–84 (1983). https://doi.org/10.1016/0020-0190(83)90075-3.
Zhou, G.: Two-dimensional range successor in optimal time and almost linear space. Inf. Process. Lett. 116(2), 171–174 (2016)
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
A preliminary version of this paper appeared at CPM 2021 [13]. P. Bille, I. L. Gørtz and M. R. Pedersen: Supported by the Danish Research Council Grant DFF-8021-002498.
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Bille, P., Gørtz, I.L., Pedersen, M.R. et al. Gapped Indexing for Consecutive Occurrences. Algorithmica 85, 879–901 (2023). https://doi.org/10.1007/s00453-022-01051-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00453-022-01051-6