Gapped Indexing for Consecutive Occurrences

Bille, Philip; Gørtz, Inge Li; Pedersen, Max Rishøj; Steiner, Teresa Anna

doi:10.1007/s00453-022-01051-6

Gapped Indexing for Consecutive Occurrences

Published: 20 October 2022

Volume 85, pages 879–901, (2023)
Cite this article

Algorithmica Aims and scope Submit manuscript

Philip Bille¹,
Inge Li Gørtz¹,
Max Rishøj Pedersen¹ &
…
Teresa Anna Steiner ORCID: orcid.org/0000-0003-1078-4075¹

217 Accesses
1 Citation
Explore all metrics

Abstract

The classic string indexing problem is to preprocess a string S into a compact data structure that supports efficient pattern matching queries. Typical queries include existential queries (decide if the pattern occurs in S), reporting queries (return all positions where the pattern occurs), and counting queries (return the number of occurrences of the pattern). In this paper we consider a variant of string indexing, where the goal is to compactly represent the string such that given two patterns \(P_1\) and \(P_2\) and a gap range \({[}\alpha , \beta ]\) we can quickly find the consecutive occurrences of \(P_1\) and \(P_2\) with distance in \({[}\alpha , \beta ]\), i.e., pairs of subsequent occurrences with distance within the range. We present data structures that use linear space and query time \({\widetilde{O}}(|P_1|+|P_2|+n^{2/3})\) for existence and counting and \({\widetilde{O}}(|P_1|+|P_2|+n^{2/3}\hbox {occ}^{1/3})\) for reporting. We complement this with a conditional lower bound based on the set intersection problem showing that any solution using \({\widetilde{O}}(n)\) space must use \({\widetilde{\Omega }}(|P_1| + |P_2| + \sqrt{n})\) query time. To obtain our results we develop new techniques and ideas of independent interest including a new suffix tree decomposition and hardness of a variant of the set intersection problem.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Consecutive Occurrences with Distance Constraints

Fast Compressed Self-indexes with Deterministic Linear-Time Construction

Article 22 October 2019

Practical Variable Length Gap Pattern Matching

Notes

\({\widetilde{O}}\) and \(\widetilde{\Omega }\) ignore polylogarithmic factors.

References

Alstrup, S., Holm, J., de Lichtenberg, K., Thorup, M.: Minimizing diameters of dynamic trees. In: Proceedings of the 24th ICALP, pp. 270–280 (1997)
Alstrup, S., Holm, J., Thorup, M.: Maintaining center and median in dynamic trees. In: Proceedings of the 7th SWAT, pp. 46–56 (2000)
Alstrup, S., Rauhe, T.: Improved labeling scheme for ancestor queries. In: Proceedings of the 13th SODA, pp. 947–953 (2002)
Amir, A., Chan, T.M., Lewenstein, M., Lewenstein, N.: On hardness of jumbled indexing. In: Proceedings of the 41st ICALP, pp. 114–125 (2014)
Amir, A., Kopelowitz, T., Levy, A., Pettie, S., Porat, E., Shalom, B.R.: Mind the gap: essentially optimal algorithms for online dictionary matching with one gap. In: Proceedings of the 27th ISAAC, pp. 12:1–12:12 (2016)
Apostolico, A., Pizzi, C., Satta, G.: Optimal discovery of subword associations in strings. In: Proceedings of the 7th DS, pp. 270–277 (2004)
Apostolico, A., Pizzi, C., Ukkonen, E.: Efficient algorithms for the discovery of gapped factors. Algorithms Mol. Biol. 6, 5 (2011)
Article Google Scholar
Apostolico, A., Satta, G.: Discovering subword associations in strings in time linear in the output size. J. Discrete Algorithms 7(2), 227–238 (2009)
Article MathSciNet MATH Google Scholar
Bader, J., Gog, S., Petri, M.: Practical variable length gap pattern matching. In: Proceedings of the 15th SEA, pp. 1–16 (2016)
Bille, P., Gørtz, I.L.: The tree inclusion problem: in linear space and faster. ACM Trans. Algorithms 7(3), 1–47 (2011)
Article MathSciNet MATH Google Scholar
Bille, P., Gørtz, I.L.: Substring range reporting. Algorithmica 69(2), 384–396 (2014)
Article MathSciNet MATH Google Scholar
Bille, P., Gørtz, I.L., Pedersen, M.R., Rotenberg, E., Steiner, T.A.: String indexing for top-\(k\) close consecutive occurrences. In: Proceedings of the 40th FSTTCS, pp. 14:1–14:17 (2020)
Bille, P., Gørtz, I.L., Pedersen, M.R., Steiner, T.A.: Gapped indexing for consecutive occurrences. In: Proceedings of the 32nd CPM, pp. 10:1–10:19 (2021)
Bille, P., Gørtz, I.L., Vildhøj, H.W., Vind, S.: String indexing for patterns with wildcards. Theory Comput. Syst. 55(1), 41–60 (2014)
Article MathSciNet MATH Google Scholar
Bille, P., Gørtz, I.L., Vildhøj, H.W., Wind, D.K.: String matching with variable length gaps. Theor. Comput. Sci. 443 (2012). Announced at SPIRE (2010)
Biswas, S., Ganguly, A., Shah, R., Thankachan, S.V.: Ranked document retrieval for multiple patterns. Theor. Comput. Sci. 746, 98–111 (2018)
Article MathSciNet MATH Google Scholar
Bucher, P., Bairoch, A.: A generalized profile syntax for biomolecular sequence motifs and its function in automatic sequence interpretation. In: Proceedings of the 2nd ISMB, pp. 53–61 (1994)
Cáceres, M., Puglisi, S.J., Zhukova, B.: Fast indexes for gapped pattern matching. In: Proceedings of the 46th SOFSEM, pp. 493–504 (2020)
Cohen, H., Porat, E.: Fast set intersection and two-patterns matching. Theor. Comput. Sci. 411(40–42), 3795–3800 (2010)
Article MathSciNet MATH Google Scholar
Ferragina, P., Koudas, N., Muthukrishnan, S., Srivastava, D.: Two-dimensional substring indexing. J. Comput. Syst. Sci. 66(4), 763–774 (2003)
Article MathSciNet MATH Google Scholar
Frederickson, G.N.: Ambivalent data structures for dynamic 2-edge-connectivity and \(k\) smallest spanning trees. SIAM J. Comput. 26(2), 484–538 (1997)
Article MathSciNet MATH Google Scholar
Fredman, M.L., Komlós, J., Szemerédi, E.: Storing a sparse table with \(o(1)\) worst case access time. J. ACM 31(3), 538–544 (1984)
Article MathSciNet MATH Google Scholar
Fredriksson, K., Grabowski, S.: Efficient algorithms for pattern matching with general gaps, character classes, and transposition invariance. Inf. Retr. 11(4), 335–357 (2008)
Article Google Scholar
Goldstein, I., Kopelowitz, T., Lewenstein, M., Porat, E.: Conditional lower bounds for space/time tradeoffs. In: Proceedings of the 15th WADS, pp. 421–436. Springer (2017)
Haapasalo, T., Silvasti, P., Sippu, S., Soisalon-Soininen, E.: Online dictionary matching with variable-length gaps. In: Proceedings of the 10th SEA, pp. 76–87 (2011)
Hofmann, K., Bucher, P., Falquet, L., Bairoch, A.: The PROSITE database, its status in 1999. Nucleic Acids Res. 27(1), 215–219 (1999)
Article Google Scholar
Hon, W., Patil, M., Shah, R., Thankachan, S.V., Vitter, J.S.: Indexes for document retrieval with relevance. In: Space-Efficient Data Structures, Streams, and Algorithms—Papers in Honor of J. Ian Munro on the Occasion of His 66th Birthday, pp. 351–362 (2013)
Hon, W., Thankachan, S.V., Shah, R., Vitter, J.S.: Faster compressed top-k document retrieval. In: Proceedings of the 23rd DCC, pp. 341–350 (2013)
Hon, W.K., Patil, M., Shah, R., Wu, S.B.: Efficient index for retrieving top-k most frequent documents. J. Discrete Algorithms 8(4), 402–417 (2010)
Article MathSciNet MATH Google Scholar
Hon, W.K., Shah, R., Thankachan, S.V., Vitter, J.S.: Space-efficient frameworks for top-k string retrieval. J. ACM 61(2), 1–36 (2014). Announced at 50th FOCS
Iliopoulos, C.S., Rahman, M.S.: Indexing factors with gaps. Algorithmica 55(1), 60–70 (2009)
Article MathSciNet MATH Google Scholar
Keller, O., Kopelowitz, T., Lewenstein, M.: Range non-overlapping indexing and successive list indexing. In: Proceedings of the 11th WADS, pp. 625–636 (2007)
Kopelowitz, T., Krauthgamer, R.: Color-distance oracles and snippets. In: Grossi, R., Lewenstein, M. (Eds.) Proceedings of the 27th CPM, pp. 24:1–24:10 (2016)
Kopelowitz, T., Pettie, S., Porat, E.: Higher lower bounds from the 3sum conjecture. In: Proceedings of the 27th SODA, pp. 1272–1287 (2016)
Larsen, K.G., Munro, J.I., Nielsen, J.S., Thankachan, S.V.: On hardness of several string indexing problems. Theor. Comput. Sci. 582, 74–82 (2015)
Article MathSciNet MATH Google Scholar
Lewenstein, M.: Indexing with gaps. In: Proceedings of the 18th SPIRE, pp. 135–143 (2011)
Mehldau, G., Myers, G.: A system for pattern matching applications on biosequences. Bioinformatics 9(3), 299–314 (1993)
Article Google Scholar
Munro, J.I., Navarro, G., Nielsen, J.S., Shah, R., Thankachan, S.V.: Top-k term-proximity in succinct space. Algorithmica 78(2), 379–393 (2017). Announced at 25th ISAAC
Munro, J.I., Navarro, G., Shah, R., Thankachan, S.V.: Ranked document selection. Theor. Comput. Sci. 812, 149–159 (2020)
Article MathSciNet MATH Google Scholar
Myers, E.W.: Approximate matching of network expressions with spacers. J. Comput. Biol. 3(1), 33–51 (1992)
Article Google Scholar
Navarro, G.: Spaces, trees, and colors: the algorithmic landscape of document retrieval on sequences. ACM Comput. Surv. 46(4), 1–47 (2014)
Article MathSciNet MATH Google Scholar
Navarro, G., Nekrich, Y.: Time-optimal top-k document retrieval. SIAM J. Comput. 46(1), 80–113 (2017). Announced at 23rd SODA
Navarro, G., Raffinot, M.: Fast and simple character classes and bounded gaps pattern matching, with applications to protein searching. J. Comput. Biol. 10(6), 903–923 (2003)
Article Google Scholar
Navarro, G., Thankachan, S.V.: New space/time tradeoffs for top-k document retrieval on sequences. Theor. Comput. Sci. 542, 83–97 (2014). Announced at 20th SPIRE
Navarro, G., Thankachan, S.V.: Reporting consecutive substring occurrences under bounded gap constraints. Theor. Comput. Sci. 638, 108–111 (2016). Announced at 26th CPM
Nekrich, Y., Navarro, G.: Sorted range reporting. In: Proceedings of the 13th SWAT, pp. 271–282 (2012)
Shah, R., Sheng, C., Thankachan, S.V., Vitter, J.S.: Top-k document retrieval in external memory. In: Proceedings of the 21st ESA, pp. 803–814 (2013)
Tsur, D.: Top-k document retrieval in optimal space. Inf. Process. Lett. 113(12), 440–443 (2013)
Article MathSciNet MATH Google Scholar
Weiner, P.: Linear pattern matching algorithms. In: Proceedings of the 14th FOCS, pp. 1–11 (1973)
Willard, D.E.: Log-logarithmic worst-case range queries are possible in space theta(n). Inf. Process. Lett. 17(2), 81–84 (1983). https://doi.org/10.1016/0020-0190(83)90075-3.
Article MATH Google Scholar
Zhou, G.: Two-dimensional range successor in optimal time and almost linear space. Inf. Process. Lett. 116(2), 171–174 (2016)
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Technical University of Denmark, Kgs. Lyngby, Denmark
Philip Bille, Inge Li Gørtz, Max Rishøj Pedersen & Teresa Anna Steiner

Authors

Philip Bille
View author publications
You can also search for this author in PubMed Google Scholar
Inge Li Gørtz
View author publications
You can also search for this author in PubMed Google Scholar
Max Rishøj Pedersen
View author publications
You can also search for this author in PubMed Google Scholar
Teresa Anna Steiner
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Teresa Anna Steiner.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

A preliminary version of this paper appeared at CPM 2021 [13]. P. Bille, I. L. Gørtz and M. R. Pedersen: Supported by the Danish Research Council Grant DFF-8021-002498.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Bille, P., Gørtz, I.L., Pedersen, M.R. et al. Gapped Indexing for Consecutive Occurrences. Algorithmica 85, 879–901 (2023). https://doi.org/10.1007/s00453-022-01051-6

Download citation

Received: 16 August 2021
Accepted: 04 October 2022
Published: 20 October 2022
Issue Date: April 2023
DOI: https://doi.org/10.1007/s00453-022-01051-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Gapped Indexing for Consecutive Occurrences

Abstract

Access this article

Similar content being viewed by others

Consecutive Occurrences with Distance Constraints

Fast Compressed Self-indexes with Deterministic Linear-Time Construction

Practical Variable Length Gap Pattern Matching

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Gapped Indexing for Consecutive Occurrences

Abstract

Access this article

Similar content being viewed by others

Consecutive Occurrences with Distance Constraints

Fast Compressed Self-indexes with Deterministic Linear-Time Construction

Practical Variable Length Gap Pattern Matching

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation