Skip to main content
Log in

Succinct Non-overlapping Indexing

  • Published:
Algorithmica Aims and scope Submit manuscript

Abstract

Text indexing is a fundamental problem in computer science. The objective is to preprocess a text T, so that, given a pattern P, we can find all starting positions (or simply, occurrences) of P in \(T\) efficiently. In some cases, additional restrictions are imposed. We consider two variants, namely the non-overlapping indexing problem, and the range non-overlapping indexing problem. Given a text \(T\) having n characters, the non-overlapping indexing problem is defined as follows: pre-process \(T\) into a data structure, such that for any pattern P, containing |P| characters, we can report a set containing the maximum number of non-overlapping occurrences of P in \(T\). Cohen and Porat (in: Algorithms and computation, 20th international symposium, ISAAC 2009, Honolulu, Hawaii. Proceedings, 2009) showed that by maintaining a linear space index in which the suffix tree of \(T\) is augmented with an O(n) word data structure, a query P can be answered in optimal time \(O(|P|+nocc)\), where \(nocc\) is the number of occurrences reported. We present the following new result. Let \(\mathsf {CSA}\) (not necessarily a compressed suffix array) be an index of \(T\) that can compute (i) the suffix range of P in \(\mathsf {search}(P)\) time, and (ii) a suffix array or an inverse suffix array value in \(\mathsf {t}_\mathsf {SA}\) time. By using \(\mathsf {CSA}\) alone, we can answer a query P in \(\mathsf {search}(P)+\mathsf {sort}(nocc)+O(nocc\cdot \mathsf {t}_\mathsf {SA})\) time. The function \(\mathsf {sort}(k)\) denotes the time for sorting k numbers in \(\{1,2,\dots ,n\}\). In the range non-overlapping indexing problem, along with the pattern P, two integers a and b, \(b \ge a\), are provided as input. The task is to report a set containing the maximum number of non-overlapping occurrences of P that lie within the range [ab]. For any arbitrarily small positive constant \(\epsilon \), we present an \(O(n \log ^\epsilon n)\) word index with \(O(|P| + nocc_{a,b})\) query time, where \(nocc_{a,b}\) is the number of occurrences reported. Our index improves upon the result of Cohen and Porat [6].

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

Notes

  1. A preliminary version [11] of this paper appeared in the 26th Annual Symposium on Combinatorial Pattern Matching (CPM) 2015.

References

  1. Abouelhoda, M.I., Kurtz, S., Ohlebusch, E.: Replacing suffix trees with enhanced suffix arrays. J. Discrete Algorithms 2(1), 53–86 (2004)

    Article  MathSciNet  Google Scholar 

  2. Alstrup, S., Brodal, G.S., Rauhe, T.: Optimal static range reporting in one dimension. In: Proceedings on 33rd Annual ACM Symposium on Theory of Computing, July 6–8, 2001, Heraklion, Crete, Greece, pp. 476–482 (2001)

  3. Baker, B.S.: A theory of parameterized pattern matching: algorithms and applications. In: Proceedings of the Twenty-Fifth Annual ACM Symposium on Theory of Computing, May 16–18, 1993, San Diego, CA, USA, pp. 71–80 (1993)

  4. Boyer, R.S., Moore, J.S.: A fast string searching algorithm. Commun. ACM 20(10), 762–772 (1977)

    Article  Google Scholar 

  5. Brodal, G.S., Fagerberg, R., Greve, M., López-Ortiz, A.: Online sorted range reporting. In: Algorithms and Computation, 20th International Symposium, ISAAC 2009, Honolulu, Hawaii, USA, December 16–18, 2009. Proceedings, pp. 173–182 (2009)

  6. Cohen, H., Porat, E.: Range non-overlapping indexing. In: Algorithms and Computation, 20th International Symposium, ISAAC 2009, Honolulu, Hawaii, USA, December 16–18, 2009. Proceedings, pp. 1044–1053 (2009)

  7. Cole, R., Gottlieb, L.-A., Lewenstein, M.: Dictionary matching and indexing with errors and don’t cares. In: Proceedings of the 36th Annual ACM Symposium on Theory of Computing, Chicago, IL, USA, June 13–16, 2004, pp. 91–100 (2004)

  8. Crochemore, M., Iliopoulos, C.S., Kubica, M., Rahman, M.S., Walen, T.: Improved algorithms for the range next value problem and applications. In: STACS 2008, 25th Annual Symposium on Theoretical Aspects of Computer Science, Bordeaux, France, February 21–23, 2008, Proceedings, pp. 205–216 (2008)

  9. Farach, M.: Optimal suffix tree construction with large alphabets. In: 38th Annual Symposium on Foundations of Computer Science, FOCS ’97, Miami Beach, Florida, USA, October 19–22, 1997, pp. 137–143 (1997)

  10. Ferragina, P., Manzini, G.: Indexing compressed text. J. ACM 52(4), 552–581 (2005)

    Article  MathSciNet  Google Scholar 

  11. Ganguly, A., Shah, R., Thankachan, S.V.: Succinct non-overlapping indexing. In: Combinatorial Pattern Matching—26th Annual Symposium, CPM 2015, Ischia Island, Italy, June 29–July 1, 2015, Proceedings, pp. 185–195 (2015)

  12. Ganguly, A., Shah, R., Thankachan, S.V: pBWT: achieving succinct data structures for parameterized pattern matching and related problems. In: Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 397–407. Society for Industrial and Applied Mathematics (2017)

  13. Ganguly, A., Shah, R., Thankachan, S.V.: Structural pattern matching-succinctly. In: 28th International Symposium on Algorithms and Computation, ISAAC 2017, December 9–12, 2017, Phuket, Thailand, pp. 35:1–35:13 (2017)

  14. Grossi, R., Vitter, J.S.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching (extended abstract). In: Proceedings of the Thirty-Second Annual ACM Symposium on Theory of Computing, May 21–23, 2000, Portland, OR, USA, pp. 397–406 (2000)

  15. Gusfield, D.: Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, Cambridge (1997)

    Book  Google Scholar 

  16. Hon, W.-K., Shah, R., Thankachan, S.V., Vitter, J.S.: On position restricted substring searching in succinct space. J. Discrete Algorithms 17, 109–114 (2012)

    Article  MathSciNet  Google Scholar 

  17. Hooshmand, S., Abedin, P., Külekci, M.O., Thankachan, S.V.: Non-overlapping indexing: cache obliviously. In: Annual Symposium on Combinatorial Pattern Matching, CPM 2018, July 2–4, 2018 - Qingdao, China, pp. 8:1–8:9 (2018)

  18. Karp, R.M., Rabin, M.O.: Efficient randomized pattern-matching algorithms. IBM J. Res. Dev. 31(2), 249–260 (1987)

    Article  MathSciNet  Google Scholar 

  19. Keller, O., Kopelowitz, T., Lewenstein, M.: Range non-overlapping indexing and successive list indexing. In: Algorithms and Data Structures, 10th International Workshop, WADS 2007, Halifax, Canada, August 15–17, 2007, Proceedings, pp. 625–636 (2007)

  20. Knuth, D.E., Morris Jr., J.H., Pratt, V.R.: Fast pattern matching in strings. SIAM J. Comput. 6(2), 323–350 (1977)

    Article  MathSciNet  Google Scholar 

  21. Mäkinen, V., Navarro, G.: Position-restricted substring searching. In: LATIN 2006: Theoretical Informatics, 7th Latin American Symposium, Valdivia, Chile, March 20–24, 2006, Proceedings, pp. 703–714 (2006)

  22. Manber, U., Myers, E.W.: Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22(5), 935–948 (1993)

    Article  MathSciNet  Google Scholar 

  23. Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Comput. Surv. 39, 1 (2007)

    Article  Google Scholar 

  24. Nekrich, Y., Navarro, G.: Sorted range reporting. In: Algorithm Theory—SWAT 2012: 13th Scandinavian Symposium and Workshops, Helsinki, Finland, July 4–6, 2012. Proceedings, pp. 271–282 (2012)

  25. Ukkonen, E.: On-line construction of suffix trees. Algorithmica 14(3), 249–260 (1995)

    Article  MathSciNet  Google Scholar 

  26. Weiner, P.: Linear pattern matching algorithms. In: 14th Annual Symposium on Switching and Automata Theory, Iowa City, Iowa, USA, October 15–17, 1973, pp. 1–11 (1973)

Download references

Acknowledgements

This research is funded in part by National Science Foundation (NSF) Grant CCF 1218904.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Arnab Ganguly.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ganguly, A., Shah, R. & Thankachan, S.V. Succinct Non-overlapping Indexing. Algorithmica 82, 107–117 (2020). https://doi.org/10.1007/s00453-019-00605-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00453-019-00605-5

Keywords

Navigation