Abstract
Two recent lower bounds on the compressibility of repetitive sequences, \(\delta \le \gamma \), have received much attention. It has been shown that a length-n string S over an alphabet of size \(\sigma \) can be represented within the optimal \(O(\delta \log \tfrac{n\log \sigma }{\delta \log n})\) space, and further, that within that space one can find all the occ occurrences in S of any length-m pattern in time \(O(m\log n + occ \log ^\epsilon n)\) for any constant \(\epsilon >0\). Instead, the near-optimal search time \(O(m+({occ+1})\log ^\epsilon n)\) has been achieved only within \(O(\gamma \log \frac{n}{\gamma })\) space. Both results are based on considerably different locally consistent parsing techniques. The question of whether the better search time could be supported within the \(\delta \)-optimal space remained open. In this paper, we prove that both techniques can indeed be combined to obtain the best of both worlds: \(O(m+({occ+1})\log ^\epsilon n)\) search time within \(O(\delta \log \tfrac{n\log \sigma }{\delta \log n})\) space. Moreover, the number of occurrences can be computed in \(O(m+\log ^{2+\epsilon }n)\) time within \(O(\delta \log \tfrac{n\log \sigma }{\delta \log n})\) space. We also show that an extra sublogarithmic factor on top of this space enables optimal \(O(m+occ)\) search time, whereas an extra logarithmic factor enables optimal O(m) counting time.
Similar content being viewed by others
Notes
In this work, we assume that P[1..m] is represented in O(m) space. For small alphabets, the packed setting, where P occupies \(O(\lceil \frac{m\log \sigma }{\log n} \rceil )\) space, could also be considered; see [3, Sec. 2.2]
If \(\delta \log n \le \sqrt{n}\), then \(\delta \log \tfrac{n\log \sigma }{\delta \log \delta } \le \delta \log (n\log \sigma ) \le 2\delta \log (\sqrt{n} \log \sigma ) \le 2 \delta \log \tfrac{n\log \sigma }{\delta \log n}\). Otherwise, \(\log \delta> \frac{1}{2}\log n-\log \log n > \frac{1}{4} \log n\), so \(\delta \log \tfrac{n\log \sigma }{\delta \log \delta } \le \delta \log \tfrac{4n\log \sigma }{\delta \log n}=\delta \log \tfrac{n\log \sigma }{\delta \log n}+ 2\delta \).
The unproductive tests \(u. next = null \) are charged to the primary occurrence v if the label of u is A, or to the first secondary occurrence of u, which exists by the definition of \(u. anc \), otherwise.
References
Stephens, Z.D., Lee, S.Y., Faghri, F., Campbell, R.H., Zhai, C., Efron, M.J., Iyer, R., Schatz, M.C., Sinha, S., Robinson, G.E.: Big data: astronomical or genomical? PLoS Biol. 13(7), 1002195 (2015). https://doi.org/10.1371/journal.pbio.1002195
Navarro, G.: Indexing highly repetitive string collections, part II: compressed indexes. ACM Comput. Surv. 54(2), 26–12632 (2021). https://doi.org/10.1145/3432999
Navarro, G.: Indexing highly repetitive string collections, part I: repetitiveness measures. ACM Comput. Surv. 54(2), 29–12931 (2021). https://doi.org/10.1145/3434399
Kreft, S., Navarro, G.: On compressing and indexing repetitive sequences. Theoret. Comput. Sci. 483, 115–133 (2013). https://doi.org/10.1016/j.tcs.2012.02.006
Kempa, D., Prezza, N.: At the roots of dictionary compression: string attractors. In: 50th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2018, pp. 827–840 (2018). https://doi.org/10.1145/3188745.3188814
Christiansen, A.R., Ettienne, M.B., Kociumaka, T., Navarro, G., Prezza, N.: Optimal-time dictionary-compressed indexes. ACM Trans. Algorithms 17(1), 8–1839 (2021). https://doi.org/10.1145/3426473
Kociumaka, T., Navarro, G., Prezza, N.: Towards a definitive compressibility measure for repetitive sequences. IEEE Trans. Inf. Theory (2022). https://doi.org/10.1109/TIT.2022.3224382
Lempel, A., Ziv, J.: On the complexity of finite sequences. IEEE Trans. Inf. Theory 22(1), 75–81 (1976). https://doi.org/10.1109/TIT.1976.1055501
Claude, F., Navarro, G.: Improved grammar-based compressed indexes. In: 19th International Symposium on String Processing and Information Retrieval, SPIRE 2012. LNCS, vol. 7608, pp. 180–192 (2012). https://doi.org/10.1007/978-3-642-34109-0_19
Claude, F., Navarro, G., Pacheco, A.: Grammar-compressed indexes with logarithmic search time. J. Comput. Syst. Sci. 118, 53–74 (2021). https://doi.org/10.1016/j.jcss.2020.12.001
Mehlhorn, K., Sundar, R., Uhrig, C.: Maintaining dynamic sequences under equality tests in polylogarithmic time. Algorithmica 17(2), 183–198 (1997). https://doi.org/10.1007/BF02522825
Jeż, A.: A really simple approximation of smallest grammar. Theoret. Comput. Sci. 616, 141–150 (2016). https://doi.org/10.1016/j.tcs.2015.12.032
Kociumaka, T., Navarro, G., Olivares, F.: Near-optimal search time in \(\delta \)-optimal space. In: 15th Latin American Symposium on Theoretical Informatics, LATIN 2022. LNCS, vol. 13568, pp. 88–103 (2022). https://doi.org/10.1007/978-3-031-20624-5_6
Batu, T., Sahinalp, S.C.: Locally consistent parsing and applications to approximate string comparisons. In: 9th International Conference on Developments in Language Theory, DLT 2005. LNCS, vol. 3572, pp. 22–35 (2005). https://doi.org/10.1007/11505877_3
Cole, R., Vishkin, U.: Deterministic coin tossing and accelerating cascades: Micro and macro techniques for designing parallel algorithms. In: 18th Annual ACM Symposium on Theory of Computing, STOC 1986, pp. 206–219 (1986). https://doi.org/10.1145/12130.12151
Raskhodnikova, S., Ron, D., Rubinfeld, R., Smith, A.D.: Sublinear algorithms for approximating string compressibility. Algorithmica 65(3), 685–709 (2013). https://doi.org/10.1007/s00453-012-9618-6
Kociumaka, T., Radoszewski, J., Rytter, W., Waleń, T.: Internal pattern matching queries in text and applications (2023) arXiv:1311.6235v5
Birenzwige, O., Golan, S., Porat, E.: Locally consistent parsing for text indexing in small space. In: 31st Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2020, pp. 607–626 (2020). https://doi.org/10.1137/1.9781611975994.37
Kempa, D., Kociumaka, T.: Dynamic suffix array with polylogarithmic queries and updates. In: 54th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2022, pp. 1657–1670 (2022). https://doi.org/10.1145/3519935.3520061
Sahinalp, S.C., Vishkin, U.: On a parallel-algorithms method for string matching problems (overview). In: 2nd Italian Conference on Algorithms and Complexity, CIAC 1994. LNCS, vol. 778, pp. 22–32 (1994). https://doi.org/10.1007/3-540-57811-0_3
Chan, T.M., Larsen, K.G., Pătraşcu, M.: Orthogonal range searching on the RAM, revisited. SoCG ’11, pp. 1–10. Association for Computing Machinery, New York (2011). https://doi.org/10.1145/1998196.1998198
Karp, R.M., Rabin, M.O.: Efficient randomized pattern-matching algorithms. IBM J. Res. Dev. 31(2), 249–260 (1987). https://doi.org/10.1147/rd.312.0249
Navarro, G.: Computing MEMs on repetitive text collections. In: 34th Annual Symposium on Combinatorial Pattern Matching, CPM 202. LIPIcs, vol. 259, pp. 24–12417 (2023)https://doi.org/10.4230/LIPIcs.CPM.2023.24
Fredman, M.L., Komlós, J., Szemerédi, E.: Storing a sparse table with \(O(1)\) worst case access time. J. ACM 31(3), 538–544 (1984). https://doi.org/10.1145/828.1884
Alstrup, S., Brodal, G.S., Rauhe, T.: New data structures for orthogonal range searching. Proceedings 41st Annual Symposium on Foundations of Computer Science, pp. 198–207 (2000). https://doi.org/10.1109/SFCS.2000.892088
Fine, N.J., Wilf, H.S.: Uniqueness theorems for periodic functions. Proc. Am. Math. Soc. 16(1), 109–114 (1965). https://doi.org/10.1090/s0002-9939-1965-0174934-9
Belazzougui, D., Navarro, G.: Alphabet-independent compressed text indexing. ACM Trans. Algorithms 10(4), 23–12319 (2014). https://doi.org/10.1145/2635816
Kempa, D., Kociumaka, T.: Collapsing the hierarchy of compressed data structures: Suffix arrays in optimal compressed space. In: Proceedings of 64th Annual IEEE Symposium on Foundations of Computer Science (FOCS) (2023) https://doi.org/10.48550/arXiv.2308.03635
Author information
Authors and Affiliations
Contributions
T.K. and G.N. defined the main idea, i.e., to verify whether restricted block compression would yield an RLSLP of size O(\(\delta \) log(n/\(\delta \))), improving upon the O(\(\gamma \) log(n/\(\gamma \))) space achieved using the ordinary block compression (without pausing), that could be used for pattern matching within the times of the larger index. F.O. worked on this problem under the supervision of G.N. and they prepared the first draft of the manuscript. In the conference version, T.K. contributed a few most subtle proofs, including the bound on the expected total size of all productions, and simplified the construction of the set M(P). While preparing the journal version, F.O. and G.N. developed the counting version of the O(\(\delta \) log(n/\(\delta \)))-size index and the slightly larger variants allowing for optimal counting and reporting. T.K. refined the analysis of the RLSLP size to achieve an O(\(\delta \) log(n log \(\sigma \)/\(\delta \) log n)) bound and adapted all the proofs so the index sizes decreased accordingly.
Corresponding author
Ethics declarations
Conflict of interest
All authors certify that they have no affiliations with or involvement in any organization or entity with any financial interest or non-financial interest in the subject, matter, or materials discussed in this manuscript.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Funded in part by Basal Funds FB0001, Fondecyt Grant 1-200038, and Ph.D Scholarship 21210579, ANID, Chile.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Kociumaka, T., Navarro, G. & Olivares, F. Near-Optimal Search Time in \(\delta \)-Optimal Space, and Vice Versa. Algorithmica 86, 1031–1056 (2024). https://doi.org/10.1007/s00453-023-01186-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00453-023-01186-0