Abstract
We introduce the notion of frequency-constrained substring complexity. For any finite string, it counts the distinct substrings of the string per length and frequency class. For a string x of length n and a partition of [n] in \(\tau \) intervals, \(\mathcal {I}=I_1,\ldots ,I_\tau \), the frequency-constrained substring complexity of x is the function \(f_{x,\mathcal {I}}(i,j)\) that maps i, j to the number of distinct substrings of length i of x occurring at least \(\alpha _j\) and at most \(\beta _j\) times in x, where \(I_j=[\alpha _j,\beta _j]\). We extend this notion as follows. For a string x, a dictionary \(\mathcal {D}\) of d strings (documents), and a partition of [d] in \(\tau \) intervals \(I_1,\ldots ,I_\tau \), we define a 2D array \(S=S[1\mathinner {.\,.}|x|,1\mathinner {.\,.}\tau ]\) as follows: S[i, j] is the number of distinct substrings of length i of x occurring in at least \(\alpha _j\) and at most \(\beta _j\) documents, where \(I_j=[\alpha _j,\beta _j]\). Array S can thus be seen as the distribution of the substring complexity of x into \(\tau \) document frequency classes. We show that after a linear-time preprocessing of \(\mathcal {D}\), for any x and any partition of [d] in \(\tau \) intervals given online, array S can be computed in near-optimal \(\mathcal {O}(|x| \tau \log \log d)\) time.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
By the notation [u] we denote \(\{1,2,\ldots ,u\}\).
References
Amir, A., Landau, G.M., Lewenstein, M., Sokol, D.: Dynamic text and static pattern matching. ACM Trans. Algorithms 3(2), 19 (2007). https://doi.org/10.1145/1240233.1240242
Belazzougui, D., Kosolobov, D., Puglisi, S.J., Raman, R.: Weighted ancestors in suffix trees revisited. In: Gawrychowski, P., Starikovskaya, T. (eds.) 32nd Annual Symposium on Combinatorial Pattern Matching, CPM 2021, 5–7 July 2021, Wrocław, Poland. LIPIcs, vol. 191, pp. 8:1–8:15. Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2021). https://doi.org/10.4230/LIPIcs.CPM.2021.8
Chang, W.I., Lawler, E.L.: Sublinear approximate string matching and biological applications. Algorithmica 12(4/5), 327–344 (1994). https://doi.org/10.1007/BF01185431
Crochemore, M., Hancart, C., Lecroq, T.: Algorithms on Strings. Cambridge University Press, Cambridge (2007)
Delcher, A.L., Kasif, S., Fleischmann, R.D., Peterson, J., White, O., Salzberg, S.L.: Alignment of whole genomes. Nucleic Acids Res. 27(11), 2369–2376 (1999). https://doi.org/10.1093/nar/27.11.2369
van Emde Boas, P.: Preserving order in a forest in less than logarithmic time and linear space. Inf. Process. Lett. 6(3), 80–82 (1977). https://doi.org/10.1016/0020-0190(77)90031-X
Farach, M.: Optimal suffix tree construction with large alphabets. In: 38th Annual Symposium on Foundations of Computer Science, FOCS ’97, Miami Beach, Florida, USA, 19–22 October 1997, pp. 137–143. IEEE Computer Society (1997). https://doi.org/10.1109/SFCS.1997.646102
Farach, M., Muthukrishnan, S.: Perfect hashing for strings: formalization and algorithms. In: Hirschberg, D., Myers, G. (eds.) CPM 1996. LNCS, vol. 1075, pp. 130–140. Springer, Heidelberg (1996). https://doi.org/10.1007/3-540-61258-0_11
Ferenczi, S.: Complexity of sequences and dynamical systems. Discret. Math. 206(1–3), 145–154 (1999). https://doi.org/10.1016/S0012-365X(98)00400-2
Fredman, M.L., Komlós, J., Szemerédi, E.: Storing a sparse table with 0(1) worst case access time. J. ACM 31(3), 538–544 (1984). https://doi.org/10.1145/828.1884
Gusfield, D.: Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology. Cambridge University Press, Cambridge (1997). https://doi.org/10.1017/cbo9780511574931
Chi, L., Hui, K.: Color set size problem with applications to string matching. In: Apostolico, A., Crochemore, M., Galil, Z., Manber, U. (eds.) CPM 1992. LNCS, vol. 644, pp. 230–243. Springer, Heidelberg (1992). https://doi.org/10.1007/3-540-56024-6_19
Kociumaka, T., Navarro, G., Prezza, N.: Toward a definitive compressibility measure for repetitive sequences. IEEE Trans. Inf. Theory 69(4), 2074–2092 (2023). https://doi.org/10.1109/TIT.2022.3224382
Kutsukake, K., Matsumoto, T., Nakashima, Y., Inenaga, S., Bannai, H., Takeda, M.: On repetitiveness measures of thue-morse words. In: Boucher, C., Thankachan, S.V. (eds.) SPIRE 2020. LNCS, vol. 12303, pp. 213–220. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59212-7_15
Leskovec, J., Rajaraman, A., Ullman, J.D.: Mining of Massive Datasets, 2nd ed. Cambridge University Press, Cambridge (2014). https://www.mmds.org/
Loukides, G., Pissis, S.P.: Bidirectional string anchors: a new string sampling mechanism. In: Mutzel, P., Pagh, R., Herman, G. (eds.) 29th Annual European Symposium on Algorithms, ESA 2021, 6–8 September 2021, Lisbon, Portugal (Virtual Conference). LIPIcs, vol. 204, pp. 64:1–64:21. Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2021). https://doi.org/10.4230/LIPIcs.ESA.2021.64
Mignosi, F.: Infinite words with linear subword complexity. Theor. Comput. Sci. 65(2), 221–242 (1989). https://doi.org/10.1016/0304-3975(89)90046-7
Navarro, G., Rojas-Ledesma, J.: Predecessor search. ACM Comput. Surv. 53(5), 105:1–105:35 (2021). https://doi.org/10.1145/3409371
Pissis, S.P.: MoTeX-II: structured MoTif eXtraction from large-scale datasets. BMC Bioinform. 15, 235 (2014). https://doi.org/10.1186/1471-2105-15-235
Puglisi, S.J., Zhukova, B.: Document retrieval hacks. In: Coudert, D., Natale, E. (eds.) 19th International Symposium on Experimental Algorithms, SEA 2021, 7–9 June 2021, Nice, France. LIPIcs, vol. 190, pp. 12:1–12:12. Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2021). https://doi.org/10.4230/LIPIcs.SEA.2021.12
Raskhodnikova, S., Ron, D., Rubinfeld, R., Smith, A.D.: Sublinear algorithms for approximating string compressibility. Algorithmica 65(3), 685–709 (2013). https://doi.org/10.1007/s00453-012-9618-6
Shallit, J.O., Shur, A.M.: Subword complexity and power avoidance. Theor. Comput. Sci. 792, 96–116 (2019). https://doi.org/10.1016/j.tcs.2018.09.010
Weiner, P.: Linear pattern matching algorithms. In: 14th Annual Symposium on Switching and Automata Theory, Iowa City, Iowa, USA, 15–17 October 1973, pp. 1–11. IEEE Computer Society (1973). https://doi.org/10.1109/SWAT.1973.13
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Pissis, S.P., Shekelyan, M., Liu, C., Loukides, G. (2023). Frequency-Constrained Substring Complexity. In: Nardini, F.M., Pisanti, N., Venturini, R. (eds) String Processing and Information Retrieval. SPIRE 2023. Lecture Notes in Computer Science, vol 14240. Springer, Cham. https://doi.org/10.1007/978-3-031-43980-3_28
Download citation
DOI: https://doi.org/10.1007/978-3-031-43980-3_28
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-43979-7
Online ISBN: 978-3-031-43980-3
eBook Packages: Computer ScienceComputer Science (R0)