Frequency-Constrained Substring Complexity

Pissis, Solon P.; Shekelyan, Michael; Liu, Chang; Loukides, Grigorios

doi:10.1007/978-3-031-43980-3_28

Solon P. Pissis^10,11,
Michael Shekelyan¹²,
Chang Liu¹³ &
…
Grigorios Loukides¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14240))

Included in the following conference series:

International Symposium on String Processing and Information Retrieval

412 Accesses

Abstract

We introduce the notion of frequency-constrained substring complexity. For any finite string, it counts the distinct substrings of the string per length and frequency class. For a string x of length n and a partition of [n] in \(\tau \) intervals, \(\mathcal {I}=I_1,\ldots ,I_\tau \), the frequency-constrained substring complexity of x is the function \(f_{x,\mathcal {I}}(i,j)\) that maps i, j to the number of distinct substrings of length i of x occurring at least \(\alpha _j\) and at most \(\beta _j\) times in x, where \(I_j=[\alpha _j,\beta _j]\). We extend this notion as follows. For a string x, a dictionary \(\mathcal {D}\) of d strings (documents), and a partition of [d] in \(\tau \) intervals \(I_1,\ldots ,I_\tau \), we define a 2D array \(S=S[1\mathinner {.\,.}|x|,1\mathinner {.\,.}\tau ]\) as follows: S[i, j] is the number of distinct substrings of length i of x occurring in at least \(\alpha _j\) and at most \(\beta _j\) documents, where \(I_j=[\alpha _j,\beta _j]\). Array S can thus be seen as the distribution of the substring complexity of x into \(\tau \) document frequency classes. We show that after a linear-time preprocessing of \(\mathcal {D}\), for any x and any partition of [d] in \(\tau \) intervals given online, array S can be computed in near-optimal \(\mathcal {O}(|x| \tau \log \log d)\) time.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 74.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
By the notation [u] we denote \(\{1,2,\ldots ,u\}\).

References

Amir, A., Landau, G.M., Lewenstein, M., Sokol, D.: Dynamic text and static pattern matching. ACM Trans. Algorithms 3(2), 19 (2007). https://doi.org/10.1145/1240233.1240242
Article MathSciNet MATH Google Scholar
Belazzougui, D., Kosolobov, D., Puglisi, S.J., Raman, R.: Weighted ancestors in suffix trees revisited. In: Gawrychowski, P., Starikovskaya, T. (eds.) 32nd Annual Symposium on Combinatorial Pattern Matching, CPM 2021, 5–7 July 2021, Wrocław, Poland. LIPIcs, vol. 191, pp. 8:1–8:15. Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2021). https://doi.org/10.4230/LIPIcs.CPM.2021.8
Chang, W.I., Lawler, E.L.: Sublinear approximate string matching and biological applications. Algorithmica 12(4/5), 327–344 (1994). https://doi.org/10.1007/BF01185431
Article MathSciNet MATH Google Scholar
Crochemore, M., Hancart, C., Lecroq, T.: Algorithms on Strings. Cambridge University Press, Cambridge (2007)
Google Scholar
Delcher, A.L., Kasif, S., Fleischmann, R.D., Peterson, J., White, O., Salzberg, S.L.: Alignment of whole genomes. Nucleic Acids Res. 27(11), 2369–2376 (1999). https://doi.org/10.1093/nar/27.11.2369
van Emde Boas, P.: Preserving order in a forest in less than logarithmic time and linear space. Inf. Process. Lett. 6(3), 80–82 (1977). https://doi.org/10.1016/0020-0190(77)90031-X
Article MATH Google Scholar
Farach, M.: Optimal suffix tree construction with large alphabets. In: 38th Annual Symposium on Foundations of Computer Science, FOCS ’97, Miami Beach, Florida, USA, 19–22 October 1997, pp. 137–143. IEEE Computer Society (1997). https://doi.org/10.1109/SFCS.1997.646102
Farach, M., Muthukrishnan, S.: Perfect hashing for strings: formalization and algorithms. In: Hirschberg, D., Myers, G. (eds.) CPM 1996. LNCS, vol. 1075, pp. 130–140. Springer, Heidelberg (1996). https://doi.org/10.1007/3-540-61258-0_11
Chapter Google Scholar
Ferenczi, S.: Complexity of sequences and dynamical systems. Discret. Math. 206(1–3), 145–154 (1999). https://doi.org/10.1016/S0012-365X(98)00400-2
Article MathSciNet MATH Google Scholar
Fredman, M.L., Komlós, J., Szemerédi, E.: Storing a sparse table with 0(1) worst case access time. J. ACM 31(3), 538–544 (1984). https://doi.org/10.1145/828.1884
Article MathSciNet MATH Google Scholar
Gusfield, D.: Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology. Cambridge University Press, Cambridge (1997). https://doi.org/10.1017/cbo9780511574931
Chi, L., Hui, K.: Color set size problem with applications to string matching. In: Apostolico, A., Crochemore, M., Galil, Z., Manber, U. (eds.) CPM 1992. LNCS, vol. 644, pp. 230–243. Springer, Heidelberg (1992). https://doi.org/10.1007/3-540-56024-6_19
Chapter Google Scholar
Kociumaka, T., Navarro, G., Prezza, N.: Toward a definitive compressibility measure for repetitive sequences. IEEE Trans. Inf. Theory 69(4), 2074–2092 (2023). https://doi.org/10.1109/TIT.2022.3224382
Article MathSciNet Google Scholar
Kutsukake, K., Matsumoto, T., Nakashima, Y., Inenaga, S., Bannai, H., Takeda, M.: On repetitiveness measures of thue-morse words. In: Boucher, C., Thankachan, S.V. (eds.) SPIRE 2020. LNCS, vol. 12303, pp. 213–220. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59212-7_15
Chapter Google Scholar
Leskovec, J., Rajaraman, A., Ullman, J.D.: Mining of Massive Datasets, 2nd ed. Cambridge University Press, Cambridge (2014). https://www.mmds.org/
Loukides, G., Pissis, S.P.: Bidirectional string anchors: a new string sampling mechanism. In: Mutzel, P., Pagh, R., Herman, G. (eds.) 29th Annual European Symposium on Algorithms, ESA 2021, 6–8 September 2021, Lisbon, Portugal (Virtual Conference). LIPIcs, vol. 204, pp. 64:1–64:21. Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2021). https://doi.org/10.4230/LIPIcs.ESA.2021.64
Mignosi, F.: Infinite words with linear subword complexity. Theor. Comput. Sci. 65(2), 221–242 (1989). https://doi.org/10.1016/0304-3975(89)90046-7
Article MathSciNet MATH Google Scholar
Navarro, G., Rojas-Ledesma, J.: Predecessor search. ACM Comput. Surv. 53(5), 105:1–105:35 (2021). https://doi.org/10.1145/3409371
Pissis, S.P.: MoTeX-II: structured MoTif eXtraction from large-scale datasets. BMC Bioinform. 15, 235 (2014). https://doi.org/10.1186/1471-2105-15-235
Article Google Scholar
Puglisi, S.J., Zhukova, B.: Document retrieval hacks. In: Coudert, D., Natale, E. (eds.) 19th International Symposium on Experimental Algorithms, SEA 2021, 7–9 June 2021, Nice, France. LIPIcs, vol. 190, pp. 12:1–12:12. Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2021). https://doi.org/10.4230/LIPIcs.SEA.2021.12
Raskhodnikova, S., Ron, D., Rubinfeld, R., Smith, A.D.: Sublinear algorithms for approximating string compressibility. Algorithmica 65(3), 685–709 (2013). https://doi.org/10.1007/s00453-012-9618-6
Article MathSciNet MATH Google Scholar
Shallit, J.O., Shur, A.M.: Subword complexity and power avoidance. Theor. Comput. Sci. 792, 96–116 (2019). https://doi.org/10.1016/j.tcs.2018.09.010
Article MathSciNet MATH Google Scholar
Weiner, P.: Linear pattern matching algorithms. In: 14th Annual Symposium on Switching and Automata Theory, Iowa City, Iowa, USA, 15–17 October 1973, pp. 1–11. IEEE Computer Society (1973). https://doi.org/10.1109/SWAT.1973.13

Download references

Author information

Authors and Affiliations

CWI, Amsterdam, The Netherlands
Solon P. Pissis
Vrije Universiteit, Amsterdam, The Netherlands
Solon P. Pissis
Queen Mary University of London, London, UK
Michael Shekelyan
Zhejiang University, Medical Center, Zhejiang, China
Chang Liu
King’s College London, London, UK
Grigorios Loukides

Authors

Solon P. Pissis
View author publications
You can also search for this author in PubMed Google Scholar
Michael Shekelyan
View author publications
You can also search for this author in PubMed Google Scholar
Chang Liu
View author publications
You can also search for this author in PubMed Google Scholar
Grigorios Loukides
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Solon P. Pissis .

Editor information

Editors and Affiliations

ISTI-CNR, Pisa, Italy
Franco Maria Nardini
University of Pisa, Pisa, Italy
Nadia Pisanti
University of Pisa, Pisa, Italy
Rossano Venturini

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pissis, S.P., Shekelyan, M., Liu, C., Loukides, G. (2023). Frequency-Constrained Substring Complexity. In: Nardini, F.M., Pisanti, N., Venturini, R. (eds) String Processing and Information Retrieval. SPIRE 2023. Lecture Notes in Computer Science, vol 14240. Springer, Cham. https://doi.org/10.1007/978-3-031-43980-3_28

Download citation

DOI: https://doi.org/10.1007/978-3-031-43980-3_28
Published: 20 September 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-43979-7
Online ISBN: 978-3-031-43980-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics