Skip to main content

Frequency-Constrained Substring Complexity

  • Conference paper
  • First Online:
String Processing and Information Retrieval (SPIRE 2023)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14240))

Included in the following conference series:

  • 412 Accesses

Abstract

We introduce the notion of frequency-constrained substring complexity. For any finite string, it counts the distinct substrings of the string per length and frequency class. For a string x of length n and a partition of [n] in \(\tau \) intervals, \(\mathcal {I}=I_1,\ldots ,I_\tau \), the frequency-constrained substring complexity of x is the function \(f_{x,\mathcal {I}}(i,j)\) that maps ij to the number of distinct substrings of length i of x occurring at least \(\alpha _j\) and at most \(\beta _j\) times in x, where \(I_j=[\alpha _j,\beta _j]\). We extend this notion as follows. For a string x, a dictionary \(\mathcal {D}\) of d strings (documents), and a partition of [d] in \(\tau \) intervals \(I_1,\ldots ,I_\tau \), we define a 2D array \(S=S[1\mathinner {.\,.}|x|,1\mathinner {.\,.}\tau ]\) as follows: S[ij] is the number of distinct substrings of length i of x occurring in at least \(\alpha _j\) and at most \(\beta _j\) documents, where \(I_j=[\alpha _j,\beta _j]\). Array S can thus be seen as the distribution of the substring complexity of x into \(\tau \) document frequency classes. We show that after a linear-time preprocessing of \(\mathcal {D}\), for any x and any partition of [d] in \(\tau \) intervals given online, array S can be computed in near-optimal \(\mathcal {O}(|x| \tau \log \log d)\) time.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 74.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    By the notation [u] we denote \(\{1,2,\ldots ,u\}\).

References

  1. Amir, A., Landau, G.M., Lewenstein, M., Sokol, D.: Dynamic text and static pattern matching. ACM Trans. Algorithms 3(2), 19 (2007). https://doi.org/10.1145/1240233.1240242

    Article  MathSciNet  MATH  Google Scholar 

  2. Belazzougui, D., Kosolobov, D., Puglisi, S.J., Raman, R.: Weighted ancestors in suffix trees revisited. In: Gawrychowski, P., Starikovskaya, T. (eds.) 32nd Annual Symposium on Combinatorial Pattern Matching, CPM 2021, 5–7 July 2021, Wrocław, Poland. LIPIcs, vol. 191, pp. 8:1–8:15. Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2021). https://doi.org/10.4230/LIPIcs.CPM.2021.8

  3. Chang, W.I., Lawler, E.L.: Sublinear approximate string matching and biological applications. Algorithmica 12(4/5), 327–344 (1994). https://doi.org/10.1007/BF01185431

    Article  MathSciNet  MATH  Google Scholar 

  4. Crochemore, M., Hancart, C., Lecroq, T.: Algorithms on Strings. Cambridge University Press, Cambridge (2007)

    Google Scholar 

  5. Delcher, A.L., Kasif, S., Fleischmann, R.D., Peterson, J., White, O., Salzberg, S.L.: Alignment of whole genomes. Nucleic Acids Res. 27(11), 2369–2376 (1999). https://doi.org/10.1093/nar/27.11.2369

  6. van Emde Boas, P.: Preserving order in a forest in less than logarithmic time and linear space. Inf. Process. Lett. 6(3), 80–82 (1977). https://doi.org/10.1016/0020-0190(77)90031-X

    Article  MATH  Google Scholar 

  7. Farach, M.: Optimal suffix tree construction with large alphabets. In: 38th Annual Symposium on Foundations of Computer Science, FOCS ’97, Miami Beach, Florida, USA, 19–22 October 1997, pp. 137–143. IEEE Computer Society (1997). https://doi.org/10.1109/SFCS.1997.646102

  8. Farach, M., Muthukrishnan, S.: Perfect hashing for strings: formalization and algorithms. In: Hirschberg, D., Myers, G. (eds.) CPM 1996. LNCS, vol. 1075, pp. 130–140. Springer, Heidelberg (1996). https://doi.org/10.1007/3-540-61258-0_11

    Chapter  Google Scholar 

  9. Ferenczi, S.: Complexity of sequences and dynamical systems. Discret. Math. 206(1–3), 145–154 (1999). https://doi.org/10.1016/S0012-365X(98)00400-2

    Article  MathSciNet  MATH  Google Scholar 

  10. Fredman, M.L., Komlós, J., Szemerédi, E.: Storing a sparse table with 0(1) worst case access time. J. ACM 31(3), 538–544 (1984). https://doi.org/10.1145/828.1884

    Article  MathSciNet  MATH  Google Scholar 

  11. Gusfield, D.: Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology. Cambridge University Press, Cambridge (1997). https://doi.org/10.1017/cbo9780511574931

  12. Chi, L., Hui, K.: Color set size problem with applications to string matching. In: Apostolico, A., Crochemore, M., Galil, Z., Manber, U. (eds.) CPM 1992. LNCS, vol. 644, pp. 230–243. Springer, Heidelberg (1992). https://doi.org/10.1007/3-540-56024-6_19

    Chapter  Google Scholar 

  13. Kociumaka, T., Navarro, G., Prezza, N.: Toward a definitive compressibility measure for repetitive sequences. IEEE Trans. Inf. Theory 69(4), 2074–2092 (2023). https://doi.org/10.1109/TIT.2022.3224382

    Article  MathSciNet  Google Scholar 

  14. Kutsukake, K., Matsumoto, T., Nakashima, Y., Inenaga, S., Bannai, H., Takeda, M.: On repetitiveness measures of thue-morse words. In: Boucher, C., Thankachan, S.V. (eds.) SPIRE 2020. LNCS, vol. 12303, pp. 213–220. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59212-7_15

    Chapter  Google Scholar 

  15. Leskovec, J., Rajaraman, A., Ullman, J.D.: Mining of Massive Datasets, 2nd ed. Cambridge University Press, Cambridge (2014). https://www.mmds.org/

  16. Loukides, G., Pissis, S.P.: Bidirectional string anchors: a new string sampling mechanism. In: Mutzel, P., Pagh, R., Herman, G. (eds.) 29th Annual European Symposium on Algorithms, ESA 2021, 6–8 September 2021, Lisbon, Portugal (Virtual Conference). LIPIcs, vol. 204, pp. 64:1–64:21. Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2021). https://doi.org/10.4230/LIPIcs.ESA.2021.64

  17. Mignosi, F.: Infinite words with linear subword complexity. Theor. Comput. Sci. 65(2), 221–242 (1989). https://doi.org/10.1016/0304-3975(89)90046-7

    Article  MathSciNet  MATH  Google Scholar 

  18. Navarro, G., Rojas-Ledesma, J.: Predecessor search. ACM Comput. Surv. 53(5), 105:1–105:35 (2021). https://doi.org/10.1145/3409371

  19. Pissis, S.P.: MoTeX-II: structured MoTif eXtraction from large-scale datasets. BMC Bioinform. 15, 235 (2014). https://doi.org/10.1186/1471-2105-15-235

    Article  Google Scholar 

  20. Puglisi, S.J., Zhukova, B.: Document retrieval hacks. In: Coudert, D., Natale, E. (eds.) 19th International Symposium on Experimental Algorithms, SEA 2021, 7–9 June 2021, Nice, France. LIPIcs, vol. 190, pp. 12:1–12:12. Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2021). https://doi.org/10.4230/LIPIcs.SEA.2021.12

  21. Raskhodnikova, S., Ron, D., Rubinfeld, R., Smith, A.D.: Sublinear algorithms for approximating string compressibility. Algorithmica 65(3), 685–709 (2013). https://doi.org/10.1007/s00453-012-9618-6

    Article  MathSciNet  MATH  Google Scholar 

  22. Shallit, J.O., Shur, A.M.: Subword complexity and power avoidance. Theor. Comput. Sci. 792, 96–116 (2019). https://doi.org/10.1016/j.tcs.2018.09.010

    Article  MathSciNet  MATH  Google Scholar 

  23. Weiner, P.: Linear pattern matching algorithms. In: 14th Annual Symposium on Switching and Automata Theory, Iowa City, Iowa, USA, 15–17 October 1973, pp. 1–11. IEEE Computer Society (1973). https://doi.org/10.1109/SWAT.1973.13

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Solon P. Pissis .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Pissis, S.P., Shekelyan, M., Liu, C., Loukides, G. (2023). Frequency-Constrained Substring Complexity. In: Nardini, F.M., Pisanti, N., Venturini, R. (eds) String Processing and Information Retrieval. SPIRE 2023. Lecture Notes in Computer Science, vol 14240. Springer, Cham. https://doi.org/10.1007/978-3-031-43980-3_28

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-43980-3_28

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-43979-7

  • Online ISBN: 978-3-031-43980-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics