Skip to main content
Log in

A Framework for Space-Efficient String Kernels

  • Published:
Algorithmica Aims and scope Submit manuscript

Abstract

String kernels are typically used to compare genome-scale sequences whose length makes alignment impractical, yet their computation is based on data structures that are either space-inefficient, or incur large slowdowns. We show that a number of exact kernels on pairs of strings of total length n, like the k-mer kernel, the substrings kernels, a number of length-weighted kernels, the minimal absent words kernel, and kernels with Markovian corrections, can all be computed in O(nd) time and in o(n) bits of space in addition to the input, using just a \(\mathtt {rangeDistinct}\) data structure on the Burrows–Wheeler transform of the input strings that takes O(d) time per element in its output. The same bounds hold for a number of measures of compositional complexity based on multiple values of k, like the k-mer profile and the k-th order empirical entropy, and for calibrating the value of k using the data. All such algorithms become O(n) using a suitable implementation of the \(\mathtt {rangeDistinct}\) data structure, and by concatenating them to a suitable BWT construction algorithm, we can compute all the mentioned kernels and complexity measures, directly from the input strings, in O(n) time and in \(O(n\log {\sigma })\) bits of space in addition to the input, where \(\sigma \) is the size of the alphabet. Using similar data structures, we also show how to build a compact representation of the variable-length Markov chain of a string T of length n, that takes just \(3n\log {\sigma }+o(n\log {\sigma })\) bits of space, and that can be learnt in randomized O(n) time using \(O(n\log {\sigma })\) bits of space in addition to the input. Such model can then be used to assign a probability to a query string S of length m in O(m) time and in \(2m+o(m)\) bits of additional space, thus providing an alternative, compositional measure of the similarity between S and T that does not require alignment.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

Notes

  1. Here, abusing terminology, we say that a node is left-maximal if it is labeled by a left-maximal substring of \(\overline{T[1..n]}\#\). Notice that those substrings are the reverse of the right-maximal substrings we are interested in.

  2. Abusing terminology, we say that a node is a maximal-repeat if its labeling string is a maximal repeat.

  3. Similarly to the case of \(\mathsf {MS}\), here \(\mathsf {RMS}\) will be represented using a bitvector \(\mathtt {rms}\) of length 2m bits that is indexed in the end to support rank and select queries.

  4. The DNA alphabet consists of the following symbols in lexicographic order: \(\mathtt {A},\mathtt {C},\mathtt {G},\mathtt {T}\). DNA complementation is defined as \(\phi (\mathtt {A})=\mathtt {T}\), \(\phi (\mathtt {C})=\mathtt {G}\), \(\phi (\mathtt {G})=\mathtt {C}\), \(\phi (\mathtt {T})=\mathtt {A}\).

References

  1. Apostolico, A.: Maximal words in sequence comparisons based on subword composition. In: Algorithms and Applications, pp. 34–44. Springer, Berlin (2010)

  2. Apostolico, A., Bejerano, G.: Optimal amnesic probabilistic automata or how to learn and classify proteins in linear time and space. J. Comput. Biol. 7(3–4), 381–393 (2000)

    Article  Google Scholar 

  3. Apostolico, A., Denas, O.: Fast algorithms for computing sequence distances by exhaustive substring composition. Algorithms Mol. Biol. 3(1), 13 (2008)

    Article  Google Scholar 

  4. Begleiter, R., El-Yaniv, R., Yona, G.: On prediction using variable order Markov models. J. Artif. Intell. Res. 22, 385–421 (2004)

    MathSciNet  MATH  Google Scholar 

  5. Bejerano, G., Seldin, Y., Margalit, H., Tishby, N.: Markovian domain fingerprinting: statistical segmentation of protein sequences. Bioinformatics 17(10), 927–934 (2001)

    Article  Google Scholar 

  6. Bejerano, G., Yona, G.: Modeling protein families using probabilistic suffix trees. In: Proceedings of the Third Annual International Conference on Computational Molecular Biology, pp. 15–24. ACM, New York (1999)

  7. Bejerano, G., Yona, G.: Variations on probabilistic suffix trees: statistical modeling and prediction of protein families. Bioinformatics 17(1), 23–43 (2001)

    Article  Google Scholar 

  8. Belazzougui, D.: Linear time construction of compressed text indices in compact space. arXiv preprint arXiv:1401.0936 (2014)

  9. Belazzougui, D.: Linear time construction of compressed text indices in compact space. In: Symposium on Theory of Computing, STOC 2014, New York, NY, USA, May 31–June 03, 2014, pp. 148–193. ACM, New York (2014)

  10. Belazzougui, D., Cunial, F.: Indexed matching statistics and shortest unique substrings. In: String Processing and Information Retrieval, pp. 179–190. Springer, Berlin (2014)

  11. Belazzougui, D., Cunial, F.: A framework for space-efficient string kernels. In: Annual Symposium on Combinatorial Pattern Matching, pp. 13–25 (2015)

  12. Belazzougui, D., Cunial, F.: Space-efficient detection of unusual words. In: String Processing and Information Retrieval, pp. 222–233. Springer, Berlin (2015)

  13. Belazzougui, D., Cunial, F., Kärkkäinen, J., Mäkinen, V.: Versatile succinct representations of the bidirectional Burrows–Wheeler transform. In: Algorithms–ESA 2013, pp. 133–144. Springer, Berlin (2013)

  14. Belazzougui, D., Navarro, G.: Alphabet-independent compressed text indexing. ACM Trans. Algorithms (TALG) 10(4), 23 (2014)

    MathSciNet  MATH  Google Scholar 

  15. Belazzougui, D., Navarro, G., Valenzuela, D.: Improved compressed indexes for full-text document retrieval. J. Discrete Algorithms 18, 3–13 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  16. Bühlmann, P., Wyner, A.J., et al.: Variable length Markov chains. Ann. Stat. 27(2), 480–513 (1999)

    Article  MathSciNet  MATH  Google Scholar 

  17. Bunton, S.: Semantically motivated improvements for PPM variants. Comput. J. 40(2/3), 76–93 (1997)

    Article  Google Scholar 

  18. Burrows, M., Wheeler, D.: A block sorting lossless data compression algorithm. Tech. Rep. 124, Digital Equipment Corporation (1994)

  19. Chairungsee, S., Crochemore, M.: Using minimal absent words to build phylogeny. Theor. Comput. Sci. 450, 109–116 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  20. Chikhi, R., Medvedev, P.: Informed and automated \(k\)-mer size selection for genome assembly. Bioinformatics 30(1), 31–37 (2014)

    Article  Google Scholar 

  21. Chor, B., Horn, D., Goldman, N., Levy, Y., Massingham, T., et al.: Genomic DNA \(k\)-mer spectra: models and modalities. Genome Biol. 10(10), R108 (2009)

    Article  Google Scholar 

  22. Clark, D.: Compact Pat trees. Ph.D. thesis, University of Waterloo, Canada (1996)

  23. Cleary, J.G., Witten, I.H.: Data compression using adaptive coding and partial string matching. IEEE Trans. Commun. 32(4), 396–402 (1984)

    Article  Google Scholar 

  24. Crochemore, M., Mignosi, F., Restivo, A.: Automata and forbidden words. Inform. Process. Lett. 67(3), 111–117 (1998)

    Article  MathSciNet  MATH  Google Scholar 

  25. Dekel, O., Shalev-Shwartz, S., Singer, Y.: Individual sequence prediction using memory-efficient context trees. IEEE Trans. Inform. Theory 55(11), 5251–5262 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  26. Farach, M., Noordewier, M., Savari, S., Shepp, L., Wyner, A., Ziv, J.: On the entropy of DNA: algorithms and measurements based on memory and rapid convergence. SODA 95, 48–57 (1995)

    MATH  Google Scholar 

  27. Ferragina, P., Manzini, G.: Opportunistic data structures with applications. In: Proceedings on 41st IEEE Symposium on Foundations of Computer Science (FOCS), pp. 390–398 (2000)

  28. Ferragina, P., Manzini, G.: Indexing compressed texts. J. ACM 52(4), 552–581 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  29. Gagie, T.: Rank and select operations on sequences. In: Encyclopedia of Algorithms, pp. 1776–1780. Springer, Berlin (2016)

  30. Giegerich, R., Kurtz, S.: From Ukkonen to McCreight and Weiner: A unifying view of linear-time suffix tree construction. Algorithmica 19(3), 331–353 (1997)

    Article  MathSciNet  MATH  Google Scholar 

  31. Gog, S.: Compressed suffix trees: design, construction, and applications. Ph.D. thesis, University of Ulm, Germany (2011)

  32. Herold, J., Kurtz, S., Giegerich, R.: Efficient computation of absent words in genomic sequences. BMC Bioinform. 9(1), 167 (2008)

    Article  Google Scholar 

  33. Hozza, M., Vinař, T., Brejová, B.: How big is that genome? Estimating genome size and coverage from \(k\)-mer abundance spectra. In: String Processing and Information Retrieval, pp. 199–209. Springer, Berlin (2015)

  34. Ileri, A.M., Xu, B.: Shortest unique substring query revisited. In: Combinatorial Pattern Matching, pp. 172–181 (2014)

  35. Lin, J., Adjeroh, D., Jiang, B.H.: Probabilistic suffix array: efficient modeling and prediction of protein families. Bioinformatics 28(10), 1314–1323 (2012)

    Article  Google Scholar 

  36. Manber, U., Myers, G.: Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22(5), 935–948 (1993)

    Article  MathSciNet  MATH  Google Scholar 

  37. Munro, I.: Tables. In: Proceedings of 16th FSTTCS, LNCS 1180, pp. 37–42 (1996)

  38. Qi, J., Wang, B., Hao, B.I.: Whole proteome prokaryote phylogeny without sequence alignment: a \(k\)-string composition approach. J. Mol. Evol. 58(1), 1–11 (2004)

    Article  Google Scholar 

  39. Reinert, G., Chew, D., Sun, F., Waterman, M.S.: Alignment-free sequence comparison (I): statistics and power. J. Comput. Biol. 16(12), 1615–1634 (2009)

    Article  MathSciNet  Google Scholar 

  40. Rieck, K., Laskov, P.: Linear-time computation of similarity measures for sequential data. J. Mach. Learn. Res. 9, 23–48 (2008)

    MATH  Google Scholar 

  41. Rieck, K., Laskov, P., Sonnenburg, S.: Computation of similarity measures for sequential data using generalized suffix trees. In: Advances in Neural Information Processing Systems, pp. 1177–1184 (2006)

  42. Rissanen, J., et al.: A universal data compression system. IEEE Trans. Inform. Theory 29(5), 656–664 (1983)

    Article  MathSciNet  MATH  Google Scholar 

  43. Ron, D., Singer, Y., Tishby, N.: The power of amnesia: learning probabilistic automata with variable memory length. Mach. Learn. 25(2–3), 117–149 (1996)

    Article  MATH  Google Scholar 

  44. Schulz, M.H., Weese, D., Rausch, T., Döring, A., Reinert, K., Vingron, M.: Fast and adaptive variable order Markov chain construction. In: Algorithms in Bioinformatics, pp. 306–317. Springer, Berlin (2008)

  45. Shawe-Taylor, J., Cristianini, N.: Kernel Methods for Pattern Analysis. Cambridge University Press, Cambridge (2004)

    Book  MATH  Google Scholar 

  46. Sims, G.E., Jun, S.R., Wu, G.A., Kim, S.H.: Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions. Proc. Natl. Acad. Sci. 106(8), 2677–2682 (2009)

    Article  Google Scholar 

  47. Smola, A.J., Vishwanathan, S.V.N.: Fast kernels for string and tree matching. In: Becker, S., Thrun, S., Obermayer, K. (eds.) Advances in Neural Information Processing Systems, vol. 15, pp. 585–592. MIT Press, London (2003)

    Google Scholar 

  48. Sokol, S.M.D.: Engineering small space dictionary matching. arXiv preprint arXiv:1301.6428 (2013)

  49. Teo, C.H., Vishwanathan, S.: Fast and space efficient string kernels using suffix arrays. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 929–936. ACM, New York (2006)

  50. Ulitsky, I., Burstein, D., Tuller, T., Chor, B.: The average common substring approach to phylogenomic reconstruction. J. Comput. Biol. 13(2), 336–350 (2006)

    Article  MathSciNet  Google Scholar 

  51. Weinberger, M.J., Rissanen, J.J., Feder, M.: A universal finite memory source. IEEE Trans. Inform. Theory 41(3), 643–652 (1995)

    Article  MATH  Google Scholar 

  52. Weiner, P.: Linear pattern matching algorithm. In: Proceedings of 14th Annual IEEE Symposium on Switching and Automata Theory, pp. 1–11 (1973)

  53. Witten, I.H., Bell, T.C.: The zero-frequency problem: estimating the probabilities of novel events in adaptive text compression. IEEE Trans. Inform. Theory 37(4), 1085–1094 (1991)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Fabio Cunial.

Additional information

This work was partially supported by Academy of Finland under Grant 284598 (Center of Excellence in Cancer Genetics Research). An early partial version of this paper appeared in Proc. CPM 2015 [11].

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Belazzougui, D., Cunial, F. A Framework for Space-Efficient String Kernels. Algorithmica 79, 857–883 (2017). https://doi.org/10.1007/s00453-017-0286-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00453-017-0286-4

Keywords

Navigation