A Framework for Space-Efficient String Kernels

Published: 07 February 2017

Volume 79, pages 857–883, (2017)
Cite this article

Algorithmica Aims and scope Submit manuscript

Djamal Belazzougui¹ &
Fabio Cunial²

402 Accesses
10 Citations
Explore all metrics

Abstract

String kernels are typically used to compare genome-scale sequences whose length makes alignment impractical, yet their computation is based on data structures that are either space-inefficient, or incur large slowdowns. We show that a number of exact kernels on pairs of strings of total length n, like the k-mer kernel, the substrings kernels, a number of length-weighted kernels, the minimal absent words kernel, and kernels with Markovian corrections, can all be computed in O(nd) time and in o(n) bits of space in addition to the input, using just a \(\mathtt {rangeDistinct}\) data structure on the Burrows–Wheeler transform of the input strings that takes O(d) time per element in its output. The same bounds hold for a number of measures of compositional complexity based on multiple values of k, like the k-mer profile and the k-th order empirical entropy, and for calibrating the value of k using the data. All such algorithms become O(n) using a suitable implementation of the \(\mathtt {rangeDistinct}\) data structure, and by concatenating them to a suitable BWT construction algorithm, we can compute all the mentioned kernels and complexity measures, directly from the input strings, in O(n) time and in \(O(n\log {\sigma })\) bits of space in addition to the input, where \(\sigma \) is the size of the alphabet. Using similar data structures, we also show how to build a compact representation of the variable-length Markov chain of a string T of length n, that takes just \(3n\log {\sigma }+o(n\log {\sigma })\) bits of space, and that can be learnt in randomized O(n) time using \(O(n\log {\sigma })\) bits of space in addition to the input. Such model can then be used to assign a probability to a query string S of length m in O(m) time and in \(2m+o(m)\) bits of additional space, thus providing an alternative, compositional measure of the similarity between S and T that does not require alignment.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

Similar content being viewed by others

A Framework for Space-Efficient String Kernels

Chapter © 2015

Space-Efficient Feature Maps for String Alignment Kernels

Article Open access 18 May 2020

On the Kernelization Complexity of String Problems

Chapter © 2014

Notes

Here, abusing terminology, we say that a node is left-maximal if it is labeled by a left-maximal substring of \(\overline{T[1..n]}\#\). Notice that those substrings are the reverse of the right-maximal substrings we are interested in.
Abusing terminology, we say that a node is a maximal-repeat if its labeling string is a maximal repeat.
Similarly to the case of \(\mathsf {MS}\), here \(\mathsf {RMS}\) will be represented using a bitvector \(\mathtt {rms}\) of length 2m bits that is indexed in the end to support rank and select queries.
The DNA alphabet consists of the following symbols in lexicographic order: \(\mathtt {A},\mathtt {C},\mathtt {G},\mathtt {T}\). DNA complementation is defined as \(\phi (\mathtt {A})=\mathtt {T}\), \(\phi (\mathtt {C})=\mathtt {G}\), \(\phi (\mathtt {G})=\mathtt {C}\), \(\phi (\mathtt {T})=\mathtt {A}\).

References

Apostolico, A.: Maximal words in sequence comparisons based on subword composition. In: Algorithms and Applications, pp. 34–44. Springer, Berlin (2010)
Apostolico, A., Bejerano, G.: Optimal amnesic probabilistic automata or how to learn and classify proteins in linear time and space. J. Comput. Biol. 7(3–4), 381–393 (2000)
Article Google Scholar
Apostolico, A., Denas, O.: Fast algorithms for computing sequence distances by exhaustive substring composition. Algorithms Mol. Biol. 3(1), 13 (2008)
Article Google Scholar
Begleiter, R., El-Yaniv, R., Yona, G.: On prediction using variable order Markov models. J. Artif. Intell. Res. 22, 385–421 (2004)
MathSciNet MATH Google Scholar
Bejerano, G., Seldin, Y., Margalit, H., Tishby, N.: Markovian domain fingerprinting: statistical segmentation of protein sequences. Bioinformatics 17(10), 927–934 (2001)
Article Google Scholar
Bejerano, G., Yona, G.: Modeling protein families using probabilistic suffix trees. In: Proceedings of the Third Annual International Conference on Computational Molecular Biology, pp. 15–24. ACM, New York (1999)
Bejerano, G., Yona, G.: Variations on probabilistic suffix trees: statistical modeling and prediction of protein families. Bioinformatics 17(1), 23–43 (2001)
Article Google Scholar
Belazzougui, D.: Linear time construction of compressed text indices in compact space. arXiv preprint arXiv:1401.0936 (2014)
Belazzougui, D.: Linear time construction of compressed text indices in compact space. In: Symposium on Theory of Computing, STOC 2014, New York, NY, USA, May 31–June 03, 2014, pp. 148–193. ACM, New York (2014)
Belazzougui, D., Cunial, F.: Indexed matching statistics and shortest unique substrings. In: String Processing and Information Retrieval, pp. 179–190. Springer, Berlin (2014)
Belazzougui, D., Cunial, F.: A framework for space-efficient string kernels. In: Annual Symposium on Combinatorial Pattern Matching, pp. 13–25 (2015)
Belazzougui, D., Cunial, F.: Space-efficient detection of unusual words. In: String Processing and Information Retrieval, pp. 222–233. Springer, Berlin (2015)
Belazzougui, D., Cunial, F., Kärkkäinen, J., Mäkinen, V.: Versatile succinct representations of the bidirectional Burrows–Wheeler transform. In: Algorithms–ESA 2013, pp. 133–144. Springer, Berlin (2013)
Belazzougui, D., Navarro, G.: Alphabet-independent compressed text indexing. ACM Trans. Algorithms (TALG) 10(4), 23 (2014)
MathSciNet MATH Google Scholar
Belazzougui, D., Navarro, G., Valenzuela, D.: Improved compressed indexes for full-text document retrieval. J. Discrete Algorithms 18, 3–13 (2013)
Article MathSciNet MATH Google Scholar
Bühlmann, P., Wyner, A.J., et al.: Variable length Markov chains. Ann. Stat. 27(2), 480–513 (1999)
Article MathSciNet MATH Google Scholar
Bunton, S.: Semantically motivated improvements for PPM variants. Comput. J. 40(2/3), 76–93 (1997)
Article Google Scholar
Burrows, M., Wheeler, D.: A block sorting lossless data compression algorithm. Tech. Rep. 124, Digital Equipment Corporation (1994)
Chairungsee, S., Crochemore, M.: Using minimal absent words to build phylogeny. Theor. Comput. Sci. 450, 109–116 (2012)
Article MathSciNet MATH Google Scholar
Chikhi, R., Medvedev, P.: Informed and automated \(k\)-mer size selection for genome assembly. Bioinformatics 30(1), 31–37 (2014)
Article Google Scholar
Chor, B., Horn, D., Goldman, N., Levy, Y., Massingham, T., et al.: Genomic DNA \(k\)-mer spectra: models and modalities. Genome Biol. 10(10), R108 (2009)
Article Google Scholar
Clark, D.: Compact Pat trees. Ph.D. thesis, University of Waterloo, Canada (1996)
Cleary, J.G., Witten, I.H.: Data compression using adaptive coding and partial string matching. IEEE Trans. Commun. 32(4), 396–402 (1984)
Article Google Scholar
Crochemore, M., Mignosi, F., Restivo, A.: Automata and forbidden words. Inform. Process. Lett. 67(3), 111–117 (1998)
Article MathSciNet MATH Google Scholar
Dekel, O., Shalev-Shwartz, S., Singer, Y.: Individual sequence prediction using memory-efficient context trees. IEEE Trans. Inform. Theory 55(11), 5251–5262 (2009)
Article MathSciNet MATH Google Scholar
Farach, M., Noordewier, M., Savari, S., Shepp, L., Wyner, A., Ziv, J.: On the entropy of DNA: algorithms and measurements based on memory and rapid convergence. SODA 95, 48–57 (1995)
MATH Google Scholar
Ferragina, P., Manzini, G.: Opportunistic data structures with applications. In: Proceedings on 41st IEEE Symposium on Foundations of Computer Science (FOCS), pp. 390–398 (2000)
Ferragina, P., Manzini, G.: Indexing compressed texts. J. ACM 52(4), 552–581 (2005)
Article MathSciNet MATH Google Scholar
Gagie, T.: Rank and select operations on sequences. In: Encyclopedia of Algorithms, pp. 1776–1780. Springer, Berlin (2016)
Giegerich, R., Kurtz, S.: From Ukkonen to McCreight and Weiner: A unifying view of linear-time suffix tree construction. Algorithmica 19(3), 331–353 (1997)
Article MathSciNet MATH Google Scholar
Gog, S.: Compressed suffix trees: design, construction, and applications. Ph.D. thesis, University of Ulm, Germany (2011)
Herold, J., Kurtz, S., Giegerich, R.: Efficient computation of absent words in genomic sequences. BMC Bioinform. 9(1), 167 (2008)
Article Google Scholar
Hozza, M., Vinař, T., Brejová, B.: How big is that genome? Estimating genome size and coverage from \(k\)-mer abundance spectra. In: String Processing and Information Retrieval, pp. 199–209. Springer, Berlin (2015)
Ileri, A.M., Xu, B.: Shortest unique substring query revisited. In: Combinatorial Pattern Matching, pp. 172–181 (2014)
Lin, J., Adjeroh, D., Jiang, B.H.: Probabilistic suffix array: efficient modeling and prediction of protein families. Bioinformatics 28(10), 1314–1323 (2012)
Article Google Scholar
Manber, U., Myers, G.: Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22(5), 935–948 (1993)
Article MathSciNet MATH Google Scholar
Munro, I.: Tables. In: Proceedings of 16th FSTTCS, LNCS 1180, pp. 37–42 (1996)
Qi, J., Wang, B., Hao, B.I.: Whole proteome prokaryote phylogeny without sequence alignment: a \(k\)-string composition approach. J. Mol. Evol. 58(1), 1–11 (2004)
Article Google Scholar
Reinert, G., Chew, D., Sun, F., Waterman, M.S.: Alignment-free sequence comparison (I): statistics and power. J. Comput. Biol. 16(12), 1615–1634 (2009)
Article MathSciNet Google Scholar
Rieck, K., Laskov, P.: Linear-time computation of similarity measures for sequential data. J. Mach. Learn. Res. 9, 23–48 (2008)
MATH Google Scholar
Rieck, K., Laskov, P., Sonnenburg, S.: Computation of similarity measures for sequential data using generalized suffix trees. In: Advances in Neural Information Processing Systems, pp. 1177–1184 (2006)
Rissanen, J., et al.: A universal data compression system. IEEE Trans. Inform. Theory 29(5), 656–664 (1983)
Article MathSciNet MATH Google Scholar
Ron, D., Singer, Y., Tishby, N.: The power of amnesia: learning probabilistic automata with variable memory length. Mach. Learn. 25(2–3), 117–149 (1996)
Article MATH Google Scholar
Schulz, M.H., Weese, D., Rausch, T., Döring, A., Reinert, K., Vingron, M.: Fast and adaptive variable order Markov chain construction. In: Algorithms in Bioinformatics, pp. 306–317. Springer, Berlin (2008)
Shawe-Taylor, J., Cristianini, N.: Kernel Methods for Pattern Analysis. Cambridge University Press, Cambridge (2004)
Book MATH Google Scholar
Sims, G.E., Jun, S.R., Wu, G.A., Kim, S.H.: Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions. Proc. Natl. Acad. Sci. 106(8), 2677–2682 (2009)
Article Google Scholar
Smola, A.J., Vishwanathan, S.V.N.: Fast kernels for string and tree matching. In: Becker, S., Thrun, S., Obermayer, K. (eds.) Advances in Neural Information Processing Systems, vol. 15, pp. 585–592. MIT Press, London (2003)
Google Scholar
Sokol, S.M.D.: Engineering small space dictionary matching. arXiv preprint arXiv:1301.6428 (2013)
Teo, C.H., Vishwanathan, S.: Fast and space efficient string kernels using suffix arrays. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 929–936. ACM, New York (2006)
Ulitsky, I., Burstein, D., Tuller, T., Chor, B.: The average common substring approach to phylogenomic reconstruction. J. Comput. Biol. 13(2), 336–350 (2006)
Article MathSciNet Google Scholar
Weinberger, M.J., Rissanen, J.J., Feder, M.: A universal finite memory source. IEEE Trans. Inform. Theory 41(3), 643–652 (1995)
Article MATH Google Scholar
Weiner, P.: Linear pattern matching algorithm. In: Proceedings of 14th Annual IEEE Symposium on Switching and Automata Theory, pp. 1–11 (1973)
Witten, I.H., Bell, T.C.: The zero-frequency problem: estimating the probabilities of novel events in adaptive text compression. IEEE Trans. Inform. Theory 37(4), 1085–1094 (1991)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Centre de Recherche sur l’Information Scientifique et Technique (DTISI-CERIST), 16306, Algiers, Algeria
Djamal Belazzougui
Max Planck Institute of Molecular Cell Biology and Genetics (MPI-CBG), Pfotenhauerstr. 108, 01307, Dresden, Germany
Fabio Cunial

Authors

Djamal Belazzougui
View author publications
You can also search for this author in PubMed Google Scholar
Fabio Cunial
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fabio Cunial.

Additional information

This work was partially supported by Academy of Finland under Grant 284598 (Center of Excellence in Cancer Genetics Research). An early partial version of this paper appeared in Proc. CPM 2015 [11].

Rights and permissions

Reprints and permissions

About this article

Cite this article

Belazzougui, D., Cunial, F. A Framework for Space-Efficient String Kernels. Algorithmica 79, 857–883 (2017). https://doi.org/10.1007/s00453-017-0286-4

Download citation

Received: 01 October 2015
Accepted: 27 January 2017
Published: 07 February 2017
Issue Date: November 2017
DOI: https://doi.org/10.1007/s00453-017-0286-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions