Abstract
String kernels are typically used to compare genome-scale sequences whose length makes alignment impractical, yet their computation is based on data structures that are either space-inefficient, or incur large slowdowns. We show that a number of exact kernels on pairs of strings of total length n, like the k-mer kernel, the substrings kernels, a number of length-weighted kernels, the minimal absent words kernel, and kernels with Markovian corrections, can all be computed in O(nd) time and in o(n) bits of space in addition to the input, using just a \(\mathtt {rangeDistinct}\) data structure on the Burrows–Wheeler transform of the input strings that takes O(d) time per element in its output. The same bounds hold for a number of measures of compositional complexity based on multiple values of k, like the k-mer profile and the k-th order empirical entropy, and for calibrating the value of k using the data. All such algorithms become O(n) using a suitable implementation of the \(\mathtt {rangeDistinct}\) data structure, and by concatenating them to a suitable BWT construction algorithm, we can compute all the mentioned kernels and complexity measures, directly from the input strings, in O(n) time and in \(O(n\log {\sigma })\) bits of space in addition to the input, where \(\sigma \) is the size of the alphabet. Using similar data structures, we also show how to build a compact representation of the variable-length Markov chain of a string T of length n, that takes just \(3n\log {\sigma }+o(n\log {\sigma })\) bits of space, and that can be learnt in randomized O(n) time using \(O(n\log {\sigma })\) bits of space in addition to the input. Such model can then be used to assign a probability to a query string S of length m in O(m) time and in \(2m+o(m)\) bits of additional space, thus providing an alternative, compositional measure of the similarity between S and T that does not require alignment.
Similar content being viewed by others
Notes
Here, abusing terminology, we say that a node is left-maximal if it is labeled by a left-maximal substring of \(\overline{T[1..n]}\#\). Notice that those substrings are the reverse of the right-maximal substrings we are interested in.
Abusing terminology, we say that a node is a maximal-repeat if its labeling string is a maximal repeat.
Similarly to the case of \(\mathsf {MS}\), here \(\mathsf {RMS}\) will be represented using a bitvector \(\mathtt {rms}\) of length 2m bits that is indexed in the end to support rank and select queries.
The DNA alphabet consists of the following symbols in lexicographic order: \(\mathtt {A},\mathtt {C},\mathtt {G},\mathtt {T}\). DNA complementation is defined as \(\phi (\mathtt {A})=\mathtt {T}\), \(\phi (\mathtt {C})=\mathtt {G}\), \(\phi (\mathtt {G})=\mathtt {C}\), \(\phi (\mathtt {T})=\mathtt {A}\).
References
Apostolico, A.: Maximal words in sequence comparisons based on subword composition. In: Algorithms and Applications, pp. 34–44. Springer, Berlin (2010)
Apostolico, A., Bejerano, G.: Optimal amnesic probabilistic automata or how to learn and classify proteins in linear time and space. J. Comput. Biol. 7(3–4), 381–393 (2000)
Apostolico, A., Denas, O.: Fast algorithms for computing sequence distances by exhaustive substring composition. Algorithms Mol. Biol. 3(1), 13 (2008)
Begleiter, R., El-Yaniv, R., Yona, G.: On prediction using variable order Markov models. J. Artif. Intell. Res. 22, 385–421 (2004)
Bejerano, G., Seldin, Y., Margalit, H., Tishby, N.: Markovian domain fingerprinting: statistical segmentation of protein sequences. Bioinformatics 17(10), 927–934 (2001)
Bejerano, G., Yona, G.: Modeling protein families using probabilistic suffix trees. In: Proceedings of the Third Annual International Conference on Computational Molecular Biology, pp. 15–24. ACM, New York (1999)
Bejerano, G., Yona, G.: Variations on probabilistic suffix trees: statistical modeling and prediction of protein families. Bioinformatics 17(1), 23–43 (2001)
Belazzougui, D.: Linear time construction of compressed text indices in compact space. arXiv preprint arXiv:1401.0936 (2014)
Belazzougui, D.: Linear time construction of compressed text indices in compact space. In: Symposium on Theory of Computing, STOC 2014, New York, NY, USA, May 31–June 03, 2014, pp. 148–193. ACM, New York (2014)
Belazzougui, D., Cunial, F.: Indexed matching statistics and shortest unique substrings. In: String Processing and Information Retrieval, pp. 179–190. Springer, Berlin (2014)
Belazzougui, D., Cunial, F.: A framework for space-efficient string kernels. In: Annual Symposium on Combinatorial Pattern Matching, pp. 13–25 (2015)
Belazzougui, D., Cunial, F.: Space-efficient detection of unusual words. In: String Processing and Information Retrieval, pp. 222–233. Springer, Berlin (2015)
Belazzougui, D., Cunial, F., Kärkkäinen, J., Mäkinen, V.: Versatile succinct representations of the bidirectional Burrows–Wheeler transform. In: Algorithms–ESA 2013, pp. 133–144. Springer, Berlin (2013)
Belazzougui, D., Navarro, G.: Alphabet-independent compressed text indexing. ACM Trans. Algorithms (TALG) 10(4), 23 (2014)
Belazzougui, D., Navarro, G., Valenzuela, D.: Improved compressed indexes for full-text document retrieval. J. Discrete Algorithms 18, 3–13 (2013)
Bühlmann, P., Wyner, A.J., et al.: Variable length Markov chains. Ann. Stat. 27(2), 480–513 (1999)
Bunton, S.: Semantically motivated improvements for PPM variants. Comput. J. 40(2/3), 76–93 (1997)
Burrows, M., Wheeler, D.: A block sorting lossless data compression algorithm. Tech. Rep. 124, Digital Equipment Corporation (1994)
Chairungsee, S., Crochemore, M.: Using minimal absent words to build phylogeny. Theor. Comput. Sci. 450, 109–116 (2012)
Chikhi, R., Medvedev, P.: Informed and automated \(k\)-mer size selection for genome assembly. Bioinformatics 30(1), 31–37 (2014)
Chor, B., Horn, D., Goldman, N., Levy, Y., Massingham, T., et al.: Genomic DNA \(k\)-mer spectra: models and modalities. Genome Biol. 10(10), R108 (2009)
Clark, D.: Compact Pat trees. Ph.D. thesis, University of Waterloo, Canada (1996)
Cleary, J.G., Witten, I.H.: Data compression using adaptive coding and partial string matching. IEEE Trans. Commun. 32(4), 396–402 (1984)
Crochemore, M., Mignosi, F., Restivo, A.: Automata and forbidden words. Inform. Process. Lett. 67(3), 111–117 (1998)
Dekel, O., Shalev-Shwartz, S., Singer, Y.: Individual sequence prediction using memory-efficient context trees. IEEE Trans. Inform. Theory 55(11), 5251–5262 (2009)
Farach, M., Noordewier, M., Savari, S., Shepp, L., Wyner, A., Ziv, J.: On the entropy of DNA: algorithms and measurements based on memory and rapid convergence. SODA 95, 48–57 (1995)
Ferragina, P., Manzini, G.: Opportunistic data structures with applications. In: Proceedings on 41st IEEE Symposium on Foundations of Computer Science (FOCS), pp. 390–398 (2000)
Ferragina, P., Manzini, G.: Indexing compressed texts. J. ACM 52(4), 552–581 (2005)
Gagie, T.: Rank and select operations on sequences. In: Encyclopedia of Algorithms, pp. 1776–1780. Springer, Berlin (2016)
Giegerich, R., Kurtz, S.: From Ukkonen to McCreight and Weiner: A unifying view of linear-time suffix tree construction. Algorithmica 19(3), 331–353 (1997)
Gog, S.: Compressed suffix trees: design, construction, and applications. Ph.D. thesis, University of Ulm, Germany (2011)
Herold, J., Kurtz, S., Giegerich, R.: Efficient computation of absent words in genomic sequences. BMC Bioinform. 9(1), 167 (2008)
Hozza, M., Vinař, T., Brejová, B.: How big is that genome? Estimating genome size and coverage from \(k\)-mer abundance spectra. In: String Processing and Information Retrieval, pp. 199–209. Springer, Berlin (2015)
Ileri, A.M., Xu, B.: Shortest unique substring query revisited. In: Combinatorial Pattern Matching, pp. 172–181 (2014)
Lin, J., Adjeroh, D., Jiang, B.H.: Probabilistic suffix array: efficient modeling and prediction of protein families. Bioinformatics 28(10), 1314–1323 (2012)
Manber, U., Myers, G.: Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22(5), 935–948 (1993)
Munro, I.: Tables. In: Proceedings of 16th FSTTCS, LNCS 1180, pp. 37–42 (1996)
Qi, J., Wang, B., Hao, B.I.: Whole proteome prokaryote phylogeny without sequence alignment: a \(k\)-string composition approach. J. Mol. Evol. 58(1), 1–11 (2004)
Reinert, G., Chew, D., Sun, F., Waterman, M.S.: Alignment-free sequence comparison (I): statistics and power. J. Comput. Biol. 16(12), 1615–1634 (2009)
Rieck, K., Laskov, P.: Linear-time computation of similarity measures for sequential data. J. Mach. Learn. Res. 9, 23–48 (2008)
Rieck, K., Laskov, P., Sonnenburg, S.: Computation of similarity measures for sequential data using generalized suffix trees. In: Advances in Neural Information Processing Systems, pp. 1177–1184 (2006)
Rissanen, J., et al.: A universal data compression system. IEEE Trans. Inform. Theory 29(5), 656–664 (1983)
Ron, D., Singer, Y., Tishby, N.: The power of amnesia: learning probabilistic automata with variable memory length. Mach. Learn. 25(2–3), 117–149 (1996)
Schulz, M.H., Weese, D., Rausch, T., Döring, A., Reinert, K., Vingron, M.: Fast and adaptive variable order Markov chain construction. In: Algorithms in Bioinformatics, pp. 306–317. Springer, Berlin (2008)
Shawe-Taylor, J., Cristianini, N.: Kernel Methods for Pattern Analysis. Cambridge University Press, Cambridge (2004)
Sims, G.E., Jun, S.R., Wu, G.A., Kim, S.H.: Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions. Proc. Natl. Acad. Sci. 106(8), 2677–2682 (2009)
Smola, A.J., Vishwanathan, S.V.N.: Fast kernels for string and tree matching. In: Becker, S., Thrun, S., Obermayer, K. (eds.) Advances in Neural Information Processing Systems, vol. 15, pp. 585–592. MIT Press, London (2003)
Sokol, S.M.D.: Engineering small space dictionary matching. arXiv preprint arXiv:1301.6428 (2013)
Teo, C.H., Vishwanathan, S.: Fast and space efficient string kernels using suffix arrays. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 929–936. ACM, New York (2006)
Ulitsky, I., Burstein, D., Tuller, T., Chor, B.: The average common substring approach to phylogenomic reconstruction. J. Comput. Biol. 13(2), 336–350 (2006)
Weinberger, M.J., Rissanen, J.J., Feder, M.: A universal finite memory source. IEEE Trans. Inform. Theory 41(3), 643–652 (1995)
Weiner, P.: Linear pattern matching algorithm. In: Proceedings of 14th Annual IEEE Symposium on Switching and Automata Theory, pp. 1–11 (1973)
Witten, I.H., Bell, T.C.: The zero-frequency problem: estimating the probabilities of novel events in adaptive text compression. IEEE Trans. Inform. Theory 37(4), 1085–1094 (1991)
Author information
Authors and Affiliations
Corresponding author
Additional information
This work was partially supported by Academy of Finland under Grant 284598 (Center of Excellence in Cancer Genetics Research). An early partial version of this paper appeared in Proc. CPM 2015 [11].
Rights and permissions
About this article
Cite this article
Belazzougui, D., Cunial, F. A Framework for Space-Efficient String Kernels. Algorithmica 79, 857–883 (2017). https://doi.org/10.1007/s00453-017-0286-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00453-017-0286-4