A Framework for Space-Efficient String Kernels

  • Djamal Belazzougui
  • Fabio CunialEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9133)


String kernels are typically used to compare genome-scale sequences whose length makes alignment impractical, yet their computation is based on data structures that are either space-inefficient, or incur large slowdowns. We show that a number of exact string kernels, like the \(k\)-mer kernel, the substrings kernels, a number of length-weighted kernels, the minimal absent words kernel, and kernels with Markovian corrections, can all be computed in \(O(nd)\) time and in \(o(n)\) bits of space in addition to the input, using just a \(\mathtt {rangeDistinct}\) data structure on the Burrows-Wheeler transform of the input strings that takes \(O(d)\) time per element in its output. The same bounds hold for a number of measures of compositional complexity based on multiple values of \(k\), like the \(k\)-mer profile and the \(k\)-th order empirical entropy, and for calibrating the value of \(k\) using the data.


  1. 1.
    Apostolico, A.: Maximal words in sequence comparisons based on subword composition. In: Elomaa, T., Mannila, H., Orponen, P. (eds.) Ukkonen Festschrift 2010. LNCS, vol. 6060, pp. 34–44. Springer, Heidelberg (2010) CrossRefGoogle Scholar
  2. 2.
    Apostolico, A., Denas, O.: Fast algorithms for computing sequence distances by exhaustive substring composition. Algorithms Mol. Biol. 3(1), 13 (2008)CrossRefGoogle Scholar
  3. 3.
    Belazzougui, D.: Linear time construction of compressed text indices in compact space. In Symposium on Theory of Computing, STOC 2014, New York, NY, USA, 31 May–03 June, pp. 148–193 (2014)Google Scholar
  4. 4.
    Belazzougui, D., Navarro, G., Valenzuela, D.: Improved compressed indexes for full-text document retrieval. J. Discret. Algorithms 18, 3–13 (2013)zbMATHMathSciNetCrossRefGoogle Scholar
  5. 5.
    Chairungsee, S., Crochemore, M.: Using minimal absent words to build phylogeny. Theoret. Comput. Sci. 450, 109–116 (2012)zbMATHMathSciNetCrossRefGoogle Scholar
  6. 6.
    Chikhi, R., Medvedev, P.: Informed and automated \(k\)-mer size selection for genome assembly. Bioinformatics 30(1), 31–37 (2014)CrossRefGoogle Scholar
  7. 7.
    Chor, B., Horn, D., Goldman, N., Levy, Y., Massingham, T., et al.: Genomic DNA \(k\)-mer spectra: models and modalities. Genome Biol. 10(10), R108 (2009)CrossRefGoogle Scholar
  8. 8.
    Crochemore, M., Mignosi, F., Restivo, A.: Automata and forbidden words. Inf. Process. Lett. 67(3), 111–117 (1998)MathSciNetCrossRefGoogle Scholar
  9. 9.
    Gog, S.: Compressed suffix trees: design, construction, and applications. Ph.D. thesis, University of Ulm, Germany (2011)Google Scholar
  10. 10.
    Herold, J., Kurtz, S., Giegerich, R.: Efficient computation of absent words in genomic sequences. BMC Bioinform. 9(1), 167 (2008)CrossRefGoogle Scholar
  11. 11.
    İleri, A.M., Külekci, M.O., Xu, B.: Shortest unique substring query revisited. In: Kulikov, A.S., Kuznetsov, S.O., Pevzner, P. (eds.) CPM 2014. LNCS, vol. 8486, pp. 172–181. Springer, Heidelberg (2014) Google Scholar
  12. 12.
    Qi, J., Wang, B., Hao, B.-I.: Whole proteome prokaryote phylogeny without sequence alignment: a \(k\)-string composition approach. J. Mol. Evol. 58(1), 1–11 (2004)CrossRefGoogle Scholar
  13. 13.
    Reinert, G., Chew, D., Sun, F., Waterman, M.S.: Alignment-free sequence comparison (I): statistics and power. J. Comput. Biol. 16(12), 1615–1634 (2009)MathSciNetCrossRefGoogle Scholar
  14. 14.
    Shawe-Taylor, J., Cristianini, N.: Kernel Methods for Pattern Analysis. Cambridge University Press, Cambridge (2004) CrossRefGoogle Scholar
  15. 15.
    Sims, G.E., Jun, S.-R., Wu, G.A., Kim, S.-H.: Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions. Proc. Natl. Acad. Sci. 106(8), 2677–2682 (2009)CrossRefGoogle Scholar
  16. 16.
    Smola, A.J., Vishwanathan, S.V.N.: Fast kernels for string and tree matching. In: Becker, S., Thrun, S., Obermayer, K. (eds.) Advances in Neural Information Processing Systems 15, pp. 585–592. MIT Press, Cambridge (2003)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  1. 1.Department of Computer ScienceUniversity of HelsinkiHelsinkiFinland
  2. 2.Helsinki Institute for Information TechnologyHelsinkiFinland

Personalised recommendations