A Framework for Space-Efficient String Kernels

Belazzougui, Djamal; Cunial, Fabio

doi:10.1007/978-3-319-19929-0_2

Djamal Belazzougui^16,17 &
Fabio Cunial^16,17

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9133))

Included in the following conference series:

Annual Symposium on Combinatorial Pattern Matching

832 Accesses
2 Citations

Abstract

String kernels are typically used to compare genome-scale sequences whose length makes alignment impractical, yet their computation is based on data structures that are either space-inefficient, or incur large slowdowns. We show that a number of exact string kernels, like the \(k\)-mer kernel, the substrings kernels, a number of length-weighted kernels, the minimal absent words kernel, and kernels with Markovian corrections, can all be computed in \(O(nd)\) time and in \(o(n)\) bits of space in addition to the input, using just a \(\mathtt {rangeDistinct}\) data structure on the Burrows-Wheeler transform of the input strings that takes \(O(d)\) time per element in its output. The same bounds hold for a number of measures of compositional complexity based on multiple values of \(k\), like the \(k\)-mer profile and the \(k\)-th order empirical entropy, and for calibrating the value of \(k\) using the data.

This work was partially supported by Academy of Finland under grant 284598 (Center of Excellence in Cancer Genetics Research).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Apostolico, A.: Maximal words in sequence comparisons based on subword composition. In: Elomaa, T., Mannila, H., Orponen, P. (eds.) Ukkonen Festschrift 2010. LNCS, vol. 6060, pp. 34–44. Springer, Heidelberg (2010)
Chapter Google Scholar
Apostolico, A., Denas, O.: Fast algorithms for computing sequence distances by exhaustive substring composition. Algorithms Mol. Biol. 3(1), 13 (2008)
Article Google Scholar
Belazzougui, D.: Linear time construction of compressed text indices in compact space. In Symposium on Theory of Computing, STOC 2014, New York, NY, USA, 31 May–03 June, pp. 148–193 (2014)
Google Scholar
Belazzougui, D., Navarro, G., Valenzuela, D.: Improved compressed indexes for full-text document retrieval. J. Discret. Algorithms 18, 3–13 (2013)
Article MATH MathSciNet Google Scholar
Chairungsee, S., Crochemore, M.: Using minimal absent words to build phylogeny. Theoret. Comput. Sci. 450, 109–116 (2012)
Article MATH MathSciNet Google Scholar
Chikhi, R., Medvedev, P.: Informed and automated \(k\)-mer size selection for genome assembly. Bioinformatics 30(1), 31–37 (2014)
Article Google Scholar
Chor, B., Horn, D., Goldman, N., Levy, Y., Massingham, T., et al.: Genomic DNA \(k\)-mer spectra: models and modalities. Genome Biol. 10(10), R108 (2009)
Article Google Scholar
Crochemore, M., Mignosi, F., Restivo, A.: Automata and forbidden words. Inf. Process. Lett. 67(3), 111–117 (1998)
Article MathSciNet Google Scholar
Gog, S.: Compressed suffix trees: design, construction, and applications. Ph.D. thesis, University of Ulm, Germany (2011)
Google Scholar
Herold, J., Kurtz, S., Giegerich, R.: Efficient computation of absent words in genomic sequences. BMC Bioinform. 9(1), 167 (2008)
Article Google Scholar
İleri, A.M., Külekci, M.O., Xu, B.: Shortest unique substring query revisited. In: Kulikov, A.S., Kuznetsov, S.O., Pevzner, P. (eds.) CPM 2014. LNCS, vol. 8486, pp. 172–181. Springer, Heidelberg (2014)
Google Scholar
Qi, J., Wang, B., Hao, B.-I.: Whole proteome prokaryote phylogeny without sequence alignment: a \(k\)-string composition approach. J. Mol. Evol. 58(1), 1–11 (2004)
Article Google Scholar
Reinert, G., Chew, D., Sun, F., Waterman, M.S.: Alignment-free sequence comparison (I): statistics and power. J. Comput. Biol. 16(12), 1615–1634 (2009)
Article MathSciNet Google Scholar
Shawe-Taylor, J., Cristianini, N.: Kernel Methods for Pattern Analysis. Cambridge University Press, Cambridge (2004)
Book Google Scholar
Sims, G.E., Jun, S.-R., Wu, G.A., Kim, S.-H.: Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions. Proc. Natl. Acad. Sci. 106(8), 2677–2682 (2009)
Article Google Scholar
Smola, A.J., Vishwanathan, S.V.N.: Fast kernels for string and tree matching. In: Becker, S., Thrun, S., Obermayer, K. (eds.) Advances in Neural Information Processing Systems 15, pp. 585–592. MIT Press, Cambridge (2003)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Helsinki, Helsinki, Finland
Djamal Belazzougui & Fabio Cunial
Helsinki Institute for Information Technology, Helsinki, Finland
Djamal Belazzougui & Fabio Cunial

Authors

Djamal Belazzougui
View author publications
You can also search for this author in PubMed Google Scholar
Fabio Cunial
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fabio Cunial .

Editor information

Editors and Affiliations

Department of Computer Science, University of Verona, Verona, Italy
Ferdinando Cicalese
Department of Computer Science, Bar-Ilan University, Ramat Gan, Israel
Ely Porat
Department of Computer Science, University of Salerno, Fisciano, Italy
Ugo Vaccaro

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Belazzougui, D., Cunial, F. (2015). A Framework for Space-Efficient String Kernels. In: Cicalese, F., Porat, E., Vaccaro, U. (eds) Combinatorial Pattern Matching. CPM 2015. Lecture Notes in Computer Science(), vol 9133. Springer, Cham. https://doi.org/10.1007/978-3-319-19929-0_2

Download citation

DOI: https://doi.org/10.1007/978-3-319-19929-0_2
Published: 16 June 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-19928-3
Online ISBN: 978-3-319-19929-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics