QGramProjector: Q-Gram Projection for Indexing Highly-Similar Strings

Wandelt, Sebastian; Leser, Ulf

doi:10.1007/978-3-642-40683-6_20

Sebastian Wandelt¹⁹ &
Ulf Leser¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8133))

Included in the following conference series:

East European Conference on Advances in Databases and Information Systems

1022 Accesses
2 Citations

Abstract

Q-gram (or n-gram, k-mer) models are used in many research areas, e.g. in computational linguistics for statistical natural language processing, in computer science for approximate string searching, and in computational biology for sequence analysis and data compression. For a collection of N strings, one usually creates a separate positional q-gram index structure for each string, or at least an index structure which needs roughly N times of storage compared to a single string index structure. For highly-similar strings, redundancies can be identified, which do not need to be stored repeatedly; for instance two human genomes have more than 99 percent similarity.

In this work, we propose QGramProjector, a new way of indexing many highly-similar strings. In order to remove the redundancies caused by similarities, our proposal is to 1) create all q-grams for a fixed reference, 2) referentially compress all strings in the collection with respect to the reference, and then 3) project all q-grams from the reference to the compressed strings.

Experiments show that a complete index can be relatively small compared to the collection of highly-similar strings. For a collection of 1092 human genomes (raw data size is 3 TB), a 16-gram index structure, which can be used for instance as a basis for multi-genome read alignment, only needs 100.5 GB (compression ratio of 31:1). We think that our work is an important step towards analysis of large sets of highly-similar genomes on commodity hardware.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature, 467(7319), 1061–1073 (October 2010)
Google Scholar
Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. Journal of Molecular Biology 215(3), 403–410 (1990)
Google Scholar
Baeza-Yates, R.A., Perleberg, C.H.: Fast and practical approximate string matching. In: Apostolico, A., Galil, Z., Manber, U., Crochemore, M. (eds.) CPM 1992. LNCS, vol. 644, pp. 185–192. Springer, Heidelberg (1992)
Chapter Google Scholar
Belazzougui, D., Venturini, R.: Compressed string dictionary look-up with edit distance one. In: Kärkkäinen, J., Stoye, J. (eds.) CPM 2012. LNCS, vol. 7354, pp. 280–292. Springer, Heidelberg (2012)
Chapter Google Scholar
Deorowicz, S., Grabowski, S.: Robust Relative Compression of Genomes with Random Access. Bioinformatics (September 2011)
Google Scholar
du Mouza, C., Litwin, W., Rigaux, P., Schwarz, T.: As-index: a structure for string search using n-grams and algebraic signatures. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management, CIKM 2009, pp. 295–304. ACM, New York (2009)
Chapter Google Scholar
Ferragina, P.: String algorithms and data structures. CoRR, abs/0801.2378 (2008)
Google Scholar
Ferragina, P., Manzini, G.: Indexing compressed text. J. ACM 52(4), 552–581 (2005)
Article MathSciNet Google Scholar
Goto, K., Bannai, H., Inenaga, S., Takeda, M.: Speeding up q-gram mining on grammar-based compressed texts. In: Kärkkäinen, J., Stoye, J. (eds.) CPM 2012. LNCS, vol. 7354, pp. 220–231. Springer, Heidelberg (2012)
Chapter Google Scholar
Kreft, S., Navarro, G.: On compressing and indexing repetitive sequences. Theor. Comput. Sci. 483, 115–133 (2013)
Article MathSciNet Google Scholar
Kuruppu, S., Puglisi, S., Zobel, J.: Optimized relative lempel-ziv compression of genomes. In: Australasian Computer Science Conference (2011)
Google Scholar
McCreight, E.: Efficient algorithms for enumerating intersection intervals and rectangles. Technical report, Xerox Paolo Alte Research Center (1980)
Google Scholar
Navarro, G.: A guided tour to approximate string matching. ACM Comput. Surv. 33(1), 31–88 (2001)
Article Google Scholar
Navarro, G.: Indexing highly repetitive collections. In: Arumugam, S., Smyth, B. (eds.) IWOCA 2012. LNCS, vol. 7643, pp. 274–279. Springer, Heidelberg (2012)
Google Scholar
Navarro, G., Raffinot, M.: Flexible pattern matching in strings: practical on-line search algorithms for texts and biological sequences. Cambridge University Press, New York (2002)
Google Scholar
Ohlebusch, E., Fischer, J., Gog, S.: CST++. In: Chavez, E., Lonardi, S. (eds.) SPIRE 2010. LNCS, vol. 6393, pp. 322–333. Springer, Heidelberg (2010)
Chapter Google Scholar
Reich, D.E., Schaffner, S.F., Daly, M.J., McVean, G., Mullikin, J.C., Higgins, J.M., Richter, D.J., Lander, E.S., Altshuler, D.: Human genome sequence variation and the influence of gene history, mutation and recombination. Nature Genetics 32(1), 135–142 (2002)
Article Google Scholar
Rytter, W.: Application of lempel–ziv factorization to the approximation of grammar-based compression. Theor. Comput. Sci. 302(1-3), 211–222 (2003)
Article MathSciNet MATH Google Scholar
Sutinen, E., Tarhio, J.: On using q-gram locations in approximate string matching. In: Spirakis, P.G. (ed.) ESA 1995. LNCS, vol. 979, pp. 327–340. Springer, Heidelberg (1995)
Chapter Google Scholar
Wandelt, S., Leser, U.: Adaptive efficient compression of genomes. Algorithms for Molecular Biology 7, 30 (2012)
Article Google Scholar
Weigel, D., Mott, R.: The 1001 Genomes Project for Arabidopsis thaliana. Genome Biology 10(5), 107+ (2009)
Google Scholar

Download references

Author information

Authors and Affiliations

Knowledge Management in Bioinformatics, Institute for Computer Science, Humboldt-Universität zu Berlin, Germany
Sebastian Wandelt & Ulf Leser

Authors

Sebastian Wandelt
View author publications
You can also search for this author in PubMed Google Scholar
Ulf Leser
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Università di Genova, Italy
Barbara Catania
DIBRIS, Università di Genova, Italy
Giovanna Guerrini
Department of Software Engineering Faculty of Mathematics and Physics, Charles University, Malostranské nám. 25, 11800, Prague 1, Czech Republic
Jaroslav Pokorný

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wandelt, S., Leser, U. (2013). QGramProjector: Q-Gram Projection for Indexing Highly-Similar Strings. In: Catania, B., Guerrini, G., Pokorný, J. (eds) Advances in Databases and Information Systems. ADBIS 2013. Lecture Notes in Computer Science, vol 8133. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40683-6_20

Download citation

DOI: https://doi.org/10.1007/978-3-642-40683-6_20
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-40682-9
Online ISBN: 978-3-642-40683-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics