Abstract
Q-gram (or n-gram, k-mer) models are used in many research areas, e.g. in computational linguistics for statistical natural language processing, in computer science for approximate string searching, and in computational biology for sequence analysis and data compression. For a collection of N strings, one usually creates a separate positional q-gram index structure for each string, or at least an index structure which needs roughly N times of storage compared to a single string index structure. For highly-similar strings, redundancies can be identified, which do not need to be stored repeatedly; for instance two human genomes have more than 99 percent similarity.
In this work, we propose QGramProjector, a new way of indexing many highly-similar strings. In order to remove the redundancies caused by similarities, our proposal is to 1) create all q-grams for a fixed reference, 2) referentially compress all strings in the collection with respect to the reference, and then 3) project all q-grams from the reference to the compressed strings.
Experiments show that a complete index can be relatively small compared to the collection of highly-similar strings. For a collection of 1092 human genomes (raw data size is 3 TB), a 16-gram index structure, which can be used for instance as a basis for multi-genome read alignment, only needs 100.5 GB (compression ratio of 31:1). We think that our work is an important step towards analysis of large sets of highly-similar genomes on commodity hardware.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature, 467(7319), 1061–1073 (October 2010)
Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. Journal of Molecular Biology 215(3), 403–410 (1990)
Baeza-Yates, R.A., Perleberg, C.H.: Fast and practical approximate string matching. In: Apostolico, A., Galil, Z., Manber, U., Crochemore, M. (eds.) CPM 1992. LNCS, vol. 644, pp. 185–192. Springer, Heidelberg (1992)
Belazzougui, D., Venturini, R.: Compressed string dictionary look-up with edit distance one. In: Kärkkäinen, J., Stoye, J. (eds.) CPM 2012. LNCS, vol. 7354, pp. 280–292. Springer, Heidelberg (2012)
Deorowicz, S., Grabowski, S.: Robust Relative Compression of Genomes with Random Access. Bioinformatics (September 2011)
du Mouza, C., Litwin, W., Rigaux, P., Schwarz, T.: As-index: a structure for string search using n-grams and algebraic signatures. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management, CIKM 2009, pp. 295–304. ACM, New York (2009)
Ferragina, P.: String algorithms and data structures. CoRR, abs/0801.2378 (2008)
Ferragina, P., Manzini, G.: Indexing compressed text. J. ACM 52(4), 552–581 (2005)
Goto, K., Bannai, H., Inenaga, S., Takeda, M.: Speeding up q-gram mining on grammar-based compressed texts. In: Kärkkäinen, J., Stoye, J. (eds.) CPM 2012. LNCS, vol. 7354, pp. 220–231. Springer, Heidelberg (2012)
Kreft, S., Navarro, G.: On compressing and indexing repetitive sequences. Theor. Comput. Sci. 483, 115–133 (2013)
Kuruppu, S., Puglisi, S., Zobel, J.: Optimized relative lempel-ziv compression of genomes. In: Australasian Computer Science Conference (2011)
McCreight, E.: Efficient algorithms for enumerating intersection intervals and rectangles. Technical report, Xerox Paolo Alte Research Center (1980)
Navarro, G.: A guided tour to approximate string matching. ACM Comput. Surv. 33(1), 31–88 (2001)
Navarro, G.: Indexing highly repetitive collections. In: Arumugam, S., Smyth, B. (eds.) IWOCA 2012. LNCS, vol. 7643, pp. 274–279. Springer, Heidelberg (2012)
Navarro, G., Raffinot, M.: Flexible pattern matching in strings: practical on-line search algorithms for texts and biological sequences. Cambridge University Press, New York (2002)
Ohlebusch, E., Fischer, J., Gog, S.: CST++. In: Chavez, E., Lonardi, S. (eds.) SPIRE 2010. LNCS, vol. 6393, pp. 322–333. Springer, Heidelberg (2010)
Reich, D.E., Schaffner, S.F., Daly, M.J., McVean, G., Mullikin, J.C., Higgins, J.M., Richter, D.J., Lander, E.S., Altshuler, D.: Human genome sequence variation and the influence of gene history, mutation and recombination. Nature Genetics 32(1), 135–142 (2002)
Rytter, W.: Application of lempel–ziv factorization to the approximation of grammar-based compression. Theor. Comput. Sci. 302(1-3), 211–222 (2003)
Sutinen, E., Tarhio, J.: On using q-gram locations in approximate string matching. In: Spirakis, P.G. (ed.) ESA 1995. LNCS, vol. 979, pp. 327–340. Springer, Heidelberg (1995)
Wandelt, S., Leser, U.: Adaptive efficient compression of genomes. Algorithms for Molecular Biology 7, 30 (2012)
Weigel, D., Mott, R.: The 1001 Genomes Project for Arabidopsis thaliana. Genome Biology 10(5), 107+ (2009)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Wandelt, S., Leser, U. (2013). QGramProjector: Q-Gram Projection for Indexing Highly-Similar Strings. In: Catania, B., Guerrini, G., Pokorný, J. (eds) Advances in Databases and Information Systems. ADBIS 2013. Lecture Notes in Computer Science, vol 8133. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40683-6_20
Download citation
DOI: https://doi.org/10.1007/978-3-642-40683-6_20
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-40682-9
Online ISBN: 978-3-642-40683-6
eBook Packages: Computer ScienceComputer Science (R0)