Skip to main content

QGramProjector: Q-Gram Projection for Indexing Highly-Similar Strings

  • Conference paper
Advances in Databases and Information Systems (ADBIS 2013)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8133))

Abstract

Q-gram (or n-gram, k-mer) models are used in many research areas, e.g. in computational linguistics for statistical natural language processing, in computer science for approximate string searching, and in computational biology for sequence analysis and data compression. For a collection of N strings, one usually creates a separate positional q-gram index structure for each string, or at least an index structure which needs roughly N times of storage compared to a single string index structure. For highly-similar strings, redundancies can be identified, which do not need to be stored repeatedly; for instance two human genomes have more than 99 percent similarity.

In this work, we propose QGramProjector, a new way of indexing many highly-similar strings. In order to remove the redundancies caused by similarities, our proposal is to 1) create all q-grams for a fixed reference, 2) referentially compress all strings in the collection with respect to the reference, and then 3) project all q-grams from the reference to the compressed strings.

Experiments show that a complete index can be relatively small compared to the collection of highly-similar strings. For a collection of 1092 human genomes (raw data size is 3 TB), a 16-gram index structure, which can be used for instance as a basis for multi-genome read alignment, only needs 100.5 GB (compression ratio of 31:1). We think that our work is an important step towards analysis of large sets of highly-similar genomes on commodity hardware.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature, 467(7319), 1061–1073 (October 2010)

    Google Scholar 

  2. Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. Journal of Molecular Biology 215(3), 403–410 (1990)

    Google Scholar 

  3. Baeza-Yates, R.A., Perleberg, C.H.: Fast and practical approximate string matching. In: Apostolico, A., Galil, Z., Manber, U., Crochemore, M. (eds.) CPM 1992. LNCS, vol. 644, pp. 185–192. Springer, Heidelberg (1992)

    Chapter  Google Scholar 

  4. Belazzougui, D., Venturini, R.: Compressed string dictionary look-up with edit distance one. In: Kärkkäinen, J., Stoye, J. (eds.) CPM 2012. LNCS, vol. 7354, pp. 280–292. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  5. Deorowicz, S., Grabowski, S.: Robust Relative Compression of Genomes with Random Access. Bioinformatics (September 2011)

    Google Scholar 

  6. du Mouza, C., Litwin, W., Rigaux, P., Schwarz, T.: As-index: a structure for string search using n-grams and algebraic signatures. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management, CIKM 2009, pp. 295–304. ACM, New York (2009)

    Chapter  Google Scholar 

  7. Ferragina, P.: String algorithms and data structures. CoRR, abs/0801.2378 (2008)

    Google Scholar 

  8. Ferragina, P., Manzini, G.: Indexing compressed text. J. ACM 52(4), 552–581 (2005)

    Article  MathSciNet  Google Scholar 

  9. Goto, K., Bannai, H., Inenaga, S., Takeda, M.: Speeding up q-gram mining on grammar-based compressed texts. In: Kärkkäinen, J., Stoye, J. (eds.) CPM 2012. LNCS, vol. 7354, pp. 220–231. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  10. Kreft, S., Navarro, G.: On compressing and indexing repetitive sequences. Theor. Comput. Sci. 483, 115–133 (2013)

    Article  MathSciNet  Google Scholar 

  11. Kuruppu, S., Puglisi, S., Zobel, J.: Optimized relative lempel-ziv compression of genomes. In: Australasian Computer Science Conference (2011)

    Google Scholar 

  12. McCreight, E.: Efficient algorithms for enumerating intersection intervals and rectangles. Technical report, Xerox Paolo Alte Research Center (1980)

    Google Scholar 

  13. Navarro, G.: A guided tour to approximate string matching. ACM Comput. Surv. 33(1), 31–88 (2001)

    Article  Google Scholar 

  14. Navarro, G.: Indexing highly repetitive collections. In: Arumugam, S., Smyth, B. (eds.) IWOCA 2012. LNCS, vol. 7643, pp. 274–279. Springer, Heidelberg (2012)

    Google Scholar 

  15. Navarro, G., Raffinot, M.: Flexible pattern matching in strings: practical on-line search algorithms for texts and biological sequences. Cambridge University Press, New York (2002)

    Google Scholar 

  16. Ohlebusch, E., Fischer, J., Gog, S.: CST++. In: Chavez, E., Lonardi, S. (eds.) SPIRE 2010. LNCS, vol. 6393, pp. 322–333. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  17. Reich, D.E., Schaffner, S.F., Daly, M.J., McVean, G., Mullikin, J.C., Higgins, J.M., Richter, D.J., Lander, E.S., Altshuler, D.: Human genome sequence variation and the influence of gene history, mutation and recombination. Nature Genetics 32(1), 135–142 (2002)

    Article  Google Scholar 

  18. Rytter, W.: Application of lempel–ziv factorization to the approximation of grammar-based compression. Theor. Comput. Sci. 302(1-3), 211–222 (2003)

    Article  MathSciNet  MATH  Google Scholar 

  19. Sutinen, E., Tarhio, J.: On using q-gram locations in approximate string matching. In: Spirakis, P.G. (ed.) ESA 1995. LNCS, vol. 979, pp. 327–340. Springer, Heidelberg (1995)

    Chapter  Google Scholar 

  20. Wandelt, S., Leser, U.: Adaptive efficient compression of genomes. Algorithms for Molecular Biology 7, 30 (2012)

    Article  Google Scholar 

  21. Weigel, D., Mott, R.: The 1001 Genomes Project for Arabidopsis thaliana. Genome Biology 10(5), 107+ (2009)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Wandelt, S., Leser, U. (2013). QGramProjector: Q-Gram Projection for Indexing Highly-Similar Strings. In: Catania, B., Guerrini, G., Pokorný, J. (eds) Advances in Databases and Information Systems. ADBIS 2013. Lecture Notes in Computer Science, vol 8133. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40683-6_20

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-40683-6_20

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-40682-9

  • Online ISBN: 978-3-642-40683-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics