International Symposium on String Processing and Information Retrieval

SPIRE 2015: String Processing and Information Retrieval pp 188-198 | Cite as

Fishing in Read Collections: Memory Efficient Indexing for Sequence Assembly

  • Vladimír Boža
  • Jakub Jursa
  • Broňa Brejová
  • Tomáš Vinař
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9309)


In this paper, we present a memory efficient index for storing a large set of DNA sequencing reads. The index allows us to quickly retrieve the set of reads containing a certain query k-mer. Instead of the usual approach of treating each read as a separate string, we take an advantage of significant overlap between reads and compress the data by aligning the reads to an approximate superstring constructed specifically for this purpose in combination with several succint data structures.


Reference Genome Genome Assembly Memory Usage Query Time Read Correction 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Communications of the ACM 13(7), 422–426 (1970)CrossRefzbMATHGoogle Scholar
  2. Blum, A., Jiang, T., Li, M., Tromp, J., Yannakakis, M.: Linear approximation of shortest superstrings. Journal of the ACM 41(4), 630–647 (1994)MathSciNetCrossRefzbMATHGoogle Scholar
  3. Boža, V., Brejová, B., Vinař, T.: GAML: Genome assembly by maximum likelihood. In: Brown, D., Morgenstern, B. (eds.) WABI 2014. LNCS, vol. 8701, pp. 122–134. Springer, Heidelberg (2014) Google Scholar
  4. Ferragina, P., Manzini, G.: Opportunistic data structures with applications. In: Foundations of Computer Science (FOCS), pp. 390–398. IEEE (2000)Google Scholar
  5. Gallant, J., Maier, D., Astorer, J.: On finding minimal length superstrings. Journal of Computer and System Sciences 20(1), 50–58 (1980)MathSciNetCrossRefzbMATHGoogle Scholar
  6. Gog, S., Beller, T., Moffat, A., Petri, M.: From theory to practice: plug and play with succinct data structures. In: Gudmundsson, J., Katajainen, J. (eds.) SEA 2014. LNCS, vol. 8504, pp. 326–337. Springer, Heidelberg (2014) Google Scholar
  7. Grossi, R., Gupta, A., Vitter, J.S.: High-order entropy-compressed text indexes. In: Symposium on Discrete Algorithms (SODA), pp. 841–850. ACM/SIAM (2003)Google Scholar
  8. Illumina (2015). E.coli MG1655 Illumina sequencing dataset. (accessed: March 03, 2015)
  9. Kelley, D.R., Schatz, M.C., Salzberg, S.L., et al.: Quake: Quality-aware detection and correction of sequencing errors. Genome Biology 11(11), R116 (2010)CrossRefGoogle Scholar
  10. Langmead, B., Salzberg, S.L.: Fast gapped-read alignment with Bowtie 2. Nature Methods 9(4), 357–359 (2012)CrossRefGoogle Scholar
  11. Okanohara, D., Sadakane, K.: Practical entropy-compressed rank/select dictionary. In: Workshop on Algorithms Engineering and Experiments (ALENEX), pp. 60–70. SIAM (2007)Google Scholar
  12. Philippe, N., Salson, M., Lecroq, T., Leonard, M., Commes, T., Rivals, E.: Querying large read collections in main memory: a versatile data structure. BMC Bioinformatics 12(1), 242 (2011)CrossRefGoogle Scholar
  13. Salzberg, S.L., Phillippy, A.M., et al.: GAGE: A critical evaluation of genome assemblies and assembly algorithms. Genome Research 22(3), 557–567 (2012)CrossRefGoogle Scholar
  14. Simpson, J.T., Durbin, R.: Efficient de novo assembly of large genomes using compressed data structures. Genome Research 22(3), 549–556 (2012)CrossRefGoogle Scholar
  15. Välimäki, N., Rivals, E.: Scalable and versatile k-mer indexing for high-throughput sequencing data. In: Cai, Z., Eulenstein, O., Janies, D., Schwartz, D. (eds.) ISBRA 2013. LNCS, vol. 7875, pp. 237–248. Springer, Heidelberg (2013) CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Vladimír Boža
    • 1
  • Jakub Jursa
    • 1
  • Broňa Brejová
    • 1
  • Tomáš Vinař
    • 1
  1. 1.Faculty of Mathematics, Physics, and InformaticsComenius UniversityBratislavaSlovakia

Personalised recommendations