Skip to main content

Fishing in Read Collections: Memory Efficient Indexing for Sequence Assembly

  • Conference paper
  • First Online:
String Processing and Information Retrieval (SPIRE 2015)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9309))

Included in the following conference series:

  • International Symposium on String Processing and Information Retrieval
  • 1099 Accesses

Abstract

In this paper, we present a memory efficient index for storing a large set of DNA sequencing reads. The index allows us to quickly retrieve the set of reads containing a certain query k-mer. Instead of the usual approach of treating each read as a separate string, we take an advantage of significant overlap between reads and compress the data by aligning the reads to an approximate superstring constructed specifically for this purpose in combination with several succint data structures.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Communications of the ACM 13(7), 422–426 (1970)

    Article  MATH  Google Scholar 

  • Blum, A., Jiang, T., Li, M., Tromp, J., Yannakakis, M.: Linear approximation of shortest superstrings. Journal of the ACM 41(4), 630–647 (1994)

    Article  MathSciNet  MATH  Google Scholar 

  • Boža, V., Brejová, B., Vinař, T.: GAML: Genome assembly by maximum likelihood. In: Brown, D., Morgenstern, B. (eds.) WABI 2014. LNCS, vol. 8701, pp. 122–134. Springer, Heidelberg (2014)

    Google Scholar 

  • Ferragina, P., Manzini, G.: Opportunistic data structures with applications. In: Foundations of Computer Science (FOCS), pp. 390–398. IEEE (2000)

    Google Scholar 

  • Gallant, J., Maier, D., Astorer, J.: On finding minimal length superstrings. Journal of Computer and System Sciences 20(1), 50–58 (1980)

    Article  MathSciNet  MATH  Google Scholar 

  • Gog, S., Beller, T., Moffat, A., Petri, M.: From theory to practice: plug and play with succinct data structures. In: Gudmundsson, J., Katajainen, J. (eds.) SEA 2014. LNCS, vol. 8504, pp. 326–337. Springer, Heidelberg (2014)

    Google Scholar 

  • Grossi, R., Gupta, A., Vitter, J.S.: High-order entropy-compressed text indexes. In: Symposium on Discrete Algorithms (SODA), pp. 841–850. ACM/SIAM (2003)

    Google Scholar 

  • Illumina (2015). E.coli MG1655 Illumina sequencing dataset. ftp://webdata:webdata@ussd-ftp.illumina.com/Data/SequencingRuns/MG1655/MiSeq_Ecoli_MG1655_110721_PF.bam (accessed: March 03, 2015)

  • Kelley, D.R., Schatz, M.C., Salzberg, S.L., et al.: Quake: Quality-aware detection and correction of sequencing errors. Genome Biology 11(11), R116 (2010)

    Article  Google Scholar 

  • Langmead, B., Salzberg, S.L.: Fast gapped-read alignment with Bowtie 2. Nature Methods 9(4), 357–359 (2012)

    Article  Google Scholar 

  • Okanohara, D., Sadakane, K.: Practical entropy-compressed rank/select dictionary. In: Workshop on Algorithms Engineering and Experiments (ALENEX), pp. 60–70. SIAM (2007)

    Google Scholar 

  • Philippe, N., Salson, M., Lecroq, T., Leonard, M., Commes, T., Rivals, E.: Querying large read collections in main memory: a versatile data structure. BMC Bioinformatics 12(1), 242 (2011)

    Article  Google Scholar 

  • Salzberg, S.L., Phillippy, A.M., et al.: GAGE: A critical evaluation of genome assemblies and assembly algorithms. Genome Research 22(3), 557–567 (2012)

    Article  Google Scholar 

  • Simpson, J.T., Durbin, R.: Efficient de novo assembly of large genomes using compressed data structures. Genome Research 22(3), 549–556 (2012)

    Article  Google Scholar 

  • Välimäki, N., Rivals, E.: Scalable and versatile k-mer indexing for high-throughput sequencing data. In: Cai, Z., Eulenstein, O., Janies, D., Schwartz, D. (eds.) ISBRA 2013. LNCS, vol. 7875, pp. 237–248. Springer, Heidelberg (2013)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tomáš Vinař .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Boža, V., Jursa, J., Brejová, B., Vinař, T. (2015). Fishing in Read Collections: Memory Efficient Indexing for Sequence Assembly. In: Iliopoulos, C., Puglisi, S., Yilmaz, E. (eds) String Processing and Information Retrieval. SPIRE 2015. Lecture Notes in Computer Science(), vol 9309. Springer, Cham. https://doi.org/10.1007/978-3-319-23826-5_19

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-23826-5_19

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-23825-8

  • Online ISBN: 978-3-319-23826-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics