Fishing in Read Collections: Memory Efficient Indexing for Sequence Assembly

Boža, Vladimír; Jursa, Jakub; Brejová, Broňa; Vinař, Tomáš

doi:10.1007/978-3-319-23826-5_19

Vladimír Boža¹⁶,
Jakub Jursa¹⁶,
Broňa Brejová¹⁶ &
…
Tomáš Vinař¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9309))

Included in the following conference series:

International Symposium on String Processing and Information Retrieval

1099 Accesses

Abstract

In this paper, we present a memory efficient index for storing a large set of DNA sequencing reads. The index allows us to quickly retrieve the set of reads containing a certain query k-mer. Instead of the usual approach of treating each read as a separate string, we take an advantage of significant overlap between reads and compress the data by aligning the reads to an approximate superstring constructed specifically for this purpose in combination with several succint data structures.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Communications of the ACM 13(7), 422–426 (1970)
Article MATH Google Scholar
Blum, A., Jiang, T., Li, M., Tromp, J., Yannakakis, M.: Linear approximation of shortest superstrings. Journal of the ACM 41(4), 630–647 (1994)
Article MathSciNet MATH Google Scholar
Boža, V., Brejová, B., Vinař, T.: GAML: Genome assembly by maximum likelihood. In: Brown, D., Morgenstern, B. (eds.) WABI 2014. LNCS, vol. 8701, pp. 122–134. Springer, Heidelberg (2014)
Google Scholar
Ferragina, P., Manzini, G.: Opportunistic data structures with applications. In: Foundations of Computer Science (FOCS), pp. 390–398. IEEE (2000)
Google Scholar
Gallant, J., Maier, D., Astorer, J.: On finding minimal length superstrings. Journal of Computer and System Sciences 20(1), 50–58 (1980)
Article MathSciNet MATH Google Scholar
Gog, S., Beller, T., Moffat, A., Petri, M.: From theory to practice: plug and play with succinct data structures. In: Gudmundsson, J., Katajainen, J. (eds.) SEA 2014. LNCS, vol. 8504, pp. 326–337. Springer, Heidelberg (2014)
Google Scholar
Grossi, R., Gupta, A., Vitter, J.S.: High-order entropy-compressed text indexes. In: Symposium on Discrete Algorithms (SODA), pp. 841–850. ACM/SIAM (2003)
Google Scholar
Illumina (2015). E.coli MG1655 Illumina sequencing dataset. ftp://webdata:webdata@ussd-ftp.illumina.com/Data/SequencingRuns/MG1655/MiSeq_Ecoli_MG1655_110721_PF.bam (accessed: March 03, 2015)
Kelley, D.R., Schatz, M.C., Salzberg, S.L., et al.: Quake: Quality-aware detection and correction of sequencing errors. Genome Biology 11(11), R116 (2010)
Article Google Scholar
Langmead, B., Salzberg, S.L.: Fast gapped-read alignment with Bowtie 2. Nature Methods 9(4), 357–359 (2012)
Article Google Scholar
Okanohara, D., Sadakane, K.: Practical entropy-compressed rank/select dictionary. In: Workshop on Algorithms Engineering and Experiments (ALENEX), pp. 60–70. SIAM (2007)
Google Scholar
Philippe, N., Salson, M., Lecroq, T., Leonard, M., Commes, T., Rivals, E.: Querying large read collections in main memory: a versatile data structure. BMC Bioinformatics 12(1), 242 (2011)
Article Google Scholar
Salzberg, S.L., Phillippy, A.M., et al.: GAGE: A critical evaluation of genome assemblies and assembly algorithms. Genome Research 22(3), 557–567 (2012)
Article Google Scholar
Simpson, J.T., Durbin, R.: Efficient de novo assembly of large genomes using compressed data structures. Genome Research 22(3), 549–556 (2012)
Article Google Scholar
Välimäki, N., Rivals, E.: Scalable and versatile k-mer indexing for high-throughput sequencing data. In: Cai, Z., Eulenstein, O., Janies, D., Schwartz, D. (eds.) ISBRA 2013. LNCS, vol. 7875, pp. 237–248. Springer, Heidelberg (2013)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Faculty of Mathematics, Physics, and Informatics, Comenius University, Mlynská dolina, 842 48, Bratislava, Slovakia
Vladimír Boža, Jakub Jursa, Broňa Brejová & Tomáš Vinař

Authors

Vladimír Boža
View author publications
You can also search for this author in PubMed Google Scholar
Jakub Jursa
View author publications
You can also search for this author in PubMed Google Scholar
Broňa Brejová
View author publications
You can also search for this author in PubMed Google Scholar
Tomáš Vinař
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tomáš Vinař .

Editor information

Editors and Affiliations

King's College London, London, United Kingdom
Costas Iliopoulos
University of Helsinki, Helsinki, Finland
Simon Puglisi
University College London, London, United Kingdom
Emine Yilmaz

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Boža, V., Jursa, J., Brejová, B., Vinař, T. (2015). Fishing in Read Collections: Memory Efficient Indexing for Sequence Assembly. In: Iliopoulos, C., Puglisi, S., Yilmaz, E. (eds) String Processing and Information Retrieval. SPIRE 2015. Lecture Notes in Computer Science(), vol 9309. Springer, Cham. https://doi.org/10.1007/978-3-319-23826-5_19

Download citation

DOI: https://doi.org/10.1007/978-3-319-23826-5_19
Published: 05 September 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-23825-8
Online ISBN: 978-3-319-23826-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics