Abstract
In this paper, we present a memory efficient index for storing a large set of DNA sequencing reads. The index allows us to quickly retrieve the set of reads containing a certain query k-mer. Instead of the usual approach of treating each read as a separate string, we take an advantage of significant overlap between reads and compress the data by aligning the reads to an approximate superstring constructed specifically for this purpose in combination with several succint data structures.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Communications of the ACM 13(7), 422–426 (1970)
Blum, A., Jiang, T., Li, M., Tromp, J., Yannakakis, M.: Linear approximation of shortest superstrings. Journal of the ACM 41(4), 630–647 (1994)
Boža, V., Brejová, B., Vinař, T.: GAML: Genome assembly by maximum likelihood. In: Brown, D., Morgenstern, B. (eds.) WABI 2014. LNCS, vol. 8701, pp. 122–134. Springer, Heidelberg (2014)
Ferragina, P., Manzini, G.: Opportunistic data structures with applications. In: Foundations of Computer Science (FOCS), pp. 390–398. IEEE (2000)
Gallant, J., Maier, D., Astorer, J.: On finding minimal length superstrings. Journal of Computer and System Sciences 20(1), 50–58 (1980)
Gog, S., Beller, T., Moffat, A., Petri, M.: From theory to practice: plug and play with succinct data structures. In: Gudmundsson, J., Katajainen, J. (eds.) SEA 2014. LNCS, vol. 8504, pp. 326–337. Springer, Heidelberg (2014)
Grossi, R., Gupta, A., Vitter, J.S.: High-order entropy-compressed text indexes. In: Symposium on Discrete Algorithms (SODA), pp. 841–850. ACM/SIAM (2003)
Illumina (2015). E.coli MG1655 Illumina sequencing dataset. ftp://webdata:webdata@ussd-ftp.illumina.com/Data/SequencingRuns/MG1655/MiSeq_Ecoli_MG1655_110721_PF.bam (accessed: March 03, 2015)
Kelley, D.R., Schatz, M.C., Salzberg, S.L., et al.: Quake: Quality-aware detection and correction of sequencing errors. Genome Biology 11(11), R116 (2010)
Langmead, B., Salzberg, S.L.: Fast gapped-read alignment with Bowtie 2. Nature Methods 9(4), 357–359 (2012)
Okanohara, D., Sadakane, K.: Practical entropy-compressed rank/select dictionary. In: Workshop on Algorithms Engineering and Experiments (ALENEX), pp. 60–70. SIAM (2007)
Philippe, N., Salson, M., Lecroq, T., Leonard, M., Commes, T., Rivals, E.: Querying large read collections in main memory: a versatile data structure. BMC Bioinformatics 12(1), 242 (2011)
Salzberg, S.L., Phillippy, A.M., et al.: GAGE: A critical evaluation of genome assemblies and assembly algorithms. Genome Research 22(3), 557–567 (2012)
Simpson, J.T., Durbin, R.: Efficient de novo assembly of large genomes using compressed data structures. Genome Research 22(3), 549–556 (2012)
Välimäki, N., Rivals, E.: Scalable and versatile k-mer indexing for high-throughput sequencing data. In: Cai, Z., Eulenstein, O., Janies, D., Schwartz, D. (eds.) ISBRA 2013. LNCS, vol. 7875, pp. 237–248. Springer, Heidelberg (2013)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Boža, V., Jursa, J., Brejová, B., Vinař, T. (2015). Fishing in Read Collections: Memory Efficient Indexing for Sequence Assembly. In: Iliopoulos, C., Puglisi, S., Yilmaz, E. (eds) String Processing and Information Retrieval. SPIRE 2015. Lecture Notes in Computer Science(), vol 9309. Springer, Cham. https://doi.org/10.1007/978-3-319-23826-5_19
Download citation
DOI: https://doi.org/10.1007/978-3-319-23826-5_19
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-23825-8
Online ISBN: 978-3-319-23826-5
eBook Packages: Computer ScienceComputer Science (R0)