Cerulean: A Hybrid Assembly Using High Throughput Short and Long Reads

Deshpande, Viraj; Fung, Eric D. K.; Pham, Son; Bafna, Vineet

doi:10.1007/978-3-642-40453-5_27

Viraj Deshpande²¹,
Eric D. K. Fung²²,
Son Pham²¹ &
…
Vineet Bafna²¹

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 8126))

Included in the following conference series:

International Workshop on Algorithms in Bioinformatics

2345 Accesses
17 Citations

Abstract

Genome assembly using high throughput data with short reads, arguably, remains an unresolvable task in repetitive genomes, since when the length of a repeat exceeds the read length, it becomes difficult to unambiguously connect the flanking regions. The emergence of third generation sequencing (Pacific Biosciences) with long reads enables the opportunity to resolve complicated repeats that could not be resolved by the short read data. However, these long reads have high error rate and it is an uphill task to assemble the genome without using additional high quality short reads. Recently, Koren et al. 2012 [1] proposed an approach to use high quality short reads data to correct these long reads and, thus, make the assembly from long reads possible. However, due to the large size of both dataset (short and long reads), error-correction of these long reads requires excessively high computational resources, even on small bacterial genomes. In this work, instead of error correction of long reads, we first assemble the short reads and later map these long reads on the assembly graph to resolve repeats.

Contribution: We present a hybrid assembly approach that is both computationally effective and produces high quality assemblies. Our algorithm first operates with a simplified version of the assembly graph consisting only of long contigs and gradually improves the assembly by adding smaller contigs in each iteration. In contrast to the state-of-the-art long reads error correction technique, which requires high computational resources and long running time on a supercomputer even for bacterial genome datasets, our software can produce comparable assembly using only a standard desktop in a short running time.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Koren, S., Schatz, M.C., Walenz, B.P., Martin, J., Howard, J.T., Ganapathy, G., Wang, Z., Rasko, D.A., McCombie, W.R., Jarvis, E.D., et al.: Hybrid error correction and de novo assembly of single-molecule sequencing reads. Nature Biotechnology 30(7), 693–700 (2012)
Article Google Scholar
Staden, R.: A strategy of dna sequencing employing computer programs. Nucleic Acids Research 6(7), 2601–2610 (1979)
Article Google Scholar
Myers, E.W.: The fragment assembly string graph. Bioinformatics 21(suppl. 2), ii79–ii85 (2005)
Google Scholar
Myers, E.W., Sutton, G.G., Delcher, A.L., Dew, I.M., Fasulo, D.P., Flanigan, M.J., Kravitz, S.A., Mobarry, C.M., Reinert, K.H., Remington, K.A., et al.: A whole-genome assembly of drosophila. Science 287(5461), 2196–2204 (2000)
Article Google Scholar
Simpson, J.T., Durbin, R.: Efficient de novo assembly of large genomes using compressed data structures. Genome Research 22(3), 549–556 (2012)
Article Google Scholar
Idury, R.M., Waterman, M.S.: A new algorithm for dna sequence assembly. Journal of Computational Biology 2(2), 291–306 (1995)
Article Google Scholar
Pevzner, P.A., Tang, H., Waterman, M.S.: An eulerian path approach to dna fragment assembly. Proceedings of the National Academy of Sciences 98(17), 9748–9753 (2001)
Article MathSciNet MATH Google Scholar
Chaisson, M.J., Pevzner, P.A.: Short read fragment assembly of bacterial genomes. Genome Research 18(2), 324–330 (2008)
Article Google Scholar
Simpson, J.T., Wong, K., Jackman, S.D., Schein, J.E., Jones, S.J., Birol, İ.: Abyss: a parallel assembler for short read sequence data. Genome Research 19(6), 1117–1123 (2009)
Article Google Scholar
Zerbino, D.R., Birney, E.: Velvet: algorithms for de novo short read assembly using de bruijn graphs. Genome Research 18(5), 821–829 (2008)
Article Google Scholar
Eisenstein, M.: Companies’ going long’generate sequencing buzz at marco island. Nature Biotechnology 31(4), 265–266 (2013)
Article Google Scholar
Waldbieser, G.: Production of long (1.5 kb–15.0 kb), accurate, dna sequencing reads using an illumina hiseq2000 to support de novo assembly of the blue catfish genome. In: Plant and Animal Genome XXI Conference, Plant and Animal Genome (2013)
Google Scholar
Chin, C.S., Alexander, D.H., Marks, P., Klammer, A.A., Drake, J., Heiner, C., Clum, A., Copeland, A., Huddleston, J., Eichler, E.E., et al.: Nonhybrid, finished microbial genome assemblies from long-read smrt sequencing data. Nature Methods (2013)
Google Scholar
Au, K.F., Underwood, J.G., Lee, L., Wong, W.H.: Improving pacbio long read accuracy by short read alignment. PLoS One 7(10), e46679 (2012)
Google Scholar
Hercus, C.: Novocraft short read alignment package (2009), http://www.novocraft.com
Wu, T.D., Watanabe, C.K.: Gmap: a genomic mapping and alignment program for mrna and est sequences. Bioinformatics 21(9), 1859–1875 (2005)
Article Google Scholar
Bashir, A., Klammer, A.A., Robins, W.P., Chin, C.S., Webster, D., Paxinos, E., Hsu, D., Ashby, M., Wang, S., Peluso, P., et al.: A hybrid approach for the automated finishing of bacterial genomes. Nature Biotechnology (2012)
Google Scholar
Ribeiro, F.J., Przybylski, D., Yin, S., Sharpe, T., Gnerre, S., Abouelleil, A., Berlin, A.M., Montmayeur, A., Shea, T.P., Walker, B.J., et al.: Finished bacterial genomes from shotgun sequence data. Genome Research 22(11), 2270–2277 (2012)
Article Google Scholar
Chaisson, M.J., Tesler, G.: Mapping single molecule sequencing reads using basic local alignment with successive refinement (blasr): application and theory. BMC Bioinformatics 13(1), 238 (2012)
Article Google Scholar
E.Coli MG1655 Illumina HiSeq2000 sequencing dataset, ftp://webdata:webdata@ussd-ftp.illumina.com/Data/SequencingRuns/MG1655/MiSeq_Ecoli_MG1655_110721_PF.bam (2013) (online; accessed June 24, 2013)
Google Scholar
E.Coli K12 MG1655 Pacbio RS sequencing dataset (2013), http://files.pacb.com/datasets/primary-analysis/e-coli-k12/1.3.0/e-coli-k12-mg1655-raw-reads-1.3.0.tgz (online; accessed June 24, 2013)
Schmutz, J., Wheeler, J., Grimwood, J., Dickson, M., Yang, J., Caoile, C., Bajorek, E., Black, S., Chan, Y.M., Denys, M., et al.: Quality assessment of the human genome sequence. Nature 429(6990), 365–368 (2004)
Article Google Scholar
English, A.C., Richards, S., Han, Y., Wang, M., Vee, V., Qu, J., Qin, X., Muzny, D.M., Reid, J.G., Worley, K.C., et al.: Mind the gap: Upgrading genomes with pacific biosciences rs long-read sequencing technology. PloS One 7(11), e47768 (2012)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science & Engineering, University of California, San Diego, CA, USA
Viraj Deshpande, Son Pham & Vineet Bafna
Bioinformatics Undergraduate Program, Department of Bioengineering, University of California, San Diego, CA, USA
Eric D. K. Fung

Authors

Viraj Deshpande
View author publications
You can also search for this author in PubMed Google Scholar
Eric D. K. Fung
View author publications
You can also search for this author in PubMed Google Scholar
Son Pham
View author publications
You can also search for this author in PubMed Google Scholar
Vineet Bafna
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

ithree institute,, University of Technology Sydney, 2007, Ultimo, NSW, Australia
Aaron Darling
Faculty of Technology, Bielefeld University, Universitätsstraße 25, 33615, Bielefeld, Germany
Jens Stoye

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Deshpande, V., Fung, E.D.K., Pham, S., Bafna, V. (2013). Cerulean: A Hybrid Assembly Using High Throughput Short and Long Reads. In: Darling, A., Stoye, J. (eds) Algorithms in Bioinformatics. WABI 2013. Lecture Notes in Computer Science(), vol 8126. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40453-5_27

Download citation

DOI: https://doi.org/10.1007/978-3-642-40453-5_27
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-40452-8
Online ISBN: 978-3-642-40453-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics