Skip to main content

Cerulean: A Hybrid Assembly Using High Throughput Short and Long Reads

  • Conference paper
Algorithms in Bioinformatics (WABI 2013)

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 8126))

Included in the following conference series:

Abstract

Genome assembly using high throughput data with short reads, arguably, remains an unresolvable task in repetitive genomes, since when the length of a repeat exceeds the read length, it becomes difficult to unambiguously connect the flanking regions. The emergence of third generation sequencing (Pacific Biosciences) with long reads enables the opportunity to resolve complicated repeats that could not be resolved by the short read data. However, these long reads have high error rate and it is an uphill task to assemble the genome without using additional high quality short reads. Recently, Koren et al. 2012 [1] proposed an approach to use high quality short reads data to correct these long reads and, thus, make the assembly from long reads possible. However, due to the large size of both dataset (short and long reads), error-correction of these long reads requires excessively high computational resources, even on small bacterial genomes. In this work, instead of error correction of long reads, we first assemble the short reads and later map these long reads on the assembly graph to resolve repeats.

Contribution: We present a hybrid assembly approach that is both computationally effective and produces high quality assemblies. Our algorithm first operates with a simplified version of the assembly graph consisting only of long contigs and gradually improves the assembly by adding smaller contigs in each iteration. In contrast to the state-of-the-art long reads error correction technique, which requires high computational resources and long running time on a supercomputer even for bacterial genome datasets, our software can produce comparable assembly using only a standard desktop in a short running time.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Koren, S., Schatz, M.C., Walenz, B.P., Martin, J., Howard, J.T., Ganapathy, G., Wang, Z., Rasko, D.A., McCombie, W.R., Jarvis, E.D., et al.: Hybrid error correction and de novo assembly of single-molecule sequencing reads. Nature Biotechnology 30(7), 693–700 (2012)

    Article  Google Scholar 

  2. Staden, R.: A strategy of dna sequencing employing computer programs. Nucleic Acids Research 6(7), 2601–2610 (1979)

    Article  Google Scholar 

  3. Myers, E.W.: The fragment assembly string graph. Bioinformatics 21(suppl. 2), ii79–ii85 (2005)

    Google Scholar 

  4. Myers, E.W., Sutton, G.G., Delcher, A.L., Dew, I.M., Fasulo, D.P., Flanigan, M.J., Kravitz, S.A., Mobarry, C.M., Reinert, K.H., Remington, K.A., et al.: A whole-genome assembly of drosophila. Science 287(5461), 2196–2204 (2000)

    Article  Google Scholar 

  5. Simpson, J.T., Durbin, R.: Efficient de novo assembly of large genomes using compressed data structures. Genome Research 22(3), 549–556 (2012)

    Article  Google Scholar 

  6. Idury, R.M., Waterman, M.S.: A new algorithm for dna sequence assembly. Journal of Computational Biology 2(2), 291–306 (1995)

    Article  Google Scholar 

  7. Pevzner, P.A., Tang, H., Waterman, M.S.: An eulerian path approach to dna fragment assembly. Proceedings of the National Academy of Sciences 98(17), 9748–9753 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  8. Chaisson, M.J., Pevzner, P.A.: Short read fragment assembly of bacterial genomes. Genome Research 18(2), 324–330 (2008)

    Article  Google Scholar 

  9. Simpson, J.T., Wong, K., Jackman, S.D., Schein, J.E., Jones, S.J., Birol, İ.: Abyss: a parallel assembler for short read sequence data. Genome Research 19(6), 1117–1123 (2009)

    Article  Google Scholar 

  10. Zerbino, D.R., Birney, E.: Velvet: algorithms for de novo short read assembly using de bruijn graphs. Genome Research 18(5), 821–829 (2008)

    Article  Google Scholar 

  11. Eisenstein, M.: Companies’ going long’generate sequencing buzz at marco island. Nature Biotechnology 31(4), 265–266 (2013)

    Article  Google Scholar 

  12. Waldbieser, G.: Production of long (1.5 kb–15.0 kb), accurate, dna sequencing reads using an illumina hiseq2000 to support de novo assembly of the blue catfish genome. In: Plant and Animal Genome XXI Conference, Plant and Animal Genome (2013)

    Google Scholar 

  13. Chin, C.S., Alexander, D.H., Marks, P., Klammer, A.A., Drake, J., Heiner, C., Clum, A., Copeland, A., Huddleston, J., Eichler, E.E., et al.: Nonhybrid, finished microbial genome assemblies from long-read smrt sequencing data. Nature Methods (2013)

    Google Scholar 

  14. Au, K.F., Underwood, J.G., Lee, L., Wong, W.H.: Improving pacbio long read accuracy by short read alignment. PLoS One 7(10), e46679 (2012)

    Google Scholar 

  15. Hercus, C.: Novocraft short read alignment package (2009), http://www.novocraft.com

  16. Wu, T.D., Watanabe, C.K.: Gmap: a genomic mapping and alignment program for mrna and est sequences. Bioinformatics 21(9), 1859–1875 (2005)

    Article  Google Scholar 

  17. Bashir, A., Klammer, A.A., Robins, W.P., Chin, C.S., Webster, D., Paxinos, E., Hsu, D., Ashby, M., Wang, S., Peluso, P., et al.: A hybrid approach for the automated finishing of bacterial genomes. Nature Biotechnology (2012)

    Google Scholar 

  18. Ribeiro, F.J., Przybylski, D., Yin, S., Sharpe, T., Gnerre, S., Abouelleil, A., Berlin, A.M., Montmayeur, A., Shea, T.P., Walker, B.J., et al.: Finished bacterial genomes from shotgun sequence data. Genome Research 22(11), 2270–2277 (2012)

    Article  Google Scholar 

  19. Chaisson, M.J., Tesler, G.: Mapping single molecule sequencing reads using basic local alignment with successive refinement (blasr): application and theory. BMC Bioinformatics 13(1), 238 (2012)

    Article  Google Scholar 

  20. E.Coli MG1655 Illumina HiSeq2000 sequencing dataset, ftp://webdata:webdata@ussd-ftp.illumina.com/Data/SequencingRuns/MG1655/MiSeq_Ecoli_MG1655_110721_PF.bam (2013) (online; accessed June 24, 2013)

    Google Scholar 

  21. E.Coli K12 MG1655 Pacbio RS sequencing dataset (2013), http://files.pacb.com/datasets/primary-analysis/e-coli-k12/1.3.0/e-coli-k12-mg1655-raw-reads-1.3.0.tgz (online; accessed June 24, 2013)

  22. Schmutz, J., Wheeler, J., Grimwood, J., Dickson, M., Yang, J., Caoile, C., Bajorek, E., Black, S., Chan, Y.M., Denys, M., et al.: Quality assessment of the human genome sequence. Nature 429(6990), 365–368 (2004)

    Article  Google Scholar 

  23. English, A.C., Richards, S., Han, Y., Wang, M., Vee, V., Qu, J., Qin, X., Muzny, D.M., Reid, J.G., Worley, K.C., et al.: Mind the gap: Upgrading genomes with pacific biosciences rs long-read sequencing technology. PloS One 7(11), e47768 (2012)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Deshpande, V., Fung, E.D.K., Pham, S., Bafna, V. (2013). Cerulean: A Hybrid Assembly Using High Throughput Short and Long Reads. In: Darling, A., Stoye, J. (eds) Algorithms in Bioinformatics. WABI 2013. Lecture Notes in Computer Science(), vol 8126. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40453-5_27

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-40453-5_27

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-40452-8

  • Online ISBN: 978-3-642-40453-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics