Skip to main content

Jabba: Hybrid Error Correction for Long Sequencing Reads Using Maximal Exact Matches

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 9289))

Abstract

Third generation sequencing platforms produce longer reads with higher error rates than second generation sequencing technologies. While the improved read length can provide useful information for downstream analysis, underlying algorithms are challenged by the high error rate. Error correction methods in which accurate short reads are used to correct noisy long reads appear to be attractive to generate high-quality long reads. Methods that align short reads to long reads do not optimally use the information contained in the second generation data, and suffer from large runtimes. Recently, a new hybrid error correcting method has been proposed, where the second generation data is first assembled into a de Bruijn graph, on which the long reads are then aligned. In this context we present Jabba, a hybrid method to correct long third generation reads by mapping them on a corrected de Bruijn graph that was constructed from second generation data. Unique to our method is that this mapping is constructed with a seed and extend methodology, using maximal exact matches as seeds. In addition to benchmark results, certain theoretical results concerning the possibilities and limitations of the use of maximal exact matches in the context of third generation reads are presented.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Salzberg, S.L., et al.: GAGE: a critical evaluation of genome assemblies and assembly algorithms. Genome Res. 22, 557–567 (2012)

    Article  Google Scholar 

  2. Yang, X., Chockalingam, S.P., Aluru, S.: A survey of error-correction methods for next-generation sequencing. Briefings Bioinform. 14(1), 56–66 (2013)

    Article  Google Scholar 

  3. Kelley, D.R., Schatz, M.C., Salzberg, S.L.: Quake: quality-aware detection and correction of sequencing errors. Genome Biol. 11, R116 (2010)

    Article  Google Scholar 

  4. Greenfield, P., et al.: Blue: correcting sequencing errors using consensus and context. Bioinformatics 30(19), 2723–2732 (2014)

    Article  Google Scholar 

  5. Zerbino, D.R., Birney, E.: Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18, 821–829 (2008)

    Article  Google Scholar 

  6. Salmela, L., Rivals, E.: LoRDEC: accurate and efficient long read error correction. Bioinformatics 30(24), 3506–3514 (2014)

    Article  Google Scholar 

  7. Schröder, J., et al.: SHREC: a short-read error correction method. Bioinformatics 25(17), 2157–2163 (2009)

    Article  Google Scholar 

  8. Ilie, L., Fazayeli, F., Ilie, S.: HiTEC: accurate error correction in high-throughput sequencing data. Bioinformatics 27(3), 295–302 (2011)

    Article  Google Scholar 

  9. Salmela, L., Schröder, J.: Correcting errors in short reads by multiple alignments. Bioinformatics 27(11), 1455–1461 (2011)

    Article  Google Scholar 

  10. Myers, G.: Efficient local alignment discovery amongst noisy long reads. In: Brown, D., Morgenstern, B. (eds.) WABI 2014. LNCS, vol. 8701, pp. 52–67. Springer, Heidelberg (2014)

    Google Scholar 

  11. Berlin, K., et al.: Assembling large genomes with single-molecule sequencing and locality sensitive hashing. Nat. Biotech. 33, 623–630 (2015)

    Article  Google Scholar 

  12. Boetzer, M., Pirovano, W.: SSPACE-longread: scaffolding bacterial draft genomes using long read sequence information. BMC Bioinform. 15(1), 211 (2014)

    Article  Google Scholar 

  13. Au, K.F., et al.: Improving pacbio long read accuracy by short read alignment. PLoS ONE 7(10), e46679 (2012)

    Article  MathSciNet  Google Scholar 

  14. Koren, S., et al.: Hybrid error correction and de novo assembly of single-molecule sequencing reads. Nat. Biotechnol. 30, 693–700 (2012)

    Article  Google Scholar 

  15. Hackl, T., et al.: proovread: large-scale high-accuracy pacbio correction through iterative short read consensus. Bioinformatics 30(21), 3004–3011 (2014)

    Article  MathSciNet  Google Scholar 

  16. Liu, Y., Schmidt, B.: Long read alignment based on maximal exact match seeds. Bioinformatics 28(18), i318–i324 (2012)

    Article  Google Scholar 

  17. Vyverman, M., et al.: A long fragment aligner called ALFALFA. BMC Bioinform. 16, 159 (2015)

    Article  Google Scholar 

  18. Li, H.: Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM (2013). arXiv:1303.3997 [q-bio.GN]

  19. Li, H., Durbin, R.: Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics 26(5), 589–595 (2009)

    Article  Google Scholar 

  20. Langmead, B., Salzberg, S.L.: Fast gapped-read alignment with bowtie 2. Nat. Methods 9(4), 357–359 (2012)

    Article  Google Scholar 

  21. Vyverman, M., et al.: essaMEM: finding maximal exact matches using enhanced sparse suffix arrays. Bioinformatics 29(6), 802–804 (2013)

    Article  Google Scholar 

  22. Zhao, M., et al.: SSW library: an SIMD Smith-Waterman C/C++ library for use in genomic applications. PLoS ONE 8(12), e82138 (2013)

    Article  Google Scholar 

  23. Arratia, R., Gordon, L., Waterman, M.S.: An extreme value theory for sequence matching. Ann. Stat. 14(3), 971–993 (1986)

    Article  MathSciNet  MATH  Google Scholar 

  24. Gordon, L., Schilling, M.F., Waterman, M.S.: An extreme value theory for longest head runs. Zeitschrift fur Wahrscheinlichkeitstheories verwandt Gebeite (Probability Theory and Related Fields) 72, 279–287 (1986)

    MathSciNet  MATH  Google Scholar 

  25. Schilling, M.F.: The surprising predictability of long runs. Math. Assoc. Am. 85(2), 141–149 (2012)

    MathSciNet  MATH  Google Scholar 

  26. Huang, W., et al.: ART: a next-generation sequencing read simulator. Bioinformatics 28(4), 593–594 (2012)

    Article  Google Scholar 

  27. Ono, Y., Asai, K., Hamada, M.: PBSIM: pacbio reads simulator-toward accurate genome assembly. Bioinformatics 29(1), 119–121 (2013)

    Article  Google Scholar 

  28. Chaisson, M.J., Tesler, G.: Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): theory and application. BMC Bioinform. 238, 13 (2012)

    Google Scholar 

Download references

Acknowledgments

The computational resources (Stevin Supercomputer Infrastructure) and services used in this work were provided by the VSC (Flemish Supercomputer Center), funded by Ghent University, the Hercules Foundation and the Flemish Government – department EWI. We acknowledge the support of Ghent University (Multidisciplinary Research Partnership “Bioinformatics: From Nucleotides to Networks”).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jan Fostier .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Miclotte, G., Heydari, M., Demeester, P., Audenaert, P., Fostier, J. (2015). Jabba: Hybrid Error Correction for Long Sequencing Reads Using Maximal Exact Matches. In: Pop, M., Touzet, H. (eds) Algorithms in Bioinformatics. WABI 2015. Lecture Notes in Computer Science(), vol 9289. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-48221-6_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-662-48221-6_13

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-662-48220-9

  • Online ISBN: 978-3-662-48221-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics