International Workshop on Algorithms in Bioinformatics

WABI 2015: Algorithms in Bioinformatics pp 175-188 | Cite as

Jabba: Hybrid Error Correction for Long Sequencing Reads Using Maximal Exact Matches

  • Giles Miclotte
  • Mahdi Heydari
  • Piet Demeester
  • Pieter Audenaert
  • Jan Fostier
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9289)

Abstract

Third generation sequencing platforms produce longer reads with higher error rates than second generation sequencing technologies. While the improved read length can provide useful information for downstream analysis, underlying algorithms are challenged by the high error rate. Error correction methods in which accurate short reads are used to correct noisy long reads appear to be attractive to generate high-quality long reads. Methods that align short reads to long reads do not optimally use the information contained in the second generation data, and suffer from large runtimes. Recently, a new hybrid error correcting method has been proposed, where the second generation data is first assembled into a de Bruijn graph, on which the long reads are then aligned. In this context we present Jabba, a hybrid method to correct long third generation reads by mapping them on a corrected de Bruijn graph that was constructed from second generation data. Unique to our method is that this mapping is constructed with a seed and extend methodology, using maximal exact matches as seeds. In addition to benchmark results, certain theoretical results concerning the possibilities and limitations of the use of maximal exact matches in the context of third generation reads are presented.

Keywords

Sequence analysis Error correction de Bruijn graph Maximal exact matches 

References

  1. 1.
    Salzberg, S.L., et al.: GAGE: a critical evaluation of genome assemblies and assembly algorithms. Genome Res. 22, 557–567 (2012)CrossRefGoogle Scholar
  2. 2.
    Yang, X., Chockalingam, S.P., Aluru, S.: A survey of error-correction methods for next-generation sequencing. Briefings Bioinform. 14(1), 56–66 (2013)CrossRefGoogle Scholar
  3. 3.
    Kelley, D.R., Schatz, M.C., Salzberg, S.L.: Quake: quality-aware detection and correction of sequencing errors. Genome Biol. 11, R116 (2010)CrossRefGoogle Scholar
  4. 4.
    Greenfield, P., et al.: Blue: correcting sequencing errors using consensus and context. Bioinformatics 30(19), 2723–2732 (2014)CrossRefGoogle Scholar
  5. 5.
    Zerbino, D.R., Birney, E.: Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18, 821–829 (2008)CrossRefGoogle Scholar
  6. 6.
    Salmela, L., Rivals, E.: LoRDEC: accurate and efficient long read error correction. Bioinformatics 30(24), 3506–3514 (2014)CrossRefGoogle Scholar
  7. 7.
    Schröder, J., et al.: SHREC: a short-read error correction method. Bioinformatics 25(17), 2157–2163 (2009)CrossRefGoogle Scholar
  8. 8.
    Ilie, L., Fazayeli, F., Ilie, S.: HiTEC: accurate error correction in high-throughput sequencing data. Bioinformatics 27(3), 295–302 (2011)CrossRefGoogle Scholar
  9. 9.
    Salmela, L., Schröder, J.: Correcting errors in short reads by multiple alignments. Bioinformatics 27(11), 1455–1461 (2011)CrossRefGoogle Scholar
  10. 10.
    Myers, G.: Efficient local alignment discovery amongst noisy long reads. In: Brown, D., Morgenstern, B. (eds.) WABI 2014. LNCS, vol. 8701, pp. 52–67. Springer, Heidelberg (2014) Google Scholar
  11. 11.
    Berlin, K., et al.: Assembling large genomes with single-molecule sequencing and locality sensitive hashing. Nat. Biotech. 33, 623–630 (2015)CrossRefGoogle Scholar
  12. 12.
    Boetzer, M., Pirovano, W.: SSPACE-longread: scaffolding bacterial draft genomes using long read sequence information. BMC Bioinform. 15(1), 211 (2014)CrossRefGoogle Scholar
  13. 13.
    Au, K.F., et al.: Improving pacbio long read accuracy by short read alignment. PLoS ONE 7(10), e46679 (2012)MathSciNetCrossRefGoogle Scholar
  14. 14.
    Koren, S., et al.: Hybrid error correction and de novo assembly of single-molecule sequencing reads. Nat. Biotechnol. 30, 693–700 (2012)CrossRefGoogle Scholar
  15. 15.
    Hackl, T., et al.: proovread: large-scale high-accuracy pacbio correction through iterative short read consensus. Bioinformatics 30(21), 3004–3011 (2014)MathSciNetCrossRefGoogle Scholar
  16. 16.
    Liu, Y., Schmidt, B.: Long read alignment based on maximal exact match seeds. Bioinformatics 28(18), i318–i324 (2012)CrossRefGoogle Scholar
  17. 17.
    Vyverman, M., et al.: A long fragment aligner called ALFALFA. BMC Bioinform. 16, 159 (2015)CrossRefGoogle Scholar
  18. 18.
    Li, H.: Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM (2013). arXiv:1303.3997 [q-bio.GN]
  19. 19.
    Li, H., Durbin, R.: Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics 26(5), 589–595 (2009)CrossRefGoogle Scholar
  20. 20.
    Langmead, B., Salzberg, S.L.: Fast gapped-read alignment with bowtie 2. Nat. Methods 9(4), 357–359 (2012)CrossRefGoogle Scholar
  21. 21.
    Vyverman, M., et al.: essaMEM: finding maximal exact matches using enhanced sparse suffix arrays. Bioinformatics 29(6), 802–804 (2013)CrossRefGoogle Scholar
  22. 22.
    Zhao, M., et al.: SSW library: an SIMD Smith-Waterman C/C++ library for use in genomic applications. PLoS ONE 8(12), e82138 (2013)CrossRefGoogle Scholar
  23. 23.
    Arratia, R., Gordon, L., Waterman, M.S.: An extreme value theory for sequence matching. Ann. Stat. 14(3), 971–993 (1986)MathSciNetCrossRefMATHGoogle Scholar
  24. 24.
    Gordon, L., Schilling, M.F., Waterman, M.S.: An extreme value theory for longest head runs. Zeitschrift fur Wahrscheinlichkeitstheories verwandt Gebeite (Probability Theory and Related Fields) 72, 279–287 (1986)MathSciNetMATHGoogle Scholar
  25. 25.
    Schilling, M.F.: The surprising predictability of long runs. Math. Assoc. Am. 85(2), 141–149 (2012)MathSciNetMATHGoogle Scholar
  26. 26.
    Huang, W., et al.: ART: a next-generation sequencing read simulator. Bioinformatics 28(4), 593–594 (2012)CrossRefGoogle Scholar
  27. 27.
    Ono, Y., Asai, K., Hamada, M.: PBSIM: pacbio reads simulator-toward accurate genome assembly. Bioinformatics 29(1), 119–121 (2013)CrossRefGoogle Scholar
  28. 28.
    Chaisson, M.J., Tesler, G.: Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): theory and application. BMC Bioinform. 238, 13 (2012)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2015

Authors and Affiliations

  • Giles Miclotte
    • 1
  • Mahdi Heydari
    • 1
  • Piet Demeester
    • 1
  • Pieter Audenaert
    • 1
  • Jan Fostier
    • 1
  1. 1.Department of Information Technology, Internet Based Communication Networks and Services (IBCN)Ghent University - IMindsGentBelgium

Personalised recommendations