Jabba: Hybrid Error Correction for Long Sequencing Reads Using Maximal Exact Matches

Miclotte, Giles; Heydari, Mahdi; Demeester, Piet; Audenaert, Pieter; Fostier, Jan

doi:10.1007/978-3-662-48221-6_13

Jabba: Hybrid Error Correction for Long Sequencing Reads Using Maximal Exact Matches

Giles Miclotte⁶,
Mahdi Heydari⁶,
Piet Demeester⁶,
Pieter Audenaert⁶ &
…
Jan Fostier⁶

Conference paper
First Online: 01 January 2015

1194 Accesses
4 Citations

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 9289))

Abstract

Third generation sequencing platforms produce longer reads with higher error rates than second generation sequencing technologies. While the improved read length can provide useful information for downstream analysis, underlying algorithms are challenged by the high error rate. Error correction methods in which accurate short reads are used to correct noisy long reads appear to be attractive to generate high-quality long reads. Methods that align short reads to long reads do not optimally use the information contained in the second generation data, and suffer from large runtimes. Recently, a new hybrid error correcting method has been proposed, where the second generation data is first assembled into a de Bruijn graph, on which the long reads are then aligned. In this context we present Jabba, a hybrid method to correct long third generation reads by mapping them on a corrected de Bruijn graph that was constructed from second generation data. Unique to our method is that this mapping is constructed with a seed and extend methodology, using maximal exact matches as seeds. In addition to benchmark results, certain theoretical results concerning the possibilities and limitations of the use of maximal exact matches in the context of third generation reads are presented.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Salzberg, S.L., et al.: GAGE: a critical evaluation of genome assemblies and assembly algorithms. Genome Res. 22, 557–567 (2012)
Article Google Scholar
Yang, X., Chockalingam, S.P., Aluru, S.: A survey of error-correction methods for next-generation sequencing. Briefings Bioinform. 14(1), 56–66 (2013)
Article Google Scholar
Kelley, D.R., Schatz, M.C., Salzberg, S.L.: Quake: quality-aware detection and correction of sequencing errors. Genome Biol. 11, R116 (2010)
Article Google Scholar
Greenfield, P., et al.: Blue: correcting sequencing errors using consensus and context. Bioinformatics 30(19), 2723–2732 (2014)
Article Google Scholar
Zerbino, D.R., Birney, E.: Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18, 821–829 (2008)
Article Google Scholar
Salmela, L., Rivals, E.: LoRDEC: accurate and efficient long read error correction. Bioinformatics 30(24), 3506–3514 (2014)
Article Google Scholar
Schröder, J., et al.: SHREC: a short-read error correction method. Bioinformatics 25(17), 2157–2163 (2009)
Article Google Scholar
Ilie, L., Fazayeli, F., Ilie, S.: HiTEC: accurate error correction in high-throughput sequencing data. Bioinformatics 27(3), 295–302 (2011)
Article Google Scholar
Salmela, L., Schröder, J.: Correcting errors in short reads by multiple alignments. Bioinformatics 27(11), 1455–1461 (2011)
Article Google Scholar
Myers, G.: Efficient local alignment discovery amongst noisy long reads. In: Brown, D., Morgenstern, B. (eds.) WABI 2014. LNCS, vol. 8701, pp. 52–67. Springer, Heidelberg (2014)
Google Scholar
Berlin, K., et al.: Assembling large genomes with single-molecule sequencing and locality sensitive hashing. Nat. Biotech. 33, 623–630 (2015)
Article Google Scholar
Boetzer, M., Pirovano, W.: SSPACE-longread: scaffolding bacterial draft genomes using long read sequence information. BMC Bioinform. 15(1), 211 (2014)
Article Google Scholar
Au, K.F., et al.: Improving pacbio long read accuracy by short read alignment. PLoS ONE 7(10), e46679 (2012)
Article MathSciNet Google Scholar
Koren, S., et al.: Hybrid error correction and de novo assembly of single-molecule sequencing reads. Nat. Biotechnol. 30, 693–700 (2012)
Article Google Scholar
Hackl, T., et al.: proovread: large-scale high-accuracy pacbio correction through iterative short read consensus. Bioinformatics 30(21), 3004–3011 (2014)
Article MathSciNet Google Scholar
Liu, Y., Schmidt, B.: Long read alignment based on maximal exact match seeds. Bioinformatics 28(18), i318–i324 (2012)
Article Google Scholar
Vyverman, M., et al.: A long fragment aligner called ALFALFA. BMC Bioinform. 16, 159 (2015)
Article Google Scholar
Li, H.: Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM (2013). arXiv:1303.3997 [q-bio.GN]
Li, H., Durbin, R.: Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics 26(5), 589–595 (2009)
Article Google Scholar
Langmead, B., Salzberg, S.L.: Fast gapped-read alignment with bowtie 2. Nat. Methods 9(4), 357–359 (2012)
Article Google Scholar
Vyverman, M., et al.: essaMEM: finding maximal exact matches using enhanced sparse suffix arrays. Bioinformatics 29(6), 802–804 (2013)
Article Google Scholar
Zhao, M., et al.: SSW library: an SIMD Smith-Waterman C/C++ library for use in genomic applications. PLoS ONE 8(12), e82138 (2013)
Article Google Scholar
Arratia, R., Gordon, L., Waterman, M.S.: An extreme value theory for sequence matching. Ann. Stat. 14(3), 971–993 (1986)
Article MathSciNet MATH Google Scholar
Gordon, L., Schilling, M.F., Waterman, M.S.: An extreme value theory for longest head runs. Zeitschrift fur Wahrscheinlichkeitstheories verwandt Gebeite (Probability Theory and Related Fields) 72, 279–287 (1986)
MathSciNet MATH Google Scholar
Schilling, M.F.: The surprising predictability of long runs. Math. Assoc. Am. 85(2), 141–149 (2012)
MathSciNet MATH Google Scholar
Huang, W., et al.: ART: a next-generation sequencing read simulator. Bioinformatics 28(4), 593–594 (2012)
Article Google Scholar
Ono, Y., Asai, K., Hamada, M.: PBSIM: pacbio reads simulator-toward accurate genome assembly. Bioinformatics 29(1), 119–121 (2013)
Article Google Scholar
Chaisson, M.J., Tesler, G.: Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): theory and application. BMC Bioinform. 238, 13 (2012)
Google Scholar

Download references

Acknowledgments

The computational resources (Stevin Supercomputer Infrastructure) and services used in this work were provided by the VSC (Flemish Supercomputer Center), funded by Ghent University, the Hercules Foundation and the Flemish Government – department EWI. We acknowledge the support of Ghent University (Multidisciplinary Research Partnership “Bioinformatics: From Nucleotides to Networks”).

Author information

Authors and Affiliations

Department of Information Technology, Internet Based Communication Networks and Services (IBCN), Ghent University - IMinds, Gaston Crommenlaan 8 (bus 201), 9050, Gent, Belgium
Giles Miclotte, Mahdi Heydari, Piet Demeester, Pieter Audenaert & Jan Fostier

Authors

Giles Miclotte
View author publications
You can also search for this author in PubMed Google Scholar
Mahdi Heydari
View author publications
You can also search for this author in PubMed Google Scholar
Piet Demeester
View author publications
You can also search for this author in PubMed Google Scholar
Pieter Audenaert
View author publications
You can also search for this author in PubMed Google Scholar
Jan Fostier
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jan Fostier .

Editor information

Editors and Affiliations

University of Maryland, College Park, Maryland, USA
Mihai Pop
University of Lille, Lille, France
Hélène Touzet

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Miclotte, G., Heydari, M., Demeester, P., Audenaert, P., Fostier, J. (2015). Jabba: Hybrid Error Correction for Long Sequencing Reads Using Maximal Exact Matches. In: Pop, M., Touzet, H. (eds) Algorithms in Bioinformatics. WABI 2015. Lecture Notes in Computer Science(), vol 9289. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-48221-6_13

Download citation

DOI: https://doi.org/10.1007/978-3-662-48221-6_13
Published: 28 August 2015
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-48220-9
Online ISBN: 978-3-662-48221-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics