Abstract
Third generation sequencing platforms produce longer reads with higher error rates than second generation sequencing technologies. While the improved read length can provide useful information for downstream analysis, underlying algorithms are challenged by the high error rate. Error correction methods in which accurate short reads are used to correct noisy long reads appear to be attractive to generate high-quality long reads. Methods that align short reads to long reads do not optimally use the information contained in the second generation data, and suffer from large runtimes. Recently, a new hybrid error correcting method has been proposed, where the second generation data is first assembled into a de Bruijn graph, on which the long reads are then aligned. In this context we present Jabba, a hybrid method to correct long third generation reads by mapping them on a corrected de Bruijn graph that was constructed from second generation data. Unique to our method is that this mapping is constructed with a seed and extend methodology, using maximal exact matches as seeds. In addition to benchmark results, certain theoretical results concerning the possibilities and limitations of the use of maximal exact matches in the context of third generation reads are presented.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Salzberg, S.L., et al.: GAGE: a critical evaluation of genome assemblies and assembly algorithms. Genome Res. 22, 557–567 (2012)
Yang, X., Chockalingam, S.P., Aluru, S.: A survey of error-correction methods for next-generation sequencing. Briefings Bioinform. 14(1), 56–66 (2013)
Kelley, D.R., Schatz, M.C., Salzberg, S.L.: Quake: quality-aware detection and correction of sequencing errors. Genome Biol. 11, R116 (2010)
Greenfield, P., et al.: Blue: correcting sequencing errors using consensus and context. Bioinformatics 30(19), 2723–2732 (2014)
Zerbino, D.R., Birney, E.: Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18, 821–829 (2008)
Salmela, L., Rivals, E.: LoRDEC: accurate and efficient long read error correction. Bioinformatics 30(24), 3506–3514 (2014)
Schröder, J., et al.: SHREC: a short-read error correction method. Bioinformatics 25(17), 2157–2163 (2009)
Ilie, L., Fazayeli, F., Ilie, S.: HiTEC: accurate error correction in high-throughput sequencing data. Bioinformatics 27(3), 295–302 (2011)
Salmela, L., Schröder, J.: Correcting errors in short reads by multiple alignments. Bioinformatics 27(11), 1455–1461 (2011)
Myers, G.: Efficient local alignment discovery amongst noisy long reads. In: Brown, D., Morgenstern, B. (eds.) WABI 2014. LNCS, vol. 8701, pp. 52–67. Springer, Heidelberg (2014)
Berlin, K., et al.: Assembling large genomes with single-molecule sequencing and locality sensitive hashing. Nat. Biotech. 33, 623–630 (2015)
Boetzer, M., Pirovano, W.: SSPACE-longread: scaffolding bacterial draft genomes using long read sequence information. BMC Bioinform. 15(1), 211 (2014)
Au, K.F., et al.: Improving pacbio long read accuracy by short read alignment. PLoS ONE 7(10), e46679 (2012)
Koren, S., et al.: Hybrid error correction and de novo assembly of single-molecule sequencing reads. Nat. Biotechnol. 30, 693–700 (2012)
Hackl, T., et al.: proovread: large-scale high-accuracy pacbio correction through iterative short read consensus. Bioinformatics 30(21), 3004–3011 (2014)
Liu, Y., Schmidt, B.: Long read alignment based on maximal exact match seeds. Bioinformatics 28(18), i318–i324 (2012)
Vyverman, M., et al.: A long fragment aligner called ALFALFA. BMC Bioinform. 16, 159 (2015)
Li, H.: Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM (2013). arXiv:1303.3997 [q-bio.GN]
Li, H., Durbin, R.: Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics 26(5), 589–595 (2009)
Langmead, B., Salzberg, S.L.: Fast gapped-read alignment with bowtie 2. Nat. Methods 9(4), 357–359 (2012)
Vyverman, M., et al.: essaMEM: finding maximal exact matches using enhanced sparse suffix arrays. Bioinformatics 29(6), 802–804 (2013)
Zhao, M., et al.: SSW library: an SIMD Smith-Waterman C/C++ library for use in genomic applications. PLoS ONE 8(12), e82138 (2013)
Arratia, R., Gordon, L., Waterman, M.S.: An extreme value theory for sequence matching. Ann. Stat. 14(3), 971–993 (1986)
Gordon, L., Schilling, M.F., Waterman, M.S.: An extreme value theory for longest head runs. Zeitschrift fur Wahrscheinlichkeitstheories verwandt Gebeite (Probability Theory and Related Fields) 72, 279–287 (1986)
Schilling, M.F.: The surprising predictability of long runs. Math. Assoc. Am. 85(2), 141–149 (2012)
Huang, W., et al.: ART: a next-generation sequencing read simulator. Bioinformatics 28(4), 593–594 (2012)
Ono, Y., Asai, K., Hamada, M.: PBSIM: pacbio reads simulator-toward accurate genome assembly. Bioinformatics 29(1), 119–121 (2013)
Chaisson, M.J., Tesler, G.: Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): theory and application. BMC Bioinform. 238, 13 (2012)
Acknowledgments
The computational resources (Stevin Supercomputer Infrastructure) and services used in this work were provided by the VSC (Flemish Supercomputer Center), funded by Ghent University, the Hercules Foundation and the Flemish Government – department EWI. We acknowledge the support of Ghent University (Multidisciplinary Research Partnership “Bioinformatics: From Nucleotides to Networks”).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Miclotte, G., Heydari, M., Demeester, P., Audenaert, P., Fostier, J. (2015). Jabba: Hybrid Error Correction for Long Sequencing Reads Using Maximal Exact Matches. In: Pop, M., Touzet, H. (eds) Algorithms in Bioinformatics. WABI 2015. Lecture Notes in Computer Science(), vol 9289. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-48221-6_13
Download citation
DOI: https://doi.org/10.1007/978-3-662-48221-6_13
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-48220-9
Online ISBN: 978-3-662-48221-6
eBook Packages: Computer ScienceComputer Science (R0)