MDAI 2011: Modeling Decision for Artificial Intelligence pp 198-210 | Cite as
An Efficient Hybrid Approach to Correcting Errors in Short Reads
Abstract
High-throughput sequencing technologies produce a large number of short reads that may contain errors. These sequencing errors constitute one of the major problems in analyzing such data. Many algorithms and software tools have been proposed to correct errors in short reads. However, the computational complexity limits their performance. In this paper, we propose a novel and efficient hybrid approach which is based on an alignment-free method combined with multiple alignments. We construct suffix arrays on all short reads to search the correct overlapping regions. For each correct overlapping region, we form multiple alignments for the substrings following the correct overlapping region to identify and correct the erroneous bases. Our approach can correct all types of errors in short reads produced by different sequencing platforms. Experiments show that our approach provides significantly higher accuracy and is comparable or even faster than previous approaches.
Keywords
High-throughput sequencing Error correction Suffix array Multiple AlignmentsPreview
Unable to display preview. Download preview PDF.
References
- 1.Mardis, E.R.: The impact of next-generation sequencing technology on genetics. Trends Genet. 24, 133–141 (2008)CrossRefGoogle Scholar
- 2.Tammi, M.T., Arner, E., Kindlund, E., Andersson, B.: Correcting errors in shotgun sequences. Nucleic Acids Res. 31, 4663–4672 (2003)CrossRefGoogle Scholar
- 3.Pevzner, P.A., Tang, H., Waterman, M.S.: A new approach to fragment assembly in DNA sequencing. In: RECOMB 2001, pp. 256–267 (2001)Google Scholar
- 4.Chaisson, M.J., Pevzner, P.A., Tang, H.: Fragment assembly with short reads. Bioinformatics 20, 2067–2074 (2004)CrossRefGoogle Scholar
- 5.Chaisson, M.J., Brinza, D., Pevzner, P.A.: De novo fragment assembly with short mate-paired reads: Does the read length matter? Genome Res. 19, 336–346 (2009)CrossRefGoogle Scholar
- 6.Butler, J., MacCallum, I., Kleber, M., Shlyakhter, I.A., Belmonte, M.K., Lander, E.S., Nusbaum, C., Jaffe, D.B.: ALLPATHS: De novo assembly of whole-genome shotgun microreads. Genome Res. 18, 810–820 (2008)CrossRefGoogle Scholar
- 7.Yang, X., Dorman, K.S., Aluru, S.: Reptile: representative tiling for short read error correction. Bioinformatics 26, 2526–2533 (2010)CrossRefGoogle Scholar
- 8.Kelley, D., Schatz, M., Salzberg, S.: Quake: quality-aware detection and correction of sequencing errors. Genome Biology 11(11), R116 (2010)Google Scholar
- 9.Shi, H., Schmidt, B., Liu, W., Muller-Wittig, W.: A parallel algorithm for error correction in high-throughput short-read data on CUDA-enabled graphics hardware. J. Comput. Biol. 17, 603–615 (2009)MathSciNetCrossRefGoogle Scholar
- 10.Schroder, J., Schroder, H., Puglisi, S.J., Sinha, R., Schmidt, B.: SHREC: a short-read error correction method. Bioinformatics 25, 2157–2163 (2009)CrossRefGoogle Scholar
- 11.Salmela, L.: Correction of sequencing errors in a mixed set of reads. Bioinformatics 26(10), 1284–1290 (2010)Google Scholar
- 12.Ilie, L., Fazayeli, F., Ilie, S.: HiTEC: accurate error correction in high-throughput sequencing data. Bioinformatics 27(3), 295–302 (2011)CrossRefGoogle Scholar
- 13.Manber, U., Myers, G.: Suffix arrays: a new method for on-line search. SIAM J. Comput. 22(5), 935–948 (1993)MathSciNetCrossRefMATHGoogle Scholar
- 14.Simon, J., Puglisi, W.F., Smyth, A.T.: A taxonomy of suffix array construction algorithms. ACM Comput. Surv. 39(2), 1–31 (2007)Google Scholar
- 15.Mori, Y.: Short description of improved two-stage suffix sorting algorithm, http://homepage3.nifty.com/wpage/software/itssort.txt
- 16.Kasai, T., Lee, G.H., Arimura, H., Arikawa, S., Park, K.: Linear-time longest-common-prefix computation in suffix arrays and its applications. In: Amir, A., Landau, G.M. (eds.) CPM 2001. LNCS, vol. 2089, pp. 181–192. Springer, Heidelberg (2001)CrossRefGoogle Scholar
- 17.Needleman, S.B.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology 48(3), 443–453 (1970)CrossRefGoogle Scholar