Abstract
Sequencing error correction has become an important step in the analyses of next-generation sequencing (NGS) datasets in order to improve data quality for downstream applications. In this chapter, we discuss different formulations for sequencing read error corrections that are based on probabilistic models able to handle datasets with a nonuniform read coverage. Nonuniform coverage is common in several applications of NGS, including small RNA and messenger RNA sequencing, metagenomics, metatranscriptomics, and single-cell sequencing. Here, we review popular formulations based on the Hamming graph of k-mers found in sequencing reads and introduce a more complete formulation that can also handle insertion and deletion errors. as found in As the breadth of applications is steadily increasing to In this chapter, we will introduce different approaches to correct sequencing errors with probabilistic models. One common formulation is based on models over Hamming graphs. A particular focus will be on a more general formulation using hidden Markov models that can solve indel errors. These methods are suitable for the correction of reads from experiments with nonuniform coverage, like RNA-Seq, single-cell sequencing, or metagenomics, a topic of rising importance in the community.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Bankevich, A., Nurk, S., Antipov, D., Gurevich, A.A., Dvorkin, M., Kulikov, A.S., Lesin, V.M., Nikolenko, S.I., Pham, S., Prjibelski, A.D., Pyshkin, A.V., Sirotkin, A.V., Vyahhi, N., Tesler, G., Alekseyev, M.A., Pevzner, P.A.: SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J. Comput. Biol. 19(5), 455–477 (2012)
Bullard, J.H., Purdom, E., Hansen, K.D., Dudoit, S.: Evaluation of statistical methods for normalization and differential expression in mRNA-seq experiments. BMC Bioinform. 11, 94 (2010)
Embree, M., Nagarajan, H., Movahedi, N., Chitsaz, H., Zengler, K.: Single-cell genome and metatranscriptome sequencing reveal metabolic interactions of an alkane-degrading methanogenic community. ISME J. 8(4), 757–767 (2014)
Glenn, T.C.: Field guide to next-generation DNA sequencers. Mol. Ecol. Resour. 11(5), 759–769 (2011)
Grabherr, M.G., Haas, B.J., Yassour, M., Levin, J.Z., Thompson, D.A., Amit, I., Adiconis, X., Fan, L., Raychowdhury, R., Zeng, Q., Chen, Z., Mauceli, E., Hacohen, N., Gnirke, A., Rhind, N., di Palma, F., Birren, B.W., Nusbaum, C., Lindblad-Toh, K., Friedman, N., Regev, A.: Full-length transcriptome assembly from RNA-seq data without a reference genome. Nat. Biotechnol. 29(7), 644–652 (2011)
Hemme, C.L., Deng, Y., Gentry, T.J., Fields, M.W., Wu, L., Barua, S., Barry, K., Tringe, S.G., Watson, D.B., He, Z., Hazen, T.C., Tiedje, J.M., Rubin, E.M., Zhou, J.: Metagenomic insights into evolution of a heavy metal-contaminated groundwater microbial community. ISME J. 4(5), 660–672 (2010)
Hinman, V.F., Nguyen, A.T., Davidson, E.H.: Expression and function of a starfish Otx ortholog, AmOtx: a conserved role for Otx proteins in endoderm development that predates divergence of the eleutherozoa. Mech. Dev. 120(10), 1165–1176 (2003)
Kelley, D.R., Schatz, M.C., Salzberg, S.L.: Quake: quality-aware detection and correction of sequencing errors. Genome Biol. 11(11), R116 (2010)
Kent, W.J.: Blat—the blast-like alignment tool. Genome Res. 12(4), 656–664 (2002)
Le, H.-S., Schulz, M.H., McCauley, B.M., Hinman, V.F., Bar-Joseph, Z.: Probabilistic error correction for RNA sequencing. Nucleic Acids Res. 41(10), e109 (2013)
Le Chatelier, E., Nielsen, T., Qin, J., Prifti, E., Hildebrand, F., Falony, G., Almeida, M., Arumugam, M., Batto, J.-M., Kennedy, S., Leonard, P., Li, J., Burgdorf, K., Grarup, N., Jorgensen, T., Brandslund, I., Nielsen, H.B., Juncker, A.S., Bertalan, M., Levenez, F., Pons, N., Rasmussen, S., Sunagawa, S., Tap, J., Tims, S., Zoetendal, E.G., Brunak, S., Clement, K., Dore, J., Kleerebezem, M., Kristiansen, K., Renault, P., Sicheritz-Ponten, T., de Vos, W.M., Zucker, J.-D., Raes, J., Hansen, T., MetaHIT consortium, Bork, P., Wang, J., Ehrlich, S.D., Pedersen, O., MetaHIT consortium additional members: Richness of human gut microbiome correlates with metabolic markers. Nature 500(7464), 541–546 (2013)
Mardis, E.R.: Next-generation DNA sequencing methods. Annu. Rev. Genomics Hum. Genet. 9, 387–402 (2008)
Marioni, J.C., Mason, C.E., Mane, S.M., Stephens, M., Gilad, Y.: RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 18(9), 1509–1517 (2008)
Medvedev, P., Scott, E., Kakaradov, B., Pevzner, P.: Error correction of high-throughput sequencing datasets with non-uniform coverage. Bioinformatics (Oxford, England) 27(13), i137–i141 (2011)
Mortazavi, A., Williams, B.A., McCue, K., Schaeffer, L., Wold, B.: Mapping and quantifying mammalian transcriptomes by RNA-seq. Nat. Methods 5(7), 621–628 (2008)
Nikolenko, S., Korobeynikov, A., Alekseyev, M.: Bayeshammer: Bayesian clustering for error correction in single-cell sequencing. BMC Genomics 14(Suppl. 1), S7 (2013)
Peng, Z., Cheng, Y., Tan, B.C.-M., Kang, L., Tian, Z., Zhu, Y., Zhang, W., Liang, Y., Hu, X., Tan, X., Guo, J., Dong, Z., Liang, Y., Bao, L., Wang, J.: Comprehensive analysis of RNA-seq data reveals extensive RNA editing in a human transcriptome. Nat. Biotechnol. 30(3), 253–260 (2012)
Qu, W., Hashimoto, S.-I., Morishita, S.: Efficient frequency-based de novo short-read clustering for error trimming in next-generation sequencing. Genome Res. 19(7), 1309–1315 (2009)
Richard, H., Schulz, M.H., Sultan, M., Nürnberger, A., Schrinner, S., Balzereit, D., Dagand, E., Rasche, A., Lehrach, H., Vingron, M., Haas, S.A., Yaspo, M.-L.: Prediction of alternative isoforms from exon expression levels in RNA-seq experiments. Nucleic Acids Res. 38(10), e112 (2010)
Saccone, S.F., Quan, J., Mehta, G., Bolze, R., Thomas, P., Deelman, E., Tischfield, J.A., Rice, J.P.: New tools and methods for direct programmatic access to the dbSNP relational database. Nucleic Acids Res. 39(Database issue), D901–D907 (2011)
Schulz, M.H., Zerbino, D.R., Vingron, M., Birney, E.: Oases: robust de novo RNA-seq assembly across the dynamic range of expression levels. Bioinformatics (Oxford, England) 28(8), 1086–1092 (2012)
Schulz, M.H., Weese, D., Holtgrewe, M., Dimitrova, V., Niu, S., Reinert, K., Richard, H.: Fiona: a parallel and automatic strategy for read error correction. Bioinformatics 30(17), i356–i363 (2014)
Sultan, M., Schulz, M.H., Richard, H., Magen, A., Klingenhoff, A., Scherf, M., Seifert, M., Borodina, T., Soldatov, A., Parkhomchuk, D., Schmidt, D., O’Keeffe, S., Haas, S., Vingron, M., Lehrach, H., Yaspo, M.-L.: A global view of gene activity and alternative splicing by deep sequencing of the human transcriptome. Science 321(5891), 956–960 (2008)
Treangen, T., Koren, S., Sommer, D., Liu, B., Astrovskaya, I., Ondov, B., Darling, A., Phillippy, A., Pop, M.: Metamos: a modular and open source metagenomic assembly and analysis pipeline. Genome Biol. 14(1), R2 (2013)
Wang, Z., Gerstein, M., Snyder, M.: RNA-seq: a revolutionary tool for transcriptomics. Nat. Rev. Genet. 10(1), 57–63 (2009)
Wijaya, E., Frith, M.C., Suzuki, Y., Horton, P.: Recount: expectation maximization based error correction tool for next generation sequencing data. Genome Inform. 23(1), 189–201 (2009). International Conference on Genome Informatics
Yin, X., Song, Z., Dorman, K., Ramamoorthy, A.: PREMIER Turbo: probabilistic error-correction using Markov inference in errored reads using the turbo principle. In: 2013 IEEE Global Conference on Signal and Information Processing, December, pp. 73–76. IEEE, New York (2013)
Zeller, G., Tap, J., Voigt, A.Y., Sunagawa, S., Kultima, J.R., Costea, P.I., Amiot, A., Böhm, J., Brunetti, F., Habermann, N., Hercog, R., Koch, M., Luciani, A., Mende, D.R., Schneider, M.A., Schrotz-King, P., Tournigand, C., Van Nhieu, J.T., Yamada, T., Zimmermann, J., Benes, V., Kloor, M., Ulrich, C.M., von Knebel Doeberitz, M., Sobhani, I., Bork, P.: Potential of fecal microbiota for early-stage detection of colorectal cancer. Mol. Syst. Biol. 10(11), 766 (2014)
Acknowledgements
We would like to thank Dilip Ariyur Durai for his help with the Oases benchmark.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this chapter
Cite this chapter
Schulz, M.H., Bar-Joseph, Z. (2017). Probabilistic Models for Error Correction of Nonuniform Sequencing Data. In: Elloumi, M. (eds) Algorithms for Next-Generation Sequencing Data. Springer, Cham. https://doi.org/10.1007/978-3-319-59826-0_6
Download citation
DOI: https://doi.org/10.1007/978-3-319-59826-0_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-59824-6
Online ISBN: 978-3-319-59826-0
eBook Packages: Computer ScienceComputer Science (R0)