Abstract
Sequence assembling is an important step for bioinformatics study. With the help of next generation sequencing (NGS) technology, high throughput DNA fragment (reads) can be randomly sampled from DNA or RNA molecular sequence. However, as the positions of reads being sampled are unknown, assembling process is required for combining overlapped reads to reconstruct the original DNA or RNA sequence. Compared with traditional Sanger sequencing methods, although the throughput of NGS reads increases, the read length is shorter and the error rate is higher. It introduces several problems in assembling. Moreover, paired-end reads instead of single-end reads can be sampled which contain more information. The existing assemblers cannot fully utilize this information and fails to assemble longer contigs. In this article, we will revisit the major problems of assembling NGS reads on genomic, transcriptomic, metagenomic and metatranscriptomic data. We will also describe our IDBA package for solving these problems. IDBA package has adopted several novel ideas in assembling, including using multiple k, local assembling and progressive depth removal. Compared with existence assemblers, IDBA has better performance on many simulated and real sequencing datasets.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
Holley RW, Everett GA, Madison JT, Zamir A. Nucleotide sequences in the yeast alanine transfer ribonucleic acid. J Biol Chem, 1965, 240: 2122–2128
Holley RW, Apgar J, Everett GA, Madison JT, Marquisee M, Merrill SH, Penswick JR, Zamir A. Structure of a ribonucleic acid. Science, 1965, 147: 1462–1465
Min Jou W, Haegeman G, Ysebaert M, Fiers W. Nucleotide sequence of the gene coding for the bacteriophage MS2 coat protein. Nature, 1972, 237: 82–88
Fiers W, Contreras R, Duerinck F, Haegeman G, Iserentant D, Merregaert J, Min Jou W, Molemans F, Raeymaekers A, Van den Berghe A, Volckaert G, Ysebaert M. Complete nucleotide sequence of bacteriophage MS2 RNA: primary and secondary structure of the replicase gene. Nature, 1976, 260: 500–507
Sanger F, Air GM, Barrell BG, Brown NL, Coulson AR, Fiddes CA, Hutchison CA, Slocombe PM, Smith M. Nucleotide sequence of bacteriophage phi X174 DNA. Nature, 1977, 265: 687–695
Warren RL, Sutton GG, Jones SJ, Holt RA. Assembling millions of short DNA sequences using SSAKE. Bioinformatics, 2007, 23: 500–501
Jeck WR, Reinhardt JA, Baltrus DA, Hickenbotham MT, Magrini V, Mardis ER, Dangl JL, Jones CD. Extending assembly of short DNA sequences to handle error. Bioinformatics, 2007, 23: 2942–2944
Dohm JC, Lottaz C, Borodina T, Himmelbauer H. SHARCGS, a fast and highly accurate short-read assembly algorithm for de novo genomic sequencing. Genome Res, 2007, 17: 1697–1706
Hernandez D, François P, Farinelli L, Osterås M, Schrenzel J. De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer. Genome Res, 2008, 18: 802–809
Myers EW. The fragment assembly string graph. Bioinformatics, 2005, 21(Suppl 2): ii79–85
Zerbino DR, Birney E. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res, 2008, 18: 821–829
Simpson JT, Wong K, Jackman SD, Schein JE, Jones SJ, Birol I. ABySS: a parallel assembler for short read sequence data. Genome Res, 2009, 19: 1117–1123
Peng Y, Leung HSM, Yiu SM, Chin FYL. IDBA-A practical iterative de Bruijn graph de novo assembler. In: Research in Computational Molecular Biology. Proceedings of the 14th Annual International Conference, Lisbon, Portugal, 2010. 426–440
Peng Y, Leung HC, Yiu SM, Chin FY. Meta-IDBA: a de novo assembler for metagenomic data. Bioinformatics, 2011, 27: i94–101
Peng Y, Leung H, Yiu SM, Chin F. IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth. Bioinformatics, 2012, 28: 1420–1428
Leung H, Yiu SM, Parkinson J, Chin F. IDBA-MT: de novo assembler for metatranscriptomic data generated from next-generation sequencing technology. J Comput Biol, 2013, 20: 540–550
Leung HCM, Yiu SM, Chin FYL. IDBA-MTP: a hybrid metatranscriptomic assembler based on protein information. In: Research in Computational Molecular Biology. Proceedings of the 18th Annual International Conference, Pittsburgh, PA, USA, 2014. 160–172
Peng Y, Leung HCM, Yiu SM, Lv MJ, Zhu XG, Chin FYL. IDBA-tran: a more robust de novo de Bruijn graph assembler for transcriptomes with uneven expression levels. Bioinformatics, 2013, 29: i326–334
Chaisson MJ, Brinza D, Pevzner PA. De novo fragment assembly with short mate-paired reads: does the read length matter? Genome Res, 2009, 19: 336–346
Chaisson MJ, Pevzner PA. Short read fragment assembly of bacterial genomes. Genome Res, 2008, 18: 324–330
Butler J, MacCallum I, Kleber M, Shlyakhter IA, Belmonte MK, Lander ES, Nusbaum C, Jaffe DB. ALLPATHS: de novo assembly of whole-genome shotgun microreads. Genome Res, 2008, 18: 810–820
Salikhov K, Sacomoto G, Kucherov G. Using cascading Bloom filters to improve the memory usage for de Brujin graphs. In: Darling A, Stoye J, eds. Algorithms in Bioinformatics, Lecture Notes in Computer Science, vol. 8126. Berlin Heidelberg: Springer, 2013. 364–376
Burrows M, Wheeler DJ. A block sorting lossless data compression algorithm. Technical Report 124. Palo Alto, CA: Digital Equipment Corporation, 1994
Rodland EA. Compact representation of k-mer de bruijn graphs for genome read assembly. BMC Bioinformatics, 2013, 14: 313
Vyahhi N, Pham SK, Pevzner P. From de Bruijn graphs to rectangle graphs for genome assembly. In: Algorithms in Bioinformatics. Proceedings of the 12th International Workshop, Ljubljana, Slovenia, 2012. 249–261
Kelley DR, Schatz MC, Salzberg SL. Quake: quality-aware detection and correction of sequencing errors. Genome Biol, 2010, 11: R116
Li R, Zhu H, Ruan J, Qian W, Fang X, Shi Z, Li Y, Li S, Shan G, Kristiansen K, Li S, Yang H, Wang J, Wang J. De novo assembly of human genomes with massively parallel short read sequencing. Genome Res, 2010, 20: 265–272
Author information
Authors and Affiliations
Corresponding authors
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0), which permits use, duplication, adaptation, distribution, and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Cite this article
Chin, F.Y.L., Leung, H.C.M. & Yiu, S.M. Sequence assembly using next generation sequencing data—challenges and solutions. Sci. China Life Sci. 57, 1140–1148 (2014). https://doi.org/10.1007/s11427-014-4752-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11427-014-4752-9