Abstract
The development of “next-generation” high-throughput sequencing technologies has made it possible for many labs to undertake sequencing-based research projects that were unthinkable just a few years ago. Although the scientific applications are diverse, e.g., new genome projects, gene expression analysis, genome-wide functional screens, or epigenetics—the sequence data are usually processed in one of two ways: sequence reads are either mapped to an existing reference sequence, or they are built into a new sequence (“de novo assembly”). In this chapter, we first discuss some limitations of the mapping process and how these may be overcome through local sequence assembly. We then introduce the concept of de novo assembly and describe essential assembly improvement procedures such as scaffolding, contig ordering, gap closure, error evaluation, gene annotation transfer and ab initio gene annotation. The results are high-quality draft assemblies that will facilitate informative downstream analyses.
An erratum to this chapter is available at http://dx.doi.org/10.1007/978-1-4939-1438-8_21
An erratum to this chapter can be found at http://dx.doi.org/10.1007/978-1-4939-1438-8_21
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Huang X, Madan A (1999) CAP3: a DNA sequence assembly program. Genome Res 9(9):868–877
Myers EW et al (2000) A whole-genome assembly of Drosophila. Science 287:2196–2204
Simpson JT et al (2009) ABySS: a parallel assembler for short read sequence data. Genome Res 19(6):1117–1123
Zerbino DR, Birney E (2008) Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res 18:821–829
Compeau PE, Pevzner PA, Tesler G (2011) How to apply de Bruijn graphs to genome assembly. Nat Biotechnol 29(11):987–991
Alkan C, Sajjadian S, Eichler EE (2011) Limitations of next-generation genome sequence assembly. Nat Methods 8(1):61–65
Boetzer M et al (2011) Scaffolding pre-assembled contigs using SSPACE. Bioinformatics 27(4):578–579
Pop M, Kosack D, Salzberg S (2004) Hierarchical scaffolding with bambus. Genome Res 14:149–159
Assefa S et al (2009) ABACAS: algorithm-based automatic contiguation of assembled sequences. Bioinformatics 25(15):1968–1969
van Hijum S et al (2005) Projector 2: contig mapping for effecient gap-closure of prokaryotic genome sequence assemblies. Nucleic Acid Res 33:560–566
Tsai IJ, Otto TD, Berriman M (2010) Improving draft assemblies by iterative mapping and assembly of short reads to eliminate gaps. Genome Biol 11:R41
Boetzer M, Pirovano W (2012) Toward almost closed genomes with GapFiller. Genome Biol 13(6):R56
Otto TD et al (2010) Iterative correction of reference nucleotides (iCORN) using second generation sequencing technology. Bioinformatics 26(14):1704–1707
Ronen R et al (2012) SEQuel: improving the accuracy of genome assemblies. Bioinformatics 28:i188–i196
Otto TD et al (2011) RATT: rapid annotation transfer tool. Nucleic Acids Res 39:e57
Logan-Klumpler FJ et al (2012) GeneDB—an annotation database for pathogens. Nucleic Acids Res 40(Database issue):D98–D108
Quail MA et al (2012) Optimal enzymes for amplifying sequencing libraries. Nat Methods 9:10–11
Simpson JT, Durbin R (2012) Efficient de novo assembly of large genomes using compressed data structures. Genome Res 22(3):549–556
Li H et al (2009) The sequence alignment/map format and SAMtools. Bioinformatics 25(16):2078–2079
Carver T et al (2012) BamView: visualizing and interpretation of next-generation sequencing read. Brief Bioinform 14:203–212
Delcher AL et al (1999) Improved microbial gene identification with GLIMMER. Nucleic Acids Res 27(23):4636–4641
Stanke M, Morgenstern B (2005) AUGUSTUS: a web server for gene prediction in eukaryotes that allows user-defined constraints. Nucleic Acids Res 22:W465–W467
Swain MT et al (2012) A post-assembly genome-improvement toolkit (PAGIT) to obtain annotated genomes. Nat Protoc 7(7):1260–1284
Fonseca NA et al (2012) Tools for mapping high-throughput sequencing data. Bioinformatics 28:3169–3177
Parra G, Bradnam K, Korf I (2007) CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes. Bioinformatics 23(9):1061–1067
Acknowledgements
I would like to thank Adam Reid, Martin Hunt, and Bernardo Foth for proofreading the chapter.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer Science+Business Media New York
About this protocol
Cite this protocol
Otto, T.D. (2015). From Sequence Mapping to Genome Assemblies. In: Peacock, C. (eds) Parasite Genomics Protocols. Methods in Molecular Biology, vol 1201. Humana Press, New York, NY. https://doi.org/10.1007/978-1-4939-1438-8_2
Download citation
DOI: https://doi.org/10.1007/978-1-4939-1438-8_2
Published:
Publisher Name: Humana Press, New York, NY
Print ISBN: 978-1-4939-1437-1
Online ISBN: 978-1-4939-1438-8
eBook Packages: Springer Protocols