From Sequence Mapping to Genome Assemblies

  • Thomas D. OttoEmail author
Part of the Methods in Molecular Biology book series (MIMB, volume 1201)


The development of “next-generation” high-throughput sequencing technologies has made it possible for many labs to undertake sequencing-based research projects that were unthinkable just a few years ago. Although the scientific applications are diverse, e.g., new genome projects, gene expression analysis, genome-wide functional screens, or epigenetics—the sequence data are usually processed in one of two ways: sequence reads are either mapped to an existing reference sequence, or they are built into a new sequence (“de novo assembly”). In this chapter, we first discuss some limitations of the mapping process and how these may be overcome through local sequence assembly. We then introduce the concept of de novo assembly and describe essential assembly improvement procedures such as scaffolding, contig ordering, gap closure, error evaluation, gene annotation transfer and ab initio gene annotation. The results are high-quality draft assemblies that will facilitate informative downstream analyses.

Key words

Mapping De novo assembly Assembly improvement Local assemblies Bin assemblies Annotation 



I would like to thank Adam Reid, Martin Hunt, and Bernardo Foth for proofreading the chapter.


  1. 1.
    Huang X, Madan A (1999) CAP3: a DNA sequence assembly program. Genome Res 9(9):868–877CrossRefPubMedCentralPubMedGoogle Scholar
  2. 2.
    Myers EW et al (2000) A whole-genome assembly of Drosophila. Science 287:2196–2204CrossRefPubMedGoogle Scholar
  3. 3.
    Simpson JT et al (2009) ABySS: a parallel assembler for short read sequence data. Genome Res 19(6):1117–1123CrossRefPubMedCentralPubMedGoogle Scholar
  4. 4.
    Zerbino DR, Birney E (2008) Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res 18:821–829CrossRefPubMedCentralPubMedGoogle Scholar
  5. 5.
    Compeau PE, Pevzner PA, Tesler G (2011) How to apply de Bruijn graphs to genome assembly. Nat Biotechnol 29(11):987–991CrossRefPubMedGoogle Scholar
  6. 6.
    Alkan C, Sajjadian S, Eichler EE (2011) Limitations of next-generation genome sequence assembly. Nat Methods 8(1):61–65CrossRefPubMedCentralPubMedGoogle Scholar
  7. 7.
    Boetzer M et al (2011) Scaffolding pre-assembled contigs using SSPACE. Bioinformatics 27(4):578–579CrossRefPubMedGoogle Scholar
  8. 8.
    Pop M, Kosack D, Salzberg S (2004) Hierarchical scaffolding with bambus. Genome Res 14:149–159CrossRefPubMedCentralPubMedGoogle Scholar
  9. 9.
    Assefa S et al (2009) ABACAS: algorithm-based automatic contiguation of assembled sequences. Bioinformatics 25(15):1968–1969CrossRefPubMedCentralPubMedGoogle Scholar
  10. 10.
    van Hijum S et al (2005) Projector 2: contig mapping for effecient gap-closure of prokaryotic genome sequence assemblies. Nucleic Acid Res 33:560–566CrossRefGoogle Scholar
  11. 11.
    Tsai IJ, Otto TD, Berriman M (2010) Improving draft assemblies by iterative mapping and assembly of short reads to eliminate gaps. Genome Biol 11:R41CrossRefPubMedCentralPubMedGoogle Scholar
  12. 12.
    Boetzer M, Pirovano W (2012) Toward almost closed genomes with GapFiller. Genome Biol 13(6):R56CrossRefPubMedCentralPubMedGoogle Scholar
  13. 13.
    Otto TD et al (2010) Iterative correction of reference nucleotides (iCORN) using second generation sequencing technology. Bioinformatics 26(14):1704–1707CrossRefPubMedCentralPubMedGoogle Scholar
  14. 14.
    Ronen R et al (2012) SEQuel: improving the accuracy of genome assemblies. Bioinformatics 28:i188–i196CrossRefPubMedCentralPubMedGoogle Scholar
  15. 15.
    Otto TD et al (2011) RATT: rapid annotation transfer tool. Nucleic Acids Res 39:e57CrossRefPubMedCentralPubMedGoogle Scholar
  16. 16.
    Logan-Klumpler FJ et al (2012) GeneDB—an annotation database for pathogens. Nucleic Acids Res 40(Database issue):D98–D108CrossRefPubMedCentralPubMedGoogle Scholar
  17. 17.
    Quail MA et al (2012) Optimal enzymes for amplifying sequencing libraries. Nat Methods 9:10–11CrossRefGoogle Scholar
  18. 18.
    Simpson JT, Durbin R (2012) Efficient de novo assembly of large genomes using compressed data structures. Genome Res 22(3):549–556CrossRefPubMedCentralPubMedGoogle Scholar
  19. 19.
    Li H et al (2009) The sequence alignment/map format and SAMtools. Bioinformatics 25(16):2078–2079CrossRefPubMedCentralPubMedGoogle Scholar
  20. 20.
    Carver T et al (2012) BamView: visualizing and interpretation of next-generation sequencing read. Brief Bioinform 14:203–212CrossRefPubMedCentralPubMedGoogle Scholar
  21. 21.
    Delcher AL et al (1999) Improved microbial gene identification with GLIMMER. Nucleic Acids Res 27(23):4636–4641CrossRefPubMedCentralPubMedGoogle Scholar
  22. 22.
    Stanke M, Morgenstern B (2005) AUGUSTUS: a web server for gene prediction in eukaryotes that allows user-defined constraints. Nucleic Acids Res 22:W465–W467CrossRefGoogle Scholar
  23. 23.
    Swain MT et al (2012) A post-assembly genome-improvement toolkit (PAGIT) to obtain annotated genomes. Nat Protoc 7(7):1260–1284CrossRefPubMedCentralPubMedGoogle Scholar
  24. 24.
    Fonseca NA et al (2012) Tools for mapping high-throughput sequencing data. Bioinformatics 28:3169–3177CrossRefPubMedGoogle Scholar
  25. 25.
    Parra G, Bradnam K, Korf I (2007) CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes. Bioinformatics 23(9):1061–1067CrossRefPubMedGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2015

Authors and Affiliations

  1. 1.Wellcome Trust Sanger InstituteCambridgeUK

Personalised recommendations