De Novo Next Generation Sequencing of Plant Genomes
The genome sequencing of all major food and bioenergy crops is of critical importance in the race to improve crop production to meet the future food and energy security needs of the world. Next generation sequencing technologies have brought about great improvements in sequencing throughput and cost, but do not yet allow for de novo sequencing of large repetitive genomes as found in most crop plants. We present a strategy that combines cutting edge next generation sequencing with “old school” genomics resources and allows rapid cost-effective sequencing of plant genomes.
KeywordsNext generation sequencing Oryza Genome sequencing Genome assembly
To meet the food and bioenergy security needs of the future, farmers must double or even triple crop yields on less land, with less water, on poorer soils, and with less pesticides. Information gained from sequenced genomes in crops, coupled with genetic association studies, will allow us to identify key genes/quantitative trait loci and networks that can lead to higher yielding crops that can grow in extreme conditions but with reduced environmental impact. We therefore need to develop efficient and cost-effective methods to sequence major food crop genomes and their wild relatives.
Plant genome sequencing has progressed rapidly since the first genome (Arabidopsis thaliana) was completed in 2000 (Arabidopsis Genome Initiative 2000). The 389-Mb rice genome was completed in 2004 (International Rice Genome Sequencing Project 2005), and recently, a draft sequence of the 2,300-Mb maize genome was released (Pennisi 2008). All were sequenced using “traditional” sequencing approaches in which sequencing libraries are constructed from individual segments of the genome (such as those contained within bacterial artificial chromosomes (BAC) clones) and are sequenced via gel electrophoresis and dideoxy terminator chemistry (i.e., “Sanger sequencing”). A whole-genome shotgun (WGS) strategy, made possible with improved assembly algorithms, has been used for several recent plant genomes (Tuskan et al. 2006; Jaillon et al. 2007; Paterson et al. 2009) in which the sequencing libraries are made directly from genomic DNA.
Recently, next generation sequencing (NGS) technologies have promised to further accelerate progress, with huge increases in sequencing throughput and, perhaps more importantly, the ability to avoid the handling of individual clones from shotgun libraries. There are currently four commercially available NGS technologies: 454 Life Sciences (acquired by Roche), Solexa (acquired by Illumina), ABI SOLID (acquired from Agencourt Biosciences), and Helicos Biosciences. Although all have their specific features, generally, they can be grouped into two classes based on the lengths of the sequence reads produced. Solexa, ABI SOLID, and Helicos all produce very short reads in very large quantities, while the 454 platform can produce a more moderate amount of sequence, but with much longer read lengths. Several of the platforms have already gone through multiple rounds of upgraded specifications, and improvements are likely to continue.
Not unexpectedly, applications for these platforms have up until now been those better suited to the short read lengths they produce. These include resequencing reference genomes (Wheeler et al. 2008), de novo sequencing of small bacterial genomes (McCutcheon and Moran 2007), assessing microbial diversity (Sogin et al. 2006), and gene expression, small RNA, and methylation analyses (Lister et al. 2008).
However, de novo sequencing of large repetitive eukaryotic genomes has yet to be accomplished with a NGS platform and remains an important and significant goal. There are many species for which a genome sequence is still a valuable and desirable resource and that are too distant from a nearest sequenced relative to take advantage of resequencing strategies. In fact, the emerging picture of the “pan genome” (Morgante et al. 2007) suggests that even sequencing genomes within a species may benefit from a de novo approach rather than a resequencing approach. However, the challenge of de novo sequencing with larger genomes is that assembly becomes difficult as repeat content increases, and many larger genomes, particularly those of crop plants, have significant repetitive content. These assembly challenges sometimes impact even traditional sequencing approaches, but are particularly problematic with short read technologies.
Physical map and MTP
As a pilot experiment, we selected an MTP of 166 BAC clones across the short arm of chromosome 3 of O. barthii, the progenitor of the West African cultivated rice—Oryza glaberrima. The physical map had previously been generated with fingerprinted contigs (FPC; Soderlund et al. 2000) and SyMAP (Soderlund et al. 2006) using SNaPshot fingerprints and BAC end sequencing (BES) of a ~10× BAC library from O. barthii accession #105608 (see supplementary materials for further data on the BAC library). When aligned to the Oryza sativa, reference genome via available BES, the selected MTP spanned over 19.4 Mb of the reference genome with an average predicted clone size of 159 kb, and an average predicted overlap between BACs of 42 kb.
Assembly of sequence data proceeded in four steps. First, the data were preprocessed to maximize the quality of the data provided to the assembler. This involved prefiltering the paired-end reads to remove those that did not contain both ends of the pair and screening for Escherichia coli and BAC vector contamination in both Titanium and paired-end reads. After preprocessing, between 15.1× (pool 2) and 22.2× (pool 4), coverage of sequence (paired-end and Titanium combined) was available for assembly. Physical coverage of paired ends was between 13.9× (pool 2) and 24.3× (pool 5). Second, the preprocessed data for each individual pool were assembled by the Newbler assembler from 454/Roche. The contiguity of these assemblies varied from pool to pool with contig N50 ranging from 10.8 (pool 6) to 19.9 kb (pool 1), and scaffold N50 ranging from 242 (pool 5) to 518 kb (pool 1). A clear trend of declining assembly metrics was seen as proximity to the centromere increased.
Assembly Statistics (Commonly Reported Assembly Statistics for Each of the Six BAC Pools)
Individual pool statistics
N50 scaffold size
# Contigs in scaffolds
N50 contig size
# Titanium reads used in assembly
# Paired ends used in assembly
Total base pair used
Assembly Statistics (Commonly Reported Assembly Statistics for the Combined Data Covering the Entire Chromosome Arm)
Entire chromosome arm
Stage 1: initial assembly
Stage 2: merge overlaps between pools
Stage 3: BAMBUS scaffolding
N50 scaffold size
Total scaffold length
N50 contig size
Total contig length
Finally, to take advantage of the available BAC end sequences and to create scaffolds that could span across pools, we performed additional scaffolding using the BAMBUS scaffolding software (Pop et al. 2004). The resulting final assembly had a total length of 18.4 Mb and was composed of just 44 scaffolds. Two facts are of particular note. First, 90% of the chromosome arm was contained in just six scaffolds, with the largest spanning more than a third of the arm (6.6 Mb). Secondly, the scaffolding with BAMBUS led to a nearly ninefold increase in the scaffold N50. Detailed assembly statistics are provided in Tables 1 and 2.
Mean coverage in pool
Lowest coverage in pool
Highest coverage in pool
Strategies for genome sequencing have generally taken one of two approaches: WGS (Adams et al. 2000; Venter et al. 2001) or hierarchical clone-by-clone sequencing (AGI 2000; IRGSP 2005). For the major plant genomes sequenced to date, both strategies have been employed. The genomes for A. thaliana (AGI 2000), O. sativa ssp. japonica (IRGSP 2005), and, most recently, Zea mays (Pennisi 2008) were generated by a clone-by-clone approach using BAC clones. Conversely, others including O. sativa ssp. indica (Yu et al. 2002), Populus trichocarpa (Tuskan et al. 2006), Vitis vinifera (Jaillon et al. 2007), and Sorghum (Paterson et al. 2009) have been sequenced by WGS. Generally, the advantages of WGS are that it needs fewer sequencing libraries, has a simpler workflow, and is therefore faster while clone-by-clone approaches are usually preferred when the genome is sufficiently large and complex that WGS assembly is problematic. With the advent of NGS technologies, the ease of generating large quantities of sequence has made WGS strategies even more appealing. However, the reality is that the short read lengths produced by early generations of NGS platforms have limited their application in de novo genome sequencing to smaller genomes—primarily bacteria.
The primary reason for this is the increasing abundance of repetitive sequence in larger genomes. Prior to the arrival of NGS platforms, WGS on larger genomes was possible because Sanger-based sequencing platforms generated paired-end reads of 700 bp or greater and a range of insert sizes from 3 to 40 kb (Weber and Myers 1997). This combination of read length and paired reads spanning quite large distances can be used effectively by assembly algorithms to resolve many repeats and reconstruct a draft genome sequence (Myers et al. 2000; Batzoglou et al. 2002; Jaffe et al. 2003). However, current NGS platforms are unable to deliver both these features and thus cannot effectively span repeats.
In the absence of a realistic WGS strategy with NGS platforms, the alternative is a clone-based approach. The traditional single clone-by-clone approach has the advantage of reduced assembly complexity, but the disadvantage of requiring large numbers of sequencing libraries and all of the logistical challenges associated with such a large-scale effort. Pooling multiple BACs together reduces the number of libraries required and, with reasonable size pools, keeps assembly complexity sufficiently manageable.
There are several significant factors that contributed to the success of this strategy in O. barthii. The first is the availability of high quality physical map contigs from which we could select contiguous pools of BAC clones. The second is the availability of the Titanium platform from 454 Life Sciences. The increased read lengths that this platform produces, compared to its predecessor (GS FLX), are invaluable in producing a high quality assembly—particularly when combined with paired-end reads from the GS FLX platform. Recently, there was a report of unsatisfactory results from attempts at pooling as few as eight BAC clones from the salmon genome for sequencing with the 454 GS FLX platform (Quinn et al. 2008). It seems likely that the major factor in the difference between our contrasting experiences with BAC pooling is the increased read lengths we were able to obtain with the newer platform. Lastly, the use of BAC end sequences to combine multiple pools together allowed our sequence scaffolds to grow larger than our initial BAC pools.
Even with these positive factors in place, it is possible that other factors may negatively impact the success of this strategy. The most obvious one is the nature of the repetitive sequence in a given pool. This not only will certainly vary between species (and could have contributed to the experience with the salmon genome) but will also vary across a given genome, as more repetitive regions such as centromeres are approached or traversed. In fact, our data demonstrate the very effect with pool 6 (closest to the centromere) having a contig N50 of only 54% of that of pool 1 (furthest from the centromere) and more than twice the number of scaffolds.
Given the factors involved, selection of pool size requires compromise between assembly quality and sequencing costs/logistics. The larger the pool, the more efficient the sequencing but the lower the expected quality of the assembly. For our proof of principle experiment, we selected 3-Mb pools after we had initial success with 1.5-Mb pools (data not shown). However, it is very possible that even larger pools could provide adequate assembly quality for many needs. For example, preliminary data shows that combining data from all 6.3 Mb pools together for assembly gives only a 35% decrease in contig N50 to 9.3 kb and a 55% decrease in scaffold N50 to 161 kb. Given the expected variation in the content and nature of repeats around a genome and between species, it would be useful to be able to predict repeat content prior to deciding upon a pool size. In fact, with a physical map and BAC end sequences in hand, it may be possible to predict such regions and modify the pool size appropriately.
Two other aspects of our method are important to emphasize. Firstly, with the sole exception of physical map editing in the first step, our results presented here did not utilize the O. sativa RefSeq in any way other than in assessment of results. Thus, the method is a true de novo assembly and could easily be applied to any species—no matter what related genomes have been sequenced. Secondly, the quality of the sequence obtained with this strategy should be emphasized. Our results clearly show that overall nucleotide accuracy approaches that of the Bermuda standard for accuracy of finished sequence—(i.e., 1 error in 10 kb). The results also show that the assembled sequence has very good contiguity—partly not only due to successful assembly but also due to a feature inherent in the BAC pooling strategy—that every pool is from a defined region of the genome. However, it is also important to recognize that there are diverse uses for a genome sequence, and many researchers do not require a high standard for both accuracy and contiguity. However, plant researchers, due to the nature of their work, often place a premium on knowing where in the genome a sequence resides. This is critical for trait mapping and breeding applications. For this reason, the genome sequencing strategy we present here is of particular utility to plant genomes.
To summarize, we utilized a fusion of “old school” physical mapping with ultrahigh throughput pyrosequencing to sequence an ~18-Mb chromosome arm in a single experiment. The resulting sequence of the chromosome arm has contiguity and accuracy that rivals any genome sequencing approach. In addition to being useful for targeting specific regions of a genome, scaling of this approach could generate the genome of all 12 chromosomes (~400 Mb) of O. barthii in a month or two. Future improvements in NGS technologies may continue to reduce the time and cost of such projects, but this strategy will likely always be valuable for addressing the complexities of sequencing large repetitive genomes with NGS. It provides an immediate and practical solution to the rapid generation of genome sequences from large and complex eukaryotic genomes for accelerated biological discovery to address some of our most critical food and bioenergy challenges.
Generation of a BAC library for O. barthii
Seed from O. barthii accession #105608 was obtained from the International Rice Research Institute, Los Banos, Philippines. Megabase-size DNA, isolated from nuclei from young 3-week-old seedling tissue, was partially digested with HindIII, and size-selected fragments were ligated into pAGIBAC1 for BAC library construction, using previously described protocols (Luo and Wing 2003; Ammiraju et al. 2006). The resultant BAC library (OB_ABa) contained 36,864 clones with an average insert size of 136 kb and represented about ten times coverage of the 420-Mb O. barthii genome. The BAC library was deposited in the AGI BAC/EST Resource Center (www.genome.arizona.edu/orders) and is publically available.
Generation of a minimum tiling path
An O. barthii chr3 short-arm MTP was selected from a whole genome physical map that was generated by SNaPshot fingerprinting and BAC end sequencing of a ~10× BAC library. The fingerprints were assembled into contigs using FPC (Soderlund et al. 2000). FPC contigs were aligned to the rice RefSeq (IRGSP 2005) using SyMAP (Soderlund et al. 2006) and adjacent contigs were merged when warranted. The MTP consisted of 166 BACs with eight gaps between FPC.
Sequencing library preparation and sequencing
DNA from each BAC clone was isolated, sheared to a size of 2–6 kb (HydroShear), purified, and quantified. Equal microgram amounts of DNA from the 168 BACs were then divided into six pools of 28 sequential BACs to form pools 1 through 5 and 26 BACs for pool 6. Each DNA pool was individually processed using 454/Roche Titanium shotgun and standard long paired-end sequencing kits and sequenced on 454/Roche GSFLX sequencers according to manufacturers specifications. Each pool was sequenced individually for the Titanium single-end read data, but three pools were combined for the paired-end read data. Pool-specific paired-end datasets were created by comparison to the pools-specific Titanium datasets using BLASTN and assigning each read to an individual pool.
Assembly was performed in four stages: (1) data preprocessing. Paired-end reads were prefiltered so that only reads containing both ends of the pair were included. (2) 454/Roche Titanium shotgun and standard long paired-end reads for each pool were combined and assembled using the 454 Newbler assembler (v2.0.00.10), after screening for E. coli and BAC vector sequence. (3) Contigs in overlap regions between neighboring pools were identified from both pools via alignments with MUMMER version 3.20 (Delcher et al. 2002). Underlying reads for these contigs were then combined and reassembled to remove duplication in overlap regions. Alignment against the rice RefSeq identified one misassembled contig caused by a single problematic read. This pool was reassembled after removal of the problem read. A composite assembly was constructed of the scaffolds and unscaffolded contigs from all six pools. (4) The BAMBUS scaffolding software (Pop et al. 2004) was then used to construct superscaffolds across pool boundaries using paired BAC end sequences from O. barthii combined with the above composite assembly.
Assembly accuracy of the initial 454 Newbler assembly (stage 1) was assessed by comparing the overlap regions of neighboring pools using BLASTN (Altschul et al. 1990). Mismatches and gaps were counted, and the latter gaps were categorized for the nature of the underlying sequence. Since the overlap regions were independently sequenced from different substrates (overlapping BAC clones), nonmatching sequence could be of biological origin (e.g., mutation in the BAC clone, or heterozygous plant material) or could be due to sequencing or assembly error. For this reason, these accuracy estimates should be considered a minimum accuracy expected from this sequencing approach.
Sequence submission to NCBI
Assembled contigs and scaffolds for each of six 3-Mb pools, as well as the entire ~18.4-Mb O. barthii chromosome 3 short arm superscaffold have been deposited at DNA Data Bank of Japan/European Molecular Biology Laboratory/GenBank under the project accession ABRL00000000.
This work was supported by National Science Foundation grants DBI-0638541 (to RAW and SR), DBI-0321678 (to RAW), and the Bud Antle Endowed Chair (to RAW)
- Ammiraju JS, Luo M, Goicoechea JL, Wang W, Kudrna D, Mueller C, et al. The Oryza bacterial artificial chromosome library resource: construction and analysis of 12 deep-coverage large-insert BAC libraries that represent the 10 genome types of the genus oryza. Genome Res. 2006;16(1):140–147.PubMedPubMedCentralCrossRefGoogle Scholar