The plant kingdom is filled with amazing diversity and significance. Plants form the base of the food chain that provides food for all living organisms, and just 15 crop plants provide 90% of the world's food intake [1]. Plant species are responsible for maintaining the balance of the carbon cycles [2], for developing and maintaining soil from erosion [3], and are promising sources of renewable energy [4]. Plant byproducts are used in many human medicines [5], and plants have been essential model organisms for studying biological systems such as the role of transposons and epigenetics [6]. For all these reasons and many more, there is great interest in sequencing plant genomes, but relatively few plant species have been sequenced compared with the hundreds of thousands of species around the world.

The first free-living organisms were sequenced less than 20 years ago, starting with simple microbial genomes [7], and increasing in complexity to the first eukaryotic genomes [8], the first multicellular species [9], and then on to plant genomes, including Arabidopsis thaliana (thale cress) [10], Oryza sativa (rice) [11], Carica papaya (papaya) [12] and Zea mays (maize) in 2009 [13], using first-generation capillary sequencing. Since then many others have been sequenced leveraging second-generation sequencing, including Fragaria vesca (strawberry) [14], Solanum lycopersicum (tomato) [15] and Cajanus cajan (pigeonpea) [16], and dozens more are nearing completion [17]. This increase in sequenced plant genomes has largely been driven by technological improvements: whereas the first generation of automated DNA sequencing instruments could sequence thousands of base pairs per day, current state-of-the-art second-generation sequencing instruments can sequence many billions of bases per day for hundreds or thousands of dollars per gigabase instead of millions or billions of dollars per gigabase [18]. These technologies have been applied to study thousands of genomes across the tree of life, enabling rich annotation of their gene networks [19], the development of comparative genomics approaches to infer evolutionary and domestication forces [13], the cataloging of genomic markers to optimize plant breeding [20], and numerous other studies that use the genome sequence as the backbone of the analysis [21].

In contrast to the tremendous advances in throughput, assembling sequencing reads remains a substantial endeavor, much greater than the sequencing efforts alone would suggest [2224]. Large complex plant genomes remain a particularly difficult challenge for de novo assembly for a variety of biological, computational and biomolecular reasons. Plant genomes can be nearly 100 times larger [25] than the currently sequenced bird [26], fish [27] or mammalian genomes [28]. In addition they can have much higher ploidy, which is estimated to occur in up to 80% of all plant species [29], and higher rates of heterozygosity and repeats [30] than their counterparts in other kingdoms. Furthermore, the gene content in plants can be very complex, as shown by the presence of large gene families and abundant pseudogenes with nearly identical sequences derived from recent whole genome duplication events and transposon activity [13]. Plants tend to have high copy chloroplasts and mitochondria organelles, which complicate assembly of their remnants in the nuclear genome and skew coverage levels [12]. Finally, it is often very difficult to extract large quantities of high-quality DNA from plant material, making it difficult to prepare proper libraries for sequencing.

For all of these reasons, sequencing and de novo assembling a plant genome can create a highly fragmented result. Instead of large contigs and scaffolds spanning large chromosome regions seen in recent vertebrate genome assemblies [31], there is a greater chance to assemble the sequencing reads into isolated gene islands among the background of high copy repeats [13]. Furthermore, the gene sequences may not always be correct, considering that nearly identical gene families are notoriously difficult to assemble and may collapse into a mosaic sequence without necessarily representing any member of the family [32]. If the level of fragmentation and mis-assembly is too great, downstream analysis will be noisy, and could even lead to false conclusions of the biology [33].

Knowing how to assemble these genomes accurately, how to best make use of the potentially highly fragmented assemblies and how to perform these applications at the lowest cost are important in today's funding environment. Genome assembly has always been an incremental process, and there are only a handful of truly finished large genomes today - even the latest release of the 'finished' human reference genome has millions of unresolved nucleotides [34]. Therefore, we need to assess when an assembly is good enough to be useful to the community, and how the agencies can get the most out of the available funding. Finally, how can researchers stay afloat in the rapidly evolving landscape with technology evolving so quickly it is challenging to know what the guidelines for plant assembly will be in 12 months or beyond. Here we assess the state of the art of de novo assembly, assess what can be expected to develop, and review the best practices for the plant community.

Assessing the needs

Assembling any genome requires the proper combination of coverage, read length and read quality [22]. If any of these factors are not met, then it is a mathematical certainty that the assembly will be fragmented into many small contigs. The Lander-Waterman model offers an analytic, if optimistic, prediction on the minimum coverage needed to assemble large contigs [35]. Using this model, a minimum of 15-fold coverage is required to assemble 100 bp reads into large contigs. However, once coverage has been equalized for errors, ploidy, sequence biases and other complicating factors, the minimum required coverage level may be much higher and sequencing to at least 100-fold coverage is recommended [31].

This statistical model also does not consider repeat composition, and short reads alone may never have the information content to resolve complex repetitive sequences. Resolving large or complex repeats fundamentally requires longer spanning information to bridge across the repeats back to unique sequence in the form of longer reads, mate-pairs, long-range mapping information or a method for fragment localization [32]. Read quality is also not directly considered in the Lander-Waterman model, but low-quality reads will reduce effective coverage and obscure true overlaps between sequencing reads, thus fragmenting the assembly and risking collapsing more repeats.

Overcoming these challenges depends on advances in both sequencing technology and assembly technology. Sequencing technology needs: (1) instrumentation improvements, including improvements in throughput, cost, read lengths and accuracy; and (2) molecular protocols, including developing new types of libraries and also new techniques for multiplexing samples to take advantage of the tremendous throughput available per instrument run. Assembly technology needs: (1) improved algorithms for accurately assembling complex genomes at scale; and (2) improved analytics to record, manipulate, analyze and visualize features to translate the salient assembly information to the broader plant biology community.

Sequence technology

The highest capacity sequencing instruments available today, such as the Illumina HiSeq 2000, can sequence nearly 100 Gbp per day, and make it possible to sequence a 3 Gbp genome to high coverage for less than US$10,000 [36]. Using these technologies, it is also possible to sequence paired-end or mate libraries ranging in size up to a few thousand base pairs. As such, even large plant genome projects can count on relatively inexpensive, deep coverage with approximately 100 bp reads and 1 to 5 kbp mate libraries. However, these short reads and small libraries have substantial limitations for large genomes with large repetitive content. Constructing high-quality draft genome assemblies for the largest plant genomes absolutely requires enhanced sequencing approaches to generate longer reads and mate-pair libraries, and protocols for localizing the sequencing and assembly problem.

One of the strongest needs is for protocols for efficiently generating a mix of larger libraries, such as 10 kbp, 40 kbp or 150 kbp in addition to standard 5 kbp libraries. Currently available protocols for these larger sizes, such as with fosmids [37], or bacterial artificial chromosome (BAC)-end sequencing [38], are effective but are laborious, costly and time consuming relative to the sequencing itself. Furthermore, the larger libraries inevitably have increased size variance and less reliable mate information. The sequencing itself needs to be improved to reduce the biases from GC composition, chimeric reads and mates, and other effects so that the coverage along the genome will be uniform and complete [39].

One promising approach for substantially longer reads and unbiased coverage is the rise of third-generation sequencing technologies such as that from Pacific Biosciences [40] and the newly announced instruments from Oxford Nanopore [41]. These platforms promise to generate longer reads that can be used for sequencing through complex repeats, link gene islands and phase haplotypes. However, these technologies are relatively immature for immediate widespread application to all large genomes of interest. Sequencers from Roche/454 make it possible to sequence approximately 700 bp reads, but at greater cost than short read sequencing, and it may not be sufficient to span the largest repeats [42].

Optical mapping technologies are another possibility for generating very long range linking information between sequence contigs and have a successful history in plant genomics [43, 44], although the current worldwide capacity is also below the demand. New technologies such as nanocoding [45], and new instruments from commercial vendors, including OpGen [46] and BioNanoGenomics [47], are expected in the next couple of years and they could expand the capacity for optical mapping similar to that seen in sequencing.

A complementary approach to improved sequencing and mapping is to develop methods for localizing sequencing and thus simplifying the assembly problem. There is a successful history of BAC-by-BAC sequencing of plant genomes [10, 11], and this is effective in the sense that assembling an isolated BAC is far simpler than assembling the entire genome. However, this technology is now prohibitively expensive without significant enhancement. For example, sequencing large genomes such as maize using a BAC-by-BAC approach costs tens of millions of dollars and hundreds of thousands of BAC clones. While next-generation sequencing would certainly reduce this cost, it is not readily possible to efficiently use next-generation sequencing on the number of BAC clones needed. This, coupled with the high cost of making and storing the large numbers of libraries needed, greatly limits the feasibility of BAC-by-BAC sequencing in the next-generation world.

Versions of BAC-by-BAC using pools of BAC or pools of fosmids is an attractive option for localizing the problem, assuming such libraries can be efficiently made and barcoding protocols can be effectively applied to tag the molecules [48]. However, to utilize the capacity of current sequencers fully, so many BACs need to be pooled in a lane that it would not effectively localize the assembly problem unless the BACs can be multiplexed and barcoded to a very high degree. Furthermore, preparing and storing these libraries will still require a substantial cost unless they can be made in a fully automated fashion. Alternative molecular isolation technologies that can be used for localizing individual chromosomes in the sample, such as flow sorting, are promising alternatives and are starting to become more widely available [49, 50].

Assembly technology

Genome assembly has been metaphorically described as the process of assembling a jigsaw puzzle from the individual reads [22]. In the case of the largest, most repetitive plant genomes, it could be metaphorically described as assembling a large jigsaw consisting of blue sky separated by nearly indistinguishable wisps of white clouds of genes - seemingly an impossible task. Assembly generally follows a hierarchical approach of comparing the individual reads to form an assembly graph of the overlapping reads or kmers, then simplifying the graph to form the initial contigs, and finally using mate-pairs and marker information to order and orient the initial contigs into scaffolds (Figure 1). Assembling a large genome is operationally complicated in that it demands extensive error correction and filtering, and large computational resources, and is often highly sensitive to the parameters used. Even beyond these complications, assembly is fundamentally complicated because repeats introduce ambiguity in how the reads should be ordered so that no perfect algorithm exists for reconstructing entire genomes even if every base of the genome has been sequenced to high depth.

Figure 1
figure 1

Schematic overview of genome assembly. (a) DNA is collected from the biological sample and sequenced. (b) The output from the sequencer consists of many billions of short, unordered DNA fragments from random positions in the genome. (c) The short fragments are compared with each other to discover how they overlap. (d) The overlap relationships are captured in a large assembly graph shown as nodes representing kmers or reads, with edges drawn between overlapping kmers or reads. (e) The assembly graph is refined to correct errors and simplify into the initial set of contigs, shown as large ovals connected by edges. (f) Finally, mates, markers and other long-range information are used to order and orient the initial contigs into large scaffolds, as shown as thin black lines connecting the initial contigs.

Several short-read assembly packages have been proven for mammalian-sized genomes up to the 3 Gbp human genome, including ABySS [51], ALLPATHS-LG [31], the Celera Assembler [52, 53], Newbler [54], SGA [55] and SOAPdenovo [56]. These assemblers can produce high-quality assemblies from short reads, although they generally require servers or clusters with 512 gigabytes of RAM and many terabytes of disk space available for a gigabase-sized genome [31]. However, these servers are decreasing in costs and can be purchased for under US$35,000 from several major computer vendors [57], and supercomputing centers make them available without any cost [58]. This is promising, but assembling the largest plant genomes currently being sequenced, such as the loblolly pine genome of approximately 21 Gbp [59], will increase the computational demands by nearly an order of magnitude, for which there is no proven technology. Enhanced algorithms for compression and distributing the computation are actively being researched [55].

Two major efforts to evaluate the state-of-the-art in assembly technology were published last year: the Assemblathon [24] and the Genome Assembly Gold-Standard Evaluation (GAGE) [23]. Both projects evaluated the performance of various genome assemblers in a competitive framework with both simulated and real datasets. They showed there was great difference in the quality of the results depending on the assembler and pipelines used. Researchers planning to assemble a genome of any size are encouraged to study their results, such as the needs for error correction, recommended assemblers and evaluation criterion. However, the genomes studied in these projects were relatively small and simple compared with the most complex plant genomes. The plant community would be well served by hosting regular competitions with plant genomes, especially since all of the major assemblers have been developed targeting vertebrate genomes, and no assembler has been proven with higher levels of ploidy or heterozygosity.

Related to the de novo assembly problem, research is greatly needed to help improve the representation of assembled genomes, including creating graph-centric and population-aware formats that can represent the complexities of plant genomes, particularly those that are only partially assembled [6062]. Incremental algorithms that can update the assembly and annotation as new data become available would also be extremely useful [33]. Finally, continued research into assembly validation is necessary for determining when an assembly is correct and conclusions can be trusted [32, 63].

Analytics

Sequencing and assembling a genome are often just the first stages of a larger study. Immediately following the assembly, the genome will need to be annotated to catalog genes and other features of interest [64], or aligned to other genomes to enable comparative genomics studies [65]. Several sequencing-based assays, such as RNA-seq [66] and Methyl-seq [67], can be used with the assembly to study transcriptionally or epigenetically active regions of the genome, and population studies will often attempt to build higher-order relationships, such as gene networks, or relate genotype to phenotype.

Currently, pipelines are available for carrying out these operations and displaying results in a 'genome browser', but continued research is needed to make the pipelines and results more accessible to different types of user. Systems such as Galaxy [68], Gramene [69] and Drupal [70] are among the leading graphical systems for executing workflows, visualizing sequencing assay results, and enabling collaborative discussions, respectively, but they operate as separate systems. A fully integrated system such as has been proposed by iPlant [71], and the DOE Systems Biology Knowledgebase [72] initiatives would lower the barrier for learning to operate these functions. In either case it is critical that the community enhance these systems and the underlying algorithms to better support the complexity of plant genomes and their evolving assemblies.

Trends and recommendations

The plant kingdom has incredible variation and diversity, and as a result each plant sequencing project seems to have its own unique analysis needs. Sequencing and assembly technologies are evolving so rapidly it is impossible to predict what will be available even one year in the future. Despite these complexities, certain trends are emerging as best practices.

Mixed library, high-coverage sequencing

Because of economic and technological reasons, the majority of sequence produced in the next 18 months will continue to originate from short reads of approximately 100 to 200 bp. Fortunately, sequences of this length can be assembled into high-quality draft assemblies for genomes as complex as human when sequenced in a mixture of libraries. In particular, Gnerre et al. [31] recommend 45× paired-end (2 × 100 bp at 180 bp), 45× short jump (2 × 100 bp at 3 kbp), 5× long jump (2 × 100 bp at 6 kbp) and 1× fosmid (2 × 26 bp at 40 kbp) to generate high-quality draft assemblies. Since the paired-end reads designed in this way overlap by approximately 20 bp, they can be preassembled into pseudo-long reads of approximately twice the original length using the built-in capabilities of ALLPATHS-LG [31] or by a standalone preassembler such as FLASH [73]. Assemblers that do not include built-in error correction greatly benefit from then applying software such as Quake [74] to identify and fix sequencing errors before assembly. The larger libraries are then needed for ordering the initial contigs into progressively larger scaffolds.

For the largest and most complex plant genomes, even these libraries may not be sufficient to span the largest or more complex repeats, and it may be necessary to employ a hybrid approach using a combination of short and long reads, and even long-range mapping technologies or localization methods. Long reads over 800 bp are available today from Roche/454, albeit at higher cost than short read sequencing, and third-generation sequencing technologies promise to provide even longer reads. As sequencing costs and instrument runtimes continue to drop, researchers are also recommended to sequence a low coverage 'genome snapshot' to evaluate the genome and library composition before attempting to sequence the genome to high coverage.

Bioinformatics partnerships

Assembling and analyzing raw sequence data still require substantial bioinformatics effort and expertise. Before attempting a complex assembly, plant biologists are strongly encouraged to develop partnerships with bioinformatics laboratories that have sufficient skills and resources to handle the onslaught of data and diagnosis problems as they occur. Fortunately, the funding agencies are aware of these challenges, and it is our hope they would be responsive to requests for appropriate bioinformatics funding.

Bioinformatics laboratories are encouraged to enhance, expand and refine their algorithms and analytics specifically for the complexities of plant genomes. In particular, because of high diversity, heterozygosity and ploidy not found in other kingdoms, there is a strong need to develop a plant-specific genome assembler that can overcome these challenges and represent the plant genome assemblies in more versatile graph-based formats along with the supporting tools for analyzing these graphs (Figure 2). Furthermore, the trend in bioinformatics software development is to develop only enough of a user interface to support the needs of a particular project. If this trend continues, many groups will reinvent the same software over and over again, wasting time and resources. Instead, funding agencies would be better served by requiring software to be developed with a high-quality user-friendly interface or integrated into a graphical system such as Galaxy, even if it requires modestly more upfront funding.

Figure 2
figure 2

Ploidy, heterozygosity and the assembly graph. (a) Schematic representation of a tetraploid genome, such as apple, cotton or cabbage, consisting of haploid chromosomes A to D with homozygosity/heterozygosity shown as different colored blocks. (b) Even without repeats or sequencing error, the assembly graph of the homozygous and heterozygous segments of the genome branch and intertwine in complex patterns. A plant-specific assembler would need to recognize these branching patterns and attempt to reconstruct the individual sequences for chromosomes A to D.

Awareness, training and education

Principal investigators need to become better informed to the current best practices for genome assembly and develop a better understanding of the effort involved to sequence, assemble, annotate and analyze a new genome. More classes and training are needed for graduate and undergraduate students to learn the fundamentals of sequence analysis and quantitative techniques. Better training is needed to teach non-experts to use the software packages, and to educate everyone about the resources that are available. The plant sequencing community would benefit by forming and hosting plant genome analysis competitions in the spirit of the Assemblathon or GAGE to evaluate the state-of-the-art for assembly, annotation and other assays. The best practices of today are certain to change as new sequencing, mapping and computational technologies are introduced, and this will be the only way to monitor these developments.

Final thoughts

We are still many years away from push-button sequencing and assembly of complex plant genomes into completely finished genomes at low cost. Nevertheless, it is now possible and affordable to sequence and assemble great numbers of interesting plant genomes into highly useful draft genome assemblies if one is mindful of the biotechnology and algorithmic challenges involved. The next frontier for plant genomics is to characterize the diversity of genomic variations across large populations, deeply annotate their functional elements, and develop predictive quantitative models relating genotype to phenotype. Improved sequencing technology and sequencing assays are certain to play a large role in these studies as well, and we envision a tight relationship between biology, biotechnology and analytics for years to come.