The draft genome sequence of cassava was obtained by a whole genome shotgun (WGS) strategy (Fig 1a). In this approach, nuclear genomic DNA is isolated and broken randomly into short fragments, which are then sorted into bins of various sizes. The ends of these fragments are sequenced and the resulting short sequence reads computationally reassembled to reconstruct long stretches of genomic sequence. Assembled sequences are called “contigs” if they represent unbroken (i.e., contiguous) stretches of the genome, and “scaffolds” if they contain one or more gaps due to either unsampled sequence or repetitive regions that are difficult to reconstruct from short sequence reads. These gaps are represented as runs of the unknown nucleotide “N”, with the number of Ns roughly corresponding to the length of the gap.
Assembling the cassava genome is computationally challenging, for two reasons. Firstly, repetitive sequences, often transposon-related, are interspersed between the compact genes of cassava, just as they are in most plant genomes (Gill et al. 2008). Since copies of a given repetitive element may be very similar at the sequence level, it is difficult to reconstruct each copy, especially with short sequence reads. Fortunately, the 454 FLX Titanium and experimental FLX + sequences are typically 400- and 700-nt long, respectively, making them long enough to assemble many repetitive segments and genes together on scaffolds spanning hundreds of thousands of base pairs (see below).
Secondly, cassava, as an outbreeding species, has allelic variation, both SNPs and structural polymorphisms (deletions, insertions, and inversions involving hundreds or thousands of nucleotides), which can complicate the derivation of a single “reference” sequence for each locus. In an attempt to minimize the number of heterozygous alleles that the assembler would have to deal with, DNA from a partially inbred cassava cultivar was provided for the genome sequencing project. This line, named AM560-2, was developed at CIAT and is reported as being an S3-derived inbred from the cassava cultivar MCOL-1505 (J. Tohme, unpublished results).
The cassava genome spans an estimated 770 Mb (Awoleye et al. 1994) in N = 18 chromosomes. A total of 22.4 billion bp of raw sequence data was generated (Table 1), enough to cover the genome ~29 times. This redundancy is necessary to assemble the reads into contigs and scaffolds (Fig 1a). These reads were assembled using Roche’s GS de novo assembler (Newbler) v. 2.5 (Quinn et al. 2008) into 12,977 scaffolds that span a total of 532.5 Mb. This version 4 assembly is somewhat short of the estimated genome size, but, based on an analysis of the unassembled reads, the missing sequence is dominated by repetitive sequences such as transposable elements, ribosomal RNA genes and centromeres. Although the genome assembly is in nearly 13,000 pieces, half of it is captured in only 487 scaffolds, each longer than 258 kbp and containing 49 or more genes.
Table 1 Cassava genomic and mRNA sequence data
Previously, only eight cassava transposons were described in public databases (Kapitonov and Jurka 2008). With the assembled genome, a more complete characterization of repetitive content is possible by scanning the assembly for sequences that occur many times (Table 2). Over a third (37.5%) of the assembly was annotated as repetitive (Table 2), dramatically expanding the catalog of cassava transposable elements. This portion of the genome is hidden (in the jargon, “masked”) when the genome is scanned for protein-coding genes.
Table 2
M. esculenta v. 4 assembly and repeat statistics
Table 3 Protein-coding annotation comparisons to cassava
While a substantial portion of the assembled scaffold sequence is represented in “gaps” (~113 Mb), almost all cassava genes are captured in the contigs. Of the putative transcripts represented in an assembly of publicly available cassava EST sequences (http://cassava.igs.umaryland.edu/blast/db/EST_asmbl_and_single.fasta), 96% can be mapped to the genome assembly. It can be estimated that the remaining portion of the genome is largely repetitive and non-gene-coding. Consistent with this, the fractions of the estimated genome size (~31%) and WGS reads (~36%) that do not appear in the assembly are approximately equal, despite low read error rates. The scaffolds obtained have yet to be assigned to chromosomes, as this requires genetic markers with known sequence. However, an 88% complete genetic map comprising 23-linkage groups includes the genetic locations of 284 scaffolds (Sraphet et al. 2011).
With the gene-rich portion of the genome in hand, the next step is to identify the protein-coding genes and the exons that comprise them. This is achieved computationally by aligning sequences from mRNA fragments (ESTs) to the genome, as well as looking for regions with homology to known proteins from other plant species. The 80,459 Sanger ESTs from Genbank were augmented by a new set of 2.7 million reads from leaf and root libraries, generated by 454 Life Sciences using the FLX Titanium platform. While half of the leaf EST reads represented rDNA and chloroplast transcripts and were not useful for annotating the nuclear genome, the remaining 1.4 M reads were used to improve gene prediction. In addition to this mRNA data, predicted protein sequences from castor bean, Arabidopsis, rice, soybean and Populus were mapped to the cassava assembly to help predict gene loci. To date we have predicted 30,666 protein-coding loci in cassava by combining ESTs, peptide homology to other species, and the statistical sequence patterns common to plant genes (e.g., intronic splice signals; Table 3). Over a third (11,526) of the protein-coding loci are supported by ESTs, which also provide evidence for a total of 3,485 alternative splice forms (Table 3).
The gene content of cassava is broadly similar to castor bean, Arabidopsis, soybean and rice (Table 3). The cassava genome sequence and annotation can be readily aligned to those of related plants, allowing comparative genomic analysis of genome structure and function. For example, a global comparison of cassava and castor bean genomes reveals extensive colinearity (an example region is shown in Fig. 2a) and a paleo-genomic duplication in cassava (Fig. 2b) that took place relative to castor bean (Chan et al. 2010). However, not all genes are found in two copies in cassava, suggesting that some initially duplicated genes have been subsequently lost. Conversely some paralogous families have more than two members, suggesting additional gene duplications that resulted in increased family size.
The cassava genome and annotation can be viewed, browsed, compared with other genomes, and downloaded for custom analysis at the Phytozome comparative plant genomics portal (www.phytozome.net, Fig. 1b; Goodstein et al. 2012). Extensive instructions are available to help users find their way around the site (http://phytozome.net/help.php).