Cassava (Manihot esculenta Krantz) is grown throughout tropical Africa, Asia and the Americas. Its large, starchy, roots and edible leaves provide food for 800 M people globally, many of whom subsist on it, in part because it is drought tolerant and requires little in the way of inputs (Ceballos et al. 2010). The high starch content (20-40%) makes cassava a desirable energy source both for human consumption and industrial biofuel applications (Balat and Balat 2009; FAO 2008; Schmitz and Kavallari 2009), However, its nutritional value is limited as the roots contain little protein or micronutrients and high levels of cyanogenic compounds. The plant is also susceptible to bacterial (Boher and Verdier 1994) and insect-transmitted viral diseases (Hillocks and Jennings 2003; Patil and Fauquet 2009). While the roots can be left in the ground for many months before harvesting, post-harvest deterioration is rapid, limiting economic development for cassava farmers (Reilly et al. 2007). Cassava is an outcrossing, heterozygous species propagated clonally from stem cuttings. These properties provide barriers to the already slow process of improving yield, disease resistance and nutrient content by traditional breeding and selection.

We describe here the progress-to-date and the future goals of genomic sequencing efforts in cassava. A draft genome sequence has been generated from a single cassava accession. With this genome sequence in hand and a catalog of common genetic variants on the way, many possibilities exist for the rapid realization of cassava’s potential both as a nutritional food and as a biofuel feedstock.

Cassava belongs to the family Euphorbiaceae, and the Fabid superfamily (also known as eurosids I), which includes several distantly related plants such as rosids, legumes and poplars. The availability of several Fabid genomes, in particular Ricinus communis [castor bean (Chan et al. 2010), which is in the same family], Populus trichocarpa [poplar (Tuskan et al. 2006)] and Glycine max [soybean (Schmutz et al. 2010)], allows researchers to take advantage of comparative genomic approaches. In addition, tools for molecular breeding are being developed by building upon the genome sequence. For example, germplasm diversity across both cultivated and wild varieties, once characterized, can be used to map valuable traits. With this map, marker-assisted selection and even genomic selection can be adopted as paradigms for generating improved cassava. Brought together, these resources should allow us to accelerate understanding of basic biology of starch accumulation, nutrient loading and resistance to diseases and pests.

Sequencing the Cassava Genome

The project to sequence the cassava genome began in 2003 with the formation of The Global Cassava Partnership (GCP-21), co-chaired by Dr. Claude Fauquet, director of the International Laboratory for Tropical Agriculture Biology (ILTAB) at the Donald Danforth Plant Science Center (DDPSC), and Dr. Joe Tohme of the International Center for Tropical Agriculture (CIAT). This, in turn, led to a 2006 proposal, by Fauquet, Tohme and 12 other leading international scientists, to the US Department of Energy Joint Genome Institute (JGI) Community Sequencing Program. The proposal was selected for a pilot project, and over the next few years a ~0.8x random shotgun sample sequence (~700k reads) was produced, half from fosmid paired-ends. As a result, nearly half the genome had been sampled, but only relatively short ~700 base-pair (bp) sequences were available.

At the 2009 International Plant and Animal Genomes conference in San Diego, USA, Steve Rounsley from the University of Arizona obtained a commitment to produce a more complete cassava sequence from 454 Life Sciences and JGI, with the encouragement of the Bill & Melinda Gates Foundation (BMGF). 454 Life Sciences and JGI chose to use 454's Genome Sequencer FLX Titanium platform to rapidly generate the DNA sequence data needed for the project. 454 Life Sciences later added longer sequences from their then-experimental FLX + extra long read technology ( Generation of raw sequence data was mostly completed by spring 2009, less than 90 days after the meeting in San Diego, and progress was discussed at a meeting convened by BMGF outside Paris, France in June 2009. In November of that year, the genome assembly and annotation were publicly released (available from, and funding was obtained from BMGF for post-genome efforts to build upon the genome data to develop breeding tools for sub-Saharan Africa.

Status of Genome Project

The draft genome sequence of cassava was obtained by a whole genome shotgun (WGS) strategy (Fig 1a). In this approach, nuclear genomic DNA is isolated and broken randomly into short fragments, which are then sorted into bins of various sizes. The ends of these fragments are sequenced and the resulting short sequence reads computationally reassembled to reconstruct long stretches of genomic sequence. Assembled sequences are called “contigs” if they represent unbroken (i.e., contiguous) stretches of the genome, and “scaffolds” if they contain one or more gaps due to either unsampled sequence or repetitive regions that are difficult to reconstruct from short sequence reads. These gaps are represented as runs of the unknown nucleotide “N”, with the number of Ns roughly corresponding to the length of the gap.

Fig. 1
figure 1

a Overview of whole genome shotgun sequencing and assembly. Starting with plant material, many genomes’ worth of DNA is extracted, purified, fragmented, pooled by length and sequenced to a high level of redundancy with the aim of sequencing every region of the genome so that the chromosomal sequence can be generated (assembled) by overlapping fragments that have (near-)identical sequences. Longer range, paired-end sequence information is used to bridge sections of the genome that are not unique (repeats) and impossible to resolve by this approach b The phytozome genome browser ( provides a portal for accessing, browsing, searching and downloading all available cassava sequence and annotation data and for comparative plant genomic analysis

Assembling the cassava genome is computationally challenging, for two reasons. Firstly, repetitive sequences, often transposon-related, are interspersed between the compact genes of cassava, just as they are in most plant genomes (Gill et al. 2008). Since copies of a given repetitive element may be very similar at the sequence level, it is difficult to reconstruct each copy, especially with short sequence reads. Fortunately, the 454 FLX Titanium and experimental FLX + sequences are typically 400- and 700-nt long, respectively, making them long enough to assemble many repetitive segments and genes together on scaffolds spanning hundreds of thousands of base pairs (see below).

Secondly, cassava, as an outbreeding species, has allelic variation, both SNPs and structural polymorphisms (deletions, insertions, and inversions involving hundreds or thousands of nucleotides), which can complicate the derivation of a single “reference” sequence for each locus. In an attempt to minimize the number of heterozygous alleles that the assembler would have to deal with, DNA from a partially inbred cassava cultivar was provided for the genome sequencing project. This line, named AM560-2, was developed at CIAT and is reported as being an S3-derived inbred from the cassava cultivar MCOL-1505 (J. Tohme, unpublished results).

The cassava genome spans an estimated 770 Mb (Awoleye et al. 1994) in N = 18 chromosomes. A total of 22.4 billion bp of raw sequence data was generated (Table 1), enough to cover the genome ~29 times. This redundancy is necessary to assemble the reads into contigs and scaffolds (Fig 1a). These reads were assembled using Roche’s GS de novo assembler (Newbler) v. 2.5 (Quinn et al. 2008) into 12,977 scaffolds that span a total of 532.5 Mb. This version 4 assembly is somewhat short of the estimated genome size, but, based on an analysis of the unassembled reads, the missing sequence is dominated by repetitive sequences such as transposable elements, ribosomal RNA genes and centromeres. Although the genome assembly is in nearly 13,000 pieces, half of it is captured in only 487 scaffolds, each longer than 258 kbp and containing 49 or more genes.

Table 1 Cassava genomic and mRNA sequence data

Previously, only eight cassava transposons were described in public databases (Kapitonov and Jurka 2008). With the assembled genome, a more complete characterization of repetitive content is possible by scanning the assembly for sequences that occur many times (Table 2). Over a third (37.5%) of the assembly was annotated as repetitive (Table 2), dramatically expanding the catalog of cassava transposable elements. This portion of the genome is hidden (in the jargon, “masked”) when the genome is scanned for protein-coding genes.

Table 2 M. esculenta v. 4 assembly and repeat statistics
Table 3 Protein-coding annotation comparisons to cassava

While a substantial portion of the assembled scaffold sequence is represented in “gaps” (~113 Mb), almost all cassava genes are captured in the contigs. Of the putative transcripts represented in an assembly of publicly available cassava EST sequences (, 96% can be mapped to the genome assembly. It can be estimated that the remaining portion of the genome is largely repetitive and non-gene-coding. Consistent with this, the fractions of the estimated genome size (~31%) and WGS reads (~36%) that do not appear in the assembly are approximately equal, despite low read error rates. The scaffolds obtained have yet to be assigned to chromosomes, as this requires genetic markers with known sequence. However, an 88% complete genetic map comprising 23-linkage groups includes the genetic locations of 284 scaffolds (Sraphet et al. 2011).

With the gene-rich portion of the genome in hand, the next step is to identify the protein-coding genes and the exons that comprise them. This is achieved computationally by aligning sequences from mRNA fragments (ESTs) to the genome, as well as looking for regions with homology to known proteins from other plant species. The 80,459 Sanger ESTs from Genbank were augmented by a new set of 2.7 million reads from leaf and root libraries, generated by 454 Life Sciences using the FLX Titanium platform. While half of the leaf EST reads represented rDNA and chloroplast transcripts and were not useful for annotating the nuclear genome, the remaining 1.4 M reads were used to improve gene prediction. In addition to this mRNA data, predicted protein sequences from castor bean, Arabidopsis, rice, soybean and Populus were mapped to the cassava assembly to help predict gene loci. To date we have predicted 30,666 protein-coding loci in cassava by combining ESTs, peptide homology to other species, and the statistical sequence patterns common to plant genes (e.g., intronic splice signals; Table 3). Over a third (11,526) of the protein-coding loci are supported by ESTs, which also provide evidence for a total of 3,485 alternative splice forms (Table 3).

The gene content of cassava is broadly similar to castor bean, Arabidopsis, soybean and rice (Table 3). The cassava genome sequence and annotation can be readily aligned to those of related plants, allowing comparative genomic analysis of genome structure and function. For example, a global comparison of cassava and castor bean genomes reveals extensive colinearity (an example region is shown in Fig. 2a) and a paleo-genomic duplication in cassava (Fig. 2b) that took place relative to castor bean (Chan et al. 2010). However, not all genes are found in two copies in cassava, suggesting that some initially duplicated genes have been subsequently lost. Conversely some paralogous families have more than two members, suggesting additional gene duplications that resulted in increased family size.

Fig. 2
figure 2

a Circos plot showing cassava-castor bean colinearity. A sample 2.7 Mb region of the cassava genome was aligned against castor bean scaffolds with the Promer tool from Mummer v3.2, and visualized with Circos. Colored segments of the circle represent cassava while greyscale represents corresponding segments of castor bean genome. Green blocks in outer ring are contigs within the 2.7 Mb scaffold. The second ring represents repeat content in grey blocks, and the inner ring represents genes in red blocks. Grey lines linking cassava segments to castor bean segments represent homology at the protein level, and the fact that these lines do not cross over each other indicates colinearity. The central portion of the cassava scaffold, highlighted in yellow, is zoomed in to show more detail. b The same 2.7 Mb cassava scaffold as in (a; top) but aligned against other regions of the cassava genome that are highly similar (bottom). This demonstrates the presence of a duplication in the cassava genome

The cassava genome and annotation can be viewed, browsed, compared with other genomes, and downloaded for custom analysis at the Phytozome comparative plant genomics portal (, Fig. 1b; Goodstein et al. 2012). Extensive instructions are available to help users find their way around the site (

Genetic Variability and Diversity in Cassava

Although cassava originated in South America and was exported to Africa and Asia, its population structure is poorly understood relative to better studied crops such as maize and rice. An understanding of genetic variation allows the development of robust systems of markers for mapping and breeding, including the characterization of germplasms that might provide useful alleles (Edwards and Batley 2010). Initial marker development in cassava has relied upon simple sequence repeats (SSRs), such as microsatellite sequences (Raji et al. 2009; Roa et al. 2000), as well as ~2,000 SNPs identified in expressed sequence tags (Ferguson et al. 2011; Tangphatsornruang et al. 2008). Known SSR and SNP markers, however, are sparsely distributed across the cassava genome and may not be ideal for either fine-mapping or inexpensive large-scale assays. In the future, this can be addressed using newer methods for marker discovery which rely on increasingly inexpensive next-generation sequencing (NGS) and can provide greater power for high density SNP discovery.

Future Directions

Several resources are currently available for improving the genome sequence assembly and are being integrated with the draft genome sequence. The first is a bacterial artificial chromosome (BAC)-based physical map from Pablo Rabinowicz at the Institute for Genome Sciences, University of Maryland and Mincheng Luo at the University of California, Davis (funded by the Generation Challenge Programme). Additional sequencing and computational gap closure are planned, as well as the generation of a dense genetic map to tie scaffolds together at the chromosome scale. This will provide a more accurate substrate for gene models, and a platform for robust systems of markers for studying natural variation and for guiding breeding programs.

Characterizing genetic variation in cassava is a significant part of the cassava genome effort. Whole genome sequencing projects of several key cultivars are already underway, including parents of populations developed for genetic mapping of tolerance to cassava brown streak disease (CBSD) – a disease that is having a devastating effect in East Africa. A large mapping population derived from two parents (“Albert”, a disease susceptible variety and “Namikonga”, a disease resistant variety; Ferguson et al – this issue) is being developed, and with inexpensive genotyping-by-sequencing (GBS), a robust, high resolution genetic map for cassava will be available in the next few years. Two additional sequencing efforts are already underway. These are generating half a dozen whole genome sequences as well as GBS information for hundreds of cultivars (including some wild relatives) chosen to sample traits as widely as possible and maximize diversity. The goal is to obtain a catalog of SNPs segregating within farmer varieties useful for African environments. These can then be used to accelerate breeding programs that will address disease and nutrition concerns, ultimately improving cassava’s quality as a crop.

In addition to describing SNPs, these whole genome sequencing ventures will allow mapping of genomic structural variation (SV), including gene presence/absence, local duplications and transposon activity. These genetic changes are increasingly recognized as important components of heritable variation (Lam et al. 2010; Springer et al. 2009). Rare SNPs/SVs will still be elusive, but these programs should be able to identify nearby linked markers, analogous to the human HapMap (

High-throughput genotyping (the determination of the alleles present at a defined set of marker loci) is now possible, with test assays underway in cassava. With the declining cost of sequencing, there is much enthusiasm for genotyping based on reduced representation sequencing (or GBS), in which loci defined by their proximity to restriction sites are assayed en masse (Elshire et al. 2011). Combined with phenotypic information, the lower cost genotyping will allow deeper sampling of diverse germplasm, the simplification of QTL (quantitative trait locus) mapping for traits such as drought or disease resistance segregating in crosses between arbitrary parents, and, possibly, whole genome association mapping.

Germplasm from hundreds of African cassava cultivars will be characterized in this approach, allowing marker-assisted breeding schemes to be developed for improving nutrient content as well as tolerance of both drought and viral cassava mosaic disease (CMD) and CBSD. We anticipate developing tens of thousands of markers across the genome. In addition to mapping and breeding, genome-enabled genetic approaches can also open the door to a mechanistic understanding of the underlying biology. As data is generated and analyzed, a genome variation database will be developed, thus providing breeding tools to aid farmers in improving cassava, with a special focus on increased resistance to CBSD.