The Cassava Genome: Current Progress, Future Directions
The starchy swollen roots of cassava provide an essential food source for nearly a billion people, as well as possibilities for bioenergy, yet improvements to nutritional content and resistance to threatening diseases are currently impeded. A 454-based whole genome shotgun sequence has been assembled, which covers 69% of the predicted genome size and 96% of protein-coding gene space, with genome finishing underway. The predicted 30,666 genes and 3,485 alternate splice forms are supported by 1.4 M expressed sequence tags (ESTs). Maps based on simple sequence repeat (SSR)-, and EST-derived single nucleotide polymorphisms (SNPs) already exist. Thanks to the genome sequence, a high-density linkage map is currently being developed from a cross between two diverse cassava cultivars: one susceptible to cassava brown streak disease; the other resistant. An efficient genotyping-by-sequencing (GBS) approach is being developed to catalog SNPs both within the mapping population and among diverse African farmer-preferred varieties of cassava. These resources will accelerate marker-assisted breeding programs, allowing improvements in disease-resistance and nutrition, and will help us understand the genetic basis for disease resistance.
KeywordsCassava 454-sequencing Linkage mapping Genotyping by sequencing Polymorphism Crop improvement
Bacterial artificial chromosome
Bill & Melinda Gates Foundation
Cassava brown streak disease
International Center for Tropical Agriculture
Cassava mosaic disease
Donald Danforth Plant Science Center
Expressed sequence tag
Global Cassava Partnership
International Laboratory for Tropical Agriculture Biology
US Department of Energy Joint Genome Institute
Quantitative trait locus
Single nucleotide polymorphism
Simple sequence repeat
Whole genome shotgun
Cassava (Manihot esculenta Krantz) is grown throughout tropical Africa, Asia and the Americas. Its large, starchy, roots and edible leaves provide food for 800 M people globally, many of whom subsist on it, in part because it is drought tolerant and requires little in the way of inputs (Ceballos et al. 2010). The high starch content (20-40%) makes cassava a desirable energy source both for human consumption and industrial biofuel applications (Balat and Balat 2009; FAO 2008; Schmitz and Kavallari 2009), However, its nutritional value is limited as the roots contain little protein or micronutrients and high levels of cyanogenic compounds. The plant is also susceptible to bacterial (Boher and Verdier 1994) and insect-transmitted viral diseases (Hillocks and Jennings 2003; Patil and Fauquet 2009). While the roots can be left in the ground for many months before harvesting, post-harvest deterioration is rapid, limiting economic development for cassava farmers (Reilly et al. 2007). Cassava is an outcrossing, heterozygous species propagated clonally from stem cuttings. These properties provide barriers to the already slow process of improving yield, disease resistance and nutrient content by traditional breeding and selection.
We describe here the progress-to-date and the future goals of genomic sequencing efforts in cassava. A draft genome sequence has been generated from a single cassava accession. With this genome sequence in hand and a catalog of common genetic variants on the way, many possibilities exist for the rapid realization of cassava’s potential both as a nutritional food and as a biofuel feedstock.
Cassava belongs to the family Euphorbiaceae, and the Fabid superfamily (also known as eurosids I), which includes several distantly related plants such as rosids, legumes and poplars. The availability of several Fabid genomes, in particular Ricinus communis [castor bean (Chan et al. 2010), which is in the same family], Populus trichocarpa [poplar (Tuskan et al. 2006)] and Glycine max [soybean (Schmutz et al. 2010)], allows researchers to take advantage of comparative genomic approaches. In addition, tools for molecular breeding are being developed by building upon the genome sequence. For example, germplasm diversity across both cultivated and wild varieties, once characterized, can be used to map valuable traits. With this map, marker-assisted selection and even genomic selection can be adopted as paradigms for generating improved cassava. Brought together, these resources should allow us to accelerate understanding of basic biology of starch accumulation, nutrient loading and resistance to diseases and pests.
Sequencing the Cassava Genome
The project to sequence the cassava genome began in 2003 with the formation of The Global Cassava Partnership (GCP-21), co-chaired by Dr. Claude Fauquet, director of the International Laboratory for Tropical Agriculture Biology (ILTAB) at the Donald Danforth Plant Science Center (DDPSC), and Dr. Joe Tohme of the International Center for Tropical Agriculture (CIAT). This, in turn, led to a 2006 proposal, by Fauquet, Tohme and 12 other leading international scientists, to the US Department of Energy Joint Genome Institute (JGI) Community Sequencing Program. The proposal was selected for a pilot project, and over the next few years a ~0.8x random shotgun sample sequence (~700k reads) was produced, half from fosmid paired-ends. As a result, nearly half the genome had been sampled, but only relatively short ~700 base-pair (bp) sequences were available.
At the 2009 International Plant and Animal Genomes conference in San Diego, USA, Steve Rounsley from the University of Arizona obtained a commitment to produce a more complete cassava sequence from 454 Life Sciences and JGI, with the encouragement of the Bill & Melinda Gates Foundation (BMGF). 454 Life Sciences and JGI chose to use 454's Genome Sequencer FLX Titanium platform to rapidly generate the DNA sequence data needed for the project. 454 Life Sciences later added longer sequences from their then-experimental FLX + extra long read technology (http://my454.com/products/gs-flx-system/index.asp). Generation of raw sequence data was mostly completed by spring 2009, less than 90 days after the meeting in San Diego, and progress was discussed at a meeting convened by BMGF outside Paris, France in June 2009. In November of that year, the genome assembly and annotation were publicly released (available from ftp://ftp.jgi-psf.org/pub/JGI_data/phytozome/v5.0/Mesculenta/), and funding was obtained from BMGF for post-genome efforts to build upon the genome data to develop breeding tools for sub-Saharan Africa.
Status of Genome Project
Assembling the cassava genome is computationally challenging, for two reasons. Firstly, repetitive sequences, often transposon-related, are interspersed between the compact genes of cassava, just as they are in most plant genomes (Gill et al. 2008). Since copies of a given repetitive element may be very similar at the sequence level, it is difficult to reconstruct each copy, especially with short sequence reads. Fortunately, the 454 FLX Titanium and experimental FLX + sequences are typically 400- and 700-nt long, respectively, making them long enough to assemble many repetitive segments and genes together on scaffolds spanning hundreds of thousands of base pairs (see below).
Secondly, cassava, as an outbreeding species, has allelic variation, both SNPs and structural polymorphisms (deletions, insertions, and inversions involving hundreds or thousands of nucleotides), which can complicate the derivation of a single “reference” sequence for each locus. In an attempt to minimize the number of heterozygous alleles that the assembler would have to deal with, DNA from a partially inbred cassava cultivar was provided for the genome sequencing project. This line, named AM560-2, was developed at CIAT and is reported as being an S3-derived inbred from the cassava cultivar MCOL-1505 (J. Tohme, unpublished results).
Cassava genomic and mRNA sequence data
454 FLX Plus (experimental)
U. Maryland & UC Davis
Expressed sequence tags (ESTs)
1.51 M reads (leaf)
0.30 M after removing chloroplast and rDNA sequences
1.19 M reads (root)
M. esculenta v. 4 assembly and repeat statistics
Total assembled scaffold length
Total assembled contig sequence length
419.5 Mb (21% gaps)
Total number of scaffolds
Half of the assembly is in scaffolds longer than
258.1 kbp (487 scaffolds)
Annotated repetitive portion of assembly
140 Mb (10M reads)
54 Mb (3.9 M reads, 6 k copies)
polygalacturonase gene sequence
12 Mb (850 k reads, 3 k copies)
Protein-coding annotation comparisons to cassava
cassava v. 4.1
castor bean (TIGR v. 0.1)
Arabidopsis thaliana Columbia (TAIR10)
soybean (JGI v. 1)
rice (MSU v. 6.0)
Protein-coding gene loci
Genes supported by one or more ESTs over >75% of length
Avg. number of exons/transcript
Median exon length (bp)
Median intron length (bp)
Transcripts with Pfam domain annotation (KOG orthology)
While a substantial portion of the assembled scaffold sequence is represented in “gaps” (~113 Mb), almost all cassava genes are captured in the contigs. Of the putative transcripts represented in an assembly of publicly available cassava EST sequences (http://cassava.igs.umaryland.edu/blast/db/EST_asmbl_and_single.fasta), 96% can be mapped to the genome assembly. It can be estimated that the remaining portion of the genome is largely repetitive and non-gene-coding. Consistent with this, the fractions of the estimated genome size (~31%) and WGS reads (~36%) that do not appear in the assembly are approximately equal, despite low read error rates. The scaffolds obtained have yet to be assigned to chromosomes, as this requires genetic markers with known sequence. However, an 88% complete genetic map comprising 23-linkage groups includes the genetic locations of 284 scaffolds (Sraphet et al. 2011).
With the gene-rich portion of the genome in hand, the next step is to identify the protein-coding genes and the exons that comprise them. This is achieved computationally by aligning sequences from mRNA fragments (ESTs) to the genome, as well as looking for regions with homology to known proteins from other plant species. The 80,459 Sanger ESTs from Genbank were augmented by a new set of 2.7 million reads from leaf and root libraries, generated by 454 Life Sciences using the FLX Titanium platform. While half of the leaf EST reads represented rDNA and chloroplast transcripts and were not useful for annotating the nuclear genome, the remaining 1.4 M reads were used to improve gene prediction. In addition to this mRNA data, predicted protein sequences from castor bean, Arabidopsis, rice, soybean and Populus were mapped to the cassava assembly to help predict gene loci. To date we have predicted 30,666 protein-coding loci in cassava by combining ESTs, peptide homology to other species, and the statistical sequence patterns common to plant genes (e.g., intronic splice signals; Table 3). Over a third (11,526) of the protein-coding loci are supported by ESTs, which also provide evidence for a total of 3,485 alternative splice forms (Table 3).
The cassava genome and annotation can be viewed, browsed, compared with other genomes, and downloaded for custom analysis at the Phytozome comparative plant genomics portal (www.phytozome.net, Fig. 1b; Goodstein et al. 2012). Extensive instructions are available to help users find their way around the site (http://phytozome.net/help.php).
Genetic Variability and Diversity in Cassava
Although cassava originated in South America and was exported to Africa and Asia, its population structure is poorly understood relative to better studied crops such as maize and rice. An understanding of genetic variation allows the development of robust systems of markers for mapping and breeding, including the characterization of germplasms that might provide useful alleles (Edwards and Batley 2010). Initial marker development in cassava has relied upon simple sequence repeats (SSRs), such as microsatellite sequences (Raji et al. 2009; Roa et al. 2000), as well as ~2,000 SNPs identified in expressed sequence tags (Ferguson et al. 2011; Tangphatsornruang et al. 2008). Known SSR and SNP markers, however, are sparsely distributed across the cassava genome and may not be ideal for either fine-mapping or inexpensive large-scale assays. In the future, this can be addressed using newer methods for marker discovery which rely on increasingly inexpensive next-generation sequencing (NGS) and can provide greater power for high density SNP discovery.
Several resources are currently available for improving the genome sequence assembly and are being integrated with the draft genome sequence. The first is a bacterial artificial chromosome (BAC)-based physical map from Pablo Rabinowicz at the Institute for Genome Sciences, University of Maryland and Mincheng Luo at the University of California, Davis (funded by the Generation Challenge Programme). Additional sequencing and computational gap closure are planned, as well as the generation of a dense genetic map to tie scaffolds together at the chromosome scale. This will provide a more accurate substrate for gene models, and a platform for robust systems of markers for studying natural variation and for guiding breeding programs.
Characterizing genetic variation in cassava is a significant part of the cassava genome effort. Whole genome sequencing projects of several key cultivars are already underway, including parents of populations developed for genetic mapping of tolerance to cassava brown streak disease (CBSD) – a disease that is having a devastating effect in East Africa. A large mapping population derived from two parents (“Albert”, a disease susceptible variety and “Namikonga”, a disease resistant variety; Ferguson et al – this issue) is being developed, and with inexpensive genotyping-by-sequencing (GBS), a robust, high resolution genetic map for cassava will be available in the next few years. Two additional sequencing efforts are already underway. These are generating half a dozen whole genome sequences as well as GBS information for hundreds of cultivars (including some wild relatives) chosen to sample traits as widely as possible and maximize diversity. The goal is to obtain a catalog of SNPs segregating within farmer varieties useful for African environments. These can then be used to accelerate breeding programs that will address disease and nutrition concerns, ultimately improving cassava’s quality as a crop.
In addition to describing SNPs, these whole genome sequencing ventures will allow mapping of genomic structural variation (SV), including gene presence/absence, local duplications and transposon activity. These genetic changes are increasingly recognized as important components of heritable variation (Lam et al. 2010; Springer et al. 2009). Rare SNPs/SVs will still be elusive, but these programs should be able to identify nearby linked markers, analogous to the human HapMap (http://www.hapmap.org).
High-throughput genotyping (the determination of the alleles present at a defined set of marker loci) is now possible, with test assays underway in cassava. With the declining cost of sequencing, there is much enthusiasm for genotyping based on reduced representation sequencing (or GBS), in which loci defined by their proximity to restriction sites are assayed en masse (Elshire et al. 2011). Combined with phenotypic information, the lower cost genotyping will allow deeper sampling of diverse germplasm, the simplification of QTL (quantitative trait locus) mapping for traits such as drought or disease resistance segregating in crosses between arbitrary parents, and, possibly, whole genome association mapping.
Germplasm from hundreds of African cassava cultivars will be characterized in this approach, allowing marker-assisted breeding schemes to be developed for improving nutrient content as well as tolerance of both drought and viral cassava mosaic disease (CMD) and CBSD. We anticipate developing tens of thousands of markers across the genome. In addition to mapping and breeding, genome-enabled genetic approaches can also open the door to a mechanistic understanding of the underlying biology. As data is generated and analyzed, a genome variation database will be developed, thus providing breeding tools to aid farmers in improving cassava, with a special focus on increased resistance to CBSD.
We thank Shenqiang Shu for help with annotation. The work conducted by the U.S. Department of Energy Joint Genome Institute is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231.
This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.