The importance of the sesame genome

Sesame (Sesamum indicum L., 2n = 26), which belongs to the Sesamum genus of the Pedaliaceae family, is one of the oldest oilseed crops and is cultivated in tropical and subtropical regions of Asia, Africa and South America [1, 2]. Its cultivation history can be traced back to between 5,000 and 5,500 years ago in the Harappa Valley of the Indian subcontinent [3]. The total area of sesame harvested in the world is currently 7.8 million hectares, and annual production is 3.84 million tons (2010, UN Food and Agriculture Organization data). Being one of the four main sesame-producing countries, China has contributed 15.2 to 32.5% of the total world sesame production over the past 10 years (2001 to 2010, UN Food and Agriculture Organization data). Sesame has one of the highest oil contents: decorticated seeds contain 45 to 63% oil [2]. The seed is also rich in protein, vitamins, including niacin, minerals and lignans, such as sesamolin and sesamin [47], and it is a popular food and medicine [813]. Sequencing and analysis of the sesame genome is essential if we are to elucidate the evolutionary origins and characteristics of the sesame species.

Sesamum is the main genus in the family Pedaliaceae, which contains 17 genera and 80 species of annual and perennial herbs that are distributed in the Old World tropics and subtropics [14]. The taxonomy and cytogenetics of the Sesamum genus has been reviewed and debated for a long time [1, 1417], and many heterogeneous landraces present in various growing areas still need to be distinguished [1, 18]. S. indicum is the sole cultivar in the Sesamum genus and evolved from wild populations [14, 19]. However, the origin and evolution of cultivated sesame is still unclear and requires more detailed investigation [1, 15]. Evidence suggests that sesame may have originated in either India or Africa [3, 2026]. Bedigian reported that sesame was derived from the Indian subcontinent (the western Indian peninsula and parts of Pakistan) thousands of years ago, and believed that the progenitor of sesame is a taxon named S. orientale var. malabaricum Nar. [22, 23], although most species of Sesamum and genera of the Pedaliaceae are native to Africa [2729]. We hope to clarify the origin and phylogeny of S. indicum by applying comparative genomics and morphological and cytological analyses.

Sesame seed is commonly known as the 'Queen of the oil seeds', perhaps for its resistance to oxidation and rancidity [3]. As it contains lignans, sesame oil also exerts anti-cancer properties both in vitro and in animal bioassays [3034]. Compared with peanut (Arachis hypogaea), soybean (Glycine max), oilseed rape (Brassica napus), sunflower (Helianthus annuus L.) and other oilseed crops, sesame seed oil has an ideal nearly equal content of oleic acid (18:1) (39.6%) and linoleic acid (18:2) (46.0%), and has desirable physiological effects, including antioxidant activity, and blood pressure- and serum lipid-lowering potential [2, 35, 36]. Studies of the genome and functional genome of sesame are essential for elucidating the regulatory mechanisms underlying fatty acid and storage protein composition and content, and the secondary metabolism of antioxidant lignans [3740].

Sesame grows well and gives good yields in both tropical and temperate climates. Its tolerance of drought and high temperatures make sesame well suited to land where few other crops can survive. However, compared with other oilseed crops, sesame seed production is not consistent, as it is susceptible to pathogens, waterlogging and low temperature conditions [41]. Sesame breeding objectives, like those for other seed-producing crops, especially oil crops, are to create new varieties with high quality and yield potential, and resistance to pathogens (including Fusarium wilt and Charcoal rot diseases), insect pests, waterlogging, drought and low temperature stress [37, 4245]. However, identification of genes or gene families and marker loci associated with yield, quality, and resistance to disease and abiotic stresses has been hampered due to a lack of information on the sesame genome. Only a few functional genes, mainly involved in the formation and regulation of fatty acids, seed storage proteins and secondary metabolites, and salt stress response, have been investigated [4654]. With the exception of a sole amplified fragment length polymorphism (AFLP) marker associated with the indehiscent-capsule trait reported in 2003 [55], no quantitative trait loci have been found in the linkage map of sesame, let alone used for molecular-assisted selection (MAS) in sesame breeding programs. Integrating desirable qualities from the few available excellent germplasm resources, including wild species, will not be achievable rapidly unless considerably more genomic and functional genomic information is available. In addition, sequencing of the sesame genome will facilitate studies of other genera of the Pedaliaceae family by providing a closely related reference genome.

We therefore plan to implement a Sesame Genome Project and sequence the S. indicum genome using the Chinese domestic cultivar, Yuzhi 11, which represents S. indicum cultivars with a simple stem, three flowers per axilla, oblong-quadrangular capsules, and white flower and seed-coat color. Yuzhi 11 is one of the most important Chinese cultivars due to its high oil content (56.66%), resistance to fungal pathogens such as Fusarium wilt, charcoal rot and Alternaria leaf spot, and waterlogging stress. It is cultivated in the main production regions of China [56, 57].

Phylogenetic position of sesame

S. indicum is located in the asterids clade of the core eudicotyledons of Angiosperm Phylogeny Group 2 (APG 2) [58]. Its phylogenetic position determined using sesame chloroplast genomic data indicates that Sesamum (Pedaliaceae family) is a sister genus to the Olea and Jasminum (Oleaceae family) clade and represents the core lineage of the Lamiales families [59]. Compared with the 19 families shown in Figure 1 (adapted from the NCBI taxonomy database [60]), Sesamum, which has 36 available genomes, is closely related to the Solanaceae and Phrymaceae families, but distantly related to other oil crops such as soybean (Glycine max), castor (Ricinus communis) and rape (Brassica rapa). At present, genomic information on the Pedaliaceae family is quite limited, as genomes from this family have not previously been sequenced.

Figure 1
figure 1

Phylogenetic positions of sesame and the 36 land plants with available genome sequences. aRefers to sesame (S. indicum L.), a member of the Pedaliaceae family, only 34 genera of which have been entered in the NCBI taxonomy database.

Overview of the Sesame Genome Project

The Sesame Genome Working Group (SGWG) comprises six major sesame research teams in China involved in investigating genetic diversity of germplasm resources, functional genomics, and biotic and abiotic resistance, in addition to sesame genome sequencing. All members of the SGWG work under the Toronto Statement for prepublication data release [61]. The main goal of the Sesame Genome Project is to provide a fine map of S. indicum and facilitate global genomic and functional genomic studies. We have already released a preliminary draft assembly [62] of the sesame genome that can be used according to the conditions outlined in this letter. A detailed plan for the Sesame Genome Project has been made available on our website [62].

Properties of the S. indicumgenome and available genomic resources

Natural sesame species can be divided into three types based on chromosome numbers, that is, 2n = 26 (for example, S. indicum, S. alatum), 2n = 32 (for example, S. protratum, S. angolense) and 2n = 64 (for example, S. radiatum, S. schinzianum) [14, 37]. The basic chromosome number in the Sesamum genus is X = 8 and 13, with X = 13 probably resulting from ancient polyploidy [37]. The size of a haploid genome of S. indicum (2n = 26) was reported to be about 0.95 Gb, with a mass of 0.97 pg [63], which is out of proportion with the 0.51 Gb and 0.97 Gb of Cerathoteca sesamoides (2n = 32) and S. radiatum (2n = 64), respectively [64]. Before beginning this genome project, we examined the characteristics of sesame chromosomes using cv. Yuzhi 11. Results showed that its karyotype formula is 2n = 2x = 26 = 6m + 16sm + 4st, and chromosome length ranges from 1.21 to 2.48 μm (H Zhang, unpublished data). We distinguished and numbered the chromosomes with 45S rRNA, simple sequence repeats (SSR) and bacterial artificial chromosome (BAC) sequence probes using fluorescent in situ hybridization (FISH) and BAC-FISH techniques to facilitate super-scaffold assembly in the sesame genome (H Zhang, unpublished data). Comparing genome size with that of Arabidopsis thaliana [65], soybean (cv. William 82) [66] and rice (cv. Nipponbare) [67], the genome size of S. indicum cv. Yuzhi 11 is estimated by flow cytometry to be about 369 Mb (H Zhang, unpublished data). From our preliminary sequencing data, we estimate the genome size to be approximately 354 Mb, close to this result (see below).

The sesame chloroplast genome was published recently [59]. Sequencing of the chloroplast genome of S. indicum cv. Yuzhi 11 has also been performed (H Zhang, unpublished data), and will be used for raw read filtering and genome assembly in our Sesame Genome Project. A total of 86,222 unigenes with an average length of 629 bp are available and 46,584 (54.03%) unigenes have a significant similarity with proteins in the NCBI nonredundant protein database and Swiss-Prot database (E-value <10-5) [39]. Before the beginning of this project, we sequenced sesame transcriptomes from 24 groups of S. indicum materials and treatments using Illumina paired-end sequencing technology to greatly enrich available information on the functional genome [40, 68], obtaining a 40G dataset containing 42,566 unitranscript sequences. We also constructed a BIBAC (pCLD 04541) library of 80,000 clones with an insert size of 120 kb and a BAC (CopyControl™ pCC1BAC™) library of 57,600 clones with an insert size of 85 kb. The genome coverage of both BAC libraries was 27- and 13-fold, respectively (H Zhang, unpublished data). There are 45,093 S. indicum expressed sequence tags (ESTs) available in the NCBI EST database. Prior to our work, only two other S. indicum seed-specific cDNA libraries, including one full-length cDNA library, had been constructed, some clones of which were chosen at random and sequenced [38, 69]. In order to explore more genes involved in sesame growth and development, we constructed a full-length cDNA library of S. indicum cv. Yuzhi 11 containing 300,000 clones, 1,200 clones of which were selected randomly and sequenced (H Zhang, unpublished data). The genomic and transcriptomic data from these studies should facilitate genome assembly and analysis. The first sesame linkage map, which contains 284 microsatellite polymorphic loci, was set up in 2009 and has been used as a landmark frame for assembly of the whole genome [70]. We recently updated this high-density linkage map with 653 SSR, SNP, AFLP and random selective amplification of microsatellite polymorphic loci (RSAMPL) markers falling into 14 linkage groups to facilitate sesame genome assembly and anchoring of trait loci (H Zhang, unpublished data).

Sequencing strategy for the S. indicumgenome

The Sesame Genome Project is divided into three phases. The first phase, which has already been completed, involves high coverage Illumina sequencing and draft genome assembly. We constructed five types of Illumina libraries, including two paired-end libraries with insert sizes of 300 and 500 bp, and three mate-pair libraries with insert sizes of 2, 3 and 5 kb. In order to avoid bias in library construction, at least two libraries for each insert length were constructed. Illumina technology was used to generate 98 Gb of reads, giving a 276× coverage of the estimated genome (Table 1). Subsequently, the draft genome was assembled using ABySS (v 1.3.3) [71]. Paired-end Illumina reads were first assembled into contigs. Mate-pair reads with insert sizes of 2, 3 and 5 kb were then aligned into the contigs, and the relationship between mate-pair reads was used to join contigs and construct scaffolds. As a result, a preliminary assembly of 293.7 Mb was generated (Table 2).

Table 1 Summary of Illumina data for the S. indicum genome
Table 2 Overview of the current draft assembly of S. indicum

The second phase will involve Roche 454 pyrosequencing and BAC sequencing and fine map construction. We have constructed Roche 454 paired-end libraries with an insert size of 20 kb and will generate 3.5 Gb of data giving a 250× coverage of the estimated genome. We also plan to end-sequence 40,000 sesame BAC clones using conventional Sanger sequencing, giving a 12× coverage of the estimated genome. To ensure hybrid de novo assembly of the best possible quality, we will use a modified Celera Assembler pipeline [72]. Roche 454 paired-end reads and BAC-end reads are better for spanning longer repetitive elements and joining scaffolds into superscaffolds. We will use BAC-end information to retrieve and select 1,000 specific BAC clones, one end of which aligns well to the scaffold while the other end is located in a gap region, for full-length sequencing using the Illumina BAC polling method. The full-length BAC sequences will fill in the gaps within superscaffolds and greatly improve genome integrity. At this stage, we expect to obtain a fine map of Yuzhi 11 with 800 to 1,000 superscaffolds of a putative N50 length of 1 Mb and N90 length of 250 kb.

In the final phase, the superscaffolds will be anchored to chromosomes. We will first anchor the BACs containing mapped SSR markers on the updated linkage map [70] (H Zhang, unpublished data). Physical distances between landmarks will then be determined. Furthermore, we will construct a physical chromosome map based on at least 1,000 BAC clones using information obtained from BAC-FISH and BAC-end. At least one BAC will be anchored on the chromosomes per superscaffold to ensure all superscaffolds are anchored onto the 13 chromosomes. In order to validate the accuracy and integrity of the sesame genome assembly, several quality control parameters, such as read depth of coverage, average quality values per contig, discordant read pairs and gene footprint coverage, will be examined. To check the accuracy of the assembly of scaffolds, we will also complete full-length sequencing of 15 BAC clones using conventional Sanger sequencing and align them to the scaffolds.

Timeline and goals of the Sesame Genome Project

The blueprint for the Sesame Genome Project was conceived and designed by the SGWG in 2009. We completed the goals of the first phase in March 2012. In the second phase, Roche 454 paired-ends reads will be sequenced by December 2012, and the double-ended sequencing of the 40,000 BAC clones and full-length sequencing of 1,000 BAC clones will be completed by June 2013. The final phase of scaffold anchoring will proceed in parallel with bioinformatics analysis. We expect to complete all the goals of Sesame Genome Project and submit a paper by December 2013. To make our data broadly available prior to publication, the completion of each goal of these phases will be publicly communicated via our website [62]. Updated versions of assembly data will be made available to any independent research groups performing non-genome-scale analyses. Sequence data and the preliminary assembly produced in the first phase are already available on the website.

Status of current preliminary genome assemblies

The current draft assembly of Yuzhi 11 is 293.7 Mb in length, with a GC content of 34.65%. The N50 and N90 sizes of the scaffolds are 22.6 kb and 4.3 kb, respectively (Table 2). Genome size was estimated to be 354 Mb using the well-established 17-mer method [73], in line with flow cytometry data that suggest it is 369 Mb (H Zhang, unpublished data). The 17-mer distribution frequency in 16.77 Gb of trimmed Illumina PE reads was calculated using Jellyfish (v1.1.4) [74]. We identified a total of 13,931,658,332 unique k-mers, and 87,207,553 k-mers that had a frequency <10. The frequency of peak k-mers was 39 (Figure 2).

Figure 2
figure 2

K-mer (17mer) frequency analysis of the S. indicum genomic sequence. Data produced from 500 bp insert libraries. The peak k-mer frequency is 39 and its minimum point is 10. Genome size was estimated with the formula: Estimated genome size (bp) = total number of k-mers with a frequency >10/peak k-mer frequency.

In order to determine the frequency and complexity of repetitive elements in the draft assembly, we compared the assembly information with the Arabidopsis repetitive elements database from the RepeatMasker library (version 20120418) and the sesame de novo database constructed for the Yuzhi 11 draft assembly (RepeatModeler, version 1.0.5) using RepeatMasker (version open-3.2.9) [75, 76]. Thirty-eight percent of the draft assembly was identified as repetitive elements (Table 3), only approximately 5.7% of which shared homology with the Arabidopsis database.

Table 3 Repeats derived from de novo and homology-based predictions in S. indicum

Quality control the raw data and intermediate datasets

In order to control the quality of raw data, the SolexaQA package was used to verify the sequence data generated from each of the 17 Illumina-Solexa libraries [77]. The raw reads were trimmed by DynamicTrim (quality threshold Q ≈ 20) and then filtered by LengthSort (the length cutoff set as 25). Unpaired reads would be screened and discarded in this system. Meanwhile, Roche 454 reads data, which are kept in Standard Flowgram Format (SFF), were converted into FastQ format and evaluated using the traditional quality metrics. As Sanger reads may contain vector sequences, the Lucy package was used to search and trim for cutting off the vector sequence contamination [78]. Low-quality bases and chimeric reads would be tracked with trim modules of the Celera Assembler.

We validated the coding region coverage of the draft assembly using two different gene footprint coverage methods. Using the Core Eukaryotic Genes Mapping Approach (CEGMA) [79], 444 (96.9%) of the 458 core eukaryotic genes (CEGs) mapped against the draft assembly were identified. An RNA sequence based method employing Velvet [80] and OASES [81] allowed us to assemble 3.5 Gb of RNA-Seq reads (NCBI accession SRX061117) [39] into 99,589 putative transcripts. Putative transcripts were then translated into 82,549 peptides using ESTScan (version 2.1) [82]. These peptides were aligned against the SWISS-PROT [83] database using BLAST (E-value 10-5) to obtain high-confidence peptides. Redundant peptides (such as alternative-splicing transcripts) were filtered according to BLAST scores and the names of the hits. More than 99.5% of the 3,584 peptides obtained could be aligned to the draft assembly using GMAP [84]. The above results indicate that the draft assembly has a high coverage of the coding region.

Gene prediction for the draft assembly was performed using InchWorm [85]: 3.5 Gb of RNA-Seq reads [GenBank: SRX061117] were assembled into 472,257 contigs and mapped to the draft genome using GMAP. The GMAP mapping results were used as a training set for ab initio prediction using AUGUSTUS [86]. As a result, 23,713 gene models were obtained with a total length of 28 Mb (Table 4). Average coding sequence length was 1.2 kb and average GC content was 45%. We obtained functional annotations of all genes using InterProScan [87], which also determines motifs and domains. Gene Ontology (GO) annotations were given to 10,656 genes using corresponding InterPro entries and the Pfam database [88]. Visualization of the functional categories of these 10,656 genes was performed using WEGO [89] (Figure 3).

Table 4 Predicted genes in S. indicum
Figure 3
figure 3

Functional catalogues of sesame genes in the preliminary assembly. Results are summarized in three main categories: biological processes, cellular components and molecular functions. A total of 10,656 genes have been assigned with Gene Ontology terms.

Biological questions to be addressed

We plan to address several key biological questions specific to sesame using this new genome and transcriptome data. We will compare the sesame genome with the genomes of monocotyledonous and other dicotyledonous plants to elucidate the phylogeny of the Sesamum genus and the origin of S. indicum. We will also perform more detailed investigations on the formation and regulation of fatty acids, storage proteins and secondary metabolites (including sesamin) in sesame. We will apply the bio-information obtained in this genome project in sesame breeding programs, paying particular attention to the induction and regulation of resistance to the main sesame diseases, including Fusarium wilt and charcoal rot diseases, and the environmental stress of waterlogging. Other possible uses of the genomics dataset, such as determining the regulatory mechanisms of biological characteristics in Sesamum, including simple stem or branch, leaf shape, indeterminate growth habit, flower number per axilla, capsule carpel number, flower color and other species-specific traits, will not form part of our analysis. We believe that the main achievement of this project will be to markedly accelerate sesame genetic research and breeding. Members of the SGWG also hope to address additional questions about the relationship between sesame growth and environmental conditions, such as identifying which genes regulate low temperature responses and drought sensitivity.

Joining the SGWG and using our early release data

This project is being conducted by the SGWG. We invite other research groups to access and use the draft assembly and raw read data, which have already been released. Any group performing non-genome-scale analyses, or investigating the above biological questions, is welcome to use our data without restriction. As a matter of courtesy and to avoid duplication of effort, we request that competing genome-scale projects or studies that overlap with the above stated research areas disclose their status to the SGWG consortium. Formal inquiries and requests to join the working group should be made to HZ. Updated versions of the genome assembly, further project descriptions and a complete list of current SGWG members dedicated to this project can be accessed on our website [62].