Plant genomes: current status

Plants are indispensable for human life, as they not only provide food, fiber, and fuel but also are critical for provision of oxygen and adsorption of CO2. Plant breeding and genetics are powerful tools for increasing plant productivity through development of improved varieties. The rapid progress of plant genomics in recent years has opened new possibilities in targeted breeding of specific traits, and provides a powerful approach to sustainable crop production [1]. Plant genomics, in combination with genetics and breeding, has a particularly crucial role to play in ensuring food security to the rapidly growing world population.

Many plant genomes are large and complex due to an abundance of transposable elements and a long history of repeated genome duplication, making genome sequencing a major challenge [2]. The era of plant genomics began with release of the Arabidopsis genome sequence in 2000 [3]. It was a milestone in plant biology and made Arabidopsis one of the most popular species for basic plant research. Rice, a staple food in most of the world, was the second available plant genome in 2002 [4, 5]. Rapid progress in the development of new sequencing technology and bioinformatic tools in recent years has allowed faster and more efficient sequencing, and assembly of genomes at lower cost. As many as 20 plant genomes have been sequenced and assembled in the last two years [625]. Genome sequences of plants belonging to different groups, such as two plants from the early land plant clades (a moss Physcomitrella patens and a spikemoss Selaginella moellendorffii) and numerous economically important monocots (Box 1), such as rice, maize, sorghum, and so on, have now been decoded [4, 5, 2528] (Figure 1). Eudicots (Box 1), the largest group in flowering plants, are composed of two major clades, the Eurosids and Euasterids. Genome sequences of many members of group Eurosids, such as Arabidopsis, grape, poplar, medicago, cucumber, and so on, have already been published [3, 13, 2931]. However, members of the group Euasterids, which has many plants of economic importance, were not represented in the list of known plant genome sequences until the release of the potato (Solanum tuberosum) genome belonging to the family Solanaceae [6]. The recently decoded genome sequences of tomato (Solanum lycopersicum) and its close wild relative (Solanum pimpinellifolium) are significant additions to published Euasterid genomes [7]. These genomes will not only promote plant genomics and breeding studies for crop improvement programs in the Euasterids, particularly in the family Solanaceae, but also provide an unprecedented opportunity for basic plant biological research in the area of development and evolution.

Figure 1
figure 1

Phylogenetic tree for plants with sequenced genomes. Genomes of the species shown in black are complete, and those of species shown in lighter gray are less complete. This tree is made based on the information provided by CoGePedia [82] and updated as of 15 July 2012.

Arabidopsisas a model plant for research

Arabidopsis has served as a model plant for basic plant research due to its small size, self-pollination, short life cycle, ease of propagation and genetic transformation [32]. Additionally, its small genome and the availability of its genome sequence made it a favorite for genetic and molecular studies. The sheer volume and extent of Arabidopsis-related research and integration of genetic, molecular, biochemical, genomic and morphological data from Arabidopsis provided insights into many universal aspects of plant biology. Due to a high level of synteny (Box 1) between the genomes of various plant species, the Arabidopsis genome also provided information on the structure of other Eurosid genomes [33].

In spite of these advantages, Arabidopsis is an 'atypical' plant (Figure 2). Its small genome and features such as leaf morphology, fruit characteristics and plant architecture differ from most agriculturally important plants [34]. While Arabidopsis is excellent for research in basic plant biology, it is not amenable to investigations of domestication or crop improvement through selective breeding. The ongoing 1001 Arabidopsis genome project aims to look at natural selection and alleles underlying phenotypic diversity across the entire genome and the entire species [35]. Mostly the questions related to domestication and crop improvement have been investigated in monocot cereal crops such as maize and rice [3638]. Genome sequences of other agriculturally and economically important plant species are needed to answer major questions about genome function and genome evolution, and application of genomic information to the practical problems of yield and quality enhancement. Consistent with this, the US National Plant Genome Initiative, established in 1998, also made a call for genome sequences of every major plant of economical importance.

Figure 2
figure 2

Comparison of Arabidopsis , tomato and maize. Leaf morphology, fruit morphology and inflorescence architecture are diverse between Arabidopsis, tomato and maize.

Tomato genetics and genomics: past and present

The family Solanaceae, having many species of economic importance, such as tomato, potato, tobacco, pepper, eggplant, and so on, is the most extensively studied family among the Euasterids. Solanaceous crop genomics is in an exciting phase of development following the recent sequencing of the potato and tomato genomes [6, 7]. Tomato (Solanum lycopersicum L.), originated in South America, and was spread around the world to become one of the most extensively used vegetable crops. Besides its economic value, it has interesting developmental features, such as compound leaves, fleshy fruits, and sympodial shoot branching (Box 1, Figure 2) [3941]. Moreover, it has simple diploid genetics, short generation time, and routine transformation technology and is easy to maintain. Together these make tomato an excellent species for both basic and applied plant research.

Several genetic and genomic resources were available for tomato before the inception of the tomato genome sequencing project. Large germplasm collections consisting of numerous accessions of landraces of tomato (S. lycopersicum) and its wild relatives (Box 2) [42, 43] had been established, many of which are sexually compatible with tomato and are also are a source of valuable disease resistance and other genes that had been exploited by breeders to develop modern cultivated tomato varieties. Tomato geneticists had used a number of morphological and isozyme markers to construct a genetic map of tomato and identified the 12 linkage groups corresponding to the cytologically visible chromosomes. This aided the construction of an RFLP (restriction fragment length polymorphism) linkage map [44, 45]. The resulting comprehensive molecular linkage map enabled breeders to identify quantitative trait loci (QTLs) leading to an understanding of the genetic basis of numerous quantitative traits [46]. The Solanaceae Genomics Network website provides extensive information on the available tomato genetic and genomic resources [47].

Domesticated tomato and related wild species (Box 2) exhibit tremendous genetic and trait biodiversity, making the group highly suitable for evolutionary and domestication studies [48, 49]. Sequence diversity analysis of extensive expressed sequence tags from domesticated and wild tomato species identified numerous inter- and intra-specific polymorphisms, many of which could be important for domestication [50]. In order to exploit the rich trait reservoir of domesticated and wild tomato species, tomato breeders developed advanced backcross mapping populations for identifying and transferring favorable QTLs from wild to cultivated germplasm. This subsequently led to development of permanent mapping populations in the form of introgression lines (ILs) where, by repeated backcrossing, a segment of a wild species genome is introduced into a cultivated tomato background [51, 52]. A set of 76 ILs, ensuring complete genome coverage of S. pennellii introgressed into the cultivated tomato M82 variety, have been extensively phenotyped for numerous traits such as morphology, yield, fruit quality, and fruit primary and secondary metabolites for the identification of QTLs [53]. The high-resolution mapping approach applied to S. pennellii ILs has led to the map-based cloning of the sugar yield QTL Brix9-2-5, and the fruit weight QTL fw2.2 [54, 55]. Brix9-2-5 was delimited to an invertase gene, which is expressed early in fruit development, whereas fw2.2 was delimited to the gene ORFX, which is expressed early in floral development. Classical breeding and marker analysis has also made remarkable contributions to improve various yield traits of tomato. For example, the fruit size QTL fasciated, initially identified using a cross between S. lycopersicum and S. pimpenellifolium, has recently been characterized [56]. The first example of yield improvement by a single overdominant gene (SINGLE FLOWER TRUSS) through heterosis has been demonstrated in tomato [57]. Furthermore, tomato and its wild relatives have also been used as a model for self/hybrid incompatibility studies [58, 59].

Tomato genome in brief

Sequencing of the tomato genome was initiated in 2005 as a multinational effort between 14 countries. The genome of the domesticated tomato Heinz 1706 was sequenced using a combination of longer Sanger and 454/Roche GS FLX reads and high-coverage, shorter SOLiD and Illumina GAIIx reads. The sequences were assembled into 91 scaffolds, covering 760 Mb of the approximately 900 Mb of genome, aligned to the 12 tomato chromosomes with 34,727 predicted protein-coding genes [7]. Most of the gaps were restricted to repeat-rich pericentromeric regions. Additionally, a draft sequence of the closest wild relative S. pimpinellifolium was compared to the Heinz sequence. The two genomes are highly similar showing only 0.6% nucleotide divergence. Sixty percent of the genes are identical or with only synonymous changes between domesticated and wild tomato, while the remaining 40% have non-synonymous changes, including alterations of stop codons with potential consequences for gene function. Compared to the potato genome, the tomato and S. pimpinellifolium genomes show more than 8% nucleotide divergence. Moreover, the tomato genome is highly syntenic with the genomes of other economically important members of the family Solanaceae, such as eggplant and pepper. Comparative genome analysis identified two consecutive triplication events in the Solanum lineage. Interestingly, these genome triplications added new gene family members such as transcription factors and enzymes necessary for ethylene biosynthesis and perception, which mediate important fruit-specific functions.

Future directions of tomato genetics and genomics

Modern tomato genetics had already used molecular markers and functional analysis to identify a handful of genes underlying developmental or yield traits, but the availability of the tomato genome sequence will further revolutionize tomato genetics and breeding. However, since the domesticated tomato varieties show limited genetic diversity, the wild tomato relatives provide a rich source of useful allelic variation. The 150 tomato genome resequencing project was recently initiated with an objective to reveal and explore extant genetic variation in tomato, and will provide a major boost to identification of valuable alleles. The project aims to sequence 83 genotypes, including 30 wild accessions, 43 land races and 10 old varieties [60]. This will not only help identify useful SNPs from the wild accessions but also rare SNPs within domesticated varieties. Tomato breeders can then target gene variants (SNPs) in the wild species associated with desirable traits such as disease or pest resistance or growth in extreme environmental conditions and introduce them into cultivars in order to exploit the rich tomato germplasm for breeding purposes. More genome sequences will facilitate QTL identification, mapping and cloning of underlying genes, and provide new SNP markers for marker-assisted breeding. For example, genome-wide association studies (GWAS) will allow detection and fine mapping of QTLs in the post-genome era, given the high phenotypic diversity among various tomato wild relatives [61, 62]. QTL analyses will also help to investigate the process of domestication and associated yield increase [56, 63]. Additionally, millions of informative markers (SNPs/InDels) and structural variations, such as duplications, inversions, transpositions, and so on, identified through comparison of genome sequences of domesticated and wild tomatoes will promote investigations into the genetic and molecular basis of the process of domestication and crop improvement.

Identification of introgressions of segments of the S. pimpinellifolium genome into the Heinz genome already suggests that introgression through conventional (rather than marker assisted) breeding has been a significant factor in crop improvement/domestication in tomato [7]. These wild-species introgressions have provided disease resistance and others have been associated with small fruit size (cherry tomatoes). ILs in the background of cultivated tomato exist for many wild tomato species [51, 6468]. The tomato ILs are an excellent tool for functional genomics studies to investigate genetic and environmental interactions. Expression QTLs (eQTLs), as identified by large-scale transcriptome profiling of the ILs, will be useful in connecting phenotypic variation to genotypic diversity, thus leading to a hypothetical regulatory network based on location of eQTLs and phenotypic QTLs [69]. Integrating additional functional genomics approaches such as metabolomics and proteomics can significantly reduce the number of candidate genes for a given QTL [70, 71]. One of the major thrusts of functional genomics in future will be RNA-seq enabled transcriptome profiling. For example, comparison of transcriptome profiles from domesticated and wild tomato species will give us insights into the gene expression differences associated with the process of domestication and trait diversity. The tomato functional genomics database (TFGD), which includes microarray, metabolite and small RNA data, has already been established as a comprehensive resource even before the complete tomato genome sequence was released [72].

The advent of next generation sequencing and available genome sequence should make characterization of large collections of tomato mutants even more rapid and robust through sequencing of phenotyped sub-pools from F2 populations and subsequent mapping using methods such as SHOREmap and next generation mapping (NGM) [73, 74]. Availability of the tomato genome sequence will speed up the understanding of gene function in developmental and metabolic pathways and identify key steps in co-regulation mechanisms by mapping relevant tomato mutants. Additionally, multiple TILLING (Targeting Induced Local Lesions IN Genomes) resources in different backgrounds have already been developed for tomato functional genomics [75, 76]. These TILLING resources, in combination with the tomato genome sequence, should be useful for both forward and reverse genetics in tomato for both basic science and/or crop improvement.

Beyond tomato to the family Solanaceae

Besides the genus Solanum, the family Solanaceae has more than 3,000 species that exhibit diversity in development, organ morphology, metabolism and geographic distribution. Many of these species have high economic, nutritional and agricultural importance. The SOL-100 sequencing project, with an objective to centralize sequences and phenotypes of 100 different species across the phylogenetic diversity of the group, will facilitate the genetic mapping of simple and complex phenotypes affecting the numerous diverse traits in SOL-100 species [77]. It has long been known that gene duplication events followed by sub-functionalization and neo-functionalization have fueled the process of evolution. Genome sequences of SOL-100 species will promote comparative genetics efforts in the family Solanaceae, and will provide important insights into the evolution of gene families [7]. This knowledge will help identify genes for traits that may be useful in tomato breeding - either through introgression from sexually compatible species or by moving them into tomato via a transgenic route.

In the post-genomic era, an overwhelming amount of data from different 'omics' approaches is being generated and utilized for genomics research. It is becoming increasingly clear that with the availability of new plant genome sequences from both the family Solanacae and more crops of agricultural interest coupled with cheap next generation sequencing technologies, conversion of raw data into biologically meaningful information will require better and easily accessible bioinformatics tools [78]. Progress in plant research in general will depend on our ability to tie together independent components such as genotypic information, phenotypic data, expression profiles and so on into higher order complexity with multiple dimensions. The availability of genome sequences and the ability to handle large data sets will promote system biology approaches for crop plant research to build higher order gene regulatory networks for understanding plant developmental and metabolic pathways [79]. For example, laser capture microdissection (LCM) or fluorescence-activated cell sorting (FACS) in combination with RNA-seq have enabled us to generate transcriptome profiles of specific tissues/cell types, such as leaf, inflorescence, fruit, and so on, at high spatial and temporal resolution. The resulting large-scale data generated can be integrated using bioinformatic tools to understand how cell/tissue-specific gene expression leads to the production of whole organ phenotypes and will address the developmental changes associated with environmental adaptation and diversification of crop plants [80, 81]. Integration of other 'omics' data from proteomics, metabolomics, epigenomics and other studies will further allow crop biology to be approached from a systems perspective.

Together with genome sequences of related tomato wild species and other SOL-100 species, tomato sequence information will not only accelerate the elucidation of evolutionary relationships within the Solanaceae, but also aid in the discovery of new genes, allele-mining, and large-scale SNP genotyping. More efficient use of genetics and breeding methods and large scale structural and functional genomics studies in combination with system biology approaches will promote rapid varietal improvement, evolution and domestication studies and functional analysis of genes (Figure 3). Considering the highly diverse developmental and metabolic behavior of different plants and differences in their human usage, it is evident that a single model plant will not able to answer multiple intriguing questions in the area of basic and applied plant biology. With the shifting focus of plant biology from a single model plant, Arabidopsis, to multiple model plants of economic importance, tomato genomics studies will serve as a model for other crop plants, with particular focus in the area of exotic germplasm genomics for trait improvement and use of system biology approaches. For example, crop plants within the family Solanaceae, such as eggplant, pepper and potato, have wild relatives that have been utilized for crop improvement and will benefit from incorporation of genome scale information into breeding efforts.

Figure 3
figure 3

Approaches to and applications of Solanaceous plant genomes. Efficient breeding, structural and functional genomics and system biology approaches will promote rapid varietal improvement, domestication studies and functional analysis of genes.

Perspective

With the recently decoded tomato and potato genomes (both belonging to the family Solanaceae), plant biologists have just started to explore the genome sequences of Euasterids for both applied and basic research [6, 7]. To promote large-scale genomic studies in Euasterids, more species belonging to this group need to be sequenced to match the plethora of sequence information from Eurosids. The family Solanaceae itself has many other plants of economic importance, such as eggplant, pepper, tobacco, petunia, and so on. Furthermore, Euasterids include numerous other plants of economic importance: the beverage-producing plants tea and coffee; oil-producing plants such as sunflower, safflower and olive; vegetable plants like carrot, parsley, lettuce and so on; and aromatic plants like mint, basil, rosemary and so on. Genome sequencing of a number of these species is already in progress (Table 1). Genome information for these plants will not only help to enhance the economic value of these crops but also answer specific questions related to their basic development and the regulation of secondary metabolism in these species. Orobanchaceae, the largest family of parasitic flowering plants, also belongs to the Euasetrids. Deciphering the genomes of parasitic plants such as Striga and their close relatives will help in saving millions of hectares of crop plants worldwide from destruction by these parasites. Moreover, availability of genome sequences from Euasterids and basal eudicot (for example, Columbine; Figure 1), along with the numerous sequenced Eurosid genomes, will promote large-scale evolutionary and biodiversity studies across all the flowering plants.

Table 1 Genome sequencing of Euasterids in progress

Box 1

Monocots: A group of flowering plants whose seeds typically have one cotyledon (embryonic-leaf).

Dicots: A group of flowering plants whose seeds typically have two cotyledons (embryonic leaves). However, the term dicots is paraphyletic. 'Eudicots' defines a monophyletic group that excludes monocots, their allies and more basal 'dicot' species.

Synteny: Physical co-localization of genetic loci in the same relative positions in two or more genomes.

Sympodial branching: In this type of branching the shoot apex terminates in a flower and an axillary bud or buds continues growth of the inflorescence. The process is reiterated many times.

Box 2: Tomato wild relatives

Tomato wild relatives fall into two groups: red- or orange-fruited species such as S. pimpinellifolium, S. galapagense, and S. cheesmaniae, and the green-fruited species such as S. neorickii, S. chmielewskii, S. chilense, S. arcanum, S. corneliomulleri, S. huaylasense, S. peruvianum, S. habrochaites, S. pennellii, S. lycopersicoides, S. sitiens, S. ochranthum and S. juglandifolium. Red-fruited species accumulate glucose and fructose, while the green-fruited species accumulate sucrose [42]. Traditionally, wild and cultivated tomatoes were grouped within the genus Lycopersicon in the Solanaceae. However, molecular phylogenetic studies support placement of tomatoes in the genus Solanum [43].