Introduction

Rice is one of the three mega-crops (rice, maize, and wheat) on which more than half the world’s population relies as major sources of calories and protein. Genetic improvement based on molecular biotechnology requires accurate genome sequence, which contributes to the establishment of genome-wide DNA markers for tagging and delimitation of the genetic regions in which genes and quantitative traits locus (QTLs) are located. Global identification of protein-coding genes in the rice genome could enhance the discovery of the genes that are responsible for agronomically desirable traits.

The International Rice Genome Sequencing Project (IRGSP, 1997–2004), which was run by a collaborative research consortium of ten countries, succeeded in establishing the nucleotide sequence of the Nipponbare cultivar of rice’s japonica ssp. to a high standard [10]. The 370 million nucleotides from the 12 chromosomes of rice have now been widely utilized as molecular coordinates for investigating rice genomics and genetics. Although most of the euchromatic regions (95%) of the rice genome were covered by the published sequences, 62 gaps and heterochromatic regions, centromeres, and telomeres, corresponding to 5% of the total genome, remained unrevealed. To acquire the sequences of these missing regions and thus improve the public rice genome sequence, new genomic libraries recently constructed from physically fragmented genomic DNA have been utilized, and these efforts have succeeded in revealing some of the junction sequences between the euchromatic and heterochromatic regions of rice telomeres. This information, which we will describe here, should give us clues to understanding the molecular diversity of, and mechanisms responsible for, the generation of telomere structures.

The genus Oryza comprises 23 species [38], but one of the mysteries of rice history is that most of the modern varieties of rice, derived from Oryza sativa and Oryza glaberrima, are the descendants of only a specific lineage (the AA genomes). Oryza emerged about 20 to 22 Mya [9]. There are geographic, physiological, and genetic diversities among Oryza species, including among rice varieties, landraces, and wild accessions. In recognition of the fact that this variation is indispensable for maintaining the vast genetic resources that should help in developing a sustainable future for the human race, rice is collected, evaluated, and stored in either national or international gene banks (e.g., NIAS Genebank, http://www.gene.affrc.go.jp/about_en.php; [18]; Japanese National Bioresource Project, http://www.shigen.nig.ac.jp/rice/oryzabase/top/top.jsp; International Rice Research Institute; Genetic Resource Center, http://www.irri.org/GRC/GRChome/Home.htm). Global genome and information resources for the investigation of genome evolution among Oryza species have been facilitated by The Oryza Map Alignment Project (OMAP, http://www.omap.org/; [17]).

Now that we are armed with these resources, it should be interesting to know how each gene or genomic region has evolved in the course of rice evolution and domestication. The elucidation of molecular diversity, as revealed by detailed sequence analysis, should be a fundamental product of such research.

Here, we present a review of comparative genomics based on information on the sequences of the genic region. A detailed molecular diversity analysis of both the exon and intron regions within the “Green Revolution” gene would not only present information on protein diversity but also give us clues to genomic conservation and development.

Diversity of the telomere region among rice chromosomes

Although IRGSP attempted clone-by-clone genomic sequencing to cover the whole genome, clone gaps remained in the chromosomal ends. As the restriction enzymes used in the construction of PAC/BAC libraries could not cut the canonical telomere array, (TTTAGGG)n, the libraries did not contain the clones derived from telomeric sequences [5, 39]. To capture the telomere sequences, a rice fosmid library constructed by the cloning of random mechanically sheared DNA [1] was screened [24]. The library enabled telomeric sequences to be obtained without the constraints imposed by enzyme site preferences. We describe here the characteristics of the telomeric regions on the basis of their sequence and length diversity among chromosomes.

The rice chromosomal end has tandemly repeated blocks of the sequence 5′-TTTAGGG-3′ [40]. These telomeric repeats are organized in the order of 5′-TTTAGGG-3′ from the chromosome-specific region [24, 42]. The seven-nucleotide unit has deletions, insertions, or substitutions of single nucleotides near the junction between the telomere and the chromosome-specific region. The rate of accumulation of telomeric variants is higher in the proximal region than in the distal region [25], suggesting that the proximal region has rarely been reconstructed by telomerase on an evolutionary time scale.

This expansion of telomeric variants makes it possible to characterize the rice chromosomal end. Copies of ATTAGGG, CTTAGGG, GTTAGGG, TATAGGG, TTCAGGG, or TTGAGGG are arrayed in tandem, or the same subtypes are close to each other, at the ends of chromosomes 2L, 3L, 7L, and 10S (Fig. 1; [25]). Inversion of telomeric repeats is observed adjacent to the beginning of the telomere array on the ends of chromosomes 4L, 7S, and 9S. Therefore, the proximal telomeric sequences are composed of blocks of at least six types of TTTAGGG variants and the canonical sequence in a chromosome-specific manner. This distribution suggests that telomeric variants might have arisen from the rapid expansion of a single mutation rather than from the gradual accumulation of random mutations. The telomere of rice contains a nucleotide deletion of one T in TTTAGGG. Rice has a 4.9% content of deletion variants, TTAGGG, dispersed throughout the whole of the sequenced region. The telomeric sequence in the Asparagales is similar to that of rice but not identical: the deletion type in rice, TTAGGG, is present in the Asparagales [35]. The partial or full replacement of the telomeric sequences by these variants might have been due to evolutionary changes in the genomic sequence that codes the RNA template or to structural changes in the catalytic subunit.

Fig. 1
figure 1

Distribution of TTTAGGG substitution variants. Each box represents the 7-nt unit of the telomere repeat TTTAGGG (white) and the different variants (ATTAGGG, CTTAGGG, GTTAGGG, TATAGGG, TTCAGGG, and TTGAGGG), as shown in the key. Gray box represents other variants, including deletion and insertion variants. Numbers indicate positions of telomere sequences from the junction between the chromosome-specific region and the telomere array.

The telomere lengths vary among various accessions of rice. The telomeres of 31 rice accessions (both cultivars and wild species, which belong to AA, BB, BBCC, CC, CCDD, GG, or HHJJ species of Oryza) are 5 to 20 kb in length [24]. Marked variation in telomere length is also observed among cultivated rice of the AA genome: the japonica cultivar Nipponbare shows a relatively low MW pattern and the indica cultivar Kasalath shows a relatively high MW pattern. Moreover, variation in telomere length is observed among chromosomes in Nipponbare. Use of the fiber–fluorescent in situ hybridization (FISH) method has revealed the diversity of telomere length of each chromosome. Seven telomeres in Nipponbare range from 5.1 to 10.8 kb in length, corresponding to about 730 to 1,500 copies of the TTTAGGG telomeric repeat. The chromosome-dependent variation might be a consequence of genetic or epigenetic differences among the sequences of subtelomeres; these differences might affect the balance between telomere shortening and telomere elongation.

Telomere length in various plants has been reported: 2.5 kb in Arabidopsis thaliana [20]; 4.5 kb at most in Melandrium album [28]; 60 to 160 kb (in most cases 90 to 130 kb) in Nicotiana tabacum [6]; and 1.8 to 40.0 kb in maize [4]. Does telomere length change in different cells? In barley (Hordeum vulgare), wide variation in telomere length is observed during the differentiation or ageing of cells. The cells that develop in long-term callus cultures have very long telomeres [16]. Thus, it is possible that telomere length in rice varies with different tissue or developmental stages.

The rice telomere has diversity in both sequence and length. The mosaics of blocks of telomere variants might have resulted from slips during DNA synthesis, a high frequency of DNA recombination, or rapid deletion in the telomeric region, suggesting that the areas near the distal chromosome ends are dynamic and variable.

Diversity analysis of rice functional genes

Growing in a wide range of environments, the genus Oryza contains 23 species; rich in genomic diversity, they could serve not only as potential genetic resources for improvement of rice production but also as good research materials for studies of the evolutionary history and functionality of genes related to speciation, domestication, polyploidy, ecological adaptation, and human selection of rice [37]. The public rice genome sequences obtained from two rice cultivars, Nipponbare (by the IRGSP) and 93-11 (by the Beijing Genomics Institute, BGI), as well as the wild rice BAC library resources established from the AA to HHKK genomes of Oryza species at Arizona Genomics Institute (AGI) provide a good opportunity to carry out such studies [46, 10, 2]. For example, analyses of BAC end sequences and preliminary generation of BAC contigs by using the above libraries have been conducted. These studies suggested that repeat sequences play a role in genome size evolution and found the physical evidence of changes in genomic composition and structure between the different genomes of Oryza species [17]. Materials on all BAC libraries and information on BAC end sequences and BAC contigs are available through the AGI BAC/EST Resource Center (http://www.genome.arizona.edu/orders).

Belonging to the Oryza genus, Oryza sativa, also called Asian cultivated rice, is thought to have originated from the Asian wild rice Oryza rufipogon only about 10,000 years ago [14]. Growing now throughout the world, Oryza sativa has two subspecies, indica and japonica. Knowledge of the differences in phenotype variations among rice species or subspecies at the level of molecular biology would widen future rice breeding possibilities. With this purpose, the Rice Genome Research Program (RGP) has constructed nine novel BAC libraries from species that carry the AA genome, as an important resource for comparative analysis of rice genomes. These include the three Asian rice varieties Kasalath (indica), Shuusoushu (indica), and Kha Mac Kho (japonica) from O. sativa, one accession from the African cultivated rice O. glaberrima, and one accession from each of the wild rice species Oryza rufipogon, Oryza barthii, Oryza glumaepatula, Oryza meridionalis, and Oryza longistaminata [41]. By chromosomal in silico mapping of 78,427 high-quality BAC end sequences, 450 Kasalath BAC contigs that consisted of 12,170 clones and covered 308.5 Mbp of the genome were generated [13]. These resources are freely accessible through the RGP homepage (BAC end sequences at http://rgp.dna.affrc.go.jp/blast/runblast.html, BAC contigs at http://rgp.dna.affrc.go.jp/E/publicdata/kasalathendmap/index.html) for researchers to perform comparative analyses of the genomes of the two subspecies of O. sativa and to generate single nucleotide polymorphism (SNP) or indel markers for genetic studies.

Both basic and applied research on rice genes has been carried out in the past decade, and especially after the completion of sequencing of the two rice genomes (Nipponbare and 93-11), genomic and genetic analyses have greatly increased our understanding of the function of the rice genome. Among the most important achievements are the current use of advanced QTL mapping and genomic sequencing techniques for successful cloning and functional analysis of the rice genes controlling agriculturally important traits. For example, the structure and function of the genes involved in spikelet shattering, grain number, grain shape (width and length), photoperiod sensitivity, tillering, and plant architecture have been reported [3, 7, 19, 21, 22, 29, 33, 36, 43]. It will be both scientifically interesting and agriculturally important to investigate the sequence diversity of these genes among different varieties and species; this information could not only provide valuable information on evolutionary history of a crop but also lead to the discovery of new alleles for the improvement of rice breeding.

To date, there are a few genes whose sequences within the different Oryza species have been extensively investigated and compared to elucidate molecular and evolutionary mechanisms. The genes most analyzed for sequence comparisons in Oryza are probably the alcohol dehydrogenases (Adh). Ge et al. [8] were the first to sequence two genes (Adh1 and Adh2) from 31 accessions representing all 23 rice species; they reported the phylogenetic relationships among the different Oryza species that are determined from the sequence polymorphisms. Yoshida et al. [44, 45] have investigated the nucleotide diversity in the Adh1 and Adh2 gene regions of O. rufipogon in order to clarify the mechanisms by which DNA variation is maintained.

We have started to perform comparative genomics on these functional genes. Here, we introduce the current results coming from the molecular and evolutionary analysis of the semidwarfing gene sd1, one of the most important genes used for the development of high-yielding rice varieties. The semidwarfing gene (sd1) is located on the long arm of chromosome 1 in rice and encodes gibberellin 20-oxidase (GA20ox2). In the 1960s, a dramatic increment of rice production throughout Asia was obtained by the development of a high-yielding semidwarf indica rice cultivar known as IR8. This so-called rice Green Revolution depended largely on the introduction of the sd1 gene, because the recessive character of the gene results in a shortened culm with improved lodging resistance and a great harvest index, allowing for the increased use of nitrogen fertilizers to improve yield [12, 15]. Using the AGI and RGP BAC libraries, we obtained and sequenced the entire regions of sd1 genes from 17 cultivated and wild rice species by screening and chromosomal in silico mapping of the positive BAC clones that covered the target region in each species. For comparison of genome diversity and divergence within and among the species, the genomic region of the Adh1 gene within the same accessions was also sequenced as controls in this study. Sequences obtained in this study have been submitted to the DNA Data Bank of Japan (acc. no. AB469048–AB469082). GA20ox2 differed in length among the species examined, ranging from 389 to 407 amino acids, with the exception of the indica cultivar 93-11, which contained only 341 amino acids because of the presence of an SNP creating a stop codon within the third exon. When the Nipponbare sequence was used as a reference, the indels detected in the other species were found to be distributed only on the N- and C-terminal regions within the coding sequence (Fig. 2a). Nucleotide substitutions, on the other hand, could be detected throughout the coding region, although, as was the case for the indels, more non-synonymous substitutions seemed to have occurred in the two terminal regions than in the internal regions of the gene (Fig. 2b). It is clear that the sequence of the gene encoding GA20ox2 is conserved within all the species examined, particularly within the AA-genome species, in which only between 0 and 5 non-synonymous sites are present, giving ≥99.2% identity at the amino acid level (Table 1). Even between the two most distant species—O. sativa and Oryza granulata—the gene encoding GA20ox2 had an identity of 88.0%.

Fig. 2
figure 2

Distribution of indels (a) and SNPs (b) detected within the sd1 gene between the Nipponbare and other rice species. Upper bar indicates synonymous base substitutions while lower bar indicates non-synonymous base substitutions.

Table 1 Summary of sequence comparison in the entire region of sd1 gene among Oryza species using Nipponbare as a reference

The sd1 gene was first identified in the Chinese variety Dee-geo-woo-gen (DGWG) and was crossed at the International Rice Research Institute (IRRI) in the early 1960s with Peta (tall) to develop the semidwarf cultivar IR8 [11]. Genetic and molecular analyses have demonstrated that the sd1 gene in DGWG contains a 383-bp deletion spanning parts of the first and second exon and resulting in a frameshift that gives a stop codon within the coding sequence [26, 29]. A similar deletion (280-bp) was detected in the semidwarf indica rice cultivar Doongara [34]. Additional alleles that carry a single mutation causing changes in the amino acid residues in the semidwarf japonica rice cultivars Jikkoku (in exon 1), Calrose76 (in exon 2), and Reimei (in exon 3) have also been found [29, 34]. Interestingly, two accessions of wild rice O. rufipogon (W1944 and W1718) are reported to carry the DGWG allele, suggesting the preservation and human use of natural alleles from the wild progenitor [27]. However, our examination of the sd1 gene sequence within the 17 cultivated and wild rice species revealed none of these types of alleles. The rice cultivar 93-11 seems to encode a truncated protein owing to the presence of a premature stop codon; this codon could, however, be considered as a null allele, because 93-11 does not have a semidwarf phenotype. We also surveyed the presence of alleles as reported above within 60 accessions of O. sativa and 34 accessions of O. rufipogon by using the world core collections from the National Institute of Agrobiological Sciences, National Institute of Genetics, and IRRI. Along with the two modern indica cultivars IR58 and Milyang 23, another rice cultivar, Rexmont, from the USA, contains the DGWG type of allele. No other varieties within the above collections carry any of the remaining types of known alleles.

DnaSP (http://www.ub.edu/dnasp/) analysis based on the aligned sequences within the entire region of the sd1 gene among different species revealed 507 polymorphic sites (in total, 1,925 available sites), including 117 synonymous sites and 78 non-synonymous sites within the exon region that enabled us to estimate the genome diversity (π) as well as divergence (K, genetic distance) within and between the Oryza species (Table 2). The differences in genome diversity and divergence between the two genes sd1 and Adh1 are very interesting. The π value of the sd1 gene in O. sativa is higher than that of the Adh1 gene, except at synonymous sites within the exon region, and the change in K value between the two species for the two genes is well correlated with the current taxonomic classification of Oryza species on the basis of crossing ability [37]. The Adh1 gene has a lower level of variation than the average heterozygosity in O. sativa and O. rufipogon; this might be related to the adaptive importance of this gene in the face of anaerobic environmental and stress in the tropics and subtropics [30, 31, 23]. The species of O. sativa complex (AA genome) and Oryza officinalis complex (BB–EE genomes) are >1 m tall; examples are Oryza alta (CCDD), Oryza latifolia (CCDD), and Oryza australiensis (EE), which can grow to a height of 2 to 4 m [32]. In contrast, two small species, Oryza brachyantha (FF) and O. granulata (GG), are shorter than 1 m. Nobody knows what causes the difference in the plant height between these species. Doubtless, however, the gibberellin hormone family is involved in many aspects of plant growth and development. Although many more rice varieties and accessions of wild rice species might be needed for this kind of genomic analysis, the higher genomic diversity and larger divergence in the sd1 region than in the Adh1 region, particularly in terms of non-synonymous sites within the exon region, should provide primary information for us to understand the evolutionary mechanism of genes involved in the control of plant architecture. Through further comparative genomic and genetic studies, it should be possible to determine how phenotypic variations are induced by DNA mutations. This information could facilitate the exploration of natural alleles for future breeding of rice.

Table 2 Estimated nucleotide diversity within the sd1 gene region of O. sativa and its divergence with other Oryza species in comparison with Adh1 gene region

Conclusion

Genome sequences from many plant species have been published, and more than 150 projects aiming to sequence plant genomes have been either completed or ongoing (Genomes OnLine Database v2.0, http://genomesonline.org/). Only the rice and Arabidopsis genomes have been sequenced completely. As the conversion of draft sequences to “finished” ones takes huge amounts of time, effort, and funding, these two plants will serve as reference genomes for the study of monocot and dicot plants, respectively. The emerging ultra-high-throughput sequencing technology will enable us to obtain whole-genome information, which will be mapped and compared with these references, in less time. Studies of genome sequences within and among Oryza species will produce a concrete database for comparative genomics. We will be able to use this database to investigate both the evolution and function of regions, genes, motifs, and sequences within the genome.