Maize (Zea mays L., corn) was domesticated in the highlands of Central Mexico approximately 10,000 years ago [1]. Corn agriculture spread rapidly into diverse climate zones, ranging from 45° N to 45° S, and supported vast Native American civilizations. Today, maize is one of the world's most important crops: for direct human consumption, as a key component of animal feed, and as the source of chemical feed stocks. Grass species (including maize) cover 20% of the terrestrial surface of the earth, and the grains from maize, rice, wheat, and minor grass crops provide the majority of calories in the human diet [2].

Since the beginning of the twentieth century, maize has been a model species for genetic analysis, reflecting its unusual biological features. Maize plants produce separate male and female inflorescences, which greatly facilitates experimentally controlled pollination by eliminating the need for emasculation (Figure 1). Large numbers of progeny (300-600 kernels per ear) and the ease of crossing allow a single maize geneticist to generate more than 100,000 outcross progeny per day. Individual plants produce up to 107 pollen grains, allowing fine-structural genetic mapping for phenotypes that can be scored at the pollen stage. Using this abundant material and extraordinary natural diversity, early geneticists mapped many genes, uncovered subtle genetic phenomena such as paramutation and imprinting, and made practical contributions to agriculture through the discovery of hybrid vigor and cytoplasmic male sterility.

Figure 1
figure 1

Maize inflorescences. The separation of (a) female inflorescence (ear) and (b) male inflorescence (tassel) is one of the key features of the maize plant responsible for its pivotal role in plant genetics, greatly simplifying controlled pollination (photos courtesy of Tom Peterson, Iowa State University).

The beautiful detail evident in meiotic maize chromosomes stimulated a generation of gifted cytogeneticists to identify the physical basis for recombination, to construct linkage maps tied to chromosomes, and to analyze the consequences of chromosome breakage. Of particular importance to current functional genomics was Barbara McClintock's discovery of transposable elements by analyzing the regulation of somatic variegation and germinal mutation in maize. Once maize transposons were molecularly cloned, they provided the means to clone any tagged gene: maize provided the first discovery of many plant-specific gene products and facilitated the cloning of related genes from other flowering plants. The availability of detailed genetic knowledge, a large community of researchers, and ease of gene cloning and genetic analysis make maize the monocotyledenous species of choice for many studies.

The maize genome is organized into 10 chromosomes (2N = 20), and is about 2.4 × 109 base-pairs in total. Sorghum, which is estimated to have diverged from a common ancestor with maize about 15-20 million years ago (MYA), has the same chromosome number, but its genome is about one third of the size. Rice diverged from a common ancestor with maize and sorghum about 50-60 MYA and has 12 chromosomes (2N = 24), comprising a much smaller genome of about 430 million base-pairs. Comparative genomics of these grasses suggests considerable colinearity between their genomes [3]. The size differences of the genomes are presumed to result from the ancestral allotetraploidization (approximate duplication from diploid to tetraploid when two species hybridize) of the maize genome [4] and differences in the expansion and dispersion of repetitive DNA (long terminal repeat retrotransposons, miniature inverted repeat transposons, and other repetitive sequences) [5].

In December 2000, Arabidopsis thaliana became the first plant species for which the genome was almost entirely sequenced (currently, 117 of an estimated 125 million base-pairs are available, with only centromeric and ribosomal DNA repeat regions as yet unsequenced [6]; reviewed in [7]). Because of its small genome size, ease of transformation, and tolerance of life in a growth chamber, this seemingly lowly weed has emerged as the model flowering plant, ahead of commercially important crops. The choice will be well justified if the evolutionarily recent advent of flowering plants means that most genes found in Arabidopsis prove to be common to all flowering plants. Among the crops, members of the Brassica genus (including B. oleracea and B. rapa, the so-called 'cole-crops', oilseeds, and mustard) are most closely related to Arabidopsis (divergence less than 20 MYA). Gene order seems to be largely conserved, and thus the Arabidopsis genome should prove a powerful tool for studying Brassica genomics [8,9]. Significant colinearity has also been observed between Arabidopsis and soybean [10] (divergence time 100 MYA), and Arabidopsis and tomato [11,12] (divergence time more than 100 MYA). This article assesses the prospects for comparative maize-Arabidopsis genome analysis in view of the greater divergence time (more than 150 MYA) between grasses (which are monocots) and flowering plants (dicots).

Lack of synteny between maize and Arabidopsis

The extent of conservation of gene order between the grasses and Arabidopsis can be estimated from three well-studied groups of maize loci: the a1-sh2 region [13,14,15], the adh1 region [16,17], and the bz locus and its associated genes [18]. The a1-sh2 region in maize, sorghum, and rice contains the sh2 gene upstream of a1, transcribed in the same direction. The a1 gene encodes an NADPH dihydroflavonol reductase required for anthocyanin biosynthesis and sh2 encodes an endosperm-expressed ADP glucose pyrophosphorylase important in starch biosynthesis. The two genes are separated by about 140 kilobases (kb) in maize but only about 19 kb in sorghum and rice. Moreover, a1 is duplicated in sorghum. Sequences that are highly similar to sh2 can be found on Arabidopsis chromosomes 1, 2, 4, and 5. Potential homologs of a1 map to Arabidopsis chromosomes 2 and 5, but they are far apart from the potential sh2 genes. Recently, two additional genes have been identified in the a1-sh2 interval: x1 and yz1, which are of unknown function and conserved among maize, rice, and sorghum [14,19].

Genic regions are generally conserved between the adh1 regions of maize and sorghum, although adh1 is the only gene with assigned function (alcohol dehydrogenase), and maize is missing three out of ten other potential genes within this region [16]. Whereas the maize region is replete with retrotransposons, gathered into sequence blocks of 14-70 kb and inserted between the potential genes, the sorghum sequence does not contain any retrotransposons. Colinearity with Arabidopsis appears limited to a block of two genes conserved between sorghum and Arabidopsis [16]. Interestingly, the colinearity of this locus pair is interrupted even between maize and rice [17].

The recently sequenced bz locus of maize and its chromosomal region displays a gene-dense genomic organization very different from adh1, with ten putative genes within a 32 kb stretch that is free of retrotransposons [18]. Although this gene density is similar to that in Arabidopsis, and most of the genes have potential homologs in Arabidopsis according to the genome sequence, no colinearity is evident. Thus, on the basis of our current picture of plant genome organization, micro-colinearity between different genomes may be even more limited than has previously been stated [20].

Proteome comparisons

Although gene order does not appear to be conserved across the monocot-dicot divide, the repertoires of gene products (that is, the typical monocot and dicot proteomes) may be conserved. This hypothesis cannot be fully tested until the complete Arabidopsis genome is matched to a complete monocot genome, but the current collection of maize proteins and genome sequence fragments may provide a clue. We downloaded the entire set of 4,195 maize protein sequence records from GenBank and reduced this collection to a representative, non-redundant set of maize proteins in several steps: firstly, removal of sequences less than 60 amino acids; secondly, removal of organelle-encoded proteins; and thirdly, selection of a single sequence to represent clusters of highly similar entries (including identities resulting from duplications in GenBank; this was done using the novel fast string matching program 'vmatch' [21]; V.B. and S.K., unpublished). The resulting set of 1,143 sequences was compared with a set of 25,617 putative Arabidopsis proteins [22] using BLASTP [23] at moderate stringency (BLAST -e option set to 1e-5). Most of the 117 entries without significant hits were identified as polypeptides encoded by transposable elements. The remaining sequences were matched directly against the Arabidopsis genome using the GeneSeqer spliced alignment program [24] to check for possible gene products not included in the Arabidopsis predicted protein set (only one unannotated Arabidopsis homolog of a maize protein was identified in this way). About 50 candidate maize-specific proteins remained, including several zeins, some predicted products of unknown function, and several other proteins (the latter group are listed in Table 1). On the basis of these results, we can give an upper estimate of 90% of maize proteins that have close homologs in Arabidopsis. The distinct maize genes appear to be tissue-specific (endosperm) or involved in maize pathogen-defense responses.

Table 1 Maize proteins with no obvious homologs in Arabidopsis

Maize EST analysis

One pivotal strategy for identifying gene products involves sequencing of large sets of expressed sequence tags (ESTs). Many plant genome projects have adopted this approach, and there are currently more than 100,000 EST database entries in the public domain for each of soybean, tomato, Medicago truncatula, maize, Arabidopsis, and rice [25]. To further assess the overlap between the maize and Arabidopsis proteomes, we derived a set of 27,294 maize ESTs with non-redundant open reading frames (ORFs) of at least 120 codons (again using vmatch). The translated ORFs (derived from all six reading frames) were compared to the set of putative Arabidopsis proteins using BLASTP at different stringency levels. As shown in Figure 2, 62-68% of the maize ESTs relate to ORF products that match Arabidopsis proteins, and the total fraction of the Arabidopsis protein set matched by the maize ESTs is 60-73%. Similar numbers were obtained for consensus sequences built from maize EST clusters [26]. Thus, a significant proportion of maize ESTs might encode highly diverged or maize-specific proteins. Some ORF products might not correspond to functional proteins, and incorrect gene prediction models and the as yet partial Arabidopsis protein set may also contribute to incomplete matching. For comparison, the same procedure applied to the Arabidopsis EST set compared to the Arabidopsis protein set gave a matching fraction of 88% or more of 28,161 qualifying ESTs, showing that chance ORFs may account for up to 12% of the unmatched ESTs in Arabidopsis, and presumably also in maize. We can therefore refine the estimate of maize proteins with close homologs in Arabidopsis to 60-90% of the maize proteome. Because ESTs are difficult to derive from genes expressed at low level there may in fact be more unmatched maize proteins to be found.

Figure 2
figure 2

Comparison of maize proteins predicted from EST sequences with Arabidopsis proteins. A non-redundant set of protein sequences consisting of at least 120 amino acids each, derived from 27,294 distinct maize ESTs, was compared with 25,617 putative Arabidopsis proteins at different BLASTP stringency levels. The percentages in each pie chart give the fractions of the two sequence sets involved in these matches, at each stringency level.

A glimpse of the maize genome

Several approaches are currently being used to provide further sequence data from the maize genome. These sequences are entered into the Genome Survey Sequence (GSS) division of GenBank because the sequencing is for the most part exploratory, at a low redundancy level. Table 2 summarizes a rough analysis of 11,625 maize GSS entries available as of 1 November 2001. The sequences were obtained by different selection strategies, including genomic sequences flanking Mutator transposon insertions [26], random inserts [27], sequences selected for not being methylated [28], bacterial artificial chromosome (BAC) ends [29], sequences that were genetically mapped [30], and sequences selected for long ORFs using the ORF Rescue vector [31]. Table 2 gives the result of a BLASTP search (option -e 1e-5) of all ORFs of at least 120 codons derived from the GSSs, compared to the non-redundant maize protein set. It can be seen that the random sequencing approaches (random inserts and BAC ends) produce a large fraction of sequences matching transposable elements, whereas the Mutator transposon insertion, methylation filter, and 'ORF rescue' approaches clearly bias against the recovery of such sequences. More than 80% of the GSS entries with ORFs derived from the former two approaches do not show significant similarities to known maize proteins, and, surprisingly, more than 70% do not match any Arabidopsis proteins (Table 3). An intriguing explanation would be that these ORFs correspond to novel or highly diverged maize proteins. It is also possible that some of the ORFs do not correspond to native translation products.

Table 2 Analysis of maize genome survey sequences: a comparison with maize proteins and ESTs
Table 3 Analysis of maize genome survey sequences: a comparison with Arabidopsis proteins

To assess these possibilities, we compared the sequences of novel ORFs with the maize EST set (application of GeneSeqer [22]). The result, that 26-44% of the four large GSS collections match (a still limited collection of) maize ESTs (see Table 2), suggests that many of the ORFs do indeed correspond to expressed genes. The remaining fraction may include less abundantly expressed genes. We can estimate the gene fraction accessible by EST sequencing from the EST coverage of GSS-derived ORFs: if the roughly 10,000 novel ORFs in the maize EST set constitute only 40% of the genes, we can anticipate some 25,000 novel maize proteins that are not found in Arabidopsis. It is likely that many of these proteins are derived from gene duplications. The lack of sequence conservation across the monocot-dicot divide suggests that there has been extensive functional divergence after duplication.

The need for a maize genome sequencing project

On the basis of available data, we think that the resource provided by the Arabidopsis genome cannot adequately substitute for more extensive maize genome sequencing. Genome organization is very different between the two plants, and the proteomes may also have significant differences, particularly with respect to agronomically important maize genes involved in plant-pathogen interactions, reproduction, and the development and function of specific tissues. The many exceptions to micro-colinearity even among the grasses suggest that the completion of the rice genome [32] will still not answer many of the questions particular to maize genomics. Beyond questions concerning agronomically important traits, plant biologists also look to maize as a model for the evolution of plant genomes that are not as small and streamlined as those of Arabidopsis and rice [33]. Correspondingly, a maize genome sequencing project will focus on sequencing gene-rich genome fractions first [34], and other crop genome projects are likely to follow. Plant biologists should look forward to very exciting times when whole-genome comparisons become possible, leading to a clearer understanding of the development of plants from their genetic blueprints.