Introduction

Corn (Zea mays) is an important worldwide crop that is relied upon for human food, animal feed and for starch ethanol production. In 2007, 93.6 million acres were planted in the US and over 13.1 billion bushels of corn were harvested. Over the last 50 years, there has been a steady increase in corn grain yield averaging an increase of 1.6 bushels per acre per year. In addition, the number of acres planted with corn has gone from 68 million in 1966 to 93.6 million in 2007 (National Agricultural Statistics Service, http://www.nass.usda.gov/). The increasing use of corn for the production of ethanol has further added to the need for increased production.

Due to the commercial importance of the crop, there is significant interest in understanding the underlying DNA sequence. The corn genome is consequently being sequenced, producing on average over 800 BAC clone sequences every month during 2007, and a draft sequence has been announced (Wilson 2008). The sequencing of the genome is a large undertaking not only because the size of the genome is nearly as large as mammalian genomes (2,800 million base pairs (Mbp) in contrast to Arabidopsis (130 Mbp) and rice (430  Mbp)) but also due to the abundance of mobile elements within the genome. Nevertheless, the genome sequence will be an invaluable asset to accelerate further enhancements in the improvements of this crop.

Because corn has been an important plant cultivated for centuries, there is a wealth of information available from phenotypic descriptions, disease effects and resistance, marker data and quantitative trait loci. As has been done for rice, this information can be synergistically integrated with the genome sequence and linked with well-defined genes.

The identification of genes that are important in yield and yield preservation, which can serve as markers for polymorphisms and their use in improved breeding, offers a huge advantage toward that goal. While the genome sequence will be completed shortly, a quicker and complimentary approach to identifying a large number of corn genes is EST and full-length cDNA sequencing. These resources will prove invaluable for annotating the genomes of corn and other monocots and as substrates for transgenic improvement of crops. As in Arabidopsis and rice, these tools will prove to be critical in speeding up the genetic improvement of corn.

The complete genome sequences of several plant species are known and the rate at which whole genomes are being sequenced is increasing. Correct annotation of these genomes remains problematic in spite of gene prediction algorithms becoming ever more sophisticated. ESTs, full-length cDNAs and tiling arrays are extremely helpful toward annotating genomes correctly. As more full-length cDNAs become available from different plant species, the accuracy of annotations improves not only for the newly sequenced genomes but also for evolutionarily related species, including those previously annotated.

In the last several years two research groups submitted greater than 5,000 full-length corn cDNAs to GenBank. Lai et al (2004) sequenced 5′ and 3′ ends of ~13,000 cDNAs from three endosperm specific libraries. 2,168 unique sequences were overlapped with just the 5′ and 3′ reads and 992 additional clones were overlapped by primer walking on the 3,400 clones that had paired reads but where the reads did not overlap. They assembled a unique set of 5,326 non-redundant clones when the transposon-related sequences were eliminated. Interestingly, 22% of these did not match rice sequences, suggesting that they were lost from rice or gained in maize over the last 50 million years. Jia et al. (2006) sequenced 20,000 cDNA clones from PEG-treated corn tissues and obtained 2,073 clones that were scored as full-length. Of the 84 clones that they could align to annotated maize BAC sequences, only 51% were annotated correctly. In addition to these full-length sequences, and others that have been submitted to GenBank by individual researchers, there are 1,923,065 maize Genome Survey Sequences and 1,014,105 ESTs.

Here we present sequences from our corn sequencing program based on 484,032 cDNA clones made from a diversity of libraries. The 5′ ESTs fall into 63,476 clusters. 36,432 cluster representatives, deemed to be full-length and novel at the time of being selected, have been fully sequenced. Within these, we identified 9,951 that are non-redundant and where we have high confidence they are full-length and lacking any errors. All sequences are available in GenBank. We have added to this set 133 cDNA sequences created from publicly available ESTs from GenBank. Analysis of these 10,084 highest quality clones indicates that they can be divided into two groups based solely on their GC content. We found this bimodal distribution in gene catalogs from other grasses, but genes from dicots and other monocots have a unimodal distribution of GC content. High GC content genes in grasses are less homologous to dicot genes than the low GC content genes suggesting a large accumulation of novel gene sequences was associated with the divergence of grasses from other plants over 60 million years ago. In addition, the genes with high GC content have a much smaller number of introns, further suggesting the novel genes may have originated in a non-plant species or from reverse transcription of RNAs.

Methods

Full-length cDNA library construction and sequencing

The tissues used to generate cDNA libraries were derived from a number of hybrids that were available to us at different parts of the growing season. The libraries generated from the Mixed male/female infl., root tissues all came from Pioneer Hi-Bred International, Inc Hybrid 35A19. The rest of the libraries were generated from tissues obtained from several other hybrids. As our goal was to generate unique full-length cDNAs, the tissues and RNAs from various libraries could have been mixed. As such, these ESTs are useful to look at polymorphisms between ESTs but not between various parents except for those in the Mixed male/female infl., root libraries.

For each library, 130 μg of mRNA was ligated with 5′ end oligonucleotide linkers (5′-rGrCrArCrGrArGrArCrCrAUUrArCrCUrArGrArArCrAUrCrCUrArAUrCrGrArArArA-3′, or 5′-rCrGUrCUrCrArCrCrCrCUrArGrArArArArArA-3′ for mRNA isolated from various stress treatments). After ligation, excess free oligonucleotide was removed by column chromatography. Approximately 10 μg of oligonucleotide ligated mRNA was obtained and used for the cDNA synthesis using SuperscriptTM II reverse transcriptase following the instructions of the vendor (Invitrogen, Carlsbad, CA, USA). The oligo-dT primer (5′-GTACGTCTCGAGTTTTTTTTTTTTTTTTTTVN-3′) was annealed to mRNA. After removal of RNA by alkaline hydrolysis, first-strand cDNA was precipitated using isopropanol to eliminate the excess free primer. Second strand cDNA was synthesized with Klenow using the 5′-end oligonucleotide linkers 5′-ATCAAGAATTCGCACGAGACCATTACCTAGAACATCCTAATC-3′, or 5′-GATCGTAGAATTCGTCTCACCCCTAGAAA-3′. The quantity of double stranded DNA was estimated using a picogreen. ds cDNA was digested with EcoRI and XhoI and ligated into pBluescript SK+ (Stratagene, CA). Ligated cDNA was transformed to DH10B cell (Invitrogen, Carlsbad, CA).

In addition to the traditional approach for cDNA normalization (Soares et al. 1994; Carninci et al. 2003), RNA/DNA hybridization and double stranded deoxynuclease were employed to remove cDNAs that were already sequenced from existing libraries. The probe, or driver, was either mRNA or cRNA that was in vitro transcribed using T3 RNA polymerase from the plasmid DNA pool. 10 μg of the probe was hybridized to FL first-strand cDNA (Zhulidov et al. 2004) for 4 h. Kamchatka crab duplex-specific nuclease was added to break the DNA strand of the RNA/DNA duplex. The unhybridized intact single-stranded cDNA was used to complete library construction.

The development of a transposon full length sequencing methodology was considered for the purpose of producing a high throughput alternative to primer walking. The main consideration was pooling a large number of cDNAs together in one transposon reaction and then deconvoluting these through sequencing. A 5′ tag was used as an anchor to facilitate the deconvolution. The method was developed based on the GPS-1 Genome Priming System from New England Biolabs. High quality plasmid DNA from 16 to 32 cDNA plasmids were normalized for concentration and subjected to the TnsABC transposase reaction. During this reaction, a single transposon was inserted into each plasmid. Post-reaction cleanup consisted of ethanol precipitation and elution in water. Each pooled sample was electroporated and grown in SOC media for 1 h at 37°C. Using large square Petri plates containing SOC agar, the cells were plated and incubated for 16–18 h. After colonies were of appropriate size they were picked and moved to 384 well plates for processing. Since transposons insert randomly within a plasmid, all colonies needed to be screened to determine whether the transposon had inserted into the vector backbone or into the cDNA insert. To facilitate this PCR, oligos were designed to amplify the 3 kb vector backbone and a simple size scan was done on agarose gels. If the band size was larger than expected, indicating a transposon within the vector, the sample was discarded. The remainder of the colonies, in which the transposon inserted into the cDNA, went through a sequencing process where the DNA sequencing reaction was primed off the inserted transposon. For each pool of 16–32 cDNAs, a 384 well plate was sequenced and the resulting reads were clustered using by the 5′ tag of the cDNAs that went into the pool.

MegaBACE sequencers (Amersham Pharmacia/Molecular Dynamics) were used at Genset to generate the 5′ ESTs while 377 sequencers (Applied Biosystems) were used for full-length cDNA sequencing. Ceres used 3700 and 3730xl sequencers (Applied Biosystems) for both 5′ ESTs and full-length cDNA sequencing. Quality scores for the Genset sequences are not available making it difficult to identify reliable polymorphisms in the clusters. The type of sequencer used for each sequence is noted in each GenBank entry.

EST clustering

Clustering of 5′-tags was done using the Washington University Blastn program (Gish 1996–2004). Two sequences were clustered together if there were no more than six mismatches in any 30-nucleotide window of their blast alignment and their alignment covered the entire overlapping region. Information about the relationship between selected clones and other 5′-tags, including their relative start positions and other relevant information was stored in an Oracle database (Alexandrov et al. 2006).

Gene models

All ESTs, Ceres cDNA, and public mRNA were aligned against the corn genomic sequences available from GenBank using the spliced alignment method (Alexandrov et al. 2006). Alignments with lower than 98% overall identity (i.e., the ratio of matching nucleotides on the transcript to the overall length of the transcript) were discarded. If a transcript matched to more than one location on the genome, only the annotations with the best overall identity were considered for further analysis. The annotations were then checked for inversion. If an overwhelming majority of an annotation’s splice sites were non-canonical, they were compared to canonical splice sites of annotations on the opposing strand. If each non-canonical splice site had a matching canonical splice site on the opposing strand (within three nucleotides), the offending annotation was marked as inverted, and removed from further analysis. Mutually overlapping annotations were grouped into loci such that each annotation inside a locus overlaps with at least one other annotation in the locus.

Transcription start site prediction (TSSer algorithm)

We have developed a novel algorithm, called TSSer, to reliably predict positions of transcription start sites. We aligned EST and mRNA sequences against the corresponding genomic sequence using Washington University Blast. Alignments are further refined using the spliced alignment algorithm (Alexandrov et al. 2006). If there are two or more loci matching a transcript sequence, we select the best one based on the identity of the match. Positions of the 5′ ends of sequence alignments are clustered and the most frequent position that does not contradict the ORF prediction for a locus is designated as the best TSS for this locus. TSSer allows more accurate determination of the transcription start site, as compared to the traditional approach of using the longest cDNA for prediction.

Functional annotation

To obtain the GO annotations of corn proteins, we downloaded the GO annotations of Arabidopsis proteins from the TAIR website as the reference annotations, performed a BLAST similarity search of the protein sequences of our corn clones against the Arabidopsis protein sequences and propagated the GO annotations of the Arabidopsis proteins to the corn clones.

Since the Arabidopsis proteins from TAIR include clone sequences and genome predictions, to make a fair comparison between the two species, we used the expressed Arabidopsis protein sequences. We performed a similar BLAST search procedure to generate the GO annotations. The GO terms used in this annotation procedure are represented as GO slim terms, which are clusters of GO terms with similar functions in a broad sense.

Pfam annotation was performed for the two sets of proteins by blasting the protein sequences against the latest Pfam BLAST database downloaded from the Pfam website (ftp://selab.janelia.org/pub/Pfam/).

Results and discussion

cDNA libraries

Full-length cDNA libraries were prepared from a mixture of floral, root, stem and leaf tissues obtained from Pioneer Hi-Bred International, Inc. Hybrid 35A19, as well as from separate collections of embryo, callus, root, female flower and abiotic stress-induced tissues obtained from various other hybrids (Table 1). Size fractions were generated for some of the cDNA preparations, and normalization and depletion strategies were employed to enrich for novel clones (see Methods). Our depletion strategy involved hybridizing the most abundant clones (generally 5,000 clones) to a batch of cDNAs to minimize their presence and sequencing additional clones from the depleted library. This strategy proved effective at depleting the most frequent clones: histones, thioredoxins and ribosomal proteins from mixed libraries, and seed storage proteins from embryo libraries (data not shown). The number of clones, clusters and good full-length clones sequenced from libraries made from the various tissues and size fractions are summarized in Table 1. For the mixed tissue libraries it is not possible to gain insight into the gene distribution and frequency for the separate tissue types.

Table 1 cDNA libraries generated from corn

5′ sequencing and EST clustering

We sequenced the 5′ end of >600,000 randomly selected clones from a diverse set of corn cDNA libraries. Usually, the 5′ ends of 1,536 random clones from each library were sequenced initially and the sequences were used to assess the proportion of full-length and novel clones. If the library was of sufficient quality (high percentage of full-length and novel clones), 10–20,000 additional clones were sequenced from the 5′ end. Sequences that were of low quality, very short (<50 nucleotides) or suspected to be from a different organism were discarded leaving 484,032 that were further analyzed and submitted to GenBank. Sequences that are derived from species other than the intended organism are quite common in EST sequencing projects. For example, of the 1,187 non-redundant sequences identified by Lai et al. (2004) as having no match to rice sequences, 61 (>5%) were closely related to E. coli sequences.

The 5′ reads were clustered into 63,476 clusters (any two clones that have >6 nt differences in any 30 nt window are clustered apart). The number of clones in each cluster ranged from 1 to 1,350 with an average of 7.64 clones per cluster. The 20 proteins with the largest number of 5′ ESTs are shown in Table 2. Histones and seed storage proteins are the most abundant in the top 20. The latter obviously reflects the abundance of seed storage protein mRNAs in seed storage tissues, in spite of attempts to minimize seed storage tissue in our libraries.

Table 2 The 20 proteins with the largest number of 5′ ESTs

As these libraries were made from hybrids, reliable polymorphisms were observed in about half of the clusters: about one third of these polymorphisms are 1–6 nt insertion/deletions (indels) and the other two thirds are base substitutions (data not shown). If the indels are seven or more nucleotides, we would not identify them in our analysis because those ESTs would form separate clusters. Interestingly, nearly 10% of the clusters with 40 or more ESTs contain one or a few ESTs with a 5′ end that is 60 or more nucleotides longer than the other members of the cluster indicating a possible alternative transcription start site.

Clone sequencing

From each EST cluster, the clone that had the longest 5′ end at the time of selection was used as a cluster representative (Selected Cluster Representative, SCR). Out of 63,476 SCRs, 39,769 clones possessed an ATG translation start site and were not redundant. 15,308 SCRs were overlapped after sequencing from the 3′ end with an average length of 689 nts. We used primer walking in an attempt to overlap the remaining SCRs. After the clones were overlapped they were reanalyzed and the acceptable clones were sequenced on the second strand. We tried to eliminate ambiguities by generating another primer and sequence to get a consensus sequence. At each step, if a clone was determined to be (1) identical to another already sequenced clone in nucleotide or protein sequence, (2) truncated, (3) wrong organism, (4) chimeric, or (5) the clone could not be overlapped, sequencing of the clone was stopped. Using this approach, 35,497 clones were overlapped. We also experimented with transposon sequencing (Strathmann et al. 1991) and overlapped 935 clones with this approach. From these 36,432 clones, we have carefully selected the highest quality non-redundant clones (9,951) which, to the best of our knowledge are free of common errors, such as redundancy, contamination, truncations, frame-shifts and chimerism. Using our EST clustering system we have also overlapped 5′ and 3′ reads of corn cDNA clones from the GenBank EST database and have added 133 full-length clones to our set of high quality corn transcripts. In most of the analyses in this paper we use this set of 10,084 clones. We estimate that there are several thousand additional good full-length corn cDNA clones in our GenBank submission which we did not use in most of the analyses in this paper, because we have less confidence that these clones contain a complete CDS.

Statistical properties of corn cDNAs

The distribution of the number of 5′ ESTs in clusters corresponding to this set of 10,084 clones is shown in Fig. 1 and follows Zipf’s law stating that a frequency of occurrence of some event, is a power-law function of the rank of this event where rank is determined by the frequency (Zipf 1949). Zipf’s law for cDNA clusters may imply that the rate of evolutionary changes in gene expression (assessed by the number of 5′ ESTs in a cluster) is proportional to the gene expression level (Gibrat 1931).

Fig. 1
figure 1

Distribution of the number of clones (5′ ESTs) among clusters represented by a set of 10,084 full-length corn cDNA clones. Relatively few genes have many ESTs whereas many have only one or few ESTs. The distribution can be approximated by a power function (inset; linear function in log scale)

Length distributions of CDS, 5′ and 3′ UTRs, and comparisons with other full-length corn transcripts from GenBank, revealed that Ceres cDNA clones have longer 5′ UTRs, shorter coding regions and similar size 3′ UTRs (Table 3). A comparison of our set of corn cDNAs to the annotations of the rice and Arabidopsis genomes also showed that Ceres clones are overall shorter. This is likely to be due to our approaches taken in cloning, selecting and sequencing and probably does not reflect a biological difference in the libraries.

Table 3 Median lengths of 5′ UTRs, CDSs, and 3′ UTRs in corn (from Ceres and GenBank), Arabidopsis and rice

Corn transcripts, especially coding regions, are more GC-rich (Table 4) than the overall genome (58% GC in CDS vs. 46% in the genome) which is consistent with previous observations (Haberer et al. 2005). Equivalent percentages for Arabidopsis are 45% and 36% (Alexandrov et al. 2006); 5′ UTRs are C-rich whereas 3′ UTRs are T-rich, similar to Arabidopsis. As in Arabidopsis, most transcripts start with A (Table 4). The consensus sequence around the initiating ATG (Fig. 2) is similar to Arabidopsis in the coding region, in that there is a strong preference for codons GCN that specify alanine as the 2nd amino acid, but different at the 5′ end in that corn is C-rich while Arabidopsis is A-rich (Alexandrov, Troukhan et al. 2006). The most frequently used stop codon is TGA (occurs in 51% of all transcripts) followed by TAG (30%) and TAA (19%). TGA is the most frequently used stop codon in Arabidopsis (44%) and rice (43%), but in Arabidopsis TAA (36%) is more frequent than TAG (20%), whereas similar frequencies of TAG (30%) and TAA (27%) are used in rice.

Table 4 Nucleotide composition of 10,084 corn transcripts
Fig. 2
figure 2

Sequence logo for the ATG consensus in corn. The logo is based on 9,920 sequences. The figure was produced using WebLogo tool (Crooks et al. 2004)

Nucleotide distribution in coding regions

Corn coding regions have elevated GC contents compared to the coding regions of Arabidopsis genes as well as compared to non-coding regions in the corn genome. More intriguingly, the GC distribution in the coding region has two peaks (Fig. 3) indicating that there are two major classes of genes in corn. Genes in the first peak have a mean G+C of about 0.5, i.e. there is no preference for G+C. Genes in the second peak have an unusually broad GC frequency distribution in the third position in the codons (GC3) with a peak at about 0.9. In contrast, the genes of Arabidopsis and other dicot species have a unimodal and narrow GC distribution. Analysis of rice genes reveals a broad GC distribution similar to corn (Campbell and Gowri 1990; Wang et al. 2004; Wang and Hickey 2007). GC content for the first two positions in codons (GC12) has a unimodal distribution for all three species, emphasizing that the main difference in GC content is due to the third nucleotide in the codons (Fig. 3). Analysis of other plant genes from GenBank revealed that the bimodal distribution is a characteristic feature of all grasses (Poaceae) for which a sufficient number of genes have been sequenced (Fig. 4).

Fig. 3
figure 3

Distribution of GC in the coding region of corn, Arabidopsis and rice. The GC content in the coding region of corn cDNAs is bimodal and the high GC content can be explained by the abundance of GC in the third position of the codons. A similar result is observed for rice but Arabidopsis is unimodal. GC indicates the ratio of GC versus AT. GC12 represent the ratio of GC versus AT in the 1st and 2nd positions of the codons. GC3 represents the ratio of GC versus AT in the 3rd position of the codons

Fig. 4
figure 4

Distribution of the GC content in the third codon position of CDSs of different plant species. All grasses have a broad distribution with two peaks whereas the dicots have a unimodal distribution. All CDS sequences except corn (we used sequences described in this paper) and Arabidopsis (we used TAIR annotation) were downloaded from the J. Craig Venter Institute (JCVI, formerly known as TIGR) ftp site ftp://ftp.tigr.org/pub/data/plantta/. The number of unique transcripts for each species is: switchgrass 7,638, Arabidopsis 27,983, poplar 12,687, canola 10,709, Medicago 20,414, cotton 24,797, corn 10,084, rice 49,870, sorghum 20,714 and wheat 62,121

We have analyzed genes that contribute to the high and low GC peaks. We hypothesized that due to the effects of cytosine methylation and cytosine deamination, “old” genes would tend to be more AT rich in the third position as in Drosophila and mammalian genomes (Petrov and Hartl 1999). Relatively new genes then would be GC-enriched in the third position. Indeed, genes encoding homologous proteins in corn and Arabidopsis as well as in rice and Arabidopsis tend to be in the lower GC peak (Fig. 5). Genes in the higher GC peak are “newer” genes, not present in Arabidopsis. Also, intronless genes are highly enriched in the higher GC peak (Fig. 5). This latter observation is consistent with a massive synthesis of existing variant genes by reverse transcription of existing gene transcripts that became stabilized into the evolving genome since mRNA transcripts lack introns. However, it is clear that genome evolution in plants is driven most often by genome duplication followed by gene loss and/or modification (Cronk 2001). Often genome duplication is achieved by polyploidization, but more rarely it may involve wider hybridizations. Given the large number of new genes with different GC structures in grasses, perhaps the lineage was initiated by a wide hybridization event with another species that had genes with a high GC content, followed by selective gene retention and loss to create today’s Poaceae. The wide hybridization, while most likely to have involved a plant species, could have been prokaryotic or algal and, a prokaryotic origin could explain the higher proportion of intronless Poaceae-specific genes. We tried to find prokaryotes responsible for such an invasion based on GC profiles of gene sequences but could not find clear candidates due to either lack of sequence data in the relevant species or significant divergence of the genes over the more than 100 million years since the dicot/monocot separation. However, we found that the corn genes with high GC content are similar to Myxobacteria genes (Fig. 5). Myxobacteria are soil dwelling gram-negative bacteria producing a wide range of secondary metabolites and can inhibit plant pathogenic fungi (Bull et al. 2002). Some algae, e.g. Chlamydomonas, have genomes with high GC codon usage (Merchant et al. 2007) and these could be a source of the novel genes.

Fig. 5
figure 5

Distribution of the GC content in the third codon position among different groups of rice genes. Only those genes which have mRNA evidence in TIGR version 5 annotation (total 23,721 genes) were considered. GC3 distribution of these genes has two peaks at about 0.5 and 0.9. Genes without introns (4,452) are more prevalent in the high GC3 peak. Genes sharing similarity with Arabidopsis (blast P-value < 1.e-50, best reciprocal hit, 7,924 genes) are mostly in the lower GC3 peak whereas genes (1,664) similar to Myxobacteria (blast P-value < 1.e-3 and not matching Arabidopsis) are mostly in the high GC3 peak. 17,100 known protein sequences of the order Myxococcales from GenBank were used for comparison

The results in Fig. 3 suggest that all Poaceae have high GC gene fractions and there is a wide distribution of gene frequencies with respect to the presence of GC in the third position. Also, some dicots such as canola, cotton and Medicago have a higher proportion of genes with high GC contents than Arabidopsis and poplar (Fig. 5). This shows that during evolution gene variants with different GC contents are stabilized in genomes to differing degrees. Thus, perhaps during evolution hybridization between organisms with different GC contents in coding sequences occurs relatively frequently and that the resulting processes in selective loss and retention produces genomes with genes having different codon usage and GC contents. The Poaceae would be examples of where such processes have occurred to generate a more extreme form. It is not possible at present to choose between various hypotheses to explain the divergent codon usage but the different ideas are not mutually exclusive. Of especial importance is to discover what forces could result in selection of plants with variant codon usage on such a massive scale. The answers may lie in RNA biology, chromatin control processes, epigenetics and heterosis.

Statistical features of gene structures

We determined gene structure by spliced alignment of full-length cDNA clones and EST sequences with available genomic DNA. This gives us additional information on parts of genes not present in transcripts, i.e. introns and promoters. 2,793 of the 10,084 cDNA sequences were aligned with 2,714 corn genomic sequences from GenBank. Only genomic sequences longer than 20,000 nucleotides were used to avoid additional bias towards genes with a smaller number of introns. 497 (18%) of 2,793 clones consist of a single exon. Distribution of the number of exons in corn genes is similar to the distribution for Arabidopsis genes having full-length cDNAs sequenced using the same technology (Fig. 6). As these sets of genes are enriched for genes with shorter cDNA transcripts (Table 3) we expect some change of this distribution when all corn transcripts are known (as illustrated for the distributions of all Arabidopsis and rice genes in Fig. 6).

Fig. 6
figure 6

Distribution of the exon number in corn, Arabidopsis and rice genes. 2793 of 10,084 full-length corn clones were mapped to corn genomic sequences of >20,000 bps to ascertain the number of exons. This subset is biased towards shorter genes which may overestimate frequencies of genes with a smaller number of exons and underestimate frequencies of genes with a larger number of exons. This effect can be seen in the distributions for Arabidopsis genes: one was obtained using Arabidopsis cDNA clones produced by a similar technology (Arabidopsis cDNA) and the other derived from all genes in the TAIR genome annotation having mRNA support (Arabidopsis all). Distribution of exons in rice genes were obtained from 23,721 genes with mRNA support from TIGR rice genome annotation, release 6 and are shown for comparison

As in rice and Arabidopsis, exon length distribution depends on the type of exons. The longest exons are single exons, followed by the terminal and initial exons. The shortest exons are internal (Table 5). Internal exons are of about the same size in corn, rice and Arabidopsis, but single exons are much shorter in corn most likely because of the biased subset of shorter Ceres cDNA clones used in this comparison. Initial and terminal exons may also be affected by this bias. It has been noted previously that the median intron length is greater in corn and rice genes compared to Arabidopsis genes (Haberer et al. 2005). However, the mode of intron length distribution is the same in all three species (Fig. 7). It also has been reported that first introns are longer than other introns in Arabidopsis (Seoighe et al. 2005) and in other species (Kriventseva et al. 1999). Using our much larger set of genes, we have shown that this also holds for corn (Table 5).

Table 5 Range and median exon and intron lengths in nucleotides
Fig. 7
figure 7

Intron length distribution in corn, Arabidopsis and rice. Introns in both the coding and non-coding parts of the mRNA were used in this analysis. All three species have similar modes for intron length, although corn and rice genes have longer introns in average

We have previously shown that the average expression of Arabidopsis genes with introns is higher than the expression of intronless genes (Alexandrov et al. 2006). Using the number of 5′ tags as a measure of expression level, we see a similar trend in corn (Fig. 8). This might be explained by the presence of regulatory sequences in introns (Gidekel et al. 1996), by increased mRNA stability or by different epigenetic chromatin structures given the lower GC content of introns.

Fig. 8
figure 8

Average gene expression increases with the number of introns in genes. The number of 5′ ESTs in each cluster was used to estimate expression. These 5′ ESTs were derived from primary libraries and so reasonably estimate mRNA abundance in the libraries. The greater the number of exons in a gene the greater its expression, as measured by the number of 5′ ESTs

Alternative splicing

Alternative splicing is an important mechanism of gene regulation and provides a significant addition to the total number of different transcripts and proteins. It is important to understand frequency and common types of alternative splicing events. A commonly used measure of alternative splicing is the fraction of genes having alternative transcripts. However, this number depends on the available number of transcript sequences—the more sequences, the greater the chance to observe spliced variants. We proposed using a Gini index (Mirkin 1996; Alexandrov et al. 2006) to compare the frequency of alternative splicing in corn and Arabidopsis. The Gini index changes from 0 to 1 and is equal to 0 when there are no alternative splice variants. Gini is also small when almost all transcripts support the major variant of splicing and only a few transcripts correspond to the other isoform. Gini is closer to 1 when the various isoforms are supported by about the same number of transcripts. While the majority of genes have a Gini index of 0 for both corn and Arabidopsis (meaning lack of alternative splicing), there is a higher frequency of corn genes, as compared to Arabidopsis, at every Gini score above 0 indicating that alternative splicing occurs more frequently in corn than in Arabidopsis (Fig. 9). The average Gini index for corn is 0.10, whereas, for Arabidopsis, the Gini index is 0.08.

Fig. 9
figure 9

Gini index for corn and Arabidopsis introns. 96% of Arabidopsis introns and 92% of corn introns have a Gini index equal to 0 meaning that there are no variants (the data point is not shown). A larger Gini index in corn means that corn transcripts are more variable

Typically, alternative splicing events are classified as one of four major types: intron retention, alternative acceptor, alternative donor or skipped exon (Wang and Brendel 2006). To compute relative frequencies of these events, we compared the transcripts within a cluster and classified all differences in splicing patterns to one of these types. Not all types of alternative splicing are symmetrical. For example, in the case of intron retention, we can see retention in transcript A when compared to transcript B, but when we compare B with A, retention is not observed, instead we see intron addition. The same is true for exon skipping. However, alternative donor and alternative acceptors show up in the two-way comparison: A to B and B to A. Thus comparing a pair of transcripts, we should count two alternative donors or acceptors, one case of intron retention (and intron addition) or one case of exon skipping (and exon addition). However, in our calculations, as in previous publications (Wang and Brendel 2006), we counted all events only once per transcript pair, combining intron retention with intron addition and exon skipping with exon addition. In our calculations we compared all possible pairs of relevant transcripts. We found the most common types of alternative splicing in corn, as in other plants, are different acceptor sites and intron retention. It is interesting that the relative frequency of these events is somewhat related to the number of transcripts in the cluster: with a lower number of transcripts, the most common event is intron retention, while for the genes with a larger number of transcripts, alternative acceptor site is more frequent (Fig. 10).

Fig. 10
figure 10

Relative frequencies of different types of alternative splicing events. The frequencies of different alternative splicing events were computed from the alignment of 563,251 transcripts with corn genomic sequences. 289,608 mapped transcripts are from Ceres libraries, the other 273,643 transcript sequences were downloaded from GenBank

Motifs in corn promoters

Promoter sequences are enriched with transcription factor binding sites. Detection of these motifs upstream of the 5′ ends of cDNA clones mapped to the corn genomic sequences confirms that our clones are indeed 5′-full-length clones. Not surprisingly, the most frequently occurring motifs appeared to be the TATA-box about 30 nucleotides upstream of the TSS and a three-letter motif exactly at the TSS. Nucleotide distribution around the TSS has a peak of A/T at position -30 and a peak of C/A just before the TSS (Fig. 11). We estimated the statistical significance of different motifs using approximation suggested by Waterman et al. (Galas et al. 1985). Significance level p of the word w is estimated as \( p = \exp ( - nH(\beta ,\alpha ))/p_{0} (w) \), where β is a fraction of sequences that contain the word w and α = p 0 (w) is an expected fraction of sequences. Entropy of β relative to α is defined as \( H(\beta ,\alpha ) = \beta \ln \left( {\frac{\beta }{\alpha }} \right) + (1 - \beta )\ln \left( {\frac{1 - \beta }{1 - \alpha }} \right) \). Analyses of promoter regions of corn and Arabidopsis showed that they generally have two highly pronounced peaks of significance: at −30 bps, corresponding to the four-nucleotide pattern TATA, and the two-letter word CA at the TSS. Interestingly, significance of the CA motif in corn promoters appears to be much larger than that of the TATA box, whereas in Arabidopsis promoters the significance of these motifs is about the same (Fig. 12).

Fig. 11
figure 11

Distribution of nucleotides around the Transcription Start Site of corn based on 5,200 promoters that have TSSs predicted by at least four 5′ ESTs. There is a peak of A/T at position -30 and a peak of C/A just prior to the TSS

Fig. 12
figure 12

The most significant words in corn (a) and in Arabidopsis (b) promoters. The analysis is performed on a subset of 5,200 corn promoters and 5,050 Arabidopsis promoters that have TSS predicted by at least four 5′ ESTs. For Corn, there is a prominent CA peak at the TSS and a smaller TATA motif at position -30. This is in sharp contrast to Arabidopsis where TATA is more frequent than CA

The importance of the TATA box for variability of gene expression has been shown by several groups for various organisms (Tirosh et al. 2006). It is not clear, however, if the TATA box is important for strong gene expression. We have divided the promoters into five groups based on their expression level (measured by the number of ESTs in the cluster) and compared frequencies of TATA box containing promoters in each group. We found that the TATA box features in both strong and weak promoters but not as frequently in medium expressed genes. This observation is also true for Arabidopsis promoters, although the peak abundance of TATA boxes for corn is within the strongest promoters and for Arabidopsis is within the weakest promoters (Fig. 13).

Fig. 13
figure 13

Frequency of a TATA box in promoters of different strengths. Strong and weak promoters have a TATA box more often than genes with average expression. In corn, TATA boxes are more frequent in stronger genes whereas in Arabidopsis TATA boxes are more frequent in weaker promoters

We found that GC-rich corn genes have a TATA box more often than genes in the GC-poor peak, namely, only 17% of genes with GC3 < 60% contain a TATA-box as compared to 42% of genes with GC3 > 80%. This observation led us to consider transcriptional differences between GC3-rich and other genes in rice for which microarray chip data are available. We obtained a list of rice microarray experiments from NCBI GEO database (http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE6893 (Jain et al. 2007) and GSE4438 (Walia et al. 2007)). For each probe on an Oryza sativa 50K Affymetrix GeneChip Rice Genome Array (http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL2025) we computed standard deviation of the logarithm of intensities. Genes with GC3 > 80% have an average standard deviation of 0.6, while genes with GC3 < 60% have significantly smaller standard deviation (0.51). These observations imply that genes with a high GC content are on average expressed at more variable levels in different cell types or growth conditions. This more variable level of expression correlates statistically significantly with the presence of a TATA box.

CG skew around transcription start sites

A peak in the cumulative CG skew (\( CG_{\text{skew}} = \frac{\# C - \# G}{\# C + \# G} \)) is associated with the transcription-coupled effects in the DNA template strand and location of the replication origin in bacteria (Beletskii and Bhagwat 1996; Grigoriev 1998). Previously, these ideas were extended to explain the CG skew peak near the TSS in Arabidopsis (Tatarinova et al. 2003). Analysis of the CG-skew at the TSS of several eukaryotic genomes was reported by Fujimori et al. (2005). Using our collection of 5′ EST for corn and genomic DNA from GenBank and TIGR, we predicted TSSs for corn with the TSSer algorithm (see Methods). We have computed the CG-skew plot for 5,200 promoters having at least four supporting 5′ ESTs as evidence. Figure 14 shows the skew present in corn promoters which is similar to the skew previously observed in Arabidopsis (Tatarinova et al. 2003).

Fig. 14
figure 14

CG skew plot for corn TSSs calculated as average CG skew in a sliding window of 40 nucleotides. The CG skew observed for corn is similar to what we have previously observed for Arabidopsis

Similarity of corn proteins to those of rice and Arabidopsis

We have compared 10,084 corn proteins with those of rice (Ouyang et al. 2007) and Arabidopsis (Swarbreck et al. 2007) derived from the genome annotations. As we expected, rice proteins are more similar to corn than Arabidopsis proteins (Fig. 15). 9514, or 94%, of corn clones have a match with 7,137 distinct rice genes (only the best rice match was counted for each corn protein) with blastp P-value ≤ 1.e-10 and 8,932 (88%) have a match with 5,853 distinct Arabidopsis genes (only the best Arabidopsis match was counted for each corn protein). Further examination of the 570 corn cDNAs not matching rice proteins, revealed that 192 cDNAs have a strong match with the rice genome indicating missed predictions in the rice genome annotation.

Fig. 15
figure 15

Corn proteins are more similar to rice than to Arabidopsis. The few exceptions are due to genes missed in the rice annotation, random fluctuations and possible contamination of corn cDNA clones by cDNAs from other organisms. 10,084 corn proteins, TAIR Arabidopsis genome annotation and TIGR rice annotation were used for comparison. Only matches with P-value ≤ 1.e-10, covering at least 70% of the protein length are shown in the plot

Protein functional characterization

Table 6 lists plant Gene Ontology (GO slim) terms (Berardini et al. 2004) associated with the 10,084 corn full length cDNA clones. We have compared frequencies of GO terms in this set of corn genes with the distribution of GO terms among all expressed proteins in the Arabidopsis genome. The corn set contains many more genes belonging to the “structural molecule activity” than one would expect from a comparison with the Arabidopsis genome, while genes in “transferase activity” are underrepresented. The top 10 Pfam families (Finn et al. 2007) are listed in Table 7. The results are consistent with the GO annotation: the most significant Pfam family is histone, which belongs to the overrepresented “nucleic acid binding” GO group.

Table 6 Classification of the 10,084 full-length clones by GO slim categories
Table 7 Top ten Pfam families

Non-coding RNAs

In the course of looking at the complete set of 63,476 clusters (SCRs), we examined the abundance of potentially functional non-coding RNAs that might be contained within the cloned cDNAs by comparing the set to the RNA families maintained at RFAM (Griffiths-Jones et al. 2005). Table 8 summarizes the different types of RNAs that were found to be present. The most abundant species were spliceosomal followed by tRNA and small nuclear RNAs (sno RNAs) involved in RNA modifications—usually ribosomal RNAs. Self-splicing introns (both Groups I and II) were also identified in the set; these are usually found in mitochondrial or chloroplastic genes but not nuclear genes. Closer examination of the nine transcripts containing self-splicing introns indicates that six of these are encoded within chloroplastic DNA, two are encoded within mitochondrial DNA, and one can be found in both mitochondrial and chloroplastic DNA.

Table 8 Non-coding RNAs in our collection of cDNA clones

Six cDNAs were identified that contained regulatory RNAs; two copies of the RNA-OUT were identified, which is the antisense to RNA-IN that inhibits transposition of the IS10 element (Kittle et al. 1989). Two copies of SnoRD14, also called U14, were identified; this RNA is involved in the processing of rRNA (Samarsky et al. 1996). Two mir elements (160 and 166) were also identified. Eighty five of the 97 non-coding RNAs identified are contained within a transcript that is at least 30% longer than non-coding RNA from RFAM suggesting that the non-coding RNA is contained within a larger gene although explanations based on chimeric clones are also possible.

Estimation of the number of protein-coding genes in corn

The total number of corn transcripts can be estimated from the number of matching sequences in two independent sets of transcripts. A similar approach was used to estimate the number of genes in the human genome (Ewing and Green 2000). If N is the total number of genes in the genome; the first set contains n 1 randomly selected genes and the second set contains n 2 independently selected genes, then the number m of the same genes in these two sets can be calculated as \( m = \left( {n_{1} /N} \right)\left( {n_{2} /N} \right)N = n_{1} n_{2} /N \); hence N = n 1 n 2/m. We compared our set of 31,552 (a part of 36,565 fully sequenced cDNA clones longer than 100 nucleotides) with a non-redundant set of 10,562 corn mRNAs longer than 100 nucleotides from GenBank. We identified 6,753 transcripts to be the same in these two sets. Thus the total number of transcripts in the corn genome can be estimated as 31,552 × 10,562/6,753 ≈ 50,000. This number is consistent with the previous estimate of the gene count in maize (from 42,000 to 56,000 genes) obtained by sampling BAC end sequences (Haberer et al. 2005).