Background

Peanut (Arachis hypogaea L.) is an important economical crop for oil production and nutritious food for human consumption. However, aflatoxin contamination caused by Aspergillus fungi is a great concern in peanut production worldwide. Aflatoxins are the most toxic and carcinogenic compounds associated with both acute and chronic toxicity in animals and humans [1, 2]. Both drought stress and high geocarposphere temperature during the latter part of the growing season compromise peanut defense to fungal invasion and exacerbate aflatoxin formation in the seeds [36]. Drought stress, extreme temperature or fungal infection can also impair plant growth and yield performance. The development of adapted peanut germplasm and cultivars with improved host-plant resistance is one of our main research objectives.

Resistance to several pathogens is known in peanut [7] indicating that peanuts have evolved a series of defense mechanisms against invasion by plant pathogens. A better understanding of the molecular mechanism for resistance to Aspergillus collonization will aid in designing strategies to develop new resistant peanut cultivars. The availability of genomic tools and bio-informatics softwares will significantly improve our ability to a better understanding of the genetic mechanisms of host-plant resistance and to facilitate the genetic improvement of cultivated peanut. Genomic research can also be used to discover novel genes with potential resistance and to develop molecular markers for use in marker-assisted selection. Recently, some genes and proteins associated with A. parasiticus or/and drought stress were identified and studied utilizing genomic and proteomic tools [812]. With the completion of the rice and Arabidopsis whole genome sequencing projects, a vast amount of valuable data has been generated to facilitate cross-species genome comparison in the plant Kingdom. The peanut genome size is significantly larger (2,800 Mb/1C) than the currently sequenced plants [13], such as Arabidopsis (128 Mb), rice (420 Mb), and Medicago (500 Mb) [14, 15]. Financial requirement makes it unrealistic to completely sequence the whole peanut genome in the near future. Therefore, peanut Expressed Sequenced Tags (EST) would be the cost-effective strategy to identify important peanut genes involved in defense to fungal invasion and to study gene expression pattern as well as genetic regulation [16, 17].

Expressed Sequence Tags (EST) is an effective genomic approach for rapid identification of expressed genes, and has been widely used in genome-wide gene expression studies in various tissues, developmental stages or under different environmental conditions [1821]. In addition, the availability of cDNA sequences has accelerated further molecular characterization of genes of interest and provided sequence information for microarray construction and genome annotation [11, 2225]. As of March 23, 2007, large number of ESTs of the top five plant species including Arapidopsis (1,276,131), rice (1,211,154), maize (1,161,193), wheat (855,272) and barley (437,728) have been deposited to the GenBank database (dbEST release 032307) [26]. These sequences provide opportunities to accelerate the understanding of the genetic mechanisms that control plant growth and responses to the environment. In contrast, there were only 19,790 Arachis ESTs deposited in GenBank, among which 13,226 were derived from cultivated peanut A. hypogaea and the remaining 6,264 from the wild species of A. stenosperma. These ESTs submitted by different peanut researchers were from different tissues and subjected to different abiotic and biotic stresses [11, 27, 28].

In this report, an effort for large-scale sequencing of cDNA was carried out with two goals: gene expression comparison between these two genotypes, 'Tifrunner' and 'GT-C20', and providing genomic resource for discovery and understanding of novel defense-related genes involved in resistance to Aspergillus colonization and drought stress. To increase gene diversity in the EST population and the probability of identifying genes associated with drought tolerance and disease resistance, different cDNA libraries were prepared from developing seeds at late reproductive stages of a resistant and a susceptible peanut genotypes challenged by A. parasiticus and drought stress. Six libraries were constructed that resulted in a total of 21,777 high-quality EST sequences, from which 8,689 unique sequences were identified. To provide useful information on the expression profiling of resistant genes at various seed developmental stages and to offer valuable genomic resource for peanut functional genomics, an extensive analysis of these ESTs was performed using a variety of computational approaches. A functional catalog of expressed genes is reported here as well as a preliminary view of their expression profiles in developing seeds at different developmental stages. This functional catalog seeks to link genes and pathways, and to provide a list of features that could aid in the understanding of how resistance genes are involved in response to biotic and abiotic challenges and how their expression is regulated.

Results

Generation of ESTs from developing seeds challenged by A. parasiticusand drought stress

Six cDNA libraries were constructed from developing seeds of two varieties ('GT-C20' and 'Tifrunner') collected at three reproductive stages (R5, R6 and R7) after challenging by A. parasiticus and drought stress. From the six cDNA libraries, a total of 24,290 clones were randomly selected, sequenced and analyzed using Sequencher software. The vector sequences of the raw sequence reads were trimmed off and low-quality sequences (shorter than 100 bp in length) were removed. A total of 21,777 high-quality EST sequences (about 86%) were generated from the 24,290 clones. Total 8,672 ESTs were generated from 'GT-C20' and 12,426 ESTs were generated from 'Tifrunner' (Table 1). The percentage of acceptable quality EST sequences from individual libraries varied from 81% to 88%. The average length of the ESTs is 411 bp ranging from 114 to 933 bp (Fig. 1). The sum of the total ESTs equal to 8.7 Mb of peanut genome. These quality ESTs combined from both genotypes at three stages were further assembled into 8,689 unique ESTs. Among them, 6,948 were singletons and 1,741 were TCs. The 21,777 ESTs have been deposited to the NCBI GenBank database with accession numbers ES702769 to ES724546.

Table 1 Summary of EST sequences, contigs, and singletons in six libraries from 'GT-C20' and 'Tifrunner'
Figure 1
figure 1

The length of trimmed EST sequence (cDNA length after removal of vector sequence and low quality sequences) submitted to clustering. The number of EST within different categories of trimmed sequence length is presented on the Y-axis. The number on the X-axis represent ranges of trimmed sequence lengths (101–200, 201–300, 301–400 bp, etc, respectively).

Overlapping of unique EST sequences and high redundancy of genes

A comparison of unique EST sequences from the two genotypes and different stages of developing seeds allows the identification of common and unique sets of expressed genes among the six libraries. The unique ESTs from the six libraries were summarized in Table 1. A total of 1,825, 681, 685, 3,107, 1,768 and 622 unique sequences were present in the C20R5, C20R6, C20R7, TFR5, TFR6 and TFR7, respectively. The distribution and overlapping of these unique EST sequences is shown in Figure 3.

Figure 3
figure 3

Hierarchical clustering analysis of differentially expressed transcripts for 'GT-C20' and 'Tifrunner'. TCs with R > 4 (84 in total) were used for hierarchical clustering analysis.

Among the unique ESTs from the C20R5, C20R6 and C20R7 libraries, only 96 ESTs (3%) were shown common to all three libraries (Fig. 2A). The number of ESTs that were common between any two libraries varied from 10.9% to 34.3%. When the same analysis was applied to the ESTs from the TFR5, TFR6 and TFR7, similar results were obtained (Fig. 2B). The ESTs that were common to all three 'Tifrunner' libraries were about 3.4%, similar to that of 'GT-C20'. There were 364 (8%) ESTs that were common to TFR5 and TFR6 libraries, 120 (2.6%) ESTs were found common to both TFR5 and TFR7 libraries, 37 (0.7%) ESTs were found common to both TFR6 and TFR7 libraries. In order to investigate differential gene expression between the resistant and susceptible genotypes, we also performed a comparative analysis between 'GT-C20' and 'Tifrunner' libraries at each seed developmental stage. There were 591 (11.74%), 197 (8.04%) and 152 (11.65%) genes were found common to 'GT-C20' and 'Tifrunner' at R5, R6 and R7, respectively (Fig. 2C, D, and 2E). These results indicated that the differences in transcript abundance might reflect genuine differences in the gene expression in the different libraries. These variations may be due to the differences in disease resistance, tolerance to abiotic stress or other genetic factors at the different developmental stages.

Figure 2
figure 2

Overlapping of unique peanut EST sequences. A: Common and unique sets of expressed genes among the 'GT-C20' three libraries; B: Common and unique sets of expressed genes among the 'Tifrunner'; C: Common and unique sets of expressed genes between 'GT-C20' and 'Tifrunner' libraries at developmental R5 stage; D: Common and unique sets of expressed genes between 'GT-C20' and 'Tifrunner' libraries at developmental R6 stage; E: Common and unique sets of expressed genes between 'GT-C20' and 'Tifrunner' libraries at developmental R7 stage. The number in the parenthesis presents the number of clones assembled into unique ESTs.

Genes that are shared between or among the libraries included highly expressed transcripts. To further investigate the high frequency of transcripts, all six libraries were analyzed, clustered and assembled individually by genotype. Those highly expressed genes (TCs) assembled from more than 20 individual ESTs were listed in Table 2 for the 'GT-C20' libraries (C20R5, C20R6 and C20R7), and Table 3 for the 'Tifrunner' libraries (TFR5, TFR6 and TFR7). A total of 8,672 ESTs from 'GT-C20' and 12,426 ESTs from 'Tifrunner' non-normalized libraries were assembled into 599 and 1,119 TCs, respectively. There were 27 GT-C20' and 36 'Tifrunner' highly expressed transcripts assembled from more than 20 individual consensus ESTs were selected for distribution analysis (Table 2 and 3). These TCs were concurrently queried against GenBank non-redundant protein database (nr) in searching their putative functions. The BLAST results showed that all the highly expressed genes (TCs) were homologous to known fragments in the GenBank database (Table 2 and 3). There were 31 highly expressed genes, identified by BLAST search, to have the same putative function in both 'GT-C20' and 'Tifrunner' libraries. These highly expressed genes encode constitutive proteins such as allergen protein (C20Contig14 and TFContig8 for iso-Arah3) (Guo et al., unpublished data), storage proteins (C20Contig51 and TFContig31 for 2S protein 1), structural protein (C20Congtig75 and TFContig44 for glycine-rich cell wall structural protein precursor), and stress-resistance associated proteins (C20Contig33 and TFContig29 for desiccation-related protein PCC13-62 precursor).

Table 2 Gene expression frequency and BLAST results of the unique ESTs assembled from more than 20 consensus ESTs in the C20R5, C20R6 and C20R7 libraries
Table 3 Gene expression frequency and BLAST results of the unique ESTs assembled from more than 20 consensus ESTs in the TFR5, TFR6 and TFR7 libraries

Functional classification of unique EST sequences

In order to further characterize the putative functions of unique ESTs and involvement in different biological processes, a similarity search against the MIPS Arabidopsis thaliana Database was performed. According to the MIPS Functional Catalogue criteria, 'GT-C20' unique sequences whose functions could be predicted from the similarity to Arabidopsis proteins with an E value of ≤ 1e-5 were classified into twenty-two categories (Fig. 4A) [29, 30]. The same analytic procedure was applied to 'Tifrunner' unique ESTs (Fig. 4B). The 'Tifrunner' ESTs with significant protein homology were also sorted into 22 groups. These results suggested that the genes represented by these unique EST sequences may play roles in different biological process.

Figure 4
figure 4

Functional classification of peanut unique ESTs by comparison to Arabidopsis Sequencing Project functional categories. A: functional categories of 'GT-C20' unique EST sequences; B: functional categories of 'Tifrunner' unique ESTs.

The results of functional classification showed that the unknown genes, including those which had no hits or low identity (less than 95%) with the Arabidopsis protein database and those which matched the unclassified and unknown proteins, represented the largest set of genes (33.33% and 34.42% for 'GT-C20' and 'Tifrunner', respectively). The second largest proportion of genes was found to participate in the biological process of metabolism. The resistance-related and environment-interacted genes were 2.6% and 2.46% in 'GT-C20' and 'Tifrunner', respectively (Fig 4A and 4B). These results indicated that it may be possible to discover novel genes involved in biotic and abiotic responses using the EST profiling startegy.

Expression profiles of cDNA from different genotypes at different developmental stages

Without normalization or subtraction in library construction, the number of the cDNA clones (or sequenced ESTs) for a given gene reflected the abundance of the gene expression at the corresponding developmental stage. The number of the consensus ESTs that assembled into a unique gene at the three developmental stages may represent the temporal expression pattern of this gene. Therefore, the temporal expression profile of a gene can be deduced by the comparison of the EST frequency at different developmental stage, while the temporal expression profile of a gene of different genotypes may be measured by comparison of the EST frequency of the different genotypes. Given the fact that the absolute EST counts varies in different libraries (Table 1), a meaningful measure of expression profile similarity is independent of these absolute numbers. To test the independence of EST distribution within the libraries, an estimation of the relative abundance defined as R (Stekel et al. 2000) was employed to identify the most highly significant differences in EST abundance for each TC among the libraries. The unequal distribution of specific ESTs with statistically significance within each library implied that these ESTs expressed at a higher level in some libraries than others. In order to limit the analysis to those genes which differentially expressed at different developmental stages, only TCs with R value larger than 4 were used for hierarchical clustering analysis. This R value provided an 82.2% true positive rate [31]. According to the cutoff threshold of R > 4, 37 TCs from 'GT-C20' libraries and 47 from 'Tifrunner' libraries were selected to search against GenBank non-redundant protein database (nr) (Table 4 and 5).

Table 4 Top hits of C20 unique EST sequences with R > 4
Table 5 Top hits of TF unique EST sequence with R > 4

Based on the abundance and the R statistic, a clustering analysis was performed to assess the relatedness of each library in terms of gene expression profiles. As Ewing et al. (1999) described [32], we compiled the 84 TCs into a matrix file comprised of the frequency of ESTs corresponding to each contig in the library that represented different seed developmental stages and performed hierarchical clustering analysis. From hierarchical clustering analysis, the 84 TCs with different redundant and similar expression patterns could be grouped into eight major clusters from A to H as shown in Figure 4. Each cluster represents a different expression profile. Hierarchical clustering analysis showed that most of high abundant genes with same putative functions from 'GT-C20' libraries and 'Tifrunner' libraries could be grouped into the same cluster. These genes usually encode constitutive proteins (such as arachin, conglutin and oleosin) and their expression patterns are not genotype dependent. Some putative genes related to resistance such as PR10 protein and defensin 2.1 precursors were found only in 'GT-C20' and the expression pattern was up-regulated (Fig. 3).

The results of hierarchical clustering and similarity search indicated that the 84 unique ESTs (R > 4) with similar DNA sequence were not equally distributed between the 'GT-C20' and 'Tifrunner' libraries. In comparison, only 32 unique ESTs (R > 4) were not equally distributed within different 'GT-C20' libraries (Table 4 and Fig. 3). There were seven, ten and eight unique TCs were observed in the C20R5, C20R6 and C20R7 libraries, respectively. Three unique TCs (C20Contig40 for allergen Ara1, C20Contig48 for arachin 6 and C20Contig37 for arachin Ahy-1) were observed between C20R5 and C20R6 libraries. These three unique EST contigs (C20Contig35 for conglutin precursor, C20Contig52 for conglutin and C20Congtig86 for gibberellin 2-oxidase) were primarily found in the C20R5 and C20R7 libraries. Only one unique EST (C20Contig62 for Ca+2-binding EF hand protein) had cDNA clones represented only in C20R6 and C20R7 libraries. Four unique ESTs (C20Contig14 for iso-Ara h3, C20Contig19 for seed storage protein SSP1, C20Contig65 for 2S protein 2 and C20Contig51 for 2S protein 1) had cDNA clones equally distributed across the three libraries of 'GT-C20'.

In the three 'Tifrunner' libraries, there were 38 unique ESTs (R > 4) whose cDNA clones were not equally distributed (Table 5 and Fig. 3). Comparison within all 'Tifrunner' libraries, fourteen, five and seven unique EST sequences were observed in TFR5, TFR6 and TFR7 libraries, respectively. Six unique ESTs were observed only in TFR5 and TFR6 but absent in TFR7 libraries. Two unique ESTs were predominately present in the TFR6 and TFR7. The remaining unique ESTs with R > 4 had cDNA clones equally distributed across the three 'Tifrunner' libraries.

Defense-related genes identified by database search

The information provided by ESTs from plant tissues challenged by specific biotic and abiotic stress conditions offered an opportunity for gene discovery. The unique EST sequences from 'GT-C20' and 'Tifrunner' were compared individually to the non-redundant protein sequence database available from NCBI by BLASTx program with a minimum E cutoff value < 1e-5. In reference to the results of differential expression and hierarchical clustering analysis (Table 4 and 5), only those genes whose expression were significant up or down regulated at different stages were selected. The other defense-related genes whose E value > 1e-5 treated as false positive and were excluded from the analysis.

Among the unique EST sequences with R > 4, only three up-regulated putative defense-related genes (putative desiccation-related protein PCC13-62 precursor, serine protease inhibitor and seed maturation protein LEA 4) were identified in both 'GT-C20' and 'Tifrunner' libraries (Table 6 and Fig. 3). Six up-regulated unique EST sequences were observed only in 'GT-C20' libraries, and matched previous reported known protein including PR10 protein, defensin protein and calmodulin (Table 6). In the 'Tifrunner' libraries, five defense-related genes such as metallothionein-like protein, heat shock protein and Cu/Zn superoxide dismutase II were detected with significant up-regulation.

Table 6 Putative resistance-related genes with significantly differential expression (R > 4) in 'GT-C20' and 'Tifrunner' libraries

Comparison of these EST data to other plant EST sequences

In order to compare these peanut ESTs to other publicly available plant ESTs, a similarity search against several plant EST databases in TIGR Gene Indices was performed (Table 7). When DNA sequence identity was at ≥ 90%, the percentages of peanut ESTs matching soybean and Medicago truncatula were 16.45% and 9.82%, respectively. When DNA sequence identity was decreased to ≥ 80%, the percentages of peanut ESTs matched to soybean and M. truncatula greatly increased to 79.46% and 72.53%, respectively. In contrast, the percentages of peanut ESTs that matched to Arabidopsis, rape seed, rice, maize and wheat ESTs were less than 50%, ranging from 33.84% to 45.69%, when DNA sequence identity was set at ≥ 80%. Although peanut and rape seed are both oilseed crops, when the DNA sequence identity was set at ≥ 80%, the similarity of peanut ESTs matching rape seed ESTs was only 38.5%, far less than that of the legume crops soybean and M. truncatula. As expected, peanut ESTs showed a higher similarity to ESTs of the legume species than to those of cereal crops, and also present a higher homology to ESTs of the dicot plants than to those of the monocots.

Table 7 Peanut unique EST homologs identified in soybean, Medicago truncatula, Arabidopsis, rapeseed, rice, maize and wheat in TIGR gene indices

Discussion

Larger-scale sequencing of Expressed Sequence Tags (EST) is an effective method for gene discovery. The available peanut EST database in GenBank is 19,790 entries as of March 23, 2007, which were derived from leaf, root, pod, cotyledon and other tissues of cultivated peanut (13,526) and wild species (6,264), respectively. Compared to maize, wheat, rice and soybean, the number and scale of peanut ESTs deposited in GenBank are far behind those major crops and it is inadequate to meet the need of peanut genetic and genomic research. Many successful EST projects have been reported for a number of species and from a variety of tissues under various conditions [6, 11, 17, 27, 33, 34]. However, most of these EST projects were restricted to different tissues from one genotype or different tissues from different genotypes. The EST project reported in this study is uniquely and systematically designed using the same tissues (developing seeds) from two genotypes, 'GT-C20' and 'Tifrunner' with different characters in terms of resistance and susceptibility to diseases, under the same environmental conditions (challenged by A. parasiticus and drought stress) at specific seed developmental stages (R5, R6 and R7). The completion of this peanut EST project makes the available peanut ESTs in the GenBank database doubled for the research community to share. In addition, the six libraries were neither normalized nor subtracted so that the frequency of a unique EST (gene) within each stage could be determined and could provide a hint for the expression level of that specific gene.

To understand the molecular basis of host resistance to A. flavus/parasiticus and consequent aflatoxin contamination, we monitored the transcript changes at these three developmental stages in developing seeds. The 8,689 unique ESTs were categorized into different functional groups based on the MIPS criteria [29, 30]. The highly expressed overlapping ESTs also helped in assembling full-length unique transcripts expressed in peanut seed, such as the putative allergen protein (iso-Ara h3, GenBank accession no. DQ855115). The putative functions of those identified unique ESTs have been predicted by similarity search according to MIPS (Fig. 4). Comparing to the Arabidopsis sequence data, 65.99% of total peanut unique ESTs matched Arabidopsis protein sequences with a known function and 17.58% had significant similarity to Arabidopsis protein sequences with unknown function. About 16.43% of the total unique ESTs showed no significant similarity to Arabidopsis al all. Those peanut ESTs matched Arabidopsis know functions were divided into nineteen categories [29, 30]. A major portion of these genes with known functions fall in the category of metabolism (24.47%) followed by transcription (8.85%, Fig. 4). To further identify novel peanut sequences, a comprehensive similarity search against GenBank non-redudant (nr) database using the stand-alone BLASTx algorithm was performed and resulted in the identification of an additional 967 putative novel sequences including 165 unique peanut ESTs matching reported known peanut genes. The BLAST result revealed that significant number of unique peanut seed ESTs match soybean (396), Arabidopsis (2952), rice (682), and other plant species.

In this study, some previously reported defense-related genes have been confirmed to be expressed. Desiccation-related proteins could be induced by drought stress and were relatively sensitive to cellular dehydration [35, 36]. The LEA (late embryogenesis abundant) proteins are known to be involved in protecting higher plants from damage caused by environmental stresses, especially dehydration from drought [3739]. Serine protease inhibitors are involved in plant defense against pathogens and could be induced in response to infection by pathogens [4042]. These three different classes of genes were up-regulated in the three reproduction stages of both 'GT-C20' and 'Tifrunner' libraries. Other related-genes with significant differential expression were present either in 'GT-C20' or in 'Tifrunner'. For example, the PR10 protein family is induced by plants in response to pathogen infection as well as abiotic stress, and showed transcriptional up-regulation upon biotic and abiotic stresses [4345]. Calmodulin (CaM) is a ubiquitous Ca2+ sensor found in all eukaryotes and has been shown to participate in the regulation of diverse calcium-dependent physiological processes [46]. Calmodulin plays an important role in sensing and transducing changes in cellular Ca2+ concentration in response to several biotic and abiotic stresses [47]. CaM has been implicated in plant-pathogen interactions [48, 49]. PR10 and Calmodulin were significantly up-regulated in 'GT-C20' libraries but not in 'Tifrunner' (Table 6). In contrast, two heat shock proteins, synthesized in response to heat stress [5052], were detected up-regulated in 'Tifrunner' libraries but not in 'GT-C20' (Table 6). This raises questions of why certain genes are present or absent or show differential expression in different genotypes, such as 'GT-C20' and 'Tifrunner'. There are two possible hypothetic explanations. One is that in this study we randomly selected clones for cDNA sequencing and might have missed some clones that could be in 'GT-C20' or 'Tifrunner' libraries. The other is that the presence, absence or significantly differential expressions of certain genes, especially defense-related genes, are a result of the genetic differences (resistance and susceptibility) of these two genotypes. In order to verify the assumption that variability of expression might be a result of genetic differences in disease resistance or stresses tolerance, two genes (an allergen protein iso ara h3, highly abundant and a constitutively expressed genes, and an LEA 4, a up-regulated and defense-related gene) were selected for sequence similarity analysis. As expected, the similarity of iso ara h3 between 'GT-C20' and 'Tifrunner' was 97%, however, LEA 4 sequences shared only 91% identity over 709 bases. For iso ara h3, among 1,692 consensus sequences, 6 gaps were found. For LEA 4, among 709 consensus sequences, 19 gaps were found (data not shown). The results implied that the allelic differences of defense-related genes were higher than that of constitutively expressed genes. Further investigations are necessary to characterize their gene functions and to analyze the patterns of their gene expressions.

Conclusion

This is a unique study using both resistance and susceptibilities genotypes under the same environmental conditions as challenged by A. parasiticus and drought stress at specific seed developmental stages (R5, R6 and R7). The large number of peanut ESTs obtained provides an important resource for gene discovery, for gene expression profiling, and for microarray design [12, 53]. The frequency of the individual EST demonstrated the temporal expression patterns of a given gene. The information from this study will significantly improve our understanding the mechanism of host resistance and provide a useful genomic resource for peanut breeding and aflatoxin research community.

Methods

Libraries construction and sequencing

The peanut varieties 'Tifrunner', susceptible to A. parasiticus but resistant to TSWV (tomato spotted wilt virus, the No.1 disease in southeastern US) and 'GT-C20', resistant to Aspergillus parasiticus but susceptible to TSWV, were selected for this experiment. The peanut plant materials used for RNA extraction were grown in the field and inoculated by A. parasiticus NRRL 2999 at mid-bloom (60 days after planting). Drought stress was imposed during the final 40 days before harvest through the use of rain-out shelters. Immature pods at the R5 (beginning seed), R6 (full seed) and R7 (beginning maturity) stages [54] from two peanut genotypes, 'GT-C20' and 'Tifrunner', were collected, frozen in liquid nitrogen, and stored at -80°C until RNA extraction.

Developing seeds were removed from the sampled immature pods for total RNA extraction. Six cDNA libraries from developing seeds were constructed according to the protocol reported previously [55]. The cDNA inserts were ligated to the pBlueScript vector. Each of the six cDNA libraries was named using first 2 letters from genotype followed by corresponding developing stage. For example, TFR5 refers to 'Tifrunner' at developing stage R5, and so on.

Sequencing was performed using ABI 3730xl Genetic analyzer (Applied Biosystems) with the ABI Prism BigDye terminator cycle sequencing kit (Foster City, CA) from 5' end of cDNA using T3 sequencing primer.

EST processing and clustering

The short vector sequences were trimmed off from the raw sequence reads and the poor-quality sequences (less than 100 nucleotides) were removed by the Sequencher 4.6 software (Gene Codes, Ann Arbor, MI). The cleaned cDNA sequences from 'GT-C20' and 'Tifrunner' were separately assembled into TCs through the use of Phrap [56] with 90% minimum match. Sequences sharing greater than 90% identity over 40 or more contiguous bases with unmatched overhang less than 30 bases in length were placed into clusters. Overlaps exclusively on low complexity regions were excluded.

Frequency of cDNAs in different libraries

The six cDNA libraries were neither normalized nor subtracted. Therefore, the number of cDNA clones comprised of contigs may represent gene expression profiles at the different developmental stage. An "electronic Northern" was conducted through analyzing the frequency of cDNA clones within each contig. Six libraries were divided into two groups for analysis according to source genotype. Either group including three libraries constructed from the same peanut genotype at different stage was separately compiled and analyzed. Each of the three libraries represented different developmental stages (R5, R6 and R7) which were subjected to different lengths of fungal challenge and drought stress was analyzed to identify cDNAs whose presence was specific to that developmental stage and environmental challenge.

Functional annotation of unique ESTs and bioinformatics

In order to identify the putative functions of unique ESTs by BLAST against the NCBI (National Center for Biotechnology Information) non-redundant protein database (nr) and the Munich Information Center for Protein Sequences (MIPS), Arabidopsis Sequencing Project functional categories [29, 30] were downloaded and localized.

A sequence similarity comparison between EST sequences and nr database was performed using the BLASTx algorithm [57, 58] with NCBI default parameters. The unique sequences were considered to be homologous to known proteins in nr database when the E value of BLAST was less than 10-5 (the probability that alignment would be generated randomly is 1<100,000) and the BLAST score was higher than 200. The putative full-length protein-coding region was determined by complete open read frame (ORF), poly (A) and significant similarity to known protein sequence. Functional classifications from MIPS were assigned to each unique EST by referring to MIPS functional catalogue. Resistance/defense-related genes were identified in the ESTs via a combination of similarity to known genes and transcript expression profiles.

Gene expression analysis was performed using TIGR MultiExperiment Viewer software [59] by using transcript abundance in each contig in all six libraries. The significant differences in EST abundance for each contig among the libraries were assessed by an R statistic described by Stekel et al. (2000). Only those TCs with R > 4 were used for hierarchical clustering analysis.

Comparative genome analysis between our ESTs and the currently available major crop EST gene indice in the databases was performed. These include Arabidopsis thaliana (81,826 ESTs), rape seed (Brassica napus) (25,929 ESTs), maize (Zea mays) (115,744 ESTs), Medicago truncatula (36,878 ESTs), rice (Oryza sativa) (181,796 ESTs), soybean (Glycine max) (63,676 ESTs), and wheat (Triticum aestivum) (122,282 ESTs). These TIGR EST gene indice (currently curated at Harvard University) were downloaded from the FTP site [60]. The following criteria were used in BLAST with the TIGR gene index, E-value less than 1e-5 and DNA identity more than 80% and 90%.