Coffee and tomato share common gene repertoires as revealed by deep sequencing of seed and cherry transcripts
An EST database has been generated for coffee based on sequences from approximately 47,000 cDNA clones derived from five different stages/tissues, with a special focus on developing seeds. When computationally assembled, these sequences correspond to 13,175 unigenes, which were analyzed with respect to functional annotation, expression profile and evolution. Compared with Arabidopsis, the coffee unigenes encode a higher proportion of proteins related to protein modification/turnover and metabolism—an observation that may explain the high diversity of metabolites found in coffee and related species. Several gene families were found to be either expanded or unique to coffee when compared with Arabidopsis. A high proportion of these families encode proteins assigned to functions related to disease resistance. Such families may have expanded and evolved rapidly under the intense pathogen pressure experienced by a tropical, perennial species like coffee. Finally, the coffee gene repertoire was compared with that of Arabidopsis and Solanaceous species (e.g. tomato). Unlike Arabidopsis, tomato has a nearly perfect gene-for-gene match with coffee. These results are consistent with the facts that coffee and tomato have a similar genome size, chromosome karyotype (tomato, n=12; coffee n=11) and chromosome architecture. Moreover, both belong to the Asterid I clade of dicot plant families. Thus, the biology of coffee (family Rubiacaeae) and tomato (family Solanaceae) may be united into one common network of shared discoveries, resources and information.
KeywordsCoffea canephora Rubiaceae Solanaceae Seed development Comparative genomics
Coffee is an important international commodity, ranking among the five most valuable agricultural exports from developing countries (Food and Agriculture Organization, http://apps.fao.org). Moreover, production and processing of coffee employs more than 25 million people worldwide (O’Brien and Kinnaird 2003). Despite its economic importance, coffee has received little attention with respect to molecular genetics and genomics research. As of December 2004, only 1,570 nucleotide and 115 protein sequences from coffee had been deposited in GenBank with the majority of those sequences derived from leaf ESTs. Many of the remaining sequences correspond to enzymes in the caffeine biosynthesis pathway—the most extensively studied pathway in coffee (Moisyadi et al. 1998; Ogawa et al. 2001; Mizuno et al. 2003; Uefuji et al. 2003).
Commercial coffee production relies mainly on two closely related species: Coffea arabica and Coffea canephora, accounting for approximately 70 and 30% of worldwide coffee production, respectively (Herrera et al. 2002). Although C. canephora accounts for a lower total proportion of the coffee market than does C. arabica, it is the main source for soluble coffee, which is consumed widely throughout the world. C. canephora is a diploid (2n=2x=22), outcrossing and highly polymorphic species native to central Africa, but which has expanded, through cultivation, especially to western Africa, Indonesia and Vietnam (Wrigley 1988). In contrast, C. arabica is believed to be a recently derived tetraploid (2n=4x=44) native to a small region of what is now Ethiopia. C. arabica is now grown widely throughout the world.
The goal of the current project was to increase the genetic and molecular knowledge of coffee through the generation and annotation of an EST database using high throughput single-pass 5′ sequencing of cDNAs derived from leaf, pericarp and seed tissues from a set of C. canephora varieties. Special emphasis was given to sequencing cDNAs from different stages of seed development, both to shed light on this important, but not well understood aspect of plant development and to capture as many genes as possible involved in determining the final chemical composition of seeds which constitute the commercial product. As a result, the EST database reported herein is, to our knowledge, the largest public database of seed-derived ESTs (White et al. 2000; Suh et al. 2003).
The plant family most closely related to coffee in which extensive sequencing has been conducted is Solanaceae (Fig. 1). In this family, comprehensive EST databases have been developed for tomato, potato, pepper, eggplant and petunia (http://www.sgn.cornell.edu/) (Hoeven et al. 2002; Ronning and Stegalkina 2003; Lee et al. 2004). Both Rubiaceae and Solanaceae belong to the Asterid I clade of dicots, and based on existing fossil evidence, are thought to have diverged from one another approximately 50 MYA (Gandolfo et al. 1998; Crepet et al. 2004) (Crepet personal communication) (Fig. 1). The closer taxonomic affinities of coffee and Solanaceae (e.g. tomato) are paralleled by a number of striking botanic and genetic similarities, including the production of fleshy berries, a similar genome content (C=950 and 640 Mb for tomato and coffee, respectively) (Hoeven et al. 2002), similar basic chromosome number (x=12 for tomato and most other Solanaceae; x=11 for coffee) and similar chromosome architecture with highly condensed pericentric heterochromatin and decondensed euchromatin at the pachytene stage of meiosis (Rick 1971; Pinto-Maglio and Cruz 1998). For these reasons, the coffee unigene set was also compared against a series of Solanaceae EST-derived unigene sets.
Materials and methods
Source of tissues
Characteristics of the 5 cDNA libraries used to develop the coffee EST database
Average insert size, kb
Good quality ESTs
Pericarp, all developmental stages
BP358, BP409, BP42, BP961, Q121
Early stage cherry
Whole cherries, 18 and 22 week after pollination
BP358, BP409, BP42, Q121
Middle stage seed
Endosperm and perisperm of seeds, 30 week after pollination
BP409, BP961, Q121
Late stage seed
Endosperm and perisperm of seeds, 42 and 46 week after pollination
BP358, BP409, BP42, BP961, Q121
RNA and mRNA isolation
Total RNA was extracted using phenol/chloroform (Rogers et al. 1999) and further treated with DNase I (RNase-free) and purified using an RNeasy Kit (Qiagen, Valencia, CA 91355). Messenger RNA was extracted from total RNA with PolyTrack mRNA Isolation Systems (Promega, Madison, WI 53711).
Directional cDNA libraries were constructed with 3–5 μg of mRNA with the ZAP-cDNA Gigapack III Gold Cloning Kit (Stratagene, La Jolla, CA 92037). The average insert length was estimated by PCR in 36 randomly selected clones from each library and ranged 1.2–1.5 kb (Table 1).
Bacteria, containing coffee cDNAs, were cultured in 384-well plates and cDNA inserts subjected to 5′ end sequencing at the BioResource Center at Cornell University (http://www.brc.cornell.edu). The average size of quality reads was 613 bp with a maximum of 1,037 bp.
Sequence quality processing
EST sequences were base-called and screened for vector sequences using PHRED software (Ewing et al. 1998). The longest stretch of overall high quality (PHRED score over 15 which corresponds to over 98% confidence) of each sequence was identified. PolyA repeats were trimmed to at most 20 bp and any sequence past the PolyA (mostly low quality sequence) was discarded. After the trimming, the sequences were screened against the E.coli K12 genome to remove any bacteria contamination. The remaining sequences were screened for minimum length (150 bp) and maximum allowed ambiguity (4%) and low complexity (60% of the sequence are of the same nucleotide, or 80% of same two nucleotides, which indicate error in sequencing).
Unigene sets were built by combining the sequences from all five coffee cDNA libraries. Clustering was performed using a program developed at the Sol Genomics Network (SGN at http://www.sgn.cornell.edu), which relied on a custom pre-clustering algorithm, and on the CAP3 program for contig generation (Huang and Madan 1999). The preclustering algorithm clustered sequences using a Smith Waterman type algorithm with initial word matching. The command line settings for CAP3 were as follows: -e 5000 -p 90 -d 10000 -b 60. The -e, -d and -b options are set such that the assembler disregards them or minimizes their effect. The -p option increases the sequence identity necessary on overlaps to 90 from a default of 75, which were found to be not stringent enough. Sequences were also checked for length, complexity and contamination. The builds were uploaded to the database, where each unigene was assigned a unique unigene ID.
Comparison of the coffee and tomato EST databases derived from use of ESTScan calibrated with the same tomato training set (see Materials and methods for details)
Average unigene length, bp
Unigenes with coding regions
Average length (bp) of predicated peptides
Average ESTScan score
BLAST matches between coffee unigenes and other sequence databases
The GenBank non-redundant (NR) protein and dbest dataset (NCBI, http://www.ncbi.nlm.nih.gov).
The Solanaceae EST-derived unigene sets, including tomato (Solanum. lycopersicum) (184,860 ESTs, 30,576 unigenes), potato (Solanum. tuberosum) (97,425 ESTs, 24,932 unigenes), pepper (Capsicum annuum) (20,738 ESTs, 9,554 unigenes), petunia (Petunia hybrida) (3,181 ESTs, 1,841 unigenes) and eggplant (Solanum melongena) (11,479 ESTs, 5,135 unigenes), all of which can be accessed at SGN (http://www.sgn.cornell.edu).
Functional annotation based on predicted peptides
ESTScan-predicted coffee peptides were subjected to InterPro Scan annotation, which integrates the most commonly used protein signature databases (PROSITE, PRINTS, Pfam, ProDom, etc.) together with their associated scanning methods for protein domain analysis (Apweiler et al. 2001; Zdobnov and Apweiler 2001). Based on the domain annotation, GO accession of the unigenes were assigned using interpro2go conversion file from the GO consortium (http://www.geneontology.org, also available at http://www.ebi.ac.uk/interpro).
Functional categorization based on gene ontology
GO annotations were formatted for input into the GOSlim program and the output parsed to count the occurrence of each GO category. GOSlims are ‘slimmed down’ versions of the ontologies that allow a high-level view of gene functions. The GOSlim file and program were obtained from the Gene Ontology Consortium at http://www.geneontology.org.
Gene family analysis
The predicted protein sequences for the coffee unigene set and the Arabidopsis protein set were combined into a single file, formatted as a blast database using formatdb, and run with BLASTP (protein vs. protein sequence) against itself with option m8 for output. The resulting file was used as the input for the tribeML program. TribeML (Enright et al. 2002) formats the clusters, such that each cluster was given in a tab delimited file, one cluster per line. Simple scripts were used to parse the information to detect the largest gene families, coffee specific families, and families that showed large expansions in coffee.
Results and discussion
Generation of coffee EST database and unigene set
Differentiating between paralogs and alleles
The Arabidopsis plot showed two peaks, one with low identity (~87%) and the other with higher identity (over 99%) with 6.4 and 1.6% of genes falling into the low and high identity peaks, respectively. Like Arabidopsis, the coffee plot also showed two peaks, one with lower identity (around 91%) and the other with high identity (over 99%). These two peaks corresponded to 6.2 and 0.8% of the total unigenes, respectively. ESTs corresponding to ten pairs of coffee unigenes, from the >99% peak, were used as probes on genomic southern hybridizations to determine whether the matching pairs were truly duplicated in the coffee genome (paralogs) or rather allelic (single copy). For eight of the ten pairs, the paired ESTs hybridized the same single copy gene on southerns (data not show). Thus, a significant number (approximately 80%) of the unigenes in this peak are likely to be allelic. However, this category represented only a small portion of the coffee unigenes (0.8%). A similar experiment was performed with 11 ESTs from 11 pairs of unigenes in the second, lower homology peak (around 91% identity, see Fig. 3). In this case, the majority (8 out of 11) was determined to represent true paralogs (two or more copies in the genome) (data not shown). Thus for further discussions, it is assumed that the majority coffee EST-derived unigenes do in fact correspond to unique coffee genes.
Functional annotation of coffee EST-derived unigenes
Predicted coffee proteins
ESTScan (see Materials and methods for details) was able to identify protein-coding sequences in 12,534 coffee unigenes (95% of total unigenes), among which 1,515 (12%) were putatively full-length (starting with ATG and ending with a stop codon). Due to the cDNA library construction method, the unigenes were biased for the 3′ end—57% of the unigenes covered the 3′ end (ending with a stop codon) while only 36% covered the 5′ end (starting with ATG). Of the 5% of unigenes from which a protein sequence could not be predicted, 81% were singletons and the majority (97%) did not match to any Arabidopsis, GenBank non-redundant (NR) or Solanaceae unigene sequences, suggesting that they are not bona fide gene transcripts.
Protein domain annotation
Twenty most abundant InterPro domains identified in coffee unigene set and comparative statistics for tomato and Arabidopsis genes
% of unigenes (ranking)
Serine/threonine protein kinase
Tyrosine protein kinase
Serine/threonine protein kinase, active site
RNA-binding region RNP-1 (RNA recognition motif)
G-protein beta WD-40 repeat
Ras GTPase superfamily
Ras small GTPase, Rab type
2OG-Fe(II) oxygenase superfamily
E-class P450, group I
Myb DNA-binding domain
Small GTP-binding protein domain
Leucine-rich repeat, plant specific
Gene ontology annotation
A comparison was made between the GOSlim biological process of Arabidopsis, tomato and coffee (Fig. 4). For both the tomato and coffee unigene sets, the GO annotations were based on InterProScan results and approximately 25% of both unigene sets were assigned a GO annotation. In Arabidopsis, the genes are of full length, giving a higher chance of finding functional domains. Moreover, extensive experimental research and manual annotation has been carried out in Arabidopsis, resulting in a higher proportion of genes with assigned GO annotation. Therefore 83% of the Arabidopsis genes are assigned GO annotations. No significant differences were observed in the annotated categories for coffee versus tomato—possibly reflecting their close taxonomic affinity. However, for a number of categories, coffee had significantly different proportions of genes than Arabidopsis. The categories with the largest significant differences (P<0.001, based on Chi-square test) are: carbohydrate metabolism, other metabolism, biosynthesis, catabolism, protein biosynthesis, protein modification and energy pathways. In all cases, coffee had a significantly higher proportion of genes in these categories than Arabidopsis (Fig. 4). It is interesting to note that many of these categories center around the synthesis, breakdown or modification of compounds. One of the hallmarks of coffee is its high-level diversity of primary and secondary compounds, which contributes to the sensory quality of brewed coffee beans. The Rubiaceae family in general contains some of the most diverse species with regard to secondary metabolism and is an especially rich source of alkaloids—a number of which have pharmacological and/or psychotropic properties (Kutchan 1995; Facchini 2001). In fact, the most widely used psychotropic drug, caffeine, comes mainly from coffee. One can speculate that this metabolic diversity is reflected in the relatively high proportion of coffee genes with putative functions related to metabolism.
In silico analysis of unigene expression
Complexity and uniqueness of different stages/tissues
Differential expression of genes across stages/tissues
Number of coffee unigenes showing significantly (P<0.05) different expression in pairwise comparisons of cDNA libraries
Early stage cherry
Middle stage seed
Late stage seed
Early stage cherry
Middle stage seed
Highly expressed genes
The 20 most highly expressed coffee unigenes: functional annotation and most similar Arabidopsis and Solanaceae homologs
Coffee unigene#: annotation
Best match (e value/score)
Early stage cherry
Middle stage seed
Late stage seed
125230: putative 2s seed storage protein
120912: 11s seed storage protein
121707: unknown function
120118: unknown function
124988: unknown function
124158: photoassimilate-responsive protein
119890: unknown function
123265: ADP-ribosylation factor
124083: secretory peroxidase
124815: unknown function
122206: SAM synthase
119460: WRKY4 transcription factor
123045: unknown function
120481: AdoMet synthase
124791: plasmodesmal receptor
122071: rubiso small subunit
Seed storage protein genes
Unigene 125230: a putative 2S seed storage protein
Unigene 125230 is the most highly expressed gene across the entire coffee unigene set (1219 ESTs) and was the dominant transcript in the middle stage seed library, accounting for 10% of the ESTs at this stage (Table 5). This gene shows high homology to a tomato unigene derived from a developing seed cDNA library, but has no detectable homolog in Arabidopsis (Table 5). Other than the match with tomato, weak homology was also detected for 2S seed storage proteins from sesame, sunflower, and Brazil nut (in decreasing order of similarity). It is interesting to note that coffee, tomato, sesame and sunflower are fairly closely related taxonomically. All belong to the Asterid I/II clade of Eudicots (Fig. 1). This close phylogenetic relationship may explain why Unigene 125230 has homologous matches only in these species. Moreover, since Unigene 125230 shows homology to the 2S seed storage proteins in these related species, we conjecture that unigene 125230, its tomato unigene match, the sesame, Brazil nut and sunflower 2S storage protein gene all encode orthologous 2S seed storage proteins. This is the first time that a 2S seed storage protein has been identified in coffee or any Solanaceae species. Finally, a BLAST search of Unigene 125230 against the coffee unigene set revealed additional putative copies of the 2S seed storage protein. However, on close examination, all appear to be splicing variants or low quality sequences. Moreover, southern hybridization with a 2S cDNA probe on genomic DNA confirmed that the 2S gene is single copy in the coffee genome (data not shown).
Unigene 120912: 11S seed storage protein
Unigene 120912 is the second most abundant unigene, containing 687 ESTs (Table 5). This gene is preferentially expressed during middle and late stage seed development and shares high similarity (over 98% identity) with a previously cloned C. arabica 11S seed storage protein (Marraccini et al. 1999; Rogers et al. 1999). This unigene also has a highly significant match to the Arabidopsis 12S storage protein and to a tomato unigene derived from seed ESTs (Pang et al. 1988) (Table 5). Given these results, we conclude that unigene 12912 is allelic with the previously described 11S seed storage protein gene from C. arabica and orthologous to 11S/12S seed storage proteins in both tomato and Arabidopsis. A BLAST search of Unigene 120912 against the coffee unigene set revealed additional putative copies of the 11S seed storage protein. However, like the 2s seed storage protein (Unigene 124230), all appears to be results of alternative splicing or low sequence quality. Moreover, southern hybridization with an 11S cDNA probe on genomic DNA confirmed that the 11S gene is single copy in the coffee genome (data not shown).
Other seed-specific genes
Early stage seed development
Unigenes 122206, 119460 and 121265 were all highly expressed and specific to the early cherry stage. The early cherry library was derived from RNA from both pericarp and seed tissue while the pericarp library was derived from RNA coming from all stages of pericarp development. Thus, if the above genes were highly expressed in the pericarp of the early cherry, they should be present in the pericarp library as well. The fact that these genes showed little or no expression outside the early cherry stage, suggests that they are probably specific to early developing seed tissues and not pericarp tissues (Table 5). Unigene 122206 showed high homology to an Arabidopsis gene annotated as encoding the enzyme S-adenosyl-l-methionine (SAM) synthetase (Table 5). This enzyme synthesizes S-adenosyl-l-methionine from l-methionine and ATP and is often represented by multiple isozymes in plant species (Schroder et al. 1997). Thus, Unigene 122206 appears to be an SAM synthetase specific to early seed development (Table 5).
Unigene 119460 shows high homology to the highly conserved WRKY transcription factor family. The WRKY transcription factor is a large gene family having more than 70 members in the Arabidopsis genome (Dong et al. 2003). Previous studies showed that it is related to wounding, stress, pathogen infection and senescence in many plant species. In some recent studies, the WRKY protein family was found to be involved in sugar signaling in barley and seed development in Arabidopsis (Johnson et al. 2002; Sun et al. 2003). However, the function of the best Arabidopsis match to Unigene 119460 (At1g80840) has not been determined. Hence, understanding the function of this highly expressed, WRKY-like coffee gene awaits further study.
Unigene 121265 is highly homologous to a gene in Arabidopsis annotated as encoding a Mob1/phocein protein. Mob1/phocein proteins are found in virtually all eukaryotes. While they are conjectured to be involved in cell cycle control, there is still little experimental evidence demonstrating biological function (Pon 2004). Thus it seems premature to conjecture what role Unigene 121265 might have that is specific to the early development of coffee seeds.
Middle stage seed development
As described earlier, Unigene 125230 is a putative 2S seed storage protein with peak expression during middle seed development. Also showing preferential expression during this same stage were Unigenes 121707, 124158 and 119890. Unigene 121707 is a gene of unknown function with high homology matches both in Arabidopsis and Solanaceae EST-derived unigenes (Table 5). Unigene 124158 is homologous to an Arabidopsis gene classified as a photoassimilate-responsive protein, which is related to pathogenesis (Herbers 1995). Finally, Unigene 119890, which is also specific to middle stage seed development, is apparently a gene unique to coffee, which will be discussed more in the following section.
Late stage seed development
As discussed previously, Unigene 120912 corresponds to the 11S seed storage protein, which is largely expressed late in seed development. Other genes with preferential expression in late stage seeds are Unigenes 120118, 119817 and 124791. Unigene 119817 likely encodes a chitinase and is further discussed in the next section. Unigene 120118 shows high homology to genes in both Arabidopsis and Solanaceae EST-derived unigene sets; however, none have known function. Unigene 124791 gives a strong match to an Arabidopsis gene annotated as a plasmodesmatal receptor.
Two highly expressed genes with homology to chitinase
Unigenes 120685 and 119817 show high sequence similarity to a number of genes classified as chitinases in other organisms. Chitinases are a large and diverse class of proteins, some of which have been implicated in resistance to fungi in various plant species, including coffee (Rojas-Herrera 2002; Chen et al. 2003). The two unigenes differ in that Unigene 120685 is expressed in leaves, pericarp and early stage cherries, but not in mid or late stage seed development (Table 5). Unigene 119817, on the other hand, was found to be exclusively expressed in late stage developing seeds and pericarp tissue. As previously mentioned, early stage cherries contained both pericarp and seed tissues. The fact that Unigene 120685 was not found in the middle and late stages of seed development suggests that this gene may not be expressed in seeds, but rather in the maternally derived pericarp and leaf tissues. Based on these results, one can speculate that these two putative chitinase genes may be involved in pathogen defense in developing CHERRIES, with Unigene 120685 being expressed in early developing, post pollination pericarp and leaf tissues and Unigene 119817 being expressed primarily late in seed development, just prior to maturity.
Highly expressed genes unique to coffee
This highly expressed unigene had no significant matches in the Arabidopsis proteome, Solanaceae EST-derived databases, GenBank NR databases, or GenBank dbest. Moreover, the predicted protein encoded by Unigene 124988 has no recognizable domains, which might give clues to its function. ESTs for this unigene were detected in all five libraries, with highest expression being observed in the pericarp (Table 5).
Unigene 119890 also has no significant match in any of the tested databases, with the possible exception of a very weak match in the Solanaceae unigene sets (the best hit was from pepper with an e value of 5e-7, Table 5). Like Unigene 124988, its predicted protein has no recognizable domains. Unigene 119890 was highly and exclusively expressed in the middle stage of developing seeds (Table 5).
The fact that neither Unigene 124988 nor Unigene 119890 have counterparts in any other databases suggests that they may represent coffee-specific genes or genes that have been evolving at such a rapid rate that they no longer bear any recognizable homologies with proteins from other plants, including the closely related Solanaceous plants. We speculate that these genes may be related to chemical or morphological features unique to coffee.
Gene families unique or significantly expanded in coffee
Gene families expanded in coffee relative to Arabidopsis
# Arabidopsis family member
# Coffee family member
Longest coffee member
Retrotransposon gag protein, class I
Polygalacturonase isoenzyme 1 beta subunit with BURP domain
Hypersensitive-induced protein, band 7 protein
Bet v I allergen
Root hair defective protein
Trypsin inhibitor Kunitz
Gene families unique to coffee in comparison to Arabidopsis
Gene family #
# Family member
Retrotransposon gag protein, class II
Thaumatin, pathogenesis related
Zn-finger, CCHC type
Disease resistance protein (TIR-NBS-LRR class)
Retrotransposon gag protein, classs III
Disease resistance protein
Leucine-rich repeat, disease resistance protein
ABA/WDS induced protein
Proline-rich region, extension-like protein
Leucine-rich repeat, plant specific, receptor-related protein kinase
Coffee-expanded gene families
The most expanded gene family in coffee corresponds to a retrotransposon gag protein (Table 6). This result has two implications. First, the retroelement encoding this gag protein occurs at a higher frequency in coffee compared with Arabidopsis, although we cannot determine whether this difference is due to a true expansion of this element in coffee subsequent to divergence from Arabidopsis, or rather a loss of the element in the Arabidopsis lineage. Second, the fact that this retrotransposon gag protein element was discovered in an EST-database indicates that this particular retroelement is being transcribed in the coffee genome, and hence may represent an active retrotransposon.
Another gene family for which coffee has significantly more members than Arabidopsis, encodes proteins annotated in Arabidopsis as acid endochitinases and photoassimilate-responsive proteins (Table 6). As noted earlier, chitinases are associated with fungal resistance and are among the most highly expressed genes in coffee. The fact that chitinases are both highly expressed and represented by an expanded gene family in coffee may reflect a greater need for fungal resistance engendered both by the perennial nature of coffee and the fact that it is a tropical species for which a multiplicity of fungal pathogens is common. The reasons for the putative expansions of the other gene families listed in Table 6 remain for future studies to determine.
Coffee-unique gene families
Table 7 lists the top gene families (based on copy number), which occur in coffee, but not in Arabidopsis. Of the 15 gene families listed, four are of unknown function. For those that could be functionally annotated, five (45%) have putative functions related to disease resistance, such as TIR-NBS-LRR disease resistance proteins, LRR proteins and thaumatin pathogenesis-related proteins (Table 7). These finding are consistent with rapid evolution of genes/gene families related to disease resistance, likely driven by selection pressure from continuously changing pathogens and/or pathogens unique to the particular environments of a species (Meyers 1998; Michelmore and Meyers 1998). Also included in this list of coffee-unique gene families are two, which encode retrotransposon gag-proteins (Table 7).
Comparison of the coffee gene repertoire with that of Arabidopsis and Solanaceae species
Coffee genes not found in Arabidopsis, but with conserved counterparts in tomato or other Solanaceous species
Solanaceae EST-derived unigene match
GenBank (non-redundant and dbest) best match
gblCB686389.1 [Brassica napus]
gil50252229.1 [Oryza sativa]
refiNP_922676.1 [Oryza sativa]
embiCAE05735.1 [Oryza sativa]
TFIIH basal transcription factor p52 subunit
gil13183175 [Seasame indicum]
refINP_524404.1 [Drosophila melanogaster]
Phospyhatidyl inositol transfer protein
gbICK093976.1 [Populus tremula]
gbIAAO73272.1 [Oryza sativa]
gil34878866 [Rattus norvegicus]
Phosphatidylinositolglycan class N
gblCA815435.1 [Vitis vinifera]
gbICK229938.1 [Macaca mulatta]
40S Ribosomal protein S21
refINP_921250.1 [Oryza sativa]
Surprisingly, 8 (40%) of the 20 coffee genes having no match in Arabidopsis did have matches in species phylogenetically distant from both coffee and Arabidopsis, including two to non-plant species (Drosophila and rat) (Table 8). Five of these matches were to rice genes, which is a monocot and highly divergent from coffee, Arabidopsis and other dicot species (Table 8). The fact that all of these species diverged from Arabidopsis and coffee long before the latter diverged from each other suggests that these genes may have been present in the last common ancestor of Arabidopsis and coffee/Solanaceae, but subsequently lost in the Arabidopsis lineage.
Coffee genes share greater similarity to genes in tomato/Solanaceae than to Arabidopsis
As discussed earlier, coffee is much more closely related to the Solanaceae than to Arabidopsis (Fig. 1). Hence, Solanaceae species may be better models for coffee genomics than Arabidopsis. The results for fast-evolving genes, presented above, are consistent with this prediction. To further investigate this assertion, the degree to which each coffee unigene matched Arabidopsis versus Solanaceae was investigated. In doing this analysis, one has to keep in mind that the entire gene repertoire of Arabidopsis is known, whereas the EST-derived unigene sets for Solanaceae do not represent the entire gene repertoire of these species. We estimate that the combined EST-derived unigene sets of Solanaceae species represent as much as three-quarters of the Solanaceae gene content (Hoeven et al. 2002). Moreover, Arabidopsis genes are of full length, while Solanaceae EST-derived unigenes are not necessarily of full length.
Herein, we describe the development and analysis of a large EST database for coffee. The resulting 47,000 ESTs correspond to 13,175 unique genes (unigenes), a large portion of which are expressed during seed development—a stage important to coffee as a crop and one for which our understanding of molecular development is still rudimentary. To our knowledge, this is the largest public database for seed-derived ESTs. Hence, this EST database represents a new public resource, which can facilitate a better understanding of seed development, as well as genomic, molecular and breeding research in coffee. By comparisons with Arabidopsis and Solanaceous species, we have identified the two major seed storage proteins of coffee (2S and 11S) and demonstrated that these proteins are expressed at different times during seed development. Through in silico gene expression analysis, we have identified a number of highly expressed genes that show high specificity for different stages of seed development as well as for the pericarp tissue that surrounds the seeds. Many of these highly expressed genes are unique to coffee and/or the Asterid clade of higher plants. While the functions of most of these highly expressed, tissue/stage specific genes remain to be determined, the fact that they have been identified points the way to promoters, which can potentially be used to drive gene expression in specific stages/tissues of the coffee plant. Many of these genes are specific to defined periods of seed and/or pericarp development—both critically important for insect/pathogen resistance and in determining the quality of the coffee bean with respect to commercial coffee products.
Coffee, as a member of the family Rubiaceae, is distantly related to the model species Arabidopsis. A computational comparison of the coffee EST-derived unigene set with the sequence databases for Arabidopsis and Solanaceous species (e.g. tomato, pepper), indicate that the latter are much better genomic models for coffee than is Arabidopsis. These results are consistent with the fact that coffee and solanaceous species share very similar chromosome architecture and are closely related, both belonging to the Asterid I clade of dicot plant family. Moreover, the ability to identify orthologous genes between coffee and tomato opens the door to eventually developing detailed comparative maps for these two species and to the sharing of genomic and biological tools/discoveries—an outcome that should expedite research in both taxa.
This work was supported by a grant from the Nestle Corporation to increase the knowledge base on coffee for the benefit of all stakeholders of the production chain, and funding from the National Science Foundation Plant Genome Program (no. 0116076). The authors wish to thank Dr. Ir Zaenudin Su, Dr. T. Wahyudi, Dr. Surip Mawardi and P. Priyono of the Indonesian Coffee and Cacao Research Institute (ICCRI) for supplying all the leaf and fruit samples used here from the ICCRI collection. We thank V. Caillet and Dr. Pierre Marraccini for their respective efforts in preparing the RNA samples used in the cDNA library construction. Thanks to Dr. Anne Frary for critical reading and editing of the manuscript.
- Herrera JC, Combes MC, Anthony F, Charrier A, Lashermes P (2002) Introgression into the Allotetraploid Coffee (Coffea arabica L.): segregation and recombination of the C. canephora genome in the tetreploid interspecific hybrid (C. arabica×C. canephora). Theor Appl Genet 104:661–668CrossRefPubMedGoogle Scholar
- Iseli C, Jongeneel CV, Bucher P (1999) ESTScan: a program for detecting, evaluating, and reconstructing potential coding regions in EST sequences. American Association of Artificial IntelligenceGoogle Scholar
- Lee S, Kim SY, Chung E, Joung YH, Pai HS, Hur CG, Choi D (2004) EST and microarray analyses of pathogen-responsive genes in hot pepper (Capsicum annuum L.) non-host resistance against soybean pustule pathogen (Xanthomonas axonopodis pv. glycines. Funct Integr Genomics 4:196–205CrossRefPubMedGoogle Scholar
- Rick CM (1971) Some cytogenetic features of the genome in diploid species. Stadler Sym 1:153–174Google Scholar
- Ronning CM, Stegalkina SSAR, Bougri O, Hart AL, Utterbach TR, Vanaken SE, Riedmuller SB, White JA, Cho J, Pertea GM, Lee Y, Karamycheva S, Sultana R, Tsai J, Quackenbush J, Griffiths HM, Restrepo S, Smart CD, Fry WE, van der Hoeven R, Tanksley S, Zhang PF, Jin HL, Yamamoto ML, Baker BJ, Buell CR (2003) Comparative analyses of potato expressed sequence tag libraries. Plant Physiol 131:419–429CrossRefPubMedPubMedCentralGoogle Scholar
- Wrigley G (1988) Coffee. (Logman Scientific & Technical)Google Scholar