Background

Melon (Cucumis melo) belongs to the Cucurbitaceae family, which comprises 130 genera, including approximately 800 species that are mainly found in temperate, subtropical and tropical regions worldwide [1, 2]. Besides melon, the Cucurbitaceae family also consists of many other economically important species, including cucumber (C. sativus), watermelon (Citrullus lanatus), squash and pumpkin (Cucurbita spp.). Economically, melon is among the most important fleshy fruits for fresh consumption. Indeed, melon is one of America's, Europe's and the Middle East's favorite fruits for dessert and salad uses because of its unique flavor. The average per capita consumption of melon in the U.S. has been increasing consecutively each decade since the 1960s with 2000-2006 average per capita consumption exceeding 12 pounds per year, an 8% rise from 1990-1999. Besides its economic importance, melon is a very useful experimental system for fundamental studies on a range of topics including sex determination [3, 4] and vascular biology [5, 6]. In addition, melon is also an intensively studied species in terms of fruit ripening. It exhibits extreme diversity for fruit traits and includes a wide variety of cultivars producing fruits differing in many traits including fruit shape, size, flesh color, sweetness, aroma volatiles and fruit texture [7]. In addition, melon fruits also have significant variations in ripening physiology and can be categorized as either climacteric or non-climacteric types based on their ripening related respiration rate and ethylene evolution profiles [8]. Extensive molecular and genetic studies have been carried out in recent years in order to better understand the regulatory mechanisms underlying important traits of melon with the aim to improve melon fruit quality [9, 10]

Melon is a diploid species (2n = 24) with an estimated genome size of 450 Mb [11]. Genetic and genomic tools available in melon include BAC libraries [1214], a physical map [15], high-resolution genetic maps [1619], oligo-based microarrays [20], and a TILLING platform for functional studies [21]. Currently the melon genome is being sequenced under the Spanish Genomics Initiative (MELONOMICS) and the genome sequencing should be completed in the near future. The sequence of the closely related cucumber genome is available [22]. Complementary to whole genome sequences, expressed sequence tags (ESTs) can directly represent the transcriptome or transcribed portions of the genome. They have played significant roles in rapid gene discovery, improving genome annotation, elucidating phylogenetic relationships, facilitating breeding programs, and large-scale expression analysis [23]. Currently in the NCBI dbEST database, there are approximately 35,000 melon ESTs, most of which were produced by González-Ibéas et al. [24]. Approximately 8,000 ESTs are available for cucumber and watermelon, respectively, and a total of approximately 1,000 EST from other cucurbit species. Recently several reports have described the generation of large-scale transcriptome sequences in cucurbit species using next generation sequencing technologies (mainly the Roche-454 massive parallel pyrosequencing technology), including melon [25], cucumber [26], and Cucurbita pepo [27]. Although sequences generated under these efforts are much shorter than traditional Sanger ESTs, they represent a significant expansion of cucurbit functional genomics resources.

We undertook to expand the melon transcript catalog in the framework of the International Cucurbit Genome Initiative, which was established in 2005, being one of its major objectives to sequence approximately 100,000 ESTs from different melon genotypes and tissues [28]. We have constructed eleven full-length enriched cDNA libraries and four standard cDNA libraries from various melon tissues and cultivars and generated ~94,000 ESTs. These melon ESTs were analyzed to determine the structure and putative functions of the corresponding transcripts. In addition, a number of new SSR and SNP markers were identified in this EST collection. All of this data has been integrated in the Cucurbit Genomics Database [28]. The ESTs generated from the present study, especially those from full-length enriched cDNA libraries, will be a useful resource for the ongoing melon whole genome sequencing project and for characterizing gene expression patterns and traits of interest in melon and closely related species.

Results and discussion

Construction and sequencing of melon cDNA libraries

We constructed eleven full-length enriched and four standard cDNA libraries from various melon tissues (cotyledon, leaf, root, flower, fruit and callus) and cultivars (Dulce, PI161375, Piel de Sapo T-111, and Vedrantais) under normal conditions or upon infection with melon necrotic spot virus (MNSV)-Mα5 (Table 1). The flower, fruit and callus libraries were derived from two climacteric (Dulce and Vedrantais) and two non-climacteric cultivars (Piel de sapo T-111 and PI161375). For the flower and fruit, RNA pools were prepared from various developmental stages (see Methods). The leaf, root and cotyledon libraries were constructed from tissues infected with MNSV-Mα5. EST sequencing was carried out independently on full-length enriched and standard cDNA clones. For full-length enriched cDNA libraries, 70,576 randomly-selected clones were sequenced from the 5' end, producing 69,196 (98%) useful reads after trimming vector, adaptor and low-quality sequences and identifying and removing all possible contaminated sequences. Assembly of these ESTs produced 6,469 clusters, among which 2,721 non-redundant clones were selected for 3' end sequencing, yielding a total of 2,381 (87.5%) high quality 3' reads. For the four standard callus libraries, 26,112 randomly-selected clones were sequenced from the 5' end, generating 22,179 (85%) high quality EST sequences. In total, we have generated 93,756 high quality melon ESTs from the constructed cDNA libraries (Table 1) and the average length of these ESTs is 629.6 bp. The EST sequences have been deposited in GenBank and are also available at the Cucurbit Genomics Database [28].

Table 1 Description of melon cDNA libraries and summary of melon ESTs

Melon EST sequence assembly and annotation

The 93,756 high quality melon ESTs generated under this study, together with ~35,000 ESTs that are publicly available [24, 28, 29] and 173 published mRNA sequences, were assembled into a melon unigene build. The resulting assembly contained a total of 24,444 unigenes with an average length of 776.7 bp, among which 11,653 were contigs with an average length of 972 bp and 12,791 were singletons with an average length of 598.7 bp (Table 2). The distribution of the number of ESTs in each melon unigene is shown in Figure 1. A number of highly abundant genes could be identified, with 162 unigenes represented by over 100 ESTs. The most abundant genes in the combined set of libraries (> 500 ESTs) are listed in Table 3. Details of the melon EST assembly are available at the Cucurbit Genomics Database [28].

Table 2 Statistics of melon unigenes
Figure 1
figure 1

Histogram of number of ESTs in each melon unigene.

Table 3 Most abundant melon unigenes (>500 EST members)

Putative functions of melon unigenes were accessed by comparing unigene sequences against the GenBank non-redundant (nr) protein database using the NCBI BLAST program. The analysis showed that applying an e value cutoff of 1e-5, a total of 19,359 (79.2%) melon unigenes had hits in the nr database; while a total of 10,068 (41.2%) had hits when an e value cutoff of 1e-50 was applied. This indicated that a very high percentage of melon unigenes could be assigned a putative function. Those having no hits in the database are likely to include non-coding RNAs, genes whose sequences do not capture regions that contain conserved functional domains, or protein coding genes that are novel in the database and/or are melon-specific.

We then further compared melon unigenes to the pfam protein domain database [30]. A total of 8,251 (33.8%) melon unigenes contained at least one pfam domain and a total of 2,206 distinct pfam domains were represented by these 8,251 melon unigenes. A similar analysis on the well-annotated Arabidopsis proteins (TAIR version 10) indicated that 3,272 pfam domains could be represented by the Arabidopsis proteome. This suggested that melon unigenes assembled in the present study captured a large portion (at least 70%) of genes in the melon genome. The most highly represented pfam domains in the melon unigene database included PF00069 (protein kinase; 144 unigenes), PF00076 (RNA recognition motif; 138 unigenes), PF07714 (protein tyrosine kinase; 108 unigenes) and PF00097 (Zinc finger, C3HC4 type; 103 unigenes).

Based on BLAST and pfam annotations, melon unigenes were further annotated with Gene Ontology (GO) terms. A total of 15,350 (62.8%) unigenes were assigned at least one GO term, among which 12,953 (53%) were assigned at least one GO term in the biological process category, 13,149 (53.8%) in the molecular function category and 12,420 (50.8%) in the cellular component category; while 9,927 (40.6%) melon unigenes were annotated with GO terms from all the three categories. Based on the GO annotations, putative gene functions of melon unigenes were classified into high-level plant specific GO slims [31] in each of the three categories. The most abundant GO slims within the biological process, molecular function, and cellular component categories were cellular process, binding, and membrane, respectively. In addition, a large number of melon unigenes appeared to be involved in plant responses to abiotic (1,534) and biotic (844) stimuli, flower development (347), and secondary metabolite process (603), or have transcription factor activities (519).

To gain insights into metabolism-related genes, we further predicted biochemical pathways from the melon unigenes and built a melon metabolic pathway database using the Pathway Tools software [32]. A total of 302 metabolic pathways, as well as 30 superpathways, were predicted from 3,543 enzyme-coding melon unigenes. Most primary and secondary metabolic pathways were well-represented by melon unigenes. The melon metabolic pathway database is freely available at the Cucurbit Genomics Database [28].

Quality assessment of melon full-length enriched cDNAs

As shown in Table 1, a total of 71,577 ESTs derived from full-length enriched cDNA clones were obtained in the present study. These ESTs were assembled into 6,848 unigenes, among which 6,469 contained 5' sequences of at least one full-length enriched cDNA clone. By blasting sequences of the 6,469 unigenes against GenBank nr, SwissProt/TrEMBL and Arabidopsis (TAIR version 10) protein databases, 5,552 (85.8%) had significant hits (1e-05). Out of the 5,552 unigenes, 4,668 (84.1%) hit within five amino acids of the corresponding start sites. This indicated that a large portion of clones from full-length enriched cDNA libraries encoded full-length cDNAs.

We further generated 3' end sequences of more than 2,300 clones (Table 1) and ultimately obtained 2,162 clones that were sequenced from both the 5' and 3' ends, among which 1,538 (72.5%) had 5' and 3' sequences that were assembled into the same unigene. After removing redundancy, a total of 1,382 unigenes that contained 5' and 3' sequences of at least one full-length enriched cDNA clone were identified as melon full-length transcripts. The majority of the identified full-length transcripts contained overlapping 5' and 3' sequences from the same clone. The length distribution of melon full-length transcripts is shown in Figure 2A. The full-length transcripts ranged from 269 to 2,839 bp and their average size was 1,230 bp, which was shorter than previously reported for tomato (1,418 bp; [33]), Arabidopsis (1,445 bp; [34]), and soybean (1,539 bp; [35]), but longer than poplar (1,045 bp; [36]). We then predicted the complete protein-coding sequences (CDS) for the 1,382 melon full-length transcripts and were able to obtain CDS for 1,345 (97.3%) full-length transcripts. The remaining 37 could be non-coding RNAs or transcripts that did not contain full CDS. Indeed, we found that four transcripts (e.g., MU51348) did not contain a stop site. The average length of the predicted CDS was 814 bp, which was shorter than that of tomato (938 bp; [33]) and soybean (1,042 bp; [35]), but longer than poplar (649 bp; [36]) and maize (799 bp; [37]). The size distribution of melon CDS predicted from melon full-length transcripts is illustrated in Figure 2A. Overall, the average lengths of both melon full-length transcripts and CDS were shorter than those reported for full-length cDNAs of other plant species such as tomato [33], Arabidopsis [34], and soybean [35]. This is not unexpected since, as mentioned earlier, the majority of melon full-length transcripts were identified based on the overlap between 5' and 3' sequences of a single full-length cDNA clone.

Figure 2
figure 2

Size distribution of cDNAs, CDS (A) and 5' and 3' UTRs (B) of melon full-length transcripts.

Based on the predicted CDS, we extracted 5' and 3' UTR sequences for each melon full-length transcript. The average lengths of melon 5' and 3' UTRs were 167 bp and 254 bp, respectively, which were very close to those of tomato (175 bp and 257 bp, respectively) and longer than those of other plant species except rice [33]. The length distributions of melon 5' and 3' UTRs are shown in Figure 2B, which were also largely similar to those of tomato [33].

We further examined codon usages of the 1,345 melon full-length transcripts and compared the codon usages to those of Arabidopsis coding sequences (TAIR version 10). The statistics of the complete codon usages of melon and Arabidopsis CDS are provided in Additional file 1. Overall codon usages of melon full-length transcripts were largely similar to those of Arabidopsis CDS. TGA, TAA, and TAG accounted for 44.9%, 37.2%, and 17.9%, respectively, of melon stop codons; and they accounted for 43.6%, 36%, and 20.4%, respectively, of Arabidopsis stop codons (Additional file 1). In addition, the GC content of melon coding sequences (45.61%) was also very close to that of Arabidopsis (44.14%). This, combined with the evidence described above, supported the high quality of melon full-length enriched cDNA libraries.

Comparative genomics analysis with other plants

To date, genome sequences of fourteen plant species have been published. These plant species are Arabidopsis [38], rice [39], poplar [40], grape [41], papaya [42], sorghum [43], cucumber [22], maize [44], soybean [45], Brachypodium [46], apple [47], castor bean [48], strawberry [49], and cacao [50]. Protein sequences of genes predicted from the fourteen plant genomes were downloaded from corresponding websites (Additional file 2). The 24,444 melon unigenes were then compared to these protein sequence databases using the NCBI BLAST (blastx) program. The complete comparative analysis results are shown in Additional file 3. At e value < 1e-05, approximately 85% of melon unigenes matched to proteins of cucumber, 75.4% to 79.2% of melon unigenes matched proteins of other dicot plants (Arabidopsis, poplar, apple, strawberry, cacao, grape, papaya, soybean, and castor bean), while 70.6% to 72.5% of melon unigenes matched proteins of monocot plants (rice, maize, sorghum, and Brachypodium). At a very stringent e value cutoff (e value < 1e-100), approximately 30% of melon unigenes matched cucumber proteins, 10.8% to 13.6% matched proteins of other dicot plants, and 7.9% to 8.5% matched proteins of monocot plants (Additional file 3). These matches represented the highly conserved proteins between melon and other plant species.

We constructed families of homologous proteins using OrthoMCL [51] from protein sequences translated from melon unigenes with ESTScan [52] and from a wide phylogenetic range of representative plant organisms including cucumber, Arabidopsis, rice, and grape. These four organisms were chosen for the OrthoMCL analysis because cucumber, as melon, belongs to the Cucurbitaceae family; grape, cucumber and some cultivars of melon (e.g., Piel de sapo) are non-climacteric fleshy fruit; and Arabidopsis and rice represent the model systems for dicot and monocot plants, respectively. As shown in Figure 3, the analysis revealed 6,972 gene families that were distributed among the five genomes, which represented highly conserved gene families across dicot and monocot plant kingdoms. We also identified 181 gene families that were specific to fleshy fruit-bearing plants (melon, cucumber, and grape), 1,192 families specific to the Cucurbitaceae family (melon and cucumber), and 220 specific to melon. Functional analysis of melon unigenes using GO terms revealed that the 6,972 melon gene families common to the other four plant species were highly enriched with GO terms related to cellular process, metabolic process, and biosynthetic process (Additional file 4). This is consistent with a previous report [50]. Gene families specific to fleshy fruits were significantly enriched with GO terms related to hormone-mediated signaling pathway, response to biotic stimulus, and regulation of metabolic processes (Additional file 4); all these biological processes have been reported to be related to fleshy fruit development [53]. Gene families specific to the Cucurbitaceae family were significantly enriched with GO terms related to responses to various stimuli including responses to hormone and chemical stimuli (Additional file 4). Both melon and cucumber have diverse floral sex types and have long served as the primary model systems for sex determination studies [54]. It has been reported that a number of environment variables, such as light, temperature, water stress, and disease, as well as exogenous treatment with hormones or other growth-regulating substances, can directly influence floral sex determination [55, 56]. Results obtained from the OrthoMCL analysis indicated that cucurbit specific gene families were enriched with such stimulus-responsive genes which might play roles in floral sex determination. Further studies, of course, are required to test this hypothesis. Finally, we found that gene families specific to melon mainly encompassed genes of unknown functions, which is consistent with findings reported in other plant species [50].

Figure 3
figure 3

Venn diagram of ortholog group distribution in melon, cucumber, Arabidopsis, grape, and rice. Numbers in individual sections indicate the numbers of ortholog groups

Tissue-specific melon gene expression

Melon cDNA libraries generated in the present study, as well as melon phloem EST libraries described in Omid et al. [29], were neither normalized nor subtracted; thus for these libraries, EST copy numbers can be used as an approximate estimation of gene expression levels in the corresponding tissues. The non-normalized and non-subtracted melon cDNA libraries were prepared from the following seven tissues: leaf, flower, fruit, phloem, cotyledon, callus, and root. Statistical analysis identified a total of 175 tissue-specific genes, among which 49, 39, 20, 25, 9, 15, and 18 were leaf, flower, fruit, phloem, cotyledon, callus, and root-specific, respectively (Additional file 5). Heatmap representation of expression profiles of these tissue-specific genes is shown in Figure 4. In most cases, genes expressed in specific tissues had putative functions or were involved in pathways known to be consistent with said tissue, e.g., leaf-specific genes were highly enriched with genes involved in photosynthesis, phloem-specific genes were highly enriched with genes encoding phloem filament proteins and phloem lectins, and callus-specific genes were highly enriched with genes involved in glycolysis, glucose metabolic process, hexose metabolic process, monosaccharide metabolic process, carbohydrate catabolic process, and alcohol metabolic process (Additional file 5). It is worth pointing out that some tissue-specific genes identified in leaf, cotyledon and root might be due to the infection of MNSV-Mα5. Indeed, functional analysis indicated that leaf, cotyledon and root-specific genes were enriched with GO terms such as response to stimulus and defense response (Additional file 5).

Figure 4
figure 4

Heatmap representation of expression profiles of melon tissue-specific genes.

It is worth noting that one of the fruit-specific genes encoded 1-aminocyclopropane-1-carboxylate oxidase (ACO), the final enzyme in the biosynthesis of ethylene which is a plant hormone that regulates ripening of climacteric fruits [57]. Further detailed digital expression analysis of this gene (MU46283) revealed that, as expected, the gene was predominantly expressed in fruits of melon cultivars Dulce and Vedrantais, both of which are climacteric fruits; while none or very few ACO transcripts were detected in fruits of the two non-climacteric cultivars, PI161375 and Piel de Sapo T-111. In addition, two genes (MU45060 and MU46015) encoding acyl carrier proteins (ACPs) were highly and exclusively expressed in fruit tissues. ACPs are essential components of the fatty acid synthase complex and may be required to maintain the production of fruit aroma volatiles [58].

Interestingly, we found that genes involved in nucleosome and chromatin assembly (e.g., histones) and translation process (e.g., ribosomal proteins) were highly enriched in the list of flower-specific genes (Additional file 5). However, the exact role of these flower-specific genes in melon flower development remains unclear and further studies are required to clarify their functions in flower development.

Marker discovery from melon EST sequences

Molecular markers are valuable resources for constructing high-density genetic maps, facilitating crop breeding and identifying traits of interest. Early melon genetic maps mainly used markers of Restriction Fragment Length Polymorphism (RFLP), Amplified Fragment Length Polymorphism (AFLP), and Random Amplified Polymorphic DNA (RAPD). However these types of markers are not user friendly as they are either labor intensive to generate, harbor low rates of polymorphism in melon [59], or are not readily transferred to other genotypes and populations [60]. With the accumulation of sequence information in melon during the past several years, markers of simple sequence repeats (SSRs) and single nucleotide polymorphisms (SNPs) are becoming more widely used in construction of melon genetic maps. These markers have the following advantages: they are hypervariable, multiallelic, codominant, locus-specific, and evenly distributed throughout the genome [60], and for markers derived from ESTs, they are directly linked to expressed genes. The melon EST sequence information generated in this and other studies has served as a major resource to generate new molecular markers (mainly SSRs and SNPs). Several recently constructed melon high-density genetic maps have already utilized SSR and SNP markers derived from EST sequences generated in the present study [18, 19].

We first screened melon unigenes for the presence of di-, tri-, tetra-, penta- and hexa-nucleotide SSR motifs. We retrieved 4,068 SSR motifs in 3,279 melon unigenes. The major types of melon SSR motifs were tri-nucleotide, followed by di-nucleotide, tetra-nucleotide, penta-nucleotide and hexa-nucleotide (Table 4). The most frequent SSR motif was AAG/CTT (1,269; 31.2%), followed by AG/CT (1,134; 27.9%), AT/AT (364; 8.9%) and AAT/ATT (120; 2.9%). CG/CG (3) was the least frequent SSR motif identified in melon unigenes; possibly due to the fact that CpG sequences are normally highly methylated, which may further inhibit transcription [61]. These statistics are in agreement with previous reports of other plant species [26, 62]. Primer pairs were designed for SSR motifs that had sufficient flanking sequences. The complete list of SSR motifs and their corresponding primer pair information is provided in Additional file 6.

Table 4 Statistics of melon simple sequence repeats (SSRs)

ESTs generated in this (Table 1) and other studies [24, 28, 29] were from a diversity of melon cultivars. We expected that SNPs would be enriched in the melon EST dataset. Using very stringent criteria (see Methods for details), we identified a total of 3,073 high-quality SNPs in 1,331 unigenes, among which 1,972 were transitions, 976 were transversions, and 125 were single-base insertions or deletions (Table 5). The most frequent SNPs were C to T transitions (1,108; 36.1%), followed by A to G transitions (864; 28.1%) (Table 5). The complete list of SNPs identified from melon ESTs is provided in Additional file 7. Detailed information including alignments of sequences containing each individual SNP is also available at the Cucurbit Genomics Database [28]. Both SSRs and SNPs identified in the present study represent an important resource for genetic linkage mapping and marker-assisted breeding in melon and closely related crops. As stated above, they have already been used for these purposes.

Table 5 Statistics of melon single nucleotide polymorphisms (SNPs)

Conclusion

We present the analysis of more than 71,000 and 22,000 melon ESTs from eleven full-length enriched and four standard cDNA libraries, respectively. These libraries were constructed from a range of tissues and melon genotypes. Analysis of approximately 1,400 melon full-length transcripts identified from this EST collection indicated that melon transcripts had 5' and 3' UTRs of similar size as those of tomato, while longer than those of other dicot plants that we investigated. Comparative analysis between melon ESTs and other plant genomes allowed us to identify a number of highly conserved gene families across the plant kingdom, as well as gene families specific to fleshy-fruit bearing plants, to the Cucurbitaceae family, and to melon. Digital expression analysis identified genes showing significant tissue-specific expression and this resource remains to be further exploited from the perspective of mining expression data. Furthermore, SSR and SNP markers were also identified in this melon EST collection and recent research activities have begun to utilize these resources to construct high-density genetic maps [18, 19]. Overall the availability of a large collection of melon ESTs from full-length enriched and standard cDNA libraries will not only facilitate the annotation of the melon genome, which is currently being sequenced by the Spanish Genomics Initiative, but also provide a valuable resource for further functional and comparative genomics analysis, and for future improvement of breeding programs of melon and closely related species.

Methods

Plant material

Fruits of the four genotypes were collected at four developmental stages: 10, 20, 30 Days After Anthesis (DAA) and at the mature stage. The mature stage was determined based on the formation of the abscission zone in the two climacteric genotypes Dulce and Vedrantais (42 and 32 DAA, respectively) and based on highest Total Soluble Solids (TSS) for the two non-climacteric fruits PI161375 and Piel de sapo (42 and 45 DAA, respectively). Hermaphrodite flowers were collected on secondary axes at three developmental stages, C1, C3, and C5, which correspond to initial, medium and late developmental stages of flowers before anthesis, respectively (Caño-Delgado, unpublished). Specifically, C1 is the most initial stage where the flowers are around 1 mm in the longitudinal axis, C3 is the stage where the future fruit shape is already defined and first stamens are visible, and C5 is the stage just before anthesis (1-2 cm). MNSV-Mα5 infected cotyledons, leaves and roots were produced from melon cultivar Piel de Sapo T111 grown in growth chamber with a 16-hour, 25°C light and 8-hour, 18°C dark regime. Specifically, nine-day old cotyledons were inoculated mechanically with fresh inoculums of MNSV-Mα5 and harvested after 4 days when necrotic lesions started to appear with high incidence. Leaves and roots were harvest 10 and 8-10 days after inoculation with MNSV-Mα5, respectively. Undifferentiated callus growth was induced from cotyledon sections of the four cultivars (Dulce, Piel de Sapo T111, PI161375, and Vedrantais). Fifty seeds from each genotype were surfaced-sterilized in 70% ethanol for 2 min, followed by 1% (w/v) NaOCl with 0.1% (v/v) Tween-20 for 20 min, and rinsed three times with sterile distilled water. Under a dissecting microscope, seed coats were removed, a small incision was done on the integuments, and embryos were hydrated overnight in sterile distilled water. Embryo axis was removed from the de-coated seeds. Depending on the genotype, four to six transversal cotyledon sections were dissected from each seed and cultured in Petri dishes containing callus induction medium. Cultures were incubated in the dark, at 28°C, and subcultured every three weeks to fresh medium. Callus induction medium was the MS (Murashige and Skoog), supplemented with 30g·L-1 sucrose, 8g·L-1 Bacto agar (Difco Laboratories, Detroit), 5uM 2,4-dichlorophenoxyacetic acid (2,4-D), and 1uM Kinetin (6-furfurylaminopurine). Five months after initiation, 100 Petri dishes, 10-cm-wide, with six to eight calli were produced from each genotype.

Total RNA preparation, cDNA library construction and cDNA clone sequencing

Total RNAs from callus and MNSV-infected tissues were extracted following the TRI-reagent (SIGMA) protocol, including two additional chloroform purification steps. Fruit total RNAs were prepared from slices of the fruit that included both flesh and rind using the protocol described by Portnoy et al. [25]. Melon flower total RNA was extracted from hermaphrodite flowers using TRIzol reagent (Invitrogen) and chlorophorm, following the protocol described by Cuperus et al. [63].

All RNA samples were submitted to one extra cleaning step on RNeasy columns (Qiagen) and purified on a poly(A) track system (Promega). For cDNA library construction, fruit and flower RNAs were pooled, respectively, by mixing equal amount of RNA from each developmental stage. Full-length enriched cDNA libraries were constructed with the RNA Captor protocol, as described previously [64], and the four standard callus cDNA libraries were constructed using the pBluescript II XR cDNA Library Construction Kit (Stratagene) according to the manufacturer's instructions. A subset of clones was randomly selected from each cDNA library. Clones from full-length enriched cDNA libraries were sequenced at Genoscope (Evry, France) and those from standard cDNA libraries at Arizona Genome Institute.

EST sequence processing, assembly, and annotation

The raw chromatogram files were base-called with phred [65]. Vector, adaptor and low-quality bases (a 20-bp window with an average error rate > 0.01) were trimmed from the raw EST sequences using LUCY [66]. The resulting sequences were then screened against the NCBI UniVec database, E. coli genome, and melon ribosomal RNA sequences using SeqClean [67], to remove possible contaminations of these sequences. Sequences shorter than 100 bp were discarded. The resulting high quality melon ESTs have been deposited in GenBank dbEST database under accession numbers JG463773-JG557528 and are also available at the Cucurbit Genomics Database [28].

Melon ESTs were assembled into unigenes using iAssembler [68] with minimum overlap of 40 bp and minimum percent identity of 97. Melon unigene sequences were compared against GenBank non-redundant (nr) and UniProt [69] protein databases using the NCBI BLAST program with a cutoff e value of 1e-5. The unigene sequences were translated into proteins using ESTScan [52] and the translated proteins were then compared to pfam domain database [30] using HMMER3 [70]. Gene Ontology (GO) terms and plant-specific GO slim ontology [31] were assigned to each unigene based on terms annotated to its corresponding homologues in the UniProt database and domains in pfam database. Melon biochemical pathways were predicted from the unigenes using the Pathway Tools program [32] and a melon biochemical pathway database was constructed and is available at the Cucurbit Genomics Database [28].

Full-length transcript identification and analysis

Unigenes containing both 5' and 3' sequences of at least one clone from the full-length enriched cDNA libraries were identified as full-length transcripts. The complete CDS were identified using the getorf application in the EMBOSS package [71]. CDS were also identified based on the ESTScan translations and CDS identified from the two approaches were integrated. 5' and 3' UTRs were then extracted from each candidate full-length transcript. Codon usages were calculated with the cusp program in the EMBOSS package [71].

Comparative genomics analysis

Melon unigenes were compared to protein databases of fourteen plant species whose genomes have been fully sequenced (Additional file 2) using the NCBI BLAST program with an e value cutoff of 1e-5. Furthermore, ortholog groups of protein sequences for melon (ESTScan translated proteins), Arabidopsis, rice, cucumber, and grape were identified using the orthoMCL program, which performs an all-against-all BLAST comparison of protein sequences with subsequent Tribe-Markov clustering [51. Venn diagram showing the distribution of shared gene families among melon, Arabidopsis, rice, cucumber and grape was created with Venn Diagrams [72]. Enriched GO terms of melon unigenes in each list of specific ortholog groups were identified using GO::TermFinder [73] with corrected p values (False Discovery Rate (FDR); [74]) less than 0.05.

Identification of tissue-specific genes

All normalized or subtracted cDNA libraries (e.g., libraries described in Gonzalez-Ibeas et al [24]) were excluded in the digital expression analysis. Pair-wise comparisons between fruit, flower, callus, leaf, root, cotyledon (Table 1), and phloem [29] were performed with the R statistic described in Stekel et al. [75] to identify differentially expressed genes. Only genes with a total of at least five EST members in the two compared tissues were included in the analysis. Raw p values from the R statistic were corrected for multiple testing using the FDR [74]. Tissue-specific genes were identified if the genes were significantly up-regulated (ratio > 2 and FDR < 0.05) in the tissue when compared to all other tissues. Enriched GO terms in each list of tissue-specific genes were identified using GO::TermFinder [73], requiring p values adjusted for multiple testing (FDR) to be less than 0.05.

Identification of SSRs and SNPs

SSRs in melon unigene sequences were identified using the MISA program [76]. The minimum repeat number was six for dinucleotide and five for tri-, tetra-, penta- and hexa-nucleotide. Primer pairs flanking each SSR loci were designed using the Primer3 program [77].

SNPs in the cDNA sequences between different melon cultivars were identified with PolyBayes [78], which takes into account both the depth of the coverage and quality of the bases. To further eliminate errors introduced by PCR amplification during the cDNA synthesis step and to distinguish true SNPs from allele differences, we filtered PolyBayes results and only kept SNPs meeting both of the following two criteria: 1) at least 2X coverage at the potential SNP site for each cultivar; 2) no same bases at the potential SNP site between the two compared cultivars. The detailed information of all melon SSRs and SNPs is freely available at the Cucurbit Genomics Database [28].