Background

Many biological systems rely on symbiotic interactions between different organisms. One of the most dramatic examples is the coral reef ecosystem, which has at its heart a mutualistic partnership between corals and endosymbiotic, dinoflagellate algae. The dinoflagellates are classified in a single genus, Symbiodinium, but molecular methods have revealed a high genetic diversity in this genus [1, 2]. The onset of these symbioses has been shown to display flexibility, but a range of specificity, i.e. from highly flexible to highly specific, is apparent during its maintenance [38]. This process is likely to involve early recognition mechanisms [9, 10] and an evasion of the hosts' digestive and immune systems [11], as well as adaptations to diverse ecological niches [12, 13] and physiological acclimation [14, 15]. There have also been controversial discussions of whether Symbiodinium populations may shift toward more heat-tolerant types as a consequence of thermal stress ("bleaching") in order to adapt to environmental changes [1618] such as increasing seawater temperatures. In light of global climate change, this subject, i.e. cnidarian bleaching, has received much attention as devastating mass bleaching events have increased both in frequency and geographic extent [19]. Nonetheless, our knowledge of the underlying cellular and molecular mechanisms that facilitate the recognition between the partners, and determine the specificity, dynamics, and collapse of cnidarian-dinoflagellate symbioses, is limited.

The cellular and molecular interactions between host and symbiont cells are important targets for genetic and genomic dissection, but corals are notoriously difficult to work with. For example, corals form large, slow-growing colonies that are difficult and costly to maintain in the laboratory, and their handling for microscopy and amenability to other cell biological, biochemical, and genetic methods is complicated by the calcareous skeleton precipitated by reef-building corals. What is needed to make rapid advances in this field is a model system that possesses the key characteristics of coral symbiosis, but allows more facile laboratory investigation (for a detailed review see [20]). The sea anemone Aiptasia represents a good candidate system [20], as it possesses the same mutualistic relationship with Symbiodinium spp., but lacks the calcareous skeleton that hinders cellular-level work. It is widely distributed, and found in shallow tropical marine environments worldwide. Sequence characterized amplified region (SCAR) data indicate that the vast majority of Aiptasia worldwide (encompassing two described species, A. pallida and A. pulchella), appear to be genetically homogeneous (Santos Lab at Auburn University, pers. comm.). The one exception is a closely related, but genetically distinct, lineage potentially restricted to the Florida Keys. Data from the Santos Lab also indicate that natural populations of Aiptasia from the Florida Keys preferentially host Symbiodinium spp. comprised of only clade A or both clades A and B, whereas those from the remaining global range host clade B exclusively. Typically considered a pest organism by seawater aquarists, Aiptasia is hardy and proliferates rapidly by asexual reproduction. Individual polyps can be maintained in a symbiotic or aposymbiotic state (i.e., with and without symbionts, respectively), experimentally re-infected with a variety of Symbiodinium strains [21, 22], and cultured at low cost [23]. In fact, numerous studies have addressed symbiosis-related questions using A. pallida and its sister species A. pulchella by applying multiple tools ranging from microscopy to RNA-interference methods [2429]. The generation of genomic resources for Aiptasia would therefore greatly advance research addressing the understanding of symbioses at a molecular, cell-biological, and genomic level.

As a cost-effective alternative to sequencing the genome of an organism, the generation and analysis of expressed-sequence-tag (EST) libraries provides an efficient method for discovering novel genes, estimating gene content, and approximating levels of gene expression. Once established, these resources can be utilized for comparative genomics studies or the construction of gene expression microarrays [30]. Among cnidarians, the extensive genomic resources now available for the non-symbiotic sea anemone Nematostella vectensis have opened new perspectives on the study of basal metazoans [31], and several EST resources have been generated for symbiotic cnidarians (predominantly corals) and Symbiodinium spp. [3234]. However, to date, only one small-scale project has generated ESTs (N = 870) for the symbiotic anemone Aiptasia pulchella [35].

In this study, we report the generation and analysis of 10,285 high-quality ESTs from a Symbiodinium clade A-hosting clonal population of Aiptasia pallida that was likely derived from an individual originating from the Florida Keys lineage, which were processed through a software pipeline [36] resulting in a user-friendly, queryable, web-accessible database named AiptasiaBase. A BLASTx-based approach was used to estimate the relative contributions of each partner to the mixed cDNA library, and we were able to identify numerous genes involved in key processes of cnidarian-dinoflagellate symbioses.

Results and Discussion

EST library construction and assembly

A total of 6,448 cDNA clones were bi-directionally sequenced, resulting in 12,896 raw chromatograms, which served as input for the processing pipeline. After base calling by phred [37], Lucy [38] discarded 2,556 low-quality sequences, short or insert-less sequences, and vector or polyA-only sequences. An additional 55 sequences were removed by seqclean [39], leaving 10,285 high-quality ESTs (from 5,450 cDNA clones) for further processing (success rate ~80%). Assembly of these ESTs by cap3 [40] resulted in the generation of 1,427 contigs, which ranged from 112 to 3,440 bp in length and contained 2 – 259 ESTs (mean: 4.8). Together with the remaining 3,498 singletons, a total of 4,925 unique sequences (UniSeqs) were generated. Because of the possibility that two (or more) UniSeqs originated from the same transcript, we also estimated the number of unique genes (unigenes) in our dataset by assembling only the reverse reads of the directionally cloned cDNAs. The resulting estimate of 2,564 unigenes compared to the 4,925 UniSeqs is likely to reflect the large average size (1.95 kb) of inserts in the cDNA library; thus, in many cases, UniSeqs represent the 3' and 5' ends of genes for which the central parts were not captured due to Sanger-sequencing length limitations (600–800 bp). In addition, different splice variants or alleles of the same gene may have contributed to the excess of UniSeqs over unigenes. Detailed pre-assembly statistics are summarized in Additional file 1: Quality control and assembly statistics.

Previously, a small-scale EST project was conducted in order to compare the abundance of transcripts between symbiotic and aposymbiotic Aiptasia pulchella polyps [35]. The present study included bi-directional sequencing, and the total number of ESTs is more than 14 times larger than in the earlier study. Therefore, the availability of almost 5,000 UniSeqs for about 2,500 unigenes represents a rich transcriptomic resource, previously unavailable at this scale, for a symbiotic anemone.

Annotation of unique sequences and implementation of AiptasiaBase

All UniSeqs were assigned putative identities based on BLASTx hits (E-value cutoff: 1e-5) to the UniProt Knowledgebase databases SwissProt and TrEMBL [41]. About 37% and ~63% of the UniSeqs found hits in SwissProt and TrEMBL, respectively, leaving ~36% of the UniSeqs without similarities to known proteins. Because the TrEMBL database contains protein sequences based on conceptual translations of all nucleotide sequence entries in EMBL/GenBank/DDBJ, we chose to annotate the UniSeqs according to the curated SwissProt entries. Assignments of gene ontologies (GO) could be made for about one third of UniSeqs in each of the GO categories: molecular function, biological process, and cellular component. Because our cDNA library represents the symbiotic, adult life-history stage of A. pallida, the GO resource generated in this study sets the stage for statistical assessments of over- or under-representation of specific GO-categories in libraries obtained from anemones under different conditions such as life-history stages, symbiotic state (symbiotic vs. aposymbiotic), or environmental conditions (temperature, salinity, nutrients, etc.). In addition to BLAST and GO annotations, all UniSeqs were screened for single-nucleotide polymorphisms (SNPs) and simple-sequence repeats (SSRs), providing resources for the investigation of gene polymorphisms between individuals and/or populations. The prediction of open reading frames within UniSeqs also provided the basis for domain annotations at the protein level. About 25% of UniSeqs matched a protein domain entry in the Pfam database [42].

One of the primary challenges of sequencing ESTs from a mixed transcriptome originating from two or more partners is to assign sequences to the proper genome of origin. Taking a bioinformatic approach to this problem, we constructed taxon-specific databases representing either "Cnidaria-only" or "Alveolata-only" (i.e., dinoflagellates and their relatives) sequences from GenBank, and then performed BLASTx-searches against those databases as well as the complete non-redundant database (see Methods). We then employed a best-BLASTx-hit (BBH) approach (Additional file 2: Flow diagram illustrating BBH approach) to estimate the numbers of ESTs that originated from A. pallida and Symbiodinium spp., respectively, at various levels of confidence (Table 1). Irrespective of the confidence level, about one quarter of ESTs had no BLASTx-hit (E-value cutoff 1e-5). At the different levels of confidence, 56 – 70% and 1.7 – 6.4% were predicted to originate from the anemone and the Symbiodinium genomes, respectively (Additional file 3: Detailed EST (N = 10,285) distribution and assignment). The relatively small fraction of Symbiodinium ESTs could be expected given that Symbiodinum spp. are spatially restricted to the endodermal tissue layer of the host and that no special effort was made to disrupt the algal cell walls during the preparation of the RNA (see Methods). Furthermore, the number of UniSeqs without a significant BLASTx hit may be higher for Symbiodinium transcripts. However, the uncertainty about the origin of non-annotated sequences represents a current limitation to our approach. Ongoing and future genome-sequencing projects for symbiotic cnidarians and their dinoflagellate endosymbionts should soon become available and help to uncover the origins of sequences without currently known homologs in other organisms. This will provide an interesting opportunity to revisit our data set to look further at these perhaps taxonomically restricted genes.

Table 1 Predicted genome-of-origin for contigs and singletons of the holobiont

Using an EST-processing software (EST2uni) [36], we stored all ESTs, UniSeqs, and annotations in a queryable database named AiptasiaBase (database = AiptasiaBase_v1), which is accessible through the URL: http://aiptasia.cs.vassar.edu/AiptasiaBase/index.php. In addition to the results generated by the software, we have included the annotation of UniSeqs according to KEGG, which provides a convenient way to explore pathway components that were identified in this study.

Analysis of the most highly abundant transcripts

We identified the contigs containing the greatest numbers of ESTs, which we used as a proxy for the most abundant transcripts. Although the numbers of ESTs in contigs that are predicted to originate from Symbiodinium were too low to be analyzed (data not shown), many of the most abundant host-derived transcripts represented genes that are involved in the processes of protein biosynthesis, extracellular-matrix formation, and oxidative-stress response (Table 2).

Table 2 Most highly expressed genes predicted to originate from the host genomea

Kuo et al. (2004) reported that the most highly expressed gene in symbiotic A. pulchella was ferritin (11.7%), whereas we found only 4 ESTs (0.04%) that represented this gene. Although differences in the preparations of the cDNA libraries (e.g. insert size-selection) and sequencing depths (474 vs. 10,285 ESTs) pose an obstacle for a direct comparison, the discrepancy in the numbers of ferritin transcripts appears to be noteworthy. In a recent study that investigated the effect of increased temperature and UV levels on the symbiotic anemone Anthopleura elegantissima, Richier et al. (2008) observed a more than 17-fold up-regulation of ferritin expression upon thermal stress, but not UV stress [43]. Given this observation, it seems possible that the anemones in the study by Kuo et al. (2004) were under elevated thermal stress at the time of sampling, which, taken together with the methodological differences mentioned above, makes any further comparative analyses unfeasible. This result has important implications, i.e., how culturing conditions of organisms as well as methodological differences between studies may have an impact on the transcriptome, and by extension, the interpretation of gene expression analyses.

The highly abundant sequences with the highest uncertainties for correct annotation (highest E-values), apolipophorin and the CUB and zona pellucida-like domain-containing protein 1, were further scrutinized by similarity searches in additional databases. These searches revealed that the best hit for CUB and zona pellucida-like domain-containing protein 1 in the GenBank non-redundant database (nr) was mesoglein, a protein that is proposed to be a structural element of the extracellular matrix of the mesoglea in the jellyfish Aurelia aurita [44]. The sequence annotated as apolipophorin contains a von Willebrand factor type-D domain, and was reported to be involved in forming lipoprotein particles that bind lipoproteins and lipids [45]. Two other highly abundant sequences had no homologs among previously characterized proteins, suggesting that they are novel, and a third contig with no BLASTx-hit was identified as an artifact due to misassembled sequences. These results illustrate some of the caveats to automated sequence assembly and annotation and highlight the necessity for corroboration after automated sequence processing when focusing on single genes or groups of genes of interest.

Candidate genes with potential relevance to cnidarian-dinoflagellate symbioses

We generated a candidate gene list of groups containing UniSeqs that are likely to be of relevance to cnidarian-dinoflagellate symbioses (Table 3). Among these, the cellular antioxidant-response system could be most comprehensively reconstructed (see below). Genes related to the innate immune system and sugar-binding proteins gave rise to a partial gene inventory (Fig. 1; Table 3). Other genes that are likely to play a role in the cellular events surrounding the breakdown of symbiosis (exocytosis, host-cell detachment, apoptosis and/or autophagy [4652]) were also identified.

Figure 1
figure 1

Illustration of genes and pathways known or likely to be involved in cnidarian-dinoflagellate symbioses. Genes that were identified or missing in the EST library are highlighted by solid red lines or dashed black lines, respectively. Pathways or cellular processes that were partially represented are highlighted by dashed red lines. APx – ascorbate peroxidase, ATG – autophagy-related protein, AIF – apoptosis-inducing factor, CASP – caspases, CAT – catalase, DRAM – damage-regulated autophagy modulator, GCLC – glutamate-cysteine ligase catalytic subunit, GGT – gamma-glutamyltranspeptidase, GPx – glutathione peroxidase, GR – glutathione reductase, GS – glutathione synthetase, GST – glutathione-S-transferase, HSP – heat-shock protein, IAP – inhibitor of apoptosis, Prx – peroxiredoxin, SOD – superoxide dismutase, Sym – Symbiodinium spp..

Table 3 Potential symbiosis-related genes identified from the Aiptasia transcriptome

Stress-induced photoinhibition and damage to algal photosystem II are thought to be responsible for an increased production of reactive oxygen species [53, 54] and consequently, diffusion of hydrogen peroxide (H2O2) through the membranes into the host cells [55]. The detoxification of H2O2 requires the activity of catalase or other peroxidases. Superoxide dismutase (SOD), which catalyzes the reduction of superoxide to H2O2, as well as glutathione peroxidase (GPx), which uses glutathione to detoxify H2O2, were both not found among the sequenced ESTs. One possibility is that the abundance of SOD transcripts in host cells was low, and the generation of superoxide spatially limited (inside the chloroplasts of Symbiodinium). In this case, superoxide may have been efficiently eliminated within the Symbiodinium cells, while excess H2O2 that was not detoxified (e.g., by Symbiodinium ascorbate peroxidase), could have diffused into the host cytosol and been reduced to H2O and O2 by catalase. Alternatively, methodological factors such as insert-size selection or general RNA processing may have prevented the detection of SOD. Other genes that had previously been reported in the context of cnidarian-dinoflagellate symbioses (Additional file 4: Genes that have been studied in the context of cnidarian-dinoflagellate symbiosis, but not found in this study) were also not detected, perhaps for same reasons as discussed above for SOD.

Conclusion

By analyzing >10,000 high-quality ESTs and generating a comprehensive database for the user community, we have provided a foundation of transcriptomic resources for a symbiotic anemone that is becoming an important model system for studying coral-dinoflagellate symbioses. The set of sequences identified constitutes a rich source of candidate genes that are likely to be involved in processes related to the onset, maintenance, and breakdown of symbiosis. In this context, we were able to reconstruct the oxidative-stress response, which we also found to be prominent during basal transcription. At the current depth of sequencing, we have identified two problems, namely (1) that some transcripts are represented by two (or more) contigs and (2) that we lack information on transcripts of low abundance. These issues will be addressed in the near future by using 454 sequencing, which, for example, has been successfully applied to sequence the coral larval transcriptome of Acropora millepora at 3 × coverage [56].

Methods

Generation and sequencing of a cDNA library from Aiptasia pallida and its dinoflagellate symbiont

A clonal line of Aiptasia pallida (clone CC7, available through the Pringle lab) hosting Symbiodinium of clade A was established from a single tiny propagule in a population obtained from Carolina Biological Supply (Burlington, NC) and grown into an abundant stock. Given the Symbiodinium clade harbored by this population, it is likely that the Aiptasia individual originated from the Florida Keys lineage. Approximately 500 anemones of various sizes were harvested from this stock under normal growth conditions (~26°C; salinity, ~33 ppt; light, ~40 μmol m-2 s-1 photosynthetic photon flux; 12-h light-dark cycle), blotted to remove excess water, and immediately frozen in liquid nitrogen. The anemones were then ground to a fine powder under liquid nitrogen using a ceramic mortar and pestle. The powder was weighed (~4 g) while still frozen and mixed with a proportional volume (50 ml) of TRIzol Reagent (Invitrogen, Carlsbad, CA); extraction was then performed in accordance with the manufacturer's instructions yielding ~5 mg of total RNA. This RNA was sent to Open Biosystems (Huntsville, AL), where it was tested for quality; mRNA was then isolated using oligo(dT)-coated magnetic particles (Seradyn, Indianapolis, IN), and cDNA was synthesized. Double-stranded cDNA was size fractionated to enrich for long reads, cloned into the vector pExpress1 (Express Genomics, Frederick, MD), and electroporated into E. coli strain DH10B. The resulting library was determined to contain ~96% recombinants with an average insert size of 1.95 kb. Sequencing was performed on 96-well capillary sequencing platforms (ABI 3700) at the DOE Joint Genome Institute (JGI, Walnut Creek, CA) and at the Genome Core Facility at the University of California, Merced, USA, CA.

Processing of ESTs and implementation of AiptasiaBase

Raw chromatogram files were used as input for the software pipeline EST2uni [36], which was implemented on an Ubuntu server (8.04 "Hardy Heron", Dual Intel Xeon 3.06 GHz) to generate the database named AiptasiaBase [57]. During the pipeline processing, raw EST reads were based-called by phred [37], and quality filtered and vector trimmed by the software Lucy [38]; low-complexity regions and repetitive elements were then removed by seqclean [39] and RepeatMasker [58], respectively. To remove unexpected vector sequences, seqclean additionally screened the processed ESTs using NCBI's UniVec database. All ESTs are available through GenBank accession numbers GH571982 – GH582266.

Clustering of processed ESTs was performed by cap3 [40] with default settings resulting in unique sequences (UniSeqs), for which open reading frames were predicted by ESTScan [59]. Similar UniSeqs were found using BLASTn [60], resulting in clusters of similar UniSeqs [60]. Short-sequence-repeat microsatellites and sequence variations were predicted by Sputnik [61] and local algorithms [36], respectively. All UniSeqs were functionally annotated by BLASTx searches [60] in protein databases nr (GenBank – NCBI), TrEMBL, and SwissProt (Uniprot) [62]; HMMER [63] searches in pfam [42]; and GO-term associations (UniProt GOA, March 2008) [64]. The number of unique genes was estimated by clustering all reverse reads using the cap3 software with default settings.

BLAST-based prediction of UniSeq origin and KEGG annotation

In order to predict whether an EST originated from Aiptasia pallida or Symbiodinium spp., we performed a best-BLASTx-hit (BBH) approach (Additional file 2: Flow diagram illustrating BBH approach). First, all UniSeqs were BLASTx-searched (E-value cutoff: 1e-5) in a non-redundant protein database (nr, GenBank, NCBI). If the BBH was from a cnidarian or an alveolate species, the sequence was predicted to originate from Aiptasia pallida or Symbiodinium spp., respectively, with high confidence. Next, if the BBH was not from a cnidarian or alveolate species, we compared the E-values for the BBHs from a search against nr databases that were previously filtered for sequences from cnidarian (582,480) or alveolate (468,072) species. The organism for which the E-value was lower was assigned to the corresponding UniSeq with medium confidence. Finally, if the E-values for BBH searches in the cnidarian and alveolate databases were equal, we compared the percentage of identical amino acids in the sequence alignments. As in the E-value-based approach, the organism with the higher percentage of identical amino acids was assigned to the corresponding UniSeq (low confidence). In addition to the annotations described above, we used the Automatic Annotation Server provided by the Kyoto Encyclopedia of Genes and Genomes (KEGG) for all UniSeqs using the single-directional best-hit option.