Background

Venomous snakes of the families Viperidae and Elapidae possess paired specialized venom glands located in the upper jaw ventral and posterior to the eyes [1] that produce an arsenal of toxins [2, 3], which they inject into prey tissues through high-pressure delivery fangs [4]. Within the reptile clade Toxicofera, venom was a single ancient innovation [5]. Snake venom toxins are the result of recruitment events by which ordinary genes were duplicated, and the new genes selectively expressed in the venom gland and amplified to multigene families with extensive neofunctionalization throughout ~100 million years of evolution [5, 6]. Given the central role that diet has played in the adaptive radiation of snakes [7], venom thus represents a key adaptation that has played an important role in the diversification of snakes.

Envenoming by snakebites constitutes a highly relevant, though neglected, public health issue on a global basis [8], as there are venomous organisms in every continent and almost every country. However, venomous animals are particularly abundant in tropical regions, the kitchen of evolution. Arthropod stings constitute the most common cause of envenoming by animals, although around 80% of the more than 150.000 yearly deaths by envenomings worldwide are caused by snakebites, followed by scorpion stings, which cause 15% [9, 10]. The venoms of extant snakes comprise complex cocktails of proteins tailored by Natural Selection to act on vital systems of the prey [11]. Medical uses of venoms are well documented in folk remedies and in Western and Chinese traditional medicine [12]. However, despite their remarkable potency and high degree of target specificity, only in the last decades have toxins been increasingly used as pharmacological tools, and it has been realized that venoms represent a vast and essentially untapped resource of preoptimized lead molecules for the medicinal chemist [1217].

Adequate treatment of snakebites is critically dependent on the ability of antivenoms to neutralize the lethal and tissue-damaging toxins, reversing thereby the signs of envenoming [18, 19]. A long term research goal of venomics, of applied importance for improving current antivenom therapy, but also for drug discovery, is to understand the molecular mechanisms and evolutionary forces that underlie the enormous pharmacological potential of venoms [12]. Individually or combined, proteomic and transcriptomic studies have demonstrated their feasibility to explore in depth the molecular diversity of venoms [[2028], and references therein]. In the absence of genome sequence, transcriptomes represent also valuable searchable databases for proteomic projects.

Since the pioneer report by Ho and co-workers in 1995 [29], snake venom transcriptomic studies have relied on sequencing DNA clones randomly picked from a cDNA library constructed by reverse transcription of the RNA molecules expressed in the venom gland [25]. The partial cDNA sequences derived from expressed genes, also known as Expressed Sequence Tags (ESTs) [30], cluster into groups of contiguous sequences (contigs), which eventually cover the entire extension of the original RNA molecule. In addition, the number of ESTs clustered into a contig is proportional to the transcriptional level of the parent RNA in the venom gland [25]. However, in the few instances in which transcriptomics and proteomics databases have been compared [26, 27, 31, 32], a low degree of concordance has been reported. The occurrence of non-venom-secreted toxin transcripts might indicate that these messengers exhibit an individual or a temporal expression pattern over the life time of the snake [33], or may encode very low-abundance venom proteins. On the other hand, the presence in the venom of toxins not represented in the transcriptome clearly indicates that construction of the cDNA library was biased, i.e. due to the necessary fractionating steps to avoid interfering substances like short, partial length 3'-end cDNAs and adapter sequences [34]. A second bias of cDNA libraries is the potential of the mRNA transcript in plasmids to be partially expressed in their bacterial cells with lethal effects [35]. Moreover, smaller cDNA fragments are over represented compared to larger ones, due to the higher transformation efficiency of smaller plasmids [35].

The high demand for low-cost sequencing has driven the development of high-throughput next-generation sequencing (NGS) technologies such as 454 Roche, Illumina's Solexa, and Applied Biosystems' SOLiD, and and most recently released Helicos HeliScope platforms as alternatives to the classical chain-termination Sanger method of DNA sequencing for the qualitative and quantitative analyses of transcriptomes [36, 37]. NGS technologies are revolutionizing the field of transcriptomics by rapidly reducing the time and cost per base sequenced [38]. For example, snake venom gland transcriptomes reported are typically arranged from few hundreds to few thousands ESTs [25]. The largest transcriptome database was assembled from 8696 ESTs (mean read length of 398 bp) from Deinagkistrodon acutus venom gland [39]. Only very recently Rokyta and colleagues [40] reported a high-throughput venom gland transcriptome of the Eastern Diamond Rattlesnake (Crotalus adamanteus) using Roche 454 sequencing technology. 82,621 reads were singletons, and the remaining 552,863 reads were assembled into 24,773 contigs of average length 513 nucleotides [40]. NGS technologies applied to the transcriptomic analysis of non-model species has the advantage of providing a genome-wide, unbiased insight into the transcriptome [41]. However, NGS techniques applied to non-model species, which like snakes lack a suitable reference genome sequence, are not devoid of limitations. NGS technologies provide shorter and more error-prone reads than Sanger sequencing, making transcript assembly a challenging bioinformatic task, which frequently yields a large set of contigs but a fragmented transcriptome [38, 41]. Here, we report the application of the 454 platform to infer the venom gland transcriptomes of Costa Rican snakes, Bothrops asper (from Caribbean (car) and Pacific (pac) populations), Bothriechis lateralis, Bothriechis schlegelii, Atropoides picadoi, Atropoides mexicanus, Crotalus simus, and Cerrophidion godmani, with special emphasis on the strategy used to assemble and analyze the gathered DNA sequences. Although the average length of singletons (174 bp) and contigs (208.2 bp), and the low coverage of reads per contig (6), prevented the generation of definitive and reliable full-length gene sequences, our results provide a deep comparison of the transcriptional activity of the venom glands of these medically relevant species in Central America [42, 43].

Results and Discussion

454 sequencing statistics and annotation of transcripts

The eight venom gland single-strand cDNA libraries were sequenced using a multiplex strategy. To this end, each cDNA library was barcoded with a unique 10-base sequence (MID, M ultiplex ID entifier) that is recognized by the sequencing analysis software, allowing for automated sorting of MID-containing reads. A total of 334,540 reads (amounting ~ 62 Mb of 8 snake venom gland transcriptomes) were simultaneously sequenced in two runs of the Genome Sequencer FLX System. Raw sequencing data are archived under accession number SRP003780 in the NCBI Sequence Read Archive (SRA, http://www.ncbi.nlm.nih.gov/Traces/sra) [44]. Accession codes by species are SRS117169.3 (B. asper (car)); SRS117211.3 (B. asper (pac)); SRS117214.2 (C. simus); SRS117215.1 (A. picadoi); SRS117216.1 (A. mexicanus); SRS117217.1 (C. godmani); SRS117218.1 (B. lateralis); and SRS117219.2 (B. schlegelii). The first run included only cDNA from B. asper from the Caribbean versant of Costa Rica, and was performed as a test run. The second run was done using cDNA from all the species investigated. The average length per read was 186.6 bases (max. 645 bp, and only 3.27% of reads < 50 bp), and this figure is in keeping with other 454 transcriptomic reports conducted in non-model species [45]. 4530 reads could not be assigned due to sequencing errors in the 3' 10-mer label. This figure corresponds to a sequencing error rate of 1.35%, which is higher than that (~ 0.04%) reported in other studies35,44. In addition, 5,165,220 nucleotides (8.27%) were masked with N's characters by RepeatMasker. The vast majority of sequence elements masked are class I (retroelements) and class II (DNA transposons) mobile elements [4648]. Retroelements may have a profound impact on the plasticity of the host genomes [49], i.e. modulating transcription of immediately downstream host genes [50, 51]. The bulk (64%) of mobile elements identified in the snake venom gland transcriptomes investigated here are retrotransposons (Additional file 1: Table S1). Retrotransposable elements have been previously reported in the transcriptomes of Bothrops insularis (4.1% of ESTs) [52], Lachesis muta (0.3%) [53], and Philodryas olfersii (4.1%) [54], and in PLA2 genes from the venom gland of Vipera ammodytes[55, 56] and Protobothrops flavoviridis[57]. In the context of multigene toxins, which like the snake venom PLA2s are evolving under strong positive adaptive selection [5860], it is worth mentioning that transposable elements are overrepresented in the mRNAs of rapidly evolving genes [61], suggesting that they have played a role in the diversification and expansion of these gene families [61, 62].

The 454 sequencing run yielded a total of 330,010 masked reads, which were distributed among the 8 venom gland transcriptomes as displayed in Additional file 1: Table S2. In the absence of any reference genome to guide the assembly, the sets of reads of each species were separately processed with program Newbler, the de novo assembler tool of the 454 Sequencing platform. However, only 58.4% of all reads clustered into 31,025 contigs (average length of the contigs was 208.2 bp; average number of reads per contig = 6), of which 43% comprised only 2 reads. The program also returned 103,357 singletons (mean length, 174.4 bp). Employing other assembler programs, such as MIRA (http://www.chevreux.org/projects_mira.html) or Velvet [63], and using different settings (i.e. Velvet: hash-length 21 or 31; MIRA: job normal or accurate), did not improve the assembly performance. The transcriptome assembly problem has been documented [64, 65], particularly for organisms without a reference genome database. Because of the low data compression gained in the assembly step and the small difference between contigs and reads mean length, bioinformatic processing of the 454 sequence data was performed on whole sets of unassembled reads. The set of 330,010 reads was searched against non-redundant GenBank databases using BLASTX and BLASTN algorithms to identify similar sequences with an e-value cutoff <10-3. 100,394 (30.4%) produced significant hits (Additional file 1: Table S2). The high percentage of reads without significant similarity to any known sequence is in line with previous transcriptomic studies. Hence, the 8696 high quality ESTs from a non-normalized cDNA library of D. acutus were assembled into 2855 clusters, of which only 45.60% matched known sequences and 54.40% had no match to any known sequence in Genbank [39, 66].

BLAST hits included 79,991 matches to entries of the taxonomic suborder Serpentes, of which 62,433 (62% of BLAST hits) displayed similarity to documented venom proteins (Additional file 1: Table S2). The set of reads lacking similarity to Serpentes entries was searched for the presence of cysteine-rich domains (eg. stretches of 50-100 amino acids containing ≥10% cysteine residues), as this feature is commonly shared by many toxin sequences [67]. The survey proved fruitless. Further attempts to enlarge the toxin dataset by searching specific databases, such as the Animal Toxin Database [68] or MEROPS [69], were also unsuccessful. The venom protein families identified, and their relative abundance, in the whole 454 read sequence dataset are listed in Table 1. The relative distribution of these venom protein families among the eight taxa investigated is shown in Table 2.

Table 1 Identity and relative abundance of venom protein entries identified in the whole 454 read sequence dataset of the 8 Costa Rican snake venom gland transcriptomes
Table 2 Relative contribution of the different venom protein family hits in each of the Costa Rican snake venom gland transcriptome

Among the proteins listed in Table 1, glutaminyl cyclase (GC) belongs to the group of venom proteins without demonstrated toxic activity. Glutaminyl cyclase has been been identified in the venom proteomes of B. jararaca, C. atrox, and C. durissus terrificus[7072]. N-terminal pyrrolidone carboxylic acid (pyroglutamate, pGlu) formation from its glutaminyl (or glutamyl) precursor is required in the maturation of numerous bioactive peptides. Snake venom GC is likely involved in the biosynthesis of pyroglutamyl peptides such as hypotensive BPPs [73, 74], and endogenous inhibitors of metalloproteinases, pEQW, pENW, and pEKW [75, 76]. Accumulation of peptide inhibitors in venoms provides a basis for attenuating the proteolytic activity of venom gland-stored SVMPs, preventing thereby autodigestion [77]. Mature PIII-SVMPs secreted into the venom proteome usually contain an N-terminal pyroglutaminyl residue (unpublished results), suggesting the action of the glutaminyl cyclase downstream of the proteolytic processing of the metalloproteinase precursor. However, the structural/functional consequences of N-terminal cyclation are unknown.

To estimate the number of toxin transcript sequences expressed in each transcriptome, multiple alignments of all reads clustered in the same protein family were generated, using the most similar full-length reference sequence as template. It was then realized that a large number of "Serpentes venom protein" reads did not align with translated ORFs. Instead, these reads appeared to represent 5'UTR, 3'UTR, and microsatellite loci. Particularly, all reads matching "cobra venom factor", "crotamine", "crotasin", and "sarafotoxin" entries corresponded to non-translated, mostly (87-100%) microsatellite DNA. In addition, 2397 out of 2412 reads for ohanin, and 840/845 3FTx reads aligned with microsatellite DNA. Microsatellite sequences accounted also for 66% GF, 49% SP, 36% PLA2, 27% CRISP, 15% LAO, and 8% CTL, but represented less than 5% of the reads of the rest of venom protein classes listed in Table 2 and Additional file 1: Table S3. On the other hand, the bulk (> 99%) of non-microsatellite untranslated sequences corresponded to 3' UTRs. Additional file 1: Table S3 summarizes the number of reads aligned to translated regions of reference snake venom toxin sequences. The occurrence of a large number of microsatellites in the venomous snake Agkistrodon contortrix has been recently reported by Castoe and colleagues [78], who used the 454 Genome Sequencer FLX next-generation sequencing platform to sample randomly ~27 Mbp (128,773 reads) of the genome of this species. These authors identified microsatellite loci in 11.3% of all reads obtained, with 14612 microsatellite loci identified in total.

The presence of mRNA coding for 3'-untranslated regions of toxins points to a) a bias due to the first-strand synthesis method used, which produced cDNA libraries enriched in 3'-end-transcripts, b) incompletely sequenced transcripts or c) to transcription of nonfunctional gene copies. Relevant to the latter possibility, excepting A. mexicanus venom, which contains a small amount (<0.1%) of a 3Ftx [50], CVF, SARA, OHA, WAP, 5'NTase, PDE, GC, KUN, HYA and 3FTx have not been detected in the venom proteomes of the Costa Rican snakes sampled here [4952] (Additional file 1: Table S4). Fry and colleagues [79] have shown that the venom system is a basal characteristic of the advanced snakes, and have investigated the timing of toxin recruitment events and patterns of toxin diversification across the full range of the ~100 million-year-old advanced snake clade [2, 3, 5, 6, 79]. These studies revealed single early recruitment events for each toxin type, including those identified here (Table 1), indicating that the venomous function arose once in squamate reptile evolution, at about 200 Myr ago. Structural and functional diversification of the venom system is best described by the birth-and-death model of protein evolution [80, 81]. Pseudogenes in Costa Rican pitviper venom transcriptomes may thus represent relics of the evolution of their venom arsenal.

The 37,961 reads comprising non-venom-protein BLAST hits were classified based on the presumed biological process to which they may contribute (Figure 1A) and on their putative molecular function (Figure 1B), according to the Gene Ontology database [48]. Their relative abundance, biological processes (general metabolism, response to external stimuli, cell differentiation, proliferation, and communication, cell cycle...), and molecular functions (transcription and translation, protein binding, catalysis, etc.) identified in this work generally agree with the broad categories reported for other viperid transcriptomes [27, 32, 52, 53, 8284], and will not be described here in detail again. The most abundant transcripts are related to DNA transcription, mRNA translation, and post-translational processing of proteins, reflecting the high specialization of this tissue for expressing and secreting toxins to the lumen of the venom gland. Furthermore, many venom toxins bear a high number of cysteinyl residues, which are engaged in extensive intra- and intermolecular disulphide crosslinking [20]. Venom proteins such as disintegrins, C-type lectin-like proteins, serine proteinases, PLA2s, 3FTxs, and SVMPs occur in different oligomerization states [8588]. The large structural impact at low energy cost of engineering disulphide bonds represents an opportunity for structural (and functional) diversification of proteins during evolution. Not surprisingly, protein disulphide isomerase (PDI), an enzyme and chaperone involved in disulphide bond formation in the endoplasmic reticulum [89, 90], represents a highly expressed gene transcript (1859 reads; 5.1% of non-venom-protein reads) in all venom gland transcriptomes, ranging from 2.3% in C. simus to 6.9% in B. lateralis.

Figure 1
figure 1

Gene Ontology annotation of the non-toxin Serpentes reads according to their presumed biological process (panel A) and molecular function (panel B). The figure represents the combination of the reads from all eight species. However, each species transcriptome exhibited similar relative expression levels of GO-annotated non-toxin transcript classes.

It is also worth mentioning the finding of reads for ribosomal 12S and 16S RNAs. This finding suggests that either internal mRNA A-rich tracts may have acted as priming sites in the cDNA synthesis, or that these messengers contained poly(A) tails. The possibility that rRNAs represent some residual contamination in the mRNA preparation should also be taken into account. Although polyadenylation is a distinctive feature of mRNA, polyadenylation of rRNA has been reported to occur in mammals and several unicellular organisms (Candida albicans, Saccharomyces cerevisiae, Leishmania braziliensis and L. donovani), and it may have a quality control role in rRNA degradation [9194]. Polyadenylated ribosomal RNA has been also reported in the venom gland transcriptome of the Desert Massasauga Rattlesnake (Sistrurus catenatus edwardsii) [95].

Calculation of the minimum number of gene copies from each toxin family

An estimation of the minimum number of genes from each toxin family transcribed into the venom gland transcriptome of each species was calculated from the multiple alignments of reads matched to a full-length reference sequence of each toxin family (Figure 2). To this end, the nucleotide sequences of the ORF-coding reads of each venom protein family were assembled into contigs using MIRA and an iterative multiple-pass reference-guided protocol. MIRA is recommended for analysis of a normalized dataset or a filtered set of reads that did not have extreme coverage [64]. Each line of the multiple alignment (Figure 2) contained a distinct set of contigs spanning the maximum possible number of nucleotide positions of the reference sequence. Since the short average length of singletons and contigs (174-208 bp), and the low coverage of reads per contig (6), prevented the generation of full-length gene sequences, each line of the alignment corresponds to one or more synthetic gene. We considered two contigs as different if their nucleotide sequences depart in more number of positions than expected from a sequencing error rate of 1.35%, and the same mutated residues were found in at least two other reads. For each toxin family from each venom gland transcriptome a representation of the "number of reads per contig" vs. "number of contigs" was plotted, and only contigs accounting for ~95% of all assembled reads were considered. The rationale for introducing this quality trimming is because in this way only contigs in which the observed sequence differences were validated in a significant number of reads were taken into account, eliminating thus potential false positives due to sequencing errors, which generate "orphan" reads. The number of topologically equivalent homologous multiply-aligned reads corresponds thus to the minimum number of genes from a given toxin family transcribed into the venom gland transcriptome (Figure 2). The outcome of this analysis is displayed in Table 3. The estimated number of toxin-coding genes is in line with the number of different proteins identified in the respective venom proteomes: C. simus (27 reverse-phase HPLC fractions, ~20 proteins) [52]; B.asper (Car) (31 HPLC fractions, ~30 proteins) [51]; B. asper (Pac) (30 HPLC fractions, ~27 proteins) [51]; A. mexicanus (41 HPLC fractions) and A. picadoi (30 HPLC fractions), each containing bradykinin-potentiating peptides and around 25-27 proteins [50]; and B. schlegelii (34 HPLC fractions) and B. lateralis (34 HPLC fractions) matched to ~29 and 27 proteins, respectively [49]. Moreover, in most cases the overwhelming majority of reads of the large multigene toxin families (i.e., SVMP, PLA2, and SP) cluster into a small subset of contigs (Additional file 1: Tables S5, S6 and S7). The uneven distribution of SVMPs of B. asper (Car) (Figure 3) clearly illustrates this point: 3590 out of 6746 reads clustered into a single contig, and only 6 other contigs were assembled from 449-173 reads. The remaining 22 transcripts comprised each 95-12 reads. The low number of venom proteins inferred from our 454 transcriptomic analysis is also in concordance with a recent high-throughput venom pland transcriptomic analysis for the Eastern Diamond Rattlesnake (C. adamanteus), which identified 40 unique toxin transcript [40]. The most diverse and highly expressed toxin classes were the SVMPs (11 isoforms), serine proteinases and C-type lectin-like proteins (9 different protein species each).

Figure 2
figure 2

Calculation of the minimum number of gene copies. Multiple alignment of the top six SVMP transcripts of B. asper (Car) (Additional file 1: Table S6) using the sequence (top) of the most similar database-annotated toxin sequence as template. Each line of the multiple sequence alignment displays a distinct set of contig(s), comprised by a unique set of reads indicated in parentheses (see also Additional file 1: Table S5). Since the short average length of the reads and the low coverage of reads per contig prevented the assemblage of reliable gene sequences, each line of the alignment corresponds to at least a distinct gene of the SVMP multigene family translated into the venom gland transcriptome of B. asper (car).

Table 3 Estimation of the minimum number of toxin family gene copies translated in the venom gland transcriptomes of Costa Rican snakes
Figure 3
figure 3

Cartesian graph and corresponding chart pie displaying the uneven distribution of the number of reads per contig among the 29 SVMP genes identified in the venom gland transcriptome of B. asper (car).

The insight provided by our present transcriptomic data, supported by previous proteomic studies, indicate that the venoms of the Costa Rican snakes investigated are comprised by toxins belonging to a few major protein families. In addition, our data suggest that different genes of a multigene family are subjected to very distinct transcription (and translation) yields, i.e. as the result of distinct stability and translational rates of the messengers.

Comparison of transcriptomes and proteomes

The relative abundance of the different toxin families in each transcriptome was calculated as the percentage of toxin family-specific reads relative to all BLAST hits (Table 2) or to the set of reads aligned to translated (ORF) regions of a reference sequences (Additional file 1: Table S3). To estimate the relative contribution of each toxin family, the total number of nucleotides of the ORF-coding reads was normalized for the full-length nucleotide sequence of a canonical member of the protein family. When available, the obtained figures were compared with the percentages of toxin families reported for the venom proteome of the same species. The outcome of this comparative analysis is compiled in Additional file 1: Tables S3 and S4. Strong discrepancies between the transcriptome-computed and the proteome-gathered toxin compositions are obvious at first sight. The best, although still far from perfect, agreement between proteomic and transcriptomic data occured when the relative abundance of transcripts was computed using all the reads (ORFs + UTRs) belonging only to toxin classes detected in the venom proteome (Figure 4). This would support the view that the majority of reads matching UTRs may indeed form part of parent translatable mRNAs. However, particularly B. asper (Car) (Figure 4A), B. schlegelii (Figure 4B), A. mexicanus (Figure 4C), and C. simus (Figure 4D) strongly depart from this picture. The reasons underlying this discrepancy are elusive, since no clear trend within or between species is apparent, but both intrinsic (methodological) and extrinsic (biological) factors may be involved. Hence, besides the difficulty of deciding between bias due to cDNA libraries enriched in 3'-end-transcripts, and the presence of transcripts of pseudogenes in the transcriptome, we hypothesize that the distinct stability and translational rates of the messengers might also contribute to the observed differences between transcriptome and proteome. Thus, a high abundant messenger subjected to a higher degradation rate may produce the same concentration as a low abundance but more stable mRNA or exhibiting a higher translational rate. The observation that B. asper (Car) and B. asper (Pac), as well as A. mexicanus and A. picadoi, exhibit highly similar transcriptomes but strongly depart in the relative toxin composition of their venom proteomes (Figure 4), indicates that individual mRNA species are translationally controlled in a species-dependent manner. The same conclusion can be drawn by comparing the proteome and transcriptome of C. simus (Figure 4D). In this respect, mounting evidences in yeast, indicate that the loading of ribosomes onto individual mRNA species varies broadly across a cellular transcriptome, and this finding is consistent with each transcript having a uniquely defined efficiency of translation [9699].

Figure 4
figure 4

Transcriptomes versus proteomes. Comparison of the protein composition of the venom of Costa Rican snakes reported from proteomic analysis (chart pies labelled "a") (Additional file 1: Table S4)49-52 and predicted from their venom gland transcriptomes (this work). Chart pies "b" display the relative occurrence of ORF-coding reads listed in Additional file 1: Table S4 normalized for the full-length DNA sequence of a canonical member of the protein family (%mol). Panels c show the relative abundance (mol%) of toxin families as in panels "b" but computing only toxins previously identified in the venom proteome49-52. Chart pies depicted in panels d show the relative composition (reads%) of all venom protein family hits in each of the Costa Rican snake venom gland transcriptome (Table 2). Protein family names are abbreviated as in Table 1.

A rough comparison of the transcriptomes (Table 2) shows that some toxin families are relatively constantly expressed among snakes while others exhibit greater variability. Principal Component Analysis (PCA) revealed that the abscissa (PC1) and the ordinate (PC2) each explained 32% of the observed variability (Figure 5A). PCA discriminated the eight transcriptomes into four groups (Figure 5B). Referring to the average value of the toxin family, the transcriptomes of A. picadoi and B. lateralis contain higher content (⇑) of SVMP reads and lower number (⇓) of PLA2 and SP reads; the two B. asper taxa express ⇑SVMPs and PLA2s and ⇓ SPs; A. mexicanus and C. godmani contain ⇑SVMPs, ⇑SPs, and slightly ⇓PLA2s; and B. schlegelii and C. simus contain ⇓SVMPs, ⇑SPs, and average number (⇔) of PLA2 reads. These data point to convergent and divergent evolutionary trends among pitvipers. PCA of the venom proteomes listed in Additional file 1: Table S4 revealed different clustering of taxa (Figure 5, panels C and D), and unequal contributions of PC1 (48%) and PC2 (22%) to venom variability. These results illustrate the versatility of snake venoms as a system to achieve the purpose of subduing prey through different strategies [100]. On the other hand, the lack of any apparent correlation between the PCA score plots for transcriptome (Figure 5B) and proteome (Figure 5D) data, further highlights the existence of variable translational patterns across species. Clearly, our results emphasize the relevance of combining detailed proteomic and transcriptomic studies for a thorough characterization of the venom toxin repertoire and the factors regulating transcription and translation.

Figure 5
figure 5

Principal Component Analysis (PCA) of the Costa Rican snake venom gland transcriptomes (A, B) and the corresponding proteomes (C, D). Panels A and C show, respectively, the contributions to PC1 and PC2 of the different toxin families of the transcriptomes and the proteomes. Panels B and D, score plots displaying the segregation of the transcriptomes (B) and the proteomes (D) into different categories. In B, PC1 and PC2 contribute equally and together explain 65% of the observed transcriptome variability; in D, PC1 and PC2 explain 70% of the variability among proteomes.

In a previous proteomic study, we identified two RP-HPLC fractions of B. schlegelii venom as Kazal-type proteinase inhibitor-like proteins (family PD000417 in the ProDom database, http://prodom.prabi.fr/prodom/current/html/home.php) [49]. Kazal-type inhibitor-like proteins (KAZ) have not been found in any other snake venom reported to date, casting doubts on their venom gland origin, on the one hand, or pointing to a recruitment event of these proteins along the speciation of the Neotropical pitviper clade [49]. Now we report the finding of 9 and 12 reads encoding ORF regions of KAZ in B. schlegelii and B. lateralis transcriptomes, respectively (Table 2 and Additional file 1: Table S4). In each species, all KAZ reads assembled into a single transcript (Table 3), suggesting a monogenic origin. The occurrence of Kazal-type inhibitor-like proteins only in the venom gland transcriptomes of the two Bothriechis taxa (Additional file 1: Tables S4 and S5) supports the view of a genus-specific recruitment event during the early-Middle Miocene ~14 Mya, the estimated divergence time for Bothriechis in a model of Middle American highland speciation [101]. On the other hand, the presence of Kazal-type proteins in the venom proteome of B. schlegelii, the basal species of the Bothriechis clade [102], suggests a species-specific expression of this class of protein. Though a number of Kazal-like domains harbor serine proteinase inhibitor activity, these protein scaffolds are also present in the extracellular part of a number of proteins, which are not known to be proteinase inhibitors. Clearly, further investigations are needed to assess the biological activity of the Kazal-type proteins, and the role that these proteins may have played in the early adaptive radiation of the Bothriechis clade.

Transcriptome-based cladistic analysis of the Costa Rican snake venom gland transcriptomes

To assess structural relationships between the transcriptomes, consensus sequences were constructed for those major toxin families shared by all snake venom gland transcriptomes, i.e. BPP, LAO, PLA2, SVMP, and SP. To assess the degree of kinship between the Costa Rican snake venom gland transcriptomes, species-specific synthetic sequences were generated by the concatenation of the 5 toxin-family consensus sequences (in the order described above), and using these synthetic sequences as input, a cladogram was built using the suite of web-tools Phylemon (http://phylemon.bioinfo.cipf.es) (Figure 6). Mutation is the driving force of evolution, but inferring evolutionary distances from multiple sequence alignments can yield misleading results if the mutation rates of the genes being compared are unequal across species. Given knowledge of the degree of mutation rate heterogeneity, appropriate algorithms can be applied to correct unbiasness and inaccuracy of the phylogenetic reconstruction [103105]. Globally, translated assembled sequences displayed mean variability levels between 0 and 7.4% (computed as number of variable residues divided by sequence length), being the SVMPs and SPs the toxin families which accumulate more amino acid substitutions. Although the cladogram depicted in Figure 6 should not be regarded as an evolutionary hypothesis, the divergence of the Atropoides taxa, and the clustering of A. mexicanus and C. godmani deserves discussion.

Figure 6
figure 6

Cladogram of phylogenetic alliances among Central American snakes inferred from comparison of concatenated consensus sequences for SVMP, SP, BPP, LAO and PLA 2 gathered from analysis of their venom gland transcriptomes. Numbers at branching points indicate the degree of sequence divergence (0.1 = 10% sequence divergence).

Despite the efforts of numerous authors, phylogenetic relationships within the subfamily Crotalinae remain controversial, particularly at the intergeneric level [100, 105]. In particular, several analyses, even from the same research group, support different phylogenetic models. Thus, Bayesian Markov chain Monte-Carlo results suggested the monophyly of the three genera of the Porthidium group (Atropoides, Cerrophidion, and Porthidium) and indicated that Cerrophidion and Porthidium form a clade that is the sister taxon to Atropoides[106]. On the other hand, genus Atropoide s has been also inferred through Bayesian phylogenetic methods to be paraphyletic with respect to Cerrophidion and Porthidium, due to Atropoides picadoi being distantly related to other Atropoides species [107, 108]. Although resolving the phylogenetic relationships among the Neotropical pitvipers of the Porthidium group requires a detailed genomic study of species occupying geographically close ecological niches, i.e., Porthidium nasutum, Porthidium ophryomegas, Porthidium volcanicum, and Cerrophidion godmani, both a previous proteomic study by Angulo and co-workers [50], who estimated that the similarity of venom proteins between the two Atropoides taxa may be around 14-16%, and the present transcriptomic analysis, support a large divergence between A. mexicanus and A. picadoi, and the closer kinship between A. mexicanus and C. godmani.

Conclusions

The snake venom gland is a highly specialized and sophisticated organ, which harbors the cellular machinery that transformed throughout > 200 million years of evolution genes coding for ordinary proteins of disparate scaffolds, diverse ancestral bioactivities, and recruited from a wide range of tissues, into lethal toxins [2]. Although the details of recruitment and neofunctionalization of these proteins remain elusive, gene duplication events, followed by the accelerated evolution of some copies and degradation of others to pseudogenes, underlay the emergence of venom gland multigene toxin families. Comparative analysis of complete genome sequences of squamate reptiles would be extremely valuable for tracking the evolution of the venom system in lizards and snakes [5, 6]. On the other hand, a deep understanding of the toxin gene expression pattern revealed by high-throughput transcriptomic may reveal taxon-specific adaptations and the underlying biological processes governing the formulation of the venom arsenal. In this respect, invoking a reverse venomics approach, knowledge of the end product of transcriptome translation, the venom proteome, may provide hints on the translation efficiency of toxin-coding transcripts. The high-throughput capability of next-generation sequencing technologies offer the opportunity to generate large transcriptome databases relatively rapidly, which may help to speed up the tedious, often de novo[20], assignment of proteomic-gathered data. Furthermore, analysis of the venom gland transcriptome enhances the comprehensibility of the venom proteome, and this in turn contributes to a more accurate interpretation of the transcriptome. The application of NGS to the analysis of snake venom transcriptomes, may represent the tool for opening the door to systems venomics.

Methods

Snake venom gland cDNA synthesis and sequencing

Venom glands were removed 3 days after venom milking, when transcription is maximal [109], from anesthetized snakes using fine forceps and immediately placed in RNAlater™ solution (Qiagen). 30 mg of tissue were disrupted and homogenized by a rotor-stator homogenizer, and total RNA was isolated using RNeasy Mini kit (Qiagen), quantified in a spectrophotometer, and quality-checked on an agarose gel discerning the 28S and 18S bands of ribosomal RNA. First strand cDNA was synthesized using RevertAid™ H Minus First Strand cDNA Synthesis Kit (Fermentas), which selectively transcribes full-length polyadenylated mRNA. The manufacturer's recommendations were followed except where specified. Approximately 5 μg of total RNA was used as starting material. In order to avoid polymerase slippage, a modified 3' 54-mer adaptor (5' GAGCTAGTTCTGGAG(T) 16 VN), which includes a type IIs enzyme (GsuI) site (underlined), was used for first-strand synthesis. This modified oligonucleotide effectively converts the long run of adenosine residues at the polyA tail into a sequence that causes fewer problems for dideoxy sequencing chemistry, and thus the resulting cDNA libraries were enriched in 3'-end-transcripts. To avoid internal cuts, the cDNA was hemimethylated by adding 5-methyl-dCTP to the dNTPs mix. The first strand cDNA was used as template for second strand synthesis by E. coli DNA Polymerase I and RNase H. Double strand (ds) cDNA was precipitated with ethanol and the pellet was resuspended in 70 μL of nuclease-free water and subjected to enzymatic digestion with GsuI for 4 hours at 30°C. The enzyme was then inactivated at 65°C for 20 minutes and the digested cDNA was stored at -20°C. For 454 pyrosequencing, the GS FLX General Library Preparation Method Manual workflow (Roche Diagnostics) was followed. To this end, 3 μg of final non-normalized cDNA library were sheared by nebulization into small fragments. The fragment ends were polished and short A/B adaptors were ligated onto both ends, providing priming regions to support both emulsion amplification and the pyrosequencing process. A biotin tag on the B adaptor allowed immobilization of the dscDNA library fragments onto streptavidin-conjugated magnetic beads and the subsequent isolation of the library of single strand cDNA sequencing templates. Each of the eight cDNA libraries was tagged with a unique 10-base sequence (MID, M ultiplex ID entifier) that is recognized by the sequencing analysis software, allowing for automated sorting of MID-containing reads. Barcoded libraries were simultaneously sequenced in a Genome sequencing FLX System (Roche Applied Science) at Life Sequencing S.L. (Parc Científic Universitat de Valencia, Paterna, Valencia, Spain; http://www.lifesequencing.com) using the method developed by Margulies et al. [110].,

Bioinformatic processing of the 454 reads, identification of toxin transcripts, and quantitation of the expression levels of toxin families

Additional file 2: Figure S1 displays a scheme of the data analysis pipeline developed to identify sequences of toxin molecules by similarity search against nucleotide databases, and elaborate a reference-guide estimation of transcripts, which included available NGS algorithms and in-house scripts. To this end, interspersed repeats and low complexity DNA sequences were masked from the transcript reads using RepeatMasker (version 3.2.7) [111]. RepeatMasker is available from the Institute for Systems Biology (http://www.systemsbiology.org) addressing http://repeatmasker.org. The program makes use of Repbase, which is a service of the Genetic Information Research Institute (http://www.girinst.org). Repbase is a comprehensive database of repetitive element consensus sequences (update of 20 January 2009). Data and computational resources for the Pre-Masked Genomes page is provided courtesy of the UCSC Genome Bioinformatics group (http://genome.ucsc.edu). Masked reads were then searched against the non-redundant NCBI database (http://blast.ncbi.nlm.nih.gov, release of March 2009) and the UniProtKB/Swiss-Prot Toxin Annotation Program database (http://us.expasy.org/sprot/tox-prot), using BlastX and BlastN [112] algorithms, specifying a cut-off value of e-03 and BLOSUM62 as scoring matrix. Snake venom gland-specific transcripts were selected from best BLAST-hit descriptions identifying GenBank entries belonging to the taxonomic suborder Serpentes. This taxonomic group is represented by 37,070 records comprising entries from 357 different genera. The subset of reads exhibiting similarity to Serpentes sequences were further filtered using a list of keywords (including the acronyms of all known toxin protein families described so far [20, 25]) to distinguish putative snake venom toxins from non-toxin (ribosomal, mitochondrial, nuclear, etc.), ordinary proteins. In a second round of filtering, non-matched sequences were searched for structural features (eg. high cysteine content) expected for a putative toxin molecule. Non-toxin-assigned transcripts were functionally annotated using the Blast2GO software [113] and classified using GO-terms [114]. The relative expression of a given toxin protein family (mol%) was calculated as the number of reads assigned to this protein family (Ri) normalized by the length (in nucleotides) of the reference transcript sequence (ntREF) and expressed as the % of total reads in the snake transcriptome (∑Reads): mol% toxin family "i" = %[(Ri/ntREF)/∑Reads). The relationship between the expression levels of the different protein families in the different species sampled was analyzed by Principal Component Analysis (PCA). When possible, toxin transcriptome profiles were compared with available proteomic-based venom toxin profiles [115118].

Assessing molecular diversity within toxin families and cladistic analysis

To assess the molecular diversity within each toxin family, the phylogenetically nearest top-hit sequence was designated as the reference sequence for aligning all the toxin family-specific reads. To this end, each toxin family read was translated into the 6 possible reading frames and blasted against the reference protein sequence using tBlastN. Matched frames exhibiting e-value thresholds better than e-03 were aligned onto the reference sequence to create a multiple alignment using COBALT [119]. The multiple alignment was then parsed to create an assembled (consensus) toxin sequence in which each amino acid position is supported by at least four reads, representing at least 30% of the total number of reads at that position. Non-sequenced positions or those that did not meet the minimum coverage condition, are depicted in lower case in the assembled sequences. Positions where two or more amino-acids fulfilled the selection criteria were annotated as variable residues suggesting the occurrence of different alleles (isoforms) of the protein. Multiple sequence alignments based on transcriptomic data were computed with program MUSCLE [120] (version 3.52) using species-specific synthetic sequences constructed by the concatenation of the consensus sequences of the major toxin families shared by all snake venom gland transcriptomes, i.e. BPP, LAO, PLA2, SVMP, and SP, in this order. Phylogenetic tree reconstruction was done using the suite of web-tools Phylemon (http://phylemon.bioinfo.cipf.es/cgi-bin/tools.cgi) using the MUSCLE, ProtDist and Neighbor options of Phylip (version 3.65).