Background

Apple snails are a family (Ampullariidae) of snails belonging to Caenogastropoda, the largest and most diverse clade within the class Gastropoda [1,2,3]. Apple snails seem to have originated on Gondwana [4], with the oldest fossils coming from Early Cretaceous deposits in Africa [5]. After the breakup of Gondwana roughly 100 million years ago, apple snails have undergone diversification in the New World and Old World respectively [4, 6]. Currently, around 120 species of apple snails are recognised in nine genera, including the Old World genera Afropomus, Forbesopomus, Lanistes, Pila and Saulea, and the New World genera Asolene, Felipponea, Marisa and Pomacea [7]. In what follows we abbreviate Pomacea, but not Pila to avoid confusion of these two genera. Ampullariids are distributed in a wide variety of freshwater habitats, including swamps, wetlands, lakes and rivers [7,8,9]. Members of the family exhibit a wide range of morphological, behavioural and physiological adaptations to their inhabited environments [10, 11]. For example, the evolutionary radiation of Lanistes in Lake Malawi contains species with contrasting morphological and behavioural features that have been interpreted as differential adaption to habitats which differ in wave action, food resources, and predators [9, 12]. Due to their long evolutionary history, wide geographic distribution and high diversity, Hayes et al. [4] suggested that ampullariids altogether provide an interesting system to study speciation and phylogeography in freshwater gastropods. Furthermore, several species of apple snails, especially P. canaliculata and P. maculata, are notorious invasive species in Asia and Hawaii, where they cause dramatic agricultural losses [10, 13], and other conservation concerns such as reductions in aquatic plant diversity and shift in wetland ecosystem functions [14, 15]. Therefore, there is substantial interest in the mechanisms of adaptation that have enabled these species to become invasive pests [16, 17], and in their biological control [18, 19].

Ampullariids are well-known for their diverse reproductive behaviours. While they are all dioecious and most genera of apple snails deposit their eggs in a jelly mass underwater, two genera (i.e., Pomacea and Pila) produce calcareous egg clutches that are deposited above the waterline. The shift from aquatic to aerial oviposition thus has occurred at least twice in the evolution of ampullariids, indicating parallel evolution in the genera Pomacea and Pila with respect to the changes in egg deposition behaviour and morphology (e.g., larger lung size and longer siphons [10]). Such behavioural and morphological adaptations in Pomacea are known to be accompanied by biochemical adaptations to predation [20]. In this regard, studies of several Pomacea species have shown that the major proteins of the egg perivitelline fluid (PVF), the fluid that surrounds and nourishes the embryo, possess multiple protective functions against predators including several anti-predation proteins (perivitellins) displaying anti-digestive, anti-nutritive, neurotoxic and aposematic properties [20,21,22,23]. Comparison between the protein-coding genes of P. canaliculata and P. maculata has revealed the involvement of gene duplication and positive selection in the formation and evolution of some PVF proteins [24, 25]. Further comparison with more distantly related genera/species would yield novel insights into the origin and evolution of PVF proteins that may underlie the diversity of reproductive behaviour and morphology in apple snails.

Apart from their use in ecological and evolutionary studies, some ampullariids, including P. canaliculata and M. cornuarietis, have been used in toxicological studies due to their high fecundity and the high sensitivity of their juveniles to pollutants such as heavy metals [26], organic pesticides [27] and organotins [28]. Mortality and deficiencies of growth or development have typically been considered to be informative toxicity end-points. Nevertheless, the lack of extensive genomic resources hinders the documentation of molecular pathways in toxicological studies of apple snails.

To facilitate molecular-oriented studies on apple snails, we sequenced the transcriptomes of seven species of apple snails: Lanistes nyassanus; Pila ampullacea; Asolene platae; Marisa cornuarietis; Pomacea diffusa; Pomacea scalaris and Pomacea canaliculata. Together with our previously generated RNA-Seq data for P. canaliculata [29] and P. maculata [25], we cover eight species which represent five genera (Fig. 1a) and both the New World and Old World clades. Among the Old World species are L. nyassanus, a species endemic to Lake Malawi in the East African Rift [9, 30]; and Pila ampullacea, a common species in the paddy fields and irrigation channels of northern Thailand [31]. Among the New World species, A. platae is restricted to the La Plata River basin and has a distribution range from Bolivia to the northern Buenos Aires province of Argentina [32]; this species has a slower growth rate and smaller reproductive output than other ampullariids and probably less invasive [33]. The other five species have been introduced from South America to various freshwater ecosystems in North America, Asia and Pacific islands including Hawaii [10, 13, 34, 35]. Following their introduction, two species of Pomacea (i.e., P. canaliculata and P. maculata) have become widely distributed and they are regarded as some of the most notorious invasive species in freshwater habitats [7, 34, 36, 37]. Our species selection thus covers the various phylogenetic lineages, the diversity of reproductive strategies, the most important invaders, and members that are commonly adopted in ecotoxicology. Fig. 1b shows the phylogenetic relationships among the species used in this study, whereas a phylogeny featuring more extensive taxon sampling is presented in Additional files 1 and 2.

Fig. 1
figure 1

Geographical distribution and phylogeny of apple snails used in the present study. a Rough native distribution ranges of the Old World (Lanistes and Pila) and New World (Asolene, Marisa and Pomacea) genera/species [7, 56]. b A maximum likelihood tree showing the phylogenetic relationship among the eight species of ampullariids based on sequences of three genes used in previous phylogenetic studies of ampullariids [6, 52]. Methodological details for the phylogenetic analysis can be found in Additional file 1. Bootstrap support values are shown, as is a scale bar of 0.05 substitution per site. Photo credit: L. nyassanus, Pila ampullacea and M. cornuarietis (JCHI); A. platae and P. scalaris (SI); P. canaliculata, P. maculata and P. diffusa (HM)

Construction and content

Sample collection and preparation

Adult snails were collected from the field in various regions of South America, Africa and Asia, or purchased from an aquarium shop in Hong Kong (Table 1). All snails were reared in aquaria filled with tap water and acclimated for at least one month at 26 ± 1 °C and a photoperiod of 14 h light/ 10 h dark. Snails were fed with a mixed diet of lettuce, carrot and fish meal once a day and the water was renewed twice a week. For most of the species, four to five female and male snails were chosen for dissection to obtain various tissues. For L. nyassanus, however, due to limited individuals available, only a female was used for dissection. Dissected tissues were immediately fixed in RNAlater™ (Invitrogen, USA) and then stored at − 20 °C until they were subjected to RNA extraction.

Table 1 A summary of transcriptome data from eight apple snails used for database construction. Tissues: albumen gland (AG), digestive gland (DG), foot (F), gill (G), lung (L), mantle (M), kidney (K), stomach (S), testis (T) and other tissues (OT; including DG, F, M and T)

RNA isolation and sequencing

Total RNA was extracted separately from each tissue sample using TRIzol® reagent (Invitrogen, MA, USA) following the manufacturer’s protocol. In general, two RNA samples, including one of the albumen gland (AG), and one of other tissues (OT), which contained equal amounts of RNA extracted from three to four tissue types, were prepared for sequencing (Table 1). AG was always processed separately, because this organ, which secrets the perivitelline fluid that protects and nourishes the embryo, is expected to play a crucial role in the reproduction and evolution of ampullariids [24, 25, 38]. More tissue types of P. canaliculata were sequenced due to the need for producing tissue-specific gene expression data in another project for this species. To enhance the comprehensiveness of the transcripts for P. canaliculata, we combined our new data with the transcriptome data we generated from a previous study [29] for assembly. The transcriptome data of P. canaliculata from another study [39] were not included here because of uncertainty of sample preparation, and because more data would not likely improve the assembly metrics [40]. Raw reads of P. maculata were obtained from a recent publication [25], and re-assembled as described below. In P. scalaris, only AG was sequenced due to the lack of high quality RNA in OT preserved in RNAlater. For all samples, the quality of extracted RNA was determined using an Agilent 2100 Bioanalyzer (Agilent Technologies, Germany). Samples with an RNA Integrity Number ≥ 8 were used for cDNA library construction using a TruSeq RNA Sample Prep Kit v2 (Illumina, California, USA), and sequenced in paired-end mode on an Illumina HiSeq sequencer (Illumina, California, USA). Library construction and sequencing were conducted by BGI Hong Kong as a commercial service.

Transcriptome assembly and annotation

Illumina raw reads were cleaned by removing adaptor sequences, reads with > 5% unknown “N” bases or > 20% bases with a quality score ≤ 10 (Table 1). Trimmomatic v0.33 was then used to further remove low quality reads with a quality score < 20 and a length < 40 base pairs (bp) [41]. For each species, clean reads from different tissue samples were pooled for de novo assembly using Trinity 2.2.0 under default settings [42]. The assembled transcripts (ranging from 126,582 to 388,329; Table 2) were clustered with CD-HIT-EST 4.6.6 to reduce redundancy using a threshold of 95% sequence similarity [43]. Open reading frames (ORFs) were predicted with TransDecoder 3.0.0 (https://transdecoder.github.io/) using a threshold of ≥100 amino acids. Only the single best ORF per transcript was retained. The longest ORF in each gene cluster was selected as the unigene. Expression levels were estimated as transcripts per kilobase million read (TPM) using Salmon 0.7.2 [44], and unigenes with TPM less than 0.5 were considered as non-expressed [25]. The level of completeness of our eight assembled transcriptomes was evaluated using BUSCO (benchmarking universal single-copy orthologs) v1.1b [45].

Table 2 Transcriptome assembly and annotation statistics. To avoid confusion between Pomacea and Pila, the latter taxon is not abbreviated as “P.

Predicted protein sequences were annotated using BLASTp 2.4.0+ [46] against NCBI’s non-redundant (nr) database with an E-value of 1 × 10− 5. Gene Ontology (GO) function for each unigene was assigned using Blast2GO [47] with BLASTp nr input. Sequences were also submitted to the Kyoto Encyclopedia of Genes and Genomes (KEGG) Automatic Annotation Sever (http://www.genome.jp/kegg/kaas/) to determine their functional relationships using the bi-directional best-hit method. References for the KEGG annotation included the default representative eukaryotic genomes as well as the genomes of several invertebrates: Helobdella robusta, Lottia gigantea, Crassostrea gigas, Octopus bimaculoides, Schistosoma mansoni, Nematostella vectensis and Hydra vulgaris. The annotation results are summarized in Table 2.

AmpuBase database construction

AmpuBase is a relational database that provides public access to these newly assembled ampullariid transcriptomes and annotations. The database structure and layout are similar to those of PcarnBase [48], except that data from several species can be searched at the same time and that the GO and KEGG search pages are integrated. In brief, for each species, a relational database was developed using MySQL v5.6.34 and hosted on an Apache HTTP server. The BLAST search function is powered by ViroBLAST [49] using the PHP programming language. The database consists of DNA and protein sequences of all unigenes that are linked with associated NCBI nr, GO and KEGG annotations through unigene ID. The database consists of five entity tables (“NCBI annotation”, “Proteins”, “DNAs”, “Gene Ontology” and “KEGG”) and two relation tables (“NCBI_GO_relation” and “NCBI_KEGG_relation”).

Utility and discussion

Transcriptome assembly metrics

There were between 72,341,892 to 462,231,634 bp of clean data, corresponding to between 20,730 and 35,828 unigenes with ORFs in each of the eight species (Table 2; Fig. 1a). The mean N50 value (shortest sequence length at 50% of the unigenes; 1576 bp) and the percentage of annotated unigenes (average 75.9%) in our study are higher than the corresponding values from previously published ampullariid transcriptomes (P. canaliculata, N50: 283 bp, 29.2% unigenes annotated [29]; P. maculata, N50: 1332 bp, 36.6% unigenes annotated [25]). Our transcriptome assembly metrics are comparable to those of recently published transcriptomes from other families of mollusks (Table 3), indicating the overall robustness of our transcriptome sequencing, assembly and annotation pipeline.

Table 3 Comparison of transcriptome assembly metrics between this study and some other studies of mollusks

To further evaluate the completeness of transcriptomes, we examined the proportions of complete as well as partial homologs of 843 conserved metazoan genes within the eight coding unigene sets. The transcriptomes contain 77.46 to 92.41% of the complete conserved metazoan genes, and 87.54 to 96.09% of the genes if fragmented BUSCO hits are also included (Table 2). These BUSCO metrics are comparable with those of other mollusc transcriptomes published in recent years (Table 3).

AmpuBase: Functions and applications

AmpuBase is available online via web interface at http://www.comp.hkbu.edu.hk/~db/AmpuBase/. The database can be searched with BLAST or other query terms. The BLAST search function allows users to blast query sequence(s) or fasta files against single or multiple DNA/protein datasets with default settings (under Basic Search option) or customizable settings (under Advance Search option) (Fig. 2a). Upon submitting a BLAST search, matched sequences are returned with their E-value and similarity score, and information on the corresponding annotation can be obtained by clicking “Unigene ID” (Fig. 2b).

Fig. 2
figure 2

The web interface of AmpuBase. a Illustration of the Basic and Advanced BLAST search options. b An example of the search result of a BLAST search, showing matched sequences, each with their BLAST statistics. c Illustration of the search functions in AmpuBase based on annotation

Apart from BLAST search, the transcriptome data can be searched in two other ways (Fig. 2c). General Annotation Search allows one to query the relevant annotations (i.e., NCBI annotation, GO and KEGG) either by using the unigene ID or unigene name (e.g., perivitellin ovorubin). Each successful query returns a table that contains Unigene ID, NCBI’s nr, GO and KEGG description (if available). The resultant sequences can be downloaded by selecting the Unigene ID and clicking “Submit” for further analysis, for example, phylogenetic analysis of perivitellin ovorubins, major and multiple functional proteins in PVF [20, 24, 25]. In addition, GO and KEGG Annotation Search is also provided for searching GO and KEGG information using GO ID, KEGG ID or a keyword. All sequence data for these ampullariid transcriptomes are available for download under the “Downloads” menu, for transcriptome wide data mining and analysis of a specific gene.

Conclusions

In this study, we have generated a large set of transcriptome data for eight species that represent five genera of Ampullariidae. These data are compiled in a relational database, AmpuBase, which greatly enhances the publicly available genomic resources for ampullariids. The database provides tools for sequence- or keyword-based query functions, which will facilitate in-depth ecological and evolutionary studies on ampullariids, and comparative studies with other invertebrates. AmpuBase will be updated when more genomic data become available in the future.