Background

Sexual reproduction is the most common form of reproduction in the animal kingdom, and only 0.1% of all animal species reproduce asexually [1]. Compared to asexual reproduction, sexual reproduction enables genetic recombination, but seems otherwise to be less efficient and exhibits profound costs, like the two-fold costs of males [2]. The evolution, persistence and underlying molecular mechanisms of both sexual and asexual reproduction are therefore central topics of evolutionary biology [3, 4]. In sexually reproducing eukaryotes, meiosis, the reduction division of diploid germ cells to generate haploid gametes such as sperm, eggs, and pollen, is an essential process. After fertilization, zygotes are created by incorporating the genetic material of both sexes, restoring the original ploidy level. This is in contrast to some asexual species, including the Amazon molly (Poecilia formosa) where no meiotic cell cycle takes place and the gametes are produced via mitosis [5]. There are several variants and types of asexual reproduction, but we will focus here on the prevalent type in vertebrates, which is parthenogenesis. In many species with parthenogenesis, meiosis is lacking (apomixis) and oocytes do not undergo a reduction division leading to diploid eggs [6]. Consequently, offspring are genetically identical to the mother. In vertebrates, this phenomenon is found in fishes, amphibians, and reptiles and only known for species of hybrid origin [7], shedding light on the role of hybridization in functional aspects of biology, and in particular in hybrid speciation. These unisexual vertebrates are used as model organisms to understand the origin and maintenance of sexual reproduction and meiosis. However, the underlying mechanisms driving asexuality, as well as the mechanisms of the transition from sexuality to asexuality, are still unclear.

Meiosis and sexual reproduction seem to have arisen very early in eukaryotic evolution and therefore vertebrate asexual lineages originated from sexual relatives [8]. Schurko and Logsdon Jr. [9] propose that the presence of a set of multiple genes required specifically for meiosis is indicative of the capability of an organism to undergo meiosis and should imply sexual reproduction. In the genome of an apomictic species, these genes should be obsolete and undergo genomic decay to the point where they are dysfunctional. Alternatively, they may evolve other functions. Meiosis genes were detected even in the putative ancient asexual protists Giardia intestinalis [8] and Trichomonas vaginalis [10]. Recently evolved asexual species, such as apomictic hybrid species provide an excellent model to investigate the evolution of meiosis-related genes under presumably relaxed selective constraints and can help to understand the transition from sexuality to asexuality. The unisexual hybrid species P. formosa and its bisexual, parental species are a particularly suitable model to investigate differences between sexually and asexually reproducing species and to determine relevant genes for the underlying processes.

P. formosa derived its common name ‘Amazon molly’ from a mythological Greek tribe of warrior women, the Amazons. It is an all-female species [11] with a natural distribution in the coastal areas along Northeastern Mexico and Southern Texas [12]. It reproduces clonally by sperm-dependent parthenogenesis, i.e., gynogenesis [13, 14]. Although this is a mode of asexual reproduction, gynogenesis does involve the mating of a female with a male from a different species (pseudo-fertilization) [15]. P. formosa evolved by hybridization of two sexually reproducing species, the Atlantic molly Poecilia mexicana (maternal) and the sailfin molly Poecilia latipinna (paternal) [16,17,18], and originated around 280, 000 years ago [19]. Both ancestral species [11] and the very closely related Tamesí molly (Poecilia latipunctata) [20] can act as sperm donors for P. formosa to initiate embryogenesis of the diploid apomictically produced oocytes [21, 22]. P. formosa progeny are identical copies (clones) of the mother, since the genetic material of the sperm donors does not contribute to the offspring except in very rarely occurring events of paternal introgression [23, 24], when parts of or the complete genetic material of the sperm introgresses and is passed on to subsequent generations, leading to polyploid or microchromosome-bearing lineages [25]. In the family Poeciliidae, like in other live-bearing fishes, insemination takes place by introducing the sperm via a copulatory organ, the modified primary anal fin (gonopodium), to the reproductive tract of the females [26]. Therefore, P. formosa must occur in sympatry with at least one of the species acting as sperm donors to be able to mate and subsequently reproduce [7]. This behavior has been described as sexual parasitism, given that the males gain no apparent benefits from mating with the heterospecific P. formosa [27], except under mate copying scenarios described by Schlupp et al. [28] and Heubel et al. [29].

In this study, we focus on the detection of genes that encode components specific for reproduction and meiosis. The presence or absence of functional (i.e., expressed) copies of these genes is evaluated by comparative transcriptome analyses of the unisexual Amazon molly P. formosa and its parental bisexual species. Such analysis can help to resolve the underlying molecular processes between the two reproduction modes and their evolution. Transcriptomics are a common tool for identifying genes of interest (candidate genes) for diverse research topics [30] and are particularly suitable to discover unique and shared genes/gene expression among closely related species [31, 32]. Here, we describe and characterize the transcriptome of a hybrid vertebrate, P. formosa, in comparison to both ancestral sexual species, P. latipinna and P. mexicana, generated by high-throughput sequencing of RNA from the gonads. The identification of more than 100 expressed genes related to reproduction, especially the meiotic cell cycle, in an apomictic species is remarkable and will provide a valuable genomic resource for future studies.

Methods

Sample preparation and next generation sequencing

To construct the transcriptomes of all three species (P. formosa, P. mexicana, and P. latipinna) the gonadal RNA of three females per species was sequenced with next-generation sequencing methods. These fish were taken from strains kept and bred at the University of Potsdam (Germany). The founder individuals of P. formosa (strain For III/9) were collected at Río Purificación (Barretal, Tamaulipas, Mexico) in 1993, P. latipinna (strain F.O II/7 1355) at Key Largo (Florida, USA) in 1993, and P. mexicana (strain Mex IV/5) at Laguna de Champaxan (Altamira, Tamaulipas, Mexico) in 1994. The fish were kept under standard conditions (12:12 h light-dark cycle at 25 °C) at the University of Potsdam in compliance with German animal welfare regulations. Two months before tissue collection (which took place in 2013), sexually mature females of each species were isolated into separate tanks to avoid gene expression shifts due to interactions with males. Before sacrificing the fish on ice, the sex and species affiliation of each individual was verified by examining the anal fin structure and the dorsal fin ray number, respectively. The excised gonads were immediately frozen in liquid nitrogen and stored at −80 °C. For RNA extraction, a combination of Trizol (Life Technologies) and the RNeasy Mini Kit (Qiagen) extraction methods was performed, including a genomic DNA removal protocol. Detailed instructions for the tissue collection and RNA isolation procedure can be found in Zhu et al. [33]. The total yield of RNA was calculated by measuring the concentration and purity using a Spectrophotometer (NanoDrop 1000; ThermoScientific) and the RNA isolates of three individuals per species were pooled for library preparation. A commercial sequencing provider (LGC Genomics GmbH, Berlin) performed transcriptomics library preparation and sequencing (100 bp, paired-end) of all three libraries on one channel of an Illumina HiSeq2000, as well as demultiplexing and adapter clipping (Casava v1.8.2; Illumina Inc.).

Preprocessing – Quality control, filtering and trimming

The initial processing of the data included quality control, filtering, and trimming of the raw reads. After controlling the quality of the obtained paired-end reads with the FastQC software (v0.11.2) [34], we used Trimmomatic (v0.32) [35] to perform different filtering and trimming steps. First, all reads containing an unknown base character (`N`) were removed. Second, bases which showed a low quality at the start or end of the read were cut off (leading/trailing). Third, the sliding window algorithm scanned the reads with a specific base wide sliding window (4 bp), which cut off when the average quality per base dropped below an average quality threshold (15). After trimming, the potentially present ribosomal RNA (rRNA) fragments were excluded from the dataset with SortMeRNA (v2.0) [36]. This software filters and removes rRNA by comparing the reads with clustered rRNA sequence databases of the small and large subunits of bacteria, eukaryotes, and archaea, compiled with the data of the SILVA project [37].

De novo assembly and removal of contamination

We initially built the transcriptomes on the basis of two strategies, de novo and genome-guided with the genome of P. formosa as reference genome (Ensemble release 2014) [38]. Recently, it has been argued that reference genomes are not always well suited as references for RNA sequencing experiments, unless they have been re-annotated before [39, 40]. Indeed, our assembly statistics and functional annotations for the reference-guided assemblies were not as good as the individual de novo assemblies for the three species. Therefore, we used the de novo assemblies for all subsequent analyses. The assembly of the trimmed and filtered reads was done with the software package Trinity (r20140717 to v2.2.0) [41] with standard parameters. Trinity is a widely used assembler based on the method of de Bruijn graphs for the reconstruction of transcriptomes de novo or genome-guided from RNA sequencing data. The Trinity assembler comprises three major consecutive software modules: First, reads were combined into larger contigs (by Inchworm), second, these contigs were clustered into components (by Chrysalis), and finally the most plausible sets of transcripts from these groups were produced (by Butterfly). Downstream analyses, e.g., to calculate quality statistics of the transcriptomes were conducted with the associated software tools of Trinity using Bowtie2 (v2.2.24) [42], SAMtools (v1.3) [43], and RSEM [44]. All sequence comparisons were conducted with the Basic Local Alignment Search Tool (BLAST) (v2.3.0+ and v2.6.0+) [45]. To identify potential contaminants within the assemblies, the transcripts of all the species were compared to protein sequence databases of four different non-target taxa (archaea, bacteria, fungi, and invertebrates) in UniProt (Swiss-Prot/TrEMBL release 2014_10) [46]. Beforehand, each taxonomic database was clustered by removing redundant sequences with 95% identity (CD-Hit v4.6.1) [47]. Transcripts which had a match were compared against a protein database of Danio rerio (TrEMBLE release 2014_10) to ensure that only real contaminants were eliminated from further analyses, while transcripts showing a high similarity with a fish sequence database were retained. Also, transcripts missing an open reading frame (ORF) were removed. ORF identification was achieved using the web server of OrfPredictor (v2.3) [48].

Annotation and comparative analyses

Classification of gene ontology (GO) terms into the categories “biological process”, “cellular components”, and “molecular functions” associated with a given gene product was carried out with the standalone graphical user interface (GUI) version of GOblet (v0.2.1) [49]. Based on sequence similarities and comparisons to well-annotated proteins from UniProt databases, the contigs of all three species were annotated with terms from the Gene Ontology project [50]. Only records with evidence codes assigned by curators of the GO Consortium from the UniProt/Swiss-Prot databases (release 2015_06) of humans, rodents, vertebrates, and mammals were chosen with an E-value cut-off of 1e−10, while those inferred solely from electronic annotation (IEA) were not considered. For each assembly, the frequencies of occurrence for the 150 generic GO slim terms (www.geneontology.org/ontology/subsets/) were calculated. The generic GO slim terms developed by the GO Consortium contain those GO terms, which show a high biological relevance and cover most of the genes/proteins annotated for all species in the database. Species-specific over- and under-representation of the GO terms was tested with a Fisher’s exact test (α = 0.05) with false discovery rate (FDR) correction of the p-value.

We conducted several sequence comparisons with different protein, genomic, and complementary DNA (cDNA) datasets of teleost fish species (Additional file 1: Table S1; E-value cut-off: 1e−50) [51] and the UniProt/Swiss-Prot database (release 2015_03; E-value cut-off: 1e−20) using the BLAST algorithm. For the identification of candidate genes relevant to our focus on sexual vs. asexual reproduction, the results were scanned for genes known to be involved in meiosis [8,9,10, 52, 53]. Furthermore, transcripts were translated to amino acid sequences with a minimum length of 70, using the Transdecoder pipeline (v3.0.2; http://transdecoder.github.io), which identifies coding regions and detects the longest ORF for every transcript in combination with homology results from the Swiss-Prot database (E-value cut-off: 1e-5) and Hmmer (v3.1b2) [54], which searches the peptides for protein domains against the pfam database (release 30.0) [55], a collection of protein family alignments. These sets of amino acid sequences were further analyzed by the OrthoFinder pipeline (v1.1.4) [56] to identify orthogroups of the three assemblies, using the Poecilia reticulata proteome as reference [57]. Afterward, the orthogroups were annotated with GOblet and analyzed specifically with regard to differences between the unisexual P. formosa and the three sexual species (P. latipinna, P. mexicana, and P. reticulata).

Differential expression

Processed reads of each species were mapped back to the combined transcripts of all three species with Bowtie2 using strict mapping parameters (no-discordant | no-mixed | score-min L,-0.1,-0.1). Then, the transcripts were clustered with Corest (v1.06) [58] and for each gene cluster, the number of mapped reads of each species was compiled. Based on the clustering, differential expression between the unisexual (P. formosa) and sexual (P. mexicana and P. latipinna) species was analyzed for gene clusters with transcripts occurring in all three species using edgeR [59]. Because of the absence of a second (replicate) unisexual species, the dispersion value was set to 0.1 as recommended in the manual and statistical significance of inferred up- and down-regulated genes was not evaluated. Up- and down-regulated genes were annotated with GOblet (v0.2.2). For specific parent GO terms and all their child-terms, all entries were manually scanned to identify further candidate genes to be differentially expressed under unisexual vs. sexual reproduction. For six genes related to the androgen receptor pathway, we obtained gene-specific expression data produced by quantitative real-time RT-PCR (abbreviated as qRT-PCR; data from [33]), originating from the same RNA isolates used for our transcriptome analysis. These gene-specific expression levels were used to exemplary validate differential expression of the respective genes derived from the transcriptome data.

Results

Next generation sequencing and de novo assembly

The sequencing of the three individuals from each species, pooled into one library on one lane of an Illumina HiSeq2000 (paired-end, 100 bp), yielded 115,183,830 raw reads for the Amazon molly P. formosa, 117,678,742 for the sailfin molly P. latipinna, and 100,309,634 for the Atlantic molly P. mexicana (Table 1). The quality control with FastQC showed that the Phred quality was lower in the first three base pairs of the reads, as well as at the end. After adapter clipping and trimming, the number of read pairs was 56,916,341 for P. formosa, 58,302,260 for P. latipinna, and 49,722,788 for P. mexicana. The reads had an average length of 94 bp for P. formosa and 95 bp for P. latipinna and P. mexicana. The average Phred quality of the reads was about 36 (Table 1). Before the assembly, the read pairs which presented potential rRNA fragments were removed with SortMeRNA (0.49% for P. formosa | 0.85% for P. latipinna | 0.89% for P. mexicana). The processed reads for all three species can be obtained from the Sequence Read Archive (SRA) at the National Center for Biotechnology Information (NCBI) (BioProject: PRJNA385580 - P. formosa: SAMN06894540 | P. latipinna: SAMN06894541 | P. mexicana: SAMN06894542).

Table 1 Transcriptome sequencing results (100 bp, single end)

The de novo assembly for the three read sets was conducted with the Trinity assembler (Table 2). The average contig length for the 108,690 transcripts for P. formosa was 1077 bp, for P. latpinna there were 117,211 transcripts with an average length of 1232 bp, and for P. mexicana the average length was 1365 bp across 135,217 transcripts (Additional file 2: Figure S1). The weighted median length of the transcripts (N50 value) was 1764 bp for P. formosa, 2339 bp for P. latipinna, and 2569 bp for P. mexicana. By comparisons with the four clustered databases of archaea, bacteria, fungi and invertebrates (clustering reduction: bacteria 81%, archaea 27%, fungi and invertebrates 32%) from UniProt (Swiss-Prot/TrEMBL), we detected 1106 (1.02%) possible contaminants among P. formosa transcripts, 1160 (0.99%) in P. latipinna and 1209 (0.89%) in P. mexicana, mostly belonging to invertebrates. ORFs were missing in 0.61% of the transcripts of P. formosa, 0.75% in P. latipinna, and 0.73% in P. mexicana. In total 1768 (1.63%) contigs for P. formosa, 2035 (1.74%) for P. latipinna, and 2192 (1.62%) for P. mexicana were excluded from the transcriptome datasets used for further analysis, either as contaminants or because of a lacking ORF.

Table 2 Statistics for the de novo assembly

Comparative analysis and identification of candidate genes

Functional gene annotation with the GOblet software yielded 47,719 transcripts assigned to GO terms for P. formosa (44.63%), 46,157 (40.08%) for P. latipinna, and 55,659 (41.84%) for P. mexicana, based on sequence similarity comparisons with the UniProt/Swiss-Prot databases of vertebrates, rodents, human, and mammals (total entries: 47,483). In total, 19,227 different GO terms were detected among all three transcriptomes; 18,531 of these were shared between all three species (total number of GO terms for P. formosa: 18,856 | P. latipinna: 18,807 | P. mexicana: 19,019); 85 GO terms are unique for the P. formosa (70 transcripts) assembly (Fig. 1). The relative frequency of found generic GO slim terms was similar for all three species. Significant differences for the GO terms enrichment analysis between the species could be found for 32 GO terms between P. formosa and P. mexicana and for 17 GO terms between P. formosa and P. latipinna (Fig. 2), six of which were significantly different to the unisexual P. formosa in both sexual species (i.e., “immune system process”, “cellular amino acid metabolic process”, “signal transduction”, “cell-cell-signalling”, “biosynthetic process”, and “lyase activity”). The GO enrichment analysis for the detected GO terms for each species (Additional files 3, 4 and 5: Figures S2, S3, S4) did not reveal any differences among the three species.

Fig. 1
figure 1

Venn diagram of unique and shared gene ontology (GO) terms for the three species Total numbers of identified GOs unique for each species and shared among species; for GOs unique to a species, the respective number of transcripts are specified

Fig. 2
figure 2

Occurrence of the represented generic GO slim terms (proportional to the total number) within the annotated transcriptomes of the Amazon molly (P. formosa) and its ancestral species P. latipinna and P. mexicana for the three main categories “Biological process”, “Cellular component” and “Molecular function”. Significant differences (one sided Fisher-Test; p < 0.05 corrected for multiple testing via Benjamini-Hochberg) between the species are labelled as * for P. formosa/P. mexicana; # for P. formosa/P. latipinna and § for P. latipinna/P. mexicana

Sequence comparisons among teleost fish included our three species, two additional species from the family Poeciliidae, i.e., the guppy (P. reticulata) and the common platy (Xiphophorus maculatus), and two well-annotated model organisms, the Japanese medaka (Oryzias latipes) and the zebrafish (D. rerio) (Table 3). In comparison with the two model species, the three assemblies showed similar results. P. formosa had a slightly higher congruency with the entire Swiss-Prot database (52.45%) in comparison with P. latipinna and P. mexicana (46.48% and 48.54%, respectively). All BLAST results were scanned for 108 meiosis-related genes obtained from the literature and databases, 46 of these genes are specific to the meiotic cell cycle (Table 4). Three common housekeeping genes [60, 61], the gene for the TATA-box binding protein (tbp), the hypoxanthine phosphoribosyl transferase 1 (hprt1), and Beta-actin (actb), were inspected and found to be equally present in all three species in terms of the number of counts. Out of the 108 meiosis-related genes, only the stra8 gene and the meiosis-specific hormad2 gene could not be detected in any of the assemblies of our study species. Four other genes were not found for P. formosa: the ccnB1ip gene, the xycp1 gene and two meiosis-specific genes (rad51B and rec114). In total 1335 transcript counts of meiosis-related genes were discovered for P. formosa, markedly fewer than the 2313 counts for P. mexicana and 2054 for P. latipinna.

Table 3 Summary of BLAST comparisons
Table 4 Genes associated with meiosis, their Uniprot accession ID and the number of the corresponding transcripts in the Amazon molly (P. formosa: Pfor), the Sailfin molly (P. latipinna: Plat) and Atlantic molly (P. mexicana: Pmex) transcriptomes

To identify putative orthologues, the transcripts were first analyzed by the Transdecoder pipeline, beginning by translating the contigs into amino acid sequences. The total number of ORFs regardless to their coding potential was 218,390 amino acid sequences for P. formosa, 251,006 for P. latipinna, and 318,099 for P. mexicana. All amino acid sequences were compared via the blastp algorithm to the UniProt/Swiss-Prot database, yielding 44,860 (20.54%) matches for P. formosa, 57,563 (22.93%) for P. latipinna, and 73,013 (22.95%) for P. mexicana. Homology comparisons with the pfam database resulted in 72,519 matches for P. formosa (corresponding to 13,341 unique database entries), 78,797 for P. latipinna (corresponding to 13,388 unique database entries), and 99,659 for P. mexicana (corresponding to 13,603 unique database entries). In total, the Transdecoder analysis yielded 82,815 amino acid sequences predicted as likely coding regions for P. formosa, 87,235 for P. latipinna, and 109,824 for P. mexicana, which were all fed into the OrthoFinder pipeline, together with the P. reticulata proteome (Table 5). For the 323,589 amino acid sequences across all four species, 77.24% were assigned to 37,781 orthogroups with a median group size of four genes. An orthogroup includes the orthologous genes of the compared species and is defined as the group of genes descended from a single gene in the last common ancestor of a group of species. 74.38% of the amino acid sequences of P. formosa were assigned to orthogroups with P. reticulata (P. mexicana: 69.45% | P. latipinna: 73.42%). All four species shared 15,027 orthogroups. Ninety orthogroups (comprising 1052 genes, corresponding to 0.33% of all genes) were species-specific, i.e., they consisted entirely of genes detected only in one species. Specifically, 14 orthogroups were unique for P. formosa, 33 for P. latipinna, 24 for P. mexicana, and 19 for P. reticulata (Fig. 3, created with the online application jvenn [62]). The unique orthogroups for each of the four species and the 988 orthogroups, which were exclusively identified among the three sexual species, were annotated to detect differences in the occurrence of the corresponding GO terms (generic slim) between the sexual and the unisexual species (Fig. 4). In the sexual species, there are more genes annotated to the GO term “embryo development” (GO:0009790) or “chromosome” (GO:0005694) than in the unisexual P. formosa. In contrast, P. formosa exhibits more genes in unique orthogroups for different enzyme activities (for example, “ligase activity” (GO:0016874)). None of the orthogroups specific for P. formosa was associated with reproduction or meiosis. The analysis of the 988 orthogroups shared among the sexual species revealed 34 additional genes related to the meiotic cell cycle (Additional files 6: Table S2, Additional files 7, 8 and 9).

Table 5 Orthology analysis using OrthoFinder
Fig. 3
figure 3

Identified orthogroups for the transcriptomes of P. formosa, P. mexicana, and P. latipinna in comparison with the proteome of P. reticulata

Fig. 4
figure 4

Number of unique gene ontology term entities for the annotated orthogroups for P. formosa, P. mexicana, P. latipinna, P. reticulata and the sexual species in combination (P. latipinna, P. mexicana, and P. reticulata). Only generic slim GO terms significantly different in their occurrence (one-sided Fisher-Tests; p < 0.05 corrected for multiple testing via Benjamini-Hochberg), between P. formosa and the other species are shown; bar sizes are proportional to their total number of occurrences

Compared to its sexually reproducing parental species, 2035 (4.69%) genes were up-regulated and 564 (1.30%) genes were down-regulated in the unisexual P. formosa identified at a false discovery rate (FDR) of 5% for the 43,356 tested genes, corrected via the Benjamini and Hochberg’s algorithm (Fig. 5). The differentially expressed genes associated with the GO terms “reproduction” (GO:000003) and “reproductive process” (GO:0022414) are listed in Table 6. Twenty seven genes related to reproduction have a higher expression in P. formosa, e.g., the gene of the Speedy protein A. For the GO enrichment of the GO term “cell junction” (GO:0030054), up- and down-regulated genes were over-represented. This means some genes of this GO term may be up-regulated in the sexual, others in the unisexual species. This is indicative of an alteration of gene activity between the unisexual and sexual species (Fig. 6). We exemplary compared our expression patterns to gene-specific expression data for six genes of the androgen receptor pathway by qRT-PCR on the same RNA isolates used for the transcriptome analysis (data from [33]). One gene (cyp19a2) was consistently up-regulated in the asexual species (1.9 fold in qRT-PCR; 1.4 fold with regard to transcriptome read numbers). Two genes (erα and erβ) were consistently up-regulated in the sexual species (1.7 resp. 1.6 fold in the qRT-PCR; both 1.9 fold in the transcriptome analysis). Two further genes (arβ and cyp19a2) were not differentially expressed in neither the qRT-PCR study nor the transcriptome analysis. For one gene (arα), the transcriptome data exhibited a 3.7 fold higher read number in the asexual species, relative to the sexual species. This was not confirmed by qRT-PCR, but expression at this gene was very variable among six biological replicates in one of the sexual species, P. latipinna (2.9 fold within 95% confidence limits). While the up-regulation of arα detected in the transcriptome data for the asexual species may be hence a false positive (presumably caused by the variable expression in one of the sexual species), we find overall consistent expression patterns in five (out of six) analyzed genes among transcriptome read number analysis and a gene-specific qRT-PCR analysis. The scale of expression differences (fold change) was also similar among the two methods.

Fig. 5
figure 5

Mean difference plot showing the log-fold change and average abundance of each gene of the differential expression analysis between P. formosa (“Unisexual”) and P. mexicana and P. latipinna (“Sexual”). Color depicts genes down-regulated (blue) or up-regulated (red) in P. formosa

Table 6 Detected GO term IDs, the GO term names and the corresponding genes for the up-regulated (+) or down-regulated (−) in P. formosa (only genes involved in reproduction and meiosis are listed)
Fig. 6
figure 6

Gene ontology term enrichment for the differentially expressed genes between the unisexual (P. formosa) and sexual species (P. mexicana and P. latipinna) generated by GOblet. Only significantly enriched (red) or depleted (blue) generic GO terms are shown (one-sided Fisher-Tests; p < 0.05), for the three components “Molecular function (a)”, “Biological process (b)” and “Cellular components” (c)

Discussion

Quality of the de novo transcriptome assemblies

De novo assembly of the datasets resulted in a higher number of transcripts for the Atlantic molly P. mexicana and the sailfin molly P. latipinna, compared to the Amazon molly P. formosa and sequencing statistics were overall quite similar between the parental species, especially regarding the N50 value and the average contig length. On average, the statistics for the de novo assemblies show similar results compared to other transcriptomes of fish species using RNA sequencing techniques (Illumina) [63,64,65,66]. The higher number of total transcripts for all three species compared to other transcriptomes of the family Poeciliidae, for example, the P. mexicana transcriptome (number of transcripts: 80,111) [63] or the transcriptome of the Western mosquitofish Gambusia affinis (average number of transcripts: 63,734) [64], can be likely attributed to the fact that we retained some of the transcripts with a low expression, which some other authors may have filtered out. We – on purpose – retained these transcripts in order to maintain our ability to detect genes expressed in a species-specific manner.

For the Trinity assembler, which is well suited for the reconstruction of transcriptomes de novo [38, 67, 68], each component (also referred to as unigenes) represents a set of transcripts, which are assumed to represent genes (P. formosa: 59,935 | P. latipinna: 73,450 | P. mexicana: 79,522) and include different isoforms (transcripts) derived from alternative splicing or closely related paralogs. Based on the longest isoform for each component, all three assemblies are more similar in the N50 value (P. formosa: 1510 bp | P. latipinna: 1654 bp | P. mexicana: 1726 bp) and the average contig length (P. formosa: 865 bp | P. latipinna: 843 bp | P. mexicana: 859 bp). Comparing our transcriptomes to the annotation releases of the three Poecilia species genomes, P. formosa has a lower number of transcripts and components than both ancestral species; this appears to reflect the actual composition of the datasets. A lower number of genomic mRNA transcripts has been previously reported for P. formosa (39,207), compared to P. mexicana (47,406) and P. latipinna (47,072) (NCBI annotation release 100, 2015).

All three de novo assembled transcriptomes exhibited comparable quality measures in downstream analyses, like functional annotation and the comparative analysis of sequence similarities with the different teleost databases. Also, the ratio of transcripts without ORFs and possible contaminants for all assemblies was similar. The contamination load obtained was only between 1.6 and 1.7% of the transcripts per species. All three de novo assemblies showed high consistency with the different genomic and proteomic datasets. By implication, the agreement with the closer related species of the Poeciliidae family was higher in comparison to the less closely related species, like the Japanese medaka O. latipes or the zebrafish D. rerio. Even with very strict mapping parameters, a high percentage of reads mapped back to the transcripts (P. formosa: 76.39% | P. latipinna: 78.69% | P. mexicana: 77.63%), which matches the desired range (between 70 and 80%) described in the Trinity user guide. In summary, overall results are similar for all the de novo assembly datasets, suggesting that the transcriptomes for all three species were suitable for comparative analysis.

Differential gene expression between unisexual and bisexual mollies

Based on the clustering and count data of the mapped reads of the species with themselves and between them, we performed a differential expression analysis comparing two conditions (unisexual vs. sexual). We considered the two sexual species as biological replicates and compared this group to the unisexual species. As a second related unisexual species does not exist, we do not have a species replicate for the unisexual condition. We were hence unable to establish statistical significance for the inferred 2035 up-regulated and 564 down-regulated genes identified for the unisexual P. formosa. We exemplary confirmed the transcriptome-derived expression patterns in five (out of six) genes analyzed by qRT-PCR. We are also aware of that an unknown number of differentially expressed genes may have gone undetected and a thorough analysis of differential expression would require a higher number of replicates per condition [69]. Nonetheless, we argue – with caution – that differences in read numbers in our transcriptome data may have revealed candidates for genes differentially expressed among sexual vs. unisexual species, to be further analyzed in future research.

We used three different approaches to identify candidate genes, which may be involved in the molecular underpinning of the different reproduction modes among the sexual and unisexual species. First, we searched the BLAST results for the occurrence of genes related to meiosis or reproduction. Second, we conducted an orthology analysis with a closely related species, the guppy P. reticulata. Finally, as described above, we identified differentially expressed genes, i.e. those, which are higher or lower expressed in P. formosa, as compared to its parental species. Scanning the BLAST results for the occurrence of 108 meiosis-related genes showed that 1.25% (equates to 1335 transcripts) of all generated transcripts for P. formosa are linked to the meiotic cell cycle which is significantly lower compared to 1.74% for P. mexicana and 1.78% for P. latipinna (p < 0.05 in both pairwise comparisons, tested with χ2 test). The ratio of the meiosis-specific genes to the total number of transcripts is 0.50% in P. formosa (P. mexicana: 0.73% | P. latipinna: 0.92%). In line with the lack of meiosis in P. formosa, a significantly lower percentage of transcripts was related to this process, in comparison to the sexual species. Yet, the down-regulation of meiosis-related genes is not as complete as one might have expected for a species producing gametes apomictically. Only two meiosis-related genes could not be detected in any of the three transcriptomes (str8 and hormad2). The stimulated by retinoic acid gene 8 (str8) is required in mice for the transition of female and male germ cells into meiosis and is typically expressed in adult testes and embryonic ovaries [70]. Therefore, this gene is not necessarily expressed in adult female gonads, the tissue analyzed here. The second absent gene was the hormad2 gene, which encodes the HORMA domain-containing 2 protein. The hormad1 and hormad2 genes are explicitly expressed during meiosis in male and female mice [71], but nothing is known about their function in fish.

In P. formosa, the most prominent meiosis-specific gene lacking in the transcriptome was the gene for meiotic recombination protein Rec114, required for DNA double strand break (DSB) formations, which induces meiotic recombination [72]. Studies in mice showed that the rec114 gene is expressed in adult testes and in embryonic ovaries and seems to be conserved among most sexually reproducing eukaryotes [73]. This gene was not found in a previously published transcriptome of the Amazon molly either [74]. The functional annotation of the homologous genes for P. mexicana, P. latipinna, and the closely related P. reticulata yielded 35 genes of interest, which were absent in the Amazon molly transcriptome. Particularly interesting is the gene for the ATP-dependent RNA helicase cgh1 (conserved germline helicase-1). In the hermaphrodite Caenorhabditis elegans, it is responsible for regulating maternal mRNA translational repression and protecting it from degradation (reviewed in [75]). The absence of this gene in C. elegans and presumably in other organisms leads to non-functional sperm and, more importantly, to the degradation of developing oocytes [76].

Evolutionary implications of lowered expression in meiosis-related genes

Our results raise questions about the function of the detected and missing genes expressed in the Amazon molly P. formosa gonads as well as about its reproduction mechanisms. The presence or absence of transcripts related to a specific process (in this case reproduction and especially meiosis) lead to expectations about their evolution in asexual species compared to sexual ones. If a certain biological process is no longer maintained, the underlying genes are expected to be under reduced functional constraints (relaxed selection), leading to the accumulation of deleterious mutations, which may compromise their biological function and/or their expression. Ultimately, genes may degenerate such that they can become pseudogenes [77, 78]. The time span since P. formosa evolved from its ancestor species (280,000 years [19]) may have been too short to result in pervasive pseudogenization of meiosis genes. Nonetheless, the generally lower expression levels and the lack of expression in several such genes, some of which of crucial importance in sexual reproduction, points to an evolutionary erosion of genes no longer necessary in an apomictic species.

However, meiosis genes are not always under relaxed selection in asexually reproducing species. In a comparison of obligate sexual and asexual individuals in the freshwater snail Potamopyrgus antipodarum, three meiosis-specific genes (spo11, msh4 and msh5) exhibited no degeneration in the asexual lineages, but were instead inferred to be under purifying selection [79]. Also, for three ancient asexual oribatid mites, there is stronger purifying selection on nuclear and mitochondrial orthologous genes compared to sexual species [80]. For the microcrustacean Daphnia pulex, whose reproduction cycle consists of alternating sexual and asexual phases, the main meiosis genes are present in the genome and are expressed under parthenogenesis [9]. These genes could gain new or until now undiscovered functions, possibly leading to novel alternative pathways to meiosis. For example, the spo11 gene – known to initiate meiotic recombination by the introduction of DSBs in sexual species – has been described to lead to extensive genetic recombination between homologous chromosomes, including multiple gene conversion events, in an ameiotic species, the parasexual fungus Candida albicans [81]. Gene conversion has been frequently detected in P. formosa [18], but deeper molecular knowledge is needed to unravel, whether there are potential alternate functions of meiosis genes in this species. Comparing meiosis-specific genes on the intron/exon level among the three species can be a first approach to analyze their functions and to detect selective constraints. An additional approach would be to study knockout/knockdown individuals in comparison with the wild type, which is a well-established and extensively used genetic technique to directly examine functional and phenotypic effects of candidate genes [82], particularly in the fish model organisms D. rerio (reviewed in [83]) and O. latipes [84] as well as for Carassius gibelio [85], which has multiple reproduction modes including sexual reproduction and unisexual gynogenesis [86]. The dataset published in this study forms an excellent basis for further investigations, including those described above or for single nucleotide polymorphism (SNP) detection, and qRT-PCR, ideally conducted in an allele-specific manner, to resolve the evolutionary questions raised. Furthermore, our dataset would be beneficial for the (re-)annotation of the genomes of all three species.

Conclusions

The generated de novo gonadal transcriptomes of the Amazon molly Poecilia formosa and its parental species, the sailfin molly P. latipinna and the Atlantic molly P. mexicana, were functionally annotated and analyzed on the basis of sequence similarities between the species. They provide a valuable resource for questions concerning the reproductive mode of an asexual hybrid species in comparison to its sexual ancestor species. Interestingly, there are also vertebrate examples, where hybrid speciation leads to an automictic form of parthenogenesis. Here, meiosis and recombination are maintained (e.g., in whiptail lizards, [87]). In contrast, our ameiotic species lacks recombination and is hence a ‘frozen hybrid’ at all nuclear loci [18, 33, 88].

Inline with our a priori hypothesis, there was a general tendency towards lower expression of meiosis-related genes in the apomictic P. formosa. However, only a few of these genes were completely absent in the P. formosa transcriptome, while the remainder constitutes interesting candidates for further evolutionary studies, e.g., on potential neofunctionalization vs. pseudogenization. Furthermore, our dataset comprises a substantial addition to the already present genomic resources available for the family of Poeciliidae and can be used for future sequencing projects as well as for the annotation of the genome for all three species.