Evaluation of the mechanisms of intron loss and gain in the social amoebae Dictyostelium
Spliceosomal introns are a common feature of eukaryotic genomes. To approach a comprehensive understanding of intron evolution on Earth, studies should look beyond repeatedly studied groups such as animals, plants, and fungi. The slime mold Dictyostelium belongs to a supergroup of eukaryotes not covered in previous studies.
We found 441 precise intron losses in Dictyostelium discoideum and 202 precise intron losses in Dictyostelium purpureum. Consistent with these observations, Dictyostelium discoideum was found to have significantly more copies of reverse transcriptase genes than Dictyostelium purpureum. We also found that the lost introns are significantly further from the 5′ end of genes than the conserved introns. Adjacent introns were prone to be lost simultaneously in Dictyostelium discoideum. In both Dictyostelium species, the exonic sequences flanking lost introns were found to have a significantly higher GC content than those flanking conserved introns. Together, these observations support a reverse-transcription model of intron loss in which intron losses were caused by gene conversion between genomic DNA and cDNA reverse transcribed from mature mRNA. We also identified two imprecise intron losses in Dictyostelium discoideum that may have resulted from genomic deletions. Ninety-eight putative intron gains were also observed. Consistent with previous studies of other lineages, the source sequences were found in only a small number of cases, with only two instances of intron gain identified in Dictyostelium discoideum.
Although they diverged very early from animals and fungi, Dictyostelium species have similar mechanisms of intron loss.
KeywordsDictyostelium discoideum Dictyostelium purpureum Intron gain Reverse transcriptase GC content Imprecise intron losses
Horizontal gene transfer
Million years ago
Non-homologous end joining
Simple sequence repeat
With the exception of the relics of certain endosymbiotic nuclei , all eukaryotic genomes contain spliceosomal introns. Evidence also suggests that eukaryotic genes transferred from organelles or prokaryotes have generally experienced a high rate of intron insertion subsequent to the transfer [2, 3, 4, 5, 6, 7]. The existence of spliceosomal introns is a common feature of eukaryotic nuclear genomes. However, previous studies indicated that the dynamic changes in introns vary greatly among eukaryotic lineages [8, 9, 10, 11, 12, 13, 14]. Thus, a model that successfully explains the mechanisms of intron loss or gain in some eukaryotic lineages may be inadequate for other lineages [15, 16]. Three models have been proposed for the mechanism of intron loss [17, 18, 19, 20]. The first is the reverse transcription model, also termed mRNA-mediated intron loss, in which introns are deleted from the genome by recombination between the genomic DNA and cDNA reverse transcribed from spliced mRNA. In this model, the introns are precisely deleted and adjacent introns tend to be lost simultaneously. As the binding of reverse transcriptase and RNA template is unstable, the reverse transcription process frequently aborts, thus producing incomplete cDNA molecules. Recombination of these cDNAs with genomic DNA would cause a preferential loss of introns from the 3′ end of genes. A long intron is expected to disturb the in vivo alignment of homologous regions between the cDNA and the genomic DNA and therefore be lost at a lower frequency than a short intron. The second model of intron loss is simple genomic deletion. In this model, each individual intron is lost independently and without bias with respect to its position within a gene. In this model, introns might occasionally be lost precisely but are typically accompanied by insertions and/or deletions in the flanking exonic sequences. The final model is one in which introns are lost during non-homologous end joining (NHEJ) repair of DNA double-strand breaks. As the repair process generally requires microhomology between the break sites, this model predicts that there should be short direct repeats at the two ends of the lost intron. Besides this, the same predictions are shared between the NHEJ repair model and the genomic deletion model. The first model has been widely supported by studies that are carried on animals, fungi, and plants [12, 13, 16, 21, 22, 23, 24, 25]. However, the pattern of intron losses observed in Arabidopsis thaliana was different from that predicted by the first model but consistent with the third model in which introns are lost during DNA double-strand break repair [19, 26]. Short direct repeats at the splice sites of lost introns have been detected in plants and invertebrates [25, 26, 27], supporting the third model. However, another prediction of the third model, imprecise intron loss, has not been observed to have a high frequency in most eukaryotic lineages .
For a comprehensive understanding of intron evolution on Earth, studies should cover all major eukaryotic lineages. However, a considerable bias exists toward a limited number of model organisms in animals, plants, and fungi [9, 12, 16, 21, 23, 24, 26, 27, 29, 30, 31, 32, 33, 34, 35, 36], which belong to two (Opisthokonta and Archaeplastida) of the five supergroups of eukaryotes according to the recent consensus phylogenetic tree of eukaryotes [37, 38]. Limited by the biased distribution of sequenced genomes , very few studies have been carried out in the most diverse kingdom, Protozoa [11, 39, 40, 41, 42, 43, 44, 45]. Even in these few studies, the research materials were heavily biased toward Plasmodium, a model lineage of the supergroup SAR (stramenopiles, alveolates, and Rhizaria) [37, 38]. The slime mold Dictyostelium belongs to another supergroup of eukaryotes, Amoebozoa [37, 38]. Dictyostelium can form differentiated multicellular structures by aggregating thousands of solitary amoebae in response to starvation . The most prominent member, Dictyostelium discoideum, has been used as a model organism to study multicellularity, cell differentiation, signal transduction, cell migration, and development for many years . At the genomic level, Dictyostelium has two characteristics which make them helpful for the further study on intron evolution. The first one is the enrichment of simple sequence repeats (SSRs) in the Dictyostelium genomes [48, 49]. According to the NHEJ repair model of intron loss , these SSRs, if exist at the splice sites, could mediated intron losses. The second special characteristic of the Dictyostelium species is the 16 documented new genes gained from bacteria by horizontal gene transfer (HGT) after their divergence from plants and animals, but prior to the divergence among themselves [48, 49]. We hope that these new genes might give some implications on the mechanism of intron gain. In the present study, we surveyed the intron losses and gains in both Dictyostelium discoideum and Dictyostelium purpureum and explored the mechanisms underlying these variations.
Results and discussion
Molecular mechanisms of intron losses in Dictyostelium
We also found that the lost introns are significantly shorter than the conserved introns (Mann–Whitney U test, P = 0.004 and 0.033, respectively, for Dictyostelium discoideum and Dictyostelium purpureum). However, their differences are very small (median size: 76 bp vs. 77 bp in Dictyostelium discoideum and 100 bp vs. 105 bp in Dictyostelium purpureum; mean size: 81 bp vs. 94 bp in Dictyostelium discoideum and 113 bp vs. 125 bp in Dictyostelium purpureum). Compared with vertebrates, the introns of Dictyostelium discoideum and Dictyostelium purpureum are very short. Therefore, we are cautious in interpreting these size differences as evidence of the reverse-transcription model.
Different intron loss rates associated with the abundance of retrotransposons
We observed a large difference in the rate of intron loss between these two species; whereas Dictyostelium discoideum was found to have lost 443 introns, Dictyostelium purpureum lost only 202 introns. χ2 tests showed that this difference is significant, regardless of whether the extant introns were represented by all the annotated introns (number of annotated introns: 15,510 in Dictyostelium discoideum and 18,412 in Dictyostelium purpureum, P = 10−30) or only the conserved introns (P = 3 × 10−20). The genomes of these two species are similar in size, with sizes of 34 and 33 Mb, respectively . Therefore, the difference in intron loss rate is not likely driven by different forces acting on genome size. Two possible explanations based on the reverse-transcription model were tested.
The first possibility is that the introns of Dictyostelium discoideum are generally shorter; thus, the genomic DNAs are more suitable substrates of the recombination process than those of Dictyostelium purpureum. Whereas Dictyostelium discoideum has a mean intron length shorter than that of Dictyostelium purpureum (132.55 vs. 162.26 bp, respectively), globally, its introns are significantly longer than those of Dictyostelium purpureum (median intron sizes: 104 bp vs. 78 bp, respectively; Additional file 3: Figure S1). Dictyostelium purpureum has a longer mean intron length because of its small proportion of extraordinarily large introns.
Copy numbers of reverse transcriptases and retrotransposons
It is also possible that intron losses are not strictly neutral [32, 53] and that a difference in the efficiency of natural selection contributed to the observed different rates of intron loss. For example, if intron losses were primarily beneficial, they would be more likely to be fixed in species with a large effective population size. In contrast, if they were slightly deleterious and were fixed by genetic drift, the species with a small effective population size should have lost more introns. The long length of time separating the divergence of the two species, 400 million years, suggests that synonymous substitutions may have become saturated and are impossible to be estimated accurately . Therefore, dN/dS, the common method used to estimate the efficiency of natural selection and genetic drift [54, 55], is not applicable in this study. Some researchers believe that introns and other repetitive sequences are slightly deleterious; their abundance, therefore, should be negatively correlated with effective population size . According to this hypothesis, the abundance of introns and the abundance of repetitive sequences should be positively correlated. The genome of Dictyostelium discoideum was found to have fewer introns, a shorter total intron length, and more intron losses than that of Dictyostelium purpureum (Additional file 3: Table S3). Unexpectedly, however, Dictyostelium discoideum has more repetitive sequences than Dictyostelium purpureum, even if retrotransposons are withheld from the comparison (Additional file 3: Table S3). Although this evidence is not strong, it indicates that the intron losses in Dictyostelium were unlikely to have been driven by the putative selective forces experienced by repetitive sequences. Future polymorphism data of Dictyostelium will be essential to infer the contribution of natural selection to the frequency of intron loss.
High GC contents around lost introns: evidence for biased gene conversion
The above evidence, in addition to data from numerous previous studies, supports the model that an intron can be deleted during a recombination event occurring between genomic DNA and intronless cDNA [12, 16, 21, 23, 24]. However, the details of such a recombination process have seldom been explored ; therefore, we cannot ascertain whether this was a gene conversion process or a reciprocal crossover recombination process, although the former has often been used in the description of the reverse-transcription model [21, 29, 58, 59, 60].
Furthermore, the high GC content of the exonic sequences flanking lost introns can be explained in two ways. The first explanation is that intron losses preferentially occurred at gene conversion hotspots, where GC contents are higher regardless of whether intron losses have occurred. If this is the case, the exonic regions flanking the extant introns of discordant positions should also have higher GC contents than the exonic sequences flanking conserved introns. For example, gene DDB_G0293580 of Dictyostelium discoideum has lost its second intron while its orthologous gene in Dictyostelium purpureum retained it, i.e. second intron of gene DPU_G0070640. If the intron was lost because of its presence at a gene conversion hotspot, the exonic sequence flanking the second intron of gene DPU_G0070640 is expected to have higher GC content. As shown in Fig. 5b, this prediction has also been proved in both Dictyostelium discoideum and Dictyostelium purpureum. The second explanation is that the intron losses have increased the GC contents of nearby exonic sequences via biased gene conversion. If this is the case, the exonic sequences flanking lost introns should have higher GC contents than their orthologous regions flanking the unique introns. The global GC content differs significantly, however, between Dictyostelium discoideum and Dictyostelium purpureum (all coding sequences were compared, with median values of 28.35 % for Dictyostelium discoideum and 30.83 % for Dictyostelium purpureum, Mann–Whitney U test, P = 0). Therefore, we compared the GC content relative to the exonic sequences flanking conserved introns rather than the absolute values. As shown in Fig. 5c, the exonic sequences flanking lost introns have significantly higher relative GC contents than their orthologous regions. If gene conversion were to increase the GC content of nearby exonic sequences, there should be differences between the two participants of the gene conversion process. The source of nucleotide difference between cDNA and the genomic DNA might result from either transcription errors, reverse transcription errors , or even DNA replication errors if the gene conversion occurred after integration of the cDNA into a chromosome.
In the above comparisons, the GC contents of 100-bp regions from each side of the discordant and conserved intron positions were surveyed. Adjusting the surveyed sequence length to 50 bp and 200 bp yielded similar but less robust results. Longer sequences indicate that more nucleotides surveyed are likely to lie beyond the conversion tracts . Counting the GC content only at 4-fold degenerate sites gave similar but slightly weaker results. Although 4-fold degenerate sites are more accurate in revealing the mutation rate, their limited numbers in coding sequences maximizes the stochastic noise in the calculation of the GC contents.
Because Dictyostelium discoideum and Dictyostelium purpureum diverged approximately 400 million years ago (Mya), it is possible that the changes in GC contents and the intron loss were independent from each other but are correlated by chance. For this reason, we analyzed the relationship between GC content and intron loss in six species that diverged from their sister species more recently: Arabidopsis thaliana (diverged from Arabidopsis lyrata 13 Mya ), Caenorhabditis briggsae (diverged from Caenorhabditis remanei 17.2 Mya ), C. remanei (diverged from C. briggsae 17.2 Mya ), Rattus norvegicus (diverged from Mus musculus 22.6 Mya ), Brassica rapa (diverged from Thellungiella parvula 30.8 Mya ), and Drosophila willistoni (diverged from Drosophila melanogaster 47.6 Mya ). The intron losses of A. thaliana, R. norvegicus, B. rapa, and Drosophila willistoni were obtained from previous publications [13, 16, 25, 28, 32, 66]. The number of intron losses in A. lyrata, M. musculus, T. parvula and Drosophila melanogaster are less than 50 in all cases. These results are not included in our analysis to minimize the stochastic noise. The intron losses of C. remanei and C. briggsae were identified in this study. The pattern of higher GC contents in the exonic sequences flanking lost introns have been confirmed in A. thaliana, B. rapa, Drosophila willistoni, C. remanei and C. briggsae (Additional file 3: Table S4). However, in R. norvegicus, the exonic sequences flanking lost introns do not have significantly higher GC contents than those flanking conserved introns. We observed that R. norvegicus has the smallest number of intron losses among the species analyzed in the present study (Additional file 3: Table S4). We suspect that stochastic noise might have covered the pattern in this particular species. In duplicating the patterns shown in Fig. 5b and c, more conflicting results were obtained among the six species (Additional file 3: Table S5-S6). In summary, we found a correlation between intron loss and higher GC contents of nearby exonic sequences. Introns were lost preferentially from GC-rich regions, which is a characteristic of frequent gene conversions .
Intron gains occur less frequently than intron losses, and most are putative
The proposed models of intron gains  clearly demonstrate that the source sequences are the key to identifying the mechanisms of intron gain. As Dictyostelium discoideum and Dictyostelium purpureum diverged 400 million years ago , the sequences of some early gained introns might have diverged from their source sequences to an extent too great to be detectable. However, if the intron gain rate remained steady throughout evolution, we would expect to find 12 instances of source sequences for the recently gained introns, e.g., within 50 million years. The difficulty in the identification of the source sequences of intron gains is common among studies, regardless of whether the studied species are distantly or closely related. For example, although Drosophila persimilis and Drosophila pseudoobscura diverged only two Mya, researchers failed to identify the source sequences for six of the seven introns gained after their divergence . Even more astonishingly, researchers failed to identify the source sequences of 20 introns among 21 new introns that were gained only in certain local populations of Daphnia pulex . One possibility is that some or even most of the “intron gains” are in fact those intron losses that were misidentified due to incomplete information on the phylogenetic background, as shown for Caenorhabditis elegans . Due to the specialty of Dictyostelium genomes, we are confident that there are intron gains. By HGT, Dictyostelium had gained 16 genes from bacteria. As bacterial genes are definitely intronless, all the introns in these 16 genes were gained after the divergence of the Amoebozoa from the plants and animals. Within these genes, nine introns have been annotated in Dictyostelium discoideum and 19 introns have been annotated in Dictyostelium purpureum. Among them, there are seven pairs of introns at orthologous positions and 14 introns at discordant positions. The sequences of all these introns were used as queries to search against the two Dictyostelium genomes and further against the nucleotide collection of NCBI using BLAST. The BLAST results were filtered with an E-value threshold of 10−10, a coverage threshold of 80 %, and a similarity threshold of 0.85. Putative source sequences for two pairs of orthologous introns had been obtained. However, these source sequences have been rejected after alignment of them with the intron sequences and manual scrutiny. For the difficulty in finding intron sources, another possibility is that exogenous sequences such as viruses have contributed sequences for most intron gains but have not yet been covered in any genome sequencing projects . In the future, surveying the metagenomic sequence data of the environmental samples obtained from the natural habitats of organisms with intron gains might lead to the identification of source sequences for additional intron gains and consequently reveal the mechanism of intron gain.
Dictyostelium belongs to a supergroup, Amoebozoa, which diverged from animals and fungi very early in the evolutionary history of eukaryotes. In spite of this ancient divergence, our results indicate that its mechanism of intron loss is similar to that of animals and fungi. Most introns were lost in the process of gene conversion between the genomic DNA and cDNA reverse transcribed from mature mRNA.
The genome sequences and annotation files of Dictyostelium discoideum, Dictyostelium purpureum, P. pallidum, and Dictyostelium fasciculatum were downloaded from DictyBase [69, 70] in March 2014, and those of E. histolytica (HM1IMSS, version 3.1) were downloaded from the AmoebaDB [71, 72]. We discarded genes with obvious annotation errors, such as those that did not contain coding sequences that were composed of multiples of three nucleotides or those that appeared to conflict with their protein sequences.
Using the BLAST reciprocal best hits with an E-value threshold of 10−10 and an identity threshold of 0.25, 7,503 pairs of one-to-one orthologous proteins were detected between Dictyostelium discoideum and Dictyostelium purpureum. Each pair of orthologous genes were independently aligned using ClustalW and MUSCLE [73, 74] with their default parameters. Sequences surrounding intron positions with low-quality alignments were discarded. Low quality was defined as a similarity between Dictyostelium discoideum and Dictyostelium purpureum within 45 bp at each side of less than 0.5, which is the first quartile of the similarities of all the aligned orthologous mRNAs. The consistent results obtained using ClustalW and MUSCLE were retained, including 6,432 conservative intron positions in 4,150 genes and 2,058 discordant intron positions in 1,605 groups of orthologous genes, with 679 unique introns in Dictyostelium discoideum and 1,379 unique introns in Dictyostelium purpureum. The orthologous genes in P. pallidum, Dictyostelium fasciculatum, and E. histolytica were detected and aligned using the same methods.
In most genome annotations, some errors are inevitable. The mis-annotation of exonic segments as introns would result in false-positive results of intron gain if the mis-annotated segments happened to be new insertions. Similarly, if a mis-annotated segment were deleted in another species, the simple deletion would be mis-recognized as an intron loss. Therefore, we re-annotated the introns at discordant positions using the transcriptome data of Dictyostelium discoideum (SRP023109), Dictyostelium purpureum (SRP001567), P. pallidum (SRP004023) and E. histolytica (SRP017935), which were downloaded from the Sequence Read Archive of NCBI . The RNA-Seq reads were mapped to the genomes using TopHat version 2.0.5 with its default parameters . Finally, we obtained 1,420 discordant intron positions in 1,170 groups of orthologous of genes.
The loss and gain of introns were distinguished using Dollop (version 3.69) . An intron loss was defined when the total number of intron-presence-absence changes were minimized. An intron gain was defined only when the intron is definitely absent from the orthologous positions of all the outgroup species (Fig. 1). A precise loss was defined as an intron loss that did not cause any insertion and/or deletion in the flanking exonic sequences while an imprecise intron loss was companied by insertion and/or deletion in the flanking exonic sequences. Loss of adjacent introns were defined as the loss of two or more neighboring introns.
The version numbers of the genome sequences and source databases of Arabidopsis thaliana, Arabidopsis lyrata, Brassica rapa, Thellungiella parvula, Drosophila willistoni, Drosophila melanogaster, Caenorhabditis briggsae, Caenorhabditis remanei, Caenorhabditis elegans, Caenorhabditis japonica, Rattus norvegicus, Mus musculus are reported in Additional file 3: Table S7. The lost and conserved introns of Brassica rapa, Arabidopsis thaliana and Rattus norvegicus were retrieved from references [16, 25, 28, 32, 66], and the lost introns of Drosophila willistoni were those published in . We updated the intron loss data with the latest versions of the genomes of Drosophila willistoni and Drosophila melanogaster.
Using the BLAST reciprocal best hits, 7,744 pairs of one-to-one orthologous proteins were detected between Drosophila willistoni and Drosophila melanogaster, and 12,613 between C. briggsae and C. remanei. Only hits with an E value below 10−10 and with identity higher than 25 % were considered. Each pair of orthologous genes was aligned using ClustalW and MUSCLE [4, 5] with their default parameters. Intron sites with no gap in the 10 bp alignment adjacent position (on both sides) were considered as “conserved” introns. In this way, we found 25,532 conserved introns between Drosophila willistoni and Drosophila melanogaster and 54,871 conserved introns between C. briggsae and C. remanei, which were used for further study.
Sequences surrounding discordant intron positions with low-quality alignments were discarded. The low quality was based on a similarity between C. briggsae and C. remanei within 45 bp at each side of less than 0.57, which is the first quartile of the similarities of all aligned orthologous mRNAs. Therefore, we obtained 4,827 discordant introns in 3,625 genes between the two Caenorhabditis species. We used C. elegans and C. japonica as related species to predict lost introns, only those introns existing in both C. elegans and C. japonica were used for further study. We then used the transcriptome data of C. briggsae (SRP034522), C. remanei (SRP040962) and C. elegans (SRP000401) to re-annotate the introns at discordant intron positions. Finally, we obtained 1,225 and 664 cases of intron losses in 1,026 and 572 genes of C. briggsae and C. remanei, respectively.
The threshold value of the similarity of coding sequences between Drosophila willistoni and Drosophila melanogaster was established as 0.5, which was lower than the first quartile (0.64). This value was used to best locate the corresponding lost introns found by Yenerall et al. ; in this manner, we identified 1,440 discordant intron sites. By corresponding to old versions of lost intron data, we found 93 cases of intron losses within 89 genes when using the latest genome information for Drosophila willistoni.
GO enrichment analysis was performed using GOTermFinder . The GO annotations of Dictyostelium discoideum were used as the background dataset in this study. The total number of genes used in calculating the background distribution of GO terms was 12,098, and the threshold P-value of 0.01 was used in the identification of specific enrichments. The GO annotations are not been assigned to the genes of Dictyostelium purpureum. So they were represented by their orthologs in Dictyostelium discoideum in GO enrichment analysis. The 443 lost introns in Dictyostelium discoideum and 202 lost introns in Dictyostelium purpureum were mapped to 586 genes of Dictyostelium discoideum. These 586 intron-lost genes were compared with whole background sets. In same way, the putative gained introns of these two species were mapped to 86 intron-gained genes of Dictyostelium discoideum, which were compared with background sets for GO enrichments.
As almost all the data are not normally distributed, we used nonparametric analyses, like Mann–Whitney U test, in our comparisons of the data.
Availability of supporting data
We are thankful to the editors and two anonymous reviewers of Axios Review for insightful feedback on the manuscript. This work was supported by the National Natural Science Foundation of China (grant numbers 31421063, 31371283 and 91231119) and the Fundamental Research Funds for the Central Universities.
- 52.NCBI Protein database [database on the Internet]. Available from: http://www.ncbi.nlm.nih.gov/protein/. Accessed: 27 Mar 2015
- 66.Zhu T. The relationship between intron loss and processed pseudogenes. Beijing: Beijing Normal Univeristy; 2015.Google Scholar
- 70.DictyBase [database on the Internet]. Available from: http://dictybase.org/. Accessed: 26 Mar 2014
- 72.AmoebaDB [database on the Internet]. Available from: http://amoebadb.org/. Accessed: 31 Mar 2014
- 75.The Sequence Read Archive of NCBI [database on the Internet]. Available from: http://www.ncbi.nlm.nih.gov/sra/. Accessed: 9 Apr 2014
- 77.Dollop. http://evolution.genetics.washington.edu/phylip/doc/dollop.html. Accessed 17 Jun 2014.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.