Background

The majority of eukaryotic genomes contain a large proportion of repetitive elements. Based on their arrangements in the genome, repetitive elements can be divided into two major categories: the transposable elements or transposons and the tandem repeats. Transposons can be divided into RNA-mediated class I transposons, which include transposons with long terminal repeats (LTRs), long interspersed nuclear elements (LINEs), and short interspersed nuclear elements (SINEs); and RNA-independent class II DNA transposons. Tandem repeats are copies of DNA repeats located adjacent to one other [13]. Tandem repeats themselves can be dispersed across the whole genome such as the case of microsatellites, and they can be clustered in the highly repetitive genome regions such as centromeric, telomeric and subtelomeric regions [4, 5].

Although repetitive elements were considered to be junk DNA [6], recent studies suggested that they are functional in regulating gene expression and contribute to genome evolution [711]. Transposons are considered to be drivers of genetic diversification because of their ability to co-opt into genetic processes such as restructuring the chromosomes or providing genetic material on which natural selection can act on [1214], and thus can be the major reason for species difference in genome size [1517]. Similarly, expansion or contraction of tandem repeats can also affect genome size [1820], and consequently affect recombination, gene expression, and conversion and chromosomal organization [2126].

Fish comprise a large and highly diverse group of vertebrates inhabiting a wide range of different aquatic environments [27]. Sequenced fish genomes vary in size from 342 Mb of Tetraodon nigroviridis to 2967 Mb of Salmo salar. Some studies have been conducted on the diversity of repetitive elements in fish [2830], but systematic comparative studies have been hindered by the lack of whole genome sequences from a large number of species. Recent availability of a large number of fish genome sequences made it possible to determine the repetitive element profiles of fish species from a broad taxonomic spectrum. In this study, we annotated the repetitive elements of 52 fish genomes from 22 orders, and determined their distribution in relationship with environmental adaptations. Here, we observed the correlation between high numbers of DNA transposons, especially the Tc1 transposons, with freshwater bony fish, high level of microsatellites with marine bony fish, and high numbers of class I transposons with cartilaginous fish and lamprey. Based on the phylogeny tree, the effects of phylogeny on the differences between freshwater or marine bony fish were evaluated with the phylogenetically independent contrasts (PIC).

Results

Contents of repetitive elements in various fish genomes

A total of 128 categories of repetitive elements are identified from the 52 fish species (Additional file 1: Table S1). We found overall positive correlation between contents of repetitive elements in fish and their genome sizes. This correlation, was still significant when implementing phylogenetically independent contrasts (Fig. 1, PIC p-value: 1.88e-03, Pearson correlation r = 0.6, p-value = 1.45e-06). However, several exceptions existed. For instance, the whale shark genome is 2.57 Gb, but contains only 26.2% of repetitive elements; in contrast, the mid-sized zebrafish genome is ~ 1.5 Gb in size, but contains over 58% of repetitive elements.

Fig. 1
figure 1

Correlation between genome sizes and contents of repetitive elements. Genome sizes against the percentages of repetitive elements to the whole genome are plotted for 52 species of species for which genome sequences are available. The major orders are plotted in different colors and shapes: Yellow circle: Tetraodontiformes; Orange circle: Perciformes circle; Green circle: Scorpaeniformes; Brown circle: Cypriniformes; Red circle: Cyclostomata; Purple circle: Cyprinodontiformes; Blue triangle: Chondrichthyes; Blue circle: Other species

Differential associations of repetitive elements across species

We investigated the possible association between repetitive elements and aquatic environment. Comparison of diversity and abundance of repetitive elements across the 52 fish genomes revealed significant differences among species (Fig. 2 and Additional file 2: Table S2). Class I transposons are more prevalent in cartilaginous fish and lampreys than bony fish species (Wilcoxon rank test, p-value = 1.41e-04). For example, class I transposons represent 76.6% of repetitive elements in elephant shark, but the bony fish genomes are more abundant with class II transposons and tandem repeats.

Fig. 2
figure 2

Classification and distribution of 128 repetitive elements in 52 species. The total number of each category of repeats to the all repeats are displayed in columns while different species are displayed in rows. The pink shade represents the freshwater living bony fish, the blue represents the marine living bony fish and yellow represents the diadromous species

Of the bony fish genomes, the freshwater bony fish contained a greater proportion of Tc1/mariner transposons than marine species (Fig. 2, Wilcoxon rank test, p-value = 8.23e-06). However, the results were not significant when the phylogeny was taken into consideration (PIC p- value: 0.117). In contrast, the marine bony fish contain a greater proportion of microsatellites (PIC p-value: 3.12e-02, Wilcoxon rank test, p-value = 3.72e-05) than the freshwater species, independent of the phylogeny. Interestingly, the diadromous species such as Anguilla rostrata, Anguilla anguilla, and S. salar contain high proportions of both the Tc1/mariner transposons and microsatellites (Table 1).

Table 1 Proportion of DNA/TcMar-Tc1, microsatellites contents out of all repeats in freshwater, marine and diadromous teleost species

Analysis of the sequence divergence rates suggest that Tc1 transposons have been present in the genomes of freshwater species for much a longer period of time or are more active than in marine species (Fig. 3). The Tc1 transposons in freshwater species are not only more abundant, but also exhibited a higher average K (average number of substitutions per site) (PIC p-value: 2.10e-02, Wilcoxon rank test, p-value = 5.39e-03) than those in marine species. This is particularly notable in Cyprinodontiformes and Labroidei in Perciformes, where Tc1 transposons appeared to have the strongest activity over a long history, as reflected by the broad distribution and sharp peaks with higher substitution rates per site (Fig. 3). The long history and high transposition activities in freshwater fish accounted, at least in part, for the high proportion of Tc1 transposons in the genomes of freshwater species.

Fig. 3
figure 3

Divergence distribution analysis of DNA/TcMar-Tc1 transposons in the representative fish genomes. The Cyprinodontiformes, Labroidei species (red) and marine bony fish (blue) are displayed. The y-axis represents the percentage of the genome comprised of repeat classes (%) and the x-axis represents the substitution rate from consensus sequences (%). Please note that not all y-axis scales are the same, particularly in marine species which are 10 times smaller

Discussion

Accumulation of repetitive elements in fish genomes

In this work, we determined the correlation between the categories and proportions of repetitive elements and the living environments of various fish species. We found that class II transposons appeared to be more abundantly associated with freshwater bony fish than with marine bony fish, when phylogeny was not considered. In contrast, microsatellites are more abundantly associated with marine bony fish than with freshwater bony fish, independent of phylogenetic relationship. In addition, class I transposons are more abundant in primitive species such as cartilaginous fish and lamprey than in bony fish. Such findings suggest that these repetitive elements are related to the adaptability of fish to their living environments, although it is unknown at present if the differential categories and proportions of repetitive elements led to the adaptation to their living environments (the cause) or the living environments led to the accumulation of different repetitive elements (the consequences).

With teleost fish, the genome sizes are greatly affected by the teleost-specific round of whole genome duplication [3133]. However, whole genome duplication did not dramatically change the proportion of the repetitive elements in the genomes. In contrast, the expansion of repetitive elements may have contributed to the expansion of fish genome sizes as observed in our analysis, fish genome sizes, with exceptions, were found to be well correlated with their contents of repetitive elements. High contents of repetitive elements in the genome can accelerate the generation of novel genes for adaptations, but their overburden can also cause abnormal recombination and splicing, resulting in unstable genomes [34]. Therefore, the content of the repetitive elements cannot grow unlimited with the genome size; it must be limited to certain levels and shaped under specific natural selection by the environment.

It is worthwhile noting that the quality of the genome assembly varied greatly. As one would expect, many of the repetitive elements may have not been assembled into the reference genome sequences, especially with those of lower assembly qualities. This may have affected the assessment of the proportions of the repetitive elements in the genomes. However, most of the genomes sequencing methods are overall similar via next generation sequencing especially Illumina sequencing, thus the systematic biases related to repeat resolution should be small. In addition, if the unassembled repetitive elements are more or less random, the quality of the genome assemblies should not have systematically affected the enrichment of specific categories of repetitive elements with habitats. The total number of genomes used in the study is relatively large (52), the impact of sequence assembly quality should have been minimized.

Comparison of the repetitive elements among species

The distributions of repetitive elements are significantly associated with various clades during evolution. For example, class I transposons are more prevalent in cartilaginous fish and lampreys than in bony fish species. However, the cartilaginous fish and lamprey lack the class II transposons. Although there were no unifying explanations for this difference, it is speculated that it may be related to the internal fertilization of cartilaginous fish, which may have minimized the exposure of gametes and embryos from horizontal transfer of Class II transposons [30, 35, 36]. Interestingly, active transposable elements in mammals are also RNA transposons. For lamprey, since it is still unclear how it fertilizes and develops in the wild [37, 38], its accumulation of class I transposons deserve further investigation. As class I transposons are involved in various biological processes such as regulation of gene expression [39, 40], the ancient accumulation of class I transposons in cartilaginous fish and lamprey are probably related to their evolutionary adaptations [41]. The contents of class I transposons are low in bony fish; the exact reasons are unknown, but could involve putative mechanisms that counteract the invasiveness of RNAs on their genomes. We realized that a much larger number of bony fish genomes are used in this study than those from cartilaginous fish and lamprey, but this is dictated by the availability of genome sequences. However, if the repetitive elements are more conserved in their categories and proportions of the genome among most closely related species, such bias in the number of genomes used in the analysis should not significantly change the results.

Repetitive elements of most freshwater bony fish are dominated by DNA transposons except C. rhenanus and T. nigroviridis which contain high levels of microsatellites. Although T. nigroviridis is a freshwater species, the vast majority (497 out of 509) of species in Tetraodontidae family are marine species [4244]. Thus it is likely that T. nigroviridis had a marine origin. Similarly, C. rhenanus is a freshwater species, but most species of the Cottidae family are marine species [43]. In addition, the biology of C. rhenanus is largely unknown [45, 46], and the origin of C. rhenanus as a freshwater species remains unexplained.

Uncovering the route of class II transposons expansion is difficult, because they can be transferred both vertically and horizontally [4749]. However, when phylogenic relationships were not considered, the observed prevalent class II transposon in freshwater species may indicate that the freshwater environments are more favorable for proliferation and spreading of DNA transposons. In addition, as found in other species, the frequent stress such as droughts and floods in the freshwater ecosystem can accelerate transpositions, which facilitate the host adaptions to the environment by generating new genetic variants [50]. Previous studies showed that freshwater ray-finned fish have smaller effective population sizes and larger genome sizes than marine species [51]. Our results lend additional support to the idea that shrinking effective population sizes may have underlined the evolution of more complex genomes [52, 53]. The significance for more prevalence of Tc1 transposon in freshwater species was reduced when accounting for phylogenetic relationship, which indicates the taxa in our data set for analysis are not statistically independent because of shared evolutionary history. However, due to the dictation of the limited and uneven sequenced species available so far, it will inevitably introduce phylogenetic bias into the analysis. For example, a large number of the sequenced fish species belong to the family of Cichlidae (6) or Cyprinidae (6). However, there is only one genome available (Ictalurus punctatus) from the order of Siluriformes, which comprise 12% of all fish species [54, 55]. Considering the fact that the phylogenetic independent contrasts analysis is robust to random species sampling [56], thus, further analysis should be conducted with a broader scope with more sequenced fish species, to complement the broader comparative studies.

Although the Gasterosteus aculeatus is collected from freshwater, studies indicated that limnetic G. aculeatus are formed as a result of marine populations trapped in freshwater recently [5759]. Thus we still classify the G. aculeatus as marine species. Because the population of marine species tend to be more stable than those in freshwater. Besides, the marine teleost species tend to have a higher osmotic pressure of body fluid [60, 61], thus, the high salinity environment may be prone to DNA polymerase slippage while not favorable for proliferation and spreading of transposons, since previous studies indicated that the higher salt concentration might stabilize the hairpin structure during the DNA polymerase slippage [62]. Future research covering a broader scope of sequenced fish linages will address whether passive increases in genome size have in fact been co-opted for the adaptive evolution of complexity in fish as well as other lineages.

Conclusions

In this study, we investigated the diversity, abundance, and distribution of repetitive elements among 52 fish species in 22 orders. Differential associations of repetitive elements were found from various clades and their living environments. Class I transposons are abundant in lamprey and cartilaginous fish, but less so in bony fish. Tc1/mariner transposons are more abundant in freshwater bony fish than in marine fish when phylogeny was not taken into consideration, while microsatellites are more abundant in marine species than those in freshwater species, independent of phylogeny. The average number of substitutions per sites of Tc1 among bony fish species suggested their longer and more active of expansion in freshwater species than in marine species, suggesting that freshwater environment is more favorable for the proliferations of Tc1 transposons. The analysis of the number of repeats within each microsatellite locus suggested that DNA polymerases are more prone to slippage during replication in marine environments than in freshwater environments. These observations support the notion that repetitive elements have roles for environmental adaptations during evolution. However, whether that is the cause or the consequences requires future studies with more comprehensive sequenced genomes.

Methods

Annotation of repetitive elements in fish genome assemblies

The channel catfish genome was assembled by our group [54], the genome sequences of other 51 species were retrieved from NCBI or Ensembl databases [33, 42, 56, 6389] (Additional file 1: Table S1). The repetitive elements were identified using RepeatModeler 1.0.8 containing RECON [90] and RepeatScout with default parameters [91]. The derived repetitive sequences were searched against Dfam [92] and Repbase [93]. If the sequence is classified as “Unknown”, they were further searched against the NCBI-nt database using blastn 2.2.28 + .

Phylogenetic analysis

The phylogenetic analysis was based on the cytochrome b [94]. Multiple alignments were conducted by MAFFT [95]. The best substitution model was selected by Prottest 3.2.1 [96]. The phylogenetic tree was constructed using MEGA7 with the maximum likelihood method [97], using JTT with Freqs. (+ F) model, and gaps were removed by partial deletion. The topological stability was evaluated with 1000 bootstraps.

Divergence distribution of DNA/TcMar-Tc1

The average number of substitutions per sites (K) for each DNA/TcMar-Tc1 fragment was subtotaled. The K was calculated based on the Jukes-Cantor formula: K = − 300/4 × Ln(1-D × 4/300), the D represents the proportion of each DNA/TcMar-Tc1 fragment differ from the consensus sequences [98].

Statistics and plotting

The statistical analyses for the significance of differences between different groups and the habitats were performed by Wilcoxon rank test function in R language package because the data are not normally distributed [99]. The Pearson correlation analysis in Excel was applied for the correlation between genome size and the content of repetitive elements. Based on the phylogeny tree of the species generated in the previous method, the phylogenetically independent contrasts between the environments and different characters was conducted to evaluate the bias of the phylogeny. The freshwater and sea water was represented by their respective salinities (0.5 for freshwater and 35 for seawater) [100]. The phylogenetically independent contrast test was conducted via the “drop.tip ()” and “pic ()” function in ape package provided by R [101]. The heat map was plotted using the Heml1.0 [102].