Background

Microsatellites, also termed as simple sequence repeats (SSRs), are short tandemly repeated sequences with 1-to-6 base pair (bp) motifs [1, 2]. They are ubiquitous and highly abundant in eukaryote, prokaryote and virus genomes [3,4,5], making up around 3% of the human genome [6]. Microsatellite instability is an important and unique form of mutation that is responsible for, or strongly implicated in, over 40 human neurological, neurodegenerative and neuromuscular disorders [7] and associations have also been observed in other complex diseases [2, 3, 8, 9]. Undoubtedly, microsatellites have attracted considerable attention due to their roles in the organization of chromosome structure, DNA recombination and replication, and gene expression and cell cycle dynamics [10].

Microsatellite analysis is used for a wide range of biological questions. Unique polymorphism of normal and disease-causing repeats can be used for disease diagnosis and prognosis [11,12,13]. Microsatellite repeats are advantageous as genetic markers due to their high polymorphism, informativeness and co-dominance, and have been used to construct quantitative trait loci (QTL) maps, genetic linkage maps [14,15,16,17,18] and DNA fingerprinting [19]. These features also provide the foundation for their successful application in other fundamental and applied fields of biology, including population and conservation genetics, genetic dissection of complex traits and marker-assisted breeding programs [10, 20,21,22].

Microsatellite content generally correlates positively with genome size [23,24,25]. The distribution of microsatellites exhibit different properties in genomes with different functionality [26,27,28,29,30,31], contradicting earlier studies stating that they are randomly distributed and simply represent “junk” DNA sequences [32]. Microsatellites are ubiquitously distributed across the entire genome, including protein-coding and non-coding regions [6, 33,34,35]. Previous studies have indicated that microsatellite occurrence differs significantly in coding and non-coding regions [36], and some microsatellite types were preferred and often common in genome-specific regions [26, 29]. Excessive microsatellite repeats occur in non-coding regions of eukaryotic organisms [37], whereas they are relatively rare in coding regions, ranging between 7 and 10% of higher plants [38, 39] and between 9 and 15% of vertebrates [40,41,42]. Meanwhile, multiple studies have demonstrated that the hotspots of microsatellite distribution may be related with various phenotypic traits [43, 44]. In the genome of Saccharomyces cerevisiae about 17% of genes contain microsatellite repeats in open reading frames (ORFs) [45, 46] and the repeats are specifically enriched in regulatory genes that encode transcription factors, DNA-RNA binding proteins and chromatin modifiers [47]. Microsatellite repeats in cis-regulatory elements and promoters, which frequently occur (e.g., ~ 25% promoters in yeast contain tandem repeats), regulate the process of gene expression [48, 49]. The (TTAGGG) n tracts constitute a substantial portion of the telomeric regions and are recognized by telomerase, which can be related to stability of chromosomes and nucleolus organizing regions [10, 50, 51].

xMicrosatellites are inherently unstable with high mutation rates from about 10− 6 to 10− 2 per locus per generation, resulting from DNA replication slippage [52, 53]. Mutation rates vary among microsatellite types (perfect, compound or interrupted), base composition of the repeat [54], repeat types (di-, tri- and tetranucleotide) [55, 56] and lengths [21], and heterozygosity [57, 58], but also among chromosome position, cell division, the GC content in flanking DNA and taxonomic groups [59,60,61,62]. Microsatellite instability has a strong influence on genomic microsatellite abundance and various functions and is explained by two mutually exclusive mutational mechanisms: (i) DNA replication slippage theory suggests that during DNA replication, the nascent and template strand realign out of register, and if DNA synthesis continues unabated on this molecule the repeat number of the microsatellite is altered [21, 63, 64]. The stability of the slipped structure has been maintained by hairpin, triplex, cruciform or quadruplex arrangement of DNA strands [65,66,67,68,69]. (ii) Unequal recombination theory assumes that large scale contractions and expansions of the repeat array involved the processes of DNA unequal recombination, including unequal crossing over and gene conversion [70], via a number of transposable elements, the best known are Alu and other short interspersed elements [64, 71]. Non-reciprocal recombination, random genetic drift and selective forces could have a significant effect on the accumulation of tandem-repetitive sequences in genomes [63, 65, 70].

So far, systematic research regarding microsatellite variation and characterization have been conducted on phylogenetic lineages, including humans [41], primates [72,73,74], plants and fungi [36, 75,76,77,78,79,80,81,82], and viruses [83, 84]. Yet microsatellite distribution patterns in fishes, an important branch of biological evolution, remained unclear. Here, 14 fish genomes have been used to indicate the microsatellite distribution patterns. The main objectives of the present study were to examine the distribution patterns of microsatellite in different fish genomes. The specific aims were 1) to examine the abundance and frequency of microsatellites in several important fish genomes, and 2) to compare the compositional differences of microsatellites in different taxa and genome-specific regions. We anticipate our study will provide foundational knowledge of microsatellite dynamics in fish species, helping us to better understand microsatellite distribution, and provide strong support for further exploration of genome structure and microsatellite functions.

Materials and methods

Genomic sequences

Genome sequences from 14 fish species, including model fishes (Danio rerio, Oryzias latipes, Astyanax mexicanus), commercial species (Cyprinus carpio, Oncorhynchus mykiss, Oncorhynchus kisutch, Oreochromis niloticus, Ictalurus punctatus, Esox Lucius, Cynoglossus semilaevis,), ornamental fishes (Poecilia reticulate, Takifugu rubripes, Nothobranchius furzeri), and “living fossil” fish species (Lepisosteus oculatus), were used in this study. Most genome sequences were downloaded from the Ensembl Genome Browser (Ensembl, Available online: http://asia.ensembl.org/index.html). The sequences of Cyprinus carpio, Nothobranchius furzeri, Oncorhynchus kisutch, and Oncorhynchus mykiss were obtained from the National Centre for Biotechnology Information (NCBI. Available online: http://www.ncbi.nlm.nih.gov/). We also obtained genome annotations to identify microsatellite locations in the genomes. The genomic (chromosomal) sequences that had complete genome annotations were included in this study. We filtered the unknown bases (Ns) in genome sequences using the Perl script and obtained the valid length of sequences for further analysis. The details of the genome sequences are listed in Table 1.

Table 1 Data source and genome sizes of the 14 fish species studied in the present study

Microsatellite identification

Microsatellites were identified from genome sequences using the Krait v0.9.0 program, a robust and ultrafast tool with a user-friendly graphic interface for genome-wide investigation of microsatellites [85]. We employed the perfect search model of the program to investigate the all motifs according to minimum repeats or minimum length of microsatellite. In the present study, we defined the perfect microsatellites as being mononucleotide repeats ≥12-bp, dinucleotide repeats ≥14-bp, trinucleotide repeats ≥15-bp, tetranucleotide repeats ≥16-bp, pentanucleotide repeats ≥20-bp and hexanucleotidue repeats ≥24-bp, and the length of flanking sequence was constrained to 200 bp, as previously described [72, 74]. We mainly examined the distribution of perfect repeats ≥12-bp long. The rationale for choosing the small cutoff value was that the microsatellites are often disrupted by single base substitutions [6, 33]. The occurrence of repeats in exons, introns and intergenic regions have been identified from the annotations of the 14 fish genome sequences using Perl scripts. The SciRoKo software tool [86] and the NCBI Graphical Sequence Viewer program (https://www.ncbi.nlm.nih.gov/projects/sviewer/) were employed to increase the reliability of the results for examined microsatellite repeats.

Repeats with unit patterns being circular per-mutations and/or reverse complements of each other were grouped together as one type. The total number of the non-overlapping type was 501 for 1–6 nt long motifs, with 1-nt motif containing two 2 types: A and C (A = T and C = G), 2-nt motif containing 4 types: AT, AG, AC and GC (AT = TA, AG = GA = CT = TC, AC=CA = GT = TG, and GC=CG), and 3–6-nt motif containing 10, 33, 102 and 350 types [41, 87].

Results

Distribution patterns of microsatellite repeats in the fish genomes

We examined the number, relative frequency (microsatellite numbers per Mb of the sequence), relative density (total microsatellite length per Mb of the sequence), GC content and the coverage degree (percentage of total microsatellites length in sequence) of microsatellites with motif lengths of 1–6 nucleotides in the 14 fish genomes (Table 2). We assigned 4-letter name abbreviations to the 14 species and these have been henceforth used to simplify results and discussion (e.g. Danio rerio = Drer, O. niloticus = Onil; see Table 2). The total number of microsatellites, ranging from 78,378 (Locu) to 1,012,084 (Drer), differed between fish species and the coverage degree varied from 0.18% (Locu) to 5.29% (Trub) (Table 2). The lowest relative frequency and relative density of microsatellites were both found in Olat (249.99 loci/Mb and 4925.76 bp/Mb, respectively) (Table 2). The highest relative frequency and density of microsatellites was found in Csem (3445.94 loci/Mb) and Ipun (25,401.97 bp/Mb), respectively. The GC content ranged from 10.94% (Trub) to 48.20% (Okis) (Table 2).

Table 2 Microsatellite distribution as frequency, density and GC content of different fish genomes

The main distribution pattern of di- (mononucleotide SSRs) > mono- (dinucleotide SSRs) > tetra- (trinucleotide SSRs) > tri- (tetranucleotide SSRs) > penta- (pentanucleotide SSRs) > hexanucleotide (hexanucleotide SSRs) was shared by six fish genomes (i.e., Drer, Trub, Ipun, Amex, Eluc, and Omyk), while a mono- > di- > tetra- > tri- > penta- > hexanucleotide pattern was observed in Olat, Ccar and Pret (Table 2). The di- > mono- > tri- > tetra- > penta- > hexanucleotide pattern was shared by Onil and Csem, whereas Nfur exhibited a di- > tetra- > mono- > tri- > penta- > hexanucleotide pattern (Table 2). The 1-nt or 2-nt repeats had a higher percentage motif abundance in the fish genomes than any other motif length, while the 2-nt repeats represented more than 60% motif abundance in Omyk, Eluc and Okis (Table 3). There was an almost equal distribution of motif abundance percentages between the first three motifs (1–3 nt) in Locu, with 3-nt and 1-nt repeats being almost identical (34.92, 34.99%, respectively). The percentage of 4-nt repeats was remarkably uniform across all taxa except for Drer and Locu, which had marginally greater or lesser percentages of this motif length (Table 3). Microsatellites with longer motifs (5–6 nt) showed lower percentages compared to the short motif repeats (1–4 nt). The 6-nt repeats had the lowest percentages among these motif lengths, ranging from 0.21% (Drer) to 0.85% (Trub) (Table 3).

Table 3 Microsatellite distribution as percent motif abundance (%) among 14 genomes

Mononucleotide repeats

Motif abundance percentages of mononucleotide repeats within intergenic regions, introns and exons varied across species, with intergenic regions ranging from 0.19% (Okis) to 7.19% (Trub), introns ranging from 9.87% (Nfur) to 45.06% (Olat) and exons ranging from 0.37% (Okis) to 3.58% (Pret) (Table 3). Among the two types of mononucleotide repeats, poly(A/T) was generally far more abundant than poly(C/G) in these fish genomes, except that the reverse was found in the Trub, Omyk and Okis genome sequences (Supplement Table 1, Tables 4 and 5). Drer had the maximum repeat number of A (or T) (192,264) followed by Ccar, Ipun, Amex and Pret. Pret contained the maximum number (4549) of C (or G). Although poly(A/T) tracts were clearly more abundant than poly(C/G) in exons (Table 4), this difference was not consistently observed in introns (Table 5) and intergenic regions (Table 6).

Table 4 Total numbers of Mono-, Di-, and Trinucleotide repeats in exons among 14 fish genomes
Table 5 Total numbers of Mono-, Di-, and Trinucleotide repeats in introns among 14 fish genomes
Table 6 Total numbers of Mono-, Di-, and Trinucleotide repeats in intergenic regions among 14 fish genomes

Dinucleotide repeats

Among genome-specific regions, there was a lower percentage of dimer repeats (AT, AC, AG, CG) in exons compared to non-coding regions, ranging from 0.37% (Ccar) to 1.77% (Onil). Within the non-coding regions, intronic regions have a higher proportion of dinucleotide repeats compared to the intergenic regions (Table 3). We found that (AC) n repeats were generally more numerous in specific genomic regions, except that Amex had greater numbers of (AG) n repeats in exonic regions and Okis had greater (AT) n repeats in intronic regions (Tables 4 and 5). The number of (AT) n repeats observed the greatest variation between genome-specific regions and species. For example, intronic or intergenic regions of Drer have similar numbers of (AT) n repeats to (AG) n repeats, whereas exon numbers of (AT) n repeats were considerably less than (AG) n repeats. Olat had more (AT) n repeats than (AG) n in exons, but the opposite was found in other genomes. Finally, (CG) n repeats were very infrequent or absent in these genomes.

Trinucleotide repeats

Motif abundance percentages of trinucleotide repeats in the exons of six fish species were greater than in intergenic regions, these six species being Olat (1.90%), Csem (1.81%), Pret (1.63%), Onil (1.53%) and Eluc (0.78%) (Table 3). Meanwhile, motif abundance percentages of trinucleotide repeats in the exons of Locu, Csem and Olat were greater than other motif lengths (e.g. mono-, di-). Among the different trinucleotide repeats, (AAT) n repeats were generally the most numerous repeats in intronic and intergenic regions of different taxa (Tables 5 and 6), except for Okis where (ACT) n repeats were the most numerous in intergenic regions (Table 6). There was no one trinucleotide repeat in exonic regions that was typically more numerous than another across the different fish species. For example, (AAT) n repeats were most numerous in Drer and Ipun, while (ATC) n repeats were greater in Ccar and Nfur and (AGG) n repeats were greatest in the 10 remaining species (Table 4). Repeats such as ACT, ACT, AGC, ACG and CCG were generally in low numbers in each specific genomic region. Furthermore, CCG repeats were absent in the intergenic regions of Eluc, Omyk, Ccar and Okis (Table 6).

Tetranucleotide repeats

Tetranucleotide repeats were frequent in each genomic region and were generally dependent on the base composition of the repeat unit (Tables 7, 8, 9 and Supplement Table 1). Overall, repeats with > 50% of A + T (e.g. AAAT, ATAG and AATC repeats) were more abundant in studied fish genomes (Supplement Table 1). There were, however, a few notable exceptions. For example, (ACAG) n repeats were the most numerous in Eluc, Omyk and Okis (Supplement Table 1). We found that the (AAAB) n repeats (where B denotes any base other than A) were most numerous in exonic regions in five fish species (i.e. Olat, Drer, Onil, Ccar and Ipun), the (ACAG) n repeats were numerous in Eluc, Omyk and Okis, and (ATCC) n repeats were most common of the remaining four fish species (Table 7). Similar to exons, the most common tetranucleotide repeat in intergenic regions was (AAAB) n, except for (ATCC) n in Olat, (AATC) n in Eluc and (ATAG) n in Amex (Table 9). In introns, (AATB) n or (ACAG) n were the most common tetranucleotide repeats in studied fish (Table 8). We also found some repeats with > 50% of C + G (e.g. ACGC, AGGG and AGCG repeats) were in the top 50% of tetranucleotide repeats in specific genome regions (Tables 7, 8, 9).

Table 7 The most frequent Tetra-, Penta-, and Hexanucleotide repeats in intronsa
Table 8 The most frequent Tetra-, Penta-, and Hexanucleotide repeats in exonsa
Table 9 The most frequent Tetra-, Penta-, and Hexanucleotide repeats in intergenic regionsa

Pentanucleotide repeats

As expected, the occurrence pentanucleotide repeats was less than tetranucoeitde repeats in different genome regions. We found a general distribution pattern of pentanucleotide repeats for all species, where (A + T)-rich repeats were the most abundant. Yet, we still found notable exceptions where (C + G)-rich repeats were dominant in specific genomic locations, including AGAGG and ACTGG in introns or intergenic regions of Trub, Csem and Okis and ACTGC in exons of Eluc (Tables 7, 8, 9). Although AGAGG repeats in introns and exons were relatively abundant in Csem, it was also the only species that lacked this repeat in intergenic regions in this study (Supplement Table 1). We also found that the CpG-containing repeats were present in the top 50% of pentanucleotide repeats, including (ATACG) n or (CCCGG) n tracts in intronic regions of Eluc and Locu, (CCCGG) n, (AATCG) n or (ACCGG) n tracts in exonic regions of Trub, Amex and Pret, and (ATACG) n or (ACCGG) n tracts in intergenic regions of Eluc and Pret (Tables 7, 8, 9).

Hexanucleotide repeats

Hexanucleotide repeats were the least numerous in specific genomic regions, except for the exons of Trub (Table 3). In exonic and intronic regions, a dominance of (C + G)-rich repeats was found in the majority of the genomes (Tables 7 and 8). The repeat motifs present in intergenic regions were highly variable and relatively (A + T)-rich (Table 9). Except for in Olat, Onil, Ccar, Okis and Okis, the CpG-containing repeats were common in the top 50% of hexanucleotide repeats in intronic and exonic regions, and half of species had CpG-containing in the top 50% of hexanucleotide repeats in intergenic regions (Tables 7, 8, 9). A few telomere-like repeats were found in introns or intergenic regions, excluding Pret. However, the (AATCCC) n and (AACCCT) n tracts were observed in exonic regions of Trub and Omyk, respectively (Table 8).

Iteration number and length distribution of microsatellites in fish genomes

Iteration number and length of microsatellites are both important factors determining microsatellite mutation rates, and it could be extremely important not only for genomic stability, but also with regard to the evolution of additional genomic features such as codon usage. To assess expandability of the repeats, iteration number of microsatellites was plotted against microsatellite length of various quantity intervals: <20, 20–50, 50–100, 100–200, 200–300, and >300 (Fig. 1). The details of all iteration numbers and densities of microsatellites in fish genomes are given in the Supplement Table 2. Usually, the frequency of microsatellites has a tendency to converge to a small iteration number. In other words, short microsatellites were observed more frequently in the fish genomes than long microsatellites. When the iteration number was less than 20, the repeat tracts varying motif lengths from mono- to hexa-nucleotide (1–6 nt) comprised more than 83.93, 67.22, 90.38, 88.93, 92.58 and 90.42%, respectively (Fig. 1 and Supplement Table 2). However, a few special microsatellites were found where the iteration number exceeded 300, for example 1-nt microsatellites in Csem, Eluc, Amex, Ccar and Okis, 2-nt microsatellites in Drer, Csem, Eluc, Nfur, Amex, Ccar and Okis, 3-nt microsatellites in Nfur, Amex, Ccar, 4-nt microsatellites in Drer, Csem, Nfur, Amex and Ccar, 5-nt microsatellites in Amex and Ccar, and 6-nt microsatellites in Csem, Amex and Ccar (Supplement Table 2).

Fig. 1
figure 1

Heat map of the microsatellite distribution frequency of different motif length (1–6 nt) based on the iteration number among 14 fish genomes

Discussion

In this study, we examined the microsatellites composed of motifs 1–6 bp long in the entire genomes of 14 fish species and analyzed their distribution and frequency in different genomic regions. Microsatellite occurrence significantly differed with the coverage degree varying from 0.18 to 5.29%. Comparison of microsatellite repeat occurrence in the genomes of humans (3%) [6], primates (0.83–0.88%) [72,73,74], birds (0.13–0.49%) [88], plants and fungi (0.04–0.15%) [75, 76, 80, 89, 90], with our data indicates that microsatellite occurrence differs between different species and this might be a general phenomenon across taxa [33]. In fact, differences might even occur between closely related species as humans and chimpanzees [91], and within the genus of Drosophila [92, 93].

Another clear trend to emerge from this analysis was that the observed dependence of microsatellite abundance on repeated unit length and iteration number was very much biased from the expected trend of gradual decrease, which was consistent with a previous study [36]. Our research also indicated that microsatellite density is not strictly positively correlated with genome size. Although it was well known that the microsatellite density generally correlates positively with genome size [26, 36, 94], our contradictory results have been found in other studies [72, 83, 88, 95]. Overall, the comparative analysis of microsatellites indicated that there was great variation of microsatellite content across the 14 fish species. This might be indicative that differential selective constraints may play an important role in microsatellite evolution and result in the accumulated preference for different microsatellite types (Saeed2016&Ellegren2004& Schlötterer2000).

During genome evolution, microsatellite repeats mutation may provide a molecular mechanism for faster adaptation to environmental stress by increasing the quantities of DNA and providing the raw materials for adaptive evolution of organisms. Generally, microsatellite instability of dinucleotide repeats is higher than trinucleotide, tetranucleotide and pentanucleotide repeats [96]. In other words, the mutation rate of microsatellite dependence on repeated unit length is biased from the trend of gradual decrease. This could explain the high numbers of mono−/di-nucleotide motif microsatellites and the low numbers of penta−/hexa-nucleotide motif microsatellites in the genomes. We should note that the frequency of tetranucleotide repeats was more than trinucleotide repeats in most of the 14 genomes. However, there was a trend that trinucleotide repeats were more frequent than tetranucleotide repeats in exonic regions, and less than tetranucleotide repeats in intronic and intergenic regions of most genomes. We suggest that the lower number of trinucleotide repeats cannot only be explained by conservation since they attribute triplet codes to form parts of genes. However, there may be a mechanism (e.g. mismatch repair system) in the exonic regions to maintain the higher number of trinucleotide repeats.

As is evident from Tables 2, 3, 4, 5, 6, 7, 8 and 9, poly(A/T) tracts were more common than poly(C/G) tracts in these genomes. Poly(A/T) tracts were particularly common in exonic and intergenic regions, but this was opposite in intronic regions of some taxa (e.g., Trub, Omyk and Okis) and this has also been observed in the human genome [6]. The higher frequency of poly(A) tracts can be attributed to the re-integration of processed genes into the genome from mRNA with an attached poly(A) tail, while poly(C/G) are not part of this integrative mechanism. An alternative explanation is that a long A-rich tail is known to be necessary for the universal retrotransposon in eukaryotic genomes, such as Alu, LINE-1 and L1 retrotransposons [97,98,99]. Meanwhile, the formation of pseudogenes may attribute to this higher proportion of (A + T)-rich repeats [36, 100]. However, the mutation mechanism of microsatellite DNA provides a basis for this phenomenon. The variable frequencies of poly(A) and poly(C) could be due to the difference in stability between (GC) n and (AT) n repeats. (GC) n repeats are more stable than (AT) n repeats and hence it would be more difficult for the poly(C) sequences to slip during replication during the evolution of microsatellite DNA [6, 95, 101]. In the intronic regions, the higher than expected frequencies of poly(C/G) tracts in some species may be due to duplication events of key DNA sequences during evolution or the integrity of chromosomes may depend on a higher order DNA sequence organization that includes the presence of poly(C/G) tracts [102].

In the case of dimeric repeats, we found (AC) n tract was common and the (GC) n tract was rare. Assuming that, on the microsatellite DNA stability, (GC)-rich regions are relatively stable, there is less replication slippage generating the repeated motifs of microsatellites [103]. On a genomic scale, microsatellite sequences are presumably at equilibrium, where (AC) n or (AG) n repeats should be more abundant than (AT) n or n repeats. However, we found the opposite distribution of microsatellite motifs in the genome of Amex. We suggest that there is interspecific variation in the mechanisms of mutation or repair of specific motifs [63] or there might be variation in the selective constraints that are associated with different microsatellite motifs [33].

Compared to other microsatellite motifs, the trinucleotide repeat undergoes strict regulation under evolutionary stress. While the (AAT) n tracts were common in intronic and intergenic regions of the fish genomes, (AGG) n tracts were typically more numerous than other repeat types in exons. Therefore, different genome fractions may characterize different microsatellite abundances resulting from the functions of genome evolution and selective constraints [104]. Combined with the above, inconsistent distribution patterns where (ACT) n tracts were numerous in intergenic regions of Okis and (AAT) n tracts were common in exons of Drer and Ipun indicated that the distribution of microsatellites reflected the bias of the base composition in the genomes fractions. Other biases, such as the (CCG) n tracts in Trub and the (ACC) n tracts in Ccar, suggest that selective forces probably play various roles in specific genomes and differ from each other in a species-specific manner [36].

It should be noted that we found extremely rare (CCG) n and (ACG) n repeats in these genomes. A reasonable explanation for this rarity is the presence of the highly mutable CpG dinucleotide within the motif. Rarity of CpG is almost certainly a consequence of the methylation. In vertebrate genomes, a CpG-containing island occurs at about one-fifth of the expected frequency [105, 106] because between 60 and 90% of CpGs are methylated at the 5 position on the cytosine ring and there is a failure of the DNA repair mechanism to recognize deamination of 5-methylcytosine to produce thymine [107, 108]. However, experiments have shown that clusters of non-methylated CpG may attribute to the lack of CpG suppression in the HTF islands, where an approximate 1% DNA fraction accounted for the total genome from a variety of vertebrates [109, 110]. The HTF fraction is extremely rich in cleavable sites for mCpG-sensitive restriction enzymes and sequences chosen at random from the HTF fraction belong to islands of DNA several hundred base pairs long that contain CpG at more than 10 times its density in bulk DNA. This would help to explain the phenomenon that (ACG) n or (CCG) n tracts were abundant in introns of all fishes, in contrast to the rarity or absence of this motif in intergenic regions. An alternative explanation is that a specific mechanism exists to maintain the observed level of CpG-containing repeats in introns. The role of cytosine methylation in histone deacetylation, chromatin remodeling, and gene silencing may account for this phenomenon [111].

In the tetranucleotide microsatellites, the (AAAB) n tracts (B denotes any base other than A) seem to be more common, followed by 25% G + C content, and then 75% G + C content and 100% G + C content. Previous studies have indicated that DNA sequence composition could have a profound influence on microsatellite incidence [26, 33]. Kristitin et al. (2002) suggested that the G + C content of microsatellites might have influenced the mutation rate because the tetranucleotide repeats with 25% G + C content were not statistically different from each other, but each was significantly different from the repeats with 50% G + C content [112]. Meanwhile, the attribution of selective forces and DNA mismatch repair system for the distribution patterns could not be ignored, because of several exceptions observed in our study, for example (ACAG) n tracts were abundant in Omky and Okis.

The longer microsatellites (5–6 nt) have an advantage of being more polymorphic than the shorter ones (1–4 nt), as mutation rates generally increase with an increase in the number of repeat units [33, 113]. The significant differences in the repeat types and motif length of microsatellites between studied fish species seems to be due to their genome-specific characteristics. In conclusion, though it remains unclear why certain repeat motifs are more common than others, or the reason they vary so much between different fish species, several observations presented here suggest that individual genomes and genome-specific regions may be characterized by unique microsatellite profiles. This was also supported by the reports of taxon-specific repeats or genome-specific region repeats [6, 36]. The study of microsatellites may help us understand numerous aspects of genome organization and functions.