Introduction

Camelus dromedarius, often referred to as the Arabian camel, is one of the most important members of the family Camelidae. The dromedary is a heat stress-resistant animal (Manee et al. 2017) able to live in extreme harsh environments such as those of the Arabian Peninsula, and its adaptations to arid conditions are remarkable. For instance, camels are able to vary their body temperature from 34 to 41.7 °C, and can conserve water by not sweating (Al-Swailem et al. 2010). Additional members of the Camelidae include the Bactrian camel (C. bactrianus) in Asia and the llama (Lama glama) and alpaca (Vicugna pacos) in South America (Groeneveld et al. 2010; Wu et al. 2014), which play crucial roles in transportation and the provision of important products such as milk and meat. Given the economic value of camelid species, their genetic characterization is essential; in particular, implementing proper strategies for conserving animal genetic resources requires the evaluation of genetic diversity both within and among populations. Consequently, assessment of camel genetic diversity is important to help the development of breeding programs, which will facilitate improvements to camel productivity and identify genetically unique structures, furthering the ongoing conservation and utilization of these valuable animals.

As morphological traits are highly affected by environmental factors (Shehzad et al. 2009; Jugran et al. 2013; Last et al. 2014), morphological variation is not necessarily an accurate marker for genetic variation. Molecular markers are key resources for genetic investigations, as they complement morphological information and are informative at any developmental stage (Backes et al. 2003). Microsatellites, also known as simple sequence repeats (SSRs) or short tandem repeats (STRs), are composed of short repetitive DNA sequences, 1–6 base pairs (bp) in length, and are widely distributed in many eukaryotic (Xu et al. 2016; Qi et al. 2015) and prokaryotic (Gur-Arie et al. 2000; Yang et al. 2003) genomes. Microsatellites undergo rapid contractions and expansions in different populations of the same species because of replication slippage (Huntley and Golding 2006), and thus are very useful markers for evaluating genetic diversity and DNA fingerprinting.

Variation in SSR lengths may also lead to changes in the local structure of DNA or protein sequences (Mrazek et al. 2007). Evidence shows that SSRs are distributed nonrandomly in genomes. Comparative analysis of Arabidopsis thaliana and Oryza sativa revealed that SSR distributions were nonrandomly distributed in different genomic regions, and varied widely in different gene regions (Lawson and Zhang 2006). SSRs are found in both coding and noncoding regions (Katti et al. 2001). However, SSRs are more abundant in noncoding regions than in exons (Hancock 1995), with trinucleotide and hexanucleotide SSRs being more abundant in coding regions (Borstnik 2002; Subramanian et al. 2003). Previous studies suggested that SSRs in promoter regions may affect gene expression, and SSRs in introns may influence gene transcription or mRNA splicing (Li et al. 2004).

The availability of draft whole genome sequences for several camel species provides the opportunity to perform post-genomic analysis to compare and assess the distribution of microsatellites across camel genomes (Bactrian Camels Genome Sequencing and Analysis Consortium et al. 2012; Wu et al. 2014). To the best of our knowledge, genome-wide characterization and analysis of perfect microsatellites in camels have not yet been reported. To date, there are four camelid species with draft genome sequences: C. dromedarius, C. bactrianus, C. ferus, and Vicugna pacos. This study aimed to screen the whole genomes of these four species for microsatellite identification. In particular, we detected and characterized SSRs and their motifs, and examined their distribution and variations in different genomic regions, which will facilitate studying the structure of the camel genome. This study will serve as a foundation for further research to develop camel-specific SSR markers.

Materials and methods

Data source

At the time of this study, only four camelid species (C. dromedarius, C. bactrianus, C. ferus, and V. pacos) were known to have draft genome sequences, which according to the genomic resources of the National Center of Biotechnology Information (NCBI) have been assembled at scaffold level. These four assemblies were used for the analysis of SSR distributions at the genomic level. Genome sequences in FASTA format and annotation information in GFF format were downloaded from the NCBI RefSeq database (Pruitt et al. 2012) through the Genomes FTP site (ftp://ftp.ncbi.nlm.nih.gov/genomes/). The accession numbers were GCF_000767585.1 (NCBI Eukaryotic Genome Annotation Pipeline Release version 100), GCF_000767855.1 (100), GCF_000311805.1 (101) and GCF_000164845.2 (101), respectively.

Identification of microsatellites

The software PERF v0.2.5 (Avvaru et al. 2017) was utilized for genome-wide SSR mining. This tool is implemented in the Python programming language for detection of microsatellites from DNA sequences. However, camelid species have very large genomes (> 2 Gb). For this reason, the criteria utilized in this study to search for perfect SSRs were as follows: motif size of 1 to 6 nucleotides long using (-m option) and (-M option), and minimum repeat numbers restricted to 12 repeats for mononucleotides, seven repeats for dinucleotides, five repeats for trinucleotides, and four repeats for tetra-, penta-, and hexanucleotides, which were consistent with previous studies (Qi et al. 2015; Liu et al. 2017; Qi et al. 2018). All other settings were set as default. In this study, repeats with unit patterns being circular permutations and/or reverse complements were deemed as one type for statistical analysis (Jurka and Pethiyagoda 1995; Li et al. 2009a). For instance, the unit AGG denotes AGG, GAG, GGA, CCT, TCC, and CTC in different reading frames or on the complementary strand. Relative frequency and relative density were used to help conduct comparisons between different repeat types or motifs. Relative frequency is the number of SSRs per megabase pair (Mb) of target sequence, and relative density is the length of SSRs in base pairs (bp) per Mb of the target sequence (Karaoglu et al. 2005). Total numbers of SSRs were normalized as relative frequency and relative density to perform comparisons between microsatellite sequences of different sizes.

Assigning microsatellites to genomic compartments

The sequences and coordinates of gene models, exons, coding sequences (CDSs), and intronic and intergenic regions for the four camelid genomes were determined according to the positions in the genome annotation files in GFF format downloaded from the NCBI FTP site (ftp://ftp.ncbi.nlm.nih.gov/genomes/all/). These GFF files were converted to BED files for further analysis using gff2bed (v2.4.28) (Neph et al. 2012). The draft genome sequences in FASTA format were indexed using the samtools faidx function implemented in SAMtools v1.7 (Li et al. 2009b). Intergenic and intronic coordinates were obtained using BEDtools subtract tool v2.26.0 (Quinlan and Hall 2010). Intergenic regions were defined as the interval sequences between genes, and intronic regions were defined as the interval sequences between exonic regions. Identified microsatellites were assigned to genomic compartments using the BEDtools intersect tool v2.26.0 (Quinlan and Hall 2010). Each tool was run with default settings.

Statistical analysis

All graphical and statistical analyses were conducted in the R programming environment (version 3.4.3) (R Core Team, 2017). The cor.test method=‘pearson’ was used to elucidate correlations between SSR data sets, including relative frequency, relative density, and GC content.

Results

Identification and characterization of microsatellites in camelid genomes

We analyzed perfect SSRs from four draft camelid genomes (C. dromedarius, C. bactrianus, C. ferus, and V. pacos). Genome characteristics including genome size, GC content, number of SSRs, relative frequency, and relative density are summarized in Table 1. Perfect microsatellites were searched for and analyzed using PERF software. In total, 546762, 544494, 547974, and 437815 perfect SSRs were identified per genome, with overall frequencies of \(\sim \) 273 SSRs/Mb in Camelus genomes and 201.55 SSRs/Mb in V. pacos, accounting for approximately 0.52% and 0.37% of the genomes, respectively. The number of SSRs was positively correlated with relative frequency (Pearson, r = 0.999, P < 0.01) and GC content of SSRs across species (Pearson, r = 0.979, P < 0.05), but negatively correlated with genome size (Pearson r = − 0.994, P < 0.01). Relative frequency and relative density of SSRs were also negatively correlated with genome size (Pearson, r = − 0.997, P < 0.01 and Pearson, r = − 0.971, P < 0.05, respectively). For instance, V. pacos has the largest genome (2172.21 Mb) among those surveyed, and was found to have the lowest SSR frequency and density (201.55 SSRs/Mb and 3828.30 bp/Mb, respectively).

Table 1 Overview of the four camelid genomes

The number, relative frequency, and density of perfect mononucleotide, dinucleotide, trinucleotide, tetranucleotide, pentanucleotide, and hexanucleotide repeat types for the four genomes are shown in Table 2. The results revealed that the relative frequencies and densities of a given type of microsatellites are greatly similar in these species (Fig. 1b, c), with the exception of the relative frequency and density of mononucleotide SSRs in V. pacos. The proportions of mono- to hexanucleotide SSRs were similar across the four genomes, particularly between C. dromedarius, C. bactrianus, and C. ferus (Fig. 1a). Mononucleotide SSRs were the most frequent type, followed by di-, tetra-, tri-, penta-, and hexanucleotide SSRs in decreasing order. Mononucleotide SSRs had frequencies of 69.16–135.79 SSRs/Mb and the highest densities of 951.09–2066.54 bp/Mb, accounting for 34.31–49.79% of the total number of SSRs. Hexanucleotide SSRs were the least frequent, only accounting for 0.76–1.00% of all SSRs.

Table 2 Number, length, frequency, and density of mono- to hexanucleotide repeats in four camelid genomes
Fig. 1
figure 1

Comparison of percentage, frequency, density, and GC content of SSRs in the camelid genomes. Percentages were calculated according to the total number of each SSR type divided by the total number of SSRs for that species. ABCD represent percentage, frequency, density, and GC content of SSRs, respectively

GC content and adenine-thymine (AT) content were investigated in camelid SSRs. The overall GC contents of SSRs were almost identical for C. dromedarius, C. bactrianus, and C. ferus, accounting for approximately 22%, and slightly higher in V. pacos (\(\sim \) 26%). The lengths and proportions of GC and AT content of all SSR types are presented in Table 3 and Fig. 1d. From the results, we can observe that all SSR repeat types had high AT contents. Mononucleotide SSRs had the highest AT content (> 94%), followed in decreasing order by penta-, tetra-, hexa-, trinucleotide, and the least being dinucleotide SSRs. The highest GC content among SSR repeat types was in the dinucleotide SSRs (\(\sim \) 40%), and the least was in the mononucleotide SSRs (\(\sim \) 4%) (Fig. 1d). The GC contents in tri- and hexanucleotide SSRs were highly similar across the four genomes, ranging from \(\sim \) 28 to \(\sim \) 32%. Interestingly, GC content in all SSR repeat types was significantly lower than that of the entire genome, except in dinucleotide SSRs. Furthermore, we conducted additional analyses to report all perfect SSRs in the four camelid genomes without applying any search criteria (supplementary files S1S4).

Table 3 AT and GC content of SSRs for each SSR type in four camelid genomes

Repeat numbers for different microsatellite types

The number of repeats in each SSR and the maximum repeats of each SSR type were found to be highly diverse in different microsatellite types across the four genomes. In general, the corresponding repeat motifs were almost identical between the four genomes, with the exception of fewer repeats for mononucleotide SSRs in V. pacos (Fig. 2).

Fig. 2
figure 2

Repeat times of different SSR types in the camelid genomes. ABCDEF represent mono-, di-, tri-, tetra-, penta-, and hexanucleotide SSR types, respectively

Diversity of microsatellite motifs in camelid genomes

As noted above, the SSRs in camelid genomes were relatively AT-rich. To better understand why this is, we analyzed the motif composition of camelid SSRs. The most frequent SSR motifs for each repeat length were found to vary at the whole genome level across the four camelid species (Table 4). The major repeat motif types shared by the four genomes and having over 5000 SSRs were (A)n, (C)n, (AC)n, (AT)n, (AG)n, (AAT)n, (AAC)n, (AAAT)n, (AAAC)n, (AAAG)n, (AAGG)n, (AATG)n, (AGAT)n, and (AAAAC)n. The numbers of degenerate repeat motifs were found to be 2, 4, 10, and 33 for C. dromedarius, C. bactrianus, C. ferus, and V. pacos, respectively, and were identical between the four camelid genomes for mono- to tetranucleotide repeat types but different for pentanucleotide and hexanucleotide repeat types.

Table 4 The number, length, frequency, and density of the most frequent motifs for each SSR type in four camelid genomes

The predominant mononucleotide motif was (A)n, accounting for 95–97% of the total mononucleotide SSRs in each genome (Fig. 3a). The (C)n repeat was the least frequent, with frequencies of less than 7 SSRs/Mb. In particular, V. pacos had approximately two-fold and one-fold lower frequency of (C)n repeats than C. dromedarius, C. bactrianus, and C. ferus (Table 4). The (AC)n repeat motif was the predominant dinucleotide SSR, occupying \(\sim \) 60% of all dinucleotide SSRs in the four genomes (Fig. 3b). The (AT)n repeat was the second most frequent dinucleotide repeat, with frequencies of 14.70–17.72 SSRs/Mb. The (AG)n motif was less abundant than (AT)n, and (CG)n was the least frequent dinucleotide SSR. (AAT)n and (AAC)n motifs were the most frequent trinucleotide SSRs, together accounting for 49–53% of trinucleotide SSRs in the four camelid genomes (Fig. 3c). The third most frequent repeat motif was (AGG)n, followed by (ATC)n and (ACC)n, which had almost identical frequencies of approximately 1.50 SSRs/Mb. The (ACG)n motif was the least abundant trinucleotide SSR in the four camelid genomes.

Fig. 3
figure 3

Percentage of SSR motif types in the camelid genomes. Percentages were calculated according to the total number of each SSR motif type divided by the total number of SSRs for that SSR type in each genome. ABCDEF represent mono-, di-, tri-, tetra-, penta-, and hexanucleotide SSR types, respectively

Fig. 4
figure 4

Comparison of percentage, frequency, density, and GC content of SSRs in different genomic regions of the camelid species. ABCD represent percentage, frequency, density, and GC content of SSRs, respectively

Among tetranucleotide repeats, (AAAT)n and (AAAC)n were the most abundant with almost identical frequencies of approximately 8 SSRs/Mb, together accounting for 38.09–39.51% of total tetranucleotide SSRs in the four genomes (Fig. 3d). The third most frequent tetranucleotide motif was (AAAG)n, with a similar frequency of more than 5 SSRs/Mb in these genomes, followed by the (AAGG)n, (AATG)n, and (AGAT)n motifs with frequencies ranging from 2.47 to 4.28 SSRs/Mb. For pentanucleotide repeats, (AAAAC)n was the most abundant motif, occupying 44.30–47.17% of pentanucleotide SSRs in the camelid genomes (Fig. 3e). The second most frequent pentanucleotide motif was (AAAAT)n, followed by (AAAAG)n; these had almost identical frequencies of approximately 1 SSR/Mb, and together accounted for 28.09–28.83% of pentanucleotide SSRs in the four genomes. Hexanucleotide repeats were found to have a lower frequency and density compared to other microsatellite types. The predominant hexanucleotide motif was (AAAAAC)n, with frequencies below 0.84 SSRs/Mb and densities below 24.06 bp/Mb, accounting for \(\sim \) 37% of hexanucleotide SSRs in Camelus species and 32.09% in V. pacos, followed by the (AAAAAG)n and (AGATAT)n motifs (Fig. 3f).

Distribution and motif diversity of microsatellites in different genomic regions

A microsatellite search was carried out in exons, CDSs, and intronic and intergenic regions to determine the distribution of SSRs in different genomic regions of C. dromedarius, C. bactrianus, C. ferus, and V. pacos. The comparison results revealed high similarity by region across the four genomes in terms of the relative abundances, densities, and percentages of most of the similar mono- to hexanucleotide SSRs; however, the occurrences and relative frequencies and densities of SSRs were found to differ significantly in coding and noncoding regions (Fig. 4). SSRs were most commonly located in intergenic regions, followed in order by intronic regions, exons, and CDSs (Fig. 4b). The frequencies of SSRs in CDSs of the four camelid species ranged from 0.83 to 1.26 SSRs/Mb, accounting for 0.30–0.36% of SSRs in Camelus species and 0.62% in V. pacos. The frequencies in exons ranged from 2.79 to 3.93 SSRs/Mb, accounting for 1.01, 1.28, 1.42, and 1.74% of SSRs in C. dromedarius, C. bactrianus, C. ferus, and V. pacos, respectively (Fig. 4a, b). The frequencies of SSRs in intergenic regions were 172.06, 170.45, 173.72, and 130.02 SSRs/Mb, respectively, accounting for \(\sim \) 62% of SSRs in all four species, while the frequencies in intronic regions were 99.69, 101.46, 97.90, and 70.37 SSRs/Mb, accounting for \(\sim \) 35% of SSRs in all four species (Fig. 4a, b). The respective densities of SSRs in coding regions were 14.93, 17.73, 20.14, and 24.15 bp/Mb for CDSs and 49.04, 60.99, 70.65, and 63.01 bp/Mb for exons (Fig. 4c). The densities of SSRs in noncoding regions were much higher, with intronic regions having densities of 1878.09, 1856.92, 1870.78, and 1302.66 bp/Mb, and intergenic regions of 3369.28, 3194.22, 3458.25, and 2505.98 bp/Mb (Fig. 4c).

In addition, the GC content of SSRs was investigated for different genomic regions of the four camelid genomes (Fig. 4d). GC contents were almost identical for C. dromedarius, C. bactrianus, C. ferus, and V. pacos. GC contents were found to vary between different genomic regions (Fig. 4d), but the distributions in intronic and intergenic regions were highly similar. SSRs located in CDSs were found to have the highest GC content (63.82–66.66%), followed by those in exons (33.94–45.89%), intronic regions (21.82–25.51%), and finally intergenic regions (22.14–25.90%).

In CDSs, trinucleotide SSRs were the most abundant type, followed by hexa-, mono-, tetra-, di-, and pentanucleotide SSRs (Fig. 5a). In exons, mononucleotide SSRs were the most abundant type in C. dromedarius, C. bactrianus, and C. ferus, while trinucleotide SSRs were the most abundant type in V. pacos (Fig. 5b). Hexanucleotide SSRs were the least abundant type in the exons of C. bactrianus and C. ferus, versus pentanucleotide SSRs in the exons of C. dromedarius and V. pacos (Fig. 5b). In intronic regions, mononucleotide SSRs were the most abundant type in all four camelid species, followed in decreasing order by di-, tetra-, tri-, penta-, and hexanucleotide SSRs (Fig. 4c). In intergenic regions, mononucleotide SSRs were the most abundant type in Camelus species, while dinucleotide SSRs were the most abundant type in V. pacos (Fig. 4d). Trinucleotide SSRs were rare in intergenic and intronic regions for all four camelid species, and hexanucleotide SSRs were the least abundant type in intronic and intergenic regions (Fig. 4c, d).

Fig. 5
figure 5

Relative frequency of mono- to hexanucleotide SSRs in different genomic regions of the camelid genomes. ABCD represent CDSs, exons, intronic regions, and intergenic regions, respectively

The abundances of specific repeat motif types were found to vary distinctly in different genomic regions of the four species (Fig. 6). In CDS regions, the predominant motif was (AGG)n in the three Camelus species, accounting for \(\sim \) 30% of CDS SSRs, followed by (AGC)n at \(\sim \) 28% (Fig. 6a). Meanwhile, (AGC)n was the most abundant trinucleotide repeat in the CDSs of V. pacos, followed by (AGG)n; these together accounted for 56.14% of CDS SSRs. In all four genomes, the motifs (AC)n, (AGG)n, and (AGC)n had similar abundances in CDS regions, together accounting for 39.65–44.19% of CDS SSRs (Fig. 4b). Consistently, the (A)n motif was the most abundant repeat in exons (27.33–44.09%), intronic regions (36.65–50.02%), and intergenic regions (31.37–46.98%) (Fig. 4b, c, d). (AC)n was the second most frequent motif in intronic (15.54–19.95%) and intergenic regions(16.14–20.67%), followed by (AT)n, which comprised 4.70–7.38% and 6.43–9.62% of the SSRs in intronic and intergenic regions, respectively (Fig. 4c, d).

Fig. 6
figure 6

Relative frequency of SSR motif types in different genomic regions of the camelid species. ABCD represent CDSs, exons, intronic regions, and intergenic regions, respectively

Discussion

Diversity of microsatellite distribution in camelid genomes

In this study, microsatellites with motifs of 1–6 bp were identified using PERF with consistent search parameters in four camelid species (C. dromedarius, C. bactrianus, C. ferus, and V. pacos). The number of SSRs, relative frequency, relative density, and GC content were analyzed to understand the structure and diversity of SSR content in camelid genomes. The findings provide evidence that these four genomes have similar distribution patterns for SSRs, suggesting that other camelid genomes are likely to share the same pattern. However, our results showed that the SSR density did not drive the genome size in these four camelids. Instead, there was a negative correlation between SSR densities and genome sizes, suggesting that SSRs might have not contributed significantly to the expansion of the genome in evolution. Perfect SSRs were found to comprise 0.53% of the C. dromedarius and C. ferus genomes, 0.51% in C. bactrianus, and 0.38% in V. pacos. The total percentages of SSRs were higher in the three Camelus species than in bovids (0.44–0.48%) (Qi et al. 2015; Ma 2015), but lower than in macaques (0.83–0.88%) (Liu et al. 2017) and humans (3%) (Subramanian et al. 2002). The wide variance in total percentages may arise from the use of different computational methods for SSR mining, the relative completeness of different genome assemblies, or real differences in SSR content among these species (Sharma et al. 2007).

As expected, the six types of SSRs were not evenly abundant across the four camelid genomes. Mononucleotide SSRs were the most abundant repeat type, consistent with bovids (Qi et al. 2015; Ma 2015) and macaques (Liu et al. 2017). In addition, this finding is consistent with the previous report that mononucleotide SSR repeats are more frequent in eukaryotic genomes than other SSR repeat types (Sharma et al. 2007). However, dinucleotide SSR repeats are the most frequent type in dicotyledons (Kumpatla and Mukhopadhyay 2005), Taenia solium (Pajuelo et al. 2015), Drosophila (Katti et al. 2001), and rodents (Toth 2000), while trinucleotide SSR repeats are the most prevalent type in a number of prokaryotes (Kim et al. 2008; Sharma et al. 2007) and yeast (Katti et al. 2001). The second most frequent SSRs in camelid genomes are dinucleotides, accounting for 25.08–33.94% of all SSRs. The third most abundant SSRs are tetranucleotides, followed by tri-, penta-, and hexanucleotide SSRs. In this analysis, hexanucleotide repeats were the least frequent, at less than 2.22 SSRs/Mb, and accounted for only 0.76–1.00% of the total number of SSRs. This observation in camelids is similar to what has been found in humans (Subramanian et al. 2002), bovids (Qi et al. 2015), and macaques (Liu et al. 2017).

A comparative analysis was conducted for microsatellite motifs within each type of repeat. We observed variation in overall number, frequency, and density between the four camelids. However, SSR motif occurrences are expected to increase as the motif length decreases, as seen in some other species (Karaoglu et al. 2004; Qi et al. 2015; Liu et al. 2017). The most prevalent SSR motifs for each type were found to be almost identical across the four genomes. Among mononucleotide repeats, the motif (A/T)n was the most abundant, accounting for 95.06–96.66% of mononucleotide SSRs. Conversely, the motif (C/G)n was rare. The (A/T)n motif is also predominant in Volvariella volvacea, Agaricus bisporus, Coprinus cinereus (Wang et al. 2014), and Caenorhabditis elegans (Castagnone-Sereno et al. 2010), while the (C/G)n motif is the most frequent in Meloidogyne incognita, Pristionchus pacificus (Castagnone-Sereno et al. 2010), and Schizophyllum commune (Wang et al. 2014). Among dinucleotide SSRs, the most abundant motif was (AC)n, similar to the trend observed in Carlavirus (Alam et al. 2014), humans (Subramanian et al. 2002), bovids (Qi et al. 2015), and macaques (Liu et al. 2017). The second most frequent dinucleotide motif was (AT)n, followed by (AG)n and (CG)n motifs, which is consistent with Bos grunniens (Ma 2015). The rareness of (CG)n motifs can be explained by the tendency to AT richness, and by the fact that strand separation is harder for CG than for AT and other tracts, raising the potential of slipped strand mispairing (Zhao et al. 2011). The (AAT)n motif was the most frequent trinucleotide SSR in the four camelids, similar to macaques (Liu et al. 2017), P. pacificus, M. hapla, B. malayi (Castagnone-Sereno et al. 2010), and Ziziphus jujuba (Xiao et al. 2015); (AAT)n is conversely rare in P. ostreatus, Coprinus cinereus, and S. commune (Wang et al. 2014). A previous study revealed that the (AAAT)n motif predominates in Ailuropoda melanoleuca (Huang et al. 2015). Among tetra-, penta-, and hexanucleotide motif types, AT-rich SSR motifs including (AAAT)n, (AAAAC)n, and (AAAAAC)n were found to be predominant, which is consistent with macaques (Liu et al. 2017). Interestingly, none of the most prevalent SSR motifs includes exclusively Cs or Gs. The over-represented motifs identified in this study support the conclusion that nucleotide sequences with higher GC content are expected to contain fewer SSRs than those of higher AT content (Schlötterer 1998). Overall, the great similarity of the most abundant motifs between the four camelids is a strong indication that the pattern of microsatellites is conserved in genus Camelus.

Diversity of microsatellite distribution in different genomic regions

Substantial evidence exists that the genomic distribution of SSRs is nonrandom, presumably due to their influences on processes such as chromatin organization, gene activity, DNA repair, and DNA recombination (Li et al. 2002, 2004). This may indicate that SSRs in different genomic regions play different functional roles. For instance, SSR expansions or contractions in coding regions can control gene activation, while SSRs located in intronic regions impact gene transcription or mRNA splicing (Li et al. 2004). SSRs in coding regions may affect phenotypes, causing neuronal diseases and cancers in humans (Pearson et al. 2005; Li et al. 2004). Furthermore, SSR repeat variations in \(5^{\prime }\) UTRs may affect gene expression, and longer SSR repeats located in \(3^{\prime }\) UTRs may lead to transcription slippage (Li et al. 2004). Here, we further studied the distribution of SSRs in different genomic regions of four camelids. The results revealed extensive variation in the distributional patterns of different SSR types between different genomic regions of camelids. Our results also demonstrated great similarity in SSR distributions within the same genomic regions of these camelid species. SSRs in noncoding regions were found to be more abundant than in coding regions, which confirm results previously reported in eukaryotes (Toth 2000; Katti et al. 2001; Qi et al. 2016) and plants (Morgante et al. 2002; Lawson and Zhang 2006; Hong et al. 2007). SSRs were most frequent in intergenic regions, followed in order by intronic regions, exons, and CDSs. SSR abundance was lowest in CDS regions, consistent with selection against frameshift mutations in coding regions (Li et al. 2002).

In CDSs, trinucleotide SSRs were the most frequent type, consistent with results observed in primates (Qi et al. 2016) and bovids (Qi et al. 2018). Such predominance of triplets over other SSR repeat types in coding regions may be explained by purifying selection, which serves to eliminate non-trimeric SSRs in coding regions as they may cause frameshift mutations (Metzgar et al. 2000). This strong evolutionary pressure against SSR expansions in CDS regions may maintain the stability of the protein products (Dokholyan et al. 2000). Mononucleotide SSRs were the most abundant in exons, intronic, and intergenic regions, with the exception of V. pacos, in which trinucleotide and dinucleotide SSRs were identified to be most frequent types in exons and intergenic regions, respectively. This was consistent with observations from other eukaryotic genomes (Sharma et al. 2007; Qi et al. 2016; Qi et al. 2018). Pentanucleotide SSRs were the least common type in CDSs, whereas hexanucleotide SSRs were the least common type in exons and intronic and intergenic regions, except in C. dromedarius and V. pacos, where pentanucleotide SSRs were the least common type in exons. The paucity of trinucleotide SSRs compared to di- and tetranucleotide SSRs was also quite pronounced in intronic and intergenic regions of the four camelids. This might be a signature of selection removing triplet repeats from noncoding regions because they could generate false open reading frames (Gonthier et al. 2015).

Comparisons among different genomic regions in the four camelid genomes demonstrated that the major SSR motif types showed great similarity in their relative abundances. The nonrandom distribution of SSRs in different genomic regions shows bias to several specific repeat motifs, suggesting that SSRs of different types may play different roles in different genomic regions (Li et al. 2004; Gemayel et al. 2012). For instance, (AGG)n repeats are predominant in the coding regions of primates (Qi et al. 2016) and bovids (Qi et al. 2018). Consistent with those results, this study found (AGG)n repeats to be the most frequent motifs in CDS regions of camelid genomes, followed by (AGC)n repeats. (AGG)n and (AGC)n motifs were also more frequent in exonic regions, and relatively infrequent in intronic and intergenic regions. Trinucleotide and hexanucleotide repeats were more abundant in CDS regions than other motif types, consistent with previous reports (Borstnik 2002; Subramanian et al. 2003). Overall, (A)n repeats were the most abundant motifs in the exons, introns, and intergenic regions of these camelids, followed by dinucleotide (AC)n repeats; these trends are similar to findings in primates (Qi et al. 2016) and bovids (Qi et al. 2018). In addition, dinucleotide (AT)n and (AG)n repeats were relatively frequent in intronic and intergenic regions of the four camelid genomes. (AAAT)n and (AAAC)n motifs were comparatively more frequent than other tetranucleotide repeats in intronic and intergenic regions.

GC content and repeat number in different types of microsatellites

Previous studies reported a correlation between GC content and the genomic features of mammals, including methylation patterns, the distribution of repeat elements (Jabbari and Bernardi 1998), and gene density (Duret et al. 1994; Duret and Hurst 2001). A high level of GC content was found to be associated with gene expression (Ren et al. 2007) and DNA thermostability (Vinogradov 2003). GC-rich regions were also associated with many genes, suggesting a potential functional relevance for the distribution of GC content in mammals (Galtier et al. 2001). Microsatellite motifs with high GC content have been reported to cause some diseases in humans. For instance, a (CGG)n repeat exceeding 200 units in the \(5^{\prime }\) untranslated region (UTR) of FMR1 was identified as the genetic cause of fragile X syndrome (Sharma et al. 2007). Furthermore, expansion of (CGG)n repeats in the \(5^{\prime }\) UTR of the DIP2B gene causes FRA12A mental retardation (Winnepenninckx et al. 2007). (G)n repeats in the membrane protein gene pmp10 of Chlamydophila were reported to be involved in the virulence and pathogenesis of Chlamydia (Grimwood et al. 2001), and (C)n repeats in outer membrane proteins was found to be involved in the pathogenesis of Clamydophila pneumoniae (Rocha 2002). Additionally, high GC content may have significant roles in the entire viral genome. For example, G-string mutants in the thymidine kinase gene were found to be associated with reactivation of herpes simplex virus (Griffiths et al. 2006).

Our results revealed that GC content is remarkably consistent within a SSR type, and is not evenly distributed in different genomic regions. Our results also suggest that SSRs with high AT content are prevalent in each genome, similar to what has been reported in 26 eukaryotic genomes (Sharma et al. 2007). (A/T)n motifs were more predominant than (G/C)n motifs, which could be interpreted as being due to a high level of AT content in the majority of the analyzed SSRs. A previous study reported that trinucleotide SSRs have the highest GC content in bovids (Qi et al. 2015), which disagrees with our results. Here, dinucleotide SSRs were found to possess the highest GC content in camelid genomes, which is consistent with macaques (Liu et al. 2017). However, GC contents varied greatly among different genomic regions, with CDSs > exons > intronic regions > intergenic regions. The high level of GC content in coding regions was investigated to determine its relative influence on gene expression patterns. For example, the GC content of \(5^{\prime }\) UTR has been found to be positively correlated with gene expression in chickens (Rao et al. 2013). In addition, the high GC content in SSR motifs has been suggested to potentially impact genome structure. For instance, increasing (CGG)n repeats in the HSV-1 genome demonstrated considerable hairpin-forming and quadruplex-forming potential (Li et al. 2004).

A number of studies reported that SSR repeat count has an influence on gene expression. As an illustration, a promoter of Saccharomyces cerevisiae containing 25 tandem repeats of the (CAG)n motif allows expression of a URA3 reporter gene and yields sensitivity to the drug 5-fluoroorotic acid, but expansion to 30 or more repeats turns off URA3 and provides drug resistance (Miret et al. 1998). Promoter regions of Escherichia coli containing exactly 12 tandem repeats of the (GAA)n motif were found to express lac Z, while those with (GAA)1216 and (GAA)511 repeat motifs do not express lac Z (Liu et al. 2000). In this study, repeat lengths and maximum lengths were found to significantly differ within and between SSR repeat types among the four genomes. Notably, dramatically fewer SSRs were observed as the number of repeats increased. This observation can be explained by the effect of high mutation rates on longer repeats compared to shorter repeats within a given SSR type (Leopoldino and Pena 2002). In particular, SSR instability is suggested to increase as the stretch of the repeat motif increases. For instance, an in vitro study in human colorectal cells demonstrated that replication error in a (G)16 repeat was 30-fold higher than for (G)10, and in a (CA)26 repeat were 10-fold higher than for (CA)13 (Campregher et al. 2010). Overall, the GC content and repeat counts of SSRs may play significant roles in most species.

Conclusion

The current work has contributed to a detailed characterization of microsatellites in camelid genomes. The camelid genomes are predominated by AT-rich SSRs, and SSRs are nonrandomly distributed. Mononucleotide SSRs were the most frequent type, followed in order by di-, tetra-, tri-, penta-, and hexanucleotide SSRs. The greatest GC content was in dinucleotide SSRs and the least in mononucleotide SSRs. The number of SSRs, relative frequency, and relative density were generally found to decrease in these genomes as motif repeat length increased. SSRs were demonstrated to be more frequent in noncoding regions than in coding regions. Overall, the results of this study showed similar patterns of SSR distribution across the four camelid species, which indicates that the same pattern of microsatellites may apply to other camels. These data provide a comprehensive view into SSR genomic distribution in the Camelidae family. Such an understanding of the characteristics of microsatellites in camelid genomes will serve many useful purposes such as the development of camelids-specific genetic markers with broad applications, in particular for STR-based genotyping, paternity testing and molecular breeding.