Genome features
The genome size ranged from 3962 (BM87) to 7369 bp (BM85) but maximum genomes were in the range of 5–5.5 kb. However, the GC% with an average of 42% ranged between 34.69 (BM95) and 52.35 (BM81) but exhibits much more diversity as compared to genome size (Fig. 1a, Supplementary file 1). In essence, the Polyomaviridae genomes are mostly of similar sizes, but its composition in terms of GC% is much more variable. If we hypothesize that SSR incidence has an equal chance across the whole genome, irrespective of the composition. Then the same should be reflected in the motifs of SSRs present. However, as discussed later, this is not the case. There are several species which have mono-nucleotide motifs exclusively in the AT region.
The correlation between genome size and GC content was ascertained with various SSR features. SSR incidence was found to be significantly correlated (r = 0.19, P < 0.05) with genome size and GC content (r = 0.08, P < 0.05). Though relative density and relative abundance were not significantly correlated with genome size (r = 0.01, P > 0.05; r = 0.005, P > 0.05), significant correlation was observed with GC content (r = 0.20, P < 0.05; and r = 0.23, P < 0.05), respectively.
Further, cSSR incidence is significantly correlated with genome size (r = 0.06, P < 0.05) but its corresponding relative density (r = 0.0038, P > 0.05) and relative abundance (r = 0.004, P > 0.05) shows no significant correlation therein. GC content is also significantly correlated for cSSR incidence (r = 0.06, P < 0.05), relative density (r = 0.11, P < 0.05), and relative abundance (r = 0.08, P < 0.05).
Incidence of SSRs and cSSRs
A total of 3036 SSRs and 223 cSSRs were extracted from the 98 species of Polyomaviridae (Supplementary files 2–4). The average distribution of SSRs and cSSRs per genome varied from 23 and 1.3 (Gammapolyomavirus) to 33 and 2.9 (Betapolyomavirus), respectively. Their distribution across genera has been summarized in Table 1.
Table 1 SSR and cSSR incidence across the different genera of Polyomaviridae Maximum of 56 SSRs were present in BM85 whereas minimum of 18 were present in BM80 and BM21. cSSR incidence ranged from 0 in seven species (BM99, BM82, BM76, BM59, BM24, BM21, BM14) to 7 in two species (BM85 and BM84) (Fig. 1a). Two interesting but contrasting observations can be made from this data. First, BM85 and BM84 with 7 cSSRs have 56 and 31 SSRs in a genome size of 7369 and 4697 bp, respectively (Supplementary file 2). What it essentially means is that though a longer genome should ideally account for more SSRs but the eventual clustering of SSRs reflected as cSSR incidence remains the same. Thus, the SSR rich regions of the genome are independent of genome size. The second aspect is that the above observation is not the norm as is evident from the cSSR range of zero to seven. Multiple genomes of Polyomaviridae with varying number of SSRs have same number of cSSRs. This is highlighted by 29 species having 2 cSSRs (Fig. 1a, Supplementary files 2–4) suggesting of a unique genome SSR signature.
To further highlight the regularity of this anomaly, we looked into cSSR%, which is percentage of SSRs present as cSSRs in a particular genome. Note, the variations in cSSR% are not only across different genera but even within, thereby negating the clustering of SSRs in a genera specific manner (Fig. 2a). These are reflective of specific yet variable localizations and clustering of SSRs in a particular genome.
Relative abundance (RA) and relative density (RD) of SSRs and cSSRs
RA is the number of microsatellites present per kb of the genome whereas RD is the sequence space composed of SSRs of microsatellites per kb of the genome. So, these values are reflective of number of iterations of SSRs present. If the SSRs have a conserved tendency to be iterated, then higher incidence should correspond to elevated RD values. Moreover, a higher RA value should correspond to high RD value. As observed, BM65 has the highest RA and RD values of 9.32 and 80.4, respectively, for SSRs which means, since more SSRs are present per kb of the genome, more genome is comprised of SSRs. The corresponding lowest values for RA and RD was 3.39 (BM21) and 26.5 (BM80), respectively (Fig. 1b, Supplementary files 2–4).
Similarly, the cSSR relative abundance (cRA) and relative density (cRD) was also studied. Since there were 7 species with no cSSR (Fig. 1a), hence the minimum cRA and cRD values were zero for these species. The highest values for cRA and cRD were 1.490 (BM84) and 33.93 (BM95), respectively (Fig. 1b, Supplementary files 2–4). This difference may be due to the differential composition of the cSSRs.
dMAX and cSSR
cSSR incidence is dependent on the allowed distance (dMAX) between two SSRs for it to be treated as one cSSR. Since cSSR is reflective of clustering of SSRs, and IMEx allows for dMAX values till 50, we analyzed cSSR incidence of Polyomaviridae genomes by varying the dMAX values from initial value of 10 to 20, 30, 40 and 50. Subsequently, % increase was calculated using the given formula.
$$\% {\text{increase}} = \left[ \begin{gathered} \left\{ {{\text{cSSR incidence at dMAXn}} - {\text{cSSR incidence at dMAX}} \left( {n - 10} \right)} \right\} \hfill \\ \div {\text{cSSR incidence at dMAX}} \left( {n - 10} \right) \hfill \\ \end{gathered} \right] \times 100$$
This % increase was thereon plotted. Though maximum increase is observed for most species when dMAX increased from 10 to 20 as evident from the predominant black bar, it does not conform to a pattern per se (Fig. 2b). This means that even in species of the same family, SSRs chart their own path in terms of localizations in each genome.
SSR motif types and their prevalence
First, the contribution of different repeat motif (mono- to hexa) to the overall SSRs incidence was ascertained. The data were analysed separately for each of the genera. Moreover, the analysis was done in percentage and not absolute numbers to account for variable number of species across genera. Note that the data from species with unassigned genera was not included herein. The contribution of mononucleotide repeats motifs ranged from 36 (Gammapolyomavirus) to 47% (Betapolyomavirus). Deltapolyomavirus had no incidence of penta- and hexa-nucleotide repeats whereas Gammapolyomavirus lacked hexanucleotide repeats. This can be attributed to fewer species in these genera. Gammapolyomavirus had the highest contribution from di-nucleotide repeats (39.42%) and the only genus to have more di-nucleotide repeats than mono-nucleotide repeats (Fig. 3a, Supplementary files 2–3).
We thereon looked into the motif composition of mono- and di-nucleotide repeats for their prevalence across the different genera of Polyomaviridae. For the mono-nucleotides, if we look at the overall data, the most prevalent repeat motif is “T” (48.95%) followed by “A” (33.48%). “T” also remains the most prevalent mono-nucleotide motif for Alpha-, Beta- and Delta-polyomavirus (47, 52 and 71 percent, respectively). However, Gammapolyomavirus has a highest contribution from “C” (34.67%) followed by “T” (33.33%) (Fig. 3b, Supplementary files 2–3). Interestingly, the same Gammapolyomavirus has the highest di-nucleotide repeat motif contribution from “AT/TA” (29.27%) motif while Alphapolyomavirus has its largest contribution from “CT/TC” (29.37). Overall, “AT/TA” was the most prevalent dinucleotide repeat motif closely followed by “CT/TC” (Fig. 3c) PV: polyomavirus.
SSRs in coding regions
The assessment of SSRs distribution across genome revealed that non-coding region accounted for 679 SSRs (22.4%) whereas coding region comprised of 32 proteins/putative genes/ORFs housed 2357 (77.6%) of SSRs (Supplementary file 2).
Subsequently, we analyzed the SSR prevalence across different genes of the studied genomes. Six genes accounted for over 92% of SSRs. Overall, the LTAg gene alone accounted for over 47% of total SSRs with VP1 gene a distant second at around 16% (Fig. 3d). Thereafter, we dissected the data across different genera. Interestingly, though LTAg gene takes the pole position in the housing of SSRs across genera, its contribution varied. In Betapolyomavirus, it was accounting for one in every two SSR (49.54%) while in Gammapolyomavirus, approximately one in every three SSR was housed in LTAg gene (35%). This difference permeates to all the genes, albeit to a lesser extent (Fig. 3e, Supplementary files 2–3).
SSRs (mono-nucleotide) specificity and host range exclusivity
The compilation of different SSRs contribution to overall incidence revealed an interesting observation. Eighteen species had one hundred percent mono-nucleotide SSRs comprising of A/T. Further, the majority of these viruses had humans or members of the ape family as their hosts. To elucidate a possible pattern and significance of the same, we arranged all the studied species in decreasing order of their mono-nucleotide SSR contribution by A/T (Fig. 4, Supplementary files 1–2). Notably, viruses with humans, apes, and related species as hosts have a much higher A/T mono-nucleotide SSRs composition as compared to birds and fishes as hosts (Fig. 4).
Using representative species (9 each) we thereon investigated whether the SSRs composition by A/T and the hosts reflect a pattern. Dot plot analysis was performed for nine species each with humans, apes and related species as hosts (Fig. 5a) and nine species with birds, fishes and other species as hosts (Fig. 5b). Interestingly, even though three species in Fig. 4 have 100% mono-nucleotide SSR contribution by A/T (same as Fig. 5a), the overall number of dots (reflective of repeat sequences) is higher for all the genomes of Fig. 5a, representing humans and related species as hosts.
Phylogenetic tree of Polyomaviridae
Subsequently, we constructed the phylogenetic tree of the 98 Polyomaviridae genomes and observed that all the viruses are not evolved together as per their hosts. However, hosts do reflect in the tree. Multiple places of clustering of the virus with the same or related hosts can be observed (Fig. 6). The fact that all viruses with human or same hosts do not follow the pattern is only indicative of other players in genome evolution besides hosts.
We thereon superimposed the data for percentage mono-nucleotide SSR contribution by AT region, the phylogenetic analysis and the known hosts. For the sake of clarity, hosts of only those species with > 90% mono-nucleotide SSR contribution from AT region are shown as illustrations here, though the complete information is provided in Fig. 4. We hypothesize that the presence of mono-repeats in the AT region is somehow providing for viral host flexibility and interchangeability.