Introduction

Amino acids play a crucial role in cellular metabolic activities of an organism. Amino acids are joined step by step to form proteins. In the standard genetic code, a set of 59 codons encode 18 standard amino acids. Here, methionine and tryptophan are the only two amino acids that are coded with a single codon while all other amino acids are encoded by more than one codons, thus making some codons seemingly redundant in transcript.    A bias in synonymous substitution of codons resulting in preferential usage of a specific codon within a codon family is termed codon usage bias (CUB), and it is different for genes, genomes, transcriptomes, and species [6, 33, 51]. CUB is frequently observed in highly expressed genes, while genes that are expressed at a low level usually have less CUB [25]. The pattern of synonymous codon usage allows the identification of relevant isoacceptor tRNAs for efficient translation of a particular gene. Thus, genomes with highly expressed genes will have more bias in codon preference, leading to the formation of proteins with lower susceptibility to misfolding [2]. Studies have shown that variation in synonymous substitutions occurs between and within genes [23, 44]. Various researchers have pointed out that the study of CUB provides important information about the evolution of related organisms [55, 58]. Because viruses are replicated and their genes are translated in living host cells, investigation of codon usage patterns of viral genomes can potentially provide information about the interaction and co-evolution of viruses with their hosts [54].

Several theories have been propounded for the origin of CUB, two of which are the selection‐mutation‐drift theory and the neutral theory. In the selection‐mutation‐drift theory, the major determinants of CUB are mutational pressure, natural selection, and genetic drift [10, 34]. In the neutral theory, mutations in the third position of a codon should be neutral, resulting in a random choice of codons [16]. In addition to mutation, selection pressure and genetic drift, several other factors influence CUB, including base constraints [8], base skewness [14], gene length [17], gene stability, translational selection [53], replication [27, 40], protein structure [72], and protein properties [16]. Analysis of CUB is useful for understanding the expression of viral genes [67]. It is also relevant for designing vaccines [20].

Anelloviridae is a family of ssDNA viruses with icosahedral symmetry [52] and a diameter of around 18–30 nm [43]. Members of this family have a single capsid protein and a genomic size of 2000–4000 nucleotides [50], and they have high degree of genetic variability [13]. The virus replicates inside the host cell, where a double-stranded DNA intermediate is formed by DNA polymerase in the S (synthesis) phase of the cell cycle [43]. The natural hosts of anellovirids include chimpanzees, tupaias, African monkeys, chickens, cattle, pigs, sheep, cats, dogs, and humans [68]. These viruses are extremely prevalent, with a comparatively static distribution worldwide and high degree of genetic heterogeneity [63]. Major diseases associated with anellovirids include lupus, hepatitis, hematologic disorders, pulmonary diseases, and myopathy [49]. They can be transmitted by sexual contact or blood transfusion and possibly by the fecal-oral route [1, 7]. No vaccines or drugs have been developed against these viruses, and further research in this area is still needed.

It has been reported that CUB is a major driving force the evolution of small DNA viruses and astroviruses [54, 74]. Preliminary analysis of flaviviruses revealed that base constraints and codon usage were related to those of their hosts [9]. Karlin et al. showed that the pattern of codon usage of Epstein-Barr virus plays a major role in influencing latent infection, leading to productive infection [36]. A codon usage pattern analysis of 31 Newcastle disease virus isolates showed that codon usage was associated with gene functions and geographic location but not with host specificity [76]. The GC constraint has been found to be the major determinant of codon usage variation in members of the family Parvoviridae [61]. Xu, et al. showed that the codon usage pattern in New World begomoviruses is apparently different from that in Old World begomoviruses, supporting the notion that the New World bipartite begomoviruses might have developed from the Old World begomoviruses [80].

In the current study, we report base constraints, codon usage pattern, protein properties, and influencing factors in the genomes of members of the family Anelloviridae in order to identify their genetic characteristics and discern the role of mutation and natural selection in CUB in genes. We also report the preferred codon for each amino acid and the overrepresented and underrepresented codons for the family as a whole to facilitate genetic engineering for the design of effective vaccines and other therapeutics. Analysis of codon usage patterns in viral genomes might elucidate adaptive traits, the role of evolutionary forces, viral interaction strategies, and host adaptation.

Methodology

Data access

The coding sequences (cds) of 62 genomes of members of the family Anelloviridae with their accession numbers were accessed from National Centre for Biotechnology Information nucleotide database (www.ncbi.nlm.nih.gov). In our analysis, only base sequences with an exact multiple of three nucleobases that had proper start and stop codons were used. A list of accession numbers and genome names of Anelloviridae family members is shown in Supplementary Table S1.

Base constraint analysis

The base content for the entire family Anelloviridae was analysed using a Perl programme written by the corresponding author (SC) to identify constraints on base composition across the family. The compositional properties of coding sequences considered in our analysis were as follows: (a) overall base content (A, C, G, T%), (b) base content at the third position of the codon (A3, C3, G3, T3%) and (c) GC content (GC, GC1, GC2, GC3, GC12%). In addition, nucleotide skew values i.e., AT skew (A-T/A+T) and GC skew (G-C/G+C) values, were computed. A positive GC or AT skew indicates higher usage of G over C or A over T. The imbalanced usage of two bases can be seen from the skew values across the transcript [3]. Similarly, purine, pyrimidine, purine-pyrimidine, keto, and amino skew values were also determined to evaluate their impact on CUB.

Relative synonymous codon usage (RSCU)

RSCU is a CUB index evaluated as the ratio of the observed frequency of a codon to its expected frequency out of all synonymous codons for a particular amino acid, multiplied by the degeneracy level. An RSCU value of 1 indicates equal usage of all synonymous codons, while a value greater than 1 indicates that a particular codon is favored. A codon with an RSCU value greater than 1.6 is considered an overrepresented codon, and one with an RSCU value less than 0.6 is considered an underrepresented codon.

The RSCU value for each codon was computed using the following formula [56]:

$$RSCUij = \frac{Xij}{{\frac{1}{ni}\sum\limits_{j = 1}^{ni} {Xij} }}$$

where Xij indicates the frequency of the jth codon for ith amino acid and ni is the number of codons for the ith amino acid (ith codon family).

Effective number of codons

The effective number of codons (ENC) measures the extent of bias in codon usage in a particular gene (cds), independent of the gene length and the protein encoded by the gene. The ENC value of a cds varies from 20 to 61 in the standard genetic code. If only one codon is used for each individual amino acid, the ENC value of that cds is 20, whereas equal usage of all synonymous codons would lead to higher ENC value. The bias in codon usage is inversely related to the ENC value. An ENC value greater than 35 indicates lower CUB, while a value less than 35 indicates higher CUB [79]. The ENC value of a cds was computed using the formula [79]:

$$ENC = 2 + \frac{9}{{F_{2} }} + \frac{1}{{F_{3} }} + \frac{5}{{F_{4} }} + \frac{3}{{F_{6} }}$$

where Fa (a = 2, 3, 4 or 6) is the average of the Fa values of the amino acids with a-fold degeneracy.

Multivariate statistical analysis

Correspondence analysis (COA) acts as a multivariate statistical platform for visualization of the major trends in CUB studies of nucleotide sequences. A COA plot is implemented to understand the variation in codon usage, with RSCU values of 59 synonymous codons across two axes F1 (axis 1) and F2 (axis 2) [62]. We used “Past” software to plot the COA graph.

Parity plot (PR2) analysis

A PR2 plot is implemented to understand the role of evolutionary forces affecting the CUB. A PR2 plot was made for 2-fold, 4-fold and 6-fold degenerate codon families with [G3/(G3+C3) along the x-axis and [A/(A+T) along the y-axis for the third position of codons. The midpoint of the plot is 0.5, where no bias exists between complementary nucleotide sequences, where G = C and A = T [66].

Neutrality plot analysis

A neutrality plot is used to quantify the magnitude of evolutionary forces, i.e., the role of mutation and natural selection in the determination of CUB (Zhang et al. [82]. It is framed with GC12 along the y-axis and GC3 along the x-axis. Each dot in the plot represents an independent genome. If the dots are diagonally distributed with a regression coefficient value approaching 1, this suggests a more important role of mutation [64], whereas a scattered distribution of dots suggests a significant role of natural selection, with the regression coefficient approaching zero.

Protein properties

Biochemical properties of proteins, i.e., aromaticity and hydropathicity, have been reported to be associated with CUB [16]. The general average hydropathicity score (GRAVY) is the average of the hydropathicity scores of the amino acids in the coding sequence of a gene, and its value ranges from -2 to +2, with a negative value indicating that the protein is hydrophilic in character and vice versa [31]. The aromaticity of a protein indicates the distribution of aromatic amino acids (tryptophan, tyrosine, and phenylalanine) in the protein [42].

Mutation responsive index (MRI)

The MRI value is used to estimate the rate of mutational drift in a cds. A positive MRI value suggests the impact of directional mutation, while the opposite indicates the influence of translational selection in a gene [21, 22].

Translational selection (P2)

The P2 value estimates the rate of codon-anticodon interactions and suggests the impact of translational efficiency in a gene. A P2 value greater than 0.5 indicates a bias favouring translational selection [22]. The P2 value of a gene is calculated using the following formula [22]:

$$P2=\frac{(WWC+SSU)}{(WWY+SSY)}$$

where W = T or A, S = G or C, and Y = T or C

Statistical analysis

The correlation coefficients for various parameters, namely, effective number of codons, base content, skew values, protein properties, mRNA free energy, and mRNA stability index, were computed using the statistical software SPSS 16.0 for Windows. A heat map was constructed for comparing GC3 values and codon usage values, using the XLSTAT software [38].

Results

Codon usage bias

To determine the impact of bias on codon usage in the genomes of members of the family Anelloviridae, the effective number of codons (ENC) was computed for each genome Table 1. The ENC values ranged from 36.65 to 57.40, with a mean ENC value of 51.93 indicating a low CUB [11]. However, a high degree of variation in codon usage was found among the different genomes. Analysis of RSCU values of 59 sense codons revealed that 27 of them were used frequently in the cds. Thus, more than one codon was preferred for each amino acid, supporting our finding of a low CUB in the genes.

Table 1 Average ENC value of the genes of members of the family Anelloviridae

Base composition

The nucleotide base content can affect the CUB of a genome [34]. Thus, we computed the overall base composition of the genomes of various members of the family Anelloviridae. The mean percentage of A (31.78) was found to be the highest, followed by C (27.11) and G (22.06), while T (19.05) had the lowest frequency Fig. 1. This indicated that the nucleobases A and C occurred more frequently than the bases G and T in the coding sequence. At the third codon position, the percentage of A was the highest (32.87), followed by C (29.39), T (21.54), and G (16.19), suggesting that A- and C-ended codons might have been preferred to G- and T-ended codons. The mean GC content of the genomes was 49.21%, with an almost equal GC and AT content across the genomes. The overall GC content is an important factor contributing to bias in codon usage across genomes [75]. Here, analysis of the GC content at the first, second, and third codon positions revealed the frequency of that GC1 (54.67%) was higher than that of GC2 (47.28%) and GC3 (45.67%) Fig. 1. Correlation analysis of ENC value and base content showed a significant correlation of ENC with A%, G%, A3%, G3%, GC%, GC1%, GC2%, GC3% and GC12%, at p < 0.01 Table 2, suggesting that these bases might be responsible for the CUB of genes in members of the family Anelloviridae.

Fig. 1
figure 1

(a) Overall base content and base content at the third codon position of genes of members of the family Anelloviridae, (b) Overall GC content and GC content at the first, second, third and first and second codon positions

Table 2 Correlation between ENC and base content of genes of members of the family Anelloviridae

Codon usage pattern

To understand the pattern of codon usage in each synonymous codon family, we performed relative synonymous codon usage (RSCU) analysis of 59 sense codons, as shown in Fig. 2. The pattern of codon usage differed from genome to genome (Supplementary Table S2). Further, we obtained mean RSCU value of codons for 62 genomes of Anelloviridae members and categorized them into four groups: RSCV value >1.6, overrepresented codons,<0.6, underrepresented codons; >1, more frequently used codons; < 1, less frequently used codons. The codon AGA was overrepresented in all of the genomes, while the codons TCG, TTG, CGG, CGT, ACG, GCG and GAT were underrepresented in all of the genomes. The preferred codons for amino acids are reported in Supplementary Table S3. Of the 27 frequently used codons, 13 were A-ended, 10 were C-ended, three codons were T-ended, and one was G-ended. Thus, A- and C-ended codons were more likely to be abundant in coding sequences. Of the 32 less frequently used codons, 13 were T-ended, 11 were G-ended, six were C-ended, and one was A-ended. Our analysis of base compositional properties and codon usage patterns across the genomes suggested that compositional features under mutation pressure contributed to the observed codon usage pattern [4].

Fig. 2
figure 2

(a) Overall RSCU values of codons in genomes of members of the family Anelloviridae. Red indicates underrepresented codons, (b) more frequently used codons, and (c) less frequently used codons

We generated a heat map by correlating the codon usage value with GC3 and found that, except for the codons TTG and AAC, all of the G- and C-ended codons were positively correlated with GC3, while, except for CGA, AGT, GTT, GGT, GCT and CGT, all of the A- and T-ended codons were negatively correlated with GC3 Fig 3. These findings confirm that codon usage variation is subject to GC constraints, and this acquires importance for understanding the molecular architecture of the genomes of members of the family Anelloviridae [30].

Fig. 3
figure 3

Heat map with codon usage values for members of the family Anelloviridae. Green indicates negative correlation, red indicates positive correlation, and black indicates no correlation of GC3 with A-, T-, G- and C-ended codons

Relationship between codon usage patterns in the genomes of members of the family Anelloviridae and those of their hosts

Viruses, being obligate parasites, can be sustained only within living hosts [86]. Here, the RSCU values of different members of the family Anelloviridae were compared with those of a few of their host organisms (humans, African monkeys, chickens, sheep, pigs, and dogs). It was found that they had several of their most frequently used codons (RSCU value > 1) and less frequently used codons (RSCU value < 1) in common (Supplementary Table S4), suggesting that these viruses were adapted to their host. Similar associations between viruses and their hosts in their pattern of codon usage have also been reported for chikungunya virus [11], poliovirus [47], and coronaviruses [78].

Role of mutational pressure

Alterations in the base sequence of the genome are usually related to mutational pressure, which is an important factor in determining CUB. The correlation between the overall base content and the base content at the third codon position was investigated to determine whether mutational pressure alone was responsible for the observed CUB. The results of Pearson’s correlation analysis, shown in Table 3, indicate a highly significant positive correlation for A-A3%, T-A3%, T-T3%, G-G3%, C-G3%, GC-G3%, G-C3%, C-C3%, GC-C3%, G-GC3%, C-GC3% and GC-GC3% at p < 0.01, indicating that these bases were proportionally related to each other, while highly significant negative correlation was observed for G-A3%, C-A3%, GC-A3%, C-T3%, GC-T3%, A-G3%, T-G3%, A-C3%, T-C3%, A-GC3% and T-GC3%, at p < 0.01, i.e., these bases were inversely related to each other in their abundance. Our results suggest that mutational pressure and natural selection influenced the CUB of members of the family Anelloviridae, in agreement with previous observations [2].

Table 3 Interrelationships of overall base composition with the base composition at third codon positions

Using regression analysis of A-A3%, G-G3%, T- T3%, and C- C3%, as shown in Fig. 4, we investigated the extent to which each base was affected by mutational pressure. G-G3% and A-A3% were found to be more strongly affected by mutation than C-C3% and T-T3%, in agreement with an earlier study [72].

Fig. 4
figure 4

Regression analysis comparing overall base content and base content at the third codon position. The regression coefficient value of the four plots indicates a greater contribution of mutation in base G, G3% and A, A3% over C, C3% and T, T3%

Trends of codon usage bias

The trends of variation in the codon usage pattern were analyzed using correspondence analysis (COA), a multivariate technique. COA was performed using the RSCU values of 59 sense codons. In COA analysis, all genomes were marked with blue colour and found scattered across the rectangular plot, G/C- and A/T-ended codons were represented as green and red colour dots, respectively, with a few overlapping ones along the x-axis Fig. 5. The graph revealed that G/C- and A/T-ended codons were separated along the axes, and the differences in codon distribution in the plot were mainly due to variation in the frequency of G/C- and A/T-ended codons. Here, some codons were found very close to the axes, suggesting that mutational pressure might have governed the CUB.

Fig. 5
figure 5

Correspondence analysis of genomes of members of the family Anelloviridae. Green indicates GC-ended codons, red indicates AT-ended codons, and blue indicates each genome. The trends of variations of the codons are presented in the plot

Further, we used the UPGMA algorithm to determine the Euclidean similarity index in Past 3 software by performing cluster analysis [28]. The results revealed two major clusters, as shown in Fig. 6. One cluster included seven genomes, and another one included 55 genomes, revealing intraspecific and interspecific relationships between them.

Fig. 6
figure 6

Cluster analysis of genomes of members of the family Anelloviridae

PR- 2 bias plot analysis

PR-2 bias plots were generated to assess the role of mutational pressure and natural selection in determining the CUB. If mutational pressure is the sole factor determining CUB, the GC and AT content will be proportional along the axes, while a deviation from proportional distribution might be due to the combined effect of mutational pressure and natural selection [65]. We constructed PR-2 bias plots for the 2-fold, 4-fold and 6-fold degenerate codon families with G3/G3+C3 and A3/A3+T3 on the x- and y-axis, respectively Fig. 7, and found unequal distribution of AT and GC bases, suggesting that both mutation and natural selection might have shaped the CUB of these genomes.

Fig. 7
figure 7

Parity rule 2 bias plot of genomes of members of the family Anelloviridae. A non-uniform distribution of bases suggests that both mutation pressure and natural selection might have influenced their CUB

Neutrality plot analysis

A neutrality plot comparing GC12 (y-axis) and GC3 (x-axis) was made to determine the effect of mutation pressure and natural selection on compositional bias Fig. 8. A highly significant correlation was observed between GC12 and GC3 (r = 0.904** at p < 0.01), suggesting that directional mutation pressure acted on all codon positions. The points were diagonally distributed with a wide range of GC3 distribution, indicating that mutational pressure influenced the CUB. Moreover, the slope of the regression line of GC12 vs. GC3 was 0.586. These results suggest a major effect of mutational pressure (58.6%) and a minor effect of natural selection (41.4%) in determining the CUB of the genes.

Fig. 8
figure 8

Neutrality plot of genomes of members of the family Anelloviridae. The regression coefficient value suggests that mutation contributed 58.6% and natural selection contributed 41.4% to the CUB

Role of protein properties

Several studies have revealed that aromatic and hydrophilic properties of proteins are related to the bias in codon usage [69]. We performed correlation analysis using ENC, GRAVY, hydrophilicity and aromaticity using Pearson’s correlation method and found a highly significant positive correlation between ENC and GRAVY (0.333**) at p < 0.01 Table 4, indicating proportional relationship between them. Highly significant negative correlation was observed between ENC and hydrophilicity (-0.349**) at p < 0.01, indicating an inverse relationship. These results suggest that the extent of CUB was associated with the properties of protein, namely, GRAVY and hydrophilicity.

Table 4 Correlation between ENC and protein properties of genes of members of the family Anelloviridae

Role of skewness

Previously, it was shown that skewness in the base composition of coding sequences is reflected in their transcriptional products [5]. Here, the overall GC skew value for all members of the family Anelloviridae was -0.11, indicating that the base C was more abundant than G, and the mean AT skew value was 0.24, indicating that the base A was more abundant than T. To further examine the role of skewness on CUB, correlation analysis was performed between the ENC value and the AT skew, GC skew, keto skew, amino skew, purine skew, pyrimidine skew and purine-pyrimidine skew Table 5. The results showed a highly significant positive correlation of the GC skew with CUB at p < 0.01, i.e., the GC skew had a proportional relationship to CUB, while the AT skew, keto skew, amino skew, purine skew, and purine-pyrimidine skew had a highly significant negative correlation with CUB at p < 0.01, indicating an inverse relationship. These results together suggest that skewness of bases influenced the CUB of the genes.

Table 5 Correlation between ENC and skew properties of genes of members of the family Anelloviridae

Mutational responsive index (MRI) and translational selection (P2)

The mutational responsive index is a specific criterion to enumerate the impact of directional mutational pressure and translational selection across the genome. Here, the mean MRI value for 62 members of the family Anelloviridae was 0.50. This positive MRI value suggests a role of directional mutation across the genomes [21]. The mean P2 value was 0.16, i.e., less than 0.5, indicating a low impact of translational selection, consistent with previous observations [14].

Discussion

The choice of a preferred codon out of each synonymous codon family for an amino acid results in a phenomenon called codon usage bias (CUB) [46]. CUB is important for exogenous gene expression, which might require optimization of codons [45]. The degree of CUB varies from species to species [51] and is associated with mutational pressure, genetic drift, and natural selection [29]. Other factors contributing to CUB include base content, base skewness, gene expression level, gene length, protein structure, and the aromaticity and hydrophilicity of the protein. In the present study, we investigated the extent and the variation in the codon usage patterns of 62 genomes of members of the family Anelloviridae. We also examined the dynamics of base content and identified major factors influencing CUB. This study provides insights into the genetics and evolutionary relationships of organisms, and identification of overrepresented and underrepresented codons may assist in efforts to alter gene expression levels through codon optimization.

The effective number of codons reflects the extent of CUB in the coding sequence of a particular gene [73]. A higher CUB is associated with a lower ENC value and vice versa [14]. The ENC values of 62 genomes of members of the family Anelloviridae ranged from 36.65 to 57.40, with an average ENC value of 51.93, indicating low CUB. The lower CUB suggests efficient usage of multiple codons for protein production [34]. Zhou et al. reported that the mean ENC value for H5N1 influenza virus was 50.91, ranging from 43.11 to 55.21, indicating low CUB [86].

The base content of a gene significantly influences the pattern of codon usage across the gene [14]. Our analysis of compositional properties indicated that the relative usage of bases was A > C > G > T. At the third codon position, A was more frequent than C and T was more frequent than G, with a preferred usage of A- and C-ended codons. The overall AT and GC content was almost equal. A highly significant correlation was observed between ENC and base content values,hence, the extent of CUB was dependent on compositional properties of the genome. Torque teno sus virus 1 had 32.72% A, 25.70% G, 22.39% C, and 19.20 % U(T). At the third codon position it had 37.45% A, 32.63% C, 30.82% G, and 25.90% U(T), with the increased preference for A-ended codons [Zhang Zhicheng et al. 2013]. The base composition of 17 human cytomegalovirus strains was 21.46% A, 21.07% T, 28.99% G, and 28.49% C, with G and C almost equal and A and T almost equal [32]. Tsai et al. reported that the GC content of 12 iridovirus genomes ranged from 27 to 55%, with distantly related viruses showing different patterns of synonymous CUB [70].

The RSCU value indicates the pattern of codon usage across the genome. We therefore determined the RSCU value for each member of the family Anelloviridae and found that 27 out of 59 codons were more frequently used, with preferred usage of A- and C-ended codons. The AGA codon was overrepresented, and the TCG, TTG, CGG, CGT, ACG, GCG and GAT codons were underrepresented in all of the genomes. Zhang et al. reported 18 frequently used codons with preferential usage of A- and C-ended codons and nine underrepresented codons in torque teno sus virus 1 [82]. RSCU analysis of Newcastle disease virus revealed eight underrepresented codons (CGC, CGA, CGT, CGG, CCG, ACG, TCG, and GCG) that were markedly suppressed [76]. A heat map comparing codon usage values and GC3 content showed that the usage of C- and G-ended codons was positively correlated with GC constraints, while the usage of T- and A-ended codons was negatively correlated with GC constraints, with a few exceptions. Similarly, a positive correlation of G- and C-ended codons with GC3 and a negative correlation of A- and T-ended codons with GC3 content has also been found in yeast, bacteria, humans [48], birds, and mammals [71]. This suggests that GC constraints are positively related to the CUB of genes.

Correlation analysis between the overall base content and the base content at the third codon position showed a highly significant correlation (p < 0.01), suggesting that mutational pressure might have influenced the CUB together with other environmental determinants. The bias in favour of codons with a higher content of one base over the other three also revealed the predominant role of mutational pressure [35, 60, 83, 85]. It has been reported that mutational pressure leads to alterations in biochemical mechanisms, with recurrent changes in certain bases thus contributing to the CUB [19, 24]. In torque teno sus virus 1, a significant negative correlation of A3% with G% and GC% (p < 0.01), and G3% and C3% with G% (p < 0.01) has been reported, indicating that base constraints and mutational pressure together determined the CUB [82].

A COA plot indicated that mutational pressure and natural selection together contributed to the CUB, consistent with a previous report [3]. A COA plot of 30 strains of hepatitis A virus revealed a major role of base content in CUB determination [15]. Fu performed COA analysis of herpesviruses and reported that genes with lower GC content were distributed along the right side of the plot, while genes with higher GC content were at the left side of the plot, supporting the notion that base compositional properties highly influenced synonymous codon changes [20].

To investigate the evolutionary relatedness of anellovirid genomes, a cluster analysis was performed that revealed two major clusters, one with seven genomes and another with 55 genomes, suggesting intraspecific and interspecific relationships. Cluster analysis of 11 isolates of human bocavirus performed based on RSCU values showed that genes with similar functions, even if they were from different isolates, grouped in the same lineage, irrespective of geographical location [84]. Wong et al. performed cluster analysis of influenza A viruses and reported that three genes, NA, HA and PB1, of human H2N2 influenza viruses of avian origin were also found in a cluster of avian virus, while avian-origin PB1 and HA genes of human H3N2 influenza viruses were extended from the cluster and a few of the H1 viral genes of human (PA) were also represented in the human H3 cluster [77].

PR-2 bias plot analysis in this study showed a disproportional distribution of AT and GC content. This pattern of base arrangement across the graph suggested that natural selection and mutational pressure both contributed to the CUB [72]. Parity plot analysis of the PB2 gene of influenza A H7N9 virus revealed no constraints in mutation and natural selection between the two complementary strands of a DNA duplex [26].

Neutrality plot analysis revealed that mutation pressure was more important than selection pressure in members of the family Anelloviridae. The higher mutational bias might be due to chemical decay of nucleotides, non-uniform DNA repair, and non-random replication errors [37]. Mutations usually occur spontaneously without any external driving force [73]. Neutrality plot analysis showed a significant correlation between GC3 and GC12 (0.904) at p < 0.01, suggesting that directional mutation pressure acted at all codon positions [20]. It also suggested that mutational pressure played a more important role than natural selection. Neutrality analysis of the PB2 gene of influenza A H7N9 virus revealed a greater role of selection pressure than mutational bias [26].

A significant correlation was found between the ENC value and biochemical properties of the protein, with the extent of CUB being associated with the GRAVY and hydrophilicity scores of the proteins across the genomes. Zhao et al. reported that hydrophilicity scores were critical factors in the codon usage of 11 isolates of bocavirus [84]. Liu et al. reported a highly significant positive correlation of GRAVY with codon usage variation and a highly significant negative correlation of aromaticity with codon usage variation in porcine circovirus [41].

Analysis of base skewness revealed higher usage of C over G and A over T. Skew values of genomes also correlated significantly with ENC values. AT skew, GC skew, amino skew, keto skew, purine skew, pyrimidine skew, and purine-pyrimidine skew were found to affect CUB. Transcription has been reported to have a likely relationship to base skew properties [12]. A significant correlation of base skewness with CUB has also been found in Nipah virus genes [14].

The MRI index and P2 values suggested a directional role of mutation over translational selection in members of the family Anelloviridae in the present study. Zhang et al. reported that mutational pressure arising from compositional constraint along with translational selection was responsible for CUB in torque teno sus virus 1 [82]. Similarly, in various other organisms, mutation and translational selection were identified as important determinants of variations in codon usage [35, 39, 57, 59].

Conclusion

We determined the base composition and codon usage pattern of the genes of 62 members of the Anelloviridae family of DNA viruses. The overall bias in the pattern of codon usage was low, with a high level of variation in synonymous codon usage in the viral genes. The base A was used more than C and G, and T was the least frequent. A- and C-ended codons were used preferentially. One codon (AGA) was overrepresented, while seven codons (TCG, TTG, CGG, CGT, ACG, GCG and GAT) were underrepresented in all of the genomes. Mutational pressure had a greater role than natural selection in determining the codon usage patterns of members of this family.