CpG-creating mutations are costly in many human viruses

Caudill, Victoria R.; Qin, Sarina; Winstead, Ryan; Kaur, Jasmeen; Tisthammer, Kaho; Pineda, E. Geo; Solis, Caroline; Cobey, Sarah; Bedford, Trevor; Carja, Oana; Eggo, Rosalind M.; Koelle, Katia; Lythgoe, Katrina; Regoes, Roland; Roy, Scott; Allen, Nicole; Aviles, Milo; Baker, Brittany A.; Bauer, William; Bermudez, Shannel; Carlson, Corey; Castellanos, Edgar; Catalan, Francisca L.; Chemel, Angeline Katia; Elliot, Jacob; Evans, Dwayne; Fiutek, Natalie; Fryer, Emily; Goodfellow, Samuel Melvin; Hecht, Mordecai; Hopp, Kellen; Hopson, E. Deshawn; Jaberi, Amirhossein; Kinney, Christen; Lao, Derek; Le, Adrienne; Lo, Jacky; Lopez, Alejandro G.; López, Andrea; Lorenzo, Fernando G.; Luu, Gordon T.; Mahoney, Andrew R.; Melton, Rebecca L.; Nascimento, Gabriela Do; Pradhananga, Anjani; Rodrigues, Nicole S.; Shieh, Annie; Sims, Jasmine; Singh, Rima; Sulaeman, Hasan; Thu, Ricky; Tran, Krystal; Tran, Livia; Winters, Elizabeth J.; Wong, Albert; Pennings, Pleuni S.

doi:10.1007/s10682-020-10039-z

CpG-creating mutations are costly in many human viruses

Original Paper
Open access
Published: 24 April 2020

Volume 34, pages 339–359, (2020)
Cite this article

Download PDF

You have full access to this open access article

Evolutionary Ecology Aims and scope Submit manuscript

CpG-creating mutations are costly in many human viruses

Download PDF

Victoria R. Caudill^1,2,
Sarina Qin^1,3,
Ryan Winstead¹,
Jasmeen Kaur¹,
Kaho Tisthammer¹,
E. Geo Pineda¹,
Caroline Solis¹,
Sarah Cobey¹⁶,
Trevor Bedford⁴,
Oana Carja⁵,
Rosalind M. Eggo⁶,
Katia Koelle⁷,
Katrina Lythgoe⁸,
Roland Regoes¹⁷,
Scott Roy¹,
Nicole Allen¹,
Milo Aviles¹,
Brittany A. Baker¹,
William Bauer¹,
Shannel Bermudez¹,
Corey Carlson¹,
Edgar Castellanos¹,
Francisca L. Catalan^1,9,
Angeline Katia Chemel¹,
Jacob Elliot¹,
Dwayne Evans^1,10,
Natalie Fiutek¹,
Emily Fryer^1,11,
Samuel Melvin Goodfellow^1,12,
Mordecai Hecht¹,
Kellen Hopp¹,
E. Deshawn Hopson Jr.¹,
Amirhossein Jaberi¹,
Christen Kinney¹,
Derek Lao¹,
Adrienne Le¹,
Jacky Lo¹,
Alejandro G. Lopez¹,
Andrea López¹,
Fernando G. Lorenzo¹,
Gordon T. Luu¹,
Andrew R. Mahoney¹,
Rebecca L. Melton^1,13,
Gabriela Do Nascimento¹,
Anjani Pradhananga¹,
Nicole S. Rodrigues^1,14,
Annie Shieh¹,
Jasmine Sims^1,15,
Rima Singh¹,
Hasan Sulaeman¹,
Ricky Thu¹,
Krystal Tran¹,
Livia Tran¹,
Elizabeth J. Winters¹,
Albert Wong¹ &
…
Pleuni S. Pennings ORCID: orcid.org/0000-0001-8704-6578¹

3472 Accesses
11 Citations
11 Altmetric
Explore all metrics

A Correction to this article was published on 16 May 2020

This article has been updated

Abstract

Mutations can occur throughout the virus genome and may be beneficial, neutral or deleterious. We are interested in mutations that yield a C next to a G, producing CpG sites. CpG sites are rare in eukaryotic and viral genomes. For the eukaryotes, it is thought that CpG sites are rare because they are prone to mutation when methylated. In viruses, we know less about why CpG sites are rare. A previous study in HIV suggested that CpG-creating transition mutations are more costly than similar non-CpG-creating mutations. To determine if this is the case in other viruses, we analyzed the allele frequencies of CpG-creating and non-CpG-creating mutations across various strains, subtypes, and genes of viruses using existing data obtained from Genbank, HIV Databases, and Virus Pathogen Resource. Our results suggest that CpG sites are indeed costly for most viruses. By understanding the cost of CpG sites, we can obtain further insights into the evolution and adaptation of viruses.

Mechanisms of viral mutation

Article Open access 08 July 2016

Evolution of Viral Genomes: Interplay Between Selection, Recombination, and Other Forces

Evidence Supporting That C-to-U RNA Editing Is the Major Force That Drives SARS-CoV-2 Evolution

Article 17 February 2023

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

Viruses cause a multitude of diseases such as AIDS, Dengue Fever, Polio, Hepatitis, and the flu. Due to their fast replication, large population sizes and high mutation rates, viruses are able to quickly adapt to new environments (Cuevas et al. 2015). The ability of viruses to adapt quickly is seen in drug resistance evolution in HIV and HCV, immune escape in influenza and vaccine-derived polio outbreaks. High mutation rates may also lead to a high mutational load, since a large proportion of mutations are costly to the virus. In fact, experimental work has shown that most mutations are deleterious for viruses, with a select few being neutral or beneficial (Sanjuán et al. 2004; Duffy 2018).

Fitness costs influence the fate of mutations. Mutations that suffer little or no fitness costs are likely to persist in the population, whereas mutations with high fitness costs will likely be weeded out. A detailed knowledge of mutational fitness costs (also termed selection coefficients) is important to discover new functional properties of a genome and to understand and predict the evolutionary dynamics of populations. Past studies of fitness costs have produced important practical insights into problems as diverse as drug resistance in viruses (Beerenwinkel et al. 2005), extinction in small populations (Schultz and Lynch 1997), and the effect of accumulating deleterious mutations on human health (Keightley 2012). However, studying fitness costs in natural populations is difficult. As a result, most of what we know comes from in vitro studies or phylogenetic approaches (Stern et al. 2007), neither of which can directly give detailed information about the costs of individual mutations in vivo. Costs of mutations can also be studied using within-host diversity data (Zanini et al. 2017; Theys et al. 2018). In this study we use between-host diversity to study fitness costs.

Several different types of studies found evidence that CpG sites are costly for viruses. A CpG site refers to an occurrence of a nucleotide C followed by G in the 5\(^{\prime }\) to 3\(^{\prime }\) direction. Studies of viral genomic sequences found that CpG sites were underrepresented in almost all small viruses tested (Karlin and Cardon 1994). Burns et al. (2009) found that CpG sites significantly decreased replicative fitness of polio viruses in vitro, while an increased GC content in itself had little to no effect on the virus’s overall fitness. Stern et al. (2017) showed that CpG sites in the polio vaccine were often mutated in vaccine-derived polio outbreaks, indicating a direct cost of CpG sites in polio in vivo. In 2018, a previous paper from our group (Theys et al. 2018) showed that in HIV, transition mutations resulting in CpG sites, were twice as costly as -otherwise similar- non-CpG-creating mutations, thereby revealing that CpG mutations confer a cost within the host.

It is not entirely clear why CpG sites are costly, but it is likely, at least in part, because the mammalian immune system uses CpG sites to recognize foreign genetic material (Murphy and Weaver 2016). Recently it was shown that ZAP proteins, which inhibit the proliferation of most RNA viruses, are more effective when the CpG sites were common (Takata et al. 2017; Ficarelli et al. 2019).

While (Theys et al. 2018) focused on the cost of CpG-creating mutations in HIV. Here we expanded our scope to encompass an array of human viruses, including Dengue, Influenza, Entero, Herpes, Hepatitis B and C. We focused on human viruses with a sufficient number of available sequences in Genbank, HIV Databases, or The Virus Pathogen Resource (VPR). Unlike in the Theys et al. (2018) paper, we focus on population-wide data (one sequence per patient) as opposed to within-patient data. The main assumption for this study is that when CpG-creating mutations come with a cost (either within hosts or at the transmission stage), we expect them to occur at lower frequencies in the population-wide sample compared to non-CpG-creating mutations. Since the types of mutations we consider (CpG-creating and non-CpG-creating) all occur on the same species-wide genealogy, we consider any significant differences in frequencies to be likely the result of a difference in cost. For a second analysis, we assume that the average frequency of mutations is inversely proportional to the cost of the mutations. This is likely an oversimplification, but it allows us to quantify the effect size we observe.

Depending on data availability, either individual genes or whole genomes were used. We found that CpG sites are costly in most viruses, though the effect is much stronger in some viruses (e.g., HIV, BK Polyoma) than others (e.g., HCV, Rota ). A full list of viruses can be found in Table 1.

Methods

Data and R scripts

Data and R scripts are available on Github:

https://github.com/Vcaudill/CpG_sites/releases/tag/v1.

Data Collection

The sequences were retrieved from the NCBI Genbank, the HIV Databases (http://www.hiv.lanl.gov/), and the Virus Pathogen Resource (VPR, https://www.viprbrc.org/) using R scripts or manually (see Table 1 for data sources). We selected viral sequences from a human host, and proteins required for viral fitness (e.g. VP1, VP2, envelope protein). Dengue, Entero, and Polio sequences were all collected through the VPR, HIV sequences from the HIV Database, and HCV, Human Parainfluenza, Influenza, Human Respiratory Syncytial, Measles, Rhino, Rota, BK, Human Boca, Hepatitis B, Human Heperies, Human Papilloma, and Parvo from Genbank.

Further data preparation and filtering

After data collection, obtained sequences were aligned and trimmed using Geneious v.11.1.4. After checking the alignment, an online translation tool (Artimo et al. 2012) was used to identify coding regions. Once a coding region was found, the sequences were verified using NCBI BLAST. We used the program RDP4 (Martin et al. 2015) to determine if any sequences were the result of recombination. If RDP4 showed that a sequence was recombined it was cut from our analysis, unless the overall number of sequences was below 100.

Consensus sequences for each virus/protein data set were generated using R or Geneious. A custom R script was also used to identify stop codons created by mutations in the coding sequences.

Table 1 Information pertaining to the datasets, such as virus name, how much and where data was available, statistical results

Full size table

Table 2 The number of data sets (out of 42) for which the Wilcoxon test was significant (percentages in parentheses) indicating that non-CpG-creating mutations were observed at higher frequencies than, otherwise similar, CpG-creating mutations

Full size table

Accurate estimation of mutation frequencies requires sufficient data points. Therefore, we calculated data points as the number of sequences multiplied by the number of nucleotides, and removed data sets that had less than 60,000 data points. We were able to collect sufficient data for 42 data sets.

Data analysis

For each of the 42 data sets, the consensus sequence was translated to create a wild type protein sequence. For each nucleotide, we determined whether a transition mutation would change the amino acid and/or create a CpG site. We determined whether the transition mutation was synonymous, non-synonymous or nonsense by comparing the wild type amino acid to the mutated amino acid. We calculated the frequency of the transition mutation for each nucleotide in the data set by dividing the number of observed transition mutations by the sum of the number of transition mutations and the wild type nucleotide.

Statistical analysis

To determine if CpG sites were costly to viruses, the data were separated into groups. First, the sites were split into four categories; each represented a consensus nucleotide and its transition mutated form (Adenine to Guanine (A\(\rightarrow\)G), Thymine to Cytosine (T\(\rightarrow\)C), Cytosine to Thymine (C\(\rightarrow\)T), or Guanine to Adenine (G\(\rightarrow\)A)). The nucleotides were then sectioned into groups of synonymous and non-synonymous, and further by CpG-creating or non-CpG-creating mutations (Fig. 1). A Wilcoxon rank-sum test was performed to determine if the mutation frequencies differed between groups of synonymous versus non-synonymous, and CpG versus non-CpG-creating mutations (Fig. 3). To calculate a “cost ratio” of CpG-creating transition mutations, we divided the mean mutation frequency of non-CpG-creating mutations by the mean mutation frequency of CpG-creating mutations of the same type (Fig. 4).

Phylogenetic approach

For each dataset we used PhyML (Guindon et al. 2010) to create unrooted trees from 200 randomly selected sequences. From the PhyML tree output we rooted the tree using the midpoint rooting method. Once rooted we used PAML (Yang 2007) to construct the ancestral sequences. Using these ancestral sequences, we repeated the “cost ratio” analysis (see supplementary figure S2).

Simulations

Using the SLIM simulation framework (Haller and Messer 2019), we simulated viral genomes in 200 hosts. We simulated a genomic region of 10,000 base pairs and a within-host population size of 5000, the first half of the genome (0–4999) is set to have only non-CpG-creating mutations with a cost of 0.01. The second half of the genome (5000–10,000) is set to have only non-CpG-creating mutations with a cost of 0.005. The mutation rate is set to \(10^{-5}\). Each simulation in a host starts with a population that consists of wildtype sequences only. The simulation will run through 1000 generations, after which a sample of 1 sequence is taken. When the simulation is run 200 times, we have 200 sequences to analyze. The average frequency of non-CpG-creating mutations was 0.011, whereas the average frequency of CpG-creating mutations was 0.006. The ratio between the two means was 1.9. The difference in frequencies was significant (Wilcoxon test, p value \(< 0.01\)).

Relation between cost and genomic CpG under-representation

The relationships between costs of CpG creating mutations and the degrees of CG dinucleotide under/over-representation (Rho statistic values) were assessed for all viral genes/genomes used in our study. The Rho statistic is obtained by dividing the frequency of dinucleotide xy by the product of frequencies of nucleotide x and nucleotide y, and calculated using the ’seqinr’ package (Charif et al. 2004) in R. The results showed overall significant negative correlation (Spearman’s \(\rho = -0.37\), \(p =0.0005\)), indicating the higher the costs of CpG creating mutations, the more CG dinucleotide was underrepresented . Correlation was also assessed separately for A G and T C mutations, which resulted in significant negative correlation for T C mutations (Spearman’s \(\rho = -0.43\), \(P =0.004\)), and marginally significant correlation for A G mutations (Spearman’s \(\rho = -0.29\), \(P = 0.06\)).

Results

We collected 42 viral datasets from online sources (Genbank, Los Alamos HIV Database, Virus Pathogen Resource (VPR)), each of which is a group of viral sequences of the same species, subtype and gene (see Table 1). Each sequence in a dataset came from an individual host from various parts of the world. The mean number of sequences in a dataset is 2501, median 579, with a maximum at 24,005 and a minimum at 41. The mean number of nucleotides for each sequence is 3710, median 1706, with a maximum of 14,469 and a minimum of 294. We established a minimum cut off of 60,000 data points per dataset (number of nucleotides \(\times\) number of sequences), viruses or genes with less data available were not included.

We use the following approach. We assume that mutations occur at random, but are then subject to selection and drift. Selection and drift can act within hosts or at the transmission stage. For most mutations, selection will act to purge the mutations from the viral population (within-host population or the global population). Whether within-host or between-host effects are more important is not clear for most viruses, but either way, we expect that more deleterious mutations are less likely to be observed often, and more benign mutations will be observed more often. The main focus of our paper is to determine whether CpG-creating mutations are observed less often in each of the 42 datasets than (otherwise similar) non-CpG-creating mutations. We focus on A\(\rightarrow\)G and T\(\rightarrow\)C mutations, because transition mutations are more common in viruses than transversion mutations and only these transition mutations can create CpG sites.

To check whether our approach was sound, in principle, and whether there was sufficient power to asses the cost of CpG-creating mutations, we first tested whether synonymous mutations were observed at higher frequencies than non-synonymous mutations using the non-parametric Wilcoxon test. All tests are one-tailed, because we expect synonymous mutations to occur at a higher frequency than non-synonymous mutations. To make our approach for non-synonymous sites as similar as possible to our approach for CpG-creating mutations, we also focus solely on A\(\rightarrow\)G and T\(\rightarrow\)C mutations. We observed a significant difference between the frequencies of synonymous mutations and non-synonymous mutations for 38 of the 42 datasets analyzed (90.5%) (Table 2).

As an additional test to make sure our approach was sound, we ran simulations in SLIM (Haller and Messer 2019). We simulated virus genomes with costly CpG-creating mutations and less costly non-CpG-creating mutations in 200 patients and find that as expected, the results show a higher average frequency for the non-CpG-creating mutations. See supplementary figure S1.

Our study focused on transition mutations that result in CpG sites. We focused on transition mutations because they occur at a much higher rate than tranversion mutations, and provide greater power to detect meaningful differences. There are two ways for a CpG site to be formed by a transition mutation; (1) a C precedes an A (CA) and the A mutates to a G, and (2) a T precedes a G (TG) and the T mutates to a C (see Fig. 2).

Both synonymous and non-synonymous mutations can create CpG sites. For example, when a TCA codon, which encodes Serine, mutates where the A becomes G (A\(\rightarrow\)G), making the codon to TCG, this will result in a new CpG site without changing the amino acid. Comparing synonymous CpG-creating versus synonymous non-CpG-creating mutations, we found that the frequencies of non-CpG mutations were significantly higher than those of CpG-creating mutations in 32 of the data sets (76.2%) for A\(\rightarrow\)G mutations and 28 of the data sets (66.7%) for T\(\rightarrow\)C mutations.

Non-synonymous mutations result in an amino acid change that alters the protein. Mutations which create a CpG site and cause a non-synonymous amino acid change are called non-synonymous CpG-creating mutations. While mutations that are non-synonymous but do not create CpG sites are called non-synonymous non-CpG-creating mutations. When comparing non-synonymous CpG-creating versus non-synonymous non-CpG-creating mutations, non-CpG-creating mutations had a significantly higher frequency than CpG-creating mutations 23.8% of the time for A\(\rightarrow\)G mutations and 40.5% for T\(\rightarrow\)C mutations (See Table 2).

From our collection of viruses, we show results from three datasets as examples (Fig. 3). Only A\(\rightarrow\)G and T\(\rightarrow\)C mutations can form CpG sites, but here we also show C\(\rightarrow\)T and G\(\rightarrow\)A nucleotides as a comparison. Our results varied, they ranged from exhibiting high mutation frequencies to low mutation frequencies and significant to not significant test results. The three examples chosen show the diversity of our results.

In each graph, four categories of mutations are compared with one another: synonymous non-CpG-creating mutations (green), synonymous CpG-creating mutations (blue), non-synonymous non-CpG-creating mutations (orange), and non-synonymous CpG-creating mutations (red). Each colored point is the mutation frequency observed at a single position within each of these categories, along with the mean value and standard error bars (one standard error above and below the mean) in black.

Figure 3a shows mutation frequencies for Dengue 1. Dengue’s genome is comprised of one large polyprotein. For Dengue 1, we have 1783 sequences and 10,176 nucleotides, making this a particularly large dataset. We show frequencies of all 10,176 sites in the genome, split into the four different transition mutations (A\(\rightarrow\)G, T\(\rightarrow\)C, C\(\rightarrow\)T, G\(\rightarrow\)A) and then split into synonymous(green and blue) and non-synonymous (orange and red). Non-CpG-creating mutations are green and orange, while CpG-creating mutations are red and blue. For this data set, all tested comparisons are significantly different (p < 0.01, Wilcoxon test). Synonymous CpG-creating mutations occur at lower frequencies than synonymous non-CpG-creating mutations, for both A\(\rightarrow\)G and T\(\rightarrow\)C mutations (green vs blue and orange vs red respectively). There is also a significant difference between the synonymous and non-synonymous mutations for both A\(\rightarrow\)G and T\(\rightarrow\)C mutations.

Next, we show mutation frequencies for the HA gene (hemagglutinin) of the Influenza A H3N2 strain (Fig. 3c, d). The p values show that non-CpG-creating mutations occur at higher frequencies than CpG-creating mutations for synonymous A\(\rightarrow\)G and T\(\rightarrow\)C mutations. For the synonymous T\(\rightarrow\)C mutations, the graph shows that the mean frequencies are almost the same, but the non-parametric Wilcoxon test still detects a significant difference (p < 0.01) (Fig. 3d). For non-synonymous mutations, we find a significant difference between CpG-creating and non-CpG-creating mutations for T\(\rightarrow\)C but not A\(\rightarrow\)G mutations. The difference in frequencies between synonymous and non-synonymous mutations is significant for both A\(\rightarrow\)G and T\(\rightarrow\)C mutations.

Next, we show the results for Human Respiratory Syncytial Virus G gene (Fig. 3e, f). The results here are very similar to the Influenza virus in the figure: all tests are significant except for the difference between CpG-creating and non-CpG-creating mutations for non-synonymous A\(\rightarrow\)G mutations (Fig. 3f).

Cost of CpG-creating mutations across all datasets

With a Wilcoxon test, we could determine whether CpG-creating mutations occur at lower frequencies than otherwise similar non-CpG-creating mutations, but it does not give us a sense of the effect size of this effect. To get a better sense of how much less frequent CpG-creating mutations are (and thus roughly how much more costly) we divided the mean frequency of non-CpG-creating mutations by the mean frequency of CpG-creating mutations for each of the datasets (Fig. 4). We graphed only the synonymous mutations as they more often showed a significant CpG effect.

We calculated two ratios for each dataset: (1) the ratio of the mean frequency of synonymous, A\(\rightarrow\)G, non-CpG-creating mutations and synonymous, A\(\rightarrow\)G, CpG-creating mutations (red), and (2) the ratio of the mean frequency of synonymous, T\(\rightarrow\)C, non-CpG-creating mutations and synonymous, T\(\rightarrow\)C, CpG-creating mutations (blue). When these ratios are above 1 it means that the non-CpG-creating mutations have a higher average frequency than CpG-creating mutations, which shows that the CpG-creating mutations are more costly. The higher the frequency, the higher the cost of CpG-creating mutations relative to the cost of non-CpG-creating mutations. The black line in the Fig. 4 indicates the ratio = 1. Most, though not all, viruses analyzed show ratios higher than 1 (above the solid black line).

We performed a sign-test (exact binomial test) to determine whether we were significantly more likely to find cost ratios higher than 1 versus cost ratios lower than 1. We found a highly significant result for both types of mutations, which confirms that the over-representation of positive cost ratios in Fig. 4 is not due to chance. For A\(\rightarrow\)G mutations (39 ratios higher than 1 out of 42 observations), p value = 5.63e\(-\)09, and for T\(\rightarrow\)C mutations (37 ratios higher than 1 out of 42 observations) p value = 4.43e\(-\)07.

In Fig. 4 the viruses are arranged by genus, with RNA viruses on the left and DNA viruses on the right. We see that the calculated frequency ratios are consistently above 1 for Dengue 1–4, Hepatitis C, HIV, Influenza A, Human Respiratory Syncytial virus, Measles, Rhino viruses, Rota A virus, BK polyoma, Human Boca and Parvo virus. Results are mixed (though still majority above 1 for Parainfluenza, Influenza B, Entero viruses Hepatitis B, Herpes virus and Human papiloma.

There is a pattern among groups of viruses where one type of mutation is more costly than the other. In Dengue and Human Parainfluenza CpG-creating T\(\rightarrow\) C mutations are relatively more costly than CpG-creating A\(\rightarrow\)G mutations. In Entero and Hepatitis B, on the other hand CpG-creating A\(\rightarrow\)G mutations are more costly than CpG-creating T\(\rightarrow\)C mutations. It is unclear whether this is an artifact of our dataset or a real effect.

Since we suspect that the amount of data available per dataset may affect our results, we plotted the product of the number of sequences and the number of nucleotides per dataset at the bottom of Fig. 4. In a separate figure (Fig. 5 and supplementary figure S3), we show how the amount of available data affects whether we find significant results for A\(\rightarrow\)G or T\(\rightarrow\)C mutations or both. In these figures, each dot represents a dataset, the x axis shows the number of sequences in each dataset and y axis shows the number of sites at which a transition mutation creates a CpG site. Blue triangles indicated two significant Wilcoxon tests (for A\(\rightarrow\)G and T\(\rightarrow\)C mutations), green squares indicate one significant result and red dots indicate no significant result. The figure shows that, in general, having more data makes it more likely to find one or two significant results. Figure 5a shows the comparison synonymous CpG-creating versus synonymous non-CpG-creating mutations. In this figure, the red and green data points are clearly clustered in the lower left corner, which suggests that the absence of significant results here is due to a lack or data. Figure 5b shows the comparison non-synonymous CpG-creating versus non-synonymous non-CpG-creating mutations. In this case, it seems that only our largest datasets lead to significant result. Finally, Fig. 5c shows the comparison synonymous versus non-synonymous mutations.

We wanted to determine whether the datasets for which we estimated high costs of CpG-creating mutations also showed a lack of CpG-sites in their genomes. To test this, we determined the relationship between the cost ratio we estimated and the CpG under-representation (Rho statistic values) and we found that overall, this relationship is indeed negative (Spearman’s \(\rho ~=~-0.37\), p = 0.0005) (see Methods and supplementary figure S4). This could mean that the different costs we estimate in different viruses have existed for long enough evolutionary time scales to affect the genome content of the viruses we study.

Discussion

CpG-creating mutations are costly in most viruses

There is previous evidence that CpG-creating mutations are costly for viruses such as HIV and Polio (Theys et al. 2018; Stern et al. 2017). It is expected that such mutations are also costly in other viruses, because CpG sites are rare in many viruses (Karlin and Cardon 1994). Here we used global data for 42 viral datasets to test whether CpG sites are costly for most human viruses. For many viruses, information on within-host diversity is not readily available, so we focused on between-host diversity, using datasets with one viral sequence per patient. We expect that mutation frequencies in such datasets are determined by mutation rates, selection coefficients and stochastic effects such as drift and selective sweeps (Hartl and Clark 2007). Our main assumption here is that stochastic effects and mutation rates affect CpG-creating and non-CpG-creating mutations equally (see section on study limitations). This means that any significant difference in mutation frequencies between CpG-creating and non-CpG-creating mutations will be due to differences in selection coefficients, which allows us to determine whether CpG-creating mutations are generally more costly than non-CpG-creating mutations (Theys et al. 2018).

We found that indeed, in the majority of viruses we tested, the mutation frequencies were significantly different between CpG-creating and non-CpG-creating mutations, which shows that there is a fitness cost to CpG-creating mutations in most viruses. We found a significant effect of CpG-creating mutations in 76.2 % of datasets for synonymous A\(\rightarrow\)G mutations and in 66.7% of synonymous T\(\rightarrow\)C mutations.

To test the statistical power of our novel approach, we also tested whether we could detect a difference in frequencies between synonymous and non-synonymous mutations. We used the same datasets and methods to demonstrate that synonymous mutations occur at higher frequencies than non-synonymous mutations. We detected a significant difference between non-synonymous mutations and synonymous mutations in 90.5 % of datasets for A\(\rightarrow\)G mutations and also 90.5% of datasets for T\(\rightarrow\)C mutations. While we detect the CpG effect not as often as the effect of non-synonymous mutations, we still detect the effect in more that two-thirds of the viruses. The cost of CpG-creating mutations should probably be considered near ubiquitous in human viruses.

We also tested for an effect of CpG-creating mutations among non-synonymous mutations, but found that this effect was only detected in 23.8% of datasets for A\(\rightarrow\)G mutations and 40.5% of datasets for T\(\rightarrow\)C mutations. One reason for this low number of significant results is probably that many non-synonymous mutations occur at very low frequencies (see figure 2A, 2C, 2E).

Quantifying the cost

After we found that a majority of viruses displayed a lower frequency of CpG-creating mutations when compared to non-CpG-creating mutations we moved on to quantify this cost. We did this separately for A\(\rightarrow\)G and T\(\rightarrow\)C mutations. For each of these two types of mutations, we calculated the ratio between the mean frequency of synonymous CpG-creating mutations and the mean frequency of synonymous non-CpG-creating mutations. We hypothesize that when CpG-mutations come with a large cost, they will be found at much lower frequencies, whereas if they come with a small cost, their frequencies will only be slightly lower than those of non-CpG-creating mutations. Therefore, the ratio we calculate will give us a sense of the relative cost of CpG sites in different viruses.

The levels of the cost ratio vary widely between viruses, with some clear differences between viral genera. For example, for HIV, the ratio is near 5 for both A\(\rightarrow\)G and C\(\rightarrow\)T mutations. This shows that CpG sites in HIV come with a large cost, as shown before based on a different data set (Theys et al. 2018). On the other hand, in Hepatitis C the ratio is close to 1 for both genotype 1A and 1B. The Wilcoxon tests were significant for Hepatitis C, but the fact that the ratio is close to 1 shows that the effect size is small. We find similar results when we look at within-host diversity for HCV using another dataset (Tisthammer, unpublished). The cost ratio for BK Polyoma is very high: we see a 100-fold difference in mean frequencies. This result is so extreme, that we are tempted to think it is not robust, but we did find that the number of CpG sites in the BK Polyoma genome is extremely low (less than 5% of the expected number, in supplementary figure S4 the two upper left dots are BK Polyoma.). This could mean that for some unknown reason CpG-sites are much more costly in BK Polyoma virus than the other viruses. Future studies could look into this.

We find more variable cost ratios in the DNA viruses than in the RNA viruses. This may be because of the smaller sample sizes for DNA viruses, or it may be that different selection pressures are at play in DNA viruses versus RNA viruses. In RNA viruses, we expect that the mammalian immune system recognizes CpG sites and forces the viruses to mimic the low CpG content in mammalian genomes (Takata et al. 2017). In DNA viruses, it is not clear if the same mechanism is at work, though unmethylated CpG sites are expected to stimulate the immune response (Hoelzer et al. 2008).

The cost ratio was calculated for both A\(\rightarrow\)G and C\(\rightarrow\)T mutations. These two ratios are not necessarily equal. In some viruses, we see surprising patterns in the cost ratios. For example, in the Dengue viruses, T\(\rightarrow\)C CpG-creating mutations (blue) are relatively more costly than A\(\rightarrow\)G mutations (red). In Influenza A however, the trend is in the other direction, where T\(\rightarrow\)C CpG-creating mutations (blue) are relatively less costly than A\(\rightarrow\)G mutations (red). Further studies are needed to determine what causes these patterns.

Limitations and future studies

Our study has a number of limitations. We only included datasets with at least 60,000 data points per dataset (number of nucleotides \(\times\) number of sequences). However, we still find that our larger datasets are more likely to yield significant results (Fig. 5). This suggests that increasing either the number of sequences or the sequence length for some of the viral datasets will increase the number of datasets with significant results.

Another limitation of our study is that we used one sequence per patient. This means that we don’t have any information on within-host diversity, and rare variants that exist within hosts will be missed. While we believe that having within-host diversity data would be useful, this study shows that even with one sequence per patient, we are able to detect costs of mutations. However, it is unclear whether this cost occurs during replication in the host, during transmission or both.

We and others have studied within-host diversity in HIV and HCV and other viruses to study costs of mutations within the host (Wang et al. 2010; Rambaut et al. 2004; Alizon et al. 2011; Theys et al. 2018). This is possible for these viruses because patients are infected for a long time and there is an expectation that mutation and selection occur within the host. For most other viruses, however, it is not clear whether it is possible to study within-host fitness costs separately from between-host effects. For example, if patients are infected with a diverse sample of the virus, then within-host mutation and selection may not be the dominant effects that shape within-host genetic diversity (Varble et al. 2014; Poon et al. 2016). For those types of viruses, studying within-host and between-host diversity may lead to the same results, and having data on within-host diversity may not necessarily increase our knowledge of fitness costs of mutations.

Finally, one of the main assumptions of this study is that the mutation rate doesn’t depend on the neighboring nucleotide. For example, we assume that an A\(\rightarrow\)G mutation is equally likely to occur when it is next to a C (creating a CpG site), or another nucleotide (not creating a CpG site). Similarly, we assume that an T\(\rightarrow\)C mutation is equally likely to occur when it is followed by a G (creating a CpG site), or another nucleotide (not creating a CpG site). In principle, it is possible that the cost we infer is due to a lower mutation rate of CpG-creating mutations. We believe that this is unlikely for several reasons. (1) Our results for A\(\rightarrow\)G and T\(\rightarrow\)C mutations are very similar, if this was due to a mutation rate effect, it would have to have the same effect on both of these mutation types. (2) Our results are consistent with results from epidemiological studies on polio (Stern et al. 2017) and in vitro studies on HIV (Takata et al. 2017; Ficarelli et al. 2019). Future studies will hopefully measure viral mutation rates with more precision.

In conclusion, we find that CpG-creating mutations are costly for most human viruses. For viruses in which we do not detect an effect of CpG-creating mutations, it is likely because of a small sample size. It was already known for some viruses that CpG-creating mutations were costly, but we have now shown that this cost occurs in most human viruses. Future work should focus on better understanding why the cost of CpG-creating mutations is higher in some viruses than others, and whether there is a relation with how the virus interacts with the human host, and possibly other hosts. We are also excited about future studies that could find what other types of mutations are costly, and we hypothesize that with the advent of artificial intelligence in population genetics (Sheehan and Song 2016; Schrider and Kern 2018), we will be able to get a much more complete understanding of the fitness landscape of viruses. Another interesting future direction would be to use modeling studies to determine the effects of the cost of these CpG-creating mutations on the effective population size and adaptive potential of viral populations.

Change history

16 May 2020
A Correction to this paper has been published: https://doi.org/10.1007/s10682-020-10052-2

References

Alizon S, Luciani F, Regoes RR (2011) Epidemiological and clinical consequences of within-host evolution. Trends Microbiol 19(1):24–32
Article CAS Google Scholar
Artimo P, Jonnalagedda M, Arnold K, Baratin D, Csardi G, de Castro E, Duvaud S, Flegel V, Fortier A, Gasteiger E, Grosdidier A, Hernandez C, Ioannidis V, Kuznetsov D, Liechti R, Moretti S, Mostaguir K, Redaschi N, Rossier G, Xenarios I, Stockinger H (2012) ExPASy: SIB bioinformatics resource portal. Nucl Acids Res 40(W1):W597–W603.
Article CAS Google Scholar
Beerenwinkel N, Däumer M, Sing T, Rahnenführer J, Lengauer T, Selbig J, Hoffmann D, Kaiser R (2005) Estimating HIV evolutionary pathways and the genetic barrier to drug resistance. J Infect Dis 191(11):1953–1960
Article CAS Google Scholar
Burns CC, Campagnoli R, Shaw J, Vincent A, Jorba J, Kew O (2009) Genetic inactivation of poliovirus infectivity by increasing the frequencies of CpG and UpA dinucleotides within and across synonymous capsid region codons. J Virol 83(19):9957–9969
Article CAS Google Scholar
Charif D, Thioulouse J, Lobry J, Perrière G (2004) Online synonymous codon usage analyses with the ade4 and seqinr packages. Bioinformatics 21(4):545–547
Article CAS Google Scholar
Cuevas JM, Geller R, Garijo R, Lopez-Aldeguer J, Sanjuan R (2015) Extremely high mutation rate of HIV-1 in vivo. PLoS Biol 13(9):e1002251
Article CAS Google Scholar
Duffy S (2018) Why are rna virus mutation rates so damn high? PLoS Biol 16(8):1–6
Article CAS Google Scholar
Ficarelli M, Antzin-Anduetza I, Hugh-White R, Firth AE, Sertkaya H, Wilson H, Neil SJD, Schulz R, Swanson CM (2019) CpG dinucleotides inhibit HIV-1 replication through zinc finger antiviral protein (ZAP)-dependent and -independent mechanisms. J Virol 94(6)
Guindon S, Dufayard J-F, Lefort V, Anisimova M, Hordijk W, Gascuel O (2010) New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of phyml 3.0. Syst Biol 59(3):307–321
Article CAS Google Scholar
Haller BC, Messer PW (2019) SLiM 3: forward genetic simulations beyond the wright-fisher model. Mol Biol Evol 36(3):632–637
Article CAS Google Scholar
Hartl DL, Clark AG (2007) Principles of population genetics. Sinauer Associates, Sunderland
Google Scholar
Hoelzer K, Shackelton LA, Parrish CR (2008) Presence and role of cytosine methylation in DNA viruses of animals. Nucleic Acids Res 36(9):2825–2837
Article CAS Google Scholar
Karlin S, Cardon LR (1994) Computational DNA sequence analysis. Annu Rev Microbiol 48(1):619–654 PMID: 7826021
Article CAS Google Scholar
Keightley PD (2012) Rates and fitness consequences of new mutations in humans. Genetics 190(2):295–304
Article Google Scholar
Martin DP, Murrell B, Golden M, Khoosal A, Muhire B (2015) Rdp4: detection and analysis of recombination patterns in virus genomes. Virus Evol 1(1):vev003
Article Google Scholar
Murphy KM, Weaver C (2016) Janeway’s immunobiology. Garland science. Taylor & Francis Group, LLC, New York
Book Google Scholar
Poon LL, Song T, Rosenfeld R, Lin X, Rogers MB, Zhou B, Sebra R, Halpin RA, Guan Y, Twaddle A et al (2016) Quantifying influenza virus diversity and transmission in humans. Nat Genet 48(2):195
Article CAS Google Scholar
Rambaut A, Posada D, Crandall KA, Holmes EC (2004) The causes and consequences of HIV evolution. Nat Rev Genet 5(1):52
Article CAS Google Scholar
Sanjuán R, Moya A, Elena SF (2004) The distribution of fitness effects caused by single-nucleotide substitutions in an RNA virus. Proc Nat Acad Sci 101(22):8396–8401
Article Google Scholar
Schrider DR, Kern AD (2018) Supervised machine learning for population genetics: a new paradigm. Trends Genet 34(4):301–312
Article CAS Google Scholar
Schultz ST, Lynch M (1997) Mutation and extinction: the role of variable mutational effects, synergistic epistasis, beneficial mutations, and degree of outcrossing. Evolution 51(5):1363–1371
Article Google Scholar
Sheehan S, Song YS (2016) Deep learning for population genetic inference. PLoS Comput Biol 12(3):e1004845
Article CAS Google Scholar
Stern A, Doron-Faigenboim A, Erez E, Martz E, Bacharach E, Pupko T (2007) Selecton 2007: advanced models for detecting positive and purifying selection using a Bayesian inference approach. Nucleic Acids Res 35(suppl 2):W506–W511
Article Google Scholar
Stern A, Te Yeh M, Zinger T, Smith M, Wright C, Ling G, Nielsen R, Macadam A, Andino R (2017) The evolutionary pathway to virulence of an RNA virus. Cell 169(1):35–46
Article CAS Google Scholar
Takata MA, Goncalves-Carneiro D, Zang TM, Soll SJ, York A, Blanco-Melo D, Bieniasz PD (2017) CG dinucleotide suppression enables antiviral defence targeting non-self RNA. Nature 550(7674):124–127
Article CAS Google Scholar
Theys K, Feder AF, Gelbart M, Hartl M, Stern A, Pennings PS (2018) Correction: within-patient mutation frequencies reveal fitness costs of CpG dinucleotides and drastic amino acid changes in HIV. PLoS Genet 14(12):e1007855
Article Google Scholar
Varble A, Albrecht RA, Backes S, Crumiller M, Bouvier NM, Sachs D, García-Sastre A et al (2014) Influenza a virus transmission bottlenecks are defined by infection route and recipient host. Cell Host Microbe 16(5):691–700
Article CAS Google Scholar
Wang GP, Sherrill-Mix SA, Chang K-M, Quince C, Bushman FD (2010) Hepatitis C virus transmission bottlenecks analyzed by deep sequencing. J Virol 84(12):6218–6228
Article CAS Google Scholar
Yang Z (2007) PAML 4: phylogenetic analysis by maximum likelihood. Mol Biol Evol 24(8):1586–1591
Article CAS Google Scholar
Zanini F, Puller V, Brodin J, Albert J, Neher RA (2017) In vivo mutation rates and the landscape of fitness costs of HIV-1. Virus Evol 3(1):vex003
Article Google Scholar

Download references

Acknowledgements

We thank Adi Stern and the Stern lab for discussion. Pleuni Pennings, Victoria Caudill, Sarina Qin, Ryan Winstead, Jasmeen Kaur, Esteban Geo Pineda, Emily Fryer, Caroline Solis, Kaho Tisthammer and Anjani Pradhananga were supported by NSF Grant # 1655212 to Pleuni S. Pennings. Sarah Cobey and Pleuni Pennings were supported by a Grant from the National Evolutionary Synthesis Center (NESCent), NSF # EF-0905606. Angeline K. Chemel, E. Deshawn Hopson, Nicole S. Rodriques and Caroline Solis were supported by the NIH MARC Grant (T34-GM008574). Dwayne Evans was supported by the NIH MA/MS-PhD Bridge Grant (R25-GM048972). Dwayne Evans, Alejandro G. Lopez, A.R. Mahoney, Rebecca L. Melton Nathan O’Neill and Ryan Winstead were supported by the NIH RISE Grant (R25-GM059298) Angeline K. Chemel, Kellen Hopp and Jasmine Sims were supported by the NSF STC CCC Grant (DBI 1548297). Dwayne Evans and Alejandro G. Lopez were supported by a Genentech Foundation MS Dissertation Scholarship. Caroline Solis was supported by a Genentech Foundation fellowship. We thank the Student Enrichment Office for supporting student research at SF State Biology. The authors declare no conflicts of interest

Author information

Authors and Affiliations

Department of Biology, San Francisco State University, San Francisco, CA, USA
Victoria R. Caudill, Sarina Qin, Ryan Winstead, Jasmeen Kaur, Kaho Tisthammer, E. Geo Pineda, Caroline Solis, Scott Roy, Nicole Allen, Milo Aviles, Brittany A. Baker, William Bauer, Shannel Bermudez, Corey Carlson, Edgar Castellanos, Francisca L. Catalan, Angeline Katia Chemel, Jacob Elliot, Dwayne Evans, Natalie Fiutek, Emily Fryer, Samuel Melvin Goodfellow, Mordecai Hecht, Kellen Hopp, E. Deshawn Hopson Jr., Amirhossein Jaberi, Christen Kinney, Derek Lao, Adrienne Le, Jacky Lo, Alejandro G. Lopez, Andrea López, Fernando G. Lorenzo, Gordon T. Luu, Andrew R. Mahoney, Rebecca L. Melton, Gabriela Do Nascimento, Anjani Pradhananga, Nicole S. Rodrigues, Annie Shieh, Jasmine Sims, Rima Singh, Hasan Sulaeman, Ricky Thu, Krystal Tran, Livia Tran, Elizabeth J. Winters, Albert Wong & Pleuni S. Pennings
Department of Biology, University of Oregon, Eugene, OR, USA
Victoria R. Caudill
Quantitative Systems Biology, Univeristy of California, Merced, CA, USA
Sarina Qin
Vaccine and Infectious Disease Division, Fred Hutchinson Cancer Research Center, Seattle, WA, USA
Trevor Bedford
Department of Computational Biology, School of Computer Science, Carnegie Mellon University, Pittsburgh, USA
Oana Carja
Department of Infectious Disease Epidemiology, London School of Hygiene & Tropical Medicine, London, UK
Rosalind M. Eggo
Department of Biology, Emory University, Atlanta, GA, USA
Katia Koelle
Big Data Institute, University of Oxford, Oxford, UK
Katrina Lythgoe
Department of Neurological Surgery, University of California, San Francisco, CA, USA
Francisca L. Catalan
Department of Organismic and Evolutionary Biology, Harvard University, Cambridge, MA, USA
Dwayne Evans
Department of Plant Biology, Carnegie Institution for Science, Stanford, CA, USA
Emily Fryer
Health Sciences Center, University of New Mexico, Albuquerque, NM, USA
Samuel Melvin Goodfellow
UCSD Biomed Sciences PhD Program, University of California, San Diego, CA, USA
Rebecca L. Melton
Biochemistry, Molecular, Cellular and Developmental Biology Graduate Group, University of California, Davis, CA, USA
Nicole S. Rodrigues
UCSF Tetrad Graduate Program, University of California, San Francisco, CA, USA
Jasmine Sims
Department of Ecology and Evolution, University of Chicago, Chicago, IL, USA
Sarah Cobey
Department of Environmental Systems Science, ETH Zurich, Zurich, Switzerland
Roland Regoes

Authors

Victoria R. Caudill
View author publications
You can also search for this author in PubMed Google Scholar
Sarina Qin
View author publications
You can also search for this author in PubMed Google Scholar
Ryan Winstead
View author publications
You can also search for this author in PubMed Google Scholar
Jasmeen Kaur
View author publications
You can also search for this author in PubMed Google Scholar
Kaho Tisthammer
View author publications
You can also search for this author in PubMed Google Scholar
E. Geo Pineda
View author publications
You can also search for this author in PubMed Google Scholar
Caroline Solis
View author publications
You can also search for this author in PubMed Google Scholar
Sarah Cobey
View author publications
You can also search for this author in PubMed Google Scholar
Trevor Bedford
View author publications
You can also search for this author in PubMed Google Scholar
Oana Carja
View author publications
You can also search for this author in PubMed Google Scholar
Rosalind M. Eggo
View author publications
You can also search for this author in PubMed Google Scholar
Katia Koelle
View author publications
You can also search for this author in PubMed Google Scholar
Katrina Lythgoe
View author publications
You can also search for this author in PubMed Google Scholar
Roland Regoes
View author publications
You can also search for this author in PubMed Google Scholar
Scott Roy
View author publications
You can also search for this author in PubMed Google Scholar
Nicole Allen
View author publications
You can also search for this author in PubMed Google Scholar
Milo Aviles
View author publications
You can also search for this author in PubMed Google Scholar
Brittany A. Baker
View author publications
You can also search for this author in PubMed Google Scholar
William Bauer
View author publications
You can also search for this author in PubMed Google Scholar
Shannel Bermudez
View author publications
You can also search for this author in PubMed Google Scholar
Corey Carlson
View author publications
You can also search for this author in PubMed Google Scholar
Edgar Castellanos
View author publications
You can also search for this author in PubMed Google Scholar
Francisca L. Catalan
View author publications
You can also search for this author in PubMed Google Scholar
Angeline Katia Chemel
View author publications
You can also search for this author in PubMed Google Scholar
Jacob Elliot
View author publications
You can also search for this author in PubMed Google Scholar
Dwayne Evans
View author publications
You can also search for this author in PubMed Google Scholar
Natalie Fiutek
View author publications
You can also search for this author in PubMed Google Scholar
Emily Fryer
View author publications
You can also search for this author in PubMed Google Scholar
Samuel Melvin Goodfellow
View author publications
You can also search for this author in PubMed Google Scholar
Mordecai Hecht
View author publications
You can also search for this author in PubMed Google Scholar
Kellen Hopp
View author publications
You can also search for this author in PubMed Google Scholar
E. Deshawn Hopson Jr.
View author publications
You can also search for this author in PubMed Google Scholar
Amirhossein Jaberi
View author publications
You can also search for this author in PubMed Google Scholar
Christen Kinney
View author publications
You can also search for this author in PubMed Google Scholar
Derek Lao
View author publications
You can also search for this author in PubMed Google Scholar
Adrienne Le
View author publications
You can also search for this author in PubMed Google Scholar
Jacky Lo
View author publications
You can also search for this author in PubMed Google Scholar
Alejandro G. Lopez
View author publications
You can also search for this author in PubMed Google Scholar
Andrea López
View author publications
You can also search for this author in PubMed Google Scholar
Fernando G. Lorenzo
View author publications
You can also search for this author in PubMed Google Scholar
Gordon T. Luu
View author publications
You can also search for this author in PubMed Google Scholar
Andrew R. Mahoney
View author publications
You can also search for this author in PubMed Google Scholar
Rebecca L. Melton
View author publications
You can also search for this author in PubMed Google Scholar
Gabriela Do Nascimento
View author publications
You can also search for this author in PubMed Google Scholar
Anjani Pradhananga
View author publications
You can also search for this author in PubMed Google Scholar
Nicole S. Rodrigues
View author publications
You can also search for this author in PubMed Google Scholar
Annie Shieh
View author publications
You can also search for this author in PubMed Google Scholar
Jasmine Sims
View author publications
You can also search for this author in PubMed Google Scholar
Rima Singh
View author publications
You can also search for this author in PubMed Google Scholar
Hasan Sulaeman
View author publications
You can also search for this author in PubMed Google Scholar
Ricky Thu
View author publications
You can also search for this author in PubMed Google Scholar
Krystal Tran
View author publications
You can also search for this author in PubMed Google Scholar
Livia Tran
View author publications
You can also search for this author in PubMed Google Scholar
Elizabeth J. Winters
View author publications
You can also search for this author in PubMed Google Scholar
Albert Wong
View author publications
You can also search for this author in PubMed Google Scholar
Pleuni S. Pennings
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pleuni S. Pennings.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 32615 KB)

Supplementary material 2 (pdf 634 KB)

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Caudill, V.R., Qin, S., Winstead, R. et al. CpG-creating mutations are costly in many human viruses. Evol Ecol 34, 339–359 (2020). https://doi.org/10.1007/s10682-020-10039-z

Download citation

Received: 05 July 2019
Accepted: 11 March 2020
Published: 24 April 2020
Issue Date: June 2020
DOI: https://doi.org/10.1007/s10682-020-10039-z

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

CpG-creating mutations are costly in many human viruses

Abstract