Investigating Evolutionary Rate Variation in Bacteria

Rates of molecular evolution are known to vary between species and across all kingdoms of life. Here, we explore variation in the rate at which bacteria accumulate mutations (accumulation rates) in their natural environments over short periods of time. We have compiled estimates of the accumulation rate for over 34 species of bacteria, the majority of which are pathogens evolving either within an individual host or during outbreaks. Across species, we find that accumulation rates vary by over 3700-fold. We investigate whether accumulation rates are associated to a number potential correlates including genome size, GC content, measures of the natural selection and the time frame over which the accumulation rates were estimated. After controlling for phylogenetic non-independence, we find that the accumulation rate is not significantly correlated to any factor. Furthermore, contrary to previous results, we find that it is not impacted by the time frame of which the estimate was made. However, our study, with only 34 species, is likely to lack power to detect anything but large effects. We suggest that much of the rate variation may be explained by differences between species in the generation time in the wild. Electronic supplementary material The online version of this article (10.1007/s00239-019-09912-5) contains supplementary material, which is available to authorized users.


1.AIMS
The goal of this of thesis is to investigate why molecular evolutionary rates vary across bacterial species. Evolutionary rates are known to vary across all kingdoms of life, including plants and animals. However, for bacteria, this topic remains relatively unexplored. This work aims to unravel the potential correlates of the accumulation rate in bacteria which will aid our understanding of bacterial evolution in general.
I first collect all available accumulation rate estimates from the literature and then to see if they correlate to several factors, including genome size, GC content, measures of natural selection and the time-frame over which the accumulation rates are measured.
Secondly, I investigate whether another factor, generation time, can explain the variation in accumulation rates. To do this a new method is developed to estimate the generation time of bacteria in the wild. For this I need two sources of information: The accumulation rate and the mutation rate. Thus, further to collecting accumulation rates, I also collect mutation rates from the literature. I estimate doubling times for five species of bacteria and also the distribution of doubling times across all bacteria.

INTRODUCTION
Knowledge about the rates at which mutations arise and genomic change occurs is crucial to understanding how organisms evolve and adapt and how molecular evolution proceeds. Evolutionary rates are known to vary extensively across species in both prokaryotes and eukaryotes and this variation will in part be associated with species characteristics and biology. Disentangling the factors that influence evolutionary rates have been explored in many animal and plant systems (e.g. (Bromham 2002;Smith & Donoghue 2008;Welch et al. 2008;Lanfear et al. 2010), but not so much in bacteria (though see Rocha et al. 2006;Weller & Wu 2015;). Here we investigate variation in the rate at which bacteria accumulate mutations through time in their natural environment over short time periods of a few months to a thousand years. We refer to these as accumulation rates to differentiate them from the mutation rate, the rate at which mutations occur, and the substitution rate, the rate at which mutations fix in a species. These rates of accumulation are commonly estimated using temporarily sampled data (Drummond et al. 2003), or concurrent samples from a population with a known date of origin (e.g. from fossil dates or co-speciation events). They vary by orders of magnitude from species such as Mycobacterium leprae with an accumulation rate of 8.6x10 -9 (Schuenemann et al. 2013) to species such as Campylobacter jejuni with a rate of 3.23x10 -5 (Wilson et al. 2009).
It remains unclear why the rate as which mutations accumulate varies so much between bacteria. The accumulation rate per year must ultimately depend upon the rate of mutation per year and the probability that a mutation reaches sufficient frequency in the population to be sampled. If some mutations are caused by DNA replication, as seems likely in most organisms, then the mutation rate per year is a function of the mutation rate per generation and the generation time. The probability that a mutation reaches a certain frequency in the population depends upon natural selection, biased gene conversion and the effective population size. We consider each of these explanations in turn.
It has previously been shown that the time-frame over which an accumulation rate is estimated can impact the estimate of evolutionary rate -they tend to be lower when measured over longer time-frames (Ho & Larson 2006;Ho et al. 2011;Duchene et al. 2014;Biek et al. 2015;). This effect is usually attributed to the inefficiency of purifying selection to remove slightly deleterious mutations over shorter time periods or problems with reliably estimating rates when the sequences are saturated. This pattern is evident in bacteria (Rocha et al. 2006;Biek et al. 2015;, however the evidence for the pattern is weak. In the most extensive analysis to date ) the negative correlation between the accumulation rate and time-frame was a consequence of just two species which had been sampled over a long time-period. Furthermore, the authors removed datasets which showed no significant accumulation of mutations through time. This will have biased their analysis towards finding a negative correlation between the accumulation rate and sampling time-frame, because species with slow accumulation rates will be removed from the analysis if they are sampled over short-time frames, because they have not had enough time to accumulate significant numbers of substitutions.
Here we revisit the question of whether the accumulation rate is slow in species sampled over longer time-frames. We do this by comparing the rate of accumulation within species across different sampling times. We find little evidence for an association and consequently move on to explore other potential correlates of the accumulation rate. This includes 1) the mutation rate per generation, and 2) the effectiveness of selection. However, we find little evidence that these factors are responsible for the variation in the accumulation rate. This suggests that generation time might be a major factor.
Although, the generation time, or doubling rate, of bacteria has been measured in the lab for many species, relatively little is known about the DT of bacteria in their natural environment. For example,the bacterium Escherichia coli can divide every 20 minutes in the laboratory under aerobic, nutrient-rich conditions. But how often does it divide in its natural environment in the gut, under anaerobic conditions where it probably spends much of its time close to starvation? And what do we make of a bacterium, such as Syntrophobacter fumaroxidans, which only doubles in the lab every 140 hours (Harmsen et al. 1998). Does this reflect a slow doubling time in the wild, or our inability to provide the conditions under which it can replicate rapidly?
Estimating the generation time is difficult for most bacteria in their natural environment and very few estimates are available. The doubling time (DT) for intestinal bacteria has been estimated in several mammals by assaying the quantity of bacteria in the gut and faeces. Assuming no cell death Gibbons & Kapsimalis (1967) estimate the DT for all bacteria in the gut to be 48, 17 and 5.8 hours in hamster, guinea pig and mouse respectively. More recently Yang et al. (2008) have shown that the doubling time of Pseudomonas aeruginosa is correlated to cellular ribosomal content in vitro and have used this to estimate the DT in vivo in a cystic fibrosis patient to be between 1.9 and 2.4 hours.
We investigate what we can infer about the generation time in bacteria using a new method that uses two sources of information. First, the accumulation rate. If we assume that all mutations in the wild are neutral, an assumption that we show to be relatively unimportant for this method, in the discussion, then the accumulation rate is an estimate of the mutation rate per year, uy. Second, we can estimate the rate of mutation per generation, ug, in the lab using a mutation accumulation experiment and whole genome sequencing, or through fluctuation tests. If we assume that the mutation rate per generation is the same in the wild and in the lab, an assumption we discuss further below, then if we divide the accumulation rate per year in the wild by the mutation rate per generation in the lab, we can estimate the number of generations that the bacterium goes through in the wild and hence the doubling time (DT = 8760 x ug / uy , where 8760 is the number of hours per year).
In summary, we investigate why the rate of accumulation varies between bacterial species; we consider a number of explanations including the time-frame over which the estimates have been sampled, variation in the mutation rate and the efficiency of natural selection. We also attempt to estimate the generation time of bacterial in the wild, as a means to investigate whether variation in the generation time is a potential explanation for the variation in the rate of accumulation.

Data collection
We compiled estimates of the accumulation rates from the literature (Appendix 1). For some species we obtained multiple estimates and in most analyses we use the average of these (Appendix 2). We also compiled estimates of the mutation rate from the literature and only used estimates that came from a mutation accumulation experiment with whole genome sequencing. If we had multiple estimates of the mutation rate, we summed the number of mutations across the mutation accumulation experiments and divided this by the product of the genome size and the number of generations that were assayed (Appendix 3). The genome size and GC content for each species is the average of all complete genomes on NCBI for each species. Nucleotide diversity estimates were calculated using orthologous sequence alignments for each species which were constructed using ODoSE ((Vos et al., 2013),http://www.odose.nl) and in-house scripts written in Python (https://www.python.org) (Appendix 2). Lab Doubling times were taken from (Vieira- We recalculated the accumulation rates in two cases in which the number of accumulated mutations had been divided by an incorrect number of years: E. coli  and Helicobacter pylori . For E. coli, we reestimated the accumulation rate using BEAST by constructing sequences of the SNPs reported in the paper and the isolation dates. For, Helicobacter pylori we use two groups of strains in which strains were sampled from a patient at 0, 3 and 16-years; in both cases the 3-year and 16-year strains appear to form a clade to the exclusion of the 0-year strain because they share some common differences from the 0-year strain ). We do not know when the 3-year and 16-year strains diverged, but assuming a molecular clock we can estimate the as follows: if the number of substitutions that have accumulated between the common ancestor of the 3-year and 16-year strain and each of the two strains are S3 and S16 respectively then the rate of accumulation can be estimated as (S16-S3)/(13 years x genome size) ( Figure   1.). Using the number of substitutions reported by ) in their figure 1 we have estimated the accumulation rate to be 5 x 10 -6 (for isolates NQ1707 and NQ4060) and 5.9 x 10 -6 (for NQ1671 and NQ4191). We excluded some accumulation rate estimates for a variety of reasons. We only considered accumulation rates sampled over an historical timeframe of at most 1500 years. Most of our estimates of the accumulation rate are for all sites in the genome, so we excluded cases in which only the synonymous accumulation rate was given. We also excluded accumulation rates from hypermutable strains. Accumulation and mutation rate estimates used in the analysis are given in supplementary tables S1 and S2 respectively.

Testing for phylogenetic inertia
To estimate phylogenetic signal in the accumulation rates and all other traits we generated phylogenetic trees for the 34 species for which we have accumulation rate estimates (Appendix 5). 16S rRNA sequences were downloaded from the NCBI genome database (https://www.ncbi.nlm.nih.gov/genome/) and aligned using MUSCLE (Edgar 2004) performed in Geneious version 10.0.9 (http://www.geneious.com, Kearse et al., 2012). From these alignments, maximum likelihood trees were constructed in RAxML (Stamatakis 2014) and integrated into the tests of Pagel (1999) and Blomberg et al. (2003) to the accumulation rates and all other traits implemented in the phylosig function in the R package phytools v.0.6 (Revell 2012 contrasts were carried out according to the method of Felsenstein (1985) using the pic function in ape v.4.1 (Paradis et al. 2004).

Divergence as a function of time
The accumulation rate is expected to decrease as more divergent sequences are sampled because natural selection will remove deleterious genetic variation over time.
To investigate this phenomenon quantitatively we used a transition matrix to explicitly calculate the distribution of allele frequencies t generations after a mutation was introduced into a haploid population. In the transition matrix the first column represents the population when the mutation is first introduced. If there are N strains (or chromosomes) in the population then there are N+1 rows, where the first row represents loss of the mutation and the N+1th row, fixation. The first column is therefore (0,1,0,0,0…0). To this column we apply selection and drift. If the fitness of the wildtype is 1 and the fitness of the mutant is 1-s then the frequency after selection for all x from 0 to N, and then P(x,3) for all x from 0 to N…etc). The ith column and jth row represent the probability of observing a mutation introduced as a single copy at generation 1, in j copies in the ith generation. The chance that a sequence sampled in t generations in the future is different to the ancestral can be calculated thus (3) If we have two strains diverging from each other, then the overall divergence, assuming that mutations do not occur at the same site, which is reasonable for low levels of divergence, is twice this. We are interested in how selection affects the rate of accumulation and so we need to divide by the accumulation rate for neutral mutations, which is equivalent to dividing equation 3 by t: In reality, not all deleterious mutations are subject to the same strength of selection so we sampled mutations from a gamma distribution; calculated P(x,s,t) for each and then averaged across mutations. We sampled 100 mutations for each set of . We initially constructed a transition matrix with 100 strains to study the pattern from 0 to 4N generations, but then subsequently investigated the pattern in more depth within the first 0.1N generations by constructing a transition matrix with 1000 strains and the first 0.01N generations.

Estimating doubling times
We estimated the DT of individual species and the distribution across species using the formula DT = 8760 x ug / uy , where ug is the mutation rate per generation as estimated from a mutation accumulation experiment, uy is the mutation rate per year estimated from the accumulation rate, and 8760 is the number of hours per year. The estimate of the standard error associated with our estimate of the doubling time was obtained using the standard formula for the variance of a ratio: V(x/y) = (M(x)/M(y)) 2 (V(x)/M(x) 2 +V(y)/M(y) 2 ) where M and V are the mean and variance of x and y. The variance for the accumulation rate was either the variance between multiple estimates of the accumulation rate if they were available, or the variance associated with the estimate if there was only a single estimate. The variance associated with the mutation rate was derived by assuming that the number of mutations was Poisson distributed.
To infer the distribution of DTs across bacteria we fit log-normal distributions to the accumulation and mutation rate data by taking the loge of the values and then fitting a normal distribution by maximum likelihood using the FindDistributionParameters in Mathematica. Normal Q-Q plots for the accumulation and mutation rate data were produced using the qqnorm function in R version 1.0.143. In fitting these distributions, we have not taken into account the sampling error associated with the accumulation and mutation rate estimates. However, these sampling errors are small compared to the variance between species: for the accumulation rates the variance between species is 3.9 x 10 -11 and the average error variance is an order of magnitude smaller at 3.6 x 10 -12 ; for the mutation rate data, the variance between species is 7.5 x 10 -18 and the average variance associated with sampling is more than two orders of magnitude smaller at 1.8 x 10 -20 . Note that we cannot perform these comparisons of variances on a log-scale because we do not have variance estimates for the log accumulation and mutation rates.

Across species
We compiled estimates of the accumulation rate for 34 species of bacteria. These vary by over 3700-fold ( Figure. 2.), but the majority of species accumulate mutations at rates of between 1x10 -6 and 2x10 -6 per site per year. In the sections below, we investigate what might cause this variation by looking for variables which correlate to the accumulation rate. Because the accumulation rate varies over orders of magnitude, all analyses were performed on the log of the accumulation rate. In such an analysis it can be important to correct for phylogenetic non-independence if there is a phylogenetic inertia. To investigate this we tested for phylogenetic inertia by inferring the phylogeny of our species using the 16S rRNA and then using the tests of Pagel (1999) and Blomberg, Garland and Ives (2003). We find that the accumulation rates show phylogenetic inertia using Pagel's l but not Bloomberg et al.'s K , and some of our other variables also show inertia including genome size and GC content, but not all (Table 1).

Sampling time
The time-interval over which evolutionary rates are measured is thought to impact rate estimates so that they become slower when measured over longer time-frames (Ho et al. 2011;Biek et al. 2015;. This is as we might expect if a substantial fraction of mutations are mildly deleterious, since they would appear over a short time-scale, but ultimately be removed by natural selection. Evidence for this effect comes from observation that the relative rate at which non-synonymous and synonymous mutations accumulate in bacterial genomes declines as a function of time (Rocha et al. 2006;Balbi & Feil 2007). year to just over 1500 years. We find a highly significant negative relationship between accumulation rate and sampling time ( considering the correlation between the accumulation rate and the sampling timeframe within these 12 species using ANCOVA, we find no correlation (slope = 0.022, p = 0.79) ( Figure. 3). Furthermore, we find no relationship between the relative rates at which non-synonymous and synonymous mutations accumulate and the time-frame over which the accumulation rate estimate was made (r = 0.2, p = 0.53), although for most datasets the accumulation rate was not calculated for the two types of site separately. In conclusion, we do not find strong evidence for a sampling time effect.  The absence of a relationship between the accumulation rate and sampling time might seem surprising given that there is ample evidence that slightly deleterious mutations segregate in bacterial populations; for example, (Hughes 2005) showed that nonsynonymous polymorphisms segregate at lower frequencies than synonymous polymorphisms in most species of bacteria. So, we would expect the rate of accumulation to decline as time progresses. To investigate this further, we derived the expected relationship between the accumulation rate and time using population genetic theory (see Materials and Methods). We assume all mutations are drawn from a distribution of fitness effects (DFE), modelled as a gamma distribution, in which all mutations are either effectively neutral, or deleterious. We find, as expected, that the rate of accumulation declines. However, it is evident that it will be difficult to detect differences in accumulation rate unless accumulation rates are sampled over a very short time frame (<0.1N generations, where N is the population size) and a much longer time frame ( Figure 5). This is because within a restricted time frame there is very little difference in accumulation rate.

Mutation rate
The rate at which bacteria accumulate mutations through time will be in part be determined by the rate at which mutations occur per unit time. If some mutations are caused by DNA replication then the mutation rate per year will depend upon the mutation rate per generation and the generation time. We test each of these components in turn.
Unfortunately, it is difficult to directly test for a relationship between the accumulation rate and the mutation rate per generation because only five species in our dataset have estimates of both these rates. The correlation between the accumulation rate and mutation rate per generation is 0.07 (p=0.9), but with such little information it is difficult to determine whether a correlation exists. However, it is potentially possible to test the relationship between the accumulation rate and the mutation rate per generation indirectly because some genomic traits correlate to the mutation rate per generation. For instance, genome size is inversely correlated to the mutation rate/site/generation (Drake 1991;Lynch 2010;Lynch, Matthew S. Ackerman, et al. 2016) . We find a negative relationship between the mutation rate and genome size (r= -0.68, p= <0.001), although this is mostly driven by Mesoplasma florum (Appendix 6.) and the correlation is weaker when we remove M.florum (r = -0.39, p= 0.053).
A negative correlation between genome size and the accumulation rate has been previously observed for a range of viruses and bacteria (Lynch 2010;Biek et al. 2015) and we also find a strong negative correlation between the accumulation rate and genome size (Figure 6a) (r = -0.43 , p=0.01) which becomes stronger when the obvious outlier B. aphidicola is excluded (r = -0.57, p = <0.001). The relationship is also negative, but loses significance, if we control for phylogeny using phylogenetic independent-contrasts (PICs) after excluding low variance comparisons and   Genomic base composition may also correlate to the mutation rate per generation. GC content is known to vary greatly across bacterial species from less than 20% to over 70%. The origins of this variation remain unresolved. There is evidence that it is not solely a consequence of mutation bias (Hildebrand et al. 2010;Hershberg & Petrov 2010) and that biased gene conversion may be a factor (Lassalle et al. 2015). Given that the pattern of mutation is generally AT-biased in bacteria (Hershberg & Petrov 2010) (though see  variation in GC content due to selection or biased gene conversion can potentially generate variation in the mutation rate by shifting the GC-content away from its equilibrium value (Krasovec et al. 2017).
This effect may explain why Mesoplasma florum's mutation rate is so high because although it has very low genomic GC content, the equilibrium GC content is predicted to be substantially lower (Krasovec et al. 2017). This will lead to positive correlation between the accumulation rate and GC content. The mutation rate may also be negatively correlated to GC-content due to variation in effective population size; a low effective population size may lead to lower GC content but a higher mutation rate because selection on mutation rate modifiers is relaxed and repair genes are lost. and GC content (r=0.473, p=0.0094), although this is lost when we account for phylogenetic non-independence (r=0.32, p=0.168).
We observe a negative correlation between GC content and the mutation rate (r=-0.59, p = 0.0016) (Appendix 7.), and we also find a strong negative correlation between the accumulation rate and the GC-content (r = -0.53 p= 0.001; Fig. 6a). Again, B.
aphidicola is a conspicuous outlier and if removed the correlation is stronger (r = -0.613, p=<0.001). This negative relationship is maintained and is almost significant for  We have detected moderately significant correlation between the accumulation rate and genome size and GC-content. These two variables are correlated to each other but a multiple regression of accumulation rate versus both yields marginally significant results for GC content (p=0.037) but not significant for genome size (p=0.45) and neither come out significant when we control for phylogeny; it is therefore not possible for us to clearly resolve which might be the true correlate. Both could conceivably be linked to the mutation rate per generation. Under the drift-limit hypothesis the mutation rate is expected to be negatively correlated to genome size, because larger genomes have potentially more deleterious mutations and this leads to more effective selection on the mutation rate (Lynch 2010;Lynch 2017). GC-content could be related to the mutation rate either through its correlation to genome size, a correlation for which there is no clear explanation, or because GC-content is a crude measure of how far a genome is from its equilibrium GC-content; if the mutation pattern is AT-biased then increasing GC-content increases the mutation rate (Krasovec 2017).

Effectiveness of selection
Selection and biased gene conversion will affect the probability that a mutation spreads to fixation in a population. Accumulation rates are estimated by excluding sites which are inferred to have been recombined and hence biased gene conversion is unlikely to explain the variation. In contrast, purifying selection will act to reduce the number of deleterious mutations surviving in populations, leading to a reduction in the accumulation rate. How effective selection is at exerting its effects depends on the power of random genetic drift, i.e. the effective population size. We can potentially measure the effectiveness of selection by considering the ratio of the nucleotide diversity at non-synonymous and synonymous sites ( N/ S); populations with more efficient selection should have lower values of N/ S. We consider the efficiency of selection using two sources of data; the ratio of the number of non-synonynous to synonymous polymorphisms, pN/pS, for the strains used to estimate the accumulation rate and N/ S in the species as a whole. We find no correlation between pN/pS in the strains to estimate the accumulation rate (r=0.07, p =0.84) but we have only nine data-points. We find an almost significant correlation for the species wide N/ S and the accumulation rate (r= -0.35, p=0.062) but none if we control for phylogenetic inertia. (r = -0.1, p=0.65).

Lifestyle
We examined whether there are differences in the accumulation rate for bacteria with different lifestyles. Most of our species are pathogens and among these we divided them into obligate pathogens and opportunistic pathogens. We find that the accumulation rates do not differ significantly between these two groups (t-test, p=0.488). We further carried out an analysis controlling for phylogenetic nonindependence by comparing sister pairs of species. We find no evidence that they are significantly different (paired sample t-test, p=0.947). Thus, lifestyle does not seem to have any clear impact on the accumulation rate.

All factors
We further carried out a multivariate analysis where we included all our variables into a multiple regression (apart from our estimates of DTs in the wild). When we consider the raw values, only genome size comes out as significant (p= 0.0153) and when we consider the phylogenetic independent contrasts lab doubling times and N/ S come out as marginally significant with similar effect sizes (Standardized regression coefficient = -0.095, p=0.080 and 1.01, p = 0.063 respectively); this suggests that accumulation rates may be higher in species with short lab DTs and smaller Ne.

Generation time
It is likely that the accumulation rate should correlate negatively with generation time (or doubling time) because species with shorter generation times will accumulate more DNA replication errors per unit time. Eukaryotes appear to display this generation time-effect (Bromham 2002;Smith & Donoghue 2008;Welch et al. 2008;Lanfear et al. 2010) and this is also evident in bacteria (Weller and Wu 2015) although see (Maughan 2007). Furthermore, the accumulation rate may also increase in populations that are rapidly expanding, for instance during epidemic disease, because of a reduction in generation time (Cui et al. 2013).
However, we find no relationship between the accumulation rate and the doubling  (Table 2). In all cases the estimated DT in the wild is greater than that of the bacterium in the lab. For example, E. coli can double every 20 minutes in the lab but we estimate that it only doubles every 15 hours in the wild.
In theory, it might be possible to estimate the DT in those bacteria for which we have either an accumulation or mutation rate estimate, but not both, by finding factors that correlate with either rate and using those factors to predict the rates. Unfortunately, we have been unable to find any factor that correlates sufficiently well to be usefully predictive. As mentioned it has been suggested that the mutation rate is correlated to genome size in microbes (Drake 1991)   assume that there is no phylogenetic non-independence in the mutation and accumulation data, an assumption we address below. We can estimate the distribution of DTs by fitting distributions to the accumulation and mutation rate data, using maximum likelihood, and then dividing one distribution by the other. We assume that both variables are log-normally distributed, an assumption which is supported by Q-Q plots with the exception of the mutation rate per generation in Mesoplasma florum, which is a clear outlier (Figure 8.). We repeated all our analyses with and without M.
florum.  Cov(g,y) is the covariance between the accumulation and mutation rates. We might expect that species with higher mutation rates also have higher accumulation rates, because the accumulation rate is expected to depend on the mutation rate, but the correlation between the two will depend upon how variable the DT and other factors, such as the strength of selection, are between bacteria. The observed correlation between the log accumulation rate and log mutation rate is 0.077, but there are only five data points, so the 95% confidence intervals on this estimate encompass almost all possible values (-0.86 to 0.89). We explore different levels of the correlation between the accumulation and mutation rates; it should be noted that Cov(g,y) can be expressed as Sqrt(vg vy) Corr(g,y) where Corr(g,y) is the correlation between the two variables.
The distribution of DTs in the wild inferred using our method is shown in Figure 8. We infer the median doubling time to be 7.04 hours, but there is considerable spread around this even when the accumulation and mutation rates are strongly correlated ( Figure 8A); as the correlation increases so the variance in DTs decreases, but the median remains unaffected. The analysis suggests that most bacteria have DTs of between 1 and 100 hours but there are substantial numbers with DTs beyond these limits. For example, even if we assume that the correlation between the accumulation and mutation rate is 0.5 we infer that 10% of bacteria have a DT of faster than one hour in the wild and 4.2% have a DT slower than 100 hours in the wild. If we remove the Mesoplasma florum mutation rate estimate from the analysis the median doubling is slightly lower at 6.16 hours, but there is almost as much variation as when this bacterium is included; at a correlation is 0.5 we infer that 12% of bacteria have a DT faster than one hour in the wild and 3.5% have a DT slower than 100 hours. with a substantial fraction of bacteria with long DTs and also some with very short DTs ( Figure 10).
Here, we have assumed that there is no phylogenetic inertia within the accumulation and mutation rate estimates. As stated above to test whether this is the case we constructed a phylogenetic tree using 16S rRNA sequences and applied the tests of Pagel (1999) (Pagel 1999) and Blomberg et al. (2003) (Blomberg et al. 2003). We also find some evidence that the data depart from a Brownian motion model using Pagel's test (i.e. l is significantly less than one) for the accumulation data (p<0.001) but not the mutation rate data (p = 0.094); i.e. the accumulation rates are more different than we would expect from their phylogeny and a Brownian motion model. A visual inspection of the data suggests that the phylogenetic signal is largely contributed by species that are closely related, rather than deeper phylogenetic levels ( Figure 11A, B) and species for which we have accumulation and mutation rate estimates are interspersed with one another on the phylogenetic tree ( Figure 11C). It therefore seems unlikely that phylogenetic inertia will influence our results.

DISCUSSION
The rate at which bacteria accumulate mutations over short timeframes of 1 to 1500 years varies by three orders of magnitude. The rate of accumulation must depend on the mutation rate per year and the strength of natural selection, and in turn the mutation rate per year is likely to depend on the mutation rate per generation and the Unfortunately, we find no very clear correlate of the accumulation rate; the accumulation rate is significantly correlated to the GC-content and genome size, but neither factor is significant when we control for phylogeny. There is a hint that both lab DT and the effective population size may be important since these emerge as marginally significant in a multiple regression of all factors when we control for phylogeny. The lack of any clear correlate may be a result of the size of our dataset; we have data from just 34 species and many of the accumulation rates are estimated with considerable error. It is likely that the number of data-points will increase considerably over the coming years and a more powerful analysis will be possible.
It has previously been shown that the accumulation rate is correlated to the timeframe over which the accumulation rate is measured ). This relationship is expected given that deleterious mutations can segregate in a population, but these are ultimately removed from the population. However, in the study of Duchenne et al. provides -red = accumulation rate, green = mutation rate and blue = both a mutation rate and an accumulation rate.
is very likely to exist but we have been unable to detect it and it is clearly not responsible for most of the variation in the accumulation rate.
We find only very weak evidence that the accumulation rate is correlated to the doubling time, as measured in the lab. However, this is perhaps not surprising. Few bacteria probably double at anything like their lab measured rates in their natural environment. We have recently estimated the DT of 5 bacterial species indirectly. We have used estimates of the rate at which bacteria accumulate mutations in their natural environment and estimates of the rate at which they mutate in the laboratory to estimate the DT for these 5 bacteria and infer the distribution of DTs across bacteria. We estimate that DTs are generally longer in the wild than in the lab, but critically we also infer that DTs vary by several orders of magnitude between bacterial species and that many bacteria have very slow DTs in their natural environment.
The method, by which we have inferred the DT in the wild, makes three important assumptions. We assume that the mutation rate per generation is the same in the lab and in the wild. However, it seems likely that bacteria in the wild will have a higher mutation rate per generation than those in the lab for two reasons. First, bacteria in the wild are likely to be stressed and this can be expected to elevate the mutation rate (Bjedov et al. 2003;Galhardo et al. 2007;Foster 2007;Maclean et al. 2013;Shewaramani et al. 2017). Second, if we assume that DTs are longer in the wild than the lab then we expect the mutation rate per generation to be higher in the wild than in the lab because some mutational processes do not depend upon DNA replication.
The relative contribution of replication dependent and independent mutational mechanisms to the overall mutation rate is unknown. Rates of substitution are higher in Firmicutes that do not undergo sporulation suggesting that replication is a source of mutations in this group of bacteria (Weller & Wu 2015), but see Maugham (2007).
However, rates of mutation accumulation seem to be similar in latent versus active infections of M. tuberculosis, suggesting that replication independent mutations might dominate in this bacterium Lillebaek et al. 2016).
The second major assumption is that the rate at which mutations accumulate in the wild is equal to the mutation rate per year; in effect, we are assuming that all mutations are effectively neutral, at least over the timeframe in which they are assayed (or that some are inviable, but the same proportion are inviable in the wild and the lab). In those accumulation rate studies, in which they have been studied separately, non-synonymous mutations accumulate more slowly than synonymous mutations; relative rates vary from 0.13 to 0.8, with a mean of 0.57 (Table A3). There is no correlation between the time-frame over which the estimate was made and the ratio of non-synonymous and synonymous accumulation rates (r = 0.2, p = 0.53). We did not attempt to control for selection because the relative rates of synonymous and non-synonymous accumulation are only available for a few species, and the relative rates vary between species. However, we can estimate the degree to which more selection against deleterious non-synonymous accumulations in the wild causes the DT to be underestimated as follows. The observed rate at which mutations accumulate in a bacterial lineage is where α is the proportion of the genome that is non-coding and β is the proportion of mutations in protein coding sequence that are non-synonymous. δx is the proportion of mutations of class x (i = intergenic, s = synonymous, n = non-synonymous) that are effectively neutral. α and β are approximately 0.15 and 0.7, respectively, in our dataset. Although there is selection on synonymous codon use in many bacteria (Hershberg & Petrov 2008), selection appears to be weak (Sharp et al. 2005) we therefore assume that δs = 1. This implies, from the rate at which non-synonymous mutations accumulate relative to synonymous mutations, that δn = 0.6. A recent analysis of intergenic regions in several species of bacteria has concluded that selection is weaker in intergenic regions than at non-synonymous sites, we therefore assume that δi = 0.8 (Thorpe et al. 2017 if there is recombination within a clade, they affect the phylogeny and potentially lead to the root of the tree being estimated as younger than it should be. This will lead to an over-estimate of the DT.
It is important to appreciate that our method estimates an average DT within a particular environment that the bacteria were sampled from. The bacterium may go through periods of quiescence interspersed with periods of growth.
Despite the assumptions we have made in our method, our estimate of the DT of P.
aruginosa of 2.3 hours in a cystic fibrosis patient is very similar to that independently estimated using the ribosomal content of cells of between 1.9 and 2.4 hours (Yang et al. 2008). There is also independent evidence that there are some bacteria that divide slowly in their natural environment. The aphid symbiont Buchnera aphidicola is estimated to double every 175-292 hours in its host (Ochman et al. 1999;Clark et al. 1999), and Mycobacterium leprae doubles every 300-600 hours on mouse footpads (Shepard 1960;Rees 1964;Levy 1976), not its natural environment, but one that is probably similar to the human skin. Furthermore, in a recent selection experiment, Avrani et al. (2017) found that several E. coli populations, which were starved of resources, accumulated mutations in the core RNA polymerase gene. These mutations caused these strains to divide more slowly than unmutated strains when resources were plentiful. Interestingly these same mutations are found at high frequency in unculturable bacteria, suggesting that there is a class of slow growing bacteria in the environment that are adapted to starvation. Korem et al. (2015) have recently proposed a general method by which the DT can be potentially estimated. They note that actively replicating bacterial cells have two or more copies of the chromosome near the origin of replication but only one copy near the terminus, if cell division occurs rapidly after the completion of DNA replication.
Using next generation sequencing, they show that it is possible to assay this signal and that the ratio of sequencing depth near the origin and terminus is correlated to bacterial growth rates in vivo. Brown et al. (2016) have extended the method to bacteria without a reference genome and/or those without a known origin and terminus of replication. In principle, these measures of cells performing DNA replication could be used to estimate the DT of bacteria in the wild. However, it's unclear how or whether the methods can be calibrated. Both Korem et al. (2015) and Brown et al. (2016)  The two methods are not consistent. They also yield very different estimates for the absolute DT. Korem et al. (2015) show that PTR is highly correlated to the growth rate of E. coli grown in a chemostat. If we assume that the relationship between PTR and growth rate is the same across bacteria in vivo and in vitro, then this implies that the median DT for the human microbiome is ~2.5 hours. In contrast, Brown et al. (2016) estimate the growth rate of Klebsiella oxytoca to be 19.7 hours in a new-born baby using faecal counts and find that this population has an iRep value of ~1.77. This value is greater than the vast majority of bacteria in the human microbiome and bacteria in the Candidate Phyla Radiation, suggesting that most bacteria in these two communities replicate very slowly. These discrepancies between the two methods suggest that it may not be easy to calibrate the PTR and iRep methods to yield estimates of the DT across bacteria.
How should we interpret our results for the five focal species in the context of what is known of their ecology? Vibrio cholerae displays the shortest DT of 1.1 hr. Vibrio species are ubiquitous in estuarine and marine environments (Reidl & Klose 2002).
They are known to have very short generation times in culture, the shortest being V. natriegens of just 9.8 minutes (G. 1961). In the wild they can exploit a wide range of carbon and energy sources, and as such have been termed "opportunitrophs" (Polz et al. 2006). Natural Vibrio communities do not grow at an accelerated rate continuously, but can exist for long periods in a semi-dormant state punctuated by rapid pulses of high growth rates (Blokesch & Schoolnik 2008), or blooms (Takemura et al. 2014), when conditions are favourable. It has also been argued that the unusual division of Vibrio genomes into two chromosomes facilitates more rapid growth (Yamaichi et al. 1999). By pointing to a very short DT in V. cholerae, our analysis is therefore consistent with what is known of the ecology of this species.
Staphylococcus aureus is predominantly found on animals and humans and inhabits various body parts, including the skin and upper respiratory tract (Schenck et al. 2016).
It can cause infection of the skin and soft tissue as well bacteraemia (John 2004). S.
aureus exhibits a range of modes of growth, some of which may to allow it to survive stress and antimicrobials whilst in its host. For instance, small subpopulations can adopt a slow-growing, quasi-dormant lifestyle, either in a multicellular biofilm or as small colony variants (SCVs) or persister cells (Bui et al. 2017). Our short DT of 1.8 hours suggests this is not the typical state for S. aureus in the wild, which is not surprising considering the incidence of SCVs in clinical samples is fairly low, between 1-30% (Proctor et al. 2006).
Pseudomonas aeruginosa can inhabit a wide variety of environments, including soil, water plants and animals. Like our other focal species, it is an opportunistic pathogen and can also infect humans, especially those with compromised immune systems, such as patients with cystic fibrosis (CF). In this context infection is chronic. Parallel evolution, the differential regulation of genes which allow it to evade the host immune system and resist antibiotic treatment during infection (Huse et al. 2010), and evidence of positive selection (Smith et al. 2006) suggests P. aeruginosa can adapt to the lungs of individuals with CF for its long-term survival. It is known to actively grow in sputum (Yang et al. 2008), where it utilises the available nutrition which supports its growth to high population densities (Palmer et al. 2005). Its ability to adapt and actively grow in the CF sputum is consistent with its relatively short DT of 2.3 hours, especially considering this is the environment in which the accumulation rate was measured and matches that estimated by Yang et al 2008(Yang et al. 2008).
E. coli and S. enterica primarily reside in the lower intestine of humans and animals, but can also survive in the environment. Although E. coli is commonly recovered from environmental samples, it is not thought able to grow or survive for prolonged periods outside of the guts of warm blooded animals, except in tropical regions where conditions are more favourable (Winfiel & Groisman 2003), although some phylogenetically distinct strains appear to reproduce and survive well in the environment (Oh et al. 2012). In contrast, Salmonella is also an enteric coloniser of cold-blooded animals, in particular reptiles, is better adapted than E. coli at surviving and growing in environmental niches. For example, Salmonella can survive and grow for at least a year in soil (Davies & Wray 1996), whereas E. coli can survive for only a few days (Bogosian & Sammons 1996). Although these secondary niches may play a greater role in Salmonella than in E. coli, it remains the case the growth rates in the environment will be much lower than those in a gut. Therefore, the increased tenacity of Salmonella in non-host environments compared to E. coli might help to explain the slower DT in this species.
In summary, the availability of accumulation and mutation rate estimates allows us to infer the DT for bacteria in the wild, and the distribution of wild DTs across bacterial species. These DT estimates are likely to be underestimates because the mutation rate per generation is expected to be higher in the wild than in the lab, and some mutations are not generated by DNA replication. Our analysis therefore suggests that DTs in the wild are typically longer than those in the lab, that they vary considerably between bacterial species and that a substantial proportion of species have very long DTs in the wild. This then would explain why accumulation rates vary so widely, there is a very large variance in DTs.

CONCLUSION
We wanted to assess the factors that potentially correlate with the accumulation rate in bacteria to investigate whether we could explain the variation in the accumulation rate found across different species. In total we collected accumulation rate estimates for 34 species of bacteria, which were mostly pathogens evolving either within individual hosts or during an outbreak. These estimates varied 3700-fold and the timeframe over which they were measured was between 1-1500 years. There are several factors that could be responsible for this huge variation including the mutation rate, natural selection and the time-frame over which rates are measured. Whilst genome size and GC content, which are proxies for the mutation rate per generation, showed a significant relationship with accumulation rate, after controlling for phylogenetic nonindependence this relationship was lost. Similarly, a measurement for the effectiveness of selection, N/ S, revealed an almost significant correlation to the accumulation rate, which was again lost when we control for phylogeny. No correlation was found between pN/pS for the strains used to estimate the accumulation rate and the accumulation rate.
Surprisingly, we find little evidence that the sampling time correlates with the accumulation rate. We find a significant negative correlation between sampling time and the accumulation rate, however this appears to be mainly driven by two species, Yersinia pestis and Mycobacterium leprae, which were sampled over relatively long time frames.
One final factor that should influence the accumulation rate is generation time. We find no relationship between lab doubling times and the accumulation rate. However, to further this analysis we developed a method to estimate doubling times in the wild.
We estimate this value for five species of bacteria and also the distribution of DTs across all bacteria. Both suggest that DTs for bacteria in the wild are considerably longer than those in the laboratory. Furthermore, they vary by orders of magnitude between different species and it appears that many species double very slowly in the wild. In conclusion, no one factor tested here stands out as a clear candidate for explaining the variation in the accumulation rates of bacteria. We can, however, suggest that due to the large variation seen in bacterial doubling times in the wild this could be the major factor driving the variation in the accumulation rate across species.