Mutation and Selection in Bacteria: Modelling and Calibration
 195 Downloads
Abstract
Temporal evolution of a clonal bacterial population is modelled taking into account reversible mutation and selection mechanisms. For the mutation model, an efficient algorithm is proposed to verify whether experimental data can be explained by this model. The selection–mutation model has unobservable fitness parameters, and, to estimate them, we use an Approximate Bayesian Computation algorithm. The algorithms are illustrated using in vitro data for phase variable genes of Campylobacter jejuni.
Keywords
Stochastic modelling Population genetics Phase variable genes Approximate Bayesian computationMathematics Subject Classification
92D25 62F15 60J101 Introduction
The objective of this paper is to propose stochastic models for bacterial population genetics together with their calibration. In other words, our aim is not only to construct models but also to suggest algorithms which can answer the question as to whether experimental data can be explained by a model or not. An answer to this question is the key for establishing which mechanisms are dominant in evolution of bacteria. The models are deliberately relatively simple though they capture two important mechanisms of bacterial population genetics: mutation and selection. Simplicity of the models allows their fast calibration, and it is also consistent with the fact that in experiments sample sizes are usually relatively small.
The models are derived, calibrated and tested within the context of phase variable (PV) genes, which occur in many bacterial pathogens and commensals (Bayliss 2009; Bayliss et al. 2012). Phase variation has three properties: (i) an on/off or high/low switch in gene expression; (ii) high switching rates; and (iii) reversible switching between expression states. Two major mechanisms of phase variation involve hypermutable simple sequence repeats (SSR) and highfrequency sitespecific recombinatorial changes in DNA topology (Bayliss 2009; Bayliss et al. 2012; WisniewskiDyé and Vial 2008; van der Woude and Bäumler 2004; Moxon et al. 1994). We note that in contrast to phase variation, nonPV mutations have lower rates and extremely rare reverse mutations, while PV genes have high mutation rates (e.g. in the case of Campylobacter jejuni they are estimated to fall between \(4\times 10^{4}\) and \(4 \times 10^{3}\)). PV genes can lead to changes in the expression of outer membrane proteins or structural epitopes of large surface molecules whose functions modulate multiple interactions between bacteria and hosts including adhesion, immune evasion and iron acquisition. Consequently, phase variation can influence host adaptation and virulence. Models accompanied by efficient data assimilation procedures are an important tool for understanding adaptation of bacteria to new environments and ultimately for determining how some bacteria cause disease.
SSRmediated phase variation is considered herein as this is the specific mechanism occurring in genes of C. jejuni which we will use in our illustrative examples. SSR, otherwise known as microsatellites, consist of tandem arrangements of multiple copies of an identical sequence (i.e. the repeat). In C. jejuni, the majority of these SSR consist of nontriplet repeats, polyG or polyC, present within the reading frame. Between 18 and 39 PV genes are present in each C. jejuni strain (Aidley et al. 2018). SSR tracts are hypermutable due to a high error rate occurring during DNA replication. Slipped strand mispairing, the proposed mechanism (Levinson and Gutman 1987), alters gene expression through parent and daughter strand misalignment during replication, which results in deletion or addition of one repeat unit in the newlysynthesized strand. Changes in repeat number of a nontriplet repeat present within a reading frame alter the coding sequence of the codon triplets producing switches in gene expression, and hence the switches in phenotypes referred to as phase variation.
Other modelling approaches to bacterial population genetics can be found in, e.g. Alonso et al. (2014), Saunders et al. (2003) and Moxon and Kussell (2017) (see also references therein). These models have explored the interplay between selection, mutation and population structure for multiple interacting genes with low or high mutation rates and varying levels of selection (Alonso et al. (2014); Gerrish et al. (2013); Palmer and Lipsitch (2006); Wolf et al. (2005); Barrick and Lenski (2013); O’Brien et al. (2013); Raynes and Sniegowski (2014)). A subset of these models have explicitly focused on hypermutability, where reversion is a defining and important phenomenon. These models have indicated that evolution of hypermutability is driven by the strength and period of selection for each expression state but is also influenced by the frequency of imposition of population bottlenecks (Saunders et al. (2003); Moxon and Kussell (2017); Palmer et al. (2013); Acar et al. (2008)). The majority of these models have considered singlegene phenomena and have not provided approaches or adjustable, portable models for application to actual experimental observations. An exception is the use of a model of nonselective bottlenecks of PV genes Aidley et al. (2017) that was utilized to predict the bottleneck size in observed bacterial populations Wanford et al. (2018). The aim herein is to develop models that could be used to examine experimentally observed populations and determine whether mutation rate alone or mutation rate and selection for changes in expression of one or more loci were driving changes in bacterial population structure. Our main focus here is on host adaptation of a clonal population of hypermutable bacteria, for which we propose a mutation–selection model. The model describes collective behaviour of interactive PV genes and is accompanied by an effective data assimilation procedure.
The rest of the paper is organized as follows. In Sect. 2, we first recall and revise the mutation model from Bayliss et al. (2012), which is a stochastic discretetime discretespace model describing the mutation mechanism only. It is derived under the assumptions of infinite (very large) size of the population maintained during the whole time period of interest, time is measured in generations, and all phasotypes have the same survival rate (fitness). Then, we introduce a new model (mutation–selection model) which takes into account both mutation and selection mechanisms. It generalizes the mutation model by allowing phasotypes to have different fitness levels. We also discuss properties, including longtime behaviour, of both models. Then, we turn our attention to calibration of the models. In Sect. 3, we propose a very efficient algorithm to test whether experimental data can be explained by the mutation model from Sect. 2 and we illustrate the algorithm by applying it to in vitro data for three PV genes of C. jejuni. In Sect. 4, we describe general methodology for estimating fitness parameters (as well as other quantities) in the mutation–selection model using Approximate Bayesian Computation (ABC), as well as an algorithm for detecting lack of independence between fitness parameters of different genes. In Sect. 5, we illustrate the methodology with applications to synthetic and real data from experiments involving the bacteria C. jejuni. We conclude with a discussion.
2 Models
Assume that a population of bacteria is sufficiently large (for theoretical purposes “near” infinite). As we will see later in this section, this assumption is used in constructing the models to average over branching trees occurring during population evolution in order to have deterministic dynamics of phasotype distributions. Hence, the required population size depends on the number of genes considered (the more genes, the richer the state space of the models and a larger population size is required) and on transition (mutation) rates. (Rare events need to be “recorded” in the population.) This simplifying assumption allows us to have tractable models which can be efficiently calibrated as we show in Sects. 3, 4 and 5. Using the models, we can examine large bacterial populations, say of size 10,000 or more, which is biologically relevant when the population is far from extinction (this situation is relevant to weak selection but may not be applicable to very strong selective pressures that cause high mortality rates and significant reductions in population size) and far from socalled bottlenecks as may occur due to strong selection or during transmission of bacterial populations between hosts or other environmental niches. The latter deserves a separate modelling and study (see, e.g. Aidley et al. 2017; Moxon and Kussell 2017).
In modelling, we neglect the continuoustime effect (see, e.g. Crow and Kimura 1970) and measure time as numbers of generations. The number of generations between two time points is evaluated as the time between the points multiplied by an average division rate. The rate can be estimated in experiments by measuring how much time is required for a population to double in the absence of selection. This simplifying assumption neglects effects related to random time of bacterial division. To compensate the use of average division rate, in calibration (Sects. 3, 4, 5) we assign to each time point a range of possible numbers of generations occurred since the previous observation.
Remark 2.1
We assume that \(\xi _{i}\) can take only two values 0 and 1 since this work is mainly motivated by PV genes as explained in the Introduction. To study more detailed genome evolution of bacteria (e.g. repeat numbers instead of phasotypes), the models presented in this section can be easily generalized to the case when the random variables \(\xi _{i}\), \(i=1,\ldots ,\ell ,\) can take more than two values without need of additional ideas (see, e.g. Hardwick et al. 2009, where a mutation model analogous to the one presented in Sect. 2.1 but with multiple values of \(\xi _{i}\) was used). However, for clarity of the exposition we restrict ourselves to the binary case here.
In Sect. 2.1, we derive a discretetime discretespace stochastic model for evolution of phasotypes after a fixed number of generations n, taking into account only the mutation mechanism of genes. (This shall be referred to herein as the mutation model.) This model was proposed in Bayliss et al. (2012) (see also Hardwick et al. 2009); here, we provide more details which are needed for clarity of exposition. In Sect. 2.2, a discretetime discretespace stochastic model is considered for the binary switching in bacteria which takes into account fitness of genes in addition to mutation. (This shall be referred to herein as the mutation–selection model.) In Sect. 2.3, it will be shown when unique stationary distributions exist for both models.
2.1 Genetic Drift Modelling
Assumption 2.1
Assume that the matrix of transitional probabilities T does not change with time.
For clarity of the exposition, let us summarize what is meant by the mutation model in this paper, highlighting all the assumptions made during its derivation.

infinite (very large) size of the population maintained during the whole time period of interest;

time is measured in generations;

each gene can be either in state 0 or 1 (i.e. OFF or ON);

all phasotypes have the same survival rate (fitness);

the matrix Tof transitional probabilities does not change with time (Assumption 2.1);
Assumption 2.2
2.2 Mutation–Selection Model
In the previous section, we constructed a mutation model in which it was assumed that all phasotypes have the same fitness. In this section, we will generalize model (7) to include selection. By selection we mean that bacteria with some phasotypes grow faster than bacteria with other phasotypes. To take into account both mutation and selection mechanisms in modelling, we exploit the idea of splitting the dynamics. Without selection, we model mutation using (8) introduced in the previous section. Assuming there is no mutation, we can model selection via reweighting a distribution of the population at each discrete time. Using the idea of splitting, at each discretetime moment we first take into account the mutation mechanism using one step of (8) and then we reweight the resulting population distribution to model the selection mechanism. We now derive the mutation–selection model.
For clarity of the exposition, let us summarize what is meant by the mutation–selection model in this paper, highlighting all the assumptions made during its derivation.

infinite (very large) size of the population maintained during the whole time period of interest;

time is measured in generations;

each gene can be either in state 0 or 1;

the matrix T of transitional probabilities does not change with time (Assumption 2.1);

the vector \(\gamma \) of fitness coefficients does not change over time and all\(\gamma ^i \ge 1 \);
We remark that model (19) degenerates to the mutation model (8) when all \(\gamma ^{j}=1.\)
In model (19), it was assumed that the vector of fitness coefficients \(\gamma \) does not change over time. But it is straightforward to generalize model (19) to the case of timedependent fitness parameters \(\gamma (n)\) by just replacing \(\gamma \) in the righthand side of (19) by \(\gamma (n).\) This generalization is important for modelling adaptation of bacteria to different environments, which will be illustrated in Sect. 5.4.
Assumption 2.3
Assume that the fitness vector \( \gamma \) can be expressed as the tensor product (25).
Model (19) with the choice of fitness vector in the form of (25) is clearly a particular case of model (19) in which fitness coefficients are assigned to each phasotype individually. Let us denote this particular case by (19), (25). In comparison with (19), (25), the general model (19) can describe bacterial population evolution when individual gene dynamics are dependent on each other. This feature of the selection model is important. For instance, in the recent studies (Woodacre et al. 2019; LangoScholey et al. 2019; Howitt 2018) of PV genes of C. jejuni, evidence of small networks of genes exhibiting dependent evolutionary behaviour was found. Fisher’s assumption, and hence model (19), (25) with independent contribution of genes to fitness of phasotypes, is open to criticism (see Waxman and Welch 2005 and references therein). In Sect. 4, we describe an algorithm (Algorithm 4.2) which allows us to test whether the data can be explained by the simplified model (19), (25) or whether assumption (25) is not plausible. At the same time, model (19), (25) is simpler than the general model (19). Model (19) has \(2^{\ell }1\) (one of the fitness coefficients in (19) is equal to 1 due to normalization used in the model’s derivation) independent fitness parameters, while (19), (25) has only \(\ell \) independent fitness parameters. In practice, the benefit of reducing the number of parameters by preferring (19), (25) over (19) must be weighed against the lack of versatility that arises from multiplying elements of fitness vectors per gene.
Remark 2.2
Both models, (8) and (19), are implemented in R Shiny and are available as a webapp at https://shiny.maths.nottingham.ac.uk/shiny/mutsel/. A description of the webapp is also available in Howitt (2018).
2.3 LongTime Behaviour of the Models
In this section, we study longtime behaviour of models (8) and (19). We start with model (8). Owing to the fact that model (8) resembles a linear Markov chain, we can study the limit of the distribution \(\pi (n;\pi (0))\) as \( n\rightarrow \infty \) using the standard theory of ergodic Markov chains (see, e.g. Meyn and Tweedie 2009) and prove the following proposition using the fact that the corresponding Markov chain has a finite number of states and under Assumption 2.2 all the elements of the matrix of transitional probabilities T are strictly positive.
Proposition 2.1
The proof of (27) is elementary and hence omitted here.
Now let us discuss the mutation–selection model (19). Using Proposition 1.2 from Kolokoltsov (2010, Ch. 1), it is not difficult to prove the following proposition.
Proposition 2.2
Let Assumption 2.2 hold. Then, when \(n\rightarrow \infty ,\) the distribution \(\pi _{\text {sel}}(n;\pi (0))\) has a limit \(^{\infty }\pi _{\text {sel}}\) for any initial \(\pi (0).\)
The next proposition is on uniqueness of the limit \(^{\infty }\pi _{\text {sel}}\) independent of initial \(\pi (0).\)
Proposition 2.3
Proof
Remark 2.3
We note that condition (29) with \(n=1\) (i.e. continuity of the mapping \(\varPhi (\pi )\) [see (23)] with Lipschitz constant less than 1) is used in Butkovsky (2014) to prove uniqueness of invariant measure for nonlinear Markov chains in a general setting. But this condition is rather restrictive, e.g. it does not hold for our model (19) even in the case of a single gene when \(1pq\) is positive and close to 1, \(\gamma _{i}^{1}=1\) and \(\gamma _{i}^{2}>1\).
Remark 2.4
In the case of a single gene, \(\ell =1,\) the uniqueness of the limit \(^{\infty }\pi _{\text {sel}}\) under Assumption 2.2 follows from Lemma A.1 in the Appendix.
In the general case, we were not able to find an explicit expression for \( ^{\infty }\pi _{\text {sel}}\) but we obtained such an expression in the case when Assumption 2.3 holds, which is given in Proposition 2.4 below. In the general case, the stationary distribution \(^{\infty }\pi _{\text {sel}}\) for a particular set of parameters p, q, \(\gamma \) can be found numerically by solving the system of \(2^{l}1\) quadratic equations.
Proposition 2.4
The proof of this proposition is given in “Appendix A”.
Result (30) has the interpretation that (assuming that the conditions of Proposition 2.3 are verified) in the stationary regime genes behave independently. It also means that if the initial population distribution \(\pi (0)\) is such that genes behave independently then they do so for all times. Further, if the initial population distribution \( \pi (0)\) is such that genes behave dependently then the strength of dependence decays with time. We know that often in practice (see, e.g. Woodacre et al. 2019; LangoScholey et al. 2019; Howitt 2018) this type of evolution behaviour is not the case, which demonstrates a limitation of model (19), (25) in being capable of explaining experimental data. At the same time, the mutation–selection model (19) does not have this deficiency.
Remark 2.5
The webapp from Remark 2.2 also gives \(^{\infty }\pi \) and an accurate approximation of \(^{\infty }\pi _{\text {sel}}.\)
3 Verifying Whether Data can be Explained by the Mutation Model
 1.
Estimates \(\hat{p}_{i},\)\(\hat{q}_{i}\), \(i=1,\ldots ,\ell ,\) of the mutation rates together with \(95\%\) confidence intervals \([\ _{*}p_{i},p_{i}^{*}]\) and \([\ _{*}q_{i},q_{i}^{*}],\) respectively;
 2.
Average number of generations \(\bar{n}_{k}\) between the time points \( k1 \) and k together with the lowest possible \(_{*}n_{k}\) and the largest possible \(n_{k}^{*}\) number of generations;
 3.
Sample distributions of phasotypes \(_{k}\hat{\pi }\) at time observation points \(k=1,2,\ldots \) and sizes \(N_{k}\) of the samples.
The average number of generations \(\bar{n}_{k}\) is computed by multiplying calendar time between the observation points by average division rate of the bacteria species being considered. The average division rate depends on the experimental conditions. Similarly, \(_{*}n_{k}\) and \(n_{k}^{*}\) are found using the slowest and fastest division rates for the bacteria. They are introduced to compensate for the use of average division rate and to reflect the stochastic nature of bacterial division. For example, in in vitro C. jejuni experiments (Woodacre et al. 2019) the average division rate was taken as 20 per 3 days, slowest—10 and fastest—25 (see also growth rates in caecal material in Battersby et al. 2016).
Sample distributions of phasotypes \(_{k}\hat{\pi }\) are derived from sample distributions of tract lengths of the PV genes under consideration (Bayliss 2009; Bayliss et al. 2012). The tract length (i.e. the repeat number) is determined by DNA analysis of bacterial material collected during in vitro or in vivo experiments (see further details, e.g. in Bayliss 2009; Bayliss et al. 2012; Howitt 2018; Woodacre et al. 2019; LangoScholey et al. 2019). The models and the data assimilation procedures in this paper are aimed at understanding how a bacteria population evolves during a particular experimental setting via looking at time evolution of \(_{k}\hat{\pi }\). We note that fitness parameters cannot be measured during a biological experiment.
Due to costs of conducting DNA analysis of bacteria, sample sizes \(N_{k}\) are usually not big [e.g. of order \(30300\) (Bayliss et al. 2012; Woodacre et al. 2019; LangoScholey et al. 2019)]. Hence, \(_{k}\hat{\pi }\) have a sampling error which cannot be ignored. Let us assume that if \(N_{k}\rightarrow \infty \) then \(_{k}\hat{\pi }\) converges to a distribution \(_{k}\bar{\pi },\) i.e. from the practical perspective, if we get data for a very large sample then the statistical error is effectively equal to zero.
As discussed at the end of Sect. 2.1, we can check for each gene individually [see (15)] whether its behaviour can be explained by the mutation model, and hence determine a subset of PV genes [for C. jejuni strain NCTC11168, there are 28 known PV genes (Bayliss 2009; Bayliss et al. 2012)] for which evolution can be explained by the mutation mechanism alone. For the other genes, i.e. those which fail this test, an alternative model [e.g. (19)] should be used. Thus, we will consider in this section how to determine whether model (15) is consistent with data for a single gene.

\(\pi =(\pi ^{1},\pi ^{2}),\)\(_{k}\hat{\pi }=(_{k}\hat{\pi }^{1},_{k}\hat{ \pi }^{2})\) and \(_{k}\bar{\pi }=(_{k}\bar{\pi }^{1},_{k}\bar{\pi }^{2})\) instead of \(\pi _{i},\)\(_{k}\hat{\pi }_{i}\) and \(_{k}\bar{\pi }_{i},\) respectively;

p, q, \(p_{*},\)\(p^{*},\)\(q_{*},\)\(q^{*}\) instead of \(p_{i}\), \(q_{i},\)\(_{*}p_{i},\)\(p_{i}^{*},\)\(_{*}q_{i},\)\( q_{i}^{*},\) respectively.

\(\bar{n}\), \(n_{*},\)\(n^{*}\) instead of \(\bar{n}_{1},\)\(_{*}n_{1},\)\(n_{1}^{*}.\)
3.1 Algorithm
Assumption 3.1
Assume that \(0<p+q<1.\)
 Step 1
If there are \(x\in \mathbb {I}_{x},\)\(u\in [\ _{0}\varepsilon _{*},\ _{0}\varepsilon ^{*}]\) and \(v\in [\ _{1}\varepsilon _{*},\ _{1}\varepsilon ^{*}]\) such that \(x=u=v\) then Yes, otherwise go to Step 2.
 Step 2For all \(x\in \mathbb {I}_{x}\) and \(u\in [\ _{0}\varepsilon _{*},\ _{0}\varepsilon ^{*}]\) such that \(x\ne u\), and for \(v\in [\ _{1}\varepsilon _{*},\ _{1}\varepsilon ^{*}], \) form the parametrized set of functionsIf for \(x\in \mathbb {I}_{x}\) a curve (x, y(x; u, v)) with y(x; u, v) defined in (40) intersects the domain \(\mathbb {J}\) then Yes; otherwise No.$$\begin{aligned} y(x;u,v)=\frac{vx}{ux}.\text { } \end{aligned}$$(40)
Remark 3.1
Algorithm 3.1 verifying whether the data can be explained by the mutation model (8) is implemented in R Shiny and is available as a webapp at https://shiny.maths.nottingham.ac.uk/shiny/gene_algorithm/.
Remark 3.2
We note that we can verify whether one gene data can be explained by model (15) using an analogue of the ABC algorithms (Algorithms 4.1 and 4.2) from Sect. 4 in the same spirit as we answer this question in the case of the mutation–selection model (19) in Sects. 4 and 5. But ABC algorithms are more computationally expensive as they are sampling based, requiring the use of Monte Carlo techniques, while Algorithm 3.1 is deterministic and very simple with negligible computational cost.
3.2 Illustrations
We illustrate Algorithm 3.1 by applying it to the data for three (cj0617, cj1295 and cj1342) out of 28 PV genes obtained in in vitro experiments (Woodacre et al. 2019) (see also Howitt 2018). Statistical analysis of the two genes done in Woodacre et al. (2019) and Howitt (2018) suggested that cj0617 is a part of a small network of dependent genes, and hence it is likely to be subject to selection, while both cj1295 and cj1342 did not demonstrate any dependencies with the other 27 PV genes, and hence they are likely to have evolution which can be explained by the mutation mechanism alone.
 cj0617:

\(_{0}\hat{\pi }^{1}=0.943\), \(_{1} \hat{\pi }^{1}=0.262\), \(p_{*}=9.1\times 10^{4},\ p^{*}=22.2\times 10^{4}\), \(q_{*}=11.0\times 10^{4},\)\(q^{*}=40.2\times 10^{4}\), \( n_{*}=110,\)\(n^{*}=275\), \(N_{0}=300\), \(N_{1}=145\).
 cj1295:

\(_{0}\hat{\pi }^{1}=0.305\), \(_{1}\hat{\pi } ^{1}=0.174\), \(p_{*}=3.0\times 10^{4},\ p^{*}=5.7\times 10^{4},\)\( q_{*}=1.4\times 10^{4},\)\(q^{*}=2.8\times 10^{4},\)\(n_{*}=110,\)\(n^{*}=275\), \(N_{0}=298\), \(N_{1}=149\).
 cj1342 :

\(_{0}\hat{\pi }^{1}=0.017\), \(_{1} \hat{\pi }^{1}=0.153\), \(p_{*}=11.0\times 10^{4},\ p^{*}=40.2\times 10^{4},\)\(q_{*}=9.1\times 10^{4},\)\(q^{*}=22.2\times 10^{4},\)\( n_{*}=110,\)\(n^{*}=275\), \(N_{0}=298\), \(N_{1}=150\).
Therefore, for cj0617 we have \(\mathbb {I}_{x}=[0.331,0.815]\), \(\varepsilon _{0}=0.071\), \(\varepsilon _{1}=0.102\), and hence \( _{0}\varepsilon _{*}=0.872,\,_{0}\varepsilon ^{*}=1,\)\( _{1}\varepsilon _{*}=0.160,\)\(_{1}\varepsilon ^{*}=0.364\); for cj1295 we have \(\mathbb {I}_{x}=[0.197,0.517]\), \(\varepsilon _{0}=0.071\), \( \varepsilon _{1}=0.10,\) and hence \(_{0}\varepsilon _{*}=0.234,\)\( _{0}\varepsilon ^{*}=0.376,\)\(_{1}\varepsilon _{*}=0.074\), \( _{1}\varepsilon ^{*}=0.274\); and for cj1342 we have \(\mathbb {I} _{x}=[0.185,0.669],\)\(\varepsilon _{0}=0.071\), \(\varepsilon _{1}=0.10,\) and hence \(_{0}\varepsilon _{*}=0,\)\(_{0}\varepsilon ^{*}=0.088,\)\(_{1}\varepsilon _{*}=0.053,\)\(_{1}\varepsilon ^{*}=0.253.\)
 Step 1
Since \([0.331,0.815]\cap [0.872,\ 1]\cap [0.160,0.364]=\varnothing ,\) we get No and we go to Step 2.
 Step 2We have under \(x\in [0.331,0.815],\)\(u\in [0.872,\ 1],\) and \(v\in [0.160,0.364]\):where$$\begin{aligned} y_{\min }(x)\le y(x;u,v)\le y_{\max }(x), \end{aligned}$$and we observe in Fig. 2 that the curves \((x,y_{\min }(x))\) and \( (x,y_{\max }(x))\) do not intersect the domain \(\mathbb {J,}\) and hence we conclude that the mutation model cannot describe evolution of this gene.$$\begin{aligned} y_{\min }(x)=\frac{0.160x}{1x}\text { and }y_{\max }(x)=\frac{0.364x}{ 0.872x}, \end{aligned}$$
 Step 1
Since \([0.197,0.517]\cap [0.234,0.376]\cap [0.074,0.274]\ne \varnothing ,\) we conclude that this gene can be described by the mutation model and it is possible that its evolution is stationary.
 Step 1
Since \([0.185,0.669]\cap [0,\ 0.088]\cap [0.053,0.253]=\varnothing ,\) we get No and we go to Step 2.
 Step 2We have under \(x\in [0.185,0.669],\)\(u\in [0,\ 0.088],\) and \(v\in [0.053,0.253]\):where$$\begin{aligned} y_{\min }(x)\le y(x;u,v)\le 1, \end{aligned}$$(41)(the bounds in (41) are achievable) and observe in Fig. 3 that the curve \((x,y_{\min }(x))\) intersects the domain \(\mathbb {J} \), and hence we conclude that the mutation model can describe evolution of this gene.$$\begin{aligned} y_{\min }(x)=\frac{x0.253}{x}\text { } \end{aligned}$$
Further illustrations for Algorithm 3.1 are available in Howitt (2018).
4 Estimation of Fitness Parameters in the Mutation–Selection Model
In this section, we describe our general methodology for the estimation of fitness parameters. We will illustrate the use of this methodology using data from C. jejuni experiments in Sect. 5. We adopt a Bayesian approach, whereby uncertainty in any unknown parameters is summarized by probability distributions. We illustrate how uncertainty in random quantities can be incorporated very naturally in the Bayesian framework, using prior information from previous experiments where available, and show how estimates in all quantities can be obtained in light of the observed data.
4.1 Bayesian Statistics
Computing summaries from the posterior distribution requires integration, which in practice is not possible analytically except for simple models. One can adopt numerical procedures, but the performance of these degrades quite rapidly as the dimension of \(\varTheta \) increases. A powerful alternative is to use simulation methods, which also have the major advantage of not requiring the normalizing constant f(x) in (42), the socalled marginal likelihood, which again requires an integration which is typically computationally expensive. If one can draw independent samples directly from \(f(\theta x)\), then Monte Carlo techniques can be used to estimate posterior quantities of interest. For complex, typically highdimensional, models, this itself may be difficult, but powerful techniques such as Markov chain Monte Carlo (MCMC) can be employed (Gelman et al. 2013; Gilks et al. 1996; Wilkinson 2012). MCMC itself can be difficult to implement effectively in some complex scenarios, and it can be computationally demanding. An important recent development is the Integrated Nested Laplace Approximation (INLA) method (Rue et al. 2009), which as the name suggests, is based on Laplace approximations to the required integrals. The Laplace method itself is a wellknown tool for approximating integrals in general (de Bruijn 1981) and has been used effectively in Bayesian statistics to compute posterior summaries (Tierney and Kadane 1986). INLA extends this idea to models with a general latent Gaussian structure and allows comparatively fast and simple approximations, which can either be used as an alternative to, or in conjunction with, simulation methods such as MCMC.
However, a further complication, which arises in our case, is that it may not even be possible to evaluate the likelihood \(f(x\theta )\), which is necessary for the simulation methods mentioned above. In this case, socalled likelihoodfree methods can be employed, an example of which is Approximate Bayesian Computation (ABC) (Beaumont 2010), which we use here. This assumes the ability to simulate from the model \(f(\cdot \theta )\) relatively easily, even if evaluation of the likelihood itself is not possible.
4.2 Approximate Bayesian Computation
 1.
Simulate \(\theta \sim f(\theta )\);
 2.
Simulate \(y \sim f(x\theta )\);
 3.
Accept \(\theta \) if \(y=x\), else return to step 1.
4.3 General Algorithm
As discussed in Sect. 3, our data are the observed sample phasotype distributions \(_{i}\hat{\pi }\), where \(i=0\) is the initial timepoint and \(i=1\) is the final timepoint. Our main question of interest is whether the proposed mutation–selection model (19) can explain the observed data, that is, are there values of the unknown quantities which are both biologically plausible and for which the final distribution obtained by model (19) is consistent with the observed sample? Recall that model (19) has input parameters \(\theta =(n,p,q,_{0}\hat{\pi },\gamma )\), where n is the number of generations, p and q are the vectors of mutation rates, \( _{0}\pi \) is the initial distribution, and \(\gamma \) is the vector of fitness parameters. In general, we will treat all elements of \( \theta \) as random, and we write \(\varTheta =(\eta ,P,Q,_{0}\varPi ,\varGamma )\) for the corresponding random vector. Then, in general, the random variables are the elements of \(\varTheta \) together with the final distribution \(_{1}\varPi \) (a realization of which we denote by \(_{1}\pi \)); here, \(_{1}\varPi \) plays the role of X in (42), i.e. the output of the probabilistic model.
Considering first all quantities other than \(\varGamma \) to be fixed, another way to phrase our main question is: is there a value of \(\varGamma \) for which the final distribution obtained from model (19) is “close to” the observed sample final distribution? In this case, there would be no evidence to reject the hypothesis that our proposed model is a plausible description of the evolution of phasotypes. The estimate of \(\varGamma \) is also of interest in its own right, for biologists to understand which phasotypes or genes benefit from advantageous selection.
While there may be estimates or observations of the various quantities we consider random, there is often uncertainty. For instance, in our applications discussed in Sect. 5, there are estimates and plausible ranges available for P, Q and \(\eta \). For the observed sample distributions \(_i\hat{\pi }\), we have only a relatively small sample from a larger population, and hence our observations are subject to sampling variation. In both cases, uncertainty can be handled very naturally in the Bayesian framework, by encoding our existing knowledge in prior distributions. Our question then becomes: while accounting for uncertainty in all unknown quantities, can the mutation–selection model explain the evolution of phasotypes given our observed data?
Let \(f(\theta )=f(n)f(p)f(q)f(_{0}\pi )f(\gamma )\) be the prior distribution on \(\varTheta \). Thus, we assume independence between these quantities a priori, and we also assume that the elements of P, Q and \( \varGamma \) are all mutually independent so that, e.g. \(f(p_{1},\ldots ,p_{l})=f(p_{1})\ldots f(p_{l})\), etc. This independence assumption for the prior is natural from the microbiology point of view.
The prior distributions we use and the methods for sampling from them are discussed below. Assuming for now that we can simulate from these priors, then Algorithm 4.1 gives the steps taken to simulate from the ABC posterior distribution. We write \(\pi _{\text {sel}}(\theta )\) for the output of the mutation–selection model (19), replacing \((n,p,q,_{0}\pi ,\gamma )\) with \(\theta \).
 Step 1
Propose a candidate value \(\theta ^*\sim f(\theta )\).
 Step 2
Obtain \(\pi _{\text {sel}}(\theta ^*)\) by mutation–selection model (19).
 Step 3
Accept \(\theta ^*\) if \(d(_1\hat{\pi },\pi _{\text { sel}}(\theta ^*))\le {}_1\epsilon \), where d is a distance function and \(_1\epsilon \) is a tolerance. Otherwise, discard \( \theta ^*\).
The samples can then be used to form Monte Carlo estimates of the required quantities. In our applications, we use the mean of the samples to form point estimates and denote the estimates by \(\hat{\gamma }\), etc. When accounting for sampling variability in the initial sample distribution, we denote an estimate of the true population distribution by \(_0 \hat{\dot{\pi }}\) (to distinguish this from the observed sample which we denote by \(_0\hat{\pi }\))—this is the (normalized) elementwise mean of the sampled initial distributions. To quantify uncertainty in the estimated parameters, we give \(95\%\) posterior probability intervals; these are simply the \(2.5\text {th}\) and \(97.5\text {th} \) percentiles of the accepted samples, which are estimates of the true percentiles of the (marginal) posterior distribution for a given parameter.
Note that, in terms of model (19) itself, there is a certain nonidentifiability surrounding the fitness parameters, since \( \gamma \) and \(k\gamma \), for some \(k >0\), give the same model. Recall from Sect. 2.2 that we interpret the fitness parameters as relative fitness and remove this nonidentifiability by taking the smallest fitness parameter to be 1, which is natural. In all our simulations, normalization is applied at the final stage. Specifically, let \(\hat{\gamma }^*\) be an unnormalized vector, formed by taking the elementwise mean of all sampled fitness vectors (which are themselves unnormalized). Then, we set \(\hat{ \gamma } = \hat{\gamma }^*/k\), where \(k = \min (\hat{\gamma }^*)\), so that \(\hat{\gamma }\) is the required estimate of relative fitness parameters.
4.4 Simulation from Priors
In general, prior distributions are chosen which reflect the current knowledge about the unknown parameters. Here, we illustrate the choice of priors we use in our applications, but other prior distributions could be used when relevant.
Fitness parameters As discussed in Sect. 2.2 , the quantities of interest are the relative fitness parameters \(\gamma \). We assign independent uniform priors to the fitness parameters, i.e. \( \gamma ^i \sim U[a_i,b_i], \,i=1,\ldots ,2^l\), where \(a_i \ge 1\), since \(\gamma = 1\) for the slowest growing phasotype (see Sect. 2.2).
Number of generations For the number of generations \( \eta \), we have from microbiology knowledge (see Sect. 3) an estimate \(\bar{n}\) and interval \([n_{*},n^{*}]\) in which \(\eta \) lies. The interval \([n_{*},n^{*}]\) is typically not symmetric around \(\bar{n}.\) We construct a prior for \(\eta \) from a skewnormal distribution, with mean \(\bar{n}\), such that \(P(n_{*}\frac{1}{2}\le \eta \le n^{*}+\frac{1}{2})=0.95\)—this is then discretized to give a probability mass function, since \(\eta \) is integervalued.
Mutation rates For the mutation rates p and q, as with the number of generations, there are estimates (\(\bar{p}\) and \(\bar{q}\) ) and \(95\%\) confidence intervals available (\([p_*,p^*]\) and \( [q_*,q^*]\)) from specially designed experiments (Bayliss et al. 2012). We form analogous prior distributions for these quantities via the same process as for \(\eta \), minus the discretization as these quantities are continuous.
Observed sample distributions We account for sampling variability in distributions using probabilistic results for the distribution of distances. Specifically, we use the Hellinger distance to measure distance between two probability distributions and use the relationship between this distance and the \(\chi ^2\) distribution to ascertain plausible discrepancies between two distributions if they are still to be considered the same after accounting for statistical variation.
We also use this idea to account for sampling variability in an observed sample distribution \(\hat{\phi }\), based on a sample size N, as follows. We first obtain a tolerance \(\epsilon = \sqrt{\frac{\chi ^2_{k1}(0.95)}{8N}}\), such that any distribution within (Hellinger) distance \(\epsilon \) of \(\hat{ \phi }\) defines a \(95\%\) confidence region for the true population distribution \(\phi \) of which \(\hat{\phi }\) is an empirical estimate. We then construct a Dirichlet distribution, centred on \(\hat{\phi }\), with parameter \(\alpha = \alpha _0 \varvec{1}_{2^l}\), \(\alpha _0 \in \mathbb {R}_+\), \(\alpha \in \mathbb {R}_+^{2^l}\) such that \( P(H(\varPhi ,\hat{\phi }) < \epsilon ) = 0.95\) where \(\varPhi \sim \hbox {Dir}(\alpha )\). To account for sampling variability in the observed distribution, we sample an observation \(\phi ^*\) from this Dirichlet distribution and accept \(\phi ^*\) if \(H(\phi ^*,\hat{\phi }) < \epsilon \) . Thus, we can think of an accepted \(\phi ^*\) as a plausible sample distribution which could have been observed instead of \(\hat{\phi }\).
Finally, we use the same procedure to obtain the tolerance used in the ABC algorithm (step 3 of Algorithm 4.1). Specifically, if the observed final distribution is based on a sample size of N, then the tolerance used is that given by (45).
4.5 Dependence of Gene Fitness Parameters
Recall the earlier discussion in Sect. 2.2 regarding dependence between the selection/fitness parameters of different genes. Specifically, under the assumption of independence (Assumption 2.3), \(\gamma \) is written as the tensor product (25). We introduce below an algorithm which can be used to test this assumption. In Sect. 5.1, we illustrate this on experimental data and show that the independence assumption does not hold for these data.
Recall that the fitness parameters for a gene l are \(\gamma _l^1\) and \( \gamma _l^2\), and \(\gamma _l = (\gamma _l^1,\gamma _l^2)\). In short, we estimate the full vector of fitness parameters, \(\gamma \), under the assumption of independence, and then assess whether the distance between the observed sample final distribution and that obtained from model (19), with \( \gamma = \hat{\gamma }\), is less than the tolerance given by (45). This is detailed in Algorithm 4.2. Note that here we focus on how to handle the fitness parameters, and assume the other elements of \( \theta \) are available—these could be fixed estimates, or estimated (with uncertainty incorporated) as part of steps 1 and 2 in Algorithm 4.2.
 Step 1
Estimate \(\gamma _l\), \(l=1,\ldots ,\ell \) (and other elements of \( \theta \) if required), using Algorithm 4.1 for each gene separately.
 Step 2
Form \(\hat{\gamma }^{\text {ind}} =\hat{\gamma }_{1}\otimes \cdots \otimes \hat{\gamma }_{\ell }\) and \(\hat{\theta }\).
 Step 3
Obtain the final distribution under the independence assumption, \(\pi ^{\text {ind}}_{\text {sel}}(\hat{\theta })\), from (19).
 Step 4
Compute \(d(_1\hat{\pi },\pi ^{\text {ind}}_{\text {sel} }(\hat{\theta }))\).
5 Results
We now illustrate our methodology with applications to data on the bacteria C. jejuni, using data from two in vitro experiments. Full experimental details for these experiments can be found in Woodacre et al. (2019) and also in Howitt (2018). We focus attention on three genes of interest, for which preliminary investigation has found evidence of dependent switching from one PV state to another (Woodacre et al. 2019; Howitt 2018). These genes are labelled cj0617, cj0685 and cj1437; note that the sample space of phasotypes is labelled according to the conventions described in Sect. 2 and Eq. (2), and in what follows, the ordering is with respect to the ordering of the genes as listed above. We first investigate whether the assumption of independence of fitness parameters is justifiable, using Algorithm 4.2, and show that there is evidence this assumption does not hold. We then illustrate the ability of our methodology to successfully estimate fitness parameters using synthetic data, before obtaining estimates of fitness parameters for our experimental data. We conclude this section with an experiment which provides evidence that switching of phasotypes occurs quickly when bacteria are subject to new environmental conditions, which suggests an interesting direction for future work involving timedependent fitness parameters. Throughout this section, we used 500,000 Monte Carlo samples for all inferences based on ABC simulation, except for the singlegene results given in Table 1, which are based on 100,000 samples.
Remark 5.1
Since we are only dealing with a relatively small number of genes, the ABC algorithm in the form proposed here is feasible in terms of computational complexity. As the dimension of the state space is \(2^l\), then clearly the dimension of the parameter space grows exponentially with the number of genes, and it would not be practical to apply the ABC algorithm for many genes, say more than 6. However, we emphasize that our overall procedure is a twostage process. Firstly, we reduce the number of genes on which to focus, by using the fast and efficient algorithm of Sect. 3 to determine which genes can be explained by the mutation model. Secondly, we then apply the mutation–selection model to the small number of remaining genes.
Singlegene data, estimates and results for the independence of fitness parameters investigation
Gene  \(_0\hat{\pi }\)  \(_1\hat{\pi }\)  \(\hat{\gamma }\) 

cj0617  (0.9433, 0.0567)  (0.2621, 0.7379)  (1, 1.016) 
cj0685  (0.0567, 0.9433)  (0.8267, 0.1733)  (1.02, 1) 
cj1437  (0.0533, 0.9467)  (0.8288, 0.1712)  (1.02, 1) 
5.1 Independence Assumption
Threegene model input (fitness parameters) and results, with and without application of Assumption 2.3
\(\hat{\gamma }^{\text {ind}}\)  \(\pi ^{\text {ind}}_{\text {sel}}(\hat{\theta })\)  \(\hat{\gamma }^{\text {gen}}\)  \(\pi ^{\text {gen}}_{\text {sel}}(\hat{\theta })\) 

(1.040400, 1.020000,  (0.099859, 0.000256,  (1.018, 1.007,  (0.143176, 0.011395, 
1.020000, 1.000000,  0.002181, 0.000143,  1.009, 1.000,  0.009522, 0.056227, 
1.057046, 1.036320,  0.877841, 0.000756,  1.026, 1.027,  0.685888, 0.033098, 
1.036320, 1.016000)  0.018654, 0.000311)  1.019, 1.004)  0.036405, 0.024289) 
5.2 Synthetic Data
Inputs for the synthetic data experiment
\(_0N\)  \(_0\epsilon \)  \(_0\hat{\pi }\)  \([a_i, b_i]\) for \( \varGamma \)  \(_1N\)  \(_1\epsilon \)  \(_1\hat{\pi }\) 

300  0.0766  (0.003, 0.010, 0.007,  [1.005, 1.04]  150  0.108  (0.13013, 0.01044, 0.01129, 
0.924, 0.043,  [1, 1]  0.13676, 0.63192, 0.00608,  
0, 0, 0.013)  [1, 1]  0.03386, 0.03951)  
[1, 1]  
[1.005, 1.04]  
[1.005, 1.04]  
[1.005, 1.04]  
[1.005, 1.04] 
Results for the synthetic data experiment
True \(\gamma \)  \(\hat{\gamma }\)  \(\pi _{\text {sel}}(\hat{\theta })\) 

(1.014, 1.002, 1.007, 1, 1.022  (1.0162, 1, 1, 1, 1.0252,  (0.12607, 0.00664, 0.00495, 0.11870, 
1.01, 1.015, 1.001)  1.0164, 1.0175, 1)  0.67638, 0.00745, 0.03145, 0.02837) 
5.3 Experimental Data and Results
We now turn our attention to analysis of experimental data from two in vitro datasets, where the raw data are in the form of repeat numbers. For different genes, the repeat numbers, which determine whether the gene is ON or OFF, are different, but this is known and hence phasotypes can be determined from repeat numbers. The estimates/confidence intervals for mutation parameters p and q, available from Bayliss et al. (2012), relate to mutation rates between repeat numbers, from which mutation rates for phasotypes can again be deduced. For example, if repeat numbers of 8/9 correspond to a certain gene being OFF/ON, then the mutation rate from OFF to ON is simply the mutation from the repeat number 8–9.
Prior settings for dataset 1
Gene  \(\bar{p}_l\) \([p_{l_{*}},p_l^{*}]\) \(\times 10^{4}\)  \(\bar{q}_l\) \([q_{l_{*}},q_l^{*}]\) \(\times 10^{4}\)  \(\bar{n}[n_{*},n^{*}]\)  \([a_i, b_i]\) for \(\varGamma \) 

cj0617  12.30 [9.1, 22.2]  17.88 [11.0, 40.2]  220 [110, 275]  [1, 1.04] 
cj0685  4.23 [3.0, 5.7]  2.15 [1.4, 2.8]  [1, 1.04]  
cj1437  \(0.0725\, [0.0388,0.2597]\)  \(0.0045\, [0.0029,0.0107]\)  [1, 1.04]  
[1, 1]  
[1.005, 1.06]  
[1.005, 1.06]  
[1, 1.04]  
[1, 1.04] 
Sample data for dataset 1
\(_0N\)  \(_0\epsilon \)  \(_0\hat{\pi }\)  \(_1N\)  \(_1\epsilon \)  \(_1\hat{\pi }\) 

300  0.0766  (0.00333, 0.01, 0.00667,  141  0.112  (0.15603, 0.00709, 0.01418, 
0.92333, 0.04333,  0.09220, 0.63121, 0.04255,  
0, 0, 0.01333)  0.04255, 0.01418) 
Sample data and prior settings for dataset 2. Also, \(_0N = 84\), \(_1N=87\), \(_0\epsilon = 0.145\), \( _1\epsilon = 0.142\)
\(\bar{p}_l\)\([p_{l_{*}},p_l^{*}]\) (\(\times \, 10^{4}\)) for cj1437  \(\bar{q}_l[q_{l_{*}},q_l^{*}]\) (\(\times \, 10^{4}\)) cj1437  \(\bar{n}\) \([n_{*},n^{*}]\)  \([a_i, b_i]\) for \(\varGamma \)  \(_0\hat{\pi }\)  \(_1\hat{\pi }\) 

17.88  12.30  20 [10, 25]  [1, 1.6]  (0.0119, 0.0476,  (0.0115, 0.0230, 
[11.0, 40.2]  [9.1, 22.2]  [1, 1.6]  0.0000, 0.7738,  0.0230, 0.0690,  
[1, 2]  0.1548, 0.0000,  0.7586, 0.0805,  
[1, 1]  0.0119, 0.0000)  0.0345, 0.0000)  
[1.1, 1.8]  
[1.05, 2.2]  
[1, 2.2]  
[1, 1.6] 
Again, we formed the vector of estimates \(\hat{\theta }\) and evaluated the predicted final distribution \(\pi _{\text {sel}}(\hat{\theta })\). We find that \(d({1}\hat{\pi },\pi _{\text {sel}}(\hat{\theta }))=0.0925\), which is less than the tolerance \(_{1}\epsilon =0.142\) (from (45) with \(N=87\)). As with dataset 1, we conclude that the mutation–selection model is a plausible description of the evolution mechanism for these three genes. For this second dataset, the point estimate of the vector of fitness parameters is \(\hat{\gamma }=(1.180,1.172,1.328,1,1.380,1.575,1.354,1.150)\). Notably, the fitness parameters are larger than those of the first dataset, suggesting that selection advantage may be more prominent in the early stages of the experiment. We explore this further in the following section.
5.4 Time Dependence
The estimated fitness parameters for the second dataset (which correspond to a much shorter period of approximately 20 generations) were larger than those obtained from the first dataset. This leads to a hypothesis of biological interest, namely that selection advantage has a larger influence in the initial stages, when the bacteria are adapting to changes in the environment. Thus, the estimates from the first dataset (corresponding to a much longer period of approximately 220 generations) are averaged over a longer period, for most of which the selection advantage is less important. This is a plausible explanation for the lower estimates seen in the first dataset.
To investigate this further, we conducted the following experiment. First, we used the initial distribution from the first dataset as input for the mutation–selection model and ran for 20 generations; for the mutation rates we used the point estimates \(\bar{p}\) and \(\bar{q}\) as for the first dataset, given in Table 5, and for the fitness parameters we used the point estimates obtained from the second experiment. This provides an interim distribution, \(_{0.5}\hat{\pi }\) say. We then apply Algorithm 4.1 using \(_{0.5}\hat{\pi }\) as initial distribution and the final distribution taken to be that from the first dataset. The aim is to see if the model can explain this final distribution, and whether the estimates of the fitness parameters are lower (as per our hypothesis). We used the following as inputs for the remaining parameters: the priors for the mutation rates, and the tolerances used, are given in Tables 5 and 6. We chose \(\bar{n} =200\) with \([n_{*},n^{*}]=[100,250]\) because 200 is the difference between the expected lengths of the second and first experiments. Initial investigation showed that the mutationonly model could not explain the observed final distribution, and hence there is still evidence of selection advantage over this time period. However, as we expect this advantage to be smaller, we use narrower priors for the selection parameters. Specifically, we used uniform priors over the interval [1, 1.01] for each fitness parameter, which also reflects no preference for a particular phasotype apriori.
Results for the timedependence experiment. Here \(_1\epsilon = 0.112\)
\(_1\hat{\pi }\)  \(\pi _{\text {sel}}(\hat{\theta })\)  \(d(_1\hat{\pi },\pi _{\text {sel}}(\hat{\theta }))\) 

(0.15603, 0.00709, 0.01418, 0.09220,  (0.15034, 0.02545, 0.03451, 0.07191,  0.0830 
0.63121, 0.04255, 0.04255, 0.01418)  0.59554, 0.05478, 0.04457, 0.02289) 
The minimum, maximum and 95% posterior probability intervals for fitness parameters from timedependence experiment
\(\hat{\gamma }^i\)  \(\min \gamma ^i\)  \(\max \gamma ^i\)  95% posterior probability intervals for \(\gamma ^i\) 

1.004021  1  1.00998  [1, 1.00925] 
1.001056  1  1.00869  [1, 1.00618] 
1.000296  1  1.00586  [1, 1.00410] 
1.006620  1.00228  1.00994  [1.00425, 1.00961] 
1.007894  1.00610  1.00999  [1.00708, 1.00982] 
1.000000  1  1.00552  [1, 1.00341] 
1.002977  1  1.00986  [1, 1.00895] 
1.002558  1  1.00941  [1, 1.008791] 
The minimum, maximum and 95% posterior probability intervals (\( \times \, 10^{4}\)) for mutation rates from timedependence experiment
Gene  \(\hat{p}_l\)  \(\min p_l\)  \(\max p_l\)  95% interval (\(p_l\))  \(\hat{ q}_l\)  \(\min q_l\)  \(\max q_l\)  95% interval (\(q_l\)) 

cj0617  12.308  9.135  17.580  9.534, 15.762  16.257  11.084  25.958  [11.727, 21.948] 
cj0685  4.126  3.002  5.619  3.112, 5.248  2.152  1.405  2.800  [1.580, 2.723] 
cj1437  0.0724  0.0389  0.127  0.0423, 0.109  0.00453  0.00294  0.00775  [0.00310, 0.00627] 
The minimum, maximum and 95% posterior probability interval for the number of generations from the timedependence experiment
\(\hat{n}\)  \(\min g_{\tilde{\eta }}(n)\)  \(\max g_{\tilde{\eta }}(n)\)  2.5/97.5 percentiles from \(g_{\tilde{\eta }}(n)\) 

212  145  250  168, 246 
6 Discussion and Conclusions
In this work, we consider two models (mutation and mutation–selection) for describing time evolution of a bacteria population. The models are accompanied by algorithms for determining whether they can explain experimental data and for estimating unobservable parameters such as fitness. In the case of the mutation–selection model, we propose an algorithm inspired by Approximate Bayesian Computation (ABC) to link the model and data. The approach considered gives microbiologists a tool for enhancing their understanding of the dominant mechanisms affecting bacterial evolution which can be used, e.g. for creating vaccines. Here, we limit ourselves to illustrative examples using in vitro data for phase variable (PV) genes of C. jejuni aimed at demonstrating how the methodology works in practice; more in depth study of PV genes will be published elsewhere. We note that the models together with the methodology linking the models and the data can be applied to other population dynamics problems related to bacteria. In particular, it is straightforward to adjust the methodology presented if considering repeat numbers instead of phasotypes.
The calibration of the models is split into two steps. First, the very efficient algorithm from Sect. 3 is applied to verify whether data for particular genes can be explained by the mutation model. This allows us to reduce the number of genes to which the mutation–selection model should be applied. The second step is calibration of the mutation–selection model for the remaining genes using the ABCtype algorithm from Sect. 4. In both steps, we take into account experimental errors and sample sizes. We note that, due to its computational complexity, the ABC algorithm is realistic to apply in the case of relatively small number of genes (2–6). We also note that if one wants to model simultaneously a large number of genes with dependent behaviour (e.g. if one needs to simultaneously model all 28 PV genes of C. jejuni strain NCTC11168, where the state space is of order \(10^{17}\)) then a spacecontinuous model should be used instead of discretespace models considered here. Development of such spacecontinuous models together with calibration procedures for them is a possible topic for future research.
Further development of the presented approach can include enhancing the models by adding a description of bottlenecks and, consequently, proposing algorithms to answer questions about the presence of bottlenecks during bacterial evolution. It is also of interest to consider continuoustime counterparts of the discretetime models studied here and thus take into account random bacterial division times. (For this purpose, e.g. ideas from Caravagna et al. (2013) and D’Onofrio (2013) can be exploited.) It will lead to models written as differential equations for which the discrete models of this paper are approximations.
The proposed ABC algorithm for estimating fitness parameters can be further developed in a number of directions. For instance, the computational costs of this algorithm grow quickly with an increase in the number of genes, and recent improvements to ABC, such as adaptive methods based on importance sampling using sequential Monte Carlo (e.g. Beaumont et al. 2009; del Moral et al. 2012) could potentially be exploited to make the algorithm more efficient. We also left for future work analysis of convergence of the considered ABCtype algorithm.
One of the assumptions we used is that mutations of individual genes happen independently of each other [see (11)] and that mutation rates do not change with time/environment, which are commonly accepted hypotheses in microbiology. At the same time, it is interesting to test the environmentally directed mutation hypothesis (see Lenski et al. 1989 and references therein), i.e. to verify whether upon relaxing assumptions on the transitional probabilities the mutation model can explain the data for the three genes considered in our experiments of Sect. 5. It is clear from our study (see also Bayliss et al. 2012) that under assumption (11) the mutation model cannot explain the data. Herein, we then test whether these three genes can be explained by a combination of mutation and selection. However, it is formally possible that the observed patterns could be explained by allowing for dependence of mutations. The data assimilation approach of Sect. 4 can be modified to test for dependence of mutations.
Though the main objective of the paper was to propose tractable models for bacterial population evolution together with their robust calibration, a number of biologically interesting observations were made. First, we saw in Sect. 3.2 that in the considered in vitro experiment some of the PV genes can be explained by the mutation model and some are not and hence were subject of further examination via the mutation–selection model. A plausible explanation, and indeed an expected outcome, is that genes vary in their responses to selection with the mutationonly genes not contributing to bacterial adaptation in this particular experimental set up. In Sect. 5, we studied three genes which did not pass the test of Sect. 3. We started by verifying whether the data can be explained by the mutation–selection model with fitness parameters being assigned to the individual genes (Assumption 2.3) rather than to specific phasotypes. (Note that three genes can generate eight phasotypes; 111, 110, 100, etc.) This hypothesis was rejected implying an important biological consequence namely that selection acts on phasotypes and there is a dependence between the three genes, i.e. adaptivity to a new environment in this case relies on a particular, coordinated configuration of states of the three genes. Next (Sect. 5.3) we estimated fitness parameters of the mutation–selection model (without imposing Assumption 2.3) and thus showed that the data can be explained by this model, i.e. these genes’ behaviour can be described using a combination of the selection and mutation mechanisms but not mutations alone. The treatment encompassed by the in vitro experiment had only one change of environment when bacteria were moved from a storage environment to sequential replication on plates. It was then natural to expect that adaptation happens soon after bacteria are placed on plates resulting in a requirement for rapid adaptation to this major environmental shift whereas sequential replication on plates maintains a constant selective regime. Using the mutation–selection model with timedependent fitness coefficients, in Sect. 5.4 we confirmed this hypothesis using data at an intermediate time point. This is a remarkable demonstration of the usefulness of the approach proposed in this paper.
Notes
Acknowledgements
We are grateful to Alexander Gorban and Theodore Kypraios for useful discussions and to Max Souza and anonymous referees for valuable suggestions. We thank Alexander Lewis for creating the webapp illustrating the algorithm of Sect. 3 and the Wellcome Trust Biomedical Vacation Scholarship 206874/Z/17/Z for supporting Alexander’s work. We also thank Lea LangoScholey, Alex Woodacre and Mike Jones for provision of data used in Sects. 3 and 5 prior to publication. This work was supported by the Engineering and Physical Sciences Research Council [grant number EP/L50502X/1] through a PhD studentship to RH; and the Biotechnology and Biological Sciences Research Council [grant number BB/I024712/1] for CDB and MVT.
References
 Acar M, Mettetal JT, van Oudenaarden A (2008) Stochastic switching as a survival strategy in fluctuating environments. Nat Genet 40:471–475CrossRefGoogle Scholar
 Aidley J, Bayliss CD (2014) Repetitive DNA: a major source of genetic diversity in Campylobacter populations? In: Sheppard SK (ed) Campylobacter ecology and evolution, chapter 6. Caister Academic Press, Swansea, pp 55–72Google Scholar
 Aidley J, Rajopadhye S, Akinyemi NM, LangoScholey L, Bayliss CD (2017) Nonselective bottlenecks control the divergence and diversification of phasevariable bacterial populations. MBio 8:02311–16Google Scholar
 Aidley J, Wanford JW, Green LR, Sheppard SK, Bayliss CD (2018) PhasomeIt: an ‘omics’ approach to cataloguing the potential breadth of phase variation in the genus Campylobacter. Microb Genom. https://doi.org/10.1099/mgen.0.000228
 Alonso AA, Molina I, Theodoropoulos C (2014) Modeling bacterial population growth from stochastic singlecell dynamics. Appl Environ Microbiol 80:5241–5253CrossRefGoogle Scholar
 Barber S, Voss J, Webster M (2015) The rate of convergence for approximate Bayesian computation. Electron J Stat 9:80–105MathSciNetCrossRefzbMATHGoogle Scholar
 Barrick JE, Lenski RE (2013) Genome dynamics during experimental evolution. Nat Rev Genet 14:827–39CrossRefGoogle Scholar
 Battersby T, Walsh D, Whyte P, Bolton DJ (2016) Campylobacter growth rates in four different matrices: broiler caecal material, live birds, Bolton broth, and brain heart infusion broth. Infect Ecol Epidemiol 6:31217CrossRefGoogle Scholar
 Bayliss CD (2009) Determinants of phase variation rate and the fitness implications of differing rates for bacterial pathogens and commensals. FEMS Microb Rev 33:504–520CrossRefGoogle Scholar
 Bayliss CD, Bidmos F, Anjum A, Manchev V, Richards R, Grossier JP, Wooldridge K, Ketley J, Barrow P, Jones M, Tretyakov MV (2012) Phase variable genes of Campylobacter jejuni exhibit high mutation rates and specific mutational patterns but mutability is not the major determinant of population structure during host colonisation. Nucleic Acids Res 40:5876–5889CrossRefGoogle Scholar
 Beaumont MA (2010) Approximate Bayesian computation in evolution and ecology. Annu Rev Ecol Evol Systematics 41:379–406CrossRefGoogle Scholar
 Beaumont MA, Cornuet JM, Marin JM, Robert CP (2009) Adaptive approximate Bayesian computation. Biometrika 96:983–990MathSciNetCrossRefzbMATHGoogle Scholar
 Butkovsky OA (2014) On ergodic properties of nonlinear Markov chains and stochastic McKean–Vlasov equations. Theory Probab Appl 58:661–674MathSciNetCrossRefzbMATHGoogle Scholar
 Caravagna G, Mauri G, d’Onofrio A (2013) The interplay of intrinsic and extrinsic bounded noises in biomolecular networks. PLoS ONE 8:e51174CrossRefGoogle Scholar
 Crow JF, Kimura M (1970) An introduction to population genetic theory. Harper & Row, New YorkzbMATHGoogle Scholar
 de Bruijn NG (1981) Asymptotic methods in analysis. Dover, Downers GrovezbMATHGoogle Scholar
 del Moral P, Doucet A, Jasra A (2012) An adaptive sequential Monte Carlo method for approximate Bayesian computation. Stat Comput 22:1009–1020MathSciNetCrossRefzbMATHGoogle Scholar
 D’Onofrio A (ed) (2013) Bounded noises in physics, biology, and engineering. Birkhäuser, BaselzbMATHGoogle Scholar
 Fisher RA (1930) The genetical theory of natural selection. Oxford University Press, OxfordCrossRefzbMATHGoogle Scholar
 Gelman A, Carlin JB, Stern HS, Dunson DB, Vehtari A, Rubin DB (2013) Bayesian data analysis, 3rd edn. Chapman & Hall, Boca RatonzbMATHGoogle Scholar
 Gerrish PJ, Colato A, Sniegowski PD (2013) Genomic mutation rates that neutralize adaptive evolution and natural selection. J R Soc Interface 10:20130329CrossRefGoogle Scholar
 Gilks WR, Richardson S, Spiegelhalter DJ (1996) Markov chain Monte Carlo in practice. Chapman & Hall, Boca RatonzbMATHGoogle Scholar
 Hardwick RJ, Tretyakov MV, Dubrova YuE (2009) Agerelated accumulation of mutations supports a replicationdependent mechanism of spontaneous mutation at tandem repeat DNA loci in mice. Mol Biol Evol 26:2647–2654CrossRefGoogle Scholar
 Howitt R (2018) Stochastic modelling of repeatmediated phase variation in Campylobacter jejuni. Ph.D. Thesis, University of NottinghamGoogle Scholar
 Kolokoltsov VN (2010) Nonlinear Markov processes and kinetic equations. Cambridge University Press, CambridgeCrossRefzbMATHGoogle Scholar
 LangoScholey L, Woodacre A, Yang L, Alarjani K, Fallaize C, Tretyakov MV, Jones MA, Bayliss CD (2019) Combinatorial shifts in phasevariable genes underpin host persistence of Campylobacter jejuni (in preparation) Google Scholar
 Lenski RE, Slatkin M, Ayala FJ (1989) Mutation and selection in bacterial populations: alternatives to the hypothesis of directed mutation. Proc Natl Acad Sci USA 86:2775–2778CrossRefGoogle Scholar
 Levinson G, Gutman GA (1987) Slippedstrand mispairing: a major mechanism for DNA sequence evolution. Mol Biol Evol 4:203–221Google Scholar
 Meyn SP, Tweedie RL (2009) Markov chains and stochastic stability. Cambridge University Press, CambridgeCrossRefzbMATHGoogle Scholar
 Moxon R, Kussell E (2017) The impact of bottlenecks on microbial survival, adaptation, and phenotypic switching in hostpathogen interactions. Evolution 71:2803–2816CrossRefGoogle Scholar
 Moxon ER, Rainey PB, Nowak MA, Lenski RE (1994) Adaptive evolution of highly mutable loci in pathogenic bacteria. Curr Biol 4:24–33CrossRefGoogle Scholar
 Noether GE (1963) Note on the Kolmogorov statistic in the discrete case. Metrika 7:115–116MathSciNetCrossRefzbMATHGoogle Scholar
 O’Brien S, Rodrigues AM, Buckling A (2013) The evolution of bacterial mutation rates under simultaneous selection by interspecific and social parasitism. Proc Biol Sci 280:20131913CrossRefGoogle Scholar
 Palmer ME, Lipsitch M (2006) The influence of hitchhiking and deleterious mutation upon asexual mutation rates. Genetics 173:461–72CrossRefGoogle Scholar
 Palmer ME, Lipsitch M, Moxon ER, Bayliss CD (2013) Broad conditions favor the evolution of phasevariable loci. MBio 4:0043012CrossRefGoogle Scholar
 Pitman EJG (1979) Some basic theory for statistical inference. Chapman & Hall, Boca RatonzbMATHGoogle Scholar
 Raynes Y, Sniegowski PD (2014) Experimental evolution and the dynamics of genomic mutation rate modifiers. Heredity (Edinb) 113:375–80CrossRefGoogle Scholar
 Rue H, Martino S, Chopin N (2009) Approximate Bayesian inference for latent Gaussian models by using integrated nested Laplace approximations. J R Stat Soc B 71:319–392 (with discussion)MathSciNetCrossRefzbMATHGoogle Scholar
 Saunders NJ, Moxon ER, Gravenor MB (2003) Mutation rates: estimating phase variation rates when fitness differences are present and their impact on population structure. Microbiology 149:485–495CrossRefGoogle Scholar
 Tierney L, Kadane J (1986) Accurate approximations for posterior moments and marginal densities. J Am Stat Soc 81:82–86MathSciNetCrossRefzbMATHGoogle Scholar
 van der Woude MW, Bäumler AJ (2004) Phase and antigenic variation in bacteria. Clin Microbiol Rev 17:581–611CrossRefGoogle Scholar
 Wanford JJ, LangoScholey L, Nothaft H, Szymanski CM, Bayliss CD (2018) Random sorting of Campylobacter jejuni phase variants due to a narrow bottleneck during colonisation of broiler chickens. Microbiology 164:896–907CrossRefGoogle Scholar
 Waxman D, Welch JJ (2005) Fisher’s microscope and Haldane’s ellipse. Am Nat 166:447–457CrossRefGoogle Scholar
 Wilkinson DJ (2012) Stochastic modelling for systems biology, 2nd edn. Chapman & Hall, Boca RatonzbMATHGoogle Scholar
 Wilkinson RD (2013) Approximate Bayesian computation (ABC) gives exact results under the assumption of model error. Stat Appl Gen Mol Biol 12:129–141MathSciNetGoogle Scholar
 WisniewskiDyé F, Vial L (2008) Phase and antigenic variation mediated by genome modifications. Antonie van Leeuwenhoek 94:493–515CrossRefGoogle Scholar
 Wolf DM, Vazirani VV, Arkin AP (2005) A microbial modified prisoner’s dilemma game: how frequencydependent selection can lead to random phase variation. J Theor Biol 234:255–62MathSciNetCrossRefGoogle Scholar
 Woodacre A, LangoScholey L, Kasli I, Howitt R, Fallaize C, Tretyakov MV, Jones MA, Bayliss CD (2019) Culturedependent fluctuations in combinatorial expression states of phasevariable genes of Campylobacter jejuni (in preparation) Google Scholar
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.