Exploiting fast-variables to understand population dynamics and evolution

We describe a continuous-time modelling framework for biological population dynamics that accounts for demographic noise. In the spirit of the methodology used by statistical physicists, transitions between the states of the system are caused by individual events while the dynamics are described in terms of the time-evolution of a probability density function. In general, the application of the diffusion approximation still leaves a description that is quite complex. However, in many biological applications one or more of the processes happen slowly relative to the system's other processes, and the dynamics can be approximated as occurring within a slow low-dimensional subspace. We review these time-scale separation arguments and analyse the more simple stochastic dynamics that result in a number of cases. We stress that it is important to retain the demographic noise derived in this way, and emphasise this point by showing that it can alter the direction of selection compared to the prediction made from an analysis of the corresponding deterministic model.


I. INTRODUCTION
In a series of articles on biological evolution, published in the Journal of Statistical Physics, it is natural to ask what expertise and insights can statistical physicists bring to the study of evolution, and in what way might their approach to the subject differ from biologists. If the subject is the one that will largely interest us in this paper -the study of evolution within the framework of population genetics -these questions are more easily answered. This is because in a system containing a large, but finite, number of individuals with given genetic characteristics, genetic drift leads to a stochastic dynamics which has many of the features which allow the application of the ideas and techniques of non-equilibrium statistical physics. We will use this formalism, but in addition our approach will have parallels with the traditional methodology of theoretical physicists.
Firstly, we will stress the fundamental nature of the microscopic description. That is, we will start with genes as basic constituents, which can be in different states according to their type (allele), location (if the individual carrying the gene is located on an island), the sex of the individual carrying the gene, etc.
Secondly, since the microscopic description contains too much detail which is irrelevant at the macroscale (or in our case, the mesoscale, where some stochastic element is retained) we will derive a reduced or effective model which contains parameters which depend on the parameters of the microscopic description and so encapsulate the relevant aspects of the microscopic description.
Thirdly, we will be interested in generic behaviour. By this we mean that we will attempt to formulate a microscopic description that does not have inbuilt assumptions that make the model more easily solvable. Instead we will try to formulate the model in such a way that it could be generalised to include many more effects, without changing its structure. In this philosophy, the simplifying assumptions are brought in during the process of obtaining the effective model, and should be clearly stated.
Finally, although the whole basis of our work is mathematical, we will use intuition to explore the admissibility of the techniques we use outside of their regime of strict applicability, and check their correctness through the use of computer simulations.
Although these ideas are familiar to the theoretical physics community, they tend to be untilised less in the biological sciences. For example, many biologists may use quite complex verbal arguments to gain insights. Conversely, our methodology may differ from that of many mathematical biologists, since rigorous justification will not be a central feature of our approach. In addition, many mathematical biologists are focussed on the deterministic dynamics found at the macroscale. Nevertheless, we view the approach we will discuss here as able to form a bridge between the intuition gleaned by biologists and the more analytic investigations of mathematicians. In this way we hope that our methods prove of interest to a wider audience outside of the theoretical physics community.
In a previous paper [41], we have reviewed the process of setting-up a description of this class of biological systems in terms of its basic constituents, and from this deriving the mesoscopic equations governing the dynamics which generalise what might be the more familiar macroscopic equations. In particular, in Ref. [41], we give formulae for writing down the form of the mesoscopic dynamics in terms of quantities which appear in the microscopic formulation. This essentially is the first point of our methodology described above, and so while we will discuss it here, we will refer the reader to this earlier paper for more details. Instead we will focus on the second point above, namely obtaining an effective theory that is more amenable to analysis than the original.
There may be several ways of reducing the complexity of the model, but here we will concentrate on one which is based on time-scale separation arguments. That is, we will seek to identify fast modes which die away relatively quickly, and slow modes which endure at long times. This is, of course, a well-known procedure, perhaps the most famous example being in hydrodynamics, where the microscopic molecular dynamics can be replaced by a macroscopic dynamics with a few long-lasting variables. Although this dynamics is macroscopic, a mesoscopic extension can also be derived along the same lines [21]. In the theory of dynamical systems, the concept of a center manifold (CM) is another manifestation of these ideas. During our discussion of this methodology, there will be several illustrations of the third and fourth points discussed above, namely the wish to use generic structures and the use of numerical simulations to check the precision of the approximations we utilise.
As we have already mentioned, one of our aims in writing this article is to make the ideas and techniques available to a larger audience. To help to achieve this we will present an application of the method in a pedagogical manner in Section 2. We have chosen one of the simplest possible systems: haploid individuals on one of two islands of equal size which can migrate from one island to the other. It will be assumed that are only two possible alleles which are modelled by a Moran process. After this informal, and hopefully easily accessible, introduction to the method, we will describe its application to a number of models in Section 3. These include; a haploid Moran model on an arbitrary number of islands with selection and mutation; a stochastic Lotka-Volterra model with an arbitrary number of islands and a stochastic Lotka-Volterra model with an arbitrary number of species; a derivation of the Hardy-Weinberg approximation from first principles; a model of epidemic spread on a network. In Section 4, we will illustrate the method in the slightly more technical case where noise-induced dynamics are present. We will see that noise-induced selection can cause selection for genotypes that are neutral in a deterministic setting, and that further, this noise induced selection can, under certain conditions, be strong enough to reverse the direction of deterministic selection. We will illustrate this behavior with reference to Lotka-Volterra models, where we will see that this effect can help alleviate the dilemma of cooperation, and a model of transitions between sex-chromosome systems. Finally, in Section 5 we conclude with a discussion. Phase plot for a haploid Moran model with two-alleles on two islands with strong migration (described in Section II). The deterministic dynamics rapidly collapse to a slow subspace, indicated here by the blue dashed line. Top right panel: Deterministic trajectories (grey arrows) for a system similar to that in the top left panel but with three islands, addressed in Section III A. Again, the deterministic dynamics rapidly collapse to a one-dimensional subspace indicated by the blue dashed line. Bottom left panel: Genotype frequencies as a function of time for a population genetic model, described in Section IV B. Stochastic trajectories (solid lines) initially rapidly relax on quasi-deterministic trajectories (inset) before reaching a one-dimensional slow subspace along which the system moves on a slower timescale. Bottom Right panel: A neutral three-species Lotka-Volterra model, addressed in Section III B 2. Stochastic trajectories (orange) rapidly collapse along quasi-deterministic trajectories onto a two-dimensional slow subspace (blue surface), about which they are confined.

II. PEDAGOGICAL EXAMPLE
In this section we will explain as simply as possible how to apply the ideas discussed in the Introduction to a concrete example. The example we choose is a Moran model with migration. We ask how this can be reduced to an effective one-island model.

A. Setting-up the model
We set up the model at the microscale. That is, at the level of haploid individuals who each carry an allele of type 1 or of type 2. The individuals can reside on one of two islands, both of which can only carry a fixed number of individuals, which we denote by N . We therefore denote by n 1 the number of individuals carrying allele 1 on island 1, by N − n 1 the number of individuals carrying allele 2 on island 1, by n 2 , the number of individuals carrying allele 1 on island 2, and by N − n 2 , the number of individuals carrying allele 2 on island 2. So the state of the whole system is given by only two variables, which we can form into the two-dimensional vector n = (n 1 , n 2 ). We would like to reduce this description to one involving only one variable, which gives the fraction of individuals in the system carrying allele 1. This would allow us to calculate, for example, the probability that allele 1 or allele 2 fix, and the mean time to fixation.
There is not enough information about the birth, death and migration of individuals to model them in any other way than as random processes, so the model is specified by giving the probability per unit time that the system transitions from its current state, given by the vector n, to a new state n . We write these transition rates as T (n |n), with the initial state on the right and the final state on the left (some authors use the reverse convention). The probability distribution function (pdf) of the system, p(n, t), is then given by the master equation This is relatively easy to understand. The first term on the right-hand side is made up of the probability of being in state n multiplied by the probability of being in that state and making a transition to state n. It therefore represents the probability of starting in state n and making at transition to state n. In the same way the second term on the right-hand side represents the probability of starting in state n and making at transition to state n . Their difference, summed over all state n , different to n, gives the rate of increase of p(n, t) with time.
The form we take for the T (n |n) depends on the model choice. Here we choose the Moran model, because it is simple: it amalgamates births and deaths and asks that birth, death and migration events happen in such a way that the population size of each island, N , is kept fixed. These are not the most realistic assumptions, and we discuss ways to relax them later in the paper, but they have the merit that the number of model parameters are kept to a minimum. The method of constructing T (n |n) also seems a little more complicated, due to the requirement of keeping a fixed number of individuals on each island. This is done as follows: (i) Pick an island (with probability 1/2, since the islands are identical) and then pick an individual randomly from that island. Allow the individual to reproduce by duplicating the individual.
(ii) With probability m, the progeny migrates to the other island. In this case choose an individual on the other island at random to die.
(iii) With probability (1 − m), the progeny remains on the same island. In this case choose an individual on the same island at random to die.
It should be noticed that the processes of birth and death are inextricably linked and that they are assumed to happen at rate 1, this choice being possible through a choice of time units. On top of this process, migration is imposed with a probability of occurrence equal to m (0 ≤ m ≤ 1). Following these rules, if the model is neutral the transition rate from the state (n 1 , n 2 ) to the state (n 1 + 1, n 2 ) is Similar expressions can be be found for T (n 1 − 1, n 2 |n 1 , n 2 ) and for T (n 1 , n 2 ± 1|n 1 , n 2 ). However, we would like to include selection in the model. In this case we have to weight the choice of picking an allele by the relative fitness of that allele on a particular island. Since we are aiming to be as simple as possible to illustrate the basic ideas, we will assume that this fitness weighting is the same on both islands, though it is simple enough to relax this condition.
Therefore we will denote the fitness weighting of allele 1 to be W (1) (n) and of allele 2 to be W (2) (n). Then the four transition rates from state (n 1 , n 2 ) to the new state are where is the fitness of the individuals on island i. Here superscripts are a label for the the two different alleles, whereas subscripts are a label for the two different islands. For further background on how to arrive precisely at the forms given by Eq. (3) the reader is referred to previous discussions in the literature [1,6,41]. If W (1) and W (2) are independent of n, then the selection is known as frequency independent selection. This will be assumed in this pedagogical treatment, and we write W (1) = 1 + α (1) s + O(s 2 ) and W (2) where s is a selection coefficient and α (1) , α (2) are constants. Since s is typically very small, we do not expect it will be necessary to include the order s 2 terms in the expressions for W (1) and W (2) .
Equations (1) and (3) define the microscopic model, and once an initial condition p(n, 0) for the pdf has been given, specify the dynamics for all t. All other systems discussed in this paper will have a similar structure; the form of the transition rates will differ depending on the system, but in all cases the substitution of these rates into the master equation (1) will give the dynamics. As we discussed in the Introduction, the validity of our methods and approximations are checked via computer simulations, and these use the microscopic model defined by Eqs. (1) and (3). The simulations use the Gillepsie algorithm [23,24] which is developed within the same formalism discussed in this section. However, the master equation is difficult to study analytically. It is for this reason that we make the diffusion approximation, approximating the microscopic model with a mesoscopic version.
The diffusion approximation was applied very early on the development of population genetics [19] and is widely used [14]. We do insist, however, that it should be derived from an underlying microscopic model, since there are potentially many microscopic models that give the same mesoscopic model, and so simply defining the model at the mesoscale can lead to ambiguities. The idea itself is very simple: if N is large enough, the ratios n i /N , which are formally fractions, are assumed to be continuous, and denoted by x i . At the same time the master equation is expanded in powers of N −1 , and terms in N −3 and higher are neglected.
This process can be carried out directly, but formulae exist for the analogues of the transition rates which appear in the mesoscopic equations [41]. To use these we need to introduce what are in effect stoichiometric vectors corresponding to the four "reactions" in Eq. (3). In other words we write the final state, n , in terms of the initial state, n, as n = n + ν µ , where µ = 1, . . . , 4 specify each of the four reactions. So for example, in the first reaction of Eq. (3), n 1 increases by 1 and n 2 does not change, so ν 1 = (1, 0). Similarly, ν 2 = (−1, 0), ν 3 = (0, 1) and ν 4 = (0, −1). These identifications allow us to use Eqs. (18) and (21) of Ref. [41] to show that the A and B functions which appear in the mesoscopic equations are and with B 12 = B 21 = 0. These then, specify the model, and the general dynamical equations which allow us to find the dynamics are either the Fokker-Planck equation (FPE) or the equivalent Itō stochastic differential equation (SDE) where τ = t/N is a rescaled time and η i (τ ) is a Gaussian white noise with zero mean and with a correlator As for the microscopic model, substitution of the specific forms given by Eqs. (4) and (5) into the generic forms of the FPE or SDE gives the behaviour of the mesoscopic model for all time, provided an initial condition P (x, 0) is given. For more details on derivation and meaning of these equations standard texts on the theory of stochastic processes [22,51] may be consulted, or previous articles on the application of these ideas in a biological context [1,6,41]. We end this section with two general points. First, if we take the limit N → ∞ of Eq. (7) we obtain the macroscopic model, which is deterministic, since the noise is eliminated by taking the limit. The dynamics of the deterministic model is then given by dx i /dτ = A i (x), i = 1, 2. Second, we keep terms of order s in A, but neglect them in B, since we are envisaging keeping terms of order s/N or 1/N 2 in the FPE, but discarding terms of order s 2 /N, s/N 2 and 1/N 3 . This essentially assumes that s ∼ N −1 , although we will keep s and N to be independent variables throughout the paper.
B. First stage of the reduction process: identifying the fast and slow variables Although the mesoscopic equations (6) and (7) are potentially more manageable than the differential-difference equation (1), they are still formidably difficult to analyze -equation (6) is a partial differential equation in three variables, and in most of the other systems discussed later in this paper, the corresponding equation may have tens or even hundreds of variables. To find a simpler, or reduced, form we want to eliminate the variables which decay away quickly, since they are not relevant to making predictions about the medium-to long-term behaviour. This subset of slow variables will form a slow-subspace (SS), so that instead of allowing the system to explore the whole space of variables, we only allow it to move within this subspace.
In practice, instead of searching for a SS directly, we frequently search for a CM which is composed not of slowvariables, which hardly change with time, but of conserved variables that do not change with time at all. In population genetics, for example, neutral theories may contain conserved quantities, due to symmetries in the system (the different alleles behave in the same way), and the effects of selection can be added as perturbative corrections, given the extremely small size of selection coefficients. The CM is found by looking for fixed points of the macroscopic equation (the macroscopic limit of the SDE with no noise term present).
This first stage of the reduction therefore consists of the following steps: 1. Identify a CM, perhaps by setting some same parameters to zero in order to increase the symmetry of the deterministic equations (this could be the neutral limit of the deterministic dynamics, for example).
2. Find the Jacobian of the fixed points that constitute the CM, and find the eigenvalues and eigenvectors of the Jacobian evaluated on the CM.
3. Form a projection operator from the eigenvectors found in (ii), which is used to operate on quantities in the full mesoscopic system in order to eliminate the fast variables.
4. Use this projection operator, or use conservation laws, to find the point where the system first reaches the CM. This will be the new initial condition for the reduced system.
We will now illustrate these four steps on the pedagogical example.
(i) Setting s = 0 in Eq. (4), we see that there is a line of fixed points x 1 = x 2 . This is the CM.
(ii) The Jacobian of the deterministic system dx i /dτ = A i (x), i = 1, 2, with s = 0, is This matrix has zero determinant and a trace equal to −m, which immediately gives it two eigenvalues as λ {1} = 0 and λ {2} = −m. Two typical features we would expect are illustrated here: the number of zero eigenvalues equals the number of dimensions of the CM (since there is no dynamics at all on the CM -it is comprised only of fixed points) and the non-zero eigenvalue has a real part which is negative (so that it can be identified as the fast mode which dies away quickly). In this case the non-zero eigenvalue is real, which is a reflection of the fact that the Jacobian, J, is symmetric. Another consequence of J being symmetric is that we would normally be required to find right-and left-eigenvectors of J, but these coincide for symmetric matrices and are given by We would expect that the eigenvector corresponding to the zero eigenvalue would lie in the CM, and indeed v {1} lies on the line x 1 = x 2 . The normalization of the eigenvectors has been chosen so that they are orthonormal: where µ, ν = 1, 2 and δ µν is the Kronecker delta.
(iii) The projection operator is defined by j , constructed only from the eigenvectors of the zero mode(s). To illustrate its use, we operate with it on a general vector of the full system given by i , has been wiped out using the orthogonality of the eigenvalues.
(iv) If the point at which the system begins is x IC i , then we would expect it to reach the CM at x CMIC i = 2 j=1 P ij x IC j , since only the slow (zero) modes will have survived by this time. Here i = 1, 2 and 'IC' and 'CMIC' stand for 'initial condition' and 'center-manifold initial condition' respectively. Applying the projection operator one finds that Another way to obtain this result is to use the conserved quantity which exists in this degenerate system. From Eq. (4), one sees that d(x 1 + x 2 )/dτ = 0 when s = 0. Therefore, x 1 + x 2 is unchanged in time, and so x IC These results will be used to construct the reduced model. As an illustration of the fourth general point made in the Introduction relating to intuition and the checking of approximations through simulations, we note that the trajectory from x IC to x CMIC is stochastic, not deterministic. Nevertheless, we will use x CMIC as the initial condition for the reduced system on the CM, even though it has been deduced through a deterministic argument. We expect the deterministic dynamics to dominate the collapse from x IC to the CM under the condition that the rate of migration (which controls the collapse of the system to the CM) is much stronger than the rate of genetic drift (which causes deviations from the deterministic collapse to the CM). Since the rate of genetic drift grows linearly with the population size (see Eq. (6)), this condition can be expressed m N −1 . In addition, the projection of the IC onto the CM as described is only strictly true when trajectories to the CM are linear (as in this case) or when the initial condition is close to the CM (in which case the linear approximation is applicable). We will also continue to use the eigenvectors of the neutral model when constructing the s = 0 reduced model. It would be possible to find perturbative corrections to these in s, but we expect the effects to be sufficiently small as to be completely negligible. These are judgements made on the basis of intuition. Their validity will be examined through numerical simulations, and the comparison of analytic results made on the basis of these assumptions with results obtained by simulations of the original model.
Finally it is worth noting that, as addressed in point (iv) above, an alternative line of attack is possible in which one transforms into the fast-slow basis of the problem, removes the fast-variables and then transforms back into the original, biologically relevant, variables of the problem (for an illustration of such an approach, see [11]). For the system at hand, the fast-slow basis is straightforward to obtain and is given by (x 1 − x 2 ) and (x 1 + x 2 ). However in more general problems such a basis may not be straightforward to obtain analytically (we will explore such a scenario in Section IV B) while the projection method that we develop here will continue to yield insight. We therefore continue to explore the current pedagological example using the projection formalism.
C. Second stage of the reduction process: construction of the reduced model In the first stage we worked with a version of the model which had a CM, but our real interest is in the actual, realistic, model which will typically only have a SS. This will not change materially the process of collapse described in Sec. II B: we still expect what is in effect a deterministic collapse onto the SS. However, we would like to be able to assume that the system would then stay in the subspace which is defined by the slow variables. This is true (at least deterministically) when there is a CM, since the system ceases to move once it reaches the CM (because A = 0). But this is not true when there is a SS. We therefore demand that A has no component in the fast directions. These conditions give the equation for the SS. There will still be a (weak) dynamics in the direction of the slow variables. We will also ask that there is no noise in the fast directions, only in the slow directions. In this way, the system is effectively constrained to evolve in the SS.
This second stage of the reduction therefore consists of the following steps: 1. Ask that A has no components in the fast directions. This gives the equation that the SS must have for this to be true.
2. Apply the projection operator to the SDE of the full system, to obtain the SDE of the reduced system.
We can now illustrate these steps on the pedagogical example.
(i) The condition that A has no component in the fast direction is v {2} · A = 0, where A is the full (s = 0) form given by Eq. (4). This condition is simply i + O(s 2 ) into this equation we can determine the SS. We know that when s = 0 the SS is the CM and that X 2 , so the SS is defined by x 1 = x 2 to order s. It therefore has exactly the same form as the CM, and is linear. This is not true in general, and is only a feature of this simple pedagogical example. The variable x 1 , or x 2 , will be denoted by z on the SS.
(ii) Applying the projection operator to the SDE (7) gives 2/ √ 2dz/dτ for the left-hand side, since x 1 = x 2 ≡ z on the SS. The first term on the right-hand side gives 2/ √ 2Ā(z) for a similar reason: A 1 (x) = A 2 (x) on the SS x 1 = x 2 , and we write A 1 or A 2 on the SS in terms of z asĀ(z). Therefore the reduced SDE is given by whereĀ and where From the properties of the noise η(τ ), including the correlation function in Eq. (8), it follows that ζ(τ ) is a Gaussian white noise with zero mean and with a correlator whereB The reduction we have described here has produced an effective mesoscopic model defined by Eqs. (11), (12), (14) and (15), which has only one variable, and which is sufficiently simple that it can be analysed mathematically. The pedagogical example was chosen on the grounds that its simplicity allowed the method to be clearly explained, and unfortunately this means that the reduction does not give that much useful information. For instance, the final results do not depend on m, and all we have done is to verify a known result [40] that with this type of selection and with the same selection pressure on all islands, that the population behaves in a similar way to a well-mixed population of size equal to that of the two islands added together. Actually, if one takes the calculation to higher order in s, then one can find an exception to this: migration-selection balance can occur if selection acts in opposing directions on the different islands.
In later sections we will describe applications of the method which will yield informative reduced models. Although the essentials of the reduction method will be the same as those that we have described in this section, a few aspects will make it seem slightly more complicated. Apart from the number of variables being greater, requiring additional indices, we mention (a) The generalisation of the Jacobian (9) will typically not be a symmetric matrix. This means that the non-zero eigenvalues will be complex in general, and the left-and right-eigenvectors will not coincide. We will use the notation u {µ} and v {µ} respectively for the left-and right-eigenvectors corresponding to the eigenvalue λ {µ} , where µ = 1, 2, . . .. They will be chosen to be orthonormal, that is, u {µ} · v {ν} = δ µν .
(b) As a consequence of this a typical term in the projection operator will involve left-and right-eigenvectors and the condition for there to be no deterministic dynamics in the µ-direction will be u {µ} · A(x) = 0, that is, will involve u {µ} rather than v {µ} .
These are simply small technical details, and the method as discussed here is in essence that used in the other applications which we will now go on to discuss. However, having accounted for these points, we can define more general forms ofĀ(z) andB(z) that will be relevant for later problems. In particular, for a system with initially M variables we find an effective one-dimensional approximation of form Eq. (11) with; andB and where we recall that u {1} is the left eigenvector corresponding to zero eigenvector. Since u {1} is perpendicular to all of the fast directions [6], these two terms can be viewed as the deterministic and noisy components of the full problem respectively projected onto the SS; that is

III. APPLICATIONS OF THE FAST-MODE ELIMINATION PROCEDURE
In this section we will discuss some applications of the formalism which we have described in Sec. II, making reference to previous work, notable features of the various models and possible future work.

A. The D-island Moran model
The modelling and analysis of migration effects in population genetics has always been challenging since it involves a spatial aspect in an essential way. Historically, it was Wright [61] who first studied migration models in population genetics, however he in fact did not assume a spatial structure, since migratory individuals were chosen from a global, well-mixed, population. The stepping stone model [34] was one of the first which did have real spatial structure. It consisted of a line of islands, but where migration could only take place between an island and its neighbours on either side. If we view migration as an interaction event between two islands, this is a one-dimensional model with nearest-neighbour interactions.
Expressed in this way, an obvious generalisation is to a network of islands with interaction strengths between the islands proportional to the probability of migration between the islands. Such a model was investigated by Nagylaki [43], although with discrete generations and strong migration, where the probability of a migration event is of the same order as a birth or a death. Assumptions either made the analysis difficult to follow or were not thought to be widely applicable, and many further studies of this kind followed which made a different set of assumptions. These, and further discussions and analysis can be found in the book by Rousset [53], while more recent results that utilise probability generating functions [50] are developed in [32].
In Sec. II a simple two island model was introduced. The D-island model is a generalisation of this but with added features. Details are given in earlier papers of ours [6,7,9], but examples of more general features are the migration probability to island i from island j, m ij , which will in general not be symmetric (m ij = m ji ), and the fact that islands will be allowed to contain different number of individuals (β i N on island i). The migration matrix is not so far defined for i = j, however if one sets the probability of the chosen individual not migrating equal to m ii = 1 − D j =i m ji , then this defines m ij for all i, j.
The factor N which appears in the popoulation size is the only large parameter in the model, and the approximations made in Sec. II are reliant on this. So, for example, β i should not be so small or large that we still cannot treat the factor β i N as being of a similar magnitude to N . Similarly, the number of islands D should not be so large that it can be thought of as being of order N . It is likely that in some cases the approximations will continue to be good outside of their strict range of validity, but at present this can only be tested by comparing the analytic results with simulations.
We will discuss the model with and without mutation separately, since typical questions which we are interested in answering differ. We begin with the model with no mutation.

The D-island model without mutation
This model has been analysed in detail in Refs. [6,7], where further details may be found. Here we only note that, in addition to the points already made, that the probability of choosing an island on which an individual is then chosen to die or migrate, f i , has to be done with a probability proportional to β i , if we are to get sensible results. When the approximation described in Sec. II are made, one again finds that the reduced model has only one degree of freedom, and is given by Eqs. (11) and (14). Even the functional forms ofĀ(z) andB(z) are unchanged, although now they contain parameters which are functions of most of the parameters of the starting model: where Here α i is the relative fitness of the first allele over the second on island i and u {1} is the left-eigenvector corresponding to the zero eigenvalue. So even with the added complexity of each island differing in size, arbitrary migration probabilities, selective advantage of allele 1 over allele 2 varying from island to island, and the number of islands itself not fixed, the reduction gives a standard (i.e. non-spatial model) Moran model with selection with effective parameters which can be seen from Eq. (19) contain information from virtually all the parameters of the original model.
As mentioned the reduced model is the mesoscopic version of the Moran (or, in fact, the Wright-Fisher model) with selection. The probability of either allele 1 or allele 2 fixing for a given initial state and also the mean time to fixation may be straightforwardly found [14], given as they are by the solution to second-order ordinary differential equations [7]. These are denoted by Q(z 0 ) and T (z 0 ) respectively. Here z 0 is the initial condition for the reduced system, which was referred to as x CMIC in the final paragraph of Sec. II B on the first stage of the reduction process. We have changed notation to the new coordinate z and also indicated that it is an initial condition through use of the subscript 0, rather than the superscript CMIC. It is clear that both Q and T have to depend on precisely where the reduced system is initialised.
In Fig. 2, fixation times calculated from the reduced model and simulations of the underlying microscopic model, from which it was derived, are shown. The very good agreement between the two gives us confidence in the method, with more comparisons [6,7] also giving further support. The calculation can also be taken to next order in s, and an effective term of order s 2 found in the expression forĀ(z) given to first order in s in Eq. (18). In this case s is tentaively assumed to be of order N −1/2 (although a direct indentification is not made), so that terms of order higher than N −2 have been ignored in the expansion of the master equation. Novel effects can be investigated through use of the s 2 term, and a greater range of parameters explored. We refer the reader to the original literature for a discussion of these [6,7].

The D-island model with mutation
So far in this article we have examined the effects of migration, selection and genetic drift, but not that other important process of population genetics: mutation. The inclusion of mutation has a drastic effect on the long term behaviour of the system, since it is now possible in principle for one allele to mutate to another at any time, and so concepts such as fixation probabilities and mean time to fixation do not apply. On the other hand, the pdf of the allele frequencies is now non-trivial and becomes stationary at long times. This is an interesting quantity which characterises the model and we use it to test the accuracy of the effective model found through the reduction procedure.
The microscopic model is constructed in a similar way to that described in Sec. II. There is more than one way that mutation can be included; here we follow the procedure described in Ref. [9]. In this case we allow birth/death/migration events to happen a fraction ξ of the time, and mutation events a fraction (1 − ξ) of the time. The mutation rate from the first allele to the second on island i is denoted by κ (1) i and from the second to the first on island i by κ (2) i . It may be a little unusual to allow rates to vary from one island to the other, but allowing for this does not markedly increase the complexity of the calculation and so we include it.
If we denote the rates without mutation (such as are shown in Eq. (3) for the case of two islands) as T S (n |n), where the subscript S indicates that selection has been included, but not mutation, then the corresponding rates with mutation and selection included are Here the dependence of the transition rates, T S (n |n), on elements of n that do not change in the transition has been suppressed. We may rescale time in the master equation by a factor of ξ and absorb a factor of (1 − ξ)/ξ in the mutation rates. This effectively means that we can drop the factors ξ and (1 − ξ) from Eq. (20).
Making the diffusion approximation, the mesoscopic model is given by Eq. (7) with the noise correlator given by Eq. (8). In this case where κ (1) = (κ D ). Since mutation has been modelled as a linear process, the κ dependence in A i (x) in Eq. (21) is exact. We will neglect the κ dependence in B ij (x) for precisely the same reasons that we neglected the dependence on the selection coefficients: we are assuming that the elements of κ (1) and κ (2) are so small that they can be thought to be of the same order as N −1 . We therefore only keep terms of order κ (1) /N , κ (2) /N and 1/N 2 in the FPE.
The reduction process itself is similar to that discussed previously since, as just mentioned, mutation rates are generally very small, and so they can be treated as perturbations of the neutral model in exactly the same way as was done for selection strengths. Therefore the reduced model is given by Eqs. (7) and (8) withB(z) unchanged from the form given in Eq. (18). Perhaps not surprisinglyĀ(z) is modified by the addition of an extra term depending on the mutation rates [9]:Ā and where a 1 is defined in Eq. (19). Here, as in Sec. III A 1, we retain terms only to order s. The stationary pdf of the effective theory can be straightforwardly found from the FPE corresponding to the SDE (7). Using the explicit forms forĀ(z) andB(z) one finds that [9]  where N is a normalisation constant and A comparison between simulations of the original model and calculations from the reduced model in Fig. 3 shows that the reduced model captures well the features of the full model. There are several ways in which the work discussed here and in the previous section could be taken forward. There is scope for biologists to tailor the technique to their own interests, perhaps including additional processes with their own set of parameters and dropping others. Mathematicians may be able to provide conditions under which the approximations made would be expected to be valid, perhaps giving upper bounds for the negative real parts of the non-zero eigenvalues. Another extension is to perform a similar analysis, but starting not from the Moran model, but from one which is closer to those used in ecological modelling. We now go on to discuss this.

B. The stochastic Lotka-Volterra model
The theory of evolution has had a very convoluted history, and a reflection of this was the significant contribution made to the subject by theoretical studies, as least, compared to other areas of the biological sciences. This had repercussions on the nature of the mathematical models used in these studies: they tended to be unrelated to broader questions relating to the organism, and more focussed on the combinatorics of allele selection. This was a good strategy when trying to test the ideas of Darwinian evolution, but it tended to isolate the theoretical development of the subject from developments elsewhere.
An example is the Wright-Fisher model [20,61], one of the first models of genetic drift, as well as the precursor of the Moran model [42]. In this model there is no competition between individuals -a trait which is obviously a feature of Darwinian evolution. The population therefore grows very quickly, and is kept under control by sampling the very large pool of individuals to form the next generation. Only a fixed number, N , of individuals are retained to form each new generation (Wright-Fisher) or births and deaths are coupled so the population at any given time is always equal to N (Moran). This leads to an artificiality in the way that the models are set-up.
In this section we will use as a starting point models in which the population is regulated by competition, rather than by a fictitious constraint which fixes the population size. This is a closer reflection of reality, and indeed the formulation of aspects of the models seem less contrived. It is a constant theme in the biological modelling literature that models of evolution should have a more ecological flavour, and this approach conforms to these views. An apparent disadvantage is that the number of variables is increased. To see that, recall that the Moran model of Sec. III A had only D variables, since the number of individuals carrying the second allele on island i could be expressed in terms of the number of individuals carrying the first allele on the same island i.e. N − n i . If the population of an island is not fixed, the number of individuals carrying the first and second allele can independently vary, and so the number of variables will double. Below we will describe how application of the reduction methods to models with competition show that they reduce to Moran-like models [8,10]. The competition will be chosen to be of the simple Lotka-Volterra type [52], but in principle more complex competitive processes could be utilised. Since these are stochastic models, we will refer to them as stochastic Lotka-Volterra (SLVC) models.

The D-island SLVC model
As indicated above this system has 2D variables which in the microscopic model are n (α) i , where i = 1, . . . , D labels the islands and α = 1, 2 labels the allele. The comment made above concerning SLVC models being less contrived can be illustrated here in the way that migration is modelled. The procedure for doing this in the Moran model involves considerable care in making sure that there are no biases built into the way the transition rates are constructed (see Eq. (3) in the simple two-island case) while keeping the population of both islands fixed. For example, one has to ensure that a death on island j occurs before a migration from island i to this island is allowed. In the SLVC model, one simply specifies birth, death, competition and migration rates, respectively b The diffusion approximation is applied in the same way as in the Moran model, although now the large parameter is not N , which is no longer present in the definition of the model, but V , which is some measure of the size of the system, such as the volume. Although there are 2D variables initially, 2D − 1 of these are fast, and so the reduced model has again only one variable. It may be possible in some parameter regimes to see a clear cut decay first to the D variables of a Moran-type model with fixed populations on each island, and then a slower decay to an effective one-island model, which parallels the discussion in Sec. III A, but in many cases these time-scales will be similar or will overlap. The time-scales are related to the inverse of the eigenvalues of the Jacobian, and in general, these are complicated functions of all the model parameters.
While it is true that the reduced SLVC D-island model does reduce to a system which has a mesoscopic description given by Eqs. (11) and (14) -although with N replaced by V -and withB(z) = b 1 z(1 − z) to leading order in , the form ofĀ(z) is a little different. It is found to be given by [47] In this case a 1 , a 2 and b 1 are functions of the rates given which appear in Eq. (26), β i , and the left-eigenvector of the Jacobian corresponding to the zero eigenvalue. This change in the form ofĀ(z) may be slight, but it could give significantly different fixation probabilities and mean time to fixation. One reason for this is that it is now possible to have a fixed point in the deterministic dynamics. This dynamics will be given by Eq. (11) without the noise term, and so fixed points are solutions ofĀ(z) = 0. WhenĀ(z) has the structure shown in Eq. (27), an internal fixed point (i.e. one not at the boundaries z = 0 or z = 1) is possible if a 2 = 0: z * = a 1 /a 2 , where the asterisk denotes a fixed point. It will only exist if 0 < a 1 /a 2 < 1, but if it is stable, it may prolong the time taken for the system to fix (reach the points z = 0 or z = 1). Similarly if it unstable, it may make it difficult for the system to cross from the region z > z * to the region z < z * , and vice-versa. The test the approximation we again compare the fixation probabilities and mean fixation time derived from the reduced model and those found from simulations of the original model. Although the form ofĀ(z) is slightly more complicated than before, it is nevertheless still straightforwaard to work with the ordinary differential equations for Q(z 0 ) and T (z 0 ). The results shown in Fig. 4 shows the reduction method again working well in this case.

The M -allele SLVC model
So far we have only discussed individuals which are haploid and carry one of two possible alleles. Here we discuss the generalisation to M alleles. This is interesting for a number of reasons, not least because we are able to recognise patterns that are not apparent in the two allele case. As in Sec. III B 1, the variables in the microscopic model are n (α) , where now α = 1, . . . , M and there is no island label, because we will assume that the population is well-mixed. Previously we used the reduction method to obtain an effective model which was amenable to analysis. Here will have a different perspective: we will ask if we can perform a reduction on the multiallelic SLVC model with M variables to the multiallelic Moran model with M − 1 variables. If this is so, then the more natural SLVC model will give the same results as the Moran model at medium and long times. From a mathematical point of view, the difference between the reduction described in Sec. III A and in Sec. III B 1 is that previously there were D − 1 or 2D − 1 fast modes, and a single slow mode, whereas here there is one fast mode and M − 1 slow modes, thus giving an effective model which is (M − 1) dimensional [10].
The reduction procedure has been carried out in Ref. [10]. An equation analogous to Eq. (11) is found, but now in (M − 1) variables: where ζ (a) (τ ) is a Gaussian noise with zero mean and with a correlator Here, the notation of underlining a vector is used for (M −1)-dimensional vectors and bold for M -dimensional vectors. The correlation function is given to zeroth order in , which is the neutral model result. It is exactly the form found in the M -allele Moran model, up to some rescalings which are absorbed into the new time τ [10]. In the neutral case A = 0, and the SLVC model reduces exactly to the Moran model at medium to long times, after rescaling the time.
If selection is included, A is no longer equal to zero. In order to aid the comparison with the Moran model, it is useful to introduce two quantities:Ĉ and Then A a (z) takes the form Simulations results are the mean of 10 3 stochastic simulations of the Moran and SLVC models. Parameters used are given Ref. [10], where the parameterization of x (0) in terms of κ is also described.
For now, we note that, just in Eq. (27), A is cubic in the components of z. This is an important point in mappings between reduced SLVC and Moran models, as we will now see. We begin our disussion with the M -allele Moran model in the case where the selection is frequency independent, that is, when the weight functions W (α) , analogous to those introduced in Eq. (3), are independent of n. Specifically we assume that W (α) = 1 + ρ (α) s, where the ρ (α) are constants. This is the case that we have examined so far in this paper. Then we find, after making the diffusion approximation, a model given by Eqs. (28), (29) and (32), but only ifĈ (ab) = 0 for all a and b. If this condition holds, the reduced SLVC model and the Moran model with frequrncy independent selection match, provided that we make the identification ρ (α) = Φ (α) −ĉ (αM ) . We also need to match up the selection strength used in the SLVC model ( ) to the one used in the Moran model (s). The relation between them is: s = (b (0) − d (0) )/b (0) . Although some care has to be taken with making the identification between the two models [10], one can note that the function A in the Moran model with frequency dependent selection is quadratic, and Eq. (32) is cubic in general, so the conditionĈ (ab) = 0 gives the possibility of a direct mapping between the two models.
We have assumed that selection is frequency independent so far, since this is the usual supposition made by many population geneticists and historically was the standard assumption used. However this may simply be a theoretical prejudice, since if one wishes to allow the fitness weightings W (α) to depend on the composition of the population, one has to devise a model for this dependence, and so frequency independence is the simplest and most convenient choice. In addition, there are hints from experimental investigations that even if there are attempts to suppress factors that might lead to frequency dependent selection, it still seems to emerge [39]. Therefore it seems important to devise a natural way of including frequency dependence in modelling selection. Fortunately, there does exist a methodology to do this. It is based on ideas from game theory, where each allele "plays" a game with every other allele in the population [45]. In the way we choose to implement this [10], the fitness weightings are taken to have the form where g (αβ) is the payoff to allele α from interacting with type β.
We can now carry out make the diffusion approximation, just as in the frequency independent case, but now using W (α) (n) given by Eq. (33), rather than the n-independent form W (α) = 1 + ρ (α) s. Clearly, the structure of W (α) (n) in Eq. (33) can potentially lead to more complicated z dependence in A, and indeed A is now found to be cubic, and of exactly the same form as that given by Eq. (32). To get the precise correspondence between the two models one must take G (ab) = −Ĉ (ab) for all a and b, where G (ab) hass the same structure as is displayed forĈ (ab) in Eq. (30), namely In addition the identification g (αM ) = Φ (α) −ĉ (αM ) has to be made. The fact that it is only G (aM ) and G (ab) , and not g (αβ) alone, that appear in the expression for A is interesting, since the quantity G (aβ) can be interpreted as a relative fitness, namely the payoff to allele a against an opponent β relative to the payoff to allele M against the same opponent. Similarly, G (ab) is a relative relative fitness. Therefore, as one would expect, it is not the actual payoffs which are important, but their values relative to the other payoffs.
The finding that the more realistic SLVC model reduces to the Moran model with frequency dependent selection is another reason to use frequency dependence in the modelling of selection in the Moran model. Although, as we have already remarked, the resulting Moran model is still (M − 1)-dimensional, and so difficult to analyse, some progress can still be made in some cases [10]. In this way the SLVC model can, in effect, be analysed. An example of such a situation is shown in Fig. 5.

C. Diploid Moran model with sexual reproduction: The Hardy-Weinberg assumption from first principles
In the Moran model addressed in Sec. II and Sec. III A, individuals are haploid (they carry only a single allele) and reproduction occurs asexually (individuals simply duplicate themselves). While is a relevant case for certain simple organisms, it is less so for many complex organisms which are diploid and reproduce sexually, such as animals.
Suppose we want to model such a system, with diploid individuals and two possible alleles at a single locus, reproducing sexually with any other individual in the population (here, there are no sexes). A mechanistic approach might be to attempt to model the three possible genotypes in the population; here we will denote the homozygotes A (1) A (1) and A (2) A (2) , while the heterozygotes will be denoted A (1) A (2) . If we fix the population size to N , this leaves two free variables. As we have previously discussed, this makes obtaining analytic quantities in the model far more difficult than in the asexual haploid case, where the system was described by a single variable.
Classic Approach. Classic studies in population genetics circumvented this complexity by building single variable models that implicitly exploited a separation of timescales. It was noticed early on in theoretical population genetics that if there were no fitness differences between the genotypes (i.e. the system was neutral) the frequency of genotypes in such a diploid system would quickly relax to Hardy-Weinberg frequencies [29,59] , where the number of each genotype could be described in terms of a single variable, the frequency of one of the alleles. In the terminology of our present paper, the system would quickly relax to a CM. Denoting the allele frequencies x (1) = n (A (1) ) /(2N ) and x (2) = n (A (2) ) /2N = (1 − x (1) ), and the genotype frequencies y (1) = n (A (1) A (1) ) /N , y (2) = n (A (1) A (2) ) /N , y (3) = n (A (2) A (2) ) /N = (1 − y (1) − y (2) ), this is given by [18] Rather than model the dynamics of the diploid population, the dynamics of the alleles was modelled with the assumption that they existed at Hardy-Weinberg frequencies. This was assumed to also hold when selection was sufficiently weak that the deviations from these "equilibrium" frequencies were not too great [42] (note the conceptual similarities with our approach). Genotypes AA are assumed to be under selective pressure A (1) A (1) , genotypes A (1) A (2) under (1 + sh) and genotypes A (2) A (2) under 1 [18]. Note then that choosing h > 1 corresponds to overdominance, while h ≤ 1 corresponds to underdominance [18]. The details are given in Appendix A, however upon applying the diffusion approximation one obtains an FPE (6) or SDE (7), with [18] A(x (1) ) = sx (1) Mechanistic Approach. Whereas Eq. (36) was developed using an a priori assumption that the system lay at Hardy-Weinberg equilibrium, we may now use the methods detailed in Section II to formally obtain an approximation for the dynamics on the CM. We note that a similar approach was taken recently in [30] and that this separation of timescales has been long noted and exploited [16,17,58]. We begin by modelling the genotypes themselves; genotypes encounter each other at a rate proportional to their frequency in the population, weighted by a joint probability W (αβ) that genotype α successfully mates with genotype β. In this way we account for selection. In a similar fashion to Section II A, we assume selection is small and formalize this by setting Expanding the master equation for small s as in Section II A, we obtain a two dimensional description of the dynamics. We now seek to understand how this two-dimensional description is related to the classic description.
Having set up the model, we can proceed to the next stage of the analysis; identifying the fast and slow variables, as described in Section II B. We can obtain a CM by setting selection equal to zero (s = 0). In this case the CM is given by Eq. (35). We can now linearise about the system about this CM to obtain the system's Jacobian, and use this matrix to obtain u {1} and v {1} , the left and right eigenvector of the Jacobian corresponding to zero eigenvalue. Finally, using Eq. (16) and Eq. (17), we can obtain an effective description for the system dynamics in terms of z = y (1) on the CM. However, in order to make a comparison between this effective theory and the classic theory, we must express the effective theory in terms of x (1) , the frequency of A (1) alleles (see Eq. (36)). Since y (1) = (x (1) ) 2 on the CM, we must therefore make the transformation x (1) = √ z. The full calculation is given in Appendix A. We note that while there are some subtle mathematical points that need to be attended to, these should not distract from the key points of the method. Our final result is that we can approximate the dynamics of the mechanistic model by an SDE of type Eq. (7) with terms A(x (1) ) = sx (1) (1 − x (1) ) α (23) − α (33) + α (13) + 2α (22) − 6α (23) + 3α (33) While the form of Eq. (38) appears a little complicated, we can in fact show that this becomes identical to Eq. (36) under the assumption that each term α (ij) can be decomposed into the sum of contributions from each allele; and that α (i) take the precise values Given these values give a precise mapping from Eqs. (38) and (39) This is in fact exactly the form of W that we would expect at leading order in s if i.e. if the fitness of each genotype is the same as that postulated in Eq. (36) and the success of genotype pairings is a multiplicative function of the individual genotype fitness. A similar mapping exists if we assume an additive interaction of the genotypes fitness on their pair sexual success or an averaging effect (see Appendix A). It is testament to the great physical intuition of the founders of population genetics that this captures their assumptions. While we have essentially recovered here a known result, it is worth noting that of course the approach we have taken is not without merits. In particular, it allows us to explore a far richer fitness landscape than the original model (see Figure 6).

D. Stochastic epidemics on networks
The networks which we have discussed so far in this paper have described the way in which islands interact with each other through migration. The islands have had populations of individuals which are already themselves large (in the sense that they are of order N , the large parameter in the system). These are metapopulation models, since the whole system is a population of populations [28]. However, in many network models the nodes are composed of a single individual, rather than a population. It is not immediately obvious how to apply the diffusion approximation in this case, and hence the reduction method of Sec. II. However there are situations when the programme we have outlined so far can be pursued, and we now describe such a case.
The example is taken from stochastic models of infectious diseases, rather than population genetics, but it illustrates the points we wish to make clearly. We will follow Ref. [46], where more details can be found. The model is an SIR model, which means that the individuals are either susceptible to the disease (in which case they are in class S), infected with the disease (in which case they are in class I), or recovered from having the disease (in which case they are in class R). We assume that the disease is such that individuals recover, and do not die as a consequence of having the disease. Our interest will be in the properties of an epidemic which might occur, and we assume that this takes place over a much shorter timeframe than the demographic processes of birth and death. So the only processes which occur in the model are (i) infection, and (ii) recovery.
The network structure enters because we assume that each individual has a fixed number of contacts from which the infection is acquired, even though the contacts themselves will change. Therefore we may imagine all individuals located at the nodes of a network, where the degree of that node is equal to the number of contacts characteristic of that individual. The network is considered in the so-called dynamic limit, in which the network structure is assumed to evolve much more quickly than the epidemic, so that the only role of the network is to encode the number of connections a given individual has to other individuals. The variables at the microscale are therefore S k , I k and R k where k labels the degree of the node on which individuals of the type S, I and R are located. As is often done in SIR models, we assume that the population is closed, so that at time t S k (t) + I k (t) + R k (t) = N k , where N k is independent of time. This implies that one set of individuals -for example those which have recovered -can be removed: R k (t) = N k − S k (t) − I k (t). The large parameter in the model is the total number of individuals: N = K k=1 N k , where K is the maximum degree of the network. The specific network of interest in Ref. [46] was a truncated Zipf distribution where the probability of an individual having degree k is given by d k ∝ k α , with −3 < α < −2 and k = 1, . . . , K, but the method we describe is also applicable to other distributions.
We have stressed throughout this article that one needs to start with the microscopic description, at the level of single individuals, to avoid ambiguities. However here, to avoid too much formalism, we will move directly to the mesoscopic model which can be derived from the microscopic description [46]: where and where k = 1, . . . , K and x is a vector of all the 2K variables. Here β and γ are the infection and recovery rates respectively, and N is assumed to be sufficiently large that s k = S k /N and i k = I k /N can be assumed to be continuous. We will not give the precise form of B (k) µν (x), µ, ν = 1, 2, here, nor the form of the rescaled time τ . The problem we face is clear from Eqs. (44) and (45): this is a stochastic system with 2K variables, where in many cases of interest K will be not be small. This is then another case where we wish to reduce the number of variables, and where the reduction method described above may be useful. In fact, one can effect a partial reduction before applying the technique of Sec. II by making the ansatz s k (τ ) = d k θ(τ ) k , which replaces the K equations (44) by the single equation where ξ(τ ) is a Gaussian noise with zero mean which is related to η (k) 1 (τ ). After this initial manoeuvre we attempt to further reduce this K + 1-dimensional system by searching for fast and slow modes. Examination of the Jacobian [46] reveals that there one K-fold degenerate eigenvalue which may be significantly greater in magnitude than the remaining eigenvalue. If we denote the ratio of the magnitude of the former to the magnitude of the latter by , then as long as is small there will be effectively K fast modes and 1 slow mode, so we would expect to be able to reduce the motion to a two-dimensional SS where one of the variables is θ and the other is the slow one just mentioned. It is found that this additional variable is λ(τ ) = K l=1 li l (τ ) [46], so that Eq. (47) simply becomes dθ/dτ = −βθλ + ξ(τ )/ √ N . The equation for λ(τ ) is more complicated, involving as it does three different noise terms, and we do not give it here.
Although the system has been reduced to a manageable two dimensional model, it is always worthwhile to perform numerical investigations to see if it possible to make further reductions, since it is still the case that two dimensional stochastic systems are difficult to study analytically. In this case, it is found that typically the noise term in the theta equation is much smaller in magnitude to the noise terms in the lambda equation. Therefore we can try to omit the noise term in Eq. (47) and combine the three noises in the λ equation together to obtain the further simplification: where φ(θ) = K k=1 k 2 d k θ k , ζ(τ ) is a Gaussian noise with zero mean and ζ(τ )ζ(τ ) = δ(τ − τ ), andσ is a function of θ and λ which is not given here [46]. The model defined by Eq. (48) is a semi-deterministic mesoscopic model, in the sense that one of the dynamical equations is deterministic, having no noise term. In fact it can be identified as the Cox-Ingersoll-Ross (CIR) model [13], for which some analytic results are known.
As a test of the various reductions that have been made on this model we will investigate how well they capture the distribution of epidemic sizes as given by r ∞ , the number of recovered individuals at the emd of the epidemic [46]. The results are shown for a particular case in Fig. 7, and indicate that even the very much simplified model given by Eq. (48) gives a good representation of the result.

IV. REVEALING NOISE-INDUCED SELECTION VIA FAST VARIABLE ELIMINATION
In Section II, we laid the groundwork for conducting fast-variable elimination in stochastic systems, while in Section III the utility of this approach was demonstrated. We have essentially shown that, given a separation of timescales exists, the dynamics of a high dimensional system can be approximated by the projection of these dynamics onto an SS. Thus far, this is true for both the deterministic and stochastic component of the dynamics (see Eqs. (16) and (17)). However, there is a further consideration that we have until now omitted; noise-induced dynamics in the slow subspace. While accounting for such dynamics is more technically involved than the method outlined in Section II, the resultant dynamical behaviour can be very interesting, and so it is worth describing the key aspects here.
Noise-induced dynamics are the phenomena whereby the projection of the deterministic dynamics onto the slowsubspace (see Eq. (16)) does not entirely describe the time-evolution of the average dynamics when demographic noise is accounted for. In the context of a system that features a CM, this means that rather than there being no average dynamics along the CM, a bias in fact emerges in some direction. Although this is a second order effect (it scales inversely with the population size and disappears in the infinite population size limit), it can in fact completely govern the dynamics along the CM, where the first order terms that control the collapse of the system to the CM disappear. If the system under consideration is one featuring competing organisms, this bias can be interpreted as noise-induced selection (or drift-induced selection in a population-genetics context). The origin of these bias terms can be interpreted in various ways.
First, and perhaps most intuitively, noise-induced dynamics can be graphically understood as resulting from a bias in how fluctuations taking the system off the CM (or SS) return to the CM. Under certain scenarios, such as when the CM is strongly curved or the trajectories to the CM are divergent, fluctuations off the CM do not return, on average, to the point on the CM from which they originated (see Figure 8). This introduces a bias that stochastically 'ratchets' the dynamics in a preferred direction.
Second, these noise-induced dynamics can be understood as a mathematical consequence of making a non-linear change of variables into the slow-subspace of the stochastic system. As we have mentioned, the SDEs that we work with should be interpreted in the Itō sense. In this context, different rules of stochastic calculus apply, and accounting for the additional terms that arise in the analysis gives rise to additional terms in the reduced description on the CM. Note that for readers not familiar with the analysis of SDEs, Itō calculus can in some sense be loosely understood as arising in the same way as Jensen's inequality; the average of a function of a random variable is not necessarily the same as the function evaluated at the average of that random variable.
Finally, in a more biological context, noise-induced dynamics in systems of competing organisms can be understood resulting from a selective pressure to reduce variance in reproductive output (see Gillespie's Criterion [25,27]). However, in contrast to Gillespie's original study, variance between the reproductive output of organisms can arise as a result of the dynamics of the organisms, rather than being assumed a priori.
Of course, the existence of these noise induced dynamics is not in itself dependent on a system exhibiting a separation of timescales. We have already mentioned Gillespie's Criterion, which demonstrates that selection for a genotype with a lower variance in reproductive output takes place in a model with only a single variable (measuring the frequency of one of two genotypes). Further, in a similar single-variable population genetics context that assumes a well-mixed population of finite size, it has been shown that noise-induced selection favours genotypes that increase the population size [31,33]. However, while fast-variable elimination does not generate noise-induced dynamics itself, it does often give us a means of quantifying and analytically understanding the effect.
As in the earlier sections, rather than transforming into the fast-slow basis of the problem, removing the fast variables and then transforming back into the original variables, we take a shortcut by using a non-linear projection of the dynamics on the CM. The reason for this is two-fold. First, it is more straightforward practically; it is in the original, biologically relevant variables that we can better understand the behaviour of the system. Second, the non-linear transform that yields the slow-fast basis is not always obvious. In order to obtain the non-linear projection, second order perturbation techniques can be used to deduce the effective dynamics. The full calculation is too long to reproduce here, however a clear and coherent explanation is given in [49]. There it is shown that if a system has M variables and a one-dimensional slow-subspace, then the equation for the reduced/effective dynamics can be expressed by The termĀ(z) and the white noise term ζ(t) retain the forms given in Eqs. (16)-(17) (that is, they are respectively components of the deterministic dynamics and noise projected onto the SS). Most interesting is the termĀ S (z) as it is this that controls the noise induced dynamics. It is of order 1/N (induced as it is by demographic noise) and takes the explicit formĀ where we recall that v {1} and u {1} are the right and left eigenvectors of the neutral Jacobian on the CM corresponding to the zero eigenvalue (see Section II). The terms which feature du {1} i /dz capture how quickly the fast directions change as a function of z, and can thus be understood as the component of the noise-induced dynamics that arises from the non-linearity of trajectories to the CM. Meanwhile the term dv {1} k /dz is the component of the noise-induced dynamics that arises from curvature of the CM itself [49]. Note that if both v are independent of z (that is, the CM is linear and the direction of fast trajectories to the SS do not vary along the SS to first order in the selection strength), then there can be no noise-induced dynamics. This is the case in the models discussed in Sections II and III A. Finally, there are also cases where, despite the CM being curved or the trajectories to the CM being divergent, there is still no noise induced selection as the terms in Eq. (50) all cancel. This is the case in both Sections III B and III C for parameterisations given in those sections.
In what follows we will illustrate how the effective description provided by Eq. (49) can be used to analytically tackle some particular problems of interest. In Section IV A we will address a two-species Lotka-Volterra model. Unlike in Section III B however, we will allow the species to have distinct birth, death and competition rates at leading order. Calculation of the effective system will reveal both that slow-living species and species that increase the global carrying capacity are stochastically selected for. In Section IV B we will describe a population genetic model of transitions between modes of sex determination. Although this population genetic model is much more complicated than the haploid Moran model, the presence of a CM in the neutral limit allows us to analytically characterise a noise-induced bias favouring the substitution of dominant neutral mutations.

A. Two-species Lotka-Volterra type models
We begin with a two-species Lotka-Volterra model of a similar type to that described in Section III B 2 (see Eq. (26)), except we now make the restriction By taking the limit of large system size, we can apply the diffusion approximation, as described in Section II A. The system is now approximated by an SDE of the same form as Eq. (7). However, since the system is in two variables, it is difficult to analyse. We next follow the approach taken in Section II B and identify a CM. A CM exists under the above parameterisation if is equal to zero. It now takes the form Notice that in isolation, species 1 exists at x (1) =b/c (1) and species 2 at x (2) =b/c (2) . Further, if we assume that c (2) > c (1) , then increasing the frequency of species 1 in the population increases the joint carrying capacity of the species, as can be seen in Figure 8. If additionally > 0, such that species 1 reproduces at a lower rate that species 2, then this can be interpreted as a model of public good production. Species 1 pays a cost to its reproductive rate to increase the carrying capacity of both species. Species 2 pays no cost, but still enjoys a reduced death rate in the presence of species 1 as a result of the lower competition parameter of species 1 (interpreted here as resulting from the production of a public good). However, in isolation, species 2 exists at lower numbers than species 1, interpreted here as resulting from the non-production of the public good.
We can now briefly describe the deterministic dynamics. In the neutral limit, where = 0, the population grows to a point on the CM described by Eq. (52). We now move away from the neutral limit, such that 1 > 0; the system grows to a point on the SS (which is equal to the CM at leading order in ) after which the system moves along the SS until species 1 is driven to extinction. We now use fast-variable elimination to characterise the dynamics when demographic noise is accounted for. x (1) x (2) 0.  (2) , d (1) = d (2) , c (1) = c (2) ). Fluctuations down and right are projected back to the CM in the direction of x (1) , but this is perfectly countered by fluctuations up and left which are projected back to the CM with in the direction of x (2) . Top right: System with two species, the second reproducing and dieing at a faster rate (b (1) (1) , c (1) = c (2) ). As a result of this asymmetry, species 2 experiences greater demographic fluctuations (the ellipse is larger in direction x (2) ). Consequently, fluctuations off the CM are projected back to the CM with a bias that favours x (1) . Bottom: System with two species, the first increasing the carrying capacity of the system (b (1) = b (2) , d (1) = d (2) , c (2) > c (1) ). The asymmetry in the angle of the CM now induces a bias in the projection of fluctuations back to the CM that favours x (1) .
As discussed in Section II B, our first task is to identify the left and right eigenvectors of the system evaluated on the CM, Eq. (52). Note that while the right-eigenvectors are not so important in Section II C (where there was no noise-induced selection), they become very important here, featuring as they do in Eq. (50). We find (2) .
We can use these expressions for the left and right eigenvectors to obtain an approximate description of the dynamics in the SS using Eq. (49) with expressions forĀ(z),Ā S (z) andB(z) directly from Eqs. (16), (50) and (17), where z is the value of x 1 on the SS; An analysis of these terms reveals a number of points. First, and as we might expect, in the limit (2) ≡ d (0) the noise induced termĀ S (z) vanishes and we are left with a system that takes the same form as that in Section III B.
We next consider the limit where = 0. NowĀ(z) = 0 as there are no dynamics along the CM in the infinite population size limit. However, if the finite population has a finite size the noise induced termĀ S (z) is not zero. Continuing with the system when = 0, and now additionally asking that c (1) = c (2) ≡ c (0) (i.e. that the carrying capacity is unchanged by the composition of the population), we find thatĀ S (z) is positive for all z along the CM if b 1 < b 2 . The species with the lower birth rate (and death rate, sinceb is fixed) is therefore selected for. This insight, made in [5,36,48], is a result of the fact that phenotypes which are reproducing and dying more quickly are subject to greater population fluctuations (they have a larger rate of population turnover). Consequently, it easier for the longer lived phenotype (lower birth/death rates), to invade and fixate. This phenomena can be viewed as analogous to Gillespie's Criterion.
Turning instead to the limit = 0, b (1) = b (2) ≡ b (0) and d (1) = d (2) ≡ d (0) reveals a similar, but biologically distinct insight. In this case though the rate of population turnover of both species is identical, one species exists at greater numbers in isolation than the other (this species has a lower value of c (α) ). Again whileĀ(z) = 0, the noise-induced termĀ S (z) is non-zero and drives the dynamics in the finite system. We now find thatĀ(z) > 0 for all z on the CM if c (2) > c (1) . That is, the species that increases the joint carrying capacity of the species is selected for. One interpretation of this noise-induced term is that it is easier for a novel mutant species to invade a small population than a large one; thus species that increase the carrying capacity of the population receive a benefit by being more stochastically robust to invasion attempts [12].
Note that in the above context, if > 0 but N is finite, it is possible thatĀ S (z) >Ā(z) along the length of the SS. Biologically, this provides a mechanism that allows for the evolution of public good producing behaviour despite the evolution of such behaviours being forbidden in the deterministic limit. This has been noted in classic population size models [31,33] as well as models of the form described here. If the population is not well mixed but exists in space, it has been shown that for weak migration the noise induced selection for public good production is amplified both in metapopulation [5,12] and continuous space models [26]. In [12] it was also shown that this behaviour (whereby a species that increases the joint population carrying capacity is stochastically selected for) is generic and robust to the inclusion of a suite of environmental variables in the model that can modify population size.

B. Models of transitions between male and female heterogamety
In this model, a diploid population is considered in which sex is genetically determined by a dominant mutation at a single locus. In mammals, sex is determined by the dominant Y chromosome, so that XY individuals are male while XX individuals are female. Such a system is termed male heterogamety. In birds the situation is somewhat reversed. Their sex determination system features a dominant feminising W chromosome, such that ZW individuals are female, while ZZ individuals are male. This scenario is termed female heterogamety.
Intriguingly, while these systems are relatively static in both mammals and birds, transitions between male and female heterogamety can occur in reptiles and amphibians. In this section, we will discuss how fast-variable elimination can be exploited to understand the impact of genetic drift on these transitions. We consider a system of male heterogamety comprised of XX females and XY males. A mutation arises on the X chromosome, changing it to an X and rendering it dominant to the Y, such that X Y individuals are female. Genotypes X X can also be produced, which are female, as can YY genotypes, which are male. This renders a system of five genotypes. Along with the absorbing state in which the system is entirely XX and XY (male heterogamety), there is also and absorbing state when the system is entirely X Y and YY (female heterogamety). Note that this state is analogous to that found in birds up to some relabeling of the chromosomes. We wish to understand the transitions between these states. We can construct a population genetic model of this process in a very similar way to the diploid Moran model (see Appendix A), except with matings restricted to being between males and females. We assume a fixed population size N . Males and females of each genotype encounter each other proportional to their frequency in the population. They produce a progeny, that inherits its chromosomes (alternatively, alleles at a single locus) in a Mendellian fashion from its parents. Simultaneously, in order that the population size is fixed to N , another individual is picked to die. In order to account for selection against certain genotypes, the probability that each genotype is selected to die can be a weighted probability. The detailed model set-up is given in [57], however here we will discuss only the main results.
Let the frequency of XX, X X and X Y (female) and XY and YY genotypes (male) be respectively given by x = (x (1) , x (2) , x (3) , x (4) , x (5) ). Due to the condition of fixed population size, we can then express x (5) as x (5) (4) . If all genotypes are equally fit, then this four-dimensional system exhibits a one-dimensional CM that connects the absorbing states, characterised in [3]. It is given by and illustrated in Figure 9. By definition there are no deterministic dynamics along this line and so one might expect that transitions between the absorbing states occur with equal probability. However employing the fast-variable elimination described in the introduction to this section, we find that in finite populations, there is noise-induced (equivalently, drift-induced) selection along the line; where z is the value of 1 − 2x (4) on the CM (see Eq. (58)). We find thatĀ S (z) is positive along the entire length of the CM (see Figure 9). Since z = 0 corresponds to the absorbing state in which the population is entirely comprised of XX and XY genotypes, and z = 1 corresponds to the absorbing state in which the population is entirely comprised of X Y and YY genotypes, we can conclude that the dominant mutation X is selected for along the entire CM. Therefore, if a dominant X mutation occurs in a resident XX-XY population, it is more likely to invade and fixate than a recessive X mutation in a resident X Y-YY population. The above picture, whereby dominant sex-determining mutations are more likely to invade and fixate, is recapitulated for successive invasions. If the resident population is X Y-YY, a dominant Y mutation can occur, Y , yielding X Y males. Again five genotypes can emerge following random mating including X X (female) and Y Y (male). This dominant mutation is once again more likely to invade, creating an emergent directionality to evolution with substitution rates of neutral dominant mutations being five times higher than those of recessive mutations [57]. Intriguingly, the behaviour described here recapitulates the empirical observation that sex chromosomes typically evolve to include a cascade of inhibitory mutations [60], with more recent mutations at the top of the cascade. The "bottom up" hypothesis [55] therefore pictures the sequential invasion of sex-determining genes, each one of which represses the action of the previous mutation. While it can be argued that the drift-induced phenomena described here is not solely responsible for this pattern, and other deterministic explanations have been suggested [56], it is nevertheless interesting to see the emergence of the pattern in a model with minimal assumptions.
In [57], the above analysis is extended to include biologically relevant selection. In addition more complicated models of transitions between sex chromosomes are also explored via the same timescale-separation arguments.

V. DISCUSSION
Our aim in this paper has been to describe and review an approach to the systematic modelling and analysis of a class of models in population genetics. Readers with a background in theoretical physics, and statistical physics in particular, will have found many of the ideas and techniques familiar. These include: the care taken in distinguishing between the form of the models at the microscale, the mesoscale and the macroscale; the idea that the model should be formulated and unambiguously defined at the microscale and that the corresponding mesoscopic and macroscopic models should be derived from the former by some form of systematic approximation procedure; the nature of these approximation procedures (the diffusion limit and fast mode elimination); the formalism of non-equilibrium statistical physics including master equations, FPEs and SDEs.
While the techniques and philosophy that we use come from statistical physics, the motivation and the models are taken from population biology, and especially from population genetics. Due to the difficulty of performing analytical calculations with models that have many variables and parameters, researchers in these fields frequently create models for rather specific situations and to answer one very limited and definite question. These models may be restricted to one particular situation, but they at least have the possibility of being analysed. On the other hand the danger with this way of doing things is that the subject becomes characterised by a series of models with conflicting assumptions and little overlap. We have taken a different path: we have not held back from setting-up general models, even though they typically contain very many variables and parameters, but have instead tried to systematically approximate these models to obtain ones which are more tractable. In this way the variables that we focus on are those which are likely to be the most important in the medium to long term (the slow modes) and the parameters which are chosen as being important at late times are combinations of the initial parameters of the original model which would have been impossible to guess a priori.
There are also some modellers who would argue that to study more complex models, and especially those which feature individuals with more than one or two attributes and which have many parameters, one should turn to agent based models. Here there is no attempt at mathematical analysis and a computer simulation is set up in which the agents (i.e. the individuals) are simply allowed to interact with each other according to a set of rules. These rules will have many parameters associated with them, which will not in general easily map onto the parameters of the type we defined here when formulating the microscopic models. While agent based models might be useful in some situations, the difficulty with the whole approach is that one may have no idea what aspect of the model is responsible for a particular outcome that one is interested in. In fact, this is frequently put forward as a strength of the agent-based method: behaviour emerges without having to have been programed-in. We believe that our approach straddles those where the models are simple and tailored for a particular situation and those agent based models which allow for complex systems, but which may be unable to provide insight into the underlying reasons for particular outcomes.
Of course, fast-variable elimination has a long history in theoretical biology and ecology. Most prominently, the form of the Michaelis-Menten function [2] and the Hollings type II functional response [15] are derived based on arguments that assume a separation of timescales. However, these techniques tend to be employed most frequently in deterministic settings. The field of population genetics provides a notable exception to this rule, with the crucial role of genetic drift providing motivation for studies that employed fast-variable elimination in a stochastic setting. We have explicitly discussed the classical assumption of diploid populations at Hardy-Weinberg frequencies in Section III C, where a separation of timescales was assumed a priori. However, there also exists a range of studies which apply more sophisticated techniques (see, for example [16,17,30,43,44,54]).
The relatively frequent occurrence of stochastic fast-variable elimination in population genetics is the result of two factors. The first, as we have already mentioned, is that population genetics models are often inherently stochastic. The second is that a common assumption in many population genetic models (even when fast-variable elimination techniques are not required) is that selection is weak relative to the other processes. If then the question we wish to ask is which of two alleles (or genotypes) fixes in a population, the stage is set for methods based on the elimination of fast-variables to be useful: by definition, if there is no selection, there must be a CM connecting states in which either allele is fixed; if selective forces are weak, the CM is simply replaced by an SS; if there are only two alleles competing, regardless of the other variables, the SS is one-dimensional, allowing an analytic treatment of the resulting effective equation.
Given the ubiquity of the above listed features in models of population genetics, it is therefore perhaps surprising that there are in fact not many more studies employing fast-variable elimination. Perhaps some of the reasons for this stem from the apparent difficulty of dealing with these types of problem, with previous studies [54] relying on the perturbative techniques described by Gardiner [22]. While certainly thorough, this approach is perhaps lacking in intuition, taking place as it does within the context of the FPE and involving operators. In contrast, we believe that there is much to say for the approaches that we have outlined in this paper. As they are described within the context of SDEs (entirely equivalent to the FPE), and as the effective equation is constructed from physically meaningful quantities (such as the left and right eigenvectors), it is relatively straightforward to apply the methodology. We note that while other approaches to fast-variable elimination in the SDE setting are well practiced (see [11], which gives a more complete review of this area of the literature), these too have not cemented themselves as methods within the population genetics literature. The fact that, as described in Section IV B, the techniques we have presented are yielding novel insights to models first developed in the 1970s is testament to the untapped potential of the approach.
In terms of applications of the method, we have tried to some extent to move away from using the Moran model, which has been a very popular starting point among some researchers over the last decade or two. Instead we have introduced competition between individuals to control population growth. There are many other types of interaction between individuals that could be included, and this is an interesting question for the future. But even when only competition between individuals is included, some interesting points emerge. For example, as discussed in Sec. III B 2, the model with competition can be mapped into a Moran model, but only if the selection is frequency dependent, which suggests that the traditional use of frequency independent selection may be too restrictive. In addition, the mapping gives a direct connection between the competition rates and the elements of the payoff matrix. Only relative payoffs of a certain type appear in the mapping between the two models, which again could be used to constrain the nature of the payoffs. This would be useful, since the lack of prescriptions available to guide the choice of payoffs is a significant weakness in the application of game theory to the subject. Of course, our methods are only valid when selection is weak, and moving beyond this regime, such as when strong mutualisms are accounted for, is known to invalidate this mapping between the SLVC and models of game theory [4]. Extending such a study using the methods we have presented is also an interesting avenue for future studies.
Perhaps one of the most surprising insights that fast-variable elimination has provided is the occurrence of noiseinduced dynamics in otherwise deterministically neutral systems. These dynamics, described in Section IV, can be interpreted as drift-induced selection and can give rise to surprising behaviours, such as selection reversal. While historically such behaviour has been observed and characterised in simpler settings [25], it has been often dismissed as a small second order effect of little biological significance [27]. This is because noise induced selection is inversely proportional to the size of the system, and therefore, it is argued, likely to be swamped by other processes. However, this misses the fact that when selection is weak (a frequent assumption in population genetics) or when a population has a small effective population size (such as when a system is spatially distributed) noise-induced terms can in fact dominate the dynamical behaviour.
Moreover, utilisation of fast-variable elimination techniques has led to an increasing awareness that noise-induced selection appears in many distinct evolutionary systems in ecology, epidemiology and population genetics. We have already discussed some of these in detail in Section IV. However more examples are arising with increasing frequency. In [37], stochastic Lotka-Volterra models for multiple islands (similar to the models described in Section III B 1) were used to show that demographic stochasticity induces a selective bias for species that disperse at a higher rate between identical islands, where there is no selection deterministically. Intriguingly, this bias persists even when the islands are inhomogeneous and a deterministic selective pressure exists for slower dispersal [38]. In [35] meanwhile, a model of viral evolution was investigated, showing that noise-induced dynamics selected for viral strains with a fast infection and recovery time in SIS and SIR models.
It is clear that the methods we have described in this paper hold a great degree of utility, whether one is seeking to consolidate existing disparate models, add more biological detail to well understood systems or uncover and characterise entirely new dynamics. Beyond drawing attention to this fact, we hope that as presented, this work will prove of value to readers with both biology and statistical physics backgrounds. For the biologists, this paper has aimed to show that these techniques are not as conceptually formidable as other studies may have suggested. For the statistical physicists, this paper has aimed to show that many interesting and relevant problems in biology still exist, as yet untapped, and amenable to analysis with their approaches. W (22) (n (2) ) 2 + 2 1 2 W (23) n (2) (N − n (1) − n (2) ) n (1) .