Simulating Evolution in Asexual Populations with Epistasis

,

We can directly specify the mapping between genotypes and fitness by explicitly indicating what the fitness of all possible genotypes (or all genotypes with non-zero fitness) is. Here, epistasis is not specified directly, but indirectly. For instance, we could specify a four genotype model with sign epistasis directly as: ## Create a two-column data frame with ## the mapping from genotypes to fitness gf <data.frame(Genotype = c("WT", "A", "B", "A, B"),

plotFitnessLandscape(fitness1)
And we can compute some epistasis statistics using the function Magellan_stats, that calls code provided by MAGELLAN [3]. From the set of statistics provided by MAGELLAN, we only want the fraction of pairs of loci that have no epistasis, magnitude epistasis, sign epistasis, reciprocal sign epistatis, and γ, the correlation in fitness effects between genotypes that differ by one locus (averaged over the fitness landscape) (see details in [3] and [13]). Since we are only interested in some of the output provided by the Magellan_stats function (which itself simply calls the function fl_statistics from MAGELLAN), we write a simple wrapper to Magellan_stats (where use_log controls whether we take the log of the fitness values before computing the epistasis statistics): ## magn sign rsign none gamma ## 0.000 1.000 0.000 0.000 0.113 (Using use_log = TRUE is discussed in further detail below.)

Specifying epistasis indirectly: using models for fitness landscapes
Alternatively, the mapping between genotypes and fitness can be done according to different random fitness landscapes models, which are characterized by different degrees of epistasis (see [3] and [13]). We will use function rfitness, that allows us to use the Rough Mount Fuji (RMF) model, which includes as limit cases both a fully additive and the House of Cards models, and the NK (or LK) model. It must be noted that the values returned by rfitness are to be interpreted as log-fitness values (and this is why, in the calls in this section, we call the function epist_stats without log-transforming the fitness values).
In the examples that follow, and for ease of representation, we will only use four loci. We will first use a fully additive and deterministic model. The additive model is sometimes also called a multiplicative model, as it becomes additive in the log-scale; the multiplicative effects on fitness that each mutation has do not depend on the state of other genes (see [3,13]).
We can generate deterministic, additive models as special cases of Rough Mount Fuji models without noise [26,32]. In the example below, the genotype with all genes mutated has maximum fitness (reference = "max"), and all genotypes have the same decrease in fitness per unit increase in Hamming distance from the genotype with maximum fitness (c = 0.5).

plot(additive)
The other extreme of the RMF model is the House of Cards model, that leads to maximally rugged fitness landscapes: the (log)fitness of each genotype is obtained, independently for each genotype, from some underlying distribution (normal, in this case): ## Set random number generator seed, for reproducibility set.seed(5)

plot(hoc)
and is shown in Figure 1.

Specifying epistasis directly
We can specify directly the fitness effects of mutations on genes and gene interactions, thus having full control over the specification of epistatic interactions (see also Note 1 and Note 2).

Genes without interactions: no epistasis
As a simple baseline example we will first specify a model without epistatic interactions:

Epistasis: two alternative specifications
Suppose we want the effects of two genes and their interaction to be as shown in Table 1: To make the example specific, let s a = 0.2, s b = 0.3, s ab = 0.7. We specify the above scenario as follows:  Here we use a "-" to mean that we explicitly exclude a specific pattern; thus, "A:-B" is interpreted as "A mutated when B is not mutated".
Alternatively, it is possible to specify the effects of genes and their interactions, without using the "-". That requires a different numerical value of the interaction, because now, as we are rewriting the interaction term as genotype "A is mutant, B is mutant" the double mutant will incorporate the effects of "A mutant", "B mutant" and "both A and B mutants". We can define a new s 2 that satisfies (1+s ab ) = (1+s a )(1+s b )(1+s 2 ) so (1+s 2 ) = (1+s ab )/((1+s a )(1+s b )) and therefore specify the model as: For example, this is the way you would specify effects with FFPopsim [34]. Whether this specification or the previous one with "-" is simpler will depend on the model.

Epistasis: two alternative specifications. A three gene example
To further illustrate the two mechanisms for epistasis specification, here we show a more complex three-gene example. We want to use the epistatic interactions where there is no epistasis between genes A and C, but there is epistasis between A and B and B and C, as shown in Table 2: We can specify that model as follows:  We needed to pass the s ac coefficient explicitly, even if it that term was just the product of the corresponding terms for the individual loci, because a full specification is required when using the "-".
We can use the alternative specification without "-", but we will need to do carry out some calculations to obtain some of the coefficients under this parameterization. To make it easier to tell the differences from the previous specification, I use capital "S" in what follows where the letters differ from the previous specification. Note that we can avoid specifying the "A:C", as it just follows from the individual "A" and "C" terms, but we need to obtain new S ab , S bc , S abc :  We can check the output from the above epistasis calculations using the graphical procedure for determining magnitude, sign, and reciprocal sign epistasis described in [6,13]. The two cases with magnitude epistasis (frequency of 2/6) correspond to the sets "abc", "aBc", "Abc", "ABc" on the one hand and "Abc", "AbC", "ABc", "ABC" on the other. The two cases with sign epistasis correspond to the two sets "abC", "aBC", "AbC", "ABC" and "aBc", "ABC", "ABc", "ABC".
Synthetic viability, where each individual mutant is lethal or has decreased fitness but the double mutant is viable, is just a case of reciprocal sign epistasis, but we illustrate it here separately with a minimal example. Suppose we want to model fitness as shown in Table 3: We will set s = 0.2, and s a = s b = −0.5. Then, we can specify fitness as follows:   Table 4: We will make s a = 0.1, s b = 0.2, s ab = −0.8. We can specify it as:

Simulating evolution
The main function for simulating evolution with OncoSimulR is oncoSimulIndiv. In addition, functions oncoSimulPop and oncoSimulSample allow us to run multiple simulations and in particular oncoSimulPop allows us to use multiple chores to parallelize execution.
Once epistasis or, equivalently, fitness of genotypes has been specified, as explained above (section 2.2), we can simulate evolution using any of the oncoSimul * functions. Before actually running simulations (section 2.3.3), though, we need to decide about the growth model and mutation rates of genes.

Growth models
OncoSimulR uses a continuous time model. The main choice for growth models is between a model with exponential growth or a model with carrying capacity (the model of McFarland et al [22][23][24]) (but see also Note 3). In both cases, when we specify the fitness effects of genes and gene interactions, and as shown above (section 2.2.2), we evaluate fitness using the usual [1,7,17,34] multiplicative model: fitness is (1+s i ) where s i is the fitness effect of gene (or gene interaction) i. In both models this fitness refers to the growth rate. The original model of McFarland et al. [24] has a slightly different parameterization, but you can go easily from one to the other (see below, section 2.3.1.2). If you specify fitness of genotypes directly (section 2.2.1) then that is also taken as the birth rate of genotypes.
In the model with exponential growth we specify the growth rate, fixing death rate at 1 (it is possible to modify this if really needed, but there is rarely any need to do so). The model with carrying capacity follows the model of McFarland et al [22][23][24]: mutations affect the birth rate, with the death rate being density dependent (see below).
In OncoSimulR, we choose the exponential growth model setting model = "Exp" in the call to functions oncoSimul * . The model with carrying capacity is specified using model =

Mutation rates, mutator genes
OncoSimulR can use both a common mutation rate for all loci as well as locus-specific mutation rates. Below we show two calls to oncoSimulIndiv, the first one, rmf_common, with a common mutation rate of 1e − 7 for all loci, and the second, rmf_loci_spec, with different mutation rates for each locus. We reuse the RMF fitness specification from section 2.2.1.2.

onlyCancer = FALSE)
It is also possible to specify mutator/antimutator genes (e.g. [14,33]); these genes, when mutated, lead to an increase/decrease in the mutation rate across the genome. Mutator/antimutator genes must be a subset of the genes in the fitness effects (if you want to have mutator genes that have no direct fitness effects, give them a fitness effect of 0 -see examples in the documentation).
In the example below, we specify that mutating gene "A" leads to an increase by a factor of fifty of the mutation rate. See also Note 5 for numerical issues that can result from using multiple mutator genes.

Running simulations until pre-specified contions are met
In addition to the growth model, fitness effects, and possible mutator effects and locus-specific mutation rates, you need to decide: • Where will you start your simulation from. This involves deciding the initial population size (argument initSize); sometimes you might want to start the simulations from a specific genotype (see Note 6).
• When will simulations stop: how long to run simulations, and whether or not to require simulations to reach a particular condition. This is covered in this section.
OncoSimulR provides very flexible ways to decide when to stop a simulation: • Using option onlyCancer = TRUE.
A simulation will be repeated until any one of the conditions below is met, if this happens before the simulation reaches finalTime. Because OncoSimulR was originally developed to simulate cancer evolution, this is often referred as "reaching cancer" but we can refer to it as "reach whatever interests me". These conditions are: -Total population size becomes larger than detectionSize.
-A gene, gene combination, or genotype among those listed in fixation becomes (this could also be used to stop the simulation as soon as a specific genotype is found, by using the genes that make that genotype as the drivers, but the mechanism in section 2.3.3.2 is generally simpler). This option is only reasonable in scenarios where we want to differentiate between driver and passenger genes, requires specifying driver genes, and will not be further discussed here.
The simulations exit as soon as any of the exiting conditions is reached; therefore, if you only care about one condition, set the other to NA.
A simulation will run only once, and will exit as soon as any of the above conditions are met or as soon as the total population size becomes zero or we reach finalTime.
Using onlyCancer = FALSE will often be the setting you want to use to examine general population genetics scenarios without focusing on possible sampling issues; set Under the onlyCancer = TRUE case, if we reach finalTime (or the population size becomes zero) before any of the "reach cancer" conditions have been fulfilled, the simulation will be repeated again, within the limits given by the following parameters to the function: max.wall.time: the total wall time we allow an individual simulation to run; max.num.tries: the maximum number of times we allow a simulation to be repeated to reach cancer; if you use oncoSimulSample, max.wall.time.total and max.num.tries.total, similar to the previous two, but specific for function oncoSimulSample, are also of application. If the specified conditions for "reaching cancer" can not be met, no object with the population state (genotypes and population sizes) will be returned (simulations will abort without returning the population state, as no simulation has achieved the specified conditions).

Fixation of genes and gene combinations
Simulations will exit when any of the genes or gene combinations in the vector (or list) fixation, passed to the oncoSimul * functions, reaches a frequency of 1, or very close to 1 (see section 2.3.3.3). The gene combinations might share genes (i.e., might have non-zero intersection). As explained above, if we want simulations to only exit when fixation of those genes/gene combinations is reached, we will set all other stopping conditions to NA. Note that if the stopping conditions can never be reached, simulations will eventually abort (e.g., when max.wall.time or max.num.tries are reached).
Since we are running simulations until fixation of genes, the Exp model will rarely be appropriate here: models with competition such as McFL are more appropriate.
The following code shows an example based in the model in Ochs and Desai [27]; the authors present a model like the one shown in Figure 2 (the numerical values are arbitrarily set by me): In this model s u > 0, s v > s u , s i < 0 and we can only arrive at v from i. Mutants "ui" and "uv" can never appear as their fitness is 0, or −∞, so s ui = s uv = −1 (or −∞).
We can specify fitness by specifying epistatic effects:  In p.2, section "Simulations" of Ochs and Desai [27], they explain that "Each simulated population was evolved until either the uphill genotype or valley-crossing genotype fixed." To use the same procedure here, we specify that we want to end the simulation when either the "u" or the "v, i" genotypes have reached fixation, by passing those genotype combinations as the fixation argument (in this example using fixation = c("u", "v") would have been equivalent, since the "v" genotype by itself has fitness of 0). Fixation will be the one and only condition for ending the simulations, and thus we set arguments detectionDrivers, finalTime, detectionSize and detectionProb explicitly to NA. We want to run the simulations repeatedly, so we use oncoSimulPop but we set the number of repetitions only to 10 for the sake of speed.
initS <-20 ## We use only a small number of repetitions for the sake ## of speed. here. To specify genotypes, you prepend the genotype combinations with a "_,", and that tells OncoSimulR that you want fixation of genotypes, not just gene combinations.
The following example illustrates the differences between the mechanisms: ## Create a simple fitness landscape  min_successive_fixation: during how many successive sampling periods the conditions of fixation need to be fulfilled before declaring fixation. These must be successive sampling periods without interruptions (i.e., a single period when the condition is not fulfilled will set the counter to 0). This can help to exclude short, transitional, local maxima that are quickly replaced by other genotypes. (The default is 50, but this is probably too small for "real life" usage).
fixation_min_size: you might only want to consider fixation to have happened if a minimal size has been reached; this can help weed out local maxima that have fitness that is barely above that of the wild-type genotype. (The default is 0).

Stochastic detection mechanism
This process is controlled by the argument detectionProb. Under this process, we simulate stopping simulations when a tumor is detected, where the probability of "tumor detection" increases with the total population size. The probability of detection is given by where P (N ) is the probability that a tumor with a population size N will be detected, and c (argument cP Detect in the oncoSimul * functions) controls how fast P (N ) increases with increasing population size relative to a baseline, B (P DBaseline in the oncoSimul * functions).
This function is evaluated at regularly spaced times during the simulation, and the decision to exit the simulation is made by comparing P (N ) against a random uniform number. Using this exiting mechanism is probably only appropriate for modelling diseases such as cancer and will not be further discussed here. See the vignette and documentation for details and examples.

Output and data analysis
The output from the simulation functions oncoSimulIndiv, oncoSimulPop, and oncoSimulSample • The predictability of evolution in complex fitness landscapes, as shown in [12,18].
• The effects of mutator/antimutator genes in reaching particular genotypes or population sizes.
• Whether we can recover restrictions in the order of accumulation of mutations with different types of epistatic relationships, as shown in [9,11].
Simple examples illustrating those scenarios are provided in the vignette of the OncoSimulR package.
4 Notes: potential pitfalls and troubleshooting 1. It is also possible to specify epistasis using Directed Acyclic Graphs (DAGs) that represent order dependencies in the accumulation of mutations. This is equivalent to specifying sign epistasis [11], and it is used by "cancer progression models" such as Conjunctive Bayesian Networks (CBN) [15,16,25], oncogenetic trees (OT) [8,31], CAncer PRogression Inference (CAPRI) [4,29], or CAncer PRogression Extraction with Single Edges (CAPRESE) [28]. This usage, however, can only represent sign epistasis, or relaxations of sign epistasis; see the vignette of the package for examples.
2. OncoSimulR also allows us to specify fitness not in terms of genes but in terms of modules.
This can be useful in some very special scenarios as discussed in, for example, [5,30]. Each module is a set of genes (and the intersection between modules is the empty set). Modules, then, play the role of a "union operation" over sets of genes. There is no major conceptual difference relative to what has been shown in this chapter, but one also needs to specify which genes belong to each module. This specification of fitness can be useful when using DAGs (as discussed in Note 1), but rarely in other scenarios.
3. It is also possible to use an exponential growth model with birth rate fixed to 1, and where the fitness specification affects the death rate, a model inspired in [2]. Specification of fitness effects via their effects on death rates, however, often leads to numerical issues (see documentation and vignette), and is not discussed in this paper.

4.
A branch of OncoSimulR, https://github.com/rdiaz02/OncoSimul/tree/ freq-dep-fitness includes an implementation with frequency-dependent fitness. This implementation includes all the features mentioned here, but allows users to also make fitness depend on the frequency of other genotypes. We have not mentioned these features as they extend beyond the specification of epistasis and the software requires users to carry out manual installation of software, until a new R building toolchain becomes stable.
5. If we use several mutator genes with independent effects it is easy to run into computational problems. Suppose we specify five mutator genes, each with an effect of multiplying by 50 the mutation rate. The genotype with all those five genes mutated will have an increased mutation rate of 50 5 = 312500000. If you set the mutation rate to the default of 1e − 6 you will have a mutation rate of 312 which makes no sense (and leads to several numerical problems and an early warning from the software).
6. It is possible to start simulations from a specific genotype. This can be done using initMutant aregument to the oncoSimul * functions.

Tables
Missing rows have a fitness of 1 and have been deleted for conciseness. Note that the mutant for exactly A and C has a fitness that is the product of the individual terms (so there is no epistasis in that case).