Tight Bounds on the Expected Runtime of a Standard Steady State Genetic Algorithm

Recent progress in the runtime analysis of evolutionary algorithms (EAs) has allowed the derivation of upper bounds on the expected runtime of standard steady-state genetic algorithms (GAs). These upper bounds have shown speed-ups of the GAs using crossover and mutation over the same algorithms that only use mutation operators (i.e., steady-state EAs) both for standard unimodal (i.e., OneMax) and multimodal (i.e., Jump) benchmark functions. The bounds suggest that populations are beneficial to the GA as well as higher mutation rates than the default 1/n rate. However, making rigorous claims was not possible because matching lower bounds were not available. Proving lower bounds on crossover-based EAs is a notoriously difficult task as it is hard to capture the progress that a diverse population can make. We use a potential function approach to prove a tight lower bound on the expected runtime of the (2+1) GA for OneMax for all mutation rates c/n with c<1.422\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$c < 1.422$$\end{document}. This provides the last piece of the puzzle that completes the proof that larger population sizes improve the performance of the standard steady-state GA for OneMax for various mutation rates, and it proves that the optimal mutation rate for the (2+1) GA on OneMax is (97-5)/(4n)≈1.2122/n\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(\sqrt{97}-5)/(4n) \approx 1.2122/n$$\end{document}.


Introduction
The runtime analysis of randomized search heuristics like evolutionary algorithms (EAs), simulated annealing, ant colony optimization and estimation-of-distribution algorithms is a young and active subfield in algorithm research that has produced remarkable results in the last 20 years [2,11,15,23]. Its main goal is to understand the working principles of the algorithms in different scenarios by deriving runtime bounds depending on problem characteristics, choice of algorithms and parameters. This line of research started with simple evolutionary algorithms using mutation only. Still today, the role of crossover, also called recombination, is less well understood than the one of mutation. In fact, explaining when recombination and mutation based genetic algorithms (GAs) 1 perform better than more traditional general purpose search heuristics that use mutation alone is regarded as one of the fundamental problems in evolution-inspired computation.
Traditionally proofs showing that crossover is a useful operator relied either on excessively low crossover rates [16,18] or on some diversity-enforcing mechanism to make recombination effective by increasing the probability that members of the population are different [8,10,20,22,27]. However, it was never shown whether this enforced diversity was necessary or whether it was an additional requirement for the proofs to hold. Recently some results have appeared proving the superiority of standard steady state GAs 2 over mutation-only algorithms, without the need of any additional diversity enforcing mechanisms. Dang et al. [7] proved that for sufficiently large population sizes, the (μ+1) GA is at least a linear factor faster than the best algorithm using only standard bit mutation for the Jump benchmark function. Hence, they showed that crossover may help algorithms to escape more quickly from local optima. Sutton [28] even proved that for the NP-hard Closest String problem from computational biology, the (μ+1) GA with sufficiently large population size and restarts is a fixed parameter tractable (FPT) algorithm while if only standard bit mutation is used (i.e., (μ+1) EA) it is not.
Strikingly, recombination has also been proven to be useful on unimodal functions. Lengler [21] has shown that there exist monotone functions for which the (μ+1) EA with not too low standard bit mutation rate c/n (i.e., c > 2.13) requires exponential runtime with high probability while the (μ+1) GA with sufficiently large population sizes can solve them in O(n log n) expected runtime for arbitrary mutation rates i.e., Θ(1)/n. Analyses have revealed that the (μ+1) GA is faster than the (μ+1) EA using any standard bit mutation rate and population size, even on unimodal functions where the latter is particularly efficient i.e., OneMax [5,6]. Furthermore, if the fitness of offspring that are identical to their parents is not unnecessarily re-evaluated, then the algorithm is faster than any unary unbiased black box algorithm for the problem [19], albeit slower than if the diversity is enforced [3,27]. To prove these results, precise analyses up to the leading constants are required since for OneMax the algorithms have the same asymptotic expected runtime O(n log n) for moderate population sizes.
An important insight from these analyses is that if diversity is enforced as in Sudholt's work [27], then inevitably there are no advantages of using population sizes greater than μ = 2 for OneMax. On the other hand, the analysis of Corus and Oliveto [6] provides upper bounds that decrease with the population size (up to some sublogarithmic limit). For large enough population sizes the best derived upper bound is roughly 1.64n ln n while, for μ = 2, Corus and Oliveto only provide a larger upper bound of 4e c n ln n c(c+4) + O(n) [5]. Due to a mistake in one probability calculation this turns out to actually be 9e c n ln n c(2c+9) + O(n). Indeed, all the positive results summarised above regarding the plain (μ+1) GA required sufficiently large population sizes. While the comparative statements with the mutation-based algorithms were possible because of the availability of lower bounds on their expected runtime, rigorously showing whether the suggested population sizes are actually necessary requires lower bounds on the expected runtime of the (μ+1) GA. Proving lower bounds for GAs with crossover is a notoriously hard task. The only available analysis concerning a standard GA is the proof that the simple genetic algorithm (SGA [12]) cannot solve OneMax in polynomial time with overwhelming probability due to the ineffectiveness of the fitness proportional selection operator [25,26]. There have been recent attempts to generalize proof methods like the family tree technique to crossover-based algorithms; however, these only apply in a specific setting without mutation.
Providing lower bounds on the expected runtime of the (μ+1) GA for OneMax has turned out to be surprisingly difficult. Sudholt simplified the analysis by considering a "greedy" (2+1) GA that always selects amongst the fittest individuals in the population and is sped-up by automatically achieving the best possible crossover operation between different parents [27]. A less greedy (2+1) GA was considered by Corus and Oliveto where individuals are only immediately crossed over optimally if the Hamming distance between the parents is larger than 2 [5]. These simplified algorithms allow the analysis to ignore the improvements which may occur in standard GAs when one parent is crossed over with another one of different fitness. However, it was never proven that the algorithms are indeed faster than the standard (2+1) GA, hence that the bounds are also valid for the latter algorithm. In this paper we provide a lower bound for the (2+1) GA with no simplifications that matches its upper bound up to the leading constant, hence providing a rigorous proof that larger populations are beneficial to the GA for OneMax. The preciseness of the results also allows us to derive that the value of c ∈ (0, 1.422] that yields the optimal mutation rate c/n is approximately c = 1.21221445. A major difficulty in proving rigorous lower bounds for populations with crossover is to find a way to aggregate the state of the algorithm such that it accurately captures the current distance from the optimum, but also the potential improvements of the crossover operator. These advancements could be very big if the parents have a large Hamming distance, and our aim is to show that this rarely happens. We solve the aggregation problem for the (2+1) GA by defining a potential function that captures the current fitness and opportunities for easy improvements through crossover. By showing bounds on the expected increase in the potential, we are able to quantify how the distance to the optimum decreases in one generation. The challenge lies in proving this for every possible population, from those with identical individuals to those with a good amount of diversity. Once the potential is appropriately bounded, we can use standard drift analysis arguments to bound the expected time from below.

Main Contributions
The expected optimisation time of the (2+1) GA is bounded from above as follows. For c = 1 this is 9 11 · en ln(n) + O(n) ≈ 2.224n ln(n) + O(n). The upper bound follows from applying the analytical framework in [5] with a corrected transition probability for p r , using the value 1/(4e) instead of 5/(24e) (we shall give more details in Sect. 3). It can also be proven with mild adaptations of the proof of [27,Theorem 4]. We provide a self-contained proof in Sect. 3.
Our main contribution is the following lower bound that matches the upper bound proven in Theorem 1 up to small-order terms. Since the bounds from Theorems 1 and 2 have the same leading constant 9e c c(2c+9) , which is minimised for we identify this as the optimal mutation rate for the (2+1) GA (up to small-order terms) within the range of rates covered by Theorem 2. is the optimal mutation rate of the (2+1) GA on OneMax, up to small-order terms. Then the expected optimisation time is ≈ 2.18417n ln(n) + O(n).
The best identified mutation rate for the (2+1) GA is lower than the one minimising the upper bound for larger population sizes μ ≥ 5 (it is at least 1.425/n and increases with μ) always providing upper bounds below 1.7n ln n and decreasing with μ [6]. This implies that the (2+1) GA with mutation rate Algorithm 1 (2+1) GA Initialize P = {x 1 , x 2 } by selecting two search points from {0, 1} n independently and uniformly at random (u. a. r.). for t ← 1, 2, . . . do Select y 1 and y 2 from P u.a.r. with replacement. Create z by applying uniform crossover to y 1 and y 2 . Flip each bit in z independently with probability c/n. Remove the worst of x 1 , x 2 and z, breaking ties u. a. r. end for Structure of the paper. Section 2 formally defines the (2+1) GA and lists important tools for the analysis, including the drift theorem used for our main result. Section 3 presents the above-mentioned, corrected upper bound from Theorem 1 and its self-contained proof. In Sect. 4, we introduce the potential function that captures the state of the (2+1) GA and is crucial for the drift analysis proving the lower bound in Theorem 2. As determining the drift requires a careful case analysis, we give a roadmap of this analysis in Sect. 5, followed by the technical Sects. 6-8 that analyze the drift in different scenarios. In Sect. 9 we then put all pieces of our analysis together to complete the proof of Theorem 2. In addition, Sect. 10 gives empirical results and results of regression analyses. The latter confirm that our theoretical results are remarkably precise and provide further insight into small-order terms. We finish with some conclusions. To streamline the presentation of the most important cases in the drift analysis, some less insightful proofs have been moved to an appendix. This paper extends a previous conference paper [24] where most proofs had to be omitted because of space constraints. The experimental results have not been published before either.

Preliminaries
The (2+1) GA is defined in Algorithm 1. The algorithm initialises the population with two randomly chosen individuals. At each generation it selects two random parents with replacement to be mated via uniform crossover. The operator assigns each bit to the offspring by selecting the corresponding bit from one parent with probability 1/2 and from the other with the same probability. Standard bit mutation is then applied to the offspring by flipping each of its bits independently with probability c/n. Finally, the worst individual amongst the parents and the offspring is removed to select the new population. Ties are broken uniformly at random.
We will analyse the expected runtime of the algorithm to optimise the function f (x) = OneMax(x) = n i=1 x i , which counts the number of 1-bits in a bitstring. Formally, the runtime is defined as the number of fitness function evaluations, which equals the smallest t such that the current population contains an optimum, up to an additive term of at most 2 stemming from evaluating the initial population.
Throughout this article, we will use state-of-the-art analysis techniques for randomized search heuristics, including drift analysis and concentration inequalities. In the following, we list three important tools, starting with the drift theorem that is crucial for the lower bound on the runtime. Theorem 4 (Multiplicative drift, lower bound, e. g., [29]) Let X t , t ≥ 0, be a stochastic process, adapted to a filtration F t , over a state space S ⊆ R ≥x min , where x min > 0. Assume that X t is non-increasing, i. e., X t+1 ≤ X t for all t ≥ 0. Let T be the smallest t ≥ 0 such that X t ≤ x min . If there exist positive real numbers β, δ > 0 such that for all t < T it holds that The following Chernoff bound goes back to [4] and is formulated in the style of [9, Theorem 1.10.1].
The following lemma bounds a sum by an integral.

Lemma 6
For any integrable function f : R → R, the following statements hold.
The first statement follows from splitting off the term f (a) and using Then we apply the first statement to α i=a f (i) and the second statement Finally, we frequently need the following estimate of the largest binomial coefficient, which can be found in [ instead of the leading constant 4e c c(c+4) claimed in [5]. Sudholt [27] provided an upper bound for (μ+λ) GAs with a diversity mechanism in the tie-breaking rule for the replacement selection. The analysis is based on a fitnesslevel argument and a simple Markov chain analysis made on each fitness level. Corus and Oliveto [5] analysed the standard (μ+1) GA, for which the choice μ = 2 yields the (2+1) GA as defined in Algorithm 1. They observed that the (μ+1) GA can lose diversity after creating it, which required a more complex Markov chain analysis of each fitness level. We first explain their framework and argue why the leading constant is 9e c c(2c+9) . In addition, we give a self-contained bound based on the analysis in [27] with an ad-hoc Markov chain framework similar to the one used in [5], simplified for the fixed value of μ = 2 (and λ = 1). This shows that both previous works [5,27], with appropriate modifications, yield the same upper bound for the (2+1) GA.

The Framework by Corus and Oliveto
Corus and Oliveto coupled the standard artificial fitness levels method [15] with a Markov chain framework to bound the expected runtime of the (μ+1) GA for OneMax and any population size μ [5]. More precisely, they divided the search space into the canonical n + 1 fitness levels, each containing all search points with i 1-bits, and assumed that the algorithm is in level L i if all the individuals of the population have exactly i 1-bits. Since crossover may only speed up the optimisation process if diversity is present in the population, a Markov chain was used on each fitness level to distinguish between populations with and without diversity. The Markov chain at fitness level i consists of two transient states S 1,i and S 2,i (i.e., resp. with/without diversity) and an absorbing state S 3,i , that is reached when a solution with better fitness is identified. For each level the analysis was performed by pessimistically assuming that initially on each level the population has no diversity (i.e., all individuals are identical). For this assumption to hold, and a valid upper bound on the expected runtime achieved, once the absorbing state is reached for level L i , the expected time for the improved individual to take over the population has to be taken into account before the analysis of the absorption time of the Markov chain for level L i+1 may be carried out (i.e., the population is initially in state S 1,i+1 for each level L i ). When the algorithm is on the current level L i with no diversity (S 1,i ) only two changes of states may occur, both due to mutation: either the absorption state S 3,i is reached by increasing the number of 1-bits in an individual, or a state with some diversity is reached by switching the order of 1-bits and 0-bits in an offspring, hence reaching state S 2,i . From this state, S 3,i may be reached more quickly than from state S 1,i if diverse parents are selected for reproduction and at least an extra 1-bit obtained in the offspring via crossover with constant probability if the diversity is not lost before (eg., by selecting two identical parents, creating a copy by not flipping any bits, and removing the diverse individual in the selection for replacement step). Overall, the proof strategy requires summing up the bounds on the absorption times of the n+1 Markov chains and subsequent takeover times over the n + 1 levels according to the artificial fitness levels methodology.
The described technique allows to calculate the transition probabilities of the Markov chain, hence to provide upper bounds on the expected runtime of the (μ+1) GA, as a function of any population size μ. In particular, it provides bounds on the expected runtime of the (μ+1) GA up to the leading constant for population sizes of μ = o(log n/ log log n) that are smaller than those of any steady-state EA using only standard bit mutation (i.e., no crossover) (i.e., for larger population sizes the take over time becomes asymptotically larger than the O(n log n) expected runtime of the (1+1) EA). An important insight from the Markov chain analysis is that the probability of losing diversity in state S 2,i is higher for population size μ = 2 than for larger populations: in the former case the diversity may be completely lost in every generation by either of the two individuals taking over, which is not the case for larger populations. However, in the analysis of this transition probability for the special case μ = 2, the bound provided by the authors of [5] is by a factor of 2 smaller than the correct one due to a mistake in the calculation of the probability that different parents are selected for reproduction and the variation operators produce an offspring identical to either parent (and the differing parent is removed by selection of replacement). This miscalculation led to an upper bound on the expected runtime of the (2+1) GA of 4e c n ln n c(c+4) + O(n) instead of the correct bound of 9e c n ln n c(2c+9) + O(n) provided in Theorem 1. In the following subsection we provide a self-contained proof of the result and point out exactly where the miscalculation occurred in [5].

A Self-Contained Upper Bound Proof for the (2+1) GA
Here we give a self-contained proof of the upper bound for the (2+1) GA, Theorem 1.

Proof of Theorem 1
We use straightforward adaptations of the proof of [27, Theorem 4], using the same notation as in [27]. As in said proof, we distinguish between the following cases that are labelled according to the current best fitness in the population, called i. There are cases i.1, i.2, and i.3 explained in the following. We estimate the expected time spent in all these cases, summed up over all possible values of i, to obtain an upper bound on the total expected optimisation time. This case is left for good if another search point of fitness at least i is created as then both search points will have fitness at least i. A sufficient condition for this is that the parent with fitness i is selected twice as parent (probability 1/4) and no 1-bit is flipped during mutation (probability at least (1 − c/n) n−1 = Ω (1) Let T i,2 be the random time spent in Case i.2. We leave this case for good if mutation only flips a single 0-bit and no 1-bit. This event has probability at least p + := (n − i) · c/n · (1 − c/n) n−1 . We further make a transition to Case i.3 if mutation flips exactly one 0-bit and one 1-bit, leading to a different search point with fitness i, and then choosing one of the identical parents for removal (probability 2/3). The probability of said events is at least From Case i.3 the algorithm can move back to Case i.2, so that we may have to consider multiple visits to Case i.2. A necessary condition for going back to Case i.2 is that the created offspring is identical to one of the search points in the population, and the other search point is being selected for removal (probability 1/3). The probability for creating an identical offspring is maximised when the two search points have Hamming distance 2. Either the same parent is selected twice (probability 1/2) and mutation does not flip any bit (probability (1 − c/n) n ), or the same parent is selected twice and mutation creates the other parent (probability O(1/n 2 )), or different parents are selected (probability 1/2), crossover and mutation set the two differing bits identical to one of the parents (probability 1/2 i.e., this is the probability that was wrongly estimated to be 1/4 in [5] leading to an upper bound of 5/(24e) for going back to Case i.2 rather than the correct 1/(4e) bound which we derive now) and mutation does not flip any of the common bits (probability (1 − c/n) n−2 ). Together, the probability of going back to Case i.2 is at most Hence the conditional probability that, when Case i.3 is left towards Case i.2 or higher fitness, it is left towards Case i.2 is at most With these pessimistic estimations for transition probabilities, we obtain the following recurrence: which is equivalent to and plugging in the above values yields the following upper bound for T i,2 : · (e c + O(n −1 )).
Noting that the times in all Cases i.1 and i.3, summed up for all i, are O(n) proves the claim.

A Potential Function Approach
We now turn to the main contribution of this paper, the tight lower bound for the (2+1) GA. Before going into detail, this section describes the main idea behind our approach and clarifies some further notation and fundamental observations that will be used in the remainder. We write a population {x 1 , x 2 } in order of monotonically decreasing fitness, that is, f (x 1 ) ≥ f (x 2 ). Let n 11 be the number of bit positions where both parents have ones and likewise for n 00 and the number of zeros. Let n 10 be the number of positions where x 1 has a 1 and x 2 has a 0 and likewise for n 01 . Then we have f (x 1 ) = n 11 + n 10 and f (x 2 ) = n 11 + n 01 . Since by assumption, f (x 1 ) ≥ f (x 2 ), we have n 10 ≥ n 01 and n 10 = n 01 is equivalent to the two individuals having equal fitness. In case n 10 = 0, both individuals are identical. Such a population is called monomorphic in population genetics, and we use this term here.
Note that the (2+1) GA is an unbiased algorithm in the sense of Lehre and Witt [19]. In brief, this means that the algorithm treats all bit positions and all bit values symmetrically when generating new search points. Owing to this symmetry of bit positions and the fact that the fitness function is symmetrical itself, i. e., it only depends on the number but not the positions of the one-bits, it suffices to know n 11 , n 10 and n 01 to fully characterise the state of the algorithm. Note that n 00 can be derived as n − n 11 − n 10 − n 01 .
The following lemma characterises probabilities of setting a bit to 1 in the offspring after a crossover of two different parents and a mutation of the result.

Lemma 8 Consider a crossover of two parents x, y followed by mutation with mutation rate p m , resulting in an offspring z. For all i,
Proof If x i = y i then crossover will create an offspring with the same bit value. The statement for x i = y i holds because of symmetry, or using the following, alternative argument. The offspring has a 1 if crossover creates a 1 and mutation does not flip bit i, or if crossover creates a 0 and mutation does flip bit i. The probability of the former event is 1/2 · (1 − p m ) and the probability of the latter event is 1/2 · p m . Together, this gives 1/2.
Note that differing bits x i = y i are set to 1 with probability 1/2, irrespective of the mutation rate. Hence, when two parents are selected, we only need to consider the effect of mutation on the bits where the parents agree. We frequently and tacitly use this fact.
Our lower bound applies when only considering populations where the number of zeros in the fitter parent is at most n/polylog(n) and at least polylog(n). This implies that all probabilities that involve flipping a 0 to 1 are polylogarithmically small.
The main tool for our lower bound is going to be drift analysis, applied to a potential function that captures the current state and potential easy fitness improvements. Definition 1 For a population P with values n 11 , n 10 , n 01 , n 00 we define the potential of P as ϕ(P) = n 11 + n 10 + n 01 3 .
The intuition is that n 11 + n 10 describes the current best fitness in the population. The term n 01 /3 adds potential to the best fitness as the population has the potential to exploit the diversity given by the n 01 1-bits that only exist in the less fit individual during a successful crossover operation.
The choice of the factor 1/3 is motivated as follows. We know from previous work [5,27] that the most helpful populations for improvements are those where two search points have the same number of ones and Hamming distance 2, that is, n 10 = n 01 = 1.
(Larger Hamming distances have the potential for larger fitness improvements, but such populations are rarely reached when the number of zeros becomes reasonably small.) Assume the current state has n 10 = n 01 = 1, corresponding to a potential of n 11 + n 10 + 1/3. The most likely transitions (and, when only O(n/polylog(n)) zeros are left, the only transitions with probability Ω(1)) are (1) collapsing the population to copies of one parent (and potential n 11 + n 10 ) and (2) creating a surplus of one 1-bit by crossover and not flipping anything else (potential n 11 + n 10 + 1). The probability of the former event is roughly 3 (1 − p m ) n /4, which is the probability of selecting the same parent twice, not flipping any bits and then selecting the other population member for removal plus the probability of selecting different parents, creating one parent by crossover and not flipping any bits. The probability of the latter event is roughly (1 − p m ) n /8, which is the probability of selecting different parents, setting both differing bits to 1 and not flipping any bits in the subsequent mutation.
Comparing these terms, the conditional probability of an improvement via crossover is roughly 1/3. In case a monomorphic population is reached, the potential reduces by 1/3 and this happens with conditional probability 1 − 1/3. In the latter event, the potential increases by 1 − 1/3 and this happens with conditional probability 1/3. The net effect of these transitions in the expected change of the potential is So the potential balances out the effects of "volatile" states left quickly.
Obviously, our analysis still needs to account for other, less likely transitions. For populations with n 10 = n 01 > 1 the conditional transition probabilities change as the probability of creating one of the parents by crossover depends on the Hamming distance n 10 + n 01 between parents. For n 10 = n 01 > 1 the likely progress in a successful crossover may be smaller than n 01 /3. Hence the term +n 01 /3 in Definition 1 is a precise estimate for the likely progress when n 01 = 1 and for larger n 01 it is an overestimation.
It suffices to restrict our considerations to moderate values of n 10 and n 01 . The reason is that the (2+1) GA always has a constant probability of creating a monomorphic population in one generation, regardless of the current population. This means that large values of n 10 and n 01 are very unlikely.

Lemma 9
Let t ≥ log 2 n and t = n O (1) . With probability 1−n −Ω(log n) , all populations within the time interval [log 2 n, t] have Hamming distance at most log 2 n between their two individuals.
Proof We call a generation that creates a population of two identical individuals a monomorphic generation. The crucial idea is to show that monomorphic generations are very frequent so that large Hamming distances are unlikely to occur.
The probability of a monomorphic generation happening is at least since it is sufficient to select a fittest parent twice, to clone it and to remove the other parent (which has probability at least 1/3). For a number t ≥ 0 of generations after a monomorphic one, let D t denote the maximum number of bits in which the two parents ever have differed during these t generations. The crucial idea is that only mutations can increase this D-value. The total number of bits flipped in t generations is the sum of tn Poisson trials with success probability c/n each. Hence, within t generations following a monomorphic one, the D-value is bounded from above by 2ct with probability 1 − 2 −Ω(t) according to Chernoff bounds (Theorem 5), and clearly the Hamming distance is no larger than the D-value. We set t := (log 2 n)/(2c) to bound the D-value by log 2 n.
The proof is completed by noting that the probability of not observing a monomorphic generation within log 2 n generations is (1 − Ω(1)) log 2 n = n −Ω(log n) . Together with the failure bound 2 −Ω(t) , which is n −Ω(log n) for t = (log 2 n)/(2c), and a union bound, this means that in any polynomial number of generations following the first monomorphic one the Hamming distance never exceeds log 2 n with probability

Roadmap for the Analysis
We give a deliberately informal, high-level view of our analysis, where Δ := ϕ(P t+1 ) − ϕ(P t ) denotes the change in potential in one generation. By the law of total probability, Δ can be split up according to the number of zeros flipped by mutation: The case that no zeros flip does not increase the potential, hence we aim to bound the first line from above by 0. The second line captures the most important case: one zero flips and subsequent progress is made. The second line will be bounded by the dominant term in our claimed lower bound. The third line involves the probability of flipping at least two zeros. If the number of zeros is small, this is unlikely and thus the third line only contributes a small order term. The above high-level view is not particularly accurate. Firstly, the above estimations need to account for error terms. Secondly, the notion of "i zeros flip" used above is not well-defined. This is because the number of zeros that can flip during mutation depends on the parent selection. The same parents may be selected twice, and then the number of zeros depends on the fitness of the parent. If two different parents are used, we only consider mutations of bits that agree in both parents, as per Lemma 8.
Hence, we need to distinguish between different events from the parent selection. To formalise this, let P 11 , P 22 , and P 12 denote the events that parent selection chooses the first parent twice, the second parent twice and both parents, respectively. We further denote by F 00 the number of flipping bits amongst the n 00 bits and likewise for F 11 and n 11 bits. We use asterisks to indicate the union of sets: F 0 * is the number of flipping bits among n 01 + n 00 bits and F * 0 is the number of flipping bits among n 10 + n 00 bits. Variables F 1 * and F * 1 are defined analogously. Armed with this notation, we express the third line rigorously with a combination of events.
Lemma 10 For all populations with n 11 ≥ n − n/ log 3 n and n 10 + n 01 ≤ log 2 n, Proof We give one common way of bounding the three lines from the statement. For all F ∈ {F 0 * , F * 0 , F 00 }, the drift can only increase by at most n 01 + F as every flipping 0bit can only increase the potential by at most 1 and crossover can increase the potential on all n 01 by at most 1. Also note that P(F * 0 ≥ 2) has the largest probability amongst all variables F (as the underlying number of zeros is maximal for F * 0 ). Thus, all 3 lines are bounded as Since n 0 * ≤ n − n 11 ≤ n/ log 3 n, we get cn 0 * The cases of no zeros flipping and one zero flipping are more difficult to handle. Corresponding drift estimates will be derived in the following sections. More precisely, we derive closed formulas for the drift of the potential function depending on the choice of parents in the following Sect. 6 and then distinguish between generations that flip no zeros (Sect. 7) and one zero (Sect. 8). As mentioned in the introduction, the drift bounds obtained in these major technical sections are finally put together in Sect. 9.

Closed Formulas for the Potential Drift
Before proceeding to bound the drift of the potential, we derive closed bounds for the potential under the condition P 12 that different parents are being selected. These bounds hold for arbitrary given values of F 00 , F 11 , the numbers of flipping common zeros and ones, respectively. These closed formulas will then be used in the subsequent sections to bound the drift for the cases F 00 = 0 and F 00 = 1, respectively.
For the derivation of closed formulas it is important to distinguish between two cases: the parents having equal fitness, n 10 = n 01 , or the parents having unequal fitness, n 10 > n 01 . This is because, depending on the fitness of the offspring, different cases may occur and the probabilities used during replacement selection may differ. For instance, when both parents have an equal fitness and the offspring happens to have the same fitness as well, there is a 3-way tie that is resolved uniform randomly by selection. When both parents have different fitness, 3-way ties are not possible (there can only be ties between the offspring and one parent). However, the offspring might be strictly worse than the better parent and strictly better than the worse parent. Such a case cannot happen when n 10 = n 01 . It therefore makes sense to split the analysis into these two cases and to derive separate closed bounds on the potential drift.

A Closed Formula for Selecting Different Parents of Equal Fitness (n 10 = n 01 )
In the following lemma we shall first derive a closed formula for the potential drift when two different parents are chosen that have the same fitness, n 10 = n 01 . The lemma defines a threshold that reflects the number of bits crossover needs to set to 1 to achieve the same fitness as x 1 and x 2 .

Lemma 11
Let S ∼ Bin(n 10 + n 01 , 1/2) and := n 10 − F 00 + F 11 . Then for all populations with n 10 = n 01 , Proof Consider a step where both parents are selected and F 00 , F 11 are known. The number of bits set to 1 by crossover is given by S ∼ Bin(n 10 + n 01 , 1/2). By the law of total probability, Since n 10 = n 01 , the fitness of both parents is n 11 + n 10 and the fitness of the offspring is n 11 + s + F 00 − F 11 . If s < , the latter is less than n 11 + n 10 and the offspring will be rejected. If s > , the offspring is fitter than the parent, one of the parents will be chosen uniformly at random for removal. If the offspring is as fit as both parents, the offspring is removed with probability 1/3. Thus, Let S 10 be the number of bits among the n 10 bits that are set to 1 in the offspring and define S 01 analogously for the n 01 bits. Note that S := S 10 +S 01 where S 10 ∼ Bin(n 10 , 1/2) and S 01 ∼ Bin(n 01 , 1/2) according to Lemma 8. Given S 10 = s 10 and S 01 = s 01 , the potential difference Δ is derived as follows. Among the n 11 bits, F 11 bits flip to 0, reducing their contribution from 1 to 1/3 each, leading to a contribution of −2F 11 /3 to the potential difference. All the n 10 bits contribute 1 to the potential ϕ(P t ). In P t+1 , s 10 bits contribute 1 and the remaining n 10 − s 10 bits contribute 1/3 each. Hence the contribution to the potential difference is −2(n 10 − s 10 )/3. The n 01 bits contribute 1/3 each in ϕ(P t ) and in P t+1 we have s 01 bits contributing 1 each and the other n 01 − s 01 bits contributing 0. Hence the contribution to the potential difference is s 01 − n 01 /3. Finally, the contribution of the n 00 bits to the potential difference is F 00 .
In the following we denote by (X | Y ) the conditional variable X conditional on Y , where Y is a sequence of events and/or random variables. Omitting the implicit conditions P 12 , F 00 , F 11 for brevity, we obtain a conditional drift Δ of By the law of total probability, Now, (S 10 | S = s) follows a hypergeometric distribution with parameters n 10 (number of red balls), n 10 +n 01 (number of balls) and s (number of draws). The expectation is thus s · n 10 /(n 10 + n 01 ) = s/2. Plugging this in yields Together, this gives the first terms simplify as The last term simplifies as Together, this proves the claim.
The bound from Lemma 11 depends on the expected surplus E(S | S ≥ ) generated by a crossover on the bits that differ between the two parents, where reflects the fitness threshold above which offspring are accepted. We use the following formula to simplify such expressions. The proof goes back to work by Gruder [13] that is highlighted in a paper by Johnson [17]. It is given in the appendix for the sake of completion.
Lemma 12 Let S ∼ Bin(n, 1/2), then for all ∈ N, Using Lemma 12, we obtain the following simplified formula.

A Closed Formula for Selecting Different Parents of Unequal Fitness (n 10 > n 01 )
For the case of unequal fitness (n 10 > n 01 ), we use similar arguments as before. This scenario has more involved calculations as we need to distinguish different cases: the offspring may be at least as good as the fitter parent, and then the calculations are similar to the equal-fitness scenario. The offspring may also be worse than the fitter parent and better than the worse parent. In this case, the potential is derived from a different formula as the values for n 10 and n 01 in the next generation are still determined according to the fitter parent. In case the offspring's fitness is equal to that of the worse parent, there is a tie and the offspring is only accepted with probability 1/2. The lemma defines two thresholds 1 and 2 that reflect the number of bits crossover needs to set to 1 to achieve the fitness of x 1 and x 2 , respectively.

Lemma 14
Let S ∼ Bin(n 10 + n 01 , 1/2), 1 := n 10 − F 00 + F 11 and 2 := n 01 − F 00 + F 11 . Then for all populations with n 10 > n 01 , Proof Consider a step where both parents are selected and F 00 , F 11 are known. The number of bits set to 1 by crossover is given by S ∼ Bin(n 10 + n 01 , 1/2). By the law of total probability, Note that the fitness of x 1 is n 11 + n 10 , that of x 2 is n 11 + n 01 and the fitness of the offspring is n 11 + s + F 00 − F 11 . We distinguish the following cases.
1. If s ≥ 1 , the offspring is at least as fit as the fitter parent, and x 2 will be removed.
The new values of n 10 , n 01 will be determined by the offspring. 2. If 2 < s < 1 , the offspring is fitter than the worse parent x 2 , and x 2 will be removed. Parent x 1 will remain the fitter parent. 3. If s = 2 , the offspring is as fit as x 2 and x 2 will be removed with probability 1/2.
Then the potential is computed as in the case 2 < s < 1 .
4. If s < 2 , the offspring is worse than x 2 and will be rejected. Hence the potential does not change.
Let Δ be the change of potential from the current population to the new population P t+1 as described above. Then, where the last line accounts for the fact that with probability 1/2 the offspring is removed if s = 2 . Now we estimate E(Δ | S = s)P(S = s), starting with the case for s ≥ 1 . Let S 10 be the number of bits among the n 10 bits that are set to 1 in the offspring and define S 01 analogously for the n 01 bits. Note that S := S 10 + S 01 where S 10 ∼ Bin(n 10 , 1/2) and S 01 ∼ Bin(n 01 , 1/2).
Given S 10 = s 10 and S 01 = s 01 , the potential difference Δ is derived as follows. Among the n 11 bits, F 11 bits flip to 0, reducing their contribution from 1 to 1/3 each, leading to a contribution of −2F 11 /3 to the potential difference. All the n 10 bits contribute 1 to the potential ϕ(P t ). In P t+1 , s 10 bits contribute 1 and the remaining n 10 − s 10 bits contribute 1/3 each. Hence the contribution to the potential difference is −2(n 10 − s 10 )/3. The n 01 bits contribute 1/3 each in ϕ(P t ) and in P t+1 we have s 01 bits contributing 1 each and the other n 01 − s 01 bits contributing 0. Hence the contribution to the potential difference is s 01 − n 01 /3. Finally, the contribution of the n 00 bits to the potential difference is F 00 . Together, By the law of total probability, for all s ≥ 1 , Now we consider the case 2 ≤ S < 1 . Here x 1 remains the fitter parent and the potential difference Δ is derived as follows. Flipping bits among the n 11 and n 10 bits do not change the potential. We have S 01 + F 00 bits that each contribute 1/3 to the potential, compared to n 01 bits in the previous population. Hence the change in potential is By the law of total probability, for all 2 ≤ s < 1 , = s · n 01 3n 10 + 3n 01 + −n 01 + F 00 3 using the same calculations as before.

Potential Drift When No Zeros Flip
We now consider the potential drift when no zeros flip. When the distance to the optimum is o(n), this case is by far the most frequent case. This also means that our drift bounds have to be precise, as even a small error term may have a big impact and spoil the analysis.

Selecting the Same Parent Twice
We start by considering the drift conditional on selecting the same parent twice.

Lemma 15
For all populations with n 10 = n 01 , For all populations with n 10 > n 01 , Proof First assume n 10 = n 01 and consider the event P 11 . Given F 0 * = 0, that is, if no 0-bit is flipped, the offspring can only be accepted if no 1-bit is flipped, i. e., F 1 * = 0. These events happen with probability P(F 0 * = 0)P(F 1 * = 0) = (1 − c/n) n and they lead to an offspring that is identical to x 1 . Since all search points have equal fitness, x 2 is removed with probability 1/3. This leads to a monomorphic population and the potential decreases by n 01 /3. Multiplying the above terms proves the claimed equality. The case of P 22 follows analogously, considering F * 0 and F * 1 instead. For n 10 > n 01 , if the fitter parent x 1 is selected twice, a copy of it is created with probability (1 − c/n) n and then x 2 is removed. This decreases the potential by n 01 /3. Multiplying the above terms yields an expectation of −n 01 /3 · (1 − c/n) n . Note that other operations cannot increase the potential since no 0-bit is being flipped and flipping 1-bits in x 1 does not decrease the potential.
If x 2 is selected twice as parent, the potential cannot increase since no 0-bits are flipped, and it cannot decrease as any 1-bit being flipped will lead to the offspring being rejected. Now we consider the event that two different parents are selected. The remainder of this section is split into drift bounds when the parents have equal fitness, n 10 = n 01 , and the case where parents have unequal fitness, n 10 > n 01 .

Selecting Different Parents of Equal Fitness (n 10 = n 01 )
We use the closed formula from Lemma 13 to show that, to get an upper bound on the drift when F 00 = 0, we only need to consider F 11 ∈ {0, 1} as larger values lead to a non-positive drift. This is not obvious, but follows from a lengthy and trite calculation. The proof is placed in the appendix to keep the main part streamlined.

Lemma 16
For all n 10 = n 01 and all i ≥ 2, E(Δ | P 12 , F 00 = 0, F 11 = i) ≤ 0. Now we are able to give an upper bound on the potential drift, assuming that different parents are selected (P 12 ) and no common zero-bits flip (F 00 = 0).

Selecting Different Parents of Unequal Fitness (n 10 > n 01 )
The case of P 12 , different parents being selected, no common 0-bits flipping (F 00 = 0) and parents having unequal fitness is easy to handle as in this scenario the potential drift is always non-positive. This is shown in the following lemma. It also states a closed formula for the potential drift as this will be used later on, in the proof of Lemma 22.
The proof is not very insightful, hence it is placed in the appendix.

Combined Drift Bound When No Zero Flips
Assembling the previous drift bounds under various conditions, we get the following drift bounds.

Potential Drift When One Zero Flips
In this section we show that the drift is bounded by a term that yields the leading constant we are aiming for in our main result. Note that here we can afford to include error terms of lower order. We first consider the case that two equal parents are selected.

Lemma 20
For all populations with n 10 = n 01 , For all populations with n 10 > n 01 , Proof First assume n 10 = n 01 and consider the event P 11 . Given F 0 * = 1, the offspring is accepted with certainty if no 1-bit is flipped. With probability 1/2 the parent survives and then the new potential is at most n 11 +n 10 +1. Consequently, the potential changes by at most 1 − n 01 /3 due to the loss of diversity. With the remaining probability 1/2, the potential increases by at most 1. The overall expected change in potential in this case is thus at most 1 − n 01 /6. If a single 1-bit also flips together with the 0-bit, then the offspring has the same fitness as both parents and the individual to be removed is selected uniformly at random. If the offspring is removed, the potential does not change. If the parent is removed, the potential increases by at most 1/3. If the other population member is removed, the potential increases by at most 1/3 − n 10 /3 due to the loss of diversity. The expected change in potential in this case is thus at most 2/9 − n 10 /9. If more than one 1-bits flip, then the offspring will have lower fitness than both other members and it will be rejected. Summing up the terms proves the first claim and the case of P 22 follows analogously, considering F * 0 and F * 1 instead.
For n 10 > n 01 , given F 0 * = 1 and that the fitter parent x 1 is selected twice, we consider separate cases according to the size of n 01 . If n 01 = 0, then no diversity can be lost. Thus, if no 1-bits flip, the potential increases by 1 because the new offspring has higher fitness than x 1 and x 2 is rejected. If at least one 1-bit flips, then the best fitness does not change and the potential increases by at most 1/3. Overall, for the case when n 01 = 0, the potential changes by P(F 1 * = 0) + 1/3 · P(F 1 * > 0) = 1/3 + 2/3 · P(F 1 * = 0). If n 01 > 0 and no 1-bits flip, then the potential changes by at most 1 − n 01 /3 since the diversity is lost because x 2 is removed. If instead at least one 1-bit flips, then the potential changes by at most 1/3 − n 01 /3 since the best fitness does not change and the diversity may be lost if x 2 is removed. Since for n 01 > 0 these terms are negative, and the offspring is accepted with probability 1 if F 1 * = 1 the claim follows by summing up the two terms.
If x 2 is selected twice and no 1-bits are flipped, then the potential increases by at most 1/3 (i.e., if an n 00 bit is flipped) since the parent is removed and the diversity is kept. If a single 1-bit is flipped then the potential increases again by at most 1/3. However, since the offspring has the same fitness as its parent, it is necessary that the parent is removed which happens with probability 1/2. If more than one 1-bits are flipped, then the offspring is rejected. Summing up the terms completes the proof.

The proof of Lemma 20 has revealed a counterintuitive effect. A population of individuals with very different fitness values f (x 1 )
f (x 2 ) can have an advantage over a population where both members have the same fitness f (x 1 ). This is because, conditioning on a 0-bit flipping, if the fitter parent is chosen twice, a near-arbitrary number of 1-bits can flip at the same time and the outcome may still be accepted. This increases the potential and explains why Lemma 20 contains an unexpectedly large potential drift in the case n 10 > n 01 = 0. Now we consider the case P 12 that two different parents are selected. When F 11 ≥ 2 and the parents have equal fitness, the drift under P 12 is non-positive for n 10 ≥ 10, except for F 11 = n 10−1 , where it is exponentially small in n 10 . The proof of the following lemma is given in the appendix.

Lemma 21
For all n 10 = n 01 , n 10 ≥ 10 and all 2 ≤ i ≤ n 10 − 1, When the two parents have unequal fitness, the following uniform upper bound on the potential drift applies.

Lemma 22
For all n 10 > n 01 and all F 11 ∈ N 0 , Proof We start again with the drift formula from Lemma 14 and plug in F 00 = 1. Then E(Δ | P 12 , F 00 = 1, F 11 ) = P(S > 1 ) · 2 − 2F 11 − n 10 + n 01 3 For n 10 < 12 the claim is verified numerically for all pairs n 01 < n 10 , see Table 4. We now turn to an analytical argument for n 10 ≥ 12. From Lemma 18 we already know that since P(S ≥ 1 ) ≤ 1/2 (using that 1 is strictly greater than the mean of S). Altogether, no matter how 1 and 2 turn out.
To obtain a combined drift formula, we need to consider probabilities for flipping a certain number of 1-bits. The following simple lemma gives bounds for these.

Lemma 23
For every mutation rate c/n and every i, Proof.

Putting Everything Together
Combining results from previous sections, for different numbers of flipping zeros, yields the following unconditional drift bound.
For n 10 > n 01 = 0, the bound from Lemma 24 is at most the bound from the same lemma for n 10 = n 01 since c e c for c ≤ 1.422. Then the claim follows as above.
To translate the upper bound from Lemma 25 into a lower bound on the expected runtime, we use the lower-bound version of the multiplicative drift theorem from Theorem 4. This theorem requires an upper bound on the drift of the potential function and a sufficiently small probability for large jumps of this value. Such large jumps can occur if the two individuals of the (2+1) GA have a large Hamming distance. Recall that Lemma 9 shows this to be unlikely.
The following lemma shows that a drift of the potential can be translated into a lower bound on the expected optimisation time. This is not immediate since the potential function is a weighted combination of two quantities.

Lemma 26
Let N t denote the number of n 00 -bits at time t of the (2+1) GA and T the first point in time where n 00 ≤ log 5 n. Assume that n 01 + n 10 ≤ log 2 n for all points in time before T , and N 0 ≥ log 5 n. If E(ϕ t+1 − ϕ t | ϕ t ) ≤ δ N t for some n −O(1) ≤ δ < 1 and all t < T , then Proof We introduce a distance function ϕ t := n − ϕ t = n 00 + 2 3 n 01 as the mirror image of our potential to obtain a function to be minimised, as required in Theorem 4. Moreover, we write Δ t = ϕ t − ϕ t+1 . The key idea is to show that the statement on E(T | ϕ 0 ) holds under the slightly different drift condition and then to prove that the actual drift condition E(ϕ t+1 − ϕ t | ϕ t ) ≤ δ N t leads to the same result, up to lower-order terms.
We consider the process from the first point in time where ϕ t ≤ n/ log 3 n, assume (11) to hold for all t < T and estimate the remaining expected time to minimise the distance ϕ t . Lemma 9 and the facts that at most log n bits flip per generation and that the distance does not drop below n/ log 3 n within the first log 2 n generations (each happening with probability 1−n −Ω(log n) ), imply that ϕ 0 ≥ n/(2 log 3 n) with respect to our time count. We assume this to happen. By Lemma 9, with probability 1−n −Ω(log n) it holds that n 10 +n 01 ≤ log 2 n for any polynomial number of steps. Assuming this for a sufficiently long period obtained from applying Markov's inequality on E(T | ϕ 0 ), our assumptions only change the bound on the expected value by a 1 − n −Ω(log n) factor.
Clearly, since n 10 + n 01 ≤ log 2 n, crossover followed by a neutral mutation can change the ϕ-value by at most log 2 n. Moreover, each mutation flips k or more bits with probability at most in particular it flips at most log 2 n bits with probability 1 − n −Ω(log n) . Adding up these effects, we arrive at P ϕ t − ϕ t+1 ≥ 2 log 2 n = n −Ω(log n) . The time to minimise the distance function is no smaller than the time to reach a distance of at most x min := log 5 n (we stop at this point to fulfill the condition on n 00 in Lemma 25). Along with β := 2/ log n and using X t := ϕ t , we verify the second condition of Theorem 4 by estimating Finally, we estimate this bound as n −Ω(n) ≤ βδ/ log n ≤ βδ/log(X t ) since both β and δ are at least inversely polynomial in n. Hence, P(X t − X t+1 ≥ β X t ) ≤ βδ/log(X t ), which satisfies the condition. Applying the theorem and recalling our assumption ϕ 0 ≥ n/(2 log n), Recall that the unconditional expected time is only by a factor 1 − n −Ω(log n) smaller.
We are now ready to prove our main result.

Proof of Theorem 2
We apply drift analysis to a "typical" run, that is, a run where events mentioned in the following that happen with overwhelming probability do occur. The contrary is called a failure event and if a failure occurs, we pessimistically assume that the runtime is 0. By the law of total probability, the expected runtime is bounded from below by the expected runtime in a typical run, multiplied by the probability that no failure occurs. With probability 1−2 −Ω(n) the initial population has a maximum number of (2/3)n one-bits. The random number of one-bits that crossover creates among the relevant n 10 and n 01 bits follows a binomial distribution with parameters n 10 + n 01 and 1/2. The expected value of this number equals (n 10 + n 01 )/2 ≤ n/2, and using Chernoff bounds (Theorem 5) with δ = 2n −1/3 and the upper bound n/2 on the expectation, the number of one-bits of every fixed offspring is at most U := (n 10 + n 01 )/2 + n 2/3 with probability 1 − 2 −Ω(n 1/3 ) . By a union bound, the probability that at least one offspring has more than U one-bits is still 2 −Ω(n 1/3 ) . Since the fittest parent has at least (n 10 + n 01 )/2 one-bits, this means that the number of one-bits at the relevant positions, and thereby the number of one-bits of the fitter offspring, grows by at most n 2/3 with probability 1 − 2 −Ω(n 1/3 ) . The same holds for each mutation with constant rate c as seen in (12). Hence, the subsequent 2 log 2 n generations do not increase the OneMax-value to more than (3/4)n with probability 1 − 2 −Ω(n 1/3 ) .
We consider the first point in time when n 11 ≥ n−n/ log 3 n. By the same arguments as before, we then have n 11 ≤ n − n/(2 log 3 n) with overwhelming probability. Now Lemma 9 is in force, implying that we can apply Lemma 26 with which is 9e c c(2c+9) · n ln(n) − O(n ln ln n) as claimed. This still holds when multiplying the above with the probability of no failure event occurring, as the union of all failure events is superpolynomially small.

Experiments
Our lower bound is tight up to lower-order terms. We ran experiments to see how close the results are to the dominant term of 9e c c(2c+9) · n ln n, when varying c and n. We see that as c increases beyond 1.422, a gap starts to appear between the dominant term and the average runtime. The average runtime seems to be smaller than 9e c c(2c+9) · n ln n. This could indicate that the lower bound does not hold for c > 1.422, or that there are small-order terms that affect the plots for n = 1000. To see how quickly the runtime approaches the dominant term of 9e c c(2c+9) · n ln n as n grows, we ran experiments with n increasing exponentially in powers of 2 from n = 2 3 = 8 to n = 2 13 = 8192. Figure 3 shows the normalised average runtimes, which is the runtime divided by n ln n. One can see that, for both the default and the optimal mutation rates, c = 1 and c = √ 97−5 4 , respectively, the curves approach their respective dominant terms. However, the approach is quite slow. Even for n = 8192 there is still a significant gap to the leading constant of the dominant term. This might indicate that the average runtime includes a significant, negative small-order term.
Recall that our lower bound (Theorem 2) includes a negative term of −O(n log log n), while the upper bound (Theorem 1) contains an additive term +O(n). The term −O(n log log n) is an artefact of the fact that we only considered around n/ log 3 n fitness levels and we excluded the most difficult log 5 n ones. It is easy to show that optimising these fitness levels takes expected time O(n log log n) for both (1+1) EA and (2+1) GA. We conjecture that the expected running time is close to 9e c c(2c+9) ·n ln(n)+bn for some negative constant b. This is based on related work on the (1+1) EA whose expected runtime on OneMax is known to be en ln(n) − 1.8925 . . . n + O(log n) (cf. [14], who even determine the expected runtime up to additive errors of O(log n/n)). Intuitively, the negative linear term appears in the bound for the (1+1) EA since the algorithm does not start with the largest possible value of the underlying potential function, which is the number of one-bits. More precisely, the initial number of onebits X 0 has an expected value of n/2 and the multiplicative drift theorems both for lower and upper bounds involve the term ln(X 0 /x min ). For X 0 ≈ n/2 this becomes roughly ln(n) − ln 2, so that the crucial term ln(X 0 /x min )/δ, where δ = Θ(1/n), becomes at most (ln n)/δ − (ln 2)/δ = Θ(n ln n) − O(n). The expected initial value To see how closely the empirical data matches a conjectured bound with a negative linear term, we performed a non-linear regression to fit the average runtimes to a model of an ln(n) + bn, for parameters a, b. We used the software R, version 4.0.3, calling the nls command with starting values of a = 1 and b = 0. The results of the regression were values of a, b that minimised the residual sum of squares.
For the (2+1) GA with c = 1, this resulted in a fitted function of 2.2381n ln(n) − 0.7995n. The leading constant 2.2381 is close to the value 2.224 from our theoretical analysis. Likewise, for the optimal value c = √ 97−5 4 ≈ 1.2122, the best fit is for 2.1532n ln(n)−0.5883n and the leading constant is close to 2.18417 from our analysis and smaller than the leading constant for c = 1. In both cases, the linear term has a negative sign. For comparison, the same regression for the (1+1) EA returned the best fit for the function 2.745n ln(n) − 2.116n. The leading constant is close to the theoretically proven one of e = 2.71828 and the coefficient of the linear term is also close to the theoretical one of −1.8925 . . . [14].

Conclusions
Proving lower bounds for crossover-based GAs is a notoriously hard problem. We have provided such a lower bound for the (2+1) GA on OneMax through a careful analysis of a potential function that captures both the current best fitness and the potential for finding improvements through crossover combining different "building blocks" of good solutions. Our lower bound is tight up to small-order terms. This for the first time proves rigorously that populations are beneficial for standard steady-state genetic algorithms. We also identified the optimal mutation rate for the (2+1) GA as ( √ 97 − 5)/(4n) for the considered range of mutation rates c/n. Our lower bound applies for c ≤ 1.422 and an obvious open question is whether the leading constant in the expected runtime remains at 9e c /(c(2c+9)) when this threshold is exceeded. Our empirical results suggest that the expected runtime is smaller than the stated bound, albeit it is not clear whether these results are affected by negative smallorder terms. We conjecture that the runtime is very close to 9e c c(2c+9) ·n ln(n)+bn, where bn is a linear term with a negative constant b. Our drift estimates subsumed errors that were by a factor of O(1/ log n) smaller than the dominant term in asymptotic notation, hence tighter analyses would be required to obtain rigorous bounds on the conjectured linear term. Another avenue for future work would be to simplify our approach, possibly by exploiting that states with n 10 > 1 are rarely reached in the late stages of a run, or to generalise our analysis to larger parent population sizes.
Proof. The claim is trivial for a > b, hence we assume a ≤ b. The following proof is stated in [17], attributed to Gruder [13].
We use Gruder's equation to prove Lemma 12.

A.2 Proof of Lemma 16
Recall that Lemma 16 claims that if F 00 = 0 and F 11 ≥ 2 then under P 12 the potential drift is non-positive.
Since the first term of the rhs. in (13) is clearly negative, we omit it in the following. We use our assumption n 10 ≥ 12. The aim is to compare different probabilities occurring in (13) to each other. More precisely, we want to show the claim that P(S > 2 ) ≥ 2P(S = 1 ). Along with the fact that the third term is clearly negative, we then have E(Δ | P 12 , F 00 = 0, F 11 ) ≤ P(S = 1 ) −F 11 + n 01 3 Table 3 Numerical values for the potential drift given P 12 , n 10 > n 01 and F 00 = F 11 = 0, rounded to 4 digits  Table 4 Numerical values for the potential drift given P 12 , n 10 > n 01 and F 00 = 1, F 11 = 0, rounded to 4 digits Hence, we are left with the claim. We first assume n 10 ≥ n 01 + 2 and treat the case n 10 = n 01 − 1 separately below. Note that nothing is to show if 1 > n 10 + n 01 since then only negative terms have non-zero probability. Hence, we assume 1 ≤ n 10 + n 01 in the following. Since n 01 ≤ n 10 − 2 and F 00 = 0, we have 1 = n 10 − F 00 + F 11 ≥ (n 10 + n 01 )/2 + 1 and 2 = n 10 − F 00 + F 11 ≤ 1 − 2. Since S follows a binomial distribution with parameters n 10 + n 01 and 1/2, we have shown 1 ≥ E(S) + 1. Using the well-known monotonicity of the binomial distribution, we have P(S = 1 − 1) ≥ P(S = 1 ), and, since P(S > 2 ) ≥ P(S = 1 ) + P(S = 1 − 1), we obtain P(S > 2 ) ≥ 2P(S = 1 ) as claimed in the case n 10 ≥ n 01 + 2.
The ratio α will reflect the following trade-off: if F 11 is small then 1 = n 10 + F 11 is close to the middle value of the binomial distribution with parameters N := n 10 + n 01 and P(S = 1 + 1) is not much smaller than P(S = 1 ). Then the positive contribution of n 01 /6 can almost be compensated by the negative −αn 01 /6. However, if F 11 is big then α must be relatively small due to the tail of the binomial distribution. Then we exploit that the negative terms involving F 11 make the drift bound small. We now consider several ranges for F 11 to relate the drift and α. The first range is F 11 ≥ n 01 /2. Then have (even without bounding α) ( Table 4) E(Δ | P 12 , F 00 = 0, F 11 ) ≤ Pr(S = 1 ) − n 01 /2 3 + n 01 6 ≤ 0.

A.4 Proof of Lemma 21
Lemma 21 is similar to Lemma 16, saying that the drift is non-positive if at least 2 1-bits flip. Strangely enough, this holds for all i ≥ 2, except for i = n 10 − 1 where the drift is exponentially small in n 10 .