On the Choice of the Update Strength in Estimation-of-Distribution Algorithms and Ant Colony Optimization

Probabilistic model-building Genetic Algorithms (PMBGAs) are a class of metaheuristics that evolve probability distributions favoring optimal solutions in the underlying search space by repeatedly sampling from the distribution and updating it according to promising samples. We provide a rigorous runtime analysis concerning the update strength, a vital parameter in PMBGAs such as the step size 1 / K in the so-called compact Genetic Algorithm (cGA) and the evaporation factor ρ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho $$\end{document} in ant colony optimizers (ACO). While a large update strength is desirable for exploitation, there is a general trade-off: too strong updates can lead to unstable behavior and possibly poor performance. We demonstrate this trade-off for the cGA and a simple ACO algorithm on the well-known OneMax function. More precisely, we obtain lower bounds on the expected runtime of Ω(Kn+nlogn)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\varOmega }(K\sqrt{n} + n \log n)$$\end{document} and Ω(n/ρ+nlogn)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\varOmega }(\sqrt{n}/\rho + n \log n)$$\end{document}, respectively, suggesting that the update strength should be limited to 1/K,ρ=O(1/(nlogn))\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$1/K, \rho = O(1/(\sqrt{n} \log n))$$\end{document}. In fact, choosing 1/K,ρ∼1/(nlogn)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$1/K, \rho \sim 1/(\sqrt{n}\log n)$$\end{document} both algorithms efficiently optimize OneMax in expected time Θ(nlogn)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\varTheta }(n \log n)$$\end{document}. Our analyses provide new insights into the stochastic behavior of PMBGAs and propose new guidelines for setting the update strength in global optimization.


Introduction
The term probabilistic model-building Genetic Algorithms describes a class of algorithms that construct a probabilistic model which is used to generate new search points.The model is adapted using information about previous search points.Both estimation-of-distribution algorithms (EDAs) and swarm intelligence algorithms including ant colony optimizers (ACO) and particle swarm optimizers (PSO) fall into this class.These algorithms generally behave differently from evolutionary algorithms where a population of search points fully describes the current state of the algorithm.
EDAs like the compact Genetic Algorithm (cGA) and many ACO algorithms update their probabilistic models by sampling new solutions and then updating the model according to information about good solutions found.In this work we focus on pseudo-Boolean optimization (finding global optima in {0, 1} n , n the number of bits) and simple univariate probabilistic models, that is, for each bit there is a value p i that determines the probability of setting the ith bit to 1 in a newly created solution.
degrades.Understanding the working principles of the underlying probabilistic model remains an important open problem for both cGA and ACO algorithms.This is evident from the lack of reasonable lower bounds.The previous best known direct lower bound for MMAS algorithms for reasonable parameters was Ω((log n)/ρ −log n) [25,Theorem 5]; this bound holds for all functions with a unique global optimum.The best known lower bound for cGA on OneMax is Ω(K √ n) [7].There are more general bounds from black-box complexity theory [6,8], showing that the expected runtime of comparison-based algorithms such as MMAS must be Ω(n) on OneMax.However, these black-box bounds do not yield direct insight into the stochastic behavior of the algorithms and do not shed light on the dependency of the algorithms' performance on the update strength.
In this paper, we study 2-MMAS ib and cGA with a much more detailed analysis that provides such insights through rigorous runtime analysis.We prove lower bounds of Ω(K √ n + n log n) and Ω( √ n/ρ + n log n) on OneMax.The terms K √ n and √ n/ρ indicate that the runtime decreases when the update strength 1/K or ρ is increased.However, the added terms + n log n set a limit: there is no asymptotic decrease and hence no benefit for choosing update strengths 1/K or ρ growing faster than 1/( √ n log n).The reason is that in this regime both algorithms suffer from a phenomenon well known in evolutionary biology and evolutionary computation as genetic drift: the probabilistic model attains extreme values simply due to the randomness of the sampling process, ignoring or overruling information about the quality of solutions.In our context, genetic drift leads to incorrect decisions being made.Correcting these incorrect decisions requires time Ω(n log n).These lower bounds hold in expectation and with high probability; hence, they accurately reflect the algorithms' typical performance.
We further show that these bounds are tight for 1/K , ρ ≤ 1/(c √ n log n).In this parameter regime the impact of genetic drift is bounded and hence these parameter choices provably lead to the best asymptotic performance on OneMax for arbitrary problem sizes n.
The lower bounds formally apply to OneMax, but we believe that they also apply more generally to functions with few optima.Among all functions with a unique global optimum, the function OneMax is provably the easiest function for certain evolutionary algorithms (see [5] for a proof for the (1+1) EA and [30,32] for extensions to populations), and similar results were shown for the cGA on linear functions by Droste [7].We believe that the lower bounds give general performance limits for all functions with a unique global optimum.However, new arguments will be required to prove (or disprove) this formally.
From a technical point of view, our work uses a novel approach: using a secondorder potential function to approximate the distribution of hitting times for a random walk that underlies changes in the probabilistic model.This approach has been recently picked up in [19] to analyze a different type of EDAs and we are confident that it will find further applications.
Finally, by pointing out similarities between cGA and 2-MMAS ib , using the same analytical framework to understand changes in the probabilistic model, we make a step towards a unified theory of probabilistic model-building Genetic Algorithms.This paper is structured as follows.Section 2 introduces the algorithms and Sect. 3 presents important analytical concepts.Section 4 proves efficient upper bounds for small update strengths, whereas Sect. 5 deals with the lower bounds for large update strengths.We finish with some conclusions.

Preliminaries
In the remainder, p t = ( p t,1 , . . ., p t,n ) denotes a vector of probabilities and x t = (x t,1 , . . ., x t,n ), y t = (y t,1 , . . ., y t,n ) denote search points from {0, 1} n .Hence p t,i refers to the i-th entry of p t and x t,i refers to the ith bit in x t .
Algorithm 1: Compact Genetic Algorithm (cGA) Our presentation of cGA follows Droste [7]; see also Friedrich et al. [12].The parameter 1/K is called update strength (classically, K is called population size) and the p t,i are called marginal probabilities.Pseudocode of cGA is shown in Algorithm 1.The cGA in each iteration generates two search points according to the probabilistic model.Then the better solution is reinforced: if the two solutions differ on some bit i, the probabilistic model p t,i is adjusted in the direction of the better solution, using a step size of 1/K .If the two solutions have equal values on bit i then p t,i remains unchanged.
The simple MMAS algorithm 2-MMAS ib , analyzed before in [26], 1 is shown in Algorithm 2. Note that the two algorithms only differ in the update mechanism.In contrast to cGA, 2-MMAS ib always changes the probabilistic model by either decreasing values p t,i to (1 − ρ) p t,i or increasing it to (1 − ρ) p t,i + ρ.Here ρ determines the strength of the update.In the context of ACO, p t,i are usually called pheromone values, however we also refer to them as marginal probabilities to unify our approach to both algorithms.

Algorithm 2: 2-MMAS ib
We note that the marginal probabilities for both algorithms are restricted to the interval [1/n, 1 − 1/n].These bounds are used such that the algorithms always show a finite expected optimization time, as otherwise certain bits can be irreversibly fixed to 0 or 1.Our results also apply to algorithms without these borders: our analysis can be easily adapted to show that when the optimum is found efficiently in the presence of borders, it is found with high probability when borders are removed, and when the algorithm is inefficient, many bits are fixed opposite to the optimum.
There are intriguing similarities in the definition of cGA and 2-MMAS ib , despite these two algorithms coming from quite different strands from the natural computation community.As mentioned earlier, they only differ in the update mechanism: cGA uses a symmetrical update rule with 1/K as the amount of change and changes a marginal probability if and only if both offspring differ in the corresponding bit value.2-MMAS ib will always change a marginal probability in either positive or negative direction by a value dependent on its current state; however, the maximum absolute change will always be at most ρ.We are not the first to point out these similarities (e. g., see the survey by Hauschild and Pelikan [16], who embrace both algorithms under the umbrella of EDAs).However, our analyses will reveal the surprising insight that both cGA and 2-MMAS ib have the same runtime behavior as well as the same optimal parameter set on OneMax and can be analyzed with almost the same techniques.
Several parts of our analysis will consider random variables X that follow the so-called Poisson-binomial distribution with probability vector ( p 1 , . . ., p n ).Then X is the sum of n Bernoulli trials with possibly different success probabilities p i , 1 ≤ i ≤ n, i. e., X = X 1 + • • • + X n , where X i = 1 with probability p i and X i = 0 with probability 1 − p i , independently for all trials.Note that the number of ones in the search points x t and y t sampled at time t by the cGA and 2-MMAS ib follows the Poisson-binomial distribution with probability vector ( p t,1 , . . ., p t,n ), which is why this distribution appears naturally in the analysis of OneMax.Section A.3 in the Appendix describes powerful bounds for such Poisson-binomially distributed random variables.
In the remainder of the paper, "poly(n)" is used as a shorthand for "n O (1) ."

On the Dynamics of the Probabilistic Model
We first elaborate on the stochastic processes underlying the probabilistic model in both algorithms.These insights will then be used to prove upper runtime bounds for small update strengths in Sect. 4 and lower runtime bounds for large update strengths in Sect. 5. We fix an arbitrary bit i and p t,i , its marginal probability at time t.Note that p t,i is a random variable, and so is its random change Δ t := p t+1,i − p t,i in one step.This change depends on whether the value of bit i matters for the decision whether to update with respect to the first bit string x sampled in iteration t (using p t as sampling distribution) or the second one y (cf.also [26]).More precisely, we inspect , which is the change of OneMax-value at bits other than i.
We assume p t,i to be bounded away from the borders such that Δ t is not affected by the borders.Then cGA experiences two different kinds of steps: then bit i does not affect the decision whether to update with respect to x t or y t .For Δ t > 0 it is necessary that bit i is sampled differently.Hence, the p t,i -value increases and decreases by 1/K with equal probability p t,i (1 − p t,i ); with the remaining probability p t+1,i = p t,i .In this case, Δ t can be described by a variable F t where with the remaining probability.
We call a step where |D t | ≥ 2 a random-walk step (rw-step) since the process in such a step is a fair random walk (with self-loops) as and y t+1 are never swapped in line 8 of cGA.Hence, the same argumentation as in the previous case applies and the process performs an rw-step as well.
Biased steps If D t = −1 then x t+1 and y t+1 are swapped unless bit i is sampled to 1 in x t+1 and to 0 in y t+1 .Hence, both events of sampling bit i differently increase the p t,i -value.We have Δ t = 1/K with probability 2 p t,i (1 − p t,i ) and Δ t = 0 otherwise.
If D t = 0 then as in the case D t = −1 both events of sampling bit i differently increase the p t,i -value.Hence, we again have Δ t = 1/K with probability 2 p t,i (1 − p t,i ) and Δ t = 0 otherwise.Let B t be a random variable such that B t := +1/K with probability 2 p t,i (1 − p t,i ), 0 with the remaining probability.
Hence, in the cases D t = −1 and D t = 0 we get that Δ t has the same distribution as B t .We call such a step a biased step (b-step Whether a step is an rw-step or b-step for bit i depends only on circumstances being external to the bit (and independent of it).Let R t be the event that D t = 1 or |D t | ≥ 2. We get the equality which we denote as superposition.Informally, the change of p t,i -value is a superposition of a fair (unbiased) random walk and biased steps.The fair random walk reflects the genetic drift underlying the process, i. e. the variance in the process may lead the algorithm to move in a random direction.In contrast, the biased steps reflect steps where the algorithm learns about which bit value leads to a better fitness at the considered bit position.We remark that the superposition of two different behaviors as formulated here is related to the approach taken in [2], where an EDA called UMDA was decomposed into a derandomized, deterministic EDA and a stochastic component modeling genetic drift.For 2-MMAS ib , structurally this kind of superposition holds as well, however, the underlying random variables look somewhat different.

Random-walk steps
then the considered bit does not affect the choice whether to update with respect to x t or y t .Hence, the marginal probability of the considered bit increases with probability p t,i and decreases with probability 1 − p t,i .
We get that Δ t = p t+1,i − p t,i is distributed as F t in this case, where F t is a random variable such that with probability 1 − p t,i .
We call such a step an rw-step in analogy to cGA as in expectation the current state does not change: Biased steps If D t = 0 or D t = −1 then the marginal probability can only decrease if both offspring sample a 0 at bit i; otherwise it will increase.The difference Δ t is a random variable Altogether, the superposition for 2-MMAS ib is also given by (1), with the modified meaning of B t and F t .
The strength of the update plays a key role here: if the update is too strong, large steps are made during updates, and genetic drift through rw-steps may overwhelm the probabilistic model, leading to "wrong" decisions being made in individual bits.On the other hand, small updates imply that rw-steps have a bounded impact, and the algorithm receives more time to learn optimal bit values in b-steps.We will formalize these insights in the following sections en route to proving rigorous upper and lower runtime bounds.Informally, one main challenge is to understand the stochastic process induced by the mixture of b-and rw-steps.

Small Update Strengths are Efficient
We first show that small update strengths are efficient for OneMax.This has been shown for 2-MMAS ib in [26].
Here we exploit the similarities between both algorithms to prove an analogous result for cGA.

Theorem 2 The expected optimization time of cGA on
The analysis follows the approach for 2-MMAS ib in [26], adapted to the different update rule, and using modern tools like variable drift analysis2 [17] and drift analysis with tail bounds [21].We also extend previous work by showing in Sect.4.1 that the upper bound for cGA holds with high probability (see Theorem 5 in Sect.4.1).The main idea is that marginal probabilities are likely to increase from their initial values of 1/2.If the update strength is chosen small enough, the effect of genetic drift (as present in rw-steps) is bounded such that with high probability all bits never reach marginal probabilities below 1/3.Under this condition, we show that the marginal probabilities have a tendency (stochastic drift) to move to their upper borders, such that then the optimum is found with good probability.
The following lemma uses considerations and notation from Sect. 3 to establish a stochastic drift, i. e. a positive trend towards optimal bit values, for cGA.We use the same notation as in Sect.3. .
Proof The assumptions on p t,i assure that p t+1,i is not affected by the borders 1/n and 1 − 1/n.Then the expected change is given by the expectation of the superposition (1): From Sect. 3 we know E F t | p t,i = 0 and E B t | p t,i = 2 p t,i (1− p t,i )/K .Further, , where the last inequality was shown in [26, proof of Lemma 1].Here we exploit that cGA and 2-MMAS ib use the same construction procedure.Together this proves the claim.
Note that the term j =i p t, j (1 − p t, j ) reflects the standard deviation of the sampling distribution on all bits j = i.
Lemma 3 indicates that the drift increases with the update strength 1/K .However, a too large value for 1/K also increases genetic drift.The following lemma shows that, if 1/K is not too large, this positive drift implies that the marginal probabilities will generally move to higher values and are unlikely to decrease by a constant.Lemma 4 Let 0 < α < β < 1 be two constants.For each constant γ > 0 there exists a constant c γ > 0 (possibly depending on α, β, and γ ) such that for a specific bit the following holds.If the bit has marginal probability at least β and K ≥ c γ √ n log n then the probability that during the following n γ steps the marginal probability decreases below α is at most O(n −γ ).
Proof The proof uses a similar approach as the proof of Lemma 3 in [26], using 1/K instead of ρ and drift bounds from Lemma 3.
The aim is to apply the negative drift theorem, Theorem 20 in the Appendix, with respect to the stochastic process X t := K p t,i , obtained by scaling the process on the marginal probabilities of the considered bit i by a factor of K .Note that the X t -process is on {K /n, 1, 2, . . ., K − 1, K − K /n, K }.
We use the interval [a, b] := [α K , β K ] in the drift theorem.To establish the first condition of the drift theorem, we use Lemma 3. Hence, we obtain the following bound on the drift and estimating p t, j (1 − p t, j ) ≤ 1/4 for all j and t.
For the second condition, we note that always |X t − X t+1 | ≤ 1 since the marginal probabilities change by at most 1/K .Hence, the second condition is trivially satisfied by choosing r := 2.
To verify the third condition, we will use that K ≥ c γ √ n log n for a constant c γ that may depend on α, β and γ .We compute, using := (β − α)K and r , ε defined above, which is at least 4 if c γ is chosen large enough but constant; here we use that α and β are constants in (0, 1).Then 1 ≤ r 2 ≤ ε(b−a) 132 log(r /ε) as demanded by the third condition.To finally apply the drift theorem, similar calculations as before yield that With these lemmas, we now prove the main statement of this section.

Proof of Theorem 2
We assume in the following that 1/K divides 1/2 − 1/n, implying that marginal probabilities are restricted to {1/n, Following [26,Theorem 3] we show that, starting with a setting where all probabilities are at least 1/2 simultaneously, with probability Ω(1) after O( √ nK ) iterations either the global optimum has been found or at least one probability has dropped below 1/3.In the first case we speak of a success and in the latter case of a failure.The expected time until either a success or a failure happens is then O( √ nK ).Now choose a constant γ > 0 such that n γ ≥ K n 3 .According to Lemma 4 applied with α := 1/3 and β := 1/2, the probability of a failure in n γ iterations is at most n −γ , provided the constant c in the condition K ≥ c √ n log n is large enough.In case of a failure we wait until the probabilities simultaneously reach values at least 1/2 again and then we repeat the arguments from the preceding paragraph.It is easy to show (cf.Lemma 2 in [26]) that the expected time for one probability to reach the upper border is always bounded by O(n 3/2 K ), regardless of the initial probabilities.By standard arguments on independent phases, the expected time until all probabilities have reached their upper border at least once is O(n 3/2 K log n).Once a bit reaches the upper border, we apply Lemma 4 again with α := 1/2 and β := 2/3 to show that the probability of a marginal probability decreasing below 1/2 in time n γ is at most n −γ (again, for large enough c).The probability that there is a bit for which this happens is at most n −γ +1 by the union bound.If this does not happen, all bits attain value at least 1/2 simultaneously, and we apply our above arguments again.
As the probability of a failure is at most n −γ +1 , the expected number of restarts is O(n −γ +1 ) and considering the expected time until all bits recover to values at least 1/2 only leads to an additional term of We only need to show that after O( √ nK ) iterations without failure the probability of having found the global optimum is Ω (1).To this end, we consider a simple potential function that takes into account marginal probabilities for all bits.An important property of the potential is that once the potential has decreased to some constant value, the probability of generating the global optimum is constant.

123
Let p 1 , . . ., p n be the current marginal probabilities and q i := 1−1/n − p i for all i.Define the potential function ϕ := n i=1 q i , which measures the distance to an ideal setting where all probabilities attain their maximum 1 − 1/n.Let q i be the q i -value in the next iteration and p i = 1 − q i .We estimate the expectation of ϕ := n i=1 q i and distinguish between two cases.If We bound p i (1 − p i ) from below using p i ≥ 1/3 and 1 − p i = q i + 1/n and the sum from above using and p i can only decrease.A decrease by 1/K happens with probability 1/n, thus To ease the notation we assume w. l. o. g. that the bits are numbered according to decreasing probabilities, i. e., increasing q-values.Let m ∈ N 0 be the largest index such that p m = 1 − 1/n.Observe that by definition of the q i we have m i=1 q i = 0 and n i=m+1 q i = ϕ.It follows Putting everything together, For ϕ ≥ 10000 this can further be bounded using where in the third inequality we used ϕ ≥ 10000 again.We now apply the variable drift theorem (given by Theorem 18 in the Appendix) to bound the expected time for the potential ϕ to decrease from any initial value ϕ ≤ n to a value ϕ ≤ 10000.To this end, we use the drift function h(ϕ) := ϕ 1/2 /(17K ) as we just established that the expected change (drift) in one step is at least h(ϕ) for all ϕ ≥ 10000.Since Theorem 18 only considers the hitting time of state 0 and the condition on the drift needs to hold for all states larger than 0, we consider a modified process instead where we merge all states with potentials 0 < ϕ < 10000 with state 0: all steps reducing a potential of ϕ ≥ 10000 to a value smaller than 10000 yield a potential of 0. In the modified process, the smallest state larger than 0 is x min = 10000.The modification can only increase the drift, hence the drift is still bounded from below by h(ϕ) for all states ϕ ≥ x min .Now Theorem 18 yields that the expected time to reach state 0 in the modified process, or, equivalently, any state ϕ < 10000 in the original process, is at most 10000 h(10000) Consider an iteration where ϕ ≤ 10000.The probability of creating ones on all bits simultaneously, given that all marginal probabilities are at least 1/3, is minimal in the extreme setting where a maximal number of bits has marginal probabilities at 1/3 and all other bits, except at most one, have marginal probabilities at their upper border.Then the probability of creating the optimum in one step is at least . Hence a successful phase finds the optimum with probability Ω(1).

A Tail Bound on the Running Time
We further show that the upper bound from Theorem 2 holds with high probability.
Along with the lower tail bounds to be presented in Sect.5, this demonstrates that the runtime of cGA is highly concentrated, and that we have developed a very good understanding of its performance and dynamic behaviour.In the following result, the failure probability can be made an arbitrarily small polynomial.
Theorem 5 For every κ > 0 there is a constant c = c(κ) such that the upper bound O( √ nK ) for the time of the cGA on OneMax from Theorem 2 holds with probability Throughout this section we re-use the notation from the proof of Theorem 2, in particular the potential function ϕ and variables p i and q We still consider the stochastic process w. r. t. the potential function ϕ from the proof of Theorem 2 and consider its drift.As done in said proof, we use that the probability that there exists a p i whose value decreases below 1/3 in n γ steps is at most √ n log n is chosen large enough.Note that we can make γ larger to decrease the probability of such a failure; however, this dictates what values of c are appropriate.In the following, we assume that the probability of such a failure is at most n −κ and work under the assumption that no failure occurs.
To get a high-probability statement, we aim to apply drift analysis with tail bounds, stated as Theorem 19 in the Appendix. 3To this end, we have to bound the momentgenerating function (mgf.) of (a stochastic upper bound on) the absolute value of where we use K = 17K to improve readability and x min = 10000.
The following lemma gives a tail bound for the time to reach a potential of at most x min .
Lemma 6 Consider the potential ϕ and the drift function h(ϕ) := ϕ 1/2 /(17K ) as defined in the proof of Theorem 2, and assume that no p i decreases below 1/3.Let T denote the random time for the potential to decrease below x min = 10000 for the first time, when starting with an initial value of ϕ 0 .Then for every t > 0, conditional on the potential always being bounded by a maximum value x max , Proof For the purpose of bounding the tail of the first hitting time for potentials below 10000 we again consider a modified process where states 0 < ϕ < 10000 are merged with state 0 (cf.proof of Theorem 18).The following calculations implicitly assume that ϕ t ≥ 10000 as otherwise we have reached a potential below 10000.
We first note that always ϕ t+1 ≥ ϕ t (1 − 1/K ) ≥ ϕ t /2.This holds since a step of cGA in the worst case increases all frequencies by 1/K (except for those at the upper border), which decreases each q i by 1/K .Hence, we get  [14], see [23, p. 495] for a summary.Hence, We now bound the mgf. of Z .Looking up the mgf. of a binomial distribution, we obtain and using e λ ≤ 1 + λ + λ 2 ≤ 1 + 2λ, we bound the last expression from above by for some some constant c 2 > 0. Hence, using the variable drift theorem with tail bounds, Theorem 19 in the Appendix, we get for any δ > 0 and η ≤ min{λ, We note that √ ϕ t ≥ 100 if ϕ t ≥ x min = 10000.Hence, using our bound D, we satisfy sufficiently small, so that only the second argument of min{λ, δλ 2 /(D − 1 − λ)} needs to be considered.We let δ := 1/2.We choose x max for some constant c 3 to satisfy the requirements on λ and η.Substituting η and δ in (2) proves the claim.
Reaching a small potential is not sufficient to show that the optimum is found with high probability.We also need to show that the algorithm spends a sufficiently large number of steps at a small potential.The following lemma shows that, after having reached a potential of at most x min , the algorithm quickly returns to this regime.
Lemma 7 Consider the potential ϕ as defined in the proof of Theorem 2, where K ≥ c √ n log n for a sufficiently large c > 0 and K = poly(n).Whenever ϕ 0 < 10000, the time R = min{t ≥ 1 | ϕ t < 10000, ϕ 0 < 10000} to return to a potential below 10000 is at most K log 2 n with probability 1 − n −ω (1) .
Proof We first show that with high probability the potential never rises beyond O(K ) in any polynomial number of steps.
Consider p i that are at the upper border initially.The probability that in one step more than log n variables move away from the upper border is at most (1) .Assuming this never happens within the next K log 2 n steps, during this time at most K log 3 n bits move away from the upper border.As every bit can only increase the potential by 1/K in one step, these bits only contribute at most log 3 n to the potential.
All bits that are not at the upper border initially can contribute up to 1 to the potential each.However, as they contribute at least 1/K (the minimum distance to the upper border), the number of such bits is bounded by 10000K .Together, the potential is at most log 3  (1)  (as K log 2 n = poly(n)) throughout the first K log 2 n steps.Now consider the potential ϕ 1 at time 1.If ϕ 1 < 10000, the return time is R = 1.Otherwise, by the same arguments as above, ϕ 1 ≤ 10000 + O(1) with probability 1−n −ω (1) as with this probability at most log n bits move away from the upper border, and at most 10000K bits that are away from the border initially only move by ±1/K in one step.
Applying Lemma 6 with an initial potential (denoted by ϕ 0 in Lemma 6 but corresponding to ϕ 1 in the time scale of the present lemma) of at most 10000 + O(1), t = K log 2 n, and x max = log 3 n + 10000K = O(K ) yields that the probability of not returning to a potential below 10000 in K log 2 n steps is at most Note that (still using the definition of h from the proof of Theorem 2), so that the probability under consideration is as claimed.
We now prove Theorem 5.

Proof of Theorem 5
Applying Lemma 4 as in the proof of Theorem 2, the probability of all p i remaining above 1/3 all the time for n γ steps is at least 1−n −γ +1 ≥ 1−n −κ , where γ = max{γ, κ − 1} and γ is chosen as in the proof of Theorem 2.
The aim is to apply Lemma 6 with T * := x min /h(x min )+ n x min 1/h(x) dx, t := 3T * and x max = n.Note that T * just represents the upper bound O(K √ n) on the expected value derived from variable drift in the proof of Theorem 2. This bound is at least Invoking the lemma yields for some constant c > 0. As K ≥ c √ n ln n, this means that the time is at most 3T * = O(K √ n) with probability at least 1 − e −cc 17 ln n .This probability becomes at least 1 − n −κ if c is chosen as a large enough constant.
Whenever the potential is at most 10000, we have a probability of Ω(1) to create the optimum (see proof of Theorem 2).By Lemma 7, the algorithm with high probability returns to such a state within K log 2 n steps.Applying these arguments log 2 n times (and considering failure probabilities for log 2 n applications of Lemma 7), the probability that after K log 4 n steps the optimum has not been found is (1) .
Adding up all failure probabilities yields the claimed result.

Large Update Strengths Lead to Genetic Drift
The bound O( √ nK ) from Theorem 2 shows that larger update strengths (i.e., smaller K ) result in smaller bounds on the runtime.However, the theorem requires that K ≥ c √ n log n so that the best possible choice results in O(n log n) runtime.An obvious question to ask is whether this is only a weakness of the analysis or whether there is an intrinsic limit that prevents smaller choices of K from being efficient.
In this section, we will show that smaller choices of K (i.e., larger update strengths) cannot give runtimes of lower orders than n log n.In a nutshell, even though larger update strengths support faster exploitation of correct decisions at single bits by quickly reinforcing promising bit values they also increase the risk of genetic drift reinforcing incorrectly made decisions at single bits too quickly.Then it typically happens that several marginal probabilities reach their lower border 1/n, from which it (due to socalled coupon collector effects) takes Ω(n log n) steps to "unlearn" the wrong settings.The very same effect happens with 2-MMAS ib if its update strength ρ is chosen too large.
We now state the lower bounds we obtain for the two algorithms, see Theorems 8 and 9 below.Note that the statements are identical if we identify the update strength 1/K of cGA with the update strength ρ of 2-MMAS ib .Also the proofs of these two theorems will largely follow the same steps.Therefore, we describe the proof approach in detail with respect to cGA in Sect.5.1.In Sect.5.2, we describe the few places where slightly different arguments are needed to obtain the result for 2-MMAS ib .

Theorem 8 The optimization time of cGA with K
) and in expectation.

in expectation.
We first describe at an intuitive level why large update strengths can be risky.In the upper bounds from Theorems 1 and 2, we have shown that for sufficiently small update strengths, the positive stochastic drift by b-steps is strong enough such that even in the presence of rw-steps all bits never reach marginal probabilities below 1/3, with high probability.Then no "incorrect" decision is made.
With larger update strengths than 1/( √ n log n) the effect of rw-steps is strong enough such that with high probability some bits will make an incorrect decision and reach the lower borders of marginal probabilities.
More specifically, the lower bounds of Ω(n log n) in Theorems 8 and 9 will be established from the following arguments.We show that many marginal probabilities will remain close to their initial values during the early stages of a run (Lemmas 13 and 15).This then implies that b-steps will be rare (Lemma 12) throughout this time, and thus genetic drift dominates.Through a detailed analysis of the distribution of first hitting times in rw-steps we show that then some marginal probabilities will hit the lower border (Lemmas 10 and 16).Finally, we show that once sufficiently many marginal probabilities have reached the lower border, then this implies a lower bound of Ω(n log n) as claimed (Lemma 14).

Proof of Lower Bound for cGA
We start with a detailed analysis of the hitting time for a marginal probability to reach the lower border 1/n and the distribution hitting times.
To illustrate this setting, fix one bit and imagine that all steps were rw-steps (we will explain later how to handle b-steps), and that all rw-steps change the current value of the bit's marginal probability (i.e., there are no self-loops).Then the process would be a fair random walk on {0, 1/K , 2/K , . . ., (K − 1)/K , 1}, started at 1/2.This fair random walk is well understood (see, e. g., Chapter 14.3 in [9]) and it is well known that the hitting time is not sharply concentrated around the expectation.More precisely, there is still a polynomially in K small probability of hitting a border within at most O(K 2 / log K ) steps and also of needing at least Ω(K 2 log K ) steps.The underlying idea is that the central limit theorem (CLT) approximates the progress within a given number of steps.
The real process is more complicated because of self-loops.Recall from the definition of F t that the process only changes its current state by ±1/K with probability 2 p t,i (1− p t,i ), hence with probability 1−2 p t,i (1− p t,i ) a self-loop occurs on this bit.The closer the process is to one of its borders {1/n, 1 − 1/n}, the larger the self-loop probability becomes and the more the random walk slows down.Hence the actual process is clearly slower in reaching a border since every looping step is just wasted.One might conjecture that the self-loops will asymptotically increase the expected hitting time.But interestingly, as we will show, the expected hitting time in the presence of self-loops is still of order Θ(K 2 ).Also the CLT (in a generalized form) is still applicable despite the self-loops, leading to a similar distribution as above.
The distribution of the hitting time of the random walk with self-loops will be analyzed in Lemma 10 below.In order to deal with self-loops, in its proof, we use a potential function mapping the actual process to a process on a scaled state space with nearly position-independent variance.Unlike the typical applications of potential functions in drift analysis, the purpose of the potential function is not to establish a position-independent first-moment stochastic drift but a (nearly) position-independent variance, i. e., the potential function is designed to analyze a second moment.This argument seems to be new in the theory of drift analysis and may be of independent interest.

Lemma 10
Consider a bit of cGA on OneMax and let p t be its marginal probability at time t.Let t 1 , t 2 , . . .be the times where cGA performs an rw-step (before hitting one of the borders 1/n or 1 − 1/n) and let Δ i := p t i +1 − p t i .For s ∈ R, let T s be the smallest t such that sgn(s)

123
Choosing 0 < α < 1, where 1/α = o(K ), and −1 ≤ s < 0 constant, we have P T s ≤ α(s K ) 2 or p t exceeds 5/6 or reaches 1/n before t T s Moreover, for any α > 0 and s ∈ R, Informally, the lemma means that every deviation of the hitting time T s by a constant factor from its expected value (which turns out as Θ(s 2 K 2 )) still has constant probability, and even deviations by logarithmic factors have a polynomially small probability.We will mostly apply the lemma for α < 1, especially α ≈ 1/ log n, to show that there are marginal probabilities that quickly approach the lower border; in fact, this effect implies that the smallest possible update strength K ∼ √ n log n in Theorem 2 necessarily involves a log n-term.Note that the second statement of the lemma also holds for α ≥ 1; however, in this realm also Markov's inequality works.Then, by the inequality e −x ≤ 1 − x/2 for x ≤ 1, we get P T s ≥ α(s K ) 2 ≥ 1/(8α), which means that Markov's inequality for deviations above the expected value is asymptotically tight in this case.
We start with the proof of the second statement, which is can be obtained by a relatively straightforward analysis of a fair random walk.

Proof of Lemma 10, 2nd statement
Throughout this proof, to ease notation we consider the scaled process on the state space S := {0, 1, . . ., K } obtained by multiplying all marginal probabilities by K ; the random variables X t = K p t will live on this scaled space.Note that we also remove the borders (K /n and K − K /n), which is possible as all considerations are stopped when such a border is reached.For the same reason, we only consider current states from {1, . . ., K − 1} in the remainder of this proof.
The first hitting time T s becomes only stochastically larger if we ignore all self-loops.Formally, recalling the trivial scaling of the state space, we consider the fair random walk where P converges in distribution to a standard normally distributed random variable (see, e. g., Chapter 10 in [9]).However, we do not use this fact directly here.Instead, to bound the deviation from the expectation, we use a classical Hoeffding bound.We assume s ≥ 0 now and will see that the case s < 0 can be handled symmetrically.
Theorem 1.11 in [4] yields, with c i = 2 as the size of the support of Δ i , that 123 Moreover, according to Theorem 1.13 in [4], the bound also holds for all k ≤ αs 2 K 2 together, more precisely, Symmetrically, we obtain Hence, a distance that is strictly smaller than s K is bridged through α(s K ) 2 rw-steps (or the process reaches a border before) with probability at least 1 − e −1/(4α) .
To illustrate the main idea for the proof of the first statement Lemma 10, we ignore b-steps for a while and recall that we are confronted with a fair random walk then.However, the random walk is not homogeneous with respect to place as the self-loops slow the process down in the vicinity of a border.Unlike the classical fair random walk, the random variables describing the change of position from time t to time t + 1 (formally, Δ t := p t+1 − p t ) are not identically distributed.In fact, the variance of Δ t becomes smaller the closer p t is to one of the borders.
In more detail, the potential function used in the proof of Lemma 10 will essentially use the self-loop probabilities to construct extra distances to bridge.For instance, states with low self-loop probability (e. g., 1/2), will have a potential that is only by Θ (1) larger or smaller than the potential of its neighbors.On the other hand, states with a large self-loop probability, say 1/K , will have a potential that can differ by as much as 2 √ K from the potential of its neighbors.Interestingly, this choice leads to variances of the one-step changes that are basically the same on the whole state space (very roughly, this is true since the squared change (2 √ K ) 2 = Θ(K ) is observed with probability Θ(1/K )).However, using the potential for this trick is at the expense of changing the support of the underlying random variables, which then will depend on the state.Nevertheless, as the support is not changed too much, the Central Limit Theorem (CLT) still applies and we can approximate the progress made within T steps by a normally distributed random variable.This approximation is made precise in the following lemma, along with a bound on the absolute error.
Lemma 11 (CLT with Lyapunov condition, Berry-Esseen inequality [10], p. 544 ).Let X 1 , . . ., X m be a sequence of independent random variables, each with finite expected value μ i and variance σ 2 i .Define If there exists a δ > 0 such that

123
(assuming all the moments of order 2 + δ to be defined), then C m converges in distribution to a standard normally distributed random variable.Moreover, the approximation error is bounded as follows: for all x ∈ R, where C is an absolute constant and Φ(x) denotes the cumulative distribution function of the standard normal distribution.
We now turn to the formal proof of the outstanding 1st statement of Lemma 10.

Proof of Lemma 10, 1st statement
As in the proof of the 2nd statement of Lemma 10 above, we consider the scaled search space {1, . . ., K − 1}.Here we will essentially use an approximation of the accumulated state within αs 2 K 2 steps by the normal distribution, but have to be careful to take into account steps describing self-loops.
To analyze the hitting time T s for the X t i -process, we now define a potential function g : S → R. Unlike the typical applications of potential functions, the purpose of g is not to establish a position-independent first-moment drift (in fact, there is no drift within S since the original process is a martingale) but a (nearly) position-independent variance, i. e., the potential function is designed to analyze a second moment.

Potential function
We proceed with the formal definition of the potential function, the analysis of its expected first-moment change and the corresponding variance, and a proof that the Lyapunov condition holds for the accumulated change within αs 2 K 2 steps.The potential function g is monotonically decreasing on {1, . . ., K /2} and centrally symmetric around K /2.We define it as follows: Inductively, we have for 1 where the second equality holds since the sum is telescoping.We also note that g(0) = O(K ), more precisely it holds that where the first inequality used √ 1/ j as a lower sum of the integral.More generally, using the monotonicity of g and the same kind of estimations as before, we obtain for i < j ≤ K /2 that Informally, the potential function stretches the whole state space by a factor of at most 4 but adjacent states in the vicinity of borders can be by 2 √ K apart in potential.Let Y t := g(X t ).We consider the one-step differences Ψ i := Y t i +1 − Y t i at the times i where rw-steps occur, and we will show via the representation Y t i := i−1 j=0 Ψ j that Y t i approaches a normally distributed variable.Note that Y t i is not necessarily the same as g(X t i ) − g(X t 0 ) since only the effect of rw-steps is covered by Y t i .
In the following, we assume 1 ≤ X t i ≤ K /2 and note that the case X t i > K /2 can be handled symmetrically with respect to −Ψ i .We proceed with the announced analysis of different moments of Ψ i .1), (7) where the o-notation is with respect to K .The lower bound E Ψ i | X t i ≥ 0 is easy to see since X t i is a fair random walk and g( j − 1) − g( j) ≥ g( j) − g( j + 1) holds for all j ≤ K /2.To prove the upper bound, we note that

Analysis of expected change of potential We claim that for all
Using the properties of rw-steps, we have that P . Moreover, on Y t i +1 = Y t i , Y t i +1 takes each of the two values g(X t i − 1) and g(X t i + 1) with the same probability.Hence where the last equality used (4).

123
We estimate the bracketed terms using 1 where the penultimate inequality exploited that f (x + h) − f (x) ≤ h f (x) for any concave, differentiable function f and h ≥ 0; here using f (x) = √ x and h = 1.Altogether, which proves ( 7) since X t i ≥ 1 and K = ω (1).

Analysis of the variance of the change of potential We claim that for all
To show this, note that 2K .Moreover, Y t i +1 < Y t i implies that X t i +1 = X t i + 1 since g is monotone decreasing on {1, . . ., K /2} and the X t i -value can change by either −1, 0, or 1.Hence, if where we used X t i /(X t i + 1) ≥ 1/2.This proves the lower bound on the variance.
Approximating the accumulated change of potential by a Normal distribution We are almost ready to prove that Y t i := i−1 j=0 Ψ j can be approximated by a normally distributed random variable for sufficiently large t.We denote by s 2 i := i−1 j=0 Var(Ψ j | X t j ) and note that s 2 i ≥ i/4 by our analysis of variance from above.The so-called Lyapunov condition, which is sufficient for convergence to the normal distribution (see Lemma 11), requires the existence of some δ > 0 such that lim i→∞ 1 We will show that the condition is satisfied for δ = 1 (smaller values could be used but do not give any benefit) and i = ω(K ) (which, as i = αs 2 K 2 , holds due to our assumptions 1/α = o(K ) and |s| = Ω( 1)).We argue that where we have used the bound on |E Ψ i | X t i | from (7).As the X t i -value can only change by {−1, 0, 1}, we get, by summing up all possible changes of the g-value, that for K large enough.Hence, plugging this in the Lyapunov condition ( 9) for δ = 1, we obtain which goes to 0 as i = ω( √ K ).Hence, for the value i := αs 2 K 2 considered in the lemma we obtain that converges in distribution to N (0, 1) according to Lemma 11.The absolute error of this approximation is also O( √ K /i) by reusing (10).
Estimating the accumulated progress Recall that our aim is to show that the event i−1 j=0 Δ j ≤ s (where s is negative and i = αs 2 K 2 ) happens with at least the probability stated in the lemma.Since we analyzed the change of the potential function g, we establish a sufficient increase of the g-value (corresponding to a decrease of marginal probability) that implies i−1 j=0 Δ j ≤ s.By (6), we know that g( |s|K implies X t i − X 0 ≤ s K < 0 and therefore also i−1 j=0 Δ j ≤ s.Hence, in the following it suffices to study the event g(X t i ) − g(X 0 ) ≥ 2 √ |s|K and to show that it happens with the required probability.

123
As already mentioned, the random variable Y t i denotes the accumulated progress (in terms of g-value) due to rw-steps up to time t i .To show that Y t i is at least 2 √ |s|K with the claimed probability bounds, we exploit the above-established property that (11) converges in distribution to N (0, 1).Hence, we need to estimate the variance s i and the expected value E Y t i .
Note that s 2 i ≥ αs 2 K 2 /4 by our analysis of variance above and therefore s i ≥ √ α|s|K /2.We have to be more careful when computing E Y t i since E Ψ i | X t i is negative for X t i > K /2.Note, however, that considerations are stopped when the marginal probability exceeds 5/6, i. e., when X t i > 5K /6.Using (7), we hence have that E We study the event Y t i ≥ r K for general r ≥ 0, which is equivalent to (11) was really N (0, 1)-distributed, the probability of the event would be Φ(r where Φ denotes the cumulative distribution function of the standard normal distribution.Taking into account the approximation error O( √ K /i) computed above and plugging in our estimates for expected value and variance, we altogether have that for any r leading to a positive argument of Φ, Using r = 3 √ |s| in (13) , we compute Using Lemma 21 (in the Appendix) we can now bound the term 1 − Φ r /(|s| √ α/4) + 3.1|s| √ α from (13) below and obtain using |s| ≤ 1 and α ≤ 1.This means that distance s K (in negative direction) is bridged by the rw-steps before or at time t i , where i = αs 2 K 2 , with probability at least p(α, s) ), where the O-term is the bound on the approximation error computed above.Undoing the scaling of the state space introduced at the beginning of this proof, this corresponds to an accumulated change of the actual state of cGA in rw-steps by s; more formally, t i=0 Δ i ≤ s in terms of the original state space.This establishes also the first statement of the lemma and completes the proof.
As rw-steps are interleaved with b-steps, Lemma 10 alone is not sufficient to analyze the overall movement of a marginal probability.We also requires a bounded number of b-steps within a given period of time.To establish this, we first show that, during the early stages of a run, the probability of a b-step is only O(1/ √ n).Intuitively, during early stages of the run many bits will have marginal probabilities in the interval [1/6, 5/6].Then the standard sampling deviation of the OneMax-value is of order Θ( √ n), and the probability of a b-step is The link between 1 − P[R t ] and the standard deviation already appeared in Lemma 3 above; roughly, it says that every step is a b-step for bit i with probability at least ( j =i p j (1− p j )) −1/2 , which is the reciprocal of the standard deviation in terms of the other bits.
The following Lemma 12 represents a kind of counterpart of Lemma 3, but here we seek an upper bound on 1 − P[R t ].

Lemma 12
Assume that at time t there are γ n bits for some constant γ > 0 bits whose marginal probabilities are within [1/6, 5/6].Then the probability of having a b-step on any fixed bit position is regardless of the decisions made in this step on all other n − γ n − 1 bits.
Proof We know from our earlier discussion that a b-step at bit i requires D t ∈ {−1, 0} where is the change of the OneMax-value at bits other than i in the two solutions x t and y t sampled at time t.
We apply the principle of deferred decisions and fix all decisions for creating x t as well as decisions for y t on all but the m := γ n selected bits with marginal probabilities in [1/6, 5/6].Let p 1 , p 2 , . . ., p m denote the corresponding marginal probabilities after renumbering these bits, and let S denote the random number of these bits set to 1.Note that there are at most 2 values for S which lead to the algorithm making a b-step.
Since S is determined by a Bernoulli trial with success probabilities p 1 , . . ., p m , Theorem 22 in the Appendix implies that the probability of S attaining any particular value is at most Taking the union bound over 2 values proves the claim.
Even though one main aim is to show that rw-steps make certain marginal probabilities reach their lower border, we will also ensure that with high probability, Ω(n) marginal probabilities do not move by too much, resulting in a large sampling variance and a small probability of b-steps.The following lemma serves this purpose.Its proof is a straightforward application of Hoeffding's inequality since it is pessimistic here to ignore the self-loops.

Lemma 13
For any bit, with probability Ω(1) for any t ≤ κ K 2 , κ > 0 a small enough constant, the first t rw-steps lead to a total change of the bit's marginal probability within [−1/6, 1/6].This fact holds independently of all other bits.

123
The probability that the above holds for less than γ n bits amongst the first n/2 bits is 2 −Ω(n) , regardless of the decisions made on the last n/2 bits.Proof Note that the probability of exceeding [−1/6, 1/6] increases with the number of rw-steps that do increase or decrease the marginal probability (as opposed to selfloops).We call these steps relevant and pessimistically assume that all t steps are relevant steps.Now defining X j := j i=1 X i as the total progress in the first j relevant steps, we have E X j = 0, for all j ≤ t, and the total change in these j steps exceeds 1/6 only if X j ≥ K /6.Applying a Hoeffding bound, Theorem 1.13 in [4], the maximum total progress is bounded as follows: .
By symmetry, the same holds for the total change reaching values less or equal to −1/6.By the union bound, the probability that the total change always remains within the interval [−1/6, 1/6] is thus at least .
Assuming κ < 1/(12 ln 2) gives a lower bound of Ω (1).Note that due to our pessimistic assumption of all steps being relevant, all bits are treated independently.Hence we may apply standard Chernoff bounds to derive the second claim.
The following lemma shows that whenever a small number of bits has reached the lower border for marginal probabilities, the remaining optimization time is Ω(n log n) with high probability.The proof is similar to the well known coupon collector's theorem [24].Lemma 14 Assume cGA reaches a situation where at least Ω(n ε ) marginal probabilities attain the lower border 1/n.Then with probability 1 − e −Ω(n ε/2 ) , and in expectation, the remaining optimization time is Ω(n log n).
Proof Let m = Ω(n ε ) be the number of bits that have reached the lower border 1/n.A necessary condition for reaching the optimum within t := (n/2 − 1) • (ε/2) ln n iterations is that during this time each of these m bits is sampled at value 1 in at least one of the two search points constructed.The probability that one bit never samples a 1 in t iterations is at least (1 − 2/n) t .The probability that all m bits sample a 1 during t steps is at most, using (1 − 2/n) n/2−1 ≥ 1/e and 1 + x ≤ e x for x ∈ R, Hence with probability 1 − exp(−Ω(n ε/2 )) the remaining optimization time is at least t = Ω(n log n).As 1−exp(−Ω(n ε/2 )) = Ω(1), the expected remaining optimization time is of the same order.
We have collected most of the machinery to prove Theorem 8.The following lemma identifies a set of bits that stay centered in a phase of Θ(K min{K , √ n}) steps, resulting in a low probability of b-steps.Basically, the idea is to bound the accumulated effect of b-steps in the phase using Chernoff bounds: given K /6 b-steps, a marginal probability cannot change by more than 1/6.Note that this applies to many, but not all bits.Later, we will see that within the phase, some of the remaining bits will reach their lower border with not too low probability.
Lemma 15 Let κ > 0 be a small constant.There exists a constant γ , depending on κ, and a selection S of γ n bits among the first n/2 bits such that the following properties hold regardless of the last n/2 bits throughout the first T := κ K • min{K , √ n} steps of cGA with K ≤ poly(n), with probability 1 − poly(n) • 2 −Ω(min{K ,n}) : 1. the marginal probabilities of all bits in S is always within [1/6, 5/6] during the first T steps,

the probability of a b-step at any bit is always O(1/
√ n) during the first T steps, and 3. the total number of b-steps for each bit is bounded by K /6, leading to a displacement of at most 1/6.

Proof
The first property is trivially true at initialization, and we show that an event of exponentially small probability needs to occur in order to violate the property.Taking a union bound over all T steps ensures that the property holds throughout the whole phase of T steps with the claimed probability.By Lemma 13, with probability 1 − 2 −Ω(n) , for at least γ n of these bits the total effect of all rw-steps is always within [−1/6, +1/6] during the first T ≤ κ K 2 steps.We assume in the following that this happens and take S as a set containing exactly γ n of these bits.
It remains to show that for all bits in S the total effect of b-steps is bounded by 1/6 with high probability.Note that, while this is the case, according to Lemma 12, the probability of a b-step at every bit in S is at most c 2 / √ n for a positive constant c 2 .This corresponds to the second property, and so long as this holds, the expected number of b-steps in T ≤ κ K 2 steps is at most κ • c 2 K .Each b-step changes the marginal probability of the bit by 1/K .A necessary condition for increasing the marginal probability by a total of at least 1/6 is that we have at least K /6 b-steps amongst the first T steps.Choosing κ small enough to make κ • c 2 K ≤ 1/2 • K /6, by Chernoff bounds the probability to get at least K /6 b-steps in T steps is e −Ω(K ) .In order for the first property to be violated, an event of probability e −Ω(K ) is necessary for any bit in S and any length of time t ≤ T ; otherwise all properties hold true.
Taking the union bound over all T ≤ κ K 2 steps and all γ n bits gives a probability bound of κ K 2 • γ n • e −Ω(K ) ≤ poly(n) • 2 −Ω(K ) for a property being violated.This proves the claim.
Finally, we put everything together to prove our lower bound for cGA.
To bound the last expression from below, we distinguish between two cases.If K ≤ √ n, then α = Ω( 1) and ( 14) is at least = Ω( 1) Combining with the probability of not exceeding 5/6, which we have proved to be constant, the probability of the bit's marginal probability hitting the lower border within T steps is Ω(n −β ).Hence by Chernoff bounds, with probability 1 − 2 −Ω(n 1−β ) , the final number of bits hitting the lower border within T steps is Ω(n Once a bit has reached the lower border, while the probability of a b-step is O(1/ √ n), the probability of leaving the bound again is O(n −3/2 ) as it is necessary that either the bit is sampled as 1 at one of the offspring and a b-step happens, or in both offspring the bit is sampled at 1.So the probability that this does not happen until the (1).Again applying Chernoff bounds leaves Ω(n 1−o( 1) ) bits at the lower border at time T with probability 1 − 2 −Ω(n 1−o(1) ) .

Proof of Lower Bound for 2-MMAS ib
We will use, to a vast extent, the same approach as in Sect.5.1 to prove Theorem 9. Most of the lemmas can be applied directly or with very minor changes.In particular, Lemmas 13, 14 and 15 also apply to 2-MMAS ib by identifying 1/K with ρ.Intuitively, this holds since the analyses of b-steps always pessimistically bound the absolute change of a marginal probability by the update strength (1/K for cGA).This also holds with respect to the update strength ρ for 2-MMAS ib .
To prove lower bounds on the time to hit a border through rw-steps, the next lemma is used.It is very similar to Lemma 10, except for two minor differences: first, also the accumulated effect of b-steps is included in the quantity p t − p 0 analyzed in the lemma.Second, considerations are stopped when the marginal probability becomes less than ρ or more than 1−ρ.This has technical reasons but is not a crucial restriction.123 border is reached.As mentioned, following the structure of the proof of Lemma 10, we now analyze several moments of Δ t , with the final aim of establishing the Lyapunov condition in Lemma 11.

Analysis of expected change of potential
We claim for all t ≥ 0 where rw-steps occur (hence, formally we enter the conditional probability space on R t , the event that an rw-step occurs at time t) that Moreover, we claim for the unconditional expected value that For a proof of ( 16), we exploit the martingale property that holds in rw-steps of 2-MMAS ib , where there are two possible successor states different from X t .Since g(x) is a convex function on [0, 1/2], we have by Jensen's inequality To bound the expected value from above, we carefully estimate the error introduced by the convexity.Note that since the integrand is non-increasing.Analogously, Inspecting the g-values of two possible successor states of x := X t , we get that where the third-last inequality estimated 1 − x ≤ 1 and used that f (z + ρ) − f (z) ≤ ρ f (z) for any concave, differentiable function f and ρ ≥ 0; here using f (z) = √ z and z = x − ρ.The penultimate used ρ ≤ 1/2.Since the final bound is O(ρ/ √ x) = o(1) due to our assumption on X t ≥ ρ, we have proved (16).
We now consider the case that a b-step occurs at time t.We are only interested in bounding E(Δ t | X t ) from below now.Given X t = x, we have X t+1 > x (which means Δ t < 0) with probability at most 1 Now, since by assumption a b-step occurs with probability at most ρ/(4α), the unconditional expected value of Δ t can be computed using the superposition equality.Combining ( 16) and ( 22), we get since x ≤ 1, proving (17).

Analysis of variance of change of potential
Regarding the variance of Δ t , we claim that and, without the condition of having an rw-step, To prove this, we expand the definition of variance to estimate Var where the penultimate inequality used ρ ≤ x and the last one x ≤ 1/2.Plugging this in, we get which completes the proof of (24).By the law of total probability, we get for the unconditional variance that Var Since P[R t ] ≥ 1/2, we altogether have for the unconditional variance that Var(Δ t | X t = x) ≥ 1/32, as claimed in (25).
Approximating the accumulated change of potential by a Normal distribution The aim is to apply the central limit theorem (Lemma 11) on the sum of the Δ t .To this end, we will verify the Lyapunov condition for δ = 1 (smaller values could be used but do not give any benefit) and t = ω(1/ρ) (which, as t = α(s/ρ) 2 , holds due to our assumptions 1/α = o(ρ −1 ) and |s| = Ω( 1)).We compute where we again have used (18) and the upper bound from (19) with respect to the two outcomes of X t+1 .Moreover, we have used the bound E(Δ t | X t ) ≥ 0 in the first term and E(|Δ t | | X t ) ≤ 3ρ/(2 √ x) + ρ/(2α) in the second term, which is a crude combination of ( 21) and (17).As ρ ≤ 1/2 and ρ ≤ x as well as α ≥ ρ, the expected value satisfies where we used x ≤ 1 and x ≥ ρ.Using s 2 t := t−1 j=0 Var(Δ j | X j ) in the notation of Lemma 11 and using that Var(Δ j | X j ) ≥ 1/32 by (25), we get 1 Proof We consider only the case X 0 ≤ ρ as the other case is symmetrical.The idea is to consider O(log n) phases and prove that the X t -value only decreases throughout all phases with the stated probability.Phase i, where i ≥ 0, starts at the first time where X t ≤ ρe −i .Clearly, as ρ ≤ 1, at the latest in phase ln n the border 1/n has been reached.We note that phase i ends after 1/ρ steps if all these these steps decrease the value; here we use that each step decreases by a relative amount of 1 − ρ and that (1 − ρ) 1/ρ ≤ e −1 .The probability of decreasing the X t -value in a step of phase i is at least (1 − ρe −i ) 2 ≥ 1 − 2e −i ρ even if the step is a b-step.Hence, the probability of all steps of phase i being decreasing is at least (1 − 2e −i ρ) 1/ρ ≥ e −2e −i .For all phases together, the probability of only having decreasing steps is still at least as suggested.
We have now collected all tools to prove the lower bound for 2-MMAS ib .

Proof of Theorem 9
This follows mostly the same structure as the proof of Theorem 8. Every occurrence of the update strength 1/K should be replaced by ρ.
There is a minor change in the analysis of rw-steps.The two applications of Lemma 10 are replaced with Lemma 16, followed by an additional application of Lemma 17.The slightly different constants in the statement of Lemma 10 do not affect the asymptotic bound Ω(n −β ) obtained.Neither does the additional application of Lemma 17, which gives a constant probability.We do not care about the time O((log n)/ρ) stated in Lemma 17, since we are only interested in a lower bound on the hitting time.
There is a difference in how b-steps are being handled.While Lemma 10 only considers the accumulated effect of rw-steps (leaving the consideration of b-steps to the proof of Theorem 8), Lemma 16 also includes the effect of b-steps, assuming bounds on the probability of b-steps and on the number of b-steps, respectively.We still have to verify that these assumptions are met.
Lemma 16 requires in its first statement that the probability of a b-step is at most ρ/(4α).Recall that such a step has probability O(1/ √ n).We argue that ρ/(4α) ≥ c/ √ n for any constant c > 0 if κ is small enough.To see this, we simply recall that α = κ √ nρ/(3s 2 ) by definition and |s| = Ω(1).Finally, the second statement of Lemma 16 restricts the number of b-steps until time α(s/ρ) 2 to at most s/(2αρ).Reusing that ρ = O(α/(κ √ n)), this holds by Chernoff bounds with high probability if κ is a sufficiently small constant.Hence, the application of the lemma is possible.

Conclusions
We have performed a runtime analysis of two probabilistic model-building Genetic Algorithms, namely cGA and 2-MMAS ib , on OneMax.The expected runtime of these algorithms was analyzed in dependency of the so-called update strength S = 1/K and S = ρ, respectively, resulting in the upper bound O( √ n/S) for S = O(1/ √ n log n) and Ω( √ n/S + n log n).Hence, S ∼ 1/ √ n log n was identified as the choice for the update strength leading to asymptotically smallest expected runtime Θ(n log n).
Our analyses of update strength reveal a general trade-off between the speed of learning and genetic drift.High update strengths imply globally a fast adaptation of the probabilistic model but impact the overall correctness of the model negatively, resulting in increased risk of adapting to samples that are locally incorrect.We think that this constitutes a universal limitation of the algorithms that extends to more general classes of functions.As even on the simple OneMax the update strength should not be bigger than 1/( √ n log n), we propose this setting as a general rule of thumb.
Our analyses have developed a quite technical machinery for the analysis of genetic drift.These techniques are not necessarily limited to cGA and 2-MMAS ib on One-Max.Very recently, they have been used in [19] to analyze the so-called UMDA, which is a more complicated EDA.We also believe that the techniques will lead to improved results for classical Genetic Algorithms such as the simple Genetic Algorithm [27], where currently only quite restricted lower bounds on the runtime are available.

A.2 Bounds on the Cumulative Distribution Function of the Standard Normal Distribution
To prove Lemmas 10 and 16, we need the following estimates for Φ(x).More precise formulas are available (and can be found by searching for bounds on the so-called error function), but are not required for our analysis.
← 1 with prob.p t,i , x t,i ← 0 with prob. 1 − p t,i n ← 1/2 3 while termination criterion not met do 4 for i ∈ {1, . . ., n} do 5 x t,i 6 for i ∈ {1, . . ., n} do 7 y t,i ← 1 with prob.p t,i , y t,i ← 0 with prob. 1 − p t,i 8 if f (x t ) < f (y t ) then swap x t and y t for i ∈ {1, . . ., n} do 9 if x t,i > y t,i then p t+1,i ← p t,i + 1/K if x t,i < y t,i then p t+1,i ← p t,i − 1/K if x t,i = y t,i then p t+1,i ← p t,i Restrict p t+1,i to be within [1/n, 1 − 1/n] 10 t ← t + 1 2 ,and we are left with an analysis of Δ =: |ϕ t+1 − ϕ t |.Here we note that for any bit i, its frequency changes by an absolute value of at most 1/K with probability at most q i + 1/n ≤ 2q i .Hence, K Δ is stochastically dominated by a Poisson-binomial distribution with parameters n and 2q i , where 1 ≤ i ≤ n.Let A be the random variable describing this Poisson-binomial distribution.While we do not know the individual success probabilities, we know their average p * := (2q i /n) = 2ϕ t /n and can bound A by a random variable B, where B ∼ np * + Bin(n, p * ) + 2. To show this, we note that P[B ≥ t] ≥ P[A ≥ t] is trivial for t ≤ np * + 2 (as P[B ≥ t] = 1).For t > np * + 2, even the dominance P Bin(n, p * ) ≥ t ≥ P[A ≥ t] holds by the results of Gleser