Multiplicative Up-Drift

Drift analysis aims at translating the expected progress of an evolutionary algorithm (or more generally, a random process) into a probabilistic guarantee on its run time (hitting time). So far, drift arguments have been successfully employed in the rigorous analysis of evolutionary algorithms, however, only for the situation that the progress is constant or becomes weaker when approaching the target. Motivated by questions like how fast fit individuals take over a population, we analyze random processes exhibiting a $(1+\delta)$-multiplicative growth in expectation. We prove a drift theorem translating this expected progress into a hitting time. This drift theorem gives a simple and insightful proof of the level-based theorem first proposed by Lehre (2011). Our version of this theorem has, for the first time, the best-possible near-linear dependence on $1/\delta$ (the previous results had an at least near-quadratic dependence), and it only requires a population size near-linear in $\delta$ (this was super-quadratic in previous results). These improvements immediately lead to stronger run time guarantees for a number of applications. We also discuss the case of large $\delta$ and show stronger results for this setting.


Introduction
In a typical situation in evolutionary search, an algorithm first makes good progress while far away from the target, since a lot can still be improved.
As the search focuses more and more on the fine details, progress slows and finding improving moves becomes rarer. Thus, the expected progress is typically an increasing function of the distance from the optimum. However, there are also many processes where this situation is reversed. For example, for heuristics involving a population, once a superior individual is found, this improvement needs to be spread over the population. This process gains speed when more individuals exist with the improvement.
Turning expected progress into an expected first hitting time is the purpose of drift theorems (see the recent survey [Len20] for a thorough introduction to drift analysis). For example, the additive drift theorem [HY01,HY04] requires a uniform lower bound δ on the expected progress (the drift) and gives an expected first hitting time of at most n/δ, where n is the initial distance from the optimum. This theorem can also be applied when the drift is changing during the process, but since a uniform δ is used in the argument, the additive drift theorem cannot be used to exploit a stronger drift later in the process.
A first step towards profiting from a changing drift behavior was the multiplicative drift theorem [DJW12,DG13]. It assumes that the drift is at least δx when the distance from the optimum is x, for some factor δ < 1. The first hitting time can then be bounded by O(log(n)/δ), where n is again the initial distance from the optimum. Apparently, this gives a much better bound than what could be shown via the additive drift in this setting. Multiplicative drift can be found in many optimization processes, making the multiplicative drift theorem one of the most useful drift theorems.
To cope with a broader variety of changing drift patterns, the variable drift theorem [MRC09,Joh10] has been developed. However, while there are several variants of this drift theorem, most of them require that the strength of the drift is a monotone increasing function in the distance from the optimum (the farther away from the optimum, the easier it is to make progress).
In this paper we are concerned with the reverse setting where drift is a decreasing function of the distance from the optimum. This has been considered only for few variable drift theorems, and all of them essentially require a step-size bounded processes. The most recent formulation of this can be found in [OW15]. We want to consider processes which are not stepsize bounded, so this drift theorem cannot be usefully applied.
While many drift theorems are phrased such that the aim is to reach the point zero, for our setting it is more natural to consider the case of reaching some target value n starting at a value of 1, and to suppose that the drift is δx going up (for the multiplicative drift theorem, we had a drift of δx going down). Thus, we call our resulting drift theorem the multiplicative up-drift theorem.
Making things more formal, consider a random process (X t ) t∈N over positive reals starting at X 0 = 1 and with target n > 1. We speak of multiplicative up-drift if there is a δ > 0 such that, for all t ≥ 0, we have the drift condition Note that this is equivalent to One trivial case of any drift process is the deterministic process with the desired gain per iteration. We quickly regard this case now as it gives the right impression of what should be a natural expected first hitting time for a well-behaved process exhibiting multiplicative up-drift.
Example 1. Let δ > 0. Suppose X 0 = 1 and, for all t, X t+1 = (1+δ)X t with probability 1. Then this process satisfies the drift condition (D) with equality. Clearly, the time to reach a value of at least n is ⌈log 1+δ (n)⌉. For small δ, this is approximately log(n)/δ, for large δ, it is approximately log(n)/ log(δ). We note here already that we will be mostly concerned with the case where δ is small. This case is the harder one since the progress is weaker, and thus there is a greater need for stronger analysis tools in this case.
Unfortunately, not all processes with multiplicative up-drift have a hitting time of O(log(n)/δ), as the following example shows.
Example 2. Let δ > 0. Suppose X 0 = 1 and, for all t, X t+1 = n with probability δ/(n − 1) (which we term a success) and X t+1 = 1 otherwise. Again, the drift condition (D) is satisfied with equality (while the target n is not reached). The time for the process to hit the target n is thus geometrically distributed with probability δ/(n − 1), giving an expected time of (n − 1)/δ = Θ(n/δ) iterations, significantly more than the O(log(n)/δ) seen in the deterministic process.
Note that for this process the additive drift theorem immediately gives the upper bound of O(n/δ) since we always have a drift of at least δ towards the target. Hence Example 2 describes a process where the stronger assumption of multiplicative up-drift does not lead to a better hitting time.
Our first main result (Theorem 3) shows that the targeted bound of O(log(n)/δ), which as we saw is optimal when we want to cover the deterministic process given in Example 1, can be obtained when strengthening condition (D) by assuming (i) that, given X t , the next state X t+1 is at least (in the stochastic domination sense) binomially distributed with expectation (1 + δ)X t , and (ii) that the process never reaches state 0. The first condition is very natural. When generating offspring independently, the number of offspring satisfying a particular desired property is binomially distributed. The second condition is a technical necessity. From the up-drift condition alone, we cannot infer any progress from state 0. Consequently, 0 could well be an absorbing state, resulting in an infinite hitting time if this state can be reached with positive probability.
In quite some applications, however, we cannot rule out that the random process reaches state 0. For example, when regarding the subpopulation of individuals having some desired property, then in an algorithm using comma selection, this might die out completely in one iteration (though often with small probability only). To cover also such processes, in our second drift theorem (Theorem 15) we extend our Theorem 3 to include that state 0 is reached with at most the probability that can be deduced from the up-drift and the binomial distribution conditions. To avoid that state 0 is absorbing, we add an additional condition governing how this state 0 is left again (see Theorem 15 for the precise statement).
As mentioned before, a main application for multiplicatively increasing drift towards the optimum is the analysis of how fit individuals spread in a population. This particular setting was previously analyzed as the level-based theorem [Leh11,DL16,CDEL18], modeled after the method of fitness-based partitions [Weg01]. Essentially, the search space is partitioned into an ordered sequence of levels. The ongoing search process increases the probability that a newly-created individual is at least on a given level and, once this probability is sufficiently high, that there is a good chance that the individual is on an even higher level. We restate the details of this theorem in the version from [CDEL18] in Theorem 19 below. The level-based theorem was originally intended for the analysis of non-elitist population-based algorithms [DL16], but has since also been applied to EDAs, namely to the UMDA in [DLN19] and, with some additional arguments, to PBIL in [LN18].
We use our second multiplicative up-drift theorem (Theorem 15) to prove a new version of the level-based theorem (Theorem 20). This new theorem allows to derive better asymptotic bounds under mostly weaker conditions: The dependence of the run time on 1/δ is reduced from near-quadratic to near-linear 1 and the minimum population size λ required for the result to hold is reduced from super-quadratic in 1/δ to near-linear in 1/δ. Since the run time often is linear in λ, this can give a further run time improvement.
Our upper bounds almost match the lower-bound example given in [CDEL18] and, in particular, match the asymptotic dependence on δ displayed by this example.
Our version of the level-based theorem can be applied in all settings where the previous-best level-based theorems were used. It leads to better results when δ is small. In Section 4, we analyze two such situations from previous analyses of non-elitist evolutionary algorithms on standard test functions. The first test function is called OneMax and maps a given bit string to the number of 1s in that bit string, thus simulating a unimodular optimization problem solvable by simple hill climbing. The second test function is called LeadingOnes and maps a bit string to the number of 1s appearing in the bit string before the first 0 (if any); this simulates an optimization problem requiring sequential optimization of different sub parts. Our results are as follows. (i) We prove that the (λ, λ) EA with fitness-proportionate selection and suitable parameters can optimize the OneMax and LeadingOnes functions in expected time O(n 3 log 2 n) and O(n 4 ) respectively, improving over the previous-best published bound of O(n 8 log n). (ii) We prove that the (λ, λ) EA with 2-tournament selection and suitable parameters in the restricted setting that only a constant fraction of the bits of the search points are evaluated finds the optimum of OneMax in O(n 2.5 log 2 n) iterations. The previous-best published bound here is O(n 4.5 log n).
We also use our methods to obtain a level-based theorem for the case that δ is large (Theorem 22). This case was not covered by the previous-best level-based theorems and our theorem now allows to exploit larger values of δ to obtain asymptotically stronger run time guarantees. As an example we show (in Section 4.3) that the (µ, λ) EA with µ = n and λ = n 1.5 on the LeadingOnes benchmark function using ranking selection and standard bit mutation has an optimization time of O(n 2.5 ). This is asymptotically better than the previously known bound of O(n 2.5 log(n)) and also shows more explicitly how optimization proceeds.
Beyond these particular results, our modular proof (first analyzing the multiplicative up-drift excluding 0, then including 0, then applying it in the context of the level-based theorem) shows the level-based theorem in a way that is more accessible than the previous versions and that gives more insight into population-based optimization processes.
In particular, our proof suggests that the behavior of the process under the named conditions is as follows.
• Once a critical mass in a level is reached, this level is never again abandoned. Thus, we can focus in our analysis on having a critical mass of individuals in one level and analyze the time it takes to gain a critical mass in the next level.
• Reaching a critical mass in the next level consists of two steps.
1. When few elements are in the next level, then these elements go extinct regularly and need to be respawned until this initial population on this level via a mostly unbiased random walk gains a moderate amount of elements.
2. With this moderate amount of elements, the bias of the random walk is large enough to make a significant decrease of the population unlikely, but instead the number of elements increases steadily, as can be shown using a concentration bound for submartingales, so that we quickly gain a critical mass in the next level.
We are optimistic that this increased understanding of population-based processes helps in the future design and analysis of such processes.

Multiplicative Up-Drift Theorems
In this section we prove three multiplicative up-drift theorems. The first is concerned with processes that cannot reach the value 0 (which could be absorbing if only a multiplicative up-drift assumption is made); the second one extends the first theorem to include also the possibility of going down to 0 (but taking an additional assumption how state 0 is left). The third does the same, but exploits the assumption that, with some positive probability, state 0 is left to a state from which, with constant probability, we make strong multiplicative progress in every iteration until the process reaches the target (as opposed to a behavior closer to an unbiased random walk). Note that our theorems essentially deal with martingales, but still we suppress the mention of conditioning on all previous members of the given process (i.e. the natural filtration) to improve readability.

Processes on the Positive Integers
As discussed in the introduction, an expected multiplicative increase as described by (D) is not enough to ensure the run time we aim at. For this reason, we assume that there is a number k such that, conditional on X t , the next state X t+1 is binomially distributed with parameters k and (1 + δ)X t /k.
Note that this implies (D). Since often precise distributions are hard to specify, we only require that X t+1 is at least as large as this binomial distribution, that is, we require that X t+1 stochastically dominates Bin(k, (1 + δ)X t /k). See [Doe19] for an introduction to stochastic domination and its use in run time analysis. To avoid that the process reaches the possibly absorbing state 0, we explicitly forbid this, that is, we require that all X t take values only in the positive integers.
Under these conditions, we analyze the time the process takes to reach or overshoot a given state n. For technical reasons, we require that n is not too close to k, that is, that there is a constant γ 0 < 1 such that n − 1 ≤ γ 0 k. For the trivial reason that the condition X t+1 Bin(k, (1 + δ)X t /k) does not make sense for X t > (1 + δ) −1 k, we also require n − 1 ≤ (1 + δ) −1 k. For all such n, we show that an expected number of O(log(n)/δ) iterations suffices to reach n when δ ≤ 1 and O(log(n)/ log(1 + δ)) iterations suffice for δ > 1. More precisely, we show the following estimate.
In addition, once the process has reached state 32 or higher, the probability to ever return to a state lower than 32 is at most 1 e(e−1) < 0.22. For the analysis we will employ Lemma 9 from Section 2.1.4 essentially for the time spent below D 0 . Note that this lemma all by itself, in case of δ ≤ 1 and n ≤ D 0 , gives the stronger bound E[T ] ≤ 6n ln(n) 1−γ 0 . Since the case δ ≤ 1 is significantly more complicated, we focus on this case in Sections 2.1.1 to 2.1.6 and discuss the case δ > 1 only in Section 2.1.7.

A Motivating Example
Before proving this result, let us give a simple example of a possible application. Consider the following elitist (µ, λ) EA. It starts with a parent population of µ individuals chosen uniformly and independently from {0, 1} n . In each iteration, it generates λ offspring, each by independently and uniformly choosing a parent individual and mutating it via standard bit mutation with the usual mutation rate 1/n. If the offspring population contains at least one individual that is at least as good as the best parent (in terms of fitness), then the new parent population is chosen by selecting µ best offspring (breaking ties arbitrarily). If all offspring are worse than the best parent, then the new parent population is composed of a best individual from the old parent population and µ − 1 best offspring (again, breaking all ties randomly).
We now use the above theorem to analyze the spread of fit individuals in the parent population. Let us assume that at some time, the parent population contains at least one individual of at least a certain fitness. We shall call such individuals fit in the following. Recall that standard bit mutation creates a copy of the parent individual with probability 1/e n := (1−1/n) n ≈ 1/e. Hence if the parent population contains x fit individuals, the number of fit individuals in the offspring population is at least (in the domination sense) Bin(λ, x µen ). Due to the elitist selection mechanism, it is also always at least one. Let us assume that λ µen is greater than one so that the expected number x λ µen of fit individuals shows a positive drift. Writing (1 + δ) := λ µen , where δ > 0 by our assumption, and assuming for simplicity δ ≤ 1 as well, we can apply the first up-drift theorem with k = λ and n = µ and observe that after an expected number of O(log(µ)/δ) iterations, the parent population consists of only fit individuals.

Proof Overview
We now proceed towards proving the first up-drift theorem. As said earlier, we concentrate on the case δ ≤ 1 in all of the following except Section 2.1.7. We start by outlining the two main difficulties and solutions in a high-level language.
One of the main difficulties is that the drift towards the target is negligibly weak in the early stages of the process. To demonstrate this, assume that δ = o(1) and that X t = o(1/δ). Then the up-drift condition (D) only ensures a drift of E[X t+1 − X t | X t ] ≥ δX t = o(1). At the same time, the binomial condition (Bin) allows a variance Var[X t+1 | X t ] of order X t , or, more specifically, admits deviations of X t+1 from its expectations of order √ X t with constant probability. For this reason, in this regime we do not progress because of the drift, but rather because of the random fluctuations of the process. It is well-known that random fluctuations are enough to reach a target, with a classical example being the unbiased random walk (W t ) on the line [0..n] := {0, 1, . . . , n}. This walk, when started in 0, still reaches n in an expected number of O(n 2 ) iterations despite the complete absence of any drift in [1..n−1]. The key to the analysis is to not regard the drift E[W t+1 −W t | W t ] of the process, but instead the drift of the process (W 2 t ). Then an easy cal- [GKK18,Section 5] for an extensive discussion). Consequently, by regarding the drift with respect to (W 2 t ) instead of the original process (W t ), we obtain an additive drift of 1, and from this an expected time of O(n 2 ) to reach state n. This has also been applied to the analysis of randomized search heuristics, see for example [Kre19, Theorem 3.18].
Apparently more common are transformations with exponents smaller than one. [Jan07, Theorem 2] turned a region with small drift into one with significantly more drift by employing the concave potential function x → √ x.
He wrote that any other function x → x ε with ε < 1 would be equally suitable to obtain the same tight upper bound. Essentially the same argument was used in a more general setting in [CDF14]. The x → √ x transformation was also used in the analysis how the sampling frequency of a neutral bit in a run of an EDA approaches the boundary values [DZ20, Theorem 6]. In [GK14, Theorem 5] a negative drift in a (small) part of the search space was overcome by considering random changes which make it possible for the algorithm to pass through the area of negative drift by chance. This was formalized by using a tailored potential function turning negative drift into positive drift by excessively rewarding changes towards the target, as opposed to steps away from the target. This ad-hoc argument was made formal and cast into a Headwind Drift Theorem in [KLW15,Theorem 4].
In abstract terms, the art here is finding a potential function g : Z ≥0 → R that transforms the unbiased process (X t ) into a process (g(X t )) with constant drift, so that we can apply the additive drift theorem to obtain a bound of O(g(X 0 )) on the expected optimization time. In order to obtain a positive drift, such a potential function has to be increasing and convex, and since the expected optimization time is g(X 0 ), at the same time the potential function should increase as slowly as possible.
For our situation, it turns out that g defined by g(x) = x ln(x) is a good choice as this again gives a constant drift and thus an expected time of roughly O(log(1/δ)/δ) to reach a state Ω(1/δ), from where on we will observe that also the original process has sufficient drift. We are not aware of this potential function being used so far in the theory of evolutionary algorithms (apart from a similar function being used in [ADY19], a work done in parallel to ours).
A technical annoyance in the analysis of the time taken to reach Ω(1/δ) is that the additive drift theorem, for good reason, does not allow that the process overshoots the target. In the classical formulation, this follows from the target being 0 and the process living in the non-negative numbers. For this reason, we cannot just show that the process (g(X t )) has a constant drift, but we need to show this drift for a version of this process that is suitably restricted to the range [1..Θ(1/δ)]. This was a major technicality in the previous version of this work [DK19]. In this version, we greatly simplify this part by using a version of the drift theorem (Theorem 4) recently proposed by Krejca [Kre19] that allows overshooting the target (at the price that the time bound depends not on the distance of the target, but the distance plus the expected overshooting).
Once the process has reached a value of Ω(1/δ), the drift is strong enough to rely on making progress from the drift (and not the random fluctuations around the expectation). This is easy when the process is above X t = ω(1/δ 2 ), since then the expected progress of at least Ω(δX t ) is asymptotically larger than the typical random fluctuation of order Ω( √ X t ). Hence a simple Chernoff bound is enough to guarantee that each single iteration gives X t+1 ≥ (1 − o(1))(1 + δ)X t . When X t is smaller, say only Θ(1/δ), only the combined result of Θ(1/δ) iterations gives an expected progress large enough to admit such a strong concentration. Since the iterations are not independent, we need some careful martingale concentration arguments in this regime. Since this part is non-trivial and uses some methods that might be of broader interest, we put this into the following separate subsections. Also, we note that the specific result that the process rarely goes below half its starting point could have some independent interest (and we shall need it later again, in the proof of Theorem 20 to prove the level-based theorem).

Additive Drift with Overshooting
We now give a version of the additive drift theorem [HY01,HY04] as shown in [Kre19, Lemma 3.7], here slightly reformulated to best fit our purposes. In contrast to most other versions of the additive drift theorem, it allows that the process overshoots the target. This is usually implicitly forbidden by regarding processes in R ≥0 and the first time to reach state 0.
This extension is not very deep, but has apparently not been known too well before (as the several works that overcome the overshooting problem with hand-made methods, including [DK19], show). We note that the arguments needed to prove such a result have been known before in this community: For example, both [Jäg07, Lemma 12] and [DK15, Lemma 7] prove lower bounds for expected run times in a way that can immediately be turned into proofs for upper bounds that allow overshooting (by switching the direction of the inequality in both assumptions and results). The proof of [WW05, Lemma 2.6], a result for hitting a particular value, can easily be extended to overshooting the value (for this, it suffices to note that E[ τs i=1 D i ] is the value of the process after reaching or overshooting s).
Theorem 4 (Additive Drift Theorem, upper bound with overshooting). Let a} be the first time the process reaches or drops below a. Suppose that there is δ > 0 such that We note that the version of this result given in [Kre19] is slightly stronger. There the condition that the process does not take values larger than some -arbitrary -number b was replaced by the weaker condition that this only holds up to time T .

Progress From Random Fluctuations: Creating Drift Where
There is no Drift In this subsection, we analyze how the process reaches a value of at least D 0 = min{⌈δ/100⌉, n}. In this regime, the drift of (X t ) is so low that the true reason for making progress is not the drift, but the random fluctuations stemming from the non-trivial variance. To turn these into an exploitable drift, we regard the process (g(X t )) for a suitable function g, observe that this process has a positive drift, and use this drift to estimate the time to reach or exceed D 0 . We use g : where, by convention, g(0) := 0, which renders g continuous in 0. To establish the desired drift, we need a few technical results about g. Via a Taylor expansion of g around a given point a, we obtain the following estimates for g.
Lemma 5. For all a > 0 and x ≥ 0, we have Proof. Let a > 0 be given. We prove the (slightly more complicated) lower bound first, showing the claim for positive x and then arguing with continuity. We let f : Then we have, for all x ∈ R + , In particular, we have f (a) = 0, f ′ (a) = 0, and f ′′ (x) ≥ 0 for all x ∈ R + . This shows that for all x ∈ R + , we have f (x) ≥ 0. By the continuity of f , we also obtain f (0) ≥ 0, and thus the claim.
For the upper bound, we regard f : as well as and f ′ (a/2) > 0, by the intermediate value theorem there exist exactly two x such that f ′ (x) = 0, one being larger than a/2 and the other smaller. Note that f ′ (a) = 0. From lim x→0 f (x) = 0 and f (a) = 0, the only local maximum being at a, we can thus conclude that f is non-positive.
We use the estimates above to show that, under suitable circumstances, the expected g-value of a random variable X is larger than g(E[X]). The lower bound in the theorem below will be used to argue that even for a process (X t ) with no drift, that is, E[X t+1 | X t ] = X t , the process (g(X t )) has a positive drift. We need the upper bound to estimate the expected overshooting of the target when applying the additive drift theorem with overshooting (Theorem 4).
Theorem 6. Let g be defined as above in Equation (1). Let X be a nonnegative random variable with positive expectation. Let Proof. We use Lemma 5 with a = E[X].
The following two corollaries follow immediately from the theorem above by recalling that the second and third central moments of a binomially distributed random variable X ∼ Bin(n, p) are . For technical reasons, we need the first estimate also for random variables X ∼ Bin(n, p) + K for some non-negative number K.
Corollary 7. If X ∼ Bin(n, p) + K for some n ∈ N, p ∈ (0, 1], and K ≥ 0, Corollary 8. If X ∼ Bin(n, p) for some n ∈ N and p ∈ (0, 1], then For p ≥ 1/n, this yields We are now prepared to show the following result. Lemma 9. Let (X t ) t∈N be a stochastic process over the positive integers. Assume that there are D 0 , k ∈ Z ≥1 and γ 0 < 1 such that D 0 − 1 ≤ γ 0 k and for all t ≥ 0 and all x ∈ [1.
Proof. There is nothing to show for D 0 = 1, so we assume D 0 ≥ 2 in the remainder. For technical reasons, let us regard the process (X ′ t ), which agrees with (X t ) while not larger than D 0 , but follows the pessimistic law X ′ t+1 ∼ Bin(k, X t /k) in the iteration where D 0 is exceeded. More precisely, we let and the remaining probability mass is put on D 0 , that is, If X ′ t > D 0 , we let X ′ t+1 = X ′ t with probability one. Since the process (X ′ t ) agrees with (X t ) while less than D 0 , we have To apply the additive drift theorem with overshooting (Theorem 4), we observe that for some Y following a binomial law with parameters k and some p ≤ (D 0 − 1)/k. By elementary arguments analogous to those used in the proof of [DD18, Lemma 1], see also [Doe20, Lemma 1.7.3] and the comment following its proof, X ′ T is stochastically dominated by D 0 + Bin(k, (D 0 − 1)/k), which immediately gives , the last estimate using D 0 ≥ 2. Consequently, the additive drift theorem with overshooting gives We remark that, in principle, Lemma 9 can be strengthened by taking into account the starting point X 0 . Assuming for simplicity that X 0 takes only values in [1..D 0 ], this would give a result like where the last estimate stems from the convexity of g and Jensen's inequality.
The reason for this weak improvement is that we estimated E[g(X ′ T )] very coarsely. However, even with a better estimate of E[g(X ′ t )], asymptotically stronger results would only be possible in the case that X 0 is very close to D 0 , that is, , which we do not expect in our typical applications.
We note further that the problem of overshooting and the resulting negative impact on the hitting time estimate is real. Even if X 0 = D 0 − 1 with probability one, we see that when taking k = 2(D 0 −1) for simplicity, we have with constant probability (that this is possible for an unbiased process stems from the fact that X 1 overshoots D 0 by a comparable amount). We omit a formal proof, but note that from X 1 ≤ D 0 − Ω( √ D 0 ), the process takes an expected number of Ω( √ D 0 ) iterations to reach or overshoot D 0 .

Submartingale Arguments Proving A Steady Progress From
D 0 on In this subsection, we shall prove that once a process satisfying the assumptions of Theorem 3 has reached a value of D ≥ D 0 := min{⌈100/δ⌉, n}, it usually makes a steady progress of a constant factor increase in Θ(1/δ) iterations without ever going below D/2. To show this result, we use a submartingale argument that might prove to be useful in other analyses of evolutionary algorithms as well. We build on the following result from Freedman [Fre75,Theorem 4.1], cited in a more compact manner in [FGL15, Theorem A] (adjusted for submartingales rather than supermartingales).
We use this result to bound the probability that the process started in D ≥ D 0 at time t at any time s ∈ [t..t+O(1/δ)] goes below D/2+(s−t)δD/2. If this does not happen, we in particular have X t+⌈3/δ⌉ ≥ 2D.
Lemma 11. Let (X t ) t∈N be a stochastic process over the positive integers. Assume that there are n, k ∈ Z ≥1 and δ ∈ (0, 1] such that n − 1 ≤ (1 + δ) −1 k and for all t ≥ 0 and all Since the proof of this lemma is not obvious, let us describe the main ideas before stating the formal proof. From (Bin) we immediately see that (X t ) is a submartingale. However, since each X t may take all values in [1..k] with positive probability, this submartingale does not admit good absolute bounds on the submartingale differences (the variable u in Theorem 10).
For this reason, we write the variable X t as a sum of the k independent binary random variables Y t1 , . . . , Y tk that describe the binomial distribution in (Bin). Then the, suitably defined, submartingale differences Y tj − 1 k X t−1 define a submartingale with differences bounded by one (we can take u = 1).
To make the progress of this submartingale visible, we would like to regard instead the submartingale with differences Y ti − 1 k X t−1 − Dδ/(2k). If X t−1 ≥ D/2, this difference still has a non-negative expectation (as necessary for a submartingale). Since we cannot rule out that some X t−1 is less than D/2, we define our submartingale via the differences where ∆ t = Dδ/(2k) when X t−1 ≥ D/2 and ∆ t = 0 otherwise. This defines a submartingale. Via Theorem 10, we shall show that with high probability, this submartingale never goes below D/2. This in particular implies that all X t are at least D/2, and hence, that all ∆ t are Dδ/(2k). Consequently, X t is not only at least D/2, but it is even at least D/2 + tDδ/2, which shows the desired progress.
Proof. To ease the notation, let us assume that t 0 = 0.
To have a better control over the one-step variances to be computed later, we first argue that we can pessimistically assume that the progress is exactly the one described by the binomial distributions in (Bin). More precisely, let X ′ 0 = X 0 and define recursively X ′ t as a random variable with distribution To show our claim it thus suffices to show To ease the argument, let us artificially continue the process in case it reaches a state of at leastñ before time ⌈3/δ⌉. More specifically, let Then this modified process agrees with the original process (X ′ ) up to time T ′ , and thus it satisfies (3) if and only if the original process does. We can thus work with the modified process in the following. We define random variables (Y tj ) 1≤t≤⌈3/δ⌉,1≤j≤k as follows. If X ′ t−1 <ñ, then Y t1 , . . . , Y tk are independent Bernoulli random variables with success probability p t = (1 + δ) /k with probability one for all j, and we set p t = 0. By (Bin) and (Bin'), we can assume Further, for all t ≥ 1, we define ∆ t ∈ {0, Dδ/(2k)} as follows. If In particular, We trivially observe where in the second estimate we used that D ≤ n − 1 ≤ (1 + δ) −1 k and that the term δ/(1 + δ) has a unique maximum at δ = 1 in [0, 1]. Finally, using again D ≤ (1 + δ) −1 k and noting that the term δ 2 /(1 + δ) is maximal for δ = 1 (assuming δ ≤ 1), with Equation (4) we compute that, conditional on For all i ≥ 0 and j ∈ [0..k − 1] let further that is, the sum of the first ik + j Z-variables in the natural lexicographic ordering. By Equation (5), this is a submartingale, by Equation (6) Let us assume that this rare event does not occur, that is, we have S m ≥ −D/2 for all m ∈ [0..N]. We note that when m is a multiple of k, then Consequently, we have X t ≥ D/2 for all t, hence ∆ t = Dδ/(2k) for all t, and thus X t ≥ D/2 + tDδ/2 as desired.
With an iterated application of the previous result, we can show that the process has a decent chance to reach the target n in time O(log(n)/δ). We shall later only need the result with the success probability 0.2782, but since we easily prove a stronger bound for larger starting points D and since such results might be useful in other contexts, we also prove such an estimate.
Proof. Using Lemma 11, with probability at least 1 − exp(−δX 0 /169) = 1 − exp(−δD/169) there is t 1 ≤ t 0 + ⌈3/δ⌉ such that X t 1 ≥ min{2D, n}. Given this event and assuming X t 1 < n, with probability at least 1 − exp(−δX t 1 /169) ≥ 1 − exp(−2δD/169), there is a t 2 ≤ t 1 + ⌈3/δ⌉ such that X t 2 ≥ min{2X t 1 , n} ≥ min{4D, n}. Repeating this doubling argument at most ⌈log 2 (n/D)⌉ times, we obtain a state of at least n. This takes at most ⌈log 2 (n/D)⌉⌈3/δ⌉ iterations and works out as desired with probability at least where the first inequality follows from a Weierstass product inequality, a mild extension of Bernoulli's inequality (see, e.g., [Doe20, Lemma 1.4.8]), and the last equation computes the geometric series. When D is small, this estimate can be negative and then is not very useful. For this case, using our assumption that D ≥ 100/δ, we compute which gives the desired bound.
2.1.6 Proof of Theorem 3 for δ ≤ 1 By combining the two main insights of the two preceding subsections, we now prove the first up-drift theorem in the case δ ≤ 1. Since the proof uses a result known as Wald's equation [Wal44], we first state a simplified version of this result.
Theorem 13 (Wald's equation). Let M ∈ R. Let X 1 , X 2 , . . . be an infinite sequence of non-negative random variables with We now give the proof of the first up-drift theorem for the case that δ ≤ 1.
of Theorem 3 for δ ≤ 1. Let us call a phase of the process (X t ) the time interval used to first reach a value of at least D 0 = min{⌈100/δ⌉, n} and then another T 2 = ⌈log 2 (n/D 0 )⌉⌈3/δ⌉ ≤ log 2 (n)⌈3/δ⌉ iterations. By Lemma 9, the expected time to reach a value of at least D 0 = min{⌈100/δ⌉, n} is at most . Hence a phase has an expected length of at most M = T 1 + T 2 .
By Lemma 12, a phase is successful, that is, reaches or exceeds n, with probability at least 0.2782. Hence the number of phases until a successful one is encountered, is described by a geometric random variable T with success rate p = 0.2782.
Hence by Wald's equation (Theorem 13), the expected time to reach or exceed n therefor is at most The claim follows from noting that 1/0.2782 < 3.6.
2.1.7 The Case δ > 1 In this section, we treat the case that δ is larger than one. In this case, the up-drift is so strong that we do not have a significant phase in which the progress stems mostly from random fluctuations. Rather, we can argue that with constant "success" probability, the process increases by a factor of at least (1 + δ/2) in each iteration and thus reaches the target of n in at most ⌈log 1+δ/2 (n)⌉ = O(log(n)/ log(δ)) iterations. In case of failure, a simple restart argument (leading to an expected constant number of restarts of the argument) suffices to show the same bound for the expected time to reach a state of at least n. This argument alone would give a relatively low success probability of ∞ i=0 (1 − exp(−(1 + 0.5) i /16)) ≤ 3.4 · 10 −6 when proceeding as in the proof below, using δ = 1, and estimating this infinite product via its first ten factors. Consequently, a very high implicit constant in the O(log(n)/ log(δ)) bound would result. To overcome this, we first argue that it takes at most an expected number of 62 iterations to reach a state of at least 32. From this point on, the probability to increase by a factor of (1 + δ/2) in each subsequent iteration is more than 0.78. While we did not aim at obtaining the best possible constants, we decided to follow this line of argument to obtain a leading constant that is not only of theoretical interest. We note that the same argument could be used with intermediate targets larger than 32 and increase factors closer to (1 + δ), which shows that the right asymptotics is (1 + o(1)) log 1+δ (n).
To prove the case δ > 1 of the first up-drift theorem, we show the following lemma. It contains a statement on making less progress than expected which is stronger than what we need here, but which might be useful in other contexts.
Lemma 14 (First Up-Drift Theorem, δ > 1). Let (X t ) t∈N be a stochastic process over the positive integers. Assume that there are n, k ∈ Z ≥1 and δ ≥ 1 such that n − 1 ≤ (1 + δ) −1 k and for all t ≥ 0 and all x ∈ [1..n − 1] with Pr[X t = x] > 0 we have the binomial condition In addition, once the process has reached some state x or higher, the probability to have a step with X t+1 < (1 + δ/2)X t before reaching X t ≥ n is at most 1 e x/32 (e x/32 −1) . In particular, once the process has reached x = 32, the probability to ever go below 32 (before reaching n) is less than 0.22.
Proof. To ease the argument, we shall now assume that we have X t+1 ∼ max{1, Bin(2(1 + δ)X t , 1/2)} when X t ≥ n. This artificial continuation of the process (similar to the one we used in Lemma 9) does not change the first time to reach or overshoot the target n, but allows us to disregard whether the process has reached the target earlier than thought.
We analyze one phase of the process, started at some time t 0 with an arbitrary value X t 0 . We say that this phase ends (after ℓ iterations) when either (i) t 0 + ℓ is the first time not earlier than t 0 that X t 0 +ℓ ≥ n ("success"), or (ii) t 0 +ℓ is the first time such that X t 0 +ℓ < (1+δ/2)X t 0 +ℓ−1 and X t 0 +ℓ−1 ≥ 32 ("failure"). In simple words, the phase ends when the target is reached or when we fail to obtain a factor-(1 + δ/2) increase from a state that is at least 32.
We first compute a simple upper bound for the expected length of a phase, which is valid regardless of whether we condition on success or failure. We start by estimating the expected time to reach a value of at least 32. Since δ > 1, at any time t the state X t+1 dominates a binomial distribution with expectation 2X t . By the well-known fact that the median of a binomial distribution with integral expectation is equal to this expectation, first explicitly shown in [Neu66], we have Pr[X t+1 ≥ 2X t ] ≥ 1/2. Consequently, the time to reach 32 is at most the timeT it takes for a sequence of random bits to encounter five successive ones. We note that the expectation ofT satisfies the recurrence E[T ] = Once a state of 32 or more is reached, we either witness a failure or an increase by a factor of (1+δ/2). Consequently, after another ⌈log 1+δ/2 (n/32)⌉ iterations, we have encountered a failure or reached the target, and hence the phase has ended within this timespan. In summary, the expected length of a phase, regardless of the starting state and regardless of whether it is successful or not, is at most 62 + ⌈log 1+δ/2 (n/32)⌉ iterations. Noting that 1 + δ ≤ (1 + δ/2) 2 , we have for all x ≥ 1, and thus the expected length of a phase is at most 62 + ⌈log 1+δ/2 (n/32)⌉ ≤ 63 + 2 log 1+δ (n).
From the "in particular" case of the "in addition clause", which we shall prove shortly, we see that a phase is successful with probability at least 0.78. By elementary properties of the geometric distribution, there is an expected number of at most 1 0.78 ≤ 1.283 phases until the process is successful, and hence reaches the target. Since each phase takes an expected number of at most 63 + 2 log 1+δ (n) iterations, the desired expected hitting time is at most 1 0.78 (63 + 2 log 1+δ (n)) ≤ 81 + 2.6 log 1+δ (n) by Wald's equation (Theorem 13).
We now prove the "in addition" statement. For any time t, by a simple Chernoff bound (e.g., Theorem 1.10.5, Equation (1.10.12), in [Doe20]), we have (using δ ≥ 1) Assume that at some time t 1 we have X t 1 = x. Let us now, minimally modifying the previously introduced notation, speak of a failure when for some t ≥ t 1 we have X t+1 < (1 + δ/2)X t . Noting that no failure for i iterations leads to a state X t 1 +i ≥ (1 + δ/2) i X t 1 = (1 + δ/2) i x, we see that the probability that no failure happens in any iteration later than t 1 is at least where, similarly as in Lemma 12, we employ the Weierstrass product inequality and the fact that 2 · (3/2) i ≥ i + 2 for all non-negative integers i. We note that for x = 32, this bound is less than 0.22 and the event "no failure" implies the event to never go below 32.

Processes That Can Reach Zero
We now extend the multiplicative up-drift theorem to include state 0. Since the subprocess consisting only of states greater than 0 satisfies the assumptions of the first up-drift theorem, we obtain from the latter an upper bound on the time spend above 0. It therefore remains to estimate the time spent in state 0, which in particular means estimating how often the process reaches this state. In the technically more demanding case that δ ≤ 1, we exploit that the process is a submartingale. We can thus employ the optional stopping theorem to estimate that with probability 1 − Ω(δ) the process reaches 0 before reaching D 0 = min{⌈100/δ⌉, n}. Consequently, after an expected number of O(δ) attempts, the process reaches D 0 , and from there with constant probability never goes back to zero.
In particular, when γ 0 is bounded away from 1 by a constant, then E[T ] = O( 1 E 0 δ + log(n) δ ), where the asymptotic notation refers to n tending to infinity and where δ = δ(n) may be a function of n. Furthermore, if n > 100/δ, then we also have that once the process has reached state of at least 100/δ, the probability to ever return to a state of at most 50/δ is at most 0.7218.
If δ > 1, then we have In addition, once the process has reached state 32 or higher, the probability to ever return to a state lower than 32 is at most 1 e(e−1) < 0.22. We show the theorem by considering two different kinds of steps of the process: those spent in state 0 and those spent in other states. For the latter we understand what happens from Theorem 3, so it remains to see what happens in state 0. There are in turn two ways in which the process can be in state 0. Either it could have been in state 0 before; in this case we will use (0) to see how the process gets out again. More complicated is the case of returning to state 0.
From Theorem 3 we know that it is unlikely to return back to 0 after having reached a sufficiently high value. In order to compute a good bound on the return probability for smaller values of the process, we use the optional stopping theorem, which we state next for convenience. We use a version given by Grimmett and Stirzaker [GS01, Chapter 12.5, Theorem 9] that can be extended to super-and submartingales.
Theorem 16 (Optional Stopping). Let (X t ) t∈N be a random process over R, and let T be a stopping time 2 for (X t ) t∈N . Suppose that Then the following two statements hold.
For the application of the optional stopping theorem it will be necessary to have a good bound on the value of the process after exceeding some value. Since no good bounds are guaranteed for the original process, we instead analyze a slightly different process which we can construct with the following lemma. It states, roughly, that we can replace a binomial random variable with expectation E with a random variable that is identically distributed in [0..E] and takes values only in [0.
.⌈4E⌉] such that the expectation is not lowered. We suspect that this result may be convenient in many other such situations, e.g., when using additive drift in processes that may overshoot the target.
Lemma 17. Let Y be a random variable taking values in the non-negative integers such that Y Bin(k, p) for some k ∈ N and p ∈ [0, 1] with kp ≥ 1. Let E = kp denote the expectation of Bin(k, p). Then there is a random variable Z such that Proof. Let first δ ≤ 1. We first analyze the time spend on all states different from 0. To this aim, letX t , t = 0, 1, . . . , be the subprocess where we are above zero. Formally speaking,X is the subsequence of (X t ) consisting of all X t that are greater than 0. Viewed as a random process, this means that we sample the next state according to the same rules as for the Xprocess; however, if this is zero, then immediately and without counting this as a step we sample the new state from the distribution described in (0) conditional on being positive (which is the same as saying that we resample until we obtain a positive result). With this, the distribution describing one step of the process is a distribution on the positive integers such that (X t+1 |X t ) Bin(k, (1+δ)X t /k). We may thus apply Theorem 3 and obtain that after an expected total number of at most 21.6 1−γ 0 D 0 ln(2D 0 ) + 3.6 log 2 (n)⌈3/δ⌉ steps, the processX reaches or exceeds n.
It remains to analyze how many steps the process X spends on state 0. To this end we first show the following claim bounding the probability of falling back to 0 when at a state x. The proof of the claim is essentially an adaptation of an argument regarding unbiased random walks (also knows as the Gamblers Ruin Problem), see, for example, [MU05, Section 12.2] for a treatment.
Claim: Let x be such that 0 ≤ x ≤ D 0 and let t 0 ≥ 0. We condition on X t 0 = x. Then the probability that, in the time from t 0 on, the process reaches a state of at least D 0 before reaching state 0 is at least x/(4D 0 ).
The claim is trivially true for x = 0. Thus, suppose x > 0. To ease reading, we regard the process (Y t ) defined by Y t = X t 0 +t for all t ≥ 0.
Let R be the first time that Y reaches or exceeds D 0 , or hits 0; this is a stopping time. To ease the following argument, we regard the following process Z, which equals Y until the stopping time (and hence has the same stopping time). We define Z recursively. We start by setting Z 0 := Y 0 . Assume that Z t is defined and Z t · 1 Zt≤D 0 = Y t · 1 Yt≤D 0 . If Z t > D 0 , then we set Z t+1 = Z t . Otherwise, that is, when Z t = Y t = x ≤ D 0 for some x, then we recall that Y t+1 Bin(k, (1 + δ)x/k) Bin(k, x/k). In this case, we let Z t+1 be the random variable constructed in Lemma 17 (w.r.t. Y t+1 , k, and p = x/k). By this lemma, we have Z t+1 · 1 Z t+1 ≤D 0 = Y t+1 · 1 Y t+1 ≤D 0 , allowing us to continue our recursive definition of Z, and showing that (Z t ) is a submartingale. We can thus use the optional stopping theorem to see that the latter again due to Lemma 17. Consequently This shows the claim. Let t ≥ 0, let us again condition on X t = 0, and let A be the event that the process reaches a state of at least D 0 after time t before reaching a state of 0. Using the claim and the law of total probability we now see that We conclude that the number of iterations spent on state 0 before reaching a state of at least D 0 is dominated by a geometric distribution with success rate E 0 4D 0 . Consequently, the expected number of these iterations is at most Once the process has reached a state of D 0 or higher, by Theorem 3 the probability to ever return to 0 is at most 0.7218. Hence the expected number of times this happens is at most 1/0.2782. We can now use Wald's equation (Theorem 13) to obtain the desired run time result.
The case of δ > 1 is analogous with 32 instead of D 0 and using Lemma 14 instead of Theorem 3.

Processes That Start High
In condition (0) of the second up-drift theorem (Theorem 15), we only exploit the progress made to states not exceeding D 0 when leaving state 0. When a process has a decent chance to leave 0 to a state equal to or above D 0 , then we can ignore the costly first part of the analysis. This is what we analyze in this section by replacing the condition (0) with a start condition (S) which intuitively says that, at any time of the process (even when not at state 0), we have a good chance of starting the process fresh from a rather high minimum value. The proof is an easy combination of Lemma 12 and a restart argument. To ease the notation, we use the shorthand log 0 b (x) := max{0, log b (x)} for all x ∈ R and b > 1.
If δ > 1, then we have Proof. We start by considering the case δ ≤ 1. Regardless of where the process is at some time t 0 , by the start condition (S) it takes an expected number of at most 1/p iterations to again reach at state of at least x min . Then, by Lemma 12 and x min ≥ D 0 , we see that the time to reach or exceed n when starting at x min or higher is no more than another ⌈log 0 2 (n/x min )⌉ iterations, with a probability of at least 0.2782.
In case this fails (with probability at most 1 − 0.2782), we simply restart the argument at the current state. By Wald's equation (Theorem 13), the expected time to reach or exceed n is at most 1 0.2782 1/p + ⌈log 0 2 (n/x min )⌉⌈3/δ⌉ .
For δ > 1, we proceed similarly. It again takes an expected number of 1/p iterations to reach x min or higher. If x min ≥ n, we are done. Otherwise, we invoke Lemma 14 to see that with probability at least 0.78, the process increases by a factor of at least (1 + δ/2) in each subsequent iteration (that starts below n), using x min ≥ D 0 ≥ 32. In this case, using again Equation (8), we reach n in at most ⌈log 1+δ/2 (n/x min )⌉ ≤ 2⌈log 1+δ (n/x min )⌉ iterations. With a restart argument used in the failure case (occurring with probability 0.22), we obtain the claimed expected hitting time of

The Level-Based Theorem
In this section, we apply our up-drift theorems to give an insightful proof of a sharper version of the level-based theorem first proposed by Lehre [Leh11].
The general setup of such level-based theorems is as follows. There is a ground set X , which in typical applications is the search space of an optimization problem. On this ground set, a Markov process (P t ) induced by a population-based EA is defined. We consider populations of fixed size λ, which may contain elements several times (multi-sets). We write X λ to denote the set of all such populations. We only consider Markov processes where each element of the next population is sampled independently with repetition. That is, for each population P ∈ X λ , there is a distribution D(P ) on X such that given P t , the next population P t+1 consists of λ elements of X , each chosen independently according to the distribution D(P t ). As all our results hold for any initial population P 0 , we do not make any assumptions on P 0 .
In the level-based setting, we assume that there is a partition of X into levels A 1 , . . . , A m . Based on information in particular on how individuals in higher levels are generated, we aim for an upper bound on the first time such that the population contains an element of the highest level A m . The first such result was given in [Leh11]. Improved and easier to use versions can be found in [DL16,CDEL18].
To ease the comparison with our result, we now state the strongest level-based theorem before our work. We note that (i) the time bound has a quadratic dependence on δ and (ii) the population size needs to be Ω(δ −2 log(δ −2 )).
The proof given in [CDEL18], as the previous proofs of level-based theorems, uses drift theory with an intricate potential function.
We now derive from our multiplicative up-drift theorems a version of the level-based theorem with (tight) linear dependence on δ. This theorem is further improved with respect to the version given in [DK19] by only requiring a population size that depends linearly on δ (rather than an at least quadratic dependence as in [DK19] or in the previous-best version given in Theorem 19). To allow such much smaller population sizes to suffice, we need a slightly stronger assumption on making improvements (as can be seen in (G1) and (G2) compared between Theorems 19 and 20, where an additional factor of 1/4 is inserted). We do not see any realistic situations in which the assumptions of Theorem 19 are fulfilled, but ours are not.
For the (technically more demanding) case δ ≤ 1, we show the following result. We treat the easier case δ > 1, not discussed in any previous work, separately at the end of this section.
Theorem 20 (Level-Based Theorem). Consider a population-based process as described in the beginning of this section.
Note that, with z * = min j∈[1..m−1] z j and γ 0 a constant, (G3) in the previous theorem is satisfied for some λ with as well as for all larger λ. We now compare our new level-based theorem with the previous best result (Theorem 19). Since we do not try to optimize constant factors, we do not discuss these (but note that ours are large).
We first observe that as long as γ 0 can be assumed to be a constant bounded away from 1, then our bound for any values of the variables is at most a constant factor larger than the bound of Theorem 19. When z j λ is large, the log 0 2 (·) expression can degenerate to an expression of order log(D 0 ) = O(log(1/δ)). This cannot happen for the logarithmic expression in the run time bound of Theorem 19, however, even in this case, our bound is of order O(log(1/δ)/δ), whereas the previous best result was O(δ −2 ). Hence when ignoring constant factors and assuming γ 0 < 1 a constant, our bound is at least as strong as the previous results.
In terms of asymptotic differences, we first note the improved dependence of the run time guarantee on δ. Ignoring a possible influence of δ on the logarithmic terms in the run time estimate, the dependence now is only O(δ −1 ), whereas it was O(δ −2 ) in the previous result.
The second asymptotic difference concerns the minimum value for λ that is prescribed by condition (G3). Note that in both results the run time estimate is a sum of two terms, the first depending linearly on λ. Consequently, being able to use a smaller population size λ can improve the run time. The main difference, and again ignoring the logarithmic term in (G3), is that λ has to be Ω(δ −2 ) in the previous result and only Ω(δ −1 ) in ours. The logarithmic terms are more tedious to compare, but clearly ours is asymptotically not larger as long as λ is at most exponential in m or at most exponential in 1/z * .
We continue by discussing minor differences between the two results. We note that t 0 in our result depends on λ. We thus end up in the slightly annoying situation that in our version, λ appears also in the right-hand side of (G3). However, since λ appears on the right-hand side only inside a logarithm (and one that is at least ln(m)), it is usually not difficult to find solutions for this inequality that lead to an asymptotically optimal value λ.
One key difference is that both (G1) and (G2) impose a condition from the point on when at least γ 0 λ/4 individuals are on a level, whereas the previous level-based theorem (as the conference version of this work) only does so from γ 0 λ on. This additional slack is required to bring down the dependence of λ on 1/δ from essentially quadratic to essentially linear. We do not see any realistic application where the stronger versions of (G1) and (G2) would be harder to show than the previous ones.
In summary, when ignoring constant factors, we do not see any noteworthy downsides of our new result and we did not find any result previously proven via a level-based theorem that could not be proven with our result. At the same time, the superior asymptotics of the run time bound and the minimum requirement on λ in terms of δ clearly are an advantage of our result.
We now proceed with proving the new level-based theorem. We shall use an estimate for the probability that a binomial random variable is a least its expectation. The following result was proven with elementary means in [Doe18]. A very similar result was shown with deeper methods in [GM14].
Lemma 21. Let n ∈ N and p ≥ 1 n . Let X ∼ Bin(n, p). Then We are now ready to state the formal proof of Theorem 20.
Proof. We first note that t 0 ≥ 10 4 , so from (G3), we have We say that we lose level j if, before having optimized, there is a time t at which there are at least γ 0 λ individuals at least on level j, and a later time t ′ > t such that at that time there are less than γ 0 λ/4 individuals at least on level j.
Our proof proceeds now as follows. First we will condition on never losing a level. We show that we have multiplicative up-drift for the number of individuals on the lowest level which does not have at least γ 0 λ individuals and a simple induction allows us to go up level by level. Then we show that any level which has at least γ 0 λ individuals will not be lost until the optimization ends, with sufficiently high probability.
Since we are only interested in the time until we have the first individual in A m , we may assume that condition (G2) also holds for j = m − 1.
We now analyze how the number of individuals above the highest level with at least γ 0 λ individuals develops. Let a level j ≤ m − 1 be given such that |P ∩ A ≥j | ≥ γ 0 λ. We condition on never losing level j, that is, on never having less than γ 0 λ/4 individuals on level j or higher. We let (X t ) be the random process describing the number of individuals on level j + 1 or higher, that is, we have X t = |P t ∩ A ≥j+1 | for all t.
We now distinguish two cases. Suppose first that z j λ ≥ D 0 ; this means that we expect at least D 0 individuals on the new level in any given iteration. By Lemma 21, we can apply Theorem 18 with p = 1 4 , n = γ 0 λ, and x min = z j λ to see that the level is filled to at least γ 0 λ individuals in an expected time of at most T j := 3.6 4 + ⌈log 0 2 (γ 0 /z j )⌉⌈3/δ⌉ iterations.
In the second case we have z j λ < D 0 and we want to use Theorem 15, where our target is again to have n = γ 0 λ individuals on level j + 1 or higher. We start by determining a useful E 0 for which we can show Condition (0). From (G1) we have that if X t = 0, then the number Y := X t+1 of individuals sampled in A ≥j+1 follows a binomial law with parameters λ and success probability p ≥ z j .
We now estimate E Since we will later need to bound the inverse of E (j) 0 from above, we note that by δ ≤ 1 and Equation (9). From (G2) we see that when X t > 0, then the number X t+1 of individuals sampled on level j + 1 or higher stochastically dominates a binomial law with parameters λ and (1 + δ)X t /λ. Consequently, we can apply Theorem 15 and estimate that the expected number of generations until there are at least γ 0 λ individuals on level j + 1 or higher is at most (j) 0 + 21.6 1−γ 0 D 0 ln(2D 0 ) + 3.6 log 2 (γ 0 λ)⌈3/δ⌉.
Summing over all levels, we obtain the following bound on the number of steps to reach a search point in A m , still conditional on never losing a level: We now argue that, with sufficiently high probability, we indeed do not lose a level. Specifically, we show that, from any iteration with at least γ 0 λ/2 individuals until the next iteration with at least that many individuals, the probability is at most exp − δγ 0 λ/2 169 that we have an iteration with less than γ 0 λ/4 individuals in between (which we will call a failure). We distinguish two cases: we either have at least γ 0 λ individuals on the level and above, or less. Using a standard Chernoff bound argument on (G2) with γ = γ 0 we see that, for iterations with at least γ 0 λ individuals, the probability to fall below γ 0 λ/2 individuals in the next step is at most This shows that steps with at least γ 0 λ individuals lead to a failure with at most the desired small probability. In the case of less than γ 0 λ individuals, just as in the proof of Theorem 3, we want to apply Lemma 11. In the language of Lemma 11, we have n = γ 0 λ ≥ 200/δ using Equation (9). Thus, we can use Lemma 11 to estimate the probability of falling below γ 0 λ/4 after having reached at least D ≥ γ 0 λ/2 ≥ 100/δ individuals. We thus see that this failure probability is at most Thus, also in this case the probability of failure is small. Using (G3), we see that the last term is at most 1/(8t 0 ). In order to obtain the overall failure probability over any number of t steps, we can now make a union bound over all intervals, each ranging from one iteration with at least γ 0 λ/2 individuals to the next. For this we will pessimistically assume that we have t such intervals within t steps. Thus, we see that the probability of ever losing a level within 2t 0 steps (twice the conditional expected optimization time, conditional on not losing a level) is at most p 1 := 0.25. Using Markov's inequality, the probability of successful optimization within 2t 0 iterations without losing a level is at least p 2 := 0.5. Thus, with a union bound on the failure probabilities, we get an unconditional probability of successful optimization within 2t 0 iterations of at least 1 − p 1 − p 2 = 0.25. Thus, a simple restart argument shows that the expected time (in iterations) for optimization is at most 8t 0 , giving the desired run time bound.
We now discuss the case δ > 1. With similar, often easier arguments, we prove the following result.
Theorem 22 (Level-Based Theorem for δ > 1). Consider a population-based process as described in the beginning of this section.
Proof. The proof reuses many arguments from the proof for the case δ ≤ 1. To later apply the second multiplicative up-drift theorem, let D 0 = min{32, γ 0 λ} and note that by our assumption D 0 = 32.
Mildly different from the case δ ≤ 1, we now say that we lose a level We again condition on never losing a level and later revoke this assumption with a restart argument. Let j ∈ [1..m − 1] and assume that at some time t ′ we have |P t ′ ∩ A ≥j | ≥ γ 0 λ. We analyze how the number of individuals on levels above j develops. To this aim, let X t = |P t ′ +t ∩ A ≥j+1 | for all t = 0, 1, 2, . . . . As in the analysis of the case δ ≤ 1, we distinguish two cases. When z j λ ≥ D 0 , then we can again apply Theorem 18 with p = 1/4, x min = z j λ, and n = γ 0 λ, showing that the expected time to fill level j + 1 to at least γ 0 λ elements is at most 1.3/p + 2.6⌈log 0 1+δ (n/x min )⌉ ≤ 7.8 + 2.6 log 0 1+δ (γ 0 /z j ).
If instead we have z j λ < D 0 , we argue as follows.
In either case, z j λ ≥ D 0 or γ 0 λ < D 0 , this level filling-up time is at most 101.6 + 657 λz j + 2.6 log 0 in expectation. Summing over all levels, we see that the expected time to, one after the other, fill all levels is at most t 0 when we condition on never losing a level. The probability to lose the current level in one iteration, by a simple Chernoff bound and (G2), is at most exp(− 1 4 γ 0 λ), since we expect to have at least (1 + δ)γ 0 λ ≥ 2γ 0 λ offspring on this level or higher. By (G3), this probability is at most 1/9t 0 . By a simple union bound, we see that the probability to lose a level in 3t 0 iterations is at most 1/3. Under this assumption, the probability to not find a search point in A m in the first 3t 0 iterations is at most 1/3 by Markov's inequality. Hence with probability 1/3, we find the desired solution in 3t 0 iterations. A simple restart argument with an expected number of three restarts now shows E[T ] ≤ λ · 9t 0 as claimed.

Applications
With the improved level-based theorem, we easily obtain the following three results. The first two improve previous results that were obtained via levelbased theorems in the case of small δ. The last result shows that our levelbased theorem for the case δ > 1 can lead to results better than what was known before for the case δ ≤ 1 (including using δ ≤ 1 when δ actually is larger).

Fitness-Proportionate Selection
Dang and Lehre [DL16] show that fitness-proportionate selection can be efficient when the mutation rate is very small; in contrast to previous results that show, for the standard mutation rate 1/n, that fitness-proportionate selection can lead to exponential run times [HJKN08,NOW09]. More precisely, Dang and Lehre regard the (λ, λ) EA with fitness-proportionate selection for variation and standard bit mutation as variation operator (Algorithm 1). Here fitness-proportionate selection (with respect to a non-negative fitness function f ) means that from a given population x 1 , . . . , x λ we choose a random element such that x i is chosen with probability f (x i )/ λ j=1 f (x j ). When λ j=1 f (x j ) is zero, we choose an individual uniformly at random. Theorem 23. Consider the (λ, λ) EA with fitness-proportionate selection, • with population size λ ≥ cn ln(n) with c sufficiently large and λ = O(n K ) for some constant K, and • mutation rate p mut ≤ 1 4n 2 and p mut = Ω(n −k ) for some constant k. Then this algorithm optimizes OneMax in an expected number of O(λn 2 log n + n log(n)/p mut ) fitness evaluations, which is O(n 3 (log n) 2 ) for optimal parameter choices.
It optimizes LeadingOnes in time O(λn 2 log n+ n 2 /p mut ) fitness evaluations, which becomes O(n 4 ) with optimal parameter choices.
To show (G1), assume that we have at least γ 0 λ/4 individuals with fitness at least j for some j ∈ [0..n − 1]. Since the selection operator favors individuals with higher fitness, the probability that the parent of a particular offspring has fitness at least j, is at least γ 0 /4. Assume that such a parent was chosen (and that this does not have fitness n since we would be done then anyway). If the parent has fitness exactly j, then the probability to generate a strictly better search point is at least (n−j)p mut (1 −p mut ) n−1 ≥ (n−j)p mut (1 −(n−1)p mut ) = (n−j)p mut (1 −o(1)) by Bernoulli's inequality and p mut = o( 1 n ). If the parent has already a fitness of j + 1 or better, then the probability to generate an offspring of fitness j + 1 or better is even higher, namely by simply flipping zero bits such an offspring is generated with probability at least (1−p mut ) n ≥ 1−np mut = 1−o(1). Hence in either case we have (G1) satisfied with z j = (n − j)γ 0 p mut (1 − o(1))/4.
To show (G2), let j ∈ [0..n − 2], γ ∈ (0, γ 0 ] and P be a population such that at least γλ individuals have a fitness of at least j + 1 and at least γ 0 λ/4 individuals have a fitness of at least j. Let F + be the sum of the fitness values of the individuals of fitness at least j + 1 and let F − = x∈P f (x) − F + be the sum of the remaining fitness values. By our assumption, F + ≥ γλ(j + 1). The probability that an individual of fitness j + 1 or more is chosen as parent of a particular offspring is γλ(j + 1) γλ(j + 1) + (1 − γ)λj = γ 1 + 1 − γ j + γ ≥ γ 1 + 1 2 j + 1 2 ≥ γ 1 + 1 2n .
Consequently, we can employ Theorem 20 and derive an expected optimization time of 1 (n − j)p mut = O(λn 2 log n + n log(n)/p mut ).
For f being the LeadingOnes function, we take the same partition of the search space and also γ 0 = 1 2 . With similar arguments as above, we show (G1) with z j = γ 0 p mut (1 − o(1))/4. The proof of (G2) remains valid without changes, since the central argument was that with sufficiently high probability a copy of the parent is generated (hence again we have δ = Θ(1/n)). The proof of (G3) remains valid since we estimated the z j uniformly as z j = Ω(p mut ). Consequently, we obtain from Theorem 20 that the optimization time T satisfies = O λm log λ δ + m δp mut = O(λn 2 log n + n 2 /p mut ). This is O(n 4 ) for λ = O(n 2 / log n) and p mut = Θ(n −2 ).

Partial Evaluation
Also in Dang and Lehre [DL16] a different parent selection algorithm was considered, 2-tournament selection, where a parent is chosen by picking two individuals uniformly at random and the fitter one is allowed to produce one offspring (see Algorithm 2).
The test functions they considered were OneMax and LeadingOnes under partial evaluation (a scheme for randomizing a given function), which we here define only for OneMax. Given a parameter c ∈ (0, 1), we use n i.i.d. random variables (R i ) i≤n , each Bernoulli-distributed with parameter c. OneMax c is defined such that, for all bit strings x ∈ {0, 1} n , OneMax c (x) = n i=1 R i x i . With other words, a bit string has a value equal to the number of 1s in it, where each 1 only counts with probability c.
1 Initialize P 0 as multi-set of λ individuals chosen independently and uniformly at random from {0, 1} n ; 2 for t = 1, 2, 3, . . . do 3 P t ← ∅; 4 for i = 1 to λ do 5 select x 0 , x 1 ∈ P t−1 uniformly at random; 6 select x ∈ {x 0 , x 1 } with maximal fitness (breaking ties uniformly); 7 generate y from x by flipping each bit independently with probability p mut ; 8 P t ← P t ∪ {y}; Dang and Lehre [DL16] showed the following statement as part of their core proof [DL16, proof of Theorem 21] regarding the performance of Algorithm 2 on OneMax c (x).
Lemma 24. Let n be large and c ∈ (1/n, 1). Then there is an a such that, for all γ ∈ (0, 1/2), the probability to produce an offspring (line 7 of Algorithm 2) of at least the quality of the γλ-ranked individual of the current population is at least γ(1 + a c/n).
Using their old level-based theorem (with a dependence on δ of order 5) and the best possible choice for λ, they obtain a bound for the expected number of fitness evaluations until optimizing OneMax with partial evaluation with parameter c ≥ 1/n of O n 4.5 log n c 3.5 .
Using the more refined level-based theorem from [CDEL18], see Theorem 19 (with a quadratic dependence on δ), one can find a run time bound of O n 3 log n c 2 .
With our level-based theorem given in Theorem 20 (with a linear dependence on δ), one can prove a run time bound of O n 2 (log(n)) 2 c .
Analogous improvements can be found in the case of LeadingOnes.

Using δ > 1
In all applications of the level-based theorem in the literature, only the case of δ ≤ 1 was used; in fact, the level-based theorem from [CDEL18] does not give a version that can benefit from δ > 1 (however, it can always be applied with δ = 1 instead of the true δ). We note the following result, which can be improved by taking δ > 1 into account. Consider optimizing the LeadingOnes benchmark function using a (µ, λ) EA with ranking selection and standard bit mutation. When λ ≥ 2eµ and λ ≥ c log(n) for some specific constant c, then an expected run time of O(n 2 + nλ log(λ)) fitness evaluations is proven in [CDEL18, Theorem 3(2)]. We easily see that in this case, using the partition of the search space into sets of equal fitness, we have z j = O(1/n) for all j ∈ [0..n − 1] and δ = λ/eµ.
Using our level-based theorem for δ > 1 (Theorem 22), we obtain the slightly better bound of O(n 2 + nλ log (1+λ/eµ) (λ)) since the time to fill up a level is getting shorter if λ is asymptotically larger than µ. For example, for µ = n and λ = n 1.5 , we can now derive an optimization time of O(nλ) = O(n 2.5 ), while the previous result was O(nλ log(λ)) = O(n 2.5 log(n)).

Conclusion
In this work, we prove three drift results for multiplicatively increasing drift. Since the desired hitting time bound of order log(n)/ min{δ, log(1 + δ)}, which implies that the process behaves similarly to the deterministic process, can only be obtained under additional assumptions, we formulate our results for processes in which each state X t+1 is distributed according to a binomial distribution with expectation (1 + δ)X t (or better, in the domination sense).
As main application for our drift results, we prove a stronger version of the level-based theorem. It in particular has the asymptotically right dependence on 1/δ, which is near-linear. Previous level-based theorems only show a dependence roughly of order δ −5 [DL16] or δ −2 [CDEL18]. This difference can be significant in applications with small δ, e.g., the result on fitness-proportionate selection [DL16], which has δ = Θ(1/n).
An equally interesting progress from our new level-based theorem is that its relatively elementary proof gives more insight in the actual development of such processes. It thus tells us in a more informative manner how certain population-based algorithms optimize certain problems. Such additional information can be useful to detect bottlenecks and improve algorithms. Also, the individual building blocks of our drift analysis may find separate applications.
In terms of future work, we note that there are processes showing multiplicative up-drift where the next state is not described by a binomial distribution. One example are population-based algorithms using plus-selection, where, roughly speaking, X t+1 ∼ X t + Bin(λ, X t /λ). We are optimistic that such processes can be handled with our methods as well. We did not do this in this first work on multiplicative up-drift since such processes can also be analyzed with elementary methods, e.g., exploiting that the process is non-decreasing and with constant probability attains the expected progress. Nevertheless, extending our drift theorems to such processes should give better constants and a more elegant analysis, so we feel that this is also an interesting goal for future work.