1 Introduction

In a typical situation in evolutionary search, an algorithm first makes good progress while far away from the target, since a lot can still be improved. As the search focuses more and more on the fine details, progress slows and finding improving moves becomes rarer. Thus, the expected progress is typically an increasing function of the distance from the optimum. However, there are also many processes where this situation is reversed. For example, for heuristics involving a population, once a superior individual is found, this improvement needs to be spread over the population. This process gains speed when more individuals exist with the improvement.

Turning expected progress into an expected first hitting time is the purpose of drift theorems (see the recent survey [30] for a thorough introduction to drift analysis). For example, the additive drift theorem [21, 22] requires a uniform lower bound \(\delta\) on the expected progress (the drift) and gives an expected first hitting time of at most \(n/\delta\), where n is the initial distance from the optimum. This theorem can also be applied when the drift is changing during the process, but since a uniform \(\delta\) is used in the argument, the additive drift theorem cannot be used to exploit a stronger drift later in the process.

A first step towards profiting from a changing drift behavior was the multiplicative drift theorem [5, 6]. It assumes that the drift is at least \(\delta x\) when the distance from the optimum is x, for some factor \(\delta <1\). The first hitting time can then be bounded by \(O(\log (n)/\delta )\), where n is again the initial distance from the optimum. Apparently, this gives a much better bound than what could be shown via the additive drift in this setting. Multiplicative drift can be found in many optimization processes, making the multiplicative drift theorem one of the most useful drift theorems.

To cope with a broader variety of changing drift patterns, the variable drift theorem [25, 32] has been developed. However, while there are several variants of this drift theorem, most of them require that the strength of the drift is a monotone increasing function in the distance from the optimum (the farther away from the optimum, the easier it is to make progress).

In this paper we are concerned with the reverse setting where drift is a decreasing function of the distance from the optimum. This has been considered only for few variable drift theorems, and all of them essentially require a step-size bounded processes. The most recent formulation of this can be found in [36]. We want to consider processes which are not step-size bounded, so this drift theorem cannot be usefully applied.

While many drift theorems are phrased such that the aim is to reach the point zero, for our setting it is more natural to consider the case of reaching some target value n starting at a value of 1, and to suppose that the drift is \(\delta x\) going up (for the multiplicative drift theorem, we had a drift of \(\delta x\) going down). Thus, we call our resulting drift theorem the multiplicative up-drift theorem.

Making things more formal, consider a random process \((X_{t})_{t \in {\mathbb {N}}}\) over positive reals starting at \(X_{0} = 1\) and with target \(n > 1\). We speak of multiplicative up-drift if there is a \(\delta > 0\) such that, for all \(t \ge 0\), we have the drift condition

(D):

\(E[X_{t+1} - X_{t} \mid X_{t}] \ge \delta X_{t}\).

Note that this is equivalent to

(D’):

\(E[X_{t+1} \mid X_{t}] \ge (1+\delta ) X_{t}\).

One trivial case of any drift process is the deterministic process with the desired gain per iteration. We quickly regard this case now as it gives the right impression of what should be a natural expected first hitting time for a well-behaved process exhibiting multiplicative up-drift.

Example 1

Let \(\delta > 0\). Suppose \(X_{0} = 1\) and, for all t, \(X_{t+1} = (1+\delta ) X_{t}\) with probability 1. Then this process satisfies the drift condition (D) with equality. Clearly, the time to reach a value of at least n is \(\lceil \log _{1+\delta }(n) \rceil\). For small \(\delta\), this is approximately \(\log (n) / \delta\), for large \(\delta\), it is approximately \(\log (n) / \log (\delta )\). We note here already that we will be mostly concerned with the case where \(\delta\) is small. This case is the harder one since the progress is weaker, and thus there is a greater need for stronger analysis tools in this case.

Unfortunately, not all processes with multiplicative up-drift have a hitting time of \(O(\log (n) / \delta )\), as the following example shows.

Example 2

Let \(\delta > 0\). Suppose \(X_{0} = 1\) and, for all t, \(X_{t+1} = n\) with probability \(\delta /(n-1)\) (which we term a success) and \(X_{t+1} = 1\) otherwise. Again, the drift condition (D) is satisfied with equality (while the target n is not reached). The time for the process to hit the target n is thus geometrically distributed with probability \(\delta /(n-1)\), giving an expected time of \((n-1)/\delta = \Theta (n/\delta )\) iterations, significantly more than the \(O(\log (n)/\delta )\) seen in the deterministic process.

Note that for this process the additive drift theorem immediately gives the upper bound of \(O(n/\delta )\) since we always have a drift of at least \(\delta\) towards the target. Hence Example 2 describes a process where the stronger assumption of multiplicative up-drift does not lead to a better hitting time.

Our first main result (Theorem 3) shows that the targeted bound of \(O(\log (n) / \delta )\), which as we saw is optimal when we want to cover the deterministic process given in Example 1, can be obtained when strengthening condition (D) by assuming (i) that, given \(X_{t}\), the next state \(X_{t+1}\) is at least (in the stochastic domination sense) binomially distributed with expectation \((1+\delta ) X_{t}\), and (ii) that the process never reaches state 0. The first condition is very natural. When generating offspring independently, the number of offspring satisfying a particular desired property is binomially distributed. The second condition is a technical necessity. From the up-drift condition alone, we cannot infer any progress from state 0. Consequently, 0 could well be an absorbing state, resulting in an infinite hitting time if this state can be reached with positive probability.

In quite some applications, however, we cannot rule out that the random process reaches state 0. For example, when regarding the subpopulation of individuals having some desired property, then in an algorithm using comma selection, this might die out completely in one iteration (though often with small probability only). To cover also such processes, in our second drift theorem (Theorem 16) we extend our Theorem 3 to include that state 0 is reached with at most the probability that can be deduced from the up-drift and the binomial distribution conditions. To avoid that state 0 is absorbing, we add an additional condition governing how this state 0 is left again (see Theorem 16 for the precise statement).

As mentioned before, a main application for multiplicatively increasing drift towards the optimum is the analysis of how fit individuals spread in a population. This particular setting was previously analyzed as the level-based theorem [2, 9, 29], modeled after the method of fitness-based partitions [38]. Essentially, the search space is partitioned into an ordered sequence of levels. The ongoing search process increases the probability that a newly-created individual is at least on a given level and, once this probability is sufficiently high, that there is a good chance that the individual is on an even higher level. We restate the details of this theorem in the version from [2] in Theorem 20 below. The level-based theorem was originally intended for the analysis of non-elitist population-based algorithms [9], but has since also been applied to EDAs, namely to the UMDA in [10] and, with some additional arguments, to PBIL in [31].

We use our second multiplicative up-drift theorem (Theorem 16) to prove a new version of the level-based theorem (Theorem 21). This new theorem allows to derive better asymptotic bounds under mostly weaker conditions: The dependence of the run time on \(1/\delta\) is reduced from near-quadratic to near-linearFootnote 1 and the minimum population size \(\lambda\) required for the result to hold is reduced from super-quadratic in \(1/\delta\) to near-linear in \(1/\delta\). Since the run time often is linear in \(\lambda\), this can give a further run time improvement. Our upper bounds almost match the lower-bound example given in [2] and, in particular, match the asymptotic dependence on \(\delta\) displayed by this example.

Our version of the level-based theorem can be applied in all settings where the previous-best level-based theorems were used. It leads to better results when \(\delta\) is small. In Sect. 4, we analyze two such situations from previous analyses of non-elitist evolutionary algorithms on standard test functions. The first test function is called OneMax and maps a given bit string to the number of 1s in that bit string, thus simulating a unimodular optimization problem solvable by simple hill climbing. The second test function is called LeadingOnes and maps a bit string to the number of 1s appearing in the bit string before the first 0 (if any); this simulates an optimization problem requiring sequential optimization of different sub parts. Our results are as follows. (i) We prove that the \((\lambda ,\lambda )\) EA with fitness-proportionate selection and suitable parameters can optimize the OneMax and LeadingOnes functions in expected time \(O(n^3 \log ^2 n)\) and \(O(n^4)\) respectively, improving over the previous-best published bound of \(O(n^8 \log n)\). (ii) We prove that the \((\lambda ,\lambda )\) EA with 2-tournament selection and suitable parameters in the restricted setting that only a constant fraction of the bits of the search points are evaluated finds the optimum of OneMax in \(O(n^{2.5} \log ^2 n)\) iterations. The previous-best published bound here is \(O(n^{4.5} \log n)\).

We also use our methods to obtain a level-based theorem for the case that \(\delta\) is large (Theorem 23). This case was not covered by the previous-best level-based theorems and our theorem now allows to exploit larger values of \(\delta\) to obtain asymptotically stronger run time guarantees. As an example we show (in Sect. 4.3) that the \((\mu , \lambda )\) EA with \(\mu = n\) and \(\lambda =n^{1.5}\) on the LeadingOnes benchmark function using ranking selection and standard bit mutation has an optimization time of \(O(n^{2.5})\). This is asymptotically better than the previously known bound of \(O(n^{2.5} \log (n))\) and also shows more explicitly how optimization proceeds.

Beyond these particular results, our modular proof (first analyzing the multiplicative up-drift excluding 0, then including 0, then applying it in the context of the level-based theorem) shows the level-based theorem in a way that is more accessible than the previous versions and that gives more insight into population-based optimization processes.

In particular, our proof suggests that the behavior of the process under the named conditions is as follows.

  • Once a critical mass in a level is reached, this level is never again abandoned. Thus, we can focus in our analysis on having a critical mass of individuals in one level and analyze the time it takes to gain a critical mass in the next level.

  • Reaching a critical mass in the next level consists of two steps.

    1. 1.

      When few elements are in the next level, then these elements go extinct regularly and need to be respawned until this initial population on this level via a mostly unbiased random walk gains a moderate amount of elements.

    2. 2.

      With this moderate amount of elements, the bias of the random walk is large enough to make a significant decrease of the population unlikely, but instead the number of elements increases steadily, as can be shown using a concentration bound for submartingales, so that we quickly gain a critical mass in the next level.

We are optimistic that this increased understanding of population-based processes helps in the future design and analysis of such processes.

2 Multiplicative Up-Drift Theorems

In this section we prove three multiplicative up-drift theorems. The first is concerned with processes that cannot reach the value 0 (which could be absorbing if only a multiplicative up-drift assumption is made); the second one extends the first theorem to include also the possibility of going down to 0 (but taking an additional assumption how state 0 is left). The third does the same, but exploits the assumption that, with some positive probability, state 0 is left to a state from which, with constant probability, we make strong multiplicative progress in every iteration until the process reaches the target (as opposed to a behavior closer to an unbiased random walk).

Note that our theorems essentially deal with martingales, but still we suppress the mention of conditioning on all previous members of the given process (i.e. the natural filtration) to improve readability.

2.1 Processes on the Positive Integers

As discussed in the introduction, an expected multiplicative increase as described by (D) is not enough to ensure the run time we aim at. For this reason, we assume that there is a number k such that, conditional on \(X_{t}\), the next state \(X_{t+1}\) is binomially distributed with parameters k and \((1+\delta ) X_{t} / k\). Note that this implies (D). Since often precise distributions are hard to specify, we only require that \(X_{t+1}\) is at least as large as this binomial distribution, that is, we require that \(X_{t+1}\) stochastically dominates \({{\,{\mathrm{Bin}}\,}}(k, (1+\delta ) X_{t} / k)\). See [12] for an introduction to stochastic domination and its use in run time analysis. To avoid that the process reaches the possibly absorbing state 0, we explicitly forbid this, that is, we require that all \(X_{t}\) take values only in the positive integers.

Under these conditions, we analyze the time the process takes to reach or overshoot a given state n. For technical reasons, we require that n is not too close to k, that is, that there is a constant \(\gamma _0 < 1\) such that \(n-1 \le \gamma _0 k\). For the trivial reason that the condition \(X_{t+1} \succeq {{\,{\mathrm{Bin}}\,}}(k, (1+\delta ) X_{t} / k)\) does not make sense for \(X_{t} > (1+\delta )^{-1} k\), we also require \(n-1 \le (1+\delta )^{-1} k\). For all such n, we show that an expected number of \(O(\log (n)/\delta )\) iterations suffices to reach n when \(\delta \le 1\) and \(O(\log (n)/\log (1+\delta ))\) iterations suffice for \(\delta > 1\). More precisely, we show the following estimate.

Theorem 3

(First Multiplicative Up-Drift Theorem) Let \((X_{t})_{t \in {\mathbb {N}}}\) be a stochastic process over the positive integers. Assume that there are \({n,k \in {\mathbb {Z}}_{\ge 1}}\), \(\gamma _0 < 1\), and \(\delta > 0\) such that \(n -1 \le \min \{\gamma _0 k, (1+\delta )^{-1} k\}\) and for all \(t \ge 0\) and all \(x \in \{1, \dots , n-1\}\) with \(\Pr [X_{t} = x] > 0\) we have the binomial condition

(Bin):

\((X_{t+1} \mid X_{t} = x) \succeq {\mathrm {Bin}}(k,(1+\delta ) x/k)\).

Let \(T := \min \{t \ge 0 \mid X_{t} \ge n\}\).

  1. (i)

    If \(\delta \le 1\), then with \(D_0 = \min \{\lceil 100/\delta \rceil , n\}\) we have

    $$\begin{aligned} E[T] \le \tfrac{15}{1-\gamma _0} D_0 \ln (2 D_0)+ 2.5 \log _2(n) \lceil 3 / \delta \rceil . \end{aligned}$$

    If \(n > 100/\delta\), then we also have that once the process has reached state of at least \(100/\delta\), the probability to ever return to a state of at most \(50/\delta\), is at most 0.5912.

  2. (ii)

    If \(\delta > 1\), then

    $$\begin{aligned} E[T] \le 2.6 \log _{1+\delta }(n) + 81. \end{aligned}$$

    In addition, once the process has reached state 32 or higher, the probability to ever return to a state lower than 32 is at most \(\tfrac{1}{e(e-1)} < 0.22\).

For the analysis we will employ Lemma 9 from Sect. 2.1.4 essentially for the time spent below \(D_0\). Note that this lemma all by itself, in case of \(\delta \le 1\) and \(n \le D_0\), gives the stronger bound \(E[T] \le \frac{6n \ln (n)}{1-\gamma _0}\).

Since the case \(\delta \le 1\) is significantly more complicated, we focus on this case in Sects. 2.1.1 to 2.1.6 and discuss the case \(\delta > 1\) only in Sect. 2.1.7.

2.1.1 A Motivating Example

Before proving this result, let us give a simple example of a possible application. Consider the following elitist \((\mu ,\lambda )\) EA. It starts with a parent population of \(\mu\) individuals chosen uniformly and independently from \(\{0,1\}^n\). In each iteration, it generates \(\lambda\) offspring, each by independently and uniformly choosing a parent individual and mutating it via standard bit mutation with the usual mutation rate 1/n. If the offspring population contains at least one individual that is at least as good as the best parent (in terms of fitness), then the new parent population is chosen by selecting \(\mu\) best offspring (breaking ties arbitrarily). If all offspring are worse than the best parent, then the new parent population is composed of a best individual from the old parent population and \(\mu -1\) best offspring (again, breaking all ties randomly).

We now use the above theorem to analyze the spread of fit individuals in the parent population. Let us assume that at some time, the parent population contains at least one individual of at least a certain fitness. We shall call such individuals fit in the following. Recall that standard bit mutation creates a copy of the parent individual with probability \(1/e_n := (1-1/n)^n \approx 1/e\). Hence if the parent population contains x fit individuals, the number of fit individuals in the offspring population is at least (in the domination sense) \({{\,{\mathrm{Bin}}\,}}(\lambda , \frac{x}{\mu e_n})\). Due to the elitist selection mechanism, it is also always at least one. Let us assume that \(\frac{\lambda }{\mu e_n}\) is greater than one so that the expected number \(x \frac{\lambda }{\mu e_n}\) of fit individuals shows a positive drift. Writing \((1+\delta ) := \frac{\lambda }{\mu e_n}\), where \(\delta > 0\) by our assumption, and assuming for simplicity \(\delta \le 1\) as well, we can apply the first up-drift theorem with \(k = \lambda\) and \(n = \mu\) and observe that after an expected number of \(O(\log (\mu )/\delta )\) iterations, the parent population consists of only fit individuals.

2.1.2 Proof Overview

We now proceed towards proving the first up-drift theorem. As said earlier, we concentrate on the case \(\delta \le 1\) in all of the following except Sect. 2.1.7. We start by outlining the two main difficulties and solutions in a high-level language.

One of the main difficulties is that the drift towards the target is negligibly weak in the early stages of the process. To demonstrate this, assume that \(\delta = o(1)\) and that \(X_{t} = o(1/\delta )\). Then the up-drift condition (D) only ensures a drift of \(E[X_{t+1} - X_{t} \mid X_{t}] \ge \delta X_{t} = o(1)\). At the same time, the binomial condition (Bin) allows a variance \({{\,{\mathrm{Var}}\,}}[X_{t+1} \mid X_{t}]\) of order \(X_{t}\), or, more specifically, admits deviations of \(X_{t+1}\) from its expectations of order \(\sqrt{X_t}\) with constant probability. For this reason, in this regime we do not progress because of the drift, but rather because of the random fluctuations of the process.

It is well-known that random fluctuations are enough to reach a target, with a classical example being the unbiased random walk \((W_{t})\) on the line \([0..n] := \{0, 1, \dots , n\}\). This walk, when started in 0, still reaches n in an expected number of \(O(n^2)\) iterations despite the complete absence of any drift in \([1..n-1]\). The key to the analysis is to not regard the drift \(E[W_{t+1} - W_{t} \mid W_{t}]\) of the process, but instead the drift of the process \((W_{t}^2)\). Then an easy calculation gives \(E[W^2_{t+1} - W^2_{t} \mid W_{t} = x] = \frac{1}{2} (x+1)^2 + \frac{1}{2} (x-1)^2 - x^2 = 1\) for all \(x \in [1..n-1]\) (see [17, Sect. 5] for an extensive discussion). Consequently, by regarding the drift with respect to \((W_t^2)\) instead of the original process \((W_t)\), we obtain an additive drift of 1, and from this an expected time of \(O(n^2)\) to reach state n. This has also been applied to the analysis of randomized search heuristics, see for example [28, Theorem 3.18].

Apparently more common are transformations with exponents smaller than one. [24, Theorem 2] turned a region with small drift into one with significantly more drift by employing the concave potential function \(x \mapsto \sqrt{x}\). He wrote that any other function \(x \mapsto x^\varepsilon\) with \(\varepsilon < 1\) would be equally suitable to obtain the same tight upper bound. Essentially the same argument was used in a more general setting in [3]. The \(x \mapsto \sqrt{x}\) transformation was also used in the analysis how the sampling frequency of a neutral bit in a run of an EDA approaches the boundary values [14, Theorem 6].

In [16, Theorem 5] a negative drift in a (small) part of the search space was overcome by considering random changes which make it possible for the algorithm to pass through the area of negative drift by chance. This was formalized by using a tailored potential function turning negative drift into positive drift by excessively rewarding changes towards the target, as opposed to steps away from the target. This ad-hoc argument was made formal and cast into a Headwind Drift Theorem in [26, Theorem 4].

In abstract terms, the art here is finding a potential function \(g : {\mathbb {Z}}_{\ge 0} \rightarrow {\mathbb {R}}\) that transforms the unbiased process \((X_{t})\) into a process \((g(X_{t}))\) with constant drift, so that we can apply the additive drift theorem to obtain a bound of \(O(g(X_0))\) on the expected optimization time. In order to obtain a positive drift, such a potential function has to be increasing and convex, and since the expected optimization time is \(g(X_0)\), at the same time the potential function should increase as slowly as possible.

For our situation, it turns out that g defined by \(g(x) = x \ln (x)\) is a good choice as this again gives a constant drift and thus an expected time of roughly \(O(\log (1/\delta )/\delta )\) to reach a state \(\Omega (1/\delta )\), from where on we will observe that also the original process has sufficient drift. We are not aware of this potential function being used so far in the theory of evolutionary algorithms (apart from a similar function being used in [1], a work done in parallel to ours).

A technical annoyance in the analysis of the time taken to reach \(\Omega (1/\delta )\) is that the additive drift theorem, for good reason, does not allow that the process overshoots the target. In the classical formulation, this follows from the target being 0 and the process living in the non-negative numbers. For this reason, we cannot just show that the process \((g(X_{t}))\) has a constant drift, but we need to show this drift for a version of this process that is suitably restricted to the range \([1..\Theta (1/\delta )]\). This was a major technicality in the previous version of this work [8]. In this version, we greatly simplify this part by using a version of the drift theorem (Theorem 4) recently proposed by Krejca [28] that allows overshooting the target (at the price that the time bound depends not on the distance of the target, but the distance plus the expected overshooting).

Once the process has reached a value of \(\Omega (1/\delta )\), the drift is strong enough to rely on making progress from the drift (and not the random fluctuations around the expectation). This is easy when the process is above \(X_t = \omega (1/\delta ^2)\), since then the expected progress of at least \(\Omega (\delta X_t)\) is asymptotically larger than the typical random fluctuation of order \(\Omega (\sqrt{X_t})\). Hence a simple Chernoff bound is enough to guarantee that each single iteration gives \(X_{t+1} \ge (1-o(1)) (1+\delta ) X_t\). When \(X_t\) is smaller, say only \(\Theta (1/\delta )\), only the combined result of \(\Theta (1/\delta )\) iterations gives an expected progress large enough to admit such a strong concentration. Since the iterations are not independent, we need some careful martingale concentration arguments in this regime. Since this part is non-trivial and uses some methods that might be of broader interest, we put this into the following separate subsections. Also, we note that the specific result that the process rarely goes below half its starting point could have some independent interest (and we shall need it later again, in the proof of Theorem 21 to prove the level-based theorem).

2.1.3 Additive Drift with Overshooting

We now give a version of the additive drift theorem [21, 22] as shown in [28, Lemma 3.7], here slightly reformulated to best fit our purposes. In contrast to most other versions of the additive drift theorem, it allows that the process overshoots the target. This is usually implicitly forbidden by regarding processes in \({\mathbb {R}}_{\ge 0}\) and the first time to reach state 0.

This extension is not very deep, but has apparently not been known too well before (as the several works that overcome the overshooting problem with hand-made methods, including [8], show). We note that the arguments needed to prove such a result have been known before in this community: For example, both [23, Lemma 12] and [7, Lemma 7] prove lower bounds for expected run times in a way that can immediately be turned into proofs for upper bounds that allow overshooting (by switching the direction of the inequality in both assumptions and results). The proof of [39, Lemma 2.6], a result for hitting a particular value, can easily be extended to overshooting the value (for this, it suffices to note that \(E[\sum _{i=1}^{\tau _s} D_i]\) is the value of the process after reaching or overshooting s).

Theorem 4

(Additive Drift Theorem, upper bound with overshooting) Let \(a, b \in {\mathbb {R}}\) with \(a \le b\). Let \((X_t)_{t \in {\mathbb {N}}}\) be a random process over \([-\infty ,b]\). Let \(T = \inf \{t \mid X_t \le a\}\) be the first time the process reaches or drops below a. Suppose that there is \(\delta > 0\) such that

$$\begin{aligned}X_t - E[X_{t+1} \mid X_0,\ldots ,X_t] \ge \delta \end{aligned}$$

for all \(t < T\). Then

$$\begin{aligned} E[T \mid X_0] \le \frac{E[X_{(T \mid X_0)}] - X_0}{\delta }\ . \end{aligned}$$

We note that the version of this result given in [28] is slightly stronger. There the condition that the process does not take values larger than some – arbitrary – number b was replaced by the weaker condition that this only holds up to time T.

2.1.4 Progress From Random Fluctuations: Creating Drift Where There is no Drift

In this subsection, we analyze how the process reaches a value of at least \(D_0 = \min \{\lceil \delta /100 \rceil ,n\}\). In this regime, the drift of \((X_t)\) is so low that the true reason for making progress is not the drift, but the random fluctuations stemming from the non-trivial variance. To turn these into an exploitable drift, we regard the process \((g(X_t))\) for a suitable function g, observe that this process has a positive drift, and use this drift to estimate the time to reach or exceed \(D_0\).

We use

$$\begin{aligned} g: {\mathbb {R}}_{\ge 0} \rightarrow {\mathbb {R}}, x \mapsto x \ln x, \end{aligned}$$
(1)

where, by convention, \(g(0) := 0\), which renders g continuous in 0. To establish the desired drift, we need a few technical results about g. Via a Taylor expansion of g around a given point a, we obtain the following estimates for g.

Lemma 5

For all \(a > 0\) and \(x \ge 0\), we have

$$\begin{aligned} g(x)&\le a \ln a + (x-a)(1+ \ln a) + (x-a)^2 \frac{1}{a},\\ g(x)&\ge a \ln a + (x-a)(1+ \ln a) + (x-a)^2 \frac{1}{2a} - (x-a)^3 \frac{1}{6a^2}. \end{aligned}$$

Proof

Let \(a > 0\) be given. We prove the (slightly more complicated) lower bound first, showing the claim for positive x and then arguing with continuity. We let \(f: {\mathbb {R}}_+ \rightarrow {\mathbb {R}}\) be such that, for all \(x \in {\mathbb {R}}_{+}\),

$$\begin{aligned} f(x) = x \ln x - a \ln a - (x-a)(1+ \ln a) - (x-a)^2 \frac{1}{2a} + (x-a)^3 \frac{1}{6a^2}. \end{aligned}$$

Then we have, for all \(x \in {\mathbb {R}}_+\),

$$\begin{aligned} f'(x) = \ln x + 1 - (1+\ln a) - \frac{2x-2a}{2a} + \frac{3x^2-6xa+3a^2}{6a^2} \end{aligned}$$

and

$$\begin{aligned} f''(x) = \frac{1}{x} - \frac{1}{a} + \frac{x}{a^2} - \frac{1}{a} = \left( \frac{1}{\sqrt{x}} - \frac{\sqrt{x}}{a} \right) ^2. \end{aligned}$$

In particular, we have \(f(a) = 0\), \(f'(a) = 0\), and \(f''(x) \ge 0\) for all \(x \in {\mathbb {R}}_+\). This shows that for all \(x \in {\mathbb {R}}_+\), we have \(f(x) \ge 0\). By the continuity of f, we also obtain \(f(0) \ge 0\), and thus the claim.

For the upper bound, we regard \(f: {\mathbb {R}}_+ \rightarrow {\mathbb {R}}\) defined by

$$\begin{aligned} f(x) = x \ln x - a \ln a - (x-a)(1+ \ln a) - (x-a)^2 \frac{1}{a} \end{aligned}$$

and compute

$$\begin{aligned} f'(x) = \ln x + 1 - (1+\ln a) - \frac{2x-2a}{a} = \ln \left( \frac{x}{a}\right) + 2 - \frac{2x}{a} \end{aligned}$$

as well as

$$\begin{aligned} f''(x) = \frac{1}{x} - \frac{2}{a} \end{aligned}$$

for all \(x \in {\mathbb {R}}_+\). Thus, \(f''(x) > 0\) for \(x < a/2\), \(f''(x) = 0\) for \(x = a/2\), and \(f''(x) < 0\) for \(x > a/2\). Consequently, \(f'\) is zero for at most two arguments. Since \(\lim _{x \rightarrow 0} f'(x) = - \infty = \lim _{x \rightarrow \infty } f'(x)\) and \(f'(a/2) > 0\), by the intermediate value theorem there exist exactly two x such that \(f'(x) = 0\), one being larger than a/2 and the other smaller. Note that \(f'(a) = 0\). From \(\lim _{x \rightarrow 0} f(x) = 0\) and \(f(a)=0\), the only local maximum being at a, we can thus conclude that f is non-positive. \(\square\)

We use the estimates above to show that, under suitable circumstances, the expected g-value of a random variable X is larger than g(E[X]). The lower bound in the theorem below will be used to argue that even for a process \((X_t)\) with no drift, that is, \(E[X_{t+1} \mid X_t] = X_t\), the process \((g(X_t))\) has a positive drift. We need the upper bound to estimate the expected overshooting of the target when applying the additive drift theorem with overshooting (Theorem 4).

Theorem 6

Let g be defined as above in Equation (1). Let X be a non-negative random variable with positive expectation. Let \(\mu _3 = E[(X-E[X])^3]\). Then

$$\begin{aligned} g(E[X]) + \frac{{{\,{\mathrm{Var}}\,}}[X]}{E[X]} \ge E \left[ g(X) \right] \ge g(E[X]) + \frac{{{\,{\mathrm{Var}}\,}}[X]}{2E[X]} - \frac{\mu _3}{6E[X]^2}. \end{aligned}$$

Proof

We use Lemma 5 with \(a = E[X]\). \(\square\)

The following two corollaries follow immediately from the theorem above by recalling that the second and third central moments of a binomially distributed random variable \(X \sim {{\,{\mathrm{Bin}}\,}}(n,p)\) are \({{\,{\mathrm{Var}}\,}}[X] = np(1-p)\) and \(E[(X - E[X])^3] = np(1-p)(1-2p)\). For technical reasons, we need the first estimate also for random variables \(X \sim {{\,{\mathrm{Bin}}\,}}(n,p)+ K\) for some non-negative number K.

Corollary 7

If \(X \sim {{\,{\mathrm{Bin}}\,}}(n,p) + K\) for some \(n \in {\mathbb {N}}\), \(p \in (0,1]\), and \(K \ge 0\), then

$$\begin{aligned} E \left[ g(X) \right] \le g(E[X]) + (1-p). \end{aligned}$$

Corollary 8

If \(X \sim {{\,{\mathrm{Bin}}\,}}(n,p)\) for some \(n \in {\mathbb {N}}\) and \(p \in (0,1]\), then

$$\begin{aligned} E \left[ g(X) \right] \ge g(E[X]) + \frac{1-p}{2} - \frac{(1-p)(1-2p)}{6E[X]}. \end{aligned}$$

For \(p \ge 1/n\), this yields

$$\begin{aligned} E \left[ g(X) \right] \ge g(E[X]) + \frac{1-p}{3}. \end{aligned}$$
(2)

We are now prepared to show the following result.

Lemma 9

Let \((X_{t})_{t \in {\mathbb {N}}}\) be a stochastic process over the positive integers. Assume that there are \(D_0, k \in {\mathbb {Z}}_{\ge 1}\) and \(\gamma _0 < 1\) such that \(D_0-1 \le \gamma _0 k\) and for all \(t \ge 0\) and all \(x \in [1..D_0-1]\) with \(\Pr [X_{t} = x] > 0\) we have the unbiased binomial condition

(\(\hbox {Bin}_0\)):

\((X_{t+1} \mid X_{t} = x) \succeq {\mathrm {Bin}}(k, x/k)\).

Let \(T := \min \{t \ge 0 \mid X_{t} \ge D_0\}\). Then

$$\begin{aligned} E[T] \le \frac{6 D_0 \ln (2 D_0)}{{1-\gamma _0}}. \end{aligned}$$

Proof

There is nothing to show for \(D_0=1\), so we assume \(D_0 \ge 2\) in the remainder. For technical reasons, let us regard the process \((X'_t)\), which agrees with \((X_t)\) while not larger than \(D_0\), but follows the pessimistic law \(X'_{t+1} \sim {{\,{\mathrm{Bin}}\,}}(k, X_t / k)\) in the iteration where \(D_0\) is exceeded. More precisely, we let \(X'_0 = X_0\). Given that some \(X'_t\) is defined already, we define \(X'_{t+1}\) as follows. If \(X'_t \le D_0\), then for all \(x \ge 1\) we have

$$\begin{aligned} \Pr [X'_{t+1} = x] = {\left\{ \begin{array}{ll} \Pr [X_{t+1} = x], &{} \text{ if }~{ x < D_0,}\\ \Pr [{{\,{\mathrm{Bin}}\,}}(k, X_t/k) = x], &{} \text{ if }~{ x > D_0,} \end{array}\right. } \end{aligned}$$

and the remaining probability mass is put on \(D_0\), that is,

$$\begin{aligned} \Pr [X'_{t+1} = D_0] = 1 - \sum _{x=1}^{D_0 -1} \Pr [X_{t+1} = x] - \! \sum _{x=D_0+1}^k \Pr [{{\,{\mathrm{Bin}}\,}}(k, X_t/k) = x].\end{aligned}$$

If \(X'_t > D_0\), we let \(X'_{t+1} = X'_t\) with probability one. Since the process \((X'_t)\) agrees with \((X_t)\) while less than \(D_0\), we have \(T' := \min \{t \mid X'_t \ge D_0\} = \min \{t \mid X_t \ge D_0\} =: T\).

We estimate \(T'\). Consider some time t such that \(x := X'_t\) is in \([1..D_0-1]\). Let \(Y \sim {{\,{\mathrm{Bin}}\,}}(k,x/k)\). Since \(X'_{t+1} \succeq Y\) and g is monotonically increasing in \(\{0\} \cup [1, \infty )\), we have \(E[g(X'_{t+1})] \ge E[g(Y)]\). By Equation (2) in Corollary 8, we have

$$\begin{aligned}E[g(Y)] \ge g(E[Y]) + \frac{1-(D_0-1)/k}{3} \ge g(x) + \frac{1-\gamma _0}{3}.\end{aligned}$$

Consequently, we have \(E[g(X'_{t+1}) - g(X'_t) \mid X'_t < D_0] \ge \frac{1-\gamma _0}{3}\).

To apply the additive drift theorem with overshooting (Theorem 4), we observe that \(T = T' = \min \{t \mid g(X'_t) \ge D_0 \ln (D_0)\}\) and compute \(E[g(X'_T)]\). By construction, \(X'_T \sim (Y \mid Y \ge D_0)\) for some Y following a binomial law with parameters k and some \(p \le (D_0-1)/k\). By elementary arguments analogous to those used in the proof of [4, Lemma 1], see also [13, Lemma 1.7.3] and the comment following its proof, \(X'_T\) is stochastically dominated by \(D_0 + {{\,{\mathrm{Bin}}\,}}(k,(D_0-1)/k)\), which immediately gives

$$\begin{aligned} E[X'_T] \le 2D_0 - 1. \end{aligned}$$

By Corollary 7, we have \(E[g(X'_T)] \le (2 D_0-1) \ln (2 D_0-1) + 1 \le (2 D_0-1) \ln (2 D_0) + 1 \le 2D_0 \ln (2 D_0)\), the last estimate using \(D_0 \ge 2\). Consequently, the additive drift theorem with overshooting gives

$$\begin{aligned} E[T'] \le \frac{2 D_0 \ln (2 D_0)}{\frac{1-\gamma _0}{3}}. \end{aligned}$$

\(\square\)

We remark that, in principle, Lemma 9 can be strengthened by taking into account the starting point \(X_0\). Assuming for simplicity that \(X_0\) takes only values in \([1..D_0]\), this would give a result like

$$\begin{aligned}E[T'] \le \frac{2 D_0 \ln (2 D_0) - E[g(X_0)]}{\frac{1-\gamma _0}{3}} \le \frac{2 D_0 \ln (2 D_0) - g(E[X_0])}{\frac{1-\gamma _0}{3}},\end{aligned}$$

where the last estimate stems from the convexity of g and Jensen’s inequality. Since \(E[g(X_0)]\) is at most \(D_0 \ln (D_0)\), we gain at most a constant factor in the estimate of \(E[T']\). The reason for this weak improvement is that we estimated \(E[g(X'_T)]\) very coarsely. However, even with a better estimate of \(E[g(X'_t)]\), asymptotically stronger results would only be possible in the case that \(X_0\) is very close to \(D_0\), that is, that \(D_0 \ln (D_0) - E[g(X_0)] = o(D_0 \ln (D_0))\), which we do not expect in our typical applications.

We note further that the problem of overshooting and the resulting negative impact on the hitting time estimate is real. Even if \(X_0 = D_0 - 1\) with probability one, we see that when taking \(k = 2(D_0-1)\) for simplicity, we have \(X_1 \le D_0 - \Omega (\sqrt{D}_0)\) with constant probability (that this is possible for an unbiased process stems from the fact that \(X_1\) overshoots \(D_0\) by a comparable amount). We omit a formal proof, but note that from \(X_1 \le D_0 - \Omega (\sqrt{D}_0)\), the process takes an expected number of \(\Omega (\sqrt{D}_0)\) iterations to reach or overshoot \(D_0\).

2.1.5 Submartingale Arguments Proving A Steady Progress From \(D_0\) on

In this subsection, we shall prove that once a process satisfying the assumptions of Theorem 3 has reached a value of \(D \ge D_0 := \min \{\lceil 100/\delta \rceil , n\}\), it usually makes a steady progress of a constant factor increase in \(\Theta (1/\delta )\) iterations without ever going below D/2. To show this result, we use a submartingale argument that might prove to be useful in other analyses of evolutionary algorithms as well. We build on the following result from [15], phrased as in [27, Theorem 11].

Theorem 10

Let \((Y_{t})_{t \in {\mathbb {N}}}\) be a stochastic process, \((c_t)_{t \in {\mathbb {N}}}\) a sequence of reals, and \(\delta \in {\mathbb {R}}\) such that

$$\begin{aligned} \forall z \in [0,\delta ]: E[\exp (z(Y_{t+1}-Y_t)) \mid Y_0,\ldots ,Y_t] \le \exp (z^2c_t/2). \end{aligned}$$

Let \(C_t = \sum _{j=0}^{t-1}c_j\). Then, for all \(t \ge 0\) and all \(x > 0\),

$$\begin{aligned} \Pr \left[ \max _{0\le j \le t}\left( Y_j - Y_0 \right) \ge x \right] \le \exp \left( -\frac{x}{2} \min \left( \delta , \frac{x}{C_t} \right) \right) . \end{aligned}$$

In order to use this theorem, we state the following two results regarding binomial distributions.

Lemma 11

Let \(n > 0\) be a natural number and \(p \in (0,1)\). Let X be a binomially distributed random variable with parameters n and p. Then we have, for all \(z \in {\mathbb {R}}\),

$$\begin{aligned} E[\exp (zX)] = \left( 1-p+pe^z\right) ^n. \end{aligned}$$

Furthermore, if \(z \in [-1,1]\),

$$\begin{aligned} E[\exp (zX)] \le \exp \left( npz + npz^2\right) . \end{aligned}$$

Proof

The first claim is the standard statement about the moment generating function of a binomial distribution. The second statement is also well-known and a standard computation, we give the derivation here for completeness.

Recall that, for \(z \in [-1,1]\), we have \(1+z \le e^z \le 1+z+z^2\). Thus we get, for all \(z \in [-1,1]\),

$$\begin{aligned} \left( 1-p+pe^z\right) ^n\le & {} \left( 1-p+p(1+z+z^2)\right) ^n\\= & {} \left( 1+pz+pz^2\right) ^n\\\le & {} e^{npz+npz^2}. \end{aligned}$$

\(\square\)

With Theorem 10 and this lemma we now show the following result.

Lemma 12

Let \((X_{t})_{t \in {\mathbb {N}}}\) be a stochastic process over the positive integers. Assume that there are \(n,k \in {\mathbb {Z}}_{\ge 1}\) and \(\delta \in (0,1]\) such that \(n-1 \le (1+\delta )^{-1} k\) and for all \(t \ge 0\) and all \(x \in [1..n-1]\) with \(\Pr [X_{t} = x] > 0\) we have the binomial condition

$$\begin{aligned} \left( {{\text{Bin}}} \right)\left( {X_{{t + 1}} \left| {X_{t} = x} \right.} \right) \succ {\text{Bin}}\left( {k,\left( {1 + \delta } \right)x/k} \right) \end{aligned}$$

Let \(t_0 \ge 0\) and \(100 / \delta \le D < n\) such that \(X_{t_0} = D\). Let \({{\tilde{n}}} = \min \{n, 2D\}\), \(T = \min \{t \mid X_t \ge {{\tilde{n}}}\}\), and \(T_1 = \min \{T, t_0 + \lceil 3/\delta \rceil \}\). Then

$$\begin{aligned}\Pr [\exists s \in [t_0..T_1] : X_s \le \tfrac{1}{2} D + \tfrac{1}{2} (s-t_0)\delta D] \le \exp \left( - \delta D / 128 \right) .\end{aligned}$$

Proof

To be able to use Lemma 11, we first argue that we can pessimistically assume that the progress is exactly the one described by the binomial distributions in (Bin). To make this claim precise, let \(X'_0 = X_0\) and define recursively \(X'_t\) as a random variable with distribution \(X'_t \sim {{\,{\mathrm{Bin}}\,}}(k,(1+\delta )X'_{t-1}/k)\) for all \(t \ge 1\) such that \({X'_{t-1} < n}\). Then a simple induction shows that \(X_t \succeq X'_t\) for all \(t \in \mathbb {N}_0\): If \(X'_{t-1} \preceq X_{t-1}\), then \(X_t \succeq {{\,\mathrm{Bin}\,}}(k (1+\delta ), X_{t-1} / k) \succeq {{\,\mathrm{Bin}\,}}(k (1+\delta ), X'_{t-1} / k) \sim X'_{t}\). Since this domination also implies

$$\begin{aligned}T = \min \{t \mid X_t \ge {{\tilde{n}}}\} \le \min \{t \mid X'_t \ge {{\tilde{n}}}\} =: T',\end{aligned}$$

proving our claim for \(((X'_t),T')\) also shows the claim for \(((X_t),T)\).

For this reason, we assume from now on that the process \((X_t)\) satisfies (Bin) “with equality”, that is, that we have \((X_{t+1} \mid X_{t} = x) \sim \mathrm {Bin}(k,(1+\delta ) x/k)\) in this assumption of this result.

To ease the argument, let us modify the process \((X_t)\) a second time, namely in the irrelevant regime where it has reached a state of at least \({{\tilde{n}}}\) (before the time \(\lceil 3/\delta \rceil\)). More specifically, let again recursively \(X_{t+1} = X_t + \frac{1}{2} D\delta\) with probability one when \(X_t \ge {{\tilde{n}}}\). Since this modified process agrees with the original process up to time T (and thus in particular up to time \(T_1\)), it satisfies our claim if and only if the original process does. We can thus work with the modified process in the following. To ease reading, we still denote it by \((X_t)\).

For this process, we let, for all \(t \in [0..\lceil 3/\delta \rceil ]\),

$$\begin{aligned} Y_t = D - X_{t_0+t} + \tfrac{1}{2}t\delta D. \end{aligned}$$

Thus, for all \(t \in [0..\lceil 3/\delta \rceil ]\) and with \(s=t_0+t\), the event \(X_s \le \tfrac{1}{2} D + \tfrac{1}{2} (s-t_0)\delta D\) equals the event \(Y_t \ge D/2\).

We want to apply Theorem 10 to the process \((Y_t)_{t \in \mathbb {N}}\). Assume first that t is such that \(X_t < {{\tilde{n}}}\). Hence we are still in the original regime of the process and have \(X_{t+1} \sim {{\,\mathrm{Bin}\,}}(k,(1+\delta )X_t/k)\). Also, we have \(X_t \le 2D\). For all \(z \in [0,1]\), using Lemma 11 and implicitly assuming \(X_{t_0+t} \ge D/2\) (giving \(X_{t_0+t} + \delta D / 2 \le (1+\delta )X_{t_0+t}\)),

$$\begin{aligned}&{E[\exp (z(Y_{t+1}-Y_t)) \mid Y_0,\ldots ,Y_t]} \\&\quad = E[\exp (z(- X_{t_0+t+1} + X_{t_0+t} + \delta D / 2)) \mid Y_0,\ldots ,Y_t]\\&\quad = E[\exp (- z X_{t_0+t+1}) \mid Y_0,\ldots ,Y_t] \cdot \exp (z(X_{t_0+t} + \delta D / 2))\\&\quad \le \exp (-(1+\delta )X_{t_0+t}z + (1+\delta )X_{t_0+t}z^2 ) \cdot \exp (z(X_{t_0+t} + \delta D / 2))\\&\quad \le \exp ((1+\delta )X_{t_0+t}z^2)\\&\quad \le \exp ((1+\delta )2Dz^2). \end{aligned}$$

Note that this estimate also holds when \(X_t \ge {{\tilde{n}}}\) since in this case \(Y_{t+1} = Y_t\) with probability one. Thus, we can set \(c_t = c = 4(1+\delta )D\) and \(C_t = t \cdot 4(1+\delta )D\) for the application of Theorem 10. We let

$$\begin{aligned} C = C_{\lceil 3/\delta \rceil }= & {} \lceil 3/\delta \rceil 4(1+\delta )D\\\le & {} (1+ 3/\delta ) 4(1+\delta )D\\\le & {} (1/\delta + 3/\delta ) 4(1+\delta )D\\\le & {} (4/\delta ) 4(1+1)D\\= & {} 32D/\delta . \end{aligned}$$

Furthemore, we consider \(x = D/2\) and get

$$\begin{aligned} \min \left( 1,x/C \right) \ge \delta /64. \end{aligned}$$

Thus, we get from Theorem 10 and observing \(Y_0 = 0\)

$$\begin{aligned} \Pr [\exists t \in [0..\lceil 3/\delta \rceil ]: Y_t \ge D/2] \le \exp \left( -\delta D/128 \right) \end{aligned}$$

as desired.\(\square\)

With an iterated application of the previous result, we can show that the process has a decent chance to reach the target n in time \(O(\log (n) / \delta )\). We shall later only need the result with the success probability 0.4088, but since we easily prove a stronger bound for larger starting points D and since such results might be useful in other contexts, we also prove such an estimate in the following lemma.

Lemma 13

Let \((X_{t})_{t \in \mathbb {N}}\) be a stochastic process over the positive integers. Assume that there are \(n,k \in \mathbb {Z}_{\ge 1}\) and \(\delta \in (0,1]\) such that \(n-1 \le (1+\delta )^{-1} k\) and for all \(t \ge 0\) and all \(x \in [1..n-1]\) with \(\Pr [X_{t} = x] > 0\) we have the binomial condition

(Bin):

\((X_{t+1} \mid X_{t} = x) \succeq \mathrm {Bin}(k,(1+\delta ) x/k)\).

Let \(t_0 \ge 0\) and \(100 / \delta \le D < n\) such that \(X_{t_0} = D\). Then, with probability at least \(\max \{0.4088, 1 - \frac{1}{\exp (\delta D / 128) - 1}\}\), the process reaches or exceeds n within at most \(\lceil \log _2(n/D)\rceil \lceil 3 / \delta \rceil\) iterations.

Proof

Using Lemma 12, with probability at least \(1 - \exp (- \delta X_0 / 128) = 1 - \exp (- \delta D / 128)\) there is \(t_1 \le t_0 + \lceil 3/\delta \rceil\) such that \(X_{t_1} \ge \min \{2D, n\}\). Given this event and assuming \(X_{t_1} < n\), with probability at least \(1 - \exp (- \delta X_{t_1} / 128) \ge 1 - \exp (- 2 \delta D / 128)\), there is a \(t_2 \le t_1 + \lceil 3/\delta \rceil\) such that \(X_{t_2} \ge \min \{2 X_{t_1},n\} \ge \min \{4D,n\}\). Repeating this doubling argument at most \(\lceil \log _2(n/D) \rceil\) times, we obtain a state of at least n. This takes at most \(\lceil \log _2(n/D)\rceil \lceil 3 / \delta \rceil\) iterations and works out as desired with probability at least

$$\begin{aligned} \prod _{i=0}^\infty&\left( 1 - \exp (- 2^i \delta D / 128)\right) \\&\ge 1 - \sum _{i=0}^\infty \exp (-2^i \delta D / 128)\\&\ge 1 - \sum _{i=0}^\infty \exp (-(i+1) \delta D / 128)\\&= 1 - \frac{1}{\exp (\delta D / 128) - 1}, \end{aligned}$$

where the first inequality follows from a Weierstass product inequality, a mild extension of Bernoulli’s inequality (see, e.g., [13, Lemma 1.4.8]), and the last equation computes the geometric series. When D is small, this estimate can be negative and then is not very useful. For this case, using our assumption that \(D \ge 100/\delta\), we compute

$$\begin{aligned} \prod _{i=0}^\infty&\left( 1 - \exp (- 2^i \delta D / 128)\right) \\&\ge \prod _{i=0}^\infty \left( 1 - \exp (-2^i \cdot \tfrac{100}{128})\right) \\&\ge \left( \prod _{i=0}^3 \left( 1 - \exp (-2^i \cdot \tfrac{100}{128})\right) \right) \cdot \left( 1 - \sum _{i=4}^\infty \exp (-2^i \cdot \tfrac{100}{128})\right) \\&\ge 0.4089 \cdot \left( 1 - \exp (-\tfrac{1600}{128}) \sum _{i=0}^\infty \exp (-\tfrac{100}{128})^i\right) \\&= 0.4089 \cdot \left( 1 - \exp (-\tfrac{1600}{128}) \frac{1}{1 - \exp (-\tfrac{100}{128})}\right) \ge 0.4088, \end{aligned}$$

which gives the desired bound.\(\square\)

2.1.6 Proof of Theorem 3 for \(\delta \le 1\)

By combining the two main insights of the two preceding subsections, we now prove the first up-drift theorem in the case \(\delta \le 1\). Since the proof uses a result known as Wald’s equation [37], we first state a simplified version of this result.

Theorem 14

(Wald’s equation) Let \(M \in \mathbb {R}\). Let \(X_1, X_2, \dots\) be an infinite sequence of non-negative random variables with \(E[X_i] \le M\) for all \(i \in \mathbb {N}\). Let T be a positive integer random variable with \(E[T] < \infty\). Assume that for all \(t \in \mathbb {N}\), we have \(E[X_t {\mathbf {1}}_{\{T \ge t\}}] = E[X_t] \Pr [T \ge t]\). Then

$$\begin{aligned}E\left[ \sum _{t = 1}^T X_t\right] \le M \cdot E[T].\end{aligned}$$

We now give the proof of the first up-drift theorem for the case that \(\delta \le 1\).

Proof

(of Theorem 3 for \(\delta \le 1\)) Let us call a phase of the process \((X_t)\) the time interval used to first reach a value of at least \(D_0 = \min \{\lceil 100/\delta \rceil , n\}\) and then another \(T_2 = \lceil \log _2(n/D_0)\rceil \lceil 3 / \delta \rceil \le \log _2(n) \lceil 3 / \delta \rceil\) iterations. By Lemma 9, the expected time to reach a value of at least \(D_0 = \min \{\lceil 100/\delta \rceil , n\}\) is at most \(T_1 = \frac{6 D_0 \ln (2 D_0)}{{1-\gamma _0}}\). Hence a phase has an expected length of at most \(M = T_1 + T_2\).

By Lemma 13, a phase is successful, that is, reaches or exceeds n, with probability at least 0.4088. Hence the number of phases until a successful one is encountered, is described by a geometric random variable T with success rate \(p = 0.4088\).

Hence by Wald’s equation (Theorem 14), the expected time to reach or exceed n therefor is at most

$$\begin{aligned} M E[T] = M \frac{1}{p} \le \frac{1}{0.4088}\left( \frac{6 D_0 \ln (2 D_0)}{{1-\gamma _0}} + \log _2(n) \lceil 3 / \delta \rceil \right) . \end{aligned}$$

The claim follows from noting that \(1/0.4088 < 2.5\). \(\square\)

2.1.7 The Case \(\delta > 1\)

In this section, we treat the case that \(\delta\) is larger than one. In this case, the up-drift is so strong that we do not have a significant phase in which the progress stems mostly from random fluctuations. Rather, we can argue that with constant “success” probability, the process increases by a factor of at least \((1+\delta /2)\) in each iteration and thus reaches the target of n in at most \(\lceil \log _{1+\delta /2}(n) \rceil = O(\log (n)/\log (\delta ))\) iterations. In case of failure, a simple restart argument (leading to an expected constant number of restarts of the argument) suffices to show the same bound for the expected time to reach a state of at least n.

This argument alone would give a relatively low success probability of

$$\begin{aligned}\prod _{i=0}^\infty (1 - \exp (-(1+0.5)^i /16)) \le 3.4 \cdot 10^{-6}\end{aligned}$$

when proceeding as in the proof below, using \(\delta = 1\), and estimating this infinite product via its first ten factors. Consequently, a very high implicit constant in the \(O(\log (n)/\log (\delta ))\) bound would result. To overcome this, we first argue that it takes at most an expected number of 62 iterations to reach a state of at least 32. From this point on, the probability to increase by a factor of \((1 + \delta /2)\) in each subsequent iteration is more than 0.78. While we did not aim at obtaining the best possible constants, we decided to follow this line of argument to obtain a leading constant that is not only of theoretical interest. We note that the same argument could be used with intermediate targets larger than 32 and increase factors closer to \((1+\delta )\), which shows that the right asymptotics is \((1+o(1)) \log _{1+\delta }(n)\).

To prove the case \(\delta > 1\) of the first up-drift theorem, we show the following lemma. It contains a statement on making less progress than expected which is stronger than what we need here, but which might be useful in other contexts.

Lemma 15

(First Up-Drift Theorem, \(\delta > 1\)) Let \((X_{t})_{t \in \mathbb {N}}\) be a stochastic process over the positive integers. Assume that there are \(n,k \in \mathbb {Z}_{\ge 1}\) and \(\delta \ge 1\) such that \(n-1 \le (1+\delta )^{-1} k\) and for all \(t \ge 0\) and all \(x \in [1..n-1]\) with \(\Pr [X_{t} = x] > 0\) we have the binomial condition

(Bin):

\((X_{t+1} \mid X_{t} = x) \succeq \mathrm {Bin}(k,(1+\delta ) x/k)\).

Let \(T := \min \{t \ge 0 \mid X_{t} \ge n\}\). Then

$$\begin{aligned} E[T] \le 2.6 \log _{1+\delta }(n) + 81. \end{aligned}$$

In addition, once the process has reached some state x or higher, the probability to have a step with \(X_{t+1} < (1+\delta /2) X_t\) before reaching \(X_t \ge n\) is at most \(\tfrac{1}{e^{x/32}(e^{x/32}-1)}\). In particular, once the process has reached \(x = 32\), the probability to ever go below 32 (before reaching n) is less than 0.22.

Proof

To ease the argument, we shall now assume that we have \({X_{t+1} \sim \max \{1, {{\,\mathrm{Bin}\,}}(2(1+\delta )X_t, 1/2)\}}\) when \(X_t \ge n\). This artificial continuation of the process (similar to the one we used in Lemma 9) does not change the first time to reach or overshoot the target n, but allows us to disregard whether the process has reached the target earlier than thought.

We analyze one phase of the process, started at some time \(t_0\) with an arbitrary value \(X_{t_0}\). We say that this phase ends (after \(\ell\) iterations) when either (i) \(t_0+\ell\) is the first time not earlier than \(t_0\) that \(X_{t_0 + \ell } \ge n\) (“success”), or (ii) \(t_0+\ell\) is the first time such that \(X_{t_0+\ell } < (1+\delta /2) X_{t_0+\ell -1}\) and \(X_{t_0+\ell -1} \ge 32\) (“failure”). In simple words, the phase ends when the target is reached or when we fail to obtain a factor-\((1+\delta /2)\) increase from a state that is at least 32.

We first compute a simple upper bound for the expected length of a phase, which is valid regardless of whether we condition on success or failure. We start by estimating the expected time to reach a value of at least 32. Since \(\delta > 1\), at any time t the state \(X_{t+1}\) dominates a binomial distribution with expectation \(2 X_t\). By the well-known fact that the median of a binomial distribution with integral expectation is equal to this expectation, first explicitly shown in [34], we have \(\Pr [X_{t+1} \ge 2 X_t] \ge 1/2\). Consequently, the time to reach 32 is at most the time \({{\tilde{T}}}\) it takes for a sequence of random bits to encounter five successive ones. We note that the expectation of \({{\tilde{T}}}\) satisfies the recurrence \(E[{{\tilde{T}}}] = \frac{1}{2} (1 + E[{{\tilde{T}}}]) + \frac{1}{4} (2 + E[{{\tilde{T}}}]) + \frac{1}{8} (3 + E[{{\tilde{T}}}]) + \frac{1}{16} (4 + E[{{\tilde{T}}}]) + \frac{1}{32} (5 + E[{{\tilde{T}}}]) + \frac{1}{32} \cdot 5\), which gives \(E[{{\tilde{T}}}] = 62\).

Once a state of 32 or more is reached, we either witness a failure or an increase by a factor of \((1+\delta /2)\). Consequently, after another \(\lceil \log _{1+\delta /2}(n/32)\rceil\) iterations, we have encountered a failure or reached the target, and hence the phase has ended within this timespan. In summary, the expected length of a phase, regardless of the starting state and regardless of whether it is successful or not, is at most \(62+\lceil \log _{1+\delta /2}(n/32)\rceil\) iterations. Noting that \(1+\delta \le (1+\delta /2)^2\), we have

$$\begin{aligned} \log _{1+\delta /2}(x) \le 2 \log _{1+\delta }(x) \end{aligned}$$
(3)

for all \(x \ge 1\), and thus the expected length of a phase is at most

$$\begin{aligned}62+\lceil \log _{1+\delta /2}(n/32)\rceil \le 63 + 2 \log _{1+\delta }(n).\end{aligned}$$

From the “in particular” case of the “in addition clause”, which we shall prove shortly, we see that a phase is successful with probability at least 0.78. By elementary properties of the geometric distribution, there is an expected number of at most \(\frac{1}{0.78} \le 1.283\) phases until the process is successful, and hence reaches the target. Since each phase takes an expected number of at most \(63 + 2 \log _{1+\delta }(n)\) iterations, the desired expected hitting time is at most \(\frac{1}{0.78} (63 + 2 \log _{1+\delta }(n)) \le 81 + 2.6 \log _{1+\delta }(n)\) by Wald’s equation (Theorem 14).

We now prove the “in addition” statement. For any time t, by a simple Chernoff bound (e.g., Theorem 1.10.5, Equation (1.10.12), in [13]), we have (using \(\delta \ge 1\))

$$\begin{aligned} \Pr [&X_{t+1}< (1+\delta /2) X_t] \\&\le \Pr [X_{t+1}< 0.75(1+\delta ) X_t] \le \Pr [X_{t+1} < 0.75 E[X_{t+1}]] \\&\le \exp (-E[X_{t+1}]/32) \le \exp (-(1+\delta )X_t/32) \le \exp (-X_t/16). \end{aligned}$$

Assume that at some time \(t_1\) we have \(X_{t_1} = x\). Let us now, minimally modifying the previously introduced notation, speak of a failure when for some \(t \ge t_1\) we have \(X_{t+1} < (1 + \delta /2) X_t\). Noting that no failure for i iterations leads to a state \(X_{t_1+i} \ge (1+\delta /2)^i X_{t_1} = (1+\delta /2)^i x\), we see that the probability that no failure happens in any iteration later than \(t_1\) is at least

$$\begin{aligned} \prod _{i=0}^\infty&(1 - \exp (-(1+\delta /2)^i x/16)) \\&\ge \prod _{i=0}^\infty (1 - \exp (-2 \cdot (3/2)^i \cdot x/32 ) \\&\ge 1 - \sum _{i=0}^\infty \exp (-2 \cdot (3/2)^i \cdot x/32) \\&\ge 1 - \sum _{i=2}^\infty \exp (-i \cdot x/32) = 1 - \tfrac{1}{e^{x/32}(e^{x/32}-1)}, \end{aligned}$$

where, similarly as in Lemma 13, we employ the Weierstrass product inequality and the fact that \(2 \cdot (3/2)^i \ge i+2\) for all non-negative integers i. We note that for \(x=32\), this bound is less than 0.22 and the event “no failure” implies the event to never go below 32. \(\square\)

2.2 Processes That Can Reach Zero

We now extend the multiplicative up-drift theorem to include state 0. Since the subprocess consisting only of states greater than 0 satisfies the assumptions of the first up-drift theorem, we obtain from the latter an upper bound on the time spend above 0. It therefore remains to estimate the time spent in state 0, which in particular means estimating how often the process reaches this state. In the technically more demanding case that \(\delta \le 1\), we exploit that the process is a submartingale. We can thus employ the optional stopping theorem to estimate that with probability \(1 - \Omega (\delta )\) the process reaches 0 before reaching \(D_0 = \min \{\lceil 100 / \delta \rceil , n\}\). Consequently, after an expected number of \(O(\delta )\) attempts, the process reaches \(D_0\), and from there with constant probability never goes back to zero.

Theorem 16

(Second Multiplicative Up-Drift Theorem) Let \((X_{t})_{t \in \mathbb {N}}\) be a stochastic process over \(\mathbb {Z}_{\ge 0}\). Let \(n,k \in \mathbb {Z}_{\ge 1}\), \(E_0 > 0\), \(\gamma _0 < 1\), and \(\delta > 0\) such that \(n - 1 \le \min \{\gamma _0 k, (1+\delta )^{-1} k\}\). Let \(D_0 = \min \{\lceil 100/\delta \rceil ,n\}\) when \(\delta \le 1\) and \(D_0 = \min \{32,n\}\) otherwise. Assume that for all \(t \ge 0\) and all \(x \in [0..n-1]\) with \(\Pr [X_{t} = x] > 0\), the following two properties hold.

(Bin):

If \(x \ge 1\), then \((X_{t+1} \mid X_{t} = x) \succeq \mathrm {Bin}(k,(1+\delta ) x/k)\).

(0):

\(E[ \min \{X_{t+1}, D_0\} \mid X_{t} = 0] \ge E_0\).

Let \(T := \min \{t \ge 0 \mid X_{t} \ge n\}\). Then, if \(\delta \le 1\),

$$\begin{aligned} E[T]&\le \frac{4D_0 }{0.4088 E_0} + \frac{15}{1-\gamma _0} D_0 \ln (2 D_0)+ 2.5 \log _2(n) \lceil 3 / \delta \rceil . \end{aligned}$$

In particular, when \(\gamma _0\) is bounded away from 1 by a constant, then \(E[T] = O(\frac{1}{E_0\delta } + \frac{\log (n)}{\delta })\), where the asymptotic notation refers to n tending to infinity and where \(\delta =\delta (n)\) may be a function of n. Furthermore, if \(n > 100/\delta\), then we also have that once the process has reached state of at least \(100/\delta\), the probability to ever return to a state of at most \(50/\delta\) is at most 0.5912.

If \(\delta > 1\), then we have

$$\begin{aligned} E[T]&\le \frac{128}{0.78 E_0} + 2.6 \log _{1+\delta }(n) + 81\\&= O\left( \frac{1}{E_0} + \frac{\log (n)}{\log (\delta )}\right) . \end{aligned}$$

In addition, once the process has reached state 32 or higher, the probability to ever return to a state lower than 32 is at most \(\tfrac{1}{e(e-1)} < 0.22\).

We show the theorem by considering two different kinds of steps of the process: those spent in state 0 and those spent in other states. For the latter we understand what happens from Theorem 3, so it remains to see what happens in state 0. There are in turn two ways in which the process can be in state 0. Either it could have been in state 0 before; in this case we will use (0) to see how the process gets out again. More complicated is the case of returning to state 0.

From Theorem 3 we know that it is unlikely to return back to 0 after having reached a sufficiently high value. In order to compute a good bound on the return probability for smaller values of the process, we use the optional stopping theorem, which we state next for convenience. We use a version given by Grimmett and Stirzaker [19, Chapter 12.5, Theorem 9] that can be extended to super- and submartingales.

Theorem 17

(Optional Stopping) Let \((X_t)_{t \in \mathbb {N}}\) be a random process over \(\mathbb {R}\), and let T be a stopping timeFootnote 2 for \((X_t)_{t \in \mathbb {N}}\). Suppose that

  1. (a)

    \(E[T] < \infty\) and that

  2. (b)

    there is some value \(c \ge 0\) such that, for all \(t < T\), it holds that \(E[|X_{t+1} - X_t| \mid X_0,\ldots ,X_t] \le c\).

Then the following two statements hold.

  1. (i)

    If, for all \(t < T\), \(X_t - E[X_{t+1} \mid X_0,\ldots ,X_t] \ge 0\), then \(E[X_T] \le E[X_0]\).

  2. (ii)

    If, for all \(t < T\), \(X_t - E[X_{t+1} \mid X_0,\ldots ,X_t] \le 0\), then \(E[X_T] \ge E[X_0]\).

For the application of the optional stopping theorem it will be necessary to have a good bound on the value of the process after exceeding some value. Since no good bounds are guaranteed for the original process, we instead analyze a slightly different process which we can construct with the following lemma. It states, roughly, that we can replace a binomial random variable with expectation E with a random variable that is identically distributed in [0..E] and takes values only in \([0..\lceil 4E \rceil ]\) such that the expectation is not lowered. We suspect that this result may be convenient in many other such situations, e.g., when using additive drift in processes that may overshoot the target.

Lemma 18

Let Y be a random variable taking values in the non-negative integers such that \(Y \succeq {{\,\mathrm{Bin}\,}}(k,p)\) for some \(k \in \mathbb {N}\) and \(p \in [0,1]\) with \(kp \ge 1\). Let \(E = kp\) denote the expectation of \({{\,\mathrm{Bin}\,}}(k,p)\). Then there is a random variable Z such that

  • \(\Pr [Z=i] = \Pr [Y=i]\) for all \(i \in [0..E]\),

  • \(\Pr [Z=i] = 0\) for all \(i \ge 4E+1\),

  • \(E[Z] \ge E\).

Proof

Let Z be defined by \(\Pr [Z=i] = \Pr [Y=i]\) for all \(i \in [0..E]\) and \(\Pr [Z = \lceil 4E \rceil ] = 1 - \Pr [Y \in [0..E]]\). Then it remains to show that \(E[Z] \ge E\). If \(X \sim {{\,\mathrm{Bin}\,}}(k,p)\), and hence \(E = E[X]\), then \(\Pr [X > E] \ge \frac{1}{4}\) by [11]. Since \(Y \succeq X\), we have \(\Pr [Y> E ] \ge \Pr [X > E] \ge \frac{1}{4}\). By definition, \(\Pr [Z = \lceil 4E \rceil ] = \Pr [Y > E] \ge \frac{1}{4}\) and thus \(E[Z] = \sum _{i = 0}^{\lceil 4E \rceil } i \Pr [Z = i] \ge \lceil 4E \rceil \Pr [Z = \lceil 4E \rceil ] \ge \lceil 4E \rceil \cdot \frac{1}{4} \ge E\). \(\square\)

We now prove Theorem 16.

Proof

Let first \(\delta \le 1\). We first analyze the time spend on all states different from 0. To this aim, let \({{\tilde{X}}}_{t}\), \(t = 0, 1, \dots\), be the subprocess where we are above zero. Formally speaking, \({{\tilde{X}}}\) is the subsequence of \((X_{t})\) consisting of all \(X_{t}\) that are greater than 0. Viewed as a random process, this means that we sample the next state according to the same rules as for the X-process; however, if this is zero, then immediately and without counting this as a step we sample the new state from the distribution described in (0) conditional on being positive (which is the same as saying that we resample until we obtain a positive result). With this, the distribution describing one step of the process is a distribution on the positive integers such that \(({{\tilde{X}}}_{t+1} \mid {{\tilde{X}}}_{t}) \succeq {{\,\mathrm{Bin}\,}}(k,(1+\delta ) {{\tilde{X}}}_{t} / k)\). We may thus apply Theorem 3 and obtain that after an expected total number of at most

$$\begin{aligned} \tfrac{15}{1-\gamma _0} D_0 \ln (2 D_0)+ 2.5 \log _2(n) \lceil 3 / \delta \rceil \end{aligned}$$

steps, the process \({{\tilde{X}}}\) reaches or exceeds n.

It remains to analyze how many steps the process X spends on state 0. To this end we first show the following claim bounding the probability of falling back to 0 when at a state x. The proof of the claim is essentially an adaptation of an argument regarding unbiased random walks (also knows as the Gamblers Ruin Problem), see, for example, [33, Sect. 12.2] for a treatment.

Claim: Let x be such that \(0 \le x \le D_0\) and let \(t_0 \ge 0\). We condition on \(X_{t_0}=x\). Then the probability that, in the time from \(t_0\) on, the process reaches a state of at least \(D_0\) before reaching state 0 is at least \(x/ ( 4D_0)\).

The claim is trivially true for \(x=0\). Thus, suppose \(x > 0\). To ease reading, we regard the process \((Y_t)\) defined by \(Y_t = X_{t_0 + t}\) for all \(t \ge 0\). Clearly, \(E[Y_0] = x\).

Let R be the first time that Y reaches or exceeds \(D_0\), or hits 0; this is a stopping time. To ease the following argument, we regard the following process Z, which equals Y until the stopping time (and hence has the same stopping time). We define Z recursively. We start by setting \(Z_0 := Y_0\). Assume that \(Z_t\) is defined and \(Z_t \cdot \mathbf{1 }_{Z_t \le D_0} = Y_t \cdot \mathbf{1 }_{Y_t \le D_0}\). If \(Z_t > D_0\), then we set \(Z_{t+1} = Z_t\). Otherwise, that is, when \(Z_t = Y_t = x \le D_0\) for some x, then we recall that \(Y_{t+1} \succeq {{\,\mathrm{Bin}\,}}(k,(1+\delta ) x/k) \succeq {{\,\mathrm{Bin}\,}}(k,x/k)\). In this case, we let \(Z_{t+1}\) be the random variable constructed in Lemma 18 (w.r.t. \(Y_{t+1}\), k, and \(p = x/k\)). By this lemma, we have \(Z_{t+1} \cdot \mathbf{1 }_{Z_{t+1} \le D_0} = Y_{t+1} \cdot \mathbf{1 }_{Y_{t+1} \le D_0}\), allowing us to continue our recursive definition of Z, and \(E[Z_{t+1} \mid Z_t] \ge Z_t\), showing that \((Z_t)\) is a submartingale. We can thus use the optional stopping theorem to see that \(E[Z_R] \ge E[Z_0]\). Furthermore,

$$\begin{aligned} E[Z_R]&= \Pr [Z_R \ge D_0] E[Z_R \mid Z_R \ge D_0] + \Pr [Z_R = 0] E[Z_R \mid Z_R = 0] \\&= \Pr [Z_R \ge D_0] E[Z_R \mid Z_R \ge D_0] \le \Pr [Z_R \ge D_0] \cdot 4D_0, \end{aligned}$$

the latter again due to Lemma 18. Consequently

$$\begin{aligned} \Pr [Y_R \ge D_0] = \Pr [Z_R \ge D_0] \ge \frac{E[Z_0]}{4D_0} = \frac{x}{4D_0}. \end{aligned}$$

This shows the claim.

Let \(t\ge 0\), let us again condition on \(X_t = 0\), and let A be the event that the process reaches a state of at least \(D_0\) after time t before reaching a state of 0. Using the claim and the law of total probability we now see that

$$\begin{aligned} P[A]&= \sum _{x=0}^\infty P[A \mid X_{t+1}=x]P[X_{t+1}=x]\\&\ge \sum _{x=0}^\infty \frac{x}{4D_0}P[X_{t+1}=x]\\&= \frac{E[X_{t+1}]}{4D_0} \ge \frac{E_0}{4D_0}. \end{aligned}$$

We conclude that the number of iterations spent on state 0 before reaching a state of at least \(D_0\) is dominated by a geometric distribution with success rate \(\frac{E_0}{4D_0}\). Consequently, the expected number of these iterations is at most \(4D_0/E_0\).

Once the process has reached a state of \(D_0\) or higher, by Theorem 3 the probability to ever return to 0 is at most 0.5912. Hence the expected number of times this happens is at most 1/0.4088. We can now use Wald’s equation (Theorem 14) to obtain the desired run time result.

The case of \(\delta > 1\) is analogous with 32 instead of \(D_0\) and using Lemma 15 instead of Theorem 3. \(\square\)

2.3 Processes That Start High

In condition (0) of the second up-drift theorem (Theorem 16), we only exploit the progress made to states not exceeding \(D_0\) when leaving state 0. When a process has a decent chance to leave 0 to a state equal to or above \(D_0\), then we can ignore the costly first part of the analysis. This is what we analyze in this section by replacing the condition (0) with a start condition (S) which intuitively says that, at any time of the process (even when not at state 0), we have a good chance of starting the process fresh from a rather high minimum value. The proof is an easy combination of Lemma 13 and a restart argument. To ease the notation, we use the shorthand \(\log ^0_b(x) := \max \{0, \log _b(x)\}\) for all \(x \in \mathbb {R}\) and \(b > 1\).

Theorem 19

(Third Multiplicative Up-Drift Theorem) Let \((X_{t})_{t \in \mathbb {N}}\) be a stochastic process over \(\mathbb {Z}_{\ge 0}\). Let \(n,k \in \mathbb {Z}_{\ge 1}\), and \(\delta > 0\) such that \(n - 1 \le (1+\delta )^{-1} k\). Let \(D_0 = \min \{100/\delta ,n\}\) when \(\delta \le 1\) and \(D_0 = \min \{32,n\}\) otherwise. Let \(x_{\mathrm {min}}\ge D_0 > 0\). Assume that for all \(t \ge 0\) and all \(x \in [0..n-1]\) with \(\Pr [X_{t} = x] > 0\), the following two properties hold.

(Bin):

If \(x \ge x_{\mathrm {min}}\), then \((X_{t+1} \mid X_{t} = x) \succeq \mathrm {Bin}(k,(1+\delta ) x/k)\).

(S):

\(\Pr [ X_{t+1} \ge x_{\mathrm {min}}\mid X_{t} = x] \ge p\). Also, \(\Pr [X_0 \ge x_{\mathrm {min}}] \ge p\).

Let \(T := \min \{t \ge 0 \mid X_{t} \ge n\}\). Then, if \(\delta \le 1\),

$$\begin{aligned} E[T]&\le 2.5\left( 1/p + \lceil \log ^0_2(n / x_{\mathrm {min}})\rceil \lceil 3 / \delta \rceil \right) . \end{aligned}$$

If \(\delta > 1\), then we have

$$\begin{aligned} E[T]&\le 1.3/p + 2.6 \lceil \log ^0_{1+\delta }(n / x_{\mathrm {min}}) \rceil . \end{aligned}$$

Proof

We start by considering the case \(\delta \le 1\). Regardless of where the process is at some time \(t_0\), by the start condition (S) it takes an expected number of at most 1/p iterations to again reach at state of at least \(x_{\mathrm {min}}\). Then, by Lemma 13 and \(x_{\mathrm {min}}\ge D_0\), we see that the time to reach or exceed n when starting at \(x_{\mathrm {min}}\) or higher is no more than another \(\lceil \log ^0_2(n / x_{\mathrm {min}})\rceil\) iterations, with a probability of at least 0.4088.

In case this fails (with probability at most \(1 - 0.4088\)), we simply restart the argument at the current state. By Wald’s equation (Theorem 14), the expected time to reach or exceed n is at most

$$\begin{aligned} \frac{1}{0.4088}\left( 1/p + \lceil \log ^0_2(n / x_{\mathrm {min}})\rceil \lceil 3 / \delta \rceil \right) . \end{aligned}$$

The claim follows from noting that \(1/0.4088 < 2.5\).

For \(\delta > 1\), we proceed similarly. It again takes an expected number of 1/p iterations to reach \(x_{\mathrm {min}}\) or higher. If \(x_{\mathrm {min}}\ge n\), we are done. Otherwise, we invoke Lemma 15 to see that with probability at least 0.78, the process increases by a factor of at least \((1+\delta /2)\) in each subsequent iteration (that starts below n), using \(x_{\mathrm {min}}\ge D_0 \ge 32\). In this case, using again Equation (3), we reach n in at most \(\lceil \log _{1+\delta /2}(n/x_{\mathrm {min}}) \rceil \le 2 \lceil \log _{1+\delta }(n / x_{\mathrm {min}}) \rceil\) iterations. With a restart argument used in the failure case (occurring with probability 0.22), we obtain the claimed expected hitting time of \(\tfrac{1}{0.78} (1/p + 2 \lceil \log _{1+\delta }(n / x_{\mathrm {min}}) \rceil ) \le 1.3/p + 2.6 \lceil \log _{1+\delta }(n / x_{\mathrm {min}}) \rceil\) by using Wald’s equation (Theorem 14).\(\square\)

3 The Level-Based Theorem

In this section, we apply our up-drift theorems to give an insightful proof of a sharper version of the level-based theorem first proposed by Lehre [29].

The general setup of such level-based theorems is as follows. There is a ground set \({\mathcal {X}}\), which in typical applications is the search space of an optimization problem. On this ground set, a Markov process \((P_t)\) induced by a population-based EA is defined. We consider populations of fixed size \(\lambda\), which may contain elements several times (multi-sets). We write \({\mathcal {X}}^\lambda\) to denote the set of all such populations. We only consider Markov processes where each element of the next population is sampled independently with repetition. That is, for each population \(P \in {\mathcal {X}}^\lambda\), there is a distribution D(P) on \({\mathcal {X}}\) such that given \(P_t\), the next population \(P_{t+1}\) consists of \(\lambda\) elements of \({\mathcal {X}}\), each chosen independently according to the distribution \(D(P_t)\). As all our results hold for any initial population \(P_0\), we do not make any assumptions on \(P_0\).

In the level-based setting, we assume that there is a partition of \({\mathcal {X}}\) into levels \(A_1, \dots , A_m\). Based on information in particular on how individuals in higher levels are generated, we aim for an upper bound on the first time such that the population contains an element of the highest level \(A_m\). The first such result was given in [29]. Improved and easier to use versions can be found in [2, 9].

To ease the comparison with our result, we now state the strongest level-based theorem before our work. We note that (i) the time bound has a quadratic dependence on \(\delta\) and (ii) the population size needs to be \(\Omega (\delta ^{-2} \log (\delta ^{-2}))\).

Theorem 20

([2]) Consider a population process as described above. Let \((A_1,\ldots ,A_m)\) be a partition of \({\mathcal {X}}\). We write \(A_{\ge j} := \bigcup _{i=j}^m A_i\) for all \(j \in [1..m]\). Assume that there are \(z_1,\ldots ,z_{m-1},\delta \in (0,1]\) and \(\gamma _0 \in (0,1)\) such that, for any population \(P \in {\mathcal {X}}^\lambda\), the following three conditions are satisfied.

(G1):

For each level \(j \in [1..m-1]\), if \(|P \cap A_{\ge j}| \ge \gamma _0 \lambda\), then

$$\begin{aligned} \Pr _{y \sim D(P)} [y \in A_{\ge j+1}] \ge z_j. \end{aligned}$$
(G2):

For each level \(j \in [1..m-2]\) and all \(\gamma \in (0,\gamma _0]\), if \(|P \cap A_{\ge j}| \ge \gamma _0 \lambda\) and \(|P \cap A_{\ge j+1}| \ge \gamma \lambda\), then

$$\begin{aligned} \Pr _{y \sim D(P)}[y \in A_{\ge j+1}] \ge (1+\delta )\gamma . \end{aligned}$$
(G3):

The population size \(\lambda\) satisfies

$$\begin{aligned} \lambda \ge \frac{4}{\gamma _0\delta ^2} \ln \left( \frac{128m}{z^*\delta ^2} \right) \text{, } \text{ where } z^* = \min _{j \in [1..m-1]} z_j. \end{aligned}$$

Let \(T := \min \{ \lambda t \mid P_t \cap A_m \ne \emptyset \}\). Then we have

$$\begin{aligned} E[T] \le 8 \frac{\lambda }{\delta ^2} \sum _{j=1}^{m-1} \left( \ln \left( \frac{6\delta \lambda }{4 + z_j\delta \lambda } \right) + \frac{1}{\lambda z_j} \right) . \end{aligned}$$

The proof given in [2], as the previous proofs of level-based theorems, uses drift theory with an intricate potential function.

We now derive from our multiplicative up-drift theorems a version of the level-based theorem with (tight) linear dependence on \(\delta\). This theorem is further improved with respect to the version given in [8] by only requiring a population size that depends linearly on \(\delta\) (rather than an at least quadratic dependence as in [8] or in the previous-best version given in Theorem 20). To allow such much smaller population sizes to suffice, we need a slightly stronger assumption on making improvements (as can be seen in (G1) and (G2) compared between Theorems 20 and 21 , where an additional factor of 1/4 is inserted). We do not see any realistic situations in which the assumptions of Theorem 20 are fulfilled, but ours are not.

For the (technically more demanding) case \(\delta \le 1\), we show the following result. We treat the easier case \(\delta > 1\), not discussed in any previous work, separately at the end of this section.

Theorem 21

(Level-Based Theorem) Consider a population-based process as described in the beginning of this section.

Let \((A_1,\ldots ,A_m)\) be a partition of \({\mathcal {X}}\). Let \(A_{\ge j} := \bigcup _{i=j}^m A_i\) for all \(j \in [1..m]\). Let \(z_1,\ldots ,z_{m-1},\delta \in (0,1]\), and let \(\gamma _0 \in (0,\frac{1}{1+\delta }]\) with \(\gamma _0 \lambda \in {\mathbb {Z}}\). Let \(D_0 = \min \{\lceil 100/\delta \rceil ,\gamma _0 \lambda \}\) and \(c_1 = 56 \, 000\). Let

$$\begin{aligned} t_0 = \frac{7000}{\delta } \left( m + \frac{1}{1-\gamma _0} \sum _{j=1}^{m-1} \log ^0_2\left( \frac{2\gamma _0\lambda }{1+\frac{z_j \lambda }{D_0}}\right) + \frac{1}{\lambda } \sum _{j=1}^{m-1}\frac{1}{z_j} \right) , \end{aligned}$$

where \(\log ^0_2(x) := \max \{0,\log _2(x)\}\) for all \(x \in {\mathbb {R}}\). Assume that for any population \(P \in {\mathcal {X}}^\lambda\) the following three conditions are satisfied.

(G1):

For each level \(j \in [1..m-1]\), if \(|P \cap A_{\ge j}| \ge \gamma _0 \lambda {/4}\), then

$$\begin{aligned} \Pr _{y \sim D(P)} [y \in A_{\ge j+1}] \ge z_j. \end{aligned}$$
(G2):

For each level \(j \in [1..m-2]\) and all \(\gamma \in (0,\gamma _0]\), if \(|P \cap A_{\ge j}| \ge \gamma _0 \lambda {/4}\) and \(|P \cap A_{\ge j+1}| \ge \gamma \lambda\), then

$$\begin{aligned} \Pr _{y \sim D(P)} [y \in A_{\ge j+1}] \ge (1+\delta )\gamma . \end{aligned}$$
(G3):

The population size \(\lambda\) satisfies

$$\begin{aligned} \lambda \ge {\frac{256}{\gamma _0 \delta } \ln \left( 8 t_0 \right) }. \end{aligned}$$

Then \(T := \min \{ \lambda t \mid P_t \cap A_m \ne \emptyset \}\) satisfies

$$\begin{aligned} E[T]&\le 8\lambda t_0 = c_1 \frac{\lambda }{\delta } \left( m + \frac{1}{1-\gamma _0} \sum _{j=1}^{m-1} \log ^0_2\left( \frac{2\gamma _0\lambda }{1+\frac{z_j \lambda }{D_0}}\right) + \frac{1}{\lambda }\sum _{j=1}^{m-1}\frac{1}{z_j} \right) . \end{aligned}$$

Note that, with \(z^* = \min _{j \in [1..m-1]} z_j\) and \(\gamma _0\) a constant, (G3) in the previous theorem is satisfied for some \(\lambda\) with

$$\begin{aligned} \lambda = O\left( \frac{1}{\delta }\log \left( \frac{m}{\delta z^*} \right) \right) \end{aligned}$$

as well as for all larger \(\lambda\).

We now compare our new level-based theorem with the previous best result (Theorem 20). Since we do not try to optimize constant factors, we do not discuss these (but note that ours are large).

We first observe that as long as \(\gamma _0\) can be assumed to be a constant bounded away from 1, then our bound for any values of the variables is at most a constant factor larger than the bound of Theorem 20. When \(z_j \lambda\) is large, the \(\log ^0_2(\cdot )\) expression can degenerate to an expression of order \(\log (D_0) = O(\log (1/\delta ))\). This cannot happen for the logarithmic expression in the run time bound of Theorem 20, however, even in this case, our bound is of order \(O(\log (1/\delta )/\delta )\), whereas the previous best result was \(O(\delta ^{-2})\). Hence when ignoring constant factors and assuming \(\gamma _0<1\) a constant, our bound is at least as strong as the previous results.

In terms of asymptotic differences, we first note the improved dependence of the run time guarantee on \(\delta\). Ignoring a possible influence of \(\delta\) on the logarithmic terms in the run time estimate, the dependence now is only \(O(\delta ^{-1})\), whereas it was \(O(\delta ^{-2})\) in the previous result.

The second asymptotic difference concerns the minimum value for \(\lambda\) that is prescribed by condition (G3). Note that in both results the run time estimate is a sum of two terms, the first depending linearly on \(\lambda\). Consequently, being able to use a smaller population size \(\lambda\) can improve the run time. The main difference, and again ignoring the logarithmic term in (G3), is that \(\lambda\) has to be \(\Omega (\delta ^{-2})\) in the previous result and only \(\Omega (\delta ^{-1})\) in ours. The logarithmic terms are more tedious to compare, but clearly ours is asymptotically not larger as long as \(\lambda\) is at most exponential in m or at most exponential in \(1/z^*\).

We continue by discussing minor differences between the two results. We note that \(t_0\) in our result depends on \(\lambda\). We thus end up in the slightly annoying situation that in our version, \(\lambda\) appears also in the right-hand side of (G3). However, since \(\lambda\) appears on the right-hand side only inside a logarithm (and one that is at least \(\ln (m)\)), it is usually not difficult to find solutions for this inequality that lead to an asymptotically optimal value \(\lambda\).

One key difference is that both (G1) and (G2) impose a condition from the point on when at least \(\gamma _0 \lambda /4\) individuals are on a level, whereas the previous level-based theorem (as the conference version of this work) only does so from \(\gamma _0 \lambda\) on. This additional slack is required to bring down the dependence of \(\lambda\) on \(1 / \delta\) from essentially quadratic to essentially linear. We do not see any realistic application where the stronger versions of (G1) and (G2) would be harder to show than the previous ones.

In summary, when ignoring constant factors, we do not see any noteworthy downsides of our new result and we did not find any result previously proven via a level-based theorem that could not be proven with our result. At the same time, the superior asymptotics of the run time bound and the minimum requirement on \(\lambda\) in terms of \(\delta\) clearly are an advantage of our result.

We now proceed with proving the new level-based theorem. We shall use an estimate for the probability that a binomial random variable is a least its expectation. The following result was proven with elementary means in [11]. A very similar result was shown with deeper methods in [18].

Lemma 22

Let \(n \in {\mathbb {N}}\) and \(p \ge \frac{1}{n}\). Let \(X \sim {{\,{\mathrm{Bin}}\,}}(n,p)\). Then

$$\begin{aligned}\Pr [X \ge E[X]] \ge \frac{1}{4}.\end{aligned}$$

We are now ready to state the formal proof of Theorem 21.

Proof

From (G3) we have

$$\begin{aligned} \gamma _0 \lambda \ge 200/\delta \ge 200. \end{aligned}$$
(4)

We say that we lose level j if, before having optimized, there is a time t at which there are at least \(\gamma _0 \lambda\) individuals at least on level j, and a later time \(t' > t\) such that at that time there are less than \(\gamma _0 \lambda / 4\) individuals at least on level j.

Our proof proceeds now as follows. First we will condition on never losing a level. We show that we have multiplicative up-drift for the number of individuals on the lowest level which does not have at least \(\gamma _0\lambda\) individuals and a simple induction allows us to go up level by level. Then we show that any level which has at least \(\gamma _0\lambda\) individuals will not be lost until the optimization ends, with sufficiently high probability.

Since we are only interested in the time until we have the first individual in \(A_m\), we may assume that condition (G2) also holds for \(j=m-1\).

We now analyze how the number of individuals above the highest level with at least \(\gamma _0 \lambda\) individuals develops. Let a level \(j \le m-1\) be given such that \(|P \cap A_{\ge j}| \ge \gamma _0 \lambda\). We condition on never losing level j, that is, on never having less than \(\gamma _0 \lambda /4\) individuals on level j or higher. We let \((X_{t})\) be the random process describing the number of individuals on level \(j+1\) or higher, that is, we have \(X_{t} = |P_t \cap A_{\ge j+1}|\) for all t.

We now distinguish two cases. Suppose first that \(z_j \lambda \ge D_0\); this means that we expect at least \(D_0\) individuals on the new level in any given iteration. By Lemma 22, we can apply Theorem 19 with \(p= \frac{1}{4}\), \(n=\gamma _0 \lambda\), and \(x_{{\mathrm {min}}}= z_j \lambda\) to see that the level is filled to at least \(\gamma _0\lambda\) individuals in an expected time of at most

$$\begin{aligned} T_j&:= 2.5\left( 4 + \lceil \log ^0_2(\gamma _0 / z_j)\rceil \lceil 3 / \delta \rceil \right) \\&\le 10 + 10 \, \frac{\lceil \log ^0_2(\gamma _0 / z_j) \rceil }{\delta }. \end{aligned}$$

iterations.

In the second case we have \(z_j \lambda < D_0\) and we want to use Theorem 16, where our target is again to have \(n = \gamma _0 \lambda\) individuals on level \(j+1\) or higher. We start by determining a useful \(E_0\) for which we can show Condition (0). From (G1) we have that if \(X_{t} =0\), then the number \(Y := X_{t+1}\) of individuals sampled in \(A_{\ge j+1}\) follows a binomial law with parameters \(\lambda\) and success probability \(p \ge z_j\).

We now estimate \(E_0^{(j)} := E[\min \{D_0,Y\}]\). Assume first that \(\lambda z_j \ge 1\) and hence \(E[Y] \ge 1\). By Lemma 22, we have \(E_0^{(j)} \ge \frac{1}{4} \min \{D_0,E[Y]\} = \frac{1}{4} \min \{D_0, \lambda z_j\}\). If instead we have \(\lambda z_j < 1\), then the probability to sample at least one individual on a higher level is at least \(\Pr [Y \ge 1] \ge 1 - (1-z_j)^{\lambda } \ge 1 - \exp (-z_j \lambda ) \ge 1 - (1 - \frac{1}{2} z_j \lambda ) = \frac{1}{2} z_j \lambda\), using the elementary estimates \(1+x \le \exp (x)\) valid for all \(x \in {\mathbb {R}}\) and \(1 - \frac{1}{2} x \ge \exp (-x)\) valid for all \(0 \le x \le 1\). Consequently, in either case, \(E_0^{(j)} \ge \frac{1}{4} \min \{D_0, \lambda z_j\}\). Since we will later need to bound the inverse of \(E_0^{(j)}\) from above, we note that

$$\begin{aligned} \frac{1}{E_0^{(j)}} \le \frac{\delta }{25} + \frac{4}{\gamma _0\lambda } + \frac{4}{\lambda z_j} \le 2 + \frac{4}{\lambda z_j} \end{aligned}$$
(5)

by \(\delta \le 1\) and Eq. (4).

From (G2) we see that when \(X_{t} > 0\), then the number \(X_{t+1}\) of individuals sampled on level \(j+1\) or higher stochastically dominates a binomial law with parameters \(\lambda\) and \((1+\delta )X_{t} / \lambda\). Consequently, we can apply Theorem 16 and estimate that the expected number of generations until there are at least \(\gamma _0 \lambda\) individuals on level \(j+1\) or higher is at most

$$\begin{aligned} T'_j&:= \frac{ 4D_0 }{0.4088 E_0^{(j)}} + \tfrac{15}{1-\gamma _0} D_0 \ln (2 D_0)+ 2.5 \log _2(\gamma _0 \lambda ) \lceil 3 / \delta \rceil . \end{aligned}$$

Since \(D_0 \ge \min \{100/\delta ,\gamma _0 \lambda \} \ge 100\) by (4) and thus \(\ln (2) \le \ln (D_0) \cdot 0.151\), we have \(15 \ln (2D_0) = 15 (\ln (2) + \ln (D_0)) \le 17.5 \ln (D_0)\). With this and \(c_0 := 17.5\) we estimate

$$\begin{aligned} T_j'&\le c_0 \left( D_0 / E_0^{(j)} + \frac{1}{1-\gamma _0} D_0 \ln (D_0)+ \frac{\log _2(\gamma _0 \lambda )}{\delta } \right) \\&\le c_0 \left( D_0\left( 2 + \frac{4}{\lambda z_j}\right) + \frac{1}{1-\gamma _0} D_0 \ln (D_0) + \frac{\log _2(\gamma _0\lambda )}{\delta }\right) \\&= c_0 \left( D_0\left( 2 + \frac{\ln (D_0)}{1-\gamma _0} \right) + \frac{\log _2(\gamma _0\lambda )}{\delta } + D_0\frac{4}{\lambda z_j}\right) \\&\le c_0 \left( D_0\left( 2\frac{\ln (D_0)}{1-\gamma _0} \right) + \frac{\log _2(\gamma _0\lambda )}{\delta } + \frac{400}{\lambda \delta z_j}\right) \\&\le c_0 \left( \frac{2 \lceil 100/\delta \rceil \ln (\gamma _0 \lambda )}{1-\gamma _0} + \frac{\log _2(\gamma _0\lambda )}{\delta } + \frac{400}{\lambda \delta z_j} \right) \\&\le \frac{c_0}{\delta } \left( \frac{203 \log _2(\gamma _0\lambda )}{1-\gamma _0} + \frac{400}{\lambda z_j} \right) . \end{aligned}$$

Let

$$\begin{aligned}T_j^* = \frac{c_0}{\delta } \left( 1 + \frac{203}{1-\gamma _0} \log ^0_2\left( \frac{2\gamma _0\lambda }{1+\frac{z_j \lambda }{D_0}}\right) + \frac{400}{\lambda z_j} \right) \end{aligned}$$

and note that \(T_j^* \ge T_j\) when \(z_j \lambda \ge D_0\) and \(T_j^* \ge T_j'\) otherwise. Hence \(T_j^*\) is an upper bound for the expected time to have at least \(\gamma _0 \lambda\) individuals in \(A_{\ge j+1}\) when starting with at least \(\gamma _0 \lambda\) individuals in \(A_{\ge j}\) and assuming that we do not lose level j.

Summing over all levels, we obtain the following bound on the number of steps to reach a search point in \(A_m\), still conditional on never losing a level:

$$\begin{aligned} \sum _{j=1}^{m-1} T_j^*&\le \frac{400 c_0}{\delta } \left( m + \frac{1}{1-\gamma _0} \sum _{j=1}^{m-1} \log ^0_2\left( \frac{2\gamma _0\lambda }{1+\frac{z_j \lambda }{D_0}}\right) + \sum _{j=1}^{m-1}\frac{1}{\lambda z_j} \right) = t_0. \end{aligned}$$

We now argue that, with sufficiently high probability, we indeed do not lose a level. Specifically, we show that, from any iteration with at least \(\gamma _0 \lambda /2\) individuals until the next iteration with at least that many individuals, the probability is at most

$$\begin{aligned} \exp \left( - \frac{\delta \gamma _0 \lambda /2}{128} \right) \end{aligned}$$

that we have an iteration with less than \(\gamma _0 \lambda / 4\) individuals in between (which we will call a failure).

We distinguish two cases: we either have at least \(\gamma _0 \lambda\) individuals on the level and above, or less. Using a standard Chernoff bound argument on (G2) with \(\gamma =\gamma _0\) we see that, for iterations with at least \(\gamma _0 \lambda\) individuals, the probability to fall below \(\gamma _0 \lambda /2\) individuals in the next step is at most

$$\begin{aligned} \exp (-\gamma _0 \lambda / 8) < \exp \left( - \frac{\delta \gamma _0 \lambda /2}{128}\right) . \end{aligned}$$

This shows that steps with at least \(\gamma _0 \lambda\) individuals lead to a failure with at most the desired small probability.

In the case of less than \(\gamma _0 \lambda\) individuals, just as in the proof of Theorem 3, we want to apply Lemma 12. In the language of Lemma 12, we have \(n = \gamma _0\lambda \ge 200/\delta\) using Equation (4). Thus, we can use Lemma 12 to estimate the probability of falling below \(\gamma _0\lambda /4\) after having reached at least \(D \ge \gamma _0 \lambda /2 \ge 100/\delta\) individuals. We thus see that this failure probability is at most

$$\begin{aligned} \exp \left( - \frac{\delta D}{128} \right) \le \exp \left( - \frac{\delta \gamma _0 \lambda /2}{128} \right) . \end{aligned}$$

Thus, also in this case the probability of failure is small. Using (G3), we see that the last term is at most \(1/(8t_0)\). In order to obtain the overall failure probability over any number of t steps, we can now make a union bound over all intervals, each ranging from one iteration with at least \(\gamma _0 \lambda /2\) individuals to the next. For this we will pessimistically assume that we have t such intervals within t steps. Thus, we see that the probability of ever losing a level within \(2t_0\) steps (twice the conditional expected optimization time, conditional on not losing a level) is at most \(p_1 :=0.25\). Using Markov’s inequality, the probability of successful optimization within \(2t_0\) iterations without losing a level is at least \(p_2 := 0.5\). Thus, with a union bound on the failure probabilities, we get an unconditional probability of successful optimization within \(2t_0\) iterations of at least \(1 - p_1 - p_2 = 0.25\). Thus, a simple restart argument shows that the expected time (in iterations) for optimization is at most \(8t_0\), giving the desired run time bound.

\(\square\)

We now discuss the case \(\delta > 1\). With similar, often easier arguments, we prove the following result.

Theorem 23

(Level-Based Theorem for \(\delta > 1\)) Consider a population-based process as described in the beginning of this section.

Let \((A_1,\ldots ,A_m)\) be a partition of \({\mathcal {X}}\). Let \(A_{\ge j} := \bigcup _{i=j}^m A_i\) for all \(j \in [1..m]\). Let \(z_1,\ldots ,z_{m-1} \in (0,1]\), \(\delta > 1\), and \(\gamma _0 \in (0,\frac{1}{1+\delta }]\) with \(\gamma _0 \lambda \in {\mathbb {Z}}_{\ge 32}\). Let

$$\begin{aligned} t_0 = 101.6 m + 2.6 \sum _{j=1}^{m-1} \log ^0_{1+\delta }\left( \frac{2\gamma _0\lambda }{1 + \frac{z_j \lambda }{D_0}} \right) + \frac{657}{\lambda } \sum _{j=1}^{m-1} \frac{1}{z_j}. \end{aligned}$$

Assume that for any population \(P \in {\mathcal {X}}^\lambda\) the following three conditions are satisfied.

(G1):

For each level \(j \in [1..m-1]\), if \(|P \cap A_{\ge j}| \ge \gamma _0 \lambda\), then

$$\begin{aligned} \Pr _{y \sim D(P)} [y \in A_{\ge j+1}] \ge z_j. \end{aligned}$$
(G2):

For each level \(j \in [1..m-2]\) and all \(\gamma \in (0,\gamma _0]\), if \(|P \cap A_{\ge j}| \ge \gamma _0 \lambda\) and \(|P \cap A_{\ge j+1}| \ge \gamma \lambda\), then

$$\begin{aligned} \Pr _{y \sim D(P)} [y \in A_{\ge j+1}] \ge (1+\delta )\gamma . \end{aligned}$$
(G3):

The population size \(\lambda\) satisfies \(\lambda \ge \frac{4}{\gamma _0} \ln (9t_0)\).

Then \(T := \min \{ \lambda t \mid P_t \cap A_m \ne \emptyset \}\) satisfies

$$\begin{aligned} E[T]&\le 9 \lambda t_0 \le 915 \lambda m + 24 \lambda \sum _{j=1}^{m-1} \log ^0_{1+\delta }\left( \frac{2\gamma _0\lambda }{1 + \frac{z_j \lambda }{D_0}} \right) + 6000 \sum _{j=1}^{m-1} \frac{1}{z_j}. \end{aligned}$$

The assumption that \(\gamma _0 \lambda \ge 32\) is not strictly necessary, but eases the presentation. Note that (G3) and \(t_0 \ge 101.6\) already imply \(\gamma _0 \lambda \ge 27.27\). Conditions (G1) and (G2) are identical with the case \(\delta \le 1\) except that we only require them to hold for \(|P \cap A_{\ge j}| \ge \gamma _0 \lambda\) instead of \(|P \cap A_{\ge j}| \ge \gamma _0 \lambda /4\). Condition (G3) is of a similar type as in the case \(\delta \le 1\).

Proof

The proof reuses many arguments from the proof for the case \(\delta \le 1\). To later apply the second multiplicative up-drift theorem, let \(D_0 = \min \{32,\gamma _0 \lambda \}\) and note that by our assumption \(D_0 = 32\).

Mildly different from the case \(\delta \le 1\), we now say that we lose a level in iteration t if there is a \(j \in [1..m-1]\) such that \(|P_t \cap A_{\ge j}| \ge \gamma _0 \lambda\) and \(|P_{t+1} \cap A_{\ge j}| < \gamma _0 \lambda\).

We again condition on never losing a level and later revoke this assumption with a restart argument. Let \(j \in [1..m-1]\) and assume that at some time \(t'\) we have \(|P_{t'} \cap A_{\ge j}| \ge \gamma _0 \lambda\). We analyze how the number of individuals on levels above j develops. To this aim, let \(X_t = |P_{t'+t} \cap A_{\ge j+1}|\) for all \(t = 0, 1, 2, \dots\). As in the analysis of the case \(\delta \le 1\), we distinguish two cases. When \(z_j \lambda \ge D_0\), then we can again apply Theorem 19 with \(p = 1/4\), \(x_{{\mathrm {min}}}= z_j \lambda\), and \(n = \gamma _0 \lambda\), showing that the expected time to fill level \(j+1\) to at least \(\gamma _0 \lambda\) elements is at most

$$\begin{aligned}1.3/p + 2.6 \lceil \log ^0_{1+\delta }(n/x_{{\mathrm {min}}})\rceil \le 7.8 + 2.6 \log ^0_{1+\delta }(\gamma _0/z_j). \end{aligned}$$

If instead we have \(z_j \lambda < D_0\), we argue as follows. We have \(E[\min \{X_{t+1},D_0\} \mid X_t = 0] \ge \frac{1}{4} \min \{D_0, \lambda z_j\} =: E_0^{(j)}\). We estimate

$$\begin{aligned}\frac{1}{E_0^{(j)}} \le \frac{4}{D_0} + \frac{4}{\lambda z_j} = \frac{1}{8} + \frac{4}{\lambda z_j}.\end{aligned}$$

With (G2), we again invoke Theorem 16 and obtain that the expected number of iterations to have \(X_t \ge \gamma _0 \lambda\) is at most

$$\begin{aligned}\frac{128}{0.78 E_0^{(j)}} + 2.6 \log _{1+\delta }(\gamma _0 \lambda ) + 81 \le 101.6 + \frac{657}{\lambda z_j} + 2.6 \log _{1+\delta }(\gamma _0 \lambda ).\end{aligned}$$

In either case, \(z_j \lambda \ge D_0\) or \(\gamma _0 \lambda < D_0\), this level filling-up time is at most

$$\begin{aligned} 101.6 + \frac{657}{\lambda z_j} + 2.6 \log ^0_{1+\delta }\left( \frac{2\gamma _0\lambda }{1 + \frac{z_j \lambda }{D_0}} \right) \end{aligned}$$

in expectation. Summing over all levels, we see that the expected time to, one after the other, fill all levels is at most \(t_0\) when we condition on never losing a level.

The probability to lose the current level in one iteration, by a simple Chernoff bound and (G2), is at most \(\exp (-\frac{1}{4} \gamma _0 \lambda )\), since we expect to have at least \((1+\delta ) \gamma _0 \lambda \ge 2 \gamma _0 \lambda\) offspring on this level or higher. By (G3), this probability is at most \(1 / 9t_0\). By a simple union bound, we see that the probability to lose a level in \(3t_0\) iterations is at most 1/3. Under this assumption, the probability to not find a search point in \(A_m\) in the first \(3t_0\) iterations is at most 1/3 by Markov’s inequality. Hence with probability 1/3, we find the desired solution in \(3 t_0\) iterations. A simple restart argument with an expected number of three restarts now shows \(E[T] \le \lambda \cdot 9 t_0\) as claimed.\(\square\)

4 Applications

With the improved level-based theorem, we easily obtain the following three results. The first two improve previous results that were obtained via level-based theorems in the case of small \(\delta\). The last result shows that our level-based theorem for the case \(\delta > 1\) can lead to results better than what was known before for the case \(\delta \le 1\) (including using \(\delta \le 1\) when \(\delta\) actually is larger).

4.1 Fitness-Proportionate Selection

Dang and Lehre [9] show that fitness-proportionate selection can be efficient when the mutation rate is very small; in contrast to previous results that show, for the standard mutation rate 1/n, that fitness-proportionate selection can lead to exponential run times [20, 35]. More precisely, Dang and Lehre regard the \((\lambda ,\lambda )\) EA with fitness-proportionate selection for variation and standard bit mutation as variation operator (Algorithm 1). Here fitness-proportionate selection (with respect to a non-negative fitness function f) means that from a given population \(x_1, \dots , x_\lambda\) we choose a random element such that \(x_i\) is chosen with probability \(f(x_i) / \sum _{j=1}^\lambda f(x_j)\). When \(\sum _{j=1}^\lambda f(x_j)\) is zero, we choose an individual uniformly at random.

Dang and Lehre show that this algorithm with mutation rate \(p_{{{\,{\mathrm{mut}}\,}}}= \frac{1}{6n^2}\) and population size \(\lambda = bn^2 \ln n\) for some constant \(b>0\) optimizes the OneMax and LeadingOnes benchmark functions in an expected number of \(O(n^8 \log n)\) fitness evaluations. We note that the previous improved level-based theorem (Theorem 20) would give a bound of \(O(n^5 \log ^2 n)\) for the smallest-possible choice of \(\lambda\). With our tighter version of the level-based theorem, we obtain the following results.

figure a

Theorem 24

Consider the \((\lambda ,\lambda )\) EA with fitness-proportionate selection,

  • with population size \(\lambda \ge c n \ln (n)\) with c sufficiently large and \(\lambda = O(n^K)\) for some constant K, and

  • mutation rate \(p_{{{\,{\mathrm{mut}}\,}}}\le \frac{1}{4n^2}\) and \(p_{{{\,{\mathrm{mut}}\,}}}= \Omega (n^{-k})\) for some constant k.

Then this algorithm optimizes OneMax in an expected number of \(O(\lambda n^2 \log n + n \log (n) / p_{{{\,{\mathrm{mut}}\,}}})\) fitness evaluations, which is \(O(n^3 (\log n)^2)\) for optimal parameter choices. It optimizes LeadingOnes in time \(O(\lambda n^2 \log n + n^2 / p_{{{\,{\mathrm{mut}}\,}}})\) fitness evaluations, which becomes \(O(n^4)\) with optimal parameter choices.

Proof

Let f be the function OneMax. We apply Theorem 21 with \(\gamma _0 = \frac{1}{2}\) and the partition formed by the sets \(A_i := \{x \in \{0,1\}^n \mid f(x) = i-1\}\) with \(i = 1, 2, \dots , n+1 =: m\).

To show (G1), assume that we have at least \(\gamma _0 \lambda / 4\) individuals with fitness at least j for some \(j \in [0..n-1]\). Since the selection operator favors individuals with higher fitness, the probability that the parent of a particular offspring has fitness at least j, is at least \(\gamma _0/4\). Assume that such a parent was chosen (and that this does not have fitness n since we would be done then anyway). If the parent has fitness exactly j, then the probability to generate a strictly better search point is at least \((n-j) p_{{{\,{\mathrm{mut}}\,}}}(1 - p_{{{\,{\mathrm{mut}}\,}}})^{n-1} \ge (n-j) p_{{{\,{\mathrm{mut}}\,}}}(1 - (n-1)p_{{{\,{\mathrm{mut}}\,}}}) = (n-j) p_{{{\,{\mathrm{mut}}\,}}}(1 - o(1))\) by Bernoulli’s inequality and \(p_{{{\,{\mathrm{mut}}\,}}}= o(\frac{1}{n})\). If the parent has already a fitness of \(j+1\) or better, then the probability to generate an offspring of fitness \(j+1\) or better is even higher, namely by simply flipping zero bits such an offspring is generated with probability at least \((1-p_{{{\,{\mathrm{mut}}\,}}})^n \ge 1 - np_{{{\,{\mathrm{mut}}\,}}}= 1 - o(1)\). Hence in either case we have (G1) satisfied with \(z_j = (n-j) \gamma _0 p_{{{\,{\mathrm{mut}}\,}}}(1 - o(1)) / 4\).

To show (G2), let \(j \in [0..n-2]\), \(\gamma \in (0, \gamma _0]\) and P be a population such that at least \(\gamma \lambda\) individuals have a fitness of at least \(j+1\) and at least \(\gamma _0 \lambda /4\) individuals have a fitness of at least j. Let \(F^+\) be the sum of the fitness values of the individuals of fitness at least \(j+1\) and let \(F^- = \sum _{x \in P} f(x) - F^+\) be the sum of the remaining fitness values. By our assumption, \(F^+ \ge \gamma \lambda (j+1)\). The probability that an individual of fitness \(j+1\) or more is chosen as parent of a particular offspring is

$$\begin{aligned} \frac{F^+}{\sum _{x \in P} f(x)}&= \frac{F^+}{F^++F^-} \\&\ge \frac{\gamma \lambda (j+1)}{\gamma \lambda (j+1)+F^-} \\&\ge \frac{\gamma \lambda (j+1)}{\gamma \lambda (j+1)+(1-\gamma )\lambda j} \\&= \gamma \left( 1+\frac{1-\gamma }{j+\gamma }\right) \ge \gamma \left( 1 + \frac{\frac{1}{2}}{j+\frac{1}{2}}\right) \ge \gamma \left( 1 + \frac{1}{2n}\right) . \end{aligned}$$

The probability that a parent creates an identical offspring is \((1 - p_{{{\,{\mathrm{mut}}\,}}})^n \ge 1 - np_{{{\,{\mathrm{mut}}\,}}}\). Consequently, the probability that an offspring has fitness at least \(j+1\) is at least \(\gamma\) times \((1+\frac{1}{2n}) (1 - np_{{{\,{\mathrm{mut}}\,}}}) \ge 1 + \frac{1}{2n} - n p_{{{\,{\mathrm{mut}}\,}}}- O(n^{-2}) \ge 1 + \frac{1}{4n} - O(n^{-2}) =: 1 + \delta\). With this \(\delta = \Theta (1/n)\), we have satisfied (G2).

Finally, we observe that

$$\begin{aligned} \frac{256}{\gamma _0 \delta }&\ln \left( \frac{c_1}{\delta } \left( \frac{m \log _2(\gamma _0\lambda )}{1-\gamma _0} + \frac{1}{\lambda } \sum _{j=1}^{m-1} \frac{1}{z_j} \right) \right) \\&= O\left( \frac{1}{\delta }\log \left( \frac{m}{\delta }\left( \log \lambda + \frac{1}{\lambda p_{{{\,{\mathrm{mut}}\,}}}}\right) \right) \right) \\&= O(n \log n), \end{aligned}$$

since m, \(\lambda\), and \(1/p_{{{\,{\mathrm{mut}}\,}}}\) are polynomially bounded in n. This shows (G3).

Consequently, we can employ Theorem 21 and derive an expected optimization time of

$$\begin{aligned} E[T]&\le \lambda \frac{c_1}{\delta } \left( \frac{m \log _2(\gamma _0\lambda )}{1-\gamma _0} + \frac{1}{\lambda } \sum _{i=1}^{m-1} \frac{1}{z_j} \right) \\&= O\left( \frac{\lambda m \log \lambda }{\delta } + \frac{1}{\delta } \sum _{j=1}^{m-1} \frac{1}{(n-j) p_{{{\,{\mathrm{mut}}\,}}}}\right) \\&= O(\lambda n^2 \log n + n \log (n) / p_{{{\,{\mathrm{mut}}\,}}}). \end{aligned}$$

which is \(O(n^3 \log ^2 n)\) for \(\lambda = \Theta (n \log n)\) and \(p_{{{\,{\mathrm{mut}}\,}}}= \Omega (n^{-2} (\log n)^{-1})\).

For f being the LeadingOnes function, we take the same partition of the search space and also \(\gamma _0 = \frac{1}{2}\). With similar arguments as above, we show (G1) with \(z_j = \gamma _0 p_{{{\,{\mathrm{mut}}\,}}}(1-o(1)) / 4\). The proof of (G2) remains valid without changes, since the central argument was that with sufficiently high probability a copy of the parent is generated (hence again we have \(\delta = \Theta (1/n)\)). The proof of (G3) remains valid since we estimated the \(z_j\) uniformly as \(z_j = \Omega (p_{{{\,{\mathrm{mut}}\,}}})\). Consequently, we obtain from Theorem 21 that the optimization time T satisfies

$$\begin{aligned} E[T]&\le \lambda \frac{c_1}{\delta } \left( \frac{m \log _2(\gamma _0\lambda )}{1-\gamma _0} + \frac{1}{\lambda } \sum _{i=1}^{m-1} \frac{1}{z_j} \right) \\&= O\left( \frac{\lambda m \log \lambda }{\delta } + \frac{m}{\delta p_{{{\,{\mathrm{mut}}\,}}}}\right) \\&= O(\lambda n^2 \log n + n^2 / p_{{{\,{\mathrm{mut}}\,}}}). \end{aligned}$$

This is \(O(n^4)\) for \(\lambda = O(n^2 / \log n)\) and \(p_{{{\,{\mathrm{mut}}\,}}}= \Theta (n^{-2})\).\(\square\)

4.2 Partial Evaluation

Also in Dang and Lehre [9] a different parent selection algorithm was considered, 2-tournament selection, where a parent is chosen by picking two individuals uniformly at random and the fitter one is allowed to produce one offspring (see Algorithm 2).

figure b

The test functions they considered were OneMax and LeadingOnes under partial evaluation (a scheme for randomizing a given function), which we here define only for OneMax. Given a parameter \(c \in (0,1)\), we use n i.i.d. random variables \((R_i)_{i \le n}\), each Bernoulli-distributed with parameter c. \(\textsc {OneMax} _c\) is defined such that, for all bit strings \(x \in \{0,1\}^n\), \(\textsc {OneMax} _c(x) = \sum _{i=1}^n R_i x_i\). With other words, a bit string has a value equal to the number of 1s in it, where each 1 only counts with probability c.

Dang and Lehre [9] showed the following statement as part of their core proof [9, proof of Theorem 21] regarding the performance of Algorithm 2 on \(\textsc {OneMax} _c(x)\).

Lemma 25

Let n be large and \(c \in (1/n,1)\). Then there is an a such that, for all \(\gamma \in (0,1/2)\), the probability to produce an offspring (line 7 of Algorithm 2) of at least the quality of the \(\gamma \lambda\)-ranked individual of the current population is at least \(\gamma (1+a \sqrt{c/n})\).

Using their old level-based theorem (with a dependence on \(\delta\) of order 5) and the best possible choice for \(\lambda\), they obtain a bound for the expected number of fitness evaluations until optimizing OneMax with partial evaluation with parameter \(c \ge 1/n\) of

$$\begin{aligned} O\left( \frac{n^{4.5} \log n}{c^{3.5}} \right) . \end{aligned}$$

Using the more refined level-based theorem from [2], see Theorem 20 (with a quadratic dependence on \(\delta\)), one can find a run time bound of

$$\begin{aligned} O\left( \frac{n^{3} \log n}{c^{2}} \right) . \end{aligned}$$

With our level-based theorem given in Theorem 21 (with a linear dependence on \(\delta\)), one can prove a run time bound of

$$\begin{aligned} O\left( \frac{n^{2} (\log (n))^2}{c} \right) . \end{aligned}$$

For this we chose analogously to [9]: \(\delta = a \sqrt{c/n}\) as given in Lemma 25, \(p_{{{\,{\mathrm{mut}}\,}}}= \delta /3\), \(m=n+1\) (with the partitioning based on fitness), \(\gamma _0 = 1/2\), \(z_j = 7(1-j/n)(\delta /9)/16\) and \(\lambda = b \ln (n)\sqrt{n/c}\) for some constant b.

Analogous improvements can be found in the case of LeadingOnes.

4.3 Using \(\delta > 1\)

In all applications of the level-based theorem in the literature, only the case of \(\delta \le 1\) was used; in fact, the level-based theorem from [2] does not give a version that can benefit from \(\delta > 1\) (however, it can always be applied with \(\delta =1\) instead of the true \(\delta\)). We note the following result, which can be improved by taking \(\delta > 1\) into account.

Consider optimizing the LeadingOnes benchmark function using a \((\mu , \lambda )\) EA with ranking selection and standard bit mutation. When \(\lambda \ge 2e \mu\) and \(\lambda \ge c \log (n)\) for some specific constant c, then an expected run time of \(O(n^2 + n \lambda \log (\lambda ))\) fitness evaluations is proven in [2, Theorem 3(2)]. We easily see that in this case, using the partition of the search space into sets of equal fitness, we have \(z_j = O(1/n)\) for all \(j \in [0..n-1]\) and \(\delta = \lambda / e\mu\).

Using our level-based theorem for \(\delta > 1\) (Theorem 23), we obtain the slightly better bound of \(O(n^2 + n\lambda \log _{(1+\lambda /e\mu )}(\lambda ))\) since the time to fill up a level is getting shorter if \(\lambda\) is asymptotically larger than \(\mu\). For example, for \(\mu = n\) and \(\lambda = n^{1.5}\), we can now derive an optimization time of \(O(n\lambda ) = O(n^{2.5})\), while the previous result was \(O(n \lambda \log (\lambda )) = O(n^{2.5}\log (n))\).

5 Conclusion

In this work, we prove three drift results for multiplicatively increasing drift. Since the desired hitting time bound of order \({\log (n)/\min \{\delta ,\log (1+\delta )\}}\), which implies that the process behaves similarly to the deterministic process, can only be obtained under additional assumptions, we formulate our results for processes in which each state \(X_{t+1}\) is distributed according to a binomial distribution with expectation \((1+\delta ) X_{t}\) (or better, in the domination sense).

As main application for our drift results, we prove a stronger version of the level-based theorem. It in particular has the asymptotically right dependence on \(1/\delta\), which is near-linear. Previous level-based theorems only show a dependence roughly of order \(\delta ^{-5}\) [9] or \(\delta ^{-2}\) [2]. This difference can be significant in applications with small \(\delta\), e.g., the result on fitness-proportionate selection [9], which has \(\delta = \Theta (1/n)\).

An equally interesting progress from our new level-based theorem is that its relatively elementary proof gives more insight in the actual development of such processes. It thus tells us in a more informative manner how certain population-based algorithms optimize certain problems. Such additional information can be useful to detect bottlenecks and improve algorithms. Also, the individual building blocks of our drift analysis may find separate applications.

In terms of future work, we note that there are processes showing multiplicative up-drift where the next state is not described by a binomial distribution. One example are population-based algorithms using plus-selection, where, roughly speaking, \(X_{t+1} \sim X_{t} + {{\,{\mathrm{Bin}}\,}}(\lambda ,X_{t}/\lambda )\). We are optimistic that such processes can be handled with our methods as well. We did not do this in this first work on multiplicative up-drift since such processes can also be analyzed with elementary methods, e.g., exploiting that the process is non-decreasing and with constant probability attains the expected progress. Nevertheless, extending our drift theorems to such processes should give better constants and a more elegant analysis, so we feel that this is also an interesting goal for future work.