Abstract
Probabilistic modelbuilding Genetic Algorithms (PMBGAs) are a class of metaheuristics that evolve probability distributions favoring optimal solutions in the underlying search space by repeatedly sampling from the distribution and updating it according to promising samples. We provide a rigorous runtime analysis concerning the update strength, a vital parameter in PMBGAs such as the step size 1 / K in the socalled compact Genetic Algorithm (cGA) and the evaporation factor \(\rho \) in ant colony optimizers (ACO). While a large update strength is desirable for exploitation, there is a general tradeoff: too strong updates can lead to unstable behavior and possibly poor performance. We demonstrate this tradeoff for the cGA and a simple ACO algorithm on the wellknown OneMax function. More precisely, we obtain lower bounds on the expected runtime of \({\varOmega }(K\sqrt{n} + n \log n)\) and \({\varOmega }(\sqrt{n}/\rho + n \log n)\), respectively, suggesting that the update strength should be limited to \(1/K, \rho = O(1/(\sqrt{n} \log n))\). In fact, choosing \(1/K, \rho \sim 1/(\sqrt{n}\log n)\) both algorithms efficiently optimize OneMax in expected time \({\varTheta }(n \log n)\). Our analyses provide new insights into the stochastic behavior of PMBGAs and propose new guidelines for setting the update strength in global optimization.
1 Introduction
The term probabilistic modelbuilding Genetic Algorithms describes a class of algorithms that construct a probabilistic model which is used to generate new search points. The model is adapted using information about previous search points. Both estimationofdistribution algorithms (EDAs) and swarm intelligence algorithms including ant colony optimizers (ACO) and particle swarm optimizers (PSO) fall into this class. These algorithms generally behave differently from evolutionary algorithms where a population of search points fully describes the current state of the algorithm.
EDAs like the compact Genetic Algorithm (cGA) and many ACO algorithms update their probabilistic models by sampling new solutions and then updating the model according to information about good solutions found. In this work we focus on pseudoBoolean optimization (finding global optima in \(\{0, 1\}^n\), n the number of bits) and simple univariate probabilistic models, that is, for each bit there is a value \(p_i\) that determines the probability of setting the ith bit to 1 in a newly created solution.
Recently, the runtime analysis of such univariate EDAs has received increasing interest. Research has focused on the expected optimization time of not only cGA but also the univariate marginal distribution algorithm (UMDA), for which upper bounds [3, 20, 33] and lower bounds [18] on its expected runtime were obtained with respect to the problem OneMax\((x) := \sum _{i=1}^n x_i\), a simple hillclimbing task. Friedrich et al. [12, 13] showed that the cGA is efficient on a noisy OneMax, even under extreme Gaussian noise. Moreover, Friedrich et al. [11] describe general properties of EDAs and how they are related to runtime analysis. In this paper, we follow up on work by Droste [7] on the cGA and by Neumann, Sudholt and Witt [26] on 2\(\hbox {MMAS}_{\text {ib}}\), an ACO algorithm that is closely related.
The cGA was introduced by Harik et al. [15]. In brief, it simulates the behavior of a Genetic Algorithm with population size K in a more compact fashion. In each iteration two solutions are generated, and if they differ in fitness, \(p_i\) is updated by \(\pm 1/K\) in the direction of the fitter individual. Here 1 / K reflects the strength of the update of the probabilistic model. Simple ACO algorithms based on the Max–Min ant system (MMAS) [29], using the iterationbest update rule, behave similarly: they generate a number \(\lambda \) of solutions and reinforce the best solution amongst these by increasing values \(p_i\), here called pheromones, according to \((1\rho ) p_i + \rho \) if the best solution had bit i set to 1, and \((1\rho )p_i\) otherwise. Here the parameter \(0< \rho < 1\) is called evaporation factor; it plays a similar role to the update strength 1 / K for cGA.
Neumann et al. [26] showed that \(\lambda =2\) ants suffice to optimize the function OneMax, in expected time \(O(\sqrt{n}/\rho )\) if the update strength is chosen small enough, \(\rho \le 1/(c\sqrt{n}\log n)\) for a suitably large constant \(c > 0\). This is \(O(n \log n)\) for \(\rho = 1/(c\sqrt{n}\log n)\). If \(\rho \) is chosen unreasonably large, \(\rho \ge c'/(\log n)\) for some \(c'>0\), the algorithm shows a chaotic behavior and needs exponential time even on this very simple function. In a more general sense, this result suggests that for global optimization such high update strengths should be avoided for any problem, unless the problem contains many global optima.
However, these results leave open a wide gap of parameter values between \(\sim 1/(\log n)\) and \(\sim 1/(\sqrt{n}\log n)\), for which no results are available. This leaves open the question of which update strengths are optimal, and for which values performance degrades. Understanding the working principles of the underlying probabilistic model remains an important open problem for both cGA and ACO algorithms. This is evident from the lack of reasonable lower bounds. The previous best known direct lower bound for MMAS algorithms for reasonable parameters was \({\varOmega }((\log n)/\rho  \log n)\) [25, Theorem 5]; this bound holds for all functions with a unique global optimum. The best known lower bound for cGA on OneMax is \({\varOmega }(K \sqrt{n})\) [7]. There are more general bounds from blackbox complexity theory [6, 8], showing that the expected runtime of comparisonbased algorithms such as MMAS must be \({\varOmega }(n)\) on OneMax. However, these blackbox bounds do not yield direct insight into the stochastic behavior of the algorithms and do not shed light on the dependency of the algorithms’ performance on the update strength.
In this paper, we study 2\(\hbox {MMAS}_{\text {ib}}\) and cGA with a much more detailed analysis that provides such insights through rigorous runtime analysis. We prove lower bounds of \({\varOmega }(K\sqrt{n} + n \log n)\) and \({\varOmega }(\sqrt{n}/\rho + n \log n)\) on OneMax. The terms \(K \sqrt{n}\) and \(\sqrt{n}/\rho \) indicate that the runtime decreases when the update strength 1 / K or \(\rho \) is increased. However, the added terms \(\mathrel {+}n \log n\) set a limit: there is no asymptotic decrease and hence no benefit for choosing update strengths 1 / K or \(\rho \) growing faster than \(1/(\sqrt{n} \log n)\). The reason is that in this regime both algorithms suffer from a phenomenon well known in evolutionary biology and evolutionary computation as genetic drift: the probabilistic model attains extreme values simply due to the randomness of the sampling process, ignoring or overruling information about the quality of solutions. In our context, genetic drift leads to incorrect decisions being made. Correcting these incorrect decisions requires time \({\varOmega }(n \log n)\). These lower bounds hold in expectation and with high probability; hence, they accurately reflect the algorithms’ typical performance.
We further show that these bounds are tight for \(1/K, \rho \le 1/(c\sqrt{n}\log n)\). In this parameter regime the impact of genetic drift is bounded and hence these parameter choices provably lead to the best asymptotic performance on OneMax for arbitrary problem sizes n.
The lower bounds formally apply to OneMax, but we believe that they also apply more generally to functions with few optima. Among all functions with a unique global optimum, the function OneMax is provably the easiest function for certain evolutionary algorithms (see [5] for a proof for the (1+1) EA and [30, 32] for extensions to populations), and similar results were shown for the cGA on linear functions by Droste [7]. We believe that the lower bounds give general performance limits for all functions with a unique global optimum. However, new arguments will be required to prove (or disprove) this formally.
From a technical point of view, our work uses a novel approach: using a secondorder potential function to approximate the distribution of hitting times for a random walk that underlies changes in the probabilistic model. This approach has been recently picked up in [19] to analyze a different type of EDAs and we are confident that it will find further applications.
Finally, by pointing out similarities between cGA and 2\(\hbox {MMAS}_{\text {ib}}\), using the same analytical framework to understand changes in the probabilistic model, we make a step towards a unified theory of probabilistic modelbuilding Genetic Algorithms.
This paper is structured as follows. Section 2 introduces the algorithms and Sect. 3 presents important analytical concepts. Section 4 proves efficient upper bounds for small update strengths, whereas Sect. 5 deals with the lower bounds for large update strengths. We finish with some conclusions.
2 Preliminaries
In the remainder, \(p_t = (p_{t, 1}, \ldots , p_{t, n})\) denotes a vector of probabilities and \(x_t = (x_{t, 1}, \ldots , x_{t, n}), y_t = (y_{t, 1}, \ldots , y_{t, n})\) denote search points from \(\{0, 1\}^n\). Hence \(p_{t, i}\) refers to the ith entry of \(p_t\) and \(x_{t, i}\) refers to the ith bit in \(x_t\).
Our presentation of cGA follows Droste [7]; see also Friedrich et al. [12]. The parameter 1 / K is called update strength (classically, K is called population size) and the \(p_{t, i}\) are called marginal probabilities. Pseudocode of cGA is shown in Algorithm 1. The cGA in each iteration generates two search points according to the probabilistic model. Then the better solution is reinforced: if the two solutions differ on some bit i, the probabilistic model \(p_{t, i}\) is adjusted in the direction of the better solution, using a step size of 1 / K. If the two solutions have equal values on bit i then \(p_{t, i}\) remains unchanged.
The simple MMAS algorithm 2\(\hbox {MMAS}_{\text {ib}}\), analyzed before in [26],^{Footnote 1} is shown in Algorithm 2. Note that the two algorithms only differ in the update mechanism. In contrast to cGA, 2\(\hbox {MMAS}_{\text {ib}}\) always changes the probabilistic model by either decreasing values \(p_{t, i}\) to \((1\rho )p_{t, i}\) or increasing it to \((1\rho )p_{t, i} + \rho \). Here \(\rho \) determines the strength of the update. In the context of ACO, \(p_{t, i}\) are usually called pheromone values, however we also refer to them as marginal probabilities to unify our approach to both algorithms.
We note that the marginal probabilities for both algorithms are restricted to the interval \({[1/n,11/n]}\). These bounds are used such that the algorithms always show a finite expected optimization time, as otherwise certain bits can be irreversibly fixed to 0 or 1. Our results also apply to algorithms without these borders: our analysis can be easily adapted to show that when the optimum is found efficiently in the presence of borders, it is found with high probability when borders are removed, and when the algorithm is inefficient, many bits are fixed opposite to the optimum.
There are intriguing similarities in the definition of cGA and 2\(\hbox {MMAS}_{\text {ib}}\), despite these two algorithms coming from quite different strands from the natural computation community. As mentioned earlier, they only differ in the update mechanism: cGA uses a symmetrical update rule with 1 / K as the amount of change and changes a marginal probability if and only if both offspring differ in the corresponding bit value. 2\(\hbox {MMAS}_{\text {ib}}\) will always change a marginal probability in either positive or negative direction by a value dependent on its current state; however, the maximum absolute change will always be at most \(\rho \). We are not the first to point out these similarities (e. g., see the survey by Hauschild and Pelikan [16], who embrace both algorithms under the umbrella of EDAs). However, our analyses will reveal the surprising insight that both cGA and 2\(\hbox {MMAS}_{\text {ib}}\) have the same runtime behavior as well as the same optimal parameter set on OneMax and can be analyzed with almost the same techniques.
Several parts of our analysis will consider random variables X that follow the socalled Poissonbinomial distribution with probability vector \((p_{1},\ldots ,p_{n})\). Then X is the sum of n Bernoulli trials with possibly different success probabilities \(p_i\), \(1\le i\le n\), i. e., \(X=X_1+\cdots +X_n\), where \(X_i=1\) with probability \(p_i\) and \(X_i=0\) with probability \(1p_i\), independently for all trials. Note that the number of ones in the search points \(x_t\) and \(y_t\) sampled at time t by the cGA and 2\(\hbox {MMAS}_{\text {ib}}\) follows the Poissonbinomial distribution with probability vector \((p_{t,1},\ldots ,p_{t,n})\), which is why this distribution appears naturally in the analysis of \(\textsc {OneMax} \). Section A.3 in the Appendix describes powerful bounds for such Poissonbinomially distributed random variables.
In the remainder of the paper, “\(\mathrm {poly}(n)\)” is used as a shorthand for “\(n^{O(1)}\).”
3 On the Dynamics of the Probabilistic Model
We first elaborate on the stochastic processes underlying the probabilistic model in both algorithms. These insights will then be used to prove upper runtime bounds for small update strengths in Sect. 4 and lower runtime bounds for large update strengths in Sect. 5.
We fix an arbitrary bit i and \(p_{t, i}\), its marginal probability at time t. Note that \(p_{t, i}\) is a random variable, and so is its random change \({\varDelta }_{t}:=p_{t+1, i}p_{t, i}\) in one step. This change depends on whether the value of bit i matters for the decision whether to update with respect to the first bit string x sampled in iteration t (using \(p_{t}\) as sampling distribution) or the second one y (cf. also [26]). More precisely, we inspect \(D_t:=x_tx_{t, i}(y_ty_{t, i})\), which is the change of \(\textsc {OneMax} \)value at bits other than i.
We assume \(p_{t, i}\) to be bounded away from the borders such that \({\varDelta }_t\) is not affected by the borders. Then cGA experiences two different kinds of steps:
Randomwalk steps If \(D_t\ge 2\), then bit i does not affect the decision whether to update with respect to \(x_t\) or \(y_t\). For \({\varDelta }_t>0\) it is necessary that bit i is sampled differently. Hence, the \(p_{t, i}\)value increases and decreases by 1 / K with equal probability \(p_{t, i}(1p_{t, i})\); with the remaining probability \(p_{t+1, i}=p_{t, i}\). In this case, \({\varDelta }_t\) can be described by a variable \(F_t\) where
We call a step where \(D_t\ge 2\) a randomwalk step (rwstep) since the process in such a step is a fair random walk (with selfloops) as \({\mathrm {E}\mathord {\left( {\varDelta }_t \mid p_{t, i}, D_t\!\ge \! 2\right) } \!=\! \mathrm {E}\mathord {\left( F_t \mid p_{t, i}\right) } = 0}\).
If \(D_t = 1\) then \(x_{t+1} \ge y_{t+1}\) such that \(x_{t+1}\) and \(y_{t+1}\) are never swapped in line 8 of cGA. Hence, the same argumentation as in the previous case applies and the process performs an rwstep as well.
Biased steps If \(D_t = 1\) then \(x_{t+1}\) and \(y_{t+1}\) are swapped unless bit i is sampled to 1 in \(x_{t+1}\) and to 0 in \(y_{t+1}\). Hence, both events of sampling bit i differently increase the \(p_{t, i}\)value. We have \({\varDelta }_t=1/K\) with probability \(2p_{t, i}(1p_{t, i})\) and \({\varDelta }_t=0\) otherwise.
If \(D_t=0\) then as in the case \(D_t=1\) both events of sampling bit i differently increase the \(p_{t, i}\)value. Hence, we again have \({\varDelta }_t=1/K\) with probability \(2p_{t, i}(1p_{t, i})\) and \({\varDelta }_t=0\) otherwise. Let \(B_t\) be a random variable such that
Hence, in the cases \(D_t=1\) and \(D_t=0\) we get that \({\varDelta }_t\) has the same distribution as \(B_t\). We call such a step a biased step (bstep) since \(\mathrm {E}\mathord {\left( {\varDelta }_t \mid p_{t, i}, D_t \in \{1, 0\}\right) } = \mathrm {E}\mathord {\left( B_t \mid p_{t, i}\right) } = 2p_{t, i}(1p_{t, i})/K >0\) here.
Whether a step is an rwstep or bstep for bit i depends only on circumstances being external to the bit (and independent of it). Let \(R_t\) be the event that \(D_t=1\) or \(D_t\ge 2\). We get the equality
which we denote as superposition. Informally, the change of \(p_{t, i}\)value is a superposition of a fair (unbiased) random walk and biased steps. The fair random walk reflects the genetic drift underlying the process, i. e. the variance in the process may lead the algorithm to move in a random direction. In contrast, the biased steps reflect steps where the algorithm learns about which bit value leads to a better fitness at the considered bit position. We remark that the superposition of two different behaviors as formulated here is related to the approach taken in [2], where an EDA called UMDA was decomposed into a derandomized, deterministic EDA and a stochastic component modeling genetic drift.
For 2\(\hbox {MMAS}_{\text {ib}}\), structurally this kind of superposition holds as well, however, the underlying random variables look somewhat different.
Randomwalk steps If \(D_t\ge 2\) or \(D_t=1\), then the considered bit does not affect the choice whether to update with respect to \(x_t\) or \(y_t\). Hence, the marginal probability of the considered bit increases with probability \(p_{t, i}\) and decreases with probability \(1p_{t, i}\).
We get that \({\varDelta }_t=p_{t+1, i}p_{t, i}\) is distributed as \(F_t\) in this case, where \(F_t\) is a random variable such that
We call such a step an rwstep in analogy to cGA as in expectation the current state does not change: \({\mathrm {E}\mathord {\left( {\varDelta }_t \mid p_{t, i}, D_t\ge 2 \vee D_t=1\right) } = \mathrm {E}\mathord {\left( F_t \mid p_{t, i}\right) }=0}\).
Biased steps If \(D_t=0\) or \(D_t=1\) then the marginal probability can only decrease if both offspring sample a 0 at bit i; otherwise it will increase. The difference \({\varDelta }_t\) is a random variable
This is called a biased step (bstep) as \(\mathrm {E}\mathord {\left( {\varDelta }_t \mid p_{t, i}, D_t \in \{1, 0\}\right) } = \mathrm {E}\mathord {\left( B_t \mid p_{t, i}\right) } = \rho \cdot (1p_{t, i})\cdot (1(1p_{t, i})^2) \rho \cdot p_{t, i}\cdot (1p_{t, i})^2 = \rho (1p_{t,i}) (1(1p_{t,i})^2  p_{t,i}(1p_{t,i})) = \rho p_{t, i}(1p_{t, i})>0\).
Altogether, the superposition for 2\(\hbox {MMAS}_{\text {ib}}\) is also given by (1), with the modified meaning of \(B_t\) and \(F_t\).
The strength of the update plays a key role here: if the update is too strong, large steps are made during updates, and genetic drift through rwsteps may overwhelm the probabilistic model, leading to “wrong” decisions being made in individual bits. On the other hand, small updates imply that rwsteps have a bounded impact, and the algorithm receives more time to learn optimal bit values in bsteps. We will formalize these insights in the following sections en route to proving rigorous upper and lower runtime bounds. Informally, one main challenge is to understand the stochastic process induced by the mixture of b and rwsteps.
4 Small Update Strengths are Efficient
We first show that small update strengths are efficient for OneMax. This has been shown for 2\(\hbox {MMAS}_{\text {ib}}\) in [26].
Theorem 1
([26]) If \(\rho \le 1/(c\sqrt{n}\log n))\) for a sufficiently large constant \(c > 0\) and \(\rho \ge 1/\mathrm {poly}(n)\) then 2\(\hbox {MMAS}_{\text {ib}}\) optimizes OneMax in expected time \(O(\sqrt{n}/\rho )\).
For \(\rho = 1/(c\sqrt{n}\log n)\) the runtime bound is \(O(n \log n)\).
Here we exploit the similarities between both algorithms to prove an analogous result for cGA.
Theorem 2
The expected optimization time of cGA on OneMax with \(K\ge c\sqrt{n}\log n\) for a sufficiently large \(c>0\) and \(K = \mathrm {poly}(n)\) is \(O(\sqrt{n}K)\). This is \(O(n\log n)\) for \(K = c\sqrt{n}\log n\).
The analysis follows the approach for 2\(\hbox {MMAS}_{\text {ib}}\) in [26], adapted to the different update rule, and using modern tools like variable drift analysis^{Footnote 2} [17] and drift analysis with tail bounds [21]. We also extend previous work by showing in Sect. 4.1 that the upper bound for cGA holds with high probability (see Theorem 5 in Sect. 4.1). The main idea is that marginal probabilities are likely to increase from their initial values of 1 / 2. If the update strength is chosen small enough, the effect of genetic drift (as present in rwsteps) is bounded such that with high probability all bits never reach marginal probabilities below 1 / 3. Under this condition, we show that the marginal probabilities have a tendency (stochastic drift) to move to their upper borders, such that then the optimum is found with good probability.
The following lemma uses considerations and notation from Sect. 3 to establish a stochastic drift, i. e. a positive trend towards optimal bit values, for cGA. We use the same notation as in Sect. 3.
Lemma 3
If \(1/n + 1/K \le p_{t, i} \le 1  1/n  1/K\) then
Proof
The assumptions on \(p_{t, i}\) assure that \(p_{t+1, i}\) is not affected by the borders 1 / n and \({11/n}\). Then the expected change is given by the expectation of the superposition (1):
From Sect. 3 we know \(\mathrm {E}\mathord {\left( F_t \mid p_{t, i}\right) } = 0\) and \(\mathrm {E}\mathord {\left( B_t \mid p_{t, i}\right) } = 2p_{t, i}(1p_{t, i})/K\). Further,
where the last inequality was shown in [26, proof of Lemma 1]. Here we exploit that cGA and 2\(\hbox {MMAS}_{\text {ib}}\) use the same construction procedure. Together this proves the claim. \(\square \)
Note that the term \(\left( \sum _{j\ne i} p_{t, j}(1p_{t, j})\right) ^{1/2}\) reflects the standard deviation of the sampling distribution on all bits \(j \ne i\).
Lemma 3 indicates that the drift increases with the update strength 1 / K. However, a too large value for 1 / K also increases genetic drift. The following lemma shows that, if 1 / K is not too large, this positive drift implies that the marginal probabilities will generally move to higher values and are unlikely to decrease by a constant.
Lemma 4
Let \(0< \alpha< \beta < 1\) be two constants. For each constant \(\gamma > 0\) there exists a constant \(c_\gamma > 0\) (possibly depending on \(\alpha , \beta \), and \(\gamma \)) such that for a specific bit the following holds. If the bit has marginal probability at least \(\beta \) and \(K \ge c_\gamma \sqrt{n} \log n\) then the probability that during the following \(n^\gamma \) steps the marginal probability decreases below \(\alpha \) is at most \(O(n^{\gamma })\).
Proof
The proof uses a similar approach as the proof of Lemma 3 in [26], using 1 / K instead of \(\rho \) and drift bounds from Lemma 3.
The aim is to apply the negative drift theorem, Theorem 20 in the Appendix, with respect to the stochastic process \(X_t:=K p_{t, i} \), obtained by scaling the process on the marginal probabilities of the considered bit i by a factor of K. Note that the \(X_t\)process is on \(\{K/n,1,2,\ldots , K1,KK/n,K\}\).
We use the interval \([a,b]:=[\alpha K,\beta K]\) in the drift theorem. To establish the first condition of the drift theorem, we use Lemma 3. Hence, we obtain the following bound on the drift
using that \(a<X_t<b\) implies \(\alpha< p_{t, i}<\beta \), and estimating \(p_{t, j}(1p_{t, j})\le 1/4\) for all j and t.
For the second condition, we note that always \(X_tX_{t+1}\le 1\) since the marginal probabilities change by at most 1 / K. Hence, the second condition is trivially satisfied by choosing \(r:=2\).
To verify the third condition, we will use that \(K \ge c_\gamma \sqrt{n} \log n\) for a constant \(c_\gamma \) that may depend on \(\alpha ,\beta \) and \(\gamma \). We compute, using \(\ell := (\beta  \alpha )K\) and \(r, \varepsilon \) defined above,
which is at least 4 if \(c_\gamma \) is chosen large enough but constant; here we use that \(\alpha \) and \(\beta \) are constants in (0, 1). Then \(1\le r^2 \le \frac{\varepsilon (ba)}{132\log (r/\varepsilon )}\) as demanded by the third condition.
To finally apply the drift theorem, similar calculations as before yield that
which is at least \(\gamma \ln n\) if \(c_\gamma \) is chosen appropriately. By assumption \(X_0\ge b\). Hence, the theorem establishes that \(\mathord {\mathrm {P}}\mathord {\left[ T\le n^\gamma \right] }=O(n^{\gamma })\). \(\square \)
With these lemmas, we now prove the main statement of this section.
Proof of Theorem 2
We assume in the following that 1 / K divides \(1/21/n\), implying that marginal probabilities are restricted to \(\{1/n, 1/n + 1/K, \ldots , 1/2, \ldots , 11/n1/K, 11/n\}\).
Following [26, Theorem 3] we show that, starting with a setting where all probabilities are at least 1 / 2 simultaneously, with probability \({\varOmega }(1)\) after \(O(\sqrt{n}K)\) iterations either the global optimum has been found or at least one probability has dropped below 1 / 3. In the first case we speak of a success and in the latter case of a failure. The expected time until either a success or a failure happens is then \(O(\sqrt{n}K)\).
Now choose a constant \(\gamma > 0\) such that \(n^\gamma \ge K n^3\). According to Lemma 4 applied with \(\alpha := 1/3\) and \(\beta := 1/2\), the probability of a failure in \(n^{\gamma }\) iterations is at most \(n^{\gamma }\), provided the constant c in the condition \(K \ge c\sqrt{n}\log n\) is large enough. In case of a failure we wait until the probabilities simultaneously reach values at least 1 / 2 again and then we repeat the arguments from the preceding paragraph. It is easy to show (cf. Lemma 2 in [26]) that the expected time for one probability to reach the upper border is always bounded by \(O(n^{3/2}K)\), regardless of the initial probabilities. By standard arguments on independent phases, the expected time until all probabilities have reached their upper border at least once is \(O(n^{3/2}K \log n)\). Once a bit reaches the upper border, we apply Lemma 4 again with \(\alpha := 1/2\) and \(\beta := 2/3\) to show that the probability of a marginal probability decreasing below 1 / 2 in time \(n^{\gamma }\) is at most \(n^{\gamma }\) (again, for large enough c). The probability that there is a bit for which this happens is at most \(n^{\gamma + 1}\) by the union bound. If this does not happen, all bits attain value at least 1 / 2 simultaneously, and we apply our above arguments again.
As the probability of a failure is at most \(n^{\gamma +1}\), the expected number of restarts is \(O(n^{\gamma +1})\) and considering the expected time until all bits recover to values at least 1 / 2 only leads to an additional term of \(n^{\gamma +1} \cdot O((n^{3/2} \log n)K) \le o(1)\) (as \(n^{\gamma } \le n^{3}/K\)) in the expectation.
We only need to show that after \(O(\sqrt{n}K)\) iterations without failure the probability of having found the global optimum is \({\varOmega }(1)\). To this end, we consider a simple potential function that takes into account marginal probabilities for all bits. An important property of the potential is that once the potential has decreased to some constant value, the probability of generating the global optimum is constant.
Let \(p_1, \ldots , p_n\) be the current marginal probabilities and \(q_i := 11/np_i\) for all i. Define the potential function \(\varphi := \sum _{i=1}^n q_i\), which measures the distance to an ideal setting where all probabilities attain their maximum \(11/n\). Let \(q_i'\) be the \(q_i\)value in the next iteration and \(p_i' = 1q_i'\). We estimate the expectation of \(\varphi ' := \sum _{i=1}^n q_i'\) and distinguish between two cases. If \(p_i \le 11/n1/K\), by Lemma 3
We bound \(p_i(1p_i)\) from below using \(p_{i} \ge 1/3\) and \(1p_i = q_i + 1/n\) and the sum from above using
Then
If \(p_i > 11/n1/K\), then \(p_i = 11/n\) (as 1 / K is a multiple of \(1/21/n\)) and \(p_i\) can only decrease. A decrease by 1 / K happens with probability 1 / n, thus
To ease the notation we assume w. l. o. g. that the bits are numbered according to decreasing probabilities, i. e., increasing qvalues. Let \(m \in \mathbb {N}_0\) be the largest index such that \(p_{m} = 11/n\). Observe that by definition of the \(q_i\) we have \(\sum _{i=1}^m q_i = 0\) and \(\sum _{i=m+1}^n q_i = \varphi \). It follows
Putting everything together,
For \(\varphi \ge 10000 \) this can further be bounded using
thus
where in the third inequality we used \(\varphi \ge 10000 \) again. We now apply the variable drift theorem (given by Theorem 18 in the Appendix) to bound the expected time for the potential \(\varphi \) to decrease from any initial value \(\varphi \le n\) to a value \(\varphi \le 10000 \). To this end, we use the drift function \(h(\varphi ) := \varphi ^{1/2}/(17K)\) as we just established that the expected change (drift) in one step is at least \(h(\varphi )\) for all \(\varphi \ge 10000 \).
Since Theorem 18 only considers the hitting time of state 0 and the condition on the drift needs to hold for all states larger than 0, we consider a modified process instead where we merge all states with potentials \(0< \varphi < 10000 \) with state 0: all steps reducing a potential of \(\varphi \ge 10000 \) to a value smaller than \(10000 \) yield a potential of 0. In the modified process, the smallest state larger than 0 is \(x_{\min }=10000 \). The modification can only increase the drift, hence the drift is still bounded from below by \(h(\varphi )\) for all states \(\varphi \ge x_{\min }\).
Now Theorem 18 yields that the expected time to reach state 0 in the modified process, or, equivalently, any state \(\varphi < 10000 \) in the original process, is at most
Consider an iteration where \(\varphi \le 10000 \). The probability of creating ones on all bits simultaneously, given that all marginal probabilities are at least 1 / 3, is minimal in the extreme setting where a maximal number of bits has marginal probabilities at 1 / 3 and all other bits, except at most one, have marginal probabilities at their upper border. Then the probability of creating the optimum in one step is at least \( \left( 1\frac{1}{n}\right) ^{n1} \cdot 3^{\lceil \varphi \cdot 3/2 \rceil } = {\varOmega }(1). \) Hence a successful phase finds the optimum with probability \({\varOmega }(1)\). \(\square \)
4.1 A Tail Bound on the Running Time
We further show that the upper bound from Theorem 2 holds with high probability. Along with the lower tail bounds to be presented in Sect. 5, this demonstrates that the runtime of cGA is highly concentrated, and that we have developed a very good understanding of its performance and dynamic behaviour. In the following result, the failure probability can be made an arbitrarily small polynomial.
Theorem 5
For every \(\kappa > 0\) there is a constant \(c = c(\kappa )\) such that the upper bound \(O(\sqrt{n}K)\) for the time of the cGA on OneMax from Theorem 2 holds with probability \(1O(n^{\kappa })\), provided \(K\ge c\sqrt{n}\log n\) and \(K = \mathrm {poly}(n)\).
Throughout this section we reuse the notation from the proof of Theorem 2, in particular the potential function \(\varphi \) and variables \(p_i\) and \(q_i := 1  1/n  p_i\) for \(1 \le i \le n\).
We still consider the stochastic process w. r. t. the potential function \(\varphi \) from the proof of Theorem 2 and consider its drift. As done in said proof, we use that the probability that there exists a \(p_i\) whose value decreases below 1 / 3 in \(n^{\gamma }\) steps is at most \(n^{\gamma +1}\) if the constant c in \(K\ge c\sqrt{n}\log n\) is chosen large enough. Note that we can make \(\gamma \) larger to decrease the probability of such a failure; however, this dictates what values of c are appropriate. In the following, we assume that the probability of such a failure is at most \(n^{\kappa }\) and work under the assumption that no failure occurs.
To get a highprobability statement, we aim to apply drift analysis with tail bounds, stated as Theorem 19 in the Appendix.^{Footnote 3} To this end, we have to bound the momentgenerating function (mgf.) of (a stochastic upper bound on) the absolute value of
where we use \(K'=17K\) to improve readability and \(x_{\min }=10000 \).
The following lemma gives a tail bound for the time to reach a potential of at most \(x_{\min }\).
Lemma 6
Consider the potential \(\varphi \) and the drift function \(h(\varphi ) := \varphi ^{1/2} /(17K)\) as defined in the proof of Theorem 2, and assume that no \(p_i\) decreases below 1 / 3. Let T denote the random time for the potential to decrease below \(x_{\min }= 10000 \) for the first time, when starting with an initial value of \(\varphi _0\). Then for every \(t > 0\), conditional on the potential always being bounded by a maximum value \(x_{\max }\),
Proof
For the purpose of bounding the tail of the first hitting time for potentials below \(10000 \) we again consider a modified process where states \(0< \varphi < 10000 \) are merged with state 0 (cf. proof of Theorem 18). The following calculations implicitly assume that \(\varphi _{t} \ge 10000 \) as otherwise we have reached a potential below 10000.
We first note that always \(\varphi _{t+1}\ge \varphi _t(11/K)\ge \varphi _t/2\). This holds since a step of cGA in the worst case increases all frequencies by 1 / K (except for those at the upper border), which decreases each \(q_i\) by 1 / K. Hence, we get
and we are left with an analysis of \({\varDelta }=:\varphi _{t+1}\varphi _t\). Here we note that for any bit i, its frequency changes by an absolute value of at most 1 / K with probability at most \(q_i+1/n\le 2q_i\). Hence, \(K{\varDelta }\) is stochastically dominated by a Poissonbinomial distribution with parameters n and \(2q_i\), where \(1\le i\le n\). Let A be the random variable describing this Poissonbinomial distribution. While we do not know the individual success probabilities, we know their average \(p^*:=\sum (2q_i/n)=2\varphi _t/n\) and can bound A by a random variable B, where \(B\sim np^* + {{\mathrm{Bin}}}(n,p^*) + 2\). To show this, we note that \(\mathord {\mathrm {P}}\mathord {\left[ B\ge t\right] }\ge \mathord {\mathrm {P}}\mathord {\left[ A\ge t\right] }\) is trivial for \(t\le np^*+2\) (as \(\mathord {\mathrm {P}}\mathord {\left[ B\ge t\right] } = 1\)). For \(t> np^*+2\), even the dominance \(\mathord {\mathrm {P}}\mathord {\left[ {{\mathrm{Bin}}}(n,p^*)\ge t\right] }\ge \mathord {\mathrm {P}}\mathord {\left[ A\ge t\right] }\) holds by the results of Gleser [14], see [23, p. 495] for a summary. Hence,
for some constant \(c_1>0\).
We now bound the mgf. of Z. Looking up the mgf. of a binomial distribution, we obtain
Assuming \(\lambda \le \min \{1,1/(16c_1\sqrt{\varphi _t})\}\) and using \(e^{\lambda }\le 1+\lambda +\lambda ^2\le 1+2\lambda \), we bound the last expression from above by
which, since \(1+x\le e^x\) for \(x\in \mathbb {R}\), is at most
since \(p^*=2\varphi _t/n\) and \(np^*\ge 20000\ge 2\) by our assumption on \(\varphi _t\).
Using (again) \(e^x \le 1+2x\) for \(x\le 1\) and recalling that \(\lambda \le 1/(16c_1\sqrt{\varphi _t})\), we arrive at the bound
for some some constant \(c_2>0\). Hence, using the variable drift theorem with tail bounds, Theorem 19 in the Appendix, we get for any \(\delta >0\) and \(\eta \le \min \{\lambda ,\delta \lambda ^2/(D1\lambda )\}\) that
We note that \(\sqrt{\varphi _t}\ge 100\) if \(\varphi _t\ge x_{\min }=10000 \). Hence, using our bound D, we satisfy
if \(c_2\) is chosen large enough for \(c_2\sqrt{\varphi _t}1\ge 0\) to hold. Similarly, we show that \( \delta \lambda ^2/(D1\lambda ) \le \lambda \) if \(\delta \) is sufficiently small, so that only the second argument of \(\min \{\lambda ,\delta \lambda ^2/(D1\lambda )\}\) needs to be considered. We let \(\delta :=1/2\). We choose \(\lambda :=1/(16c_1\sqrt{x_{\max }})\) and \(\eta :=\delta \lambda /(c_2\sqrt{x_{\max }}) = c_3/x_{\max }\) for some constant \(c_3\) to satisfy the requirements on \(\lambda \) and \(\eta \). Substituting \(\eta \) and \(\delta \) in (2) proves the claim. \(\square \)
Reaching a small potential is not sufficient to show that the optimum is found with high probability. We also need to show that the algorithm spends a sufficiently large number of steps at a small potential. The following lemma shows that, after having reached a potential of at most \(x_{\min }\), the algorithm quickly returns to this regime.
Lemma 7
Consider the potential \(\varphi \) as defined in the proof of Theorem 2, where \(K\ge c\sqrt{n}\log n\) for a sufficiently large \(c>0\) and \(K = \mathrm {poly}(n)\). Whenever \(\varphi _0 < 10000 \), the time \(R = \min \{t \ge 1 \mid \varphi _t< 10000, \varphi _0 < 10000 \}\) to return to a potential below \(10000 \) is at most \(K\log ^2 n\) with probability \(1n^{\omega (1)}\).
Proof
We first show that with high probability the potential never rises beyond O(K) in any polynomial number of steps.
Consider \(p_i\) that are at the upper border initially. The probability that in one step more than \(\log n\) variables move away from the upper border is at most \(\left( {\begin{array}{c}n\\ \log n\end{array}}\right) (1/n)^{\log n} \le 1/((\log n)!) = n^{\omega (1)}\). Assuming this never happens within the next \(K\log ^2 n\) steps, during this time at most \(K\log ^3 n\) bits move away from the upper border. As every bit can only increase the potential by 1 / K in one step, these bits only contribute at most \(\log ^3 n\) to the potential.
All bits that are not at the upper border initially can contribute up to 1 to the potential each. However, as they contribute at least 1 / K (the minimum distance to the upper border), the number of such bits is bounded by \(10000 K\). Together, the potential is at most \(\log ^3 n + 10000 K = O(K)\) with probability \(1(K\log ^2 n) \cdot n^{\omega (1)} = 1n^{\omega (1)}\) (as \(K\log ^2 n = \mathrm {poly}(n)\)) throughout the first \(K\log ^2 n\) steps.
Now consider the potential \(\varphi _1\) at time 1. If \(\varphi _1 < 10000 \), the return time is \(R=1\). Otherwise, by the same arguments as above, \(\varphi _1 \le 10000 + O(1)\) with probability \(1n^{\omega (1)}\) as with this probability at most \(\log n\) bits move away from the upper border, and at most \(10000 K\) bits that are away from the border initially only move by \(\pm 1/K\) in one step.
Applying Lemma 6 with an initial potential (denoted by \(\varphi _0\) in Lemma 6 but corresponding to \(\varphi _1\) in the time scale of the present lemma) of at most \(10000 + O(1)\), \(t = K\log ^2 n\), and \(x_{\max }=\log ^3 n + 10000 K = O(K)\) yields that the probability of not returning to a potential below 10000 in \(K\log ^2 n\) steps is at most
Note that
(still using the definition of h from the proof of Theorem 2), so that the probability under consideration is
as claimed. \(\square \)
We now prove Theorem 5.
Proof of Theorem 5
Applying Lemma 4 as in the proof of Theorem 2, the probability of all \(p_i\) remaining above 1 / 3 all the time for \(n^{\gamma '}\) steps is at least \(1n^{\gamma '+1} \ge 1n^{\kappa }\), where \(\gamma ' = \max \{\gamma , \kappa 1\}\) and \(\gamma \) is chosen as in the proof of Theorem 2.
The aim is to apply Lemma 6 with \(T^*:=x_{\min }/h(x_{\min })+\int _{x_{\min }}^n 1/h(x)\,\mathrm {d}x\), \(t:=3T^*\) and \(x_{\max }=n\). Note that \(T^*\) just represents the upper bound \(O(K \sqrt{n})\) on the expected value derived from variable drift in the proof of Theorem 2. This bound is at least \(T^*\ge \int _{0}^n 1/h(x)\,\mathrm {d}x = 17K \int _{0}^n x^{1/2}\,\mathrm {d}x = 34K\sqrt{n}\). Invoking the lemma yields
for some constant \(c' > 0\). As \(K\ge c\sqrt{n}\ln n\), this means that the time is at most \(3T^* = O(K\sqrt{n})\) with probability at least \(1e^{c c' 17\ln n}\). This probability becomes at least \(1n^{\kappa }\) if c is chosen as a large enough constant.
Whenever the potential is at most 10000, we have a probability of \({\varOmega }(1)\) to create the optimum (see proof of Theorem 2). By Lemma 7, the algorithm with high probability returns to such a state within \(K\log ^2 n\) steps. Applying these arguments \(\log ^2 n\) times (and considering failure probabilities for \(\log ^2 n\) applications of Lemma 7), the probability that after \(K\log ^4 n\) steps the optimum has not been found is \((\log ^2 n) \cdot e^{{\varOmega }(\log ^2 n)} = n^{\omega (1)}\).
Adding up all failure probabilities yields the claimed result. \(\square \)
5 Large Update Strengths Lead to Genetic Drift
The bound \(O(\sqrt{n}K)\) from Theorem 2 shows that larger update strengths (i. e., smaller K) result in smaller bounds on the runtime. However, the theorem requires that \(K\ge c\sqrt{n}\log n\) so that the best possible choice results in \(O(n\log n)\) runtime. An obvious question to ask is whether this is only a weakness of the analysis or whether there is an intrinsic limit that prevents smaller choices of K from being efficient.
In this section, we will show that smaller choices of K (i. e., larger update strengths) cannot give runtimes of lower orders than \(n\log n\). In a nutshell, even though larger update strengths support faster exploitation of correct decisions at single bits by quickly reinforcing promising bit values they also increase the risk of genetic drift reinforcing incorrectly made decisions at single bits too quickly. Then it typically happens that several marginal probabilities reach their lower border 1 / n, from which it (due to socalled coupon collector effects) takes \({\varOmega }(n\log n)\) steps to “unlearn” the wrong settings. The very same effect happens with 2\(\hbox {MMAS}_{\text {ib}}\) if its update strength \(\rho \) is chosen too large.
We now state the lower bounds we obtain for the two algorithms, see Theorems 8 and 9 below. Note that the statements are identical if we identify the update strength 1 / K of cGA with the update strength \(\rho \) of 2\(\hbox {MMAS}_{\text {ib}}\). Also the proofs of these two theorems will largely follow the same steps. Therefore, we describe the proof approach in detail with respect to cGA in Sect. 5.1. In Sect. 5.2, we describe the few places where slightly different arguments are needed to obtain the result for 2\(\hbox {MMAS}_{\text {ib}}\).
Theorem 8
The optimization time of cGA with \(K \le \mathrm {poly}(n)\) is \({\varOmega }(\sqrt{n}K + n \log n)\) with probability \(1\mathrm {poly}(n) \cdot 2^{{\varOmega }(\min \{K, n^{1/2o(1)}\})}\) and in expectation.
Theorem 9
The optimization time of 2\(\hbox {MMAS}_{\text {ib}}\) with \(1/\rho \le \mathrm {poly}(n)\) is \({\varOmega }(\sqrt{n}/\rho + n \log n)\) with probability \(1\mathrm {poly}(n) \cdot 2^{{\varOmega }(\min \{1/\rho , n^{1/2o(1)}\})}\) and in expectation.
We first describe at an intuitive level why large update strengths can be risky. In the upper bounds from Theorems 1 and 2, we have shown that for sufficiently small update strengths, the positive stochastic drift by bsteps is strong enough such that even in the presence of rwsteps all bits never reach marginal probabilities below 1 / 3, with high probability. Then no “incorrect” decision is made.
With larger update strengths than \(1/(\sqrt{n}\log n)\) the effect of rwsteps is strong enough such that with high probability some bits will make an incorrect decision and reach the lower borders of marginal probabilities.
More specifically, the lower bounds of \({\varOmega }(n \log n)\) in Theorems 8 and 9 will be established from the following arguments. We show that many marginal probabilities will remain close to their initial values during the early stages of a run (Lemmas 13 and 15). This then implies that bsteps will be rare (Lemma 12) throughout this time, and thus genetic drift dominates. Through a detailed analysis of the distribution of first hitting times in rwsteps we show that then some marginal probabilities will hit the lower border (Lemmas 10 and 16). Finally, we show that once sufficiently many marginal probabilities have reached the lower border, then this implies a lower bound of \({\varOmega }(n \log n)\) as claimed (Lemma 14).
5.1 Proof of Lower Bound for cGA
We start with a detailed analysis of the hitting time for a marginal probability to reach the lower border 1 / n and the distribution hitting times.
To illustrate this setting, fix one bit and imagine that all steps were rwsteps (we will explain later how to handle bsteps), and that all rwsteps change the current value of the bit’s marginal probability (i. e., there are no selfloops). Then the process would be a fair random walk on \(\{0,1/K,2/K,\ldots ,(K1)/K,1\}\), started at 1 / 2. This fair random walk is well understood (see, e. g., Chapter 14.3 in [9]) and it is well known that the hitting time is not sharply concentrated around the expectation. More precisely, there is still a polynomially in K small probability of hitting a border within at most \(O(K^2/\log K)\) steps and also of needing at least \({\varOmega }(K^2\log K)\) steps. The underlying idea is that the central limit theorem (CLT) approximates the progress within a given number of steps.
The real process is more complicated because of selfloops. Recall from the definition of \(F_t\) that the process only changes its current state by \(\pm 1/K\) with probability \(2p_{t, i}(1p_{t, i})\), hence with probability \(12p_{t, i}(1p_{t, i})\) a selfloop occurs on this bit. The closer the process is to one of its borders \(\{1/n,11/n\}\), the larger the selfloop probability becomes and the more the random walk slows down. Hence the actual process is clearly slower in reaching a border since every looping step is just wasted. One might conjecture that the selfloops will asymptotically increase the expected hitting time. But interestingly, as we will show, the expected hitting time in the presence of selfloops is still of order \({\varTheta }(K^2)\). Also the CLT (in a generalized form) is still applicable despite the selfloops, leading to a similar distribution as above.
The distribution of the hitting time of the random walk with selfloops will be analyzed in Lemma 10 below. In order to deal with selfloops, in its proof, we use a potential function mapping the actual process to a process on a scaled state space with nearly positionindependent variance. Unlike the typical applications of potential functions in drift analysis, the purpose of the potential function is not to establish a positionindependent firstmoment stochastic drift but a (nearly) positionindependent variance, i. e., the potential function is designed to analyze a second moment. This argument seems to be new in the theory of drift analysis and may be of independent interest.
Lemma 10
Consider a bit of cGA on OneMax and let \(p_t\) be its marginal probability at time t. Let \(t_1, t_2, \ldots \) be the times where cGA performs an rwstep (before hitting one of the borders 1 / n or \(11/n\)) and let \({\varDelta }_i:=p_{t_i+1}p_{t_i}\). For \(s\in \mathbb {R}\), let \(T_s\) be the smallest t such that \({{\mathrm{sgn}}}(s)\left( \sum _{i=0}^{t} {\varDelta }_{i}\right) \ge s\) holds.
Choosing \(0<\alpha <1\), where \(1/\alpha =o(K)\), and \(1\le s<0\) constant, we have
Moreover, for any \(\alpha >0\) and \(s\in \mathbb {R}\),
Informally, the lemma means that every deviation of the hitting time \(T_s\) by a constant factor from its expected value (which turns out as \({\varTheta }(s^2K^2)\)) still has constant probability, and even deviations by logarithmic factors have a polynomially small probability. We will mostly apply the lemma for \(\alpha <1\), especially \(\alpha \approx 1/\log n\), to show that there are marginal probabilities that quickly approach the lower border; in fact, this effect implies that the smallest possible update strength \(K\sim \sqrt{n}\log n\) in Theorem 2 necessarily involves a \(\log n\)term. Note that the second statement of the lemma also holds for \(\alpha \ge 1\); however, in this realm also Markov’s inequality works. Then, by the inequality \(e^{x}\le 1x/2\) for \(x\le 1\), we get \(\mathord {\mathrm {P}}\mathord {\left[ T_s\ge \alpha (sK)^2\right] }\ge 1/(8\alpha )\), which means that Markov’s inequality for deviations above the expected value is asymptotically tight in this case.
We start with the proof of the second statement, which is can be obtained by a relatively straightforward analysis of a fair random walk.
Proof of Lemma 10, 2nd statement
Throughout this proof, to ease notation we consider the scaled process on the state space \(S:=\{0,1,\ldots ,K\}\) obtained by multiplying all marginal probabilities by K; the random variables \(X_t=K p_{t}\) will live on this scaled space. Note that we also remove the borders (K / n and \(KK/n\)), which is possible as all considerations are stopped when such a border is reached. For the same reason, we only consider current states from \(\{1,\ldots ,K1\}\) in the remainder of this proof.
The first hitting time \(T_s\) becomes only stochastically larger if we ignore all selfloops. Formally, recalling the trivial scaling of the state space, we consider the fair random walk where \(\mathord {\mathrm {P}}\mathord {\left[ X_{t_i+1}=j1\right] }=\mathord {\mathrm {P}}\mathord {\left[ X_{t_i+1}=j+1\right] }=1/2\) if \(X_{t_i}=j\in \{1,\ldots ,K1\}\). We write \(Y_t=\sum _{i=0}^{t1} {\varDelta }_{t_i}\). Clearly, \({\varDelta }_i\) is uniform on \(\{1,1\}\), \(\mathrm {E}\mathord {\left( {\varDelta }_i\mid 0<X_{t_i}<K\right) }=0\), \({{\mathrm{Var}}}({\varDelta }_i\mid 0<X_{t_i}<K)=1\) and \(Y_t\) is a sum of independent, identically distributed random variables. It is well known that \((Y_t\mathrm {E}\mathord {\left( Y_t\right) })/\sqrt{{{\mathrm{Var}}}(Y_t)}\) converges in distribution to a standard normally distributed random variable (see, e. g., Chapter 10 in [9]). However, we do not use this fact directly here. Instead, to bound the deviation from the expectation, we use a classical Hoeffding bound. We assume \(s\ge 0\) now and will see that the case \(s<0\) can be handled symmetrically.
Theorem 1.11 in [4] yields, with \(c_i=2\) as the size of the support of \({\varDelta }_i\), that
Moreover, according to Theorem 1.13 in [4], the bound also holds for all \(k\le \alpha s^2 K^2\) together, more precisely,
Symmetrically, we obtain
Hence, a distance that is strictly smaller than sK is bridged through \(\alpha (sK)^2\) rwsteps (or the process reaches a border before) with probability at least \(1e^{1/(4\alpha )}\). \(\square \)
To illustrate the main idea for the proof of the first statement Lemma 10, we ignore bsteps for a while and recall that we are confronted with a fair random walk then. However, the random walk is not homogeneous with respect to place as the selfloops slow the process down in the vicinity of a border. Unlike the classical fair random walk, the random variables describing the change of position from time t to time \(t+1\) (formally, \({\varDelta }_t:=p_{t+1}p_{t}\)) are not identically distributed. In fact, the variance of \({\varDelta }_t\) becomes smaller the closer \(p_t\) is to one of the borders.
In more detail, the potential function used in the proof of Lemma 10 will essentially use the selfloop probabilities to construct extra distances to bridge. For instance, states with low selfloop probability (e. g., 1 / 2), will have a potential that is only by \({\varTheta }(1)\) larger or smaller than the potential of its neighbors. On the other hand, states with a large selfloop probability, say 1 / K, will have a potential that can differ by as much as \(2\sqrt{K}\) from the potential of its neighbors. Interestingly, this choice leads to variances of the onestep changes that are basically the same on the whole state space (very roughly, this is true since the squared change \((2\sqrt{K})^2={\varTheta }(K)\) is observed with probability \({\varTheta }(1/K)\)). However, using the potential for this trick is at the expense of changing the support of the underlying random variables, which then will depend on the state. Nevertheless, as the support is not changed too much, the Central Limit Theorem (CLT) still applies and we can approximate the progress made within T steps by a normally distributed random variable. This approximation is made precise in the following lemma, along with a bound on the absolute error.
Lemma 11
(CLT with Lyapunov condition, BerryEsseen inequality [10], p. 544 ). Let \(X_1,\ldots ,X_m\) be a sequence of independent random variables, each with finite expected value \(\mu _i\) and variance \(\sigma _i^2\). Define
If there exists a \(\delta >0\) such that
(assuming all the moments of order \(2+\delta \) to be defined), then \(C_m\) converges in distribution to a standard normally distributed random variable.
Moreover, the approximation error is bounded as follows: for all \(x\in \mathbb {R}\),
where C is an absolute constant and \({\varPhi }(x)\) denotes the cumulative distribution function of the standard normal distribution.
We now turn to the formal proof of the outstanding 1st statement of Lemma 10.
Proof of Lemma 10, 1st statement
As in the proof of the 2nd statement of Lemma 10 above, we consider the scaled search space \(\{1,\ldots ,K1\}\). Here we will essentially use an approximation of the accumulated state within \(\alpha s^2K^2\) steps by the normal distribution, but have to be careful to take into account steps describing selfloops. To analyze the hitting time \(T_s\) for the \(X_{t_i}\)process, we now define a potential function \(g:S\rightarrow \mathbb {R}\). Unlike the typical applications of potential functions, the purpose of g is not to establish a positionindependent firstmoment drift (in fact, there is no drift within S since the original process is a martingale) but a (nearly) positionindependent variance, i. e., the potential function is designed to analyze a second moment.
Potential function We proceed with the formal definition of the potential function, the analysis of its expected firstmoment change and the corresponding variance, and a proof that the Lyapunov condition holds for the accumulated change within \(\alpha s^2K^2\) steps. The potential function g is monotonically decreasing on \(\{1,\ldots ,K/2\}\) and centrally symmetric around K / 2. We define it as follows:
Inductively, we have for \(1\le i \le K/2\) that
where the second equality holds since the sum is telescoping. We also note that \(g(0)=O(K)\), more precisely it holds that
where the first inequality used \(\sum _{j=2}^{K/21} \sqrt{1/j}\) as a lower sum of the integral. More generally, using the monotonicity of g and the same kind of estimations as before, we obtain for \(i<j\le K/2\) that
Informally, the potential function stretches the whole state space by a factor of at most 4 but adjacent states in the vicinity of borders can be by \(2\sqrt{K}\) apart in potential.
Let \(Y_t:=g(X_t)\). We consider the onestep differences \({\varPsi }_i:=Y_{t_i+1}Y_{t_i}\) at the times i where rwsteps occur, and we will show via the representation \(Y_{t_i}:=\sum _{j=0}^{i1} {\varPsi }_j\) that \(Y_{t_i}\) approaches a normally distributed variable. Note that \(Y_{t_i}\) is not necessarily the same as \(g(X_{t_i})g(X_{t_0})\) since only the effect of rwsteps is covered by \(Y_{t_i}\).
In the following, we assume \(1\le X_{t_i}\le K/2\) and note that the case \(X_{t_i}>K/2\) can be handled symmetrically with respect to \({\varPsi }_i\). We proceed with the announced analysis of different moments of \({\varPsi }_i\).
Analysis of expected change of potential We claim that for all \(i\ge 0\)
where the onotation is with respect to K.
The lower bound \(\mathrm {E}\mathord {\left( {\varPsi }_i\mid X_{t_i}\right) }\ge 0\) is easy to see since \(X_{t_i}\) is a fair random walk and \(g(j1)g(j) \ge g(j)g(j+1)\) holds for all \(j\le K/2\). To prove the upper bound, we note that \(X_{t_i+1}\in \{X_{t_i}1,X_{t_i},X_{t_i}+1\}\) so that
Using the properties of rwsteps, we have that \(\mathord {\mathrm {P}}\mathord {\left[ Y_{t_i+1}\ne Y_{t_i}\right] } = 2\frac{(KX_{t_i}) X_{t_i}}{K^2}\). Moreover, on \(Y_{t_i+1}\ne Y_{t_i}\), \(Y_{t_i+1}\) takes each of the two values \(g(X_{t_i}1)\) and \(g(X_{t_i}+1)\) with the same probability. Hence
where the last equality used (45).
We estimate the bracketed terms using
where the penultimate inequality exploited that \(f(x+h)f(x)\le h f'(x)\) for any concave, differentiable function f and \(h\ge 0\); here using \(f(x)=\sqrt{x}\) and \(h=1\). Altogether,
which proves (7) since \(X_{t_i}\ge 1\) and \(K=\omega (1)\).
Analysis of the variance of the change of potential We claim that for all \(i\ge 0\)
To show this, note that
since \(\mathrm {E}\mathord {\left( {\varPsi }_i\mid X_{t_i}\right) } \ge 0\). Now, as \(0<X_{t_i}\le K/2\), we have \(\mathord {\mathrm {P}}\mathord {\left[ Y_{t_i+1}<Y_{t_i}\right] } = \frac{(KX_{t_i}) X_{t_i}}{K^2} \ge \frac{X_{t_i}}{2K}\). Moreover, \(Y_{t_i+1}<Y_{t_i}\) implies that \(X_{t_i+1}=X_{t_i}+1\) since g is monotone decreasing on \(\{1,\ldots ,K/2\}\) and the \(X_{t_i}\)value can change by either \(1\), 0, or 1. Hence, if \(Y_{t_i+1}<Y_{t_i}\) then \(Y_{t_i+1}Y_{t_i} = g(X_{t_i}+1)  g(X_{t_i}) =  \sqrt{2K/(X_{t_i}+1)}\). Altogether,
where we used \(X_{t_i}/(X_{t_i}+1)\ge 1/2\). This proves the lower bound on the variance.
Approximating the accumulated change of potential by a Normal distribution We are almost ready to prove that \(Y_{t_i}:=\sum _{j=0}^{i1} {\varPsi }_j\) can be approximated by a normally distributed random variable for sufficiently large t. We denote by \(s_i^2 := \sum _{j=0}^{i1} {{\mathrm{Var}}}({\varPsi }_j\mid X_{t_j})\) and note that \(s_i^2 \ge i/4\) by our analysis of variance from above. The socalled Lyapunov condition, which is sufficient for convergence to the normal distribution (see Lemma 11), requires the existence of some \(\delta >0\) such that
We will show that the condition is satisfied for \(\delta =1\) (smaller values could be used but do not give any benefit) and \(i=\omega (K)\) (which, as \(i=\alpha s^2K^2\), holds due to our assumptions \(1/\alpha =o(K)\) and \(s={\varOmega }(1)\)). We argue that
where we have used the bound on \(\mathrm {E}\mathord {\left( {\varPsi }_i\mid X_{t_i}\right) }\) from (7). As the \(X_{t_i}\)value can only change by \(\{1,0,1\}\), we get, by summing up all possible changes of the gvalue, that
for K large enough.
Hence, plugging this in the Lyapunov condition (9) for \(\delta =1\), we obtain
implying that
which goes to 0 as \(i=\omega (\sqrt{K})\). Hence, for the value \(i:=\alpha s^2 K^2\) considered in the lemma we obtain that
converges in distribution to N(0, 1) according to Lemma 11. The absolute error of this approximation is also \(O(\sqrt{K/i})\) by reusing (10).
Estimating the accumulated progress Recall that our aim is to show that the event \(\sum _{j=0}^{i1} {\varDelta }_j \le s\) (where s is negative and \(i=\alpha s^2 K^2\)) happens with at least the probability stated in the lemma. Since we analyzed the change of the potential function g, we establish a sufficient increase of the gvalue (corresponding to a decrease of marginal probability) that implies \( \sum _{j=0}^{i1} {\varDelta }_j \le s\). By (6), we know that \(g(X_{t_i})g(X_0)\ge 2\sqrt{s}K \) implies \(X_{t_i}X_0\le sK<0\) and therefore also \( \sum _{j=0}^{i1} {\varDelta }_j\le s\). Hence, in the following it suffices to study the event \(g(X_{t_i})g(X_0)\ge 2\sqrt{s}K\) and to show that it happens with the required probability.
As already mentioned, the random variable \(Y_{t_i}\) denotes the accumulated progress (in terms of gvalue) due to rwsteps up to time \(t_i\). To show that \(Y_{t_i}\) is at least \(2\sqrt{s}K\) with the claimed probability bounds, we exploit the aboveestablished property that (11) converges in distribution to N(0, 1). Hence, we need to estimate the variance \(s_i\) and the expected value \(\mathrm {E}\mathord {\left( Y_{t_i}\right) }\).
Note that \(s_i^2 \ge \alpha s^2 K^2/4\) by our analysis of variance above and therefore \(s_i\ge \sqrt{\alpha }sK/2\). We have to be more careful when computing \(\mathrm {E}\mathord {\left( Y_{t_i}\right) }\) since \(\mathrm {E}\mathord {\left( {\varPsi }_i\mid X_{t_i}\right) }\) is negative for \(X_{t_i}>K/2\). Note, however, that considerations are stopped when the marginal probability exceeds 5 / 6, i. e., when \(X_{t_i}>5K/6\). Using (7), we hence have that \(\mathrm {E}\mathord {\left( {\varPsi }_i\mid X_{t_i}\right) }\ge \sqrt{2/(5K^2/6)} \ge 1.55/K\). Therefore, \(\mathrm {E}\mathord {\left( Y_{t_i}\right) } \ge i\cdot (1.55/K) = 1.55\alpha s^2 K\) and \(\mathrm {E}\mathord {\left( Y_{t_i}/s_i\right) } \ge 3.1s\sqrt{\alpha }\).
We study the event \(Y_{t_i}\ge rK\) for general \(r\ge 0\), which is equivalent to \(\frac{Y_{t_i}\mathrm {E}\mathord {\left( Y_{t_i}\mid X_0\right) }}{s_i} \ge rK/s_i  \mathrm {E}\mathord {\left( Y_{t_i}/s_i\right) }\). If (11) was really N(0, 1)distributed, the probability of the event would be \({\varPhi }(rK/s_i\mathrm {E}\mathord {\left( Y_{t_i}/s_i\right) })\), where \({\varPhi }\) denotes the cumulative distribution function of the standard normal distribution. Taking into account the approximation error \(O(\sqrt{K/i})\) computed above and plugging in our estimates for expected value and variance, we altogether have that
for any r leading to a positive argument of \({\varPhi }\),
Using \(r=3\sqrt{s}\) in (13) , we compute
Using Lemma 21 (in the Appendix) we can now bound the term \(1\mathord {{\varPhi }}\Bigl (r/(s\sqrt{\alpha /4})+3.1s\sqrt{\alpha }\Bigr )\) from (13) below and obtain
using \(s\le 1\) and \(\alpha \le 1\). This means that distance sK (in negative direction) is bridged by the rwsteps before or at time \(t_i\), where \(i=\alpha s^2K^2\), with probability at least \(p(\alpha ,s)O(\sqrt{K/i}) = p(\alpha ,s)O(\alpha ^{1/2}s^{1}K^{1/2})\), where the Oterm is the bound on the approximation error computed above. Undoing the scaling of the state space introduced at the beginning of this proof, this corresponds to an accumulated change of the actual state of cGA in rwsteps by s; more formally, \(\left( \sum _{i=0}^{t} {\varDelta }_{i}\right) \le s\) in terms of the original state space. This establishes also the first statement of the lemma and completes the proof. \(\square \)
As rwsteps are interleaved with bsteps, Lemma 10 alone is not sufficient to analyze the overall movement of a marginal probability. We also requires a bounded number of bsteps within a given period of time. To establish this, we first show that, during the early stages of a run, the probability of a bstep is only \(O(1/\sqrt{n})\). Intuitively, during early stages of the run many bits will have marginal probabilities in the interval [1 / 6, 5 / 6]. Then the standard sampling deviation of the OneMaxvalue is of order \({\varTheta }(\sqrt{n})\), and the probability of a bstep is \(1\mathord {\mathrm {P}}\mathord {\left[ R_t\right] } = O(1/\sqrt{n})\). The link between \(1\mathord {\mathrm {P}}\mathord {\left[ R_t\right] }\) and the standard deviation already appeared in Lemma 3 above; roughly, it says that every step is a bstep for bit i with probability at least \((\sum _{j \ne i} p_j(1p_j))^{1/2}\), which is the reciprocal of the standard deviation in terms of the other bits.
The following Lemma 12 represents a kind of counterpart of Lemma 3, but here we seek an upper bound on \(1\mathord {\mathrm {P}}\mathord {\left[ R_t\right] }\).
Lemma 12
Assume that at time t there are \(\gamma n\) bits for some constant \(\gamma > 0\) bits whose marginal probabilities are within [1 / 6, 5 / 6]. Then the probability of having a bstep on any fixed bit position is
regardless of the decisions made in this step on all other \(n\gamma n1\) bits.
Proof
We know from our earlier discussion that a bstep at bit i requires \(D_t \in \{1, 0\}\) where \(D_t := x_t  x_{t, i}  (y_t  y_{t, i})\) is the change of the OneMaxvalue at bits other than i in the two solutions \(x_t\) and \(y_t\) sampled at time t.
We apply the principle of deferred decisions and fix all decisions for creating \(x_t\) as well as decisions for \(y_t\) on all but the \(m := \gamma n\) selected bits with marginal probabilities in [1 / 6, 5 / 6]. Let \(p_1, p_2, \ldots , p_{m}\) denote the corresponding marginal probabilities after renumbering these bits, and let S denote the random number of these bits set to 1. Note that there are at most 2 values for S which lead to the algorithm making a bstep.
Since S is determined by a Bernoulli trial with success probabilities \(p_1, \ldots , p_{m}\), Theorem 22 in the Appendix implies that the probability of S attaining any particular value is at most
Taking the union bound over 2 values proves the claim. \(\square \)
Even though one main aim is to show that rwsteps make certain marginal probabilities reach their lower border, we will also ensure that with high probability, \({\varOmega }(n)\) marginal probabilities do not move by too much, resulting in a large sampling variance and a small probability of bsteps. The following lemma serves this purpose. Its proof is a straightforward application of Hoeffding’s inequality since it is pessimistic here to ignore the selfloops.
Lemma 13
For any bit, with probability \({\varOmega }(1)\) for any \(t \le \kappa K^2\), \(\kappa > 0\) a small enough constant, the first t rwsteps lead to a total change of the bit’s marginal probability within \([1/6, 1/6]\). This fact holds independently of all other bits.
The probability that the above holds for less than \(\gamma n\) bits amongst the first n / 2 bits is \(2^{{\varOmega }(n)}\), regardless of the decisions made on the last n / 2 bits.
Proof
Note that the probability of exceeding \([1/6, 1/6]\) increases with the number of rwsteps that do increase or decrease the marginal probability (as opposed to selfloops). We call these steps relevant and pessimistically assume that all t steps are relevant steps.
Now defining \(X_j := \sum _{i=1}^{j} X_i\) as the total progress in the first j relevant steps, we have \(\mathrm {E}\mathord {\left( X_j\right) } = 0\), for all \(j \le t\), and the total change in these j steps exceeds 1 / 6 only if \(X_j \ge K/6\). Applying a Hoeffding bound, Theorem 1.13 in [4], the maximum total progress is bounded as follows:
By symmetry, the same holds for the total change reaching values less or equal to \(1/6\). By the union bound, the probability that the total change always remains within the interval \([1/6, 1/6]\) is thus at least
Assuming \(\kappa < 1/(12 \ln 2)\) gives a lower bound of \({\varOmega }(1)\).
Note that due to our pessimistic assumption of all steps being relevant, all bits are treated independently. Hence we may apply standard Chernoff bounds to derive the second claim. \(\square \)
The following lemma shows that whenever a small number of bits has reached the lower border for marginal probabilities, the remaining optimization time is \({\varOmega }(n \log n)\) with high probability. The proof is similar to the well known coupon collector’s theorem [24].
Lemma 14
Assume cGA reaches a situation where at least \({\varOmega }(n^\varepsilon )\) marginal probabilities attain the lower border 1 / n. Then with probability \(1  e^{{\varOmega }(n^{\varepsilon /2})}\), and in expectation, the remaining optimization time is \({\varOmega }(n \log n)\).
Proof
Let \(m = {\varOmega }(n^{\varepsilon })\) be the number of bits that have reached the lower border 1 / n. A necessary condition for reaching the optimum within \(t := (n/21)\cdot (\varepsilon /2) \ln n\) iterations is that during this time each of these m bits is sampled at value 1 in at least one of the two search points constructed. The probability that one bit never samples a 1 in t iterations is at least \((1  2/n)^t\). The probability that all m bits sample a 1 during t steps is at most, using \((12/n)^{n/21} \ge 1/e\) and \(1+x \le e^x\) for \(x \in \mathbb {R}\),
Hence with probability \(1  \exp ({\varOmega }(n^{\varepsilon /2}))\) the remaining optimization time is at least \(t = {\varOmega }(n \log n)\). As \(1  \exp ({\varOmega }(n^{\varepsilon /2})) = {\varOmega }(1)\), the expected remaining optimization time is of the same order. \(\square \)
We have collected most of the machinery to prove Theorem 8. The following lemma identifies a set of bits that stay centered in a phase of \({\varTheta }(K\min \{K,\sqrt{n}\})\) steps, resulting in a low probability of bsteps. Basically, the idea is to bound the accumulated effect of bsteps in the phase using Chernoff bounds: given K / 6 bsteps, a marginal probability cannot change by more than 1 / 6. Note that this applies to many, but not all bits. Later, we will see that within the phase, some of the remaining bits will reach their lower border with not too low probability.
Lemma 15
Let \(\kappa > 0\) be a small constant. There exists a constant \(\gamma \), depending on \(\kappa \), and a selection S of \(\gamma n\) bits among the first n / 2 bits such that the following properties hold regardless of the last n / 2 bits throughout the first \(T := \kappa K \cdot \min \{K, \sqrt{n}\}\) steps of cGA with \(K \le \mathrm {poly}(n)\), with probability \(1\mathrm {poly}(n) \cdot 2^{{\varOmega }(\min \{K, n\})}\):

1.
the marginal probabilities of all bits in S is always within [1 / 6, 5 / 6] during the first T steps,

2.
the probability of a bstep at any bit is always \(O(1/\sqrt{n})\) during the first T steps, and

3.
the total number of bsteps for each bit is bounded by K / 6, leading to a displacement of at most 1 / 6.
Proof
The first property is trivially true at initialization, and we show that an event of exponentially small probability needs to occur in order to violate the property. Taking a union bound over all T steps ensures that the property holds throughout the whole phase of T steps with the claimed probability.
By Lemma 13, with probability \(12^{{\varOmega }(n)}\), for at least \(\gamma n\) of these bits the total effect of all rwsteps is always within \([1/6, +1/6]\) during the first \(T \le \kappa K^2\) steps. We assume in the following that this happens and take S as a set containing exactly \(\gamma n\) of these bits.
It remains to show that for all bits in S the total effect of bsteps is bounded by 1 / 6 with high probability. Note that, while this is the case, according to Lemma 12, the probability of a bstep at every bit in S is at most \(c_2/\sqrt{n}\) for a positive constant \(c_2\). This corresponds to the second property, and so long as this holds, the expected number of bsteps in \(T \le \kappa K^2\) steps is at most \(\kappa \cdot c_2 K\). Each bstep changes the marginal probability of the bit by 1 / K. A necessary condition for increasing the marginal probability by a total of at least 1 / 6 is that we have at least K / 6 bsteps amongst the first T steps. Choosing \(\kappa \) small enough to make \(\kappa \cdot c_2 K \le 1/2 \cdot K/6\), by Chernoff bounds the probability to get at least K / 6 bsteps in T steps is \(e^{{\varOmega }(K)}\). In order for the first property to be violated, an event of probability \(e^{{\varOmega }(K)}\) is necessary for any bit in S and any length of time \(t \le T\); otherwise all properties hold true.
Taking the union bound over all \(T \le \kappa K^2\) steps and all \(\gamma n\) bits gives a probability bound of \(\kappa K^2 \cdot \gamma n \cdot e^{{\varOmega }(K)} \le \mathrm {poly}(n) \cdot 2^{{\varOmega }(K)}\) for a property being violated. This proves the claim. \(\square \)
Finally, we put everything together to prove our lower bound for cGA.
Proof of Theorem 8
If \(K = O(1)\) then it is easy to show, similarly to Lemma 17, that each bit independently hits the lower border with probability \({\varOmega }(1)\) by sampling only zeros. Then the result follows easily from Chernoff bounds and Lemma 14. Hence we assume in the following \(K = \omega (1)\).
For \(K \ge \sqrt{n}\), Lemma 15 implies a lower bound of \({\varOmega }(K \sqrt{n})\) as then the probability of sampling the optimum in any of the first \(T := \kappa K \cdot \min \{K, \sqrt{n}\}\) steps is at most \((5/6)^{\gamma n} = 2^{{\varOmega }(n)}\). Taking a union bound over the first T steps and adding the error probability from Lemma 15 proves the claim for a lower bound of \({\varOmega }(K \sqrt{n})\) with the claimed probability. This proves the theorem for \(K = {\varOmega }(\sqrt{n}\log n)\) as then the \({\varOmega }(\sqrt{n}K)\) term dominates the runtime. Hence we may assume \(K = o(\sqrt{n}\log n)\) in the following and note that in this realm proving a lower bound of \({\varOmega }(n \log n)\) is sufficient as here this term dominates the runtime.
We still assume that the events from Lemma 15 apply to the first n / 2 bits. We now use Lemma 10 to show that some marginal probabilities amongst the last n / 2 bits are likely to walk down to the lower border. Note that Lemma 10 applies for an arbitrary (even adversarial) mixture of rwsteps and bsteps over time. This allows us to regard the progress in rwsteps as independent between bits.
In more detail, we will apply both statements of Lemma 10 to a fresh marginal probability from the last n / 2 bits, to prove that it walks to its lower border with a not too small probability. First we apply the second statement of the lemma for a positive displacement of \(s:=1/6\) within T steps, using \(\alpha := T/(sK)^2\). The random variable \(T_s\) describes the first point of time where the marginal probability reaches a value of at least \(1/2+1/6+s=5/6\) through a mixture of b and rwsteps. This holds since we work under the assumption that the bsteps only account for a total displacement of at most 1 / 6 during the phase. Lemma 10 now gives us a probability of at least \(1e^{1/(4\alpha )} = {\varOmega }(1)\) (using \(\alpha =O( 1)\)) for the event that the marginal probability does not exceed 5 / 6. In the following, we condition on this event.
We then revisit the same stochastic process and apply Lemma 10 again to show that, under this condition, the random walk achieves a negative displacement. Note that the event of not exceeding a certain positive displacement is positively correlated with the event of reaching a given negative displacement (formally, the state of the conditioned stochastic process is always stochastically smaller than of the unconditioned process), allowing us to apply Lemma 10 again despite dependencies between the two applications.
We now apply the first statement of Lemma 10 for a negative displacement of \(s := 1\) through rwsteps within T steps, using \(\alpha := T/((sK)^2)\). Since we still work under the assumption that the bsteps only account for a total displacement of at most 1 / 6 during the phase, the displacement is then altogether no more than \(s+1/6\le 5/6\), implying that the lower border is hit as the marginal frequency does not exceed 5 / 6.
The conditions on \(\alpha \) in of Lemma 10 hold as \(0< \alpha < 1\) choosing \(\kappa \) small enough, and \(1/\alpha = O(K/\min \{\sqrt{n}, K\}) = o(K)\) for \(K = \omega (1)\). Also note that \(1/\alpha = O(K/\min \{\sqrt{n}, K\}) = o(\log n)\) since \(K = o(n \log n)\). Now the lemma states that the probability of the random walk reaching a displacement through rwsteps of at most s (or hitting the lower border before) is at least
To bound the last expression from below, we distinguish between two cases. If \(K\le \sqrt{n}\), then \(\alpha ={\varOmega }(1)\) and (14) is at least
since \(K=\omega (1)\) and \(s={\varTheta }(1)\). If \(K\ge \sqrt{n}\), then with \(1/\alpha =o(\log n)\) we estimate (14) from below by
for some \(\beta =\beta (n) = o(1)\).
Combining with the probability of not exceeding 5 / 6, which we have proved to be constant, the probability of the bit’s marginal probability hitting the lower border within T steps is \({\varOmega }(n^{\beta })\). Hence by Chernoff bounds, with probability \(12^{{\varOmega }(n^{1\beta })}\), the final number of bits hitting the lower border within T steps is \({\varOmega }(n^{1\beta }) = {\varOmega }(n^{1o(1)})\).
Once a bit has reached the lower border, while the probability of a bstep is \(O(1/\sqrt{n})\), the probability of leaving the bound again is \(O(n^{3/2})\) as it is necessary that either the bit is sampled as 1 at one of the offspring and a bstep happens, or in both offspring the bit is sampled at 1. So the probability that this does not happen until the \(T = O(n \log n)\) steps are completed is \((1O(n^{3/2}))^{T} \le e^{O(\log (n)/\sqrt{n})} = o(1)\). Again applying Chernoff bounds leaves \({\varOmega }(n^{1o(1)})\) bits at the lower border at time T with probability \(12^{{\varOmega }(n^{1o(1)})}\).
Then Lemma 14 implies a lower bound of \({\varOmega }(n \log n)\) that holds with probability \(12^{{\varOmega }(n^{1/2o(1)})}\). \(\square \)
5.2 Proof of Lower Bound for 2\(\hbox {MMAS}_{\mathrm{ib}}\)
We will use, to a vast extent, the same approach as in Sect. 5.1 to prove Theorem 9. Most of the lemmas can be applied directly or with very minor changes. In particular, Lemmas 13, 14 and 15 also apply to 2\(\hbox {MMAS}_{\text {ib}}\) by identifying 1 / K with \(\rho \). Intuitively, this holds since the analyses of bsteps always pessimistically bound the absolute change of a marginal probability by the update strength (1 / K for cGA). This also holds with respect to the update strength \(\rho \) for 2\(\hbox {MMAS}_{\text {ib}}\).
To prove lower bounds on the time to hit a border through rwsteps, the next lemma is used. It is very similar to Lemma 10, except for two minor differences: first, also the accumulated effect of bsteps is included in the quantity \(p_tp_0\) analyzed in the lemma. Second, considerations are stopped when the marginal probability becomes less than \(\rho \) or more than \(1\rho \). This has technical reasons but is not a crucial restriction. We supply an additional lemma, Lemma 17 below, that applies when the marginal probability is less than \(\rho \). The latter lemma uses known analyses similar to socalled landslide sequences defined in [26, Section 4].
Lemma 16
Consider a bit of 2\(\hbox {MMAS}_{\text {ib}}\) on OneMax and let \(p_t\) be its marginal probability at time t. We say that the process breaks a border at time t if \(\min \{p_t,1p_t\}\le \max \{1/n,\rho \}\). Given \(s\in \mathbb {R}\) and arbitrary starting state \(p_0\), let \(T_s\) be the smallest t such that \({{\mathrm{sgn}}}(s)(p_tp_0) \ge s\) holds or a border is reached.
Choosing \(0<\alpha <1\), where \(1/\alpha =o(\rho ^{1})\), and \({1 \le s< 0}\) constant, and assuming that every step is a bstep with probability at most \(\rho /(4\alpha )\), we have
Moreover, for any \(\alpha >0\) and constant \(0<s\le 1\), if there are at most \(s/(2\alpha \rho )\) bsteps until time \(\alpha (s/\rho )^2\), then
Proof
We follow similar ideas as in the proof of Lemma 10. Again, we start with the second statement, where \(s\ge 0\) is assumed, and aim for applying a Hoeffding bound. We note that a marginal probability of 2\(\hbox {MMAS}_{\text {ib}}\) can only change by an absolute amount of at most \(\rho \) in a step. Hence, the bsteps until time \(\alpha (s/\rho ^2)\) account for an increase of the \(X_t\)value by at most s / 2. With respect to the rwsteps, Theorem 1.11 from [4] can be applied with \(c_i=2\rho \) and \(\lambda =s/2\).
Also for the first statement, we follow the ideas from the proof of Lemma 10. In particular, the borders stated in the lemma will be ignored as all considerations are stopped when they are reached. We will apply a potential function and estimate its first and second moment separately with respect to rwsteps and nonrw steps.
Definition of potential function Our potential function is
which can be considered the continuous analogue of the function g used in the proof of Lemma 10. For \(r>0\) and \(x\le 1/2\), we note that
For better readability, we denote by \(X_t:={p_{t}}\), \(t\ge 0\), the stochastic process obtained by listing the marginal probabilities of the considered bit over time. Let \(Y_t:=g(X_t)\) and \({\varDelta }_t:=Y_{t+1}Y_t\). In the remainder of this proof, we assume \(X_t\le 1/2\); analyses for the case \(X_t>1/2\) are symmetrical by switching the sign of \({\varDelta }_t\). We also assume \(X_t \ge \rho \) as we are only interested in statements before the first point of time where a border is reached. As mentioned, following the structure of the proof of Lemma 10, we now analyze several moments of \({\varDelta }_t\), with the final aim of establishing the Lyapunov condition in Lemma 11.
Analysis of expected change of potential We claim for all \(t\ge 0\) where rwsteps occur (hence, formally we enter the conditional probability space on \(R_t\), the event that an rwstep occurs at time t) that
Moreover, we claim for the unconditional expected value that
For a proof of (16), we exploit the martingale property
that holds in rwsteps of 2\(\hbox {MMAS}_{\text {ib}}\), where there are two possible successor states different from \(X_t\). Since g(x) is a convex function on [0, 1 / 2], we have by Jensen’s inequality
To bound the expected value from above, we carefully estimate the error introduced by the convexity. Note that
since the integrand is nonincreasing. Analogously,
Inspecting the gvalues of two possible successor states of \(x:=X_t\), we get that
where the thirdlast inequality estimated \(1x\le 1\) and used that \(f(z+\rho )f(z)\le \rho f'(z)\) for any concave, differentiable function f and \(\rho \ge 0\); here using \(f(z)=\sqrt{z}\) and \(z=x\rho \). The penultimate used \(\rho \le 1/2\). Since the final bound is \(O(\rho /\sqrt{x}) = o(1)\) due to our assumption on \(X_t\ge \rho \), we have proved (16).
We now consider the case that a bstep occurs at time t. We are only interested in bounding \(\mathrm {E}\mathord {\left( {\varDelta }_t\mid X_t\right) }\) from below now. Given \(X_t=x\), we have \(X_{t+1}>x\) (which means \({\varDelta }_t<0\)) with probability at most \(1(1x)^2 = 1(12x+x^2) \le 2x\). With the remaining probability, \(X_{t+1}<x\). Since \(X_{t+1}\le x+\rho \), we get
Now, since by assumption a bstep occurs with probability at most \(\rho /(4\alpha )\), the unconditional expected value of \({\varDelta }_t\) can be computed using the superposition equality. Combining (16) and (22), we get
since \(x\le 1\), proving (17).
Analysis of variance of change of potential Regarding the variance of \({\varDelta }_t\), we claim that
and, without the condition of having an rwstep,
To prove this, we expand the definition of variance to estimate
since \(\mathrm {E}\mathord {\left( {\varDelta }_t\mid X_t\right) } \ge 0\). We note that for \(X_t=x\), we have \(\mathord {\mathrm {P}}\mathord {\left[ X_{t+1}\ge x\right] } = x\). On \(X_{t+1}\ge x\), we have \({\varDelta }_t<0\), which means \(\mathord {\mathrm {P}}\mathord {\left[ {\varDelta }_t < 0\right] } = x\). Now,
where the penultimate inequality used \(\rho \le x\) and the last one \(x\le 1/2\). Plugging this in, we get
which completes the proof of (24).
By the law of total probability, we get for the unconditional variance that
Since \(\mathord {\mathrm {P}}\mathord {\left[ R_t\right] }\ge 1/2\), we altogether have for the unconditional variance that
as claimed in (25).
Approximating the accumulated change of potential by a Normal distribution The aim is to apply the central limit theorem (Lemma 11) on the sum of the \({\varDelta }_t\). To this end, we will verify the Lyapunov condition for \(\delta =1\) (smaller values could be used but do not give any benefit) and \(t=\omega (1/\rho )\) (which, as \(t=\alpha (s/\rho )^{2}\), holds due to our assumptions \(1/\alpha =o(\rho ^{1})\) and \(s={\varOmega }(1)\)). We compute
where we again have used (18) and the upper bound from (19) with respect to the two outcomes of \(X_{t+1}\). Moreover, we have used the bound \(\mathrm {E}\mathord {\left( {\varDelta }_t\mid X_t\right) }\ge 0\) in the first term and \(\mathrm {E}\mathord {\left( {\varDelta }_t\mid X_t\right) } \le 3\rho /(2\sqrt{x})+\rho /(2\alpha )\) in the second term, which is a crude combination of (21) and (17). As \(\rho \le 1/2\) and \(\rho \le x\) as well as \(\alpha \ge \rho \), the expected value satisfies
where we used \(x\le 1\) and \(x\ge \rho \). Using \(s_t^2:=\sum _{j=0}^{t1}{{\mathrm{Var}}}({\varDelta }_j\mid X_j)\) in the notation of Lemma 11 and using that \({{\mathrm{Var}}}({\varDelta }_j\mid X_j)\ge 1/32\) by (25), we get
which goes to 0 as \(t=\omega (1/\rho )\). This establishes the Lyapunov condition. Hence, for the value \(t:=\alpha (s/\rho )^2\) considered in the lemma, we obtain that \(\frac{Y_{t}\mathrm {E}\mathord {\left( Y_{t}\mid X_0\right) }}{s_t}\) converges in distribution to the normal distribution N(0, 1).
Estimating the accumulated progress Note that \(s_t^2 \ge \alpha (s/\rho )^2/32\) since \({{\mathrm{Var}}}({\varDelta }_t\mid X_t) \ge 1/32\) by (25). Hence, \(s_t = \sqrt{\alpha /32}(s/\rho )\), recalling that \(s<0\). Moreover, as \(x\le 5/6\) is assumed in this part of the lemma, by combining (21) and (17), we get \(\mathrm {E}\mathord {\left( {\varDelta }_t\mid X_t\right) } \ge \rho /(2\alpha )\rho \cdot (3/2)\sqrt{6/5} \ge \rho /(2\alpha )  1.7\rho \ge 2.2\rho /\alpha \) and \(\mathrm {E}\mathord {\left( Y_{t}\right) } = Y_0 + \sum _{i=0}^{t1} \mathrm {E}\mathord {\left( {\varDelta }_i\mid X_i\right) } \ge 0 + t (2.2\rho /\alpha ) \ge 2.2s^2/\rho \). Together, this means \(\frac{\mathrm {E}\mathord {\left( Y_t\right) }}{s_t} \ge \frac{2.2s^2/\rho }{\sqrt{\alpha /32}(s/\rho )} \ge \sqrt{155/\alpha }s\ge \sqrt{155/\alpha }\) since \(s\le 1\) and \(\alpha \le 1\). By the normalization to N(0, 1), we have that
hence
for any r leading to a positive argument of \({\varPhi }\), where \({\varPhi }\) denotes the cumulative distribution function of the standard normal distribution and \(O(\sqrt{1/(t\rho )})\) the approximation error derived in (27).
We are interested in the event that \(Y_{t}\ge 2\sqrt{s}/\rho \), recalling that \(s<0\) and \(X_{t+1}\ge X_t \iff Y_{t+1}\le Y_t\). We made this choice because the event \(Y_t = g(X_{t})g(X_0)\ge 2\sqrt{s}/\rho \) implies that \(X_{t}X_0\le s\) by (15).
To compute the probability of the event \(Y_t\ge 2\sqrt{s}/\rho \), we choose \(r=2\sqrt{s}/\rho \) and get \(r\rho /(s\sqrt{\alpha /32})+\sqrt{155/\alpha }) \le 24/\sqrt{s\alpha }\). We get
By Lemma 21,
which means that distance s is bridged (in negative direction) before or at time \(t=\alpha (s/\rho )^2\) with probability at least \(p(\alpha ,s)O(\sqrt{1/(t\rho )}) = p(\alpha ,s)  O(\sqrt{\rho }/(s\sqrt{\alpha }))\). \(\square \)
The following lemma shows that a marginal probability of less than \(\rho \) is unlikely to be increased again.
Lemma 17
In the setting of Lemma 16, if \(\min \{p_0,1p_0\}\le \rho \), the marginal probability will reach the closer border from \(\{1/n,11/n\}\) in \(O((\log n)/\rho )\) steps with probability at least \(e^{2/(1e)}\). This even holds if each step is a bstep.
Proof
We consider only the case \(X_0\le \rho \) as the other case is symmetrical. The idea is to consider \(O(\log n)\) phases and prove that the \(X_t\)value only decreases throughout all phases with the stated probability. Phase i, where \(i\ge 0\), starts at the first time where \(X_t\le \rho e^{i}\). Clearly, as \(\rho \le 1\), at the latest in phase \(\ln n\) the border 1 / n has been reached. We note that phase i ends after \(1/\rho \) steps if all these these steps decrease the value; here we use that each step decreases by a relative amount of \(1\rho \) and that \((1\rho )^{1/\rho }\le e^{1}\).
The probability of decreasing the \(X_t\)value in a step of phase i is at least \((1\rho e^{i})^2 \ge 12e^{i}\rho \) even if the step is a bstep. Hence, the probability of all steps of phase i being decreasing is at least \((12e^{i}\rho )^{1/\rho } \ge e^{2e^{i}}\). For all phases together, the probability of only having decreasing steps is still at least
as suggested. \(\square \)
We have now collected all tools to prove the lower bound for 2\(\hbox {MMAS}_{\text {ib}}\).
Proof of Theorem 9
This follows mostly the same structure as the proof of Theorem 8. Every occurrence of the update strength 1 / K should be replaced by \(\rho \).
There is a minor change in the analysis of rwsteps. The two applications of Lemma 10 are replaced with Lemma 16, followed by an additional application of Lemma 17. The slightly different constants in the statement of Lemma 10 do not affect the asymptotic bound \({\varOmega }(n^{\beta })\) obtained. Neither does the additional application of Lemma 17, which gives a constant probability. We do not care about the time \(O((\log n)/\rho )\) stated in Lemma 17, since we are only interested in a lower bound on the hitting time.
There is a difference in how bsteps are being handled. While Lemma 10 only considers the accumulated effect of rwsteps (leaving the consideration of bsteps to the proof of Theorem 8), Lemma 16 also includes the effect of bsteps, assuming bounds on the probability of bsteps and on the number of bsteps, respectively. We still have to verify that these assumptions are met.
Lemma 16 requires in its first statement that the probability of a bstep is at most \(\rho /(4\alpha )\). Recall that such a step has probability \(O(1/\sqrt{n})\). We argue that \(\rho /(4\alpha ) \ge c/\sqrt{n}\) for any constant \(c>0\) if \(\kappa \) is small enough. To see this, we simply recall that \(\alpha =\kappa \sqrt{n}\rho /(3s^2)\) by definition and \(s={\varOmega }(1)\).
Finally, the second statement of Lemma 16 restricts the number of bsteps until time \(\alpha (s/\rho )^2\) to at most \(s/(2\alpha \rho )\). Reusing that \(\rho =O(\alpha /(\kappa \sqrt{n}))\), this holds by Chernoff bounds with high probability if \(\kappa \) is a sufficiently small constant. Hence, the application of the lemma is possible. \(\square \)
6 Conclusions
We have performed a runtime analysis of two probabilistic modelbuilding Genetic Algorithms, namely cGA and 2\(\hbox {MMAS}_{\text {ib}}\), on OneMax. The expected runtime of these algorithms was analyzed in dependency of the socalled update strength \(S=1/K\) and \(S=\rho \), respectively, resulting in the upper bound \(O(\sqrt{n}/S)\) for \(S=O(1/\sqrt{n}\log n)\) and \({\varOmega }(\sqrt{n}/S+n\log n)\). Hence, \(S\sim 1/\sqrt{n}\log n\) was identified as the choice for the update strength leading to asymptotically smallest expected runtime \({\varTheta }(n\log n)\).
Our analyses of update strength reveal a general tradeoff between the speed of learning and genetic drift. High update strengths imply globally a fast adaptation of the probabilistic model but impact the overall correctness of the model negatively, resulting in increased risk of adapting to samples that are locally incorrect. We think that this constitutes a universal limitation of the algorithms that extends to more general classes of functions. As even on the simple OneMax the update strength should not be bigger than \(1/(\sqrt{n}\log n)\), we propose this setting as a general rule of thumb.
Our analyses have developed a quite technical machinery for the analysis of genetic drift. These techniques are not necessarily limited to cGA and 2\(\hbox {MMAS}_{\text {ib}}\) on OneMax. Very recently, they have been used in [19] to analyze the socalled UMDA, which is a more complicated EDA. We also believe that the techniques will lead to improved results for classical Genetic Algorithms such as the simple Genetic Algorithm [27], where currently only quite restricted lower bounds on the runtime are available.
Notes
The 2\(\hbox {MMAS}_{\text {ib}}\) in [26] used a randomized tiebreaking rule that swaps x and y with probability 1 / 2 if \(f(x)=f(x)\). We omit this swap to ease presentation without changing the stochastic behavior; namely, conditioning on creating two specific samples x and y, where \(x\ne y\), in one of the two possible orders, the probability of sampling x first is 1 / 2 due to the independence of the trials.
The term “drift” is used in both “genetic drift” and in “drift analysis.” In the latter, “drift” is used to indicate the expected progress towards a target. We sometimes use the term “stochastic drift” to distinguish it from “genetic drift”. Drift theorems always refer to stochastic drift.
To apply Theorem 19 we will again consider a slightly modified process, where potential values \(0< \varphi < 10000 \) are being merged with state 0.
References
Baillon, J.B., Cominetti, R., Vaisman, J.: A sharp uniform bound for the distribution of sums of Bernoulli trials. Comb. Probab. Comput. 25, 352–361 (2016)
Chen, T., Lehre, P.K., Tang, K., Yao, X.: When is an estimation of distribution algorithm better than an evolutionary algorithm? In: Proceedings of the IEEE Congress on Evolutionary Computation. IEEE Press, pp. 1470–1477 (2009)
Dang, D., Lehre, P.K.: Simplified runtime analysis of estimation of distribution algorithms. In: Proceedings of the Genetic and Evolutionary Computation Conference, pp. 513–518 (2015)
Doerr, B.: Analyzing randomized search heuristics: tools from probability theory. In: Auger, A., Doerr, B. (eds.) Theory of Randomized Search Heuristics. World Scientific, Singapore (2011)
Doerr, B., Johannsen, D., Winzen, C.: Drift analysis and linear functions revisited. In: Proceedings of the IEEE Congress on Evolutionary Computation, pp. 1967–1974 (2010)
Doerr, C., Lengler, J.: OneMax in blackbox models with several restrictions. In: Proceedings of the Genetic and Evolutionary Computation Conference. ACM Press, pp. 1431–1438 (2015)
Droste, S.: A rigorous analysis of the compact genetic algorithm for linear functions. Nat. Comput. 5(3), 257–283 (2006)
Droste, S., Jansen, T., Wegener, I.: Upper and lower bounds for randomized search heuristics in blackbox optimization. Theory Comput. Syst. 39, 525–544 (2006)
Feller, W.: An Introduction to Probability Theory and Its Applications, vol. 1. Wiley, New York (1968)
Feller, W.: An Introduction to Probability Theory and Its Applications, vol. 2. Wiley, New York (1971)
Friedrich, T., Kötzing, T., Krejca, M.S.: EDAs cannot be balanced and stable. In: Proceedings of GECCO’16, pp. 1139–1146 (2016)
Friedrich, T., Kötzing, T., Krejca, M.S., Sutton, A.M.: The benefit of recombination in noisy evolutionary search. In: Proceedings of the 26th International Symposium on Algorithms and Computation. Springer, pp. 140–150 (2015)
Friedrich, T., Ktzing, T., Krejca, M.S., Sutton, A.M.: The compact genetic algorithm is efficient under extreme gaussian noise. IEEE Trans. Evol. Comput. 21(3), 477–490 (2017)
Gleser, L.J.: On the distribution of the number of successes in independent trials. Ann. Probab. 3(1), 182–188 (1975)
Harik, G.R., Lobo, F.G., Goldberg, D.E.: The compact genetic algorithm. IEEE Trans. Evol. Comput. 3(4), 287–297 (1999)
Hauschild, M., Pelikan, M.: An introduction and survey of estimation of distribution algorithms. Swarm Evol. Comput. 1(3), 111–128 (2011)
Johannsen, D.: Random Combinatorial Structures and Randomized Search Heuristics. Ph.D. thesis, Universität des Saarlandes, Saarbrücken, Germany and the MaxPlanckInstitut für Informatik (2010)
Krejca, M., Witt, C.: Lower bounds on the run time of the univariate marginal distribution algorithm on OneMax. Theor. Comput. Sci. (2018, to appear); preprint at https://doi.org/10.1016/j.tcs.2018.06.004
Krejca, M.S., Witt, C.: Lower bounds on the run time of the univariate marginal distribution algorithm on OneMax. In: Proceedings of FOGA 2017. ACM Press, pp. 65–79 (2017)
Lehre, P.K., Nguyen, P.T.H.: Improved runtime bounds for the univariate marginal distribution algorithm via anticoncentration. In: Proceedings of GECCO’17. ACM Press, pp. 414–434 (2017)
Lehre, P. K., Witt, C.: Concentrated hitting times of randomized search heuristics with variable drift. In: Proceedings of the 25th International Symposium on Algorithms and Computation, vol. 8889 of Lecture Notes in Computer Science. Springer, pp. 686–697 (2014). Extended version at arXiv:1307.2559
Lehre, P.K., Witt, C.: General drift analysis with tail bounds. ArXiv eprints (2017). arXiv:1307.2559
Marshall, A.W., Olkin, I., Arnold, B.C.: Inequalities: Theory of Majorization and Its Applications, 2nd edn. Springer, Berlin (2011)
Motwani, R., Raghavan, P.: Randomized Algorithms. Cambridge University Press, Cambridge (1995)
Neumann, F., Sudholt, D., Witt, C.: Analysis of different MMAS ACO algorithms on unimodal functions and plateaus. Swarm Intell. 3(1), 35–68 (2009)
Neumann, F., Sudholt, D., Witt, C.: A few ants are enough: ACO with iterationbest update. In: Proceedings of the Genetic and Evolutionary Computation Conference, pp. 63–70 (2010)
Oliveto, P.S., Witt, C.: Improved time complexity analysis of the simple genetic algorithm. Theor. Comput. Sci. 605, 21–41 (2015)
Rowe, J.E., Sudholt, D.: The choice of the offspring population size in the (1, \(\lambda \)) evolutionary algorithm. Theor. Comput. Sci. 545, 20–38 (2014)
Stützle, T., Hoos, H.H.: MAXMIN ant system. J.Future Gen. Comput. Syst. 16, 889–914 (2000)
Sudholt, D.: A new method for lower bounds on the running time of evolutionary algorithms. IEEE Trans. Evol. Comput. 17(3), 418–435 (2013)
Sudholt, D., Witt, C.: Update strength in EDAs and ACO: how to avoid genetic drift. In: Proceedings of the Genetic and Evolutionary Computation Conference, New York, NY, USA. ACM, pp. 61–68 (2016)
Witt, C.: Tight bounds on the optimization time of a randomized search heuristic on linear functions. Comb. Probab. Comput. 22(2), 294–318 (2013)
Witt, C.: Upper bounds on the running time of the univariate marginal distribution algorithm on OneMax. Algorithmica (2018, to appear); preprint at https://doi.org/10.1007/s0045301804630
Acknowledgements
This research was initiated at Dagstuhl seminar 15211 “Theory of Evolutionary Algorithms” and also benefitted from Dagstuhl seminars 16011 “Evolution and Computing” and 17191 “Theory of Randomized Optimization Heuristics”. The authors thank the organisers and participants of all three seminars. The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/20072013) under Grant Agreement No. 618091 (SAGE) and from the Danish Research Council (DFFFNU) under Grant 400200542. This article is based upon work from COST Action CA15140 ‘Improving Applicability of NatureInspired Optimisation by Joining Theory and Practice (ImAppNIO)’ supported by COST (European Cooperation in Science and Technology).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
An extended abstract of this article with parts of the results was presented at GECCO’16 [31].
A General Tools
A General Tools
1.1 A.1 Drift Theorems
The term variable drift analysis was coined by Johannsen [17] to describe a stochastic process on nonnegative real values where the expected change towards an absorbing target state 0 can be bounded by a positive and monotone increasing function h. His variable drift theorem was subsequently refined and generalized (see also [28] for a broader class of functions h). The following variant is due to Lehre and Witt [22], who allow variable drift for continuous spaces.
Theorem 18
(Variable drift, upper bound; Theorem 16 in [22]). Let \((X_t)_{t \in \mathbb {N}_0}\), be a stochastic process over some state space \(S \subseteq \{0\}\cup [x_{\min }, x_{\max }]\), adapted to a filtration \((\mathcal {F}_t)_{t\in \mathbb {N}_0}\), where \(x_{\min }> 0\). Let \(h(x) :[x_{\min }, x_{\max }] \rightarrow \mathbb {R}^+\) be a monotone increasing function such that 1 / h(x) is integrable on \([x_{\min }, x_{\max }]\) and \(\mathrm {E}\mathord {\left( X_t  X_{t+1} \mid \mathcal {F}_t\right) } \ge h(X_t)\) if \(X_t \ge x_{\min }\). Then it holds for the first hitting time \(T := \min \{t \mid X_t = 0\}\) that
The next theorem gives tail bounds on variable drift bounds.
Theorem 19
(Tail bounds for variable drift [21], see also Th. 4 in [22]). Let \((X_t)_{t\in \mathbb {N}_0}\), be a stochastic process, adapted to a filtration \((\mathcal {F}_t)_{t\in \mathbb {N}_0}\), over some state space \(S\subseteq \{0\}\cup [x_{\min },x_{\max }]\), where \(x_{\min }\ge 0\). Let \(h:[x_{\min },x_{\max }]\rightarrow \mathbb {R}^+\) be a function such that 1 / h(x) is integrable on \([x_{\min },x_{\max }]\). Suppose there exist a random variable Z and some \(\lambda >0\) such that \(\int _{X_{t+1}}^{X_t} 1/h(\max \{x,x_{\min }\})\,\mathrm {d}x\prec Z\) for \(X_{t}\ge x_{\min }\) and \(E(e^{\lambda Z}) = D\) for some \(D>0\). Then the following two statements hold for the first hitting time \(T:=\min \{t\mid X_t=0\}\).

(i)
If \(\mathrm{E}(X_t X_{t+1} \mid {\mathcal {F}}_t ; X_t\ge x_{\min }) \ge h(X_t)\) then for any \(\delta >0\), and \(\eta :=\min \{\lambda , \delta \lambda ^2/(D1\lambda )\}\) and \(t>0\) it holds that
$$\begin{aligned} \mathord {\mathrm {P}}\mathord {\left[ T>t \mid X_0\right] } \le \exp \left( \eta \left( \frac{x_{\min }}{h(x_{\min })}+\int _{x_{\min }}^{X_0} \frac{1}{h(x)} \,\mathrm {d}x(1\delta )t\right) \right) . \end{aligned}$$ 
(ii)
If \(\mathrm{E}(X_t X_{t+1} \mid \mathcal {F}_t ; X_t\ge x_{\min }) \le h(X_t)\) then for any \(\delta >0\), \(\eta :=\min \{\lambda , \delta \lambda ^2/(D1\lambda )\}\) and \(t>0\) it holds
$$\begin{aligned}&\mathord {\mathrm {P}}\mathord {\left[ T < t \mid X_0\right] } \\&\quad \le \; \exp \left( \eta \left( (1+\delta )t  \frac{x_{\min }}{h(x_{\min })}  \int _{x_{\min }}^{X_0} \frac{1}{h(x)} \,\mathrm {d}x\right) \right) \frac{1}{\eta (1+\delta )}. \end{aligned}$$If state 0 is absorbing then
$$ \mathord {\mathrm {P}}\mathord {\left[ T < t \mid X_0\right] } \le \exp \left( \eta \left( (1+\delta )t  \frac{x_{\min }}{h(x_{\min })}  \int _{x_{\min }}^{X_0} \frac{1}{h(x)} \,\mathrm {d}x\right) \right) . $$
Finally, we will need the following theorem concerned with drift away from the target. It is taken from [27].
Theorem 20
(Negative Drift with Scaling (Theorem 2 in [27])). Let \((X_t)_{t\in \mathbb {N}_0}\), be a stochastic process, adapted to a filtration \((\mathcal {F}_t)_{t\in \mathbb {N}_0}\), over some state space \(S\subseteq \mathbb {R}_0^+\). Suppose there exist an interval \([a,b]\subseteq \mathbb {R}\) and, possibly depending on \(\ell :=ba\), a drift bound \(\varepsilon :=\varepsilon (\ell )>0\) as well as a scaling factor \(r:=r(\ell )\) such that for all \(t\ge 0\) the following three conditions hold:

1.
\(\mathrm {E}\mathord {\left( X_{t+1}X_{t}\mid \mathcal {F}_t\,;\, a< X_t <b\right) } \ge \varepsilon \),

2.
\(\mathord {\mathrm {P}}\mathord {\left[ X_{t+1}X_t\ge jr \mid \mathcal {F}_t\,;\, a< X_t\right] } \le e^{j}\) for all \(j\in \mathbb {N}_0\),

3.
\(1\le r^2 \le \varepsilon \ell /(132\log (r/\varepsilon ))\).
Then for the first hitting time \(T^*:=\min \{t\ge 0:X_t\le a \mid X_0\ge b\}\) it holds that \(\mathord {\mathrm {P}}\mathord {\left[ T^*\le e^{\varepsilon \ell /(132r^2)}\right] }= O(e^{\varepsilon \ell /(132r^2)})\).
1.2 A.2 Bounds on the Cumulative Distribution Function of the Standard Normal Distribution
To prove Lemmas 10 and 16, we need the following estimates for \({\varPhi }(x)\). More precise formulas are available (and can be found by searching for bounds on the socalled error function), but are not required for our analysis.
Lemma 21
([9], p. 175). For any \(x>0\)
and for \(x<0\)
1.3 A.3 A Bound for Poisson Binomial Distributions
Theorem 22
(Adapted from Theorem 2.1 in [1]). Let \(S_n = X_1 + \cdots + X_n\) denote a sum of independent Bernoulli trials where \(\mathord {\mathrm {P}}\mathord {\left[ X_i = 1\right] } = p_i\). Then for every \(0 \le j \le n\)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Cite this article
Sudholt, D., Witt, C. On the Choice of the Update Strength in EstimationofDistribution Algorithms and Ant Colony Optimization. Algorithmica 81, 1450–1489 (2019). https://doi.org/10.1007/s004530180480z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s004530180480z
Keywords
 Ant colony optimization
 Estimationofdistribution algorithms
 Genetic Algorithms
 Probabilistic modelbuilding Genetic Algorithms
 Runtime analysis
 Theory of randomized search heuristics