Runtime Analyses of the Population-Based Univariate Estimation of Distribution Algorithms on LeadingOnes

We perform rigorous runtime analyses for the univariate marginal distribution algorithm (UMDA) and the population-based incremental learning (PBIL) Algorithm on LeadingOnes. For the UMDA, the currently known expected runtime on the function is Onλlogλ+n2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathcal {O}}\left( n\lambda \log \lambda +n^2\right)$$\end{document} under an offspring population size λ=Ω(logn)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\lambda =\Omega (\log n)$$\end{document} and a parent population size μ≤λ/(e(1+δ))\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mu \le \lambda /(e(1+\delta ))$$\end{document} for any constant δ>0\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\delta >0$$\end{document} (Dang and Lehre, GECCO 2015). There is no lower bound on the expected runtime under the same parameter settings. It also remains unknown whether the algorithm can still optimise the LeadingOnes function within a polynomial runtime when μ≥λ/(e(1+δ))\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mu \ge \lambda /(e(1+\delta ))$$\end{document}. In case of the PBIL, an expected runtime of O(n2+c)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathcal {O}}(n^{2+c})$$\end{document} holds for some constant c∈(0,1)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$c \in (0,1)$$\end{document} (Wu, Kolonko and Möhring, IEEE TEVC 2017). Despite being a generalisation of the UMDA, this upper bound is significantly asymptotically looser than the upper bound of On2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathcal {O}}\left( n^2\right)$$\end{document} of the UMDA for λ=Ω(logn)∩On/logn\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\lambda =\Omega (\log n)\cap {\mathcal {O}}\left( n/\log n\right)$$\end{document}. Furthermore, the required population size is very large, i.e., λ=Ω(n1+c)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\lambda =\Omega (n^{1+c})$$\end{document}. Our contributions are then threefold: (1) we show that the UMDA with μ=Ω(logn)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mu =\Omega (\log n)$$\end{document} and λ≤μe1-ε/(1+δ)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\lambda \le \mu e^{1-\varepsilon }/(1+\delta )$$\end{document} for any constants ε∈(0,1)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varepsilon \in (0,1)$$\end{document} and 0<δ≤e1-ε-1\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$0<\delta \le e^{1-\varepsilon }-1$$\end{document} requires an expected runtime of eΩ(μ)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$e^{\Omega (\mu )}$$\end{document} on LeadingOnes, (2) an upper bound of Onλlogλ+n2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathcal {O}}\left( n\lambda \log \lambda +n^2\right)$$\end{document} is shown for the PBIL, which improves the current bound On2+c\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathcal {O}}\left( n^{2+c}\right)$$\end{document} by a significant factor of Θ(nc)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Theta (n^{c})$$\end{document}, and (3) we for the first time consider the two algorithms on the LeadingOnes function in a noisy environment and obtain an expected runtime of On2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathcal {O}}\left( n^2\right)$$\end{document} for appropriate parameter settings. Our results emphasise that despite the independence assumption in the probabilistic models, the UMDA and the PBIL with fine-tuned parameter choices can still cope very well with variable interactions.


Motivations
Estimation of distribution algorithms (eDAs) [24,38,40] are randomised search heuristics that look for optimal solutions by building and sampling from probabilistic models. They are known by various other names, including probabilistic modelbuilding genetic algorithms [40] or iterated density estimation algorithms [2]. Unlike traditional evolutionary algorithms (eAs), which use standard genetic operators such as mutation and crossover to create variation, eDAs, on the other hand, generate it via model building and model sampling. The workflow of eDAs is an iterative process. The starting model is a uniform distribution over the search space, from which an initial population of individuals is sampled. The algorithm ranks individuals according to a fitness function and selects the ≤ fittest individuals to update the model. The procedure is repeated many times and terminates when a threshold on the number of iterations is exceeded or a solution of good quality is obtained [13,20]. We call the parameter the offspring population size, while the parameter is known as the parent population size of the algorithm.
Several eDAs have been proposed over the last decades. They differ in how they learn the variable interplay and build/update the probabilistic models over iterations. In general, eDAs can be categorised into two main classes: univariate and multivariate. Univariate eDAs assume variable independence and usually represent the model as a probability vector (each component is a marginal), encoding a product distribution from which individuals are sampled independently and identically. Typical eDAs in this class are the univariate marginal distribution algorithm (UMDA [38]), the compact genetic algorithm (cgA [19]) and the population-based incremental learning (PBIL [1]). Some ant colony optimisation algorithms like the -MMAS [42] can also be cast into this framework (also called n-Bernoulli--EDA [15]). In contrast, multivariate eDAs apply statistics of order two or more to capture the underlying structures of the addressed problems. This paper focuses on univariate eDAs on discrete optimisation, and for that reason we refer the interested readers to [20,23] for other eDAs on a continuous domain.
The UMDA is probably the most famous univariate eDA. In each so-called iteration, the algorithm updates marginals to the 'empirical' frequencies of 1s sampled among the fittest individuals. In 2015, Dang and Lehre [5], via the level-based theorem [3], obtained an upper bound of O n log + n 2 on the expected runtime for the algorithm on the LeADIngOnes function when the offspring population size is = Ω(log n) and the parent population size ≤ ∕(e(1 + )) for any constant > 0 . For = Ω(log n) ∩ O(n∕ log n) , the above bound becomes O n 2 . Under a selective pressure ∕ ≥ (1 + )∕e , it is still unknown whether the UMDA could optimise the function in polynomial expected runtime. Furthermore, we are also missing a lower bound on the expected runtime, that is necessary to understand how the algorithm copes with variable dependencies.
Another univariate eDA is the PBIL [1]-a generalisation of the UMDA-which updates the marginals using a convex combination with a smoothing parameter ∈ (0, 1] between the current marginals and the empirical frequencies of 1s sampled among the fittest individuals (also called incremental learning). Unlike the UMDA, runtime results for the PBIL are very limited. The only rigorous analysis on test functions has been published recently in [47], where the authors argued that the algorithm with a sufficiently large population size can avoid making wrong decisions early even when the smoothing parameter is large. They also showed an expected runtime of O(n ) = O n 2+ on the LeADIngOnes function for some small constant ∈ (0, 1) . And yet the required offspring population size still remains large, i.e., = Ω(n 1+ ) [47]. It remains open whether a tighter upper bound can be obtained for the PBIL on the LeADIngOnes function. The answer to this question is of special interest because it might be considered as the first step towards showing the substantial advantage of incremental learning over the update mechanism used by the UMDA. Furthermore, more bounds on the expected runtime of the PBIL on test functions with well-known structures possibly shed light on the behaviour of the algorithm on other problems, especially those with a separably additive decomposition property [37] because many sub-functions may have fitness landscapes that resemble those of test functions, and in these situations the behaviours of the algorithms can be quickly deduced.
See Table 1 for a summary of the latest runtime results of the UMDA and the PBIL on the LeADIngOnes function.

Our Contributions
The contributions of this paper are three-fold.  1. We analyse the expected runtime of the UMDA. Together with previous results [5,6], our results provide a clearer picture of the runtime of the algorithm on the LeADIngOnes function. We show that under a low selective pressure the algorithm fails to optimise the function in polynomial expected runtime. This result essentially reveals the limitations of probabilistic models based on probability vectors as the algorithm hardly stays in promising states when the selective pressure is not high enough, while the optimum cannot be sampled with high probability. On the other hand, when the selective pressure is sufficiently high, we obtain a lower bound of Ω(n ∕ log ) on the expected runtime under any offspring population sizes = Ω(log n). 2. We obtain an expected runtime of O n log + n 2 for the PBIL on the LeAD-IngOnes function under any population sizes = Ω(log n) . For = O(n∕ log n) , the runtime bound becomes O n 2 , making it relatively comparable to the performance of the class of univariate unbiased black-box algorithms in the sense of Lehre and Witt [31]-a general framework covering many well-known randomised search heuristics in evolutionary computation. More importantly, the new upper bound improves the previously best known upper bound of O n 2+c [47] by a factor of Θ(n c ) for some constant c ∈ (0, 1) . Our bound only requires a population of size = Ω(log n) as opposed to = Ω(n 1+ ) as in [47]. To do this, we make use of the level-based theorem [3] with some additional arguments. By taking advantage of the Dvoretzky-Kiefer-Wolfowitz (DKW) inequality [33], we observe that with high probability, the empirical frequency does not deviate far from the model probability. We believe that it is the first time that the DKW inequality has been used in the runtime analysis of model-based algorithms. 3. We introduce noise to the LeADIngOnes function, where a single bit is flipped before evaluating the fitness with a constant probability p ∈ (0, 1) (also called prior noise). We note that the same noise model is first considered in [11,41] for the (1 + 1) EA and [4,17] for population-based eAs. We show that if the selective pressure ∕ is sufficiently high, the algorithms optimise the noisy LeADIn-gOnes function within an expected runtime of O n 2 + n log . To the best of our knowledge, this is also the first time that the UMDA and the PBIL are rigorously studied in a noisy environment, while the cgA is already considered in [16] under Gaussian posterior noise. Despite the simplicity of the noise model considered, this can be viewed as the first step towards understanding the behaviour of these eDAs in a noisy environment.

Outline of the Paper
The paper is structured as follows. Section 2 introduces the studied algorithms and general tools used in the paper. Section 3 provides a detailed analysis for an exponential expected runtime for the UMDA on the LeADIngOnes function in case of low selective pressure, followed by the analysis under a high selective pressure in Sect. 4. We also present in this section an improved upper bound on the expected runtime of the PBIL on LeADIngOnes. In Sect. 5, we consider the LeADIngOnes function under a prior noise model and obtain an upper bound O n 2 on the expected runtime for 1 3 appropriate parameter settings. Section 6 presents an empirical study to complement theoretical results derived earlier. The paper ends in Sect. 7, where we give our concluding remarks and speak of potential future work.

Preliminaries
We first recall that a random variable Y is said to follow a Bernoulli distribution with success probability p ∈ [0, 1] , denoted as Y ∼ Ber(p) , if and only if Pr(Y = 1) = p and Pr(Y = 0) = 1 − p [36, p. 445]. If there are n ∈ ℕ such random variables (with the same success probability p), then the sum of them (i.e., a random variable X) follows a binomial distribution with n trials and success probability p, denoted as X ∼ Bin(n, p) [36, p. 445]. An extension of the binomial distribution is the Poisson binomial distribution, in which each of n random variables can have a different success probability [14, p. 263]. More formally, we write X ∼ PB(p 1 ,

The Studied Fitness Function
In evolutionary computing, we represent a solution to an optimisation problem as a bitstring (or an individual) . We consider in the paper the problem of maximising the number of leading 1s in a bitstring. The fitness value of such a bitstring can be obtained by This is a uni-modal function with a maximum fitness value of n when the input is the all-ones bitstring (i.e., the global optimum). In essence, the bits in this particular function are highly correlated, so it is often used to study the ability of eDAs to cope with variable dependency [22]. We call n the problem instance size and X = {0, 1} n the finite binary search space consisting of all bitstrings of length n.

Population-Based Univariate EDAs
The UMDA, defined in Algorithm 1, maintains a probabilistic model that is represented as an n-vector p t ∶=(p t,1 , … , p t,n ) , where each so-called marginal p t,i ∈ [0, 1] for i ∈ [n] is the probability of sampling a one at the i-th bit position in the offspring. The probability of sampling a particular individual x = (x 1 , x 2 , … , x n ) ∈ X from the given probability vector p t equals We often call the distribution defined in Eq. 2 a product distribution. The starting model is the uniform distribution p 0 ∶=(1∕2, … , 1∕2) . The algorithm, in each iteration t ∈ ℕ , samples an offspring population of individuals, denoted as , and sorts them in descending order according to fitness to obtain a sorted population P . A parent population consisting of the fittest individuals in P t participates in the update of the probabilistic model. Let x (j) t,i denote the value sampled at bit position i in the j-th individual in the offspring population P t (and analogously x (j) t,i for the parent population P t ). Then, t is the number of 1s sampled at bit position i ∈ [n] across the parent population. The algorithm first sets each marginal to the value X t,i ∕ and then adjusts them to be within the interval [1∕n, 1 − 1∕n] , where the two values, 1/n and 1 − 1∕n , are called the lower and upper borders (or margins), respectively. In summary, the updating process can be written as follows.
Furthermore, the ratio of ∕ ∈ (0, 1] is known as the selective pressure of the algorithm. The whole procedure is repeated until some terminal condition has been fulfilled. Some common choices are a threshold on the number of iterations allowed to run or a lower bound on the fitness quality of the fittest individual in the population. However, for theoretical analysis, we halt the algorithm only after a global optimum has been found for the first time. Algorithm 1: UMDA with an offspring population size λ and a parent population size µ for the maximisation of a function f (x) where x is of length n.
t ) by sorting Pt in descending order of fitness, where ties are broken uniformly at random A generalisation of the UMDA is the PBIL. While most operations in the PBIL are similar to those of the UMDA, the algorithm makes use of a new smoothing parameter ∈ (0, 1] and updates the model via a convex combination as follows.
In essence, the PBIL takes into account the current marginals when updating the probabilistic model. We also note that the UMDA is the PBIL with a maximum smoothing parameter = 1.

Level-Based Analysis
First proposed in [25], the level-based theorem is a general tool that provides upper bounds on the expected runtime of many non-elitist population-based algorithms on a wide range of optimisation problems [3,5,6,8,26,27,29]. The theorem assumes that the studied algorithm can be cast into the framework in Algorithm 2, which maintains a population P t ∈ X , where X is the space of all populations of size . We write P t,i to denote the i-th individual in the population P t . The theorem also assumes the existence of a mapping D from the space of populations X to the space of the probability distribution over the search space. In iteration t, the mapping D depends only on the population P t and involves in the production of a new population for the next iteration [3]. t ← t + 1 6 until termination condition is fulfilled However, the theorem never assumes specific fitness functions, selection mechanisms, or generic operators like mutation and crossover, but it assumes that the search space X can be partitioned into m disjoint subsets A 1 , … , A m , which we call levels, and the last level A m consists of all global optima of the objective function. Let A ≥j ∶= ∪ m k=j A k . The following theorem is taken from [3, Theorem 1].

Theorem 1 (Level-Based Theorem) Given a partition
where for all t ∈ ℕ , P t ∈ X is the population of Algorithm 2 in iteration t. Let y ∼ D(P t ). If there exist z 1 , … , z m−1 , ∈ (0, 1], and 0 ∈ (0, 1) such that for any population P t ∈ X , Pr y ∈ A ≥j+1 ≥ z j .

Dvoretzky-Kiefer-Wolfowitz Inequality
The DKW inequality [33] provides an estimate on how close an empirical distribution will be to the true distribution from which the samples are drawn. We note that the advantage of the DKW inequality comes from the fact that the upper bound exp{−2 2 } depends only on the number of samples , which in our case is the offspring population size of the algorithms.
Definition 3 (Majorisation) Given two vectors p∶=(p 1 , … , p n ) and q∶=(q 1 , … , q n ) , where p 1 ≥ p 2 ≥ … ≥ p n and analogously for the q i . Vector p is said to majorise vector q if and Majorisation is a powerful tool in runtime analysis of univariate eDAs because the algorithms operate on a probability vector-based model (see [6,10,27,45,47]). The following lemma shows one of the properties of majorisation, which will be used frequently in the main parts of the paper.

Other Tools
Lemma 5 (Chernoff Bound [36]) Let X ∼ PB(p 1 , for any 0 ≤ ≤ 1. Lemma 6 (Chernoff-Hoeffding Bound [12]) Let X ∼ PB(p 1 , p 2 , … , p n ). Let We also recall that a random variable X is said to stochastically dominate another random variable Y (defined on the same probability space) if for all k ∈ ℝ we have Pr(X ≥ k) ≥ Pr(Y ≥ k) [

Runtime Analysis Under Low Selective Pressure
Before we get to analysing the function, we introduce some notation. Let C t,i for all i ∈ [n] denote the number of individuals having at least i leading 1s in iteration t, and D t,i is the number of individuals having exactly i − 1 leading 1s. For the special case i = 1 , D t,i consists of individuals that do not have any leading 1s.
Once the population has been sampled, the algorithm invokes truncation selection to select the fittest individuals (out of a population of ) to update the probability vector. We take this -cutoff into account by defining a random variable which tells us how many marginals, counting from bit position one, are set to the upper border 1 − 1∕n in iteration t. Furthermore, we define another random variable to be the number of leading 1s of the fittest individual(s).

On the Distributions of C t,i and D t,i
In order to analyse the distributions of the random variables C t,i and D t,i , we shall take an alternative view on the sampling process at an arbitrary bit position i ∈ [n] in iteration t ∈ ℕ via the principle of deferred decisions [36, p. 55]. We imagine that the process samples the values of the first bit for individuals. Once this has finished, it moves on to the second bit and so on until the population is sampled.
To be more specific, we now look at the first bit in iteration t. The number of 1s sampled at the first bit position follows a binomial distribution with parameters and p t,1 , i.e., C t,1 ∼ Bin , p t,1 . Thus, the number of 0s at the first bit position is D t,1 = − C t,1 . For completeness, we always assume that C t,0 = .
Having sampled the first bit for individuals, note that the bias due to selection at the second bit position comes into play only if the first bit is a 1. If this is the case, then a 1 is more preferred to a 0 at the second-bit position. Among the C t,1 fittest individuals, the probability of sampling a 1 at the second bit position is p t,2 ; thus, the number of individuals having at least 2 leading 1s is binomially distributed with parameters C t,1 and p t,2 , that is, C t,2 ∼ Bin C t,1 , p t,2 , and the number of 0s equals D t,2 = C t,1 − C t,2 . Among the D t,1 last individuals, since for these individuals the first bit is a 0, there is no bias between a 1 and a 0. The number of 1s sampled at the second bit position among the D t,1 last individuals follows a binomial distribution with parameters D t,1 (or − C t,1 ) and p t,2 .
We can generalise this result for an arbitrary bit position i ∈ [n] . The number of individuals having at least i leading 1s follows a binomial distribution with C t,i−1 trials and success probability p t,i , i.e., and Furthermore, the number of 1s sampled among the − C t,i−1 remaining individuals is binomially distributed with − C t,i−1 trials and success probability p t,i . Let (F t ∶ t ∈ ℕ) be a filtration induced from the population (P t ∶ t ∈ ℕ) [44, p. 93]. If we consider the expectations of these random variables, by the tower property of conditional expectation (or tower rule) [44, p. 88] and the fact that p t,i is F t−1 -measurable [44, p. 93], we then get and similarly We note that by the end of this sampling process, we will obtain a population that is sorted in descending order according to the LeADIngOnes-values.
Recall that we aim at showing that the UMDA takes exponential time to optimise the LeADIngOnes function when the selective pressure is not sufficiently high, as required in [6,Theorem 7]. Let ∶= ∕ denote the selective pressure of the algorithm. For any constant ∈ (0, 1) , we define Clearly, we always get ≤ . We also define a stopping time. ∶= min{t ∈ ℕ ∶ Z t ≥ } to be the first hitting time of the value for the random variable Z t . We then consider two phases: (1) until the random variable Z t hits the value for the first time ( t ≤ ), and (2) after the random variable Z t has hit the value for the first time ( t > ).

Phase 1: Before the Fitness of the th Individual Hits the Threshold T
he algorithm starts with an initial population P 0 sampled from a uniform distribution p 0 = (1∕2, … , 1∕2) . An initial observation is that the all-ones bitstring cannot be sampled in the population P 0 with high probability since the probability of sampling it from the uniform distribution is 2 −n , then by the union bound [14, p. 23] it appears in the population P 0 w.p. at most ⋅ 2 −n = 2 −Ω(n) since we only consider the offspring population of size at most polynomial in the problem instance size n (i.e., ∈ poly (n) ). The following lemma states the expectations of the random variables Z * 0 and Z 0 .
and let f ∶=LeadingOnes . The probability of sampling an individual with more than k leading 1s (where k < n ) is The event Z * 0 ≤ k implies that the individuals all have at most k leading 1s, i.e., and is integer-valued, we then get which completes the proof. ◻

Lemma 8 It holds for any t ∈ ℕ and
Proof By the definition of the random variable Z t , we know that C t,Z t ≥ and C t,Z t +1 < . Consider bit position j∶=Z t + 2 . We then obtain from Eq. 7 that among the C t,j−1 fittest individuals there are C t,j ∼ Bin C t,j−1 , p t,j individuals with at least j leading 1s. For the − C t,j−1 > 0 remaining individuals (among the fittest individuals), the overall fitness (or the fitness ranking) of these individuals have been already decided by the first j − 1 bits, and what is sampled at bit position j will not have any impact on the ranking of these individuals. In other words, there is no bias in bit j among these (remaining) individuals, which also means that the number of 1s sampled here follows a binomial distribution with − C t,j−1 trials and success probability p t,j , i.e., Bin − C t,j−1 , p t,j . Thus, we get: Because the distribution of X t,j depends only on p t,j , the same line of arguments can be repeated for each of the remaining bit positions Z t + 3, … , n . The proof is now complete. ◻ We now show that the value of the random variable Z t does not decrease during phase 1 with high probability.
Proof It suffices to show that w.p. at most e −Ω( ) there exists an iteration t ∈ [1, ] such that Z t < Z t−1 . We first note that the value of the random variable Z t drops in iteration t + 1 only if the number of individuals with at least Z t leading 1s in the next iteration is less than . Recall that Z t < for any t < . The number of individuals with at least Z t leading 1s, sampled in iteration t + 1 , follows a binomial distribution with trials and success probability (1 − 1∕n) Z t . Thus, in expectation the number of such individuals is By a Chernoff bound (see Lemma 5), the probability of sampling at most (1 − ) ⋅ ∕(1 − ) = such individuals is at most e −( 2 ∕2)⋅ ∕(1− ) = e −Ω( ) for any constant ∈ (0, 1) . By the union bound, this rare event happens at least once during the first iterations w.p. at most e −Ω( ) , and the complement event takes place w.p. at least 1 − e −Ω( ) , which completes the proof. ◻

Phase 2: After the Fitness of the th Individual has Hit Value ˛ for the First Time
By the definition of Z t , the first Z t marginals are set to the upper border 1 − 1∕n in iteration t ∈ ℕ . Recall that the random variable X t,i denotes the number of 1s at bit position i ∈ [n] among the fittest individuals, which is used to update the probabilistic model of the UMDA.
The preceding section shows that the random variable Z t is non-decreasing during phase 1 w.p. 1 − e −Ω( ) . The following lemma also shows that its value stays above afterwards with high probability.

Lemma 10 It holds for any constant k > 0 that
Proof Consider the worst scenario in which Z t = for some t ∈ [ , + e k ] . We also note that the value of the random variable Z t drops below in iteration t + 1 if and only if the number of individuals with at least Z t leading 1s sampled in the next iteration is less than . An offspring with at least leading 1s is still sampled w.p.
(1 − 1∕n) = ∕( (1 − )) for some constant ∈ (0, 1) , and by a Chernoff bound, there are at most such individuals sampled in the next iteration w.p. at most e −Ω( ) . By the union bound, this happens at least once in the interval [ , + e k ] w.p. at most e k ⋅ e −Ω( ) . The complement event then occurs w.p. at least 1 − e k ⋅ e −Ω( ) , which completes the proof. ◻ The following lemma further shows that there is also an upper bound on the random variable Z t .

Lemma 11 It holds for any constant
An individual with at least leading 1s is sampled w.p.
for some constant ∈ (0, 1) . Thus, the number of such individuals sampled in the next iteration will be stochastically dominated by Bin( , ∕( (1 + ))) , and thus their expected number is at most ∕(1 + ) . By a Chernoff bound, the probability of sampling at least (1 + ) ⋅ ∕(1 + ) = such individuals in the next iteration is at most e −Ω( ) . By the union bound, this rare event happens at least once in the interval [0, e k ] w.p. at most e k ⋅ e −Ω( ) . Thus, the complement event occurs w.p. at least 1 − e k ⋅ e −Ω( ) . The proof is then complete by noting that the value of the random variable Z t exceeds if and only if the number of individuals with more than leading 1s sampled in the next iteration is at least . ◻ Lemmas 10 and 11 together give essential insights about the behaviour of the algorithm. The random variable Z t will stay well below the threshold for e Ω( ) iterations w.p. 1 − e −Ω( ) for a sufficiently large parent population size . More precisely, the random variable Z t will move back and forth around an equilibrium value This is because when Z t = , in expectation there are exactly (1 − 1∕n) = = individuals having at least leading 1s.

3
An exponential lower bound on the runtime is obtained if we can also show that the probability of sampling the n − last bits correctly is exponentially small. We now choose the ratio of ∕ such that n − ≥ n for any constant ∈ (0, 1) , that is equivalent to ≤ n(1 − ) . By (12) and solving for , we then obtain The right-hand side is at most 1∕e 1− as (1 − 1∕n) n ≤ 1∕e for all n > 0 [36], so the above inequality always holds if the selective pressure satisfies ≥ (1 + )∕e 1− for any constants > 0 and ∈ (0, 1).
The remainder of this section shows that the n − ( + 1) = Ω(n) last bits cannot be sampled correctly in any polynomial number of iterations with high probability. We first show that the sampling processes among the Ω(n) last bits are mutually independent. To ease the analysis, we further define Y t,1 , Y t,2 , … , Y t,n to be n Bernoulli random variables representing an offspring sampled from the product distribution p t (see Eq. 2).

Lemma 12 Let k be any positive constant. It holds w.p. at least
Proof By Lemma 11, we know that Z t ≤ for any t ≤ e k w.p. at least 1 − e k ⋅ e −Ω( ) . We also obtain by Lemma 8 that X t+1,j ∼ Bin , p t+1,j for any j ≥ + 2 . In other words, the number of ones sampled at a bit position j ≥ + 2 among the fittest individuals in the next iteration depends only on the marginal p t,j . Thus, for any two distinct bit positions j 1 , j 2 ∈ { + 2, … , n} sampling a one at bit position j 1 is independent of sampling a one at bit position j 2 . ◻ Now consider an arbitrary bit position i ≥ + 1 . We always get [Y t,i | F t−1 ] = p t,i , and by the tower property of conditional expectation we also obtain For the UMDA without borders, the stochastic process (p t,i ∶ t ∈ ℕ) is a martingale [15], which results in [p t,i ] = p 0,i = 1∕2 . We will show in the following lemma that for the UMDA with borders the expected value of a marginal at an arbitrary bit position i ≥ + 2 also stays at 1/2 for any t ∈ ℕ.

Lemma 13
Let ≥ c log n for a sufficiently large constant c > 0. If there exists a constant k < n such that Z t ≤ k − 2 for any t ∈ ℕ, then for any i ≥ k that Proof For readability, we omit the index i through out the proof. Recall that p t = max{1∕n, min{1 − 1∕n, X t−1 ∕ }} . By the definition of expectation, we get [p t,i ] = 1 2 .

3
Algorithmica (2021) 83:3238-3280 We note further that from which we then obtain Substituting (15) into (14) yields We are left to calculate the two probabilities that X t−1 = 0 and X t−1 = . Since these are unconditional probabilities, we shall make no assumption (even on p t−1 ) when calculating them. All we know are that p 0 = 1∕2 and, by Lemma 8, X t−1 is binomially distributed with trials and success probability p t−1 , which means that there is no bias towards any border in the stochastic process (X t ∶ t ∈ ℕ) . Due to this symmetry, we get Furthermore, by the tower rule we also have Substituting (17) and (18) into (16) yields . Then by induction on time, we obtain which completes the proof. ◻ Lemma 13 gives us insights into the expected values of the marginals at any time t ∈ ℕ . One should not confuse the expectation with the actual value of the marginals. Friedrich et al. [15] showed a similar result for the UMDA without border that (14) even when the expectation stays at 1/2, the actual value of the marginal in iteration t can be close to the trivial lower or upper border due to its large variance. Very recently, Doerr and Zheng [9] obtained a tight bound of Θ( ) on the first hitting time of any trivial border for these marginals. Furthermore, Lehre and Nguyen [28] showed that the variance reaches a value of Θ( 2 ) after only Ω( ) iterations.

Proof Given that
≥ (1 + )∕e 1− , by Lemma 11 we get Z t ≤ ≤ n(1 − ) for any t = poly(n) w.o.p. We shall prove the lemma by looking at the n − ( + 2) ≥ n − n(1 − ) − 2 = n − 2 = Ω(n) last bit positions. Let us now consider the total number of zeros sampled at these bit positions in an iteration. We know by Lemma 13 (for k = + 2 ) that their marginals stay at 1/2 in expectation, and we also know by Lemma 12 that the samplings at these bit positions are mutually independent. Therefore, by the linearity of expectations, the expected total number of zeros sampled there is This means that in order to sample all ones at these bit positions there are still at least Ω(n) zeros to flip. In other words, we need to deviate a distance of Ω(n) below the expected value, and by a Chernoff-Hoeffding bound (see Lemma 6) such an event happens w.p. at most By the union bound, this event happens at least once in a polynomial number of iterations (in n) w.p. still at most e −Ω(n) . The proof is now complete. ◻ We are ready to show our main result.

Theorem 15
The UMDA with a parent population size ≥ c log n for some sufficiently large constant c > 0 and a selective pressure satisfying for any constants ∈ (0, 1) and 0 < ≤ e 1− − 1 has a runtime of e Ω( ) on the LeAD-ingOnes function w.p. 1 − e −Ω( ) and also in expectation.
Proof Due to the low selective pressure, we have ≤ n(1 − ) . We now consider the two phases as introduced above. During phase 1, the all-ones bitstring cannot be sampled w.p. at least 1 − e −Ω(n) since by Lemma 14 the Ω(n) last bit positions cannot be sampled correctly with the same probability. If this phase lasts for e Ω( ) iterations, then we are done, and the theorem holds trivially. Thus, we shall assume that phase 1 lasts for at most poly (n) iterations. During phase 2, we have observed by Lemma 11 that the random variable Z t exceeds in an iteration t ≤ e k for some constant k > 0 w.p. at most e −Ω( ) , while in the same iteration the Ω(n) last bits are sampled as all ones w.p. at most e −Ω(n) due to Lemma 14. Thus, the all-ones bitstring can be sampled in that iteration w.p. at most e −Ω( ) , and by the union bound the all-ones bitstring is sampled at least once in e k iterations w.p. at most e −Ω( ) . Note also that the last statement only holds if the constant c (in ≥ c log n ) is chosen sufficiently large. Therefore, the algorithm takes at least e Ω( ) iterations to optimise the function w.p. at least 1 − e −Ω( ) .

A New Lower Bound for the UMDA
When the selective pressure = ∕ is set too high such that the value of , defined in Eq. 11, exceeds the problem instance size n, phase 1 will end when the fittest individuals are all-ones bitstrings. By Eq. 11, this case occurs when for any constant ∈ (0, 1) . The right-hand side is at least (1 − )∕e for any n ≥ (1 + )∕ [30], and the above inequality always holds if we choose the selective pressure ≤ (1 − ) 2 ∕e , We now recall the following result [5,Theorem 4], which in this case yields the first upper bound on the expected runtime for the UMDA on the LeADIngOnes function.

Theorem 16
The UMDA with an offspring population size ≥ c log n for some sufficiently large constant c > 0 and a selective pressure satisfying for any constant > 0 has an expected runtime of O n log + n 2 on the LeADin-gOnes function.
Until now, we are still missing a lower bound on the expected runtime for the UMDA on the LeADIngOnes function, and in this section, we aim at deriving such a lower bound.
Recall that the random variable Z t , defined in Eq. 5, denotes the number of marginals, counting from the first bit position, which are set to the upper border 1 − 1∕n in iteration t, and the random variable Z * t , defined in Eq. 6, denotes the fitness value of the fittest individual. The following lemma shows the expected difference between these two random variables in an arbitrary iteration t ∈ ℕ . We pessimistically assume that the Z t first marginals are all set to one since we are only interested in a lower bound and this will speed up the optimisation process.

Lemma 17 It holds for any
Proof Let t ∶=Z * t − Z t . Consider the bit positions Z t + 2, Z t + 3, … , n among the fittest individuals. We shall view this as an abstract population of individuals, each of length n − (Z t + 1) , and also let � t ∶=Z * t − (Z t + 1) = t − 1 . In other words, ′ t is a random variable describing the number of leading 1s of the fittest individual in this abstract population. We first note that if X t,Z t +1 = 0 , then Z * t = Z t and t = 0 . By the law of total expectation, we get We are left to calculate the last conditional expectation. Consider again the abstract population introduced above. The probability of sampling at most k leading 1s in this population is 1 − ∏ (Z t +2)+k i=Z t +2 p t,i , and the probability that all in the abstract population have more than k leading 1s is for any bounded integer-valued random variable Y, we then get We know, by Lemma 13, that the values of the marginals p t,i for each i ≥ Z t + 2 stay at 1/2 in expectation and also, by Lemma 12, that the samplings at these bit positions are pairwise independent. Note also that x ↦ (1 − x) is a convex function for any x ∈ [0, 1] , so by Jensen's inequality for convexity [44, p. 61] we get which completes the proof. ◻ Lemma 17 gives an important insight that the two random variables Z t and Z * t only differ by a logarithmic additive term at any point in time in expectation. The global optimum is found when the random variable Z * t reaches the value of n. We can therefore alternatively analyse the random variable Z t instead of Z * t . In other words, the random variable Z t , starting from an initial value Z 0 given in Lemma 7, has to travel an expected distance of n − O(log ) − Z 0 (at bit positions) before the global optimum is found. We shall apply the additive drift theorem (for a lower bound) [21] for a potential function g(x) = n − x on the stochastic process (Z t ∶ t ∈ ℕ) . The single-step change (also called drift) is We are ready to show a lower bound on the expected runtime of the UMDA on the LeADIngOnes function.

Theorem 18
The UMDA with a parent population size ≥ c log n for some sufficiently large constant c > 0 and a selective pressure satisfying for any constant ∈ (0, 1) has an expected runtime of Ω(n ∕ log ) on the LeADin-gOnes function.
Proof Let i∶=Z t + 1 . By definition, Z t+1 = Z t and Δ t = 0 if there are less than individuals with at least i leading 1s sampled in the next iteration (i.e., C t+1,i < ). Thus, the drift is maximised when C t+1,i ≥ . By the law of total expectation, we then get We are left to bound the expectation. Given Z t , we know by Lemma 13 that the marginals of bit positions from Z t + 2 to n stay at 1/2 in expectation, and also by Lemma 8 the samplings at these bit positions are pairwise independent. By following the proof of Lemma 17, we can quickly upper bound the required expectation as follows.
Then, the expected drift is

A Tighter Upper Bound for the PBIL
In this section, we aim at showing a tighter upper bound than the upper bound of O n 2+c in [47] for the PBIL on the LeADIngOnes function. We shall apply the levelbased theorem. To begin with, we first remark that Algorithm 2 assumes a mapping D from the space of populations X to the space of probability distributions over the search space. The mapping D is often said to depend on the current population only [3]; however, it is not always necessary, especially for the PBIL with a sufficiently large offspring population size . The rationale behind this is that in each iteration the PBIL draws samples from the product distribution, specified in Eq. 2, that correspond to individuals in the current offspring population. If the number of samples is sufficiently large, it is very unlikely that the many empirical frequencies of ones deviate far from the true marginals. We will make this intuition more rigorous via the DKW inequality (see Theorem 2).
We shall use a canonical partition of the search space, where each subset A j contains bitstrings with j leading 1s.
Thus, there are n + 1 levels, ranging from A 0 to A n . We then need to verify three conditions in Theorem 1. Recall that A ≥j = ∪ n i=j A i . For conditions (G1) and (G2), we assume that there are at least 0 individuals in levels A ≥j in iteration t. Following [5], we choose 0 = ∕ . This implies that the fittest individuals have at least j leading 1s. We define to be the frequency of ones at bit position i in the entire population of individuals. We now show under the assumption of the condition (G1) of the level-based theorem that if the population size is = Ω(log n) , the first j marginals cannot be too close to the lower border 1/n with high probability.

Lemma 19
Assume that |P t ∩ A ≥j | ≥ 0 and ≥ c((1 + 1∕ )∕ 0 ) 2 ln(n) for any constants c, > 0 and 0 ∈ (0, 1), then Proof We only show the first statement as the second follows from the first statement. Let Q i be the number of ones sampled among the j first bit positions in the i-th individual in the current population P t . By the assumption |P t ∩ A ≥j | ≥ 0 on the current population, the empirical distribution function of Q i must satisfy where q t ≥ 0 is the fraction of individuals in the current population with j leading ones, while the true distribution function satisfies F(j − 1) = 1 − q t , where q t ∶= ∏ j i=1 p t,i is the probability of sampling at least j leading ones in an individual. The DKW inequality yields that for all > 0 . Therefore, with probability at least 1 − 2e −2 2 it holds q t − q t ≤ and, thus, q t ≥q t − ≥ 0 − . Choosing ∶= 0 ∕(1 + ) , we get q t = ∏ j i=1 p t,i ≥ 0 (1 − ∕(1 + )) = 0 ∕(1 + ) with probability at least Lemma 19 tells us that if the current level of the population is j, then all marginals p t,1 , p t,2 , … , p t,j are at least 0 ∕(1 + ) in an iteration t ∈ ℕ with probability polynomially close to one. To show an upper bound on the expected runtime for the PBIL on the LeADIngOnes function, we first apply the level-based theorem to obtain an upper bound conditional on the event that for all iterations t ≤ t * , and 1 ≤ i ≤ j , where j is the current level in iteration t, satisfy p t,i ≥ 0 ∕(1 + ) where t * is a sufficiently long time interval which will be specified later. In the end, we follow the line of argumentations put forward in [10, Theorem 8] to derive an overall unconditional expected runtime.
We first introduce the AM-GM inequality [34].
Lemma 20 (AM-GM Inequality) Let a 1 , … , a n be n non-negative real numbers. It holds that Equality occurs if and only if a 1 = a 2 = ⋯ = a n .
We are ready to establish an improved upper bound on the expected runtime of the PBIL on the LeADIngOnes function. Surprisingly, the proof is straightforward and not very technically demanding compared to the proof in [47].

Theorem 21
The PBiL with an offspring population size with c log n ≤ = poly (n) for a sufficiently large constant c > 0, a constant smoothing parameter ∈ (1∕e, 1] and a constant selective pressure satisfying for any constant > 0, has an expected runtime of O n log + n 2 on the LeADin-gOnes function. Proof First, we partition the search space into "levels" using the canonical partition defined in (19), in which each subset A j contains individuals with exactly j leading 1s.There are a total of n + 1 levels ranging from A 0 to A n .
Let ∶= T∕ denote the runtime of the algorithm in terms of number of iterations. We say that failure event F t occurs in iteration t ∈ ℕ if there exist two indices i, j ∈ ℕ satisfying 1 ≤ i ≤ j ≤ n such that |P t ∩ A ≥j | ≥ 0 and p t,i < 0 ∕(1 + ) , where and 0 are parameters which will be specified later. Furthermore, for any t ≥ 0 , we let G t ∶= ⋀ t i=0 ¬F i denote the event that there is no failure in the first t iterations. We will first estimate the expected runtime of the algorithm starting from any initial state, conditional on the event G ∨s , i.e., that no failure occurs before the optimum has been found for the first time or before iteration s, whichever is the larger. Here, s is a parameter we will define later, and x ∨ y ∶= max(x, y) . Note that Pr G ∨s > 0 because any new individual in any iteration is optimal with probability at least n −n > 0 . Afterwards, we will estimate the overall expected runtime of the algorithm on the function.
To obtain an upper bound on the expected runtime conditional on the event G ∨s , we apply the level-based theorem with respect to the partition A 0 , … , A n described above.
For the two conditions (G1) and (G2), assuming that |P t ∩ A ≥j | ≥ 0 = , we are required to show that the probability of sampling an offspring in levels A ≥j+1 in iteration t + 1 is lower bounded by (1 + ) for some constant > 0 . We note by Lemma 4 that this probability can be bounded from below as follows: that holds for any vector q∶=(q 1 , … , q j ) , which majorises the vector p * t+1 ∶=(p t+1,1 , … , p t+1,j ) . In the remainder of this proof, we shall construct such a vector q from vector p * t+1 .
In order to construct vector q, we will shift the weight ∑ j i=1 p t+1,i as far as possible to the marginals with smaller indices. The trivial upper bound on each component q i is the upper border 1 − 1∕n . For the lower bound, we note from the assumption |P t ∩ A ≥j | ≥ that the fittest individuals have at least j leading 1s, meaning that when updating the model we always have p t+1,i = (1 − )p t,i + ≥ for each i ∈ [j] . Therefore, a trivial lower bound on each component q i is the smoothing parameter . We define a vector q = (q 1 , … , q j ) as follows: for an integer m = ⌊g(j)⌋ , where Because of the floor function, we always get g(j) − 1 < m ≤ g(j) , and thus ≤ q m+1 ≤ 1 − 1∕n , meaning that the defined value of the component q m+1 is indeed a probability. By the definition of the vector q in (21), we have for any k ∈ [j − 1] that

3
Therefore, according to Definition 3 the vector q majorises the vector p *

t+1
. By Lemma 4, the probability of sampling the first j bits correctly is which holds because (1 − 1∕n) m ≥ 1∕e for any integer m < n . Recall that we aim at showing that the above probability is at least a constant, so we are done if we can show that j − m = O(1) . We are going to show that this is indeed the case.
Let p 0 ∶= 0 ∕(1 + ) . We get by Lemmas 19 and 20 that the weight thus, We also have the following.
where the last inequality follows the fact that ln(x) ≤ n(x 1∕n − 1) for all n > 0 and x > 0 [39]. Since the value 0 is assumed constant, and so is the value p 0 = 0 ∕(1 + ) for any constant > 0 ; thus, meaning the the probability of sampling the first j bits correctly in iteration t + 1 is at least a constant. In the remainder of the proof, we will use this result to verify conditions (G1) and (G2) of the level-based theorem. For condition (G1), we are interested in a lower bound z j on the probability of sampling an offspring in levels A ≥j+1 in iteration t + 1 . Because the marginal p t+1,j+1 ≥ 1∕n , this probability is Thus, condition (G1) is satisfied with the lower bound z * = z j = Ω(1∕n).
For condition (G2), assuming further that |P t ∩ A ≥j+1 | ≥ , meaning that the marginal at bit position j + 1 will be set to p t+1,j+1 ≥ (1 − )p t,j+1 + ( )∕ ≥ ∕ 0 . In this case, the probability of sampling the first j + 1 bits correctly is For = 1 , the lower bound in (26) becomes ∕(e 0 ) , and the condition (G2) can be easily confirmed by setting 0 ≤ 1∕(e(1 + )) for any constant > 0 . We note that this is already obtained in Theorem 16 for the UMDA on the LeADIngOnes function. Otherwise, if the smoothing parameter < 1 , we can rewrite for any constant > 0 , then (26) is equivalent to which always holds if we choose the value 0 such that For any ∈ (0, 1] , the right-hand side of Eq. 27 is always less than one as 1 − ln ≥ 1 , so in the left-hand side we require 1 + ln > 0 , which is equivalent to > 1∕e . We then obtain the following bound on 0 : The smoothing parameter is a constant, so is the upper bound on 0 . In the end, condition (G2) of Theorem 1 is verified.
To satisfy the condition (G3), it suffices to choose ≥ c log n for a sufficiently large constant c > 0.   .
Having verified the three conditions (G1), (G2) and (G3), and noting that ln 6 ∕(4 + z j ) < ln (3 ∕2) , Theorem 1 now guarantees an upper bound, for some constant c 1 > 0, To obtain an upper bound on the unconditional expected runtime, we divide the run into consecutive phases, each of length s ∶= 2t * = poly (n) iterations. Note that for all i ∈ ℕ , the event ≤ s is independent of the failure event F s+i . By Lemma 28 1 and (29), it follows that the probability that the algorithm finds the optimum within one phase is We now estimate the probability of the event G s . By Lemma 19 and a union bound, failure event F t occurs with probability at most 2n −2c+1 assuming the population size satisfies ≥ c((1 + 1∕ )∕ 0 ) 2 ln(n) for a constant c > 0 . By another union bound and assuming that c is chosen sufficiently large, the probability of no failure within s = poly (n) iterations is From (30) and (31), it follows that the algorithm, starting from any initial state, finds the optimum within a phase with probability at least 1∕2 − o (1).
If the algorithm does not find the optimum within a phase or event G s does not hold, the algorithm enters some unknown state at the end of the current phase. Because our analysis makes no assumption about the state of the algorithm at the beginning of the phase, we can repeat the same analysis for the next phase. Hence, the number of phases until an optimum is found for the first time is stochastically dominated by a geometric random variable [35, Definition 2.8] with success probability 1∕2 − o(1) . By [7,Corollary 8.3] and [35, p. 32], the expected number of iterations until an optimum is found for the first time is at most 1∕(1∕2 − o(1)) = O(1).
It follows that the overall expected runtime of the PBIL on the LeADIngOnes function is O( s) = O n log + n 2 , which completes the proof. ◻ We note from Eq. 28 that the threshold on the selective pressure is a function of the smoothing parameter ∈ (1∕e, 1] , denoted by h( ) . When → 1 , that is, the PBIL converges to the UMDA, h( ) → 1∕(e(1 + )) , which matches the selective pressure considered in Theorem 16. Also, h( ) is an increasing function and has a very small value when gets closer to 1∕e ≈ 0.3679 (see Fig. 1). In other words, we need to pick an extremely high selective pressure when the smoothing parameter approaches 1/e (from above).

Direct Extensions
The function is another test function also widely used in runtime analyses of eDAs [6,10,27,46]. This is a linear function where the bit weights decrease exponentially with bit positions. Due to some similarity with the LeADIngOnes function, we will show that the runtime bound derived in Theorem 21 can be extended to the BInVAL function. We first partition the search space into non-empty disjoint subsets A 0 , … , A n as follows.
The following lemma formalises the similarity between the two functions.

Lemma 22 x ∈ A j if and only if LeadingOnes(x) = j.
Proof For the sufficient condition, if x ∈ A j , meaning that then the first j bits must be 1s, followed by a 0 at bit position j + 1 . This is due to the fact that 2 n−(j+1) > ∑ n i=j+2 2 n−i . For the necessary condition, if LeadingOnes(x) = j , the first j bits are 1s, followed by a 0 at bit position j + 1 . The BInVAL-value of the bitstring is at most Fig. 1 Threshold on the selective pressure for the PBIL with ∈ (1∕e, 1] on the LeADIngOnes function in Eq. 28 with = 0.01 . Note also that 1∕e ≈ 0.3679 and 1∕(e(1 + )) ≈ 0.3642 Therefore, x must be in the level of A j . ◻ We now consider the sorting of individuals after the population is sampled in an arbitrary iteration. For the LeADIngOnes function, all that matters to determine the ranking of a bitstring is the number of leading 1s. Alternatively, we can say the ranking of an individual depends on the position of the leftmost zero in the bitstring, and all following bits have no contribution to the overall fitness of the individual. However, this is not the case for the BInVAL function, where all individuals are first sorted according to their LeADIngOnes-values. Ties are broken not uniformly at random as for the LeADIngOnes function but by comparing the number of leading 1s following the leftmost zero among these individuals. However, since the proof of Theorem 21 never takes bits after the leftmost zero into account, the result also holds for the BInVAL function. The following corollary yields the first upper bound on the expected runtime of the PBIL on the BInVAL function. We note that a similar bound for the UMDA on the BInVAL function is shown in [6].

Corollary 23
The PBiL with an offspring population size ≥ c log n for a sufficiently large constant c > 0, a constant smoothing parameter ∈ (1∕e, 1], and a constant selective pressure satisfying for any constant > 0, has an expected runtime of O n log + n 2 on BinVAL.
Furthermore, due to the similarity between the PBIL and the -MMAS [42], we are now able to establish the expected runtime of the -MMAS on the LeADIngOnes and BInVAL functions. For the -MMAS, we have = 1 , substituting this into Eq. 20 and noting also that ≥ c log n , we then obtain

Corollary 24
The -MMAS with a population size ≥ c log n for a sufficiently large constant c > 0 has an expected runtime of O n log + n 2 on the LeADingOnes and BinVAL functions.

Runtime Analyses on Noisy LeadingOnes
We also consider a prior noise model and formally define the problem for any constant 0 < p < 1 as follows.
We denote F as the noisy fitness and f as the actual fitness. For simplicity, we also denote P t as the population prior to noise. The same noise model is studied in [4,11,17,41,43] for population-based eAs on the OneMAx and LeADIngOnes functions.
We shall make use of the level-based theorem and first partition the search space X into n + 1 disjoint subsets A 0 , … , A n as in Eq. 19. Recall that A ≥j = ∪ n i=j A i . We then need to verify three conditions (G1), (G2) and (G3) of the level-based theorem, where due to the presence of noise we choose the parameter 0 = ∕((1 − )(1 − p)) for any constant ∈ (0, 1) and the selective pressure = ∕ to leverage the impact of noise in our analysis. The following lemma tells us the number of individuals in the population in iteration t which have fitness Proof We take an alternative view on the sampling of the population and the application of noise. More specifically, we first sample the population, sort it in descending order according to the true fitness, and then noise occurs at any individual w.p. p. Because noise does not occur at an individual w.p. 1 − p , amongst the 0 individuals in levels A ≥j , in expectation there are individuals unaffected by noise. Furthermore, by a Chernoff bound [36], there are at least (1 − ) ⋅ ∕(1 − ) = such individuals for some constant 0 < < 1 w.p. at least 1 − e −( 2 ∕2)⋅ ∕(1− ) = 1 − e −Ω( ) , which proves the first statement.
For the second statement, we only consider individuals with actual fitness f (x) < j and noisy fitness F(x) ≥ j in the population. If such an individual is selected when updating the model, it will introduce a 0 to the total number of 0s among the fittest individuals for the first j bits. Let B denote the number of such individuals. There are at most (1 − 0 ) individuals with actual fitness f (x) < j , so the probability that their noisy fitness values are at least F(x) ≥ j is at most p/n because a specific bit must be flipped in the prior noise model. Hence the expected number of these individuals is upper bounded by We now show by a Chernoff bound that the event B ≥ for a small constant ∈ (0, 1) occurs w.p. at most e −Ω( ) . We shall rely on the fact that p∕n ≤ ∕2 for sufficiently large n, which follows from the assumption ∕ = O(1) . We use the parameter ∶= ∕ [B] − 1 , which by (32) and the assumption p∕n ≤ ∕2 satisfies ≥ n∕(p ) − 1 ≥ 1 . We also have the lower bound A Chernoff bound [36] now gives the desired result which completes the proof. ◻ We now derive upper bounds on the expected runtime of the UMDA on the LeAD-IngOnes function in the noisy environment.

Theorem 26
Consider a prior noise model with constant parameter p ∈ (0, 1). The UMDA with a parent population size ≥ c log n for some sufficiently large constant c > 0 and a constant selective pressure satisfying for some constants , ∈ (0, 1) has an expected runtime of O n log + n 2 on the LeADingOnes function.

Proof
We will apply the level-based theorem. Each level A j for j ∈ [n] ∪ {0} is formally defined as in (19), and there are a total of m∶=n + 1 levels.
For the condition (G1), we assume that |P t ∩ A ≥j | ≥ 0 , and we are required to show that the probability of sampling an offspring in levels A ≥j+1 in iteration t + 1 is lower bounded by a value z j . We choose the parameter 0 = ∕((1 − )(1 − p)) for any constant ∈ (0, 1) and the constant selective pressure = ∕ . For convenience, we also partition the noisy population into four groups: 1. Individuals with fitness f (x) ≥ j and F(x) ≥ j. 2. Individuals with fitness f (x) ≥ j and F(x) < j. 3. Individuals with fitness f (x) < j and F(x) ≥ j. 4. Individuals with fitness f (x) < j and F(x) < j.
By Lemma 25, there are at least individuals in group 1 w.p. 1 − e −Ω( ) . The algorithm selects the fittest individuals according to the noisy fitness values to update the probabilistic model. Hence, unless the mentioned event does not happen, no individuals from group 2 or group 4 will be included when updating the model. which holds for any vector q∶=(q 1 , … , q j ) which majorises the vector (p t,1 , … , p t,j ) . By Definition 3, we construct such a vector q which by the definition majorises the vector (p t,1 , … , p t,j ) as follows.
We now show that with high probability, the vector element q j stays within the interval [1 − 1∕n − , 1 − 1∕n] , i.e., q j is indeed a probability. Since p t,i ≤ 1 − 1∕n for all i ≤ j , we have the upper bound q j ≤ (1 − 1∕n)j − (1 − 1∕n)(j − 1) = 1 − 1∕n . For the lower bound, we note from (34) that p t,i ≥ 1 − Q i ∕ − 1∕n for all i ≤ j and any Q i ≥ 0 , so we also obtain By Lemma 25, we have B ≤ for some small constant ∈ (0, 1) w.p. 1 − e −Ω( ) . Assume that this high-probability event actually happens, we therefore have q j ≥ 1 − 1∕n − . From this result, the definition of the vector q and (35), we can conclude that the probability of sampling in iteration t + 1 an offspring x with actual fitness f (x) ≥ j is (1) since (1 − 1∕n) j−1 ≥ 1∕e for any n > 0 . Because we also have p t,j+1 ≥ 1∕n , the probability of sampling an offspring in levels A ≥j+1 is at least Ω(1) ⋅ (1∕n) = Ω(1∕n) . Thus, the condition (G1) holds with a value of z j = Ω(1∕n).
For the condition (G2), we assume further that |P t ∩ A ≥j+1 | ≥ for some value ∈ (0, 0 ) , and we are also required to show that the probability of sampling an offspring in levels A ≥j+1 is at least (1 + ) for some small constant ∈ (0, 1) . Because the marginal p t,j+1 can be lower bounded by ∕ , the above probability can be written as follows.
The condition (G3) requires the offspring population size to satisfy which, by noting that 0 = ( ∕ )∕((1 − )(1 − p)) , is equivalent to which can be easily satisfied by choosing a sufficiently large constant c in ≥ c log n .
Having verified the three conditions (G1), (G2) and (G3), and noting that ln( ∕(4 + z j )) < ln(3 ∕2) , the level-based theorem now guarantees an upper bound of Note that, throughout the proof, we always assume the occurrence of the following two events in each iteration (see Lemma 25): (1) The number of individuals in group 1 is at least w.p. 1 − e −Ω( ) , (2) The number of individuals in group 3 is B ≤ for some small constant ∈ (0, 1) w.p. 1 − e −Ω( ) .
By the union bound, either or all of these events happen in an iteration t ∈ ℕ with probability at most 2n −2c+1 + e −Ω( ) + e −Ω( ) = n −c 2 for some constant c 2 > 0 . The complementary event occurs with probability at least 1 − n −c 2 . Following the same line of argumentation as in [10,Theorem 8] (which has already been applied in the proof of Theorem 21), the overall expected runtime is O n log + n 2 . ◻ We remark here that the exponential lower bound in Theorem 15 for the LeADIn-gOnes function without noise also holds for the noisy LeADIngOnes function. We are also interested in the runtime of the PBIL on the noisy LeADIngOnes. The following theorem derives such a result.

Theorem 27
Consider the prior noise model with constant parameter p ∈ (0, 1). The PBiL with a parent population size ≥ c log n for some sufficiently large constant c > 0, a constant smoothing parameter ∈ (1∕e, 1], and also a constant selective pressure satisfying for some constant ∈ (0, 1) has an expected runtime of O n 2 + n log on the LeADingOnes function.
Proof We assume that 0 = ∕((1 − )(1 − p)) for any constant ∈ (0, 1) and the selective pressure = ∕ . We also partition the noisy population into four groups as in the proof of Theorem 26 and pessimistically assume that the PBIL uses all of the B individuals in group 3 and − B individuals chosen from group 1 when updating the model. For all i ∈ [j] , let Q i be the number of individuals in group 3 which has 1s at bit positions 1 through j, except for one position i where it has a 0. By definition, we then have Similarly to the proof of Theorem 21, we shall show that the probability of sampling the first j bits correctly is at least a constant using a majorisation argument. Because noise only impacts the weight ∑ j i=1 p t+1,i , we still define the vector q as in (21) and an integer m = ⌊g(j)⌋ as in (22). We are left to show a constant upper bound on the difference j − m used in Eq. 23. We notice that in this case the weight becomes which by noting that Putting everything together, we then obtain which by (33) that B ≤ for some small constant > 0 w.p. 1 − e −Ω( ) satisfies Thus, the probability of sampling an offspring in levels A ≥j is at least a constant, which immediately results in a lower bound of Ω(1∕n) on the probability of sampling the first j + 1 bits correctly, confirming the condition (G1) of the level-based theorem.
For condition (G2), we use the lower bound p t+1,j+1 ≥ ∕ = ∕ . Then, the probability of sampling an offspring in levels A ≥j+1 is which always holds if we choose the selective pressure = ∕ such that Similar to Eq. 27, we also require ∈ (1∕e, 1] . The condition (G2) is now verified.
For condition (G3), it suffices to use a population size ≥ c log n , for a sufficiently large constant c > 0 . Having verified three conditions, Theorem 1 now guarantees an upper bound of O n 2 + n log . Note that throughout the proof we always assume the occurrence of the following three events: (1) Each of the first j marginals is at least p 0 ≥ 0 ∕(1 + ) w.p. 1 − 2n −2c for any constants c > 0 and > 0 , which requires a population of ≥ c((1 + 1∕ )∕ 0 ) 2 ln(n) = Ω(log n) (see Lemma 19), By the union bound, either or all of these events happen in an iteration t ∈ ℕ with probability at most 2n −2c+1 + e −Ω( ) + e −Ω( ) = n −c 2 for some constant c 2 > 0 . The complementary event occurs with probability at least 1 − n −c 2 . Following the same line of argumentation as in [10,Theorem 8] (which has already been applied in the proofs of Theorems 21 and 26), the overall expected runtime is O n log + n 2 .
◻ Figure 2 plotted the threshold on the selective pressure in Eq. 36 for two noise probabilities p = 0.2 and p = 0.95.

Experiments
In this section, we provide an empirical study to see how closely the theoretical results match the experimental results for reasonable problem instance sizes, and to investigate a broader range of parameters. Our analysis is focused on different regimes on the selective pressure in the noise-free setting.

Under Low Selective Pressure
We have shown in Theorem 15 that when the selective pressure ≥ (1 + )∕e 1− for any constants > 0 and ∈ (0, 1) , the UMDA requires an expected runtime of e Ω( ) to optimise the LeADIngOnes function. We now choose = 0.2 and = 0.1 , we then get ≥ (1 + 0.2)∕e 1−0.1 ≈ 0.4879 . Thus, the choice = 0.5 should be sufficient to yield an exponential runtime. For the population size, we experiment with two different settings: = 5 log n (small) and = n (large) for a problem instance size n = 100 . Substituting everything into (11) and (12), we then get ≈ 47 and ≈ 87 . The numbers of leading 1s of the fittest individual and the -th individual in the sorted population (denoted by random variables Z * t and Z t respectively) are shown in Fig. 3 over an epoch of 5000 iterations. The dotted blue lines denote the constant functions of = 47 and = 87 . One can see that the Z t -values keep increasing until it reaches the value of during the early stage and always stays well under value afterwards. Furthermore, Z * t -values do not deviate too far from Z t that matches our analysis since the chance of sampling all ones from the n − last bits is exponentially small. We also run the same experiments for the PBIL when we further choose a smoothing parameter of = 0.5 ∈ (1∕e, 1] . As predicted, one can see that two random variables Z t and Z * t stay well below the threshold (Fig. 4).

Under High Selective Pressure
When the selective pressure is sufficiently high, that is, ≤ (1 − o(1))(1 − )∕e for any constant ∈ (0, 1) , there is an upper bound O n 2 + n log on the expected runtime [5]. Theorem 18 yields a lower bound of Ω(n ∕ log n) . We start by looking at how the values of random variable Z t and Z * t change over time. Our analysis shows that it never decreases during the whole optimisation course with overwhelming probability and eventually reaches the value of n. Similarly, we consider the two different settings for population size and also note that our result holds for a parent population size ≥ c log n , when the constant c > 0 must be tuned carefully; in this experiment, we set c = 5 (an integer larger than 3 should be sufficient). We then get ≤ (1 − 1∕100)(1 − 0.1)∕e ≈ 0.3278 . Therefore, the choice of = 0.1 should be sufficient and we then get ≈ 160 > n = 100 . The experiment outcomes are shown in Fig. 5. The empirical result shows that both the Z-and Z * -values keep increasing over the whole course of optimisation, matching our findings in Sect. 4.1. Furthermore, the difference between the Z-and Z * -values in each iteration is relatively small, which again matches the result of Lemma 17.

Conclusion
In this paper, we have derived runtime results for population-based univariate eDAs (i.e., the UMDA and the PBIL) on the LeADIngOnes function-a well-known test problem in the theory of evolutionary computation. For the UMDA, we have found that the algorithm under a low selective pressure requires an exponential expected runtime in the population size. More specifically, the algorithm takes an expected runtime of 2 Ω( ) when ≥ c log n for a sufficiently large constant c > 0 and ∕ ≥ (1 + )∕e 1− for any constant > 0 and ∈ (0, 1) . The analyses reveal the limitations of the probabilistic model based on probability vectors as the algorithm hardly stays at promising states for long enough to make progress. This leads the algorithm into a non-optimal equilibrium state from which the global optimum is exponentially unlikely to be sampled. On the other hand, when the selective pressure is high we obtain a lower bound of Ω(n ∕ log ) on the expected runtime for the algorithm.
We then moved on to consider the PBIL on the LeADIngOnes function. The algorithm is shown to optimise the function within an expected runtime of O n 2 for appropriate parameter settings. Our findings here improve the currently best-known upper bound of O n 2+c in [47] by a significant factor of Θ(n c ) for some constant c ∈ (0, 1).
Furthermore, we for the first time study the performances of the UMDA and the PBIL on the LeADIngOnes function under a prior noise model, where a uniformly chosen bit is flipped with a constant probability p ∈ (0, 1) before invoking the fitness function. We show that an O n 2 expected runtime still holds in this case for both algorithms under an offspring population size = Ω(log n) ∩ O(n∕ log n) . Despite the simplicity of the noise model, this can be viewed as the first step towards broadening our understanding of the two algorithms' behaviours in a noisy environment.
The UMDA with an offspring population size = Ω(log n) ∩ O(n∕ log n) needs an O n 2 expected time on the LeADIngOnes function [5]. In this case, Theorem 18 yields a lower bound Ω(n 2 ∕ log 2 n) . Thus, it remains open whether this gap of Θ(log 2 n) could be closed to achieve a tight bound on the runtime. Note that our result in Theorem 15, together with Theorem 16, provide upper bounds on the expected runtime of the UMDA on the LeADIngOnes function when the selective pressure is low and high (around the threshold value of 1/e). Although we could choose the constant small/large enough such that the selective pressure becomes arbitrarily close to 1/e, it is still unknown whether the UMDA will take a polynomial or exponential expected runtime when the selective pressure is exactly 1/e. Another avenue for future work would be to investigate the PBIL with a smoothing parameter ∈ (0, 1∕e) . Our analysis does not cover this regime of the smoothing parameter.

Additional Results
In the following variant of Markov's inequality, we use the notation x ∨ y ∶= max(x, y).

Lemma 28
Assume any random variable ∈ ℕ and a sequence of events F 0 , F 1 , … such that for all s, i ∈ ℕ , the event ≤ s is independent of the event F s+i . Define for all t ∈ ℕ the event G t ∶= ⋀ t i=0 (¬F i ) . For any t ∈ ℝ, s ∈ ℕ with s ≥ t , if Pr G ∨s > 0 and then Pr ( > s) ≤ 1 − Pr G s (1 − t∕s).
Proof By the law of total probability, we have