Level-Based Analysis of the Univariate Marginal Distribution Algorithm

Estimation of Distribution Algorithms (EDAs) are stochastic heuristics that search for optimal solutions by learning and sampling from probabilistic models. Despite their popularity in real-world applications, there is little rigorous understanding of their performance. Even for the Univariate Marginal Distribution Algorithm (UMDA)—a simple population-based EDA assuming independence between decision variables—the optimisation time on the linear problem OneMax was until recently undetermined. The incomplete theoretical understanding of EDAs is mainly due to the lack of appropriate analytical tools. We show that the recently developed level-based theorem for non-elitist populations combined with anti-concentration results yield upper bounds on the expected optimisation time of the UMDA. This approach results in the bound $$\mathcal {O}\left( n\lambda \log \lambda +n^2\right) $$ O n λ log λ + n 2 on the LeadingOnes and BinVal problems for population sizes $$\lambda >\mu =\varOmega (\log n)$$ λ > μ = Ω ( log n ) , where $$\mu $$ μ and $$\lambda $$ λ are parameters of the algorithm. We also prove that the UMDA with population sizes $$\mu \in \mathcal {O}\left( \sqrt{n}\right) \cap \varOmega (\log n)$$ μ ∈ O n ∩ Ω ( log n ) optimises OneMax in expected time $$\mathcal {O}\left( \lambda n\right) $$ O λ n , and for larger population sizes $$\mu =\varOmega (\sqrt{n}\log n)$$ μ = Ω ( n log n ) , in expected time $$\mathcal {O}\left( \lambda \sqrt{n}\right) $$ O λ n . The facility and generality of our arguments suggest that this is a promising approach to derive bounds on the expected optimisation time of EDAs.

We show that the recently developed level-based theorem for non-elitist populations combined with anticoncentration results yield upper bounds on the expected optimisation time of the UMDA.This approach results in the bound O nλ log λ + n 2 on two problems, LeadingOnes and BinVal, for population sizes λ > µ = Ω(log n), where µ and λ are parameters of the algorithm.We also prove that the UMDA with population sizes µ ∈ O ( √ n) ∩ Ω(log n) optimises OneMax in expected time O (λn), and for larger population sizes µ = Ω( √ n log n), in expected time O (λ √ n).The facility and generality of our arguments suggest that this is a promising approach to derive bounds on the expected optimisation time of EDAs.

Introduction
Estimation of Distribution Algorithms (EDAs) are a class of randomised search heuristics with many practical applications [14,19,23,47,48].Unlike traditional Evolutionary Algorithms (EAs) which search for optimal solutions using genetic operators such as mutation or crossover, EDAs build and maintain a probability distribution of the current population over the search space, from which the next generation of individuals is sampled.Several EDAs have been developed over the last decades.The algorithms differ in how they capture interactions among decision variables, as well as in how they build and update their probabilistic models.EDAs are often classified as either univariate or multivariate; the former treat each variable independently, while the latter also consider variable dependencies [40].Wellknown univariate EDAs include the compact Genetic Algorithm (cGA [20]), the Population-Based Incremental Learning Algorithm (PBIL [4]), and the Univariate Marginal Distribution Algorithm (UMDA [36]).Given a problem instance size n, univariate EDAs represent probabilistic models as an n-vector, where each vector component is called a marginal.Some Ant Colony Optimisation (ACO) algorithms and even certain singleindividual EAs can be cast in the same framework as univariate EDAs (or n-Bernoulli-λ-EDA, see, e.g., [17,42,21,24]).Multivariate EDAs, such as the Bayesian Optimisation Algorithm, which builds a Bayesian network with nodes and edges representing variables and conditional dependencies respectively, attempt to learn relationships between decision variables [21].The sur-veys [1,21,39] describe further variants and applications of EDAs.
Recently EDAs have drawn a growing attention from the theory community of evolutionary computation [10,17,26,44,46,25,45,27,12,31].The aim of the theoretical analyses of EDAs in general is to gain insights into the behaviour of the algorithms when optimising an objective function, especially in terms of the optimisation time, that is the number of function evaluations, required by the algorithm until an optimal solution has been found for the first time.Droste [13] provided the first rigorous runtime analysis of an EDA, specifically the cGA.Introduced in [20], the cGA samples two individuals in each generation and updates the probabilistic model according to the fittest of these individuals.A quantity of ±1/K is added to the marginals for each bit position where the two individuals differ.The reciprocal K of this quantity is often referred to as the abstract population size of a genetic algorithm that the cGA is supposed to model.Droste showed a lower bound Ω(K √ n) on the expected optimisation time of the cGA for any pseudo-Boolean function [13].He also proved the upper bound O(nK) for any linear function, where K = n 1/2+ε for any small constant ε > 0. Note that each marginal of the cGA considered in [13] is allowed to reach the extreme values zero and one.Such an algorithm is referred to as an EDA without margins, since in contrast it is possible to reinforce some margins (also called borders) on the range of values for each marginal to keep it away from the extreme probabilities, often within the interval [1/n, 1−1/n].An EDA without margins can prematurely converge to a suboptimal solution; thus, the runtime bounds of [13] were in fact conditioned on the event that early convergence never happens.Very recently, Witt [45] studied an effect called domino convergence on EDAs, where bits with heavy weights tend to be optimised before bits with light weights.By deriving a lower bound of Ω(n 2 ) on the expected optimisation time of the cGA on BinVal for any value of K > 0, Witt confirmed the claim made earlier by Droste [13] that BinVal is a harder problem for the cGA than the OneMax problem.Moreover, Lengler et al. [31] considered K = O √ n/ log 2 n , which was not covered by Droste in [13], and obtained a lower bound of Ω(K 1/3 n + n log n) on the expected optimisation time of the cGA on OneMax.Note that if K = Θ( √ n/ log 2 n), the above lower bound will be Ω(n 7/6 / log 2 n), which further tightens the bounds on the expected optimisation time of the cGA.
An algorithm closely related to the cGA with (reinforced) margins is the 2-Max Min Ant System with iteration best (2-MMAS ib ).The two algorithms differ only slightly in the update procedure of the model, and 2-MMAS ib is parameterised by an evaporation factor ρ ∈ (0, 1).Sudholt and Witt [42] proved the lower bounds Ω(K √ n + n log n) and Ω( √ n/ρ + n log n) for the two algorithms on OneMax under any setting, and upper bounds O(K √ n) and O( √ n/ρ) when K and ρ are in Ω( √ n log n).Thus, the optimal expected optimisation time Θ(n log n) of the cGA and the 2-MMAS ib on OneMax is achieved by setting these parameters to Θ( √ n log n).The analyses revealed that choosing lower parameter values result in strong fluctuations that may cause many marginals (or pheromones in the context of ACO) to fix early at the lower margin, which then need to be repaired later.On the other hand, choosing higher parameter values resolve the issue but may slow down the learning process.
Friedrich et al. [17] pointed out two behavioural properties of univariate EDAs at each bit position: a balanced EDA would be sensitive to signals in the fitness, while a stable one would remain uncommitted under a biasless fitness function.During the optimisation of LeadingOnes, when some bit positions are temporarily neutral, while the others are not, both properties appear useful to avoid commitment to wrong decisions.Unfortunately, many univariate EDAs without margins, including the cGA, the UMDA, the PBIL and some related algorithms are balanced but not stable [17].A more stable version of the cGA -the so-called stable cGA (or scGA) -was then introduced in [17].Under appropriate settings, it yields an expected optimisation time of O(n log n) on LeadingOnes with a probability polynomially close to one.Furthermore, a recent study by Friedrich et al. [16] showed that cGA can cope with higher levels of noise more efficiently than mutationonly heuristics do.
Introduced by Baluja [4], the PBIL is another univariate EDA.Unlike the cGA that samples two solutions in each generation, the PBIL samples a population of λ individuals, from which the µ fittest individuals are selected to update the probabilistic model, i.e., truncation selection.The new probabilistic model is obtained using a convex combination with a smoothing parameter ρ ∈ (0, 1] of the current model and the frequencies of ones among all selected individuals at that bit position.The PBIL can be seen as a special case of the cross-entropy method [38] on the binary hypercube {0, 1} n .Wu et al. [46] analysed the runtime of the PBIL on OneMax and LeadingOnes.The authors argued that due to the use of a sufficiently large population size, it is possible to prevent the marginals from reaching the lower border early even when a large smoothing parameter ρ is used.Runtime results were proved for the PBIL without margins on OneMax and the PBIL with margins on LeadingOnes, and were then compared to the runtime of some Ant System approaches.However, the required population size is large, i. e. λ = ω(n).Very recently, Lehre and Nguyen [27] obtained an upper bound of O(nλ log λ + n 2 ) on the expected optimisation time for the PBIL with margins on BinVal and LeadingOnes, which improves the previously known upper bound in [46] by a factor of n ε , where ε is some positive constant, for smaller population sizes λ = Ω(log n).
The UMDA is a special case of the PBIL with the largest smoothing parameter ρ = 1, that is, the probabilistic model for the next generation depends solely on the selected individuals in the current population.This characteristic distinguishes the UMDA from the cGA and PBIL in general.The algorithm has a wide range of applications, not only in computer science, but also in other areas like population genetics and bioinformatics [19,48].Moreover, the UMDA is related to the notion of linkage equilibrium [41,35], which is a popular model assumption in population genetics.Thus, studies of the UMDA can contribute to the understanding of population dynamics in population genetics.
Despite an increasing momentum in the runtime analysis of EDAs over the last few years, our understanding of the UMDA in terms of runtime is still limited.The algorithm was early analysed in a series of papers [5,6,7,8], where time-complexities of the UMDA on simple uni-modal functions were derived.These results showed that the UMDA with margins often outperforms the UMDA without margins, especially on functions like BVLeadingOnes, which is a uni-modal problem.The possible reason behind the failure of the UMDA without margins is due to fixation, causing no further progression for the corresponding decision variables.The UMDA with margins is able to avoid this by ensuring that each search point always has a positive chance to be sampled.Shapiro investigated the UMDA with a different selection mechanism than truncation selection [40].In particular, this variant of the UMDA selects individuals whose fitnesses are no less than the mean fitness of all individuals in the current population when updating the probabilistic model.By representing the UMDA as a Markov chain, the paper showed that the population size has to be at least √ n for the UMDA to prevent the probabilistic model from quickly converging to the corner of the hypercube on OneMax.This phenomenon is well-known as genetic drift [2].A decade later, the first upper bound on the expected optimisation time of the UMDA on OneMax was revealed [10].Working on the standard UMDA using truncation selection, Dang and Lehre [10] proved an upper bound of O(nλ log λ) on the expected optimisation time of the UMDA on OneMax, assuming a population size λ = Ω(log n).If λ = Θ(log n), then the upper bound is O(n log n log log n).Inspired by the previous work of [42] on cGA/2-MMAS ib , Krejca and Witt [25] obtained a lower bound of Ω(µ √ n + n log n) for the UMDA on OneMax via drift analysis, where λ = (1 + Θ(1))µ.
Compared to [42], the analysis is much more involved since, unlike in cGA/2-MMAS ib where each change of marginals between consecutive generations is small and limited by to the smoothing parameter, large changes are always possible in the UMDA.From these results, we observe that the latest upper and lower bounds for the UMDA on OneMax still differ by Θ(log log n).This raises the question of whether this gap could be closed.This paper derives upper bounds on the expected optimisation time of the UMDA on the following problems: OneMax, BinVal, and LeadingOnes.The preliminary versions of this work appeared in [10] and [26].
Here we use the improved version of the level-based analysis technique [9].The analyses for LeadingOnes and BinVal are straightforward and similar to each other, i. e. yielding the same runtime O(nλ ln λ + n 2 ); hence, they will serve the purpose of introducing the technique in the context of EDAs.Particularly, we only require population sizes λ = Ω(log n) for LeadingOnes which is much smaller than previously thought [6,7,8].For OneMax, we give a more detailed analysis so that an expected optimisation time of O(n log n) is derived if the population size is chosen appropriately.This significantly improves the results in [9,10] and matches the recent lower bound of [25] as well as the performance of the (1+1) EA.More specifically, we assume λ ≥ bµ for a sufficiently large constant b > 0, and separate two regimes of small and large selected populations: the upper bound O(λn) is derived for µ = Ω(log n) ∩ O( √ n), and the upper bound O(λ √ n) is shown for µ = Ω( √ n log n).These results exhibit the applicability of the level-based technique in the runtime analysis of (univariate) EDAs.Table 1 summarises the latest results about the runtime analyses of univariate EDAs on simple benchmark problems; see [24] for a recent survey on the theory of EDAs.Related independent work: Witt [44] independently obtained the upper bounds of O(λn) and O(λ √ n) on the expected optimisation time of the UMDA on OneMax for µ = Ω(log n) ∩ o(n) and µ = Ω( √ n log n), respectively, and λ = Θ(µ) using an involved drift analysis.While our results do not hold for µ = Ω( √ n) ∩ O ( √ n log n), our methods yield significantly easier proofs.Furthermore, our analysis also holds when the parent population size µ is not proportional to the offspring population size λ, which is not covered in [44].
This paper is structured as follows.Section 2 introduces the notation used throughout the paper and  [12] LeadingOnes * without margins the UMDA with margins.We also introduce the techniques used, including the level-based theorem, which is central in the paper, and an important sharp bound on the sum of Bernoulli random variables.Given all necessary tools, Section 3 presents upper bounds on the expected optimisation time of the UMDA on both LeadingOnes and BinVal, followed by the derivation of the upper bounds on the expected optimisation time of the UMDA on OneMax.The latter consists of two smaller subsections according to two different ranges of values of the parent population size.Section 5 presents a brief empirical analysis of the UMDA on LeadingOnes, BinVal and OneMax to support the theoretical findings in Sections 3 and 4. Finally, our concluding remarks are given in Section 6.

Preliminaries
This section describes the three standard benchmark problems, the algorithm under investigation and the level-based theorem, which is a general method to derive upper bounds on the expected optimisation time of non-elitist population-based algorithms.Furthermore, a sharp upper bound on the sum of independent Bernoulli trials, which is essential in the runtime analysis of the UMDA on OneMax for a small population size, is presented, followed by Feige's inequality.We use the following notation throughout the paper.The natural logarithm is denoted as ln(•), and log(•) denotes the logarithm with base 2. Let [n] be the set {1, 2, . . ., n}.The floor and ceiling functions are ⌊x⌋ and ⌈x⌉, respectively, for x ∈ R. For two random variables We often consider a partition of the finite search space X = {0, 1} n into m ordered subsets A 1 , . . ., A m called levels, i. e.A i ∩A j = ∅ for any i = j and ∪ m i=1 A i = X .The union of all levels above j inclusive is denoted A ≥j := ∪ m i=j A i .An optimisation problem on X is assumed, without loss of generality, to be the maximisation of some function f : X → R. A partition is called fitness-based (or f -based) if for any j ∈ [m − 1] and all x ∈ A j , y ∈ A j+1 : f (y) > f (x).An f -based partitioning is called canonical when x, y ∈ A j if and only if f (x) = f (y).
Given the search space X , each x ∈ X is called a search point (or individual ), and a population is a vector of search points, i.e.P ∈ X λ .For a finite population

the number of individuals in population
P which are in level A j .Truncation selection, denoted as (µ, λ)-selection for some µ < λ, applied to population P transforms it into a vector P ′ (called selected population) with |P ′ | = µ by discarding the λ− µ worst search points of P with respect to some fitness function f , were ties are broken uniformly at random.

Three Problems
We consider the three pseudo-Boolean functions: OneMax, LeadingOnes and BinVal, which are defined over the finite binary search space X = {0, 1} n and widely used as theoretical benchmark problems in runtime analyses of EDAs [13,10,27,25,44,46].Note in particular that these problems are only required to describe and compare the behaviour of the EDAs on problems with wellunderstood structures.The first problem, as its name may suggest, simply counts the number of ones in the bitstring and is widely used to test the performance of EDAs as a hill climber [24].While the bits in OneMax have the same contributions to the overall fitness, BinVal, which aims at maximising the binary value of the bitstring, has exponentially scaled weights relative to bit positions.In contrast, LeadingOnes counts the number of leading ones in the bitstring.Since bits in this particular problem are highly correlated, it is often used to study the ability of EDAs to cope with dependencies among decision variables [24].
The global optimum for all functions are the all-ones bitstring, i.e. 1 n .For any bitstring x = (x 1 , . . ., x n ) ∈ X , these functions are defined as follows:

Univariate Marginal Distribution Algorithm
Introduced by Mühlenbein and Paaß [36], the Univariate Marginal Distribution Algorithm (UMDA; see Algorithm 1) is one of the simplest EDAs, which assume independence between decision variables.To optimise a pseudo-Boolean function f : {0, 1} n → R, the algorithm follows an iterative process: sample independently and identically a population of λ offspring from the current probabilistic model and update the model using the µ fittest individuals in the current population.Each sample-and-update cycle is called a generation (or iteration).The probabilistic model in generation t ∈ N is represented as a vector and t ∈ N is the probability of sampling a one at the i-th bit position of an offspring in generation t.
Note that the probabilistic model is initialised as be λ individuals that are sampled from the probability distribution (1), then µ of which with the fittest fitness are Algorithm 1: UMDA with margins parameter : offspring population size λ, parent population size µ, maximising f t , . . ., x t,i denote the value of the i-th bit position of the k-th individual in the current sorted population P t .For each i ∈ [n], the corresponding marginal of the next model is which can be interpreted as the frequency of ones among the µ fittest individuals at bit-position i.
The extreme probabilities -zero and one -must be avoided for each marginal p t (i); otherwise, the bit in position i would remain fixed forever at either zero or one, obstructing some regions of the search space.To avoid this, all marginals p t+1 (i) are usually restricted within the closed interval [ 1 n , 1 − 1 n ], and such values 1 n and 1 − 1 n are called lower and upper borders, respectively.The algorithm in this case is known as the UMDA with margins.

Level-Based Theorem
We are interested in the optimisation time of the UMDA, which is a non-elitist algorithm; thus, tools for analysing runtime for this class of algorithms are of importance.Currently in the literature, drift theorems have often been used to derive upper and lower bounds on the expected optimisation time of the UMDA, see, e.g., [44,25] because they allow us to examine the dynamics of each marginal in the vector-based probabilistic model.In this paper, we take another perspective where we consider the population of individuals.To do this, we make use of the so-called level-based theorem, which has been previously used to derive the first upper bound of O (nλ log λ) on the expected optimisation time of the UMDA on OneMax [10].
Algorithm 2: Non-elitist population-based algorithm Introduced by Corus et al. [9], the level-based theorem is a general tool that provides upper bounds on the expected optimisation time of many non-elitist populationbased algorithms on a wide range of optimisation problems [9].It has been applied to analyse the expected optimisation time of Genetic Algorithms with or without crossover on various pseudo-Boolean functions and combinatorial optimisation problems [9], self-adaptive EAs [11], the UMDA with margins on OneMax and LeadingOnes [10], and very recently the PBIL with margins on LeadingOnes and BinVal [27].
The theorem assumes that the algorithm to be analysed can be described in the form of Algorithm 2. The population P t at generation t ∈ N of λ individuals is represented as a vector (P t (1), . . ., P t (λ)) ∈ X λ .The theorem is general because it does not assume specific fitness functions, selection mechanisms, or generic operators like mutation and crossover.Rather, the theorem assumes that there exists, possibly implicitly, a mapping D from the set of populations X λ to the space of probability distributions over the search space X .The distribution D(P t ) depends on the current population P t , and all individuals in population P t+1 are sampled identically and independently from this distribution [9].The assumption of independent sampling of the individual holds for the UMDA, and many other algorithms.
The theorem assumes a partition A 1 , . . ., A m of the finite search space X into m subsets, which we call levels.We assume that the last level A m consists of all optimal solutions.Given a partition of the search space X , we can state the level-based theorem as follows: , where for all t ∈ N, P t ∈ X λ is the population of Algorithm 2 in generation t.If there exist z 1 , . . ., z m−1 , δ ∈ (0, 1], and γ 0 ∈ (0, 1) such that for any population -(G3) and the population size λ ∈ N satisfies Informally, the first condition (G1) requires that the probability of sampling an individual in level A ≥j+1 is at least z j given that at least γ 0 λ individuals in the current population are in level A ≥j .Condition (G2) further requires that given that γ 0 λ individuals of the current population belong to levels A ≥j , and, moreover, γλ of them are lying at levels A ≥j+1 , the probability of sampling an offspring in levels A ≥j+1 is at least (1 + δ)γ.The last condition (G3) sets a lower limit on the population size λ.As long as the three conditions are satisfied, an upper bound on the expected time to reach the last level A m of a population-based algorithm is guaranteed.
To apply the level-based theorem, it is recommended to follow the five-step procedure in [9]: 1) identifying a partition of the search space 2) finding appropriate parameter settings such that condition (G2) is met 3) estimating a lower bound z j to satisfy condition (G1) 4) ensuring the the population size is large enough and 5) derive the upper bound on the expected time to reach level A m .
Note in particular that Algorithm 2 assumes a mapping D from the space of populations X λ to the space of probability distributions over the search space.The mapping D is often said to depend on the current population only [9]; however, this is not strictly necessary.Very recently, Lehre and Nguyen [27] applied Theorem 4 to analyse the expected optimisation time of the PBIL with a sufficiently large offspring population size λ = Ω(log n) on LeadingOnes and BinVal, when the population for the next generation in the PBIL is sampled using a mapping that depends on the previous probabilistic model p t in addition to the current population P t .The rationale behind this is that, in each generation, the PBIL draws λ samples from the probability distribution (1), that correspond to λ individuals in the current population.If the number of samples λ is sufficiently large, it is highly likely that the empirical distributions for all positions among the entire population cannot deviate too far from the true distributions, i.e. marginals p t (i) [27], due to the Dvoretzky-Kiefer-Wolfowitz inequality [33].

Feige's Inequality
In order to verify conditions (G1) and (G2) of Theorem 4 for the UMDA on OneMax using a canonical f -based partition A 1 , . . ., A m , we later need a lower bound on the probability of sampling an offspring in given levels, that is Pr y∼pt (y ∈ A ≥j ), where y is the offspring sampled from the probability distribution (1).Let Y denote the number of ones in the offspring y.It is well-known that the random variable Y follows a Poisson-Binomial distribution with expectation . A general result due to Feige [15] provides such a lower bound when Y < E [Y ]; however, for our purposes, it will be more convenient to use the following variant [10].

Anti-Concentration Bound
In addition to Feige's inequality, it is also necessary to compute an upper bound on the probability of sampling an offspring in a given level, that is Pr y∼pt (y ∈ A j ) for any j ∈ [m], where y ∼ Pr(• | p t ) as defined in (1).
Let Y be the random variable that follows a Poisson-Binomial distribution as introduced in the previous subsection.Baillon et al. [3] derived the following sharp upper bound on the probability Pr y∼pt (y ∈ A j ).
Theorem 6 (Adapted from Theorem 2.1 in [3]).Let Y be an integer-valued random variable that follows a Poisson-Binomial distribution with parameters n and p t , and let σ 2 n = n i=1 p t (i)(1 − p t (i)) be the variance of Y .For all n, y and p t , it then holds that where η is an absolute constant being

Runtime of the UMDA on LeadingOnes and BinVal
As a warm-up example, and to illustrate the method of level-based analysis, we consider the two functions -LeadingOnes and BinVal-as defined in Definitions 2 and 3.It is well-known that the expected optimisation time of the (1+1) EA on LeadingOnes is Θ(n 2 ), and that this is optimal for the class of unary unbiased blackbox algorithms [28].Early analysis of the UMDA on LeadingOnes [8] required an excessively large population, i. e. λ = ω(n 2 log n).Our analysis below shows that a population size λ = Ω(log n) suffices to achieve the expected optimisation time O(n 2 ).
BinVal is a linear function with exponentially decreasing weights relative to the bit position.Thus, the function is often regarded as an extreme linear function (the other one is OneMax) [13].Droste [13] was the first to prove an upper bound of O (nK) = O n 2+ε on the expected optimisation time of the cGA on BinVal, assuming that ε > 0 is a constant.Regardless of the abstract population size K, Witt recently derived a lower bound of Ω(n 2 ) on the expected optimisation time of the cGA on BinVal [45, Corollary 3.5] and verified the claim made earlier by Droste [13] that BinVal harder problem than OneMax for the cGA.We now give our runtime bounds for the UMDA on LeadingOnes and BinVal with a sufficiently large population size λ.
Proof.We apply Theorem 4 by following the guidelines from [9].
Step 1: For both functions, we define the levels Thus, there are m = n + 1 levels ranging from A 1 to A n+1 .Note that a constant γ 0 appearing later in this proof is set to γ 0 := µ/λ, that coincides with the selective pressure of the UMDA.
For LeadingOnes, the partition is clearly f -based as it is canonical to the function.For BinVal, however, note that since all the j−1 leading bits of any x ∈ A j are ones, then the contribution of these bits to BinVal(x) is j−1 i=1 2 n−i .On the other hand, the contribution of bit position j is 0, and that of the last n − j bits is between 0 and Therefore, for any j ∈ [n] and all x ∈ A j , and all y ∈ A j+1 we have that thus, the partition is also f -based for BinVal.This observation allows us to carry over the proof arguments of LeadingOnes to BinVal.
Step 3: In (G1), for any level j ∈ [n] satisfying |P t ∩ A ≥j | ≥ γ 0 λ = µ we need a lower bound Pr (y ∈ A ≥j+1 ) ≥ z j .Again the condition on level j gives that the µ fittest individuals of P t have at least j − 1 leading 1-bits, or . Due to the imposed lower margin, we can assume pessimistically that p t+1 (j) = 1 n .Hence, So, (G1) is satisfied for z j := 1 en .
Step 5: All conditions of Theorem 4 are satisfied, so the expected optimisation time of the UMDA on LeadingOnes is We now consider BinVal.In both problems, all that matters to determine the level of a bitstring is the position of the first zero-bit.Now consider two bitstrings in the same level for BinVal, their rankings after the population is sorted are also determined by some other less significant bits; however, the proof thus far never takes these bits into account.Hence, the expected optimisation time of the UMDA on LeadingOnes can be carried over to BinVal for the UMDA with margins using truncation selection.

Runtime of the UMDA on OneMax
We consider the problem in Definition 1, i.e., maximisation of the number of ones in a bitstring.It is wellknown that OneMax can be optimised in expected time Θ(n log n) using the simple (1 + 1) EA.The levelbased theorem yielded the first upper bound on the expected optimisation time of the UMDA on OneMax, which is O(nλ log λ), assuming that λ = Ω(log n) [10].This leaves open whether the UMDA is slower than the (1 + 1) EA and other traditional EAs on OneMax.
We now introduce additional notation used throughout the section.The following random variables related to the sampling of a Poisson Binomial distribution with the parameter vector p t = (p t (1), . . ., p t (n)) are often used in the proofs.
-Let Y := (Y 1 , Y 2 , . . ., Y n ) denote an offspring sampled from the probability distribution (1) in generation t, where Pr(Y i = 1) = p t (i) for each i ∈ [n].-Let Y i,j := j k=i Y k denote the number of ones sampled from the sub-vector (p t (i), p t (i + 1), . . ., p t (j)) of the model p t where 1 ≤ i ≤ j ≤ n.

Small parent population size
Our approach refines the analysis in [10] by considering anti-concentration properties of the random variables involved.As already discussed in subsection 2.3, we need to verify the three conditions (G1), (G2) and (G3) of Theorem 4 to derive an upper bound on the expected optimisation time.The range of values of the marginals are When p t (i) = 1 − 1/n or 1/n, we say that the marginal is at the upper or lower border (or margin), respectively.Therefore, we can categorise values for p t (i) into three groups: those at the upper margin 1 − 1/n, those at the lower margin 1/n, and those within the closed interval [1/µ, 1 − 1/µ].For OneMax, all bits have the same weight and the fitness is just the sum of these bit values, so the re-arrangement of bit positions will have no impact on the sampling distribution.Given the current sorted population, recall that t,i , and without loss of generality, we can re-arrange the bit-positions so that for two integers k, ℓ ≥ 0, it holds and for all i ∈ (k + ℓ, n], X i = 0 and p t (i) = 1/n.We define the levels using the canonical f -based partition Note that the probability appearing in conditions (G1) and (G2) of Theorem 4 is the probability of sampling an offspring in levels A ≥j+1 , that is Pr (Y 1,n ≥ j).
We aim at obtaining an upper bound of O(nλ) on the expected optimisation time of the UMDA on OneMax using the level-based theorem.The logarithmic factor O(log λ) in the previous upper bound O(nλ log λ) in [10] stems from the lower bound Ω(1/µ) on the parameter z j in the condition (G1) of Theorem 4. We aim for the stronger bound z j = Ω( n−j+1 n ).Note that in the following proofs, we choose the parameter γ 0 := µ/λ.
Assume that the current level is A j , that is |P t ∩ A ≥j | ≥ γ 0 λ = µ, which, together with the two variables k and ℓ, implies that there are at least j − ℓ − 1 ones from the first k bit positions.To verify conditions (G1) and (G2) of Theorem 4, we need to calculate the probability of sampling an offspring with at least j ones (in levels A ≥j+1 ).It is thus more likely for the algorithm to maintain the ℓ ones for all bit positions i ∈ (k, k + ℓ] (actually this happens with probability at least 1/e), and also sample at least j − ℓ ones from the remaining n − ℓ bit positions.This lead us to consider three distinct cases according to different configurations of the current population with respect to the two parameters k and j in Step 3 of Theorem 8 below.
1. k ≥ µ.In this situation, the variance of Y 1,k is not too small.By the result of Theorem 6, the distribution of Y 1,k cannot be too concentrated on its mean and with probability at least Ω(1), the algorithm can sample at least j − ℓ ones from the first k bit positions to obtain an offspring with at least (j − ℓ) + ℓ = j ones.Thus, the probability of sampling at least j ones is bounded from below by = Ω(1).
2. k < µ and j ≥ n + 1 − n µ .In this case, the current level is very close to the optimal A n+1 , and the bitstring has few zeros.As already obtained from [10], the probability of sampling an offspring in A ≥j+1 in this case is Ω( 1 µ ).Since the condition can be rewritten as ). 3. The remaining cases.Later will we prove that if µ ≤ n(1 − c) for some constant c ∈ (0, 1), and excluding the two cases above, imply 0 ≤ k < (1 − c)(n−j+1).In this case, k is relatively small, and ℓ is not too large since the current level is not very close to the optimal A n+1 .This implies that most zeros must be located among bit positions i ∈ (k + ℓ, n], and it suffices to sample an extra one from this region to get at least (j − ℓ − 1) + ℓ + 1 = j ones.The probability of sampling an offspring in levels A ≥j+1 is then We now present our detailed runtime analysis for the UMDA on OneMax, when the population size is small, that is, Theorem 8.For some constant a > 0 and any constant c ∈ (0, 1), the UMDA (with margins) with parent population size a ln(n Proof.Recall that γ 0 := µ/λ.We re-arrange the bit positions as explained above and follow the recommended 5-step procedure for applying Theorem 4 [9]. Step 1.The levels are defined as in Eq. ( 2).There are exactly m = n + 1 levels from A 1 to A n+1 , where level A n+1 consists of the optimal solution.
Step 2. We verify condition (G2) of Theorem 4. In particular, for some δ ∈ (0, 1), for any level j ∈ [m − 2] and any γ ∈ (0, γ 0 ], assuming that the population is configured such that |P t ∩ A ≥j | ≥ γ 0 λ = µ and |P t ∩ A ≥j+1 | ≥ γλ > 0, we must show that the probability of sampling an offspring in levels A ≥j+1 must be no less than (1+δ)γ.By the re-arrangement of the bit-positions mentioned earlier, it holds that k+ℓ i=k+1 where X i for all i ∈ [n] are given in Algorithm 1.By assumption, the current population P t consists of γλ individuals with at least j ones and µ − γλ individuals with exactly j − 1 ones.Therefore, n i=1 Combining ( 3), ( 4) and noting that λ = µ/γ 0 yield Let Z = Y 1,k + Y k+ℓ+1,n be the integer-valued random variable, which describes the number of ones sampled in the first k and the last n − k − ℓ bit positions.Since k + ℓ ≤ n, the expected value of Z is In order to obtain an offspring with at least j ones, it is sufficient to sample ℓ ones in positions k + 1 to k + ℓ and at least j − ℓ ones from the other positions.The probability of this event is bounded from below by The probability to obtain ℓ ≥ n − 1 ones in the middle interval from position k by the result of Lemma 10 for t = −1.We now estimate the probability Pr (Z ≥ j − ℓ) using Feige's inequality.Since Z takes integer values only, it follows by (5) that Applying Theorem 5 for ∆ = γ/γ 0 ≤ 1 and noting that we chose µ and λ such that Combining ( 6), (7), and (8) yields Pr(Y 1,n ≥ j) ≥ (1 + δ) γ, and, thus, condition (G2) of Theorem 4 holds.
Step 3. We now consider condition (G1) for any level j.Let P t be any population where |P t ∩ A ≥j | ≥ γ 0 λ = µ.For a lower bound on Pr (Y 1,n ≥ j), we modify the population such that any individual in levels A ≥j+1 is moved to level A j .Thus, the µ fittest individuals belong to level A j .By the definition of the UMDA, this will only reduce the probabilities p t+1 (i) on the OneMax problem.Hence, by Lemma 13, the distribution of Y 1,n for the modified population is stochastically dominated by Y 1,n for the original population.A lower bound z j that holds for the modified population therefore also holds for the original population.All the µ fittest individuals in the current sorted population P t have exactly j − 1 ones, and, therefore, n i=1 X i = µ (j − 1) and k i=1 X i = µ (j − ℓ − 1).There are four distinct cases that cover all situations according to different values of variables k and j.We aim to show that in all four cases, we can use the parameter z j = Ω( n−j+1 n ).Case 0: k = 0.In this case, p t (i) = 1 − 1/n for 1 ≤ i ≤ j − 1, and p t (i) = 1/n for j ≤ i ≤ n.To obtain j ones, it suffices to sample only ones in the first j − 1 positions, and exactly a one in the remaining positions, i.e., Case 1: k ≥ µ.We will apply the anti-concentration inequality in Theorem 6.To lower bound the variance of the number of ones sampled in the first k positions, we use the bounds 1/µ ≤ p i (t) ≤ 1 − 1/µ which hold for 1 ≤ i ≤ k.In particular, where the second inequality holds for sufficiently large n because µ ≥ a ln(n) for some constant a > 0. Theorem 6 applied with σ k ≥ 9/10 now gives By combining these two probability bounds, the probability of sampling an offspring with at least j − ℓ ones from the first k positions is In order to obtain an offspring in levels A ≥j+1 , it is sufficient to sample at least j − ℓ ones from the k first positions and ℓ ones from position k+1 to position k+ℓ.Therefore, using (7) and the above lower bound, this event happens with probability bounded from below by The second condition is equivalent to 1/µ ≥ (n − j + 1)/n.The probability of sampling an offspring in levels A ≥j+1 is then bounded from below by , where we used the inequality Pr (Y 2,k ≥ j − ℓ − 1) ≥ 1/14 for µ ≥ 14 proven in [10].Since 1/µ ≥ (n−j+1)/n, we can conclude that Case 3: 1 ≤ k < µ and j < n(1 − 1/µ) + 1.This case covers all the remaining situations not included by the first two cases.The latter inequality can be rewritten as n − j + 1 ≥ n/µ.We also have µ ≤ n(1 − c), so n/µ ≥ µ/(1 − c).It then holds that Thus, the two conditions can be shortened to 1 ≤ k < (1−c)(n−j+1).In this case, the probability of sampling j ones is where the 1/2 factor in the last inequality is due to (10).
Combining all three cases together yields the probability of sampling an offspring in levels A ≥j+1 as follows.
Step 5. We have verified all three conditions (G1), (G2), and (G3).By Theorem 4 and the bound z j = Ω((n − j + 1)/n), the expected optimisation time is therefore We simplify the two terms separately.By Stirling's approximation (see Lemma 12), the first term is The second term is Since λ > µ = Ω(log n), the expected optimisation time is

Large parent population size
For larger parent population sizes, i.e., µ = Ω( √ n log n), we prove the upper bound of O(λ √ n) on the expected optimisation time of the UMDA on OneMax.Note also that Witt [44] obtained a similar result, and we rely on one of his lemmas to derive our result.In overall, our proof is not only significantly simpler but also holds for different settings of µ and λ, that is, λ = Ω(µ) instead of λ = Θ(µ).
Theorem 9.For sufficiently large constants a > 1 and c > 0, the UMDA (with margins) with offspring population size λ ≥ aµ, and parent population size µ ≥ c √ n log n, has expected optimisation time O (λ √ n) on OneMax.
Here, we are mainly interested in the parent population size µ ≥ c √ n log n for a sufficiently large constant c > 0. In this case, Witt [44] found that Pr(T ≤ n cc ′ ) = O(n −cc ′ ), where c ′ is another positive constant and T := min{t ≥ 0 | p t (i) ≤ 1/4} for an arbitrary bit i ∈ [n].This result implies that the probability of not sampling at least an optimal solution within n cc ′ generations is bounded by O(n −cc ′ ).Therefore, the UMDA needs O(nλ log λ)/λ = O(n log λ) generations [10] with probability O(n If we choose the constant c large enough, then n log λ can subsume any polynomial number of generations, i. e. n log λ poly(n), which leads to . Therefore, the overall expected number of generations is still bounded by O( √ n), so the expected opti- In addition, the analysis by Witt [44] implies that all marginals will generally move to higher values and are unlikely to drop by a large distance.We then pessimistically assume that all marginals are lower bounded by a constant p min = 1/4.Again, we rearrange the bit positions such that there exist two integers 0 ≤ k, ℓ ≤ n, where k + ℓ = n and Note that k > 0 because if k = 0 we would have sampled a globally optimal solution.
Step 1: We partition the search space into the m subsets A 1 , . . ., A m (i.e.levels) defined for i ∈ [m − 1] as follows and A m := {1 n }, where the sequence (f i ) i∈N is defined with some constant d ∈ (0, 1] as The range of d will be specified later, but for now note that m = min{i | f i = n} + 1 and due to Lemma 15 1 , we know that the sequence (f i ) i∈N is well-behaved: it starts at 0 and increases steadily (at least 1 per level), then eventually reaches n exactly and remains there afterwards.Moreover, the number of levels satisfies m = Θ( √ n).
Step 2: For (G2), we assume that |P t ∩ A ≥j | ≥ γ 0 λ = µ and |P t ∩ A ≥j+1 | ≥ γλ.Additionally, we make the pessimistic assumption that |P t ∩ A ≥j+2 | = 0, i.e. the current population contains exactly γλ individuals in A j+1 , µ − γλ individuals in level A j , and λ − µ individuals in the levels below A j .In this case, n i=1 and The expected value of Y 1,k is Due to the assumption p t (i) The probability of sampling an offspring in A ≥j+1 is bounded from below by where By Theorem 6, we have The last inequality follows from η ≈ 0.4688 Lemma 16,so (12) becomes The last inequality is satisfied if for any j ∈ The discriminant of this quadratic equation is ∆ = 1 + 4ψ −2 > 0. Vieta's formula [43] yields that the product of its two solutions is negative, implying that the equation has two real solutions d 1 < 0 and d 2 > 0. Specifically, and Therefore, if we choose any value of d such that 0 < d ≤ d 2 , then inequality (13) always holds.The probability of sampling an offspring in A ≥j+1 is therefore bounded from below by The last inequality holds if we choose the population size in the UMDA such that µ/λ = γ 0 ≤ ψ/(1 + δ)e, where δ ∈ (0, 1].Condition (G2) then follows.
Step 3: Assume that |P t ∩ A ≥j | ≥ γ 0 λ = µ.This means that the µ fittest individuals in the current sorted population P t belong to levels A ≥j .In other words, An individual belonging to the higher levels A ≥j+1 must have at least f j ones.The probability of sampling an offspring y ∈ A ≥j+1 is equivalent to Pr(Y 1,n ≥ f j ).According to the level definitions and following the result of Lemma 17, we have In order to obtain a lower bound on Pr (Y 1,n ≥ f j ), we need to bound the probability Pr(Y ) from below by a constant.We obtain such a bound by applying the result of Lemma 14.This lemma with constant d * ≥ 1/p min = 4 and d ≤ d * yields Hence, the probability of sampling an offspring in levels A ≥j+1 is bounded from below by a positive constant z j := κ independent of n.
Step 5: The probability of sampling an offspring in levels A ≥j+1 is bounded from below by z j = κ.Having satisfied all three conditions, Theorem 4 then guarantees an upper bound on the expected optimisation time of the UMDA on OneMax, assuming that µ = Ω( √ n log n),

Empirical results
We have proved upper bounds on the expected optimisation time of the UMDA on OneMax, LeadingOnes and BinVal.However, they are only asymptotic upper bounds as growth functions of the problem and population sizes.They provide no information on the multiplicative constants or the influences of lower order terms.Our goal is also to investigate the runtime behaviour for larger populations.To complement the theoretical findings, we therefore carried out some experiments by running the UMDA on the three functions.
For each function, the parameters were chosen consistently with the theoretical analyses.Specifically, we set λ = n, and n ∈ {100, 200, . . ., 4500}.Although the theoretical results imply that significantly smaller population sizes would suffice, e.g.λ = O(log n) for Theorem 8 we chose a larger population size in the experiments to more easily observe the impact of λ on the running time of the algorithm.The results are shown in Figures 1-3.For each value of n, the algorithm is run 100 times, and then the average runtime is computed.The mean runtime for each value of n is estimated with 95% confidence intervals using the bootstrap percentile method [29] with 100 bootstrap samples.Each mean point is plotted with two error bars to illustrate the upper and lower margins of the confidence intervals.

OneMax
In Section 4, we obtained two upper bounds on the expected optimisation time of the the UMDA on OneMax, which are tighter than the earlier bound O(nλ log λ) in [10], as follows We therefore experimented with two different settings for the parent population size: µ = √ n and µ = √ n log(n).We call the first setting small population ) for the setting of small population and O(n 3/2 ) for the setting of large population.Following [29], we identify the three positive constants c 1 , c 2 and c 3 that best fit the models c 1 n log n, c 2 n 3/2 and c 3 n 2 in non-linear least square regression.Note in particular that these models were chosen because they are close to the theoretical results.The correlation coefficient ρ is then calculated for each model to find the best-fit model.In Table 2, we observe that for small parent populations (i.e.µ = √ n), model 0.8104 n 3/2 fits the empirical data best, while the quadratic model gives the worst result.For larger parent population (i.e.µ = √ n log n), the model 1.0767 n 3/2 fits best the empirical data among the three models.Since 0.8104 n 3/2 ∈ O(n 2 ), these findings are consistent with the theoretical expected optimisation time and may further suggest that the quadratic bound in case of small population is not tight.

LeadingOnes
We conducted experiments with µ = √ n, and λ = n.According to Theorem 7, the upper bound of the expected runtime is in this case O(nλ log λ + n 2 ) = O(n 2 log n). Figure 2 shows the empirical runtime.Similarly to the OneMax problem, we fit the empirical runtime with four different models -c 1 n log n, c 2 n 3/2 , c 3 n 2 and c 4 n 2 log n -using non-linear regression.The best values of the four constants are shown in Table 3 along with the correlation coefficients of the models.3 show that both the model 1.5223 n 2 and the model 0.1851 n 2 log n, having the same correlation coefficient, fit well with the empirical data (i.e. the empirical data lie between these two curves).This finding is consistent with the theoretical runtime bound O(n 2 log n).Note also that these two models differ asymptotically by Θ(log n), suggesting that our analysis of the UMDA on LeadingOnes is nearly tight.

BinVal
Finally, we consider BinVal.The upper bound O(nλ log λ+ n 2 ) from Theorem 7 for the function is identical to the bound for LeadingOnes.Since BinVal is also a linear function like OneMax, we decided to set the experiments similarly for these functions, i. e. with different parent populations µ = √ n and µ = √ n log n.The empirical results are shown in Figure 3. Again the empirical runtime is fitted to the three models c 1 n log n, c 2 n 3/2 and c 3 n 2 .The best values of c 1 , c 2 and c 3 are listed in Table 4, along with the correlation coefficient for each model.Theorem 7 gives the upper bound of O(n 2 log n) for the expected runtime of BinVal.However, Figure 3 and Table 4 show clearly that the model 1.4605 n 3/2 fits best the empirical runtime for µ = √ n.On the other hand, the empirical runtime lies between the two models 11.973 n log n and 1.6586 n 3/2 when µ = √ n log n.While these observations are consistent with the theoretical upper bound since O(n 3/2 ) and O(n log n) are all members of O(n 2 log n), they also suggest that our analysis of the UMDA on BinVal given by Theorem 7 may be loose.

Conclusion
Despite the popularity of EDAs in real-world applications, little has been known about their theoretical optimisation time, even for apparently simple settings such as the UMDA on toy functions.More results for the UMDA on these simple problems with well-understood structures provide a way to describe and compare the performance of the algorithms with other search heuristics.Furthermore, results about the UMDA are not only relevant to evolutionary computation, but also to population genetics where it corresponds to the notion of linkage equilibrium [35,41].
We have analysed the expected optimisation time of the the UMDA on three benchmark problems: OneMax, LeadingOnes and BinVal.For both LeadingOnes and BinVal, we proved the upper bound of O(nλ log λ+ n 2 ), which holds for λ = Ω(log n).For OneMax, two upper bounds of O(λn) and O(λ √ n) were obtained for µ = Ω(log n) ∩ O( √ n) and µ = Ω( √ n log n), respectively.Although our result assumes that λ ≥ (1 + β)µ for some positive constant β > 0, it no longer requires that λ = Θ(µ) as in [44].Note that if λ = Θ(log n), a tight bound of Θ(n log n) on the expected optimisation time of the UMDA on OneMax is obtained, matching the well-known tight bound of Θ(n log n) for the (1+1) EA on the class of linear functions.Although we did not obtain a runtime bound when the parent population size is µ = Ω( √ n)∩O( √ n log n), our results finally close the existing Θ(log log n)-gap between the first upper bound of O(n log n log log n) for λ = Ω(µ) [10] and the relatively new lower bound of Ω(µ √ n + n log n) for λ = (1 + Θ(1))µ [25].
Our analysis further demonstrates that the levelbased theorem can yield, relatively easily, asymptotically tight upper bounds for non-trivial, populationbased algorithms.An important additional component of the analysis was the use of anti-concentration properties of the Poisson-Binomial distribution.Unless the variance of the sampled individuals is not too small, the distribution of the population cannot be too concentrated anywhere, even around the mean, yielding sufficient diversity to discover better solutions.We expect that similar arguments will lead to new results in runtime analysis of evolutionary algorithms.
In the following we write X Y to denote that random variable Y stochastically dominates random variable X, i. e. Pr (X ≥ k) ≤ Pr (Y ≥ k) for all k ∈ R. The lemma below can be easily proved with coupling argument [37].
Proof.Let us rewrite ( 14) by introducing a variable x ≥ 0 as follows: We consider two different cases.From ∂g/∂y = 0, we obtain y = 0. Note that when y = 0, ∂ 2 g/∂y 2 < 0. This means g(y, f j−1 ) reaches the maximum value when y = 0 with respect to f j−1 , and g (y, f j−1 ) ≥ min g (−ℓ/n, f j−1 ) , g (n − f j−1 , f j−1 ) The lemma is proved by combining results of the two cases.

Fig. 1 :
Fig. 1: Mean runtime of the UMDA on OneMax with 95% confidence intervals plotted with error bars in red colour.Models are also fitted via non-linear regression.

Figure 2
Figure 2 and Table3show that both the model 1.5223 n 2 and the model 0.1851 n 2 log n, having the same correlation coefficient, fit well with the empirical data (i.e. the empirical data lie between these two curves).This finding is consistent with the theoretical runtime bound O(n 2 log n).Note also that these

Fig. 2 :
Fig. 2: Mean runtime of the UMDA on LeadingOnes with 95% confidence intervals plotted with error bars in red colour.Models are also fitted via non-linear regression.

Fig. 3 :
Fig. 3: Mean runtime of the UMDA on BinVal with 95% confidence intervals plotted with error bars in red colour.Models are also fitted via non-linear regression.

Table 1 :
Expected optimisation time (number of fitness evaluations) of univariate EDAs on the three problems OneMax, LeadingOnes and BinVal.

Table 2 :
Correlation coefficient ρ for the best-fit models in the experiments with OneMax shown in Figures 1a and 1b.

Table 3 :
Correlation coefficient ρ for the best-fit models in the experiments with LeadingOnes shown in Figure2.

Table 4 :
Correlation coefficient ρ for the best-fit models in the experiments with BinVal shown in Figures3a and 3b µ = √ n log n 11.973 n log n 0.