Analysing the Robustness of Evolutionary Algorithms to Noise: Refined Runtime Bounds and an Example Where Noise is Beneficial

We analyse the performance of well-known evolutionary algorithms, the (1+1)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(1+1)$$\end{document} EA and the (1+λ)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(1+\lambda )$$\end{document} EA, in the prior noise model, where in each fitness evaluation the search point is altered before the evaluation with probability p. We present refined results for the expected optimisation time of these algorithms on the function Leading-Ones, where bits have to be optimised in sequence. Previous work showed that the (1+1)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(1+1)$$\end{document} EA on Leading-Ones runs in polynomial expected time if p=O((logn)/n2)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p = O((\log n)/n^2)$$\end{document} and needs superpolynomial expected time if p=ω((logn)/n)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p = \omega ((\log n)/n)$$\end{document}, leaving a huge gap for which no results were known. We close this gap by showing that the expected optimisation time is Θ(n2)·exp(Θ(min{pn2,n}))\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varTheta (n^2) \cdot \exp (\varTheta (\min \{pn^2, n\}))$$\end{document} for all p≤1/2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p \le 1/2$$\end{document}, allowing for the first time to locate the threshold between polynomial and superpolynomial expected times at p=Θ((logn)/n2)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p = \varTheta ((\log n)/n^2)$$\end{document}. Hence the (1+1)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(1+1)$$\end{document} EA on Leading-Ones is surprisingly sensitive to noise. We also show that offspring populations of size λ≥3.42logn\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\lambda \ge 3.42\log n$$\end{document} can effectively deal with much higher noise than known before. Finally, we present an example of a rugged landscape where prior noise can help to escape from local optima by blurring the landscape and allowing a hill climber to see the underlying gradient. We prove that in this particular setting noise can have a highly beneficial effect on performance.


Introduction
Many real-world problems suffer from sources of uncertainty, such as noise in the fitness evaluation, changing constraints, or dynamic changes to the fitness function [26].Evolutionary algorithms are well suited for dealing with these challenges due to their use of a population, and because they can often recover quickly from setbacks resulting from noise or dynamic changes.They have proven to work well in many applications to combinatorial problems [6].
However, our theoretical understanding of how evolutionary algorithms deal with noise is limited.It is often not clear how noise affects the performance of evolutionary algorithms, and how much noise an evolutionary algorithm can cope with.For evolution strategies in continuous optimisation there exists a rich body of work (see, e. g. [4,25,32] and the references therein), but there are only few rigorous theoretical analyses on the performance of noisy evolutionary optimisation in discrete spaces.
The first runtime analysis for discrete evolutionary algorithms in a noisy setting was given by Droste [16] in the context of a simple algorithm called (1+1) EA on the well-known function OneMax (x) := n i=1 x i , which simply counts the number of bits set to 1.He considered a setting now known as one-bit prior noise, where with probability p a uniform random bit is flipped before evaluation.Hence, instead of returning the fitness of the evaluated search point, the fitness function may return the fitness of a random Hamming neighbour.He proved that, when p = O((log n)/n) the (1+1) EA can still optimise OneMax efficiently.But when p = ω((log n)/n) the expected optimisation time becomes superpolynomial.
Gießen and Kötzing [22] studied a more general class of algorithms, including the (1+1) EA, the (1+λ) EA that generates λ new solutions (offspring) in parallel and picks the best one, and the (µ+1) EA that keeps a Table 1: Overview of results on the expected optimisation time on LeadingOnes with prior noise.Results for the (1+1) EA also hold for asymmetric one-bit noise, for which no results on LeadingOnes are available, with the caveat that for p = ω(1/n) we only have an upper bound of O(n 2 The bound O(λn + n 2 ) from [22] was simplified to O(n 2 ) using their condition λ = o(n).[22,Thm. 20] polynomial if p = O((log n)/n 2
The upper bound (Section 3) is based on a very simple argument: estimating the probability that no noise will occur during a period of time long enough to allow the algorithm to find an optimum without experiencing any noise.A similar argument was used independently in [11] to derive precise and general results for the (1+1) EA on noisy and dynamic OneMax.The lower bound (Section 4) follows arguments from Rowe and Sudholt [48] who analysed the performance of the non-elitist algorithm (1,λ) EA on LeadingOnes.
In Section 5 we show an improved upper bound for the (1+λ) EA on LeadingOnes.Finally, in Section 6 we show that on the class of Hurdle problems [39], a class of rugged functions with many local optima on an underlying slope, noise helps to overcome local optima, allowing a simple hill climber to succeed that would otherwise fail with overwhelming probability.
This manuscript extends a preliminary version [49] that contained parts of the results.In this extension, conditions on bit-wise noise were relaxed in the context of the (1+1) EA to allow for larger noise values.An exponential upper bound for the (1+1) EA was added to obtain asymptotically tight exponents for all reasonable noise strengths.Several empirical analyses were added to complement the theoretical results for LeadingOnes and Hurdle.

Preliminaries
Algorithm 1 shows the (1+λ) EA in the context of prior noise, which includes the (1+1) EA as a special case of λ = 1.Here noise(x) denotes a noisy version of a search point x, according to the given noise model.We assume that all applications of noise are independent.The (1+λ) EA creates λ independent offspring, evaluates their noisy fitness, and then picks a best offspring.This offspring is then compared against the parent, whose noisy fitness is evaluated in each generation.This means in particular that an offspring can replace a parent whose real fitness is higher if the parent is misevaluated to a lower noisy fitness, the offspring is misevaluated to a higher noisy fitness, or both.

Algorithm 1: (1+λ) EA with prior noise
Choose x uniformly at random.while termination criterion not met do for i = 1, . . ., λ do Create y i by copying x and flipping each bit independently with probability 1/n.Evaluate The optimisation time is defined as the number of fitness evaluations until a global optimum is found for the first time.We consider the following prior noise models from previous work; asymmetric noise is inspired by an asymmetric mutation operator [23].
One-bit noise(p) [16,22]: with probability 1 − p, noise(x) = x and otherwise noise(x) = x where in x , compared to x, one bit chosen uniformly at random was flipped.
Bit-wise noise(p, q) [42]: with probability 1 − p, noise(x) = x and otherwise noise(x) = x where in x , compared to x, each bit was flipped independently with probability q.
Asymmetric one-bit noise(p) [45]: with probability 1 − p, noise(x) = x and otherwise noise(x) = x where in x , compared to x, if x / ∈ {0 n , 1 n }, with probability 1/2 a uniform random 0-bit is flipped, with probability 1/2 a uniform random 1-bit is flipped, and if x ∈ {0 n , 1 n } a uniform random bit is flipped.
The special case (1, q) denotes bit-wise noise as investigated in [22].We often write (p, q/n) for bit-wise noise instead of (p, q) as then q plays a similar role to p in one-bit prior noise p, which allows for a more unified presentation of results: we obtain identical noise thresholds across both models (thresholds for q in the (1, q) model are by a factor of n smaller than those for p [42]).Note that we do generally allow q > 1, while in our preliminary work [49] q was restricted to q ≤ 1.The conditions from [5] for (p, q/n) bit-wise noise simplify to p min{q, 1} = O((log n)/n 2 ) for polynomial expected times and p min{q, 1} = ω((log n)/n) for superpolynomial times, respectively.
Note that Pr(noise(x) = x) = p for one-bit noise and asymmetric one-bit noise, and for the bit-wise noise model (p, q/n), Pr(noise(x) = x) = p(1 − (1 − q/n) n ) as noise occurs with probability p and at least one bit is flipped with probability 1 − (1 − q/n) n .We simplify the last expression using the following inequalities for all 0 ≤ p ≤ 1 and ∈ N.
The second and third inequality are shown in [2, Lemma 6], and the first one follows from considering the two cases p ≤ 1/2 and p > 1/2.Thus Pr(noise(x) = x) is tightly bounded as follows: We often limit our considerations to p ≤ 1/2 for one-bit noise as otherwise more than half of the time, the optimum will not be recognised as an optimum.This can lead to counterintuitive effects.For instance, [43,Theorem 3.3] for bit-wise noise with p = 1 shows that increasing the sample size for the (1+1) EA with resampling can turn a polynomial expected time on LeadingOnes into an exponential time; this is essentially because states close to the optimum become more appealing than the optimum itself.For bit-wise noise (p, q/n) we assume q/n ≤ 1/2 as otherwise noise(x) is more likely return search points that are closer to the bit-wise complement x of x than to x itself.With q/n ≤ 1/2 the worst possible noise is q/n = 1/2 where noise(x) is chosen uniformly at random from the whole search space, irrespective of x.

A Simple and General Upper Bound For Dealing With Uncertainty
We first present a very simple result that applies in a general setting of optimisation under uncertainty (noise/dynamic changes/etc.).It is formulated for iterative algorithms that maintain a single search point, called trajectory-based algorithms, however it is easy to extend the definition to population-based algorithms as well.
Our approach is based on the worst-case median optimisation time, defined as follows.The definition uses the term trajectory-based algorithm to denote an iterative algorithm that maintains one search point in each iteration.The (1+1) EA and the (1+λ) EA are both trajectory-based algorithms as they both evolve a single search point.The definition also includes randomised local search (RLS), simulated annealing, the (1,λ) EA [48] or the Strong Selection Weak Mutation (SSWM) algorithm [38].Definition 1.For any trajectory-based algorithm A optimising a fitness function f let T A,f (x) be the random first hitting time of a global optimum when starting in x.We assume hereinafter that each initial search point x leads to a finite expectation.
We define the worst-case expected optimisation time E A,f as Further define the median optimisation time and the worst-case median optimisation time We omit subscripts if the context is clear.Applying Markov's inequality for all x, the median worst-case optimisation time is not much larger than the expected worst-case optimisation time as shown in the following simple theorem 1 .Theorem 1.For every A and every f , M A,f ≤ 2E A,f .Proof.For all x, M A,f (x) ≤ 2E A,f (x) by Markov's inequality.
The following theorem gives an upper bound on the worst-case expected optimisation time under uncertainty, assuming we do know (an upper bound on) the median worst-case optimisation time in a setting without uncertainty.
Theorem 2. Consider a setting where in each iteration a failure event may occur independently with probability 0 ≤ p < 1.Consider any function f on which an iterative algorithm A has worst-case median optimisation time M if p = 0. Then the worst-case expected optimisation time of A with failure probability p is at most The statement also holds if p is an upper bound on the probability of a failure and/or M is an upper bound on the described time.
Proof.By definition of the median worst-case optimisation time, if the algorithm experiences M steps without a failure, it will find an optimum with probability at least 1/2 regardless of the initial search point.The probability that in a phase of M steps there will be no failure is at least (1 − p) M .Hence the expected waiting time for a phase of M steps without failures where the algorithm finds an optimum is at most 2M (1 − p) −M for every initial search point.
The inequality follows from In the setting of prior noise, Theorem 2 implies the following.
Theorem 3. Consider an iterative algorithm A that evaluates up to ν search points in each iteration.For every function f on which A has worst-case median optimisation time M without prior noise, its worst-case expected optimisation time is at most for each of the following settings: 1. one-bit prior noise with probability p < 1, 2. bit-wise prior noise (p , q/n) with q/n ≤ 1/2 and p := p min{q, 1}, and 3. asymmetric one-bit prior noise with probability p < 1.
Proof.The probability of noise occurring in one search point is at most p; this is immediate for one-bit noise and it is p (1 − (1 − q/n) n ) ≤ p min{q, 1} for bit-wise noise by (2).Since noise is applied to all search points independently, noise occurs in one iteration with probability at most p * := 1 − (1 − p) ν .Invoking Theorem 2 with parameter p * and the occurrence of noise as failure event yields the first claimed bound.The inequality follows as in the proof of Theorem 2.
We remark that Theorem 2 also applies in many other settings, for example in • restart strategies that restart the algorithm in each iteration with probability p, • non-elitist algorithms like the (1,λ) EA, where the failure event could be defined as the best fitness decreasing, • stochastic ageing [8,37], an approach from artificial immune systems, where individuals are suddenly killed off with a fixed probability and the failure event is that the whole population happens to die at the same time (which implies a restart), • dynamic optimisation where p is the probability of the fitness function changing, if M is taken as (an upper bound for) the worst-case median optimisation time for all possible fitness functions that can be attained in the considered dynamic setting.
For LeadingOnes, Theorem 3 implies the following.
Theorem 4. The expected optimisation time of the (1+1) EA with prior noise probability p ≤ 1/2 for each of the settings from Theorem 3 on LeadingOnes is Proof.The upper bound follows directly from Theorem 3 with ν = 2 (as the (1+1) EA evaluates parent and offspring in each generation), 2p/(1 − p) = O(p), and the fact that the worst-case expected optimisation time of the (1+1) EA on LeadingOnes is O(n 2 ) [17], hence by Theorem 1 the worst-case median optimisation time is Despite the simplicity of the above proofs, Theorem 4 matches, unifies and generalises the best known results [5,42] which only classify the expected optimisation time on LeadingOnes as being either polynomial, superpolynomial, or exponential (see Table 1).It also gives results for asymmetric one-bit noise, for which no results on LeadingOnes are available.

An Exponential Upper Bound for Large Noise
For very large noise levels p, Theorem 4 gives an upper bound of essentially e O(pn 2 ) , which can be as bad as e O(n 2 ) for p = Ω(1).This is clearly too pessimistic as the expected time to create the optimum by mutation is at most n n = e n ln n for every fitness function and every initial search point.
We therefore provide a new, tailored upper bound for large noise levels, showing that the expected optimisation time is at most e O(n) .To this end, we will prove that the (1+1) EA converges to a stationary distribution π in which the optimum 1 n has stationary mass π(1 n ) ≥ 2 −n .We then bound the mixing time, that is, the time until the algorithm has approached the stationary distribution such that the optimum is found with a probability close to π(1 n ).Throughout this section we assume that the reader is familiar with the foundations of Markov chain theory and mixing times as described in relevant text books like [31].
The following lemma shows that transitions to higher fitness values are at least as likely as transitions to lower values.Lemma 5. Let Pr(x → y) denote the probability that the (1+1) EA with prior noise transitions from x to y in one generation.Then for all x, y with LeadingOnes(x) < LeadingOnes(y) we have Pr(x → y) ≥ Pr(y → x) in each of the following settings: 1. one-bit prior noise with probability p ≤ 1/2, 2. bit-wise prior noise (p, q) with q ≤ 1/2.

asymmetric one-bit prior noise with probability p ≤ 1/2,
Proof.A transition from x to y is made if and only if mutation of x results in y and y is accepted.Since the probability of mutation of x creating y is equal to that of mutation of y creating x, we just need to show that the probability of accepting y as offspring of x is no smaller than the probability of accepting x as offspring of y.
Let i denote the smallest index of any bit flipped in the parent's noise, and i := ∞ if there is no such bit.Define j in the same way for the offspring's noise.Abbreviate := LeadingOnes(x).Now, if i ≤ j ≤ then the offspring will be accepted regardless of whether the parent is x or y.If j < i ≤ the offspring will be rejected in both scenarios.Hence we only need to show the claimed inequality for conditional probabilities assuming i, j > .
If i, j > + 1 then the better search point y will survive, regardless of whether the parent is x or y.If i = + 1 and the parent is x then the inferior search point x may survive.This case is symmetric to j = + 1 and y being the parent.Since Pr(i = + 1) = Pr(j = + 1) and only one of the previous cases can occur, the probability of x surviving is at most Pr(i = + 1).

Thus the claim follows if we can show that
Pr(i, j > + 1) ≥ Pr(i = + 1).
In the symmetric and asymmetric one-bit noise settings, the left-hand side is at least (1 − p)2 ≥ 1/4 and the right-hand side is at most p/n ≤ 1/4.For the bit-wise noise setting, if p ≤ 1/2 the left-hand side is at least 1/4 as above and the right-hand side equals pq(1 − q) ≤ pq ≤ 1/4.If p > 1/2 we argue that the left-hand side is at least 2p(1 − p)(1 − q) +1 ≥ p(1 − q) +1 ≥ pq(1 − q) = Pr(i = + 1) as it is sufficient to have noise in exactly one parent, if noise does not flip the first + 1 bits.
The exponential upper bound is stated as follows.
Theorem 6.The expected optimisation time of the (1+1) EA with prior noise probability p ≤ 1/2 for each of the settings from Theorem 3, except for asymmetric one-bit noise, on LeadingOnes is at most 2 O(n) .
Proof.If p = 0 then the expected optimisation time of the (1+1) EA on LeadingOnes is O(n 2 ) ≤ 2 O(n) , hence we assume p > 0 in the following.
We first show that the (1+1) EA on LeadingOnes is an ergodic Markov chain, which implies the existence of a stationary distribution π.Ergodicity simply follows from the fact that every search point x can be turned into any other search point y in one generation if mutation of x creates y (probability at least n −n ) and LeadingOnes(noise(x)) = 0, which happens with probability at least p/n > 0 for one-bit noise, probability at least p q > 0 for bit-wise noise with p = p min{q, 1} > 0 and probability at least p/(2n) > 0 for asymmetric one-bit noise.
To prove the claimed inequality 1/π(1 n ) ≤ 2 n we will use the following property of stationary distributions (cf.Proposition 1.19 in [31]): It remains to bound the mixing time, that is, the time until the algorithm has gotten close to the stationary distribution (as will be made precise soon).Let p t be the distribution of the current search point at time t.The difference to the stationary distribution π is described by the total variation distance that describes the maximum difference between probabilities for any event A: In particular, we have Pr( We now show that ||p t − π|| ≤ 2 −n−1 for a suitable t = poly(n) • 2 O(n) .This will be achieved by using a coupling (X t , Y t ).In a nutshell, a coupling is a pair process where, viewed individually, X t and Y t are both faithful copies of the original process, the (1+1) EA on LeadingOnes.But they may not be independent: they can follow a joint distribution and the coupling ensures that, once they have reached the same state, their states will always be equal.More formally, if X t = Y t then X t+1 = Y t+1 .The first point in time where their states become equal, when starting in states X 0 = x and Y 0 = y is called the coupling time T xy .
It is known that the tail of the coupling time, or more precisely the tail of the worst-case coupling time for any initial states x, y, yields a bound on the total variation distance.Using [31, Theorem 5.2] we get We will show the right-hand side becomes less than 2 −n−1 within 2 O(n) generations 2 .
We use the following coupling between two copies X t , Y t of the (1+1) EA, where we identify X t and Y t with the (1+1) EA's current search points in the respective chains.During mutation, for bits where X t and Y t agree we make the same decisions in both Markov chains.Otherwise, with probability 1/n we flip the bit in X t but not in Y t , with probability 1/n we flip the bit in Y t but not in X t , and with the remaining probability 1 − 2/n the bit is not flipped at all.We further assume that the same noise is applied in both chains.It is easy to verify that both chains, viewed in isolation, represent faithful copies of the (1+1) EA on LeadingOnes, and that after both chains have reached the same state, their states will always be equal as they experience the same mutations and the same noise.
Let Eq t denote the size of the largest prefix that is identical in X t and Y t , i. e., Eq t = max{i | X Note that if both chains decide to reject their offspring, Eq t+1 = Eq t and if both chains decide to accept then Eq t+1 ≥ Eq t due to the way mutations are coupled.Once Eq t has reached a value of n, both chains will always have the same state.
Let i := Eq t < n then X t i+1 = Y t i+1 by definition of Eq t .Assume without loss of generality that X t i+1 = 0. We first show that Pr(Eq t+1 > Eq t | Eq t , Eq t < n) ≥ 1/(3en).A sufficient event is that mutation makes bit i + 1 equal in X t and Y t and the outcome is accepted in both chains.Mutation flips X t i+1 while not flipping X t 1 , . . ., X t i and Y t 1 , . . ., Y t i+1 with probability 1/n•(1−1/n) i ≥ 1/(en) as per definition of the coupling mutation flips X t i+1 and does not flip Y t i+1 with probability 1/n and every bit j ≤ i is not flipped in X t and Y t with probability 1 − 1/n since X t j = Y t j .The outcome of such a mutation then needs to be accepted despite noise.Let α i denote the probability of noise flipping any of the first i bits.The offspring will be accepted if noise leaves the first i bits intact, or if noise does flip at least one bit amongst the first i bits in both parent and offspring, but still the offspring's noisy fitness is at least as good as that of its parent.Noting the symmetry in the latter case, the probability of accepting said mutation is at least (1 − α i ) 2 + α 2 i /2 ≥ 1/3 for every possible value α i .Together, this shows Pr(Eq t+1 > Eq t | Eq t , Eq t < n) ≥ 1/(3en).
Note that the first i bits are identical in the noisy parent evaluation of both X t and Y t , and they are also identical in the noisy evaluation of both offspring x , y in X t and Y t , respectively.If either of these noisy evaluations is less than i, the decision whether to accept or reject is only based on the first i bits and X t and Y t make the same decision.The only problematic case is when noise(X t ), noise(Y t ), noise(x ), and noise(y ) all have at least i leading ones as then one Markov chain might accept their offspring while the other might reject theirs.If LeadingOnes(x ) and LeadingOnes(y ) are both at least i, Eq t+1 ≥ Eq t and no harm is done.
However, we might have LeadingOnes(x ) < i or LeadingOnes(y ) < i in case mutation destroys the prefix of i leading ones (probability at most i/n), but noise flips the same bits, covering up all detrimental mutations.The probability of the latter event is at most p/n for one-bit noise (or 0 in case mutation flipped more than one bit).We call step t a relevant step if Eq t+1 = Eq t .In a relevant step, the conditional probability of increasing Eq t is Ω(1) and the probability of increasing Eq t in at most n subsequent relevant steps, until Eq t = n is reached, is at least (Ω(1)) n = 2 −Ω(n) .
In the case of bit-wise noise, the probability of decreasing Eq t is at most q(1 − q) i−1 as (since q ≤ 1/2) the best case is that mutation has only flipped one bit, which needs to be covered up by noise.The conditional probability of Eq t increasing in a relevant step is thus at least 1/(3en) The probability of increasing Eq t in at most n subsequent relevant steps until a value of n is reached is thus at least

The reciprocal of this expression is upper bounded by
For both one-bit and bit-wise noise, a relevant step occurs with probability at least 1/(3en) (unless the chains have already coupled).Hence the expected waiting time for n relevant steps is at most 3en 2 .Thus, from any initial configuration of X t and Y t , the expected time for a sequence of up to n relevant steps all increasing Eq t until the maximum value n is reached and the chains are coupled is bounded by E(max xy T xy ) ≤ 3en 2 • e O(n) := t * .By Markov's inequality, Pr(max xy T xy ≥ 2t * ) ≤ 1/2 and the probability that the process has not coupled within n + 1 subsequent phases of length 2t * each is at most 2 −n−1 .
This shows that the time until the total variation distance to π has decreased to a value of at most 2 O(n) .Then the probability of sampling the optimum in the next generation is at least π If the optimum is not found then, we repeat the above arguments.This establishes an upper bound of

A Matching Lower Bound for the (1+1) EA on LeadingOnes
The arguments from Section 3 and Theorem 2 pessimistically assume that, once noise occurs, the algorithm needs to restart from scratch.For LeadingOnes, and problems with a similar structure, this is not far from the truth.An unlucky mutation can destroy a long prefix of leading ones and the fitness of the current search point can decrease significantly.We will see that then the algorithm comes close to having to start from scratch.Such an effect was already observed and made rigorous in the analysis of island models with migration [27], separable functions [15], and for the (1,λ) EA on LeadingOnes [48]; parts of this section closely follow the proof of Theorem 12 in [48] (but had to be adapted to noisy settings).
The main result of this section is the following.
Theorem 7. The expected optimisation time of the (1+1) EA with prior noise probability p ≤ 1/2 for each of the settings from Theorem 3 on LeadingOnes is This is superpolynomial for p = ω((log n)/n 2 ).
Along with Theorems 4 and 6 and the fact that polynomial factors only account for a ±O(log n) term in the exponent, yielding e Ω(n) = Θ(n 2 ) • e Ω(n) , we get the following result.for each of the following settings: 1. one-bit prior noise with probability p ≤ 1/2 and 2. bit-wise prior noise (p , q/n) with q/n ≤ 1/2 and p := p min{q, 1}.
The result is tight up to constants in exponent of the term exp(Θ(min{pn 2 , n})) that reflects the impact of noise.
Theorem 7 improves on the best known results, summarised in Table 1.Note that there is a gap of order 1/n between the noise parameter regime p = ω((log n)/n) where times are known to be superpolynomial [5,42] and the noise parameter regime p = O((log n)/n 2 ) that led to polynomial upper bounds in [5,42] and in Theorem 4.
Theorem 7 closes this gap by showing that superpolynomial times already occur for noise parameters p = ω((log n)/n 2 ), which is by a factor of 1/n smaller than previous results [5,42].This shows that the (1+1) EA on LeadingOnes is highly sensitive to noise, especially since the corresponding threshold for OneMax is at p = Θ((log n)/n) [16,22].Theorem 7 also unifies and generalises all known results for LeadingOnes under prior noise by giving bounds that hold for the whole range of noise parameters p, and for different prior noise models.
In order to prove Theorem 7, we first analyse the probability of the fitness dropping significantly.
Lemma 9. Consider the setting of Theorem 7 with a current LeadingOnes value of i ≥ 4. Then the probability that the LeadingOnes value decreases below i/2 in one generation is Ω(pi 2 /n 2 ).This is Ω(p) if i = Ω(n).
Proof.Mutation flips a bit at position { n/4 , . . ., n/2 } and leaves the other bits unflipped with probability Ω(i/n).Let n/4 ≤ i * ≤ i/2 denote the position of the bit flipped during mutation.Let i x denote the smallest index of any bit flipped during the parent's noise and i x := ∞ if no such bit exists.Define i y in the same way for the offspring.We claim that after a mutation as described above, the probability that the offspring is accepted regardless is Ω(pi/n).A sufficient condition for this to happen is that i x ≤ n/4 ≤ i * and i y ≥ i x .
For all noise models, we claim that Pr(i as parent and offspring are subject to the same independent noise under identical conditions. If all these events happen, the offspring will appear to be no worse than the parent.Hence the offspring will survive, and its LeadingOnes value is at most i/2.Since all events are independent (or conditionally independent), multiplying these probabilities implies the claim.
As argued in [48] for the (1,λ) EA, such a fallback is not too detrimental per se as the (1+1) EA might recover from this easily.If the bits between i/2 and i have not been flipped during the mutation creating the accepted offspring, the previous leading ones can be easily recovered, in the best case by simply flipping the first 0-bit in the current search point.However, while waiting for such a mutation to happen, all bits between i/2 + 1 and i do not contribute to the fitness.So over time these bits are subjected to random mutations, which are likely to destroy many of the former leading ones.In other words, after a fallback previous leading ones are forgotten quickly.
The last fact was formalised in [27, Lemma 3] stated below.The lemma states that the probability distribution of a bit subjected to random mutations rapidly approaches a uniform distribution.
Lemma 10 (Adapted from Lässig and Sudholt [27]).Let x 0 , x 1 , . . ., x t be a sequence of random bit values such that x j+1 results from x j by flipping the bit x j independently with probability 1/n.Then for every t ∈ N We now say that the (1+1) EA falls back if, starting from a fitness at least f * := 2n/3, the algorithm drops to a fitness of i * for some i * ≤ n/2.We speak of a lasting fallback if in the 2n/(1 − p) generations directly following a fallback the following holds: 1. all acceptance decisions are made independently from bit values at positions i * + 2, . . ., n, 2. bit i * + 1 is never flipped during mutation and 3. in at least n/2 generations the offspring is accepted.
A lasting fallback implies that the fitness remains at most i * during at least n/2 accepted steps.In these accepted steps, the bits at positions i * + 2, . . ., n are mutated independently from acceptance decisions and hence take on a near-random state.
We remark that in a noise-free setting, so long as bit i * + 1 is never flipped, the acceptance decisions would trivially be independent from bit positions i * + 2, . . ., n.In a setting with noise, however, these bits might play a role as bit i * + 1 might be flipped by noise, and then the acceptance decision might depend on further bits.Hence more careful arguments are needed.
We also say that the initial search point is a lasting fallback if its fitness is at most n/2.If i * is the initial fitness, the bits at positions i * + 2, . . ., n take on a uniform random state.
The following lemma estimates probabilities for fallbacks and lasting fallbacks.
Lemma 11.Consider any of the settings described in Theorem 3. If p ≤ 1/2 and the current fitness is at least f * , the probability of one generation yielding a fallback is Ω(p).Additionally, the probability of a fallback becoming a lasting fallback is Ω(1).
Proof.The first statement follows from Lemma 9 as halving the current fitness results in a search point of fitness at most n/2.It remains to estimate the probability of a fallback becoming a lasting fallback.Let i * be the fitness obtained during a fallback and let x t be the parent in generation t.Abbreviate i t := LeadingOnes(x t ).We call a generation t good if • bit i t + 1 is not flipped during mutation and In a good generation the noisy fitness of the parent is at most i t , hence the offspring is accepted if and only if its noisy fitness is at least i t .This decision only depends on bits at positions 1, . . ., i t and is independent from bits at positions i t + 2, . . ., n.
Moreover, in a good generation t we have i t+1 ≤ i t as the fitness cannot increase if bit i t + 1 is not flipped during mutation.If all generations since the fallback have been good then i t ≤ i * and decisions are independent from bits i * + 2, . . ., n as claimed.
We estimate the probability of all 2n/(1−p) generations being good.For any generation t, the probability of the first event is 1 − 1/n.The probability of the second event is at least 1 − 1/n − p/n ≥ 1 − 2/n as bit i t + 1 can only be set to 1 if it is mutated or flipped during noise.The probability of noise flipping any fixed bit is at most p/n in all considered noise settings.Hence the probability of a generation t being good is at least 1 − 3/n by a union bound and the probability that all 2n/(1 − p) generations are good is Assuming that these generations are all good, we finally estimate the number of accepted generations under this condition.Using Pr(A | B) = Pr(A ∩ B)/P (B) ≥ Pr(A ∩ B), we lower-bound the probability of a generation t being accepted and good.This happens if bits 1, . . ., i t + 1 are not flipped during mutation (probability at least (1−1/n) n ), bit i t +1 is set to 0 in the noisy parent (probability at least 1−2/n as estimated above) and the offspring does not suffer from noise (probability at least 1 − p).Together, the probability of an accepted generation conditional on it being good is at least ( n is large enough.The expected number of accepted generations in 2n/(1 − p) good generations is at least 2n/3 and by Chernoff bounds, the probability of having at least n/2 accepted generations is 1 − 2 −Ω(n) .
Together, all three criteria in the definition of lasting fallbacks hold with probability Ω(1).
After a lasting fallback has occurred, the (1+1) EA with overwhelming probability needs some time in order to recover.Specifically, at least cn 2 generations are needed to increase the best fitness since the latest lasting fallback by at least n/6.Lemma 12. Let t be the latest generation where a fallback became a lasting fallback or t = 0 if no lasting fallback occurred.Let B t be the best fitness found since generation t.With probability 1 − e −Ω(n) , for a small constant c > 0, B t+cn 2 < B t + n/6.
Proof.We pessimistically overestimate the probability of a fitness improvement due to the effects of noise in generations from t to t + cn 2 : we assume that noise never leads to a decrease in the number of leading ones.Secondly, we call a step successful if the first 0-bit is flipped during mutation or if it is flipped during the parent's or offspring's noise.In this case we assume that this bit becomes part of the leading ones for the next generation and the next parent's fitness is determined by the position of the first 0-bit amongst the following bits.The probability of a successful step is still bounded from above by 3/n.
A lasting fallback implies that at any generation from t, all bits at positions {B t + 1, . . ., n} have been subjected to mutation at least t mix = n/2 times and these mutations were independent of the acceptance decision (by definition of a lasting fallback).Every mutation flips each of these bits independently with probability 1/n, leaving the bits in a random state.We apply the principle of deferred decisions [33, page 9] and determine the current bit value for these bits at the time these bits first have a chance to become part of the leading ones in an offspring.By Lemma 10 we know that then the probability such a bit is set to 1 is at most Note that due to our pessimistic assumptions concerning successful steps, the bits following the first 0-bit will always be irrelevant for the decision whether or not to accept the offspring.Hence the above probability bound also holds after generation t.
A necessary condition for increasing the best fitness by at least n/6 in cn 2 generations, c a positive constant chosen later, is that either 1. among cn 2 mutations at least 6cn steps are successful or 2. during at most 6cn successful steps the total fitness gain is at least n/6.
The probability of a successful step is always at most 3/n as mentioned earlier.By standard Chernoff bounds, the probability for the first event is at most e −Ω(n) .The total fitness gain is given by the number of improvements-at most 6cn-plus a sum of up to 6cn geometric random variables to account for additional bits gained (these additional bits are often called "free riders").By Theorem 5 in [3], we get that the probability of a fitness gain of n/6 is e −Ω(n) , provided that c is small enough.
Lemma 13.Let c > 0 be any constant.Within cn 2 generations where the current fitness is larger than f * , a lasting fallback occurs with probability at least 1 − e −Ω(pn 2 ) .
Proof.The probability of a fallback occurring is Ω(p), and then it becomes lasting with probability Ω(1).Note that the time until a fallback potentially becomes a lasting fallback (whether it does or not) is not counted towards the cn 2 generations from the statement as during this time the fitness is smaller than f * .
So the probability that no lasting fallback occurs is at most ≤ e −Ω(pn 2 ) .Now we prove Theorem 7.
Proof of Theorem 7.With probability 1 − 2 −Ω(n) the initial search point has fitness less than n/2, so the (1+1) EA starts with a lasting fallback.As the fitness after initialisation and after every lasting fallback is at most n/2, by Lemma 12, reaching a fitness of at least f * from there takes time at least cn 2 with overwhelming probability, for a suitably small constant c > 0. Applying Lemma 12 every time the fitness increases to at least f * , the (1+1) EA does not find an optimum within the next cn 2 generations where the fitness is at least f * , with overwhelming probability.But by Lemma 13 during these cn 2 generations another lasting fallback occurs, with overwhelming probability.We iterate this argument until a failure occurs.The largest failure probability is e −Ω(pn 2 ) if p = O(1/n), hence in expectation we can iterate this argument at least e Ω(pn 2 ) times, each iteration taking time at least cn 2 (from the time it takes to reach fitness f * after a lasting fallback).If p = ω(1/n), the largest failure probability is e −Ω(n) and in expectation we can iterate this argument for e Ω(n) generations.Together, this proves the claim.

Improved Results for Offspring Populations
The general Theorem 2 can also be used in the context of offspring populations in the (1+λ) EA, in order to quantify the robustness of evolutionary algorithms with offspring populations to noise.Offspring populations can reduce the probability of the current fitness decreasing.The current fitness can decrease in two different ways: 1. the current search point may be misevaluated as having a poor fitness, and then be replaced by an offspring that is worse than the parent in real fitness or 2. the current search point may be replaced by an offspring where mutation has led to poor real fitness, but noise happens to misevaluate the offspring as having a high fitness, thus replacing its parent.Here noise essentially needs to make the same bit-flips as the preceding mutation to cover up the effect of mutation.
The first failure can be avoided if there is a clone of the current search point where no prior noise has occurred.A large offspring population can amplify this probability.Lemma 14.Consider the (1+λ) EA in a prior noise model where Pr(noise(y) = y) ≤ p for all search points y.Then for all current search points x the probability that all copies of x among parent and offspring are affected by noise is at most Proof.Let q := (1 − 1/n) n abbreviate the probability of creating a clone of the parent for an offspring.The probability of creating exactly i clones is λ i q i (1 − q) λ−i , and then the probability that all i + 1 copies of x (including the parent) are affected by noise is at most p i+1 .Hence the sought probability is where we have used the binomial theorem in the penultimate equality.Plugging in (1 − 1/n) n for q yields the claimed result.For the second bound we use (1 Our aim is to apply Theorem 2 where the failure event is the union of the event described in Lemma 14 and other events described later.However, we still need a bound on the worst-case median optimisation time, or (by Theorem 1) the worst-case expected optimisation time, assuming that the algorithm always retains at least one copy of the current search point.
Note that we cannot simply use a runtime result for the (1+λ) EA without noise as noise can still affect the generated offspring; the only condition we can rely on is that we cannot lose all copies of the current search point.If noise is disruptive, the (1+λ) EA may behave like having a smaller effective offspring population, the size of which is random.Note that we cannot pessimistically use a bound on the (1+1) EA to upper bound the time of the (1+λ) EA in this setting as different offspring population sizes can affect search dynamics in unforeseen ways.Jansen et al. [24] presented a problem class where different offspring population sizes lead to very different performance.
The following theorem gives improved upper bounds for one-bit noise and bit-wise noise3 .
Theorem 15.The expected number of function evaluations for the (1+λ) EA with prior noise parameter p ≤ 1/2 on LeadingOnes with log e e−1/2 (n in each of the following settings: 1. one-bit prior noise with probability p < 1 and 2. bit-wise prior noise (p , q/n) with q/n ≤ 1/n and p := p min{q, 1}.
The exponent is smaller compared to the upper bound for the (1+1) EA by a factor of order λn, and thus the threshold for p for which polynomial times are guaranteed increases by the same factor.The threshold between polynomial and superpolynomial times could be higher as we do not have a corresponding lower bound.
Theorem 15 improves and generalises the best known result for the (1+λ) EA [22,Corollary 24] which requires p = O(1/n) and λ ≥ 72 log n and gives a time bound of O(λn + n 2 ).This is O(n 2 ) as the authors also assume λ = o(n).Our result covers the whole parameter range for p up to 1/2 and also identifies a functional relationship between p and λ that guarantees robustness to noise.
Proof of Theorem 15.We estimate the probability of the following failure events in order to apply a union bound later on.
Failure event E 1 : all copies of the current search point are affected by noise.By Lemma 14, this probability is at most Failure event E 2 : the best offspring is evaluated as having the parent's fitness, and the offspring y chosen to replace the parent carries disruptive mutations that were undone by noise, i. e. LeadingOnes(y) < LeadingOnes(noise(y)) = LeadingOnes(x).The probability for this to happen is at most as noise has to flip at least one specific bit.
Failure event E 3 : there is an offspring y that carries disruptive mutations, but is being evaluated as being better than the parent, i. e. LeadingOnes(y) < LeadingOnes(x) and LeadingOnes(noise(y)) > LeadingOnes(x).For each offspring where mutation flips one of the leading ones, two events may occur: if mutation flips the first 0-bit, noise in an offspring has to undo all mutations of the leading ones.This has probability at most p/n 2 .Otherwise, noise has to undo all mutations of the leading ones and flip the first 0-bit at the same time.This is impossible under one-bit noise, and has probability at most p/n 2 under bit-wise noise.Along with a union bound over these two events and λ offspring, As long as no failure occurs, the current fitness of the (1+λ) EA cannot decrease.We now show that, conditional on no failure occurring, the expected worst-case number of generations of the (1+λ) EA is bounded by The probability of one offspring increasing the current fitness is at least (1 − p)/(en) as it suffices to flip the first 0-bit and not to flip any of the other bits, and to have the offspring being evaluated correctly.The probability that this happens in at least one of the λ offspring and the parent is evaluated correctly is at least where the inequality follows from [2, Lemma 6].The expected time to increase the best fitness is thus O(n/λ), and since the fitness only has to be increased at most n times, an upper bound of O(n 2 /λ) generations follows, for every initial search point.The same bound also holds for the worst-case median optimisation time by Theorem 1.Now the result follows from applying Theorem 2 with a time bound of O(n 2 /λ) and a failure probability bound of p 1 + p 2 + p 3 = O(p/n), and multiplying the number of generations by λ for the number of function evaluations.

Experiments for LeadingOnes
We also performed experiments to see the threshold behaviour more clearly and to get further insights into the search dynamics in the presence of noise.
Figure 1 shows the average optimisation times over 1000 runs of the (1+λ) EA on 100-bit Leading-Ones with λ ∈ {1, 2, 4, 8, 16} for both one-bit prior noise with probability p and bit-wise prior noise (1, q/n).For both noise models the parameter was varied exponentially: p ∈ {2 −20 , 2 −19 , . . ., 2 −1 } and q ∈ {2 −20 , 2 −19 , . . ., 2 0 }.Runs were stopped after 10n 2 = 10 5 generations or when the optimum was found.For the (1+1) EA with one-bit noise we can see that for small noise values like p ∈ {2 −20 , . . ., 2 −15 } the averages seem unaffected by the noise parameter, as noise occurs too rarely to have a noticeable effect.When increasing p, the average time increases slightly before shooting up around p = 2 −8 and hitting the generation limit at p = 2 −6 in nearly all runs.This clearly shows that and how the expected optimisation time grows exponentially in pn 2 in this regime.
Figure 1 further shows how offspring populations can shift the threshold between efficient and inefficient times towards higher values of p.Even very small offspring population sizes λ have a significant effect.For instance, the (1+8) EA is still efficient for p = 1/4 and only becomes inefficient for p = 1/2.The (1+16) EA is efficient even for p = 1/2.Note that the curves for all (1+λ) EAs have a very similar shape, independent of λ; they just appear to be shifted towards different values of p.This matches our theoretical results as the exponential term e O(pn/λ) contains the ratio p/λ, indicating that the noise strength can be compensated by the offspring population size in a linear fashion.
Comparing plots for one-bit noise and bit-wise noise, the curves look almost identical.Another interesting performance measure not covered by our theoretical results is to inspect the best fitness found during a run before either finding an optimum or being stopped at 10n 2 generations.Figure 2 shows averages over these values.For the (1+1) EA the best fitness steadily decreases when increasing the noise parameter beyond the threshold for inefficient running times, reaching values of 30.414 for one-bit noise with p = 1/2 and 25.781 for bit-wise noise with q = 1.For comparison, the average best fitness found during 10n 2 = 10 5 uniform random samples was 16.926.Again, we see that offspring populations help by shifting the curves towards higher noise strengths.

An Example Where Noise Helps
The results so far show that on LeadingOnes, noise is disruptive and larger noise values lead to higher expected optimisation times.
The final contribution of this paper is to look at noise from a very different angle.We will show that noise can be beneficial for escaping from local optima.To this end, we consider a known class of functions that lead to a highly rugged fitness landscape with an underlying gradient pointing towards the location of the global optimum.Such landscapes are known as "big valley" structures, which is an important characteristic of many hard problems from combinatorial optimisation [36,47].
Prügel-Bennett defined such a class of problems known as Hurdle problems [39] as an example function where genetic algorithms with crossover outperform hill climbers.Hurdle functions are functions of unitation, that is, they only depend on the number of 1-bits.The fitness is given as where |x| 0 denotes the number of 0-bits in x and w is a parameter called hurdle width that defines the distance between subsequent peaks.A sketch of the function is shown in Figure 3.Here all search points with i mod w = 0 zeros are local optima, and all search points with j zeros, i − w < j < i, have worse fitness.Hence an evolutionary algorithm needs to flip at least w bits in order to find a search point of better fitness.Nguyen and Sudholt [35] proved that the (1+1) EA has expected time Θ(n w ) if 2 ≤ w ≤ n/2.
In the following, we consider the well-known algorithm Randomised Local Search (RLS), which works like the (1+1) EA, but only flips exactly one bit in each mutation (chosen uniformly at random).We choose RLS instead of the (1+1) EA to keep the analyses simple and to make the point that even a very badly performing algorithm can be turned into a highly efficient algorithm through beneficial effects of noise.We will in particular show that RLS under noise is drastically faster than the (1+1) EA without noise.Section 6.2 will further discuss whether results for RLS under noise can be transferred to the (1+1) EA under noise.
It is obvious that RLS has infinite expected time on any Hurdle function with non-trivial hurdle width w ≥ 2, and Nguyen and Sudholt [35] showed via Chernoff bounds that local searchers get stuck in a nonoptimal local optimum with probability 1 However, prior noise can help to escape from such a local optimum: RLS with one-bit prior noise can misevaluate either the parent or the offspring, which allows the algorithm to accept a search point with i mod w = w − 1 ones.Then it can climb to the next local optimum from there, until the global optimum is found.This is made precise in the following theorem.
Theorem 16.The expected optimisation time of RLS with one-bit prior noise p ≤ 1/(6n) on Hurdle with hurdle width w ≥ 2 log n is O(n 2 /(pw 2 ) + n log n).
Note that in particular for p = 1/(6n) and w = Ω(n/ √ log n) this is O(n log n).Then RLS is as efficient as on the underlying function OneMax without any hurdles.
Proof of Theorem 16.The algorithm can escape from a local optimum with i zeros, i mod w = 0, if the offspring has i − 1 zeros (probability i/n) and additionally 1. the offspring is misevaluated as having i zeros (probability p(n − i + 1)/n) or 2. the parent is misevaluated as having i − 1 zeros (probability pi/n).
The probability of the union of these events is as the event of both offspring and parent being misevaluated as described is counted twice in the enumeration.Together, the probability of escaping from a local optimum with i zeros is at least pi/n.We now define a potential function g such that g(i) estimates or overestimates the expected optimisation time from a state with i zeros, bar constant factors.Let a i := 2 (i mod w)−w+1 , then The term a i n 2 i 2 p(1−p) 2 is necessary since on a slope towards a local optimum there is a chance to increase the number of zeros and to possibly return to a worse, previously visited local optimum.The term is largest, n 2 i 2 p(1−p) 2 , for i = w − 1 mod w as from there returning to a local optimum with i + 1 zeros is very likely.This needs to be accounted for in our choice of potential function.The term decreases exponentially for decreasing i mod w since this risk is reduced as the algorithm moves away from a local optimum.
Note that g(0) ≤ g(1) ≤ • • • ≤ g(n), with g(n) being composed of the following sums.The additive terms n i for all i > 0, i mod w > 0 sum up to at most n i=1 n i = O(n log n).For each hurdle with a peak at i zeros, g(n) contains an additive term n ip as well as terms . Adding up the terms for each hurdle with w, 2w, 3w, . . ., (n/w)w zeros yields where the penultimate line follows from and in the last line we used log(n/w) = O(n/w) to absorb the middle term.We show in the following that the potential decreases in expectation by Ω(1).
For 0 < i mod w < w − 1, the potential decreases by g(i) − g(i − 1) if mutation creates a search point with i − 1 zeros and the mutant is evaluated correctly (probability at least i/n • (1 − p)).It is increased by g(i + 1) − g(i) only if mutation creates a search point with i + 1 zeros (probability (n − i)/n ≤ 1) and either the parent or the offspring is misevaluated (probability at most 2p), as otherwise the offspring will be rejected.Thus for all i with i mod w / ∈ {0, w − 1}, using a i+1 = 2a i , As p ≤ 1/(6n), the bracket is at least 1 − 1/(6n) − 2/3 ≥ 0, hence the drift is at least For i mod w = 0, the potential is decreased by g(i) − g(i − 1) = n ip with probability at least pi/n, and it is increased by g(i+1)−g(i) only if either the parent or the offspring is misevaluated and the offspring increases the number of zeros.The probability of an increase is bounded by 2p.Thus and using p ≤ 1/(6n), i ≥ w and w ≥ 2 log n this is at least − o(1).
For i mod w = w − 1 the potential is decreased by g(i) − g(i − 1) if mutation decreases the number of zeros and both parent and offspring are evaluated truthfully.The potential is increased by g(i + 1) − g(i) only if mutation creates a search point with i + 1 zeros (probability at most 1).Thus For all states i > 0, the expected decrease in g(X t ) is at least c for a suitable constant c > 0. Once g(X t ) = 0 is reached, an optimum is found.Standard additive drift analysis (see, e. g. [29, Theorem 1] for a self-contained statement and proof) then implies that the expected time until g( The reason why prior noise is helpful is that, intuitively speaking, it can "smooth out" the fitness landscape, blurring rugged peaks and allowing the algorithm to see the underlying gradient.Hence noise can be useful for problems with a big valley structure [36,47].This effect has been observed in continuous spaces before [46] where it was termed "annealing of peaks".In discrete spaces the only other examples the author is aware of showing a positive effect of noise are deceptive functions and needle-in-a-haystack functions [45].
To put our result in perspective, we have shown that noise can mitigate a poor choice of algorithm.In our case, an elitist algorithm became a non-elitist algorithm because of noise.This is helpful for Hurdle as here non-elitism is advantageous, while even a small amount of non-elitism is clearly detrimental for Leading-Ones.Note that, as argued in [1, Section 4], noise can never improve an optimal algorithm for a particular problem.If noise was able to improve the performance of an optimal algorithm, we could simply simulate the effect of noise in the algorithm and obtain a better performing algorithm.

Experiments
We also provide experiments for Hurdle to see how well the theory predicts the average optimisation time, and to answer questions not covered by Theorem 16.
Figure 4 shows the expected optimisation time of RLS and the (1+1) EA, for Hurdle with n = 100 bits and a hurdle width of w = 2 log n = 14.Runs were stopped after n 3 = 10 6 generations or when the optimum was found.For one-bit noise with noise strength p, the plots show that the algorithm is very efficient in the region p ∈ {2 −10 , . . ., 2 −4 } ≈ {1/(10n), . . ., 6.4/n} as predicted by Theorem 16.The time further seems to increase with 1/p as p is decreased, which matches the term n 2 /(pw 2 ) in the running time bound.
We can further see that as p becomes too large, i. e., for p ≥ 2 −3 , the average time increases sharply.This matches known results for OneMax where p = ω((log n)/n) leads to superpolynomial expected times [16].
Figure 4 further shows that the choice of the noise model is insignificant: the results are nearly identical for one-bit prior noise p and bit-wise prior noise (1, q/n) across all values of p = q.The proof of Theorem 16 relies on the fact that a fitness-decreasing step leaving a local optimum towards the global optimum is accepted because of noise.While this effect was helpful on LeadingOnes, it is detrimental for Hurdle.This is shown empirically in Figure 6.
An increased offspring population shifts the curves towards higher noise parameters, while maintaining the unimodal shape of the curve, with steep increases for too large values.This shift is very similar to the one observed for the (1+λ) EA on LeadingOnes.
For instance, for p ∈ {2 −10 , . . ., 2 −4 } where (1+1) RLS is efficient, the (1+8) RLS fails to find the optimum before time runs out in almost all runs, and the (1+16) RLS only found the optimum in a single run at p = 0.5, with a time of 795,151 generations.We conclude that, in this context, offspring populations can be harmful.

Conclusions
We have presented a simple method for proving upper bounds under several prior noise models, based on estimating the probability that during the median worst-case optimisation time no noise occurs.Despite its simplicity, it matches and generalises the best known results [5,42] and provides a unified approach for one-bit noise, bit-wise noise, and asymmetric bit-wise noise.Along with our negative result for LeadingOnes, the expected optimisation time of the (1+1) EA on LeadingOnes is Θ(n 2 ) • exp(Θ(min{pn 2 , n})) for one-bit noise p ≤ 1/2, asymmetric one-bit noise with p = O(1/n), and bit-wise noise (p , q/n) where q/n ≤ 1/2 and p = p min{q, 1}.This confirms that the threshold between polynomial and superpolynomial expected times is p = Θ((log n)/n 2 ) and p = Ω(1/n) leads to exponential expected times.
Offspring populations can cope with noise up to p ≤ 1/2 if the population size is at least λ ≥ log e e−1/2 (n) ≈ 3.42 log n.We obtained an upper bound of O n 2 • e O(pn/λ) , guaranteeing polynomial expected times for p = O((λ log n)/n).An open problem is whether the upper bound is tight in the same sense as for the (1+1) EA.
Finally, we showed that on the Hurdle problem class, a highly rugged problem with a clear "big valley" structure, prior noise is helpful as it allows RLS to escape from local optima and to follow the underlying gradient.Experiments complemented our theoretical results and also showed that RLS under noise outperforms the (1+1) EA both with and without noise.Experiments further showed that on Hurdle, in stark contrast to LeadingOnes, offspring populations in RLS can be harmful as here they reduce the beneficial effects of noise.
Open problems for future work include showing a lower bound for the expected optimisation time of the (1+λ) EA on LeadingOnes, and obtaining tighter results on the performance of evolutionary algorithms with parent populations, i. e., the (µ+1) EA, on LeadingOnes and other problems.

Figure 3 :
Figure 3: Sketch of a Hurdle function with hurdle width w = 4 and problem size n = 20.