On the Benefits of Populations on the Exploitation Speed of Standard Steady-State Genetic Algorithms

It is generally accepted that populations are useful for the global exploration of multi-modal optimisation problems. Indeed, several theoretical results are available showing such advantages over single-trajectory search heuristics. In this paper we provide evidence that evolving populations via crossover and mutation may also benefit the optimisation time for hillclimbing unimodal functions. In particular, we prove bounds on the expected runtime of the standard ($\mu$+1)~GA for OneMax that are lower than its unary black box complexity and decrease in the leading constant with the population size up to $\mu=O(\sqrt{\log n})$. Our analysis suggests that the optimal mutation strategy is to flip two bits most of the time. To achieve the results we provide two interesting contributions to the theory of randomised search heuristics: 1) A novel application of drift analysis which compares absorption times of different Markov chains without defining an explicit potential function. 2) The inversion of fundamental matrices to calculate the absorption times of the Markov chains. The latter strategy was previously proposed in the literature but to the best of our knowledge this is the first time is has been used to show non-trivial bounds on expected runtimes.


INTRODUCTION
Populations in evolutionary and genetic algorithms are considered crucial for the e ective global optimisation of multi-modal problems.For this to be the case, the population should be su ciently diverse such that it can explore multiple regions of the search space at the same time [9].Also, if the population has su cient diversity, then it considerably enhances the e ectiveness of crossover for escaping from local optima.Indeed the rst proof that crossover can considerably improve the performance of GAs relied on either enforcing diversity by not allowing genotypic duplicates or by using unrealistically small crossover rates for the J function [11].It has been shown several times that crossover is useful to GAs using the same, or similar, diversity enhancing mechanisms for a range of optimisation problems including shortest path problems [7], vertex cover [17], colouring problems inspired by the Ising model [21] and computing input output sequences in nite state machines [14].
ese examples provide considerable evidence that, by enforcing the necessary diversity, crossover makes GAs e ective and o en superior to applying mutation alone.However, rarely it has been proven that the diversity mechanisms are actually necessary for GAs, or to what extent they are bene cial to outperform their mutation-only counterparts rather than being applied to simplify the analysis.Recently, some light has been shed on the power of standard genetic algorithms without diversity over the same algorithms using mutation alone.Dang et al. showed that the plain (µ+1) GA is at least a linear factor faster than its (µ+1) EA counterpart at escaping the local optimum of J [3].Su on showed that the same algorithm with crossover if run su ciently many times is a xed parameter tractable algorithm for the closest string problem while without crossover it is not [23].Lengler provided an example of a class of unimodal functions to highlight the robustness of the crossover based version with respect to the mutation rate compared to the mutation-only version i.e., the (µ+1) GA is e cient for any mutation rate c/n while the (µ+1) EA requires exponential time as soon as approx.c > 2.13 [15].In all three examples the population size has to be large enough for the results to hold, thus providing evidence of the importance of populations in combination with crossover.
Recombination has also been shown to be very helpful at exploitation if the necessary diversity is enforced through some mechanism.In the (1+(λ, λ)) GA such diversity is achieved through large mutation rates.e algorithm can optimise the well-known O M function in sublogarithmic time with static o spring population sizes λ [5], and in linear time with self-adaptive values of λ [4].Although using a recombination operator, the algorithm is still basically a single-trajectory one (i.e., there is no population).More realistic steady-state GAs that actually create o spring by recombining parents have also been analysed for O M .Sudholt showed that (µ+λ) GAs are twice as fast as their mutation-only version (i.e., no recombination) for O M if diversity is enforced arti cially i.e., genotype duplicates are preferred for deletion [22].He proved a runtime of (e/2)n ln n + O(n) versus the en ln n + O(n) function evaluations required by any standard bit mutation-only evolutionary algorithm for O M and any other linear function [25].If o spring are identical to their parents it is not necessary to evaluate the quality of their solution.When the unnecessary queries are avoided, the expected runtime of the GA using arti cial diversity from [22] is bounded above by (1 + o(1))0.850953nln n [19].Hence, it is faster than any unary (i.e., mutation-only) unbiased 1 black-box search heuristic [12] On one hand, the enforced arti ciality in the last two results considerably simpli es the analysis.On the other hand, the power of evolving populations for e ective optimisation cannot be appreciated.Since the required diversity to make crossover e ective is arti cially enforced, the optimal population size is 2 and larger populations provide no bene ts.Corus and Oliveto showed that the standard (µ+1) GA without diversity is still faster than mutation-only ones by proving an upper bound on the runtime of (3/4)en ln n + O(n) for any 3 < µ < o(log n/log log n) [2].A result of enforcing the diversity in [22] was that the best GA for the problem only used a population of size 2.However, even though this arti ciality was removed in [2], a population of size 3 was su cient to get the best upper bound on the runtime achievable with their analysis.Overall, their analysis does not indicate any tangible bene t towards using a population larger than µ = 3. us, rigorously showing that populations are bene cial for GAs in the exploitation phase has proved to be a non-trivial task.
In this paper we provide a more precise analysis of the behaviour of the population of the (µ+1) GA for O M .We prove that the standard (µ+1) GA is at least 60% faster than the same algorithm using only mutation.We also prove that the GA is faster than any unary unbiased black-box search heuristic if o spring with identical genotypes to their parents are not evaluated.More importantly, our upper bounds on the expected runtime decrease with the population size up to µ = o( log n), thus providing for the rst time a natural example where populations evolved via recombination and mutation optimise faster than unary unbiased heuristics.

e Genetic Algorithm
e (µ+1) GA is a standard steady-state GA which samples a single new solution at every generation [8,20].It keeps a population of the µ best solutions sampled so far and at every iteration selects two solutions from the current population uniformly at random with replacement as the parents.e recombination operator then picks building blocks from the parents to create the o spring solution.For the case of pseudo-Boolean functions f : {0, 1} n → R, the most frequently used recombination operator is uniform crossover which picks the value of each bit position i ∈ [n] from one parent or the other uniformly at random (i.e., from each parent with probability 1/2) [8].
en, an unbiased unary variation operator, which is called the mutation operator, is applied to the o spring solution before it is added to the population.e most common mutation operator is the standard bit-mutation which independently ips each bit of the o spring solution with some probability c/n [25].Finally, before moving to the next iteration, one of the solutions with the worst tness value is removed from the population.For the case of maximisation the (µ+1) GA is de ned in Algorithm 1.
e runtime of Algorithm 1 is the number of function evaluations until a solution which maximises the function f is sampled for the rst time.If every o spring is evaluated, then the runtime is equal to the value of the variable t in Alg. 1 when the optimal solution is 1 e probability of a bit being ipped by an unbiased operator is the same for each bit-position.sampled.However, if the tness of o spring which are identical to their parents are not evaluated, then the runtime is smaller than t.We will rst analyze the former scheme and then adapt the result to the la er.

e Optimisation Problem
Given a secret bitstring z ∈ {0, 1} n , O M z (x) := {i ∈ [n]|z i = x i } returns the number of bits on which a candidate solution x ∈ {0, 1} n matches z [25].e optimisation time (synonymously, runtime) is de ned by the number of queries to the function required by an algorithm to minimise the Hamming distance between the candidate solution and the hidden bitstring z.

Our Results
In this paper we prove the following results.

I
. e expected runtime for the (µ+1) GA (with unbiased mutations and population size µ = o( log n) to optimise the O M function is where γ 1 and γ 2 are decreasing functions of the population size µ.
e above two statements are very general as they provide upper bounds on the expected runtime of the (µ+1) GA for each value of the population size up to µ = o( log n) and any unbiased mutation operator.e leading constants γ 1 and γ 2 in Statements 1 and 2 are plo ed respectively in Fig. 1 and 2 for di erent population sizes using the p 0 , p 1 , p 2 and c values which minimise the upper bounds.
e result is signi cant particularly for the following three reasons (in order of increasing importance).
(1) e rst statement shows how the genetic algorithm outperforms any unbiased mutation-only heuristic since the best expected runtime achievable by any algorithm belonging to such class is at least n ln n − cn ± o(n) [6].Given that the best expected runtime achievable with any search heuristic using only standard bit mutation is (1 + o(1))en ln n [25], the second statement shows how by adding recombination a speed-up of 60% is achieved for the O M problem for any population size up to µ = o( log n).
(2) Very few results are available proving constants in the leading terms of the expected runtime for randomised algorithms due to the considerable technical di culties in deriving them.Exceptions exist such as the analyses of [25] and [6] without which our comparative results would not have been achievable.While such precise results are gaining increasing importance in the theoretical computer science community, the available ones are related to more simple algorithms. is is the rst time similar results are achieved concerning a much more complicated to analyse standard genetic algorithm using realistic population sizes and recombination.
(3) e preciseness of the analysis allows for the rst time an appreciation of the surprising importance of the population for optimising unimodal functions 2 as our upper bounds on the expected runtime decrease as the population size increases.In particular as the problem size increases, so does the optimal size of the population (the best known runtime available for the (µ+1) GA was of (1 + o(1))3/4en ln n independent of the population size as long as it 2 Populations are traditionally thought to be useful for solving multi-modal problems.
is greater than µ = 3 i.e., there were no evident advantages in using a larger population [2]). is result is in contrast to all previous analyses of simpli ed evolutionary algorithms for unimodal functions where the algorithmic simpli cations, made for the purpose of making the analysis more accessible, caused the use of populations to be either ine ective or detrimental [19,22,24].Our upper bound of µ = o( log n) is very close to the at most logarithmic population sizes typically recommended for monotone functions to avoid detrimental runtimes [1,24].We conjecture that the optimal population size is Θ(log n 1−ϵ ) for any constant ϵ > 0, which cannot be proven with our mathematical methods for technical reasons.

Proof Strategy
Our aim is to provide a precise analysis of the expected runtime of the (µ+1) GA for optimising O M with arbitrary problem size n.Deriving the exact transition probabilities of the algorithm from all possible con gurations to all others of its population is prohibitive.We will instead devise a set of n Markov chains, one for each improvement the algorithm has to pessimistically make to reach the global optimum, which will be easier to analyse.en we will prove that the Markov chains are slower to reach their absorbing state than the (µ+1) GA is in nding the corresponding improvement.
In essence, our proof strategy consists of: (1) to identify suitable Markov chains, (2) to prove that the absorbing times of the Markov chains are larger than the expected improving times of the actual algorithm and (3) to bound the absorbing times of each Markov chain.
In particular, concerning point (2) we will rst de ne a potential function which monotonically increases with the number of copies of the genotype with most duplicates in the population and then bound the expected change in the potential function at every iteration (i.e., the dri ) from below.Using the maximum value of the potential function and the minimum dri , we will bound the expected time until the potential function value drops to its minimum value for the rst time. is part of the analyses is a novel application of dri analysis techniques [13].In particular, rather than using an explicit distance function as traditionally occurs, we de ne the potential function to be equal to the conditional expected absorption time of the corresponding states of each Markov chain.
Concerning point (3) of our proof strategy, we will calculate the absorbing times of the Markov chains M j by identifying their fundamental matrices.is requires the inversion of tridiagonal matrices.Similar matrix manipulation strategies to bound the runtime of evolutionary algorithms have been previously suggested in the literature [10,18].However, all previous applications showed that the approach could only be applied to prove results that could be trivially achieved via simpler standard methods.To the best of our knowledge, this is the rst time that the power of this long abandoned approach has nally been shown by proving non-trivial bounds on the expected runtime.

MAIN RESULT STATEMENT
Our main result is the following theorem.
e transition probabilities  1. e expected runtime for the (µ+1) GA with µ = o( log n) using an unbiased mutation operator mutate(x) that ips i bits with probability p i with p 0 ∈ Ω(1) and p 1 ∈ Ω(1) to optimise the O M function is: if the quality of o spring identical to their parents is not evaluated for their quality is known; and e recombination operator of the GA is e ective only if individuals with di erent genotypes are picked as parents (i.e., recombination cannot produce any improvements if two identical individuals are recombined).However, more o en than not, the population of the (µ+1) GA consists only of copies of a single individual.When diversity is created via mutation (i.e., a new genotype is added to the population), it either quickly leads to an improvement or it quickly disappears.e bound on the runtime re ects this behaviour as it is simply a waiting time until one of two event happens; either the current individual is mutated to a be er one or diversity emerges and leads to an improvement before it is lost.e ξ 2 term in the runtime is the conditional probability that once diversity is created by mutation, it will be lost before reaching the next tness level (an improvement).Naturally, (1 − ξ 2 ) is the probability that a successful crossover will occur before losing diversity.e (1 − ξ 2 ) factor increases with the population size µ, which implies that larger populations have a higher capacity to maintain diversity long enough to be exploited by the recombination operator.
Note that se ing p i := 0 for all i > 2 minimises the upper bound on the expected runtime in the second statement of eorem 3.1 and reduces the bound to: . Now, we can see the critical role that ξ * (µ) = (1 − ξ 2 )µ/(µ + 1) plays in the expected runtime.For any population size which yields ξ * (µ) ≤ 1/2, ipping only one bit per mutation becomes advantageous.e best upper bound achievable from the above expression is then (1 + o(1))n ln n by assigning an arbitrarily small constant to p 0 and p 1 = 1 − p 0 .As long as p 0 = Ω(1), when an improvement occurs, the superior genotype takes over the population quickly relative to the time between improvements.Since there are only one-bit ips, the crossover operator becomes virtually useless (i.e., crossover requires a Hamming distance of 2 between parents to create an improving o spring) and the resulting algorithm is a stochastic local search algorithm with a population.However, when ξ * (µ) > 1/2 se ing p 2 as large as possible provides the best upper bound.e transition for ξ * (µ) happens between population sizes of 4 and 5.For populations larger than 5, by se ing p 1 := ϵ/2 and p 0 =: ϵ/2 to an arbitrarily small constant ϵ and se ing p 2 = 1 − ϵ, we get the upper bound ), which is plo ed for di erent population sizes in Figure 1.
A direct corollary to the main result is the upper bound for the classical (µ+1) GA commonly used in evolutionary computation which applies standard bit mutation with mutation rate c/n for which p Let ξ 2 be as de ned in eorem 3.1.e expected runtime for the (µ+1) GA with µ = o( log n) using standard bitmutation with mutation rate c/n, c = Θ(1) to optimise the O M function is: if the quality of o spring identical to their parents is not evaluated for their quality is known; By calculating ξ * (µ) := (1 − ξ 2 )µ/(µ + 1) for xed values of µ we can determine values of c (i.e., mutation rate) which minimise the leading constant of the runtime bound in Corollary 3.2.In Figure 2 we plot the leading constants in the rst statement, minimised by picking the appropriate c values for µ ranging from 5 to 50.All the values presented improve upon the upper bound on the runtime of 1.96n ln n given in [2] for any µ ≥ 3 and µ = o(log n/log log n).All the upper bounds are less than 1.7n ln n and clearly decrease with the population size, signifying an at least 60% increase in speed compared to the en ln n (1 − o (1)) lower bound for the same algorithm without the recombination operator.
Considering the leading constants in the second statement of Corollary 3.2, for all population sizes larger than 5, the upper bound for the optimal mutation rate is smaller than the theoretical lower bound on the runtime of unary unbiased black-box algorithms.For population sizes of 3 and 4, ξ * = 1/3 and the expression to be minimised is (1 − e −c )e c /(c + c 2 /3).For c > 0, this expression has no minimum and is always larger than one.us, at least with our technique, a population of size 5 or larger is necessary to prove that the (µ+1) GA outperforms stochastic local search and any other unary unbiased optimisation heuristic.

ANALYSIS
Our main aim is to provide an upper bound on the expected runtime (E[T ]) of the (µ+1) GA de ned in Algorithm 1 to maximise the O M function.W.l.o.g.we will assume that the target string z of the O M z function to be identi ed is the bitstring of all 1-bits since all the operators in the (µ+1) GA are invariant to the bit-value (have no bias towards 0s or 1s).We will provide upper bounds on the expected value E[T j ], where T j is the time until an individual with at least j + 1 1-bits is sampled for the rst time given that the initial population consists of individuals with j 1-bits (i.e., the population is at level j).en, by summing up the values of E[T j ] and the expected times for the whole population to reach j + 1 1-bits for j ∈ {1 . . ., , n − 1} we achieve a valid upper bound on the runtime of the (µ+1) GA.Similarly to the analysis in [2], we will pessimistically assume that the algorithm is initialised with all individuals having just 0-bits, and that throughout the optimisation process at most one extra 1-bit is discovered at a time. . . .
We will devise a Markov chain M j for each j ∈ {0, . . ., n − 1} for which we can analyse the expected absorbing time E[T j i ] starting from state S j i .We will then prove that it is in expectation slower in reaching its absorbing state than the (µ+1) GA is in nding an improvement given an initial population at level j.In particular, we will de ne a non-negative potential function on the domain of all possible con gurations of a population at level j or above.For any con guration at level j, we will refer to the genotype with the most copies in the population as the majority genotype and de ne the diversity of a population as the number of non-majority individuals in the population.Our potential function will be monotonically decreasing with the diversity.Moreover, we will assign the potential function a value of zero for all populations with at least one solution which has more than j 1-bits.en, we will bound the expected change in the potential function at every iteration (i.e., the dri ) from below.Using the maximum value of the potential function and the minimum dri , we will derive a bound on the expected time until an improvement is found starting from a population at level j with no diversity (i.e., all the solutions in the population are identical).While this upper bound will not provide an explicit runtime as a function of the problem size, it will allow us to conclude that the E[T j 0 ] ≥ E[T j ]. us, all that remains will be to bound the expected absorbing time of M j initialised at state S j 0 .We will obtain this bound by identifying the fundamental matrix of M j .A er establishing that the inverse of the fundamental matrix is a strongly diagonally dominant tridiagonal matrix, we will make use of existing tools in the literature for inverting such matrices and complete our proof.

Markov Chain De nition
In this subsection we will present the Markov chains which we will use to model the behaviour of the (µ+1) GA.Each Markov chain M j has m := µ/2 transient states (S j 0 , S j 1 , . . ., S j m−1 ) and one absorbing state (S j m ) with the topology depicted in Figure 3. e states S j i represent the amount of diversity in the population.Hence, eg., S j 1 refers to a population where all the individuals have j 1-bits and all but one of them share the same genotype, while S j m−1 refers to a population where at most m − 1 = µ/2 + 1 individuals are identical.Compared to the analysis presented in [2] that used Markov chains of only three states (i.e., no diversity, diversity, increase in 1-bits), M j allows to control the diversity in the population more precisely, thus to show that larger populations are bene cial to the optimisation process.
p i, j := 0 otherwise.Now, we will point out the important characteristics of these transition probabilities.e transition probabilities, p i,k , are set to be equal to provable bounds on the probabilities of the (µ+1) GA with a population consisting of solutions with j bits of gaining/losing diversity (p i,i+1 /p i,i−1 ) and sampling a solution with more than j 1-bits (p i,m ).In particular, upper bounds are used for the transition probabilities p i,k where i < k and lower bounds are used for the transition probabilities p i,k where i > k.Note that greater diversity corresponds to a higher probability of two distinct individuals with j 1-bits being selected as parents and improved via recombination (i.e., p i,m monotonically increases with i and recombination is ine ective if i = 0 and the improvement probability p 0,m is simply the probability of increasing the number of 1-bits by mutation only. us, p 0,m = Θ(j/n) while p i,m = Θ(i(µ − i)/µ 2 ) when i > 0. e rst forward transition probability p 0,1 denotes the probability of the mutation operator of creating a di erent individual with j 1-bits and the selection operator removing one of the majority individuals from the population.e other transition probabilities, p i,i+1 and p i,i−1 bound the probability that a copy of the minority solution or the majority solution is added to the population and that a member of the other species (minority/majority) is removed in the subsequent selection phase.All transition probabilities except p 0,1 and p 0,m are independent of j and referred to in the theorem statements without specifying j.

Validity of the Markov Chain Model
In this subsection we will present how we establish that M j is a pessimistic representation of Algorithm 1 initialised with a population of µ identical individuals at level j.In particular, we will show that E[T j 0 ], the expected absorbing time starting from state S j 0 is larger than E[T j ]. is result is formalised in the following lemma.].We will use dri analysis [13], a frequently used tool in the runtime analysis of evolutionary algorithms, to prove the above result.We will start by de ning a potential function over the state space of Alg. 1 that maps states to the expected absorbing times of M j .e minimum of the potential function will correspond to the state of Algorithm 1 which has sampled a solution with more than j 1-bits and we will explicitly prove that the maximum of the potential function is E[T j 0 ].en, we will show that the dri , i.e., the expected decrease in the potential function value in a single time unit (from time t to t + 1), is at least one.Using the maximum value of the potential function and the minimum dri , we will bound the runtime of the algorithm by the absorbing time of the Markov chain.
We will de ne our potential function over the domain of all possible population diversities at level j.We will refer to the genotype with the most copies in the population as the majority genotype and recall that the diversity, D t ∈ {0, . . ., µ − 1}, of a population P t is de ned as the number of non-majority individuals in the population.

De nition 4.3.
e potential function value for level j, j (or wherever j is obvious), is de ned as follows: where E[T j i ] (denoted as E[T i ] wherever j is obvious) is the expected absorbing time of the Markov chain M j starting from state S j i .e absorbing state of the Markov chain corresponds to a population with at least one individual with more than j 1-bits, thus having potential function value (and expected absorbing time) equal to zero.e state S j 0 corresponds to a population with no diversity.e following lemma formalises that the expected absorbing time gets larger as the initial states get further away from µ/2 − 1. is property implies that the expected absorbing time from state S j 0 constitutes an upper bound for the potential function j .L 4.4.Let E[T j i ] be the expected absorbing time of Markov chain M j conditional on its initial state being Now that the potential function is bounded from above, we will bound the dri E[ t − t +1 )|D t = i].Due to the law of total expectation, the expected absorbing time, E[T i ] satis es j p i, j (E[T j ] − E[T i ]) = 1 for any absorbing Markov chain at state i.Since E[T i ] and E[T j ] are the respective potentials of the states S i and S j , the le hand side of the equation closely resembles the dri .Since the probabilities for M j are pessimistically set to underestimate the dri , the above equation allows us to formally prove the following: L 4.5.For a population at level j, E[ 1) for all t > 0 P .We will now show that ∆ t (i the expectation of the di erence between the potential function values of population P t and P t +1 , is larger than one for all i. When there is no diversity in the population (i.e., D t = 0) the only way to increase the diversity is to introduce it during a mutation operation.A non-majority individual is obtained when one of the n − j 0-bits and one of the j 1-bits are ipped while no other bits are touched.en one of the majority individuals must be removed from the population during the selection phase. is event has probability at least Another way to change the potential function value is to create an improved individual with the mutation operator.In order to improve a solution it is su cient to pick one of n − j 1-bits and ip no other bits. is event has the probability at least p 1 • (n−j) n = p 0,m .us, we can conclude that when D t = 0, the dri is at least p 0,m (T 0 ) + p 0,1 (T 0 − T 1 ).We can observe through the law of total expectation for the state S j 0 of Markov chain M j that this expression for the dri when D t = 0 is larger than one.
For D t > 0, we will condition the dri on whether the picked parents are both majority individuals E 1 , are both minority individuals with the same genotype E 2 , are a pair that consists of one majority and one minority individual E 3 , or they are both minority individuals with di erent genotypes E 4 .
Let E * be the event that the population P t consists of two genotypes with Hamming distance two.en, Note that when there are more than two genotypes in the population, the event of picking any two non-majority individuals is divided into two separate cases of picking identical minority individuals and picking two di erent minority individuals.Obviously, the sum of the probabilities of these two cases is equal to the probability of picking two minority individuals when there are only two genotypes (one majority and one minority) in the population.
Restricting ourselves to ∆ i >0 t , the dri conditional on i > 0, the law of total expectation states: We can rearrange the above expression using the probabilities from Eq 1 We will now write the law of total expectation for state i for our Markov chain M j : We will then substitute the probabilities in the law of total expectation with the values from De nition 4.1, Finally, we will rearrange the above expression into the terms with the probabilities of events E i as multiplicative factors We will refer to the rst, second and third line of the Eq 2 as the E 1 , E 2 and E 3 term respectively.We will show that for each term, the conditional dri is larger than the term without the multiplicative factor.
When two majority individuals are selected as parents (E 1 ), we pessimistically assume that improving to the next level and increasing the diversity zero probability.Losing the diversity requires that no bits are ipped during mutation and that a minority individual will be removed from the population.e probability that no bits are ipped is p 0 .us we can show that: . is bound is obviously the same as the E 1 term of Eq 2 without the parent selection probability.
When two minority individuals are selected as parents(E 2 or E 4 ), if they are identical (E 2 ) then it is su cient that the mutation does not ip any bits which occurs with probability p 0 and that a majority individual is removed from the population (with probability (µ − i)/µ).us, the probability of increasing the diversity is p 0 ×(µ−i)/µ and the probability of creating a majority individual is O(1/n) since it is necessary to ip at least one particular bit position: However, if the two minority individuals have a Hamming distance of 2d ≥ 2 (i.e., E 4 ), then in order to create another minority individual at the end of the crossover operation it is necessary that the crossover picks exactly d 1-bits and d 0-bits among 2d bit positions where they di er.ere are 2d d di erent ways that this can happen and the probability that any particular outcome of crossover is realised is 2 −2d .One of those outcomes though, might be the majority individual and if that is the case the diversity can decrease a erwards.However, while the Hamming distance between the minority individuals can be 2d = 2, obtaining a majority individual by recombining two minority individuals requires at least four speci c bit positions to be picked correctly during crossover and thus does not occur with probability greater than 1/16.On the other hand, when two di erent minority individuals are selected as parents, there is at least a 1−( 2d d )2 −2d 2 ≥ 1/4 probability that the crossover will result in an individuals with more 1-bits and then with probability p 0 the mutation will not ip any bits.
Note that the E 2 term of Eq 2 multiplied with a factor of (1 − o(1)) is smaller than both conditional dri s multiplied with the parent selection probability (i/µ) 2 .Finally, we will consider the dri conditional on event E 3 , the case when one minority and one majority individual are selected as parents.We will further divide this event into two subcases.In the rst case the Hamming distance 2d between the minority and the majority individual is exactly two (d = 1).en, the probabilities that crossover creates a copy of the minority individual, a copy of the majority individual or a new individual with more 1-bits are all equal to 1/4.us, the conditional dri is However, when d > 1, the dri is more similar to the case of E 3 where the probabilities of creating copies of either the minority of the majority diminish with larger d while larger d increases the probability of creating an improved individual.More precisely, the dri is . Now, nally we can observe that (∆ t |E 3 , d = 1) multiplied with 2i(µ − i)/µ 2 is larger than the E 3 term in Eq 2 multiplied with a factor of (1 − o(1)).
We have now shown piece by piece that the conditional dri s are larger the corresponding terms in the right hand side of Eq 2 up to small order terms, and thus established that ∆ t ≥ (1 − o( 1)) for all t > 0. Since we have previously shown that j t ≤ T j 0 , we can now apply eorem 6.1 to obtain E Once an individual with j + 1 1-bits is sampled for the rst time it takes O(µ log µ) iterations before the whole population consists of individuals with at least j + 1 1-bits [2,22].If the population size is in the order of o(log n/log log n), then the total number of iterations where there are individuals with di erent tness values in the population is in the order of o(n log n).Since j ∈ {0, 1, . . ., n−1}, we can establish that E With the potential function bounded from above by Lemma 4.4 and the dri bounded from below by Lemma 4.5, we can use the additive dri theorem 3 from [13] to bound E[T j ] by E[T j 0 ].By summing over all levels, we get the bound stated in Lemma 4.2 on the expected runtime of Algorithm 1.

Markov Chain Absorption Time Analysis
In the previous subsection we stated in Lemma 4.2 that we can bound the absorbing times of the Markov chains M j to derive an upper bound on the runtime of Algorithm 1.In this subsection we use mathematical tools developed for the analysis of Markov chains to provide such bounds on the absorbing times.
e absorbing time of a Markov chain starting from any initial state i can be derived by identifying its fundamental matrix.Let the matrix Q denote the transition probabilities between the transient states of the Markov chain M j .e fundamental matrix of M j is de ned as N := (I − Q) −1 where I is the identity matrix.e most important characteristic of the fundamental matrix is that when it is multiplied by a column vector of ones, the product is a vector holding E[T j i ], the expected absorbing times conditional 3 e additive dri theorem is provided in the appendix as eorem 6.1 for reviewer convenience on the initial state i of the Markov chain.Since, Lemma 4.2 only involves T j 0 , we are only interested in the entries of the rst row of N = [n ik ].However, inverting the matrix I − Q is not always a straightforward task.Fortunately, I − Q = [a ik ] has characteristics that allow bounds on the entries of its inverse.Its entries are related to the transition probabilities of M j as follows: a ii = 1 − p i−1,i−1 = p i−1,i−2 + p i−1,i + p i−1,m ∀i ∈ {2, ..., m − 1} ( 5) Observe that I − Q is a tridiagonal matrix, in the sense that all non-zero elements of I − Q are either on the diagonal or adjacent to it.Moreover, the diagonal entries a ii of I − Q are in the form 1 − p i−1,i−1 , which is equal to the sum of all transition probabilities out of state i − 1.Since the other entries on row i are transition probabilities from state i − 1 to adjacent states, we can see that |a ii | > i k a ik .e matrices where |a ii | > i k a ik holds are called strongly diagonally dominant (SDD).Since I − Q is SDD, according to Lemma 2.14 in [16], it holds for the fundamental matrix ).We note here that the O(µ 2 ) factor in the above expression creates the condition µ = o( log n) on the population size for our main results.We will now bound the term n 1,1 from above to establish our upper bound using the following theorem: T 4.6 (D C 3.2 [16]).A is an m × m tridiagonal non-singular SDD matrix such that a i,k ≤ 0 for all i k, A −1 = [n i,k ] exists and n i,k ≥ 0 for all i, k, then n 1,1 = 1/(a 1,1 + a 1,2 ξ 2 ), ξ i = a i,i−1 /(a i,i + a i,i+1 ξ i+1 ), and ξ m = a m,m−1 /a m,m .
In order to use eorem 4.6, we need to satisfy its conditions.We can easily see that non-diagonal entries of the original matrix I − Q are non-positive and use eorem 3.15 in [16] to show that N = (I − Q) −1 has no negative entries.us, eorem 4.6 yields:

CONCLUSION
In this work, we have shown that the steady-state (µ+1) GA optimises O M faster than any unary unbiased search heuristic.Providing precise asymptotic bounds on the expected runtime of standard GAs without arti cial mechanisms that simplify the analysis has been a long standing open problem.We have derived bounds up to the leading term constants of the expected runtime.
To achieve this result we show that a simpli ed Markov chain pessimistically represents the behaviour of the GA for O M . is insight about the algorithm/problem pair allows the derivation of runtime bounds for a complex multi-dimensional stochastic process.
e analysis shows that as the number of states in the Markov chain (the population size) increases, so does the probability that diversity in the population is kept.us, larger populations increase the probability that recombination nds improved solutions quickly, hence reduce the expected runtime.

Figure 1 :
Figure 1: e leading constant from the second statement of eorem 3.1 versus the population size.e best leading constant achievable by any unary unbiased algorithm is 1.
2 and m := µ/2 are de ned in De nition 4.1 (Section 4.1) and are depicted in Figure3.

Figure 3 :
Figure 3: e topology of Markov Chain M j .

De nition 4 . 1 .
Let M j be a Markov chain with m := µ/2 transient states

L 4 . 2 .
Let E[T ] be the expected runtime until the (µ+1) GA with µ = o(log n/log log n) optimises the O M function and E[T j i ] (or E[T i ] wherever j is obvious) is the expected absorbing time of M j starting from state S j i .en, E[T ] ≤ o(n log n)+(1+o(1)) n−1 j=0 E[T j 0

N 1 ≤ 1 . 1 p
for all i k that, |n i,k | ≤ |n k,k | ≤ |a k,k | 1 − l j |a l,k | |a k, k | −|a k,k | − l j |a l,k | −In our particular case, the above inequality implies that |n 1,k | ≤ 1/p k −1,m .For any population with diversity, there is a probability in the order of O(1/µ) to select one minority and one majority individual and a constant probability that their o spring will have more 1-bits than the current level.Considering m = O(µ), E[T j 0 ] = m k =1 n 1,k < n 1,1 + m k =2 k −1,m ≤ n 1,1 + O(µ 2