Lazy Parameter Tuning and Control: Choosing All Parameters Randomly From a Power-Law Distribution

Most evolutionary algorithms have multiple parameters and their values drastically affect the performance. Due to the often complicated interplay of the parameters, setting these values right for a particular problem (parameter tuning) is a challenging task. This task becomes even more complicated when the optimal parameter values change significantly during the run of the algorithm since then a dynamic parameter choice (parameter control) is necessary. In this work, we propose a lazy but effective solution, namely choosing all parameter values (where this makes sense) in each iteration randomly from a suitably scaled power-law distribution. To demonstrate the effectiveness of this approach, we perform runtime analyses of the $(1+(\lambda,\lambda))$ genetic algorithm with all three parameters chosen in this manner. We show that this algorithm on the one hand can imitate simple hill-climbers like the $(1+1)$ EA, giving the same asymptotic runtime on problems like OneMax, LeadingOnes, or Minimum Spanning Tree. On the other hand, this algorithm is also very efficient on jump functions, where the best static parameters are very different from those necessary to optimize simple problems. We prove a performance guarantee that is comparable to the best performance known for static parameters. For the most interesting case that the jump size $k$ is constant, we prove that our performance is asymptotically better than what can be obtained with any static parameter choice. We complement our theoretical results with a rigorous empirical study confirming what the asymptotic runtime results suggest.


Introduction
Evolutionary algorithms (EAs) are general-purpose randomized search heuristics.They are adapted to the particular problem to be solved by choosing suitable values for their parameters.This flexibility is a great strength on the one hand, but a true challenge for the algorithm designer on the other.Missing the right parameter values can lead to catastrophic performance losses.
Despite being a core topic of both theoretical and experimental research, general advice on how to set the parameters of an EA are still rare.The difficulty stems from the fact that different problems need different parameters, different instances of the same problem may need different parameters, and even during the optimization process on one instance the most profitable parameter values may change over time.
In an attempt to design a simple one-size-fits-all solution, Doerr, Le, Makhmara, and Nguyen [DLMN17] proposed to use random parameter values chosen independently in each iteration from a power-law distribution (note that random mutation rates were used before [DDK18,DDK19], but with different distributions and for different reasons).Mostly via mathematical means, this was shown to be highly effective for the choice of the mutation rate of the (1 + 1) EA when optimizing the jump benchmark, which has the property that the optimal mutation rate depends strongly on the problem instance.More precisely, for a jump function with representation length n and jump size 2 ≤ k = o( √ n), the standard mutation rate p = 1/n gives an expected runtime of (1 + o(1))en k , where e ≈ 2.718 is Euler's number.The asymptotically optimal mutation rate p = k/n leads to a runtime of (1 + o(1))n k (e/k) k .Deviating from the optimal rate by a small constant factor increases the runtime by a factor exponential in k.When using the mutation rate α/n, where α ∈ [1..n/2] is sampled independently in each iteration from a power-law distribution with exponent β > 1, the runtime becomes Θ(k β−0.5 n k (e/k) k ), where the constants hidden by the asymptotic notation are independent of n and k.Consequently, apart from the small polynomial factor Θ(k β−0.5 ), this randomized mutation rate gives the performance of the optimal mutation rate and in particular also achieves the super-exponential runtime improvement by a factor of (e/k) Θ(k) over the standard rate 1/n.The idea of choosing parameter values randomly according to a power-law distribution was quickly taken up by other works.In [FQW18,QGWF21],variants of the heavytailed mutation operator were proposed and analyzed on TwoMax, Jump, MaxCut, and several sub-modular problems.In [WQT18, DZ21, DQ22], power-law mutation in multiobjective optimization was studied.In [COY21], the authors compared power-law mutation and artificial immune systems.In [ABD22], heavy-tailed mutation was regarded for the (1 + (λ, λ)) GA, however again only for a single parameter and this parameter being the mutation rate.Very recently, the first analysis of a heavy-tailed choice of a parameter of the selection operator was conducted [DELQ22].
While optimizing a single parameter is already non-trivial (and the latest work [ABD22] showed that the heavy-tailed mutation rate can even give results better than any static mutation rate, that is, it can inherit advantages of dynamic parameter choices), the really difficult problem is finding good values for several parameters of an algorithm.Here the often intricate interplay between the different parameters can be a true challenge (see, e.g., [Doe16] for a theory-based determination of the optimal values of three parameters).
The only attempt to choose randomly more than one parameter was made in [AD20] for the (1 + (λ, λ)) GA having a three-dimensional parameter space spanned by the parameters population size λ, mutation rate p, and crossover bias c.For this algorithm, first proposed in [DDE15], the product d = pcn of mutation rate, crossover bias, and representation length describes the expected distance of an offspring from the parent.It was argued heuristically in [AD20] that a reasonable parameter setting should have p = c, that is, the same mutation rate and crossover bias.With this reduction of the parameter space to two dimensions, the parameter choice in [AD20] was made as follows.Independently (and independently in each iteration), both λ and d were chosen from a power-law distribution.Mutation rate and crossover bias were both set to d/n to ensure p = c and pcn = d.When using unbounded power-law distributions with exponents β λ = 2 + ε and β d = 1 + ε ′ with ε, ε ′ > 0 any small constants, this randomized way of setting the parameters gave an expected runtime of e O(k) ( n k ) (1+ε)k/2 on jump functions with jump size k ≥ 3.This is very similar (slightly better for k < 1 ε , slightly worse for k > 1 ε ) to the runtime of ( n k ) (k+1)/2 e O(k) obtainable with the optimal static parameters.This is a surprisingly good performance for a parameter-less approach, in particular, when compared to the runtime of Θ(n k ) of many classic evolutionary algorithms.Note that both for the static and dynamic parameters only upper bounds were proven 1 , hence we cannot make a rigorous conclusion on which algorithm performs better on jump.The proofs of these upper bounds however suggest to us that they are tight.
Our results: While the work [AD20] showed that in principle it can be profitable to choose more than one parameter randomly from a power-law distribution, it relied on the heuristic assumption that one should take the mutation rate equal to the crossover bias.There is nothing wrong with using such heuristic insight, however, one has to question if an algorithm user (different from the original developers of the (1 + (λ, λ)) GA) would have easily found this relation p = c.
In this work, we show that such heuristic preparations are not necessary: One can simply choose all three parameters of the (1 + (λ, λ)) GA from (scaled) power-law distributions and obtain a runtime comparable to the ones seen before.More precisely, when using the power-law exponents 2 + ε for the distribution of the population size and 1 + ε ′ for the distributions of the parameters p and c and scaling the distributions for p and c 1 A lower bound of ( n k ) k/2 e Θ(k) fitness evaluations on the runtime of the (1 + (λ, λ)) GA with static parameters was shown in [ADK22], but this bound was proven for the initialization in the local optimum of Jump k and it does not include the runtime until the algorithm gets to the local optimum from a random solution.
by dividing by √ n (to obtain a constant distance of parent and offspring with constant probability), we obtain the same e O(k) ( n k ) (1+ε)k/2 runtime guarantee as in [AD20].From our theoretical results one can see that the exact choice of ε ′ does not affect the asymptotical runtime neither on easy functions such as OneMax, nor on hard functions such as Jump k .Hence if an algorithm user would choose all exponents as 2 + ε, which is a natural choice as it leads to a constant expectation and a super-constant variance as usually desired from a power-law distribution, the resulting runtimes would still be O(n log n) for OneMax and e O(k) ( n k ) (1+ε)k/2 for jump functions with gap size k.With this approach, the only remaining design choice is the scaling of the distributions.It is clear that this cannot be completely avoided simply because of the different scales of the parameters (mutation rates are in [0, 1], population sizes are positive integers).However, we argue that here very simple heuristic arguments can be employed.For the population size, being a positive integer, we simply use a power-law distribution on the nonnegative integers.For the mutation rate and the crossover bias, we definitely need some scaling as both number have to be in [0, 1].Recalling that (and this is visible right from the algorithm definition) the expected distance of offspring from their parents in this algorithm is d = pcn and recalling further the general recommendation that EAs should generate offspring with constant Hamming distance from the parent with reasonable probability (this is, for example, implicit both in the general recommendation to use a mutation rate of 1/n and in the heavy-tailed mutation operator proposed in [DLMN17]), a scaling leading to a constant expected value of d appears to be a good choice.We obtain this by taking both p and c from power-law distributions on the positive integers scaled down by a factor of √ n.This appears again to be the most natural choice.We note that if an algorithm user would miss this scaling and scale down both p and c by a factor of n (e.g., to obtain an expected constant number of bits flipped in the mutation step), then our runtime estimates would increase by a factor of n βp+βc 2 −1 , which is still not much compared to the roughly n k/2 runtimes we have and the Θ(n k ) runtimes of many simple evolutionary algorithms.
Our precise result is a mathematical runtime analysis of this heavy-tailed algorithm for arbitrary parameters of the three heavy-tailed distributions (power-law exponent and upper bound on the range of positive integers it can take, including the case of no bound) on a set of "easy" problems (OneMax, LeadingOnes, the minimum spanning tree and the partition problem) and on Jump function.We show that on easy problems the heavytailed (1 + (λ, λ)) GA asymptotically is not worse than the (1 + 1) EA, and on Jump it significantly outperforms the (1 + 1) EA for a wide range of the parameters of power-law distributions.These results show that the absolutely best performance can be obtained by guessing correctly suitable upper bounds on the ranges.Since guessing these parameters wrong can lead to significant performance losses, whereas the gains from these optimal parameter values are not so high, we would rather advertise our parameter-less "layman recommendation" to use unrestricted ranges and power-law exponents slightly more than two for the population size and slightly more than one for other parameters.These recommendations are supported by the empirical study shown in Section 4.
Our work also provides an example where a dynamic (here simply randomized) parameter choice provably gives an asymptotic runtime improvement.This improvement is significantly more pronounced than the o( √ log n) factor speed-up observed in [DD18,ABD22] for the optimization of OneMax via the (1 + (λ, λ)) GA.
We note that our situation is different, e.g., from the optimization of jump functions via the (1 + 1) EA.Here the mutation rate k n is asymptotically optimal [DLMN17] for Jump k .Clearly, for the easy OneMax-type part of the optimization process, the mutation rate 1 n would be superior, but the damage from using the larger rate k n only leads to a lower-order increase of the runtime.
We prove that this is different for the optimization of the jump functions via the (1 + (λ, λ)) GA.Since this effect is already visible for constant values of k, and in fact strongest visible, to ease the presentation, we assume that k is constant.We note that only for constant k the different variants of the (1 + (λ, λ)) GA had a polynomial runtime, so clearly, k constant (and not too large) is the most interesting case.
For constant k, our result is e O(k) ( n k ) (1+ε)k/2 .The best runtime that could be obtained with a static mutation rate was e O(k) n (k+1)/2 k −k/2 .Hence by choosing ε sufficiently small, our upper bound is asymptotically smaller than the best upper bound for static parameters.Unfortunately, no lower bounds were proven in [ADK22] for static parameters.To rigorously support our claim that dynamic parameter choices can asymptotically outperform static ones when optimizing jump functions via the (1 + (λ, λ)) GA, in Section 3.3 we prove such a lower bound.Since this is not the main topic of this work, we shall not go as far as proving that the upper bound for static parameters is tight, but we content ourselves with a simpler proof of a weaker lower bound, which however suffices to support our claim of the superiority of dynamic parameter choices.
In summary, our results demonstrate that choosing all parameters of an algorithm randomly according to a simple (scaled) power-law can be a good way to overcome the problem of choosing appropriate fixed or dynamic parameter values.We are optimistic that this approach will lead to a good performance also for other algorithms and other optimization problems.

Preliminaries
In this section we collect definitions and tools which we use in the paper.To avoid misreading of our results, we note that we use the following notation.By N we denote the set of all positive integer numbers and by N 0 we denote the set of all non-negative integer numbers.We write [a..b] to denote an integer interval including its borders and [a, b] to denote a real-valued interval including its borders.For any probability distribution L and random variable X, we write X ∼ L to indicate that X follows the law L.We denote the binomial law with parameters n ∈ N and p ∈ [0, 1] by Bin(n, p).We denote the geometric distribution taking values in {1, 2, . . .} with success probability p ∈ [0, 1] by Geom(p).We denote by T I and T F the number of iterations and the number of fitness evaluations performed until some event holds (which is always specified in the text).

Objective Functions
In this paper we consider five benchmark functions and problems, namely OneMax, LeadingOnes, the minimum spanning tree problem, the partition problem and Jump k .All of them are pseudo-Boolean functions, that is, they are defined on the set of bit strings of length n and return a real number.
OneMax returns the number of one-bits in its argument, that is, OneMax(x) = OM(x) = n i=1 x i .It is one of the most intensively studied benchmarks in evolutionary computation.Many evolutionary algorithms can find the optimum of OneMax in time JJW05,Wit06,AD21].The (1 + (λ, λ)) GA with a fitness-dependent or self-adjusting choice of the population size [DDE15,DD18] or with a heavy-tailed random choice of the population size [ABD21] is capable of solving OneMax in linear time when the other two parameters are chosen suitably depending on the population size.
In the minimum spanning tree problem (MST for brevity) we are given an undirected graph G = (V, E) with positive integer edge weights defined by a weight function ω : E → N ≥1 .We assume that this graph does not have parallel edges or loops.The aim is to find a connected subgraph of a minimum weight.By n we denote the number of vertices, by m we denote the number of edges in G.
This problem can be solved by minimizing the following fitness function defined on all subgraphs G ′ = (V, E ′ ) of the given graph G.
where cc(G ′ ) is the number of connected components in G ′ and W total is the total weight of the graph G, that is, the sum of all edge weights.This definition of the fitness guarantees that any connected graph has a better (in this case, smaller) fitness than any unconnected graph and any tree has a better fitness than any graph with cycles.
The natural representation for subgraphs used in [NW07] is via bit-strings of length m, where each bit corresponds to some particular edge in graph G.An edge is present in subgraph G ′ if and only if its corresponding bit is equal to one.In [NW07] it was shown that the (1 + 1) EA solves the MST problem with the mentioned representation and fitness function in expected number of O(m 2 log(W total )) iterations.
In the partition problem we have a set of n objects with positive integer weights w 1 , w 2 , . . ., w n and our aim is to split the objects into two sets (usually called bins) such that the total weight of the heavier bin is minimal.Without loss of generality we assume that the weights are sorted in a non-increasing order, that is, w 1 ≥ w 2 ≥ • • • ≥ w n .By w we denote the total weight of all objects, that is, w = n i=1 w i .By a (1 + δ) approximation (for any δ > 0) we mean a solution in which the weight of the heavier bin is at most by a factor of (1 + δ) greater than in an optimal solution.
Each partition into two bins can be represented by a bit string of length n, where each bit corresponds to some particular object.The object is put into the first bin if and only if the corresponding bit is equal to one.As fitness f (x) of an individual x we consider the total weight of the objects in the heavier bin in the partition which corresponds to x.
In [Wit05] it was shown that the (1 + 1) EA finds a ( 4 3 + ε) approximation for any constant ε > 0 of any partition problem in linear time and that it finds a 4 3 approximation in time O(n 2 ).
The function Jump k (where k is a positive integer parameter) is defined via OneMax as follows.
A plot of Jump k is shown in Figure 1.The main feature of Jump k is a set of local optima at distance k from the global optimum and a valley of extremely low fitness in between.Most EAs optimizing Jump k first reach the local optima and then have to perform a jump to the global one, which turns out to be a challenging task for most classic algorithms.In particular, for all values of µ and λ it was shown that (µ + λ) EA and (µ, λ) EA have a runtime of Ω(n k ) fitness evaluations when they optimize Jump k [DJW02,Doe22].Using a mutation rate of k n [DLMN17], choosing it from a power-law distribution [DLMN17], or setting it dynamically with a stagnation detection mechanism [RW20, RW21b, RW21a, DR22] reduces the runtime of the (1 + 1) EA by a k Θ(k) factor, however, for constant k the runtime of the (1 + 1) EA remains Θ(n k ).Many crossover-based algorithms have a better runtime on Jump k , see [JW02, FKK + 16, DFK + 16, DFK + 18, RA19, WVHM18] for results on algorithms different from the (1 + (λ, λ)) GA.Those beating the Õ(n k−1 ) runtime shown in [DFK + 18] may appear somewhat artificial and overfitted to the precise definition of the jump function, see [Wit21].Outside the world of genetic algorithms, the estimation-of-distribution algorithm cGA and the ant-colony optimizer 2-MMAS ib were shown to optimize jump functions with small k = O(log n) in time O(n log n) [HS18,Doe21,BBD21].Runtime analyses for artificial immune systems, hyperheuristics, and the Metropolis algorithm exist [COY17, COY19, LOW19], but their runtime guarantees are asymptotically weaker than O(n k ) for constant k.
Figure 1: Plot of the Jump k function.As a function of unitation, the function value of a search point x depends only on the number of one-bits in x.

Power-Law Distributions
We say that an integer random variable X follows a power-law distribution with parameters Here C β,u = ( u j=1 j −β ) −1 denotes the normalization coefficient.We write X ∼ pow(β, u) and call u the bounding of X and β the power-law exponent.
The main feature of this distribution is that while having a decent probability to sample X = Θ(1) (where the asymptotic notation is used for u → +∞), we also have a good (inverse-polynomial instead of negative-exponential) probability to sample a super-constant value.The following lemmas show the well-known properties of the power-law distributions.Their proofs can be found, for example, in [AD20].
Lemma 1 (Lemma 1 in [AD20]).For all positive integers a and b such that b ≥ a and for all β > 0, the sum b i=a i −β is where Θ notation is used for b → +∞.
Lemma 4 simply follows from Lemma 1.

2.3
The Heavy-Tailed (1 + (λ, λ)) GA We now define the heavy-tailed (1 + (λ, λ)) GA.The main difference from the standard (1 + (λ, λ)) GA is that at the start of each iteration the mutation rate p, the crossover bias c, and the population size λ are randomly chosen as follows.We sample p ∼ n −1/2 pow(β p , u p ) and c ∼ n −1/2 pow(β c , u c ).The population size is chosen via λ ∼ pow(β λ , u λ ).Here the upper limits u λ , u p and u c can be any positive integers, except we require u p and u c to be at most √ n (so that we choose both p and c from interval (0, 1]).The power-law exponents β λ , β p and β c can be any non-negative real numbers.We call these parameters of the power-law distribution the hyperparameters of the heavy-tailed (1 + (λ, λ)) GA and we give recommendations on how to choose these hyperparameters in Section 3.3.The pseudocode of this algorithm is shown in Algorithm 1.We note that it is not necessary to store the whole offspring population, since only the best individual has a chance to be selected as a mutation or crossover winner.Hence also large values for λ are algorithmically feasible.
Concerning the scalings of the power-law distributions, we find it natural to choose the integer parameter λ from a power-law distribution without any normalization.For the scalings of the power-law determining the parameters p and c, we argued already in the introduction that the scaling factor of n −1/2 is natural as it ensures that the Hamming distance between parent and offspring, which is pcn for this algorithm, is one with constant probability.We see that there is some risk that an algorithm user misses this argument and, for example, chooses a scaling factor of n −1 for the mutation rate, which leads to the Hamming distance between parent and mutation offspring being one with constant probability.A completely different alternative would be to choose c ∼ 1 pow(βm,um) , inspired by the recommendation "c := 1/(pn)" made for static parameters in [DDE15].Without proof, we note that these and many similar strategies increase the runtime by at most a factor of Θ(n c ), c a constant independent of n and k, thus not changing the general n (0.5+ε)k runtime guarantee proven in this work.
The following theoretical results exist for the (1 + (λ, λ)) GA.With optimal static parameters the algorithm solves OneMax in approximately O(n log(n)) fitness evaluations [DDE15].The runtime becomes slightly worse on the random satisfiability instances due to a weaker fitness-distance correlation [BD17].In [ADK19] it was shown that the runtime of the (1 + (λ, λ)) GA on LeadingOnes is the same as the runtime of the most classic algorithms, that is, Θ(n 2 ), which means that it is not slower than most other EAs despite the absence of a strong fitness-distance correlation.The analysis of the (1 + (λ, λ)) GA with static parameters on Jump k in [ADK22] showed that the (1 + (λ, λ)) GA (with uncommon parameters) can find the optimum in e O(k) ( n k ) (k+1)/2 fitness evaluations, which is roughly a square root of the Θ(n k ) runtime of many classic algorithms on this function.
Concerning dynamic parameter choices, a fitness-dependent parameter choice was shown to give linear runtime on OneMax [DDE15], which is the best known runtime for crossover-based algorithms on OneMax.In [DD18], it was shown that also the selfadjusting approach of controlling the parameters with a simple one-fifth rule can lead to this linear runtime.The adapted one-fifth rule with a logarithmic cap lets the (1 + (λ, λ)) GA outperform the (1 + 1) EA on random satisfiability instances [BD17].
Choosing λ from a power-law distribution and taking p = λ n and c = 1 λ lets the (1 + (λ, λ)) GA optimize OneMax in linear time [ABD22].Also, as it was mentioned in the introduction, with randomly chosen parameters (but with some dependencies between several of them) the (1 . For the LeadingOnes it was shown in [ADK19] that the runtime of the (1 + (λ, λ)) GA is Θ(n 2 ) and that any dynamic choice of λ does not change this asymptotical runtime.
9 Flip ℓ bits in x (i) chosen uniformly at random; Create y (i) by taking each bit from x ′ with probability c and from x with probability (1 − c); reached the local optimum, then we call the mutation phase successful if all k zero-bits of x are flipped to ones in the mutation winner x ′ .We call the crossover phase successful if the crossover winner has a greater fitness than x.

Useful Tools
An important tool in our analysis is Wald's equation [Wal45] as it allows us to express the expected number of fitness evaluations through the expected number of iterations and the expected cost of one iteration.
Lemma 5 (Wald's equation).Let (X t ) t∈N be a sequence of real-valued random variables and let T be a positive integer random variable.Let also all following conditions be true.
1.All X t have the same finite expectation.

For all
Then we have In our analysis of the heavy-tailed (1 + (λ, λ)) GA we use the following multiplicative drift theorem.
Theorem 6 (Multiplicative Drift [DJW12]).Let S ⊂ R be a finite set of positive numbers with minimum s min .Let {X t } t∈N 0 be a sequence of random variables over S ∪ {0}.Let T be the first point in time t when X t = 0, that is, which is a random variable.Suppose that there exists a constant δ > 0 such that for all t ∈ N 0 and all s ∈ S such that Pr[ We use the following well-known relation between the arithmetic and geometric means.
Lemma 7.For all positive a and b it holds that a + b ≥ 2 √ ab.

Runtime Analysis
In this section we perform a runtime analysis of the heavy-tailed (1 + (λ, λ)) GA on the easy problems OneMax, LeadingOnes, and Minimum Spanning Tree as well as the more difficult Jump problem.We show that this algorithm can efficiently escape local optima and that it is capable of solving Jump functions much faster than the known mutationbased algorithms and most of the crossover-based EAs.At the same time it does not fail on easy functions like OneMax, unlike the (1 + (λ, λ)) GA with those static parameters which are optimal for Jump [ADK22].
From the results of this section we distill the recommendations to use β p and β c slightly greater than one and to use β λ slightly greater than two.We also suggest to use almost unbounded power-law distributions, taking u c = u p = √ n and u λ = 2 n .These recommendations are justified in Corollary 16.

Easy Problems
In this subsection we show that the heavy-tailed (1 + (λ, λ)) GA has a reasonably good performance on the easy problems OneMax, LeadingOnes, minimum spanning tree, and partition.
Theorem 8.If β λ > 1, β p > 1, and β c > 1, then the heavy-tailed (1 + (λ, λ)) GA finds the optimum of OneMax in O(n log(n)) iterations.The expected number of fitness evaluations is The central argument in the proof of Theorem 8 is the observation that the heavy-tailed (1 + (λ, λ)) GA performs an iteration equivalent to one of the (1 + 1) EA with a constant probability, which is shown in the following lemma.
Lemma 9.If β p , β c and β λ are all strictly greater than one, then with probability ρ = Θ(1) the heavy-tailed √ n and λ = 1 and performs an iteration of the (1 + 1) EA with mutation rate 1 n .Proof.Since we choose p, c and λ independently, then by the definition of the power-law distribution and by Lemma 2 we have If we have λ = 1, then we have only one mutation offspring which is automatically chosen as the mutation winner x ′ .Note that although we first choose ℓ ∼ Bin(n, p) and then flip ℓ random bits in x, the distribution of x ′ in the search space is the same as if we flipped each bit independently with probability p (see Section 2.1 in [DDE15] for more details).
In the crossover phase we create only one offspring y by applying the biased crossover to x and x ′ .Each bit of this offspring is different from the bit in the same position in x if and only if it was flipped in x ′ (with probability p) and then taken from x ′ in the crossover phase (with probability c).Therefore, y is distributed in the search space as if we generated it by applying the standard bit mutation with mutation rate pc to x.Hence, we can consider such iteration of the heavy-tailed (1 + (λ, λ)) GA as an iteration of the (1 + 1) EA which uses a standard bit mutation with mutation rate pc = 1 n .We are now in position to prove Theorem 8.
Proof of Theorem 8.By Lemma 9 with probability at least ρ, which is at least some constant independent of the problem size n, the heavy-tailed (1 + (λ, λ)) GA performs an iteration of the (1 + 1) EA.Hence, the probability P (i) to increase fitness in one iteration if we have already reached fitness i is Hence, we estimate the total runtime in terms of iterations as a sum of the expected runtimes until we leave each fitness level.
To compute the expected number of fitness evaluations until we find the optimum we use Wald's equation (Lemma 5).Since in each iteration of the heavy-tailed (1 + (λ, λ)) GA we make 2λ fitness evaluations, we have By Lemma 3 we have Therefore, Theorem 8 shows that the heavy-tailed (1 + (λ, λ)) GA can fall back to a (1 + 1) EA behavior and turn into a simple hill climber.Since we do not have a matching lower bound, our analysis leaves open the question to what extent the heavy-tailed (1 + (λ, λ)) GA benefits from iterations in which it samples parameter values different from the ones used in the lemma above.On the one hand, in [ABD22] it was shown that if we choose only one parameter λ from the power-law distribution and set the other parameters to their optimal values in the (1 + (λ, λ)) GA (namely, p = λ n and c = 1 λ [DDE15]), then we have a linear runtime on OneMax.This indicates that there is a chance that the heavy-tailed (1 + (λ, λ)) GA with an independent choice of three parameters can also have a o(n log(n)) runtime on this problem.On the other hand, the probability that we choose p and c close to their optimal values is not high, hence we have to rely on making good progress when using non-optimal parameters values.Our experiments presented in Section 4.1 suggest that such parameters do not yield the desired progress speed and that the heavytailed (1 + (λ, λ)) GA has an Ω(n log(n)) runtime (see Figure 3).For this reason, we rather believe that the heavy-tailed (1 + (λ, λ)) GA proposed in this work has an inferior performance on OneMax than the one proposed in [ABD22].Since our new algorithm has a massively better performance on jump functions, we feel that losing a logarithmic factor in the runtime on OneMax is not too critical.
Lemma 9 also allows us to transform any upper bound on the runtime of the (1 + 1) EA which was obtained via the fitness level argument or via drift with the fitness into the same asymptotical runtime for the heavy-tailed (1 + (λ, λ)) GA.We give three examples in the following subsections.

LeadingOnes
For the LeadingOnes problem, we now show that arguments analogous to the ones in [Rud97] can be used to prove an O(n 2 ) runtime guarantee also for the heavy-tailed (1 + (λ, λ)) GA.
Theorem 10.If β λ > 1, β p > 1, and β c > 1, then the expected runtime of the heavy-tailed (1 + (λ, λ)) GA on LeadingOnes is O(n 2 ) iterations.In terms of fitness evaluations the expected runtime is Proof.The probability that the heavy-tailed (1 + (λ, λ)) GA improves the fitness in one iteration is at least the probability that it performs an iteration of the (1 + 1) EA that improves the fitness.By Lemma 9 the probability that the heavy-tailed (1 + (λ, λ)) GA performs an iteration of the (1 + 1) EA is Θ(1).The probability that the (1 + 1) EA increases the fitness in one iteration is at least the probability that it flips the first zerobit in the string and does not flip any other bit, which is Hence, the probability that the heavy-tailed (1 + (λ, λ)) GA increases the fitness in one iteration is Ω( 1 n ).Therefore, the expected number of iterations before the heavy-tailed (1 + (λ, λ)) GA improves the fitness is O(n) iterations.Since there will be no more than n improvements in fitness before we reach the optimum, the expected total runtime of the heavy-tailed (1 + (λ, λ)) GA on LeadingOnes is at most O(n 2 ) iterations.Since by Lemma 3 with β λ > 1 the expected cost of one iteration is by Wald's equation (Lemma 5) the expected total runtime in terms of fitness evaluations is

Minimum Spanning Tree Problem
We proceed with the runtime on the minimum spanning tree problem.Reusing some of the arguments from [NW07] and some more from the later work [DJW12], we show that the expected runtime of the heavy-tailed (1 + (λ, λ)) GA admits the same upper bound O(m 2 log(W total )) as the (1 + 1) EA.
In terms of fitness evaluations it is Proof.In [NW07] it was shown that starting with a random subgraph of G, the (1 + 1) EA finds a spanning tree graph in O(m log(n)) iterations.We now briefly adjust these arguments to the heavy-tailed (1 + (λ, λ)) GA.If G ′ is disconnected, then the probability to reduce the number of connected components is at most the probability that the heavy-tailed (1 + (λ, λ)) GA performs an iteration of the (1 + 1) EA multiplied by the probability that an iteration of the (1 + 1) EA adds an edge which connects two connected components (and does not add or remove other edges from the subgraph G ′ ).The latter probability is , since there are at least cc(G ′ ) − 1 edges which we can add to connect a pair of connected components.Therefore, by the fitness level argument we have that the expected number of iterations before the heavy-tailed (1 + (λ, λ)) GA finds a connected graph is O(m log(n)).
If the algorithm has found a connected graph, with probability Ω( ) the heavytailed (1 + (λ, λ)) GA performs an iteration of the (1 + 1) EA that removes an edge participating in a cycle (since there are at least (|E ′ | − (n − 1)) such edges).Therefore, in O(m log(m)) iterations the heavy-tailed (1 + (λ, λ)) GA finds a spanning tree (probably not the minimum one).Note that O(m log(m)) = O(m log(n)), since we do not have loops and parallel edges and thus m ≤ n(n−1) 2 .
Once the heavy-tailed (1 + (λ, λ)) GA has obtained a spanning tree, it cannot accept any subgraph that is not a spanning tree.Therefore, we can use the multiplicative drift argument from [DJW12].Namely, we define a potential function Φ(G ′ ) that is equal to the weight of the current tree minus the weight of the minimum spanning tree.In [DJW12] it was shown that for every iteration t of (1 + 1) EA, we have where G ′ t denotes the current graph in the start of iteration t.By Lemma 9 and since the weight of the current graph cannot decrease in one iteration, for the heavy-tailed (1 + (λ, λ)) GA we have for some ρ, which is a constant independent of m and W . Since the edge weights are integers, we have Φ(G ′ t ) ≥ 1 =: s min for all t such that G ′ t is not an optimal solution.We also have Φ(G ′ 0 ) ≤ W total by the definition of W total .Therefore, by the multiplicative drift theorem (Theorem 6) we have that the expected runtime until we find the optimum starting from a spanning tree is at most Together with the runtime to find a spanning tree, we obtain a total expected runtime of iterations.By Lemma 3 and by Wald's equation (Lemma 5) the expected number of fitness evaluations is therefore

Approximations for the Partition Problem
We finally regard the partition problem.We use similar arguments as in [Wit05] (slightly modified to exploit multiplicative drift analysis) to show that the heavy-tailed (1 + (λ, λ)) GA also finds a ( 4 3 + ε) approximation in linear time.For 4 3 approximations we improve the O(n 2 ) runtime result of [Wit05] and show that both the (1 + 1) EA and the heavy-tailed (1 + (λ, λ)) GA succeed in O(n log(w)) fitness evaluations.The heavier bin: The heavy-tailed (1 + (λ, λ)) GA and the (1 + 1) EA also find a 4 3 approximation in an expected number of O(n log(w)) iterations.The expected number of fitness evaluations for the heavy-tailed Proof.We first recall the definition of a critical object from [Wit05].Let ℓ ≥ w 2 be the fitness of the optimal solution.Let i 1 < i 2 < • • • < i k be the indices of the objects in the heavier bin.Then we call the object r in the heavier bin the critical one if it is the object with the smallest index such that j:i j ≤r In other words, the critical object is the object in the heavier bin such that the total weight of all previous (non-lighter) objects in that bin is not greater than ℓ, but the total weight of all previous objects together with the weight of this object is greater than ℓ.We call the weight of the critical object the critical weight.We also call the objects in the heavier bin which have index at least r the light objects.This notation is illustrated in Figure 2.
We now show that at some moment the critical weight becomes at most w 3 and does not exceed this value in the future.For this we consider two cases.
Case 1: w 2 > w 3 .Note that in this case we also have w 1 > w 3 , since w 1 ≥ w 2 and the weight of all other objects is w − w 1 − w 2 < w 3 .If the two heaviest objects are in the same bin, then the weight of this (heavier) bin is at least 2w 3 .In any partition in which these two objects are separated the weight of the heavier bin is at most max{w − w 1 , w − w 2 } < 2w 3 , therefore if the algorithm generates such a partition it would replace a partition in which the two heaviest objects are in the same bin.For the same reason, once we have a partition with the two heaviest objects in different bins, we cannot accept a partition in which they are in the same bin.
The probability of separating the two heaviest objects into two different bins is at least the probability that the heavy-tailed (1 + (λ, λ)) GA performs an iteration of the (1 + 1) EA (which by Lemma 9 is Θ(1)) multiplied by the probability that in this iteration we move one of these two objects into a different bin and do not move the second object.This is at least Consequently, in an expected number of O(n) iterations the two heaviest objects will be separated into different bins.
Note that the weight of the heaviest object cannot be greater than the weight of the heavier bin (even in the optimal solution), hence we have w 2 ≤ w 1 ≤ ℓ.Therefore, when the two heaviest objects are separated into different bins neither of them can be the critical one.Hence, the critical weight is now at most w 3 < w 3 .Case 2: w 2 ≤ w 3 .Since the heaviest object can never be the critical one, the critical weight is at most w 2 ≤ w 3 .Once the critical weight is at most w 3 , we define a potential function Φ(x t ) = max{f (x t ) − ℓ − w 6 , 0}, where x t is the current individual of the heavy-tailed (1 + (λ, λ)) GA at the beginning of iteration t.Note that this potential function does not increase due to the elitist selection of the (1 + (λ, λ)) GA.We now show that as long as Φ(x t ) > 0, any iteration which moves any light object to the lighter bin and does not move other objects reduces the fitness (and the potential).Recall that the weight of each light object is at most w 3 .Then the weight of the bin which was heavier before the move is reduced by the weight of the moved object.The weight of the other bin becomes at most Therefore, the weight of both bins becomes smaller than the weight of the bin which was heavier before the move, hence such a partition is accepted by the algorithm.Now we estimate the expected decrease of the potential in one iteration.Recall that by Lemma 9 the probability that the heavy-tailed (1 + (λ, λ)) GA performs an iteration of the (1 + 1) EA is at least some ρ = Θ(1).The probability that in such an iteration we move only one particular object is 1 n (1 − 1 n ) n−1 ≥ 1 en .Hence we have two options.
• If there is at least one light object with weight at least Φ(x t ), then moving it we decrease the potential to zero, since the wight of the heavier bin becomes not greater than ℓ + w 6 and the weight of the lighter bin also cannot become greater than ℓ + w 6 as it was shown earlier.Hence, we have • Otherwise, the move of any light object decreases the potential by the weight of the moved object, since the heavy bin will remain the heavier one after such a move.The total weight of the light objects is at least f (x t ) − ℓ ≥ Φ(x t ).Let L be the set of indices of the light objects.Then we have Now we are in position to use the multiplicative drift theorem (Theorem 6).Note that the maximum value of potential function is w 2 and its minimum positive value is 1 6 (since f (x t ) and ℓ are integer values and w 6 is divided by 1 6 ).Therefore, denoting T I as the smallest t such that Φ(x t ) = 0, we have When Φ(x t ) = 0, we have which means that x t is a 4 3 approximation of the optimal solution.To show that we obtain a ( 4 3 +ε) approximation in expected linear time for all constants ε > 0, we use a modified potential function Φ ε , which is defined by For this potential function the drift is at least as large as for Φ, but its smallest non-zero value is εw 2 .Hence, by the multiplicative drift theorem (Theorem 6) the expectation of the first time T I (ε) when Φ ε turns to zero is at most When Φ(x t ) = 0, we have therefore x t is a ( 4 3 + ε) approximation.By Lemma 3 and by Wald's equation (Lemma 5) we also have the following estimates on the runtimes T F and T F (ε) in terms of fitness evaluations.

Jump Functions
In this subsection we show that the heavy-tailed (1 + (λ, λ)) GA performs well on jump functions, hence there is no need for the informal argumentation [AD20] to choose mutation rate p and crossover bias c identical.The main result is the following theorem, which estimates the expected runtime until we leave the local optimum of Jump k .
Theorem 13.Let k ∈ [2.. n 4 ], u p ≥ √ 2k, and u c ≥ √ 2k.Assume that we use the heavytailed (1 + (λ, λ)) GA (Algorithm 1) to optimize Jump k , starting already in the local optimum.Then the expected number of fitness evaluations until the optimum is found is shown in Table 1, where p pc denotes the probability that both p and c are in The proof of Theorem 13 follows from similar arguments as in [AD20, Theorem 6], the main differences being highlighted in the following two lemmas.
]] is as shown in Table 2.
Proof.Since we choose p and c independently, we have By the definition of the power-law distribution and by Lemmas 1 and 2, we have ]] and F (β λ , u λ ) is some function of β λ and u λ , to ease reading we only state F (β λ , u λ ) = E[T F ]p pc and show the influence of the hyperparameters on p pc in Table 2. Asymptotical notation is used for n → +∞.The highlighted cell shows the result for the hyperparameters suggested in Corollary 16.
We can estimate Pr[c ∈ [ k n , 2k n ]] in the same manner, which gives us the final estimate of p pc shown in Table 2. Now we proceed with an estimate of the probability to find the optimum in one iteration after choosing p and c.  -tailed (1 + (λ, λ)) GA is in the local optimum of Jump k , then the probability that the algorithm generates the global optimum in one iteration is at least e −Θ(k) min{1, ( k n ) k λ 2 }.Proof.The probability P pc (λ) that we find the optimum in one iteration is the probability that we have a successful mutation phase and a successful crossover phase in the same iteration.If we denote the probability of a successful mutation phase by p M and the ]] when both u p and u c are at least √ 2k.Asymptotical notation is used for n → +∞.The highlighted cell shows the result for the hyperparameters suggested in Corollary 16.
probability of a successful crossover phase by p C , then we have P pc (λ) = p M p C .Then with q ℓ being some constant which denotes the probability that the number ℓ of bits we flip in the mutation phase is in [pn, 2pn], by Lemmas 3.1 and 3.2 in [ADK22] we have , then we have Otherwise, if λ < n k k , then both minima are equal to their second argument.Thus, we have Bringing the two cases together, we finally obtain Now we are in position to prove Theorem 13.
Proof of Theorem 13.Let the current individual x of the heavy-tailed (1 + (λ, λ)) GA be already in the local optimum.Let P be the probability of event F when the algorithm finds optimum in one iteration.By the law of total probability this probability is at least ].The number T I of iterations until we jump to the optimum follows a geometric distribution Geom(P ) with success probability P .Therefore, Since in each iteration the heavy-tailed (1 + (λ, λ)) GA performs 2λ fitness evaluations (with λ chosen from the power-law distribution), by Wald's equation (Lemma 5) the expected number E[T F ] of fitness evaluations the algorithm makes before it finds the optimum is In the remainder we show how E[λ], p (F |pc) and p pc depend on the hyperparameters of the algorithm.
First we note that p pc was estimated in Lemma 14.Also, by Lemma 3 the expected value of λ is Finally, we compute the conditional probability of F via the law of total probability.
where P pc (i) is as defined in Lemma 15, in which it was shown that P pc (i) ≥ e −Θ(k) min{1, ( k n ) k i 2 }.We consider two cases depending on the value of u λ .Case 1: when u λ ≤ ( n k ) k/2 .In this case we have By Lemma 4 we estimate E[λ 2 ] and obtain Case 2: when u λ > ( n k ) k/2 .In this case we have . Therefore, we have Estimating the sums via Lemma 1, we obtain Gathering the estimates for the two cases and the estimates of E[λ] and p pc together, we obtain the runtimes listed in Table 1.

Recommended Hyperparameters
In this subsection we subsume the results of our runtime analysis to show most preferable parameters of the power-law distributions for the practical use.We point out the runtime with such parameters on OneMax and Jump k in Corollary 16.We then also prove a lower bound on the runtime of the (1 + (λ, λ)) GA with static parameters to show that when k is constant (that is, the most interesting case, since only then we have a polynomial runtime), then the performance of the heavy-tailed (1 + (λ, λ)) GA is asymptotically better than the best performance we can obtain with the static parameters.
Corollary 16.Let β λ = 2 + ε λ and β p = 1 + ε p and β c = 1 + ε c , where ε λ , ε p , ε c > 0 are some constants.Let also u λ be at least 2 n and u p = u c = √ n.Then the expected runtime of the heavy-tailed . This corollary follows from Theorems 8 and 13.We only note that for the runtime on Jump k the same arguments as in Theorem 8 show us that the runtime until we reach the local optimum is at most O(n log(n)), which is small compared to the runtime until we reach the global optimum.Also we note that when β p and β c are both greater than one and u p = u c = √ n ≥ √ 2k, by Lemma 14 we have ), which is implicitly hidden in the e O(k) factor of the runtime on Jump k .We also note that u λ = 2 n guarantees that u λ > ( n k ) k/2 , which yields the runtimes shown in the right column of Table 1.Corollary 16 shows that when we have (almost) unbounded distributions and use powerlaw exponents slightly greater than one for all parameters except the population size, for which we use a power-law exponent slightly greater than two, we have a good performance both on easy monotone functions, which give us a clear signal towards the optimum, and on the much harder jump functions, without any knowledge of the jump size.
We now also show that the proposed choice of the hyper-parameters gives us a better performance than any static parameters choice on Jump k for constant k.As we have already noted in the introduction, only for such values of k different variants of the (1 + (λ, λ)) GA and many other classic EAs have a polynomial runtime, hence this case is the most interesting to consider.We prove the following theorem which holds for any static parameters choice of the (1 + (λ, λ)) GA, even when we use different population sizes λ M and λ C in the mutation and in the crossover phases respectively.Theorem 17.Let n be sufficiently large.Then the expected runtime of the (1 + (λ, λ)) GA with any static parameters p, c, λ M and λ C on Jump k with k ≤ n 512 is at least B := Before we prove Theorem 17, we give a short sketch of the proof to ease the further reading.First we show that with high probability the (1 + (λ, λ)) GA with static parameters starts at a point with approximately n 2 one-bits.In the second step we handle a wide range of parameter settings and show that for them we cannot obtain a runtime better than B by showing that the probability to find the optimum in one iteration is at most 1/B.For the remaining settings we then show that we are not likely to observe an Ω(n) progress in one iteration, hence with high probability there is an iteration when we have a fitness which is n 2 + Ω(n) and at the same time which is n − k − Ω(n).From that point on the probability that we have a progress which is Ω(k log( n k )) is very unlikely to happen hence with high probability the (1 + (λ, λ)) GA does not reach the local optima of Jump k (nor the global one) in Ω( ) fitness evaluations by the definition of the algorithm.For the narrowed range of parameters this yields the lower bound.
To transform these informal arguments into a rigorous proof we use several auxiliary tools.The first of them is Lemma 14 from [DWY21], which we formulate as follows2 .
Lemma 18 (Lemma 14 in [DWY21]).Let x be a bit string of length n with exactly m one-bits in it.Let y be an offspring of x obtained by flipping each bit independently with probability r n , where r ≤ n 2 .Let also m ′ be a random variable denoting the number of one-bits in y.Then for any ∆ ≥ 0 we have We also use the following lemma, which bounds the probability to make a jump to a certain point.
Lemma 19.If we are in distance d ≤ n 2 from the unique optimum of any function, then the probability P that the (1 + (λ, λ)) GA with mutation rate p, crossover bias c and population sizes for the mutation and crossover phases λ M and λ C respectively finds the optimum in one iteration is at most A very similar, but less general result has been proven in [ADK22] (Theorem 16).
Proof of Lemma 19.Without loss of generality we assume that the unique optimum is the all-ones bit string.Hence, the current individual has exactly d zero-bits.Let p ℓ be the probability that we choose ℓ as the number of bits to flip at the start of the iteration of the (1 + (λ, λ)) GA.Let also p m (ℓ) be the probability (conditional on the chosen ℓ) that the mutation winner has all zero-bits flipped to ones.Note that this is necessary for crossover to be able to create the global optimum.Let p c (ℓ) be the probability that conditional on the chosen ℓ and on that we flip all d zero-bits in the mutation winner, we then flip ℓ − d zeros in the mutation winner in at least one crossover offspring.Then by the law of total probability we have For ℓ < d the probability that we flip all d zero-bits in the mutation winner is zero.For ℓ = d we flip all d zero-bits in one particular mutation offspring with probability q m (ℓ) = n d −1 .Since we create all λ M offspring independently, the probability that we flip all d zero-bits in at least one offspring is , where we used Bernoulli inequality.Since when we create such an offspring in the mutation phase, we already find the optimum, we assume that we do not need to perform the crossover and therefore, p c (ℓ) = 1 in this case.When ℓ > d, the probability to flip all d zero-bits in one offspring is .
The probability to do so in one of λ M independently created offspring is thus The probability that in one crossover offspring we take from the current individual all ℓ − d bits which are zeros in the mutation winner and take from the mutation winner all d bits which are zeros in the current individual is q c (ℓ) = c d (1 − c) ℓ−d .Consequently, the probability that we do this in at least one of λ C independently created individuals is Recall that ℓ is chosen from the binomial distribution Bin(n, p), thus we have p ℓ = n ℓ p ℓ (1 − p) n−ℓ .Putting all the estimates above into (1) we obtain We now consider function To find its maximum, we consider its value in the ends of the interval (which is zero in both ends) and in the roots of its derivative, which is Hence, the only root of the derivative is in x = d n .Since f d (x) is a smooth function, it reaches its maximum there, which is, Since we assume that d ≤ n 2 , we conclude that for all x ∈ [0, 1] we have Hence we have both Since P cannot exceed one, we also have An important corollary from Lemma 19 is the following lower bound for the case when we use too small population sizes.
Corollary 20.Consider the run of the (1 + (λ, λ)) GA with static parameters on Jump k with k < n 2 .Let the population sizes which are used for the mutation and crossover phases be λ M and λ C respectively.Let also the current individual x be a point outside the fitness valley, but with at least n 2 one-bits.Then if λ M λ C < ln( n k )( 2n k ) k−1 , then the expected runtime until we find the global optimum is at least Proof.Since the algorithm has already found the point outside the fitness valley, it will never accept a point inside it as the current individual x.Hence, unless we find the optimum, the distance to it from the current individual is at least k and at most n 2 .We now consider the term ( d 2n ) d , which is used in the bound given in Lemma 19, as a function of d and maximize it for d ∈ [k, n 2 ].For this purpose we consider its values in the ends of the interval and in the zeros of its derivative, which is, Hence, the derivative is equal to zero only when d 2n = e −1 , that is, when d = 2n e .Since we only consider d which are at most n 2 , the derivative does not have roots in this range.We also note that for d < 2n e the derivative is negative, hence the maximal value of ( d 2n ) d is reached when d = k.Therefore, by Lemma 19 we have Since λ M λ C ≤ ln( n k )( 2n k ) k−1 and since for all x ≥ 2 we have ln(x) < x 2 , we compute Therefore, the minimum in (2) is equal to the second argument.Thus, the runtime T I (in terms of iterations) is dominated by the geometric distribution with parameter 2λ M λ C k 2n k ≤ 1 2 .This implies that the expected number of unsuccessful iterations is E 2 .Since in each unsuccessful iteration we have exactly λ M + λ C fitness evaluation, we have By Lemma 7 we obtain In the following lemma we also show that too large population sizes also yield a too large expected runtime.
Lemma 21.Consider the run of the (1 + (λ, λ)) GA with static parameters on Jump k with k < n 2 .Let the population sizes which are used for the mutation and crossover phases be λ M and λ C respectively.Let also the current individual x be a point outside the fitness valley, but with at least n 2 one-bits.Then if , then the expected runtime until we find the global optimum is at least Proof.By Lemma 7 we have that the cost of one iteration is , hence to prove this lemma it is enough to consider only the first iteration of the algorithm, which already takes at least fitness evaluations plus one evaluation for the initial individual.We first show that if λ M ≥ 2 n 2 − 2, then we are not likely to sample the optimum before making 2 n 2 − 1 fitness evaluations.For this we note that the initial individual and all mutation offspring in the first iteration are sampled independently of the fitness function, thus they are random points in the search space.Therefore, for each of these individuals the probability to be the optimum is 2 −n .Consequently, by the union bound, when we create the initial individual and 2 n 2 − 2 mutation offspring, the probability that at least one of them is the optimum is at most Hence, with probability at least (1 − 2 −( n 2 +1) ) we have to make 2 n 2 − 1 or more fitness evaluations, which implies that In the rest of the proof we assume that λ M < 2 n 2 − 2. Since the mutation winner is chosen based on the fitness, we cannot use the same argument with random points in the search space for the crossover phase.However, we can consider an artificial process, which in parallel runs the crossover phase for each mutation offspring seen as winner.If none of these parallel processes has generated the optimum within m crossover offspring samples, then also the true process has not done so within a total of 1 + λ M + m fitness evaluations.We note that in the parallel crossover phases, since no selection has been made, again all offspring are uniformly distributed in {0, 1} n .
Let us fix m = 2 n 2 −1 .By the union bound, the probability that one of 1 + λ M + mλ M individuals generated by the artificial process is the optimum is at most At the same time, if the original (1 + (λ, λ)) GA creates 1+λ M +m ≥ m individuals, it also performs at least m fitness evaluations.Hence, the expected number of fitness evaluations is at least .
We are now in position to prove Theorem 17.
Proof of Theorem 17. Initialization.Recall that the initial individual is sampled uniformly at random, hence the number of one-bits in it follows a binomial distribution Bin(n, 1 2 ).By the symmetry argument we have that the number of one-bits X in the initial individual is at least n 2 with probability at least 1 2 .By Chernoff bounds (see, e.g., Theorem 1.10.1 in [Doe20]) we also have that the probability that X is greater than n 2 + n 8 is at most Hence, with probability at least 1 2 − e −Θ(n) the initial individual has a number of one-bits (and hence, the fitness) in [ n 2 , 5n 8 ].We now condition on this event3 .Narrowing the reasonable population sizes.Since we condition on starting in distance d ≤ n 2 from the optimum of Jump k , by Corollary 20 and Lemma 21 we have that if we choose λ M and λ C such that λ Hence, in the rest of the proof we assume that ln . We note that by Lemma 7 this assumption also implies that the cost of one iteration is Narrowing the reasonable mutation rate and crossover bias.We now show that using a too large mutation rate or crossover bias also yields a runtime which is greater than ( 2n k ) k+1 2 and therefore greater than B. Conditional on the current individual x being in distance d ≤ n 2 from the optimum, by Lemma 19 we have that if pc ≥ 1 2 (and therefore, p ≥ 1 2 ), then we have .
Therefore, the expected number of fitness evaluations until we find the optimum is at least Since we already assume that λ M λ C ≤ 1 ln(n/k) ( 2n k ) k+1 , by Lemma 7 we have .
Therefore, we have We note that for k ∈ [1, n 512 ] the term k 2n k+1 is decreasing in k (we avoid the proof of this fact, but note that it trivially follows from considering the derivative).Consequently, if we assume that k ≤ n 512 , then we have Hence, using pc ≥ 1 2 gives us the expected runtime which is not less than B. Making a linear progress.For the rest of the proof we assume that we have population sizes such that λ ] and p and c such that pc ≤ 1 2 .We now show that at some iteration before we have already made at least ( 2n k ) k+1 2 fitness evaluations we get a current individual x with fitness in [ n 2 + n 8 , n 2 + n 4 ].For this we show that conditional on f (x) ≥ n 2 we are not likely to increase fitness in one iteration by at least n 8 in a very long time.For this purpose we consider a modified iteration of the (1 + (λ, λ)) GA, where in the crossover phase we create not only λ C offspring by crossing the current individual x with the mutation winner x ′ , but we create λ M • λ C offspring by performing crossover between x and each mutation offspring λ C times.The best offspring in this modified iteration cannot be worse than the best offspring in a non-modified iteration.Hence the probability that we increase the fitness by a least n 8 is at most the probability that the best offspring of this modified iteration is better than the current individual x by at least n 8 .Consider one particular offspring y ′ created in this modified iteration.Recall that its parent was created by first choosing a number ℓ from the binomial distribution Bin(n, p) and then flipping ℓ bits, therefore it is distributed as if we created it by flipping each bit independently with probability p. Then when we create y ′ we take each flipped bit from its parent with probability c, hence in the resulting offspring each bit is flipped with probability pc, independently of other bits.Consequently the distribution of y ′ is the same as if we created it via the standard bit mutation with probability of flipping each bit equal to pc.Note that this argument works only when we consider one particular individual, since the mutation offspring are dependent on each other (since they have the same number ℓ of bits flipped) and therefore, their offspring are all also dependent.
To estimate the probability that y ′ has a fitness by n 8 greater than x, we use Lemma 18 with r = pcn (note that since pc < 1 2 , we have r ≤ n 2 , thus we satisfy the conditions of Lemma 18).Since we are conditioning on m After n k modified iterations we create λ M λ C n k offspring, therefore by the union bound the probability that at least one of them has a fitness by at least n 8 greater than the fitness of its parent is at most λ We also note that the term 2n k k+2 is increasing in k for k ∈ [1, n 512 ] (we omit the proof, since it trivially follows from considering its derivative).Therefore, for such k this probability is at most By the union bound the probability that we create such offspring in n/4−k δ iterations is at most If we start at some point x with fitness f (x) ≤ 3n 4 and we do not improve fitness by at least δ for n/4−k δ iterations, then we do not reach the local optima or the global optimum in this number of iterations (note that for the considered values of k ≤ n 32 and δ the value of n/4−k δ is at least one).During these iterations we do at least (λ M + λ C ) n/4−k δ fitness evaluations.Since we have already shown that λ fitness evaluations.We now estimate the factor (n−4k)k 2δ ln( n k ).Since by the theorem conditions we have k ≤ n 32 , for n large enough we have Summary of the proof.We now bring our arguments together.For the narrowed range of parameters we have shown that (i) with probability 1 2 −e −Θ(n) the initial individual has between n 2 and 5n 8 one-bits, (ii) then with probability 1 − e −Θ(n) we reach a point which has between 5n 8 and 3n 4 one-bits or exceed n we do not find the optimum before making fitness evaluations.Therefore, the expected runtime T F (in terms of fitness evaluations) is at least

Experiments
As our theoretical analysis gives upper bounds that are precise only up to constant factors, we now use experiments to obtain a better understanding of how the heavy-tailed (1 + (λ, λ)) GA performs on concrete problem sizes.We conducted a series of experiments on OneMax and Jump functions with jump sizes k ∈ [2..6].Since our theory-based recommendations for β p and β c are very similar, the analysis on the jump functions treats the corresponding distributions very symmetrically, and our preliminary experimentation did not find any significant advantages from using different values for β p and β c , we decided to keep them equal in our experiments and denote them together as β pc , such that β p = β c = β pc .
In all the presented plots we display average values of 100 independent runs of the considered algorithm, together with the standard deviation.
Figure 3 presents the running times of the heavy-tailed (1 + (λ, λ)) GA on OneMax against the problem size n for all considered values of β pc , whereas a fixed value of β λ = 2.8 is used.The running times of the (1 + 1) EA are also presented for comparison.Since the runtime is normalized by n ln(n), the plot of the latter tends to a horizontal line, and so do the plots of the heavy-tailed (1 + (λ, λ)) GA with β pc ≥ 1.8.Other plots, after discounting 1.9 2 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3 3.1 3.2 3.3 for the noise in the measurements, appear to be convex upwards and, similarly to [ABD22], they will likely become horizontal as n grows.For OneMax, bigger values of β pc appear to be better.Since greater β pc increases the chances of behaving similar to the (1 + 1) EA during an iteration, this fits to the situation discussed in Lemma 9.The plots look similar also for β λ different from 2.8, so we do not present them here.
To investigate the dependencies on β λ and β pc more thoroughly, we consider the largest available problem size n = 2 14 and plot the runtimes for all parameters configurations in Figure 4.The general trend of an improving performance with growing β pc can be clearly seen here as well.For β λ , the picture is less clear.It appears that very small β λ also result in larger running times, medium values of roughly β λ = 2.4 yield the best available runtimes, and a further increase of β λ increases the runtime again, but only slightly.As very large β λ , such as β λ = 3.2, correspond to regimes similar to the (1 + 1) EA, this might be a sign that some of the working principles of the (1 + (λ, λ)) GA are still beneficial on an easy problem like OneMax.

Results for Jump Functions
For Jump functions we used the problem sizes n ∈ {2 i | 3 ≤ i ≤ 7}, subject to the condition k ≤ n 4 and hence n ≥ 4k, as assumed in the theoretical results of this paper.As running times are higher in this setting, we consider a smaller set of parameter combinations, β pc ∈ {1.0, 1.2, 1.4} and β λ ∈ {2.0, 2.2, 2.4}.
Figure 5 presents the results of a comparison of the heavy-tailed (1 + (λ, λ)) GA with the (1 + 1) EA on Jump with jump parameter k ∈ {3, 5}.We chose two most distant distribution parameters for the heavy-tailed (1 + (λ, λ)) GA for presenting in this figure.However, the difference between these is negligible compared to the difference to the (1 + 1) EA.Such a difference aligns well with the theory, as the running time of the  (1 + 1) EA is Θ(n k ), whereas Theorem 13 predicts much smaller running times for the heavy-tailed (1 + (λ, λ)) GA.Due to large standard deviation, we performed statistical tests on the results presented in Figure 5 using two statistical tests: the Student's t-test as the one which checks mean values which are the subject of our theorems, and the Wilcoxon rank sum test as a nonparametric test.The results are presented in Table 3, where for each row and each test the maximum p-value is shown out of two between the (1 + 1) EA and either of the parameterizations of the heavy-tailed (1 + (λ, λ)) GA.The p-values in all the cases are very small: except for the case k = 3, n = 16, they are all well below 10 −10 , which indicates a vast difference between the algorithms and hence a clear superiority of the heavy-tailed (1 + (λ, λ)) GA.
The parameter study, presented in Figure 6, suggests that for Jump the particular values of β pc are not very important, although larger values result in the marginally better performance.However, larger β λ tend to make the performance worse, which is more pronounced for larger jump sizes k.This finding agrees with upper bound proven in Corollary 16, in which (n/k) is raised to a power that is proportional to ε λ = β λ − 1.Each difference is statistically significant with p < 0.008 using the Wilcoxon rank sum test.
Finally, Figure 7 shows the running times of the heavy-tailed (1 + (λ, λ)) GA for a fixed parameterization β pc = 1.0 and β λ = 2.0 for all available values of n and k, to give an impression of the typical running times of this algorithm on the Jump problem.

Figure 2 :
Figure 2: Illustration of the definition of the critical object.

Lemma 15 .
Let k ∈ [2.. n 4 ].Let λ, p and c be already chosen in an iteration of the heavy-tailed (1 + (λ, λ)) GA and let p, c ∈ [ k n , 2k n ].If the current individual x of the heavy

Figure 3 :
Figure 3: Running times of the heavy-tailed (1 + (λ, λ)) GA on OneMax starting from a random point, normalized by n ln(n), for different β pc = β p = β c and β λ = 2.8 in relation to the problem size n.The expected running times of (1 + 1) EA, also starting from a random point, are given for comparison.

Figure 5 :
Figure 5: Running times of the heavy-tailed (1 + (λ, λ)) GA on Jump, depending on the problem size n, in comparison to the (1 + 1) EA.Jump sizes are k = 3 on the left and k = 5 on the right.

Figure 6 :Figure 7 :
Figure 6: Dependency of running times of the heavy-tailed (1 + (λ, λ)) GA on Jump k on β λ and β pc for k = 3 (on the left) and k = 6 (on the right).Problem size n = 2 7 is used.
Table 2 shows estimates for p pc .

Table 1 :
Influence of the hyperparameters β λ , u λ on the expected number E[T F ] of fitness evaluations the heavy-tailed (1 + (λ, λ)) GA starting in the local optimum takes to optimize Jump k .Since all runtime bounds are of type E[T F ] = F (β λ , u λ )/p pc , where p pc

Table 2 :
Influence of the hyperparameters β p and β c on p pc