Large Population Sizes and Crossover Help in Dynamic Environments

Dynamic linear functions on the hypercube are functions which assign to each bit a positive weight, but the weights change over time. Throughout optimization, these functions maintain the same global optimum, and never have defecting local optima. Nevertheless, it was recently shown [Lengler, Schaller, FOCI 2019] that the $(1+1)$-Evolutionary Algorithm needs exponential time to find or approximate the optimum for some algorithm configurations. In this paper, we study the effect of larger population sizes for Dynamic BinVal, the extremal form of dynamic linear functions. We find that moderately increased population sizes extend the range of efficient algorithm configurations, and that crossover boosts this positive effect substantially. Remarkably, similar to the static setting of monotone functions in [Lengler, Zou, FOGA 2019], the hardest region of optimization for $(\mu+1)$-EA is not close the optimum, but far away from it. In contrast, for the $(\mu+1)$-GA, the region around the optimum is the hardest region in all studied cases.


Introduction
The (µ+1) Evolutionary Algorithm and the (µ+1) Genetic Algorithm, (µ+1)-EA and (µ + 1)-GA for short, are heuristic algorithms that aim to optimize an objective or fitness function f : {0, 1} n → R. Both maintain a population of µ search points, and in each round they create an offspring from the population and discard one of the µ + 1 search points, based on their objective values.They differ in how the offspring is created: the (µ + 1)-EA chooses a parent from the population and mutates it, the (µ+1)-GA uses crossover of two parent solutions in addition to mutation.
Two classical theoretical questions for these algorithms have ever been: -What is the effect of the population size?In which optimization landscapes and regimes are larger (smaller) populations beneficial?-In which situations does crossover improve performance?Although these questions have ever been central for studies of the (µ + 1)-EA and the (µ + 1)-GA, there is still vivid ongoing research on these questions, see [1,2,3,7,14,15,17,18] for a selection of theoretical work.(Also, the book chapter [16] treats related topics.)More generally, the research question is: which algorithm configurations perform well in which optimization landscapes?Such landscapes are given by a specific benchmark functions or by a class of functions.
Recently, a new type of dynamic landscapes was introduced by Lengler and Schaller [11].It was called noisy linear functions in [11], but we prefer the term dynamic linear functions.In this setting, the objective function is of the form f : {0, 1} n → R; f (x) = n i=1 W i x i with positive coefficients W i > 0. However, the twist is that the weights W i are redrawn for each generation.I.e., we have a distribution D, and for the t-th generation we draw i.i.d.weights W (t) i from that distribution, which define a function f (t) .Then the µ + 1 competing individuals are compared with respect to the fitness function f (t) .
To motivate this setting, let us give a grotesquely oversimplified example.Imagine a chess engine has 100 bits as parameters.Each bit switches on/off a database tailored to one specific opening (1 = access, 0 = no access), and this improves performance massively for this opening.E.g., the first bit determines whether the engine plays well in a French Opening, but has no influence whatsoever on the performance of an Italian Opening (the database is ignored since it does not produce matches to that situation).Let us go even further and assume that the engine will always win an opening if the corresponding database is active, and always lose with inactive database.Then we have removed even the slightest ambiguity, and this setting has a obvious optimal solution, which is the all-one string (activate all databases).This situation may seem completely trivial, but crucially, it is not solved by some standard optimization algorithms.
To complete the analogy, assume that the engine is trained by playing against different players, where player t has probability W (t) i to choose the i-th opening.Then the reward is precisely the dynamic linear function introduced above, and it was shown in [11] that the (1 + 1)-EA needs exponential time to approximate the optimum within a constant factor when configured with bad parameters.These bad parameter settings look quite innocent.With standard bit mutation (i.e., for mutation we flip each bit independently with probability p = c/n), any choice c > c 0 ≈ 1.59 leads asymptotically to an exponential time for finding or approximating the optimum, if the distribution D is too skewed.On the other hand, for any c < c 0 the (1 + 1)-EA finds the optimum in time O(n log n) for any D.This lack of stability motivates our paper: we ask whether larger population sizes and/or crossover can push the threshold c 0 of failure.
Optimization of dynamic functions may occur in various contexts.The chess engine with varying opponents is one such example.Similar examples arise in the context of co-evolution, e.g., a chess-engine trained against itself, or a team of DOTA agents in which some abilities of an agent (good aim, good exploration strategy, good path planning, ...) are always helpful (positive weight), but may be more or less important depending on her current co-agents.A rather different example is planning the timetable of a transportation company: to be efficient in the exploration phase of the optimization algorithm, schedules may be compared only for some partial data, and not for the whole data set.Similarly, consider an optimization process in which the function evaluation involves an offline test, as in drug development or robotic training.Then each test may involve sub-tly varying outer conditions (e.g., different temperatures or humidity, different lighting), which effectively gives a slightly different fitness function for each test.Our Results in a Nutshell.Instead of the full range of dynamic linear functions as in [11], we only study the limiting case of these functions, which we call dynamic binval.We perform experiments to study the performance of the (µ+1)-EA and the (µ+1)-GA for small values of µ.Similarly as for the (1+1)-EA, we find that for each algorithm there is a threshold c 0 such that the algorithm is efficient for every mutation parameter c < c 0 , and inefficient for c > c 0 .This threshold c 0 is our main object of study, and we investigate how it depends on the algorithmic choices.We find that an increased population size helps to push c 0 , but that the benefits are much larger when crossover is used.As a baseline, we recover the theoretical result from [11] that for the (1 + 1)-EA the threshold is at c 0 ≈ 1.59, though experimentally, for n = 3000 it seems closer to 1.7.For the (2 + 1)-EA the threshold increases to c 0 ≈ 2.2, and further to c 0 ≈ 3.1 for the (2 + 1)-GA.If we explicitly forbid that the two parents in crossover are identical then the threshold even shifts to c 0 ≈ 4.2.We call the resulting algorithm (2 + 1)-GA-NoCopy.For larger population sizes we get a threshold of c 0 ≈ 2.6 for the (3 + 1)-EA, c 0 ≈ 3.4 for the (5 + 1)-EA, c 0 ≈ 6.1 for the (3 + 1)-GA, and c 0 > 20 for the (5 + 1)-GA.
The theoretical results for the (1 + 1)-EA predict that the runtime jumps from quasi-linear to exponential.Indeed, we can experimentally confirm huge jumps in the runtime even for slight changes of the mutations parameter c.For example, we obtain a significant p-value for the a posteriori hypothesis that the (2 + 1)-EA with c = 2.5 is more than 60 times slower than the (2 + 1)-EA with c = 2.0.In fact, this is a highly conservative estimate since we needed to cut off the runs for c = 2.5.We systematically list these factors in our result sections.
To get a better understanding of the hardness of the optimization landscape, we compute the drift of degenerate populations, inspired by [8].We call a population degenerate if it consists entirely of multiple copies of the same individual.If X i is the number of zero-bits in the i-th degenerate population, then we estimate the drift E[X i − X i+1 | X i = y] by Monte-Carlo simulations.Moreover, for y close to 0 we derive precise asymptotic formulas for the degenerate population drift for the (2+1)-EA and the (2+1)-GA.In [8] the degenerate population drift was studied theoretically for the (µ + 1)-EA on monotone functions, which is a related, but not identical setup (see below).Still, part of the analysis carries over: if the population drift is negative for some y then the runtime is exponential, while it is O(n log n) if the population drift is positive everywhere.
Perhaps surprisingly, the (µ + 1)-EA and the (µ + 1)-GA are not just quantitatively different, but we also find a strong qualitative difference in the hardness landscape.For the (µ + 1)-GA, the "hardest" part of the optimization process is close to optimum, in all cases that we have experimentally explored.Formally, we found that if the degenerate population drift is negative somewhere, then it is also negative close to the optimum.For the (µ + 1)-EA, we found the opposite: the degenerate population drift can be negative for some intermediate ranges, although it is positive around the optimum.This implies that the hard part of optimization (taking exponential time) is getting in the vicinity of the optimum.But once the algorithm is somewhat near the optimum, it will efficiently finish optimization.This behavior is rather counter-intuitive, since common wisdom says that optimization gets harder close to the optimum.Notably, a similar phenomenon has recently been proven for certain monotone functions by Lengler and Zou [8,13], see below.Related Work.The only previous work on dynamic linear functions is by Lengler and Schaller [11].As mentioned before, they proved that for every c > c 0 ≈ 1.59 there is ε > 0 and a distribution D such that the (1 + 1)-EA with mutation rate c/n needs exponential time to find a search point with at least (1 − ε)n one-bits for dynamic linear functions with weight distribution D. For c < c 0 the optimization time is O(n log n) for all distributions D.Moreover, for any c > c 0 , they gave a completely characterization of all distributions for which the (1 + 1)-EA with mutation rate c/n is efficient/inefficient.
An important strand of work that is similar in spirit, though not in detail, is the study of monotone functions.A function is monotone if for every bitstring, flipping any zero-bit into a one-bit increases the fitness.Doerr, Jansen, Sudholt, Winzen, and Zarges [4] and Lengler and Steger [12] showed that there are monotone functions for which the (1+1)-EA needs exponential time to find or approximate the optimum if the mutation parameter c is too large (c > c 0 ≈ 2.1), while it is efficient for all monotone functions if c ≤ 1+ε for some small ε > 0 [9].The construction of hard (static) instances from [12] was named HotTopic in [8], and it resembles dynamic linear functions: the HotTopic function is locally given by linear functions with certain positive weights, but as the algorithm proceeds from one part of the search space ("level") to the next, the weights change.This analogy inspired the introduction of dynamic linear functions in [11].
For HotTopic functions, there is a plethora of results.In [8], the dichotomy between exponential and quasi-linear time from the (1 + 1)-EA was extended to a large number of other algorithms, including the (1 + λ)-EA, the (µ + 1)-EA, their so-called "fast" counterparts, and the (1 + (λ, λ))-GA.On the other hand, it was shown that the (µ + 1)-GA is always efficient for HotTopic functions if the population size is sufficiently large, while for the (µ + 1)-EA the population size does not change the threshold c 0 at all.Notably, for the population-based algorithms (µ + 1)-EA and (µ + 1)-GA, the efficiency result was only obtained for parameterizations of the HotTopic functions in which the weight changes occur close to the optimum.This seemed like a technical detail at first, but in an extremely surprising result, Lengler and Zou [13] showed that this detail was hiding an unexpected core: if the weights are changed far away from the optimum, then increasing the population size has a devastating effect on the performance of the (µ + 1)-EA.For any c > 0 (also values much smaller than 1), there is a µ 0 such that the (µ + 1)-EA with µ ≥ µ 0 and mutation rate c/n needs exponential time on some monotone functions.Together with [8], this shows three things for monotone functions: 1.For optimization close to the optimum, the population size has no strong impact on the performance of the (µ + 1)-EA.
2. Close to the optimum, the (µ + 1)-GA outperforms the (µ + 1)-EA massively (quasi-linear instead of exponential) if the population size is large enough.It can cope with any constant mutation parameter c. 3. Far away from the optimum, a larger population size decreases the performance of (µ + 1)-EA massively.There is no safe choice of c if µ is too large.
It would be extremely interesting to understand the (µ + 1)-GA far away from the optimum.Unfortunately, such results are unknown.Theoretical analysis is hard (though perhaps not impossible), and function evaluations of HotTopic are extremely expensive, so experiments are only possible for very small problem sizes.Our paper can be seen as the first work which studies the behavior of the (µ + 1)-GA in a related, though not identical setting.
We conclude this section with a word of caution.HotTopic functions and dynamic linear functions are similar in spirit, but not in actual detail.For example, the analysis of the (µ + 1)-GA in [8] or of the (µ + 1)-EA in [13] rely heavily on the fact that there weights are locally stable in HotTopic functions.Thus it is unclear how far the analogy carries.Some of our experimental findings for the (µ + 1)-EA for dynamic linear functions differ from the theoretical (asymptotic) results for HotTopic in [8,13].For us, a larger µ is beneficial, as it shifts the theshold c 0 to the right.For HotTopic functions, it does not shift the threshold at all if the algorithm operates close to the optimum, and it shifts the threshold to the left (i.e., makes things worse) far away from the optimum.This could either be because the theoretical effects only kick in for very large µ, or because HotTopic and dynamic linear functions are genuinely different.On the other hand, both settings agree in the surprising effect that the hardest part for the algorithm is not close to the optimum, but rather far away from it.

The Algorithms
All our considered algorithms maintain a population P of search points of size µ.In each round (or generation), they create an offspring from the population, and from the µ+1 search points they remove the one with lowest fitness, breaking ties randomly.They only differ in the offspring creation.The (µ+1)-EA uses standard bit mutation: a random parent is picked from the population, and each bit in the parent is flipped with probability c/n.The genetic algorithms flips a coin in each round whether to use mutation (as above), or whether to use bitwise uniform crossover : for the latter, it picks two random parents from the population, and for each bit it randomly chooses the bit of either parent.For the (µ + 1)-GA, the two parents are chosen independently.For the (µ + 1)-GA-NoCopy, they are chosen without repetition.The parameters are thus the mutation parameter c > 0, which we will assume to be independent of n, and the population size µ.In our theoretical (asymptotic) results, we will only consider µ = 2.The pseudocode description is given in Algorithm 1.
1 Initialize P with µ independent x ∈ {0, 1} n uniformly at random; 2 Optimization: for t = 1, 2, 3, . . .do 3 Creation of Offspring: For GAs, flip a fair coin to do either mutation or crossover; for EA, always do mutation and no crossover.

Dynamic Linear Functions and the Dynamic Binval Function
We have described the algorithms for optimizing a static fitness function f .However, throughout the thesis, we will consider dynamic functions that changes in every round.We denote the function in the t-th iteration by f (t) .That means that in the selection step (Line 11), we select the worst individual as z ∈ arg min{f (t) (x) | x ∈ P (t) }, where P (t) is the t-th population.Crucially, we never mix different fitness functions, i.e., we never compare f (t1) (x) with f (t2) (x ) for different t 1 = t 2 .In other words, the fitness of all individuals changes in each round.Since this requires µ + 1 function evaluations per generation, we define the runtime as the number of generations until the algorithm finds the optimum.This deviates from the more standard convention to count the number of function evaluations (essentially by a factor µ + 1), but it makes the performance easier to compare with previous work on static linear functions.Also, note that the runtime equals the number of search points that are sampled, up to an additive −(µ − 1) from initialization.We consider two types of dynamic functions.A dynamic linear function is described by a distribution D on R + .For the t-th round, we draw n independent samples n and set So f (t) is a positive with positive weights.Thus all f (t) share the same global optimum 1 . . . 1, have no other local optima, and they are monotone, i.e., flipping a zero-bit into a one-bit always increases the fitness.
For dynamic binval, DynBV, in the t-round we draw a permutation π t : {1..n} → {1..n} uniformly at random, and define So, we randomly permute the bits of the string, and take the binary value of the permuted string.As for dynamic linear functions, all f (t) share the same global optimum 1 . . . 1, have no other local optima, and prefer one-bits to zero-bits.We claim that in a certain sense, DynBV is a limit case of dynamic linear functions in which the tail of the distribution D becomes infinitely heavy.Let us make this precise.Consider a noisy linear function f and two strings x, x that differ in k bits.To ease notation, assume they differ in the first k ≥ 2 bits.The order statistics k by sorting, i.e., the first order statistics k , and so on.If the distribution D is sufficiently skewed, then the probability comes arbitrarily close to one.However, conditioned on the event in (3), comparison with respect to the dynamic linear function is equivalent to comparison with respect to DynBV.If we compute the difference , then the sign of this difference will only depend on which of the k bits has the highest weight, since this summand dominates the whole remaining sum.The position of the highest weight bit is uniformly at random, so effectively, conditioned on the event in (3), the dynamic linear function picks randomly one of the bits in which x and x differ, and bases its comparison only on that bit.This is exactly what DynBV also does.
In fact, this reasoning was implicitly used in [11, Theorem 7].There, to construct a hard example for the (1 + 1)-EA with c > c 0 ≈ 1.59, the authors used D as a Pareto distribution Par(β), showed that for this distribution p k ≥ (k − 1) −β , and observed that this term comes arbitrarily close to 1 as β → 0. Then they picked β so close to zero that the difference was negligible.Thus, in effect, they used for their hard function that DynBV can be arbitrarily well approximated by dynamic linear functions, and their proof implies that the (1 + 1)-EA also needs exponential time for DynBV if c > c 0 , and time O(n log n) for c < c 0 .Note that in general one needs to be a bit careful when two limits n → ∞ and β → ∞ are involved.However, this is no problem for the (1 + 1)-EA since there are strong tail bounds for the number of bits in which x and x differ [11].The same argument also extends to populations in situations in which the population tends to degenerate to copies of a single points, as it is the case for (µ + 1) algorithms if it is hard to find improvements [8].

Runtime Simulations
Recall that we count the runtime as the number of generations until the optimum is sampled.We run the different algorithms to observe the distribution of the runtime.A run terminates if either the optimum is found, or an upper limit of generations is reached.Unless otherwise noted, the upper limit is set to be 100e c /c • n ln n, which is 100 times larger than the expected runtime of the (1 + 1)-EA [18].The python code of our running time simulation can be found in our GitHub repository [10].Unless otherwise noted, each data point is obtained by 30 independent runs.
To verify the correctness of our simulation we first measure mean and variance of the runtime of the (1+1)-EA with mutation parameter c = 1 on the function OneMax (the linear function where all weights are 1), and compare them with the highly accurate values derived in [6].We compute mean and variance of 3000 runs and find that our observed mean deviates 0.16% of the predicted mean, while the observed variance of is within 2.3% of the predicted variance.
To visualize runtimes, we use plots provided by the IOHprofiler [5].As an example, the runtime of the (1 + 1)-EA with mutation parameter c = 1.0 on OneMax is visualized in Figure 1.Note that time (i.e., number of generations) is displayed on the y-axis, while the x-axis corresponds gives the number of 1bits.Thus, a steep part of the curve corresponds to slow progress, while a flat part of a curve corresponds to fast progress.Also, mind that the y-axis uses log scale.Due to the exponential runtimes, we frequently encounter the problem that runs are terminated due to the iteration limit of 100e c /c • n ln n generations.In this case we will often plot two values, which are a lower and upper estimates of for the expected runtime.The lower bound is the mean runtime, i.e., the average number of generations among the successful runs.By definition, this number never exceeds the upper limit.For the upper bound, we use the expected runtime (ERT) as defined by the IOH profiler.The ERT is calculated by drawing random runs from our pool of runs, until a successful run is drawn.Then the runtime of all drawn runs is added up, and the expectation of this process is defined as the ERT.This estimates the expected runtime if we start over the algorithm every time the iteration limit is reached.The ERT overestimates the runtime if the state at hitting the iteration limit is better then the starting state of a restart, which is the case in all benchmarks we consider.(It may not be the case for deceptive functions.)We remark that if all runs hit the iteration limit, then our lower bound (mean runtime) is the iteration limit, while the upper bound (ERT) is infinity.A more detailed explanation of the ERT can be found in [5].Comparison of Runtimes.We want to compare runtimes for different algorithms and values of c.We denote by R Alg c the random variable describing the runtime of Algorithm Alg for a specific c.Because our sample size is fairly small (10 to 30), we compare runtimes using the Wilcoxon-Mann-Whitney test.The Wilcoxon-Mann-Whitney test is a test of the null hypothesis that with probability at least 1/2 a randomly selected value from one runtime distribution will be at most (at least) a randomly selected value from a second runtime distribution.A small p-value would then indicate that the runtime of one algorithm, treated as random variable, is larger (smaller) than the runtime of the other algorithm in significantly more than half of the cases.
We will also be interested in quantifying by how much an algorithm is slower than another algorithm.To this end, we will determine the largest factor d ≥ 1 by which we can multiply one runtime distribution such that the Wilcoxon-Mann-Whitney test still yields a statistically significant p-value.For example, we will find that even if we multiply the runtime R according to the Wilcoxon-Mann-Whitney test.Note that this is a posteriori hypothesis since the factor d is chosen in hindsight.Therefore, it must not be treated as an actually significant result.Still, it gives useful information, and shows that R .All tests are done with R.

Analysis of Population Drift
Recall that X i was defined to be the number of zero-bits in an individual of the i-th degenerated population, where degenerate means that all individuals are copies of the same search point.To be precise, if P (i) is the i-th degenerate population, then P (i+1) is the first degenerate population after P (i) has changed at least once.That does not exclude the possibility P (i) = P (i+1) , but we require at least one intermediate step in which an offspring is accepted that is not a copy of the parent(s) in P (i) .Then the degenerate population drift (population drift for short) is E[X i − X i+1 | X i = y].We will estimate this drift with Monte-Carlo simulations.Moreover, for y = o(n) we will derive a Markov chain state diagram in which all transition probabilities coincide with the transition probabilities in the real process up to (1 + o(1)) factors.By analyzing this state diagram, we are able to compute the population drift up to minor order error terms.
To motivate the use of population drift, consider the following example for µ = 2. Take a population {x 1 , x 2 }, and assume that x 1 has at least as many one-bits than x 2 .Then in every iteration, there is a chance to simply copy x 1 (mutate but flip none of the bits), and accept it.For this to happen, we need to first mutate x 1 , flip no bits at all and accept the new offspring.The probability of mutating x 1 and flipping no bits is given by 1  2 •(1− c n ) n ≈ e −c /2 for the (2+1)-EA and 1  4 • (1 − c n ) n ≈ e −c /4 for the (2 + 1)-GA and the (2 + 1)-GA-NoCopy.The probability of accepting x 1 is at least 1/2, as x 1 has at least as many 1s as x 2 .Hence, we have a constant probability to degenerate to a population {x 1 , x 1 } in every iteration.This implies that any population degenerates within an expected constant number of rounds.The argument can be generalized to larger µ, see [8]: for every constant µ and c, the expected time until the population degenerates from any starting population is O(1).Moreover, it was shown in [8] that if the population drift is negative for y = αn for some α ∈ (1/2, 1) then asymptotically the runtime is exponential in n.On the other hand, if the population drift is positive for all α then the runtime is O(n log n).Hence, we are trying to identify parameter regimes for which areas of negative population drift occur.

Runtimes
The results of our runtime simulations for different algorithms and values of c can be found in Figures 2. As expected, we find a threshold behavior, i.e., there is a value c 0 such that the runtime increases dramatically as c crosses this threshold.For the (1 + 1)-EA, by visual inspection we observe a threshold in c 0 ∈ [1.5, 1.8] (in agreement with the theoretically derived threshold c 0 ≈ 1.59 from [11]).For the (2 + 1)-EA, it seems to be within c 0 ∈ [2.2, 2.3], for the (2 + 1)-GA in [3, 3.2] and for the (2 + 1)-GA-NoCopy in [4.1, 4.3].Thus we obtain a clear ranking of the algorithms for µ ≤ 2, which is (1 + 1)-EA (worst), (2 + 1)-EA, (2 + 1)-GA, and (2 + 1)-GA-NoCopy.For the (3 + 1)-EA, the threshold appears to lie in the interval [2.5, 2.7], for the (3 + 1)-GA in [6.0, 6.3] and for the (5 + 1)-EA in [3.3, 3.45].We were not able to find a threshold behavior for the (5 + 1)-GA for any c < 20.These results further confirm that the GA variants are performing massively better than its EA counterparts.Moreover, both for EAs and GAs, a larger population size shifts the threshold to the right.For GAs, this is analogous to theoretical results for monotone functions, but for EAs the effect goes in the opposite direction than for monotone functions, see the discussion in Section 1. Fig. 2. Comparison of runtimes for different algorithms and values of c.We choose values of c that lie around the threshold for the respective algorithm.We plot both ERT and mean (both in the same color) if they differ significantly (mean is always the lower curve, see Section 3.1 for a more detailed explanation).
To validate the ranking for the algorithms with µ ≤ 2 statistically, we use the following comparisons.If we compare R for d ≤ 63.36.This confirms the aforementioned ranking of the algorithms.
To establish intervals for the critical value c 0 of the algorithms with µ ≤ 2, we compare R , and find that the latter is significantly smaller for all d ≤ 38.84.We interpret this huge drop in performance as strong indication that the threshold lies in the interval c 0 ∈ [1.  3. We can clearly see that the conditional population drift is negative in the area between 300 and 50 one-bits away from the optimum, but then becomes positive again when being less than 50 one-bits away from the optimum.We conclude that the hardest part for the (2 + 1)-EA is not around the optimum.We obtained similar results for the (3 + 1)-EA, also visualized in Figure 3.This surprising result is similar to monotone functions [8,13], see the discussion in Section 1.
For the (2 + 1)-GA, the picture looks entirely different, as the drift is now strictly decreasing.We can only observe a negative drift area right at the optimum, starting from about 2900 1-bits.This behavior is similar to the (1 + 1)-EA and the (µ + 1)-GA-NoCopy, where the most difficult part is also close to the optimum (data not shown).Unfortunately, due to the large value of c 0 , we were not able to obtain a conclusive result for the (3 + 1)-GA within reasonable computation time.For the (2 + 1)-EA and (2 + 1)-GA, we also derive exact asymptotic formulas for the population drift close to the optimum.We postpone the derivation to the appendix.We compare the formula with the estimates via Monte Carlo simulation for different values of c, using n = 3000, y = 1, see Figure 4. We can see that the curves match closely for small c, and that we get a moderate fit for larger c.We suspect that the deviations for the (2 + 1)-EA come from the expectation of the population drift being influenced by large but rare values of X i − X i+1 for large c.So the Monte Carlo simulations might be missing parts of this heavy tail.This tail is less heavy for the (2 + 1)-GA, since the probability to produce duplicates is always high, and thus degeneration happens quickly even for large c.In both cases, the curves agree perfectly in the sign of the population drift, which is our main interest.Negative drift at the optimum occurs at c > 3.1 for the (2 + 1)-GA, which matches matches well the threshold obtained from runtime simulations.However, as expected, there is no such match for the (2 + 1)-EA, where the threshold for the population drift at the optimum is 2.5 while the threshold for the runtime is below 2.3.I.e., the (2+1)-EA already struggles for values of c for which optimization around the optimum is easy.This confirms that the hardest region for the (2 + 1)-EA (but not the (2 + 1)-GA) is not at the optimum, but a bit away from it.

Conclusions
We have studied the effect of population size and crossover for the dynamic DynBV benchmark.We have found that the algorithms generally profited from larger population size.Moreover, they profited strongly from crossover, even more so if we forbid crossovers between identical parents.
We have studied the case µ ≤ 2 in more depth.Remarkably, there is a strong qualitative difference between the (2 + 1)-EA and the (2 + 1)-GA.While for the latter one, the hardest region for optimization is close to the optimum (as one would expect), the same is not true for the (2 + 1)-EA.We believe that this is a interesting discovery.The only hint at such an effect on OneMax-like functions that we are aware of is for monotone functions [13].However, the results in [13] predict that large population sizes hurt the (µ + 1)-EA, in opposition to our findings.Currently we are lacking any understanding of whether this comes from the small values of µ that we considered here, or whether it is due to the differences between monotone and dynamic linear functions.
For future work, there are many natural questions.We have chosen the (µ + 1)-GA to decide randomly between a mutation and a crossover step, but other choices are possible.Even with our formulation, it might be that the probability 1/2 for choosing crossover has a strong impact.Also, we have exclusively focused on the limiting case DynBV, but dynamic linear functions are also interesting for less extreme case weight distributions.Finally, an interesting variant of dynamic linear functions or DynBV might not change the objective every round, but only every s rounds (our runtime simulation already supports this feature and is publicly available).

A Appendix
A.1 (2 + 1)-EA State diagram of (2 + 1)-EA Here, we derive an expression for the conditional drift E based on a state diagram for the (2 + 1)-EA.Recall that we defined X i to be the number of zero-bits in the i-th degenerated population.We define E progress as the event that an offspring is accepted into the degenerate population that is not identical with the old search point in the population.We will always assume that we are working close to the optimum, which means the number of 0-bits, y, is o(n).This allows us to ignore cases where more than one 0-bit was flipped when going from one degenerated population to the next.Additionally, we are interested only in the case where n → ∞.We will encounter several o(1) terms in our calculations that will go to 0 as n becomes large.
We claim that the development of the population follows the state diagram in Figure 5.The transition probabilities are written below the arrows.Before we justify Figure 5 in detail, note crucially that an arrow may summarize several generations.For example, assume that the algorithm generates a population {x, y} in which x strictly dominates y, i..e, every one-bit in y is also a one-bit in x (and the converse is not true).In particular, x is strictly fitter than y.We could model this situation by a state S.However, recall that it takes only O(1) steps until a copy of x is generated.Since by our assumption y = o(n), the probability that any zero-bit is flipped into a one-bit in this time is o(1).Hence, whp (with high probability, i.e., with probability 1 − o( 1)) all offspring are strictly dominated by x until a copy of x is created, and then the population degenerates.Hence, we would obtain an arrow from S to the state {x, x} with probability 1 − o(1), and the o(1) will only affect the minor order terms of the final result.In cases like this, we will not draw the state S in the first place.Instead, we may omit it, and replace any arrow to S with an arrow (of the same probability) to the state {x, x}.This will allow us to keep the state diagram simple.
The top vertex represents a degenerate starting state {x, x}.Let y be the number of 0-bits in an individual x of our initial degenerated population {x, x}.A superscript denotes the number of additional 1-bits of an individual compared to x.This number can also be negative.By S(k) a state where the next degenerated population has (y − k) zero-bits.So the initial top state is identical with S(0).
Starting with a population {x, x}, three things can happen.First, no bits at all or only 1-bits are flipped.Then, we will surely reject the offspring x, as it is dominated (there is no position where x has a 1-bit and x a 0-bit) by x.When being close to the optimum, this is what happens most of the time, as we have only a couple of zero-bits left and they will rarely be flipped.Secondly, we could also just flip a 0-bit, but no 1-bits.Then we are in a state such that the offspring x 1 is dominating x 0 .Assuming that no further 1-bits are flipped, the population will whp degenerate to {x, x}, as no offspring will be accepted over x.At some point, x will just be copied and accepted.The third and perhaps most interesting case occurs when a 0-bit and r ≥ 1 1-bits are flipped.We then have an offspring x (1−r) such that x and x differ at exactly r + 1 positions.In these r + 1 positions, x has exactly one 1-bit, and 0-bits everywhere else.Now x will either be rejected, which results in the same population {x, x} we started with, or will be accepted.If it is accepted, we land in a state which we called F (r) (green in Figure 5).Starting from F (r) we can either mutate x or x.Assume we mutate x, and flip s 1-bits to create ẋ−s .Notice that ẋ is dominated by x.We can then either accept ẋ to land in S(0) or reject it to return to F (r) once again.If we mutate x, we create an offspring x 1−r−s , which is dominated by x.If we accept x, we will conclude in S(1 − r), otherwise we will go back to F (r).
Putting it all together, we can get an explicit formula for the drift.In the diagram, note that we need to compute the drift conditioned on E progress , i.e., assuming we visit either S(1) or F (r).
First, we compute the expected number of 0-bits when starting in state F (r).Let us slightly abuse notation and also call this Recall that we land in state F (r) if we flip one 1-bit and r 1-bits to create an offspring x and accept it.Simply writing out the transition probabilities yields Approximate the sums by letting them run up to infinity only give another factor (1 ± o(1)).The dominant terms have small s, and so we may approximate For the left hand side, we artificially write In order to compute the population drift, we need some elementary probabilities.Let E r 0 be the event that exactly r zero-bits are flipped and E r 1 the event that exactly r one-bits are flipped.Also, define E acc to be the event that the offspring is accepted.Then , where Here, r max could be as large as n − y, but for evaluation of the formula we will cut off at a value such that the difference is negligible, e.g.50.

B (2 + 1)-GA Population Drift
We compute the degenerate population drift at the optimum using a state diagram as we did with the (2 + 1)-EA, see Figure 6.We only show the part that has changed significantly, which is the state that we now call F (r) (before F (r)).The upper part of the state diagram would be the same as in Figure 5, except that now we can also do a crossover in the initial state, which will not change our population.Notice that the population drift will also exclude this case, since we only consider cases where at least one new offspring is accepted in the process.
Starting from F (r) we can do a mutation, which is exactly the same as in the (2 + 1)-EA.Otherwise, we do a crossover between two strings.If we do a crossover between x and x (x will just be copied in this case), we can either accept x over x, to get to a state S(0), or reject it to return to F (r). Similarly, we can do a crossover between x and x and either accept to end in S(1 − r) or reject to land back in F (r).
Finally, we can do a crossover between x and x.To simplify notation, let us remove the bits on which x and x agree.Moreover, we may assume that the first position is the one at which x has a one-bit, bit not x.So we are left with x = (0, 1, 1, . . ., 1) and x = (1, 0, 0, . . ., 0).We can now do a case distinction depending on the shape of our crossover result ẋ.If ẋ gets every one-bit, we land in a state S(1), as ẋ dominates both x and x.If ẋ has a one-bit at the first position, plus inherits s < r 1-bits from x, we can either remove x to go into a state F (r − s) or remove x, and land in S(s + 1 − r), as ẋ dominates x.Finally, ẋ could have a zero-bit at the first position and inherit s one-bits from x. Then we could reject ẋ to return to F (r) or accept it over x to conclude in S(0).
We can also derive the conditional drift for the (2+1)-GA in exactly the same fashion as before.We start again by computing F (r) := E[X i+1 | X i = y ∈ o(n) ∧ we are in state F (r)].Notice that now we obtain a recursive formula as we can go from a state F (r) to a state F (r − s).Unfortunately, the formula does not simplify as much.Solving for F (r) yields a recursive formula: Now computing the population drift is exactly the same exercise as for the (2 + 1)-EA, except that we have to account for crossovers in the initial states.We only have to adjust the probability P[E progress ] and beware that mutations now have an additional 1  2 factor.Then we can write down the expression for the conditional drift of the (2 + 1)-GA.x > x 1 4 x > x

Fig. 1 .
Fig. 1.Runtime of the (1 + 1)-EA on OneMax test yields significant p-values ≤ 0.05 for every d ≤ 57.88.We conclude that for mutation parameter c = 2.0, the (1 + 1)-EA is much slower than the (2 + 1)-EA.In the same manner, d • R d ≤ 29.59, all with p < 0.05.Degenerate population drift We estimate the degenerate population drift by Monte-Carlo simulation on the (2 + 1)-EA with c = 2.3, which is slightly above the threshold.The results are visualized in Figure

Fig. 3 .
Fig. 3. Degenerate population drift for different algorithms and values of c just above the respective thresholds.The shaded area shows standard deviation.

F
w e e n x a n d x

4 C r o s s o v e r 1 2
w e e n x a n d x 1