Predicting effective control parameters for differential evolution using cluster analysis of objective function features

A methodology is introduced which uses three simple objective function features to predict effective control parameters for differential evolution. This is achieved using cluster analysis techniques to classify objective functions using these features. Information on prior performance of various control parameters for each classification is then used to determine which control parameters to use in future optimisations. Our approach is compared to state-of-the-art adaptive and non-adaptive techniques. Two accepted bench mark suites are used to compare performance and in all cases we show that the improvement resulting from our approach is statistically significant. The majority of the computational effort of this methodology is performed off-line, however even when taking into account the additional on-line cost our approach outperforms other adaptive techniques. We also investigate the key tuning parameters of our methodology, such as number of clusters, which further support the finding that the simple features selected are predictors of effective control parameters. The findings presented in this paper are significant because they show that simple to calculate features of objective functions can help to select control parameters for optimisation algorithms. This can have an immediate positive impact on the application of these optimisation algorithms on real world problems, where it is often difficult to select effective control parameters.


Introduction
Despite having a number of advantages over classical gradient based techniques, the performance of evolutionary algorithms depends both on the problem to be optimised and the algorithm being used [24].To make matters worse, this performance also depends heavily on the selection of algorithm specific control parameters.This variability of performance makes the field hard to penetrate for users in industry who simply want to use an algorithm to solve a problem.Often the problem they wish to solve is not well understood before they start to solve it, which makes selecting an algorithm and control parameters all the more difficult.The motivation of the work, presented in this paper, is to automate this selection using simple machine learning techniques.Specifically, at this stage of development, the aim is to automatically select an effective set of control parameters for differential evolution, for the problem to be optimised.

Terminology
The problem to be optimised is termed the objective function.This paper focusses on optimising continuous black box objective functions.We are proposing that an objective function can be described by a number of features β.An optimisation algorithm instance is determined by its control parameters p.Our aim is to classify objective functions using their features, in order to predict a set of effective control parameters which will result in a high performing algorithm for a particular objective function.

Background
When applying an evolutionary algorithm to a new application it is common to use the control parameters suggested in literature.These parameters are usually obtained from extensive studies on algorithm behaviour using suites of benchmark optimisation problems.Parameters which work well on common problem test suites will emerge [7] and this single set will end up being used in the majority of applications.The problem is that with truly novel applications there may be no understanding of which test suites, if any, correctly represents the real world problem.Strictly speaking, each time an algorithm is applied to a new application a parameter study should be undertaken, to both provide insight into the robustness of the parameters and perhaps squeeze out some additional performance.The reality is that these studies are often infeasible in real applications, where a single objective function evaluation may represent hours, or days, of computational time [14,23,21,22].Thus a great deal of research has been undertaken with the motivation to address this problem.

Automatic Tuning Algorithms
Sequential Model-Based Optimization for General Algorithm Configuration (SMAC) [11] and Sequential Parameter Optimization (SPO) [2] are both examples of work carried out to improve the tuning process of algorithms.In the case of SPO the approach facilitates manual tuning whereas SMAC is automated.In both cases optimum control parameters are determined for a single objective function, whereas our goal here is to predict effective control parameters for a general unknown objective function.
The closest approach we could find to the one we propose is Feature Based Algorithm Configuration (FBAC) [3].FBAC can be thought of as an extension of automatic tuning algorithms.It uses sophisticated objective function features to classify objective functions.They are able to accurately predict performance models for objective functions which could, in theory, be used to determine an effective set of control parameters.However, the features they use require a large number of samples of the objective function to calculate.This would lead to an excessive computational cost in real applications.In our approach we try to address this by selecting somewhat crude features which can be calculated with a low number of samples.

Adaptive Algorithms
The most common strategy to address the problem of performance variability is to design algorithms with self-adaptive control parameters.In such algo-rithms the control parameters are themselves optimised, based on current performance, as the algorithm runs [16,25,9].A related field is Hyper-Heuristics whose goal is to automate the design of heuristic optimisation algorithms based on current performance [4,12].The work done in adaptive algorithms closely relates to our motivation of improving performance on an unknown objective function, for this reason we focus on comparing our approach to state of the art adaptive algorithms.

Case Study Optimisation Algorithm: Differential Evolution
To show the effectiveness of our approach we are forced to select a single optimisation algorithm.Differential Evolution (DE) [18] will be used to test the effectiveness of the predictive methodology.It is stressed that the predictive methodology is independent of the evolutionary algorithm, in the case where control parameters are non continuous they can be mapped to integers.DE is popular and its control parameters are well studied.DE is aimed at nonlinear non-differentiable continuous functions and has been designed to be a direct stochastic search method.The method has a small number of control parameters and applies a crossover and mutation operator based on the differences between randomly selected individuals of the population.
There are a number of alternative DE methods and many additions have been made to the algorithm.It is beyond the scope of this paper to explain these additions in detail, so instead we describe the algorithm used in this study and allow the reader to find detailed explanation in original papers.
To select new members of the population, a direct one-to-one competition scheme is employed in each generation.From the population of the current generation, a target member, x i,g , is selected, where i refers to the member's number and g the generation.A donor vector, v i,g , is generated using the current-to-pbest/1/bin approach [26].Three members of the population, distinct to that of the target member, are selected at random and v i,g is calculated according to the relation where p 2 is a control parameter usually referred to as the weighting factor.
x r1,g and x r2,g are two members selected at random from the whole population and x pbest,g is randomly selected from the top q × p 3 (q ∈ [0, 1]).p 3 is the population size or number of parents.q is a control parameter which controls the greediness of the algorithm, to eliminate this parameter it is randomised as in the Success-History Based Parameter Adaptation for Differential Evolution (SHADE) algorithm [20].In addition an external archive is used to generate x r2,g [20].
A cross over operator is applied to the target and donor vectors to form a trial vector.The elements of the target and donor vectors enter the trial vector with a probability p 1 , a control parameter usually referred to as crossover constant.The target vector is compared with the trial vector and the vector with the best fitness value is selected for admission into the next generation.This iteration scheme repeats until a suitable stopping criterion is met [18].
DE has been applied, with success, to the fields of electrical power systems, electromagnetic engineering, control systems and robotics, chemical engineering, pattern recognition, artificial neural networks and signal processing [5].In [17] Storn suggests using the control parameters p 1 = 0.900, p 2 = 0.500 and p 3 = 10D where D is the number of dimensions in the function.The effect of these parameters on algorithm performance is a well researched subject.For example, there appears to be complex relationships between problem dimensionality and the most appropriate population size [15].
We compare the proposed predictive technique to a state of the art adaptive technique: SHADE [20].This technique uses an historical memory of control parameters which have performed well to guide the selection of control parameters each generation.In the original study it was shown to have competitive performance compared to other state of the art algorithms using the CEC 2005 benchmarks which are used in this study.All control parameters used in our study are the same as used in the original SHADE study [20].

Contribution and Motivation of this Paper
Our contribution to the line of enquiry described in the introduction are results which show that a set of simple to calculate features can be used to classify objective functions in order to directly predict effective control parameters for an algorithm.Many authors have suggested this might be possible, but we believe this work is the first time a statistically significant improvement in performance has been shown using such an approach.We hope that this work will motivate further investigations into features which are computationally cheap to calculate.The motivation for this is real world applications where it is infeasible to tune an algorithm each time a new objective function is considered, and where the form of the objective function may be unknown, making it difficult to relate to previous analyses of control parameters.

Our Approach: Predicting Effective Control Parameters for Evolutionary Algorithms using Cluster Analysis of Objective Function Features
The aim of our approach is to automatically predict an effective set of control parameters for an unknown objective function.This is achieved by classifying objective functions using three simple to calculate features which are described in Section 2.1.2.A number of experiments are performed offline with varying control parameters, across a range of objective functions.The algorithm performance is measured and recorded for each experiment, the performance metric used is described in Section 2.1.1.Functions are split into classifications using the unsupervised machine learning technique k-means++ [1].All the experiments in a particular classification are ranked by performance and the mean values of the control parameters used in the top 10% of experiments is calculated.When a new function is to be optimised it is sampled, on-line, and its features calculated.It is then classified and the mean values calculated for its classification are used to optimise it.
In the remainder of this section each aspect of this predictive approach is discussed in detail.

Optimisation Algorithm Performance Metric
There are a number of metrics which can be utilised to define the performance of an optimisation algorithm [7,3].The meaning of performance may change depending on the application [13], but in general we wish to reduce the objective function value with a small number of objective function evaluations.In this work a performance metric, α, is defined as where F 1 is the lowest objective function value in the first generation, F G is the lowest objective function value in the final generation, G, and N g is the total number of function evaluations performed up to and including generation g.Generation g is the first generation at which the reduction in the objective function reaches 99% of the total reduction i.e.
This choice is justified as follows.In practice, an evolutionary optimisation algorithm is run until a maximum number of objective function evaluations is reached or a predetermined accuracy, or tolerance, is achieved.Dividing by N g means that α gives us information on the efficiency of the optimisation algorithm.An algorithm which finds the optimum in the first few generations therefore has a larger α than an algorithm which found the optimum in the final generation.In real applications, practicalities such as objective function evaluation cost, limit the number of objective function evaluations [14,23,21,22].α is designed to reward algorithms which exhibit high convergence in the first few function evaluations.It is not claimed that α is the correct metric for all situations, it is a choice depending on user requirements.In this study an attempt is made to model a situation where an engineer wishes to apply an optimisation algorithm to a real problem.One can imagine that such an engineer would simply select an algorithm and use the set of control parameters suggested in literature.In the authors experience, in applying optimisation algorithms to engineering applications, the proposed metric α is relevant for many engineers.Control parameters suggested in literature may not have been tuned with this metric in mind, despite this the engineer would likely use these parameters.In the results section convergence curves are presented to show the effect of this metric choice.

Objective Function Features
Functions are often described using features such as symmetry, smoothness, condition number or separability.These are usually only defined for analytical problems.It is well understood that these features affect the performance of optimisation algorithms.The challenge, therefore, is to formulate a set of features that can be calculated with a small number of objective function evaluations.
In this proof of concept study, the starting point for calculating these features will be a Latin hypercube sampling of the objective function search space.The number of samples taken is referred to as σ.This sampling will be performed prior to the optimisation in this study, but in future this sampling could also be used as the first generation of the optimisation algorithm.The objective function values in this sampling are first normalised by subtracting the mean and dividing by the standard deviation.Three simple features have been selected to test the methodology.
1. β 1 , is the number of dimensions of the function which is known to strongly effect algorithm performance.
2. β 2 , is the interquartile range of the normalised data, which provides information on function variation within the domain.This feature will identify functions which have largely flat topology.
3. β 3 , is the skew of the normalised data.The skew tells us how the function value is distributed, a skew of zero would indicate a normal distribution, whereas positive and negative values would indicate a tailed distribution.This feature could potentially identify functions with sharp optimum as well as give information regarding function symmetry.
Collectively these features make up the characteristics, β, of a particular objective function.

Control Parameter Selection
DE requires a number of control parameters, stored in the vector p, which defines a single instance of the algorithm.Running many p on many objective functions results in a number of data points in the form ( p, β, α).This data is named the training data and is used to exploit any relationships between the control parameters, function features and performance.The approach adopted is to apply the unsupervised clustering algorithm k-means++ [1].The k-means algorithm takes an unlabelled data set and classifies it into a user specified number of groups, κ.Each group is defined by a cluster centroid, a data point belongs to the group whose centroid it lies closest to.The k-means++ variant of the algorithm carefully initialises these centroids in favour of random initialisation [1].
Using the training set, the objective functions are classified by applying k-means++ to the β data points.For each classification the data points are sorted by α.The top 10% data points are identified and the mean p is calculated from that set and used to optimise new functions identified as belonging to that classification.
At the end of optimisation the new data point, ( p, β, α), from that run is appended to the training data and the k-means++ algorithm is run to update the classifications and redetermine the best control parameters for each new centroid.The key idea is that the memory of good performing parameters are extended from a single optimisation run to the entire history of using the algorithm.

Procedure for a Single Optimisation
Each time a function was optimised the optimiser was limited to 10, 000 objective function evaluations.To consider our approach fairly the on-line cost of calculating β, before the optimisation takes place, contributes to the number of objective function evaluations.In other words our approach has fewer objective function evaluations available when the optimisation starts.All optimisation runs were repeated 30 times, with 30 different random seeds used for all random number generation.With the same random seed the same control parameters would result in the same performance on the same function instance.This allowed pairwise comparisons between different control parameters.Repeating each optimisation run with 30 different random seeds ensured a 'lucky' seed was not selected which benefited a particular approach.

Test Suites
Two established optimisation benchmark suites were used in this study.
Real-Parameter Black-Box Optimization Benchmarking functions (BBOB) 2015 BBOB 2015 were used to train the predictive methodology.The 24 benchmark functions which make up the BBOB 2015 test suite are given in [8,10].This suite includes separable functions, functions with low to high condition numbers and multi-modal functions with weak global structure.The same numbering system for the functions in [10] is used in this paper.All of these functions are defined for an arbitrary number of dimensions and have the same search domain.The test suite includes 15 instances for each function, for each instance a combination of optimal location shifting and linear transformations are applied.Each instance is shifted and rotated in the same manor on subsequent runs which enables direct comparison of performance.In the experiments presented here, a single test suite entails optimising each function instance at a range of dimensions (2, 10, 20, 30, 40, 50) using 30 different random seeds.The resulting number of tests in a single run of the suite is then 64, 800.

IEEE Congress on Evolutionary Computation CEC 2005 Real-Parameter Optimisation benchmarks
The CEC 2005 benchmark functions as detailed in [19] make up the second test suite.These were used to test the effectiveness of problem specific tuning on objective functions different to the training set.The 25 functions were used with the same numbering system presented in the technical report [19].All functions were optimised at 2, 10, 30 and 50 dimensions using 30 different random seeds resulting in a total of 3, 000 tests.

Statistical Methodology
Using the test suites described above allows pairwise comparison of α between different approaches.The approach for using nonparametric statistical tests described by Derrac et al. [6] is followed here.The Wilcoxon signed ranks test is used to compare the predictive methodology to other approaches.The test results in the value W which will be reported along with the p-value.Alone W can be difficult to interpret so the z -ratio is also reported.A z -ratio greater than 3.09 implies there is only a 0.1% chance that the improvements observed were caused by random chance.Conversely a z -ratio less than 1.64 would be considered statistically insignificant.For all the statistics presented a positive W indicates that the predictive methodology has performed better than the group it is compared to, a larger W indicates a more significant improvement.

Experiments
For each function in the BBOB 2015 suite a Latin hypercube sampling of p will be generated and each of these control parameter sets used to optimise that function.Since there are three control parameters, 30 sets of p will be generated each time.For these optimisation runs the number of samples used to calculate β, was set to σ = 1000.The resulting data will be used as the initial training set for the problem aware tuning.
There will then be four methods for selecting the DE control parameters: • the suggested parameters from literature, • SHADE, • the predictive methodology (using cluster analysis), • using the best performing control parameters from the training set.
The predictive methodology will be applied with varying σ and κ to gauge the sensitivity to these.Each time our approach is used, a new set of samples is generated to calculate β in order to simulate the use of the approach in practice.Each of these methods will be used to optimise both function suites.The non-parametric tests will then be utilised to compare the effectiveness of each method.It needs to be stressed that, in the comparisons presented, β is recalculated and used for objective function classification in each optimisation run.This does mean that when comparing the number of function evaluations to other methods, the predictive methodology has σ additional evaluations.These function evaluations have been included in all measurements of performance as they indicate the cost of our methodology.
The goal of this paper is to show that features, such as those in β, can be used as predictors for p in order to maximise α.If this is the case, future research into minimising the required σ to effectively approximate β can be undertaken, as well as research into different definitions for α.In the following experiments the BBOB 2015 suite is both the training suite and the testing suite.This is the most basic test for our approach.It  is worth pointing out that β is recalculated using new random samplings for each optimisation experiment.

Predictive Methodology Compared to Picking the Best from the Training Set
In the following study a single set of tuning parameters, the best performing in the initial training set, were used to optimise the BBOB 2015 test suite.The control parameters used in this study are presented in Table 1.
The results of the statistical tests, shown in Table 2, show that predictive methodology results in a statistically significant increase of α compared to the best from the training set.This improvement was achieved regardless of the values of κ and σ.With κ = 100 the improvement is comparatively reduced, yet remains statistically significant, which suggests there may be an optimum κ.This implies that objective functions which lie close to each other in the abstract β space require similar p.With κ set too high each function is only effected by its close neighbours and therefore has access to less information regarding p and α.

Predictive Methodology Compared to Using the Best Parameters from Literature
The test suite was optimised using the control parameters suggested in literature, these parameters are shown in Table 1.These parameters are what most practitioners would use in practice as a rule of thumb.Table 3 shows the statistical comparison between the performance of these fixed parameters to the predictive methodology.In all cases the predictive methodology performs significantly better.There is a jump in performance when σ in- creases from 100 to 1000 which indicates a sensitivity on the sampling of the objective functions.

Predictive Methodology Compared to SHADE
The test suite was optimised using SHADE.Table 4 shows the statistical comparison between the performance of SHADE to the predictive methodology.In all cases the predictive methodology performs significantly better than adaptive tuning.The performance is more significant when σ = 1000.

Convergence Behaviour
In Fig. 2 convergence plots are presented for a number of functions in the BBOB 2015 suite.The objective function value is plotted against the number of function evaluations for each control parameter selection strategy.These data points were only recorded once per generation if an improvement in the objective function value was found.This means that the number of data points depends on the population size and is not the same for every curve.Each function is shown for different numbers of dimensions and the control parameters selected by the predictive methodology are presented.These functions were selected to show a range of cases, some where the predictive methodology performs well and some where it does not.Where the predictive methodology performs well it achieves rapid convergence early in the optimisation, which is what the metric α was designed to achieve.The predictive methodology and the best control parameters from the initial training set were used to optimise the CEC 2005 benchmark functions.Table 5 shows the statistical tests for this comparison.For all κ and σ the z -ratio falls just short of the 99.9% confidence level, but still shows a statistically significant improvement.The improvement observed when optimising the CEC 2005 function suite is comparatively less significant than the improvement observed when optimising the BBOB 2015 suite.

Predictive Methodology Compared to Using the Best Parameters from Literature
The results of statistical tests comparing the predictive methodology to the fixed parameters suggested in literature are shown in Table 6.For all κ and σ the z -ratio shows that the improvement is statistically significant.

The Predictive Methodology Compared to SHADE
The results of statistical tests comparing the predictive methodology to the fixed parameters suggested in literature are shown in Table 7.For all κ and σ the z -ratio shows that the improvement is statistically significant.

Discussion
The results show that, when optimising the BBOB 2015 objective function suite, the predictive methodology out performed using fixed and adaptive tuning parameters for DE with a 99.9% confidence level.There was an observed increase in performance when the initial sampling size, σ, of the objective functions was increased.There was a slight drop in performance from κ = 10 to κ = 100.This indicates that having fewer large classifications of objective function are better than a more granulated approach.This trend does not continue to κ = 1, i.e. simply using the best parameters from the Optimising the CEC 2005 benchmark functions, using only the BBOB 2015 suite for training, is a tough test for the suitability of β to act as a predictor for control parameters in the general case.In all cases the z -ratio indicates that the predictive methodology is likely to be the most effective approach.Overall there is a slightly reduced confidence in this conclusion compared with the BBOB 2015 suite, but the z -ratio indicates that the improvement is still statistically significant.This reduction in significance is likely explained since one function suite has been used to train the problem aware tuning for another.
Overall the results show that the simple objective function features, β can act as a predictor for selecting appropriate control parameters for DE.The advantage of this approached can be observed when comparing it to SHADE.SHADE learns which control parameters are most effective during the optimisation process.The predictive methodology attempts to predict effective control parameters prior to the optimisation, therefore the benefit is felt from the first iteration.This prediction itself comes at the cost of objective function evaluations, which were accounted for in this study.The open problem, therefore, is how to effectively approximate these features with fewer objective function evaluations.In the future, an aim is to use these objective function evaluations as the first generation of the optimisation run as not to waste them.
To improve performance, it may be possible to update the value of β as the optimisation runs and better samples the objective function.As the approximation of β improves the control parameters could be changed mid run.This would require thoughtful implementation to avoid introducing significant computational overhead.There is also scope for designing more sophisticated and varied objective function features.Performance may also be increased with a larger training data set.This does come with an additional overhead, as the computational cost of the k-means algorithm increases with the size of the training data.In the future there is no reason why the k-means calculation could not be performed using cloud computing with training data collected from many users of the algorithm.

Conclusions
The methodology proposed has been shown to offer statistically significant improvement over other approaches.This implementation shows that the concept has the potential to be a powerful addition to evolutionary opti-misation algorithms.The method is general and could be applied to any evolutionary algorithm and any performance measure of interest.There are a number of avenues to investigate, discussed above, which may improve the methodology further.In particular an investigation into more sophisticated function features, such as those used in FBAC [3], is a high priority.The long term goal should be to extend this methodology to automatically select the most appropriate evolutionary algorithm for a problem, not just the control parameters.Such an automation would be of great use for industrialists wishing to apply evolutionary algorithms to real world applications.

Fig. 1
Fig.1shows a two dimensional projection of the training data used in the following studies.This data resulted from optimising the BBOB 2015 function suite only.The marker colour indicates which classification each data point belongs to when κ = 10.In the following experiments the BBOB 2015 suite is both the training suite and the testing suite.This is the most basic test for our approach.It

Figure 1 :
Figure 1: The BBOB 2015 training data set.Markers are coloured according to their classification when κ = 10.

Figure 2 :
Figure 2: Examples of the convergence behaviour using different control parameter selection strategies for the BBOB 2015 function suite.In the predictive methodology κ = 10 and σ = 1000.

Figure 3 :
Figure 3: Examples of the convergence behaviour using different control parameter selection strategies for the CEC 2005 function suite.The predictive methodology was trained using the BBOB 2015 function suite with κ = 10 and σ = 1000.

Table 1 :
The control parameters used in the fixed control parameter optimi-

Table 3 :
BBOB 2015 Function Suite: Wilcoxon signed ranks test data comparing the predictive methodology to using the control parameters most com-