MVMOO: Mixed variable multi-objective optimisation

In many real-world problems there is often the requirement to optimise multiple conflicting objectives in an efficient manner. In such problems there can be the requirement to optimise a mixture of continuous and discrete variables. Herein, we propose a new multi-objective algorithm capable of optimising both continuous and discrete bounded variables in an efficient manner. The algorithm utilises Gaussian processes as surrogates in combination with a novel distance metric based upon Gower similarity. The MVMOO algorithm was compared to an existing mixed variable implementation of NSGA-II and random sampling for three test problems. MVMOO shows competitive performance on all proposed problems with efficient data acquisition and approximation of the Pareto fronts for the selected test problems.


Introduction
As the requirement for optimising process efficiency, whilst simultaneously improving other environmental or economic factors becomes a necessity, utilising optimisation methodologies has become increasingly more important. This has led to optimisation tasks requiring an ever-increasing resource demand to solve. Consequently, efficient and accessible optimisation methodologies are required to guide complex systems to their optimal conditions. Many real-world optimisation problems consist of multiple conflicting objectives and constraints which can be composed of both continuous and discrete variables. Continuous variables are variables in which we have access to any real value in the desired optimisation range of the variable. Given their inherent continuous nature they are often easier to solve with a wider array of applicable optimisation techniques. Discrete variables can take the form of integer values or categorical values (materials, reaction solvents). In many cases these optimisations can be expensive to evaluate in terms of time or monetary resources.
It is therefore necessary to utilise algorithms that can efficiently guide the search towards the optimum set of conditions for a given problem. Commonly for real-world problems derivative information may be unavailable, therefore, the use of derivative free optimisation techniques is required.
For many multi-objective problems there does not exist a common optimal for all objectives. As such, algorithms will work towards the trade-off between the competing objectives. This non-dominated set of points is known as the Pareto front [1]. The Pareto front can then be utilised to inform the decision maker as to conditions which act as the best compromise for their given problem and requirements. There are two main techniques for solving multi-objective problems; scalarisation or Pareto based methods. Scalarisation involves converting the multi-objective problem into a problem with a single objective. This can often be done by applying weightings to each objective and additively combining them into a single objective, although there exist many alternative scalarisation techniques [2]. As scalarization combines the objective functions into a single objective, often the methodology needs to be rerun with alternative weightings or weightings changed during the optimisation to develop a complete picture of the Pareto set for a problem. The settings or procedure for weight generation can have a significant effect on the efficacy of the optimisations [3].
Pareto based methods look to iteratively improve a set of non-dominated solutions towards an optimal front. Here, evolutionary algorithms (EAs) have been widely applied for the solution of multi-objective problems [4]. In general, EAs iteratively change a set of candidate solutions towards an optimum set of values. The most notable example of this would be the Non-dominated Sorting Genetic Algorithm II (NSGA-II) algorithm [5], with the algorithm being successfully applied to a wide array of simulated and real world problems [6,7]. A key disadvantage for their application on expensive to evaluate problems is the requirement for many function evaluations to progress to an optimum set of points. In these instances, EAs are often coupled with surrogate models of the real process to reduce this burden [8,9].
Bayesian optimisation approaches have been applied to multi-objective problems. For Bayesian approaches, typically, a Gaussian process surrogate is built from past evaluations, and is utilised in tandem with an acquisition function to determine the next point for the optimisation [10]. The Expected Hypervolume Improvement acquisition function provides an extension of the single objective expected improvement to the multi-objective domain [11]. The algorithm provides good convergence to the true Pareto front, however requires a high computational budget for computing the infill criteria for more than three objectives [12,13]. Alternative acquisition functions and approaches have been suggested all looking to reduce the computational burden for acquisition evaluation as well as improve the efficiency of the optimisations. Recently, the TSEMO algorithm employing Thompson sampling with an internal NSGA-II optimisation has been proposed [14]. The authors utilise the inherent exploratory nature of Thompson sampling to build Gaussian surrogates, which are subsequently optimised utilising NSGA-II to suggest points that maximise the hypervolume improvement.
Generally, mixed variable multi-objective problems have received less attention when compared with other optimisation fields. However, a significant volume of research has made use of EAs such as genetic algorithms (GAs) to solve a wide variety of real-world problems, often being coupled with simulated systems or surrogate models [15][16][17][18][19][20]. In many of these works the direct system is rarely optimised with preliminary work instead used to construct accurate surrogate models. Surrogate-based techniques utilise a cheap model constructed from a training dataset generated from the expensive objective function. For each iteration, the surrogate is used to guide the optimisation and determine the next point of evaluation. Gaussian processes (GPs) have often shown excellent performance in these surrogate based techniques; however, their use has usually been limited to continuous variable only processes. This limitation is linked with the correlation function requiring a distance metric which can be difficult to define between different levels of discrete variables. Previous methods [21][22][23][24] propose modifications to the covariance function to ensure a valid covariance matrix can be defined for mixed inputs.
More recently, Zhang et al. [25] adopted a latent variable approach to enable the Bayesian optimisation for material design. The proposed methodology maps the qualitative variables to underlying quantitative latent variables, from which a standard covariance function can be used. The latent variables for each discrete variable combination are determined utilising maximum likelihood estimation with each combination being mapped to a 2D latent variable domain. The authors compared their suggested single objective optimisation methodology to MATLAB's bayesopt which utilises dummy variables in place of the qualitative variables.
Herein we present a mixed variable multi-objective Bayesian optimisation algorithm and provide three test cases with comparison to both random sampling and a mixed variable version of NSGA-II provided in jMetalPy [26]. The algorithm looks to extend Bayesian multi-objective methodologies to the mixed variable domain, which to the best of the authors' knowledge has limited prior work [27]. Utilising the recently proposed Expected Improvement Matrix [28] and an adapted distance metric, the algorithm provides an efficient approach to optimising expensive to evaluate mixed variable multiobjective optimisation problems without the need for reparameterization of the discrete variables.

Gaussian processes
A Gaussian process can be defined as a collection of random variables. A GP is fully defined by a mean function, m( ) , and a covariance function, k , ′ of a real process f ( ) [29].
With being an arbitrary input vector and [⋅] the expectation. The mean function defines the average of the function, with the covariance function or kernel specifying the covariance or the relationship between pairs of random variables. The kernel is composed of several hyperparameters for which the optimal values can be determined through model training. The application of GP models in optimisation literature is principally restricted to the continuous domain utilising a stationary kernel. When considering mixed variable optimisation, one must ensure that a valid kernel can be constructed.

Covariance function
The application of Bayesian techniques is often limited to continuous variable optimisation problems with the covariance being a function of the distance between points in the input domain.
The difficulty for mixed variable optimisation problems when using a GP as the surrogate for the response is formulating a proper correlation structure. As discrete variables can vary in similarity by a large degree, assuming continuous methods can be applied to determine the covariance matrix is not appropriate and may result in an invalid covariance matrix. The covariance function that characterises a GP is customisable by the user, however, to ensure that it is valid it is a requirement for the covariance function to be symmetric and positive semi-definite; this will be true if its eigenvalues are all non-negative.
Qian et al. [21] were amongst the first to implement GP modelling for systems with mixed variable types. The method requires an extensive estimation procedure, requiring the solution of an internal semidefinite programming problem to ensure a valid covariance matrix is computed during model training. Due to the large number of hyperparameters required, it can be difficult to obtain their optimum value, especially in efficient optimisation application, where the number of evaluations is often restricted.
Halstrup [30] developed a novel covariance function based upon the Gower distance metric. This metric requires significantly fewer hyperparameters and exhibits good scalability to higher dimensional problems. Gower [31] initially proposed the metric to measure the similarity between two objects with mixed variable data, with the metric modified to represent distance.
The first term considers the quantitative factors and is the weighted Manhattan distance between the two variables, where Δw k is the range of the k th quantitative variable s w i m w j m considers the m th qualitative variable. This is set to 0 if w i m is equal to w j m and 1 if they are not equal. d and q are the number of qualitative and quantitative variables, respectively.
An applied example for Gower distance calculation for a sample set of experiments is provided below. Consider we have three variables in which two are quantitative and one is qualitative, we have an initial dataset (Table 1) and we want to determine the overall distance between each entry, such that we may be able to model a response.
For each quantitative variable we can initially calculate the variable range normalised Manhattan distance and for the single qualitative variable we can perform a binary comparison between each entry. Finally, we combine each distance matrix using (4), the results of which are displayed in Fig. 1. With a valid distance metric this permits the use of pre-existing correlation functions such as the square exponential or the Matérn kernels. An example of a mixed Matérn 5/2 kernel is given below: where dim is the problem dimension, with and the covariance function's hyperparameters.
Prior work by Halstrup has looked at the application of the Gower distance metric for use in enabling Bayesian single objective optimisation techniques to be used on mixed variable systems. This work was further developed by Pelamatti et al. [32] in which they compared the currently available techniques for optimising mixed variable systems utilising Gaussian approaches. They found the suggested methodology adopted by Halstrup performed comparatively well to other techniques whilst offering the benefit of a reduced number of hyperparameters.

Model training
For effective use and application of Bayesian techniques it is recommended to optimise the hyperparameters of the GP model prior to application of the acquisition function. The hyperparameters in this instance were tuned through maximising the likelihood of the available data points. The Python package GPflow (version 1.3) [33] was utilised alongside a modified version of the Matérn kernel to construct and fit the GP model to the data for all test examples where K y is the covariance matrix of noisy response values, − 1 2 y T K −1 y y assesses how well the model fits the data given the current hyperparameters, . A complexity penalty in the is included to reduce overfitting of the model. − n 2 log 2 is a normalisation constant, where n is the number of training data points.

Acquisition function
To select the next evaluation point Bayesian optimisation utilises an acquisition function which is computed using the surrogate GP model built upon previously collected data. The acquisition function is typically inexpensive so that it can be evaluated extensively to guide the proceeding data acquisition.
One such acquisition function is the expected improvement matrix (EIM) [28]. EIM is a multi-objective adaptation of EI, with the expected improvement for each candidate and each objective calculated in a matrix. This matrix representation of expected improvement resembles the current best solution in multi-objective optimisation, which is also a matrix.
The matrix representations of EI for each objective are then combined for each point using three possible transformations; Euclidean, min/max distance and hypervolume based transformations. In this work we have elected to use a Euclidean based transformation to give a combined expected improvement for a given point with respect to all the objectives and the current non-dominated front: As the EI equation is in its closed form it can be rapidly calculated and optimised utilising evolutionary algorithms or multi-start local search methods. This should allow a global solution for the optimisation of EIM to be found at a reduced computational cost. EIM has been previously implemented in the form of the EIM-EGO algorithm. The implementation was tested on a series of continuous test problems, displaying competitive results to the state-of-the-art multi-objective algorithms, whilst providing efficient scaling when increasing the number of optimised objectives [28]. For the initial space filling design in all cases, we performed a 5 sample Latin hypercube for each combination of the discrete variables. It is noted that this can lead to a large initial cost to the optimisations when there are multiple discrete variables or discrete variable levels in the optimisation problem.

Algorithm overview
Upon completion of the initial data collection a GP model was constructed for each objective function and hyperparameters optimised using GPflow's internal Adam optimiser. The trained GP models are then used with the EIM acquisition function to determine the next point for evaluation. The optimisation of the acquisition function occurs in two stages; initially a large sample of the acquisition function is taken using a Halton sequence for each discrete variable combination. The leading variable combination was then taken forward with additional local optimisation performed using SciPy's [34] implementation of sequential least squares programming (SLSQP) [35].

Test functions
The algorithm has been applied to three test problems [36][37][38] to evaluate the selected approach and provide comparison with existing mixed variable multi-objective optimisation methods.
For the first two test problems, the true Pareto front optimal set was known and has been detailed below. For the final test problem, the true Pareto front has been determined through comprehensive evaluation of the function domain utilising the mixed variable NSGA-II algorithm.
Tables 2 and 3 provide details of the initial dataset size, algorithm evaluation budgets and settings for NSGA-II for all test problems. NSGA-II termination criteria was set to be once the prespecified number of generations had been evaluated.

Discrete VLMOP2
The VLMOP2 test function [36] is a well-known bi-objective test problem. Here we have adapted the original problem to create a mixed variable equivalent: Like the original function, the Pareto front is concave with the Pareto optima line generally from Figure 3 details the Pareto front and Pareto optimal set for the discrete VLMOP2 problem. This provides a useful initial test for the algorithm to determine whether it can effectively switch between the two levels of the discrete variable.

Ordinary differential equation (ODE) catalytic system
The following test problem involved the optimisation of 3 continuous variables and a single qualitative variable for two objective functions obtained through the solution of a system of ordinary differential equations (ODEs). Solution of the ODE system can be obtained through application of an initial value problem algorithm, such as the Runge-Kutta methods [39]. The system of ODEs was used to describe a chemical reaction where the qualitative variable is the choice of catalyst utilised in system and varies the activation energy of the reaction as described in (12), (13) and Table 4. This problem was initially reported as a single objective mixed variable problem by Baumgartner et al. [37]; we have modified the problem to formulate a multi-objective example for this test case. The system involves optimising the catalyst, d , catalyst concentration, C cat , temperature, T , and the reaction time, t res , in which the catalyst effects the activation energy of the reaction, E A i . Equations describing the system are provided below.
where is the residence time of the reactor, k 0 is the Arrhenius constant for the reaction ( 3. The system of ODEs was solved utilising MATLAB's ode45 numerical integration solver with the algorithm optimising for both space time yield and product yield. Figure 4 details the Pareto front for the simulated systems with details of the Pareto optimal set provided.   As the Pareto front in this test problem is comprised solely of a single discrete variable level, this allows for comparison with a continuous multi-objective optimisation algorithm. In this instance comparison versus the Thompson Sampling Efficient Multi-Objective Optimisation (TSEMO) [14] algorithm was performed.

Fuel injector design
This problem is comprised of a mixed integer fuel injector design problem proposed by Burachik et al. [38]. The authors adjusted the problem to impose an integer constraint on one of the design variables. The problem is described as follows (Fig. 5):  As with the discrete VLMOP2 example, the Pareto front is made up of multiple discrete variable levels so should provide a good test problem to determine how effectively the algorithm can switch between different discrete variable combinations in a larger problem.

Results
For the three test cases presented the algorithm was compared against random sampling, a split Latin hypercube (LHC) design and the mixed variable NSGA-II.

Performance indicators
There exists a wide plethora of performance metrics that have been developed over the last few decades for the comparison of different multi-objective optimisation algorithms. Riquelme et al. [40] provide a comprehensive review of many performance metrics highlighting their usage and the associated advantages/disadvantages. To evaluate the performance of the newly proposed algorithm we have utilised three performance metrics. A summary for each of the chosen metrics is provided below.

Hypervolume
The Hypervolume performance metric measures the volume of the objective space dominated by the calculated Pareto front given a specified reference point. The indicator provides a measure for both convergence and diversity and has been widely used as a comparison tool for multi-objective algorithms [41].

Modified inverted generational distance (IGD+)
The modified Inverted Generational Distance (IGD+) performance metric was first proposed by Ishibuchi et al. [42]. Modification of the IGD performance metric [43] was performed to achieve a weakly Pareto compliant version. The metric provides a measure on the convergence towards the true Pareto front as well as the diversity of the calculated front, with a smaller value suggesting the calculated Pareto front is close to that of the true Pareto front. IGD+ is calculated as follows: where d + i = max a i − z i , 0 , with a i being a point from the calculated Pareto front A and z i being a point from the true Pareto front Z.

Worst Attainment Surface
The Worst Attainment surfaces are defined by a set of Pareto points for a specific algorithm and test function that are dominated by all other points for the specific algorithm on a specific test function. They highlight the divide between dominated and non-dominated points for the given set [44]. This provides insight as to where the different algorithms perform differently and to what extend and thus can highlight a specific algorithm's weakness.

Discrete VLMOP2
Firstly, we will consider the results of the optimisation of the discrete VLMOP2 test problem. Summary plots for all performance metrics, as well as iterative summary plots, are shown in Fig. 6. For the hypervolume performance indicator a reference point of R = [1.0, 1.25] was used.
Overall, the MVMOO algorithm performs comparatively well compared to the mixed version of the NSGA-II algorithm with the novel MVMOO algorithm exceeding the performance in terms of both Hypervolume and IGD+, with a 21-fold reduction in iterations for Hypervolume and a 36-fold reduction for IGD+ for comparative performance to NSGA-II.
The iterative results for both Hypervolume and IGD+ (Fig. 6a, b) appear to indicate premature termination of both MVMOO and NSGA-II with MVMOO plateauing to a lesser extent than NSGA-II. The budget for this test problem was kept deliberately small due to only two variables being optimised and the desire for efficient multi-objective optimisation. Figure 6(e) illustrates well how in the MVMOO algorithm was able to achieve points at the extremes of the Pareto front and correlate well for the two outlying points for both Hypervolume and IGD+. Here the algorithm has failed to correctly obtain the mixed central aspect of the Pareto front (see Fig. 6(e)) suggesting the algorithm may at times struggle to switch between discrete variables where the effects of the continuous variables are similar, this behaviour is further explored in the fuel injector design test problem. Again, it is likely this effect, combined with the limited evaluation budget, has likely led to the two outlying points in both IGD+ and Hypervolume. In this case where the evaluation budget is limited the initial training data collection plays a key role in the efficacy of the algorithm.

ODE Catalytic System
Summary plots for all performance metrics, as well as iterative summary plots for the ODE catalytic system test problem are shown in Fig. 7. For the hypervolume performance indicator a reference point of R = [0, 0] was used.
Comparative performance can be observed in Fig. 7(a-d) for the MVMOO algorithm compared with a solely continuous multi-objective optimisation counterpart. Figure 7(e) details how the algorithm in the worse instance struggles to achieve the points at the extremes of the pareto front, possibly accounting for the lower values of Hypervolume observed. NSGA-II was able to achieve superior performance across all metrics with extremely good approximation of the Pareto front in the worst attainment summary surface plot, Fig. 7(e).
Considering both Fig. 7(a, b), we can see there is a significant stall in the iterative progress for both performance metrics. This behaviour is linked to the current initial data gathering procedure for the MVMOO algorithm with a 5 sample Latin hypercube taken for each level of the discrete variable. As the system only has a single optimum discrete variable level this leads to this stalling behaviour. Further work investigating improved techniques for initial data collection may prove beneficial for systems when there is a single optimal discrete variable level. The steepness of the gradient of improvement for Hypervolume in Fig. 7(a) indicates how the MVMOO algorithm efficiently works towards the Pareto front for the system with final results comparable to that of the continuous algorithm which was afforded a larger algorithm iteration budget. This highlights the efficiency of the algorithm particularly given the search domain was 8 times larger than that of the continuous algorithm when you consider the discrete variable levels. Comparing performance with regards to Hypervolume and IGD+ against NSGA-II the algorithm achieves the same level of performance in approximately 7% and 8% of the iterations to that required by NSGA-II.

Fuel injector design
Summary plots for all performance metrics, as well as iterative summary plots for the fuel injector design test problem are shown in Figs. 8, 9 and 10. For the hypervolume performance indicator a reference point of R = [0.8, 1.4, 1.7, 1.0] was used.
Examining Fig. 8(a-d) the results for both performance metrics indicate a significant improvement when using MVMOO over NSGA-II, with the algorithm achieving on par performance to the final mean hypervolume of NSGA-II after 55 iterations and after 42 iterations for mean IGD+ representing a 145-fold and a 190-fold improvement in efficiency with respect to these performance indicators.
A notable difference between the worst attainment surfaces of MVMOO and NSGA-II is the approximation of the bulk body of the Pareto front, with MVMOO providing a more uniform spread of points with some clustering present in NSGA-II. As with the ODE catalytic system example the MVMOO struggles to achieve the extremes of the Pareto front. This, however, may be rectified if the algorithm was allowed to continue with a lack of plateauing in performance metrics evident from the iterative Hypervolume plot (Fig. 8a) suggesting an improvement may be achieved through increasing the budget, this behaviour is echoed to a lesser extent on the iterative plot for IGD+ (Fig. 8b). The greater degree in plateau evident on the IGD+ could be linked to the algorithm struggling to achieve points at the edge of the Pareto front. In both the Hypervolume and IGD+ iterative plots there is clear evidence that the NSGA-II algorithm struggles with this 4-objective system with little improvement apparent for the latter iterations. Observing Figs. 9 and 10, this marked difference in both Hypervolume and IGD+ for NSGA-II could be attributed to the lack of points around the bulk sections of the Pareto front. Additional iterations would likely rectify this; however, this highlights the drastically improved efficiency in attaining near Pareto optimal points presented by the MVMOO algorithm.
Overall, the MVMOO displays competitive performance in term of both efficiency and final performance across all three test problems presented. On all examples the algorithm displays excellent repeatability evident through the narrow quartile ranges on all test problem box plots. The algorithm has been shown to be effective at switching between discrete variable levels should this be required and produces a uniform distribution of points across the front efficiently describing the front for a minimal budget.

Conclusions
In this work we present a novel Bayesian multi-objective algorithm (MVMOO) capable of simultaneously optimising both discrete and continuous input variables. The algorithm was applied to three test cases with comparison to random and genetic algorithmbased alternatives. MVMOO was compared to NSGA-II as well as two random sampling techniques for 3 test functions where it was able to outperform NSGA-II for 2 out of the 3 test problems with a significantly reduced function evaluation budget. For the catalytic ODE test problem, MVMOO compared comparatively well with NSGA-II and a continuous variable multi-objective algorithm, TSEMO, which were able to outperform the new algorithm in terms of IGD+ and hypervolume. Overall, MVMOO was able to perform competitively when compared to NSGA-II with a substantially reduced experimental budget, providing a viable, efficient option when optimising expensive mixed variable multi-objective optimisation problems without the need of a priori knowledge of the objective function.
Acknowledgements JAM thanks the EPSRC (EP/R513258/1) and University of Leeds for funding. This work was funded, in part, by the EPSRC project "Cognitive Chemical Manufacturing" EP/R032807/1. RAB was supported by the Royal Academy of Engineering under the Research Chairs and Senior Research Fellowships scheme (RCSRF1920\9\38).