Surrogate-assisted Multi-objective Optimization via Genetic Programming Based Symbolic Regression

Yang, Kaifeng; Affenzeller, Michael

doi:10.1007/978-3-031-27250-9_13

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13970))

Included in the following conference series:

International Conference on Evolutionary Multi-Criterion Optimization

1842 Accesses
2 Citations

Abstract

Surrogate-assisted optimization algorithms are a commonly used technique to solve expensive-evaluation problems, in which a regression model is built to replace an expensive function. In some acquisition functions, the only requirement for a regression model is the predictions. However, some other acquisition functions also require a regression model to estimate the “uncertainty” of the prediction, instead of merely providing predictions. Unfortunately, very few statistical modeling techniques can achieve this, such as Kriging/Gaussian processes, and recently proposed genetic programming-based (GP-based) symbolic regression with Kriging (GP2). Another method is to use a bootstrapping technique in GP-based symbolic regression to estimate prediction and its corresponding uncertainty. This paper proposes to use GP-based symbolic regression and its variants to solve multi-objective optimization problems (MOPs), which are under the framework of a surrogate-assisted multi-objective optimization algorithm (SMOA). Kriging and random forest are also compared with GP-based symbolic regression and GP2. Experiment results demonstrate that the surrogate models using the GP2 strategy can improve SMOA’s performance.

You have full access to this open access chapter, Download conference paper PDF

Surrogate-assisted evolutionary algorithm for expensive constrained multi-objective discrete optimization problems

Article Open access 04 January 2021

On Dealing with Uncertainties from Kriging Models in Offline Data-Driven Evolutionary Multiobjective Optimization

New Surrogate Approaches Applied to Meta-Heuristic Algorithms

Keywords

1 Introduction

A common remedy to expensive optimization problems is to replace exact evaluations with approximations learned from past evaluations. Theoretically, any supervised learning techniques can be used for building up surrogate models, for instance, Gaussian processes or Kriging [16], random forest [7], supported vector machine [4], genetic programming based (GP-based) symbolic regression [9], to name a few. Among these techniques, the Kriging/Gaussian process is appealing both in academic studies and in applications due to its estimation of both prediction and the corresponding uncertainty as outputs. The utilization of these two outputs allows a balancing property of exploitation and exploration in an acquisition function (a.k.a infill criteria in some papers) during the optimization processes.

Compared with the black-box modeling technique of Kriging, the transparent-box modeling technique has been a hot topic recently due to its ability to interpret the relationship between inputs and predictions. Among many techniques of transparent-box modeling, GP-based symbolic regression (SR) attracts many researchers’ attention, as it searches the space of all possible mathematical formulas for concise or closed-form mathematical expressions that best fit a dataset [1]. To quantify prediction uncertainty in GP-based symbolic regression, a simple technique is to utilize bootstrapping, which is widely used in the machine learning field. Bootstrapping is usually used in ensemble methods to reduce the bias of a (simple) predictor by using several predictors on different samples of a dataset and averaging their predictions. The set of predictions from each predictor can be used to estimate the properties of a predictor (such as prediction mean and its variance) by measuring those properties when sampling from an approximating distribution. For example, this technique is incorporated into a random forest (RF) in Sequential Model-Based Algorithm Configuration (SMAC) [12], where the variance is calculated using the predictions of the trees of the random forest. Following the same idea, the prediction variance of GP-based symbolic regression can also be estimated by the method of bootstrapping. Such similar works can be found in [2, 8, 10]. Another technique to quantify the uncertainty of the prediction’s residual is achieved in [19], in which a so-called GP-based symbolic regression with Kriging (GP2) was proposed by summing a GP-based symbolic expression and one additional residual term to regulate the predictions of the symbolic expression, where the residual term follows a normal distribution and is estimated by Kriging.

The techniques mentioned above allow GP-based symbolic regression to employ acquisition functions that balance exploration and exploitation during the optimization process. Under the framework of surrogate-assisted multi-objective optimization algorithm (SMOA), this paper utilizes the most recently modeling technique GP2 [19] and GP-based symbolic regression incorporated with a bootstrapping technique as the surrogate models to solve multi-objective optimization problems (MOPs). These modeling techniques are compared with two state-of-the-art modeling techniques, Kriging, and random forest.

The main contribution of this paper is extending the surrogate models that allow an acquisition function’s ability to balance exploration and exploitation in the field of SMOA. This paper is structured as follows: Sect. 2 describes the preliminaries of multi-objective optimization problems, and introduces the framework of surrogate-assisted multi-objective optimization algorithm; Sect. 3 explains three different techniques for surrogate models; Sect. 4 introduces the definition of upper confidence bound (UCB) [6, 17] and its relating concepts; Sect. 5 shows the parameter settings and discusses the experiment results.

2 Surrogate-Assisted Multi-objective Optimization

This section introduces the preliminaries of the SMOA, including the MOP’s definition in Sect. 2.1, and the framework of SMOA in Sect. 2.2.

2.1 Multi-objective Optimization Problem

A continuous MOP^{Footnote 1} is defined as minimizing multiple objective functions simultaneously and can be formulated as:

$$\begin{aligned} \mathop {\mathrm {arg\,min}}\limits _{\textbf{x}} \quad \textbf{f}(\textbf{x}):= \big ( {f}_1(\textbf{x}),\cdots ,{f}_m(\textbf{x}) \big ), \qquad \qquad \textbf{x} \in \mathcal {X} \subseteq \mathbb {R}^d \end{aligned}$$

(1)

where m is the number of objective functions, ${f}_i$ stands for the i-th objective functions $f_i: \mathcal {X} \rightarrow \mathbb {R}$, $i = 1, \dots , m$, and $\mathcal {X}$ is a decision vector subset.

2.2 Framework of Surrogate-Assisted Multi-objective Optimization

The fundamental concept of SMOA is to firstly build a surrogate model $\mathcal {M}$ to reflect the relationship between a decision vector $\textbf{x} = (x_1, \cdots , x_d)$ and its each corresponding objective value $y_i = f_i(\textbf{x}), i\in \{1,\cdots ,m\}$ for each objective. In SMOA, it is usually assumed that objective functions are mutually independent in an objective space^{Footnote 2}.

SMOA starts with sampling an initial design of experiment (DoE) with a size of $\eta $ (line 2 in Algorithm 1), $\textrm{X}=\{\textbf{x}^{(1)},\textbf{x}^{(2)},$ $\ldots , \textbf{x}^{(\eta )}\} \subseteq \mathcal {X}$. By using the initial DoE, $\textrm{X}$ and its corresponding objective values, $\textbf{Y}=\{\textbf{f}(\textbf{x}^{(1)}),\textbf{f}(\textbf{x}^{(2)}),\cdots ,\textbf{f}(\textbf{x}^{(\eta )})\} \subseteq \mathbb {R}^{m\times \eta }$ (line 3 in Algorithm 1), can be then utilized to construct surrogate models $\mathcal {M}_i$. Then, an SMOA starts with searching for a decision vector set $ \textbf{x} '$ in the search space $\mathcal {X}$ by maximizing the acquisition function $\mathscr {A}$ with parameters of $\gamma $ and surrogate models (line 7 in Algorithm 1). Here, an acquisition function quantifies how good or bad a point $ \textbf{x} $ is in objective space. In this paper, $\mathscr {A}$ is the upper confidence bound (described in Sec. 4) due to its ability to balance exploration and exploitation, low computational complexity, and popularity in the deep learning fields.

By maximizing the acquisition function, a single-objective optimization algorithm is deployed to search for the optimal decision vector $ \textbf{x} ^*$. In this paper, we use BI-population CMA-ES to search for the optimal $ \textbf{x} ^*$ due to the favorable performance on BBOB function testbed [11]. The optimal decision vector $ \textbf{x} ^*$ will then be evaluated by the ‘true’ objective functions $ \textbf{f} $. The surrogate models will be retrained by using the updated $\textrm{X}$ and $\textbf{Y}$. The main loop (as shown in Algorithm 1 from line 6 to line 12) will not stop until a stopping criterion is satisfied. Common stopping criteria include a number of iterations, convergence velocity, etc. In this paper, we specify the stopping criterion ($T_c$) as a number of function evaluations.

3 Surrogate Models

This section introduces a common surrogate model, Kriging in Sect. 3.1; a canonical GP-based symbolic regression in Sect. 3.2; and the GP-based symbolic regression with Kriging (GP2) in Sect. 3.3. Additionally, the method of quantifying prediction uncertainty is introduced in each subsection.

3.1 Kriging

Kriging is a statistical interpolation method and has been proven to be a popular surrogate model to approximate noise-free data in computer experiments. Considering a realization of y at n locations, expressed as the following vector $\textbf{f}(\textbf{X}) =( {f}(\textbf{x}^{(1)}) \cdots , {f}(\textbf{x}^{(n)}) )$ and $\textbf{X}=\{\textbf{x}^{(1)}, \cdots , \textbf{x}^{(n)}\} \subseteq \mathcal {X}$, Kriging assumes $\textbf{f}(\textbf{X})$ to be a realization of a random process f of the following form [5, 13]:

$$\begin{aligned} f(\textbf{x}) = \mu (\textbf{x}) + \epsilon (\textbf{x}), \end{aligned}$$

(2)

where $\mu (\textbf{x})$ is the estimated mean value and $\epsilon (\textbf{x})$ is a realization of a Gaussian process with zero mean and variance $\sigma ^2$.

The regression part $\mu (\textbf{x})$ approximates the function f(.) globally. The Gaussian process $\epsilon (\textbf{x})$ takes local variations into account and estimates the uncertainty quantification. The correlation between the deviations at two decision vectors ($\textbf{x}$ and $\mathbf {x'}$) is defined as:

$$ Corr[\epsilon (\textbf{x}),\epsilon (\mathbf {x'})] = R(\textbf{x},\mathbf {x'}) = \prod _{i=1}^m R_i(x_i,x_i'), $$

where R(., .) is the correlation function that decreases with the distance between two points ($ \textbf{x} \text { and } \textbf{x} '$).

It is common practice to use a Gaussian correlation function (a.k.a., a squared exponential kernel):

$$ R(\textbf{x},\mathbf {x'}) = \prod ^m_{i=1} \text {exp}(-\theta _i(x_i - x'_i)^2), $$

where $\theta _i \ge 0$ are parameters of the correlation model to reflect the variables’ importance [21]. The covariance matrix can then be expressed through the correlation function: $Cov({\epsilon (\textbf{x})}) = \sigma ^2 \mathbf {\Sigma }$, where $ \mathbf {\Sigma }_{i,j}=R(\textbf{x}_i,\textbf{x}_j)$.

When $\mu (\textbf{x})$ is assumed to be an unknown constant, the unbiased prediction is called ordinary Kriging (OK). In OK, the Kriging model determines the hyperparameters $\mathbf {\theta } = (\theta _1, \theta _2, \cdots , \theta _n)$ by maximizing the likelihood over the observed dataset. The expression of the likelihood function is: $L = -\frac{n}{2}\ln ({\sigma }^2) - \frac{1}{2}\ln (|\mathbf {\Sigma }|) $.

Uncertainty Quantification: The maximum likelihood estimations of the mean $\hat{\mu }$ and the variance $\hat{\sigma }^2$ can be expressed by: $ \hat{\mu } = \frac{\textbf{1}^{\top }_n \mathbf {\Sigma }^{-1} \textbf{y}}{\textbf{1}^{\top }_n \mathbf {\Sigma }^{-1} \textbf{1}_n} $ and $ \hat{\sigma }^2 = \frac{1}{n}(\textbf{y} - \textbf{1}_n\hat{\mu })^{\top }\mathbf {\Sigma }^{-1}(\textbf{y}-\textbf{1}_n\hat{\mu }) $. Then the prediction of the mean and the variance of an OK’s prediction at a target point $\textbf{x}^t$ can be derived as follows [13, 20]:

$$\begin{aligned} \mu _{ok}(\textbf{x}^t) =&\hat{\mu } + \textbf{c}^{\top } \mathbf {\Sigma }^{-1} (\textbf{y} - \hat{\mu }\textbf{1}_n), \end{aligned}$$

(3)

$$\begin{aligned} \sigma _{ok}^2(\textbf{x}^t) =\ {}&\hat{\sigma }^2[1 - \textbf{c}^{\top } \mathbf {\Sigma }^{-1}\textbf{c} + \frac{1-\textbf{c}^{\top }\Sigma ^{\top }\textbf{c}}{\textbf{1}_n^{\top }\Sigma ^{-1}\textbf{1}_n}], \end{aligned}$$

(4)

where $\textbf{c} = (Corr(y(x^t),y(x_1)), \cdots , Corr(y(x^t),y(x_n)))^{\top }$.

3.2 Genetic Programming Based Symbolic Regression

Genetic programming (GP) [15] is a typical evolutionary algorithm to solve optimization problems that can be formulated as: $\mathop {\mathrm {arg\,min}}\limits _{\textbf{x}} f(\textbf{x}), \textbf{x} \in \mathcal {X}$, where $\textbf{x} = (x_1, \cdots , x_n)$ represents a decision vector (also known as individual) in evolutionary algorithms (EAs). Similar to other EAs, GP evolves a population of solution candidates Pop(.) by following the principle of the survival of the fittest and utilizing biologically-inspired operators. Algorithm 2 is the pseudocode of GP, where Variation operator includes crossover/recombination and mutation operators.

The feature that distinguishes GP from other evolutionary algorithms is the variable-length representation for $\textbf{x}$. The search space of GP-based symbolic regression problems is a union of a function space $\mathcal {F}$ and a terminal space $\mathcal {T} = \{\mathcal {S} \cup \mathbb {R} \} $, where $\mathcal {S}$ represents a variable symbol space (e.g., $x_1$) and $c \in \mathbb {R}$ represents a constant number. The function set $\mathcal {F}$ includes available symbolic functions (e.g., $\{+, -, \times , \div , \exp , \log \}$).

To distinguish the objective function and decision variables in MOPs, $f_{sr}(\cdot )$ and $\textbf{t}$ are used to represent the objective function in GP-based symbolic regression and a GP individual for symbolic regression problems, respectively. The objective function $f_{sr}(\cdot )$ can be any performance metrics for supervised learning problems. For instance, mean squared error (MSE), the Pearson correlation coefficient (PCC), Coefficient of determination (R2), etc. In this paper, MSE, PCC, and R2 are used as the objective functions in GP-based symbolic regression, and they are defined as:

$\text{ MSE } = \frac{1}{n}\sum _{i=1}^{n}(y_i - \hat{y}_i)^2$, $\text{ PCC } = \frac{ \sum _{i=1}^{n}(y_i-\bar{y})(\hat{y}_i-\bar{\hat{y}}) }{\sqrt{\sum _{i=1}^{n}(y_i-\bar{y})^2}\sqrt{\sum _{i=1}^{n}(\hat{y}_i-\bar{\hat{y}})^2}}$, $\text{ R2 } = 1 - \frac{\sum _{i=1}^{n}(y_i - \hat{y}_i)^2}{\sum _{i=1}^{n}(y_i - \bar{y})^2}$,

where n is the number of samples, $y_i$ and $\hat{y_i}$ represent the real response and predicted response of the i-th sample, $\bar{y} = \frac{1}{n}\sum _{i=1}^n y_i$, and $\bar{\hat{y}} = \frac{1}{n}\sum _{i=1}^n \hat{y}_i$.

Therefore, a GP-based symbolic regression problem in this paper can be well formulated as:

$$\begin{aligned}&\mathop {\mathrm {arg\,max}}\limits _{\textbf{t}} f_{sr}(\textbf{t}), \quad t_i \in \{\mathcal {F}\cup \mathcal {T} \}|_{i=1,2,\cdots } \qquad \text{ if } \quad f_{sr} \in \{\text{ PCC }, \text{ R2 }\} \text{ or } \\&\mathop {\mathrm {arg\,min}}\limits _{\textbf{t}} f_{sr}(\textbf{t}), \quad t_i \in \{\mathcal {F}\cup \mathcal {T} \}|_{i=1,2,\cdots } \qquad \text{ if } \quad f_{sr} = \text{ MSE }. \end{aligned}$$

Uncertainty Quantification: Similar to other bootstrapping techniques, each individual in the population of the GP-based symbolic regression is trained on a list of bootstrap samples. The variance error is estimated according to the error on the bootstrap samples. In this paper, the detailed procedure of bootstrapping for GP-based symbolic regression is as follows:

1.
Resampling $n_b \le n$ samples (to form a new dataset $D'$) from original training dataset D of n samples;
2.
Train the GP-based symbolic regression model ($ \textbf{t} $) based on dataset $D'$;
3.
Repeat (1) to (2) for $n_{pop}$ times;
4.
The bagging prediction and variance at a query vector $ \textbf{x} ^t$ are then computed by the mean and variance over all $n_{pop}$ predictions:
$$\begin{aligned}&\mu _{sr}( \textbf{x} ^t) = \frac{1}{n_{pop}}\sum _{i=1}^{n_{pop}} \textbf{t} _i( \textbf{x} ^t) \end{aligned}$$
(5)

$$\begin{aligned}&\sigma _{sr}^2( \textbf{x} ^t) = \frac{1}{n_{pop}} \sum _{i=1}^{n_{pop}} \big ( \textbf{t} _i( \textbf{x} ^t) - \mu _{sr}( \textbf{x} ^t ) \big ) ^2 \end{aligned}$$
(6)

3.3 GP-Based Symbolic Regression with Kriging

GP-based symbolic regression with Kriging (GP2) integrates GP-based symbolic regression and Kriging by summing a best-fitted expression ($symreg( \textbf{x} )$) on a training dataset and a residual term. The prediction of GP2 can be expressed as:

$$\begin{aligned} \text {GP}^2_{symreg}(\textbf{x}) = symreg(\textbf{x}) + res(\textbf{x}), \end{aligned}$$

(7)

where $res(\textbf{x})$ is the residual-regulating term that follows a normal distribution and is obtained by Kriging.

The pseudocode of the GP2 is shown in Algorithm 3. The algorithm starts from generating the training dataset D by computing the real responses $\textbf{Y}$ of the decision set X. The best-fitted symbolic expression $symreg( \textbf{x} )$ on dataset D is generated by GP-based symbolic regression (at line 4 in Algorithm 3). The residuals of the best-fitted symbolic expression $symreg( \textbf{x} )$ on $\textbf{X}$ can be computed by $\textbf{Y} - \hat{\textbf{Y}}$. $\textbf{X}_{gp}$ (line 6 in Algorithm 3) represents the variables that affect the residuals of $symreg( \textbf{x} )$. The most intuitive way is to set $\textbf{X}_{gp}$ as $\textbf{X}$. In addition, it is also reasonable to set $\textbf{X}_{gp}$ as $\hat{\textbf{Y}}$ and $\{ \textbf{X} \cup \hat{\textbf{Y}}\}$, due to the fact that residual of $symreg( \textbf{x} )$ highly depends on the the prediction $\hat{\textbf{Y}}$. Then, Kriging models the relationship between the $\textbf{X}_{gp}$ and the residuals of $symreg( \textbf{x} )$ in the dataset D (at line 7 in Algorithm 3). The residual of $symreg( \textbf{x} )$, noted as $res( \textbf{x} )$, also depends on $\textbf{X}$, as $\textbf{X}_{gp}$ depends on $\textbf{X}$. The final surrogate model built by the GP2, noted as $\text {GP}^2_{symreg}( \textbf{x} )$, is the sum of $symreg( \textbf{x} )$ and its corresponding residual distribution $res( \textbf{x} )$.

Uncertainty Quantification: The prediction of the symbolic expression of GP2 is a sum of symbolic expression and residual term that follows a normal distribution. The prediction and variance at a query vector $ \textbf{x} ^t$ by using the GP2 strategy are:

$$\begin{aligned} \mu _{GP2}( \textbf{x} ^t)&= symreg( \textbf{x} ^t) + \mu _{ok}( \textbf{x} ^t), \end{aligned}$$

(8)

$$\begin{aligned} \sigma ^2_{GP2}( \textbf{x} ^t)&= s_{sr}^2( \textbf{x} ^t) + \sigma ^2_{ok}( \textbf{x} ^t), \end{aligned}$$

(9)

where $s_{sr}^2( \textbf{x} ^t)$ is the variance of the symbolic regression expression $symreg( \textbf{x} )$ at $ \textbf{x} ^t$.

In this paper, we argued $s_{sr}^2( \textbf{x} ) = 0$ due to the fact that $symreg( \textbf{x} )$ is an explicit mathematical expression and there is no variance to $symreg( \textbf{x} )$ at an arbitrary point $ \textbf{x} $. Therefore, Eq. (9) can be simplified as $\sigma ^2_{GP2}( \textbf{x} ^t) = \sigma ^2_{ok}( \textbf{x} ^t)$.

4 Acquisition Functions

The Hypervolume Indicator (HV), introduced in [22], is an essentially unary indicator for evaluating the quality of a Pareto-front approximation set in MOPs due to its Pareto compliant property^{Footnote 3} and no requirement of in-advance knowledge of the Pareto front. The maximization of HV leads to a high-qualified and diverse Pareto-front approximation set. The Hypervolume Indicator is defined as follows:

$$\begin{aligned} {\text {HV}} (\mathcal{P}\mathcal{F},\textbf{r}) :=\lambda _m(\cup _{ \textbf{y} \in \mathcal{P}\mathcal{F}}[ \textbf{y} , \textbf{r}]), \end{aligned}$$

(10)

where $\lambda _m$ is the Lebesgue measure and the reference point $ \textbf{r} $ clips the space to ensure a finite volume.

The Hypervolume improvement (HVI) measures the improvement of a new solution in terms of HV and is defined as:

$$\begin{aligned} \text{ HVI } (\textbf{y}; \mathcal{P}\mathcal{F}, \textbf{r}) := \text{ HV }(\mathcal{P}\mathcal{F}\cup \{\textbf{y}\}; \textbf{r}) - \text{ HV }(\mathcal{P}\mathcal{F}; \textbf{r}) \end{aligned}$$

The Upper confidence bound of the HVI (UCB) [6, 17] is an indicator based on the HVI and prediction uncertainty in a naïve way. Denoting $ \boldsymbol{\mu } ( \textbf{x} )=(\mu _1( \textbf{x} ), \ldots , \mu _m( \textbf{x} ))$ as the prediction and ${ \boldsymbol{\sigma } }( \textbf{x} )=({\sigma }_1( \textbf{x} ),\ldots , {\sigma }_m( \textbf{x} ))$ as the uncertainty of m independent surrogate models, the hypervolume improvement of upper confidence bound of a solution at $ \textbf{x} $, i.e., ${ \boldsymbol{\mu } }( \textbf{x} ) - \omega { \boldsymbol{\sigma } }( \textbf{x} )$, $\omega \in \mathbb {R}_{\ge 0}$ specifies the confidence level:

$$\begin{aligned} \text{ UCB }( \textbf{x} ; \mathcal{P}\mathcal{F}, \textbf{r} , \omega ) := \text{ HVI }({ \boldsymbol{\mu } }( \textbf{x} ) - \omega { \boldsymbol{\sigma } }( \textbf{x} ); \mathcal{P}\mathcal{F}, \textbf{r} ). \end{aligned}$$

(11)

The parameter $\omega $ controls the balancing weight between exploration and exploitation, and a positive $\omega $ rewards the high prediction uncertainty.

Example 1

Two examples of the landscape of HVI and UCB are shown in Fig. 1. The Pareto-front approximation set is denoted as $\mathcal{P}\mathcal{F}= \{{\textbf{y}^{(1)}} = (1,3),$ $\textbf{y}^{(2)}=(2, 2), \textbf{y}^{(3)} = (3, 1)\}$, the reference point is $\textbf{r}=(4,4)$, and the same uncertainty ($\boldsymbol{\sigma } = (1,1)$) is used for every point in the objective space, and $\omega $ is set as 1 for the UCB. The left and right figures show the landscape of HVI and UCB, respectively, in the objective space that ranges from −1 to 5 for both $f_1$ and $f_2$.

5 Experiments

In this paper, the algorithms are performed on ZDT series problems [23] with the parameters of $d=5$, $m=2$, and a fixed reference point $\textbf{r}=[15,15]$. In this paper, four categories of surrogate models are compared, including Kriging (in Alg. I), five different GP2 variants by using different configurations (in Alg. II − VI), random forest (in Alg. VII), and GP-based symbolic regression (in Alg. VIII and IX). Each experiment consists of 15 independent runs.

5.1 Parameter Settings

The common parameters in all algorithms include DoE size $\eta =30$ and the function evaluation budget $T_c=200$. The stopping criteria of $f_{sr}$ is 0.1, 0.99, 0.99 for MSE, PCC, and R2 metrics, respectively. The function evaluation of genetic programming in GP-based symbolic regression and GP2 is 4E3 [19] and 2E4 [18], respectively.

The parameter $\omega $ of UCB is set as $\sqrt{g /\log (g)}$ [14], where g is the number of function evaluation. The function space $\mathcal {F}$ for symbolic regression is $\{+, -, \times , \div , \sin , \cos , \text{ pow } \}$. The acquisition function is the UCB and is optimized by CMA-ES in all algorithms. The hyper-parameters of CMA-ES follow: the number of restarts is 3, the number of initialized population size is 7, the maximum generation number is 2000, and the remaining parameters are set in default.

Table 1 shows other parameter settings in each algorithm, where $n_{pop}$ represents the number of estimators in a random forest. The other parameters (e.g., $p_m$, $p_c$, etc.) are set as default values in GPLearn [18].

Table 1. Algorithm Parameter Configuration

Full size table

5.2 Empirical Results

In this section, we evaluate the Pareto-front approximation sets by using the HV indicator. Figure 2 shows the HV convergence curves of 15 independent runs on ZDT problems. To depict a slight difference in HV values, we discuss the average $\log (\varDelta HV)$ convergence of HV relative to the reference HV value of 250 in this paper. Notice that the scale varies in different problems since all problems have different ranges of the Pareto fronts.

From Fig. 2, it is clear to see that Alg. VII of using the random forest as the surrogate model yields the worst experiment results w.r.t. mean HV and its corresponding standard deviation. Compared with Alg. I and other algorithms, the poor performance of Alg. VII is expected, because of the introduction of Kriging in other algorithms (except for Alg. VIII and IX), of which prediction is more accurate than that of the random forest when the training dataset has few samples. Compared with GP-based symbolic regression algorithms (Alg. VIII and IX), which use ‘plain’ GP, Alg. VII still can not outperform them, due to the efficient optimization mechanism in genetic programming.

On single-modal convex Pareto-front problem ZDT1, Fig. 2 shows that GP-based symbolic regression algorithms (Alg. VIII and IX) are only slightly better than random forest in Alg. VII, but are worse than the other algorithms of using Kriging. The performances of the algorithms (Alg. I – VI) are similar w.r.t. mean HV value over 15 runs. In addition, Alg. I, which merely uses Kriging as the surrogate model, converges the fastest at the early stage of optimization among all the test algorithms. Similarly, Alg. I also converges much faster than most of the test algorithms on the other two single-modal concave Pareto-front problems (ZDT2 and ZDT6). An explanation is the relatively simple landscape of these three test problems.

On discontinued Pareto-front problem ZDT3 and a multi-modal Pareto-front problem ZDT4, the algorithms’ performance differs greatly. Firstly, the convergence of Alg. I is much slower than the other algorithms at the beginning stage of optimization. Since a relatively large^{Footnote 4} reference point is used in this paper, the HVI introduced by extreme solutions is definitely much larger than that of a knee point. Thus, we can conclude that the strategy of Kriging with UCB is difficult to search for extreme solutions at the early stage of optimization. Consequently, this will slow the convergence of Alg. I. Additionally, another intuitive explanation is that the landscape of the UCB based on Kriging models on ZDT3 and ZDT4 is not smooth enough to allow the optimizer CMA-ES to jump out of local optima. In other words, the UCB is hard to measure slight differences around the local optima based on Kriging’s prediction because of the landscape’s roughness.

Table 2. Algorithms ranking.

Full size table

All algorithms’ ranking (w.r.t. mean HV value over 15 runs) on each problem is shown in Table 2. From this table, it is easy to conclude that Alg. V, which utilizes GP2 method of $ \textbf{X} _{gp}=\{ \textbf{X} \cup \hat{ \textbf{Y} } \} $ and $f_{sr}=\text{ R2 }$, outperforms the other algorithms, and Alg. IV (GP2 method of $ \textbf{X} _{gp}=\{ \textbf{X} \} $ and $f_{sr}=\text{ R2 }$) is the second best algorithm. Since the search space of $\textbf{x}$ of surrogate models in Alg. IV is the same search space of the objective function $ \textbf{f} ( \textbf{x} )$ of MOPs, Alg. IV is more reasonable to interpret the residual regulation term $res( \textbf{x} )$ (in Eq. (7)) when it is compared with Alg. V. Additionally, there is no significant difference between the performance of Alg. IV and Alg. V on the test problems, see Table 3 for details. Considering these two aspects, it is recommended to use Alg. IV when the acquisition function is the UCB.

Table 3. The pairwise Wilcoxon’s Rank-Sum test ($+/\approx /-$) matrix at a 0.05 significance level was performed among all the test algorithms, where algorithms in all the columns (except the first column) are pairwise compared with the algorithms in the first column.

Full size table

Table 3 shows the performance of pairwise Wilcoxon’s Rank-Sum test ($+/\approx /-$) matrix among all test algorithms on the test problems. The sum of Wilcoxon’s Rank-Sum test (sum of $+/\approx /-$) confirms that Alg. V performs best, as it significantly outperforms 23 pairwise instances between algorithms and problems. Besides, compared with Algo. I, most of the GP2 methods (Alg. II – Alg. V) perform at least similar to or significantly better than Alg. I.

6 Conclusion and Future Work

This paper utilizes two different techniques to quantify the prediction uncertainty of GP-based symbolic regression, namely, a bootstrapping technique in GP-based symbolic regression and using the Kriging to estimate the residual uncertainty in GP2. Two surrogate modeling techniques, GP-based symbolic regression, and GP2 of different configurations, are compared with Kriging and random forest on five state-of-the-art MOP benchmarks. The statistical results show the effectiveness of GP2 w.r.t. mean $\varDelta $HV convergence. Among five different variants of GP2, we recommend the usage of the GP2 by employing $ \textbf{X} $ or $\{ \textbf{X} \cup \mathbf {\hat{Y}} \}$ to predict residuals by setting R2 as the fitness function in genetic programming, because of their good statistical performance.

For future work, it is recommended to compare the performance of bootstrapping and Jackknife to estimate the variance of GP-based symbolic regression. It is also worthwhile to investigate the performance of the proposed methods on real-world applications.

Notes

1.
Constraints are not considered in this paper.
2.
Recently, some work has considered the dependency in constructing the surrogate models, such as the so-called dependent Gaussian processes [3]. This paper does not consider a dependency between objectives for simplicity.
3.
$\forall A,B\subseteq \mathbb {R}^m: \text{ HV }(A, \textbf{r}) > \text{ HV }(B,\textbf{r})\implies A \prec B$.
4.
The ideal Pareto fronts on ZDT problems hold: $\max \big ({f_{1}( \textbf{x} )}\big ) \le 1, \max \big ( {f_{2}( \textbf{x} )} \big ) \le 1$.

References

Affenzeller, M., Winkler, S.M., Kronberger, G., Kommenda, M., Burlacu, B., Wagner, S.: Gaining deeper insights in symbolic regression. In: Riolo, R., Moore, J.H., Kotanchek, M. (eds.) Genetic Programming Theory and Practice XI. GEC, pp. 175–190. Springer, New York (2014). https://doi.org/10.1007/978-1-4939-0375-7_10
Chapter Google Scholar
Agapitos, A., Brabazon, A., O’Neill, M.: Controlling overfitting in symbolic regression based on a bias/variance error decomposition. In: Coello, C.A.C., Cutello, V., Deb, K., Forrest, S., Nicosia, G., Pavone, M. (eds.) PPSN 2012. LNCS, vol. 7491, pp. 438–447. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-32937-1_44
Chapter Google Scholar
Álvarez, M.A., Rosasco, L., Lawrence, N.D.: Kernels for Vector-Valued Functions. Review. Found. Trends Mach. Learn. 4(3), 195–266 (2012). https://doi.org/10.1561/2200000036
Andres, E., Salcedo-Sanz, S., Monge, F., Pellido, A.: Metamodel-assisted aerodynamic design using evolutionary optimization. In: EUROGEN (2011)
Google Scholar
$\check{Z}$ilinskas, A., Mockus, J.: On one Bayesian method of search of the minimum. Avtomatica i Vicheslitel’naya Teknika 4, 42–44 (1972)
Google Scholar
Emmerich, M.T.M., Yang, K., Deutz, A.H.: Infill criteria for multiobjective Bayesian optimization. In: Bartz-Beielstein, T., Filipič, B., Korošec, P., Talbi, E.-G. (eds.) High-Performance Simulation-Based Optimization. SCI, vol. 833, pp. 3–16. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-18764-4_1
Chapter Google Scholar
Feurer, M., Klein, A., Eggensperger, K., Springenberg, J.T., Blum, M., Hutter, F.: Auto-sklearn: efficient and robust automated machine learning. In: Hutter, F., Kotthoff, L., Vanschoren, J. (eds.) Automated Machine Learning. TSSCML, pp. 113–134. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-05318-5_6
Chapter Google Scholar
Fitzgerald, J., Azad, R.M.A., Ryan, C.: A bootstrapping approach to reduce over-fitting in genetic programming. In: Proceedings of the 15th Annual Conference Companion on Genetic and Evolutionary Computation, pp. 1113–1120. GECCO 2013 Companion, Association for Computing Machinery, New York, NY, USA (2013). https://doi.org/10.1145/2464576.2482690
Fleck, P., et al.: Box-type boom design using surrogate modeling: introducing an industrial optimization benchmark. In: Andrés-Pérez, E., González, L.M., Periaux, J., Gauger, N., Quagliarella, D., Giannakoglou, K. (eds.) Evolutionary and Deterministic Methods for Design Optimization and Control With Applications to Industrial and Societal Problems. CMAS, vol. 49, pp. 355–370. Springer, Cham (2019). https://doi.org/10.1007/978-3-319-89890-2_23
Chapter Google Scholar
Folino, G., Pizzuti, C., Spezzano, G.: Ensemble techniques for parallel genetic programming based classifiers. In: Ryan, C., Soule, T., Keijzer, M., Tsang, E., Poli, R., Costa, E. (eds.) EuroGP 2003. LNCS, vol. 2610, pp. 59–69. Springer, Heidelberg (2003). https://doi.org/10.1007/3-540-36599-0_6
Chapter MATH Google Scholar
Hansen, N.: Benchmarking a bI-population CMA-ES on the BBOB-2009 function testbed. In: Proceedings of the 11th Annual Conference Companion on Genetic and Evolutionary Computation Conference: Late Breaking Papers. pp. 2389–2396. GECCO 2009, Association for Computing Machinery, New York, NY, USA (2009). https://doi.org/10.1145/1570256.1570333
Hutter, F., Hoos, H.H., Leyton-Brown, K.: Sequential model-based optimization for general algorithm configuration. In: Coello, C.A.C. (ed.) LION 2011. LNCS, vol. 6683, pp. 507–523. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-25566-3_40
Chapter Google Scholar
Jones, D.R., Schonlau, M., Welch, W.J.: Efficient global optimization of expensive black-box functions. J. Global Optim. 13(4), 455–492 (1998)
Article MathSciNet MATH Google Scholar
Lukovic, M.K., Tian, Y., Matusik, W.: Diversity-guided multi-objective Bayesian optimization with batch evaluations. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) Advances in Neural Information Processing Systems, vol. 33, pp. 17708–17720. Curran Associates, Inc. (2020)
Google Scholar
Koza, J.R.: Hierarchical genetic algorithms operating on populations of computer programs. In: Sridharan, N.S. (ed.) Proceedings of the Eleventh International Joint Conference on Artificial Intelligence IJCAI-89, vol. 1, pp. 768–774. Morgan Kaufmann, Detroit, MI, USA (20–25 Aug 1989). https://www.genetic-programming.com/jkpdf/ijcai1989.pdf
Li, N., Zhao, L., Bao, C., Gong, G., Song, X., Tian, C.: A real-time information integration framework for multidisciplinary coupling of complex aircrafts: an application of IIIE. J. Ind. Inf. Integr. 22, 100203 (2021). https://doi.org/10.1016/j.jii.2021.100203
Article Google Scholar
Srinivas, N., Krause, A., Kakade, S., Seeger, M.: Gaussian process optimization in the bandit setting: No regret and experimental design. In: Proceedings of the 27th International Conference on International Conference on Machine Learning, pp. 1015–1022. ICML2010, Omnipress, Madison, WI, USA (2010)
Google Scholar
Stephens, T.: gplearn: Genetic programming in python (2019). https://gplearn.readthedocs.io/en/stable/
Yang, K., Affenzeller, M.: Quantifying uncertainties of residuals in symbolic regression via kriging. Procedia Comput. Sci. 200, 954–961 (2022). 3rd International Conference on Industry 4.0 and Smart Manufacturing. https://doi.org/10.1016/j.procs.2022.01.293
Yang, K., Emmerich, M., Deutz, A., Bäck, T.: Efficient computation of expected hypervolume improvement using box decomposition algorithms. J. Global Optim. 75(1), 3–34 (2019). https://doi.org/10.1007/s10898-019-00798-7
Article MathSciNet MATH Google Scholar
Yang, K., Emmerich, M., Deutz, A., Bäck, T.: Multi-objective Bayesian global optimization using expected hypervolume improvement gradient. Swarm Evol. Comput. 44, 945–956 (2019). https://doi.org/10.1016/j.swevo.2018.10.007
Article Google Scholar
Zitzler, E., Thiele, L.: Multiobjective evolutionary algorithms: a comparative case study and the strength pareto approach. IEEE Trans. Evol. Comput. 3(4), 257–271 (1999)
Article Google Scholar
Zitzler, E., Deb, K., Thiele, L.: Comparison of multiobjective evolutionary algorithms: empirical results. Evol. Comput. 8(2), 173–195 (2000)
Article Google Scholar

Download references

Acknowledgements

This work is supported by the Austrian Science Fund (FWF – Der Wissenschaftsfonds) under the project (I 5315, ‘ML Methods for Feature Identification Global Optimization).

Author information

Authors and Affiliations

School of Informatics, Communications and Media University of Applied Sciences Upper Austria Softwarepark 11, 4232, Hagenberg, Austria
Kaifeng Yang & Michael Affenzeller

Authors

Kaifeng Yang
View author publications
You can also search for this author in PubMed Google Scholar
Michael Affenzeller
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kaifeng Yang .

Editor information

Editors and Affiliations

Leiden University, Leiden, The Netherlands
Michael Emmerich
Leiden University, Leiden, The Netherlands
André Deutz
Leiden University, Leiden, The Netherlands
Hao Wang
Leiden University, Leiden, The Netherlands
Anna V. Kononova
TH Köln, Gummersbach, Germany
Boris Naujoks
University of Exeter, Exeter, UK
Ke Li
University of Jyvaskyla, Jyvaskyla, Finland
Kaisa Miettinen
De Montfort University, Leicester, UK
Iryna Yevseyeva

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yang, K., Affenzeller, M. (2023). Surrogate-assisted Multi-objective Optimization via Genetic Programming Based Symbolic Regression. In: Emmerich, M., et al. Evolutionary Multi-Criterion Optimization. EMO 2023. Lecture Notes in Computer Science, vol 13970. Springer, Cham. https://doi.org/10.1007/978-3-031-27250-9_13

Download citation

DOI: https://doi.org/10.1007/978-3-031-27250-9_13
Published: 09 March 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-27249-3
Online ISBN: 978-3-031-27250-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Surrogate-assisted Multi-objective Optimization via Genetic Programming Based Symbolic Regression

Abstract

Similar content being viewed by others

Surrogate-assisted evolutionary algorithm for expensive constrained multi-objective discrete optimization problems

On Dealing with Uncertainties from Kriging Models in Offline Data-Driven Evolutionary Multiobjective Optimization

New Surrogate Approaches Applied to Meta-Heuristic Algorithms

Keywords

1 Introduction