1 Introduction

Preference-Based Optimization (PBO) algorithms seek the global solution of an optimization problem whose objective function is unknown and unmeasurable. Instead, the “goodness” of a decision vector is assessed by a human Decision-Maker (DM), who compares different samples (i.e. tunings of the decision vector) and states which of them he/she prefers. In this work, we consider queries to the DM that involve two options. The output of a query is the preference between the two calibrations of the decision vector expressed by the decision-maker (e.g. “this tuning is better than that one”) [3].

Preference-based optimization is closely related to industry practice. In the context of control systems, often the performances achieved by a regulator tuning are evaluated by a decision-maker, who expresses his/her judgment by observing the behavior of the system under control. Multiple experiments must be performed until the DM is satisfied by the closed-loop performances. If a trial-and-error approach is adopted, then there is no guarantee that the final tuning is optimal in some sense. Moreover, the calibration process might be quite time-consuming since possibly many combinations of the controller parameters need to be tested. PBO algorithms constitute a better alternative to the trial-and-error methodology due to the fact that they drive the experiments by exploiting the information carried by the preferences expressed by the DM. The goal is to seek the calibration of the decision vector that is most preferred by the decision-maker while also minimizing the number of queries, thus performing fewer experiments. Successful applications of preference-based optimization procedures include [39], where the authors use algorithm GLISp [3] to tune the controller for a continuous stirring tank reactor and for autonomous driving vehicles. The same algorithm has been employed in [30] to calibrate the parameters of a velocity planner for robotic sealing tasks.

Although preference-based optimization is a valuable tool for solving those calibration tasks that involve human decision-makers, its applicability is broader. In multi-objective optimization, PBO can be employed to scalarize the multi-objective optimization problem into a single objective one by tuning the weights for the weighted sum method, or to choose which, among a set of Pareto optimal solutions, is the most suited for the DM [3, 21]. PBO can also be used to perform active preference learning. In general, preference learning methods estimate a predictive model that describes the tastes of a human decision-maker [12]. Preference-based optimization methods also rely on a predictive model (called surrogate model), which is updated after each query to the DM, but its prediction accuracy is not the main concern. Rather, the sole purpose of the surrogate model is to drive the search towards the most preferred tuning of the decision vector.

The preference-based optimization problem can be formalized by applying notions of utility theory [27]. Suppose that there exists a binary relation, called the preference relation, which describes the tastes of the decision-maker (i.e. the outputs of the queries). If the preference relation of the DM exhibits certain properties, then it is possible to represent it with a continuous (latent) utility functionFootnote 1 [8], which assigns an abstract degree of “goodness” to all possible calibrations of the decision vector. The tuning that is most preferred by the decision-maker is the one with the highest utility and corresponds to the optimizer of the PBO problem.

As previously mentioned, many preference-based optimization algorithms rely on a surrogate model, which is an approximation of the latent utility function built from the preference information at hand. In general, any procedure that uses a surrogate model when solving an optimization problem is said to be a surrogate-based (or response surface) method. These algorithms are mostly used for black-box optimization problems, where the objective function is unknown but measurable (see [19, 36]). Nevertheless, surrogate-based methods can also be employed for PBO problems, provided that the surrogate model is properly defined. In practice, preference-based response surface methods iteratively propose new samples to be compared to the available best candidate by trading off exploitation (or local search), i.e. selecting samples that are likely to offer an improvement based on the data at our disposal, and exploration (or global search), i.e. finding samples in regions of the decision space of which we have little to no knowledge. Typically, new candidate samples are sought by minimizing or maximizing an acquisition function that encapsulates these two aspects.

Most surrogate models either rely on Gaussian Processes (\(\mathcal{G}\mathcal{P}\)s) [38], giving rise to preferential Bayesian optimization [13], or Radial Basis Functions (RBFs) [15]. For example, in [7] the authors propose a predictive model for the latent utility function that is based on \(\mathcal{G}\mathcal{P}\)s. The latter is used as a surrogate model in [6] to carry out preference-based optimization. Alternative preferential Bayesian optimization algorithms are proposed in [4] and in [13]. Recently, in [3] the authors developed a preference-based optimization method, called GLISp, which is based on a RBF approximation of the latent utility function.

In this work, we propose an extension of the GLISp [3] algorithm (that we will refer to as GLISp-r) which is more robust in finding the global solution of the preference-based optimization problem. To do so, we:

  1. 1.

    Address some limitations of the acquisition function used in [3], which can cause the procedure to miss the global solution and get stuck on local optima;

  2. 2.

    Dynamically vary the exploration–exploitation trade-off weight, allowing GLISp-r to alternate between local and global search. This is commonly done in the black-box optimization framework, see for example [15, 28, 29, 37], but it has not been tried for GLISp [3];

  3. 3.

    Provide a proof of global convergence for GLISp-r, furtherly motivating its robustness. Currently, no such proof is available for GLISp [3] (or for any of the most popular preference-based response surface methods, cf. also [4, 6, 13]).

Regarding this last point, as a minor contribution of this work, we formalize the preference-based optimization problem from a utility theory perspective, allowing us to analyze the existence of a solution and, ultimately, to prove the global convergence of GLISp-r.

This paper is organized as follows. In Sect. 2, we introduce the preference-based optimization problem. Section 3 addresses how to build the surrogate model (as in GLISp [3]) and look for the next sample to evaluate, keeping in mind the exploration–exploitation dilemma. We also briefly cover the exploration function used in [3]. The latter is thoroughly analyzed in Sect. 4, where we propose a solution to the shortcomings that we have encountered in GLISp [3]. Section 5 describes algorithm GLISp-r and addresses its convergence. Then, Sect. 6 compares the performances of GLISp-r with GLISp [3] and C-GLISp [40], a revisitation of the latter method by the same authors, on several benchmark optimization problems. Finally, Sect. 7 gives some concluding remarks.

2 Problem formulation

Let \(\varvec{x} \in \mathbb {R}^n, \varvec{x} = \begin{bmatrix}x^{(1)}&\ldots&x^{(n)}\end{bmatrix}^{\top },\) be the n-dimensional decision vector and suppose that we are interested in finding its calibration that is most preferred by the decision-maker within a subset of \(\mathbb {R}^{n}\), namely \(\Omega \subset \mathbb {R}^{n}\). In particular, we define the constraint set \(\Omega \) as:

$$\begin{aligned} \Omega =\left\{ \varvec{x}: \varvec{l}\le \varvec{x}\le \varvec{u}, \varvec{g_{ineq}}(\varvec{x})\le \varvec{0}_{q_{ineq}}, \varvec{g_{eq}}(\varvec{x})=\varvec{0}_{q_{eq}}\right\} . \end{aligned}$$
(1)

In (1), \(\varvec{l}, \varvec{u} \in \mathbb {R}^{n}, \varvec{l} \le \varvec{u},\) are the lower and upper bounds on the decision vector while \(\varvec{g_{ineq}}: \mathbb {R}^{n} \rightarrow \mathbb {R}^{q_{ineq}}\) and \(\varvec{g_{eq}}: \mathbb {R}^{n} \rightarrow \mathbb {R}^{q_{eq}}\) are the constraints functions associated to the inequality and equality constraints respectively (which are \(q_{ineq} \in \mathbb {N} \cup \left\{ 0 \right\} \) and \(q_{eq} \in \mathbb {N} \cup \left\{ 0 \right\} \)). Notation-wise, \(\varvec{0}_{q_{ineq}}\) represents the \(q_{ineq}\)-dimensional zero column vector (and similarly for \(\varvec{0}_{q_{eq}}\)). We suppose that: (i) all of the constraints in (1) are completely known and (ii) \(\Omega \) includes, at least, the bound constraints \(\varvec{l}\le \varvec{x}\le \varvec{u}\) (the remaining equality and inequality constraints can be omitted, resulting in \(q_{ineq} = q_{eq} = 0\)).

In this work, we formalize the preference-based optimization problem from a utility theory [27] perspective. We start by introducing preference relations, a particular kind of binary relations that play a key role in PBO.

Definition 1

(Binary relation [27]) Consider the constraint set \(\Omega \) in (1); we define a generic binary relation \(\mathcal {R}\) on \(\Omega \) as a subset \(\mathcal {R} \subseteq \Omega \times \Omega \).

Notation-wise, given two samples \(\varvec{x}_{i}, \varvec{x}_{j} \in \Omega \), we denote the ordered pairs for which the binary relation holds, \(\left( \varvec{x}_{i}, \varvec{x}_{j}\right) \in \mathcal {R}\), as \(\varvec{x}_{i} \mathcal {R} \varvec{x}_{j}\).

Definition 2

(Preference relation [27]) A preference relation, \(\succsim \subseteq \Omega \times \Omega \), is a binary relation that describes the tastes of a human decision-maker.

In the context of utility theory, \(\varvec{x}_{i} \succsim \varvec{x}_{j}\) implies that the DM with preference relation \(\succsim \) on \(\Omega \) deems the alternative \(\varvec{x}_{i}\) at least as good as \(\varvec{x}_{j}\). We say that the decision-maker is rational (in an economics sense) if his/her preference relation exhibits certain properties, as highlighted by the following Definition.

Definition 3

(Rational decision-maker [27]) Consider a decision-maker with preference relation \(\succsim \) on \(\Omega \). We say that the DM is rational if \(\succsim \) is a reflexive, transitive and complete binary relation on \(\Omega \).

Now, we briefly review each of the aforementioned properties and give some insights as to why they characterize the rationality of an individual (see [8, 10, 27] for more details):

  • Reflexivity of \(\succsim \) on \(\Omega \) implies that, for the DM, any alternative is as good as itself, i.e. \(\varvec{x}_{i} \succsim \varvec{x}_{i}\) for each \(\varvec{x}_{i} \in \Omega \);

  • A decision-maker whose preference relation \(\succsim \) on \(\Omega \) is transitive is able to express his/her preferences coherently since if \(\varvec{x}_{i} \succsim \varvec{x}_{j}\) and \(\varvec{x}_{j} \succsim \varvec{x}_{k}\) hold, then \(\varvec{x}_{i} \succsim \varvec{x}_{k}\), for any \(\varvec{x}_{i}, \varvec{x}_{j}, \varvec{x}_{k} \in \Omega \);

  • Completeness of \(\succsim \) on \(\Omega \) implies that the DM is able to express a preference between any two alternatives in \(\Omega \), i.e. either \(\varvec{x}_{i} \succsim \varvec{x}_{j}\) or \(\varvec{x}_{j} \succsim \varvec{x}_{i}\) holds for each \(\varvec{x}_{i}, \varvec{x}_{j} \in \Omega \).

The preference relation \(\succsim \) on \(\Omega \) of a rational decision-maker is usually “split” into two transitive binary relations [27]:

  • The strict preference relation \(\succ \) on \(\Omega \), i.e. \(\varvec{x}_{i} \succ \varvec{x}_{j}\) if and only if \(\varvec{x}_{i} \succsim \varvec{x}_{j}\) but not \(\varvec{x}_{j} \succsim \varvec{x}_{i}\) (\(\varvec{x}_{i}\) is “better than” \(\varvec{x}_{j}\) or, equivalently, \(\varvec{x}_j\) is “worse than” \(\varvec{x}_i\)), and

  • The indifference relation \(\sim \) on \(\Omega \), i.e. \(\varvec{x}_{i} \sim \varvec{x}_{j}\) if and only if \(\varvec{x}_{i} \succsim \varvec{x}_{j}\) and \(\varvec{x}_{j} \succsim \varvec{x}_{i}\) (\(\varvec{x}_{i}\) is “as good as” \(\varvec{x}_{j}\)).

One last relevant property for preference relations is continuity.

Definition 4

(Continuous preference relation [27]) A preference relation \(\succsim \) on \(\Omega \) is continuous if the strict upper and lower \(\succsim \)-contour sets:

$$\begin{aligned} \mathcal {U}_{\scriptscriptstyle \succ }\left( \varvec{x}\right)&= \left\{ \tilde{\varvec{x}}: \tilde{\varvec{x}} \in \Omega , \tilde{\varvec{x}} \succ \varvec{x}\right\} \quad \text {and} \\ \mathcal {L}_{\scriptscriptstyle \succ }\left( \varvec{x}\right)&= \left\{ \tilde{\varvec{x}}: \tilde{\varvec{x}} \in \Omega , \varvec{x} \succ \tilde{\varvec{x}}\right\} \quad \text {respectively}, \end{aligned}$$

are open subsets of \(\Omega \) for each \(\varvec{x} \in \Omega \).

Intuitively speaking, if \(\succsim \) on \(\Omega \) is continuous and \(\varvec{x}_i \succ \varvec{x}_j\) holds, then an alternative \(\varvec{x}_k\) which is “very close” to \(\varvec{x}_j\) should also be deemed strictly worse than \(\varvec{x}_i\) by the decision-maker, i.e. \(\varvec{x}_i \succ \varvec{x}_k\).

Having defined the preference relation \(\succsim \) on \(\Omega \) thoroughly, we can finally state the goal of preference-based optimization:

$$\begin{aligned} find\ \varvec{x}^{\varvec{*}} \in \Omega \ such\ that\ \varvec{x^{*}}\succsim \varvec{x},\forall \varvec{x}\in \Omega . \end{aligned}$$
(2)

Formally, \(\varvec{x^{*}}\) is called the \(\succsim \)-maximum of \(\Omega \) [27], i.e. the sample that is most preferred by the decision-maker with preference relation \(\succsim \) on \(\Omega \). Concerning the existence of \(\varvec{x}^*\), we can state the following Proposition, which can be seen as a generalization of the Extreme Value Theorem [1] for preference relations.

Proposition 1

(Existence of a \(\succsim \)-maximum of \(\Omega \) [27]) A \(\succsim \)-maximum of \(\Omega \) is guaranteed to exist if \(\Omega \) is a compact subset of a metric space (in our case \(\Omega \subset \mathbb {R}^{n}\)) and \(\succsim \) is a continuous preference relation on \(\Omega \) of a rational decision-maker (see Definitions 3 and 4).

Proposition 1 will be relevant when proving the convergence of the proposed algorithm, in Sect. 5. Lastly, in order to write Problem (2) as a typical global optimization problem, we need to state one of the most important results in utility theory.

Theorem 1

(Debreu’s utility representation Theorem for \(\mathbb {R}^{n}\) [8]) Let \(\Omega \) be any nonempty subset of \(\mathbb {R}^{n}\) and \(\succsim \) be a preference relation on \(\Omega \) of a rational decision-maker (as in Definition 3). If \(\succsim \) on \(\Omega \) is continuous, then it can be represented by a continuous utility function \(u_{\scriptscriptstyle \succsim }:\Omega \rightarrow \mathbb {R}\) such that, for any \(\varvec{x}_i, \varvec{x}_j \in \Omega \):

$$\begin{aligned} \begin{aligned} \varvec{x}_i \succsim \varvec{x}_j \quad&\text {if and only if} \quad u_{\scriptscriptstyle \succsim }\left( \varvec{x}_i\right) \ge u_{\scriptscriptstyle \succsim }\left( \varvec{x}_j\right) , \\ \varvec{x}_i \succ \varvec{x}_j \quad&\text {if and only if} \quad u_{\scriptscriptstyle \succsim }\left( \varvec{x}_i\right) > u_{\scriptscriptstyle \succsim }\left( \varvec{x}_j\right) , \\ \varvec{x}_i \sim \varvec{x}_j \quad&\text {if and only if} \quad u_{\scriptscriptstyle \succsim }\left( \varvec{x}_i\right) = u_{\scriptscriptstyle \succsim }\left( \varvec{x}_j\right) . \end{aligned} \end{aligned}$$

Using Theorem 1, we can build an optimization problem to find the \(\succsim \)-maximum of \(\Omega \). In particular, we define the scoring function, \(f: \mathbb {R}^{n} \rightarrow \mathbb {R}\), as \(f\left( \varvec{x}\right) = - u_{\scriptscriptstyle \succsim }\left( \varvec{x}\right) \) and re-write Problem (2) as:

$$\begin{aligned} \varvec{x}^{\varvec{*}}&= \mathchoice{{\text {arg}}\, \mathchoice{\underset{\varvec{x}}{{\text {min}}}\,}{{\text {min}}_{\varvec{x}}}{{\text {min}}_{\varvec{x}}}{{\text {min}}_{\varvec{x}}} }{{\text {arg}}\, \mathchoice{\underset{\varvec{x}}{{\text {min}}}\,}{{\text {min}}_{\varvec{x}}}{{\text {min}}_{\varvec{x}}}{{\text {min}}_{\varvec{x}}} }{{\text {arg}}\, \mathchoice{\underset{\varvec{x}}{{\text {min}}}\,}{{\text {min}}_{\varvec{x}}}{{\text {min}}_{\varvec{x}}}{{\text {min}}_{\varvec{x}}} }{{\text {arg}}\, \mathchoice{\underset{\varvec{x}}{{\text {min}}}\,}{{\text {min}}_{\varvec{x}}}{{\text {min}}_{\varvec{x}}}{{\text {min}}_{\varvec{x}}} } f(\varvec{x}) \nonumber \\ \text {s.t.}&\quad \varvec{x}\in \Omega . \end{aligned}$$
(3)

Remark 1

Formally, there could be more than one \(\succsim \)-maximum of \(\Omega \), i.e. Problem (3) could admit multiple global solutions, as described by the set:

$$\begin{aligned} \mathcal {X}^* = \left\{ \varvec{x}^{\varvec{*}}_i: \varvec{x}^{\varvec{*}}_i \in \Omega \text { such that } \not \exists \varvec{x}: \varvec{x} \succ \varvec{x}^{\varvec{*}}_i \right\} . \end{aligned}$$

In this work, without loss of generality, we assume that there exists only one global solution \(\varvec{x}^{\varvec{*}}\) as in (3). In practice, as we will see in Sect. 5, any globally convergent preference-based optimization procedure generates a set of samples that is dense in \(\Omega \) and thus, at least asymptotically, it actually finds all the global minimizers in \(\mathcal {X}^*\). We do not make any assumptions on the local solutions of (3), which can be more than one.

On a side note, we could view preference-based optimization as a particular instance of black-box optimization [19, 36] where the cost function is not measurable in any way. Instead, information on \(f(\varvec{x})\) in (3) comes in the form of preferences, as described in the next section.

2.1 Data available to preference-based optimization procedures

In GLISp [3], instead of considering the preference relation \(\succsim \) on \(\Omega \) explicitly, the authors describe the outputs of the queries to the DM using the preference function \({\pi }_{\scriptscriptstyle \succsim }:\mathbb {R}^{n} \times \mathbb {R}^{n} \rightarrow \{-1,0,1\}\), defined as:

$$\begin{aligned} {\pi }_{\scriptscriptstyle \succsim }\left( \varvec{x}_{i},\varvec{x}_{j} \right) ={\left\{ \begin{array}{ll} -1 &{} \text {if}\quad \varvec{x}_{i} \succ \varvec{x}_{j} \Leftrightarrow f\left( \varvec{x}_{i}\right) <f\left( \varvec{x}_{j}\right) \\ 0 &{} \text {if}\quad \varvec{x}_{i} \sim \varvec{x}_{j} \Leftrightarrow f\left( \varvec{x}_{i}\right) =f\left( \varvec{x}_{j}\right) \\ 1 &{} \text {if}\quad \varvec{x}_{j} \succ \varvec{x}_{i} \Leftrightarrow f\left( \varvec{x}_{i}\right) >f\left( \varvec{x}_{j}\right) \end{array}\right. }. \end{aligned}$$
(4)

In light of the just reviewed utility theory literature, we can see that \({\pi }_{\scriptscriptstyle \succsim }(\varvec{x}_{i},\varvec{x}_{j})\) in (4) is obtained from the utility representation of the preference relation \(\succsim \) on \(\Omega \) (see Theorem 1) and from the fact that \(f(\varvec{x}) = - u_{\scriptscriptstyle \succsim }(\varvec{x})\). In the case of rational decision-makers (Definition 3), reflexivity and transitivity of the preference relation are highlighted by the following properties of the preference function:

  • \({\pi }_{\scriptscriptstyle \succsim }(\varvec{x}_{i},\varvec{x}_{i}) = 0\), for each \(\varvec{x}_{i} \in \mathbb {R}^{n}\),

  • \({\pi }_{\scriptscriptstyle \succsim }(\varvec{x}_{i},\varvec{x}_{j}) = {\pi }_{\scriptscriptstyle \succsim }(\varvec{x}_{j},\varvec{x}_{k}) = b \Rightarrow {\pi }_{\scriptscriptstyle \succsim }(\varvec{x}_{i},\varvec{x}_{k}) = b\), for any \(\varvec{x}_{i}, \varvec{x}_{j}, \varvec{x}_{k} \in \mathbb {R}^{n}\).

In the context of PBO, surrogate-based methods aim solve Problem (3) starting from a set of \(N \in \mathbb {N}\) distinct samples of the decision vector:

$$\begin{aligned} \mathcal {X}=\left\{ \varvec{x}_{i}:i=1,\ldots ,N,\varvec{x}_{i}\in \Omega , \varvec{x}_{i} \ne \varvec{x}_{j}, \forall i \ne j\right\} \end{aligned}$$
(5)

and a set of \(M \in \mathbb {N}\) preferences expressed by the decision-maker:

$$\begin{aligned} \mathcal {B}=\left\{ b_{h}:h=1,\ldots ,M,b_{h}\in \{-1,0,1\}\right\} . \end{aligned}$$
(6)

The term \(b_h\) in (6) is the output of the h-th query, where the decision-maker was asked to compare two samples in \(\mathcal {X}\), as highlighted by the following mapping set:

$$\begin{aligned} \mathcal {S}&=\Big \{\left( \ell (h),\kappa (h)\right) : h=1,\ldots ,M,\ell (h),\kappa (h)\in \mathbb {N}, b_{h}={\pi }_{\scriptscriptstyle \succsim }\left( \varvec{x}_{\ell (h)},\varvec{x}_{\kappa (h)}\right) , \nonumber \\&\quad b_{h}\in \mathcal {B},\varvec{x}_{\ell (h)},\varvec{x}_{\kappa (h)}\in \mathcal {X}\Big \}. \end{aligned}$$
(7)

In (7), \(\ell : \mathbb {N} \rightarrow \mathbb {N}\) and \(\kappa : \mathbb {N} \rightarrow \mathbb {N}\) are two mapping functions that associate the indexes of the samples, contained inside \(\mathcal {X}\), to their respective preferences in \(\mathcal {B}\). The cardinalities of these sets are \(|\mathcal {X}|=N\) and \(|\mathcal {B}|=|\mathcal {S}|=M\). Also note that \(1\le M\le \begin{pmatrix}N \\ 2 \end{pmatrix}\).

3 Handling exploration and exploitation

In this section, we review some key concepts that are common in most preference-based optimization algorithms. We also cover briefly how exploration and exploitation are handled by algorithm GLISp [3].

Preference-based response surface methods iteratively propose new samples to evaluate with the objective of solving Problem (3) while also minimizing the number of queries. Suppose that, at iteration k, we have at our disposal the set of samples \(\mathcal {X}\) in (5), \(|\mathcal {X}|= N\), and the sets \(\mathcal {B}\) in (6) and \(\mathcal {S}\) in (7), \(|\mathcal {B}|=|\mathcal {S}|=M\). We define the best sample found so far as:

$$\begin{aligned} \varvec{x_{best}}\left( N\right) \in \Omega : \varvec{x_{best}}\left( N\right) \in \mathcal {X}, |\mathcal {X}|= N,\quad \text {and}\quad \varvec{x_{best}}\left( N\right) \succsim \varvec{x}_i, \forall \varvec{x}_i \in \mathcal {X}. \end{aligned}$$

The new candidate sample,

$$\begin{aligned} \varvec{x}_{N+1} \in \Omega , \end{aligned}$$

is obtained by solving an additional global optimization problem:

$$\begin{aligned} \varvec{x}_{N+1}&= \mathchoice{{\text {arg}}\, \mathchoice{\underset{\varvec{x}}{{\text {min}}}\,}{{\text {min}}_{\varvec{x}}}{{\text {min}}_{\varvec{x}}}{{\text {min}}_{\varvec{x}}} }{{\text {arg}}\, \mathchoice{\underset{\varvec{x}}{{\text {min}}}\,}{{\text {min}}_{\varvec{x}}}{{\text {min}}_{\varvec{x}}}{{\text {min}}_{\varvec{x}}} }{{\text {arg}}\, \mathchoice{\underset{\varvec{x}}{{\text {min}}}\,}{{\text {min}}_{\varvec{x}}}{{\text {min}}_{\varvec{x}}}{{\text {min}}_{\varvec{x}}} }{{\text {arg}}\, \mathchoice{\underset{\varvec{x}}{{\text {min}}}\,}{{\text {min}}_{\varvec{x}}}{{\text {min}}_{\varvec{x}}}{{\text {min}}_{\varvec{x}}} } a_N(\varvec{x}) \nonumber \\ \text {s.t.}&\quad \varvec{x}\in \Omega , \end{aligned}$$
(8)

where \(a_N: \mathbb {R}^{n} \rightarrow \mathbb {R}\) is a properly defined acquisition function which trades off exploration and exploitation.

In practice, once \(\varvec{x}_{N+1}\) has been computed, we let the decision-maker express a preference between the best sample found so far and the new one, obtaining:

$$\begin{aligned} b_{M+1} = {\pi }_{\scriptscriptstyle \succsim }\left( \varvec{x}_{N+1},\varvec{x_{best}}\left( N\right) \right) . \end{aligned}$$

After that, \(\varvec{x}_{N+1}\) is added to the set \(\mathcal {X}\) and, similarly, the sets \(\mathcal {B}\) and \(\mathcal {S}\) are also updated with the new preference \(b_{M+1}\). The process is iterated until a certain condition is met. Typically, a budget, or rather a maximum number of samples \(N_{max} \in \mathbb {N}\), is set and the optimization procedure is stopped once it is reached.

In our case, \(a_N\left( \varvec{x}\right) \) in (8) is defined starting from a surrogate model \(\hat{f}_N: \mathbb {R}^{n} \rightarrow \mathbb {R}\), which approximates the scoring function \(f\left( \varvec{x}\right) \) of Problem (3), and a function \(z_N: \mathbb {R}^{n} \rightarrow \mathbb {R}\) that promotes the exploration of those regions of \(\Omega \) where fewer samples have been evaluated. The acquisition function that we will propose in this work is a weighted sum of these two contributions (see Sect. 4.3). \(\hat{f}_N\left( \varvec{x}\right) \) and \(z_N\left( \varvec{x}\right) \) are defined as in GLISp [3], which we now review.

3.1 Surrogate model

Given N samples \(\varvec{x}_i \in \mathcal {X}\) in (5), we define the surrogate model \(\hat{f}_N:\mathbb {R}^{n} \rightarrow \mathbb {R}\) as the radial basis function expansion [9]:

$$\begin{aligned} \begin{aligned} \hat{f}_N\left( \varvec{x}\right)&=\sum _{i=1}^{N}\beta ^{(i)}\cdot \varphi \left( \epsilon \cdot \left\| {{\varvec{x}-\varvec{x}_{i}}} \right\| _2\right) \\&=\sum _{i=1}^{N}\beta ^{(i)}\cdot \phi _{i}\left( \varvec{x}\right) \\&=\varvec{\phi }\left( \varvec{x}\right) ^{\top }\cdot \varvec{\beta }, \end{aligned} \end{aligned}$$
(9)

where \(\varphi :\mathbb {R}_{\ge 0}\rightarrow \mathbb {R}\) is a properly chosen radial function, \(\epsilon \in \mathbb {R}_{>0}\) is the so-called shape parameter (which needs to be tuned) and \(\phi _{i}:\mathbb {R}^{n}\rightarrow \mathbb {R}\) is the radial basis function originated from \(\varphi \left( \cdot \right) \) and center \(\varvec{x}_{i}\in \mathcal {X}\), namely \(\phi _{i}\left( \varvec{x}\right) =\varphi \left( \epsilon \cdot \left\| {{\varvec{x}-\varvec{x}_{i}}} \right\| _2\right) \). Moreover, \(\varvec{\phi }\left( \varvec{x}\right) \in \mathbb {R}^{N}\), \(\varvec{\phi }\left( \varvec{x}\right) =\begin{bmatrix}\phi _{1}\left( \varvec{x}\right)&\ldots&\phi _{N}\left( \varvec{x}\right) \end{bmatrix}^{\top }\), is the radial basis function vector and \(\varvec{\beta }=\begin{bmatrix}\beta ^{(1)}&\ldots&\beta ^{(N)}\end{bmatrix}^{\top }\in \mathbb {R}^{N}\) is a vector of weights that has to be estimated from the preferences in \(\mathcal {B}\) (6) and \(\mathcal {S}\) (7). Given a distance \(r = \left\| {{\varvec{x} - \varvec{x}_i}} \right\| _2\), some commonly used radial functions are [11]:

  • Inverse quadratic: \(\varphi \left( \epsilon \cdot r\right) =\frac{1}{1+\left( \epsilon \cdot r\right) ^{2}}\);

  • Multiquadratic: \(\varphi \left( \epsilon \cdot r\right) =\sqrt{1+\left( \epsilon \cdot r\right) ^{2}}\);

  • Linear: \(\varphi \left( \epsilon \cdot r\right) =\epsilon \cdot r\);

  • Gaussian: \(\varphi \left( \epsilon \cdot r\right) =e^{-\left( \epsilon \cdot r\right) ^{2}}\);

  • Thin plate spline: \(\varphi \left( \epsilon \cdot r\right) =\left( \epsilon \cdot r\right) ^{2}\cdot \log \left( \epsilon \cdot r\right) \);

  • Inverse multiquadratic: \(\varphi \left( \epsilon \cdot r\right) =\frac{1}{\sqrt{1+\left( \epsilon \cdot r\right) ^{2}}}\).

One advantage of using (9) as the surrogate model is highlighted by the following Proposition.

Proposition 2

\(\hat{f}_N\left( \varvec{x}\right) \) in (9) is differentiable everywhereFootnote 2 if and only if the chosen radial basis function \(\phi _i \left( \varvec{x}\right) = \varphi \left( \epsilon \cdot \left\| {{\varvec{x}-\varvec{x}_{i}}} \right\| _2 \right) \) is differentiable everywhere.

The surrogate model in (9) can be used to define the surrogate preference function \({\hat{\pi }}_{{\scriptscriptstyle \succsim }_{N}}:{\mathbb {R}^{n}\times \mathbb {R}^{n}\rightarrow \{-1,0,1\}}\). Differently from \({\pi }_{\scriptscriptstyle \succsim }(\varvec{x}_{i},\varvec{x}_{j})\) in (4), we consider a tolerance \(\sigma \in \mathbb {R}_{>0}\) to avoid using strict inequalities and equalities and define \({\hat{\pi }}_{{\scriptscriptstyle \succsim }_{N}}(\varvec{x}_{i},\varvec{x}_{j})\) as [3]:

$$\begin{aligned} {\hat{\pi }}_{{\scriptscriptstyle \succsim }_{N}}(\varvec{x}_{i},\varvec{x}_{j})={\left\{ \begin{array}{ll} -1 &{} \text {if}\quad \hat{f}_N\left( \varvec{x}_{i}\right) -\hat{f}_N\left( \varvec{x}_{j}\right) \le -\sigma \\ 0 &{} \text {if}\quad |\hat{f}_N\left( \varvec{x}_{i}\right) -\hat{f}_N\left( \varvec{x}_{j}\right) |\le \sigma \\ 1 &{} \text {if}\quad \hat{f}_N\left( \varvec{x}_{i}\right) -\hat{f}_N\left( \varvec{x}_{j}\right) \ge \sigma \end{array}\right. }. \end{aligned}$$
(10)

Suppose now that we have at our disposal the sets \(\mathcal {X}\) in (5), \(\mathcal {B}\) in (6) and \(\mathcal {S}\) in (7). Then, we are interested in a surrogate model \(\hat{f}_N\left( \varvec{x}\right) \) in (9) that correctly describes the preferences expressed by the decision-maker, i.e. we would like the corresponding surrogate preference function \({\hat{\pi }}_{{\scriptscriptstyle \succsim }_{N}}(\varvec{x}_{i},\varvec{x}_{j})\) in (10) to be such that:

$$\begin{aligned} b_{h}={\hat{\pi }}_{{\scriptscriptstyle \succsim }_{N}}\left( \varvec{x}_{\ell (h)},\varvec{x}_{\kappa (h)}\right) , \quad \forall b_{h}\in \mathcal {B}, \quad \left( \ell (h),\kappa (h)\right) \in \mathcal {S},\quad h = 1, \ldots , M. \end{aligned}$$

This, in turn, translates into some constraints on \(\hat{f}_N\left( \varvec{x}\right) \), which can be used to estimate \(\varvec{\beta }\). To do so, the authors of GLISp [3] define the following convex optimization problem:

$$\begin{aligned}&\quad \mathchoice{{\text {arg}}\, \mathchoice{\underset{\varvec{\varepsilon },\varvec{\beta }}{{\text {min}}}\,}{{\text {min}}_{\varvec{\varepsilon },\varvec{\beta }}}{{\text {min}}_{\varvec{\varepsilon },\varvec{\beta }}}{{\text {min}}_{\varvec{\varepsilon },\varvec{\beta }}} }{{\text {arg}}\, \mathchoice{\underset{\varvec{\varepsilon },\varvec{\beta }}{{\text {min}}}\,}{{\text {min}}_{\varvec{\varepsilon },\varvec{\beta }}}{{\text {min}}_{\varvec{\varepsilon },\varvec{\beta }}}{{\text {min}}_{\varvec{\varepsilon },\varvec{\beta }}} }{{\text {arg}}\, \mathchoice{\underset{\varvec{\varepsilon },\varvec{\beta }}{{\text {min}}}\,}{{\text {min}}_{\varvec{\varepsilon },\varvec{\beta }}}{{\text {min}}_{\varvec{\varepsilon },\varvec{\beta }}}{{\text {min}}_{\varvec{\varepsilon },\varvec{\beta }}} }{{\text {arg}}\, \mathchoice{\underset{\varvec{\varepsilon },\varvec{\beta }}{{\text {min}}}\,}{{\text {min}}_{\varvec{\varepsilon },\varvec{\beta }}}{{\text {min}}_{\varvec{\varepsilon },\varvec{\beta }}}{{\text {min}}_{\varvec{\varepsilon },\varvec{\beta }}} } \frac{\lambda }{2} \cdot \varvec{\beta }^{\top } \cdot \varvec{\beta } + \varvec{r}^\top \cdot \varvec{\varepsilon } \nonumber \\ \text {s.t.}&\quad \hat{f}_N\left( \varvec{x}_{\ell (h)}\right) -\hat{f}_N\left( \varvec{x}_{\kappa (h)}\right) \le -\sigma +\varepsilon ^{(h)}&\forall h:b_{h}=-1 \nonumber \\&\quad |\hat{f}_N\left( \varvec{x}_{\ell (h)}\right) -\hat{f}_N\left( \varvec{x}_{\kappa (h)}\right) |\le \sigma +\varepsilon ^{(h)}&\forall h:b_{h}=0 \nonumber \\&\quad \hat{f}_N\left( \varvec{x}_{\ell (h)}\right) -\hat{f}_N\left( \varvec{x}_{\kappa (h)}\right) \ge \sigma -\varepsilon ^{(h)}&\forall h:b_{h}=1 \nonumber \\&\quad \varepsilon ^{(h)} \ge 0 \nonumber \\&\quad h=1,\ldots ,M, \end{aligned}$$
(11)

where:

  • \(\varvec{\varepsilon } = \begin{bmatrix} \varepsilon ^{(1)}&\ldots&\varepsilon ^{(M)} \end{bmatrix}^{\top } \in \mathbb {R}_{\ge 0}^{M}\) is a vector of slack variables (one for each preference) which takes into consideration that: (i) there might be some outliers in \(\mathcal {B}\) and \(\mathcal {S}\) if the decision-maker expresses the preferences in an inconsistent way, and (ii) the selected radial function and/or shape parameter for \(\hat{f}_N\left( \varvec{x}\right) \) in (9) do not allow a proper approximation of the scoring function \(f\left( \varvec{x}\right) \);

  • \(\varvec{r} = \begin{bmatrix} r^{(1)}&\ldots&r^{(M)} \end{bmatrix}^{\top } \in \mathbb {R}^{M}_{>0}\) is a vector of weights that can be used to penalize more some slacks related to the most important preferences. In GLISp [3], the authors weigh more the preferences associated to the current best candidate and define \(\varvec{r}\) as follows:

    $$\begin{aligned} r^{(h)}&= 1, \quad \forall h: \left( \ell (h),\kappa (h)\right) \in \mathcal {S}, \varvec{x}_{\ell (h)} \ne \varvec{x_{best}}(N)\quad \text {and}\quad \varvec{x}_{\kappa (h)} \ne \varvec{x_{best}}(N), \\ r^{(h)}&= 10, \quad \forall h: \left( \ell (h),\kappa (h)\right) \in \mathcal {S}, \varvec{x}_{\ell (h)} = \varvec{x_{best}}(N)\quad \text {or}\quad \varvec{x}_{\kappa (h)} = \varvec{x_{best}}(N). \end{aligned}$$
  • \(\lambda \in \mathbb {R}_{\ge 0}\) plays the role of a regularization parameter. It is easy to see that, for \(\lambda =0\), Problem (11) is a Linear Program (LP) while, for \(\lambda >0\), it is a Quadratic Program (QP).

Problem (11) ensures that, at least approximately, \(\hat{f}_N\left( \varvec{x}\right) \) in (9) is a suitable representation of the unknown preference relation \(\succsim \) on \(\Omega \) which produced the data described in Sect. 2.1 (see Theorem 1).

3.2 Exploration function

Consider a sample \(\varvec{x}_i \in \mathcal {X}\) in (5). Its corresponding Inverse Distance Weighting (IDW) function \(w_{i}:\mathbb {R}^{n}{\setminus }\left\{ \varvec{x}_{i}\right\} \rightarrow \mathbb {R}_{>0}\) is defined as [31]:

$$\begin{aligned} w_{i}\left( \varvec{x}\right) =\frac{1}{ \left\| {{\varvec{x}-\varvec{x}_{i}}} \right\| _2^{2}}. \end{aligned}$$
(12)

GLISp [3] uses the so-called IDW distance function \(z_N: \mathbb {R}^{n} \rightarrow (-1, 0]\),

$$\begin{aligned} z_N\left( \varvec{x}\right) ={\left\{ \begin{array}{ll} 0 &{} \text {if}\quad \varvec{x}\in \mathcal {X} \\ {-\frac{2}{\pi }}\cdot \arctan \left( \frac{1}{\sum _{i=1}^{N}w_{i}\left( \varvec{x}\right) }\right) &{} \text {otherwise} \end{array}\right. }, \end{aligned}$$
(13)

to promote the exploration in those regions of \(\mathbb {R}^{n}\) where fewer samples have been evaluated. It is possible to prove the following Proposition and Lemma.

Proposition 3

The IDW distance function \(z_N\left( \varvec{x}\right) \) in (13) is differentiable everywhere.

The proof of Proposition 3 is given in [2]. Here, we derive the gradient of \(z_N\left( \varvec{x}\right) \) in (13) (see “Appendix A” for its derivation).

Lemma 1

The gradient of the IDW distance function \(z_N\left( \varvec{x}\right) \) in (13) is:

$$\begin{aligned} \nabla _{\varvec{x}} z_N\left( \varvec{x}\right) ={\left\{ \begin{array}{ll} \varvec{0}_{n} &{} \text {if}\quad \varvec{x}\in \mathcal {X} \\ -\frac{4}{\pi }\cdot \frac{\sum _{i=1}^{N}\left( \varvec{x}-\varvec{x}_{i}\right) \cdot w_{i}\left( \varvec{x}\right) ^{2}}{1+\left[ \sum _{i=1}^{N}w_{i}\left( \varvec{x}\right) \right] ^{2}} &{} \text {otherwise} \end{array}\right. }. \end{aligned}$$
(14)

The gradient \(\nabla _{\varvec{x}} z_N\left( \varvec{x}\right) \) in (14) will be particularly relevant in the following section.

4 Next candidate sample search

As previously mentioned in Sect. 3, a key aspect of surrogate-based methods is the exploration–exploitation dilemma. Typically, new candidate samples are sought by minimizing an acquisition function \(a_N\left( \varvec{x}\right) \) that is a weighted sum between the surrogate model and the exploration function. In practice, \(\hat{f}_N\left( \varvec{x}\right) \) in (9) and \(z_N\left( \varvec{x}\right) \) in (13) often exhibit different ranges and need to be rescaled. In particular, GLISp [3] adopts the following \(a_N\left( \varvec{x}\right) \):

$$\begin{aligned} a_N(\varvec{x})=\frac{\hat{f}_N\left( \varvec{x}\right) }{\Delta \hat{F}} + \delta \cdot z_N\left( \varvec{x}\right) , \end{aligned}$$
(15)

where \(\delta \in \mathbb {R}_{\ge 0}\) is the exploration–exploitation trade-off weight. The division by

$$\begin{aligned} \Delta \hat{F} = \mathchoice{\underset{\varvec{x}_i \in \mathcal {X}}{{\text {max}}}\,}{{\text {max}}_{\varvec{x}_i \in \mathcal {X}}}{{\text {max}}_{\varvec{x}_i \in \mathcal {X}}}{{\text {max}}_{\varvec{x}_i \in \mathcal {X}}} \hat{f}_N(\varvec{x}_i) - \mathchoice{\underset{\varvec{x}_i \in \mathcal {X}}{{\text {min}}}\,}{{\text {min}}_{\varvec{x}_i \in \mathcal {X}}}{{\text {min}}_{\varvec{x}_i \in \mathcal {X}}}{{\text {min}}_{\varvec{x}_i \in \mathcal {X}}} \hat{f}_N(\varvec{x}_i) \end{aligned}$$
(16)

aims to rescale the surrogate model to make it assume a range that is comparable to that of the IDW distance function in (13), which is \((-1, 0]\).

In this section, we address some limitations of \(a_N\left( \varvec{x}\right) \) in (15), which might prevent GLISp [3] from reaching the global minimizer of Problem (3), and define an alternative acquisition function. Furthermore, we propose a strategy to iteratively vary the exploration–exploitation trade-off.

4.1 Shortcomings of GLISp

There are two shortcomings of \(a_N\left( \varvec{x}\right) \) in (15) that limit the exploratory capabilities of GLISp [3]. First, the rescaling of \(\hat{f}_N\left( \varvec{x}\right) \) in (15), which relies on \(\Delta \hat{F}\) in (16), only takes into account the previously evaluated samples inside \(\mathcal {X}\) in (5) and thus it can be ineffective in making \(\frac{\hat{f}_N\left( \varvec{x}\right) }{\Delta \hat{F}}\) and \(z_N\left( \varvec{x}\right) \) comparable over all \(\Omega \) (see Problem (8)). Second, the IDW distance function in (13) exhibits two characteristics that can make its contribution negligible in \(a_N\left( \varvec{x}\right) \) in (15) and complicate the selection of \(\delta \):

  1. 1.

    Even though the range of \(z_N\left( \varvec{x}\right) \) is \((-1, 0]\), what we are really interested in when solving Problem (8) and, ultimately, Problem (3), are the values that it assumes for \(\varvec{x} \in \Omega \) and not on its whole domain \(\mathbb {R}^{n}\). In particular, there are some situations for which \(|z_N\left( \varvec{x}\right) |\ll 1, \forall \varvec{x} \in \Omega \). Consider, for example, the case \(\mathcal {X} = \left\{ \varvec{x}_1\right\} \) (\(N = 1\)); then, \(\forall \varvec{x} \in \mathbb {R}^n {\setminus } \mathcal {X}\), the IDW distance function simply becomes:

    $$\begin{aligned} z_1\left( \varvec{x}\right) =-\frac{2}{\pi }\cdot \arctan \left( \left\| {{\varvec{x}-\varvec{x}_{1}}} \right\| _2^{2}\right) . \end{aligned}$$

    Suppose that Problem (3) is only bound constrained, i.e. \(\Omega = \left\{ \varvec{x}: \varvec{l}\le \varvec{x}\le \varvec{u} \right\} \). Then, \(z_1\left( \varvec{x}\right) \) assumes its lowest value at one (or more) of the vertices of the box defined by the constraints in \(\Omega \). Define \(\varvec{v}_{\Omega } \in \Omega \) as one of such vertices; if \( \left\| {{\varvec{v}_{\Omega } - \varvec{x}_{1}}} \right\| _2\) is close to zero, then \(|z_N\left( \varvec{x}\right) |\ll 1, \forall \varvec{x} \in \Omega \). Thus, unless \(\delta \gg 1\) in (15), \(\hat{f}_N\left( \varvec{x}\right) \) in (9) and \(z_N\left( \varvec{x}\right) \) in (13) might not be comparable.

  2. 2.

    The (absolute) values assumed by the IDW distance function decrease as the number of samples increases. To clarify this, consider two sets of samples:

    $$\begin{aligned} \mathcal {X}'&= \left\{ \varvec{x}_{1}, \ldots , \varvec{x}_{N} \right\} , \quad |\mathcal {X}'|= N, \\ \mathcal {X}''&= \mathcal {X}' \cup \left\{ \varvec{x}_{N + 1} \right\} , \quad |\mathcal {X}''|= N + 1. \end{aligned}$$

    Given any point \(\tilde{\varvec{x}} \in \mathbb {R}^n {\setminus } \mathcal {X}''\), the IDW distance functions obtained from the previously defined sets are:

    $$\begin{aligned} z_{N}\left( \varvec{\tilde{x}}\right)&= {-\frac{2}{\pi }}\cdot \arctan \left( \frac{1}{\sum _{i=1}^{N}w_{i}\left( \varvec{\tilde{x}}\right) }\right) , \\ z_{N+1}\left( \varvec{\tilde{x}}\right)&= {-\frac{2}{\pi }}\cdot \arctan \left( \frac{1}{\sum _{i=1}^{N+1}w_{i}\left( \varvec{\tilde{x}}\right) }\right) . \end{aligned}$$

    Note that \(w_i\left( \varvec{\tilde{x}} \right) > 0, \forall \varvec{\tilde{x}}\in \mathbb {R}^n {\setminus } \mathcal {X}''\) and \(i = 1, \ldots , N + 1\) (see (12)). Hence:

    $$\begin{aligned} |z_{N}\left( \varvec{\tilde{x}}\right) |> |z_{N+1}\left( \varvec{\tilde{x}}\right) |> 0, \end{aligned}$$

    proving the above point. In practice, unless \(\delta \) in (15) is progressively increased as the iterations go on, GLISp [3] will explore the constraint set \(\Omega \) of Problem (3) less as the number of samples increases, regardless of whether a region that contains the global minimizer \(\varvec{x^*}\) has been located.

A visualization of these two characteristics of \(z_N\left( \varvec{x}\right) \) in (13) is presented in Fig. 1.

Fig. 1
figure 1

Examples of the IDW distance function \(z_N\left( x\right) \) in (13) for different numbers of points (\(N = 2\) on the left, \(N = 6\) on the right) and \(-3 = l \le x \le u = 3\). Notice how \(z_N\left( x\right) \) does not cover its whole range \((-1, 0]\), at least inside the bound constraints, and its absolute values decrease as the number of samples increases

In this work, we overcome the limitations of \(a_N\left( \varvec{x}\right) \) in (15) by defining an acquisition function that is similar to the one used by the Metric Stochastic Response Surface (MSRS [29]) algorithm, a popular black-box optimization procedure. In MSRS [29], the surrogate \(\hat{f}_N\left( \varvec{x}\right) \) and the exploration function \(z_N\left( \varvec{x}\right) \) are made comparable by randomly sampling \(\Omega \) and rescaling the two using min–max normalization. Then, Problem (8) is not solved explicitly but by choosing the generated random sample that achieves the lowest value for the acquisition function. Here, we propose to rescale \(z_N\left( \varvec{x}\right) \) in (13) and \(\hat{f}_N\left( \varvec{x}\right) \) in (9) using some insights on the stationary points of the IDW distance function and solve Problem (8) explicitly, using a proper global optimization solver.

4.2 Novel rescaling strategy

In this section, we derive the approximate locations of the stationary points of the IDW distance function in (13) and use them to define an augmented set of samples \(\mathcal {X}_{aug} \supset \mathcal {X}\) that is suited for the min–max normalization of both \(z_N\left( \varvec{x}\right) \) and \(\hat{f}_N\left( \varvec{x}\right) \).

4.2.1 Stationary points of the IDW distance function

The locations of the global maximizers of \(z_N\left( \varvec{x}\right) \) can be deduced immediately from (13) and Lemma 1, as stated by the following Proposition.

Proposition 4

Each \(\varvec{x}_i \in \mathcal {X}\) in (5) is a global maximizer of \(z_N\left( \varvec{x}\right) \) in (13).

Proof

Recall that:

  1. (i)

    \(\nabla _{\varvec{x}} z_N\left( \varvec{x}_i\right) = \varvec{0}_{n}, \forall \varvec{x}_i \in \mathcal {X}\), see (14);

  2. (ii)

    \(z_N\left( \varvec{x}\right) < 0, \forall \varvec{x} \in \mathbb {R}^n {\setminus } \mathcal {X}\), see (13);

  3. (iii)

    \(z_N\left( \varvec{x}_i\right) = 0, \forall \varvec{x}_i \in \mathcal {X}\), see (13).

From Item (i) we deduce that each \(\varvec{x}_i \in \mathcal {X}\) is a stationary point of \(z_N\left( \varvec{x}\right) \). Item (ii), in conjunction with Item (iii), implies that such samples are local maximizers of the IDW distance function in (13) since there exists a neighborhood of \(\varvec{x}_i \in \mathcal {X}\), denoted as \(\mathcal {N}\left( \varvec{x}_i\right) \), such that \(z_N\left( \varvec{x}\right) \le z_N\left( \varvec{x}_i \right) , \forall \varvec{x} \in \mathcal {N}\left( \varvec{x}_i\right) \). Moreover, due to Item (ii), \(z_N\left( \varvec{x}\right) \le z_N\left( \varvec{x}_i \right) , \forall \varvec{x} \in \mathbb {R}^n\) and not just in a neighborhood of \(\varvec{x}_i \in \mathcal {X}\). Hence, each \(\varvec{x}_i \in \mathcal {X}\) is a global maximizer of \(z_N\left( \varvec{x}\right) \) in (13). \(\square \)

Reaching similar conclusions for the minimizers of \(z_N\left( \varvec{x} \right) \) is much harder; however, we can consider some simplified situations. Note that we are not necessarily interested in finding the minimizers of the IDW distance function in (13) with high accuracy, but rather to gain some insights on where they are likely to be located so that we can rescale both \(z_N\left( \varvec{x} \right) \) in (13) and \(\hat{f}_N\left( \varvec{x}\right) \) in (9) sufficiently enough to make them comparable. Moreover, their approximate locations can be used to solve the following global optimization problem (pure exploration):

$$\begin{aligned} \varvec{x}_{N+1}&= \mathchoice{{\text {arg}}\, \mathchoice{\underset{\varvec{x}}{{\text {min}}}\,}{{\text {min}}_{\varvec{x}}}{{\text {min}}_{\varvec{x}}}{{\text {min}}_{\varvec{x}}} }{{\text {arg}}\, \mathchoice{\underset{\varvec{x}}{{\text {min}}}\,}{{\text {min}}_{\varvec{x}}}{{\text {min}}_{\varvec{x}}}{{\text {min}}_{\varvec{x}}} }{{\text {arg}}\, \mathchoice{\underset{\varvec{x}}{{\text {min}}}\,}{{\text {min}}_{\varvec{x}}}{{\text {min}}_{\varvec{x}}}{{\text {min}}_{\varvec{x}}} }{{\text {arg}}\, \mathchoice{\underset{\varvec{x}}{{\text {min}}}\,}{{\text {min}}_{\varvec{x}}}{{\text {min}}_{\varvec{x}}}{{\text {min}}_{\varvec{x}}} } z_N(\varvec{x}) \nonumber \\ \text {s.t.}&\quad \varvec{x}\in \Omega \end{aligned}$$
(17)

by using a multi-start derivative-based optimization method with warm-start [23, 26] (recall that \(z_N\left( \varvec{x}\right) \) is differentiable everywhere, see Proposition 3). Problem (17) is quite relevant for the global convergence of the algorithm that we will propose in Sect. 5.

Remark 2

In the following Paragraphs, we analyze where the local minimizers of \(z_N\left( \varvec{x} \right) \) in (13) and the solution(s) of the simplified problem:

$$\begin{aligned} \varvec{x}_{N+1}&= \mathchoice{{\text {arg}}\, \mathchoice{\underset{\varvec{x}}{{\text {min}}}\,}{{\text {min}}_{\varvec{x}}}{{\text {min}}_{\varvec{x}}}{{\text {min}}_{\varvec{x}}} }{{\text {arg}}\, \mathchoice{\underset{\varvec{x}}{{\text {min}}}\,}{{\text {min}}_{\varvec{x}}}{{\text {min}}_{\varvec{x}}}{{\text {min}}_{\varvec{x}}} }{{\text {arg}}\, \mathchoice{\underset{\varvec{x}}{{\text {min}}}\,}{{\text {min}}_{\varvec{x}}}{{\text {min}}_{\varvec{x}}}{{\text {min}}_{\varvec{x}}} }{{\text {arg}}\, \mathchoice{\underset{\varvec{x}}{{\text {min}}}\,}{{\text {min}}_{\varvec{x}}}{{\text {min}}_{\varvec{x}}}{{\text {min}}_{\varvec{x}}} } z_N(\varvec{x}) \nonumber \\ \text {s.t.}&\quad \varvec{l} \le \varvec{x} \le \varvec{u} \end{aligned}$$
(18)

are located in some specific cases. Note that \(\left\{ \varvec{x}: \varvec{l} \le \varvec{x} \le \varvec{u}\right\} \supseteq \Omega \) (see (1)) and thus the global minimum of Problem (18) is lower than or at most equal to the global minimum of Problem (17). Therefore, the minimizers of Problem (18) are better suited to perform min–max rescaling of \(z_N\left( \varvec{x} \right) \) than those of Problem (17).

Case \(\mathcal {X} = \left\{ \varvec{x}_1\right\} \) (\(N = 1\)). The IDW distance function and its gradient \(\forall \varvec{x} \in \mathbb {R}^n {\setminus } \mathcal {X}\) are:

$$\begin{aligned} z_N\left( \varvec{x}\right)&=-\frac{2}{\pi }\cdot \arctan \left( \left\| {{\varvec{x}-\varvec{x}_{1}}} \right\| _2^{2}\right) , \\ \nabla _{\varvec{x}} z_N\left( \varvec{x}\right)&=-\frac{4}{\pi }\cdot \left( \varvec{x}-\varvec{x}_{1}\right) \cdot \frac{w_{1}\left( \varvec{x}\right) ^{2}}{1+w_{1}\left( \varvec{x}\right) ^{2}}. \end{aligned}$$

Clearly, \(\forall \varvec{x} \in \mathbb {R}^n {\setminus } \mathcal {X}\), the gradient is never zero since \(w_{1}\left( \varvec{x}\right) > 0\). Therefore, the only stationary point is the global maximizer \(\varvec{x}_1\) (see Proposition 4). However, if we were to consider Problem (18), then its solution would be located at one of the vertices of the box defined by the bound constraints \(\varvec{l} \le \varvec{x} \le \varvec{u}\).

Case \(\mathcal {X} = \left\{ \varvec{x}_1, \varvec{x}_2\right\} \) (\(N = 2\)). The gradient of the IDW distance function \(\forall \varvec{x} \in \mathbb {R}^n {\setminus } \mathcal {X}\) is:

$$\begin{aligned} \nabla _{\varvec{x}}z_N\left( \varvec{x}\right) =-\frac{4}{\pi }\cdot \frac{\left( \varvec{x}-\varvec{x}_{1}\right) \cdot w_{1}\left( \varvec{x}\right) ^{2}+\left( \varvec{x}-\varvec{x}_{2}\right) \cdot w_{2}\left( \varvec{x}\right) ^{2}}{1+\left[ w_{1}\left( \varvec{x}\right) +w_{2}\left( \varvec{x}\right) \right] ^{2}}. \end{aligned}$$

Let us consider the midpoint \(\varvec{x_{\mu }}=\frac{\varvec{x}_{1}+\varvec{x}_{2}}{2}\), that is such that \( \left\| {{\varvec{x_{\mu }}-\varvec{x}_{1}}} \right\| _2= \left\| {{\varvec{x_{\mu }}-\varvec{x}_{2}}} \right\| _2\) and for which \(w_{1}\left( \varvec{x_{\mu }}\right) =w_{2}\left( \varvec{x_{\mu }}\right) \). If we substitute it in the previous expression, we obtain:

$$\begin{aligned} \nabla _{\varvec{x}}z_N\left( \varvec{x_{\mu }}\right) =\varvec{0}_{n}, \end{aligned}$$

which means that \(\varvec{x_{\mu }}\) is a stationary point for \(z_N\left( \varvec{x}\right) \) in (13). It is easy to see by visual inspection that such point is actually a local minimizer for the IDW distance function (see for example Fig. 2). However, note that \(\varvec{x_{\mu }}\) is not necessarily the global solution of Problem (18), it might just be a local one.

Fig. 2
figure 2

One-dimensional examples of the IDW distance function \(z_N\left( x\right) \) in (13) and its gradient \(\nabla _{x} z_N\left( x\right) \) in (14) in the four analyzed cases. The red boxes mark the different clusters while the red dashed lines highlight the values of x for which the first derivative of \(z_N\left( x\right) \) is zero. Finally, the black vertical lines mark the midpoints (either between points or centroids of the clusters). Only a portion of all possible midpoints between centroids has been reported in the general case. Notice that the midpoints in Fig. 2c and d are quite close to the local minimizers of \(z_N\left( \varvec{x}\right) \), while the midpoint in Fig. 2b is an exact local solution of Problem (18) (color figure online)

Case \(\mathcal {X} = \mathcal {X}^{(1)} \cup \mathcal {X}^{(2)}\) (\(N > 2\)). Suppose now that the samples contained in \(\mathcal {X}\) (5) can be partitioned into two clusters:

  • \(\mathcal {X}^{(1)} = \left\{ \varvec{x}_1, \ldots , \varvec{x}_{N_1}\right\} \) (\(|\mathcal {X}^{(1)}|= N_1\)),

  • \(\mathcal {X}^{(2)} = \left\{ \varvec{x}_{N_1 + 1}, \ldots , \varvec{x}_{N}\right\} \) (\(|\mathcal {X}^{(2)}|= N - N_1\)),

such that \(\mathcal {X}^{(1)} \cap \mathcal {X}^{(2)} = \emptyset \) and \(\mathcal {X}^{(1)} \cup \mathcal {X}^{(2)} = \mathcal {X}\). Consider the midpoint between the centroids of each cluster:

$$\begin{aligned} \varvec{x_{\mu }} = \frac{1}{2} \cdot \left[ \frac{\sum _{\varvec{x}_{i}\in \mathcal {X}^{(1)}} \varvec{x}_{i}}{N_1}+\frac{\sum _{\varvec{x}_{i}\in \mathcal {X}^{(2)}} \varvec{x}_{i} }{N - N_1} \right] . \end{aligned}$$
(19)

We make the simplifying assumption that all the points contained inside each cluster are quite close to each other, i.e. \(\varvec{x}_{1} \approx \varvec{x}_{2} \approx \cdots \approx \varvec{x}_{N_1}\) and \(\varvec{x}_{N_1 + 1} \approx \varvec{x}_{N_1 + 2} \approx \cdots \approx \varvec{x}_{N}\). Then, the midpoint in (19) is approximately equal to \(\varvec{x_{\mu }} \approx \frac{\varvec{x}_1 + \varvec{x}_N}{2}\). Moreover, we have that \(w_{1}\left( \varvec{x_{\mu }}\right) \approx \cdots \approx w_{N_{1}}\left( \varvec{x_{\mu }}\right) \approx w_{N_{1}+1}\left( \varvec{x_{\mu }}\right) \approx \cdots \approx w_{N}\left( \varvec{x_{\mu }}\right) \). Thus, the gradient of the IDW distance function at \(\varvec{x_{\mu }}\) is approximately equal to:

$$\begin{aligned} \nabla _{\varvec{x}}z_N\left( \varvec{x_{\mu }}\right)&=-\frac{4}{\pi }\cdot \frac{\sum _{i=1}^{N}\left( \varvec{x_{\mu }}-\varvec{x}_{i}\right) \cdot w_{i}\left( \varvec{x_{\mu }}\right) ^{2}}{1+\left[ \sum _{i=1}^{N}w_{i}\left( \varvec{x_{\mu }}\right) \right] ^{2}} \\&\approx -\frac{4}{\pi }\!\cdot \!\frac{w_{1}\left( \varvec{x_{\mu }}\right) ^2}{1\!+\!\left[ N\cdot w_{1}\left( \varvec{x_{\mu }}\right) \right] ^{2}}\!\cdot \!\left[ \left( \frac{N}{2}\!-\!N_1\right) \!\cdot \!\varvec{x}_1\!+\!\left( - \frac{N}{2}\!+\!N_1\right) \!\cdot \! \varvec{x}_{N} \right] . \end{aligned}$$

Clearly, if the clusters are nearly equally sized, i.e. \(N_1 \approx N - N_1 \approx \frac{N}{2}\), then:

$$\begin{aligned} \nabla _{\varvec{x}}z_N\left( \varvec{x_{\mu }}\right) \approx \varvec{0}_{n}, \end{aligned}$$

reaching a similar result to the one that we have seen for the case \(N = 2\).

General case (\(N > 2\)). Any set of samples \(\mathcal {X}\) in (5) can be partitioned into an arbitrary number of disjoint clusters, say \(K \in \mathbb {N}, K \le |\mathcal {X}|= N\), i.e.:

$$\begin{aligned} \mathcal {X}= \mathcal {X}^{(1)}\cup \mathcal {X}^{(2)}\cup \cdots \cup \mathcal {X}^{(K)}, \quad \text {such that}\quad \mathcal {X}^{(i)} \cap \mathcal {X}^{(j)} = \emptyset , \forall i \ne j. \end{aligned}$$

In this case, finding the local solutions of Problem (18) explicitly, or even approximately, is quite complex. Heuristically speaking, if the clusters are “well spread” (i.e. all the points contained inside each cluster \(\mathcal {X}^{(i)}\) are sufficiently far away from the others in \(\mathcal {X}^{(j)}, j = 1, \ldots , K, j \ne i\)), then we can approximately deduce where the local minimizers of \(z_N\left( \varvec{x}\right) \) in (13) are located.

Fig. 3
figure 3

Two-dimensional example of “well spread” clusters, highlighted with different colors. The circles denote the points contained in \(\mathcal {X}\) while the crosses represent the centroids of \(\mathcal {X}^{(1)}, \mathcal {X}^{(2)}\) and \(\mathcal {X}^{(3)}\). \(\varvec{x_{\mu }}\) is the midpoint between the centroids of clusters \(\mathcal {X}^{(1)}\) and \(\mathcal {X}^{(2)}\). Finally, the blue lines highlight the distances between the samples of cluster \(\mathcal {X}^{(3)}\) and \(\varvec{x_{\mu }}\) (color figure online)

For instance, Fig. 3 depicts a set of samples \(\mathcal {X}\) that has been partitioned into three clusters, \(\mathcal {X}^{(1)}, \mathcal {X}^{(2)}\) and \(\mathcal {X}^{(3)}\), and for which the previous hypothesis is satisfied.

In the general case, given the clusters \(\mathcal {X}^{(i)}\) and \(\mathcal {X}^{(j)}\), we compute their centroids \(\varvec{x}^{(i)}_{\varvec{c}}\), \(\varvec{x}^{(j)}_{\varvec{c}}\), and the corresponding midpoint \(\varvec{x_{\mu }}\) between them as:

$$\begin{aligned} \varvec{x}^{(k)}_{\varvec{c}}&= \frac{\sum _{\varvec{x}_i \in \mathcal {X}^{(k)}} \varvec{x}_i}{|\mathcal {X}^{(k)}|} \quad (\text {centroid of}\,k\text {-th cluster}), \end{aligned}$$
(20a)
$$\begin{aligned} \varvec{x_{\mu }}&= \frac{\varvec{x}^{(i)}_{\varvec{c}} + \varvec{x}^{(j)}_{\varvec{c}}}{2}\quad (\text {midpoint}). \end{aligned}$$
(20b)

Going back to the example depicted in Fig. 3, if we consider the midpoint \(\varvec{x_{\mu }}\) between the centroids of \(\mathcal {X}^{(1)}\) and \(\mathcal {X}^{(2)}\), due to the “well spread” hypothesis we can say that \( \left\| {{\varvec{x_{\mu }} - \varvec{x}_i}} \right\| _2 \gg 0, \forall \varvec{x}_i \in \mathcal {X}^{(3)}\), making the contributions of the points inside the third cluster negligible when evaluating \(z_N\left( \varvec{x}\right) \) in (13) at \(\varvec{x_{\mu }}\). Therefore, the IDW distance function at \(\varvec{x_{\mu }}\) is approximately equal to:

$$\begin{aligned} z_N\left( \varvec{x_{\mu }}\right)&= {-\frac{2}{\pi }}\cdot \arctan \left\{ \left[ \sum _{k = 1}^{3} \left( \sum _{\varvec{x}_i \in \mathcal {X}^{(k)}} \frac{1}{ \left\| {{\varvec{x_{\mu }} - \varvec{x}_i}} \right\| _2^2}\right) \right] ^{-1}\right\} \\&\approx {-\frac{2}{\pi }}\cdot \arctan \left[ \left( \sum _{\varvec{x}_i \in \mathcal {X}^{(1)}} \frac{1}{ \left\| {{\varvec{x_{\mu }} - \varvec{x}_i}} \right\| _2^2} + \sum _{\varvec{x}_i \in \mathcal {X}^{(2)}} \frac{1}{ \left\| {{\varvec{x_{\mu }} - \varvec{x}_i}} \right\| _2^2} \right) ^{-1} \right] . \end{aligned}$$

In general, given K clusters, if these are “well spread”, then we can consider each possible couple of clusters separately and neglect the contributions of the remaining ones. Approximately speaking, we could split the general case into \(\begin{pmatrix}K \\ 2 \end{pmatrix}\) distinct problems that read as follows: find the stationary points of the IDW distance function \(z_{N_{i \cup j}}\left( \varvec{x}\right) \) in (13) defined from the set of samples \(\mathcal {X}^{(i)}\cup \mathcal {X}^{(j)}, i \ne j\) and \(N_{i \cup j} = |\mathcal {X}^{(i)}\cup \mathcal {X}^{(j)}|\). Hence, rough locations of the stationary points of \(z_{N_{i \cup j}}\left( \varvec{x}\right) \) can be found by following the same rationale proposed for the previously analyzed cases.

Some one-dimensional examples of all the previously analyzed situations are reported in Fig. 2.

4.2.2 Min–max rescaling and augmented sample set

Given a generic set of samples \(\mathcal {X} = \left\{ \varvec{x}_1, \ldots , \varvec{x}_N\right\} \) and a multivariable function \(h:\mathbb {R}^{n} \rightarrow \mathbb {R}\), min–max rescaling (or normalization) [16] rescales \(h\left( \varvec{x}\right) \) as:

$$\begin{aligned} \bar{h}\left( \varvec{x}; \mathcal {X}\right) = \frac{h\left( \varvec{x}\right) - h_{min}\left( \mathcal {X}\right) }{\Delta H\left( \mathcal {X}\right) }, \end{aligned}$$
(21)

where:Footnote 3

$$\begin{aligned} h_{min}\left( \mathcal {X}\right)&= \mathchoice{\underset{\varvec{x}_i \in \mathcal {X}}{{\text {min}}}\,}{{\text {min}}_{\varvec{x}_i \in \mathcal {X}}}{{\text {min}}_{\varvec{x}_i \in \mathcal {X}}}{{\text {min}}_{\varvec{x}_i \in \mathcal {X}}} h(\varvec{x}_i), \end{aligned}$$
(22a)
$$\begin{aligned} h_{max}\left( \mathcal {X}\right)&= \mathchoice{\underset{\varvec{x}_i \in \mathcal {X}}{{\text {max}}}\,}{{\text {max}}_{\varvec{x}_i \in \mathcal {X}}}{{\text {max}}_{\varvec{x}_i \in \mathcal {X}}}{{\text {max}}_{\varvec{x}_i \in \mathcal {X}}} h(\varvec{x}_i), \end{aligned}$$
(22b)
$$\begin{aligned} \Delta H\left( \mathcal {X}\right)&= h_{max}\left( \mathcal {X}\right) -h_{min}\left( \mathcal {X}\right) . \end{aligned}$$
(22c)

The objective of min–max rescaling is to obtain a function with range [0, 1], i.e. we would like to have \(\bar{h}:\mathbb {R}^{n} \rightarrow [0, 1]\). Clearly, the quality of the normalization depends on the information brought by the samples contained inside \(\mathcal {X}\), as pointed out in the following Remark.

Remark 3

We can observe that:

  1. 1.

    If \(\mathcal {X}\) contains the global minimizer(s) and maximizer(s) of \(h\left( \varvec{x}\right) \), then \(\bar{h}\left( \varvec{x}\right) \) defined as in (21) effectively has codomain [0, 1],

  2. 2.

    Otherwise, we can only ensure that \(0 \le \bar{h}\left( \varvec{x}_i\right) \le 1, \forall \varvec{x}_i \in \mathcal {X}\).

  3. 3.

    In general, if we increase the amount of distinct samples in \(\mathcal {X}\), then the rescaling of \(h\left( \varvec{x}\right) \) gets better (or, worst case, stays the same).

Going back to the problem of rescaling the IDW distance function \(z_N\left( \varvec{x}\right) \) in (13), if we were to apply (21) using the set of previously evaluated samples \(\mathcal {X}\) in (5), then it would not be effective since \(z_N\left( \varvec{x}_i\right) = 0, \forall \varvec{x}_i \in \mathcal {X}\) (see Proposition 4). Instead, we have opted to generate a sufficiently expressive augmented sample set \(\mathcal {X}_{aug} \supset \mathcal {X}\) and perform min–max normalization using \(\mathcal {X}_{aug}\) instead of \(\mathcal {X}\).

Consider the general case described in Sect. 4.2.1. Then, the augmented sample set \(\mathcal {X}_{aug}\) can be built in the following fashion:

  1. 1.

    Partition the points in \(\mathcal {X}\) (5) into different clusters. Here, for simplicity, we fix a-priori the number \(K_{aug} \in \mathbb {N}\) of clusters and apply K-means clustering [5, 17, 22] to obtain the sets \(\mathcal {X}^{(1)}, \ldots , \mathcal {X}^{(K_{aug})}\);

  2. 2.

    Compute the centroids of each cluster, using (20a), and group them inside the set \(\mathcal {X}_c = \left\{ \varvec{x}_{\varvec{c}}^{(1)}, \ldots , \varvec{x}_{\varvec{c}}^{(K_{aug})} \right\} \);

  3. 3.

    Calculate all the midpoints \(\varvec{x_\mu }\) between each possible couple of centroids \(\varvec{x}_{\varvec{c}}^{(i)}, \varvec{x}_{\varvec{c}}^{(j)} \in \mathcal {X}_c, \varvec{x}_{\varvec{c}}^{(i)} \ne \varvec{x}_{\varvec{c}}^{(j)},\) using (20b);

  4. 4.

    Build the augmented sample set as \(\mathcal {X}_{aug} = \mathcal {X} \cup \mathcal {X}_\mu \), where \(\mathcal {X}_\mu \) is the set which groups all the previously computed midpoints.

Clearly, as highlighted by (21) and Remark 3, if \(\mathcal {X}_{aug}\) contains points that are close (or equal) to the global minimizer(s) and maximizer(s) of \(z_N\left( \varvec{x}\right) \) in (13), then the quality of the min–max rescaling of the IDW distance function improves.

Algorithm 1 formalizes these steps while also taking into consideration the case \(|\mathcal {X}|\le K_{aug}\) (for which no clustering is performed). Note that we also include the bounds \(\varvec{l}\) and \(\varvec{u}\) inside \(\mathcal {X}_{c}\) and \(\mathcal {X}_{aug}\) for two reasons: (i) \(\varvec{l}\) or \(\varvec{u}\) might actually be the solutions of Problem (18)Footnote 4 and (ii) given that we also want to rescale \(\hat{f}_N\left( \varvec{x}\right) \) in (9), adding additional points to the augmented sample set improves the quality of min–max normalization (see Remark 3). Notice that the number of points contained inside \(\mathcal {X}_{aug}\) obtained from Algorithm 1 is:

$$\begin{aligned} |\mathcal {X}_{aug}|= |\mathcal {X}|+ \begin{pmatrix}K_{aug} + 2 \\ 2 \end{pmatrix} + 2. \end{aligned}$$

Therefore, to avoid excessively large augmented sample sets, \(K_{aug}\) needs to be chosen appropriately.

As a final remark, we point out that we could perform min–max normalization in (21) by using the real minima and maxima of \(z_N\left( \varvec{x}\right) \) in (13) and \(\hat{f}_N\left( \varvec{x}\right) \) in (9), which can be obtained by solving four additional global optimization problems. However, we have preferred to stick with the proposed heuristic way since we are not interested in an extremely accurate rescaling and, also, to avoid potentially large overhead times due to solving additional global optimization problems.

figure a

4.3 Definition of the acquisition function

In this section, we take advantage of the results on the stationary points of \(z_N\left( \varvec{x}\right) \) presented in Sect. 4.2.1 to rescale the surrogate model and the exploration function. In particular, we define the following acquisition function:

$$\begin{aligned} a_N\left( \varvec{x}\right) = \delta \cdot \hat{\bar{f}}_N\left( \varvec{x}; \mathcal {X}_{aug}\right) + \left( 1 - \delta \right) \cdot \bar{z}_N\left( \varvec{x}; \mathcal {X}_{aug}\right) , \end{aligned}$$
(23)

where \(\hat{f}_N\left( \varvec{x}\right) \) in (9) and \(z_N\left( \varvec{x}\right) \) in (13) have been rescaled using min–max normalization as in (21) and \(\mathcal {X}_{aug}\) is generated by Algorithm 1. \(\delta \in \left[ 0, 1\right] \) is the exploration–exploitation trade-off weight; also note that \(\delta = 0\) corresponds to pure exploration, while \(\delta = 1\) results in pure exploitation. \(a_N\left( \varvec{x}\right) \) in (23) is similar to the acquisition function of MSRS [29] (for black-box optimization) but here we use an ad-hoc augmented sample set instead of a randomly generated one and a different exploration function. We will refer to the algorithm that we will propose in Sect. 5, which uses \(a_N\left( \varvec{x}\right) \) in (23), as GLISp-r, where the “r” highlights the min–max rescaling performed for the acquisition function.

A comparison between the terms of the acquisition functions in (23) (GLISp-r) and in (15) (GLISp [3]) is depicted in Fig. 4. As the number of samples N increases, the absolute values of \(z_N\left( \varvec{x}\right) \) in (13) get progressively smaller (see Sect. 4.1) and simply dividing \(\hat{f}_N\left( \varvec{x}\right) \) by \(\Delta \hat{F}\) as in (15) is not enough to make the exploration and exploitation contributions comparable. Thus, unless \(\delta \) in (15) is dynamically varied in between iterations of GLISp [3], solving Problem (8) with \(a_N\left( \varvec{x}\right) \) in (15) becomes similar to performing pure exploitation. This, in turn, can make GLISp [3] more prone to getting stuck on local minima of Problem (3) with no way of escaping (especially if the surrogate model is not expressive enough to capture the location of the global minimizer). Vice-versa, by performing min–max rescaling as proposed in (23), the exploration and exploitation contributions stay comparable throughout the whole optimization process and approximately assume the same range. For this reason, it is also more straightforward to define \(\delta \) in (23) compared to the weight in (15).

From Propositions 2 and 3, we can immediately deduce the following results on the differentiability of \(a_N\left( \varvec{x}\right) \) in (23).

Proposition 5

The acquisition function \(a_N\left( \varvec{x}\right) \) in (23) is differentiable everywhere provided that the surrogate model \(\hat{f}_N\left( \varvec{x}\right) \) in (9) is differentiable everywhere.

At each iteration of GLISp-r we find the next candidate for evaluation, i.e. \(\varvec{x}_{N+1}\), by solving Problem (8) with the acquisition function in (23). It is possible to use derivative-based optimization solvers since \(a_N\left( \varvec{x}\right) \) in (23) is differentiable everywhere. In general, the acquisition function is multimodal and thus it is better to employ a global optimization procedure. Moreover, \(a_N\left( \varvec{x} \right) \) is cheap to evaluate; therefore, we are not particularly concerned on its number of function evaluations.

Fig. 4
figure 4

Comparison between the terms of the acquisition functions in (15) (left, GLISp [3]) and in (23) (right, GLISp-r). The scoring function \(f\left( \varvec{x}\right) \) is the gramacy-lee [14] function while \(|\mathcal {X}|= N = 10\). For GLISp-r, the number of clusters used to build \(\mathcal {X}_{aug}\) through Algorithm 1 is \(K_{aug} = 5\). Notice how, for GLISp [3], the exploration function \(z_N\left( \varvec{x}\right) \) is not comparable with the rescaled surrogate \(\frac{\hat{f}_N\left( \varvec{x}\right) }{\Delta \hat{F}}\) since it assumes values that are one to two orders of magnitude lower

4.3.1 Greedy \(\delta \)-cycling

Many black-box optimization algorithms explicitly vary their emphasis on exploration and exploitation in between iterations. Just to cite a few:

  • Gutmann-RBF [15] uses an acquisition function that is a measure of “bumpiness” of the RBF surrogate that depends upon a target value \(\tau \) to aim for.

    The author suggests to cycle the values of \(\tau \) between \(\tau = \mathchoice{\underset{\varvec{x} \in \Omega }{{\text {min}}}\,}{{\text {min}}_{\varvec{x} \in \Omega }}{{\text {min}}_{\varvec{x} \in \Omega }}{{\text {min}}_{\varvec{x} \in \Omega }} \hat{f}_N\left( \varvec{x}\right) \) (local search) and \(\tau = -\infty \) (global search).

  • The authors of MSRS [29], which uses an acquisition function that is similar to (23), propose to cycle between different values of \(\delta \) as to prioritize exploration or exploitation more.

  • In algorithm SO-SA [37], which is a revisitation of MSRS [29], the weight \(\delta \) is chosen in a random fashion at each iteration. Moreover, the authors adopt a greedy strategy, i.e. \(\delta \) is kept unaltered until it fails to find a significantly better solution.

In GLISp [3], the weight \(\delta \) for \(a_N\left( \varvec{x}\right) \) in (15) is kept constant throughout the whole optimization process. Also, defining some form of cycling for such hyper-parameter can be quite complex since the additive terms that compose the acquisition function are not always comparable. In this work, we propose a strategy that is in between that of MSRS [29] and SO-SA [37], which we refer to as greedy \(\delta \)-cycling. We define a sequence of \(N_{cycle} \in \mathbb {N}\) weights to cycle:

$$\begin{aligned} \Delta _{cycle} = \langle \delta _0, \ldots , \delta _{N_{cycle - 1}}\rangle . \end{aligned}$$
(24)

\(\Delta _{cycle}\) should contain values that are well spread within the \(\left[ 0, 1\right] \) range as to properly alternate between local and global search. Greedy \(\delta \)-cycling operates as follows. Suppose that, at iteration k of GLISp-r, we have at our disposal \(|\mathcal {X}|= N\) samples and denote the trade-off weight \(\delta \) in (23) as \(\delta \left( k\right) \) to highlight the iteration number. Furthermore, assume \(\delta \left( k\right) = \delta _j \in \Delta _{cycle}\), which is used to find the new candidate sample \(\varvec{x}_{N+1}\) at iteration k by solving Problem (8). Then, if \(\varvec{x}_{N+1} \succ \varvec{x_{best}}\left( N\right) \) (i.e. there has been some improvement), the trade-off weight is kept unchanged, \(\delta \left( k+1\right) = \delta \left( k\right) = \delta _j\). Otherwise, we cycle the values in \(\Delta _{cycle}\), obtaining \(\delta \left( k+1\right) = \delta _{\left( j+1\right) \text {mod}{N_{cycle}}}\). Thus:

$$\begin{aligned} \delta \left( k+1\right) = {\left\{ \begin{array}{ll} \delta _j &{} \text {if}\quad \varvec{x}_{N+1} \succ \varvec{x_{best}}\left( N\right) \\ \delta _{\left( j+1\right) \text {mod}{N_{cycle}}} &{} \text {if}\quad \varvec{x_{best}}\left( N\right) \succsim \varvec{x}_{N+1} \end{array}\right. } \end{aligned}$$
(25)

In Sect. 5, we will discuss the choice of the cycling sequence in (24) more in detail and also cover its relationship with the global convergence of GLISp-r.

5 Algorithm GLISp-r and convergence

Algorithm 2 describes each step of the GLISp-r procedure. As with any surrogate-based method, GLISp-r starts from an initial set of samples \(\mathcal {X}, |\mathcal {X}|= N_{init} \in \mathbb {N}, N_{init} \ge 2\), generated by a space-filling experimental design [36], such as a Latin Hypercube Design (LHD) [24]. The sets \(\mathcal {B}\) in (6) and \(\mathcal {S}\) in (7), as well as the initial best candidate \(\varvec{x_{best}}\left( N_{init}\right) \), are obtained by asking the decision-maker to compare the samples in \(\mathcal {X}\) (5) as proposed in Algorithm 3, which prompts \(M = N_{init} - 1\) queries. Once the initial sampling phase is concluded, the new candidate samples are obtained by solving Problem (8). The procedure is stopped once \(|\mathcal {X}|= N_{max}\), where \(N_{max} \in \mathbb {N}\) is a budget specified by the user. Overall, the decision-maker is queried \(N_{max} - 1\) times.

GLISp-r follows the same scheme of GLISp [3] but, additionally, at each iteration, builds the augmented sample set \(\mathcal {X}_{aug}\) using Algorithm 1. New candidate samples are found by minimizing the acquisition function in (23) instead of \(a_N\left( \varvec{x}\right) \) in (15). Moreover, \(\delta \) is cycled as proposed in Sect. 4.3.1. Similarly to GLISp [3], the shape parameter \(\epsilon \) of the surrogate model in (9) is recalibrated at certain iterations of the algorithm, as specified by the set \(\mathcal {K}_{R} \subseteq \left\{ 1, \ldots , N_{max} - N_{init}\right\} \), using grid-search Leave One Out Cross-Validation (LOOCV). In particular, at each iteration \(k \in \mathcal {K}_{R}\), \(\epsilon \) for \(\hat{f}_N\left( \varvec{x}\right) \) in (9) is selected among a set \(\mathcal {E}_{LOOCV{}}\) of possible shape parameters as the one whose corresponding surrogate preference function \({\hat{\pi }}_{{\scriptscriptstyle \succsim }_{N}}\left( \varvec{x}_i, \varvec{x}_j\right) \) in (10) classifies (out-of-sample) most of the preferences in \(\mathcal {B}\) and S correctly [3]. Lastly, consistently with GLISp [3], Problem (3) is rescaled so that each decision variable assumes the \(\left[ -1, 1\right] \) range (at least inside \(\Omega \)).

figure b

5.1 Global convergence of GLISp-r

Whenever we are dealing with any global optimization algorithm, it is possible to guarantee its convergence to the global minimizer of Problem (3) by checking if the conditions of the following Theorem hold.

Theorem 2

(Convergence of a global optimization algorithm [33]) Let \(\Omega \subset \mathbb {R}^{n}\) be a compact set. An algorithm converges to the global minimum of every continuous function \(f:\mathbb {R}^n \rightarrow \mathbb {R}\) over \(\Omega \) if and only if its sequence of iterates,

$$\begin{aligned} \langle \varvec{x}_i\rangle _{i \ge 1} = \langle \varvec{x}_1, \varvec{x}_2, \ldots \rangle , \end{aligned}$$

is dense in \(\Omega \).

In what follows, for the sake of clarity, we denote:

  • \(\mathcal {X}_\infty \) as the set containing all the elements of \(\langle \varvec{x}_i\rangle _{i \ge 1}\) (infinite sequence),

  • \(\mathcal {X}_k \subseteq \mathcal {X}_\infty \) as the set containing all the elements of \(\langle \varvec{x}_i\rangle _{i = 1}^k\), which is a subsequence of \(\langle \varvec{x}_i\rangle _{i \ge 1}\) composed of its first \(k \in \mathbb {N}\) entries.

To prove the convergence of GLISp-r, we also make use of the following Theorem, which gives us a sufficient condition that ensures the denseness of the sequence of iterates produced by any global optimization algorithm.

Theorem 3

(A sufficient condition for the denseness of \(\mathcal {X}_\infty \) [28]) Let \(\Omega \) be a compact subset of \(\mathbb {R}^n\) and let \(\langle \varvec{x}_{i}\rangle _{i\ge 1}\) be the sequence of iterates generated by an algorithm \(\texttt {A}\) (when run indefinitely). Suppose that there exists a strictly increasing sequence of positive integers \(\langle i_{t}\rangle _{t\ge 1}, i_{t}\in \mathbb {N},\) such that \(\langle \varvec{x}_{i}\rangle _{i\ge 1}\) satisfies the following condition for some \(\alpha \in (0,1]\):

$$\begin{aligned} \mathchoice{\underset{1\le i\le i_{t}-1}{{\text {min}}}\,}{{\text {min}}_{1\le i\le i_{t}-1}}{{\text {min}}_{1\le i\le i_{t}-1}}{{\text {min}}_{1\le i\le i_{t}-1}} \left\| {{\varvec{x}_{i_{t}}-\varvec{x}_{i}}} \right\| _2 \ge \alpha \cdot d_{\Omega }\left( \mathcal {X}_{i_t - 1}\right) , \quad \forall t \in \mathbb {N}, \end{aligned}$$
(26)

where:

$$\begin{aligned} d_{\Omega }\left( \mathcal {X}_{i_t - 1}\right) =\mathchoice{\underset{\varvec{x}\in \Omega }{{\text {max}}}\,}{{\text {max}}_{\varvec{x}\in \Omega }}{{\text {max}}_{\varvec{x}\in \Omega }}{{\text {max}}_{\varvec{x}\in \Omega }} \mathchoice{\underset{1\le i\le i_{t}-1}{{\text {min}}}\,}{{\text {min}}_{1\le i\le i_{t}-1}}{{\text {min}}_{1\le i\le i_{t}-1}}{{\text {min}}_{1\le i\le i_{t}-1}} \left\| {{\varvec{x}-\varvec{x}_{i}}} \right\| _2. \end{aligned}$$
(27)

Then, \(\mathcal {X}_{\infty }\) generated by \(\texttt {A}\) is dense in \(\Omega \).

The aforementioned Theorem has been used to prove the global convergence of CORS [28], a black-box optimization procedure. Clearly, if \(\Omega \) is compact and (26) holds for some \(\alpha \in (0,1]\) (making \(\mathcal {X}_{\infty }\) dense in \(\Omega \)) then, due to Theorem 2, algorithm \(\texttt {A}\) converges to the global minimum of every continuous function \(f\left( \varvec{x}\right) \) over \(\Omega \). For what concerns the preference-based framework, we need to ensure that the scoring function \(f\left( \varvec{x}\right) \), which represents the preference relation \(\succsim \) on \(\Omega \), is continuous. Theorem 1 gives us necessary conditions on \(\succsim \) to achieve such property. Furthermore, Proposition 1 can be used to check the existence of a \(\succsim \)-maximum of \(\Omega \). The next Theorem addresses the global convergence of GLISp-r (Algorithm 2).

Theorem 4

(Convergence of GLISp-r) Let \(\Omega \subset \mathbb {R}^n\) be a compact set and \(\succsim \) be a continuous preference relation on \(\Omega \) of a rational (as in Definition 3) human decision-maker. Then, provided that \(\exists \delta _j \in \Delta _{cycle}\) in (24) such that \(\delta _j = 0\) and \(N_{max} \rightarrow \infty \), GLISp-r converges to the global minimum of Problem (3) for any choice of its remaining hyper-parameters.Footnote 5

Proof

Compactness of \(\Omega \), continuity of \(\succsim \) on \(\Omega \) and rationality of the decision-maker are conditions that ensure the existence of a solution for Problem (3) (cf. Theorem 1 and Proposition 1).

Consider the sequence of iterates \(\langle \varvec{x}_i\rangle _{i \ge 1}\) produced by Algorithm 2. The first \(N_{init} \in \mathbb {N}, N_{init} \ge 2,\) elements of \(\langle \varvec{x}_i\rangle _{i \ge 1}\) are obtained by the LHD [24]. Instead, each \(\varvec{x}_i \in \mathcal {X}_{\infty }, i > N_{init},\) is selected as the solution of Problem (8) with \(a_N\left( \varvec{x}\right) \) in (23).

Now, suppose that \(\delta \) in (23) is cycled regardless of the improvement that the new candidate samples might bring (non-greedy cycling). Denote the exploration–exploitation trade-off weight at iteration k of Algorithm 2 as \(\delta \left( k\right) \) and assume that \(\delta (k) = \delta _j \in \Delta _{cycle}\). Then, the cycling is performed as:

$$\begin{aligned} \delta \left( k + 1\right) = \delta _{\left( j + 1\right) \text {mod}{N_{cycle}}}, \quad \forall k \in \mathbb {N}, \end{aligned}$$
(28)

instead of (25). Without loss of generality, suppose that \(\Delta _{cycle}\) in (24) is defined as:

$$\begin{aligned} \delta _j \ne 0, \forall j = 0, \ldots , N_{cycle} - 2,\quad \text {and}\quad \delta _{N_{cycle} - 1} = 0. \end{aligned}$$

Then, every \(N_{cycle}\) iterations Algorithm 2 looks for a new candidate sample by minimizing the (min–max rescaled) IDW distance function in (13), regardless of the surrogate model in (9), see \(a_N\left( \varvec{x}\right) \) in (23) and Problem (8). In practice, minimizing \(\bar{z}_N\left( \varvec{x}; \mathcal {X}_{aug}\right) \) over \(\Omega \) is equivalent to solving \(\mathchoice{{\text {arg}}\, \mathchoice{\underset{\varvec{x} \in \Omega }{{\text {min}}}\,}{{\text {min}}_{\varvec{x} \in \Omega }}{{\text {min}}_{\varvec{x} \in \Omega }}{{\text {min}}_{\varvec{x} \in \Omega }} }{{\text {arg}}\, \mathchoice{\underset{\varvec{x} \in \Omega }{{\text {min}}}\,}{{\text {min}}_{\varvec{x} \in \Omega }}{{\text {min}}_{\varvec{x} \in \Omega }}{{\text {min}}_{\varvec{x} \in \Omega }} }{{\text {arg}}\, \mathchoice{\underset{\varvec{x} \in \Omega }{{\text {min}}}\,}{{\text {min}}_{\varvec{x} \in \Omega }}{{\text {min}}_{\varvec{x} \in \Omega }}{{\text {min}}_{\varvec{x} \in \Omega }} }{{\text {arg}}\, \mathchoice{\underset{\varvec{x} \in \Omega }{{\text {min}}}\,}{{\text {min}}_{\varvec{x} \in \Omega }}{{\text {min}}_{\varvec{x} \in \Omega }}{{\text {min}}_{\varvec{x} \in \Omega }} } z_N\left( \varvec{x}\right) \) since scaling and shifting the IDW distance function does not change its minimizers [26]. We define the following strictly increasing sequence of positive integers:

$$\begin{aligned} \langle i_{t'} \rangle _{t' \ge 1} = \langle N_{init} + t' \cdot N_{cycle}\rangle _{t' \ge 1}, \end{aligned}$$
(29)

which is such that:

$$\begin{aligned} \varvec{x}_{i_{t'}}&= \mathchoice{{\text {arg}}\, \mathchoice{\underset{\varvec{x}}{{\text {min}}}\,}{{\text {min}}_{\varvec{x}}}{{\text {min}}_{\varvec{x}}}{{\text {min}}_{\varvec{x}}} }{{\text {arg}}\, \mathchoice{\underset{\varvec{x}}{{\text {min}}}\,}{{\text {min}}_{\varvec{x}}}{{\text {min}}_{\varvec{x}}}{{\text {min}}_{\varvec{x}}} }{{\text {arg}}\, \mathchoice{\underset{\varvec{x}}{{\text {min}}}\,}{{\text {min}}_{\varvec{x}}}{{\text {min}}_{\varvec{x}}}{{\text {min}}_{\varvec{x}}} }{{\text {arg}}\, \mathchoice{\underset{\varvec{x}}{{\text {min}}}\,}{{\text {min}}_{\varvec{x}}}{{\text {min}}_{\varvec{x}}}{{\text {min}}_{\varvec{x}}} } z_{i_{t'} - 1}\left( \varvec{x}\right) , \quad \forall t' \in \mathbb {N} \nonumber \\ \text {s.t.}&\quad \varvec{x}\in \Omega . \end{aligned}$$
(30)

Now, recall from Proposition 4 that each \(\varvec{x}_i \in \mathcal {X}_{i_{t'} - 1}\) is a global maximizer of Problem (30). Furthermore, \(z_{i_{t'} - 1}\left( \varvec{x}\right) \) in (13) is differentiable everywhere and thus continuous (see Proposition 3). Then, by the Extreme Value Theorem [1], Problem (30) admits at least a solution. Hence, we can conclude that:

$$\begin{aligned} \varvec{x}_{i_{t'}} \notin \mathcal {X}_{i_{t'} - 1} \implies \exists \tilde{\alpha }' \in \mathbb {R}_{>0} \text { such that } \mathchoice{\underset{1\le i\le i_{t'}-1}{{\text {min}}}\,}{{\text {min}}_{1\le i\le i_{t'}-1}}{{\text {min}}_{1\le i\le i_{t'}-1}}{{\text {min}}_{1\le i\le i_{t'}-1}} \left\| {{\varvec{x}_{i_{t'}}-\varvec{x}_{i}}} \right\| _2 \ge \tilde{\alpha }', \quad \forall t' \in \mathbb {N}. \end{aligned}$$
(31)

Clearly, due to how new candidate samples are sought (i.e. by minimizing some acquisition function over \(\Omega \)), we have that (recall (27)):

$$\begin{aligned} \alpha '\cdot d_{\Omega }\left( \mathcal {X}_{i_{t'} - 1}\right) \le \mathchoice{\underset{1\le i\le i_{t'}-1}{{\text {min}}}\,}{{\text {min}}_{1\le i\le i_{t'}-1}}{{\text {min}}_{1\le i\le i_{t'}-1}}{{\text {min}}_{1\le i\le i_{t'}-1}} \left\| {{\varvec{x}_{i_{t'}}-\varvec{x}_{i}}} \right\| _2 \le d_{\Omega }\left( \mathcal {X}_{i_{t'} - 1}\right) , \quad \forall t' \in \mathbb {N}, \end{aligned}$$
(32)

for some \(\alpha ' \in (0, 1]\). Then, by combining (31) and (32), we get:

$$\begin{aligned} 0 < \tilde{\alpha }' \le d_{\Omega }\left( \mathcal {X}_{i_{t'} - 1}\right) , \quad \forall t' \in \mathbb {N}. \end{aligned}$$
(33)

Therefore, \(\exists \alpha ' \in (0, 1]\) such that \(\tilde{\alpha }' = \alpha ' \cdot d_{\Omega }\left( \mathcal {X}_{i_{t'} - 1}\right) \) which satisfies the condition (26) of Theorem 3, \(\forall t' \in \mathbb {N}\). Thus, Algorithm 2 with \(\delta \) cycled as in (28) produces a sequence of iterates that is dense in \(\Omega \).

Next, consider the greedy \(\delta \)-cycling strategy proposed in Sect. 4.3.1 and for an arbitrary choice of \(\Delta _{cycle}\) in (24). Let us examine the case in (25) when \(\delta \) is kept unchanged from an iteration of Algorithm 2 to the other. Denote as \(\langle i_{t''} \rangle _{t'' = 1}^{t_{max}''}, t_{max}'' \in \mathbb {N},\) the sequence of indexes of those samples that improve upon the current best candidate, resulting in no change in the exploration–exploitation trade-off weight. We have that:

$$\begin{aligned} \varvec{x}_{i_{t''}} \succ \varvec{x_{best}}\left( i_{t''} - 1\right) , \quad \forall t'': 1 \le t'' \le t_{max}''. \end{aligned}$$

Clearly, \(\varvec{x}_{i_{t''}} \notin \mathcal {X}_{i_{t''} - 1}\) since it is strictly preferred to all the other samples in \(\mathcal {X}_{i_{t'' - 1}}\). Thus, we could define a positive constant \(\tilde{\alpha }'' \in \mathbb {R}_{> 0}\) analogously to (31):

$$\begin{aligned} \begin{aligned} \varvec{x}_{i_{t''}} \notin \mathcal {X}_{i_{t''} - 1} \implies&\exists \tilde{\alpha }'' \in \mathbb {R}_{>0} \text { such that } \mathchoice{\underset{1\le i\le i_{t''}-1}{{\text {min}}}\,}{{\text {min}}_{1\le i\le i_{t''}-1}}{{\text {min}}_{1\le i\le i_{t''}-1}}{{\text {min}}_{1\le i\le i_{t''}-1}} \left\| {{\varvec{x}_{i_{t''}}-\varvec{x}_{i}}} \right\| _2 \ge \tilde{\alpha }'',\\&\forall t'': 1 \le t'', \le t_{max}''. \end{aligned} \end{aligned}$$
(34)

Finally, let us consider the greedy \(\delta \)-cycling strategy in (25) as whole and assume, as in Theorem 4, that \(\exists \delta _j \in \Delta _{cycle}\) in (24) such that \(\delta _j = 0\). We can build a strictly increasing sequence of positive integers \(\langle i_{t}\rangle _{t \ge 1}\) by merging:

  • The elements of the sequence \(\langle i_{t''} \rangle _{t'' = 1}^{t_{max}''}\), which are the indexes of those samples that improve upon the best candidates \(\varvec{x_{best}}\left( i_{t''} - 1\right) , \forall t'': 1 \le t'' \le t_{max}''\);

  • The elements of the sequence \(\langle i_{t'} \rangle _{t' \ge 1}\), which constitute the indexes of those samples found by solving the pure exploration problem in (30). Note that, unless Algorithm 2 always improves upon its current best candidate (in which case \(\langle i_{t''} \rangle _{t'' = 1}^{t_{max}''}\) is actually infinite and hence \(\mathcal {X}_{\infty }\) is dense in \(\Omega \)), Problem (8) with \(a_N\left( \varvec{x}\right) \) in (23) is solved using \(\delta = 0\) infinitely often, although not necessarily every \(N_{cycle}\) iterations as in (29).

Hence, we can select a positive constant \(\tilde{\alpha } \in \mathbb {R}_{>0}\) as \(\tilde{\alpha } = \mathchoice{\underset{}{{\text {min}}}\,}{{\text {min}}_{}}{{\text {min}}_{}}{{\text {min}}_{}} \left\{ \tilde{\alpha }', \tilde{\alpha }''\right\} \) which, analogously to (33), is such that:

$$\begin{aligned} 0 < \tilde{\alpha } \le d_{\Omega }\left( \mathcal {X}_{i_{t} - 1}\right) , \quad \forall t \in \mathbb {N}. \end{aligned}$$
(35)

Therefore, \(\exists \alpha \in (0, 1]\) such that \(\tilde{\alpha } = \alpha \cdot d_{\Omega }\left( \mathcal {X}_{i_{t} - 1}\right) \) which satisfies the condition (26) of Theorem 3, \(\forall t \in \mathbb {N}\). Thus, GLISp-r with \(\delta \) in (23) cycled following the greedy \(\delta \)-cycling strategy in (25) converges to the global minimum of Problem (3). \(\square \)

Most preference-based response surface methods, such as the ones in [3, 4, 6, 13], do not address their convergence to the global minimum of Problem (3). In this work, we have shown that, by leveraging some results from the utility theory literature (see Sect. 2), we can find sufficient conditions on the preference relation of the human decision-maker (\(\succsim \) on \(\Omega \)) that guarantee the existence of a solution for Problem (3) and allow us to analyze the convergence of any preference-based procedure as we would in the global optimization framework.

We conclude this section with two Remarks on Theorem 4. The first deals with the importance of adding a zero entry inside \(\Delta _{cycle}\) in (24), while the second addresses the selection of the cycling set.

Remark 4

The omission of a zero entry inside \(\Delta _{cycle}\) in (24) does not necessarily preclude the global convergence of Algorithm 2 on all possible preference-based optimization problems. For example, if \(f\left( \varvec{x}\right) \) is a constant function then, after we evaluate the first sample \(\varvec{x}_1\), any other point brings no improvement (i.e. we would have \(\varvec{x}_i \sim \varvec{x}_1, \forall i > 1\)). The caveat is that, if \(\not \exists \delta _j \in \Delta _{cycle}\) such that \(\delta _j = 0\), then the sequence \(\langle i_{t''} \rangle _{t'' = 1}^{t_{max}''}\) for which (34) holds is likely to be finite (i.e. GLISp-r does not improve upon its current best candidate infinitely often). Moreover, differently from Problem (30), we have no guarantee that the solutions of:

$$\begin{aligned} \varvec{x}_{N + 1}&= \mathchoice{{\text {arg}}\, \mathchoice{\underset{\varvec{x}}{{\text {min}}}\,}{{\text {min}}_{\varvec{x}}}{{\text {min}}_{\varvec{x}}}{{\text {min}}_{\varvec{x}}} }{{\text {arg}}\, \mathchoice{\underset{\varvec{x}}{{\text {min}}}\,}{{\text {min}}_{\varvec{x}}}{{\text {min}}_{\varvec{x}}}{{\text {min}}_{\varvec{x}}} }{{\text {arg}}\, \mathchoice{\underset{\varvec{x}}{{\text {min}}}\,}{{\text {min}}_{\varvec{x}}}{{\text {min}}_{\varvec{x}}}{{\text {min}}_{\varvec{x}}} }{{\text {arg}}\, \mathchoice{\underset{\varvec{x}}{{\text {min}}}\,}{{\text {min}}_{\varvec{x}}}{{\text {min}}_{\varvec{x}}}{{\text {min}}_{\varvec{x}}} } a_{N}\left( \varvec{x}\right) , \quad \forall N: N \ne i_{t''} - 1, 1 \le t'' \le t_{max}'' \\ \text {s.t.}&\quad \varvec{x}\in \Omega , \end{aligned}$$

with \(a_N\left( \varvec{x}\right) \) defined as in (23) and for \(\delta \ne 0\), are not already present in \(\mathcal {X}_N\).

Therefore, we cannot ensure the existence of a strictly increasing sequence of positive integers that is infinite and for which (26) holds. Instead, the result in Theorem 3 does not apply for \(\langle i_{t''} \rangle _{t'' = 1}^{t_{max}''}\), since the sequence is finite. Consequently, Algorithm 2 does not necessarily produce a sequence of iterates \(\langle \varvec{x}_i \rangle _{i \ge 1}\) that is dense in \(\Omega \), preventing its convergence on some (but not all) preference-based optimization problems.

Remark 5

Theorem 4 guarantees that, under some hypotheses, GLISp-r converges to the global minimum of Problem (3), however it does not give any indication on its convergence rate. In particular, if \(\Delta _{cycle}\) is actually \(\langle 0 \rangle \), Algorithm 2 amounts to performing an exhaustive search without considering the information carried by the preferences in \(\mathcal {B}\) (6) and \(\mathcal {S}\) (7), which is quite inefficient [1]. Therefore, it is best to include some \(\delta _j\)’s in \(\Delta _{cycle}\) that allow the surrogate model to be taken into consideration. For this reason, we suggest including terms that are well spread within the \(\left[ 0, 1\right] \) range, including a zero entry to ensure the result in Theorem 4. Intuitively, the rate of convergence will be dependent on how well \(\hat{f}_{N}\left( \varvec{x}\right) \) in (9) approximates \(f\left( \varvec{x}\right) \) as well as on the choice of \(\Delta _{cycle}\) in (24).

6 Empirical results

In this section, we compare the performances of algorithms GLISp-r and GLISp [3] on a variety of benchmark bound-constrained global optimization problems taken from [2, 14, 18, 25]. Consistently with the preference-based optimization literature, we stick to benchmark problems with less than \(n = 10\) decision variables [3, 4, 6, 13]. We also consider the revisited version of the IDW distance function in (13) employed by C-GLISp [40], which is an extension of GLISp [3] proposed by the same authors.Footnote 6 In particular, in C-GLISp [40]:

$$\begin{aligned} z_N(\varvec{x}) =&\left( \frac{N}{N_{max}} - 1\right) \cdot \arctan \left( \frac{\sum _{i = 1, i \ne i_{best}\left( N\right) }^N w_i\left( \varvec{x_{best}}\left( N\right) \right) }{\sum _{i = 1}^N w_i\left( \varvec{x}\right) } \right) + \nonumber \\&- \frac{N}{N_{max}} \cdot \arctan \left( \frac{1}{\sum _{i = 1}^N w_i\left( \varvec{x}\right) } \right) , \quad \forall \varvec{x} \in \mathbb {R}^n \setminus \mathcal {X}, \end{aligned}$$
(36)

while \(z_N(\varvec{x}) = 0, \forall \varvec{x} \in \mathcal {X}\). In (36), \(i_{best}\left( N\right) \in \mathbb {N}, 1 \le i_{best}\left( N\right) \le N,\) represents the index of the best-found candidate when \(|\mathcal {X}|= N\). In practice, \(z_N\left( \varvec{x}\right) \) in (36) improves the exploratory capabilities of GLISp [3] without the need to define an alternative acquisition function from the one in (15) (see [40]).

We point out that we could also consider the preferential Bayesian optimization algorithm in [6] as an additional competitor for GLISp-r. However, in [3], the authors show that GLISp outperforms the aforementioned method. As we will see in the next sections, GLISp-r exhibits convergence speeds that are similar to those of GLISp [3] and hence we have decided to omit the algorithm in [6] from our analysis. Moreover, after preliminary testing, the latter method was proven to not be on par w.r.t. the other competitors.

6.1 Experimental setup

All benchmark optimization problems have been solved on a machine with two Intel Xeon E5-2687W @3.00 GHz CPUs and 128 GB of RAM. GLISp-r has been implemented in MATLAB. Similarly, we have used the MATLAB code for GLISp provided in [3] (formally, version 2.4 of the software package) and the one for C-GLISp supplied in [40] (version 3.0 of the same code package). For all the procedures, Problem (8) has been solved using Particle Swarm Optimization (PSWARM). In particular, we have used its MATLAB implementation provided by [20, 34, 35].

To achieve a fair comparison, we have chosen the same hyper-parameters for both GLISp-r and GLISp/C-GLISp [3, 40], whenever possible. This applies, for example, to the shape parameter \(\epsilon \), which is initialized to \(\epsilon = 1\), and the radial function \(\varphi \left( \cdot \right) \), that is an inverse quadratic. Furthermore, the shape parameter for the surrogate model \(\hat{f}_N\left( \varvec{x}\right) \) in (9) is recalibrated using LOOCV (see GLISp [3]) at the iterations \(\mathcal {K}_{R} = \left\{ 1, 50, 100\right\} \). Its possible values are \(\mathcal {E}_{LOOCV{}} = \{0.1000,\) 0.1668,  0.2783,  0.4642,  0.7743,  1,  1.2915,  2.1544,  3.5938,  5.9948,  \(10\}\). The remaining hyper-parameters shared by GLISp/C-GLISp [3, 40] and GLISp-r are set to \(\lambda = 10^{-6}\) and \(\sigma = 10^{-2}\). Regarding GLISp-r, we have chosen \(\Delta _{cycle} = \langle 0.95,0.7,0.35,0 \rangle \), where we have included a zero term to comply with the convergence result in Theorem 4. The reasoning behind this cycling sequence is that, after the initial sampling phase, we give priority to the surrogate as to drive the algorithm towards more promising regions of \(\Omega \), for example where local minima are located. In practice, if \(f\left( \varvec{x}\right) \) is a function that can be approximated well by \(\hat{f}_N\left( \varvec{x}\right) \) with little data, starting with a \(\delta \) in (23) that is close to 1 might lead the procedure to converge quite faster. If that is not the case, the remaining terms contained inside \(\Delta _{cycle}\) promote the exploration of other zones of the constraint set, either dependently or independently from \(\hat{f}_N\left( \varvec{x}\right) \). For the sake of completeness, we also analyze the performances of GLISp-r when equipped with two particular cycling sequences: \(\Delta _{cycle} = \langle 0.95 \rangle \) (“pure” exploitation) and \(\Delta _{cycle} = \langle 0 \rangle \) (pure exploration). Concerning GLISp/C-GLISp [3, 40], which use \(a_N\left( \varvec{x}\right) \) in (15), we set \(\delta = 2\), as proposed by the authors in [3]. Lastly, for GLISp-r, we select \(K_{aug} = 5\) for all optimization problems since, empirically, after several preliminary experiments, it has proven to be good enough to rescale \(z_N\left( \varvec{x}\right) \) in (13) and \(\hat{f}_N\left( \varvec{x}\right) \) in (9) in most cases (see for example Fig. 4).

We run the procedures using a fixed budget of \(N_{max}=200\) samples and solve each benchmark optimization problem \(N_{trial} = 100\) times, starting from \(N_{init} = 4 \cdot n\) points generated by LHDs [24] with different random seeds. To ensure fairness of comparison, all the algorithms are started from the same samples.

For the sake of clarity, we point out that the preferences between the couples of samples are expressed using the preference function in (4), where \(f\left( \varvec{x}\right) \) is the corresponding cost function for the considered benchmark optimization problem. Therefore, the preferences are always expressed consistently, allowing us to compare GLISp-r and GLISp/C-GLISp [3, 40] properly. That would not necessarily be the case if a human decision-maker was involved.

6.2 Results

We compare the performances of GLISp-r and GLISp/C-GLISp [3, 40] on each benchmark optimization problem by means of convergence plots and data profiles [1]. Convergence plots depict the median, best and worst case performances over the \(N_{trial}\) instances and with respect to the cost function values achieved by \(\varvec{x_{best}}\left( N\right) \), as N increases. Data profiles are one of the most popular tools for assessing efficiency and robustness of global optimization methods: efficient surrogate-based methods exhibit steep slopes (i.e. fast convergences speeds), while robust algorithms are able to solve more (instances of) problems within the budget \(N_{max}\). In general, no method is both efficient and robust: a trade-off must be made [32]. Typically, data profiles are used to visualize the performances of several algorithms on multiple benchmark optimization problems simultaneously. However, due to the stochastic nature of LHDs [24], here we will also depict the data profiles for GLISp-r, GLISp [3] and C-GLISp [40] on each benchmark optimization problem on its own, highlighting how the algorithms behave when started from different samples. In practice, data profiles show, for \(1 \le N \le N_{max}\), how many among the \(N_{trial}\) instances of one (or several) benchmark optimization problem(s) have been solved to a prescribed accuracy, defined as [1]:

$$\begin{aligned} acc\left( N\right) = \frac{f\left( \varvec{x_{best}}\left( N\right) \right) -f\left( \varvec{x}_{1}\right) }{f^*-f\left( \varvec{x}_{1}\right) }, \end{aligned}$$
(37)

where \(f^* = \mathchoice{\underset{\varvec{x} \in \Omega }{{\text {min}}}\,}{{\text {min}}_{\varvec{x} \in \Omega }}{{\text {min}}_{\varvec{x} \in \Omega }}{{\text {min}}_{\varvec{x} \in \Omega }} f\left( \varvec{x}\right) \). In particular, here we consider a benchmark optimization problem to be solved by some algorithm when \(acc\left( N\right) > t, t = 0.95\). In what follows, the results achieved by GLISp-r equipped with \(\Delta _{cycle} = \langle 0.95 \rangle \) and \(\Delta _{cycle} = \langle 0 \rangle \) are reported only in the data profiles (and not in the convergence plots), to make the graphs more readable.

Figure 5 shows the cumulative data profiles of GLISp-r, GLISp [3] and C-GLISp [40], which result from considering all the benchmark optimization problems simultaneously. Instead, Figs. 6 and 7 depict the convergence plots and the data profiles achieved by the considered algorithms on each benchmark. In Table 1 we report the number of samples required to reach the accuracy \(t = 0.95\), i.e.:

$$\begin{aligned} N_{acc> t} = \mathchoice{\underset{1 \le N \le N_{max}}{{\text {min}}}\,}{{\text {min}}_{1 \le N \le N_{max}}}{{\text {min}}_{1 \le N \le N_{max}}}{{\text {min}}_{1 \le N \le N_{max}}} N \quad \text { such that } acc\left( N\right) > t. \end{aligned}$$
(38)

In practice, given that each benchmark optimization problem is solved multiple times, \(N_{acc > t}\) in Table 1 is assessed median-wise (over the \(N_{trial}\) instances), giving us an indication on the efficiency of each method. In the same table, we also report the percentages of instances of problems solved by GLISp-r, GLISp [3] and C-GLISp [40] (which is an indicator of robustness) and the average execution times of each algorithm.

Fig. 5
figure 5

Cumulative data profiles (\(acc\left( N\right) > 95\%\)) of the considered preference-based optimization algorithms. GLISp-r is depicted in red (\(\Delta _{cycle} = \langle 0.95, 0.7, 0.35, 0\rangle \) continuous line, \(\Delta _{cycle} = \langle 0.95 \rangle \) dashed line, \(\Delta _{cycle} = \langle 0 \rangle \) dotted line), GLISp [3] in blue and C-GLISp [40] in green (color figure online)

Fig. 6
figure 6

Performances achieved by the considered preference-based optimization algorithms on the benchmark optimization problems: convergence plots on the left and data profiles (\(acc\left( N\right) > 95\%\)) on the right. GLISp-r (with \(\Delta _{cycle} = \langle 0.95, 0.7, 0.35, 0\rangle \)) is depicted in red, GLISp [3] in blue and C-GLISp [40] in green. The dashed black-line in the convergence plots represents \(f^*\). We also show the number of initial samples, \(N_{init}\), with a black vertical line. The results obtained by GLISp-r with \(\Delta _{cycle} = \langle 0.95\rangle \) (dashed red line) and GLISp-r with \(\Delta _{cycle} = \langle 0\rangle \) (dotted red line) are shown only in the data profiles (color figure online)

Fig. 7
figure 7

Figure 6 cont’d

From our experiments, we gather that:

  • GLISp-r (\(\Delta _{cycle} = \langle 0.95, 0.7, 0.35, 0\rangle \)) can be notably more robust than GLISp [3] without excessively compromising its convergence speed. On several occasions, the latter algorithm gets stuck on local minima of the benchmark optimization problems. That is particularly evident on the bemporad [2] and gramacy-lee [14] benchmarks, in which cases GLISp [3] solves, respectively, only 70% and 31% of the \(N_{trial}\) instances. Instead, GLISp-r (\(\Delta _{cycle} = \langle 0.95, 0.7, 0.35, 0\rangle \)) is able to solve both of them to the prescribed accuracy. In practice, GLISp [3] shines when exploitation is better suited for the benchmark optimization problem at hand. That is particularly relevant for the bukin 6 [18] problem, on which both GLISp [3] and GLISp-r with \(\Delta _{cycle} = \langle 0.95 \rangle \) (“pure” exploitation) perform quite well. Lastly, C-GLISp [40] is as robust as GLISp-r (\(\Delta _{cycle} = \langle 0.95, 0.7, 0.35, 0\rangle \)) but is the least efficient among the analyzed procedures.

  • The pure exploration strategy (i.e. GLISp-r with \(\Delta _{cycle} = \langle 0\rangle \)) performs poorly, even for \(n = 2\). When \(n = 5\), it is unable solve any problem (in fact, the data profiles stay flat after the initial sampling phase). Only for \(n = 1\) the pure exploration strategy is quite robust and relatively efficient.

  • Vice-versa, a “pure” exploitatory approach (i.e. GLISp-r with \(\Delta _{cycle} = \langle 0.95\rangle \)), although not necessarily globally convergent (see Theorem 4), can actually be successful on some benchmark optimization problems. Often, such strategy exhibits a slightly lower \(N_{acc > t}\) (median-wise) than the other procedures, see Table 1. Notably, the data profiles of GLISp-r (\(\Delta _{cycle} = \langle 0.95\rangle \)) can be quite similar to those of GLISp [3]. Therefore, we could say that GLISp [3] has limited exploratory capabilities.

  • The main disadvantage of GLISp-r, compared to GLISp [3] and C-GLISp [40], is the increased computational time, as reported in Table 1. That is due to the computational overhead of Algorithm 1, which generates the augmented sample set \(\mathcal {X}_{aug}\) for the proposed procedure by performing K-means clustering. On average, GLISp-r (\(\Delta _{cycle} = \langle 0.95, 0.7, 0.35, 0\rangle \)) is 31% slower than GLISp [3] and 72% slower than C-GLISp [40]. However, we point out that these overheads are practically negligible when the queries to the decision-maker involve running computer simulations or performing real-world experiments which, contrary to the considered benchmark optimization problems, can take from a few minutes up to several hours. This is a common assumption made by surrogate-based methods: at each iteration, the most time-consuming operation is the query to the decision-maker (or, in the context of black-box optimization, the measure of the cost function [19, 36]).

Table 1 Comparison between the considered preference-based optimization algorithms

7 Conclusions

In this work, we introduced the preference-based optimization problem from a utility theory perspective. Then, we extended algorithm GLISp [3], giving rise GLISp-r. In particular, we have addressed the shortcomings of \(a_N\left( \varvec{x}\right) \) in (15), defined a new acquisition function and proposed the greedy \(\delta \)-cycling strategy, which dynamically varies the exploration–exploitation weight \(\delta \) in (23). Furthermore, we have proven the global convergence of GLISp-r, which is strictly related to the choice of the cycling sequence \(\Delta _{cycle}\). To the best of our knowledge, GLISp-r is the first preference-based surrogate-based method with a formal proof of convergence.

Compared to the original method, the proposed extension is less likely to get stuck on local minima of the scoring function and proves to be more robust on several benchmark optimization problems without particularly compromising its convergence speed. Moreover, we have also considered algorithm C-GLISp [40], which improves the exploratory capabilities of GLISp [3] by employing a different exploration function. In our experiments, we have observed that, even though C-GLISp [40] is as robust as GLISp-r, it often exhibits slower convergence rates compared to GLISp [3] and the proposed extension.

Further research is devoted to extending GLISp-r in order to handle black-box constraints, which involve functions whose analytical formulations are not available. One possibility is to follow the same reasoning behind C-GLISp [40], which does so by adding a term to the acquisition function that penalizes the exploration in those regions of \(\Omega \) where the black-box constraints are likely to be violated. Additionally, a compelling line of research concerns preference-based optimization problems in the case where the human decision-maker is “irrational” (in a sense that some of the properties in Definition 3, typically completeness, do not hold). From a practical perspective, when the preference relation \(\succsim \) on \(\Omega \) of the DM is not complete, we would need a surrogate model that is able to handle the answer “I do not know” when the decision-maker cannot decide which, among two calibrations, he/she prefers.