1 Introduction

The generally constrained global optimization problem addressed here has the form

$$\begin{aligned} \min _{x \in \Omega } \; f(x)\quad \text{subject to} \quad g_I(x) \le 0 \quad \text{and} \quad g_E(x) = 0 \end{aligned}$$
(1)

where \(\Omega\) is a finite box defined by upper (U) and lower (L) bounds, as follows

$$\begin{aligned} \Omega = \{ x \in \mathbb {R}^n: L_i \le x_i \le U_i \quad \forall i = 1,\ldots ,n \}. \end{aligned}$$

The objective function f maps \(\mathbb {R}^n\) into \(\mathbb {R}\). The equality constraint function \(g_E(x)\) and inequality constraint function \(g_I(x)\) map \(\mathbb {R}^n\) into \(\mathbb {R}^q\) and \(\mathbb {R}^{m-2q}\), respectively. It is assumed that \(L < U\) and that f, \(g_E\) and \(g_I\) are continuous functions of x on \(\Omega\).

For simplicity, the equality constraints are replaced by the pair of inequality constraints \(g_E(x) \le 0\) and \(-g_E(x) \le 0\). This puts all constraints in the convenient form \(g(x) \le 0\) where g(x) maps \(\mathbb {R}^n\) into \(\mathbb {R}^m\), yielding a simpler expression of problem (1), viz.

$$\begin{aligned} \min _{x \in \Omega } \; f(x) \quad \text{subject to} \quad g(x) \le 0 \end{aligned}$$
(2)

From now on, we work with (2) rather than (1).

A wide variety of algorithms have been proposed for this problem. The majority of these methods are stochastic, although deterministic methods also exist [9].

Many stochastic methods are population based, such as particle swarm [4, 14], differential evolution [17], fish swarm [24] artificial immune system [3], and electromagnetism-like methods [1]. Stochastic methods are often paired with local searches to refine identified minimizers. For example, Hedar and Fukushima [7] pair the derivative free Nelder–Mead method with simulated annealing. Sequential quadratic programming is paired with multistart-clustering in deft-funnel [27], and with particle swarm methods [11].

Methods can also be differentiated on how expensive each function value is to compute. For expensive objectives methods using surrogates [15, 25] are effective, with radial basis functions being a common means of generating the surrogates. The surrogate is designed to be much cheaper to evaluate than the objective and constraint functions. The surrogate can be minimized by a subsidiary global optimization method designed for cheaper to evaluate functions giving a new sample point for (2). A more sophisticated fusion combining accelerated random search [2] with surrogates is given by Nuñez et al. [19] and Regis [23].

These methods employ a variety of strategies to adjudicate between changes in objective function and constraint violations. They include filters [7, 15, 21, 24], penalty functions [1, 9, 26], interval arithmetic [14] and other techniques such as adaptive trade-off models [30]. In addition, sample points can be biased towards feasibility by using constraint gradient information [29] or other processes [31]. A thorough survey of constraint handling techniques is given by Mezura-Montes and Coello [16].

In the next section, we describe a stochastic penalty function method for problems with objective and constraint functions that are cheap to evaluate. Convergence is discussed in Sect. 3, with numerical results given in Sect. 4. The final section concludes the paper.

2 Algorithm development

The feasible region of problem (2) is

$$\begin{aligned} \mathcal {F} = \{ x \in \Omega : \, g_i(x) \le 0 \;\;\; \forall i = 1,\ldots ,m \}. \end{aligned}$$

The proposed algorithm extends the oscars-ii algorithm [22] for bound constrained global optimization to problems of the form (1). Oscars-ii and the new method generate random sample points in \(\Omega\) and various subregions of \(\Omega\). Both methods retain two points at each iteration: the best known point \(b \in \Omega\) and a control point \(c \in \Omega\) used to direct the construction of the sampled subregions of \(\Omega\). The basic structure of each iteration is to randomly choose an iterate from the current subregion of \(\Omega\), calculate f and g at that iterate, and then update b, c and the sampling subregion.

With oscars-ii general constraints are absent, and b is just the iterate with the least known value of f. From time to time the control point is reset to b or a random point in \(\Omega\). Between resets c is the point with the least f value from or after the most recent reset.

The presence of general constraints means b and c must be chosen differently. At each iteration the new method chooses b as the best known feasible point, or if no feasible point is found, the least infeasible infeasible point. In contrast, c is chosen to minimize a merit function J over the iterates generated since the most recent reset. The merit function contains a parameter which is adjusted occasionally to obtain an acceptable convergence rate.

If \(\mathcal {F}\) has measure zero, the algorithm will almost surely fail to find a feasible point with any finite number of sample points. To circumvent this issue, violations of the general constraints \(g \le 0\) up to a specified tolerance \(\tau _\textsf{c} \ge 0\) are permitted. The subset of \(\Omega\) satisfying the general constraints within this tolerance is

$$\begin{aligned} \mathcal {F}_\textsf{tol} = \left\{ x \in \Omega : \; g_i(x) \le \tau _\textsf{c} \;\;\; \forall i = 1,\ldots ,m \right\} . \end{aligned}$$

Provided \(\mathcal {F}\) is non-empty, continuity of g implies \(\mathcal {F}_\textsf{tol}\) has positive measure for all \(\tau _\textsf{c} > 0\). Depending on the nature of the constraints, \(\tau _\textsf{c} > 0\) might be necessary, but if \(\mathcal {F}\) has a positive measure, then \(\tau _\textsf{c} = 0\) suffices.

Throughout this paper \(x^*\) and \(x^*_\textsf{tol}\) denote arbitrary global minimizers of f over \(\mathcal {F}\) and \(\mathcal {F}_\textsf{tol}\) respectively. Also \(f^* = f(x^*)\) and \(f^*_\textsf{tol} = f(x^*_\textsf{tol})\) are used.

2.1 The merit function

Jones [9] introduced the auxiliary function

$$\begin{aligned} J_\textsf{orig} = [f(x) - \phi ]_+ + \sum _{j=1}^m w_j [g_j(x)]_+ \end{aligned}$$

to extend the direct algorithm [10] to problems of the form (1). Here \(\phi\) is a possible value of \(f^*\), \(w_j\) are positive weights and \([g_j]_+ = \max (g_j,0)\). Direct subdivides \(\Omega\) into ever finer sets of rectangles, where each such set covers \(\Omega\). At each iteration some rectangles are divided, yielding the next cover. The rectangles which are selected are those which could contain a global minimizer of \(J_\textsf{orig}\) for some value of \(\phi\) and some value of a Lipschitz constant for \(J_\textsf{orig}\). In order to do this, direct calculates and retains the objective and constraint function values at the centre of every rectangle. In constrast oscars-ii and the new method retain only b and c.

The auxiliary function \(J_\textsf{orig}\) is modified cosmetically to yield the merit function J. Specifically \(\phi\) is added to the first term, and a different measure of infeasibility is used for the second term giving

$$\begin{aligned} J(x,\phi ) = \max \left\{ f(x), \phi \right\} + \frac{ \left( v(x) \right) ^2}{1 + v(x) } \end{aligned}$$

Here v(x) is the unweighted 2–norm of the constraint violations

$$\begin{aligned} v(x) = \left\| [g(x)]_+ \right\| _2, \end{aligned}$$

and the parameter \(\phi\) is an estimate of \(f^*\) that is updated from time to time. The second term in J behaves like \(v^2\) when v is small, and like v for \(v \gg 1\). This gives J some of the characteristics of a rounded \(\ell _2\) exact penalty function.

When \(\phi = f^*\), J has the property that its global minimizer(s) over \(\Omega\) are precisely the solutions of (2). To see this first note that \(J(x^*,\phi ) = f^*\) when \(\phi = f^*\) and \(x^*\) solves (2). For any \(x \in \Omega\) we have

$$\begin{aligned} J(x,f^*) = \max \{f(x),f^*\} + \frac{ \left( v(x) \right) ^2}{1 + v(x) } \ge f^* + 0 = J(x^*,f^*) \end{aligned}$$

showing that \(x^*\) is a global minimizer of \(J(\cdot ,f^*)\) over \(\Omega\). Conversely \(J(x,f^*) = f^*\) can only hold if \(f(x) \le f^*\) and \(v(x) = 0\). However \(v(x) = 0\) implies \(x \in \mathcal {F}\), which means that \(f(x) \ge f^*\). Hence the set of global minimizers of (2) is the same as the set of global minimizers of \(J(\cdot ,f^*)\) over \(\Omega\), as required.

2.2 An iteration

The algorithm uses a sequence of iterations indexed by k. At iteration k, the algorithm randomly draws one sample point \(x_k\) from a sampling region \(\Omega _k \subseteq \Omega\) and calculates f and g at \(x_k\). The algorithm also updates two points at each iteration: the best known point \(b_k\) and the current control point \(c_k\). A subscript k refers to a quantity’s value at iteration k.

Each \(\Omega _k\) is box-shaped and aligned with the coordinate axes, which makes randomly sampling \(\Omega _k\) straightforward. Specifically, the sampling region has the form

$$\begin{aligned} \Omega _k = \{ x \in \mathbb {R}^n: \ell _k \le x \le u_k \} \end{aligned}$$

where \(\ell _k \in \mathbb {R}^n\) and \(u_k \in \mathbb {R}^n\) are the vectors of lower and upper bounds satisfying \(L \le \ell _k\) and \(u_k \le U\). The bounds \(\ell _k\) and \(u_k\) are adjusted at each iteration so that \(c_k \in \Omega _k\) always holds.

If at least one point in \(\mathcal {F}_\textsf{tol}\) has been found, \(b_k\) is the sample point in \(\mathcal {F}_\textsf{tol}\) with the least value of f. Otherwise, \(b_k\) is the least infeasible sample point in the sense that it has the smallest value of v(x). In contrast, the control point is the ‘recently generated’ sample point with the least J value.

From time to time, the control point is reset, allowing the method to alternately search widely across \(\Omega\), and focus attention in the vicinity of the currently best known point.

2.3 The control point and selecting \(\Omega _k\)

The control point \(c_k\) is used to direct how each \(\Omega _{k+1}\) is formed from its predecessor \(\Omega _k\). After selecting the current sample point \(x_k\), if \(J(x_k,\phi ) \ge J(c_k,\phi )\), then \(x_k\) is rejected and \(\Omega _{k+1}\) is chosen so that \(\Omega _{k+1} \subset \Omega _k\) with \(c_{k+1} = c_k \in \Omega _{k+1}\) and \(x_k \not \in \Omega _{k+1}\). If this proposed \(\Omega _{k+1}\) is too small along all coordinate directions, it is reset via \(\Omega _{k+1} = \Omega\). In any case, the current control point is retained via \(c_{k+1} = c_k\).

Alternatively, if \(J(x_k,\phi ) < J(c_k,\phi )\), then \(x_k\) is judged to be superior to \(c_k\) by the merit function, and \(c_{k+1} = x_k\) and \(\Omega _{k+1} = \Omega\) are used. Hence, if a better point is found, this becomes the new control and the sampling box resets to \(\Omega\).

2.4 Passes and cycles

Iterations in which the sampling box \(\Omega _k\) is reset to \(\Omega\) can be used to group iterations into passes. Each pass starts at an iteration with \(\Omega _k = \Omega\), and ends on the iteration before \(\Omega _k = \Omega\) next occurs. Similarly, iterations in which c is reset can be used to group iterations (and passes) into cycles. Each cycle starts at an iteration with in which c is reset, and ends on the iteration before the next reset of c occurs. Since c is only ever reset when \(\Omega _k = \Omega\), each cycle consists of a whole number of passes. This is described in more detail now.

The process used to generate each \(\Omega _k\) produces contiguous subsequences of nested boxes, bracketed by iterations where \(\Omega _k = \Omega\). A sequence of such iterations forms a pass. For example, if iterations \(k,k+1,\ldots ,k+p\) form a pass then \(\Omega _k \supset \Omega _{k+1} \supset \cdots \supset \Omega _{k+p-1} \supset \Omega _{k+p}\) where \(\Omega _k = \Omega _{k+p+1} = \Omega\).

The sequence of passes is divided up into contiguous subsequences of passes called cycles. The event which distinguishes the start of a cycle (and hence the end of the previous cycle) is the standard way of choosing \(c_k = c_{k-1}\) or \(c_k = x_{k-1}\) is suspended for one iteration. Instead, at the start of cycle number \(N_c\), if \(N_c\) is odd, then \(c_k\) is chosen randomly from \(\Omega\). If \(N_c\) is an even number, \(c_k\) is set equal to the current best known point \(b_k\). In both cases \(\Omega _k = \Omega\) is used.

The motivation for alternately starting cycles with random controls and the best known point is that the former aids exploration of unexplored areas of \(\Omega\) whereas the latter focuses the search in the most promising area found so far. To ensure the method alternates between these two cases on a regular basis, the maximum number of sample points in each cycle is limited, with this limit increasing with increasing \(N_c\). Cycles are also ended if they repeatedly fail a ‘stall test’ which assesses whether or not the current cycle is likely to improve the best known point. The stall test is only performed at the end of each pass. If a cycle is ended for any reason, the current pass is also ended.

In summary, passes and cycles end for the following four reasons

  1. 1.

    An improved control point has been found (pass ends only).

  2. 2.

    The sample box size falls below \(h_\textsf{min}\) along all axes (pass ends only).

  3. 3.

    \(T_\textsf{stall}\) consecutive stall test failures occur (pass and cycle end).

  4. 4.

    The maximum permitted number of sample points in the current cycle is reached (pass and cycle end). Herein this maximum is \(30(3+N_c)\) as per Price et al. [22].

In practice, case 4 is checked first. If case 4 does not hold, cases 1 and 2 are checked. If either case 1 or 2 holds, then case 3 is checked for the end of a cycle, otherwise case 3 is skipped.

At the start of each even numbered cycle, the algorithm may take one uphill step that increases J by at most G. This step aids the algorithm in moving along constraint boundaries, and is described in more detail later.

Next, the main algorithm is listed. It counts the number of iterates in the current cycle using j. Also \(c_\textsf{old}\) and \(c_\textsf{older}\) denote the control points at the ends of the previous two passes in the cycle. At the start of each cycle \(c_\textsf{old}\) and \(c_\textsf{older}\) are set equal to that cycle’s initial control point. These two quantities are used in the stall test.

Algorithm 1
figure a

The main algorithm

2.5 The uphill step

When \(G > 0\), step 5 will accept an iterate \(x_k\) up to G worse than c. Normally \(G = 0\) is used except for cycles which start with the best known point as the initial control point. These cycles can use \(G > 0\) until the control point is first updated; after that \(G = 0\) is used. At the start of each such cycle, G is set to is one percent of the average constraint violation (capped at 100) seen by the algorithm so far. This choice permits an uphill step of at most 1 when all sample points generated so far are highly infeasible. As less infeasible or feasible sample points are found, this maximum uphill step seamlessly reduces to its minimum possible value of \(G = 0\). The latter is achieved when only feasible sample points are encountered, irrespective of whether general constraints are present or not.

This allows a single worse step off the best known point at the start of each evenly numbered cycle. The purpose of this is to enhance the algorithm’s ability to move along constraint boundaries.

Allowing uphill steps has pros and cons. If the best point is wedged in a vee, then an uphill step can escape the notch, and subsequent steps might be able to locate an improved point more rapidly. The risk is that the algorithm will waste time undoing the uphill step without any gain. Numerical experiments with test problems which have no general constraints show that the latter is the dominant effect on such problems. When general constraints are present, numerical results indicate the uphill step is beneficial.

2.6 Box cutting procedure

At each iteration, a sample point \(x_k\) is generated in \(\Omega _k\). If \(x_k\) is not better than the current control point \(c_k\) (specifically \(J(x_k,\phi _k) \ge J(c_k,\phi _k)\)), then part of the region \(\Omega _k\) is cut off, yielding the next sample box \(\Omega _{k+1} \subset \Omega _k\). This cutting process is done by shifting some of the bounds defining \(\Omega _k\) inwards towards \(c_k\). Each such bound shift is equivalent to cutting off part of \(\Omega _k\) with a hyperplane orthogonal to some coordinate axis. Each of these cuts is selected so that \(c_k\) lies in the part of \(\Omega _k\) that is retained, and \(x_k\) lies in the part that is cut off. Since the current control point is retained when \(\Omega\) is cut, this yields \(c_{k+1} = c_k \in \Omega _{k+1}\).

In the rest of this subsection, the iteration number k is dropped from the subscripts of all quantities. In some places a subscript i appears. It denotes the \(i^\textsf{th}\) component of the relevant quantity at iteration k.

The cutting process (listed in Algorithm 2) is governed by two parameters: A and \(\beta\). The former governs how close each cut is to \(c_k\) and \(\beta\) affects which faces of \(\Omega _k\) are cut off. A cut is performed perpendicular to each coordinate axis for which the magnitude of the corresponding component of the trial step \(s = x - c\) is at least \(\beta \left\| x - c \right\| _\infty\), where \(0< \beta < 1\). For each such dimension i, the cut passes a fraction \(A |s_i|/\Vert s\Vert _{\infty }\) along the line segment from x to c, where \(0< A < 1\). Hence, for dimensions with maximal \(|s_i|\), the cut is a fraction A of the distance from x to c.

Algorithm 2
figure b

The box cutting sub-algorithm

2.7 Stall test

Oscars-ii [22] uses a Kolmogorov–Smirnov (KS) statistic to terminate unprofitable cycles on problems with no general constraints. This test uses the 100 best objective function values in the current cycle, sorted in increasing order. This prevents the KS test from easily being transported over to the generally constrained case. The obvious approach is to apply the same test with the values of J in place of the f values. However J depends on \(\phi\), and \(\phi\) can be updated at any iteration. This means all points in the current cycle must be stored, and re-sorted after every change in \(\phi\). This makes the KS test significantly more expensive to implement. Thus we replace it with a simpler, cheaper to implement stall test which is described now.

A stall test is done at the end of each pass which is not also the end of the current cycle. This situation occurs whenever a new point with a lower J value is found or the minimum box size is reached. If the stall test indicates that progress is poor for \(T_\textsf{stall}\) consecutive passes, the cycle is considered to have stalled. Poor progress can occur in two ways: Firstly, if the pass does not improve the current control point c (no progress), or progress is made, but it is insignificant.

For the latter case, let \(J^*\) minimize \(J(\cdot ,\phi _k)\) over \(\Omega\). We make the simplifying assumption that the measure of the level sets for J just greater than \(J^*\) can be adequately approximated by a power law of the form \(K (J - J^*)^p\) for some values of K and p. This often occurs in practice: for example an unconstrained minimizer of a \(C^2\) function with a positive definite Hessian has this characteristic. Under this assumption, it is easily seen that the expected reduction in J from an improving step is proportional to \(J(c) - J^*\), provided c is sufficiently close to \(x^*\) for any sample x satisfying \(J(x) < J(c)\) to be drawn randomly from the level set \(\{x \in \Omega : J(x) < J(c)\}\).

The expected value of each reduction in J is unknown. In its place, the actual changes in J are used. This allows a rough estimate \(J_\textsf{est}\) of the limit of J for the current cycle that can be formed after each improvement in J via

$$\begin{aligned} J_\textsf{est} = J(c) - \lambda \frac{J(c_\textsf{old}) - J(c)}{1 - \lambda }\quad \text{where} \quad \lambda = \frac{J(c_\textsf{old}) - J(c)}{J(c_\textsf{older}) - J(c_\textsf{old})} \end{aligned}$$

If either

  1. (a)

    no improvement in J has been made in this pass; or

  2. (b)

    both \(J(c_\textsf{old}) - J(c) < J(c_\textsf{older}) - J(c_\textsf{old})\) and \(J_\textsf{est} \ge J(b) - \tau _\textsf{stall}\)

hold, then the number of consecutive stalled passes \(N_s\) is incremented, otherwise \(N_s\) is set to zero. Noting that \(J(c) \le J(c_\textsf{old}) \le J(c_\textsf{older})\), the first condition in (b) guarantees that \(\lambda\) is defined and \(\lambda < 1\). If \(\lambda \ge 1\), progress does not appear to be decaying and the method is assumed to not be stalling. Here \(\tau _\textsf{stall}\) is the smallest decrease in J which is considered significant.

Once \(T_\textsf{stall}\) consecutive passes have each yielded poor progress the cycle is ended, and a new cycle starts from either a random point in \(\Omega\), or the best known point.

3 Convergence properties

This section looks at the convergence properties of the method when run in exact arithmetic without halting. Since there is no guarantee that the feasible region has positive measure, it is necessary to frame the results in terms of the essential global minimum, which is as follows.

Definition 1

The essential global minimum \(f^{\sharp }\) of f over a set \(S \subseteq \Omega\) is

$$\begin{aligned} f^{\sharp }(S) = \inf \left\{ \eta \in \mathbb {R}: m\left( \{ x \in S: f(x) \le \eta \} \right) > 0 \right\} \end{aligned}$$

where \(m(\cdot )\) denotes the Lebesgue measure. If \(m(S) = 0\), we set \(f^{\sharp } = \infty\).

The main convergence result shows that the algorithm locates an essential global minimizer of f over \(\mathcal {F}_\textsf{tol}\) almost surely.

Theorem 1

Let \(\mathcal {F}\) be non-empty and let \(b_{\infty }\) be an arbitrary limit of \(\left\{ b_k \right\}\). Then firstly

  1. (a)

    \(\tau _\textsf{c} > 0\) implies \(\mathcal {F}_\textsf{tol}\) has positive measure; and secondly

  2. (b)

    \(m(\mathcal {F}_\textsf{tol}) > 0\) implies both \(b_{\infty } \in \mathcal {F}_\textsf{tol}\) and \(f(b_{\infty }) \le f^{\sharp }(\mathcal {F}_\textsf{tol})\) almost surely.

Proof

For part (a), let \(z \in \mathcal {F}\) and define the neighborhood

$$\begin{aligned} \mathcal {N}^*_{\epsilon } = \left\{ x \in \Omega : \left\| x - z \right\| _{\infty } < \min (\delta ,\epsilon ) \right\} \quad \text{where} \quad \delta = \min _{i \in 1,\ldots ,n} (U_i-L_i)/2 > 0. \end{aligned}$$

\(\mathcal {N}^*_{\epsilon }\) has Lebesgue measure of at least \(\min (\delta ^n,\epsilon ^n)\), which is positive for all \(\epsilon > 0\). This is easily seen on noting that at least one orthant of the uniform norm ball \(\left\{ x \in \mathbb {R}^n: \left\| x - z \right\| _{\infty } < \min (\delta ,\epsilon ) \right\}\) lies entirely within \(\Omega\). Continuity of g implies \(\exists \epsilon > 0\) such that \(\mathcal {N}^*_{\epsilon } \subset \mathcal {F}_\textsf{tol}\), as required.

For part (b), step 4 of the main algorithm ensures the number of cycles \(N_c \rightarrow \infty\) as the number of points \(k \rightarrow \infty\). At the start of each odd numbered cycle, the control point is drawn randomly from \(\Omega\), hence the number of sample points drawn randomly from \(\Omega\) becomes arbitrarily large as \(k \rightarrow \infty\).

The strategy for updating the best point means once a point in \(\mathcal {F}_\textsf{tol}\) is located, all future best points will lie in \(\mathcal {F}_\textsf{tol}\). Additionally, each such best point in \(\mathcal {F}_\textsf{tol}\) can only be replaced by another point in \(\mathcal {F}_\textsf{tol}\) which has a lower f value. Let

$$\begin{aligned} \epsilon _{\mu } = m\left( \{ x \in \mathcal {F}_\textsf{tol}: f(x) \le \mu \}\right) \end{aligned}$$

Now \(\epsilon _{\mu } > 0\) for all \(\mu > f^{\sharp }\left( {\mathcal F}_\textsf{tol}\right)\) by Definition 1. Hence, after \(N_c\) cycles have been completed at the \(k^\textsf{th}\) iteration

$$\begin{aligned} \text{Prob}\left( f(b_k) > \mu \quad \text{or}\quad b_{k} \not \in {\mathcal F}_\textsf{tol} \right) \le \left( 1-\epsilon _{\mu }\right) ^{N_c/2} \end{aligned}$$

The right hand side tends to zero as \(N_c \rightarrow \infty\), yielding the result. \(\hfill\Box\)

When \(\mathcal {F}_\textsf{tol}\) is the closure of its interior, the continuity of f implies the minimum of f over \(\mathcal {F}_\textsf{tol}\) equals \(f^{\sharp }(\mathcal {F}_\textsf{tol})\). When a non-empty \(\mathcal {F}\) is the closure of its interior, the preferred choice is \(\tau _\textsf{c} = 0\) can be made. This gives \(b_\infty\) as a global minimizer of (2) almost surely. The definition of \(\phi\) means \(\phi _k \rightarrow f^*\) almost surely, meaning \(J(x,\phi _k)\) converges to the exact penalty function \(J(x,f^*)\) almost surely.

The absence of equality constraints does not guarantee that \(\mathcal {F}\) is the closure of its interior. The risk with \(\tau _\textsf{c} > 0\) is that the best known point b which is returned by the algorithm satisfies the constraints within tolerance, but is infeasible and the distance between b and the feasible region is large.

4 Numerical testing

The new method was compared against its predecessors oscars [20] and oscars-ii [22] on 50 bound constrained problems in 2–30 dimensions, and on 21 additional problems in 9–60 dimensions. Comparisons with other methods for generally constrained problems are also done using problems from the G-suite and elsewhere. Finally, some tests are done to explore the value of knowing \(f^*\), and how performance varies with dimension and number of constraints on randomized test problems [28].

The numerical tests herein were all performed with \(A = 0.9\), \(\beta = 1/3\), \(h_\textsf{min} = 10^{-6}\), \(\tau _\textsf{stall} = 10^{-6}\) and \(T_\textsf{stall} = 5\). For tests with the t-cell method of Aragón et al. [3] and stochastic ranking evolutionary search (sres) [26], \(\tau _\textsf{c} = 10^{-4}\) is used.

4.1 Bound constrained only tests

When general constraints are absent, the algorithm minimizes f over \(\Omega\). This follows because \(v \equiv 0\) and \(J(x,\phi ) \equiv \max \{f(x),\phi \}\). Since \(\phi\) is always set equal to the best known f immediately, for any sample point x we have \(f(x) \ge \phi\). Hence \(f(x) = J(x,\phi )\) for all points x at which f has been calculated. This means the algorithm minimizes f when no general constraints are present. The stall test applied to J is identical to the stall test applied to f.

The method was tested on both problem sets used in Price et al. [22]. Test set 1 [20] contains 50 test problems and test set 2 [22] contains an additional 21 largely higher dimensional problems. Ten runs were performed for problems in test set 1, and 30 runs for problems in test set 2. Each run which located a best known point b satisfying

$$\begin{aligned} b \in \mathcal {F}_\textsf{tol} \quad \text{and} \quad f(b) \le f^* + \tau _\textsf{obj} \max \left\{ 1,|f^*|\right\} \end{aligned}$$
(3)

was deemed successful, and halted immediately on satisfying these two conditions. Here \(\tau _\textsf{obj} = 10^{-3}\) gives the maximum permitted absolute error (when \(|f^*| < 1\)) or relative error (when \(|f^*| \ge 1\)) in f.

Runs which did not find a point satisfying (3) after 50,000 function evaluations (for test set 1) or 250,000 function evaluations (test set 2) were deemed unsuccessful, and halted at that point.

A summary of the results for both test sets is presented in Table 1. For each method, the number of function evaluations taken to find a solution was averaged across all runs for each problem. For each problem, each methods’ averages were normalized by dividing by the least of the methods’ averages for that problem. The normalized function evaluation counts are averaged across all problems, and listed in the column headed ‘norm nf’. The non-normalized averages of the function counts for all runs of all problems are in the ‘fevals’ column. Columns headed ‘best’ and ‘FR’ list the number of problems on which each method had the lowest average function count, and the total number of runs of all problems which ended in failure. Failed runs are costed out at the maximum number of function evaluations when calculating the normalized and non-normalized averages. Doing so artificially reduces both average function counts, with more failed runs tending to yield greater reductions. The reason for listing both types of average function count is that the non-normalized averages are dominated by problems which take many function evaluations to solve. Some problems need more than 100 times as many function evaluations to solve as others.

Table 1 Summary of results on problems without general constraints

These results show that the method is competitive with oscars-ii [22] and superior to the original oscars [20] algorithm on bound constrained problems.

4.2 Generally constrained problems: the G-suite

The method was tested on 17 problems from the G-suite [13, 17, 18, 30] and compared against Stochastic Ranking Evolutionary Search (sres) [26] and the modified t-cell algorithm [3]. Results are listed in Table 2, where the best, worst and mean objective function values are listed for sets of 30 runs, each using 350,000 function evaluations and \(\tau _\textsf{c} = 10^{-4}\). The stopping condition (3) was not used: all runs were halted at 350,000 function evaluations and the best point reported. These conditions match those used for t-cell [3] and sres [4], allowing a direct comparison with their results. Results for sres and t-cell listed in Table 2 are given to the same number of significant figures as listed by Aragón et al. [3] and Cagnina et al. [4].

Table 2 Comparison with t-cell [3] and sres [4] with all runs using 350,000 function evaluations

We regard any solution as acceptable if it satisfies (3) with \(\tau _\textsf{c} = 10^{-4}\) and \(\tau _\textsf{obj} = 10^{-3}\). Some results from Aragón et al. [3] and Cagnina et al. [4] are not listed with sufficient accuracy to determine if this standard was met. In such cases, it is assumed that the required accuracy was achieved. Firstly, looking at the best points found by each method over the 30 runs sres, t-cell and the current method had 16, 15, and 15 acceptable solutions respectively. For the mean scores the new method was acceptable on 11 problems, and the other two methods each on 9. For the worst points found, t-cell and this method were acceptable on 9, and sres on 8. These results show the new method is competitive on generally constrained optimization problems.

The non-zero constraint tolerance permits points which are slightly better than optimal to be returned as the solution, and several such points feature in Table 2.

4.3 Other generally constrained tests

The method was also tested on a wider set of generally constrained problems. This set is the 17 G-suite problems used above, along with the Gomez3 problem [8], problems 3.3, 4.3, 4.4 and 4.5 from Floudas and Pardalos [6], problems 12.2.3 and 12.2.4 from Floudas et al. [5], the pentagon and equil problems from Lukšan and Vlček [12] and the cylinder-sphere problem (below).

The best known points for problems 4.4 and 4.5 [6] were updated to \(f^* = -3.13363591\) at (0, 3, 0, 1) and \(f^* = -13.401903555\) at (1/6, 2, 4, 1/2, 0,  2) respectively.

Problems 12.2.3 and 12.2.4 from Floudas et al. [5] contain a mix of real and binary variables. Each binary variable \(z_i\) was handled by using a real variable \(x_i \in [-0.5,1.5]\) and rounding \(x_i\) to the nearest member of \(\{0,1\}\) to get the binary value \(z_i\) before evaluating the objective and constraint functions.

The problems in Lukšan and Vlček [12] are nonsmooth local optimization test problems, and as such are not finitely bounded above and below in all dimensions. To rectify this \(\Omega = [-2,2]^6\) was used for the pentagon problem, and \(\Omega = [0,1]^8\) for equil. The last problem (cylinder-sphere) has similar characteristics. It is

$$\begin{aligned} \min \, x_2 \quad \text{subject to} \quad \Vert x\Vert _2^2 \le 1 \quad \text{and} \quad (x_1+a)^2 + (x_2+a)^2 \ge a^2 + (a-1)^2 \end{aligned}$$

with \(x \in [-2,2]^{10}\) and \(a = 0.25\). The solution is \(x^*_2 = -1\) and \(x^*_i = 0\) for all \(i \ne 2\) with an optimal objective function value of \(f^* = -1\). There is one proper local minimizer with \(f = 0\) at \(x_1 = -1\) and \(x_i = 0\) otherwise.

The 10 problems not from the G-suite were tested under the same conditions as the G-suite. Results appear in Table 3. The method found feasible points on all 300 runs, and optimal points (as judged by (3)) on 274 runs. There were 7, 15 and 4 runs with non-optimal points on problem FP3.3, equil and CF12.2.4 respectively. These results show that the method can also be effective on problems that are nonsmooth or have some binary variables.

Table 3 Results for generally constrained problems not in the G-suite

The method was also tested with \(\tau _\textsf{c} = 0\) on the 15 generally constrained problems which do not have equality constraints. The testing regime was otherwise identical to that for the G-suite. With both \(\tau _\textsf{c} = 0\) and \(\tau _\textsf{c} = 10^{-4}\) the method found feasible points on 449 runs out of 450, with one fail each on problem G10. The number of successful runs were 358 and 356 respectively, showing the method is effective with zero constraint tolerance on inequality constrained problems.

4.4 Tests using generally constrained Schoen functions

Tests were also performed on a modified set of Schoen test problems [28] of the form

$$\begin{aligned} f(x) = \frac{\sum _{i=1}^N s_i \prod _{j=1, j\ne i}^N \Vert x - z_j \Vert _2^2}{\sum _{i=1}^N \prod _{j=1, j\ne i}^N \Vert x - z_j \Vert _2^2} - \sum _{j=1}^m \max (g_j(x),0) \quad \Omega = [0,1]^n \end{aligned}$$
(4)

subject to the constraints \(g_j \le 0\), \(j = 1,\ldots ,m\). Each \(z_j\) is chosen randomly from \([0,1]^n\) and its associated \(s_j\) is selected from a normal distribution with mean 5 and variance 1. Each constraint takes the form:

$$\begin{aligned} g_j(x) = \pm \left( \Vert x^* - y_j \Vert _2 - \Vert x - y_j \Vert _2 \right) - 0.2 \, \theta _j \Vert x^* - y_j \Vert _2 \end{aligned}$$

where each ± sign and \(y_j \in \Omega\) are chosen randomly to yield either a convex or concave hyperspherical constraint centred on \(y_j\). The \(\theta _j \in \{0,1\}\) are also chosen randomly, where \(\theta _j = 0\) makes the constraint active at \(x^*\); otherwise it is inactive at \(x^*\).

The global minimizer \(x^*\) of the left hand term of f on \(\Omega\) is the \(z_j\) with the least corresponding \(s_j\) [28]. This \(s_j\) is reduced by \(10^{-3} \max (|s_j|,1)\) to reduce the risk another minimum is within tolerance of the global minimum. Now \(x^* \in \mathcal {F}\) and the right hand term in (4) is zero on \(\mathcal {F}\), so \(x^*\) solves the generally constrained Schoen problem.

Tests were run on batches of 100 random problems in 5, 10 and 20 dimensions, with \(N = 40\) and 0, 1, 3, 9 or 27 constraints. One run per problem was performed, with all runs using \(\tau _c = 10^{-4}\) and \(\tau _\textsf{obj} = 10^{-3}\). Each run halted successfully once (3) was satisfied, or as a fail if 350,000 function evaluations was reached.

Results appear in Table 4, and show steadily increasing difficulty with dimension. These results also show adding a small number of constraints dramatically increases the problem difficulty. This is because every constaint is active or nearly active at the solution, which significantly reduces the basin of attraction of \(x^*\) compared to other \(z_j \in \mathcal {F}\). On such problems convergence to a proper local minimizer sometimes occurs. As more constraints are added, few or no \(z_j\) remain in \(\mathcal {F}\) and the problem largely reduces to obtaining feasibility. Finding the basin of \(x^*\) is easier, but the constraints make estimating \(x^*\) accurately more computationally expensive.

Table 4 These generally constrained Schoen problem results list the average number of function evaluations over 100 problems, and number of failed runs (FR)

4.5 The value of knowing \(f^*\)

If there are no general constraints, setting \(\phi = f^*\) and using the updating for \(\phi\) both ensure \(J(x) = f(x)\) holds at all sample points which have been used. This means both versions of the algorithm are functionally identical. Hence comparisons between estimating \(\phi\) and using \(\phi = f^*\) are made only on the 27 generally constrained problems given above. The Schoen problems were not used.

For each problem, 30 runs were performed with \(\tau _\textsf{c} = 10^{-4}\) for two versions of the algorithm: one estimated \(\phi\) as described above, and the other used \(\phi = f^*\) at all times. Both versions halted when (3) was satisfied. This allows the relative speeds of both versions to be compared on the easier problems. Results are presented in Table 5. These show clearly that setting \(\phi = f^*\) when \(f^*\) is known is detrimental to the algorithm’s performance. The advantage of not setting \(\phi = f^*\) is that as \(\phi\) is adjusted, the induced kink in J from the \(\max \{f,\phi \}\) term moves around. This movement enhances the algorithm’s ability to traverse general constraint boundaries towards the solution (Table 5).

Table 5 This table compares the new method when the optimal objective function value \(f^*\) is, and is not known

5 Conclusion

The performance of the new method on generally constrained problems is similar to that of the filter version f-oscars [21] of oscars in terms of function counts. In terms of overheads, the new method is almost twice as efficient. Moreover, on problems without general constraints, or where no general constraints are active in the vicinity of the solution, f-oscars becomes equivalent to oscars, and the latter is markedly inferior to the new method.

The new method has been shown to converge almost surely in exact arithmetic. Numerical results show that the method is effective on a wide range of bound and generally constrained problems, including nonsmooth problems. It compares well against other similar methods on problems from the G-suite. There is scope for improvement in the exploitation phase via a local search. This would be particularly beneficial on generally constrained problems as it would help the method traverse along any general constraint boundaries towards the solution.