1 Introduction

Direct Search Methods (DSM) have been extensively used for solving optimization problems with continuous variables. Despite their simplicity, encouraging numerical results have been reported for solving small and medium size problems, with or without first order derivative information (see [22] and references therein). DSMs solve the minimization problem, minimize \(f(x), x\in \mathcal{F} \subseteq \mathbb R^n\) as follows: At the i-th iteration an estimate \(x_i \in \mathcal{F}\) to the solution is known. A search direction \(d_i\in \;{\mathop {D}\limits ^{_\infty }}= \{d\in \mathbb R^n: ||d||= 1\}\) and a stepsize \(\tau _i\in \mathbb R_+\) complying with manageable conditions are found, and a new estimate \(x_{i+1}= x_i + \tau _i d_i \in \mathcal{F}\) is generated. Early application of DSMs on optimization problems with continuous variables were monotone; i.e. they forced a sufficient decrease to the function values, namely \(f(x_{i+1}) \le f(x_i) - \sigma _i\); for some \(\sigma _i>0\). Steepest descent, variable metric and Newton methods are classical DSM instances which use or approximate first order information. The pattern search method (PSM), introduced in [42], is an instance of a monotone DSM with no derivative information. An extensive list of references for monotone algorithms is given in [5]. Reference [23] stands as the pioneer work of nonmonotone DSMs to problems with continuous variables. Several variants were suggested which inherit the properties of Newton’s method with explicit derivative information [11, 14, 24, 44]. These authors claim that nonmonotone DSMs might improve the performance of a monotone algorithm, when the estimates enter into a curved narrow valley. In smooth constrained optimization, nonmonotone algorithms are used to avoid the Maratos’ effect [45, and references therein]. Performance of nonmonotone DSMs is more stable on noisy functions than its monotone counterpart [5, 16]. Authors in references [14, 16, 17, 19, 20] employ nonmonotone DSMs to try to avoid convergence to nearby local minimizers. Finally, the authors in [21] claim that almost any deterministic monotone optimization algorithm for solving models with continuous variables has a nonmonotone counterpart. This paper analyzes the basic theory needed for solving \(\mathop {\textrm{minimize}}\limits _{(x; z) \in \mathcal{F}} g(x; z)\) with a non-monotone DSM, where \(\mathcal{F}\) is an unspecified feasible set. The results are then adapted to solve the mixed integer optimization problem () formulated as

$$\begin{aligned}&\mathop {\textrm{minimize}}\limits _{(x;z) \in \mathcal{F}} \ g(x; z): \mathbb R^{n+p} \rightarrow \mathbb R, \end{aligned}$$
(1a)
$$\begin{aligned}&\mathcal{F}=\{x \in \mathbb R^n, \ z \in \mathbb Z^p: s \le (x; z) \le t\}, \end{aligned}$$
(1b)

where \(g(\cdot )\) might be nonsmooth.

To facilitate our exposition, (xz) stands for a column vector of \(n + p\) components, with \(x \in \mathbb R^n\) and \(z \in \mathbb Z^p\). Thus, in (), s and t are known vectors in \(\mathbb R^{n+p}\), which represent respectively, the lower and upper bounds of all variables in (). An initial effort to use PSMs for solving nonlinearly constrained problems with discrete -and/or categorical- variables can be found in [1, 5]. This paper is an outgrowth of the nonmonotone DSM proposed in [19], where the discrete variables were assumed to lay on a regular grid of discrete points.

This work presents a convergence theory for a class of monotone and nonmonotone DSMs, which allows us to have at hand a common framework for solving problem () and its particular instances: unconstrained and box-constrained optimization problems with mixed integer variables. We assume that we use penalization, Lagrangian functions, filtering, or any other technique that may transform a constrained problem into a single or a sequence of box-constrained problems structured like (). The authors in [1, 2] propose a simple barrier function, which assumes that the objective function is infinite at all infeasible points. Since there exist nonsmooth functions that do not decrease along any direction from a point that is not a minimum [13, Exercise 2.6], we always consider a finite set of search directions; nonetheless, we show that our algorithm might detect directions of negative curvature. We solve the mixed optimization problem () under relaxed conditions, regardless of the availability of derivative information. Some degree of randomness is necessary to fully exploit the algorithm’s capabilities, though the numerical tests show that a deterministic version of the algorithm performs well.

For nonsmooth problems, the nonmonotone DSM proposed in this paper shares convergence properties with monotone DSMs frequently cited. As our purpose is to apply this methodology in derivative free optimization (DFO), where differentiability and other function features may be unknown to the user, no effort is devoted to the rate of convergence.

The paper’s primary goal is to analyze non-monotone DSM algorithms, which in general, have been scarcely considered in the open literature. All convergence results hold with either explicit use of first order derivatives or for DFO. Moreover, if first order information is at hand, the algorithms simplify significantly, and, generally, a singleton search direction is easily identified.

The remainder of this paper is organized as follows. In the next section we prove some lemmas that are essential for the understanding of the algorithms. We state some useful definitions and display pseudocodes that will guide us in the analysis of the optimization problems to be exposed in Sects. 3, 4 and 5. Section 3 describes the DSM proposed for solving problems with continuous variables. We show that non-monotone DSMs generate a sequence onverging to a point satisfying a non-smooth stationarity condition; which, to our knowledge, has been overlooked. For Lipschitzian functions the sequence converges to a point satisfying the stationarity condition in the Clarke sense. Section 3 conforms the theory developed in Sect. 2 to unconstrained and box-constrained problems. Compactness, Fréchet differentiability and continuous first order derivatives around limit points are sufficient conditions to achieve convergence. In Sects. 4 and 5 we assume that variables can take discrete values. The methodology persists. Convergence to a stationarity point does not require to detect the best local value in the vicinity of a solution estimate. We suggest a variable separation strategy to solve Problem (). In Sect. 6 we solve a group of low-dimensional academic problems. The results look promising, and seem to validate the theoretical findings. We also solve a real application problem and report a new global solution to the problem. The paper concludes with some remarks and open questions to research topics that are under investigation.

In short, and for the sake of clarity, we analyze convergence of the algorithms that solve the basic unconstrained and box-constrained optimization problems. This analysis is in substance carried over to the mixed variable problem (), which is formally studied in Sect. 5.

Notation Throughout the paper we use a rather standard notation with some peculiarities.

  • \( \mathbb R^n, \mathbb R^p, \mathbb R^r\), and so forth, are Euclidean spaces.

  • \(\mathcal{F} \subseteq \mathbb R^n\) is the feasible set,

  • superscripts stand for components and sub-indexes represent different vectors,

  • \(d^T w\) is the inner product of vectors dw defined in the same Euclidean space,

  • \(uu^T\) is a matrix U with components \(U^{jk}= u^j u^k\);

  • (xz) is a vector with \((n+p)\) components, \(x\in \mathbb R^n, z\in \mathbb Z^p\),

  • When \(\hat{z}\) is fixed, f(x) denotes \(g(x; \hat{z})\). Likewise, When \(\hat{x}\) is fixed, f(z) denotes \(g(\hat{x}; z)\).

  • \(\nabla f(x)\) is the gradient,

  • \(\displaystyle f^0(\hat{x},\hat{d}) {\mathop {=}\limits ^{\varDelta }} \begin{array}{c} \text {\ limsup} \\ ^{x \rightarrow \hat{x}} \\ ^{\tau \rightarrow 0^+} \end{array} \frac{f(x+ \tau \hat{d}) - f(x)}{\tau }\,\mathrm{is\,the\,Clarke\,generalized\,directional}\) derivative at \(\hat{x} \text { along } \hat{d}\),

  • \(g(x; z): \mathbb R^{n+p} \rightarrow \mathbb R, x\in \mathbb R^n, z \in \mathbb Z^p.\)

  • \(g_x^0(\hat{x}; \hat{z}, \hat{d}) {\mathop {=}\limits ^{\varDelta }} \begin{array}{c} \text {\ limsup} \\ ^{x \rightarrow \hat{x}} \\ ^{\tau \rightarrow 0^+} \end{array} \displaystyle \frac{g(x+ \tau \hat{d}; \hat{z}) - g(x; \hat{z})}{\tau }\),

  • \(\sigma (\cdot ): \mathbb R_+ \rightarrow \mathbb R_+\) stands for a sigma-function, which is:

    $$\begin{aligned}{} & {} \text {forcing: } ( \sigma (\tau _i) \rightarrow 0) \text { iff } (\tau _i \rightarrow 0),\\{} & {} \text {unbounded from above: } (\sigma (\tau _i) \rightarrow \infty ) \text { iff } (\tau _i \rightarrow \infty ), \text { and} \\{} & {} \text {little o}(\tau ^2): \lim \limits _{\tau \rightarrow 0} \sigma (\tau )/\tau ^2= 0, \end{aligned}$$
  • q represents a dummy integer, that plays different roles in the paper,

  • I is the identity matrix and \(e_j, j=1, \dots , n\) are its columns,

  • H is the Househlder matrix \(\displaystyle I - \frac{2}{u^Tu}uu^T,\ u\not = 0\),

  • \(\;{\mathop {D}\limits ^{_\infty }}= \{d \in \mathbb R^n: ||d||= 1\}\),

  • Other Capital letters are matrices or finite sets,

  • \(\text {If } S \text { is a finite set, } \#S \text { is its cardinality}\),

  • i will be the iteration number. It is usually implicitly assumed.

  • We use the Matlab notation to invoke an algorithm. \([w_1,\dots ,w_q]= {Algname}(v_1,\dots ,v_p)\) invokes function Algname() with p inputs and q outputs.

2 Preliminaries

This work addresses non-smooth functions. However, convergence theory might require some degree of differentiability. Some algorithms assume that g(xz) is Lipschitz continuous; which implies that the Clarke generalized directional derivative exists and it is finite. The Clarke generalized directional derivative is a useful concept in optimization theory.

In this section we define concepts and prove several lemmas needed in the general analysis of our algorithms. The basic convergence assumptions are:

A1: The objective function g(xz) is continuous and computable at all feasible points. The level set \(L=\{ (x;z) \in \mathcal{F}: g(x;z) \le \varphi _0\}\) is compact for any \( \varphi _0\).

This seems to be the weakest condition required by optimization algorithms that solve ().

A2: For a fixed z, the objective function g(xz) in () is Fréchet differentiable and continuously differentiable around limit points.

A3: g(xz) is Lipschitz continuous in its first argument; that is, for a fixed z, \(|g(y;z) - g(x;z)| \le \vartheta ||y-x||\) for some \(\vartheta >0\).

The algorithm for solving the mixed integer problem () inherits most of its properties from the algorithms specifically proposed in Sects. 3 and 4 for solving the pure versions.

Definitions 1, 2 and 3 implicitly assume that \((\hat{x};\hat{z}) \in \mathcal{F}\).

Definition 1

The direction \(\hat{d}\) is a feasible direction at \((\hat{x};\hat{z}) \) when

$$\begin{aligned} (\hat{x}+ \lambda \hat{d}; \hat{z}) \in \mathcal{F} \text { for all sufficiently small } \lambda > 0. \end{aligned}$$
(2)

Definition 2

The direction \(\hat{d}\in \mathbb R^n\) is a quasi-descent direction (qdd) at \( (\hat{x}; \hat{z})\) when

$$\begin{aligned} g( \hat{x}+ \tau \hat{d}; \hat{z})- g(\hat{x}; \hat{z})\le - \sigma (\tau )\ \text { for all sufficiently small } \tau > 0. \end{aligned}$$
(3)

Definition 3

The direction d is a feasible qdd at \( (\hat{x}; \hat{z})\) when both (2) and (3) hold.

To facilitate the writing and the reading of this paper, we often denote \(f(x) {\mathop {=}\limits ^{\varDelta }} g(x; \hat{z})\) when \(\hat{z}\) remains fixed.

Lemma 1

Let A1, A2 hold and let the unit vector d be a strictly descent direction at x, that is, \(d^T \nabla {f}(x) \le -\alpha ||\nabla {f}(x) ||\), for some \(\alpha >0\). Under these conditions d is a qdd.

Proof

It is obvious. \([Arrayt]\begin{array} {rl} \displaystyle f(x + \tau d) - f(x) &{}= \tau d^T \nabla f(x) + o(\tau ) \\ &{}\le -\tau \alpha ||\nabla f(x) || + o(\tau ) \\ \end{array}\)

The proof is complete if we define \(\sigma (\tau )= \tau \alpha ||\nabla f(x) || - o(\tau ).\) \(\square \)

Lemma 2

Under Assumptions A1–A3, let \(\{x_j\}\rightarrow \hat{x}\in \mathbb R^n\) and \(f^0(\hat{x}, \hat{d})< 0\), then \(\hat{d}\) is a qdd at some x sufficiently close to \(\hat{x}\).

Proof

By A3, \(f^0(\hat{x}, \hat{d})\) exists and it is finite. If \(\hat{d}\) is not a qdd for all x close to \(\hat{x}\), we can identify a sequence \(\{\tau _j\}_{j\in \mathbb N} \downarrow 0, \{x_j\} \rightarrow \hat{x}\) for which

$$\begin{aligned} \displaystyle \frac{f( x_j+ \tau _j \hat{d})- f( x_j)}{\tau _j} >- \sigma (\tau _j)/\tau _j. \end{aligned}$$
(4)

By taking limits, it follows that \(f^0(\hat{x}, \hat{d}) \ge 0\), a contradiction. \(\square \)

Corollary 1

If \( \{x_i\} \rightarrow \hat{x} \text { and } f^0(\hat{x},d) \le \alpha <0\), then d is a qdd for all x sufficiently close to \(\hat{x}\).

Lemma 3

Let \(\hat{d} \in \;{\mathop {D}\limits ^{_\infty }}= \{d\in \mathbb R^n:||d||= 1\}\) be a feasible qdd at \(\hat{x} \in \mathcal{F}\). Under Assumption A1, there exist \(\tau> 0, \gamma > 0\) such that

$$\begin{aligned} \begin{array}{c} \hat{x}+\tau \hat{d} + \gamma \tau \hat{d} \not \in \mathcal{F} \\ \text { \textbf{OR}} \\ f(\hat{x}+ \tau \hat{d}+ \gamma \tau \hat{d}) - f(\hat{x}+ \tau \hat{d}) > - \sigma (\tau ). \end{array} \end{aligned}$$
(5)

Proof

Since \(\hat{d}\) is a feasible qdd, (23) hold by definition for some \(\hat{\tau }> 0\). If (5) holds for \(\tau = \hat{\tau }\), and some \(\hat{\gamma }>0\), the lemma is valid. We now proceed by contradiction: If the statement of the lemma is false, Algorithm 1 generates an infinite loop with a sequence of feasible points with strictly decreasing \(f(\cdot )\) values, which contradicts A1. \(\square \)

figure a

We would like to state that a point \( {\mathop {x}\limits ^{_{*}}} \) is non-smooth stationary (nss) on D when there exists no feasible qdds at \( {\mathop {x}\limits ^{_{*}}} \); that is, when

$$\begin{aligned} \left. \begin{array}{c} \lambda> 0 \\ d \in D \end{array} \right\rangle {\Rightarrow }\ \big ( {\mathop {x}\limits ^{_{*}}} + \lambda d \not \in \mathcal{F}\big )\ { \textbf{OR }}\ \big (f( {\mathop {x}\limits ^{_{*}}} + \lambda d) - f( {\mathop {x}\limits ^{_{*}}} )> - o(\lambda ) \big ), \end{aligned}$$
(6)

where \(D \subseteq \;{\mathop {D}\limits ^{_\infty }}\ = \{d \in \mathbb R^n: ||d||= 1\}\).

Corollary 2

Under Assumption A1, any nonempty feasible set \(\mathcal F\) has an nss point satisfying (6) with \(D= \;{\mathop {D}\limits ^{_\infty }}\).

Proof

If there exists no nss, then given any \(x\in \mathcal{F}\) there exists \(\lambda >0, d\in \;{\mathop {D}\limits ^{_\infty }}\) and a qdd defined by (3). The while loop in Algorithm 1 would generate an infinite sequence of strictly descent \(f(\cdot )\), contradicting A1. \(\square \)

The implication (6) is, in general, difficult to implement. We resort to finite sets D and discrete \(\lambda \)-values. We define a \(\varLambda \)-set as follows:

Definition 4

A \(\varLambda \)-set is a set of \(\lambda \)-values \(\{\lambda _0, \lambda _1, \dots ,\}\) with the following properties:

  • it is a bounded set,

  • it contains an infinite number of distinct elements in \(\mathbb R_+\) that can be lexicographically ordered; that is, \((j< k) \Rightarrow (\lambda _j > \lambda _k)\).

  • its elements \(\lambda _0, \lambda _1, \dots \) converge to 0, that is, \(\{\lambda _j\} \downarrow 0.\)

A typical \(\varLambda \)-set that is often implicitly used by the optimization community is given by (7). Given \(0< \mu _s< \mu _t< 1\),

$$\begin{aligned} \text {Pick } \mu \in [\mu _s\ \mu _t], \lambda _0> \varepsilon ,\ \hat{\varLambda }= \{\lambda \in \mathbb R_+: \lambda = (\mu )^k\lambda _0, k= 0,1,2,\dots ,\}. \end{aligned}$$
(7)

Definition 5

The point \( {\mathop {x}\limits ^{_{*}}} \ \in \mathcal{F}\) is nss on \((\varLambda , D)\) when

$$\begin{aligned} \left. \begin{array}{c} \lambda \in \varLambda , \lambda> \varepsilon \\ d \in D \end{array} \right\rangle { \Rightarrow }\ \big ( {\mathop {x}\limits ^{_{*}}} + \lambda d \not \in \mathcal{F}\big )\ { \textbf{OR }}\ \big (f( {\mathop {x}\limits ^{_{*}}} + \lambda d) - f( {\mathop {x}\limits ^{_{*}}} )> - o(\lambda ) \big ). \end{aligned}$$
(8)

Definition 6

The point \( {\mathop {x}\limits ^{_{*}}} \ \in \mathcal{F}\) is totally stationary when (8) holds on \({\mathop {D}\limits ^{_\infty }}= \{d \in \mathbb R^n: ||d||= 1\}\).

For practical convenience, stationarity is defined on the sets \((\varLambda , D)\). Ideally, we would like \(\{x_i\}\) to converge to a total nss point. We will prove below that in unconstrained minimization this can occur on a very large set, provided that \(\{D_i\}\) are carefully constructed for large enough i. On the other hand, a singleton direction is enough to ensure stationarity of smooth functions with derivative information.

Algorithm 1 is just one way to prove Lemma 3. Many other alternatives are possible, but this algorithm is easy to implement, and it is close to its monotone counterpart algorithm proposed in [18] and the related nonmonotone versions developed afterwards [14, 16, 17].

The following lemmas will be useful for differentiable functions.

Lemma 4

Let \(\{d_1, \cdots , d_n\}\) be a finite set of n orthogonal directions with \(||d_j||=1,j=1,\cdots ,n\), and let \(D \supseteq \{\pm d_1, \cdots , \pm d_n\}\). It follows that

$$\begin{aligned} \forall w \in \mathbb R^n\ \big (\exists d\in D\!: d^T w \le -(1/\sqrt{n}\; ) ||w|| \big ). \end{aligned}$$
(9)

Moreover, if \(f(\cdot )\) is Fréchet differentiable,

$$\begin{aligned} \exists d \in D\!: d^T \nabla f(x) \le -(1/\sqrt{n}) ||\nabla f(x)||. \end{aligned}$$
(10)

Proof

(9) is a well known fact and its proof is omitted. (10) is an instance of (9). \(\square \)

Lemma 5

Let \(\hat{x} \in \mathbb R^n\) be a fixed point, let D be a set of search directions fulfilling the conditions required by Lemma 4 and let \(\tau _0\in \mathbb R_+\) be a positive number. If \(f(\cdot )\) is differentiable and

$$\begin{aligned} \left\langle \begin{array}{c} d \in D \\ \tau \in (0\ \tau _0) \end{array} \right\rangle \Rightarrow \big (f(\hat{x}+ \tau d) -f(\hat{x}) > -o(\tau )\big ) \end{aligned}$$
(11)

then \(\nabla f(\hat{x})= 0\).

Proof

We proceed by contradiction and assume that \(\nabla f(\hat{x}) \not = 0\). Lemma 4 ensures that \( d^T \nabla f(\hat{x}) \le -(1/\sqrt{n}) ||\nabla f(\hat{x})||\) for some \(d\in D\). It follows by Lemma 1 that d is a qdd; consequently, \(\tau \in (0,\ \tau _0)\) exists satisfying (5); therefore (11) can occur only if \(\nabla f(\hat{x})=0\). \(\square \)

If \(\ \nabla f(\hat{x}) = 0,\) it is not possible to find a direction of descent satisfying \(d^T\nabla f(\hat{x}) < 0\). Next lemma shows that, under more stringent conditions, it is possible to identify directions with negative curvature.

Lemma 6

If \(\displaystyle \lim \limits _{\lambda \rightarrow 0} \sigma (\lambda ) / \lambda ^2 = 0\), and \(f(\cdot ) \in C^2\), the inequality (3) holds for directions with negative curvature.

Proof

If \(\not \exists \big (\lambda \in (0\ \lambda _0)\big )\) satisfying (3) and \(\nabla f(x) = 0\), then for all \(\lambda \in (0,\ \lambda _0)\), it follows that

$$\begin{aligned} \displaystyle \frac{1}{2} d^T \nabla ^2 f(x) d+ \frac{o(\lambda ^2)}{\lambda ^2} = \frac{f(x+ \lambda d) - f(x)}{\lambda ^2} \ge \frac{f(x+ \lambda d) - \varphi }{\lambda ^2} > -\frac{\sigma (\lambda )}{\lambda ^2}; \end{aligned}$$

but this cannot hold for small enough \(\lambda \) when \(\ d^T \nabla ^2 f(x) d < 0\). \(\square \)

2.1 The search directions

In the forthcoming sections we will see that the convergence analysis of smooth functions is simplified when (12) holds. So, we always force (12) to be valid when \(\nabla f(\cdot )\) exists.

$$\begin{aligned} (\forall D_i) (\exists d \in D_i: d^T\nabla f(x_i) \le -\alpha ||\nabla f(x_i)||),\ \text { for some } \alpha > 0. \end{aligned}$$
(12)

By Lemma 4, orthogonal directions implicitly enforce (12) with \(\alpha = 1/\sqrt{n}\). It is also known that the cosine value \(\alpha = 1/\sqrt{n}\) cannot be improved by any D that spans \(\mathbb R^n\) positively, with either \(n+1\) or 2n search directions [35]. It has been argued that performance of a DFO algorithm may improve with \(n+1\) search directions [4]. We suggest the directions \(\pm d_1,\dots , \pm d_n\), where \(d_1,\dots , d_n\) are defined by Algorithm 2, Line 4. They are the columns of the Householder matrix \(H= I-2uu^T\) generated by a unit random vector \(u\in \mathbb R^n\), as suggested by [18]. A list of nice properties of \(D=\{d_1,\dots ,d_n\}\) is:

figure b
  • D is a set of orthogonal search directions.

  • Algorithm 2 describes an easy way to generate random orthogonal directions obtained from the columns of the Householder matrix.

  • It has been argued that some degree of randomness may benefit the performance of a deterministic algorithm [22]. In [6] the authors claim that they did not find a deterministic strategy to improve their algorithm.

  • It is relatively simple to distribute the workload among processors [18].

  • Line 4 of Algorithm 2 defines \(\ d_j= He_j= e_j- (2u^j/u^Tu) u,\ j=1,\dots ,n\). It is not necessary to construct explicitly the whole matrix H. The vector u contains the information needed to generate the search directions. This feature allows a significant saving in memory for medium and large problems.

  • Algorithm 2 conforms the search directions to the geometry of the constraints. It merely assigns \(u^j= 0\) in Line 2b whenever the variable \(x^j\) is close to either one of its bounds. Note that

    $$\begin{aligned} \big [ u^j= 0 \big ] \Rightarrow \big [d_j= e_j, \text { and }\ d_k^j= 0, k\not =j\big ]. \end{aligned}$$
    (13)

    Hence, \(x^j + d_k^j= x^j\), for all \(k\not =j\), which prevents the j-th variable from getting closer to its bounds.

  • The published numerical results since its inception in [18] have been highly competitive [14, 16, 17, 19, 20].

  • \(u^j\) can be randomly generated in \([-1\ 1], j= 1, \dots , n\). This is convenient when \(x\in \mathbb R^n\).

  • A suitable vector \(u \in \mathbb Z^n\) can be as well randomly generated to handle integer variables. From Lines 3a and 4 of Algorithm 2 we obtain that \(\ \zeta d_j= \zeta e_j- (2u^j) u\), where \(\zeta =u^Tu\) and \(d_j= He_j\). It follows that

    figure c
    $$\begin{aligned} u\in \mathbb Z^n \Rightarrow \big ( \tau d\in \mathbb Z^n \text { for } d\in D \text { and } \tau = \pm \zeta , \ \pm 2\zeta , \dots \big ). \end{aligned}$$
    (14)

2.2 The essential algorithm

The Iteration() algorithm, identified as Algorithm 3, will be the core iteration for those algorithms dealing with variables in \(\mathbb R^n\). It is invoked by \([x_i,\ \varphi _i,\ \tau _i, \ D_i]\) = Iteration(\(x_{i-1},\varphi _{i-1})\). Iterative calls to Iteration() are carried out until convergence criteria are met. The formal input parameters of Iteration() are the actual estimate \(x_{i-1} \in \mathbb R^n\) and \(\varphi _{i-1} \in \mathbb R\), an upper bound of \(f(x_{i-1})\). A new estimate \(x_i\), a new functional upper bound \(\varphi _i\), a stepsize \(\tau _i\), and the set of search directions \(D_i\) are returned by Iteration().

The Iteration()’s task is to find a feasible qdd. Lines 2a-2c check if the input estimate \(x_{i-1}\) is a feasible qdd; in which case, \(\lambda \le \varepsilon \) at Line 2d and Iteration() returns. Strictly speaking, \(x_{i-1}\) has no feasible qdd when for any \(\lambda > 0\) and any \(d\in \;{\mathop {D}\limits ^{_\infty }}\), it follows that

$$\begin{aligned} x_{i-1} + \lambda d \not \in \mathcal{F}\ \text { \textbf{or }}\ f(x_{i-1} + \lambda d) > \varphi _{i-1} - \sigma (\lambda ). \end{aligned}$$
(15)

In a practical implementation we admit that there is is no qdd at \(x_{i-1}\) when (15) holds for \(\lambda > \varepsilon ,\ \lambda \in \varLambda \) and a finite \(D_i\). When Iteration() finds a feasible qdd, it returns \([x_i, \varphi _i, \tau _i, D_i]\) satisfying

$$\begin{aligned} x_i&\in \mathcal{{F}} \end{aligned}$$
(16a)
$$\begin{aligned} f(x_i)&\le \varphi _i \le \varphi _{i-1}- \sigma (\tau _i) \end{aligned}$$
(16b)
$$\begin{aligned} f(x_i+ \tau _i d)&> \varphi _i- \sigma (\tau _i) \text { for all } d\in D_i. \end{aligned}$$
(16c)

Lemma 3 ensures that () can be fulfilled. Iteration() also keeps the best estimate \(x_b\) at its Line 4a.

Remark 1

\(\varphi _i\) may be the observed value of a random variable uniformly distributed in \(\big [f(x_i), \ \varphi _{i-1}- \sigma (\tau _i) \big ]\).

3 Nonmonotone DSMs for optimization with variables in \(\mathbb R^n\)

This section applies the theory developed in Sect. 2 to our problems of interest. Section 3.1 deals with the unconstrained optimization problem

$$\begin{aligned} \mathop {\textrm{minimize}}\limits _{x \in \mathcal{F}} f(x), \qquad \mathcal{F} = \mathbb R^n. \end{aligned}$$
(17)

This section describes algorithms to find non-smooth stationary points on a finite set of directions. Section 3.2 goes one step forward. It is concerned with convergence to total stationary points characterized by Definition (). Previous works have dealt with this issue. To ensure convergence, the authors in [6] need to proceed iteratively with an asymptotically dense matrix. We describe an algorithm which does not require to explicitly deal with an asymptotically dense matrix. Section 3.3 deals with the box-constrained optimization problem

$$\begin{aligned} \mathop {\textrm{minimize}}\limits _{x\in \mathcal{F}} f_C(x),\qquad \mathcal{F}=&\{x \in \mathbb R^n: s^k \le x^k \le t^k, k=1, \dots , n \}. \end{aligned}$$
(18)

3.1 Unconstrained optimization

In this section we prove several convergence results for unconstrained problems. Under Assumptions A1, A2 convergence to a point fulfilling the classical first order necessary conditions is proved. If A1, A3 hold, a sequence of estimates \(\{x_i\}_{i\in \mathbb N}\) converges to a stationary point fulfilling Definition 5 for \(D= {\mathop {D}\limits ^{_{*}}} \), a limit point of \(\{D_i\}\). By including some more laborious algorithmic implementation in the limit we prove stationarity on the set \({\mathop {D}\limits ^{_\infty }}= \{d:||d||=1\}\).

Our first task is to impose conditions on the searching sets \(\{D_i\}\) that ensure convergence to stationary points of smooth functions.

Theorem 1

Let \(\{d_i\}\) be the sequence of search directions that satisfies (12), that is: \( d_i^T\nabla f(x_i) \le -\alpha ||\nabla f(x_i)||),\ \text { for some } \alpha > 0\).

figure d

If A1 and A2 hold, Algorithm 4 generates \(\{\nabla f(x_i)\} \rightarrow 0\).

Proof

From (16c) and (12) it follows that

$$\begin{aligned} \begin{array}{r l} -\alpha ||\nabla f(x_i)|| \ge d_i^T \nabla f(x_i) =&{} \displaystyle \frac{f(x_i+\tau _i d_i)- f(x_i)}{\tau _i} - o(\tau _i) / \tau _i \\ \ge &{} \displaystyle \frac{f(x_i + \tau _i d_i)- \varphi _i}{\tau _i} - o(\tau _i) / \tau _i \\ >&{} - \sigma (\tau _i) / \tau _i - o(\tau _i) / \tau _i. \end{array} \end{aligned}$$
(19)

Since \(\tau _i \rightarrow 0\), we conclude that \(\{\nabla f(x_i)\} \rightarrow ~0\). \(\square \)

We claim that the following proposition is valid.

Proposition 1

Let \(\{D_i\}\) be a sequence of search direction sets, each containing a direction of descent satisfying (12). Under Assumptions A1 and A2, Algorithm 4 applied to (17) generates \(\{\nabla f(x_i)\} \rightarrow 0\). \(\square \)

Corollary 3

If \(f(\cdot )\) is continuously differentiable, any accumulation point \( {\mathop {x}\limits ^{_{*}}} \) of \(\{x_i\}\) is stationary; more specifically, \(\nabla f( {\mathop {x}\limits ^{_{*}}} ) = 0\).

Proof

It is obvious. \(\square \)

In general, there is not a simple way to explicitly find a qdd at x; however, a qdd might appear in D when:

  • \(f(\cdot ) \in C^2\) and there are directions of negative curvature at x, that is, \(d^T \nabla ^2 f(x) d< 0\).

  • There are directions with a negative directional derivative at x.

We should recall that even if \(f(\cdot )\) is differentiable, its first order derivatives might not be computable; and there is no way to explicitly verify (12). Nonetheless, Lemma 4 shows that (12) holds when \(D_i\) is a set of orthogonal directions. Computability of the gradient simplifies the algorithm significantly. Convergence prevails for any strictly positive definite matrix P and \(D_i= \{-P\nabla f(x_i)\}\), a singleton.

Algorithm Continuous(), identified as Algorithm 4, is a non-monotone DSM that returns a stationary point to problem (17) regardless of the presence -or absence- of derivative information. It is invoked by

\([ {\mathop {x}\limits ^{_{*}}} , {\mathop {D}\limits ^{_{*}}} ] =\) Continuous\((x,\varphi ,\varepsilon )\). Continuous() calls Algorithm 3 while \(\tau _i > \varepsilon \). It has 3 input arguments: the starting point \(x_0 \in \mathcal{F}\), the upper function value \(\varphi _0 \ge f(x_0)\) and the accuracy \(\varepsilon \). Continuous() finds an nss point when \(\tau _i < \varepsilon \) at its Line 1; but it returns \(x_i\) only if \(f(x_i) < f(x_b)\) at its Line 21. Otherwise, the algorithm invokes Iteration(\(x_b, f(x_b\))). The input parameters (\(x_b, f(x_b\))) make sure that all subsequent estimates have a function value below \(f(x_b)\). The convergence theory will require \(\{x_i\} \subset \mathcal{F}\), which is obvious in unconstrained minimization.

Theorem 2

Let A1, A3 hold. When \(\varepsilon = 0\), Algorithm 4 generates \(\{\tau _i\} \rightarrow ~0\), and we can identify a subsequence \(\{x_i,D_i,\tau _i\}_{i\in K \subseteq \mathbb N}\) converging to \(( {\mathop {x}\limits ^{_{*}}} , {\mathop {D}\limits ^{_{*}}} ,0)\), where \( {\mathop {x}\limits ^{_{*}}} \) satisfies (8) on \( {\mathop {D}\limits ^{_{*}}} \). In addition, \(f^0( {\mathop {x}\limits ^{_{*}}} ,d) \ge 0\) for all \(d\in {\mathop {D}\limits ^{_{*}}} \).

Proof

Line 1 of Algorithm 4 invokes Iteration(). If \(x_i\) has no qdds, the if clause in Line 2b of Iteration() never holds and \(\{\lambda \} \downarrow 0\) within the while loop 2a-2c, but keeping fixed the solution estimate; that is, \(x_{j+1}= x_j\) for all \(j \ge i\). By compactness we can identify a subsequence \(\{x_i,D_i,\tau _i\}_{i\in K \subseteq \mathbb N} \rightarrow (x_i, {\mathop {D}\limits ^{_{*}}} ,0).\)

If the whole sequence \(\{x_i\}_{i\in \mathbb N}\) possess qdds, it follows by (16b) and A1 that

$$\begin{aligned} \min \limits _{x\in \mathcal{F}} f(x) \le f(x_{k+1}) \le \varphi _{k+1} \le \varphi _i - \sum \limits _{k\ge j \ge i} \sigma (\tau _j); \end{aligned}$$
(20)

which implies \(\{\sigma (\tau _i)\} \rightarrow 0\). A fortiori, \(\{\tau _i\} \rightarrow 0\). Since \(f(x_i) \le \varphi _i\), it follows from (16c) that

$$\begin{aligned} \displaystyle \frac{f(x_i + \tau _i d)- f(x_i)}{\tau _i} \ge \frac{f(x_i+ \tau _i d)- \varphi _i}{\tau _i} > - \sigma (\tau _i)/\tau _i, \text { for all } d \in D_i. \end{aligned}$$
(21)

By compactness we can identify a sequence \(\{x_i, D_i,\tau _i\} \rightarrow ( {\mathop {x}\limits ^{_{*}}} , {\mathop {D}\limits ^{_{*}}} ,0)\). Moreover, when A3 holds, it follows that \(f(x_i+\tau _i {\mathop {d}\limits ^{_{*}}} ) \ge f(x_i + \tau _i d)- \vartheta \tau ||d- {\mathop {d}\limits ^{_{*}}} ||\), and

$$\begin{aligned} \displaystyle \frac{f(x_i + \tau _i {\mathop {d}\limits ^{_{*}}} )- f(x_i)}{\tau _i} \ge&\frac{f(x_i+ \tau _i d)- \varphi _i}{\tau _i} - \vartheta ||d- {\mathop {d}\limits ^{_{*}}} || \end{aligned}$$
(22a)
$$\begin{aligned} >&- \sigma (\tau _i)/\tau _i- \vartheta ||d- {\mathop {d}\limits ^{_{*}}} || \text { for all } {\mathop {d}\limits ^{_{*}}} \in {\mathop {D}\limits ^{_{*}}} . \end{aligned}$$
(22b)

We conclude that \( {\mathop {x}\limits ^{_{*}}} \) is stationary on the limit set \( {\mathop {D}\limits ^{_{*}}} \). Clearly

$$\begin{aligned} \displaystyle \begin{array}{c} \text {limsup} \\ ^{x \rightarrow {\mathop {x}\limits ^{_{*}}} , \tau \rightarrow 0} \\ \end{array} \frac{f(x+ \tau d) - f(x)}{\tau } \ge 0. \end{aligned}$$
(22c)

\(\square \)

figure e

We just proved that Algorithm 4 generates a sequence \(\{x_i\}\) converging to a stationary point \( {\mathop {x}\limits ^{_{*}}} \) satisfying Definition 5 on a limit set \({ {\mathop {D}\limits ^{_{*}}} }\). We now include material that will guide us to sketch a method to find stationarity on some \(D \supset {\mathop {D}\limits ^{_{*}}} \). We will show stationarity on finite sets \(\{D_i\} \rightarrow \;{\mathop {D}\limits ^{_\infty }}\).

3.2 Non-smooth stationarity on\(\;{\mathop {D}\limits ^{_\infty }}\)

Remark 2

If (6) holds for the sets \(D_1 \subset \;{\mathop {D}\limits ^{_\infty }}\) and \(D_2 \subset \;{\mathop {D}\limits ^{_\infty }}\), then it also holds for the set \((D_1 \cup D_2)\).

As stated earlier, it is more plausible to implement the algorithm with discrete values for \(\lambda \).

Remark 3

Let \(\hat{\varLambda }\) be given complying with Definition 4, and let \(x_i\) be a non-smooth stationary point satisfying (8) on the sets \(D_1 \subset \;{\mathop {D}\limits ^{_\infty }}\) and \(D_2 \subset \;{\mathop {D}\limits ^{_\infty }}\). The point \(x_i\) is also non-smooth stationary satisfying (8) on the set \((D_1 \cup D_2)\). In particular; if \(x_i\) is non-smooth stationary on a set \(D \subset \;{\mathop {D}\limits ^{_\infty }}\) and on a set S of random directions uniformly distributed in\(\;{\mathop {D}\limits ^{_\infty }}\), then \(x_i\) is also non-smooth stationary on \((D\cup S)\). For the remainder of this sub-section \(\{S_i\}\) denotes finite sets of random directions uniformly distributed in \({\mathop {D}\limits ^{_\infty }} =\{d \in \mathbb R^n: ||d||= 1\}\).

Lemma 7

Let Algorithm 5 be invoked by \(\big [\lambda , d, \#S \big ] = {Total }(y, D, \varLambda )\); where \(y \in \mathbb R^n\) satisfies (8) on \((D, \varLambda )\). The input y satisfies (8) on the set \(D \cup S\) if and only if \(\lambda = 0, d= 0\).

Proof

It is elementary. Line 3a of Algorithm 5 returns non-zero values to \(\big [\lambda , d\big ]\) if -and only if- Line 2a holds true, that is, d is a qdd. \(\square \)

We close this section with a brief description of the Algorithm 6, which improves the nss point found by Algorithm 5. Theoretically, \(\{x_i\}\) converges with probability 1 to an nss point complying with Definition 6.

figure f

Remark 4

A uniformly random set of q directions in\(\;{\mathop {D}\limits ^{_\infty }}\) can be efficiently obtained as follows:

rand(0,1) is a random variable with standard normal distribution

$$\begin{aligned} \begin{array}{ll} {\textsc {for }} j= 1, \dots , q \\ \quad {\textsc {for} }\, k=1, \dots , n \\ \quad d_j^k = \textbf{rand}(0,1) \\ \quad {\textsc {end-for}} \\ \quad \text {norm } = \sqrt{ (d^1)^2 \dots (d^n)^2} \\ \quad {\textsc {if } (norm} > \varepsilon ) \quad d \leftarrow d/\text {norm} \\ {\textsc {end-for}} \end{array} \end{aligned}$$

In [32] the author suggested that

$$\begin{aligned} P_r \big (\exists _{i,j,i<j} |d_i^k - d_j^k|< \epsilon \big ) < \delta \approx 1 - (1-\epsilon )^{p(p-1)}, \end{aligned}$$
(23)

where \(P_r(E)\) is the probability of the event E and p is the number of generated directions.

3.3 Box-constrained optimization

In this subsection we deal with the box-constrained optimization problem (18), which is repeated here for easy reference:

$$\begin{aligned} \mathop {\textrm{minimize}}\limits _{ x\in \mathcal{F}} f_C(x), \qquad \mathcal{F}=&\{x \in \mathbb R^n: s^k \le x^k \le t^k, k=1, \dots , n \}, \end{aligned}$$

where \(s^k, t^k\) are, respectively, the lower and the upper bound of the variable \(x^k\). We assume \(s^k <t^k\); otherwise, when \(s^k= t^k\), the variable \(x^k\) has a constant value and it is no longer considered as a variable. An approach for solving (18) is to consider the barrier function

$$\begin{aligned} f(x)= \left\{ \begin{array}{c l} f_C(x), &{}\text {if } x\in \mathcal{F}, \\ \varphi _0+1, &{} \text {otherwise,} \end{array} \right. \end{aligned}$$
(24)

and try to use the same tools used for solving the unconstrained problem (17). The algorithm starts at \(x_0 \in \mathcal{F}\) and applies Algorithm 4 to the function defined by (24). The algorithm also needs that search directions conform to the geometry of the constrained set. Algorithm 2 takes care of this circumstance. To formalize the proof of convergence we need to introduce the following notation and definitions.

  • Define the index set \(B_i\) of binding variables at the i-th iteration as:

    $$\begin{aligned} B_i= \{1\le k \le n: \min (x_i^k- s^k, t^k- x_i^k) \le \beta \}, \text { for some } \beta > 0. \end{aligned}$$
    (25)
  • Denote by \(\mathcal{S}_i\) the subspace spanned by \(\{x\in \mathbb R^n: x^k= 0, k \in B_i\}\), and let \(v_i \in \mathbb R^n\) be the projected gradient on \(\mathcal{S}_i\) given by \(v_i^k= \left\{ \begin{array}{cl} \nabla f(x_i)^k, &{} k \not \in B_i, \\ 0 &{} otherwise. \end{array} \right. \)

  • Let \(E_i= \{d_1, \cdots , d_q\}\) be a finite set of q vectors in \(\mathcal{S}_i\) satisfying (12), that is,

    $$\begin{aligned} \big (\forall v\in \mathbb R^n\big ) (\exists d \in E_i): d^Tv \le -\alpha ||v||, \text { for some } \alpha > 0. \end{aligned}$$
    (26)
  • Let the set \(D_i\) of search directions be

    $$\begin{aligned} D_i= E_i \cup \{\pm e_j, j\in B_i\}. \end{aligned}$$
    (27)

Essentially, we apply Algorithm 4 slightly modified to deal with the binding variables appointed to by \(B_i\). We now proceed to prove convergence.

Let \(B'\) be a set that appears infinitely often in the sequence \(\{B_i\}_{i\in \mathbb N}\).

Let \(K'= \{i: B_i= B'\}\) and let \( {\mathop {x}\limits ^{_{*}}} \) be a limit point, that is, \(\{x_i\}_{i \in K} \rightarrow {\mathop {x}\limits ^{_{*}}} \), for some \(K \subseteq K'\).

Theorem 3

Let \(\{D_i\}\) be the sequence of search directions satisfying (27). Under assumptions A1, A2, Algorithm 4 generates a sequence \(\{x_i\}_{i\in K}\) converging to a stationary point \( {\mathop {x}\limits ^{_{*}}} \) satisfying:

$$\begin{aligned}{} & {} \begin{array}{l} \big ( s^k =\ {\mathop {x}\limits ^{_{*}}} ^k \big ) \Rightarrow \nabla f( {\mathop {x}\limits ^{_{*}}} )^k \ge 0, \\ \big ( {\mathop {x}\limits ^{_{*}}} ^k\ = t^k \big ) \Rightarrow \nabla f( {\mathop {x}\limits ^{_{*}}} )^k \le 0, \text { and } \end{array} \end{aligned}$$
(28a)
$$\begin{aligned}{} & {} \big (s^k<\ {\mathop {x}\limits ^{_{*}}} ^k < t^k\big ) \Rightarrow \nabla f( {\mathop {x}\limits ^{_{*}}} )^k= 0. \end{aligned}$$
(28b)

Proof

Let \(B_i, K, {\mathop {x}\limits ^{_{*}}} \) be defined as above. To prove (28a) we merely prove \(\big (s^k =\ {\mathop {x}\limits ^{_{*}}} ^k\big ) \Rightarrow \nabla f( {\mathop {x}\limits ^{_{*}}} )^k \ge 0\). The case \(\big ( {\mathop {x}\limits ^{_{*}}} ^k = t^k\big ) \Rightarrow \nabla f( {\mathop {x}\limits ^{_{*}}} )^k \le 0\) is quite similar and it is omitted. As \(x_i^k \rightarrow \ {\mathop {x}\limits ^{_{*}}} ^k\), it follows that \(x_i^k - s^k \le \beta \) for all large enough i; ergo \(k \in B_i\) and \(f(x_i+ \lambda _i e_k) - f(x_i) \ge -\lambda _i \sigma (\lambda _i);\ e^k\) satisfies (6). By taking limits we obtain \(\nabla f( {\mathop {x}\limits ^{_{*}}} )^k \ge 0\).

To prove (28b) we need to consider 2 cases:

  1. a.

    \(k\in B_i\). By construction \(f(x_i+ \lambda _i e_k) - f(x_i) \ge -\lambda _i \sigma (\lambda _i)\) and \(f(x_i- \lambda _i e_k) - f(x_i) \ge -\lambda _i \sigma (\lambda _i)\). Together these 2 inequalities imply that \(\nabla f( {\mathop {x}\limits ^{_{*}}} )^k = 0\). Besides, both \(e^k, -e^k\) satisfy (6).

  2. b.

    \(k \not \in B_i\). By construction \(E_i\) is a finite set that contains a direction that satisfies (12). If A1 and A2 hold, we mimic the convergence of Theorem 1replacing D by E and \(\nabla f(\cdot )\) by v and deduce that Algorithm 4 generates \(\{v_i\}_{i\in K} \rightarrow 0\); which means that \(\nabla f( {\mathop {x}\limits ^{_{*}}} )^k= 0, k\not \in B_i\)\(\square \)

4 The integer optimization problem

This section is concerned with the integer optimization problem

$$\begin{aligned} \mathop {\textrm{minimize}}\limits _{z \in \mathcal{F}} f(z), \qquad \mathcal{F}= \{z\in \mathbb Z^p: s \le z \le t\}, \end{aligned}$$
(29)

where \(f(\cdot ): \mathbb R^p \rightarrow \mathbb R\). Problem (29) is an optimization problem with integer variables subjected to bounds. It is combinatorial. An exhaustive enumeration of the feasible points might be computationally expensive. Besides, (29) is generally a nonconvex problem and normally no algorithm ensures convergence to the optimizer. To try to find a global solution several strategies based on relaxations, cutting planes, branch and bound, surrogate models and heuristics have been devised. It seems out of the question to elaborate a common approach for getting the global solution to any instance of (29). However, some attempts have solved specific instances. In [7], the authors adapt MADS for solving a problem with grid variables regularly distributed along the coordinate axis. In [19], the authors applied a discretized version of Algorithms 3 and 4 for solving the same problem. We recall that this paper assumes that constraints other than bounds can be handled via penalization [12, 33], Lagrangian [30], infinite barriers [1] or any other appropriate technique.

Our algorithm is monotone with no \(\sigma \)-function involved. From the onset, we admit that we have at hand a feasible starting point \(z_0 \in \mathcal{F}\) and only feasible points are considered as solution estimates. We also assume that A1 holds, which essentially means that the level set \(L= \{z\in \mathcal{F}: f(z) \le f(z_0)\}\) is finite. We consider convergence to local minimizers. A local neighborhood \(\mathcal{N}(z_i,\varrho )\) can be defined as

figure g
$$\begin{aligned} \mathcal{N}(z_i, \varrho )= \{v \in \mathcal{F}: ||v-z_i|| \le \varrho \}, \end{aligned}$$
(30)

where \(||\cdot ||\) stands for any norm and \(\varrho > 0\). Usually the iterate \(z_i\) is considered stationary to (29) if

$$\begin{aligned} v\in \mathcal{N}(z_i,\varrho ) \Rightarrow f(v)\ge f(z_i). \end{aligned}$$
(31a)

A naive artifice to verify (31a) constructs \(\mathcal{N}(z_i,\varrho )\) and then evaluates all \(v\in \mathcal{N}(z_i,\varrho )\). This is often an expensive procedure that precludes the use of \(\varrho >2\). The algorithm Greedy(), identified as Algorithm 7, is described above. It tries to find better estimates that do not belong to the set \(\mathcal{N}(z_i, \varrho );\ z_i\) will be accepted as stationary if

$$\begin{aligned} v\in \mathcal{V}(z_i) \Rightarrow f(v) \ge f(z_i), \end{aligned}$$
(31b)

where \(\mathcal{V}(z_i) \supseteq \mathcal{N}(z_i, \varrho )\). The aim is to build \(\mathcal{V}(z_i)- \mathcal{N}(z_i, \varrho ) \not = \emptyset \). This can be regarded as a strategy to try to escape from the local stationary point defined by (31a).

We now briefly illustrate the way Greedy() constructs \(\mathcal{V}(z_i)\).

Given \(q \in \{1,\dots ,p\}\) we want

$$\begin{aligned} \mathcal{V}_q(z_i) \subseteq \big \{v\in \mathcal{F}: ||v-z_i||_0= q \big \}. \end{aligned}$$
(32a)

With no loss of generality we assume \(v^k= z_i^k\), for \(k> q\) and rewrite (32a) as:

$$\begin{aligned} \mathcal{V}_q(z_i) \subseteq \big \{v\in \mathcal{F}: v^k = z_i^k, k> q\big \}. \end{aligned}$$
(32b)

In the actual implementation \(v^{k_1} = z_i^{k_1}, \cdots , v^{k_q} = z_i^{k_q}\), with \(k_1, \dots , k_q\) randomly chosen. The generality of (32b) is preserved, but the explanation that follows is more fluid. We make use of search directions and rewrite (32b) as

$$\begin{aligned} \mathcal{V}_q(z_i) \subseteq \big \{v\in \mathcal{F}: v= z_i + \tau d, d^k= 0, k>q \big \}. \end{aligned}$$
(32c)

We have not yet defined the set \(\mathcal{V}_q(z_i)\), but a promising candidate is:

$$\begin{aligned} \mathcal{V}_q(z_i)= \big \{v \in \mathcal{F}: v= z_i + (u^Tu) H e^k, k \le q \big \}, \end{aligned}$$
(32d)

where \(H= I - (2 /u^Tu) u u^T \text { is the } p_\times p\) Householder matrix, \(u^k \in \{-2, -1, 1, 2\}\) for \(k\le q, \text { and } u^k=0, k> q\).

The set \(\mathcal{V}_q(z_i)\) is the set E generated by Algorithm 2 specialized to discrete variables. Given \(h \le p,\ y\) is accepted as stationary if

$$\begin{aligned} f(y) \le f(v), v \in \mathcal{V}(y)= \bigcup _{q=1}^h \mathcal{V}_q(y). \end{aligned}$$
(33)

Remark 5

In our implementation we define \(\mathcal{V}(y)= \bigcup _{q=1}^h \mathcal{V}_q(y) \cup \mathcal{N}(y,\varrho )\).

Greedy() is invoked by \(y=\) Greedy \((z_i,\eta )\), where y is stationary satisfying (33), \(z_i\) is the variable under scrutiny, and \(||y-z_i||_0 \le \eta \).

This algorithmic implementation was effective on the numerical tests performed in Sect. 6. A global solution was often returned. A detailed algorithmic description of Greedy() is given just after its pseudocode is displayed in Algorithm 7.

Remark 6

With our implementation \(||z_i- z_{i-1}||_0=p\) is possible.

Lemma 8

If A1 holds, Greedy() returns a stationary point satisfying (33) in a finite number of iterations.

Proof

If Greedy() remains in the while loop, the if clause in Line 4b is valid infinitely often and Line 4c generates an infinite strictly decreasing sequence \(\{f(y)\}\). By A1 this cannot occur; hence, Greedy() returns with \(q= 0\). When \(q= 1\), Greedy()’s Line 3g constructs the search directions \(d_1, \dots , d_{\#\mathcal{N}}\) and reset \(\mathcal{V}\) to \(\mathcal{N}(z,\varrho )\). Hence, \(q=0\) only happens if \(f(v) \ge f(z)\), for all \(v \in \bigcup _{q=1}^h \mathcal{V}_q(z) \cup \mathcal{N}(z,\varrho )\). \(\square \)

5 Mixed integer optimization

In this section we analyze convergence of the Variable Separation (VS) scheme for solving problem (), which is repeated here for easy reference.

$$\begin{aligned} \mathop {\textrm{minimize}}\limits _{(x;z) \in \mathcal{F}} \ g(x; z): \mathbb R^{n+p} \rightarrow \mathbb R, \\ \quad \mathcal{F}=\{x \in \mathbb R^n, \ z \in \mathbb Z^p: s \le (x; z) \le t\}. \end{aligned}$$

The difficulties, the standard methodology and the software that has been used for solving mixed integer optimization problems is abridged in [8, 28]. The authors in [19] suggested a uniform discretization of the continuous variables on a grid and solve (29). To improve the solution, the problem is again solved on a finer grid until convergence criteria are met. Details can be found in [19].

5.1 Neighborhood

The neighborhood definition is crucial to detect good optimizers. An algorithm that returns \( {\mathop {z}\limits ^{_{*}}} \) as a solution to a discrete problem must ensure that no better point is found in its neighborhood. A recent discussion on this issue was given in [36] in the context of DFO. We extract 2 definitions of local stationarity depending on a specified neighborhood. \(( {\mathop {x}\limits ^{_{*}}} ; {\mathop {z}\limits ^{_{*}}} ) \in \mathcal{F}\) is a local minimizer of a nonlinearly constrained problem with mixed variables if either A or B hold.

$$\begin{aligned} \mathbf{A:}\quad g( {\mathop {x}\limits ^{_{*}}} ; {\mathop {z}\limits ^{_{*}}} ) \le g(x; z)\ \text{ for } \text{ all } \ x \in \mathcal{B}( {\mathop {x}\limits ^{_{*}}} , {\mathop {z}\limits ^{_{*}}} ,\rho )\ \text { and all } \ z \in \mathcal{N}( {\mathop {x}\limits ^{_{*}}} , {\mathop {z}\limits ^{_{*}}} , \varrho ), \end{aligned}$$
(34a)
$$\begin{aligned} \text {where }\ \mathcal{B}( {\mathop {x}\limits ^{_{*}}} , {\mathop {z}\limits ^{_{*}}} , \rho ) =\{x: (x; {\mathop {z}\limits ^{_{*}}} ) \in \mathcal{F}, ||x- {\mathop {x}\limits ^{_{*}}} \!|| \le \rho \}, \end{aligned}$$
(34b)
$$\begin{aligned} \text {and }\ \mathcal{N}( {\mathop {x}\limits ^{_{*}}} , {\mathop {z}\limits ^{_{*}}} , \varrho ) = \{z: ( {\mathop {x}\limits ^{_{*}}} ; z) \in \mathcal{F}, ||z- {\mathop {z}\limits ^{_{*}}} || \le \varrho \}. \end{aligned}$$
(34c)
$$\begin{aligned} \mathbf{B:}\quad g( {\mathop {x}\limits ^{_{*}}} ; {\mathop {z}\limits ^{_{*}}} ) \le g(x; z)\ \text{ for } \text{ all } \ x \in \mathcal{B}'( {\mathop {x}\limits ^{_{*}}} ,\rho )\ \text { and all } \ z \in \mathcal{N}'( {\mathop {z}\limits ^{_{*}}} , \varrho ), \end{aligned}$$
(35a)
$$\begin{aligned} \text {where }\ \mathcal{B}'( {\mathop {x}\limits ^{_{*}}} , \rho ) =\{x: (x; z) \in \mathcal{F} \text{ for } \text{ some } z, ||x- {\mathop {x}\limits ^{_{*}}} \!|| \le \rho \}, \end{aligned}$$
(35b)
$$\begin{aligned} \text {and }\ \mathcal{N}'( {\mathop {z}\limits ^{_{*}}} , \varrho ) = \{z: (x; z) \in \mathcal{F} \text{ for } \text{ some } x, ||z- {\mathop {z}\limits ^{_{*}}} || \le \varrho \}. \end{aligned}$$
(35c)

It is easy to see that \(\mathcal{B}( {\mathop {x}\limits ^{_{*}}} , {\mathop {z}\limits ^{_{*}}} ,\rho ) \subseteq \mathcal{B}'( {\mathop {x}\limits ^{_{*}}} ,\rho ),\) and \(\mathcal{N}( {\mathop {x}\limits ^{_{*}}} , {\mathop {z}\limits ^{_{*}}} , \varrho ) \subseteq \mathcal{N}'( {\mathop {z}\limits ^{_{*}}} , \varrho )\). The authors in [36] claim some benefits from the use of B.

Definition 7

The point \(( {\mathop {x}\limits ^{_{*}}} ; {\mathop {z}\limits ^{_{*}}} )\) is locally stationary for problem () if

$$\begin{aligned} \min _{z\in \mathcal{V}( {\mathop {z}\limits ^{_{*}}} )} g( {\mathop {x}\limits ^{_{*}}} ; z)=g( {\mathop {x}\limits ^{_{*}}} ; {\mathop {z}\limits ^{_{*}}} )= \min \limits _{x \in \mathcal{B}( {\mathop {x}\limits ^{_{*}}} , {\mathop {z}\limits ^{_{*}}} , \rho )} g(x; {\mathop {z}\limits ^{_{*}}} ), \end{aligned}$$
(36)

where we are employing the neighborhood \(\mathcal{V}(z_i)\) given by () in the previous section. Notationally \(\mathcal{V}(z_i) \subseteq \{z: (x_i, z) \in \mathcal{F}\}\).

5.2 Variable separation for solving ()

The standard approach for solving () is based on the VS scheme. VS is a powerful technique that has been used in optimization problems in a multi-processing environment. It is specially suitable for solving large systems [1, 31]. Convergence for DFO with VS and continuous variables was first established in [3] for monotone algorithms and later for nonmonotone DSMs [16].

A typical VS scheme splits a problem into two or more subproblems. In mixed integer problems, it seems natural to divide the problem () in the subproblems (37a, 37b). Each can be worked out with specific tools. In the rest of the paper, we commit a minor abuse of notation and \(\mathcal{B}(x_{i-1}, \rho )\) substitutes for \(\mathcal{B}(x_{i-1}, z_{i-1}, \rho )\).

$$\begin{aligned} \text {Given }&(x_{i-1}; z_{i-1}) \in \mathcal{F} \nonumber \\ \text {find } x_i \text { such that } g(x_i; z_{i-1})&\le g(x; z_{i-1}) \text { for } x \in \mathcal{B}(x_{i-1}, \rho )\end{aligned}$$
(37a)
$$\begin{aligned} \text {and } \nonumber \\ \text {find } z_i \text { such that } g(x_i; z_i)&\le g(x_i; z) \text { for } z \in \mathcal{V}( z_{i-1}). \end{aligned}$$
(37b)

Given \((x_{i-1}; z_{i-1}) \in \mathcal{F}\), (37a) gives \(x_i\), the best local solution for x provided that \(z_{i-1}\in \mathbb Z^p\) stays fixed, then (37b) selects the best discrete variable in \(\mathcal{V}(z_{i-1})\), a finite set of feasible points that includes \(z_{i-1}\). This iterative process is repeated until convergence conditions are met. In [36] the authors suggest to look for a global solution to (37b) and claim some benefits [36, Section 2]. However, their choice may be computationally expensive, since it looks for the global optimum of a problem with discrete variables. A local solution will hopefully require the evaluation of \(g(x_i;z)\) at points in a small discrete set. In some applications the subproblem (37a) reduces to a linear model and \(z_i\) is chosen with heuristics linked to the structure of the problem [39]. It is advisable to use known efficient techniques whenever possible.

figure h

standardVS(), identified as Algorithm 8, describes a straightforward implementation of (37). It is invoked with \([ {\mathop {x}\limits ^{_{*}}} ; {\mathop {z}\limits ^{_{*}}} ] = \text{ standardVS } (x_0, z_0,\varepsilon )\), where \((x_0; z_0)\) is the actual estimate, \(\varepsilon \) is the accuracy, and \(( {\mathop {x}\limits ^{_{*}}} ; {\mathop {z}\limits ^{_{*}}} )\) is the stationary point satisfying (36). We now prove convergence under Assumptions A1, A4. We assume that we have at hand algorithms that solve problems (38a, b)

A1 The function \(g(\cdot ; \cdot )\) is computable at all feasible points and the set

\(L=\{(x; z) \in \mathcal{F}: g(x; z) \le \varphi _0\}\) is compact for any given \( \varphi _0\).

A4 The function \(g(\cdot ;\cdot ): \mathbb R^{n+p} \rightarrow \mathbb R\) satisfies the conditions required by the algorithms employed for solving (38a, b).

Lemma 9

Under assumptions A1, A4 the sequence \(\{g(x_i; z_i)\}\) generated by standardVS() is decreasing.

Proof

By optimality in (38a) and (38b) it follows that:

$$\begin{aligned} \displaystyle \min _{z\in \mathcal{V}(z_{i-1})} g(x_i; z) =&\ g(x_i; z_i) \le g(x_i; z_{i-1}), \end{aligned}$$
(39a)
$$\begin{aligned} {and}&\nonumber \\ \displaystyle g(x_{i-1}; z_{i-1}) \ge g(x_i; z_{i-1}) =&\min _{x\in \mathcal{B}(x_{i-1},\rho )} g(x; z_{i-1}). \end{aligned}$$
(39b)

Hence,

$$\begin{aligned} g(x_{i-1}; z_{i-1}) \ge g(x_i; z_{i-1}) \ge g(x_i; z_i) \end{aligned}$$
(40)

\(\square \)

Corollary 4

If \(g(x_i; z_i) = g(x_{i-1}; z_{i-1})\), then \(g(x_i; z_i)\) satisfies (36).

Proof

If \(g(x_i; z_i) = g(x_{i-1}; z_{i-1})\), then from (40) we deduce that

\( g(x_{i-1}; z_{i-1})= g(x_i; z_{i-1})\) and (39a-b) becomes a chain of equalities. It follows that \(g(x_{i-1}; z_{i-1})\) satisfies (36) and Algorithm 8 stops. \(\square \)

Lemma 10

Let \(\rho \) be large enough and let \(z_j= z_{i-1}\) for some \(j>i\). Under Assumptions A1, A4 Algorithm 8 stops at a stationary point.

Proof

At iterations ij, with \(j>i\) we obtain by construction that

$$\begin{aligned} \begin{array}{ r c l} &{} g (x_{i-1}; z_{i-1})\ge g(x_i; z_{i-1}) \ge g(x_i; z_i) &{} \\ &{} {and} &{} \\ &{} g (x_j; z_j) \ge g(x_{j+1}; z_j) \ge g(x_{j+1}; z_{j+1}). \end{array} \end{aligned}$$
(41)

We now assume \(z_j= z_{i-1}\) and reach a contradiction. It follows that

$$\begin{aligned} \displaystyle g(x_{j+1};z_j)&= \min _{x\in \mathcal{B}(x_j,\rho )} g(x; z_j) \end{aligned}$$
(42a)
$$\begin{aligned} =\min _{x\in \mathcal{B}(x_j,\rho )} g(x; z_{i-1})&= \min _{x\in \mathcal{B}(x_{i-1},\rho )} g(x; z_{i-1}) \end{aligned}$$
(42b)
$$\begin{aligned}&= \ g(x_i; z_{i-1}), \end{aligned}$$
(42c)

where equality (42b) holds because \(\mathcal{B}(x_j,\rho )= \mathcal{B}(x_{i-1},\rho )\) for large enough \(\rho \). It also follows that

$$\begin{aligned} \begin{array}{l l l} g(x_i; z_{i-1}) &{}\ge g(x_i; z_i) &{} \text { by (41)} \\ &{} > g(x_j; z_j) &{} \text { by Corollary 4} \\ &{} \ge g(x_{j+1}; z_j) &{} \text { by (42a)} \\ &{} \ge g(x_i; z_{i-1}) &{} \text { by (42c)}; \end{array} \end{aligned}$$
(43)

which is a clear contradiction. \(\square \)

From the above discussion we conclude that

Proposition 2

Under Assumptions A1, A4 and large enough \(\rho \), Algorithm 8 terminates in a finite number of iterations. \(\square \)

Proof

It is an immediate consequence of Lemma 10 and Corollary 4. On the one hand, if \(z_j = z_{i-1}\) for some \(j>i\), Lemma 10 ensures termination in a finite number of iterations. On the other hand, since the feasible set is bounded by A1, all integer variables are exhausted in a finite number of iterations. \(\square \)

This is a desirable conclusion. However, it requires the global solution of optimization problems at each iteration. Theoretically, this is an infinite process. When either optimization problem, (38a) or (38b), is rather complex, we offer another alternative that avoids global optimization subproblems.

5.3 VS with incomplete optimization

If we denote \(w= \begin{pmatrix} x \\ z \end{pmatrix} \in \mathbb R^{n+p}\), problem () can be considered as an optimization problem with \(n+p\) variables. To prove convergence along the lines laid out in Sect. 2, we should construct a set of search directions \(D= \{d_1, \dots , d_q\}, q> (n+p), d_j \in \mathbb R^{n+p}\), with some \(d \in D\) satisfying (12). A lot of information is needed. The new estimate is

$$\begin{aligned} w_i= w_{i-1}+ \tau _i d_j, \text { that is, } \begin{pmatrix} x_i \\ z_i \end{pmatrix} = \begin{pmatrix} x_{i-1} \\ z_{i-1} \end{pmatrix} + \tau _i \begin{pmatrix} d_j^x \\ d_j^z \end{pmatrix} \end{aligned}$$
(44)

for some \(d_j \in D;\ d_j^x\) are the first n components of \(d_j\), and \(d_j^z\) stand for the last p components of \(d_j\). This updating might not look convincing. To keep \(\{z_i\} \in \mathbb Z^p\), we must impose \(\{\tau _i d_j^z\} \in \mathbb Z^p\). This restriction on \(\{\tau _i\}\) is not desirable for the continuous variables and updating (44) is discarded. The VS scheme used by Algorithm 8 overcomes this shortcoming. However, the solution of the global optimization problems (38a, b) are required to prove convergence. In this section we propose a VS scheme that does not require global optimization. It strongly rests upon the theory developed in the previous sections. We need to define stationarity for Problem ()

Definition 8

\(( {\mathop {x}\limits ^{_{*}}} ; {\mathop {z}\limits ^{_{*}}} ) \in \mathcal{F}\) is nonsmooth stationary on D for Problem () if \( \exists ( \{x_i\} \rightarrow {\mathop {x}\limits ^{_{*}}} , \{\tau _i\} \downarrow 0 )\ \) such that

$$\begin{aligned} x_i+ \tau _i d \not \in \mathcal{F} { \textbf{or}}\ \big (g(x_i+ \tau _i d; {\mathop {z}\limits ^{_{*}}} ) - g(x_i; {\mathop {z}\limits ^{_{*}}} ) > - \sigma (\tau _i) \big ) \text { for } d\in D, \end{aligned}$$
(45a)

and for some \(\varrho > 0\)

figure i
$$\begin{aligned} g( {\mathop {x}\limits ^{_{*}}} ; {\mathop {z}\limits ^{_{*}}} ) \le g( {\mathop {x}\limits ^{_{*}}} ,z) \text { for } z \in \mathcal{V}( {\mathop {z}\limits ^{_{*}}} , \varrho ). \end{aligned}$$
(45b)

Algorithm 9 is the Main() algorithm. It uses a VS scheme for solving Problem (). Given \((x_{i-1};z_{i-1})\), the Main() algorithm

  1. 1.

    invokes MixedDiscrete(), identified as Algorithm 11, to obtain the next iterate \((x_{i-1};z_i)\). Its output \(z_i\) satisfies

    $$\begin{aligned} \begin{array}{r l} (x_{i-1};z_i) \in \mathcal{F},\ g(x_{i-1};z_i) &{}\le g(x_{i-1};z)\}\ \text { for } z\in \mathcal{V}(z_{i-1}), \text { and} \\ g(x_{i-1};z_i) &{} \le g(x_{i-1}; z_{i-1}). \end{array} \end{aligned}$$
    (46)

    The last inequality trivially holds if \(z_{i-1}\in \mathcal{V}(z_{i-1})\).

  2. 2.

    invokes MixedContinuous(), identified as Algorithm 10, within a repeat-until loop. The algorithm leaves the loop when \(g(x_i;z_i) < g(x_{i-1}; z_i)\). This strategy makes sure the algorithm generates a decreasing sequence; \(g(x_i; z_i) > g(x_j; z_j)\) for \(i<j\). This feature is shared with Algorithm 8 and it was very convenient for proving finite termination. Besides, if the decreasing condition does nold hold, the algorithm might not be converging to the optimum solution \(( {\mathop {x}\limits ^{_{*}}} ; {\mathop {z}\limits ^{_{*}}} )\) returned by standardVS().

figure j

Remark 7

Algorithms 10 and 11 are, respectively, Algorithms 3 and 7 adapted to the VS scheme for solving Problem (). They are explicitly shown to facilitate the writing and reading of this section.

figure k

Lemma 11

Given \((x_{i-1};z_{i-1}) \in \mathcal{F}\), let \(z_i\) = MixedDiscrete\((x_{i-1}, z_{i-1},\eta )\). It follows that \(g(x_{i-1};z_i) \le g(x_{i-1};z) \) for all \(z\in \mathcal{V}(z_{i-1}) \).

Proof

This is a replica of Lemma 8, replacing f(z) by g(xz). We reach a similar result, namely, \(g(x_{i-1};z_i) \le g(x_{i-1};v), v\in \bigcup _{q=1}^h \mathcal{V}_q(y)\). \(\square \)

Let \(x_{i+1}\) be returned by MixedContinuous\((x_i, z_{i+1}, \varphi _i)\). If a qdd is not found by MixedContinuous(), then \(\exists \{\lambda _j\} \downarrow 0\) such that

$$\begin{aligned} (x_i + \lambda d, z_i)&\not \in \mathcal{F}, \textbf{or} \end{aligned}$$
(47a)
$$\begin{aligned} g(x_i + \lambda d; z_i)-g(x_i;z_i)&> -o(\lambda ) \text { for } \lambda \in \{\lambda _j\}, d\in D_{i+1}. \end{aligned}$$
(47b)

Main() returns \((x_{i}; z_{i})\) as stationary. In a practical implementation we admit that a qdd at \(x_i\) does not exist when \(\lambda <\varepsilon \) in (). In short, Main() returns \((x_i,z_i)\) satisfying () or it generates an infinite sequence \(\{x_i, z_i, \tau _i, \varphi _i, g_i, D_i \}\) with the following properties:

$$\begin{aligned} (x_{i-1}; z_i)&\in \mathcal{F}, \end{aligned}$$
(48a)
$$\begin{aligned} g(x_{i-1}; z_i)&< g(x_{i-1};z), \text { for all } z \in \mathcal{V}(z_{i-1}). \end{aligned}$$
(48b)
$$\begin{aligned} g(x_{i-1}; z_i)&\le g(x_{i-1};z_{i-1}), \end{aligned}$$
(48c)
$$\begin{aligned} (x_i; z_i)&\in \mathcal{F}, \end{aligned}$$
(48d)
$$\begin{aligned} g(x_i; z_i)&< g(x; z_{i}) \text { for all } x \in \mathcal{B}(x_{i-1},\rho )\end{aligned}$$
(48e)
$$\begin{aligned} g(x_i; z_i)&\le \varphi _i \le \varphi _{i-1}- \sigma (\tau _i), \end{aligned}$$
(48f)
$$\begin{aligned} g(x_i + \tau _id; z_i)&> \varphi _i- \sigma (\tau _i) \text { for all } d \in D. \end{aligned}$$
(48g)

The following facts are inherited from the material analyzed in the previous sections of this paper. We should bear in mind that \(f(x) {\mathop {=}\limits ^{\varDelta }} g(x; {\mathop {z}\limits ^{_{*}}} )\), whenever \( {\mathop {z}\limits ^{_{*}}} \) is fixed.

Lemma 12

Under Assumption A1, \(\{\tau _i\} \rightarrow 0\).

Proof

By (48e) and A1 it follows that

$$\begin{aligned} g(x_k; z_k) \le \varphi _k \le \varphi _i - \sum \limits _{k > j \ge i} \sigma (\tau _j). \end{aligned}$$
(49)

This implies that \(\{\sigma (\tau _i)\} \rightarrow 0\). A fortiori, \(\{\tau _i\} \rightarrow 0\). \(\square \)

Theorem 4

Under Assumptions A1, A4, \(\{x_i\}_{i\in K} \rightarrow {\mathop {x}\limits ^{_{*}}} \) satisfying ().

Proof

Let \(S= \{x_i, z_i, \tau _i, g_i, \varphi _i, D_i \}_1^\infty \) be the infinite sequence generated by Main() and let \( {\mathop {x}\limits ^{_{*}}} , {\mathop {D}\limits ^{_{*}}} \) be an accumulation point of \(\{x_i, D_i\}\). As the number of discrete variables is finite by A1, there exists a variable, say \(z_k\), that appears infinitely often in S. Let us denote this variable as \( {\mathop {z}\limits ^{_{*}}} \) and extract from S the subsequence

$$\begin{aligned} {\mathop {S}\limits ^{_{*}}} = \{x_i, {\mathop {z}\limits ^{_{*}}} , \tau _i, g_i, \varphi _i, D_i\}_{i \in K}, K= \{i: \{x_i, D_i\} \rightarrow ( {\mathop {x}\limits ^{_{*}}} , {\mathop {D}\limits ^{_{*}}} ), z_i= {\mathop {z}\limits ^{_{*}}} \}. \end{aligned}$$

From (48f) it follows for \(i \in K\) that

$$\begin{aligned} \begin{array}{r l} \displaystyle \frac{g(x_i + \tau _i d; {\mathop {z}\limits ^{_{*}}} )- g(x_i; {\mathop {z}\limits ^{_{*}}} )}{\tau _i} &{}\ge \displaystyle \frac{g(x_i+ \tau _i d, {\mathop {z}\limits ^{_{*}}} )- \varphi _i}{\tau _i} \\ &{} > - \sigma (\tau _i)/\tau _i, \text { for all } d \in D_i, \end{array} \end{aligned}$$
(50)

which shows that () holds. Furthermore, from Lemma 11 we obtain that \(g(x_i; {\mathop {z}\limits ^{_{*}}} ) \le g(x_i,z),\ z\in \mathcal{V}( {\mathop {z}\limits ^{_{*}}} , \varrho )\). We conclude that () holds. The proof is complete. \(\square \)

6 Numerical experiments

This section reports some results obtained with a code written in C and compiled with DEV-cpp version 5–11 on a computer equipped with an Intel Core CPU @ 3.3 GHz. These preliminary results emphasize four relevant goals:

  1. 1.

    Continuous vs Discrete Performance. All problems tested have global integer optimizers. We ran two algorithms for each problem: Continuous() with \(x\in \mathbb R^n\) and Greedy() with \(x\in \mathbb Z^n\). We would like to show numerically that both algorithms converge adequately to the global solution. We point out that it is common to include the constraint \(x\in \mathbb Z^n\) to test integer optimization algorithms [37, 38, 43].

  2. 2.

    Global solution (glob). We carried out 100 runs for each problem and report the number of cases where the known global optimum was attained.

  3. 3.

    Search directions. The experiments were performed with the randomly generated directions described by Algorithm 2 and with a coordinate search.

  4. 4.

    Functional evaluations (eval). This figure is often considered a performance index.

The choice of parameters may significantly influence any algorithm’s behavior. The best selection is probably problem dependent. Nonetheless we run all examples with a common set of parameters,

  • \(\epsilon =\) 1.e-07 is the required accuracy. The smaller its value, the larger the value of eval. We use \(\epsilon \) mainly to stop the algorithms.

  • \(\eta = \min (n, 6)\), where n is the number of variables. The \(\eta \) value suggests the size of the local neighborhood where a better stationary point may be located. The larger its value, the larger the value of eval.

  • \(\varphi _0= f(x_0; z_0) + 0.8|f(x_0;z_0)|\). This parameter affects the non-monotone behavior of the algorithm.

  • \(\delta =\) 1.e-05. To inhibit stagnation we accept \(x_{i+1}\) as a new estimate when:

    $$\begin{aligned} \begin{array}{c} f(x_{i+1}) \le \phi _i - \sigma _i, \\ \textbf{and} \\ { |f(x_{i+1}) - f(x_i)| \ge \delta (\delta + |f(x_i)|).} \end{array} \end{aligned}$$
    (51)

The test problems have a small number of variables; however, they seem to represent a collection of hard problems. The starting point was a random feasible \(z\in \mathbb Z^n\). Whenever possible, we state some comparison related issues with other algorithms that solve the same problem. The results are summarized in Tables 1, 2, 3, 4, 5, 6, 7 and 8. Each table shows the reference to the problem studied; its number and kind of variables; the set of search direction; the minimum, maximum and average number of function evaluations, and the number of times the known global solution is attained.

The algorithm also solved a small nonlinear real problem described in [10, Section 3.3]. The idea is to show an artifice to deal with discrete variables, which are not integer variables. It is worth mentioning that our algorithm unveiled a better solution than that reported in [10]. This section ends with some remarks about the numerical results obtained.

6.1 Branin function [9]. Optimizer: y= (−3, 13)

$$\begin{aligned}{} & {} \begin{array}{c} a= 1, b= 5.1 / (4\pi ^2), c= 5.0/\pi , \\ r= 6.0, s= 10.0, t= 1.0/(8\pi ),\\ x^1= y^1- 0.689, x^2= y^2+ 0.629. \end{array}\nonumber \\{} & {} \begin{array}{l} f(x)= a*\big (x^2- b(x^1)^2 + cx^1-r\big )^2 + s(1-t) \cos (x^1)+ s+ 5x^1. \\ \qquad \qquad -5 \le x^1 \le 10; \quad 0 \le x^2 \le 15. \end{array} \end{aligned}$$
(52)

The Branin function is a classical test problem in global optimization. The version shown above has an integer global solution. The results are given in Table 1.

Table 1 Performance for function (52), n = 2

6.2 Rosenbrock function [9]. Optimizer: \(x^k = 1, k=1,\dots ,50\)

$$\begin{aligned} \begin{array}{l} f(x)=\sum \limits _{k= 1}^{k=n-1} 100(x^{k+1} - (x^k)^2)^2 + (1- x^k)^2, \\ \qquad \qquad -5 \le x^k \le 5, k= 1, \dots , n. \end{array} \end{aligned}$$
(53)

The function (53) has been extensively used as a benchmark. It has long narrow valleys and local minimizers for \(n>4\) [27]. The global solution is all integer \(x= (1, 1, \dots , 1, \dots )\) regardless the number of variables. The results shown in Table 2 with \(x \in \mathbb Z^{50}\) are remarkable when compared with [38]. It never gets trapped by any of the numerous local function minimizers [27].

Table 2 Performance for function (53), n = 50

6.3 Lukšan function [34]. Optimizer: \(y=( 1, \ 1\))

$$\begin{aligned}{} & {} \begin{array}{c} x^1= y^1+ 0.14, \quad x^2= y^2- 0.1, \\ w^1 = (x^1)^2 + (x^2)^4, \\ \quad w^2 = (x^1 - 2)^2 + (x^2 - 2)^2, \\ \quad w^3 = 2 \exp (x^2 - x^1). \end{array} \nonumber \\{} & {} f(w)= \max (w^1, w^2, w^3). \end{aligned}$$
(54)

Minimax are non-smooth problems that appear often in testing algorithms.

Table 3 Performance for function (54), n = 2

6.4 Pintér function [40]. Optimizer: \(x= y\)

$$\begin{aligned} \begin{array}{c l} f(x) &{}= 0.025n\sum \limits _{k=1}^{k=n} (x^k- y^k)^2 + \sin ^2(x^k- y^k) \\ &{}\quad + \sin ^2 \left( \sum \limits _{k=1}^{k=n}\left[ (x^k-y^k) +(x^k - y^k)^2\right] \right) , \\ &{}\quad -5 \le x^k \le 5, k= 1, \dots , n. \end{array} \end{aligned}$$
(55)

The Pintér problem was a difficult test for our algorithms. The author in [40, Section 4.4] justifies the need of using global search strategies. Problem (55) possesses many local minimizers with close function values. The dimension and solution can be arbitrarily selected. We chose \(y^k=1, k=1,\dots ,n\) with \(n=5\) and \(n= 10\). The results are shown on Table 4.

Table 4 Performance for function (55), n = 5

6.5 Ackley function [9]. Optimizer: \(x^k= 0, k=1,\dots ,30\)

$$\begin{aligned} \begin{array}{l} \displaystyle f(x)= 20+ \exp (1) - 20 \exp \left( -0.2 \sqrt{\frac{1}{n}\sum _{k=1}^{k=n} (x^k)^2}\right) - \exp \left( \frac{1}{n}\sum _{k=1}^{k=n} \cos (2\pi x^k)\right) , \\ \qquad \qquad -10 \le x^k \le 10, k=1, \dots , n. \end{array} \nonumber \\ \end{aligned}$$
(56)

The authors in [29] could not solve (56) with any of the heuristic algorithms they tried. In [15] the author reports that his algorithm returns the global solution in 70 out of 100 runs, with 20,000 function evaluations in average. Our algorithms always found the global solution. The number of evaluations for the discrete version was remarkably low in comparison with [38].

Table 5 Performance for function (56), n = 30

6.6 Shekel function [9]. Optimizer: \(x^k= 4, k=1,\dots ,4\)

$$\begin{aligned}{} & {} \begin{array}{l} b= (0.1, 0.2, 0.2, 0.4, 0.4, 0.6, 0.3, 0.7, 0.5, 0.5), \\ C=\left[ \begin{array}{l} {4, 1, 8, 6, 3, 2, 5, 8, 6, 7} \\ {4, 1, 8, 6, 7, 9, 3, 1, 2, 3} \\ {4, 1, 8, 6, 3, 2, 5, 8, 6, 7} \\ {4, 1, 8, 6, 7, 9, 3, 1, 2, 3} \end{array} \right] . \end{array}\nonumber \\{} & {} \begin{array}{l} \displaystyle f(x)= - \sum _{j=1}^{j=10} \left( \sum _{k=1}^{k=4} (x^k- C^{kj})^2 + b^j \right) ^{-1}, \\ \qquad \qquad 0 \le x^k \le 10, k=1, \dots , 4. \end{array} \end{aligned}$$
(57)

The Shekel function (57) has 10 local discrete minimizers, which is a challenge for the Greedy() algorithm. Again, in regards to function evaluations, the discrete version outperformed both the continuous version and the results reported in [15, 38].

Table 6 Performance for function (57), n = 4

6.7 NgZhang function [38]. Optimizer: \(x^k= 1, k=1,\dots ,25\)

$$\begin{aligned} \begin{array}{l} f(x)= (x^1 - 1)^2 + (x^2-1)^2 + n \sum _{k=1}^{n-1} (n-k)(x^{k+1}- (x^k)^2), \\ \qquad \qquad -5 \le x^k \le 5, \ k= 1, \dots , n. \end{array} \end{aligned}$$
(58)
Table 7 Performance for function (58), n = 25

6.8 Beam function [10]. Known solution: \(x=( 7, 0.1, 9.4848, 0.1), f(x)= 97.7\)

$$\begin{aligned}{} & {} \begin{array}{c} a= 36, b= 1000, c= 10^7, \\ h= x^2(x^1 - 2 x^4)^3 / 12 + 2 \Big [x^3 (x^4)^3+ x^4 x^3 (x^1-x^4)^2 /4\Big ], \\ f(x^1, x^2, x^3, x^4)= a \big [2x^4x^3 + (x^1-2x^4)x^2\big ]. \end{array} \nonumber \\{} & {} \begin{array}{c r l} \displaystyle \mathop {\textrm{minimize}}\limits _{x\in G} &{} f(x) &{}: \mathbb R^4 \rightarrow \mathbb R \\ G=\mathbf{\{}x\in \mathbb R^4: &{} g^1(x)&{}= abx^1/(2h) - 5000 \le 0, \\ &{} g^2(x)&{}= 36^3 b/(3ch) - 0.1 \le 0, \\ &{} 3.0 &{} \le x^1 \le 7.0, \\ &{} 0.1 &{} \le x^2 \le 2.0, \\ &{} 2.0 &{} \le x^3 \le 12.0, \\ &{} 0.1 &{} \le x^4 \le 1.0 \mathbf{\}}. \\ \end{array} \end{aligned}$$
(59)

The problem (59) is a real problem. The goal is to design a minimum volume solution of a beam satisfying physical constraints. A detailed description is given in [10]. To solve (59) we use the exact penalty function

$$\begin{aligned} 500\big [(\max (g^1(x), 0)+ \max (g^2(x), 0)\big ]. \end{aligned}$$

The algorithm’s rate of success was 72% with the search directions given by Algorithm 2. Besides, the algorithm unveiled a better solution:

x= (6.9999198, 0.1000450, 9.4756010, 0.1000151),

\(f(x)= 92.72525, \ g^1(x)= -0.00351, \ g^2(x)= -0.09383\)

Table 8 Performance for function (59), n = 4

6.9 Mixed beam

We now mention the mixed variable problem suggested in [10]; \(x^4\) becomes discrete with the additional constraint

$$\begin{aligned} x^4 \in V= \{0.1, 0.25, 0.35, 0.5, 0.65, 0.75, 0.9, 1.0\}. \end{aligned}$$
(60)

To handle (60) as a mixed variable problem we employ an integer variable z whose values are linked to \(x^4\) as follows:

\(z=\)

1

2

3

4

5

6

7

8

\(x^4=\)

0.1

0.25

0.35

0.5

0.65

0.75

0.9

1.0

Problem (59) is now a non-linearly constrained problem with 3 continuous variables \((x^1, x^2, x^3)\) and the integer variable z. This kind of problem was not an aim of this paper. Nonetheless, Main() solved it with less than 600 function evaluations.

6.10 Comments on the numerical results

Problems 6.1\(-\)6.7 were taken from the open literature. They were originally proposed in benchmarks for continuous optimization. Nowadays many researches have added the constraint \(x\in \mathbb Z^n\) for testing integer optimization algorithms. With this sample we hope to identify features to be improved.

Problem 6.1:

The version of the Branin function exposed here has one global minimizer and two more local minimizers. This is the only problem in the sample that solved the discrete version in fewer function evaluations than those needed to solve the continuous version.

Problem 6.2:

The authors in [38] report that they successfully solved the discrete version of the Rosenbrock function in \( \approx \)1.6 million function evaluations. Our result is remarkable.

Problem 6.3:

For the Lukšan function the results in \(\mathbb R^n\) and coordinate search were not appealing. This is probably linked to the geometry of the problem.

Problem 6.4:

The Pintér problem was the most difficult to solve. Often, the algorithms converge to a local minimizer. Global strategies are needed to improve the algorithm’s performance.

Problem 6.5:

In 400 runs our algorithms always returned the global minimizer of the Ackley function, This is an impressive result because few function evaluations were needed for solving the discrete version.

Problem 6.6:

We had tested the Shekel function in [15]. The results have improved.

Problem 6.7:

This problem was efficiently solved; however we should pay attention to the dispersion in function evaluations showed in \(\mathbb R^n\) with random directions.

Overall, our algorithms, especially the discrete version, look promising. Better results should be expected by introducing globalization strategies.

7 Conclusion and final remarks

We have formalized and described a common framework for a class of nonmonotone Direct Search Methods. Convergence of non-smooth functions is proved with no differentiability assumptions. A new concept of quasi descent direction is included to show nonsmooth convergence. These methods converge under weak assumptions to classical first-order stationary points of unconstrained problems with continuous variables. Convergence prevails for box constrained optimization with mixed variables. Derivative information is not required. For unconstrained problems with continuous variables we prove convergence to a point that does not have a quasi descent direction. The role of a finite set of n orthogonal search directions randomly generated was crucial to ensure this result. Orthogonality was also an asset for the design of a variable separation algorithm for solving mixed-integer optimization problem. All convergence properties imply convergence to a Clarke stationary point, provided the function is Lipschitz continuous.

For the sake of clarity, we have exposed the theory complemented with a basic but unspecified implementation of the algorithms. Our framework is open to many variants that deserve further research. Nonetheless, we wrote a preliminary code and carried out numerical experiments on several academic problems and a 5-variable real application. The results are encouraging. The next apparent line of research is the convergence analysis of DSMs to a nonlinearly constrained mixed-integer optimization problem.

We have left out for future study some important issues that are beyond the aims of this paper, like strategies for accelerating convergence, global minimization, parallelism, and hybrid methods. A clear research topic is the analysis and impact of non-monotone features in other successful approaches to DFO, like MADS [7], surrogate functions [25, 26, 41], and others.

In this paper, and also in [15], feasibility of the sequence of the solution estimates was forced at all iterations. If this condition can be removed, we would probably broaden the class of nonmonotone DSMs converging under the same conditions stated herein.