1 Introduction

In this work, we consider the problem of minimizing a smooth function with a sparsity constraint (cardinality constraint). Optimization problems where sparse solutions are sought arise frequently in modern science and engineering. Just as examples, applications of sparse optimization regard compressed sensing in signal processing [1, 2], best subset selection [3,4,5,6] and sparse inverse covariance estimation [7, 8] in statistics, sparse portfolio selection [9] in decision science, neural networks compression in machine learning [10, 11].

It is well known that optimization problems involving the norm are \({{\mathcal {N}}}{\mathcal {P}}\)-hard [12, 13]. Hence, classes of algorithms have been proposed through the years to approximately solve cardinality-constrained problems. Examples of effective methods are given by the Lasso [14] and other \(\ell _p\)-reformulation approaches [15, 16].

Particularly useful algorithms designed to deal with cardinality-constrained optimization problems are the greedy sparse simplex method [17] and the class of penalty decomposition (PD) methods [18]. The former is specifically designed for problems of the form (1), and the latter has been defined to deal with cardinality-constrained problems characterized by the presence of further standard constraints. These methods, based on different approaches, present theoretical convergence properties and are computationally efficient in the solution of cardinality-constrained problems. However, they require to exactly solve at each iteration suitable subproblems (of dimension 1 in the case of the greedy sparse simplex method, and of dimension n for PD methods). This may be prohibitive when either the objective function is nonconvex or the finite termination of an algorithm applied to a convex subproblem cannot be guaranteed. This latter issue typically occurs when the convex function is not quadratic. Note that there are several applications of sparse optimization involving nonconvex objective functions (see, e.g., [10]).

The aim of the present work is to tackle cardinality-constrained problems by defining convergent algorithms that do not require to compute the exact solution of (possibly nonconvex) subproblems. To this aim, we focus on the approach of the PD methods and we present two contributions:

  1. (a)

    the definition of a PD algorithm performing inexact minimizations by an Armijo-type line search [19] along gradient-related directions;

  2. (b)

    the definition of a derivative-free PD method for sparse black-box optimization.

The two algorithms share the penalty decomposition approach, but differ significantly in the inexact minimization steps and in the definition of the inner stopping criterion. We perform a theoretical analysis of the proposed methods, and we state convergence results that are equivalent to those of the original PD methods [18] but, in general, weaker than those of the greedy sparse simplex method [17]. Finally, we remark that, to our knowledge, convergent derivative-free methods for cardinality-constrained problems were not known, and this makes the derivative-free algorithm proposed in the present work particularly attractive. The paper is organized as follows. In Sect. 2, we address optimality conditions for problem (1); we also describe the PD method originally introduced in [18]. In Sect. 3, we propose a modified version of the PD algorithm and we state global convergence results. In Sect. 4, we present a derivative-free PD method for black-box optimization and we prove the global convergence of the proposed method. The results of preliminary computational experiments, limited to a class of convex problems, are reported in Sect. 5 and show the validity of the proposed approach. Finally, Sect. 6 contains some concluding remarks.

2 Background

In this work, we consider the following optimization problem

$$\begin{aligned} \min _{x\in {\mathbb {R}}^n} f(x)\quad \text { s.t. }\quad \Vert x\Vert _0\le s, \end{aligned}$$
(1)

where \(f:{\mathbb {R}}^n \rightarrow {\mathbb {R}}\) is a continuously differentiable function, \(\Vert x\Vert _0\) is the norm of x, i.e., the number of its nonzero components, and s is an integer such that \(0< s < n\). Throughout the paper, we make the following assumption.

Assumption 2.1

The function \(f:{\mathbb {R}}^n\rightarrow {\mathbb {R}}\) is continuously differentiable and coercive on \({\mathbb {R}}^n\), i.e., for all sequences \(\{x^k\}\) such that \(x^k\in {\mathbb {R}}^n\) and \(\lim _{k\rightarrow \infty }\Vert x^k\Vert = \infty \) we have \(\lim _{k\rightarrow \infty }f(x^k) = \infty .\)

The above assumption implies that problem (1) admits solution. Necessary optimality conditions for problem (1) have been stated in [17], where the basic feasible (BF) property has been introduced. We recall this notion hereafter.

Definition 2.1

We say that a point \({\bar{x}}\in {\mathbb {R}}^n\) is a BF-vector, if:

  • when \(\Vert {\bar{x}}\Vert _0 = s\), it holds \(\nabla _i f({\bar{x}}) = 0\) for all i s.t. \({\bar{x}}_i\ne 0\);

  • when \(\Vert {\bar{x}}\Vert _0 < s\), it holds \(\nabla _i f({\bar{x}}) = 0\) for all \(i=1,\ldots ,n\).

It can be easily shown that if \(x^\star \) is an optimal solution of problem (1), then \(x^\star \) is a BF-vector.

Necessary optimality conditions for cardinality-constrained problems with additional nonlinear constraints have been studied in [18]. Such conditions have been used to study the convergence of the PD method proposed in the same work. In the case of problem (1), the aforementioned necessary optimality conditions result simplified. In particular, on the basis of the convergence analysis performed in [18], we introduce the following definition.

Definition 2.2

We say that a point \({\bar{x}}\in {\mathbb {R}}^n\) satisfies Lu–Zhang first-order optimality conditions if there exists a set \(I\subseteq \{1,\ldots ,n\}\) such that \(|I|=s\), \({\bar{x}}_i = 0\) for all \(i\in {\bar{I}} = \{1,\ldots ,n\} \setminus I\) and \(\nabla _i f({\bar{x}}) = 0\) for all \(i\in I\).

If \(x^\star \) is an optimal solution, then \(x^\star \) satisfies Lu–Zhang first-order optimality conditions [18]. It can be easily verified that a BF-vector satisfies Lu–Zhang conditions. The converse is not necessarily true, i.e., Lu–Zhang conditions are weaker than the optimality conditions expressed by BF property. We show this with the following example.

Example 2.1

Consider problem (1), letting

$$ f(x) = (x_1-1)^2 + x_2^2 + (x_3-1)^2 $$

and \(s=2\). The point \({\bar{x}}=[1\;0\;0]\) satisfies Lu–Zhang conditions, but it is not a BF-vector. Indeed, let \(J = \{1,2\}\). We have that \({\bar{x}}_j=0\) for all \(j\in {\bar{J}}\) and \(\nabla _i f({\bar{x}}) = 0\) for all \(i\in J\). Thus, \({\bar{x}}\) satisfies Lu–Zhang conditions. On the other hand, \(\Vert {\bar{x}}\Vert _0<2\), and \(\nabla _3 f({\bar{x}})\ne 0\), i.e., it is not a BF-vector.

2.1 The Projection onto the Feasible Set

Consider the problem of computing the orthogonal projection of a vector \({{\bar{x}}}\) onto the feasible set, i.e., the problem

$$\begin{aligned} \min _{x} \; \Vert x - {\bar{x}} \Vert ^2\quad \text { s.t. } \quad \Vert x\Vert _0\le s. \end{aligned}$$
(2)

Since the feasible set is not convex, the solution of (2) is not unique. A globally optimal solution can be computed in closed form taking the s components of \({{\bar{x}}}\) with the largest absolute value [17]. To formally characterize the solution, let us define the index set I(x) of the largest nonzero variables (in absolute value) at a generic point \(x \in {\mathbb {R}}^n\), satisfying the following properties:

$$\begin{aligned} \begin{aligned}&I(x) \in \mathop {\mathrm{arg~max}}\limits _{S\subseteq \{1,\ldots ,n\}} |S|\\&\quad \text {s.t.} \quad |S|\le s, \quad i\in S \Rightarrow x_i\ne 0,\\&\qquad \qquad \!\! |x_i|\ge |x_j|\quad \forall \, i\in S, \forall \, j\notin S. \end{aligned} \end{aligned}$$
(3)

In general, the index set I(x) is not uniquely defined. Also, note that \(I(x) = \{i\in \{1,\ldots ,n\}: x_i\ne 0\}\) if \(\Vert x\Vert _0\le s\).

Then, the solution \(x^\star \) of problem (2) is such that

$$\begin{aligned} x_i^\star = {\bar{x}}_i\text { for } i \in I({\bar{x}}),\qquad x_i^\star = 0\text { for } i \notin {I}({\bar{x}}). \end{aligned}$$
(4)

2.2 The Penalty Decomposition Method

Applying the classical variable splitting technique [20], Problem (1) can be equivalently expressed as

$$\begin{aligned} \min _{x,y\in {\mathbb {R}}^n}\;\;f(x)\quad \text { s.t. }\quad \Vert y\Vert _0\le s,\quad x=y. \end{aligned}$$
(5)

For simplicity, in the following, we will denote \(Y=\{y \in {\mathbb {R}}^n : \Vert y\Vert _0\le s \}\).

The quadratic penalty function associated with (5) is

$$\begin{aligned} q_\tau (x,y) = f(x) + \frac{\tau }{2}\Vert x-y\Vert ^2, \end{aligned}$$

where \(\tau >0\) is the penalty parameter.

In [18], the penalty decomposition (PD) method (see Algorithm 1) was proposed to solve Problem (5). In particular, the approach is that of approximately solving a sequence of penalty subproblems by a two-block decomposition method. The algorithm starts from a point \((x^0, y^0)\) that is feasible for problem (5). At every iteration, the algorithm performs the block coordinate descent (BCD) method [21, 22] w.r.t. the two blocks of variables x and y, until an approximate stationary point of the penalty function w.r.t. the x block is attained. Then, the penalty parameter \(\tau _k\) is increased for the successive iteration, where a higher degree of accuracy is required to approximate a stationary point.

Note that, as discussed in Sect. 2.1, the y-update step can be performed by computing the closed-form solution of the related subproblem. At the beginning of each iteration, before starting the BCD loop, a test is performed to ensure that the points of the generated sequence belong to a compact level set. This is done in order to guarantee that the sequence generated by the PD method is bounded, so that it admits limit points. In [18] it is proved that each limit point is feasible and satisfies Lu–Zhang conditions.

figure a

3 An Inexact Penalty Decomposition Method

Algorithm 1 has been shown to be effective in practice [18]. However, it requires to compute, in the inner iterations of the block decomposition method, the exact solution of a sequence of subproblems in the x variables (see steps 5 and 10). This may be prohibitive when either the objective function is nonconvex or the finite termination of an algorithm applied to a convex subproblem cannot be guaranteed. On the other hand, the convergence analysis performed in [18] is strongly based on the assumption that the global minima of the subproblems in the x variables are determined. In order to overcome this nontrivial issue by preserving global convergence properties, we propose a modified version of the algorithm, suitable even for problems with nonconvex objective function.

figure b

The proposed procedure is described in Algorithm 2. The exact minimization with respect to the x variables is replaced by an Armijo-type line search along the steepest descent direction of the penalty function. The line search procedure along a descent direction d is shown in Algorithm 3.

We recall some well-known properties for the Armijo-type line search, later used in the convergence analysis. These results can be found, for instance, in [19].

figure c

It can be easily seen that the algorithm is well defined, i.e., there exists a finite integer j such that \(\beta ^j\) satisfies the acceptability condition (6).

Moreover the following result holds.

Proposition 3.1

Let \(g:{\mathbb {R}}^n\times {\mathbb {R}}^n\rightarrow {\mathbb {R}}\) be a continuously differentiable function and \(\{x^t,y^t\}\subseteq {\mathbb {R}}^n\times {\mathbb {R}}^n\). Let \(T\subseteq \{0,1,\ldots ,\}\) be an infinite subset such that

$$ \lim \limits _{\begin{array}{c} t\rightarrow \infty \\ t\in T \end{array}}(x^t,y^t) = ({{\bar{x}}},{{\bar{y}}}). $$

Let \(\{d^t\}\) be a sequence of directions such that \(\nabla _xg(x^t,y^t)^Td^t<0\) and assume that \(\Vert d^t\Vert \le M\) for some \(M>0\) and for all \(t\in T\). If

$$\begin{aligned} \lim \limits _{\begin{array}{c} t \rightarrow \infty \\ t\in T \end{array}} g(x^t, y^t)-g(x^t+ \alpha _t d^t, y^t )= 0, \end{aligned}$$

then we have

$$\begin{aligned} \lim \limits _{\begin{array}{c} t \rightarrow \infty \\ t\in T \end{array}} \nabla _xg(x^t,y^t)^Td^t = 0. \end{aligned}$$

Remark 3.1

Step 12 of Algorithm 2 can be modified in order to make the algorithm more general. More specifically, the steepest descent direction \(-\nabla _x q_{\tau _k}(u^\ell ,v^\ell )\) could be replaced by any gradient-related direction \(d^\ell \). In this sense, we have the possibility of arbitrarily defining the updated point \(u^{\ell +1}\), provided that \(q_{\tau _k}(u^{\ell +1},v^{\ell }) \le q_{\tau _k}(u^\ell + \alpha _\ell d^\ell , v^\ell )\), where \(\alpha _\ell \) is computed by an Armijo line search along the descent direction \(d^\ell \) that, in particular, may be \(-\nabla _x q_{\tau _k}(u^\ell ,v^\ell )\). It can be easily seen that this modification does not spoil the theoretical analysis we are going to carry out hereafter, while it may bring significant benefits from a computational perspective.

Remark 3.2

As outlined by [18], the stopping condition at line 10 of Algorithm 2 is useful for establishing the convergence properties of the algorithm, but, in practice, different rules could be employed with benefits in terms of efficiency. For example, the progress of the decreasing sequence \(\{q_{\tau _k}(u^\ell ,v^\ell )\}\) might be taken into account. As for the main loop, the whole algorithm can be stopped in practice as soon as \(x^k\) and \(y^k\) are sufficiently close.

We now address the properties of the inexact penalty decomposition method. Let us introduce the level set

$$ {\mathcal {L}}_0(f) = \{x : f(x) \le f(x^0)\}. $$

Note that \({\mathcal {L}}_0(f)\) is compact, being f continuous and coercive on \({\mathbb {R}}^n\). First we show that also \(q_\tau (x,y)\) is a coercive function.

Lemma 3.1

The function \(q_\tau (x,y)\) is coercive on \({\mathbb {R}}^n\times {\mathbb {R}}^n\).

Proof

Let us consider any pair of sequences \(\{x^k\}\) and \(\{y^k\}\) such that at least one of the following conditions holds

$$\begin{aligned} \lim _{k\rightarrow \infty }\Vert x^k\Vert= & {} \infty , \end{aligned}$$
(7)
$$\begin{aligned} \lim _{k\rightarrow \infty }\Vert y^k\Vert= & {} \infty . \end{aligned}$$
(8)

Assume by contradiction that there exists an infinite subset \(K\subseteq \{0,1,\ldots ,\}\) such that

$$\begin{aligned} \limsup _{\begin{array}{c} k\rightarrow \infty \\ k\in K \end{array}} q_{\tau }(x^k,y^k)\ne \infty . \end{aligned}$$
(9)

Suppose first that there exists an infinite subset \(K_1\subseteq K\) such that

$$\begin{aligned} \Vert x^k-y^k\Vert \le M, \end{aligned}$$
(10)

for some \(M>0\) and for all \(k\in K_1\). Recalling that f is coercive on \({\mathbb {R}}^n\), from (7), (8) we have that \(f(x^k)\rightarrow \infty \) for \(k\rightarrow \infty ,k\in K_1\). From (10) we obtain

$$ \lim _{\begin{array}{c} k\rightarrow \infty \\ k\in K_1 \end{array}}q_{\tau }(x^k,y^k) = \lim _{\begin{array}{c} k\rightarrow \infty \\ k\in K_1 \end{array}} f(x^k)+\frac{\tau }{2} \Vert {x^k-y^k}^2\Vert =\infty , $$

and this contradicts (9). Then, we must have

$$ \lim _{\begin{array}{c} k\rightarrow \infty \\ k\in K \end{array}}\Vert x^k-y^k\Vert =\infty . $$

As f is coercive and continuous, it admits minimum over \({\mathbb {R}}^n\). Let \(f^\star \) be the minimum value of f. Thus, we have

$$ q_{\tau }(x^k,y^k)\ge f^\star +\frac{\tau }{2} \Vert {x^k-y^k}\Vert ^2, $$

which implies that \(q_{\tau }(x^k,y^k)\rightarrow \infty \) for \(k\rightarrow \infty ,k\in K\).

Then, we can conclude that, for any infinite set K, we have

$$ \lim _{\begin{array}{c} k\rightarrow \infty \\ k\in K \end{array}} q_{\tau }(x^k,y^k)=\infty , $$

and this contradicts (9). \(\square \)

Now, we can prove that Algorithm 2 is well defined, i.e., that the cycle between step 10 and step 14 terminates in a finite number of inner iterations.

Proposition 3.2

Algorithm 2 cannot infinitely cycle between step 10 and step 14, i.e., for each outer iteration \(k\ge 0\), the algorithm determines in a finite number of inner iterations a point \((x^{k+1},y^{k+1})\) such that

$$\begin{aligned} \Vert \nabla _x q_{\tau _k}(x^{k+1},y^{k+1}) \Vert \le \epsilon . \end{aligned}$$
(11)

Proof

Suppose by contradiction that, at a certain iteration k, the sequence \(\{ u^\ell , v^\ell \}\) is infinite. From the instructions of the algorithm, we have

$$\begin{aligned} q_{\tau _k}(u^{\ell +1}, v^{\ell +1}) \le q_{\tau _k}(u^0, v^0). \end{aligned}$$

Hence, for all \(\ell \ge 0\), the point \(\{u^\ell , v^\ell \}\) belongs to the level set

$$ {\mathcal {L}}_0(q_{\tau _k})=\{(u,v)\in {\mathbb {R}}^n\times {\mathbb {R}}^n: \ q_{\tau _k}(u, v) \le q_{\tau _k}(u^0, v^0)\}. $$

Lemma 3.1 implies that \({\mathcal {L}}_0(q_{\tau _k})\) is a compact set. Therefore, the sequence \(\{ u^\ell , v^\ell \}\) admits cluster points. Let \(K \subseteq \{0,1,\ldots \}\) be an infinite subset such that

$$\begin{aligned} \lim _{\begin{array}{c} \ell \rightarrow \infty \\ \ell \in K \end{array}} (u^\ell , v^\ell ) = ({\bar{u}}, {\bar{v}}). \end{aligned}$$

Recalling the continuity of the gradient, we have

$$ \lim _{\begin{array}{c} \ell \rightarrow \infty \\ \ell \in K \end{array}}\nabla _x q_{\tau _k}(u^\ell ,v^\ell ) = \nabla _x q_{\tau _k}({\bar{u}},{\bar{v}}). $$

We now show that \(\nabla _x q_{\tau _k}({\bar{u}},{\bar{v}})=0\). Setting \(d^\ell =-\nabla _x q_{\tau _k}(u^\ell ,v^\ell )\) and taking into account the instructions of the algorithm we can write

$$\begin{aligned} q_{\tau _k}(u^{\ell +1}, v^{\ell +1}) \le q_{\tau _k}(u^{\ell +1}, v^{\ell }) = q_{\tau _k}(u^{\ell } +\alpha _\ell d^\ell , v^\ell ) < q_{\tau _k}(u^{\ell }, v^{\ell }). \end{aligned}$$
(12)

Recalling again the continuity of the gradient, we have that \(d^\ell \rightarrow \nabla _x q_{\tau _k}({\bar{u}},{\bar{v}})\) for \(\ell \in K\) and \(\ell \rightarrow \infty \), and hence \(\Vert d^\ell \Vert \le M\) for some \(M>0\) and for all \(\ell \in K\).

The sequence \(\{q_{\tau _k}(u^{\ell }, v^{\ell })\}\) is monotone decreasing, \(q_{\tau _k}(u,v)\) is continuous, and hence, we have that

$$ \lim _{\ell \rightarrow \infty }{q_{\tau _k}(u^{\ell }, v^{\ell }) = q_{\tau _k}({\bar{u}}, {\bar{v}})}. $$

From (12), it follows \(\lim \limits _{\ell \rightarrow \infty } q_{\tau _k}(u^{\ell }, v^{\ell }) - q_{\tau _k}(u^{\ell } + \alpha _\ell d^\ell , v^\ell ) = 0.\) Then, the hypotheses of Proposition 3.1 are satisfied and we can write

$$\begin{aligned} \lim _{\begin{array}{c} \ell \rightarrow \infty \\ \ell \in K \end{array}} \nabla _x q_{\tau _k}(u^\ell ,v^\ell )^Td^\ell = \lim _{\begin{array}{c} \ell \rightarrow \infty \\ \ell \in K \end{array}} -\Vert \nabla _x q_{\tau _k}(u^\ell ,v^\ell ) \Vert ^2 = 0, \end{aligned}$$

which implies that, for \(\ell \in K\) sufficiently large, we have \(\Vert \nabla _x q_{\tau _k}(u^\ell ,v^\ell ) \Vert \le \epsilon \), i.e., the stopping criterion of step 10 is satisfied in a finite number of iterations, and this contradicts the fact that \(\{u^\ell ,v^\ell \}\) is an infinite sequence. \(\square \)

Before stating the global convergence result, we prove that the sequence generated by the algorithm admits limit points and that every limit point \(({\bar{x}}, {\bar{y}})\) is such that \({\bar{x}}\) is feasible for the original problem (1).

Proposition 3.3

Let \(\{x^k,y^k\}\) be the sequence generated by Algorithm 2. Then, \(\{x^k,y^k\}\) admits cluster points and every cluster point \(({\bar{x}},{\bar{y}})\) is such that \({{\bar{x}}}={\bar{y}}\), and \(\Vert {{\bar{x}}}\Vert _0\le s\).

Proof

Consider a generic iteration k. The instructions of the algorithm imply for all \(\ell \ge 0\)

$$ q_{\tau _k}(u^{\ell +1}, v^{\ell +1})\le q_{\tau _k}(u^{\ell +1}, v^{\ell })= q_{\tau _k}(u^{\ell } - \alpha _{\ell }\nabla _x q_{\tau _k}(u^{\ell }, v^{\ell }) , v^\ell )\le q_{\tau _k}(u^{\ell }, v^{\ell }), $$

and hence we can write

$$\begin{aligned} q_{\tau _k}(x^{k+1}, y^{k+1})\le q_{\tau _k}(u^0 - \alpha _0\nabla _x q_{\tau _k}(u^0, v^0) , v^0). \end{aligned}$$
(13)

From the definition of \((u^0, v^0)\), we either have \((u^0, v^0) = (x^k,y^k)\) or \((u^0, v^0) = (x^0,y^0)\). In the former case, we have, by the definition of \(x_{\text {trial}}\), that

$$ q_{\tau _k}(u^0 - \alpha _0\nabla _x q_{\tau _k}(u^0, v^0) , v^0) = q_{\tau _k}(x_\text {trial} , y^k) \le f(x^0), $$

where the last inequality holds, as in this case the condition at line 6 is satisfied. In the latter case, we have

$$\begin{aligned} q_{\tau _k}(u^0 - \alpha _0\nabla _x q_{\tau _k}(u^0, v^0) , v^0)&\le q_{\tau _k}(u^0, v^0) = q_{\tau _k}(x^0, y^0)\\&= f(x^0) + \frac{\tau _k}{2} \Vert x^0-y^0\Vert ^2 = f(x^0). \end{aligned}$$

Then, in both cases from (13) it follows

$$\begin{aligned} q_{\tau _k}(x^{k+1}, y^{k+1})\le f(x^0). \end{aligned}$$
(14)

We also have

$$\begin{aligned} f(x^{k+1})\le q_{\tau _k}(x^{k+1}, y^{k+1})=f(x^{k+1})+{{\tau _k}\over {2}}\Vert x^{k+1}-y^{k+1}\Vert ^2\le f(x^0), \end{aligned}$$
(15)

and hence we can conclude that for all \(k\ge 0\) we have \(f(x^{k+1})\le f(x^0)\). Therefore, the points of the sequence \(\{x^k\}\) belong to the compact set \({\mathcal {L}}_0(f)\), and this implies that \(\{x^k\}\) is a bounded sequence and that, for all \(k\ge 0\), \(f(x^k)\ge f^\star >-\infty \), \(f^\star \) being the minimum value of f over \({\mathbb {R}}^n\).

From (15), dividing by \(\tau _k\), we get

$$\begin{aligned} \Vert x^{k+1}-y^{k+1}\Vert ^2 \le 2\frac{f(x^0)-f(x^{k+1})}{\tau _k}\le 2\frac{f(x^0)-f^\star }{\tau _k}. \end{aligned}$$

Taking limits for \(k\rightarrow \infty \), recalling that \(\tau _k\rightarrow \infty \) for \(k\rightarrow \infty \), we obtain

$$\begin{aligned} \lim _{k\rightarrow \infty }\Vert x^{k+1}-y^{k+1}\Vert = 0. \end{aligned}$$
(16)

Therefore, since \(\{x^k\}\) is a bounded sequence, from (16), it follows that \(\{y^k\}\) is bounded, and hence the sequence \(\{(x^k,y^k)\}\) admits cluster points. Let \(({\bar{x}},{\bar{y}})\) be any cluster point of \(\{(x^k,y^k)\}\), i.e., there exists an infinite subset \(K\subseteq \{0,1,\ldots \}\) such that

$$ \lim _{\begin{array}{c} k\rightarrow \infty \\ k\in K \end{array}} (x^{k},y^{k}) = ({\bar{x}},{\bar{y}}). $$

Again from (16) it follows \({{\bar{x}}}={\bar{y}}\).

Finally, as \(\Vert y^k\Vert _0\le s\) for all k, recalling the lower semicontinuity of the -norm \(\Vert \cdot \Vert _0\), we can conclude that \(\Vert {\bar{x}}\Vert _0=\Vert {\bar{y}}\Vert _0\le s\). \(\square \)

We are ready to state the global convergence result.

Theorem 3.1

Let \(\{x^k,y^k\}\) be the sequence generated by Algorithm 2. Then, \(\{x^k,y^k\}\) admits cluster points and every cluster point \(({\bar{x}},{\bar{y}})\) is such that \({\bar{x}}\) satisfies the Lu–Zhang conditions for problem (1).

Proof

Proposition 3.3 implies that the sequence \(\{x^k,y^k\}\) admits cluster points. Let \(K\subseteq \{0,1,\ldots \}\) be an infinite subsequence such that

$$ \lim \limits _{\begin{array}{c} k\rightarrow \infty \\ k\in K \end{array}}(x^{k+1},y^{k+1}) = ({\bar{x}},{\bar{y}}). $$

From Proposition 3.3, it follows \({\bar{x}}={\bar{y}}\) and

$$\begin{aligned} \Vert {{\bar{x}}}\Vert _0\le s. \end{aligned}$$
(17)

Using (11) of Proposition 3.2, for all \(k\ge 0\), we have

$$\begin{aligned} \Vert \nabla f(x^{k+1})+\tau _k(x^{k+1}-y^{k+1})\Vert \le \varepsilon _k, \end{aligned}$$

so that, taking limits for \(k\in K\) and \(k\rightarrow \infty \), as \(\epsilon _k\rightarrow 0\), we can write

$$\begin{aligned} \lim _{\begin{array}{c} k\rightarrow \infty \\ k\in K \end{array}}\Vert \nabla f(x^{k+1})+\tau _k(x^{k+1}-y^{k+1})\Vert = 0. \end{aligned}$$
(18)

From the instructions of the algorithm, we have \(y^{k+1} \in \mathop {\mathrm{arg~min}}\limits _{y\in Y} q_{\tau _k} (x^{k+1},y)\), i.e., \(y^{k+1}\) is a solution of the problem

$$\begin{aligned} \min _{y} \; \Vert y - x^{k+1} \Vert ^2\quad \text { s.t. }\quad \Vert y\Vert _0\le s. \end{aligned}$$

From (4) it follows

$$\begin{aligned} y_i^{k+1} = x_i^{k+1}\quad \text {for } i \in I(x^{k+1}),\qquad y_i^{k+1} = 0\quad \text {for } i \notin I(x^{k+1}), \end{aligned}$$

where we recall that the index set \(I(x^{k+1})\) contains at most s elements, those corresponding to the not null components of \(x^{k+1}\) with the largest absolute value.

Note that \(|I(x^{k+1})|<s\) implies \(\Vert x^{k+1}\Vert _0<s\) and hence \(y^{k+1}=x^{k+1}\). Therefore, we can write

$$\begin{aligned} -\tau _k(x_i^{k+1}-y_i^{k+1}) = 0 \left\{ \begin{array}{ll} \forall \,i\in I(x^{k+1}),&{}\text {if } |I(x^{k+1})| = s,\\ \forall \,i\in \{1,\ldots ,n\},&{}\text {if } |I(x^{k+1})| < s. \end{array}\right. \end{aligned}$$
(19)

The index set \(I(x^{k+1})\) belongs to the finite set \(\{1,\ldots ,n\}\); therefore, there exists an infinite subset \(K_1\subseteq K\) such that \(I(x^{k+1}) = I\) for all \(k\in K_1\).

Let \(I^\star = I({\bar{x}})\). We show that \(I^\star \subseteq I\). Indeed, assume by contradiction that there exists \(i\in I^\star \) such that \(i\notin I\). Hence, \({\bar{y}}_i = {\bar{x}}_i \ne 0\), while \(y_i^{k+1} = 0\) for all \(k \in K\). This is a contradiction, since \(y^{k+1} \rightarrow {\bar{y}}\) for \(k \rightarrow \infty , k \in K\).

Therefore, we have the following possible cases:

$$ \text {(i) }|I|=s,\;I=I^\star ;\qquad \text {(ii) }|I|<s;\qquad \text {(iii) }|I|=s,\;I\supset I^\star . $$

We now prove each case separately:

  1. (i)

    Let \(i\in I=I^\star \); from (18) we have

    $$\begin{aligned} \lim _{\begin{array}{c} k\rightarrow \infty \\ k\in K_1 \end{array}} \nabla _i f(x^{k+1}) + \tau _k (x^{k+1}_i-y^{k+1}_i)= 0, \end{aligned}$$

    and, using the first condition of (19), it follows \(\tau _k (x_i^{k+1}-y_i^{k+1}) = 0\) for all \(k\in K_1. \) Therefore, recalling the continuity of the gradient, we can write

    $$\begin{aligned} \lim _{\begin{array}{c} k\rightarrow \infty \\ k\in K_1 \end{array}}\nabla _i f(x^{k+1}) = \nabla _i f({\bar{x}}) = 0 \quad \forall \,i\in I^\star , \end{aligned}$$

    i.e., Lu–Zhang conditions hold with the set \(I=I^\star \).

  2. (ii)

    Let \(i\in \{1,\ldots ,n\}\); similarly to the previous case, we have that

    $$\begin{aligned} \lim _{\begin{array}{c} k\rightarrow \infty \\ k\in K_1 \end{array}}\nabla _i f(x^{k+1}) + \tau _k (x^{k+1}_i-y^{k+1}_i)= 0, \end{aligned}$$

    and using the second condition of (19) it follows \(\tau _k (x_i^{k+1}-y_i^{k+1}) = 0\) for all \(k\in K_1.\) Therefore, we obtain

    $$\begin{aligned} \lim _{\begin{array}{c} k\rightarrow \infty \\ k\in K_1 \end{array}}\nabla _i f(x^{k+1}) = \nabla _i f({\bar{x}}) = 0 \quad \forall \,i\in \{1,\ldots ,n\}, \end{aligned}$$

    i.e., Lu–Zhang conditions hold taking any subset of \(\{1,\ldots ,n\}\) of cardinality s that contains \(I^*\).

  3. (iii)

    Let \(i\in I\). By the same reasonings of case (i), we can write

    $$\begin{aligned} \lim _{\begin{array}{c} k\rightarrow \infty \\ k\in K_1 \end{array}}\nabla _i f(x^{k+1}) = \nabla _i f({\bar{x}}) = 0 \quad \forall \,i\in I, \end{aligned}$$

    i.e., Lu–Zhang conditions hold with the set I.

Putting everything together, we have from (i), (ii) and (iii) that Lu–Zhang conditions are always satisfied. \(\square \)

As we can see, the proposed inexact version of the algorithm enjoys the same convergence properties as the original, exact one. We also provide, in the following remark, a better characterization of the algorithm, showing that the limit points are often BF-vectors.

Remark 3.3

We note that, in both case (i) and case (ii) we have that \({\bar{x}}\) satisfies the BF optimality conditions. Moreover, note also that:

  • If there exists a subsequence \({\hat{K}}\subseteq K\) s.t. \(\Vert x^{k}\Vert _0 = \Vert {\bar{x}}\Vert _0\) for all \(k\in {\hat{K}}\), the only possible cases are cases (i) and (ii). Indeed, let us consider a further subsequence \(K_2\subseteq {\hat{K}}\), such that \(I(x^{k+1}) = I\) for every \(k\in K_2\), for some \(I\subset \{1,\ldots ,n\}\). We know that \(K_2\) exists and that \(I \supseteq I^\star \). Since \(\Vert x^{k+1}\Vert _0 = \Vert {\bar{x}}\Vert _0\le s\) for every \(k\in K_2\), I and \(I^\star \) are the index sets of nonzero variables of \(x^{k+1}\) and \({\bar{x}}\), respectively, which have the same cardinality. Therefore, it cannot be \(I\supset I^\star \). It follows that \(I=I^\star \), so we fall into either case (i) or case (ii), and thus, \({\bar{x}}\) satisfies BF conditions.

  • If there exists a subsequence \({\hat{K}}\subseteq K\) such that \(\Vert x^{k+1}\Vert _0<s\) for all \(k\in {\hat{K}}\), we can again define \(K_2\subseteq {\hat{K}}\) such that \(I(x^{k+1}) = I\) for every \(k\in K_2\), for some \(I\subset \{1,\ldots ,n\}\). In this case, we have \(|I| = \Vert x^{k+1}\Vert _0<s\) and case (ii) applies. It follows that \({\bar{x}}\) is a BF-vector.

4 A Derivative-Free Penalty Decomposition Method

First-order information about the objective function is fundamental for the PD methods we have considered thus far. However, there are applications where the objective function is obtained by direct measurements or it is the result of a complex system of calculations, so that its analytical expression is not available and the computation of its values may be affected by the presence of noise. Hence, in these cases the gradient cannot be explicitly calculated or approximated.

Such lack of information has an impact on the applicability of Algorithm 2. In particular, the x update step and the inner loop stopping criterion are no more employable as they are.

In this section, we propose a derivative-free modification of Algorithm 2 that, similarly to [23,24,25], updates x by line search steps along the coordinate axes and employs a stopping criterion based on the length of such steps.

The derivative-free PD method is described by Algorithm 4. At the x update step, we employ as search directions the coordinate directions and their opposites. A tentative step length \({\tilde{\alpha }}_i\) is associated with each of these directions. At every iteration, all search directions are considered one at a time; a derivative-free line search is performed along each direction, according to Algorithm 5. If the tentative step size does not provide a sufficient decrease, it will be reduced for the next iteration. If, on the other hand, the tentative step size is of sufficient decrease, an extrapolation procedure is carried out; the tentative step size for that same direction at the successive iteration will be the longest one tried in the extrapolation phase that provides a sufficient decrease. That same step length is also used to move along the considered direction, provided it is large at least \(\varepsilon _k\); otherwise, no movement is done along the direction. The inner loop then stops when all tentative step sizes have become smaller than \(\varepsilon _k\).

figure d
figure e

Hereafter, we show that Algorithm 4 enjoys the same convergence properties as Algorithm 2. First, we prove that the line search procedure does not loop infinitely inside our procedure.

Proposition 4.1

Algorithm 5 cannot infinitely cycle between steps 5 and 8.

Proof

Assume by contradiction that Algorithm 5 does not terminate. Then, for \(j=0,1,\ldots \), we have \(f(x+\sigma ^j\alpha _0d) \le f(x) - \gamma \sigma ^{2j}\alpha _0^2\Vert d\Vert ^2.\) Taking limits for \(j\rightarrow \infty \), we obtain that \(f(x+\sigma ^j\alpha _0d)\rightarrow -\infty \), and this contradicts the fact that f is bounded below, being f continuous and coercive. \(\square \)

Note that, as shown by Proposition 4.1, \(q_{\tau _k}\) is coercive on \({\mathbb {R}}^n\times {\mathbb {R}}^n\). We prove that Algorithm 4 is well defined, i.e., the inner loop terminates in finite number of iterations.

Proposition 4.2

Algorithm 4 cannot infinitely cycle between steps 15 and 29.

Proof

Assume by contradiction that the algorithm loops infinitely. Then, for every \(\ell =0,1,\ldots \), there exists \(i\in \{1,\ldots ,2n\}\) such that \({\tilde{\alpha }}_i^\ell >\varepsilon _k\), i.e.,

$$\begin{aligned} \max _{i=1,\ldots ,2n}\{{\tilde{\alpha }}^\ell _i\} > \varepsilon _k. \end{aligned}$$
(20)

The instructions of the algorithm imply

$$ q_{\tau _k}(u^{\ell +1}, v^{\ell +1})\le q_{\tau _k}(u^{\ell +1}, v^\ell )\le q_{\tau _k}(u^{\ell }(i), v^\ell )\le q_{\tau _k}(u^{\ell }(i-1), v^\ell ) \le q_{\tau _k}(u^\ell ,v^\ell ). $$

Then, the decreasing sequence \(\{q_{\tau _k}(u^\ell ,v^\ell )\}\) tends to a finite value, being \(q_{\tau _k}\) continuous and coercive and hence bounded below. For any \(i \in \{1,\dots ,2n\}\), we can split the sequence of iterations \(\{0,1,\ldots \}\) into two subsequences \(K_1\) and \(K_2\) such that \(K_1\cup K_2=\{0,1,\ldots \}\), \(K_1\cap K_2=\emptyset \). In particular, we denote by:

  • \(K_1\) the set of iterations where \({\tilde{\alpha }}_i^{\ell +1} = \alpha _i^\ell = {\tilde{\alpha }}_i^\ell \sigma ^s>0\) for some \(s\ge 0\), \(s\in {\mathbb {N}}\);

  • \(K_2\) the set of iterations where \({\tilde{\alpha }}_i^{\ell +1} = \delta {\tilde{\alpha }}_i^\ell \) and \(\alpha _i^\ell =0\).

Note that \(K_1\) and \(K_2\) cannot both be finite. Then, we analyze the following two cases, \(K_1\) infinite (Case I) and \(K_2\) infinite (Case II).

Case (I). We have

$$\begin{aligned} q_{\tau _k}(u^{\ell +1}, v^{\ell +1})&\le q_{\tau _k}(u^{\ell +1}, v^{\ell })\le q_{\tau _k}(u^{\ell }(i), v^{\ell })\le q_{\tau _k}(u^{\ell }(i-1), v^{\ell }) - \gamma ({\tilde{\alpha }}_i^\ell \sigma ^s)^2\\&\le q_{\tau _k}(u^\ell (0), v^\ell ) - \gamma ({\tilde{\alpha }}_i^\ell )^2= q_{\tau _k}(u^\ell , v^\ell ) - \gamma ({\tilde{\alpha }}_i^\ell )^2. \end{aligned}$$

Taking limits for \(\ell \in K_{1}\), \(\ell \rightarrow \infty \), recalling that \(\{q_{\tau _k}(u^\ell ,v^\ell )\}\) tends to a finite limit, we get

$$\begin{aligned} \lim _{\begin{array}{c} \ell \rightarrow \infty \\ \ell \in K_1 \end{array}}{\tilde{\alpha }}_i^\ell = 0, \end{aligned}$$
(21)

and hence, for \(\ell \in K_1\) sufficiently large, we have \({\tilde{\alpha }}_i^\ell \le \varepsilon _k\).

Case (II). For every \(\ell \in K_2\), let \(m_\ell \) be the maximum index on \(\{0,1,\ldots \}\) such that \(m_\ell \in K_1\), \(m_\ell <\ell \) (\(m_\ell \) is the index of the last iteration in \(K_1\) preceding \(\ell \)). We can assume \(m_\ell =0\) if the index \(m_\ell \) does not exist, that is, \(K_1\) is empty. Then, we can write \({\tilde{\alpha }}_i^\ell =\delta ^{\ell -m_\ell }\alpha _i^{m_\ell }.\) As \(\ell \in K_2\) and \(\ell \rightarrow \infty \), either \(m_\ell \rightarrow \infty \) (if \(K_1\) is an infinite subset) or \(\ell -m_\ell \rightarrow \infty \) (if \(K_1\) is finite). Therefore, (21) and the fact that \(\delta \in (0,1)\) imply

$$ \lim _{\begin{array}{c} \ell \rightarrow \infty \\ \ell \in K_2 \end{array}}{\tilde{\alpha }}_i^\ell =0. $$

Thus, for \(\ell \in K_2\) sufficiently large, we have \({\tilde{\alpha }}_i^\ell \le \varepsilon _k\).

We can conclude that \(\lim _{\ell \rightarrow \infty }{\tilde{\alpha }}_i^\ell =0\), so that, recalling that i is arbitrary, we get \(\max _{i=1,\ldots ,n}\{{\tilde{\alpha }}^\ell _i\} \le \varepsilon _k\) for \(\ell \) sufficiently large, and this contradicts (20). \(\square \)

Next, we prove a technical result used later.

Proposition 4.3

Assume that the initial step sizes \({\tilde{\alpha }}^0_i\), with \(i=1,\ldots ,n\), are such that \({\tilde{\alpha }}^0_i > \varepsilon _k\) for all k. Then, for every k and for every \(i=1,\dots ,2n,\) there exists \(\rho _i^k \in \,]0, c{\varepsilon _k}[\,\) such that

$$\begin{aligned} \nabla _x q_{\tau _k}(x^{k+1} + \rho _i^k d_i, y^{k+1})^T d_i > -c\varepsilon _k, \end{aligned}$$

with \(c = \max \{\sigma ,1/\delta \}\).

Proof

Given any iteration k, let \(\ell \) be the index of the last inner iteration. By definition of \(\ell \), we must have that \({\tilde{\alpha }}_i^{\ell +1} \le \varepsilon _k\) for all \(i=1,\ldots ,n\). From the instructions of the algorithm this implies that we have \(u^{\ell +1} = u^\ell (2n) = \ldots = u^\ell (0) = u^\ell \), and consequently \(v^{\ell +1} = v^\ell \). Consider any \(i\in \{1,\ldots ,2n\}\). We have two cases:

  1. 1.

    \({\tilde{\alpha }}^{\ell +1}_i = \delta {\tilde{\alpha }}_i^\ell \); in this case, \({\tilde{\alpha }}_i^\ell \) did not satisfy the sufficient decrease condition in the LineSearch procedure, i.e.,

    $$\begin{aligned} \begin{aligned} q_{\tau _k}(u^\ell + {\tilde{\alpha }}_i^\ell d_i, v^\ell ) - q_{\tau _k}(u^\ell , v^\ell )>- \gamma ({\tilde{\alpha }}_i^\ell )^2. \end{aligned} \end{aligned}$$
    (22)

    Using the mean value theorem, we can write

    $$\begin{aligned} \begin{aligned} q_{\tau _k}(u^\ell + {\tilde{\alpha }}_i^\ell d_i, v^\ell ) - q_{\tau _k}(u^\ell , v^\ell ) = {\tilde{\alpha }}_i^\ell \nabla _x q_{\tau _k}(u^\ell + \rho _i^\ell d_i, v^\ell )^T d_i, \end{aligned} \end{aligned}$$
    (23)

    where \(\rho _i^\ell \in \,]0, {\tilde{\alpha }}_i^\ell [\,\). From (22) and (23), it follows:

    $$\begin{aligned} \begin{aligned} \nabla _x q_{\tau _k}(u^\ell + \rho _i^\ell d_i, v^\ell )^T d_i > -\gamma {\tilde{\alpha }}_i^\ell = -\frac{\gamma }{\delta }{\tilde{\alpha }}^{\ell +1}_i\ge -\frac{\gamma }{\delta }\varepsilon _k. \end{aligned} \end{aligned}$$

    Observe that \({\tilde{\alpha }}_i^\ell \le \varepsilon _k/\delta \) and hence \(\rho _i^\ell \in \,]0,\varepsilon _k/\delta [\,\).

  2. 2.

    \({\tilde{\alpha }}^{\ell +1}_i = \alpha ^\ell _i\); from the instructions of the LineSearch procedure, we get

    $$\begin{aligned} \begin{aligned} q_{\tau _k}(u^\ell + \sigma \alpha _i^\ell d_i, v^\ell ) - q_{\tau _k}(u^\ell , v^\ell ) > - \gamma (\sigma \alpha _i^\ell )^2. \end{aligned} \end{aligned}$$
    (24)

    Using the mean value theorem, we can write

    $$\begin{aligned} \begin{aligned} q_{\tau _k}(u^\ell + \sigma \alpha _i^\ell d_i, v^\ell ) - q_{\tau _k}(u^\ell , v^\ell ) = \sigma \alpha _i^\ell \nabla _x q_{\tau _k}(u^\ell + \rho _i^\ell d_i, v^\ell )^T d_i, \end{aligned} \end{aligned}$$
    (25)

    where \(\rho _i^\ell \in \,]0, \sigma \alpha _i^\ell [\,\). From (24) and (25), it follows

    $$\begin{aligned} \begin{aligned} \nabla _x q_{\tau _k}(u^\ell + \rho _i^\ell d_i, v^\ell )^T d_i > -\gamma \sigma \alpha _i^\ell = -\gamma \sigma {\tilde{\alpha }}^{\ell +1}_i\ge -\gamma \sigma \varepsilon _k. \end{aligned} \end{aligned}$$

    Observe that \(\sigma \alpha _i^\ell = \sigma {\tilde{\alpha }}_i^{\ell +1} \le \sigma \varepsilon _k\) and hence \(\rho _i^\ell \in \,]0,\sigma \varepsilon _k[\,\).

Thus, in both cases we can write

$$\begin{aligned} \nabla _x q_{\tau _k}(u^\ell + \rho _i^\ell d_i, v^\ell )^T d_i > -c\varepsilon _k, \end{aligned}$$
(26)

for some \(\rho _i^\ell \in \,]0,c\varepsilon _k[\,\) and \(c = \max \{\sigma ,1/\delta \}\).

Since \({\tilde{\alpha }}_i^{\ell +1} \le \varepsilon _k\) for all \(i=1,\ldots ,2n\), from the instructions of the algorithm, we have \(u^{\ell +1} = u^\ell \) and consequently \(v^{\ell +1} = v^\ell \). Hence, equation (26) holds with \(u^\ell =x^{k+1}\), and \(v^\ell =y^{k+1}\). \(\square \)

Now, we prove that the sequence generated by the algorithm admits limit points and that every limit point is feasible for the original problem.

Proposition 4.4

Let \(\{x^k,y^k\}\) be the sequence generated by Algorithm 4. Then, \(\{x^k,y^k\}\) admits cluster points and every cluster point \(({\bar{x}},{\bar{y}})\) is such that \({{\bar{x}}}={\bar{y}}\), and \(\Vert {\bar{x}}\Vert _0\le s\).

Proof

Consider a generic iteration k. The instructions of the algorithm imply, for all \(\ell \ge 0\),

$$ q_{\tau _k}(x^{k+1}, y^{k+1})=q_{\tau _k}(u^{\ell +1}, v^{\ell +1})\le q_{\tau _k}(u^{\ell +1}, v^{\ell })\le q_{\tau _k}(u^{\ell }, v^{\ell }). $$

From the definition of \((u^0, v^0)\), we either have \((u^0, v^0) = (x^k,y^k)\) or \((u^0, v^0) = (x^0,y^0)\). In the former case, for some \(i\in \{1,\ldots ,2n\}\) we have, by the definition of \(x_{\text {trial}}\), that

$$ q_{\tau _k}(u^1,v^0)\le q_{\tau _k}(u^0 + {\hat{\alpha }}_i d_i,v^0) = q_{\tau _k}(x_\text {trial} , y^k) \le f(x^0). $$

In the latter case, we have

$$ q_{\tau _k}(u^0, v^0)= q_{\tau _k}(x^0, y^0) = f(x^0) + \frac{\tau _k}{2} \Vert x^0-y^0\Vert ^2 = f(x^0). $$

Then, in both cases it follows

$$\begin{aligned} q_{\tau _k}(x^{k+1}, y^{k+1})\le f(x^0). \end{aligned}$$
(27)

The rest of the proof follows the same reasonings used in the proof of Proposition 3.3, starting from the condition corresponding to (27), i.e., condition (14). \(\square \)

Theorem 4.1

Let \(\{x^k,y^k\}\) be the sequence generated by Algorithm 4. Then, \(\{x^k,y^k\}\) admits cluster points and every cluster point \(({\bar{x}},{\bar{y}})\) is such that \({\bar{x}}\) satisfies the Lu–Zhang conditions for problem (1).

Proof

Proposition 4.4 implies that the sequence \(\{x^k,y^k\}\) admits cluster points. Let \(K\subseteq \{0,1,\ldots \}\) be an infinite subsequence such that

$$ \lim \limits _{\begin{array}{c} k\rightarrow \infty \\ k\in K \end{array}}(x^{k+1},y^{k+1}) = ({\bar{x}},{\bar{y}}). $$

From Proposition 4.4, it follows \({\bar{x}}={\bar{y}}\) and \(\Vert {{\bar{x}}}\Vert _0\le s\). From the instructions of the algorithm, we have \(y^{k+1} \in \mathop {\mathrm{arg~min}}\limits _{y\in Y} q_{\tau _k} (x^{k+1},y),\) i.e., \(y^{k+1}\) is a solution of the problem

$$\begin{aligned} \min _{y} \; \Vert y - x^{k+1} \Vert ^2\quad \text { s.t. }\quad \Vert y\Vert _0\le s. \end{aligned}$$

From (4) it follows

$$\begin{aligned} y_i^{k+1} = x_i^{k+1}\quad \text {for } i \in I(x^{k+1}),\qquad y_i^{k+1} = 0\quad \text {for } i \notin I(x^{k+1}), \end{aligned}$$

where we recall that the index set \(I(x^{k+1})\) contains at most s elements, those corresponding to the not null components of \(x^{k+1}\) with the largest absolute value.

Note that \(|I(x^{k+1})|<s\) implies \(\Vert x^{k+1}\Vert _0<s\) and hence \(y^{k+1}=x^{k+1}\). Therefore, we can write

$$\begin{aligned} -\tau _k(x_i^{k+1}-y_i^{k+1}) = 0 \left\{ \begin{array}{ll} \forall \,i\in I(x^{k+1}),&{}\text {if } |I(x^{k+1})| = s,\\ \forall \,i\in \{1,\ldots ,n\},&{}\text {if } |I(x^{k+1})| < s. \end{array}\right. \end{aligned}$$
(28)

The index set \(I(x^{k+1})\) belongs to the finite set \(\{1,\ldots ,n\}\); therefore, there exists an infinite subset \(K_1\subseteq K\) such that \(I(x^{k+1}) = I\) for all \(k\in K_1\).

Let \(I^\star = I({\bar{x}})\). We have already shown in the proof of Theorem 3.1 that \(I^\star \subseteq I\). We consider the following possible cases:

$$ \text {(i) }|I|=s,\;I=I^\star ;\qquad \text {(ii) }|I|<s;\qquad \text {(iii) }|I|=s,\;I\supset I^\star . $$

We now prove each case separately:

  1. (i)

    Let \(i\in I=I^\star \); using the first condition of (28), we get \(\tau _k (x_i^{k+1}-y_i^{k+1}) = 0\) for all \(k\in K_1.\) From Proposition 4.3, recalling that

    $$ {\mathcal {D}}=\{d_1,\ldots ,d_{2n}\}=\{e_1,\ldots e_n,-e_1,\ldots ,-e_n\}, $$

    we have that

    $$\begin{aligned} \begin{aligned} \nabla f(x^{k+1}+ \rho _i^k e_i)^Te_i&=\nabla _x q_{\tau _k}(x^{k+1} + \rho _i^k e_i, y^{k+1})^T e_i> -c\varepsilon _k,\\ -\nabla f(x^{k+1}+ \rho _{i+n}^k e_i)^Te_i&=-\nabla _x q_{\tau _k}(x^{k+1} - \rho _{i+n}^k e_i, y^{k+1})^T e_i > -c\varepsilon _k, \end{aligned} \end{aligned}$$

    with \(c = \max \{\sigma ,1/\delta \}\). Taking limits for \(k\rightarrow \infty ,k\in K_1\), recalling that \(\varepsilon _k\rightarrow 0\), \(\rho _i^k,\rho _{i+n}^k\in \,]0,c\varepsilon _k[\,\) and the continuity of the gradient, we get

    $$\begin{aligned} \lim _{k\in K_1,k\rightarrow \infty } \nabla f(x^{k+1}+ \rho _i^k e_i)^Te_i&= \nabla _if({{\bar{x}}})\ge 0,\\ \lim _{k\in K_1,k\rightarrow \infty } -\nabla f(x^{k+1}- \rho _{i+n}^k e_i)^Te_i&=- \nabla _if({{\bar{x}}})\ge 0, \end{aligned}$$

    from which it follows that \(\nabla _i f({\bar{x}}) = 0\) for all \(i\in I^\star \), i.e., Lu–Zhang conditions hold with the set \(I=I^\star \).

  2. (ii)

    Let \(i\in \{1,\ldots ,n\}\); the second condition of (28) implies \(\tau _k (x_i^{k+1}-y_i^{k+1}) = 0\) for all \(k\in K_1.\) Similarly to the previous case, we can write

    $$\begin{aligned} \begin{aligned} \nabla f(x^{k+1}+ \rho _i^k e_i)^Te_i&=\nabla _x q_{\tau _k}(x^{k+1} + \rho _i^k e_i, y^{k+1})^T e_i> -c\varepsilon _k,\\ -\nabla f(x^{k+1}+ \rho _{i+n}^k e_i)^Te_i&=-\nabla _x q_{\tau _k}(x^{k+1} - \rho _{i+n}^k e_i, y^{k+1})^T e_i > -c\varepsilon _k, \end{aligned} \end{aligned}$$

    with \(c = \max \{\sigma ,1/\delta \}\), and we can prove

    $$\begin{aligned} \lim _{\begin{array}{c} k\rightarrow \infty \\ k\in K_1 \end{array}}\nabla _i f(x^{k+1}) = \nabla _i f({\bar{x}}) = 0 \quad \forall \,i\in \{1,\ldots ,n\}, \end{aligned}$$

    i.e., Lu–Zhang conditions hold taking any subset of \(\{1,\ldots ,n\}\) of cardinality s that contains \(I^*\).

  3. (iii)

    Let \(i\in I\). By the same reasonings of case (i), we can write

    $$\begin{aligned} \lim _{\begin{array}{c} k\rightarrow \infty \\ k\in K_1 \end{array}}\nabla _i f(x^{k+1}) = \nabla _i f({\bar{x}}) = 0 \quad \forall \,i\in I, \end{aligned}$$

    i.e., Lu–Zhang conditions hold with the set I.

Putting everything together, we have, from (i), (ii) and (iii), that Lu–Zhang conditions are always satisfied. \(\square \)

Remark 4.1

As in Remark 3.3, if there exists a subsequence \({\hat{K}}\subset K\) s.t. \(\Vert x^{k}\Vert _0 = \Vert {\bar{x}}\Vert _0\) for all \(k\in {\hat{K}}\) or \(\Vert x^{k}\Vert _0<s\) for all \(k\in {\hat{K}}\), \({\bar{x}}\) is a BF-vector.

5 Preliminary Computational Experiments

In this section, we show the results of preliminary computational experiments, performed to assess the validity of the proposed approach.

The purpose of these preliminary experiments is to evaluate the inexact minimization strategy of the proposed algorithm (in both its gradient-based and derivative-free versions), compared with the exact minimization approach of the original PD method. To this aim, we consider the problem of sparse logistic regression, where the objective function is convex, but the solution of the subproblems in the x variables cannot be obtained in closed form, i.e., it requires the adoption of an iterative method.

Test Problems

The problem of sparse logistic regression [26] has important applications, for instance, in machine learning [27, 28]. Given a dataset having N samples \(\{z^1,\ldots ,z^N\}\), with n features and N corresponding labels \(\{d_1,\ldots , d_N\}\) belonging to \(\{-1,1\}\), the sparse logistic regression problem can be formulated as follows:

$$\begin{aligned} \min _w\;L(w) =\sum _{i=1}^{N} \log \left( 1+\exp \left( -d_i(w^Tz^i)\right) \right) \quad \text { s.t. }\quad \Vert w\Vert _0\le s. \end{aligned}$$
(29)

The benchmark for this experiment is made up of 18 problems of the form (29), obtained as described hereafter. We employed 6 binary classification datasets, listed in Table 1. All the datasets are from the UCI Machine Learning Repository [29]. For each dataset, we removed data points with missing variables; moreover, we one-hot encoded the categorical variables and standardized the other ones to zero mean and unit standard deviation. For every dataset, we chose 3 different values of s, in order to define 3 different problems of the form (29). The considered values of s correspond to the \(25\%\), \(50\%\) and \(75\%\) of the number n of features of the dataset.

Table 1 List of datasets used for experiments on sparse logistic regression

Implementation Details

Algorithms 12 and 4 have been implemented in Python 3.6. The algorithms start from the feasible initial point \(x^0=y^0=0\in {\mathbb {R}}^n\). Their common parameters have been set as follows: \(\tau _0=1\) and \(\theta =1.1\). The three algorithms differ only in the x-minimization step. Concerning the line search parameters of Algorithm 2, we set \(\gamma =10^{-5}\) and \(\beta =0.5\). As for the derivative-free Algorithm 4, we set \(\delta =0.5\), \(\gamma =10^{-5}\), \(\sigma =2\).

The x-minimization step for Algorithm 1 has been performed by the BFGS [19] solver included in the SciPy library [30]. In particular, the inner iterates of the BFGS solver have been stopped whenever the current point \(u^{\ell +1}\) is such that \(\Vert \nabla _x q_{\tau _k}(u^{\ell +1},v^\ell )\Vert \le 10^{-5}\), i.e., when the current point is a good approximation of a stationary point and hence, being the penalty function \(q_{\tau _k}\) strictly convex with respect to u, of the global minimizer.

For a fair comparison, we employ for the three PD procedures the same stopping criteria for the outer and the inner loop. Specifically, we used the practical stopping criteria proposed in [18]: the inner loop stops when the decrease of the value of the function \(q_{\tau _k}\) is sufficiently small, i.e., when \(q_{\tau _k}(u^\ell ,v^\ell )-q_{\tau _k}(u^{\ell +1},v^{\ell +1})\le \epsilon _\text {in},\) where \(\epsilon _\text {in} = 10^{-4}\); the outer loop is stopped when x and y are sufficiently close, i.e., as soon as \(\Vert x^{k+1}-y^{k+1}\Vert \le \epsilon _\text {out},\) where \(\epsilon _\text {out}=10^{-4}\).

All the experiments have been carried out on an Intel(R) Core(TM) i7- 6700 CPU @ 3.40GHz machine with 4 physical cores (8 threads) and 16 GB RAM.

Numerical Results

The three algorithms, Algorithm 1 called exact PD, Algorithm 2 called inexact PD, and Algorithm 4 called DFPD, have been compared using performance profiles [31]. We recall that, in performance profiles, each curve represents, given a performance metric, the cumulative distribution of the ratio between the result obtained by a solver on an instance of a problem and the best result obtained by any considered solver on that instance. The results of the comparison are shown in Figure 1.

From the results in Figure 1b, we can observe that the performances of the three algorithms, in terms of attained objective function values, are quite close, with rather slight fluctuations. It is worth remarking that different local minima can be attained by different algorithms, even for equal starting points, because of the nonconvex nature of problem (29).

On the other hand, as shown in Figure 1a, the inexact version of the PD algorithm clearly outperforms the other two algorithms in terms of efficiency. This aspect can be valuable in connection with a global optimization strategy, where many local minimizations have to be performed and the availability of an efficient local solver may be useful. The derivative-free algorithm is about an order of magnitude slower than its direct gradient-based counterpart, which is reasonable, considering that the size of the considered problems is quite large in the perspective of derivative-free optimization. In fact, the difference between the speed of gradient-based and derivative-free methods on problems with relatively large size is usually even larger; here, this gap is mitigated, since there is a large set of instructions shared by all the versions of the algorithm.

Fig. 1
figure 1

Performance profiles of runtime (a) and attained objective value (b) for the exact, inexact and derivative-free penalty decomposition algorithms, on the 18 sparse logistic regression problems

On the whole, the computational experience, although limited to a single class of problems, confirms the validity of the proposed approach. We remark that we tested the simplest implementation of the proposed algorithm, that is, performing, in the x-minimization step, a single line search along the steepest direction. Benefits, in terms of attained function values, could be obtained by performing more iterations of a descent method and by introducing a suitable inner stopping criterion. As already observed, this can be done to improve the effectiveness of the algorithm preserving its global convergence properties.

6 Conclusions

In this paper, we have proposed two penalty decomposition-based methods for smooth cardinality-constrained problems. In the first method, based on gradient information, the exact minimization step of the original penalty decomposition method is replaced by line searches along gradient-related directions. Thus, the contribution related to this algorithm lies in the fact that it represents a viable technique, whenever a closed-form solution of the subproblems in the original variables is not available (in both the convex and nonconvex cases). The second method is a derivative-free algorithm for sparse black-box optimization. We remark that, to our knowledge, derivative-free algorithms for cardinality-constrained problems are not known, so that the presented method seems to yield an important contribution in the field of sparse optimization. We state global convergence results for the new penalty decomposition algorithms. We note that the theoretical analysis is quite different from that of the related literature and that it presents substantial differences for the two proposed algorithms. Although the main focus of the work is theoretical, we have reported also the results of preliminary computational experiments performed by the proposed penalty decomposition methods. The obtained results, although limited to a single class of problems, show the validity of the proposed approach. Further work will regard the extension of the presented algorithms to the case of problems with additional equality and inequality constraints, which, similarly to what is done by [18], might be handled by moving them into the quadratic penalty term. Another interesting theoretical investigation might concern the substitution of the line search step by a trust-region framework. Such a modification, which we consider to be reasonable, would in fact require nontrivial changes to the convergence analysis. Finally, the application of the derivative-free algorithm to real sparse black-box problems would be of great interest.