1 Introduction

Let \(\left\| x \right\| _{0}\) denote the number of nonzero components of the vector x. We consider the \(\ell _{0}\)-minimization problem

$$\begin{aligned} \begin{array}{lcl} &{}\min \limits _{x\in R^{n}}&{}\left\| x \right\| _{0}\\ &{} \hbox {s.t.} &{} \left\| y-Ax \right\| _{2}\le \epsilon ,~ Bx\le b, \end{array} \end{aligned}$$
(1)

where \(A\in R^{m\times n}\) and \(B\in R^{l\times n}\) are two matrices with \(m\ll n\) and \(l\le n\), \(y\in R^{m}\) and \(b\in R^{l}\) are two given vectors, and \(\epsilon \ge 0\) is a given parameter, and \(\left\| x \right\| _{2}=(\sum _{i=1}^{n}\left| x_{i}\right| ^{2})^{1/2}\) is the \(\ell _{2}\)-norm of x. In compressed sensing (CS), the parameter \(\epsilon \) denotes the level of the measurement error \(\eta =y-Ax\). Clearly, the problem (1) is to find the sparsest point in the convex set

$$\begin{aligned} T=\lbrace x: ~ \left\| y-Ax\right\| _{2}\le \epsilon , Bx\le b\rbrace . \end{aligned}$$
(2)

The constraint \(Bx\le b\) is motivated by some practical applications. For instance, many signal recovery models might include extra constraints reflecting certain special structures or prior information of the target signals. The model (1) is general enough to cover several important applications in compressed sensing [5, 6, 12, 13], 1-bit compressed sensing [19, 23, 35] and statistical regression [21, 24, 29, 31]. The following two models are clearly the special cases of (1):

$$\begin{aligned} \begin{array}{ll} \hbox {(C1)}~\min \limits _{x} \lbrace \Vert x\Vert _{0}: ~ y=Ax\rbrace ;&\hbox {(C2)}~ \min \limits _{x} \lbrace \Vert x\Vert _{0}: ~ \left\| y-Ax\right\| _{2}\le \epsilon \rbrace . \end{array} \end{aligned}$$

The problem (C1) is often called the standard \(\ell _{0}\)-minimization problem [7, 16, 35]. Some structured sparsity models, including the nonnegative sparsity model [6, 7, 16, 35] and the monotonic sparsity model (isotonic regression) [32], are also the special cases of the model (1).

Clearly, directly solving the problem (1) is generally very difficult since the \(\ell _{0}\)-norm is a nonlinear, nonconvex and discrete function. Moreover, due to the analysis in [33], the problem (1) might have infinitely many optimal solutions so that it is needed to develop some efficient algorithms to solve the problem (1). Some algorithms have been developed for some special cases of the problem such as (C1) and (C2) over the past decade, including convex optimization and heuristic methods [13, 14, 16, 35]. For instance, by replacing the \(\ell _{0}\)-norm in problem (1) with the \(\ell _{1}\)-norm, we immediately obtain the \(\ell _{1}\)-minimization problem

$$\begin{aligned} \begin{array}{lll}&\min \limits _x&\lbrace \left\| x \right\| _{1}: ~ x\in T\rbrace . \end{array} \end{aligned}$$
(3)

A more efficient class of models than (3) is the so-called weighted \(\ell _{1}\)-minimization model [8, 17, 35, 38]. For (C1) and (C2), the reweighted \(\ell _1\)-minimization model can be stated respectively as

$$\begin{aligned} \begin{array}{ll} \hbox {(E1)}~\min \limits _{x} \lbrace \Vert Wx\Vert _{1}: ~ y=Ax\rbrace ;&\hbox {(E2)}~ \min \limits _{x} \lbrace \Vert Wx\Vert _{1}:~ \left\| y-Ax\right\| _{2}\le \epsilon \rbrace , \end{array} \end{aligned}$$

where \(W=\mathrm {diag}(w)\) is a diagonal matrix with \(w\in R_{+}^{n}\) being a weight vector. A single weighted \(\ell _1\)-minimization is not efficient enough to outperform the standard \(\ell _1\)-minimization. As a result, the reweighted \(\ell _{1}\)-algorithm has been developed, which consists of solving a series of individual weighted \(\ell _{1}\)-minimization problems [1, 2, 8, 17, 35, 38]. Taking (C1) as an example, this method solves a series of the following reweighted \(\ell _{1}\)-problems:

$$\begin{aligned} \min \limits _{x} \lbrace (w^{k})^{T}\vert x\vert : ~ y=Ax\rbrace , \end{aligned}$$

where k denotes the kth iteration and the weight \(w^{k}\) is updated by a certain rule. For example, the first-order method would yield a good updating scheme for \(w^{k}.\) The convergence of some reweighted algorithms was shown under certain conditions [9, 22, 35, 38]. The reweighted \(\ell _{1}\)-minimization may perform better than \(\ell _{1}\)-minimization on sparse signal recovery when the initial point is suitably chosen (see, e.g., [8, 9, 15, 22, 35, 38]). Although this paper focuses on the study of reweighted algorithms, it is worth mentioning that there exist other types of algorithms for \(\ell _0\)-minimization problems, which have also been widely studied in the CS literature, such as orthogonal matching pursuits [14, 25, 30], compressed sampling matching pursuits [16, 27], subspace pursuits [10, 16], thresholding algorithms [3, 11, 14, 16, 26], and the newly developed optimal k-thresholding algorithms [36].

Recently, a new framework of reweighted algorithms for sparse optimization problems was proposed in [35, 37, 39] which is derived from the perspective of the dual density. The key idea is to use the complementarity between the solutions of the \(\ell _0\)-minimization and theoretically equivalent weighted \(\ell _1\)-minimization problem. Such complementarity property makes it possible to reformulate the \(\ell _0\)-minimization problem as an equivalent bilevel optimization which seeks the densest solution of the dual problem of a weighted \(\ell _{1}\)-problem (see [35] for details). In this paper, we generalize this idea to the \(\ell _{0}\)-minimization problem (1) and develop new dual-density-based algorithms through convex relaxation of the bilevel optimization. More specifically, to possibly solve the model (1), we consider the problem

$$\begin{aligned} \begin{array}{lll}&\min \limits _x&\lbrace \left\| Wx \right\| _{1}=w^{T}\vert x\vert : x\in T\rbrace , \end{array} \end{aligned}$$
(4)

which is the weighted \(\ell _{1}\)-minimization problem associated with the problem (1) for a given weight \(w\in R^n_+\) \((W=\mathrm {diag}(w))\). The dual-density-based reweighted \(\ell _{1}\)-algorithms for (1) are directly derived from the relaxation of the bilevel optimization reformulation of the problem (1). To this goal, we develop a sufficient condition for the strict complementarity of the solutions of weighted \(\ell _1\)-minimization problem associated with the problem (1) and the solutions of its dual problem. We propose three types of convex relaxations of the bilevel optimization problem in order to develop our dual-density-based \(\ell _1\)-algorithms for the problem (1).

The paper is organized as follows. In Sect. 2, we recall the merit functions for sparsity and give a few examples of such functions, and we introduce the classic reweighted \(\ell _{1}\)-algorithms. Section 3 is denoted to the development of a sufficient condition for the strict complementarity property to hold. In Sect. 4, we show that the \(\ell _0\)-problem (1) can be reformulated equivalently as a bilevel optimization problem which, in theory, can generate an optimal weight for weighted \(\ell _1\)-minimization problems. In Sect. 5, we discuss several new relaxation strategies for such a bilevel optimization problem, based on which we develop the dual-density-based reweighted \(\ell _{1}\)-algorithms for the problem (1). Finally, we demonstrate some numerical results for the proposed algorithms.

Notation

The \(\ell _{p}\)-norm on \(R^{n}\) is defined as \(\left\| x\right\| _{p}=(\sum _{i=1}^{n}\left| x_{i}\right| ^{p})^{1/p}\), where \(p\ge 1\). The n-dimensional Euclidean space is denoted by \(R^{n}\). \(R^{n}_{+}\) and \(R^{n}_{++}\) are the sets of nonnegative and positive vectors respectively. The set of \(m \times n\) matrices is denoted by \(R^{m\times n}\). The identity matrix of a suitable size is denoted by I. The complementary set of \(S\subseteq \left\{ 1,\ldots ,n\right\} \) is denoted by \(\bar{S}\), i.e., \(\bar{S}=\lbrace 1,\ldots ,n\rbrace \setminus S\). For a given vector \(x\in R^{n}\) and \(S\subseteq \left\{ 1,\ldots ,n\right\} , \) \(x_{S} \) is the subvector of x supported on S.

2 Preliminary

In this section, we recall the notion of merit functions for sparsity and list a few such examples. We also briefly outline the classic reweighted \(\ell _{1}\)-methods for the problem (1). A function is called a merit function for sparsity if it can approximate the \(\ell _{0}\)-norm in some senses [35, 38]. Some concave functions are shown to be the good candidates for the merit functions for sparsity [8, 20, 35, 37, 38]. As pointed out in [38, 39], we may choose a family of merit functions in the form

$$\begin{aligned} {\Psi }_{\varepsilon }( s)=\sum \limits _{i=1}^{n}\varphi _{\varepsilon }( s_{i}),~s\in R^{n}_{+}, \end{aligned}$$

where \( \varphi _{\varepsilon } \) is a function from \(R_+ \) to \( R_+.\) \({\Psi }_{\varepsilon }( s)\) satisfies the following properties:

  • (P1) for any given \(s\in R_{+}^{n}\), \({\Psi }_{\varepsilon }( s)\) tends to \(\left\| s\right\| _{0}\) as \(\varepsilon \) tends to 0;

  • (P2) \({\Psi }_{\varepsilon }( s)\) is twice continuously differentiable with respect to \(s \in R^n_{+}\) in the open neighborhood of \(R^n_{+}; \)

  • (P3) \(\varphi _{\varepsilon }( s_i)\) is concave and strictly increasing with respect to every \(s_{i}\in R_{+}\).

We denote the set of such merit functions by

$$\begin{aligned} \mathbf{F} =\lbrace {\Psi }_{\varepsilon }: {\Psi }_{\varepsilon } ~ \mathrm { satisfies } ~ (P1),(P2)~ \mathrm {and}~(P3) \rbrace . \end{aligned}$$

The following merit functions satisfying (P1)-(P3) have been used in [38, 39]:

$$\begin{aligned} {\Psi }_{\varepsilon }( s)= & {} n-\frac{\sum _{i=1}^{n}\log ( s_{i}+\varepsilon )}{\log \varepsilon },~s\in R_{+}^{n}, \end{aligned}$$
(5)
$$\begin{aligned} {\Psi }_{\varepsilon }( s)= & {} \sum _{i=1}^{n}\frac{s_{i}}{ s_{i}+\varepsilon },~s\in R_{+}^{n}, \end{aligned}$$
(6)
$$\begin{aligned} {\Psi }_{\varepsilon }( s)= & {} \sum _{i=1}^{n}( s_{i}+\varepsilon ^{1/\varepsilon })^{\varepsilon },~s\in R_{+}^{n} \end{aligned}$$
(7)

where \(\varepsilon \in (0,1)\). In this paper, we also consider the following merit function:

$$\begin{aligned} {\Psi }_{\varepsilon }( s)=\frac{2}{\pi }\sum _{i=1}^{n}\arctan (\frac{ s_{i}}{\varepsilon }), ~s\in R_{+}^{n}, \end{aligned}$$
(8)

where \(\varepsilon >0\). It is easy to show that (8) belongs to the set \(\mathbf{F} \).

Lemma 1

The function (8) satisfies (P1)-(P3) on \(R^{n}_{+}\).

Proof

Obviously, the function (8) satisfies (P1) and (P2). We now prove that it also satisfies (P3). In \(R^{n}_{+}\), note that

$$\begin{aligned} \nabla {\Psi }_{\varepsilon }( s)=\left( \nabla \varphi _{\varepsilon }( s_{1}), \ldots , \nabla \varphi _{\varepsilon }( s_{n})\right) ^{T} =\frac{2}{\pi }\biggr ( \frac{\varepsilon }{s_{1}^{2}+\varepsilon ^{2}}, \ldots , \frac{\varepsilon }{s_{n}^{2}+\varepsilon ^{2}}\biggr )^{T}, \end{aligned}$$

and

$$\begin{aligned} \nabla ^{2} {\Psi }_{\varepsilon }( s)=\frac{4}{\pi } \mathrm {diag}\biggr (-\frac{\varepsilon s_{1}}{(s_{1}^{2}+\varepsilon ^{2})^{2}}, \ldots ,-\frac{\varepsilon s_{n}}{(s_{n}^{2}+\varepsilon ^{2})^{2}}\biggr ). \end{aligned}$$

Due to \(s_{i}\ge 0~\mathrm {and}~\varepsilon >0\), we have \(\nabla \varphi _{\varepsilon }( s_{i})>0\) and \(\nabla ^{2} \varphi _{ \varepsilon }(s_{i})\le 0\) for \(i=1,\ldots ,n\) which implies that \({\Psi }_{\varepsilon }(s)\) is concave and strictly increasing with respect to every entry of \(s\in R_{+}^{n}\). Thus (8) satisfies (P1), (P2) and (P3). \(\square \)

In order to compare the algorithms proposed in later sections, we briefly introduce the classic reweighted \(\ell _1\)-method. Following the idea in [38] and [35], replacing \(\left\| x\right\| _{0}\) with \({\Psi }_{\varepsilon }(t)\in \mathbf{F} \) leads to the following approximation of the problem (1):

$$\begin{aligned} \min _{(x, t)} \lbrace {\Psi }_{\varepsilon }(t): x\in T, ~\vert x\vert \le t\rbrace . \end{aligned}$$
(9)

By using the first order approximation of \({\Psi }_{\varepsilon }(t)\in \mathbf{F} \) at the point \(t^{k},\) the problem (9) can be approximated by the optimization

$$\begin{aligned} \min _{(x, t)} \lbrace \nabla {\Psi }_{\varepsilon }^{T}(t^{k})t: x\in T, ~\vert x\vert \le t\rbrace , \end{aligned}$$
(10)

which is used to generate the new iterate \((x^{k+1}, t^{k+1}). \) Due to the fact that \({\Psi }_{\varepsilon }(t)\) is strictly increasing with respect to each \(t_{i}\in R_{+},\) it is evident that the iterate \((x^{k}, t^{k})\) must satisfy \(t^{k}=\vert x^{k}\vert \), which implies that

$$\begin{aligned} x^{k+1}\in {\mathop {\hbox {argmin}}\limits }_{x} \lbrace \nabla {\Psi }_{\varepsilon }^{T}(\vert x^{k}\vert )\vert x\vert : x\in T\rbrace . \end{aligned}$$

This is the classic reweighted \(\ell _1\)-minimization method described in [35].

figure a

Based on the generic convergence of revised Frank-Wolfe algorithms (FW-RD) for a class of concave functions in [28], the generic convergence of the algorithm RA can be obtained (see details in [28]), that is, there exists a family of merit functions \({\Psi }_{\varepsilon }\in \mathbf{F} \) such that RA converges to a stationary point of the problem. The convergence of RA to a sparse point in the case of linear-system constraints can be found in [35].

3 Duality, strict complementarity and optimality condition

To develop the dual-density-based reweighted \(\ell _{1}\)-algorithms, we first discuss the duality and the optimality condition of the model (4), and we give a sufficient condition for the strict complementarity to satisfy for the model (4).

3.1 Duality and complementary condition

By introducing two variables \(t\in R^{n}\) and \(\gamma \in R^{m}\) such that

$$\begin{aligned} \vert x \vert \le t ~\mathrm {and} ~\gamma =y-Ax, \end{aligned}$$

we can rewrite (4) as the following problem:

$$\begin{aligned} \begin{array}{lcl} &{} \min \limits _{(x, \gamma , t)} &{} w^T t\\ &{} \mathrm {s.t.} &{} \left\| \gamma \right\| _{2}\le \epsilon , ~ Bx\le b,\\ &{} &{}\gamma =y-Ax,~ \left| x \right| \le t,~ t\ge 0. \end{array} \end{aligned}$$
(11)

Obviously, (11) is equivalent to (4). Additionally, if \(w\in R^{n}_{++}\), then the solution \((x^{*}, t^{*}, \gamma ^{*})\) to (11) must satisfy that \(\vert x^{*} \vert = t^{*}\) and \(\gamma ^{*}=y-Ax^{*}\), and the following relation of the solutions of (4) and (11) is obvious.

Lemma 2

If \(x^{*}\) is optimal to the problem (4), then all vectors \((x^{*}, t^{*}, \gamma ^{*})\) satisfying

$$\begin{aligned} \vert x^{*}_{\mathrm {supp}(w)}\vert =t^{*}_{\mathrm {supp}(w)},~\vert x^{*}_{\overline{\mathrm {supp}(w)}}\vert \le t^{*}_{\overline{\mathrm {supp}(w)}}~~\mathrm {and}~~ \gamma ^{*}=y-Ax^{*} \end{aligned}$$

are optimal to the problem (11). Moreover, if \((\bar{x}, \bar{t}, \bar{\gamma })\) is optimal to the problem (11), then \(\bar{x}\) is optimal to the problem (4).

Let \(\lambda =(\lambda _{1},\ldots ,\lambda _{6})\) be the dual variable, then the dual problem of (11) can be stated as follows:

$$\begin{aligned} \begin{array}{lcl} &{} \max \limits _{\lambda } &{} -\lambda _{1}\epsilon -\lambda _{2}^{T}b+\lambda _{3}^{T}y\\ &{} \mathrm {s.t.} &{} B^{T}\lambda _{2}-A^{T}\lambda _{3}+\lambda _{4}-\lambda _{5}=0,\\ &{} &{}w=\lambda _{4}+\lambda _{5}+\lambda _{6}, ~\left\| \lambda _{3}\right\| _{2}\le \lambda _{1},\\ &{} &{} \lambda _{i}\ge 0,~ i=1,2,4,5,6,. \end{array} \end{aligned}$$
(12)

The strong duality between (11) and (12) can be guaranteed under suitable condition. Thus the following results follows from the classic optimization theory [4].

Lemma 3

Let Slater condition hold for the convex problem (11), i.e., there exists \((x^{*}, \gamma ^{*}, t^{*})\in ri(T)\) such that

$$\begin{aligned} \left\| \gamma ^{*}\right\| _{2}< \epsilon , ~Bx^{*}\le b, ~\vert x^{*}\vert \le t^{*}, ~y=Ax^{*}+\gamma ^{*},~ t^{*}\ge 0, \end{aligned}$$

where ri(T) is the relative interior of T. Then there is no duality gap between (11) and its dual problem (12). Moreover, if the optimal value of (11) is finite, then there exists at least one optimal Lagrangian multiplier such that the dual optimal value can be attained.

In this paper, we assume that Slater condition holds for (11). Clearly, the optimal value of (11) is finite when w is a given vector, and hence the strong duality holds for (11) and (12) and the dual optimal value can be attained. Actually, the set \(\{x : Ax = y, Bx \le b\}\) is in practice not empty due to the fact that y and b are the measurements of the signals. Thus Slater condition is a very mild sufficient condition for strong duality to hold for the problems (11) and (12).

3.2 Optimality condition for (11) and (12)

It is well-known that for any convex minimization problem with differentiable objective and constraint functions for which the strong duality holds, Karush-Kuhn-Tucker (KKT) condition is the necessary and sufficient optimality condition for the problem and its dual problem [4]. Since Slater condition holds for (11), by Lemma 3, the optimality condition for (11) is stated as follows.

Theorem 1

If Slater condition holds for (11), then \((x^{*}, \gamma ^{*}, t^{*})\) is optimal to (11) and \(\lambda _{i}^{*}, i=1,\ldots ,6\) is optimal to (12) if and only if \((x^{*}, \gamma ^{*}, t^{*}, \lambda ^{*})\) satisfies the KKT conditions for (11), i.e.,

$$\begin{aligned} \left\{ \begin{array}{lll} \gamma ^{*}=y-Ax^{*},~ \left\| \gamma ^{*}\right\| _{2}\le \epsilon , ~ x^{*}\le t^{*}, ~ -t^{*}\le x^{*}, \\ Bx^{*}\le b, ~ t^{*}\ge 0,~ \lambda _{i}^{*}\ge 0, ~i=1,2,4,5,6, \\ \lambda ^{*}_{1}(\epsilon -\left\| \gamma ^{*} \right\| _{2})=0, ~\lambda _{2}^{*T}(b-Bx^{*})=0, \\ \lambda _{4}^{*T}(t^{*}-x^{*})=0, ~\lambda _{5}^{*T}(x^{*}+t^{*})=0, ~\lambda _{6}^{*T}t^{*}=0, \\ \partial _{x} L(x^{*}, \gamma ^{*}, t^{*}, \lambda ^{*})=B^{T}\lambda _{2}^{*}-A^{T}\lambda _{3}^{*}+\lambda _{4}^{*}-\lambda _{5}^{*}=0,\\ \partial _{\gamma } L(x^{*}, \gamma ^{*}, t^{*},\lambda ^{*})=(\lambda ^{*}_{1})\nabla (\left\| \gamma ^{*}\right\| _{2})-\lambda ^{*}_{3}=0,\\ \partial _{t} L(x^{*}, \gamma ^{*}, t^{*},\lambda ^{*})=w-\lambda _{4}^{*}-\lambda _{5}^{*}-\lambda _{6}^{*}=0.\\ \end{array} \right. \end{aligned}$$
(13)

where \(L(x^{*}, \gamma ^{*}, t^{*}, \lambda ^{*})=w^{T}t^{*}-\lambda ^{*}_{1}(\epsilon -\left\| \gamma ^{*} \right\| _{2})- \lambda _{2}^{*T}(b-Bx^{*}) -\lambda _{3}^{*T}(Ax^{*}+\gamma ^{*}-y) -\lambda _{4}^{*T}(t^{*}-x^{*})-\lambda _{5}^{*T}(x^{*}+t^{*})-\lambda _{6}^{*T}t^{*}\).

From the optimality condition in (13), we see that \(t^{*}\) and \(\lambda ^{*}_{6}\) satisfy the complementary condition.

Corollary 1

Let Slater condition hold for (11). Then, for any optimal solution pair \(((x^{*}, t^{*}, \gamma ^{*}),\lambda ^{*})\), where \((x^{*}, t^{*}, \gamma ^{*})\) is optimal to (11) and \(\lambda ^{*}=(\lambda _{1}^{*},\ldots , \lambda _{6}^{*})\) is optimal to (12), \(t^{*}\) and \(\lambda _{6}^{*}\) are complementary in the sense that

$$\begin{aligned} (t^{*})^{T}\lambda _{6}^{*}=0, ~ t^{*}\ge 0 ~ \mathrm {and} ~ \lambda _{6}^{*}\ge 0. \end{aligned}$$

Clearly, if \((x^{*}, t^{*}, \gamma ^{*})\) is optimal to (11) and w is positive, it must hold \(\vert x^{*}\vert =t^{*}\). Hence by Corollary 1, for \(i=1,\ldots ,n\), we have

$$\begin{aligned} \vert x^{*}_{i}\vert (\lambda _{6}^{*})_{i}=0, ~ (\lambda _{6}^{*})_{i}\ge 0. \end{aligned}$$
(14)

When w is nonnegative, and if \((x^{*}, t^{*}, \gamma ^{*})\) is optimal to (11), we have

$$\begin{aligned} \vert x^{*}_{i}\vert =t_{i}^{*},~ i\in \mathrm {supp}(w);~\vert x^{*}_{i}\vert \le t_{i}^{*}, ~i\in \overline{\mathrm {supp}(w)}. \end{aligned}$$

For \(i\in \mathrm {supp}(w)\), (14) is valid. For \(i\in \overline{\mathrm {supp}(w)}\), due to the constraints \(w=\lambda _{4}+\lambda _{5}+\lambda _{6}\) and \(\lambda _{4}, \lambda _{5}, \lambda _{6} \ge 0\), \(w_{i}=0\) implies that \((\lambda _{6}^{*})_{i}=0\). This means (14) is also valid for \(i\in \overline{\mathrm {supp}(w)}\). Therefore, we have the following result:

Theorem 2

Let w be a nonnegative given vector, and let Slater condition hold for (11). Then, for any optimal solution pair \(((x^{*}, t^{*}, \gamma ^{*}),\lambda ^{*})\), where \((x^{*}, t^{*}, \gamma ^{*})\) is optimal to (11) and \(\lambda ^{*}= (\lambda _{1}^{*},\ldots , \lambda _{6}^{*})\) is optimal to (12), \(\vert x_{i}^{*}\vert \) and \((\lambda _{6}^{*})_{i}\) are complementary in the sense that

$$\begin{aligned} \vert x_{i}^{*}\vert (\lambda _{6}^{*})_{i}=0 ~ \mathrm {and} ~ (\lambda _{6}^{*})_{i}\ge 0, ~i=1,\ldots ,n. \end{aligned}$$
(15)

The relation (15) implies that

$$\begin{aligned} \left\| x^{*} \right\| _{0}+\left\| \lambda _{6}^{*} \right\| _{0}\le n, \end{aligned}$$

where n is the dimension of \(x^{*}\) or \(\lambda _{6}^{*}\). Suppose \(\vert x^{*}\vert \) and \(\lambda ^{*}_{6}\) are strictly complementary, i.e.,

$$\begin{aligned} \vert x^{*}\vert ^{T}\lambda _{6}^{*}=0, ~\lambda _{6}^{*}\ge 0 ~ \mathrm {and}~\vert x^{*} \vert +\lambda _{6}^{*}> 0. \end{aligned}$$

Then

$$\begin{aligned} \left\| x^{*} \right\| _{0}+\left\| \lambda _{6}^{*} \right\| _{0}= n. \end{aligned}$$

3.3 Strict complementarity

For nonlinear optimization models, the strictly complementary property might not hold. However, it might be possible to develop a condition such that the strict complementarity holds for the model (4) or (11). We now develop such a condition for the problems (11) and (12) under the following assumption.

Assumption 1

Let \(W=\mathrm {diag}(w)\) satisfy the following properties:

  • \(\langle G1\rangle \)   The problem (4) with w has an optimal solution which is a relative interior point in the feasible set T, denoted by \(x^{*}\in ri(T)\), such that

    $$\begin{aligned} \left\| y-Ax^{*}\right\| _{2}<\epsilon , ~Bx^{*}\le b, \end{aligned}$$
  • \(\langle G2\rangle \)   the optimal value \(Z^{*}\) of (4) is finite and positive, i.e., \(Z^{*}\in (0, \infty )\),

  • \(\langle G3\rangle \)   \(w_{j}\in (0, \infty ]\) for all \(1\le j \le n\).

Example 1

Consider the system \(\left\| y-Ax\right\| _{2}\le \epsilon , Bx\le b\) with \(\epsilon =10^{-1}\), where

$$\begin{aligned} A=\left[ \begin{array}{cccc} 1 &{} 0 &{} -2 &{}5 \\ 0&{} 1&{} 4&{} -9\\ 1&{} 0&{} -2&{}5 \end{array} \right] , B=\left[ \begin{array}{cccc} -0.5 &{} 0 &{} 1 &{}-2.5 \\ 0.5&{} -0.5&{} -1&{} 2\\ -3&{} -3&{} -2&{}3 \end{array} \right] , y=\left[ \begin{array}{c} 1\\ -1\\ 1 \end{array} \right] ,b=\left[ \begin{array}{c} -0.5\\ 1\\ -1 \end{array} \right] . \end{aligned}$$

We can see that the problem (4) with \(w=(1,100,1,100)^{T}\) has an optimal solution \((1/2,0,-1/4,0)^{T}\) which satisfies Assumption 1.

Next we prove the following theorem concerning the strict complementarity for (11) and (12) under Assumption 1.

Theorem 3

Let y and b be two given vectors, \(A\in R^{m\times n}\) and \(B\in R^{l\times n}\) be two given matrices, and w be a given weight which satisfies Assumption 1. Then there exists a pair \(((x^{*}, t^{*}, \gamma ^{*}), \lambda ^{*})\), where \((x^{*}, t^{*}, \gamma ^{*})\) is an optimal solution to (11) and \(\lambda ^{*}=(\lambda _{1}^{*},\ldots ,\lambda _{6}^{*})\) is an optimal solution to (12), such that \(t^{*}\) and \(\lambda _{6}^{*}\) are strictly complementary, i.e.,

$$\begin{aligned} (t^{*})^{T}\lambda _{6}^{*}=0, ~t^{*}+\lambda _{6}^{*}>0, ~(t^{*}, \lambda _{6}^{*})\ge 0. \end{aligned}$$

Proof

Note that (G1) in Assumption 1 implies that Slater condition holds for (11). This, combined with (G2), indicates from Lemma 3 that the duality gap is 0, and the optimal value \(Z^{*}\) for (12) can be attained. For any given index \(j: 1\le j \le n\), we consider a series of minimization problems:

$$\begin{aligned} \begin{array}{cl} \min \limits _{(x, t, \gamma )}&{} -t_{j}\\ \mathrm {s.t.}&{}\left\| \gamma \right\| _{2}\le \epsilon ,~Bx\le b,~ \gamma =y-Ax,\\ &{}\left| x \right| \le t, ~-w^{T}t\ge -Z^{*}, ~ t\ge 0. \end{array} \end{aligned}$$
(16)

The dual problem of (16) can be obtained by using the same method for developing the dual problem of (11), which is stated as follows:

$$\begin{aligned} \begin{array}{lcl} &{}\max \limits _{(\mu , \tau )}&{}-\mu _{1}\epsilon -\mu _{2}^{T}b+\mu _{3}^{T}y-\tau Z^{*}\\ &{}\mathrm {s.t.}&{} B^{T}\mu _{2}-A^{T}\mu _{3}+\mu _{4}-\mu _{5}=0,~\left\| \mu _{3}\right\| _{2}\le \mu _{1},\\ &{} &{}\tau w=\mu _{4}+\mu _{5}+\mu _{6}+p,~ \mu _{i}\ge 0,~ i=1,2,4,5,6, ~\tau \ge 0, \end{array} \end{aligned}$$
(17)

where p is a vector whose jth component is 1 and the remains are 0, i.e.,

$$\begin{aligned} p_{j}=1;~p_{i}=0,~i\ne j. \end{aligned}$$

Next we show that (16) and (17) satisfy the strong duality property under Assumption 1. It can be seen that \((x,t,\gamma )\) is a feasible solution to (16) if and only if \((x,t,\gamma )\) is an optimal solution of (11), or if x is optimal to (4). If w satisfies the conditions in Assumption 1, then there exists an optimal solution \(\bar{x}\) of (4) such that \(\left\| y-A\bar{x}\right\| _{2}<\epsilon ,~B\bar{x}\le b \) and \(w^{T}\vert \bar{x}\vert =Z^{*},\) which means there is a relative interior point \((\bar{x}, \bar{t}, \bar{\gamma })\) of the feasible set of (16) satisfying

$$\begin{aligned} \left\| \bar{\gamma } \right\| _{2}<\epsilon ,~ B\bar{x}\le b,~ \bar{\gamma }=y-A\bar{x}, ~\vert \bar{x} \vert \le \bar{t}, ~w^{T}\bar{t}\le Z^{*}, ~\bar{t}\ge 0. \end{aligned}$$

As a result, the strong duality holds for (16) and (17) for all j. Moreover, due to (G2) and (G3), w is positive and \(Z^{*}\) is finite, so \(t_{j}\) cannot be \(\infty \). Thus the optimal value of all jth minimization problems (16) is finite. It follows from Lemma 3 that for each jth optimization (16) and (17), the duality gap is 0, and each jth dual problem (17) can achieve their optimal value.

We use \(\xi ^{*}_{j}\) to denote the optimal value of the jth problem in (16). Clearly, \(\xi ^{*}_{j}\) is nonpositive, i.e.,

$$\begin{aligned} \xi ^{*}_{j}<0 ~~ \mathrm {or} ~~ \xi ^{*}_{j}=0. \end{aligned}$$

Case 1:    \(\xi ^{*}_{j}<0\). Then (11) has an optimal solution \((x', t', \gamma ')\) where the jth component in \(t'\) is positive since \(t_{j}'=-\xi ^{*}_{j}\) and admits the largest value amongst all the optimal solutions of (11). By Theorem 1, the complementary condition implies that (12) has an optimal solution \(\lambda '=(\lambda '_{1},\ldots ,\lambda '_{6})\) where jth component in \(\lambda _{6}'\) is 0. Then we have an optimal solution pair \(((x',t',\gamma '),\lambda ')\) for (11) and (12) such that \(t_{j}'>0\) and \((\lambda _{6}')_{j}=0\). It means that

$$\begin{aligned} t_{j}'=-\xi ^{*}_{j}>0 \quad \mathrm {implies} \quad ( \lambda _{6}')_{j}=0. \end{aligned}$$

Case 2:    \(\xi ^{*}_{j}=0\). Following from the strong duality between (16) and (17), we have an optimal solution \((\mu , \tau )\) of the jth optimization problem (17) such that

$$\begin{aligned} -\mu _{1}\epsilon -\mu _{2}^{T}b+\mu _{3}^{T}y=\tau Z^{*}. \end{aligned}$$

First, we consider \(\tau \ne 0\). The above equality can be reduced to

$$\begin{aligned} -\frac{\mu _{1}\epsilon }{\tau }-\frac{\mu _{2}^{T}}{\tau }b+\frac{\mu _{3}^{T}}{\tau }y=Z^{*}, \end{aligned}$$

and we also have

$$\begin{aligned} B^{T}\frac{\mu _{2}}{\tau }-A^{T}\frac{\mu _{3}}{\tau }+\frac{\mu _{4}}{\tau }-\frac{\mu _{5}}{\tau }=0,~\left\| \frac{\mu _{3}}{\tau }\right\| _{2}\le \frac{\mu _{1}}{\tau }, ~ w=\frac{\mu _{4}}{\tau }+\frac{\mu _{5}}{\tau }+\frac{\mu _{6}}{\tau }+\frac{p}{\tau }. \end{aligned}$$

We set

$$\begin{aligned} \lambda _{1}^{'}=\frac{\mu _{1}}{\tau }, ~\lambda _{2}^{'}=\frac{\mu _{2}}{\tau }, ~\lambda _{3}^{'}=\frac{\mu _{3}}{\tau }, ~\lambda _{4}^{'}=\frac{\mu _{4}}{\tau }, ~\lambda _{5}^{'}=\frac{\mu _{5}}{\tau }, ~\lambda _{6}^{'}=\frac{\mu _{6}}{\tau }+\frac{p}{\tau }. \end{aligned}$$

Due to strong duality of (11) and (12) again, \(\lambda ^{'}=( \lambda _{1}^{'},\ldots , \lambda _{6}^{'})\) is optimal to (12). Note that

$$\begin{aligned} (\lambda _{6})_{j}^{'}=\frac{(\mu _{6})_{j}+1}{\tau }. \end{aligned}$$

Thus \((\lambda _{6})_{j}^{'}>0\), which follows from \(\mu _{6}\ge 0\) and \(\tau >0\). Thus

$$\begin{aligned} t'_{j}=-\xi _{j}^{*}=0~ ~\mathrm {implies} ~~ (\lambda _{6})_{j}^{'}>0. \end{aligned}$$

Note that the third constraint in jth optimization of (17) requires \(\tau \ne 0\) since w, \(\mu _{4}\), \(\mu _{5}\), \(\mu _{6}\) are all non-negative and \(p_{j}=1\) so that the jth component in \(\tau w\) must be greater or equal than 1. Therefore, all jth optimization problems in (17) are infeasible if \(\tau =0\). As a result, the optimal solution \((\mu , \tau )\) of (17) with \(\tau =0\) is impossible to occur. Combining the cases 1 and 2 implies that for each \(1\le j \le n\), we have an optimal solution pair \(((x^{(j)}, t^{(j)}, \gamma ^{(j)}), \lambda ^{(j)})\) such that \(t^{(j)}_{j}>0\) or \((\lambda _{6}^{(j)})_{j}>0\). For all jth solution pairs, they all satisfy the following properties:

  1. (i)

    \((x^{(j)}, t^{(j)}, \gamma ^{(j)})\) is optimal to (11), and \((\lambda _{1}^{(j)}, \lambda _{2}^{(j)}, \lambda _{3}^{(j)}, \lambda _{4}^{(j)}, \lambda _{5}^{(j)}, \lambda _{6}^{(j)})\) is optimal to (12);

  2. (ii)

    the jth component of \(t^{(j)}\) and the jth component of \(\lambda _{6}^{(j)}\) are strictly complementary, such that \(t^{(j)}_{j}(\lambda _{6}^{(j)})_{j}=0,~ t^{(j)}_{j}+(\lambda _{6}^{(j)})_{j} >0 \).

Denote \((x^{*}, t^{*}, \gamma ^{*},\lambda ^{*})\) by

$$\begin{aligned} x^{*}=\frac{1}{n}\sum _{j=1}^{n} x^{(j)}, ~ t^{*}=\frac{1}{n}\sum _{j=1}^{n} t^{(j)}, ~ \gamma ^{*}=\frac{1}{n}\sum _{j=1}^{n}\gamma ^{(j)}, ~ \lambda _{i}^{*}=\frac{1}{n}\sum _{j=1}^{n}\lambda ^{(j)}_{i}, ~ i=1,2,\cdots ,6. \end{aligned}$$

Since \((x^{(j)}, t^{(j)}, \gamma ^{(j)}),~ j=1,2,\ldots ,n\) are all optimal solutions of (11), then for any j, we have

$$\begin{aligned} \left\{ \begin{array}{ll} w^{T}t^{(j)}=Z^{*},~\left\| \gamma ^{(j)} \right\| _{2}\le \epsilon ,\quad Bx^{(j)}\le b, \\ \gamma ^{(j)}=y-Ax^{(j)}, ~ \vert x^{(j)}\vert \le t^{(j)},\quad t^{(j)}\ge 0. \end{array} \right. \end{aligned}$$
(18)

It is easy to see that

$$\begin{aligned} w^{T}t^{*}=Z^{*},~ Bx^{*}\le b, ~\gamma ^{*}=y-Ax^{*}, ~ t^{*}\ge 0. \end{aligned}$$

Moreover,

$$\begin{aligned}&\left\| \gamma ^{*}\right\| _{2}= \left\| \frac{1}{n}\sum _{j=1}^{n}\gamma ^{(j)}\right\| _{2} \le \sum _{j=1}^{n}\left\| \frac{1}{n}\gamma ^{(j)}\right\| _{2}\le \epsilon ,\\&\vert x^{*}\vert =\left| \frac{1}{n}\sum _{j=1}^{n}x^{(j)}\right| \le \frac{1}{n}\sum _{j=1}^{n}\vert x^{(j)}\vert \le \frac{1}{n}\sum _{j=1}^{n}t^{(j)}=t^{*}, \end{aligned}$$

where the first inequality of each equation above follows from the triangle inequality. Then the vector \((x^{*}, t^{*}, \gamma ^{*})\) satisfies

$$\begin{aligned} \left\{ \begin{array}{ll} w^{T}t^{*}=Z^{*},~\left\| \gamma ^{*} \right\| _{2}\le \epsilon , ~ Bx^{*}\le b,\\ \gamma ^{*}=y-Ax^{*}, ~\vert x^{*}\vert \le t^{*}, ~t^{*}\ge 0. \end{array} \right. \end{aligned}$$
(19)

Thus \((x^{*}, t^{*}, \gamma ^{*})\) is optimal to (11), and similarly it can be proven that \(\lambda ^{*}=(\lambda ^{*}_{1},\ldots ,\lambda ^{*}_{6})\) is an optimal solution to (12). By strong duality, \(t^{*}\) and \(\lambda _{6}^{*}\) are complementary. Due to the above-mentioned property (2), it is impossible to find a pair \((t^{*}, \lambda _{6}^{*})\) such that their jth components are both 0. Thus, \((t^{*}, \lambda ^{*}_{6})\) is the strictly complementary solution pair for (11) and (12). \(\square \)

Remark 1

It can be seen that the following two sets

$$\begin{aligned} P^{*}=\lbrace i: t_{i}^{*}>0\rbrace ~ \mathrm {and}~ Q^{*}=\lbrace i: (\lambda _{6}^{*})_{i}>0 \rbrace \end{aligned}$$

are invariant for all pairs of strictly complementary solutions. Suppose there are two distinct optimal pairs of the solutions of (11) and (12), denoted by \((x_{(k)}\), \(t_{(k)}\), \(\gamma _{(k)}\), \(\lambda _{(k)})\), \(k=1,2\), such that \(( t_{(k)}, \lambda _{6(k)}), k=1,2\) are strictly complementary pairs, where \((x_{(k)}, t_{(k)}, \gamma _{(k)})\) are optimal to (11) and \((\lambda _{(k)})\) are optimal to (12). Due to Theorem 1, we know that

$$\begin{aligned} (\lambda _{6(1)})^{T}t_{(2)}=0~\mathrm {and}~(\lambda _{6(2)})^{T}t_{(1)}=0. \end{aligned}$$

It means that the supports of all strictly complementary pairs of (11) and (12) are invariant. Otherwise, there exists an index j such that \((t_{(1)})_{j}>0\) and \((\lambda _{6(2)})_{j}>0\), leading to a contradiction.

Since the optimal solution \((x^{*}, t^{*}, \gamma ^{*})\) to (11) must have \(t^{*}=\vert x^{*}\vert \) if \(w>0\), the main results of Theorem 3 also imply that \(\vert x^{*}\vert \) and \( \lambda _{6}^{*}\) are strictly complementary under Assumption 1.

4 Bilevel model for optimal weights

For weighted \(\ell _{1}\)-minimization, how to determine a weight to guarantee the exact recovery, sign recovery or support recovery of sparse signals is an important issue in CS theory. Based on the complementary condition and strict complementarity discussed above, we may develop a bilevel optimization model for such a weight, which is called the optimal weight in [37, 39] and [35].

Definition 1

(Optimal Weight) A weight is called an optimal weight if the solution of the weighted \(\ell _{1}\)-problem with this weight is one of the optimal solution of the \(\ell _{0}\)-minimization problem.

Let \(Z^{*}\) be the optimal value of (4). Notice that the optimal solution of (4) remains the same when w is replaced by \(\alpha w\) for any positive \(\alpha \). When \(Z^{*}\ne 0\), by replacing W by \(W/Z^{*}\), we can obtain

$$\begin{aligned} 1=\min _{x}\lbrace \left\| (W/Z^{*}) x\right\| _{1}:x\in T\rbrace , \end{aligned}$$

where \(W=\mathrm {diag}(w)\). We use \(\zeta \) to denote the set of such weights, i.e.,

$$\begin{aligned} \zeta =\lbrace w\in R^{n}_{+}: ~1=\min _{x}\lbrace \left\| Wx\right\| _{1}, x\in T\rbrace \rbrace . \end{aligned}$$
(20)

Clearly, \(\bigcup \limits _{\alpha > 0}\alpha \zeta \) is the set of weights such that (4) has a finite and positive optimal value, and \(\zeta \) is not necessarily bounded. Under Slater condition, Theorem 2 implies that given any \(w\in \zeta \), any optimal solutions of (11) and (12), denoted by \((x^{*}(w),t^{*}(w),\gamma ^{*}(w))\) and \(\lambda ^{*}(w)=(\lambda ^{*}_{1}(w),\ldots , \lambda ^{*}_{6}(w))\), satisfy that \(\vert x^{*}(w)\vert \) and \(\lambda _{6}^{*}(w)\) are complementary, i.e.,

$$\begin{aligned} \left\| x^{*}(w)\right\| _{0}+\left\| \lambda _{6}^{*}(w)\right\| _{0}\le n. \end{aligned}$$
(21)

If \(w^{*}\) satisfies Assumption 1, then Slater condition is automatically satisfied for (11) with \(w^{*}\) and (21) is also valid. Moreover, by Theorem 3, there exists a strictly complementary pair \((\vert x^{*}(w^{*})\vert , \lambda _{6}^{*}(w^{*}))\) such that

$$\begin{aligned} \left\| x^{*}(w^{*})\right\| _{0}+\left\| \lambda _{6}^{*}(w^{*})\right\| _{0}=n. \end{aligned}$$

If \(w^{*}\) is an optimal weight (see Definition 1), then \(\lambda ^{*}_{6}(w^{*})\) must be the densest slack variable among all \(w\in \zeta \), and locating a sparse vector can be converted to

$$\begin{aligned} \lambda ^{*}_{6}(w^{*})={\mathop {\hbox {argmax}}\limits }\lbrace \left\| \lambda ^{*}_{6}(w)\right\| _{0}:w\in \zeta \rbrace . \end{aligned}$$

Inspired by the above fact, we develop a theorem under Assumption 2 which claims that finding a sparsest point in T is equivalent to seeking the proper weight w such that the dual problem (12) has the densest optimal variable \(\lambda _{6}\). Such weights are optimal weights and can be determined by certain bilevel optimization. This idea was first introduced by Zhao and Kočvara [37] (and also by Zhao and Luo [39]) to solve the standard \(\ell _{0}\)-minimization (C1). In this paper, we generalize their idea to solve the model (1) by developing new convex relaxation technique for the underlying bilevel optimization problem. Before that we make the following assumption:

Assumption 2

Let \(\nu \) be an arbitrary sparsest point in T given in (2). There exists a weight \(\bar{w}\ge 0\) such that

  • \(\langle H1\rangle \) the problem (4) with \(\bar{w}\) has an optimal solution \(\bar{x}\) such that \(\left\| \bar{x}\right\| _{0}=\left\| \nu \right\| _{0}\),

  • \(\langle H2\rangle \) there exists an optimal variable in (12) with \(\bar{w}\), denoted as \(\bar{\lambda }\), such that \(\bar{\lambda _{6}}\) and \(\bar{x}\) are strictly complementary,

  • \(\langle H3\rangle \) the optimal value of (4) with \(\bar{w}\) is finite and positive.

An example for the existence of a weight satisfying Assumption 2 is given in the remark following the next theorem.

Theorem 4

Let Slater condition and Assumption 2 hold. Consider the bilevel optimization

$$\begin{aligned} \begin{array}{lcl} &{} \max \limits _{(w, \lambda )} &{} \left\| \lambda _{6}\right\| _{0}\\ &{} \mathrm {s.t.} &{} B^{T}\lambda _{2}-A^{T}\lambda _{3}+\lambda _{4}-\lambda _{5}=0,~\left\| \lambda _{3}\right\| _{2}\le \lambda _{1}, \\ &{} &{} -\lambda _{1}\epsilon -\lambda _{2}^{T}b+\lambda _{3}^{T}y=\min \limits _{x}\lbrace \left\| Wx\right\| _{1}: x\in T\rbrace ,\\ &{} &{}w= \lambda _{4}+\lambda _{5}+\lambda _{6}\ge 0,~\lambda _{i}\ge 0, ~ i=1,2,4,5,6, \end{array} \end{aligned}$$
(22)

where \(W=\mathrm {diag}(w)\), and T is given as (2). If \((w^{*}, \lambda ^{*})\) is an optimal solution to the above optimization problem (22), then any optimal solution \(x^{*}\) to

$$\begin{aligned} \min _{x}\lbrace \left\| W^{*}x\right\| _{1}: x\in T\rbrace , \end{aligned}$$
(23)

is a sparsest point in T, where \(W^{*}=\mathrm {diag}(w^{*})\).

Proof

Let \(\nu \) be a sparsest point in T. Suppose that \((w^{*}, \lambda ^{*})\) is an optimal solution of (22). We now prove that any optimal solution to (23) is a sparsest point in T under Assumption 2. Let \(w'\) be a weight satisfying Assumption 2. This means that (4) with \(W=\mathrm {diag}(w')\) has an optimal solution \(x'\) such that \(\left\| x'\right\| _{0}=\left\| \nu \right\| _{0}\). Moreover, there exists a strictly complementary pair (\(x', \lambda _{6}'\)) satisfying

$$\begin{aligned} \left\| x'\right\| _{0}+\left\| \lambda _{6}'\right\| _{0}=n= \left\| \lambda _{6}'\right\| _{0}+\left\| \nu \right\| _{0}, \end{aligned}$$
(24)

where \(\lambda '=(\lambda _{1}',\ldots ,\lambda _{6}')\) is the dual optimal solution of (12) with \(w=w'\), i.e.,

$$\begin{aligned} \begin{array}{lcl} &{} \max \limits _{\lambda }&{}-\lambda _{1}\epsilon -\lambda _{2}^{T}b+\lambda _{3}^{T}y\\ &{}\mathrm {s.t.}&{} B^{T}\lambda _{2}-A^{T}\lambda _{3}+\lambda _{4}-\lambda _{5}=0,~\left\| \lambda _{3}\right\| _{2}\le \lambda _{1},\\ &{} &{}w'=\lambda _{4}+\lambda _{5}+\lambda _{6},~ \lambda _{i}\ge 0,~ i=1,2,4,5,6. \end{array} \end{aligned}$$
(25)

By Lemma 3, Slater condition implies that strong duality holds for the problems (25) and (11) with \(w'\). Note that the optimal values of (11) and (4) with \(w'\) are equal and finite so that \((w', \lambda ')\) is feasible to (22). Let \(x^{*}\) be an arbitrary solution to (23). Note that (11) with \(w^{*}\) is equivalent to (23), to which the dual problem is

$$\begin{aligned} \begin{array}{lcl} &{} \max \limits _{\lambda }&{}-\lambda _{1}\epsilon -\lambda _{2}^{T}b+\lambda _{3}^{T}y\\ &{}\mathrm {s.t.}&{} B^{T}\lambda _{2}-A^{T}\lambda _{3}+\lambda _{4}-\lambda _{5}=0,~\left\| \lambda _{3}\right\| _{2}\le \lambda _{1},\\ &{} &{}w^{*}=\lambda _{4}+\lambda _{5}+\lambda _{6},~ \lambda _{i}\ge 0,~ i=1,2,4,5,6. \end{array} \end{aligned}$$
(26)

Moreover, \(\lambda ^{*}=(\lambda _{1}^{*},\ldots ,\lambda _{6}^{*})\) is feasible to (26) and the third constraint of (22) implies that there is no duality gap between (11) with \(w^{*}\) and (26). Thus, by strong duality, \(\lambda ^{*}=(\lambda _{1}^{*},\ldots ,\lambda _{6}^{*})\) is an optimal solution to (26). Therefore, by Theorem 2, \(\vert x^{*}\vert \) and \(\lambda _{6}^{*}\) are complementary. Hence, we have

$$\begin{aligned} \left\| x^{*}\right\| _{0}\le n- \left\| \lambda _{6}^{*}\right\| _{0}. \end{aligned}$$
(27)

Since \((w^{*}, \lambda ^{*})\) is optimal to (22), we have

$$\begin{aligned} \left\| \lambda _{6}'\right\| _{0}\le \left\| \lambda _{6}^{*}\right\| _{0}. \end{aligned}$$
(28)

Plugging (24) and (28) into (27) yields

$$\begin{aligned} \left\| x^{*}\right\| _{0}\le n- \left\| \lambda _{6}^{*}\right\| _{0} \le n-\left\| \lambda _{6}'\right\| _{0}=\left\| x'\right\| _{0}=\left\| \nu \right\| _{0}, \end{aligned}$$

which implies \(\left\| x^{*}\right\| _{0}=\left\| \nu \right\| _{0},\) due to the assumption that \(\nu \) is the sparsest point in T. Then any optimal solution to (24) is a sparsest point in T. \(\square \)

Given Assumption 2 and Slater condition, finding a sparsest point in T is tantamountly equal to look for the densest dual solution via the bilevel model (22).

By the definition of optimal weights, Theorem 4 implies that \(w^{*}\) is an optimal weight by which a sparsest point can be obtained via (4). If there is no weight satisfying the properties in Assumption 2, a heuristic method for finding a sparse point in T can be also developed from (21) since the increase in \(\Vert \lambda _{6}(w)\Vert _{0}\) leads to the decrease of \(\left\| x(w) \right\| _{0}\) to a certain level. Before we close this section, we make some remarks for Assumption 2.

Remark 2

Consider Example 1. It can be seen that \((0,0,2,1)^{T}\) is a sparsest point in the feasible set T of this example. If we choose weight \(w=(100, 100, 1, 1)^{T}\), then we can see that \((0, 0, 2, 1)^{T}\) is the unique optimal solution of (4) which satisfies \(\langle H1\rangle \) and \(\langle H3\rangle \) in Assumption 2. In addition, \((0, 0, 2, 1)^{T}\) is a relative interior point in the feasible set T. This, combined with the fact that weights are positive, implies that Assumption 1 is satisfied, and hence the strict complementarity is satisfied which means that \(\langle H2\rangle \) in Assumption 2 is satisfied. Specifically, we can find an optimal dual solution \(\bar{\lambda }=(\bar{\lambda }_{1},\ldots ,\bar{\lambda }_{6})\) with \(\bar{\lambda }_{6}=(32.27, 31.71, 0, 0)^{T}\). Therefore, in this example, the weight \(w=(100, 100, 1, 1)^{T}\) satisfies Assumption 2.

5 Dual-density-based algorithms

Note that it is difficult to solve a bilevel optimization. We now develop three types of relaxation models for solving the bilevel optimization (22).

5.1 Relaxation models

Zhao and Luo [39] presented a method to relax a bilevel problem similar to (22). Motivated by their idea, we now relax our bilevel model. We focus on relaxing the difficult constraint \(-\lambda _{1}\epsilon -\lambda _{2}^{T}b+\lambda _{3}^{T}y=\min _{x}\lbrace \left\| Wx\right\| _{1}: x\in T\rbrace \) in (22). By replacing the objective function \(\Vert \lambda _{6}\Vert _{0}\) in (22) by \({\Psi }_{\varepsilon }(\lambda _{6})\in \mathbf{F} , \) where \( \lambda _{6}\ge 0\), we obtain an approximation problem of (22), i.e.,

$$\begin{aligned} \begin{array}{lcl} &{}\max \limits _{(w, \lambda )}&{} {\Psi }_{\varepsilon }(\lambda _{6})\\ &{}\mathrm {s.t.}&{} B^{T}\lambda _{2}-A^{T}\lambda _{3}+\lambda _{4}-\lambda _{5}=0,~\left\| \lambda _{3}\right\| _{2}\le \lambda _{1}\\ &{} &{} -\lambda _{1}\epsilon -\lambda _{2}^{T}b+\lambda _{3}^{T}y=\min _{x}\lbrace \left\| Wx\right\| _{1}: x\in T\rbrace ,\\ &{} &{} w= \lambda _{4}+\lambda _{5}+\lambda _{6} \ge 0,~ \lambda _{i}\ge 0,~ i=1,2,4,5,6. \end{array} \end{aligned}$$
(29)

We recall the set of the weights \(\zeta \) given in (20). It can be seen that w being feasible to (29) implies that (11) and (12) satisfy the strong duality and have the same finite optimal value, which is equivalent to the fact that \(w\in \zeta \) when Slater condition holds for (11). Moreover, note that the constraints of (29) indicate that for any given \(w\in \zeta \), \(\lambda \) satisfying the constraints of (29) is optimal to (12). Therefore the purpose of (29) is to find the densest dual optimal variable \(\lambda _{6}\) for all \(w\in \zeta \). Thus (29) can be rewritten as

$$\begin{aligned} \begin{array}{lcl} &{}\max \limits _{(w, \lambda )}&{} {\Psi }_{\varepsilon }(\lambda _{6})\\ &{}\mathrm {s.t.}&{} w\in \zeta , ~B^{T}\lambda _{2}-A^{T}\lambda _{3}+\lambda _{4}-\lambda _{5}=0, ~\left\| \lambda _{3}\right\| _{2}\le \lambda _{1},\\ &{} &{} w= \lambda _{4}+\lambda _{5}+\lambda _{6}\ge 0, ~\lambda _{i}\ge 0, ~i=1,2,4,5,6,\\ &{} &{} \mathrm {where} ~ \lambda =(\lambda _{1},\ldots ,\lambda _{6}) ~ \mathrm {is} ~ \mathrm {optimal} ~ \mathrm {to} \\ &{} &{}\max _{\lambda }\lbrace -\lambda _{1}\epsilon -\lambda _{2}^{T}b+\lambda _{3}^{T}y:\left\| \lambda _{3}\right\| _{2}\le \lambda _{1}, ~w= \lambda _{4}+\lambda _{5}+\lambda _{6},\\ &{} &{} B^{T}\lambda _{2}-A^{T}\lambda _{3}+\lambda _{4}-\lambda _{5}=0, ~ \lambda _{i}\ge 0, ~i=1,2,4,5,6\rbrace . \end{array} \end{aligned}$$
(30)

Denote the feasible set of (12) by

$$\begin{aligned} \begin{aligned} D(w):=&\lbrace \lambda : B^{T}\lambda _{2}-A^{T}\lambda _{3}+\lambda _{4}-\lambda _{5}=0, ~\left\| \lambda _{3}\right\| _{2}\le \lambda _{1}, ~w= \lambda _{4}+\lambda _{5}+\lambda _{6}\ge 0,~\\&\lambda _{i}\ge 0, ~i=1,2,4,5,6\rbrace . \end{aligned} \end{aligned}$$
(31)

Clearly, the problem (30) can be presented as

$$\begin{aligned} \begin{array}{lcl} &{}\max \limits _{(w, \lambda )}&{} {\Psi }_{\varepsilon }(\lambda _{6})\\ &{}\mathrm {s.t.}&{} w\in \zeta , ~\lambda \in D(w),~ \mathrm {where} ~ \lambda ~ \mathrm {is} ~ \mathrm {optimal} ~ \mathrm {to} \\ &{} &{}\max _{\lambda }\lbrace -\lambda _{1}\epsilon -\lambda _{2}^{T}b+\lambda _{3}^{T}y:\lambda \in D(w)\rbrace . \end{array} \end{aligned}$$
(32)

An optimal solution of (32) can be obtained by maximizing \({\Psi }_{\varepsilon }(\lambda _{6})\) which is based on maximizing \(-\lambda _{1}\epsilon -\lambda _{2}^{T}b+\lambda _{3}^{T}y\) over the feasible set of (32). Therefore, \({\Psi }_{\varepsilon }(\lambda _{6})\) and \(-\lambda _{1}\epsilon -\lambda _{2}^{T}b+\lambda _{3}^{T}y\) are required to be maximized over the dual constraints \(\lambda \in D(w)\) for all \(w\in \zeta \). To maximize both the objective functions, we consider the following model as the first relaxation of (22):

$$\begin{aligned} \begin{array}{lcl} &{}\max \limits _{(w, \lambda )}&{} -\lambda _{1}\epsilon -\lambda _{2}^{T}b+\lambda _{3}^{T}y+\alpha {\Psi }_{\varepsilon }(\lambda _{6})\\ &{}\mathrm {s.t.}&{} w\in \zeta , ~\lambda \in D(w).\\ \end{array} \end{aligned}$$
(33)

where \(\alpha >0\) is a given small parameter.

Now we develop the second type of relaxation of the bilevel optimization (22). Note that under Slater condition, for all \(w\in \zeta \), the dual objective \(-\lambda _{1}\epsilon -\lambda _{2}^{T}b+\lambda _{3}^{T}y\) must be nonnegative and is homogeneous in \(\lambda =(\lambda _{1},\ldots ,\lambda _{6})\). Moreover, if \(w\in \zeta \), then \(-\lambda _{1}\epsilon -\lambda _{2}^{T}b+\lambda _{3}^{T}y\) has a nonnegative upper bound due to the weak duality. Inspired by this observation, in order to maximize both \({\Psi }_{\varepsilon }(\lambda _{6})\) and \(-\lambda _{1}\epsilon -\lambda _{2}^{T}b+\lambda _{3}^{T}y\), we may introduce a small positive \(\alpha \) and consider the following approximation:

$$\begin{aligned} \begin{array}{lcl} &{}\max \limits _{(w, \lambda )}&{} -\lambda _{1}\epsilon -\lambda _{2}^{T}b+\lambda _{3}^{T}y \\ &{}\mathrm {s.t.}&{} w\in \zeta , ~\lambda \in D(w),~ -\lambda _{1}\epsilon -\lambda _{2}^{T}b+\lambda _{3}^{T}y\le \alpha {\Psi }_{\varepsilon }(\lambda _{6}).\\ \end{array} \end{aligned}$$
(34)

The constraint

$$\begin{aligned} -\lambda _{1}\epsilon -\lambda _{2}^{T}b+\lambda _{3}^{T}y\le \alpha {\Psi }_{\varepsilon }(\lambda _{6}) \end{aligned}$$
(35)

implies that \({\Psi }_{\varepsilon }(\lambda _{6})\) might be maximized when \(-\lambda _{1}\epsilon -\lambda _{2}^{T}b+\lambda _{3}^{T}y\) is maximized if \(\alpha \) is small and suitably chosen.

Finally, we consider the following inequality in order to develop third type of convex relaxation.

$$\begin{aligned} -\lambda _{1}\epsilon -\lambda _{2}^{T}b+\lambda _{3}^{T}y+f(\lambda _{6})\le \gamma , \end{aligned}$$
(36)

where \(\gamma \) is a given positive number, \(f(\lambda _{6})\) is a certain function depending on \(\varphi _{\varepsilon }((\lambda _{6})_{i})\), which satisfies the following properties:

(I1):

\(f(\lambda _{6})\) is convex and continuous with respect to \(\lambda _{6}\in R_{+}^{n}\);

(I2):

maximizing \({\Psi }_{\varepsilon }(\lambda _{6})\) over the feasible set can be equivalently or approximately achieved by minimizing \(f(\lambda _{6})\).

There are many functions satisfying the properties (I1) and (I2). For instance, we may consider the following functions:

(J1):

\(e^{-{\Psi }_{\varepsilon }(\lambda _{6})}\); (J2) \(-\log ({\Psi }_{\varepsilon }(\lambda _{6})+\sigma _{1})\); (J3) \(\frac{1}{{\Psi }_{\varepsilon }(\lambda _{6})+\sigma _{1}}\); (J4) \(\frac{1}{n}\sum _{i=1}^{n} \frac{1}{\varphi _{\varepsilon }((\lambda _{6})_{i})+\sigma _{1}}\),

where \(\sigma _{1}\) is a small positive number. Now we claim that the functions (J1)-(J4) satisfy (I1) and (I2). Clearly, the functions (J1), (J2) and (J3) satisfy (I2). Note that

$$\begin{aligned} \frac{1}{{\Psi }_{\varepsilon }(\lambda _{6})+\sigma _{1}} \le \frac{1}{n}\sum _{i=1}^{n} \frac{1}{\varphi _{\varepsilon }((\lambda _{6})_{i})+\sigma _{1}}. \end{aligned}$$

Thus the minimization of \(\frac{1}{n}\sum _{i=1}^{n} \frac{1}{\varphi _{\varepsilon }((\lambda _{6})_{i})+\sigma _{1}}\) is likely to imply the minimization of \(\frac{1}{{\Psi }_{\varepsilon }(\lambda _{6})+\sigma _{1}}\), which means the maximization of \({\Psi }_{\varepsilon }(\lambda _{6})\). It is easy to check that the functions (J1)-(J4) are continuous in \(\lambda _{6}\ge 0\). It is also easy to check that (J1)-(J3) are convex for \(\lambda _{6}\ge 0\). Note that for any \(\varphi _{\varepsilon }((\lambda _{6})_{i})> -\sigma _{1},~i=1,\ldots ,n\), all functions \(\frac{1}{\varphi _{\varepsilon }((\lambda _{6})_{i})+\sigma _{1}}\) are convex. Therefore their sum is convex for \(\lambda _{6}\ge 0\) as well. Thus all functions (J1)-(J4) satisfy the two properties (I1) and (I2). Moreover, the functions (J1), (J3) and (J4) have finite values even when \((\lambda _{6})_{i}\rightarrow \infty \).

Replacing \(-\lambda _{1}\epsilon -\lambda _{2}^{T}b+\lambda _{3}^{T}y\le \alpha {\Psi }_{\varepsilon }(\lambda _{6})\) in (34) by (36) leads to the model

$$\begin{aligned} \begin{array}{lcl} &{}\max \limits _{(w, \lambda )}&{} -\lambda _{1}\epsilon -\lambda _{2}^{T}b+\lambda _{3}^{T}y \\ &{}\mathrm {s.t.}&{} w\in \zeta , ~\lambda \in D(w),~-\lambda _{1}\epsilon -\lambda _{2}^{T}b+\lambda _{3}^{T}y+f(\lambda _{6})\le \gamma .\\ \end{array} \end{aligned}$$
(37)

Clearly, the convexity of \(f(\lambda _{6})\) guarantees that (37) is a convex optimization. Moreover, (36) and the property (I2) of \(f(\lambda _{6})\) imply that maximizing \(-\lambda _{1}\epsilon -\lambda _{2}^{T}b+\lambda _{3}^{T}y\) is roughly equivalent to minimizing \(f(\lambda _{6})\) over the feasible set, and thus maximizing \({\Psi }_{\varepsilon }(\lambda _{6})\). The properties (I1) and (I2) ensure that the problem (37) is computationally tractable and is a certain relaxation of (32) and (22).

5.2 One-step dual-density-based algorithm

Note that the set \(\zeta \) has no explicit form, and we need to deal with the set \(\zeta \) to solve three relaxation problems (33), (34) and (37). First we relax \(w\in \zeta \) to \(w\in R^{n}_{+}\) and obtain three convex minimization models. In this case, the difficulty for solving the problems (33) and (34) is that \({\Psi }_{\varepsilon }(\lambda _{6})\) might attain an infinite value when \(w_{i}\rightarrow \infty \). We may introduce a bounded merit function \({\Psi }_{\varepsilon }\in \text {F}\) into (33) and (34) so that the value of \({\Psi }_{\varepsilon }(\lambda _{6})\) is finite. Moreover, to avoid the infinite optimal value in the model (33), \(w\in \zeta \) can be relaxed to \(-\lambda _{1}\epsilon -\lambda _{2}^{T}b+\lambda _{3}^{T}y\le 1\) due to the weak duality. Based on the above observation, we obtain a solvable relaxation for (33) and (34) respectively as follows:

$$\begin{aligned} \begin{array}{cl} \max \limits _{(w, \lambda )}&{} -\lambda _{1}\epsilon -\lambda _{2}^{T}b+\lambda _{3}^{T}y+\alpha {\Psi }_{\varepsilon }(\lambda _{6})\\ \mathrm {s.t.}&{} w\in R^{n}_{+}, ~\lambda \in D(w),~-\lambda _{1}\epsilon -\lambda _{2}^{T}b+\lambda _{3}^{T}y\le 1.\\ \end{array} \end{aligned}$$
(38)

and

$$\begin{aligned} \begin{array}{cl} \max \limits _{(w, \lambda )}&{} -\lambda _{1}\epsilon -\lambda _{2}^{T}b+\lambda _{3}^{T}y \\ \mathrm {s.t.}&{} w\in R^{n}_{+}, ~\lambda \in D(w),~ -\lambda _{1}\epsilon -\lambda _{2}^{T}b+\lambda _{3}^{T}y\le \alpha {\Psi }_{\varepsilon }(\lambda _{6}).\\ \end{array} \end{aligned}$$
(39)

Due to the constraints (36), the optimal value of the problem (37) is finite if it is feasible. By replacing \(\zeta \) by \(R^{n}_{+}\) in (37) , we also obtain a new relaxation of (22):

$$\begin{aligned} \begin{array}{lcl} &{}\max \limits _{(w, \lambda )}&{} -\lambda _{1}\epsilon -\lambda _{2}^{T}b+\lambda _{3}^{T}y \\ &{}\mathrm {s.t.}&{} w\in R^{n}_{+},~\lambda \in D(w),~-\lambda _{1}\epsilon -\lambda _{2}^{T}b+\lambda _{3}^{T}y+f(\lambda _{6})\le \gamma .\\ \end{array} \end{aligned}$$
(40)

Thus, a new weighted \(\ell _{1}\)-algorithm for the model (1) is developed:

figure b

In this paper, we consider the forms DDA(I)–DDA(III). The corresponding constants, the dual-density-based problems for these algorithms are listed in Table 1.

Table 1 DDA(I)–DDA(III)

5.3 Dual-density-based reweighted \(\ell _{1}\)-algorithm

Now we develop reweighted \(\ell _{1}\)-algorithms for (1) based on (32). To this need, we introduce a bounded convex set \(\mathcal {W}\) for w to approximate the set \(\zeta \). By replacing \(\zeta \) with \(\mathcal {W}\) in the models (33), (34) and (37), we obtain the following three types of convex relaxation models of (22):

$$\begin{aligned}&\begin{array}{lcl} &{}\max \limits _{(w, \lambda )}&{} -\lambda _{1}\epsilon -\lambda _{2}^{T}b+\lambda _{3}^{T}y+\alpha {\Psi }_{\varepsilon }(\lambda _{6})\\ &{}\mathrm {s.t.}&{} w\in \mathcal {W}, ~\lambda \in D(w), ~-\lambda _{1}\epsilon -\lambda _{2}^{T}b+\lambda _{3}^{T}y\le 1, \end{array} \end{aligned}$$
(41)
$$\begin{aligned}&\begin{array}{lcl} &{}\max \limits _{(w, \lambda )}&{} -\lambda _{1}\epsilon -\lambda _{2}^{T}b+\lambda _{3}^{T}y \\ &{}\mathrm {s.t.}&{} w\in \mathcal {W}, ~\lambda \in D(w),~ -\lambda _{1}\epsilon -\lambda _{2}^{T}b+\lambda _{3}^{T}y\le \alpha {\Psi }_{\varepsilon }(\lambda _{6}), \end{array} \end{aligned}$$
(42)
$$\begin{aligned}&\begin{array}{lcl} &{}\max \limits _{(w, \lambda )}&{} -\lambda _{1}\epsilon -\lambda _{2}^{T}b+\lambda _{3}^{T}y \\ &{}\mathrm {s.t.}&{} w\in \mathcal {W}, \lambda \in D(w),~-\lambda _{1}\epsilon -\lambda _{2}^{T}b+\lambda _{3}^{T}y+f(\lambda _{6})\le \gamma .\\ \end{array} \end{aligned}$$
(43)

Inspired by [37] and [39], we can choose the following bounded convex set:

$$\begin{aligned} \mathcal {W}= \biggr \lbrace w\in R_{+}^{n}: (x^{0})^{T}w\le M, 0\le w\le M^{*}e\biggr \rbrace , \end{aligned}$$
(44)

where \(x^{0}\) is the initial point, which can be the solution of the \(\ell _{1}\)-minimization (3), and M, \(M^{*}\) are two given numbers such that \(1\le M\le M^{*}\). We also consider the set

$$\begin{aligned} \mathcal {W}=\biggr \lbrace w\in R_{+}^{n}: w_{i}\le \frac{M}{\vert x^{0}_{i}\vert +\sigma _{2}}\biggr \rbrace , \end{aligned}$$
(45)

where both M and \(\sigma _{2}\) are two given positive numbers. \((x^{0})^{T}w\le M\) in (44) and \(w_{i}\le \frac{M}{\vert x^{0}_{i}\vert +\sigma _{2}}\) in (45) are motivated by the idea of existing reweighted algorithm in [8, 37, 39]. The set \(\mathcal {W}\) can be seen as not only a relaxation of \(\zeta \), but also being used to ensure the boundedness of \({\Psi }_{\varepsilon }(\lambda _{6})\). Based on (44) and (45), we update \(\mathcal {W}\) in the algorithms either as:

$$\begin{aligned} \mathcal {W}^{k}= \biggr \lbrace w\in R_{+}^{n}: (x^{k-1})^{T}w\le M,~ 0\le w\le M^{*}e\biggr \rbrace , \end{aligned}$$
(46)

or

$$\begin{aligned} \mathcal {W}^{k}=\biggr \lbrace w\in R_{+}^{n}: w_{i}\le \frac{M}{\vert x^{k-1}_{i}\vert +\sigma _{2}}\biggr \rbrace . \end{aligned}$$
(47)

This yields the following algorithm (DRA for short).

figure c

The initial step of DRA is to solve DDA and to get the initial weight \(w^{0}\) and the set \(\mathcal {W}^{1}\). Different choice of the dual-density-based problems, dual-density-based weighted problem and the set \(\mathcal {W}\) yields different forms of DRA. In this paper, we consider the following forms of DRA(I)–DRA(VI). The corresponding constants, \(\mathcal {W}\), DDA and the dual-density-based weighted problems for these algorithms are listed in the following table.

Table 2 DRA(I)–DRA(VI)

Notice that w is restricted in the bounded set \(\mathcal {W}\) so that the optimal value of (41) cannot be infinite. Therefore, we can use the bounded or unbounded merit functions in \({\Psi }_{\varepsilon }\in \mathbf{F} \), for example, (5), (6), (7) and (8). In addition, M can not be too small. If M is a sufficiently small positive number, there might be a gap between the maximum of \(-\lambda _{1}\epsilon -\lambda _{2}^{T}b+\lambda _{3}^{T}y\) and the maximum of \({\Psi }_{\varepsilon }(\lambda _{6})\) over the feasible set.

The existing reweighted \(\ell _{1}\)-algorithm, RA, always needs an initial iterate, which is often obtained by solving a simple \(\ell _{1}\)-minimization. Unlike these existing methods, DRA(I)–DRA(VI) can create an initial iterate by themselves.

6 Numerical experiments

In this section, by choosing proper parameters and merit functions, the performance of the dual-density-based reweighted \(\ell _{1}\)-algorithms DRA(I)–DRA(VI) will be demonstrated. We use the random examples of convex sets T in our experiments. We first set the noise level \(\epsilon \) and the parameter \(\varepsilon \) of merit functions. The sparse vector \(x^{*}\) and the entries of A and B (if B is not deterministic) are generated from Gaussian random variables with zero mean and unit variance. For each generated \((x^{*}, A, B)\), we set y and b as follows:

$$\begin{aligned} y=Ax^{*}+\frac{c_{1}\epsilon }{\Vert c\Vert _{2}}c, ~Bx^{*}+d=b, \end{aligned}$$
(48)

where \(d\in R^{l}_{+}\) is generated as absolute Gaussian random variables with zero mean and unit variance, and \(c_{1}\in R\) and \(c\in R^{m}\) are generated as Gaussian random variables with zero mean and unit variance. Then the convex set T is generated, and all examples of T are generated this way. We use

$$\begin{aligned} \left\| x'-x^{*}\right\| / \Vert x^{*}\Vert \le 10^{-5} \end{aligned}$$
(49)

as our default stopping criterion where \(x'\) is the solution found by the algorithm, and one success is counted as long as (49) is satisfied. In our experiments, we make 200 random examples for each sparsity level. All the algorithms are implemented in Matlab 2018a, and all the convex problems are solved by CVX (Grant and Boyd [18]).

To demonstrate the performance of the dual-density-based reweighted \(\ell _{1}\)-algorithms listed in Table 2, we mainly consider the two cases in our experiments

(N1):

\(A\in R^{50\times 200}\), \(B=0\) and \(b=0\);

(N2):

\(A\in R^{50\times 200}\), \(B\in R^{50\times 200}\).

For all cases, we implement the algorithms DRA(I)–DRA(VI), and compare their performance in finding the sparse vectors in T with \(\ell _{1}\)-minimization and the algorithm RA with different merit functions.

6.1 Merit functions and parameters

The default parameters and merit functions in DRA(I) and DRA(II) are set as that of the algorithms in [39]. We set (6) as the default merit function for DRA(III) and DRA(IV), and set (J3) with

$$\begin{aligned} f(\lambda _{6})=\frac{1}{{\Psi }_{\varepsilon }(\lambda _{6})+\sigma _{1}}, ~ {\Psi }_{\varepsilon }(\lambda _{6})=\sum _{i=1}^{n}\frac{(\lambda _{6})_{i}}{(\lambda _{6})_{i}+\varepsilon }, ~\lambda _{6}\in R_{+}^{n} \end{aligned}$$
(50)

as the default function for DRA(V) and DRA(VI). We choose the noise level \(\epsilon =10^{-4}\) for both cases. The default parameters for each dual-density-based reweighted \(\ell _{1}\)-algorithm are summarized in Tables 3 and 4. The algorithms in Table 4 will be compared with DRA(I)–DRA(VI).

Table 3 Default parameters in algorithms
Table 4 Algorithms to be compared

Candès, Wakin and Boyd in [8] developed a reweighted algorithm which is referred to as CWB in this section. From the perspective of the reweighted \(\ell _{1}\)-algorithm (RA) in [38], CWB is a special case of RA using the merit function \(\sum _{i=1}^{n}\) \(\log (\vert x_{i}\vert +\varepsilon )\). The ARCTAN is also a special case of RA using the function (8) as the merit function for sparsity. CWB, ARCTAN and \(\ell _{1}\)-minimization (3) will be compared with DRA(I)–DRA(VI) in sparse vector recovery in this section. The parameter \(\varepsilon \) in RA is set to \(10^{-1}\) or \(10^{-5}\), and the remaining parameters are the same as DRA.

6.2 Case \(\mathrm {(N1)}\)

Fig. 1
figure 1

(i)–(iii) Comparison of the performance of the dual-density-based reweighted \(\ell _{1}\)-algorithms by performing 1 iteration and 5 iterations respectively. (iv) Comparison of DRA and RA

Now we perform numerical experiments to show the behaviors of the dual-density-based reweighted \(\ell _{1}\)-algorithms in two cases (N1) and (N2). Note that in the case of (N1), the model (1) is reduced to the sparse model (C2). The numerical results are given in Fig. 1(i)–(iii), Note that there are five legends in each figure (i)–(iii), corresponding to \(\ell _{1}\)-minimization, the dual-density-based reweighted \(\ell _{1}\)-algorithms with one iteration or five iterations. For instance, in (ii), we compare DRA(III) and DRA(IV) which all perform either one iteration or five iterations. For example, (DRA(III),1) and (DRA(III),5) represent DRA(III) with one iteration and five iterations, respectively.

It can be seen that the dual-density-based reweighted \(\ell _{1}\)-algorithms are performing better when the number of iteration is increased and all of them outperform \(\ell _{1}\)-minimization in our experiment environment, while the performance of DRA(I) with one or five iterations is similar to the performance of \(\ell _{1}\)-minimization. (i)–(iii) indicate the same phenomena: the algorithms based on (47) might achieve more improvement than the ones based on (46) when the number of iteration is increased. For example, in (iii), the success rate of DRA(VI) with five iterations has improved by nearly \(25\%\) compared with those with one iteration for each sparsity from 14 to 20, while DRA(V) has only improved its performance by \(10\%\) after increasing the number of iterations. We filter the algorithms with the best performance from (i)–(iii) in Fig. 1 and merge them into (iv) together with CWB and ARCTAN in Fig. 1. It can be seen that DRA(IV) and DRA(VI) outperform CWB and ARCTAN, especially as \(\varepsilon \) in CWB and ARCTAN is relatively small, and they also outperform the \(\ell _{1}\)-minimization as well.

6.3 Case \((\mathrm {N2})\)

Fig. 2
figure 2

(i)–(iii) Comparison of the performance of DRA with one iteration and five iterations. (iv) Comparison of the performance of the DRA and RA

Although the performance of ARCTAN and DRA(VI) is slightly better than that of DRA(II) and CWB in the case (N2), these algorithms can compete to each other in finding sparse vectors at high sparsity level in many situations (see Fig. 2). The other behaviors are similar to the case (N1). We compare the reweighted \(\ell _{1}\)-algorithms with updating rule (46) and (47), which are shown in (i) and (ii) in Fig. 3, respectively. For the algorithms using (46), when executing 5 iterations, Fig. 3(i) shows that DRA(III) and DRA(V) perform much better than DRA(I). For the algorithms using (47), when executing 5 iterations, Fig. 3(ii) indicates that the success rates of finding the sparse vectors in T by DRA(II) and DRA(VI) are very similar.

Fig. 3
figure 3

Comparison of the performance of DRA with (46) or (47)

Finally, we carry out experiment to show how the parameter \(\varepsilon \) of merit functions affect the performance of locating the sparse vectors in T by dual-density-based reweighted \(\ell _{1}\)-algorithms. In Fig. 4, some numerical results for dual-density-based reweighted algorithms with different \(\varepsilon \) indicate that the performance of the DRA-typed algorithm is relatively insensitive to the choice of small \(\varepsilon \). Experiments reveals that when \(\varepsilon \le 10^{-10} \), the performance of CWB and ARCTAN are almost identical to that of \(\ell _{1}\)-minimization, which is also observed in (iv) in Fig. 1 when \(\varepsilon =10^{-5}\).

Fig. 4
figure 4

Comparison of the performance of DRA with different \(\varepsilon \)

7 Conclusions

In this paper, we have studied a class of algorithms for the \(\ell _{0}\)-minimization problem (1). The one-step dual-density-based algorithms (DDA) and the dual-density-based reweighted \(\ell _{1}\)-algorithms (DRA) are developed. These algorithms are developed based on the new relaxation of the equivalent bilevel optimization of the underlying \(\ell _0\)-minimization problem. Unlike RA, the DRA can automatically generate an initial iterate instead of obtaining the initial iterate by solving \(\ell _{1}\)-minimization. Numerical experiments show that in some cases such as (N1) and (N2), the dual-density-based methods proposed in this paper can perform better than \(\ell _{1}\)-minimization in solving the sparse optimization problem (1), and can be comparable to some existing reweighted \(\ell _1\)-methods. Although the experiments have shown that DRA-typed algorithms outperform \(\ell _{1}\)-minimization and some classic reweighted \(\ell _{1}\)-algorithms, there still exist some future work to do. For example, the convergence and the stability of DRA-typed algorithms are worthwhile future work, which might be investigated under certain assumptions such as the so-called restricted weak range space property (see, e.g., [34]).