Newton-Type Optimal Thresholding Algorithms for Sparse Optimization Problems

Sparse signals can be possibly reconstructed by an algorithm which merges a traditional nonlinear optimization method and a certain thresholding technique. Different from existing thresholding methods, a novel thresholding technique referred to as the optimal k-thresholding was recently proposed by Zhao (SIAM J Optim 30(1):31–55, 2020). This technique simultaneously performs the minimization of an error metric for the problem and thresholding of the iterates generated by the classic gradient method. In this paper, we propose the so-called Newton-type optimal k-thresholding (NTOT) algorithm which is motivated by the appreciable performance of both Newton-type methods and the optimal k-thresholding technique for signal recovery. The guaranteed performance (including convergence) of the proposed algorithms is shown in terms of suitable choices of the algorithmic parameters and the restricted isometry property (RIP) of the sensing matrix which has been widely used in the analysis of compressive sensing algorithms. The simulation results based on synthetic signals indicate that the proposed algorithms are stable and efficient for signal recovery.


Introduction
The sparse optimization problem arises naturally from a wide range of practical scienarios such as compressed sensing [9,16,18,28], signal and image processing [6,13,15], pattern recognition [24] and wireless communications [11], to name a few.The typical problem of signal recovery via compressed sensing can be formulated as the following sparse optimization problem: where k is a given integer number reflecting the sparsity level of the target signal x * , A ∈ R m×n is a measurement matrix with m ≪ n, x 0 is the so-called ℓ 0 -norm counting the nonzeros of the vector x, and y are the acquired measurements of the signal x * to recover.The vector y is ususally represented as y = Ax * + η, where η denotes a noise vector.
The hard thresholding is the simplest thresholding approach used to generate iterates satisfying the constraint of the problem (1).Throughout the paper, we use H k (•) to denote the hard thresholding operator which retains the largest k magnitudes of a vector and zeroes out the others.The following iterative hard thresholding (IHT) scheme where λ > 0 is a stepsize, was first studied in [2,3].Incorporating a pursuit step (least-squares step) into IHT yields the hard thresholding pursuit (HTP) [17,5], and when λ is replaced by an adaptive stepsize similar to the one used in traditional conjugate methods, it leads to the so-called normalized iterative hard thresholding (NIHT) algorithms in [4,26].The theoretical performance of these algorithms can be analyzed in terms of the restricted isometry property (RIP)(see, e.g., [2,3,18]).
On the other hand, the search direction A T (y − Ax p ) of the above-mentioned algorithm is the negative gradient of the objective function of the problem (1).Such a search direction can be replaced by another direction provided that it is a descent direction of the objective function.Thus an Newton-type direction was studied in [34,21,35].The following iterative method is proposed and referred to as NSIHT in [21]: where ǫ > 0 is a parameter and λ > 0 is the stepsize.However, as pointed out in [29,31], the weakness of the hard thresholding operator H k (•) is that when applied to a non-sparse iterate generated by the classic gradient method, it may cause a ascending value of the objective of (1) at the thresholded vector, compared to the objective value at its unthresholded counterpart.As a result, direct use of the hard thresholding operator to a non-sparse or non-compressible vector in the course of an algorithm may lead to significant numerical oscillation and divergence of the algorithm.To overcome such a drawback of hard thresholding operator, Zhao [29] proposed an optimal k-thresholding technique which performs thresholding and objective-value reduction simultaneously.The optimal k-threholding iterative scheme in [29] can be simply stated as where λ remains a stepsize, and Z # k (•) is the so-called optimal k-thresholding operator.Given a vector u, the thresholded vector Z # k (u) = u ⊗ w * (the Hadamard product of two vectors) where the vector w * is the optimal solution to the following quadratic 0-1 optimization problem: where e = (1, . . ., 1) T ∈ R n is the vector of ones, and {0, 1} n denotes the set of n-dimensional 0-1 vectors.To avoid solving such a binary optimization problem, an alternative approach is to solve its convex relaxation which, as pointed out in [29,31], is the tightest convex relaxation of the above problem: Based on the convex relaxation of the operator Z # k (•), efficient algorithms called relaxed optimal k-thresholding algorithms (ROT) and its variants have been proposed and investigated in [29,31].Simulations demonstrate that this new framework of thresholding methods works efficiently, and it overcomes the drawback of the traditional hard thresholding operator.
Due to the aforementioned weakness of H k which appears in the Newton-type iterative method (2), it makes sense to consider a further improvement of the performance of such a method.The purpose of this paper is to combine the optimal k-thresholding and Newton-type search direction in order to develop an algorithm that may alleviate or eliminate the drawback of hard thresholding operator and hence enhance the numerical performance of the Newton-type method (2).The proposed algorithms are called the Newton-type optimal k-thresholding (NTOT).The convex relaxation versions of this algorithm are also studied in this paper, which are referred to as Newton-type relaxed optimal thresholding (NTROT) algorithms, and its enhanced version with a pursuit step (NTROTP for short).The guaranteed performance and convergence of these algorithms are shown under the RIP assumption as well as suitable conditions imposed on the algorithmic parameters.
The paper is organized as follows.The algorithms are described in Section 2. The theoretical performances of the proposed algorithms in noisy settings are shown in Section 3. The empirical results are demonstrated in Section 4, which indicate that under appropriate choices of the parameter and stepsize the proposed algorithms are efficient for signal reconstruction and their performances are comparable to a few existing methods.

Algorithms
Some notations will be used throughout the paper.Let R n denote the n-dimensional Euclidean space, and R m×n denotes the set of m × n matrices.For a vector x ∈ R n , the ℓ 2 -norm is defined as x 2 := n i x 2 i .We use [N ] to denote the set {1, . . ., n}.Given a set Ω ⊆ [N ], Ω := [N ]\Ω denotes the complement set of Ω. x Ω denotes the vector obtained from x by retaining the entries of x indexed by Ω and zeroing out the ones indexed by Ω.A T denotes the transpose of the matrix A. Give a vector u, L k (u) denotes the index set of the largest k magnitudes of u.Throughout the paper, a vector x is said to be k-sparse if x 0 ≤ k.
Note that the gradient and Hessian of the function f For the problem (1), the Hessian A T A is singular, and thus the classic Newton's method cannot be applied to the function f (x) directly.Modifying the matrix by adding ǫI leads to the nonsingular matrix A T A + ǫI, where ǫ is a positive parameter and I ∈ R n×n is the identity matrix.Then we immediately obtain the following Newton-type iterative method for the minimization of f (x) : where λ is a stepsize.Different from the approach (2), we utilize the optimal kthresholding operator instead of the hard thresholding operator to develop a Newtontype iterative algorithm, which is described as Algorithm 1.
Solving the 0-1 problem (P 1 ) is generally expensive, Zhao [29] suggests solving its tightest convex relaxation, i.e., the problem (3).This results in the Newton-type relaxed optimal k-thresholding algorithm, which is described as Algorithm 2.
If the step (P 2 ) generates a 0-1 solution w p , i.e., w p is exactly a k-sparse vector, then u p ⊗w p is exactly k-sparse, in which case the operator H k in (P 3 ) is superfluous.However, as the vector w p may not necessarily be k-sparse, so H k is used in (P 3 ) to truncate the iterate so that it satisfies the constraint of the problem (1).This is quite different from H k (u p ) that directly performs hard thresholding on u p which may not be sparse at all.The vector w p are either k-sparse or admits a compressible feature in which case performing hard thresholding on the resulting vector u p ⊗ w p can avoid significant oscillation of the objective value of (1).
To further stabilize the NTROT, a pursuit step can be performed after solving the optimization problem (P 2 ).This leads to the algorithm called NTROTP, which is the main algorithm concerned in this paper.
The step (P 5 ) is a pursuit step at which a least-squares problem is solved on the support of the largest k magnitudes of the vector u p ⊗ w p .In the next section, we establish sufficient conditions for the guaranteed performance and convergence of the algorithms NTOT, NTROT and NTROTP.

Theoretical Analysis
Before going ahead, let us first recall the definition of restricted isometry constant (RIC).
Definition 1 [7] The q-th order restricted isometry constant (RIC) δ q of a matrix A ∈ R m×n is the smallest number δ q ≥ 0 such that for all q-sparse vectors x, where q is an integer number.
If δ q < 1, we say that the matrix A satisfies the q-th order restricted isometry property (RIP).It is well known that the random matrices including Bernoulli, Gaussian and more general subgaussian matrices may satisfy the RIP of a certain order with an overwhelming probability [7,9,18].

Analysis of NTOT in noisy scenarios
The following two lemmas are very helpful to show the main result in this section.
The first one was taken from [21] and the second one can be found in [29].
Lemma 1 [21] Let A ∈ R m×n with m ≪ n be a measurement matrix.Given a vector u ∈ R n and an index set m , where σ 1 , σ m are the largest and smallest singular values of the matrix A, respectively, then one has where t is a certain integer number.
Lemma 2 [29] Let y = Ax + η be the measurements of the k-sparse vector x ∈ R n , and let u ∈ R n be an arbitrary vector.Let Z # k (u) be the optimal k-thresholding vector of u.Then for any k-sparse binary vector ŵ ∈ {0, 1} n satisfying supp(x) ⊆ supp( ŵ), one has The bound (4) follows directly from the proof of Theorem 4.3 in [29].In fact, the inequality ( 4) is obtained by combining the inequality (4.5) and the first inequality of (4.7) in [29].
We now state and show the sufficient condition for the guaranteed performance of NTOT in noisy settings.
Theorem 1 Let y = Ax + η be the measurements of the signal x ∈ R n with measurement error η.Let S = L k (x) and σ 1 and σ m be, respectively, the largest and smallest singular values of the matrix A ∈ R m×n .Suppose that the restricted isometry constant of A satisfies δ 2k < 0.5349, and that ǫ is a given parameter satisfying If the stepsize λ in NTOT is chosen such that then the sequence {x p } generated by the NTOT satisfies that where In particular, when x is k-sparse and η = 0, then the sequence {x p } converges to x.
Proof.Let ŵ be a k-sparse binary vector such that S ⊆ supp( ŵ), which implies x S = x S ⊗ ŵ.From the structure of NTOT, Z # k (u p ) = u p ⊗ w p where w p is the optimal solution to the problem (P 1 ).Note that y = Ax + η = Ax S + η ′ where η ′ = Ax S + η.By Lemma 2, we immediately have By the definition of u p in NTOT, we see that By the singular value decomposition of A, for any vector u ∈ R n , it is very easy to verify that From the choices of (ǫ, λ), we see that λ ≤ ǫ + σ 2 m and ǫ > σ 2 1 .Thus it follows from ( 9) and ( 10) that where the first term of the right-hand side follows from Lemma 1 with the fact 8) and (11) leads to where From (12), to guarantee the recovery of x S by the NTOT, it is sufficient to ensure that ρ < 1, which is equivalent to This is guaranteed under the choice of λ given in (6).The remaining proof is to show that the range in (6) exists.In fact, if the following two conditions are satisfied, the existence of such a range is guaranteed: and By noting that δ k ≤ δ 2k , it is straightforward to verify that the inequality ( 13) is guaranteed under the condition δ 2k < 0.5349.The inequality ( 14) can be written as which is also guaranteed under the choice of ǫ given in (5).Thus the desired result follows.In particular, if η = 0 and x is k-sparse, the relation ( 7) is reduced to which implies that {x p } converges to x as p → ∞. ✷

Analysis of NTROT in noisy scenarios
Still we denote by S = L k (x), the index set of the largest k magnitudes of x.The measurements are given as y = Ax + η, where η ∈ R m is a noise vector.We first recall some useful technical results which have been shown in [29] and [31].Lemma 3 below is a property of the hard thresholding operator H k , whereas the second one is property of the solution of the optimization problem (P 2 ) in NTROT and (P 4 ) in NTROTP.
Lemma 3 [29] Let z ∈ R n be a given vector and v ∈ R n be a k-sparse vector with Lemma 4 [31] Let Λ ⊆ {1, . . ., n} be any given index set, and let w ∈ R n be any given vector satisfying e T w = k and 0 ≤ w ≤ e. Decompose the vector w Λ as where q is a nonnegative integer number such that |Λ| = (q−1)k +α where 0 ≤ α < k, w Λ1 is the first k largest magnitudes in {w i : i ∈ Λ}, w Λ2 is the second k largest magnitudes in {w i : i ∈ Λ}, and so on.Then one has The next result is actually implied from the proof of Theorem 4.8 in [29].Item (i) in the lemma below is immediately obtained by combining two inequalities in the proof of Theorem 4.8 in [29].So we only outline a simple proof of the Item (ii) for this lemma.
Lemma 5 Let y = Ax + η be the measurements of x and ŵ ∈ {0, 1} n be a k-sparse binary vector such that S = L k (x) ⊆ supp( ŵ).Let S p+1 = supp(x p+1 ), u p and w p be defined in NTROT.One has Proof.By setting Λ := S ∪ S p+1 and w := w p in Lemma 4. Decompose the vector (w p ) S∪S p+1 into in the way described in Lemma 4. Since (x S ) S∪S p+1 = 0, we have where where the last inequality follows from the definition of δ k and the fact | supp(v (i) )| ≤ k.We also have that where the last inequality follows from the fact ( and Lemma 4 which claims that q i=1 (w p ) Λi ∞ < 2. Combining the above two inequalities yields which is exactly the relation given in Item (ii) of the Lemma.✷ The main result in this section is stated as follows.
Theorem 2 Let y = Ax + η be the measurements of x ∈ R n with measurement error η.Let S = L k (x) and let σ 1 and σ m denote, respectively, the largest and smallest singular values of the matrix A ∈ R m×n .Suppose that the restricted isometry constant of A satisfies that δ 3k < 0.2119.
Let ǫ be a given parameter satisfying If the stepsize λ in NTROT satisfies then the sequence {x p } generated by the NTROT satisfies that where In particular, when x is k-sparse and η = 0, then the sequence {x p } converges to x.
Proof.Let S, σ 1 , σ m be defined as in the theorem.Note that y := Ax + η = Ax S + η ′ where η ′ = Ax S + η.Denote by S p+1 = supp(x p+1 ).Applying Lemma 3, we immediately have In what follows, we bound each of the terms on the right-hand side of the above inequality.By the definition of u p in NTROT, we have Noting that (x S ) S p+1 \S = (x S ⊗ w p ) S p+1 \S = 0, we have where the last inequality follows from Lemma 1 (with the fact | supp(x p − x S ) ∪ (S p+1 \S)| ≤ 3k) and (10).We now provide an upper bound for the first term of the right-hand side of (20).Let ŵ be a k-sparse binary vector satisfying S ⊆ supp( ŵ).By Lemma 5, we have and Applying Lemma 1 (with ) and ( 10), we have Denote by Φ = L k (x S − u p ).As | supp(x p − x S ) ∪ Φ| ≤ 3k, by a proof similar to (24), we also have Combining ( 22)-( 25), we have where Substituting ( 26) and ( 21) into ( 20) yields (17) with constants ρ and τ given in (18) and (19), respectively.Due to the fact δ k ≤ δ 2k ≤ δ 3k , we see from (18) that Thus to ensure ρ < 1, it is sufficient to require that which can be written as This together with λ ≤ ǫ + σ 2 m implies that if the range of λ is given as ( 16), then it guarantees that ρ < 1.To ensure the existence of the interval in (16), it is sufficient to choose ǫ such that which is equivalent to The first condition is ensured by δ 3k < 0.2119, and the second condition is ensured by the choice of ǫ given in (15).The proof of the theorem is complete.In particular, if η = 0 and x is k-sparse, the relation ( 17) is reduced to which implies that {x p } converges to x as p → ∞. ✷

Analysis of NTROTP in noisy scenarios
Before showing the main result, we introduce a lemma concerning a property of the pursuit step.
Lemma 6 [29] Let y = Ax + ν be the noisy measurements of the k-sparse signal x ∈ R n , and let u ∈ R n be an arbitrary k-sparse vector.Then the optimal solution of the pursuit step satisfies that Theorem 3 Let y = Ax + η be the measurements of the signal x ∈ R n with measurement error η.Let S = L k (x) and σ 1 and σ m denote, respectively, the largest and smallest singular values of the matrix A ∈ R m×n .Suppose that the restricted isometry constant of A satisfies that and ǫ is a given parameter satisfying If the stepsize λ in NTROTP satisfies then the sequence {x p } generated by the NTROTP satisfies that where and (31) In particular, when x is k-sparse and η = 0, then the sequence {x p } converges to x.
Proof.NTROTP comprises of NTROT and a pursuit step.From the proof of Theorem 2, we see that where the constants ρ and τ are given by ( 18) and ( 19) respectively.From the step (P 5 ), x p+1 is the solution to the pursuit step.By Lemma 6, we have where η ′ = Ax S + η.Using δ k ≤ δ 2k ≤ δ 3k and combining (32) and ( 33) leads to , where ρ and τ are defined as (30) and ( 31) respectively.Note that ρ < 1 is equivalent to This is ensured by the choice of λ given in (28).This means the choice of λ in (28) ensures that ρ < 1.To guarantee the existence of the range (28), it is sufficient to require that which is equivalent to Note that The first condition in (35) is satisfied when δ 3k < 0.2.The second condition (35) is also satisfied provided ǫ is chosen large enough, i.e., satisfying (27).In particular, if η = 0 and x is k-sparse, the relation ( 29) is reduced to x p+1 − x 2 ≤ ρ x p − x 2 ≤ ρp x 0 − x 2 , which implies that {x p } converges to x as p → ∞. ✷

Numerical Experiments
Simulations were performed to test the performance of the proposed algorithms with respect to residual reduction, average number of iterations needed for convergence and success frequency for signal recovery.The measurement matrices generated for experiments are Gaussian random matrices, whose entries are independent and identically distributed and follow the standard normal distribution N (0, 1).Nonzero entries of realized sparse signals also follow the N (0, 1), and their position follows a uniform distribution.All involved optimization problems in algorithms were solved by the CVX which is developed by Grant and Boyd [20] with solver 'Mosek'.

Residual reduction
The experiment was carried out to compare the residual-reduction performance of the algorithms with given (ǫ, λ).In this experiment, we set A ∈ R 256×512 , y = Ax * , x * 0 = 70 and x 0 = 0.The stepsize λ and parameter ǫ are set respectively as which guarantees that ǫ > σ 2 1 and λ ≤ ǫ + σ 2 m , where σ 1 and σ m denote the largest and the smallest singular value of the matrix.Fig. 1 demonstrates the change of the residual value, i.e., y − Ax 2 , in the course of iterations of the algorithms.From Fig. 1 (a), it can be seen that the NTROTP is more powerful than other algorithms in residual reduction.In the same experiment environment, we also compare the residual change in the course of NSIHT and NTROT which use different thresholding operators.Fig. 1 (b) shows that the algorithm with optimal thresholding operator can reduce the residual more efficiently than the one with hard thresholding operator.
The performance of NTROT and NTROTP are clearly related to the choice of (ǫ, λ).Thus we test the residual-reduction performance of the proposed algorithms in terms of different values of parameter ǫ and stepsize λ.The results are shown in Fig. 2 and Fig. 3, respectively.In Fig. 2, the stepsize λ is fixed as λ = 10, and ǫ = ǫ * , 1.1ǫ * , 1.5ǫ * and 2ǫ * , where ǫ * = σ 2 1 + 1.In Fig. 3, the parameter ǫ is fixed as ǫ = σ 2 1 + 1, and stepsize λ is taken as λ = 1, 2, 5, 10 respectively.Such choices of (ǫ, λ) satisfy that ǫ > σ 2  1 and λ ≤ ǫ + σ 2 m .It can be seen that the NTROT is more sensitive to the change of ǫ and λ than the NTROTP which is generally insensitive to the change of (ǫ, λ).This indicates that NTROTP is a stable algorithm.

Number of iterations
The simulations were also performed to examine the impact of sparsity levels and measurement levels on the average number of iterations needed for signal reconstruction via Newton-type iterative algorithms.In this experiment, all algorithms start from x 0 = 0 and terminate either when r := x p − x * 2 / x * 2 ≤ 10 −3 is met or when the maximum number of iterations (i.e., 50 iterations) is reached.
Fig. 4 (a) demonstrates the influence of sparsity levels on the number of iterations needed by NSIHT, NSHTP, NTROT and NTROTP to reconstruct a signal.In this experiment, the size of measurement matrices is still 256 × 512, and the ratio k/n varies from 0.01 to 0.35.The average number of iterations is calculated based on 50 random examples for each sparsity level k/n.A common feature of these algorithms is that with increase of the sparsity levels, the required iterations for the algorithms to reconstruct signals also increase.We also observe that both the optimal thresholding and pursuit step help reduce the required number of iterations of algorithms to reconstruct a signal.Fig. 4 (b) compares the average number of iterations required by several algorithms applying to different measurement levels.The target signal is fixed as x * ∈ R 500 with x * 0 = 50, and the length of observed vector y = Ax * , i.e., the number of measurements, varies from 50 to 300.When m/n < 0.25, we see that no algorithm could recover the target 50-sparse signal within 50 iterations, due to the fact that the measurement levels are too low for signal reconstruction.The more measurements obtained for the target signal x * , the less number of iterations needed for reconstruction, as shown in Fig. 4 (b).Both NSHTP and NTROTP could recover the signal by using relatively a small number of iterations when the ratio m/n ≥ 0.35, and the NTROTP needs less iterations than NSIHT, NSHTP and NTROT.

Performance of signal recovery
Fig. 5 compares the success frequencies of signal reconstruction via several algorithms with both exact and inexact measurements.Five algorithms are taken into account in this comparison, including ℓ 1 -minimization, orthogonal matching pursuit (OMP), subspace pursuit (SP), NSHTP and NTROTP.The size of matrices is still 256 × 512.For noisy case, the measurements of x * are set as y = Ax * +0.001θwhere θ ∈ R 256 is a standard Gaussian random vector.Iterative algorithms start at x 0 = 0 and terminate after 20 iterations except for the OMP which stops after k iterations owing to the structure of algorithm, where k = x * 0 .The choice of (ǫ, λ) is the same as (36).The ratio k/n increases from 0.01 to 0.35.The condition x p − x * 2 / x * 2 ≤ 10 −3 is set as the recovery criterion.The vertical axis in Fig. 5 represents the success rate of reconstruction, which is calculated based on 50 random examples.The results in Fig. 5 indicate that the NTROTP is stable and robust for sparse signal recovery compared with other algorithms used in this experiments.

Conclusion
A class of Newton-type optimal k-thresholding algorithms is proposed in this paper.Under the restricted isometry property (RIP), we have proved that the NTOT, NTROT and NTROTP algorithms are guaranteed to reconstruct sparse signals with a proper choice of the algorithmic parameter and iterative stepsize.Simulations indicate that the proposed algorithms especially the NTROTP is a stable and robust algorithm for signal reconstruction.

Fig. 1 (Fig. 2
Fig. 1 (a) Comparison of residual-reduction performances of several algorithms; (b) Residual change in the course of iterations using different thresholding operators

Fig. 5
Fig. 5 Comparison of success frequencies of signal recovery via different algorithms