An inexact regularized proximal Newton method for nonconvex and nonsmooth optimization

This paper focuses on the minimization of a sum of a twice continuously differentiable function $f$ and a nonsmooth convex function. We propose an inexact regularized proximal Newton method by an approximation of the Hessian $\nabla^2\!f(x)$ involving the $\varrho$th power of the KKT residual. For $\varrho=0$, we demonstrate the global convergence of the iterate sequence for the KL objective function and its $R$-linear convergence rate for the KL objective function of exponent $1/2$. For $\varrho\in(0,1)$, we establish the global convergence of the iterate sequence and its superlinear convergence rate of order $q(1\!+\!\varrho)$ under an assumption that cluster points satisfy a local H\"{o}lderian local error bound of order $q\in(\max(\varrho,\frac{1}{1+\varrho}),1]$ on the strong stationary point set; and when cluster points satisfy a local error bound of order $q>1+\varrho$ on the common stationary point set, we also obtain the global convergence of the iterate sequence, and its superlinear convergence rate of order $\frac{(q-\varrho)^2}{q}$ if $q>\frac{2\varrho+1+\sqrt{4\varrho+1}}{2}$. A dual semismooth Newton augmented Lagrangian method is developed for seeking an inexact minimizer of subproblem. Numerical comparisons with two state-of-the-art methods on $\ell_1$-regularized Student's $t$-regression, group penalized Student's $t$-regression, and nonconvex image restoration confirm the efficiency of the proposed method.

Given a data matrix A ∈ R m×n and a vector b ∈ R m , we are interested in the following nonconvex and nonsmooth composite optimization problem where ψ : R m → R and g : R n → R with R := (−∞, ∞] are proper lower semicontinuous (lsc) functions and satisfy the following basic assumptions: Assumption 1 (i) ψ is twice continuously differentiable on the set A(O) − b, where O ⊂ R n is an open subset containing domg; (ii) g is convex and continuous relative to its domain domg; Such a model allows g to be an indicator function of a closed convex set in R n by Assumption 1 (ii), and moreover, covers the case that g is a weakly convex function.
Model (1) has a host of applications in statistics, signal and image processing, machine learning, financial engineering, and so on.For example, the popular lasso [44] and sparse inverse covariance estimation [53] in statistics are the special instances of (1) with a convex ψ.In some inverse problems, non-Gaussianity of noise or nonlinear relation between measurements and unknowns often leads to (1) with a nonconvex ψ (see [10]).In addition, the higher moment portfolio selection problem (see [37,55]) also takes the form of (1) with a nonconvex ψ.

Related work.
For problem (1), many types of methods were developed.Fukushima and Mine [19] introduced originally the proximal gradient (PG) method; Tseng and Yun [47] proposed a block coordinate decent method and obtained the subsequence convergence of the iterate sequence, and its R-linear convergence rate under the Luo-Tseng error bound; Milzarek [30] developed a class of methods by virtue of a combination of semismooth Newton steps, a filter globalization, and thresholding steps for (1) with g(x) = µ x 1 , and achieved subsequence convergence and local q-superlinear convergence properties for q ∈ (1, 2]; Bonettini et al. [11] extended their variable metric inexact line-search (VMILA) method [9] by incorporating a forward-backward step, and verified the global convergence of the iterate sequence and the linear convergence rate of the objective value sequence under the uniformly bounded positive definiteness of the scaled matrix and the KL property of exponent θ ∈ (0, 1/2] of the forward-backward envelope (FBE) of F ; and by using the FBE of F , initially introduced in [36], Stella et al. studied a combination of PG step and quasi-Newton step of FBE of F with a line search at the iterate, verified the global convergence for a KL function F and the superlinear convergence rate under the nonsingularity of the Hessian of the FBE in [42], and obtained the same properties as in [42] for (1) but with a nonconvex g by using an Armijo line search at the PG output of iterate in [43].
Next we mainly review inexact proximal Newton methods that are closely related to this work.This class of methods, also called an inexact successive quadratic approximation (SQA) method, is finding in each iterate an approximate minimizer y k satisying a certain inexactness criterion for a subproblem of the following form where x k is the current iterate, and G k is a symmetric positive definite matrix that represents a suitable approximation of the Hessian ∇ 2 f (x k ).The proximal Newton method can be viewed as a special variable metric one, and it will reduce to the PG method if G k = γ k I with γ k > 0 related to the Lipschitz constant of ∇f .Note that subproblem (2) is seeking a root of 0 ∈ ∇f (x k ) + G k (x − x k ) + ∂g(x), the partially linearized version at x k of the stationary point equation 0 ∈ ∂F (x), where ∂F (x) denotes the basic (limiting or Mordukhovich) subdifferential of F at x (see Section 2 for its definition).The proximal Newton method belongs to the quite general iterative framework proposed by Fischer [17] if the inexactness criterion there is used.However this inexactness criterion is not implementable due to the involved unknown stationary point set.As pointed out in [54], the proximal Newton method depends more or less on three key ingredients: the approximation matrix G k , the inner solver for (2), and the inexactness criterion on y k (i.e., the stopping criterion of the inner solver to control the inexactness of y k ).Since (2) takes into account the second-order information of f , the proximal Newton method has a remarkable superiority to the PG method, i.e., a faster convergence rate.
Early proximal Newton methods were tailored for special instances of convex ψ and g in problem (1) such as GLMNET [18], newGLMNET [52], QUIC [20] and the Newton-Lasso method [34].Lee et al. presented a generic version of the exact proximal Newton method in [23], achieved a global convergence result by requiring the uniform positive definiteness of G k , and established a local linear or superlinear convergence rate depending on the forcing term on a stopping criterion for an inexact proximal Newton method with unit step-size.Li et al. [26] extended the exact proximal Newton method proposed in [45] for self-concordant functions f to the proximal Newton method with inexact steps, and achieved the local linear, superlinear or quadratic convergence rate resting with the parameter in the inexact criterion under the positive definiteness assumption of ∇ 2 f on domf .Recently, Tran-Dinh et al. [46] followed this line to propose a class of homotopy proximal Newton methods by applying a proximal Newton method, with an inexactness criterion relying on the difference between the objective function value of (2) and its optimal value, to a parametric formulation of (1), and achieved a global linear convergence rate of the iterate sequence under the assumption of self-concordance on f .Yue et al. [54] proposed an inexact proximal Newton method with a regularized Hessian by an inexactness condition depending on the KKT residual of the original problem, and established the superlinear and quadratic convergence rates under the Luo-Tseng error bound.As far as we know, their work is the first to achieve the superlinear convergence without the strong convexity of F for an implementable proximal Newton method, although Fisher [17] ever achieved the superlinear convergence rate for the proposed iterative framework, covering the proximal Newton method with an impractical inexactness criterion, under the calmness of the mapping (∂F ) −1 .Mordukhovich et al. [32] also studied a similar inexact regularized proximal Newton method, and achieved the R-superlinear convergence rate under the metric q-subregularity of ∂F for q ∈ ( 1 2 , 1), and the quadratic convergence rate under the metric subregularity of ∂F that is equivalent to the calmness of (∂F ) −1 .Their metric q-subregularity condition is weaker than the Luo-Tseng error bound.
For problem (1) with g(x) = µ x 1 , Byrd et al. [12] studied an inexact proximal Newton method with an approximate solution criterion determined by the KKT residual of (1).They showed that the KKT residual sequence converges to zero under the uni-formly bounded positive definiteness of G k and obtained local superlinear and quadratic convergence rates under the positive definiteness and the Lipschitz continuity of ∇ 2 f at stationary points.For (1) with an optimal-set-strongly-convex F , Lee and Wright [22] investigated an inexact proximal Newton method with an approximate solution criterion relying on the difference between the objective function value of (2) at the iterate and its optimal value and established a global linear convergence rate of the objective value sequence under the uniformly bounded positive definiteness of G k .Recently, Kanzow and Lechner [21] proposed a globalized inexact proximal Newton-type (GIPN) method by switching from a Newton step to a PG step when the proximal Newton direction does not satisfy a sufficient decrease condition, and established the global convergence and superlinear convergence rate under the uniformly bounded positive definiteness of G k and the local strong convexity of F .
From the above discussions, we see that for the nonconvex problem (1), the existing global convergence results of the proximal Newton methods require the uniform positive definiteness of G k , while the local superlinear (or quadratic) convergence results assume that F is locally strongly convex in a neighborhood of cluster points of the iterate sequence.The local strong convexity of F in a neighborhood of a stationary point implies the isolatedness of this stationary point, and then the Luo-Tseng error bound and subsequently the metric subregularity of the mapping ∂F .Inspired by the works [54,32], it is natural to ask whether an inexact proximal Newton method can be designed for problem (1) to possess a superlinear convergence rate without the local strong convexity of F .In addition, we observe that when the power ̺ = 0 in the regularized Hessian (3) below, the global convergence of the iterate sequence in [54] requires the Luo-Tseng's error bound as does for its linear convergence rate, and in addition their linear convergence rate result depends upon that the parameter of the method is upper bounded by the unknown constant of the error bound.Then, it is natural to ask whether it is possible to achieve the global convergence and linear convergence rate of the iterate sequence for (1) under a weaker condition.This work aims to resolve the two questions for the nonconvex and nonsmooth problem (1).

Our contributions.
Motivated by the structure of f and the work [48] for a smooth unconstrained problem, we adopt the following regularized Hessian of ∇ 2 f (x k ): to construct a strongly convex quadratic approximation of F at iterate x k , and propose an inexact regularized proximal Newton method (IRPNM) for (1), where a + := max(0, a) for a ∈ R, λ min (H) denotes the smallest eigenvalue of H, r(x k ) is the KKT residual of (1) at x k (see (4) for its definition), and a 1 ≥ 1, a 2 > 0 and ̺ ∈ [0, 1) are the constants.Different from the regularized Hessian in [48], we here use ))] + I in order to avoid the computation of the smallest eigenvalue when ψ is separable.The matrix G k in (3) is uniformly positive definite when ̺ = 0 but not when ̺ ∈ (0, 1) as will be shown that r(x k ) → 0 as k → ∞.Our inexactness criterion on y k depends on the nonincreasing of the objective value of (2), along with the KKT residual r(x k ) when ̺ ∈ (0, 1) and the approximate optimality of y k when ̺ = 0; see criteria (17) and ( 18) below.In addition, the Armijo line-search is imposed on the direction d k := y k − x k to achieve a sufficient decrease of F .Our contributions are summarized as follows.
For ̺ = 0, we achieve a global convergence of the iterate sequence for the KL function F and its R-linear convergence rate for the KL function F of exponent 1/2, which is weaker than the Luo-Tseng error bound.In this case, our regularized proximal Newton method is similar to the VMILA in [10,11] except that a different inexactness criterion and a scaled matrix involving the Hessian of f are used.Compared with the convergence results in [11], which removes the restriction condition (see [10, Eq. ( 23)]) on the iterate sequence in the convergence analysis of [10], our R-linear convergence rate is obtained for the iterate sequence instead of the objective value sequence, and the required KL property of exponent 1/2 is for F itself rather than its FBE.
For ̺ ∈ (0, 1), we establish the global convergence of the iterate sequence and its superlinear convergence rate with order q(1+̺), under an assumption that cluster points satisfy a locally Hölderian error bound of order q ∈ (max(̺, 1 1+̺ ), 1] on a strong stationary point set.This result not only extends the conclusion of [32,Theorem 5.1] to problem (1), but also discards the local strong convexity required in [12,21] the convergence analysis of the (regularized) proximal Newton methods for this class of nonconvex and nonsmooth problems.When cluster points satisfy a local error bound of order q > 1+̺ on the common stationary point set, we also get the global convergence of the iterate sequence, and its superlinear convergence rate of order (q−̺) 2 q for q > 2̺+1+ √ 4̺+1

2
, which bridges the gap that the strong stationary point set may be empty.
In addition, inspired by the structure of G k , we also develop a dual semismooth Newton augmented Lagrangian method (SNALM) to compute the approximate minimizer y k of (2) satisying the inexactness criterion (17), and compare the performance of our IRPNM armed with SNALM with that of GIPN [21] and ZeroFPR [43] on ℓ 1 -regularized Student's t-regressions, group penalized Student's t-regressions and nonconvex image restoration.Numerical comparisons show that IRPNM is superior to GIPN in the objective values and the running time, and it is comparable with ZeroFPR in terms of the objective values, but requires much less running time than ZeroFPR does if the obtained stationary point is the strong one (such as for ℓ 1 -regularized and group penalized Student's t-regressions), otherwise requires more running time than the latter (such as for nonconvex image restoration).Such a numerical performance is entirely consistent with the theoretical results.

Notation.
Throughout this paper, S n represents the set of all n × n real symmetric matrices, S n + denotes the cone consisting of all positive semidefinite matrices in S n , and I denotes an identity matrix whose dimension is known from the context.For a real symmetric matrix H, H denotes the spectral norm of H, and H 0 means that H ∈ S n + .For a closed set C ⊆ R n , Π C denotes the projection operator onto the set C, dist(x, C) means the Euclidean distance of a vector x ∈ R n to the set C, and δ C denotes the indicator function of C. For a vector x ∈ R n , B(x, δ) denotes the closed ball centered at x with radius δ > 0. For a multifunction F : R n ⇒ R n , its graph is defined as gphF := {(x, y) | y ∈ F(x)}.A function h : R n → R is said to be proper if its domain domh := {x ∈ R n | h(x) < ∞} is nonempty.For a closed proper h : R n → R, its proximal mapping P γ h and Moreau envelope e γ h associated to a parameter γ > 0 are respectively defined as P γ h(x) := arg min z∈R n 1 2γ z − x 2 + h(z) and e γ h(x) := min z∈R n 1 2γ z − x 2 + h(z) , and we write Ph for P 1 h in the sequel.

Preliminaries.
We recall the preliminary knowledge and results used in the subsequent sections.

Stationary points of problem (1).
By [40,Exercise 8.8 (c)] and the twice differentiability of f on O, at any x ∈ domg, ∂F (x) = ∇f (x) + ∂g(x) = ∂F (x), where ∂F (x) and ∂F (x) respectively denote the regular and basic (also called Morduhovich) subdifferential of F at x.A vector x ∈ domg is called a stationary point of (1) if 0 ∈ ∂F (x), and it is called a strong stationary point if in addition ∇ 2 ψ(Ax − b) 0. We denote by S * and X * the set of stationary points and the set of strong stationary points of (1), respectively.It is not hard to verify that if A has a full row rank, X * = x ∈ S * | ∇ 2 f (x) 0 .By Assumption 1 (i) and the outer semicontinuity of ∂g, both S * and X * are closed.Note that X * may be empty even if S * is nonempty.As will be shown by Proposition 3.1 (iv) later, S * is nonempty under the boundedness of a level set of F .A local minimizer may not be a strong stationary point, and the converse is not necessarily true either.Define the residual mapping and residual function of (1) respectively as follows: R(x) := x−Pg(x−∇f (x)) and r(x By the convexity of g, it is easy to check that 0 ∈ ∂F (x) if and only if r(x) = 0. From [5, Definition 4.1], a vector x ∈ R n is called an L-stationary point of (1) if for some L > 0, x = P L −1 g(x−L −1 ∇f (x)).By invoking [41,Lemma 4], one can check that x is a stationary point of (1) if and only if it is an L-stationary point of (1).Let ϑ : O → R be a continuously differentiable function.Consider the problem and its canonical perturbation problem induced by a parameter vector u ∈ R n : The following proposition states the relation between an approximate stationary point of problem (5) and a stationary point of its canonical perturbation (6).
The metric q-subregularity of a multifunction has been used to analyze the convergence rate of the proximal point algorithm for seeking a solution to the maximally monotone operator [24], and the local superlinear and quadratic convergence rates of the proximal Newton-type method for nonsmooth composite convex optimization [32].
Definition 2.1 (see [24,Definition 3.1]) Let F : R n ⇒ R n be a multifunction.Consider any point (x, y) ∈ gphF.For a given q > 0, we say that F is (metrically) q-subregular at x for y if there exist κ > 0 and δ > 0 such that for all x ∈ B(x, δ), When q = 1, this property is called the (metric) subregularity of F at x for y.
By Definition 2.1, if x ∈ domF is an isolated point, F is subregular at x for any y ∈ F(x); and if F is closed at x, the subregularity of F at x for y ∈ F(x) implies its q-subregularity at x for y for any q ∈ (0, 1) (also known as the q-order Hölderian subregularity).The convergence analysis of this work (see Section 4.2) will employ the q-subregularity of the mapping ∂F , which is equivalent to that of the mapping R defined in (4) by the following lemma.It is worth pointing out that such an equivalence was ever obtained in [14] and [15, page 21] for q = 1.Lemma 2.1 Consider any x ∈ S * and q > 0. If the mapping ∂F is q-subregular at x for 0, then the residual mapping R is min{q, 1}-subregular at x for 0. Conversely, if the mapping R is q-subregular at x for 0, so is the mapping ∂F at x for 0.
Definition 2.2 A proper function h : R n → R is said to have the KL property at a point x ∈ dom ∂h if there exist δ > 0, ̟ ∈ (0, ∞] and ϕ ∈ Υ ̟ such that for all If ϕ can be chosen to be the function t → ct 1−θ with θ ∈ [0, 1) for some c > 0, then h is said to have the KL property of exponent θ at x.If h has the KL property (of exponent θ) at each point of dom∂h, it is called a KL function (of exponent θ).
By [3, Lemma 2.1], a proper lsc function h : R n → R has the KL property of exponent 0 at every noncritical point (i.e., the point at which the limiting subdifferential of h does not contain 0).Thus, to show that a proper lsc function is a KL function of exponent θ ∈ (0, 1), it suffices to check its KL property of exponent θ ∈ (0, 1) at all critical points.On the calculation of the KL exponent, we refer the readers to [25,51,50].
Next we discuss the relation between the q-subregularity of ∂F for q > 0 and the KL property of F with exponent θ ∈ (0, 1), which along with Assumption 2 improves the conclusions of [35,Propositions 3.1 & 3.2], an unpublished manuscript.It is worth emphasizing that this assumption is weaker than the one on the stationary values of F used in [29,25] (see Example 2.1), and holds automatically if F is a KL function by Remark 2.2.
(i) Under Assumption 2, the q-subregularity of the mapping ∂F at x for 0 implies that the function F has the KL property of exponent max{ 1 2q , 1 1+q } at x.
(ii) If F has the KL property of exponent 1 2q for q ∈ (1/2, 1] at x and x is a local minimizer of (1), then ∂F is (2q−1)-subregular at x for 0.
Proof: (i) From the expression of F and the descent lemma, for any x, y ∈ B(x, ε)∩domg and any v ∈ ∂F (x), where ε is the same as the one in (9), where the first inequality is due to the descent lemma by (9), and the last one is using v−∇f (x) ∈ ∂g(x) and the convexity of g.Since ∂F is q-subregular at x for 0, there exist ε > 0, κ > 0 such that (8) holds for all z ∈ B(x, ε).
where the last inequality is using (8).Note that 0 1+q } for a constant c > 0 (depending on κ and L ′ ), so the function F has the KL property of exponent max{ 1 2q , 1 1+q } at x.
(ii) Since F has the KL property of exponent 1 2q for q ∈ (1/2, 1] at x, there exist ε > 0, ̟ > 0 and c > 0 such that for all z From Assumption 1 (i)-(ii), the function F is continuous at x relative to domg.Together with the local optimality of x, there exists ε > 0 such that Define

By using the expression of F and noting that
Combining with last inequality, we obtain (14), and then the condition of [49, Corollary 2 (ii)] is satisfied with γ = 2q−1 2q and a = 1/c.From [49, Corollary 2 (ii)], for any z ∈ B(x, δ/2), max( F (z), 0) ≥ [c −1 dist(z, S F )] 2q 2q−1 which, by the expressions of F and S F , is equivalent to saying that Note that every point of S F is a local minimizer of F .Clearly, S F ⊆ (∂F ) −1 (0).Then Now we argue that ∂F is (2q−1)-subregular at x for 0. Pick any x ∈ B(x, δ/2).It suffices to consider , by ( 12) and ( 15), Along with (15) and dist(x, (∂F ) −1 (0)) = 0, inequality (16) holds.Thus, inequality (16) holds for all x ∈ B(x, δ 2 ), and ∂F is (2q−1)-subregular at x for 0. ✷ Remark 2.1 (a) The local optimality of x in Proposition 2.2 (ii) is sufficient but not necessary.For example, consider problem (1) with f and g ≡ 0 (see [33,Section 1.2.3]).It is easy to verify that S * = {x 1, * , x 2, * , x 3, * } with x 1, * = (0, 0) ⊤ , x 2, * = (0, −1) ⊤ and x 3, * = (0, 1) ⊤ .Clearly, Assumption 2 holds at each x i, * for i = 1, 2, 3.Moreover, ∂F = ∇f is q-subregular at (x 1, * , 0) for any q ∈ (0, 1].By Proposition 2.2 (i), F has the KL property of exponent 1 2q at x 1, * .However, x 1, * is not a local minimizer of F .(b) Under the assumption of Proposition 2.2 (ii), F has the growth at x as in (15).With the example in part (a), one can verify that such a growth does not necessarily hold if the local optimality assumption on x is replaced by Assumption 2. The growth of F in (15) has the same order as the one obtained by Mordukhovich et al. [31] under the q-subregularity of ∂F at x for 0 and an additional assumption on F .(c) When F is locally strongly convex in a neighborhood of x ∈ S * , there exist δ > 0 and c > 0 such that for all , which by [3, Section 4.1] means that F has the KL property of exponent 1/2 at x.In fact, in this case, dist(0, ∂F (x)) ≥ c(F (x)−F (x)) for all x ∈ B(x, δ), which by Proposition 2.2 (ii) means that the mapping ∂F is also subregular at x for 0. Remark 2.2 If F has the KL property at x ∈ S * , then Assumption 2 holds at x. Indeed, suppose on the contradiction that Assumption 2 does not hold at x.Then, there exists a sequence 3 Inexact regularized proximal Newton method.Now we describe our inexact regularized proximal Newton method (IRPNM) for solving (1).Let x k be the current iterate.As mentioned previously, we adopt G k defined in (3) to construct the quadratic approximation model (2) at x k .When ψ is convex, this quadratic model reduces to the one used in [54,32].Since G k is positive definite, the cost function of ( 2) is strongly convex, so it has a unique minimizer, denoted by x k .For model (2), one may use the coordinate gradient descent method [47] as in [54] or the line search PG method as in [21] to seek an approximate minimizer y k .Inspired by the structure of G k , we develop in Section 5 a dual semismooth Newton augmented Lagrangian (SNALM) method to seek an approximate minimizer y k .
To guarantee that our IRPNM has a desired convergence, we require y k to satisfy where η ∈ (0, 1) and τ ≥ ̺ are the constants, and r k is the KKT residual of (2) given by As will be shown in Lemma 3.1 below, the vector y k satisfying ( 18) is actually an exact minimizer of the canonical perturbation of (2).By [25, Lemma 4.1], so criterion (17) is the same as the one used in [32].Also, it is weaker than the one used in [54].Indeed, let ℓ k be the partial first-order approximation of F at x k : One can verify that for some ζ ∈ (0, 1) by using the positive definiteness of G k and the following relation It seems that the distance involved in ( 18) is difficult to compute in practice, but when choosing SNALM as the inner solver, one can easily achieve an element ω k ∈ ∂Θ k (y k ) (see Section 5.1.1)and then an upper bound ω k for dist(0, ∂Θ k (y k )).Thus, criterion (18) is guaranteed to hold by requiring ω k ≤ ηr(x k ).It is worth pointing out that one can replace the first inequality in (18) with r k (y k ) ≤ ηr(x k ), but such y k become more inaccurate by (20), which will lead to more iterations and running time.
With an inexact minimizer y k of subproblem (2), we perform the Armijo line search along the direction d k := y k − x k to capture a step-size α k > 0 so that the objective value of problem (1) can gain a sufficient decrease.The algorithm steps into the next iterate with otherwise with x k+1 := y k .Now we are ready to summarize the iterate steps of our IRPNM.
Set α k := β m k and end (for) Remark 3.1 (a) Different from the proximal Newton methods in [54,32], the next iterate x k+1 of Algorithm 1 may take x k + α k d k or y k , determined by their objective values.
To the best of our knowledge, such a technique first appears in [10, Algorithm 1], which is crucial to achieve the global convergence of the iterate sequence {x k } k∈N for ̺ = 0 under KL property of F (see Theorem 4.1).Note that a standard abstract convergence scheme adopted in the KL framework [4] usually requires a relative error condition at iterates, while such a selection allows to employ the relative error condition at y k (which might not be the next iterate) in the proof of convergence.This also has an advantage for Algorithm 1 to seek a stationary point with a smaller objective value.
(b) The line search criterion in [54,Eq (7)] implies the one in equation (23).Indeed, equality (22) and Since g is convex and 23) holds.This implication suggests that (23) may need less computation time than the one in [54].(c) From criterion (17) or (18), for each k ∈ N, r k (y k ) ≤ ηr(x k ), and then where the last inequality is using the expressions of r(x k ) and r k (y k ).By the boundedness of the sequence {x k } k∈N (see Proposition 3.1 below), we have r(x k ) ≤ c 0 d k for a con-stant c 0 > 0. Thus, d k = 0 implies that x k is a stationary point of (1).This interprets why Algorithm 1 also adopts d k ≤ ǫ 0 as a termination condition.
(d) Now we clarify the relation between Algorithm 1 and the globalized inexact proximal Newton-type method (GIPN) proposed in [21].Assume that H k and η k in step (S.1) of [21, Algorithm 3.1] respectively take G k and η min 1, [r(x k )] τ .Then, under the unit step-size and the following condition that there exist κ > 0 and k 0 ∈ N such that the sequence {x k } k≥ k 0 generated by [21, Algorithm 3.1] is same as the one yielded by Algorithm 1. Indeed, using the same arguments as those for Lemma 3.1 below, one can show that the inexactness criterion in [21,Eq (13)] is well defined.In addition, by using equation (22), G k a 2 [r(x k )] ̺ I and condition (27), it follows that This implies that [21, condition (14)] is satisfied with ρ = 1 2 a 2 κ − ̺ q and p = 2 + ̺/ q when k ≥ k 0 .Thus, we conclude that the iterates generated by [21, Algorithm 3.1] for k ≥ k 0 are same as those yielded by Algorithm 1 with ̺ > 0.
Before analyzing the convergence of Algorithm 1, we need to verify that it is well defined, i.e., arguing that the inexactness conditions in ( 17) and ( 18) are feasible and the line search in (23) will terminate in a finite number of steps.Lemma 3.1 For each iterate of Algorithm 1, the conditions in (17) and (18) are satisfied by any y ∈ domg that is sufficiently close to the exact solution x k of (2), and there exists an integer m k ≥ 0 such that the descent condition in (23) is satisfied.
Proof: Fix any k ∈ N. Consider the iterate x k .Assume that x k is not a stationary point of (1).Since r k (x k ) = 0, from the continuity of r k relative to domg, for any y ∈ domg sufficiently close to x k , r k (y) ≤ η min{r(x k ), [r(x k )] 1+τ }.In addition, since x k is the unique optimal solution of ( 2 Since Θ k is continuous relative to domg by Assumption 1 (ii), the last inequality means that for any y ∈ domg sufficiently close to x k , Θ k (y) ≤ Θ k (x k ).The two sides show that criterion (17) is satisfied by any y ∈ domg sufficiently close to x k .
For the first condition in (18), consider problems ( 5) and ( 6) with ϑ given by With such y, by letting y p = Pg(y − ∇ϑ(y)) and using Proposition 2.1, it follows that dist(0, ∂Θ k (y p )) ≤ η min{r(x k ), [r(x k )] 1+τ }.This, along with the above discussions, shows that criterion ( 18) is satisfied by any y ∈ domg close to x k .For the last part, we only need to consider that d k = 0. From the convexity of g and the definition of directional derivative, it follows that where the first inequality is using [39, Theorem 23.1] and d k = y k − x k , and the second inequality is using (25).Together with the definition of directional derivative, This implies that the line search step in ( 23) is well defined.✷ To close this section, we summarize the properties of the sequence {x k } k∈N generated by Algorithm 1, which will be used for the subsequent convergence analysis.Proposition 3.1 Let {x k } k∈N be generated by Algorithm 1, and denote by ω(x 0 ) its cluster point set.Suppose that L F (x 0 ) := {x ∈ domg | F (x) ≤ F (x 0 )} is bounded.Then, (i) the sequence {F (x k )} k∈N is nonincreasing and convergent; (ii) the sequence {x k } k∈N is bounded; (iii) lim k→∞ r(x k ) = 0 and lim k→∞ d k = 0; (iv) ω(x 0 ) ⊂ S * is a nonempty compact set and F ≡ F := lim k→∞ F (x k ) on ω(x 0 ).
From the expressions of G k and µ k , G k µ k I, which together with the last inequality and the expression of R k (y k ) implies that where the last inequality is using r k (y k ) ≤ ηr(x k ) implied by (17) or (18).Then, By part (ii) and the continuity of r relative to domg, there exists a constant τ > 0 such that d k ≤ τ for all k ∈ N. Let B denote the unit ball in R n .Assumption 1 (i) implies that ∇f is Lipschitz continuous on the compact set L F (x 0 ) + τ B with Lipschitz constant, say, L ∇f .Then, for any Let 23) is violated for the stepsize t k := α k /β, from the convexity of g and the definition of ℓ k , it follows that for some where the last equality is due to the mean-value theorem, and the last inequality comes from (25).Combining ( 30) and ( 29) yields that (23) and part (i), (26) and µ k = a 2 [r(x k )] ̺ .Using the boundedness of { G k }, we get lim k→∞ r(x k ) = 0. Together with (28), it follows that lim k→∞ d k = 0. (iv) By part (ii), the set ω(x 0 ) is nonempty and compact.Pick any x ∈ ω(x 0 ).There is an index set K ⊂ N such that lim K∋k→∞ x k = x.From part (iii) and the continuity of r relative to domg, we have x ∈ S * , and then ω(x 0 ) ⊂ S * .Note that {x k } k∈K ⊂ domg and F is continuous relative to domg by Assumption 1 (i)-(ii).Then, F (x) = F , which shows that F keeps constant on the set ω(x 0 ).✷ 4 Convergence analysis of Algorithm 1.
The following assumption will be often used in the convergence analysis of this section.

Assumption 3
The level set L F (x 0 ) of the function F is bounded.
We write X := R n × R n × S n and endow the space X with a norm | • | defined by | w| := x 2 + y 2 + H 2 F for w = (x, y, H) ∈ X, where • F denotes the Frobenius norm induced by the trace inner product in S n .Define the potential function By the expression of Φ and the inexactness condition (18), the following result holds.
Proof: Since lim k→∞ d k = 0 by Proposition 3.1 (iii), there exists k ∈ N such that where τ is the same as the one in the proof of Proposition 3.1 (iii), from the expression of w k and (29), it follows that where the seond inequality is also using d k ≤ 1.By the expression of G k , inequality (29) and ̺ = 0, we have The desired result then follows by (32).✷ Theorem 4.1 If F is a KL function, then under Assumption 3, ∞ k=0 x k+1 −x k < ∞ and the sequence {x k } k∈N converges to a stationary point of (1).
), then by Step (4) we have d k 1 = 0. Together with (26), r(x k 1 ) = 0, i.e., x k 1 is a stationary point of (1) and Algorithm 1 stops within a finite number of steps.Hence, it suffices to consider that F (x k ) > F (x k+1 ) for all k ∈ N. By the expression of Φ and (24), for each k ∈ N, where F is the same as the one in Proposition 3.1 (iv).Consider the set Ω := {(x, x, G(x)) ∈ X | x ∈ ω(x 0 )} where G is the mapping defined by G(x) := ∇ 2 f (x)+a 1 [−λ min (∇ 2 ψ(Ax))] + A ⊤ A for x ∈ domg.By Proposition 3.1 (iv), the set Ω is nonempty and compact and the function Φ keeps constant, i.e., Φ ≡ F , on Ω.Since F is assumed to be a KL function, Φ is also a KL function.By invoking [8, Lemma 6], there exist ε > 0, ̟ > 0 and ϕ ∈ Υ ̟ such that for all (x, y, H) By Proposition 3.1 (iii) and ) and the continuity of G relative to domg, it holds that lim k→∞ dist((x k , y k , G k ), Ω) = 0.By the proof of Lemma 4.1, the sequence {y k } k∈N is bounded, which along with {y k } k∈N ⊂ domg and Assumption 2 (i)-(ii) implies that the sequence {F (y k )} k∈N is bounded.We state that lim k→∞ F (y k ) = F .If not, by (33) there necessarily exists an index set K ⊂ N such that lim K∋k→∞ F (y k ) = F * > F .Since {y k } k∈K is bounded, there exists an index set K 1 ⊂ K such that lim K 1 ∋k→∞ y k = y * , which along with lim k→∞ d k = 0 yields that y * ∈ ω(x 0 ).Thus, from the continuity of F , F * = lim K∋k→∞ F (y k ) = lim K 1 ∋k→∞ F (y k ) = F (y * ) = F , which yields a contradiction to F * > F .Thus, the stated limit lim k→∞ F (y k ) = F holds.Along with the expression of Φ, lim k→∞ y k − x k = 0 and the positive definiteness of G k , we have lim k→∞ Φ(x k , y k , G k ) = F .Thus, there exists k ∈ N such that for all k ≥ k, and By Lemma 4.1, for each k ≥ k where k is the same as the one in Lemma 4.1, there exists In addition, from the proof of Proposition 3.1 (iii), ) := α for each k ∈ N, which along with ( 23)- (24) implies that for each k ∈ N, Fix any k ≥ k := max{k, k} + 1.Since ϕ ′ is nonincreasing on (0, ̟) by the concavity of ϕ, combining (34) with (33) and using Together with the concavity of ϕ and inequality (35), it follows that By summing this inequality from k to l > k, it is immediate to obtain that where the second inequality is using the nonnegativity of ϕ(F (x l+1 ) − F ). Thus, Passing the limit l → ∞ to this inequality yields that ∞ i=k y i − x i < ∞.Note that Now we are ready to achieve the linear convergence rate of the sequence {x k } k∈N .
Theorem 4.2 If F is a KL function of exponent 1 2q with q ∈ ( 1 2 , 1], then under Assumption 3 the sequence {x k } k∈N converges to a point x ∈ S * and there exist constants γ ∈ (0, 1) and c 1 > 0 such that for all sufficiently large k, for q ∈ ( 1 2 , 1).
Proof: By following the proof of [25, Theorem 3.6], we can verify that the function Φ has the KL property of exponent 1 2q at every point of Ω, the set appearing in the proof of Theorem 4.1.For each k ∈ N, write ∆ k := ∞ i=k y i − x i .Fix any k ≥ k where k is the same as in the proof of Theorem 4.1.From inequality (37), where ϕ(t) = ct (t > 0) for some c > 0. From the expression of ϕ and ( 36), (F (x k )−F ) Together with the last inequality, we obtain where the second inequality is due to lim k→∞ d k = 0.When q = 1, we have The conclusion holds with γ = γ 1 1+γ 1 and c 1 = ∆ k ( γ 1 1+γ 1 ) − k .When q ∈ (1/2, 1), from the last inequality, we have ∆ for all k ≥ k.By using this inequality and following the same analysis technique as those in [2, Page 14], we obtain It is worth emphasizing that the conclusion of Theorem 4.2 requires the KL property of F with exponent 1 2q though its proof is completed by using that of the function Φ.

Convergence analysis for 0 < ̺ < 1.
To analyze the global convergence and superlinear convergence rate of the sequence {x k } k∈N generated by Algorithm 1 with 0 < ̺ < 1, we need to establish several lemmas.First of all, by noting that for each k ∈ N, and using the monotonicity of ∂g and G k µ k I, one can bound the error between y k and x k as follows.
Lemma 4.2 For each k ∈ N, it holds that By the monotonity of ∂g, we have Combining this inequality with G k µ k I and using (17) Then, the desired result holds.✷ Next we bound the distance from x k to the exact minimizer x k of (2) by dist(x k , S * ), which extends the result of [54,Lemma 4] to the nonconvex setting.Lemma 4.3 Consider any x ∈ ω(x 0 ).Suppose that Assumption 3 and that ∇ 2 ψ is strictly continuous at Ax − b relative to A(domg) − b with modulus L ψ .Then, there exists ε 0 > 0 such that for all Take Fix any x k ∈ B(x, ε 0 /2).Pick any x k, * ∈ Π S * (x k ).Since x ∈ S * by Proposition 3.1 (iv), we have From the monotonicity of ∂g, it follows that Together with G k µ k I and the triangle inequality, we obtain that where the last inequality is using (39) with x = x k and x ′ = x k + t(x k, * − k ).The conclusion then follows by using The following lemma bounds Λ k defined in Lemma 4.3 in terms of dist(x k , X * ).This result appeared early in [48, Lemma 5.2], and we here provide a simpler proof.
where ε 0 and Λ k are both same as those appearing in Lemma 4.3.
Proof: Fix any x k ∈ B(x, ε 0 /2).By the expression of Λ k , it suffices to consider that λ min (∇ 2 ψ(Ax k −b)) < 0. Pick any x k, * ∈ Π X * (x k ).From x ∈ X * , we have x k, * − x ≤ 2 x k − x ≤ ε 0 , and consequently x k, * ∈ B(x, ε 0 ) ∩ domg.From x k, * ∈ X * , we have where the first inequality is by the Lipschitz continuity of the function S n ∋ Z → λ min (Z) with modulus 1, and the second one is using (38) This shows that the desired result holds.The proof is completed.✷ Remark 4.1 (a) When X * in Lemma 4.4 is replaced by S * , the conclusion may not hold.For example, consider (1) with g ≡ 0 and f given by Remark 2.1 (a), and x = (0, 0 Next we show that under an error bound condition on X * , when the iterates are close enough to a cluster point, the unit step-size must occur.It is worth pointing out that this result also holds without the error bound condition if the parameter ̺ of Algorithm 1 is restricted in 1/2); see Appendix B. Lemma 4.5 Fix any x ∈ ω(x 0 ).Suppose that Assumption 3 holds, that ∇ 2 ψ is strictly continuous at Ax−b relative to A(domg)−b with modulus L ψ , and that the local Hölderian error bound of order q > ̺ at x on X * holds, i.e., there exist ε > 0 and κ > 0 such that for all x ∈ B(x, ε), dist(x, X * ) ≤ κ[r(x)] q .Then, there exists k ∈ N such that for all , where ε 0 is the same as the one in Lemma 4.3.
Proof: Since x ∈ ω(x 0 ), there is an index set K such that lim K∋k→∞ x k = x, which means that for all k ∈ K large enough, dist(x k , X * ) ≤ κ[r(x k )] q , and x ∈ X * then follows by passing K ∋ k → ∞ and using Proposition 3.1 (iii).Recall that d k = y k − x k for each k ∈ N. By invoking Lemmas 4.2-4.4,for all x k ∈ B(x, ε 1 ), By Proposition 3.1 (iv), for each k ∈ N, dist(x k , S * ) ≤ dist(x k , ω(x 0 )), which along with lim k→∞ dist(x k , ω(x 0 )) = 0 implies that for each k ∈ N, Π S * (x k ) ⊂ L F (x 0 ) + τ B (if necessary by increasing τ ).Thus, for each k ∈ N, with any x k, * ∈ Π S * (x k ), we have From X * ⊆ S * and the given error bound assumption, for all x k ∈ B(x, ε 1 ), we have In addition, from (40) and the error bound assumption, for all From the last four inequalities, for all x k ∈ B(x, ε 1 ), it follows that From ( 41)-( 42), the boundedness of {G k }, and lim k→∞ dist(x k , S * ) = 0 by Proposition 3.1 (iv), we conclude that Fix any integer m ≥ 0. By using (39), where the last inequality is due to Λ k ≥ 0 and ∇ 2 f (x k )+Λ k A ⊤ A 0. In addition, from the convexity of the function g and β m ∈ (0, 1], it follows that where the last inequality is by (25).Adding the last two inequalities together yields Recall that for all k ≥ k, if . This means that (43) holds . The desired result then follows.✷ Inspired by the proof of [32, Theorem 5.1], we establish the global convergence and superlinear convergence rate of {x k } k∈N generated by Algorithm 1 with ̺ ∈ (0, 1).Theorem 4.3 Let {x k } k∈N be the sequence generated by Algorithm 1 with ̺ ∈ (0, 1).Consider any x ∈ ω(x 0 ).Suppose that Assumption 3 holds, that ∇ 2 ψ is strictly continuous at Ax−b relative to A(domg)−b with modulus L ψ , and that the locally Hölderian error bound of order q ∈ (max(̺, 1 1+̺ ), 1] at x on X * holds.Then, the sequence {x k } k∈N converges to x with the Q-superlinear convergence rate of order q(1+̺).
Proof: From the proof of Lemma 4.5, we have x ∈ X * .Let k ∈ N, ε 0 and ε 1 be the same as in Lemma 4.5.Since lim k→∞ r(x k ) = 0 by Proposition 3.1 (iii), r(x k ) ≤ 1 for all k ≥ k (if necessary by increasing k).Since lim k→∞ dist(x k , S * ) = 0 by Proposition 3.1 (iv), from (42) and τ ≥ ̺, there exists c 1 > 0 such that Also, from Lemma 4.5, for all x k ∈ B(x, ε 1 ) with k ≥ k, it holds that α k = 1.We next argue that for all k ≥ k, once x k ∈ B(x, ε 1 ) and where the last inequality is using (40).Note that Using the expressions of r and r k and inequality (44) yields that Combining this inequality and (46) and using Lemma 4.4, the local Hölderian error bound of order q at x on X * , and inequality (40), we obtain Since q + q̺ > 1 and lim k→∞ dist(x k , S * ) = 0, using this inequality yields the stated relation in (45) (if necessary by increasing k).Then, for any σ ∈ (0, 1), there exists Let . We argue by induction that if some iterate x k 0 ∈ B(x, ε) with k 0 ≥ k, then α k = 1 and x k+1 = y k ∈ B(x, ε 2 ) for all k ≥ k 0 .Indeed, as x ∈ ω(x 0 ), we can find k 0 > k such that x k 0 ∈ B(x, ε).Using (44) yields that . By ( 44) and ( 48), . By induction the stated result holds.Since lim k→∞ dist(x k , S * ) = 0, for any ǫ > 0 there exists 44) and ( 48), where the first equality is using α k = 1 for all k ≥ k 0 .This shows that {x k } k∈N is a Cauchy sequence and thus converges to x ∈ X * .By passing the limit k 1 → ∞ to (49) and using (47), we conclude that for any k > k 0 , That is, {x k } k∈N converges to x with the Q-superlinear rate of order q(1+̺).✷ The Hölderian error bound in Theorem 4.3 is stronger than the q-subregularity of ∂F at (x, 0) by Lemma 2.1, but it does not require the isolateness of stationary points and is implied by the local strong convexity of F in a neighborhood of x.Indeed, if F is locally strongly convex in a neighborhood of x, then there exists δ > 0 such that Then, using the similar arguments as those for the first part of Lemma 2.1, one can show that for all x ∈ B(x, δ), dist(x, X * ) ≤ cr(x) for some c > 0.
The error bound assumption in Theorem 4.3 implicitly requires that X * = ∅.When X * = ∅, we have the global convergence of {x k } and its superlinear convergence rate under a local error bound on S * as follows.Its proof is similar to that of Theorem 4.3, see Appendix C. Theorem 4.4 Let {x k } k∈N be the sequence generated by Algorithm 1 with 0 < ̺ < 1. Fix any x ∈ ω(x 0 ) and any q > 1 + ̺.Suppose that Assumption 3 holds, that ∇ 2 ψ is strictly continuous at Ax − b relative to A(domg) − b with modulus L ψ , and that there exist ε > 0 and κ > 0 such that dist(x, S * ) ≤ κ[r(x)] q for all x ∈ B(x, ε).Then, the sequence {x k } k∈N converges to x, and the convergence is Q-superlinear with order (q−̺) 2 q for q > 2̺+1+ √ 4̺+1  (27) holds with κ = c 1 κ and q = q.Together with Remark 3.1 (c), we can achieve the global convergence and superlinear rate results for the iterate sequence generated by GIPN under the same assumption of Theorem 4.3 or 4.4, which greatly improve the ones in [21], as they assumed the uniformly bounded positive definiteness of G k and the local strong convexity of F .

Numerical experiments.
Before testing the performance of Algorithm 1, we first focus on its implementation (see https://github.com/SCUT-OptGroup/IRPNM for the corresponding code).

Implementation of Algorithm 1.
The core in the implementation of Algorithm 1 is to find an approximate minimizer y k of subproblem (2) satisfying (17) or (18).By the expression of G k in (3), we have When the function ψ is separable, it is very cheap to achieve such a reformulation of Then, subproblem (2) can be equivalently written as which is a Lasso problem when g takes the ℓ 1 -norm.Inspired by the encouraging results reported in [27], we develop an efficient SNALM for solving subproblem (50).Note that an SNALM was also developed in [28,Section 4.2] to solve the subproblems of an inexact SQA method for composite DC problems.

SNALM for solving subproblems.
Let g * k be the conjugate function of g k .The dual problem of (50) takes the form of The basic iterate steps of the augmented Lagrangian method for (51) are as follows: where L σ (•, •; x) is the augmented Lagrangian function of (51) associated to the penalty factor σ > 0 and the Lagrange multiplier x.For the ALM, the primary computation cost is to solve (52a).After calculating, By the strong convexity of Φ j , ξ j+1 is a solution of (53) iff it is the unique root of system ∇Φ j (ξ) = 0, which is semismooth if g is finite-valued and convex and strongly semismooth if g is a piecewise linear-quadratic convex function.The Clarke Jacobian [13] of ∇Φ j is always nonsingular due to its strong monotonicity.Moreover, one can achieve the exact characterization of its Clarke Jacobian for some specific g.Hence, we apply the semismooth Newton method to seeking a root of system ∇Φ j (ξ) = 0.For more details on semismooth Newton methods, see [38] or [16,Chapter 7].By combining the optimality conditions of (52a) with equation (52b), we deduce that ζ j+1 ∈ ∂g k (−x j+1 ), which implies that Inspired by this and criterion (17) or (18), we terminate the ALM at iterate (ξ j+1 , ζ j+1 , x j+1 ) when Θ k (−x j+1 ) ≤ Θ k (x k ) and r k (−x j+1 ) ≤ ǫ ALM := η min{r(x k ), [r(x k )] 1+τ } for ̺ ∈ (0, 1), or when Θ k (−x j+1 ) ≤ Θ k (x k ) and ω j+1 ≤ ǫ ALM := ηr(x k ) for ̺ = 0.In the implementation of the SNALM, we adjust the penalty factor σ j in terms of the primal and dual infeasibility.

Choice of parameters for Algorithm 1.
For the subsequent tests, we choose a 1 = 1, which means that our G k may not be uniformly positive definite if ̺ > 0. The other parameters of Algorithm 1 are chosen as a 2 = min{σ, 10 −2 max{1,r(x 0 )} } with σ = 10 −4 , β = 0.1, η = 0.9, τ = ̺.From Figure 1(a) below, a larger ̺ corresponds to a better convergence rate, but preliminiary tests indicate that Algorithm 1 with a larger ̺ does not necessarily require less running time and Algorithm 1 with ̺ ∈ [0.4,0.5] usually works well in terms of running time.Inspired by this, we choose Algorithm 1 with ̺ = 0.45 for the subsequent testing.
Figure 1 below plots the convergence rate curves of the iterate sequences yielded by Algorithm 1 with different ̺ for the example in Section 5.2 with d = 80dB in Figure 1(a) and the example in Section 5.4 with λ = 10 −2 in Figure 1(b).We see that every curve in Figure 1(a) shows a superlinear convergence rate, but no curve in Figure 1(b) does this.After checking, we find that the sequences {x k } in Figure 1(a) all converge to a strong stationary point, while those in Figure 1(b) converge to a non-strong one.This coincides with the theoretical results obtained in Section 4. In the rest of this section, we compare the performance of Algorithm 1 armed with the SNALM (called IRPNM) with that of GIPN and ZeroFPR on ℓ 1 -regularized Student's t-regressions, group penalized Student's t-regressions, and nonconvex image restoration.Among others, GIPN was proposed in [21] by solving subproblem (2) with FISTA [6] (named GIPN-F) or the KKT system R k (x) = 0 with the semismooth Newton method [30] (named GIPN-S), and ZeroFPR was proposed in [43] by seeking the root of the KKT system for minimizing the FBE of F along a quasi-Newton direction.For GIPN, we choose a 1 = 1.001 to ensure that the approximate matrix G k is uniformly bounded positive definite as required by its convergence results.For GIPN and ZeroFPR, we adopt the default setting except that their maximum number of iterations are set as 2000 and 20000, respectively.For IRPNM, the maximum numbers of iterations of Algorithm 1 and SNALM are set as 1000 and 100, respectively.For fairness, three solvers use the same starting point x init and the same stopping condition r(x k ) ≤ ǫ 0 .All tests are performed on a desktop running 64-bit Windows System with an Intel(R) Core(TM) i9-10850K CPU 3.60GHz and 32.0 GB RAM on Matlab 2020b.

ℓ 1 -regularized Student's t-regression.
This class of problems takes the form of (1) with ψ(u) := m i=1 log(1 + u 2 i /ν) (ν > 0) for u ∈ R m and g(x) := λ x 1 for x ∈ R n , where λ > 0 is the regularized parameter.Such a function was introduced in [1] to deal with the data contaminated by heavy-tailed Student-t errors.The test examples are randomly generated in the same way as in [7,30].Specifically, we generate a true sparse signal x true of length n = 512 2 with s = ⌊ n 40 ⌋ nonzero entries, whose indices are chosen randomly; and then compute each nonzero component via x true i = η 1 (i)10 dη 2 (i)/20 , where η 1 (i) ∈ {±1} is a random sign and η 2 (i) is uniformly distributed in [0, 1].The signal has dynamic range of ddB with d ∈ {20, 40, 60, 80}.The matrix A ∈ R m×n takes m = n/8 random cosine measurements, i.e., Ax = (dct(x)) J , where dct is the discrete cosine transform and J ⊂ {1, . . ., n} with |J| = m is an index set chosen randomly.The vector b is obtained by adding Student's t-noise with degree of freedom 4 and rescaled by 0.1 to Ax true .
For each λ = c λ ∇f (0) ∞ with c λ = 0.1 and 0.01, we run the three solvers with ν = 0.25, x init = A ⊤ b and ǫ 0 = 10 −5 for 10 independent trials, i.e., ten groups of data (A, b) generated randomly.Table 1 lists the average objective values, KKT residuals and running times over 10 independent trials.We see that IRPNM yields the same objective values as ZeroFPR does, but requires much less running time, while GIPN-S can not achieve the desired accuracy within 2000 iterations for most of test instances.The three solvers all require less running time for c λ = 0.1 than c λ = 0.01 due to more sparsity of stationary points.After checking, we find that IRPNM yields a strong stationary point for all test instances except those for c λ = 0.1, d = 20, and the smallest eigenvalue of Hessian at these strong stationary points is numerically close to 0, so it is highly possible for the stationary point to be nonisolated.

Group penalized Student's t-regression.
This class of problems takes the form of (1) with ψ being the same as in Section 5.2 and g(x) := λ l i=1 x J i for x ∈ R n , where the index sets J 1 , . . ., J l satisfy J i ∩ J j = ∅ for any i = j and l i=1 J i = {1, . . ., n}.We generate a true group sparse signal x true ∈ R n of length n = 512 2 with s nonzero groups, whose indices are chosen randomly, and compute each nonzero entry of x true by the same formula as in Section 5.2.The matrix A ∈ R m×n is generated in the same way as in Section 5.2, and b ∈ R m is obtained by adding Student's t-noise with degree of freedom 5 and rescaled by 0.1 to Ax true .
We run the three solvers with λ = 0.1 ∇f (0) , ν = 0.2, x init = A ⊤ b and ǫ 0 = 10 −5 for 10 independent trials, i.e., ten groups of data (A, b) generated randomly.Table 2 lists the average objective values, KKT residuals and running times over 10 independent trials for corresponding d = 60, 80 dB and nonzero group s = 16, 64, 128.We see that IRPNM yields the same objective values as ZeroFPR does, but requires much less running time than the latter does; while GIPN-F can not achieve the desired accuracy within 2000 iterations for all test instances.Also, after checking, IRPNM yields a strong stationary point for each test instance.This class of problems has the form of (1) with ψ described as in Section 5.2 and g(x) = λ Bx 1 for x ∈ R n , where A ∈ R n×n is a matrix to represent a Gaussian blur operator with standard deviation 4 and a filter size of 9, the vector b ∈ R n represents a blurred image, and B ∈ R n×n is an orthogonal matrix to represent a two-dimensional discrete Haar wavelet transform of level 4. A 256 × 256 image cameraman.tif is chosen as the test image x true ∈ R n with n = 256 2 , and the blurred noisy image b is obtained by adding Student's t-noise with degree of freedom 1 and rescaled by 10 −3 to Ax true .For each λ ∈ {10 −2 , 10 −3 , 10 −4 }, we run the three solvers with ν = 1, x init = b and ǫ 0 = 10 −4 for 10 independent trials, i.e., ten groups of data b generated randomly.As shown in Figure 2, the three solvers all demonstrate good restorations.Table 3 reports the average objective values, KKT residuals and running times over 10 independent trials.We see that IRPNM and ZeroFPR outperform GIPN-F in achieving the desired accuray within less running time, and IRPNM yields the objective values (or the KKT residuals) comparable with those yielded by ZeroFPR though it needs more time than ZeroFPR does.For this class of problems, IRPNM yields a non-strong stationary point for each test instance, and displays a slower convergence rate than ZeroFPR does, which accounts for why it requires more running time than the latter does.
Since H ≻ 0, from the expression of the function Φ, it is not difficult to obtain that [dist(0, ∂Φ(x, y, H))] ξ + H(y − x) where the last inequality is by the concavity of the function R + ∋ t → t . By [25, Lemma 3.1], there exist γ 1 > 0 and γ 2 ∈ (0, 1) such that ξ + H(y − x) , where the third inequality is by H 1/2 (x−y) ≤ 1.The last inequality shows that Φ has the KL property of exponent θ at (x, y, H) ∈ Ω. ✷ B Unit step-size results without error bound condition Lemma B.1 Let {x k } k∈N be the sequence generated by Algorithm 1 with ̺ ∈ (0, 1/2).Consider any x ∈ ω(x 0 ).Suppose that ∇ 2 ψ is strictly continuous at Ax− b relative to A(domg)−b with modulus L ψ , then there exists k ∈ N such that for all x k ∈ B(x, ε 0 /2) with k ≥ k, α k = 1, where ε 0 is the same as the one in Lemma 4.3.
Proof: From (28) we have where the last inequality is due to Λ k ≥ 0 and ∇ 2 f (x k )+Λ k I 0. In addition, from the convexity of the function g and β m ∈ (0, 1], it follows that where the last inequality is by (25).Adding the last two inequalities together yields Recall that for all k ≥ k, d k µ k ≤ 3(1−2σ) L ψ A 3 .This means that (55) holds with m = 0 if k ≥ k and x k ∈ B(x, ε 0 /2).Consequently, the desired result follows.✷ C Proof of Theorem 4.4 Before providing the proof of Theorem 4.4, we first show that the unit step-size occurs when k is sufficiently large.

Lemma 4 . 4
Consider any x ∈ X * .Suppose that Assumption 3 and that ∇ 2 ψ is strictly continuous at Ax − b relative to A(domg) − b with modulus L ψ .Then,

2 . 4 . 2
Remark Assume that H k and η k in step (S.1) of [21, Algorithm 3.1] respectively take G k and η min 1, [r(x k )] τ .Then, under the local error bound assumption of Theorem 4.3 or 4.4, the unit step-size always occurs and inequality

Figure 1 :
Figure 1: Influence of ̺ on the convergence behavior of the iterate sequences of Algorithm 1

Table 2 :
Numerical comparisons on group penalized Student's t-regression