Globalized inexact proximal Newton-type methods for nonconvex composite functions

Optimization problems with composite functions consist of an objective function which is the sum of a smooth and a (convex) nonsmooth term. This particular structure is exploited by the class of proximal gradient methods and some of their generalizations like proximal Newton and quasi-Newton methods. The current literature on these classes of methods almost exclusively considers the case where also the smooth term is convex. Here we present a globalized proximal Newton-type method which allows the smooth term to be nonconvex. The method is shown to have nice global and local convergence properties, and some numerical results indicate that this method is very promising also from a practical point of view.


Introduction
In this paper, we deal with the composite problem where f ∶ ℝ n → ℝ is (twice) continuously differentiable and ∶ ℝ n → ℝ ∪ {+∞} is convex, proper, and lower semicontinuous (lsc). In this formulation, the objective function is neither convex nor smooth, so it covers a wide class of applications as described below. Since is allowed to take the value +∞ , (1) also comprises constrained problems on convex sets.

Background
Optimization problems in the form (1) arise in many applications in statistics, machine learning, compressed sensing, and signal processing.
Common applications are the lasso [42] and related problems, where the function f represents a smooth loss function such as the quadratic loss f (x) ∶= ‖Ax − b‖ 2 2 or the logistic loss f (x) ∶= 1 � for some given data A ∈ ℝ m×n , b ∈ ℝ m , and a i ∈ ℝ n for i = 1, … , m . A convex regularizer is added to involve some additional constraints or to control some sparsity. Typical regularizers are the 1 -and 2 -norm, a weighted 1 -norm (x) ∶= ∑ n i=1 i �x i � for some weights i > 0 , or the total variation (x) = ‖∇x‖ ∶= Loss problems are typically used to reconstruct blurred or incomplete data or to classify data.
Another type of application are inverse covariance estimation problems [3,46]. The aim of this problem class is to find the (sparse) inverse covariance matrix of a probability distribution of identically and independently distributed samples. For further applications, where the function f is assumed to be convex, we refer to the list given by Combettes and Ways [17] and references therein. Further problems in the form (1) are constrained problems [10] arising in the above mentioned fields.
Nonconvex applications occur, e.g., in inverse problems, where given data are not related linearly or are perturbed with non Gaussian errors as student's t-regression [1], see also Sects. 5.2 and 5.4; cf. the list by Bonettini et al. [9] for more examples of problems of this type.

Description of the Method
In every step of the proximal Newton-type method, we (inexactly) solve the problem for some x ∈ ℝ n and a given matrix H which is either equal to the Hessian ∇ 2 f (x) or represents a suitable approximation of the exact Hessian. The advantage of using proximal Newton-type steps that take into account second order information of f is that, similar to smooth Newton-type methods, one can prove fast local convergence. However, they are only well-defined for convex f and the convergence theorems typically require some strong convexity assumption.
In contrast, proximal gradient methods perform a backward step using only first order information of f. This means that (2) is solved for some positive definite H ∈ ℝ n×n , which is usually a fixed multiple of the identity matrix. The method can therefore be shown to converge globally in the sense that every accumulation point of a sequence generated by this method is a stationary point of , but it is not possible to achieve fast local convergence results.
In this paper, we take into account the advantages of both methods and combine them to get a globalized proximal Newton-type method. Since the proximal Newton-type update is preferable, we try to solve the corresponding subproblem and use (2) arg min y f (x) + ∇f (x) T (y − x) + 1 2 (y − x) T H(y − x) + (y) a novel descent condition to control whether the current iterate is updated with its solution or a proximal gradient step is performed. To achieve global convergence, we further add an Armijo-type line search. As the computation of the Newton-type step defined in (2) can be expensive, our convergence theory allows some freedom in the choice of the matrices H, in particular, one can use quasi-Newton or limited memory quasi-Newton matrices.

Related work
The original proximal gradient method was introduced by Fukushima and Mine [22]. It may be viewed as a special instance of the method described in Tseng and Yun [44], which utilizes a block separable structure of and performs block wise descent. Numerous authors [24,35,45] deal with acceleration techniques whereby all of them require the Lipschitz continuity of the gradient ∇f . Further methods [6,39] also assume that f is convex.
In an intermediate approach between proximal Newton and proximal gradient methods, referred to as variable metric proximal gradient methods, the matrix H in (2) does not need to be a multiple of the identity matrix, but is still positive definite, uniformly bounded, and does not necessarily contain second order information of f. Various line search techniques and inexactness conditions on the subproblem solution can be applied [7-9, 13, 21, 23, 26, 27, 40, 41] to prove global convergence. These references include fast local convergence results for the case that H is replaced by the Hessian of f or some approximation and a suitable boundedness condition holds.
In Lee, Sun, and Saunders [27] a generic version of the proximal Newton method is presented and several convergence results based on the exactness of the subproblem solutions and the Hessian approximation are stated. For the local convergence theory, they need strong convexity of f. In Yue, Zhou, and So [47], an inexact proximal Newton method with regularized Hessian is presented which assumes f to be convex, but not strongly convex, and an error bound condition. Their inexactness criterion is similar to ours. The authors in [28,43] assume that f is convex and selfconcordant and apply a damped proximal Newton method.
Bonettini et al. [8,9] consider an inexact proximal gradient method with variable metric and an Armijo-type line search to solve problem (1). The structure of the method in [9] is similar to ours, but they use a different inexactness criterion, have no globalization and add an overrelaxation step to ensure convergence. The convergence theory covers global convergence and local convergence under the assumption that ∇f is Lipschitz continuous and satisfies the Kurdyka-Łojasiewicz property.
A similar method with various line search criteria is introduced by Lee and Wright [26]. Their inexactness criterion is related to the one from Bonettini et al. Furthermore, they use a line search technique to update the matrix H in (2), if suitable descent is not achieved. Here, convergence rates are proven for nonconvex as well as for convex problems.
Further methods exist for the case where we can write =̃•B for a linear mapping B ∶ ℝ n → ℝ p and a convex function ̃∶ ℝ p → ℝ . This formulation is used if the proximity operator of ̃ is easy to compute whereas the one of is not. In [15,16,29] fixed point methods are used to solve the problems under different assumptions, the reformulation into a constrained problem is applied in [2,48].
Another class of methods to solve (1) are semismooth Newton methods. Patrinos, Stella, and Bemporad assume in [37] that f is convex and apply a semismooth Newton method combined with a line search strategy. The method MINFBE of Stella, Themelis, and Patrinos [41] is based on the same idea, but uses a different line search strategy, for which they can prove convergence under the assumption that ∇f is Lipschitz continuous. Furthermore, they state linear convergence for convex problems.
For strongly convex f with Lipschitz continuous gradient, Patrinos and Bemporad [36] state a semismooth Newton method that uses a globalization strategy similar to our method and applies a proximal gradient step if the given descent criterion does not hold. A semismooth Newton method with filter globalization is introduced by Milzarek and Ulbrich [32] for (x) = ‖x‖ 1 with some > 0 and adapted for arbitrary convex by Milzarek [31]. For the semismooth Newton update, they check a filter condition and, if it does not hold, a proximal gradient step with Armijo-type line search is performed.

Outline of the paper
This paper is organized as follows. First, we introduce the proximity operator with some properties, formulate the proximal gradient method, and state a convergence result in Sect. 2. The globalization of the proximal Newton-type method and its inexact variant is deduced in Sect. 3, where we also state some preliminary observations. In Sect. 4, we first prove global convergence under fairly mild assumptions, and then provide a fast local convergence result. We then consider the numerical behaviour of our method(s) on different classes of problems in Sect. 5, also including a comparison with several state-of-the-art solvers. We conclude with some final remarks in Sect. 6.

Notation
The set of all symmetric matrices in ℝ n×n is denoted by n , and the set of all symmetric positive definite matrices is abbreviated by n ++ . We write The standard inner product of x, y ∈ ℝ n is denoted by ⟨x, y⟩ ∶= x T y . Finally, we write ‖x‖ H ∶= √ x T Hx for the norm induced by a given matrix H ≻ 0.

The proximal gradient method
This section first recalls the definition and some elementary properties of the proximity operator, and then describes a version of the proximal gradient method which is applicable to possibly nonconvex composite optimization problems. Throughout this section, we assume that f is continuously differentiable and is proper, lsc, and convex.

The proximity operator
The proximity operator was introduced by Moreau [34] and turned out to be a very useful tool both from a theoretical and an algorithmic point of view. Here we restate only some of its properties, and refer to the monograph [4] by Bauschke and Combettes for more details.
For a positive definite matrix H ∈ ℝ n×n and a convex, proper, and lsc function ∶ ℝ n → ℝ , the mapping is called the proximity operator of with respect to H. Here, the minimizer prox H (x) is uniquely defined for all x ∈ ℝ n since the expression inside the arg min is a strongly convex function. If H is the identity matrix, we simply write prox (x) instead of prox I (x). Using Fermat's rule and the sum rule for subdifferentials, the definition of the proximity operator gives p = prox H (x) if and only if 0 ∈ (p) + H(p − x) , or equivalently We next restate a result on the continuity of the proximity operator due to Milzarek [31,Corollary 3.1.4], which states that the proximity operator is continuous not only with respect to the argument, but also with respect to the positive definite matrix. We call x * ∈ dom a stationary point of the program (1) if 0 ∈ ∇f (x * ) + (x * ) . Using [4,Proposition 17.14] and (3), we obtain the characterizations where the last reformulation turns out to be independent of the particular matrix H.

Proximal gradient method
The proximal gradient method was introduced by Fukushima and Mine [22] as a generalization of the proximal point algorithm, which, in turn, was established by Rockafellar [38]. Note that the existing literature on the proximal gradient method usually assumes f to be smooth with a (globally) Lipschitz continuous gradient. In order to obtain complexity and rate of convergence results, additional assumptions, e.g. the convexity of f, are required, cf. Beck [5] for more details.
Here we present a version of the proximal gradient method which still has nice global convergence properties also in the case where f is only continuously differentiable (not necessarily convex and without assuming any Lipschitz continuity of the corresponding gradient mapping). The method itself is essentially known and may be viewed as a special instance of the method described in Tseng and Yun [44], see also the PhD Thesis by Milzarek [31]. This version differs from the original one in [22] and its variants considered for convex problems by using a different line search globalization strategy. The proximal gradient method described here plays a central role in the globalization of our proximal Newton-type method.
To motivate the proximal gradient method, let us first recall that the classical (weighted) gradient method for the minimization of a smooth objective function f first computes a minimizer d k of the quadratic subproblem for some H k ≻ 0 , and then takes x k+1 = x k + t k d k for some suitable stepsize t k > 0 . Usually, H k is chosen as a positive multiple of the identity matrix. For H k = I , we get the method of steepest descent, hence d k is given by −∇f (x k ) in this case.
Next consider the composite optimization problem from (1). To solve this nonsmooth problem, we simply add the nonsmooth function to the argument of (5) and obtain the subproblem be a solution of this subproblem. The next iterate is then defined by x k+1 ∶= x k + t k d k for a suitable stepsize t k > 0 . A simple calculation shows that the solution d k of (6) is given by We now state our proximal gradient method explicitly. The stepsize rule uses the expression for k ∈ ℕ 0 , which is an upper bound of the directional derivative � (x k , d k ) , see Lemma 2.3. Occasionally, we write Δ instead of Δ k , if it is computed in some variables x and d instead of x k and d k , respectively.
The algorithm allows H k to be any positive definite matrix. In general, it is chosen independently of the iteration and as a positive multiple of the identity matrix, because in that case the computation of the proximity operator is less costly, in some cases (depending on the mapping ) even an explicit expression is known.
We now want to prove that Algorithm 2.2 is well-defined and justify the termination criterion. The analysis is mainly based on [31,44]. Note that we assume implicitly that the algorithm does not terminate after finitely many steps.
We first give an estimate for the value of Δ , which is essentially [32, Lemma 3.5].
Note that this result implies that Δ k is always a negative number as long as d k is nonzero.
The termination criterion in (S.2) is justified by (4). Thus, it ensures that the algorithm terminates in a stationary point of . Together with the next result, it follows that Algorithm 2.2 is well-defined, which means, in particular, that the line search procedure in (S.3) always terminates after finitely many steps.

Corollary 2.4 Algorithm 2.2 is well-defined, and we have
Proof Consider a fixed iteration index k. Since, by assumption, the algorithm generates an infinite sequence, (S.2) yields d k ≠ 0 for all k. Thus, by Lemma 2.3, we have Δ k < 0 . Using the first inequality in Lemma 2.3, we therefore obtain for all sufficiently small t > 0 . Rearranging this inequality, we see that the step size rule (S.3) and, consequently, the whole algorithm is well-defined. Furthermore, using Δ k < 0 in (S.3) yields (x k+1 ) = (x k + t k d k ) ≤ (x k ) + t k Δ k < (x k ) , and this completes the proof. ◻ .
The following convergence result is a special case of [44, Theorem 1(e)].

Theorem 2.5
Let {H k } k ⊂ n ++ be a sequence such that there exist 0 < m < M with mI ⪯ H k ⪯ MI for all k ∈ ℕ 0 . Then any accumulation point of a sequence generated by Algorithm 2.2 is a stationary point of .
Theorem 2.5 cannot be applied directly in order to verify global convergence of our inexact proximal Newton-type method since only some of the search directions d k are computed by a proximal gradient method, whereas other directions correspond to an inexact proximal Newton-type step. However, a closer inspection of the proof of [44,Theorem 1] yields that the following slightly stronger convergence result holds.

Remark 2.6
An easy consequence of the proof of Theorem 2.5, cf. [44], is the following more general result: Let {x k } be a sequence such that x k+1 = x k + t k d k holds for all k with some search directions d k ∈ ℝ n (not necessarily generated by a proximal gradient step) and a stepsize t k > 0 . Assume further that (x k+1 ) ≤ (x k ) holds for all k. Let {x k } K be a convergent subsequence of the given sequence such that the search directions d k = d H k (x k ) are obtained by proximal gradient steps for all k ∈ K , where mI ⪯ H k ⪯ MI (0 < m ≤ M) , and the corresponding step sizes t k > 0 are determined by the Armijo-type rule from (S.3). Then the limit point of the subsequence {x k } K is still a stationary point of . ◊

Globalized inexact proximal Newton-type method
Let us start with the derivation of our globalized inexact proximal Newton-type method. To this end, let us first assume that H k stands for the exact Hessian ∇ 2 f (x k ) (later H k will be allowed to be an approximation of the Hessian only).
In smooth optimization, one step of the classical version of Newton's method for minimizing a function f ∶ ℝ n → ℝ consists in finding a solution of H k (x − x k ) = −∇f (x k ) . This is equivalent (assuming H k being positive definite for the moment) to solve the problem min x f k (x) , where is a quadratic approximation of f at the current iterate x k . To solve this problem inexactly, one often uses the criterion for some k ∈ (0, 1). Now we adapt this strategy to the nonsmooth problem (1). In this case, the objective function is f + , and the corresponding approximation we use is In view of (4), we may view as a replacement for the derivative of the objective function since F(x) = 0 if and only if x is a stationary point of .
Since k is another function of the form (1), one can use the same idea to replace the derivative of k by This observation motivates to replace the inexactness criterion (10) by a condition like ‖F k (x)‖ ≤ k ‖F(x k )‖ for some > 0 and k ≥ 0 , see [13,27].
Note that the methods of Bonettini et al. [9] and Lee and Wright [26] use a different inexactness criterion considering the value of the difference k (x) − k (x * ,k ) of the function values of k , where x * ,k is an exact minimizer of k . In contrast, our criterion originates directly from the smooth Newton method and considers a different optimality criterion based on the distance of the point x itself from being a solution of the subproblem, not the distance of the function values.
The main idea of our globalized proximal Newton-type method is now similar to a standard globalization of the classical Newton method for smooth unconstrained optimization problems: Whenever the proximal Newton-type direction exists and satisfies a suitable sufficient decrease condition, the proximal Newtontype direction is accepted and followed by a line search. Otherwise, a proximal gradient step is taken which always exists and guarantees suitable global convergence properties. The descent criterion used here is motivated by the condition in [18,36]. The line search is based on the Armijo-type condition already used in the proximal gradient method and makes use of the same Δ k that was already defined in (8). The exact statement of our method is as follows, where, now, we allow H k to be an approximation of the Hessian of f at x k .
Before we start to analyse the convergence properties of Algorithm 3.1, let us add a few comments regarding the proximal subproblems that we try to solve inexactly in (S.1). Since H k is not necessarily positive definite, these subproblems are not guaranteed to have a solution. The same difficulty arises within the classical Newton method since, in the indefinite case, the quadratic subproblem (9) certainly has no minimizer. Nevertheless, the classical Newton method is often quite successful even if H k is indefinite (at least during some intermediate iterations), and the Newton direction is usually well-defined because it just computes a stationary point of the subproblem (9) which exists also for indefinite matrices H k . Here, the situation is similar since the conditions (13) only check whether we have an (inexact) stationary point (note that these conditions certainly hold for the exact solution of the corresponding subproblem, cf. [27, Proposition 2.4] for the second condition and note that < 1 2 ). Moreover, the situation here is even better than in the classical case since the additional function may guarantee the existence of a minimum even for indefinite H k (e.g. if has compact support as this occurs when is the indicator function of a bounded feasible set). We therefore believe that our proximal Newton-type direction does exist in many situations (otherwise we switch to the proximal gradient direction).
The properties of Algorithm 3.1 obviously depend on the choice of the matrices H k and the degree of inexactness that is used to compute the inexact proximal Newton-type direction in (S.1). This degree is specified by the test in (13). The local convergence analysis requires some additional conditions regarding the choice of the sequence k , whereas the global convergence analysis depends only on the choice k ∈ [0, ) for some given ∈ (0, 1) and does not need the second condition in (13). The condition in (14) is a sufficient decrease condition, with > 0 typically being a small constant. For our subsequent analysis, we set The following result shows that the step size rule in (S.3) is well-defined and Algorithm 3.1 is a descent method.

Proposition 3.2 Consider a fixed iteration k and suppose that d k ≠ 0 . Then the line search in (S.3) is well-defined and yields a new iterate x k+1 satisfying
was generated by the proximal gradient method}, was generated by the inexact proximal Newton-type method}.
Proof Since the proximal gradient method is well-defined by Corollary 2.4, the claim holds for k ∈ K G . Now, assume k ∈ K N , in which case (14) holds. Then Δ k < 0 and, therefore, the remaining part of the proof is identical to the one of Corollary 2.4. ◻ Proposition 3.2 requires d k ≠ 0 . In view of the following result, this assumption can be stated without loss of generality. In particular, this result justifies our termination criterion in (S.2).

Lemma 3.3 An iterate x k generated by GIPN is a stationary point of if and only if
Proof For k ∈ K G , the result follows from (4). Hence assume k ∈ K N , and let d k = 0 . This yields Altogether, the previous results show that Algorithm 3.1 is well-defined.

Convergence theory
In the following, we will prove global and local convergence results for algorithm GIPN. For this purpose, we assume that GIPN generates an infinite sequence and d k ≠ 0 holds for all k ∈ ℕ . The latter is motivated by Lemma 3.3.

Global convergence
The following is the main global convergence result for Algorithm 3.1. It guarantees stationarity of any accumulation point. Hence, if f is also convex, this implies that any accumulation point is a solution of the composite optimization problem from (1).
Then every accumulation point of a sequence generated by this method is a stationary point of .
Proof Let {x k } be a sequence generated by GIPN and {x k } K a subsequence of {x k } converging to some x * . If there are infinitely many indices k ∈ K with k ∈ K G , i.e. the subsequence contains infinitely many iterates x k such that x k+1 is generated by the proximal gradient method, Proposition 3.2 and the statement of Remark 2.6 yield that x * is a stationary point of .
Hence consider the case where all elements of the subsequence {x k+1 } K are generated by inexact Newton-type steps. Since { (x k )} is monotonically decreasing by Proposition 3.2, {x k } K converges to x * , and since is lsc, we get the convergence of the entire sequence { (x k )} to some finite number * . The line search rule therefore yields and, hence, t k Δ k → 0 for k → ∞ . We claim that this implies {‖d k ‖} K → 0 (possibly after taking another subsequence). To verify this statement, we distinguish two cases: (14).
Case 2: lim inf k∈K t k = 0 . Without loss of generality, assume lim k∈K t k = 0 . Then, for all k ∈ K sufficiently large, the line search test is violated for the stepsize k ∶= t k ∕ . Using the monotonicity of the difference quotient of convex functions, cf. [4,Proposition 9.27], and the definition of Δ k , we therefore obtain for all k ∈ K sufficiently large, where the last expression uses the mean value theorem with some k ∈ (x k , x k + k d k ) . Reordering these expressions, we obtain Using (14) we get (14). Using p > 1 , this implies k ‖d k ‖ → K 0 . Hence the right hand side of (16) converges to zero due to the uniform continuity of ∇f on compact sets. Consequently, (16) shows that ‖d k ‖ → K 0.
Therefore, d k → K 0 holds in both cases. Since x k → K x * , the definition of d k also implies x k → K x * . Using the continuity of the proximity operator, we therefore get and, since {H k } is bounded by assumption, (13) and ∈ (0, 1) , taking the limit k → K ∞ therefore implies x * = prox (x * − ∇f (x * )) , which is equivalent to x * being a stationary point of . ◻

Remark 4.2
Note that the proof of Theorem 4.1 only requires p > 1 and the first condition from (13). The second condition from (13) is only needed in the local convergence theory. ◊

Local convergence
We now turn to the local convergence properties of Algorithm 3.1. To this end, we assume that is locally strongly convex in a neighbourhood of an accumulation point of a sequence of iterates and the sequence {H k } is bounded. Under these assumptions, we first prove the convergence of the complete sequence. In order to verify the convergence of {x k } , we therefore have to verify only the con- Hence let {x k } K denote an arbitrary subsequence converging to x * . Since Note that the assumption regarding the local strong convexity of in a neighbourhood of x * certainly holds if the Hessian ∇ 2 f (x * ) is positive definite.
For the following analysis, we assume, in addition, that f is twice continuously differentiable and the sequence {H k } satisfies the Dennis-Moré condition [19] Under suitable assumptions, we expect the method to be locally superlinearly or quadratically convergent. The main steps into this direction are summarized in the following observations, which are partly taken from [47]. Then there exist constants > 0 as well as C 1 , C 2 , 1 , 2 , > 0 such that, for any iterate x k ∈ B (x * ) , the following statements hold, where x k ex is the exact solution of the corresponding subproblem in (S.1) of Algorithm 3.1: Proof We verify each of the three statements separately, using possibly different values of .
(a) First, note that the function k is strongly convex and, therefore, has a unique minimizer. Thus, the exact solution of the subproblem exists and hence guarantees that there is an inexact solution x k . Since The definition of k together with the subdifferential sum rule therefore implies which is equivalent to Since k is strongly convex with modulus m > 0 , its subdifferential is strongly monotone in this neighbourhood with the same modulus. Hence, using (17) together with 0 ∈ k (x k ex ) , we get Applying the Cauchy-Schwarz inequality, this implies Using the inexactness criterion (13), we finally get, with C 1 ∶= (1 + M + m)∕m, (c) The inequality holds trivially for x k ex = x * . Thus, assume x k ex ≠ x * . First, note that (a) implies Since is locally strongly convex in a neighbourhood of x * , the subdifferential is strongly monotone, i.e. there exist > 0 and > 0 such that holds for all x, y ∈ B (x * ) and s(x) ∈ (x), s(y) ∈ (y) . Using the stationarity of x * and x k ex , we have 0 ∈ ∇f (x * ) + (x * ) and 0 ∈ ∇f ( . Thus, also noting that x k ex eventually belongs to B (x * ) in view of part (b), we get From (b) we get {x k ex } → x * . Thus, by reducing > 0 , if necessary, we get from the twice differentiability of f. The assertion follows from dividing by ‖x * −x k ex ‖ and using (18). ◻ A suitable combination of the previous results leads to the following (global and) local convergence result for Algorithm 3.1.
Proof Note that we know from Theorem 4.3 that x * is both a stationary point and a strict local minimum of , and that the whole sequence {x k } converges to x * . (a) Similar to the proof of Proposition 4.4, there exists a solution x k of the subproblem min x k (x) for all k ∈ ℕ . Let Δ k,N be the Δ-function corresponding to the . Then the second condition in (13)  holds for all sufficiently large k ∈ ℕ . Combining these inequalities yields . We therefore get Thus, the sufficient descent condition (14) is fulfilled and the search direction d k = d k N is obtained by the inexact proximal Newton-type method. (b) Taylor expansion yields for some k ∈ (x k ,x k ) . Hence, we get By the Dennis-Moré criterion, this is o(‖x k − x k ‖ 2 ) for x k → x * . As before, it follows from the continuity of F and the results in Proposition 4.4 (a) and (b) that ‖x k − x k ‖ → 0 . Thus, using (13), we obtain for all sufficiently large k, where the last inequality follows from (19) (note that Δ k = Δ k,N in the current situation). This proves that in this case the full step length is attained. The twice continuous differentiability of f yields that the second term is o(‖x k − x * ‖) . The Dennis-Moré condition implies that the third term is o(‖x k − x * ‖) . Thus, the above yields part (c) for = 1∕((C 1 + 1 C 2 )(L + 2)) . Finally, under the assumptions of part (d), also the first term is o(‖x k − x * ‖) , which completes the proof. ◻ Note that one can also verify local quadratic convergence under slightly stronger assumption as in Theorem 4.5 (d), in particular, using a stronger version of the Dennis-Moré condition. The details are left to the reader.

Numerical results
In this section, we report some numerical results for solving problem (1) and show the competitiveness compared to several state-of-the-art methods. All numerical results have been obtained in MATLAB R2018b using a machine running Open SuSE Leap 15.1 with an Intel Core i5 processor 3.2 GHz and 16 GB RAM.
In the following, GPN denotes the globalized (inexact) proximal Newton method, whereas QGPN denotes a globalized (inexact) proximal quasi-Newton method, where the exact Hessian is replaced by a limited memory BFGS-update.

logistic regression with 1 -Penalty
In this example, we consider the logistic regression problem where a i ∈ ℝ n (i = 1, … , m) are given feature vectors and b i ∈ {±1} the corresponding labels, > 0 , y ∈ ℝ n , v ∈ ℝ . Usually, we have m ≫ n . Logistic regression is used to separate data by a hyperplane, see [25] for further information. With

we can write (20) equivalently as
The function is convex, but not strictly convex, and its derivative is globally Lipschitz continuous. Thus, this holds also for the smooth part of .

Algorithmic details
Subproblem solvers The crucial part of the implementation is the solution of the subproblem (13). We use two methods for this aim, which are described below: The fast iterative shrinkage thresholding algorithm (FISTA) [6] and the globalized semismooth Newton method (SNF) [32].
FISTA by Beck and Teboulle [6] is an accelerated first order method for the solution of problems of type (1), where f is convex and has a Lipschitz continuous gradient. In every step a problem of type (6) is solved for H k = L k I , where L k is an approximation to the Lipschitz constant of ∇f , which is updated by backtracking. After that, a step size is computed and the next iterate is a convex combination of the old iterate and the computed solution. For the approximation of the Lipschitz constant of f k , we start with L 0 ∶= 1 and use the increasing factor ∶= 2 . The globalized proximal Newton-type method with this subproblem solver is denoted by GPN-F. SNF by Milzarek and Ulbrich [32] is a semismooth Newton method with filter globalization. Since the subproblems in this example are convex, we use the convex variant of the method. The semismooth Newton method is essentially applied to the equation F(x) = 0 with F(x) defined in (12). After computing a search direction, a filter decides if the update is applied or a proximal gradient step is performed. All constants are chosen as in [32]. We denote the globalized proximal Newton method with SNF subproblem solver by GPN-S. In both cases, the initial point for the subproblem solvers is the current iterate x k . Choice of parameters We use the parameters p = 2.1 and = 10 −8 for the acceptance criterion (14). The line search is performed with = 0.1 and = 10 −4 . The constant c k for the proximal gradient step is initialized with c 0 = 1∕6 , and in each step adapted to reach the Lipschitz constant of the gradient of f.
Variant with quasi-Newton-update Assuming that is locally strongly convex in a neighbourhood of an accumulation point of a sequence generated by GPN, the sequence of matrices {H k } is generated using BFGS-updates and the subproblems in (13) are solved exactly, i.e. = 0 . Then, similar to [49] one can prove that the sequence {H k } satisfies the Dennis-Moré-condition.
Motivated by this idea, we implemented a variant of the algorithm, where the exact Hessian in the quadratic approximation (11) is replaced by a limited memory BFGS-update with a memory of 10. The implementation follows [14]. We skip the update, if (s k ) T y k < 10 −9 for s k = x k − x k−1 and y k = ∇f (x k ) − ∇f (x k−1 ) . Like before, we denote these methods by QGPN-F and QGPN-S, respectively.

State-of-the-art methods
We check the above described variants of GPN against each other, but also compare them with several state-of-the-art methods, which are listed below.
PG The proximal gradient method is described in Algorithm 2.2. It is a first order method to solve problem (1). We set = 0.1 , = 10 −4 and H k = c k I , where c k is updated as before.
FISTA [6] The fast iterative shrinkage thresholding algorithm is an accelerated variant of the proximal gradient method. Details were already given in Sect. 5.1.1.
SpaRSA [45] SpaRSA (Sparse reconstruction by separable approximation) is another accelerated first order method to solve problem (1). The main difference to FISTA is the update of the factor c k , which is done by a Barzilai-Borwein approach.
SNF [32] The semismooth Newton method with filter globalization is described in 5.1.1. Similar to the subproblem solver, we apply the convex version of the method.

Numerical comparison
We follow the example in [12] and generate test problems with n = 10 4 features and m = 10 6 training sets. Each feature vector a i has approximately 10 nonzero entries, which are generated independently from a standard normal distribution. We choose y true ∈ ℝ n with 100 nonzero entries and v true ∈ ℝ , which are independently generated from standard normal distribution and define the labels as b i = sign a T i y true + v true + v i , where the v i (i = 1, … , m) are also chosen independently from a normal distribution with variance 0.1. The regularization parameter is set to 0.1 max , where max is the smallest value such that y * = 0 is a solution of (20). The derivation of this value can be found in [25]. For all methods, we start with the initial value x 0 = 0.
Due to the differences of the methods, the standard termination criteria of them are not a suitable choice to compare the performance. Thus, we compute the approximate minimizer * of (20) using GPN-F with very high accuracy. We terminate each of the algorithms above when the value (x k ) in the current iterate x k satisfies for = 10 −6 . Termination of the subproblems We start with an investigation of the termination of the subproblems (13). As a consequence of Theorem 4.5, we can choose the sequence { k } to be constant (const.). For our experiments, we computed an upper bound for using the constants in the convergence theorem and set k = 0.9 . A second possibility is to use a diminishing (dim.) sequence { k } . Here we investigated the sequence k = 1∕(k + 1) . Since the inexact termination criterion (13) is not practicable without significant additional computation costs, we also use a third variant: We minimize (11) using the standard termination criterion for the used solvers with a low maximal number of iterations, more precisely, 80 iterations for FISTA and 10 iterations for SNF, which resulted in the best performance. The tolerance is adapted in each step such that the subproblems are solved more exactly when the current iterate is near the solution.
The averaged results of 100 runs for the described variants of our method are listed in Table 1. It can be seen that for the variants with subproblem solver SNF, the computation costs using the diminishing or constant sequence { k } are much higher than the costs using a maximum of 10 iterations, although, as expected, the number of total iterations is lower. Especially the number of evaluations of the proximity operator illustrates the difference in computation costs using the inexactness criterion in (13) and the approximation of the criterion by limiting the inner iterations. This is reasonable since there is one extra computation of the proximity operator in every inner iteration to check the inexactness condition. In contrast, the numbers of iterations are within the same range. For the variants using FISTA to solve the subproblems, we observe a similar behaviour, although it is less marked here.
To draw a conclusion from these observations, in the following we restrict the experiments and only investigate solving the subproblems with a maximum of 10 iterations (SNF) and 80 iterations (FISTA), which have the lowest computation costs. To accomplish comparability for these experiments, we look at the runtime of 100 test examples and document the results using the performance profiles introduced by Dolan and Moré [20]. The results are shown in Fig. 1, the averaged values for some counters are given in Table 2.
Comparison of GPN-variants We start with a comparison of the variants of the globalized proximal Newton-type methods, namely GPN-F, GPN-S, QGPN-F, and QGPN-S. At first, it can be observed that the iterations obtained by the inexact proximal Newton step are almost always accepted. We see that the semismooth Newton subproblem solver performs much better than the FISTA solver. One reason for this is that we can terminate the subproblem solvers in (Q)GPN-S after only 10 iterations to get reasonable results, whereas test runs show that (Q) GPN-F performs best with a maximum of 80 iterations in each subproblem. Nevertheless, note that every iteration of SNF itself needs to solve a linear system by the CG method, but both, FISTA and SNF, need to evaluate the product ∇ 2 f (x k )z for some z ∈ ℝ n in every iteration, which is the most expensive part of the algorithm since it involves two multiplications with A or A T .  Furthermore, the performance of the variants with limited memory BFGSupdate for the Hessian of the smooth part is significantly better than the use of the exact Hessian, although we need more outer and inner iterations to reach the termination accuracy. Again, this is due to the number of Hessian-vector-multiplications, which appear in QGPN only once in every iteration to compute the function value and the BFGS-update, whereas in GPN they are needed in every inner iteration.
Both arguments together verify why QGPN-S is the best variant tested, whereas the performance of GPN-F is not competitive.
We see in Table 2 that almost all solutions of the subproblems satisfy the descent condition (14) and, since the number of function evaluations is approximately the number of outer iterations, almost all search directions are applied with full step length. Thus, for this example, the globalization is not necessary in practice. Since problem (21) is globally strongly convex if A has full range, a slight adaption of our local convergence theory shows that one can prove convergence also without globalization. The details are left to the reader.
Comparison to other methods Since FISTA and the proximal gradient method are first order methods, it is not surprising that they need considerable more iterations to reach the termination tolerance. Thus, with the same arguments as above, they are not competitive due to the huge number of matrix-vector-products involving the matrices A or A T , although they do not need to evaluate the Hessians. The third first order method, SpaRSA, is far better, because the number of iterations and therefore the number of matrix-vector-products is much smaller, but it is still not able to compete with the second order methods.
The semismooth Newton method with filter globalization is the only second order method we compare our method to. As before, we see a correlation between the runtime and the number of matrix-vector-products with one of the matrices A or A T . As this number is higher than the one of QGPN, the runtime is still larger than the one of QGPN-S for most of the examples. The columns have the same meaning as in Table 1 Method In contrast to our method, we did not implement SNF with a limited memory BFGS-update. The low number of matrix-vector-products given in Table 2 recommends that this would not yield a significantly better performance.
Comparing FISTA with GPN-F and QGPN-F, where FISTA is used to solve the subproblems, we see that GPN-F is not competitive for the mentioned reasons, whereas QGPN-F is far better than FISTA on its own. A similar observation is true for the comparison of SNF with GPN-S and QGPN-S, where the GPN method is still the slowest method but not significantly. Thus, the globalized proximal Newton-type method with limited memory BFGS-update for the Hessian accelerates the performance of the underlying subproblem solver.

Student's t-regression with 1 -Penalty
In many applications of inverse problems, the aim is to find a sparse solution x * ∈ ℝ n of the problem Ax = b with A ∈ ℝ m×n and b ∈ ℝ m . Often, b is not known exactly but only a perturbed vector b . A widespread solution is to consider the penalized problem for some > 0 . This works well if we have Gaussian errors in the entries of b . Particularly, the influence of large errors is large. In problems, where the influence of large errors should be weighted less, but the influence of errors in a specific domain should be weighted more, it is reasonable to replace the quadratic loss by the student loss. We obtain the problem with ∶ ℝ → ℝ, (u) = log 1 + u 2 for some > 0 . For more information on student's t-distribution, we refer to [1,32] and references therein. It is easy to see that the derivative of is still Lipschitz continuous and is coercive, but not convex. Thus, many state-of-the-art methods are not applicable to this problem. We expect a solution of (23) to solve the linear system Ax = b , at least approximately. Since is locally strongly convex in B √ (0) , we expect that in a solution of (23) the local convergence theory is applicable.

Algorithmic details
Subproblem solvers As seen in Sect. 5.1, the SNF subproblem solver performed much better than the FISTA subproblem solver. Thus, we use again the semismooth Newton method with filter globalization [32] for the solution of the subproblems, apply at most 10 inner iterations per outer iteration and adapt the tolerance to get more exact solutions, if the current iterate is close to the solution of the main problem. We denote this method by GPN.
Since the problem in this section is nonconvex, the subproblems might be not bounded from below. To circumvent this problem, we also implemented a variant with regularized Hessians. As the second derivative of is easy to compute and the Hessian of the objective function is of the form A T DA for some diagonal matrix D ∈ ℝ m×m , we replace all diagonal entries d i of D by the maximum of d i and a small positive constant. The subproblem solver remains unchanged and we denote this regularized method by GPN+.
Choice of parameters As above, we set p = 2.1 , = 10 −8 , = 0.1 , and = 10 −4 . In this case, we start with c 0 = 100 and again adapt c k to approximate the Lipschitz constant of the gradient of the smooth part in (23).
Quasi-Newton-update In the second of the following test examples we use again a variant of the globalized proximal Newton method, where the Hessian of f is replaced by a limited memory BFGS-update with a memory of 10. We denote the method by QGPN. As before, we skip the update and use the previous approximation, if (s k ) T y k < 10 −9 for s k = x k − x k−1 and y k = ∇f (x k ) − ∇f (x k−1 ) . Since this problem is not convex, one could expect that skipping of updates happens occasionally. However, our experiments show that this happens in less than 10% and, if so, especially in the first iterations. Thus, the limited memory BFGS-updates are reasonably practicable.

State-of-the-art methods
Since problem (23) is nonconvex, most of the methods in Sect. 5.1 do not apply in this case. We therefore compare our algorithm to the following methods.
PG The proximal gradient method as described in Algorithm 2.2 has no convexity requirement. Again, we set = 0.1 , = 10 −4 , and H k = c k I , where c k is initialized with c 0 = 100 and adapted to reach a Lipschitz constant of ∇f .
SNF [32] The semismooth Newton method with filter globalization, as described in 5.1.1, has also a nonconvex variant with additional descent conditions, which are checked for the semismooth Newton update. We choose all constants as described in [32].

Numerical comparison
As mentioned above, we test two sets of examples. We start with the test setting described in [32]. Let n = 512 2 and m = n∕8 = 32768 . The matrix A ∈ ℝ m×n takes m random cosine measurements, i.e. for a random subset I ⊂ {1, … , n} with m elements, we set Ax = (dct(c)) I , where dct is the discrete cosine transform.
We generate a true sparse vector x true ∈ ℝ n with k = ⌊n∕40⌋ = 6553 nonzero entries, whose indices are chosen randomly. The nonzero components are computed via x true i = 1 (i)10 2 (i) with 1 (i) ∈ {±1} is a random sign and 2 (i) is chosen independently from a uniform distribution in [0, 1]. The image b ∈ ℝ m is generated by adding Student's t-noise with degree of freedom 4 and rescaled by 0.1 to Ax true . We set = 0.25 and set = 0.1 max , where max is the critical value, for which the zero vector is already a critical point of (23). Using Fermat's rule for the generalized Jacobian of (23), we obtain by a short calculation max = 2 i is the i-th row of A. We start with the initial point x 0 = A T b and, again, terminate each of the algorithms above, when the value (x k ) in the current iterate x k satisfies (22) for = 10 −6 , where * is computed by GPN with a very high accuracy. It is important to mention that all stationary points of problem (23), if there is more than one, have the same function value. Thus, this termination criterion makes sense although the problem is nonconvex.
For this example, we do not use QGPN since test runs have shown that QGPN is significantly slower than GPN here. The reason is that, in contrast to the example in 5.1.3, the computation of matrix-vector-products involving the matrix A are cheaper than the product with the BFGS-matrix, as the discrete cosine transform is a predefined Matlab-function.
To accomplish comparability, we look at the runtime of 100 test examples and document the performance using the performance profiles introduced by Dolan and Moré [20]. The results are shown in Fig. 2a, the averaged values for some counters are given in Table 3.
The first observation is that there is no significant difference between the globalized proximal Newton method GPN and the regularized version GPN+. In both methods, almost all updates are performed by proximal Newton steps. Thus, in the following we refer only to GPN.
The proximal gradient method is in all examples significantly slower than the second order methods. As mentioned above, this is not due to the number of matrixvector-products, which has the same magnitude as the one for GPN. In contrast, the numbers of function evaluations and evaluations of the proximity operator are much higher.
To demonstrate the performance of the limited memory BFGS proximal Newtontype method QGPN, we construct a second test example with higher computation costs for the matrix-vector-products with the matrices A or A T . In the above test As there was no significant difference in the performance of GPN and GPN+, we apply GPN, QGPN, SNF and the proximal gradient method PG to this setting. The results are shown in Fig. 2b and Table 4.
First, we observe that SNF did not converge at all within 1 000 iterations for this problem class. A look at the function value shows that it increases in every step. Since SNF is not a descent method regarding the function value and there is no result guaranteeing the convergence in the nonconvex case, this is not unreasonable.
Comparing the remaining methods, we find that the results confirm the observations of the example in Sect. 5.1. The performance of QGPN is far the best, whereas GPN is not competitive, though it is not as bad as for the 1 -regularized logistic regression.

Logistic regression with overlapping group penalty
The main advantage of the globalized proximal Newton method over semismooth Newton methods is that it is also able to solve problems of type (1), where the nonsmooth function is not the 1 -norm and there is no known formula to Table 3 Averaged values of 100 runs for the first example in Sect. 5.2 with tolerance 10 −6 The columns have the same meaning as in Table 1 Method where A ∈ ℝ m×n contains the information on feature vectors and corresponding labels and ∶ ℝ → ℝ is defined by (u) ∶= log (1 + exp(−u)) . A group penalty makes sense in many applications here, since some features are related to others. For more information on logistic regression with group penalty, we refer to [30].

Algorithmic details
Subproblem solver As there is no formula to compute the proximity operator of , the subproblem solvers of the previous sections are not directly applicable. We can write as ̃•B , where B is a linear mapping and ̃ is a group penalty without overlapping. Thus, we can compute the proximity operator of ̃ . Both, the proximal Newton subproblem as well as the proximity operator, can be written as with a positive definite matrix Q ∈ ℝ n×n and c ∈ ℝ n . We solve both problems with fixed point methods described by Chen et al. in [16]. For the computation of the proximity operator, we use the fixed point algorithm based on the proximity operator (FP 2 O) and for solving the proximal Newton subproblem the primal-dual fixed point algorithm based on the proximity operator (PDFP 2 O).
For both methods, we use a stopping tolerance of 10 −9 and apply at most 10 iterations for each problem. For the method we also need the largest eigenvalue of BB T , which can be shown to be equal to the largest integer k such that there exists an index i ∈ {1, … , n} that is contained in k groups G j .
Choice of parameters As before, we set the parameters to p = 2.1 , = 10 −8 , = 0.1 , and = 10 −4 . Here, we start with c 0 = 1 and again adapt c k to approximate the Lipschitz constant of the gradient of the smooth part in (24).
Other methods We make a comparison between our method with the above mentioned subproblem-solvers, FISTA [6] with the parameters as in 5.1.1. For the computation of the proximity operators, we also use FP 2 O. Furthermore, we apply PDFP 2 O directly to problem (24).

Numerical comparison
We follow an example in [2] and generate A ∈ ℝ n×m with n = 1000 , m = 700 from a uniform distribution and normalize the columns of A. The groups G j are The first five groups contain five consecutive numbers and the last element of one group is, at the same time, the first element of the next group. Each of the next five groups contain one element of one of the first groups. The remaining groups have no overlap and contain always 10 elements. The coefficients j are chosen to be 1∕ √ |G j | , where |G j | is the number of indices in that group.
The parameter is again chosen as 0.1 max , where max is the critical value such that 0 is a solution of (24) for all ≥ max . Let a T i be the rows of A. Then a short computation shows max = As before, we start with the initial value x 0 = 0.
We terminate each of the algorithm as soon as the current iterate satisfies (22) for = 10 −6 , where * is the function value computed by GPN using a very high accuracy. Again, we document the results using the performance profiles on the runtime of 100 test examples. The results are shown in Fig. 3, the averaged values for some counters are given in Table 5.
We see that there are about 15% of the examples, where FISTA performs better than GPN, but in most examples GPN shows by far the best performance. This can be seen by looking at the number of inner iterations of both methods. In this  case, the costs of inner iterations is almost equal for both methods. Since the average number of inner iterations in FISTA is more then twice as big as the one of GPN, this illustrates the difference in performance.

Nonconvex image restoration
We demonstrate the performance of our method for nonconvex image restoration. Given a noisy blurred image b ∈ ℝ n and a blur operator A ∈ ℝ n×n , the aim is to find an approximation x to the original image satisfying Ax = b . Note that, for simplicity, we assume that the images x, b are vectors in ℝ n . For this purpose, we use again the student loss from Sect. 5.2 and get the problem where ∶ ℝ → ℝ, (u) = log 1 + u 2 for some > 0 , and B ∶ ℝ n → ℝ n is a twodimensional discrete Haar wavelet transform, which guarantees antialiasing.
Since B is orthogonal, we get Thus, the proximity operator can be computed exactly. Similar to Sect. 5.2, we expect that is strongly convex in a neighbourhood of a solution such that our local convergence theory applies here.

Algorithmic details
We solve the subproblems using FISTA with a maximum of 50 iterations and a tolerance of 10 −6 . We do not use the SNF-solver here since the occurring linear systems of equations are not separable and we would need to solve a full dimensional system of equations several times, see below for details. The parameters are chosen as in Sect. 5.3.1.
We compare our methods GPN and the limited memory BFGS variant QGPN, where the updating of the BFGS-matrix follows the description in Sect. 5.2.1, to the proximal gradient method PG and the semismooth Newton method with filter globalization SNF [32] with the parameters mentioned in that paper. In this case, the matrix M(x k ) occurring in the linear systems M(x k ) = −F(x k ) has the form prox ‖B⋅‖ 1 (u) = B T prox ‖⋅‖ 1 (Bu). where H k is an approximation to the Hessian of the smooth part of and D k is a diagonal matrix depending on the iterate x k . This matrix does not have a block structure or is separable, so this is a full dimensional linear system of equations, which impairs the performance of this method. We solve each of these systems using GMRES(m) with 100 iterations and restart every m = 10 iterations.

Numerical comparison
We follow the example in [41], see also [11]. In detail, A is a Gaussian blur operator with standard deviation 4 and a filter size of 9, = 1 and B is the discrete Haar wavelet transform of level four. Furthermore, we choose = 10 −4 . The blurred noisy image b is created by applying A to the test image cameraman of size 256 × 256 and adding Student's t-noise with degree of freedom 1 and rescaled by 10 −3 . For all methods, the initial point is x 0 = b.
Since the most expensive computations are the applications of A, B and their transposes, we stop each of the algorithms if, after an outer iteration, the sum of these applications exceeds 2 ⋅ 10 4 . The results are shown in Fig. 4 and Table 6.
The reason why the restored images are minimal lighter than the original is that we used the Hair wavelet transform with four levels and not the maximal possible level log 2 (256) = 8 . Furthermore, we mention that for GPN and QGPN almost all iterations are Newton steps, whereas for SNF only half of the iterations are Newton steps. As expected, the performance of the semismooth Newton method with filter globalization is not satisfying here, since the solution of the linear systems is expensive. In contrast, the proximal methods show good restorations. The difference in the corresponding images in Fig. 4 are hard to see, so we study the values in Table 6. The relative error ( (x) − * )∕ * , where x is the image provided by the algorithm and * is the value of in the original image, is best for GPN, so the corresponding image best approximates the original one. Comparing the inner iterations of GPN and QGPN with the iterations of the proximal gradient method, the ones of GPN and PG are within the same range, whereas the ones of QGPN have almost the double value. This and the number of calls of B and B T explain why the CPUtime used by QGPN is approximately twice as much as the one of GPN. In this case, the avoidence of calls of A and A T does not yield a better performance, since the price to pay is the higher number of calls of the Haar transform.
Comparing GPN and PG, the numbers of (inner) iterations and applications of A, B and their transposes are almost the same, but the superiority of the second order method GPN over PG can be seen in the values of the CPU-time and the relative error of the function value.

Conclusion
We introduced a globalization of the proximal Newton-type method to solve structured optimization problems consisting of a smooth and a convex function. For this purpose the proximal Newton-type method was combined with a proximal gradient method using a novel descent criterion. We also gave an inexactness approach and the possibility to replace the Hessian of the smooth part by quasi-Newton matrices. We proved global convergence in the convex and nonconvex case and, under suitable conditions, local superlinear convergence.
The numerical part shows that the proposed method is competitive for convex and nonconvex problems, especially when the computation of the Hessian is expensive and we can use limited memory quasi-Newton updates. Furthermore, when there is Table 6 Values of the example in Sect. 5.3 for the four tested algorithms Abbreviations: time (CPU-time in seconds), iter (total number of (outer) iterations), optim (optimality criterion (x)− * * ), subiter (number of inner iterations), A-calls, B-calls (applications of the mapping A and B and transposed mapping, resp.)

Method Time Iter
Optim no efficient way to compute the proximity operator for the nonsmooth function, the globalized proximal Newton-type method outperforms the methods compared to.