Convergence Properties of Monotone and Nonmonotone Proximal Gradient Methods Revisited

Composite optimization problems, where the sum of a smooth and a merely lower semicontinuous function has to be minimized, are often tackled numerically by means of proximal gradient methods as soon as the lower semicontinuous part of the objective function is of simple enough structure. The available convergence theory associated with these methods (mostly) requires the derivative of the smooth part of the objective function to be (globally) Lipschitz continuous, and this might be a restrictive assumption in some practically relevant scenarios. In this paper, we readdress this classical topic and provide convergence results for the classical (monotone) proximal gradient method and one of its nonmonotone extensions which are applicable in the absence of (strong) Lipschitz assumptions. This is possible since, for the price of forgoing convergence rates, we omit the use of descent-type lemmas in our analysis.


Introduction
In this paper, we address the classical problem of minimizing the sum of a smooth function f and a nonsmooth function φ, also known under the name composite optimization.This setting received much attention throughout the last years due to its inherent practical relevance in, e.g., machine learning, data compression, matrix completion, and image processing, see e.g.[6,13,14,20,27,28].
A standard technique for the solution of composite optimization problems is the proximal gradient method, introduced by Fukushima and Mine [21] and popularized e.g. by Combettes and Wajs in [18].A particular instance of this method is the celebrated iterative shrinkage/threshold algorithm (ISTA), see, e.g.[5].A summary of existing results for the case where the nonsmooth term φ is defined by a convex function is given in the monograph by Beck [4].
The proximal gradient method can also be interpreted as a forward-backward splitting method, see [12,31] for its origins and [3] for a modern view, and is able to handle problems where the nonsmooth term φ is given by a merely lower semicontinuous function, see, e.g. the seminal works [1,8].These references also provide convergence and rate-of-convergence results by using the popular descent lemma together with the celebrated Kurdyka-Łojasiewicz property.
To the best of our knowledge, however, the majority of available convergence results for proximal gradient methods assume that the smooth term f is continuously differentiable with a globally Lipschitz continuous gradient (or they require local Lipschitzness together with a bounded level set which, again, implies the global Lipschitz continuity on this level set).This requirement, which is the essential ingredient for the classical descent lemma, is often satisfied for standard applications of the proximal gradient method in data science and image processing, where f appears to be a quadratic function.
In this paper, we aim to get rid of this global Lipschitz condition.This is motivated by the fact that the algorithmic application we have in mind does not satisfy this Lipschitz property since the smooth term f corresponds to the augmented Lagrangian function of a general nonlinear constrained optimization problem, which rarely has a globally Lipschitz continuous gradient or a bounded level set.The proximal gradient method will be used to solve the resulting subproblems which forces us to generalize the convergence theory up to reasonable assumptions which are likely to hold in our framework.We refer the interested reader to [15,19,23,25] where such augmented Lagrangian proximal methods are investigated.
Numerically, a nonmonotone version of the proximal gradient method is often preferred.Based on ideas by Grippo et al. [22] in the context of smooth unconstrained optimization problems, Wright et al. [34] developed a nonmonotone proximal gradient method for composite optimization problems known under the name SpaRSA.In their paper, the authors assume that the nonsmooth part φ of the objective function is convex.Almost simultaneously, the authors of [7] presented a nonmonotone projected gradient method for the minimization of a differentiable function over a convex set.Their findings can be interpreted as a special case of the results from [34] where φ equals the indicator of a convex set.The ideas from [7,34] were subsequently generalized in the papers [15,16] where the proximal gradient method is used as a subproblem solver within an augmented Lagrangian and penalization scheme, respectively.However, the authors did not address the aforementioned problematic lack of Lipschitzness in these papers which causes their convergence theory to be barely applicable in their algorithmic framework.In [26,33], the authors present nonmonotone extensions of ISTA which can handle merely lower semicontinuous terms in the objective function.Again, for the convergence analysis, global Lipschitzness of the smooth term's derivative is assumed.Due to its practical importance, we therefore aim to provide a convergence theory for the nonmonotone proximal gradient method without using any Lipschitz assumption.
In the seminal paper [2], the authors consider the composite optimization problem with both terms being convex, but without a global Lipschitz assumption for the gradient of the smooth part f .They get suitable rate-of-convergence results for the iterates generated by a Bregman-type proximal gradient method using only a local Lipschitz condition.In addition, however, they require that there is a constant L > 0 such that Lh − f is convex, where h is a convex function which defines the Bregman distance (in our setting, h equals the squared norm).Some examples indicate that this convexity-type condition is satisfied in many practically relevant situations.Subsequently, this approach was generalized to the nonconvex setting in [9] using, once again, a local Lipschitz assumption only, as well as the slighty stronger assumption (in order to deal with the nonconvexity) that there exist L > 0 and a convex function h such that both Lh − f and Lh + f are convex.Note that the constant L plays a central role in the design of the corresponding proximal-type methods.Particularly, it is used explicitly for the choice of stepsizes.Finally, the very recent paper [17] proves global convergence results under a local Lipschitz assumption (without the additional convexity-type condition), but assumes that the iterates and stepsizes of the underlying proximal gradient method remain bounded.
To the best of our knowledge, this is the current state-of-the-art regarding the convergence properties of proximal gradient methods.The aim of this paper is slightly different, since we do not provide rate-of-convergence results, but conditions which guarantee accumulation points to be suitable stationary points of the composite optimization problem.This is the essential feature of the proximal gradient method which, for example, is exploited in [15,19,25] to develop augmented Lagrangian proximal methods.We also stress that, in this particular situation, the above assumption that Lh ± f is convex for some L > 0 is often violated unless we are dealing with linear constraints only.
Our analysis does not require a global Lipschitz assumption and is not based on the crucial descent lemma, contrasting [2,9] mentioned above.The results show that we can get stationary accumulation points only under a local Lipschitz assumption and, depending on the properties of φ, sometimes even without any Lipschitz condition.In any case, a convexity-type condition like Lh ± f being convex for some constant L is not required at all.Moreover, the implementation of our proximal gradient method does not need any knowledge of the size of any Lipschitz-type constant.
Since the aim of this paper is to get a better understanding of the theoretical convergence properties of both monotone and nonmonotone proximal gradient methods, and since these methods have already been applied numerically to a large variety of problems, we do not include any numerical results in this paper.
Let us recall that we are mainly interested in conditions ensuring that accumulation points of sequences produced by the proximal gradient method are stationary.The main contributions of this paper show that this property holds (neglecting a few technical conditions) for the monotone proximal gradient method if either the smooth function f is continuously differentiable and the nonsmooth function φ is continuous on its domain (e.g., this assumption holds for a constrained optimization problem where φ corresponds to the indicator function of a nonempty and closed set), or if f is differentiable with a locally Lipschitz continuous derivative and φ is an arbitrary lower semicontinuous function.Corresponding statements for the nonmonotone proximal gradient method require stronger assumptions, basically the uniform continuity of the objective function on a level set.That, however, is a standard assumption in the literature dealing with nonmonotone stepsize rules.
The paper is organized as follows: In Section 2, we give a detailed statement of the composite optimization problem and provide some necessary background material from variational analysis.The convergence properties of the monotone and nonmonotone proximal gradient method are then discussed in Sections 3 and 4, respectively.We close with some final remarks in Section 5.

Problem Setting and Preliminaries
We consider the composite optimization problem where f : X → R is continuously differentiable, φ : X → R := R ∪ {∞} is lower semicontinuous (possibly infinite-valued and nondifferentiable), and X denotes a Euclidean space, i.e., a real and finite-dimensional Hilbert space.We assume that the domain dom φ := {x ∈ X | φ(x) < ∞} of φ is nonempty to rule out trivial situations.In order to minimize the function ψ : X → R in (P), it seems reasonable to exploit the composite structure of ψ, i.e., to rely on the differentiability of f on the one hand, and on some beneficial structural properties of φ on the other one.This is the idea behind splitting methods.Throughout the paper, the Euclidean space X will be equipped with the inner product •, • : X × X → R and the associated norm • .For some set A ⊂ X and some point x ∈ X, we make use of A + x = x + A := {x + a | a ∈ A} for the purpose of simplicity.For some sequence {x k } ⊂ X and x ∈ X, x k → φ x means that x k → x and φ(x k ) → φ(x).The continuous linear operator f ′ (x) : X → R denotes the derivative of f at x ∈ X, and we will make use of ∇f (x) := f ′ (x) * 1 where f ′ (x) * : R → X is the adjoint of f ′ (x).This way, ∇f is a mapping from X to X. Furthermore, we find The following concepts are standard in variational analysis, see e.g.[29,32].Let us fix some point x ∈ dom φ.Then is called the regular (or Fréchet) subdifferential of φ at x. Furthermore, the set is well known as the limiting (or Mordukhovich) subdifferential of φ at x. Clearly, we always have ∂φ(x) ⊂ ∂φ(x) by construction.Whenever φ is convex, equality holds, and both subdifferentials coincide with the subdifferential of convex analysis, i.e., holds in this situation.It can be seen right from the definition that whenever x * ∈ dom φ is a local minimizer of φ, then 0 ∈ ∂φ(x * ), which is referred to as Fermat's rule, see [29,Proposition 1.30(i)].Given x ∈ dom φ, the limiting subdifferential has the important robustness property see [29,Proposition 1.20].Clearly, the converse inclusion ⊃ is also valid by definition of the limiting subdifferential.Note that in situations where φ is discontinuous at x, the requirement x k → φ x in the definition of the set on the left-hand side in (2.1) is strictly necessary.In fact, the usual outer semicontinuity in the sense of set-valued mappings, given by would be a much stronger condition in this situation and does not hold in general.
Whenever x ∈ dom φ is fixed, the sum rule holds, see [29,Proposition 1.30(ii)].Thus, due to Fermat's rule, whenever x * ∈ dom φ is a local minimizer of f + φ, we have 0 ∈ ∇f (x * ) + ∂φ(x * ).This condition is potentially more restrictive than 0 ∈ ∇f (x * ) + ∂φ(x * ) which, naturally, also serves as a necessary optimality condition for (P).However, the latter is more interesting from an algorithmic point of view as it is well known from the literature on splitting methods comprising nonconvex functions φ.If φ is convex, there is no difference between those stationarity conditions.Throughout the paper, a point x * ∈ dom φ satisfying 0 ∈ ∇f (x * ) + ∂φ(x * ) will be called a Mordukhovich-stationary (M-stationary for short) point of (P) due to the appearance of the limiting subdifferential.In the literature, the name limiting critical point is used as well.We close this section with two special instances of problem (P) and comment on the corresponding M-stationary conditions. where with f : X → R and C ⊂ X as in Remark 2.1, and ϕ : X → R being another lower semicontinuous function (which might represent a regularization, penalty, or sparsitypromoting term, for example).Setting φ := ϕ + δ C , we obtain once again an optimization problem of the form (P). The corresponding M-stationarity condition is given by 0 Unfortunately, the sum rule does not hold in general.However, for locally Lipschitz functions ϕ, for example, it applies, see [29,Theorems 1.22,2.19].Note that the resulting stationarity condition might be slightly weaker than M-stationarity as introduced above.Related discussions can be found in [24, Section 3].

Monotone Proximal Gradient Method
We first investigate a monotone version of the proximal gradient method applied to the composite optimization problem (P) with f being continuously differentiable and φ being lower semicontinuous.Recall that the corresponding M-stationarity condition is given by 0 ∈ ∇f (x) + ∂φ(x).
Our aim is to find, at least approximately, an M-stationary point of (P).The following algorithm is the classical proximal gradient method for this class of problems.Since we will also consider a nonmonotone variant of this algorithm in the following section, we call this the monotone proximal gradient method.

5:
Denote by i k := i the terminal value, and set γ k := γ k,i k and x k+1 := x k,i k . 6: The convergence theory requires some technical assumptions.Assumption 3.2 (a) is a reasonable condition regarding the given composite optimization problem, whereas Assumption 3.2 (b) is essentially a statement relevant for the subproblems from (3.1).In particular, Assumption 3.2 (b) implies that the quadratic objective function of the subproblems (3.1) are, for fixed k, i ∈ N, coercive, and therefore always attain a solution x k,i (which, however, may not be unique).
The subsequent convergence theory assumes implicitly that Algorithm 3.1 generates an infinite sequence.
We first establish that the stepsize rule in Step 4 of Algorithm 3.1 is always finite.
Lemma 3.3.Consider a fixed iteration k of Algorithm 3.1, assume that x k is not an M-stationary point of (P), and suppose that Assumption 3.2 (b) holds.Then the inner loop in Step 4 of Algorithm 3.1 is finite, i.e., we have γ k = γ k,i k for some finite index i k ∈ {0, 1, 2, . ..}.
Proof.Suppose that the inner loop of Algorithm 3.1 does not terminate after a finite number of steps in iteration k.Recall that x k,i is a solution of (3.1).Therefore, we get ∇f where the first estimate follows from the lower semicontinuity of φ and the final inequality is a consequence of (3.3).Therefore, we have We claim that Assume, by contradiction, that there is a subsequence i l → ∞ such that Since x k,i l is optimal for (3.1),Fermat's rule and the sum rule (2.3) yield for all l ∈ N. Taking the limit l → ∞ while using (3.4) and (3.6), we obtain which means that x k is already an M-stationary point of (P).This contradiction shows that (3.5) holds.Hence, there is a constant c > 0 such that holds for all large enough i ∈ N. In particular, this implies for all sufficiently large i ∈ N. Furthermore, (3.3) shows that Using a Taylor expansion of the function f and exploiting (3.8), (3.9), we obtain for all i ∈ N sufficiently large.This, however, means that the acceptance criterion (3.2) is valid for sufficiently large i ∈ N, contradicting our assumption.This completes the proof.
Let us note that the above proof actually shows that the inner loop from Step 4 of Algorithm 3.1 is either finite, or we have and recalling that ∇f : X → X is continuous motivates to also use for some τ abs > 0 as a termination criterion of the inner loop since this encodes, in some sense, approximate M-stationarity of x k,i for (P) (note that taking the limit l → ∞ in (3.10) would recover the limiting subdifferential of φ at x k since we have x k,i l → φ x k by (3.4)).
A critical step for the convergence theory of Algorithm 3.1 is provided by the following result.Proof.First recall that the sequence {x k } is well-defined by Lemma 3.3.Using the acceptance criterion (3.2), we get for all k ∈ N. Hence, the sequence {ψ(x k )} is monotonically decreasing.Since ψ is bounded from below on dom φ by Assumption 3.2 (a) and {x k } ⊂ dom φ, it follows that this sequence is convergent.Therefore, (3.11) implies Hence the assertion follows from the fact that, by construction, we have γ k ≥ γ min > 0 for all k ∈ N.
A refined analysis gives the following result.
Proposition 3.5.Let Assumption 3.2 hold, let {x k } be a sequence generated by Algorithm 3.1, and let {x k } K be a subsequence converging to some point x * .Then Proof.If the subsequence {γ k } K is bounded, the statement follows immediately from Proposition 3.4.The remaining part of this proof therefore assumes that this subsequence is unbounded.Without loss of generality, we may assume that γ k → K ∞ and that the acceptance criterion (3.2) is violated in the first iteration of the inner loop for each k ∈ K.Then, for γk := γ k /τ , k ∈ K, we also have γk → K ∞, but the corresponding vector xk := x k,i k −1 does not satisfy the stepsize condition from (3.2), i.e., we have On the other hand, since xk solves the corresponding subproblem (3.1) with γk = γ k,i k −1 , we have We claim that this, in particular, implies xk → K x * .In fact, using (3.13), the Cauchy-Schwarz inequality, and the monotonicity of {ψ(x k )}, we obtain Since f is continuously differentiable and −φ is bounded from above by an affine function in view of Assumption 3.2 (b), this implies xk − x k → K 0. In fact, if { xk − x k } K would be unbounded, then the left-hand side would grow more rapidly than the right-hand side, and if { xk − x k } K would be bounded, but staying away, at least on a subsequence, from zero by a positive number, the right-hand side would be bounded, whereas the left-hand side would be unbounded on the corresponding subsequence.Now, by the mean-value theorem, there exists ξ k on the line segment connecting Exploiting (3.12), we therefore obtain which can be rewritten as (note that xk = x k in view of (3.12)).Since x k → K x * (by assumption) and xk → K x * (by the previous part of this proof), we also get ξ k → K x * .Using δ ∈ (0, 1) and the continuous differentiability of f , it follows from (3.15) that γk xk − x k → K 0. Finally, exploiting the fact that x k+1 and xk are solutions of the subproblems (3.1) with parameters γ k and γk , respectively, we find Adding these two inequalities and noting that γ k = τ γk > γk yields x k+1 − x k ≤ xk − x k and, therefore, This completes the proof.
The above technique of proof implies a boundedness result for the sequence {γ k } K if ∇f satisfies a local Lipschitz property around the associated accumulation point of iterates.This observation is stated explicitly in the following result.
Corollary 3.6.Let Assumption 3.2 hold, let {x k } be a sequence generated by Algorithm 3.1, let {x k } K be a subsequence converging to some point x * , and assume that ∇f : X → X is locally Lipschitz continuous around x * .Then the corresponding subsequence {γ k } K is bounded.
Proof.We may argue as in the proof of Proposition 3.5.Hence, on the contrary, assume that γ k → K ∞.For each k ∈ K, define γk and xk as in that proof, and let L > 0 denote the local Lipschitz constant of ∇f around x * .Recall that x k → K x * (by assumption) and xk → K x * (from the proof of Proposition 3.5).Exploiting (3.15), we therefore obtain for all k ∈ K sufficiently large, using the fact that ξ k is on the line segment between x k and xk .Since γk → K ∞ and xk = x k , see once again (3.12), this gives a contradiction.Hence, {γ k } K stays bounded.
The following is the main convergence result for Algorithm 3.1 which requires a slightly stronger smoothness assumption on either f or φ.Theorem 3.7.Assume that Assumption 3.2 holds while either φ is continuous on dom φ or ∇f : X → X is locally Lipschitz continuous.Then each accumulation point x * of a sequence {x k } generated by Algorithm 3.1 is an M-stationary point of (P).
Proof.Let {x k } K be a subsequence converging to x * .In view of Proposition 3.4, it follows that also the subsequence {x k+1 } K converges to x * .Furthermore, Proposition 3.5 yields γ k x k+1 − x k → K 0. The minimizing property of x k+1 , Fermat's rule, and the sum rule (2.3) imply that holds for each k ∈ K. Hence, if we can show φ(x k+1 ) → K φ(x * ), we can take the limit k → K ∞ in (3.16) to obtain the desired statement 0 ∈ ∇f (x * ) + ∂φ(x * ).Due to (3.11), we find ψ(x k+1 ) ≤ ψ(x 0 ) for each k ∈ K. Taking the limit k → K ∞ while respecting the lower semicontinuity of φ gives ψ(x * ) ≤ ψ(x 0 ), and due to x 0 ∈ dom φ, we find x * ∈ dom φ.Thus, the condition φ(x k+1 ) → K φ(x * ) obviously holds if φ is continuous on its domain since all iterates x k generated by Algorithm 3.1 as well as x * belong to dom φ.
Hence, it remains to consider the situation where φ is only lower semicontinuous, but ∇f is locally Lipschitz continuous.From x k+1 → K x * and the lower semicontinuity of φ, we find It therefore suffices to show that lim sup k∈K φ(x k+1 ) ≤ φ(x * ) holds.Since x k+1 solves the subproblem (3.1) with parameter γ k , we obtain for each k ∈ K.We now take the upper limit over K on both sides.Using the continuity of ∇f , the convergences x k+1 − x k → K 0 as well as γ k x k+1 − x k 2 → K 0 (see Propositions 3.4 and 3.5), and taking into account that γ k x k − x * 2 → K 0 due to the boundedness of the subsequence {γ k } K in this situation, see Corollary 3.6, we obtain lim sup k∈K φ(x k+1 ) ≤ φ(x * ).Altogether, we therefore get φ(x k+1 ) → K φ(x * ), and this completes the proof.
Note that φ being continuous on dom φ is an assumption which holds, e.g., if φ is the indicator function of a closed set, see Remark 2.1.Therefore, Theorem 3.7 provides a global convergence result for constrained optimization problems with an arbitrary continuously differentiable objective function over any closed (not necessarily convex) feasible set.Moreover, the previous convergence result also holds for a general lower semicontinuous function φ provided that ∇f is locally Lipschitz continuous.This includes, for example, sparse optimization problems in X ∈ {R n , R n×m } involving the so-called ℓ 0 -quasi-norm, which counts the number of nonzero entries of the input vector, as a penalty term or optimization problems in X := R n×m comprising rank penalties.Note that we still do not require the global Lipschitz continuity of ∇f .However, it is an open question whether the previous convergence result also holds for the general setting where f is only continuously differentiable and φ is just lower semicontinuous.Remark 3.8.Let {x k } be a sequence generated by Algorithm 3.1.In iteration k ∈ N, x k+1 satisfies the necessary optimality condition (3.16) of the subproblem (3.1).Hence, from the next iteration's point of view, we obtain for each k ∈ N with k ≥ 1.This justifies evaluation of the termination criterion for some τ abs > 0 since this means that x k is, in some sense, approximately Mstationary for (P).Observe that, along a subsequence for some x * , Propositions 3.4 and 3.5 yield x k → K x * and γ k−1 (x k − x k−1 ) → K 0 under appropriate assumptions, which means that (3.17) is satisfied for large enough k ∈ K due to continuity of ∇f : X → X, see the discussion after Lemma 3.3 as well.
Recall that the existence of accumulation points is guaranteed by the coercivity of the function ψ.A simple criterion for the convergence of the entire sequence {x k } is provided by the following comment.Remark 3.9.Let {x k } be any sequence generated by Algorithm 3.1 such that x * is an isolated accumulation point of this sequence.Then the entire sequence converges to x * .This follows immediately from [30,Lemma 4.10] and the property of the proximal gradient method stated in Proposition 3.4.The accumulation point x * is isolated, in particular, if f is twice continuously differentiable with ∇ 2 f (x * ) being positive definite and φ is convex.In this situation, x * is a strict local minimum of ψ and therefore the only stationary point of ψ is a neighborhood of x * .Since, by Theorem 3.7, every accumulation point is stationary, it follows that x * is necessarily an isolated stationary point in this situation and, thus, convergence of the whole sequence {x k } to x * follows.

Nonmonotone Proximal Gradient Method
The method to be presented here is a nonmonotone version of the proximal gradient method from the previous section.The kind of nonmonotonicity used here was introduced by Grippo et al. [22] for a class of smooth unconstrained optimization problems and then discussed, in the framework of composite optimization problems, by Wright et al. [34] as well as in some subsequent papers.We first state the precise algorithm and investigate its convergence properties.The relation to the existing convergence results is postponed until the end of this section.

5:
Denote by i k := i the terminal value, and set γ k := γ k,i k and x k+1 := x k,i k .
7: end while 8: return x k The only difference between Algorithm 3.1 and Algorithm 4.1 is in the stepsize rule.More precisely, Algorithm 4.1 may be viewed as a generalization of Algorithm 3.1 since the particular choice m = 0 recovers Algorithm 3.1.Numerically, in many examples, the choice m > 0 leads to better results and is therefore preferred in practice.On the other hand, for m > 0, we usually get a nonmonotone behavior of the function values {ψ(x k )} which complicates the theory significantly.In addition, the nonmontone proximal gradient method also requires stronger assumptions in order to prove a suitable convergence result.
In particular, in addition to the requirements from Assumption 3.2, we need the following additional conditions on the data functions in order to proceed.
The function φ is continuous on dom φ.
Note that we always have L ψ (x 0 ) ⊂ dom φ by the continuity of f .Furthermore, whenever ψ is coercive, Assumption 4.2 (b) already implies Assumption 4.2 (a) since L ψ (x 0 ) would be a compact subset of dom φ in this situation, and continuous functions are uniformly continuous on compact sets.Observe that coercivity of ψ is an inherent property in many practically relevant settings.We further note that, in general, Assumption 4.2 (a) does not imply Assumption 4.2 (b), and the latter is a necessary requirement since, in our convergence theory, we will also evaluate the function φ in some points resulting from an auxiliary sequence {x k } which may not belong to the level set L ψ (x 0 ).
For the convergence theory, we assume implicitly that Algorithm 4.1 generates an infinite sequence {x k }.We first note that the stepsize rule in the inner loop of Algorithm 4.1 is always finite.Since this observation follows immediately from Lemma 3.3.
Throughout the section, for each k ∈ N, let l(k) ∈ {k − m k , . . ., k} be an index such that is valid.We already mentioned that {ψ(x k )} may possess a nonmonotone behavior.However, as the following lemma shows, {ψ(x l(k) )} is monotonically decreasing.Proof.The nonmonotone stepsize rule from (4.2) can be rewritten as where the last equality follows from (4.3).This shows the claim.
As a corollary of the above result, we obtain that the iterates of Algorithm 4.1 belong to the level set L ψ (x 0 ).
The counterpart of Proposition 3.4 is significantly more difficult to prove in the nonmonotone setting.In fact, it is this central result which requires the uniform continuity of the objective function ψ from Assumption 4.2 (a).Though its proof is essentially the one from [34], we present all details since they turn out to be of some importance for the discussion at the end of this section.Proof.Since ψ is bounded from below due to Assumption 3.2 (a), Lemma 4.3 implies for some finite ψ * ∈ R. From Corollary 4.4, we find ) (here, we assume implicitly that k is large enough such that no negative indices l(k) − n − 1 occur).More precisely, for n = 0, we have Taking the limit k → ∞ in the previous inequality and using (4.4), we therefore obtain lim where d k := x k+1 − x k for all k ∈ N. Using (4.4) and (4.5), it follows that where the last equality takes into account the uniform continuity of ψ from Assumption 4.2 (a) and (4.5).
We will now prove, by induction, that the limits hold for all j ∈ N with j ≥ 1.We already know from (4.5) and (4.6) that (4.7) holds for j = 1.Suppose that (4.7) holds for some j ≥ 1.We need to show that it holds for j + 1.Using (4.3) with k replaced by l(k) − j − 1, we have (again, we assume implicitly that k is large enough such that l(k) − j − 1 is nonnegative).Rearranging this expression and using γ k ≥ γ min for all k yields Taking k → ∞, using (4.4), as well as the induction hypothesis, it follows that which proves the induction step for the first limit in (4.7).The second limit then follows from where the first equation exploits (4.8) together with the uniform continuity of ψ from Assumption 4.2 (a) and {x l(k)−j }, {x l(k)−(j+1) } ⊂ L ψ (x 0 ), whereas the final equation is the induction hypothesis.
In the last step of our proof, we now show that lim k→∞ d k = 0 holds.Suppose that this is not true.Then there is a (suitably shifted, for notational simplicity) subsequence {d k−m−1 } k∈K and a constant c > 0 such that Now, for each k ∈ K, the corresponding index l(k) is one of the indices k −m, k −m+ 1, . . ., k.Hence, we can write k−m−1 = l(k)−j k for some index j k ∈ {1, 2, . . ., m+1}.
Since there are only finitely many possible indices j k , we may assume without loss of generality that j k = j holds for some fixed index j ∈ {1, . . ., m + 1}.Then (4.7) implies lim This contradicts (4.9) and therefore completes the proof.
Theorem 4.6.Assume that Assumptions 3.2 and 4.2 hold and let {x k } be a sequence generated by Algorithm 4.1.Suppose that x * is an accumulation point of {x k } such that x k → K x * holds along a subsequence k → K ∞.Then x * is an M-stationary point of (P), and γ k (x k+1 − x k ) → K 0 is valid.
Proof.Since {x k } K is a subsequence converging to x * , it follows from Proposition 4.5 that also the subsequence {x k+1 } K converges to x * .We note that x * ∈ dom φ follows from Corollary 4.4 by closedness of L ψ (x 0 ).The minimizing property of x k+1 for (4.1) together with Fermat's rule and the sum rule from (2.3) imply that the necessary optimality condition (3.16) holds for each k ∈ K.We claim that the subsequence {γ k } K is bounded.Assume, by contradiction, that this is not true.Without loss of generality, let us assume that γ k → K ∞ and that the acceptance criterion (4.2) is violated in the first iteration of the inner loop for each k ∈ K. Setting γk := γ k /τ for each k ∈ K, {γ k } K also tends to infinity, but the corresponding vectors xk := x k,i k −1 , k ∈ K, do not satisfy the stepsize condition from (4.2), i.e., we have On the other hand, since xk = x k,i k −1 solves the corresponding subproblem (3.1) with γk = γ k,i k −1 , we have which holds for each k ∈ K ′ by means of Fermat's rule and the sum rule (2.3), we immediately see that x * is an M-stationary point of (P) by taking the limit k → K ′ ∞ and exploiting the continuity of φ on dom φ from Assumption 4.2 (b).Thus, for the remainder of the proof, we may assume that there is a constant c > 0 such that holds for each k ∈ K. Further, we then also get for all k ∈ K sufficiently large.Rearranging (4.11) gives us for each k ∈ K. From the mean-value theorem, we obtain some ξ k on the line segment between xk and x k such that for all k ∈ K sufficiently large.This contradiction to (4.10) shows that the sequence {γ k } K is bounded.Finally, the continuity of φ from Assumption 4.2 (b) gives φ(x k+1 ) → K φ(x * ) due to x k+1 → K x * .Thus, recalling x k → K x * and the boundedness of {γ k } K , we find γ k (x k+1 − x k ) → K 0, and taking the limit k → K ∞ in (3.16) gives us M-stationarity of x * for (P).The uniform continuity of ψ which is demanded in Assumption 4.2 (a) is obviously a much stronger assumption than the one used in the previous section for the monotone proximal gradient method.In particular, this assumption rules out applications where φ is given by the ℓ 0 -quasi-norm.Nevertheless, the theory still covers the situation where the role of φ is played by an ℓ p -type penalty function for p ∈ (0, 1) over X ∈ {R n , R n×m } which is known to promote sparse solutions.More precisely, this choice is popular in sparse optimization if the more common ℓ 1 -norm does not provide satisfactory sparsity results, and the application of the ℓ 0 -quasi-norm seems too difficult, see [6,14,15,19,27,28] for some applications and numerical results based on the ℓ p -quasi-norm or closely related expressions.We would like to note that uniform continuity is a standard assumption in the context of nonmonotone stepsize rules involving acceptance criteria of type (4.2), see [22, page 710].
We close this section with a discussion on existing convergence results for nonmonotone proximal gradient methods.To the best of our knowledge, the first one can be found in [34].The authors prove convergence under the assumptions that f is differentiable with a globally Lipschitz continuous gradient and φ being real-valued and convex, see [34, Section II.G].Implicitly, however, they also exploit the uniform continuity of ψ = f +φ in their proof of [34,Lemma 4], a result like Proposition 4.5, without stating this assumption explicitly.Taking this into account, our Assumption 4.2 (a) is actually weaker than the requirements used in [34], so that the results of this section can be viewed as a generalization of the convergence theory from [34].
Furthermore, [15, Section 3.1] and [16, Appendix A] consider a nonmonotone proximal gradient method which is slightly different from Algorithm 4.1 since the acceptance criterion (4.2) is replaced by the slightly simpler condition In [16, Theorem 4.1], the authors obtain convergence to M-stationary points whenever ψ is bounded from below as well as uniformly continuous on the level set L ψ (x 0 ), f possesses a Lipschitzian derivative on some enlargement of L ψ (x 0 ), and φ is continuous.Clearly, our convergence analysis of Algorithm 4.1 does not exploit any Lipschitzianity of ∇f , so our assumptions are weaker than those ones used in [16].
In [15,Theorem 3.3], the authors claim that the results from [16] even hold when the continuity assumption on φ is dropped.The proof of [15,Theorem 3.3], however, relies on the outer semicontinuity property (2.2) of the limiting subdifferential, which does not hold for general discontinuous functions φ, so this result is not reliable.Finally, let us mention that the two references [26,33] also consider nonmonotone (and accelerated) proximal gradient methods.These methods are not directly comparable to our algorithm since they are based on a different kind of nonmonotonicity.In any case, although the analysis in both papers works for merely lower semicontinuous functions φ, the provided convergence theory requires ∇f to be globally Lipschitz continuous.

Conclusions
In this paper, we demonstrated how the convergence analysis for monotone and nonmonotone proximal gradient methods can be carried out in the absence of (global) Lipschitz continuity of the derivative associated with the smooth function.Our results, thus, open up these algorithms to be reasonable candidates for subproblem solvers within an augmented Lagrangian framework for the numerical treatment of constrained optimization problems with lower semicontinuous objective functions, see e.g.[15] where this approach has been suggested but suffers from an incomplete analysis, and [19,23,25] where this approach has been corrected and extended.
Let us mention some remaining open problems regarding the investigated proximal gradient methods.First, it might be interesting to find minimum requirements which ensure global convergence of Algorithms 3.1 and 4.1.We already mentioned in Section 3 that it is an open question whether the convergence analysis for Algorithm 3.1 can be generalized to the setting where f is only continuously differentiable while φ is just lower semicontinuous.Second, we did not investigate if the Kurdyka-Łojasiewicz property could be efficiently incorporated into the convergence analysis in order to get stronger results even in the absence of strong Lipschitz assumptions on the derivative of f .Third, our analysis has shown that Algorithms 3.1 and 4.1 compute M-stationary points of (P) in general.In the setting of Remark 2.2, i.e., where constrained programs with a merely lower semicontinuous objective function are considered, the introduced concept of M-stationarity is, to some extent, implicit since it comprises an unknown subdifferential.In general, the latter can be approximated from above in terms of initial problem data only in situations where a qualification condition is valid.The resulting stationarity condition may be referred to as explicit M-stationarity.It seems to be a relevant topic of future research to investigate whether Algorithms 3.1 and 4.1 can be modified such that they compute explicitly M-stationary points in this rather general setting.Fourth, it might be interesting to investigate whether other types of nonmonotonicity, different from the one used in Algorithm 4.1, can be exploited in order to get rid of the uniform continuity requirement from Assumption 4.2 (a).
Finally, we note that there exist several generalizations of proximal gradient methods using, e.g., inertial terms and Bregman distances, see e.g.[2,[9][10][11] and the references therein.The corresponding convergence theory is also based on a global Lipschitz assumption for the gradient of the smooth term or additional convexity assumptions which allow the application of a descent-type lemma.It might be interesting to see whether our technique of proof can be adapted to these generalized proximal gradient methods in order to weaken the postulated assumptions.

Remark 2 . 1 .
Consider the constrained optimization problem min x f (x) subject to x ∈ C for a continuously differentiable function f : X → R and a nonempty and closed (not necessarily convex) set C ⊂ X.This problem is equivalent to the unconstrained problem (P) by setting φ := δ C , where δ C : X → R denotes the indicator function of the set C, vanishing on C and taking the value ∞ on X\C, which is lower semicontinuous due to the assumptions regarding C. The corresponding M-stationarity condition is given by 0

Assumption 3 . 2 .
(a) The function ψ is bounded from below on dom φ.(b) The function φ is bounded from below by an affine function.

Assumption 4 . 2 .
(a) The function ψ is uniformly continuous on the sublevel set
11) for each k ∈ K. Due to γk → K ∞ and since φ is bounded from below by an affine function due to Assumption 3.2 (b) while φ is continuous on its domain by Assumption 4.2 (b) (which yields boundedness of the right-hand side of (4.11)), this implies xk − x k → K 0. Consequently, we have xk → K x * as well.Now, if γk xk

Remark 4 . 7 .
(a) Note that Assumptions 3.2 and 4.2 do not comprise any Lipschitz conditions on ∇f .(b) The results in this section recover the findings from [23, Section 4] and [25, Section 3] which were obtained in the special situation where φ is the indicator function associated with a closed set, see Remark 2.1 as well.(c)Based on Theorem 4.6, (3.17) also provides a reasonable termination criterion for Algorithm 4.1, see Remark 3.8 as well.(d)In view of Proposition 4.5, it follows in the same way as in Remark 3.9 that the entire sequence {x k } generated by Algorithm 4.1 converges if there exists an isolated accumulation point.