Proximal Gradient Algorithms under Local Lipschitz Gradient Continuity: A Convergence and Robustness Analysis of PANOC

Composite optimization offers a powerful modeling tool for a variety of applications and is often numerically solved by means of proximal gradient methods. In this paper, we consider fully nonconvex composite problems under only local Lipschitz gradient continuity for the smooth part of the objective function. We investigate an adaptive scheme for PANOC-type methods (Stella et al. in Proceedings of the IEEE 56th CDC, 1939--1944, 2017), namely accelerated linesearch algorithms requiring only the simple oracle of proximal gradient. While including the classical proximal gradient method, our theoretical results cover a broader class of algorithms and provide convergence guarantees for accelerated methods with possibly inexact computation of the proximal mapping. These findings have also significant practical impact, as they widen scope and performance of existing, and possibly future, general purpose optimization software that invoke PANOC as inner solver.


Introduction
Problems involving the minimization of the sum of a smooth and a nonsmooth function are of interest for a wide variety of applications ranging from optimal and model predictive control (MPC), signal processing, compressed sensing, machine learning, and many others; see, e.g., [10,19,30] and references therein. Structured problems can also arise as subproblems within other numerical optimization algorithms, e.g., the augmented Lagrangian method (ALM) [23,5,7]. These use cases often yield nonconvex and large-scale problems and can pose stringent requirements in terms of both computation and memory.
In the last few years, these considerations led to a renewed interest in algorithms of splitting nature [10,19] owing to their simple operation oracles and low memory footprint, on top of their amenability to address nonsmooth, possibly nonconvex, constrained problems, making them widely applicable. The price of this flexibility is paid in terms of slow convergence and sensitivity to ill conditioning, hindering their direct employment to real-time applications, such as MPC, where optimal solutions to hard problems have to be retrieved in very limited time.
Inspired by Newton-type methods for smooth optimization, second-order information can be adopted, so as to better scale with problem size and achieve asymptotic superlinear rates. However, only local convergence guarantees can be expected without introducing a globalization strategy, such as a backtracking linesearch procedure. Unfortunately, for nonsmooth problems, even if fast search directions are available classical linesearch strategies are not applicable. In fact, lacking directional differentiability, the notion of descent directions is not relevant for possibly extended-real-valued, discontinuous functions.
In this very setting, the recently introduced PANOC [32] demonstrated how these downsides within the proximal gradient (PG) algorithm can be overcome while retaining all the favorable features. Essentially, PANOC is a linesearch method that uses the so-called forward-backward envelope (FBE) [22] as merit function to globalize the convergence of fast local methods. It offers an umbrella framework that includes the PG method as special instance; other variations are obtained by selecting virtually arbitrary update directions, which are suitably dampened in such a way to guarantee convergence. A most prominent use case is the employment of directions stemming from methods of quasi-Newton type applied to the nonlinear equation R γ (x) = 0 that encodes first-order necessary conditions for optimality, where R γ is a (set-valued) generalization of the gradient mapping for nonsmooth problems, cf. (2.6). In accommodating arbitrary update directions, PANOC does not require differentiability properties on the merit func-tion and waives the need of regularization terms to enforce a descent condition on the update directions. We defer a more detailed analysis to the dedicated Section 3.
Although the algorithm uses the same computational oracle of PG, curvature information enables asymptotic superlinear rates under mild assumptions at the limit point [32]. By employing directions of quasi-Newton type, no inner iterative procedure nor Hessian evaluations are required, making PANOC's iterations simple, lightweight and scalable. Because of these favorable properties, PANOC was originally meant as a nonlinear MPC solver particularly suited for embedded applications subject to limited hardware capabilities, such as land and aerial vehicles [26,28,15] and robotics [3,27,4]; see also [18,13] for extensive surveys and comparisons with other popular methods. Its success in the field led to a reconsideration of the spectrum of problems that the solver could be applied to. On a historical note, this evolution was reflected by a swift rebranding of the acronym over the years, originally meant as Proximal Averaged Newton-type method for Optimal Control in the original publication [32], but then tacitly reproposed as the same method for Optimality Conditions in [2] (and subsequent appearances) to allude to its applicability to the much broader range of composite minimization problems. This flexibility was further exploited in [29], where PANOC is employed as inner solver for ALM minimization subproblems for the general purpose Optimization Engine (OpEn) solver.
This rapid evolution was perhaps neglectful of some aspects, primarily because PG is subject to binding assumptions to guarantee a global Lipschitz differentiability requirement. In the context of MPC, physical bounds on input variables result in optimization problems where the feasible set is bounded, in which case local Lipschitzness can be shown to suffice, making virtually no exclusion to the problems that can be addressed. In more general formulations, and especially so in a fully nonconvex setting, however, all known results are valid under a global Lipschitzness assumption, with the very recent work [14] possibly emerging as unique exception in a vast literature; see also [11,25] for convex problems. Other alternatives are to be found in the Bregman setting [8,17,1], which are however subject to (and thus limited in applicability by) the identification of a distancegenerating function enabling a so-called Lipschitz-like convexity condition and that makes induced proximal operations tractable at the same time. While this may not seem a major issue in composite minimization, it undeniably constitutes a severe drawback in ALM contexts, where constraints relaxation can produce subproblems with unbounded feasible sets, without this necessarily being the case for the original problem. Although adding large box constraints to ensure conver-gence may be thought of as a viable solution, unsatisfactory practical performance can persist because of poor geometry estimation, as we will show. This paper addresses the above-mentioned shortcomings of PANOC, and of PG as a byproduct, by investigating an adaptive stepsize selection rule for its PG oracle. This criterion, in a slightly less general form, was first proposed in [20,Alg. 7], but without theoretical guarantees and driven from a different observation, namely the poor performance of PANOC if initial stepsizes are badly estimated. After confirming this claim with case study examples, we provide a complete convergence theory showing that the method, here referred to as PANOC + for clarity, can also cope with local Lipschitzness, while this is not the case for PANOC. Furthermore, we examine the robustness of the improved method with respect to suboptimal solutions of the PG subproblems. These findings will significantly impact on PANOC (+) , both in performance and applicability, propagating to all its dependencies, e.g., by removing stringent assumptions of general purpose optimization solvers such as OpEn [29]. Indeed, the significance and effectiveness of PANOC + have already been demonstrated in [21,12]. As part of the open-source Julia package ProximalAlgorithms.jl [31], our implementation PANOCplus of PANOC + is publicly available.
A convergence analysis of PG with a locally Lipschitz smooth term and possibly inexact inner minimizations is obtained as simple byproduct of the more general theory here developed. Indeed, a vast class of algorithms is covered by the analysis in this work, thanks to the arbitrarity of the selected update directions within the PANOC framework.

Problem Setting and Preliminaries
In this paper we consider structured minimization problems where x ∈ n , n ∈ , is the decision variable, under the following standing assumptions, assumed throughout.
Blanket assumption. The following hold in problem (P): a1 f : n → has locally Lipschitz-continuous gradient.
Motivated by its efficiency and popularity, yet aware of its inability to address this general problem formulation, this paper studies a robustified variant of PANOC algorithm with adaptive stepsize selection [32,Rem. III.4], building upon the preliminary work of [20, §6.1]. PANOC and the proposed generalization PANOC + will be presented and compared in Section 3, after the needed definitions and preliminary material are covered in this section.

Notational Conventions
With and ∪ {∞} we denote the real and extended-real line, and by = {0, 1, . . .} the set of natural numbers. The effective domain of an extendedreal-valued function h : n → is denoted by dom h {x ∈ n | h(x) < ∞}, and we say that h is: proper if dom h ∅; lower semicontinuous (lsc) if h(x) ≤ lim inf x→x h(x) for allx ∈ n ; coercive if h(x) → ∞ as x → ∞. For α ∈ , the α-sublevel set of h is lev ≤α h {x ∈ n : h(x) ≤ α}.
The notation T : n ⇒ n indicates a set-valued mapping T that associates every x ∈ n to a subset T (x) ⊆ n . The graph of T is gph T {(x, y) | y ∈ T (x)}. Following [24,Def. 8.3], we denote by∂h : n ⇒ n the regular (Fréchet) and we say that h is prox-bounded if it is proper and h + 1 2γ · 2 is bounded below on n for some γ > 0. The supremum of all such γ is the threshold γ h of prox-boundedness for h. In particular, if h is bounded below by an affine function, then γ h = ∞. When h is lsc, for any γ ∈ (0, γ h ) the proximal mapping prox γh is nonempty-and compact-valued, and the Moreau envelope h γ finite and locally Lipschitz continuous [24, Thm. 1.25 and Ex. 10.32].

Proximal Gradient Iterations
Given a point x ∈ n , one iteration of the proximal gradient (PG) method for problem (P) consists in selectinḡ where γ ∈ (0, γ g ) is a stepsize parameter. The necessary optimality condition in the minimization problem defining the proximal mapping then reads 5) and in particular the fixed-point inclusion x ∈ T γ (x) implies the stationarity condition 0 ∈ ∂ϕ(x). By interpreting (2.4) as a fixed-point iteration, one can also consider the associated (set-valued) fixed-point residual R γ , namely and seek fixed points of T γ as zeros of the residual R γ .

Forward-Backward Envelope
At the heart of PANOC rationale is the observation that, under assumptions, the fixed-point residual R γ in (2.6) is continuous around and even differentiable at critical points [34, §4], and the inclusion problem 0 ∈ R γ ( · ) reduces to a wellbehaved system of equations, when close to solutions. This motivated the adoption of Newton-type directions on R γ , that enable fast convergence when close to solutions. The key tool enabling convergence regardless of whether or not the initial point happens to be sufficiently close to a solution is the so-called forwardbackward envelope (FBE).
Definition 2.1 (forward-backward envelope). Relative to (P), the FBE with stepsize γ ∈ (0, γ g ) is 7b) or, equivalently, lettingx be any element of T γ (x), Owing to its continuity properties, the FBE has been employed to generalize and improve PG-based algorithms that address the general setting of structured nonconvex optimization [16,34,9]. The following results are well known when f has globally Lipschitz gradient [34,Prop.s 4.2 and 4.3]. A simple proof in the more general setting addressed here is given for completeness.
for any x ∈ n , with equality holding iff x ∈ T γ (x).
Proof. Assertion 2.2(i) follows from the expression (2.7b), owing to the similar property of the Moreau envelope g γ , while 2.2(ii) is obtained by taking w = x in (2.7a). The first inequality in 2.2(iii) owes to item 2.2(ii) (independently of L), and the second one follows from the expression (2.7c) of ϕ fb γ .

Good and Bad Adaptive Stepsize Selection Rules
As briefly mentioned in Section 2.3, the FBE is the key tool for globalizing the convergence of fast local methods, such as of quasi-Newton type, applied to the nonlinear equation R γ (x) = 0 encoding necessary optimality conditions for (P).
Initialize k = 0, computex 0 ∈ T γ 0 (x 0 ), and start from step 1.6 1.7: k ← k + 1 and start the next iteration at step 1.1 Elaborating on how Newton-type directions can be selected given the nonsmooth, possibly set-valued, nature of R γ is beyond the scope of this survey, and the interested reader is referred to [34,32]. The core idea is nevertheless the same as in the familiar context of smooth minimization: trying to enforce (supposedly fast) updates x → x + d in place of "nominal" updates x →x, wherex would amount to a gradient step or, in our nonsmooth setting, a proximal gradient stepx ∈ T γ (x) as in (2.4). Still in complete analogy with the smooth case, accepting a candidate update x + d must be validated by a "quality check", like an Armijo-type condition, in violation of which d is either discarded or dampened with a smaller stepsize. PANOC is precisely a mechanism to dampen and accept update directions in a nonsmooth setting, using the FBE as validation control. Its steps are given in Algorithm 1.
A basic assumption for PANOC is that ∇f be globally L f -Lipschitz, so that a well-known quadratic upper bound, see e.g., [6,Prop. A.24], ensures that L = L f can be taken for all x ∈ n in Lemma 2.2(iii). Alternatively, if g has bounded domain and the selected directions d k are bounded, it suffices that ∇f is locally Lipschitz-continuous; see [32,Rem. III.4]. For any α ∈ (0, 1) the choice γ k = α /L f then violates step 1.6, meaning that γ k ≡ γ is constant. The dampening of the direction occurs at step 1.2, where starting with τ k = 1 the candidate update x k−1 + d k is pushed towardsx k−1 ∈ T γ (x k−1 ) by reducing the steplength τ k until the value of the FBE is sufficiently reduced, cf. step 1.4. The process terminates, since ϕ fb γ is continuous (atx k−1 ), and it is strictly smaller than ϕ fb

PANOC + : the "Good" Adaptive Stepsize Rule
What is presented in Algorithm 1 is actually the "adaptive" variant of PANOC, which still works under the assumption of global Lipschitz differentiability but waives the need of prior knowledge about L f . The γ-backtracking at step 1.6 decreases (i.e., "adapts") γ k and terminates as soon as the needed bound as in Lemma 2.2(iii) is satisfied. As first noted in [20, §6.1], however, this adaptive criterion may produce bad estimates of the local Lipschitz constant of ∇f and overall result in poor algorithmic performance. The phenomenon can be attributed to an asynchrony between the two backtracking steps, the one dampening the update direction and the one adaptively adjusting the proximal gradient stepsize. This claim can be verified in the iteration mismatch between variable x k and stepsize γ k−1 occurring at step 1.3 (cf. Remark 3.1).
To account for this fact, [20,Alg. 7] proposes to adapt the PG stepsize γ k within the linesearch on the update direction. As recently showcased in [21], not only does this conservatism prove beneficial in preventing the acceptance of poor quality directions, but it often also reduces the overall computational cost. Although numerical simulations indicate superior performance, this refined linesearch lacks a theoretical analysis of its convergence properties.
This modification, which we allusively call the "good" adaptive variant (or PANOC + for brevity), is depicted in Algorithm 2. In fact, the method presented here presents a slight, but important generalization, namely in allowing the selection of a new direction d k every time the stepsize γ k is reduced, cf. step 2.5, which was not considered in [20,Alg. 7]. This flexibility is crucial: whenever the stepsize γ k changes so does the PG residual mapping R γ , and consistently so should directions using its curvature information. Moreover, we provide theoretical guarantees on the finite termination of the backtracking linesearch procedure, even without global Lipschitz gradient continuity and merely suboptimal proximal computation. These findings uphold the algorithmic framework proposed in [32,20,21] on two aspects: the adaptive linesearch is shown to terminate, and can cope with a merely locally Lipschitz-differentiable term f . These findings are of high significance also for other methods that rely on PANOC as internal solver, such as the Algorithm 2 PANOC + : the "good" adaptive γ-stepsize rule Require x 0 ∈ n ; γ 0 ∈ (0, γ g ); D ≥ 0; α, β ∈ (0, 1) Initialize k ← 0, and start from step 2.4 and go back to step 2.3 2.7: k ← k + 1 and start the next iteration at step 2.1 general purpose OpEn [29]. Moreover, it will be shown that all this remains true even if the minimization problem defining the PG mapping T γ is solved inexactly and/or suboptimally.
The peculiarity of PANOC + over the bad adaptive rule of original PANOC is that the two backtracking steps, the one on the direction τ k and the one on the PG stepsize γ k , are tightly intertwined. The intricate structure emerges at steps 2.5 and 2.6: the direction stepsize τ k resets every time the proximal stepsize γ k is adjusted and, conversely, the value of γ k is assessed anew when τ k changes. This entanglement allows the evaluation of the FBE at step 2.4 with an up-to-date stepsize γ k , as opposed to (and eliminating) the asynchrony obstructing PANOC's performance. The adaptivity of PANOC + allows the FBE ϕ fb γ to better capture the (local) landscape of ϕ and, ultimately, to relax the assumption of globally Lipschitz gradient.
To substantiate these claims, in the following Section 3.2 we first showcase the ineffectiveness of PANOC applied to problem (P) where f has only locally Lipschitz-continuous gradient, and then compare the "good" and the "bad" adaptive strategies on a common ground in Section 3.3.
Remark 3.1 (Algorithm notation). Algorithm 2 operates two linesearch steps within each iteration, one on the "proximal" stepsize γ k at step 2.5 and one on the "direction" stepsize τ k at step 2.6. Whenever the respective needed conditions are violated, either γ k or τ k is reduced and the iteration restarted from a previous step. As a consequence, variables may be overwritten within each iteration before being accepted. To avoid a heavy double-index notation, used only within proofs out of full rigor, the sub-and superscript notation is designed to differentiate temporary and permanent variables; specifically, within iteration k only variables indexed with k are updated, whereas those indexed with k − 1 remain untouched. Similar considerations apply to Algorithm 1.

Failure of "Bad" PANOC without Globally Lipschitz Gradient
Let us consider the minimization of the convex, twice continuously differentiable, and adopt PANOC as given in Algorithm 1. In particular, we choose the directions as As we are about to show, starting from any x 0 > 0 this particular choice of directions complies with the bound d k ≤ D x k−1 −x k−1 for D = 18 and satisfies the τ-linesearch with τ k = 1 for every k. Moreover, the choice α = 16 /27 leads to a conveniently simple expression for the γ-linesearch, namely γ k ≤ 1 2x k . As a result, starting from x 0 > 0 with γ 0 > 1 4x 0 , the algorithm reduces iterating the following lines and thus produces a sequence x k = x 0 4 k that is diverging, and causes the cost to increase unboundedly. We now show the claims one by one. To this end, denoting y k γ k x k throughout, observe that • Linesearch on γ. For x k > 0 the backtracking on γ k at step 1.6 (after removing a 2 9 x 3 k factor) terminates when To simplify the computation, observe that necessarily y k ≤ 1 for inequality (3.5) to hold, and in particular the argument of the absolute value is necessarily positive: in fact, since y k = γ k x k > 0 and α < 1, (3.5) implies 1 − 2 3 y k 3 ≤ 1 − y k , hence y k ≤ 1. After this simplification and by restricting the analysis to y k = γ k x k > 0, it can be seen that (3.5) has solution 0 < γ k ≤ 9 4x k 1 − 1 − 2 3 α . For α = 16 /27, this bound simplifies to 0 < γ k ≤ 1 2x k as claimed. This shows the validity of the first line in (3.3). Since γ k is halvened (only) until it enters this range, one also has that Starting with x k > 0 we show that x k+1 = x k + d k+1 = 4x k satisfies the linesearch condition. Indeed, by using the expression for the FBE in  .6) implies that the inequality always holds. We stressed that, although we consider an exemplary problem designed to yield simple computations, similar arguments would still apply for C ∞ , strongly convex formulations, e.g., x 4 + x 2 ; see also Remark 3.2.

Robustness against poor directions
In spite of the breakdown demonstrated in Section 3.2, global convergence guarantees for PANOC can be recovered by adding a term g with bounded domain, as is the case of a possibly large but bounded box constraint, and selecting update directions d k that are bounded, see [32,Rem. III.4]. Nonetheless, as noted in [20, §6.1], this would scarcely help in practice: early iterations would be agnostic to the large box and exhibit the same diverging behavior until the boundary is approached, at which point a drastically reduced stepsize γ would be the cause of a painfully slow convergence.
We substantiate these claims by considering the example in Section 3.2 with some amendments. In particular, we let g be the indicator function of the interval [−B, B], namely g(x) = 0 if |x| ≤ B and g(x) = ∞ otherwise, and select directions d k as above if d k ≤ E and Ed k / d k otherwise, with possibly large but bounded B, E ≥ 0. The problem becomes minimize x∈ 2 9 |x| 3 subject to |x| ≤ B. (3.7) Adopting these precautions, PANOC generates iterates that converge to a solution, starting from any initial point. We set B = E = 100 for the results displayed in Figure 1 with a comparison against PANOC + . Although the latter solves the illustrative problem in its original form (that is, with B = ∞), we stress that it would not be affected by the safeguards put in place to guarantee the convergence of "bad" PANOC. The diverging behavior of PANOC is apparent, until the safeguards activate, as expected from Section 3.2. At step 1.3 PANOC accepts an update x k based on the sufficient decrease of a merit function defined by the FBE with the previous stepsize γ k−1 . Figure 2 illustrates this phenomenon by comparing the merit functions adopted by PANOC and PANOC + to verify whether a tentative update is to be accepted or not. In this example, PANOC's merit function are lower unbounded (see (3.4)) and full steps along the update directions d k are accepted, in fact favored, leading to diverging iterates. In turn, this results in a temporary departure from the solution, degrading the overall efficiency of the algorithm. Conversely, at step 2.4 PANOC + verifies sufficient decrease of the FBE with the current stepsize γ k , yielding monotone decrease of the (time varying, but lower bounded) merit function ϕ fb γ , as depicted in Figure 1. Note that the merit function for PANOC + in Figure 2 is only piecewise continuous because its evaluation is always preceded by the γ-stepsize backtracking, i.e., the stepsize γ k = γ k (x k ) in ϕ fb γ depends on the candidate update x k being tested. This adaptivity allows PANOC + to well estimate the geometry of the cost function ϕ and to construct a tighter merit function.
These simulations also show that, despite the more conservative linesearch, PANOC + does not necessarily require more iterations nor function evaluations to provide a more consistent performance, nor does it lead to a smaller stepsize.
Indeed, considering larger box constraints and update directions, i.e., larger values for B, the limitations and inadequacy of "bad" PANOC in this setting become apparent, while providing support in favor of the (initially) more conservative adaptive scheme of "good" PANOC + .
Remark 3.2. Noticeably, the "bad" PANOC can exhibit this diverging behavior even when the problem admits just one feasible point. To see this, let us consider once again the illustrative example above with B = 0, so that dom g = dom ϕ = {0}. Then, patterning the proof in Section 3.2, we obtain that the algorithm produces a sequence (x k ) k∈ that is diverging, despite the fact thatx k = 0 for every k, since ϕ fb γ (x) = x 2 1 2γ k − 4 9 |x| is still lower unbounded for any γ k > 0. This also confirms the necessity of imposing bounded d k in [32, Rem. III.4], in addition to d k ≤ D x k−1 −x k−1 as in step 1.1, not needed in the "good" PANOC + even

Robustness against poor initial stepsize estimation
The poor performance of PANOC on problem (3.7) can be attributed to the bad quality of update directions d k . We now consider a more meaningful comparison on problem (3.7), this time with directions given by a classical Newton-type approach. We extend f linearly outside of the box [−B, B] so as to make it (convex and) globally Lipschitz differentiable without affecting the problem. We thus consider minimize x∈ f (x) subject to |x| ≤ B, , the minimization of f is equivalent to that of ϕ fb γ , when γ < 1 /L f . As such, in the spirit of [33] we may select update directions based on a Newton method on the FBE. We simulate the scenario in which L f is unknown, thereby selecting an initial stepsize γ 0 larger than 1 /L f . Since the cost function is coercive and has a unique stationary point, both methods are guaranteed to converge to the unique solution x = 0.
We consider classical Newton directions  Figure 2, the poor geometry estimation of PANOC is responsible for an initial divergent behavior that causes slower asymptotic convergence with a small stepsize.
with µ > 0 as regularization parameter. When not defined, ∇ 2 ϕ fb γ k is intended in a Clarke generalized sense. Figure 3 shows that PANOC's iterates initially diverge, even if the starting point x 0 is close to the solution x , if the proximal stepsize γ 0 is poorly estimated, in line with the observations above, and despite the choice of regularized Newtontype directions. Conversely, PANOC + adaptively constructs a tighter merit function and exhibits monotone decrease of ϕ fb γ , as depicted in Figure 3. Once again, these simulations show that PANOC + provides a more consistent performance without necessarily requiring more iterations or function evaluations; moreover, the nested linesearch procedure does not lead to a smaller stepsize nor does it hinder fast asymptotic convergence.

Algorithmic Analysis under Inexact Proximal Oracles
In this section we analyze the properties of the iterates generated by PANOC + , starting from their well definedness. As a substantial proof of robustness with respect to inexact prox evaluations, we will generalize the setting to an extent that the oracle of the proximal mapping is not required, and instead only a local solution of the proximal subminimization problem is needed. We will refer to this variant as the inexact PANOC + , and emphasize that the exact counterpart described in Algorithm 2 falls as a special case. The investigation in this section originates essentially from three observations. Firstly, in the inexact scenario we cannot avail ourselves of the FBE, as its evaluation requires global optimality in the solution of the proximal subproblem. Secondly, by considering the equivalent reformulation of (P) minimize x,z∈ n f (x) + g(z) subject to x = z and defining the associated augmented Lagrangian function is the result of an exact proximal minimization. Thirdly, in the ALM framework, algorithms can be constructed that converge in some sense to stationary points of the optimization problem, even solving the associated subproblems only approximately [7]. Therefore, we seek relaxed (sub)optimality concepts for the evaluation of the proximal mapping. This viewpoint will ultimately highlight how additionally to being used as a solver within ALMs, as in [29,21,12], PANOC + can operate as an ALM-type solver itself.
In the broadest possible setting, we do not require any (sub)optimality in the proximal minimization subproblem other than improvement with respect to the previous iteration. Clearly, additional conditions are needed for generating meaningful iterates, but as a proof of robustness of PANOC + we demonstrate that any choice complying with said requirement maintains the well definedness of the algorithm. We will then provide instances of such conditions that, possibly under additional assumptions on the problem, ensure optimality conditions for the limit points of the proposed inexact variant.
Specifically, we consider Algorithm 2 with the following instruction replacing step 2.4 therein, remarking that "exact"x k ∈ T γ (x k ) as prescribed in Algorithm 2 comply with this relaxed requirement (any suchx k is a global minimizer of L1 /γ (x k , · , −∇f (x k )), and Φ k = ϕ fb γ (x k ) in this case). Suboptimal prox step for inexact PANOC + 2.4 : Letx k be a suboptimal minimizer of L1 /γ (x k , · , −∇f (x k )) such that (4.4)

Well Definedness and Convergence Results
A crucial complication that the stepsize adjustment in the "good" PANOC + suffers if compared with the original one in the "bad" PANOC, is that it gives rise to a nested dependency between γ k , τ k , and d k that could potentially give rise to infinite recursions. While this is fortunately not the case, as we are about to show, the proof is not as straightforward as in [32]. On top of this, while in the "exact" case local boundedness properties of the PG operator T γ could conveniently be exploited, in accounting also for inexactness even for a fixed x k the set of pointsx k complying with the relaxed requirement (4.4) may be unbounded. The following result will serve as surrogate of local boundedness for the suboptimal proximal operator.
Lemma 4.1. Let a constant c ∈ , a sequence (γ j ) j∈ 0, and two bounded sequences (u j , z j ) j∈ in n be fixed, and for every j ∈ letz j be such that Then, (z j ) j∈ is bounded.
Proof. An application of Young's inequality on the inner product yields To arrive to a contradiction, up to extracting if necessary, suppose that 0 < z j → ∞. Since lim inf j→∞ g(z j )/ z j 2 > −∞ by [24,Ex. 1.24], dividing by z j 2 and passing to the limit leads to the contradiction 0 ≤ −1.
To avoid trivialities, in what follows we assume that x k x k always holds. This is consistent with stopping criteria based on the PG residual 1 γ k x k −x k , see Section 4.2, in which case x k =x k would trigger a successful termination.
Lemma 4.2 (Well definedness of the "good" (inexact) PANOC + ). Consider the iterates generated by Algorithm 2 with inexact proximal evaluation at step 2.4 as given in (4.4). The following hold: (i) Well definedness: at every iteration, the number of backtrackings at steps 2.5 and 2.6 is finite.
Proof. As observed in Remark 3.1, each iteration k defines or updates only variables indexed with a k sub/superscript, while those defined in previous iterations are untouched. In what follows, let us index by k, j the variables defined at the j-th attempt within iteration k. Note further that γ k, j L k, j = α ∈ (0, 1) holds for every attempt j within every iteration k, since every time γ k is halvened the estimate L k is doubled (cf. step 2.5).
• 4.2(i) We proceed by induction on k. If k = 0, there is no backtracking on τ, and from Lemma 4.1 we conclude that all the trialsx 0, j remain confined in a bounded set Ω 0 , and therefore any stepsize γ 0, j < 1 /L f,Ω 0 is accepted.
Suppose now that k > 0 and observe that, by the definition of Φ k in (4.4) and the failure of the condition at step 2.5, the inequality holds. Since d k, j ≤ D x k−1 − x k−1 and τ k, j ∈ [0, 1], any attempt x k, j defined at step 2.3 during the k-th iteration satisfies and thus remains in a bounded set, be it Ω k . To arrive to a contradiction, suppose that γ k, j 0 as j → ∞. Observe that condition (4.4) reads Since (x k, j ) j∈ is bounded, an application of Lemma 4.1 reveals that (x k, j ) k∈ too is bounded. Up to possibly enlarging the set, both sequences remain confined in the bounded set Ω k , implying that the condition at step 2.5 should have terminated in finite time, whence the sought contradiction.
Hence, γ k, j is backtracked finitely many times within iteration k; up to discarding early attempts, we may denote γ k, j = γ k . Condition (4.4) reads As τ k, j 0, one has that x k, j →x k−1 . Since f and ∇f are continuous, the righthand side of the inequality converges to ϕ(x k−1 ), overall resulting in Since x k−1 −x k−1 > 0 and β < 1, for j large enough the condition at step 2.6 will be violated and therefore the k-th iteration successfully terminated.  We next consider an asymptotic analysis of the algorithm. (i) (Φ k ) k∈ converges to a finite value ϕ ≥ inf ϕ from above.
in particular the set of limit points of (x k ) k∈ is closed and connected, and coincides with that of (x k ) k∈ .
whence the claimed finite sum.
and thus x k − x k−1 vanishes, and in turn so does x k −x k−1 since • 4.3(vi) The first implication follows from Lemma 4.2(iii), and the second one from assertion 4.3(ii). If (x k ) k∈ is bounded, and thus so is (x k ) k∈ , the set Ω k in the proof of Lemma 4.2(i) can be taken independent of k, and asymptotic constancy of γ k follows from the same arguments therein. Finally, if ∇f is L f -Lipschitz continuous the condition at step 2.5 fails to hold as soon as γ k ≤ α /L f [6,Prop. A.24], and γ k is thus asymptotically constant.
• 4.3(iv) By iteratively applying inequality (4.8), we obtain that Contrary to the claim, if k∈ γ k < ∞ holds, then (x k ) k∈ is bounded. From assertion 4.3(vi) proven above we then infer that γ k is asymptotically constant, thus contradicting the finiteness of k∈ γ k .   [32] with constant stepsize, and its convergence results are then readily available, including global convergence (possibly at R-linear rates) under Kurdyka-Łojasiewicz assumptions, and superlinear when converging to a strong local minimum with directions satisfying the Dennis-Moré condition, see [32,34].
Nevertheless, even in accounting for inexact proximal evaluations it is still possible to derive some qualitative guarantees for the limit points, provided that x k satisfies some local suboptimality requirements. We list two such instances in the following definition and later detail a proof validating the claim.
(ii) Uniformly locally optimal if there exist r > 0 and a sequence ε k 0 such that the following local minimality condition holds: Notice that no (approximate) local minimality is required in the approximate stationarity criterion of Definition 4.5(i). Consequently, the output can be retrieved by any descent method starting at the previous iteration and terminating when δstationarity is achieved. It is also worth remarking that the prox suboptimality tolerance δ does not need to be small nor fixed for all iterations, and can instead be replaced by a sequence δ k δ ≥ 0. The uniform local optimality requirement of Definition 4.5(ii) is instead more restrictive, and is possibly subject to prior knowledge on the geometry of the augmented Lagrangian. The uniformity is dictated by the value of r > 0, whose role can be appreciated by considering the sequence z k = 1 /k for k > 0 which consists of (isolated) local minimizers for the function otherwise, yet the limit z = 0 is not stationary for h. The pathology arises from the non uniformity of the radius of local minimality of z k , which is r k < 1 /k(k+1) → 0.
Theorem 4.6 (Subsequential convergence of inexact PANOC + ). Consider the iterates generated by Algorithm 2 with inexact proximal evaluation at step 2.4 as given in (4.4). Suppose that the iterates remain bounded (as is the case when ϕ is coercive), and let ω be the set of limit points of (x k ) k∈ . Then: (i) If (x k ) k∈ are δ-stationary as in Definition 4.5(i) and gph ∂g is closed relative to dom g × n , then ω is made of δ-stationary points for ϕ.
(ii) If the sequence (x k ) k∈ is (eventually) uniformly locally optimal as in Definition 4.5(ii) (this being true in case of exact prox evaluations, having r = ∞ and ε k = 0 in this case), then the set ω is made of stationary points for ϕ, and ϕ is constantly equal to ϕ as in assertion 4.3(i) there.
Proof. Up to possibly discarding early iterates, in light of the boundedness of the sequences and the consequent eventual constancy of γ k by Theorem 4.3(vi), we may assume that γ k ≡ γ > 0 holds for all k. Let x ∈ ω be fixed, and let an infinite set of indices K ⊆ be such that (x k ) k∈K → x , so that (x k ) k∈K → x too as it follows from Theorem 4.3(iii).
• 4.6(ii) Letting ϕ be as in Theorem 4.3(i) and invoking (4.5), lsc of ϕ yields ϕ(x ) ≤ ϕ . For k large enough so thatx k is r-close to x , we have owing to continuity of f and ∇f , and the fact that both ε k and x k −x k vanish (the former by assumption and the latter by Theorem 4.3(iii)). From the arbitrarity of x ∈ ω we conclude that ϕ is constant on ω with value ϕ . Notice further this also shows that g(x k ) → g(x ) as K k → ∞. Ekeland's variational principle [24, Prop. 1.43] with δ k = √ ε k ensures for every k ∈ K (large enough so that √ ε k ≤ r) the existence of ξ k ∈ B(x k ; √ ε k ) together with such that L1 /γ (x k , ξ k , −∇f (x k )) ≤ Φ k and η k ∈ B(0; √ ε k ). By lsc of g and since ξ k → x , necessarily g(ξ k ) → g(x ) and the inclusion −∇f (x ) ∈ ∂g(x ) is then readily obtained, whence the claimed stationarity of x for ϕ.
Closedness of gph ∂g relative to dom g × n as required in Theorem 4.6(i) is frequently encountered in applications, and trivially encompasses all functions that are continuous on their domain, such as indicators of closed sets. The 0-norm is instead an example of a function which is not continuous on its domain but that nevertheless complies with the requirement in Theorem 4.6(i). Indeed, notice that Consider a sequence x k → x along with ∂g(x k ) v k → v; we will show that v ∈ ∂g(x), regardless of whether or not g(x k ) converges to g(x). Indeed, if x i = 0, then trivially v i ∈ = E i . Otherwise, x k i 0 holds for large enough k, thus necessarily v k i = 0, and consequently v i ∈ {0} = E i . Either way, since this holds for every component, we conclude that v ∈ ∂g(x).

Termination Criteria
Algorithm 2 runs indefinitely and generates an infinite sequence of iterates (x k ) k∈ and (x k ) k∈ . Along its execution, we are compelled to check some suitable conditions for stopping and returning anx k that, in some sense, satisfactorily minimizes ϕ. The assertion of Theorem 4.3(v) guarantees that the standard termination criterion on the residual is verified in finite time. However, considering (2.5), a control on the magnitude of ∇f (x k ) − ∇f (x k ) must also be imposed in order to guarantee bounds on dist(0, ∂ϕ(x k )). This calls for a strengthened linesearch condition at step 2.5 ensuring also the satisfaction of so that, by a triangular inequality argument on (2.5), ε-stationarity ofx k (that is, dist(0, ∂ϕ(x k )) ≤ ε) would be guaranteed by (4.11). On the one hand, owing to Assumption a1 the proof of Lemma 4.2(i) (and of all other results) would still verbatim apply, meaning that this criterion would not affect the well definedness of Algorithm 2, or in fact any result presented so far. On the other hand, this would require evaluations of ∇f (x k ), otherwise not needed, and thus affect the overall complexity. To account for this fact, a viable solution is to trigger this strengthened linesearch only after (4.11) is first satisfied, at which point the algorithm can terminate whenever (4.11) is verified again. Note that the same conclusions can be made under suboptimal prox evaluations complying with the local uniformly of Definition 4.5(ii), as long as ε k = 0 for all k. In case of δ-stationarity as in Definition 4.5(i), instead, the same criterion would guarantee (δ + ε)-stationarity of the output.

Nonmonotone Variant
Nonmonotone linesearch procedures often prove beneficial in practice, as they can reduce conservatism in the linesearch and favor larger steps. By patterning the rationale of the ZeroFPR algorithm [34], a nonmonotone linesearch can be readily integrated in PANOC + at step 2.6 without affecting the finite termination and asymptotic properties asserted in Lemma 4.2 and Theorem 4.3. This is done by changing the definition of Φ k at step 2.4 into Φ k = (1 − p k )Φ k−1 + p k ϕ fb γ (x k ) for k > 0 (with ϕ fb γ (x k ) being replaced by L1 /γ (x k ,x k , −∇f (x k )) in the inexact case), where (p k ) k∈ ⊂ (0, 1] is any user-selected sequence bounded away from 0. The key observation enabling the possibility to replicate all the convergence results is the inequality ϕ fb γ (x k ) ≤ Φ k , which follows from an elementary induction (cf. [34, Lem. 5.1]).

Adaptive Proximal Gradient Method
By selecting d k =x k−1 −x k−1 at step 2.2, PANOC + reduces to the classical proximal gradient method x k ∈ T γ k−1 (x k−1 ) with an adaptive stepsize. In fact, the descent condition at step 2.6 does not need to be checked, as it is always satisfied for any τ k , having x k = (1 − τ k )x k−1 + τ k (x k + d k ) =x k−1 independently of the value of τ k . For this specific choice of the update direction d k , the algorithm simplifies and reduces to the proximal gradient method with adaptive stepsize selection given in Algorithm 3. Convergence results developed in the general setting of PANOC + can thus be readily imported, even in the inexact case. Algorithm 3 Inexact proximal gradient with adaptive γ-stepsize rule Require x 0 ∈ n ; γ 0 ∈ (0, γ g ); α ∈ (0, 1) Initializex −1 = x 0 , k ← 0, and start from step 3.2 3.1: γ k ← γ k−1 , x k ←x k−1 3.2: Letx k be as in (4.4) (e.g.,x k ∈ T γ k (x k )) 2γ k x k − x k 2 then γ k ← γ k /2, and go back to step 3.2 3.4: k ← k + 1 and start the next iteration at step 3.1 We note that the exact version of Algorithm 3, that is, withx k ∈ T γ (x k ) in step 3.2, corresponds to a simplified version of the linesearch strategy [25,LS1], with no relaxation and in finite dimensional spaces but here analyzed for (fully) nonconvex problems. Alternatively, it can be viewed as the monotone PG method outlined in [14,Alg. 3.1] with a slightly more conservative linesearch, since where the inequalities follow from step 3.3 and Lemma 2.2(ii). Remarkably, plain continuous differentiability (as opposed to locally Lipschitzian) suffices in the given reference, under a few other technical assumptions. However, the discussion therein is confined to plain PG iterations as in Algorithm 3, while our analysis is more general and captures plain PG as simple byproduct.