Global Convergence of Model Function Based Bregman Proximal Minimization Algorithms

Lipschitz continuity of the gradient mapping of a continuously differentiable function plays a crucial role in designing various optimization algorithms. However, many functions arising in practical applications such as low rank matrix factorization or deep neural network problems do not have a Lipschitz continuous gradient. This led to the development of a generalized notion known as the $L$-smad property, which is based on generalized proximity measures called Bregman distances. However, the $L$-smad property cannot handle nonsmooth functions, for example, simple nonsmooth functions like $\abs{x^4-1}$ and also many practical composite problems are out of scope. We fix this issue by proposing the MAP property, which generalizes the $L$-smad property and is also valid for a large class of nonconvex nonsmooth composite problems. Based on the proposed MAP property, we propose a globally convergent algorithm called Model BPG, that unifies several existing algorithms. The convergence analysis is based on a new Lyapunov function. We also numerically illustrate the superior performance of Model BPG on standard phase retrieval problems, robust phase retrieval problems, and Poisson linear inverse problems, when compared to a state of the art optimization method that is valid for generic nonconvex nonsmooth optimization problems.


Introduction
We are interested in solving the following nonconvex optimization problem: where f : R N → R is a proper lower semicontinuous function that is lower bounded. Special instances of the above mentioned problem include two broad classes of problems, namely, additive composite problems (Section 4.1) and composite problems (Section 4.2). Such problems arise in numerous practical applications such as, quadratic inverse problems [19], low-rank matrix factorization problems [50], Poisson linear inverse problems [5], robust denoising problems with nonconvex total variation regularization [51], deep linear neural networks [52], and many more.
In this paper, we design an abstract framework for globally convergent algorithms based on suitable approximations of the objective, where the convergence analysis is moreover driven by a requirement on the approximation quality. A classical special case is that of a continuously differentiable f : R N → R, whose gradient mapping is Lipschitz continuous over R N . For such a function, the following Descent Lemma (cf. Lemma 1.2.3 of [53]) describes the approximation quality of the objective f by its linearization f (x) + ∇f (x), x −x in terms of a quadratic error estimate with certain L,L > 0. Such inequalities play a crucial role in designing algorithms that are used to minimize f . Gradient Descent is one such algorithm, which we focus here. We illustrate Gradient Descent in terms of sequential minimization of suitable approximations to the objective, based on the first order Taylor expansion -the linearization of f around the current iterate x k ∈ R N . Consider the following model function at the iterate x k ∈ R N : where ·, · denotes the standard inner product in the Euclidean vector space R N of dimension N and f (·; x k ) is the linearization of f around x k . Set τ > 0. Now, the Gradient Descent update can be written equivalently as follows: Its convergence analysis is essentially based on the Descent Lemma (1), which we reinterpret as a bound on the linearization error (model approximation error) of f . However, obviously (1) imposes a quadratic error bound, which cannot be satisfied in general. For example, functions like x 4 or (x 3 + y 3 ) 2 or (1 − xy) 2 do not have a Lipschitz continuous gradient. The same is true in several of the before mentioned practical applications, for example, matrix factorization [50] and deep linear neural networks [52] problems.
This issue was recently resolved in [19], based on the initial work in [5], by introducing a generalization of the Lipschitz continuity assumption for the gradient mapping of a function, which was termed the "L-smad property". In the context of convex optimization, similar notion namely "relative smoothness" was proposed in [46]. Such a notion was also independently considered in [11], before [46]. However, all these approaches rely on the model function (2), which is the linearization of the function. In this paper, we generalize to arbitrary model functions (Definition 3) instead of the linearization of the function.
Now, we briefly recall the "L-smad property". The main restrictiveness of the Lipschitz continuous gradient notion arises as only quadratic model approximation errors are allowed. Even for simple functions like x 4 such quadratic bounds do not exist. Hence, generalized proximity measures which allow for higher order bounds are needed. To this regard, the L-smad property relies on generalized proximity measures known as Bregman distances. These distances are generated from so-called Legendre functions (Definition 1). Consider a Legendre function h, then the Bregman distance between x ∈ dom h and y ∈ int dom h is given by A continuously differentiable function f : R N → R is L-smad with respect to a Legendre function h : R N → R over R N withL, L > 0, if the following condition holds true: We interpret these inequalities as a generalized distance measure for the linearization error of f . Similar to the Gradient Descent setting, minimization of f (x) + ∇f (x), x −x + 1 τ D h (x,x) essentially results in the Bregman proximal gradient (BPG) algorithm's update step [19].
However, the L-smad property relies on the continuous differentiability of the function f , thus nonsmooth functions as simple as |x 4 − 1| or |1 − (xy) 2 | or log(1 + |1 − (xy) 2 |) cannot be captured under the L-smad property. Numerous difficult nonsmooth optimization problems cannot be captured either. This motivates a more general notion than the L-smad property.
This lead us to the development of the MAP property (Definition 5), where MAP abbreviates Model Approximation Property. Consider a function f : R N → R that is proper lower semicontinuous, and a Legendre function h : R N → R with dom h = R N . We abbreviate "lower semicontinuous" as "lsc". For certainx ∈ R N , we consider generic model function f (x;x) that is proper lsc and approximates the function around the model centerx, while preserving the local first order information (Definition 3). The MAP property is satisfied with the constantsL > 0 and L ∈ R if for anyx ∈ R N the following holds: Note that we do not require the continuous differentiability of the function f . Our MAP property is inspired from [25], however, their work considers only the upper bound, and also they rely on decomposition of function into two components.
We illustrate the MAP property with a simple example. Consider a composite problem f (x) = g(F (x)) := |x 4 − 1|, where F (x) := x 4 − 1 is a continuously differentiable function over R, and g(x) := |x| is a Lipschitz continuous function over R. Note that neither the Lipschitz continuity of the gradient nor the L-smad property is valid for this problem. However, the MAP property is valid here. At certainx ∈ R, we consider the model function that is given by f (x;x) := g(F (x) + ∇F (x)(x −x)), where ∇F (x) is the Jacobian of F atx. Then, withL = L = 4, the MAP property is satisfied: where h(x) = 0.25x 4 and the generated Bregman distance is D h (x,x) = 0.25x 4 We provide further details in Example 4 and in Example 7.
We considered the above given composite problem for illustration purposes, and we emphasize that our framework is applicable for large classes of nonconvex problems (see Section 4). Similar to the BPG setting, minimization of f (x;x) + 1 τ D h (x,x) essentially results in Model BPG algorithm's update step. The precise definition of the model function is provided in Definition 3, the MAP property in full generality is provided in Definition 5, and the Model BPG algorithm is provided in Algorithm 1.
We now discuss our main contributions and the related work.

Contributions
Our main contributions are the following.
• We introduce the MAP property, which generalizes the Lipschitz continuity assumption of the gradient mapping and the L-smad property [19,5]. Earlier proposed notions were restricted to additive composite problems. The MAP property is essentially an extended Descent Lemma that is valid for generic composite problems (see Section 4), based on Bregman distances. Our theory is applicable to generic nonconvex nonsmooth objectives, and is not restricted to composite objectives. MAP like property was also partially considered in [25], however with focus on stochastic optimization. The MAP property relies on the notion of model function, that serves as a function approximation, and preserves the local first order information of the function. Our work extends the foundations laid by [29,25] that consider generic model functions (potentially nonconvex), and [63] which considers convex model functions.
• Based on the MAP property, Model based Bregman Proximal Gradient (Model BPG) algorithm (Algorithm 1) is proposed. Several existing algorithms such as Proximal Gradient Method [23], Bregman Proximal Gradient Method [19] (or Mirror Descent [8]), Prox-Linear algorithm [31], and many other algorithms can be seen as a special case. Moreover, novel algorithms arise depending on the definition of the model function. We emphasize that Model BPG is practical, simple to implement and also does not require special knowledge about the problem such as the so-called information zone [18]. Close variants of Model BPG already exist in the literature, such as line search based Bregman proximal gradient method [63], and mirror descent variant [25], however, the convergence of the full sequence of iterates was not known.
• The standard global convergence analysis, in the sense that the full sequence of iterates converges to a single point, relies on descent properties of function values evaluated at the iterates of an algorithm. However, using function values can be restrictive, and alternatives are sought [66]. To fix this issue, we introduce a new Lyapunov function, through which we prove the global convergence of the full sequence of iterates generated by Model BPG. We eventually show that the sequence generated by Model BPG converges to a critical point of the objective function, which is potentially nonconvex and nonsmooth. Notably, the usage of a Lyapunov function is popular for inertial algorithms [61,51] and through our work we aim to popularize Lyapunov functions also for noninertial algorithms. Usage of Lyapunov functions is also popular in the context of dynamical systems [36].
• The global convergence analysis of Bregman proximal gradient (BPG) [19] relies on the full domain of the Bregman distance. However, there are many Bregman distances for which the domain is restricted. We show in this paper, that under certain assumptions that are typically satisfied in practice, the global convergence of the full sequence of iterates generated by Model BPG using generic Bregman distances can indeed be obtained (Theorem 25, 28). In general, this requires the limit points of the sequence to lie in the interior of domain of the employed Legendre function. While this is certainly a restriction, nevertheless, the considered setting is highly nontrivial and novel in the general context of nonconvex nonsmooth optimization. Moreover, it allows us to avoid the common restriction of requiring (global) strong convexity of the Legendre function, which is a severe drawback that rules out many interesting applications in related approaches (e.g., see Section 5.3).
• We provide a comprehensive numerical section showing the superior performance of Model BPG compared to a state of the art optimization algorithm, namely, Inexact Bregman Proximal Minimization Line Search (IBPM-LS) [62], on standard phase retrieval problems, robust phase retrieval problems and Poisson linear inverse problems.

Related work
Our work is fundamentally based on three pillars, namely, Bregman distances, model functions, and Kurdyka-Łojasiewicz (KL) inequality. Bregman distances are certain generalized proximity measures, which generalize Euclidean distances. Model functions serve as function approximations which preserve local first order information about the function. The KL inequality is a certain regularity property of the function, which is crucial for global convergence analysis, and is typically satisfied by objectives that arise in practice. We provide below the related work based on these three topics.
Bregman distances. Recently, there has been huge surge of work on Bregman distances [71,38,20,9,24,35,65,32]. This is due to the flexibility one gains in modelling the proximity measures. The seminal Mirror Descent algorithm [8] incorporates Bregman distances in the update step. At-times the special structure of the Bregman distance can result in closed form update steps, simple case being Gradient Descent with Euclidean Distance. Also, for instance in the minimization problem obtained for deblurring an image under Poisson noise, one can obtain a closed form expression for an optimization subproblem using a Bregman distance generated by Burg's entropy [63]. Bregman distances for structured matrix factorization problems were considered in [50,72,27,42,37] and certain extensions to deep linear neural networks were considered in [52]. Bregman distances allow for many optimization algorithms, which were previously thought to be completely different to co-exist in a single algorithm, thus making the analysis simpler. The crucial observation that Bregman distances can indeed be used to generalize the notion of Lipschitz continuous gradient was considered in [5]. However, their setting was restricted to convex problems. This was later mitigated in [19], via the L-smad property for nonconvex problems. Recently, the related notions such as relative smoothness [46], and relative continuity [45] were proposed based on Bregman distances. Before [5] and [46], the work in [11] also considered a generalization of the Lipschitz continuous gradient notion. More related references also include [55,40]. As mentioned in the introduction, the L-smad property can also be restrictive, and thus we propose the MAP property to generalize the L-smad property even further. Closely related work is [25], however, their focus was on developing stochastic algorithms.
Model functions. The MAP property relies on the concept of the model function, which is essentially a function approximation that preserves the local first order information. In smooth optimization, it is common to use the Taylor approximation of a certain order as model function.
In nonsmooth optimization, we can only speak of "Taylor-like" models [58,57,29,63], which is a (nonunique) approximation that satisfies certain error bound or a growth function [29,63]. The class of model functions used in [58,57] only satisfy a lower bound, and bundle methods are developed, which is a different class of algorithms that we do not discuss here. The growth functions in [29,63] that measure the approximation quality of the model function, which is also used in this paper, can be interpreted as a generalized first-order oracle. It has been shown in [63] that the concept of model functions unifies several algorithms for smooth and nonsmooth optimization, for example, Gradient Descent, Proximal Gradient Descent, Levenberg Marquardt's method, ProxDescent, certain variable metric versions of these algorithms and some related majorization-minimization based algorithms. More recently, model functions were considered in the context of the Conditional Gradient method in [64]. A particularly interesting class of model functions is the one for which the approximation quality measure is formed by Bregman distances [5,19,63], which is our main focus in this paper.
Kurdyka-Łojasiewicz inequality. Based on the MAP property, we propose Model BPG algorithm. In order to prove the global convergence of the full sequence of iterates generated by Model BPG algorithm, the Kurdyka-Łojasiewicz inequality [39,43,44,13,15] is key. This inequality is satisfied by most functions that appear in practical applications, in particular, semi-algebraic functions [12], globally subanalytic functions [14], or more generally, functions that are definable in an o-minimal structure [15,26]. Usually, the essential conditions required for global convergence analysis can be collected in an abstract manner, and are clearly summarized and studied in [4,17]. Basically, the conditions that need to be verified are called "sufficient descent condition", "relative error condition", and "continuity condition". The sequence satisfying such conditions is at-times called gradient-like descent sequence [19], which we detail in Section B in the appendix. In order to prove the global convergence of the full sequence of iterates generated by Model BPG, it suffices to prove that it is a gradient-like descent sequence. Sequences arising using several algorithms such as Bregman Proximal Gradient (BPG) or Proximal Gradient method are gradient-like descent sequences. In the context of additive composite problems, global convergence analysis of BPG was provided in [19]. However, their setting is restrictive as the employed Legendre function is assumed to be strongly convex with full domain and the model framework is not considered. In this paper, we do not have such restrictions, thus our framework is highly general and is applicable to broad classes of nonconvex nonsmooth problems (see Section 4).

Preliminaries and notations.
We work in a Euclidean vector space R N of dimension N ∈ N equipped with the standard inner product ·, · and induced norm · . For a set C ⊂ R N , we define C − := inf s∈C s . We skip basic definitions here, instead we provide them in Section A in the appendix and all notations are primarily taken from [69].
Legendre functions defined below generate the Bregman distances, which are generalized proximity measures compared to the Euclidean distance. Some properties of Legendre function include the following: Legendre function is also referred as kernel generating distance [19], or a reference function [46]. Generic reference functions used in [46] are more general compared to Legendre functions, as they do not require essential smoothness.
The Bregman distance associated with any Legendre function h is defined by In contrast to the Euclidean distance, the Bregman distance is lacking symmetry.
Examples. Prominent examples of Bregman distances can be found in [5,Example 1,2]. We provide some examples below. For any vector x ∈ R N , the i th coordinate is denoted by x i .
• Bregman distance generated from h(x) = 1 2 x 2 is equivalent to the Euclidean distance.
Such distances are helpful in Poisson linear inverse problems [5,63].
x i log(x i ) (Boltzmann-Shannon entropy), with 0 log(0) := 0, the Bregman distance is given by Such distances are helpful to handle simplex constraints [8].
• Phase retrieval problems [19] use the Bregman distance based on the Legendre function h : R N → R that is given by • Matrix factorization problems [50,72] use the Bregman distance based on the Legendre function h : R N 1 × R N 2 → R that is given by with certain c 1 , c 2 > 0 and N 1 , N 2 ∈ N. Based on the work in [50], related Bregman distances for deep linear neural networks were also explored later in [52].

Problem setting and Model BPG algorithm
We solve possibly nonsmooth and nonconvex optimization problems of the form that satisfy the following assumption, which we impose henceforth.

Assumption 1.
The objective function f : R N → R is proper, lower semi-continuous (possibly nonconvex nonsmooth) and a coercive function, i.e., as Due to [69,Theorem 1.9], the function f satisfying Assumption 1 is bounded from below, and Argmin x∈R N f (x) is nonempty and compact. We denote the following: We denote the set of critical points with respect to the limiting subdifferential (see Appendix A) as We require the following technical definitions.
Example of a proper growth function is ω(t) = η r t r for η, r > 0. Lipschitz continuity and Hölder continuity can be interpreted with growth functions or, more generally, with uniform continuity [63]. We use the notion of a growth function to quantify the difference between a model function (defined below) and the objective function. x ∈ dom f , if there exists a growth function ωx such that the following is satisfied: Model function is essentially a first-order approximation to a function f (see Lemma 38), which explains the naming as "Taylor-like model" by [29]. The qualitative approximation property is represented by the growth function. We refer to (10) as a bound on the model error, and the symbol ωx denotes the dependency of the growth function on the model centerx.
Few remarks are in order, which we provide below: • Informally, the model function approximates the function well near the model center. Convex model functions are explored in [63,64], however in our setting, the model functions can be nonconvex.
• Nonconvex model functions were considered in [29], however only subsequential convergence was shown. Their work is focussed on the termination criterion of the algorithms, however, they do not present an implementable algorithm.
If the growth function constants are independent ofx, this results in a uniform approximation. However, typically the growth function depends on the model center, as we illustrate below.
Withx ∈ R N as the model center, we consider the following model function: As per the proof provided in Section C in the appendix, the model error is given by where the growth function is ωx(t) = 24 x 2 t 2 + 8t 4 .
The above example illustrates that a constant in the growth function ωx(t) is dependent on the model center. It is often of interest to obtain a uniform approximation for the model error |f (x) − f (x;x)|, where the growth function is not dependent on the model center. In general, obtaining such a uniform approximation is not trivial, and may even be impossible. Moreover, typically finding an appropriate growth function is not trivial.
For this purpose, it is preferable to have a global bound on the model error, for which such a bound can be easily verified, the dependency on the model center is more structured, and the constants arising do not have any dependency on the model center. In the context of additive composite problems, previous works such as [5,46,19] relied on Bregman distances to upper bound the model error and verified the model error property with a simple convexity test based on second order information (c.f. [5,Proposition 1]). Based on this idea, we propose the following MAP property, which is valid for a huge class of generic nonconvex problems and also generalizes the previous works. We emphasize that the MAP property is valid for a large class of nonsmooth functions. MAP like property that is valid for composite problems was also explored in [25]. We provide the precise connections to previous works and examples in Section 4. x ∈ dom f ∩ int dom h the following holds: Remark 6. We provide the following remarks.
• The design of a model function is independent of an algorithm. However, algorithms can be governed by the model function, for example, Model BPG in Algorithm 1. The property of a model function is rather an analogue to differentiability or a (uniform) first-order approximation. Note that forx ∈ int dom h, the Bregman distance D h (x,x) is bounded by o( x −x ), which is a growth function. Therefore, the MAP property requires additional algorithm specific properties of the model function. In particular, we require the constantsL and L to be independent ofx, which provides a global consistency between the model function approximations.
• The condition dom f ⊂ cl dom h is a minor regularity condition. For example, if dom f = [0, ∞) and dom h = (0, ∞) (e.g., for h in Burg's entropy), such a function h can still be used in MAP property. However, the L-smad property [19] would require x,x in (11) to lie in int dom h (see also Section 4.1).
• Note that the choice of L is unrestricted in MAP property. For nonconvex f , L is typically a positive real number. For convex f typically the condition L ≥ 0 holds true. However, note that the values of L,L are governed by the model function. In the context of convex additive composite problems, L < 0 can hold true for relatively strongly convex functions [46].
Example 7 (Running Example -Contd). We continue Example 4 to illustrate the MAP property.
Let h(x) = 1 4 x 4 , we clearly have which in turn results in the following upper bound for the model error The upper bound is obtained in terms of a Bregman distance. Clearly, the constants arising do not have any dependency on the model center.
We now present Model BPG that we analyze for the setting of Assumption 2.
• Initialization: and compute Assumption 2. Let h be a Legendre function that is C 2 over int dom h. Moreover, the conditions dom f ∩ int dom h = ∅ and critf ∩ int dom h = ∅ hold true.
(i) The existL > 0, L ∈ R such that for anyx ∈ dom f ∩ int dom h, the function f with dom f ⊂ cl dom h, and model function f (·,x) for f around the model centerx satisfies the MAP property atx with the constantsL, L.
(ii) For anyx ∈ dom f ∩ int dom h, the following qualification condition holds true: (iii) For all x, y ∈ dom f , the condition is a proper, lsc function and is continuous over By ∂ x f (x;x) we mean the limiting subdifferential of the model function x → f (x;x) withx fixed and ∂f (x; y) denotes the limiting subdifferential w.r.t (x, y); dito for the horizon subdifferential.
Discussion on Assumption 2. The condition dom f ⊂ cl dom h is not a restriction as one can always add an indicator function to f such that the iterates never leave cl dom h. The qualification condition in (13) is required for the applicability of the subdifferential summation rule (see [69,Corollary 10.9]). Assumption 2(iii) and [69,Corollary 10.11] ensures that for all x, y ∈ dom f , the following holds true: We emphasize that Assumption 2(iii) is only required for the implication (Assumption 2(iii)'). Certain classes of functions mentioned in Section 4 satisfy (Assumption 2(iii)') directly, instead of Assumption 2(iii). Assumption 2(iv) is typically satisfied in practice and plays a key role in Lemma 27. Based on Assumption 2(iii), for any fixedx ∈ dom f , the model function f (x;x) is regular at any x ∈ dom f . Using this fact, we deduce that the model function preserves the first order information of the function, in the sense that for x ∈ dom f the condition ∂ y f (y; x)| y=x = ∂f (x) holds true, which we prove in Lemma 38 in the appendix. Many popular algorithms such as Gradient Descent, Proximal Gradient Method, Bregman Proximal Gradient Method, Prox-Linear method are special cases of Model BPG depending on the choice of the model function and the choice of Bregman distance, thus making it a unified algorithm (also c.f. [63]). Examples of model functions are provided in Section 4, for which we verify all the assumptions. Other related model functions can also be found in [63,Section 5].
Proof. Firstly, note that as a consequence of MAP property due to Assumption 2 and nonnegativity of Bregman distances, the following condition is satisfied If the set dom f ∩ dom h is bounded, the objective f (·;x) + 1 τ D h (·,x) is coercive. Otherwise, the coercivity of f implies that the objective f (·;x) + 1 τ D h (·,x) is coercive, due to (15). Then, the result follows from a simple application of [ We would like to highlight that Model BPG results in monotonically nonincreasing function values, which we prove below.
Proposition 9 (Sufficient Descent Property in Function values). Let Assumptions 1, 2 hold. Also, let (x k ) k∈N be a sequence generated by Model BPG, then the following holds for k ≥ 1 We provide the proof of Proposition 9 in Section E in the appendix.

Global convergence analysis of Model BPG algorithm
The convergence analysis of most algorithms in nonconvex optimization is based on a descent property. Usually, the objective value is shown to decrease, for example, as in Proposition 9 and in the analysis of additive composite problems [19,Lemma 4.1]. However, function values proved to be restrictive, primarily because the same techniques as additive composite problems do not work anymore for general composite problems, and alternatives like [66] are sought after.

New Lyapunov function
Here, we discuss one of our main contribution. We propose a Lyapunov function as our measure of progress. The Lyapunov function F h L is given by and The set of critical points of the above given Lyapunov function is given by Usage of Lyapunov functions is a popular strategy in the analysis of inertial methods [61,51]. Even though our algorithm is non-inertial in nature, we show that the above defined Lyapunov function is suitable for the global convergence analysis. Certain previous works such as [48] considered a Lyapunov function based analysis for (non-inertial) Forward-Douglas-Rachford splitting method. Also, Lyapunov function based analysis is popular in the context of dynamical systems [36].
The motivation for using the Lyapunov function F h L instead of the function f is the following. In each iteration of Model BPG, we optimize the model function with a proximity measure, and the analysis with our proposed Lyapunov function reflects this explicitly, unlike the function value. The proposed Lyapunov function is related to the Bregman-Moreau envelope [40] of the model function f (·;x) wherex ∈ dom f ∩ int dom h. Under certain special case of the model function (Section 4.1), such a Bregman-Moreau envelope is related to the Bregman forward-backward envelope [1]. In the context where the Bregman distance is set to the Euclidean distance, the related works which consider value function based analysis is provided [16,66,70].
We now look at some properties of F h L . Proposition 11. The Lyapunov function defined in (17) satisfies the following properties: Proof. (i) This follows from MAP property and the definition of (17) gives the result.
Furthermore, we obtain the following: The statement follows using inf x∈R N f (x) = v(P) > −∞ due to Assumption 1 .
Equipped with the Lyapunov function F h L , we focus now on the global convergence result of Model BPG. Our global convergence analysis is broadly divided into the following five parts.
• Sufficient descent property. In Section 3.2, we show that the sequence generated by Model BPG results in monotonically nonincreasing Lyapunov function values.
• Relative error condition. In Section 3.3, based on certain additional assumptions, we show that the infimal norm of the subdifferential of the Lyapunov function can be upper bounded by an entity that depends on the difference of successive iterates, and that entity tends towards zero asymptotically, implying stationarity in the limit.
• Subsequential convergence. In Section 3.4, we explore the behavior of limit points obtained from the sequence generated by Model BPG. We prove F h L -attentive convergence along converging subsequences. Moreover, we prove that the set of F h L -attentive limit points is compact, connected and F h L is constant on this set. When all limit points of the sequence generated by Model BPG lie in int dom h, we show that all the limit points are critical points of the Lyapunov function.
• Global convergence to stationarity point of the Lyapunov function. Under the condition that the Lyapunov function satisfies Kurdyka-Łojasiewicz property, we show in Section 3.5 that the full sequence generated by Model BPG converges to a point x such that (x, x) is the critical point of the Lyapunov function. However, the relation of x to the function f is not imminent here.
• Global convergence to stationarity point of the function. In Section 3.6, we prove that the update mapping is continuous and also show that fixed points of the update mapping are critical points of f . We exploit these properties to deduce that the full sequence of iterates generated by Model BPG converges to a critical point of f .

Sufficient descent property
We have already proved the sufficient descent property in terms of function values in Proposition 9.
Here, we prove the sufficient descent property of the Lyapunov function.
Proposition 12 (Sufficient descent property). Let Assumptions 1, 2 hold. Also, let (x k ) k∈N be a sequence generated by Model BPG, then the following holds for k ≥ 1 Proof. By global optimality of x k+1 as in (12), we have We have the following inequality from the MAP property Thus, the result follows from the definition of F h L in (17).
Proposition 13. Let Assumptions 1, 2 hold and let (x k ) k∈N be a sequence generated by Model BPG.
The following assertions hold: Proof. (i) Nonincreasing property follows trivially from Proposition 12 and as ε k > 0. We know from Proposition 11(iii) that the Lyapunov function is lower bounded, which implies convergence of F h L (x k+1 , x k ) k∈N to a finite value. (ii) Let n be a positive integer. Summing (20) from k = 1 to n and using ε ≤ ε k we get Taking the limit as n → ∞, we obtain the first assertion, from which we immediately deduce that {D h (x k+1 , x k )} k∈N converges to zero.
(iii) From (21) we also obtain, which after division by n yields the result.

Relative error condition
For the purposes of analysis, we require the following assumption.
Assumption 3. We have the following conditions: (ii) The function h has bounded second derivative on any bounded subset B ⊂ int dom h.
(iii) For bounded (u k ) k∈N , (v k ) k∈N in int dom h, the following holds as k → ∞: and choose anyx ∈ B, then consider the model function given by : The subdifferential of the model function is given by Considering the fact that u ≤ 1 and by the definition of c we have the following: which verifies Assumption 3(i). Now, we look at the relative error condition, which bounds the infimal norm of the subdifferential of the Lyapunov function, i.e., inf v∈∂F h L (x k+1 ,x k ) v , with the term x k+1 − x k upto a scaling factor. Such a bound is useful to achieve stationarity asymptotically, and plays a crucial role in proving global convergence. Note that with the descent property (Proposition 12) and Assumption 3(iii), we have x k+1 − x k → 0.
Lemma 15 (Relative error). Let Assumptions 1, 2, 3 hold. Let the sequence (x k ) k∈N be generated by Model BPG, then there exists a constant C > 0 such that for certain k ≥ 0, we have where Proof. As per [49,Theorem 2.19], the subdifferential ∂F h L (x k+1 , x k ) is given by because the Bregman distance is continuously differentiable around x k ∈ dom f ∩ int dom h. Using [69, Corollary 10.11], Assumption 2(iv), and using the fact that h is C 2 over int dom h (cf. Assumption 2) we obtain Consider the following: where in the first equality we use (23), in the second equality we use the result in (24) , and in the last step we used The optimality of x k+1 in (12) implies the existence of ξ k+1 x k ) such that the following condition holds: Therefore, the first block coordinate in (24) satisfies Now consider the first term of the right hand side in (25). We have where in the second step we used (28) and in the last step we applied mean value theorem along with the fact that the entity Considering the second term of the right hand side in (25), we have where in the last step we used Assumption 3(i) and the fact that ∇ 2 h(x k ) is bounded by L h . The result follows from combining the results obtained for (28).

Subsequential convergence
We now consider results on generic limit points and show that stationarity can indeed be attained for iterates produced by Model BPG. The set of limit points of some sequence (x k ) k∈N is denoted as follows and its subset of f -attentive limit points We explore below certain properties that are generic to any bounded sequence, and are later helpful to quantify properties of the sequence generated by Model BPG.
Proposition 16. For a bounded sequence (x k ) k∈N such that x k+1 −x k → 0 as k → ∞, the following holds: The proof relies on the same technique as the proof of [17, Lemma 3.5] (also see [17,Remark 3.3]).
We now show that the sequence generated by Model BPG (x k ) k∈N indeed attains x k+1 − x k → 0 as k → ∞, which in turn enables the application of Proposition 16 to deduce the properties of the sequence generated by Model BPG, which later proves to be crucial for the proof of global convergence.
Proposition 17. Let Assumption 1, 2, 3 hold. Let (x k ) k∈N be a sequence generated by Model BPG. Then, we have The condition ε > 0 implies that x k+1 − x k → 0 as k → ∞.
Proof. Note that the sequence (x k ) k∈N is a bounded sequence (see Remark 10). By the Descent Property (Proposition 12) and using ε k ≥ ε we have after rearranging Summing on both sides and due to the convergence of Lyapunov function, using Proposition 12, we obtain which implies (29). For ε > 0, Assumption 3(iii) together with (29) Analyzing the full set of limit points of the sequence generated by Model BPG is difficult, as illustrated in [63]. Obtaining the global convergence is still an open problem. Moreover, the work in [63] relies on convex model functions.
In order to simplify slightly the setting, we restrict the set of limit points to the set int dom h. Such a choice may appear to be restrictive, however, Model BPG when applied to many practical problems results in sequences that have this property as illustrated in Section 5.
To this regard, denote the following The subset of F h L -attentive (similar to f -attentive) limit points is Also, we define ω Proposition 18. Let Assumptions 1, 2, 3 hold. Let (x k ) k∈N be a sequence generated by Model BPG. Then, the following holds: Obviously, by Assumption 3(iii) combined with the fact that x k K → x , we have D h (x , x k ) → 0 as k K → ∞, which, together with the lower semicontinuity of f , implies . As a consequence of Proposition 13 and Assumption 3(iii), and with the MAP property we have , which further implies f (x k+1 ) k∈K → f (x) due to the following. Note that we have i.e., the value of the limit point is independent of the choice of the subsequence. The result follows directly and by using (i).
The following result summarizes that F h L -attentive sequences converge to a stationary point. Theorem 19 (Sub-sequential convergence to stationary points). Let Assumptions 1, 2, 3 hold. If the sequence (x k ) k∈N is generated by Model BPG, then Proof. From (22), we have ∂F h L (x k+1 , x k ) − ≤ C x k+1 − x k for some constant C > 0. Using x k+1 − x k → 0, convergence of (τ k ) k∈N , and Proposition 18(i) yields (31), by the closedness property of the limiting subdifferential (72).
Discussion. Subsequential convergence to a stationary point was already considered in few works. In particular, the work in [29] already provides such a result, however, it relies on certain abstract assumptions. Even though such assumptions are valid for some practical algorithms, the authors do not consider a concrete algorithm. Moreover, their abstract update step depends on the minimization of the model function, which can require additional regularity conditions on the problem. For example, if the model function is linear, then the domain must be compact to guarantee the existence of a solution. A related line-search variant of Model BPG was considered in [63], for which subsequential convergence to a stationarity point was proven. The subsequential convergence results in [63] are more general than our work, as they analyse the behavior of limit points in dom h, cl dom h, int dom h (cf. [63,Theorem 22]). Our analysis is restricted to limit points in int dom h, as typically such an assumption holds in practice (see Section 5). Though subsequential convergence is satisfactory, proving global convergence is nontrivial, in general. It is not yet clear from our work, whether global convergence can be proven if the limit points lie on the boundary of dom h. Both the above-mentioned works rely on function values to obtain a subsequential convergence result. We change this trend. In this paper, we rely on Lyapunov function and obtain an even stronger result, that is global convergence of the sequence generated by Model BPG to a stationarity point.

Global convergence to a stationary point of the Lyapunov function
The global convergence statement of Model BPG relies on the so-called Kurdyka-Łojasiewicz (KL) property. It has became a standard tool in recent years, and it is essentially satisfied by any function that appears in practice, we just state the definition here and refer to [13,15,4,17,39] for more details. The following definition is from [4].
We abbreviate Kurdyka-Łojasiewicz property as KL property. The function ϕ in the KL property is known as a desingularizing function. Many functions arising in practical problems satisfy the KL property, such as, for example, semi-algebraic functions with a desingularizing function of the following form: for certain c > 0 and θ ∈ [0, 1). The KL property is crucial in order to prove the global convergence of sequences generated by many algorithms, for example PALM [17], iPALM [68], BPG [19], CoCaIn BPG [51] and many others. For the purpose of simplification of analysis, we use the following uniformization lemma for the KL property from [17].

Lemma 21 (Uniformized KL property [17, Lemma 3.6]).
Let Ω be a compact set and let f : R N → R be proper and lower semicontinuous function. Assume that f is constant on Ω and satisfies KL property at each point on Ω. Then, there exist ε > 0, η > 0, a continuous concave function ϕ : [0, η) → R + such that ϕ(0) = 0, ϕ ∈ C 1 (0, η), and ϕ (s) > 0 for all s ∈ (0, η), and for allx ∈ Ω and x in the following intersection It is well known that the class of functions definable in an o-minimal structure satisfies KL property [15,Theorem 14]. The exact definition of o-minimal structure is given in [15,Definition 6], which we record in Section F in the appendix. Numerous functions and sets can be defined in an o-minimal structure, for example, sets and functions that are semi-algebraic and globally subanalytic. For a comprehensive discussion, we refer the reader to [15,Section 4] and [59, Section 4.5].
Assumption 4. Let O be an o-minimal structure. The functionsf : The following result shows that functions definable in an o-minimal structure are closed under pointwise addition and multiplication. This is a standard result which can, for example, be found in [59,Corollary 4.32].
Lemma 22. Let S, T ⊂ R M , S ∩ T = ∅, and let f : S → R N , g : T → R N be maps that belong to O.
Then, pointwise addition and multiplication, f + g and f · g, restricted to S ∩ T belongs to O.
The following result connects KL property to functions that are definable in an o-minimal structure.  [19,51] relies on strong convexity of h. However, in our setting we relax such a requirement on h, via the following assumption. Note that imposing such an assumption (Assumption 5) is weaker than imposing the strong convexity of h, as we only need the strong convexity property to hold over a compact convex set. Such a property can be satisfied even if h is not strongly convex, for example, Burg's entropy (see Section 5.3).
Assumption 5. For any compact convex set B ⊂ int dom h, there exists σ B > 0 such that h is σ B -strongly convex over B, i.e., for any x, y ∈ B the condition D h (x, y) ≥ σ B 2 x − y 2 holds. Now, we present the global convergence result of the sequence generated by Model BPG.
Moreover, the sequence (x k ) k∈N converges to x such that (x, x) is the critical point of F h L . Proof. Note that the sequence (x k ) k∈N generated by Model BPG is a bounded sequence (see Remark 10). The proof relies on Theorem 37 provided in Section B in the appendix, for which we need to verify the conditions (H1)-(H5). Due to Lemma 24, F h L satisfies Kurdyka-Łojasiewicz property at each point of dom ∂F h L . Note that as ω int dom h (x 0 ) = ω(x 0 ) holds true, there exists a sufficiently small ε > 0 such that B := {x : dist(x, ω(x 0 )) ≤ ε} ⊂ int dom h. As ω(x 0 ) is compact due to Proposition 16(i), the setB is also compact. Moreover, the convex hull of the setB denoted by B := convB is also compact, as the convex hull of a compact set is also compact in finite dimensional setting. A simple calculation reveals that the set B lies in the set int dom h. Thus, due to Proposition 17 along with Proposition 16(ii), without loss of generality, we assume that the sequence (x k ) k∈N generated by Model BPG lies in the set B. By definition of σ B as per Assumption 5 we have through which we obtain x k+1 − x k 2 and a k = 1. We also have existence of w k+1 ∈ ∂F h L (x k+1 , x k ) due to Lemma 15 such that for some C > 0 we have , which is (H2) with b = C, since the coefficients for both Euclidean distances are bounded from above. The continuity condition (H3) is deduced from a converging subsequence, whose existence is guaranteed by boundedness of (x k ) k∈N , and Proposition 18 guarantees that such convergent subsequences are F h L -attentive convergent. The distance condition (H4) holds trivially as ε k > 0 and σ B > 0. The parameter condition (H5), holds because b n = 1 in this setting, hence (b n ) n∈N ∈ 1 and also we have sup n∈N 1 b n a n = 1 < ∞ , inf n a n = 1 > 0 .
Theorem 37 implies the finite length property from which we deduce that the sequence (x k ) k∈N generated by Model BPG converges to a single point, which we denote by x. As (x k+1 ) k∈N also converges to x, the sequence ((x k+1 , x k )) k∈N converges to (x, x), which is a critical point of F h L due to Theorem 19.

Global convergence to a stationary point of the objective function
The global convergence result in Theorem 25 shows that Model BPG converges to a point, which in turn can be used to represent the critical point of the Lyapunov function. However, our goal is to find a critical point of the objective function f . We now establish the connection between a critical point of the Lyapunov function and a critical point of the objective function. Such a connection can later be exploited to conclude that the sequence generated by Model BPG converges to a critical point of f . Firstly, we need the following result, which establishes the connection between fixed points of the update mapping and critical points of f .

Proof.
Letx ∈ dom f ∩ int dom h be a fixed point of T τ , in the sense the conditionx ∈ T τ (x) holds true. By definition of T τ (x), the following condition holds true: at x =x, which implies that 0 ∈ ∂f (x;x). As a consequence of Lemma 38, we have ∂f (x;x) ⊂ ∂f (x), thusx is the critical point of the function f .
We also require the following technical result.
Lemma 27 (Continuity property). Let Assumptions 1, 2, 3 hold. Let the sequence (x k ) k∈N be bounded such that Proof. Consider any sequence (y k ) k∈N such that for any k ∈ N, the condition y k ∈ T τ k (x k ) holds true. Recall that f (x; y) is continuous on its domain due to Assumption 2(iv). By optimality of y k ∈ T τ k (x k ), for any z ∈ R N we have the following: As a consequence of boundedness of the sequence (y k ) k∈N , by Bolzano-Weierstrass Theorem there exists a convergent subsequence. Let y k K → π such that π ∈ dom f ∩ int dom h. Note that τ k K → τ for some K ⊂ N. Applying limit on both sides of (35) using the continuity of the model function and the Bregman distance gives which implies that π minimizes the function f (·;x) + 1 τ D h (·,x). This implies that π ∈ T τ (x) and the result follows.
The following result establishes the fact the sequence generated by Model BPG indeed converges to the critical point of the objective function. Proof. The sequence (x k ) k∈N generated by Model BPG under the assumptions as in Theorem 25 is globally convergent, thus let x k → x and also x k+1 → x. As x k+1 ∈ T τ k (x k ) and τ k converges to τ , with Lemma 27 we deduce that x ∈ T τ (x) . Additionally, with the result in Lemma 27, we deduce that x is the fixed point of the mapping T τ (x), i.e., x ∈ T τ (x). Then, using Lemma 26 we conclude that x is the critical point of the function f .
It is possible to deduce convergence rates for a certain class of desingularizing functions. Based on [3,17,33], we provide the following result, which provides the convergence rates for the sequence generated by Model BPG.
Theorem 29 (Convergence rates). Under the conditions of Theorem 25, let the sequence (x k ) k∈N generated by Model BPG converge to x ∈ dom f ∩ int dom h, and let the Lyapunov function F h L satisfy Kurdyka-Łojasiewicz property with the following desingularizing function: for certain c > 0 and θ ∈ [0, 1). Then, we have the following: • If θ = 0, then (x k ) k∈N converges in finite number of steps.
• If θ ∈ (0, 1 2 ], then there exists ρ ∈ [0, 1) and G > 0 such that for all k ≥ 0 we have • If θ ∈ ( 1 2 , 1), then there exists G > 0 such that for all k ≥ 0 we have Proof. Here, we consider the same notions as in the proof of Theorem 25. First, using the convexity of the function −s 1−θ we obtain where in the second inequality we used the Proposition 12 along with the definition of σ B , and in the last step we used ε k ≥ ε.
Continuing the calculation, following the proof technique of [17, Theorem 3.1], using Lemma 21 with Ω = U , we deduce that there exists l ∈ N, C 1 > 0 such that for any k > l, the following holds: On application of Lemma 21 with Ω = U , and Lemma 15, we deduce that there exists C 2 > 0 such that The rest of the proof is only a slight modification to the proof of [3, Theorem 5].

Examples
In this section we consider special instances of (P), namely, additive composite problems and a broad class of composite problems. The goal is to quantify assumptions for these problems such that the global convergence result (Theorem 28) of Model BPG is applicable. To this regard, we only consider the functions that satisfy Assumption 1. Typically, function is made up of function components and these components govern the function behavior. Thus, it is beneficial to introduce properties on the components of f , for which certain plausible conditions will enable the applicability of Model BPG. In this section, henceforth we enforce the following blanket assumptions.
(B1) The function h is a Legendre function that is C 2 over int dom h. For any compact convex set B ⊂ int dom h, there exists σ B > 0 such that h is σ B -strongly convex over B. Also, h has bounded second derivative on any bounded subset B 1 ⊂ int dom h. Moreover, for bounded (u k ) k∈N , (v k ) k∈N in int dom h, the following holds as k → ∞: (B2) The function f is coercive and additionally the conditions Note that h satisfies Assumption (B1) which considers the same conditions on h as in Assumptions 2, 3, 5. The function satisfies Assumption (B2), which is a consolidation of function specific assumptions in Assumptions 1, 2. Clearly, Assumption (B3) implies Assumption 4.

Additive composite problems
We consider the following nonconvex additive composite problem: which is a special case of (P). Additive composite problems arise in several applications, such as standard phase retrieval [19], low rank matrix factorization [50], deep linear neural networks [52], and many more. We impose the following conditions that are common in the analysis of forward-backward algorithms [61], which are used to optimize additive composite problems.
(C1) f 0 : R N → R is a proper, lsc function and is regular at any x ∈ dom f 0 . Also, the following qualification condition holds true: (C2) f 1 : R N → R is a proper, lsc function and is C 2 on an open set that contains dom f 0 . Also, there existL, L > 0 such that for anyx ∈ dom f 0 ∩ int dom h, the following condition holds true: Note that with Assumption (C1), (C2) it is easy to deduce that Using the model function in (40) and the condition (39), we deduce that there exist L,L > 0 such that for anyx ∈ dom f ∩ int dom h, MAP property is satisfied atx with L,L as the following holds true: x −x , thus satisfying Assumption 2(i). The condition in (41) is similar to the popular L-smad property in [19]. The main addition is that x ∈ dom f ∩ dom h andx ∈ dom f ∩ int dom h, whereas the L-smad property requires x,x ∈ dom f ∩ int dom h. We illustrate this below.
Remark We present below Model BPG algorithm that is applicable for additive composite problems. Using the model function in (40) in Model BPG we recover the BPG algorithm from [19].
BPG is Model BPG (Algorithm 1) with For h(x) = 1 2 x 2 , Model BPG is equivalent to proximal gradient method. Assumptions (C1), (C2) along with (B2) imply proper, lsc property of f and lower-boundedness of f , thus satisfying Assumption 1. Considering (C1) we deduce that f 0 (x) is regular at x ∈ dom f 0 . Using [69,Proposition 10.5] we note that [69,Proposition 10.5] on f 0 , we obtain the following result: Let (x,x) ∈ dom f × dom f , we consider the following entity: and in order for the summation rule of subdifferential ([69, Corollary 10.9]) to be applicable at (x,x), we need finiteness of f 0 (x) and continuously differentiability off 1 (x,x) := f 1 (x) + ∇f 1 (x), x −x (also see [69,Exercise 8.8]). Clearly, f 0 is finite at (x,x), andf 1 is finite and also continuously differentiable around (x,x) due to Assumption (C2). Thus, using (43) and [69,Corollary 10.9] we obtain the following conditions: and as a result (Assumption 2(iii)') is satisfied. Using the condition (38) and (44), we deduce that Assumption 2(ii) is satisfied. Now, we verify Assumption 3(i). Consider a bounded subset S in dom f . For fixed x ∈ dom f , and for allx ∈ S we have Note that ∇f 1 is Lipschitz continuous on any bounded subset of dom f , as f 1 is C 2 on dom f . This implies that the Hessian is bounded on bounded sets of dom f . Thus, based on the same notions in (45), we deduce that there exists a constant M > 0 such that holds true, thus verifying Assumption 3(i). As a simple consequence of Assumption (C1), (C2) the condition Assumption 2(iv) is satisfied.

Composite problems
We consider the following nonconvex composite problem: which is a special case of the problem (P). Composite problems arise in robust phase retrieval, robust PCA, censored Z 2 synchronization [28,41,54,30,31]. We require the following conditions.
(D1) f 0 : R N → R is a proper, lsc function and is regular at any x ∈ dom f 0 . Also, the following qualification condition holds true: (D2) g : R M → R is a Q-Lipschitz continuous function and a regular function. Also, there exists P > 0 such that at any x ∈ R M , the following condition holds true: (D3) F : R N → R M is C 2 over R N . Also, there exist L > 0 such that for anyx ∈ dom f 0 ∩ int dom h, the following condition holds true: where ∇F (x) is the Jacobian of F atx.
Note that when M = 1, g(x) = x, the problem in (46) is a special case of (37). However, for a generic g satisfying (D2), the problem in (46) cannot be captured under the additive composite problem setting given in Section 4.1. Thus, in this section we consider a separate analysis for generic composite problems in (46).
The properties (D1), (D2), (D3) along with (B2) imply proper, lsc property and lower-boundedness of f , thus satisfying Assumption 1. Note that with Assumption (D1), (D2), (D3) it is easy to deduce that dom f 0 = dom f . Letx ∈ dom f and we consider the following model function which, when evaluated at x ∈ dom f gives: Using (D2), (D3) we deduce that there existsL := LQ > 0 such that for anyx ∈ dom f ∩ int dom h, the following MAP property holds atx withL: for all x ∈ dom f ∩ dom h, as g is Q-Lipschitz continuous and (D3) holds true. Thus, Assumption 2(i) is satisfied withL = L = LQ. Before we verify other assumptions, we present Prox-Linear BPG, a specialization of Model BPG that is applicable to composite problems.
A similar statement also holds for ∂ ∞ g(F (x;x)) which on using ∂ ∞ F (x;x) g(F (x;x)) = {0} due to [69, Theorem 9.13] results in ∂ ∞ g(F (x;x)) = {0, 0}. This further implies that the following qualification condition holds true: Using the qualification condition (53) along with [69, Corollary 10.9], we obtain the following: Thus, (Assumption 2(iii)') is satisfied. Additionally, using the condition (47) in (D1), we deduce that Assumption 2(ii) is satisfied. Now, we verify Assumption 3(i). Let's consider a bounded subset S in dom f . Forx ∈ dom f , there exists a constant M S > 0 (dependent on S) such that for all w ∈ ∂xf (x;x) := (∇F (x) + ∇x(∇F (x)(x −x))) * ∂ F (x;x) g(F (x;x)) the following condition holds true: where we have used the boundedness of second order derivatives of components of F over S, as F is a twice continuously differentiable mapping, and boundedness of subgradients of g as per (48). As a simple consequence of Assumption (D1), (D2), (D3) the condition Assumption 2(iv) is satisfied.

Experiments
For the purpose of empirical evaluation we consider many practical problems, namely, standard phase retrieval problems, robust phase retrieval problems and Poisson linear inverse problems. We compare our algorithms with Inexact Bregman Proximal Minimization Line Search (IBPM-LS) [62], which is a popular algorithm to solve generic nonsmooth nonconvex problems. Before we provide the empirical results, we comment below on a variant of Model BPG based on the backtracking technique, which we used in the experiments.
Model BPG with backtracking. It is possible that the value ofL in the MAP property is unknown. This issue can be solved by using a backtracking technique, where in each iteration a local constantL k is found such that the following condition holds: The value ofL k is found by taking an initial guessL 0 k . If the condition (56) fails to hold, then with a scaling parameter ν > 1, we setL k to the smallest value in the set {νL 0 k , ν 2L0 k , ν 3L0 k , . . .} such that (56) holds true. EnforcingL k ≥L k−1 for k ≥ 1 ensures that after finite number of iterations there is no change in the value ofL k , which takes us to the situation that we analyzed in the paper. The conditionL k ≥L k−1 can be enforced by choosingL 0 k =L k−1 .

Standard phase retrieval
The phase retrieval problem involves approximately solving a system of quadratic equations. Let b i ∈ R and A i ∈ R N ×N be a symmetric positive semi-definite matrix, for all i = 1, . . . , M . The goal of standard phase retrieval problem is to find x ∈ R N such that the following system of quadratic equations is satisfied: In standard terminology, b i 's are measurements and A i 's are so-called sampling matrices. In the context of Bregman proximal algorithms, regarding the phase retrieval problem, we refer the reader to [19,51]. Further references regarding the phase retrieval problem include [21,73,47]. The standard technique to solve such system of quadratic equations is to solve the following optimization problem: where R(x) is the regularization term. We consider here L1 regularization with R(x) = λ x 1 and squared L2 regularization with R(x) = λ 2 x 2 , with some λ > 0. We consider two model functions in order to solve the problem in (58).

Model 1.
Here, the analysis falls under the category of additive composite problems given in Section 4.1, where we set the following: We consider the standard model for additive composite problems from [19], where around y ∈ R N , the model function P 0 (·; y) : R N → R at x ∈ R N is given by Consider the following Legendre function: Then, due to [19,Lemma 5.1] the following L-smad property or the MAP property is satisfied : where In this setting, Model BPG subproblems have closed form solutions (see [19,51]).
Model 2. The importance of finding better models suited to a particular problem was emphasized in [2]. The above provided model function in (59) is satisfactory, however, we would like take advantage of the structure of the function (58). Taking inspiration from [2], a simple observation that the objective is nonnegative can be exploited to create a new model function. We incorporate such a behavior in our second model function provided below. We use the Prox-Linear setting described in Section 4.2, where for any x ∈ R N we set the following: and for anyỹ ∈ R M we set Based on the model function (49), for fixed y ∈ R N , we consider the model function P 1 (·; y) : R N → R which, when evaluated at x ∈ R N gives Considering the Legendre function h(x) = 1 4 x 4 + 1 2 x 2 and [19, Lemma 5.1], a simple calculation reveals that the following MAP property holds true: Figure 1: In this experiment we compare the performance of Model BPG, Model BPG with Backtracking (denoted as Model BPG-WB), and IBPM-LS [62] on standard phase retrieval problems, with both L1 and squared L2 regularization. For this purpose, we consider M1 model function as in (59) without absolute sign (which is the same setting as [19]), and with M2 model function as in (61).
Model BPG with M2 (61) is faster in both the settings and Model BPG variants perform significantly better than IBPM-LS. By reg, we mean regularization.
We provide empirical results in Figure 1, where we show superior performance of Model BPG variants compared to IBPM-LS, in particular, with the model function provided in (61). For simplicity, we choose a constant step-size τ in all the iterations, such that τ ∈ (0, 1/L 0 ). We empirically validate Proposition 12 in Figure 2. All the assumptions required to deduce the global convergence of Model BPG are straightforward to verify, and we leave it as an exercise to the reader. Note that here int dom h = R N , thus the condition ω int dom h (x 0 ) = ω(x 0 ) holds trivially.

Robust phase retrieval
Now, we consider the robust phase retrieval problem, where the goal is the same as standard phase retrieval problem, that is to solve the system of quadratic equations in (57). It is well known that L1 loss is more robust to noise compared to squared L2 loss [34]. The problem in (58) uses squared L2 loss. Here, we consider L1 loss based robust phase retrieval problem, which involves solving the following optimization problem : where we set R(x) = λ x 1 (L1 regularization) or R(x) = λ 2 x 2 (squared L2 regularization), for some λ > 0. Such an objective is preferred if the data obtained is noisy, and we require the solution that is robust to noise. We use the Prox-Linear setting described in Section 4.2, where for any x ∈ R N we set the following: and for anyỹ ∈ R M we set We consider the following model function. For fixed y ∈ R N , the model function f (x; y) at x ∈ R N is given by With the Legendre function h(x) = 1 2 x 2 and as a consequence of triangle property, a simple calculation reveals that for all x, y ∈ R N we have . We use a constant step-size τ k = τ such that τ ∈ (0, 1/L 1 ). All the other assumptions of Model BPG are straightforward to verify and we leave it as an exercise to the reader. In each iteration of Model BPG, subproblems take the following form: which we solve using Primal-Dual Hybrid Gradient Algorithm (PDHG) [67]. The empirical results are reported in Figure 3, where we illustrate the better performance of Model BPG based methods compared to IBPM-LS [62] on robust phase retrieval problems. We empirically validate Proposition 12 in Figure 4. Note that here int dom h = R N , thus the condition ω int dom h (x 0 ) = ω(x 0 ) holds trivially.

Poisson linear inverse problems
We now consider a broad class of problems with varied practical applications, known as Poisson inverse problems [10,5,63,56]. The problem setting is as follows. For all i = 1, . . . , M , let b i > 0, a i = 0 and a i ∈ R N + be known. Moreover, we have for any x ∈ R N + , a i , x > 0 and M i=1 (a i ) j > 0, for all j = 1, . . . , N , i = 1, . . . , M . Equipped with these notions, one can write the optimization problem of Poisson linear inverse problems as following: where φ is the regularizing function, which is potentially nonconvex. For simplicity, we set φ = 0. The function f 1 : R N → R at any x ∈ R N is defined as following: Note that the function f 1 is coercive. Since f 1 is a continuous function, its level set restricted to R + , i.e., C := {x ≥ 0 : f 1 (x) ≤ f 1 (x 0 )} is compact, for any x 0 ∈ R + . In order to apply Model BPG, we need h such that the MAP property is satisfied. We consider the Legendre function h : R N ++ → R that is given by where x i is the i th coordinate of x. The above given function h is also known as Burg's entropy. Consider the following lemma.
Lemma 32. Let h be defined as in (65). For L ≥ M i=1 b i , the function Lh − f 1 and Lh + f 1 is convex on R N ++ , or equivalently the following L-smad property or the MAP property holds true: Proof. The proof of convexity of Lh − f 1 follows from [5,Lemma 7]. The function Lh + f 1 is convex as f 1 is convex.
When Model BPG is applied to solve (64) with h given in (65), if the limit points of the sequence generated by Model BPG lie in int dom h, our global convergence result is valid. However, it is difficult to guarantee such a condition. This is because, there can exist subsequences for which certain components of the iterates can tend to zero. In such a scenario, some components of ∇ 2 h(x k ) will tend to ∞, which will lead to the failure of the relative error condition in Lemma 15. In that case, our analysis cannot guarantee the global convergence of the sequence generated by Model BPG.
Thus, in such a scenario it is important to guarantee that the iterates of Model BPG lie in R N ++ . To this regard, we modify the problem (64), by adding certain constraint set, such that all the limit points lie in int dom h. Then, the global convergence of the sequence generated by Model BPG sequence can be guaranteed. The full objective after the modification is provided below where for certain ε > 0 we denote and δ Cε (·) is the indicator function of the set C ε . We consider φ = 0 or φ(x) = λ x 1 or φ(x) = λ x 2 2 , with certain λ > 0. Note that C ε ⊂ R + . For practical purposes, C ε is almost the same as R + , when ε is chosen sufficiently small. Note that the choice of ε is only heuristic. To this end, withx ∈ C ε , we consider the following model function which, when evaluated at x gives: The Legendre function in (65) is still valid as C ε ⊂ R + , and the MAP property holds true as a consequence of Lemma 32. The coercivity of the function f along with Proposition 9 implies that the iterates of Model BPG will lie in the compact convex set {x : f (x) ≤ f (x 0 )}. Thus, the sequence generated by Model BPG is bounded. The analysis falls under the category of additive composite problems given in Section 4.1, where we set f 1 := f 1 and f 0 (·) := δ Cε (·) + φ(·). In the earlier discussion, we have proved the crucial assumptions for applying Model BPG to Poisson linear inverse problems. The rest of the assumptions in Theorem 30 are straightforward to verify and we leave it as an exercise to the reader. We now provide closed form expressions for the update step (12) in three settings of φ.
Closed form update step -No regularization. Set φ = 0. The update step of Model BPG involves solving the following subproblem: The optimality condition for the i th component of x k+1 due to Fermat's rule is given by for some v k+1 ∈ N Cε (x k+1 ). Thus, we deduce that with τ k chosen such that 1 + τ k ∇f (x k ) i (x k ) i > 0, for i = 1, . . . , N , the solution is given by where all the operations are performed element-wise.   Figure 6: By Lyapunov f.v. we mean Lyapunov function values. Under the same setting as in Figure 5, we illustrate here that Model BPG results in sequences that have monotonically nonincreasing Lyapunov function value evaluations.
Closed form update step -L1 regularization. We consider here the standard L1 regularization setting, where with certain λ > 0 we set φ(x) = λ x 1 . The update step of Model BPG involves solving the following subproblem: Based on [5, Section 5.2] and Fermat's rule we deduce that with τ k chosen such that 1 + τ k λ( . . , N , the closed form solution is given by where all the operations are performed element-wise. Closed form update step -L2 regularization. We consider here the standard L2 regularization setting, where with certain λ > 0 we set φ(x) = λ 2 x 2 2 . The update step of Model BPG involves solving the following subproblem: The optimality condition for the i th component of x k+1 due to Fermat's rule is given by for some v k+1 ∈ N Cε (x k+1 ). Based on [5, Section 5.2] we deduce that with τ k chosen such that 1 + τ k ∇f (x k ) i (x k ) i + τ k λε > 0, for i = 1, . . . , N , the closed form solution is given by where all the operations are performed element-wise.
The empirical results are reported in Figure 5, where we illustrate the better performance of Model BPG based methods compared to IBPM-LS [62], when applied on Poisson linear inverse problems. We empirically validate Proposition 12 in Figure 6. Note that here int dom h = R N ++ . Based on the aforementioned closed form solutions it is clear that the sequence generated by Model BPG lies in C ε . The condition C ε ⊂ int dom h implies that ω int dom h (x 0 ) = ω(x 0 ) holds true.

Conclusion
Bregman proximal minimization framework is prominent in solving additive composite problems, in particular, using BPG [19] algorithm or its variants [51]. However, extensions to generic composite problems was an open problem. To this regard, based on foundations of [29,63], we proposed Model BPG algorithm that is applicable to a vast class of nonconvex nonsmooth problems, including generic composite problems. Model BPG relies on certain function approximation, known as model function, which preserves first order information about the function. The model error is bounded via certain Bregman distance, which drives the global convergence analysis of the sequence generated by Model BPG. The analysis is nontrivial and requires significant changes compared to the standard analysis of [19,17,3,4]. Moreover, we numerically illustrate the superior performance of Model BPG on various real world applications.

A Additional preliminaries
We work with extended-valued functions f : We use the notation of f -attentive convergence x f →x ⇔ (x, f (x)) → (x, f (x)) and the notation k K → ∞ for some K ⊂ N to represent k → ∞ where k ∈ K. The indicator function δ C of a set C ⊂ R N is defined by δ C (x) = 0, if x ∈ C and δ C (x) = +∞, otherwise. The (orthogonal) projection ofx onto C, denoted proj C (x), is given by a minimizer of min x∈C x −x . A set-valued mapping T : R N ⇒ R M is defined by its graph GraphT : [69,Def. 8.3], we introduce subdifferential notions for nonsmooth functions. The Fréchet subdifferential of f atx ∈ dom f is the set ∂f (x) of elements v ∈ R N such that Forx ∈ dom f , we set ∂f (x) = ∅. The (limiting) subdifferential of f atx ∈ dom f is defined by and ∂f (x) = ∅ forx ∈ dom f . As a direct consequence of the definition of the limiting subdifferential, we have the following closedness property at anyx ∈ dom f : A pointx ∈ dom f for which 0 ∈ ∂f (x) is a called a critical point, which is a necessary optimality condition (Fermat's rule [69, Thm. 10.1]) forx being a local minimizer. We define the set of (global) minimizers of a function f by and the (unique) minimizer of f by argmin x∈R N f (x) if Argmin x∈R N f (x) consists of a single element.
As shorthand, we also use Argmin f and argmin f .
In order to conveniently work with Bregman distances, we collect a few properties.
Proposition 33. Let h ∈ L and D h be the associate Bregman distance.
(i) D h is strictly convex on every convex subset of dom ∂h with respect the first argument.
(ii) For y ∈ int dom h, it holds that D h (x, y) = 0 if and only if x = y.
(iii) For x ∈ R N and u, v ∈ int dom h the following three point identity holds: Proof. Associated with such a distance function is the following proximal mapping.
Definition 34 (Bregman proximal mapping [7,Def. 3.16]). Let f : R N → R and D h be a Bregman distance associated with h ∈ L . The D h -prox (or Bregman proximal mapping) associated with f is defined by P h f (y) := argmin In general, the proximal mapping is set-valued, however for a convex function, the following lemma simplifies the situation.
Lemma 35. Let f : R N → R be a proper, closed, convex function that is bounded from below, and h ∈ L such that dom f ∩ int dom h = ∅. Then the associated Bregman proximal mapping P h f is single-valued on its domain and maps to dom f ∩ int dom h. Proposition 36. Let f : R N → R be a proper, closed, convex function that is bounded from below, and h ∈ L such that int dom h ∩ dom f = ∅. For y ∈ int dom h,x = P h f (y), and any x ∈ dom f the following inequality holds: Proof. See [22,Lem. 3.2].
For examples and more useful properties of Bregman functions, we refer the reader to [6,7,5,55].

B Gradient-like Descent Sequence
We briefly review the concept of gradient-like descent sequence, which is given below. For ease of global convergence analysis of Model BPG we use following results from [60]. Let F : R N × R P → R be a proper, lower semi-continuous function that is bounded from below, then assume the following assumption from [60] holds.
Assumption 6 (Gradient-like Descent Sequence [60]). Let (u n ) n∈N be a sequence of parameters in R P and let (ε n ) n∈N be an 1 -summable sequence of non-negative real numbers. Moreover, we assume there are sequences (a n ) n∈N , (b n ) n∈N , and (d n ) n∈N of non-negative real numbers, a non-empty finite index set I ⊂ Z and θ i ≥ 0, i ∈ I, with i∈I θ i = 1 such that the following holds: (H1) (Sufficient decrease condition) For each n ∈ N, it holds that F(x n+1 , u n+1 ) + a n d 2 n ≤ F(x n , u n ) .
(H2) (Relative error condition) For each n ∈ N, the following holds: (set d j = 0 for j ≤ 0) (H3) (Continuity condition) There exists a subsequence ((x n j , u n j )) j∈N and (x,ũ) ∈ R N × R P such that (x n j , u n j ) F → (x,ũ) as j → ∞ .
Such an assumption is crucial in order to obtain global convergence of the sequences generated by Model BPG. Assumption 6 is more general compared to the conditions that arise in standard gradient-like sequence [19], which is basically based on the first three conditions.
We now provide the global convergence statement from [60], based on Assumption 6. Firstly, denote the following. The set of limit points of a bounded sequence ((x n , u n )) n∈N is given by ω(x 0 , u 0 ) := lim sup n→∞ {(x n , u n )} , and the subset of F-attentive limit points is denoted by Theorem 37 (Global convergence [60, Theorem 10]). Suppose F is a proper lower semicontinuous Kurdyka-Łojasiewicz function that is bounded from below. Let (x n ) n∈N be a bounded sequence generated by an abstract algorithm parametrized by a bounded sequence (u n ) n∈N that satisfies Assumption 6. Assume that F-attentive convergence holds along converging subsequences of ((x n , u n )) n∈N , i.e. ω(x 0 , u 0 ) = ω F (x 0 , u 0 ). Then, the following holds: (i) The sequence (d n ) n∈N satisfies ∞ k=0 d k < +∞ , i.e., the trajectory of the sequence (x n ) n∈N has finite length with respect to the abstract distance measures (d n ) n∈N .
(ii) Suppose d k satisfies x k+1 − x k 2 ≤cd k+k for some k ∈ Z andc ∈ R, then ∞ k=0 x k+1 − x k 2 < +∞ , and the trajectory of the sequence (x n ) n∈N has a finite Euclidean length, and thus (x n ) n∈N converges tox from (H3).
(iii) Moreover, if (u n ) n∈N is a converging sequence, then each limit point of ((x n , u n )) n∈N is a critical point, which in the situation of (ii) is the unique point (x,ũ) from (H3).

C Proof of Example 4
The model error is given by where in the second inequality we use mean value theorem with s ∈ [0, 1], the third inequality is a simple application of Cauchy-Schwarz rule. On further application of the fundamental theorem of calculus, we have ∇g(x + s(x −x)) − ∇g(x) = 1 0 ∇ 2 g(x + s(x −x))(x −x)ds , Using the fact that ∇ 2 g(x) = 4 x 2 I + 8xx T , and ∇ 2 g(x) ≤ 12 x 2 we obtain ∇g(x + s(x −x)) − ∇g(x) ≤ 12 1 0 x + s(x −x) 2 x −x ds , ≤ 12 1 0 2 x 2 + 2s 2 (x −x) 2 x −x ds , where in the second step we used the inequality a + b 2 ≤ 2 a 2 + 2 b 2 for any a, b ∈ R N . For any model centerx ∈ R N , the growth function is then given by ωx(t) = 24 x 2 t 2 + 8t 4 .

D Model function preserves first order information
Lemma 38. Let Assumption 1, 2 hold true. For any x ∈ dom f , the following condition holds true: Proof. We follow the proof strategy of [64, Lemma 14]. Letx ∈ dom f and let v ∈ ∂f (x), then, by definition we have x ∈ dom f. Using the Definition 3, with f (x;x) = f (x) we have the following For any t > 0, note that ωx(t) = o(t) as ωx is a growth function, using which we obtain This implies that v ∈ ∂f (x;x) and by regularity of f (·;x) we also obtain that v ∈ ∂f (x;x). For the second part of the proof, let v ∈ ∂f (x;x) withx ∈ dom f , thus satisfying: Using the definition of model function (Definition 3), we obtain , ∀x ∈ dom f , which on using the fact that ωx(t) = o(t) results in

E Proof of Proposition 9
By global optimality of x k+1 as in (12), we have We have the following inequality from MAP property Thus, the result follows by combining (75) and (76).
Definition 39 (o-minimal structure [15,Definition 6]). An o-minimal structure on (R, +, .) is a sequence of boolean algebras O N of "definable" subsets of R N , such that for each N ∈ N