Primal subgradient methods with predefined stepsizes

In this paper, we suggest a new framework for analyzing primal subgradient methods for nonsmooth convex optimization problems. We show that the classical step-size rules, based on normalization of subgradient, or on the knowledge of optimal value of the objective function, need corrections when they are applied to optimization problems with constraints. Their proper modifications allow a significant acceleration of these schemes when the objective function has favorable properties (smoothness, strong convexity). We show how the new methods can be used for solving optimization problems with functional constraints with possibility to approximate the optimal Lagrange multipliers. One of our primal-dual methods works also for unbounded feasible set.


Introduction
Motivation.The first method for unconstrained minimization of nonsmooth convex function was proposed in [8].This was a primal subgradient method with constant step sizes h k ≡ h > 0, where f ′ (x k ) is a subgradient of the objective function at point x k .In the next years, there were developed several strategies for choosing the steps (see [1] for historical remarks and references).Among them, the most important one is the rule of the first-order divergent series: with the optimal choice h k = O(k −1/2 ).As a variant, it is possible to use in (2.9) the normalized directions Another alternative for the step sizes is based on the known optimal value f * [1]: (1.4) In both cases, the corresponding schemes, as applied to functions with bounded subgradients, have the optimal rate of convergence O(k −1/2 ), established for the best value of the objective function observed during the minimization process [3].The presence of simple set constraints was treated just by applying to the minimization sequence an Euclidean projection onto the feasible set.
The next important advancement in this area is related to development of mirror descent method (MDM) ( [3], see also [2]).In this scheme, the main information is accumulated in the dual space in the form of aggregated subgradients.For defining the next test point, this object is mapped (mirrored) to the primal space by a special prox-function, related to a general norm.Thus, we get an important possibility of describing topology of convex sets by the appropriate norms.
After this discovery, during several decades, the research activity in this field was concentrated on the development of dual schemes.One of the drawbacks of the classical MDM is that the new subgradients are accumulated with the vanishing weights h k .It was corrected in the framework of dual averaging [5], where the aggregation coefficients can be even increasing, and the convergence in the primal space is achieved by applying some vanishing scaling coefficients.Another drawback is related to the fact that the convergence guarantees are traditionally established only for the best values of the objective function.This inconvenience was eliminated by development of quasi-monotone dual methods [6], where the rate of convergence is proved for all points of the minimization sequence.
Thus, at some moment, primal methods were almost forgotten.However, in this paper we are going to show that in some situations the primal schemes are very useful.Moreover, there is still space for improvement of the classical methods.Our optimism is supported by the following observations.Firstly, from the recent developments in Optimization Theory, it becomes clear that the size of subgradients of objective functions for the problems with simple set constraints must be defined differently.Hence, the usual norms in the rules (1.3) and 1.4) can be replaced by more appropriate objects.
Secondly, for an important class of quasi-convex functions, linear approximations do not work properly.Hence, for corresponding optimization problems, only the primal schemes can be used.Finally, as we will see, the proper primal schemes provide us with a very simple and natural possibility for approximating the optimal Lagrange multipliers for problems with functional constraints, eliminating somehow the heavy machinery, which is typical for the methods based on Augmented Lagrangian.
Contents.In Section 2, we present a new subgradient method for minimizing quasiconvex function on a simple set.Its justification is based on a new concept of directional proximity measure, which is a generalization of the old technique initially presented in [4].In this method, we apply and indirect strategy for choosing the step size, which needs solution of a simple univariate equation.In unconstrained case, this strategy is reduced to the normalization step (1.3).The main advantage of the new method is a possibility of automatic acceleration for functions with Hölder-continuous gradients.
In Section 3, we present a method for solving composite minimization problem with max-type objective function.For choosing the step size, we use a proper generalization of the rule (1.4), based on an optimal value of the objective.This method admits a linear rate of convergence for smooth strongly convex functions (see Section 4).Note that a simple example demonstrates that the classical rule does not benefit from strong convexity.Method of Section 3 automatically accelerates on functions with Hölder continuous gradient.
In Sections 5, we consider a minimization problem with single max-type constraint containing an additive composite term.For this problem, we apply a swithching strategy, where the steps for the objective function are based on the rule of Section 2, and for improving the feasibility we use the step size strategy of Section 3.For controlling the step sizes, we suggest a new rule of the second-order divergent series: For the bounded feasible sets, it eliminates an unpleasant logarithmic factors in the convergence rate.The method automatically accelerates for the problem with smooth functional components.It is interesting that the rates of convergence for the objective function and the constraints can be different.
The remaining sections of the paper are devoted to the methods, which can approximate optimal Lagrange multipliers for convex problems with functional inequality constraints.In Section 6, we consider simplest swithcing strategy of this type, where for the steps with the objective function we use the rule of Section 2, and for the steps with violated constraints we use the rule of Section 3. In the method of Section 7, both steps are based the rule of Section 2. In both cases, we obtain the rates of convergence for infeasibility of the generated points, and the upper bound for the duality gap, computed for the simple estimates of the optimal dual multipliers.Such an estimate is formed as a sum of steps at active iterations for each violated constraint divided by the sum of steps at iterations when the objective function was active.
In Section 8, we provide the theoretical guarantees for our estimates of the optimal Lagrange multipliers in terms of value of the dual function.They depend on the depth of Slater condition of our problem.Finally, in Section 9, we present a swithching method, which can generate the approximate dual multipliers for problems with unbounded feasible set.

Notation.
Denote by E a finite-dimensional real vector space, and by E * its dual space composed by linear functions on E. For such a function s ∈ E * , denote by s, x its value at x ∈ E. For measuring distances in E, we use an arbitrary norm • .The corresponding dual norm is defined in a standard way: Sometimes it is convenient to measure distances in E by Euclidean norm • B .It is defined by a self-adjoint positive-definite linear operator B : E → E * in the following way: For a differentiable function f (•) with convex and open domain dom f ⊆ E, denote by ∇f (x) ∈ E * its gradient at point x ∈ dom f .If f is convex, it can be used for defining the Bregman distance between two points x, y ∈ dom f : In this paper, we develop new proximal-gradient methods based on a predefined proxfunction d(•), which can be restricted to a convex open domain dom d ⊆ E. This domain always contains the basic feasible set of the corresponding optimization problem.We assume that d(•) is continuously differentiable and strongly convex on dom d with parameter one: Thus, combining the definition (1.6) with inequality (1.7), we get 2 Subgradient method for quasi-convex problems Consider the following constrained optimization problem: where function f 0 (•) is closed and quasi-convex on dom f 0 and the set Q ⊆ dom f 0 is closed and convex.Denote by x * one of the optimal solutions of (2.1) and let f * 0 = f 0 (x * ).We assume that x * ∈ int dom f 0 . (2.2) Let us assume that at any point x ∈ Q it is possible to compute a vector f ′ 0 (x) ∈ E * \ {0}, satisfying the following condition: for any y ∈ E, we have: In order to justify the rate of convergence of our scheme, we need the following characteristic of problem (2.1): If y ∈ dom f 0 and r = y − x * , then µ(r) = +∞.In view of assumption (2.2), this function is finite at least in some neighborhood of the point r = 0. We say that vector d ∈ E defines a direction in E if d = 1.If f ′ 0 (x), d > 0, we call it a recession direction of function f 0 (•) at point x ∈ Q.Using such a direction, we can define the directional proximity measure of point x ∈ Q as follows:

Proof:
Note that the first-order optimality condition for the problem at Step b) can be written as follows: Therefore, we have In our method, we will use an indirect control of the dual step sizes {λ k } k≥0 , which is based on a predefined sequence of primal step-size parameters {h k } k≥0 .As we will prove by Lemma 3, the convergence of the process can be derived, for example, from the following standard conditions: (2.12) Then, inequality (2.10) justifies the choice of the dual step-size parameter λ k as a solution to the following equation:

and equation (2.13) gives us λ
In this case, method (2.9) coincides with the classical variant of subgradient method . However, for Q = E, the rule (2.13) allows a proper scaling of the step size by the boundary of feasible set.✷ At each iteration of method (2.9), the corresponding value of λ k can be found by an efficient one-dimensional search procedure, based, for example, on a Newton-type scheme.Since the latter scheme has local quadratic convergence, we make a plausible assumption that it is possible to compute an exact solution of equation (2.13).This solution has the following important interpretation (in view of (2.8)): At the same time, we have Hence Our complexity bounds follow from the rate of convergence of the following values: Lemma 3 Let condition (2.13) be satisfied.Then for any N ≥ 0, we have On the other hand, (2.17) Summing up these inequalities for k = 0, . . ., N , we obtain inequality (2.16).✷ Let us look now at one example of the rate of convergence of method (2.9) for an objective function from a nonstandard problem class.For simplicity, let us measure distances by Euclidean norm x B (see (1.5)).In this case, we can take d and get Let us assume that function f 0 (•) in problem (2.1) is ptimes continuously differentiable on E and its pth derivative is Lipschitz-continuous with constant L p .Then we can bound function µ(•) in (2.4) as follows: where all norms for derivatives are induced by Let us fix the total number of steps N ≥ 1 and assume the bound R 0 ≥ x 0 − x * B be available.Then, defining the step sizes Hence, in view of inequality (2.18), we have Note that the first p coefficients in estimate (2.16) depend on the local properties of the objective function at the solution, and only the last term employs the global Lipschitz constant for the pth derivative.Clearly, we do not need to know the bounds for all these derivatives in order to define the step-size strategy (2.19). 3 Step-size control for max-type convex problems Let us consider now the following problem of composite optimization where ψ(•) is a simple closed convex function, and with all f i (•), 1 ≤ i ≤ m, being closed and convex on dom ψ.Denote by the linearization of function f (•), and by x * ∈ dom ψ an optimal solution of this problem.Similarly to (2.7), let us define the following univariate functions: where x ∈ dom ψ.The unique optimal solution of this optimization problem is denoted by Tx (λ).Note that function φx (•) is convex and continuously differentiable with the following derivative: where T = Tx (λ).Thus, φx (0) = 0, Tx (0) = x, and φ′ x(0) = 0. Hence, φx (λ) ≥ 0 for all λ ≥ 0. Let us prove the following variant of Lemma 2.

Proof:
In view of the first-order optimality condition for the minimization problem in (3.3), for all x ∈ dom ψ, we have Therefore, we get In this section, we analyze the following optimization scheme.
Suppose that the optimal value F * is known.In view of convexity of function f (•), we have Hence, for points {x k } k≥0 , generated by method (3.7), we have This observation explains the following step-size strategy: This strategy has a natural optimization interpretation.
Lemma 5 Let λ k be defined by (3.9).Then Lemma 5 has two important consequences allowing us to estimate the rate of convergence of method (3.7) with the step-size rule (3.9).Namely, for any k ≥ 0 we have Note that method (3.7) is not monotone.However, inequality (3.11) gives us a global rate of convergence for the following characteristic: where r 0 = β d (x 0 , x * ).
Theorem 1 Let all functions f i (•) have Hölder-continuous gradients: where ν ∈ [0, 1] and L ν ≥ 0. Then for the step-size rule (3.9) in method (3.7) we have Then for any k ≥ 0 we have It remains to use inequality (3.13).

✷
Note that the step-size strategy (3.9) does not depend on the Hölder parameter ν in the condition (3.14).Hence, the number of iterations of method (3.7), (3.9), ensuring an ǫ-accuracy in function value, is bounded from above by the following quantity: To conclude the section, let us show that the step-size rule (3.9) can behave much better than the classical one.
k , 0) T .Then, by the classical rule, we have k+1 = 0 and x (1) On the other hand, the rule (3.10) as applied to the same point x k defines x = x k+1 as an intersection of two lines This means that x , and we get the linear rate of convergence to the optimal point x * = 0. ✷ Step-size control for problems with smooth strongly convex components A linear rate of convergence, demonstrated by method (3.7) in the Example 2), provides us with motivation to look at behavior of this method on smooth and strongly convex problems.Let us introduce a Euclidean metric by (1.5) and define x, y ∈ E. Suppose that all functions f i (•), i = 1, . . ., m in (3.2) are continuously differentiable and have Lipschitz-continuous gradients with the same constant L f ≥ 0: In the notation of Section 3, these inequalities imply the following upper bound for the objective function of problem (3.1): Hence, for the sequence of points {x k } k≥0 , generated by method (3.7), we have Thus, we can prove the following statement.
Theorem 2 Under condition (4.1), the rate of convergence of method (3.7) can be estimated as follows: If in addition functions f i (•), i = 1, . . ., m, are strongly convex: with µ f > 0, then the rate of convergence is linear:
If in adition, the conditions (4.5) are satisfied, then Note that the first-order optimality conditions for problem (3.1) can be written in the following form: Therefore, inequality (4.8) implies that B , and we get inequality (4.6).✷

Convex minimization with max-type composite constraint
Let us show that both step-size strategies described in Sections 2 and 3 can be unified in one scheme for solving constrained optimization problems.In this section, we deal with the problem in the following semi-composite form: where ψ(•) is a simple closed convex function, Q ⊆ dom ψ is a closed convex set, and with all functions f i (•), i = 0, . . ., m, being closed and convex on dom ψ.We assume Q to be bounded: In order to solve problem (5.1), we propose a method, which combines two different types of iterations.One of them improves the feasibility of the current point, and the second one improves its optimality.
Iteration of the first type is based on the machinery developed in Section 3 with the particular value F * = 0.It is applied to some point x k ∈ Q. (5.3) Note that for F (x k ) ≤ 0, we have λ k = 0 and Tx k (λ k ) = x k .For iteration k of the second type, we need to choose a primal step size bound h k > 0.Then, at the test point y k ∈ Q, we define the function ϕ y k (•) by (2.7) and apply the following rule: Since in both rules parameters λ k are functions of the test points, we will use shorter notations T (y k ) and T (x k ).Consider the following optimization scheme.

Double-Step Subgradient Method for Semi-Composite Problem
Initialization.Choose x 0 ∈ dom ψ and sequence of steps (5.5) Thus, method (5.5) is defined by a sequence of primal step bounds H = {h k } k≥0 .However, since in (5.5) we apply a switching strategy, it is impossible to say in advance what will be the type of a particular kth iteration.Therefore, as compared with the classical conditions (2.12), we need additional regularity assumptions on H.
It will be convenient to relate this sequence with another sequence of scaling coefficients T = {τ k } k≥0 , satisfying the second-order divergence condition (compare with (2.12)).τ k = +∞.Thus, it is stronger than (2.12).In order to transform the sequence T into convergence rate of some optimization process, we need to introduce the following characteristic.

Second-Order
Definition 1 For a sequence T , the integer-valued function a(k), k ≥ 0, is called the divergence delay (of degree two) if a(k) ≥ 0 is the minimal integer value such that (5.8) Then condition (5.6) a is valid and b) Consider the following sequence: (5.9) Then S 1 = 2, S 2 = 3, and the difference Let us analyze performance of method (5.5) with an appropriately chosen sequence H.

Namely, let us choose
where the sequence T satisfies condition (5.6) and D is taken from (5.2).We are interested only in the values of objective function f 0 (•) computed at the points with small values of the functional constraint F (•).As we will see, these points are involved in Step 2b).For the total number of steps N ≥ 1 + a(0), denote −→ x k+1 . (5.11) Let us define the following directional proximity measures: We are interested in the rate of convergence to zero of the following characteristic: Theorem 3 For any N ≥ 1 + a(0), the number k(N ) ≥ 0 is well defined and δ * N < h k(N ) .Moreover, if all functions f i (•), i = 1, . . ., m, have Hölder-continuous gradients on Q with parameter ν ∈ [0, 1] and constant L ν > 0, then, for any k ∈ F N , we have (5.12)

Proof:
Let us bound the distances Summing up these inequalities for k = k(N ), . . .N − 1, we obtain (5.7),(5.10) Since F * = 0, we get and this is inequality (5.12).✷ As a straightforward consequence of Theorem 3, we have the following rate of convergence in function value: where the function µ(•) is defined by (2.4).Thus, the actual rate of convergence of method (5.5) depends on the rate of convergence of sequence T and the magnitude of divergence delay.For example, for the choice (5.9), in view of inequality a(k) ≤ k − 1, we have Hence, for N = 2M , the choice (5.9) ensures k(N ) ≥ M = N/2, and we have (5.14) It is interesting that the rate of convergence (5.12) for the constraints can be higher than the rate (5.13) for the objective function.
In this section, we consider a simpler switching strategy for solving convex optimization problems with potentially many functional constraints.Our method is also able to approximate the corresponding optimal Lagrange multipliers.Consider the following constrained optimization problem: where all functions f i (•) are closed and convex, i = 0, . . ., m, and set Q is closed, convex, and bounded.Denote by x * one of its optimal solutions.Sometimes we use notation We assume that functions f i (•) are subdifferentiable on Q and it is possible to compute their subgradients with uniformly bounded norms: For the set Q, we assume existence of a prox-function d(•) defining the corresponding Bregman distance β d (•, •).In our methods, we need to know a constant D such that Our first method is based on two operations presented in Sections 2 and 3.For defining iterations of the first type, we need functions ϕ i,x (λ) with λ ≥ 0, parameterized by x ∈ Q, and defined as follows: An appropriate value of λ can be found from the equation and used for setting x k+1 = T i,x k (λ), the optimal solutions of problem (6.5).In this section, we perform this iteration only for the objective function (i = 0).The possibility of using the steps T i,x k (λ) for inequality constraints is analyzed in Sections 7 and 9.The iteration of the second type is trying to improve feasibility of the current point x k ∈ Q (compare with (5.3)).It needs computation of all Bregman projections Ti (x k ) def = arg min for indexes i = 1, . . ., m.The first-order optimality condition for problem (6.7) is as follows: where T = Ti (x k ) and λ = λ i (x k ) ≥ 0 is the optimal Lagrange multiplier for the linear inequality constraint in (6.7).We assume that this multiplier can be also computed.Note that for x = x k we have Consider the following optimization scheme.

3.
Else, set i k = 0, compute λ k by (6.6) and set x k+1 = T x k (λ k ). (6.9) Let us prove that method (6.9) can find an approximate solution of the primal-dual problem (6.1), (6.4).Let us choose a sequence T satisfying condition (5.6) and define where D satisfies inequality (6.3).Then, for the number of steps N ≥ 1 + a(0), let us define function k(N ) by the first equation in (5.11).Now we can define the following objects: Clearly, the vectors of dual multipliers λ * (N ) = (λ * 1 (N ), . . ., λ * m (N )) are defined only if A 0 (N ) = ∅.As usual, the sum over an empty set of iterations is assumed to be zero.
Theorem 4 Let the sequence of step bounds H in method (6.9) be defined by (6.10).Then for any N ≥ 1 + a(0), we have A 0 (N ) = ∅.Moreover, if all functions f i (•), 0 = 1, . . ., m, satisfy (6.2), then max 1≤i≤m 1 ) Proof: Indeed, we have Let us fix some x ∈ Q and denote r k (x) = β d (x k , x).Then, for k ∈ A 0 (N ), we have At the same time, from the first-order optimality condition for problem (6.5), we have Hence, On the other hand, Therefore, Let us assume now that A 0 (N ) = ∅.Then B N < 0, and this is impossible since the point x * is feasible.Thus, we have proved that A 0 (N ) = ∅.
Further, for any k ∈ A 0 (N ) we have Hence, r 0 k = x k+1 − x k > 0 and we conclude that Thus, we have proved that B N ≤ M 0 σ 0 (N )h k(N ) .Dividing this inequality by σ 0 (N ) > 0, we get the bound (6.13).
Finally, note that for any k ∈ A 0 (N ) we have Ti ( In the latter case, we have Thus, inequality (6.12) is proved.✷ Note that for the choice of scaling sequence (5.9), method (5.5) is globally convergent.In this case, the computation of Lagrange multipliers by (6.11) for all values of N , requires storage of all coefficients {λ k } k≥0 .This inconvenience can be avoided if we decide to accumulate the sums for Lagrange multipliers only starting from the moments k(N q ) with N q = 2 q , q ≥ 1.Then the method (5.5) will be allowed to stop only at the moments 2N q .
7 Approximating Lagrange Multipliers, II Method (6.9) has one hidden drawback.If i k ≥ 1, then 0 (6.7) Thus, this scheme most probably generates infeasible approximations of the optimal point, which violate some of the functional constraints.In order to avoid this tendency, we propose a scheme which uses for both types of iterations (improving either feasibility or optimality) the same step-size rule (6.6).

Fixed-
Step Switching Subgradient Method for Problem (6.1) Initialization.Choose x 0 ∈ Q and a sequence of step bounds
Thus, for both types of iterations, we use the same step-size strategy (6.6).Note that for any i = 0, . . ., m and T i = T i,x k (λ i,k ), we have Hence, since T i = x k , we have In method (7.1), we choose H in accordance to (6.10).Then, for N ≥ 1+a(0), we define function k(N ) by the first equation in (5.11) and introduce by (6.11) the approximations of the optimal Lagrange multipliers.

Proof:
As in the proof of Theorem 4, for x ∈ Q, we denote r k (x) = β d (x k , x), and Then, for k ∈ A 0 (N ), we have proved that On the other hand, Note that the first-order optimality condition for problem (2.7) is as follows: where T i is its optimal solution, Therefore, for all x ∈ Q an k ∈ A 0 (N ), we have Thus, as in the proof of Theorem 4, we conclude that Since A 0 (N ) = ∅ (see the proof of Theorem 4), dividing this inequality by σ 0 (N ) > 0, we get inequality (7.4).Finally, note that for all k ∈ A 0 (N ) and i = 1, . . ., m, we have Thus, inequality (7.3) is proved.✷

Accuracy guarantees for the dual problem
In Sections 6 and 7, we developed two convergent methods (6.9) and (7.1), which are able to approach the optimal solution of the primal problem (6.1), generating in parallel an approximate solution of the dual problem (6.4).Indeed, for N ≥ 1 + a(0), denote Then, in view of Theorems 4 and ( 5), we have Since φ( λ * (N )) ≤ f * 0 , this inequality justfies that the point x * N is a good approximate solution to primal problem (6.1).
However, note that in our reasonings, we did not assume yet the existence of optimal solution of the dual problem (6.4).It appears that under our assumptions, this may not happen.In this case, since f * 0 (N ) can be significantly smaller than f * 0 , inequality (8.1) cannot justify that vector λ * (N ) delivers a good value of the dual objective function.
In this case, L(x, λ) = x (2) + λ(1 − x (1) ).Hence, Thus, there is no duality gap: φ * def = sup λ≥0 φ(λ) = 0.However, the optimal dual solution λ * does not exist.Let us look now at the perturbed feasible set where ǫ > 0 is sufficiently small.Note that it contains a point with the second coordinate equal to φ * − ǫ(2 − ǫ).This means that the condition (8.1) can guarantee only that Hence, for dual problems with nonexisting optimal solutions, we can expect a significant drop in the quality of approximation in terms of the function value.✷ Thus, in our complexity bounds, we need to take into account the size of optimal dual solution λ * ∈ R m + .Let the sequence {x k } k≥0 be generated by one of the methods (6.9) or (7.1).Then, for any k ≥ 0, we have Thus, we have proved the following theorem.
Theorem 6 Under conditions of Theorem 4 or 5, for all N ≥ 1+a(0) and all k ∈ A 0 (N ), we have Proof: Indeed, in view of inequalities (6.12) and ( 7.3), we have max . Thus, we get the bound (8.3) from (8.1) and (8.2). ✷ Recall that the size of optimal dual multipliers can be bounded by the standard Slater condition.Namely, let us assume existence of a point x ∈ Q such that Then, in accordance to Lemma 3.1.21in [7], we have Thus, the Slater condition provides us with the following bound: which is valid for all N ≥ 1 + a(0).Note that we are able to compute vector λ * (N ) without computing values of the dual function φ(•), which can be very complex.In fact, computational complexity of a single value φ(λ) can be of the same order as the complexity of solving the initial problem (6.1), or even more.
9 Subgradient method for unbounded feasible set In the previous sections, we looked at optimization methods applicable to the bounded sets (see condition (6.3)).If this is not true, the second-order divergence condition (5.6) cannot help, and we need to find another way of justifying efficiency of the subgradient schemes.This is the goal of the current section.We are still working with the problem (6.1), satisfying condition (6.2).However, the set Q is not bounded anymore.Hence, we cannot count on Sion's theorem (6.4).
In our method, we have a sequence of scaling coefficients Γ = {γ k } k≥0 satisfying condition a tolerance parameter ǫ > 0, and a rough estimate for the distance to the optimum D 0 > 0.
For functions ϕ i,y (•) defined by (6.5) with y ∈ Q, denote by a = a i,k (y) the unique solution of the equation Let us look at the following optimization scheme.
3. Compute a k = a i k ,k (y k ) by (9.2), and set x k+1 = T i k ,y k a k γ k+1 . (9. 3) It seems that now, the selection of violated constraint by Step 2 looks more natural than in the methods (6.9) and (7.1).For x ∈ Q, denote ∆ k (x) = γ k β d (x k , x)− β d (x 0 , x) .This is a linear function of x.In what follows, we assume the Bregman distance is convex with respect to its first argument: and this is inequality (9.5).✷ Since ∆ 0 (x) = 0, we can sum up the inequalities (9.5) for k = 0, . . ., N − 1 with N ≥ 1, and get the following consequence: ≤ γ N (β d (x 0 , x) + D 0 ), (9.6) which is valid for all x ∈ Q.
In order to approximate the optimal Lagrange multipliers of problem (6.4), we need to introduce the following objects: For our convergence result, we need to assume existence of the optimal solution x * to problem (6.1).Let us introduce some bound D ≥ β d (x 0 , x * ), which is not used in the method(9.3).Denote Q D = {x ∈ Q : β d (x 0 , x) ≤ D}.Clearly, if we replace in problem (6.1) the set Q by Q D , then its optimal solution will not be changed.However, now we can correctly define a restricted dual function φ D (λ) = min Divergence Condition a) τ k ≥ τ k+1 > 0 for any k ≥ 0. b)

. 7 )
Clearly, condition(5.6)b ensures that all values a(k), k ≥ 0, are well defined.At the same time, from (5.6) a , we have a(k + 1) ≥ a(k) for any k ≥ 0.Let us give two important examples of such sequences.Example 3 a) Let us fix an integer N ≥ 0 and define

Example 4
Consider the following problem min x∈R 2