Bregman Three-Operator Splitting Methods

The paper presents primal–dual proximal splitting methods for convex optimization, in which generalized Bregman distances are used to define the primal and dual proximal update steps. The methods extend the primal and dual Condat–Vũ algorithms and the primal–dual three-operator (PD3O) algorithm. The Bregman extensions of the Condat–Vũ algorithms are derived from the Bregman proximal point method applied to a monotone inclusion problem. Based on this interpretation, a unified framework for the convergence analysis of the two methods is presented. We also introduce a line search procedure for stepsize selection in the Bregman dual Condat–Vũ algorithm applied to equality-constrained problems. Finally, we propose a Bregman extension of PD3O and analyze its convergence.


Introduction
We discuss proximal splitting methods for optimization problems in the form minimize f (x) + g(Ax) + h(x), where f , g, and h are convex functions, and h is differentiable.This general problem covers a wide variety of applications in machine learning, signal and image processing, operations research, control, and other fields [11,19,31,40].In this paper, we consider proximal splitting methods based on Bregman distances for solving (1) and some interesting special cases of (1).Recently, several primal-dual first-order methods have been proposed for the three-term problem (1): the Condat-Vũ algorithm [20,50,53], the primal-dual three-operator (PD3O) algorithm [51], and the primal-dual Davis-Yin (PDDY) algorithm [44].Algorithms for some special cases of (1) are also of interest.These include the Chambolle-Pock algorithm, also known as the primal-dual hybrid gradient (PDHG) method [10,12] (when h = 0), the Loris-Verhoeven algorithm [15,23,34] (when f = 0), the proximal gradient algorithm (when g = 0), and the Davis-Yin splitting algorithm [21] (when A = I).All these methods handle the nonsmooth functions f and g via the standard Euclidean proximal operator.
To further improve the efficiency of proximal algorithms, proximal operators based on generalized Bregman distances have been proposed and incorporated in many methods [2,3,6,14,24,27,35,46,48]. Bregman distances offer two potential benefits.First, the Bregman distance can help build a more accurate local optimization model around the current iterate.This is often interpreted as a form of preconditioning.For example, diagonal or quadratic preconditioning [29,33,41] has been shown to improve the practical convergence of PDHG, as well as the accuracy of the computed solution [1].As a second benefit, a Bregman proximal operator of a function may be easier to compute than the standard Euclidean proximal operator, and therefore reduce the complexity per iteration of an optimization algorithm.Recent applications of this kind include optimal transport problems [16], optimization over nonnegative trigonometric polynomials [13], and sparse semidefinite programming [30].
Extending standard proximal methods and their convergence analysis to Bregman distances is not straightforward because some fundamental properties of the Euclidean proximal operator no longer hold for Bregman proximal operators.An example is the Moreau decomposition which relates the (Euclidean) proximal operators of a closed convex function and its conjugate [37].Another example is the simple relation between the proximal operators of a function g and the composition with a linear function g(Ax) when AA T is a multiple of the identity; see, e.g., [4,19].This composition rule is used in [39] to establish the equivalence between some well-known first-order proximal methods for problem (1) with A = I and with general A.
The purpose of this paper is to present new Bregman extensions and convergence results for the Condat-Vũ and PD3O algorithms.The main contributions are as follows.
• The Condat-Vũ algorithm [20,50] exists in a primal and a dual variant.We discuss extensions of the two algorithms that use Bregman proximal operators in the primal and dual updates.
The Bregman primal Condat-Vũ algorithm first appeared in [12, Algorithm 1], and is also a special case of the algorithm proposed in [52] for a more general convex-concave saddle point problem.We give a new derivation of this method and its dual variant, by applying the Bregman proximal point method to the primal-dual optimality conditions.Based on the interpretation, we provide a unified framework for the convergence analysis of the two variants, and show an O(1/k) ergodic convergence rate, which is consistent with previous results for Euclidean proximal operators in [20,50] and Bregman proximal operators in [12].We also give a convergence result for the primal and dual iterates.
• We propose an easily implemented backtracking line search technique for selecting stepsizes in the Bregman dual Condat-Vũ algorithm for problems with equality constraints.The proposed backtracking procedure is similar to the technique in [36] for the special setting of PDHG with Euclidean proximal operators, but has some important differences even in this special case.We give a detailed analysis of the algorithm with line search and recover the O(1/k) ergodic rate of convergence for related algorithms in [30,36].
• We propose a Bregman extension for PD3O and establish an ergodic convergence result.
The paper is organized as follows.Section 2 gives a precise statement of the problem (1), and reviews the duality theory that will be used in the rest of the paper.In Section 3 we review some well-known first-order proximal methods and establish connections between them.Section 4 provides some necessary background on Bregman distances.In Section 5 we discuss the Bregman primal and dual Condat-Vũ algorithms and analyze their convergence.The line search technique and its convergence are discussed in Section 6.In Section 7 we extend PD3O to a Bregman proximal method and analyze its convergence.Section 8 contains results of a numerical experiment.

Duality theory and merit functions
This section summarizes the facts from convex duality theory that underlie the primal-dual methods discussed in the paper.We also describe primal-dual merit functions that will be used in the convergence analysis.We use the notation x, y = x T y for the standard inner product of vectors x and y, and x = x, x 1/2 for the Euclidean norm of a vector x.Other norms will be distinguished by a subscript.

Problem formulation
In (1) the vector x is an n-vector and A is an m × n matrix.The functions f , g, h are closed and convex, and h is differentiable, i.e., h(x) ≥ h(x ′ ) + ∇h(x ′ ), x − x ′ for all x, x ′ ∈ dom h, where dom h is an open convex set.We assume that f + h and g are proper, i.e., have nonempty domains.
An important example of ( 1) For C = {b} the constraints are a set of linear equations Ax = b.This special case actually covers all applications of the more general problem (1), since (1) can be reformulated as minimize f (x) + g(y) + h(x) subject to Ax = y, at the expense of increasing the problem size by introducing a splitting variable y.

Dual problem and optimality conditions
The dual of problem (1) where (f + h) * and g * are the conjugates of f + h and g: The conjugate (f + h) * is the infimal convolution f * and h * , denoted by f * h * : The primal-dual optimality conditions for (1) and (2) are Here ∂f and ∂g * are the subdifferentials of f and g * .We often write the optimality conditions as Throughout the paper, we assume that the optimality conditions (3) are solvable.We will refer to the convex-concave function as the Lagrangian of (1).We follow the convention that L(x, z) The objective functions in (1) and the dual problem (2) can be expressed as Solutions x ⋆ , z ⋆ of the optimality conditions (3) form a saddle-point of L, i.e., satisfy inf In particular, L(x ⋆ , z ⋆ ) is the optimal value of ( 1) and (2).

Merit functions
The algorithms discussed in this paper generate primal and dual iterates and approximate solutions x, z with x ∈ dom(f + h) and z ∈ dom g * .The feasibility conditions Ax ∈ dom g and −A T z ∈ dom(f + h) * are not necessarily satisfied.Hence the duality gap sup may not always be useful as a merit function to measure convergence.If we add constraints x ′ ∈ X and z ′ ∈ Z to the optimization problems on the left-hand side of (5), where X and Z are compact convex sets, we obtain a function defined for all x ∈ dom(f + h) and z ∈ dom g * .This follows from the fact that the functions f + h + δ X and g * + δ Z are closed and co-finite, so their conjugates have full domain [43,Corollary 13.3.1].If η(x, z) is easily computed, and η(x, z) ≥ 0 for all x ∈ dom(f + h) and z ∈ dom g * with equality only if x and z are optimal, then the function η can serve as a merit function in primal-dual algorithms for problem (1).
If dom(f +h) and dom g * are bounded, then X and Z can be chosen to contain dom(f +h) and dom g * .Then the constraints in (6) are redundant and η(x, z) is the duality gap (5).Boundedness of dom(f + h) and dom g * is a common assumption in the literature on primal-dual first-order methods.
A weaker assumption is that (1) has an optimal solution x ⋆ ∈ int(X) and ( 2) has an optimal solution z ⋆ ∈ int(Z).Then η(x, z) ≥ 0 for all x ∈ dom(f + h) and z ∈ dom g * , with equality η(x, z) = 0 only if x, z are optimal for (1) and (2).To see this, we first express the two terms in (6) as where σ X = δ * X and σ Z (v) = δ * Z are the support functions of X and Z, respectively.Consider the problem of minimizing η(x, z).By expanding the infimal convolutions in the expressions for the two terms of η, this convex optimization problem can be formulated as with variables x, y, z, w.The dual of this problem is with variables x, z.The optimality conditions for ( 7) and ( 8) include the conditions Ax−y ∈ N Z (z) and −A T z −w ∈ N X (x), where N X (x) = ∂δ X (x) is the normal cone to X at x, and N Z (z) = ∂δ Z (z) the normal cone to Z at z.By assumption, there exist points x ⋆ ∈ int(X) and z ⋆ ∈ int(Z) that are optimal for the original problem (1) and its dual (2).It can be verified that (x, y, z, w) = (x ⋆ , Ax ⋆ , z ⋆ , −A T z ⋆ ), (x, z) = (x ⋆ , z ⋆ ) are optimal for ( 7) and ( 8), and that η(x ⋆ , z ⋆ ) = 0. Now let (x, ẑ) be any other minimizer of η, i.e., η(x, ẑ) = 0. Then x, ẑ and the corresponding minimizers ŷ, ŵ in (7), must satisfy the optimality conditions with the optimal dual variables x = x ⋆ , z = z ⋆ .In particular, The objective value of (7) at this point then reduces to 0 = f (x) + h(x) + g(Ax) + g * (ẑ) + (f + h) * (−A T ŵ), the duality gap associated with the original problem and its dual.This shows that η(x, ẑ) = 0 implies that x, ẑ are optimal for problem (1) and (2).
Consider for example the primal and dual pair Here g = δ {b} .If we take The first three terms are the primal objective augmented with an exact penalty for the constraint Ax = b.
As another example, consider minimize This is an example of (1) with f (x) = x 1 , h(x) = 0, and g the indicator function of {y | y ≤ b}.
The domains dom f and dom g * are unbounded.If we choose Hence, for this example the merit function ( 6) is The second term is an exact penalty for the primal constraint Ax ≤ b.The last term is an exact penalty for the dual constraint A T z ∞ ≤ 1.

First-order proximal algorithms: survey and connections
In this section, we discuss several first-order proximal algorithms and their connections.We start with four three-operator splitting algorithms for problem (1): the primal and dual variants of the Condat-Vũ algorithm [20,50], the primal-dual three-operator (PD3O) algorithm [51], and the primal-dual Davis-Yin (PDDY) algorithm [44].For each of the four algorithms, we make connections with other first-order proximal algorithms, using reduction (i.e., setting some parts in (1) to zero) and the "completion" reformulation (based on extending A to a matrix with orthogonal rows and equal row norms) [39].We focus on the formal connections between algorithms.The connections do not necessarily provide the best approach for convergence analysis or the best known convergence results.
The proximal operator or proximal mapping of a closed convex function f : R n → R is defined as If f is closed and convex, the minimizer in the definition exists and is unique for all y [37].We will call (9) the standard or the Euclidean proximal operator when we need to distinguish it from Bregman proximal operators defined in Section 4.

Condat-Vũ three-operator splitting algorithm
We start with the (primal) Condat-Vũ three-operator splitting algorithm, which was proposed independently by Condat [20] and Vũ [50], (primal) Condat-Vũ (10) (primal) PDHG Loris-Verhoeven with shift (11) proximal gradient The stepsizes σ and τ must satisfy στ A 2 2 + τ L ≤ 1, where A 2 is the spectral norm of A, and L is the Lipschitz constant of ∇h with respect to the Euclidean norm.Many other first-order proximal algorithms can be viewed as special cases of (10), and their connections are summarized in Figure 1.When h = 0, algorithm (10) reduces to the (primal) primal-dual hybrid gradient (PDHG) method [10,12,42], or PDHGMu in [26].When g = 0 in (10) (and assuming z (0) = 0), we obtain the proximal gradient algorithm.When f = 0, we obtain a variant of the Loris-Verhoeven algorithm [15,23,34], We refer to this as Loris-Verhoeven with shift, for reasons that will be clarified later.Furthermore, when A = I in PDHG, we obtain the Douglas-Rachford splitting (DRS) algorithm [18,25,32].Conversely, the "completion" technique in [39] shows that PDHG coincides with DRS applied to a reformulation of the problem.Similarly, when A = I in the primal Condat-Vũ algorithm (10), we obtain a new algorithm and refer to it as the reduced primal Condat-Vũ algorithm.Conversely, the reduced primal Condat-Vũ algorithm reverts to (10) via the "completion" trick.We can also set f = 0 in the reduced Condat-Vũ algorithm or A = I in (11), and obtain the reduced Loris-Verhoeven algorithm with shift: Finally, due to the absence of f in (12), it is not clear how to apply the "completion" trick to (12) to obtain (11).
Condat [20] also discusses a variant of (10), which we will call the dual Condat-Vũ algorithm:  (13) dual PDHG dual Loris-Verhoeven with shift (14) proximal gradient Figure 2 summarizes the proximal algorithms derived from (13).When h = 0, algorithm (13) reduces to PDHG applied to the dual of (1) (with h = 0), which is shown to be equivalent to linearized ADMM [40] (also called Split Inexact Uzawa in [26]).Setting g = 0 in (13) yields the proximal gradient algorithm.When f = 0, we obtain a new algorithm: Following the previous naming convention, we call it dual Loris-Verhoeven algorithm with shift.Furthermore, setting A = I in (13) gives the reduced dual Condat-Vũ algorithm.Conversely, applying the "completion" trick to this reduced algorithm recovers (13).Similarly, setting A = I in dual PDHG gives dual DRS, i.e., DRS with f and g switched, and conversely, the "completion" trick recovers dual PDHG from dual DRS.We can also set A = I in ( 14) or f = 0 in the reduced dual Condat-Vũ algorithm, and obtain the reduced dual Loris-Verhoeven algorithm with shift:

Primal-dual three-operator (PD3O) splitting algorithm
The third diagram, Figure 3, starts with the primal-dual three-operator (PD3O) splitting algorithm [51] x Compared with the Condat-Vũ algorithm (10), PD3O seems to have slightly more complicated updates and larger complexity per iteration, but the requirement for the stepsizes is looser: στ A 2 2 ≤ 1 and τ ≤ 1/L.When h = 0, (16) reduces to the (primal) PDHG.The classical proximal gradient algorithm can be obtained by setting g = 0.When f = 0, it reduces to the iterations Davis-Yin (primal) Douglas-Rachford proximal gradient PD3O (16) (primal) PDHG Loris-Verhoeven ( 17) dual PDHG Loris-Verhoeven ( 17) This algorithm was discovered independently as the Loris-Verhoeven algorithm [34], the primaldual fixed point algorithm based on proximity operator (PDFP 2 O) [15], and the proximal alternating predictor corrector (PAPC) [23].Comparison with (11) reveals a minor difference between these two algorithms: the gradient term in the z-update is taken at the newest primal iterate x (k+1) in Loris-Verhoeven (17) and at the previous point x (k) in the shifted version.This difference is inherited in the proximal gradient algorithm and its shifted version (12).Furthermore, when A = I and σ = 1/τ in PD3O, we recover the well-known Davis-Yin splitting (DYS) algorithm [21].We can also set A = I in ( 17) and obtain the iterations The stepsize conditions require στ ≤ 1 and τ ≤ 1/L.Thus we can set σ = 1/τ and apply Moreau decomposition.The resulting algorithm is exactly the proximal gradient method.The only difference in the z-update between ( 12) and ( 18) is the point at which the gradient of h is taken.The second algorithm uses the most up-to-date iterate x (k+1) when evaluating the gradient of h, and this choice allows a larger stepsize τ .

Primal-dual Davis-Yin (PDDY) splitting algorithm
The core algorithm in Figure 4 is the primal-dual Davis-Yin (PDDY) splitting algorithm [44] z The requirement for stepsizes is the same as that in PD3O: στ A 2 2 ≤ 1 and τ ≤ 1/L. Figure 4 is almost identical to Figure 3 with the roles of f and g exchanged.When h = 0, PDDY reduces to the dual PDHG.In addition, when A = I and σ = 1/τ , PDDY reduces to the Davis-Yin algorithm, but with f and g exchanged.Similarly, when h = 0, A = I and σ = 1/τ , PDDY reverts to the Douglas-Rachford algorithm with f and g switched.
We have seen that the middle and right parts of Figure 4 are those of Figure 3 with f and g switched.However, when one of the functions f or g is absent, the algorithms reduced from PD3O and PDDY are exactly the same.In particular, when f = 0, PDDY reduces to the Loris-Verhoeven algorithm.

Bregman distances
In this section we give the definition of Bregman proximal operators and the basic properties that will be used in the paper.We refer the interested reader to [9] for an in-depth discussion of Bregman distances, their history, and applications.
Let φ be a convex function with a domain that has nonempty interior, and assume φ is continuous on dom φ and continuously differentiable on int(dom φ).The generalized distance (or Bregman distance) generated by the kernel function φ is defined as the function = argmin It is assumed that for every a and every y ∈ int(dom φ) the minimizer x = prox φ f (y, a) is unique and in int(dom φ).
The distance generated by the kernel φ The corresponding Bregman proximal operator is the standard proximal operator applied to y − a: prox φ f (y, a) = prox f (y − a).For this distance, closedness and convexity of f guarantee that the proximal operator is well defined.The questions of existence and uniqueness are more complicated for general Bregman distances.There are no simple general conditions that guarantee that for every a and every y ∈ int(dom φ) the generalized proximal operator ( 20) is uniquely defined and in int(dom φ).Some sufficient conditions are provided (see, for example, [8, Section 4.1], [3, Assumption A]), but they may be quite restrictive or difficult to verify in practice.In applications, however, the Bregman proximal operator is used with specific combinations of f and φ, for which the minimization problem in (20) is particularly easy to solve.In those applications, existence and uniqueness of the solution follow directly from the closed-form solution or availability of a fast algorithm to compute it.A typical example will be provided in Section 8.
From the expression (21) we see that x = prox φ f (y, a) satisfies Equivalently, by definition of subgradient, for all x ∈ dom f ∩ dom φ.
5 Bregman Condat-Vũ three-operator splitting algorithms We now discuss two Bregman three-operator splitting algorithms for the problem (1).The algorithms use a generalized distance d p in the primal space, generated by a kernel φ p , and a generalized distance d d in the dual space, generated by a kernel φ d .The first algorithm is and will be referred to as the Bregman primal Condat-Vũ algorithm.The second algorithm will be called the Bregman dual Condat-Vũ algorithm: The two algorithms need starting points x (0) ∈ int(dom φ p ) ∩ dom h, and z (0) ∈ int(dom φ d ).
Conditions on stepsizes σ, τ will be specified later.When Euclidean distances are used for the primal and dual proximal operators, the two algorithms reduce to the primal and dual variants of the Condat-Vũ algorithm ( 10) and ( 13), respectively.Algorithm (23) has been proposed in [12].
Here we discuss it together with (24) in a unified framework.In Section 5.1 we show that the proposed algorithms can be interpreted as the Bregman proximal point method applied to a monotone inclusion problem.In Section 5.2 we analyze their convergence.In Section 5.3 we discuss the connections between the two algorithms and other Bregman proximal splitting methods.
Assumptions Throughout Section 5 we make the following assumptions.The kernel functions φ p and φ d are 1-strongly convex with respect to norms • p and • d , respectively: for all (x, x ′ ) ∈ dom d p and (z, z ′ ) ∈ dom d d .The assumption that the strong convexity constants are equal to one can be made without loss of generality, by scaling the norms (or distances) if needed.We also assume that the function Lφ p − h is convex for some L > 0.More precisely, dom φ p ⊆ dom h and Note that this assumption is looser than the one in [12,Equation (4)].We denote by A the matrix norm where • p, * and • d, * are the dual norms of • p and • d .
It is also assumed that the primal-dual optimality conditions (3) have a solution (x ⋆ , z ⋆ ) with x ⋆ ∈ dom φ p and z ⋆ ∈ dom φ d .

Derivation from Bregman proximal point method
The Bregman Condat-Vũ algorithms ( 23) and ( 24) can be viewed as applications of the Bregman proximal point algorithm to the optimality conditions (3).This interpretation extends the derivation of the Bregman PDHG algorithm from the Bregman proximal point algorithm given in [30].The idea originates with He and Yuan's interpretation of PDHG as a "preconditioned" proximal point algorithm [28].
The Bregman proximal point algorithm [9,24,27] is an algorithm for monotone inclusion problems 0 ∈ F (u).The update u (k+1) in one iteration of the algorithm is defined as the solution of the inclusion ∇φ(u where φ is a Bregman kernel function.Applied to (3), with a kernel function φ pd , the algorithm generates a sequence (x (k) , z (k) ) defined by

Primal-dual Bregman distances
We introduce four possible primal-dual kernel functions: the functions where σ, τ > 0, and the functions The subscripts in φ + and φ − refer to the sign of the inner product term z, Ax .The subscripts in φ pcv and φ dcv indicate the algorithm (Bregman primal or dual Condat-Vũ) for which these distances will be relevant.If these kernel functions are convex, they generate the following Bregman distances.The distances generated by φ + and φ − are respectively, and the distances generated by φ dcv and φ pcv are We now show that φ + and φ − are convex if and strongly convex if στ A 2 < 1, and that the functions φ dcv and φ pcv are convex if and strongly convex if στ A 2 + τ L < 1.

Bregman Condat-Vũ algorithms from proximal point method
The Bregman primal Condat-Vũ algorithm ( 23) is the Bregman proximal point method with the kernel function φ pd = φ pcv .If we take φ pd = φ pcv in (28), we obtain two coupled inclusions that determine x (k+1) , z (k+1) .The first one is This shows that x (k+1) solves the optimization problem The solution is the x-update (23a) in the Bregman primal Condat-Vũ method.The second inclusion is This shows that z (k+1) solves the optimization problem The solution is the z-update (23b).

Convergence analysis
The derivation in Section 5.1 allows us to apply existing convergence theory for the Bregman proximal point method to the proposed algorithms ( 23) and (24).In particular, Solodov and Svaiter [45] have studied Bregman proximal point methods with inexact prox-evaluations for solving variational inequalities, which include the monotone inclusion problem as a special case.The results in [45] can be applied to analyze convergence of the Bregman Condat-Vũ methods with inexact evaluations of proximal operators.The literature on the Bregman proximal point method for monotone inclusions [24,27,45] focuses on the convergence of iterates, and this generally requires additional assumptions on φ p and φ d (beyond the assumptions of convexity made in Section 5.1).In this section we present a selfcontained convergence analysis and give a direct proof of an O(1/k) rate of ergodic convergence.We also give a self-contained proof of convergence of the iterates x (k) and z (k) .
We make the assumptions listed in Section 5.1: the strong convexity assumption (25) for the primal and dual kernels φ p and φ d , and the relative smoothness property (26) of the function h.We assume that the stepsizes σ, τ satisfy (30), and that the primal-dual optimality condition (3) has a solution (x ⋆ , z ⋆ ) ∈ dom φ p × dom φ d .
For the sake of brevity we combine the analysis of the Bregman primal and the Bregman dual Condat-Vũ algorithms.In the following, d, d, φ are defined as for Bregman dual Condat-Vũ (24).
Proof.We write ( 23) and ( 24) in a unified notation as where x and z are defined in the following table: The optimality condition (22) for the proximal operator evaluation (33a) is that for all x ∈ dom f ∩ dom φ p .The optimality condition for (33b) is that for all z ∈ dom g * ∩ dom φ d .Combining the two inequalities gives for all x ∈ dom f ∩ dom φ p and all z ∈ dom g * ∩ dom φ d .The second inequality follows from convexity of h.Substituting the expressions for x and z in the Bregman primal Condat-Vũ algorithm (23), we obtain for the last line of ( 35) .
If we substitute the expressions for x and z in the Bregman dual Condat-Vũ algorithm, the last line of (35) becomes Therefore, for both algorithms, (35) implies that if we select the minus sign in ∓ for the Bregman primal Condat-Vũ algorithm, and the plus sign for the Bregman dual Condat-Vũ algorithm.For the primal method, this shows For the dual method, k+1) , z (k+1) ; x (k) , z (k) ).

Ergodic convergence
We define averaged iterates for k ≥ 1.We show that for all x ∈ dom f ∩ dom φ p and z ∈ dom g * ∩ dom φ d .
Proof.From (32), since L(u, v) is convex in u and concave in v, for all x ∈ dom f ∩ dom φ p and z ∈ dom g * ∩ dom φ d .The last step follows from (31) with More generally, if X ⊆ dom φ p and Z ⊆ dom φ d are compact convex sets that contain optimal solutions x ⋆ , z ⋆ in their interiors, then the merit function ( 6) is bounded by
To show convergence of the entire sequence (x (k) , z (k) ), we substitute (x, ẑ) in (32): Since the left-hand side is nonnegative, we have d(x, ẑ; ) for all k ≥ 1.This further implies that for all k ≥ k i .By the second additional assumption mentioned above, the right-hand side converges to zero.Then the left-hand side also converges to zero and, from (41) x (k) → x and z (k) → ẑ.

Relation to other Bregman proximal algorithms
Following similar steps as in Section 3, we obtain several Bregman proximal splitting methods as special cases of ( 23) and (24).The connections are summarized in Figure 5 and Figure 6.A comparison of Figures 1 and 5 shows that all the reduction relations (A = I) are still valid.However, it is unclear how to apply the "completion" operation to algorithms based on non-Euclidean Bregman distances.When h = 0, (23) reduces to Bregman PDHG [12].When g = 0, g * = δ {0} (and assuming z (0) = 0), we obtain the Bregman proximal gradient algorithm [3].When f = 0 in (23), we obtain the Bregman Loris-Verhoeven algorithm with shift: Furthermore, when A = I in (23), we recover the reduced Bregman primal Condat-Vũ algorithm.Similarly, setting A = I in Bregman PDHG yields the Bregman Douglas-Rachford algorithm.Last, when we set A = I in (43), we have the Bregman reduced Loris-Verhoeven algorithm with shift: Bregman proximal gradient Similarly, the Bregman dual Condat-Vũ algorithm ( 24) can be reduced to some other Bregman proximal splitting methods, as summarized in Figure 6.In particular, when f = 0 in (24), we obtain the Bregman dual Loris-Verhoeven algorithm with shift: Moreover, setting A = I in (45) yields the reduced Bregman Loris-Verhoeven algorithm with shift: 6 Bregman dual Condat-Vũ algorithm with line search The algorithms ( 23) and ( 24) use constant parameters σ and τ .The stepsize condition (30) involves the matrix norm A and the Lipschitz constant L in (26).Estimating or bounding A for a large matrix can be difficult.As an added complication, the norms • p and • d in the definition of the matrix norm ( 27) are assumed to be scaled so that the strong convexity parameters of the primal and dual kernels are equal to one.Close bounds on the strong convexity parameters may also be difficult to obtain.Using conservative bounds for A and L results in unnecessarily small values of σ and τ , and can dramatically slow down the convergence.Even when the estimates of A and L are accurate, the requirements for the stepsizes (30) are still too strict in most iterations, as observed in [1].In view of the above arguments, line search techniques for primal-dual proximal methods have recently become an active area of research.Malitsky and Pock [36] proposed a line search technique for PDHG and the Condat-Vũ algorithm in the Euclidean case.The algorithm with adaptive parameters in [49] focuses on a special case of (1) (i.e., f = 0) and extends the Loris-Verhoeven algorithm (17).A Bregman proximal splitting method with line search is discussed  in [30] and considers the problem (1) with h = 0 and g = δ {b} .In this section, we extend the Bregman dual Condat-Vũ algorithm ( 24) with a varying parameter option, in which the stepsizes are chosen adaptively without requiring any estimates or bounds for A or the strong convexity parameter of the kernels.The algorithm is restricted to problems in the equality constrained form This is a special case of (1) with g = δ {b} , the indicator function of the singleton {b}.
The details of the algorithm are discussed in Section 6.1 and a convergence analysis is presented in Section 6.2.The main conclusion is an O(1/k) rate of ergodic convergence, consistent with previous results for related algorithms [30,36].
Assumptions We make the same assumptions as in Section 5.1, but define where • is the Euclidean norm.The matrix norm A is defined accordingly as
In the line search algorithm, the parameters θ k , τ k , σ k are determined by a backtracking search.At the start of the algorithm, we set τ −1 and σ −1 to some positive values.To start the search in iteration k we choose θk ≥ 1.For i = 0, 1, 2, . .., we set , and compute zk+1 , x k+1 , z k+1 using (48).For some δ ∈ (0, 1], if we accept the computed iterates z(k+1) , x (k+1) , z (k+1) and parameters θ k , σ k , τ k , and terminate the backtracking search.If (49) does not hold, we increment i and continue the backtracking search.The backtracking condition ( 49) is similar to the condition in the line search algorithm for PDHG with Euclidean proximal operators [36,Algorithm 4], but it is not identical, even in the Euclidean case.The proposed condition is weaker and allows larger stepsizes than the condition in [36, Algorithm 4].

Convergence analysis
The proof strategy is the same as in [30,Section 3.3], extended to account for the function h.The main conclusion is an O(1/k) rate of ergodic convergence, shown in equation (57).

Lower bound on algorithm parameters
We first show that the stepsizes are bounded below by where The lower bounds imply that the backtracking eventually terminates with positive stepsizes σ k and τ k .
Proof.Applying (31) , together with the Lipschitz condition (26), we see that the backtracking condition (49) holds at iteration k if 0 < δ < 1 and Then mathematical induction can be used to prove (50).The two lower bounds (50) hold at k = 0 by the definition of τ min and σ min .Now assume τ k−1 ≥ τ min , σ k−1 ≥ σ min , and consider the kth iteration.The first attempt of θ k is θ k = θk ≥ 1.If this value is accepted, then Otherwise, one or more backtracking steps are needed.Denote by θk the last rejected value.Then θ2 k τ 2 k−1 β A 2 + θk τ k−1 L > δ 2 and the accepted θ k satisfies Therefore,
Proof.The optimality condition for the primal prox-operator (48b) gives for all x ∈ dom f ∩ dom φ p .Hence The second inequality follows from the convexity of h, i.e., h(x The dual update (48c) implies that This equality at The equality (53)  1) .
We evaluate this at z = z (i) and add it to the equality at z = z (i−2) multiplied by θ i−1 : Now we combine (52) for k = i − 1, with (54) and (55).For i ≥ 1, which is the desired result (51).The first inequality follows from (52).In the second last step we substitute (54) and (55).The last step uses the line search exit condition (49) at k = i − 1.

Ergodic convergence
We define the averaged primal and dual sequences for k ≥ 1.We show that for all x ∈ dom f ∩ dom φ p and all z.This holds for any choice of δ ∈ (0, 1] in (49).If we compare ( 56) and ( 37), we note that the two left-hand sides involve different dual iterates (z avg as opposed to z (k) avg ).
Proof.From (51), Since L(u, v) is convex in u and affine in v, Dividing by k i=1 τ i−1 gives (56).
Substituting x = x ⋆ and z = z ⋆ in (58) yields since Ax ⋆ = b.More generally, suppose X ⊆ dom f ∩ dom φ p is a compact convex set containing an optimal solution x ⋆ in its interior, and Z = {z | z ≤ γ} contains a dual optimal z ⋆ , then the merit function η defined in (6) satisfies The second line follows from (56) and the third line follows from (50).

Monotonicity properties and convergence of iterates
For x = x ⋆ , z = z ⋆ , the left-hand side of ( 51) is nonnegative and we obtain for k ≥ 0.Moreover, These inequalities hold for any value δ ∈ (0, 1].In particular, the last inequality implies that z(i+1) − z (i) → 0. When δ < 1 it also implies that d p (x (i+1) , x (i) ) → 0 and, by the strong convexity assumption on φ p , that x (i+1) − x (i) → 0. With additional assumptions similar to those in Section 5.2.3, one can show the convergence of iterates; see [30,Section 3.3.4].

Bregman PD3O algorithm
In this section we propose the Bregman PD3O algorithm, another Bregman proximal method for the problem (1).Bregman PD3O also involves two generalized distances, d p and d d , generated by φ p and φ d , respectively, and it consists of the iterations The only difference between Bregman PD3O and Bregman primal Condat-Vũ algorithm ( 23) is the additional term τ (∇h(x (k) ) − ∇h(x (k+1) )).Thus the two algorithms (23) and (59) reduce to the same method when h is absent from problem (1).The additional term allows PD3O to use larger stepsizes than the Condat-Vũ algorithm.If we use the same matrix norm A and Lipschitz constant L in the analysis for the two methods, then the conditions are The range of possible parameters is illustrated in Figure 7.In Section 7.1 we provide the detailed convergence analysis of the Bregman PD3O method.The connections between Bregman PD3O and several other Bregman proximal methods are discussed in Section 7.2.Assumptions Throughout Section 7 we make the following assumptions.The kernel functions φ p and φ d are 1-strongly convex with respect to the Euclidean norm and an arbitrary norm • d , respectively: The assumptions that the strong convexity constants are one can be made without loss of generality, by scaling the distances.The definition of A follows ( 27) and reduces to We also assume that the gradient of h is L-Lipschitz continuous with respect to the Euclidean norm: dom h = R n and h(y) − h(x) − ∇h(x), y − x ≤ L 2 y − x 2 , for any x, y ∈ dom h.
The parameters τ and σ must satisfy Finally, we assume that the optimality condition (3) has a solution (x ⋆ , z ⋆ ) ∈ dom φ p × dom φ d .Note that (62) is a stronger assumption than (26).(Combined with the first inequality in (61), it implies (26).)We will use the following consequence of (62): for all x, y [38, Theorem 2.1.5].

A primal-dual Bregman distance
We introduce a primal-dual kernel where σ, τ > 0. If φ pd3o is convex, the generated Bregman distance is given by We now show that φ pd3o is convex if στ A 2 ≤ 1.
Proof.It is sufficient to show that d pd3o is nonnegative: In step 1 we use the strong convexity assumption (61), the definition of A (27) with • p = • , and the assumption στ A 2 ≤ 1.The bound on d d (z, z ′ ) follows from Note that the convexity of φ pd3o only requires the first inequality in the stepsize condition (63).Although the Bregman PD3O algorithm (59) is not the Bregman proximal point method for the Bregman kernel φ pd3o , the distance d pd3o will appear in the key inequality (67) of the convergence analysis.

One-iteration analysis
We first show that the iterates x (k+1) , z (k+1) generated by Bregman PD3O (59) satisfy for all x ∈ dom f ∩ dom φ p and z ∈ dom g * ∩ dom φ d .
Proof.Recall that Bregman PD3O differs from the Bregman primal Condat-Vũ algorithm (23) only in an additional term in the dual update.The proof in Section 6.2.2 therefore applies up to (35), with Substituting the above (x, z) into (34) and applying the definition of d − (29) yields Step 3 follows from definition of d pd3o (65).In step 4 we use the Lipschitz condition (64) and the second inequality in the stepsize condition (63).The last step follows from the fact that d pd3o is nonnegative (66).

Ergodic convergence
The iterates generated by Bregman PD3O (59) satisfy for all x ∈ dom f ∩ dom φ p and all z ∈ dom g * ∩ dom φ d , where the averaged iterates are defined in (36).
Proof.From (67), since L(u, v) is convex in u and concave in v, for all x ∈ dom f ∩ dom φ p and z ∈ dom g * ∩ dom φ d .The third inequality follows from (65):

Relation to other Bregman proximal algorithms
The proposed algorithm (59) can be viewed as an extension to PD3O (16) using generalized distances, and reduces to several Bregman proximal methods by reduction.These algorithms can also be organized into a diagram similar to Figure 3. Figure 8 starts from Bregman PD3O (59), and summarizes its connection to several Bregman proximal methods.When h = 0, (59) reduces to Bregman PDHG, and when g = 0, (59) reduces to the Bregman proximal gradient algorithm.The Bregman Loris-Verhoeven algorithm is Bregman PD3O with f = 0: This algorithm has been discussed in [17] under the name NEPAPC.Setting A = I (with σ = 1/τ ), we obtain a new variant of Bregman proximal gradient algorithm: x (k+1) = argmin x ∇h(x (k) ) − z (k) , x + 1 τ d p (x, x (k) ) (69a) The difference between (69) and ( 44) is the additional term τ (∇h(x (k) ) − ∇h(x (k+1) )), the same as the difference between ( 23) and (59).When the Euclidean proximal operator is used, (69) reduces to the proximal gradient method.However, the new algorithm (69) does not seem to be equivalent to the Bregman proximal gradient algorithm due to the lack of Moreau decomposition in the generalized case.Nevertheless, the new algorithm (69) may still be interesting on its own, especially when the generalized proximal operator of g * is easy to compute while the (Euclidean or generalized) proximal operator of g is computationally expensive.Finally, setting A = I (and σ = 1/τ ) in Bregman PD3O (59) gives a Bregman Davis-Yin algorithm.

Numerical experiment
In this section we evaluate the performance of the Bregman primal Condat-Vũ algorithm ( 23), Bregman dual Condat-Vũ algorithm with line search (48), and Bregman PD3O (59).The main goal of the example is to validate and illustrate the difference in the stepsize conditions (60), and the usefulness of the line search procedure.We consider the convex optimization problem minimize ψ(x) = λ Ax 1 + 1 2 Cx − b 2 subject to 1 T x = 1, x 0, (70) where x ∈ R n is the optimization variable, C ∈ R m×n , and A ∈ R (n−1)×n is the difference matrix This problem is of the form of (1) with and δ H is the indicator function of the hyperplane H = {x ∈ R n | 1 T x = 1}.We use the relative entropy distance in the primal space.This distance is 1-strongly convex with respect to ℓ 1 -norm [5] (and also ℓ 2norm).With the relative entropy distance, all the primal iterates x (k) remain feasible.In the dual space we use the Euclidean distance.Thus, the matrix norm (27) in the stepsize condition (30) for the Bregman Condat-Vũ algorithms is the (1,2)-operator norm where a i is the ith column of A. In the Bregman PD3O algorithm, we use the squared Euclidean distance d p (x, y) = 1 2 x − y 2 , and the matrix norm in the stepsize condition (63) is the spectral norm A 2 .For the difference matrix (71), A 2 is bounded above by 2, and very close to this upper bound for large n.
The Lipschitz constant for h with respect to the ℓ 1 -norm is the largest absolute value of the elements in C T C, i.e., L 1 = max i,j |(C T C) ij |.This value is used in the stepsize condition (30) for the Bregman Condat-Vũ algorithms.The Lipschitz constant with respect to the ℓ 2 -norm is , which is used in the stepsize condition (63) for Bregman PD3O.The matrix norms and Lipschitz constants are summarized as follows: In the example we use the exact values of L and the (Euclidean) proximal operator of g * is the projection onto the infinity norm ball: The experiment is carried out in Python 3.6 on a desktop with an Intel Core i5 2.4GHz CPU and 8GB RAM.We set m = 500 and n = 10, 000.The elements in the matrix C ∈ R m×n and b ∈ R m are randomly generated from independent standard Gaussian distributions.For the constant stepsize option, we choose These two choices, as well as the range of possible parameters, are illustrated in Figure 9.The two choices are on the blue and red curve, respectively, and satisfy the requirement (60) with equality.For the line search algorithm, we set θk = 1.2 to encourage more aggressive updates, and , which is consistent with the choice in (72).We solve the problem (70) using the Bregman primal Condat-Vũ algorithm (23), the Bregman dual Condat-Vũ algorithm with line search (48), and Bregman PD3O (59). the relative distance between the function values to the optimal value ψ ⋆ , which is computed via CVXPY [22].Comparison between the Bregman primal Condat-Vũ algorithm and Bregman PD3O shows that Bregman PD3O converges faster.Figure 10 also compares the performance between the Bregman primal Condat-Vũ algorithm with constant stepsizes and Bregman dual algorithm with line search.One can see clearly that the line search significantly improves the convergence.On the other hand, the line search does not add much computation overhead, as the plots of the CPU time and the number of iterations are roughly identical.In these experiments Bregman PD3O and the Bregman dual Condat-Vũ algorithm with line search have a similar performance, without one algorithm being conclusively better than the other.

1 .
1 and existing results on Bregman proximal point method[27, Theorem 3.1],[45, Theorem 3.2].Here we provide a self-contained proof under additional assumptions about the primal and dual distance functions.The following two assumptions are common in the literature on Bregman distances[9,14,24,27].For fixed x and z, the sublevel sets {x ′ | d p (x, x ′ ) ≤ γ} and {z ′ | d d (z, z ′ ) ≤ γ} are closed.In other words, the distances d p (x, x ′ ) and d d (z, z ′ ) are closed functions of x ′ and z ′ , respectively.Since a sum of closed functions is closed, the distance d(x, z; x ′ , z ′ ) is a closed function of (x ′ , z ′ ), for fixed (x, z).

Figure 7 :
Figure 7: Acceptable stepsizes in Condat-Vũ algorithms and PD3O.We assume the same matrix norm A and Lipschitz constant L are used in the analysis of the two algorithms.The light gray region under the blue curve is defined by the inequality for the Condat-Vũ algorithms in (60).The region under the red curve shows the values allowed by the stepsized conditions for PD3O.

1 and L 2 ,
The Bregman proximal operator of f has a closed-form solution:

Figure 9 :
Figure 9: The blue and red curves show the boundaries of the stepsize regions for Bregman Condat-Vũ algorithms and Bregman PD3O, respectively.The blue and red points indicate the chosen parameters in (72) (red for for PD3O, blue for Condat-Vũ).In the Bregman dual Condat-Vũ algorithm with line search, the stepsizes are selected on the dashed straight line.The solid line segment shows the range of stepsizes that were selected, with dots indicating the largest, median, and smallest stepsizes.