Abstract
The paper presents primal–dual proximal splitting methods for convex optimization, in which generalized Bregman distances are used to define the primal and dual proximal update steps. The methods extend the primal and dual Condat–Vũ algorithms and the primal–dual three-operator (PD3O) algorithm. The Bregman extensions of the Condat–Vũ algorithms are derived from the Bregman proximal point method applied to a monotone inclusion problem. Based on this interpretation, a unified framework for the convergence analysis of the two methods is presented. We also introduce a line search procedure for stepsize selection in the Bregman dual Condat–Vũ algorithm applied to equality-constrained problems. Finally, we propose a Bregman extension of PD3O and analyze its convergence.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
We discuss proximal splitting methods for optimization problems in the form
where f, g, and h are convex functions and h is differentiable. This general problem covers a wide variety of applications in machine learning, signal and image processing, operations research, control, and other fields [11, 19, 31, 40]. In this paper, we consider proximal splitting methods based on Bregman distances for solving (1) and some interesting special cases of (1).
Recently, several primal–dual first-order methods have been proposed for the three-term problem (1): the Condat–Vũ algorithm [20, 50, 53], the primal–dual three-operator (PD3O) algorithm [51], and the primal–dual Davis–Yin (PDDY) algorithm [44]. Algorithms for some special cases of (1) are also of interest. These include the Chambolle–Pock algorithm, also known as the primal–dual hybrid gradient (PDHG) method [10, 12] (when \(h=0\)), the Loris–Verhoeven algorithm [15, 23, 34] (when \(f=0\)), the proximal gradient algorithm (when \(g=0\)), and the Davis–Yin splitting algorithm [21] (when \(A=I\)). All these methods handle the nonsmooth functions f and g via the standard Euclidean proximal operator.
To further improve the efficiency of proximal algorithms, proximal operators based on generalized Bregman distances have been proposed and incorporated in many methods [2, 3, 6, 14, 24, 27, 35, 46, 48]. Bregman distances offer two potential benefits. First, the Bregman distance can help build a more accurate local optimization model around the current iterate. This is often interpreted as a form of preconditioning. For example, diagonal or quadratic preconditioning [29, 33, 41] has been shown to improve the practical convergence of PDHG, as well as the accuracy of the computed solution [1]. As a second benefit, a Bregman proximal operator of a function may be easier to compute than the standard Euclidean proximal operator and therefore reduce the complexity per iteration of an optimization algorithm. Recent applications of this kind include optimal transport problems [16], optimization over nonnegative trigonometric polynomials [13], and sparse semidefinite programming [30].
Extending standard proximal methods and their convergence analysis to Bregman distances is not straightforward because some fundamental properties of the Euclidean proximal operator no longer hold for Bregman proximal operators. An example is the Moreau decomposition which relates the (Euclidean) proximal operators of a closed convex function and its conjugate [37]. Another example is the simple relation between the proximal operators of a function g and the composition with a linear function g(Ax) when \(AA^T\) is a multiple of the identity; see, e.g., [4, 19]. This composition rule is used in [39] to establish the equivalence between some well-known first-order proximal methods for problem (1) with \(A=I\) and with general A.
The purpose of this paper is to present new Bregman extensions and convergence results for the Condat–Vũ and PD3O algorithms. The main contributions are as follows.
-
The Condat–Vũ algorithm [20, 50] exists in a primal and a dual variant. We discuss extensions of the two algorithms that use Bregman proximal operators in the primal and dual updates. The Bregman primal Condat–Vũ algorithm first appeared in [12, Algorithm 1] and is also a special case of the algorithm proposed in [52] for a more general convex–concave saddle point problem. We give a new derivation of this method and its dual variant, by applying the Bregman proximal point method to the primal–dual optimality conditions. Based on the interpretation, we provide a unified framework for the convergence analysis of the two variants and show an O(1/k) ergodic convergence rate, which is consistent with previous results for Euclidean proximal operators in [20, 50] and Bregman proximal operators in [12]. We also give a convergence result for the primal and dual iterates.
-
We propose an easily implemented backtracking line search technique for selecting stepsizes in the Bregman dual Condat–Vũ algorithm for problems with equality constraints. The proposed backtracking procedure is similar to the technique in [36] for the special setting of PDHG with Euclidean proximal operators, but has some important differences even in this special case. We give a detailed analysis of the algorithm with line search and recover the O(1/k) ergodic rate of convergence for related algorithms in [30, 36].
-
We propose a Bregman extension for PD3O and establish an ergodic convergence result.
The paper is organized as follows. Section 2 gives a precise statement of the problem (1) and reviews the duality theory that will be used in the rest of the paper. In Sect. 3, we review some well-known first-order proximal methods and establish connections between them. Section 4 provides some necessary background on Bregman distances. In Sect. 5, we discuss the Bregman primal and dual Condat–Vũ algorithms and analyze their convergence. The line search technique and its convergence are discussed in Sect. 6. In Sect. 7, we extend PD3O to a Bregman proximal method and analyze its convergence. Section 8 contains results of a numerical experiment.
2 Duality Theory and Merit Functions
This section summarizes the facts from convex duality theory that underlie the primal–dual methods discussed in the paper. We also describe primal–dual merit functions that will be used in the convergence analysis.
We use the notation \(\langle x, y\rangle = x^Ty\) for the standard inner product of vectors x and y, and \(\Vert x\Vert = \langle x, x\rangle ^{1/2}\) for the Euclidean norm of a vector x. Other norms will be distinguished by a subscript.
2.1 Problem Formulation
In (1), the vector x is an n-vector and A is an \(m\times n\) matrix. The functions f, g, and h are closed and convex, and h is differentiable, i.e.,
where \(\mathop {\textbf{dom}}h\) is an open convex set. We assume that \(f+h\) and g are proper, i.e., have nonempty domains.
An important example of (1) is \(g = \delta _C\), the indicator function of a closed convex set C. With \(g=\delta _C\), the problem is equivalent to
For \(C =\{b\}\), the constraints are a set of linear equations \(Ax=b\). This special case actually covers all applications of the more general problem (1), since (1) can be reformulated as
at the expense of increasing the problem size by adding a splitting variable y.
2.2 Dual Problem and Optimality Conditions
The dual of problem (1) is
where \((f+h)^*\) and \(g^*\) are the conjugates of \(f+h\) and g:
The conjugate \((f+h)^*\) is the infimal convolution \(f^*\) and \(h^*\), denoted by :
The primal–dual optimality conditions for (1) and (2) are
Here, \(\partial f\) and \(\partial g^*\) are the subdifferentials of f and \(g^*\). We often write the optimality conditions as
Throughout the paper, we assume the optimality conditions (3) are solvable.
We will refer to the convex–concave function
as the Lagrangian of (1). We follow the convention that \(\mathcal {L}(x,z) = +\infty \) if \(x\not \in \mathop {\textbf{dom}}(f+h)\) and \(\mathcal {L}(x,z) = -\infty \) if \(x\in \mathop {\textbf{dom}}(f+h)\) and \(z\not \in \mathop {\textbf{dom}}g^*\). The objective functions in (1) and the dual problem (2) can be expressed as
Solutions \(x^\star \), \(z^\star \) of the optimality conditions (3) form a saddle point of \(\mathcal {L}\):
In particular, \(\mathcal {L}(x^\star , z^\star )\) is the optimal value of (1) and (2).
2.3 Merit Functions
The algorithms discussed in this paper generate primal and dual iterates and approximate solutions x, z with \(x \in \mathop {\textbf{dom}}(f+h)\) and \(z\in \mathop {\textbf{dom}}g^*\). The feasibility conditions \(Ax \in \mathop {\textbf{dom}}g\) and \(-A^Tz \in \mathop {\textbf{dom}}(f+h)^*\) are not necessarily satisfied. Hence, the duality gap
may not always be useful as a merit function to measure convergence.
If we add constraints \(x'\in X\) and \(z'\in Z\) to the optimization problems on the left-hand side of (5), where X and Z are compact convex sets, we obtain a function
defined for all \(x\in \mathop {\textbf{dom}}(f+h)\) and \(z\in \mathop {\textbf{dom}}g^*\). This follows from the fact that the functions \(f+h+\delta _X\) and \(g^*+\delta _Z\) are closed and co-finite, so their conjugates have full domain [43, Corollary 13.3.1]. If \(\eta (x,z)\) is easily computed, and \(\eta (x,z) \ge 0\) for all \(x\in \mathop {\textbf{dom}}(f+h)\) and \(z\in \mathop {\textbf{dom}}g^*\) with equality only if x and z are optimal, then the function \(\eta \) can serve as a merit function in primal–dual algorithms for problem (1).
If \(\mathop {\textbf{dom}}(f+h)\) and \(\mathop {\textbf{dom}}g^*\) are bounded, then X and Z can be chosen to contain \(\mathop {\textbf{dom}}(f+h)\) and \(\mathop {\textbf{dom}}g^*\). Then, the constraints in (6) are redundant and \(\eta (x,z)\) is the duality gap (5). Boundedness of \(\mathop {\textbf{dom}}(f+h)\) and \(\mathop {\textbf{dom}}g^*\) is a common assumption in the literature on primal–dual first-order methods.
A weaker assumption is that (1) has an optimal solution \(x^\star \in \mathop {\textbf{int}}(X)\) and (2) has an optimal solution \(z^\star \in \mathop {\textbf{int}}(Z)\). Then, \(\eta (x,z) \ge 0\) for all \(x \in \mathop {\textbf{dom}}(f+h)\) and \(z \in \mathop {\textbf{dom}}g^*\), with equality \(\eta (x,z) = 0\) only if x, z are optimal for (1) and (2). To see this, we first express the two terms in (6) as
where \(\sigma _X = \delta _X^*\) and \(\sigma _Z(v) = \delta _Z^*\) are the support functions of X and Z, respectively. Consider the problem of minimizing \(\eta (x,z)\). By expanding the infimal convolutions in the expressions for the two terms of \(\eta \), this convex optimization problem can be formulated as
with variables x, y, z, w. The dual of this problem is
with variables \(\bar{x}, \bar{z}\). The optimality conditions for (7) and (8) include the conditions \(Ax-y \in N_Z(\bar{z})\) and \(-A^Tz - w \in N_X(\bar{x})\), where \(N_X(\bar{x}) = \partial \delta _X(\bar{x})\) is the normal cone to X at \(\bar{x}\), and \(N_Z(\bar{z}) = \partial \delta _Z(\bar{z})\) the normal cone to Z at \(\bar{z}\). By assumption, there exist points \(x^\star \in \mathop {\textbf{int}}(X)\) and \(z^\star \in \mathop {\textbf{int}}(Z)\) that are optimal for the original problem (1) and its dual (2). It can be verified that \((x,y,z,w) = (x^\star , Ax^\star , z^\star , -A^Tz^\star )\), \((\bar{x}, \bar{z}) = (x^\star , z^\star )\) are optimal for (7) and (8), and that \(\eta (x^\star ,z^\star ) = 0\). Now let \((\hat{x},\hat{z})\) be any other minimizer of \(\eta \), i.e., \(\eta (\hat{x}, \hat{z}) = 0\). Then, \(\hat{x}, \hat{z}\) and the corresponding minimizers \(\hat{y}, \hat{w}\) in (7), must satisfy the optimality conditions with the optimal dual variables \(\bar{x}= x^\star \), \(\bar{z} = z^\star \). In particular, \(A\hat{x} -\hat{y} \in N_Z(z^\star ) = \{0\}\) and \(-A^T\hat{z} -\hat{w} \in N_X(x^\star ) = \{0\}\). The objective value of (7) at this point then reduces to \(0 = f(\hat{x}) + h(\hat{x}) + g(A\hat{x}) + g^*(\hat{z}) + (f+h)^*(-A^T\hat{w})\), the duality gap associated with the original problem and its dual. This shows that \(\eta (\hat{x}, \hat{z}) =0\) implies that \(\hat{x}, \hat{z}\) are optimal for problem (1) and (2).
Consider for example the primal and dual pair
Here, \(g= \delta _{\{b\}}\). If we take \(Z = \{z \mid \Vert z\Vert \le \gamma \}\), then \(\sigma _Z(y)=\gamma \Vert y\Vert \) and . If in addition \(\mathop {\textbf{dom}}(f+h)\) is bounded and we take \(X \supseteq \mathop {\textbf{dom}}(f+h)\), then
with domain \(\mathop {\textbf{dom}}(f+h) \times {\textbf{R}}^m\). The first three terms are the primal objective augmented with an exact penalty for the constraint \(Ax=b\).
As another example, consider
This is an example of (1) with \(f(x)=\Vert x\Vert _1\), \(h(x)=0\), and g the indicator function of \(\{y \mid y \le b\}\). The domains \(\mathop {\textbf{dom}}f\) and \(\mathop {\textbf{dom}}g^*\) are unbounded. If we choose \(X = \{ x \mid \Vert x\Vert _\infty \le \kappa \}\) and \(Z = \{z \mid \Vert z\Vert _\infty \le \lambda \}\), then the merit function (6) for this example is
with domain \({\textbf{R}}^n \times {\textbf{R}}^m_+\). The second term is an exact penalty for the primal constraint \(Ax\le b\). The last term is an exact penalty for the dual constraint \(\Vert A^Tz\Vert _\infty \le 1\).
3 First-Order Proximal Algorithms: Survey and Connections
In this section, we discuss several first-order proximal algorithms and their connections. We start with four three-operator splitting algorithms for problem (1): the primal and dual variants of the Condat–Vũ algorithm [20, 50], the primal–dual three-operator (PD3O) algorithm [51], and the primal–dual Davis–Yin (PDDY) algorithm [44]. For each of the four algorithms, we make connections with other first-order proximal algorithms, using reduction (i.e., setting some parts in (1) to zero) and the “completion” reformulation (based on extending A to a matrix with orthogonal rows and equal row norms) [39]. We focus on the formal connections between algorithms. The connections do not necessarily provide the best approach for convergence analysis or the best known convergence results.
The proximal operator or proximal mapping of a closed convex function \(f :{\textbf{R}}^n \rightarrow {\textbf{R}}\) is defined as
If f is closed and convex, the minimizer in the definition exists and is unique for all y [37]. We will call (9) the standard or Euclidean proximal operator when we need to distinguish it from Bregman proximal operators in Section 4.
3.1 Condat–Vũ Three-Operator Splitting Algorithm
We start with the (primal) Condat–Vũ three-operator splitting algorithm, which was proposed independently by Condat [20] and Vũ [50],
The stepsizes \(\sigma \) and \(\tau \) must satisfy \(\sigma \tau \Vert A\Vert _2^2 + \tau L \le 1\), where \(\Vert A\Vert _2\) is the spectral norm of A, and L is the Lipschitz constant of \(\nabla h\) with respect to the Euclidean norm. Many other first-order proximal algorithms can be viewed as special cases of (10), and their connections are summarized in Fig. 1.
When \(h=0\), algorithm (10) reduces to the (primal) primal–dual hybrid gradient (PDHG) method [10, 12, 42], or PDHGMu in [26]. When \(g=0\) in (10) (and assuming \(z_0=0\)), we obtain the proximal gradient algorithm. When \(f=0\), we obtain a variant of the Loris–Verhoeven algorithm, which will be referred to as the Loris–Verhoeven with shift, for reasons that will be clarified later. If we further set \(A=I\), we obtain the reduced Loris–Verhoeven algorithm with shift. However, due to the absence of f in the reduced Loris–Verhoeven algorithm with shift, it is not clear how to apply the “completion” trick to it. Furthermore, when \(A=I\) in PDHG, we obtain the Douglas–Rachford splitting (DRS) algorithm [18, 25, 32]. Conversely, the “completion” technique in [39] shows that PDHG coincides with DRS applied to a reformulation of the problem. Similarly, when \(A=I\) in the primal Condat–Vũ algorithm (10), we obtain a new algorithm and refer to it as the reduced primal Condat–Vũ algorithm. Conversely, the reduced primal Condat–Vũ algorithm reverts to (10) via the “completion” trick.
Condat [20] also discusses a variant of (10), which we will call the dual Condat–Vũ algorithm:
Figure 2 summarizes the proximal algorithms derived from (11). When \(h=0\), algorithm (11) reduces to PDHG applied to the dual of (1) (with \(h=0\)), which is shown to be equivalent to linearized ADMM [40] (also called Split Inexact Uzawa in [26]). Setting \(g=0\) in (11) yields the proximal gradient algorithm. When \(f=0\), we obtain the dual Loris–Verhoeven algorithm with shift, following the previous naming convention, and if we further set \(A=I\), we obtain the reduced dual Loris–Verhoeven algorithm with shift. Moreover, setting \(A=I\) in (11) gives the reduced dual Condat–Vũ algorithm. Conversely, applying the “completion” trick to this reduced algorithm recovers (11). Similarly, setting \(A=I\) in dual PDHG gives dual DRS, i.e., DRS with f and g switched, and conversely, the “completion” trick recovers dual PDHG from dual DRS.
3.2 Primal–Dual Three-Operator (PD3O) Splitting Algorithm
The third diagram starts with the primal–dual three-operator (PD3O) splitting algorithm [51]
and is presented in Fig. 3.
Compared with the Condat–Vũ algorithm (10), PD3O seems to have slightly more complicated updates and larger complexity per iteration, but the requirement for the stepsizes is looser: \(\sigma \tau \Vert A\Vert _2^2 \le 1\) and \(\tau \le 1/L\). When \(h=0\), (12) reduces to the (primal) PDHG. The classical proximal gradient algorithm can be obtained by setting \(g=0\). The PD3O algorithm (12) with \(f=0\) was discovered independently as the Loris–Verhoeven algorithm [34], the primal–dual fixed point algorithm based on proximity operator (PDFP\(^2\)O) [15], and the proximal alternating predictor corrector (PAPC) [23]. Comparison with the Loris–Verhoeven algorithm with shift reveals a minor difference between these two algorithms: the gradient term in the z-update is taken at the newest primal iterate \(x_{k+1}\) in the Loris–Verhoeven algorithm and at the previous point \(x_k\) in the shifted version. This difference is inherited in the proximal gradient algorithm and its shifted version. Furthermore, when \(A=I\) and \(\sigma =1/\tau \) in PD3O, we recover the well-known Davis–Yin splitting (DYS) algorithm [21]. We can also set \(A=I\) in the Loris–Verhoeven algorithm and obtain the classical proximal gradient algorithm.
3.3 Primal–Dual Davis–Yin (PDDY) Splitting Algorithm
The core algorithm in Fig. 4 is the primal–dual Davis–Yin (PDDY) splitting algorithm [44]
The requirement for stepsizes is the same as that in PD3O: \(\sigma \tau \Vert A\Vert _2^2 \le 1\) and \(\tau \le 1/L\). Figure 4 is almost identical to Fig. 3 with the roles of f and g exchanged. When \(h=0\), PDDY reduces to the dual PDHG. In addition, when \(A=I\) and \(\sigma =1/\tau \), PDDY reduces to the Davis–Yin algorithm, but with f and g exchanged. Similarly, when \(h=0\), \(A=I\) and \(\sigma =1/\tau \), PDDY reverts to the Douglas–Rachford algorithm with f and g switched.
We have seen that the middle and right parts of Fig. 4 are those of Fig. 3 with f and g switched. However, when one of the functions f or g is absent, the algorithms reduced from PD3O and PDDY are exactly the same. In particular, when \(f=0\), PDDY reduces to the Loris–Verhoeven algorithm.
4 Bregman Distances
In this section, we give the definition of Bregman proximal operators and the basic properties that will be used in the paper. We refer the interested reader to [9] for an in-depth discussion of Bregman distances, their history, and applications.
Let \(\phi \) be a convex function with a domain that has nonempty interior, and assume \(\phi \) is continuous on \(\mathop {\textbf{dom}}\phi \) and continuously differentiable on \(\mathop {\textbf{int}}(\mathop {\textbf{dom}}\phi )\). The generalized distance (or Bregman distance) generated by the kernel function \(\phi \) is defined as the function
with domain \(\mathop {\textbf{dom}}d=\mathop {\textbf{dom}}\phi \times \mathop {\textbf{int}}(\mathop {\textbf{dom}}\phi )\). The corresponding Bregman proximal operator of a function f is
It is assumed that for every a and every \(y \in \mathop {\textbf{int}}(\mathop {\textbf{dom}}\phi )\) the minimizer \(\hat{x}=\textrm{prox}_f^\phi (y,a)\) is unique and in \(\mathop {\textbf{int}}(\mathop {\textbf{dom}}\phi )\).
The distance generated by the kernel \(\phi (x)=(1/2)\Vert x\Vert ^2\) is the squared Euclidean distance \(d(x,y) = (1/2) \Vert x-y\Vert ^2\). The corresponding Bregman proximal operator is the standard proximal operator applied to \(y-a\):
For this distance, closedness and convexity of f guarantee that the proximal operator is well defined. The questions of existence and uniqueness are more complicated for general Bregman distances. There are no simple general conditions that guarantee that for every a and every \(y \in \mathop {\textbf{int}}(\mathop {\textbf{dom}}\phi )\) the generalized proximal operator (15) is uniquely defined and in \(\mathop {\textbf{int}}(\mathop {\textbf{dom}}\phi )\). Some sufficient conditions are provided (see, for example, [8, Section 4.1], [3, Assumption A]), but they may be quite restrictive or difficult to verify in practice. In applications, however, the Bregman proximal operator is used with specific combinations of f and \(\phi \), for which the minimization problem in (15) is particularly easy to solve. In those applications, existence and uniqueness of the solution follow directly from the closed-form solution or availability of a fast algorithm to compute it. A typical example will be provided in Sect. 8.
From the expression (16) and the definition of subgradient, we see that \(\hat{x}=\textrm{prox}_f^\phi (y,a)\) satisfies
for all \(x \in \mathop {\textbf{dom}}f \cap \mathop {\textbf{dom}}\phi \). The second inequality follows from the definition (14); see, e.g., [45, Lemma 4.1].
5 Bregman Condat–Vũ Three-Operator Splitting Algorithms
We now discuss two Bregman three-operator splitting algorithms for the problem (1). The algorithms use a generalized distance \(d_\mathrm p\) in the primal space, generated by a kernel \(\phi _\mathrm p\), and a generalized distance \(d_\textrm{d}\) in the dual space, generated by a kernel \(\phi _\textrm{d}\). The first algorithm is
and will be referred to as the Bregman primal Condat–Vũ algorithm. The second algorithm will be called the Bregman dual Condat–Vũ algorithm:
The two algorithms need starting points \(x_0 \in \mathop {\textbf{int}}(\mathop {\textbf{dom}}\phi _\mathrm p) \cap \mathop {\textbf{dom}}h\), and \(z_0 \in \mathop {\textbf{int}}(\mathop {\textbf{dom}}\phi _\textrm{d})\). Conditions on stepsizes \(\sigma \), \(\tau \) will be specified later. When Euclidean distances are used for the primal and dual proximal operators, the two algorithms reduce to the primal and dual variants of the Condat–Vũ algorithm (10) and (11), respectively. Algorithm (18) has been proposed in [12]. Here, we discuss it together with (19) in a unified framework.
In Sect. 5.1, we show that the proposed algorithms can be interpreted as the Bregman proximal point method applied to a monotone inclusion problem. In Sect. 5.2, we analyze their convergence. In Sect. 5.3, we discuss the connections between the two algorithms and other Bregman proximal splitting methods.
Assumption 1
-
(1.1)
The kernel functions \(\phi _\mathrm p\) and \(\phi _\textrm{d}\) are 1-strongly convex with respect to norms \(\Vert \cdot \Vert _\mathrm p\) and \(\Vert \cdot \Vert _\textrm{d}\), respectively:
$$\begin{aligned} d_\mathrm p(x,x^\prime ) \ge \frac{1}{2}\Vert x-x^\prime \Vert ^2_\mathrm p, \qquad d_\textrm{d}(z,z^\prime ) \ge \frac{1}{2}\Vert z-z^\prime \Vert ^2_\textrm{d}\end{aligned}$$(20)for all \((x,x') \in \mathop {\textbf{dom}}d_\mathrm p\) and \((z,z') \in \mathop {\textbf{dom}}d_\textrm{d}\). The assumption that the strong convexity constants are equal to one can be made without loss of generality, by scaling the norms (or distances) if needed.
-
(1.2)
The function \(L\phi _\mathrm p-h\) is convex for some \(L>0\). More precisely, \(\mathop {\textbf{dom}}\phi _\mathrm p\subseteq \mathop {\textbf{dom}}h\) and
$$\begin{aligned} h(x)-h(x^\prime )-\langle \nabla h(x^\prime ), x-x^\prime \rangle \le Ld_\mathrm p(x,x^\prime ) \quad \text{ for } \text{ all } (x,x^\prime ) \in \mathop {\textbf{dom}}d_\mathrm p. \end{aligned}$$(21) -
(1.3)
The primal–dual optimality conditions (3) have a solution \((x^\star , z^\star )\) with \(x^\star \in \mathop {\textbf{dom}}\phi _\mathrm p\) and \(z^\star \in \mathop {\textbf{dom}}\phi _\textrm{d}\).
Note that Assumption 1.2 is looser than the one in [12, Equation (4)]. We denote by \(\Vert A\Vert \) the matrix norm
where \(\Vert \cdot \Vert _{\mathrm p,*}\) and \(\Vert \cdot \Vert _{\textrm{d},*}\) are the dual norms of \(\Vert \cdot \Vert _\mathrm p\) and \(\Vert \cdot \Vert _\textrm{d}\).
5.1 Derivation from Bregman Proximal Point Method
The Bregman Condat–Vũ algorithms (18) and (19) can be viewed as applications of the Bregman proximal point algorithm to the optimality conditions (3). This interpretation extends the derivation of the Bregman PDHG algorithm from the Bregman proximal point algorithm given in [30]. The idea originates with He and Yuan’s interpretation of PDHG as a “preconditioned” proximal point algorithm [28].
The Bregman proximal point algorithm [9, 24, 27] is an algorithm for monotone inclusion problems \(0 \in F(u)\). The update \(u_{k+1}\) in one iteration of the algorithm is defined as the solution of the inclusion
where \(\phi \) is a Bregman kernel function. Applied to (3), with a kernel function \(\phi _\textrm{pd}\), the algorithm generates a sequence \((x_k, z_k)\) defined by
5.1.1 Primal–Dual Bregman Distances
We introduce four possible primal–dual kernel functions: the functions
where \(\sigma ,\tau > 0\), and the functions
The subscripts in \(\phi _+\) and \(\phi _-\) refer to the sign of the inner product term \(\langle z, Ax\rangle \). The subscripts in \(\phi _\textrm{pcv}\) and \(\phi _\textrm{dcv}\) indicate the algorithm (Bregman primal or dual Condat-Vũ) for which these distances will be relevant. If these kernel functions are convex, they generate the following Bregman distances. The distances generated by \(\phi _+\) and \(\phi _-\) are
respectively, and the distances generated by \(\phi _\textrm{dcv}\) and \(\phi _\textrm{pcv}\) are
The following lemma provides sufficient conditions for the (strong) convexity of \(\phi _+\), \(\phi _-\), \(\phi _\textrm{pcv}\), and \(\phi _\textrm{dcv}\).
Lemma 1
The functions \(\phi _+\) and \(\phi _-\) are convex if \(\sigma \tau \Vert A\Vert ^2 \le 1\) and strongly convex if \(\sigma \tau \Vert A\Vert ^2 < 1\). The functions \(\phi _\textrm{dcv}\) and \(\phi _\textrm{pcv}\) are convex if
and strongly convex if \(\sigma \tau \Vert A\Vert ^2 + \tau L < 1\).
Proof
To show that the kernel functions \(\phi _+\) and \(\phi _-\) are convex, we show that \(d_+\) and \(d_-\) are nonnegative. Suppose \(\sigma \tau \Vert A\Vert ^2 \le \delta _1 \delta _2\) with \(\delta _1,\delta _2 > 0\). Then, (20) and the arithmetic–geometric mean inequality imply that
Therefore,
With \(\delta _1=\delta _2=1\), this shows convexity of \(\phi _+\) and \(\phi _-\); with \(\delta _1=\delta _2<1\), strong convexity. Similarly,
With \(\delta _1=1-\tau L\) and \(\delta _2=1\), this shows convexity of \(\phi _\textrm{pcv}\) and \(\phi _\textrm{dcv}\); with \(\delta _1=\delta -\tau L\) and \(\delta _2=\delta < 1\), strong convexity. \(\square \)
5.1.2 Bregman Condat-Vũ Algorithms from Proximal Point Method
The Bregman primal Condat–Vũ algorithm (18) is the Bregman proximal point method with the kernel function \(\phi _\textrm{pd} = \phi _\textrm{pcv}\). If we take \(\phi _\textrm{pd} = \phi _\textrm{pcv}\) in (23), we obtain two coupled inclusions that determine \(x_{k+1}\), \(z_{k+1}\). The first one is
This shows that \(x_{k+1}\) solves the optimization problem
The solution is the x-update (18a) in the Bregman primal Condat–Vũ method. The second inclusion is
This shows that \(z_{k+1}\) solves the optimization problem
The solution is the z-update (18b).
Similarly, choosing \(\phi _\textrm{pd} = \phi _\textrm{dcv}\) in (23) yields the Bregman dual Condat–Vũ algorithm (19).
5.2 Convergence Analysis
The derivation in Sect. 5.1 allows us to apply existing convergence theory for the Bregman proximal point method to the proposed algorithms (18) and (19). In particular, Solodov and Svaiter [45] have studied Bregman proximal point methods with inexact prox-evaluations for solving variational inequalities, which include the monotone inclusion problem as a special case. The results in [45] can be applied to analyze convergence of the Bregman Condat–Vũ methods with inexact evaluations of proximal operators.
The literature on the Bregman proximal point method for monotone inclusions [24, 27, 45] focuses on the convergence of iterates, and this generally requires additional assumptions on \(\phi _\mathrm p\) and \(\phi _\textrm{d}\) (beyond Assumption 1.1). In this section, we present a self-contained convergence analysis and give a direct proof of an O(1/k) rate of ergodic convergence (using only Assumption 1). We also give a self-contained proof of convergence of the iterates \(x_k\) and \(z_k\).
For the sake of brevity, we combine the analysis of the Bregman primal and Bregman dual Condat–Vũ algorithms. In the following, d, \(\tilde{d}\), \(\tilde{\phi }\) are defined as
5.2.1 One-Iteration Analysis
Lemma 2
Under Assumption 1 and the stepsize condition (25), the iterates generated by the Bregman Condat–Vũ algorithms (18) and (19) satisfy
for all \(x \in \mathop {\textbf{dom}}f \cap \mathop {\textbf{dom}}\phi _\mathrm p\) and \(z \in \mathop {\textbf{dom}}g^*\cap \mathop {\textbf{dom}}\phi _\textrm{d}\).
Proof
We write (18) and (19) in a unified notation as
where \(\tilde{x}\) and \(\tilde{z}\) are defined in the following table:
The optimality condition (17) for the proximal operator evaluation (28a) is
for all \(x \in \mathop {\textbf{dom}}f \cap \mathop {\textbf{dom}}\phi _\mathrm p\). The optimality condition for (28b) is that
for all \(z \in \mathop {\textbf{dom}}g^*\cap \mathop {\textbf{dom}}\phi _\textrm{d}\). Combining the two inequalities gives
for all \(x\in \mathop {\textbf{dom}}f \cap \mathop {\textbf{dom}}\phi _\mathrm p\) and all \(z\in \mathop {\textbf{dom}}g^* \cap \mathop {\textbf{dom}}\phi _\textrm{d}\). The second inequality follows from convexity of h. Substituting the expressions for \(\tilde{x}\) and \(\tilde{z}\) in the Bregman primal Condat–Vũ algorithm (18), we obtain for the last four terms on the right-hand side of (30)
If we substitute the expressions for \(\tilde{x}\) and \(\tilde{z}\) in the Bregman dual Condat–Vũ algorithm, the last four terms on the right-hand side of (30) equal to the negative of the right-hand side of (31).
Therefore, for both algorithms, (30) implies that
if we select the minus sign in \({\mp }\) for the Bregman primal Condat–Vũ algorithm, and the plus sign for the Bregman dual Condat–Vũ algorithm. \(\square \)
5.2.2 Ergodic Convergence
We define averaged iterates
for \(k \ge 1\). The ergodic convergence of \((x^\textrm{avg}_k,z^\textrm{avg}_k)\) is given in the following theorem.
Theorem 1
Under Assumption 1 and the stepsize condition (25), the averaged iterates \((x^\textrm{avg}_k,z^\textrm{avg}_k)\) satisfy
for all \(x \in \mathop {\textbf{dom}}f \cap \mathop {\textbf{dom}}\phi _\mathrm p\) and \(z \in \mathop {\textbf{dom}}g^*\cap \mathop {\textbf{dom}}\phi _\textrm{d}\).
Proof
From (27) in Lemma 2, since \(\mathcal {L}(u,v)\) is convex in u and concave in v,
for all \(x\in \mathop {\textbf{dom}}f \cap \mathop {\textbf{dom}}\phi _\mathrm p\) and \(z \in \mathop {\textbf{dom}}g^*\cap \mathop {\textbf{dom}}\phi _\textrm{d}\). The last step follows from (26) with \(\delta _1=\delta _2=1\). \(\square \)
Substituting \(x=x^\star \), \(z=z^\star \) in (33) gives
More generally, if \(X \subseteq \mathop {\textbf{dom}}\phi _\mathrm p\) and \(Z \subseteq \mathop {\textbf{dom}}\phi _\textrm{d}\) are compact convex sets that contain optimal solutions \(x^\star \), \(z^\star \) in their interiors, then the merit function (6) is bounded by
5.2.3 Monotonicity Properties
We present an auxiliary result that will be used in Sect. 5.2.4 to show convergence of iterates.
Lemma 3
Under Assumption 1 and the stepsize condition (25), we have
for \(k \ge 0\), and
The inequality (34) also implies \(\sum _{i=0}^k \tilde{d}(x_{i+1}, z_{i+1}; x_i, z_i) \le d(x^\star , z^\star ; x_0, z_0)\), and thus \(\tilde{d}(x_{k+1}, z_{k+1}; x_k, z_k) \rightarrow 0\).
Proof
For \((x,z)=(x^\star ,z^\star )\), the left-hand side of (27) is nonnegative and (34) holds. Hence, \(d(x^\star , z^\star ; x_{k+1}, z_{k+1}) \le d(x^\star , z^\star ; x_k, z_k)\) and (35) holds. \(\square \)
5.2.4 Convergence of Iterates
Convergence of iterates can be obtained by combining the derivation in Sect. 5.1 and existing results on Bregman proximal point method [27, Theorem 3.1], [45, Theorem 3.2]. Here, we provide a self-contained proof under additional assumptions about the primal and dual distance functions.
Assumption 2
-
(2.1)
For fixed x and z, the sublevel sets \(\{x^\prime \mid d_\mathrm p(x,x^\prime ) \le \gamma \}\) and \(\{z^\prime \mid d_\textrm{d}(z,z^\prime ) \le \gamma \}\) are closed. In other words, the distances \(d_\mathrm p(x,x^\prime )\) and \(d_\textrm{d}(z,z^\prime )\) are closed functions of \(x^\prime \) and \(z^\prime \), respectively. Since a sum of closed functions is closed, the distance \(d(x,z; x',z')\) is a closed function of \((x',z')\), for fixed (x, z).
-
(2.2)
If \(\tilde{x}_k \in \mathop {\textbf{int}}(\mathop {\textbf{dom}}\phi _\mathrm p)\) converges to \(x \in \mathop {\textbf{dom}}\phi _\mathrm p\), then \(d_\mathrm p(x,\tilde{x}_k) \rightarrow 0\). Similarly, if \(\tilde{z}_k \in \mathop {\textbf{int}}(\mathop {\textbf{dom}}\phi _\textrm{d})\) converges to \(z \in \mathop {\textbf{dom}}\phi _\textrm{d}\), then \(d_\textrm{d}(z,\tilde{z}_k) \rightarrow 0\).
-
(2.3)
The stepsizes \(\sigma \) and \(\tau \) satisfy \(\sigma \tau \Vert A\Vert ^2+\tau L<1\).
The first two assumptions in Assumption 2 are common in the literature on Bregman distances [9, 14, 24, 27]. As shown in Lemma 1, Assumption 2 implies that the kernel functions \(\phi _\textrm{pcv}\) and \(\phi _\textrm{dcv}\) are strongly convex and that
for some \(\alpha > 0\). Similarly, \(\sigma \tau \Vert A\Vert ^2 < 1\) implies that
for some \(\beta > 0\). Recall that \(d=d_-\), \(\tilde{d}=d_\textrm{pcv}\) for the Bregman primal algorithm (18), and \(d=d_+\), \(\tilde{d}=d_\textrm{dcv}\) for the Bregman dual algorithm (19).
Theorem 2
Under Assumptions 1 and 2, the iterates \((x_k,z_k)\) generated by Bregman primal and dual Condat–Vũ algorithms (18) and (19) converge to an optimal point \((x^\star ,z^\star )\).
Proof
We first note that \(\tilde{d}(x_{k+1}, z_{k+1}; x_k, z_k) \rightarrow 0\) and (36) imply that \(x_{k+1} -x_k \rightarrow 0\) and \(z_{k+1}-z_k \rightarrow 0\).
The inequality (35), together with (37), implies that the sequence \((x_k,z_k)\) is bounded. Let \((x_{k_i},z_{k_i})\) be a convergent subsequence of \((x_k, z_k)\) with limit \((\hat{x}, \hat{z})\). Since \(x_{k_i+1}-x_{k_i} \rightarrow 0\) and \(z_{k_i+1} -z_{k_i} \rightarrow 0\), the sequence \((x_{k_i+1},z_{k_i+1})\) converges to \((\hat{x}, \hat{z})\). We show that \((\hat{x},\hat{z})\) satisfies the optimality condition (3).
From (35), \(d(x^\star ,z^\star ;x_{k_i},z_{k_i})\) is bounded. Since the sublevel sets \(\{(x^\prime ,z^\prime ) \mid d(x^\star ,z^\star ;x^\prime ,z^\prime ) \le \gamma \}\) are closed and in \(\mathop {\textbf{int}}(\mathop {\textbf{dom}}\phi _\mathrm p) \cap \mathop {\textbf{int}}(\mathop {\textbf{dom}}\phi _\textrm{d})\), the limit \((\hat{x},\hat{z}) \in \mathop {\textbf{int}}(\mathop {\textbf{dom}}\phi _\mathrm p) \cap \mathop {\textbf{int}}(\mathop {\textbf{dom}}\phi _\textrm{d})\). The iterates in the subsequence satisfy
where \(\phi _\textrm{pd}=\phi _\textrm{pcv}\) in the Bregman primal Condat–Vũ algorithm and \(\phi _\textrm{pd}=\phi _\textrm{dcv}\) in the Bregman dual Condat–Vũ algorithm. The left-hand side of (38) converges to \((-A^T\hat{z}, A\hat{x})\) because \(\nabla \phi _\textrm{pd}\) is continuous on \(\mathop {\textbf{int}}(\mathop {\textbf{dom}}\phi _\textrm{pd})\). Since the operator on the right-hand side of (38) is maximal monotone, the limit point \((\hat{x},\hat{z})\) satisfies the optimality condition (see [7, page 27], [47, Lemma 3.2])
To show convergence of the entire sequence, we substitute \((\hat{x},\hat{z})\) in (27):
We have \(d(\hat{x},\hat{z};x_k,z_k) \le d(\hat{x},\hat{z};x_{k-1},z_{k-1})\) for all \(k \ge 1\), since the left-hand side is nonnegative. This further implies that \(d(\hat{x},\hat{z};x_k,z_k) \le d(\hat{x},\hat{z};x_{k_i},z_{k_i})\) for all \(k \ge k_i\). By Assumption (2.2), the right-hand side converges to zero. Then, the left-hand side also converges to zero and from (37), \(x_k \rightarrow \hat{x}\) and \(z_k \rightarrow \hat{z}\). \(\square \)
5.3 Relation to Other Bregman Proximal Algorithms
Following similar steps as in Sect. 3, we obtain several Bregman proximal splitting methods as special cases of (18) and (19). The connections are summarized in Figs. 5 and 6. A comparison of Figs. 1 and 5 shows that all the reduction relations (\(A=I\)) are still valid. However, it is unclear how to apply the “completion” operation to algorithms based on non-Euclidean Bregman distances.
When \(h=0\), (18) reduces to Bregman PDHG [12]. When \(g=0\), \(g^*=\delta _{\{0\}}\) (and assuming \(z_0=0\)), we obtain the Bregman proximal gradient algorithm [3]. When \(f=0\) in (18), we obtain the Bregman Loris–Verhoeven algorithm with shift, and if we further set \(A=I\), we obtain the Bregman reduced Loris–Verhoeven algorithm with shift. Similarly, when \(A=I\) in (18), we recover the reduced Bregman primal Condat–Vũ algorithm, and setting \(A=I\) in Bregman PDHG yields the Bregman Douglas–Rachford algorithm.
Similarly, the Bregman dual Condat–Vũ algorithm (19) can be reduced to some other Bregman proximal splitting methods, as summarized in Fig. 6. In particular, when \(f=0\) in (19), we obtain the Bregman dual Loris–Verhoeven algorithm with shift, and if we further set \(A=I\), we obtain the reduced Bregman Loris–Verhoeven algorithm with shift.
6 Bregman Dual Condat–Vũ Algorithm with Line Search
The algorithms (18) and (19) use constant parameters \(\sigma \) and \(\tau \). The stepsize condition (25) involves the matrix norm \(\Vert A\Vert \) and the Lipschitz constant L in (21). Estimating or bounding \(\Vert A\Vert \) for a large matrix can be difficult. As an added complication, the norms \(\Vert \cdot \Vert _\mathrm p\) and \(\Vert \cdot \Vert _\textrm{d}\) in the definition of the matrix norm (22) are assumed to be scaled so that the strong convexity parameters of the primal and dual kernels are equal to one. Close bounds on the strong convexity parameters may also be difficult to obtain. Using conservative bounds for \(\Vert A\Vert \) and L results in unnecessarily small values of \(\sigma \) and \(\tau \) and can dramatically slow down the convergence. Even when the estimates of \(\Vert A\Vert \) and L are accurate, the requirements for the stepsizes (25) are still too strict in most iterations, as observed in [1]. In view of the above arguments, line search techniques for primal–dual proximal methods have recently become an active area of research. Malitsky and Pock [36] proposed a line search technique for PDHG and the Condat–Vũ algorithm in the Euclidean case. The algorithm with adaptive parameters in [49] focuses on a special case of (1) (i.e., \(f=0\)) and extends the Loris–Verhoeven algorithm. A Bregman proximal splitting method with line search is discussed in [30] and considers the problem (1) with \(h=0\) and \(g=\delta _{\{b\}}\). In this section, we extend the Bregman dual Condat–Vũ algorithm (19) with a varying parameter option, in which the stepsizes are chosen adaptively without requiring any estimates or bounds for \(\Vert A\Vert \) or the strong convexity parameter of the kernels. The algorithm is restricted to problems in the equality constrained form
This is a special case of (1) with \(g=\delta _{\{b\}}\), the indicator function of the singleton \(\{b\}\).
The details of the algorithm are discussed in Sect. 6.1, and a convergence analysis is presented in Sect. 6.2. The main conclusion is an O(1/k) rate of ergodic convergence, consistent with previous results for related algorithms [30, 36].
Assumption 3
The kernel function, the Bregman distance, and the norm in the dual space are
(Recall that \(\Vert \cdot \Vert \) denotes the Euclidean norm.)
The matrix norm \(\Vert A\Vert \) is defined accordingly as
6.1 Algorithm
The algorithm uses the following iteration, with starting points \(x_0 \in \mathop {\textbf{dom}}h \cap \mathop {\textbf{int}}(\mathop {\textbf{dom}}\phi _\mathrm p)\) and \(z_{-1} = z_0\):
With constant parameters \(\theta _k=1\), \(\sigma _k=\sigma \), \(\tau _k = \tau \), the algorithm reduces to the Bregman dual Condat–Vũ algorithm (19) applied to (39), except for the numbering of the dual iterates.
In the line search algorithm, the parameters \(\theta _k\), \(\tau _k\), \(\sigma _k\) are determined by a backtracking search. At the start of the algorithm, we set \(\tau _{-1}\) and \(\sigma _{-1}\) to some positive values. To start the search in iteration k we choose \(\bar{\theta }_k \ge 1\). For \(i=0,1,2,\ldots \), we set \(\theta _k = 2^{-i}\bar{\theta }_k\), \(\tau _k=\theta _k\tau _{k-1}\), \(\sigma _k=\theta _k\sigma _{k-1}\), and compute \(\bar{z}_{k+1}\), \(x_{k+1}\), \(z_{k+1}\) using (40). For some \(\delta \in (0,1]\), if
we accept the computed iterates \(\bar{z}_{k+1}\), \(x_{k+1}\), \(z_{k+1}\) and parameters \(\theta _k\), \(\sigma _k\), \(\tau _k\), and terminate the backtracking search. If (41) does not hold, we increment i and continue the backtracking search.
The backtracking condition (41) is similar to the condition in the line search algorithm for PDHG with Euclidean proximal operators [36, Algorithm 4], but it is not identical, even in the Euclidean case. The proposed condition is weaker and allows larger stepsizes than the condition in [36, Algorithm 4].
6.2 Convergence Analysis
The proof strategy is the same as in [30, Section 3.3], extended to account for the function h. The main conclusion is an O(1/k) rate of ergodic convergence, as shown in Eq. (49).
6.2.1 Lower Bound on Algorithm Parameters
Lemma 4
Suppose Assumptions 1 and 3 hold. The stepsizes selected by the backtracking process are bounded below by
where \(\beta =\sigma _{-1}/\tau _{-1}\). The lower bounds imply that the backtracking eventually terminates with positive stepsizes \(\sigma _k\) and \(\tau _k\).
Proof
Applying (26) in the proof of Lemma 1 with \(\tau =\tau _k\), \(\sigma =\sigma _k\), \(\delta _1=\delta ^2-\tau _k L\) and \(\delta _2=1\), together with the Lipschitz condition (21) in Assumption 1, we see that the backtracking condition (41) holds at iteration k if \(0 < \delta \le 1\) and
Then, mathematical induction can be used to prove (42). The rest of the proof is similar to that in [30, §3.3.1], extended to account for the smooth function h, and thus is omitted here. \(\square \)
6.2.2 One-Iteration Analysis
Lemma 5
Suppose Assumptions 1 and 3 hold. The iterates \(x_{k+1}\), \(z_{k+1}\), \(\bar{z}_{k+1}\) generated by the algorithm (40) and the backtracking process satisfy
for all \(x \in \mathop {\textbf{dom}}f \cap \mathop {\textbf{dom}}\phi _\mathrm p\) and all z. Here, \(\mathcal {L}(x,z)=f(x)+h(x)+\langle z, Ax-b\rangle \).
Proof
The optimality condition for the primal prox-operator (40b) gives
for all \(x\in \mathop {\textbf{dom}}f \cap \mathop {\textbf{dom}}\phi _\mathrm p\). Hence,
The second inequality follows from the convexity of h, i.e., \(h(x) \ge h(x_k) + \langle \nabla h(x_k), x-x_k\rangle \). The dual update (40c) implies that
This equality at \(k=i-1\) is
The equality (45) at \(k=i-2\) is
We evaluate this at \(z=z_i\) and add it to the equality at \(z=z_{i-2}\) multiplied by \(\theta _{i-1}\):
Now, we combine (44) for \(k=i-1\), with (46) and (47). For \(i \ge 1\),
which is the desired result (43). The first inequality follows from (44). In the second to last step we substitute (46) and (47). The last step uses the line search exit condition (41) at \(k=i-1\). \(\square \)
6.2.3 Ergodic Convergence
We define the averaged primal and dual sequences
for \(k \ge 1\). The ergodic convergence of \((x^\textrm{avg}_k, \bar{z}^\textrm{avg}_k)\) is given in the following theorem.
Theorem 3
Suppose Assumptions 1 and 3 hold, and the stepsizes are selected by the backtracking process with line search condition (41). The averaged iterates \((x^\textrm{avg}_k, \bar{z}^\textrm{avg}_k)\) satisfy
for all \(x \in \mathop {\textbf{dom}}f \cap \mathop {\textbf{dom}}\phi _\mathrm p\) and all z. This holds for any \(\delta \in (0,1]\) in (41).
If we compare (48) and (33), we note that the two left-hand sides involve different dual iterates (\(\bar{z}^\textrm{avg}_k\) as opposed to \(z^\textrm{avg}_k\)).
Proof
Since \(\mathcal {L}(u,v)\) is convex in u and affine in v,
Dividing by \(\sum _{i=1}^k \tau _{i-1}\) gives (48). \(\square \)
Substituting \(x=x^\star \) and \(z=z^\star \) in (48) yields
since \(Ax^\star =b\). More generally, suppose \(X \subseteq \mathop {\textbf{dom}}f \cap \mathop {\textbf{dom}}\phi _\mathrm p\) is a compact convex set containing an optimal solution \(x^\star \) in its interior, and \(Z = \{z \mid \Vert z\Vert < \gamma \}\) contains a dual optimal \(z^\star \), then the merit function \(\eta \) defined in (6) satisfies
The second line follows from (48) and the third line follows from Lemma 4.
6.2.4 Monotonicity Properties and Convergence of Iterates
For \(x=x^\star \), \(z=z^\star \), the left-hand side of (43) is nonnegative and we obtain
for \(k \ge 0\). These inequalities hold for any value \(\delta \in (0,1]\). In particular, the second inequality implies that \(\bar{z}_{i+1}-z_i \rightarrow 0\). When \(\delta < 1\) it also implies that \(d_\mathrm p(x_{i+1}, x_i) \rightarrow 0\) and, by the strong convexity assumption on \(\phi _\mathrm p\), that \(x_{i+1}-x_i \rightarrow 0\). With additional assumptions similar to those in Sect. 5.2.3, one can show the convergence of iterates; see [30, Section 3.3.4].
7 Bregman PD3O Algorithm
In this section, we propose the Bregman PD3O algorithm, another Bregman proximal method for the problem (1). Bregman PD3O also involves two generalized distances, \(d_\mathrm p\) and \(d_\textrm{d}\), generated by \(\phi _\mathrm p\) and \(\phi _\textrm{d}\), respectively, and it consists of the iterations
The only difference between Bregman PD3O and Bregman primal Condat–Vũ algorithm (18) is the additional term \(\tau (\nabla h(x_k)-\nabla h(x_{k+1}) )\). Thus, the two algorithms (18) and (50) reduce to the same method when h is absent from problem (1). The additional term allows PD3O to use larger stepsizes than the Condat–Vũ algorithm. If we use the same matrix norm \(\Vert A\Vert \) and Lipschitz constant L in the analysis for the two methods, then the conditions are
The range of possible parameters is illustrated in Fig. 7.
In Sect. 7.1, we provide the detailed convergence analysis of the Bregman PD3O method. The connections between Bregman PD3O and several other Bregman proximal methods are discussed in Sect. 7.2.
Assumption 4
-
(4.1)
The kernel functions \(\phi _\mathrm p\) and \(\phi _\textrm{d}\) are 1-strongly convex with respect to norms \(\Vert \cdot \Vert \) and \(\Vert \cdot \Vert _\textrm{d}\), respectively:
$$\begin{aligned} d_\mathrm p(x,x^\prime ) \ge \frac{1}{2}\Vert x-x^\prime \Vert ^2, \qquad d_\textrm{d}(z,z^\prime ) \ge \frac{1}{2}\Vert z-z^\prime \Vert _\textrm{d}^2 \end{aligned}$$(52)for all \((x,x') \in \mathop {\textbf{dom}}d_\mathrm p\) and \((z,z') \in \mathop {\textbf{dom}}d_\textrm{d}\). The assumption that the strong convexity constants are one can be made without loss of generality, by scaling the distances.
-
(4.2)
The gradient of h is L-Lipschitz continuous with respect to the Euclidean norm: \(\mathop {\textbf{dom}}h={\textbf{R}}^n\) and
$$\begin{aligned} h(y)-h(x)-\langle \nabla h(x), y-x\rangle \le \frac{L}{2}\Vert y-x\Vert ^2, \quad \text{ for } \text{ any } x,y \in \mathop {\textbf{dom}}h. \end{aligned}$$(53) -
(4.3)
The parameters \(\tau \) and \(\sigma \) satisfy
$$\begin{aligned} \sigma \tau \Vert A\Vert ^2 \le 1, \qquad \tau \le 1/L. \end{aligned}$$(54) -
(4.4)
The primal–dual optimality conditions (3) have a solution \((x^\star , z^\star )\) with \(x^\star \in \mathop {\textbf{dom}}\phi _\mathrm p\) and \(z^\star \in \mathop {\textbf{dom}}\phi _\textrm{d}\).
Note that (53) is a stronger assumption than (21). (Combined with the first inequality in (52), it implies (21).) We will use the following consequence of (53) [38, Theorem 2.1.5]:
7.1 Convergence Analysis
7.1.1 A Primal–Dual Bregman Distance
We introduce a primal–dual kernel
where \(\sigma ,\tau > 0\). If \(\phi _\textrm{pd3o}\) is convex, the generated Bregman distance is given by
The following lemma gives a sufficient condition for the convexity of \(\phi _\textrm{pd3o}\).
Lemma 6
Suppose (52) holds. The kernel function \(\phi _\textrm{pd3o}\) is convex if \(\sigma \tau \Vert A\Vert ^2 \le 1\), where \(\Vert A\Vert \) is defined in (22) with \(\Vert \cdot \Vert _\mathrm p=\Vert \cdot \Vert \).
Proof
It is sufficient to show that \(d_\textrm{pd3o}\) is nonnegative:
In step 1, we use the strong convexity assumption (52), the definition of \(\Vert A\Vert \) (22) with \(\Vert \cdot \Vert _\mathrm p=\Vert \cdot \Vert \), and the assumption \(\sigma \tau \Vert A\Vert ^2 \le 1\). The bound on \(d_\textrm{d}(z,z')\) follows from
\(\square \)
Note that the convexity of \(\phi _\textrm{pd3o}\) only requires the first inequality in the stepsize condition (54). Although the Bregman PD3O algorithm (50) is not the Bregman proximal point method for the Bregman kernel \(\phi _\textrm{pd3o}\), the distance \(d_\textrm{pd3o}\) will appear in the key inequality (58) of the convergence analysis.
7.1.2 One-Iteration Analysis
Lemma 7
Under Assumption 4, the iterates \(x_{k+1}\), \(z_{k+1}\) generated by Bregman PD3O (50) satisfy
for all \(x \in \mathop {\textbf{dom}}f \cap \mathop {\textbf{dom}}\phi _\mathrm p\) and \(z \in \mathop {\textbf{dom}}g^*\cap \mathop {\textbf{dom}}\phi _\textrm{d}\).
Proof
Recall that Bregman PD3O differs from the Bregman primal Condat–Vũ algorithm (18) only in an additional term in the dual update. The proof in Sect. 6.2.2 therefore applies up to (29), with
Substituting the above \((\tilde{x}, \tilde{z})\) into (29) and applying the definition of \(d_-\) (24) yields
Step 3 follows from definition of \(d_\textrm{pd3o}\) (56). In step 4, we use the Lipschitz condition (55) and the second inequality in the stepsize condition (54). The last step follows from the fact that \(d_\textrm{pd3o}\) is nonnegative (57). \(\square \)
7.1.3 Ergodic Convergence
Theorem 4
Under Assumption 4, Bregman PD3O iterates (50) satisfy
for all \(x \in \mathop {\textbf{dom}}f \cap \mathop {\textbf{dom}}\phi _\mathrm p\) and all \(z \in \mathop {\textbf{dom}}g^*\cap \mathop {\textbf{dom}}\phi _\textrm{d}\), where the averaged iterates are defined in (32).
Proof
From (58) in Lemma 7, since \(\mathcal {L}(u,v)\) is convex in u and concave in v,
for all \(x\in \mathop {\textbf{dom}}f \cap \mathop {\textbf{dom}}\phi _\mathrm p\) and \(z \in \mathop {\textbf{dom}}g^*\cap \mathop {\textbf{dom}}\phi _\textrm{d}\). The third inequality follows from (56):
\(\square \)
7.2 Relation to Other Bregman Proximal Algorithms
The proposed algorithm (50) can be viewed as an extension to PD3O (12) using generalized distances and reduces to several Bregman proximal methods by reduction. These algorithms can also be organized into a diagram similar to Fig. 3. Figure 8 starts from Bregman PD3O (50) and summarizes its connection to several Bregman proximal methods.
When \(h=0\), (50) reduces to Bregman PDHG, and when \(g=0\), the Bregman proximal gradient algorithm. The Bregman Loris–Verhoeven algorithm is Bregman PD3O with \(f=0\), and has been discussed in [17] under the name NEPAPC. Setting \(A=I\) (with \(\sigma =1/\tau \)), we obtain a new variant of the Bregman proximal gradient algorithm. The difference between this variant of the Bregman proximal gradient algorithm and the Bregman reduced Loris–Verhoeven algorithm with shift obtained in Sect. 5.3 is the additional term \(\tau (\nabla h(x_k)-\nabla h(x_{k+1}))\), the same as the difference between (18) and (50). When the Euclidean proximal operator is used, (50) (with \(f=0\), \(A=I\) and \(\sigma =1/\tau \)) reduces to the proximal gradient method. However, the new method does not seem to be equivalent to the classical Bregman proximal gradient algorithm due to the lack of Moreau decomposition in the generalized case. Finally, setting \(A=I\) (and \(\sigma =1/\tau \)) in (50) gives a Bregman Davis–Yin algorithm.
8 Numerical Experiment
In this section, we evaluate the performance of the Bregman primal Condat–Vũ algorithm (18), Bregman dual Condat–Vũ algorithm with line search (40), and Bregman PD3O (50). The main goal of the example is to validate and illustrate the difference in the stepsize conditions (51) and the usefulness of the line search procedure. We consider the convex optimization problem
where \(x \in {\textbf{R}}^n\) is the optimization variable, \(C \in {\textbf{R}}^{m \times n}\), and \(A \in {\textbf{R}}^{(n-1) \times n}\) is the difference matrix
This problem is of the form of (1) with \(f(x)=\delta _H(x)\), \(g(y)=\lambda \Vert y\Vert _1\), \(h(x)=\tfrac{1}{2}\Vert Cx-b\Vert ^2\), and \(\delta _H\) is the indicator function of the hyperplane \(H=\{x \in {\textbf{R}}^n \mid \textbf{1}^Tx=1\}\). We use the relative entropy distance
in the primal space. This distance is 1-strongly convex with respect to \(\ell _1\)-norm [5] (and also \(\ell _2\)-norm). With the relative entropy distance, all the primal iterates \(x_k\) remain feasible. In the dual space we use the Euclidean distance. Thus, the matrix norm (22) in the stepsize condition (25) for the Bregman Condat–Vũ algorithms is the (1,2)-operator norm \(\Vert A\Vert _{1,2}=\max _i \Vert a_i\Vert =\sqrt{2}\), where \(a_i\) is the ith column of A. In the Bregman PD3O algorithm, we use the squared Euclidean distance \(d_\mathrm p(x,y) = \tfrac{1}{2} \Vert x-y\Vert ^2\), and the matrix norm in the stepsize condition (54) is the spectral norm \(\Vert A\Vert _2\). For the difference matrix (60), \(\Vert A\Vert _2\) is bounded above by 2, and very close to this upper bound for large n.
The Lipschitz constant for h with respect to the \(\ell _1\)-norm is the largest absolute value of the elements in \(C^TC\), i.e., \(L_1=\max _{i,j} |(C^TC)_{ij}|\). This value is used in the stepsize condition (25) for the Bregman Condat–Vũ algorithms. The Lipschitz constant with respect to the \(\ell _2\)-norm is \(L_2=\Vert C\Vert _2^2\), which is used in the stepsize condition (54) for Bregman PD3O.
The matrix norms and Lipschitz constants are summarized as follows:
In the example, we use the exact values of \(L_1\) and \(L_2\),
The Bregman proximal operator of f has a closed-form solution:
where \(y_i\) is the ith entry of y, and the (Euclidean) proximal operator of \(g^*\) is the projection onto the infinity norm ball \(\{z \mid \Vert z\Vert _\infty \le \lambda \}\).
The experiment is carried out in Python 3.6 on a desktop with an Intel Core i5 2.4GHz CPU and 8GB RAM. We set \(m=500\) and \(n=10,000\). The elements in the matrix \(C \in {\textbf{R}}^{m \times n}\) and \(b \in {\textbf{R}}^m\) are randomly generated from independent standard Gaussian distributions. For the constant stepsize option, we choose
These two choices, as well as the range of possible parameters, are illustrated in Fig. 9. The two choices are on the blue and red curve, respectively, and satisfy the requirement (51) with equality. For the line search algorithm, we set \(\bar{\theta }_k=1.2\) to encourage more aggressive updates, and \(\beta =\sigma _{-1}/\tau _{-1}=L_1^2\), which is consistent with the choice in (61).
We solve the problem (59) using the Bregman primal Condat–Vũ algorithm (18), the Bregman dual Condat–Vũ algorithm with line search (40), and Bregman PD3O (50). Figure 10 reports the relative distance between the function values to the optimal value \(\psi ^\star \), which is computed via CVXPY [22]. Comparison between the Bregman primal Condat–Vũ algorithm and Bregman PD3O shows that Bregman PD3O converges faster. Figure 10 also compares the performance between the Bregman primal Condat–Vũ algorithm with constant stepsizes and Bregman dual algorithm with line search. One can see clearly that the line search significantly improves the convergence. On the other hand, the line search does not add much computation overhead, as the plots of the CPU time and the number of iterations are roughly identical. In these experiments, Bregman PD3O and the Bregman dual Condat–Vũ algorithm with line search have a similar performance, without one algorithm being conclusively better than the other.
9 Conclusions
We presented two variants of Bregman Condat–Vũ algorithms, introduced a line search technique for the Bregman dual Condat–Vũ algorithm for equality-constrained problems, and proposed a Bregman extension to PD3O. Many open questions remain. It is unclear how to use Bregman distances in PDDY, and how to extend the line search technique to Bregman PD3O, the Bregman primal Condat–Vũ algorithm, and the more general problem (1). Moreover, in the current backtracking technique the ratio of the primal and dual stepsizes is fixed. A further improvement would be to relax this constraint [1, 36].
References
Applegate, D., Dóaz, M., Hinder, O., Lu, H., Lubin, M., O’Donoghue, B., and Schudy, W.: Practical large-scale linear programming using primal–dual hybrid gradient. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P. S. Liang, and J. Wortman Vaughan, (eds), Advances in Neural Information Processing Systems, vol 34 (2021)
Auslender, A., Teboulle, M.: Interior gradient and proximal methods for convex and conic optimization. SIAM J. Optim. 16(3), 697–725 (2006)
Bauschke, H.H., Bolte, J., Teboulle, M.: A descent lemma beyond Lipschitz gradient continuity: first-order methods revisited and applications. Math. Oper. Res. 42(2), 330–348 (2017)
Beck, A.: First-Order Methods in Optimization. Society for Industrial and Applied Mathematics (2017)
Beck, A., Teboulle, M.: Gradient-based algorithms with applications to signal recovery. In: Eldar, Y., Palomar, D. (eds.) Convex Optimization in Signal Processing and Communications. Cambridge University Press, Cambridge (2009)
Bolte, J., Sabach, S., Teboulle, S., Vaisbourd, Y.: First order methods beyond convexity and Lipschitz gradient continuity with applications to quadratic inverse problems. SIAM J. Optim. 28(3), 2131–2151 (2018)
Brézis, H.: Opérateurs maximaux monotones et semi-groupes de contractions dans les espaces de Hilbert, volume 5 of North-Holland Mathematical Studies. North-Holland (1973)
Bubeck, S.: Convex optimization: algorithms and complexity. Found. Trends Mach. Learn. 8(3–4), 231–357 (2015)
Censor, Y., Zenios, S.A.: Parallel Optimization: Theory, Algorithms, and Applications. Numerical Mathematics and Scientific Computation, Oxford University Press, New York (1997)
Chambolle, A., Pock, T.: A first-order primal-dual algorithm for convex problems with applications to imaging. J. Math. Imaging Vis. 40, 120–145 (2011)
Chambolle, A., Pock, T.: An introduction to continuous optimization for imaging. Acta Numer. 25, 161–319 (2016)
Chambolle, A., Pock, T.: On the ergodic convergence rates of a first-order primal-dual algorithm. Math. Progr. Ser. A 159, 253–287 (2016)
Chao, H.-H., Vandenberghe, L.: Entropic proximal operators for nonnegative trigonometric polynomials. IEEE Trans. Signal Process. 66(18), 4826–4838 (2018)
Chen, G., Teboulle, M.: Convergence analysis of a proximal-like minimization algorithm using Bregman functions. SIAM J. Optim. 3, 538–543 (1993)
Chen, P., Huang, J., and Zhang, X.: A primal–dual fixed point algorithm for convex separable minimization with applications to image restoration. Inverse Problems, 29(2) (2013)
Clason, C., Lorenz, D.A., Mahler, H., and Wirth, B.: Entropic regularization of continuous optimal transport problems. J. Math. Anal. Appl. 494(1), 124432 (2021)
Cohen, E., Sabach, S., and Teboulle, M.: Non-Euclidean proximal methods for convex-concave saddle-point problems. J. Appl. Numer. Optim., 3(1) (2021)
Combettes, P.L., Pesquet, J.-C.: A Douglas-Rachford splitting approach to nonsmooth convex variational signal recovery. IEEE J. Sel. Top. Signal Process. 1(4), 564–574 (2007)
Combettes, P.L., Pesquet, J.-C.: Proximal splitting methods in signal processing. In: Fixed-Point Algorithms for Inverse Problems in Science and Engineering. Springer Optimization and Its Applications, pp. 185–212. Springer, New York (2011)
Condat, L.: A primal-dual splitting method for convex optimization involving Lipschitzian, proximable and linear composite terms. J. Optim. Theory Appl. 158(2), 460–479 (2013)
Davis, D., and Yin, W.: A three-operator splitting scheme and its optimization applications. arXiv e-prints, arXiv:1504.01032 (2015)
Diamond, S., Chu, E., and Boyd, S.: CVXPY: a Python-embedded modeling language for convex optimization, version 0.2. cvxpy.org (2014)
Drori, Y., Sabach, S., Teboulle, M.: A simple algorithm for a class of nonsmooth convex-concave saddle-point problems. Oper. Res. Lett. 43(2), 209–214 (2015)
Eckstein, J.: Nonlinear proximal point algorithms using Bregman functions, with applications to convex programming. Math. Oper. Res. 18(1), 202–226 (1993)
Eckstein, J., Bertsekas, D.: On the Douglas-Rachford splitting method and the proximal point algorithm for maximal monotone operators. Math. Program. 55, 293–318 (1992)
Esser, E., Zhang, X., Chan, T.: A general framework for a class of first order primal-dual algorithms for convex optimization in imaging science. SIAM J. Imag. Sci. 3(4), 1015–1046 (2010)
Güler, O.: Ergodic convergence in proximal point algorithms with Bregman functions. In: Du, D.-Z., Sun, J. (eds.) Advances in Optimization and Approximation, pp. 155–165. Springer, Cham (1994)
He, B., Yuan, X.: Convergence analysis of primal-dual algorithms for a saddle-point problem: from contraction perspective. SIAM J. Imag. Sci. 5(1), 119–149 (2012)
Jacobs, M., Leger, F., Li, W., Osher, S.: Solving large-scale optimization problems with a convergence rate independent of grid size. SIAM J. Numer. Anal. 57(3), 1100–1123 (2019)
Jiang, X., Vandenberghe, L.: Bregman primal-dual first-order method and applications to sparse semidefinite programming. Comput. Optim. Appl. 81(1), 127–159 (2022)
Komodakis, N., Pesquet, J.: Playing with duality: an overview of recent primal-dual approaches for solving large-scale optimization problems. IEEE Signal Process. Mag. 32(6), 31–54 (2015)
Lions, P.L., Mercier, B.: Splitting algorithms for the sum of two nonlinear operators. SIAM J. Numer. Anal. 16(6), 964–979 (1979)
Liu, Y., Xu, Y., Yin, W.: Acceleration of primal-dual methods by preconditioning and simple subproblem procedures. J. Sci. Comput. 86(2), 21 (2021)
Loris, I., and Verhoeven, C.: On a generalization of the iterative soft-thresholding algorithm for the case of non-separable penalty. Inverse Problems, 27(12) (2011)
Lu, H., Freund, R.M., Nesterov, Y.: Relatively smooth convex optimization by first-order methods, and applications. SIAM J. Optim. 28(1), 333–354 (2018)
Malitsky, Y., Pock, T.: A first-order primal-dual algorithm with linesearch. SIAM J. Optim. 28(1), 411–432 (2018)
Moreau, J.J.: Proximité et dualité dans un espace hilbertien. Bull. Math. Soc. France 93, 273–299 (1965)
Nesterov, Y.: Lectures on Convex Optimization. Springer Publishing Company, Cham (2018)
O’Connor, D., Vandenberghe, L.: On the equivalence of the primal-dual hybrid gradient method and Douglas-Rachford splitting. Math. Program. 179(1–2), 85–108 (2020)
Parikh, N., Boyd, S.: Proximal algorithms. Found. Trends Opt. 1(3), 123–231 (2013)
Pock, T., and Chambolle, A.: Diagonal preconditioning for first order primal–dual algorithms in convex optimization. In: Metaxas, D., Quan, L., Sanfeliu, A., and Van Gool, L., (eds), International Conference on Computer Vision, vol 13, pp. 1762–1769 (2011)
Pock, T., Cremers, D., Bischof, H., and Chambolle, A.: An algorithm for minimizing the Mumford-Shah functional. In: Matsuyama, T., (ed), International Conference on Computer Vision, vol 12, pp. 1133–1140 (2009)
Rockafellar, R.T.: Convex Analysis. Princeton University Press, Princeton (1970)
Salim, A., Condat, L., Mishchenko, K., and Richtárik, P.: Dualize, split, randomize: toward fast nonsmooth optimization algorithms. J. Opt. Theory Appl. (2022)
Solodov, M.V., Svaiter, B.F.: An inexact hybrid generalized proximal point algorithm and some new results on the theory of Bregman functions. Math. Oper. Res. 25(2), 214–230 (2000)
Teboulle, M.: A simplified view of first order methods for optimization. Math. Program. 170(1), 67–96 (2018)
Tseng, P.: A modified forward-backward splitting method for maximal monotone mappings. SIAM J. Control. Optim. 38(2), 431–446 (2000)
Tseng, P.: On accelerated proximal gradient methods for convex-concave optimization (2008)
Vladarean, M.-L., Malitsky, Y., and Cevher, V.: A first-order primal-dual method with adaptivity to local smoothness. In: Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P. S., and Wortman Vaughan, J., (eds), Advances in Neural Information Processing Systems, vol 34 (2021)
Vu, B.C.: A splitting algorithm for dual monotone inclusions involving cocoercive operators. Adv. Comput. Math. 38, 667–681 (2013)
Yan, M.: A new primal-dual algorithm for minimizing the sum of three functions with a linear operator. J. Sci. Comput. 76(3), 1698–1717 (2018)
Hamedani, E.Y., Aybat, N.S.: A primal-dual algorithm with line search for general convex-concave saddle point problems. SIAM J. Opt. 31(2), 1299–1329 (2021)
Yu, Y., Elango, P., Topcu, U., and Açıkmeşe, B.: Proportional-integral projected gradient method for conic optimization. Automatica, 142 (2022)
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Shoham Sabach.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Jiang, X., Vandenberghe, L. Bregman Three-Operator Splitting Methods. J Optim Theory Appl 196, 936–972 (2023). https://doi.org/10.1007/s10957-022-02125-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10957-022-02125-9