1 Introduction

We discuss proximal splitting methods for optimization problems in the form

$$\begin{aligned} \text{ minimize } \quad f(x) + g(Ax) + h(x), \end{aligned}$$
(1)

where f, g, and h are convex functions and h is differentiable. This general problem covers a wide variety of applications in machine learning, signal and image processing, operations research, control, and other fields [11, 19, 31, 40]. In this paper, we consider proximal splitting methods based on Bregman distances for solving (1) and some interesting special cases of (1).

Recently, several primal–dual first-order methods have been proposed for the three-term problem (1): the Condat–Vũ algorithm [20, 50, 53], the primal–dual three-operator (PD3O) algorithm [51], and the primal–dual Davis–Yin (PDDY) algorithm [44]. Algorithms for some special cases of (1) are also of interest. These include the Chambolle–Pock algorithm, also known as the primal–dual hybrid gradient (PDHG) method [10, 12] (when \(h=0\)), the Loris–Verhoeven algorithm [15, 23, 34] (when \(f=0\)), the proximal gradient algorithm (when \(g=0\)), and the Davis–Yin splitting algorithm [21] (when \(A=I\)). All these methods handle the nonsmooth functions f and g via the standard Euclidean proximal operator.

To further improve the efficiency of proximal algorithms, proximal operators based on generalized Bregman distances have been proposed and incorporated in many methods [2, 3, 6, 14, 24, 27, 35, 46, 48]. Bregman distances offer two potential benefits. First, the Bregman distance can help build a more accurate local optimization model around the current iterate. This is often interpreted as a form of preconditioning. For example, diagonal or quadratic preconditioning [29, 33, 41] has been shown to improve the practical convergence of PDHG, as well as the accuracy of the computed solution [1]. As a second benefit, a Bregman proximal operator of a function may be easier to compute than the standard Euclidean proximal operator and therefore reduce the complexity per iteration of an optimization algorithm. Recent applications of this kind include optimal transport problems [16], optimization over nonnegative trigonometric polynomials [13], and sparse semidefinite programming [30].

Extending standard proximal methods and their convergence analysis to Bregman distances is not straightforward because some fundamental properties of the Euclidean proximal operator no longer hold for Bregman proximal operators. An example is the Moreau decomposition which relates the (Euclidean) proximal operators of a closed convex function and its conjugate [37]. Another example is the simple relation between the proximal operators of a function g and the composition with a linear function g(Ax) when \(AA^T\) is a multiple of the identity; see, e.g., [4, 19]. This composition rule is used in [39] to establish the equivalence between some well-known first-order proximal methods for problem (1) with \(A=I\) and with general A.

The purpose of this paper is to present new Bregman extensions and convergence results for the Condat–Vũ and PD3O algorithms. The main contributions are as follows.

  • The Condat–Vũ algorithm [20, 50] exists in a primal and a dual variant. We discuss extensions of the two algorithms that use Bregman proximal operators in the primal and dual updates. The Bregman primal Condat–Vũ algorithm first appeared in [12, Algorithm 1] and is also a special case of the algorithm proposed in [52] for a more general convex–concave saddle point problem. We give a new derivation of this method and its dual variant, by applying the Bregman proximal point method to the primal–dual optimality conditions. Based on the interpretation, we provide a unified framework for the convergence analysis of the two variants and show an O(1/k) ergodic convergence rate, which is consistent with previous results for Euclidean proximal operators in [20, 50] and Bregman proximal operators in [12]. We also give a convergence result for the primal and dual iterates.

  • We propose an easily implemented backtracking line search technique for selecting stepsizes in the Bregman dual Condat–Vũ algorithm for problems with equality constraints. The proposed backtracking procedure is similar to the technique in [36] for the special setting of PDHG with Euclidean proximal operators, but has some important differences even in this special case. We give a detailed analysis of the algorithm with line search and recover the O(1/k) ergodic rate of convergence for related algorithms in [30, 36].

  • We propose a Bregman extension for PD3O and establish an ergodic convergence result.

The paper is organized as follows. Section 2 gives a precise statement of the problem (1) and reviews the duality theory that will be used in the rest of the paper. In Sect. 3, we review some well-known first-order proximal methods and establish connections between them. Section 4 provides some necessary background on Bregman distances. In Sect. 5, we discuss the Bregman primal and dual Condat–Vũ algorithms and analyze their convergence. The line search technique and its convergence are discussed in Sect. 6. In Sect. 7, we extend PD3O to a Bregman proximal method and analyze its convergence. Section 8 contains results of a numerical experiment.

2 Duality Theory and Merit Functions

This section summarizes the facts from convex duality theory that underlie the primal–dual methods discussed in the paper. We also describe primal–dual merit functions that will be used in the convergence analysis.

We use the notation \(\langle x, y\rangle = x^Ty\) for the standard inner product of vectors x and y, and \(\Vert x\Vert = \langle x, x\rangle ^{1/2}\) for the Euclidean norm of a vector x. Other norms will be distinguished by a subscript.

2.1 Problem Formulation

In (1), the vector x is an n-vector and A is an \(m\times n\) matrix. The functions f, g, and h are closed and convex, and h is differentiable, i.e.,

$$\begin{aligned} h(x) \ge h(x') + \langle \nabla h(x'), x-x'\rangle \quad \hbox { for all}\ x,x' \in \mathop {\textbf{dom}}h, \end{aligned}$$

where \(\mathop {\textbf{dom}}h\) is an open convex set. We assume that \(f+h\) and g are proper, i.e., have nonempty domains.

An important example of (1) is \(g = \delta _C\), the indicator function of a closed convex set C. With \(g=\delta _C\), the problem is equivalent to

$$\begin{aligned} \begin{array}{ll} \text{ minimize } &{} f(x) + h(x) \\ \text{ subject } \text{ to } &{} Ax \in C. \end{array} \end{aligned}$$

For \(C =\{b\}\), the constraints are a set of linear equations \(Ax=b\). This special case actually covers all applications of the more general problem (1), since (1) can be reformulated as

$$\begin{aligned} \begin{array}{ll} \text{ minimize } &{} f(x) + g(y) + h(x) \\ \text{ subject } \text{ to } &{} Ax=y, \end{array} \end{aligned}$$

at the expense of increasing the problem size by adding a splitting variable y.

2.2 Dual Problem and Optimality Conditions

The dual of problem (1) is

$$\begin{aligned} \text{ maximize }\quad {-(f+h)^*(-A^Tz) - g^*(z)}, \end{aligned}$$
(2)

where \((f+h)^*\) and \(g^*\) are the conjugates of \(f+h\) and g:

$$\begin{aligned} (f+h)^*(w) = \sup _x{(\langle w, x\rangle - f(x) - h(x))}, \qquad g^*(z) = \sup _y{(\langle z, y\rangle - g(y))}. \end{aligned}$$

The conjugate \((f+h)^*\) is the infimal convolution \(f^*\) and \(h^*\), denoted by :

The primal–dual optimality conditions for (1) and (2) are

$$\begin{aligned} 0 \in \partial f(x) + \nabla h(x) + A^T z, \qquad 0 \in \partial g^*(z) - Ax. \end{aligned}$$

Here, \(\partial f\) and \(\partial g^*\) are the subdifferentials of f and \(g^*\). We often write the optimality conditions as

$$\begin{aligned} 0 \in \begin{bmatrix} 0 &{} A^T \\ -A &{} 0 \end{bmatrix} \begin{bmatrix} x \\ z \end{bmatrix} + \begin{bmatrix} \partial f(x)+\nabla h(x) \\ \partial g^*(z) \end{bmatrix}. \end{aligned}$$
(3)

Throughout the paper, we assume the optimality conditions (3) are solvable.

We will refer to the convex–concave function

$$\begin{aligned} \mathcal {L}(x,z) = f(x)+h(x)+ \langle z, Ax\rangle -g^*(z) \end{aligned}$$

as the Lagrangian of (1). We follow the convention that \(\mathcal {L}(x,z) = +\infty \) if \(x\not \in \mathop {\textbf{dom}}(f+h)\) and \(\mathcal {L}(x,z) = -\infty \) if \(x\in \mathop {\textbf{dom}}(f+h)\) and \(z\not \in \mathop {\textbf{dom}}g^*\). The objective functions in (1) and the dual problem (2) can be expressed as

$$\begin{aligned} \sup _z \mathcal {L}(x,z) = f(x)+h(x)+g(Ax), \qquad \inf _x \mathcal {L}(x,z) = -(f+h)^*(-A^Tz)-g^*(z). \end{aligned}$$

Solutions \(x^\star \), \(z^\star \) of the optimality conditions (3) form a saddle point of \(\mathcal {L}\):

$$\begin{aligned} \inf _x \sup _z \mathcal {L}(x, z) = \sup _z \mathcal {L}(x^\star , z) = \mathcal {L}(x^\star , z^\star ) = \inf _x \mathcal {L}(x,z^\star ) = \sup _z \inf _x \mathcal {L}(x,z). \end{aligned}$$
(4)

In particular, \(\mathcal {L}(x^\star , z^\star )\) is the optimal value of (1) and (2).

2.3 Merit Functions

The algorithms discussed in this paper generate primal and dual iterates and approximate solutions x, z with \(x \in \mathop {\textbf{dom}}(f+h)\) and \(z\in \mathop {\textbf{dom}}g^*\). The feasibility conditions \(Ax \in \mathop {\textbf{dom}}g\) and \(-A^Tz \in \mathop {\textbf{dom}}(f+h)^*\) are not necessarily satisfied. Hence, the duality gap

$$\begin{aligned} \sup _{z'} \mathcal {L}(x, z') - \inf _{x'} \mathcal {L}(x', z) = f(x) + h(x) + g(Ax) + (f+h)^*(-A^Tz) + g^*(z) \end{aligned}$$
(5)

may not always be useful as a merit function to measure convergence.

If we add constraints \(x'\in X\) and \(z'\in Z\) to the optimization problems on the left-hand side of (5), where X and Z are compact convex sets, we obtain a function

$$\begin{aligned} \eta (x,z) = \sup _{z'\in Z} \mathcal {L}(x, z') - \inf _{x'\in X} \mathcal {L}(x', z) \end{aligned}$$
(6)

defined for all \(x\in \mathop {\textbf{dom}}(f+h)\) and \(z\in \mathop {\textbf{dom}}g^*\). This follows from the fact that the functions \(f+h+\delta _X\) and \(g^*+\delta _Z\) are closed and co-finite, so their conjugates have full domain [43, Corollary 13.3.1]. If \(\eta (x,z)\) is easily computed, and \(\eta (x,z) \ge 0\) for all \(x\in \mathop {\textbf{dom}}(f+h)\) and \(z\in \mathop {\textbf{dom}}g^*\) with equality only if x and z are optimal, then the function \(\eta \) can serve as a merit function in primal–dual algorithms for problem (1).

If \(\mathop {\textbf{dom}}(f+h)\) and \(\mathop {\textbf{dom}}g^*\) are bounded, then X and Z can be chosen to contain \(\mathop {\textbf{dom}}(f+h)\) and \(\mathop {\textbf{dom}}g^*\). Then, the constraints in (6) are redundant and \(\eta (x,z)\) is the duality gap (5). Boundedness of \(\mathop {\textbf{dom}}(f+h)\) and \(\mathop {\textbf{dom}}g^*\) is a common assumption in the literature on primal–dual first-order methods.

A weaker assumption is that (1) has an optimal solution \(x^\star \in \mathop {\textbf{int}}(X)\) and (2) has an optimal solution \(z^\star \in \mathop {\textbf{int}}(Z)\). Then, \(\eta (x,z) \ge 0\) for all \(x \in \mathop {\textbf{dom}}(f+h)\) and \(z \in \mathop {\textbf{dom}}g^*\), with equality \(\eta (x,z) = 0\) only if xz are optimal for (1) and (2). To see this, we first express the two terms in (6) as

where \(\sigma _X = \delta _X^*\) and \(\sigma _Z(v) = \delta _Z^*\) are the support functions of X and Z, respectively. Consider the problem of minimizing \(\eta (x,z)\). By expanding the infimal convolutions in the expressions for the two terms of \(\eta \), this convex optimization problem can be formulated as

$$\begin{aligned} \begin{array}{ll} \text{ minimize } &{} f(x) + h(x) + g(y) + \sigma _Z(Ax-y) \\ &{} {} + g^*(z) + (f+h)^*(w) + \sigma _X(-A^Tz-w), \end{array} \end{aligned}$$
(7)

with variables xyzw. The dual of this problem is

$$\begin{aligned} \begin{array}{ll} \text{ maximize } &{} -f(\bar{x}) - h(\bar{x}) - g(A\bar{x}) -g^*(\bar{z}) - (f+h)^*(-A^T\bar{z}) \\ \text{ subject } \text{ to } &{} \bar{x} \in X, \; \bar{z} \in Z, \end{array} \end{aligned}$$
(8)

with variables \(\bar{x}, \bar{z}\). The optimality conditions for (7) and (8) include the conditions \(Ax-y \in N_Z(\bar{z})\) and \(-A^Tz - w \in N_X(\bar{x})\), where \(N_X(\bar{x}) = \partial \delta _X(\bar{x})\) is the normal cone to X at \(\bar{x}\), and \(N_Z(\bar{z}) = \partial \delta _Z(\bar{z})\) the normal cone to Z at \(\bar{z}\). By assumption, there exist points \(x^\star \in \mathop {\textbf{int}}(X)\) and \(z^\star \in \mathop {\textbf{int}}(Z)\) that are optimal for the original problem (1) and its dual (2). It can be verified that \((x,y,z,w) = (x^\star , Ax^\star , z^\star , -A^Tz^\star )\), \((\bar{x}, \bar{z}) = (x^\star , z^\star )\) are optimal for (7) and (8), and that \(\eta (x^\star ,z^\star ) = 0\). Now let \((\hat{x},\hat{z})\) be any other minimizer of \(\eta \), i.e., \(\eta (\hat{x}, \hat{z}) = 0\). Then, \(\hat{x}, \hat{z}\) and the corresponding minimizers \(\hat{y}, \hat{w}\) in (7), must satisfy the optimality conditions with the optimal dual variables \(\bar{x}= x^\star \), \(\bar{z} = z^\star \). In particular, \(A\hat{x} -\hat{y} \in N_Z(z^\star ) = \{0\}\) and \(-A^T\hat{z} -\hat{w} \in N_X(x^\star ) = \{0\}\). The objective value of (7) at this point then reduces to \(0 = f(\hat{x}) + h(\hat{x}) + g(A\hat{x}) + g^*(\hat{z}) + (f+h)^*(-A^T\hat{w})\), the duality gap associated with the original problem and its dual. This shows that \(\eta (\hat{x}, \hat{z}) =0\) implies that \(\hat{x}, \hat{z}\) are optimal for problem (1) and (2).

Consider for example the primal and dual pair

$$\begin{aligned}{}\begin{array}{ll} \text{ minimize } &{} f(x) + h(x) \\ \text{ subject } \text{ to } &{} Ax = b \end{array} \qquad \quad \begin{array}{ll} \text{ maximize }&-b^Tz - (f+h)^*(-A^Tz). \end{array} \end{aligned}$$

Here, \(g= \delta _{\{b\}}\). If we take \(Z = \{z \mid \Vert z\Vert \le \gamma \}\), then \(\sigma _Z(y)=\gamma \Vert y\Vert \) and . If in addition \(\mathop {\textbf{dom}}(f+h)\) is bounded and we take \(X \supseteq \mathop {\textbf{dom}}(f+h)\), then

$$\begin{aligned} \eta (x,z) = f(x) + h(x) + \gamma \Vert Ax-b\Vert + b^Tz + (f+h)^*(-A^Tz) \end{aligned}$$

with domain \(\mathop {\textbf{dom}}(f+h) \times {\textbf{R}}^m\). The first three terms are the primal objective augmented with an exact penalty for the constraint \(Ax=b\).

As another example, consider

$$\begin{aligned}{}\begin{array}{ll} \text{ minimize } &{} \Vert x\Vert _1 \\ \text{ subject } \text{ to } &{} Ax\le b \end{array} \qquad \quad \begin{array}{ll} \text{ maximize } &{} -b^T z\\ \text{ subject } \text{ to } &{} \Vert A^Tz\Vert _\infty \le 1, \; z \ge 0. \end{array} \end{aligned}$$

This is an example of (1) with \(f(x)=\Vert x\Vert _1\), \(h(x)=0\), and g the indicator function of \(\{y \mid y \le b\}\). The domains \(\mathop {\textbf{dom}}f\) and \(\mathop {\textbf{dom}}g^*\) are unbounded. If we choose \(X = \{ x \mid \Vert x\Vert _\infty \le \kappa \}\) and \(Z = \{z \mid \Vert z\Vert _\infty \le \lambda \}\), then the merit function (6) for this example is

$$\begin{aligned} \eta (x,z) = \Vert x\Vert _1 + \lambda \sum _{i=1}^m \max \{0, (Ax-b)_i\} + b^T z + \kappa \sum _{i=1}^n \max \{0, |(A^Tz)_i|-1\} \end{aligned}$$

with domain \({\textbf{R}}^n \times {\textbf{R}}^m_+\). The second term is an exact penalty for the primal constraint \(Ax\le b\). The last term is an exact penalty for the dual constraint \(\Vert A^Tz\Vert _\infty \le 1\).

3 First-Order Proximal Algorithms: Survey and Connections

In this section, we discuss several first-order proximal algorithms and their connections. We start with four three-operator splitting algorithms for problem (1): the primal and dual variants of the Condat–Vũ algorithm [20, 50], the primal–dual three-operator (PD3O) algorithm [51], and the primal–dual Davis–Yin (PDDY) algorithm [44]. For each of the four algorithms, we make connections with other first-order proximal algorithms, using reduction (i.e., setting some parts in (1) to zero) and the “completion” reformulation (based on extending A to a matrix with orthogonal rows and equal row norms) [39]. We focus on the formal connections between algorithms. The connections do not necessarily provide the best approach for convergence analysis or the best known convergence results.

The proximal operator or proximal mapping of a closed convex function \(f :{\textbf{R}}^n \rightarrow {\textbf{R}}\) is defined as

$$\begin{aligned} \textrm{prox}_f(y) = \mathop {\textrm{argmin}}_x{\big (f(x) + \frac{1}{2}\Vert x-y\Vert ^2\big )}. \end{aligned}$$
(9)

If f is closed and convex, the minimizer in the definition exists and is unique for all y [37]. We will call (9) the standard or Euclidean proximal operator when we need to distinguish it from Bregman proximal operators in Section 4.

3.1 Condat–Vũ Three-Operator Splitting Algorithm

We start with the (primal) Condat–Vũ three-operator splitting algorithm, which was proposed independently by Condat [20] and Vũ [50],

$$\begin{aligned} x_{k+1}&= \textrm{prox}_{\tau f} \big (x_k-\tau (A^Tz_k+\nabla h(x_k)) \big ) \end{aligned}$$
(10a)
$$\begin{aligned} z_{k+1}&= \textrm{prox}_{\sigma g^*} \big (z_k+\sigma A(2x_{k+1}-x_k) \big ). \end{aligned}$$
(10b)

The stepsizes \(\sigma \) and \(\tau \) must satisfy \(\sigma \tau \Vert A\Vert _2^2 + \tau L \le 1\), where \(\Vert A\Vert _2\) is the spectral norm of A, and L is the Lipschitz constant of \(\nabla h\) with respect to the Euclidean norm. Many other first-order proximal algorithms can be viewed as special cases of (10), and their connections are summarized in Fig. 1.

Fig. 1
figure 1

Proximal methods derived from primal Condat–Vũ algorithm.

When \(h=0\), algorithm (10) reduces to the (primal) primal–dual hybrid gradient (PDHG) method [10, 12, 42], or PDHGMu in [26]. When \(g=0\) in (10) (and assuming \(z_0=0\)), we obtain the proximal gradient algorithm. When \(f=0\), we obtain a variant of the Loris–Verhoeven algorithm, which will be referred to as the Loris–Verhoeven with shift, for reasons that will be clarified later. If we further set \(A=I\), we obtain the reduced Loris–Verhoeven algorithm with shift. However, due to the absence of f in the reduced Loris–Verhoeven algorithm with shift, it is not clear how to apply the “completion” trick to it. Furthermore, when \(A=I\) in PDHG, we obtain the Douglas–Rachford splitting (DRS) algorithm [18, 25, 32]. Conversely, the “completion” technique in [39] shows that PDHG coincides with DRS applied to a reformulation of the problem. Similarly, when \(A=I\) in the primal Condat–Vũ algorithm (10), we obtain a new algorithm and refer to it as the reduced primal Condat–Vũ algorithm. Conversely, the reduced primal Condat–Vũ algorithm reverts to (10) via the “completion” trick.

Condat [20] also discusses a variant of (10), which we will call the dual Condat–Vũ algorithm:

$$\begin{aligned} z_{k+1}&= \textrm{prox}_{\sigma g^*} (z_k+\sigma Ax_k) \end{aligned}$$
(11a)
$$\begin{aligned} x_{k+1}&= \textrm{prox}_{\tau f} \big (x_k-\tau (A^T(2z_{k+1}-z_k)+\nabla h(x_k)) \big ). \end{aligned}$$
(11b)
Fig. 2
figure 2

Proximal methods derived from dual Condat–Vũ algorithm.

Figure 2 summarizes the proximal algorithms derived from (11). When \(h=0\), algorithm (11) reduces to PDHG applied to the dual of (1) (with \(h=0\)), which is shown to be equivalent to linearized ADMM [40] (also called Split Inexact Uzawa in [26]). Setting \(g=0\) in (11) yields the proximal gradient algorithm. When \(f=0\), we obtain the dual Loris–Verhoeven algorithm with shift, following the previous naming convention, and if we further set \(A=I\), we obtain the reduced dual Loris–Verhoeven algorithm with shift. Moreover, setting \(A=I\) in (11) gives the reduced dual Condat–Vũ algorithm. Conversely, applying the “completion” trick to this reduced algorithm recovers (11). Similarly, setting \(A=I\) in dual PDHG gives dual DRS, i.e., DRS with f and g switched, and conversely, the “completion” trick recovers dual PDHG from dual DRS.

3.2 Primal–Dual Three-Operator (PD3O) Splitting Algorithm

The third diagram starts with the primal–dual three-operator (PD3O) splitting algorithm [51]

$$\begin{aligned} x_{k+1}&= \textrm{prox}_{\tau f} (x_{k}- \tau (A^Tz_{k}+\nabla h(x_{k}))) \end{aligned}$$
(12a)
$$\begin{aligned} z_{k+1}&= \textrm{prox}_{\sigma g^*} (z_{k}+\sigma A(2x_{k+1}-x_{k}+\tau \nabla h(x_{k}) -\tau \nabla h(x_{k+1}))), \end{aligned}$$
(12b)

and is presented in Fig. 3.

Fig. 3
figure 3

Proximal algorithms derived from PD3O.

Compared with the Condat–Vũ algorithm (10), PD3O seems to have slightly more complicated updates and larger complexity per iteration, but the requirement for the stepsizes is looser: \(\sigma \tau \Vert A\Vert _2^2 \le 1\) and \(\tau \le 1/L\). When \(h=0\), (12) reduces to the (primal) PDHG. The classical proximal gradient algorithm can be obtained by setting \(g=0\). The PD3O algorithm (12) with \(f=0\) was discovered independently as the Loris–Verhoeven algorithm [34], the primal–dual fixed point algorithm based on proximity operator (PDFP\(^2\)O) [15], and the proximal alternating predictor corrector (PAPC) [23]. Comparison with the Loris–Verhoeven algorithm with shift reveals a minor difference between these two algorithms: the gradient term in the z-update is taken at the newest primal iterate \(x_{k+1}\) in the Loris–Verhoeven algorithm and at the previous point \(x_k\) in the shifted version. This difference is inherited in the proximal gradient algorithm and its shifted version. Furthermore, when \(A=I\) and \(\sigma =1/\tau \) in PD3O, we recover the well-known Davis–Yin splitting (DYS) algorithm [21]. We can also set \(A=I\) in the Loris–Verhoeven algorithm and obtain the classical proximal gradient algorithm.

3.3 Primal–Dual Davis–Yin (PDDY) Splitting Algorithm

The core algorithm in Fig. 4 is the primal–dual Davis–Yin (PDDY) splitting algorithm [44]

$$\begin{aligned} z_{k+1}&= \textrm{prox}_{\sigma g^*} (z_{k}+\sigma Ax_{k}) \end{aligned}$$
(13a)
$$\begin{aligned} x_{k+1}&= \textrm{prox}_{\tau f} \big (x_{k}-\tau A^T(2z_{k+1}-z_{k}) -\tau \nabla h(x_{k}+\tau A^T(z_{k}-z_{k+1})) \big ). \end{aligned}$$
(13b)

The requirement for stepsizes is the same as that in PD3O: \(\sigma \tau \Vert A\Vert _2^2 \le 1\) and \(\tau \le 1/L\). Figure 4 is almost identical to Fig. 3 with the roles of f and g exchanged. When \(h=0\), PDDY reduces to the dual PDHG. In addition, when \(A=I\) and \(\sigma =1/\tau \), PDDY reduces to the Davis–Yin algorithm, but with f and g exchanged. Similarly, when \(h=0\), \(A=I\) and \(\sigma =1/\tau \), PDDY reverts to the Douglas–Rachford algorithm with f and g switched.

Fig. 4
figure 4

Proximal algorithms derived from PDDY.

We have seen that the middle and right parts of Fig. 4 are those of Fig. 3 with f and g switched. However, when one of the functions f or g is absent, the algorithms reduced from PD3O and PDDY are exactly the same. In particular, when \(f=0\), PDDY reduces to the Loris–Verhoeven algorithm.

4 Bregman Distances

In this section, we give the definition of Bregman proximal operators and the basic properties that will be used in the paper. We refer the interested reader to [9] for an in-depth discussion of Bregman distances, their history, and applications.

Let \(\phi \) be a convex function with a domain that has nonempty interior, and assume \(\phi \) is continuous on \(\mathop {\textbf{dom}}\phi \) and continuously differentiable on \(\mathop {\textbf{int}}(\mathop {\textbf{dom}}\phi )\). The generalized distance (or Bregman distance) generated by the kernel function \(\phi \) is defined as the function

$$\begin{aligned} d(x,y) = \phi (x)-\phi (y)-\langle \nabla \phi (y), x-y\rangle , \end{aligned}$$
(14)

with domain \(\mathop {\textbf{dom}}d=\mathop {\textbf{dom}}\phi \times \mathop {\textbf{int}}(\mathop {\textbf{dom}}\phi )\). The corresponding Bregman proximal operator of a function f is

$$\begin{aligned} \textrm{prox}_f^\phi (y,a)&= \mathop {\textrm{argmin}}_x{(f(x)+\langle a, x\rangle +d(x,y))} \end{aligned}$$
(15)
$$\begin{aligned}&= \mathop {\textrm{argmin}}_x{(f(x)+\langle a, x\rangle +\phi (x) -\langle \nabla \phi (y), x\rangle )}. \end{aligned}$$
(16)

It is assumed that for every a and every \(y \in \mathop {\textbf{int}}(\mathop {\textbf{dom}}\phi )\) the minimizer \(\hat{x}=\textrm{prox}_f^\phi (y,a)\) is unique and in \(\mathop {\textbf{int}}(\mathop {\textbf{dom}}\phi )\).

The distance generated by the kernel \(\phi (x)=(1/2)\Vert x\Vert ^2\) is the squared Euclidean distance \(d(x,y) = (1/2) \Vert x-y\Vert ^2\). The corresponding Bregman proximal operator is the standard proximal operator applied to \(y-a\):

$$\begin{aligned} \textrm{prox}_f^\phi (y,a)=\textrm{prox}_f(y-a). \end{aligned}$$

For this distance, closedness and convexity of f guarantee that the proximal operator is well defined. The questions of existence and uniqueness are more complicated for general Bregman distances. There are no simple general conditions that guarantee that for every a and every \(y \in \mathop {\textbf{int}}(\mathop {\textbf{dom}}\phi )\) the generalized proximal operator (15) is uniquely defined and in \(\mathop {\textbf{int}}(\mathop {\textbf{dom}}\phi )\). Some sufficient conditions are provided (see, for example, [8, Section 4.1], [3, Assumption A]), but they may be quite restrictive or difficult to verify in practice. In applications, however, the Bregman proximal operator is used with specific combinations of f and \(\phi \), for which the minimization problem in (15) is particularly easy to solve. In those applications, existence and uniqueness of the solution follow directly from the closed-form solution or availability of a fast algorithm to compute it. A typical example will be provided in Sect. 8.

From the expression (16) and the definition of subgradient, we see that \(\hat{x}=\textrm{prox}_f^\phi (y,a)\) satisfies

$$\begin{aligned} f(x) + \langle a, x\rangle&\ge f(\hat{x})+\langle a, \hat{x}\rangle +\langle \nabla \phi (y)-\nabla \phi (\hat{x}), x-\hat{x}\rangle \nonumber \\&= f(\hat{x})+\langle a, \hat{x}\rangle +d(\hat{x}, y)+d(x,\hat{x})-d(x,y) \end{aligned}$$
(17)

for all \(x \in \mathop {\textbf{dom}}f \cap \mathop {\textbf{dom}}\phi \). The second inequality follows from the definition (14); see, e.g., [45, Lemma 4.1].

5 Bregman Condat–Vũ Three-Operator Splitting Algorithms

We now discuss two Bregman three-operator splitting algorithms for the problem (1). The algorithms use a generalized distance \(d_\mathrm p\) in the primal space, generated by a kernel \(\phi _\mathrm p\), and a generalized distance \(d_\textrm{d}\) in the dual space, generated by a kernel \(\phi _\textrm{d}\). The first algorithm is

$$\begin{aligned} x_{k+1}&= \textrm{prox}^{\phi _\mathrm p}_{\tau f} \big (x_k, \tau A^Tz_k+\tau \nabla h(x_k) \big ) \end{aligned}$$
(18a)
$$\begin{aligned} z_{k+1}&= \textrm{prox}^{\phi _\textrm{d}}_{\sigma g^*} \big (z_k, -\sigma A(2x_{k+1}-x_k)\big ) \end{aligned}$$
(18b)

and will be referred to as the Bregman primal Condat–Vũ algorithm. The second algorithm will be called the Bregman dual Condat–Vũ algorithm:

$$\begin{aligned} z_{k+1}&= \textrm{prox}_{\sigma g^*}^{\phi _\textrm{d}} (z_k, -\sigma Ax_k) \end{aligned}$$
(19a)
$$\begin{aligned} x_{k+1}&= \textrm{prox}_{\tau f}^{\phi _\mathrm p} (x_k, \tau A^T(2z_{k+1}-z_k) + \tau \nabla h(x_k) ). \end{aligned}$$
(19b)

The two algorithms need starting points \(x_0 \in \mathop {\textbf{int}}(\mathop {\textbf{dom}}\phi _\mathrm p) \cap \mathop {\textbf{dom}}h\), and \(z_0 \in \mathop {\textbf{int}}(\mathop {\textbf{dom}}\phi _\textrm{d})\). Conditions on stepsizes \(\sigma \), \(\tau \) will be specified later. When Euclidean distances are used for the primal and dual proximal operators, the two algorithms reduce to the primal and dual variants of the Condat–Vũ algorithm (10) and (11), respectively. Algorithm (18) has been proposed in [12]. Here, we discuss it together with (19) in a unified framework.

In Sect. 5.1, we show that the proposed algorithms can be interpreted as the Bregman proximal point method applied to a monotone inclusion problem. In Sect. 5.2, we analyze their convergence. In Sect. 5.3, we discuss the connections between the two algorithms and other Bregman proximal splitting methods.

Assumption 1

         

  1. (1.1)

    The kernel functions \(\phi _\mathrm p\) and \(\phi _\textrm{d}\) are 1-strongly convex with respect to norms \(\Vert \cdot \Vert _\mathrm p\) and \(\Vert \cdot \Vert _\textrm{d}\), respectively:

    $$\begin{aligned} d_\mathrm p(x,x^\prime ) \ge \frac{1}{2}\Vert x-x^\prime \Vert ^2_\mathrm p, \qquad d_\textrm{d}(z,z^\prime ) \ge \frac{1}{2}\Vert z-z^\prime \Vert ^2_\textrm{d}\end{aligned}$$
    (20)

    for all \((x,x') \in \mathop {\textbf{dom}}d_\mathrm p\) and \((z,z') \in \mathop {\textbf{dom}}d_\textrm{d}\). The assumption that the strong convexity constants are equal to one can be made without loss of generality, by scaling the norms (or distances) if needed.

  2. (1.2)

    The function \(L\phi _\mathrm p-h\) is convex for some \(L>0\). More precisely, \(\mathop {\textbf{dom}}\phi _\mathrm p\subseteq \mathop {\textbf{dom}}h\) and

    $$\begin{aligned} h(x)-h(x^\prime )-\langle \nabla h(x^\prime ), x-x^\prime \rangle \le Ld_\mathrm p(x,x^\prime ) \quad \text{ for } \text{ all } (x,x^\prime ) \in \mathop {\textbf{dom}}d_\mathrm p. \end{aligned}$$
    (21)
  3. (1.3)

    The primal–dual optimality conditions (3) have a solution \((x^\star , z^\star )\) with \(x^\star \in \mathop {\textbf{dom}}\phi _\mathrm p\) and \(z^\star \in \mathop {\textbf{dom}}\phi _\textrm{d}\).

Note that Assumption 1.2 is looser than the one in [12, Equation (4)]. We denote by \(\Vert A\Vert \) the matrix norm

$$\begin{aligned} \Vert A\Vert =\sup _{u \ne 0, v \ne 0} \frac{\langle v, Au\rangle }{\Vert v\Vert _\textrm{d}\Vert u\Vert _\mathrm p} = \sup _{u\ne 0} \frac{\Vert Au\Vert _{\textrm{d},*}}{\Vert u\Vert _\mathrm p} = \sup _{v\ne 0} \frac{\Vert A^Tv\Vert _{\mathrm p,*}}{\Vert v\Vert _\textrm{d}}, \end{aligned}$$
(22)

where \(\Vert \cdot \Vert _{\mathrm p,*}\) and \(\Vert \cdot \Vert _{\textrm{d},*}\) are the dual norms of \(\Vert \cdot \Vert _\mathrm p\) and \(\Vert \cdot \Vert _\textrm{d}\).

5.1 Derivation from Bregman Proximal Point Method

The Bregman Condat–Vũ algorithms (18) and (19) can be viewed as applications of the Bregman proximal point algorithm to the optimality conditions (3). This interpretation extends the derivation of the Bregman PDHG algorithm from the Bregman proximal point algorithm given in [30]. The idea originates with He and Yuan’s interpretation of PDHG as a “preconditioned” proximal point algorithm [28].

The Bregman proximal point algorithm [9, 24, 27] is an algorithm for monotone inclusion problems \(0 \in F(u)\). The update \(u_{k+1}\) in one iteration of the algorithm is defined as the solution of the inclusion

$$\begin{aligned} \nabla \phi (u_k) - \nabla \phi (u_{k+1}) \in F(u_{k+1}), \end{aligned}$$

where \(\phi \) is a Bregman kernel function. Applied to (3), with a kernel function \(\phi _\textrm{pd}\), the algorithm generates a sequence \((x_k, z_k)\) defined by

$$\begin{aligned}{} & {} {\nabla \phi _\textrm{pd}(x_{k},z_{k}) -\nabla \phi _\textrm{pd}(x_{k+1},z_{k+1})} \nonumber \\{} & {} \quad \in \begin{bmatrix} A^T z_{k+1} + \partial f(x_{k+1})+\nabla h(x_{k+1}) \\ -A x_{k+1} + \partial g^*(z_{k+1}) \end{bmatrix}. \end{aligned}$$
(23)

5.1.1 Primal–Dual Bregman Distances

We introduce four possible primal–dual kernel functions: the functions

$$\begin{aligned} \phi _+(x,z) = \frac{1}{\tau }\phi _\mathrm p(x) +\frac{1}{\sigma }\phi _\textrm{d}(z)+\langle z, Ax\rangle , \quad \phi _-(x,z) = \frac{1}{\tau }\phi _\mathrm p(x) +\frac{1}{\sigma }\phi _\textrm{d}(z)-\langle z, Ax\rangle , \end{aligned}$$

where \(\sigma ,\tau > 0\), and the functions

$$\begin{aligned} \phi _\textrm{dcv}(x,z) = \phi _+(x,z) - h(x), \qquad \phi _\textrm{pcv}(x,z) = \phi _-(x,z) - h(x). \end{aligned}$$

The subscripts in \(\phi _+\) and \(\phi _-\) refer to the sign of the inner product term \(\langle z, Ax\rangle \). The subscripts in \(\phi _\textrm{pcv}\) and \(\phi _\textrm{dcv}\) indicate the algorithm (Bregman primal or dual Condat-Vũ) for which these distances will be relevant. If these kernel functions are convex, they generate the following Bregman distances. The distances generated by \(\phi _+\) and \(\phi _-\) are

$$\begin{aligned} d_+ (x,z;x^\prime ,z^\prime )&= \frac{1}{\tau }d_\mathrm p(x,x^\prime )+\frac{1}{\sigma }d_\textrm{d}(z,z^\prime ) +\langle z-z^\prime , A(x-x^\prime )\rangle \nonumber \\ d_-(x,z;x^\prime ,z^\prime )&= \frac{1}{\tau }d_\mathrm p(x,x^\prime )+\frac{1}{\sigma }d_\textrm{d}(z,z^\prime ) -\langle z-z^\prime , A(x-x^\prime )\rangle , \end{aligned}$$
(24)

respectively, and the distances generated by \(\phi _\textrm{dcv}\) and \(\phi _\textrm{pcv}\) are

$$\begin{aligned} d_\textrm{dcv}(x,z;x^\prime ,z^\prime )&= d_+ (x,z;x^\prime ,z^\prime ) -h(x)+h(x^\prime )+\langle \nabla h(x'), x-x^\prime \rangle \\ d_\textrm{pcv}(x,z;x^\prime ,z^\prime )&= d_- (x,z;x^\prime ,z^\prime ) -h(x)+h(x^\prime )+\langle \nabla h(x'), x-x^\prime \rangle . \end{aligned}$$

The following lemma provides sufficient conditions for the (strong) convexity of \(\phi _+\), \(\phi _-\), \(\phi _\textrm{pcv}\), and \(\phi _\textrm{dcv}\).

Lemma 1

The functions \(\phi _+\) and \(\phi _-\) are convex if \(\sigma \tau \Vert A\Vert ^2 \le 1\) and strongly convex if \(\sigma \tau \Vert A\Vert ^2 < 1\). The functions \(\phi _\textrm{dcv}\) and \(\phi _\textrm{pcv}\) are convex if

$$\begin{aligned} \sigma \tau \Vert A\Vert ^2 + \tau L \le 1 \end{aligned}$$
(25)

and strongly convex if \(\sigma \tau \Vert A\Vert ^2 + \tau L < 1\).

Proof

To show that the kernel functions \(\phi _+\) and \(\phi _-\) are convex, we show that \(d_+\) and \(d_-\) are nonnegative. Suppose \(\sigma \tau \Vert A\Vert ^2 \le \delta _1 \delta _2\) with \(\delta _1,\delta _2 > 0\). Then, (20) and the arithmetic–geometric mean inequality imply that

$$\begin{aligned} \left| \langle z-z^\prime , A(x-x^\prime )\rangle \right|&\le \Vert A\Vert \Vert z-z'\Vert _\textrm{d}\Vert x-x'\Vert _\mathrm p\nonumber \\&\le \sqrt{\frac{\delta _1 \delta _2}{\sigma \tau }} \Vert z-z'\Vert _\textrm{d}\Vert x-x'\Vert _\mathrm p\nonumber \\&\le \frac{\delta _1}{2\tau } \Vert x-x'\Vert _\mathrm p^2 + \frac{\delta _2}{2\sigma } \Vert z-z'\Vert _\mathrm p^2 \nonumber \\&\le \frac{\delta _1}{\tau } d_\mathrm p(x,x') + \frac{\delta _2}{\sigma } d_\textrm{d}(z,z'). \end{aligned}$$
(26)

Therefore,

$$\begin{aligned} d_{\pm } (x,z; x',z')&= \frac{1}{\tau } d_\mathrm p(x,x^\prime ) + \frac{1}{\sigma } d_\textrm{d}(z,z^\prime ) \pm \langle z-z^\prime , A(x-x^\prime )\rangle \\&\ge \frac{1-\delta _1}{\tau } d_\mathrm p(x,x') + \frac{1-\delta _2}{\sigma } d_\textrm{d}(z,z') \\&\ge \frac{1-\delta _1}{2\tau } \Vert x-x'\Vert _\mathrm p^2 + \frac{1-\delta _2}{2\sigma } \Vert z-z'\Vert _\textrm{d}^2. \end{aligned}$$

With \(\delta _1=\delta _2=1\), this shows convexity of \(\phi _+\) and \(\phi _-\); with \(\delta _1=\delta _2<1\), strong convexity. Similarly,

$$\begin{aligned} d_\mathrm {dcv/pcv} (x,z; x', z')&= d_{\pm } (x,z; x',z') - h(x) + h(x') + \langle \nabla h(x'), x-x'\rangle \\&\ge \frac{1-\tau L-\delta _1}{\tau } d_\mathrm p(x,x^\prime ) + \frac{1-\delta _2}{\sigma } d_\textrm{d}(z,z^\prime ). \end{aligned}$$

With \(\delta _1=1-\tau L\) and \(\delta _2=1\), this shows convexity of \(\phi _\textrm{pcv}\) and \(\phi _\textrm{dcv}\); with \(\delta _1=\delta -\tau L\) and \(\delta _2=\delta < 1\), strong convexity. \(\square \)

5.1.2 Bregman Condat-Vũ Algorithms from Proximal Point Method

The Bregman primal Condat–Vũ algorithm (18) is the Bregman proximal point method with the kernel function \(\phi _\textrm{pd} = \phi _\textrm{pcv}\). If we take \(\phi _\textrm{pd} = \phi _\textrm{pcv}\) in (23), we obtain two coupled inclusions that determine \(x_{k+1}\), \(z_{k+1}\). The first one is

$$\begin{aligned} 0 \in \frac{1}{\tau }(\nabla \phi _\mathrm p(x_{k+1}) -\nabla \phi _\mathrm p(x_k))+A^Tz_k +\nabla h(x_k)+\partial f(x_{k+1}). \end{aligned}$$

This shows that \(x_{k+1}\) solves the optimization problem

$$\begin{aligned} \text{ minimize } \quad f(x)+\langle A^Tz_k+\nabla h(x_k), x\rangle +\frac{1}{\tau } d_\mathrm p(x,x_k). \end{aligned}$$

The solution is the x-update (18a) in the Bregman primal Condat–Vũ method. The second inclusion is

$$\begin{aligned} 0 \in \frac{1}{\sigma }(\nabla \phi _\textrm{d}(z_{k+1}) -\nabla \phi _\textrm{d}(z_k))-A(2x_{k+1}-x_k) +\partial g^*(z_{k+1}). \end{aligned}$$

This shows that \(z_{k+1}\) solves the optimization problem

$$\begin{aligned} \text{ minimize } \quad g^*(z)-\langle z, A(2x_{k+1}-x_k)\rangle +\frac{1}{\sigma } d_\textrm{d}(z,z_k). \end{aligned}$$

The solution is the z-update (18b).

Similarly, choosing \(\phi _\textrm{pd} = \phi _\textrm{dcv}\) in (23) yields the Bregman dual Condat–Vũ algorithm (19).

5.2 Convergence Analysis

The derivation in Sect. 5.1 allows us to apply existing convergence theory for the Bregman proximal point method to the proposed algorithms (18) and (19). In particular, Solodov and Svaiter [45] have studied Bregman proximal point methods with inexact prox-evaluations for solving variational inequalities, which include the monotone inclusion problem as a special case. The results in [45] can be applied to analyze convergence of the Bregman Condat–Vũ methods with inexact evaluations of proximal operators.

The literature on the Bregman proximal point method for monotone inclusions [24, 27, 45] focuses on the convergence of iterates, and this generally requires additional assumptions on \(\phi _\mathrm p\) and \(\phi _\textrm{d}\) (beyond Assumption 1.1). In this section, we present a self-contained convergence analysis and give a direct proof of an O(1/k) rate of ergodic convergence (using only Assumption 1). We also give a self-contained proof of convergence of the iterates \(x_k\) and \(z_k\).

For the sake of brevity, we combine the analysis of the Bregman primal and Bregman dual Condat–Vũ algorithms. In the following, d, \(\tilde{d}\), \(\tilde{\phi }\) are defined as

$$\begin{aligned} \begin{array}{llll} d = d_- &{}\qquad \tilde{d} = d_\textrm{pcv} &{}\qquad \tilde{\phi }= \phi _\textrm{pcv} &{}\qquad \text{ for } \text{ Bregman } \text{ primal } \text{ Condat--V }\tilde{u}~(18), \\ d = d_+ &{}\qquad \tilde{d} = d_\textrm{dcv} &{} \qquad \tilde{\phi }= \phi _\textrm{dcv} &{}\qquad \text{ for } \text{ Bregman } \text{ dual } \text{ Condat--V }\tilde{u}~(19). \end{array} \end{aligned}$$

5.2.1 One-Iteration Analysis

Lemma 2

Under Assumption 1 and the stepsize condition (25), the iterates generated by the Bregman Condat–Vũ algorithms (18) and (19) satisfy

$$\begin{aligned}{} & {} {\mathcal {L}(x_{k+1},z)-\mathcal {L}(x,z_{k+1})} \nonumber \\{} & {} \quad \le d(x,z;x_{k},z_{k})- d(x,z;x_{k+1},z_{k+1}) - \tilde{d}(x_{k+1}, z_{k+1}; x_{k}, z_{k}) \end{aligned}$$
(27)

for all \(x \in \mathop {\textbf{dom}}f \cap \mathop {\textbf{dom}}\phi _\mathrm p\) and \(z \in \mathop {\textbf{dom}}g^*\cap \mathop {\textbf{dom}}\phi _\textrm{d}\).

Proof

We write (18) and (19) in a unified notation as

$$\begin{aligned} x_{k+1}&= \textrm{prox}_{\tau f}^{\phi _\mathrm p} (x_k, \tau (A^T \tilde{z} + \nabla h(x_k))) \end{aligned}$$
(28a)
$$\begin{aligned} z_{k+1}&= \textrm{prox}_{\sigma g^*}^{\phi _\textrm{d}} (z_k, -\sigma A\tilde{x}) \end{aligned}$$
(28b)

where \(\tilde{x}\) and \(\tilde{z}\) are defined in the following table:

$$\begin{aligned} \begin{array}{lll} \text{ Bregman } \text{ primal } \text{ Condat--V }\tilde{u}: &{} \tilde{x}=2x_{k+1}-x_k &{} \tilde{z}=z_k \\ \text{ Bregman } \text{ dual } \text{ Condat--V }\tilde{u}: &{} \tilde{x}=x_k &{} \tilde{z}=2z_{k+1}-z_k. \end{array} \end{aligned}$$

The optimality condition (17) for the proximal operator evaluation (28a) is

$$\begin{aligned} {\tau (f(x_{k+1})- f(x))}\le & {} d_\mathrm p(x,x_k) - d_\mathrm p(x_{k+1},x_k) - d_\mathrm p(x,x_{k+1}) \\{} & {} {} + \tau \langle A^T\tilde{z}+\nabla h(x_k), x-x_{k+1}\rangle \end{aligned}$$

for all \(x \in \mathop {\textbf{dom}}f \cap \mathop {\textbf{dom}}\phi _\mathrm p\). The optimality condition for (28b) is that

$$\begin{aligned} {\sigma (g^*(z_{k+1})- g^*(z))}\le & {} d_\textrm{d}(z,z_k) - d_\textrm{d}(z_{k+1}, z_k) - d_\textrm{d}(z, z_{k+1}) \\{} & {} - \sigma \langle z-z_{k+1}, A\tilde{x}\rangle \end{aligned}$$

for all \(z \in \mathop {\textbf{dom}}g^*\cap \mathop {\textbf{dom}}\phi _\textrm{d}\). Combining the two inequalities gives

$$\begin{aligned}{} & {} {\mathcal {L}(x_{k+1}, z) - \mathcal {L}(x,z_{k+1})} \nonumber \\{} & {} \quad = f(x_{k+1}) - f(x) + h(x_{k+1}) - h(x) + g^*(z_{k+1}) - g^*(z) \nonumber \\{} & {} \qquad + \langle A^Tz, x_{k+1}\rangle - \langle z_{k+1}, Ax\rangle \nonumber \\{} & {} \quad \le \frac{1}{\tau } \Big (d_\mathrm p(x,x_k)-d_\mathrm p(x,x_{k+1}) - d_\mathrm p(x_{k+1}, x_k) \Big ) \nonumber \\{} & {} \qquad + \frac{1}{\sigma } \Big (d_\textrm{d}(z,z_k) - d_\textrm{d}(z,z_{k+1}) - d_\textrm{d}(z_{k+1}, z_k) \Big ) \nonumber \\{} & {} \qquad + h(x_{k+1})-h(x)+\langle \nabla h(x_k), x-x_{k+1}\rangle \nonumber \\{} & {} \qquad + \langle A^T \tilde{z}, x-x_{k+1}\rangle - \langle z-z_{k+1}, A\tilde{x}\rangle + \langle A^T z, x_{k+1}\rangle -\langle z_{k+1}, Ax\rangle \end{aligned}$$
(29)
$$\begin{aligned}{} & {} \quad \le \frac{1}{\tau } \Big ( d_\mathrm p(x,x_k)-d_\mathrm p(x,x_{k+1}) - d_\mathrm p(x_{k+1}, x_k) \Big ) \nonumber \\{} & {} \qquad + \frac{1}{\sigma } \Big (d_\textrm{d}(z,z_k)-d_\textrm{d}(z,z_{k+1}) - d_\textrm{d}(z_{k+1}, z_k) \Big ) \nonumber \\{} & {} \qquad + h(x_{k+1})-h(x_k)- \langle \nabla h(x_k), x_{k+1} - x_k\rangle \nonumber \\{} & {} \qquad + \langle A^T \tilde{z}, x-x_{k+1}\rangle - \langle z-z_{k+1}, A\tilde{x}\rangle \nonumber \\{} & {} \qquad + \langle A^T z, x_{k+1}\rangle -\langle z_{k+1}, Ax\rangle \end{aligned}$$
(30)

for all \(x\in \mathop {\textbf{dom}}f \cap \mathop {\textbf{dom}}\phi _\mathrm p\) and all \(z\in \mathop {\textbf{dom}}g^* \cap \mathop {\textbf{dom}}\phi _\textrm{d}\). The second inequality follows from convexity of h. Substituting the expressions for \(\tilde{x}\) and \(\tilde{z}\) in the Bregman primal Condat–Vũ algorithm (18), we obtain for the last four terms on the right-hand side of (30)

$$\begin{aligned}{} & {} {\langle A^T\tilde{z}, x-x_{k+1}\rangle -\langle z-z_{k+1}, A\tilde{x}\rangle +\langle A^T z, x_{k+1}\rangle - \langle z_{k+1}, Ax\rangle } \nonumber \\{} & {} \quad = \langle z_k, A(x-x_{k+1})\rangle -\langle z-z_{k+1}, A(2x_{k+1}-x_k)\rangle +\langle A^T z, x_{k+1}\rangle - \langle z_{k+1}, Ax\rangle \nonumber \\{} & {} \quad = \langle z_k-z_{k+1}, A(x-x_{k+1})\rangle +\langle z-z_{k+1}, A(x_k-x_{k+1})\rangle \nonumber \\{} & {} \quad = -\langle z-z_k, A(x-x_k)\rangle +\langle z-z_{k+1}, A(x-x_{k+1})\rangle \nonumber \\{} & {} \qquad {} + \langle z_{k+1}-z_k, A(x_{k+1}-x_k)\rangle . \end{aligned}$$
(31)

If we substitute the expressions for \(\tilde{x}\) and \(\tilde{z}\) in the Bregman dual Condat–Vũ algorithm, the last four terms on the right-hand side of (30) equal to the negative of the right-hand side of (31).

Therefore, for both algorithms, (30) implies that

$$\begin{aligned}{} & {} {\mathcal {L}(x_{k+1}, z) - \mathcal {L}(x,z_{k+1})} \nonumber \\{} & {} \quad \le \frac{1}{\tau } d_\mathrm p(x,x_k) + \frac{1}{\sigma } d_\textrm{d}(z,z_k) {\mp } \langle z-z_k, A(x-x_k)\rangle \nonumber \\{} & {} \qquad {} - \Big (\frac{1}{\tau } d_\mathrm p(x,x_{k+1}) + \frac{1}{\sigma } d_\textrm{d}(z,z_{k+1}) {\mp } \langle z-z_{k+1}, A(x-x_{k+1})\rangle \Big ) \nonumber \\{} & {} \qquad {} -\Big (\frac{1}{\tau } d_\mathrm p(x_{k+1},x_k) + \frac{1}{\sigma } d_\textrm{d}(z_{k+1},z_k) {\mp } \langle z_{k+1} - z_k, A(x_{k+1}-x_k)\rangle \Big ) \nonumber \\{} & {} \qquad {} + h(x_{k+1})-h(x_k) -\langle \nabla h(x_k), x_{k+1}-x_k\rangle , \end{aligned}$$

if we select the minus sign in \({\mp }\) for the Bregman primal Condat–Vũ algorithm, and the plus sign for the Bregman dual Condat–Vũ algorithm. \(\square \)

5.2.2 Ergodic Convergence

We define averaged iterates

$$\begin{aligned} x^\textrm{avg}_k = \frac{1}{k} \sum _{i=1}^k x_i, \qquad z^\textrm{avg}_k = \frac{1}{k} \sum _{i=1}^k z_i \end{aligned}$$
(32)

for \(k \ge 1\). The ergodic convergence of \((x^\textrm{avg}_k,z^\textrm{avg}_k)\) is given in the following theorem.

Theorem 1

Under Assumption 1 and the stepsize condition (25), the averaged iterates \((x^\textrm{avg}_k,z^\textrm{avg}_k)\) satisfy

$$\begin{aligned} \mathcal {L}(x^\textrm{avg}_k, z) - \mathcal {L}(x, z^\textrm{avg}_k) \le \frac{2}{k} \Big (\frac{1}{\tau } d_\mathrm p(x,x_0) + \frac{1}{\sigma } d_\textrm{d}(z,z_0) \Big ) \end{aligned}$$
(33)

for all \(x \in \mathop {\textbf{dom}}f \cap \mathop {\textbf{dom}}\phi _\mathrm p\) and \(z \in \mathop {\textbf{dom}}g^*\cap \mathop {\textbf{dom}}\phi _\textrm{d}\).

Proof

From (27) in Lemma 2, since \(\mathcal {L}(u,v)\) is convex in u and concave in v,

$$\begin{aligned} \mathcal {L}(x^\text {avg}_k, z)-\mathcal {L}(x, z^\text {avg}_k)&\le \frac{1}{k} \sum _{i=1}^k \big (\mathcal {L}(x_i,z)-\mathcal {L}(x,z_i)\big ) \nonumber \\ {}&\le \frac{1}{k} \big (d(x,z;x_0,z_0) - d(x,z;x_k,z_k) \big ) \nonumber \\ {}&\le \frac{1}{k} d(x,z;x_0,z_0) \nonumber \\ {}&\le \frac{2}{k} \Big (\frac{1}{\tau } d_\mathrm p(x,x_0) + \frac{1}{\sigma } d_\text {d}(z,z_0) \Big ) \end{aligned}$$

for all \(x\in \mathop {\textbf{dom}}f \cap \mathop {\textbf{dom}}\phi _\mathrm p\) and \(z \in \mathop {\textbf{dom}}g^*\cap \mathop {\textbf{dom}}\phi _\textrm{d}\). The last step follows from (26) with \(\delta _1=\delta _2=1\). \(\square \)

Substituting \(x=x^\star \), \(z=z^\star \) in (33) gives

$$\begin{aligned} \mathcal {L}(x^\textrm{avg}_k, z^\star )-\mathcal {L}(x^\star , z^\textrm{avg}_k) \le \frac{2}{k} \Big (\frac{1}{\tau } d_\mathrm p(x^\star ,x_0) + \frac{1}{\sigma } d_\textrm{d}(z^\star ,z_0) \Big ). \end{aligned}$$

More generally, if \(X \subseteq \mathop {\textbf{dom}}\phi _\mathrm p\) and \(Z \subseteq \mathop {\textbf{dom}}\phi _\textrm{d}\) are compact convex sets that contain optimal solutions \(x^\star \), \(z^\star \) in their interiors, then the merit function (6) is bounded by

$$\begin{aligned} \eta (x^\textrm{avg}_k, z^\textrm{avg}_k) \le \frac{2}{k} \Big (\frac{1}{\tau } \sup _{x \in X} d_\mathrm p(x,x_0) + \frac{1}{\sigma } \sup _{z \in Z} d_\textrm{d}(z,z_0) \Big ). \end{aligned}$$

5.2.3 Monotonicity Properties

We present an auxiliary result that will be used in Sect. 5.2.4 to show convergence of iterates.

Lemma 3

Under Assumption 1 and the stepsize condition (25), we have

$$\begin{aligned} d(x^\star , z^\star ; x_{k+1}, z_{k+1}) \le d(x^\star , z^\star ; x_k, z_k) - \tilde{d}(x_{k+1}, z_{k+1}; x_k, z_k) \end{aligned}$$
(34)

for \(k \ge 0\), and

$$\begin{aligned} d(x^\star , z^\star ; x_k, z_k) \le d(x^\star , z^\star ; x_0, z_0). \end{aligned}$$
(35)

The inequality (34) also implies \(\sum _{i=0}^k \tilde{d}(x_{i+1}, z_{i+1}; x_i, z_i) \le d(x^\star , z^\star ; x_0, z_0)\), and thus \(\tilde{d}(x_{k+1}, z_{k+1}; x_k, z_k) \rightarrow 0\).

Proof

For \((x,z)=(x^\star ,z^\star )\), the left-hand side of (27) is nonnegative and (34) holds. Hence, \(d(x^\star , z^\star ; x_{k+1}, z_{k+1}) \le d(x^\star , z^\star ; x_k, z_k)\) and (35) holds. \(\square \)

5.2.4 Convergence of Iterates

Convergence of iterates can be obtained by combining the derivation in Sect. 5.1 and existing results on Bregman proximal point method [27, Theorem 3.1], [45, Theorem 3.2]. Here, we provide a self-contained proof under additional assumptions about the primal and dual distance functions.

Assumption 2

  1. (2.1)

    For fixed x and z, the sublevel sets \(\{x^\prime \mid d_\mathrm p(x,x^\prime ) \le \gamma \}\) and \(\{z^\prime \mid d_\textrm{d}(z,z^\prime ) \le \gamma \}\) are closed. In other words, the distances \(d_\mathrm p(x,x^\prime )\) and \(d_\textrm{d}(z,z^\prime )\) are closed functions of \(x^\prime \) and \(z^\prime \), respectively. Since a sum of closed functions is closed, the distance \(d(x,z; x',z')\) is a closed function of \((x',z')\), for fixed (xz).

  2. (2.2)

    If \(\tilde{x}_k \in \mathop {\textbf{int}}(\mathop {\textbf{dom}}\phi _\mathrm p)\) converges to \(x \in \mathop {\textbf{dom}}\phi _\mathrm p\), then \(d_\mathrm p(x,\tilde{x}_k) \rightarrow 0\). Similarly, if \(\tilde{z}_k \in \mathop {\textbf{int}}(\mathop {\textbf{dom}}\phi _\textrm{d})\) converges to \(z \in \mathop {\textbf{dom}}\phi _\textrm{d}\), then \(d_\textrm{d}(z,\tilde{z}_k) \rightarrow 0\).

  3. (2.3)

    The stepsizes \(\sigma \) and \(\tau \) satisfy \(\sigma \tau \Vert A\Vert ^2+\tau L<1\).

The first two assumptions in Assumption 2 are common in the literature on Bregman distances [9, 14, 24, 27]. As shown in Lemma 1, Assumption 2 implies that the kernel functions \(\phi _\textrm{pcv}\) and \(\phi _\textrm{dcv}\) are strongly convex and that

$$\begin{aligned} \tilde{d}(x,z; x',z') \ge \frac{\alpha }{2\tau } \Vert x-x'\Vert _\mathrm p^2 + \frac{\alpha }{2\sigma } \Vert z-z'\Vert _\textrm{d}^2 \end{aligned}$$
(36)

for some \(\alpha > 0\). Similarly, \(\sigma \tau \Vert A\Vert ^2 < 1\) implies that

$$\begin{aligned} d(x,z; x',z') \ge \frac{\beta }{2\tau } \Vert x-x'\Vert _\mathrm p^2 + \frac{\beta }{2\sigma } \Vert z-z'\Vert _\textrm{d}^2 \end{aligned}$$
(37)

for some \(\beta > 0\). Recall that \(d=d_-\), \(\tilde{d}=d_\textrm{pcv}\) for the Bregman primal algorithm (18), and \(d=d_+\), \(\tilde{d}=d_\textrm{dcv}\) for the Bregman dual algorithm (19).

Theorem 2

Under Assumptions 1 and 2, the iterates \((x_k,z_k)\) generated by Bregman primal and dual Condat–Vũ algorithms (18) and (19) converge to an optimal point \((x^\star ,z^\star )\).

Proof

We first note that \(\tilde{d}(x_{k+1}, z_{k+1}; x_k, z_k) \rightarrow 0\) and (36) imply that \(x_{k+1} -x_k \rightarrow 0\) and \(z_{k+1}-z_k \rightarrow 0\).

The inequality (35), together with (37), implies that the sequence \((x_k,z_k)\) is bounded. Let \((x_{k_i},z_{k_i})\) be a convergent subsequence of \((x_k, z_k)\) with limit \((\hat{x}, \hat{z})\). Since \(x_{k_i+1}-x_{k_i} \rightarrow 0\) and \(z_{k_i+1} -z_{k_i} \rightarrow 0\), the sequence \((x_{k_i+1},z_{k_i+1})\) converges to \((\hat{x}, \hat{z})\). We show that \((\hat{x},\hat{z})\) satisfies the optimality condition (3).

From (35), \(d(x^\star ,z^\star ;x_{k_i},z_{k_i})\) is bounded. Since the sublevel sets \(\{(x^\prime ,z^\prime ) \mid d(x^\star ,z^\star ;x^\prime ,z^\prime ) \le \gamma \}\) are closed and in \(\mathop {\textbf{int}}(\mathop {\textbf{dom}}\phi _\mathrm p) \cap \mathop {\textbf{int}}(\mathop {\textbf{dom}}\phi _\textrm{d})\), the limit \((\hat{x},\hat{z}) \in \mathop {\textbf{int}}(\mathop {\textbf{dom}}\phi _\mathrm p) \cap \mathop {\textbf{int}}(\mathop {\textbf{dom}}\phi _\textrm{d})\). The iterates in the subsequence satisfy

$$\begin{aligned}&\nabla \phi _\textrm{pd} (x_{k_i},z_{k_i}) -\nabla \phi _\textrm{pd} (x_{k_i+1},z_{k_i+1}) +\begin{bmatrix} -A^Tz_{k_i+1} \\ Ax_{k_i+1} \end{bmatrix} \nonumber \\&\in \begin{bmatrix} \partial f(x_{k_i+1})+\nabla h(x_{k_i+1}) \\ \partial g^*(z_{k_i+1}) \end{bmatrix}, \end{aligned}$$
(38)

where \(\phi _\textrm{pd}=\phi _\textrm{pcv}\) in the Bregman primal Condat–Vũ algorithm and \(\phi _\textrm{pd}=\phi _\textrm{dcv}\) in the Bregman dual Condat–Vũ algorithm. The left-hand side of (38) converges to \((-A^T\hat{z}, A\hat{x})\) because \(\nabla \phi _\textrm{pd}\) is continuous on \(\mathop {\textbf{int}}(\mathop {\textbf{dom}}\phi _\textrm{pd})\). Since the operator on the right-hand side of (38) is maximal monotone, the limit point \((\hat{x},\hat{z})\) satisfies the optimality condition (see [7, page 27], [47, Lemma 3.2])

$$\begin{aligned} \begin{bmatrix} -A^T\hat{z} \\ A\hat{x} \end{bmatrix} \in \begin{bmatrix} \partial f(\hat{x})+\nabla h(\hat{x}) \\ \partial g^*(\hat{z}) \end{bmatrix}. \end{aligned}$$

To show convergence of the entire sequence, we substitute \((\hat{x},\hat{z})\) in (27):

$$\begin{aligned} \mathcal {L}(x_{k+1},\hat{z})-\mathcal {L}(\hat{x},z_{k+1}) \le d(\hat{x},\hat{z};x_k,z_k)-d(\hat{x},\hat{z};x_{k+1},z_{k+1}). \end{aligned}$$

We have \(d(\hat{x},\hat{z};x_k,z_k) \le d(\hat{x},\hat{z};x_{k-1},z_{k-1})\) for all \(k \ge 1\), since the left-hand side is nonnegative. This further implies that \(d(\hat{x},\hat{z};x_k,z_k) \le d(\hat{x},\hat{z};x_{k_i},z_{k_i})\) for all \(k \ge k_i\). By Assumption (2.2), the right-hand side converges to zero. Then, the left-hand side also converges to zero and from (37), \(x_k \rightarrow \hat{x}\) and \(z_k \rightarrow \hat{z}\). \(\square \)

5.3 Relation to Other Bregman Proximal Algorithms

Following similar steps as in Sect. 3, we obtain several Bregman proximal splitting methods as special cases of (18) and (19). The connections are summarized in Figs. 5 and  6. A comparison of Figs. 1 and 5 shows that all the reduction relations (\(A=I\)) are still valid. However, it is unclear how to apply the “completion” operation to algorithms based on non-Euclidean Bregman distances.

Fig. 5
figure 5

Proximal algorithms derived from Bregman primal Condat–Vũ algorithm (18).

When \(h=0\), (18) reduces to Bregman PDHG [12]. When \(g=0\), \(g^*=\delta _{\{0\}}\) (and assuming \(z_0=0\)), we obtain the Bregman proximal gradient algorithm [3]. When \(f=0\) in (18), we obtain the Bregman Loris–Verhoeven algorithm with shift, and if we further set \(A=I\), we obtain the Bregman reduced Loris–Verhoeven algorithm with shift. Similarly, when \(A=I\) in (18), we recover the reduced Bregman primal Condat–Vũ algorithm, and setting \(A=I\) in Bregman PDHG yields the Bregman Douglas–Rachford algorithm.

Similarly, the Bregman dual Condat–Vũ algorithm (19) can be reduced to some other Bregman proximal splitting methods, as summarized in Fig. 6. In particular, when \(f=0\) in (19), we obtain the Bregman dual Loris–Verhoeven algorithm with shift, and if we further set \(A=I\), we obtain the reduced Bregman Loris–Verhoeven algorithm with shift.

Fig. 6
figure 6

Proximal algorithms derived from Bregman dual Condat–Vũ algorithm (19).

6 Bregman Dual Condat–Vũ Algorithm with Line Search

The algorithms (18) and (19) use constant parameters \(\sigma \) and \(\tau \). The stepsize condition (25) involves the matrix norm \(\Vert A\Vert \) and the Lipschitz constant L in (21). Estimating or bounding \(\Vert A\Vert \) for a large matrix can be difficult. As an added complication, the norms \(\Vert \cdot \Vert _\mathrm p\) and \(\Vert \cdot \Vert _\textrm{d}\) in the definition of the matrix norm (22) are assumed to be scaled so that the strong convexity parameters of the primal and dual kernels are equal to one. Close bounds on the strong convexity parameters may also be difficult to obtain. Using conservative bounds for \(\Vert A\Vert \) and L results in unnecessarily small values of \(\sigma \) and \(\tau \) and can dramatically slow down the convergence. Even when the estimates of \(\Vert A\Vert \) and L are accurate, the requirements for the stepsizes (25) are still too strict in most iterations, as observed in [1]. In view of the above arguments, line search techniques for primal–dual proximal methods have recently become an active area of research. Malitsky and Pock [36] proposed a line search technique for PDHG and the Condat–Vũ algorithm in the Euclidean case. The algorithm with adaptive parameters in [49] focuses on a special case of (1) (i.e., \(f=0\)) and extends the Loris–Verhoeven algorithm. A Bregman proximal splitting method with line search is discussed in [30] and considers the problem (1) with \(h=0\) and \(g=\delta _{\{b\}}\). In this section, we extend the Bregman dual Condat–Vũ algorithm (19) with a varying parameter option, in which the stepsizes are chosen adaptively without requiring any estimates or bounds for \(\Vert A\Vert \) or the strong convexity parameter of the kernels. The algorithm is restricted to problems in the equality constrained form

$$\begin{aligned} \begin{array}{ll} \text{ minimize } &{} f(x) + h(x) \\ \text{ subject } \text{ to } &{} Ax=b. \end{array} \end{aligned}$$
(39)

This is a special case of (1) with \(g=\delta _{\{b\}}\), the indicator function of the singleton \(\{b\}\).

The details of the algorithm are discussed in Sect. 6.1, and a convergence analysis is presented in Sect. 6.2. The main conclusion is an O(1/k) rate of ergodic convergence, consistent with previous results for related algorithms [30, 36].

Assumption 3

The kernel function, the Bregman distance, and the norm in the dual space are

$$\begin{aligned} \phi _\textrm{d}(z) = \tfrac{1}{2} \Vert z\Vert ^2, \qquad d_\textrm{d}(z,z') = \tfrac{1}{2} \Vert z-z'\Vert ^2, \qquad \Vert z\Vert _\textrm{d}= \Vert z\Vert . \end{aligned}$$

(Recall that \(\Vert \cdot \Vert \) denotes the Euclidean norm.)

The matrix norm \(\Vert A\Vert \) is defined accordingly as

$$\begin{aligned} \Vert A\Vert =\sup _{u \ne 0, v \ne 0} \frac{\langle v, Au\rangle }{\Vert v\Vert \Vert u\Vert _\mathrm p} = \sup _{u \ne 0} \frac{\Vert Au\Vert }{\Vert u\Vert _\mathrm p} = \sup _{v \ne 0} \frac{\Vert A^Tv\Vert _{\mathrm p,*}}{\Vert v\Vert }. \end{aligned}$$

6.1 Algorithm

The algorithm uses the following iteration, with starting points \(x_0 \in \mathop {\textbf{dom}}h \cap \mathop {\textbf{int}}(\mathop {\textbf{dom}}\phi _\mathrm p)\) and \(z_{-1} = z_0\):

$$\begin{aligned} \bar{z}_{k+1}&= z_k + \theta _k (z_k - z_{k-1}) \end{aligned}$$
(40a)
$$\begin{aligned} x_{k+1}&= \textrm{prox}_{\tau _k f}^{\phi _\mathrm p} \big ( x_k, \tau _k (A^T\bar{z}_{k+1}+\nabla h(x_k)) \big ) \end{aligned}$$
(40b)
$$\begin{aligned} z_{k+1}&= z_k + \sigma _k (Ax_{k+1}-b). \end{aligned}$$
(40c)

With constant parameters \(\theta _k=1\), \(\sigma _k=\sigma \), \(\tau _k = \tau \), the algorithm reduces to the Bregman dual Condat–Vũ algorithm (19) applied to (39), except for the numbering of the dual iterates.

In the line search algorithm, the parameters \(\theta _k\), \(\tau _k\), \(\sigma _k\) are determined by a backtracking search. At the start of the algorithm, we set \(\tau _{-1}\) and \(\sigma _{-1}\) to some positive values. To start the search in iteration k we choose \(\bar{\theta }_k \ge 1\). For \(i=0,1,2,\ldots \), we set \(\theta _k = 2^{-i}\bar{\theta }_k\), \(\tau _k=\theta _k\tau _{k-1}\), \(\sigma _k=\theta _k\sigma _{k-1}\), and compute \(\bar{z}_{k+1}\), \(x_{k+1}\), \(z_{k+1}\) using (40). For some \(\delta \in (0,1]\), if

$$\begin{aligned}{} & {} \langle z_{k+1}-\bar{z}_{k+1}, A(x_{k+1}-x_k)\rangle +h(x_{k+1})-h(x_k)-\langle \nabla h(x_k), x_{k+1}-x_k\rangle \nonumber \\{} & {} \le \frac{\delta ^2}{\tau _k} d_\mathrm p(x_{k+1}, x_k) + \frac{1}{2\sigma _k} \Vert \bar{z}_{k+1} - z_{k+1}\Vert ^2, \end{aligned}$$
(41)

we accept the computed iterates \(\bar{z}_{k+1}\), \(x_{k+1}\), \(z_{k+1}\) and parameters \(\theta _k\), \(\sigma _k\), \(\tau _k\), and terminate the backtracking search. If (41) does not hold, we increment i and continue the backtracking search.

The backtracking condition (41) is similar to the condition in the line search algorithm for PDHG with Euclidean proximal operators [36, Algorithm 4], but it is not identical, even in the Euclidean case. The proposed condition is weaker and allows larger stepsizes than the condition in [36, Algorithm 4].

6.2 Convergence Analysis

The proof strategy is the same as in [30, Section 3.3], extended to account for the function h. The main conclusion is an O(1/k) rate of ergodic convergence, as shown in Eq. (49).

6.2.1 Lower Bound on Algorithm Parameters

Lemma 4

Suppose Assumptions 1 and 3 hold. The stepsizes selected by the backtracking process are bounded below by

$$\begin{aligned} \tau _k \ge \tau _\textrm{min} \triangleq \min {\Big \{\tau _{-1}, \frac{-L+\sqrt{L^2+4\delta ^2\beta \Vert A\Vert ^2}}{4\beta \Vert A\Vert ^2}\Big \}}, \;\; \sigma _k \ge \sigma _\textrm{min} \triangleq \beta \tau _\textrm{min}, \end{aligned}$$
(42)

where \(\beta =\sigma _{-1}/\tau _{-1}\). The lower bounds imply that the backtracking eventually terminates with positive stepsizes \(\sigma _k\) and \(\tau _k\).

Proof

Applying (26) in the proof of Lemma 1 with \(\tau =\tau _k\), \(\sigma =\sigma _k\), \(\delta _1=\delta ^2-\tau _k L\) and \(\delta _2=1\), together with the Lipschitz condition (21) in Assumption 1, we see that the backtracking condition (41) holds at iteration k if \(0 < \delta \le 1\) and

$$\begin{aligned} \tau _k \sigma _k\Vert A\Vert ^2 + \tau _k L \le \delta ^2. \end{aligned}$$

Then, mathematical induction can be used to prove (42). The rest of the proof is similar to that in [30, §3.3.1], extended to account for the smooth function h, and thus is omitted here. \(\square \)

6.2.2 One-Iteration Analysis

Lemma 5

Suppose Assumptions 1 and 3 hold. The iterates \(x_{k+1}\), \(z_{k+1}\), \(\bar{z}_{k+1}\) generated by the algorithm (40) and the backtracking process satisfy

$$\begin{aligned}{} & {} {\mathcal {L}(x_{k+1},z)-\mathcal {L}(x, \bar{z}_{k+1})} \nonumber \\{} & {} \quad \le \frac{1}{\tau _k} \left( d_\mathrm p(x,x_k) - d_\mathrm p(x,x_{k+1}) - (1-\delta ^2) d_\mathrm p(x_{k+1},x_k)\right) \nonumber \\{} & {} \quad {} + \frac{1}{2\sigma _k} \left( \Vert z-z_k\Vert ^2 - \Vert z-z_{k+1}\Vert ^2 - \Vert \bar{z}_{k+1} - z_k\Vert ^2 \right) \end{aligned}$$
(43)

for all \(x \in \mathop {\textbf{dom}}f \cap \mathop {\textbf{dom}}\phi _\mathrm p\) and all z. Here, \(\mathcal {L}(x,z)=f(x)+h(x)+\langle z, Ax-b\rangle \).

Proof

The optimality condition for the primal prox-operator (40b) gives

$$\begin{aligned} {f(x_{k+1})-f(x)}\le & {} \frac{1}{\tau _k} \big (d_\mathrm p(x,x_k)-d_\mathrm p(x,x_{k+1})- d_\mathrm p(x_{k+1}, x_k)\big ) \\{} & {} +\langle A^T\bar{z}_{k+1}+\nabla h(x_k), x-x_{k+1}\rangle \end{aligned}$$

for all \(x\in \mathop {\textbf{dom}}f \cap \mathop {\textbf{dom}}\phi _\mathrm p\). Hence,

$$\begin{aligned}{} & {} {f(x_{k+1}) + h(x_{k+1}) - f(x) - h(x)} \nonumber \\{} & {} \quad \le \frac{1}{\tau _k} (d_\mathrm p(x,x_k) - d_\mathrm p(x,x_{k+1}) - d_\mathrm p(x_{k+1}, x_k)) + \langle A^T\bar{z}_{k+1}, x - x_{k+1}\rangle \nonumber \\{} & {} \qquad {} + h(x_{k+1})-h(x)+\langle \nabla h(x_k), x-x_{k+1}\rangle \nonumber \\{} & {} \quad \le \frac{1}{\tau _k} (d_\mathrm p(x,x_k) - d_\mathrm p(x,x_{k+1}) - d_\mathrm p(x_{k+1}, x_k)) + \langle A^T\bar{z}_{k+1}, x - x_{k+1}\rangle \nonumber \\{} & {} \qquad {} +h(x_{k+1})-h(x_k) -\langle \nabla h(x_k), x_{k+1}-x_k\rangle . \end{aligned}$$
(44)

The second inequality follows from the convexity of h, i.e., \(h(x) \ge h(x_k) + \langle \nabla h(x_k), x-x_k\rangle \). The dual update (40c) implies that

$$\begin{aligned} \langle z-z_{k+1}, Ax_{k+1}-b\rangle = \frac{1}{\sigma _k} \langle z-z_{k+1}, z_{k+1}-z_k\rangle \quad \text{ for } \text{ all } z. \end{aligned}$$
(45)

This equality at \(k=i-1\) is

$$\begin{aligned} \langle z-z_i, Ax_i-b\rangle&= \frac{1}{\sigma _{i-1}} \langle z-z_i, z_i-z_{i-1}\rangle \nonumber \\&= \frac{1}{2\sigma _{i-1}} \left( \Vert z-z_{i-1}\Vert ^2 - \Vert z-z_i\Vert ^2 - \Vert z_i - z_{i-1} \Vert ^2 \right) . \end{aligned}$$
(46)

The equality (45) at \(k=i-2\) is

$$\begin{aligned} \langle z-z_{i-1}, Ax_{i-1}-b\rangle&= \frac{1}{\sigma _{i-2}} \langle z-z_{i-1}, z_{i-1}-z_{i-2}\rangle \\&= \frac{\theta _{i-1}}{\sigma _{i-1}} \langle z-z_{i-1}, z_{i-1}-z_{i-2}\rangle \\&= \frac{1}{\sigma _{i-1}} \langle z-z_{i-1}, \bar{z}_i-z_{i-1}\rangle . \end{aligned}$$

We evaluate this at \(z=z_i\) and add it to the equality at \(z=z_{i-2}\) multiplied by \(\theta _{i-1}\):

$$\begin{aligned} {\langle z_i-\bar{z}_i, Ax_{i-1}-b\rangle }{} & {} = \frac{1}{\sigma _{i-1}} \langle z_i-\bar{z}_i, \bar{z}_i - z_{i-1}\rangle \nonumber \\{} & {} = \frac{1}{2\sigma _{i-1}} \left( \Vert z_i-z_{i-1}\Vert ^2 -\Vert z_i-\bar{z}_i\Vert ^2-\Vert \bar{z}_i-z_{i-1}\Vert ^2 \right) . \end{aligned}$$
(47)

Now, we combine (44) for \(k=i-1\), with (46) and (47). For \(i \ge 1\),

$$\begin{aligned}{} & {} {\mathcal {L}(x_i,z) - \mathcal {L}(x, \bar{z}_i)} \\{} & {} = f(x_i)+h(x_i)+\langle z, Ax_i-b\rangle -f(x)-h(x)-\langle \bar{z}_i, Ax-b\rangle \\{} & {} \le \frac{1}{\tau _{i-1}} \Big (d_\mathrm p(x,x_{i-1})-d_\mathrm p(x,x_i) - d_\mathrm p(x_i,x_{i-1})\Big ) + \langle A^T\bar{z}_i, x-x_i\rangle \\{} & {} \quad {} + \langle z, Ax_i-b\rangle - \langle \bar{z}_i, Ax-b\rangle + h(x_i) - h(x_{i-1}) -\langle \nabla h(x_{i-1}), x_i-x_{i-1}\rangle \\{} & {} = \frac{1}{\tau _{i-1}} \Big (d_\mathrm p(x,x_{i-1}) - d_\mathrm p(x,x_i) - d_\mathrm p(x_i,x_{i-1})\Big ) + \langle z-\bar{z}_i, Ax_i-b\rangle \\{} & {} \quad {} +h(x_i)-h(x_{i-1}) -\langle \nabla h(x_{i-1}), x_i-x_{i-1}\rangle \\{} & {} = \frac{1}{\tau _{i-1}} \Big (d_\mathrm p(x,x_{i-1}) - d_\mathrm p(x,x_i) - d_\mathrm p(x_i,x_{i-1})\Big ) \\{} & {} \quad {} +\langle z_i-\bar{z}_i, A(x_i-x_{i-1})\rangle + \langle z-z_i, Ax_i-b\rangle + \langle z_i-\bar{z}_i, Ax_{i-1}-b\rangle \\{} & {} \quad {} +h(x_i)-h(x_{i-1}) -\langle \nabla h(x_{i-1}), x_i-x_{i-1}\rangle \\{} & {} = \frac{1}{\tau _{i-1}} \Big (d_\mathrm p(x,x_{i-1}) - d_\mathrm p(x,x_i) - d_\mathrm p(x_i,x_{i-1}))\Big ) \\{} & {} \quad {} +\frac{1}{2\sigma _{i-1}} \Big (\Vert z-z_{i-1}\Vert ^2-\Vert z-z_i\Vert ^2-\Vert \bar{z}_i-z_{i-1}\Vert ^2 -\Vert \bar{z}_i-z_i\Vert ^2\Big ) \\{} & {} \quad {}+\langle A^T(z_i-\bar{z}_i), x_i-x_{i-1}\rangle +h(x_i)-h(x_{i-1}) -\langle \nabla h(x_{i-1}), x_i-x_{i-1})\rangle \\{} & {} \le \frac{1}{\tau _{i-1}} \left( d_\mathrm p(x,x_{i-1}) - d_\mathrm p(x,x_i) - (1-\delta ^2) d_\mathrm p(x_i,x_{i-1})\right) \\{} & {} \quad {} + \frac{1}{2\sigma _{i-1}} \left( \Vert z-z_{i-1}\Vert ^2 - \Vert z-z_i\Vert ^2 - \Vert \bar{z}_i - z_{i-1}\Vert ^2 \right) , \end{aligned}$$

which is the desired result (43). The first inequality follows from (44). In the second to last step we substitute (46) and (47). The last step uses the line search exit condition (41) at \(k=i-1\). \(\square \)

6.2.3 Ergodic Convergence

We define the averaged primal and dual sequences

$$\begin{aligned} x^\textrm{avg}_k = \frac{1}{\sum _{i=1}^k \tau _{i-1}} \sum _{i=1}^k \tau _{i-1} x_i, \qquad \bar{z}^\textrm{avg}_k = \frac{1}{\sum _{i=1}^k \tau _{i-1}} \sum _{i=1}^k \tau _{i-1} \bar{z}_i \end{aligned}$$

for \(k \ge 1\). The ergodic convergence of \((x^\textrm{avg}_k, \bar{z}^\textrm{avg}_k)\) is given in the following theorem.

Theorem 3

Suppose Assumptions 1 and 3 hold, and the stepsizes are selected by the backtracking process with line search condition (41). The averaged iterates \((x^\textrm{avg}_k, \bar{z}^\textrm{avg}_k)\) satisfy

$$\begin{aligned} \mathcal {L}(x^\textrm{avg}_k,z) - \mathcal {L}(x,\bar{z}^\textrm{avg}_k)&\le \frac{1}{\sum _{i=1}^k \tau _{i-1}} \big (d_\mathrm p(x,x_0) + \frac{1}{2\beta } \Vert z-z_0\Vert ^2\big ) \end{aligned}$$
(48)
$$\begin{aligned}&\le \frac{1}{k \tau _\textrm{min}} \big (d_\mathrm p(x,x_0) + \frac{1}{2\beta } \Vert z-z_0\Vert ^2\big ) \end{aligned}$$
(49)

for all \(x \in \mathop {\textbf{dom}}f \cap \mathop {\textbf{dom}}\phi _\mathrm p\) and all z. This holds for any \(\delta \in (0,1]\) in (41).

If we compare (48) and (33), we note that the two left-hand sides involve different dual iterates (\(\bar{z}^\textrm{avg}_k\) as opposed to \(z^\textrm{avg}_k\)).

Proof

From (43) in Lemma 5,

$$\begin{aligned} \mathcal {L}(x_i,z)-\mathcal {L}(x, \bar{z}_i)\le & {} \frac{1}{\tau _{i-1}} \Big (d_\mathrm p(x,x_{i-1}) - d_\mathrm p(x,x_i) \\{} & {} + \frac{1}{2\beta } \Vert z-z_{i-1}\Vert ^2 - \frac{1}{2\beta } \Vert z-z_i\Vert ^2 \Big ). \end{aligned}$$

Since \(\mathcal {L}(u,v)\) is convex in u and affine in v,

$$\begin{aligned} {(\sum _{i=1}^k \tau _{i-1}) \big (\mathcal {L}(x^\textrm{avg}_k, z)-\mathcal {L}(x, \bar{z}^\textrm{avg}_k)\big )}\le & {} \sum _{i=1}^k \tau _{i-1} (\mathcal {L}(x_i, z)-\mathcal {L}(x, \bar{z}_i)) \\\le & {} d_\mathrm p(x,x_0) + \frac{1}{2\beta } \Vert z-z_0\Vert ^2. \end{aligned}$$

Dividing by \(\sum _{i=1}^k \tau _{i-1}\) gives (48). \(\square \)

Substituting \(x=x^\star \) and \(z=z^\star \) in (48) yields

$$\begin{aligned}{} & {} {f(x^\textrm{avg}_k)+h(x^\textrm{avg}_k)+\langle z^\star , Ax^\textrm{avg}_k-b\rangle -f(x^\star )-h(x^\star )} \\{} & {} \quad \le \frac{1}{\sum _{i=1}^k\tau _{i-1}} \big (d_\mathrm p(x^\star ,x_0) + \frac{1}{2\beta } \Vert z^\star -z_0\Vert ^2 \big ), \end{aligned}$$

since \(Ax^\star =b\). More generally, suppose \(X \subseteq \mathop {\textbf{dom}}f \cap \mathop {\textbf{dom}}\phi _\mathrm p\) is a compact convex set containing an optimal solution \(x^\star \) in its interior, and \(Z = \{z \mid \Vert z\Vert < \gamma \}\) contains a dual optimal \(z^\star \), then the merit function \(\eta \) defined in (6) satisfies

$$\begin{aligned} \eta (x^\textrm{avg}_k, \bar{z}^\textrm{avg}_k)&= \sup _{z \in Z} \mathcal L(x^\textrm{avg}_k,z) - \inf _{x \in X} \mathcal L(x,\bar{z}^\textrm{avg}_k) \\&\le \frac{1}{\sum _{i=1}^k\tau _{i-1}} \Big ( \sup _{x\in X} d_\mathrm p(x,x_0) + \frac{1}{2\beta } (\gamma + \Vert z_0\Vert )^2 \Big ) \\&\le \frac{1}{k \tau _\textrm{min}} \Big (\sup _{x \in X} d_\mathrm p(x,x_0) + \frac{1}{2\beta } (\gamma + \Vert z_0\Vert )^2 \Big ). \end{aligned}$$

The second line follows from (48) and the third line follows from Lemma 4.

6.2.4 Monotonicity Properties and Convergence of Iterates

For \(x=x^\star \), \(z=z^\star \), the left-hand side of (43) is nonnegative and we obtain

$$\begin{aligned}{} & {} {d_\mathrm p(x^\star ,x_{k+1}) +\frac{1}{2\beta }\Vert z^\star -z_{k+1}\Vert ^2} \\{} & {} \quad \le d_\mathrm p(x^\star ,x_k)+\frac{1}{2\beta }\Vert z^\star -z_k\Vert ^2 -\big ((1-\delta ^2)d_\mathrm p(x_{k+1},x_k) +\frac{1}{2\beta }\Vert \bar{z}_{k+1}-z_k\Vert ^2 \big ) \\{} & {} \quad \le d_\mathrm p(x^\star ,x_k)+\frac{1}{2\beta }\Vert z^\star -z_k\Vert ^2 \end{aligned}$$

for \(k \ge 0\). These inequalities hold for any value \(\delta \in (0,1]\). In particular, the second inequality implies that \(\bar{z}_{i+1}-z_i \rightarrow 0\). When \(\delta < 1\) it also implies that \(d_\mathrm p(x_{i+1}, x_i) \rightarrow 0\) and, by the strong convexity assumption on \(\phi _\mathrm p\), that \(x_{i+1}-x_i \rightarrow 0\). With additional assumptions similar to those in Sect. 5.2.3, one can show the convergence of iterates; see [30, Section 3.3.4].

7 Bregman PD3O Algorithm

In this section, we propose the Bregman PD3O algorithm, another Bregman proximal method for the problem (1). Bregman PD3O also involves two generalized distances, \(d_\mathrm p\) and \(d_\textrm{d}\), generated by \(\phi _\mathrm p\) and \(\phi _\textrm{d}\), respectively, and it consists of the iterations

$$\begin{aligned} x_{k+1}&= \textrm{prox}^{\phi _\mathrm p}_{\tau f} (x_k,\tau A^Tz_k+\tau \nabla h(x_k)) \end{aligned}$$
(50a)
$$\begin{aligned} z_{k+1}&= \textrm{prox}^{\phi _\textrm{d}}_{\sigma g^*} \big (z_k, -\sigma A(2x_{k+1}-x_k +\tau (\nabla h(x_k)-\nabla h(x_{k+1})) ) \big ). \end{aligned}$$
(50b)

The only difference between Bregman PD3O and Bregman primal Condat–Vũ algorithm (18) is the additional term \(\tau (\nabla h(x_k)-\nabla h(x_{k+1}) )\). Thus, the two algorithms (18) and (50) reduce to the same method when h is absent from problem (1). The additional term allows PD3O to use larger stepsizes than the Condat–Vũ algorithm. If we use the same matrix norm \(\Vert A\Vert \) and Lipschitz constant L in the analysis for the two methods, then the conditions are

$$\begin{aligned} \begin{array}{ll} \text{ Condat--V }\tilde{u}: &{} \sigma \tau \Vert A\Vert ^2 + \tau L \le 1 \\ \text{ PD3O: } &{} \sigma \tau \Vert A\Vert ^2 \le 1, \;\; \tau \le 1/L. \end{array} \end{aligned}$$
(51)

The range of possible parameters is illustrated in Fig. 7.

Fig. 7
figure 7

Acceptable stepsizes in Condat–Vũ algorithms and PD3O. We assume the same matrix norm \(\Vert A\Vert \) and Lipschitz constant L are used in the analysis of the two algorithms. The light gray region under the blue curve is defined by the inequality for the Condat–Vũ algorithms in (51). The region under the red curve shows the values allowed by the stepsize conditions for PD3O.

In Sect. 7.1, we provide the detailed convergence analysis of the Bregman PD3O method. The connections between Bregman PD3O and several other Bregman proximal methods are discussed in Sect. 7.2.

Assumption 4

 

  1. (4.1)

    The kernel functions \(\phi _\mathrm p\) and \(\phi _\textrm{d}\) are 1-strongly convex with respect to norms \(\Vert \cdot \Vert \) and \(\Vert \cdot \Vert _\textrm{d}\), respectively:

    $$\begin{aligned} d_\mathrm p(x,x^\prime ) \ge \frac{1}{2}\Vert x-x^\prime \Vert ^2, \qquad d_\textrm{d}(z,z^\prime ) \ge \frac{1}{2}\Vert z-z^\prime \Vert _\textrm{d}^2 \end{aligned}$$
    (52)

    for all \((x,x') \in \mathop {\textbf{dom}}d_\mathrm p\) and \((z,z') \in \mathop {\textbf{dom}}d_\textrm{d}\). The assumption that the strong convexity constants are one can be made without loss of generality, by scaling the distances.

  2. (4.2)

    The gradient of h is L-Lipschitz continuous with respect to the Euclidean norm: \(\mathop {\textbf{dom}}h={\textbf{R}}^n\) and

    $$\begin{aligned} h(y)-h(x)-\langle \nabla h(x), y-x\rangle \le \frac{L}{2}\Vert y-x\Vert ^2, \quad \text{ for } \text{ any } x,y \in \mathop {\textbf{dom}}h. \end{aligned}$$
    (53)
  3. (4.3)

    The parameters \(\tau \) and \(\sigma \) satisfy

    $$\begin{aligned} \sigma \tau \Vert A\Vert ^2 \le 1, \qquad \tau \le 1/L. \end{aligned}$$
    (54)
  4. (4.4)

    The primal–dual optimality conditions (3) have a solution \((x^\star , z^\star )\) with \(x^\star \in \mathop {\textbf{dom}}\phi _\mathrm p\) and \(z^\star \in \mathop {\textbf{dom}}\phi _\textrm{d}\).

Note that (53) is a stronger assumption than (21). (Combined with the first inequality in (52), it implies (21).) We will use the following consequence of (53) [38, Theorem 2.1.5]:

$$\begin{aligned} h(y)-h(x)-\langle \nabla h(x), y-x\rangle \ge \frac{1}{2L} \Vert \nabla h(y)-\nabla h(x)\Vert ^2 \quad \hbox {for all} x {,} y. \end{aligned}$$
(55)

7.1 Convergence Analysis

7.1.1 A Primal–Dual Bregman Distance

We introduce a primal–dual kernel

$$\begin{aligned} \phi _\textrm{pd3o}(x,y,z) = \frac{1}{\tau } \phi _\mathrm p(x) +\frac{1}{\sigma } \phi _\textrm{d}(z)+\frac{\tau }{2} \Vert y\Vert ^2 -\langle y, x\rangle -\langle z, A(x-\tau y)\rangle , \end{aligned}$$

where \(\sigma ,\tau > 0\). If \(\phi _\textrm{pd3o}\) is convex, the generated Bregman distance is given by

$$\begin{aligned}{} & {} {d_\textrm{pd3o}(x,y,z;x^\prime ,y^\prime ,z^\prime )} \nonumber \\{} & {} \quad = \frac{1}{\tau } d_\mathrm p(x,x^\prime ) +\frac{1}{\sigma } d_\textrm{d}(z,z^\prime ) +\frac{\tau }{2} \Vert y-y^\prime \Vert ^2 \nonumber \\{} & {} \quad {} - \langle y-y^\prime , x-x^\prime \rangle -\langle z-z^\prime , A(x-x^\prime )\rangle +\tau \langle z-z^\prime , A(y-y^\prime )\rangle . \end{aligned}$$
(56)

The following lemma gives a sufficient condition for the convexity of \(\phi _\textrm{pd3o}\).

Lemma 6

Suppose (52) holds. The kernel function \(\phi _\textrm{pd3o}\) is convex if \(\sigma \tau \Vert A\Vert ^2 \le 1\), where \(\Vert A\Vert \) is defined in (22) with \(\Vert \cdot \Vert _\mathrm p=\Vert \cdot \Vert \).

Proof

It is sufficient to show that \(d_\textrm{pd3o}\) is nonnegative:

$$\begin{aligned}{} & {} {d_\textrm{pd3o}(x,y,z;x^\prime ,y^\prime ,z^\prime )} \nonumber \\{} & {} \quad \ge \frac{1}{2\tau } \Vert x-x^\prime \Vert ^2 +\frac{\tau }{2} \Vert A^T(z-z^\prime )\Vert ^2 +\frac{\tau }{2} \Vert y-y^\prime \Vert ^2 \nonumber \\{} & {} \qquad {} - \langle y-y^\prime , x-x^\prime \rangle -\langle z-z^\prime , A(x-x^\prime )\rangle +\tau \langle z-z^\prime , A(y-y^\prime )\rangle \nonumber \\{} & {} \quad = \frac{1}{2} \Big \Vert \frac{1}{\sqrt{\tau }}(x-x^\prime ) -\sqrt{\tau }(y-y^\prime ) -\sqrt{\tau }A^T(z-z^\prime ) \Big \Vert ^2 \nonumber \\{} & {} \quad \ge 0. \end{aligned}$$
(57)

In step 1, we use the strong convexity assumption (52), the definition of \(\Vert A\Vert \) (22) with \(\Vert \cdot \Vert _\mathrm p=\Vert \cdot \Vert \), and the assumption \(\sigma \tau \Vert A\Vert ^2 \le 1\). The bound on \(d_\textrm{d}(z,z')\) follows from

$$\begin{aligned} \frac{1}{\sigma }d_\textrm{d}(z,z^\prime ) \ge \frac{1}{2\sigma }\Vert z-z^\prime \Vert _\textrm{d}^2 \ge \frac{\Vert A^T(z-z^\prime )\Vert ^2}{2\sigma \Vert A\Vert ^2} \ge \frac{\tau }{2} \Vert A^T(z-z^\prime )\Vert ^2. \end{aligned}$$

\(\square \)

Note that the convexity of \(\phi _\textrm{pd3o}\) only requires the first inequality in the stepsize condition (54). Although the Bregman PD3O algorithm (50) is not the Bregman proximal point method for the Bregman kernel \(\phi _\textrm{pd3o}\), the distance \(d_\textrm{pd3o}\) will appear in the key inequality (58) of the convergence analysis.

7.1.2 One-Iteration Analysis

Lemma 7

Under Assumption 4, the iterates \(x_{k+1}\), \(z_{k+1}\) generated by Bregman PD3O (50) satisfy

$$\begin{aligned} \mathcal {L}(x_{k+1},z)-\mathcal {L}(x,z_{k+1})&\le d_\textrm{pd3o} \big (x,\nabla h(x), z;x_k,\nabla h(x_k),z_k\big ) \nonumber \\&-d_\textrm{pd3o} \big (x,\nabla h(x), z;x_{k+1},\nabla h(x_{k+1}),z_{k+1} \big ) \nonumber \\&-d_\textrm{pd3o} \big (x_{k+1},\nabla h(x), z_{k+1};x_k,\nabla h(x_k),z_k \big ) \end{aligned}$$
(58)

for all \(x \in \mathop {\textbf{dom}}f \cap \mathop {\textbf{dom}}\phi _\mathrm p\) and \(z \in \mathop {\textbf{dom}}g^*\cap \mathop {\textbf{dom}}\phi _\textrm{d}\).

Proof

Recall that Bregman PD3O differs from the Bregman primal Condat–Vũ algorithm (18) only in an additional term in the dual update. The proof in Sect. 6.2.2 therefore applies up to (29), with

$$\begin{aligned} \tilde{x} = 2x_{k+1} - x_k + \tau (\nabla h(x_k) - \nabla h(x_{k+1})), \qquad \tilde{z} = z_k. \end{aligned}$$

Substituting the above \((\tilde{x}, \tilde{z})\) into (29) and applying the definition of \(d_-\) (24) yields

$$\begin{aligned}{} & {} {\mathcal {L}(x_{k+1},z)-\mathcal {L}(x,z_{k+1})} \\{} & {} \quad \le d_-(x,z;x_k,z_k) - d_-(x,z;x_{k+1},z_{k+1}) - d_-(x_{k+1},z_{k+1};x_k,z_k) \nonumber \\{} & {} \qquad {} - \tau \langle A^T(z-z_{k+1}), \nabla h(x_k)-\nabla h(x_{k+1})\rangle \\{} & {} \qquad {} +h(x_{k+1})-h(x) +\langle \nabla h(x_k), x-x_{k+1}\rangle \nonumber \\{} & {} \quad = d_-(x,z;x_k,z_k)+\frac{\tau }{2} \Vert \nabla h(x)-\nabla h(x_k)\Vert ^2 \\{} & {} \qquad {} -\langle (x-\tau A^Tz) -(x_k-\tau A^Tz_k), \nabla h(x)-\nabla h(x_k)\rangle \\{} & {} \qquad {} -\Big (d_-(x,z;x_{k+1},z_{k+1})+\frac{\tau }{2} \Vert \nabla h(x)-\nabla h(x_{k+1})\Vert ^2 \\{} & {} \qquad {} -\big \langle {x-\tau A^Tz -(x_{k+1}-\tau A^Tz_{k+1} )}, {\nabla h(x)-\nabla h(x_{k+1})} \big \rangle \Big ) \\{} & {} \qquad {} -\Big (d_-(x_{k+1}, z_{k+1}; x_k, z_k) +\frac{\tau }{2} \Vert \nabla h(x)-\nabla h(x_k)\Vert ^2 \\{} & {} \qquad {} -\big \langle {(x_{k+1}-\tau A^Tz_{k+1}) -(x_k-\tau A^Tz_k)}, {\nabla h(x)-\nabla h(x_k)} \big \rangle \Big ) \\{} & {} \qquad {} -( h(x)-h(x_{k+1})-\langle \nabla h(x_{k+1}), x-x_{k+1}\rangle -\frac{\tau }{2}\Vert \nabla h(x)-\nabla h(x_{k+1})\Vert ^2) \\{} & {} \quad = d_\textrm{pd3o}(x,\nabla h(x),z;x_k,\nabla h(x_k),z_k) -d_\textrm{pd3o} (x,\nabla h(x),z;x_{k+1},\nabla h(x_{k+1}),z_{k+1}) \\{} & {} \qquad {} -d_\textrm{pd3o} (x_{k+1},\nabla h(x),z_{k+1}; x_k,\nabla h(x_k),z_k) \\{} & {} \qquad {} -( h(x)-h(x_{k+1})-\langle \nabla h(x_{k+1}), x-x_{k+1}\rangle -\frac{\tau }{2}\Vert \nabla h(x)-\nabla h(x_{k+1})\Vert ^2) \\{} & {} \quad \le d_\textrm{pd3o} (x,\nabla h(x),z;x_k,\nabla h(x_k),z_k) -d_\textrm{pd3o} (x,\nabla h(x),z;x_{k+1},\nabla h(x_{k+1}),z_{k+1}) \\{} & {} \qquad {} -d_\textrm{pd3o} (x_{k+1},\nabla h(x),z_{k+1}; x_k,\nabla h(x_k),z_k) \\{} & {} \quad \le d_\textrm{pd3o} (x,\nabla h(x),z;x_k,\nabla h(x_k),z_k) -d_\textrm{pd3o} (x,\nabla h(x),z;x_{k+1},\nabla h(x_{k+1}),z_{k+1}). \end{aligned}$$

Step 3 follows from definition of \(d_\textrm{pd3o}\) (56). In step 4, we use the Lipschitz condition (55) and the second inequality in the stepsize condition (54). The last step follows from the fact that \(d_\textrm{pd3o}\) is nonnegative (57). \(\square \)

7.1.3 Ergodic Convergence

Theorem 4

Under Assumption 4, Bregman PD3O iterates (50) satisfy

$$\begin{aligned} \mathcal {L}(x^\textrm{avg}_k, z)-\mathcal {L}(x, z^\textrm{avg}_k) \le \frac{3}{k} \Big (\frac{2}{\tau } d_\mathrm p(x,x_0) + \frac{1}{\sigma } d_\textrm{d}(z,z_0) \Big ), \end{aligned}$$

for all \(x \in \mathop {\textbf{dom}}f \cap \mathop {\textbf{dom}}\phi _\mathrm p\) and all \(z \in \mathop {\textbf{dom}}g^*\cap \mathop {\textbf{dom}}\phi _\textrm{d}\), where the averaged iterates are defined in (32).

Proof

From (58) in Lemma 7, since \(\mathcal {L}(u,v)\) is convex in u and concave in v,

$$\begin{aligned} \mathcal {L}(x^\text {avg}_k, z)-\mathcal {L}(x, z^\text {avg}_k)&\le \frac{1}{k} \sum _{i=1}^k \big (\mathcal {L}(x_i,z)-\mathcal {L}(x,z_i)\big ) \nonumber \\ {}&\le \frac{1}{k} \big (d(x,z;x_0,z_0) - d(x,z;x_k,z_k) \big ) \nonumber \\ {}&\le \frac{1}{k} d(x,z;x_0,z_0) \nonumber \\ {}&\le \frac{2}{k} \Big (\frac{1}{\tau } d_\mathrm p(x,x_0) + \frac{1}{\sigma } d_\text {d}(z,z_0) \Big )\end{aligned}$$

for all \(x\in \mathop {\textbf{dom}}f \cap \mathop {\textbf{dom}}\phi _\mathrm p\) and \(z \in \mathop {\textbf{dom}}g^*\cap \mathop {\textbf{dom}}\phi _\textrm{d}\). The third inequality follows from (56):

$$\begin{aligned}&d_\textrm{pd3o}(x,y,z;x',y',z') \\&{} \le \frac{1}{\tau }d_\mathrm p(x,x') + \frac{1}{\sigma } d_\mathrm d(z,z') + \frac{\tau }{2} \Vert y-y'\Vert ^2 + \Vert y-y'\Vert \Vert x-x'\Vert \\&\quad {} + \Vert A\Vert \Vert x-x'\Vert \Vert z-z'\Vert _\mathrm d + \Vert A\Vert \Vert y-y'\Vert \Vert z-z'\Vert _\mathrm d \\&{} \le \frac{1}{\tau }d_\mathrm p(x,x') + \frac{1}{\sigma } d_\mathrm d(z,z') + \frac{\tau }{2} \Vert y-y'\Vert ^2 + \frac{1}{2\tau } \Vert x-x'\Vert ^2 + \frac{\tau }{2} \Vert y-y'\Vert ^2 \\&\quad {} + \frac{1}{2\tau } \Vert x-x'\Vert ^2 + \frac{1}{2\sigma } \Vert z-z'\Vert _\mathrm d^2 + \frac{1}{2\tau } \Vert y-y'\Vert ^2 + \frac{1}{2\sigma } \Vert z-z'\Vert _\mathrm d^2 \\&{} \le \frac{3}{\tau } d_\mathrm p(x,x') + \frac{3}{\sigma } d_\textrm{d}(z,z') + \frac{3\tau }{2} \Vert y-y'\Vert ^2. \end{aligned}$$

\(\square \)

7.2 Relation to Other Bregman Proximal Algorithms

The proposed algorithm (50) can be viewed as an extension to PD3O (12) using generalized distances and reduces to several Bregman proximal methods by reduction. These algorithms can also be organized into a diagram similar to Fig. 3. Figure 8 starts from Bregman PD3O (50) and summarizes its connection to several Bregman proximal methods.

Fig. 8
figure 8

Proximal algorithms derived from Bregman PD3O.

When \(h=0\), (50) reduces to Bregman PDHG, and when \(g=0\), the Bregman proximal gradient algorithm. The Bregman Loris–Verhoeven algorithm is Bregman PD3O with \(f=0\), and has been discussed in [17] under the name NEPAPC. Setting \(A=I\) (with \(\sigma =1/\tau \)), we obtain a new variant of the Bregman proximal gradient algorithm. The difference between this variant of the Bregman proximal gradient algorithm and the Bregman reduced Loris–Verhoeven algorithm with shift obtained in Sect. 5.3 is the additional term \(\tau (\nabla h(x_k)-\nabla h(x_{k+1}))\), the same as the difference between (18) and (50). When the Euclidean proximal operator is used, (50) (with \(f=0\), \(A=I\) and \(\sigma =1/\tau \)) reduces to the proximal gradient method. However, the new method does not seem to be equivalent to the classical Bregman proximal gradient algorithm due to the lack of Moreau decomposition in the generalized case. Finally, setting \(A=I\) (and \(\sigma =1/\tau \)) in (50) gives a Bregman Davis–Yin algorithm.

8 Numerical Experiment

In this section, we evaluate the performance of the Bregman primal Condat–Vũ algorithm (18), Bregman dual Condat–Vũ algorithm with line search (40), and Bregman PD3O (50). The main goal of the example is to validate and illustrate the difference in the stepsize conditions (51) and the usefulness of the line search procedure. We consider the convex optimization problem

$$\begin{aligned} \begin{array}{ll} \text{ minimize } &{} \psi (x) = \lambda \Vert Ax\Vert _1+\tfrac{1}{2}\Vert Cx-b\Vert ^2 \\ \text{ subject } \text{ to } &{} \textbf{1}^Tx=1, \quad x \succeq 0, \end{array} \end{aligned}$$
(59)

where \(x \in {\textbf{R}}^n\) is the optimization variable, \(C \in {\textbf{R}}^{m \times n}\), and \(A \in {\textbf{R}}^{(n-1) \times n}\) is the difference matrix

$$\begin{aligned} A = \left[ \begin{array}{rrrrrr} -1 &{} 1 &{} 0 &{} \cdots &{} 0 &{} 0 \\ 0 &{} -1 &{} 1 &{} \cdots &{} 0 &{} 0 \\ \vdots &{} \vdots &{} \vdots &{} &{} \vdots &{} \vdots \\ 0 &{} 0 &{} 0 &{} \cdots &{} -1 &{} 1 \end{array}\right] . \end{aligned}$$
(60)

This problem is of the form of (1) with \(f(x)=\delta _H(x)\), \(g(y)=\lambda \Vert y\Vert _1\), \(h(x)=\tfrac{1}{2}\Vert Cx-b\Vert ^2\), and \(\delta _H\) is the indicator function of the hyperplane \(H=\{x \in {\textbf{R}}^n \mid \textbf{1}^Tx=1\}\). We use the relative entropy distance

$$\begin{aligned} d_\mathrm p(x,y) = \sum _{i=1}^n(x_i \log (x_i/y_i) - x_i + y_i), \qquad \mathop {\textbf{dom}}d_\mathrm p= {\textbf{R}}^n_+ \times {\textbf{R}}^n_{++} \end{aligned}$$

in the primal space. This distance is 1-strongly convex with respect to \(\ell _1\)-norm [5] (and also \(\ell _2\)-norm). With the relative entropy distance, all the primal iterates \(x_k\) remain feasible. In the dual space we use the Euclidean distance. Thus, the matrix norm (22) in the stepsize condition (25) for the Bregman Condat–Vũ algorithms is the (1,2)-operator norm \(\Vert A\Vert _{1,2}=\max _i \Vert a_i\Vert =\sqrt{2}\), where \(a_i\) is the ith column of A. In the Bregman PD3O algorithm, we use the squared Euclidean distance \(d_\mathrm p(x,y) = \tfrac{1}{2} \Vert x-y\Vert ^2\), and the matrix norm in the stepsize condition (54) is the spectral norm \(\Vert A\Vert _2\). For the difference matrix (60), \(\Vert A\Vert _2\) is bounded above by 2, and very close to this upper bound for large n.

The Lipschitz constant for h with respect to the \(\ell _1\)-norm is the largest absolute value of the elements in \(C^TC\), i.e., \(L_1=\max _{i,j} |(C^TC)_{ij}|\). This value is used in the stepsize condition (25) for the Bregman Condat–Vũ algorithms. The Lipschitz constant with respect to the \(\ell _2\)-norm is \(L_2=\Vert C\Vert _2^2\), which is used in the stepsize condition (54) for Bregman PD3O.

The matrix norms and Lipschitz constants are summarized as follows:

$$\begin{aligned} \begin{array}{lcc} &{} \text{ matrix } \text{ norm } &{} \text{ Lipschitz } \text{ constant } \\ \text{ Bregman } \text{ Condat--V }\tilde{u} \quad &{} \Vert A\Vert _{1,2}=\sqrt{2} \quad &{} L_1=\max _{i,j}|(C^TC)_{ij}| \\ \text{ Bregman } \text{ PD3O } &{} \Vert A\Vert _2 \le 2 &{} L_2=\Vert C\Vert _2^2. \end{array} \end{aligned}$$

In the example, we use the exact values of \(L_1\) and \(L_2\),

The Bregman proximal operator of f has a closed-form solution:

$$\begin{aligned} \textrm{prox}_f^\phi (y,a)_k = \frac{y_k e^{-a_k}}{\sum _{i=1}^n y_i e^{-a_i}}, \quad k=1,\ldots , n, \end{aligned}$$

where \(y_i\) is the ith entry of y, and the (Euclidean) proximal operator of \(g^*\) is the projection onto the infinity norm ball \(\{z \mid \Vert z\Vert _\infty \le \lambda \}\).

The experiment is carried out in Python 3.6 on a desktop with an Intel Core i5 2.4GHz CPU and 8GB RAM. We set \(m=500\) and \(n=10,000\). The elements in the matrix \(C \in {\textbf{R}}^{m \times n}\) and \(b \in {\textbf{R}}^m\) are randomly generated from independent standard Gaussian distributions. For the constant stepsize option, we choose

$$\begin{aligned} \begin{array}{lll} \text{ Condat-V }\tilde{u} &{} \sigma =L_1/2 &{} \tau =1/(2L_1) \\ \text{ PD3O } &{} \sigma =L_2/4 &{} \tau =1/L_2. \end{array} \end{aligned}$$
(61)

These two choices, as well as the range of possible parameters, are illustrated in Fig. 9. The two choices are on the blue and red curve, respectively, and satisfy the requirement (51) with equality. For the line search algorithm, we set \(\bar{\theta }_k=1.2\) to encourage more aggressive updates, and \(\beta =\sigma _{-1}/\tau _{-1}=L_1^2\), which is consistent with the choice in (61).

Fig. 9
figure 9

The blue and red curves show the boundaries of the stepsize regions for Bregman Condat–Vũ algorithms and Bregman PD3O, respectively. The blue and red points indicate the chosen parameters in (61) (red for PD3O, blue for Condat–Vũ). In the Bregman dual Condat–Vũ algorithm with line search, the stepsizes are selected on the dashed straight line. The solid line segment shows the range of stepsizes that were selected, with dots indicating the largest, median, and smallest stepsizes.

We solve the problem (59) using the Bregman primal Condat–Vũ algorithm (18), the Bregman dual Condat–Vũ algorithm with line search (40), and Bregman PD3O (50). Figure 10 reports the relative distance between the function values to the optimal value \(\psi ^\star \), which is computed via CVXPY [22]. Comparison between the Bregman primal Condat–Vũ algorithm and Bregman PD3O shows that Bregman PD3O converges faster. Figure 10 also compares the performance between the Bregman primal Condat–Vũ algorithm with constant stepsizes and Bregman dual algorithm with line search. One can see clearly that the line search significantly improves the convergence. On the other hand, the line search does not add much computation overhead, as the plots of the CPU time and the number of iterations are roughly identical. In these experiments, Bregman PD3O and the Bregman dual Condat–Vũ algorithm with line search have a similar performance, without one algorithm being conclusively better than the other.

Fig. 10
figure 10

Comparison of three algorithms (Bregman primal Condat–Vũ, Bregman dual Condat–Vũ with line search, and Bregman PD3O) in terms of objective values. The top two figures plot the relative error of the function value versus CPU time and number of iterations for one problem instance (59), respectively. The bottom two figures correspond to another problem instance.

9 Conclusions

We presented two variants of Bregman Condat–Vũ algorithms, introduced a line search technique for the Bregman dual Condat–Vũ algorithm for equality-constrained problems, and proposed a Bregman extension to PD3O. Many open questions remain. It is unclear how to use Bregman distances in PDDY, and how to extend the line search technique to Bregman PD3O, the Bregman primal Condat–Vũ algorithm, and the more general problem (1). Moreover, in the current backtracking technique the ratio of the primal and dual stepsizes is fixed. A further improvement would be to relax this constraint [1, 36].