# Acceleration of the PDHGM on Partially Strongly Convex Functions

- 811 Downloads

## Abstract

We propose several variants of the primal–dual method due to Chambolle and Pock. Without requiring full strong convexity of the objective functions, our methods are accelerated on subspaces with strong convexity. This yields mixed rates, \(O(1{/}N^2)\) with respect to initialisation and *O*(1 / *N*) with respect to the dual sequence, and the residual part of the primal sequence. We demonstrate the efficacy of the proposed methods on image processing problems lacking strong convexity, such as total generalised variation denoising and total variation deblurring.

### Keywords

Primal–dual Accelerated Subspace Total generalised variation### Mathematics Subject Classification

90C25 49M29 94A08## 1 Introduction

*X*and

*Y*, possibly infinite dimensional. Also let \(K \in \mathcal {L}(X; Y)\) be a bounded linear operator. We then wish to solve the problem

*F*(see, for example, [1, 2]) also be written with the help of the convex conjugate \(F^*\) in the minimax form

*G*or \(F^*\) is strongly convex, the method can be accelerated to \(O(1/N^2)\) convergence rates of the iterates and an ergodic duality gap [3]. But what if we have only partial strong convexity? For example, what if

*P*to a subspace \(X_0 \subset X\), and strongly convex \(G_0: X_0 \rightarrow \mathbb {R}\)? This kind of structure is common in many applications in image processing and data science, as we will more closely review in Sect. 5. Under such

*partial strong convexity*, can we obtain a method that would give an accelerated rate of convergence at least for

*Px*?

We provide a partially positive answer: we can obtain mixed rates, \(O(1/N^2)\) with respect to initialisation, and *O*(1 / *N*) with respect to bounds on the ‘residual variables’ *y* and \((I-P)x\). In this respect, our results are similar to the ‘optimal’ algorithm of Chen et al. [5]. Instead of strong convexity, they assume smoothness of *G* to derive a primal–dual algorithm based on backward–forward steps, instead of the backward–backward steps of [3].

*a priori*a level of smoothing—comparable to \(\Gamma '\)—needed to achieve prescribed solution quality. One then solves a smoothed problem, which can be done at \(O(1/N^2)\) rate. However, to obtain a solution with higher quality than the a priori prescribed one, one needs to solve a new problem from scratch, as the smoothing alters the problem being solved. One can also employ restarting strategies, to take some advantage of the previous solution, see, for example, [8]. Our approach does not depend on restarting and a priori chosen solution qualities: the method will converge to an optimal solution to the original non-smooth problem. Indeed, the introduced additional strong convexity \(\Gamma '\) is controlled automatically.

The ‘fast dual proximal gradient method’, or FDPG [9], also possesses different type of mixed rates, *O*(1 / *N*) for the primal, and \(O(1/N^2)\) for the dual. This is, however, under standard strong convexity assumptions. Other than that, our work is related to various further developments from the PDHGM, such as variants for nonlinear *K* [10, 11] and non-convex *G* [12]. The PDHGM has been the basis for inertial methods for monotone inclusions [13] and primal–dual stochastic coordinate descent methods without separability requirements [14]. Finally, the FISTA [15, 16] can be seen as a primal-only relative of the PDHGM. Not attempting to do full justice here to the large family of closely related methods, we point to [4, 17, 18] for further references.

The contributions of our paper are twofold: firstly, to paint a bigger picture of what is possible, we derive a very general version of the PDHGM. This algorithm, useful as a basis for deriving other new algorithms besides ours, is the content of Sect. 2. In this section, we provide an abstract bound on the iterates of the algorithm, later used to derive convergence rates. In Sect. 3, we extend the bound to include an ergodic duality gap under stricter conditions on the acceleration scheme and the step length operators. A by-product of this work is the shortest convergence rate proof for the accelerated PDHGM known to us. Afterwards, in Sect. 4, we derive from the general algorithm two efficient mixed-rate algorithms for problems exhibiting strong convexity only on subspaces. The first one employs the penalty or smoothing \(\psi \) on both the primal and the dual. The second one only employs the penalty on the dual. We finish the study with numerical experiments in Sect. 5. The main results of interest for readers wishing to apply our work are Algorithms 3 and 4 along with the respective convergence results, Theorems 4.1 and 4.2.

## 2 A General Primal–Dual Method

### 2.1 Notation

*X*and

*Y*. For \(T, S \in \mathcal {L}(X; X)\), the notation \(T \ge S\) means that \(T-S\) is positive semidefinite; in particular, \(T \ge 0\) means that

*T*is positive semidefinite. In this case, we also denote

*I*, as is standard.

### 2.2 Background

*X*and

*Y*, as well as a bounded linear operator \(K \in \mathcal {L}(X; Y)\). We then wish to solve the minimax problem

*O*(1 /

*N*) rate for the ergodic duality gap [3]. If

*G*is strongly convex with factor \(\gamma \), we may use the acceleration scheme [3]

*preconditioning*or

*step length operator*

### 2.3 Abstract Partial Monotonicity

Our plan now is to formulate a general version of (3), replacing \(\tau _i\) and \(\sigma _i\) by operators \(T_i \in \mathcal {L}(X; X)\) and \(\Sigma _i \in \mathcal {L}(Y; Y)\). In fact, we will need two additional operators \({\widetilde{T}}_i \in \mathcal {L}(X; X)\) and \({\hat{T}}_i \in \mathcal {L}(Y; Y)\) to help communicate change in \(T_i\) to \(\Sigma _i\). They replace \(\omega _i\) in (3b) and (7), operating as \({\hat{T}}_{i+1} K {\widetilde{T}}^{-1}_i \approx \omega _i K\) from both sides of *K*. The role of \({\widetilde{T}}_i\) is to split the original primal step length \(\tau _i\) in the space *X* into the two parts \(T_i\) and \({\widetilde{T}}_i\) with potentially different rates. The role of \({\hat{T}}_i\) is to transfer \({\widetilde{T}}_i\) into the space *Y*, to eventually control the dual step length \(\Sigma _i\). In the basic algorithm (3), we would simply have \({\widetilde{T}}_i=T_i=\tau _i I \in \mathcal {L}(X; X)\), and \({\hat{T}}_i=\tau _i I \in \mathcal {L}(Y; Y)\) for the scalar \(\tau _i\).

*partially strongly*\((\psi , \widetilde{\mathcal {T}}, \widetilde{\mathcal {K}})\)

*-monotone*, meaning that for all \(x, x' \in X, {\widetilde{T}}\in \widetilde{\mathcal {T}}\), \(\Gamma ' \in [0, \Gamma ]+\widetilde{\mathcal {K}}\) holds

*testing operator*, and the operator \(\Gamma ' \in \widetilde{\mathcal {K}}\) as

*introduced strong monotonicity*. The functional \(\psi _{{{\widetilde{T}}}^{-1,*}(\Gamma '-\Gamma )}\) is a

*penalty*corresponding to the test and the introduced strong monotonicity. The role of testing will become more apparent in Sect. 2.4.

*-monotone*with respect to \(\hat{\mathcal {T}}\) in the sense that for all \(y, y' \in Y\), \({\hat{T}}\in \hat{\mathcal {T}}, R \in \hat{\mathcal {K}}\) holds

for some family of functionals \(\{\phi _{T}: Y \rightarrow \mathbb {R}\}\). Again, the inequality in (\(\hbox {F}^*\)-PM) is understood to hold for all elements of the sets \(\partial F^*(y')\) and \(\partial F^*(y)\).

In our general analysis, we do not set any conditions on \(\psi \) and \(\phi \), as their role is simply symbolic transfer of dissatisfaction of strong monotonicity into a penalty in our abstract convergence results.

Let us next look at a few examples on how (G-PM) or (\(\hbox {F}^*\)-PM) might be satisfied. First we have the very well-behaved case of quadratic functions.

### Example 2.1

\(G(x)=\Vert f-Ax\Vert ^2/2\) satisfies (G-PM) with \(\Gamma =A^*A\), \(\widetilde{\mathcal {K}}=\{0\}\), and \(\psi \equiv 0\) for any invertible \({\widetilde{T}}\). Indeed, *G* is differentiable with \(\langle \nabla G(x') - \nabla G(x),{\widetilde{T}}^{-1}(x'-x)\rangle =\langle A^*A(x'-x),{\widetilde{T}}^{-1}(x'-x)\rangle =\Vert x'-x\Vert _{{{\widetilde{T}}}^{-1,*}\Gamma }^2\).

The next lemma demonstrates what can be done when all the parameters are scalar. It naturally extends to functions of the form \(G(x_1, x_2)=G(x_1)+G(x_2)\) with corresponding product form parameters.

### Lemma 2.1

### Proof

We denote \(A :={{\mathrm{dom}}}G\). If \(x' \not \in A\), we have \(G(x')=\infty \), so (8) holds irrespective of \(\gamma \) and *C*. If \(x \not \in A\), we have \(\partial G(x)=\emptyset \), so (8) again holds. We may therefore compute the constants based on \(x, x' \in A\). Now, there is a constant *M* such that \(\sup _{x \in A} \Vert x\Vert \le M\). Then, \(\Vert x'-x\Vert \le 2M\). Thus, if we pick \(C=4M^2\), then \((\gamma /2)(\Vert x'-x\Vert ^2-C) \le 0\) for every \(\gamma \ge 0\) and \(x, x' \in A\). By the convexity of *G*, (8) holds.\(\square \)

### Example 2.2

An indicator function \(\iota _A\) of a convex bounded set *A* satisfies the conditions of Lemma 2.1. This is generally what we will use and need.

### 2.4 A General Algorithm and the Idea of Testing

*K*. Our proposed algorithm can thus be characterised as solving on each iteration \(i \in \mathbb {N}\) for the next iterate \(u^{i+1}\) the preconditioned proximal point problem

*testing operator*

*H*with respect to testing by \(S_i\). With this, taking a fixed \(\delta >0\), the properties

### 2.5 A Simplified Condition

### 2.6 Basic Convergence Result

Our main result on Algorithm 1 is the following theorem, providing some general convergence estimates. It is, however, important to note that the theorem does not yet directly prove convergence, as its estimates depend on the rate of decrease in \(T_N {\widetilde{T}}_N^*\), as well as the rate of increase in the penalty sum \(\sum _{i=0}^{N-1} D_{i+1}\) coming from the dissatisfaction of strong convexity. Deriving these rates in special cases will be the topic of Sect. 4.

### Theorem 2.1

*X*and

*Y*, satisfying (G-PM) and (\(\hbox {F}^*\)-PM). Pick \(\delta \in (0, 1)\), and suppose (C1) and (C2) are satisfied for each \(i \in \mathbb {N}\) for some invertible \(T_{i} \in \mathcal {L}(X; X)\), \({\widetilde{T}}_{i} \in \widetilde{\mathcal {T}}\), \({\hat{T}}_{i+1} \in \hat{\mathcal {T}}\), and \(\Sigma _{i+1} \in \mathcal {L}(Y; Y)\), as well as \(\Gamma _i \in [0, \Gamma ] + \widetilde{\mathcal {K}}\) and \(R_{i+1} \in \hat{\mathcal {K}}\). Suppose that \({{\widetilde{T}}}^{-1,*}_iT^{-1}_i\) and \({\hat{T}}^{-1}_{i+1}\Sigma ^{-1}_{i+1}\) are self-adjoint. Let \({\widehat{u}}=({\widehat{x}}, {\widehat{y}})\) satisfy (OC). Then, the iterates of Algorithm 1 satisfy

### Remark 2.1

The term \({\widetilde{D}}_{i+1}\), coming from the dissatisfaction of strong convexity, penalises the basic convergence, which is on the right-hand side of (17) presented by the constant \(C_0\). If \(T_N {\widetilde{T}}_N\) is of the order \(O(1/N^2)\), at least on a subspace, and we can bound the penalty \({\widetilde{D}}_{i+1} \le C\) for some constant *C*, then we clearly obtain mixed \(O(1/N^2) + O(1/N)\) convergence rates on the subspace. If we can assume that \({\widetilde{D}}_{i+1}\) actually converges to zero at some rate, then it will even be possible to obtain improved convergence rates. Since typically \({\widetilde{T}}_i, {\hat{T}}_{i+1} \searrow 0\) reduce to scalar factors within \({\widetilde{D}}_{i+1}\), this would require prior knowledge of the rates of convergence \(x^i \rightarrow {\widehat{x}}\) and \(y^i \rightarrow {\widehat{y}}\). Boundedness of the iterates \(\{(x^i, y^i)\}_{i=0}^\infty \), we can, however, usually ensure.

### Proof

*H*from (6), it follows

## 3 Scalar Off-diagonal Updates and the Ergodic Duality Gap

### 3.1 Scalar Specialisation of Algorithm 1

### Example 3.1

*G*be strongly convex with factor \(\gamma \ge 0\). We take \(T_i=\tau _i I\), \({\widetilde{T}}_i=\tau _i I\), \({\hat{T}}_i=\tau _i I\), and \(\Sigma _{i+1}=\sigma _{i+1} I\) for some scalars \(\tau _i, \sigma _{i+1}>0\). The conditions (G-pm) and (\(\hbox {F}^*-\hbox {pm}\)) then hold with \(\psi \equiv 0\) and \(\phi \equiv 0\), while (\(\hbox {C2}''\)) and (\(\hbox {C1}''\)) reduce with \(R_{i+1}=0\), \(\Gamma _i=\gamma I\), \(\Omega _i=\omega _i I\), and \({\widetilde{\omega }}_i=\omega _i\) into

### 3.2 The Ergodic Duality Gap and Convergence

*N*is defined as the duality gap for \((x_N, Y_N)\), namely

### Theorem 3.1

*X*and

*Y*, satisfying (G-pc) and (\(\hbox {F}^*\hbox {-pc}\)) for some sets \(\widetilde{\mathcal {K}}\), \(\hat{\mathcal {K}}\), and \(0 \le \Gamma \in \mathcal {L}(X; X)\). Pick \(\delta \in (0, 1)\), and suppose (\(\hbox {C2}''\)) and (\(\hbox {C1}''\)) are satisfied for each \(i \in \mathbb {N}\) for some invertible self-adjoint \(T_i \in \mathcal {Q}\), \(\Sigma _{i} \in \mathcal {L}(Y; Y)\), as well as \(\Gamma _i \in \lambda ([0, \Gamma ] + \widetilde{\mathcal {K}})\) and \(R_{i} \in \lambda \hat{\mathcal {K}}\) for \(\lambda =1/2\). Let \({\widehat{u}}=({\widehat{x}}, {\widehat{y}})\) satisfy (OC). Then, the iterates of Algorithm 2 satisfy

### Remark 3.1

For convergence of the gap, we must accelerate less (factor 1 / 2 on \(\Gamma _i\)).

### Example 3.2

(*No acceleration*) Consider Example 3.1, where \(\psi \equiv 0\) and \(\phi \equiv 0\). If \(\gamma =0\), we get ergodic convergence of the duality gap at rate *O*(1 / *N*). Indeed, we are in the scalar step setting, with \({\widetilde{\tau }}_j={\widetilde{\tau }}_j=\tau _0\). Thus, presently \({\widetilde{q}}_N=N\tau _0\).

### Example 3.3

*Full acceleration*) With \(\gamma >0\) in Example 3.1, we know from [3, Corollary 1] that

### Remark 3.2

Therefore, \(\tau _N \le C_\tau /N\) for \(C_\tau :=\gamma ^{-1}+\tau _0\). Moreover, the second inequality gives \(\tau ^{-1}_N \le \tau ^{-1}_0 + \gamma N\).

### Proof

*Theorem*3.1) The non-gap estimate in the last paragraph of the theorem statement, where \(\lambda =1\), we modify \(\mathcal {G}_N :=0\), is a direct consequence of Theorem 2.1. We therefore concentrate on the estimate that includes the gap, and fix \(\lambda = 1/2\). We begin by expanding

*i*of the step length parameters in comparison with the last step of (29), we thus obtain

*G*and \(F^*\), we observe

## 4 Convergence Rates in Special Cases

### 4.1 An Approach to Updating \(\Sigma \)

### 4.2 When \(\Gamma \) is a Multiple of a Projection

To summarise the findings of this section, we state the following proposition.

### Proposition 4.1

### Proof

### 4.3 Primal and Dual Penalties with Projective \(\Gamma \)

*i*, we, however, replace (45c) by the simpler condition

The next lemma summarises these results for the standard choice of \({\widetilde{\omega }}_i\).

### Lemma 4.1

### Proof

The choice (57) satisfies (45a), so that (45) in its entirety will be satisfied with the right-hand sides of (45b)–(45c) given by (56). The bound \({\widetilde{\tau }}_i \le {\widetilde{\tau }}_0\) follows from \({\widetilde{\omega }}_i \le 1\). Finally, the implication (58) is a simple estimation of (55).

\(\square \)

### Theorem 4.1

### Proof

### Remark 4.1

As a special case of Algorithm 3, if we choose \(\zeta = \tau _0^{\perp , -2}\), then we can show from (55) that \(\tau _i^\perp =\tau _0^\perp =\zeta ^{-1/2}\) for all \(i \in \mathbb {N}\).

### Remark 4.2

The convergence rate provided by Theorem 4.1 is a mixed \(O(1/N^2) + O(1/N)\) rate, similarly to that derived in [5] for a type of forward–backward splitting algorithm for smooth *G*. Ours is of course backward–backward type algorithm. It is interesting to note that using the differentiability properties of infimal convolutions [23, Proposition 18.7], and the presentation of a smooth *G* as an infimal convolution, it is formally possible to derive a forward–backward algorithm from Algorithm 3. The difficulties lie in combining this conversion trick with conditions on the step lengths.

### 4.4 Dual Penalty Only with Projective \(\Gamma \)

*i*if \(a_i \searrow 0\). To ensure satisfaction for all \(i \in \mathbb {N}\), it suffices to take \(\{a_i\}_{i=0}^\infty \) non-increasing, and satisfy the initial condition

### Remark 4.3

If \(\phi \equiv 0\), that is, if \(F^*\) is strongly convex, we may simply pick \({\widetilde{\omega }}_i=\omega _i=1/\sqrt{1+2\gamma \tau _i}\), that is \(a_i=2\gamma \), and obtain from (70) a \(O(1/N^2)\) convergence rate.

*O*(1 /

*N*) over both the initialisation and the dual sequence. By choosing \(q=1\), we get \(O(1/N^{3/2})\) convergence with respect to the initialisation, and \(O(1/N^{1/2})\) with respect to the residual sequence.

### Theorem 4.2

### Proof

We apply Proposition 4.1 whose assumptions we have verified during the course of the present section. In particular, \({\widetilde{\tau }}_i \le {\widetilde{\tau }}_0\) through the choice (68) that forces \({\widetilde{\omega }}_i \le 1\). Also, have already derived the rate (70) from (48). Inserting (72) into (70), noting that the former is only valid for \(N \ge 2\), immediately gives (74).\(\square \)

## 5 Examples from Image Processing and the Data Sciences

### 5.1 Regularised Least Squares

*Tikhonov regularisation*or

*empirical loss minimisation*form

We are particularly interested in strongly convex \(G_0\) and *A* with a non-trivial null-space. Examples include, for example, Lasso—a type of regularised regression—with \(G_0=\Vert x\Vert _2^2/2\), \(K=I\), and \(F(x)=\Vert x\Vert _1\), on finite-dimensional spaces. If the data of the Lasso is ‘sparse’, in the sense that *A* has a non-trivial null-space, then, based on accelerating the strongly convex part of the variable, our algorithm can provide improved convergence rates compared to standard non-accelerated methods.

In image processing examples abound, we refer to [25] for an overview. In total variation (\(TV \)) regularisation, we still take \(F(x)=\Vert x\Vert _1\), but is \(K=\nabla \) the gradient operator. Strictly speaking, this has to be formulated in the Banach space \(BV (\Omega )\), but we will consider the discretised setting to avoid this problem. For denoising of Gaussian noise with \(TV \) regularisation, we take \(A=I\), and again \(G_0=\Vert x\Vert _2^2/2\). This problem is not so interesting to us, as it is fully strongly convex. In a simple form of \(TV \) inpainting—filling in missing regions of an image—we take *A* as a subsampling operator *S* mapping an image \(x \in L^2(\Omega )\) to one in \(L^2(\Omega \setminus \Omega _d)\), for \(\Omega _d \subset \Omega \) the defect region that we want to recreate. Observe that in this case, \(\Gamma =S^*S\) is directly a projection operator. This is therefore a problem for our algorithms! Related problems include reconstruction from subsampled magnetic resonance imaging (MRI) data (see, for example, [11, 26]), where we take \(A=S\mathfrak {F}\) for \(\mathfrak {F}\) the Fourier transform. Still, \(A^*A\) is a projection operator, so the problem perfectly suits our algorithms.

Another related problem is total variation deblurring, where *A* is a convolution kernel. This problem is slightly more complicated to handle, as \(A^*A\) is not a projection operator. Assuming periodic boundary conditions on a box \(\Omega =\prod _{i=1}^m [c_i, d_i]\), we can write \(A=\mathfrak {F}^* {\hat{a}} \mathfrak {F}\), multiplying the Fourier transform by some \({\hat{a}} \in L^2(\Omega )\). If \(|{\hat{a}}| \ge \gamma \) on a subdomain, we obtain a projection form \(\Gamma \) (it would also be possible to extend our theory to non-constant \(\gamma \), but we have decided not to extend the length of the paper by doing so. Dualisation likewise provides a further alternative).

*Satisfaction of convexity conditions*

In case of our total variation examples, \(F(x)=\Vert x\Vert _1\) and \(K=\nabla \). Provided mean-zero functions are not in the kernel of *A*, one can through Poincar’s inequality [27] on \(BV (\Omega )\) and a two-dimensional connected domain \(\Omega \subset \mathbb {R}^2\) show that even the original infinite-dimensional problems have bounded solutions in \(L^2(\Omega )\). We may therefore again add the artificial constraint (76) with \(Z=L^2\) to (75).

*Dynamic bounds and pseudo-duality gaps*

We seldom know the exact bound *M*, but can derive conservative estimates. Nevertheless, adding such a bound to Algorithm 4 is a simple, easily implemented projection of \(P^\perp (x^i - T_i K^* y^i)\) into the constraint set. In practise, we do not use or need the projection, and update the bound *M* dynamically so as to ensure that the constraint (76) is never active. Indeed, *A* having a non-trivial nullspace also causes duality gaps for (P) to be numerically infinite. In [28], a ‘pseudo-duality gap’ was therefore introduced, based on dynamically updating *M*. We will also use this type of dynamic duality gaps in our reporting.

### 5.2 \(TGV ^2\) Regularised Problems

*G*into ones for \(G_1\) and \(G_2\). The condition (\(\hbox {F}^*\)-pcr) with \(\bar{\rho }=\infty \) is then immediate from Lemma 2.1. Moreover, the Sobolev–Korn inequality [31] allows us to bound on a connected domain \(\Omega \subset \mathbb {R}^2\) an optimal \({\hat{w}}\) to (77) as

*w*is not used in (77). Therefore we may again replace \(G_2=0\) by the artificial constraint \(G_2(w)=\iota _{\Vert \,\varvec{\cdot }\,\Vert _{L^2} \le M}(w)\). By Lemma 2.1,

*G*will then satisfy (G-pcr) with \(\bar{\gamma }^\perp =\infty \).

### 5.3 Numerical Results

We demonstrate our algorithms on \(TGV ^2\) denoising and \(TV \) deblurring. Our tests are done on the photographs in Fig. 1, both at the original resolution of \(768 \times 512\), and scaled down by a factor of 0.25 to \(192 \times 128\) pixels. It is image #23 from the free Kodak image suite. Other images from the collection that we have experimented on give analogous computational results. For both of our example problems, we calculate a target solution by taking one million iterations of the basic PDHGM (3). We also tried interior point methods for this, but they are only practical for the smaller denoising problem.

We evaluate Algorithms 3 and 4 against the standard unaccelerated PDHGM of [3], as well as (a) the mixed-rate method of [5], denoted here C-L-O, (b) the relaxed PDHGM of [20, 32], denoted here ‘Relax’, and (c) the adaptive PDHGM of [33], denoted here ‘Adapt’. All of these methods are very closely linked and have comparable low costs for each step. This makes them straightforward to compare.

As we have discussed, for comparison and stopping purposes, we need to calculate a pseudo-duality gap as in [28], because the real duality gap is in practise infinite when *A* has a non-trivial nullspace. We do this dynamically; upgrading, the *M* in (76) every time, we compute the duality gap. For both of our example problems, we use for simplicity \(Z=L^2\) in (76). In the calculation of the final duality gaps comparing each algorithm, we then take as *M* the maximum over all evaluations of all the algorithms. This makes the results fully comparable. We always report the duality gap in decibels \(10\log _{10}(\text {gap}^2/\text {gap}_0^2)\) relative to the initial iterate. Similarly, we report the distance to the target solution \({\hat{u}}\) in decibels \(10\log _{10}(\Vert u^i-{\hat{u}}\Vert ^2/\Vert {\hat{u}}\Vert ^2)\), and the primal objective value \(\text {val}(x) :=G(x)+F(Kx)\) relative to the target as \(10\log _{10}(\text {val}(x)^2/\text {val}({\hat{x}})^2)\). Our computations were performed in MATLAB+C-MEX on a MacBook Pro with 16GB RAM and a 2.8 GHz Intel Core i5 CPU.

*denoising*The noise in our high-resolution test image, with values in the range [0, 255], has standard deviation 29.6 or 12 dB. In the downscaled image, these become, respectively, 6.15 or 25.7 dB. As parameters \((\beta , \alpha )\) of the \(TGV ^2\) regularisation functional, we choose (4.4, 4) for the downscale image, and translate this to the original image by multiplying by the scaling vector \((0.25^{-2}, 0.25^{-1})\) corresponding to the 0.25 downscaling factor. See [34] for a discussion about rescaling and regularisation factors, as well as for a justification of the \(\beta /\alpha \) ratio.

*c*from [33]. For ‘Relax’, we use the value 1.5 for the inertial \(\rho \) parameter of [32]. For both of these algorithms, we use the same choices of \(\sigma _0\) and \(\tau _0\) as for the PDHGM.

\(TGV ^2\) denoising performance, maximum 20,000 iterations

Low resolution | ||||||
---|---|---|---|---|---|---|

Method | Gap \(\le -50\) dB | Tgt \(\le -40\) dB | Val \(\le 1\) dB | |||

Iter | Time (s) | Iter | Time (s) | Iter | Time (s) | |

PDHGM | 30 | 0.40 | 40 | 0.46 | 30 | 0.40 |

C-L-O | 500 | 4.67 | 1210 | 11.31 | 970 | 9.04 |

Alg.3 | 20 | 0.29 | 10 | 0.22 | 20 | 0.29 |

Alg.4 | 20 | 0.47 | 20 | 0.47 | 20 | 0.47 |

Relax | 20 | 0.34 | 30 | 0.45 | 20 | 0.34 |

Adapt | 5360 | 106.63 | 2040 | 41.38 | 3530 | 70.78 |

High resolution | ||||||
---|---|---|---|---|---|---|

Method | Gap \(\le -40\) dB | Tgt \(\le -30\) dB | Val \(\le 1\) dB | |||

Iter | Time (s) | Iter | Time (s) | Iter | Time (s) | |

PDHGM | 50 | 8.85 | 30 | 5.13 | 30 | 5.13 |

C-L-O | 80 | 15.76 | 30 | 5.97 | 80 | 15.76 |

Alg.3 | 40 | 6.20 | 20 | 3.10 | 40 | 6.20 |

Alg.4 | 60 | 9.18 | 30 | 4.53 | 60 | 9.18 |

Relax | 40 | 7.45 | 20 | 3.70 | 20 | 3.70 |

Adapt | \(\textendash \) | \(\textendash \) | \(\textendash \) | \(\textendash \) | \(\textendash \) | \(\textendash \) |

We take fixed 20,000 iterations and initialise each algorithm with \(y^0=0\) and \(x^0=0\). To reduce computational overheads, we compute the duality gap and distance to target only every 10 iterations instead of at each iteration. The results are in Fig. 3 and Table 1. As we can see, Algorithm 3 performs extremely well for the low-resolution image, especially in its initial iterations. After about 700 or 200 iterations, depending on the criterion, the standard and relaxed PDHGM start to overtake. This is a general effect that we have seen in our tests: the standard PDHGM performs in practise very well asymptotically, although in principle all that exists is a *O*(1 / *N*) rate on the ergodic duality gap. Algorithm 4, by contrast, does not perform asymptotically so well. It can be extremely fast on its initial iterations, but then quickly flattens out. The C-L-O surprisingly performs better on the high-resolution image than on the low-resolution image, where it does somewhat poorly in comparison with the other algorithms. The adaptive PDHGM performs very poorly for \(TGV ^2\) denoising, and we have indeed excluded the high-resolution results from our reports to keep the scaling of the plots informative. Overall, Algorithm 3 gives good results fast, although the basic and relaxed PDHGM seems to perform, in practise, better asymptotically.

*deblurring*Our test image has now been distorted by Gaussian blur of standard deviation 4, which we intent to remove. We denote by \({\hat{a}}\) the Fourier presentation of the blur operator as discussed in Sect. 5.1. For numerical stability of the pseudo-duality gap, we zero out small entries, replacing this \({\hat{a}}\) by \({\hat{a}} \chi _{|{\hat{a}}(\,\varvec{\cdot }\,)| \ge \Vert {\hat{a}}\Vert _\infty /1000}(\xi )\). Note that this is only needed for the stable computation of \(G^*\) for the pseudo-duality gap, to compare the algorithms; the algorithms themselves are stable without this modification. To construct the projection operator

*P*, we then set \({\hat{p}}(\xi )=\chi _{|{\hat{a}}(\,\varvec{\cdot }\,)| \ge 0.3 \Vert {\hat{a}}\Vert _\infty }(\xi )\), and \(P=\mathfrak {F}^* {\hat{p}} \mathfrak {F}\).

\(TV \) deblurring performance, maximum 10,000 iterations

Method | Low resolution | High resolution | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|

Gap \(\le -60\) dB | Tgt \(\le -40\) dB | Val \(\le 1\) dB | Gap \(\le -60\) dB | Tgt \(\le -30\) dB | Val \(\le 1\) dB | |||||||

Iter | Time (s) | Iter | Time (s) | Iter | Time (s) | Iter | Time (s) | Iter | Time (s) | Iter | Time (s) | |

PDHGM | 390 | 2.53 | 2630 | 17.41 | 60 | 0.47 | 1180 | 118.30 | 970 | 98.98 | 70 | 6.59 |

C-L-O | 600 | 3.81 | 8930 | 54.20 | 950 | 5.95 | 500 | 48.44 | 1940 | 187.42 | 1000 | 96.60 |

Alg.3 | 130 | 1.14 | 880 | 7.22 | 20 | 0.25 | 400 | 58.42 | 320 | 46.16 | 40 | 6.13 |

Alg.4 | 30 | 0.47 | 90 | 0.97 | 10 | 0.29 | 60 | 7.97 | 50 | 6.66 | 30 | 3.98 |

Relax | 260 | 1.62 | 1750 | 11.34 | 40 | 0.29 | 790 | 77.31 | 650 | 63.84 | 50 | 5.29 |

Adapt | 110 | 1.12 | 660 | 5.94 | 10 | 0.16 | 260 | 39.39 | 150 | 23.30 | 30 | 4.72 |

We use \(TV \) parameter 2.55 for the high-resolution image and the scaled parameter \(2.55*0.15\) for the low-resolution image. We parametrise all the algorithms almost exactly as \(TGV ^2\) denoising above, of course with appropriate \(\Omega _U\) and \(\Vert K\Vert ^2 \le 8\) corresponding to \(K=\nabla \) [36]. The only difference in parameterisation is that we take \(q=1\) instead of \(q=0.1\) for Algorithm 4.

The results are in Fig. 4 and Table 2. It does not appear numerically feasible to go significantly below \(-100\) or \(-80\) dB gap. Our guess is that this is due to the numerical inaccuracies of the fast Fourier transform implementation in MATLAB. The C-L-O performs very well judged by the duality gap, although the images themselves and the primal objective value appear to take a little bit longer to converge. The relaxed PDHGM is again slightly improved from the standard PDHGM. The adaptive PDHGM performs very well, slightly outperforming Algorithm 3, although not Algorithm 4. This time Algorithm 4 performs remarkably well.

## 6 Conclusion

To conclude, overall, our algorithms are very competitive within the class of proposed variants of the PDHGM. Within our analysis, we have, moreover, proposed very streamlined derivations of convergence rates for even the standard PDHGM, based on the proximal point formulation and the idea of testing. Interesting continuations of this study include whether the condition \({\hat{T}}_i K=K{\widetilde{T}}_i\) can reasonably be relaxed such that \({\hat{T}}_i\) and \({\widetilde{T}}_i\) would not have to be scalars, as well as the relation to block coordinate descent methods, in particular [14, 37].

## Notes

### Acknowledgements

This research was started while T. Valkonen was at the Center for Mathematical Modeling at Escuela Politécnica Nacional in Quito, supported by a Prometeo scholarship of the Senescyt (Ecuadorian Ministry of Science, Technology, Education, and Innovation). In Cambridge, T. Valkonen has been supported by the EPSRC Grant EP/M00483X/1 “Efficient computational tools for inverse imaging problems”. Thomas Pock is supported by the European Research Council under the Horizon 2020 programme, ERC starting Grant Agreement 640156.

### Compliance with ethical standards

### A Data Statement for the EPSRC

This is primarily a theory paper, with some demonstrations on a photograph freely available from the Internet. As this article was written, the used photograph from the Kodak image suite was, in particular, available at http://r0k.us/graphics/kodak/. It has also been archived with our implementations of the algorithms at https://www.repository.cam.ac.uk/handle/1810/253697.

### References

- 1.Rockafellar, R.T.: Convex Analysis. Princeton University Press, Princeton (1972)MATHGoogle Scholar
- 2.Ekeland, I., Temam, R.: Convex Analysis and Variational Problems. SIAM (1999)Google Scholar
- 3.Chambolle, A., Pock, T.: A first-order primal–dual algorithm for convex problems with applications to imaging. J. Math. Imaging Vis.
**40**, 120–145 (2011). doi: 10.1007/s10851-010-0251-1 MathSciNetCrossRefMATHGoogle Scholar - 4.Esser, E., Zhang, X., Chan, T.F.: A general framework for a class of first order primal–dual algorithms for convex optimization in imaging science. SIAM J. Imaging Sci.
**3**(4), 1015–1046 (2010). doi: 10.1137/09076934X MathSciNetCrossRefMATHGoogle Scholar - 5.Chen, Y., Lan, G., Ouyang, Y.: Optimal primal–dual methods for a class of saddle point problems. SIAM J. Optim.
**24**(4), 1779–1814 (2014). doi: 10.1137/130919362 MathSciNetCrossRefMATHGoogle Scholar - 6.Nesterov, Y.: Smooth minimization of non-smooth functions. Math. Program.
**103**(1), 127–152 (2005). doi: 10.1007/s10107-004-0552-5 MathSciNetCrossRefMATHGoogle Scholar - 7.Beck, A., Teboulle, M.: Smoothing and first order methods: a unified framework. SIAM J. Optim.
**22**(2), 557–580 (2012). doi: 10.1137/100818327 MathSciNetCrossRefMATHGoogle Scholar - 8.O’Donoghue, B., Candès, E.: Adaptive restart for accelerated gradient schemes. Found. Comput. Math.
**15**(3), 715–732 (2015). doi: 10.1007/s10208-013-9150-3 MathSciNetCrossRefMATHGoogle Scholar - 9.Beck, A., Teboulle, M.: A fast dual proximal gradient algorithm for convex minimization and applications. Oper. Res. Lett.
**42**(1), 1–6 (2014). doi: 10.1016/j.orl.2013.10.007 MathSciNetCrossRefGoogle Scholar - 10.Valkonen, T.: A primal–dual hybrid gradient method for non-linear operators with applications to MRI. Inverse Probl.
**30**(5), 055,012 (2014). doi: 10.1088/0266-5611/30/5/055012 CrossRefGoogle Scholar - 11.Benning, M., Knoll, F., Schönlieb, C.B., Valkonen, T.: Preconditioned ADMM with nonlinear operator constraint (2015). arXiv:1511.00425
- 12.Möllenhoff, T., Strekalovskiy, E., Moeller, M., Cremers, D.: The primal–dual hybrid gradient method for semiconvex splittings. SIAM J. Imaging Sci.
**8**(2), 827–857 (2015). doi: 10.1137/140976601 MathSciNetCrossRefMATHGoogle Scholar - 13.Lorenz, D., Pock, T.: An inertial forward-backward algorithm for monotone inclusions. J. Math. Imaging Vis.
**51**(2), 311–325 (2015). doi: 10.1007/s10851-014-0523-2 MathSciNetCrossRefMATHGoogle Scholar - 14.Fercoq, O., Bianchi, P.: A coordinate descent primal–dual algorithm with large step size and possibly non separable functions (2015). arXiv:1508.04625
- 15.Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci.
**2**(1), 183–202 (2009). doi: 10.1137/080716542 MathSciNetCrossRefMATHGoogle Scholar - 16.Beck, A., Teboulle, M.: Fast gradient-based algorithms for constrained total variation image denoising and deblurring problems. IEEE Trans. Image Process.
**18**(11), 2419–2434 (2009). doi: 10.1109/TIP.2009.2028250 MathSciNetCrossRefMATHGoogle Scholar - 17.Setzer, S.: Operator splittings, Bregman methods and frame shrinkage in image processing. Int. J. Comput. Vis.
**92**(3), 265–280 (2011). doi: 10.1007/s11263-010-0357-3 MathSciNetCrossRefMATHGoogle Scholar - 18.Valkonen, T.: Optimising big images. In: A. Emrouznejad (ed.) Big Data Optimization: Recent Developments and Challenges, Studies in Big Data, pp. 97–131. Springer, Berlin (2016). doi: 10.1007/978-3-319-30265-2_5
- 19.Rockafellar, R.T., Wets, R.J.B.: Variational Analysis. Springer, Berlin (1998). doi: 10.1007/978-3-642-02431-3
- 20.He, B., Yuan, X.: Convergence analysis of primal–dual algorithms for a saddle-point problem: from contraction perspective. SIAM J. Imaging Sci.
**5**(1), 119–149 (2012). doi: 10.1137/100814494 MathSciNetCrossRefMATHGoogle Scholar - 21.Pock, T., Chambolle, A.: Diagonal preconditioning for first order primal–dual algorithms in convex optimization. In: Computer Vision (ICCV), 2011 IEEE International Conference on, pp. 1762 –1769 (2011). doi: 10.1109/ICCV.2011.6126441
- 22.Rudin, W.: Functional Analysis. International series in Pure and Applied Mathematics. McGraw-Hill, New York (2006)Google Scholar
- 23.Bauschke, H., Combettes, P.: Convex Analysis and Monotone Operator Theory in Hilbert Spaces. CMS Books in Mathematics. Springer, Berlin (2011)CrossRefMATHGoogle Scholar
- 24.Hohage, T., Homann, C.: A generalization of the Chambolle–Pock algorithm to Banach spaces with applications to inverse problems (2014). arXiv:1412.0126
- 25.Chan, T., Shen, J.: Image Processing and Analysis: Variational, PDE, Wavelet, and Stochastic Methods. Society for Industrial and Applied Mathematics (SIAM) (2005)Google Scholar
- 26.Benning, M., Gladden, L., Holland, D., Schönlieb, C.B., Valkonen, T.: Phase reconstruction from velocity-encoded MRI measurements—a survey of sparsity-promoting variational approaches. J. Magn. Reson.
**238**, 26–43 (2014). doi: 10.1016/j.jmr.2013.10.003 CrossRefGoogle Scholar - 27.Ambrosio, L., Fusco, N., Pallara, D.: Functions of Bounded Variation and Free Discontinuity Problems. Oxford University Press, Oxford (2000)MATHGoogle Scholar
- 28.Valkonen, T., Bredies, K., Knoll, F.: Total generalised variation in diffusion tensor imaging. SIAM J. Imaging Sci.
**6**(1), 487–525 (2013). doi: 10.1137/120867172 MathSciNetCrossRefMATHGoogle Scholar - 29.Bredies, K., Kunisch, K., Pock, T.: Total generalized variation. SIAM J. Imaging Sci.
**3**, 492–526 (2011). doi: 10.1137/090769521 MathSciNetCrossRefMATHGoogle Scholar - 30.Bredies, K., Valkonen, T.: Inverse problems with second-order total generalized variation constraints. In: Proceedings of the 9th International Conference on Sampling Theory and Applications (SampTA) 2011, Singapore (2011)Google Scholar
- 31.Temam, R.: Mathematical Problems in Plasticity. Gauthier-Villars (1985)Google Scholar
- 32.Chambolle, A., Pock, T.: On the ergodic convergence rates of a first-order primal-dual algorithm. Math. Program. (2015). doi: 10.1007/s10107-015-0957-3 MATHGoogle Scholar
- 33.Goldstein, T., Li, M., Yuan, X.: Adaptive primal–dual splitting methods for statistical learning and image processing. Adv. Neural Inf. Process. Syst.
**28**, 2080–2088 (2015)Google Scholar - 34.de Los Reyes, J.C., Schönlieb, C.B., Valkonen, T.: Bilevel parameter learning for higher-order total variation regularisation models. J. Math. Imaging Vis. (2016). doi: 10.1007/s10851-016-0662-8. Published online
- 35.Chen, K., Lorenz, D.A.: Image sequence interpolation using optimal control. J. Math. Imaging Vis.
**41**, 222–238 (2011). doi: 10.1007/s10851-011-0274-2 MathSciNetCrossRefMATHGoogle Scholar - 36.Chambolle, A.: An algorithm for mean curvature motion. Interfaces Free Bound.
**6**(2), 195 (2004)MathSciNetCrossRefMATHGoogle Scholar - 37.Suzuki, T.: Stochastic dual coordinate ascent with alternating direction multiplier method (2013). arXiv:1311.0622v1

## Copyright information

**Open Access**This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.