Inexact First-Order Primal-Dual Algorithms

,


Introduction
The numerical solution of nonsmooth optimization problems and the acceleration of their convergence has been regarded a fundamental issue in the past ten to twenty years.This is mainly due to the development of image reconstruction and processing, data science and machine learning which require to solve very large and highly nonlinear minimization problems.Two of the most popular approaches are forward-backward splittings [40,22,20], in particular the FISTA method [7,6], and first-order primal-dual methods, first introduced in [50,30] and further studied in [13,14].The common thread of all these methods is that they split the global objective into many elementary bricks which, individually, may be "easy" to optimize.
In their original version, all the above mentioned approaches require that the mathematical operations necessary in every step of the respective algorithms can be executed without errors.However, one might stumble over situations in which these operations can only be performed up to a certain precision, e.g.due to a slightly erroneous computation of a gradient or due to the application of a proximal operator lacking a closed-form solution.Examples where this problem arises are TV-type regularized inverse problems [6,5,31,54,28] or low-rank minimization and matrix completion [10,42].To address this issue, there has been a lot of work investigating the convergence of proximal point methods [52,37,34,23,53,35] (where the latter two also prove rates) and proximal forward-backward splittings [22] under the presence of errors.The objectives of these publications reach from general convergence proofs [41,49,18,60,32] and convergence up to some accuracy level [45,25,26] to convergence rates in dependence of the errors [55,56,4] also for the accelerated versions (including the FISTA method).The very recent paper [8] follows a similar approach to [56], however extending also to nonconvex problems using variable metric strategies and linesearch.
For the recently popular class of first-order primal-dual algorithms mentioned above the list is shorter: Combettes et al. [19,20,21] and also Vũ [57] explicitly model inaccuracies in the proximal operators of very general primal-dual methods in the context of monotone operators, and prove convergence under rather mild assumptions on the decay of the errors.The work which comes closest to the one at hand is the one by Condat [24].
However, none of the above works show convergence rates for their algorithms.The goal of this paper is twofold: First, and most importantly, we investigate the convergence of the primal-dual algorithms presented in [13,14] for problems of the form min x∈X max y∈Y Kx, y + f (x) + g(x) − h * (y), for convex and lower semicontinuous g and h and convex and Lipschitz-differentiable f , with errors occurring in the computation of ∇f and the proximal operators for g and h * .Following the line of the preceding works on forward-backward splittings, we consider the different notions of inexact proximal points used in [55] and extended in [53,56,4] and, assuming an appropriate decay of the errors, establish the convergence rates of [13,14] also for perturbed algorithms.More precisely, we prove the well-known O (1/N ) rate for the basic version, a O 1/N 2 rate if either f, g or h * are strongly convex, and a linear convergence rate in case both the primal and dual objective are strongly convex.Moreover we show that also under a slower decay of errors we can establish rates, however unsurprisingly slower depending on the errors.
In the spirit of [56] for inexact forward-backward algorithms, the second goal of this paper is to provide an interesting application for such inexact primal-dual algorithms.For some problems one simply might not be able to compute a gradient or proximal operator without errors, in which case this work gives criteria on how large these errors may be to still obtain a certain convergence.We instead put special focus on situations where one takes this path deliberately in order to split the global objective more efficiently.A particular instance are problems of the form min x h(K 1 x) + g(K 2 x) = min x max y y, K 1 x + g(K 2 x) − h * (y). (1.1) A very popular example is the TV-L 1 model with some imaging operator K 1 , where g and h are chosen to be L 1 -norms and K 2 = ∇ is a gradient.It has e.g been studied analytically by [1,2,3] and subsequently by [47,48,36,16].However, due to the nonlinearity and nondifferentiability of the involved terms, solutions of the model are numerically hard to compute.In the literature one can find a variety of approaches to solve the TV-L 1 model, reaching from (smoothed) gradient descent [16] over interior point methods [33], primal-dual methods using a semi-smooth Newton method [27] to augmented Lagrangian methods [29,58].Interestingly, the inexact framework we propose in this paper provides a very simple and intuitive algorithmic approach to the solution of the TV-L 1 model.More precisely, applying an inexact primal-dual algorithm directly to formulation (1.1), we obtain a nested algorithm in the spirit of [6,17,5,31,54,56,28], y n+1 = prox σh * (y n + σK 1 (x n+1 + θ(x n+1 − x n ))), where prox σh * denotes the proximal map with respect to h * and step size σ (cf.Section 2).It requires to solve the inner subproblem of the proximal step with respect to g •K 2 , i.e. a TV-denoising problem, which does not have a closed-form solution but has to be solved numerically.It has been observed in [6] that, using this strategy on the primal TV-L 2 deblurring problem can cause the FISTA algorithm to diverge if the inner step is not executed with sufficient precision.As a remedy, the authors of [56] demonstrated that the theoretical error bounds they established for inexact FISTA can also serve as a criterion for the necessary precision of the inner proximal problem and hence make the nested approach viable.We show that the bounds for inexact primal-dual algorithms established in this paper can be used to make the nested approach viable for entirely non-differentiable problems such as the TV-L 1 model, while the results of [56] for partly smooth objectives can also be obtained as a special case of the accelerated versions.
In the context of inexact and nested algorithms it is worthwhile mentioning the very recent 'Catalyst' framework [39,38], which uses nested inexact proximal point methods to accelerate a large class of generic optimization problems in the context of machine learning.The inexactness criterion applied there is the same as in [55,4].Our approach, however, is much closer to [55,56,4], stating convergence rates for perturbed algorithms, while [39,38] focus entirely on nested algorithms, which we only consider as a particular instance of perturbed algorithms in the numerical experiments.
The remainder of the paper is organized as follows: In the next section we introduce the notions of inexact proxima that we will use throughout the paper, and their respective connections, advantages and disadvantages.In the following Section 3 we formulate inexact versions of the primal-dual algorithms presented in [13] and [14] and prove their convergence including rates depending on the decay of the errors.In the numerical Section 4 we line out the above splitting idea for nested algorithms in more detail and show how inexact primal-dual algorithms can be used to improve the convergence of some well-known imaging problems.

Inexact computations of the proximal point
In this section we introduce and discuss the idea of the proximal point and several ways for its approximation.For a proper, convex and lower semicontinuous (l.s.c.) function g : X → R mapping from a Hilbert space X to the extended real line R = R ∪ {∞} and y ∈ X the proximal point [44,43,51,52] is given by prox τ g (y) = arg min x − y 2 + g(x) . (2.1) Since g is proper we directly obtain prox τ g (y) ∈ dom g.The 1/τ -strongly convex mapping prox τ g : X → X is called proximity operator of τ g.Letting be the proximity function, the first-order optimality condition for the proximum gives different characterizations of the proximal point: Based on these equivalences we introduce four different types of inexact computations of the proximal point, which are differently restrictive.Most of them have already been considered in literature and we give reference next to the definitions.We also like to recommend [53] and [56] for some intuitive illustrations of the different notions in case of a projection.We start with the first expression in (2.3).
Definition 2.1.Let ε ≥ 0. We say that z ∈ X is a type-0 approximation of the proximal point prox τ g (y) This refers to choosing a proximal point from the √ 2τ ε-ball around the true proximum.It is important to notice that a type-0 approximation is not necessarily feasible, i.e. it can occur that z / ∈ dom g.This can easily be verified for e.g.g being the indicator function of a convex set, and shall require us to treat this approximation slightly differently from the following ones.
In order to relax the second or third expression in (2.3), we need the notion of an ε-subdifferential of g : X → R at z ∈ X : where as a direct consequence of the definition we obtain a notion of ε-optimality (2.4) Based on this and the second expression in (2.3), we define a second notion of an inexact proximum.It has e.g.been considered in [55,4] to prove the convergence of inexact (accelerated) proximal gradient methods under the presence of errors.
Definition 2.2.Let ε ≥ 0. We say that z ∈ X is a type-1 approximation of the proximal point prox τ g (y) with precision ε if Hence, by (2.4), an approximation in the sense of Definition 2.2 minimizes the energy of the proximity function (2.2) up to an error of ε.It turns out that a type-0 approximation is weaker than a type-1 approximation, which can be seen from the following lemma: . The result is easy to verify and can be found e.g. in [52,34,53].An even more restrictive version of an inexact proximum can be obtained by relaxing the third expression in (2.3), which has been introduced in [37] and subsequently been used in [23,53] in the context of inexact proximal point methods.
Definition 2.4.Let ε ≥ 0. We say that z ∈ X is a type-2 approximation of the proximal point prox τ g (y) Letting φ τ (z) = z − y 2 /(2τ ), the following characterization from [53] of the ε-subdifferential of the proximity function G τ = g + φ τ sheds light on the difference between a type-1 and type-2 approximation: (2.5) Equation (2.5) decomposes the error in the ε-subdifferential of G τ into two parts related to g or φ τ , respectively.As a consequence, for a type-1 approximation of the proximum it is not clear how the error is distributed between g or φ τ , while for a type-2 approximation the error occurs solely in g.Hence a type-2 approximation can be seen as an extreme case of a type-1 approximation with ε 2 = 0. Another way to see the difference between the two approximations is to rewrite the definition of a type-2 approximation: This reveals that a type-2 approximation also somehow recovers the strong convexity of the proximity function while a type-1 approximation does not.In view of the decomposition (2.5), we define a fourth notion of an inexact proximum as the extreme case ε 1 = 0, which has been considered in e.g.[52] and [34].
Definition 2.5.Let ε ≥ 0. We say that z ∈ X is a type-3 approximation of the proximal point prox τ g (y) with precision ε if Definition 2.5 gives the notion of a "correct" output for an incorrect input of the proximal operator.Being the two extreme cases, type-2 and type-3 proxima are also proxima of type 1.The decomposition (2.5) further leads to the following lemma from [55], which allows to treat the type-1, -2 and -3 approximations in the same setting.
Lemma 2.6.If z ∈ X is a type-1 approximation of prox τ g (y) with precision ε, then there exists r ∈ X with r ≤ √ 2τ ε such that Note that letting r = 0 in Lemma 2.6 gives a type-2 approximation, replacing the ε-subdifferential by a normal one gives a type-3 approximation.We mention that there exist further notions of approximations of the proximal point, e.g.used in [34], and refer to [56, Section 2.2] for a compact discussion.
Even tough we prove our results for different types of approximations, the most interesting one in view of practicability is the approximation of type 2. This is due to the following insights obtained by [56]: Without loss of generality let with proper, convex and l.s.c.w : Z → R and linear B : X → Z. Then the evaluation of the proximum requires to solve x − y 2 + w(Bx). (2.6) Now if there exists x 0 ∈ X such that g is continuous in Bx 0 , the Fenchel-Moreau-Rockafellar duality formula [59,Corollary 2.8.5] states that where we refer to the right hand side as the "dual" problem W τ (z).Furthermore we can always recover the primal solution x from the dual solution ẑ via the relation x = y − B * ẑ.Most importantly, we obtain that x and ẑ solve the primal respectively dual problem if and only if the duality gap G(x, z) := G τ (x) + W τ (z) vanishes, i.e.

= min
The following result in [56] states that the duality gap can also be used as a criterion to assess admissible type-2 approximations of the proximal point: Note that by the above discussion Proposition 2.7 also implies a type-1 and type-0 approximation.Proposition 2.7 has an interesting implication: if one can construct a feasible dual variable z during the solution of (2.6), it is easy to check the admissibility of the corresponding primal variable x to be a type-2 approximation by simply evaluating the duality gap.We shall make use of that in the numerical experiments in Section 4.

Inexact primal-dual algorithms
Based on the introduced notions of an inexact proximum we can now prove the convergence of some primal-dual algorithms under the presence of the respective error.We start with the type-1, -2 and -3 approximations and outline in Appendix A.2 how to get a grip also on the (odd and least restrictive) type-0 approximation.The convergence analysis in this chapter is based on a combination of techniques derived in previous works on the topic: similar results on the convergence of exact primal-dual algorithms can be found e.g. in [13,15] and [14], the techniques to obtain error bounds for the inexact proximum are mainly taken from [55] and [4].Throughout this section we consider the saddle-point problem where we make the following assumptions: 1. K : X → Y is a linear and bounded operator between Hilbert spaces X and Y with norm L = K , 2. f : X → R is proper, convex, lower semicontinuous and differentiable with L f -Lipschitz gradient, It is well-known that taking the supremum over y in L(x, y) leads to the corresponding "primal" formulation of the saddle-point problem (3.1) which for a lot of variational problems might be the starting point.Analogously, taking the infimum over x leads to the dual problem.Given an algorithm producing iterates (x N , y N ) for the solution of (3.1), the goal of this section is to obtain estimates of the form for α > 0 and (x, y) ∈ X × Y.If (x, y) = (x , y ) is a saddle point, the left hand side vanishes if and only if the pair (x N , y N ) is a saddle point itself, yielding a convergence rate in terms of the primal-dual objective in O (1/N α ).Under very weak additional assumptions one can then also derive estimates e.g. for the error in the primal objective.If for fixed N and x N the supremum over y in L(x N , y) is attained at some ỹ, one easily sees that giving a convergence estimate for the primal problem.
In the original versions of primal-dual algorithms (e.g.[13,15]), the authors usually require h * and g to have a simple structure such that their proximal operators (2.1) can be sufficiently easily evaluated.Instances of such proximal operators are easy projections such as projections onto norm balls, or shrinkage operators.A particular feature of most of these operators is that they have a closed-form solution and can hence be evaluated exactly.We study the situation where the proximal operators for g or h * can only be evaluated up to a certain precision in the sense of Section 2, and as well the gradient of f may contain errors.As opposed to the general iteration of an exact primal-dual algorithm [15] ŷ = prox σh * (ȳ + σK x), where (x, ȳ) and (x, ỹ) are the previous points, and (x, ŷ) are the updated exact points, we introduce the general iteration of an inexact primal-dual algorithm Here the updated points (x, y) denote the inexact proximal point (as opposed to the exact proxima (x, ŷ)), which are only computed up to precision ε respectively δ, in the sense of a type-2 approximation from Section 2 for y and a type-i approximation for i ∈ {1, 2, 3} for x.The vector e ∈ X denotes a possible error in the gradient of f .Note that we use two different pairs of input points (x, ȳ) and (x, ỹ) in order to include possible intermediate overrelaxed input points.It is clear, however, that we require x to depend on x respectively ỹ on y in order to couple the primal and dual updates.At first glance it seems counterintuitive that we allow errors of type 1, 2 and 3 in x, while only allowing for type-2 errors in y.The following general descent rule for the iteration (3.5) sheds some more light on this fact and forms the basis for all the following proofs.It can be derived using simple convexity results and resembles the classical energy descent rules for forward-backward splittings (which can be found in almost any paper on first-order descent methods).It can then be used to obtain estimates on the decay of the objective of the form (3.2).We prove the descent rule for a type-1 approximation of the primal proximum since we always obtain the result for a type-2 or type-3 approximation as a special case.Lemma 3.1.Let τ, σ > 0 and (x, y) be obtained from (x, ȳ) and (x, ỹ) and the updates (3.5) for i = 1, i.e. x is a type-1 approximation of the exact primal proximum x.Then for every (x, y) ∈ X × Y we have Proof.For the inexact type-2 proximum y we have by Definition 2.4 that (ȳ + σK x − y)/σ ∈ ∂ δ h * (y), so by the definition of the subdifferential we find For the inexact type-1 primal proximum, from Definition 2.2 and Lemma 2.6 we have that there exists a vector r with r ≤ √ 2τ ε such that Hence we find that where we applied the Cauchy-Schwarz inequality to the error term.Further by the Lipschitz property of f we have (cf.[46]) Now we add the equations (3.7), (3.8) and (3.9), insert and rearrange to arrive at the result We point out that, as a special case, choosing a type-2 approximation for the primal proximum in Lemma 3.1 corresponds to dropping the square root in the estimate (3.6), choosing a type-3 approximation corresponds to dropping the additional ε at the end.Also note that the above descent rule is the same as the one in [13,15] except for the additional error terms in the last line of (3.6).
Lemma 3.1 has several immediate implications: from the second to last line of (3.6) we can see that we either need to have x = x of y = ỹ, since otherwise the algorithm would be fully implicit and every iteration as difficult as the original problem (see also [13]).Furthermore, in order to control the errors e and ε in the last line of Lemma 3.1 it is obvious that we need to find a useful bound on x − x .
We shall obtain this bound using a linear extrapolation in the primal variable x (already used in [13]).However, if we allow e.g. a type-1 approximation also in y, we obtain an additional error term in (3.6) involving y − y that we need to bound as well.Even though we shall be able to obtain a bound in most cases, it will be arbitrarily weak due to the asymmetric nature of the used primal-dual algorithms, or go along with severe restrictions on the step sizes.This fact will become more obvious from the proofs in the following.We however would like to emphasize that the considered scenarios in particular include the most interesting situation of a type-2 approximation in both the primal and dual proximum, which can be checked for many nested algorithms by evaluating a primal-dual gap (cf. the end of Section 2).

The convex case: no acceleration
We start with a proof for the most basic version of algorithm (3.5) which makes use of a technical lemma taken from [55] which can be found in the appendix.The following inexact primal-dual algorithm corresponds to the choice in algorithm (3.5), i.e. a semi-explicit choice with overrelaxation on the primal variable x: Theorem 3.2.Let L = K and choose β 1 and τ, σ > 0 such that τ L f + στ L 2 + τ βL < 1, and let the iterates (x n , y n ) be defined by Algorithm (3.11) for i ∈ {1, 2, 3}.Then for any N ≥ 1 and where Proof.Using the choices (3.10) in Lemma 3.1 leads us to The goal of the proof is, as usual, to manipulate this inequality such that we obtain a recursion where most of the terms cancel when summing the inequality.The starting point is an extension of the scalar product on the right hand side: where we used (for α > 0) that by Young's inequality for every x, x ∈ X and y, y ∈ Y, and α = σL + β.This gives (3.15) Now we let x 0 = x −1 and sum (3.15) from n = 0, . . ., N − 1 to obtain Now as before using Young's inequality on the inner product with α = This inequality can now be used to bound the sum on the left hand side as well as x − x N by only the initialization (x 0 , y 0 ) and the errors e n , ε n and δ n .We start with the latter and let (x, y) = (x , y ) such that the sum on the left hand side is nonnegative, hence (note that the third and fourth sum on the right hand side are negative).We multiply the equation by 2τ and continue with a technical result by [55, p.12].Using Lemma A.1 with we obtain a bound on x − x N , which depends only on the initialization and the error terms: where we set Since A N and B N are increasing we find for all n ≤ N : This finally gives and bounds the error terms.We now obtain from (3.16) that which gives the assertion using the convexity of g, f and h * and the definition of the ergodic sequences X N and Y N .More precisely, by Jensen's inequality we have It remains to note that for a type-2 approximation the square root in A N can be dropped and for a type-3 approximation B N = 0, which gives the different A N,i , B N,i .
We comment on two particularities of Theorem 3.2, before we establish a convergence result from Theorem 3.2 if the right hand side of (3.12) converges to zero as the number of iterations N grows.Remark 3.3.It is clear that, in particular for linear problems or problems with growth 1, one would like to obtain inequality (3.12) for all (x, y) ∈ X × Y.The issue is that, as can be seen from the proof, one needs a bound on x n − x in order to control the errors.The typical situation for linear problems however is that the solution x is restricted to a bounded set D, such that f and/or g have bounded domain.In this case one can estimate x n − x ≤ diam(D), and following the line of the proof (cf.inequality (3.16)) we obtain for all (x, y) ∈ X × Y that Remark 3.4.Another insight can be obtained by realizing that for (x, y) = (x , y ) the left hand side of (3.12) actually is a generalized Bregman distance between the ergodic iterates (X N , Y N ) and the saddle point (x , y ) Proof.The argumentation follows the same line as [13] and [15].Since by assumption A N,i and B N,i are summable, let Now equation (3.17) establishes the boundedness of the sequence x N for all N ∈ N, which also implies the boundedness of the ergodic average X N .Note that by the same argumentation as for x N , (3.16) and (3.20) also establish a global bound on y N and Y N .Hence there exists a subsequence (X N k , Y N k ) weakly converging to a cluster point (x * , y * ).Then, since f, g and h * are l.s.c.(thus also weakly l.s.c.), we deduce from equation (3.12) that which implies that (x * , y * ) is a saddle point itself and establishes the first assertion.Now, if X and Y are finite dimensional, the boundedness of (x n , y n ) gives a subsequence (x n k , y n k ) (strongly) converging to a cluster point (x * , y * ).Using (x, y) = (x * , y * ) in (3.16) and the boundedness of the error terms established in (3.18) we find that x n−1 − x n → 0 and y n−1 − y n → 0 (note that this is precisely the reason for the introduction of β and the strict inequality in τ L f + τ σL 2 + τ βL < 1).As a consequence we also have x n k −1 − x n k → 0 and i.e. also x n k −1 → x * .Let now T denote the primal update of the exact algorithm (3.4), i.e. xn+1 = T (x n ), and T εn denote the primal update of the inexact algorithm (3.11), i.e. x n+1 = T εn (x n ).Then due to the continuity of T we obtain We apply the same argumentation to y n , which together implies that (x * , y * ) is a fixed point of the (exact) iteration (3.11) and hence a saddle point of our original problem (3.1).We now use (x, y) = (x * , y * ) in inequality (3.15) and sum from n = n k , . . ., N − 1 (leaving out negative terms on the right hand side) to obtain for It remains to notice that since e n → 0, ε n → 0, δ n → 0 and the above observations, the right hand side tends to zero for k → ∞, which implies that also x N → x * and y N → y * for N → ∞.
We finally comment on necessary conditions on the errors { e n }, {ε n } and {δ n } to ensure convergence of the algorithm (3.11).As already stated in Theorem 3.5, a sufficient condition to ensure both a decay of (3.12) with the rate O(1/N ) and a convergence of the iterates (in finite dimensions) is summability of the errors, i.e. we require {ε n } and { e n } to decrease like O(1/n 1+α ) for some α > 0. Then Theorem 3.2 establishes the well-known O(1/N ) convergence rate of nonaccalerated primal-dual methods also in the presence of errors.It is worth noticing that a better convergence of the errors does not affect the convergence rate of the algorithm, but simply yields a better constant.Another interesting fact is that the decay to zero of the right-hand side of (3.12) still holds if the partial sums A N , i, B N , i are only in o( √ N ).This includes the cases where the sequences {ε n }, {δ n } and { e n } are not summable.If e.g.{ε n } and { e n } behave like O(1/n) we obtain that A N increases like O(log(N )) and we obtain convergence of the primal-dual gap in O log(N ) 2

N
. In this case, however, we lose the boundedness of the iterates and the convergence of the iterates can no longer be guaranteed by Theorem 3.5.We summarize the results more precisely in the next Corollary.
Proof.Under the assumptions of the corollary, if α > 1/2, the sequences { e n }, {ε n } and {δ n } are summable and the error term on the right hand side of equation (3.12) is bounded, hence we obtain a convergence rate of ), which gives the third assertion.

The convex case: a stronger version
In fact, if we restrict ourselves to type-2 approximations of the proximum, one can state a stronger version for the reduced problem with f = 0: again assuming it has at least one saddle point (x , y ).We mention again that this formulation comprises a large class of problems, as already lined out in Section 1.We consider the algorithm which is the inexact analog of the basic exact primal-dual algorithm presented in [13].Following their line of proof, we can state the following result: Theorem 3.7.Let L = K and τ, σ > 0 such that τ σL 2 < 1, and let the sequence (x n , y n ) be defined by algorithm (3.22).Then for Furthermore, if Proof.The proof can be done exactly along the lines of [13, Theorem 1] (or along the proof of Theorem 3.2), so we just give the main steps.Letting f = 0 and choosing a type-2 approximation gives L f = 0 and lets us drop the term ( e n+1 + (2ε n+1 )/τ ) x − x n+1 in inequality (3.13).This is the essential difference, since we do not have to establish a bound on x − x n+1 .Choosing α = σ/τ in Young's inequality and proceeding as before the gives The definition of the ergodic sequences and Jensen's inequality then yield the assertion.
As before we can state convergence of the iterates if the errors {ε n } and {δ n } decay fast enough.The proof is the same as for Theorem 3.5.
Theorem 3.8.Let the sequences (x n , y n ) and (X N , Y N ) be defined by (3.22) respectively.If the sequences {ε n } and {δ n } in Theorem 3.7 are summable, then every weak cluster point (x * , y * ) of (X N , Y N ) is a saddle point of problem (3.21).Moreover, if the dimension of X and Y is finite, there exists a saddle point (x * , y * ) ∈ X × Y such that x n → x * and y n → y * .Remark 3.9.The main difference between Theorem 3.7 and Theorem 3.2 is that inequality (3.23) bounds the left hand side for all x, y ∈ X × Y and not only for a saddle point (x , y ).Following [13,Remark 2] and if {ε n }, {δ n } are summable we can state the same O (1/N ) convergence of the primal energy, dual energy or the global primal-dual gap under the additional assumption that h has full domain, g * has full domain or both have full domain.More precisely, if e.g.h has full domain, then it is classical that h * is superlinear and that the supremum appearing in the conjugate is attained at some ỹN ∈ ∂h(KX N ), which is uniformly bounded in N due to (3.24) Now evoking inequality (3.23) and proceeding exactly along (3.3) we can state that for a primal minimizer x ∈ X and we have with a constant C depending on x 0 , y 0 , h and K .This proves the convergence of the primal energy.
Analogously we can establish the convergence rates for the dual problem and also the global gap.
Remark 3.10.If h * has bounded domain, e.g. if h is a norm, we can even state "mixed" rates for the primal energy if the errors are not summable.Since in this case y − y 0 ≤ diam(domh * ) we may take the supremum over all y ∈ dom h * and obtain The above result in particular holds for the aforementioned TV-L 1 model, which we shall consider in the numerical section.

The strongly convex case: primal acceleration
We now turn our focus on possible accelerations of the scheme and consider again the full problem (3.1) with the additional assumption that g is γ-strongly convex, i.e. for any x ∈ dom ∂g It is a known fact that if g is strongly convex, its convex conjugate g * has Lipschitz gradient, which usually guarantees the possibility to accelerate the algorithm.We mention that we obtain the same result if f (or both g and f ) are strongly convex, since it is possible to transfer the strong convexity from f to g and vice versa [15,Section 5].Hence for simplicity we focus on the case where g is strongly convex.Choosing in algorithm (3.5) we can define an accelerated version of the inexact primal-dual algorithm: We prove the following theorem in Appendix A.3.
Theorem 3.11.Let L = K and τ n , σ n , θ n such that Let (x n , y n ) ∈ X × Y be defined by the above algorithm for i ∈ {1, 2, 3}.Then for any saddle point (x , y ) ∈ X × Y of (3.1) and we have that and where As a direct consequence of Theorem 3.11 we can state convergence rates of the accelerated algorithm (3.27) in dependence on the errors { e n }, {δ n } and {ε n }.Corollary 3.12.Let τ 0 = 1/(2L f ), σ 0 = L f /L 2 and τ n , σ n and θ n be given by (3.27).Let α > 0, i ∈ {1, 2, 3} and Then Proof.In [13] it has been shown that with this choice we have τ n ∼ 2/(nγ).Since the product τ n σ n = τ 0 σ 0 = 1/(2L 2 ) stays constant over the course of the iterations, this implies that σ n ∼ (nγ)/(4L 2 ), from which we directly deduce that T N ∼ (γN 2 )/(8L f ), hence T N = O N 2 .Moreover we find that τ N /σ N ∼ ( √ 8L)/(γN ).Now let i = 1 and α ∈ (0, 1), then we have By analogous reasoning we find B N,1 = O N 2−2α .Summing up we obtain that yielding the last row of the assertion.For α = 1 we see that τ N /σ N A N,1 is finite and B N,1 = O (log(N )), for α > 1 also B N,1 is summable, implying the other two convergence rates.It then remains to notice that the cases i ∈ {2, 3} can be obtained as special cases.

The strongly convex case: dual acceleration
This section is devoted to the comparison of inexact primal-dual algorithms and inexact forward-backward splittings established in [55,56,4], considering the problem with h having a 1/γ-Lipschitz gradient and proximable g.The above mentioned works establish convergence rates for an inexact forward-backward splitting (or proximal gradient method) on this problem, where both the computation of the proximal operator with respect to g and the gradient of h might contain errors ( [56] only considers errors in the proximum).They show that using a type-1 approximation [55] respectively a type-2 approximation [56] of the proximum, essentially the same convergence rates for the decrease of the objective (3.28) as for the error-free version can be obtained.Subsequently the authors of [4] also state convergence of the iterates for an inexact FISTA-type forward-backward splitting in the spirit of [11] for both a type-1 and type-2 approximation.Due to the strong convexity in h * we know that the algorithm can be accelerated "à la" [13,15] or as in the previous section, and we shall be able to essentially recover the results on inexact forward-backward splittings/inexact FISTA obtained by [55,56,4].Choosing (note f = 0 and e = 0) in algorithm (3.5) we can define an accelerated version of the inexact primal-dual algorithm: prox σnh * (y n + σ n K(x n + θ n (x n − x n−1 )) We prove the following theorem in Appendix A.4.
Theorem 3.13.Let L = K and τ n , σ n , θ n such that Let (x n , y n ) ∈ X × Y be defined by the above algorithm for i ∈ {1, 2, 3}.Then for a saddle point (x , y ) ∈ X × Y and we have that , and , where We can once more establish convergence rates depending on the decay of the errors.
Corollary 3.14 essentially recovers the results given in [55,56,4], though the comparison is not exactly straightforward.For an optimal O N −2 convergence in objective with a type-1 approximation the authors of [55] require ε n = O 1/n 4+κ for any κ > 0, for the error d n in the gradient of h • K they need d n = O 1/n 4+κ .Since a type-2 approximation of the proximum is more demanding, the authors of [56] obtain a weaker dependence of the convergence on the error and only require ε n = O n 3+κ .Note that they only consider the case d n = 0.The work in [4] essentially refines both results under the same assumptions on the errors.Corollary 3.14 now states that for an optimal O N −2 convergence we require ε n = O n 3+κ in case of a type-1 approximation and ε n = O n 2+κ in case of an error of type-2, which seems to be one order less than the other results.We do not have a precise mathematical explanation at this point.The main difference appears to be the changing step sizes τ n , σ n in the proximal operators for the inexact primal-dual algorithm in Theorem 3.13, which behave like n respectively 1/n, while the step sizes remain fixed for inexact forward-backward.The numerical section, however, indeed confirms the weaker dependence of the inexact primal-dual algorithm on the errors.Remark 3.15.We want to highlight that, in the spirit of Section 3.2 it is as well possible to state a stronger version in case the approximations are of type 2 in both the primal and dual proximal point, which then bounds the "gap" for all (x, y) ∈ X × Y instead of for a saddle point (x , y ) in Theorem 3.13 (cf.inequality (A.9)): Under some additional assumptions we can then again derive estimates on the primal energy for every fixed N ∈ N. If again h has full domain, the supremum appearing in the conjugate is attained at some ỹN and exactly along (3.3) we derive In case the errors are summable we again obtain that also ỹN is globally bounded (cf.Remark 3.9) and we obtain convergence in O 1/N 2 .If the errors are not summable there is no similar argument to obtain the global boundedness of the ỹN , however at least on a heuristic level one can expect a convergence to y * at a similar rate as X N such that the above bound is useful.This is indeed confirmed in the numerical section where we observe the O N −2α decay from Corollary 3.14 also for the primal objective for nonsummable errors.

The smooth case
For the sake of completeness we also discuss a final version of an accelerated primal-dual version, where both g and h * are γ-respectively µ-strongly convex.In this setting the primal objective is both smooth and strongly convex, and it is well-known that first-order algorithms can be accelerated to linear convergence in this case.We consider the algorithm prox σh * (y n + σK(x n + θ(x n − x n−1 )) and prove the following result in Appendix A.5 Theorem 3.16.Let L = K and τ, σ, θ such that Let (x n , y n ) ∈ X × Y be defined by algorithm (3.31) for i ∈ {1, 2, 3}.Then for the unique saddle-point (x , y ) and As before, we can now state convergence rates, if the decay of the errors is also linear.This leads us to the following corollary: Corollary 3.17.Let α > 0, i ∈ {1, 2, 3} and for 0 < q < 1 Proof.It is clear that we need to investigate the decay of the term to obtain a convergence rate.In view of the specific form of A N,i and B N,i and the rate of ε n , δ n and e n we consider For A N,i we note the factor θ N is squared, as opposed to the factor of B N,i , which implies that the decay of e n and √ ε n can be less restrictive for A N,i and implies the square root on the constant q for e n .Due to the square root on q we have to distinguish whether √ q < θ or √ q > θ.In the former case we have by Equation (3.33), now with √ q instead of q, that while in the latter we obtain θ 2N A 2 N,i = O q N = O θ N , which in sum gives C N,i = O θ N .If θ < q < 1, we have by analogous argumentation and (3.33) that θ N B N,i = O q N and since θ < q < √ q < 1 also It remains to give some explicit formulation of the step sizes that fulfill the conditions (3.32).Solving (3.32) for τ, σ and θ gives [15] .

Numerical experiments
There exists a large variety of interesting optimization problems, e.g. in imaging, that could be investigated in the context of inexact primal-dual algorithms, and even creating numerical examples for all the discussed notions of inexact proxima and different versions of algorithms clearly goes beyond the scope of this paper.Instead, we want to discuss two different questions on two classical imaging problems and leave further studies to the interested reader.First of all, we want to confirm numerically, that the convergence rates we proved above are "sharp" in some sense, meaning that if the errors are close to the upper bounds we obtain the convergence rates predicted by the theory.As an example, taking the (non-accelerated) algorithm from Section 3.2 and a decay of the errors δ n = ε n = O (n −α ), the theory states a convergence in O (N −α ), and indeed we shall observe this result.As already announced in the introduction, the second question we want to answer is whether one can actually benefit from the theory and employ different splitting strategies in order to obtain nested algorithms, which can then only be solved in an inexact fashion (cf.[56]).We investigate both questions using problems of the form where we assume that the proximal operators of both g and h * (or g * and h by Moreau's decomposition) have an exact closed form solution.The formulation on the right hand side of (4.1) leads to a nested inexact primal-dual algorithm Hence the dual proximal operator can be evaluated exactly (i.e.δ n = 0), while the inner subproblem does not have a closed-form solution and hence has to be computed in an inner loop up to the necessary precision ε n .We choose the type-2 approximation since in this case, according to Proposition 2.7, the precision of the proximum can be assessed by means of the duality gap.In order to be able to evaluate the gap, we solve the dual problem of the 1/τ -strongly convex formulation given in (4.2), i.e.
using FISTA [7].To distinguish between outer and inner problems for the splittings we shall always denote the iteration number for the outer problem by n, while the iteration number of the inner problem is k.In order to achieve the necessary precision, we iterate the proximal problem until the primal-dual gap (cf.also Section 2) satisfies where for the last experiment.According to [56,Theorem 6.1], the convergence rate of the gap G using FISTA is in O (1/k) though the energy converges like O 1/k 2 .
While for the asymptotic results we proved in the precious section the constant C of the rate does not matter, it indeed does in practice.In order to use Proposition 2.7 as a criterion, C should somehow correspond to the "natural" size of the duality gap of (4.2).In order not to choose the constraint too restrictive but still active we follow [56] and choose C = G(y 0 − τ B * y 0 , 0), which is the duality gap of the first proximal subproblem for n = 1 evaluated at z = 0.
For the sake of brevity we discuss only three problems: we start with the non-differentiable TV-L 1 model for deblurring, a problem which cannot be accelerated, and continue with "standard" TV-L 2 deblurring, which also serves as a prototype for a manifold of applications with a general operator instead of a blurring kernel (cf.e.g.[54,28]).Since in this case the objective is Lipschitz-differentiable, the convex conjugate is strongly convex, which allows to accelerate the algorithm.The third problem we investigate is a "smoothed" version of the TV-L 2 model, which can be accelerated to linear convergence.
We investigate two different setups: as already announced above, we want to confirm the convergence rates predicted by the theory numerically.We hence require the inexact proximal problem (4.2) to be solved with an error close to the accuracy level ε n .To achieve this we, where it is necessary, deliberately solve the inner problem suboptimally, meaning that we use a cold start and reduced step sizes t k for the inner problem, ensuring that the inner problem is not solved "accidentially" at a higher precision.We shall see that this is indeed necessary for the "slow" TV-L 1 problem.In a second setup we investigate whether the obtained error bounds can also be used as a criterion to ensure (optimal) convergence of the nested algorithm (4.2).As observed in e.g.[6] for the TV-L 2 model and the FISTA algorithm, insufficient precision of the inner proximum can cause the algorithm to diverge.Instead of performing a fixed high number of inner iterations as a remedy, we solve the inner problem only up to precision ε n in every step, which by the theory ensures that the algorithm converges with the same rate as the decay of the errors.We now use the best possible step sizes t k and a warm start strategy in order to minimize the computational costs of the inner loop and it has already been stated in [56], that this strategy significantly speeds up the process.We use a standard primal-dual reconstruction (PDHG) after 10 5 iterations as a numerical "ground truth" u * to compute the optimal energy F * = F (u * ).

Nondifferentiable deblurring with the TV-L 1 model
In this section we study the numerical solution of the TV-L 1 model with a discrete blurring operator A : X → X.As already lined out in the introduction, there exist a variety of methods to solve the problem (e.g.[16,33,27,58]), where most of them make use of the fact that the operator A can be written as a convolution.We use an easy strategy which does not rely on the structure of the operator and is hence also applicable to operators different from convolutions.Due to the nondifferentiability of both the data term and regularizer, a very simple approach is to dualize both terms in the primal-dual formulation (similar to ADMM [9] or 'PD-Split' in [14]): where P λ denotes the convex set P λ = {x ∈ X | x ∞ ≤ λ}.One can then employ a standard primal-dual method (PDHG [13]) which reads Unfortunately one can observe that whenever there is no explicit primal term in the formulation of the problem, the energy tends to oscillate and convergence can be quite slow (even though of course in O (1/N ), cf. Figure 1(b)).As an alternative we propose to split the problem differently and operate on the following primal-dual formulation: We employ algorithm (3.22), i.e. the non-accelerated basic inexact primal-dual algorithm (iPD) with type-2 errors and obtain Note that the dual proximum in this case can be evaluated error-free.
As a numerical study we perform deblurring on MATLAB's Lily image in [0, 1] with resolution 256×192, which has been corrupted by a Gaussian blur of approximately 12 pixels full width at half maximum (where we assume a pixel size of 1) and 50 percent salt-and-pepper noise, i.e.50 percent of the pixels have been randomly set to either 0 or 1.Furthermore, we performed power iterations to determine the operator norm of (A, ∇) as L ≈ √ 8 and set σ = τ = 0.99/ √ 8 for (PDHG).For (iPD) L can be determined analytically as L = A = 1, hence τ = σ = 0.99 for (iPD).
At first, as already announced above, we want to confirm the convergence rates predicted by the theory numerically.We hence require the inexact proximal problem (4.5) to be solved with an error close to the accuracy level ε n and use a cold start and reduced step sizes t k for the inner problem.We choose ε n = O (n −α ), in which case the theory of Section 3.2 states a decay of the energy in O (N −α ).The results can be found in Figure 1(a).One can easily observe that the decay of the relative objective is almost exactly as predicted: with higher α it approaches O N −1 , in fact for summable errors it even seems a little better.
In the second setup we investigate whether the obtained error bounds can also be used as a criterion to ensure (optimal) convergence of the nested algorithm (4.5).We now use a warm start and optimal step sizes t k for the proximal subproblem.The results for varying parameter α can be found in Figure 1(b).Interestingly for this problem, the error bounds from the theory are indeed too pessimistic or, vice versa, the TV-L 1 problem is "easier" than expected.As can be observed in Figure 1(b), the convergence rate for all choices of α tends towards O (1/N ), with slight advantages for higher α, while the number of required inner iterations k (Figure 1(c) and (d)) to reach the necessary precision is remarkably low.In fact, performing just a single inner iteration in every step of the algorithm resulted in a O (1/N ) convergence rate (cf.also Figure 1(d)).Actually, the required number of inner iterations even decreases over the course of the outer iterations which suggests that the dual variable of the inner problem "converges" as well.Note that this does not contradict the theoretical findings of this paper, but the contrary: while the first study clearly confirms that in the worst case the proved worst-case estimates are reached, the second implies that in practice one might as well perform by far better.More precisely, the given strategy provides a sufficient upper bound for the errors which ensures convergence, which does not imply that it is necessary.The same is true for the criterion of [56], in fact a small gap implies that the type-2 error is within the worst-case bounds, but it might as well be smaller.Interestingly, the same behavior of the TV-L 1 model (however without an operator A) has been observed in [13], where the problem seemed to converge like O 1/N 2 in the end, even though only O (1/N ) can be theoretically guaranteed.

Differentiable deblurring with the TV-L 2 model
The second problem we investigate is the TV-L 2 model for image deblurring Again, the easiest approach to solve problem (4.6) is to write down a primal-dual formulation min u∈X max y1∈X,y2∈Y Since the above problem is not strongly convex in y 2 it cannot be accelerated, so a basic primal-dual algorithm [13] (PDHG) for the solution reads We need to choose the step sizes τ and σ such that τ σL 2 ≤ 1, where L denotes the norm of the operator (A, ∇).We remark that, due to the special relation between the Fourier transform and a convolution, the same problem can be solved without dualizing the data term, since the primal proximal operator admits a closed form solution [13].The problem however stays non-strongly convex, and in order to keep this a general prototype for L 2 -type problems, we do not use this formulation.
The "inexact" approach instead operates on a different primal-dual formulation given by min which is now 1-strongly convex in y and can be accelerated.Using the inexact primal-dual algorithm from Section 3.4 leads to the following algorithm with τ n , σ n , θ n as given in Theorem 3.13.We again perform deblurring on MATLAB's Lily image in [0, 1] with resolution 256 × 192, which has been corrupted by a Gaussian blur of approximately 12 pixels full width at half maximum (where we assume a pixel size of 1), and in this case Gaussian noise with standard deviation s = 0.01 and zero mean.We allow errors of the size ε n = C/n −2α for α ∈ (0, 1), which by Corollary 3.14 should result in a O N −2α rate respectively O N −2 for α > 1.The results can be found in Figure 2. In contrast to the TV-L 1 problem, in this experiment it was not necessary to employ a cold start strategy and reduced step sizes for the inner problem in order to obtain the "worst case" rates.Instead also for a warm start and best possible step sizes for the inner problem the bounds for the gap (4.3) were active for all choices of α. Figure 2 shows the error in relative objective for the ergodic sequence U N (a) and the iterates u n (b) for increasing α.It can be observed that the rate of the decay is almost exactly the one predicted, while not surprisingly the iterates themselves even decay a little faster than the ergodic sequence.The amount of inner iterations necessary to obtain the required precision of the proximum is unsurprisingly higher than in the non-accelerated case, even though they stay reasonable for rather low outer iteration numbers.It is worth mentioning that of course for α ≤ 0.5 the convergence rates coincide with the non-accelerated version.

Smooth deblurring with the TV-L 2 model
The last problem we consider is a smoothed version of the TV-L 2 model from the previous experiments: for small γ, with primal-dual formulation min u∈X max y1∈X,y2∈Y Since the above problem is γ-strongly convex in u (note that it is also L f = γ-Lipschitz differentiable in the primal), a possible accelerated primal-dual algorithm [15] (PDHGacc) for the solution reads with τ n , σ n , θ n given by Theorem 3.11 (see also [15]).We choose τ 0 = 0.99/L, σ 0 = (1 − τ 0 L f )/τ 0 L 2 such that τ 0 L f + τ 0 σ 0 L 2 = 1 as required, with L = (A, ∇) ≈ √ 8 (see also the previous section).We remark that the differentiable primal term involving γ could also be handled implicitly, leading to a linear proximal step instead of the explicit evaluation of the gradient.In our experiments, however, this did not substantially affect the results.In the spirit of the previous experiments we employ a different splitting also on this problem: (4.9) the outer iteration numbers for accelerated primal-dual (PDHGacc) and inexact primal-dual (iPD) for q = 0.9, (c) loglog plot of the inner iteration number vs. outer iteration number for q = 0.9.One can observe that the predicted convergence rate of O θ N is exactly attained, while for lower outer iteration numbers the necessary amount of inner iterations stays reasonably low.
accelerated PD version has barely reached 1e−2.It should however be mentioned that also (PDHGacc) reaches the O N −2 rate soon after these 250 iterations.Figure 3(c) shows the price we pay for the inner loop, i.e. the number of inner iterations which is necessary over the course of the 250 outer iterations.As one would expect for linear convergence, the number of inner iterations explodes for high outer iteration numbers, which substantially slows down the algorithm.However, the algorithm reaches an error of 1e−6 in relative objective already after approximately 100 iterations, in which case the number of inner iterations is still remarkably low (around 10-20), which makes the approach viable in practice.This could be in particular interesting for problems with a very costly operator A, where the tradeoff between outer and inner iterations is high.

Conclusion and Outlook
In this paper we investigated the convergence of the class of primal-dual algorithms developed in [50,13,15] under the presence of errors occurring in the computation of the proximal points and/or gradients.Following [55,56,4] we studied several types of errors and showed that under a sufficiently fast decay of these errors we can establish the same convergence rates as for the error-free algorithms.More precisely we proved the (optimal) O (1/N ) convergence to a saddle-point in finite dimensions for the class of nonsmooth problems considered in this paper, and proved a O 1/N 2 or even linear O θ N convergence rate for partly smooth respectively entirely smooth problems.We demonstrated both the performance and the practical use of the approach on the example of nested algorithms, which can be used to split the global objective more efficiently in many situations.A particular example is the nondifferentiable TV-L 1 model which can be very easily solved by our approach.A few questions remain open for the future: A very practical one is whether one can use the idea of nested algorithms to (heuristically) speed up the convergence of real life problems which are not possible to accelerate, such as TV-type methods in medical imaging.As demonstrated in the numerical section, using an inexact primal-dual algorithm one can often "introduce" strong convexity by splitting the problem differently and hence obtain the possibility to accelerate.This can in particular be interesting for problems with operators of very different costs, where the trade-off between inner and outer iterations is high and hence a lot of inner iterations are still feasible.Following the same line, it would furthermore be interesting to combine the convergence results for inexact algorithms with stochastic approaches as done in [12], which are also designed to speed up the convergence for this particular situation, which could provide an additional boost.Another point to investigate is whether one can combine the inexact approach with linesearch and variable metric strategies similar to [8].
Proof.We can easily verify the assertion by dropping f and simply interchanging the roles of x and y (and thus τ and σ) in Theorem 3.2.
As for Theorem 3.
which implies strong convergence of XN to XN with the same rate as the decrease of the objective.Hence we can essentially handle the situation of a type-0 approximation by the same means as before.The major difference is still that none of the xn need to be feasible, which could impose problems in practice.Since type-0 approximations are the weakest among the introduced notions, they should technically impose the least restrictive error criteria.It however is an open question how to check x − x ≤ √ 2τ ε effectively.It is easy to see that the duality gap bounds this quantity, in which situation Proposition 2.7 "unfortunately" states that x is already a stronger type-2 approximation.Hence it remains to find a different criterion for the precision of a type-0 approximation to make this approach feasible in practice.

A.3 Proof of Theorem 3.11
Proof.Using Lemma 3.1, we proceed exactly as in the proof of Theorem 3.2 (now only including the γ-strong convexity of g as well as τ = τ n , σ = σ n and introducing θ n ), to arrive at the basic inequality The goal of the proof is, again, to manipulate this inequality such that we obtain a recursion where most of the terms cancel when summing the inequality.For the sake of clarity let us denote In order to get a useful recursion in the first line it is clear that we require such that we obtain the estimate For a useful recursion for the second line we expand and compute (cf.Equation (3.14) with now α = σ n θ n L) where we used (A.4) such that σ n θ n = σ n−1 .We note that since Putting everything together and rearranging (note that the terms y n+1 − y n 2 /(2σ n ) cancel and σ n+1 /σ n = 1/θ n+1 ) we arrive at We multiply the inequality by σ n /σ 0 to reveal the recursion and sum from n = 0, . . ., N − 1: Now, as above, we use that This equation can now be use as before to bound all terms on the left hand side.Again for a saddle point (x , y ) ∈ X × Y the sum is nonnegative, hence we obtain the inequality: For the sake of readability let us denote Then as before with Lemma A.1 we find, .
Since A N and B N are increasing we have for all n ≤ N Now evoking equation (A.5) we obtain The convexity of (ξ, ζ) → L(ξ, y ) − L(x , ζ) and the definition of the ergodic averages yields the assertion (cf. the proof of Theorem 3.13).The estimate on x − x N 2 follows analogously.It remains to note that for a type-2 approximation the square root in A N can be dropped and for a type-3 approximation B N = 0, which gives the different A N,i , B N,i .
A.4 Proof of Theorem 3.13 Proof.We proceed exactly as in the proof of Theorem 3.11 with interchanged roles of x, y, τ n and σ n to arrive at the basic inequality The goal of the proof is, as usual, to manipulate this inequality such that we obtain a recursion where most of the terms cancel when summing the inequality.For the sake of clarity let us denote In order to get a useful recursion for the first two lines it is clear that we need to require  In order to obtain a recursion for the first term on the right hand side of (A.8) we note that in the fourth line.Putting everything together and rearranging we arrive at (recall that τ n+1 /τ n = 1/θ n+1 ) L(x n+1 , y) − L(x, y n+1 ) ≤ ∆ n (x, y) − θ n K(x n − x n−1 ), y n − y + 1 2τ n x n − x n−1 2 − τ n+1 τ n ∆ n+1 (x, y) − θ n+1 K(x n+1 − x n ), y n+1 − y + 1 2τ n+1 x n+1 − x n 2 − (1 − τ n σ n θ 2 n L 2 ) Requiring that τ n σ n θ 2 n L 2 ≤ 1 we can discard the related term and multiply the inequality by τ n /τ 0 to reveal the recursion: We now sum the above inequality from n = 0, . . ., N − 1: This equation can now be use as before to bound all terms on the left hand side and hence gives the necessary bound on x − x N appearing in the error term.For a saddle point (x , y ) ∈ X × Y the sum on the left hand side is nonnegative and we find: Hence, again with Lemma A.1, , where we denote Since A N and B N are increasing we have for all n ≤ N x − x n ≤ A n + x − x 0 2 + τ 0 σ 0 y − y 0 2 + 2B n + A 2 Then we find (again by equation (A.9)) which gives the first assertion.The estimate on x − x N 2 and y − y N 2 then follows analogously from inequality (A.9).It remains to note that for a type-2 approximation the square root in A N can be dropped and for a type-3 approximation B N = 0, which gives the different A N,i , B N,i .
A.5 Proof of Theorem 3.16 Proof.We again start with the general descent rule in Lemma 3.1: We now multiply by θ −n and sum from n = 0, . . ., N − 1: which (again by Young's inequality) implies .
By monotonicity we have the same bound for all n ≤ N : We now again use inequality (A.10) to obtain a bound for the sum: Eventually we let to deduce the assertion by convexity and Jensen's inequality.By the same argumentation as above we can also use inequality (A.10) to obtain the convergence of the iterates: and h : X → R are proper, lower semicontinuous and convex functions, 4. problem (3.1) admits at least one solution (x , y ) ∈ X × Y.

Figure 1 :
Figure 1: Inexact primal-dual on the TV-L 1 problem.(a) and (b) loglog plots of the relative objective error vs. the outer iteration number for different decay rates α of the errors.(a) cold start, error close to the bound O (1/n α ), (b) warm start.(c) and (d) number of inner iterations respectively sum of inner iterations vs. number of outer iterations for different decay rates α.One can observe in (a) that the predicted rates in the worst case are attained, while in practice the problem also converges for very few inner iterations (b), (c) and (d).

Figure 3 :
Figure3: Inexact primal-dual on the smoothed TV-L 2 problem.(a) and (b) loglog plots of the relative objective error respectively relative error in norm vs. the outer iteration numbers for accelerated primal-dual (PDHGacc) and inexact primal-dual (iPD) for q = 0.9, (c) loglog plot of the inner iteration number vs. outer iteration number for q = 0.9.One can observe that the predicted convergence rate of O θ N is exactly attained, while for lower outer iteration numbers the necessary amount of inner iterations stays reasonably low.

2 xN 2 A N + 2B N 2 .
1 (L(x n , y ) − L(x , y n )) ≤ 1 2τ x − x 0 2 + τ σ y − y 0 2 + 2B N +2A N 2θ N A N + θ N − x 0 + θ y ), which can bound by C/N , yielding a convergence rate in this distance.Theorem 3.5.Let the sequences (x n , y n ) and (X N , Y N ) be defined by (3.11) respectively Theorem 3.2.If the partial sums A N,i and B N,i in Theorem 3.2 are summable, then every weak cluster point (x * , y * ) of (X N , Y N ) is a saddle point of problem (3.1).Moreover, if the dimension of X and Y is finite, there exists a saddle point (x * , y * ) ∈ X × Y such that x n → x * and y n → y * .
2 we can now state a rate for ( XN , Y N ) if the partial sums A N and √ B N are in o( √ N ).Since the result still relies on the unknown true proxima xn , it then remains to note that for XN := ( A.6) (1 + γσ n )σ n+1 θ n+1 ≥ σ n ,