A nested primal–dual FISTA-like scheme for composite convex optimization problems

We propose a nested primal–dual algorithm with extrapolation on the primal variable suited for minimizing the sum of two convex functions, one of which is continuously differentiable. The proposed algorithm can be interpreted as an inexact inertial forward–backward algorithm equipped with a prefixed number of inner primal–dual iterations for the proximal evaluation and a “warm–start” strategy for starting the inner loop, and generalizes several nested primal–dual algorithms already available in the literature. By appropriately choosing the inertial parameters, we prove the convergence of the iterates to a saddle point of the problem, and provide an O(1/n) convergence rate on the primal–dual gap evaluated at the corresponding ergodic sequences. Numerical experiments on some image restoration problems show that the combination of the “warm–start” strategy with an appropriate choice of the inertial parameters is strictly required in order to guarantee the convergence to the real minimum point of the objective function.


Introduction
In this paper, we address the following composite convex optimization problem 1 3 where f ∶ ℝ d → ℝ is convex and differentiable, A ∈ ℝ d � ×d and g ∶ ℝ d � → ℝ ∪ {∞} is convex. Problems of the form (1) typically arise from variational models in image processing, computer vision and machine learning, such as the classical Total Variation (TV) [38], Total Generalized Variation (TGV) [11] or Directional TGV models [22,25,26] for image denoising, TV-1 models for optical flow estimation [15], regularized logistic regression in multinomial classification problems [10] and several others.
One standard approach to tackle problem (1) is the forward-backward (FB) method [5,8,19,20], which combines iteratively a gradient step on the differentiable part f with a proximal step on the convex (possibly non differentiable) term g•A , thus generating an iterative sequence {u n } n∈ℕ of the form where > 0 is an appropriate steplength along the descent direction −∇f (u n ) and the proximal operator prox g•A is defined by Convergence of the iterates (2) to a solution of problem (1) can be obtained by assuming that ∇f is Lipschitz continuous and choosing ∈ (0, 2∕L) , where L is the Lipschitz constant of the gradient [20]. Accelerated versions of the standard FB scheme can be obtained by introducing an inertial step inside (2), thus obtaining where ū n is usually called the inertial point and n ∈ [0, 1] is the inertial parameter. Such an approach was first considered in a pioneering work by Nesterov [32] for differentiable problems (i.e assuming g ≡ 0 ), and then later generalized in [5] to the more general problem (1) under the name of FISTA -"Fast Iterative Soft-Thresholding Algorithm". FISTA shares an optimal O(1∕n 2 ) convergence rate [2,5], which is a lower bound for first-order methods [33], and convergence of the iterates still holds under appropriate assumptions on the inertial parameter [14].
The original versions of methods (2) and (4) require that the proximal operator of g•A is computed exactly. However, such an assumption is hardly realistic, as the proximal operator might not admit a closed-form formula or its computation might be too costly in several practical situations, for instance in TV regularized imaging problems or low-rank minimization (see [3,30,35,39] and references therein). In these cases, assuming that prox g and the matrix A are explicitly available, the proximal operator can be approximated at each outer iteration n by means of an inner iterative algorithm, which is applied to the dual problem associated to (3), see [9,13,17,39]. The choice of an appropriate number of inner iterations for the evaluation of the proximal operator is thus a crucial issue, as it affects the numerical performance (1) min u∈ℝ d F(u) ≡ f (u) + g(Au) , (2) u n+1 = prox g•A (u n − ∇f (u n )), n = 0, 1, … (4) ū n = u n + n (u n − u n−1 ) u n+1 = prox g•A (ū n − ∇f (ū n )) , n = 0, 1, … of FB algorithms as well as their theoretical convergence. In this regard, we can identify two main strategies in the literature for addressing the inexact computation of the FB iterate. On the one hand, several works have devised inexact versions of the FB algorithm by requiring that the FB iterate is computed with sufficient accuracy up to a prefixed (or adaptive) tolerance [8,9,20,39,41]. In these works, the inner iterations are usually run until some stopping condition is met, in order to ensure that the inexact FB iterate is sufficiently close to the exact proximal point.
Since the tolerances are required to be vanishing, the number of inner iterations will grow unbounded as the outer iterations increase, resulting in an increasing computational cost per proximal evaluation and hence an overall slowdown of the algorithm. On the other hand, one could fix the number of inner iterations in advance, thus losing control on the accuracy of the proximal evaluation, while employing an appropriate starting condition for the inner numerical routine. In particular, the authors in [17] consider a nested primal-dual algorithm that can be interpreted as an inexact version of the standard FB scheme (2), where the proximal operator is approximated by means of a prefixed number of primal-dual iterates, and each inner primal-dual loop is "warm-started" with the outcome of the previous one. As proved in [17,Theorem 3.1], such a "warm-start" strategy is sufficient for proving the convergence of the iterates to a solution of (1) when the accuracy in the proximal evaluation is prefixed.
In this paper, we propose an accelerated variant of the nested primal-dual algorithm devised in [17] that performs an inertial step on the primal variable. The proposed algorithm could be interpreted as an inexact version of the FISTA-like method (4) equipped with a prefixed number of inner primal-dual iterations for the proximal evaluation and a "warm-start" strategy for the initialization of the inner loop. Under an appropriate choice of the inertial parameters, we prove the convergence of the iterates towards a solution of (1) and an ergodic O(1/n) convergence rate result on the primal-dual gap. We note that other nested primal-dual algorithms with inertial steps on the primal variable have been studied in the previous literature, however they are either shown to diverge in numerical counterexamples [4,18] or its convergence relies on additional over relaxation steps [28] in the same fashion as in the popular "Chambolle-Pock" primal-dual algorithm [15]. We claim that our proposed method is the first inexact inertial FB method whose convergence relies solely on its "warm-start" procedure, without requiring any increasing accuracy in the proximal evaluation. Furthermore, our numerical experiments on a Poisson image deblurring test problem show that the theoretical assumption made on the inertial parameters is strictly necessary for proving convergence, as the proposed nested inertial primal-dual algorithm equipped with the same inertial parameter rule of FISTA (which does not comply with the theoretical assumption) is observed to diverge from the solution of the minimization problem.
The paper is organized as follows. In Sect. 2 we recall some useful notions of convex analysis, the problem setting, and a practical procedure for approximating the proximal operator by means of a primal-dual loop. In Sect. 3 we present the algorithm and the related convergence analysis. Section 4 is devoted to numerical experiments on some image restoration test problems. Finally, the conclusions and future perspectives are reported in Sect. 5.

Preliminaries
In the remainder of the paper, the symbol ⟨⋅, ⋅⟩ denotes the standard inner product on ℝ d . If v ∈ ℝ d and A ∈ ℝ d � ×d , ‖v‖ is the usual Euclidean vectorial norm, and ‖A‖ represents the largest singular value of A.
is the ball of center x and radius , and aff(Ω) is the affine hull of Ω . Finally, the symbol ℝ ≥0 stands for the set of nonnegative real numbers.

Basic notions of convex analysis
We start by recalling some well-known definitions and properties of convex analysis that will be employed throughout the paper. (i) I f (u) = 1 (u) + 2 (u) a n d th e re ex i st s u 0 ∈ ℝ d � s u ch th a t u 0 ∈ relint(dom( 1 )) ∩ relint(dom( 2 )) , where relint denotes the relative interior of a set, then  The following two lemmas involving the proximal operator will be extensively employed in the convergence analysis of Sect. 3.

Problem setting
In the remainder, we consider the optimization problem under the following blanket assumptions on the involved functions.
Note that the hypothesis on A is needed only to ensure that the subdifferential rule in Lemma 1 (ii) holds, so that we can write (g•A)(u) = A T g(Au) . Such rule allows us to interpret the minimum points of (5) as solutions of appropriate variational equations, as stated below. (5) if and only if the following conditions hold for any , > 0.
Proof Observe that û is a solution of problem (5) if and only if 0 ∈ F(û) . According to Lemma 1 this is equivalent to writing An application of Lemma 3 shows that the inclusion v ∈ g(Aû) is equivalent to the equality v = prox −1 g * (v + −1 Aû) , which gives the thesis. ◻ Using Lemma 2, the convex problem (5) can be equivalently reformulated as the following convex-concave saddle-point problem [37] where L(u, v) denotes the primal-dual function. Under Assumptions 1, we can switch the minimum and maximum operator [37, Corollary 31.2.1] and use again Lemma 2, thus obtaining ∇f (û) + A Tv = 0, withv ∈ g(Aû).

Inexact proximal evaluation via primal-dual iterates
A function of the form g•A does not necessarily admit a closed-form formula for its proximal operator, as it is the case of several sparsity-inducing priors appearing in image and signal processing applications. However, when a closed-form formula is not accessible, the proximal operator can be still approximately evaluated by means of a primal-dual scheme, as we show in the following.
Theorem 1 Let f, g, and A be defined as in problem (5) under Assumptions 1.
Proof Theorem 1 is a special case of [17,Lemma 2.3], where the authors prove the same result for objective functions of the form F(u) = f (u) + g(Au) + h(u) being h a lower semicontinuous convex function. The proof is based on the differential inclusion characterizing the proximal point ẑ = prox g•A (z) , that is Using Lemma 3, the above variational equation can be equivalently written as where , > 0 and g * is the Fenchel conjugate of g. By replacing the first equation of (11) into the second one, we obtain that is, v is a fixed point of the operator Consequently, the dual sequence {v k } k∈ℕ defined in (9) can be interpreted as the fixed-point iteration applied to the operator T. From this remark, the thesis easily follows (see [17,Lemma 2.3] for the complete proof). ◻ (10) z − z + A Tv = 0, withv ∈ g(Aẑ).

Remark 1 By combining items (i)-(ii) of Theorem 1, we conclude that
Hence, the proximal operator of g•A can be approximated by a finite number of steps of a primal-dual procedure, provided that the operators prox g * and A are easily computable.

Algorithm and convergence analysis
In this section, we propose an accelerated version of the nested primal-dual algorithm considered in [17], which features an extrapolation step in the same fashion as in the popular FISTA and other Nesterov-type forward-backward algorithms [2,5,9]. The resulting scheme can be itself considered as an inexact inertial forward-backward algorithm in which the backward step is approximately computed by means of a prefixed number of primal-dual steps.
We report the proposed scheme in Algorithm 1. Given the initial guess u 0 ∈ ℝ d , the steplengths 0 < < 1∕L , 0 < < 1∕‖A‖ 2 , and a prefixed positive integer k max , we first compute the extrapolated point ū n (STEP 1) and initialize the primal-dual loop with the outcome of the previous inner loop, i.e. v 0 n = v k max n−1 (STEP 2). This is the so-called "warm-start" strategy borrowed from [17]. Subsequently we perform k max inner primal-dual steps (STEP 3), compute the additional primal iterate u k max n (STEP 4), and average the primal iterates {u k n } k max k=1 over the number of inner iterations (STEP 5). Note that STEPS 3-4 represent an iterative procedure for computing an approximation of the proximal-gradient point û n+1 = prox g•A (ū n − ∇f (ū n )) (see Theorem 1 and Remark 1). In that spirit, we can consider Algorithm 1 as an inexact inertial forward-backward scheme where the inexact computation of the proximal operator is made explicit in the description of the algorithm itself.

3
A nested primal-dual FISTA-like scheme for composite convex… Algorithm 1 includes some nested primal-dual schemes previously proposed in the literature as special cases: • when n ≡ 0 , k max = 1 , and f (u) = ‖Hu − y‖ 2 ∕2 , Algorithm 1 becomes the Generalized Iterative Soft-Thresholding Algorithm (GISTA), which was first devised in [29] for regularized least squares problems; • when n ≡ 0 and k max = 1 , we recover the Proximal Alternating Predictor-Corrector (PAPC) algorithm proposed in [23], which can be seen as an extension of GISTA to a general data-fidelity term f(u); • when n ≡ 0 , Algorithm 1 reduces to the nested (non inertial) primal-dual scheme proposed in [17], which generalizes GISTA and PAPC by introducing multiple inner-primal dual iterations.
In the case n ≠ 0 , Algorithm 1 can be interpreted as an inexact version of FISTA, which differs from other inexact FISTA-like schemes existing in the literature (see e.g. [9,43]) due to the warm-start strategy, the prefixed number of primal-dual iterates and the averaging of the inner primal sequence at the end of each outer iteration. Note that similar nested inertial primal-dual algorithms were proposed in [4,18] for solving least squares problems with total variation regularization. However, both these extensions do not adopt a "warm-start" strategy for the inner loop: indeed the authors in [4] set v 0 n = 0 , whereas in [18] the initial guess is computed by extrapolating the two previous dual variables. Interestingly, in the numerical experiences of [4,18], the function values of the methods were observed to diverge or converge to a higher limit value than their non-extrapolated counterpart. These numerical "failures" might be due to the lack of the warm-start strategy, which appears to be crucial also for the convergence of inertial schemes, as we will demonstrate in the next section.
We also note that Algorithm 1 differs significantly from the classical Chambolle-Pock primal-dual algorithm for minimizing sums of convex functions of the form [15], and its generalizations to the case where f is differentiable [16,21,35,44]. Indeed, while Algorithm 1 performs the extrapolation step on the primal variable ū n before the inner primal-dual loop begins, the algorithms proposed in [15,16,21,35,44] include extrapolation as an intermediate step between the primal and the dual iteration (or viceversa). For instance, the Chambolle-Pock algorithm described in [15,Algorithm 1] can be written as where ū n+1 represents the extrapolated iterate, which is computed after the primal iterate u n+1 and then employed inside the dual iterate v n+1 . All the algorithms in [16,21,35,44] are built upon similar extrapolation strategies. Furthermore, we remark that the works [21,35,44] also take into account the possible inexact computation of the involved proximal operators, either by introducing additive errors [21,44] or considering n −approximations of the proximal points [35], as well as errors into the computation of the gradient of f. By contrast, we put ourselves in the easier scenario where the proximal operators and gradient are all computable in closed form.

Convergence analysis
The aim of this section is to prove the convergence of the iterates generated by Algorithm 1, as well as an O(1/n) convergence rate on the primal-dual gap evaluated at the average (ergodic) sequences associated to {u n } n∈ℕ and {v 0 n } n∈ℕ . Similar results in the (non strongly) convex setting have been proved for several other primal-dual methods, see e.g. [15-17, 23, 35].
We start by proving some technical descent inequalities holding for the primal-dual function (7), where an approximate forward-backward point of the form u =ū − ∇f (ū) − A T v and its associated dual variable v appear. The result is similar to the one given in [23, Lemma 3.1] for the iterates of the PAPC algorithm, which in turn is a generalization of a technical result employed for GISTA [29,Lemma 3]. Such inequalities also recall the key inequality of FISTA [14,

3
A nested primal-dual FISTA-like scheme for composite convex… which however involves the primal function values instead of the primal-dual ones. For our purposes, we need to introduce the following function which is a norm whenever 0 < < 1∕‖A‖ 2 , as is the case for Algorithm 1.
The following facts hold.
Proof (i) We first apply the following three-point equality with a =ū , b = u , c = z and then divide the result by the factor 2 , thus obtaining Starting from L(z, v) , we can write the following chain of inequalities.
where the first inequality follows from the convexity of f, and the second one from Eq. (22) combined with the descent lemma on f. The result follows by reordering the terms in (23).
(ii) Let us consider the inequality obtained using Lemma 4 with x =v , e = −1 Aũ , y = v and then multiply it by ( −1 )∕2 , namely Starting from L(u, v) , we can write the following chain of inequalities.

Plugging (27) inside (25) yields
Now observe that the scalar product in the previous inequality can be rewritten as Finally, inserting the previous fact inside (28) and recalling the definition of the norm ‖ ⋅ ‖ A leads to (23) By reordering the terms in the above inequality, we get the thesis.
, and most importantly, ỹ =y and g ≡ 0 . In particular, the condition ỹ =y makes the inner product ⟨K(x −x),ỹ −y⟩ disappear in [35,Lemma 3]; the assumption g ≡ 0 allows us to rewrite (25) as (28) through relation (27) to get the result. Hence, we can say that our Lemma 7(iii) is a specialized version of [35,Lemma 3], which is appropriately rewritten by employing our specific assumptions ỹ =y and g ≡ 0.
The next two lemmas are also needed to prove convergence of Algorithm 1. The former has been employed in the convergence analysis of several FISTA variants with inexact proximal evaluations (see e.g. [9,41]), whereas the second one is a classical tool for proving convergence of first-order algorithms.
be sequences of real nonnegative numbers, with {q n } n∈ℕ being a monotone nondecreasing sequence, satisfying the following recursive property Then the following inequality holds: In the following theorem, we state and prove the convergence of the primaldual sequence {(u n , v 0 n )} n∈ℕ generated by Algorithm 1 to a solution of the saddlepoint problem (7), provided that the inertial parameters satisfy an appropriate technical assumption. Our result extends the convergence theorem in [17, Theorem 3.1], which holds only for the case n ≡ 0 . As in [17], the line of proof adopted here can be summarized according to the following points: (i) Prove that the sequence {(u n , v 0 n )} n∈ℕ is convergent; (ii) Show that each limit point of {(u n , v 0 n )} n∈ℕ is a solution to the saddle-point problem (7); (iii) Conclude that the primal-dual sequence is converging to a saddle-point. However, some major modifications are required here in order to take into account the presence of the inertial step on the primal variable.
Theorem 2 Suppose that f, g, A are defined as in problem (5) under Assumptions 1.
By directly applying Eq.
and discarding the terms proportional to ‖u k+1 n −ū n ‖ 2 and ‖v k+1 n − v k n ‖ 2 A , we obtain the following inequality Recalling that ū n = u n + n (u n − u n−1 ) and using the Cauchy-Schwarz inequality, we can write the following inequalities Summing the previous relation for k = 0, … , k max − 1 allows to write

3
A nested primal-dual FISTA-like scheme for composite convex… By exploiting the update rule (16) together with the convexity of the function (u) = ‖û − u‖ 2 and recalling the warm-start strategy v , it easily follows that Plugging the previous inequality inside (32) leads to If we apply recursively inequality (33), we get in the lefthand side, multiplying both sides by 2 ∕k max and adding the term 2 n+1 ‖u n+1 − u n ‖ 2 + 2 n+1 ‖û − u n+1 ‖‖u n+1 − u n ‖ to the right-hand side, we come to Finally, we can use From the previous inequality combined with condition (31), it follows that the sequence {u n } n∈ℕ is bounded. By discarding the term proportional to ‖û − u n+1 ‖ 2 (32) in Eq. (34), using again (31) and the boundedness of {u n } n∈ℕ , we also deduce the boundedness of {v 0 n } n∈ℕ . (ii) Point (i) guarantees the existence of a constant M > 0 such that ‖û − u n ‖ ≤ M for all n ∈ ℕ . Then, from Eq. (33), we get The previous inequality and condition (31) allows to apply Lemma 9 with . Furthermore, based on condition (31), we can write Therefore lim n∈I ‖ū n − u n ‖ = 0 , which implies lim n∈Iū n = u † . By writing down again Eq.
is a solution of (7), and using L(u k+1 By employing the definition of ū n , the Cauchy-Schwarz inequality and the boundedness of {u n } n∈ℕ inside the previous relation, we come to Summing the inequality for k = 0, … , k max − 1 leads to If we combine the above inequality with the update rule (16), the convexity of the function (u) = ‖û − u‖ 2 and the warm-start strategy v We now sum the previous relation over n = 1, … , N , thus obtaining Taking the limit for N → ∞ and recalling condition (31), it follows that . By combining this fact with the continuity of prox −1 g * inside the steps (14)-(15), we conclude that (u † , v † ) satisfies the fixed-point Eq. (6), which is equivalent to say that (u † , v † ) is a solution of (7) (see Lemma 6). Since (u † , v † ) is a saddle point, it follows from point (ii) that the sequence { k max ‖u † − u n ‖ 2 + 2 ‖v † − v 0 n ‖ 2 A } n∈ℕ converges and, by definition of limit point, it admits a subsequence converging to zero. Then the sequence {(u n , v 0 n )} n∈ℕ must converge to (u † , v † ) . ◻ (31) is quite similar to the one adopted by Lorenz and Pock in [28,Eq. 23] for their primal-dual forward-backward algorithm, which applies extrapolation on both the primal and dual variable and then performs a primal-dual iteration in the same fashion as in the popular "Chambolle-Pock" algorithm [15,16]. Indeed, the inertial parameter n adopted in [28] needs to satisfy the following condition where {x n } n∈ℕ , {y n } n∈ℕ denote the primal and dual iterates of the algorithm, respectively. Differently from (31), condition (35) depends also on the dual variable and squares the norm of the gap between two successive iterates.

Remark 3 Condition
As also noted in [28], conditions (31) and (35) can be easily enforced "online", since they depend only on the past iterates. One possible strategy to compute n according to (31) consists in modifying the usual FISTA inertial parameter as follows where C > 0 is a positive constant, { n } n∈ℕ is a prefixed summable sequence and FISTA n is computed according to the usual FISTA rule, namely [5] In the numerical section, we will see that the modified FISTA-like rule (36) has the practical effect of shrinking the inertial parameter after a certain number of iterations, thus avoiding the possible divergence of the algorithm from the true solution of (5).
We now prove an ergodic O(1/N) convergence rate for Algorithm 1 in terms of the quantity L(U N+1 , z � ) − L(z, V N+1 ) , where (z, z � ) ∈ ℝ d × ℝ d � and U N+1 and V N+1 are the ergodic sequences. On the one hand, our result generalizes the convergence rate obtained in [23, Theorem 3.1] under the assumptions n ≡ 0 and k max = 1 , which in turn extends a previous result due to [29, Theorem 1] holding for Algorithm 1 in the case f (u) = 1 2 ‖Hu − y‖ 2 , n ≡ 0 , k max = 1 . On the other hand, the result is novel also in the case n ≡ 0 and k max ≥ 1 , for which we recover the nested primal-dual algorithm presented in [17].
Proof Considering Eq. (21) with u = u k+1 n , ū =ū n , v = v k+1 n , v = v k n , and discarding the terms proportional to ‖u k+1 n −ū n ‖ 2 and ‖v k+1 n − v k n ‖ 2 A , we can write where the last relation follows from the Cauchy-Schwarz inequality and the boundedness of the sequence {u n } n∈ℕ , being M > 0 such that ‖z − u n ‖ ≤ M for all n ∈ ℕ.
Summing the previous inequality for k = 0, … , k max − 1 and exploiting the update rule (16), the convexity of the function (u) = ‖z − u‖ 2 and the warm-start strategy v k max n = v 0 n+1 in the right-hand side, we obtain For the left-hand side of the previous inequality, it is possible to write the following lower bound where the second inequality follows from the convexity of the function L(⋅, z � ) − L(z, ⋅) , the update rule (16), and the definition of the sequence {v n } n∈ℕ . Combining the last lower bound with (39) results in Summing the previous inequality for n = 0, … , N , multiplying and dividing the lefthand side by N + 1 , exploiting the convexity of the function L(⋅, z � ) − L(z, ⋅) and condition (31) yields from which the thesis follows. ◻ (17), where n −approximations of the proximal operators prox f and prox g * are employed. In particular, assuming that n = O(1∕n ) with > 0 , the authors in [35] show that the convergence rate on the quantity 1) . Furthermore, they also derive improved convergence rates for a couple of accelerated versions of the same algorithm, under the assumption that one of the terms in the objective function is strongly convex [35, Corollary 2-3]. Although we were not able to show accelerated rates in the strongly convex scenario, a linear rate for Algorithm 1 with n ≡ 0 has already been given in [17,Theorem 3.2], hence the extension of this result to the case n ≠ 0 is a possible topic for future research.

Remark 5
Note that the numerator of (38) gets smaller as k max increases, suggesting that a higher number of inner primal-dual iterations corresponds to a better convergence rate factor. Such a result is coherent with the one in [17,Theorem 3.2], where a convergence rate including a 1∕k max factor is provided for Algorithm 1 with n ≡ 0 , under the assumption that f is strongly convex. However, one has to consider that a bigger k max also corresponds to a higher computational cost per iteration. Hence, the choice of an appropriate k max is crucial for the practical performance of the algorithm. We refer the reader to Sect. 4 for some preliminary numerical investigation on this issue.
Theorem 3 allows us to derive an O(1/N) convergence rate on the primal function values, as shown in the following corollary.

Corollary 1 Suppose that the same assumptions of Theorem 3 hold and, additionally, the function g has full domain. Then, we have
Proof Following e.g. [35,Remark 3], we have that, if g has full domain, then g * is superlinear, i.e. g * (v)∕|v| → ∞ as |v| → ∞ . As a result, the supremum of L(U N+1 , z � ) over z ′ is attained at some point z � N+1 ∈ g(AU N+1 ) , namely Then, given a solution û to problem (5), we can write the following chain of inequalities where the second equality is due to Lemma 2, the third inequality follows from (41) and the definition of supremum, and the fourth inequality is an application of the rate (38). Since U N is a bounded sequence (due to Theorem 2(i)), z � N+1 ∈ g(AU N+1 ) , and g is locally Lipschitz, it follows that z � N+1 is uniformly bounded over N, which implies that the sequence {‖z � N+1 − v 0 0 ‖ 2 A } N∈ℕ is bounded by a constant M ′ > 0 . Hence, the thesis follows. ◻ Analogously, one could prove an O(1/N) convergence rate for the primal-dual gap

Numerical experiments
We are now going to show the numerical performance of Algorithm 1 on some image deblurring test problems where the observed image has been corrupted by Poisson noise. All the experiments are carried out by running the routines in MAT-LAB R2019a on a laptop equipped with a 2.60 GHz Intel Core i7-4510U processor and 8 GB of RAM.

Weighted 2 -TV image restoration under Poisson noise
Variational models encoding signal-dependent Poisson noise have become ubiquitous in microscopy and astronomy imaging applications [6]. Let z ∈ ℝ d with where u ∈ ℝ d is the ground truth, H ∈ ℝ d×d is the blurring matrix associated to a given Point Spread Function (PSF), and P(w) denotes a realization of a Poisson-distributed random vector with parameter w ∈ ℝ d , w i > 0 for i = 1, … , d . According to the Maximum a Posteriori approach, the problem of recovering the unknown image u from the acquired image z consists in solving the following penalized optimization problem where denotes the generalized Kullback-Leibler (KL) divergence, playing the role of the data-fidelity term, > 0 is the regularization parameter, and R(u) is a penalty term encoding some a-priori information on the object to be reconstructed. In the following, we consider an approximation of the KL divergence that is obtained by truncating its Taylor expansion around the vector z at the second order. Introducing the diagonal positive definite matrix W ∶= diag(1∕z) , whose elements on the diagonal are defined as W i,i = 1∕z i for i = 1, … , d , the desired KL approximation can be written as the following weighted least squares data fidelity term where the weighted norm is defined as ‖u‖ W = √ ⟨u, Wu⟩ . The data term (44) promotes high fidelity in low-intensity pixels and large regularization in high-intensity pixels (i.e. in presence of high levels of noise). Such an approximation has already been considered for applications such as positron emission tomography and superresolution microscopy, see e.g. [12,27,36,40].
As we want to impose edge-preserving regularization on the unknown image, we couple the data fidelity function (44) with the classical isotropic Total Variation (TV) function [38] where ∇ i u ∈ ℝ 2 represents the standard forward difference discretization of the image gradient at pixel i. In conclusion, we aim at solving the following minimization problem z = P(Hu),

3
A nested primal-dual FISTA-like scheme for composite convex… Problem (46) can be framed in (5) by setting where u ∈ ℝ d and v ∈ ℝ 2d . Note that g is convex and continuous, and ∇f is L−Lipschitz continuous on ℝ d with the constant L being estimated as Furthermore, problem (46) admits at least one solution under mild assumptions on H, for instance when H is such that H ij ≥ 0 i, j = 1, … , d and has at least one strictly positive component for each row and column [1]. Hence Assumptions 1 are satisfied, Theorem 2 holds and the iterates {u n } n∈ℕ of Algorithm 1 are guaranteed to converge to a solution of problem (46), provided that the inertial parameters { n } n∈ℕ satisfy condition (31).
Note that each step of Algorithm 1 can be computed in closed form, if applied to problem (46): indeed A is simply the gradient operator (48) and A T its associated divergence, the spectral norm ‖A‖ can be explicitly upper bounded with the value √ 8 , and prox g * is the projection operator onto the cartesian product B(0, ) × B(0, ) × ⋯ × B(0, ) , where B(0, ) ⊆ ℝ 2 is a ball of center 0 and radius (see [13]).

Implementation and parameters choice
For our numerical tests, we consider four grayscale images, which are displayed in Fig. 1: .

Fig. 1 Blurred and noisy test images used our numerical experiments
• moon, phantom, mri are demo images of size 537 × 358 , 256 × 256 , and 128 × 128 respectively, which are taken from MATLAB Image Processing Toolbox; • micro is the 128 × 128 confocal microscopy phantom described in [46].
The regularization parameter has been fixed equal to = 0.15 for moon, = 0.12 for phantom, = 0.09 for micro, and = 0.12 for mri. The blurring matrix H is associated to a Gaussian blur with window size 9 × 9 and standard deviation equal to 4, and products with H (respectively H T ) are performed by assuming reflexive boundary conditions. We simulate numerically the blurred and noisy acquired images by convolving the true object with the given PSF and then using the imnoise function of the MATLAB Image Processing Toolbox to simulate the presence of Poisson noise.
We implement several versions of Algorithm 1, each one equipped with a different rule for the computation of the inertial parameter n at STEP 1 and a different strategy for the initialization of the inner loop. The resulting algorithms are denoted as follows.
• FB (warm): the nested non-inertial primal-dual algorithm formerly presented in [17], which is obtained as a special case of Algorithm 1 by setting n ≡ 0; • FB (no-warm): a nested non-inertial primal-dual algorithm obtained from Algorithm 1 by setting n ≡ 0 and replacing STEP 2 with the initialization v 0 n = 0 ; due to the lack of the "warm-start" strategy, Theorem 2 does not apply to this algorithm, and hence its convergence is not guaranteed; • FISTA (warm): a particular instance of Algorithm 1 obtained by choosing n as the FISTA inertial parameter FISTA n defined in (37); note that we do not have any information on whether FISTA n complies with condition (31) or not, hence it is unclear a priori whether the iterates will converge or not to a minimum point; • FISTA (no-warm): here we set n = FISTA n and use the null dual vector v 0 n = 0 as initialization of the inner loop at STEP 2; • FISTA-like (warm): this is Algorithm 1 with the inertial parameter n computed according to the rule (36), in which we set C = 10‖u 1 − u 0 ‖ and n = 1∕n 1.1 ; as explained in Remark 3, such a practical rule ensures that condition (31) holds, therefore by Theorem 2 we are guaranteed that the iterates of the algorithm will converge to a solution of (5); • FISTA-like (no-warm): this is just as "FISTA-like (warm)" but with the "warm-start" strategy at STEP 2 replaced by v 0 n = 0 ; again no theoretical guarantees are available in this case.
For all algorithms, we set the initial primal iterate as the observed image, namely u 0 = u −1 = z , the initial dual iterate as the null vector, i.e. v k max −1 = 0 , the primal steplength as = (1 − )∕(8‖W‖ 2 ) , and the dual steplength as = (1 − )∕8 , where > 0 is the machine precision. Regarding the number of primal-dual iterations, we will test different values of k max in the next section.

Results
As a first test, we run all previously described algorithms with k max = 1 for 2000 iterations. In Fig. 2, we show the decrease of the primal function values F(u n ) with respect to the iteration number for all test images. We observe that the "FB" and "FISTA-like" implementations of Algorithm 1 that make use of the "warm-start" strategy converge to a smaller function value than the ones provided by the other competitors. These two instances of Algorithm 1 are indeed the only ones whose convergence to a minimum point is guaranteed by Theorem 2. On the contrary, "FISTA (warm)" is the only warm-started algorithm that tends to a higher function value, and in fact it provides the worst reconstructions for all test problems, as can be seen in the comparison of the reconstructed phantom images in Fig. 4. The other algorithms using a null initialization for the inner loop all fail to converge to the smallest function value. Such numerical failures are typical of nested inertial primal-dual algorithms that do not employ a "warm-start" initialization in the inner loop, see e.g. [4,18]. Focusing on the "FISTA-like" variants, we observe that these algorithms perform identically to the "FISTA" variants in the first iterations, then move away towards smaller function values. This is due to the computational rule (36), which has the practical effect of shrinking the inertial parameter after a certain number of iterations. Indeed Fig. 3 shows that the parameter C∕(n 1.1 ‖u n − u n−1 ‖) becomes smaller than FISTA n after the first 10 − 20 iterations, and then seemingly stabilizes around a value smaller than 1. However, we also remark that condition (36) alone is not enough for ensuring convergence to the true minimum value, as the "FISTA-like (no-warm)" implementation stabilizes to a higher function value than the one provided by the warm-started FISTA-like algorithm for all test images. Hence, we conclude that the combination of condition (36) with the "warm-start" strategy on the inner loop is strictly necessary in order to prevent the divergence of Algorithm 1 from the true solution of (5).
We now evaluate the impact of the number of inner primal-dual iterations on the overall performance of Algorithm 1. In Table 1 we report the function value provided by the different algorithms after 2000 iterations on the test image phantom, as k max varies from 1 to 20. On the one hand, we see that all algorithms yield roughly the same function value if k max is large, however at the expense of a much higher computational cost per outer iteration. On the other hand, the warm-started FISTA-like implementation provides the best accuracy level already with k max = 1 ,  While the plots on the left show a similar performance in terms of iterations as k max varies, the plots on the right confirm that a higher value of k max often results in an increased computational effort to achieve the desired accuracy level.

KL-TV image restoration under Poisson noise
In this section, we consider again the problem of image restoration under Poisson noise, i.e. the problem of recovering the original image u ∈ ℝ d from a distorted image z ∈ ℝ d , z i ≥ 0 i = 1, … , d that is a realization of a Poisson random vector of the form where H ∈ ℝ d×d is again the blurring matrix, e is the vector of all ones, and b > 0 is a constant term taking into account the existence of some background emission. This time, we follow the Maximum a Posteriori approach without approximating the KL divergence, and imposing both TV regularization and non-negativity on the image pixels. Therefore, we address the following optimization problem where the functions KL and TV are defined in (43) and (45), respectively, Ω = {u ∈ ℝ d ∶ u i ≥ 0, i = 1, … , d} denotes the non-negative orthant, and Ω is the indicator function of Ω , i.e.
Note that problem (49) can be included in (5) by setting where u ∈ ℝ d , v ∈ ℝ 4d , and I d ∈ ℝ d×d is the identity matrix of size d. We remark that f is convex, continuously differentiable, and its gradient is L−Lipschitz continuous for any constant L > 0 , whereas g is convex and lower semicontinuous. In addition, the problem admits at least one solution under mild assumptions on H, for instance the same ones assumed for problem (46) [24,Proposition 3]. Thus, z = P(Hu + be), (49) min u∈ℝ d KL(Hu + be;z) + TV(u) + Ω (u), Theorem 2 guarantees once again the convergence of the iterates {u n } n∈ℕ of Algorithm 1 to a solution of problem (49). As in the previous test, we remark that each step of Algorithm 1 can be performed explicitly. Indeed, the blocks of the matrix A are the discrete gradient operator ∇ , the identity matrix I, and the blurring matrix H, all of which can be easily applied in closed form. Furthermore, the spectral norm of A can be upper bounded as ‖A‖ 2 ≤ ‖∇‖ 2 + ‖H T H‖ + 1 , where ‖∇‖ 2 ≤ 8 and ‖H T H‖ ≤ ‖H‖ 1 ‖H‖ ∞ . Finally, the term g can be written as the sum of separable functions g = g 1 + g 2 + g 3 , where , so that prox g * can be block-decomposed as where prox g * 1 is the projection operator onto the cartesian product is the projection operator onto the orthant Ω * = {v ∈ ℝ d ∶ v i ≤ 0, i = 1, … , d} , and prox g * 3 can be computed according to a closed-form formula [19, Table 2]. Therefore, prox g * is overall computable in closed form.

Implementation and parameters choice
For the following tests, we consider the grayscale images phantom, mri, micro used in the tests of section 4.1, and two different types of blur: a Gaussian blur with window size 9 × 9 , and an out-of-focus blur with radius 4. The products with the blurring matrix H (respectively H T ) are performed by assuming periodic boundary conditions. For each test image and blur, we convolve the true object with the corresponding PSF, add a constant background term b = 10 −6 , and then apply the imnoise function of the MATLAB Image Processing Toolbox to simulate the presence of Poisson noise. We consider two different Signal-to-Noise (SNR) ratios (66 and 72 dB), which are imposed by suitably pre-scaling the images before applying the imnoise function. We recall that the SNR in presence of Poisson noise is estimated by [6] where n exact and n background represent the total number of photons in the image and in the background term, respectively.
We compare Algorithm 1 with some well-known optimization methods suited for the deblurring of Poissonian images. We detail the implemented algorithms below.
• Algorithm 1: we equip our proposed algorithm with k max ∈ {1, 5, 10} , n computed according to the rule (36) with the same choices of C, n adopted in section 4.1, the initial dual iterate as v k max −1 = 0 , the primal steplength as = 10 −2 , , and the dual steplength as = 0.9∕(9 + ‖H‖ 1 ‖H‖ ∞ ) . We note that, although the sequence {u n } n∈ℕ is guaranteed to converge to a solution of problem (49), some of the iterates might fall outside the domain Ω , as the primal sub-iterates u k n in STEP 3 are not projected onto Ω . As a partial remedy to this issue, we report the results for the projected sequence {P Ω (u n )} n∈ℕ , where P Ω denotes the projection operator onto Ω . Such a sequence is both feasible and convergent to a solution of (49), even though we can not state anything regarding its convergence rate, as Theorem 3 is applicable only to the non-projected sequence {u n } n∈ℕ . • CP: this is the popular Chambolle-Pock primal-dual algorithm [15], whose general iteration is reported in (17). Here we apply CP to problem (49) by set- Since the steplength parameters , need to satisfy the condition < 1∕‖A‖ 2 to guarantee the convergence of the sequence, we choose the same primal steplength as in Algorithm 1, i.e., = 10 −2 , and = 0.9 −1 ∕(8 + ‖H‖ 1 ‖H‖ ∞ ). • PIDSplit+: this is an efficient algorithm based on an alternating split Bregman technique, which was first proposed in [42]. The PIDSplit+ algorithm depends on a parameter > 0 , which is set here as = 50∕ , as suggested in [42]. Note that there exist also adaptive strategies for computing this parameter, see e.g. [47].
For all algorithms, we set the initial primal iterate as the observed image, namely u 0 = z . For each test problem, the regularization parameter is tuned by a trialand-error strategy. In particular, we run the CP algorithm by varying in the set {10 −3 , 5 ⋅ 10 −3 , 10 −2 , 5 ⋅ 10 −2 , 10 −1 , 5 ⋅ 10 −1 } , performing 5000 iterations and using u 0 = z at each run. The value of providing the smallest relative error ‖u n − u † ‖∕‖u † ‖ is retained as the regularization parameter, being u † the ground truth and u n the last iterate.

Results
We apply Algorithm 1, PIDSplit+, and CP on 12 test problems, which are generated by varying the test image (phantom, mri, micro), the blur (Gaussian or out-offocus), and the noise level (66 or 72 dB). For each test problem, we run all algorithms for 1000 iterations. In Figs. 9, 10, 11, and 12, we report the relative decrease of the function values (F(u n ) − F(u * ))∕F(u * ) , where u * is a precomputed solution obtained by running the CP algorithm for 5000 iterations. We can appreciate the acceleration obtained by Algorithm 1 with respect to the iteration number, whereas a higher k max usually increases the computational time without a significant gain in terms of accuracy. With k max = 1 , the computational times of Algorithm 1 are comparable to the ones of CP, even though our proposed algorithm is sometimes faster in the first seconds (see e.g. Fig. 11). Overall, Algorithm 1 exhibits a rate of convergence towards the minimum point that is similar for each combination of noise level and blur.
In Table 2, we show the values for the structural similarity (SSIM) index [45] as a measure of the quality of the reconstructions provided by the algorithms. As it can be seen, all methods provide acceptable SSIM values, and are quite stable with respect to the noise level and blur. The quality of the reconstructions    provided by Algorithm 1 seems to be not particularly sensitive to the choice of k max , although a much wider experimentation is needed to confirm this remark.

Conclusions
We have presented a nested primal-dual algorithm with an inertial step on the primal variable for solving composite convex optimization problems. At each outer iteration, the proposed method starts the inner loop with the outcome of the previous one, employing the so-called "warm-start" strategy, and then performs a prefixed number of primal-dual iterations for inexactly computing the proximal-gradient point. We have proved the convergence of the primal iterates towards a minimum point, as well as an O(1/n) convergence rate for the ergodic sequences associated to the primal and dual iterates, under the assumption that the inertial parameter satisfies an appropriate (implementable) condition. Numerical experiments on some Poisson image deblurring problems show that the proposed inertial primal-dual method preserves the typical acceleration effect of FISTAlike algorithms in the first iterations, while ensuring convergence to the true minimum value as the iterations proceed. Possible future developments include the introduction of linesearches and variable steplengths and metrics inside the proposed scheme, as well as the study of a suitable practical rule for choosing the "optimal" number k max of inner primal-dual iterations.

Conflict of interests
The authors have no relevant financial or non-financial interests to disclose.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.