Stochastic Primal Dual Hybrid Gradient Algorithm with Adaptive Step-Sizes

,


Introduction
The stochastic primal-dual hybrid gradient (SPDHG) algorithm introduced in [8] is a stochastic version of the primal-dual hybrid gradient (PDHG) algorithm, also known as Chambolle-Pock algorithm [9].SPDHG has proved more efficient than PDHG for a variety of problems in the framework of large-scale non-smooth convex inverse problems [13,22,24,27].Indeed, SPDHG only uses a subset of the data at each iteration, hence reducing the computational cost of evaluating the forward operator and its adjoint; as a result, for the same computational burden, SPDHG attains convergence faster than PDHG.This is especially relevant in the context of medical imaging, where there is a need for algorithms whose convergence speed is compatible with clinical standards, and at the same time able to deal with convex, non-smooth priors like Total Variation (TV), which are wellsuited to ill-posed imaging inverse problems, but preclude the recourse to scalable gradient-based methods.
Like PDHG, SPDHG is provably convergent under the assumption that the product of its primal and dual step-sizes is bounded by a constant depending on the problem to solve.On the other hand, the ratio between the primal and dual step-sizes is a free parameter, whose value needs to be chosen by the user.The value of this parameter, which can be interpreted as a control on balance between primal and dual convergence, can have a severe impact on the convergence speed of PDHG, and the same also holds true for SPDHG [12].This leads to an important challenge in practice, as there is no known theoretical or empirical rule to guide the choice of the parameter.Manual tuning is computationally expensive, as it would require running and comparing the algorithm on a range of values, and there is no guarantee that a value leading to fast convergence for one dataset would keep being a good choice for another dataset.For PDHG, [14] have proposed an online primal-dual balancing strategy to solve the issue, where the values of the step-sizes evolve along the iterations.More generally, adaptive step-sizes have been used for PDHG with backtracking in [14,20], adapting to local smoothness in [25] and are widely used for a variety of other algorithms, namely gradient methods in [19], subgradient methods in [3] and splitting methods in [4,6,5,7,18] to improve convergence speed and bypass the need for explicit model constants, like Lipschitz constants or operator norms.For SPDHG, an empirical adaptive scheme has been used for Magnetic Particle Imaging but without convergence proof [27].
On the theoretical side, a standard procedure to prove the convergence of proximal-based algorithms for convex optimization is to use the notion of Féjer-monotonicity [2].Constant step-sizes lead to a fixed metric setting, while adaptive step-sizes lead to a variable metric setting.Work [11] states the convergence of deterministic Féjer-monotone sequences in the variable metric setting, while work [10] is concerned by the convergence of random Féjer-monotone sequences in the fixed metric setting.
In this work, we introduce and study an adaptive version of SPDHG.More precisely: • We introduce a broad class of strategies to adaptively choose the step-sizes of SPDHG.This class includes, but is not limited to, the adaptive primal-dual balancing strategy, where the ratio of the step-sizes, which controls the balance between convergence of the primal and dual variable, is tuned online.
• We prove the almost-sure convergence of SPDHG under the schemes of the class.In order to do that, we introduce the concept of C-stability, which generalizes the notion of Féjermonotonicity, and we prove the convergence of random C-stable sequences in a variable metric setting, hence generalizing results from [11] and [10].We then show that our proposed algorithm falls within this novel theoretical framework by following similar strategies than in the almost-sure convergence proofs of [16,1].
• We compare the performance of SPDHG for various adaptive schemes and the known fixed step-size scheme on large-scale imaging inverse tasks (sparse-view CT, limited-angle CT, lowdose CT).We observe that the primal-dual balancing adaptive strategy is always as fast or faster than all the other strategies.In particular, it consistently leads to substantial gains in convergence speed over the fixed strategy if the fixed step-sizes, while in the theoretical convergence range, are badly chosen.This is especially relevant as it is impossible to know whether the fixed step-sizes are well or badly chosen without running expensive comparative tests.Even in the cases where the SPDHG's fixed step-sizes are well-tuned, meaning that they are in the range to which the adaptive step-sizes are observed to converge, we observe that our adaptive scheme still provides convergence acceleration over the standard SPDHG after a certain number of iterations.Finally, we pay special attention to the hyperparameters used in the adaptive schemes.These hyperparameters are essentially controlling the degree of adaptivity for the algorithm and each of them has a clear interpretation and is easy to choose in practice.We observe in our extensive numerical tests that the convergence speed of our adaptive scheme is robust to the choices of these parameters within the empirical range we provide, hence can be applied directly to the problem at hand without fine-tuning, and solves the step-sizes choice challenge encountered by the user.
The rest of the paper is organized as follows.In Section 2, we introduce SPDHG with adaptive step-sizes, state the convergence theorem, and carry the proof.In Section 3, we propose concrete schemes to implement the adaptiveness, followed by numerical tests on CT data in Section 4. We conclude in Section 5. Finally, Section 6 collects some useful lemmas and proofs.

Convergence theorem
The variational problem to solve takes the form: where X and (Y i ) i∈{1,...,n} are Hilbert spaces, where f * i stands for the Fenchel conjugate of f i .The set of solution to (2.1) is denoted by C, the set of non-negative integers by N and 1, n stands for {1, . . ., n}.Elements (x * , y * ) of C are called saddle-points and characterized by In order to solve the saddle-point problem, we introduce the adaptive stochastic primal-dual hybrid gradient (A-SPDHG) algorithm in Algorithm 2.1.At each iteration k ∈ N, A-SPDHG involves the following five steps: • update the primal step-size τ k and the dual step-sizes (σ k i ) i∈ 1,n (line 4); • update the primal variable x k by a proximal step with step-size τ k+1 (line 5); • randomly choose an index i with probability p i (line 6); • update the dual variable y k i by a proximal step with step-size σ k+1 i (line 7); • compute the extrapolated dual variable (line 8).
A-SPDHG is adaptive in the sense that the step-sizes values are updated at each iteration according to an update rule which takes into account the value of the primal and dual iterates x l and y l up to the current iteration.As the iterates are stochastic, the step-sizes are themselves stochastic, which must be carefully accounted for in the theory.

4:
Determine (σ k+1 i ) i∈ 1,n , τ k+1 according to the update rule and the values of (σ l i ) i∈ 1,n , τ l , x l and y l for l ∈ 0, k .

6:
Randomly pick i ∈ 1, n with probability p i 7: Before turning to the convergence of A-SPDHG, let us recall some facts about the state-of-theart SPDHG.Each iteration of SPDHG involves the selection of a random subset of 1, n .In the serial sampling case where the random subset is a singleton, SPDHG algorithm [8] is a special case of Algorithm 2.1 with the update rule

Under the condition
SPDHG iterates converge almost surely to a solution of the saddle-point problem (2.1) [1,16].
Let us now turn to the convergence of A-SPDHG.The main theorem, Theorem 2.1 below, gives conditions on the update rule under which A-SPDHG is provably convergent.Plainly speaking, these conditions are threefold: (i) the step-sizes for step k + 1, (σ k+1 i ) i∈ 1,n and τ k+1 , depend only on the iterates up to step k, (ii) the step-sizes satisfy a uniform version of condition (2.3), (iii) the step-sizes sequences (τ k ) k≥0 and (σ k i ) k≥0 for i ∈ 1, n do not decrease too fast.More precisely, they are uniformly almost surely quasi-increasing in the sense defined below.
In order to state the theorem rigorously, let us introduce some useful notation and definitions.For all k ∈ N, the σ-algebra generated by the iterates up to point k, F (x l , y l ), l ∈ 0, k , is denoted by F k .We say that a sequence (u k ) k∈N is F k k∈N -adapted if for all k ∈ N, u k is measurable with respect to F k .
A positive real sequence (u k ) k∈N is said to be quasi-increasing if there exists a sequence (η k ) k∈N with values in [0, 1), called the control on (u k ) k∈N , such that (2.4) By extension, we call a random positive real sequence (u k ) k∈N uniformly almost surely quasiincreasing if there exists a deterministic sequence (η k ) k∈N with values in [0, 1) such that ∞ k=1 η k < ∞ and equation (2.4) above holds almost surely (a.s.).
Theorem 2.1 (Convergence of A-SPDHG).Let X and Y be separable Hilbert spaces, A i : X → Y i bounded linear operators, f i : Y i → R ∪ {+∞} and g : X → R ∪ {+∞} proper, convex and lower semi-continuous functions for all i ∈ 1, n .Assume that the set of saddle-points C is non-empty and the sampling is proper, that is to say p i > 0 for all i ∈ 1, n .If the following conditions are met: k∈N -adapted, (ii) there exists β ∈ (0, 1) such that for all indices i ∈ 1, n and iterates k ∈ N, (iii) the initial step-sizes τ 0 and σ 0 i for all indices i ∈ 1, n are positive and the step-sizes sequences (τ k ) k∈N and (σ k i ) k∈N for all indices i ∈ 1, n are uniformly almost surely quasi-increasing, then the sequence of iterates (x k , y k ) k∈N converges almost surely to an element of C.
While the conditions (i)-(iii) are general enough to cover a large range of step-sizes update rules, we will focus in practice on the primal-dual balancing strategy, which consists in scaling the primal and the dual step-sizes by an inverse factor at each iteration.In that case, the update rule depends on a random positive sequence (γ k ) k∈N and reads as: and are positive, that there exists a deterministic sequence (ǫ k ) k∈N with values in [0, 1) such that ǫ k < ∞ and for all k ∈ N and i ∈ 1, n , Then, the step-sizes sequences satisfy assumptions (i)-(iii) of Theorem 2.1.
Lemma 2.2 is proved in Section 6.
Connection with the literature: • The primal-dual balancing strategy has been introduced in [14] for PDHG and indeed for n = 1 we recover with Lemma 2.2 the non-backtracking algorithm presented in [14].As a consequence, our theorem also implies the pointwise convergence of this algorithm, whose convergence was established in the sense of vanishing residuals in [14].
• Still for PDHG, [20] proposes without proof an update rule where the ratio of the step-sizes is either quasi non-increasing or quasi non-decreasing.This requirement is similar to but not directly connected with ours, where we ask the step-sizes themselves to be quasi nonincreasing.
Outline of the proof: Theorem 2.1 is proved in the following sub-sections.We first define in Section 2.2 metrics related to the algorithm step-sizes on the primal-dual product space.As the step-sizes are adaptive, we obtain a sequence of metrics.The proof of Theorem 2.1 is then similar in strategy to those of [1] and [16] but requires novel elements to deal with the metrics variability.In Theorem 2.5, we state convergence conditions for an abstract random sequence in a Hilbert space equipped with random variable metrics.In Section 2.4 and Section 2.5 we show that A-SPDHG falls within the scope of Theorem 2.5.We collect all elements and conclude the proof in Section 2.6.

Variable metrics
For a Hilbert space H, we call S(H) the set of bounded self-adjoint linear operators from H to H, and for all M ∈ S(H) we introduce the notation:

By an abuse of notation we write
Coming back to A-SPDHG, let us define for every iteration k ∈ N and every index i ∈ 1, n two block operators of S(X × Y i ) as: and a block operator of S(X × Y ) as: (2.8) The following lemma translates assumptions (i)-(iii) of Theorem 2.1 on properties on the variable metric sequences.
(c) Assumptions (ii) and (iii) of Theorem 2.1 imply that (M k i ) k∈N , (N k i ) k∈N , i ∈ 1, n and (N k ) k∈N are uniformly a.s.quasi-decreasing.
(d) Assumption (ii) and (iii) of Theorem 2.1 imply that the sequences (τ k ) k∈N and (σ k i ) k∈N for all i ∈ 1, n are a.s.bounded from above and by below by positive constants.In particular, this implies that there exists α > 0 such that Remark 2.4 (Step-sizes induced metrics on the primal-dual product space).The lemma implies that M k i , N k i and N k are positive definite, hence induce a metric on the corresponding spaces.If n = 1 and for constant step-sizes, M k i corresponds to the metric used in [17], where PDHG is reformulated as a proximal point algorithm for a non-trivial metric on the primal-dual product space.
Proof of Lemma 2.3.Assertion (a) of the lemma follows from the fact that for all iterate k ∈ N, the operators M k+1 i , N k+1 i and N k+1 are in the σ-algebra generated by τ k+1 , σ k+1 i , i ∈ 1, n .Assertion (b) follows from equation (6.2) of Lemma 6.1 to be found in the complementary material.The proof of assertion (c) is a bit more involved.Let us assume that assumption (iii) of Theorem 2.1 holds and let (η k 0 ) k∈N and (η k i ) k∈N be the controls of (τ k ) k∈N and (σ k i ) k∈N for i ∈ 1, n respectively.We define the sequence (η k ) k∈N by: which is a common control on (τ k ) k∈N and (σ k i ) k∈N for i ∈ 1, n as the maximum of a finite number of controls.Let us fix k ∈ N and i ∈ 1, n .Because the intersection of a finite number of measurable events of probability one is again a measurable event of probability one, it holds almost surely that for all (x, y k∈N has a bounded sum, consider that η k k∈N is summable, hence converges to 0, hence is smaller than 1/2 for all integers k bigger than a certain K; in turn, for all integers k bigger than K, the term η k (1 − η k ) −1 is bounded from below by 0 and from above by 2η k , hence is summable.)One can see by a similar proof that (N k ) k∈N is uniformly quasi-decreasing with the same control.To follow with the case of (M k i ) k∈N , we have, as before: Let us conclude with the proof of assertion (d).By assumption (iii), the sequences (τ k ) k∈N and (σ k i ) k∈N are uniformly a.s.quasi-increasing.We define a common control (η k ) k∈N as in (2.9).Then, the sequences (τ k ) k∈N and (σ k i ) k∈N are a.s.bounded from below by the same deterministic constant ) which is positive as the initial step-sizes are positive and (η k ) k∈N takes values in [0, 1) and has finite sum.Furthermore, by assumption (ii), the product of the sequences (τ k ) k∈N and (σ k i ) k∈N is almost surely bounded from above.As a consequence, each sequence (τ k ) k∈N and (σ k i ) k∈N is a.s.bounded from above.The equivalence with N k i ∈ S α (X × Y i ) for all i ∈ 1, n , and with N k ∈ S α (X × Y ), is straightforward.

Convergence of random C-stable sequences in random variable metrics
Let H be a Hilbert space and C ⊂ H a subset of H. Let (Ω, σ(Ω), P) be a probability space.All random variables in the following are assumed to be defined on Ω and measurable with respect to σ(Ω) unless stated otherwise.Let (Q k ) k∈N be a random sequence of S(H).
A random sequence (u k ) k∈N with values in H is said to be stable with respect to the target C relative to (Q k ) k∈N if for all u ∈ C, the sequence u k − u Q k k∈N converges almost surely.The following theorem then states sufficient conditions for the convergence of such sequences.Theorem 2.5 (Convergence of C-stable sequences).Let H be a separable Hilbert space, C a closed non-empty subset of H, (Q k ) k∈N a random sequence of S(H), and (u k ) k∈N a random sequence of H.If the following conditions are met: takes values in S α (H) for a given α > 0 and is uniformly a.s.quasi-decreasing, (ii) (u k ) k∈N is stable with respect to the target C relative to (Q k ) k∈N , (iii) every weak sequential cluster point of (u k ) k∈N is almost surely in C, meaning that there exists Ω (iii) a measurable subset of Ω of probability one such that for all ω ∈ Ω, every weak sequential cluster point of (u k (ω)) k∈N is in C.
then (u k ) k∈N converges almost surely weakly to a random variable in C.
Stability with respect to a target set C is implied by Féjer and quasi-Féjer monotonicity with respect to C, which have been studied either for random sequences [10] or in the framework of variable metrics [11], but to the best of our knowledge not both at the same time.The proof of Theorem 2.5 follows the same lines than [10, Proposition 2.3 (iii)] and uses two results from [11].
Proof.The set C is a subset of the separable Hilbert space H, hence is separable.As C is a closed and separable, there exists {c n , n ∈ N} a countable subset of C whose closure is equal to C. Thanks to assumption (ii), there exists for all n ∈ N a measurable subset Ω n (ii) of Ω with probability one such that the sequence ( . Furthermore, let Ω (i) be a measurable subset of Ω of probability one corresponding to the almost sure property for assumption (i).Let As the intersection of a countable number of measurable subsets of probability one, Ω is itself a measurable set of Ω with P( Ω) = 1.Fix ω ∈ Ω for the rest of the proof.
The sequence (Q k (ω)) k∈N takes values in S α (H) for α > 0 and is quasi-decreasing with control (η k (ω)) k∈N .Furthermore, for all k ∈ N, where the product Furthermore, for all x ∈ C, there exists a sequence (x n ) n∈N with values in {c n , n ∈ N} converging strongly to x.By assumption, for all n ∈ N, the sequence ( u k (ω) − x n Q k (ω) ) k∈N converges to a limit which shall be called l n (ω).For all n ∈ N and k ∈ N, we can write thanks to the triangular inequality: By taking the limit k → +∞, it follows that: Taking now the limit n → +∞ shows that the sequence ( u k (ω) − x Q k (ω) ) k∈N converges for all x ∈ C. On the other hand, because ω ∈ Ω (iii) , the weak cluster points of (u k (ω)) k∈N lie in C. Hence, by [11,Theorem 3.3], the sequence (u k (ω)) k∈N converges almost surely to a point u(ω) ∈ C.
We are now equipped to prove Theorem 2.1.We show in Section 2.4 and Section 2.5 that A-SPDHG satisfies points (ii) and (iii) of Theorem 2.5 respectively and conclude the proof in Section 2.6.Interestingly, the proofs of point (ii) and of point (iii) rely on two different ways of apprehending A-SPDHG.Point (ii) relies on a convex optimisation argument: by taking advantage of the measurability of the primal variable at step k + 1 with respect to F k , one can write a contraction-type inequality relating the conditional expectation of the iterates' norm at step k + 1 to the iterates' norm at step k.Point (iii) relies on monotone operator theory: we use the fact that the update from the half-shifted iterations (y k , x k+1 ) to (y k+1 , x k+2 ) can be interpreted as a step of a proximal-point algorithm on X × Y i conditionally to i being the index randomly selected at step k.

A-SPDHG is stable with respect to the set of saddle-points
In this section, we show that (x k , y k ) k∈N is stable with respect to C relative to the variable metrics sequence (N k ) k∈N defined in equation (2.8) above.We introduce the operators P ∈ S(Y ) and Σ k ∈ S(Y ) defined respectively by and the functionals (U k ) k∈N , (V k ) k∈N defined for all (x, y) ∈ X × Y as: We begin by recalling the cornerstone inequality satisfied by the iterates of SPDHG stated first in [8] and reformulated in [1].
Lemma 2.6 ([1], Lemma 4.1).For every saddle-point (x * , y * ), it a.s.stands that for all k ∈ N\{0}, The second step is to relate the assumptions of Theorem 2.1 to properties of the functionals appearing in (2.10).Let us introduce Y sparse ⊂ Y the set of elements (y 1 , . . ., y n ) having at most one non-vanishing component.
Lemma 2.7 (Properties of functionals of interest).Under the assumptions of Theorem 2.1, there exists a non-negative, summable sequence (η k ) k∈N such that a.s.for every iterate k ∈ N and x ∈ X, y ∈ Y, z ∈ Y sparse : (2.11a) (2.11e) Proof.Let (η k i ) k∈N and (η k i ) k∈N be the controls of (M k i ) k∈N and (N k i ) k∈N respectively for all i ∈ 1, n .We define the common control (η k ) k∈N by: For all y ∈ Y , we can write which proves (2.11a).Let us now fix x ∈ X, z ∈ Y sparse and k ∈ N. By definition, there exists i ∈ 1, n such that z j = 0 for all j = i.We obtain the inequalities (2.11b)-(2.11d)by writing: Finally, we obtain inequality (2.11e) by writing: where the last inequality is a consequence of (2.5).
Lemma 2.8 (A-SPDHG is C-stable).Under the assumptions of Theorem 2.1, (i) the sequence (x k , y k ) k∈N of Algorithm 2.1 is stable with respect to C relative to (N k ) k∈N , (ii) the following results hold: 2 < ∞ and a.s.
Proof.Let us begin with the proof of point (i).By definition of A-SPDHG with serial sampling, the difference between two consecutive dual iterates is almost surely sparse: Let us define the sequences which are a.s.non-negative thanks to (2.11c) and (2.11d).Notice that the primal iterates x l from l = 0 up to l = k + 1 are measurable with respect to F k , whereas the dual iterates y l from l = 0 up to l = k are measurable with respect to F k .Hence a k and b k are measurable with respect to F k .Furthermore, inequalities (2.10), (2.11a) and (2.11b) imply that almost surely for all k ∈ N \ {0}, From the last point in particular, we can write thanks to (2.11d) and the monotone convergence theorem: , and in turn y k − y k−1 (P Σ k+1 ) −1 k∈N\{0} , converge almost surely to 0. Furthermore, sup k E a k < ∞ hence sup k x k − x * 2 (τ k ) −1 , and in turn sup k x k − x * (τ k ) −1 , are finite, and by (2.11e), one can write that for k ∈ N \ {0}, We know that (η k ) k∈N is summable hence converges to 0. As a consequence, To conclude with, thanks to the identity the almost sure convergence of (a k ) k∈N implies in turn that of ( (x k − x * , y k − y * ) 2 N k ) k∈N .Let us now turn to point (ii).The first assertion is a straightforward consequence of and bounds (2.11c) and (2.11d).Furthermore, it implies that 2 is a.s.finite, hence (x k+1 − x k , y k − y k−1 ) a.s.converges to 0, and so does x k+1 − x k .

Weak cluster points of A-SPDHG are saddle-points
The goal of this section is to prove that A-SPDHG satisfies point (iii) of Theorem 2.5.On the event I k = i , A-SPDHG update procedure can be rewritten as We define T σ,τ i : (x, y) → (x, ŷi ) by: ,τ k+2 i (x k+1 , y k ) on the event {I k = i} (and y k+1 j = y k j for j = i).Lemma 2.9 (Cluster points of A-SPDHG are saddle points).Let (x, ȳ) a.s.be a weak cluster point of (x k , y k ) k∈N (meaning that there exists a measurable subset Ω of Ω of probability one such that for all ω ∈ Ω, (x(ω), ȳ(ω)) is a weak sequential cluster point of (x k (ω), y k (ω)) k∈N ) and assume that the assumptions of Theorem 2.1 hold.Then (x, ȳ) is a.s. in C.
Proof.Thanks to Lemma 2.8-(ii) and the monotone convergence theorem, Hence we can deduce that It follows that the series in the expectation is a.s.finite, and since p i > 0 we deduce that almost surely, T for all i = 1, . . .n.
We consider a sample (x k , y k ) which is bounded and such that (2.13) holds.
We let for each i, ) → 0 for i = 1, . . ., n.Then, one has x where δ i,k x,y → 0 as k → ∞.Given a test point (x, y), one may write for any k: and summing all these inequalities, we obtain: where δ k → 0 as k → ∞.We deduce that if (x, ȳ) is the weak limit of a subsequence (x k l , y k l −1 ) (as well as, of course, (x k l , y k l )), then: Since (x, y) is arbitrary, we find that (2.2) holds for (x, ȳ).

Proof of Theorem 2.1
Under the assumptions of Theorem 2.1, the set C of saddle-points is closed and non-empty and X × Y is a separable Hilbert space.By Lemma 2.3, the variable metrics sequence (N k ) k∈N defined in (2.8) satisfies condition (i) of Theorem 2.5.Furthermore, the iterates of Algorithm 2.1 comply with condition (ii) and (iii) of Theorem 2.5 by Lemma 2.8 and Lemma 2.9 respectively, hence converge almost surely to a point in C.

Algorithmic Design and Practical Implementations
In this section we present practical instances of our A-SPDHG algorithm, where we specify a stepsize adjustment rule which satisfies our assumptions in convergence proof.We extend the adaptive step-size balancing rule for deterministic PDHG, which is proposed by [14], into our stochastic setting, with minibatch approximation to minimize the computational overhead.

A-SPDHG rule (a) -Tracking & balancing the primal-dual progress
Let's first briefly introduce the foundation of our first numerical scheme, which is built upon the deterministic adaptive PDHG algorithm proposed by Goldstein et al [14], with the iterates: In this foundational work of Goldstein et al [14], they proposed to evaluate two sequences in order to track and balance the progresses of the primal and dual iterates of deterministic PDHG (denoted here as v * k and d * k ): These two sequences measure the lengths of the primal and dual subgradients for the objective min x∈X max y∈Y g(x) + Ax, y − f * (y), which can be demonstrated by the definition of proximal operators.The primal update of deterministic PDHG can be written as: The optimality condition of the above objective declares: By adding −A * y k+1 on both sides and rearranging the terms, one can derive: and similarly for the dual update one can also derive: which indicates that the sequences v * k and d * k given by (3.1) should effectively track the primal progress and dual progress of deterministic PDHG, hence Goldstein et al [14] propose to utilize these as the basis of balancing the primal and dual step sizes for PDHG.
In light of this, we propose our first practical implementation of A-SPDHG in Algorithm 3.1 as our rule-(a), where we use a unique dual step-size σ k = σ k j for all iterates k and indices j and where we estimate the progress of achieving optimality on the primal and dual variables via the two sequences v k and d k defined at each iteration k with I k = i as: which are minibatch extension of (3.1) tailored for our stochastic setting.By making them balanced on the fly via adjusting the primal-dual step size ratio when appropriate, we can enforce the algorithm to achieve similar progress in both primal and dual steps, hence improve the convergence.To be more specific, as shown in Algorithm 3.1, in each iteration the values of v k and d k are evaluated and compared.If the value of v k (which tracks the primal subgradients) is significantly larger than d k (which tracks the dual subgradients), then we know that the primal progress is slower than the dual progress, hence the algorithm would boost the primal step size while shrinking the dual step-size.If v k is noticeably smaller than d k then the algorithm would do the opposite.
Note that here we adopt the choice of ℓ 1 -norm as the length measure for v k and d k as done by Goldstein et al [14,15], since we also observe numerically the benefit over the more intuitive choice of ℓ 2 -norm.
For full-batch case (n = 1), it reduces to the adaptive PDHG proposed by [14,15].We adjust the ratio between primal and dual step sizes according to the ratio between v k and d k , and whenever the step-sizes change, we shrink α (which controls the amplitude of the changes) by a factor η ∈ (0, 1) -we typically choose η = 0.995 in our experiments.For the choice of s, we choose s = A as our default.1

Reducing the overhead with subsampling:
Noting that unlike the deterministic case which does not have the need of extra matrix-vector multiplication since A * y k and Ax k can be memorized, our stochastic extension will require the computation of A i x k since we will sample different subsets between back-to-back iterations with high probability.When using this strategy, we will only have a maximum 50% overhead in terms of FLOP counts, which is numerically negligible compared to the significant acceleration it will bring towards SPDHG especially when the primal-dual step-size ratio is suboptimal, as we will demonstrate later in the experiments.Moreover, we found numerically that we can significantly reduce this overhead by approximation tricks such as subsampling: with S k being a random subsampling operator such that E[(S k ) T S k ] = 1 ρ Id.In our experiments we choose 10% subsampling for this approximation hence the overhead is reduced from 50% to only 5% which is negligible, without compromising the convergence rates in practice.

A-SPDHG rule (b) -Exploiting angle alignments
More recently, Yokota and Hontani [26] propose a variant of adaptive step-size balancing scheme for PDHG, utilizing the angles between the subgradients ∂g(x k+1 ) + A * y k+1 and the difference of the updates x k − x k+1 .
If these two directions are highly aligned, then the primal step size can be increased for bigger step.If these two directions have a large angle, then the primal step-size should be shrunken.By extending this scheme to stochastic setting we obtain another choice of adaptive scheme for SPDHG.

Algorithm 3.1: A-SPDHG, rule (a)
Input: dual step-size σ 0 , primal step-size τ 0 , α 0 ∈ (0, 1), η ∈ (0, 1), δ > 1, probabilities (p i ) 1≤i≤n ; primal variable x 0 , dual variable We present this scheme in Algorithm 3.2 as our rule (b).At iteration k with I k = i, compute: as an estimate of ∂g(x k+1 ) + A * y k+1 , then measure the cosine of the angle between this and x k − x k+1 : The threshold c for the cosine value (which triggers the increase of the primal step-size) typically needs to be very close to 1 (we use c = 0.999) due to the fact that we mostly apply these type of algorithms in high-dimensional problems, following the choice in [26] which was for deterministic PDHG.
Recently Zdun et al [27] proposed a heuristic similar to our rule (b), but they choose q k+1 to be the approximation for an element of ∂g(x k+1 ) instead of ∂g(x k+1 ) + A * y k+1 .Our choice follows more closely to the original scheme of Yokota and Hontani [26].We numerically found that their scheme is not competitive in our settings.

Numerical Experiments
In this section we present numerical studies of the proposed scheme in solving one of the most typical imaging inverse problems, the Computed Tomography (CT).We compare A-SPDHG algorithm with the original SPDHG, on different choices of starting ratio of the primal and dual step-sizes.In our CT imaging example, we seek to reconstruct the tomography images from fan-beam X-ray measurement data, by solving the following TV-regularized objective: where D denotes the 2D differential operator, A ∈ R m×d and x ∈ R d .We consider three fanbeam CT imaging modalities: Sparse-View CT, Low-Dose CT and Limited-Angle CT.We test the A-SPDHG and SPDHG on two images of different sizes (Example 1 on a phantom image sized 1024 × 1024, while Example 2 being an image from the Mayo Clinic Dataset [21] sized 512 × 512.), on 4 different starting ratios (10 −3 , 10 −5 , 10 −7 and 10 −9 ).We interleave partitioned the measurement data and operator into n = 10 minibatches for both algorithms.To be more specific, we first collect all the X-ray measurement data and list them consecutively from 0 degree to 360 degree to form the full A and b, and then interleavingly group every 10-th of the measurements into one minibatch, to form the partition {A i } 10 i=1 and {b i } 10 i=1 .For A-SPDHG we choose to use the approximation step for d k presented in (3.7) with 10% subsampling hence the computational overhead is negligible in this experiment.We initialize all algorithms from a zero-image.
We present our numerical results in Figures 1, 2, 3 and 6.In these plots we compare the convergence rates of the algorithms in terms of number of iterations (the execution time per iteration for the algorithms are almost the same, as the overhead of A-SPDHG is trivial numerically).Among these, Figures 1 and 2 report the results for large-scale sparse-view CT experiments on a phantom image and a lung CT image from Mayo-Clinic dataset [21], while Figure 3 reports the results for low-dose CT experiments where we simulate a large number of measurements corrupted with a significant amount Poisson noise, and then, in Figure 6 we report the results for limited-angle CT which only a range of 0-degree to 150-degree of measurement angles are present, while the measurements from the rest [150, 360] degrees of angles are all missing.In all these examples we can consistently observe that no matter how we initialize the primal-dual step-size ratio, A-SPDHG can automatically and consistently adjust the step size ratio to the optimal choice which is around either 10 −5 or 10 −7 for these four different CT problems, and significantly outperform the vanilla SPDHG for the cases where the starting ratio is away from the optimal range.Meanwhile, even for the cases where the starting ratio of SPDHG algorithm is near-optimal, we can observe consistently from most of these examples that our scheme outperforms the vanilla SPDHG algorithm locally after a certain number of iterations (highlighted by the vertical dash lines in relevant subfigures), which further indicates the benefit of adaptivity for this class of algorithms2 .Note that throughout all these different examples, we use only one fixed set of parameters for A-SPDHG suggested in the previous section, which again indicates the strong practicality of our scheme.
For the low-dose CT example, we run two extra sets of experiments, regarding a larger number of partioning of minibatches (40) on Figure 4, and warm-start from a better initialization image obtained via filter-backprojection on Figure 5.We found that in all these extra examples we consistently observe superior performances of A-SPDHG over the vanilla SPDHG especially when the primal-dual step-size ratios are suboptimal.Interestingly, we found that the warm-start's effect does not have noticeable impact of the comparative performances between SPDHG and A-SPDHG.This is mainly due to the fact that the SPDHG with suboptimal primal-dual step-size ratio will converge very slowly in high accuracy regimes (see Fig 5(d) for example) in practice hence the warm-start won't help much here.
We should also note that conceptually all the hyperparameters in our adaptive schemes are basically the controllers of the adaptivity of the algorithm (while for extreme choices we recover the vanilla SPDHG).In Figures 7 and 9, we present some numerical studies on the choices of hyperparameters of rule (a) and rule (b) of A-SPDHG algorithm.We choose the fixed starting ratio of 10 −7 for primal-dual step-sizes in these experiments.For rule (a), we found that it is robust to the choice of the starting shrinking rate α 0 , shrinking speed η and the gap δ.Overall, we found that these parameters have weak impact of the convergence performance of our rule (a) and easy to choose.
For rule (b), we found that the performance is more sensitive to the choice of parameter c and η comparing to rule (a), although the dependence is still weak.Our numerical studies suggest that rule (a) is a better-performing choice than rule (b), but each of them have certain mild weaknesses (the first rule has a slight computational overhead which can be partitially addressed with subsampling scheme, while the second rule seems often being slower than the first rule), which require further studies and improvements.Nevertheless, we need to emphasis that all these parameters are essentially controlling the degree of adaptivity of the algorithms and fairly easy to choose, noting that for all these CT experiments with varying sizes/dimensions and modalities we only use one fixed set of the hyperparameters in A-SPDHG, and we are already able to consistently observe numerical improvements over vanilla SPDHG.

Conclusion
In this work we propose a new framework (A-SPDHG) for adaptive step-size balancing in stochastic primal-dual hybrid gradient methods.We first derive theoretically sufficient conditions on the adaptive primal and dual step-sizes for ensuring convergence in the stochastic setting.We then propose a number of practical schemes which satisfy the condition for convergence, and our numerical results on imaging inverse problems supports the effectiveness of the proposed approach.
To our knowledge, this work constitutes the first theoretical analysis of adaptive step-sizes for a stochastic primal-dual algorithm.Our on-going work includes the theoretical analysis and algorithmic design of further accelerated stochastic primal-dual methods with line-search schemes for even faster convergence rates.

Complementary material for Section 2
We begin by a useful lemma.which proves the direct implication of (6.1).For the converse implication, consider x ∈ X \ {0} such that P x = P x and y = −λP x for a scalar λ.Then, the non-negativity of the polynomial (x, y) 2

M
x 2 = b P 2 λ 2 − 2 P 2 λ + a for all λ ∈ R implies that P 4 − ab P 2 ≤ 0, which is equivalent to the desired conclusion (ab) −1/2 P ≤ 1. Equivalence (6.Let us now turn to the proof of Lemma 2.2. Proof of Lemma 2.2.Let us assume that the step-sizes satisfy the assumptions of the lemma.Then, Assumption (i) of Theorem 2.1 is straightforwardly satisfied.Moreover, for i ∈ 1, n , the product sequence (τ k σ k i ) k∈N is constant along the iterations by equation (2.6) and satisfies equation (2.5) for iterate k = 0, thus satisfies (2.5) for all k ∈ N for β = max i τ 0 σ 0 i A i 2 /p i , which proves Assumption (ii).Finally, equation (2.7) implies that Assumption (iii) is satisfied.

Ethical approval
This declaration is not applicable.

Competing interests
There are no competing interests to declare.

Authors' contribution
CD, MJE and AC elaborated the proof strategy and CD wrote parts 1, 2 and 6.JT worked on the algorithmic design, performed the numerical experiments and wrote parts 3-5.All authors reviewed the manuscript.

Availability of data and materials
The related implementation of the algorithms and the image data used in the experiment will be made available on the website https://junqitang.com .For the phantom image example we use the one in the experimental section of [8], while for the lung CT image example we use an image from the Mayo Clinic Dataset [21] which is publicly available.

Figure 2 :
Figure 2: Comparison between SPDHG and A-SPDHG on Sparse-View CT (Example 2), with a variety of starting primal-dual step-size ratios.Here the forward operator A ∈ R m×d where the dimension m = 92160, d = 262144.We include the images reconstructed by the algorithms at termination (50th epoch).

Lemma 6 . 1 .
Let a, b be positive scalars, β ∈ (0, 1), and P a bounded linear operator from a Hilbert Test on the choices for δ

Figure 7 :Figure 8 :
Figure 7: Test on different choices of parameters of A-SPDHG (rule-a) on X-ray Low-Dose fanbeam CT example, starting ratio of primal-dual step-sizes: 10 −7 .We can observe that the performance of ASPDHG has only minor dependence on these parameter choices.

Figure 9 :
Figure 9: Test on different choices of parameters of A-SPDHG (rule-b) on X-ray Low-Dose fanbeam CT example, starting ratio of primal-dual step-sizes: 10 −7 .