Cocoercivity, Smoothness and Bias in Variance-Reduced Stochastic Gradient Methods

With the purpose of examining biased updates in variance-reduced stochastic gradient methods, we introduce SVAG, a SAG/SAGA-like method with adjustable bias. SVAG is analyzed in a cocoercive root-finding setting, a setting which yields the same results as in the usual smooth convex optimization setting for the ordinary proximal-gradient method. We show that the same is not true for SVAG when biased updates are used. The step-size requirements for when the operators are gradients are significantly less restrictive compared to when they are not. This highlights the need to not rely solely on cocoercivity when analyzing variance-reduced methods meant for optimization. Our analysis either match or improve on previously known convergence conditions for SAG and SAGA. However, in the biased cases they still do not correspond well with practical experiences and we therefore examine the effect of bias numerically on a set of classification problems. The choice of bias seem to primarily affect the early stages of convergence and in most cases the differences vanish in the later stages of convergence. However, the effect of the bias choice is still significant in a couple of cases.


Introduction
Variance-reduced stochastic gradient (VR-SG) methods is a family of iterative optimization algorithms that combine the low per-iteration computational cost of the ordinary stochastic gradient descent and the attractive convergence properties of gradient descent. Just as ordinary stochastic gradient descent, VR-SG methods solve smooth optimization problems on finite sum form, where, for all i ∈ {1, . . . , n}, f i : R N → R is a convex function that is L-smooth, i.e., f i is differentiable with L-Lipschitz continuous gradient. These types of problems are common in model fitting, supervised learning, and empirical risk minimization which, together with the nice convergence properties of VR-SG methods, has lead to a great amount of research on VR-SG methods and the development of several different variants, e.g., [1][2][3][4][5][6][7][8][9][10][11][12][13][14][15].
Broadly speaking, VR-SG methods form a stochastic estimate of the objective gradient by combining one or a few newly evaluated terms of the gradient with all previously evaluated terms. Classic examples of this can be seen in the SAG [1,2] and SAGA [3] algorithms. Given some initial iterates x 0 , y 0 1 , . . . , y 0 n ∈ R N and step-size λ > 0, SAGA samples i k uniformly from {1, . . . , n} and then 1 arXiv:1903.09009v3 [math.OC] 1 Feb 2022 updates the iterates as for k ∈ {0, 1, . . . }. The update of x k+1 is said to be unbiased since the expected value of x k+1 at iteration k is equal to an ordinary gradient descent update. This is in contrast to the biased SAG, which is identical to SAGA except that the update of x k+1 is y k j and the expected value of x k+1 now includes a term containing the old gradients 1 n n i=1 y k i . Although SAG shows that unbiasedness is not essential for the convergence of VR-SG methods, the effects of this bias are unclear. The majority of VR-SG methods are unbiased but existing works have not established any clear advantage of either the biased SAG or the unbiased SAGA. This paper will examine the effect of bias and its interplay with different problem assumptions for SAG/SAGA-like methods.

Problem and Algorithm
Instead of solving (1) directly, we consider a closely related but more general root-finding problem. Throughout the paper, we consider the Euclidean space R N and the problem of finding x ∈ R N such that where R i : R N → R N is 1 L -cocoercive-see Section 2-for all i ∈ {1, . . . , n}. Since L-smoothness of a convex function is equivalent to 1 L -cocoercivity of the gradient [16,Corollary 18.17], the smooth optimization problem in (1) can be recovered by setting R i = ∇f i for all i ∈ {1, . . . , n} in (2). Problem (2) is also interesting in its own right with it and the closely related fixed point problem of finding x ∈ R N such that x = (Id −αR)x where α ∈ (0, 2L −1 ) both having applications in for instance feasibility and non-linear signal recovery problems, see [17][18][19] and the references therein. To solve this problem, we present the Stochastic Variance Adjusted Gradient (SVAG) algorithm.

Algorithm 1 SVAG
input single valued operators R i : R N → R N , initial state x 0 ∈ R N and y 0 1 , . . . , y 0 n ∈ R N , step-size λ > 0, innovation weight θ ∈ R for k = 0, 1, . . . do Sample i k uniformly from {1, . . . , n} j for all j = i k end for SVAG is heavily inspired by SAG and SAGA with both being special cases, θ = 1 and θ = n respectively. Just like SAG and SAGA, in each iteration, SVAG evaluates one operator R i k and stores the results in y k+1 i k . An estimate of the full operator is then formed as Rx k ≈ R k = θ n (R i k x k − y k i k ) + 1 n n j=1 y k j .
The scalar θ determine how much weight should be put on the new information gained from evaluating R i k x k . If the innovation, R i k x k − y k i k , is highly correlated with the total innovation, Rx k − 1 n n j=1 y k j , a large innovation weight θ can be chosen and vice versa. The innovation weight θ also determines the bias of SVAG. Taking the expected value R k given the information at iteration k gives y k j which reveals that R k is an unbiased estimate of Rx k if θ = n, i.e., in the SAGA case. Any other choice, for instance SAG where θ = 1, yields a bias towards 1 n n j=1 y k j .

Contribution
The theory behind finding roots of monotone operators in general, and cocoercive operators in particular, has been put to good use when analyzing first order optimization methods, examples include [16,[20][21][22][23][24]. For instance can both the proximal-gradient and ADMM methods be seen as instances of classic root-finding fixed-point iterations and analyzed as such, namely forward-backward and Douglas-Rachford respectively. The resulting analyses can often be simple and intuitive and even though the root-finding formulation is more general-not all cocoercive operators are gradients of convex functions-they are not necessarily more conservative. For example, analyzing proximalgradient as forward-backward splitting yield the same rates and step-size conditions as analyzing it as a minimization method in the smooth/cocoercive setting, see for instance [25,Theorem 2.1.14] and [16,Example 5.18 and Proposition 4.39]. However, the main contribution of this paper is to show that the same is not true for VR-SG methods, in particular it is not true for SVAG when it is biased.
The results consist of two main convergence theorems for SVAG: one in the cocoercive operator case and one in the cocoercive gradient case, the later being equivalent to the minimization of a smooth and convex finite sum. Both of these theorems match or improve upon previously known results for the SAG and SAGA special cases. Comparing the two settings reveal that SVAG can use significantly larger step-sizes, with faster convergence as a result, in the cocoercive gradient case compared to the general cocoercive operator case. In the operator case, an upper bound on the stepsize that scales as O(n −1 ) is found where n is the number of terms in (2). However, the restrictions on the step-size loosen with reduced bias and the unfavorable O(n −1 ) scaling disappears completely when SVAG is unbiased. In the gradient case, this bad scaling never occurs, regardless of bias. We provide examples in which SVAG diverges with step-sizes larger than the theoretical upper bounds in the operator case. Since the gradient case is proven to converge with much larger step-sizes, this verifies the difference between the convergence behavior of cocoercive operators and gradients.
These results indicate that it is inadvisable to only rely on more general monotone operator theory and not explicitly use the gradient property when analyzing VR-SG methods meant for optimization. However, the large impact of bias in the cocoercive operator setting also raises the question regarding its importance in other non-gradient settings as well. One such setting of interest, where the operators are not gradients of convex functions, is the case of saddle-point problems. These problems are of importance in optimization due to their use in primal-dual methods but recently they have also gained a lot of attention due to their applications in the training of GANs in machine learning. Because of this, and due to the attractive properties of VR-SG methods in the convex optimization setting, efforts have gone into applying VR-SG methods to saddle-point problems as well [26][27][28][29][30]. Most of these efforts have been unbiased, something our analysis suggests is wise. With that said, it is important to note that our analysis is often not directly applicable due the fact that saddle-point problems rarely are cocoercive.
The main reason for the recent rise in popularity of variance-reduced stochastic methods is their use in the optimization setting, but, although bias plays a big role in the cocoercive operator case, our results are not as clear in this setting. For instance, the theoretical results for the SAG and SAGA special cases yield identical rates and step-size conditions with no clear advantage to either special case. Further experiments are therefore performed where several different choices of bias in SVAG are examined on a set of logistic regression and SVM optimization problems. However, the results of these experiments are in line with existing works with no significant advantage of any particular bias choice in SVAG, these choices include both SAG and SAGA. Although the performance difference is significant in some cases, no single choice of bias performs best for all problems and all bias choices eventually converge with the same rate in the majority of the cases. Furthermore, the theoretical maximal step-size can routinely be exceeded in these experiments, indicating that there is room for further theoretical improvements.

Related Work
There is a large array of options for solving (2). For n ∈ {1, 2, 3, 4}, several operator splitting methods exist with varying assumptions on the operator properties, see for instance [22,24,[31][32][33][34] and the references therein. However, while these methods also can be applied for larger n by simply regrouping the terms, they do not utilize the finite sum structure of the problem. Algorithms have therefore been designed to utilize this structure for arbitrary large n with the hopes of reducing the total computational costs, e.g., [19,[35][36][37]. In particular the problem and method in [19] is closely related to the root-finding problem and algorithm considered in this paper.
Using the notation of [19], when T 0 = Id, the fixed point problem of [19] can be mapped to (2) via R i = ω i (Id −T i ) and vice verse. 1 Many applications considered in [19] can therefore, at least in part, be tackled with our algorithm as well. In particular, the problem of finding common fixed points of firmly nonexpansive operators can directly be solved by our algorithm. However, [19] is more general in that it allows for T 0 = Id and works in general real Hilbert spaces. Comparing with the algorithm of [19] we see that, just as our algorithm is a generalization of SAG/SAGA, it can be seen as a generalization of Finito [8], another classic VR-SG method. It generalize Finito in several way, for instance it allows for an additional proximal/backward step and it replace the stochastic selection with a different selection criteria. However, in the optimization setting it still suffers from the same drawback as Finito when compared to SAG/SAGA-like algorithms. It still needs to store a full copy of the iterate for each term in objective. Since SAG, SAGA, and SVAG only need to store the gradient of each term, they can utilize any potential structure of the gradients to reduce the storage requirements [1]. Although the differences above are interesting in their own right, the notion of bias we examine in this paper is not applicable to Finito-like algorithms.
SAG and SAGA were compared in [3] but with no direct focus on the effects of bias. Other examples of research on SAG and SAGA include acceleration, sampling strategy selection, and ways to reduce the memory requirement [6,[38][39][40][41][42]. However, none of these works, including [41] that was written by the authors, analyze the biased case we consider in this paper. Even the works considering non-uniform sampling of gradients [38][39][40][41] perform some sort of bias correction in order to remain unbiased. Furthermore, in order to keep the focus on the effects of the bias we have refrained from bringing in such generalizations into this work, making it distinct from the above research. To the authors' knowledge, the only theoretical convergence result for biased VR-SG methods are the ones for SAG [1,2]. But, since they only consider SAG, they fail to capture the breadth of SVAG and our proof is the first to simultaneously capture SAG, SAGA, and more.
Since the release of the first preprint of this paper, [43] has also provided a proof covering the gradient case of both SAG and SAGA, and some choices of bias in SVAG. All though [43] does not consider cocoercive operators, it is some sense more general with them considering a general biased stochastic estimator of the gradient. This generality comes at the cost of a more conservative analysis with their step-size scaling with O(n −1 ) in all cases.

Preliminaries and Notation
Let R denote the real numbers and let the natural numbers be denoted N = {0, 1, 2, . . . }. Let ·, · denote the standard Euclidean inner product and · = ·, · the standard 2-norm. The scaled inner product and norm on we denote as ·, · Σ = Σ(·), · and · Σ = ·, · Σ where Σ is a positive definite matrix. If Σ is not positive definite, · Σ is not a norm but we keep the notation for convenience.
Let n be the number of operators in (2). The vector 1 is the vector of all ones in R n and e i be the vector in R n of all zeros except the i:th element that contains a 1. The matrix I is an identity matrix with the size derived from context and E i = e i e T i . The symbol ⊗ denotes the Kronecker product of two matrices. The Kronecker product is linear in both arguments and the following properties hold In the last property it is assumed that the dimensions are such that the matrix multiplications are well defined. The eigenvalues of A ⊗ B are given by where τ i and µ j are the eigenvalues of A and B respectively.
The Cartesian product of two sets C 1 and C 2 is defined as From this definition we see that if C 1 and C 2 are closed and convex, so is Let X be the set of all solutions of (2), and define Z as the set of primal-dual solutions Assuming they exists, x denotes a solution to (2) and z denotes a primal-dual solution, i.e., x ∈ X and z ∈ Z .
A single valued operator R : holds for all x, y ∈ R N . An operator that is 1 L -cocoercive is L-Lipschitz continuous. The set of zeros of a cocoercive operator R is closed and convex.
holds for all x, y ∈ R N .
For more details regarding monotone operators and convex functions see [16,25].
To establish almost sure sequence convergence of the stochastic algorithm, the following propositions will be used. The first is from [44] and establishes convergence of non-negative almost super-martingales. The second is based on [45] and provides the tool to show almost sure sequence convergence.
Proposition 2.1. Let (Ω, F, P ) be a probability space and F 0 ⊂ F 1 ⊂ . . . be a sequence of subσ-algebras of F. For all k ∈ N, let z k , β k , ξ k and ζ k be non-negative F k -measurable random variables. If hold almost surely for all k ∈ N, then z k converges a.s. to a finite valued random variable and Proposition 2.2. Let Z be a non-empty closed subset of a finite dimensional Hilbert space H, let φ : [0, ∞) → [0, ∞) be a strictly increasing function such that φ(t) → ∞ as t → ∞, and let (x k ) k∈N be a sequence of H-valued random variables. If φ( x k − z ) converges a.s. to a finite valued non-negative random variable for all z ∈ Z, then the following hold: (ii) Suppose the cluster points of (x k ) k∈N are a.s. in Z, then (x k ) k∈N converge a.s. to a Z-valued random variable.
Proof. In finite dimensional Hilbert spaces, these two statements are the same as statements (ii) and (

Convergence
Throughout the analysis we will use the following two assumptions on the operators of (2).

Reformulation
We begin by formalizing and reformulating Algorithm 1 into a more convenient form. Let (Ω, F, P ) be the underlying probability space of Algorithm 1. The index selected at iteration k is then a uniformly distributed random variable i k : . . , n} are the iterates of Algorithm 1. Let F 0 ⊂ F 1 ⊂ . . . be a sequence of sub-σ-algebras of F such that z k are F k -measurable and i k is independent of F k . With the operator B : R N (n+1) → R 2N n defined as B(x, y 1 , . . . , y n ) = (R 1 x, . . . , R n x, y 1 , . . . , y n ), one iteration of Algorithm 1 can be written as where z 0 ∈ R N (n+1) is given and The vector e i and the matrix E i are defined in Section 2.
The following lemma characterizes the zeros of (U i ⊗ I)B and hence the fixed points of (6) and Algorithm 1.
Furthermore, the set Z is closed and convex and R i x = R ix for all x ,x ∈ X and for all i ∈ {1, . . . , n}.
Proof of Lemma 3.1. The zero statement, 0 = (U i ⊗ I)Bz , follows from definition of z . For closedness and convexity of Z , we first prove that R i x is unique for each i ∈ {1, . . . , n}. Taking x, y ∈ X , which implies The set Z is a Cartesian product of X and the points r i = R i x for i ∈ {1, . . . , n} and any x ∈ X . A set consisting of only one point is closed and convex and X is closed and convex since 1 n n i=1 R i is cocoercive [16,Proposition 23.39], hence is Z closed and convex.
The operator B in the reformulated algorithm can be used to enforce the following property on the sequence (z k ) k∈N . Lemma 3.2. Let (Ω, F, P ) be a probability space and (z k ) k∈N be a sequence of random variables z k : Ω → R N (n+1) . If Bz k → Bz a.s. where z ∈ Z , then any cluster point of (z k ) k∈N will almost surely be in Z .
Proof of Lemma 3.2. Let z be a cluster point of (z k ) k∈N . Take an ω ∈ Ω such that Bz k (ω) → Bz . For this ω and for all k ∈ N, we define the realizations of z and z k as as l → ∞ where L-Lipschitz continuity of 1 n n i=1 R i was used. This concludes thatx ∈ X and sinceȳ i = R i x = R ix for all i ∈ {1, . . . , n} by Lemma 3.1, we have that z(ω) ∈ Z . Since this hold for any ω such that Bz k (ω) → Bz and the set in F of all such ω have probability one due to the almost sure convergence of Bz k → Bz , we have z ∈ Z almost surely.
The reformulation (6) further allows us to concisely formulate two Lyapunov inequalities.
where matrices H and M are given by Lemma 3.4. Let Assumption 3.2 hold, the update (6) then satisfies The matrix H E[U i k ] is given by , see the supplementary material for verification of this and other matrix identities. We also note that Taking ξ ∈ [0, 2λ nL ] and putting these two expression together yield Using 1 L -cocoercivity of R i for each i ∈ {1, . . . , n} gives where M = 1 2 (M +M T ) is the matrix in the Lemma. Finally, using this inequality and 0 = (U i k ⊗ I)Bz from Lemma 3.1 gives Proof of Lemma 3.4. Take k ∈ N and note that With D = 0 1 T we have (K ⊗ I)z k = x k + λ n (D ⊗ I)Bz k . Using the first order convexity condition on F and 0 = (D ⊗ I)Bz = (G ⊗ I)Bz yields where S C = λ n (D T G + G T D). Combining these two inequalities gives

Convergence Theorems
We are now ready to state the main convergence theorems for SVAG. They are stated with the notation from Algorithm 1 but are proved at the end of this section with the help of the reformulation in (6) and the lemmas above.
for any x ∈ X .
Both Theorem 3.1 and 3.2 give the step-size condition λ ∈ (0, 1 2L ) for the SAGA special case, i.e., θ = n. This is the same as the largest upper bound found in the literature [3] and appears to be tight [41]. Theorem 3.2 also give this step-size condition when θ = 1, i.e., SAG in the optimization case. This bound improves on upper bound of 1 16L ≤ λ presented in [2]. In the cocoercive operator setting with θ = n, Theorem 3.1 gives a step-size condition that scales with n −1 . This step-size scaling is significantly worse compared to the gradient case in Theorem 3.2 in which the step-size's dependence on n is O(1) for all θ. This difference is indeed real and not an artifact of the analysis since we in Section 4 present a problem for which the cocoercivity result appears to be tight. A consequence of this unfavorable step-size scaling in the operator setting is slow convergence. There is therefore little reason to use anything else than θ = n in SVAG when R i is not a gradient of a smooth function for all i ∈ {1, . . . , n}.
The rates of Theorem 3.1 and 3.2 are of O( 1 t+1 ) type with two sets of multiplicative factors. One factor which only depend on the algorithm parameters, n λ(L −1 −λc) , and one set which depend on how the algorithm initialization relates to the solution set, C R and C R + C F . The initialization dependent factors also depend on the algorithm parameters, but, since knowing the exact dependency requires knowing the solution set, we will not attempt to tune the parameters to decrease this factor. Only considering the first factor, the rate becomes better if c is decreased and, since c is independent of λ, the best choice of step-size is λ = (2Lc) −1 . This means that λ = (4L) −1 and θ = n are the best parameter choices in the cocoercive operator setting. In the optimization case the best step-size is also λ = (4L) −1 but the innovation weight can be selected as either θ = n or θ = 1.
However, in the optimization case we do not believe that these theoretical rates reflects real world performance and parameter choices based on them might therefore not perform particularly well. We base this belief on our experience with numerical experiments. For θ = n and θ = 1, we have not found any optimization problem where the step-size condition in Theorem 3.2 appears to be tight. Also, using λ = (2Lc) −1 as suggested by the Theorem 3.2 can in some cases lead to impractically small step-sizes. For instance, if λ = (2Lc) −1 was used in the experiments in Section 4, a couple of the experiments would have step-sizes over 1000 times smaller than the ones used now. One can of course not disprove a worst case analysis with experiments but we still feel they indicate a conservative analysis, even though the analysis improves on the previous best results.
Proof of Theorem 3.1. Apply Lemma 3.3 with ξ = 0, the iterates given by (6) then satisfy the following for all z ∈ Z , Assuming H 0 and 2M − E[U T i k HU i k ] 0 can Proposition 2.1 be applied. We will later prove that this assumption indeed does hold. Proposition 2.1 gives a.s. summability of Bz k − Bz 2 (2M −E[U T i k HU i k ])⊗I and hence will Bz k → Bz almost surely. Lemma 3.2 then gives that all cluster points of (z k ) k∈N are in Z almost surely. Finally, since Proposition 2.1 ensures the a.s. convergence of z k − z 2 H⊗I and since R N (n+1) with the inner product (H ⊗ I)·, · is a finite dimensional Hilbert space, Proposition 2.2 gives the almost sure convergence of z k → z ∈ Z .
There always exists a λ such that 2M − E[U T i k HU i k ] and H are positive definite. First we show that H 0 always holds for λ > 0. Taking the Schur complement of 1 in H gives Hence is H 0 since the Schur complement is positive definite.
Straightforward algebra, see the supplementary material, yields Positive definiteness of this matrix is established by ensuring positivity of the smallest eigenvalue σ min . The smallest eigenvalue σ min is greater than the sum of the smallest eigenvalue of each term. For the eigenvalues of the Kronecker products, see (3). This gives that Since λ > 0 by assumption, if we have that σ min > 0 and that 2M − E[U T i k HU i k ] is positiv definite. Rates are gotten by taking the total expectation of (8) and adding together the inequalities from k = 0 to k = t, yielding Putting in the lower bound on σ min and rearranging yields min k∈{0,...,t} From the definition of H in Lemma 3.3 we have R 1 x , . . . , R n x ). Since this hold for any z ∈ Z and hence any x ∈ X , the results of theorems follows by minimizing the RHS over x ∈ X . Note, since R i x constant for all x ∈ X , the objective is convex and, since X is closed and convex, the minimum is then attained.

Proof of Theorem 3.2. Combining Lemma 3.3 and 3.4 yield
which holds for all k ∈ N, ξ ∈ [0, 2λ nL ], and z ∈ Z . Since H 0 for λ > 0, see the proof of Theorem 3.1, the first term is non-negative while the second term is non-negative if θ ≤ n. From cocoercivity of ∇F , the last term is non-positive and we assume, for now, that there exists λ > 0 and 2λ nL ≥ ξ > 0 such that 2M − E[U T i k HU i k ] + λ(n − θ)S − ξI 0, making the third term non-positive.
Applying Proposition 2.1 gives the a.s. summability of Since both term are positive, both terms are a.s. summable. From the first term we have the a.s. convergence of Bz k → Bz and Lemma 3.2 then gives that all cluster points of (z k ) k∈N are almost surely in Z . For the second term we note that by convexity we have then is summable a.s. since ξnL > 0. Using smoothness of F , (5) and the notation from (7) gives From Proposition 2.1 we can also conclude that z k − z 2 H⊗I +2λ(n−θ)(F ((K ⊗I)z k )−F (x )) a.s. converge to a non-negative random variable. Since H⊗I also must a.s. converge to a non-negative random variable. Proposition 2.2 then give the almost sure convergence of (z k ) k∈N to Z .
We now show that there exists λ > 0 and ξ > 0 such that We show positive definiteness by ensuring that the smallest eigenvalue is positive. The smallest eigenvalue σ min is greater than the sum of the smallest eigenvalues of each term, Assuming λ ≤ 1 2L yields the following lower bound on the smallest eigenvalue Selecting which satisfy the assumption 2λ nL ≥ ξ > 0, yield σ min ≥ ξ. Since λ > 0 by assumption, if we have that σ min ≥ ξ > 0 and hence that the examined matrix is positive definite. Furthermore, if λ satisfies the above inequality it also satisfies the assumption λ ≤ 1 2L . Rates are gotten in the same way as for Theorem 3.1, the total expectation is taken of the Lyapunov inequality at the beginning of the proof and the inequalities are summed from k = 0 to k = t.
Inserting the lower bound on σ min , rearranging and minimizing over x ∈ X yield the results of the theorem.

Numerical Experiments
A number of experiments, outlined below, were performed to verify the tightness of the theory in the cocoercive operator case and examine the effect of bias in the cocoercive gradient case. The experiments were implemented in Julia [46] and, together with several other VR-SG methods, can be found at https://github.com/mvmorin/VarianceReducedSG.jl.

Cocoercive Operators Case
In order for the difference between cocoercive operators and cocoercive gradients to not be an artifact of our analysis, the results in the operator case can not be overly conservative. We therefore  construct a cocoercive operator problem for which the results appear to be tight, thereby verifying the difference. Consider problem (2) where the operator R i : R 2 → R 2 is an averaged rotation for all i ∈ {1, . . . , n} and some τ ∈ [0, 2π). The operators are 1-cocoercive and the zero vector is the only solution to (2) if τ = π. The step-size condition from Theorem 3.1 appears to be tight for θ ∈ [0, n] when the angle of rotation τ approaches π. We therefore let τ = 179 180 π and solve the problem with different configurations of step-size λ and innovation weight θ. Figure 1 displays the relative distance to the solution after 100n iterations of SVAG together with the upper bound on the step-size. When θ ∈ [0, n] and λ exceeds the upper bound, the distance to the solution increases for both n = 100 and n = 10000, i.e., the method does not converge. Hence, for θ ∈ [0, n], the step-size bound in Theorem 3.1 appears to be tight. However, it is noteworthy that for this particular problem it seems beneficial to exceed the step-size bound when θ > n.

Cocoercive Gradients Case
Since, as we stated in Section 3.2, we do not believe that the theoretical rates are particularly tight in the optimization case, we examine the effects of the bias numerically. These experiments can of course not be exhaustive and we choose to focus on only the bias parameter θ and perform all experiments with the same step-size. This also demonstrate why we believe the analysis to be conservative since the chosen step-size in some cases are a 1000 times larger than upper bound from Theorem 3.2. Convergence with this large of a step-size have also been seen elsewhere with both [2] and [43] disregarding their own the theoretical step-size conditions. The experiments are done by performing a rough parameter sweep over the innovation weight θ on two different binary classification problems and we will look for patterns in how the convergence is affected. The first problem is logistic regression,  The second is SVM with a square hinge loss, where γ > 0 is a regularization parameter. In both problems are y i ∈ {−1, 1} the label and a i ∈ R N the features of the ith training data point. Note, although not initially obvious, max(0, ·) 2 is convex and differentiable with Lipschitz continuous derivative and the second problem is therefore indeed smooth. The logistic regression problem does not necessarily have a unique solution and the distance to the solution set is therefore hard to estimate. For this reason, we examine the convergence of ∇F (x k ) → 0 instead of the distance to the solution set.
The datasets for both these classification problems are taken from the LibSVM [47] collection of datasets. The number of examples in the datasets varies between n = 683 and n = 60, 000 while the number of features is between N = 10 and N = 5, 000. Two of the datasets, mnist.scale and protein, consist of more than 2 classes. These are converted to binary classification problems by grouping the different classes into two groups. For the digit classification dataset mnist.scale, the digits are divided into the groups 0-4 and 5-9. For the protein data set, the classes are grouped as 0 and 1-2. The results of solving the classification problems above can be found in Figures 2 and 3.
From Figures 2 and 3 it appears like the biggest difference between the innovation weights are in the early stages of the convergence. Most innovation weight choices appear to eventually converge with the same rate. In the cases where this does not happen, the fastest converging choice of innovation weight actually reaches machine precision. It is therefore not possible to say whether these cases would eventually reach the same rate as well. Since none of the choices of θ appears to consistently be at a significant disadvantage, even though the step-size used exceeds the upper bound in Theorem 3.2 when θ = 0.1n and θ = 0.01n, we conjecture that the asymptotic rates for a given step-size is independent of θ.
The initial phase can clearly have a large impact on the convergence and it can therefore still be a benefit to tuning the bias. However, comparing the different choices of innovation weight yields no clear conclusion since no single choice of innovation weight consistently outperforms another. In most cases do the lower bias choices-θ = n (SAGA) or θ = 0.1n-seem perform best but, when they do not, the high bias choices-θ = 1 (SAG) and θ = 0.01n-perform significantly better. Another observation is that lowering θ increases any oscillations if they are present. We speculate that it is due to the increased inertia and we also believe that this inertia is what allows the lower innovation weights to sometimes perform better.

Conclusion
We presented SVAG, a variance-reduced stochastic gradient method with adjustable bias and with SAG and SAGA as special cases. It was analyzed in two scenarios, one being the minimization problem of a finite sum of functions with cocoercive gradients and the other being finding a root of a finite sum of cocoercive operators. The analysis improves on the previously best known analyses in both settings and, more significantly, the two different scenarios gave significantly different convergence conditions for the step-size. In the cocoercive operator setting a much more restrictive condition was found and it was verified numerically. This difference is not present in ordinary gradient descent and can therefore easily be overlooked, however, these results suggest that is inadvisable in the variance-reduced stochastic gradient setting.
The theoretical results in the minimization case was further examined with numerical experiments. Several choices of bias was examined but we did not find the same dependence on the bias that the theory suggests. In fact, the asymptotic convergence behavior was similar for the different choices of bias, indicating that further improvements of the theory is still needed. The bias mainly impacted the early stages of the convergence and in a couple of cases this impact was significant. There might therefore still be benefits to tuning the bias to the particular problem but further work is needed to efficiently do so.
Funding This work is funded by the Swedish Research Council via grant number 2016-04646.

A Matrix Identities
Here we verify the matrix identities used in the proofs.

A.3
For E[U T i k HU i k ], first note that U i k can be written as The third and forth term are This results in