Optimal Algorithms for Differentially Private Stochastic Monotone Variational Inequalities and Saddle-Point Problems

In this work, we conduct the first systematic study of stochastic variational inequality (SVI) and stochastic saddle point (SSP) problems under the constraint of differential privacy (DP). We propose two algorithms: Noisy Stochastic Extragradient (NSEG) and Noisy Inexact Stochastic Proximal Point (NISPP). We show that a stochastic approximation variant of these algorithms attains risk bounds vanishing as a function of the dataset size, with respect to the strong gap function; and a sampling with replacement variant achieves optimal risk bounds with respect to a weak gap function. We also show lower bounds of the same order on weak gap function. Hence, our algorithms are optimal. Key to our analysis is the investigation of algorithmic stability bounds, both of which are new even in the nonprivate case. The dependence of the running time of the sampling with replacement algorithms, with respect to the dataset size $n$, is $n^2$ for NSEG and $\tilde{O}(n^{3/2})$ for NISPP.


Introduction
Stochastic variational inequalities (SVI) and stochastic saddle-point (SSP) problems have become a central part of the modern machine learning toolbox.The main motivation behind this line of research is the design of algorithms for multiagent systems and adversarial training, which are more suitably modeled by the language of games, rather than pure (stochastic) optimization.Applications that rely on these methods may often involve the use of sensitive user data, so it becomes important to develop algorithms for these problems with provable privacy-preserving guarantees.In this context, differential privacy (DP) has become the gold standard of privacy-preserving algorithms, thus a natural question is whether it is possible to design DP algorithms for SVI and SSP that attain high accuracy.
Motivated by these considerations, this work provides the first systematic study of differentially-private SVI and SSP problems.Before proceeding to the specific results, we present more precisely the problems of interest.The stochastic variational inequality (SVI) problem is: given a monotone operator F : W Þ Ñ R d in expectation form F pwq " E β"P rF β pwqs, find w ˚P W such that xF pw ˚q, w ´w˚y ě 0 @w P W. (VI(F)) The closely related stochastic saddle point (SSP) problem is: given a convex-concave real-valued function f : W Þ Ñ R (here W " X ˆY is a product space), given in expectation form f px, yq " E β"P rf β px, yqs, the goal is to find px ˚, y ˚q that solves min xPX max yPY f px, yq. (SP(f)) In both of these problems, the input to the algorithm is an i.i.d.sample S " pβ 1 , . . ., β n q " P n .Uncertainty introduced by a finite random sample renders the computation of exact solutions infeasible, so gap (a.k.a.population ı . (1.1) We also define the weak VI-gap as WeakGap VI pA, F q :" E A sup wPW E S rxF pwq, ApSq ´wys. (1.2) Here, expectation is taken over both the sample data S and the internal randomization of A. For SSP (SP(f)), given an algorithm A : Z n Þ Ñ X ˆY, and letting ApSq " pxpSq, ypSqq, a natural gap function is the following saddle-point (a.k.a.primal-dual) gap Gap SP pA, f q :" E A,S " sup xPX , yPY rf pxpSq, yq ´f px, ypSqqs ı . (1.3) Analogously as above, we define the weak SSP gap as 1WeakGap SP pA, f q :" E A sup xPX , yPY E S rf pxpSq, yq ´f px, ypSqqs. (1.4) It is easy to see that in both cases the gap is always nonnegative, and any exact solution must have zero-gap.For examples and applications of SVI and SSP we refer to Section 2.1.Despite the fact that the strong VI is a more classical and well-studied quantity, the weak VI gap has been observed to be useful in various contexts.We refer the reader to [41] for more discussions on the weak VI gap.
On the other hand, we are interested in designing algorithms that are differentially private.These algorithms build a solution based on a given dataset S of random i.i.d.examples from the target distribution, and output a (randomized) feasible solution, ApSq.We say that two datasets S " pβ i q i , S 1 " pβ 1  i q i are neighbors, denoted S » S 1 , if they only differ in a single entry i.We say that an algorithm ApSq is pε, ηq-differentially private if for every event E in the output space 2P A rApSq P Es ď e ε P A rApS 1 q P Es `η p@S » S 1 q. (1.5) Here ε, η ě 0 are prescribed parameters that quantify the privacy guarantee.Designing DP algorithms for particular data analysis problems is an active area of research.Optimal risk algorithms for stochastic convex optimization have only very recently been developed, and it is unclear whether these methods are extendable to SVI and SSP settings.

Summary of Contributions
Our work is the first to provide population risk bounds for DP-SVI and DP-SSP problems.Moreover, our algorithms attain provably optimal rates and are computationally efficient.We summarize our contributions as follows: 1. We provide two different algorithms for DP-SVI and DP-SSP: namely, the noisy stochastic extragradient method (NSEG) and a noisy inexact stochastic proximal-point method (NISPP).The NSEG method is a natural DP variant of the well-known stochastic extragradient method [22], where privacy is obtained by Gaussian noise addition; on the other hand, the NISPP method is an approximate proximal point algorithm [33,20] in which every proximal iterate is made noisy to make it differentially private.Our more basic variants of both of these methods are based on iterations involving disjoint sets of datapoints (a.k.a.single pass method), which are known to typically lead to highly suboptimal rates in DP (see the Related Work Section for further discussion).
2. We derive novel uniform stability bounds for the NSEG and NISSP methods.For NSEG, our stability upper bounds are inspired by the interpretation of the extragradient method as a (second order) approximation of the proximal point algorithm.In particular, we provide expansion bounds for the extragradient iterates, and solve a (stochastic) linear recursion.The stability bounds for NISPP method are based on stability of the (unique) SVI solution in the strongly monotone case.Finally, we investigate the risk attained by multipass versions of the NSEG and NISPP methods, leveraging known generalization bounds for stable algorithms [26].Here, we show that the optimal risk for DP-SVI and DP-SSP can be attained by running these algorithms with their sampling with replacement variant.In particular, NSEG method requires n 2 stochastic operator evaluations, and NISPP method requires much smaller r Opn 3{2 q operator evaluations for both DP-SVI and DP-SSP problems.In particular, these upper bounds also show the dependence of the running time of each of these algorithms w.r.t. the dataset size.
3. Finally, we prove lower bounds on the weak gap function for any DP-SSP and DP-SVI algorithm, showing the risk optimality of the aforementioned multipass algorithms.The main challenge in these lower bounds is showing that existing constructions of lower bounds for DP convex optimization [5,36,4] lead to lower bounds on the weak gap of a related SP/VI problem.
The following table provides details of population risk and operator evaluation complexity .Here n is the dataset size, and d is the dimension of the solution search space.We omit the dependence on other problem parameters (e.g., Lipschitz constants and diameter), as well as the privacy parameter η.

Related Work
We divide our discussion on related work in three main areas.Each of these areas has been extensively investigated, so a thorough description of existing work is not possible.We focus ourselves on the work which is more directly related to our own.
It is important to note that naive adaptation of these methods to the DP setting requires adding noise to the operator evaluations at every iteration, which substantially degrades the accuracy of the obtained solution.A careful privacy accounting and minibatch schedule can lead to optimal guarantees for single-pass methods [13], however this requires accuracy guarantees for the last iterate, which is currently an open problem for SVI and SSP (aside from specific cases, typically involving strong monotonicity conditions, e.g., [16,25]).We circumvent this problem by providing population risk guarantees for multipass methods.

2.
Stability and Generalization: Deriving generalization (or population risk) bounds for general-purpose algorithms is a challenging task, actively studied in theoretical machine learning.Bousquet and Elisseeff [6] provided a systematic treatment of this question for algorithms which are stable, with respect to changes of a single element in the training dataset, and a sequence of works have refined these generalization guarantees (see [14,7] and references therein).This idea has been applied to investigate the generalization properties of regularized empirical risk minimization [6,34], and more recently to iterative methods, such as stochastic gradient descent [15,3].
Using stability to obtain population risk bounds in SVI and SSP is substantially more challenging, due to the presence of a supremum in the accuracy measure (see eqns.(1.1) and (1.3)).Recently, Zhang et al. [41], established stability implies generalization results for the strong SP gap under strong monotonicity assumptions.
Their proof strategy applies analogously to address the SVI setting, although this is not carried out in their work.More recently, Lei et al. [26], proved generalization bounds on the weak SP gap without strong monotonicity assumptions.We leverage this result for our algorithms, and further elaborate on its implications for SVI in Section 2.2.
To the best of our knowledge, our work is the first to address DP algorithms for SVI and SSP.Our approach for generalization of multipass algorithms is inspired by the noisy SGD analysis in [3].However, our stability analysis differs crucially from [3]: in the case of NSEG, we need to carefully address the double operator evaluation of the extragradient step, which is done by using the fact that the extragradient operator is approximately nonexpansive.In the case of NISPP, we leverage the contraction properties of strongly monotone VI solutions.By contrast, SGD in the nonsmooth case is far from nonexpansive [3].Alternative approaches to obtain optimal risk in DP-SCO, including privacy amplification by iteration [13], and phased regularization or phased SGD [13], appear to run into fundamental limitations when applied to DP-SVI and DP-SSP.It is an interesting future research direction to obtain faster running times with optimal population risk in DP-SVI and DP-SSP, which may benefit from these alternative approaches.
The main body of this paper is organized as follows.In Section 2, we provide the necessary background information on SVI/SSP, uniform stability, and differential privacy, which are necessary for the rest of the paper.In Section 3 we introduce the NSEG method, together with its basic privacy and accuracy guarantee for a single pass version.Section 4 provides stability bounds for NSEG method along with the consequent optimal rates for SVI and SSP.In Section 5, we introduce the single-pass differentially private NISPP method with bound on expected SVI-gap.Section 6 presents stability analysis of NISPP, together with the resulting optimal rates for SVI/SSP gap.We conclude in Section 7 with lower bounds that prove the optimality of the obtained rates.

Notation and Preliminaries
We work on the Euclidean space pR d , x¨, ¨yq, where x¨, ¨y is the standard inner product, and }u} " a xu, uy is the ℓ 2 -norm.Throughout, we consider a compact convex set W Ď R d with diameter D ą 0. We denote the standard Euclidean projection operator on set W by Π W p¨q. The identity matrix on R d is denoted by I d .
We let P denote an unknown distribution supported on an arbitrary set Z, from which we have access to exactly n i.i.d.datapoints which we denote by sample set S " P n .Throughout, we will use boldface characters to denote sources of randomness (coming from the data, or internal algorithmic randomization).We say that two datasets S, S 1 are adjacent (or neighbors), denoted by S » S 1 , if they differ in a single data point.We also denote subsets (a.k.a.batches), or single data points, of S or P by B and β, respectively.Whether β or B is sampled from P or S is specified explicitly unless it is clear from the context.For a batch B, we denote its size by |B|.Therefore, we have |S| " n.Throughout, we will denote Gaussian random variables by ξ.
We say that F : W Ñ R d is a monotone operator if xF pw 1 q ´F pw 2 q, w 1 ´w2 y ě 0, @w 1 , w 2 P W.
Given L ą 0, we say that F is L-Lipschitz continuous, if F pw 1 q ´F pw 2 q ď L w 1 ´w2 , @w 1 , w 2 P W.
Finally, we say that F is M -bounded if sup wPW }F pwq} ď M .We denote the set of monotone, L-Lipschitz and M -bounded operators by M 1 W pM, Lq.In this work, we will focus on the case where F is an expectation operator, i.e., F pwq :" E β"P rF β pwqs, where P is an arbitrary distribution supported on Z, and for any β in Z, F β p¨q P M 1 W pM, Lq, β-a.s. 3n the stochastic saddle point problem (SP(f)), we modify the notation slightly.Here, X Ď R d 1 and Y Ď R d 2 are compact convex sets, and we will assume that the saddle point functions f β p¨, ¨q : X ˆY Þ Ñ R, satisfy the following conditions β-a.s.
• ∇ x f β p¨, ¨q is L x -Lipschitz continuous, and ∇ y f β p¨, ¨q is L y -Lipschitz continuous, and; • f β p¨, yq is convex, for any given y P Y, and f β px, ¨q is concave, for any given x P X (we will say in this case the function is convex-concave).
If the assumptions above are met, we will denote L fi b L 2 x `L2 y .Under the assumptions above, it is well-known that SSP [32,35] (and SVI [12], respectively) have a solution.
In the case of saddle-point problems, given the convex-concave function f β p¨, ¨q : (2.1) We will call this operator the monotone operator associated with f β p¨, ¨q.Furthermore, if ∇ x f β p¨, yq has L x -Lipschitz continuous gradient and ∇ y f β px, ¨q has L y -Lipschitz continuous gradient, then F is b L 2 x `L2 y -Lipschitz continuous.It is easy to see that, given a SSP problem with function f β p¨, ¨q and sets X , Y, an (exact) SVI solution (VI(F)) for the monotone operator associated to f px, yq " E β rf β px, yqs over the set W " X ˆY, yields an exact SSP solution for the starting problem.Unfortunately, such reduction does not directly work for approximate solutions to (1.1) and (1.3), so the analysis must be done separately for both problems.
For batch B, we denote the empirical (a.k.a.sample average) operator F B pwq :" 1 |B| ř βPB F β pwq.On the other hand, for a batch B, the empirical saddle point function is denoted as f B px, yq " 1 |B| ř βPB f β px, yq.Given a distribution P, the expectation operator and function are denoted by F P pwq :" E β"P rF β pwqs, and f P px, yq " E β"P rf β px, yqs, respectively.For brevity, whenever it is clear from context we will drop the dependence on P.

Examples and Applications of SVI and SSP
An interesting problem which can be formulated as a SSP-problem is the minimization of a max-type convex function: where φ j : X Ñ R is a stochastic convex function φ j pxq :" E ζ j "P j rφ j,ζ j pxqs for all j P rms.This problem is essentially a structured nonsmooth optimization problem which can be reformulated into a convex-concave saddle point problem: Here, β " pζ j q m j"1 is the random input to the saddle point problem: f β px, yq " ř m j"1 y j φ j,ζ j pxq.Note that a substantial generalization of the max-type problem above is the so called compositional optimization problem: min xPX φpxq :" Φpφ 1 pxq, . . ., φ m pxqq, where φ j pxq are convex maps and Φpu 1 , . . ., u m q is a real-valued convex function whose Fenchel-type representation is assumed to have the form Φpu 1 , . . ., u m q " max yPY ř m j"1 xu j , A j y `bj y ´Φ˚p yq, where Φ ˚is a convex, Lipschitz and smooth.Then, overall optimization problem can be reformulated as a convexconcave saddle point problem: where stochasticity is introduced due to constituent functions φ j pxq " E ζ j rφ j,ζ j pxqs.
To conclude, we remark that these type of models have been recently proposed in machine learning to address approximate fairness [39] and federated learning on heterogeneous populations [27].In these examples, the different indices j P rms may denote different subgroups from a population, and we are interested in bounding the (excess) population risk on these subgroups uniformly (with the motivation of preventing discrimination against any subgroup).This clearly cannot be achieved by a stochastic convex program, and a stochastic saddle-point formulation is effective in certifying accuracy across the different subgroups separately.
For further examples and applications of stochastic variational inequalities and saddle-point problems, we refer the reader to [22,21,41].

Algorithmic Stability
In general, an algorithm is a randomized function mapping datasets to candidate solutions, A : Z n Þ Ñ R d , which is measurable w.r.t. the dataset.Two datasets, S " pβ 1 , . . ., β n q, S 1 " pβ 1  1 , . . ., β 1 n q P Z n are said to be neighbors (denoted S » S 1 ) if they only differ in at most one data point, namely Algorithmic stability is a notion of sensitivity analysis of an algorithm under neighboring datasets.Of particular interest to our work is the notion of uniform argument stability (UAS).Definition 2.1 (Uniform Argument Stability).Let A : Z n Þ Ñ R d be a randomized mapping and δ ą 0. We say that A is δ-uniformly argument stable (for short, δ-UAS) if Ocassionally, we may denote δ A pS, S 1 q fi }ApSq ´ApS 1 q}, for convenience.The importance of algorithmic stability in machine learning comes from the fact that stability implies generalization in stochastic optimization and stochastic saddle point (SSP) problems [6,7,41].Below, we restate existing results on stability implies generalization for SSP problems below.Before doing so we need to briefly introduce the (strong) empirical gap function: given a dataset S and an algorithm A, we define the empirical gap function for a saddle point and variational inequality problem respectively as Notice that in these definitions the dataset S is fixed.
Proposition 2.1.[41,26] Consider the stochastic saddle point problem (SP(f)) with functions f β p¨, yq and f β px, ¨q being M -Lipschitz for all x P X , y P Y and β-a.s.. Let A : Z n Ñ X ˆY be an algorithm, where ApSq " pxpSq, ypSqq.If xp¨q is δ x -UAS and yp¨q is δ y -UAS, and both are integrable, then WeakGap SP pA, f q ď E S rEmpGap SP pA, f S qs `M rδ x `δy s. (2.4) This result can be extended for SVI problems as well.We provide a formal statement below and prove it in Appendix Proposition 2.2.Consider a stochastic variational inequality with M -bounded operators (2.5)

Background on differential privacy
Differential privacy is an algorithmic stability type of guarantee for randomized algorithms, that certifies that the output distribution of the algorithm "does not change too much" by changes in a single element from the dataset.The formal definition is provided in eqn.(1.5).Next we provide some basic results in differential privacy, which we will need for our work.For further information on the topic, we refer the reader to the monograph [10].

Basic privacy guarantees
In this work, most of our privacy guarantees will be obtained by the well-known Gaussian mechanism, which performs Gaussian noise addition on a function with bounded sensitivity.Given a function A : If A is randomized, then the supremum must hold with high-probability over the randomization of A (this will not be a problem in this work, since our randomized algorithms enjoy sensitivity bounds w.p. 1).The Gaussian mechanism (associated to function A) is defined as Then, for σ 2 " 2s 2 lnp1{ηq{ε 2 , the Gaussian mechanism is pε, ηq-DP.
Our algorithms will adaptively use a DP mechanism such as the above.Certifying privacy of a composition can be achieved in different ways.The most basic result establishes that if we use disjoint batches of data at each iteration, then the composition will preserve the largest privacy parameter among its building blocks.This result is known as parallel composition, and its proof its a direct application of the post-processing property of DP.Proposition 2.4 (Parallel composition of differential privacy).Let S " pS 1 , . . ., S K q P Z n be a dataset partitioned on blocks of sizes n 1 , . . ., n K , respectively.

, K, be a sequence of mechanisms, and let
Some of the algorithms we develop in this work make repeated use of the data, and certifying privacy for these algorithms requires the use of adaptive composition results in DP (see, e.g.[11,10]).For our algorithms, it is particularly important to leverage the sampling with replacement procedure to select the data that is used at each iteration, for which sharp bounds on DP can be obtained by the moments accountant method [1].Below we summarize a specific version of this method that suffices for our purposes. 4heorem 2.2 ( [1]).Consider sequence of functions A 1 , . . ., A K , where

a function with sensitivity bounded as a function of the last data batch size, as follows
Consider the mechanism obtained by sampling a random subset of size m from the dataset, i.e., letting S k " pUnifprSsqq m , and composing it with a Gaussian mechanism with noise σ 2 , i.e.

The Noisy Stochastic Extragradient Method
To solve the DP-SVI problem we propose a noisy stochastic extragradient method (NSEG) in Algorithm 1.

Algorithm 1 Noisy Stochastic Extragradient (NSEG) Method
1: Input: Starting point u 0 P W, dataset S " pβ i q iPrns " P n , stepsizes pγ t q tPrT s 2: for t " 1, . . ., T do 3: The name noisy and stochastic in Algorithm 1 is justified by the sequence of operators F 1,t , F 2,t we use: where B 1 t , B 2 t are batches extracted from dataset S, and ξ 1 t , ξ 2 t i.i.d.
" N p0, σ 2 t q.We will denote the batch size of batch The exact details of the sampling method for B t will depend on the variant of the algorithm.Here, we detail some key features of the above algorithm.Stochastic extragradient was proposed in [22] where they do not have any noise addition in F 1,t , F 2,t (stochasticity only arises from the dataset randomness), and where disjoint batches are used for all iterations, as well as within iterations.This choice is motivated by the goal of extracting population risk bounds for their algorithm.
Another important consideration is that this algorithm can also be applied to an SSP problem by using as stochastic oracle the monotone operator associated to the stochastic convex-concave function (2.1), over the set W " X ˆY.From here onwards, when we say that a certain SVI algorithm is applied to an SSP, we mean using the choices above for the operator and feasible set, respectively.
We start by stating the convergence guarantees for the single-pass NSEG method.This is obtained as a direct corollary of [22,Thm. 1], where we use an explicit bound on the oracle error with the variance of the Gaussian.Theorem 3.1 ( [22]).Consider a stochastic variational inequality (VI(F)), with operators F β in M 1 pL, M q.Let A be the NSEG method (Algorithm 1) where 0 ă γ t ď 1{r ?3Ls and pB 1 t , B 2 t q t are independent random variables from a product distribution B 1 t , B 2 t " P Bt , satisfies where ApSq is the output of Algorithm 1 on the dataset S " Ť t B t " P n and expectation in the left hand side is taken over the dataset draws, random sample batch choices, as well as noise ξ t 1 , ξ t 2 .On the other hand, A applied to a stochastic (SP(f)) problem attains saddle point gap Gap SP pA, f q ď K 0 pT q Γ T .

Differential privacy analysis of NSEG method
We now proceed to establish the privacy guarantees for the single-pass variant of Algorithm 1.This is a direct consequence of Propositions 2.3 and 2.4, and the fact that each operator evaluation has sensitivity bounded by 2M {B t .
We now apply the previous results to obtain population risk bounds for DP-SVI by the NSEG method.
is pε, ηq-differentially private and achieves Gap VI pA, F q (for SVI) or Gap SP pA, f q (for SSP) of Remark 3.2.Notice that in the corollary above, the gap is nontrivial iff a d lnp1{ηq{rnεs ă 1, which means that the left hand side attains the max on the range where the gap is nontrivial.
Proof.Consider a SVI or SSP problem.Let us recall that by Theorem 3.1, Algorithm 1 achieves expected gap We observe that excess risk bounds of the same order for DP-SCO based on noisy SGD and the uniform stability of differential privacy have been established [5].Improving these bounds in DP-SCO required substantial efforts, which was only achieved recently [4,13,3].Furthermore, to the best of our knowledge, the upper bounds on the risk above are the first of their type for DP-SVI and DP-SSP, respectively.To improve upon them, we will follow the approach of [3], based on a multi-pass empirical error convergence, combined with weak gap generalization bounds based on uniform stability.

Stability of NSEG and Optimal Risk for DP-SVI and DP-SSP
The bounds established for DP-SVI are potentially suboptimal, and many of the past approaches used to attain optimal rates for DP-SCO, such as privacy amplification by iteration, phased regularization, etc. appear to encounter substantial barriers for their application to DP-SVI.In order to resolve this gap, we show that for both DP-SVI and DP-SSP we can indeed obtain optimal rates, which match those of DP-SCO.In order to achieve this we develop a multi-pass variant of the NSEG method, which enjoys generalization performance due to its stability.

Stability of NSEG method
To analyze the stability of NSEG it is useful to interpret the extragradient method as an approximation of the proximal point algorithm.This connection has been established at least since [29].Given a monotone and 1-Lipschitz operator G : R d Þ Ñ R d , we define the s-extragradient operator inductively as follows.First, R 0 p¨; Given such operator, the (deterministic) extragradient method [24] corresponds to, starting from u 0 P W, iterating u t`1 " R 2 pu t ; γF q p@t P rT ´1sq.
It is known that if G is contractive, the recursion (4.1) leads to a fixed point Rpu; Gq, satisfying It is also easy to see that Rp¨; Gq : R d Þ Ñ W is nonexpansive.
The next lemma shows an expansion upper bound for extragradient iterations.This type of bound will be later used to establish the uniform argument stability of the NSEG algorithm.
Next, to prove (4.6), we proceed as follows: The Expansion Lemma above allows us to bound how much would two trajectories of the NSEG method may deviate, given two pairs of sequences of operators F 1,t , F 2,t and F 1 1,t , F 1 2,t .The bounds we will obtain from this analysis will give us direct bounds on the UAS for the NSEG method.Proof.Clearly, ν 0 " 0. Let us now derive a recurrence for both ν t and δ t .
where in the last inequality we used inequality (4.5).Let us bound now the rightmost term above, (4.9) We conclude that Now, where in inequality (i), we used Lemma 4.1 (more precisely, inequality (4.6)), and in inequality (ii), we used (4.9).Unraveling the above recursion, we get that for all t P rT s, Finally, we combine the bound above with (4.10), to conclude that for all t P rT s: The next theorem provides in-expectation and high probability upper bounds for the NSEG method.Despite the fact that we will not particularly apply the latter bounds, we believe these may be of independent interest.W pL, M q and stepsizes 0 ă γ t ď 1{L, satisfies the following uniform argument stability bounds: 1. Batch method A batch-EG : Where given dataset S, F 1,t " F S `ξt 1 , and F 2,t " F S `ξt 2 .In expectation, And for constant stepsize γ t " γ, there exists a universal constant K ą 0, such that for any 0 ă θ ă 1, with probability 1 ´θ: (4.12)And for constant stepsize γ t " γ, there exists a universal constant K ą 0, such that for any 0 ă θ ă 1{r2ns, with probability 1 ´θ: `ξt 2 .Hence we have that p∆ 1,t q tPrT s and p∆ 2,t q tPrT s are sequences of independent r.v. with expectation bounded by 2M {n.Therefore, by Lemma 4.2 (eqn.(4.8)), and following the steps that lead to inequality (4.14), we have: Finally for the high-probability bound, note that for any realization of the algorithm randomness, we have We additionally assume constant stepsize, γ t " γ ą 0. Hence, we can resort on concentration of sums of Bernoulli random variables, which guarantees that An analog bound can be established for ∆ 2,t , which together with bound (4.15) leads to Notice this bound only depends on our choice of i, and it is otherwise uniform over all S » S 1 .Finally, by a union bound on i P rns (together with a renormalization of θ), we have that

Optimal risk for DP-SVI and DP-SSP by NSEG method
Now we use our stability and risk bounds for NSEG to derive optimal risk bounds for DP-SSP.For this, we use the sampled with replacement variant, A repl-EG .
This is quite a mild sample size requirement.In this range, when M ě LD, our upper bound matches the excess risk bounds for DP-SCO [4], and we will show these rates are indeed optimal for DP-SVI and DP-SSP as well Proof.Given that our bounds for SVI and SSP are analogous, we proceed indistinctively for both problems.First, let us bound the empirical accuracy of the method.By Theorem 3.1, together with the fact that sampling with replacement is an unbiased stochastic oracle for the empirical operator: where EmpGappA, Sq is EmpGap VI pA, F S q or EmpGap SP pA, f S q for an SVI or SSP problem, respectively.Next, by Theorem 4.3, we have that A (or xpSq and ypSq, for the SSP case) are UAS with parameter Hence, noting that empirical risk upper bounds weak empirical gap and using Proposition 2.1 or Proposition 2.2 (depending on whether the problem is an SSP or SVI, respectively), we have that the risk is upper bounded by its empirical risk plus M δ, where δ is the UAS parameter of the algorithm; in particular, is bounded by Similar claims can be made WeakGap SP pA, f S q.Hence, we conclude the proof.

The Noisy Inexact Stochastic Proximal Point Method
In the previous sections, we presented NSEG method with its single-pass and multipass variants and provided optimal risk guarantees for DP-SVI and DP-SSP problems in Opn 2 q stochastic operator evaluations.In the rest of the paper, our aim is to provide another algorithm that can achieve the optimal risk for both of these problems with much less computational effort.Towards that end, consider the following algorithm: Algorithm 2 Noisy Inexact Stochastic Proximal Point (NISPP) Method 1: Input: w 0 P W 2: for k " 0, 1, . . ., K do 3: Sample a batch B k`1 Ď S.

4:
u k`1 Ð VI ν pW, F B k`1 p¨q `λk p¨´w k qq. 5: In the above algorithm, we leave few things unspecified which will be stated later during convergence and privacy analysis.Here, we detail some key features of the above algorithm.In line 3, we sample a batch B k`1 of size B k`1 " |B k`1 |.Similar to the NSEG, we will look at two different variants of NISPP method: single-pass and multi-pass.Depending on the type of the method, we will specify the sampling mechanism.In line 4 of Algorithm 2, we have u k`1 is a ν-approximate strong VI solution of the mentioned VI problem for some ν ě 0, i.e., xF B k`1 pu k`1 q `λk pu k`1 ´wk q, w ´uk`1 y ě ´ν p@w P Wq. (5.1) Note that if ν " 0 then this is an exact solution satisfying (VI(F)) with operator F replaced by F p¨q `λk p¨´w k q.
For ν ą 0, we obtain that u k`1 is an inexact solution satisfying solution criterion up to ν additive error.In line 5, we add a Gaussian noise to u k`1 in order to preserve privacy.The resulting iterate w k`1 can be potentially outside the set W. Hence, in line 7, the ergodic average s w K can be outside W. In order to preserve feasibility of the solution, we project s w K onto set W and output it as a solution in line 8. Projection of the average in line 8, as opposed to projection individual w k`1 in line 5 is crucial for convergence guarantee of Algorithm 2.
In the rest of this section, we exclusively deal with the single-pass version of NISPP method, i.e., we assume that batches tB k`1 u k"0,...,K are disjoint subsets of the dataset S. We start with the convergence guarantees of single-pass NISPP method.In order to prove convergence, we show a useful bound on dist W p s w K q :" min wPW w ´s w K .
Moreover, we have Proof.Note that s u K P W. Hence, first relation in (5.2) follows by definition of dist W p¨q function.Equality follows from definition of s u K and s w K .To obtain (5.3), note that tξ k u K`1 k"1 are i.i.d.random variable with mean 0. Expanding ř K k"0 γ k ξ k`1 2 , using linearity of expectation and noting Erξ T i ξ j s " 0 for all i ‰ j, we conclude (5.3).Hence, we conclude the proof.
We prove the following convergence rate result for Algorithm 2 for the risk of SVI/SSP problem.In particular, we assume that the algorithm performs a single-pass over the dataset S " P n containing n i.i.d.datapoints.
Theorem 5.1.Consider the stochastic (VI(F)) problem with operators F β P Ă M 1 pL, M q.Let A be the single-pass NISPP method (Algorithm 2) where sequence tγ k u kě0 , tλ k u kě0 satisfy γ k λ k " γ 0 λ 0 (5.4) for all k ě 0.Moreover, B k`1 are independent samples from a product distribution B k`1 " P B k`1 and B k`1 Ă S.
Then, we have where, Z 0 pKq :" Proof.Let w P W. Then xF pwq, w k`1 ´wy " xF pwq, u k`1 ´wy `xF pwq, w k`1 ´uk`1 y. (5.7) We will analyze each term above separately.First, note that where first inequality follows from monotonicity and second inequality follows from (5.1).Now note that where last inequality follows from L-Lipschitz continuity of F and F B k`1 .Noting that u k`1 ´w ď D for all w P W and using the above bound in (5.9), we have Letting u 0 :" w 0 and consequently ξ 0 " 0, we have from (5.10) Let us define an auxiliary sequence tz k u kě0 , where z 0 " w 0 and for all k ě 1, we have Then, due to the mirror-descent bound, we have (5.12) Moreover, noting that Combining above relation with (5.12), we have Now multiplying (5.7) by γ k then summing from k " 0 to K; noting the definition of s w K and Γ K ; and using (5.8), (5.11) and (5.13) along with assumption (5.4) implies Moreover, define an auxiliary sequence th k u kě0 with h 0 :" w 0 and h k`1 :" Π W rh k ´∆k s.
Then due to mirror descent bound, we have Using above relation along with (5.16), we have Using the above relation in (5.15), we have Finally, note that for all valid k, we have where, in (5.19), we used the fact that F β q is M -bounded for all β P S. Now, using (5.17) in relation (5.14), noting the bound on dist W p s w K q from Proposition 5.1 (in particular (5.3)), taking supremum with respect to w P W, then taking expectation and noting (5.18)-(5.21),we have where in the first inequality, we used the fact that Erdist W p s w K qs ď a Erdist W p s w K q 2 s.Hence, we conclude the proof of (5.5).Now, we extend this for (SP(f)).Denote u k`1 " pr x k`1 , r y k`1 q.Then, we have Using the above in (5.11), we obtain, Now, using Proposition 5.1 to bound the distance between points q and pΠ X rs x K s, Π Y rs y K sq and using Jensen's inequality to conclude that and retracing the steps of this proof from (5.11), we obtain (5.6).Hence, we conclude the proof.

Differential privacy of the NISPP method
First, we show a simple bound on ℓ 2 -sensitivity for updates of NISPP method.
Proof.Let w k be an iterate in the start of k-th iteration of Algorithm 2. Suppose B k`1 and B 1 k`1 be two different batches used in k-th iteration to obtain u k`1 and u 1 k`1 , respectively.Also note that B k`1 and B 1 k`1 differ in only single datapoint.Then, due to (5.1), we have for all w P W xF B k`1 pu k`1 q `λk pu k`1 ´wk q, w ´uk`1 y ě ´ν Using w " u 1 k`1 in the first relation and w " u k`1 in the second relation above and then summing, we obtain Now, noting that F B k`1 is a monotone operator and denoting a k`1 :" Finally noting that if β and β 1 are the differing datapoints in B k`1 and B 1 k`1 , then Using the above relation in (5.22) and noting that ℓ 2 -sensitivity This implies Hence, we conclude the proof.
Now, we provide a policy for setting γ k , λ k and B k`1 to obtain population risk bounds for DP-SVI and DP-SSP problem by the NISPP method.
Proof.Note that values of ν, σ k`1 and other required conditions proposed in Propositions 5.2 and 5.3 are satisfied.Hence, this algorithm is pε, ηq-differentially private.Moreover, all requirements of Theorem 5.1 are satisfied.In order to maintain single pass over the dataset, we require K " n B " n 2{3 iterations.Then, we provide individual bounds on the terms of (5.5) ((5.6), respectively) and conclude the corollary using Theorem 5.1.
Note that we are using a constant parameter policy.Hence, σ k`1 " σ " 4M ρBλ 0 for all k ě 0. Substituting appropriate parameter values, we have .
Substituting these bounds in Theorem 5.1, we conclude the proof.
Remark 5.2.We have the following remarks for NISPP method: 1.In order to obtain ν-approximate solution of the subproblem of NISPP method satisfying (5.1), we can use the Operator Extrapolation (OE) method (see Theorem 2.3 [25]).OE method outputs a solution u k`1 satisfying u k`1 ´wk `1 ď ζ in L`λ 0 λ 0 lnp D ζ q iterations, where w k`1 is an exact SVI solution for problem (5.1).Furthermore, we have for all w P W, 0 ď xF pw k`1 q `λk pw k`1 ´wk q, w ´wk `1y " xF pu k`1 q `F pw k`1 q ´F pu k`1 q `λk pu k`1 ´wk q `λk pw k`1 ´uk`1 q, w ´wk `1y ď xF pu k`1 q `λk pu k`1 ´wk q, w ´wk `1y `pL `λk q u k`1 ´wk `1 w ´wk `1 ď xF pu k`1 q `λk pu k`1 ´wk q, w ´wk `1y `pL `λk qD u k`1 ´wk `1 " xF pu k`1 q `λk pu k`1 ´wk q, w ´uk`1 `uk`1 ´wk `1y `pL `λk qD u k`1 ´wk `1 ď xF pu k`1 q `λk pu k`1 ´wk q, w ´uk`1 y ` F pu k`1 q `λk pu k`1 ´wk q u k`1 ´wk `1 `pL `λk qD u k`1 ´wk `1 ď xF pu k`1 q `λk pu k`1 ´wk q, w ´uk`1 y `rLD `M `2λ k Ds u k`1 ´wk `1 Setting ζ " ν{rLD `M `2λ k Ds, we obtain that u k`1 is a ν-approximate solution satisfying (5.1).Using the convergence rate above, we require L`λ 0 λ 0 ln M D`LD 2 `2λ k D 2 ν operator evaluations.
Note that since, λ 0 ě L, we have L`λ 0 λ 0 ď 2.Moreover, (H ence, each iteration of NISPP method requires Oplog nq iterations of OE method for solving the subproblem.Moreover, each iteration of the OE method requires 2B stochastic operator evaluations.Hence, we require OpKB log nq stochastic operator evaluations in the entire run of NISPP (Algorithm 2).Noting that KB " n, we conclude that this is a near linear time algorithm and also performs only a single pass over the data in the stochastic outer-loop.We provide the details of OE method in the Appendix B. In view of Corollary 5.1, it seems that running NISPP method for n 3{2 stochastic operator evaluations may provide optimal risk bounds.However, running that many stochastic operator evaluations requires multi-pass over the dataset so, in principle, this would only provide bounds in the empirical risk.In order to compute the population risk of this multi-pass version, we analyze the stability of NISPP and provide generalization guarantees which result in optimal population risk.

Stability of NISPP and Optimal Risk for DP-SVI and DP-SSP
In this section, we develop a multi-pass variant of NISPP method, and prove its stability to extrapolate empirical performance to population risk bounds.

Stability of NISPP method
Let us start with two adjacent datasets S » S 1 .Suppose we run NISPP method on both datasets starting from the same point w 0 P W.Then, in the following lemma, we provide bound on the how far apart trajectories of these two runs can drift.Lemma 6.1.Let pu k`1 , w k`1 q kě0 and pu 1 k`1 , w 1 k`1 q kě0 be two trajectories of the NISPP method (Algorithm 2) for any adjacent datasets S » S 1 whose batches are denotes by B k`1 , B 1 k`1 respectively.Moreover, denote a k`1 :" Proof.It is clear from the definition i that B j " B 1 j for all j ď i.This implies u j " u 1 j and w j " w 1 j for all j ď i.Hence, we conclude first case of (6.1).
Using (5.1) for ν-approximate strong VI solution, we have, Then, adding (6.2) with w " u 1 k`1 and (6.3) with w " u k`1 , we have Also note that where the last inequality follows from monotonicity of F B 1 k`1 .Using above relation along with (6.4), we obtain where we used the definition a k along with Cauchy-Schwarz inequality.Solving for the quadratic inequality in δ k`1 , we obtain the following recursion which can be further simplified to Solving this recursion and noting the base case that δ i " 0, we obtain (6.1).
A direct consequence of the previous analysis are in-expectation and high probability uniform argument stability upper bounds for the sampling with replacement variant of Algorithm 2. Theorem 6.2.Let A denote the sampling with replacement NISPP method (Algorithm 2) where B k is chosen uniformly at random from subsets of S of a given size B k .Then A satisfies the following uniform argument stability bounds: Furthermore, if |B k | " B and λ k " λ for all k (i.e., constant batch size and regularization parameter throughout iterations) then w.p. at least 1 ´n expt´KB{r4nsu (over both sampling and noise addition) Proof.Let S » S 1 and pu k`1 q k , pu 1 k`1 q k the trajectories of the algorithm on S and S 1 , respectively.By Lemma 6.1, letting δ k`1 " } wk`1 ´w 1 k`1 }, we get δ j ď We proceed now to the high-probability bound.Let r k " Berppq, for k P rKs, with Kp ă 1.Then, for any 0 ă θ ă 1{2, Choosing θ " τ {r2Kps ă 1{2, we get that the probability above is upper bounded by expt´τ 2 {r4Kpsu.Finally, choosing τ " Kp, we get Next, fix the coordinate i where S » S 1 may differ.Noticing that a k is a.s.upper bounded by p2M {Bqr k with r k " Berppq, with p " B{n, we get In particular, w.p. at least 1 ´expt´K B 4n u, we have a k ď 4M λn `b4ν λ .Using a union bound over i P rns (and noticing that averaging preserves the stability bound), we conclude that Hence, we conclude the proof.
Remark 6.3.An important observation, for the high-probability guarantee to be useful, is necessary that the algorithm is run for sufficiently many iterations; in particular, we require K " ωpn{Bq.Whether this assumption can be completely avoided is an interesting question.Nevertheless, as we will see in the section, our policy for DP-SVI and DP-SSP problem satisfies this requirement.
6.2 Optimal risk for DP-SVI and DP-SSP by the NISPP method In previous three sections, we provided bounds on optimization error, generalization error and value of σ for obtaining pε, ηq-differential privacy.In this section, we specify a policy for selecting λ k , B k , γ k , σ k and ν such that requirement in previous three sections are satisfied and we can get optimal risk bounds while maintaining pε, ηq-privacy.In particular, consider the multi-pass NISPP method where each sample batch B k is chosen randomly from subsets of S with replacement.Then, we have the following theorem: Theorem 6.4.Let A be the multi-pass NISPP method (Algorithm 2).Set the following constant stepsize and batchsize policy for A: Then Proof.Note that since ν satisfies assumption in Proposition 5.2, we have ℓ 2 -sensitivity of the update of u k`1 is 4M λ 0 B k`1 .Then, in view of Theorem 2.2 along with value of σ k`1 , we conclude that Algorithm 2 is pε, ηq-differential private.Now, for convergence, we first bound the empirical gap.Given that our bounds for (VI(F)) and (SP(f)) are analogous, we proceed indistinctively for both problems.By Theorem 5.1, along with the fact that sampling with replacement is an unbiased stochastic oracle for the empirical operator, we have for any S this is a known lower bound for DP-SCO [4].It is important to note as well that this reduction applies to the weak generalization gap, as defined in (1.4), as in the case Y " tȳu is a singleton: WeakGap SP pA, f q " E A rsup Next, we study the case of DP-SVI.The situation is more subtle here.Our approach is to first prove a reduction from population weak VI gap to empirical strong VI gap for the case where operators are constant w.r.t.w.In fact, it seems unlikely this reduction works for more general monotone operators, however this suffices for our purposes, as we will later prove a lower bound construction used for DP-ERM [5] leads to a lower bound for strong VI gap with constant operators.The formal reduction to the empirical version of the problem is presented in the following lemma.Its proof follows closely the reduction from DP-SCO to DP-ERM from [4].Below, given a dataset S P Z n , let P S " 1 n ř βPS δ β be the empirical distribution associated with S.
Lemma 7.1.Let A be an pε{r4 logp1{ηqs, e ´εη{r8 logp1{ηqsq-DP algorithm for SVI problems.Then there exists an pε, ηq-DP algorithm B such that for any empirical VI problem with constant operators, EmpGap VI pB, F S q ď WeakGap VI pA, F P S q p@S P Z n q.
Proof.Consider the algorithm B that does the following: first, it extracts a sample T " pP S q n ; next, executes A on T; and finally, outputs ApTq.We claim that this algorithm is pε, ηq-DP w.r.t.S, which follows easily by bounding the number of repeated examples with high probability, together with the group privacy property applied to A (for a more detailed proof, see Appendix C in [4]).Now, given a constant operator F β pwq, let Rpβq P R d be its unique evaluation.Let now R S be the unique evaluation of F S , and given a distribution P let R P be the unique evaluation of F P pwq " E β"P rF β pwqs.
Noting that E T rR T s " R S , we have that EmpGap VI pBpSq, F S q " E B r sup where third equality holds since the optimal choice of w is independent of T, and the last equality holds by definition of the weak gap function and the fact that T " pP S q n .
Next, we prove a lower bound for the empirical VI problem over constant operators.Before we prove the result, let us observe that the presented lower bound shows the optimality of our algorithms in the range where M ě LD.Obtaining a matching lower bound for any choice of M, L, D is an interesting question, which unfortunately our proof technique does not address: this is a limitation that the lower bound is based on constant operators, and therefore their Lipschitz constants are always zero.
Proof.Let A be any algorithm for SVI.By the classical (nonprivate) lower bounds for SVI [30,22], we have that the minimax SVI gap attainable is lower bounded by ΩpM D{ ?nq.On the other hand, using Lemma 7.1 the accuracy of any pε, ηq-DP algorithm for weak SVI gap is lower bounded by the strong gap achieved by p4ε lnp1{ηq, e ε Õpηqq-DP algorithms on empirical VI problems with constant operators.Finally, by Proposition 7.2, the latter class of problems enjoys a lower bound Ωpmint1, b d lnp1{re ε Õpηqsq{rεn lnp1{ηqsuq " Ωpmint1, ?d{rnεsq, which implies a lower bound on the former class of this order.We conclude by combining both the private and nonprivate lower bounds established above.

Theorem 4 . 3 .
The NSEG method (Algorithm 1) for closed and convex domain W Ď R d with diameter D, operators in M 1

2 .
For non-DP version of NISPP method, i.e., σ k " 0 for all k, we can easily obtain population risk bound of Op M D ?n q by setting λ 0 " M D ?n, B " 1 (or K " n) and ν " M D ?n in Corollary 5.1.

Proposition 7 . 1 .
yPY E S rf pxpSq, yqs ´inf xPX E S rf px, ypSqqss " E A E S rf pxpSq, ȳqs ´inf xPX f px, ȳq " E A,S rf pxpSq, ȳq ´inf xPX f px, ȳqs, which is simply the expected optimality gap.Using this reduction, together with a lower bound for DP-SCO [4], we conclude that Let n, d P N, L, M, D, ε ą 0 and δ " op1{nq.The class of problems DP-SSP with gradient operators within the class M 1 W pL, M q, and domain W containing an Euclidean ball of diameter D{2, satisfies the lower bound Ω ´M D ´1 ?n wPW xR S , BpSq ´wys " E A,T rxR S , ApTqy ´inf wPW xR S , wys " E A sup wPW E T rxR S , ApTq ´wys " WeakGap VI pA, F P S q,

Proposition 7 . 2 . 1 Theorem 7 . 2 .
Let n, d P N, L, M, D, ε ą 0 and 2 ´opnq ď δ ď op1{nq.The class of DP empirical VI problems with constant operators within the class M 1 W pL, M q, and domain W containing an Euclidean ball of diameter D{2 satisfies the lower bound Ω ´M D ´min !Consider the following empirical VI problem: F β puq " M β, W " Bp0, Dq and dataset S with points contained in t´1{ ?d, `1{ ?du d .Notice that, since the operator in this case is constant, the VI gap coincides with the excess risk of the associated convex optimization problem any u P W, EmpGap VI pu, F S q " sup with the lower bounds on excess risk proved for this problem in [5, Appendix C] and [36, Theorem 5.1] show that any pε, ηq-DP algorithm for this problem must incur in worst-case VI gap ΩpM D mint1, ?d logp1{ηq εn uq, which proves the result.The two results above provide the claimed lower bound for the weak SVI gap of any differentially private algorithm.Let n, d P N, L, M, D, ε ą 0 and 2 ´opnq ď δ ď op1{nq.The class of DP-SVI problems with operators within the class M 1 W pL, M q, and domain W containing an Euclidean ball of diameter D{2 satisfies a lower bound for the weak VI gap Ω´M ´γF 2 pwqq ´ΠW pv ´γF 2 pzqq} ď }Rpu; γF 2 q ´Rpv; γF 2 q} `}Π W pu ´γF 2 pwqq ´ΠW pu ´γF 2 pRpu; γF 2 qqq} `}Π W pv ´γF 2 pzqq ´ΠW pv ´γF 2 pRpv; γF 2 qqq} ď }u ´v} `γL}w ´Rpu; γF 2 q} `γL}z ´Rpv; γF 2 q}.
, Algorithm 2 is pε, ηq-differential private.Moreover, output ApSq satisfies the following bound on E A rWeakGap VI pApSq, F qs for SVI problem (or E A rWeakGap SP pApSq, f qs for SSP problem) Next, by Theorem 6.2, we have that ApSq (or xpSq and ypSq for the SSP case) are UAS with parameter Hence, noting that empirical risk upper bounds weak VI or SP gap, i.e., using Proposition 2.1 or Proposition 2.2 (depending on whether the problem is an SSP or SVI, respectively), we have that the risk is upper bounded by its empirical risk plus M δ, where δ is the UAS parameter of the algorithm; in particular, if WeakGappA; Sq is the (SVI or SSP, respectively) gap function for the expectation objective, then