1 Introduction

Stochastic variational inequalities (SVI) and stochastic saddle-point (SSP) problems have become a central part of the modern machine learning toolbox. The main motivation behind this line of research is the design of algorithms for multiagent systems and adversarial training, which are more suitably modeled by the language of games, rather than pure (stochastic) optimization. Applications that rely on these methods may often involve the use of sensitive user data, so it becomes important to develop algorithms for these problems with provable privacy-preserving guarantees. In this context, differential privacy (DP) has become the gold standard of privacy-preserving algorithms, thus a natural question is whether it is possible to design DP algorithms for SVI and SSP that attain high accuracy.

Motivated by these considerations, this work provides the first systematic study of differentially-private SVI and SSP problems. Before proceeding to the specific results, we present more precisely the problems of interest. The stochastic variational inequality (SVI) problem is: given a monotone operator \(F:\mathcal {W}\mapsto \mathbb {R}^d\) in expectation form \(F(w)=\mathbb {E}_{\varvec{\beta }\sim {\mathcal P}}[F_{\varvec{\beta }}(w)]\), find \(w^* \in \mathcal {W}\) such that

$$\begin{aligned} \langle F(w^*),w-w^*\rangle \geqslant 0 \quad \forall w \in \mathcal {W}. \end{aligned}$$
(VI(F))

The closely related stochastic saddle point (SSP) problem is: given a convex-concave real-valued function \(f:{\mathcal W}\mapsto \mathbb {R}\) (here \({\mathcal {W}}={{\mathcal {X}}}\times {{\mathcal {Y}}}\) is a product space), given in expectation form \(f(x,y) = \mathbb {E}_{\varvec{\beta }\sim \mathcal {P}}[f_{\varvec{\beta }}(x, y)]\), the goal is to find \((x^{*},y^{*})\) that solves

$$\begin{aligned} \min _{x \in \mathcal {X}} \max _{y \in \mathcal {Y}} f(x,y) . \end{aligned}$$
(SP(f))

In both of these problems, the input to the algorithm is an i.i.d. sample \(\textbf{S}=(\varvec{\beta }_1,\ldots ,\varvec{\beta }_n)\sim {{\mathcal {P}}}^n\). Uncertainty introduced by a finite random sample renders the computation of exact solutions infeasible, so gap (a.k.a. population risk) functions are used to quantify the quality of solutions. Let \({\mathcal {A}}:{{\mathcal {Z}}}^n\mapsto {\mathcal {W}}\) be an algorithm for SVI problems (VI(F)).

We define the strong VI-gap associated with \({\mathcal {A}}\) as

$$\begin{aligned} \text{ Gap}_{\textrm{VI}}({\mathcal {A}},F):= {\mathbb {E}_{{\mathcal {A}},\textbf{S}}} \left[ \sup _{w\in \mathcal {W}}\langle F(w),{\mathcal {A}}(\textbf{S})-w \rangle \right] . \end{aligned}$$
(1.1)

We also define the weak VI-gap as

$$\begin{aligned} \text{ WeakGap}_{\textrm{VI}}({\mathcal {A}},F) := \mathbb {E}_{\mathcal {A}} \sup _{w \in \mathcal {W}} \mathbb {E}_{\textbf{S}}\left[ \langle F(w),{\mathcal {A}}(\textbf{S}) -w\rangle \right] . \end{aligned}$$
(1.2)

Here, expectation is taken over both the sample data \(\textbf{S}\) and the internal randomization of \({\mathcal {A}}\). For SSP (SP(f)), given an algorithm \({\mathcal {A}}:{{\mathcal {Z}}}^n\mapsto {{\mathcal {X}}}\times {\mathcal Y}\), and letting \({\mathcal {A}}(\textbf{S})=(x(\textbf{S}),y(\textbf{S}))\), a natural gap function is the following saddle-point (a.k.a. primal-dual) gap

$$\begin{aligned} \text{ Gap}_{\textrm{SP}}({\mathcal {A}},f):= \mathbb {E}_{{\mathcal {A}},\textbf{S}} \left[ \sup _{x\in \mathcal {X},\, y\in \mathcal {Y}} [f(x(\textbf{S}), y)-f(x,y(\textbf{S}))] \right] . \end{aligned}$$
(1.3)

Analogously as above, we define the weak SSP gap asFootnote 1

$$\begin{aligned} \text{ WeakGap}_{\textrm{SP}}({\mathcal {A}},f):= \mathbb {E}_{\mathcal {A}} \sup _{x\in \mathcal {X},\, y\in \mathcal {Y}} \mathbb {E}_{\textbf{S}}[f(x(\textbf{S}), y)-f(x,y(\textbf{S}))] . \end{aligned}$$
(1.4)

It is easy to see that in both cases the gap is always nonnegative, and any exact solution must have zero-gap. For examples and applications of SVI and SSP we refer to Sect. 2.1. Despite the fact that the strong VI is a more classical and well-studied quantity, the weak VI gap has been observed to be useful in various contexts. We refer the reader to [50] for more discussions on the weak VI gap.

On the other hand, we are interested in designing algorithms that are differentially private. These algorithms build a solution based on a given dataset \(\textbf{S}\) of random i.i.d. examples from the target distribution, and output a (randomized) feasible solution, \({\mathcal {A}}(\textbf{S})\). We say that two datasets \(\textbf{S}=(\varvec{\beta }_i)_i,\textbf{S}^{\prime }=(\varvec{\beta }_i^{\prime })_i\) are neighbors, denoted \(\textbf{S}\simeq \textbf{S}^{\prime }\), if they only differ in a single entry i. We say that an algorithm \({\mathcal {A}}(\textbf{S})\) is \((\varepsilon ,\eta )\)-differentially private if for every event E in the output spaceFootnote 2

$$\begin{aligned} \mathbb {P}_{\mathcal {A}}[{\mathcal {A}}(\textbf{S})\in E] \leqslant e^{\varepsilon }\mathbb {P}_{\mathcal {A}}[{\mathcal {A}}(\textbf{S}^{\prime })\in E] + \eta \quad (\forall \textbf{S}\simeq \textbf{S}^{\prime }). \end{aligned}$$
(1.5)

Here \(\varepsilon ,\eta \geqslant 0\) are prescribed parameters that quantify the privacy guarantee. Designing DP algorithms for particular data analysis problems is an active area of research. Optimal risk algorithms for stochastic convex optimization have only very recently been developed, and it is unclear whether these methods are extendable to SVI and SSP settings.

1.1 Summary of contributions

Our work is the first to provide population risk bounds for DP-SVI and DP-SSP problems. Moreover, our algorithms attain provably optimal rates and are computationally efficient. We summarize our contributions as follows:

  1. 1.

    We provide two different algorithms for DP-SVI and DP-SSP: namely, the noisy stochastic extragradient method (NSEG) and a noisy inexact stochastic proximal-point method (NISPP). The NSEG  method is a natural DP variant of the well-known stochastic extragradient method [30], where privacy is obtained by Gaussian noise addition; on the other hand, the NISPP method is an approximate proximal point algorithm [28, 43] in which every proximal iterate is made noisy to make it differentially private. Our more basic variants of both of these methods are based on iterations involving disjoint sets of datapoints (a.k.a. single pass method), which are known to typically lead to highly suboptimal rates in DP (see the Related Work Section for further discussion).

  2. 2.

    We derive novel uniform stability bounds for the NSEG  and NISSP methods. For NSEG, our stability upper bounds are inspired by the interpretation of the extragradient method as a (second order) approximation of the proximal point algorithm. In particular, we provide expansion bounds for the extragradient iterates, and solve a (stochastic) linear recursion. The stability bounds for NISPP method are based on stability of the (unique) SVI solution in the strongly monotone case. Finally, we investigate the risk attained by multipass versions of the NSEG  and NISPP methods, leveraging known generalization bounds for stable algorithms [35]. Here, we show that the optimal risk for DP-SVI and DP-SSP can be attained by running these algorithms with their sampling with replacement variant. In particular, NSEG method requires \(n^2\) stochastic operator evaluations, and NISPP method requires much smaller \({\widetilde{O}}(n^{3/2})\) operator evaluations for both DP-SVI and DP-SSP problems. In particular, these upper bounds also show the dependence of the running time of each of these algorithms w.r.t. the dataset size.

  3. 3.

    Finally, we prove lower bounds on the weak gap function for any DP-SSP and DP-SVI algorithm, showing the risk optimality of the aforementioned multipass algorithms. The main challenge in these lower bounds is showing that existing constructions of lower bounds for DP convex optimization [5, 7, 46] lead to lower bounds on the weak gap of a related SP/VI problem.

The following table provides details of population risk and operator evaluation complexity.

1.2 Related work

We divide our discussion on related work in three main areas. Each of these areas has been extensively investigated, so a thorough description of existing work is not possible. We focus ourselves on the work which is more directly related to our own.

Table 1 Different levels of risk and complexity achieved by NSEG and NISPP methods for \((\varepsilon ,\eta )\)-differentially private SVI/SSP. Here n is the dataset size, and d is the dimension of the solution search space. We omit the dependence on other problem parameters (e.g., Lipschitz constants and diameter), as well as the privacy parameter \(\eta \)
  1. 1.

    Stochastic Variational Inequalities and Saddle-Point Problems: Variational inequalities and saddle-point problems are classical topics in applied mathematics, operations research and engineering (e.g., [3, 18, 33, 38, 40,41,42,43, 45]). Their stochastic counterparts have only gained traction recently, mainly motivated by their applications in machine learning (e.g., [24, 25, 29, 30, 34] and references therein). For the stochastic version of (SP(f)), [39] proposed a robust stochastic approximation method. The first optimal algorithm for SVI with monotone Lipschitz operators was obtained by Juditsky, Nemirovski and Tauvel [30], and very recently Kotsalis, Lan and Li [34] developed optimal variants for the strongly monotone case (in terms of distance to the optimum criterion, rather than VI gap).

    It is important to note that naive adaptation of these methods to the DP setting requires adding noise to the operator evaluations at every iteration, which substantially degrades the accuracy of the obtained solution. A careful privacy accounting and minibatch schedule can lead to optimal guarantees for single-pass methods [19], however this requires accuracy guarantees for the last iterate, which is currently an open problem for SVI and SSP (aside from specific cases, typically involving strong monotonicity conditions, e.g., [24, 34]). We circumvent this problem by providing population risk guarantees for multipass methods.

  2. 2.

    Stability and Generalization: Deriving generalization (or population risk) bounds for general-purpose algorithms is a challenging task, actively studied in theoretical machine learning. Bousquet and Elisseeff [8] provided a systematic treatment of this question for algorithms which are stable, with respect to changes of a single element in the training dataset, and a sequence of works have refined these generalization guarantees (see [9, 21] and references therein). This idea has been applied to investigate the generalization properties of regularized empirical risk minimization [8, 44], and more recently to iterative methods, such as stochastic gradient descent [4, 23].

    Using stability to obtain population risk bounds in SVI and SSP is substantially more challenging, due to the presence of a supremum in the accuracy measure (see Eqs. (1.1) and (1.3)). Recently, Zhang et al. [50], established stability implies generalization results for the strong SP gap under strong monotonicity assumptions. Their proof strategy applies analogously to address the SVI setting, although this is not carried out in their work. More recently, Lei et al. [35], proved generalization bounds on the weak SP gap without strong monotonicity assumptions. We leverage this result for our algorithms, and further elaborate on its implications for SVI in Sect. 2.2.

  3. 3.

    Differential Privacy: Differential privacy is the gold standard for private data analysis, and it has been studied for nearly 20 years [15, 16]. Beyond its classical definition, multiple variants have been introduced, including local [14, 31], concentrated [10], Rényi [37], and Gaussian [13]. Relevant to the optimization community are the applications of differential privacy to combinatorial optimization [22].

    Differentially private empirical risk minimization and stochastic convex optimization have been extensively studied for over a decade (see, e.g. [5, 7, 11, 12, 19, 26, 27, 32, 47]). Relevant to our work are the first optimal risk algorithms for DP-ERM [7] and DP-SCO [5]. Non-Euclidean extensions have also been obtained recently [2, 6]. To the best of our knowledge, our work is the first to address DP algorithms for SVI and SSP. Our approach for generalization of multipass algorithms is inspired by the noisy SGD analysis in [4]. However, our stability analysis differs crucially from [4]: in the case of NSEG, we need to carefully address the double operator evaluation of the extragradient step, which is done by using the fact that the extragradient operator is approximately nonexpansive. In the case of NISPP, we leverage the contraction properties of strongly monotone VI solutions. By contrast, SGD in the nonsmooth case is far from nonexpansive [4]. Alternative approaches to obtain optimal risk in DP-SCO, including privacy amplification by iteration [19, 20], and phased regularization or phased SGD [19], appear to run into fundamental limitations when applied to DP-SVI and DP-SSP. It is an interesting future research direction to obtain faster running times with optimal population risk in DP-SVI and DP-SSP, which may benefit from these alternative approaches.

The main body of this paper is organized as follows. In Sect. 2, we provide the necessary background information on SVI/SSP, uniform stability, and differential privacy, which are necessary for the rest of the paper. In Sect. 3 we introduce the NSEG  method, together with its basic privacy and accuracy guarantee for a single pass version. Section 4 provides stability bounds for NSEG method along with the consequent optimal rates for SVI and SSP. In Sect. 5, we introduce the single-pass differentially private NISPP method with bound on expected SVI-gap. Section 6 presents stability analysis of NISPP, together with the resulting optimal rates for SVI/SSP gap. We conclude in Sect. 7 with lower bounds that prove the optimality of the obtained rates.

2 Notation and preliminaries

We work on the Euclidean space \((\mathbb {R}^d, \langle \cdot ,\cdot \rangle )\), where \(\langle \cdot ,\cdot \rangle \) is the standard inner product, and \(\Vert u\Vert =\sqrt{\langle u,u\rangle }\) is the \(\ell _2\)-norm. Throughout, we consider a compact convex set \(\mathcal {W}\subseteq \mathbb {R}^d\) with diameter \(D>0\). We denote the standard Euclidean projection operator on set \(\mathcal {W}\) by \(\varPi _{\mathcal {W}}(\cdot )\). The identity matrix on \(\mathbb {R}^d\) is denoted by \({\mathbb {I}}_d\).

We let \(\mathcal {P}\) denote an unknown distribution supported on an arbitrary set \(\mathcal {Z}\), from which we have access to exactly n i.i.d. datapoints which we denote by sample set \(\textbf{S}\sim {{\mathcal {P}}}^n\). Throughout, we will use boldface characters to denote sources of randomness (coming from the data, or internal algorithmic randomization). We say that two datasets \(\textbf{S}, \textbf{S}'\) are adjacent (or neighbors), denoted by \(\textbf{S}\simeq \textbf{S}'\), if they differ in a single data point. We also denote subsets (a.k.a. batches), or single data points, of \(\textbf{S}\) or \(\mathcal {P}\) by \(\textbf{B}\) and \(\varvec{\beta }\), respectively. Whether \(\varvec{\beta }\) or \(\textbf{B}\) is sampled from \(\mathcal {P}\) or \(\textbf{S}\) is specified explicitly unless it is clear from the context. For a batch \(\textbf{B}\), we denote its size by \(|\textbf{B}|\). Therefore, we have \(|\textbf{S}| = n\). Throughout, we will denote Gaussian random variables by \(\varvec{\xi }\).

We say that \(F:\mathcal {W}\rightarrow \mathbb {R}^d\) is a monotone operator if

$$\begin{aligned} \langle F(w_1) -F(w_2),w_1-w_2\rangle \geqslant 0, \quad \forall w_1, w_2 \in \mathcal {W}. \end{aligned}$$

Given \(L>0\), we say that F is L-Lipschitz continuous, if

$$\begin{aligned} \Vert F(w_1) - F(w_2)\Vert _{}^{} \leqslant L\Vert w_1-w_2\Vert _{}^{}, \quad \forall w_1, w_2 \in \mathcal {W}. \end{aligned}$$

Finally, we say that F is M-bounded if \(\sup _{w\in \mathcal {W}}\Vert F(w)\Vert \leqslant M\). We denote the set of monotone, L-Lipschitz and M-bounded operators by \({\mathcal {M}}_{\mathcal {W}}^1(M,L)\). In this work, we will focus on the case where F is an expectation operator, i.e., \(F(w):= \mathbb {E}_{\varvec{\beta }\sim \mathcal {P}} [F_{\varvec{\beta }}(w)]\), where \(\mathcal {P}\) is an arbitrary distribution supported on \(\mathcal {Z}\),

and for any \(\varvec{\beta }\) in \({{\mathcal {Z}}}\), \(F_{\varvec{\beta }}(\cdot )\in {\mathcal {M}}_{\mathcal {W}}^1(M,L)\), \(\varvec{\beta }\)-a.s.Footnote 3

In the stochastic saddle point problem (SP(f)), we modify the notation slightly. Here, \({{\mathcal {X}}}\subseteq \mathbb {R}^{d_1}\) and \({{\mathcal {Y}}}\subseteq \mathbb {R}^{d_2}\) are compact convex sets, and we will assume that the saddle point functions \(f_{\varvec{\beta }}(\cdot ,\cdot ):{{\mathcal {X}}}\times {{\mathcal {Y}}}\mapsto \mathbb {R}\), satisfy the following conditions \(\varvec{\beta }\)-a.s.

  • \(\nabla _x f_{\varvec{\beta }}(\cdot ,\cdot )\) is \(L_x\)-Lipschitz continuous, and \(\nabla _y f_{\varvec{\beta }}(\cdot ,\cdot )\) is \(L_y\)-Lipschitz continuous, and;

  • \(f_{\varvec{\beta }}(\cdot ,y)\) is convex, for any given \(y\in \mathcal {Y}\), and \(f_{\varvec{\beta }}(x,\cdot )\) is concave, for any given \(x\in \mathcal {X}\) (we will say in this case the function is convex-concave).

If the assumptions above are met, we will denote \(L\triangleq \sqrt{L_x^2 + L_y^2}\). Under the assumptions above, it is well-known that SSP [42, 45] (and SVI [18], respectively) have a solution.

In the case of saddle-point problems, given the convex-concave function \(f_{\varvec{\beta }}(\cdot ,\cdot ):{{\mathcal {X}}}\times {{\mathcal {Y}}}\mapsto \mathbb {R}\), it is well-known that the operator \(F:{{\mathcal {X}}}\times {{\mathcal {Y}}}\mapsto \mathbb {R}^d\times \mathbb {R}^d\) below is monotone

$$\begin{aligned} F_{\varvec{\beta }}(x,y) = (\nabla _x f_{\varvec{\beta }}(x,y), -\nabla _y f(x,y)). \end{aligned}$$
(2.1)

We will call this operator the monotone operator associated with \(f_{\varvec{\beta }}(\cdot ,\cdot )\). Furthermore, if \(\nabla _x f_{\varvec{\beta }}(\cdot ,y)\) has \(L_x\)-Lipschitz continuous gradient and \(\nabla _y f_{\varvec{\beta }}(x,\cdot )\) has \(L_y\)-Lipschitz continuous gradient, then F is \(\sqrt{L_x^2+L_y^2}\)-Lipschitz continuous.

It is easy to see that, given a SSP problem with function \(f_{\varvec{\beta }}(\cdot ,\cdot )\) and sets \({{\mathcal {X}}}\), \({{\mathcal {Y}}}\), an (exact) SVI solution (VI(F)) for the monotone operator associated to \(f(x,y)=\mathbb {E}_{\varvec{\beta }}[f_{\varvec{\beta }}(x,y)]\) over the set \(\mathcal {W}={{\mathcal {X}}}\times {{\mathcal {Y}}}\), yields an exact SSP solution for the starting problem. Unfortunately, such reduction does not directly work for approximate solutions to (1.1) and (1.3), so the analysis must be done separately for both problems.

For batch \(\textbf{B}\), we denote the empirical (a.k.a. sample average) operator \(F_{\textbf{B}}(w):= \frac{1}{|\textbf{B}|} \sum _{\varvec{\beta }\in \textbf{B}} F_{\varvec{\beta }}(w).\) On the other hand, for a batch \(\textbf{B}\), the empirical saddle point function is denoted as \(f_{\textbf{B}}(x, y) = \frac{1}{|\textbf{B}|} \sum _{\varvec{\beta }\in \textbf{B}} f_{\varvec{\beta }}(x, y).\) Given a distribution \({{\mathcal {P}}}\), the expectation operator and function are denoted by \(F_{{\mathcal {P}}}(w):={\mathbb {E}}_{\varvec{\beta }\sim {\mathcal {P}}}[F_{\varvec{\beta }}(w)]\), and \(f_{{\mathcal {P}}}(x,y)=\mathbb {E}_{\varvec{\beta }\sim {{\mathcal {P}}}}[f_{\varvec{\beta }}(x,y)]\), respectively. For brevity, whenever it is clear from context we will drop the dependence on \({{\mathcal {P}}}\).

2.1 Examples and applications of SVI and SSP

An interesting problem which can be formulated as a SSP-problem is the minimization of a max-type convex function:

$$\begin{aligned} \min _{x \in \mathcal {X}} \Big \{\phi (x):=\max _{1 \leqslant j \leqslant m} \phi _j(x)\Big \}, \end{aligned}$$

where \(\phi _j: \mathcal {X}\rightarrow \mathbb {R}\) is a stochastic convex function \(\phi _j(x):= \mathbb {E}_{\varvec{\zeta }_j\sim \mathcal {P}_j}[\phi _{j,\varvec{\zeta }_j}(x)]\) for all \(j \in [m]\). This problem is essentially a structured nonsmooth optimization problem which can be reformulated into a convex-concave saddle point problem:

$$\begin{aligned} \min _{x \in \mathcal {X}} \max _{y \in \varDelta _m} \mathbb {E}_{\varvec{\zeta }_1\dots \varvec{\zeta }_m} [\textstyle {\sum }_{j=1}^m y_i\phi _{j,\varvec{\zeta }_j}(x)] \end{aligned}$$

Here, \(\varvec{\beta }= (\varvec{\zeta }_j)_{j=1}^m\) is the random input to the saddle point problem: \(f_{\varvec{\beta }}(x,y) = \sum _{j = 1}^my_j\phi _{j,\varvec{\zeta }_j}(x)\). Note that a substantial generalization of the max-type problem above is the so called compositional optimization problem:

$$\begin{aligned} \min _{x \in \mathcal {X}}\phi (x):= \varPhi (\phi _1(x), \dots , \phi _m(x)), \end{aligned}$$

where \(\phi _j(x)\) are convex maps and \(\varPhi (u_1, \dots , u_m)\) is a real-valued convex function whose Fenchel-type representation is assumed to have the form

$$\begin{aligned} \varPhi (u_1, \dots , u_m) = \max _{y \in \mathcal {Y}} \textstyle {\sum }_{j=1}^m \langle u_j,A_jy+ b_j\rangle - \varPhi _{*}(y), \end{aligned}$$

where \(\varPhi _{*}\) is a convex, Lipschitz and smooth. Then, overall optimization problem can be reformulated as a convex-concave saddle point problem:

$$\begin{aligned} \min _{x \in \mathcal {X}} \max _{y \in \mathcal {Y}} \textstyle {\sum }_{j=1}^m \langle \phi _j(x),A_jy+ b_j\rangle - \varPhi _{*}(y), \end{aligned}$$

where stochasticity is introduced due to constituent functions \(\phi _j(x) = \mathbb {E}_{\varvec{\zeta }_j} [\phi _{j,\varvec{\zeta }_j}(x)].\)

To conclude, we remark that these types of models have been recently proposed in machine learning to address approximate fairness [49] and federated learning on heterogeneous populations [36]. In these examples, the different indices \(j\in [m]\) may denote different subgroups from a population, and we are interested in bounding the (excess) population risk on these subgroups uniformly (with the motivation of preventing discrimination against any subgroup). This clearly cannot be achieved by a stochastic convex program, and a stochastic saddle-point formulation is effective in certifying accuracy across the different subgroups separately.

For further examples and applications of stochastic variational inequalities and saddle-point problems, we refer the reader to [29, 30, 50].

2.2 Algorithmic stability

In general, an algorithm is a randomized function mapping datasets to candidate solutions, \({\mathcal {A}}:\mathcal {Z}^n\mapsto \mathbb {R}^d\), which is measurable w.r.t. the dataset. Two datasets, \(\textbf{S}=(\varvec{\beta }_1,\ldots ,\varvec{\beta }_n),\,\textbf{S}^{\prime }=(\varvec{\beta }_1^{\prime },\ldots ,\varvec{\beta }_n^{\prime })\in \mathcal {Z}^n\) are said to be neighbors (denoted \(\textbf{S}\simeq \textbf{S}^{\prime }\)) if they only differ in at most one data point, namely

$$\begin{aligned} \varvec{\beta }_j=\varvec{\beta }_j^{\prime } \qquad (\exists i\in [n])(\forall j\ne i). \end{aligned}$$

Algorithmic stability is a notion of sensitivity analysis of an algorithm under neighboring datasets. Of particular interest to our work is the notion of uniform argument stability (UAS).

Definition 1

(Uniform Argument Stability) Let \({\mathcal {A}}:\mathcal {Z}^n\mapsto \mathbb {R}^d\) be a randomized mapping and \(\delta >0\). We say that \({\mathcal {A}}\) is \(\delta \)-uniformly argument stable (for short, \(\delta \)-UAS) if

$$\begin{aligned} \sup _{\textbf{S}\simeq \textbf{S}^{\prime }} \mathbb {E}_{\mathcal {A}}\Vert {\mathcal {A}}(\textbf{S})-{\mathcal {A}}(\textbf{S}^{\prime })\Vert \leqslant \delta . \end{aligned}$$

Occasionally, we may denote \(\delta _{\mathcal {A}}(\textbf{S},\textbf{S}^{\prime })\triangleq \Vert {\mathcal {A}}(\textbf{S})-{\mathcal {A}}(\textbf{S}^{\prime })\Vert \), for convenience. The importance of algorithmic stability in machine learning comes from the fact that stability implies generalization in stochastic optimization and stochastic saddle point (SSP) problems [8, 9, 50]. Below, we restate existing results on stability implies generalization for SSP problems below. Before doing so we need to briefly introduce the (strong) empirical gap function: given a dataset \(\textbf{S}\) and an algorithm \(\mathcal {A}\), we define the empirical gap function for a saddle point and variational inequality problem respectively as

$$\begin{aligned} \text{ EmpGap}_{\textrm{SP}}(\mathcal {A},f_{\textbf{S}}):= & {} \mathbb {E}_{\mathcal {A}}[\sup _{x,y} f_{\textbf{S}}(x(\textbf{S}),y)-f_{\textbf{S}}(x,y(\textbf{S})) ] \end{aligned}$$
(2.2)
$$\begin{aligned} \text{ EmpGap}_{\textrm{VI}}(\mathcal {A},F_{\textbf{S}}):= & {} \mathbb {E}_{\mathcal {A}}[\sup _{w} \langle F_{\textbf{S}}(w),\mathcal {A}(\textbf{S})-w\rangle ]. \end{aligned}$$
(2.3)

Notice that in these definitions the dataset \(\textbf{S}\) is fixed.

Proposition 1

[35, 50] Consider the stochastic saddle point problem (SP(f)) with functions \(f_{\varvec{\beta }}(\cdot , y)\) and \(f_{\varvec{\beta }}(x, \cdot )\) being M-Lipschitz for all \(x \in \mathcal {X}, y \in \mathcal {Y}\) and \(\varvec{\beta }\)-a.s.. Let \(\mathcal {A}: \mathcal {Z}^n \rightarrow \mathcal {X}\times \mathcal {Y}\) be an algorithm, where \(\mathcal {A}(\textbf{S})=(x(\textbf{S}),y(\textbf{S}))\). If \(x(\cdot )\) is \(\delta _x\)-UAS and \(y(\cdot )\) is \(\delta _y\)-UAS, and both are integrable, then

$$\begin{aligned} \text{ WeakGap}_{\textrm{SP}}(\mathcal {A},f) \leqslant \mathbb {E}_{\textbf{S}}[ \text{ EmpGap}_{\textrm{SP}}(\mathcal {A},f_{\textbf{S}}) ] + M[\delta _x+\delta _y]. \end{aligned}$$
(2.4)

This result can be extended for SVI problems as well. We provide a formal statement below and prove it in Appendix A.

Proposition 2

Consider a stochastic variational inequality with M-bounded operators \(F_{\varvec{\beta }}(\cdot ):\mathcal {W}\mapsto \mathbb {R}^d\). If \({\mathcal {A}}:\mathcal {Z}^n\mapsto \mathcal {W}\) is integrable and \(\delta \)-UAS, then

$$\begin{aligned} \text{ WeakGap}_{\textrm{VI}}({\mathcal {A}},F) \leqslant \mathbb {E}_{\textbf{S}}[\text{ EmpGap}_{\textrm{VI}}({\mathcal {A}},F_\textbf{S}) ] +M\delta . \end{aligned}$$
(2.5)

2.3 Background on differential privacy

Differential privacy is an algorithmic stability type of guarantee for randomized algorithms, that certifies that the output distribution of the algorithm “does not change too much” by changes in a single element from the dataset. The formal definition is provided in Eq. (1.5). Next we provide some basic results in differential privacy, which we will need for our work. For further information on the topic, we refer the reader to the monograph [16].

2.3.1 Basic privacy guarantees

In this work, most of our privacy guarantees will be obtained by the well-known Gaussian mechanism, which performs Gaussian noise addition on a function with bounded sensitivity. Given a function \({\mathcal {A}}:\mathcal {Z}^n\mapsto \mathbb {R}^d\), we define its \(\ell _2\)-sensitivity as

$$\begin{aligned} \sup _{\textbf{S}\simeq \textbf{S}^{\prime }}\Vert {\mathcal {A}}(\textbf{S})-{\mathcal {A}}(\textbf{S}^{\prime })\Vert . \end{aligned}$$
(2.6)

If \({\mathcal {A}}\) is randomized, then the supremum must hold with high-probability over the randomization of \({\mathcal {A}}\) (this will not be a problem in this work, since our randomized algorithms enjoy sensitivity bounds w.p. 1). The Gaussian mechanism (associated to function \({\mathcal {A}}\)) is defined as \({\mathcal {A}}_{{\mathcal {G}}}(\textbf{S})\sim {{\mathcal {N}}}({\mathcal {A}}(\textbf{S}),\sigma ^2I)\).

Proposition 3

Let \({\mathcal {A}}:\mathcal {Z}^n\mapsto \mathbb {R}^d\) be a function with \(\ell _2\)-sensitivity \(s>0\). Then, for \(\sigma ^2=2s^2\ln (1/\eta )/\varepsilon ^2\), the Gaussian mechanism is \((\varepsilon ,\eta )\)-DP.

Our algorithms will adaptively use a DP mechanism such as the above. Certifying privacy of a composition can be achieved in different ways. The most basic result establishes that if we use disjoint batches of data at each iteration, then the composition will preserve the largest privacy parameter among its building blocks. This result is known as parallel composition, and its proof is a direct application of the post-processing property of DP.

Proposition 4

(Parallel composition of differential privacy) Let \(\textbf{S}=(\textbf{S}_1,\ldots ,\textbf{S}_K)\in \mathcal {Z}^n\) be a dataset partitioned on blocks of sizes \(n_1,\ldots ,n_K\), respectively. \({\mathcal {A}}_k:\mathcal {Z}^{n_k}\times \mathbb {R}^{d\times (k-1)}\mapsto \mathbb {R}^d\), \(k=1,\ldots ,K\), be a sequence of mechanisms, and let \({\mathcal {A}}:\mathcal {Z}^n\mapsto \mathbb {R}^d\) be given by

$$\begin{aligned} {\mathcal {B}}_1(\textbf{S})= & {} {\mathcal {A}}_1(\textbf{S}_1)\\ {\mathcal {B}}_{k}(\textbf{S})= & {} {\mathcal {A}}_{k}(\textbf{S}_{k}, {\mathcal {B}}_{1}(\textbf{S}),{\mathcal {B}}_{2}(\textbf{S}),\ldots ,{\mathcal {B}}_{k-1}(\textbf{S})) \quad (\forall k=2,\ldots ,K-1)\\ {\mathcal {A}}(\textbf{S})= & {} {\mathcal {A}}_{K}(\textbf{S}_{K},{\mathcal {B}}_{1}(\textbf{S}),{\mathcal {B}}_{2}(\textbf{S}),\ldots ,{\mathcal {B}}_{K-1}(\textbf{S})). \end{aligned}$$

Then, If each \({\mathcal {A}}_k\) is \((\varepsilon _k,\eta _k)\)-DP in its first argument (i.e., w.r.t. \(\textbf{S}_k\)) then \({\mathcal {A}}\) is \((\max _k \varepsilon _k, \max _k \eta _k)\)-DP.

Some of the algorithms we develop in this work make repeated use of the data, and certifying privacy for these algorithms requires the use of adaptive composition results in DP (see, e.g. [16, 17]). For our algorithms, it is particularly important to leverage the sampling with replacement procedure to select the data that is used at each iteration, for which sharp bounds on DP can be obtained by the moments accountant method [1]. Below we summarize a specific version of this method that suffices for our purposes.Footnote 4

Theorem 1

[1] Consider sequence of functions \({\mathcal {A}}_1,\ldots ,{\mathcal {A}}_K\), where \({\mathcal {A}}_k:\mathcal {Z}^{n_k}\times \mathbb {R}^{d\times (k-1)}\mapsto \mathbb {R}^d\) is a function with sensitivity bounded as a function of the last data batch size, as follows

$$\begin{aligned} \sup _{L\in \mathbb {R}^{d\times (k-1)},\textbf{S}_k\simeq \textbf{S}_k^{\prime }}\Vert {\mathcal {A}}_k(\textbf{S}_k, L)-{\mathcal {A}}_k(\textbf{S}_k^{\prime }, L)\Vert \leqslant {s}. \end{aligned}$$

Consider the mechanism obtained by sampling a random subset of size m from the dataset, i.e., letting \(\textbf{S}_k\sim (\text{ Unif }([\textbf{S}]))^m\), and composing it with a Gaussian mechanism with noise \(\sigma ^2\), i.e.

$$\begin{aligned} {\mathcal {B}}_1(\textbf{S})= & {} ({\mathcal {A}}_1)_{{\mathcal {G}}}(\textbf{S}_{1}) \\ {\mathcal {B}}_{k}(\textbf{S})= & {} ({\mathcal {A}}_{k})_{{\mathcal {G}}}(\textbf{S}_{k}, {\mathcal {B}}_{1}(\textbf{S}),{\mathcal {B}}_{2}(\textbf{S}),\ldots ,{\mathcal {B}}_{k-1}(\textbf{S})) \quad (\forall k=2,\ldots ,K). \end{aligned}$$

There exists an absolute constant \(c_1>0\), such that if \(\varepsilon <c_1K(m/n)^2\) and the noise parameter \(\sigma \geqslant \sqrt{2K\ln (1/\eta )}sm/[n\varepsilon ]\), then \(\mathcal {A}(\textbf{S}):= \{{\mathcal {B}}_1(\textbf{S}), \ldots , {\mathcal {B}}_K(\textbf{S})\}\) is \((\varepsilon ,\eta )\)-differentially private.

3 The noisy stochastic extragradient method

To solve the DP-SVI problem we propose a noisy stochastic extragradient method (NSEG) in Algorithm 1.

figure a

The name noisy and stochastic in Algorithm 1 is justified by the sequence of operators \(F_{1,t},F_{2,t}\) we use:

$$\begin{aligned}{} & {} F_{1,t}(\cdot ) \triangleq F_{\textbf{B}_t^1}(\cdot )+\varvec{\xi }_t^1, \quad F_{2,t}(\cdot )\triangleq F_{\textbf{B}_t^2}(\cdot )+\varvec{\xi }_t^2. \end{aligned}$$
(3.1)

where \(\textbf{B}_t^1,\textbf{B}_t^2\) are batches extracted from dataset \(\textbf{S}\), and \(\varvec{\xi }_t^1,\varvec{\xi }_t^2{\mathop {\sim }\limits ^{i.i.d.}}{{\mathcal {N}}}(0,\sigma _t^2)\). We will denote the batch size of batch \(\textbf{B}_t^1\) and \(\textbf{B}_t^2\) as \(B_t:=|\textbf{B}_t^1|=|\textbf{B}_t^2|\). The exact details of the sampling method for \(\textbf{B}_t\) will depend on the variant of the algorithm. Here, we detail some key features of the above algorithm. Stochastic extragradient was proposed in [30] where they do not have any noise addition in \(F_{1,t}, F_{2,t}\) (stochasticity only arises from the dataset randomness), and where disjoint batches are used for all iterations, as well as within iterations. This choice is motivated by the goal of extracting population risk bounds for their algorithm.

Another important consideration is that this algorithm can also be applied to an SSP problem by using as stochastic oracle the monotone operator associated to the stochastic convex-concave function (2.1), over the set \(\mathcal {W}={{\mathcal {X}}}\times {{\mathcal {Y}}}\). From here onwards, when we say that a certain SVI algorithm is applied to an SSP, we mean using the choices above for the operator and feasible set, respectively.

We start by stating the convergence guarantees for the single-pass NSEG  method. This is obtained as a direct corollary of [30, Thm. 1], where we use an explicit bound on the oracle error with the variance of the Gaussian.

Theorem 2

[30] Consider a stochastic variational inequality (VI(F)), with operators \(F_{\varvec{\beta }}\) in \({\mathcal {M}}^1(L,M)\). Let \(\mathcal {A}\) be the NSEG method (Algorithm 1) where \(0<\gamma _t\leqslant 1/[\sqrt{3} L]\) and \((\textbf{B}_t^1,\textbf{B}_t^2)_{t}\) are independent random variables from a product distribution \(\textbf{B}_t^1,\textbf{B}_t^2\sim {{\mathcal {P}}}^{B_t}\), satisfies

$$\begin{aligned} \text{ Gap}_{\textrm{VI}}({\mathcal {A}},F) \leqslant \frac{K_0(T)}{\varGamma _T}, \end{aligned}$$

where \(K_0(T)\triangleq \Big ( D^2 +7\sum _{t\in [T]}\gamma _t^2[M^2/2+d\sigma _t^2]\Big )\), \(\varGamma _T = \textstyle {\sum }_{t=1}^T\gamma _t\), \({\mathcal {A}}(\textbf{S})\) is the output of Algorithm 1 on the dataset \(\textbf{S}=\bigcup _t \textbf{B}_t \sim {{\mathcal {P}}}^n\) and expectation in the left hand side is taken over the dataset draws, random sample batch choices, as well as noise \(\varvec{\xi }^t_1, \varvec{\xi }^t_2\).

On the other hand, \(\mathcal {A}\) applied to a stochastic (SP(f)) problem attains saddle point gap

$$\begin{aligned} \text{ Gap}_{\textrm{SP}}({\mathcal {A}},f) \leqslant \frac{K_0(T)}{\varGamma _T}. \end{aligned}$$

3.1 Differential privacy analysis of NSEG  method

We now proceed to establish the privacy guarantees for the single-pass variant of Algorithm 1. This is a direct consequence of Propositions 3 and 4, and the fact that each operator evaluation has sensitivity bounded by \(2M/B_t\).

Proposition 5

Algorithm 1 with batch sizes \((B_t)_{t\in [T]}\) and variance \(\sigma _t^2=\frac{8M^2}{B_{t}^2}\frac{\ln (1/\eta )}{\varepsilon ^2}\) is \((\varepsilon , \eta )\)-differentially private.

We now apply the previous results to obtain population risk bounds for DP-SVI by the NSEG  method.

Corollary 1

Algorithm 1 with disjoint batches of size \(B_t=B=\min \{\sqrt{d(\ln (1/\eta )}/\varepsilon , n\}\), constant stepsize \(\gamma _t\equiv \gamma =D/\big [M\sqrt{7T\big (1+\frac{8d}{B^2}\frac{\ln (1/\eta )}{\varepsilon ^2}\big )}\big ]\) and variance \(\sigma _t^2=\sigma ^2=\frac{8M^2}{B^2}\frac{\ln (1/\eta )}{\varepsilon ^2}\) is \((\varepsilon , \eta )\)-differentially private and achieves \(\text{ Gap}_{\textrm{VI}}({\mathcal {A}},F)\) (for SVI) or \(\text{ Gap}_{\textrm{SP}}(\mathcal {A},f)\) (for SSP) of

$$\begin{aligned} O\left( MD \max \left\{ \frac{[d \ln (1/\eta )]^{1/4}}{\sqrt{n\varepsilon }}, \frac{\sqrt{d\ln (1/\eta )}}{n\varepsilon } \right\} \right) . \end{aligned}$$

Remark 1

Notice that in the corollary above, the gap is nontrivial iff \(\sqrt{d\ln (1/\eta )}/[n\varepsilon ]<1\), which means that the left hand side attains the max on the range where the gap is nontrivial.

Proof

Consider a SVI or SSP problem. Let us recall that by Theorem 2, Algorithm 1 achieves expected gap

$$\begin{aligned} \frac{D^2}{\gamma T}+7M^2\gamma \Big (1+\frac{8d}{B^2}\frac{\ln (1/\eta )}{\varepsilon ^2} \Big ). \end{aligned}$$

Choosing \(\gamma =D/\big [M\sqrt{7T\big (1+\frac{8d}{B^2}\frac{\ln (1/\eta )}{\varepsilon ^2}\big )}\big ]\), we obtain an expected gap

$$\begin{aligned} \frac{2\sqrt{7}MD}{\sqrt{T}}\Big (1+\frac{\sqrt{8d}}{B\varepsilon }\sqrt{\ln (1/\eta )}\Big ) =\frac{2\sqrt{14}MD\sqrt{B}}{\sqrt{n}}+\frac{8\sqrt{7}MD\sqrt{d}}{\sqrt{nB}\varepsilon }\sqrt{\ln (1/\eta )}, \end{aligned}$$

where we used that for a single-pass algorithm, \(n=2TB\) (this choice of T exhausts the data when disjoint batches are chosen).

Recalling that \(B=\min \{\sqrt{d\ln (1/\eta )}/\varepsilon , n\}\). Then the expected gap is bounded by

$$\begin{aligned} O\left( MD \max \Big \{ \frac{[d\ln (1/\eta )]^{1/4}}{\sqrt{n\varepsilon }}, \frac{\sqrt{d\ln (1/\eta )}}{n\varepsilon } \Big \}\right) . \end{aligned}$$

Hence, we conclude the prrof. \(\square \)

We observe that excess risk bounds of the same order for DP-SCO based on noisy SGD and the uniform stability of differential privacy have been established [7]. Improving these bounds in DP-SCO required substantial efforts, which was only achieved recently [4, 5, 19]. Furthermore, to the best of our knowledge, the upper bounds on the risk above are the first of their type for DP-SVI and DP-SSP, respectively. To improve upon them, we will follow the approach of [4], based on a multi-pass empirical error convergence, combined with weak gap generalization bounds based on uniform stability.

4 Stability of NSEG and optimal risk for DP-SVI and DP-SSP

The bounds established for DP-SVI are potentially suboptimal, and many of the past approaches used to attain optimal rates for DP-SCO, such as privacy amplification by iteration, phased regularization, etc. appear to encounter substantial barriers for their application to DP-SVI. In order to resolve this gap, we show that for both DP-SVI and DP-SSP we can indeed obtain optimal rates, which match those of DP-SCO. In order to achieve this we develop a multi-pass variant of the NSEG  method, which enjoys generalization performance due to its stability.

4.1 Stability of NSEG method

To analyze the stability of NSEG  it is useful to interpret the extragradient method as an approximation of the proximal point algorithm. This connection has been established at least since [38]. Given a monotone and 1-Lipschitz operator \(G:\mathbb {R}^d\mapsto \mathbb {R}^d\), we define the s-extragradient operator inductively as follows. First, \(R_{0}(\cdot ; G):\mathbb {R}^d\mapsto \mathbb {R}^d\) is defined as \(R_{0}(u; G)= \varPi _{\mathcal {W}}(u)\). Then, for \(s\geqslant 0\)

$$\begin{aligned} R_{s+1}(u; G)= \varPi _{\mathcal {W}}(u- G(R_{s}(u; G))). \end{aligned}$$
(4.1)

Given such operator, the (deterministic) extragradient method [33] corresponds to, starting from \(u_0\in \mathcal {W}\), iterating

$$\begin{aligned} u_{t+1} = R_{2}(u_t; \gamma F) \qquad (\forall t\in [T-1]). \end{aligned}$$

It is known that if G is contractive, the recursion (4.1) leads to a fixed point \(R_{}(u; G)\), satisfying

$$\begin{aligned} R_{}(u; G)= \varPi _{\mathcal {W}}(u- G(R_{}(u; G))). \end{aligned}$$
(4.2)

It is also easy to see that \(R_{}(\cdot ; G):\mathbb {R}^d\mapsto \mathcal {W}\) is nonexpansive.

Proposition 6

(Near nonexpansiveness of the extragradient operator) Let \(F\in {\mathcal {M}}_{\mathcal {W}}^1(L,M)\) and \(\mathcal {W}\subseteq \mathbb {R}^d\) compact convex set with diameter \(D>0\). Then, for all s nonnegative integer, and \(u,v\in \mathbb {R}^d\),

$$\begin{aligned} \Vert R_{s}(u; \gamma F)- R_{}(u; \gamma F)\Vert \leqslant (\gamma L)^s\Vert R_{0}(u; \gamma F)-R_{}(u; \gamma F) \Vert \end{aligned}$$
(4.3)

and

$$\begin{aligned} \Vert R_{s}(u; \gamma F)- R_{s}(v; \gamma F)\Vert \leqslant \Vert u-v\Vert +2D(\gamma L)^s. \end{aligned}$$
(4.4)

Proof

The first part, Eq. (4.3), is proved by induction on s. The result clearly holds for \(s=0\), and if \(s\geqslant 1\), we use (4.1) and (4.2) to obtain

$$\begin{aligned}{} & {} \Vert R_{s}(u; \gamma F) - R_{}(u; \gamma F) \Vert \\{} & {} \quad = \Vert \varPi _{\mathcal {W}} (u-\gamma F(R_{s-1}(u; \gamma F)) - \varPi _{\mathcal {W}}(u-\gamma F(R_{}(u; \gamma F)))\Vert \\{} & {} \quad \leqslant \gamma \Vert F(R_{s-1}(u; \gamma F))-F( R_{}(u; \gamma F) )\Vert \,\,\leqslant \,\, \gamma L\Vert R_{s-1}(u; \gamma F)- R_{}(u; \gamma F) \Vert \\{} & {} \quad \leqslant (\gamma L)^s\Vert R_{0}(u; \gamma F)-R_{}(u; \gamma F)\Vert , \end{aligned}$$

where in the first inequality we used the nonexpansiveness of the projection operator, next we used the L-Lipschitzness of F, and finally we used the inductive hypothesis to conclude.

The second part, Eq. (4.4), is a direct consequence of (4.3), the triangle inequality, and that \(R_{0}(u; \gamma F)\), \(R_{}(u; \gamma F)\), \(R_{0}(v; \gamma F)\), \(R_{}(v; \gamma F)\in \mathcal {W}\). \(\square \)

The next lemma shows an expansion upper bound for extragradient iterations. This type of bound will be later used to establish the uniform argument stability of the NSEG algorithm.

Lemma 1

(Expansion of the extragradient iteration) Let \(F_1,F_2:\mathbb {R}^d\mapsto \mathbb {R}^d\) monotone L-Lipschitz operators, and \(0\leqslant \gamma < 1/L\). Let \(u,v\in \mathcal {W}\), and \(w,z,u^{\prime },v^{\prime }\in \mathcal {W}\) such that

$$\begin{aligned} w= & {} \varPi _{\mathcal {W}}(u-\gamma F_1(u)) \qquad \qquad \, z \,\,\,=\,\, \varPi _{\mathcal {W}}(v-\gamma F_1(v)) \\ u^{\prime }= & {} \varPi _{\mathcal {W}}(u-\gamma F_2(w)) \qquad \qquad v^{\prime } \,\,=\,\, \varPi _{\mathcal {W}}(v-\gamma F_2(z)). \end{aligned}$$

Then,

$$\begin{aligned} \Vert w-z\Vert\leqslant & {} \Vert u-v\Vert +2 L D\gamma , \end{aligned}$$
(4.5)
$$\begin{aligned} \Vert u^{\prime }-v^{\prime }\Vert\leqslant & {} \Vert u-v\Vert +(\widetilde{M}_1+\widetilde{M}_2+2LD)L \gamma ^2, \end{aligned}$$
(4.6)

where \(\widetilde{M}_1\triangleq \Vert F_1(R_{}(u; \gamma F_1))-F_2(R_{}(u; \gamma F_2))\Vert \) and \(\widetilde{M}_2\triangleq \Vert F_1(R_{}(v; \gamma F_1))-F_2(R_{}(v; \gamma F_2))\Vert \).

Proof

By definition of w and z, we have,

$$\begin{aligned} \Vert w-z\Vert= & {} \Vert \varPi _{\mathcal {W}}(u-\gamma F_1(u)) -\varPi _{\mathcal {W}}(v-\gamma F_1(v)) \Vert \\\leqslant & {} \Vert R_{}(u; \gamma F_1)- R_{}(v; \gamma F_1)\Vert +\Vert \varPi _{\mathcal {W}}(u-\gamma F_1(u)) - R_{}(u; \gamma F_1)\Vert \\{} & {} +\Vert \varPi _{\mathcal {W}}(v-\gamma F_1(v)) - R_{}(v; \gamma F_1) \Vert \\\leqslant & {} \Vert u-v\Vert + \gamma L [\Vert R_{0}(u; \gamma F_1) - R_{}(u; \gamma F_1) \Vert +\Vert R_{0}(v; \gamma F_1)-T{}{\gamma F_1}{v}\Vert ], \end{aligned}$$

where we used the nonexpansiveness of the operator \(R_{}(\cdot ; \gamma F_1)\) and Proposition 6. Moreover, since \(u,v,R_{0}(u; \gamma F_1), R_{0}(v; \gamma F_1) \in \mathcal {W}\), we have \(\Vert w-z\Vert \leqslant \Vert u-v\Vert +2LD\gamma \), proving (4.5).

Next, to prove (4.6), we proceed as follows:

$$\begin{aligned} \Vert u^{\prime }-v^{\prime }\Vert= & {} \Vert \varPi _{\mathcal {W}}(u-\gamma F_2(w))-\varPi _{\mathcal {W}}(v-\gamma F_2(z))\Vert \\\leqslant & {} \Vert R_{}(u; \gamma F_2) - R_{}(v; \gamma F_2)\Vert \\{} & {} + \Vert \varPi _{\mathcal {W}}(u-\gamma F_2(w))-\varPi _{\mathcal {W}}(u-\gamma F_2(R_{}(u; \gamma F_2)))\Vert \\{} & {} + \Vert \varPi _{\mathcal {W}}(v-\gamma F_2(z))-\varPi _{\mathcal {W}}(v-\gamma F_2(R_{}(v; \gamma F_2)))\Vert \\\leqslant & {} \Vert u-v\Vert +\gamma L\Vert w- R_{}(u; \gamma F_2)\Vert +\gamma L \Vert z- R_{}(v; \gamma F_2)\Vert . \end{aligned}$$

Using again Proposition 6, we have that

$$\begin{aligned} \Vert w- R_{}(u; \gamma F_2) \Vert= & {} \Vert R_{1}(u; \gamma F_1)- R_{}(u; \gamma F_2) \Vert \\\leqslant & {} \Vert R_{1}(u; \gamma F_1)-R_{}(u; \gamma F_1) \Vert + \Vert R_{}(u; \gamma F_1) - R_{}(u; \gamma F_2) \Vert \\\leqslant & {} L D\gamma +\Vert \varPi _{\mathcal {W}}\big (u-\gamma F_1( R_{}(u; \gamma F_1))\big )\\{} & {} -\varPi _{\mathcal {W}}\big (u-\gamma F_2( R_{}(u; \gamma F_2))\big )\Vert \\\leqslant & {} L D\gamma +\gamma \Vert F_1(R_{}(u; \gamma F_1))-F_2(R_{}(u; \gamma F_2))\Vert \\\leqslant & {} L D\gamma +\widetilde{M}_1\gamma . \end{aligned}$$

An analog bound can be obtained for \(\Vert z-R_{}(v; \gamma F_2)\Vert \):

$$\begin{aligned} \Vert z-R_{}(v; \gamma F_2)\Vert \leqslant LD\gamma +\widetilde{M}_2\gamma , \end{aligned}$$

concluding the claimed bound (4.6):

$$\begin{aligned} \Vert u^{\prime }-v^{\prime }\Vert \leqslant \Vert u-v\Vert + L[\widetilde{M}_1+\widetilde{M}_2+ 2LD]\gamma ^2. \end{aligned}$$

\(\square \)

The Expansion Lemma above allows us to bound how much would two trajectories of the NSEG  method may deviate, given two pairs of sequences of operators \(F_{1,t},F_{2,t}\) and \(F_{1,t}^{\prime },F_{2,t}^{\prime }\). The bounds we will obtain from this analysis will give us direct bounds on the UAS for the NSEG  method.

Lemma 2

Let \(F_{1,t},F_{2,t}\) and \(F_{1,t}^{\prime },F_{2,t}^{\prime }\) be L-Lipschitz operators, and \(0\leqslant \gamma _t< 1/L\) for all \(t\in [T]\). Let \(\{(u_t,w_t)\}_{t\in [T]}\) and \(\{(v_t,z_t)\}_{t\in [T]}\) be the sequences resulting from Algorithm 1, with operators \(\{(F_{1,t},F_{2,t})\}_{t\in [T]}\) and \(\{(F_{1,t}^{\prime },F_{2,t}^{\prime })\}_{t\in [T]}\), respectively; and starting from \(u^0=v^0\). Let

$$\begin{aligned} \varDelta _{1,t}\triangleq & {} \sup _{u\in \mathcal {W}}\Vert F_{1,t}(u)-F_{1,t}^{\prime }(u)\Vert , \\ \varDelta _{2,t}\triangleq & {} \sup _{u\in \mathcal {W}}\Vert F_{2,t}(u)-F_{2,t}^{\prime }(u)\Vert ,\\ \widetilde{M}_{1,t}\triangleq & {} \Vert F_{1,t}(R_{}(u_{t-1}; \gamma F_{1,t}))-F_{2,t}(R_{}(u_{t-1}; \gamma F_{2,t}))\Vert , \text{ and } \\ \widetilde{M}_{2,t}\triangleq & {} \Vert F_{1,t}(R_{}(v_{t-1}; \gamma F_{1,t}))-F_{2,t}(R_{}(v_{t-1}; \gamma F_{2,t}))\Vert ; \end{aligned}$$

then, for al \(t=0,\ldots ,T\),

$$\begin{aligned} \nu _t \triangleq \Vert u_t-v_t\Vert\leqslant & {} \sum _{s=1}^{t}\Big ([\widetilde{M}_{1,t}+\widetilde{M}_{2,t}+2LD]L\gamma _s^2+L\varDelta _{1,s} \gamma _s^2+\varDelta _{2,s}\gamma _s\Big ) \end{aligned}$$
(4.7)
$$\begin{aligned} \delta _t \triangleq \Vert w_t-z_t\Vert\leqslant & {} \sum _{s=1}^{t-1}\Big ( [\widetilde{M}_{1,t}+\widetilde{M}_{2,t}+2LD]L\gamma _s^2+L\varDelta _{1,s}\gamma _s^2+\varDelta _{2,s}\gamma _s\Big )\nonumber \\{} & {} +\varDelta _{1,t}\gamma _t+2LD\gamma _t. \end{aligned}$$
(4.8)

Proof

Clearly, \(\nu _0=0\). Let us now derive a recurrence for both \(\nu _t\) and \(\delta _t\).

$$\begin{aligned} \delta _{t}= & {} \Vert R_{1}(u_{t-1}; \gamma _t F_{1,t}) - R_{1}(v_{t-1}; \gamma _t F_{1,t}^{\prime }) \Vert \\\leqslant & {} \Vert R_{1}(u_{t-1}; \gamma _t F_{1,t}) - R_{1}(v_{t-1}; \gamma _t F_{1,t}) \Vert + \Vert R_{1}(v_{t-1}; \gamma _t F_{1,t}) - R_{1}(v_{t-1}; \gamma _t F_{1,t}^{\prime }) \Vert \\\leqslant & {} \Vert u_{t-1}-v_{t-1}\Vert +2LD\gamma _t + \Vert R_{1}(v_{t-1}; \gamma _t F_{1,t}) - R_{1}(v_{t-1}; \gamma _t F_{1,t}^{\prime })\Vert , \end{aligned}$$

where in the last inequality we used inequality (4.5). Let us bound now the rightmost term above,

$$\begin{aligned}{} & {} \Vert R_{1}(v_{t-1}; \gamma _t F_{1,t}) - R_{1}(v_{t-1}; \gamma _t F_{1,t}^{\prime }) \Vert \nonumber \\{} & {} \quad = \Vert \varPi _{\mathcal {W}}(v_{t-1}-\gamma _t F_{1,t}(v_{t-1}))-\varPi _{\mathcal {W}}(v_{t-1}-\gamma _t F_{1,t}^{\prime }(v_{t-1}))\Vert \nonumber \\{} & {} \quad \leqslant \gamma _t \Vert F_{1,t}(v_{t-1})- F_{1,t}^{\prime }(v_{t-1})\Vert \nonumber \\{} & {} \quad \leqslant \varDelta _{1,t}\gamma _t. \end{aligned}$$
(4.9)

We conclude that

$$\begin{aligned} \delta _{t} \leqslant \nu _{t-1} + 2LD\gamma _t + \varDelta _{1,t}\gamma _t. \end{aligned}$$
(4.10)

Now,

$$\begin{aligned} \nu _{t} =&\Vert u_{t}-v_{t}\Vert \leqslant \Vert \varPi _{\mathcal {W}}(u_{t-1}-\gamma _t F_{2,t}(w_t))-\varPi _{\mathcal {W}}(v_{t-1}-\gamma _t F_{2,t}^{\prime }(z_t))\Vert \\ =&\Vert \varPi _{\mathcal {W}}(u_{t-1}-\gamma _t F_{2,t}(w_{t}))-\varPi _{\mathcal {W}}(v_{t-1}-\gamma _t F_{2,t}(z_t))\Vert \\&+\Vert \varPi _{\mathcal {W}}(v_{t-1}-\gamma _t F_{2,t}(z_t))-\varPi _{\mathcal {W}}(v_{t-1}-\gamma _t F_{2,t}^{\prime }(z_t))\Vert \\ \leqslant&\Vert \varPi _{\mathcal {W}}(u_{t-1}-\gamma _t F_{2,t}( R_{1}(u_{t-1}; \gamma _t F_{1,t}))) -\varPi _{\mathcal {W}}(v_{t-1}-\gamma _t F_{2,t}(R_{1}(v_{t-1}; \gamma _t F_{1,t}^{\prime })))\Vert \\&+\gamma _t \Vert F_{2,t}(z_t)-F_{2,t}^{\prime }(z_t)\Vert \\ \leqslant&\Vert \varPi _{\mathcal {W}}(u_{t-1}-\gamma _t F_{2,t}(R_{1}(u_{t-1}; \gamma _t F_{1,t}))) -\varPi _{\mathcal {W}}(v_{t-1}-\gamma _t F_{2,t}(R_{1}(v_{t-1}; \gamma _t F_{1,t})))\Vert \\&+\Vert \varPi _{\mathcal {W}}(v_{t-1}-\gamma _t F_{2,t}(R_{1}(v_{t-1}; \gamma _t F_{1,t})))\\&-\varPi _{\mathcal {W}}(v_{t-1}-\gamma _t F_{2,t}(R_{1}(v_{t-1}; \gamma _t F_{1,t}^{\prime })))\Vert +\gamma _t \varDelta _{2,t} \\ {\mathop {\leqslant }\limits ^{{\mathrm{(i)}}}}&\Vert u_{t-1}-v_{t-1}\Vert +[\widetilde{M}_{1,t}+\widetilde{M}_{2,t}+2LD]L\gamma _t^2 +\gamma _tL\Vert R_{1}(v_{t-1}; \gamma _t F_{1,t}) \\&- R_{1}(v_{t-1}; \gamma _t F_{1,t}^{\prime })\Vert +\gamma _t \varDelta _{2,t} \\ {\mathop {\leqslant }\limits ^{{\mathrm{(ii)}}}}&\nu _{t-1} + [\widetilde{M}_{1,t}+\widetilde{M}_{2,t}+2LD]L\gamma _t^2+L\varDelta _{1,t}\gamma _t^2+\varDelta _{2,t}\gamma _t, \end{aligned}$$

where in inequality (i), we used Lemma 1 (more precisely, inequality (4.6)), and in inequality (ii), we used (4.9). Unraveling the above recursion, we get that for all \(t\in [T]\),

$$\begin{aligned} \nu _t \leqslant \sum _{s=1}^{t}\Big ( [\widetilde{M}_{1,t}+\widetilde{M}_{2,t}+2LD]L\gamma _s^2+L\varDelta _{1,s}\gamma _s^2+\varDelta _{2,s}\gamma _s\Big ). \end{aligned}$$

Finally, we combine the bound above with (4.10), to conclude that for all \(t\in [T]\):

$$\begin{aligned} \delta _t \leqslant \sum _{s=1}^{t-1}\Big ( [\widetilde{M}_{1,t}+\widetilde{M}_{2,t}+2LD]L\gamma _s^2+L\varDelta _{1,s}\gamma _s^2+\varDelta _{2,s}\gamma _s\Big ) +\varDelta _{1,t}\gamma _t+2LD\gamma _t. \end{aligned}$$

\(\square \)

The next theorem provides in-expectation and high probability upper bounds for the NSEG  method. Despite the fact that we will not particularly apply the latter bounds, we believe these may be of independent interest.

Theorem 3

The NSEG  method (Algorithm 1) for closed and convex domain \(\mathcal {W}\subseteq \mathbb {R}^d\) with diameter D, operators in \({\mathcal {M}}_{\mathcal {W}}^1(L, M)\) and stepsizes \(0<\gamma _t\leqslant 1/L\), satisfies the following uniform argument stability bounds:

  1. 1.

    Let \({\mathcal {A}}_{\mathrm{batch-EG}}\) denote the Batch method where given dataset \(\textbf{S}\), \(F_{1,t}=F_{\textbf{S}}+\varvec{\xi }_1^t\), and \(F_{2,t}=F_{\textbf{S}}+\varvec{\xi }_2^t\). Then, in expectation,

    $$\begin{aligned}&\sup _{\textbf{S}\simeq \textbf{S}^{\prime }}\mathbb {E}_{{\mathcal {A}}_{\mathrm{batch-EG}}}[\delta _{{\mathcal {A}}_{\mathrm{batch-EG}}}(\textbf{S},\textbf{S}^{\prime })]\\&\quad \leqslant \sum _{t=0}^{T-1}\Big ( [4M+2LD+4\sqrt{d}\sigma ]L\gamma _t^2+\frac{2ML}{n}\gamma _t^2+\frac{2M}{n}\gamma _t\Big )\\&\qquad + \frac{1}{T}\sum _{t=1}^{T}\left( \frac{2ML}{n}+2LD\right) \gamma _t, \end{aligned}$$

    and for constant stepsize \(\gamma _t\equiv \gamma \), there exists a universal constant \(K>0\), such that for any \(0<\theta <1\), with probability \(1-\theta \):

    $$\begin{aligned} \sup _{\textbf{S}\simeq \textbf{S}^{\prime }}\delta _{{\mathcal {A}}_{\mathrm{batch-EG}}}(\textbf{S},\textbf{S}^{\prime })&\leqslant 4[T\sqrt{d}\sigma +\sigma \sqrt{Kd\ln (1/\theta )}]L\gamma ^2+ [4M+2LD]LT\gamma ^2\nonumber \\&\quad +\frac{2ML}{n}T\gamma ^2+\frac{2M}{n}T\gamma + \left( \frac{2ML}{n}+2LD\right) \gamma . \end{aligned}$$
    (4.11)
  2. 2.

    Let \({\mathcal {A}}_{\mathrm{repl-EG}}\) denote Sampled with replacement method where given dataset \(\textbf{S}\), \(F_{1,t}=F_{\varvec{\beta }_{i(1,t)}}+\varvec{\xi }_1^t\), and \(F_{2,t}=F_{\varvec{\beta }_{i(2,t)}}+\varvec{\xi }_2^t\), for \(i(1,t),i(2,t)\sim \text{ Unif }([n])\), independently. Then, in expectation,

    $$\begin{aligned}{} & {} \sup _{\textbf{S}\simeq \textbf{S}^{\prime }}\mathbb {E}[\delta _{{\mathcal {A}}_{\mathrm{repl-EG}}}(\textbf{S},\textbf{S}^{\prime })]\nonumber \\{} & {} \quad \leqslant \sum _{t=0}^{T-1}\Big ( [4M+2LD+4\sqrt{d}\sigma ]L\gamma _t^2+\frac{2 ML}{n}\gamma _t^2+\frac{2 M}{n}\gamma _t\Big )\nonumber \\{} & {} \qquad + \frac{1}{T}\sum _{t=1}^{T}\left( \frac{2ML}{n}+2LD\right) \gamma _t. \end{aligned}$$
    (4.12)

    And for constant stepsize \(\gamma _t\equiv \gamma \), there exists a universal constant \(K>0\), such that for any \(0<\theta <1/[2n]\), with probability \(1-\theta \):

    $$\begin{aligned} \sup _{\textbf{S}\simeq \textbf{S}^{\prime }}\delta _{{\mathcal {A}}_{\mathrm{repl-EG}}}(\textbf{S},\textbf{S}^{\prime })\leqslant & {} 4[T\sqrt{d}\sigma +\sigma \sqrt{Kd\ln (2/\theta )}]L\gamma ^2\nonumber \\{} & {} + [4M+2LD]LT\gamma ^2 +2LD\gamma \nonumber \\{} & {} +\big (1+3\log \big (\frac{2n}{\theta }\big )\big )\frac{2MT}{n}(L\gamma ^2+\gamma /T+\gamma ). \end{aligned}$$
    (4.13)

Proof

Let \(\textbf{S}\simeq \textbf{S}^{\prime }\). Then

  1. 1.

    Batch method. Notice that for the batch case \(F_{1,t}=F_{\textbf{S}}+\varvec{\xi }_1^t\), and \(F_{1,t}^{\prime }=F_{\textbf{S}^{\prime }}+\varvec{\xi }_1^t\); and \(F_{2,t}=F_{\textbf{S}}+\varvec{\xi }_2^t\), and \(F_{2,t}^{\prime }=F_{\textbf{S}^{\prime }}+\varvec{\xi }_2^t\). Then, it is easy to see that \(\varDelta _{1,t}\leqslant 2M/n\) and \(\varDelta _{2,t}\leqslant 2M/n\). On the other hand, since the operators are M bounded and since noise addition is Gaussian

    $$\begin{aligned} \mathbb {E}[\widetilde{M}_{1,t}]= & {} \mathbb {E}[\Vert F_{1,t}(R_{}(u_{t-1}; \gamma F_{1,t})) -F_{2,t}(R_{}(u_{t-1}; \gamma F_{2,t}))\Vert ] \nonumber \\\leqslant & {} \mathbb {E}[\Vert F_{\textbf{B}_t^1}(R_{}(u_{t-1}; \gamma F_{1,t}))+\varvec{\xi }_1^t\Vert ] +\mathbb {E}[\Vert F_{\textbf{B}_t^2}(R_{}(u_{t-1}; \gamma F_{2,t}))+\varvec{\xi }_2^t\Vert ] \nonumber \\\leqslant & {} 2M+\mathbb {E}[\Vert \varvec{\xi }_1^t\Vert +\Vert \varvec{\xi }_2^t\Vert ] \leqslant 2[M+\sqrt{d}\sigma ], \end{aligned}$$
    (4.14)

    and an analog bound holds for \(\mathbb {E}[\widetilde{M}_{2,t}]\). Hence, by Lemma 2:

    $$\begin{aligned} \mathbb {E}_{{\mathcal {A}}_{\mathrm{batch-EG}}}[\delta _{{\mathcal {A}}_{\mathrm{batch-EG}}}(S,S^{\prime })]\leqslant & {} \sum _{t=0}^{T-1}\Big ( [4M+2LD+4\sqrt{d}\sigma ]L\gamma _t^2+\frac{2ML}{n}\gamma _t^2+\frac{2M}{n}\gamma _t\Big )\\{} & {} + \frac{1}{T}\sum _{t=1}^{T}\big (\frac{2ML}{n}+2LD\big )\gamma _t, \end{aligned}$$

    which proves the claimed bound.

    For the high probability bound, we use that the norm of a Gaussian vector is \(Kd\sigma ^2\)-subgaussian, for a universal constant \(K>0\) (see, e.g. [48, Thm. 3.1.1]), and therefore \(\mathbb {E}[\exp \{\lambda (\Vert \varvec{\xi }_i^t\Vert -\sigma \sqrt{d})\}]\leqslant \exp \{Kd\sigma ^2\lambda ^2\}\); hence by the Chernoff-Crámer bound, for any \(\alpha >0\)

    $$\begin{aligned} \mathbb {P}\left[ \sum _{t\in [T]} \big (\Vert \varvec{\xi }_1^t\Vert +\Vert \varvec{\xi }_2^t\Vert \big ) >(2+\alpha )T\sqrt{d}\sigma )\right]\leqslant & {} \exp \{-\lambda \alpha T\sqrt{d}\sigma \}\Big ( \exp \{2Kd\sigma ^2\lambda ^2\} \Big )^T\\= & {} \exp \{T(2Kd\sigma ^2\lambda ^2-\alpha \sqrt{d}\sigma \lambda )\}. \end{aligned}$$

    Choosing \(\lambda =\alpha /[4K\sqrt{d}\sigma ]\) and \(\alpha =\frac{2\sqrt{K}}{T}\sqrt{\ln (1/\theta )}\), we get

    $$\begin{aligned} \mathbb {P}\left[ \sum _{t\in [T]} \big (\Vert \varvec{\xi }_1^t\Vert +\Vert \varvec{\xi }_2^t\Vert \big ) > 2T\sqrt{d}\sigma +2\sigma \sqrt{Kd\ln (1/\theta )}\right] \leqslant \theta . \end{aligned}$$
    (4.15)

    This guarantee, together with the rest of the terms appearing in our previous stability bound (which hold w.p. 1) proves (4.11).

  2. 2.

    Sampled with replacement. Let \(i\in [n]\) be the coordinate where \(\textbf{S}\) and \(\textbf{S}^{\prime }\) may differ. Let \(i(1,t),i(2,t)\sim \text{ Unif }([n])\) i.i.d., for \(t\in [T]\). Now we apply Lemma 2 with \(F_{1,t}=F_{\varvec{\beta }_{i(1,t)}}+\varvec{\xi }_1^t\), and \(F_{1,t}^{\prime }=F_{\varvec{\beta }_{i(1,t)}^{\prime }}+\varvec{\xi }_1^t\); and \(F_{2,t}=F_{\varvec{\beta }_{i(2,t)}}+\varvec{\xi }_2^t\), and \(F_{2,t}^{\prime }=F_{\varvec{\beta }_{i(2,t)}^{\prime }}+\varvec{\xi }_2^t\). Hence we have that \((\varDelta _{1,t})_{t\in [T]}\) and \((\varDelta _{2,t})_{t\in [T]}\) are sequences of independent r.v. with expectation bounded by 2M/n. Therefore, by Lemma 2 (Eq. (4.8)), and following the steps that lead to inequality (4.14), we have:

    $$\begin{aligned} \mathbb {E}[\delta _{\mathcal {A}}(\textbf{S},\textbf{S}^{\prime })]\leqslant & {} \sum _{t=0}^{T-1}\left( [4M+2LD+4\sqrt{d}\sigma ]L\gamma _t^2+\frac{2ML}{n}\gamma _t^2+\frac{2M}{n}\gamma _t\right) \\{} & {} + \frac{1}{T}\sum _{t=1}^{T}\left( \frac{2ML}{n}+2LD\right) \gamma _t. \end{aligned}$$

    Finally for the high-probability bound, note that for any realization of the algorithm randomness, we have

    $$\begin{aligned} \delta _{\mathcal {A}}(\textbf{S},\textbf{S}^{\prime })\leqslant & {} \sum _{t=1}^T[4M+2(\Vert \varvec{\xi }_1^t\Vert +\Vert \varvec{\xi }_2^t\Vert )+2LD]L\gamma _t^2\\{} & {} +\frac{2LD}{T}\sum _{t=1}^T\gamma _t +L\sum _{t=1}^T\gamma _t^2\varDelta _{1,t}+\frac{1}{T}\sum _{t=1}^T\varDelta _{1,t}\gamma _t+\sum _{t=1}^T\varDelta _{2,t}\gamma _t. \end{aligned}$$

    We additionally assume constant stepsize, \(\gamma _t\equiv \gamma >0\). Hence, we can resort on concentration of sums of Bernoulli random variables, which guarantees that

    $$\begin{aligned} \mathbb {P}\left[ \sum _{t=1}^T\varDelta _{1,t} > (1+3\log (2/\theta ))\frac{2MT}{n} \right] \leqslant \exp \{-\log (2/\theta )\}=\frac{\theta }{2}. \end{aligned}$$

    An analog bound can be established for \(\varDelta _{2,t}\), which together with bound (4.15) leads to

    $$\begin{aligned}{} & {} \mathbb {P}_{{\mathcal {A}}_{\mathrm{repl-EG}}}\Big [ \delta _{{\mathcal {A}}_{\mathrm{repl-EG}}}(\textbf{S},\textbf{S}^{\prime })> 4[T\sqrt{d}\sigma +\sigma \sqrt{Kd\ln (1/\theta )}]L\gamma ^2\\{} & {} \quad + [4M+2LD]LT\gamma ^2 +2LD\gamma \\{} & {} \quad +\big (1+3\log \big (\frac{2}{\theta }\big )\big )\frac{2MT}{n}(L\gamma ^2+\gamma /T+\gamma ) \Big ]\leqslant 2\theta . \end{aligned}$$

    Notice this bound only depends on our choice of i, and it is otherwise uniform over all \(\textbf{S}\simeq \textbf{S}^{\prime }\). Finally, by a union bound on \(i\in [n]\) (together with a renormalization of \(\theta \)), we have that

    $$\begin{aligned}{} & {} \mathbb {P}_{{\mathcal {A}}_{\mathrm{repl-EG}}}\Big [ \sup _{\textbf{S}\simeq \textbf{S}^{\prime }} \delta _{{\mathcal {A}}_{\mathrm{repl-EG}}}(\textbf{S},\textbf{S}^{\prime })> 4[T\sqrt{d}\sigma +\sigma \sqrt{Kd\ln (2/\theta )}]L\gamma ^2\\{} & {} \quad + [4M+2LD]LT\gamma ^2 +2LD\gamma \\{} & {} \quad +\big (1+3\log \big (\frac{4n}{\theta }\big )\big )\frac{2MT}{n}(L\gamma ^2+\gamma /T+\gamma ) \Big ]\leqslant \theta . \end{aligned}$$

\(\square \)

4.2 Optimal risk for DP-SVI and DP-SSP by NSEG method

Now we use our stability and risk bounds for NSEG to derive optimal risk bounds for DP-SSP. For this, we use the sampled with replacement variant, \({\mathcal {A}}_{\mathrm{repl-EG}}\).

$$\begin{aligned} F_{1,t}(\cdot )=F_{\varvec{\beta }_{i(1,t)}}(\cdot )+\varvec{\xi }_1^t; \qquad F_{2,t}(\cdot )=F_{\varvec{\beta }_{i(2,t)}}(\cdot )+\varvec{\xi }_2^t. \end{aligned}$$
(4.16)

Using the moments accountant method (Theorem 1) one can show the following.

Proposition 7

(Privacy of sampled with replacement NSEG) Algorithm 1 with operators given by Eq. (4.16) and \(\sigma _t^2=8M^2\log (1/\eta )/\varepsilon ^2\), is \((\varepsilon ,\eta )\)-differentially private.

Theorem 4

(Excess risk of sampled with replacement NSEG) Consider an instance of the (VI(F)) or (SP(f)) problem. Let \(\mathcal {A}\) be the sampled with replacement variant (4.16) of NSEG method (Algorithm 1), with \(\gamma _t=\gamma =\min \{D/M,1/L\}/[n\max \{\sqrt{n}, \sqrt{d \ln (1/\eta )}/\varepsilon \}]\), \(\sigma _t^2=8\,M^2\log (1/\eta )/\varepsilon ^2\), \(T=n^2\). Then, \(\text{ WeakGap}_{\textrm{VI}}(\mathcal {A},F)\) (for SVI) or \(\text{ WeakGap}_{\textrm{SP}}(\mathcal {A},f)\) (for SSP) are bounded by

$$\begin{aligned} O\Big ( (MD+LD^2)\max \Big \{\frac{1}{\sqrt{n}}, \frac{\sqrt{d\ln (1/\eta )}}{n\varepsilon }\Big \} +\frac{MLD}{n^{5/2}}\Big ). \end{aligned}$$

Remark 2

Notice that assuming \(n=\varOmega (\min \{\sqrt{L},\sqrt{M/D}\})\) the bound of the Theorem simplifies to

$$\begin{aligned} O\left( (MD+LD^2)\max \Big \{\frac{1}{\sqrt{n}}, \frac{\sqrt{d\ln (1/\eta )}}{n\varepsilon }\Big \} \right) . \end{aligned}$$

This is quite a mild sample size requirement. In this range, when \(M\geqslant LD\), our upper bound matches the excess risk bounds for DP-SCO [5], and we will show these rates are indeed optimal for DP-SVI and DP-SSP as well

Proof

Given that our bounds for SVI and SSP are analogous, we proceed indistinctively for both problems.

First, let us bound the empirical accuracy of the method. By Theorem 2, together with the fact that sampling with replacement is an unbiased stochastic oracle for the empirical operator:

$$\begin{aligned} \mathbb {E}_{\textbf{S}}\Big [\text{ EmpGap }(\mathcal {A},\textbf{S})\Big ]\leqslant & {} \frac{1}{\gamma T}\left( \frac{D^2}{2}+7M^2T\gamma ^2\left( 1+\frac{8d \log (1/\eta )}{\varepsilon ^2}\right) \right) \\\leqslant & {} \frac{D^2}{2n}\max \{M/D,L\} \max \{\sqrt{n},\sqrt{d\ln (1/\eta )}/\varepsilon \}\\{} & {} +\frac{7M^2\min \{D/M,1/L\}}{n\max \{\sqrt{n},\sqrt{d\ln (1/\eta )}/\varepsilon \}} \frac{9d\ln (1/\eta )}{\varepsilon ^2} \\= & {} O\left( (MD+LD^2)\max \Big \{\frac{1}{\sqrt{n}}, \frac{\sqrt{d\ln (1/\eta )}}{n\varepsilon }\Big \} \right) , \end{aligned}$$

where \(\text{ EmpGap }(\mathcal {A},\textbf{S})\) is \(\text{ EmpGap}_{\textrm{VI}}(\mathcal {A},F_{\textbf{S}})\) or \(\text{ EmpGap}_{\textrm{SP}}(\mathcal {A},f_{\textbf{S}})\) for an SVI or SSP problem, respectively.

Next, by Theorem 3, we have that \({\mathcal {A}}\) (or \(x(\textbf{S})\) and \(y(\textbf{S})\), for the SSP case) are UAS with parameter

$$\begin{aligned} \delta= & {} [4M+2LD+4\sqrt{d}\sigma ]LT\gamma ^2+\frac{2 ML}{n}T\gamma ^2+\frac{2 M}{n}T\gamma + \left( \frac{2ML}{n}+2LD\right) \gamma \\\leqslant & {} \Big (\frac{4LD^2}{M}+2D\Big )\frac{1}{n}+\frac{8LD^2\sqrt{2d\ln (1/\eta )}}{M\varepsilon n}+\frac{2LD^2}{M}\frac{1}{n^{3/2}}+\frac{2D}{\sqrt{n}}+\frac{2LD}{n^{5/2}}+\frac{2D}{n^{3/2}}\\= & {} O\Big (\frac{1}{M}\cdot \Big (\frac{MD+LD^2}{n}+\frac{LD^2\sqrt{d\ln (1/\eta )}}{\varepsilon n} +\frac{MLD}{n^{5/2}}\Big ) \Big ). \end{aligned}$$

Hence, noting that empirical risk upper bounds weak empirical gap and using Proposition 1 or Proposition 2 (depending on whether the problem is an SSP or SVI, respectively), we have that the risk is upper bounded by its empirical risk plus \(M\delta \), where \(\delta \) is the UAS parameter of the algorithm; in particular, is bounded by

$$\begin{aligned} \text{ WeakGap}_{\textrm{VI}}(\mathcal {A},F)\leqslant & {} \mathbb {E}_{\textbf{S}}[\text{ EmpGap}_{\textrm{VI}}(\mathcal {A},F_{\textbf{S}})]+M\delta \\= & {} O\Big ( (MD+LD^2)\max \Big \{\frac{1}{\sqrt{n}}, \frac{\sqrt{d\ln (1/\eta )}}{n\varepsilon }\Big \} +\frac{MLD}{n^{5/2}}\Big ), \end{aligned}$$

Similar claims can be made \(\text{ WeakGap}_{\textrm{SP}}(\mathcal {A},f_\textbf{S})\). Hence, we conclude the proof. \(\square \)

5 The noisy inexact stochastic proximal point method

In the previous sections, we presented NSEG  method with its single-pass and multipass variants and provided optimal risk guarantees for DP-SVI and DP-SSP problems in \(O(n^2)\) stochastic operator evaluations. In the rest of the paper, our aim is to provide another algorithm that can achieve the optimal risk for both of these problems with much less computational effort. Towards that end, consider the following algorithm:

figure b

In the above algorithm, we leave a few things unspecified which will be stated later during convergence and privacy analysis. Here, we detail some key features of the above algorithm. In line 3, we sample a batch \(\textbf{B}_{k+1}\) of size \(B_{k+1}=|\textbf{B}_{k+1}|\). Similar to the NSEG, we will look at two different variants of NISPP method: single-pass and multi-pass. Depending on the type of the method, we will specify the sampling mechanism. In line 4 of Algorithm 2, we have \(u_{k+1}\) is a \(\nu \)-approximate strong VI solution of the mentioned VI problem for some \(\nu \geqslant 0\), i.e.,

$$\begin{aligned} \langle F_{\textbf{B}_{k+1}}(u_{k+1}) + \lambda _k (u_{k+1}-w_k),w-u_{k+1}\rangle \geqslant -\nu \qquad (\forall w \in \mathcal {W}). \end{aligned}$$
(5.1)

Note that if \(\nu = 0\) then this is an exact solution satisfying (VI(F)) with operator F replaced by \(F(\cdot ) + \lambda _k(\cdot - w_k)\). For \(\nu >0\), we obtain that \(u_{k+1}\) is an inexact solution satisfying solution criterion up to \(\nu \) additive error. In line 5, we add a Gaussian noise to \(u_{k+1}\) in order to preserve privacy. The resulting iterate \(w_{k+1}\) can be potentially outside the set \(\mathcal {W}\). Hence, in line 7, the ergodic average \({\bar{w}}_K\) can be outside \(\mathcal {W}\). In order to preserve feasibility of the solution, we project \({\bar{w}}_K\) onto set \(\mathcal {W}\) and output it as a solution in line 8. Projection of the average in line 8, as opposed to projection individual \(w_{k+1}\) in line 5 is crucial for convergence guarantee of Algorithm 2.

In the rest of this section, we exclusively deal with the single-pass version of NISPP method, i.e., we assume that batches \(\{\textbf{B}_{k+1}\}_{k=0,\ldots ,K}\) are disjoint subsets of the dataset \(\textbf{S}\). We start with the convergence guarantees of single-pass NISPP method. In order to prove convergence, we show a useful bound on \({\text {dist}}_\mathcal {W}({\bar{w}}_K):= \min _{w\in \mathcal {W}}\Vert w-{\bar{w}}_K\Vert _{}^{}\).

Proposition 8

Let \({\bar{u}}_K:= \frac{1}{\varGamma _K} \textstyle {\sum }_{k = 0}^K\gamma _ku_{k+1}\). Then,

$$\begin{aligned} {\text {dist}}_\mathcal {W}({\bar{w}}_K)^2\leqslant \Vert {\bar{u}}_K - {\bar{w}}_K\Vert _{}^{2} = \frac{1}{\varGamma _K^2}\Vert \textstyle {\sum }_{k = 0}^K\gamma _k\varvec{\xi }_{k+1}\Vert _{}^{2}. \end{aligned}$$
(5.2)

Moreover, we have

$$\begin{aligned} \mathbb {E}[{\text {dist}}_\mathcal {W}({\bar{w}}_K)^2] \leqslant \frac{1}{\varGamma _K^2}\textstyle {\sum }_{k = 0}^K\gamma _k^2\mathbb {E}\Vert \varvec{\xi }_{k+1}\Vert _{}^{2} \end{aligned}$$
(5.3)

Proof

Note that \({\bar{u}}_K \in \mathcal {W}\). Hence, first relation in (5.2) follows by definition of \({\text {dist}}_\mathcal {W}(\cdot )\) function. Equality follows from definition of \({\bar{u}}_K\) and \({\bar{w}}_K\). To obtain (5.3), note that \(\{\varvec{\xi }_k\}_{k = 1}^{K+1}\) are i.i.d. random variable with mean 0. Expanding \(\Vert \textstyle {\sum }_{k = 0}^K \gamma _k\varvec{\xi }_{k+1}\Vert _{}^{2}\), using linearity of expectation and noting \(\mathbb {E}[\varvec{\xi }_i^T\varvec{\xi }_j] = 0\) for all \(i \ne j\), we conclude (5.3). Hence, we conclude the proof. \(\square \)

We prove the following convergence rate result for Algorithm 2 for the risk of SVI/SSP problem. In particular, we assume that the algorithm performs a single-pass over the dataset \(\textbf{S}\sim \mathcal {P}^n\) containing n i.i.d. datapoints.

Theorem 5

Consider the stochastic (VI(F)) problem with operators \(F_{\varvec{\beta }} \in \mathcal {M}^1_\mathcal {W}(L, M)\). Let \(\mathcal {A}\) be the single-pass NISPP method (Algorithm 2) where sequence \(\{\gamma _k\}_{k \geqslant 0}, \{\lambda _k\}_{k \geqslant 0}\) satisfy

$$\begin{aligned} \gamma _k \lambda _k = \gamma _0\lambda _0 \end{aligned}$$
(5.4)

for all \(k \geqslant 0\). Moreover, \(\textbf{B}_{k+1}\) are independent samples from a product distribution \(\textbf{B}_{k+1} \sim \mathcal {P}^{B_{k+1}}\) and \(\textbf{B}_{k+1} \subset \textbf{S}\). Then, we have

$$\begin{aligned} \text{ Gap}_{\textrm{VI}}(\mathcal {A},F) \leqslant \nu + \frac{Z_0(K)}{\varGamma _K} + M \sqrt{\frac{1}{\varGamma _K^2}\textstyle {\sum }_{k = 0}^K\gamma _k^2\sigma _{k+1}^2d}, \end{aligned}$$
(5.5)

where, \(Z_0(K):= \frac{3\gamma _0\lambda _0}{2}D^2 + \frac{4\,M^2 + 3\,L^2D^2}{\gamma _0\lambda _0}\textstyle {\sum }_{k = 0}^K\gamma _k^2 + \frac{5\gamma _0\lambda _0d}{2}\textstyle {\sum }_{k =1}^K\sigma _{k+1}^2\) and \(\varGamma _K:= \textstyle {\sum }_{k = 0}^K \gamma _k\).

Similarly, \(\mathcal {A}\) applied to stochastic (SP(f)) problem achieves

$$\begin{aligned} \text{ Gap}_{\textrm{SP}}(\mathcal {A},f) \leqslant \nu + \frac{Z_0(K)}{\varGamma _K} + M \sqrt{\frac{1}{\varGamma _K^2}\textstyle {\sum }_{k = 0}^K\gamma _k^2\sigma _{k+1}^2d}. \end{aligned}$$
(5.6)

Proof

Let \(w\in \mathcal {W}\). Then

$$\begin{aligned} \langle F(w),w_{k+1}-w\rangle&= \langle F(w),u_{k+1}-w\rangle + \langle F(w),w_{k+1}-u_{k+1}\rangle . \end{aligned}$$
(5.7)

We will analyze each term above separately. First, note that

$$\begin{aligned} \langle F(w),w_{k+1}-u_{k+1}\rangle&\leqslant \frac{1}{2\lambda _k}\Vert F(w)\Vert _{}^{2} + \frac{\lambda _k}{2}\Vert \varvec{\xi }_{k+1}\Vert _{}^{2} \leqslant \frac{M^2}{2\lambda _k} + \frac{\lambda _k}{2}\Vert \varvec{\xi }_{k+1}\Vert _{}^{2}. \end{aligned}$$
(5.8)

Note that

$$\begin{aligned}&\langle F(w),u_{k+1}-w\rangle \nonumber \\&\quad \leqslant \langle F(u_{k+1}),u_{k+1}-w\rangle \nonumber \\&\quad = \langle F_{\textbf{B}_{k+1}}(u_{k+1}),u_{k+1}-w\rangle + \langle F(u_{k+1}) - F_{\textbf{B}_{k+1}}(u_{k+1}),u_{k+1}-w\rangle \nonumber \\&\quad \leqslant \lambda _k\langle u_{k+1}-w_k,w - u_{k+1}\rangle + \nu + \langle F(u_{k+1}) - F_{\textbf{B}_{k+1}}(u_{k+1}),u_{k+1}-w\rangle \nonumber \\&\quad =\frac{\lambda _k}{2} \left[ \Vert w-w_k\Vert _{}^{2} - \Vert w-u_{k+1}\Vert _{}^{2} - \Vert u_{k+1} - w_k\Vert _{}^{2}\right] + \nu \nonumber \\&\qquad + \langle F(u_{k+1}) - F_{\textbf{B}_{k+1}}(u_{k+1}),u_{k+1}-w\rangle , \end{aligned}$$
(5.9)

where first inequality follows from monotonicity and second inequality follows from (5.1). Now note that

$$\begin{aligned}&\langle F(u_{k+1}) - F_{\textbf{B}_{k+1}}(u_{k+1}),u_{k+1}-w\rangle \\&\quad = \langle F(u_{k+1}) - F_{\textbf{B}_{k+1}}(u_{k+1})- [F(w_k) - F_{\textbf{B}_{k+1}}(w_{k})],u_{k+1}-w\rangle \\&\quad \quad + \langle F(w_k) - F_{\textbf{B}_{k+1}}(w_{k}),u_{k+1} - w_k\rangle + \langle F(w_k) - F_{\textbf{B}_{k+1}}(w_{k}),w_k-w\rangle \\&\quad \leqslant \frac{\lambda _k}{6L^2}\Vert F(u_{k+1}) - F(w_k)\Vert _{}^{2} + \frac{3L^2}{2\lambda _k}\Vert u_{k+1}-w\Vert _{}^{2}\\&\quad \quad + \frac{\lambda _k}{6L^2}\Vert F_{\textbf{B}_{k+1}}(u_{k+1}) - F_{\textbf{B}_{k+1}}(w_{k})\Vert _{}^{2} + \frac{3L^2}{2\lambda _k}\Vert u_{k+1}-w\Vert _{}^{2}\\&\quad \quad + \frac{3}{2\lambda _k}\Vert F(w_k) - F_{\textbf{B}_{k+1}}(w_{k})\Vert _{}^{2}\\&\qquad + \frac{\lambda _k}{6} \Vert u_{k+1}- w_k\Vert _{}^{2}+ \langle F(w_k) - F_{\textbf{B}_{k+1}}(w_{k}),w_k-w\rangle \\&\quad \leqslant \frac{\lambda _k}{2}\Vert u_{k+1}- w_k\Vert _{}^{2} + \frac{3L^2}{\lambda _k}\Vert u_{k+1}-w\Vert _{}^{2} + \frac{3}{2\lambda _k}\Vert F(w_k) - F_{\textbf{B}_{k+1}}(w_{k})\Vert _{}^{2} \\&\quad \quad + \langle F(w_k) - F_{\textbf{B}_{k+1}}(w_{k}),w_k-w\rangle , \end{aligned}$$

where last inequality follows from L-Lipschitz continuity of F and \(F_{\textbf{B}_{k+1}}\). Noting that \(\Vert u_{k+1}-w\Vert _{}^{} \leqslant D\) for all \(w \in \mathcal {W}\) and using the above bound in (5.9), we have

$$\begin{aligned}&\langle F(w),u_{k+1}-w\rangle \nonumber \\&\quad \leqslant \langle F(u_{k+1}),u_{k+1}-w\rangle \nonumber \\&\quad \leqslant \frac{\lambda _k}{2}\left[ \Vert w-w_k\Vert _{}^{2} - \Vert w-u_{k+1}\Vert _{}^{2} \right] \nonumber \\&\quad + \underbrace{\nu + \frac{3L^2D^2}{\lambda _k} + \frac{3}{2\lambda _k}\Vert F(w_k) - F_{\textbf{B}_{k+1}}(w_{k})\Vert _{}^{2} + \langle F(w_k) - F_{\textbf{B}_{k+1}}(w_{k}),w_k-w\rangle }_{E_k} \end{aligned}$$
(5.10)

Letting \(u_0:= w_0\) and consequently \(\varvec{\xi }_0 = \textbf{0}\), we have from (5.10)

$$\begin{aligned}&\langle F(w),u_{k+1}-w\rangle \nonumber \\&\quad \leqslant \langle F(u_{k+1}),u_{k+1}-w\rangle \nonumber \\&\quad \leqslant \frac{\lambda _k}{2} \left[ \Vert w-u_k\Vert _{}^{2} - \Vert w-u_{k+1}\Vert _{}^{2} + 2\langle w-u_k,u_k-w_k\rangle + \Vert u_k-w_k\Vert _{}^{2}\right] + E_k \nonumber \\&\quad =\frac{\lambda _k}{2} \left[ \Vert w-u_k\Vert _{}^{2} - \Vert w-u_{k+1}\Vert _{}^{2} \right] + \lambda _k \langle \varvec{\xi }_k,u_k-w\rangle + \frac{1}{2}\lambda _k\Vert \varvec{\xi }_k\Vert _{}^{2} + E_k \end{aligned}$$
(5.11)

Let us define an auxiliary sequence \(\{z_k\}_{k \geqslant 0}\), where \(z_0 = w_0\) and for all \( k \geqslant 1\), we have

$$\begin{aligned} z_k = \varPi _{\mathcal {W}}[z_{k-1} - \varvec{\xi }_k]. \end{aligned}$$

Then, due to the mirror-descent bound, we have

$$\begin{aligned} \textstyle {\sum }_{k =1}^K \langle \varvec{\xi }_k,z_k - w\rangle \leqslant \frac{1}{2}\Vert w-w_0\Vert _{}^{2} - \frac{1}{2}\Vert w-z_K\Vert _{}^{2} + \textstyle {\sum }_{k =1}^K \Vert \varvec{\xi }_k\Vert _{}^{2}. \end{aligned}$$
(5.12)

Moreover, noting that

$$\begin{aligned} \langle \varvec{\xi }_k,u_k - z_k\rangle&= \langle \varvec{\xi }_k,u_k - z_{k-1}\rangle +\langle \varvec{\xi }_k, z_{k-1} -z_k\rangle \leqslant \langle \varvec{\xi }_k,u_k - z_{k-1}\rangle + \Vert \varvec{\xi }_k\Vert _{}^{}\Vert z_k - z_{k-1}\Vert _{}^{}\\&\leqslant \langle \varvec{\xi }_k,u_k - z_{k-1}\rangle + \Vert \varvec{\xi }_k\Vert _{}^{2}. \end{aligned}$$

Combining above relation with (5.12), we have

$$\begin{aligned} \textstyle {\sum }_{k =1}^K\langle \varvec{\xi }_k,u_k - w\rangle \leqslant \frac{1}{2}\Vert w-w_0\Vert _{}^{2} + 2\textstyle {\sum }_{k =1}^K\Vert \varvec{\xi }_k\Vert _{}^{2} + \textstyle {\sum }_{k =1}^K\langle \varvec{\xi }_k,u_k - z_{k-1}\rangle . \end{aligned}$$
(5.13)

Now multiplying (5.7) by \(\gamma _k\) then summing from \(k = 0\) to K; noting the definition of \({\bar{w}}_K\) and \(\varGamma _K\); and using (5.8), (5.11) and (5.13) along with assumption (5.4) implies

$$\begin{aligned}&\varGamma _k\langle F(w),{\bar{w}}_K - w\rangle \nonumber \\&\quad \leqslant \gamma _0\lambda _0 \Vert w-w_0\Vert _{}^{2} \nonumber \\&\quad +\textstyle {\sum }_{k = 0}^K \left[ \frac{\gamma _kM^2}{2\lambda _k} + \frac{\gamma _0\lambda _0}{2} \Vert \varvec{\xi }_{k+1}\Vert _{}^{2} + \frac{5\gamma _0\lambda _0}{2} \Vert \varvec{\xi }_k\Vert _{}^{2} +\gamma _kE_k + \gamma _0\lambda _0\langle \varvec{\xi }_k,u_k - z_{k-1}\rangle \right] \nonumber \\&\quad \Rightarrow \varGamma _k\langle F(w),\varPi _{\mathcal {W}}[{\bar{w}}_K] - w\rangle \nonumber \\&\quad \leqslant \gamma _0\lambda _0 \Vert w-w_0\Vert _{}^{2} +\varGamma _K \langle F(w),\varPi _{\mathcal {W}}[{\bar{w}}_K] - {\bar{w}}_K\rangle \nonumber \\&\quad +\textstyle {\sum }_{k = 0}^K \left[ \frac{\gamma _kM^2}{2\lambda _k} + \frac{\gamma _0\lambda _0}{2} \Vert \varvec{\xi }_{k+1}\Vert _{}^{2} + \frac{5\gamma _0\lambda _0}{2} \Vert \varvec{\xi }_k\Vert _{}^{2} +\gamma _kE_k + \gamma _0\lambda _0\langle \varvec{\xi }_k,u_k - z_{k-1}\rangle \right] \nonumber \\&\quad \leqslant \gamma _0\lambda _0 \Vert w-w_0\Vert _{}^{2} + \varGamma _KM{\text {dist}}_\mathcal {W}({\bar{w}}_K) \nonumber \\&\quad +\textstyle {\sum }_{k = 0}^K \left[ \frac{\gamma _kM^2}{2\lambda _k} + \frac{\gamma _0\lambda _0}{2} \Vert \varvec{\xi }_{k+1}\Vert _{}^{2} + \frac{5\gamma _0\lambda _0}{2} \Vert \varvec{\xi }_k\Vert _{}^{2} +\gamma _kE_k + \gamma _0\lambda _0\langle \varvec{\xi }_k,u_k - z_{k-1}\rangle \right] \end{aligned}$$
(5.14)

Now note that

$$\begin{aligned} \textstyle {\sum }_{k = 0}^K \gamma _kE_k&= \varGamma _K \nu + \frac{3L^2D^2}{\gamma _0\lambda _0}\textstyle {\sum }_{k = 0}^K \gamma _k^2 + \frac{3}{2\gamma _0\lambda _0}\textstyle {\sum }_{k = 0}^K \gamma _k^2\Vert F_{\textbf{B}_{k+1}}(w_{k})-F(w_k)\Vert _{}^{2}\nonumber \\&\quad + \gamma _0\lambda _0 \textstyle {\sum }_{k = 0}^K \langle \frac{1}{\lambda _k}(F(w_k) - F_{\textbf{B}_{k+1}}(w_{k})),w_k-w\rangle \end{aligned}$$
(5.15)

Define \(\varvec{\varDelta }_k:= \frac{1}{\lambda _k}(F(w_k) - F_{\textbf{B}_{k+1}}(w_{k}))\). Note that \(\mathbb {E}_{\textbf{B}_{k+1}}[\varvec{\varDelta }_k| w_k] = 0\). Moreover, define an auxiliary sequence \(\{h_k\}_{k \geqslant 0}\) with \(h_0:= w_0\) and

$$\begin{aligned} h_{k+1}:= \varPi _{\mathcal {W}}[h_k - \varvec{\varDelta }_k]. \end{aligned}$$

Then due to mirror descent bound, we have

$$\begin{aligned} \textstyle {\sum }_{k = 0}^K \langle \varvec{\varDelta }_k,h_{k+1}-w\rangle \leqslant \frac{1}{2}\Vert w-w_0\Vert _{}^{2} - \frac{1}{2}\Vert w-h_{K+1}\Vert _{}^{2} + \textstyle {\sum }_{k = 0}^K\Vert \varvec{\varDelta }_k\Vert _{}^{2}. \end{aligned}$$
(5.16)

Moreover,

$$\begin{aligned} \langle \varvec{\varDelta }_k,h_k-h_{k+1}\rangle \leqslant \Vert \varvec{\varDelta }_k\Vert _{}^{}\Vert h_k-h_{k+1}\Vert _{}^{} \leqslant \Vert \varvec{\varDelta }_k\Vert _{}^{2}. \end{aligned}$$

Using above relation along with (5.16), we have

$$\begin{aligned}&\textstyle {\sum }_{k = 0}^K\langle \varvec{\varDelta }_k,h_k - w\rangle \\&\quad = \textstyle {\sum }_{k = 0}^K[ \langle \varvec{\varDelta }_k,h_{k+1} - w\rangle + \langle \varvec{\varDelta }_k,h_k - h_{k+1}\rangle ]\\&\quad \leqslant \frac{1}{2}\Vert w-w_0\Vert _{}^{2} - \frac{1}{2}\Vert w-h_{K+1}\Vert _{}^{2} + 2\textstyle {\sum }_{k = 0}^K\Vert \varvec{\varDelta }_k\Vert _{}^{2}\\&\Rightarrow \textstyle {\sum }_{k = 0}^K\langle \varvec{\varDelta }_k,w_k - w\rangle \\&\quad = \textstyle {\sum }_{k = 0}^K[ \langle \varvec{\varDelta }_k,w_k-h_k\rangle + \langle \varvec{\varDelta }_k,h_k-w\rangle ]\\&\quad \leqslant \textstyle {\sum }_{k=0}^K\langle \varvec{\varDelta }_k,w_k-h_k\rangle + \frac{1}{2}\Vert w-w_0\Vert _{}^{2} - \frac{1}{2}\Vert w-h_{K+1}\Vert _{}^{2} + 2\textstyle {\sum }_{k = 0}^K\Vert \varvec{\varDelta }_k\Vert _{}^{2}. \end{aligned}$$

Using the above relation in (5.15), we have

$$\begin{aligned} \textstyle {\sum }_{k = 0}^K \gamma _kE_k&= \varGamma _K \nu + \frac{3L^2D^2}{\gamma _0\lambda _0}\textstyle {\sum }_{k = 0}^K \gamma _k^2 + \frac{3}{2\gamma _0\lambda _0}\textstyle {\sum }_{k = 0}^K \gamma _k^2\Vert F_{\textbf{B}_{k+1}}(w_{k})-F(w_k)\Vert _{}^{2}\nonumber \\&\quad +\gamma _0\lambda _0 \textstyle {\sum }_{k =0}^K \langle \varvec{\varDelta }_k,w_k-w\rangle \nonumber \\&\leqslant \frac{\gamma _0\lambda _0}{2}\Vert w-w_0\Vert _{}^{2} + \varGamma _K\nu \frac{3L^2D^2}{\gamma _0\lambda _0}\textstyle {\sum }_{k = 0}^K \gamma _k^2 \nonumber \\&\quad + \frac{3}{2\gamma _0\lambda _0}\textstyle {\sum }_{k = 0}^K \gamma _k^2\Vert F_{\textbf{B}_{k+1}}(w_{k})-F(w_k)\Vert _{}^{2}\nonumber \\&\quad + 2\gamma _0\lambda _0\textstyle {\sum }_{k = 0}^{K} \Vert \varvec{\varDelta }_k\Vert _{}^{2} + \gamma _0\lambda _0\textstyle {\sum }_{k = 0}^K \langle \varvec{\varDelta }_k,w_k-h_k\rangle . \end{aligned}$$
(5.17)

Finally, note that for all valid k, we have

$$\begin{aligned} \mathbb {E}[\Vert \varvec{\xi }_k\Vert _{}^{2}]&= \sigma _k^2d, \end{aligned}$$
(5.18)
$$\begin{aligned} \mathbb {E}[\Vert \varvec{\varDelta }_k\Vert _{}^{2}]&=\mathbb {E}_{w_k}[\mathbb {E}_{\textbf{B}_{k+1}}[\Vert \varvec{\varDelta }_k\Vert _{}^{2}| w_k]] \nonumber \\&= \frac{1}{\lambda _k^2}\mathbb {E}_{w_k} \mathbb {E}_{\textbf{B}_{k+1}}\left[ \Vert F_{\textbf{B}_{k+1}}(w_{k}) - F(w_k)\Vert _{}^{2} | w_k\right] \nonumber \\&\leqslant \frac{1}{\lambda _k^2}\mathbb {E}_{w_k}\mathbb {E}_{\textbf{B}_{k+1}}\Vert F_{\textbf{B}_{k+1}}(w_{k})\Vert _{}^{2} \leqslant \frac{M^2}{\lambda _k^2}, \end{aligned}$$
(5.19)
$$\begin{aligned} \mathbb {E}[\langle \varvec{\varDelta }_k,w_k-h_k\rangle ]&= \mathbb {E}[\langle \mathbb {E}[\varvec{\varDelta }_k|w_k, h_k],w_k-h_k\rangle ] = 0 , \end{aligned}$$
(5.20)
$$\begin{aligned} \mathbb {E}[\langle \varvec{\xi }_k,u_k-z_{k-1}\rangle ]&= \mathbb {E}[\langle \mathbb {E}[\varvec{\xi }_k| u_k, z_{k-1}],u_k-z_{k-1}\rangle ] = 0, \end{aligned}$$
(5.21)

where, in (5.19), we used the fact that \(F_{\varvec{\beta }})\) is M-bounded for all \(\varvec{\beta } \in \textbf{S}\).

Now, using (5.17) in relation (5.14), noting the bound on \({\text {dist}}_\mathcal {W}({\bar{w}}_K)\) from Proposition 8 (in particular (5.3)), taking supremum with respect to \(w \in \mathcal {W}\), then taking expectation and noting (5.18)-(5.21), we have

$$\begin{aligned}&\varGamma _k\mathbb {E}\sup _{w \in \mathcal {W}}\langle F(w),\varPi _{\mathcal {W}}[{\bar{w}}_K] - w\rangle \\&\quad \leqslant \frac{3\gamma _0\lambda _0}{2} D^2 + \textstyle {\sum }_{k =0}^K\left[ \frac{\gamma _k(4M^2+3L^2D^2)}{\lambda _k} + \frac{5\gamma _0\lambda _0}{2}\sigma _{k+1}^2d+ \gamma _k\nu \right] \\&\quad + \varGamma _K M \sqrt{\frac{1}{\varGamma _K^2}\textstyle {\sum }_{k = 0}^K\gamma _k^2\sigma _{k+1}^2d}.\\&\Rightarrow \mathbb {E}\sup _{w \in \mathcal {W}}\langle F(w),\varPi _{\mathcal {W}}[{\bar{w}}_K] - w\rangle \\&\quad \leqslant \nu + \frac{1}{\varGamma _K} \left[ \frac{3\gamma _0\lambda _0}{2}D^2 + \frac{4M^2 + 3L^2D^2}{\gamma _0\lambda _0}\textstyle {\sum }_{k = 0}^K\gamma _k^2 + \frac{5\gamma _0\lambda _0d}{2}\textstyle {\sum }_{k =1}^K\sigma _{k+1}^2 \right] \\&\quad + M \sqrt{\frac{1}{\varGamma _K^2}\textstyle {\sum }_{k = 0}^K\gamma _k^2\sigma _{k+1}^2d}, \end{aligned}$$

where in the first inequality, we used the fact that \(\mathbb {E}[{\text {dist}}_\mathcal {W}({\bar{w}}_K)] \leqslant \sqrt{\mathbb {E}[{\text {dist}}_\mathcal {W}({\bar{w}}_K)^2]}\). Hence, we conclude the proof of (5.5).

Now, we extend this for (SP(f)). Denote \(u^{k+1} = ({\widetilde{x}}^{k+1}, {\widetilde{y}}^{k+1})\). Then, we have

$$\begin{aligned}&\langle F(u_{k+1}),u_{k+1} -w\rangle \\&\quad = \langle \nabla _x f({\widetilde{x}}_{k+1}, {\widetilde{y}}_{k+1} ),{\widetilde{x}}_{k+1} - x\rangle + \langle -\nabla _y f({\widetilde{x}}_{k+1}, {\widetilde{y}}_{k+1} ),{\widetilde{y}}_{k+1} - y\rangle \\&\quad \geqslant f({\widetilde{x}}_{k+1}, {\widetilde{y}}_{k+1}) - f(x, {\widetilde{y}}_{k+1}) + [-f({\widetilde{x}}_{k+1}, {\widetilde{y}}_{k+1}) + f({\widetilde{x}}_{k+1}, y)]\\&\quad = f({\widetilde{x}}_{k+1}, y) - f(x, {\widetilde{y}}_{k+1}). \end{aligned}$$

Using the above in (5.11), we obtain,

$$\begin{aligned} f({\widetilde{x}}_{k+1}, y) - f(x, {\widetilde{y}}_{k+1})&\leqslant \frac{\lambda _k}{2} \left[ \Vert w-u_k\Vert _{}^{2} - \Vert w-u_{k+1}\Vert _{}^{2} \right] + \lambda _k \langle \varvec{\xi }_k,u_k-w\rangle \\&\qquad + \frac{1}{2}\lambda _k\Vert \varvec{\xi }_k\Vert _{}^{2} + E_k. \end{aligned}$$

Now, using Proposition 8 to bound the distance between points \(\tfrac{1}{\varGamma _K}(\textstyle {\sum }_{k =0}^K\gamma _k{\widetilde{x}}_{k+1}, \textstyle {\sum }_{k =0}^K\gamma _k{\widetilde{y}}_{k+1})\) and \((\varPi _\mathcal {X}[{\bar{x}}_K], \varPi _\mathcal {Y}[{\bar{y}}_K])\) and using Jensen’s inequality to conclude that

$$\begin{aligned}&\frac{1}{\varGamma _K}\textstyle {\sum }_{k = 0}^K\gamma _k [f({\widetilde{x}}_{k+1}, y) - f(x, {\widetilde{y}}_{k+1})]\\&\quad \geqslant f(\tfrac{1}{\varGamma _K}\textstyle {\sum }_{k = 0}^K\gamma _k {\widetilde{x}}_{k+1}, y) - f(x, \tfrac{1}{\varGamma _K} \textstyle {\sum }_{k = 0}^K\gamma _k{\widetilde{y}}_{k+1})]\\&\quad \geqslant f(\varPi _\mathcal {X}[{\bar{x}}_K],y) - f(x, \varPi _\mathcal {Y}[{\bar{y}}_K]) - M\Vert *\Vert _{\begin{bmatrix} \tfrac{1}{\varGamma _K}\textstyle {\sum }_{k = 0}^K\gamma _k {\widetilde{x}}_{k+1} - \varPi _\mathcal {X}[{\bar{x}}_K]\\ \tfrac{1}{\varGamma _K}\textstyle {\sum }_{k = 0}^K\gamma _k {\widetilde{y}}_{k+1} - \varPi _\mathcal {Y}[{\bar{y}}_K] \end{bmatrix}}^{}{} \end{aligned}$$

and retracing the steps of this proof from (5.11), we obtain (5.6). Hence, we conclude the proof. \(\square \)

5.1 Differential privacy of the NISPP method

First, we show a simple bound on \(\ell _2\)-sensitivity for updates of NISPP method.

Proposition 9

Suppose \(\nu \leqslant \frac{2M^2}{\lambda _kB_{k+1}^2}\) then \(\ell _2\)-sensitivity of updates of Algorithm 2 is at most \(\frac{4M}{\lambda _kB_{k+1}}\)

where \(B_{k+1} = |\textbf{B}_{k+1}|\) is the batch size of k-th iteration.

Proof

Let \(w_k\) be an iterate in the start of k-th iteration of Algorithm 2. Suppose \(\textbf{B}_{k+1}\) and \(\textbf{B}_{k+1}'\) be two different batches used in k-th iteration to obtain \(u_{k+1}\) and \(u_{k+1}'\), respectively. Also note that \(\textbf{B}_{k+1}\) and \(\textbf{B}_{k+1}'\) differ in only single datapoint. Then, due to (5.1), we have for all \(w \in \mathcal {W}\)

$$\begin{aligned} \langle F_{\textbf{B}_{k+1}}(u_{k+1}) + \lambda _k(u_{k+1} - w_k) ,w - u_{k+1}\rangle&\geqslant -\nu \\ \langle F_{\textbf{B}_{k+1}'}(u_{k+1}') + \lambda _k(u_{k+1}' - w_k) ,w - u_{k+1}'\rangle&\geqslant -\nu \end{aligned}$$

Using \(w = u_{k+1}'\) in the first relation and \(w = u_{k+1}\) in the second relation above and then summing, we obtain

$$\begin{aligned}&\langle F_{\textbf{B}_{k+1}}(u_{k+1}) -F_{\textbf{B}_{k+1}'}(u_{k+1}') , u_{k+1} - u_{k+1}'\rangle \\&\quad \leqslant 2\nu - \lambda _k \Vert u_{k+1} - u_{k+1}'\Vert _{}^{2}\\&\Rightarrow \langle F_{\textbf{B}_{k+1}}(u_{k+1}) - F_{\textbf{B}_{k+1}}(u_{k+1}') , u_{k+1} - u_{k+1}'\rangle \\&\quad \leqslant \langle F_{\textbf{B}_{k+1}'}(u_{k+1}')- F_{\textbf{B}_{k+1}}(u_{k+1}') , u_{k+1} - u_{k+1}'\rangle \\&\quad + 2\nu - \lambda _k \Vert u_{k+1} - u_{k+1}'\Vert _{}^{2} \end{aligned}$$

Now, noting that \(F_{\textbf{B}_{k+1}}\) is a monotone operator and denoting \(\textbf{a}_{k+1}:= \Vert F_{\textbf{B}_{k+1}'}(u_{k+1}')- F_{\textbf{B}_{k+1}}(u_{k+1}')\Vert _{}^{}\), \(p_{k+1}:= \Vert w_{k+1} - w'_{k+1}\Vert _{}^{} = \Vert u_{k+1} - u'_{k+1}\Vert _{}^{}\) we have

$$\begin{aligned} 0&\leqslant \langle F_{\textbf{B}_{k+1}'}(u_{k+1}')- F_{\textbf{B}_{k+1}}(u_{k+1}') , u_{k+1} - u_{k+1}'\rangle + 2\nu - \lambda _k \Vert u_{k+1} - u_{k+1}'\Vert _{}^{2}\nonumber \\&\leqslant \textbf{a}_{k+1} p_{k+1} - \lambda _k p_{k+1}^2 + 2\nu . \end{aligned}$$
(5.22)

Finally noting that if \(\varvec{\beta }\) and \(\varvec{\beta }'\) are the differing datapoints in \(\textbf{B}_{k+1}\) and \(\textbf{B}_{k+1}'\), then

$$\begin{aligned} \textbf{a}_{k+1}&= \frac{1}{B_{k+1}}\Vert F_{\varvec{\beta }'}(u'_{k+1}) - F_{\varvec{\beta }}(u'_{k+1}) \Vert _{}^{} \leqslant \frac{2M}{B_{k+1}}. \end{aligned}$$

Using the above relation in (5.22) and noting that \(\ell _2\)-sensitivity \(p_{k+1} = \Vert w_{k+1}-w_{k+1}'\Vert _{}^{} = \Vert u_{k+1} -u_{k+1}'\Vert _{}^{}\), we have, \(p_{k+1}\) satisfies

$$\begin{aligned} p_{k+1}^2 -\frac{2M}{\lambda _k B_{k+1}} p_{k+1} - \frac{2\nu }{\lambda _k} \leqslant 0. \end{aligned}$$

This implies

$$\begin{aligned} p_{k+1} \leqslant \frac{M}{\lambda _kB_{k+1}} + \sqrt{\frac{M^2}{\lambda _k^2B_{k+1}^2} + \frac{2\nu }{\lambda _k}} \leqslant \frac{2M}{\lambda _kB_{k+1}} + \sqrt{\frac{2\nu }{\lambda _k}}. \end{aligned}$$

Setting \(\nu \leqslant \frac{2M^2}{\lambda _kB_{k+1}^2}\), we have \(p_{k+1} \leqslant \frac{4\,M}{\lambda _kB_{k+1}}.\) Hence, we conclude the proof. \(\square \)

Using the \(\ell _2\)-sensitivity result above along with Proposition 3 and 4, we immediately obtain the following:

Proposition 10

Algorithm 2 with batch sizes \((B_{k+1})_{k \in [K]_0}\), parameters \((\lambda _k)_{k \in [K]_0}\), variance \(\sigma _{k+1}^2 = \frac{32\,M^2}{\lambda _k^2B_{k+1}^2}\frac{\ln (1/\eta )}{\varepsilon ^2}\) and \(\nu \) satisfying assumptions of Proposition 9 is \((\varepsilon , \eta )\)-differentially private.

Now, we provide a policy for setting \(\gamma _k, \lambda _k\) and \(B_{k+1}\) to obtain population risk bounds for DP-SVI and DP-SSP problem by the NISPP method.

Corollary 2

Algorithm 2 with disjoint batches \(\textbf{B}_{k+1}\) is of size \(B_{k+1} = B:= n^{1/3}\) for all \(k \geqslant 0\) and the following parameters

$$\begin{aligned} \gamma _k&= 1, {} & {} \lambda _k = \lambda _0 := \max \left\{ \frac{M}{D}, L\right\} \max \left\{ n^{1/3}, \frac{\sqrt{d\ln (1/\eta )}}{\varepsilon }\right\} ,\\ \sigma _{k+1}^2&= \frac{32M^2}{ B\lambda _0} \frac{\ln (1/\eta )}{\varepsilon ^2},{} & {} \nu = \frac{2M^2}{\lambda _0B^2}, \end{aligned}$$

is \((\varepsilon , \eta )\)-differentially private and achieves expected SVI-gap (SSP-gap, respectively)

$$\begin{aligned} O\left( (M+LD)D\left[ \frac{1}{n^{1/3}} + \frac{\sqrt{d\ln (1/\eta )}}{\varepsilon n^{2/3}}\right] \right) . \end{aligned}$$

Proof

Note that values of \(\nu \), \(\sigma _{k+1}\) and other required conditions proposed in Propositions 9 and 10 are satisfied. Hence, this algorithm is \((\varepsilon , \eta )\)-differentially private.

Moreover, all requirements of Theorem 5 are satisfied. In order to maintain single pass over the dataset, we require \(K = \frac{n}{B} = n^{2/3}\) iterations. Then, we provide individual bounds on the terms of (5.5) ((5.6), respectively) and conclude the corollary using Theorem 5.

Note that we are using a constant parameter policy. Hence, \(\sigma _{k+1} = \sigma = \frac{4\,M}{\rho B\lambda _0}\) for all \(k \geqslant 0\). Substituting appropriate parameter values, we have

$$\begin{aligned} \nu&= \frac{2MD}{n^{2/3}\max \{n^{1/3}, \sqrt{d\ln (1/\eta )}/\varepsilon \} } \leqslant \frac{2MD}{n},\\ M \sqrt{\frac{1}{\varGamma _K^2}\textstyle {\sum }_{k = 0}^K\gamma _k^2\sigma _{k+1}^2d}&= \frac{M\sqrt{d}\sigma _{k+1}}{\sqrt{K}} = \frac{4M^2\sqrt{2d\ln (1/\eta )}}{\varepsilon n^{2/3}\lambda _0} \leqslant \frac{4\sqrt{2}MD}{n^{2/3}}, \\ \frac{3\lambda _0D^2}{2K}&\leqslant \frac{3(M+LD)D}{2} \left( \frac{1}{n^{1/3}} + \frac{\sqrt{d\ln (1/\eta )}}{\varepsilon n^{2/3}}\right) ,\\ \frac{4M^2 + 3L^2D^2}{\lambda _0}&\leqslant \frac{4MD}{n^{1/3}} + \frac{3LD^2}{n^{1/3}},\\ \frac{5\lambda _0d\sigma ^2}{2}&= \frac{40M^2d\ln (1/\eta )}{\varepsilon ^2B^2 \lambda _0} \leqslant \frac{40MD \sqrt{d\ln (1/\eta )}}{\varepsilon n^{2/3}}. \end{aligned}$$

Substituting these bounds in Theorem 5, we conclude the proof. \(\square \)

Remark 3

We have the following remarks for NISPP method:

  1. 1.

    In order to obtain \(\nu \)-approximate solution of the subproblem of NISPP method satisfying (5.1), we can use the Operator Extrapolation (OE) method (see Theorem 2.3 [34]). OE method outputs a solution \(u_{k+1}\) satisfying \(\Vert u_{k+1} -w^*_{k+1}\Vert _{}^{} \leqslant \zeta \) in \( \frac{L+\lambda _0}{\lambda _0}\ln (\frac{D}{\zeta })\) iterations, where \(w_{k+1}^*\) is an exact SVI solution for problem (5.1). Furthermore, we have for all \(w \in \mathcal {W}\),

    $$\begin{aligned} 0&\leqslant \langle F(w^*_{k+1}) + \lambda _k(w^*_{k+1}-w_k),w-w^*_{k+1}\rangle \\&= \langle F(u_{k+1}) + F(w^*_{k+1}) - F(u_{k+1}) + \lambda _k (u_{k+1}- w_k)\\&\qquad + \lambda _k(w^*_{k+1}- u_{k+1}) ,{w -w^*_{k+1}}\rangle \\&\leqslant \langle F(u_{k+1}) + \lambda _k (u_{k+1}- w_k),w -w^*_{k+1}\rangle \\&\qquad + (L+\lambda _k)\Vert u_{k+1}-w^*_{k+1}\Vert _{}^{}\Vert w -w^*_{k+1}\Vert _{}^{}\\&\leqslant \langle F(u_{k+1}) + \lambda _k (u_{k+1}- w_k),w -w^*_{k+1}\rangle \\&\qquad + (L+\lambda _k)D\Vert u_{k+1}-w^*_{k+1}\Vert _{}^{}\\&= \langle F(u_{k+1}) + \lambda _k (u_{k+1}- w_k),w- u_{k+1} + u_{k+1}-w^*_{k+1}\rangle \\&\qquad + (L+\lambda _k)D\Vert u_{k+1}-w^*_{k+1}\Vert _{}^{}\\&\leqslant \langle F(u_{k+1}) + \lambda _k (u_{k+1}- w_k),w- u_{k+1}\rangle + \Vert F(u_{k+1})\\&\qquad + \lambda _k (u_{k+1}- w_k)\Vert _{}^{}\Vert u_{k+1}-w^*_{k+1}\Vert _{}^{}\\&\quad + (L+\lambda _k)D\Vert u_{k+1}-w^*_{k+1}\Vert _{}^{}\\&\leqslant \langle F(u_{k+1}) + \lambda _k (u_{k+1}- w_k),w- u_{k+1}\rangle \\&\qquad + [LD+M+2\lambda _kD]\Vert u_{k+1}-w^*_{k+1}\Vert _{}^{} \end{aligned}$$

    Setting \(\zeta =\nu /[LD+M+2\lambda _kD]\), we obtain that \(u_{k+1}\) is a \(\nu \)-approximate solution satisfying (5.1). Using the convergence rate above, we require \(\frac{L+\lambda _0}{\lambda _0}\ln \frac{MD+LD^2+2\lambda _kD^2}{\nu }\) operator evaluations.

    Note that since, \(\lambda _0 \geqslant L\), we have \(\frac{L+\lambda _0}{\lambda _0} \leqslant 2\). Moreover,

    $$\begin{aligned} \ln {\frac{MD+LD^2 + 2\lambda _kD^2}{\nu }}&\leqslant \ln {\frac{4\lambda _kD^2}{\nu }}\nonumber \\&= \ln \left( \frac{2\lambda _0^2D^2B^2}{M^2}\right) \nonumber \\&= \ln \left( n^{2/3}\max \left\{ n^{2/3}, \frac{d\ln (1\eta )}{\varepsilon ^2}\right\} \max \left\{ 1, \frac{L^2D^2}{M^2}\right) \right\} \end{aligned}$$
    (5.23)

    Hence, each iteration of NISPP method requires \(O(\log {n})\) iterations of OE method for solving the subproblem. Moreover, each iteration of the OE method requires 2B stochastic operator evaluations. Hence, we require \(O(KB\log {n})\) stochastic operator evaluations in the entire run of NISPP (Algorithm 2). Noting that \(KB = n\), we conclude that this is a near linear time algorithm and also performs only a single pass over the data in the stochastic outer-loop. We provide the details of OE method in the Appendix B.

  2. 2.

    For non-DP version of NISPP method, i.e., \(\sigma _k = 0\) for all k, we can easily obtain population risk bound of \(O(\frac{MD}{\sqrt{n}})\) by setting \(\lambda _0 = \frac{M}{D}\sqrt{n}\), \(B = 1\) (or \(K = n\)) and \(\nu = \frac{MD}{\sqrt{n}}\) in Corollary 2.

In view of Corollary 2, it seems that running NISPP method for \(n^{3/2}\) stochastic operator evaluations may provide optimal risk bounds. However, running that many stochastic operator evaluations requires multi-pass over the dataset so, in principle, this would only provide bounds in the empirical risk. In order to compute the population risk of this multi-pass version, we analyze the stability of NISPP and provide generalization guarantees which result in optimal population risk.

6 Stability of NISPP and optimal risk for DP-SVI and DP-SSP

In this section, we develop a multi-pass variant of NISPP method, and prove its stability to extrapolate empirical performance to population risk bounds.

6.1 Stability of NISPP method

Let us start with two adjacent datasets \(\textbf{S}\simeq \textbf{S}'\). Suppose we run NISPP method on both datasets starting from the same point \(w_0 \in \mathcal {W}\). Then, in the following lemma, we provide bound on the how far apart trajectories of these two runs can drift.

Lemma 3

Let \((u_{k+1}, w_{k+1})_{k\geqslant 0}\) and \((u'_{k+1}, w'_{k+1})_{k\geqslant 0}\) be two trajectories of the NISPP method (Algorithm 2) for any adjacent datasets \(\textbf{S}\simeq \textbf{S}'\) whose batches are denotes by \(\textbf{B}_{k+1}, \textbf{B}'_{k+1}\) respectively. Moreover, denote \(\textbf{a}_{k+1}:= \Vert F_{\textbf{B}_{k+1}}(u_{k+1}) - F_{\textbf{B}'_{k+1}}(u_{k+1})\Vert _{}^{}\) and \(\delta _{k+1}:= \Vert u_{k+1}-u'_{k+1}\Vert _{}^{} (= \Vert w_{k+1}-w'_{k+1}\Vert _{}^{})\) for k-th iteration of Algorithm 2. Then, if \(i = \inf \{k: \textbf{B}_{k+1} \ne \textbf{B}'_{k+1}\}\),

$$\begin{aligned} \delta _{j+1} {\left\{ \begin{array}{ll} =0 &{}\text {if } j+1 \leqslant i\\ \leqslant \sum _{k = i}^j\frac{2\textbf{a}_{k+1}}{\lambda _k} + \sqrt{\frac{4\nu }{\lambda _k}} &{}\text {otherwise}. \end{array}\right. } \end{aligned}$$
(6.1)

Proof

It is clear from the definition i that \(\textbf{B}_j = \textbf{B}_j'\) for all \(j \leqslant i\). This implies \(u_j = u'_j\) and \(w_j = w_j'\) for all \(j \leqslant i\). Hence, we conclude first case of (6.1).

Using (5.1) for \(\nu \)-approximate strong VI solution, we have,

$$\begin{aligned} \langle F_{\textbf{B}_{k+1}}(u_{k+1}) + \lambda _k(u_{k+1}-w_k),w - u_{k+1}\rangle&\geqslant -\nu , \end{aligned}$$
(6.2)
$$\begin{aligned} \langle F_{\textbf{B}'_{k+1}}(u'_{k+1}) + \lambda _k(u'_{k+1}-w'_k),w - u'_{k+1}\rangle&\geqslant -\nu . \end{aligned}$$
(6.3)

Then, adding (6.2) with \(w = u'_{k+1}\) and (6.3) with \(w = u_{k+1}\), we have

$$\begin{aligned}&\langle F_{\varvec{\beta }_{k+1}}(u_{k+1}) - F_{\varvec{\beta }'_{k+1}}(u'_{k+1}),u_{k+1}-u'_{k+1}\rangle \nonumber \\&\quad \leqslant 2\nu - \lambda _k\langle u_{k+1} -u'_{k+1},(u_{k+1}-w_k) - (u'_{k+1}-w'_k)\rangle \nonumber \\&\quad =2\nu - \lambda _k \delta _{k+1}^2 + \lambda _k\langle u_{k+1} -u'_{k+1},w_k - w_k'\rangle \nonumber \\&\quad \leqslant 2\nu - \lambda _k\delta _{k+1}^2 + \frac{\lambda _k}{2}[\delta _{k}^2 + \delta _{k+1}^2]\nonumber \\&\quad \leqslant 2\nu -\frac{\lambda _k}{2}\delta _{k+1}^2 + \frac{\lambda _k}{2}\delta _{k}^2 . \end{aligned}$$
(6.4)

Also note that

$$\begin{aligned}&\langle F_{\textbf{B}_{k+1}}(u_{k+1}) - F_{\textbf{B}'_{k+1}}(u'_{k+1}),u_{k+1}-u'_{k+1}\rangle \\&\quad =\langle F_{\textbf{B}'_{k+1}}(u_{k+1}) - F_{\textbf{B}'_{k+1}}(u'_{k+1}),u_{k+1}-u'_{k+1}\rangle \\&\quad + \langle F_{\textbf{B}_{k+1}}(u_{k+1}) - F_{\textbf{B}'_{k+1}}(u_{k+1}),u_{k+1}-u'_{k+1}\rangle \\&\quad \geqslant \langle F_{\textbf{B}_{k+1}}(u_{k+1}) - F_{\textbf{B}'_{k+1}}(u_{k+1}),u_{k+1}-u'_{k+1}\rangle \end{aligned}$$

where the last inequality follows from monotonicity of \(F_{\textbf{B}'_{k+1}}\). Using above relation along with (6.4), we obtain

$$\begin{aligned} \frac{\lambda _k}{2} \delta _{k+1}^2&\leqslant \frac{\lambda _k}{2}\delta _{k}^2 + 2\nu + \langle F_{\textbf{B}'_{k+1}}(u_{k+1}) - F_{\textbf{B}_{k+1}}(u_{k+1}),u_{k+1}-u'_{k+1}\rangle \\ \Rightarrow \delta _{k+1}^2&\leqslant \delta _{k}^2 + \frac{4\nu }{\lambda _k} + \frac{2}{\lambda _k}\textbf{a}_{k+1}\delta _{k+1}, \end{aligned}$$

where we used the definition \(a_k\) along with Cauchy-Schwarz inequality. Solving for the quadratic inequality in \(\delta _{k+1}\), we obtain the following recursion

$$\begin{aligned} \delta _{k+1} \leqslant \frac{\textbf{a}_{k+1}}{\lambda _k} + \sqrt{\frac{\textbf{a}_{k+1}^2}{\lambda _k^2} + \delta _{k}^2 + \frac{4\nu }{\lambda _k}} \end{aligned}$$

which can be further simplified to

$$\begin{aligned} \delta _{k+1} \leqslant \delta _{k} + \frac{2\textbf{a}_{k+1}}{\lambda _k} + \sqrt{\frac{4\nu }{\lambda _k}}. \end{aligned}$$

Solving this recursion and noting the base case that \(\delta _i = 0\), we obtain (6.1). \(\square \)

A direct consequence of the previous analysis are in-expectation and high probability uniform argument stability upper bounds for the sampling with replacement variant of Algorithm 2.

Theorem 6

Let \(\mathcal {A}\) denote the sampling with replacement NISPP method (Algorithm 2) where \(\textbf{B}_k\) is chosen uniformly at random from subsets of \(\textbf{S}\) of a given size \(B_k\). Then \(\mathcal {A}\) satisfies the following uniform argument stability bounds:

$$\begin{aligned} \sup _{\textbf{S}\simeq \textbf{S}^{\prime }}\mathbb {E}_{\mathcal {A}}[\delta _{\mathcal {A}}(\textbf{S},\textbf{S}^{\prime })] \leqslant \sum _{k=1}^{K}\left( \frac{2M}{n\lambda _k}+\sqrt{\frac{4\nu }{\lambda _k}} \right) . \end{aligned}$$

Furthermore, if \(|\textbf{B}_{k}|= B\) and \(\lambda _k= \lambda \) for all k (i.e., constant batch size and regularization parameter throughout iterations) then w.p. at least \(1-n\exp \{-KB/[4n]\}\) (over both sampling and noise addition)

$$\begin{aligned} \sup _{\textbf{S}\simeq \textbf{S}^{\prime }}[\delta _{\mathcal {A}}(\textbf{S},\textbf{S}^{\prime })] \leqslant \frac{4MK}{\lambda n}+K\sqrt{\frac{4\nu }{\lambda }}. \end{aligned}$$

Proof

Let \(\textbf{S}\simeq \textbf{S}^{\prime }\) and \((u_{k+1})_k\), \((u_{k+1}^{\prime })_k\) the trajectories of the algorithm on \(\textbf{S}\) and \(\textbf{S}^{\prime }\), respectively. By Lemma 3, letting \(\delta _{k+1}=\Vert {\tilde{w}}_{k+1}-{\tilde{w}}_{k+1}^{\prime }\Vert \), we get \(\delta _{j} \leqslant \sum _{k=1}^{j}\Big ( \frac{2\textbf{a}_k}{\lambda _k}+\sqrt{\frac{4\nu }{\lambda _k}} \Big ),\) where \(\textbf{a}_k=\Vert F_{\textbf{B}_{k+1}}(u_{k+1})-F_{\textbf{B}_{k+1}'}(u_{k+1}')\Vert _{}^{}\) is a random variable. By the law of total probability, \(\mathbb {E}[\textbf{a}_k] \leqslant \frac{|\textbf{B}_{k+1}|}{n}\frac{2\,M}{|\textbf{B}_{k+1}|}+\big (1-\frac{|\textbf{B}_{k+1}|}{n}\big )\cdot 0=\frac{2\,M}{n}.\) Hence, \(\mathbb {E}[\delta _{j}]\leqslant \sum _{k=1}^{j}\Big ( \frac{2\,M}{n\lambda _k}+\sqrt{\frac{4\nu }{\lambda _k}} \Big ) \leqslant \sum _{k=1}^{K}\Big ( \frac{2\,M}{n\lambda _k}+\sqrt{\frac{4\nu }{\lambda _k}} \Big ).\) Since \( \Vert \varPi _{\mathcal {W}}({\bar{w}}_K)-\varPi _{\mathcal {W}}({\bar{w}}_K^{\prime })\Vert _{}^{} \leqslant \Vert {\bar{w}}_K - {\bar{w}}'_K\Vert _{}^{} \leqslant \max _{k\in [K]}\delta _k\), and since \(\textbf{S}\simeq \textbf{S}^{\prime }\) are arbitrary,

$$\begin{aligned} \sup _{\textbf{S}\simeq \textbf{S}^{\prime }}\mathbb {E}_{\mathcal {A}}[\delta _{\mathcal {A}}(\textbf{S},\textbf{S}^{\prime })] \leqslant \sum _{k=1}^{K}\left( \frac{2M}{n\lambda _k}+\sqrt{\frac{4\nu }{\lambda _k}} \right) . \end{aligned}$$

We proceed now to the high-probability bound. Let \(\varvec{r}_k\sim \text{ Ber }(p)\), for \(k\in [K]\), with \(Kp< 1\). Then, for any \(0<\theta <1/2\),

$$\begin{aligned} \mathbb {P}\left[ \sum _{k=1}^K\varvec{r}_k \geqslant Kp+\tau \right]\leqslant & {} \exp \big (-\theta (\tau +Kp))\Big [ 1+p(e^{\theta }-1) \Big ]^K \leqslant \exp \{Kp\theta ^2-\theta \tau \}. \end{aligned}$$

Choosing \(\theta =\tau /[2Kp]<1/2\), we get that the probability above is upper bounded by \(\exp \{-\tau ^2/[4Kp]\}\). Finally, choosing \(\tau =Kp\), we get

$$\begin{aligned} \mathbb {P}\left[ \sum _{k=1}^K\varvec{r}_k \geqslant 2Kp \right] \leqslant \exp \{-Kp/4\}. \end{aligned}$$

Next, fix the coordinate i where \(S\simeq S^{\prime }\) may differ. Noticing that \(\textbf{a}_k\) is a.s. upper bounded by \((2M/B)\varvec{r}_k\) with \(\varvec{r}_k\sim \text{ Ber }(p)\), with \(p=B/n\), we get

$$\begin{aligned} \mathbb {P}\left[ \sum _{k=1}^K\frac{2\textbf{a}_k}{\lambda } \geqslant \frac{2}{\lambda }\frac{2M}{n}\right] \leqslant \exp \{-\frac{KB}{4n}\}. \end{aligned}$$

In particular, w.p. at least \(1-\exp \{-\frac{KB}{4n}\}\), we have \( \textbf{a}_k \leqslant \frac{4\,M}{\lambda n}+\sqrt{\frac{4\nu }{\lambda }}.\) Using a union bound over \(i\in [n]\) (and noticing that averaging preserves the stability bound), we conclude that

$$\begin{aligned} \mathbb {P}\left[ \sup _{\textbf{S}\simeq \textbf{S}^{\prime }} \delta _{\mathcal {A}}(\textbf{S},\textbf{S}^{\prime }) >\frac{4MK}{\lambda n}+K\sqrt{\frac{4\nu }{\lambda }}\right] \leqslant n\exp \{-KB/4n\}. \end{aligned}$$

Hence, we conclude the proof. \(\square \)

Remark 4

An important observation, for the high-probability guarantee to be useful, is necessary that the algorithm is run for sufficiently many iterations; in particular, we require \(K=\omega (n/B)\). Whether this assumption can be completely avoided is an interesting question. Nevertheless, as we will see in the section, our policy for DP-SVI and DP-SSP problem satisfies this requirement.

6.2 Optimal risk for DP-SVI and DP-SSP by the NISPP method

In previous three sections, we provided bounds on optimization error, generalization error and value of \(\sigma \) for obtaining \((\varepsilon ,\eta )\)-differential privacy. In this section, we specify a policy for selecting \(\lambda _k, B_k, \gamma _k, \sigma _k \) and \(\nu \) such that requirement in the previous three sections are satisfied and we can get optimal risk bounds while maintaining \((\varepsilon ,\eta )\)-privacy. In particular, consider the multi-pass NISPP method where each sample batch \(\textbf{B}_k\) is chosen randomly from subsets of \(\textbf{S}\) with replacement. Then, we have the following theorem:

Theorem 7

Let \(\mathcal {A}\) be the multi-pass NISPP method (Algorithm 2). Set the following constant stepsize and batchsize policy for \(\mathcal {A}\):

$$\begin{aligned}&\gamma _k = 1, \quad{} & {} \lambda _k = \lambda _0= \max \left\{ \frac{M}{D},L\right\} \max \left\{ \sqrt{n}, \frac{\sqrt{d\log {1/\delta }}}{\theta }\right\} , \quad{} & {} B_k = B= \sqrt{n},\\&K = n,{} & {} \nu = \frac{M^2}{\lambda _0n^2}, \quad{} & {} \sigma _{k+1} = \frac{8M}{B \lambda _0}\frac{\sqrt{\ln (1/\eta )}}{\varepsilon } . \end{aligned}$$

Then, Algorithm 2 is \((\varepsilon , \eta )\)-differential private. Moreover, output \(\mathcal {A}(\textbf{S})\) satisfies the following bound on \(\mathbb {E}_{{\mathcal {A}}}[\text{ WeakGap}_{\textrm{VI}}(\mathcal {A}(\textbf{S}),F)]\) for SVI problem (or \(\mathbb {E}_{{\mathcal {A}}}[\text{ WeakGap}_{\textrm{SP}}(\mathcal {A}(\textbf{S}),f)]\) for SSP problem)

$$\begin{aligned} O\left( (M+LD)D \left( \frac{1}{\sqrt{n}} + \frac{\sqrt{d\ln {1/\eta }}}{\varepsilon n}\right) \right) , \end{aligned}$$

Moreover, such solution is obtained in total of \({\widetilde{O}}(n\sqrt{n})\) stochastic operator evaluations.

Proof

Note that since \(\nu \) satisfies assumption in Proposition 9, we have \(\ell _2\)-sensitivity of the update of \(u_{k+1}\) is \(\frac{4M}{\lambda _0B_{k+1}}\). Then, in view of Theorem 1 along with value of \(\sigma _{k+1}\), we conclude that Algorithm 2 is \((\varepsilon , \eta )\)-differential private.

Now, for convergence, we first bound the empirical gap. Given that our bounds for (VI(F)) and (SP(f)) are analogous, we proceed indistinctively for both problems. By Theorem 5, along with the fact that sampling with replacement is an unbiased stochastic oracle for the empirical operator, we have for any \(\textbf{S}\)

$$\begin{aligned} \mathbb {E}_{\mathcal {A}}[\text{ EmpGap }(\mathcal {A},F_\textbf{S})]&\leqslant \nu + \frac{\lambda _0D^2}{n} + \frac{7M^2}{\lambda _0} + \frac{160M^2d}{\varepsilon ^2B^2\lambda _0}\ln {\frac{1}{\eta }} + M\sqrt{2d}\frac{8M}{Bn \varepsilon \lambda _0}\\&\leqslant O\left( MD \left( \frac{1}{\sqrt{n}} + \frac{\sqrt{d \ln {1/\eta }}}{\varepsilon n} \right) \right) . \end{aligned}$$

Similar claims can be made for empirical gap for (SP(f)) problem: \(\mathbb {E}_{{\mathcal {A}}}\big [ \text{ EmpGap }(\mathcal {A}, f_{\textbf{S}})\big ]\) where output of \(\mathcal {A}\) is \((x(\textbf{S}), y(\textbf{S}))\).

Next, by Theorem 6, we have that \(\mathcal {A}(\textbf{S})\) (or \(x(\textbf{S})\) and \(y(\textbf{S})\) for the SSP case) are UAS with parameter

$$\begin{aligned} \delta = \frac{2M}{\lambda _0} + n\sqrt{\frac{4\nu }{\lambda _0}} = \frac{4M}{\lambda _0} \leqslant \frac{4D}{\sqrt{n}}. \end{aligned}$$

Hence, noting that empirical risk upper bounds weak VI or SP gap, i.e., using Proposition 1 or Proposition 2 (depending on whether the problem is an SSP or SVI, respectively), we have that the risk is upper bounded by its empirical risk plus \(M\delta \), where \(\delta \) is the UAS parameter of the algorithm; in particular, if \(\text{ WeakGap }(\mathcal {A};\textbf{S})\) is the (SVI or SSP, respectively) gap function for the expectation objective, then

$$\begin{aligned}&\mathbb {E}_{\mathcal {A}, \textbf{S}} \big [ \text{ WeakGap}_{\textrm{VI}}(\mathcal {A},F)\big ] \leqslant O\Big (MD\Big (\frac{1}{\sqrt{n}} + \frac{\sqrt{d\ln (1/\eta )} }{\varepsilon n}\Big )\Big ) \\&\quad + \frac{20MD}{\sqrt{n}} = O\Big (MD\Big (\frac{1}{\sqrt{n}} + \frac{\sqrt{d\ln (1/\eta )}}{\varepsilon n}\Big )\Big ) \end{aligned}$$

Similar claim can be made for \(\text{ WeakGap}_{\textrm{SP}}(\mathcal {A},f)\).

Finally, we analyze the running time performance. As in Remark 3, number of OE method iterations for obtaining \(\nu \)-approximate solution is \(O\big (\frac{L+\lambda _0}{\lambda _0}\ln \big (\frac{LD^2+MD+\lambda _0D^2}{\nu }\big )\big )\). Now note that \(\frac{L+\lambda _0}{\lambda _0} \leqslant \frac{\sqrt{n}+1}{\sqrt{n}} \leqslant 2\) since \(n \geqslant 1\). Moreover, in view of (5.23), we have \(\ln \big (\frac{LD^2+MD+\lambda _0D^2}{\nu }\big ) \leqslant \ln \left( \frac{4\lambda _0D^2}{\nu }\right) \leqslant \ln \left( n^2\max \lbrace 1, \frac{L^2D^2}{M^2}\rbrace \max \lbrace n, \frac{d\ln (1/\eta )}{\varepsilon ^2}\rbrace \right) \). Each iteration of OE method costs B stochastic operator evaluations and we run outer-loop of NISPP method K times. Hence, total stochastic operator evaluations (after ignoring the \(\ln \)-term) \({\widetilde{O}}(KB) = {\widetilde{O}}(n\sqrt{n})\). Hence, we conclude the proof. \(\square \)

7 Lower bounds and optimality of our algorithms

In this section, we show the optimality of our obtained rates from Sects. 4.1 and 6.1. The first observation is that, since DP-SCO corresponds to a DP-SSP problem where \({{\mathcal {Y}}}\) is a singleton, the complexity of DP-SSP is lower bounded by \(\varOmega (MD \big ( \frac{1}{\sqrt{n}}+\min \big \{1,\frac{\sqrt{d}}{\varepsilon n}\big \} \big )\big )\): this is a known lower bound for DP-SCO [5]. It is important to note as well that this reduction applies to the weak generalization gap, as defined in (1.4), as in the case \({{\mathcal {Y}}}=\{\bar{y}\}\) is a singleton:

$$\begin{aligned} \text{ WeakGap}_{\textrm{SP}}({\mathcal {A}},f)&={\mathbb {E}}_{\mathcal {A}}[ \sup _{y\in {{\mathcal {Y}}}} {\mathbb {E}}_{\textbf{S}}[f(x(\textbf{S}),y)] -\inf _{x\in {{\mathcal {X}}}} {\mathbb {E}}_{\textbf{S}}[f(x, y(\textbf{S}))] ]\\&= {\mathbb {E}}_{\mathcal {A}}{\mathbb {E}}_{\textbf{S}}[f(x(\textbf{S}),\bar{y})]-\inf _{x\in {{\mathcal {X}}}} f(x, \bar{y})\\&= {\mathbb {E}}_{{\mathcal {A}},\textbf{S}}[f(x(\textbf{S}),\bar{y})-\inf _{x\in {{\mathcal {X}}}} f(x, \bar{y})], \end{aligned}$$

which is simply the expected optimality gap. Using this reduction, together with a lower bound for DP-SCO [5], we conclude that

Proposition 11

Let \(n,d\in \mathbb {N}\), \(L,M,D,\varepsilon >0\) and \(\delta =o(1/n)\). The class of problems DP-SSP with gradient operators within the class \({\mathcal {M}}_{\mathcal {W}}^1(L,M)\), and domain \({\mathcal {W}}\) containing an Euclidean ball of diameter D/2, satisfies the lower bound

$$\begin{aligned} \varOmega \Big (MD\Big ( \frac{1}{\sqrt{n}}+\min \Big \{1,\frac{\sqrt{d}}{\varepsilon n}\Big \} \Big )\Big ). \end{aligned}$$

Next, we study the case of DP-SVI. The situation is more subtle here. Our approach is to first prove a reduction from population weak VI gap to empirical strong VI gap for the case where operators are constant w.r.t. w. In fact, it seems unlikely this reduction works for more general monotone operators, however this suffices for our purposes, as we will later prove a lower bound construction used for DP-ERM [7] leads to a lower bound for strong VI gap with constant operators.

The formal reduction to the empirical version of the problem is presented in the following lemma. Its proof follows closely the reduction from DP-SCO to DP-ERM from [5]. Below, given a dataset \(\textbf{S}\in {{\mathcal {Z}}}^n\), let \({{\mathcal {P}}}_{\textbf{S}}=\frac{1}{n}\sum _{\varvec{\beta }\in \textbf{S}}\delta _{\varvec{\beta }}\) be the empirical distribution associated with \(\textbf{S}\).

Lemma 4

Let \({\mathcal {A}}\) be an \((\varepsilon /[4\log (1/\eta )], e^{-\varepsilon }\eta /[8\log (1/\eta )])\)-DP algorithm for SVI problems. Then there exists an \((\varepsilon ,\eta )\)-DP algorithm \({\mathcal {B}}\) such that for any empirical VI problem with constant operators,

$$\begin{aligned} \text{ EmpGap}_{\textrm{VI}}({\mathcal {B}},F_{\textbf{S}}) \leqslant \text{ WeakGap}_{\textrm{VI}}({\mathcal {A}},F_{{{\mathcal {P}}}_{\textbf{S}}}) \qquad (\forall \textbf{S}\in {{\mathcal {Z}}}^n). \end{aligned}$$

Proof

Consider the algorithm \({\mathcal {B}}\) that does the following: first, it extracts a sample \(\textbf{T}\sim ({{\mathcal {P}}}_{\textbf{S}})^n\); next, executes \({\mathcal {A}}\) on \(\textbf{T}\); and finally, outputs \({\mathcal {A}}(\textbf{T})\). We claim that this algorithm is \((\varepsilon ,\eta )\)-DP w.r.t. \(\textbf{S}\), which follows easily by bounding the number of repeated examples with high probability, together with the group privacy property applied to \({\mathcal {A}}\) (for a more detailed proof, see Appendix C in [5]). Now, given a constant operator \(F_{\varvec{\beta }}(w)\), let \(R(\varvec{\beta })\in {\mathbb {R}}^d\) be its unique evaluation. Let now \(R_{\textbf{S}}\) be the unique evaluation of \(F_{\textbf{S}}\), and given a distribution \({{\mathcal {P}}}\) let \(R_{{\mathcal {P}}}\) be the unique evaluation of \(F_{{{\mathcal {P}}}}(w)={\mathbb {E}}_{\varvec{\beta }\sim {{\mathcal {P}}}}[F_{\varvec{\beta }}(w)]\).

Noting that \({\mathbb {E}}_{\textbf{T}}[R_{\textbf{T}}]=R_{\textbf{S}},\) we have that

$$\begin{aligned} \text{ EmpGap}_{\textrm{VI}}({\mathcal {B}}(\textbf{S}),F_{\textbf{S}})&= {\mathbb {E}}_{\mathcal {B}}[\sup _{w\in {\mathcal {W}}} \langle R_{\textbf{S}},{\mathcal {B}}(\textbf{S})-w \rangle ]\\&= {\mathbb {E}}_{{\mathcal {A}},\textbf{T}}[ \langle R_{\textbf{S}},{\mathcal {A}}(\textbf{T})\rangle - \inf _{w\in {\mathcal {W}}} \langle R_{\textbf{S}},w \rangle ]\\&={\mathbb {E}}_{{\mathcal {A}}} \sup _{w\in {\mathcal {W}}} {\mathbb {E}}_{\textbf{T}} [ \langle R_{\textbf{S}},{\mathcal {A}}(\textbf{T})-w \rangle ]\\&= \text{ WeakGap}_{\textrm{VI}}({\mathcal {A}},F_{{{\mathcal {P}}}_{\textbf{S}}}), \end{aligned}$$

where third equality holds since the optimal choice of w is independent of \(\textbf{T}\), and the last equality holds by definition of the weak gap function and the fact that \(\textbf{T}\sim ({{\mathcal {P}}}_{\textbf{S}})^n\).

\(\square \)

Next, we prove a lower bound for the empirical VI problem over constant operators.

Proposition 12

Let \(n,d\in \mathbb {N}\), \(L,M,D,\varepsilon >0\) and \(2^{-o(n)}\leqslant \delta \leqslant o(1/n)\). The class of DP empirical VI problems with constant operators within the class \({\mathcal {M}}_{\mathcal {W}}^1(L,M)\), and domain \({\mathcal {W}}\) containing an Euclidean ball of diameter D/2 satisfies the lower bound

$$\begin{aligned} \varOmega \Big (MD\Big ( \min \Big \{1,\frac{\sqrt{d\log (1/\eta )}}{\varepsilon n}\Big \} \Big )\Big ). \end{aligned}$$

Proof

Consider the following empirical VI problem: \(F_{\varvec{\beta }}(u)= M\varvec{\beta }\), \(\mathcal {W}={\mathcal {B}}(0,D)\) and dataset \(\textbf{S}\) with points contained in \(\{-1/\sqrt{d},+1/\sqrt{d}\}^d\). Notice that, since the operator in this case is constant, the VI gap coincides with the excess risk of the associated convex optimization problem

$$\begin{aligned} (P)~~~ \min _{u\in \mathcal {W}} \Big \langle \frac{M}{n}\sum _{i\in [n]} \varvec{\beta }_i, u\Big \rangle . \end{aligned}$$

Indeed, for any \(u\in {\mathcal {W}}\),

$$\begin{aligned} \text{ EmpGap}_{\textrm{VI}}(u,F_\textbf{S})= & {} \sup _{v\in {\mathcal {B}}(0,D)} \Big \langle \frac{M}{n} \sum _{i\in [n]} \varvec{\beta }_i, u- v\Big \rangle = \Big \langle \frac{M}{n} \sum _{i\in [n]} \varvec{\beta }_i, u+ \frac{D\sum _i \varvec{\beta }_i}{\Vert \sum _i \varvec{\beta }_i\Vert }\Big \rangle \\= & {} \Big \Vert \frac{MD}{n}\sum _{i\in [n]} \varvec{\beta }_i\Big \Vert +MD\big \langle \frac{u}{D}, \frac{1}{n}\sum _i \varvec{\beta }_i\big \rangle . \end{aligned}$$

This, together with the lower bounds on excess risk proved for this problem in [7, Appendix C] and [46, Theorem 5.1] show that any \((\varepsilon ,\eta )\)-DP algorithm for this problem must incur in worst-case VI gap \(\varOmega (MD\min \{1,\frac{\sqrt{d\log (1/\eta )}}{\varepsilon n}\})\), which proves the result. \(\square \)

The two results above provide the claimed lower bound for the weak SVI gap of any differentially private algorithm.

Theorem 8

Let \(n,d\in \mathbb {N}\), \(L,M,D,\varepsilon >0\) and \(2^{-o(n)}\leqslant \delta \leqslant o(1/n)\). The class of DP-SVI problems with operators within the class \({\mathcal {M}}_{\mathcal {W}}^1(L,M)\), and domain \({\mathcal {W}}\) containing an Euclidean ball of diameter D/2 satisfies a lower bound for the weak VI gap

$$\begin{aligned} \tilde{\varOmega }\Big (MD\Big ( \frac{1}{\sqrt{n}}+\min \Big \{1,\frac{\sqrt{d}}{\varepsilon n}\Big \} \Big )\Big ). \end{aligned}$$

Before we prove the result, let us observe that the presented lower bound shows the optimality of our algorithms in the range where \(M\geqslant LD\). Obtaining a matching lower bound for any choice of MLD is an interesting question, which unfortunately our proof technique does not address: this is a limitation that the lower bound is based on constant operators, and therefore their Lipschitz constants are always zero.

Proof

Let \({\mathcal {A}}\) be any algorithm for SVI. By the classical (nonprivate) lower bounds for SVI [30, 40], we have that the minimax SVI gap attainable is lower bounded by \(\varOmega (MD/\sqrt{n})\). On the other hand, using Lemma 4 the accuracy of any \((\varepsilon ,\eta )\)-DP algorithm for weak SVI gap is lower bounded by the strong gap achieved by \((4\varepsilon \ln (1/\eta ),e^{\varepsilon }{\tilde{O}}(\eta ))\)-DP algorithms on empirical VI problems with constant operators. Finally, by Proposition 12, the latter class of problems enjoys a lower bound \(\varOmega (\min \{1,\sqrt{d\ln (1/[e^{\varepsilon }{\tilde{O}}(\eta )])}/[\varepsilon n\ln (1/\eta )]\})=\tilde{\varOmega }(\min \{1,\sqrt{d}/[n\varepsilon ])\), which implies a lower bound on the former class of this order. We conclude by combining both the private and nonprivate lower bounds established above. \(\square \)