Differentially private Riemannian optimization

In this paper, we study the differentially private empirical risk minimization problem where the parameter is constrained to a Riemannian manifold. We introduce a framework of differentially private Riemannian optimization by adding noise to the Riemannian gradient on the tangent space. The noise follows a Gaussian distribution intrinsically defined with respect to the Riemannian metric. We adapt the Gaussian mechanism from the Euclidean space to the tangent space compatible to such generalized Gaussian distribution. We show that this strategy presents a simple analysis as compared to directly adding noise on the manifold. We further show privacy guarantees of the proposed differentially private Riemannian (stochastic) gradient descent using an extension of the moments accountant technique. Additionally, we prove utility guarantees under geodesic (strongly) convex, general nonconvex objectives as well as under the Riemannian Polyak-{\L}ojasiewicz condition. We show the efficacy of the proposed framework in several applications.


Introduction
With the ever-increasing complication of statistics and machine learning models, data privacy has become a primary concern as it becomes increasingly difficult to safeguard the potential disclosure of private information during model training.Differential privacy [21,22] provides a framework for quantifying the privacy loss as well as for designing algorithms with privacy-preserving guarantees.
Many problems in machine learning fall under the paradigm of empirical risk minimization (ERM), where the loss is expressed as 1 n n i=1 f (w; z i ) with independent and identically distributed (i.i.d) samples z 1 , . . ., z n drawn from a data distribution D. Differentially private ERM, originally studied in [18], aims to safeguard the privacy disclosure of the samples in the solution w * .There exist many approaches to achieve this goal.The first class of methods is to perturb the output of a non-differentially private algorithm by adding a Laplace or Gaussian noise [17,18,56].Another approach considers adding a linear random perturbation term to the objective and is known as objective perturbation [17,18,34,28,8].The third type of approach is to inject noise to gradient based algorithms at each iteration [9,50,49,1,7].In addition, there also exist various specialized methods for problems such as linear regression and statistics estimation [20,39,5,31,11].
Among all the aforementioned approaches, gradient perturbation receives the most attention due to its generality for arbitrary loss functions and scalability to large datasets.Furthermore, it only requires to bound the sensitivity of the gradients computed at each iteration, rather than the entire process.There is an ongoing line of work that aims to improve the utility of the gradient perturbed algorithms while maintaining the same amount of privacy budget.Such improvements have been seen under (strongly) convex losses [50,28,53,7,35,6], nonconvex losses [50,56,49,51], and also structured losses such as satisfying Polyak-Łojasiewicz condition [50].
In this paper, we consider following ERM problem in a differentially private setting where the parameter is constrained to lie on a Riemannian manifold, i.e., min w∈M where M is a d-dimensional Riemannian manifold and f : M × D − → R is a loss function over samples.Riemannian manifolds commonly occur in statistics and machine learning where the parameters naturally possess additional nonlinear structure, such as orthogonality [3], positive definiteness [10], unit norm, and hyperbolic [13] among others.Popular applications involving above manifold structures include matrix completion [14,25], metric learning, covariance estimation [27], principal component analysis [2], and taxonomy embedding [38], to name a few.
While few recent works address specific Riemannian optimization problems under differential privacy, such as private Fréchet mean computation [42], there exists no systematic study of general purpose strategies to guarantee differential privacy for (1) on Riemannian manifolds.On the other hand, differentially private (non-Riemannian) approaches have been studied for ERM problems with specific constraints [19,9,30,1,7,36].Such approaches typically employ the projected gradient algorithm, i.e., taking the gradient step and adding the noise in the Euclidean space, and then projecting onto the constraint set.Extending such a strategy to Riemannian manifolds may result in looser sensitivity and utility bounds scaling poorly with the dimension of the ambient space, which can be much larger than the intrinsic manifold dimension [42].
Contributions.In this work, we propose a general framework via Riemannian optimization to achieve differential privacy for (1) by adding noise to the Riemannian gradient adhering to the Riemannian metric (an inner product formally defined in Section 2).To this end, we generalize the Gaussian mechanism to the tangent space of Riemannian manifolds and also adapt the moments accountant technique to trace the privacy loss.We study the privacy guarantees of the differentially private Riemannian (stochastic) gradient descent method.Additionally, we show its utility guarantees for a variety of interesting function classes on Riemannian manifolds, including geodesic (strongly) convex, general nonconvex functions, and functions satisfying Riemannian Polyak-Łojasiewicz (PL) conditions.A summary of utility bounds with our proposed framework is in Table 1.In addition, we show that the projected gradient methods for (1) ignore the intrinsic geometry when taking the update and thus hamper the utility under structured loss functions on manifolds (discussed in Section 4).Finally, we provide illustrating examples and empirical results in Section 5.

Geodesic convex
Geodesic strongly convex Riemannian PL condition General nonconvex minimizing curve on the manifold with zero acceleration.For any ξ ∈ T w M, the exponential map is defined as Exp w (ξ) = γ(1) where γ(0) = w and γ (0) = ξ.If, between two points w, w ∈ M, there exists a unique geodesic connecting them, the exponential map has a smooth inverse and the Riemannian distance is given by dist(w, w ) = Exp −1 w (w ) w = Exp −1 w (w) w .We call a neighbourhood W totally normal if for any two points, the exponential map is invertible.The Riemannian gradient of a real-valued function, denoted as gradF (w), is a tangent vector that satisfies, for any ξ ∈ T w M, it holds that gradF (w), ξ w = D ξ F (w) = ∇F (w), ξ 2 where D ξ F (w) is the directional derivative of F (w) along ξ and ∇F (w) is the Euclidean gradient.
Riemannian optimization.Under non-private settings, Riemannian optimization [3,13] provides a class of methods to efficiently solve problem (1) for arbitrary loss functions by treating the constrained problem as unconstrained problem over manifolds.Given the Riemannian gradient, Riemannian steepest descent [48] takes a gradient update via the Exponential map so that the iterates stay on the manifold, i.e., Exp w (−η gradF (w)) for some stepsize η.Other more advanced solvers include Riemannian conjugate gradient [44], trust region methods [2], as well as many recent stochastic optimizers [12,54,46,33,26,25].
Function classes on Riemannian manifolds.The notion of Lipschitz continuity has been generalized to Riemannian manifolds [13,55].A differentiable function [55] is an extension of convexity in the Euclidean space.A set W ⊆ M is geodesic convex if for any two points in the set, there exists a geodesic in the set joining them.A function F : W − → R is called geodesic convex if for any w, w ∈ W, it satisfies F (γ(t)) ≤ (1−t)F (w)+ tF (w ) for all t ∈ [0, 1], where γ is the geodesic such that γ(0) = w, γ(1) = w .If the function F is differentiable, an equivalent characterization of geodesic convexity is F (w ) ≥ F (w) + gradF (w), ξ w for any w ∈ W, w = Exp w (ξ).In addition, a function F is called geodesic β-strongly convex if for any w, w = Exp w (ξ) ∈ W, it satisfies F (w ) ≥ F (w) + gradF (w), ξ w + β 2 d 2 (w, w ) for some β > 0. Further, we introduce Riemannian Polyak-Łojasiewicz (PL) condition [54,33,25], which is weaker than the geodesic strong convexity.A function F : M − → R is said to satisfy the Riemannian PL condition if for any w ∈ M, there exists τ > 0 such that F (w) − F (w * ) ≤ τ gradF (w) 2 w where w * is a global minimizer of F on M.
Finally, we recall a trigonometric distance bound for Riemannian manifolds with lower bounded sectional curvature, which is crucial in convergence analysis for geodesic convex optimization.
Lemma 1 (Trigonometric distance bound [12,55,54,27]).Let w 0 , w 1 , w 2 ∈ W ⊆ M lie in a totally normal neighbourhood of a Riemannian manifold with curvature lower bounded by κ min , and l 0 = dist(w 0 , w 1 ), l 1 = dist(w 1 , w 2 ) and l 2 = dist(w 0 , w 2 ).Denote θ as the angle on T w 0 M such that cos(θ) = 1 l 0 l 2 Exp −1 w 0 (w 1 ), Exp −1 w 0 (w 2 ) w 0 .Let D W be the diameter of W, i.e., D W := max w,w ∈W dist(w, w ).Define the curvature constant ς = Differential privacy.Let D = {z 1 , . . ., z n } ⊂ D n be a dataset.A neighbouring dataset of D, denoted as D is a dataset that differs in only one sample from D. The neighbouring relation is denoted as D ∼ D .We first recall the definition of ( , δ)-differential privacy (DP) [22], which is defined on arbitrary measurable space M (not necessarily a Riemannian manifold).
Definition 1 (( , δ)-Differential privacy).A randomized mechanism R : D n − → M is called ( , δ)differentially private on M if for any neighbouring datasets D, D ⊂ D n and any measurable space A ⊆ M, we have In addition, we make use of Rényi differential privacy (RDP) [37], which enjoys a tighter privacy bound under composition and subsampling [52].Given neighbouring datasets D, D , we first define the cumulant generating function of a mechanism R as ) is known as the privacy loss random variable at o ∼ R(D) [23,22].When maximized over all the neighbouring datasets, K R (λ) := sup D∼D K R,(D,D ) (λ) is called the λ-th moment of the mechanism [1].
The notions of differential privacy introduced above are well-defined on Riemannian manifolds, which is a measurable space under the Borel sigma algebra [41].However, a systematic approach for preserving differential privacy when the parameters of interest are on Riemannian manifolds has not been studied.A recent work [42] proposes differentially private Fréchet mean computation over general Riemannian manifolds by output perturbation.Nevertheless, computing the Fréchet mean is a special problem instance of (1), which we can solve via our proposed general framework under differential privacy.See more detailed discussions and comparisons in Section 5.

Differential privacy on Riemannian manifolds
This section proposes many tools for preserving and analyzing differential privacy on Riemannian manifolds.Proofs for the results in this section are deferred to the supplementary.
First, we generalize the Gaussian mechanism [22] from the Euclidean space to Riemannian manifolds.One approach is to directly add noise on the manifold following an intrinsic Gaussian distribution [41].However, this strategy faces two challenges.First, it is required to bound the sensitivity in terms of the Riemannian distance, which could be difficult particularly for negatively curved manifolds, such as the symmetric positive definite manifold.Second, the generalization suffers from metric distortion by curvature and it requires a nontrivial adaptation of the proof strategy in the Euclidean space [22].Nevertheless, it is worth mentioning that the Laplace mechanism [21] can be generalized with the triangle inequality of the Riemannian distance as has been done recently in [42].
Instead, we consider directly adding noise to the tangent space of the manifold following an isotropic Gaussian distribution with respect to the Riemannian metric.In this case, we can measure the sensitivity on the tangent space.We highlight that although the tangent space can be identified as a Euclidean space, the proposed strategy differs from the classic Gaussian mechanism, which adds isotropic noise to each coordinate in the Euclidean space.
To this end, we define the tangent space Gaussian distribution as follows.
Definition 3 (Tangent space Gaussian distribution).For any w ∈ M, a tangent vector ξ ∈ T w M follows a tangent space Gaussian distribution at w, denoted as ξ ∼ N w (µ, σ 2 ) with mean µ ∈ T w M and standard deviation σ > 0 if its density is given by p , where C w,σ is the normalizing constant.
Remark 1.In a normal coordinate system of the tangent space, we remark that N w (µ, σ 2 ) is equivalent to a standard multivariate Gaussian with covariance as a function of the metric tensor of the tangent space.Denote ξ ∈ R d as the vectorization of the tangent vector ξ ∈ T w M R d in the normal coordinates.The density can then be written as , where G w is the (symmetric positive definite) metric tensor at w.This is a standard Gaussian distribution with mean µ and covariance Next, we introduce a generalization of the Gaussian mechanism on the tangent space.We stress that the following Gaussian mechanism depends on the sensitivity measured in the Riemannian metric.
Proposition 2 (Tangent space Gaussian mechanism).Given a query function H : To show privacy guarantees in the subsequent sections, we adapt the moments accountant technique in the Euclidean space [1] to Riemannian manifolds, which results in a tighter bound compared to the advanced composition [22].To achieve this, we first provide lemmas that bound the moments of a tangent space Gaussian mechanism (Proposition 2).The proof strategy is motivated by the connection of the ( , δ)-differential privacy (Definition 1) and Rényi differential privacy [37] established in [52].
The next two lemmas show upper bounds on K R (λ) under full datasets (Lemma 2) as well as under subsampling (Lemma 3).

Lemma 2 (Moments bound). Consider a query function
Lemma 3 (Moments bound under subsampling).Under the same settings as in Lemma 2, consider D sub to be a subset of size b where samples are selected from It is worth highlighting that the bound given in Lemma 3 asymptotically matches the bound derived from the moments accountant [1] when b/n − → 0. In addition, we observe that under small λ and large σ, subsampling does not improve the bound compared to using the full dataset.

Differentially private Riemannian (stochastic) gradient descent
In this section, we introduce differentially private Riemannian (stochastic) gradient descent (Algorithm 1), where we add noise following the tangent space Gaussian distribution N w (0, σ 2 ).We show under proper choice of parameters, the algorithm preserves both the privacy guarantee as well as utility guarantees under various function classes on Riemannian manifolds.Proofs for this section are included in the supplementary.
In particular, in Algorithm 1, the samples are selected without replacement following [52], and thus, when b = n, we recover the full gradient descent.The noise variance is chosen as to ensure ( , δ)-differential privacy (Theorem 1) for some constant c.We remark that the choice of σ matches the standard results in the Euclidean space for gradient descent [50] and for stochastic gradient descent [1,7,51] up to some constants that may depend on the manifold of interest.The output of the the algorithm depends on the function class of the objective, discussed in Section 4.2.
Proof.The idea is to bound the moment of the randomized mapping K Rt (λ) every iteration using Lemma 2 for gradient descent and Lemma 3 for stochastic gradient descent.Then by composability theorem [1, Theorem 2.1] as well as the connection between RDP to DP in Proposition 1, we can ensure the differential privacy.Detailed proof can be found in the supplementary.

Convergence guarantees
For convergence analysis, we start by making an assumption that all the iterates w 0 , . . ., w T stay bounded within a compact support W ⊆ M that contains a stationary point w * (i.e., gradF (w * ) = 0).Let c l > 0 such that G w c l I d satisfies for all w ∈ W. We first show that the expected norm of the noise injected gradient can be bounded as follows.Next we show utility guarantees under various function classes on Riemannian manifolds, including geodesic (strongly) convex, general nonconvex and functions that satisfy Riemannian PL condition.It has been shown that many nonconvex problems in the Euclidean space are in fact geodesic (strongly) convex or satisfy the Riemannian PL condition on the manifold.This allows tighter utility bounds compared to differentially private projected gradient methods.Some examples are given in Section 5.
Geodesic convex optimization.When f (w; z) is geodesic convex over W for any z ∈ D, the stationary points w * is a global minimum of F (w).The utility of Algorithm 1 is measured as the expected empirical excess risk E[F (w priv )] − F (w * ).
Theorem 2 (Utility under geodesic convex optimization).Suppose f (w; z) is geodesic convex, geodesic L 0 -Lipschitz over W. Assume W to be a totally normal neighbourhood with diameter D W .Let ς be the curvature constant of W defined in Lemma 1.Consider Algorithm 1 with output 3 where w priv = wT −1 is computed by geodesic averaging as follows: set w1 = w 1 and . Then w priv satisfies for the choice of σ 2 in Algorithm 1 and T = n 2 .
First, we see that Theorem 2 indicates the same utility bound under both gradient descent and stochastic gradient descent.This matches the results in the Euclidean space for convex functions [9] up to a square root factor of the curvature constant ς.In addition, we highlight that the bound also depends on the intrinsic dimension of the manifold, rather than the ambient space.
Geodesic strongly convex optimization.Under geodesic strong convexity, the global optimizer w * is unique and we use the same measure for bounding the utility.
Optimization under Riemannian PL condition.Next, we consider the case when the objective F satisfies the Riemannian Polyak-Łojasiewicz (PL) condition.It is known that Polyak-Łojasiewicz is a sufficient condition to establish linear convergence to global minimum [32].In addition, the Riemannian PL condition includes the geodesic strongly convexity as a special case [54].
General nonconvex optimization.Under general nonconvex objectives, we show utility bound with respect to the expected gradient norm squared, i.e., E gradF (w priv ) 2 w priv , which has also been considered in the Euclidean space [56,50].
Theorem 5 (Utility under general nonconvex optimization).Suppose f (x; w) is geodesic L 0 -Lipschitz and L 1 -smooth, possibly nonconvex.Consider Algorithm 1 with output 2. Set η t = η = 1/L 1 and Remark 2 (Extending utility guarantees to other Riemannnian optimization methods).In this section, we have shown utility guarantees for the vanilla Riemannian gradient descent and stochastic gradient descent under differential privacy.We remark that such a strategy can be applied to more advanced solvers while preserving the same privacy budget.One example is to use line search methods to select the stepsize.Under current parameter settings, the same privacy guarantees can be preserved.In addition, provided that the update direction ζ t is 'close' to the full gradient gradF (w t ) (as in [15]), the same utility bounds also hold.

Applications
Here we explore two applications to demonstrate the efficacy of the proposed framework of differentially private Riemannian optimization.The experiments are implemented in Matlab using ManOpt package [16] on an i7-8750H CPU.The codes can be found on https://github.com/andyjm3.

Principal eigenvector computation over sphere manifold
We first consider the problem of computing leading eigenvector of a sample covariance matrix [47,54] as where S d = {w ∈ R d+1 : w 2 = w w = 1} is the sphere manifold of intrinsic dimension d and z 1 , . . ., z n ∈ R d+1 are zero-centered samples.Sphere manifold is an embedded submanifold of R d+1 with the tangent space given as  Utility.First we see that the metric tensor G w = I d and hence c l = 1.This is because given an orthonormal basis on the tangent space T w S d , denoted as B ∈ R (d+1)×d , we have ξ, where ξ, ζ ∈ R d is the vectorization under the coordinate transformation given by B. In addition, the geodesic Lipschitz constant L 0 is bounded as gradf (w; Then applying Theorem 4 for utility under Riemannian PL condition, we can show if properly initialized, the expected empirical excess risk of Algorithm 1 is bounded as , where we have c l = 1, L 0 = ϑ and we use O to hide logarithmic factors. Remark 3 (Comparison with perturbed projected gradient descent).Considering the alternative approach using projected gradient descent with gradient perturbation in the ambient space for solving the problem (2), one can only guarantee a utility of E[ ∇F (w priv )] 2  2 ] ≤ O ϑ n due to the nonconvexity of the problem [50].This results in a looser bound in n compared to our obtained utility guarantee above.
Sampling from the tangent space Gaussian distribution.Based on the argument above, the tangent space Gaussian distribution N w (0, σ 2 ), from Definition 3, reduces to the classic isotropic Gaussian distribution N (0, σ 2 I d ) in R d .Hence, sampling from N w (0, σ 2 ) can be achieved by first sampling from N (0, σ 2 I d ) and then transforming using a basis matrix B.
Experiment settings and results.We follow the same procedures as in [47] to generate the sample matrix Z = [z 1 ; . . .; z n ] ∈ R n×(d+1) .Specifically, we construct a (d + 1) × (d + 1) diagonal matrix Σ = diag(1, 1 − 1.1ν, . . ., 1 − 1.4ν, |x 1 |/(d + 1), |x 2 |/(d + 1), . ..)where ν is the eigengap defined in Theorem 6 and x 1 , x 2 , . . .are standard Gaussian random variables.Then construct Z = U ΣV where U, V are n × (d + 1) and (d + 1) × (d + 1) are random column orthonormal matrices.Thus Z has the same spectrum as Σ.We generate noise following Algorithm 1 where the parameters T = log(n 2 2 /((d + 1)L 2 0 log(1/δ))) and σ 2 = T log(1/δ)L 2 0 /(n 2 2 ) according to Theorem 4 for Riemannian PL condition.We set = 0.1, δ = 10 −3 , ν = 10 −3 , d + 1 = 50 and L 0 is estimated from the samples.We compare Algorithm 1 with full gradient (b = n), denoted as DP-RGD against the projected gradient descent (with noise added in the ambient space), denoted as DP-PGD.In fact, the projected gradient descent on the sphere approximates the Riemannian gradient descent because the former updates by w t+1 = wt+ζt wt+ζt 2 where ζ t = ∇F (w t ) + t .This approximates the exponential map to the first order and is known as the retraction [4,13].For this reason and the purpose of comparability, we set the same noise variance and the max iterations for both DP-RGD and DP-PGD.We show the expected empirical excess risk under different sample size in Figure 1a, with the stepsize tuned and fixed to be η = 0.2 for both algorithms and results averaged over 20 runs with different initializations.From the figure, we see an improved utility of proposed DP-RGD, demonstrating the benefit of using intrinsic Riemannian update.

Fréchet mean computation over symmetric positive definite manifold
The second application we consider is Fréchet mean computation [10,29] over the manifold of r × r symmetric positive definite (SPD) matrices, denoted as S r ++ .Specifically, given a set of SPD matrices X 1 , . . ., X n ∈ S r ++ , the goal is to find a center W ∈ S r ++ by minimizing an empirical average of the squared Riemannian distance to the samples: where logm(•) represents the principal matrix logarithm.The tangent space of the set of SPD matrices is given as T W S r ++ = {U ∈ R r×r : U = U }, i.e., the set of symmetric matrices.It can be shown the function f (W ; X i ) in the problem (3) is the Riemannian distance squared dist 2 (W, X i ) associated with the affine-invariant Riemannian metric [10], defined as , where expm(•) denotes the principal matrix exponential.The Riemannian gradient is computed by gradF (W ) = W ∇F (W )W .Similarly, the problem (3) is known to be nonconvex in the Euclidean space while it is geodesic strongly convex on the SPD manifold [54,55].
Utility.To show the utility guarantees, we first show that the Lipschitz constant can be bounded.Specifically, the Riemannian gradient of problem (3) can be derived as gradf (W ; Also, it is known the SPD manifold with affine-invariant metric is negatively curved, then we have the curvature constant ζ ≤ 2 |κ min |D W + 1 according to Lemma 1. Then applying Theorem 3, we show the utility of , where the intrinsic dimension of the SPD manifold S r ++ is d = r(r + 1)/2.Remark 4 (Comparison to utility obtained in [42]).Here we compare the utility of proposed Algorithm to the result in [42].First we highlight that in [42], by output perturbation with Laplace noise, W priv is shown to satisfy -pure differential privacy (with δ = 0).The utility is given by by applying the triangle inequality to the expected empirical excess risk and ignoring other factors.
Sampling from the tangent space Gaussian distribution.Sampling ) can be performed via standard implementation of the random walk Metropolis-Hastings [45].Particularly, we can repeatedly sample from a proposal Gaussian distribution on R r(r+1)/2 conditional on the previous iterate and evaluate exp(− 1 2σ 2 2 W ).
Experiment settings and results.We follow the steps in [42] to generate synthetic samples on SPD manifold S r ++ following the Wishart distribution W (I r /r, r) with a diameter bound D W .The optimal solution W * is obtained by running Riemannian gradient descent (RGD) on problem (3) until the gradient norm falls below 10 −14 .For the example, we choose r = 2 and set T = n, which we find empirically better than n 2 (as suggested in Theorem 3).We choose σ 2 = T log(1/δ)D 2 w /(n 2 2 ) where = 0.1, δ = 10 −3 , D W = 1, η = 0.01.We compare the proposed DP-RGD with the differentially private Fréchet mean (DP-FM) by output perturbation in [42].Following the procedures in [42], we first obtain a non-privatized Fréchet mean W by running RGD.Then we sample from an intrinsic Laplace distribution on S r ++ (by the steps in [24]) with footprint W and σ = ∆ FM / where ∆ FM = 2D W /n = 2/n is the sensitivity of the Fréchet mean on SPD manifold ([42, Theorem 2]).We plot in Figure 1b the expected empirical excess risk against the sample size for DP-RGD and DP-FM, averaged over 20 runs.We observe a better utility of our proposed method, particular when the sample size is small.

Concluding remarks
We propose a general framework to ensure differential privacy for ERM problems in the Riemannian optimization setting.We develop a general strategy to add noise that adheres to the intrinsic Riemannian geometry.To this end, we generalize the Gaussian mechanism to the tangent space compatible with the Riemannian metric.We also prove privacy as well as utility guarantees for a differentially private version of Riemannian (stochastic) gradient descent method.Finally, we highlight that the generalized Gaussian mechanism as well as the analysis toolkit in this paper allows to safeguard differential privacy on Riemannian manifolds beyond the context of Riemannian optimization as long as the operations are defined on tangent space.

A.1 Proofs from Section 3
Proof of Proposition 2. For any ξ ∼ N w (0, σ 2 ), we have R(D) ∼ N w (H(D), σ 2 ).Set H(D) − H(D ) = ζ and let the privacy loss random variable [23,22] be defined as By [22,Lemma 3.17 (by vectorization as in Remark 1).Then we can write The rest of the proof directly follows from [22,Theorem A.1] by applying the tail bound of Gaussian distribution.
Proof of Lemma 2. First, recall that the Rényi divergence [43] between two distributions P, Q with parameter β ≥ 2 is defined as Div β (P . By standard results on Rényi divergence between multivariate Gaussian distributions, such as [40], we have where we assume without loss of generality, D, D differ only in sample j and the last inequality follows from the Lipschitzness of h.
Proof of Lemma 3. The proof is a combination of Lemma 2 and the subsampling theorem ([52, Theorem 9]) for Rényi differential privacy.From Lemma 2 and the definition of Rényi differential privacy, we see the mechanism R applying to D sub is (λ + 1, Then by the subsampling theorem [52, Theorem 9], we have .
By choosing parameters σ 2 , b and ensuring λ is small, it is possible to show as in [51,Lemma 3.7], where we use the fact that sensitivity without subsampling is

A.2 Proofs from Section 4
Proof of Theorem 1.Let R t (D) = w t+1 = Exp wt (−η t Rt (D)), where Rt (D) = ζ t = 1 b z∈B gradf (w t ; z)+ t .The final output R(D) is a composition of R T , ..., R 1 .First, from the post-processing lemma [22,Proposition 7.1], the differential privacy properties of R t is equivalent to that of Rt .To show privacy guarantees of Rt , we consider gradient descent and stochastic gradient descent separately.
Under the setting of gradient descent, i.e., b = n, we have by Lemma 2. Then from the composability theorem [1, Theorem 2.1], we have after Similar as in [1,50], when for some constant c 1 , we can guarantee the conditions.And hence, the algorithm satisfies ( , δ)-diffential privacy.
Under the setting of stochastic gradient descent, i.e., b < n, we can bound K Rt (λ) as in Lemma 3.After T iterations, the cumulative bound is under sufficiently large σ 2 .Based on similar reasoning as in the full gradient case, we see when , the algorithm is ( , δ)-differentially private.
Proof of Lemma 4. First, using unbiasedness of gradf (w; z), we have E[ζ] = gradF (w).Consider the vectorization and the tangent space Gaussian distribution expressed as a standard multivariate Gaussian.Thus, where we notice E[ gradf (w, z), w ] = 0 as the noise is independent.
Proof of Theorem 2. The proof is adapted from the proof of [55, Theorem 10].Since each f (w; z) is geodesic convex, then F (w) = 1 n n i=1 f (w; z i ) is geodesic convex.Then from the first-order characterization of geodesic convex functions [13], we have F (w t ) − F (x * ) ≤ − gradF (w t ), Exp −1 wt (w * ) wt .Further by Lemma 1 on a geodesic triangle formed by w t , w t+1 , w * , we obtain where we use Lemma 4 and notice that Exp −1 wt (w t+1 ) = −η t ζ t .The expectation is over both the randomness in the noise and the subsampling (if b < n) at iteration t.Combining (4) and the first order characterization yields where the expectation is over the randomness of both the noise and subsampling (if b < n).Telescoping this inequality from t = 0, ..., T − 1 and setting η t = where we take expectation over all iterations.Finally, it can be shown from the property of geodesic averaging that F (w priv ) ≤ 1 T T −1 t=0 F (w t ).
Proof of Theorem 3. The proof adapts from [55, Theorem 12]  Applying this result recursively by choosing η t = η < min{1/L 1 , 1/τ } and taking full expectation yields where the last inequality follows from the limit of a geometric series and η < Let η t = η < min{1/L 1 , 1/τ }, and apply the result recursively.Then we have (F (w 0 ) − F (w * )) under the same choice of T .This matches the bound under full gradient setting.where we choose η t = η = 1/L 1 .Telescoping this inequality for t = 0, ..., T −1 and taking expectation yields where we use the fact that η t = 1 L .Following the same argument and choice of σ 2 and T , we achieve the same bound as full gradient case.
Riemannian geometry.A Riemannian manifold M of dimension d is a smooth manifold with an inner product structure •, • w (i.e., a Riemannian metric) on every tangent space T w M. Given an orthonormal basis (∂ 1 , . . ., ∂ d ) for T w M, the metric can be expressed as a (symmetric positive definite) matrix G w and the inner product can be written as ξ, ζ w = ξ G w ζ where ξ, ζ ∈ R d are the vectorization of tangent vectors ξ, ζ ∈ T w M in the normal coordinate system.An induced norm is defined as ξ w = ξ, ξ w for any ξ ∈ T w M. A geodesic γ : [0, 1] − → M is a locally distance

Figure 1 :
Figure1: Expected empirical excess risk against sample size on leading eigenvector computation and Fréchet mean computation, both averaged over 20 runs.We observe an improved utility guarantees of our proposed DP-RGD for both problems.

Table 1 :
Utility guarantees of proposed ( , δ)-differentially private Riemannian (stochastic) gradient descent under different function classes.L 0 , L 1 are geodesic Lipschitz and smoothness constants of f and β, τ are constants of geodesic strong convexity and Riemannian PL condition of F respectively.n is the size of the dataset.d is the intrinsic dimension of the manifold.ς is the curvature constant of the domain (defined in Lemma 1).All utility bounds are measured in the expected empirical excess risk E[F (w priv )] − F (w * ) where w * is a global minimizer except for the general nonconvex case where the bound is measured in E F (w priv ) 2 w priv .The bounds hide dependence on c l (lower bound on the metric tensor) and D W (diameter bound of the domain).Please also refer to Section 4.2.