Abstract
Let \(\mu _N\) be the empirical measure associated to a \(N\)sample of a given probability distribution \(\mu \) on \(\mathbb {R}^d\). We are interested in the rate of convergence of \(\mu _N\) to \(\mu \), when measured in the Wasserstein distance of order \(p>0\). We provide some satisfying nonasymptotic \(L^p\)bounds and concentration inequalities, for any values of \(p>0\) and \(d\ge 1\). We extend also the non asymptotic \(L^p\)bounds to stationary \(\rho \)mixing sequences, Markov chains, and to some interacting particle systems.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction and results
1.1 Notation
Let \(d\ge 1\) and \({\mathcal P}({{\mathbb {R}}^d})\) stand for the set of all probability measures on \({{\mathbb {R}}^d}\). For \(\mu \in {\mathcal P}({{\mathbb {R}}^d})\), we consider an i.i.d. sequence \((X_k)_{k\ge 1}\) of \(\mu \)distributed random variables and, for \(N \ge 1\), the empirical measure
As is wellknown, by GlivenkoCantelli’s theorem, \(\mu _N\) tends weakly to \(\mu \) as \(N\rightarrow \infty \) (for example in probability, see Van der VaartWellner [40] for details and various modes of convergence). The aim of the paper is to quantify this convergence, when the error is measured in some Wasserstein distance. Let us set, for \(p\ge 1\) and \(\mu ,\nu \) in \({\mathcal P}({{\mathbb {R}}^d})\),
where \({\mathcal H}(\mu ,\nu )\) is the set of all probability measures on \({{\mathbb {R}}^d}\times {{\mathbb {R}}^d}\) with marginals \(\mu \) and \(\nu \). See Villani [41] for a detailed study of \({\mathcal T}_p\). The Wasserstein distance \({\mathcal W}_p\) on \({\mathcal P}({{\mathbb {R}}^d})\) is defined by \({\mathcal W}_p(\mu ,\nu )={\mathcal T}_p(\mu ,\nu )\) if \(p\in (0,1]\) and \({\mathcal W}_p(\mu ,\nu )=({\mathcal T}_p(\mu ,\nu ))^{1/p}\) if \(p>1\).
The present paper studies the rate of convergence to zero of \({\mathcal T}_p(\mu _N,\mu )\). This can be done in an asymptotic way, finding e.g. a sequence \(\alpha (N)\rightarrow 0\) such that \(\lim _N \alpha (N)^{1}{\mathcal T}_p(\mu _N,\mu ) <\infty \) a.s. or \(\lim _N\alpha (N)^{1} \mathbb {E}({\mathcal T}_p(\mu _N,\mu ))<\infty \). Here we will rather derive some nonasymptotic moment estimates such as
as well as some nonasymptotic concentration estimates (also often called deviation inequalities)
They are naturally related to moment (or exponential moment) conditions on the law \(\mu \) and we hope to derive an interesting interplay between the dimension \(d\ge 1\), the cost parameter \(p>0\) and these moment conditions. Let us introduce precisely these moment conditions. For \(q>0\), \(\alpha >0\), \(\gamma >0\) and \(\mu \in {\mathcal P}({{\mathbb {R}}^d})\), we define
We now present our main estimates, the comparison with the existing results and methods will be developped after this presentation. Let us however mention at once that our paper relies on some recent ideas of Dereich et al. [16].
1.2 Moment estimates
We first give some \(L^p\) bounds.
Theorem 1
Let \(\mu \in {\mathcal P}({{\mathbb {R}}^d})\) and let \(p>0\). Assume that \(M_q(\mu )<\infty \) for some \(q>p\). There exists a constant \(C\) depending only on \(p,d,q\) such that, for all \(N\ge 1\),
Observe that when \(\mu \) has sufficiently many moments (namely if \(q>2p\) when \(p\ge d/2\) and \(q> dp/(dp)\) when \(p\in (0,d/2)\)), the term \(N^{(qp)/q}\) is small and can be removed. We could easily treat, for example, the case \(p>d/2\) and \(q=2p\) but this would lead to some logarithmic terms and the paper is technical enough.
This generalizes [16], in which only the case \(p\in [1,d/2)\) (whence \(d\ge 3\)) and \(q>dp/(dp)\) was treated. The argument is also slightly simplified.
To show that Theorem 1 is really sharp, let us give examples where lower bounds can be derived quite precisely.
(a) If \(a\ne b \in {{\mathbb {R}}^d}\) and \(\mu =(\delta _a+\delta _b)/2\), one easily checks (see e.g. [16, Remark 1]) that \(\mathbb {E}({\mathcal T}_p(\mu _N,\mu )) \ge c N^{1/2}\) for all \(p\ge 1\). Indeed, we have \(\mu _N=Z_N\delta _a+(1Z_N)\delta _b\) with \(Z_N=N^{1}\sum _1^N {\mathbf{1}}_{\{X_i=a\}}\), so that \({\mathcal T}_p(\mu _N,\mu )=ab^pZ_N1/2\), of which the expectation is of order \(N^{1/2}\).
(b) Such a lower bound in \(N^{1/2}\) can easily be extended to any \(\mu \) (possibly very smooth) of which the support is of the form \(A\cup B\) with \(d(A,B)>0\) (simply note that \({\mathcal T}_p(\mu _N,\mu )\ge d^p(A,B)Z_N\mu (A)\), where \(Z_N=N^{1}\sum _1^N {\mathbf{1}}_{\{X_i \in A\}}\)).
(c) If \(\mu \) is the uniform distribution on \([1,1]^d\), it is wellknown and not difficult to prove that for \(p>0\), \(\mathbb {E}({\mathcal T}_p(\mu _N,\mu )) \ge c N^{p/d}\). Indeed, consider a partition of \([1,1]^d\) into (roughly) \(N\) cubes with length \(N^{1/d}\). A quick computation shows that with probability greater than some \(c>0\) (uniformly in \(N\)), half of these cubes will not be charged by \(\mu _N\). But on this event, we clearly have \({\mathcal T}_p(\mu _N,\mu )\ge a N^{1/d}\) for some \(a>0\), because each time a cube is not charged by \(\mu _N\), a (fixed) proportion of the mass of \(\mu \) (in this cube) is at distance at least \(N^{1/d}/2\) of the support of \(\mu _N\). One easily concludes.
(d) When \(p=d/2=1\), it has been shown by Ajtai et al. [2] that for \(\mu \) the uniform measure on \([1,1]^d\), \({\mathcal T}_1(\mu _N,\mu ) \simeq c (\log N / N)^{1/2}\) with high probability, implying that \(\mathbb {E}({\mathcal T}_1(\mu _N,\mu )) \ge c (\log N / N)^{1/2}\).
(e) Let \(\mu (dx)= c x^{qd}{\mathbf{1}}_{\{x\ge 1\}}dx\) for some \(q>0\). Then \(M_r(\mu )<\infty \) for all \(r\in (0,q)\) and for all \(p\ge 1\), \(\mathbb {E}({\mathcal T}_p(\mu _N,\mu )) \ge c N^{(qp)/q}\). Indeed, \(\mathbb {P}(\mu _N(\{x\ge N^{1/q}\})=0)= (\mu (\{x< N^{1/q}\}))^N =(1c/N)^N \ge c>0\) and \(\mu (\{x\ge 2 N^{1/q}\})\ge c/N\). One easily gets convinced that \({\mathcal T}_p(\mu _N,\mu )\ge N^{p/q}{\mathbf{1}}_{\{\mu _N(\{x\ge N^{1/q}\})=0\}} \mu (\{x\ge 2 N^{1/q}\})\), from which the claim follows.
As far as general laws are concerned, Theorem 1 is really sharp: the only possible improvements are the following. The first one, quite interesting, would be to replace \(\log (1+N)\) by something like \(\sqrt{\log (1+N)}\) when \(p=d/2\) (see point (d) above). It is however not clear whether it is feasible in full generality. The second one, which should be a mere (and not very interesting) refinement, would be to sharpen the bound in \(N^{(qp)/q}\) when \(M_q(\mu )<\infty \): point (e) only shows that there is \(\mu \) with \(M_q(\mu )<\infty \) for which we have a lowerbound in \(N^{(qp)/q  {\varepsilon }}\) for all \({\varepsilon }>0\).
However, some improvements are possible when restricting the class of laws \(\mu \). First, when \(\mu \) is the uniform distribution in \([1,1]^d\), the results of Talagrand [38, 39] strongly suggest that when \(d\ge 3\), \(\mathbb {E}({\mathcal T}_p(\mu _N,\mu )) \simeq N^{p/d}\) for all \(p>0\), and this is much better than \(N^{1/2}\) when \(p\) is large. Such a result would of course immediately extend to any distribution \(\mu =\lambda \circ F^{1}\), for \(\lambda \) the uniform distribution in \([1,1]^d\) and \(F:[1,1]^d\mapsto {{\mathbb {R}}^d}\) Lipschitz continuous. In any case, a smoothness assumption for \(\mu \) cannot be sufficient, see point (b) above.
Second, for irregular laws, the convergence can be much faster than \(N^{p/d}\) when \(p<d/2\), see point (a) above where, in an extreme case, we get \(N^{1/2}\) for all values of \(p>0\). It is shown by Dereich et al. [16] (see also Barthe and Bordenave [3]) that indeed, for a singular law, \(\lim _N N^{p/d}\mathbb {E}({\mathcal T}_p(\mu _N,\mu ))=0\).
1.3 Concentration inequalities
We next state some concentration inequalities.
Theorem 2
Let \(\mu \in {\mathcal P}({{\mathbb {R}}^d})\) and let \(p>0\). Assume one of the three following conditions:
Then for all \(N\ge 1\), all \(x\in (0,\infty )\),
where
and
The positive constants \(C\) and \(c\) depend only on \(p,d\) and either on \(\alpha ,\gamma ,{\mathcal E}_{\alpha ,\gamma }(\mu )\) (under (1)) or on \(\alpha ,\gamma ,{\mathcal E}_{\alpha ,\gamma }(\mu ),{\varepsilon }\) (under (2)) or on \(q,M_q(\mu ),{\varepsilon }\) (under (3)).
We could also treat the critical case where \({\mathcal E}_{\alpha ,\gamma }(\mu )<\infty \) with \(\alpha =p\), but the result we could obtain is slightly more intricate and not very satisfying for small value of \(x\) (even if good for large ones).
Remark 3
When assuming (2) with \(\alpha \in (0,p)\), we actually also prove that
with \(\delta =2p/\alpha 1\), see Step 5 of the proof of Lemma 13 below. This allows us to extend the inequality \(b(N,x)\le C\exp (c (Nx)^{\alpha /p})\) to all values of \(x\ge x_N\), for some (rather small) \(x_N\) depending on \(N,\alpha ,p\). But for very small values of \(x>0\), this formula is less interesting than that of Theorem 2. Despite much effort, we have not been able to get rid of the logarithmic term.
We believe that these estimates are quite satisfying. To get convinced, first observe that the scales seem to be the good ones. Recall that \(\mathbb {E}({\mathcal T}_p(\mu _N,\mu ))=\int _0^\infty \mathbb {P}( {\mathcal T}_p(\mu ^N,\mu )\ge x)dx\).
(a) One easily checks that \(\int _0^\infty a(N,x)dx \le CN^{p/d}\) if \(p<d/2\), \(CN^{1/2}\log (1+N)\) if \(p=d/2\), and \(CN^{1/2}\) if \(p>d/2\), as in Theorem 1.
(b) When integrating \(b(N,x)\) (or rather \(b(N,x)\wedge 1\)), we find \(N^{(q{\varepsilon }p)/(q{\varepsilon })}\) under (3) and something smaller under (1) or (2). Since we can take \(q{\varepsilon }>2p\), this is \(<N^{1/2}\) (and thus also \(<N^{p/d}\) if \(p<d/2\) and than \(N^{1/2}\log (1+N)\) if \(p=d/2\)).
The rates of decrease are also satisfying in most cases. Recall that in deviation estimates, we never get something better than \(\exp (Ng(x))\) for some function \(g\). Hence \(a(N,x)\) is probably optimal. Next, for \(\bar{Y}_N\) the empirical mean of a family of centered i.i.d. random variables, it is wellknown that the good deviation inequalities are the following.
(a) If \(\mathbb {E}[\exp (aY_1^\beta )]<\infty \) with \(\beta \ge 1\), then \(\Pr [\bar{Y}_N\ge x ] \le Ce^{cNx^2}{\mathbf{1}}_{\{x\le 1\}} + Ce^{cNx^{\beta }}{\mathbf{1}}_{\{x> 1\}}\), see for example Djellout et al. [18], Gozlan [24] or Ledoux [27], using transportation cost inequalities.
(b) If \(\mathbb {E}[\exp (aY_1^\beta )]<\infty \) with \(\beta <1\), then \(\Pr [\bar{Y}_N\ge x ] \le Ce^{cNx^2} + Ce^{c(Nx)^{\beta }}\), see Merlevède et al. [31, Formula (1.4)] which is based on results by Borovkov [8].
(c) If \(\mathbb {E}[Y_1^r]<\infty \) for some \(r>2\), then \(\Pr [\bar{Y}_N\ge x ] \le Ce^{cNx^2} + CN (Nx)^{r}\), see Fuk and Nagaev [23], using usual truncation arguments.
Our result is in perfect adequacy with these facts [(up to some arbitratry small loss due to \({\varepsilon }\) under (2) and (3)] since \({\mathcal T}_p(\mu _N,\mu )\) should behave very roughly as the mean of the \(X_i^p\)’s, which e.g. has an exponential moment with power \(\beta :=\alpha /p\) under (1) and (2).
1.4 Comments
The control of the distance between the empirical measure of an i.i.d. sample and its true distribution is of course a long standing problem central both in probability, statistics and informatics with a wide number of applications: quantization (see Delattre et al. [14] and Pagès and Wilbertz [33] for recent results), optimal matching (see Ajtai et al. [2], Dobrić and Yukich [19], Talagrand [39], Barthe and Bordenave [3]), density estimation, clustering (see Biau et al. [5] and Laloë [26]), MCMC methods (see [36] for bounds on ergodic averages), particle systems and approximations of partial differential equations (see Bolley et al. [11] and Fournier and Mischler [22]). We refer to these papers for an extensive introduction on this vast topic.
If many distances can be used to consider the problem, the Wasserstein distance is quite natural, in particular in quantization or for particle approximations of P.D.E.’s. However the depth of the problem was discovered only recently by Ajtai et al. [2], who considered the uniform measure on the square, investigated thoroughly by Talagrand [39]. As a review of the litterature is somewhat impossible, let us just say that the methods involved were focused on two methods inherited by the definitions of the Wasserstein distance: the construction of a coupling or by duality to control a particular empirical process.
Concerning moment estimates (as in Theorem 1), some results can be found in Horowitz and Karandikar [25], Rachev and Rüschendorf [35] and Mischler and Mouhot [32]. But theses results are far from optimal, even when assuming that \(\mu \) is compactly supported. Very recently, strickingly clever alternatives were considered by Boissard and Le Gouic [7] and by Dereich et al. [16]. Unfortunately, the construction of Boissard and Le Gouic, based on iterative trees, was a little too complicated to yield sharp rates. On the contrary, the method of [16], exposed in details in the next section, is extremely simple, robust, and leads to the almost optimal results exposed here. Some sharp moment estimates were already obtained in [16] for a limited range of parameters.
Concerning concentration estimates, only few results are available. Let us mention the work of Bolley et al. [11] and very recently by Boissard [6], which we considerably improve. Our assumptions are often much weaker (the reference measure \(\mu \) was often assumed to satisfy some functional inequalities, which may be difficult to verify and usually include more “structure” than mere integrability conditions) and \(\Pr [{\mathcal T}_p (\mu _N,\mu )\ge x]\) was estimated only for rather large values of \(x\). In particular, when integrating the concentration estimates of [11], one does never find the good moment estimates, meaning that the scales are not the good ones.
Moreover, the approach of [16] is robust enough so that we can also give some good moment bounds for the Wasserstein distance between the empirical measure of a Markov chain and its invariant distribution (under some conditions). This could be useful for MCMC methods because our results are non asymptotic. We can also study very easily some \(\rho \)mixing sequences (see Doukhan [20]), for which only very few results exist, see Biau et al. [7]. Finally, we show on an example how to use Theorem 1 to study some particle systems. For all these problems, we might also obtain some concentration inequalities, but this would need further refinements which are out of the scope of the present paper, somewhat already technical enough, and left for further works.
1.5 Plan of the paper
In the next section, we state some general upper bounds of \({\mathcal T}_p(\mu ,\nu )\), for any \(\mu ,\nu \in {\mathcal P}({{\mathbb {R}}^d})\), essentially taken from [16]. Section 3 is devoted to the proof of Theorem 1. Theorem 2 is proved in three steps: in Sect. 4 we study the case where \(\mu \) is compactly supported and where \(N\) is replaced by a Poisson\((N)\)distributed random variable, which yields some pleasant independance properties. We show how to remove the randomization in Sect. 5, concluding the case where \(\mu \) is compactly supported. The non compact case is studied in Sect. 6. The final Sect. 7 is devoted to dependent random variables: \(\rho \)mixing sequences, Markov chains and a particular particle system.
2 Coupling
The following notion of distance, essentially taken from [16], is the main ingredient of the paper.
Notation 4
(a) For \(\ell \ge 0\), we denote by \({\mathcal P}_\ell \) the natural partition of \((1,1]^d\) into \(2^{d\ell }\) translations of \((2^{\ell },2^{\ell }]^d\). For two probability measures \(\mu ,\nu \) on \((1,1]^d\) and for \(p>0\), we introduce
which obviously defines a distance on \({\mathcal P}((1,1]^d)\), always bounded by \(1\).
(b) We introduce \(B_0:=(1,1]^d\) and, for \(n\ge 1\), \(B_n:=(2^n,2^n]^d{\setminus }(2^{n1},2^{n1}]^d\). For \(\mu \in {\mathcal P}({{\mathbb {R}}^d})\) and \(n\ge 0\), we denote by \({\mathcal R}_{B_n} \mu \) the probability measure on \((1,1]^d\) defined as the image of \(\mu _{B_n}/\mu (B_n)\) by the map \(x \mapsto x/2^n\). For two probability measures \(\mu ,\nu \) on \({{\mathbb {R}}^d}\) and for \(p>0\), we introduce
A little study, using that \({\mathcal D}_p \le 1\) on \({\mathcal P}((1,1]^d)\), shows that this defines a distance on \({\mathcal P}({{\mathbb {R}}^d})\).
Having a look at \({\mathcal D}_p\) in the compact case, one sees that in some sense, it measures distance of the two probability measures simultaneously at all the scales. The optimization procedure can be made for all scales and outperforms the approach based on a fixed diameter covering of the state space (which is more or less the approach of Horowitz and Karandikar [25]). Moreover one sees that the principal control is on \(\pi (F)\mu (F)\) which is a quite simple quantity. The next results are slightly modified versions of estimates found in [16], see [16, Lemma 2] for the compact case and [16, proof of Theorem 3] for the non compact case. It contains the crucial remark that \({\mathcal D}_p\) is an upper bound (up to constant) of the Wasserstein distance.
Lemma 5
Let \(d\ge 1\) and \(p>0\). For all pairs of probability measures \(\mu ,\nu \) on \({{\mathbb {R}}^d}\), \({\mathcal T}_p(\mu ,\nu )\le \kappa _{p,d} {\mathcal D}_p(\mu ,\nu )\), with \(\kappa _{p,d}:= 2^p d^{p/2}(2^p+1)/(2^p1)\).
Proof
We separate the proof into two steps.
Step 1. We first assume that \(\mu \) and \(\nu \) are supported in \((1,1]^d\). We infer from [16, Lemma 2], in which the conditions \(p\ge 1\) and \(d\ge 3\) are clearly not used, that, since the diameter of \((1,1]^d\) is \(2\sqrt{d}\),
where “\(C\) child of \(F\)” means that \(C\in {\mathcal P}_{\ell +1}\) and \(C \subset F\). Consequently,
which is nothing but \(\kappa _{p,d}{\mathcal D}_p(\mu ,\nu )\). We used that \(\sum _{F \in {\mathcal P}_0} \mu (F)\nu (F)=0\).
In Dereich et al. [16], use directly the formula with the children to study the rate of convergence of empirical measures. This leads to some (small) technical complications, and does not seem to improve the estimates.
Step 2. We next consider the general case. We consider, for each \(n\ge 1\), the optimal coupling \(\pi _n(dx,dy)\) between \({\mathcal R}_{B_n}\mu \) and \({\mathcal R}_{B_n}\nu \) for \({\mathcal T}_p\). We define \(\xi _n(dx,dy)\) as the image of \(\pi _n\) by the map \((x,y)\mapsto (2^nx,2^ny)\), which clearly belongs to \({\mathcal H}(\mu \vert _{B_n}/\mu (B_n),\nu \vert _{B_n}/\nu (B_n))\) and satisfies \(\int \!\!\int xy^p \xi _n(dx,dy) = 2^{np} \int \!\!\int xy^p \pi _n(dx,dy) =2^{np} {\mathcal T}_p({\mathcal R}_{B_n}\mu ,{\mathcal R}_{B_n}\nu )\).
Next, we introduce \(q:=\frac{1}{2} \sum _{n\ge 0} \nu (B_n)\mu (B_n)\) and we define
where
Using that
it is easily checked that \(\xi \in {\mathcal H}(\mu ,\nu )\). Furthermore, we have, setting \(c_p=1\) if \(p\in (0,1]\) and \(c_p=2^{p1}\) if \(p>1\),
Recalling that \(\int \!\!\int xy^p \xi _n(dx,dy) \le 2^{np} {\mathcal T}_p({\mathcal R}_{B_n}\mu ,{\mathcal R}_{B_n}\nu )\), we deduce that
We conclude using Step 1 and that \(c_p\le \kappa _{p,d}\). \(\square \)
When proving the concentration inequalities, which is very technical, it will be good to break the proof into several steps to separate the difficulties and we will first treat the compact case. On the contrary, when dealing with moment estimates, the following formula will be easier to work with.
Lemma 6
Let \(p>0\) and \(d\ge 1\). For all \(\mu ,\nu \in {\mathcal P}({{\mathbb {R}}^d})\),
with the notation \(2^n F = \{2^n x\;:\; x\in F\}\) and where \(C_p=1+2^{p}/(12^{p})\).
Proof
For all \(n\ge 1\), we have \(\mu (B_n)\nu (B_n)=\sum _{F\in {\mathcal P}_0}\left \mu (2^nF \cap B_n)\nu (2^n F \cap \right. \) \(\left. B_n)\right \) and
This last term is smaller than \(2^{p} \left \mu (B_n)\nu (B_n)\right /(12^{p})\) and this ends the proof. \(\square \)
3 Moment estimates
The aim of this section is to give the
Proof of Theorem 1
We thus assume that \(\mu \in {\mathcal P}({{\mathbb {R}}^d})\) and that \(M_q(\mu )<\infty \) for some \(q>p\). By a scaling argument, we may assume that \(M_q(\mu )=1\). This implies that \(\mu (B_n)\le 2^{q(n1)}\) for all \(n\ge 0\). By Lemma 5, we have \({\mathcal T}_p(\mu _N,\mu )\le \kappa _{p,d} {\mathcal D}_p(\mu _N,\mu )\), so that it suffices to study \(\mathbb {E}({\mathcal D}_p(\mu _N,\mu ))\). In the whole proof, the positive constant \(C\), whose value may change from line to line, depends only on \(p,d,q\).
For a Borel subset \(A\subset {{\mathbb {R}}^d}\), since \(N\mu _N(A)\) is Binomial\((N,\mu (A))\)distributed, we have
Using the Cauchy–Scharz inequality and that \(\#({\mathcal P}_\ell )=2^{d\ell }\), we deduce that for all \(n\ge 0\), all \(\ell \ge 0\),
Using finally Lemma 6 and that \(\mu (B_n)\le 2^{q(n1)}\), we find
Step 1. Here we show that for all \({\varepsilon }\in (0,1)\), all \(N\ge 1\),
First of all, the bound by \(C{\varepsilon }\) is obvious in all cases (because \(p>0\)). Next, the case \(p>d/2\) is immediate. If \(p\le d/2\), we introduce \(\ell _{N,{\varepsilon }}:= \lfloor \log (2+{\varepsilon }N)/(d\log 2)\rfloor \), for which \(2^{d \ell _{N,{\varepsilon }}} \simeq 2+{\varepsilon }N\) and get an upper bound in
If \(p=d/2\), we find an upper bound in
as desired. If \(p\in (0,d/2)\), we get an upper bound in
If \({\varepsilon }N \ge 1\), then \((2+{\varepsilon }N)^{1/2p/d}\le (3{\varepsilon }N)^{1/2p/d}\) and the conclusion follows. If now \({\varepsilon }N \in (0,1)\), the result is obvious because \(\min \{{\varepsilon },{\varepsilon }({\varepsilon }N)^{p/d}\}={\varepsilon }\).
Step 2: \(p>d/2\). By (4) and Step 1 (with \({\varepsilon }=2^{qn}\)), we find
Indeed, this is obvious if \(q>2p\), while the case \(q\in (p,2p)\) requires to separate the sum in two parts \(n \le n_N\) and \(n>n_N\) with \(n_N=\lfloor \log N /(q\log 2)\rfloor \). This ends the proof when \(p>d/2\).
Step 3: \(p=d/2\). By (4) and Step 1 (with \({\varepsilon }=2^{qn}\)), we find
If \(q>2p\), we immediately get a bound in
which ends the proof (when \(p=d/2\) and \(q>2p\)).
If \(q\in (p,2p)\), we easily obtain, using that \(\log (2+x)\le 2\log x\) for all \(x\ge 2\), an upper bound in
where \(n_N=\lfloor \log (N/2) / (q \log 2)\rfloor \). A tedious exact computation shows that
Using that the contribution of the middle term of the second line is negative and the inequality \(\log N  (n_N+1)q\log 2 \le \log 2\) (because \((n_N+1)q\log 2 \ge \log (N/2)\)), we find
We finally have checked that \(\mathbb {E}({\mathcal D}_p(\mu _N,\mu ))\le CN^{(qp)/q}+CN^{1/2}N^{p/q1/2} \le C N^{(qp)/q} \), which ends the proof when \(p=d/2\).
Step 4: \(p\in (0,d/2)\). We then have, by (4) and Step 1,
If \(q>dp/(dp)\), which implies that \(q(1p/d)>p\), we immediately get an upper bound by \(C N^{p/d}\), which ends the proof when \(p<d/2\) and \(q>dp/(dp)\).
If finally \(q\in (p,dp/(dp))\), we separate the sum in two parts \(n \le n_N\) and \(n>n_N\) with \(n_N=\lfloor \log N /(q\log 2)\rfloor \) and we find a bound in \(C N^{(qp)/q}\) as desired. \(\square \)
4 Concentration inequalities in the compact poissonized case
It is technically advantageous to first consider the case where the size of the sampling is Poisson distributed, which implies some independence properties. Replacing \(N\) (large) by a Poisson\((N)\)distributed random variable should be feasible, because a Poisson\((N)\)distributed random variable is close to \(N\) with high probability.
Notation 7
We introduce the functions \(f\) and \(g\) defined on \((0,\infty )\) by
Observe that \(f\) is increasing, nonnegative, equivalent to \(x^2\) at \(0\) and to \(x\log x\) at infinity. The function \(g\) is positive and increasing on \((1,\infty )\).
The goal of this section is to check the following.
Proposition 8
Assume that \(\mu \) is supported in \((1,1]^d\). Let \(\Pi _N\) be a Poisson measure on \({{\mathbb {R}}^d}\) with intensity measure \(N\mu \) and introduce the associated empirical measure \(\Psi _N=(\Pi _N({{\mathbb {R}}^d}))^{1}\Pi _N\). Let \(p\ge 1\) and \(d\ge 1\). There are some positive constants \(C,c\) (depending only on \(d,p\)) such that for all \(N\ge 1\), all \(x\in (0,\infty )\),
We start with some easy and wellknown concentration inequalities for the Poisson distribution.
Lemma 9
For \(\lambda >0\) and \(X\) a Poisson\((\lambda )\)distributed random variable, we have

(a)
\(E(\exp (\theta X)) = \exp (\lambda (e^\theta 1))\) for all \(\theta \in {\mathbb {R}}\);

(b)
\(E(\exp (\theta X\lambda )) \le 2 \exp (\lambda (e^\theta 1\theta ) )\) for all \(\theta >0\);

(c)
\(\mathbb {P}(X>\lambda x) \le \exp (\lambda g(x))\) for all \(x>0\);

(d)
\(\mathbb {P}(X\lambda >\lambda x) \le 2 \exp (\lambda f(x))\) for all \(x>0\);

(e)
\(\mathbb {P}(X>\lambda x) \le \lambda \) for all \(x>0\).
Proof
Point (a) is straightforward. For point (b), write \(E(\exp (\theta X\lambda ))\le e^{\theta \lambda }\mathbb {E}(\exp (\theta X))+ e^{\theta \lambda } \mathbb {E}(\exp (\theta X))\), use (a) and that \(\lambda (e^{\theta }1+\theta )\le \lambda (e^\theta 1\theta )\). For point (c), write \(\mathbb {P}(X>\lambda x)\le e^{ \theta \lambda x}\mathbb {E}[\exp (\theta X)]\), use (a) and optimize in \(\theta \). Use the same scheme to deduce (d) from (b). Finally, for \(x>0\), \(\mathbb {P}(X>\lambda x)\le \mathbb {P}(X>0)=1e^{\lambda }\le \lambda \). \(\square \)
We can now give the
Proof of Proposition 8
During the proof, the constants may only depend on \(p\) and \(d\). We fix \(x>0\) for the whole proof. Recalling Notation 4(a), we have
for any choice of \(\ell _0\in {\mathbb {N}}\). We will choose \(\ell _0\) later, depending on the value of \(x\). For any nonnegative family \(r_\ell \) such that \(\sum _1^{\ell _0} r_\ell \le 1\), we thus have
By Lemma 9(c), (d), since \(\Pi _N({{\mathbb {R}}^d})\) is Poisson\((N)\)distributed, \(\mathbb {P}(\Pi _N({{\mathbb {R}}^d})\ge N(c x 2^{p \ell _0} 1))\le \exp (N g(c x 2^{p \ell _0} 1))\) and \(\mathbb {P}(\Pi _N({{\mathbb {R}}^d})N \ge c Nx) \le 2 \exp (Nf(cx))\). Next, using that the family \((\Pi _N(F))_{F\in {\mathcal P}_\ell }\) is independent, with \(\Pi _N(F)\) Poisson\((N\mu (F))\)distributed, we use Lemma 9(a) and that \(\#({\mathcal P}_\ell )=2^{\ell d}\) to obtain, for any \(\theta >0\),
Hence
Choosing \(\theta = \log (1+c x2^{p\ell } r_\ell )\), we find
We have checked that
At this point, the value of \(c>0\) is not allowed to vary anymore. We introduce some other positive constants \(a\) whose value may change from line to line.
Case 1: \(cx>2\). Then we choose \(\ell _0=1\) and \(r_1=1\). We have \(cx2^{p\ell _0}1 = 2^pcx 1 \ge (2^p1)cx+1\) whence \(g(c x 2^{p \ell _0} 1)\ge g((2^p1)cx+1) =f((2^p1)cx)\). We also have \(\sum _{\ell =1}^{\ell _0}2^{2^{d\ell }} \exp (Nf(c x 2^{p\ell }r_\ell )) =2^{2^d}\exp (Nf(2^pc x))\). We finally get \({\varepsilon }(N,x) \le C \exp (Nf(ax))\), which proves the statement (in the three cases, when \(cx>2\)).
Case 2: \(cx\le 2\). We choose \(\ell _0\) so that \((1+2/(cx))\le 2^{p\ell _0}\le 2^p (1+2/(cx))\), i.e.
This implies that \(c x 2^{p \ell _0} \ge 2 + cx\). Hence \(g(c x 2^{p \ell _0} 1)\ge g(1+cx)=f(cx)\). Furthermore, we have \(c x 2^{p \ell }r_\ell \le c x 2^{p \ell _0}\le 2^p( 2+cx)\le 2^{p+2}\) for all \(\ell \le \ell _0\), whence \(f(c x 2^{p\ell }r_\ell ) \ge a x^2 2^{2p\ell } r_\ell ^2\) (because \(f(x)\ge a x^2\) for all \(x\in [0,2^{p+2}]\)). We thus end up with (we use that \(2^{2^{d\ell }}\le \exp (2^{d\ell })\))
Now the value of \(a>0\) is not allowed to vary anymore, and we introduce \(a^{\prime }>0\), whose value may change from line to line.
Case 1.1: \(p>d/2\). We take \(r_\ell :=(12^{\eta })2^{ \eta \ell }\) for some \(\eta >0\) such that \(2(p\eta )>d\). If \(Nx^2\ge 1\), we easily get
The last inequality uses that \(y^2\ge f(y)\) for all \(y>0\). If finally \(Nx^2\le 1\), we obviously have
We thus always have \({\varepsilon }(N,x)\le C \exp (Nf(a^{\prime }x))\) as desired.
Case 2.2: \(p=d/2\). We choose \(r_\ell :=1/\ell _0\). Thus, if \(a N(x/\ell _0)^2 \ge 2\), we easily find
because \(\ell _0 \ge 1\) and \(f\) is increasing. If now \(a N(x/\ell _0)^2 < 2\), we just write
We thus always have \({\varepsilon }(N,x)\le C \exp (Nf(a^{\prime }x/\ell _0))\). Using that \(\ell _0 \le C \log (2+1/x)\), we immediately conclude that \({\varepsilon }(N,x)\le C \exp (Nf(a^{\prime }x/\log (2+1/x)))\) as desired.
Case 2.3: \(p\in (0,d/2)\). We choose \(r_\ell := \kappa 2^{(d/2p)(\ell \ell _0)}\) with \(\kappa =1/(12^{pd/2})\). For all \(\ell \le \ell _0\),
where the constant \(b>0\) is such that \(2^{(d2p)\ell _0}\ge b x^{d/p2}\) (the existence of \(b\) is easily checked). Hence if \(N a \kappa ^2 x^{d/p}\ge 2/b\), we find
and thus, still using that \(N x^{d/p}\ge 2/(ab\kappa ^2)\),
Consequently, we have \({\varepsilon }(N,x) \le 3\exp (Nf(cx)) +C\exp (a^{\prime } N x^{d/p})\) if \(N a \kappa ^2 x^{d/p}\ge 2/b\). As usual, the case where \(N a \kappa ^2 x^{d/p}\le 2/b\) is trivial, since then
This ends the proof. \(\square \)
5 Depoissonization in the compact case
We next check the following compact version of Theorem 2.
Proposition 10
Assume that \(\mu \) is supported in \((1,1]^d\). Let \(p>0\) and \(d\ge 1\) be fixed. There are some positive constants \(C\) and \(c\) (depending only on \(p,d\)) such that for all \(N\ge 1\), all \(x\in (0,\infty )\),
We will need the following easy remark.
Lemma 11
For all \(N\ge 1\), for \(X\) Poisson\((N)\)distributed, for all \(k\in \{0,\dots , \lfloor \sqrt{N} \rfloor \}\),
Proof
By Perrin [34], we have \(N! \le e \sqrt{N} (N/e)^N\). Thus
Since \(\log (1+x)\le x\) on \((0,1)\), we have \(((N+k)/N)^{N+k}\le \exp (k+k^2/N)\le \exp (k+1)\), so that \(\mathbb {P}[X=N+k] \ge e^{2}/\sqrt{2N}\). \(\square \)
Proof of Proposition 10
The probability indeed vanishes if \(x>1\), since \({\mathcal D}_p\) is smaller than \(1\) when restricted to probability measures on \((1,1]^d\). In the sequel, the constants may only depend on \(p\) and \(d\).
Step 1. We introduce a Poisson measure \(\Pi _N\) on \({{\mathbb {R}}^d}\) with intensity measure \(N\mu \) and the associated empirical measure \(\Psi _N=\Pi _N/\Pi _N({{\mathbb {R}}^d})\). Conditionally on \(\{\Pi _N({{\mathbb {R}}^d})=n\}\), \(\Psi _N\) has the same law as \(\mu _n\) (the empirical measure of \(n\) i.i.d. random variables with law \(\mu \)).
By Lemma 11 (since \(\Pi _N({{\mathbb {R}}^d})\) is Poisson\((N)\)distributed),
which of course implies that (for all \(N\ge 1\), all \(x>0\)),
Step 2. Here we prove that there is a constant \(A>0\) such that for any \(N\ge 1\), any \(k \in \{0,\ldots , \lfloor \sqrt{N} \rfloor \}\), any \(x > A N^{1/2}\),
Build \(\mu _n\) for all values of \(n\ge 1\) with the same i.i.d. family of \(\mu \)distributed random variables \((X_k)_{k\ge 1}\). Then a.s.,
This obviously implies (recall Notation 4(a)) that \({\mathcal D}_p(\mu _{N},\mu _{N+k}) \le C N^{1/2}\) a.s. (where \(C\) depends only on \(p\)). By the triangular inequality, \({\mathcal D}_p(\mu _N,\mu ) \le {\mathcal D}_p(\mu _{N+k},\mu ) + C N^{1/2}\), whence
if \(x CN^{1/2}\ge x/2\), i.e. \(x\ge 2C N^{1/2}\).
Step 3. Gathering Steps 1 and 2, we deduce that for all \(N\ge 1\), all \(x>AN^{1/2}\),
We next apply Proposition 8. Observing that, for \(x \in (0,1]\),

(i)
\(\exp (Nf(cx/2))\le \exp (cNx^2)\) (case \(p>d/2\)),

(ii)
\(\exp (Nf(cx/2\log (2+2/x)))\le \exp (cN(x/\log (2+1/x)^2)\) (case \(p=d/2\)),

(iii)
\(\exp (Nf(cx/2)) + \exp (c N (x/2)^{d/p})\le \exp (cNx^{d/p})\) (case \(p\in (0,d/2)\))
concludes the proof when \(x>AN^{1/2}\). But the other case is trivial, because for \(x \le A N^{1/2}\),
which is also smaller than \(C\exp (N(x/\log (2+1/x))^2)\) and than \(C\exp (Nx^{d/p})\) (if \(d>2p\)).
6 Concentration inequalities in the non compact case
Here we conclude the proof of Theorem 2. We will need some concentration estimates for the Binomial distribution.
Lemma 12
Let \(X\) be Binomial\((N,p)\)distributed. Recall that \(f\) was defined in Notation 7.

(a)
\(\mathbb {P}[XNp\ge N p z]\le ({\mathbf{1}}_{\{p(1+z) \le 1\}}+{\mathbf{1}}_{\{z \le 1\}}) \exp (Npf(z))\) for all \(z>0\).

(b)
\(\mathbb {P}[XNp\ge N p z]\le Np\) for all \(z>1\).

(c)
\(\mathbb {E}(\exp (\theta X))=(1p+pe^{\theta })^N \le \exp (N p (1e^{\theta }))\) for \(\theta >0\).
Proof
Point (c) is straightforward. Point (b) follows from the fact that for \(z>1\), \(\mathbb {P}[XNp\ge N p z]=\mathbb {P}[X\ge Np(1+z)]\le \mathbb {P}[X\ne 0]=1(1p)^N \le pN\). For point (a), we use Bennett’s inequality [4], see Devroye and Lugosi [17, Exercise 2.2 page 11], together with the obvious facts that \(\mathbb {P}[XNp\ge N p z]=0\) if \(p(1+z)>1\) and \(\mathbb {P}[XNp\le N p z]=0\) if \(z>1\). The following elementary tedious computations also works: write \(\mathbb {P}[XNp\ge N p z] = \mathbb {P}(X\ge Np(1+z)) + \mathbb {P}(NX\ge N(1p+zp)) =:\Delta (p,z)+\Delta (1p,zp/(1p))\), observe that \(NX\sim \) Binomial\((N,1p)\). Use that \(\Delta (p,z)\le {\mathbf{1}}_{\{p(1+z)\le 1\}}\exp (\theta N p(1+z))(1p+pe^\theta )^N\) and choose \(\theta =\log ((1p)(1+z)/(1ppz))\), this gives \(\Delta (p,z)\le {\mathbf{1}}_{\{p(1+z)\le 1\}} \exp (N[p(1+z)\log (1+z)+(1ppz)\log ((1ppz)/(1p)) ] )\). A tedious study shows that \(\Delta (p,z)\le {\mathbf{1}}_{\{p(1+z)\le 1\}} \exp (Npf(z))\) and that \(\Delta (1p,zp/(1p))\le {\mathbf{1}}_{\{z \le 1\}}\exp (Npf(z))\). \(\square \)
We next estimate the first term when computing \({\mathcal D}_p(\mu _N,\mu )\).
Lemma 13
Let \(\mu \in {\mathcal P}({{\mathbb {R}}^d})\) and \(p>0\). Assume (1), (2) or (3). Recall Notation 4 and put \(Z_N^p:=\sum _{n\ge 0}2^{pn}\mu _N(B_n)\mu (B_n)\). Let \(x_0\) be fixed. For all \(x>0\),
The positive constants \(C\) and \(c\) depend only on \(p,d,x_0\) and either on \(\alpha ,\gamma ,{\mathcal E}_{\alpha ,\gamma }(\mu )\) (under (2)) or on \(\alpha ,\gamma ,{\mathcal E}_{\alpha ,\gamma }(\mu ),{\varepsilon }\) (under (2)) or on \(q,M_q(\mu ),{\varepsilon }\) (under (3)).
Proof
During the proof, the constants are only allowed to depend on the same quantities as in the statement, unless we precise it. Under (1) or (2), we assume that \(\gamma =1\) without loss of generality (by scaling), whence \({\mathcal E}_{\alpha ,1}(\mu )<\infty \) and thus \(\mu (B_n)\le C e^{2^{(n1)\alpha }}\) for all \(n\ge 0\). Under (3), we have \(\mu (B_n)\le C 2^{qn}\) for all \(n\ge 0\). For \(\eta >0\) to be chosen later (observe that \(\sum _{n\ge 0} (12^{\eta })2^{\eta n}=1\)), putting \(c:=12^{\eta }\) and \(z_n:= c x 2^{(p+\eta )n}/\mu (B_n)\),
From now on, the value of \(c>0\) is not allowed to vary anymore. We introduce another positive constant \(a>0\) whose value may change from line to line.
Step 1: bound of \(I_n\). Here we show that under (3) (which is of course implied by (1) or (2)), if \(\eta \in (0,q/2p)\), there is \(A_0>0\) such that
This will obviously imply that for all \(N\ge 1\), all \(x>0\),
First, \(\sum _{n\ge 0} I_n(N,x)=0\) if \(z_n>2\) for all \(n\ge 0\). Recalling that \(\mu (B_n)\le C2^{qn}\), this is the case if \(x\ge (2C/c)\sup _{n\ge 0}2^{(p+\eta q)n}=(2C/c):=A_0\). Next, since \(N\mu _N(B_n)\sim \) Binomial\((N,\mu (B_n))\), Lemma 12(a) leads us to
because \(f(x) \ge x^2/4\) for \(x\in [0,2]\). Since finally \(\mu (B_n) z_n^2/4 \ge a x^2 2^{(q2p2\eta )n}\), we easily conclude, since \(q2p2\eta >0\) and since \(Nx^2 \ge 1\), that
Step 2: bound of \(J_n\) under (1) or (2) when \(x\le A\). Here we fix \(A>0\) and prove that if \(\eta >0\) is small enough, for all \(x\in (0,A]\) such that \(Nx^2\ge 1\),
Here the positive constants \(C\) and \(a\) are allowed to depend additionally on \(A\). This will imply, as usual, that for all \(N\ge 1\), all \(x\in (0,A]\),
By Lemma 12(a), (b) (since \(z_n> 2\) implies \({\mathbf{1}}_{\{\mu (B_n)(1+z_n)\le 1\}}+{\mathbf{1}}_{\{z_n\le 1\}} \le {\mathbf{1}}_{\{z_n\le 1/\mu (B_n)\}}\)),
because \(f(y)\ge a y \log y \ge a y \log [2\vee y]\) for \(y> 2\). Since \(\mu (B_n) \le Ce^{2^{(n1)\alpha }}\), we get
A straightforward computation shows that there is a constant \(K\) such that for \(n \ge n_1:=\lfloor K(1+\log \log (K/x))\rfloor \), we have \(\log (a x2^{(p+\eta )n}e^{2^{(n1)\alpha }})\ge 2^{(n1)\alpha }/2\). Consequently,
We first show that \(J^1(N,x)\le Ce^{aNx^2}\) (here we actually could get something much better). First, since \(n_1=\lfloor K+K\log \log (K/x)\rfloor \) and \(x \in [0,A]\), we clearly have e.g. \(x2^{(p+\eta )n_1} \ge a x^{3/2}\). Next, \(Nx^2\ge 1\) implies that \(1/x \le (Nx^{3/2})^2\). Thus
We now treat \(J^2(N,x)\).
Step 2.1. Under (1), we immediately get, if \(\eta \in (0, \alpha p)\) (recall that \(x\in [0,A]\)),
where we used that \(x \le A\) and \(Nx^2\ge 1\) (whence \(Nx\ge 1/A\)).
Step 2.2. Under (2), we first write
We choose \(n_2:=\lfloor \log (Nx)/((p+\eta )\log 2) \rfloor \), which yields us to \(2^{(n_21)\alpha }\ge (Nx)^{\alpha /(p+\eta )}/2^{2\alpha }\) and \((Nx) 2^{(\alpha p\eta )n_2}\le (Nx)^{\alpha /(p+\eta )}\). Consequently (recall that \(x\in (0,A]\)),
For any fixed \({\varepsilon }\in (0,\alpha )\), we choose \(\eta >0\) small enough so that \(\alpha /(p+\eta )\ge (\alpha {\varepsilon })/p\) and we conclude that (recall that \(Nx\ge 1/A\) because \(Nx^2\ge 1\) and \(x\le A\))
The last inequality is easily checked, using that \(Nx^2\ge 1\) implies that \(N\le (Nx)^2\).
Step 3: bound of \(J_n\) under (3). Here we show that for all \({\varepsilon }\in (0,q)\), if \(\eta >0\) is small enough,
As usual, this will imply that for all \(x>0\), all \(N\ge 1\),
Exactly as in Step 2, we get from Lemma 12(a)–(b) that
Hence for \(n_3\) to be chosen later, since \(a N \mu (B_n)z_n= a Nx2^{(p+\eta )n}\),
We choose \(n_3:= \lfloor (q{\varepsilon })\log (Nx)/(pq\log 2) \rfloor \), which implies that \(2^{qn_3} \le 2^q (Nx)^{(q{\varepsilon })/p}\) and that \(2^{(p+\eta )n_3} \ge (Nx)^{(q{\varepsilon })(p+\eta )/(pq)}\). Hence
If \(\eta \in (0,p{\varepsilon }/(q{\varepsilon }))\), then \(1(q{\varepsilon })(p+\eta )/(pq)>0\), and thus
This ends the step.
Step 4. We next assume (1) and prove that for all \(x\ge A_1:=2^{p}[M_p(\mu )+(2 \log {\mathcal E}_{\alpha ,1}(\mu ))^{p/\alpha }]\),
A simple computation shows that for any \(\nu \in {\mathcal P}({{\mathbb {R}}^d})\), \(\sum _{n\ge 0} 2^{pn}\nu (B_n) \le 2^p M_p(\nu )\), whence \(Z_N^{p} \le 2^p M_p(\mu )+ 2^p N^{1}\sum _1^N X_i^p\le 2^p M_p(\mu )+ 2^p [N^{1}\sum _1^N X_i^\alpha ]^{p/\alpha }\). Thus
Next, we note that for \(y\ge 2 \log {\mathcal E}_{\alpha ,1}(\mu )\),
The conclusion easily follows, since \(x\ge A_1\) implies that \(y:=[x2^{p}  M_p(\mu )]^{\alpha /p}\ge 2 \log {\mathcal E}_{\alpha ,1}(\mu )\) and since \(y \ge [x2^{p1}]^{\alpha /p}[M_p(\mu )]^{\alpha /p}\).
Step 5. Assume (2) and put \(\delta := 2p/\alpha 1\). Here we show that for all \(x>0\), \(N\ge 1\),
Step 5.1. For \(R>0\) (large) to be chosen later, we introduce the probability measure \(\mu ^R\) as the law of \(X{\mathbf{1}}_{\{X\le R\}}\). We also denote by \(\mu _N^R\) the corresponding empirical measure (coupled with \(\mu _N\) in that the \(X_i\)’s are used for \(\mu _N\) and the \(X_i{\mathbf{1}}_{\{X_i\le R\}}\)’s are chosen for \(\mu _N^R\)). We set \(Z_N^{p,R}:=\sum _{n\ge 0}2^{pn}\left \mu _N^R(B_n)\mu ^R(B_n)\right \) and first observe that \(\left Z^p_NZ^{p,R}_N\right \le 2^p N^{1}\sum _1^N X_i^p{\mathbf{1}}_{\{X_i>R\}} + 2^p \int _{\{x>R\}}x^p\mu (dx)\). On the one hand, \(\int _{\{x>R\}}x^p\mu (dx)\le \exp (R^\alpha /2) \int x^p e^{x^\alpha /2} \mu (dx) \le C \exp (R^\alpha /2)\) by (2) (with \(\gamma =1\)). On the other hand, since \(\alpha \in (0,p]\), \(\sum _1^N X_i^p{\mathbf{1}}_{\{X_i>R\}}\le \left( \sum _1^N X_i^\alpha {\mathbf{1}}_{\{X_i>R\}}\right) ^{p/\alpha }\). Hence if \(x\ge A \exp (R^\alpha /2)\), where \(A:=2^{p+1}C\),
Observing that \(\mathbb {E}[\exp (X_1^\alpha {\mathbf{1}}_{\{X_1>R\}}/2)]\le 1+ \mathbb {E}[\exp (X_1^\alpha /2){\mathbf{1}}_{\{X_1>R\}}] \le 1 + C\exp (R^\alpha /2)\) by (2) and using that \(\log (1+u)\le u\), we deduce that for all \(x\ge 2^{p+1} C \exp (R^\alpha /2)\),
With the choice
we finally find
provided \(x\ge A \exp (R^\alpha /2)\), i.e. \((N+1)x \ge A\). As usual, this immediately extends to any value of \(x>0\).
Step 5.2. To study \(Z_N^{p,R}\), we first observe that since \(\mu ^R(B_n)=0\) if \(2^{n1}\ge R\), we have \(2^{pn}\mu ^R(B_n) \le (2R)^{p\alpha /2}2^{\alpha n/2}\mu ^R(B_n)\) for all \(n\ge 0\). Hence \(Z_N^{p,R}\le (2R)^{p\alpha /2}Z_N^{\alpha /2,R}\). But \(\mu ^R\) satisfies \({\int _{{{\mathbb {R}}^d}}}\exp (x^\alpha ) \mu ^R(dx)<\infty \) uniformly in \(R\), so that we may use Steps 1, 2 and 4 (with \(p=\alpha /2<\alpha \)) to deduce that for all \(x>0\), \(\Pr \left( Z_N^{\alpha /2,R}\ge x\right) \le C \exp (a N x^2)\). Consequently, \(\Pr \left( Z_N^{p,R}\ge x\right) \le C \exp (a N (x/R^{p\alpha /2})^2)\). Recalling (5) and that \(\delta := 2p/\alpha 1\), we see that that \(\Pr \left( Z_N^{p,R}\ge x\right) \le C \exp \left( a Nx^2 (\log (1+N))^{\delta }\right) \). This ends the step.
Conclusion. Recall that \(x_0>0\) is fixed.
First assume (1). By Step 4, \(\Pr \left[ Z_N^{p} \ge x \right] \le C\exp (aNx^{\alpha /p})\) for all \(x\ge A_1\). We deduce from Steps 1 and 2 that for \(x\in (0,A_1)\), \(\Pr \left[ Z_N^{p} \ge x \right] \le C\exp (aNx^2)\). We easily conclude that for all \(x>0\), \(\Pr \left[ Z_N^{p} \ge x \right] \le C\exp (aNx^2){\mathbf{1}}_{\{x\le x_0\}} + C\exp (aNx^{\alpha /p}){\mathbf{1}}_{\{x>x_0\}}\) as desired.
Assume next (2). By Step 5, \(\Pr \left[ Z_N^{p} \ge x \right] \le C \exp (a Nx^2 (\log (1+N))^{\delta })+C\exp (a(Nx)^{\alpha /p})\). But if \(x \ge x_0\), we clearly have \((Nx)^{\alpha /p} \le C Nx^2 (\log (1+N))^{\delta }\) because \(\alpha <p\), so that \(\Pr \left[ Z_N^{p} \ge x \right] \le C\exp (a(Nx)^{\alpha /p})\). If now \(x \le x_0\), we use Steps 1 and 2 to write \(\Pr \left[ Z_N^{p} \ge x\right] \le C\exp (aNx^2)+C\exp (a(Nx)^{(\alpha {\varepsilon })/p})\).
Assume finally (3). By Steps 1 and 3, \(\Pr [Z_N^{p} \ge x ]\le C\exp (aNx^2) + C N (Nx)^{(q{\varepsilon })/q}\) for all \(x>0\). But if \(x\ge x_0\), \(\exp (aNx^2)\le \exp (aNx)\le C (Nx)^{(q{\varepsilon })/q} \le C N (Nx)^{(q{\varepsilon })/q}\). We conclude that for all \(x>0\), \(\Pr [Z_N^{p} \ge x ]\le C\exp (aNx^2){\mathbf{1}}_{\{x\le x_0\}} + C N (Nx)^{(q{\varepsilon })/q}\) as desired.
We can now give the
Proof of Theorem 2
Let us recall that the constants during this proof may depend only on \(p,d\) and either on \(\alpha ,\gamma ,{\mathcal E}_{\alpha ,\gamma }(\mu )\) (under (1)) or on \(\alpha ,\gamma ,{\mathcal E}_{\alpha ,\gamma }(\mu ),{\varepsilon }\) (under (2)) or on \(q,M_q(\mu ),{\varepsilon }\) (under (3)).
Using Lemma 5, we write
Hence
By Lemma 13 (choosing \(x_0:=1/(2\kappa _{p,d})\)), we easily find \(\Pr (Z_N^p \ge x/(2\kappa _{p,d})) \le Ce^{cNx^2}{\mathbf{1}}_{\{x\le 1\}} +b(N,x) \le a(N,x){\mathbf{1}}_{\{x\le 1\}}+b(N,x)\), these quantities being defined in the statement of Theorem 2. We now check that there is \(A>0\) such that for all \(x>0\),
This will end the proof, since one easily checks that \(a(N,x){\mathbf{1}}_{\{x \le A\}}\le a(N,x){\mathbf{1}}_{\{x \le 1\}}+b(N,x)\) (when allowing the values of the constants to change).
Let us thus check (6). For \(\eta >0\) to be chosen later, we set \(c:=(12^{\eta })/(2\kappa _{p,d})\) and \(z_n:= c x 2^{(p+\eta )n}/\mu (B_n)\). Observing that \(\sum _{n\ge 0} (12^{\eta })2^{\eta n}=1\)), we write
From now on, the value of \(c>0\) is not allowed to vary anymore. We introduce another positive constant \(a>0\) whose value may change from line to line. We only assume (3) (which is implied by (1) or (2)). We now show that if \(\eta >0\) is chosen small enough, there is \(A>0\) such that
where \(h(x)=x^2\) if \(p>d/2\), \(h(x)=(x/\log (2+1/x))^2\) if \(p=d/2\) and \(h(x)=x^{d/p}\) if \(p<d/2\). This will obviously imply as usual that for all \(x>0\),
and thus conclude the proof of (6). We thus only have to prove (7).
Conditionally on \(\mu _N(B_n)\), \({\mathcal R}_{B_n}\mu _N\) is the empirical measure of \(N\mu _N(B_n)\) points which are \({\mathcal R}_{B_n}\mu \)distributed. Since \({\mathcal R}_{B_n}\mu \) is supported in \((1,1]^d\), we may apply Proposition 10 and obtain
by Lemma 12(c). But the condition \(z_n\le 1\) implies that \(h(z_n)\) is bounded (by a constant depending only on \(p\) and \(d\)), whence
By (3), we have \(\mu (B_n)\le C 2^{qn}\). Hence if \(x>A:=C/c\), we have \(z_n\ge (c/C)x2^{(qp\eta )n} >1\) for all \(n\ge 1\) (if \(\eta \in (0,qp)\)) and thus \(\sum _{n\ge 0} K_n(N,x)=0\) as desired.
Next, we see that \(\theta \mapsto \theta h(x/\theta )\) is decreasing, whence for all \(x\le A\),
We now treat separately the three cases.
Step 1: case \(p>d/2\). Since \(h(x)=x^2\), we have, if \(\eta \in (0,q/2p)\),
if \(Nx^2\ge 1\).
Step 2: case \(p=d/2\). Since \(h(x)=(x/\log (2+1/x))^2\), we have, if \(\eta \in \hbox {(0,q/2p)}\),
if \(N h(x) \ge 1\). The third inequality only uses that \(\log ^2(2+1/(x2^{n(qp\eta )})) \le \log ^2(2+1/x)\).
Step 3: case \(p<d/2\). Here \(h(x)=x^{d/p}\). Since \(p<d/2\) and \(q>2p\), it holds that \(q(1p/d)p>0\). We thus may take \(\eta \in (0,q(1p/d)p)\) (so that \(q(d/p1)dd\eta /p>0\)) and we get
if \(N x^{d/p} \ge 1\). \(\square \)
7 The dependent case
We finally study a few classes of dependent sequences of random variables. We only give some moment estimates. Concentration inequalities might be obtained, but this should be much more complicated.
7.1 \(\rho \)mixing stationary sequences
A stationary sequence of random variables \((X_n)_{n\ge 1}\) with common law \(\mu \) is said to be \(\rho \)mixing, for some \(\rho :{\mathbb {N}}\rightarrow {\mathbb {R}}^+\) with \(\rho _n\rightarrow 0\), if for all \(f,g \in L^2(\mu )\) and all \(i,j\ge 1\)
We refer for example to Rio [37], Doukhan [20] or Bradley [10].
Theorem 14
Consider a stationary sequence of random variables \((X_n)_{n\ge 1}\) with common law \(\mu \) and set \(\mu _N:=N^{1}\sum _1^N \delta _{X_i}\). Assume that this sequence is \(\rho \)mixing, for some \(\rho :{\mathbb {N}}\rightarrow {\mathbb {R}}^+\) satisfying \(\sum _{n\ge 0} \rho _n<\infty \). Let \(p>0\) and assume that \(\mu \in M_q({{\mathbb {R}}^d})\) for some \(p>q\). There exists a constant \(C\) depending only on \(p,d,q, M_q(\mu ),\rho \) such that, for all \(N\ge 1\),
This is very satisfying: we get the same estimate as in the independent case. The case \(\sum _{n\ge 0} \rho _n=\infty \) can also be treated (but then the upper bounds will be less good and depend on the rate of decrease of \(\rho \)). Actually, the \(\rho \)mixing condition is slightly too strong (we only need the covariance inequality when \(f=g\) is an indicator function), but it is best adapted notion of mixing we found in the litterature.
Proof
We first check that for any Borel subset \(A \subset {{\mathbb {R}}^d}\),
But this is immediate: \(\mathbb {E}[\mu _N(A)]=\mu (A)\) (whence \(\mathbb {E}[\mu _N(A)\mu (A)]\le 2\mu (A)\)) and
This is smaller than \(C \mu (A)/N\) as desired, since \(\sum _{i,j\le N} \rho _{ij}\le N \sum _{k\ge 0} \rho _k= C N\). Once this is done, it suffices to copy (without any changes) the proof of Theorem 1. \(\square \)
7.2 Markov chains
Here we consider a \({{\mathbb {R}}^d}\)valued Markov chain \((X_n)_{n\ge 1}\) with transition kernel \(P\) and initial distribution \(\nu \in {\mathcal P}({{\mathbb {R}}^d})\) and we set \(\mu _N:=N^{1}\sum _{1}^N \delta _{X_n}\). We assume that it admits a unique invariant probability measure \(\pi \) and the following \(L^2\)decay property (usually related to a Poincaré inequality)
for some sequence \(\rho =(\rho _n)_{n\ge 1}\) decreasing to \(0\).
Theorem 15
Let \(p\ge 1\), \(d\ge 1\) and \(r> 2\) be fixed. Assume that our Markov chain \((X_n)_{n\ge 0}\) satisfies (8) with a sequence \((\rho _n)_{n\ge 1}\) satisfying \(\sum _{n\ge 1} \rho _n<\infty \). Assume also that the initial distribution \(\nu \) is absolutely continuous with respect to \(\pi \) and satisfies \(\Vert d\nu /d\pi \Vert _{L^r(\pi )}<\infty \). Assume finally that \(M_q(\pi )<\infty \) for some \(q>p r/(r1)\). Setting \(q_r:=q(r1)/r\) and \(d_r=d(r+1)/r\), there is a constant \(C\), depending only on \(p,d,r,q,\rho ,M_q(\pi )\) and \(\Vert d\nu /d\pi \Vert _{L^r(\pi )}\) such that for all \(N\ge 1\),
Once again, we might adapt the proof to get a complete picture corresponding to other decay than \(L^2\)\(L^2\) and to slower mixing rates \((\rho _n)_{n\ge 1}\).
Proof
We only have to show that for any \(\ell \ge 0\), any \(n\ge 0\),
Since \(M_q(\pi )<\infty \) (whence \(\pi (B_n)\le C 2^{qn}\)), we will deduce that
Then the rest of the proof is exactly the same as that of Theorem 1, replacing everywhere \(q\) and \(d\) by \(q_r\) and \(d_r\).
We first check that \(\Delta _{n,\ell }^N \le C (\pi (B_n))^{(r1)/r}\). Using that \(\Vert d\nu /d\pi \Vert _{L^r(\pi )}<\infty \), we write
We next consider a Borel subset \(A\) of \({{\mathbb {R}}^d}\) and check that
To do so, as is usual when working with Markov chains or covariance properties (see [7]), we introduce \(f=1_A\pi (A)\) and write
For \(j\ge i\), it holds that
Using the Hölder inequality (recall that \(\Vert d\nu /d\pi \Vert _{L^r(\pi )}<\infty \) with \(r>2\)) and (8), we get
But for \(s>1\), \(\Vert f\Vert _{L^s(\pi )}\le C_s(\pi (A)+(\pi (A))^s)^{1/s}\le C_s (\pi (A))^{1/s}\), we find \(\mathbb {E}_\nu (f(X_i)f(X_j))\le C \rho _{ji}(\pi (A))^{(r1)/r}\) and thus
as desired. We used that \(\sum _{i,j=1}^N \rho _{ij}\le C N\).
We can finally conclude that
by the Hölder inequality (and because \(\# {\mathcal P}_\ell =2^{d\ell }\)), where \(d_r=d (r+1)/r\) as in the statement. \(\square \)
7.3 Mc KeanVlasov particles systems
Particle approximation of nonlinear equations has attracted a lot of attention in the past thirty years. We will focus here on the following \({{\mathbb {R}}^d}\)valued nonlinear S.D.E.
where \(u_t= Law(X_t)\) and \((B_t)\) is an \({{\mathbb {R}}^d}\)valued Brownian motion. This is a probabilistic representation of the socalled Mc KeanVlasov equation, which has been studied in particular by Carillo et al. [12], Malrieu [28] and Cattiaux et al.[13] to which we refer for further motivations and existence and uniqueness of solutions. We will mainly consider here the case where \(V\) and \(W\) are convex (and if \(V=0\) the center of mass is fixed) and \(W\) is even. To fix the ideas, let us consider only two cases:

(a)
\(Hess\, V\ge \beta Id>0\), \(Hess\, W\ge 0\).

(b)
\(V(x)=x^\alpha \) for \(\alpha >2\), \(Hess\, W\ge 0\).
The particle system introduced to approximate the nonlinear equation is the following. Let \((B^i_t)_{t\ge 0}\) be \(N\) independent Brownian motions. For \(i=1,\ldots ,N\), set \(X^{i,N}_0=x\) and
Usual propagation of chaos property is usually concerned with control of
uniformly (or not) in time. It is however very natural to consider rather a control of
where \(\hat{u}^N_t=\frac{1}{N}\sum _{i=1}^N\delta _{X^{i,N}_t}\), as in Bolley et al. [11].
To do so, and inspired by the usual proof of propagation of chaos, let us consider nonlinear independent particles
(driven by the same Brownian motions as the particle system) and the corresponding empirical measure \(u^N_t=\frac{1}{N}\sum _{i=1}^N\delta _{X^{i}_t}\). We then have
Then following [28] in case (a) and [13] in case (b), one easily gets (for some timeindependent constant \(C\))
where \(\alpha (n)=N^{1}\) in case (a) and \(\alpha (N)=N^{1/(\alpha 1)}\) in case (b). It is not hard to prove here that the nonlinear particles have infinitely many moments (uniformly in time) so that combining Theorem 1 with the previous estimates gives
where \(\beta (N)=N^{1/2}\) if \(d=1\), \(\beta (N)=N^{1/2}\log (1+N)\) if \(d=2\) and \(\beta (N)=N^{1/d}\) if \(d\ge 3\).
References
Adamczak, R.: A tail inequality for suprema of unbounded empirical processes with applications to Markov chains. Electron. J. Probab. 13, 1000–1034 (2008)
Ajtai, M., Komlós, J., Tusnády, G.: On optimal matchings. Combinatorica 4, 259–264 (1984)
Barthe, F., Bordenave, C.: Combinatorial optimization over two random point sets. Séminaire de Probabilités XLV, Lecture Notes in Mathematics 2078, pp. 483–535, Springer, Berlin (2013)
Bennett, G.: Probability inequalities for the sum of independent random variables. J. Am. Statist. Assoc. 57, 33–45 (1962)
Biau, G., Devroye, L., Lugosi, G.: On the performance of clustering in Hilbert spaces. IEEE Trans. Inf. Theory 54, 781–790 (2008)
Boissard, E.: Simple bounds for the convergence of empirical and occupation measures in 1Wasserstein distance. Electron. J. Probab. 16, 2296–2333 (2011)
Boissard, E., Le Gouic, T.: On the mean speed of convergence of empirical and occupation measures in Wasserstein distance. arXiv:1105.5263
Borovkov, A.A.: Estimates for the distribution of sums and maxima of sums of random variables when the Cramér condition is not satisfied. Sib. Math. J. 41, 811–848 (2000)
Bradley, R.C.: A central limit theorem for stationary \(\rho \)mixing sequences with infinite variance. Ann. Probab. 16, 313–332 (1988)
Bradly, R.C.: Introduction to Strong Mixing Conditions, vol. 1,2,3. Kendrick Press, Heber City (2007)
Bolley, F., Guillin, A., Villani, C.: Quantitative concentration inequalities for empirical measures on noncompact spaces. Probab. Theory Relat. Fields 137, 541–593 (2007)
Carrillo, J.A., Mac Cann, R., Villani, C.: Kinetic equilibration rates for granular media and related equations: entropy dissipation and mass transportation. Rev. Mat. Iberoam. 19, 971–1018 (2003)
Cattiaux, P., Guillin, A., Malrieu, F.: Probabilistic approach for granular media equations in the non uniformly convex case. Probab. Theory Relat. Fields 140, 19–40 (2008)
Delattre, S., Graf, S., Luschgy, H., Pagès, G.: Quantization of probability distributions under normbased distortion measures. Stat. Decis. 22, 261–282 (2004)
Dereich, S.: Asymptotic formulae for coding problems and intermediate optimization problems: a review. In: Trends in Stochastic Analysis. pp. 187–232, Cambridge University Press, Cambridge (2009)
Dereich, S., Scheutzow, M., Schottstedt, R.: Constructive quantization: approximation by empirical measures. Ann. Inst. Henri Poincar Probab. Stat. 49, 1183–1203 (2013)
Devroye, L., Lugosi, G.: Combinatorial Methods in Density Estimation. Springer, Berlin (2001)
Djellout, H., Guillin, A., Wu, L.: Transportation costinformation inequalities and applications ro random dynamical systems and diffusions. Ann. Probab. 32, 2702–2732 (2004)
Dobrić, V., Yukich, J.E.: Asymptotics for transportation cost in high dimensions. J. Theor. Probab. 8, 97–118 (1995)
Doukhan, P.: Mixing: Properties and Examples. Springer, NewYork (1995)
Dudley, R.M.: Central limit theorems for empirical measures. Ann. Probab. 6, 899–929 (1978)
Fournier, N., Mischler, S.: Rate of convergence of the Nanbu particle system for hard potentials. arXiv:1302.5810
Fuk, D.H., Nagaev, S.V.: Probability inequalities for sums of independent random variables. Theory Probab. Appl. 16, 660–675 (1971)
Gozlan, N.: Integral criteria for transportation cost inequalities. Electron. Commun. Probab. 11, 64–77 (2006)
Horowitz, J., Karandikar, R.L.: Mean rates of convergence of empirical measures in the Wasserstein metric. J. Comput. Appl. Math. 55, 261–273 (1994)
Laloë, T.: \(L_1\)Quantization and clustering in Banach spaces. Math. Method Stat. 19, 136–150 (2009)
Ledoux, M.: The Concentration of Measure Phenomenon. Mathematical Surveys and Monographs 89. American Math. Society, Providence (2001)
Malrieu, F.: Convergence to equlibrium for granular media equations. Ann. Appl. Probab. 13, 540–560 (2003)
Massart, P.: Concentration Inequalities and Model Selection: Ecole d’Été de Probabilités de SaintFlour XXXIII. Springer, Berlin (2003)
Merlevède, F., Peligrad, M.: Rosenthaltype inequalities for the maximum of partial sums of stationary processes and examples. Ann. Probab. 41, 914–960 (2013)
Merlevède, F., Peligrad, M., Rio, E.: A Bernstein type inequality and moderate deviations for weakly dependent sequences. Probab. Theory Relat. Fields 151, 435–474 (2011)
Mischler, S., Mouhot, C.: Kac’s programm in kinetic theory. Invent. Math. 193, 1–147 (2013)
Pagès, G., Wilbertz, B.: Optimal Delaunay and Voronoi quantization schemes for pricing American style options. In: Carmona, R., Hu, P., Del Moral, P., Oudjane, N. (eds.) Numerical Methods in Finance, pp. 171–217. Springer, Berlin (2012)
Perrin, D.: Une variante de la formule de Stirling. http://www.math.upsud.fr/~perrin/CAPES/analyse/Suites/Stirling
Rachev, S.T., Rüschendorf, L.: Mass Transportation Problems. Vol. I. and II. Probability and its Applications. Springer, Berlin (1998)
Roberts, G., Rosenthal, J.S.: Shiftcoupling and convergence rates of ergodic averages. Commun. Stat. Stoch. Models 13, 147–165 (1996)
Rio, E.: Théorie asymptotique des processus aléatoires faiblement dépendants, Mathématiques et Applications 31. Springer, Paris (2000)
Talagrand, M.: Matching random samples in many dimensions. Ann. Appl. Probab. 2, 846–856 (1992)
Talagrand, M.: The transportation cost from the uniform measure to the empirical measure in dimension \(\ge 3\). Ann. Probab. 22, 919–959 (1994)
Van der Vaart, A., Wellner, J.A.: Weak Convergence of Empirical Processes. Springer, Berlin (1996)
Villani, C.: Topics in Optimal Transportation. Graduate Studies in Mathematics, 58. American Mathematical Society, Providence (2003)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Fournier, N., Guillin, A. On the rate of convergence in Wasserstein distance of the empirical measure. Probab. Theory Relat. Fields 162, 707–738 (2015). https://doi.org/10.1007/s0044001405837
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s0044001405837
Keywords
 Empirical measure
 Sequence of i.i.d. random variables
 Wasserstein distance
 Concentration inequalities
 Quantization
 Markov chains
 \(\rho \)mixing sequences
 Mc KeanVlasov particles system