Bounding the Expectation of the Supremum of Empirical Processes Indexed by Hölder Classes

Schreuder, N.

doi:10.3103/S1066530720010056

Bounding the Expectation of the Supremum of Empirical Processes Indexed by Hölder Classes

Published: 31 August 2021

Volume 29, pages 76–86, (2020)
Cite this article

Mathematical Methods of Statistics Aims and scope Submit manuscript

N. Schreuder¹

172 Accesses
1 Citation
Explore all metrics

Abstract

In this note, we provide upper bounds on the expectation of the supremum of empirical processes indexed by Hölder classes of any smoothness and for any distribution supported on a bounded set in $\mathbb{R}^{d}$. These results can alternatively be seen as non-asymptotic risk bounds, when the unknown distribution is estimated by its empirical counterpart, based on $n$ independent observations, and the error of estimation is quantified by integral probability metrics (IPM). In particular, IPM indexed by Hölder classes are considered and the corresponding rates are derived. These results interpolate between two well-known extreme cases: the rate $n^{-1/d}$ corresponding to the Wassertein-1 distance (the least smooth case) and the fast rate $n^{-1/2}$ corresponding to very smooth functions (for instance, functions from a RKHS defined by a bounded kernel).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Convergence properties of new $$\alpha $$ -Bernstein–Kantorovich type operators

Article 04 April 2024

Ajay Kumar, Abhishek Senapati & Tanmoy Som

Infinitely many distributional solutions to a general kind of nonlinear fractional Schrödinger-Poisson systems

Article 16 October 2023

Hamza Boutebba, Hakim Lakhal & Kamel Slimani

On the rate of convergence in Wasserstein distance of the empirical measure

Article 18 October 2014

Nicolas Fournier & Arnaud Guillin

Notes

See [21, Section 2.5] for the link between definitions of sub-Gaussian random variables (bound on moment-generating function, tail inequalities, …) and the Orlicz norm $\psi_{2}$.
We refer the reader to https://ttic.uchicago.edu/t̃ewari/lectures/lecture10.pdf for a simple proof of this lemma.

REFERENCES

M. Arjovsky, S. Chintala, and L. Bottou, Wasserstein Generative Adversarial Networks, Ed. by D. Precup and Y. W. Teh, Proceedings of the 34th International Conference on Machine Learning, Vol. 70 of Proceedings of Machine Learning Research, International Convention Centre, Sydney, Australia, PMLR (2017), pp. 214–223.
F. Bassetti, A. Bodini, and E. Regazzini, ‘‘On minimum Kantorovich distance estimators,’’ Statistics and probability letters 76 (12), 1298–1302 (2006).
Article MathSciNet Google Scholar
F.-X. Briol, A. Barp, A. B. Duncan, and M. Girolami, Statistical inference for generative models with maximum mean discrepancy (2019). arXiv preprint arXiv:1906.05944.
M. Chen, W. Liao, H. Zha, and T. Zhao, Statistical guarantees of generative adversarial networks for distribution estimation (2020). arXiv preprint arXiv:2002.03938.
E. del Barrio, P. Deheuvels, and S. van de Geer, Lectures on empirical processes. EMS Series of Lectures in Mathematics. European Mathematical Society (EMS), Zürich. Theory and statistical applications, With a preface by Juan A. Cuesta Albertos and Carlos Matrán (2007).
R. M. Dudley, ‘‘The speed of mean Glivenko-Cantelli convergence,’’ Ann. Math. Statist. 40, 40–50 (1968).
Article MathSciNet Google Scholar
E. Giné and R. Nickl, Mathematical Foundations of Infinite-Dimensional Statistical Models, Vol. 40 (Cambridge University Press, 2016).
Book Google Scholar
I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, ‘‘Generative adversarial nets,’’ In Advances in Neural Information Processing Systems, 2672–2680 (2014).
V. Koltchinskii, Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems: Ecole d’Eté de Probabilités de Saint-Flour XXXVIII-2008, Vol. 2033 (Springer Science and Business Media, 2011).
Book Google Scholar
T. Liang, On how well generative adversarial networks learn densities: Nonparametric and parametric results (2018). arXiv preprint arXiv:1811.03179.
R. Nickl and B. M. Pötscher, ‘‘Bracketing metric entropy rates and empirical central limit theorems for function classes of besov-and sobolev-type,’’ Journal of Theoretical Probability 20 (2), 177–199 (2007).
Article MathSciNet Google Scholar
A. Rakhlin, K. Sridharan, and A. B. Tsybakov, ‘‘Empirical entropy, minimax regret and minimax risk,’’ Bernoulli 23 (2), 789–824 (2017).
Article MathSciNet Google Scholar
F. Santambrogio, ‘‘Optimal transport for applied mathematicians,’’ Birkäuser, NY 55 (58–63), 94 (2015).
MATH Google Scholar
M. Scetbon, L. Meunier, J. Atif, and M. Cuturi, ‘‘Equitable and optimal transport with multiple agents,’’ in Proceedings of the 24th International Conference on Artificial Intelligence and Statistics, Vol. 130, Proceedings of Machine Learning Research (2021), pp. 2035–2043; arXiv: 2006.07260 (2020).
A. Shiryayev, Selected Works of AN Kolmogorov, Vol. III, Information Theory and the Theory of Algorithms (Springer, 1993), Vol. 27.
N. Srebro and K. Sridharan, Note on refined Dudley integral covering number bound. Unpublished results (2010). http://ttic. uchicago. edu/karthik/dudley. pdf
N. Srebro, K. Sridharan, and A. Tewari, ‘‘Smoothness, low noise and fast rates,’’ in Advances in neural information processing systems, 2199–2207 (2010).
B. K. Sriperumbudur, K. Fukumizu, A. Gretton, B. Schölkopf, G. R. Lanckriet, et al., ‘‘On the empirical estimation of integral probability metrics,’’ Electronic Journal of Statistics 6, 1550–1599 (2012).
Article MathSciNet Google Scholar
A. B. Tsybakov, Introduction to Nonparametric Estimation (Springer Science and Business Media, 2008).
MATH Google Scholar
A. W. van der Vaart and J. A. Wellner, Weak Convergence and Empirical Processes (Springer Series in Statistics. Springer-Verlag, New York. With applications to statistics, 1996).
R. Vershynin, High-dimensional probability: An introduction with applications in data science (Cambridge university press, 2018), vol. 47.
Book Google Scholar
J. Weed, F. Bach, et al. ‘‘Sharp asymptotic and finite-sample rates of convergence of empirical measures in wasserstein distance,’’ Bernoulli 25 (4A), 2620–2648 (2019).
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

CREST, ENSAE, Institut Polytechnique de Paris, 91120, Palaiseau, France
N. Schreuder

Authors

N. Schreuder
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to N. Schreuder.

APPENDIX

PROOFS

This section contains the proofs of the main results, Theorems 3 and 4, stated in the main body of the note.

A.1. Proof of Theorem 3

The proof of Theorem 3 can be found in [16]. We add it here for completeness.

Let $\gamma_{0}=S_{n}(\mathcal{F})=\sup_{f\in\mathcal{F}}||f||_{L_{2}(P_{n})}$. Define $\gamma_{j}=2^{-j}\gamma_{0}$, for every integer $j\in\mathbb{N}$, and let $T_{j}$ be a minimal $\gamma_{j}$-cover of $\mathcal{F}$ with respect to $L_{2}(P_{n})$. For any function $f\in\mathcal{F}$, we denote by $\widehat{f}_{j}$ an element of $T_{j}$ which is an $\gamma_{j}$ approximation of $f$. For any positive integer $N$ we can decompose the function $f$ as

$$f=f-\widehat{f}_{N}+\sum_{j=1}^{N}(\widehat{f}_{j}-\widehat{f}_{j-1}),$$

where $\widehat{f}_{0}=0\in\mathcal{F}$. Hence, for any positive integer $N$, we have

$$\widehat{R}_{n}(\mathcal{F})=\frac{1}{n}\mathbb{E}_{\sigma}\left[\sup_{f\in\mathcal{F}}\sum_{i=1}^{n}\sigma_{i}\left(f(X_{i})-\widehat{f}_{N}(X_{i})+\sum_{j=1}^{N}(\widehat{f}_{j}(X_{i})-\widehat{f}_{j-1}(X_{i}))\right)\right]$$

$${}\leqslant\frac{1}{n}\mathbb{E}_{\sigma}\left[\sup_{f\in\mathcal{F}}\sum_{i=1}^{n}\sigma_{i}(f(X_{i})-\widehat{f}_{N}(X_{i}))\right]+\sum_{j=1}^{N}\frac{1}{n}\mathbb{E}_{\sigma}\left[\sup_{f\in\mathcal{F}}\sum_{i=1}^{n}\sigma_{i}(\widehat{f}_{j}(X_{i})-\widehat{f}_{j-1}(X_{i}))\right]$$

$${}\leqslant\frac{1}{n}\sup_{f\in\mathcal{F}}\sum_{i=1}^{n}|(f(X_{i})-\widehat{f}_{N}(X_{i}))|+\sum_{j=1}^{N}\frac{1}{n}\mathbb{E}_{\sigma}\left[\sup_{f\in\mathcal{F}}\sum_{i=1}^{n}\sigma_{i}(\widehat{f}_{j}(X_{i})-\widehat{f}_{j-1}(X_{i}))\right]$$

$${}=\sup_{f\in\mathcal{F}}||f-\widehat{f}_{N}||_{L_{2}(P_{n})}+\sum_{j=1}^{N}\frac{1}{n}\mathbb{E}_{\sigma}\left[\sup_{f\in\mathcal{F}}\sum_{i=1}^{n}\sigma_{i}(\widehat{f}_{j}(X_{i})-\widehat{f}_{j-1}(X_{i}))\right]$$

$${}\leqslant\gamma_{N}+\sum_{j=1}^{N}\frac{1}{n}\mathbb{E}_{\sigma}\left[\sup_{f\in\mathcal{F}}\sum_{i=1}^{n}\sigma_{i}(\widehat{f}_{j}(X_{i})-\widehat{f}_{j-1}(X_{i}))\right].$$

For any positive integer $j$, the triangle inequality gives

$$||\widehat{f}_{j}-\widehat{f}_{j-1}||_{L_{2}(P_{n})}\leqslant||\widehat{f}_{j}-f||_{L_{2}(P_{n})}+||f-\widehat{f}_{j-1}||_{L_{2}(P_{n})}\leqslant\gamma_{j}+\gamma_{j-1}=3\gamma_{j}.$$

(A.1)

We need the following classic lemma which controls the expectation of a Rademacher average over a finite set.Footnote

We refer the reader to https://ttic.uchicago.edu/t̃ewari/lectures/lecture10.pdf for a simple proof of this lemma.

Lemma A.1 (Massart’s finite class lemma). Let $\mathcal{X}$ be a finite subset of $\mathbb{R}^{n}$ and let $\sigma_{1},\dots,\sigma_{n}$ be independent Rademacher random variables. Denote the radius of $\mathcal{X}$ by $R=\sup_{x\in\mathcal{X}}||x||$. Then, we have,

$$\mathbb{E}\left[\sup_{x\in\mathcal{X}}\frac{1}{n}\sum_{i=1}^{n}\sigma_{i}x_{i}\right]\leqslant R\frac{\sqrt{2\log|\mathcal{X}|}}{n}.$$

Applying this lemma to $\mathcal{X}_{j}=\left\{(\widehat{f}_{j}(X_{i})-\widehat{f}_{j-1}(X_{i}))_{i=1}^{n}\in\mathbb{R}^{n}:f\in\mathcal{F}\right\}$ for any $j=1,\dots,n$ and using (3), we get

$$\sum_{j=1}^{N}\frac{1}{n}\mathbb{E}_{\sigma}\left[\sup_{f\in\mathcal{F}}\sum_{i=1}^{n}\sigma_{i}(\widehat{f}_{j}(X_{i})-\widehat{f}_{j-1}(X_{i}))\right]\leqslant\sum_{j=1}^{N}3\gamma_{j}\frac{\sqrt{2\log(|T_{j}|\cdot|T_{j-1}|)}}{n}$$

Therefore we have

$$\widehat{R}_{n}(\mathcal{F})\leqslant\gamma_{N}+\sum_{j=1}^{N}3\gamma_{j}\frac{\sqrt{2\log(|T_{j}|\cdot|T_{j-1}|)}}{n}\leqslant\gamma_{N}+\frac{6}{n}\sum_{j=1}^{N}\gamma_{j}\sqrt{\log|T_{j}|}$$

$${}=\gamma_{N}+\frac{12}{n}\sum_{j=1}^{N}(\gamma_{j}-\gamma_{j+1})\sqrt{\log|T_{j}|}=\gamma_{N}+\frac{12}{n}\sum_{j=1}^{N}(\gamma_{j}-\gamma_{j+1})\sqrt{\log\mathcal{N}(\mathcal{F},L_{2}(P_{n}),\gamma_{j})}$$

$${}\leqslant\gamma_{N}+\frac{12}{n}\int\limits_{\gamma_{N+1}}^{\gamma_{0}}\sqrt{\log\mathcal{N}(\mathcal{F},L_{2}(P_{n}),\varepsilon)}d\varepsilon.$$

For any $\tau>0$, pick $N=\sup\{j:\gamma_{j}>2\tau\}$. Then $\gamma_{N}=2\gamma_{N+1}\leqslant 4\tau$ and $\gamma_{N+1}=\gamma_{N}/2\geqslant\tau$. Hence, we conclude that

$$\widehat{R}_{n}(\mathcal{F})\leqslant 4\tau+\frac{12}{\sqrt{n}}\int\limits_{\tau}^{\gamma_{0}}\sqrt{\log\mathcal{N}(\mathcal{F},L_{2}(P_{n}),\varepsilon)}d\varepsilon.$$

Since $\tau$ can take any positive value we can take the infimum over all positive $\tau$ and this concludes the proof.

A.2. Proof of Theorem 4

Without loss of generality, we prove the theorem in the case $L=1$. The general case will follow by homogeneity. For simplicity we write $\mathcal{H}^{\alpha}=\mathcal{H}^{\alpha}(1)$, $Ph=\int_{\mathcal{X}}hdP$ and $P_{n}h=\int_{\mathcal{X}}hdP_{n}$. A symmetrization argument (Lemma 1) gives

$$\mathbb{E}\bigg{[}\sup_{h\in\mathcal{H}^{\alpha}}|Ph-P_{n}h|\bigg{]}\leqslant 2\mathbb{E}\big{[}\widehat{R}_{n}(\mathcal{H}^{\alpha})\big{]},$$

where the empirical Rademacher process $\widehat{R}_{n}(\mathcal{H}^{\alpha})$ is given by

$$\widehat{R}_{n}(\mathcal{H}^{\alpha})=\frac{1}{n}\mathbb{E}\left[\sup_{h\in\mathcal{H}^{\alpha}}\sum_{i=1}^{n}\sigma_{i}h(X_{i})\bigg{|}X_{1},\ldots,X_{n}\right].$$

Noting that, for any $h\in\mathcal{H}^{\alpha}$,

$$P_{n}h^{2}:=\frac{1}{n}\sum_{i=1}^{n}h^{2}(X_{i})\leqslant||h^{2}||_{\infty}\leqslant 1,$$

the improved Dudley bound (Theorem 3) coupled with Lemma 2 yields,

$$\mathbb{E}\bigg{[}\sup_{h\in\mathcal{H}^{\alpha}}|P_{n}h-Ph|\bigg{]}\leqslant\inf_{\tau>0}\left(4\tau+\frac{12}{\sqrt{n}}\int\limits_{\tau}^{1}\sqrt{\log\mathcal{N}(\mathcal{H}^{\alpha},||\cdot||_{\infty},\varepsilon)}d\varepsilon\right)$$

$${}\leqslant\inf_{\tau>0}\left(4\tau+\frac{12\sqrt{K\lambda_{d}(\mathcal{X}^{1})}}{\sqrt{n}}\int\limits_{\tau}^{1}\varepsilon^{-d/2\alpha}d\varepsilon\right).$$

Applying Lemma A.2 with $\beta=\frac{d}{2\alpha}$ and $a=3\sqrt{\frac{K\lambda}{n}}$ where $K=K_{\alpha,d}$ is the constant depending only on $\alpha$ and $d$ borrowed from Theorem 1 and $\lambda:=\lambda_{d}(\mathcal{X}^{1})$, we get

$$\mathbb{E}\bigg{[}\sup_{h\in\mathcal{H}^{\alpha}}|P_{n}h-Ph|\bigg{]}\leqslant 12\begin{cases}\left(\frac{K\lambda}{n}\right)^{{\alpha}/{d}}\left[\frac{d}{d-2\alpha}\wedge(1+0.5\log(\frac{n}{9K\lambda}))\right]\quad\text{if $\alpha<{d}/{2}$}\\ \left(\frac{K\lambda}{n}\right)^{{1}/{2}}\left[\frac{2\alpha}{2\alpha-d}\wedge(1+\frac{\alpha}{d}\log(\frac{n}{9K\lambda}))\right]\quad\text{if $\alpha\geqslant{d}/{2}$}.\end{cases}$$

(A.2)

The proof is finished since the upper bound stated in Theorem 4 is a direct consequence of (A.2)

A.3. Additional Lemma

The following lemma enables to obtain an upper bound on Dudley’s refined bound (Theorem 3) for any bounded class whose entropy grows polynomially in $1/\varepsilon$.

Lemma A.2. For any real positive numbers $a$ and $\beta$, it holds

$$\min_{0\leqslant\tau\leqslant 1}\left(\tau+a\int\limits_{\tau}^{1}\varepsilon^{-\beta}d\varepsilon\right)\leqslant(a^{{1}/{\beta}}\vee a)\left[\left(\frac{\beta\vee 1}{|\beta-1|}\right))\wedge\left(1+\frac{\log({1}/{a})}{\beta\vee 1}\right)\right].$$

Proof. Let $a$ and $\beta$ be real positive numbers. Define the function

$$f\colon[0,1]\to\mathbb{R},$$

$$\tau\mapsto\tau+a\int\limits_{\tau}^{1}\varepsilon^{-\beta}d\varepsilon.$$

One can easily check that

$$f^{*}:=\min_{0\leqslant\tau\leqslant 1}f(\tau)=\begin{cases}1\quad\text{if $a>1$}\\ a^{{1}/{\beta}}+\frac{a}{1-\beta}(1-a^{{1}/{\beta}-1})\quad\text{if $a<1$ }.\end{cases}$$

(A.3)

In the case $a<1$, using the fact that $1-x^{\alpha}\leqslant\log(x^{-\alpha})$ for any $\alpha>0$ and $x\in(0,1]$, we have

$$f^{*}\leqslant(a^{{1}/{\beta}}\vee a)\left[\left(\frac{\beta\vee 1}{|\beta-1|}\right))\wedge\left(1+\frac{\log({1}/{a})}{\beta\vee 1}\right)\right].$$

(3)

Finally, since the RHS of (A.3) is greater than $1$ for any $a>1$, (A.3) holds for any positive real $a$ and this concludes the proof. $\Box$

ACKNOWLEDGMENTS

The author thanks Arnak Dalalyan for his diligent proofreading of this note, Yannick Guyonvarch for interesting references and Alexander Tsybakov for suggesting to present an extension of the main result.

About this article

Cite this article

Schreuder, N. Bounding the Expectation of the Supremum of Empirical Processes Indexed by Hölder Classes. Math. Meth. Stat. 29, 76–86 (2020). https://doi.org/10.3103/S1066530720010056

Download citation

Received: 20 June 2020
Revised: 21 November 2020
Accepted: 13 March 2021
Published: 31 August 2021
Issue Date: January 2020
DOI: https://doi.org/10.3103/S1066530720010056

Keywords:

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Bounding the Expectation of the Supremum of Empirical Processes Indexed by Hölder Classes

Abstract

Access this article

Similar content being viewed by others

Convergence properties of new $$\alpha $$ -Bernstein–Kantorovich type operators

Infinitely many distributional solutions to a general kind of nonlinear fractional Schrödinger-Poisson systems

On the rate of convergence in Wasserstein distance of the empirical measure

Notes

REFERENCES

Author information

Authors and Affiliations

Corresponding author

APPENDIX

PROOFS

A.1. Proof of Theorem 3

A.2. Proof of Theorem 4

A.3. Additional Lemma

ACKNOWLEDGMENTS

About this article

Cite this article

Keywords:

Navigation

Bounding the Expectation of the Supremum of Empirical Processes Indexed by Hölder Classes

Abstract

Access this article

Similar content being viewed by others

Convergence properties of new $$\alpha $$ -Bernstein–Kantorovich type operators

Infinitely many distributional solutions to a general kind of nonlinear fractional Schrödinger-Poisson systems

On the rate of convergence in Wasserstein distance of the empirical measure

Notes

REFERENCES

Author information

Authors and Affiliations

Corresponding author

APPENDIX

PROOFS

A.1. Proof of Theorem 3

A.2. Proof of Theorem 4

A.3. Additional Lemma

ACKNOWLEDGMENTS

About this article

Cite this article

Share this article

Keywords:

Search

Navigation