Skip to main content
Log in

Bounding the Expectation of the Supremum of Empirical Processes Indexed by Hölder Classes

  • Published:
Mathematical Methods of Statistics Aims and scope Submit manuscript

Abstract

In this note, we provide upper bounds on the expectation of the supremum of empirical processes indexed by Hölder classes of any smoothness and for any distribution supported on a bounded set in \(\mathbb{R}^{d}\). These results can alternatively be seen as non-asymptotic risk bounds, when the unknown distribution is estimated by its empirical counterpart, based on \(n\) independent observations, and the error of estimation is quantified by integral probability metrics (IPM). In particular, IPM indexed by Hölder classes are considered and the corresponding rates are derived. These results interpolate between two well-known extreme cases: the rate \(n^{-1/d}\) corresponding to the Wassertein-1 distance (the least smooth case) and the fast rate \(n^{-1/2}\) corresponding to very smooth functions (for instance, functions from a RKHS defined by a bounded kernel).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

Notes

  1. See [21, Section 2.5] for the link between definitions of sub-Gaussian random variables (bound on moment-generating function, tail inequalities, …) and the Orlicz norm \(\psi_{2}\).

  2. We refer the reader to https://ttic.uchicago.edu/t̃ewari/lectures/lecture10.pdf for a simple proof of this lemma.

REFERENCES

  1. M. Arjovsky, S. Chintala, and L. Bottou, Wasserstein Generative Adversarial Networks, Ed. by D. Precup and Y. W. Teh, Proceedings of the 34th International Conference on Machine Learning, Vol. 70 of Proceedings of Machine Learning Research, International Convention Centre, Sydney, Australia, PMLR (2017), pp. 214–223.

  2. F. Bassetti, A. Bodini, and E. Regazzini, ‘‘On minimum Kantorovich distance estimators,’’ Statistics and probability letters 76 (12), 1298–1302 (2006).

    Article  MathSciNet  Google Scholar 

  3. F.-X. Briol, A. Barp, A. B. Duncan, and M. Girolami, Statistical inference for generative models with maximum mean discrepancy (2019). arXiv preprint arXiv:1906.05944.

  4. M. Chen, W. Liao, H. Zha, and T. Zhao, Statistical guarantees of generative adversarial networks for distribution estimation (2020). arXiv preprint arXiv:2002.03938.

  5. E. del Barrio, P. Deheuvels, and S. van de Geer, Lectures on empirical processes. EMS Series of Lectures in Mathematics. European Mathematical Society (EMS), Zürich. Theory and statistical applications, With a preface by Juan A. Cuesta Albertos and Carlos Matrán (2007).

  6. R. M. Dudley, ‘‘The speed of mean Glivenko-Cantelli convergence,’’ Ann. Math. Statist. 40, 40–50 (1968).

    Article  MathSciNet  Google Scholar 

  7. E. Giné and R. Nickl, Mathematical Foundations of Infinite-Dimensional Statistical Models, Vol. 40 (Cambridge University Press, 2016).

    Book  Google Scholar 

  8. I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, ‘‘Generative adversarial nets,’’ In Advances in Neural Information Processing Systems, 2672–2680 (2014).

  9. V. Koltchinskii, Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems: Ecole d’Eté de Probabilités de Saint-Flour XXXVIII-2008, Vol. 2033 (Springer Science and Business Media, 2011).

    Book  Google Scholar 

  10. T. Liang, On how well generative adversarial networks learn densities: Nonparametric and parametric results (2018). arXiv preprint arXiv:1811.03179.

  11. R. Nickl and B. M. Pötscher, ‘‘Bracketing metric entropy rates and empirical central limit theorems for function classes of besov-and sobolev-type,’’ Journal of Theoretical Probability 20 (2), 177–199 (2007).

    Article  MathSciNet  Google Scholar 

  12. A. Rakhlin, K. Sridharan, and A. B. Tsybakov, ‘‘Empirical entropy, minimax regret and minimax risk,’’ Bernoulli 23 (2), 789–824 (2017).

    Article  MathSciNet  Google Scholar 

  13. F. Santambrogio, ‘‘Optimal transport for applied mathematicians,’’ Birkäuser, NY 55 (58–63), 94 (2015).

    MATH  Google Scholar 

  14. M. Scetbon, L. Meunier, J. Atif, and M. Cuturi, ‘‘Equitable and optimal transport with multiple agents,’’ in Proceedings of the 24th International Conference on Artificial Intelligence and Statistics, Vol. 130, Proceedings of Machine Learning Research (2021), pp. 2035–2043; arXiv: 2006.07260 (2020).

  15. A. Shiryayev, Selected Works of AN Kolmogorov, Vol. III, Information Theory and the Theory of Algorithms (Springer, 1993), Vol. 27.

  16. N. Srebro and K. Sridharan, Note on refined Dudley integral covering number bound. Unpublished results (2010). http://ttic. uchicago. edu/karthik/dudley. pdf

  17. N. Srebro, K. Sridharan, and A. Tewari, ‘‘Smoothness, low noise and fast rates,’’ in Advances in neural information processing systems, 2199–2207 (2010).

  18. B. K. Sriperumbudur, K. Fukumizu, A. Gretton, B. Schölkopf, G. R. Lanckriet, et al., ‘‘On the empirical estimation of integral probability metrics,’’ Electronic Journal of Statistics 6, 1550–1599 (2012).

    Article  MathSciNet  Google Scholar 

  19. A. B. Tsybakov, Introduction to Nonparametric Estimation (Springer Science and Business Media, 2008).

    MATH  Google Scholar 

  20. A. W. van der Vaart and J. A. Wellner, Weak Convergence and Empirical Processes (Springer Series in Statistics. Springer-Verlag, New York. With applications to statistics, 1996).

  21. R. Vershynin, High-dimensional probability: An introduction with applications in data science (Cambridge university press, 2018), vol. 47.

    Book  Google Scholar 

  22. J. Weed, F. Bach, et al. ‘‘Sharp asymptotic and finite-sample rates of convergence of empirical measures in wasserstein distance,’’ Bernoulli 25 (4A), 2620–2648 (2019).

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to N. Schreuder.

APPENDIX

PROOFS

This section contains the proofs of the main results, Theorems 3 and 4, stated in the main body of the note.

A.1. Proof of Theorem 3

The proof of Theorem 3 can be found in [16]. We add it here for completeness.

Let \(\gamma_{0}=S_{n}(\mathcal{F})=\sup_{f\in\mathcal{F}}||f||_{L_{2}(P_{n})}\). Define \(\gamma_{j}=2^{-j}\gamma_{0}\), for every integer \(j\in\mathbb{N}\), and let \(T_{j}\) be a minimal \(\gamma_{j}\)-cover of \(\mathcal{F}\) with respect to \(L_{2}(P_{n})\). For any function \(f\in\mathcal{F}\), we denote by \(\widehat{f}_{j}\) an element of \(T_{j}\) which is an \(\gamma_{j}\) approximation of \(f\). For any positive integer \(N\) we can decompose the function \(f\) as

$$f=f-\widehat{f}_{N}+\sum_{j=1}^{N}(\widehat{f}_{j}-\widehat{f}_{j-1}),$$

where \(\widehat{f}_{0}=0\in\mathcal{F}\). Hence, for any positive integer \(N\), we have

$$\widehat{R}_{n}(\mathcal{F})=\frac{1}{n}\mathbb{E}_{\sigma}\left[\sup_{f\in\mathcal{F}}\sum_{i=1}^{n}\sigma_{i}\left(f(X_{i})-\widehat{f}_{N}(X_{i})+\sum_{j=1}^{N}(\widehat{f}_{j}(X_{i})-\widehat{f}_{j-1}(X_{i}))\right)\right]$$

$${}\leqslant\frac{1}{n}\mathbb{E}_{\sigma}\left[\sup_{f\in\mathcal{F}}\sum_{i=1}^{n}\sigma_{i}(f(X_{i})-\widehat{f}_{N}(X_{i}))\right]+\sum_{j=1}^{N}\frac{1}{n}\mathbb{E}_{\sigma}\left[\sup_{f\in\mathcal{F}}\sum_{i=1}^{n}\sigma_{i}(\widehat{f}_{j}(X_{i})-\widehat{f}_{j-1}(X_{i}))\right]$$

$${}\leqslant\frac{1}{n}\sup_{f\in\mathcal{F}}\sum_{i=1}^{n}|(f(X_{i})-\widehat{f}_{N}(X_{i}))|+\sum_{j=1}^{N}\frac{1}{n}\mathbb{E}_{\sigma}\left[\sup_{f\in\mathcal{F}}\sum_{i=1}^{n}\sigma_{i}(\widehat{f}_{j}(X_{i})-\widehat{f}_{j-1}(X_{i}))\right]$$

$${}=\sup_{f\in\mathcal{F}}||f-\widehat{f}_{N}||_{L_{2}(P_{n})}+\sum_{j=1}^{N}\frac{1}{n}\mathbb{E}_{\sigma}\left[\sup_{f\in\mathcal{F}}\sum_{i=1}^{n}\sigma_{i}(\widehat{f}_{j}(X_{i})-\widehat{f}_{j-1}(X_{i}))\right]$$

$${}\leqslant\gamma_{N}+\sum_{j=1}^{N}\frac{1}{n}\mathbb{E}_{\sigma}\left[\sup_{f\in\mathcal{F}}\sum_{i=1}^{n}\sigma_{i}(\widehat{f}_{j}(X_{i})-\widehat{f}_{j-1}(X_{i}))\right].$$

For any positive integer \(j\), the triangle inequality gives

$$||\widehat{f}_{j}-\widehat{f}_{j-1}||_{L_{2}(P_{n})}\leqslant||\widehat{f}_{j}-f||_{L_{2}(P_{n})}+||f-\widehat{f}_{j-1}||_{L_{2}(P_{n})}\leqslant\gamma_{j}+\gamma_{j-1}=3\gamma_{j}.$$
(A.1)

We need the following classic lemma which controls the expectation of a Rademacher average over a finite set.Footnote

We refer the reader to https://ttic.uchicago.edu/t̃ewari/lectures/lecture10.pdf for a simple proof of this lemma.

We refer the reader to https://ttic.uchicago.edu/t̃ewari/lectures/lecture10.pdf for a simple proof of this lemma.

Lemma A.1 (Massart’s finite class lemma). Let \(\mathcal{X}\) be a finite subset of \(\mathbb{R}^{n}\) and let \(\sigma_{1},\dots,\sigma_{n}\) be independent Rademacher random variables. Denote the radius of \(\mathcal{X}\) by \(R=\sup_{x\in\mathcal{X}}||x||\). Then, we have,

$$\mathbb{E}\left[\sup_{x\in\mathcal{X}}\frac{1}{n}\sum_{i=1}^{n}\sigma_{i}x_{i}\right]\leqslant R\frac{\sqrt{2\log|\mathcal{X}|}}{n}.$$

Applying this lemma to \(\mathcal{X}_{j}=\left\{(\widehat{f}_{j}(X_{i})-\widehat{f}_{j-1}(X_{i}))_{i=1}^{n}\in\mathbb{R}^{n}:f\in\mathcal{F}\right\}\) for any \(j=1,\dots,n\) and using (3), we get

$$\sum_{j=1}^{N}\frac{1}{n}\mathbb{E}_{\sigma}\left[\sup_{f\in\mathcal{F}}\sum_{i=1}^{n}\sigma_{i}(\widehat{f}_{j}(X_{i})-\widehat{f}_{j-1}(X_{i}))\right]\leqslant\sum_{j=1}^{N}3\gamma_{j}\frac{\sqrt{2\log(|T_{j}|\cdot|T_{j-1}|)}}{n}$$

Therefore we have

$$\widehat{R}_{n}(\mathcal{F})\leqslant\gamma_{N}+\sum_{j=1}^{N}3\gamma_{j}\frac{\sqrt{2\log(|T_{j}|\cdot|T_{j-1}|)}}{n}\leqslant\gamma_{N}+\frac{6}{n}\sum_{j=1}^{N}\gamma_{j}\sqrt{\log|T_{j}|}$$

$${}=\gamma_{N}+\frac{12}{n}\sum_{j=1}^{N}(\gamma_{j}-\gamma_{j+1})\sqrt{\log|T_{j}|}=\gamma_{N}+\frac{12}{n}\sum_{j=1}^{N}(\gamma_{j}-\gamma_{j+1})\sqrt{\log\mathcal{N}(\mathcal{F},L_{2}(P_{n}),\gamma_{j})}$$

$${}\leqslant\gamma_{N}+\frac{12}{n}\int\limits_{\gamma_{N+1}}^{\gamma_{0}}\sqrt{\log\mathcal{N}(\mathcal{F},L_{2}(P_{n}),\varepsilon)}d\varepsilon.$$

For any \(\tau>0\), pick \(N=\sup\{j:\gamma_{j}>2\tau\}\). Then \(\gamma_{N}=2\gamma_{N+1}\leqslant 4\tau\) and \(\gamma_{N+1}=\gamma_{N}/2\geqslant\tau\). Hence, we conclude that

$$\widehat{R}_{n}(\mathcal{F})\leqslant 4\tau+\frac{12}{\sqrt{n}}\int\limits_{\tau}^{\gamma_{0}}\sqrt{\log\mathcal{N}(\mathcal{F},L_{2}(P_{n}),\varepsilon)}d\varepsilon.$$

Since \(\tau\) can take any positive value we can take the infimum over all positive \(\tau\) and this concludes the proof.

A.2. Proof of Theorem 4

Without loss of generality, we prove the theorem in the case \(L=1\). The general case will follow by homogeneity. For simplicity we write \(\mathcal{H}^{\alpha}=\mathcal{H}^{\alpha}(1)\), \(Ph=\int_{\mathcal{X}}hdP\) and \(P_{n}h=\int_{\mathcal{X}}hdP_{n}\). A symmetrization argument (Lemma 1) gives

$$\mathbb{E}\bigg{[}\sup_{h\in\mathcal{H}^{\alpha}}|Ph-P_{n}h|\bigg{]}\leqslant 2\mathbb{E}\big{[}\widehat{R}_{n}(\mathcal{H}^{\alpha})\big{]},$$

where the empirical Rademacher process \(\widehat{R}_{n}(\mathcal{H}^{\alpha})\) is given by

$$\widehat{R}_{n}(\mathcal{H}^{\alpha})=\frac{1}{n}\mathbb{E}\left[\sup_{h\in\mathcal{H}^{\alpha}}\sum_{i=1}^{n}\sigma_{i}h(X_{i})\bigg{|}X_{1},\ldots,X_{n}\right].$$

Noting that, for any \(h\in\mathcal{H}^{\alpha}\),

$$P_{n}h^{2}:=\frac{1}{n}\sum_{i=1}^{n}h^{2}(X_{i})\leqslant||h^{2}||_{\infty}\leqslant 1,$$

the improved Dudley bound (Theorem 3) coupled with Lemma 2 yields,

$$\mathbb{E}\bigg{[}\sup_{h\in\mathcal{H}^{\alpha}}|P_{n}h-Ph|\bigg{]}\leqslant\inf_{\tau>0}\left(4\tau+\frac{12}{\sqrt{n}}\int\limits_{\tau}^{1}\sqrt{\log\mathcal{N}(\mathcal{H}^{\alpha},||\cdot||_{\infty},\varepsilon)}d\varepsilon\right)$$

$${}\leqslant\inf_{\tau>0}\left(4\tau+\frac{12\sqrt{K\lambda_{d}(\mathcal{X}^{1})}}{\sqrt{n}}\int\limits_{\tau}^{1}\varepsilon^{-d/2\alpha}d\varepsilon\right).$$

Applying Lemma A.2 with \(\beta=\frac{d}{2\alpha}\) and \(a=3\sqrt{\frac{K\lambda}{n}}\) where \(K=K_{\alpha,d}\) is the constant depending only on \(\alpha\) and \(d\) borrowed from Theorem 1 and \(\lambda:=\lambda_{d}(\mathcal{X}^{1})\), we get

$$\mathbb{E}\bigg{[}\sup_{h\in\mathcal{H}^{\alpha}}|P_{n}h-Ph|\bigg{]}\leqslant 12\begin{cases}\left(\frac{K\lambda}{n}\right)^{{\alpha}/{d}}\left[\frac{d}{d-2\alpha}\wedge(1+0.5\log(\frac{n}{9K\lambda}))\right]\quad\text{if $\alpha<{d}/{2}$}\\ \left(\frac{K\lambda}{n}\right)^{{1}/{2}}\left[\frac{2\alpha}{2\alpha-d}\wedge(1+\frac{\alpha}{d}\log(\frac{n}{9K\lambda}))\right]\quad\text{if $\alpha\geqslant{d}/{2}$}.\end{cases}$$
(A.2)

The proof is finished since the upper bound stated in Theorem 4 is a direct consequence of (A.2)

A.3. Additional Lemma

The following lemma enables to obtain an upper bound on Dudley’s refined bound (Theorem 3) for any bounded class whose entropy grows polynomially in \(1/\varepsilon\).

Lemma A.2. For any real positive numbers \(a\) and \(\beta\), it holds

$$\min_{0\leqslant\tau\leqslant 1}\left(\tau+a\int\limits_{\tau}^{1}\varepsilon^{-\beta}d\varepsilon\right)\leqslant(a^{{1}/{\beta}}\vee a)\left[\left(\frac{\beta\vee 1}{|\beta-1|}\right))\wedge\left(1+\frac{\log({1}/{a})}{\beta\vee 1}\right)\right].$$

Proof. Let \(a\) and \(\beta\) be real positive numbers. Define the function

$$f\colon[0,1]\to\mathbb{R},$$

$$\tau\mapsto\tau+a\int\limits_{\tau}^{1}\varepsilon^{-\beta}d\varepsilon.$$

One can easily check that

$$f^{*}:=\min_{0\leqslant\tau\leqslant 1}f(\tau)=\begin{cases}1\quad\text{if $a>1$}\\ a^{{1}/{\beta}}+\frac{a}{1-\beta}(1-a^{{1}/{\beta}-1})\quad\text{if $a<1$ }.\end{cases}$$
(A.3)

In the case \(a<1\), using the fact that \(1-x^{\alpha}\leqslant\log(x^{-\alpha})\) for any \(\alpha>0\) and \(x\in(0,1]\), we have

$$f^{*}\leqslant(a^{{1}/{\beta}}\vee a)\left[\left(\frac{\beta\vee 1}{|\beta-1|}\right))\wedge\left(1+\frac{\log({1}/{a})}{\beta\vee 1}\right)\right].$$
(3)

Finally, since the RHS of (A.3) is greater than \(1\) for any \(a>1\), (A.3) holds for any positive real \(a\) and this concludes the proof. \(\Box\)

ACKNOWLEDGMENTS

The author thanks Arnak Dalalyan for his diligent proofreading of this note, Yannick Guyonvarch for interesting references and Alexander Tsybakov for suggesting to present an extension of the main result.

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Schreuder, N. Bounding the Expectation of the Supremum of Empirical Processes Indexed by Hölder Classes. Math. Meth. Stat. 29, 76–86 (2020). https://doi.org/10.3103/S1066530720010056

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.3103/S1066530720010056

Keywords:

Navigation