Abstract
In this note, we provide upper bounds on the expectation of the supremum of empirical processes indexed by Hölder classes of any smoothness and for any distribution supported on a bounded set in \(\mathbb{R}^{d}\). These results can alternatively be seen as non-asymptotic risk bounds, when the unknown distribution is estimated by its empirical counterpart, based on \(n\) independent observations, and the error of estimation is quantified by integral probability metrics (IPM). In particular, IPM indexed by Hölder classes are considered and the corresponding rates are derived. These results interpolate between two well-known extreme cases: the rate \(n^{-1/d}\) corresponding to the Wassertein-1 distance (the least smooth case) and the fast rate \(n^{-1/2}\) corresponding to very smooth functions (for instance, functions from a RKHS defined by a bounded kernel).
Similar content being viewed by others
Notes
See [21, Section 2.5] for the link between definitions of sub-Gaussian random variables (bound on moment-generating function, tail inequalities, …) and the Orlicz norm \(\psi_{2}\).
We refer the reader to https://ttic.uchicago.edu/t̃ewari/lectures/lecture10.pdf for a simple proof of this lemma.
REFERENCES
M. Arjovsky, S. Chintala, and L. Bottou, Wasserstein Generative Adversarial Networks, Ed. by D. Precup and Y. W. Teh, Proceedings of the 34th International Conference on Machine Learning, Vol. 70 of Proceedings of Machine Learning Research, International Convention Centre, Sydney, Australia, PMLR (2017), pp. 214–223.
F. Bassetti, A. Bodini, and E. Regazzini, ‘‘On minimum Kantorovich distance estimators,’’ Statistics and probability letters 76 (12), 1298–1302 (2006).
F.-X. Briol, A. Barp, A. B. Duncan, and M. Girolami, Statistical inference for generative models with maximum mean discrepancy (2019). arXiv preprint arXiv:1906.05944.
M. Chen, W. Liao, H. Zha, and T. Zhao, Statistical guarantees of generative adversarial networks for distribution estimation (2020). arXiv preprint arXiv:2002.03938.
E. del Barrio, P. Deheuvels, and S. van de Geer, Lectures on empirical processes. EMS Series of Lectures in Mathematics. European Mathematical Society (EMS), Zürich. Theory and statistical applications, With a preface by Juan A. Cuesta Albertos and Carlos Matrán (2007).
R. M. Dudley, ‘‘The speed of mean Glivenko-Cantelli convergence,’’ Ann. Math. Statist. 40, 40–50 (1968).
E. Giné and R. Nickl, Mathematical Foundations of Infinite-Dimensional Statistical Models, Vol. 40 (Cambridge University Press, 2016).
I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, ‘‘Generative adversarial nets,’’ In Advances in Neural Information Processing Systems, 2672–2680 (2014).
V. Koltchinskii, Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems: Ecole d’Eté de Probabilités de Saint-Flour XXXVIII-2008, Vol. 2033 (Springer Science and Business Media, 2011).
T. Liang, On how well generative adversarial networks learn densities: Nonparametric and parametric results (2018). arXiv preprint arXiv:1811.03179.
R. Nickl and B. M. Pötscher, ‘‘Bracketing metric entropy rates and empirical central limit theorems for function classes of besov-and sobolev-type,’’ Journal of Theoretical Probability 20 (2), 177–199 (2007).
A. Rakhlin, K. Sridharan, and A. B. Tsybakov, ‘‘Empirical entropy, minimax regret and minimax risk,’’ Bernoulli 23 (2), 789–824 (2017).
F. Santambrogio, ‘‘Optimal transport for applied mathematicians,’’ Birkäuser, NY 55 (58–63), 94 (2015).
M. Scetbon, L. Meunier, J. Atif, and M. Cuturi, ‘‘Equitable and optimal transport with multiple agents,’’ in Proceedings of the 24th International Conference on Artificial Intelligence and Statistics, Vol. 130, Proceedings of Machine Learning Research (2021), pp. 2035–2043; arXiv: 2006.07260 (2020).
A. Shiryayev, Selected Works of AN Kolmogorov, Vol. III, Information Theory and the Theory of Algorithms (Springer, 1993), Vol. 27.
N. Srebro and K. Sridharan, Note on refined Dudley integral covering number bound. Unpublished results (2010). http://ttic. uchicago. edu/karthik/dudley. pdf
N. Srebro, K. Sridharan, and A. Tewari, ‘‘Smoothness, low noise and fast rates,’’ in Advances in neural information processing systems, 2199–2207 (2010).
B. K. Sriperumbudur, K. Fukumizu, A. Gretton, B. Schölkopf, G. R. Lanckriet, et al., ‘‘On the empirical estimation of integral probability metrics,’’ Electronic Journal of Statistics 6, 1550–1599 (2012).
A. B. Tsybakov, Introduction to Nonparametric Estimation (Springer Science and Business Media, 2008).
A. W. van der Vaart and J. A. Wellner, Weak Convergence and Empirical Processes (Springer Series in Statistics. Springer-Verlag, New York. With applications to statistics, 1996).
R. Vershynin, High-dimensional probability: An introduction with applications in data science (Cambridge university press, 2018), vol. 47.
J. Weed, F. Bach, et al. ‘‘Sharp asymptotic and finite-sample rates of convergence of empirical measures in wasserstein distance,’’ Bernoulli 25 (4A), 2620–2648 (2019).
Author information
Authors and Affiliations
Corresponding author
APPENDIX
PROOFS
This section contains the proofs of the main results, Theorems 3 and 4, stated in the main body of the note.
A.1. Proof of Theorem 3
The proof of Theorem 3 can be found in [16]. We add it here for completeness.
Let \(\gamma_{0}=S_{n}(\mathcal{F})=\sup_{f\in\mathcal{F}}||f||_{L_{2}(P_{n})}\). Define \(\gamma_{j}=2^{-j}\gamma_{0}\), for every integer \(j\in\mathbb{N}\), and let \(T_{j}\) be a minimal \(\gamma_{j}\)-cover of \(\mathcal{F}\) with respect to \(L_{2}(P_{n})\). For any function \(f\in\mathcal{F}\), we denote by \(\widehat{f}_{j}\) an element of \(T_{j}\) which is an \(\gamma_{j}\) approximation of \(f\). For any positive integer \(N\) we can decompose the function \(f\) as
where \(\widehat{f}_{0}=0\in\mathcal{F}\). Hence, for any positive integer \(N\), we have
For any positive integer \(j\), the triangle inequality gives
We need the following classic lemma which controls the expectation of a Rademacher average over a finite set.Footnote
We refer the reader to https://ttic.uchicago.edu/t̃ewari/lectures/lecture10.pdf for a simple proof of this lemma.
We refer the reader to https://ttic.uchicago.edu/t̃ewari/lectures/lecture10.pdf for a simple proof of this lemma.
Lemma A.1 (Massart’s finite class lemma). Let \(\mathcal{X}\) be a finite subset of \(\mathbb{R}^{n}\) and let \(\sigma_{1},\dots,\sigma_{n}\) be independent Rademacher random variables. Denote the radius of \(\mathcal{X}\) by \(R=\sup_{x\in\mathcal{X}}||x||\). Then, we have,
Applying this lemma to \(\mathcal{X}_{j}=\left\{(\widehat{f}_{j}(X_{i})-\widehat{f}_{j-1}(X_{i}))_{i=1}^{n}\in\mathbb{R}^{n}:f\in\mathcal{F}\right\}\) for any \(j=1,\dots,n\) and using (3), we get
Therefore we have
For any \(\tau>0\), pick \(N=\sup\{j:\gamma_{j}>2\tau\}\). Then \(\gamma_{N}=2\gamma_{N+1}\leqslant 4\tau\) and \(\gamma_{N+1}=\gamma_{N}/2\geqslant\tau\). Hence, we conclude that
Since \(\tau\) can take any positive value we can take the infimum over all positive \(\tau\) and this concludes the proof.
A.2. Proof of Theorem 4
Without loss of generality, we prove the theorem in the case \(L=1\). The general case will follow by homogeneity. For simplicity we write \(\mathcal{H}^{\alpha}=\mathcal{H}^{\alpha}(1)\), \(Ph=\int_{\mathcal{X}}hdP\) and \(P_{n}h=\int_{\mathcal{X}}hdP_{n}\). A symmetrization argument (Lemma 1) gives
where the empirical Rademacher process \(\widehat{R}_{n}(\mathcal{H}^{\alpha})\) is given by
Noting that, for any \(h\in\mathcal{H}^{\alpha}\),
the improved Dudley bound (Theorem 3) coupled with Lemma 2 yields,
Applying Lemma A.2 with \(\beta=\frac{d}{2\alpha}\) and \(a=3\sqrt{\frac{K\lambda}{n}}\) where \(K=K_{\alpha,d}\) is the constant depending only on \(\alpha\) and \(d\) borrowed from Theorem 1 and \(\lambda:=\lambda_{d}(\mathcal{X}^{1})\), we get
The proof is finished since the upper bound stated in Theorem 4 is a direct consequence of (A.2)
A.3. Additional Lemma
The following lemma enables to obtain an upper bound on Dudley’s refined bound (Theorem 3) for any bounded class whose entropy grows polynomially in \(1/\varepsilon\).
Lemma A.2. For any real positive numbers \(a\) and \(\beta\), it holds
Proof. Let \(a\) and \(\beta\) be real positive numbers. Define the function
One can easily check that
In the case \(a<1\), using the fact that \(1-x^{\alpha}\leqslant\log(x^{-\alpha})\) for any \(\alpha>0\) and \(x\in(0,1]\), we have
Finally, since the RHS of (A.3) is greater than \(1\) for any \(a>1\), (A.3) holds for any positive real \(a\) and this concludes the proof. \(\Box\)
ACKNOWLEDGMENTS
The author thanks Arnak Dalalyan for his diligent proofreading of this note, Yannick Guyonvarch for interesting references and Alexander Tsybakov for suggesting to present an extension of the main result.
About this article
Cite this article
Schreuder, N. Bounding the Expectation of the Supremum of Empirical Processes Indexed by Hölder Classes. Math. Meth. Stat. 29, 76–86 (2020). https://doi.org/10.3103/S1066530720010056
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.3103/S1066530720010056