Optimal convergence rate of the universal estimation error

Weinan, E.; Wang, Yao

doi:10.1186/s40687-016-0093-6

Optimal convergence rate of the universal estimation error

Research
Open access
Published: 10 February 2017

Volume 4, article number 2, (2017)
Cite this article

Download PDF

You have full access to this open access article

Research in the Mathematical Sciences Aims and scope Submit manuscript

Optimal convergence rate of the universal estimation error

Download PDF

E. Weinan^1,2,3 &
Yao Wang²

1792 Accesses
10 Citations
Explore all metrics

Abstract

We study the optimal convergence rate for the universal estimation error. Let $\mathcal {F}$ be the excess loss class associated with the hypothesis space and n be the size of the data set, we prove that if the Fat-shattering dimension satisfies $\text {fat}_{\epsilon } (\mathcal {F})= O(\epsilon ^{-p})$, then the universal estimation error is of $O(n^{-1/2})$ for $p<2$ and $O(n^{-1/p})$ for $p>2$. Among other things, this result gives a criterion for a hypothesis class to achieve the minimax optimal rate of $O(n^{-1/2})$. We also show that if the hypothesis space is the compact supported convex Lipschitz continuous functions in $\mathbb {R}^d$ with $d>4$, then the rate is approximately $O(n^{-2/d})$.

Convergence rate in precise asymptotics for the law of the iterated logarithm

Article 25 July 2016

The Turán-type inequality in the space L0 on the unit interval

Article 23 July 2021

Functions with Ultradifferentiable Powers

Article 19 May 2020

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Background

Given some data independently generated by the same underlying distribution and some model class, we are interested in how close the model trained with the data is to the best possible model for the underlying distribution. The gap is known as the generalization error in the context of supervised learning. The model class is called hypothesis space. We can decompose the generalization error into two parts. One is the difference between the best possible model and the best model in the hypothesis space. This is known as the approximation error. The second part is called the estimation error, which is the difference between the best model from the hypothesis space and the model trained with the data. In this paper, we will focus on the estimation error.

To begin with, we will use the following notations: We denote the data set by $\{Z_i=(X_i,Y_i)\}_1^n$, which is generated independently from the same underlying distribution $\mu $, here $X_i$ is the i-th input and $Y_i$ is the corresponding output. L is the loss function and $\mathcal {H}$ is the hypothesis space which contains functions from X to Y. Let $h^*$ be the minimizer of the risk associated with $\mathcal {H}$ and $\hat{h}$ be the minimizer of the empirical risk:

$$\begin{aligned}&h^*\,:=\,\underset{{h \in H} }{\text {argmin}} \mathbb {E}_{\mu }[L(h)],\\&\hat{h}\,:=\,\underset{{h \in H} }{\text {argmin}}\mathbb {E}_{\mu _n}[ L(h)]. \end{aligned}$$

Here for simplification, we use L(h) in place of L(h(X), Y), $\mu _n= \frac{1}{n}\sum _{i=1}^n\delta _{Z_i} $ for the empirical measure and $\mathbb {E}_{\mu _n}[L(h)]=\frac{1}{n}\sum _{i=1}^n(L(h(X_i),Y_i))$ to denote the empirical risk. The estimation error is defined to be $\mathbb {E}_{\mu }[L(\hat{h})-L(h^*)]$.

To estimate the estimation error, instead of looking at the space $\mathcal {H}$, we will look at the excess loss class associated with $\mathcal {H}$, denoted as $\mathcal {F}$, see [4]

$$\begin{aligned} \mathcal {F}:=\{Z=(X,Y)\rightarrow L(h)-L(h^*):h\in \mathcal {H}\}. \end{aligned}$$

Every function $h \in \mathcal {H}$ corresponds to an element in $\mathcal {F}$. Let $\hat{f}$ and $f^*$ in $\mathcal {F}$ be the corresponding elements of $\hat{h}$ and $h^*$ in $\mathcal {H}$, respectively. Obviously $f^*\equiv 0$. Now the estimation error can be written as $\mathbb {E}_{\mu }[\hat{f}]$. Since $f^*$ is the minimizer of $\mathbb {E}_\mu [f]$ and $\mathbb {E}_\mu [f^*]=0$, we know that $\mathbb {E}_\mu [\hat{f}]\ge 0$. Similarly, we know that $\hat{f}$ is the minimizer of $\mathbb {E}_{\mu _n}[f]$ and because $\mathbb {E}_{\mu _n}[f^*]=0$, we have $\mathbb {E}_{\mu _n}[\hat{f}]\le 0$. Therefore, we have

$$\begin{aligned} 0\le \mathbb {E}_{\mu }[\hat{f}]\le \mathbb {E}_{\mu }[\hat{f}]-\mathbb {E}_{\mu _n}[\hat{f}]. \end{aligned}$$

(1.1)

To bound the $\mathbb {E}_{\mu }[\hat{f}]$, it is enough to bound $\mathbb {E}_{\mu }(\hat{f})-\mathbb {E}_{\mu _n}[\hat{f}]$. Intuitively, for any fixed function f, if we blindly apply the Law of Large Number and the Central Limit Theorem, we get

$$\begin{aligned}&\mathbb {E}_{\mu }[f]-\mathbb {E}_{\mu _n}[f]\rightarrow 0\ \text {almost surely},\\&\quad \sqrt{n}(\mathbb {E}_{\mu }[f] -\mathbb {E}_{\mu _n}[f]) \rightarrow N(0,\mathbb {E}_\mu (f-\mathbb {E}_\mu [f])^2)\ \text {in distribution}. \end{aligned}$$

However, we cannot use the Law of Large Number or the Central Limit Theorem for $\hat{f}$ since $\hat{f}$ is the empirical minimizer, the iid assumption does not hold.

The following example is informative. Suppose $\mathcal {F}$ contains all continuous functions with range bounded below by 0. Then the the empirical minimizer $\hat{f}$ can be any function interpolating the data set with value 0. This implies that $\mathbb {E}_{\mu _n}\hat{f}=0$. But there is no guarantee that $\mathbb {E}_{\mu } \hat{f}=0$ and hence no guarantee that $\mathbb {E}_{\mu }[\hat{f}]-\mathbb {E}_{\mu _n}[\hat{f}]$ converges as n goes to infinity.

The solution to this dilemma is to study the differences between the true and empirical expectation of all functions in the whole excess loss class rather than focusing only on $\hat{f}$. Thus we define the empirical process $\{(\mathbb {E}_{\mu _n}-\mathbb {E}_\mu )(f): f\in \mathcal {F}\}$ as the family of the random variables indexed by $f\in \mathcal {F}$. Instead of bounding $\mathbb {E}_\mu [\hat{f}]-\mathbb {E}_{\mu _n}[\hat{f}]$, it is better to bound the supremum of the empirical process. Define $||Q||_\mathcal {F}=\text {sup}\{|Qf|:f\in \mathcal {F}\}$. The quantity $||\mathbb {E}_{\mu _n}-\mathbb {E}_\mu ||_\mathcal {F}$ will be called the empirical process supremum and its expectation $\mathbb {E}_{\mu }||\mathbb {E}_{\mu _n}-\mathbb {E}_\mu ||_\mathcal {F}$ will be called the $\mu $-estimation error, and it naturally provides a good bound for the estimation error.

Next we define a $\mathcal {F}$-indexed empirical process $G_n$ by

$$\begin{aligned} f\mapsto G_n f=\sqrt{n}(\mathbb {E}_{\mu _n}-\mathbb {E}_\mu )[f]=\frac{1}{\sqrt{n}}\sum _{i=1}^n(f(Z_i)-\mathbb {E}_\mu (f)). \end{aligned}$$

(1.2)

We now make the assumption that

$$\begin{aligned} \sup \limits _{f\in \mathcal {F}}|f(Z)-\mathbb {E}_{\mu }(f)|<\infty \end{aligned}$$

(1.3)

for all Z. Under this condition, the empirical process $\{G_n:f\in \mathcal {F}\}$ can be viewed as a map in $l^{\infty } (\mathcal {F})$. Consequently, it makes sense to investigate conditions under which

$$\begin{aligned} G_n=\sqrt{n}(\mathbb {E}_{\mu _n}-\mathbb {E}_\mu )\rightarrow G \text { in distribution}, \end{aligned}$$

(1.4)

where G is a tight process in $l^{\infty } (\mathcal {F})$. This is actually the $\mathcal {F}$-version Central Limit Theorem. Function spaces that satisfy this property are called Donsker class [10]. Moreover, a class $\mathcal {F}$ is called a Glivenko–Cantelli class (GC) [10] if the $\mathcal {F}$-version Law of Large Numbers

$$\begin{aligned} ||\mathbb {E}_{\mu _n}-\mathbb {E}_\mu ||_\mathcal {F} \rightarrow 0\text { almost surely} \end{aligned}$$

holds.

We now simplify the assumption (1.3). If we let $g=f-E_{\mu }[f]$, we have

$$\begin{aligned} \mathbb {E}_{\mu }[g] -\mathbb {E}_{\mu _n}[g]= \mathbb {E}_{\mu }[f] -\mathbb {E}_{\mu _n}[f]. \end{aligned}$$

(1.5)

Thus we can assume $\mathbb {E}_{\mu } f=0$ for any $f \in \mathcal {F}$. Then (1.3) can be simplified to be

$$\begin{aligned} \sup _{f\in \mathcal {F}}|f(Z))|<\infty . \end{aligned}$$

(1.6)

Without loss of generality, we further assume that

$$\begin{aligned} \sup _{f\in \mathcal {F}}|f(Z))|\le 1. \end{aligned}$$

Equivalently, we are interested in the following class of distributions

$$\begin{aligned} \mathcal {P}=\left\{ \mu : |f(Z)|\le 1 \text { for any } f\in \mathcal {F} \text { and any } Z \text { generated from } \mu \right\} . \end{aligned}$$

Since $\mu $ is actually unknown (otherwise we have achieved our goal for learning), we study the worst case of $\mu $-estimation error, so we define the

$$\begin{aligned} \sup _{\mu \in \mathcal {P}} \mathbb {E}_{\mu } ||\mathbb {E}_{\mu _n}-\mathbb {E}_\mu ||_\mathcal {F} \end{aligned}$$

to be the universal estimation error.

2 Preliminaries

There are many classical approaches to describe the complexity of a class of functions. For instance, growing number and VC dimension can be used to describe the binary classification hypothesis space. In more general settings, one can also use the Rademacher complexity. However, it seems that this quantity is not very intuitive. When using these terms, one cannot tell how fast the empirical loss minimizer comes close to the loss minimizer as the data size increases. In this paper, we will use the entropy and the Fat-shattering dimension to describe the complexity.

2.1 Rademacher average

The first step to study the $\mu $-estimation error is to study the Rademacher average: for fixed empirical measure $\mu _n$, we define the Rademacher average [3, 14] by

$$\begin{aligned} R(\mathcal {F}/\mu _n)= \mathbb {E}_r \sup _{f\in \mathcal {F}} \left| \frac{1}{n}\sum _{i=1}^n r_i f(Z_i)\right| \end{aligned}$$

(2.1)

where $r_i, \ldots , r_n$ are iid Rademacher random variables satisfying $P(r=-1)=P(r=1)=1/2$ and $\mathbb {E}_r$ is the expectation with respect to the Rademacher variables. Also, we define the Rademacher process associated with the empirical measure $\mu _n$ as

$$\begin{aligned} X_{rad}(f)=\frac{1}{n} \sum _{i=1}^n r_i f(Z_i). \end{aligned}$$

(2.2)

It is known that the Rademacher averages control the $\mu $ estimation error:

Theorem 2.1

[10] If $\mathcal {F}$ is a class of functions map into $[-M,M]$, then for every integer n, we have

$$\begin{aligned} \mathbb {E}_\mu \sup _{f\in \mathcal {F}} \left| \frac{1}{n}\sum _{i=1}^n (f(Z_i)-\mathbb {E}_{\mu }f )\right| \le 2 \mathbb {E}_{\mu \times r} \sup _{f\in \mathcal {F}} \left| \frac{1}{n} \sum _{i=1}^n r_i f(Z_i)\right| \nonumber \\ \le M\mathbb {E}_\mu \sup _{f\in \mathcal {F}} \left| \frac{1}{n}\sum _{i=1}^n (f(Z_i)-\mathbb {E}_{\mu }f )\right| +\mathbb {E}_{r}\left| \frac{1}{n} \sum _{i=1}^n r_i\right| . \end{aligned}$$

(2.3)

From this, we see that the term $\mathbb {E}_{\mu } ||\mathbb {E}_{\mu _n}-\mathbb {E}_\mu ||_\mathcal {F}$ is comparable to the expectation of the Rademacher average up to a term of $O(n^{-1/2})$.

2.2 Covering number and fat-shatter dimension

To get more explicit bounds, we need two more concepts. In what follows, the logarithm always takes 2 as base and $L_p(\mu _n)$ norm of f is defined as $\big ( 1/n\sum _{i=1}^n |f(Z_i)|^p \big )^{1/p}$.

Definition 2.2

For an arbitrary semi-metric space (T, d), the covering number $\mathbb {N}(\epsilon , T,d)$ is the minimal number of the closed d-balls of radius $\epsilon $ required to cover T. See [8, 10]. The associated entropy $\text {log}\mathbb {N}(\epsilon , T,d)$ is the logarithm of the covering number.

We also define another concept which is always easy to calculate: the Fat-shattering dimension.

Definition 2.3

For every $\epsilon >0$, a set $A=\{Z_1,\ldots ,Z_n\}$ is said to be $\epsilon $- shattered by $\mathcal {F}$ if there exists some real function $s:A\rightarrow \mathbb {R}$ such that for every $I\in \{1,\ldots ,n\}$ there exists some $f_I\in F$ such that $f_I(Z_i)\ge s(Z_i)+\epsilon $ if $i\in I$, and $f_I(Z_i)\le s(Z_i)-\epsilon $ if $i\notin I$.

$$\begin{aligned} \text {fat}_{\epsilon }(\mathcal {F}):=\text {sup}\big \{|A| \big | A\in \Omega , A\ \text {is}\ \epsilon \text {-shattered by} \ \mathcal {F}\big \} \end{aligned}$$

is called the Fat-shattering dimension, $f_I$ is called the shattering function of the set I, and the set $\{s(Z_i)| Z_i\in A\}$ is called a witness to the $\epsilon $-shatter.

Note that both the Fat-shattering dimension and the covering number are non-decreasing as $\epsilon $ decreases and since $||f||_{L_{1}(\mu _n)}\le ||f||_{L_{2}(\mu _n)} \le ||f||_{L_{\infty }(\mu _n)}$, we know that

$$\begin{aligned} \mathbb {N}(\epsilon ,\mathcal {F}, L_1(\mu _n))\le \mathbb {N}(\epsilon ,\mathcal {F}, L_2(\mu _n))\le \mathbb {N}(\epsilon ,\mathcal {F}, L_{\infty }(\mu _n)). \end{aligned}$$

(2.4)

The Fat-shattering dimension is actually linear with respect to the entropy up to a logarithm factor of the Fat-shattering dimension[14], on page 253 and page 252:

Lemma 2.4

If $|f|\le 1$ for any $f\in \mathcal {F}$, then

$$\begin{aligned} \sup _{\mu _n}\log \mathbb {N}(\epsilon , \mathcal {F},L_1(\mu _n)) \ge fat_{16\epsilon } (\mathcal {F})/8. \end{aligned}$$

(2.5)

Lemma 2.5

for every empirical measure $\mu _n$ and $p\ge 1$, there is some constant $c_p$ such that

$$\begin{aligned} \log \mathbb {N}(\epsilon ,\mathcal {F}, L_p(\mu _n))\le c_p \text {fat}_{\frac{\epsilon }{8}}(\mathcal {F})\log ^2\left( \frac{2\text {fat}_{\frac{\epsilon }{8}}(\mathcal {F})}{\epsilon }\right) . \end{aligned}$$

(2.6)

2.3 Maximal inequality

In order to study the maxima of a class of random variables, we begin with the simple case when the class is finite. In this case, we have

$$\begin{aligned} ||\text {max}_{1\le i\le m} X_i||_p\le (\mathbb {E} \text {max}_{1\le i\le m} |X_i|^p)^{1/p}\le m^{1/p} \text {max}_{1\le i\le m}||X_i||_p. \end{aligned}$$

(2.7)

As m increase, this type bound increases very fast, so we cannot get satisfied result. To overcome this, we introduce the following Orlicz 2-norm and the corresponding maximal inequality:

Definition 2.6

Let $\psi _2(x)=e^{x^2}-1$, and Orlicz norm for random variables $||\cdot ||_{\psi ^2}$ is defined by (see [10] for more details)

$$\begin{aligned} ||X||_{\psi _2}:=\text {inf}\left\{ c>0: \mathbb {E}\psi _2\left( \frac{|X|}{c}\right) \le 1\right\} . \end{aligned}$$

(2.8)

Note that $||X||_{\psi _2}\ge ||X||_{L_1}$ since $\psi _2(x) \ge x$. The Orlicz norm is more sensitive to the behavior of in the tail of X, which makes it possible to have a better bound if we bound the maxima of many variables with a light tails. The following lemma gives a better bound [11], in chapter 8:

Lemma 2.7

Let $X_1,X_2,\ldots ,X_m$ be random variables,

$$\begin{aligned} || \sup _{1\le i\le m} X_i||_{\psi _2}\le 4 \sqrt{ \log (m+1)} \sup _{1\le i\le m} ||X_i||_{\psi _2}. \end{aligned}$$

(2.9)

Random variables from Rademacher process actually have a nice property that their tails decrease very fast. The following result was proved by Kosorok in [11], in chapter 8:

Lemma 2.8

Define

$$\begin{aligned} X(a)=\sum _{i=1}^n r_i a_i, \ a\in \mathbb {R}^n, \end{aligned}$$

where $r_i, \ldots , r_n$ are i.i.d. Rademacher random variables satisfying $P(r=-1)=P(r=1)=1/2$. Let $a=(a_1,\ldots ,a_n)\in \mathbb {R}^n$, Then we have

$$\begin{aligned} P\left( \left| \sum _{i=1}^n r_i a_i\right| >x\right) \le 2 e^{-\frac{1}{2}x^2/||a||^2} \end{aligned}$$

(2.10)

for the Euclidean norm $||\cdot ||$. Hence $||\sum r a||_{\psi _2}\le \sqrt{6}||a||.$

Our main technique comes from Mendelson [14], who studied the Gaussian average rather than Rademacher average, which is defined by

$$\begin{aligned} l(\mathcal {F}/\mu _n)=\frac{1}{\sqrt{n}} \mathbb {E}_g \underset{f\in \mathcal {F}}{\text {sup}} \big |\sum _{i=1}^n g_if(Z_i)\big |, \end{aligned}$$

where $g_i$ are independent standard Gaussian random variables and $\mathbb {E}_g$ means taking expectation of these Gaussian random variables. Note that the factor $1/\sqrt{n}$ was used in his result instead of 1 / n. Mendelson proved that if $p<2$, the Gaussian averages are uniformly bounded; if $p>2$, they may grow at the rate of $n^{\frac{1}{2}-\frac{1}{p}}$, and this bound is tight for Gaussian averages. In [13, 17], it was that the Gaussian and the Rademacher averages are closely related and have the following connection:

Theorem 2.9

There are absolute constants c and C such that for every n and $\mathcal {F}$

$$\begin{aligned} c(1+\log n)^{\frac{1}{2}}\mathbb {E}_g \underset{f\in \mathcal {F}}{\sup } \big |\sum _{i=1}^n g_if(Z_i)\big |\le \mathbb {E}_r \underset{f\in \mathcal {F}}{\sup } \big |\sum _{i=1}^n r_if(Z_i)\big |\le C\mathbb {E}_g \underset{f\in \mathcal {F}}{\sup } \big |\sum _{i=1}^n g_if(Z_i)\big |.\qquad \quad \end{aligned}$$

(2.11)

Using the above theorem and the result in [14], the upper bound was given for expectation of the Rademacher average. But we cannot say whether the bound is tight. In the following section, We will give a direct proof of the upper bound for the expectation of the Rademacher average and we will make the argument that the bound is tight in section 4.

3 Upper bound

To bound the empirical Rademacher average, we use the following theorem, this follows from the standard “chaining” method, see [11], chapter 8.

Theorem 3.1

Let $\mu _n$ be the empirical measure and $|f|\le 1$ for all $f\in \mathcal {F}$ and $f_0\equiv 0\in \mathcal {F}$, let $(\epsilon _k)_{k=0}^{\infty }$ be a decreasing monotone sequence to 0 with $\epsilon _0=1$. Then, there exists an absolute constant C such that for any integer N,

$$\begin{aligned} R(\mathcal {F}/\mu _n)\le Cn^{-\frac{1}{2}}\sum _{k=1}^N \epsilon _{k-1} \log ^{\frac{1}{2}} \mathbb {N}(\epsilon _k, \mathcal {F}, L_2(\mu _n)) +\epsilon _N. \end{aligned}$$

(3.1)

Proof

Note that if for any $\epsilon _i$, $\mathbb {N}(\epsilon _i, \mathcal {F},L_2(\mu _n))$ is infinity, the inequality trivially holds . Hence we can, without loss of generality, assume the covering numbers appear in the inequality are all finite.

Construct a sequence of finite covering sets $\mathcal {F}_0,\mathcal {F}_1,\ldots , \mathcal {F}_N$ such that $\mathcal {F}_i \subset \mathcal {F}$ and $\mathcal {F}_i$ is minimal $\epsilon _i$-cover for the semi-metric space $(\mathcal {F}, L_2(\mu _n))$. For each $f\in \mathcal {F}$ we could find $f_N\in \mathcal {F}_N$, such that $||f-f_N||_{L_2(\mu _n)}\le \epsilon _N$. Now we fix the empirical measure $\mu _n$ and study the associated Rademacher process $X_{rad}(f)=\frac{1}{n}\sum _{i=1}^nr_i f(Z_i)$. Applying the triangle inequality to the Rademacher average, we get

$$\begin{aligned} R(\mathcal {F}/\mu _n)=\mathbb {E}_r\underset{f\in \mathcal {F}}{\text {sup}}|X_{rad}(f)|\le \mathbb {E}_r \underset{f\in \mathcal {F}}{\text {sup}} |X_{rad}(f-f_N)|+\mathbb {E}_r\underset{f_N\in \mathcal {F}_N}{\text {sup}}|X_{rad}(f_N)|. \end{aligned}$$

(3.2)

The first term on the right-hand side can be bounded as follows

$$\begin{aligned}&\mathbb {E}_r \underset{f\in \mathcal {F}}{\text {sup}} \frac{1}{n} \big |\sum _{i=1}^n r_i (f-f_N)(Z_i)\big |\le \underset{f\in \mathcal {F}}{\text {sup}} \sqrt{\frac{1}{n} \sum _{i=1}^n (f-f_N)^2(Z_i)}\nonumber \\&\quad =\underset{f\in \mathcal {F}}{\text {sup}} ||f-f_N||_{L_2(\mu _n)}\le \epsilon _N. \end{aligned}$$

(3.3)

The magnitude of the second term $\mathbb {E}_r \sup _{f_N\in \mathcal {F}_N}|X_{rad}(f_N)|$ is determined by the size of $\mathcal {F}_N$. Now we use the following chaining method: For any $f_k\in \mathcal {F}_k$, there is a $f_{k-1}\in \mathcal {F}_{k-1}$ such that $f_k$ is in the $\epsilon _{k-1}$ ball centered at $f_{k-1}$ in the semi-metric space $(\mathcal {F}, L_2(\mu _n))$. We say that $f_{k-1}$ is chaining with $f_k$, denote as $f_{k-1}\rightarrow f_{k}$. Using the triangle inequality, we have

$$\begin{aligned} \underset{f_N\in \mathcal {F}_N}{\text {sup}} |X_{rad}(f_N)|\le \sum _{k=1}^N \underset{f_{k-1}\rightarrow f_k}{\text {sup}} |X_{rad}(f_k)-X_{rad}(f_{k-1})|+|X_{rad}(f_0)| \end{aligned}$$

(3.4)

Since for any $f\in \mathcal {F}$, $||f-f_0||_{L_2(\mu _n)}\le 1$, and $\mathcal {F}_0=\{f_0\equiv 0\}$, the term $|X_{rad}(f_0)|$ vanishes. Taking the $\psi _2$ norm on both sides and using the triangle inequality again for the $\psi _2$ norm, we obtain

$$\begin{aligned} \big |\big |\underset{f_N\in \mathcal {F}_N}{\text {sup}} |X_{rad}(f_N)| \big |\big |_{\psi _2}\le \sum _{k=1}^N \big |\big |\underset{f_{k-1}\rightarrow f_k}{\text {sup}} |X_{rad}(f_k-f_{k-1})| \big |\big |_{\psi _2}. \end{aligned}$$

(3.5)

Since $\mathbb {N}(\epsilon _k, \mathcal {F}, L_2(\mu _n)) \ge \mathbb {N}(\epsilon _{k-1}, \mathcal {F}, L_2(\mu _n))$, the number of choices of the chaining pair $(f_{k-1}\rightarrow f_{k})$ is bounded by $\mathbb {N}^2(\epsilon _k, \mathcal {F}, L_2(\mu _n))$. Applying Lemma 2.7, for the maximal inequality on each term on the right-hand side of (3.5), we have

$$\begin{aligned} \big |\big |\underset{f_{k-1}\rightarrow f_k}{\text {sup}}|X_{rad}(f_k-f_{k-1})|\big |\big |_{\psi _2} \le 4 \text {log}^{\frac{1}{2}} (\mathbb {N}^2(\epsilon _k, \mathcal {F}, L_2(\mu _n))+1) (\underset{f_{k-1}\rightarrow f_k}{\text {sup}}||X_{rad}(f_k-f_{k-1})||_{\psi _2}).\nonumber \\ \end{aligned}$$

(3.6)

As long as the covering number is bigger than 1, the factor

$$\begin{aligned} 4\text {log}^{1/2}(\mathbb {N}^{2}(\epsilon _k, \mathcal {F}, L_2(\mu _n))+1) \end{aligned}$$

is bounded by $9 \text {log}^{1/2}(\mathbb {N}(\epsilon _k, \mathcal {F}, L_2(\mu _n)))$. Moreover, by Lemma 2.8, we have

$$\begin{aligned} \underset{f_{k-1}\rightarrow f_k}{\text {sup}} ||X_{rad}(f_k-f_{k-1})||_{\psi _2} \le \underset{f_{k-1}\rightarrow f_k}{\text {sup}} \sqrt{6}n^{-1/2}||f_k-f_{k-1}||_{L_2(\mu _n)} \end{aligned}$$

By construction, it is bounded by $\sqrt{6}n^{-1/2}\epsilon _{k-1}$. So we have

$$\begin{aligned}&\mathbb {E}_r \underset{f_N\in \mathcal {F}_N}{\text {sup}} |X_{rad}(f_N)| \le \big |\big |\underset{f_N\in \mathcal {F}_N}{\text {sup}} |X_{rad}(f_N)| \big |\big |_{\psi _2}\nonumber \\&\quad \le Cn^{-1/2}\sum _{k=1}^N\epsilon _{k-1}\text {log}^{\frac{1}{2}} \mathbb {N}(\epsilon _k, \mathcal {F},L_2(\mu _n)). \end{aligned}$$

(3.7)

$\square $

In [14], Mendelson found a similar upper bound for the Gaussian average, the details of this chaining technique also can be found in [15].

We now present the bound for Radmacher average using Fat-shattering dimension:

Theorem 3.2

Assume that for some $\gamma >1$, $\text {fat}_\epsilon (\mathcal {F})\le \gamma \epsilon ^{-p}$ holds for any $\epsilon >0$, then there exists a constant $C_p$, which depends only on p, such that for any empirical measure $\mu _n$

$$\begin{aligned} R(\mathcal {F}/\mu _n) \le {\left\{ \begin{array}{ll} C_p\gamma ^{\frac{1}{2}}\log \gamma \ n^{-1/2} &{} \mathrm{if}\ 0<p<2,\\ C_2\gamma ^{\frac{1}{2}}\log \gamma n^{-1/2}\ \log ^2 n&{} \mathrm{if} \ p=2,\\ C_p\gamma ^{\frac{1}{2}}\log \gamma n^{-1/p}\ &{} \mathrm{if}\ p>2. \end{array}\right. } \end{aligned}$$

(3.8)

Proof

Let $\mu _n$ be an empirical measure. When $p<2$, we know the sum on the right-hand side of inequality (3.1) can be bounded using Lemma 2.5 as follows:

$$\begin{aligned}&n^{-1/2}\sum _{k=1}^N \epsilon _{k-1} \log ^{\frac{1}{2}} \mathbb {N}(\epsilon _k, \mathcal {F}, L_2(\mu _n))\le n^{-1/2}\int _0^{\infty }\text {log}^{\frac{1}{2}} \mathbb {N}(\epsilon , \mathcal {F},L_2(\mu _n)) d\epsilon \nonumber \\&\quad \le C_p \gamma ^{\frac{1}{2}} \text {log} \gamma \ n^{-1/2}. \end{aligned}$$

(3.9)

Assume that $p\ge 2$. Let $\epsilon _k=2^{-k}$ and $N=p^{-1}\text {log}\ n$. Using Theorem 3.1 and Lemma 2.5, we have

$$\begin{aligned} R(\mathcal {F}/\mu _n)&\le C_p n^{-1/2}\text {log}\gamma \sum _{k=1}^N \epsilon ^{1-\frac{p}{2}} \text {log}\big (\frac{2}{\epsilon _k}\big )+2\epsilon _N\nonumber \\&\le C_pn^{-1/2}\gamma ^{\frac{1}{2}} \text {log} \gamma \sum _{k=1}^N k2^{k\left( \frac{p}{2}-1\right) }+2n^{-\frac{1}{p}}. \end{aligned}$$

(3.10)

If $p=2$, the geometric sum is bounded by:

$$\begin{aligned} C_pn^{-1/2}(\gamma ^{\frac{1}{2}} \text {log}\gamma )N^{2}\le C_p(\gamma ^{\frac{1}{2}} \text {log} \gamma ) n^{-1/2}\text {log}^2 n. \end{aligned}$$

If $p>2$, it is bounded by

$$\begin{aligned} C_p(\gamma ^{\frac{1}{2}} \text {log} \gamma ) n^{-1/p}. \end{aligned}$$

$\square $

We also present the entropy version upper bound, the proof follows from the same argument.

Theorem 3.3

Assume that for some $\gamma >1$, $\log \mathbb {N}(\epsilon ,\mathcal {F},L_2(\mu _n))\le \gamma \epsilon ^{-p}$ holds for all $\epsilon >0$. Then there exists a constant $C_p$, which depends only on p, such that for any empirical measure $\mu _n$

$$\begin{aligned} R(\mathcal {F}/\mu _n) \le {\left\{ \begin{array}{ll} C_p\gamma ^{\frac{1}{2}} n^{-1/2} &{} \mathrm{if}\ 0<p<2, \\ C_2\gamma ^{\frac{1}{2}} n^{-1/2} \log n&{} \mathrm{if} \ p=2,\\ C_p\gamma ^{\frac{1}{2}} n^{-1/p} &{} \mathrm{if}\ p>2. \end{array}\right. } \end{aligned}$$

(3.11)

By taking the expectation of $R(\mathcal {F}/\mu _n)$ in Theorem 3.2 and Theorem 3.3, then apply the Theorem 2.1, we can also get the upper bounds for corresponding $\mu $-estimation error and universal estimation error.

4 Lower bound

In this section, we prove that for some proper underlying distribution $\mu $, the Fat-shattering dimension provides a lower bound for the Rademacher average (hence for the universal estimation error), and this bound is tight. A similar lower bounds for the Gaussian average can be found in [14].

Theorem 4.1

If $\text {fat}_\epsilon (\mathcal {F})\ge \gamma \epsilon ^{-p}$ for some $\gamma $, then there exists a measure $\mu \in \mathcal {P}$ and constant c such that

$$\begin{aligned} \mathbb {E}_{\mu }R(\mathcal {F}/\mu _n) \ge c n^{-\frac{1}{p}}. \end{aligned}$$

Proof

By the definition of Fat-shattering dimension, for every integer n, let $\epsilon =(\gamma /n)^{1/p}$, there exists a set $\{Z_1,Z_2,\ldots , Z_n\}$ which is $\epsilon $ shattered by $\mathcal {F}$ and all $Z_i$ are distinct. Let $\mu $ be the measure uniformly distributed on $\{Z_1,Z_2,\ldots , Z_n\}$. By the definition of shattering, we know all $Z_i$ are distinct.

Let $Z_1^*,\ldots , Z^*_n$ be the data generated uniformly and independently from $\mu $ and let $\mu _n$ be the corresponding empirical measure. Assume that $Z_i$ appears $n_i$ times in the support of $\mu _n$. Then we have:

$$\begin{aligned} R(\mathcal {F}/\mu _n)= & {} \frac{1}{n} \mathbb {E}_r \sup _{f\in \mathcal {F}} \left| \sum _{i=1}^n\sum _{k=1}^{n_i} r_{i,k}f(Z_i)\right| \end{aligned}$$

(4.1)

$$\begin{aligned}\ge & {} \frac{1}{2n} \mathbb {E}_r \sup _{f,f'\in \mathcal {F}} \sum _{i=1}^n\sum _{k=1}^{n_i} r_{i,k}(f(Z_i)-f'(Z_i)) \end{aligned}$$

(4.2)

where the $\{r_{i,k}\}$’s are independently Rademacher random variables.

As we know for those i where $n_i>0$, the probability of $P(\sum _{k=1}^{n_i} r_{i,k}=0)\le \frac{1}{2}$. For a realization of $r_{i,k}$, set $A=\{i: \sum _{k=1}^{n_i} r_{i,k}>0 \}$. Let $f_A$ to be the Fat-shattering function of the set A, and $f_{A^c}$ be the shattering function of its complement $A^c$. Also, denote by $n^*$ the number of i’s for which $n_i>0$. Then we have,

$$\begin{aligned} \sup _{f,f'\in \mathcal {F}} \sum _{i=1}^n \sum _{k=1}^{n_i} r_{i,k}(f(Z_i)-f'(Z_i)) \ge \sum _{i=1}^n \sum _{k=1}^{n_i} ( r_{i,k}(f_A(Z_i)-f_{A^c}(Z_i))). \end{aligned}$$

(4.3)

As long as $\sum _k^{n_i} r_{i,k}\ne 0$, for each i, $\sum _{k=1}^{n_i} ( r_{i,k}(f_A(Z_i)-f_{A^c}(Z_i)))\ge 2\epsilon $. So we know

$$\begin{aligned} R(\mathcal {F}/\mu _n)\ge \frac{1}{2n}\mathbb {E}_r \sum _{i=1}^n \sum _{k=1}^{n_i} ( r_{i,k}(f_A(Z_i)-f'_{A^c}(Z_i))) \ge \frac{1}{2n} \epsilon n^*. \end{aligned}$$

(4.4)

The last inequality holds because for each i with $n_i>0$, the probability of $\sum _{k=1}^{n_i} r_{i,k}=0$ is no more than 1 / 2.

Now take the expectation for inequality (4.4), we have

$$\begin{aligned} \mathbb {E}_{\mu }R(\mathcal {F}/\mu _n)\ge \mathbb {E}_\mu \left( \frac{1}{2n}\epsilon n^*\right) , \end{aligned}$$

(4.5)

$n^*$ here is the number of $Z_i$’s that appear in $Z_1^* \ldots ,Z_n^*$. We know

$$\begin{aligned} \mathbb {E}_\mu (n^*)=n \left( 1-\left( \frac{n-1}{n}\right) ^n\right) >\left( 1-\frac{1}{e}\right) n. \end{aligned}$$

(4.6)

For $\epsilon ={(\gamma /n)}^{1/p}$, we obtain the following lower bound

$$\begin{aligned} \mathbb {E}_{\mu }R(\mathcal {F}/\mu _n) \ge \left( \frac{1}{2}-\frac{1}{2e}\right) \gamma ^{1/p} n^{-\frac{1}{p}}. \end{aligned}$$

(4.7)

Addition with Theorem 2.1, For $p\ge 2$, we know there also exists a constant $c_1$ such that

$$\begin{aligned} \mathbb {E}_{\mu } \sup _{f\in \mathbb {F}} \left| \frac{1}{n}\sum _{i=1}^n f(Z_i)-\mathbb {E}_{\mu }f\right| >c_1n^{-\frac{1}{p}}. \end{aligned}$$

$\square $

In the previous section and this section, we have proved that for $p>2$, the expectation of the Rademacher average is bounded above and below by $O(n^{-1/p})$. Since $O(n^{-1/2})$ is negligible comparing $O(n^{-1/p})$, from Theorem 2.1, we know that the universal estimation error is bounded by $n^{-1/p}$ and this bound is tight.

For $p<2$, the upper bound gives us convergence rate as $O(n^{-1/2})$ and in this case $\mathcal {F}$ is the Donsker class [10]. As long as the limit of the empirical process is non-trivial, the rate $O(n^{-1/2})$ is optimal.

5 Excess loss class or hypothesis class

It seems a little bit obscure to study the excess loss class $\mathcal {F}$ rather than $\mathcal {H}$ itself. However, when it comes to the most common loss functions L, the complexity of excess loss class $\mathcal {F}$ can be controlled by the complexity of the hypothesis space $\mathcal {H}$. For example, assuming that the loss function L is K-Lipschitz in its first argument, i.e. for all $\hat{y}_1,\hat{y}_2,y$, we have

$$\begin{aligned} |L(\hat{y}_1,y)-L(\hat{y}_2,y)|\le K|\hat{y}_1-\hat{y}_2|. \end{aligned}$$

(5.1)

Since we also have $f^*\equiv 0\in \mathcal {F}$, it is not hard to prove that the Rademacher average of the excess loss class can be bounded in terms of the average of the hypothesis space:

$$\begin{aligned} R(\mathcal {F}/\mu _n)\le KR(\mathcal {H}/\mu _n). \end{aligned}$$

(5.2)

Thus we know that the Rademacher average of $\mathcal {H}$ can bound the Rademacher average of $\mathcal {F}$. We also have the following lemma to characterize how to bound the entropy of $\mathcal {F}$ by the entropy of $\mathcal {H}$ when using q-loss function. The proof can be found in [14].

Lemma 5.1

If $\mathcal {H}$ has uniform bound of 1, then for every $1\le q\le \infty $ there is a constant $C_q$ such that for every $\epsilon >0$, g bounded by 1, and probability $\mu $, we have

$$\begin{aligned} \log \mathbb {N}(\epsilon ,|\mathcal {H}-g|^q,L_2(\mu ))\le \log \mathbb {N}(C_q\epsilon ,\mathcal {H},L_2(\mu )). \end{aligned}$$

(5.3)

In the following case, we can further claim that the complexity of the excess loss class controls hypothesis space.

Lemma 5.2

Assume $\mathcal {H}$ has a uniform bound of 1. Let $\mathcal {H}^{*}=\{(h/4+3/4): h\in \mathcal {H}\}$ and if

$$\begin{aligned} \mathcal {H}^{*}\subset \mathcal {H}, \end{aligned}$$

then there exists constant c such that

$$\begin{aligned} \log \mathbb {N}(c\epsilon ,\mathcal {H},L_2(\mu ))\le \log \mathbb {N}(\epsilon ,(\mathcal {H}-g)^2,L_2(\mu )). \end{aligned}$$

(5.4)

Proof

It is easily seen from the definition that the covering number is translation invariant:

$$\begin{aligned} \mathbb {N}(\epsilon , \mathcal {H}, L_2(\mu _n))=\mathbb {N}(\epsilon ,\mathcal {H}-g,L_2(\mu _n)). \end{aligned}$$

(5.5)

Also by the property that $\mathcal {H}^*\subset \mathcal {H}$, one can prove that by enlarging the radius of the covering balls, the covering number of $\mathcal {H}$ can be bounded by $\mathcal {H}^*$:

$$\begin{aligned} \mathbb {N}(4\epsilon , \mathcal {H}, L_2(\mu _n))\le \mathbb {N}(\epsilon ,\mathcal {H}^{*},L_2(\mu _n)). \end{aligned}$$

(5.6)

Moreover, since $\mathcal {H}^*$ is bounded below by 1 / 2 , we have $|h_1^2-h_2^2|\ge |h_1-h_2|$, therefore the covering number of $\mathcal {H}^*$ can be bounded by the covering number of $(\mathcal {H}^*) ^2$. And because $\mathcal {H}^{*}\subset \mathcal {H}$, the covering number of $(\mathcal {H})^2$ can bound the covering number of $(\mathcal {H}^*) ^2$, and hence the covering number of $\mathcal {H}^*$ and $\mathcal {H}$. Together with the translation invariant property, the result follows. $\square $

We will see in later applications that the condition $\mathcal {H}^{*}\subset \mathcal {H}$ can actually be achieved in many scenarios.

6 Application

6.1 VC classes for classification

We consider the binary classification problem. Assume $\mathcal {F}$ has finite VC dimension V. Then there exists a constant C such that the estimation error is bounded by $C\sqrt{V/n}$, which is optimal in the minimax sense, see [7] for more details.

From the definition of VC dimension, we know that $\text {fat}_{\epsilon }(\mathcal {F})=V$ for $\epsilon <1$. In this case, we can set $\gamma $ to be V and p to be 1. Under this setting, from Theorem 3.2, the associated Rademacher average is bounded above by $C_1 \text {log} V\sqrt{V/n}$. It is clearly optimal in terms of the data size and only a logarithm factor of V worse than the best bound.

Remark 6.1

Faster rates can be achieved under some margin assumptions for the distribution of $\mu $, see [12].

6.2 Regularized linear class

Assume that the input $X\in \mathbb {R}^d$, $||X||_q\le a$ and linear weight vector satisfies the regularization condition $||W||_p\le b$, where $1/p+1/q=1$ and $1\le p\le 2$. Consider the following linear function hypothesis space $\mathcal {H}_p$ containing all the functions in the form of $W\cdot X$. In [19], Zhang derived the following bound:

$$\begin{aligned} \log \mathbb {N}(\epsilon , \mathcal {H}_p,L_2(\mu _n))\le \left\lceil \frac{a^2b^2}{\epsilon ^2} \right\rceil \log (2d+1). \end{aligned}$$

(6.1)

He then obtained a bound for the estimation error for classification error. Now we can use his result (6.1) for more general setting, for example, real value problems.

Fix the regularization condition $||W||_p\le b$ and let $\mathcal {H}_1$ is the hypothesis space for lasso regression and $\mathcal {H}_2$ for ridge regression as following:

$$\begin{aligned} \mathcal {H}_1= & {} \left\{ W\cdot X: ||W||_1\le b \text { and } ||X||_{\infty }\le 1/b\right\} \text { and}\\ \mathcal {H}_2= & {} \left\{ W\cdot X: ||W||_2\le b \text { and } ||X||_2\le 1/b\right\} . \end{aligned}$$

From the Holder inequality, we have $|W\cdot X|\le 1 $ for $W\cdot X\in \mathcal {H}_1, \mathcal {H}_2$. The bound of the entropy together with Theorem 3.3 gives the upper bound of the Rademacher average:

$$\begin{aligned} R(\mathcal {H}_p/\mu _n) \le C_2 \sqrt{\frac{{\log (2d+1)}}{n}}\log n. \end{aligned}$$

(6.2)

where $C_2$ is the constant from Theorem 3.3. This bound provides a convergence rate bound for regression estimation error.

6.3 Non-decreasing class and bounded variation class

Let $\mathcal {H}_1$ and $\mathcal {H}_2$ be the set of all functions on [0, T] taking values in $[-1,1]$ with the requirements that $h_1$ is non-decreasing for any $h_1\in \mathcal {H}_1$ and the total variation of $h_2$ is bounded by V for any $h_2\in \mathcal {H}_2$. If $V\ge 2$, we have $\mathcal {H}_1\subset \mathcal {H}_2$. The Rademacher average of $\mathcal {H}_2$ provides an upper bound for Rademacher average of $\mathcal {H}_1$. In [5], Bartlett proved the following theorem:

Theorem 6.2

For all $\epsilon \le V/12$

$$\begin{aligned} \log \mathbb {N}(\epsilon ,\mathcal {H}_2,L_1(\mu ))\le \frac{13V}{\epsilon }. \end{aligned}$$

(6.3)

From Lemma 2.4, we know that the Fat-shattering dimension has the bound:

$$\begin{aligned} fat_\epsilon (\mathcal {H}_2)\le \frac{128V}{\epsilon }. \end{aligned}$$

(6.4)

From Theorem 3.2, we know the convergence rate of Rademacher average of $\mathcal {H}_2$ can achieve $O(n^{-1/2})$ and so does $\mathcal {H}_1$.

6.4 Multiple layer neural nets

We will present some evidence to why deep learning works. We make the assumption that the input magnitude of each neuron is bounded and consider the following architecture for the neural net:

$$\begin{aligned} \Omega =\left\{ x\in \mathbb {R}^d:||X||_{\infty }\le B\right\} . \end{aligned}$$

Let $\mathcal {H}_0$ be the class of functions on $\Omega $ defined by

$$\begin{aligned} \mathcal {H}_0=\left\{ X=(X^1, X^2\ldots , X^d)\ \rightarrow X^i: 1\le i\le d\right\} . \end{aligned}$$

Let $\sigma $ be the standard logistic sigmoid function, which is 1-Lipschitz. Define the hypothesis space recursively by:

$$\begin{aligned} \mathcal {H}_l=\bigg \{\sigma \big ( \sum _{i=1}^N w_i h_i\big ): N\in \mathbb {N}, h_i\in \mathcal {H}_{l-1} , \sum _{i=1}^N|w_i|\le C \bigg \} \end{aligned}$$

Define the C-convex hull of $\mathcal {H}$ as

$$\begin{aligned} \text {conv}_C({\mathcal {H}})=\left\{ \sum c_ih_i:h_i \in \mathcal {H}, \sum |c_i|\le C \right\} . \end{aligned}$$

By the definition of Rademacher average, one can show

$$\begin{aligned} CR(\mathcal {H}/\mu _n)=R(\text {conv}_C(\mathcal {H})/\mu _n). \end{aligned}$$

(6.5)

One can also check by compositing $\mathcal {H}$ with a L-Lipschitz function $\sigma $, we have

$$\begin{aligned} R((\sigma \circ \mathcal {H})/\mu _n)\le LR(\mathcal {H}/\mu _n). \end{aligned}$$

(6.6)

Since the number of functions in the space $\mathcal {H}_0$ is d, which is finite, the $\epsilon $-covering number can be bounded by d for any $\epsilon $. Then by applying Theorem 3.3 and setting $\gamma =\text {log}\ d$ and $p=1$ , we can bound $R(\mathcal {H}_0/\mu _n)$ by $C_1\sqrt{\text {log}d/n}$ for a positive constant $C_1$. Do induction on the number of layers, in each layer, we use (6.5) and (6.6) alternatively and get

$$\begin{aligned} R(\mathcal {H}_l/\mu _n)\le C_1C^{l} \sqrt{\frac{\text {log}d}{n}}. \end{aligned}$$

(6.7)

Note that $\mathcal {H}_l$ satisfies the requirement in Lemma 5.2. Hence for $L_2$ loss function, the Rademacher average of $\mathcal {F}$ has a similar upper bound which differs by a constant factor and so does the universal estimation error.

Our result can be compared with the result in [2] of Bartlett:

$$\begin{aligned} \text {log} \mathbb {N}(\epsilon ,\mathcal {H}_l, L_2(\mu _n))\le a\left( \frac{b}{\epsilon }\right) ^{2l}. \end{aligned}$$

(6.8)

Here a, b are factors independent of $\epsilon $. From this bound, we can only get the universal estimation error bound in the form of $O(n^{-1/2l})$, which means that the learning rate decays very fast when more layers are used.

Deep neural nets often use hundreds of layers. One might think that this may lead to large estimation error and overfitting. However, our result shows that as long as we control the magnitude of the weights, overfitting is not a problem.

6.5 Boosting

Using simple function class such as decision stumps as hypothesis space usually leads to low estimation error but high approximation error. In order to reduce the approximation error, we can enrich the hypothesis space. Boosting [9] has proven to be an attractive strategy in their regard both in theory and in practice. In each step t, based on the error, the current function $h_{t-1}$ made, boosting greedily choose a function $g_t$ from the base function space $\mathcal {B}$, multiplied by the learning rate $\gamma _t$ and added to the current function $h_{t-1}$ to reduce the error $h_{t-1}$ made. We denote by T the total number of steps. Let us consider the following hypothesis space:

$$\begin{aligned} \mathcal {H}=\left\{ \sum _{t=1}^T \gamma _t g_t\ \big | \sum _{t=1}^T |\gamma _t|\le C, g_t \in \mathcal {B}\right\} , \end{aligned}$$

which contains all possible functions produced by boosting with constraint on its learning rate.

In [16], Schapire et al. have shown that for AdaBoosting, the margin error on training data decreases exponentially fast in T. They also provided a bound on generalization error by assuming that the VC dimension is finite.

In the following we will derive a bound for boosting in more general setting. Note that the hypothesis space $\mathcal {H}$ we considered can also be regarded as a C-convex hull of $\mathcal {B}$, defined in the last section:

$$\begin{aligned} R(\mathcal {H}/\mu _n)=R(\text {conv}_C(\mathcal {B})/\mu _n)=CR(\mathcal {B}/\mu _n). \end{aligned}$$

(6.9)

As we argued previously, the Rademacher average can bound the estimation error. This result essentially tells us that the estimation error of boosting can be bounded by $C\mathbb {E}_{\mu }R(\mathcal {B}/\mu _n)$. Since the base function space $\mathcal {B}$ is fixed in boosting, the bound is actually determined by C, the $L_1$ norm of the learning rate.

C here controls the complexity of $\mathcal {H}$. When one uses too many steps and the corresponding learning rate does not decay fast enough, C becomes too large and overfitting becomes a problem.

6.6 Convex functions

This example illustrates the fact that if $\mathcal {H}$ is rich enough, the rate of $O(n^{-1/2})$ cannot be achieved. Consider the hypothesis space $\mathcal {H}$ containing all the real-valued convex functions defined on $[a,b]^d\subset \mathbb {R}^d$, which are uniformly bounded by B and uniformly L-Lipschitz.

In Bronshtein’s paper [6], it was proved that for $\epsilon $ sufficiently small, the logarithm of the covering number $\mathbb {N}(\epsilon ,\mathcal {H},L_{\infty }(\mu ))$ can be bounded from above and below by a positive constant times $\epsilon ^{-d/2}$, here $\mu $ is the ordinary Lesbegue measure.

We use both Fat-shattering dimension and entropy in this case. By Lemma 2.4, we have

$$\begin{aligned} \text {log} \mathbb {N}(\epsilon ,\mathcal {H},L_{\infty }(\mu ))\ge \underset{\mu _n}{\text {sup}}\ \text {log}\mathbb {N}(\epsilon ,\mathcal {H},L_2(\mu _n))\ge \text {fat}_{16\epsilon }(\mathcal {H})/8. \end{aligned}$$

(6.10)

From Theorem 3.3, we conclude that $R(\mathcal {H}/\mu _n)$ is bounded above by $Cn^{-2/d}$ for some constant C.

To bound the associated Rademacher average from below, we use the inequality from lemma 2.5:

$$\begin{aligned}&\text {fat}_\epsilon (\mathcal {F}) \text {log}^2 \bigg ( \frac{2\text {fat}_{\frac{\epsilon }{8}}(\mathcal {F})}{\epsilon } \bigg ) \ge \sup _{\mu _n}\text {log}\,\mathbb {N}(\epsilon ,\mathcal {H},L_\infty (\mu _n))\nonumber \\&\quad = \text {log}\,\mathbb {N}(\epsilon ,\mathcal {H},L_\infty (\mu )) \ge c\epsilon ^{-d/2}. \end{aligned}$$

(6.11)

By solving this inequality for $\text {fat}_{\epsilon }(\mathcal {F})$, we conclude that there exists a function $\delta (\epsilon )$ which decreases to 0 as $\epsilon $ goes to 0 such that

$$\begin{aligned} fat_\epsilon (\mathcal {F})\ge c\epsilon ^{-d/2-\delta (\epsilon )}. \end{aligned}$$

(6.12)

Now apply Theorem 4.1, we can conclude that there exists $\gamma (n)$ which goes to 0 as n goes to 0 such that the Rademacher average is bounded below by $O(n^{-(2/d-\gamma (n))}).$

Note that $\mathcal {H}$ also satisfies the requirement in Lemma 5.2, if we use $L_2$ norm for the loss function, we know that the universal estimation error has a rate between $O(n^{-(2/d-\gamma (n))})$ and $O(n^{-2/d})$. This shows that the general convex function space in high dimension can be very complex for learning problems.

Acknowledgements

The work presented here is supported in part by the Major Program of NNSFC under grant 91130005, DOE grant DE-SC0009248, and ONR grant N00014-13-1-0338. Dedicated to Professor Bjorn Engquist on occasion of his 70th birthday.

References

Alon, N., Ben-David, S., Cesa-Bianchi, N., Haussler, D.: Scale sensitive dimensions, uniform convergence and learnability. J. Assoc. Comput. Math 44(4), 615–631 (1997)
Article MathSciNet MATH Google Scholar
Bartlett, L.: The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the network. IEEE Trans. Inf. Theory 44(2), 525–536 (1998)
Barlett, L., Mendelson, S.: Rademacher and Gaussian complexities: risk bounds and structural results. J. Mach. Learn. Res. 3(2002), 463–482 (2011)
MathSciNet Google Scholar
Bartlett, L., Bousquet, O., Mendelson, S.: Local Rademacher complexities. Ann. Stat. 33(4), 1497–1537 (2005)
Bartlett, L., Kulkarni, R., Posner, E.: Covering number for real valued Function classes. IEEE Trans. Inf. Theory 43(5) 1721–1724 (1997)
Bronshtein, E.M.: $\epsilon $-Entropy for classes of convex functions. Sib. Math. J. 17, 393–398 (1976)
Article Google Scholar
Devorye, L., Lugosi, G.: Lower bounds in pattern recognition and learning. Pattern Recogn. 28, 1011–1018 (1995)
Article Google Scholar
Dudley, R.M., Giné, E., Zinn, J.: Uniform and universal Glivenko-Cantelli classes. J. Theor. Probab. 4, 485–510 (1991)
Article MathSciNet MATH Google Scholar
Freund, Y., Schapire, R.: A short introduction to boosting. J. Jpn. Soc. Artif. Intell. 14(5), 771–780 (1999)
Gine, E.: Empirical processes and applications: an overview. Bernoulli 12(4), 929–989 (1984)
MATH Google Scholar
Kosorok, Micheal R.: Introduction to Empirical Processes and Semiparametric Inference. Springer, Berlin (2008)
Book MATH Google Scholar
Lugosi, G.: Principles of Nonparametric Learning. In: CISM International Centre for Mechanical Sciences, vol. 434. Springer, Verlag, pp. 1–56 (2002)
Lindenstrauss, J., Milman, V.D.: The local theory of normed spaces and its application to convexity. In: Proceedings of 14th Annual Conference Computational Learning Theory, pp. 256–272 (2001)
Mendelson, S.: Rademahcer averages and phase transitions in Glivenko–Cantelli classes. IEEE Trans. Inf. Theory 48(1), 251–263 (2002)
The Volume of Convex Bodies and Banach Space Geometry. Cambridge University Press, Cambridge (1989)
Schapire, R., Freund, Y., Bartlett, P., Lee, W.: Boosting the margin: a new explanation for the effectiveness of voting methods. In: Proceedings of the Fourteenth International Conference on Machine Learning (1997)
Tomczak-Jaegermann, N.: Banach Mazur distance and finite dimensional operator ideals. Pitman monographs and surveys in pure and applied mathematics. Pure and Applied Mathematics, vol. 38, p 395 (1989)
Vapnik, V., Chervonenkis, A.: Necessary and sufficient conditions for uniform convergence of the means to mathematical expectations. Theory Prob. Appl. 26, 532–553 (1971)
Zhang, T.: Covering number bounds of certain regularized linear function classes. J. Mach. Learn. Res. 2(2002), 527–550 (2002)
MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Beijing Institute of Big Data Research, Peking University, Beijing, China
E. Weinan
Department of Mathematics, Princeton University, Princeton, NJ, USA
E. Weinan & Yao Wang
PACM, Princeton University, Princeton, NJ, USA
E. Weinan

Authors

E. Weinan
View author publications
You can also search for this author in PubMed Google Scholar
Yao Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yao Wang.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Cite this article

Weinan, E., Wang, Y. Optimal convergence rate of the universal estimation error. Res Math Sci 4, 2 (2017). https://doi.org/10.1186/s40687-016-0093-6

Download citation

Received: 19 July 2016
Accepted: 28 December 2016
Published: 10 February 2017
DOI: https://doi.org/10.1186/s40687-016-0093-6

Optimal convergence rate of the universal estimation error

Abstract

Similar content being viewed by others

Convergence rate in precise asymptotics for the law of the iterated logarithm

The Turán-type inequality in the space L0 on the unit interval

Functions with Ultradifferentiable Powers

1 Background

2 Preliminaries

2.1 Rademacher average

Theorem 2.1

2.2 Covering number and fat-shatter dimension

Definition 2.2

Definition 2.3

Lemma 2.4

Lemma 2.5

2.3 Maximal inequality

Definition 2.6

Lemma 2.7

Lemma 2.8

Theorem 2.9

3 Upper bound

Theorem 3.1

Proof

Theorem 3.2

Proof

Theorem 3.3

4 Lower bound

Theorem 4.1

Proof

5 Excess loss class or hypothesis class

Lemma 5.1

Lemma 5.2

Proof

6 Application

6.1 VC classes for classification

Remark 6.1

6.2 Regularized linear class

6.3 Non-decreasing class and bounded variation class

Theorem 6.2

6.4 Multiple layer neural nets

6.5 Boosting

6.6 Convex functions

Acknowledgements

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation