1 Introduction

Let \((\varOmega ,{{\mathcal A}},{\mathbb {P}})\) be an arbitrary complete probability space. Let \({\mathcal Z}\) be a separable metric space and \({\mathcal F}=\{f: {\mathcal Z}\rightarrow {\mathbb R}\}\) be a set of bounded real valued functions on \({\mathcal Z}\). Consider independent and identically distributed random variables \(Z_1,\ldots , Z_n, \ldots \) in \({\mathcal Z}\) with the common distribution \(\mathbf{P}\). The empirical process indexed by \(f\in {\mathcal F}\) is defined as

$$\begin{aligned} f\mapsto {\mathbb G}_n(f)\, :=\, \frac{1}{n}\sum _{t=1}^n ({\mathbb {E}}f(Z) - f(Z_t)). \end{aligned}$$

The study of the behavior of the supremum of this process is a central topic in empirical process theory, and it is well known that this behavior depends on the “richness” of \({\mathcal F}\). Statements about convergence of the supremum to zero are known as uniform Laws of Large Numbers (LLN). More precisely, a class \({\mathcal F}\) is said to be (strong) Glivenko–Cantelli for the distribution \({\mathbf P}\) if the supremum of \({\mathbb G}_n(f)\) converges to zero almost surely as \(n\rightarrow \infty \). Of particular interest are classes for which this convergence happens uniformly for all distributions. A class \({\mathcal F}\) is said to be uniform Glivenko–Cantelli if

$$\begin{aligned} \forall \delta >0,\quad \lim _{n'\rightarrow \infty } \sup _{{\mathbf P}} ~{\mathbb {P}}\left( \sup _{n\ge n'}\sup _{f\in {\mathcal F}} |{\mathbb G}_n(f)| > \delta \right) =0 \end{aligned}$$
(1)

where \({\mathbb {P}}\) is the product measure \({\mathbf P}^\infty \). As a classical example, consider i.i.d. real-valued random variables \(Z_1,\ldots ,Z_n\) and a class \({\mathcal F}= \{z\mapsto \mathbf{1}\left\{ z\le \theta \right\} : \theta \in {\mathbb R}\}\), where \(\mathbf{1}\left\{ \right\} \) is the indicator function. For this class, (1) holds by the well known results of Glivenko and Cantelli: almost surely, the supremum of the difference between the cumulative distribution function and the empirical distribution function converges to zero. A number of necessary and sufficient conditions for the Glivenko–Cantelli and the uniform Glivenko–Cantelli properties have been derived over the past several decades [11].

In this paper, we are interested in the martingale analogues of the uniform LLN, as well as in the analogues to the various notions of complexity that appear in empirical process theory. Specifically, consider a sequence of random variables \((Z_t)_{t \ge 1}\) adapted to a filtration \(({{\mathcal A}}_t)_{t \ge 1}\). We are interested in the following process indexed by \(f \in {\mathcal F}\):

$$\begin{aligned} f\mapsto {\mathbb M}_n(f) := \frac{1}{n}\sum _{t=1}^n ({\mathbb {E}}[f(Z_t)|{{\mathcal A}}_{t-1}] - f(Z_t)). \end{aligned}$$

The central object of study in this paper is the supremum of the process \({\mathbb M}_n(f)\), and in particular we address the question of whether a uniform convergence similar to (1) holds. Evidently, \({\mathbb M}_n(f)\) coincides with \({\mathbb G}_n(f)\) in the case when \(Z_1, Z_2,\ldots \) are i.i.d. random variables. More generally, for any fixed \(f\in {\mathcal F}\), the sequence \(({\mathbb {E}}[f(Z_t)|{{\mathcal A}}_{t-1}] - f(Z_t))_{t \ge 1}\) is a martingale difference sequence. Similar to the notion of uniform Glivenko–Cantelli class \({\mathcal F}\), we can define the notion of uniform convergence for dependent random variables over function class \({\mathcal F}\) as follows.

Definition 1

A function class \({\mathcal F}\) satisfies Sequential Uniform Convergence if,

$$\begin{aligned} \forall \delta > 0,\quad \lim _{n' \rightarrow \infty } \sup _{{\mathbb {P}}} ~{\mathbb {P}}\left( \sup _{n \ge n'} \sup _{f \in \mathcal {F}} \left| {\mathbb M}_n(f) \right| > \delta \right) =0, \end{aligned}$$
(2)

where the supremum is over all distributions \({\mathbb {P}}\) on the space \((\varOmega ,{{\mathcal A}})\).

The gap between properties (1) and (2) is already witnessed by the example of the class \({\mathcal F}= \{z\mapsto \mathbf{1}\left\{ z\le \theta \right\} : \theta \in {\mathbb R}\}\) of functions on \({\mathbb R}\), discussed earlier. In contrast to the uniform Glivenko–Cantelli property, the martingale analogue (2) does not hold for this class. On the positive side, the necessary and sufficient conditions for a class \({\mathcal F}\) to satisfy sequential uniform convergence, as derived in this paper, can be verified for a wide range of interesting classes.

2 Summary of the results

One of the main results in this paper is the following equivalence.

Theorem 1

Let \({\mathcal F}\) be a class of \([-1,1]\)-valued functions. Then the following statements are equivalent.

  1. 1.

    \({\mathcal F}\) satisfies Sequential Uniform Convergence.

  2. 2.

    For any \(\alpha > 0\), the sequential fat-shattering dimension \({\mathrm {fat}}_\alpha ({\mathcal F})\) is finite.

  3. 3.

    Sequential Rademacher complexity \({\mathfrak R}_n({\mathcal F})\) satisfies \(\lim _{n\rightarrow \infty } {\mathfrak R}_n({\mathcal F}) = 0\).

Theorem 1 yields a characterization of the uniform convergence property in terms of two quantities. The first one is a combinatorial “dimension” of the class at scale \(\alpha \) (Definition 7). The second is a measure of complexity of the class through random averages (Definition 3). In addition to these quantities, we define sequential versions of covering numbers and the associated Dudley-type entropy integral. En route to proving Theorem 1, we obtain key relationships between the introduced covering numbers, the combinatorial dimensions, and random averages. These relationships constitute the bulk of the paper, and can be considered as martingale extensions of the results in empirical process theory. Specifically, we show

  • A relationship between the empirical process with dependent random variables and the sequential Rademacher complexity (Theorem 2), obtained through sequential symmetrization.

  • An upper bound of sequential Rademacher complexity by a Dudley-type entropy integral through the chaining technique (Theorem 3).

  • An upper bound on sequential covering numbers in terms of the combinatorial dimensions (Theorems 4 and 5, as well as Corollary 1). In particular, Theorem 5 is a sequential analogue of the celebrated Vapnik–Chervonenkis–Sauer–Shelah lemma.

  • A relationship between the combinatorial dimension and sequential Rademacher complexity (Lemma 2) and, as a consequence, equivalence of many of the introduced complexity notions up to a poly-logarithmic factor.

  • Properties of sequential Rademacher complexity and, in particular, the contraction inequality (Lemma 7).

  • An extension of the above results to high-probability statements (Lemmas 4, 5, and 6) and an application to concentration of martingales in Banach spaces (Corollary 2).

This paper is organized as follows. In the next section we place the present paper in the context of previous work. In Sects. 46 we introduce sequential complexities. A characterization of sequential uniform convergence appears in Sect. 7. We conclude the paper with some structural results in Sect. 8 and an application to exponential inequalities for sums of martingale difference sequences in Banach spaces in Sect. 9. Most proofs are deferred to the Appendix.

3 Related literature

The seminal work of Vapnik and Chervonenkis [37] provided the first necessary and sufficient conditions—via a notion of random VC entropy—for a class \({\mathcal F}\) of binary valued functions to be a Glivenko–Cantelli (GC) class. These results were strengthened by Steele [29], who showed almost sure convergence. A similar characterization of the GC property via a notion of a covering number in the case of uniformly bounded real-valued functions appears in [38]. For the binary-valued case, a distribution-independent version of the VC entropy (termed the growth function) was shown by Vapnik and Chervonenkis [37] to yield a sufficient condition for the uniform GC property. The “necessary” direction was first shown (according to [11, p. 229]) in an unpublished manuscript of Assouad, 1982. For real-valued classes of functions, the necessary and sufficient conditions for the uniform GC property were established in [12] through a notion of a covering number similar to the Koltchinskii–Pollard entropy. A characterization of GC classes for a fixed distribution were also given by Talagrand [30, 31] through a notion of a “witness of irregularity”. Similar in spirit, the pseudo-dimension introduced in [24] was shown by Pollard to be sufficient, though not necessary, for the uniform GC property. A scale-sensitive version of pseudo-dimension (termed the fat-shattering dimension by [5]) was introduced by Kearns and Schapire [16]. Finiteness of this dimension at all scales was shown in [3] to characterize the uniform GC classes. We refer the reader to [11, Chapter 6] and [33, 34] for a much more detailed account of the results.

The GC-type theorems have also been extended to the case of weakly dependent random variables. For instance, Yukich [40] relies on a \(\phi \)-mixing assumption, while Nobel and Dembo [21] and Yu [39] consider \(\beta \)-mixing sequences. For a countable class with a finite VC dimension, a GC theorem has been recently shown by Adams and Nobel [2] for ergodic sequences. We refer the reader to [2, 10] for a more comprehensive survey of results for non-i.i.d. data. Notably, the aforementioned papers prove a GC-type property under much the same type of complexity measures as in the i.i.d. case. This is in contrast to the present paper, where the classical notions do not provide answers to the questions of convergence.

In this paper, we do not make mixing or ergodicity assumptions on the sequence of random variables. However, the definition of \({\mathbb M}_n(f)\) imposes a certain structure which is not present when an average is compared with a single expected value. Thus, our results yield an extension of the GC property to non-i.i.d. data in a direction that is different from the papers mentioned above. Such an extension has already been considered in the literature: the quantity \(\sup _{f \in \mathcal {F}} {\mathbb M}_n(f)\) has been studied by van de Geer [34] (see Chapter 8.2). Dudley integral type upper bounds for a given distribution \({\mathbb {P}}\) were provided in terms of the so called generalized entropy with bracketing, corresponding to the particular distribution \({\mathbb {P}}\). This is a sufficient condition for convergence of the supremum of \({\mathbb M}_n(f)\) for the given distribution. In this work, however, we are interested in providing necessary and sufficient conditions for the uniform analogue of the GC property, as well as in extending the ideas of symmetrization, covering numbers, and scale-sensitive dimensions to the non-i.i.d. case. Towards the end of Sect. 7, we discuss the relationship between the generalized entropy with bracketing of [34] and the tools provided in this work.

We also stress that this paper studies martingale uniform laws of large numbers rather than a convergence of \(n{\mathbb M}_n(f)\), which only holds under stringent conditions; such a convergence for reverse martingales has been studied in [36]. The question of the limiting behavior of \(\sqrt{n}{\mathbb M}_n(f)\) (that is, the analogue of the Donsker property [11]) is also outside of the scope of this paper.

The study of the supremum of the process \({\mathbb M}_n(f)\) has many potential applications. For instance, in [35], the quantity \(\sup _{f \in {\mathcal F}} {\mathbb M}_n(f)\) is used to provide bounds on estimation rates for autoregressive models. In [1, 25] connections between minimax rates of sequential prediction problems and the supremum of the process \({\mathbb M}_n(f)\) over the associated class of predictors \({\mathcal F}\) are established. In Sect. 9 of this work, we show how the supremum of \({\mathbb M}_n(f)\) over class of linear functionals can be used to derive exponential inequalities for sums of martingale differences in general Banach spaces.

4 Symmetrization and the tree process

A key tool in deriving classical uniform convergence theorems (for i.i.d. random variables) is symmetrization. The main idea behind symmetrization is to compare the empirical process \({\mathbb G}_n(f)\) over a probability space \((\varOmega ,{{\mathcal A}},{\mathbb {P}})\) to a symmetrized empirical process, called the Rademacher process, over the probability space \((\varOmega ^\epsilon ,{\mathcal B},{\mathbb {P}}_\epsilon )\) where \(\varOmega ^\epsilon =\{-1,1\}^{{\mathbb N}}, {\mathcal B}\) the Borel \(\sigma \)-algebra and \({\mathbb {P}}_\epsilon \) the uniform probability measure. We use the notation \({\mathbb {E}}_\epsilon \) to represent expectation under the measure \({\mathbb {P}}_\epsilon \), and \(({\mathcal B}_t)_{t \ge 0}\) to denote the dyadic filtration on \(\varOmega ^\epsilon \) given by \({\mathcal B}_t=\sigma (\epsilon _1,\ldots ,\epsilon _t)\), where \(\epsilon _t\)’s are independent symmetric \(\{\pm 1\}\)-valued Rademacher random variables and \({\mathcal B}_0 = \{\{\}, \varOmega ^\epsilon \}\).

Given \(z_1,\ldots ,z_n \in {\mathcal Z}\), the Rademacher process \( {\mathbb S}^{(z_{1:n})}_n(f)\) is definedFootnote 1 as

$$\begin{aligned} f\mapsto {\mathbb S}^{(z_{1:n})}_n(f) := \frac{1}{n}\sum _{t=1}^n \epsilon _t f(z_t). \end{aligned}$$
(3)

It is well-known (e.g. [33]) that the behavior of the supremum of the symmetrized process \( {\mathbb S}^{(z_{1:n})}_n(f)\) is closely related to the behavior of the supremum of the empirical process as

$$\begin{aligned} {\mathbb {E}}\sup _{f\in {\mathcal F}} {\mathbb G}_n(f) \le \, 2\, \sup _{z_1,\ldots ,z_n \in {\mathcal Z}} {\mathbb {E}}\sup _{f\in {\mathcal F}} {\mathbb S}^{(z_{1:n})}_n(f) \end{aligned}$$
(4)

and a similar high-probability statement can also be proved. Note that the Rademacher process is defined on the probability space \((\varOmega ^\epsilon ,{\mathcal B},{\mathbb {P}}_\epsilon )\), which is potentially easier to handle than the original probability space for the empirical process.

In the non-i.i.d. case, however, a similar symmetrization argument requires significantly more care and relies on the notion of decoupled tangent sequences [9, Def. 6.1.4]. Fix a sequence of random variables \((Z_t)_{t \ge 1}\) adapted to the filtration \(({{\mathcal A}}_t)_{t \ge 1}\). A sequence of random variables \((Z'_{t})_{t \ge 1}\) is said to be a decoupled sequence tangent to \((Z_t)_{t \ge 1}\) if for each \(t\), conditioned on \(Z_1,\ldots ,Z_{t-1}\), the random variables \(Z_t\) and \(Z'_t\) are independent and identically distributed. Thus, the random variables \((Z'_t)_{t \ge 1}\) are conditionally independent given \((Z_{t})_{t \ge 1}\). In Theorem 2 below, a sequential symmetrization argument is applied to the decoupled sequences, leading to a tree process—an analogue of the Rademacher process for the non-i.i.d. case. First, let us define the notion of a tree.

A \({\mathcal Z}\) -valued tree \({\mathbf z}\) of depth \(n\) is a rooted complete binary tree with nodes labeled by elements of \({\mathcal Z}\). We identify the tree \({\mathbf z}\) with the sequence \(({\mathbf z}_1,\ldots ,{\mathbf z}_n)\) of labeling functions \({\mathbf z}_i : \{\pm 1\}^{i-1} \rightarrow \mathcal {Z}\) which provide the labels for each node. Here, \({\mathbf z}_1\in {\mathcal Z}\) is the label for the root of the tree, while \({\mathbf z}_i\) for \(i>1\) is the label of the node obtained by following the path of length \(i-1\) from the root, with \(+1\) indicating ‘right’ and \(-1\) indicating ‘left’. A path of length \(n\) is given by the sequence \(\epsilon = (\epsilon _1,\ldots ,\epsilon _{n}) \in \{\pm 1\}^{n}\). For brevity, we shall often write \({\mathbf z}_t(\epsilon )\), but it is understood that \({\mathbf z}_t\) only depends on the prefix \((\epsilon _1,\ldots ,\epsilon _{t-1})\) of \(\epsilon \). Given a tree \({\mathbf z}\) and a function \(f:{\mathcal Z}\rightarrow {\mathbb R}\), we define the composition \(f\circ {\mathbf z}\) as a real-valued tree given by the labeling functions \((f\circ {\mathbf z}_1,\ldots ,f\circ {\mathbf z}_n)\).

Observe that if \(\epsilon _1,\ldots ,\epsilon _n\) are i.i.d. Rademacher random variables, then

$$\begin{aligned} \left( \epsilon _t f({\mathbf z}_t(\epsilon _1,\ldots ,\epsilon _{t-1})) \right) _{t=1}^n \end{aligned}$$

is a martingale-difference sequence for any given function \(f:{\mathcal Z}\rightarrow {\mathbb R}\).

Definition 2

Let \(\epsilon _1,\ldots ,\epsilon _n\) be independent Rademacher random variables. Given a \({\mathcal Z}\)-valued tree \({\mathbf z}\) of depth \(n\), the stochastic process

$$\begin{aligned} f\mapsto {\mathbb T}_n^{({\mathbf z})}(f):=\frac{1}{n} \sum _{t=1}^n \epsilon _t f({\mathbf z}_t(\epsilon _1,\ldots ,\epsilon _{t-1})) \end{aligned}$$

will be called the tree process indexed by \({\mathcal F}\).

We may view the tree process \({\mathbb T}^{({\mathbf z})}_n(f)\) as a generalization of the Rademacher process \( {\mathbb S}^{(z_{1:n})}_n(f)\). Indeed, suppose \(({\mathbf z}_1,\ldots ,{\mathbf z}_n)\) is a sequence of constant labeling functions such that for any \(t\in [n], {\mathbf z}_t(\epsilon _1,\ldots ,\epsilon _{t-1}) = z_t\) for any \((\epsilon _1,\ldots ,\epsilon _{t-1})\). In this case, \({\mathbb T}^{({\mathbf z})}_n(f)\) and \( {\mathbb S}^{(z_{1:n})}_n(f)\) coincide. In general, however, the tree process can behave differently (in a certain sense) from the Rademacher process.

Given \(z_1,\ldots ,z_n\), the expected supremum of the Rademacher process in (4) is known as (empirical) Rademacher averages or Rademacher complexity of the function class. We propose the following definition for the tree process:

Definition 3

The sequential Rademacher complexity of a function class \({\mathcal F}\subseteq {\mathbb R}^{\mathcal Z}\) on a \({\mathcal Z}\)-valued tree \({\mathbf z}\) is defined as

$$\begin{aligned} {\mathfrak R}_n({\mathcal F},{\mathbf z}) = {\mathbb {E}}\sup _{f\in {\mathcal F}}{\mathbb T}^{({\mathbf z})}_n(f) = {\mathbb {E}}\left[ \sup _{f \in {\mathcal F}} \frac{1}{n}\sum _{t=1}^n \epsilon _t f({\mathbf z}_t(\epsilon )) \right] \end{aligned}$$

and

$$\begin{aligned} {\mathfrak R}_n({\mathcal F}) = \sup _{\mathbf z}{\mathfrak R}_n({\mathcal F},{\mathbf z}) \end{aligned}$$

where the outer supremum is taken over all \({\mathcal Z}\)-valued trees of depth \(n\), and \(\epsilon =(\epsilon _1,\ldots , \epsilon _n)\) is a sequence of i.i.d. Rademacher random variables.

Theorem 2

The following relation holds between the empirical process with dependent random variables and the sequential Rademacher complexity:

$$\begin{aligned} {\mathbb {E}}\sup _{f\in {\mathcal F}} {\mathbb M}_n(f) \le 2\, {\mathfrak R}_n({\mathcal F}). \end{aligned}$$
(5)

Furthermore, this bound is tight, as we have

$$\begin{aligned} \frac{1}{2} \left( {\mathfrak R}_n({\mathcal F}) - \frac{B}{2\sqrt{n}} \right) \le \sup _{{\mathbb {P}}}\, {\mathbb {E}}\sup _{f\in {\mathcal F}} {\mathbb M}_n(f) \end{aligned}$$
(6)

where \(B = \inf _{z\in {\mathcal Z}} \sup _{f,f' \in {\mathcal F}}(f(z) - f'(z)) \ge 0\).

We would like to point out that in general \({\mathfrak R}_n({\mathcal F}) = \varOmega \left( \frac{B}{\sqrt{n}}\right) \), and so in the worst case the behavior of the expected supremum of \({\mathbb M}_n\) is precisely given by the sequential Rademacher complexity. Further, we remark that for a class \({\mathcal F}\) of linear functions on some subset \({\mathcal Z}\) of a vector space such that \(0 \in {\mathcal Z}\), we have \(B \le 0\) and the lower bound becomes \(\frac{1}{2} {\mathfrak R}_n({\mathcal F})\).

The proof of Theorem 2 requires more work than the classical symmetrization proof [11, 14, 19] due to the non-i.i.d. nature of the sequences. To readers familiar with the notion of martingale type in the theory of Banach spaces we would like to point out that the tree process can be viewed as an analogue of Walsh–Paley martingales. The upper bound of Theorem 2 is a generalization of the fact that the expected norm of a sum of martingale difference sequences can be upper bounded by the expected norm of sum of Walsh–Paley martingale difference sequences, as shown in [23].

As mentioned earlier, the sequential Rademacher complexity is an object that is easier to study than the original empirical process \({\mathbb M}_n\). The following sections introduce additional notions of complexity of a function class that provide control of the sequential Rademacher complexity. Specific relations between these complexity notions will be shown, leading to the proof of Theorem 1.

5 Finite classes, covering numbers, and chaining

The first step in upper bounding sequential Rademacher complexity is the following result for a finite collection of trees.

Lemma 1

For any finite set \(V\) of \({\mathbb R}\)-valued trees of depth \(n\) we have that

$$\begin{aligned} {\mathbb {E}}\left[ \max _{{\mathbf v}\in V} \sum _{t=1}^n \epsilon _t {\mathbf v}_t(\epsilon ) \right] \le \sqrt{2 \log (|V|) \max _{{\mathbf v}\in V} \max _{\epsilon \in \{\pm 1\}^n} \sum _{t=1}^n {\mathbf v}_t(\epsilon )^2} \end{aligned}$$

where \(|V|\) denotes the cardinality of the set \(V\).

A simple consequence of the above lemma is that if \({\mathcal F}\subseteq [-1,1]^{\mathcal Z}\) is a finite class, then for any tree \({\mathbf z}\), we have that

$$\begin{aligned} {\mathbb {E}}\left[ \max _{f \in {\mathcal F}} \frac{1}{n}\sum _{t=1}^n \epsilon _t f({\mathbf z}_t(\epsilon ))\right] \le {\mathbb {E}}\left[ \max _{{\mathbf v}\in {\mathcal F}({\mathbf z}) } \frac{1}{n}\sum _{t=1}^n \epsilon _t {\mathbf v}_t(\epsilon )\right] \le \sqrt{\frac{2 \log (|{\mathcal F}|)}{n}}, \end{aligned}$$
(7)

where \({\mathcal F}({\mathbf z}) = \{f\,\circ \, {\mathbf z}: f\in {\mathcal F}\}\) is the projection of \({\mathcal F}\) onto \({\mathbf z}\). It is clear that \(|{\mathcal F}({\mathbf z})|\le |{\mathcal F}|\) which explains the second inequality above.

To illustrate the next idea, consider a binary-valued function class \({\mathcal F}\subseteq \{\pm 1\}^{\mathcal Z}\). In the i.i.d. case, the cardinality of the coordinate projection

$$\begin{aligned} \{(f(z_1), \ldots ,f(z_n)): f\in {\mathcal F}\} \end{aligned}$$

immediately yields a control of the supremum of the empirical process. For the tree-based definition, however, it is easy to see that the cardinality of \({\mathcal F}({\mathbf z})\) is exponential in \(n\) for any interesting class \({\mathcal F}\), leading to a vacuous upper bound.

A key observation is that the first inequality in (7) holds with \({\mathcal F}({\mathbf z})\) replaced by a potentially smaller set \(V\) of \({\mathbb R}\)-valued trees with the property that

$$\begin{aligned} \forall f \in {\mathcal F},\ \forall \epsilon \in \{\pm 1\}^n,\ \exists {\mathbf v}\in V\ \mathrm {s.t.}\quad {\mathbf v}_t(\epsilon ) = f({\mathbf z}_t(\epsilon )) \end{aligned}$$
(8)

for all \(t\in [n]\). Crucially, the choice of \({\mathbf v}\) may depend on \(\epsilon \). A set \(V\) of \({\mathbb R}\)-valued trees satisfying (8) is termed a \(0\) -cover of \({\mathcal F}\subseteq {\mathbb R}^{\mathcal Z}\) on a tree \({\mathbf z}\) of depth \(n\). We denote by \({\mathcal N}(0,{\mathcal F},{\mathbf z})\) the size of a smallest \(0\)-cover on \({\mathbf z}\) and define \({\mathcal N}(0,{\mathcal F},n) = \sup _{\mathbf z}{\mathcal N}(0,{\mathcal F},{\mathbf z})\).

To illustrate the gap between the size of a \(0\)-cover and the cardinality of \({\mathcal F}({\mathbf z})\), consider a tree \({\mathbf z}\) of depth \(n\) and suppose for simplicity that \(|\hbox {Img}({\mathbf z})| = 2^n-1\) where \(\hbox {Img}({\mathbf z}) = \cup _{t\in [n]}\hbox {Img}({\mathbf z}_t)\) and \(\hbox {Img}({\mathbf z}_t) = \{{\mathbf z}_t(\epsilon ): \epsilon \in \{\pm 1\}^n\}\). Suppose \({\mathcal F}\) consists of \(2^{n-1}\) binary-valued functions defined as zero on all of \(\hbox {Img}({\mathbf z})\) except for a single value of \(\hbox {Img}({\mathbf z}_n)\). In plain words, each function is zero everywhere on the tree except for a single leaf. While the projection \({\mathcal F}({\mathbf z})\) contains \(2^{n-1}\) distinct trees, the size of a \(0\)-cover is only \(2\): it is enough to take an all-zero function \(g_0\) along with a function \(g_1\) which is zero on all of \(\hbox {Img}({\mathbf z})\) except \(\hbox {Img}({\mathbf z}_n)\) (i.e. on the leaves). It is easy to verify that \(g_0\circ {\mathbf z}\) and \(g_1\circ {\mathbf z}\) provide a \(0\)-cover for \({\mathcal F}\) on \({\mathbf z}\). Unlike \(|{\mathcal F}({\mathbf z})|\), the size of the cover does not grow with \(n\), capturing the fact that the function class is “simple” on any given path.

For real-valued function classes, the notion of a \(0\)-cover needs to be relaxed to incorporate scale. We propose the following definitions.

Definition 4

A set \(V\) of \({\mathbb R}\)-valued trees of depth \(n\) is a (sequential) \(\alpha \) -cover (with respect to \(\ell _p\)-norm) of \({\mathcal F}\subseteq {\mathbb R}^{\mathcal Z}\) on a tree \({\mathbf z}\) of depth \(n\) if

$$\begin{aligned} \forall f \in {\mathcal F},\ \forall \epsilon \in \{\pm 1\}^n,\ \exists {\mathbf v}\in V\ \mathrm {s.t.}\quad \left( \frac{1}{n} \sum _{t=1}^n |{\mathbf v}_t(\epsilon ) - f({\mathbf z}_t(\epsilon ))|^p \right) ^{1/p} \le \alpha . \end{aligned}$$

The (sequential) covering number of a function class \({\mathcal F}\) on a given tree \({\mathbf z}\) is defined as

$$\begin{aligned} {\mathcal N}_p(\alpha , {\mathcal F}, {\mathbf z}) = \min \left\{ |V| : V \ \text { is an }\alpha \text {-cover w.r.t. }\ell _p\text {-norm of }{\mathcal F}\text { on } {\mathbf z}\right\} . \end{aligned}$$

Further define \({\mathcal N}_p(\alpha , {\mathcal F}, n) = \sup _{\mathbf z}{\mathcal N}_p(\alpha , {\mathcal F}, {\mathbf z}) \), the maximal \(\ell _p\) covering number of \({\mathcal F}\) over depth-\(n\) trees.

In the study of the supremum of a stochastic process indexed by a set \(S\), it is natural to endow the set with a pseudo-metric \(d\). The structure and “richness” of the index set \(S\) (as given by covering numbers or, more generally, via the chaining technique [32, 34]) yield precise control on the supremum of the stochastic process. It is natural to ask whether we can endow the projection \({\mathcal F}({\mathbf z})\) with a metric and appeal to known results. This turns out to be not quite the case, as the pseudo-metric needs to be random. Indeed, observe that the tree \({\mathbf v}\) providing the cover may depend on the path \(\epsilon \) itself. We may define the random pseudo-metric between the \({\mathbb R}\)-valued trees \({\mathbf v}',{\mathbf v}\) as

$$\begin{aligned} d^{p}_\epsilon ({\mathbf v}',{\mathbf v}) = \left( \frac{1}{n}\sum _{t=1}^n \left| {\mathbf v}'_t(\epsilon )-{\mathbf v}_t(\epsilon ) \right| ^p\right) ^{1/p}. \end{aligned}$$

An \(\alpha \)-cover \(V\) then guarantees that, for any \(\epsilon \in \{\pm 1\}^n\),

$$\begin{aligned} \sup _{{\mathbf v}'\in {\mathcal F}({\mathbf z})}\inf _{{\mathbf v}\in V} d^p_\epsilon ({\mathbf v}',{\mathbf v})\le \alpha . \end{aligned}$$

Therefore, our results below can be seen as extending the chaining technique to the case of a random pseudo-metric \(d^p_\epsilon \).

With the definition of an \(\alpha \)-cover with respect to \(\ell _1\) norm, it is immediate (using Lemma 1) that for any \({\mathcal F}\subset [-1,1]^{\mathcal Z}\), for any \(\alpha >0\),

$$\begin{aligned} {\mathfrak R}_n({\mathcal F},{\mathbf z}) \le \alpha + \sqrt{\frac{2\log {\mathcal N}_1(\alpha , {\mathcal F}, {\mathbf z})}{n}}. \end{aligned}$$
(9)

It is recognized, however, that a tighter control is obtained by integrating the covering numbers at different scales. To this end, consider the following analogue of the Dudley entropy integral bound.

Definition 5

For \(p \ge 2\), the integrated complexity of a function class \({\mathcal F}\subseteq [-1,1]^{\mathcal Z}\) on a \({\mathcal Z}\)-valued tree of depth \(n\) is defined as

$$\begin{aligned} {\mathfrak D}^p_n ({\mathcal F}, {\mathbf z}) = \inf _{\alpha > 0}\left\{ 4 \alpha + \frac{12}{\sqrt{n}}\int _{\alpha }^{1} \sqrt{\log \ \mathcal {N}_p(\delta , {\mathcal F}, {\mathbf z}) \ } d \delta \right\} \end{aligned}$$

and

$$\begin{aligned} {\mathfrak D}^p_n({\mathcal F}) = \sup _{{\mathbf z}} {\mathfrak D}^p_n({\mathcal F},{\mathbf z}). \end{aligned}$$

We denote \({\mathfrak D}^2_n({\mathcal F},{\mathbf z})\) as \({\mathfrak D}_n({\mathcal F},{\mathbf z})\). Clearly, \({\mathfrak D}^p_n({\mathcal F},{\mathbf z})\le {\mathfrak D}^q_n({\mathcal F},{\mathbf z})\) for \(p\le q\).

Theorem 3

For any function class \({\mathcal F}\subseteq [-1,1]^{\mathcal Z}\), we have that

$$\begin{aligned} {\mathfrak R}_n({\mathcal F},{\mathbf z}) \le {\mathfrak D}_n({\mathcal F},{\mathbf z}) \end{aligned}$$

for any \({\mathcal Z}\)-valued tree \({\mathbf z}\) of depth \(n\).

We conclude this section by mentioning that two distinct notions of a packing (or, \(\alpha \)-separated set) exist for trees, according to the order of quantifiers in the definition. In one definition, it must be that every member of the packing set is \(\alpha \)-separated from every other member on some path. For the other, there must be a path on which every member of the packing is \(\alpha \)-separated from every other member. In the classical case the distinction does not arise, and the packing number is known to be closely related to the covering number. For the tree case, however, the two notions are distinct, one providing an upper bound and one a lower bound on the covering number. Due to this discrepancy, difficulties arise in attempting to replicate proofs that pass through the packing number, such as the Dudley’s extraction technique [19] for obtaining estimates on the \(\ell _2\) covering numbers.

6 Combinatorial parameters

For i.i.d. data, the uniform Glivenko–Cantelli property for classes of binary-valued functions is characterized by the Vapnik–Chervonenkis combinatorial dimension [37]. For real-valued function classes, the corresponding notions are the scale-sensitive dimensions, such as the fat-shattering dimension [3, 6]. In this section, we recall the definition of Littlestone dimension [8, 18] and propose its scale-sensitive versions for real-valued function classes. Subsequently, these combinatorial parameters are shown to control the growth of sequential covering numbers.

Definition 6

A \({\mathcal Z}\)-valued tree \({\mathbf z}\) of depth \(d\) is shattered by a function class \({\mathcal F}\subseteq \{\pm 1\}^{{\mathcal Z}}\) if

$$\begin{aligned} \forall \epsilon \in \{\pm 1\}^d, \ \exists f \in {\mathcal F}\ \ \text {s.t. }\quad \forall t \in [d],\quad f({\mathbf z}_t(\epsilon )) = \epsilon _t. \end{aligned}$$

The Littlestone dimension \({\mathrm {Ldim}}({\mathcal F}, {\mathcal Z})\) is the largest \(d\) such that \({\mathcal F}\) shatters a \({\mathcal Z}\)-valued tree of depth \(d\).

We propose the following scale-sensitive version of Littlestone dimension.

Definition 7

A \({\mathcal Z}\)-valued tree \({\mathbf z}\) of depth \(d\) is \(\alpha \) -shattered by a function class \({\mathcal F}\subseteq {\mathbb R}^{\mathcal Z}\) if there exists an \({\mathbb R}\)-valued tree \({\mathbf s}\) of depth \(d\) such that

$$\begin{aligned} \forall \epsilon \in \{\pm 1\}^d, \ \exists f \in {\mathcal F}\ \ \text {s.t. }\quad \forall t \in [d],\quad \epsilon _t (f({\mathbf z}_t(\epsilon )) - {\mathbf s}_t(\epsilon )) \ge \alpha /2. \end{aligned}$$

The tree \({\mathbf s}\) will be called a witness to shattering. The (sequential) fat-shattering dimension \({\mathrm {fat}}_\alpha ({\mathcal F}, {\mathcal Z})\) at scale \(\alpha \) is the largest \(d\) such that \({\mathcal F}\) \(\alpha \)-shatters a \({\mathcal Z}\)-valued tree of depth \(d\).

With these definitions it is easy to see that \({\mathrm {fat}}_\alpha ({\mathcal F}, {\mathcal Z}) = {\mathrm {Ldim}}({\mathcal F}, {\mathcal Z})\) for a binary-valued function class \({\mathcal F}\subseteq \{\pm 1\}^{\mathcal Z}\) for any \(0 <\alpha \le 2\).

When \({\mathcal Z}\) and/or \({\mathcal F}\) are understood from the context, we will simply write \({\mathrm {fat}}_\alpha \) or \({\mathrm {fat}}_\alpha ({\mathcal F})\) instead of \({\mathrm {fat}}_\alpha ({\mathcal F}, {\mathcal Z})\). Furthermore, we will write \({\mathrm {fat}}_\alpha ({\mathcal F}, {\mathbf z})\) for \({\mathrm {fat}}_\alpha ({\mathcal F}, \hbox {Img}({\mathbf z}))\). Hence, \({\mathrm {fat}}_\alpha ({\mathcal F},{\mathbf z})\) is the largest \(d\) such that \({\mathcal F}\) \(\alpha \)-shatters a tree \({\mathbf z}'\) of depth \(d\) with \(\hbox {Img}({\mathbf z}')\subseteq \hbox {Img}({\mathbf z})\).

If trees \({\mathbf z}\) are defined by constant mappings \({\mathbf z}_t(\epsilon ) = z_t\), the combinatorial parameters introduced in the definitions above coincide with the Vapnik–Chervonenkis dimension and its scale-sensitive version, the fat-shattering dimension. Therefore, the notions we are studying lead to a theory that can be viewed as a sequential generalization of the Vapnik–Chervonenkis theory.

We now relate the combinatorial parameters to the size of a sequential cover. In the binary case (\(k=1\) below), a reader might notice a similarity of Theorems 4 and 5 to the classical results due to Sauer [27], Shelah [28], and Vapnik and Chervonenkis [37]. There are several approaches to proving what is often called the Vapnik–Chervonenkis–Sauer–Shelah lemma. We opt for the inductive-style proof (e.g., see the book by Alon and Spencer [4]), which becomes more natural for the case of trees because of their recursive structure.

Theorem 4

Let \({\mathcal F}\subseteq {\{0,\ldots , k\}}^{\mathcal Z}\) be a class of functions with \({\mathrm {fat}}_2({\mathcal F}, {\mathcal Z}) = d\). Then for any \(n> d\),

$$\begin{aligned} {\mathcal N}_\infty (1/2, {\mathcal F}, n)\le \sum _{i=0}^d {n\atopwithdelims ()i} k^i\le \left( \frac{ekn}{d}\right) ^d. \end{aligned}$$

Furthermore, for \(n\le d\),

$$\begin{aligned} {\mathcal N}_\infty (1/2, {\mathcal F}, n) \le k^n. \end{aligned}$$

Consequently, the upper bound \({\mathcal N}_\infty (1/2, {\mathcal F}, n) \le (ekn)^d\) holds for any \(n\ge 1\).

Armed with Theorem 4, we can approach the problem of bounding the size of a sequential cover at scale \(\alpha \) through discretization. For the classical case of a cover based on a set points, the discretization idea appears in [3, 20]. When passing from the combinatorial result to the cover at scale \(\alpha \) in Corollary 1, it is crucial that the statement of Theorem 4 is in terms of \({\mathrm {fat}}_2({\mathcal F})\) rather than \({\mathrm {fat}}_1({\mathcal F})\). This point can be seen in the proof of Corollary 1: unavoidably, the discretization process can map almost identical function values to distinct discrete values which differ by \(1\), forcing us to demand shattering at scale \(2\).

We now show that the sequential covering numbers are bounded in terms of the sequential fat-shattering dimension.

Corollary 1

Let \({\mathcal F}\subseteq [-1,1]^{\mathcal Z}\). For any \(\alpha >0\) and any \(n\ge 1\), we have that

$$\begin{aligned} {\mathcal N}_\infty (\alpha , {\mathcal F}, n) \le \left( \frac{2e n}{\alpha }\right) ^{{\mathrm {fat}}_{\alpha } ({\mathcal F})}. \end{aligned}$$

In the classical (non-sequential) case, it is known that the \(\ell _\infty \) covering numbers cannot be dimension-independent, suggesting that the dependence on \(n\) in the above estimate cannot be removed altogether. It is interesting to note that the clean upper bound of Corollary 1 is not yet known for the classical case of \(\ell _\infty \) covering numbers (see [26]). Finally, we remark that the question of proving a dimension-free bound for the sequential \(\ell _2\) covering number in the spirit of [20] is still open.

We state one more result, with a proof similar to that of Theorem 4. It provides a bound on the \(0\)-cover in terms of the \({\mathrm {fat}}_1({\mathcal F})\) combinatorial parameter. Of particular interest is the case \(k=1\), when \({\mathrm {fat}}_1 ({\mathcal F})= {\mathrm {Ldim}}({\mathcal F})\).

Theorem 5

Let \({\mathcal F}\subseteq {\{0,\ldots , k\}}^{\mathcal Z}\) be a class of functions with \({\mathrm {fat}}_1({\mathcal F},{\mathcal Z}) = d\). Then for any \(n>d\),

$$\begin{aligned} {\mathcal N}(0, {\mathcal F}, n) \le \sum _{i=0}^d {n\atopwithdelims ()i} k^i \le \left( \frac{ekn}{d} \right) ^d. \end{aligned}$$

Furthermore, for \(n\le d\),

$$\begin{aligned} {\mathcal N}(0, {\mathcal F}, n) \le (k+1)^n. \end{aligned}$$

Consequently, the upper bound \({\mathcal N}(0, {\mathcal F}, n) \le (ekn)^d\) holds for any \(n\ge 1\).

In addition to the above connections between the combinatorial dimensions and covering numbers, we can also relate the scale-sensitive dimension to sequential Rademacher averages. This will allow us to “close the loop” and show equivalence of the introduced complexity measures. The following lemma asserts that the fat-shattering dimensions at “large enough” scales cannot be too large and provides a lower bound for the Rademacher complexity.

Lemma 2

Let \({\mathcal F}\subseteq [-1,1]^{\mathcal Z}\). For any \(\beta > 2{\mathfrak R}_n({\mathcal F})\), we have that \({\mathrm {fat}}_\beta ({\mathcal F}) < n\). Furthermore, for any \(\beta > 0\), it holds that

$$\begin{aligned} \min \left\{ {\mathrm {fat}}_\beta ({\mathcal F}), n\right\} \le \frac{32\, n\,{\mathfrak R}_n({\mathcal F})^2}{\beta ^2}. \end{aligned}$$

The following lemma complements Theorem 3.

Lemma 3

For any function class \({\mathcal F}\subseteq [-1,1]^{\mathcal Z}\), we have that

$$\begin{aligned} {\mathfrak D}^\infty _n({\mathcal F}) \le 8\ {\mathfrak R}_n({\mathcal F})\left( 1 + 4 \sqrt{2} \ \log ^{3/2}(e n^2)\right) \end{aligned}$$

as long as \({\mathfrak R}_n({\mathcal F}) \ge 1/n\).

Theorems 2 and 3, together with Lemma 3, imply that the quantities \({\mathfrak D}^\infty _n({\mathcal F}), {\mathfrak D}^2_n({\mathcal F}), {\mathfrak R}_n({\mathcal F})\), and \(\sup _{{\mathbb {P}}}{\mathbb {E}}\sup _{f\in {\mathcal F}} {\mathbb M}_n(f)\) are equivalent up to poly-logarithmic in \(n\) factors:

$$\begin{aligned} \frac{1}{2} \left( {\mathfrak R}_n({\mathcal F}) - \frac{B}{2\sqrt{n}} \right)&\le \sup _{{\mathbb {P}}}\, {\mathbb {E}}\sup _{f\in {\mathcal F}} {\mathbb M}_n(f) \le 2{\mathfrak R}_n({\mathcal F}) \le 2{\mathfrak D}_n({\mathcal F})\nonumber \\&\le 16{\mathfrak R}_n({\mathcal F})\left( 1 + 4 \sqrt{2} \ \log ^{3/2}(e n^2)\right) \end{aligned}$$
(10)

as long as \({\mathfrak R}_n({\mathcal F}) \ge 1/n\), with \(B\) defined as in Theorem 2. Additionally, the upper and lower bounds in terms of the fat-shattering dimension follow, respectively, from the integrated complexity bound and Corollary 1, and from Lemma 2.

At this point, we have introduced all the key notions of sequential complexity and showed fundamental connections between them. In the next section, we turn to the proof of Theorem 1.

7 Sequential uniform convergence

In order to prove Theorem 1, we will need to show in-probability (rather than in-expectation) versions of some of the earlier results. Luckily, the proof techniques are not significantly different. First, we prove Lemma 4, an in-probability version of the sequential symmetrization technique of Theorem 2. Let us use the shorthand \({\mathbb {E}}_{t}\left[ \cdot \right] = {\mathbb {E}}[\cdot \, |\, Z_1,\ldots ,Z_t]\).

Lemma 4

Let \({\mathcal F}\subseteq [-1,1]^{\mathcal Z}\). For any \(\alpha >0\), it holds that

$$\begin{aligned}&{\mathbb {P}}\left( \sup _{f \in \mathcal {F}} \left| \frac{1}{n}\sum _{t=1}^n \left( f(Z_t) - {\mathbb {E}}_{t-1}\left[ f(Z_t) \right] \right) \right| > \alpha \right) \\&\quad \le 4 \sup _{{\mathbf z}}\ {\mathbb {P}}_{\epsilon }\left( \sup _{f \in \mathcal {F}} \left| \frac{1}{n}\sum _{t=1}^{n} \epsilon _t f({\mathbf z}_t(\epsilon )) \right| > \alpha /4\right) . \end{aligned}$$

The next result is an analogue of Eq. (9).

Lemma 5

Let \({\mathcal F}\subseteq [-1,1]^{\mathcal Z}\). For any \(\alpha >0\), we have that

$$\begin{aligned} {\mathbb {P}}_{\epsilon }\!\left( \sup _{f \in \mathcal {F}} \left| \frac{1}{n} \sum _{t=1}^{n} \epsilon _t f({\mathbf z}_t(\epsilon )) \right| > \alpha /4 \right) \le 2 \mathcal {N}_1(\alpha /8, {\mathcal F}, {\mathbf z}) e^{- n \alpha ^2 /128} \le 2\left( \frac{16e n}{\alpha }\right) ^{{\mathrm {fat}}_{\alpha /8}} e^{- n \alpha ^2 /128} \end{aligned}$$

for any \({\mathcal Z}\)-valued tree \({\mathbf z}\) of depth \(n\).

Proof of Theorem 1

Let \(E_n^\alpha \) denote the event

$$\begin{aligned} \frac{1}{n} \sup _{f \in {\mathcal F}}\left| \sum _{t=1}^n \left( f(Z_t) - \mathbb {E}_{t-1}[f(Z_t)] \right) \right| > \alpha . \end{aligned}$$

Combining Lemma 4 and Lemma 5, for any distribution,

$$\begin{aligned} {\mathbb {P}}\left( E_n^\alpha \right) \le 8 \left( \frac{16e n}{\alpha }\right) ^{{\mathrm {fat}}_{\alpha /8}} e^{- n \alpha ^2 /128}. \end{aligned}$$

We have, for a fixed \(n'\),

$$\begin{aligned} {\mathbb {P}}\left( \sup _{n \ge n'} \sup _{f \in \mathcal {F}} \left| {\mathbb M}_n(f) \right| > \alpha \right) \le \sum _{n \ge n'} {\mathbb {P}}\left( E_{n}^\alpha \right) \le \sum _{n \ge n'} 8 \left( \frac{16e n}{\alpha }\right) ^{{\mathrm {fat}}_{\alpha /8}} e^{- n \alpha ^2 /128}. \end{aligned}$$

Since the last sum does not depend on \({\mathbb {P}}\), we may take the supremum over \({\mathbb {P}}\) and then let \(n' \rightarrow \infty \) to conclude that, if \({\mathrm {fat}}_{\alpha /8}\) is finite then

$$\begin{aligned} \limsup _{n' \rightarrow \infty } \sup _{{\mathbb {P}}} ~{\mathbb {P}}\left( \sup _{n \ge n'} \sup _{f \in \mathcal {F}} \left| {\mathbb M}_n(f) \right| {>} \alpha \right) \le \limsup _{n' \rightarrow \infty } \sum _{n \ge n'} 8 \left( \frac{16e n}{\alpha }\right) ^{{\mathrm {fat}}_{\alpha /8}} e^{- n \alpha ^2 /128}=0. \end{aligned}$$

Therefore, if \({\mathrm {fat}}_{\alpha }\) is finite for all \(\alpha > 0\) then \({\mathcal F}\) satisfies sequential uniform convergence. This proves \(2 \Rightarrow 1\).

Next, notice that if \({\mathfrak R}_n({\mathcal F}) \rightarrow 0\) then for any \(\alpha > 0\) there exists \(n_\alpha < \infty \) such that \(\alpha > 2 {\mathfrak R}_{n_\alpha }({\mathcal F})\). Therefore, by Lemma 2 we have that \({\mathrm {fat}}_\alpha ({\mathcal F}) < n_\alpha < \infty \). We can, therefore, conclude that \(3\Rightarrow 2\). Now, to see that \(1 \Rightarrow 3\) notice that by Theorem 2,

$$\begin{aligned} \sup _{{\mathbb {P}}}\, {\mathbb {E}}\sup _{f\in {\mathcal F}} \left| {\mathbb M}_n(f)\right| \ge \frac{1}{2} \left( {\mathfrak R}_n({\mathcal F}) - \frac{B}{2\sqrt{n}} \right) \end{aligned}$$

and so \(\lim _{n \rightarrow \infty } \sup _{{\mathbb {P}}}\, {\mathbb {E}}\sup _{f\in {\mathcal F}} \left| {\mathbb M}_n(f)\right| = 0\) implies that \({\mathfrak R}_n({\mathcal F}) \rightarrow 0\). Since almost sure convergence implies convergence in expectation, we conclude \(1 \Rightarrow 3\). \(\square \)

The final result of this section is a stronger version of Lemma 5, showing that sequential Rademacher complexity is, in some sense, the “right” complexity measure even when one considers high probability statements. This lemma will be used in Sect. 9.

Lemma 6

Let \({\mathcal F}\subseteq [-1,1]^{\mathcal Z}\). Suppose \({\mathrm {fat}}_\alpha ({\mathcal F})\) is finite for all \(\alpha > 0\) and that the following mild assumptions hold: \({\mathfrak R}_n({\mathcal F})\ge 1/n, {\mathcal N}_\infty (2^{-1}, {\mathcal F}, n)\ge 4\), and there exists a constant \(L\) such that \(L > \sum _{j=1}^\infty {\mathcal N}_\infty (2^{-j},{\mathcal F},n)^{-1}\). Then for any \(\theta >\sqrt{12/n}\), for any \({\mathcal Z}\)-valued tree \({\mathbf z}\) of depth \(n\),

$$\begin{aligned}&{\mathbb {P}}_\epsilon \left( \sup _{f\in {\mathcal F}} \left| \frac{1}{n}\sum _{t=1}^n \epsilon _t f({\mathbf z}_t(\epsilon ))\right| > 8\left( 1+\theta \sqrt{8n \log ^{3}(en^2) }\right) \cdot {\mathfrak R}_n({\mathcal F}) \right) \\&\quad \le {\mathbb {P}}_\epsilon \left( \sup _{f\in {\mathcal F}} \left| \frac{1}{n}\sum _{t=1}^n \epsilon _t f({\mathbf z}_t(\epsilon ))\right| > \inf _{\alpha > 0}\left\{ 4 \alpha + 6 \theta \int _{\alpha }^{1} \sqrt{\log {\mathcal N}_\infty (\delta ,{\mathcal F},n)} d \delta \right\} \right) \\&\quad \le 2L e^{- \frac{n \theta ^2}{4}}. \end{aligned}$$

We established that sequential Rademacher complexity, as well as the sequential versions of covering numbers and fat-shattering dimensions, provide necessary and sufficient conditions for sequential uniform convergence. Let us now make a connection to the results of [34] who studied the notion of generalized entropy with bracketing. In [34], this complexity measure was shown to provide upper bounds on uniform deviations of martingale difference sequences for a given distribution \({\mathbb {P}}\). We do not know whether having a “small” generalized entropy with bracketing (more precisely, decay of the Dudley integral with generalized bracketing entropy) with respect to all distributions is also a necessary condition for sequential uniform convergence. Nevertheless, using the results of this work we can show that the generalized entropy with bracketing can be used as a tool to establish sequential uniform convergence for tree processes. Specifically, by Theorem 1, to establish this convergence it is enough to show uniform convergence for the tree process \(\sup _{f \in \mathcal {F}} {\mathbb T}_n^{({\mathbf z})}(f)\) for any \({\mathcal Z}\)-valued tree \({\mathbf z}\). It is therefore sufficient to only consider the generalized entropy with bracketing for tree processes. For a tree process on any \({\mathbf z}\), however, the notion of the generalized entropy coincides with the notion of a sequential cover in the \(\ell _\infty \) sense. Indeed, the brackets for the tree process are pairs of real valued trees. By taking the center of each bracket, one obtains a sequential cover; conversely, by using a covering tree as a center one obtains a bracket. We conclude that convergence of the Dudley-type integral with the generalized entropy with bracketing for all tree processes is a necessary and sufficient condition for sequential uniform convergence.

We end this section by mentioning that measurability of \(\sup _{f \in {\mathcal F}} {\mathbb M}_n(f)\) can be ensured with some regularity conditions on \({\mathcal Z}\) and \({\mathcal F}\). For instance, we may assume that \({\mathcal F}\) is a class of uniformly bounded measurable functions that is image admissible Suslin (there is a map \(\varGamma \) from a Polish space \({\mathcal Y}\) to \({\mathcal F}\) such that the composition of \(\varGamma \) with the evaluation map \((y, z) \mapsto \varGamma (y)(z)\) is jointly measurable). In this case, it is easy to check that \(\sup _{f \in {\mathcal F}} {\mathbb M}_n(f)\) is indeed measurable (see, for instance, Corollary 5.3.5 in [11]).

8 Structural results

Being able to bound the complexity of a function class by a complexity of a simpler class is of great utility for proving bounds. In empirical process theory, such structural results are obtained through properties of Rademacher averages [7, 19]. In particular, the contraction inequality due to Ledoux and Talagrand [17, Corollary 3.17], allows one to pass from a composition of a Lipschitz function with a class to the function class itself. This wonderful property permits easy convergence proofs for a vast array of problems.

We show that the notion of sequential Rademacher complexity also enjoys many of the same properties. In particular, the following is a sequential analogue of the Ledoux–Talagrand contraction inequality [7, 17].

Lemma 7

Fix a class \({\mathcal F}\subseteq [-1,1]^{\mathcal Z}\) with \({\mathfrak R}_n({\mathcal F})\ge 1/n\). Let \(\phi :{\mathbb R}\rightarrow {\mathbb R}\) be a Lipschitz function with a constant \(L\). Then

$$\begin{aligned} {\mathfrak R}_n (\phi \circ {\mathcal F}) \le 8\,L\,\left( 1+4\sqrt{2}\log ^{3/2}(en^2)\right) \cdot {\mathfrak R}_n({\mathcal F}). \end{aligned}$$

In comparison to the classical result, here we get an extra logarithmic factor in \(n\). Whether the result without this logarithmic factor can be proved for the tree process remains an open question.

In the next proposition, we summarize some other useful properties of sequential Rademacher complexity (see [7, 19] for the results in the i.i.d. setting).

Proposition 1

Sequential Rademacher complexity satisfies the following properties. For any \({\mathcal Z}\)-valued tree \({\mathbf z}\) of depth \(n\):

  1. 1.

    If \({\mathcal F}\subseteq {\mathcal G}\), then \({\mathfrak R}_n({\mathcal F},{\mathbf z}) \le {\mathfrak R}_n({\mathcal G},{\mathbf z})\).

  2. 2.

    \({\mathfrak R}_n({\mathcal F},{\mathbf z}) = {\mathfrak R}_n (\mathrm{conv }({\mathcal F}),{\mathbf z})\).

  3. 3.

    \({\mathfrak R}_n(c{\mathcal F},{\mathbf z}) = |c|{\mathfrak R}_n({\mathcal F},{\mathbf z})\) for all \(c\in {\mathbb R}\).

  4. 4.

    For any \(h, {\mathfrak R}_n({\mathcal F}+h,{\mathbf z}) = {\mathfrak R}_n({\mathcal F},{\mathbf z})\) where \({\mathcal F}+h = \{f+h: f\in {\mathcal F}\}\).

The structural results developed in this section are crucial for analyzing sequential prediction problems, a topic we explore in a separate paper.

9 Application: concentration of martingales in Banach spaces

As a consequence of the uniform convergence results, one can obtain concentration inequalities for martingale difference sequences in Banach spaces. Before we provide the concentration inequality, we first state a rather straightforward lemma that follows from Lemmas 4 and 6.

Lemma 8

For \({\mathcal F}\subseteq [-1,1]^{{\mathcal Z}}\), for \(n\ge 2\) and any \(\alpha >0\), we have that

$$\begin{aligned} {\mathbb {P}}\left( \sup _{f \in \mathcal {F}} \left| \frac{1}{n}\sum _{t=1}^n f(Z_t) - {\mathbb {E}}_{t-1}\left[ f(Z_t) \right] \right| > \alpha \right) \le 8L \exp \left( -\frac{\alpha ^2}{c \log ^{3}(n) {\mathfrak R}^2_n({\mathcal F})} \right) \end{aligned}$$

under the mild assumptions \({\mathfrak R}_n({\mathcal F})\ge 1/n\) and \({\mathcal N}_\infty (2^{-1}, {\mathcal F}, n)\ge 4\). Here \(c\) is an absolute constant and \(L > e^4\) is such that \(L > \sum _{j=1}^\infty {\mathcal N}_\infty (2^{-j},{\mathcal F},n)^{-1}\).

Let us now consider the case of a unit ball in a Banach space and discuss the conditions under which the main result of this section (Corollary 2 below) is stated. Let \({\mathcal Z}\) be the unit ball of a Banach space with norm \(\Vert {\cdot }\Vert \) and consider the class \({\mathcal F}\) of continuous linear mappings \(z \mapsto \left<f,z\right>\) with \(\Vert {f}\Vert _{*}\le 1\), where \(\Vert {\cdot }\Vert _{*}\) is the dual to the norm \(\Vert {\cdot }\Vert \). By definition of the norm,

and

Further note that for any linear class and any \(\gamma > 0, {\mathcal N}_\infty (\gamma ,{\mathcal F},n) \ge 1/\gamma \) and so \(\sum _{j=1}^\infty {\mathcal N}_\infty (2^{-j},{\mathcal F},n)^{-1} \le 2\). In view of Lemma 8, under the mild condition that \({\mathcal N}_\infty (2^{-1}, {\mathcal F}, n)\ge 4\),

for absolute constants \(c,C>0\). It remains to provide an upper bound on the sequential Rademacher complexity \({\mathfrak R}_n({\mathcal F})\). To this end, recall that a function \(\varPsi : {\mathcal F}\rightarrow {\mathbb R}\) is \((\sigma ,q)\)-uniformly convex (for \(q\in [2,\infty )\)) with respect to a norm \(\left\| \cdot \right\| _{*}\) if, for all \(\theta \in [0,1]\) and \(f_1,f_2 \in {\mathcal F}\),

$$\begin{aligned} \varPsi ( \theta f_1 + (1-\theta ) f_2 ) \le \theta \varPsi (f_1) + (1-\theta ) \varPsi (f_2) - \frac{\sigma \,\theta \,(1-\theta )}{q} \big \Vert f_1 - {{f_{2}}\big \Vert _{*}^{q}}. \end{aligned}$$

Proposition 2

Suppose that \(\varPsi \) is \((\sigma ,q)\)-uniformly convex with respect to a given norm \(\left\| \cdot \right\| _{*}\) on \({\mathcal F}\) and \(0 \le \varPsi (f) \le \varPsi _{max}\) for all \(f \in {\mathcal F}\). Then we have

where \(p>1\) is such that \(1/p+1/q=1\), and \(C_p = (p/(p-1))^{\frac{p-1}{p}}\).

The following is the main result of this section:

Corollary 2

Given any function \(\varPsi \) that is \((\sigma ,q)\)-uniformly convex with respect to \(\left\| \cdot \right\| _{*}\) such that \(0 \le \varPsi (f) \le \varPsi _{max}\) for all \(f \in {\mathcal F},\) and given and any \(\alpha > 0\) we have

where \(\frac{1}{q} + \frac{1}{p} = 1, q\in [2, \infty ),\) and \(n\ge 2\). Here, \(C\) is an absolute constant, and \(c_p\) only depends on \(p\).

We suspect that the \(\log ^{3}(n)\) term in the above bound is an artifact of the proof technique and can probably be avoided. For instance, when the norm is equivalent to a \(2\)-smooth norm (that is, \(p=2\)), Pinelis [22] shows concentration of martingale difference sequences in the Banach space without the extra \(\log ^{3}(n)\) term and with better constants. However, his argument is specific to the \(2\)-smooth case. As a rather direct consequence of the uniform convergence results for dependent random variables, we are able to provide concentration of martingales in Banach spaces for general norms.

Let \((W_t)_{t\ge 1}\) be a martingale difference sequence in a Banach space such that for any \(t, \Vert {W_t}\Vert \le 1\). The celebrated result of Pisier [23] states that

(11)

if and only if the Banach space can be given an equivalent \(p\)-smooth norm for some \(p >1\). Using duality this can equivalently be restated as: (11) holds for any martingale difference sequence \((W_t)_{t\ge 1}\) with \(\Vert {W_t}\Vert \le 1\) if and only if we can find a function \(\varPsi : {\mathcal F}\rightarrow {\mathbb R}\) which is \((1,q)\)-uniformly convex for some \(q < \infty \) with respect to norm \(\Vert {\cdot }\Vert _{*}\) (the dual norm) and is such that \(\varPsi _{\max } \le C < \infty \). Furthermore, it can be shown that the rate of convergence of expected norm of the martingale is tightly governed by the smallest such \(q\) one can find. Hence, combined with Corollary 2 we conclude that whenever expected norm of martingales in Banach spaces converge, exponential bounds for martingales in the Banach space, like the one in Corollary 2, also have to hold.