1 Introduction

Many high dimensional learning problems require sample sizes that grow with the dimension of the data representation in an essential way in general. Examples include learning with scale-insensitive loss functions such as the 0–1 loss, learning on unbounded input or parameter domains (Mohri et al., 2012; Shalev-Shwartz & Ben-David, 2014), learning Lipschitz classifiers (Gottlieb & Kontorovich, 2014), metric learning (Verma & Branson, 2015), and others. A common approach to deal with these problems is to employ some form of regularisation constraints that reflect prior knowledge about the problem, when available. Indeed, natural data sources and real-world learning problems tend to possess some hidden low complexity structure, and these can permit effective learning from relatively small sample sizes in principle. However, knowing these structures in advance to devise appropriate learning algorithms can be a challenge.

In this work, we are interested in the general question of how to discover and exploit such hidden benign traits when problem-specific prior knowledge is insufficient, based on just a general-purpose low complexity conjecture.

We address this question through random projection’s ability to expose structure—an ability previously studied in contexts as distinct as high dimensional phenomena (Bartl & Mendelson, 2021), geometric functional analysis (Liaw et al., 2017), and brain research (Papadimitriou & Vempala, 2019). Random projection (RP) is a simple, computationally efficient linear dimensionality reduction technique that preserves Euclidean structure with high probability. In machine learning, this can speed up computations at the price of a controlled loss of accuracy—this is generally referred to as compressive learning, in analogy with compressive sensing. Moreover, RP has a regularisation effect, and it has also been used as an analytic tool to better understand high dimensional learning in an early conference version of this work (Kabán, 2019).

The remainder of this section sets up the problem and gives a motivating example. In Sect. 2 we give simple PAC-bounds in the agnostic setting, both for compressive learning and for high dimensional learning. Our goal here is to work under minimal assumptions and isolate interpretable structural quantities that help gain intuitive insights into generalisation in high dimensional small sample situations. We term these as compressive distortion and compressive complexity in the compressed and uncompressed settings respectively, and we show that our bounds can be tight when these quantities are small.

In Sect. 3 we instantiate the above by bounding the problem-specific quantities that appear in these bounds for several widely-used model classes. These worked examples demonstrate how these quantities unearth structural characteristics that make these specific problems solvable to good approximation in a random linear subspace. In the examples considered, these turn out to take the form of some familiar benign traits such as the margin, the margin distribution, the intrinsic dimension, the spectral decay of the data covariance, or the norms of parameters—all of which remove dimensionality-dependence from error-guarantees in settings where such dependence is known to be essential in general. At the same time, our general notions of compressive distortion and compressive complexity serve to unify these characteristics, and may be used beyond the examples pursued here. We also show how one can use unlabelled data to estimate these general quantities when analytic bounds are infeasible, and this procedure recovers a form of consistency regularisation (Laine & Aila, 2017), which is a semi-supervised technique widely used in practice.

1.1 Problem setting

1.1.1 High dimensional learning

Let \({{\mathcal {X}}}_d\subset {\mathbb {R}}^{d}\) be an input domain, and \({{\mathcal {Y}}}\) the target domain—e.g. \({{\mathcal {Y}}}=\{-1,1\}\) is classification, \({{\mathcal {Y}}}\subseteq {\mathbb {R}}\) in regression. We are interested in high dimensional problems, so d can be arbitrarily large.

Let \({\mathcal {H}}_d\) be a function class (hypothesis class) with elements \(h: {{\mathcal {X}}}_d\rightarrow {{\mathcal {Y}}}\). The loss function \(\ell :{{\mathcal {Y}}}\times {{\mathcal {Y}}}\rightarrow [0,\bar{\ell }]\) quantifies the mismatch between predictions and targets. Throughout this work we assume that the loss is bounded i.e. \(\bar{\ell }<\infty\). This simplifying assumption is often made in algorithm-independent theoretical analyses, either by clipping the loss, or by working with bounded functions \(h\in {\mathcal {H}}_d\) e.g. by constraining both the parameter and input spaces to bounded sets. Several examples may be found in (Rosasco et al., 2004). Boundedness is often natural too, since classification losses in use are typically surrogates for the 0–1 loss, which is bounded by \(\bar{\ell }=1\).

We are given a set of labelled examples \({\mathcal {T}}_N=\{(X_1,Y_1),\dots ,(X_N,Y_N) \}\) drawn i.i.d. from some unknown distribution \({\mathbb {P}}\) over \({{\mathcal {X}}}_d\times {{\mathcal {Y}}}\). The learning problem is to select a function from \({\mathcal {H}}_d\) with smallest generalisation error \(E_{(X,Y)\sim {\mathbb {P}}}[\ell (h(X),Y)]\), using the sample \({\mathcal {T}}_N\).

Let \({\mathcal {G}}_d=\ell \circ {\mathcal {H}}_d= \{(x,y)\rightarrow g(x,y)=\ell (h(x),y): h\in {\mathcal {H}}_d\}\) denote the loss class under study. Expectations with respect to (w.r.t.) the unknown data distribution \({\mathbb {P}}\), will be denoted by the shorthand \(E[g]:=E_{(X,Y)\sim {\mathbb {P}}}[g(X,Y)] = \int _{{{\mathcal {X}}}\times {{\mathcal {Y}}}}g d{\mathbb {P}}\). Sample averages, i.e. expectations w.r.t. the empirical measure \(\hat{{\mathbb {P}}}_N\) defined by a sample \({\mathcal {T}}_N\) will be denoted as \(\hat{E}_{{\mathcal {T}}_N}[g]:= \hat{E}_{{\mathcal {T}}_N}[g(X,Y)] = \frac{1}{N}\sum _{n=1}^N g(X_n,Y_n) = \int _{{{\mathcal {X}}}\times {{\mathcal {Y}}}}gd\hat{{\mathbb {P}}}_N\), where \(\hat{{\mathbb {P}}}_N=\frac{1}{N}\sum _{n=1}^N \delta _{X_n}\), and \(\delta _{X}\) is the probability distribution concentrated at X. A best element of \({\mathcal {H}}\) is denoted by \(h^*\in \underset{h\in {\mathcal {H}}_d}{{\text {arg inf}}}~E[\ell \circ h]\), \(g^*:=\ell \circ h^*\); a sample error minimiser is \({\hat{h}}\in \underset{h\in {\mathcal {H}}_d}{\text {arg min}}~ \hat{E}_{{\mathcal {T}}}[\ell \circ h]\), and \(\hat{g}:=\ell \circ \hat{h}\).

1.1.2 Compressive learning

Let \(k \le d\) be integers, and \(R \in {\mathbb {R}}^{k \times d}\) a random matrix with independent and identically distributed (i.i.d.) entries from a 0-mean 1/k-variance distribution, chosen to satisfy the Johnson–Lindenstrauss (JL) property (Property 5.1). This is referred to as a random projection (RP) (Arriaga & Vempala, 1999; Matoušek, 2008). For instance, a random matrix with i.i.d. Gaussian entries is known to satisfy JL. For simplicity, throughout of this paper we will work with Gaussian RP, which serves as a simple dimensionality reduction method. While RP is not a projection in a strict linear-algebraic sense, the rows of R have approximately identical lengths and are approximately orthogonal to each other with high probability—hence the established nomenclature of "random projection".

We denote the compressed input domain by \({{\mathcal {X}}}_R\equiv R({{\mathcal {X}}}) \subseteq {\mathbb {R}}^k\), and have analogous definitions, indexed by R, as follows. The compressed function class \({\mathcal {H}}_R\) contains functions of the form \(h_R: {{\mathcal {X}}}_R\rightarrow {{\mathcal {Y}}}\). The learning algorithm receives the compressed training set, denoted \({\mathcal {T}}_R^N=\{( RX_{n},Y_{n})\}_{n=1}^{N}\), and selects a function from \({\mathcal {H}}_R\).

We denote a sample error minimiser in this reduced class by \({\hat{h}}_R \in \underset{h_R\in {\mathcal {H}}_R}{{\text {arg inf}}}\; \hat{E}_{{\mathcal {T}}_R^N}[\ell \circ h_R]\), where \(\hat{E}_{{\mathcal {T}}_R^N}[\ell \circ h_R] = \frac{1}{N}\sum _{n=1}^N \ell (h_R(RX_n),Y_n)\) is the empirical error of the compressed learning problem, and denote \(\hat{g}_R:=\ell \circ \hat{h}_R\). Likewise, \(h^*_R\in \underset{h_R\in {\mathcal {H}}_R}{{\text {arg inf}}}~E[\ell \circ h_R]\) denotes a best function in \({\mathcal {H}}_R\), \(g^*_R:=\ell \circ h^*_R\).

We are interested in the generalisation error of the compressed sample minimiser \(\hat{h}_R\), that is \(E_{(X,Y)\sim {\mathbb {P}}}[\ell ({\hat{h}}_R(RX),Y)]\), relative to the best \(h^*\in {\mathcal {H}}_d\).

Let us end this introduction with an example that showcases the regularisation effect of RP, and demonstrates a failure of empirical risk minimisation (ERM) without regularisation. This will motivate our approach of introducing novel quantities in Sect. 2, and the instantiations of these quantities later in Sect. 3 may be regarded as a strategy to derive model-specific regularisers from the structure-preserving ability of RP. In our bounds, these quantities will be responsible for dimension-independence.

1.2 A motivating example

Random projection based dimensionality reduction is most commonly motivated by computational speed-up and storage savings, and these benefits may come at the expense of a slight deterioration of accuracy performance. But this is just part of the story. In this section we make the picture more complete by demonstrating a simple example to highlight that RP has a regularisation effect without of which ERM can actually fail.

Theorem 1

(ERM can be arbitrarily bad) Let \(e_i\) be the i-th canonical basis vector, suppose the data distribution is uniform on the finite set \({{\mathcal {X}}}\times {{\mathcal {Y}}}:=S\equiv \{(e_1+e_i, 1), (-e_1-e_i, -1): i=2,\dots ,d\}\), and let \({\mathcal {T}}_N\) be an i.i.d. sample of size N. Then,

  1. 1.

    There exists a classifier \(h_{\text {bad}}\) such that \(\hat{E}_{(X,Y)\sim {\mathcal {T}}_N}[\textbf{1}(h_{\text {bad}}^TXY\le 0)]=0\), but

    $$\begin{aligned} E_{X,Y}[\textbf{1}(h_{\text {bad}}^TXY \le 0)] \ge 1-\frac{N}{d-1}. \end{aligned}$$
  2. 2.

    Let R be a \(k\times d\) random projection matrix with i.i.d. sub-gaussian entries independent of \({\mathcal {T}}_N\), and \(d \ge k\ge \lceil 16\log \frac{4N}{\delta }\rceil\), where \(\gamma >0\) is the normalised margin of \(h^*\) in S. Given any \(\delta \in (0,1)\), w.p. at least \(1-\delta\) the generalisation error of any compressive ERM, \(\hat{h}_R\in {\mathbb {R}}^k\), is upper bounded as the following

    $$\begin{aligned} E_{X,Y}\left\{ \textbf{1}\left( \hat{h}_R^TRXY\le 0\right) \right\} \le \frac{2}{N} \left( k\log \frac{2eN}{k} + \log \frac{4}{\delta } \right) \end{aligned}$$

The proof is given in Appendix Sect. 1. The construction exploits the fact that some ERM classifiers perform badly in small sample problems with large margin; in contrast, RP narrows the margin while keeping separability with high probability, so in this construction compressive ERM enjoys a dimension-free generalisation guarantee.

2 Error bounds for compressible problems

2.1 Learning with compressive ERM

We introduce the following definition, which later we use to bound the error of compressive ERM.

Definition 1

(Compressive distortion of a function) Given a function \(g\in {\mathcal {G}}_d\), we define its compressive distortion as the following:

$$\begin{aligned} D_R(g) \equiv \inf _{g_R\in {\mathcal {G}}_R}E_{X,Y}\vert {(g_R\circ R - g)(X,Y)}\vert ; \;\;\; D_k(g) \equiv E_R[D_R(g)(X,Y)] \end{aligned}$$
(1)

Property 2.1

The following properties are immediate:

  1. 1.

    For all \(g\in {\mathcal {G}}_d\) and all \(k\in {\mathbb {N}}, D_k(g)\ge 0\).

  2. 2.

    There exists \(k\le d\) s.t. \(D_k(g)=0\).

  3. 3.

    For any k, if \(g(x,y)\in [0,\bar{\ell }]\) for all \((x,y)\in {{\mathcal {X}}}\times {{\mathcal {Y}}}\), then \(D_k(g) \in [0,\bar{\ell }]\).

  4. 4.

    If \(\ell\) is L-Lipschitz in its first argument, then \(\forall h\in {\mathcal {H}}_d, D_k(g)\le L\cdot D_k(h)\), where \(g=\ell \circ h\).

Moreover, these properties also hold for \(D_R\).

Due to the first two properties above, as \(k\rightarrow d\), the generalisation bounds for compressive ERM will recover those for the original ERM. The last property implies that for many loss functions of interest, the compressive distortion can be bounded independently of label information.

It is natural to conjecture that learning problems whose target function has small compressive distortion are easier for compressive learning. This is indeed the case, as we shall see shortly. Recall the empirical Rademacher complexity of a function class \({\mathcal {G}}\) is defined as \({\hat{{\mathcal {R}}}}_N({\mathcal {G}})=\frac{1}{N}E_{\sigma }\sup _{g\in {\mathcal {G}}}\sum _{n=1}^N\sigma _ng(X_n)\), where \(\sigma =(\sigma _1,\dots ,\sigma _N)\overset{\text {\tiny i.i.d}}{\sim }\text {Uniform}(\pm 1)\). Let us denote by \(\hat{g}_R=\ell \circ \hat{h}_R\) the loss of the compressive ERM predictor. We have the following generalisation bound.

Theorem 2

(Generalisation of compressive ERM) Let \({\mathcal {G}}_R\) be the loss class associated with the compressive class of functions \({\mathcal {H}}_R\), and assume that \(\ell\) is uniformly bounded above by \(\bar{\ell }\). For any \(k\in {\mathbb {N}}\) and \(\delta >0\), w.p. \(1-2\delta\),

$$\begin{aligned} E[\hat{g}_R]&\le E[g^*]+ D_k(g^*) + 2{\hat{{\mathcal {R}}}}_N({\mathcal {G}}_R) +\bar{\ell }\cdot \xi (k,g^*,\delta ) +4\bar{\ell }\sqrt{\frac{\log (3/\delta )}{2N}} \end{aligned}$$
(2)

where \(\xi (k,g^*,\delta )\equiv \min \left\{ \frac{1-\delta }{\delta }D_k(g^*),\sqrt{\frac{1}{2}\log \frac{1}{\delta }} \right\}\). In particular, if \(D_k(g^*) \le \theta\) for some \(\theta \in [0,\bar{\ell }]\), then the compressive ERM satisfies

$$\begin{aligned} E[\hat{g}_R]&\le E[g^*]+ \theta + 2{\hat{{\mathcal {R}}}}_N({\mathcal {G}}_R) +\bar{\ell }\cdot \xi (k,g^*,\delta ) +4\bar{\ell }\sqrt{\frac{\log (3/\delta )}{2N}}. \end{aligned}$$
(3)

Proof

Fixing R we have an ERM over the compressive class. Hence, we can bound the generalisation error of the function learned, \(\hat{g}_R\in {\mathcal {G}}_R\), using classic uniform bounds such as (Mohri et al, 2012, Lemma 3.3) (Theorem 29 in Appendix 5) combined with the Hoeffding bound. This gives w.p. \(1-\delta\) that

$$\begin{aligned} E[\hat{g}_R]&\le E[g_R^*] + 2 {\hat{{\mathcal {R}}}}_N({\mathcal {G}}_R)+4\bar{\ell }\sqrt{\frac{\log (3/\delta )}{2N}} \end{aligned}$$
(4)

This bound is relative to \(g^*_R\in {\mathcal {G}}_R\), that is the best achievable in the reduced class, while we want a bound relative to the best achievable in the original class, i.e. \(g^*\in {\mathcal {G}}_d\). To this end, we write

$$\begin{aligned} E[g_R^*] = E[g^*]+ E[g_R^*-g^*] \le E[g^*]+ \inf _{g_R\in {\mathcal {G}}_R}E\vert g_R-g^*\vert = E[g^*]+D_R(g^*), \end{aligned}$$
(5)

where we used Jensen’s inequality to draw the infimum out of the expectation, since the infimum is a concave function.

Now, since the loss is bounded, and recalling that \(D_k(g^*)=E_R[D_R(g^*)]\), we can bound the last term on the r.h.s. as \(D_R(g^*)\le D_k(g^*)+\sqrt{\frac{1}{2}\log (1/\delta )}\) w.p. \(1-\delta\) using Hoeffding’s inequality (Lemma 27), or alternatively as \(D_R(g^*)\le \frac{1}{\delta }D_k(g^*) = D_k(g^*)+ \frac{1-\delta }{\delta }D_k(g^*)\) w.p. \(1-\delta\) using Markov’s inequality (Lemma 26). Each of these two bounds can be tighter than the other depending on the magnitude of \(D_k(g^*)\). By taking the minimum, we have

$$\begin{aligned} D_R(g^*)\le D_k(g^*)+\xi (k,g^*,\delta ). \end{aligned}$$
(6)

Finally, by the union bound, both (4) and (6) hold simultaneously w.p. \(1-2\delta\), hence we conclude the statement (2). Equation (3) follows from (2) by substituting the upper bound \(\theta\) for \(D_k(g^*)\). \(\square\)

The error of the uncompressed ERM is recovered when \(D_k(g^*)=0\), which in the worst case will happen for \(k=d\). Moreover, depending on the structure of the problem, \(D_k(g^*)\) can become negligible even for \(k<d\). Theorem 2 implies that compressive learning will work better on problems where the target function \(g^*\) has small compressive distortion.

The benefit of this simple result is to unify the analysis of compressive learning of various models into one framework, which further depends on problem-specific quantities. In particular, the compressive distortion appears in the bound, which depends on the particular model class, and analysing this quantity further will give us a handle on discovering problem-specific characteristics that contribute to the ease of learning from compressed data.

Here we assumed that the distortion threshold \(\theta\) and the compression dimension k are fixed in advance. The latter may be set to a fraction of the available sample size N, so that the function class complexity remains small. Later in Sect. 3 we develop some intuition about the geometric meaning of compressive distortion in some concrete function classes, and demonstrate how it can be used to learn about benign problem characteristics.

2.2 Learning compressible problems in the dataspace

The main quantity in our analysis of compressive learning in the previous section was the compressive distortion of the target function, \(D_k(g^*)\). In this section we return to the original high dimensional problem, and define a notion of distortion for the entire function class, which we refer to as the compresive complexity of the class. We shall then focus on function classes that have low compressive complexity. The intuition behind this approach is that such classes are in fact a smaller in some sense, which should allow easier learning—albeit this will have to be a non-ERM algorithm that avoids the pitfalls of ERM that we exemplified earlier in Sect. 1.2, and this will indeed follow from our analysis. To this end, in this section we give a uniform bound in terms of compressive complexity.

We introduce an auxiliary construction that involves a random projection for analytic purposes, while the learning problem stays in the original data space without any dimensionality reduction. As before, \(R\in {\mathbb {R}}^{k\times d},k\le d\) is a RP matrix, but this time it will serve a purely analytic role. We define an auxiliary function class, \({\mathcal {G}}_R=\ell \circ {\mathcal {H}}_R\) with elements \(g_R=\ell \circ h_R\)—again for analytic purposes. This class may be chosen freely. A natural choice is to have the same functional form as the elements of \({\mathcal {G}}_d\), but operating on k (rather than d) dimensional inputs, as then from a compressive learning guarantee one can readily infer a dataspace guarantee, as we shall see shortly. However, other choices can be more convenient to work with when the dataspace bound is sought. Next, we define compressive complexity with the aid of an unspecified auxiliary class \({\mathcal {G}}_R\), as follows.

Definition 2

(Compressive complexity of a function class) Given a function class \({\mathcal {G}}_d\) and a function \(g\in {\mathcal {G}}_d\), we let \(\hat{D}_{R,N}(g) \equiv \inf _{g_R\in {\mathcal {G}}_R} \hat{E}_{{\mathcal {T}}_{N}}\vert g_R(RX,Y)-g(X,Y)\vert\), and \(\hat{D}_{k,N}(g) \equiv E_R[\hat{D}_{R,N}(g)]\). We define the compressive distortion of \({\mathcal {G}}_d\) as the following.

$$\begin{aligned} \hat{{\mathcal {C}}}_{k,N}({\mathcal {G}}_d) \equiv \sup _{g\in {\mathcal {G}}_d} \hat{D}_{k,N}(g);\;\;\; \;\;\; {{\mathcal {C}}}_{k,N}({\mathcal {G}}_d) \equiv E_{{\mathcal {T}}_N\sim {\mathbb {P}}^{N}} [\hat{{\mathcal {C}}}_{k,N}({\mathcal {G}}_d)] \end{aligned}$$
(7)

We may think of the compressive complexity as the largest (w.r.t. \(g\in {\mathcal {G}}_d\)) ‘mimicking error’ (on average over training sets) of compressive learners that each receive a randomly compressed version of the inputs and learn to behave like g. With the use of Definition 2, we can decompose the Rademacher complexity of the original class as the following.

Lemma 3

(Decomposition of Rademacher complexities) Let \({\mathcal {G}}_d\) be a class of uniformly bounded real valued functions on \({{\mathcal {X}}}\). We have

$$\begin{aligned} {{\hat{{\mathcal {R}}}}}_N({\mathcal {G}}_d)&\le \hat{{\mathcal {C}}}_{k,N}({\mathcal {G}}_d) + E_R[{\hat{{\mathcal {R}}}}_N({\mathcal {G}}_R)] \end{aligned}$$
(8)
$$\begin{aligned} {\mathcal {R}}_N({\mathcal {G}}_d)&\le {{\mathcal {C}}}_{k,N}({\mathcal {G}}_d) + E_R[{\mathcal {R}}_N({\mathcal {G}}_R)] \end{aligned}$$
(9)
$$\begin{aligned} {{\mathcal {R}}}_N({\mathcal {G}}_d)&\le \hat{{\mathcal {C}}}_{k,N}({\mathcal {G}}_d) + E_R[{\hat{{\mathcal {R}}}}_N({\mathcal {G}}_R)]+ \bar{\ell }\sqrt{\frac{\log (1/\delta )}{2N}} \text {~w.p. }1-\delta \end{aligned}$$
(10)
$$\begin{aligned} {{\mathcal {R}}}_N({\mathcal {G}}_d)&\le \hat{{\mathcal {C}}}_{k,N}({\mathcal {G}}_d) + E_R[{\mathcal {R}}_N({\mathcal {G}}_R)]+ \bar{\ell }\sqrt{\frac{\log (1/\delta )}{2N}} \text {~w.p. }1-\delta \end{aligned}$$
(11)
$$\begin{aligned} {{\mathcal {R}}}_N({\mathcal {G}}_d)&\le {{\mathcal {C}}}_{k,N}({\mathcal {G}}_d) + E_R[{\hat{{\mathcal {R}}}}_N({\mathcal {G}}_R)]+ \bar{\ell }\sqrt{\frac{\log (1/\delta )}{2N}} \text {~w.p. }1-\delta \end{aligned}$$
(12)

Proof of Lemma 3

By the definition,

$$\begin{aligned} {\hat{{\mathcal {R}}}}_N({\mathcal {G}}_d)&=E_{\sigma }\sup _{g\in {\mathcal {G}}_d}\frac{1}{N}\sum _{n=1}^N \sigma _n g(X_n,Y_n). \end{aligned}$$

We add and subtract \(E_{\sigma }\sup _{g\in {\mathcal {G}}_d} E_R\inf _{g_R\in {\mathcal {G}}_R}\left\{ \frac{1}{N}\sum _{n=1}^N \sigma _n g_R(RX_n,Y_n) \right\}\), so

$$\begin{aligned} {\hat{{\mathcal {R}}}}_N({\mathcal {G}}_d)&\le E_{\sigma }\sup _{g\in {\mathcal {G}}_d} E_R\inf _{g_R\in {\mathcal {G}}_R} \left\{ \frac{1}{N}\sum _{n=1}^N \sigma _n ( g(X_n,Y_n) -g_R(RX_n,Y_n))\right\} \\&\hspace{3.8cm} + E_{\sigma } E_R\sup _{g_R\in {\mathcal {G}}_R} \left\{ \frac{1}{N}\sum _{n=1}^N \sigma _n g_R(RX_n,Y_n)\right\} \\&\le \hat{{{\mathcal {C}}}}_{k,N}({\mathcal {G}}_d) + E_R[{\hat{{\mathcal {R}}}}_N({\mathcal {G}}_R)]. \end{aligned}$$

This completes the proof of (8). Taking expectation w.r.t. the distribution of \({\mathcal {T}}_N\) we obtain (9). Using these, we obtain inequalities (10)–(12) by employing McDiarmid’s inequality (Lemma 28), as follows.

Since the loss function is bounded by \(\bar{\ell }\), changing one point of \({\mathcal {T}}_N\) can only change \({\hat{{\mathcal {R}}}}_N({\mathcal {G}}_d)\) (or \(\hat{{\mathcal {C}}}_{k,N}({\mathcal {G}}_d)\)), as a functions of a set of N points, by at most \(c=\bar{\ell }/N\). Hence, applying one side of McDiarmid’s inequality gives each of the following

$$\begin{aligned} {{\mathcal {R}}}_N({\mathcal {G}}_d)&\le {\hat{{\mathcal {R}}}}_N({\mathcal {G}}_d)+{\bar{\ell }}\sqrt{\frac{\log (1/\delta )}{2N}} \text { w.p. } 1-\delta ; \end{aligned}$$
(13)
$$\begin{aligned} {{\mathcal {C}}}_{k,N}({\mathcal {G}}_d)&\le {\hat{\mathcal {C}}}_{k,N}({\mathcal {G}}_d)+{\bar{\ell }}\sqrt{\frac{\log (1/\delta )}{2N}} \text { w.p. } 1-\delta . \end{aligned}$$
(14)

Now, combining (13) with (8) gives (10). Combining (9) with (14) gives (11). Finally, using (9) and then applying (13) with the class \({\mathcal {G}}_R\) gives (12). \(\square\)

The reason the above decompositions will be useful for our purposes is that, whenever \({{\mathcal {C}}}_{k,N}({\mathcal {G}}_d)\) is sufficiently small, then the Rademacher complexity of the original function class becomes essentially the complexity of a k rather than a d dimensional function class—therefore, inspecting \(\mathcal {C}_{k,N}({\mathcal {G}}_d)\) for the class \({\mathcal {G}}_d\) at hand will help us gain intuitive insight about the structures that make some high dimensional problems actually be less high dimensional than they appear to be. As such, our focus is on problems where \({\mathcal {R}}_N({\mathcal {G}}_d)\) grows with d, and \({\mathcal {C}}_{k,N}({\mathcal {G}}_d)\) is small, and examples will follow in the next section. In such problems, when prior knowledge does not justify any further assumptions, the smallness of compressive distortion represents a general-purpose simplicity conjecture that may be used to derive conditions for a high dimensional problem to be solvable in low dimensions. The particular form of these will depend on the particular function class associated with the learning problem, but for now we keep the formalism general and simple.

Theorem 4

(Uniform bounds for problems with small compressive complexity) Fix some \(\theta \in [0,\bar{\ell }]\). Suppose that \({\tilde{{\mathcal {G}}}}_d\subseteq {\mathcal {G}}_d\) is a function class that satisfies \({{\mathcal {C}}}_{k,N}({\tilde{{\mathcal {G}}}}_d)\le \theta\). Then, for any \(\delta >0\), w.p. \(1-\delta\) the following holds uniformly for all \(g\in {\tilde{{\mathcal {G}}}}_d\):

$$\begin{aligned} E[g]\le \hat{E}_{{\mathcal {T}}_{N}}[g] + 2\theta + 2 E_R[{\hat{{\mathcal {R}}}}_N({\mathcal {G}}_R)] +3\bar{\ell }\sqrt{\frac{\log (2/\delta )}{2N}} \end{aligned}$$
(15)

Furthermore, w.p. \(1-\delta\), \(\hat{g}:=\underset{g\in {\tilde{{\mathcal {G}}}}_d}{\text {arg min}} \;\hat{E}[g]\), satisfies

$$\begin{aligned} E[\hat{g}] \le E[g^*] + 2\theta + 2E_R[{\hat{{\mathcal {R}}}}_N({\mathcal {G}}_R)] +4\bar{\ell }\sqrt{\frac{\log (3/\delta )}{2N}}. \end{aligned}$$
(16)

Proof

By the classic Rademacher bound (Theorem 29) applied to \({\tilde{{\mathcal {G}}}}_d\), we have w.p. \(1-\delta /2\) for all \(g\in {\tilde{{\mathcal {G}}}}_d\) that

$$\begin{aligned} E[g]&\le \hat{E}_{{\mathcal {T}}_{N}}[g] + 2E_R[{{\mathcal {R}}}_N({\tilde{{\mathcal {G}}}}_d)] +\bar{\ell }\sqrt{\frac{\log (2/\delta )}{2N}}. \end{aligned}$$
(17)

Applying (12) from Lemma 3 to \({\tilde{{\mathcal {G}}}}_d\), we further have \({{\mathcal {R}}}_N({\tilde{{\mathcal {G}}}}_d)\le {\hat{{\mathcal {R}}}}_N({\tilde{{\mathcal {G}}}}_R)+{{\mathcal {C}}}_{k,N}({\tilde{{\mathcal {G}}}}_d) + \bar{\ell }\sqrt{\frac{\log (2/\delta )}{2N}}\) w.p. \(1-\delta /2\), where \({\tilde{{\mathcal {G}}}}_R \subseteq {\mathcal {G}}_R\). This combined with (17) using the union bound gives w.p. \(1-\delta\)

$$\begin{aligned} E[g]&\le \hat{E}_{{\mathcal {T}}_{N}}[g] + 2E_R[{\hat{{\mathcal {R}}}}_N({\tilde{{\mathcal {G}}}}_R)]+2{{\mathcal {C}}}_{k,N}({\tilde{{\mathcal {G}}}}_d) +\bar{\ell }\sqrt{\frac{\log (2/\delta )}{2N}}. \end{aligned}$$
(18)

Finally, \({\tilde{{\mathcal {G}}}}_R \subseteq {\mathcal {G}}_R\) implies \({\hat{{\mathcal {R}}}}_N({\tilde{{\mathcal {G}}}}_R)\le {\hat{{\mathcal {R}}}}_N({{\mathcal {G}}}_R)\), and using that \({{\mathcal {C}}}_{k,N}({\tilde{{\mathcal {G}}}}_d)\le \theta\) completes the proof of (15).

Equation (16) follows from (15). Indeed, as (15) holds uniformly for all \(g\in {\tilde{{\mathcal {G}}}}_d\), it also holds with \(\hat{g}\) in the place of g, and we apply this w.p. \(1-2\delta /3\) yielding

$$\begin{aligned} E[\hat{g}]\le \hat{E}_{{\mathcal {T}}_{N}}[\hat{g}] + 2\theta + 2 E_R[{\hat{{\mathcal {R}}}}_N({\mathcal {G}}_R)] +3\bar{\ell }\sqrt{\frac{\log (2/(3\delta ))}{2N}}. \end{aligned}$$
(19)

By definition of \(\hat{g}\), we also have \(\hat{E}_{{\mathcal {T}}_N}[\hat{g}]\le \hat{E}_{{\mathcal {T}}_N}[g^*]\), and by Hoeffding’s inequality we further have \(\hat{E}_{{\mathcal {T}}_N}[g^*]\le E[g^*]+\hat{\ell }\sqrt{\frac{\log (3/\delta )}{2N}}\) w.p. \(1-\delta /3\). Finally, we combine this with (19) via the union bound to complete the proof. \(\square\)

Theorem 4 implies that, if the compressive complexity of the function class is sufficiently small, then the d-dimensional problem is solvable with a guarantee that is almost as good as a \(k \ll d\)-dimensional version of the problem. This is of interest in problems where the available sample size N is too small relative to d to permit a meaningful guarantee. Observe that k manages a tradeoff, as \(\theta\) decreases with k while the Rademacher complexity in general may increase with k. As before, k and \(\theta\) are considered to be fixed before seeing the data. A sensible choice is to set k proportional to N—which is typically known—in other words, in small sample settings we are prepared to take a bias \(\theta\) and in return gain control over the affordable complexity of the class. The classic bounds are recovered when \(k=d\). However, the intuition is that often the geometry of the problem may be favourable for \(\theta\) to be sufficiently small while \(k \ll d\). Our bounds express this intuition, and Sect. 3 will make it more concrete.

Note that the restriction of the function class to obey \(\mathcal {C}_{k,N}({\tilde{{\mathcal {G}}}}_d)\le \theta\) is necessary for the above guarantee. This is important, as in practice it is often easier to specify a large class \({\mathcal {G}}_d\), and we have seen earlier in Theorem 1 that an unconstrained ERM can be arbitrarily bad. Hence, in order to exploit the guarantee provided by Theorem 4, the learning algorithm must ensure this constraint.

The compressive complexity has similar properties to those of compressive distortion.

Property 2.2

The following properties hold.

  1. 1.

    For all \(g\in {\mathcal {G}}_d\) and all \(k\in {\mathbb {N}}\), \({{\mathcal {C}}}_{k,N}({\mathcal {G}}_d)\ge 0\).

  2. 2.

    There exists \(k\le d\) s.t. \({{\mathcal {C}}}_{k,N}({\mathcal {G}}_d)=0\).

  3. 3.

    For any k, if \(g(x,y)\in [0,\bar{\ell }]\) for all \((x,y)\in {{\mathcal {X}}}\times {{\mathcal {Y}}}\), then \(\mathcal {C}_{k,N}({\mathcal {G}}_d)\in [0,\bar{\ell }]\).

  4. 4.

    If \(\ell\) is L-Lipschitz in its first argument, then \({{\mathcal {C}}}_{k,N}({\mathcal {G}}_d)\le L \cdot {{\mathcal {C}}}_{k,N}({\mathcal {H}}_d)\).

Moreover, these properties also hold for \(\hat{D}_{R,N}(\cdot ), \hat{D}_{k,N}(\cdot )\), and \(\hat{{\mathcal {C}}}_{k,N}(\cdot )\).

Furthermore, we can link compressive distortion with compressive complexity, and this facilitates insights about high dimensional dataspace learning from guarantees obtained on compressive learning.

Property 2.3

(From compressive distortion to compressive complexity) Let \(\hat{{\mathbb {P}}}\) denote the counting probability measure over the training sample. Suppose we have a bound \(D_R(h)\le \psi _R(h,{\mathbb {P}})\) for all \(h\in {\mathcal {H}}_d\), where \(\psi _R\) is some expression that depends on R. Then, we also have \(\mathcal {C}_{k,N}({\mathcal {H}}_d) \le E_{{\mathcal {T}}_N\sim {\mathbb {P}}^N}[\sup _{h\in {\mathcal {H}}_d} E_R[\psi _R(h,\hat{{\mathbb {P}}})]]\). In particular, if \(D_R(h)\le \phi ({{\mathbb {P}}})\cdot \varphi _R(h)\) for all \(h\in {\mathcal {H}}_d\) with some expressions \(\phi\) and \(\varphi _R\), then \({{\mathcal {C}}}_{k,N}({\mathcal {H}}_d) \le E_{{\mathcal {T}}_N\sim {\mathbb {P}}^N}[\phi (\hat{{\mathbb {P}}})]\cdot \sup _{h\in {\mathcal {H}}_d}E_R[\varphi _R(h)]\).

Proof of Property 2.3

Since \(D_R(h)\le \psi _R(h,{\mathbb {P}})\) for all \(h\in {\mathcal {H}}_d\), we also have \(\hat{D}_{R,N}(h)\le \psi _{R}(h,\hat{{\mathbb {P}}})\) for all \(h\in {\mathcal {H}}_d\). Hence,

$$\begin{aligned} {\mathcal {C}}_{k,N}({\mathcal {H}}_d)= E_{{\mathcal {T}}_N\sim {\mathbb {P}}^N}\sup _{h\in {\mathcal {H}}_d} E_R[\hat{D}_{R,N}(h)] \le E_{{\mathcal {T}}_N\sim {\mathbb {P}}^N}\sup _{h\in {\mathcal {H}}_d} E_R[\psi _R(h,\hat{{\mathbb {P}}})]. \end{aligned}$$
(20)

Applying this to the special case when \(\psi (h,{\mathbb {P}})=\phi ({\mathbb {P}})\cdot \varphi _R(h)\) for all \(h\in {\mathcal {H}}_d\), the second statement follows. \(\square\)

Below in Lemma 5 we give a simple example of a compressible problem, i.e. a distribution and function class pair where we have both a low compressive distortion and a low compressive complexity.

Definition 3

(Almost low-rank distributions) Given \(\theta \in [0,1]\) and \(k\le d\) we say that a probability measure \(\mu\) is \(\theta\)-almost k-rank on \({\mathbb {R}}^d\), if there exists a k-dimensional linear subspace \(V_k\subseteq {\mathbb {R}}^d\) such that \(\mu (V_k) > 1-\theta\).

Lemma 5

(Compressive distortion and compressive complexity in almost low-rank distributions) Let \({\mathcal {G}}_d\) be the linear function class with an \(\bar{\ell }\)-bounded loss function. Suppose that the marginal \({\mathbb {P}}_X\) is a \(\theta\)-almost k-rank distribution on \({\mathbb {R}}^d\), and R is a \(k\times d\) RP matrix having full row-rank a.s. For any \(N\in {\mathbb {N}}\), we have

$$\begin{aligned} D_k(g^*)&\le \bar{\ell }\theta \end{aligned}$$
(21)
$$\begin{aligned} {\mathcal {C}}_{k,N}({\mathcal {G}}_d)&\le \bar{\ell }\theta . \end{aligned}$$
(22)

Lemma 5 will be useful in the construction of a lower bound in Sect. 2.3. The idea of the proof is that, knowing that the marginal distribution is almost k-rank, we can choose the auxiliary class \({\mathcal {G}}_R\) such that \(R\in {\mathbb {R}}^{k\times d}\) leaves the linear subspace \(V_k\) unchanged a.s.

The proof of Lemma 5 is given in Appendix Sect. 2.

2.3 Tightness of the bounds

The upper bounds of Theorems 2 and 4 are attractive when \(\theta\) is small, i.e. for compressible problems. Our goal in this section is to show the tightness of these bounds under the same conditions as those upper bounds. More precisely, we will show that there exists a function class for which the dependence of the bound on the parameters \(\theta ,k\) and N cannot be improved without imposing extra conditions.

First, we need to make explicit the dependence of the relevant quantities on the unknown data distribution \({\mathbb {P}}_d\). To this end, we shall use the notations \(D_k(g^*,{\mathbb {P}}_d)\) and \({\mathcal {C}}_{k,N}({\mathcal {G}}_d,{\mathbb {P}}_d)\) for the compressive distortion and the compressive complexity respectively. We drop the index d as it stays the same throughout this section, so \({\mathcal {G}}\) will stand for \({\mathcal {G}}_d\), and \({\mathcal {H}}\) will stand for \({\mathcal {H}}_d\). As in the previous sections, we assume \(\bar{\ell }\)-bounded loss functions.

Next, we define the class of distributions for which these quantities are below a specified threshold.

Definition 4

(Compressible distributions) Let \(k\le d\) be an integer, and \(\theta \in [0,1]\).

  1. 1.

    Given a learning problem with target function \(g^*(\cdot ,\cdot )=\ell (h^*(\cdot ),\cdot )\), we say that a distribution \({\mathbb {P}}\) is D-compressible with parameters \((\theta ,k)\), if the compressive distortion of \(g^*\) satisfies \(D_k(g^*,{\mathbb {P}}) \le \bar{\ell }\theta\). We denote by \({\mathcal {P}}_{g^*}(\theta ,k):=\{{\mathbb {P}}: D_k(g^*,{\mathbb {P}})\le \bar{\ell }\theta \}\) the set of all D-compressible distributions with parameters \((\theta ,k)\).

  2. 2.

    Given a function class \({\mathcal {G}}\), we say that a distribution \({\mathbb {P}}\) is C-compressible with parameters \((\theta ,k)\), if the compressive complexity of \({\mathcal {G}}\) satisfies \({\mathcal {C}}_{k,N}({\mathcal {G}},{\mathbb {P}}) \le \bar{\ell }\theta\). We denote by \({\mathcal {P}}_{{\mathcal {G}}}(\theta ,k):=\{{\mathbb {P}}: C_{k,N}({\mathcal {G}},{\mathbb {P}})\le \bar{\ell }\theta \}\) the set of all C-compressible distributions with parameters \((\theta ,k)\).

For a distribution \({\mathbb {P}}\), we denote by \(h_{{\mathbb {P}}}^*\in \underset{h\in {\mathcal {H}}}{{\text {arg inf}}}\; E[\ell (h(X),Y)]\) a best classifier of the class \({\mathcal {H}}\) in the underlying distribution \({\mathbb {P}}\). In the construction of the proof of the forthcoming Theorem 6, \(h_{{\mathbb {P}}}^*\) will coincide with the Bayes-optimal classifier. A learning algorithm \({\mathcal {A}}: ({{\mathcal {X}}}\times {{\mathcal {Y}}})^N \rightarrow {\mathcal {H}}\) takes a training set of size N and returns a classifier. The loss of this classifier is denoted by \(g_{{\mathcal {A}}({\mathcal {T}}_N)}(X,Y):= \ell (({\mathcal {A}}({\mathcal {T}}_N))(X),Y)\).

We have the following lower bound in the high-dimensional small sample setting.

Theorem 6

(Lower bound) Consider the 0–1 loss. For any \(\theta \in [0,1]\), any integers \(k\le N\le d\), and any algorithm \({\mathcal {A}}: ({{\mathcal {X}}}\times {{\mathcal {Y}}})^N\times {{\mathcal {X}}}\rightarrow {\mathcal {H}}\) there exists a D-compressible and C-compressible distribution \({\mathbb {P}}\in {\mathcal {P}}_{g^*}(k,\theta )\;\cap \;{\mathcal {P}}_{{\mathcal {G}}}(k,\theta )\) (which depends on \(\theta , k, d, N\) and \({\mathcal {A}}\)) such that:

$$\begin{aligned} E_{{\mathcal {T}}_N\sim {\mathbb {P}}^N}[E[g_{{\mathcal {A}}({\mathcal {T}}_N)}]] - E[g_{{\mathbb {P}}}^*] \ge \frac{1}{32} \left( \theta + \sqrt{\frac{k}{N}} \right) . \end{aligned}$$
(23)

The proof is deferred to Appendix 4. Theorem 6 says that, in the high dimensional setting (\(k\le N \le d\)), for any choice of algorithm there is a bad distribution which, despite it being compressible (i.e. it satisfies the same condition as our upper bounds), the error of the classifier returned by the algorithm on an i.i.d. sample of size N from that distribution is large.

We note that the bad distribution is allowed to depend on the sample size. Therefore Theorem 6 does not imply that, for some distribution, the excess risk converges at a rate no faster than that of the upper bound. However, studying faster rates is beyond the scope of this paper, as require additional assumptions and is pursued elsewhere (Reeve & Kabán, 2021).

The important point here is that, there are function classes for which the lower bound of Theorem 6 matches the upper bound up to a constant factor—for instance in k-dimensional linear classification it is well-known that \({\hat{{\mathcal {R}}}}_N({\mathcal {G}}_R) \in \Theta \left( \sqrt{{k}/{N}} \right)\) (Bartlett & Mendelson, 2002). Hence, the lower bound of Theorem 6 implies that Theorem 4 cannot be improved in general by more than a constant factor. To see this more clearly, we rearrange the upper bound from Theorem 4 to have the same left-hand side as (23). Setting \(\epsilon :=4\bar{\ell }\sqrt{\frac{\log (3/\delta )}{2N}}\) gives \(2\delta =6\exp \left( -\frac{N\epsilon ^2}{8\bar{\ell }^2}\right)\), and we have

$$\begin{aligned} {\mathbb {P}}_{{\mathcal {T}}_N}\left\{ E[\hat{g}] >E[g^*]+2\theta +2E_R[{\hat{{\mathcal {R}}}}_N({\mathcal {G}}_R)]+\epsilon \right\} \le 6\exp \left( -\frac{N\epsilon ^2}{8\bar{\ell }^2}\right) . \end{aligned}$$

This implies that

$$\begin{aligned} E_{{\mathcal {T}}_N}[E[\hat{g}]]&-E[g^*]-2\theta +2E_R[{\hat{{\mathcal {R}}}}_N({\mathcal {G}}_R)]\\&\le \int _{0}^{\infty } {\mathbb {P}}_{{\mathcal {T}}_N}\left\{ E[\hat{g}] -E[g^*]+2\theta -2E_R[{\hat{{\mathcal {R}}}}_N({\mathcal {G}}_R)]>\epsilon \right\} d\epsilon \\&\le 6 \int _{0}^{\infty }\exp \left( -\frac{N\epsilon ^2}{8\bar{\ell }^2}\right) d\epsilon =3\sqrt{\frac{8\pi \bar{\ell }^2}{N}} < \frac{16\bar{\ell }}{\sqrt{N}}. \end{aligned}$$

Hence, noting that \(\bar{\ell }\) is a constant independent of kNd and \(\theta\), we have for the linear class that

$$\begin{aligned} E_{{\mathcal {T}}_N}[E[\hat{g}]]- E[g^*]&\le 2\theta +2E_R[{\hat{{\mathcal {R}}}}_N({\mathcal {G}}_R)]+\frac{16\bar{\ell }}{\sqrt{N}} \nonumber \\&\le 2\theta +62\sqrt{\frac{k}{N}}+\frac{16\bar{\ell }}{\sqrt{N}} ={{\mathcal {O}}}\left( \theta +\sqrt{\frac{k}{N}}\right) . \end{aligned}$$
(24)

This matches the lower bound up to a constant factor.

In the compressed ERM bound of Theorem 2, the term \(\xi (k,g^*,\delta )\) reflects the variability of error due to working in a lower dimensional random subspace of \({{\mathcal {X}}}\). This term is irreducible with N, instead it decays with k through \(D_k(g^*)\), which is model-specific. The next section will analyse this quantity for several learning problems. Moreover, cf. the second statement of Property 2.1, there is always some integer \(k^*\le d\) such that whenever \(k\ge k^*\) we will have \(\xi (k,g^*,\delta )=0\), making the upper bound again match the lower bound up to a constant factor.

3 Discovering problem-specific benign traits

The previous section focused on bounds of a general form, and we argued that these are tight when the problem is compressible. In this section we study the question of what makes learning problems compressible. The answers will depend on the particular learning problem, and we demonstrate how the novel quantities we introduced (the compressive distortion and the compressive complexity) can exploit the structure-exposing ability of random projections to reveal more answers to this question.

The forthcoming subsections are devoted to instantiating these quantities in several models associated to learning tasks, in order to demonstrate their use in revealing structural insights. The proofs of the forthcoming propositions are relegated to Appendix 3, where we also give details on how to use the obtained expressions in the general form of our bounds from the previous sections.

3.1 Thresholded linear models

We start with the classical example of binary classification with linear functions \({\mathcal {H}}_d=\{x\rightarrow h^Tx: h,x\in {\mathbb {R}}^d\}\), and where the loss function of interest is the 0–1 loss, that is \(\ell _{01}: {{\mathcal {Y}}}\times {{\mathcal {Y}}}\rightarrow \{0,1\}, \ell _{01}(\hat{y},y)=\textbf{1}(\hat{y}y \le 0)\). By a slight abuse of notation, we identify the linear classifiers with their weight vectors. As before, we let \({\mathcal {G}}_d = \ell _{01}\circ {\mathcal {H}}_d\), and \({\mathcal {G}}_R=\ell _{01}\circ {\mathcal {H}}_R\) its compressive counterpart. In this setting, we have the following, proved in Appendix Section “Thresholded linear models”.

Proposition 7

Consider the linear function class with the 0–1 loss, as above. We have

$$\begin{aligned} D_k(g^*)&\le E_X\left[ \exp \left( \frac{-k\cos ^2(\measuredangle _{X}^{h})}{8}\right) \right] \cdot \textbf{1}(k<d)\end{aligned}$$
(25)
$$\begin{aligned} {\mathcal {C}}_{k,N}({\mathcal {G}}_d)&\le E_X\left[ \sup _{h\in {\mathcal {H}}_d} \exp \left( \frac{-k\cos ^2(\measuredangle _{X}^{h})}{8}\right) \right] \cdot \textbf{1}(k<d). \end{aligned}$$
(26)

In the above, \(\measuredangle _{X}^{h})\) is the angle, in radians, between the vectors X and h, so \(\cos (\measuredangle _{X}^{h})\) is the normalised margin of a point X in terms of its distance to the hyperplane with normal vector h. Consequently, we see that in the case of halfspace learning, the compressive distortion is bounded by the moment generating function of the square of margin distribution. This example recovers as a special case, the main findings of Kabán and Durrant (2020) in a nutshell.

3.2 Linear model with Lipschitz loss

Next we consider the linear function class \({\mathcal {H}}_d=\{x\rightarrow h^Tx: h,x\in {\mathbb {R}}^d \}\), with \(\ell : {{\mathcal {Y}}}\times {{\mathcal {Y}}}\rightarrow [0,\bar{\ell }]\) a bounded loss function that is \(L_{\ell }\)-Lipschitz in its first argument. Common examples of bounded Lipschitz loss functions may be found e.g. in (Rosasco et al., 2004), several of which are surrogates for the 0–1 loss. As before, we let \({\mathcal {G}}_d = \ell \circ {\mathcal {H}}_d\), and \({\mathcal {G}}_R=\ell \circ {\mathcal {H}}_R\) its compressive version. Let \(\Sigma :=E_X[XX^T]\), and we require that \(\text {Tr}(\Sigma ) < \infty\). In this setting, we have the following.

Proposition 8

Consider the linear model class described above. For any \(p\in {{\mathbb {N}}}\) s.t. \(2\le p\le k-2\), and \(k\le \text {rank}(\Sigma )\), we have

$$\begin{aligned} D_k(g^*)&\le L_{\ell } \Vert h^*\Vert _2 \cdot \Xi (k,p,\{\lambda _j(\Sigma )\}_{j}) \end{aligned}$$
(27)
$$\begin{aligned} {{\mathcal {C}}}_{k,N}({\mathcal {G}}_d)&\le L_{\ell } E_{{\mathcal {T}}_N\sim {{\mathbb {P}}}^N}[ \Xi (k,p,\{\lambda _j(\hat{\Sigma })\}_j)]\cdot \sup _{h\in {\mathcal {H}}_d} \Vert h\Vert _2 \end{aligned}$$
(28)

where

$$\begin{aligned} \Xi (k,p,\{\lambda _j(\Sigma )\}_{j})&:= \left( 1+\sqrt{\frac{k-p}{p -1}}\right) \sqrt{\lambda _{k-p+1}(\Sigma )} +\frac{e\sqrt{k}}{p} \sqrt{\sum _{j>k-p}\lambda _j(\Sigma )} \end{aligned}$$
(29)

From the form of (27)–(28) we infer that both the compressive distortion and the compressive complexity decrease with the inverse margin and the rate of decay of the eigen-spectrum of the data covariance. In conjuction with Theorem 2 this means that the larger the margin of \(h^*\), or/and the faster the eigen-decay of the data covariance, the better the chance that compressive classification with the considered linear model class succeeds. Likewise, in the light of Theorem 4, learning the model in high dimensional settings is eased in situations where compressive complexity is small—i.e. when the margin is large, and the eigen-spectrum has a fast decay.

The proof is deferred to Appendix Section “Linear models with bounded Lipschitz loss”. Essentially, we relate the problem to a weighted OLS problem, which was previously analysed (Kabán, 2013; Slawski, 2018), and then manipulate the expressions to apply a seminal result by Halko et al. (Halko et al., 2011).

It may be interesting to note that a coarser alternative that nevertheless retains the main characteristics can be obtained with less sophisticated tools, as the following.

Proposition 9

Consider the linear model class described above. We have

$$\begin{aligned} D_k(g^*)&\le L_{\ell }\sqrt{\frac{2}{k}}\sqrt{\text {Tr}(\Sigma )}\Vert h^*\Vert _2 \end{aligned}$$
(30)
$$\begin{aligned} {{\mathcal {C}}}_{k,N}({\mathcal {G}}_d)&\le L_{\ell } \sqrt{\frac{2}{k}}\sqrt{\text {Tr}(\Sigma )}\sup _{h\in {\mathcal {H}}_d}\Vert h^*\Vert _2. \end{aligned}$$
(31)

Proof

Using the Lipschitz property of the loss, and relaxing the infimum in the definition of \(D_k(g^*)\), we have \(D_k(g^*)\le L_{\ell }E_XE_R \left[ \vert h^TR^TRX-h^TX\vert \right] \le L_{\ell }\left\{ E_XE_R\left[ ( h^TR^TRX-h^TX)^2\right] \right\} ^{1/2} \le \sqrt{\frac{2}{k}}\text {Tr}(\Sigma )\Vert h^*\Vert _2.\) Here we used Lemma 2 of Kabán (2014) to compute the matrix expectation, which in our case of R with i.i.d. entries from \({\mathcal {N}}(0,1/k)\) says that \(E_R[R^TR\Sigma RR^T]=\frac{1}{k}((k+1)\Sigma +\text {Tr}(\Sigma )I_d)\). We also have \(E_R[R^TR]=I_d\), and the final expression (30) then follows by rearranging, and using the Cauchy-Schwartz and Jensen’s inequalities.

By the factorised form of (30), Property 2.3 immediately gives (31). \(\square\)

We see the expressions in Propositions 8 and 9 are driven by the eigen-decay of the unknown true covariance, the margin of the classifier, and k with a decay of order \(1/\sqrt{k}\).

3.3 Two-layer perceptron

The purpose of this section is to examine the effect of adding a hidden layer by considering the class of classic fully-connected two-layer perceptrons. It turns of that the distortion bounds can still be expressed in terms of structures that we encountered in the simpler model of the previous section.

Let \({\mathcal {H}}_d=\{x\rightarrow \sum _{i=1}^m v_i \phi (w_i^Tx): x\in {{\mathcal {X}}}_d, \Vert v\Vert _1 \le 1 \}\) be the class of classic two-layer perceptrons, where \(\phi (\cdot )\) is a \(L_{\phi }\)-Lipschits activation function. We do not regularise the first layer weights, as the RP has a regularisation effect on these. Let \(\ell : {{\mathcal {Y}}}\times {{\mathcal {Y}}}\rightarrow [0,\bar{\ell }]\) be an \(L_{\ell }\)-Lipschitz and \(\bar{\ell }\)-bounded loss function as before, and let \({\mathcal {G}}_d = \ell \circ {\mathcal {H}}_d\), and \({\mathcal {G}}_R=\ell \circ {\mathcal {H}}_R\) its compressive version. The \(\ell _1\)-regularisation on the higher-layer weights has the practical benefit of pruning unnecessary components. Again we will assume \(\text {Tr}(E_X[XX^T])<\infty\). In this setting we obtain the following, proved in Appendix Section “Two-layer perceptron”.

Proposition 10

Consider the feed-forward neural network class above. For any \(p\in {{\mathbb {N}}}\) s.t. \(2\le p\le k-2\), and any \(k\le \text {rank}(\Sigma )\), we have

$$\begin{aligned} D_k(g^*)&\le L_{\ell }L_{\phi } \Vert v^*\Vert _2\Vert W^*\Vert _{F}\cdot \Xi (k,p,\{\lambda _j(\Sigma )\}_{j})\cdot \textbf{1}(k<d) \end{aligned}$$
(32)
$$\begin{aligned} {\mathcal {C}}_{k,N}({\mathcal {G}}_d)&\le L_{\ell }L_{\phi } E_{{\mathcal {T}}_{N}\sim {\mathbb P}^{N}}[\Xi (k,p,\{\lambda _j(\hat{\Sigma })\}_j)]\cdot \sup _{v,W} \Vert v\Vert _2\Vert W\Vert _{F}\cdot \textbf{1}(k<d) \end{aligned}$$
(33)

where \(\Xi (k,p,\{\lambda _j(\hat{\Sigma })\}_j)\) is defined in Eq. (29).

We have not considered adding further hidden layers, as the RP only affects the input layer, so deeper networks are unlikely to present further insights on the effect of compressing the data. We have also not attempted to extend our analysis to other types of neural nets in this fast developing field, as analytic bounds of the specific quantities we are interested in would quickly become difficult to obtain and interpret. However, we will return with a generally applicable approach later in Sect. 3.7, where we show how one can use additional unlabelled data to estimate the compressive complexity instead of analytically bounding it. Finally, in the light of multiple equivalent formulations of bounds for layered networks (Munteanu et al., 2022) (under certain conditions), one can argue that the question of what exactly the bounds depend on becomes less interesting for the study of neural nets. Indeed, our only purpose in this section was to demonstrate the intuition that, structures that help learning the linear model also help learning the two-layered model—hence, learning has at least as many (and probably more) benign structures to exploit in the richer class.

3.4 Quadratic model learning

Another interesting non-linear learning problem where we can showcase the ability of RP to discover meaningful structure and eliminate dimension-dependence is learning quadratic models, including Mahalanobis metric learning. Let \({\mathcal {M}}_d\) be the set of \(d\times d\) symmetric matrices, and consider the quadratic function class \({\mathcal {H}}_d=\{x\rightarrow x^T A x: A\in {\mathcal {M}}_d, x\in {\mathbb {R}}^d\}\), with \(\ell\) an \(\bar{\ell }\)-bounded \(L_{\ell }\)-Lipschitz loss, and we denote be \({\mathcal {G}}_d=\ell \circ {\mathcal {H}}_d\) the loss class of \({\mathcal {H}}_d\). It is known from related analysis of Verma and Branson (2015) that the error of learning a Mahalanobis metric tensor \(A\in {\mathcal {M}}_d\) necessarily grows with \(\sqrt{d}\) if no structural assumptions are imposed on the metric tensor. We will use our RP-based analysis to discover a benign structural condition that eliminates the dependence of the error on d.

Let \({\mathcal {H}}_R\) be the compressive version of \({\mathcal {H}}_d\), with R having i.i.d. Gaussian entries with 0-mean and variance 1/k, as before, and \({\mathcal {G}}_R=\ell \circ {\mathcal {H}}_R\). In Appendex section “Quadratic models” we show the following.

Proposition 11

In the quadratic function class above, for any \(k\le d\), we have

$$\begin{aligned} D_k(g^*)&\le \sqrt{\frac{4}{k^2}+\frac{3}{k}}L_{\ell } \text {Tr}(\Sigma ) \Vert A^*\Vert _* \cdot \textbf{1}(k < d) \end{aligned}$$
(34)
$$\begin{aligned} {\mathcal {C}}_{k,N}({\mathcal {G}}_d)&\le \sqrt{\frac{4}{k^2}+\frac{5}{k}}L_{\ell } \text {Tr}(\Sigma ) \sup _{A\in {\mathcal {M}}_d}\Vert A\Vert _* \cdot \textbf{1}(k < d) \end{aligned}$$
(35)

where \(\Vert \cdot \Vert _*\) is the nuclear norm of the matrix in its argument.

Equation 34 in conjunction with Theorem 2 highlights that, the smaller the nuclear norm of the true parameter matrix \(A^*\), the better the generalisation guarantee for compressively learning the quadratic model. Equation (35) further suggests that learning a quadratic model in high dimensions becomes easier when the nuclear norm of the parameter matrix is small. In addition, both bounds of Proposition 11 scale with the trace of the true covariance of the data distribution, suggesting that spectral decay of the data source is a benign trait.

We find it interesting to relate our findings to recent results by Latorre et al. (2021) which have shown for the quadratic class of classifiers that nuclear norm regularisation in the original data space (no dimensionality reduction considered) has the ability to take advantage of low intrinsic dimensionality of the data to achieve better accuracy, which other regularisers studied therein do not. The fact that the nuclear norm appears in our distortion bounds further validates the ability of our RP-based approach to find meaningful structural traits for the learning problem at hand. In fact, Theorem 4 essentially turns the expression (35) into a regulariser, which is realised by the nuclear norm regulariser in this case, since all the other factors are independent of the model’s parameters. Therefore the RP-based analysis following the same recipe as we did in the former sections for other function classes, again succeeded in revealing a meaningful benign trait for the function class under study.

3.5 Nearest neighbour classification

The previous sections concerned various parametric classes. Here we take a representative of a nonparametric class, namely a simplified version of the nearest neighbour classifier proposed by Kontorovich and Weiss (2015).

The nearest neighbour rule can be expressed as the following (Crammer et al., 2002; Kontorovich & Weiss, 2015; von Luxburg & Bousquet, 2004). Denote by \({\mathcal {T}}_N^+,{\mathcal {T}}_N^- \subset {\mathcal {T}}_N, {\mathcal {T}}_N^+ \cup {\mathcal {T}}_N^- = {\mathcal {T}}_N\) the positively and negatively labelled training points respectively. Define the distance of a point \(x\in {{\mathcal {X}}}\) to a set S as \(d(x,S)= \inf _{z\in S}\{\Vert x-z\Vert \}\). Then, \(N^+(x) \equiv d(x,{\mathcal {T}}_N^+)\) and \(N^-(x)\equiv d(x,{\mathcal {T}}_N^-)\) are the nearest positive / nearest negative neighbours of x, and the label prediction for \(x\in {{\mathcal {X}}}\) is given by the sign of the following:

$$\begin{aligned} h(x : {\mathcal {T}}_N^+,{\mathcal {T}}_N^-) =\frac{1}{2}(d(x,{\mathcal {T}}_N^-) - d(x,{\mathcal {T}}_N^+)) = \frac{1}{2}\left( \Vert x-N^-(x)\Vert -\Vert x-N^+(x)\Vert \right) \end{aligned}$$
(36)

Throughout this section we use Euclidean norms. Like Kontorovich and Weiss (2015), we assume a bounded input domain, \({{\mathcal {X}}}_d\subseteq {{\mathcal {B}}}(0,B)\). (This can be relaxed, as we will do in the next subsection for a more general case.) We consider the class of classifiers \({\mathcal {H}}_d=\{x\rightarrow h(x: {\mathcal {T}}_N^+,{\mathcal {T}}_N^-)=\frac{1}{2}\left( \Vert x-N^-(x)\Vert -\Vert x-N^+(x)\Vert \right)\), and \({\mathcal {G}}_d=\ell \circ {\mathcal {H}}_d\) where we take \(\ell (\cdot )\) to be the ramp-loss defined as \(\ell (h(x),y)=\min \{\max \{0,1-h(x)y/\gamma \},1\}\), which is \(1/\gamma\)-Lipschitz.

In the RP-ed domain, we use subscripts: \(N^+_R(x)\) and \(N^-_R(x)\) denote the points whose images under the random projection R is the nearest positive or nearest negative to Rx. So the compressive class \({\mathcal {H}}_R\) contains functions of the form:

$$\begin{aligned} h_R(Rx:R{\mathcal {T}}_N^+,R{\mathcal {T}}_N^-) :=\frac{1}{2}\left( \Vert Rx-N^-_R(Rx)\Vert -\Vert Rx-N_R^+(Rx)\Vert \right) \end{aligned}$$
(37)

Composed with the \(1/\gamma\)-Lipschitz loss, we have by construction that \({\mathcal {G}}_d\subseteq \{x\rightarrow g(x): x\in {{\mathcal {X}}}_d, g \text {~is~}1/\gamma \text {-Lipschitz}\}\), and \({\mathcal {G}}_R\subseteq \{(Rx)\rightarrow g_R(Rx): x\in {{\mathcal {X}}}_d,g_R \text {~is~}1/\gamma \text {-Lipschitz}\}\). That is, the function classes of interest are subsets of the d and k-dimensional class of \(1/\gamma\)-Lipschitz functions respectively. By the Lipschitz extension theorem (von Luxburg & Bousquet, 2004), for any \(\gamma\)-separated labelled sample there exists a 1-Lipschitz function has the same predictions as the 1-NN induced by that sample, for all points of the input domain \({{\mathcal {X}}}\).

For a given value of \(\gamma\), the ERM classifier in the class of \(1/\gamma\)-Lipschitz functions of the form defined above is obtained by choosing a sub-sample from the training points such that this sub-sample is \(\gamma\)-separated, and the 1-NN induced by it makes the fewest errors on the full training set (including the points left out). This procedure was proposed by Kontorovich and Weiss (2015) along with an efficient algorithmic implementation.

Let \(g^*\) be the best d-dimensional \(1/\gamma\)-Lipschitz function of the form (36), and \(g_R\) the best k-dimensional \(1/\gamma\)-Lipschitz function of the form (37). We have the following, proved in Appendix Section “Nearest neighbours classification”.

Proposition 12

Let \(T=\left\{ \frac{x-x'}{\Vert x-x'\Vert }: x,x'\in {{\mathcal {X}}}_d,x\ne x'\right\}\). For the class of nearest neighbour classifiers described above, we have

$$\begin{aligned} D_k(g^*)\le \frac{2B \cdot w(T)}{\gamma \sqrt{k}};\;\;\;\;\;\;\;\;\; {{\mathcal {C}}}_{k,N}({\mathcal {G}}_d) \le \frac{2B \cdot w(T)}{\gamma \sqrt{k}}. \end{aligned}$$
(38)

where \(w(T)=E_{r\sim {\mathcal {N}}(0,1)}\sup _{t\in T} \{\langle r,t\rangle \}\) is the Gaussian width of the set T.

In this example, we have the same upper bound on both the compressive distortion and the compressive complexity, featuring the Gaussian width of the normalised distances on the support set. The Gaussian width (see e.g. Vershynin, 2018, sec. 7.5 and references therein) is a measure of complexity for sets (justifying the name ‘compressive complexity’), and it is sensitive not just to the algebraic intrinsic dimension of the support set but also takes fractional values reflecting weakly represented directions in the set, and it is sensitive to structure embedded in Euclidean spaces, such as the existence of a sparse representation, smooth manifold structure, spectral decay, and so on.

The bound we obtain by instantiating Theorems 2 and 4 with the expressions from Proposition 12 and the Rademacher complexity of \({\mathcal {G}}_R\) holds true with any integer value of k chosen before seeing the data. An interesting connection is obtained if we set k to the value that ensures that the compressive complexity term is below some specified \(\eta \in (0,1)\), i.e. \(k\gtrsim \frac{w^2(T)}{\eta ^2\gamma ^2}\). With this choice, the associated generalisation bound (Eq. 123) recovers a bound of the form obtained previously for this classifier in doubling metric spaces (Gottlieb et al., 2016; Kontorovich & Weiss, 2015), with the squared Gaussian width taking the place of the doubling dimension. Indeed, there is a known link between the doubling dimension and the squared Gaussian width (Indyk, 2007). In an Euclidean metric space with algebraic dimension d they are both of order \(\Theta (d)\), but are otherwise more general and can take fractional values. However, if w(T) is unknown or the sample size N is too small relative to \(w(T)^2\), then one may opt to set k proportional to N instead, which is typically known while the Gaussian width or the doubling dimension may be unknown in practice.

3.6 General Lipschitz classifiers

The nearest neighbour example from the previous section generalises to the class of all Lipschitz classifiers (Gottlieb & Kontorovich, 2014; von Luxburg & Bousquet, 2004), examples of which, besides nearest neighbours, also include the support vector machine and others (von Luxburg & Bousquet, 2004). Let \({\mathcal {H}}_d\) and \({\mathcal {H}}_R\) be the sets of \(L_h\)-Lipschitz functions on \({{\mathcal {X}}}_d\) and \({{\mathcal {X}}}_R\) respectively. We take the exact same setting as previous margin-based analyses (Gottlieb et al., 2016), including an \(L_{\ell }\)-Lipschitz loss functions bounded by \(\bar{\ell }\). For instance \(\bar{\ell }\) can be 1, since classification losses (e.g. the hinge loss), are surrogates to the 0–1 loss, so clipping at 1 makes sense, as it was done by Gottlieb et al. (2016). We restrict ourselves to the Euclidean space to leverage the computational advantages of random projections. In addition, we relax the requirement for the input space \({{\mathcal {X}}}_d\) to be bounded, and instead only require that most of the probability lies in a bounded subset. This relaxation is also applicable to our previous section.

Let \({\mathbb {P}}_X\) denote the marginal probability, and for each \(\epsilon \ge 0\) we define

$$\begin{aligned} w_{\epsilon }({{\mathcal {X}}}_d,{\mathbb {P}}_X)&:= \inf _{A\subseteq {{\mathcal {X}}}_d : {\mathbb {P}}_X(X\in A)\ge 1-\epsilon } w(A) \end{aligned}$$
(39)

This lets us relax the boundedness assumption of the domain \({{\mathcal {X}}}_d\), instead we only need it to have a bounded subset A of \(1-\epsilon\) probability mass for \(w_{\epsilon }({\mathbb {P}}_X)\) to be finite. The familiar Gaussian width is recovered when \(\epsilon =0\), i.e. \(w_{0}({{\mathcal {X}}}_d,{\mathbb {P}})=w({{\mathcal {X}}}_d)\). In the sequel, we use the shorthand

$$\begin{aligned} {{\mathcal {X}}}_d^{\epsilon }:=\{A\subseteq {{\mathcal {X}}}_d: {\mathbb {P}}(X\in A)\ge 1-\epsilon , w_{\epsilon }({{\mathcal {X}}}_d,{\mathbb {P}}_X)=w(A)\}. \end{aligned}$$

In this setting, we have the following, proved in Appendix Section “General Lipschitz classifiers”.

Proposition 13

Consider the class of Lipschitz classifiers described above. We have

$$\begin{aligned} D_k(g^*)&\le L_{\ell }L_h \text {diam}({{\mathcal {X}}}_d^{\epsilon }) \frac{w({{\mathcal {X}}}_d^{\epsilon })}{\sqrt{k}} + \epsilon \cdot \bar{\ell }\end{aligned}$$
(40)
$$\begin{aligned} {\mathcal {C}}_{k,N}({\mathcal {G}}_d)&\le L_{\ell }L_h \text {diam}({{\mathcal {X}}}_d^{\epsilon }) \frac{w({{\mathcal {X}}}_d^{\epsilon })}{\sqrt{k}} + \epsilon \cdot \bar{\ell } \end{aligned}$$
(41)

Originally, the Lipschitz classifier (Gottlieb & Kontorovich, 2014) was proposed as a classification approach in doubling metric spaces. The analysis of Gottlieb and Kontorovich (2014) highlighted that the generalisation error can be expressed in terms of the doubling dimension of the metric space. As we commented in the Nearest Neighbour section a particular choice of k proportional to the square of the Gaussian width makes this connection explicit, while in contrast we are also free to choose other values of k. Another difference is in the methodological focus: In (Gottlieb & Kontorovich, 2014; Kontorovich & Weiss, 2015; Gottlieb et al., 2016), bounding the error in terms of a notion of intrinsic dimension was made possible due to a specific property of the Lipschitz class, by which the covering numbers of the function class are upper bounded in terms of the covering numbers of the input space. By contrast, in our strategy the starting point was to exploit random projection to obtain an auxiliary class of a lower complexity, and as such, the Lipschitz property of the classifier functions is not in generally required in our framework. Indeed, we have seen throughout the various examples in this section that the same starting point has drawn together some widely used regularisation schemes in the case of parametric models, as well as the Gaussian width in the nearest neighbour and Lipschitz classifier examples.

3.7 Turning compressive complexity into a regulariser

In several examples of the previous section, the upper bound on \({\mathcal {C}}_{k,N}\) has taken the form \(\sup _{g\in {\mathcal {G}}_d}{\mathcal {C}}_k(g)\), where \({\mathcal {C}}_k\) is some function that only depends on the data through g. Structural Risk Minimisation (SRM) (Vapnik, 1998) is a classic approach that can be applied to turn the expression of \({\mathcal {C}}_k\) into a regulariser—this would ensure that ERM is confined to an appropriate subset of \({\mathcal {G}}_d\) that satisfy the compressibility constraint in our theorems.

For more complicated models, however, bounding the compressive complexity in a useful way may be difficult or out of reach. In the absence of a suitable analytic upper bound, in this section we show that one can instead estimate it from unlabelled data, whenever the loss function is Lipschitz, yielding semi-supervised regularisation algorithms that learn the regularisation term from an independent unlabelled data set. This recovers a form of consistency regularisation (Laine & Aila, 2017)—a semi-supervised technique widely used in practice—giving it a theoretical justification. We describe this in the sequel.

Exploiting the uniform nature of the bound of Theorem 4, we use structural risk minimisation (SRM). This will give us a regulariser whose general form comes from the compressive distortion of the function class, and which takes care of the required low-distortion constraint so the resulting predictor enjoys the guarantee stated in Theorem 4. The reason this works is that, by construction, a uniform bound is equivalent to the objective of a learning algorithm (as it can be iterated as many times as needed, so this algorithm enjoys the generalisation guarantee indicated by the bound).

Suppose we have an independent unlabelled data set drawn i.i.d. from the marginal distribution of the data. For each \(\theta \in [0,\bar{\ell }]\), we define the class

$$\begin{aligned} {\mathcal {G}}_d^{\theta }:= \left\{ g\in {\mathcal {G}}_d: \hat{D}_k(g)\le \theta \right\} \subseteq {\mathcal {G}}_d. \end{aligned}$$
(42)

Note, these classes depend on the independent unlabelled data set, but not on the labelled data. Fix an increasing sequence \((\theta _i)_{i\in {\mathbb {N}}}\). This defines a nested sequence of subsets of the function class \({\mathcal {G}}_d\), as we have \({\mathcal {G}}_d^{\theta _1} \subseteq {\mathcal {G}}_d^{\theta _2}\subseteq ... \subseteq {\mathcal {G}}_d\). Let \((\mu _i)_{i\in {\mathbb {N}}}\) be an associated sequence of probability weights s.t. \(\sum _{i\in {\mathbb {N}}}\mu _i \le 1\). By Theorem 4 applied to \({\mathcal {G}}_d^{\theta }\), for any fixed value of \(\theta\), we have uniformly for all \(g\in {\mathcal {G}}_d^{\theta }\), w.p. \(1-\delta\) that

$$\begin{aligned} E[g]&\le \hat{E}_{{\mathcal {T}}_N}[g] + 2\theta + 2 E_R[{\hat{{\mathcal {R}}}}_N({\mathcal {G}}_R^{\theta })] + 3\sqrt{\frac{\log (2/\delta )}{2N}} \end{aligned}$$
(43)

where \({\mathcal {G}}_R^{\theta }\) is the RP-ed version of \({\mathcal {G}}_d^{\theta }\), and note that \({\hat{{\mathcal {R}}}}_N({\mathcal {G}}_R^{\theta })\le {\hat{{\mathcal {R}}}}_N({\mathcal {G}}_R)\). We now use this bound for each \(i\in {\mathbb {N}}\) with failure probabilities \(\delta \mu _i\). By the union bound, w.p. \(1-\delta\) uniformly for all \(i\in {\mathbb {N}}\) and all \(g\in {\mathcal {G}}_d^{\theta _i}\),

$$\begin{aligned} E[g] \le \hat{E}_{{\mathcal {T}}_N}[g] + 2\theta _i + 2 E_R[{\hat{{\mathcal {R}}}}_N({\mathcal {G}}_R)] + 3\sqrt{\frac{\log (2/\delta \mu _i)}{2N}}. \end{aligned}$$
(44)

This suggests the following algorithm. For each \(g\in {\mathcal {G}}_d\), let i(g) denote the smallest integer such that \(g\in {\mathcal {G}}_d^{\theta _{i(g)}}\); more precisely, \(i(g):=\min \{i\in {\mathbb {N}}: \hat{D}_k(g) < \theta _{i}\}\). Define the following minimisation objective as a learning algorithm:

$$\begin{aligned} g^{reg}:= \underset{g \in {\mathcal {G}}_d}{\text {arg min}} \;\; \left\{ \hat{E}_{{\mathcal {T}}_N}[g] +2\theta _{i(g)} + 3\sqrt{\frac{\log (1/\mu _{i(g)})}{2N}} \right\} . \end{aligned}$$
(45)

In practice, one can set \((\mu _i)_{i\in {\mathbb {N}}}\) as a uniform distribution on a finite sequence, so the last term becomes constant and omitted. Regarded as a guiding principle, the above suggests a practical algorithm using \(\hat{D}_k(g)\) directly in place of its discretised version \(\theta _{i(g)}\). We have the following guarantee about \(g^{reg}\).

Theorem 14

With probability at least \(1-\delta\),

$$\begin{aligned} E[g^{reg}]\le E[g^*] + 2\theta _{i(g^*)} + 2 E_R[{\hat{{\mathcal {R}}}}_N({\mathcal {G}}_R)] + 4 \sqrt{\frac{\log (4/(\delta \mu _{i(g^*)}))}{2N}}. \end{aligned}$$
(46)

Proof of Theorem 14

We apply the uniform bound of Eq. (44) with the choice \(\theta :=\theta _{i(g^{reg})}\), so

$$\begin{aligned} E[g^{reg}]&\le _{1-\delta /2} \hat{E}_{{\mathcal {T}}_N}[g^{reg}] + 2\theta _{i(g^{reg})} + 2E_R[{\hat{{\mathcal {R}}}}_N({\mathcal {G}}_R)]+3\sqrt{\frac{\log (4/(\delta \mu _{i(g^{reg})}))}{2N}} \end{aligned}$$
(47)

By the definition of \(g^{reg}\), for any \(g\ne g^{reg}, g\in {\mathcal {G}}_d\), the right hand side is further upper bounded as

$$\begin{aligned} \le \hat{E}_{{\mathcal {T}}_N}[g^*] +2\theta _{i(g^*)} + 2E_R[{\hat{{\mathcal {R}}}}_N({\mathcal {G}}_R)] + 3\sqrt{\frac{\log (4/(\delta \mu _{i(\theta _{g^*})}))}{2N}} \end{aligned}$$
(48)

We subtract \(E[g^*]\) from both sides, and use Hoeffding’s inequality to bound \(\hat{E}_{{\mathcal {T}}_N}[g^*]-E_{{\mathcal {T}}_N}[g^*]\), yielding

$$\begin{aligned} E[g^{reg}]-E[g^*]&\le \hat{E}_{{\mathcal {T}}_N}[g^*] +2\theta _{i(g^*)} + 2E_R[{\hat{{\mathcal {R}}}}_N({\mathcal {G}}_R)] + 3\sqrt{\frac{\log (4/(\delta \mu _{i(g^*)}))}{2N}} -E[g^*] \end{aligned}$$
(49)
$$\begin{aligned}&\le _{1-\delta /2} 2\theta _{i(g*)} + 2E_R[{\hat{{\mathcal {R}}}}_N({\mathcal {G}}_R)] + 4 \sqrt{\frac{\log (4/(\delta \mu _{i(g^*)}))}{2N}} \end{aligned}$$
(50)

Combining (47) and (50) by the union bound completes the proof. \(\square\)

Comments. The bound contains \(\theta _{i(g^*)}\), which, is an upper estimate of \(\hat{D}_k(g^*)\). This might not be a quantity of particular interest in itself, but we can relate it to \(D_k(g^*)\), as follows. Provided sufficient unlabelled data to ensure, for a given \(\eta \in (0,1)\), that \(\sup _{g\in {\mathcal {G}}_d}\vert \hat{D}_k(g)-D_k(g)\vert \le \eta\) w.p \(1-\delta\), then whenever we have \(\hat{D}_k(g^*)\le \theta _{i(g^*)}\) this also implies \(D_k(g^*)\le \theta _{i(g^*)}+\eta\) w.p. \(1-\delta\); consequently, with the overall probability of \(1-2\delta\), we have

$$\begin{aligned} E[g^{reg}]\le E[g^*] + 2 \theta ^*+ 2 E_R[{\hat{{\mathcal {R}}}}_N({\mathcal {G}}_R)] + 4 \sqrt{\frac{\log (4/(\delta \mu _{i(g^*)}))}{2N}}. \end{aligned}$$
(51)

where \(\theta ^* = \theta _{i(g^*)} +\eta\) is our high probability upper estimate on \(D_k(g^*)\). Thus, for the chosen \(k<d\), if a learning problem exhibits small \(D_k(g^*)\), and provided we have a large enough unlabelled set, then the algorithm (45) adapts to take advantage of this structure.

We have not elaborated here on how much unlabelled data would be needed. One can leverage and adapt the findings of Turner and Kabán (2023), where it was found (albeit in a deterministic model-compression setting) that the problem of ensuring a that \(\eta\) is as small as we like is in general statistically as difficult as the original learning problem, but it becomes surprisingly easy in many natural problem settings, namely when the compression only affects the predictions for a small number of sample points.

As a final comment, we assumed throughout that the choice of k is made before seeing the data, e.g. based on the available sample size N. Instead, if desired, one can pursue a hierarchical SRM to allow the value of k to be also determined from the training sample. The parameter k needs to be large enough to ensure that \(\theta _{g^*}\) is sufficiently small, and it needs to be small enough to match the available sample size N in order to keep the Rademacher complexity term small.

4 Conclusions

We presented a framework to study the general question of how to discover and exploit such hidden benign traits when problem-specific prior knowledge is insufficient, using random projection’s ability to expose structure. We considered both compressive learning and high dimensional learning, and give simple and general PAC bounds in the agnostic setting, in terms of some general notions of compressive distortion and compressive complexity that we introduced. We have also shown the tightness of our bounds when these quantities are small. The novel quantities of compressive distortion and compressive complexity take different forms in different learning tasks, and we instantiate these in several of these. This demonstrated their ability to capture and discover interpretable structural characteristics that make high dimensional instances of these problems solvable to good approximation in a random linear subspace. In the examples considered, these turned out to resemble the margin, the margin distribution, the intrinsic dimension, the spectral decay of the data covariance, or the norms of parameters. In future work it will be interesting to use this strategy to discover benign structural traits in further PAC-learnable problems, and to develop regularised algorithms suggested by the bounds.