1 Introduction

The last decade has seen a tremendous increase of interest in complex learning problems, such as deep neural networks, and learning in very high dimensional spaces. This results in a large number of parameters which need to be learned from the data. This is typically very resource-intensive in terms of memory, computation, and labelled training data; and consequently infeasible to deploy on devices with limited resources such as mobile phones, wearable devices, and the Internet of Things. Therefore, a plethora of model-compression and approximation techniques have been proposed, such as quantisation, pruning, factorisation, random projection, hashing, and others (Choudhary et al., 2020). Rather intriguingly, many empirical findings on realistic benchmark problems seem to indicate that, despite a drastic compression of the complex model, such techniques often perform impressively well, with predictive accuracy comparable to that of full precision models. Below we mention just a few illustrative landmarks.

Quantisation of the weights of deep neural networks was proposed in BinaryConnect (Courbariaux et al., 2015), where a neural network with weights constrained to a single bit (\(\pm 1\)) was proposed and empirically demonstrated to achieve comparable results to a full precision network of the same size. These results were further refined by the Quantised Neural Networks (QNN) training algorithm (Hubara et al., 2017), and the idea was also extended to convolutional networks in Xnor-net (Rastegari et al., 2016). Another compression scheme introduced in Han et al. (2016), called Deep Compression, employed a combination of pruning, quantisation, and Huffman coding to achieve similar results to the original network, with a significant reduction in memory usage.

Factorisation of the weights into low-rank matrices has been another common technique to reduce the size of a deep neural network (DNN), see Denil et al. (2013, 2014) for details. Recent survey articles on a variety of model-compression techniques specific to deep neural networks may be found in Choudhary et al. (2020), Cheng et al. (2017) and Menghani (2021).

In a related work (Ravi, 2019), the author proposes to learn the high and low complexity networks simultaneously through a joint objective function that minimises not only their individual sample errors but also their disagreement. They found experimentally that this approach improves accuracy of both models, regardless of the model-compression technique employed. While a theoretical explanation remains elusive, this was among the first attempts to shift focus from the compressed model back to the fuller picture of the original model, and consider these objectives in tandem.

Theoretical studies of model-compression are much scarcer, and the interplay between model approximation and generalisation is not very well understood. Work taking an information theoretic approach (Gao et al., 2019) studied the trade-off between the compression granularity (rate) and the change it induces in the empirical error, using rate distortion theory. Follow-on work (Bu et al., 2021) extended their analysis to show that it is possible (on occasion) for compressed versions of pre-trained models to generalise even better than the original.

Another line of research exploited a notion of compression (Arora et al., 2018; Zhou et al., 2019). In Arora et al. (2018), a new compression framework was introduced for proving generalisation bounds, and their analysis indicated that resilience to noise implies a better generalisation for deep neural networks. A PAC-Bayes bound was then proposed to give a non-vacuous generalisation bound on the compressed network in Zhou et al. (2019). This was further built upon in Baykal et al. (2019), and has inspired a new algorithm along with a generalisation bound for the fully connected network.

In Suzuki et al. (2020b), compression-based bounds on a new pruning method for DNN was established, and more recently the authors also gave bounds for the full network (Suzuki et al., 2020a). This latter work allows the compression-based bound to be converted into a bound for the full network, using the local Rademacher complexity of the Minkowski difference between the loss class of the full networks and the loss class of the compressed networks. This is therefore another instance, entirely complementary of the work of Ravi (2019), where the performance of the approximate model is linked back in some way to that of the full model, albeit a joint treatment has not been attempted.

In Ashbrock and Powell (2021), a stochastic Markov gradient decent was introduced to learn in memory limited setting directly in the discrete parameter space. They provide convergence analysis for their optimisation algorithm, but generalisation is only demonstrated experimentally.

The general trend and focus on compressing deep neural networks (DNNs) is remarkable. However, we conjecture a more fundamental connection between approximability and generalisation that is not specific to deep networks. Contrary to the increasingly sophisticated and specialised tools being developed for DNNs, our aim here is to study the connection between approximability and generalisation from first principles. To do this, we want to ensure generalisation guarantees for learning with approximate models in general.

We also hypothesise that target concepts that have low sensitivity to approximation may represent a benign trait of learning problems in general, which would imply easier learnability of the full precision model too. To substantiate this, we shall seek learning algorithms whose generalisation ability depends on the approximability of the target concept, irrespective of the form of the learned predictor being used in the full or approximated setting.

1.1 Contributions

In the following roadmap we summarise the main contributions and findings of this paper:

  • We define a notion of approximability of a predictor, which quantifies the average extent of sensitivity of its predictions when subjected to a given approximation operator (Sect. 2.1). This quantity will feature heavily in our generalisation bounds.

  • In Sect. 2.2 we show that low sensitivity target functions may require less labelled training data, provided we have access to an independent unlabelled set of sufficient size (Theorem 1). This sets the stage for approximability to be viewed as a benign trait for learning.

  • In Sect. 2.3 we develop a practical theory, showing that a constrained empirical risk minimisation algorithm with a modified loss function, which enforces approximability up to a given threshold, learns a predictor that is guaranteed to generalise well both in its full precision and its approximate forms (Proposition 2.4). Furthermore, we construct an objective function that also implicitly optimises the trade-off managed by the sensitivity threshold (Proposition 2.5). These results then give rise to a learning algorithm that is able to take advantage of additional unlabelled data without the requirement for it to be independent from the labelled set (Theorem 2).

  • For learning a good approximate predictor, we also give two variants of our algorithm that allows the user to control the above trade-off directly (Theorem 3, Remark 2.6). This may be useful in certain settings, for example when low memory requirements prevail over prediction accuracy.

  • Section 3 is devoted to studying our unlabelled data requirements. We show that, while the worst case unlabelled sample size requirement is necessarily large (Proposition 3.1), natural examples of structure may arise from the data source interacting with the model, which may reduce, or may even eliminate the requirement for additional unlabeled sample (Propositions 3.3, 3.4). This analysis is independent of the hypothesis class employed, and leads to some general conditions under which sensitivity estimation enjoys favourable convergence (Theorem 4). In addition, we also point out that, the structural restrictions of the hypothesis class in itself can bring further insights – in particular, for generalised linear models, the weight sensitivity turns out to be sufficient for dimension-independent learning (Proposition 3.5).

  • We discuss implications of our theoretical results related to real problems, including binarisation with depth-indepedent error bounds and on-device deep network classification in Sect. 4.

Throughout the exposition of the main sections, we only consider deterministic approximation operators, keeping the reasoning and the formalism simple, and rooted in first principles. We discuss extensions in Sect. 4, including the use of stochastic approximation operators.

1.2 Related work

We have already highlighted two existing studies that considered both sides of model-compression, namely the approximate predictor as well as the full predictor. Below we further discuss these in the light of our aims, approach, and findings, along with existing works that relate to ours in terms of either high-level ideas or technical aspects.

In a similar spirit to Ravi (2019), our inquiry concerns simultaneously both the approximate model and the full precision model. However, contrary to the empirical approach taken in Ravi (2019), where the heuristic nature of the algorithms make a theoretical understanding somewhat elusive, our approach is analytic. We employ Rademacher complexity analysis of the generalisation error (Bartlett & Mendelson, 2002) to give algorithm-independent uniform bounds on the generalisation for both approximate and approximable function classes. The uniform nature of these bounds justifies algorithms that minimise them. Therefore, our algorithms come with guarantees of good generalisation. Our framework is general, and can be used to analyse the approximability and generalisation in tandem for any PAC-learnable machine learning problem.

Our findings are consistent with those found in Suzuki et al. (2020a), with a difference in the approach, resulting in a different and more general angle. In particular, their focus is on translating already known bounds on compressed neural networks to the full uncompressed class. In contrast, we focus on showing that having good approximability (i.e. low sensitivity to approximation) improves generalisation bounds in PAC-learnable classes. In addition, we pursue a joint treatment of learning both the approximate and the full predictor simultaneously.

The works in Arora et al. (2018) and Zhou et al. (2019), based on the idea of compression and resilience to noise, are also somewhat related to our work, on a high-level. However, in both Arora et al. (2018) and Zhou et al. (2019) the generalisation bounds are for the compressed model only; whereas, our treatment provides both sides of the coin—algorithms that learn a predictor that generalises both in its full precision and its approximate form. In Arora et al. (2018), the focus is on bounding the classification error of the compressed predictor with the \(\gamma\)-margin loss (with \(\gamma > 0\)) of the full model for multi-class classification. This corresponds to our general bounded Lipschitz loss function. Moreover, in Zhou et al. (2019) a PAC-Bayes approach is taken and so numerical tightness comes from data-dependent quantities in the bound that do not necessarily identify or shed light on structural traits of the problem responsible for good generalisation. In contrast, by employing Rademacher analysis we are able to highlight structural properties responsible for low complexity and good generalisation, so our approach and findings are complementary to these works.

Our starting point in Sect. 2.2 is the semi-supervised framework of Bǎlcan and Blum (2010), where our approximability, or sensitivity of functions to approximation plays the role of an unlabeled error, and we replace VC entropy with Rademacher complexity to facilitate the use of our bounds outside the classification setting. However, from Sect. 2.3 onward we depart from this framework in favour of simpler and more straightforwardly implementable bounds that fit our specific goals at the expense of a negligible additive term. In return, we obtain some advantages: (1) for our purposes, the unlabelled data need not be independent from the labelled set, (2) the sensitivity threshold is optimised implicitly and automatically by our algorithm without appeal to structural risk minimisation, and (3) we are able to study structural regularities that reduce or even eliminate the need of unlabelled data, which was not attempted in the previous work.

2 Generalisation through approximability

2.1 Notations and preliminaries

Consider the input domain \({\mathcal {X}}\subseteq {\mathbb {R}}^d\), where d denotes the dimensionality of the feature representation, and output domain \({\mathcal {Y}}\subseteq {\mathbb {R}}\). Let \(m \in {\mathbb {N}}\) and consider a sample \(S \in ({\mathcal {X}}\times {\mathcal {Y}})^m\) of size m drawn i.i.d. from an unknown distribution D. Let \({\mathcal {H}}\) be the hypothesis class; this is a set of functions mapping from \({\mathcal {X}}\) to \({\mathcal {Y}}\). We consider a loss function \(\ell :{\mathcal {Y}}\times {\mathcal {Y}}\rightarrow {\mathbb {R}}_+\). Then we define the generalisation and empirical error of a function \(f \in {\mathcal {H}}\) as

$$\begin{aligned} {{\,\textrm{err}\,}}(f) {:}{=}{{\,\mathrm{{\mathbb {E}}}\,}}_{(x,y) \sim D} [\ell (f(x),y)],\;\; \text { and }\;\; {{\,\mathrm{{\widehat{err}}}\,}}(f) {:}{=}\frac{1}{m} \sum _{(x,y) \in S} \ell (f(x),y). \end{aligned}$$

The best function in the class will be denoted as \(f^* {:}{=}{{\,\textrm{argmin}\,}}_{f \in {\mathcal {H}}} \{ {{\,\textrm{err}\,}}(f) \}\).

We let \({\mathcal {H}}_A\) be the set of approximate functions from \({\mathcal {X}}\) to \({\mathcal {Y}}\). Note \({\mathcal {H}}_A\) needs not be a subset of \({\mathcal {H}}\). We define an approximation operator \(A :{\mathcal {H}}\rightarrow {\mathcal {H}}_A\), which maps a hypothesis to its approximation. Here A is considered to be deterministic; extension to stochastic approximate algorithms is discussed later in Sect. 4.

Definition 2.1

(Approximation-sensitivity of a function) Fix \(p \ge 1\). Given a sample \(S \in {\mathcal {X}}^{m}\) of size m drawn i.i.d. from the marginal distribution \(D_x\), we define the true and empirical sensitivity as

$$\begin{aligned} {\mathcal {D}}_{A}^p (f) {:}{=}{{\,\mathrm{{\mathbb {E}}}\,}}_{x \sim D_x} [ \vert f(x) - Af (x) \vert ^p ]^{\frac{1}{p}},\;\; \text { and } \;\;\widehat{{\mathcal {D}}}_{A}^p (f) {:}{=}\left( \frac{1}{m} \sum _{x \in S} \vert f(x) - Af (x) \vert ^p \right) ^{\frac{1}{p}}. \end{aligned}$$

The choice of p-norm will be left to the user in our forthcoming bounds. Formally, it is sufficient to work with \(p=1\), as by Jensen’s inequality, for all \(p\ge 1\), we have \({\mathcal {D}}_{A}^{1} (f) \le {\mathcal {D}}_{A}^p (f)\) and \(\widehat{{\mathcal {D}}}_{A}^1 (f) \le \widehat{{\mathcal {D}}}_{A}^p (f)\), for all \(f \in {\mathcal {H}}\). So the forthcoming bounds will be tightest with the choice \(p=1\). However, sometimes the user might like to specify a constraint on the sensitivity of functions in terms of the more familiar Euclidean norm (\(p=2\)), or some other member of the family of p-norms. Our results apply to any specification of p, so we will state results for general p-norms. An example where \(p=2\) is advantageous will be encountered later in Theorem 4. When the choice of p is arbitrary, we may omit the upper index in our notation.

The approximating class \({\mathcal {H}}_A\) is typically chosen to be much smaller than the original class \({\mathcal {H}}\), implying a reduced complexity term in our generalisation bounds, at the expense of a larger empirical error, and the appearance of an additional sensitivity term \({\mathcal {D}}_{A}(f)\). We can think of \({\mathcal {H}}_A\) as a compressed model class whose elements occupy less memory, yet still expressive enough to represent the essence of \({\mathcal {H}}\). Examples include quantisation and other model-compression schemes. The granularity of approximation that we can afford is considered to be fixed. In memory-constrained settings this is constrained by the available hardware.

We now define sensitivity-restricted hypothesis classes

$$\begin{aligned} {\mathcal {H}}_t {:}{=}\{ f \in {\mathcal {H}}: {\mathcal {D}}_{A}(f) \le t \} \;\;\text { and }\;\; {\widehat{{\mathcal {H}}}}_t {:}{=}\{ f \in {\mathcal {H}}: \widehat{{\mathcal {D}}}_{A}(f) \le t \}. \end{aligned}$$

We also define the class of sensitivities to be

$$\begin{aligned} {\mathcal {D}}_{A}{\mathcal {H}}{:}{=}\{ x \mapsto \vert f(x) - Af (x) \vert : f \in {\mathcal {H}}\}. \end{aligned}$$

We begin by stating the assumptions that we employ throughout the remainder of the paper. The first assumption is that the loss function is bounded and Lipschitz. These allow us to invoke the theory of Rademacher complexity, as well as make the connection between the generalisation error and the sensitivity of a function.

Recall, for a sample S of size m, the empirical Rademacher complexity of the class \({\mathcal {H}}\) is defined as

$$\begin{aligned} \widehat{{\mathcal {R}}}_S ({\mathcal {H}}) {:}{=}{{\,\mathrm{{\mathbb {E}}}\,}}_{\sigma } \sup _{f \in {\mathcal {H}}} \frac{1}{m} \sum _{k=1}^m \sigma _k f(x_k), \end{aligned}$$

where \(\sigma \in \{-1,1\}^m\) is a Rademacher variable, i.e. distributed uniformly on \(\{-1,1\}\). The Rademacher complexity is

$$\begin{aligned} {\mathcal {R}}_m ({\mathcal {H}}) {:}{=}{{\,\mathrm{{\mathbb {E}}}\,}}_S \widehat{{\mathcal {R}}}_S ({\mathcal {H}}). \end{aligned}$$

A classic result (Bartlett & Mendelson, 2002), Theorem 8 [see also Mohri et al. (2018), Lemma 3.3] shows that the generalisation gap scales as the Rademacher complexity – that is, we have with probability at least \(1 - \delta\) that

$$\begin{aligned} \vert {{\,\textrm{err}\,}}(f) - {{\,\mathrm{{\widehat{err}}}\,}}(f) \vert&\le 2 {\mathcal {R}}_m ({\mathcal {H}}) + \sqrt{\frac{\ln (\frac{2}{\delta })}{m}}, \quad \text {and} \\ \vert {{\,\textrm{err}\,}}(f) - {{\,\mathrm{{\widehat{err}}}\,}}(f) \vert&\le 2 \widehat{{\mathcal {R}}}_S ({\mathcal {H}}) + 3 \sqrt{\frac{\ln (\frac{2}{\delta })}{m}}. \end{aligned}$$

We make two assumptions that let us leverage the theory of Rademacher complexities. The first one is standard.

Assumption 1

\(\ell :{\mathcal {Y}}\times {\mathcal {Y}}\rightarrow {\mathbb {R}}_+\) is a bounded and \(\rho\)-Lipschitz loss function. That is, there exists \(B>0\) such that

$$\begin{aligned} \ell (x,y) \le B \text{ and } \vert \ell (x,y) - \ell (z,y)\vert \le \rho \vert x - z \vert , \end{aligned}$$

for all \(x,z \in {\mathcal {X}}, y \in {\mathcal {Y}}\). By re-scaling we may assume without loss of generality that \(B=1\).

Assumption 1 lets us bound the empirical Rademacher complexity of the loss class \(\ell \circ {\mathcal {H}}\) with that of \({\mathcal {H}}\) using Talagrand’s contraction lemma (Mohri et al., 2018), Lemma 5.7, that is \(\widehat{{\mathcal {R}}}_S(\ell \circ {\mathcal {H}})\le \rho \widehat{{\mathcal {R}}}_S({\mathcal {H}})\). Classic examples of Lipschitz loss functions include the (clipped) hinge loss, and the logistic loss (\(\rho =1\) for both). The 0-1 loss for \(\pm 1\) valued classifiers also satisfies Assumption 1 (\(\rho =1/2\)), and indeed we have \(\widehat{{\mathcal {R}}}_S(l_{01}\circ {\mathcal {H}})=\frac{1}{2}\widehat{{\mathcal {R}}}({\mathcal {H}})\) by Mohri et al. (2018), Lemma 3.4.

The second assumption we make is the uniform boundedness of the sensitivities. This will let us extend Rademacher analysis to the class of sensitivities \({\mathcal {D}}_{A}({\mathcal {H}})\), which then allows us to shift the complexity terms from the full models to the approximate models.

Assumption 2

The set of sensitivities, \({\mathcal {D}}_{A}{\mathcal {H}}\), is uniformly bounded. That is, there exists \(C>0\) such that,

$$\begin{aligned} \Vert f - Af\Vert _{\infty } \le C \end{aligned}$$

for all \(f \in {\mathcal {H}}\).

Assumption 2 is weaker than assuming that the functions in \({\mathcal {H}}\) and \({\mathcal {H}}_A\) are bounded. The latter is often assumed in analyses, either by constraining the norms of parameters and taking \({\mathcal {X}}\) to be bounded, or by passing linear outputs through a bounded nonlinearity—for instance, a sigmoidal function, or a threshold function—in the case of classification.

We start by giving a lemma that compares the true and empirical sensitivity. This is where our estimates for the size of the unlabeled sample are derived. We explore this topic further in Sect. 3.

Lemma 2.2

With probability at least \(1-\delta\) we have

$$\begin{aligned} \vert {\mathcal {D}}_{A}^{1} (f) - \widehat{{\mathcal {D}}}_{A}^1 (f) \vert \le 2 \widehat{{\mathcal {R}}}_{S} ({\mathcal {D}}_{A}{\mathcal {H}}) + 3 C \sqrt{\frac{\ln (\frac{2}{\delta })}{2 m}}, \end{aligned}$$

for all \(f \in {\mathcal {H}}\).

Proof

By classic Rademacher bounds (Bartlett & Mendelson, 2002), Theorem 8, it holds with probability at least \(1-\delta\) that

$$\begin{aligned} \vert {\mathcal {D}}_{A}^{1} (f) - \widehat{{\mathcal {D}}}_{A}^1 (f) \vert&= \left| {{\,\mathrm{{\mathbb {E}}}\,}}_{x \sim D_x} [ \vert f(x) - Af (x) \vert ] - \frac{1}{m} \sum _{x \in S} \vert f(x) - Af (x) \vert \right| \\&\le 2 \widehat{{\mathcal {R}}}_{S} ({\mathcal {D}}_{A}{\mathcal {H}}) + 3 C \sqrt{\frac{\ln (\frac{2}{\delta })}{2 m}}, \end{aligned}$$

as required. \(\square\)

We now relate the generalisation error of the full model with the generalisation error of the approximate model through our notion of approximation sensitivity. The following is a key lemma as it allows us to shift from the complexity of the full precision models to the low precision models.

Lemma 2.3

Fix \(t\ge 0\). We have the following bound

$$\begin{aligned} \vert {{\,\textrm{err}\,}}(f) - {{\,\textrm{err}\,}}(A f) \vert \le \rho {\mathcal {D}}_{A}^{1} (f), \end{aligned}$$

for all \(f \in {\mathcal {H}}_t\).

Proof

Let \(f \in {\mathcal {H}}_t\). Then, by Jensen’s inequality and using the Lipschitz property of \(\ell\) we have

$$\begin{aligned} \vert {{\,\textrm{err}\,}}(f) - {{\,\textrm{err}\,}}(A f)\vert&\le {{\,\mathrm{{\mathbb {E}}}\,}}_{(x,y)\sim D} [\vert \ell (f(x),y)- \ell (A f (x) , y)\vert ]\\ {}&\le \rho {{\,\mathrm{{\mathbb {E}}}\,}}_{x \sim D_x} [ \vert f(x) - Af (x) \vert ] = \rho {\mathcal {D}}_{A}^{1} (f). \end{aligned}$$

This completes the proof. \(\square\)

2.2 Learning of low approximation-sensitive predictors

Learning in high dimensional settings or complex model classes requires enormous training sets in general, or some fairly specific prior knowledge about the problem structure. However, many real-world problems possess benign traits that are hard to know in advance. Inspired by the practical success of approximate algorithms created by various model-compression methods, in this section we investigate approximability as a potential benign trait for learning, by quantifying its effect on the generalisation error. More precisely, we elaborate on our intuition that, if a relatively complex target concept admits a simpler approximation that makes little alteration to its predictive behaviour, then it should be learnable from smaller training set sizes.

The rationale is easy to see, as follows. Fix some approximation operator A and associated sensitivity threshold \(t\ge 0\). Then by the classic Rademacher bound (Bartlett & Mendelson, 2002), Theorem 8, for any \(\delta >0\), with probability at least \(1-\delta\) over the draw of the training sample, we have for all \(f\in {\mathcal {H}}_t\) that

$$\begin{aligned} {{\,\textrm{err}\,}}(f) \le {{\,\mathrm{{\widehat{err}}}\,}}(f)+2\rho \widehat{{\mathcal {R}}}_S({\mathcal {H}}_t)+3\sqrt{\frac{\log (2/\delta )}{2m}}. \end{aligned}$$
(1)

Let \(f_t^*={{\,\textrm{argmin}\,}}_{f\in {\mathcal {H}}_t} \{ {{\,\textrm{err}\,}}(f) \}\). To learn this function, we consider a hypothetical Empirical Risk Minimiser (ERM) in the restricted class \({\mathcal {H}}_t\) – that is, we define the following minimum:

$$\begin{aligned} {\hat{f}}{:}{=}{{\,\textrm{argmin}\,}}_{f\in {\mathcal {H}}_t} \{ {{\,\mathrm{{\widehat{err}}}\,}}(f) \}. \end{aligned}$$
(2)

Applying (1) to the function \({\hat{f}}\) from (2) with a failure probability of at most \(2\delta /3\), we note that \({{\,\mathrm{{\widehat{err}}}\,}}({\hat{f}})\le {{\,\mathrm{{\widehat{err}}}\,}}(f_t^*)\) by definition of \({\hat{f}}\), and further note that \({{\,\mathrm{{\widehat{err}}}\,}}(f_t^*)\le {{\,\textrm{err}\,}}(f_t^*)+\sqrt{\frac{\log (3/\delta )}{2m}}\) with probability at least \(1-\delta /3\) by Hoeffding’s inequality. Combining these with the use of a union bound yields that, with probability at least \(1-\delta\), \({\hat{f}}\) satisfies

$$\begin{aligned} {{\,\textrm{err}\,}}({\hat{f}})&\le {{\,\textrm{err}\,}}(f_t^*)+2\rho \widehat{{\mathcal {R}}}_S({\mathcal {H}}_t)+4\sqrt{\frac{\log (3/\delta )}{2m}}. \end{aligned}$$
(3)

Clearly, since \({\mathcal {H}}_t\subseteq {\mathcal {H}}\), then by a property of Rademacher complexities (Bartlett & Mendelson, 2002), Theorem 12 part 1 we have \(\widehat{{\mathcal {R}}}_S({\mathcal {H}}_t)\le \widehat{{\mathcal {R}}}_S({\mathcal {H}})\). So, whenever the concept we try to learn is actually in \({\mathcal {H}}_t\) (i.e. a low-sensitivity target function) then, depending on \(t\ge 0\), we can have a tighter guarantee compared to that of an empirical risk minimiser over the larger class \({\mathcal {H}}\).

Unfortunately, the minimisation in (2) is not implementable, because the specification of the function class \({\mathcal {H}}_t\) depends on the sensitivity function \({\mathcal {D}}_{A}\), which in turn depends on the true marginal distribution of the input data. It is often much easier to specify a larger function class \({\mathcal {H}}\) independent of the distribution, but this would ignore the sensitivity property and consequently lose out on the tighter guarantee.

The first approach that we consider will be based on observing that the sensitivity function only depends on inputs and is independent of the target values. Hence, we can make use of additional unlabelled data to estimate it, which is typically more widely available in applications. To this end, our first line of attack is similar in flavour with a classic semi-supervised framework proposed in Bǎlcan and Blum (2010). In that work, the authors augmented the standard PAC model with a notion of compatibility to encode a prior belief about the target function in terms of an expectation over the marginal distribution. As a first approach, we will instantiate their compatibility notion with our notion of approximation-sensitivity. Similarly to Bǎlcan and Blum (2010), this approach also allows us to use structural risk minimisation (SRM) to adapt the threshold parameter t. Therefore, balancing between the reduced complexity of the class and the potentially increased error of the best function on this reduced class yields the following result.

Theorem 1

Fix an approximation operator A. Suppose we have an independent i.i.d. unlabelled sample \(S'_x\sim D_x^{m_u}\) of size \(m_u\), and let \(\epsilon _u>0\) s.t. \(\sup _{f\in {\mathcal {H}}} \vert {\mathcal {D}}_{A}(f)-\widehat{{\mathcal {D}}}_{A}(f)\vert \le \epsilon _u\) with probability at least \(1-\delta /2\) with respect to the random draw of \(S'_x\). Take an increasing sequence \((t_k)_{k\in {\mathbb {N}}} \subset {\mathbb {R}}_+\), and for each \(k\in {\mathbb {N}}\) define \(f_k^* {:}{=}{{\,\textrm{argmin}\,}}_{f \in {\mathcal {H}}_{t_k}} \{ {{\,\textrm{err}\,}}(f) \}\). Let \(w :{\mathbb {N}}\rightarrow {\mathbb {R}}\) be such that for all \(k\in {\mathbb {N}}\), \(w_k\ge 0\) and \(\sum _{k \in {\mathbb {N}}} w_k \le 1\). Then, for all \(k\in {\mathbb {N}}\) and all \(f\in {\hat{{\mathcal {H}}}}_{t_k+\epsilon _u}\), with probability at least \(1-\delta\), we have:

$$\begin{aligned} {{\,\textrm{err}\,}}(f) \le {{\,\mathrm{{\widehat{err}}}\,}}(f)+2\rho \widehat{{\mathcal {R}}}_S({\hat{{\mathcal {H}}}}_{t_{k}+\epsilon _u})+3\sqrt{\frac{\log (1/w_k)}{2m}} +3\sqrt{\frac{\log (4/\delta )}{2m}}. \end{aligned}$$
(4)

Furthermore, for each \(f \in {\mathcal {H}}\) define \({\hat{k}}(f) {:}{=}\min \{ k \in {\mathbb {N}}: \widehat{{\mathcal {D}}}_{A}(f) \le t_{k}+\epsilon _u \}\), and consider the following algorithm

$$\begin{aligned} {\hat{f}} {:}{=}{{\,\textrm{argmin}\,}}_{f\in {\mathcal {H}}} \left\{ {{\,\mathrm{{\widehat{err}}}\,}}(f) + 2\rho \widehat{{\mathcal {R}}}_S \left( {\hat{{\mathcal {H}}}}_{t_{{\hat{k}}(f)} + \epsilon _u}\right) + 3\sqrt{\frac{\log \left( 1/w_{{\hat{k}}(f)} \right) }{2m}} \right\} . \end{aligned}$$
(5)

Then, with probability at least \(1-\delta\) we have

$$\begin{aligned} {{\,\textrm{err}\,}}({\hat{f}}) \le \min _{k\in {\mathbb {N}}}\left\{ {{\,\textrm{err}\,}}(f_{k}^*)+2\rho \widehat{{\mathcal {R}}}_S({\hat{{\mathcal {H}}}}_{t_k+\epsilon _u})+4\sqrt{\frac{\log (1/w_{k})}{2m}}\right\} +3\sqrt{\frac{\log (6/\delta )}{2m}}. \end{aligned}$$
(6)

Before giving the proof, we make a few comments. Firstly, we see that, with a large enough \(m_u\) (i.e. sufficient additional unlabelled data), we have by Lemma 2.2 with probability \(1-\delta /2\) that, the magnitude of \(\epsilon _u\le 2\widehat{{\mathcal {R}}}_{S'_x}({\mathcal {D}}_{A}{\mathcal {H}})+3\sqrt{\frac{\log (4/\delta )}{2m_u}}\) can be made arbitrarily small – this is the only role of \(S'_x\). A detailed account of the possible ranges of magnitude of this quantity will be discussed in Sect. 3, along with some natural factors that make it small. For now, let us point out that, by construction, whenever both \({\mathcal {H}}\) and \({\mathcal {H}}_A\) are PAC-learnable, and without further conditions, the complexity of our sensitivity class is determined by the complexities of \({\mathcal {H}}\) and \({\mathcal {H}}_A\) [see discussion around (29)]. By contrast, the general setting of semi-supervised learning in Bǎlcan and Blum (2010) allows arbitrarily complex compatibility classes, which, in a worst case scenario can backfire and blow up the required labelled data size (Bǎlcan & Blum, 2010), Theorem 22.

The objective of the minimisation algorithm in (5) follows the idea of minimising the uniform bound (4). It finds a good predictor along with the appropriate subclass of \({\mathcal {H}}\) to which it belongs. The sequence of sensitivity threshold candidates \((t_k)_{k\in {\mathbb {N}}}\), and the associated weights \((w_k)_{k\in {\mathbb {N}}}\), with \(w_k\ge 0\) for all \(k\in {\mathbb {N}}\) and \(\sum _{k\in {\mathbb {N}}}w_k\le 1\), must be chosen before seeing any data (for instance, \(w_k:=2^{-k}\)), with \(w_k\) representing an a-priori belief in a particular \(t_k\).

As a further observation, the function classes \({\hat{{\mathcal {H}}}}_{t_k+\epsilon _u}\) that feature in the high probability guarantee (6) are dependent on the unlabelled data. This dependence can be removed if desired, by noting that, with high probability, \(\widehat{{\mathcal {D}}}_{A}^1 (f) \le t + \varepsilon _u\) implies \({\mathcal {D}}_{A}^1 (f) \le t + 2 \varepsilon _u\) for all \(f \in {\hat{{\mathcal {H}}}}_{t_k+\epsilon _u}\) – hence we have \({\hat{{\mathcal {H}}}}_{t_k+\epsilon _u} \subseteq {{\mathcal {H}}}_{t_k+2\epsilon _u}\), which in turn implies \(\widehat{{\mathcal {R}}}_S({\hat{{\mathcal {H}}}}_{t_k+\epsilon _u})\le \widehat{{\mathcal {R}}}_S({{\mathcal {H}}}_{t_k+2\epsilon _u})\) with high probability. In fact, the failure probability of this bound is already accounted for in the proof of (6), so replacing \(\widehat{{\mathcal {R}}}_S({\hat{{\mathcal {H}}}}_{t_k+\epsilon _u})\) by \(\widehat{{\mathcal {R}}}_S({{\mathcal {H}}}_{t_k+2\epsilon _u})\) in (6) holds with the same probability as the stated.

Lastly, but most importantly, since \({\hat{{\mathcal {H}}}}_{t_k+\epsilon _u}\subseteq {\mathcal {H}}\), we have \(\widehat{{\mathcal {R}}}_S({\hat{{\mathcal {H}}}}_{t_k+\epsilon _u})\le \widehat{{\mathcal {R}}}_S({\mathcal {H}})\). The extent of this reduction of complexity depends on several factors even for specific approximation choices, including the sensitivity the unknown target function, the magnitude of \(\epsilon _u\) and the threshold estimate \(t_k+\epsilon _u\), the original class \({\mathcal {H}}\), and the data distribution. For the sake of intuition, suppose that availability of unlabelled data is not a barrier, so the potential gain is down to the interaction between the unknown data distribution and the unknown target function. A low sensitivity asserts that, for the particular approximation A, only a small mass fraction of the input points is affected by subjecting a predictor to A. If the target function satisfies this, and the marginal distribution is such that most functions of \({\mathcal {H}}\) do not satisfy this, then \({\mathcal {H}}_t\) (i.e. the remaining set of functions that have low sensitivity) will be small. Let us consider some informal examples.

Example 1. We can think of a model approximation as a perturbation of the model. In classification, this induces a perturbation of the decision boundary. If the true classes are well separated by a large margin, then there is leeway for such perturbation. Hence, just as in the framework in Bǎlcan and Blum (2010), dense classes separated by a large margin will rule out all functions that cut across dense regions, leaving a handful few – especially if \({\mathcal {H}}\) was a simple class, such as linear predictors.

Furthermore, in the extreme case of zero sensitivity we can use the simpler class \({\mathcal {H}}_A\) instead of \({\mathcal {H}}\), as in the following.

Example 2. Consider a relatively complex parametric class \({\mathcal {H}}\), and a coarse quantisation as A. Then \({\mathcal {H}}_A\) simply becomes a finite hypothesis class. If the target function is insensitive to this approximation, then it is enough to work with \({\mathcal {H}}_A\) and have the guarantees enjoyed by the finite class.

Example 3. Suppose the functions in \({\mathcal {H}}\) have a large number of parameters so \({\mathcal {H}}\) has high complexity, but the data distribution is supported in a simple restricted set that makes much of the representational capacity of \({\mathcal {H}}\) remain dormant. Then the effect of a model-compression will spread out among both relevant and irrelevant parameters, making less of a noticeable difference to the function values.

While these examples are both simplistic and informal, ample empirical evidence in the literature demonstrates that many model approximation methods do work surprisingly well in practice. In the next section we aim to develop an approach that helps to untangle and shed more light onto the various contributing factors that influence the error when learning involves approximate predictors. But first we prove Theorem 1.

Proof of Theorem 1

For a fixed \(t\ge 0\), by the definition of \(\epsilon _u\), with probability \(1-\delta /2\), we have that \(f\in {\mathcal {H}}_t\) implies \(f\in {\hat{{\mathcal {H}}}}_{t+\epsilon _u}\). We shall pursue SRM by exploiting the independent unlabelled sample to define a nested sequence of function classes \({\hat{{\mathcal {H}}}}_{t_1+\epsilon _u}\subseteq {\hat{{\mathcal {H}}}}_{t_2+\epsilon _u} \dots \subseteq {\hat{{\mathcal {H}}}}_{t_k+\epsilon _u}\subseteq {\hat{{\mathcal {H}}}}_{t_{k+1}+\epsilon _u}\subseteq \dots \subseteq {\mathcal {H}}\) where \(k\in {\mathbb {N}}\). These classes depend on the unlabelled sample, but not on the labelled sample. For any fixed \(k\in {\mathbb {N}}\), the classic Rademacher bound (Bartlett & Mendelson, 2002), Theorem 8 implies with probability at least \(1- (w_k\delta / 2)\) that all \(f\in {\hat{{\mathcal {H}}}}_{t_k+\epsilon _u}\) satisfy

$$\begin{aligned} {{\,\textrm{err}\,}}(f) \le {{\,\mathrm{{\widehat{err}}}\,}}(f)+2\rho \widehat{{\mathcal {R}}}_S({\hat{{\mathcal {H}}}}_{t_k+\epsilon _u})+3\sqrt{\frac{\log (4/(\delta w_k))}{2m}}. \end{aligned}$$

Since \(k \in {\mathbb {N}}\) is arbitrary, and the non-negative weights satisfy \(\sum _{k \in {\mathbb {N}}} w(k) \le 1\), we take a union bound and it follows with probability at least \(1 - \frac{\delta }{2}\) that, uniformly for all \(k\in {\mathbb {N}}\) and all \(f\in {\hat{{\mathcal {H}}}}_{t_k+\epsilon _u}\) we have

$$\begin{aligned} {{\,\textrm{err}\,}}(f) \le {{\,\mathrm{{\widehat{err}}}\,}}(f)+2\rho \widehat{{\mathcal {R}}}_S({\hat{{\mathcal {H}}}}_{t_k+\epsilon _u})+3\sqrt{\frac{\log (1/w_k)}{2m}}+3\sqrt{\frac{\log (4/\delta )}{2m}}. \end{aligned}$$

This proves (4).

To obtain (6) for \({\hat{f}}\) defined in (5), we apply (4) to \({\hat{f}}\). By construction, \({\hat{f}}\in {\hat{{\mathcal {H}}}}_{t_{{\hat{k}}({\hat{f}})}+\epsilon _u}\). Recall also that with probability at least \(1-\frac{\delta }{2}\) we have \(f^*_{k}\in {\hat{{\mathcal {H}}}}_{t_k+\epsilon _u}\) as \(f^*_{k}\in {\mathcal {H}}_{t_k}\). Therefore, with probability at least \(1-\delta /3\),

$$\begin{aligned} {{\,\textrm{err}\,}}({\hat{f}})&\le {{\,\mathrm{{\widehat{err}}}\,}}({\hat{f}})+2\rho \widehat{{\mathcal {R}}}_S({\hat{{\mathcal {H}}}}_{t_{{\hat{k}}({\hat{f}})}+\epsilon _u})+3\sqrt{\frac{\log (1/w_{{\hat{k}}({\hat{f}})})}{2m}}+3\sqrt{\frac{\log (2\cdot 3/\delta )}{2m}} \end{aligned}$$
(7)
$$\begin{aligned}&\le {{\,\mathrm{{\widehat{err}}}\,}}(f^*_{k})+2\rho \widehat{{\mathcal {R}}}_S({\hat{{\mathcal {H}}}}_{t_{k}+\epsilon _u})+3\sqrt{\frac{\log (1/w_{k})}{2m}}+3\sqrt{\frac{\log (6/\delta )}{2m}} , \end{aligned}$$
(8)

for all \(k\in {\mathbb {N}}\). In the last inequality we used the definition of \({\hat{f}}\) noting that the right hand side of (7) is minimised by \({\hat{f}}\). In addition, by Hoeffding’s inequality, we also have \({{\,\mathrm{{\widehat{err}}}\,}}(f^*_{k})\le {{\,\textrm{err}\,}}(f^*_{k})+\sqrt{\frac{\log (6/(w_k \delta ))}{2m}}\) with probability at least \(1-(w_k \delta )/6\). Combining with (8) and using the union bound, it follows with probability at least \(1-\delta\) that

$$\begin{aligned} {{\,\textrm{err}\,}}({\hat{f}}) \le {{\,\textrm{err}\,}}(f^*_{k})+2\rho \widehat{{\mathcal {R}}}_S({\hat{{\mathcal {H}}}}_{t_{k}+\epsilon _u})+4\sqrt{\frac{\log (1/w_{k})}{2m}}+3\sqrt{\frac{\log (6/\delta )}{2m}}, \end{aligned}$$

for all \(k\in {\mathbb {N}}\). Finally, choosing k to minimise the bound concludes the proof. \(\square\)

2.3 A joint approach to sensitivity and generalisation

The conceptually straightforward approach of the previous subsection implies that a target concept that is robust to the effects of approximation by a low-complexity predictor, may require less labelled examples to be learned. In particular, the regularised ERM algorithm defined in (5) can accomplish this learning task, the regulariser being the empirical Rademacher complexity of the restricted class \({\hat{{\mathcal {H}}}}_{t_{{\hat{k}}(f)}}\), along with a penalty for estimating \({\hat{k}}(f)\). In effect, this algorithm adaptively trims the original function class to the relevant subset of low-sensitivity predictors, and consequently returns a low-sensitivity element of an otherwise potentially much larger function class.

The appeal of this finding lies not only to serve as a possible explanation towards the question of what makes some instances of a learning problem easier than others. Also, by the low-sensitivity property, such predictor should be usable in its approximated form in memory-constrained settings. Indeed, for any \(t\ge 0\), if \({\hat{f}}\in {\mathcal {H}}_t\), then by Lemma 2.3 we have

$$\begin{aligned} {{\,\textrm{err}\,}}(A{\hat{f}}) = {{\,\textrm{err}\,}}({\hat{f}})+ \left( {{\,\textrm{err}\,}}(A{\hat{f}})-{{\,\textrm{err}\,}}({\hat{f}})\right) \le {{\,\textrm{err}\,}}({\hat{f}})+\rho {\mathcal {D}}_{A}({\hat{f}})\le {{\,\textrm{err}\,}}({\hat{f}})+\rho t. \end{aligned}$$
(9)

In other words, for a predictor with low approximation-sensitivity, using \(A{\hat{f}}\) instead of \({\hat{f}}\) will only incur an additive error of up to \(\rho t\). This additional term is the price to pay for predicting with the simplified function \(A{\hat{f}}\) instead of the full-precision function \({\hat{f}}\) – it will not improve with more data, but t is small precisely when the target function we try to learn has a low-sensitivity.

In this section we are interested in a more practical formulation of the tandem of learning an approximate predictor as well as a full precision predictor. The approach presented so far, beyond its conceptual elegance, has some practical drawbacks: (1) it requires an additional independent unlabelled data set; and (2) it requires computing the empirical Rademacher complexity of the restricted class. Computing empirical Rademacher complexities is known to be typically a hard combinatorial optimisation problem for interesting hypothesis classes (Bartlett & Mendelson, 2002), as it amounts to computing an empirical risk minimiser under the 0–1 loss.

To get around these limitations, we shall take a different approach. We start by modifying the loss function to explicitly encode the fact that we are interested in a good low-complexity approximate predictor. More precisely, for a given threshold value \(t\ge 0\), we start by defining the minimiser of the following constrained function:

$$\begin{aligned} {\hat{f}}_t {:}{=}{{\,\textrm{argmin}\,}}_{f \in {\mathcal {H}}_t} \{ {{\,\mathrm{{\widehat{err}}}\,}}(A f) \} \end{aligned}$$
(10)

This is defined for the purpose of theoretical analysis, it is not computable, since checking \(f\in {\mathcal {H}}_t\) would require knowledge of the data distribution.

The following result shows that the function \({\hat{f}}_t\) in (10) achieves two different functionalities simultaneously, as it not only produces a good approximate predictor with quantified error guarantee, including the price to pay for the approximation, but \({\hat{f}}_t\) itself is a good predictor whenever the problem admits an approximable target function.

Proposition 2.4

Fix an approximation operator A and \(t\ge 0\). Define \(f_t^* {:}{=}{{\,\textrm{argmin}\,}}_{f \in {\mathcal {H}}_t} \{ {{\,\textrm{err}\,}}(f) \}\) and \(g_t^* {:}{=}{{\,\textrm{argmin}\,}}_{g \in A {\mathcal {H}}_t} \{ {{\,\textrm{err}\,}}(g) \}\). Then, with probability at least \(1-\delta\), the function \({\hat{f}}_t\) from (10) satisfies all of the following simultaneously:

$$\begin{aligned} {{\,\textrm{err}\,}}(A {\hat{f}}_t)&\le \min \{ {{\,\textrm{err}\,}}(A f_t^*), {{\,\textrm{err}\,}}(g_t^*) \} + 2 \rho \widehat{{\mathcal {R}}}_S ({\mathcal {H}}_A) + 4 \sqrt{\frac{\ln (\frac{9}{\delta })}{2m}}, \end{aligned}$$
(11)
$$\begin{aligned} {{\,\textrm{err}\,}}(A {\hat{f}}_t)&\le {{\,\textrm{err}\,}}(f_t^*) + \rho t + 2 \rho \widehat{{\mathcal {R}}}_S ({\mathcal {H}}_A) + 4 \sqrt{\frac{\ln (\frac{9}{\delta })}{2m}} \end{aligned}$$
(12)
$$\begin{aligned} {{\,\textrm{err}\,}}({\hat{f}}_t)&\le {{\,\textrm{err}\,}}(f_t^*) + 2 \rho t + 2 \rho \widehat{{\mathcal {R}}}_S ({\mathcal {H}}_A) + 4 \sqrt{\frac{\ln (\frac{9}{\delta })}{2m}} . \end{aligned}$$
(13)

We note that \(\rho t \rightarrow 0\) as \(t \rightarrow 0\); however, as t decreases, the choice of predictors in \({\mathcal {H}}_t\) decreases too, and so \({{\,\textrm{err}\,}}(f_t^*)\) would be expected to increase. That is, the choice of t balances the trade-off between the sensitivity term \(\rho t\), and the error term, \({{\,\textrm{err}\,}}(f_t^*)\).

Proposition 2.4 allows us to view learning and model-compression as two sides of the same coin. Eq. (13) suggests that low-sensitivity target functions are easier to learn, and a constrained ERM algorithm is able to learn it up to a constant factor of its sensitivity. Indeed, suppose \(f^*=f_t^*\), i.e. the target function has sensitivity below t. Then the error of \({\hat{f}}_t\) is guaranteed to be much smaller than the worst case error of finding \(f^*\) in the whole class \({\mathcal {H}}\). At the same time, (12) provides a guarantee for the approximate predictor \(A{\hat{f}}\) that can potentially be deployed in low-memory settings by paying the additive term \(\rho t\) proportional to the extent of approximability. Remarkably, both of these two seemingly different goals are accomplished by the same function \({\hat{f}}_t\) defined in (10). Moreover (11) gives guarantees for \(A {\hat{f}}_t\) relative to both \(A f_t^*\) (the approximation of the best predictor in \({\mathcal {H}}\) with sensitivity of at most t) and \(g^*_t\) (the best approximate predictor in \(A{\mathcal {H}}_t\)).

Proof of Proposition 2.4

By Rademacher bounds (Bartlett & Mendelson, 2002), Theorem 8 and Talagrand’s contraction lemma (Mohri et al., 2018), Lemma 5.7, we have with probability at least \(1-\frac{2\delta }{9}\), that

$$\begin{aligned} {{\,\textrm{err}\,}}(A {\hat{f}}_t)&\le {{\,\mathrm{{\widehat{err}}}\,}}(A {\hat{f}}_t) + 2 \widehat{{\mathcal {R}}}_S (\ell \circ A {\mathcal {H}}_t) + 3 \sqrt{\frac{\ln (\frac{2 \cdot 9}{2\delta })}{2m}}\nonumber \\ {}&\le {{\,\mathrm{{\widehat{err}}}\,}}(A {\hat{f}}_t) + 2 \rho \widehat{{\mathcal {R}}}_S (A {\mathcal {H}}_t) + 3 \sqrt{\frac{\ln (\frac{9}{\delta })}{2m}}. \end{aligned}$$
(14)

By definition of \({\hat{f}}_t\) we have \({{\,\mathrm{{\widehat{err}}}\,}}(A {\hat{f}}_t) \le \min \{ {{\,\mathrm{{\widehat{err}}}\,}}(A f_t^*), {{\,\mathrm{{\widehat{err}}}\,}}(g_t^*) \}\). Using this together with Hoeffding’s inequality, with probability \(1-\frac{\delta }{9}\) both

$$\begin{aligned} {{\,\mathrm{{\widehat{err}}}\,}}(A {\hat{f}}_t)&\le {{\,\mathrm{{\widehat{err}}}\,}}(A f_t^*) \le {{\,\textrm{err}\,}}(A f_t^*) + \sqrt{\frac{\ln (\frac{9}{\delta })}{2m}}, \text { and } \\ {{\,\mathrm{{\widehat{err}}}\,}}(A {\hat{f}}_t)&\le {{\,\mathrm{{\widehat{err}}}\,}}(g_t^*) \le {{\,\textrm{err}\,}}(g_t^*) + \sqrt{\frac{\ln (\frac{9}{\delta })}{2m}} \end{aligned}$$

hold separately. Therefore, by the union bound and the fact that \(\widehat{{\mathcal {R}}}_S (A {\mathcal {H}}_t) \le \widehat{{\mathcal {R}}}_S ({\mathcal {H}}_A)\) we have with probability at least \(1 -\frac{4\delta }{9}\), that (11) holds. Similarly, as \({{\,\mathrm{{\widehat{err}}}\,}}(A {\hat{f}}_t) \le {{\,\mathrm{{\widehat{err}}}\,}}(A f_t^*)\) and by Lemma 2.3, we have with probability at least \(1 -\frac{\delta }{9}\), that

$$\begin{aligned} {{\,\mathrm{{\widehat{err}}}\,}}(A {\hat{f}}_t) \le {{\,\mathrm{{\widehat{err}}}\,}}(A f_t^*)&\le {{\,\textrm{err}\,}}(A f_t^*) + \sqrt{\frac{\ln (\frac{9}{\delta })}{2m}} \le {{\,\textrm{err}\,}}(f_t^*) + \rho {\mathcal {D}}_{A}^{1} (f_t^*) + \sqrt{\frac{\ln (\frac{9}{\delta })}{2m}} \end{aligned}$$
(15)
$$\begin{aligned}&\le {{\,\textrm{err}\,}}(f_t^*) + \rho t + \sqrt{\frac{\ln (\frac{9}{\delta })}{2m}}. \end{aligned}$$
(16)

Combining the above three inequalities and the fact that \(\widehat{{\mathcal {R}}}_S (A {\mathcal {H}}_t) \le \widehat{{\mathcal {R}}}_S ({\mathcal {H}}_A)\) we have with probability at least \(1 -\frac{2\delta }{9}\), that

$$\begin{aligned} {{\,\textrm{err}\,}}(A {\hat{f}}_t) \le {{\,\textrm{err}\,}}(f_t^*) + 2 \rho t + 2 \rho \widehat{{\mathcal {R}}}_S ({\mathcal {H}}_A) + 4 \sqrt{\frac{\ln (\frac{9}{\delta })}{2m}}. \end{aligned}$$
(17)

This proves (12). The second part follows by using Lemma 2.3, Jensen’s inequality and \({\hat{f}}_t \in {\mathcal {H}}_t\), so we have

$$\begin{aligned} {{\,\textrm{err}\,}}({\hat{f}}_t) \le \rho {\mathcal {D}}_{A}^{1} ({\hat{f}}_t) + {{\,\textrm{err}\,}}(A {\hat{f}}_t) \le \rho {\mathcal {D}}_{A}({\hat{f}}_t) + {{\,\textrm{err}\,}}(A {\hat{f}}_t) \le \rho t + {{\,\textrm{err}\,}}(A {\hat{f}}_t). \end{aligned}$$

Taking the union bound for each of the equations completes the proof.

Next, we show that in this formulation we can relax the fixed parameter t that constrains the function class, without the use of SRM. To avoid clutter, here we suppose the functional form of \(f\mapsto {\mathcal {D}}_{A}(f)\) is known – this can be estimated from an independent unlabelled data set as in the previous section. \(A{\hat{f}}\)

To this end, consider the minimiser of the following (hypothetical) objective function, used for theoretical analysis.

$$\begin{aligned} {\hat{f}} {:}{=}{{\,\textrm{argmin}\,}}_{f\in {\mathcal {H}}} \{ {{\,\mathrm{{\widehat{err}}}\,}}(A f) + \rho {\mathcal {D}}_{A}(f) \} \end{aligned}$$
(18)

Here the first term is our modified loss function as before, and the second term acts as a regulariser that implicitly constrains the function class. The following result shows that \({\hat{f}}\) from (18) behaves as the previous minimiser from (10), while it also automatically adapts the class-constraining sensitivity threshold t.

Proposition 2.5

Fix an approximation operator A. For \(t\ge 0\), let \(f_t^*:={{\,\textrm{argmin}\,}}_{f\in {\mathcal {H}}_t}\{ {{\,\textrm{err}\,}}(f) \}\). For the function \({\hat{f}}\) defined in (18), with probability at least \(1-\delta\) we have both of the following

$$\begin{aligned} {{\,\textrm{err}\,}}(A {\hat{f}})&\le \min _{t\ge 0}\left\{ {{\,\textrm{err}\,}}(f^*_t) + 2 \rho t\right\} + 2 \rho \widehat{{\mathcal {R}}}_S ({\mathcal {H}}_A) + 4 \sqrt{\frac{\ln (\frac{8}{\delta })}{2m}}, \text { and} \end{aligned}$$
(19)
$$\begin{aligned} {{\,\textrm{err}\,}}({\hat{f}})&\le \min _{t\ge 0}\left\{ {{\,\textrm{err}\,}}(f^*_t) + 2 \rho t\right\} + 2 \rho \widehat{{\mathcal {R}}}_S ({\mathcal {H}}_A) + 4 \sqrt{\frac{\ln (\frac{8}{\delta })}{2m}}, \end{aligned}$$
(20)

simultaneously.

Proof of Proposition 2.5

Using Lemma 2.3 and Rademacher bounds (Bartlett & Mendelson, 2002), Theorem 8, we have with probability at least \(1-\frac{\delta }{4}\), that

$$\begin{aligned} {{\,\textrm{err}\,}}({\hat{f}})&\le {{\,\textrm{err}\,}}(A {\hat{f}}) + \rho {\mathcal {D}}_{A}^{1} ({\hat{f}}) \end{aligned}$$
(21)
$$\begin{aligned}&\le {{\,\mathrm{{\widehat{err}}}\,}}(A {\hat{f}}) + \rho {\mathcal {D}}_{A}^{1} ({\hat{f}}) + 2 \widehat{{\mathcal {R}}}_S (\ell \circ {\mathcal {H}}_A) + 3 \sqrt{\frac{\ln (\frac{2 \cdot 4}{\delta })}{2m}}. \end{aligned}$$
(22)

Let \(g {:}{=}{{\,\textrm{argmin}\,}}_{t\ge 0} \{ {{\,\textrm{err}\,}}(f_t^*) + 2\rho {\mathcal {D}}_{A}^{1} (f_t^*) \}\). Then by the definition of \({\hat{f}}\) and the Hoeffding bound we obtain with a probability of at least \(1- \frac{\delta }{8}\), that

$$\begin{aligned} {{\,\mathrm{{\widehat{err}}}\,}}(A {\hat{f}}) + \rho {\mathcal {D}}_{A}^{1} ({\hat{f}}) \le {{\,\mathrm{{\widehat{err}}}\,}}(A g)+\rho {\mathcal {D}}_{A}^{1} (g) \le {{\,\textrm{err}\,}}(A g)+\rho {\mathcal {D}}_{A}^{1} (g) +\sqrt{\frac{\ln (\frac{8}{\delta })}{2m}}. \end{aligned}$$
(23)

Then, by Lemma 2.3 and definition of g we have

$$\begin{aligned} {{\,\textrm{err}\,}}(A g)+\rho {\mathcal {D}}_{A}^{1} (g) \le {{\,\textrm{err}\,}}(g) + 2\rho {\mathcal {D}}_{A}^{1} (g) \le {{\,\textrm{err}\,}}(f_t^*)+2\rho {\mathcal {D}}_{A}^{1} (f_t^*), \end{aligned}$$

for all \(t\ge 0\). Hence, \({{\,\textrm{err}\,}}(A g)+\rho {\mathcal {D}}_{A}^{1} (g) \le \min _{t\ge 0}\{{{\,\textrm{err}\,}}(A f_t^*) + 2\rho {\mathcal {D}}_{A}^{1} (f_t^*) \}\), and substituting into (23) yields

$$\begin{aligned} {{\,\mathrm{{\widehat{err}}}\,}}(A {\hat{f}}) + \rho {\mathcal {D}}_{A}^{1} ({\hat{f}}) \le \min _{t\ge 0} \{{{\,\textrm{err}\,}}(A f^*_t)+ 2\rho {\mathcal {D}}_{A}^{1} (f^*_t)\}+ \sqrt{\frac{\ln (\frac{8}{\delta })}{2m}}. \end{aligned}$$

with probability at least \(1-\frac{\delta }{8}\). By the Talagrand contraction lemma (Mohri et al., 2018), Lemma 5.7 we have \(\widehat{{\mathcal {R}}}_S (\ell \circ {\mathcal {H}}_A) \le \rho \widehat{{\mathcal {R}}}_S ({\mathcal {H}}_A)\), and so combining with (22) and then by a union bound we have with probability at least \(1-\frac{\delta }{2}\) that

$$\begin{aligned} {{\,\textrm{err}\,}}({\hat{f}}) \le \min _{t\ge 0} \{{{\,\textrm{err}\,}}(A f^*_t)+ 2\rho {\mathcal {D}}_{A}^{1} (f^*_t)\} + \sqrt{\frac{\ln (\frac{8}{\delta })}{2m}} + 2 \rho \widehat{{\mathcal {R}}}_S ({\mathcal {H}}_A) + 4 \sqrt{\frac{\ln (\frac{8}{\delta })}{2m}}. \end{aligned}$$

Noting that \({\mathcal {D}}_{A}(f_t^*)\le t\) completes the proof of (20). Eq. (19) also follows, with probability at least \(1-\frac{\delta }{2}\), since \({{\,\textrm{err}\,}}(A{\hat{f}})\) is upper bounded by the right hand side of (21), by adding the non-negative term \(\rho {\mathcal {D}}_{A}(f)\).

From Proposition 2.5 we see again that, for any fixed approximation function A such that \({\mathcal {H}}_A\) has smaller complexity than \({\mathcal {H}}\), if the target function has a low sensitivity (i.e. \({\mathcal {D}}_{A}(f^*)\) is small), then it is learnable from fewer labels than an arbitrary target from \({\mathcal {H}}\) would be. Of course, there may be learning problems where \(f^*\) has low error but high sensitivity for the pre-defined A, but the minimiser in (18) is a function that automatically balances between generalisation error and sensitivity.

It is now straightforward to use an estimate of \({\mathcal {D}}_{A}(f)\), giving rise to a learning algorithm that is an implementable version of the construct analysed in Proposition 2.5.

Theorem 2

(Joint learning of full and approximate predictors) Fix an approximation operator A, and consider the following algorithm.

$$\begin{aligned} {\hat{f}} {:}{=}{{\,\textrm{argmin}\,}}_{f\in {\mathcal {H}}}\{{{\,\mathrm{{\widehat{err}}}\,}}(Af)+\rho \widehat{{\mathcal {D}}}_{A}(f)\}. \end{aligned}$$
(24)

Let \(\epsilon _u>0\) be such that \(\sup _{f\in {\mathcal {H}}} \vert {\mathcal {D}}_{A}(f)-\widehat{{\mathcal {D}}}_{A}(f)\vert \le \epsilon _u\) with probability at least \(1-\frac{\delta }{2}\) with respect to \(D_x^{m_u}\) where \(m_u\ge m\). For \(t\ge 0\), let \(f_t^*:={{\,\textrm{argmin}\,}}_{f\in {\mathcal {H}}_t}\{ {{\,\textrm{err}\,}}(f) \}\). Then with probability at least \(1-\delta\), the function \({\hat{f}}\) satisfies both

$$\begin{aligned} {{\,\textrm{err}\,}}(A {\hat{f}})&\le \min _{t\ge 0}\left\{ {{\,\textrm{err}\,}}(f^*_t) + 2 \rho t\right\} + 2 \rho \widehat{{\mathcal {R}}}_S ({\mathcal {H}}_A) + (4+\rho ) \sqrt{\frac{\ln (\frac{16}{\delta })}{2m}}+\rho \epsilon _u, \text { and} \\ {{\,\textrm{err}\,}}({\hat{f}})&\le \min _{t\ge 0}\left\{ {{\,\textrm{err}\,}}(f^*_t) + 2 \rho t\right\} + 2 \rho \widehat{{\mathcal {R}}}_S ({\mathcal {H}}_A) + (4+\rho ) \sqrt{\frac{\ln (\frac{16}{\delta })}{2m}}+\rho \epsilon _u, \end{aligned}$$

simultaneously.

Proof

This follows by the same steps as the proof of Proposition 2.5 combined with Lemma 2.2. \(\square\)

Let us compare Theorem 2 with Theorem 1. With sufficient unlabelled data \(\epsilon _u\) can be made arbitrarily small, in both theorems. However, in Theorem 1 the unlabelled sample for estimating \(\epsilon _u\) must be independent of the labelled sample; this is because in that construction the function class depends on the unlabelled data through the sensitivity estimate. By contrast, in Theorem 2 we have an implicit adaptation of t, so the function class does not depend on the unlabelled sample. This enables us to reuse the labelled points also for estimating the sensitivity, and any additional unlabelled data just contributes to further shrinking \(\epsilon _u\). Hence, in Theorem 2 whenever \(\epsilon _u\) is already small enough using the m training points of S, we do not even require any additional unlabelled points. In later sections we will see natural conditions where this is easily the case.

The advantage of the algorithm analysed in Theorem 1 is its statistical consistency, since given enough labelled data the generalisation error converges to that of the best predictor of the class. However, if the goal is to obtain an approximate predictor, we pay the price of an additive sensitivity term (9), and Theorem 2 shows that allowing such term enables a much more implementation-friendly algorithm without sacrificing the essence of the theoretical guarantee on generalisation.

Comparing the algorithm from Theorem 2 with that of Theorem 1, observe the difference in the regularisation term. Regularising with the sensitivity estimate was not justified in the formulation of Theorem 1, and indeed the authors of Bǎlcan and Blum (2010) have pointed out that regularising with their general compatibility estimate was not theoretically justified – despite it being used in practice (Chapelle et al., 2006). By contrast, in the formulation of Theorem 2, we have been able to justify it within our approximability objective.

2.4 Managing the trade-off between sample error and sensitivity for the approximate predictor

The analysis from Proposition 2.5 and Theorem 2 have shown that the associated algorithm has an implicit ability to realise the optimal trade-off between the sample error of \(A{\hat{f}}\) and the sensitivity term, t, without any effort or tuning parameter from the user.

However, there may be situations when a different trade-off is desired, and in such a case we want to manage this trade-off as a tuning parameter. This is especially relevant for practical applications in memory-constrained settings, where obtaining a good approximate predictor \(A{\hat{f}}\) is the sole interest. For instance, we may only care about very low sensitivity functions at the expense of a slightly raised error, or vice-versa. Or we might like to explore multiple trade-offs as in a multi-objective approach. Another instance of this is when unlabelled data is also scarce but an analytic upper bound can be derived on the sensitivity function up to an unknown constant.

Conceptually, a good way to address this sort of issues would be to take back control over the threshold parameter t using the learning algorithm in (10) (with or without estimating the sensitivity). However, the constrained optimisation formulation can be awkward to perform in practice. Below we suggest a more user-friendly form of the algorithm, and show that its solution is close to that of (10).

For each \(\lambda \ge 0\) consider the following algorithm

$$\begin{aligned} {\tilde{f}}_{\lambda }&{:}{=}{{\,\textrm{argmin}\,}}_{f \in {\mathcal {H}}} \{ {{\,\mathrm{{\widehat{err}}}\,}}(A f) + \lambda \widehat{{\mathcal {D}}}_{A}(f) \}. \end{aligned}$$
(25)

Algorithms of this form, including the exploitation of unlabelled data in the regularisation term, have been in use in practice for a long time (Chapelle et al., 2006), see also (van Engelen & Hoos, 2020). The regularisation parameter \(\lambda\) balances the two terms of the objective function, and in addition to potential availability of prior knowledge, there is a wide range of well-established model selection methods available to set this parameter in practice.

To this end, we shall compare the error of \({\tilde{f}}_{\lambda }\) from algorithm (25) with that for \({\hat{f}}_t\) from the algorithm given in (10). The following proposition shows that, for any specification of \(\lambda\), there is a value of \(t\ge 0\) such that the errors of these two predictors are close, up to additive terms that decay with the sample size.

Theorem 3

(Balancing sample error & sensitivity) Let \(\epsilon _u > 0\) be such that \(\sup _{f\in {\mathcal {H}}} \vert {\mathcal {D}}_{A}(f)-\widehat{{\mathcal {D}}}_{A}(f)\vert \le \epsilon _u\) with probability at least \(1-\delta /4\) with respect to \(D_x^{m_u}\), where \(m_u\ge m\). For any \(\lambda >0\), there exists \(t > 0\) such that with probability at least \(1-\delta\) we have

$$\begin{aligned} {{\,\textrm{err}\,}}(A {\tilde{f}}_{\lambda }) - {{\,\textrm{err}\,}}(A {\hat{f}}_t)&\le 4 \rho \widehat{{\mathcal {R}}}_S ({\mathcal {H}}_A) + 6 \sqrt{\frac{\ln (\frac{8}{\delta })}{2m}} + 2\lambda \epsilon _u . \end{aligned}$$
(26)

Proof of Theorem 3

Take \(t \le {\mathcal {D}}_{A}({\tilde{f}}_{\lambda })\). Then from the definition of algorithm (10) we have \({\mathcal {D}}_{A}({\hat{f}}_t) \le t \le {\mathcal {D}}_{A}({\tilde{f}}_{\lambda })\). Using this, the definition of \({\tilde{f}}_{\lambda }\), and Lemma 2.2, it follows with probability at least \(1-\frac{\delta }{4}\) that

$$\begin{aligned} {{\,\mathrm{{\widehat{err}}}\,}}(A {\tilde{f}}_{\lambda }) + \lambda \widehat{{\mathcal {D}}}_{A}({\tilde{f}}_{\lambda })&\le {{\,\mathrm{{\widehat{err}}}\,}}(A {\hat{f}}_t) + \lambda \widehat{{\mathcal {D}}}_{A}({\hat{f}}_t) \\&\le {{\,\mathrm{{\widehat{err}}}\,}}(A {\hat{f}}_t) + \lambda {\mathcal {D}}_{A}({\hat{f}}_t) + \lambda \epsilon _u \\&\le {{\,\mathrm{{\widehat{err}}}\,}}(A {\hat{f}}_t) + \lambda {\mathcal {D}}_{A}({\tilde{f}}_{\lambda }) + \lambda \epsilon _u. \end{aligned}$$

Rearranging, and using Lemma 2.2 again, we have with probability at least \(1-\delta /2\) that

$$\begin{aligned} {{\,\mathrm{{\widehat{err}}}\,}}(A {\tilde{f}}_{\lambda }) - {{\,\mathrm{{\widehat{err}}}\,}}(A {\hat{f}}_t) \le \lambda ({\mathcal {D}}_{A}({\tilde{f}}_{\lambda }) - \widehat{{\mathcal {D}}}_{A}({\tilde{f}}_{\lambda })) + \lambda \epsilon _u \le 2\lambda \epsilon _u. \end{aligned}$$

This shows that the sample errors of the two predictors are close.

Now, to prove (26) we again use Rademacher bounds (Bartlett & Mendelson, 2002), Theorem 8 with probability at least \(1 - \delta /2\) on \({\mathcal {H}}_A\) twice, combined with (24) and the union bound. We have with probability \(1-\delta\),

$$\begin{aligned} {{\,\textrm{err}\,}}(A {\tilde{f}}_{\lambda }) - {{\,\textrm{err}\,}}(A {\hat{f}}_t)=\,& ({{\,\textrm{err}\,}}(A {\tilde{f}}_{\lambda }) - {{\,\mathrm{{\widehat{err}}}\,}}(A {\tilde{f}}_{\lambda })) + ({{\,\mathrm{{\widehat{err}}}\,}}(A {\tilde{f}}_{\lambda })) - {{\,\mathrm{{\widehat{err}}}\,}}(A {\hat{f}}_t))\\ &\quad + {({{\,\mathrm{{\widehat{err}}}\,}}(A {\hat{f}}_t) - {{\,\textrm{err}\,}}(A {\hat{f}}_t))} \\&\le 2\lambda \epsilon _u + {2} \left( 2 \rho \widehat{{\mathcal {R}}}_S ({\mathcal {H}}_A) + 3 \sqrt{\frac{\ln \left({\frac{8}{\delta }}\right)}{2m}}\right) , \end{aligned}$$

as required.

The comments we made on Theorem 2 also apply to Theorem 3. In particular, the training points contribute to estimating the sensitivity also, unlike the approach in Theorem 1 which required a separate independent unlabelled sample.

As a further remark, let us also address the case when instead of estimating the sensitivity from unlabelled data we have an analytic upper bound on this function, in the case of some specific choice of function class and approximation operator, up to some unknown absolute constant. The constant will be subsumed into the tuning parameter \(\lambda\).

Let \({\overline{{\mathcal {D}}_{A}}}(\cdot )\) be a mapping from \({\mathcal {H}}\) to \({\mathbb {R}}_+\) where there exists \(c>0\) such that for all \(f \in {\mathcal {H}}\), we have \({\mathcal {D}}_{A}(f)\le c\cdot {\overline{{\mathcal {D}}_{A}}}(f)\). Note that, \({\overline{{\mathcal {D}}_{A}}}(\cdot )\) does not depend on the sample. Now, for each \(\lambda \ge 0\) define the following algorithm

$$\begin{aligned} {\overline{f}}_{\lambda } {:}{=}{{\,\textrm{argmin}\,}}_{f \in {\mathcal {H}}} \{ {{\,\mathrm{{\widehat{err}}}\,}}(A f) + \lambda {\overline{{\mathcal {D}}_{A}}}(f) \}. \end{aligned}$$
(27)

Furthermore, let \({\hat{f}}_t\) be the predictor returned by algorithm (10), and \(\breve{f}_t\) the predictor from a version of the same algorithm (10) that replaces the unknown \({\mathcal {D}}_{A}(\cdot )\) with \({\overline{{\mathcal {D}}_{A}}}(\cdot )\). Then \(\breve{f}_t\) will have a guarantee of the same form as before in Proposition 2.4 where t is now a threshold on \({\overline{{\mathcal {D}}_{A}}}(\cdot )\) rather than \({{\mathcal {D}}_{A}}(\cdot )\). The following remark shows that the error of \({\overline{f}}_{\lambda }\) is close to that of \(\breve{f}_t\).

Remark 2.6

For any \(\lambda >0\), there exists \(t > 0\) such that with probability at least \(1-\delta\) we have

$$\begin{aligned} {{\,\textrm{err}\,}}(A{\overline{f}}_{\lambda })-{{\,\textrm{err}\,}}(A{\overline{f}}_{t}) \le 4 \rho \widehat{{\mathcal {R}}}_S ({\mathcal {H}}_A) + 6 \sqrt{\frac{\ln (\frac{8}{\delta })}{2m}}. \end{aligned}$$
(28)

Proof

Let \(t\ge 0\) be such that \(t \le {\overline{{\mathcal {D}}_{A}}}({\overline{f}}_{\lambda })\). Then \({\overline{{\mathcal {D}}_{A}}}(\breve{f}_t) \le {\overline{{\mathcal {D}}_{A}}}({\overline{f}}_{\lambda })\). Consequently, by the definition of \({\overline{f}}_{\lambda }\), we have

$$\begin{aligned} {{\,\mathrm{{\widehat{err}}}\,}}(A{\overline{f}}_{\lambda })+\lambda {\overline{{\mathcal {D}}_{A}}}({\overline{f}}_{\lambda }) \le {{\,\mathrm{{\widehat{err}}}\,}}(A{\overline{f}}_{t})+\lambda {\overline{{\mathcal {D}}_{A}}}({\overline{f}}_{t}) \le {{\,\mathrm{{\widehat{err}}}\,}}(A{\overline{f}}_{t})+\lambda {\overline{{\mathcal {D}}_{A}}}({\overline{f}}_{\lambda }). \end{aligned}$$

Therefore, \({{\,\mathrm{{\widehat{err}}}\,}}(A {\overline{f}}_{\lambda })) \le {{\,\mathrm{{\widehat{err}}}\,}}(A \breve{f}_t)\). Using this, we have

$${\text{err}}(A\bar{f}_{\lambda } ) - {\text{err}}(A{\breve{f}} _{t} ) = ({\text{err}}(A\bar{f}_{\lambda } ) - \widehat{{{\text{err}}}}(A\bar{f}_{\lambda } )) + (\widehat{{{\text{err}}}}(A\bar{f}_{\lambda } )) - \widehat{{{\text{err}}}}(A{\breve{f}} _{t} )) + (\widehat{{{\text{err}}}}(A{\breve{f}} _{t} ) - {\text{err}}(A{\breve{f}} _{t} )) \le 2\left( {2\rho \widehat{{\mathcal{R}}}_{S} ({\mathcal{H}}_{A} ) + 3\sqrt {\frac{{\ln \left( {\frac{8}{\delta }} \right)}}{{2m}}} } \right).$$

with probability at least \(1-\delta\), by applying the usual Rademacher bounds (Bartlett & Mendelson, 2002), Theorem 8 to the class \({\mathcal {H}}_A\) twice. \(\square\)

We should note that Theorem 3 and Remark 2.6 require that \(\lambda\) is specified before seeing the data. However, we can use SRM to allow an exploration of a countable number of different values for this parameter before making this choice for a small additional error term. Specifically, take a sequence of candidate values \(\{\lambda _k\}_{k\in {\mathbb {N}}}\) weighted by \(\{w_k\}_{k\in {\mathbb {N}}}\) with \(\sum _{k\in {\mathbb {N}}}w_k\le 1\). Then the same bounds hold for all \(\lambda _k\), where \(k\in {\mathbb {N}}\), simultaneously at the expense of an additional term of \(3\sqrt{\frac{\log (1/w_k)}{2m}}\).

3 Rademacher complexity of the class of sensitivities

The generalisation bounds of Sect. 2 that include estimated values of the sensitivity, rely on the empirical Rademacher complexity of the class of sensitivities \({\mathcal {D}}_{A}{\mathcal {H}}\). In Theorem 1 this was estimated on a separate unlabelled set independent of the labelled sample, while in Theorems 2 and 3 it was estimated on the input points of the labelled training set, possibly augmented with further unlabelled data. To unify notations, in this section we will write S for a (generic) sample in both cases, and m for its cardinality – with a view that, if the empirical Rademacher complexity of \({\mathcal {D}}_{A}{\mathcal {H}}\) converges sufficiently fast with the cardinality of the labelled sample m, then the labelled data S may actually be sufficient.

However, arguably, the complexity of the sensitivity class, \(\widehat{{\mathcal {R}}}_S ({\mathcal {D}}_{A}{\mathcal {H}})\), can be at least as large as that of the original function class \({\mathcal {H}}\) in the worst case, so one may wonder whether the bounds are actually useful. In this section we look at this quantity more closely. Indeed, using a property of the empirical Rademacher complexities (Bartlett & Mendelson, 2002), Theorem 12, part 7 gives

$$\begin{aligned} \widehat{{\mathcal {R}}}_{S} ({\mathcal {D}}_{A}{\mathcal {H}}) \le \widehat{{\mathcal {R}}}_{S} ({\mathcal {H}}) + \widehat{{\mathcal {R}}}_{S} ({\mathcal {H}}_A). \end{aligned}$$
(29)

Moreover, this bound is tight, since equality holds when the approximating class \({\mathcal {H}}_A\) is a singleton—however, the use of a singleton \({\mathcal {H}}_A\) is quite contrived, and far from what approximate algorithms are designed for.

For a fixed (possibly unlabelled) sample S, the set of interest in this section is the restriction of \({\mathcal {D}}_{A}{\mathcal {H}}\) to S,

$$\begin{aligned} {\mathcal {D}}_{A}{\mathcal {H}}\vert _{S} {:}{=}\left\{ \begin{pmatrix} \vert f(x_1) - Af (x_1) \vert \\ \vdots \\ \vert f(x_m) - Af (x_m) \vert \end{pmatrix} :f \in {\mathcal {H}}\right\} \end{aligned}$$

We use \(R_p {:}{=}\sup _{f \in {\mathcal {H}}}\widehat{{\mathcal {D}}}_{A}^p (f)\) for the worst sensitivity in the chosen p-norm on the sample S. Note that from Assumption 2 we have \(R_p \le C\) for all \(p>0\). We shall also use the shorthand

$$\begin{aligned} u_k=u(x_k)=\vert f(x_k)-Af(x_k)\vert \text { and } u=(u_k)_{k\in [m]}. \end{aligned}$$

Note that \({\mathcal {D}}_{A}{\mathcal {H}}\vert _{S} \subseteq B_p(0,m^{1/p}R_p)\) for all \(p\ge 1\), where \(B_p(c,r)\) denotes the p-ball centered at c with radius r.

We start by putting a crude magnitude bound on \(\widehat{{\mathcal {R}}}_{S} ({\mathcal {D}}_{A}{\mathcal {H}})\), which holds irrespective of the choices of \({\mathcal {H}}\) and \({\mathcal {H}}_A\), and is tight up to a constant factor. The following proposition shows that, whenever \(R_p\) is small, the empirical Rademacher complexity of the sensitivity class must be small in magnitude, and this bound is also tight up to a constant factor, for all choices of \(p\ge 1\). This magnitude bound will not imply a decay as m increases, as we make no assumptions beyond an i.i.d. sample at this point. However this magnitude bound will be a useful reference in our later subsections, and it can also be taken in conjunction with other bounds, since one can always take the minimum of all upper bounds.

Proposition 3.1

(Crude magnitude bound) For any \(p\ge 1\), we have \(\widehat{{\mathcal {R}}}_S ({\mathcal {D}}_{A}{\mathcal {H}}) \le R_p\). Moreover, a lower bound of the same order holds as follows. Given p as chosen above, suppose that \({\mathcal {D}}_{A}{\mathcal {H}}\vert _S\) nearly fills the p-ball of radius \(R_p m^{1/p}\), in the sense that the convex hull of \({\mathcal {D}}_{A}{\mathcal {H}}\vert _S\) contains the p-ball of radius \(\frac{m^{1/p}}{2} R_p\) intersected with the positive orthant. Then there exists a constant \(C_p>0\) that only depends on the choice of the p-norm, such that \(\widehat{{\mathcal {R}}}_S ({\mathcal {D}}_{A}{\mathcal {H}})\ge C_p\cdot R_p\).

Proof

By Hölder’s inequality,

$$\begin{aligned} \widehat{{\mathcal {R}}}_{S} ({\mathcal {D}}_{A}{\mathcal {H}})&= \frac{1}{m} {{\,\mathrm{{\mathbb {E}}}\,}}_{\sigma } \sup _{f \in {\mathcal {H}}} \sum _{k=1}^{m} \sigma _k \vert f(x_k) - Af (x_k) \vert \\&\le \frac{1}{m} \sup _{f \in {\mathcal {H}}} \sum _{k=1}^{m} \vert f(x_k) - Af (x_k) \vert \\&\le \sup _{f \in {\mathcal {H}}} \left( \frac{1}{m} \sum _{k=1}^{m} \vert f(x_k) - Af (x_k) \vert ^p \right) ^{\frac{1}{p}},\\&= \sup _{f \in {\mathcal {H}}} \widehat{{\mathcal {D}}}_{A}^{p}(f) = R_p \end{aligned}$$

for all \(p \in [1, \infty )\). This proves the upper bound.

We denote by \(K_+\) the positive orthant, and let \(B_p^+\left( 0,{\frac{m^{1/p}}{2} R_p}\right) := K_+ \cap B_p \left( 0,{\frac{m^{1/p}}{2} R_p}\right)\). To prove the lower bound, we recall Moreau’s decomposition theorem (Moreau, 1965) (see also (Wei et al., 2019), Sec. 2.1 & Sec. 3.1.5), which is the following: Given a closed convex cone \(K\subset {\mathbb {R}}^m\), denote its polar cone by \(K^*=\{u\in {\mathbb {R}}^m : \langle u,u' \rangle \le 0\text { for all } u'\in K\}\). Then, every vector \(v\in {\mathbb {R}}^m\) can be decomposed as

$$\begin{aligned} v=\Pi _{K}(v)+\Pi _{K^*}(v)\text { such that } \langle \Pi _{K}(v),\Pi _{K^*}(v)\rangle = 0, \end{aligned}$$
(30)

where \(\Pi _{K}(u) {:}{=}{{\,\textrm{argmin}\,}}_{u'\in K}\Vert u-u' \Vert _2\) is the orthogonal projection of u into K. Hence we have

$$\begin{aligned} \widehat{{\mathcal {R}}}_S ({\mathcal {D}}_{A}{\mathcal {H}})&= \frac{1}{m} {{\,\mathrm{{\mathbb {E}}}\,}}_{\sigma } \sup _{f \in {\mathcal {H}}} \sum _{k=1}^m \sigma _k \vert f(x_m) - Af (x_m) \vert \end{aligned}$$
(31)
$$\begin{aligned}&= \frac{1}{m} {{\,\mathrm{{\mathbb {E}}}\,}}_{\sigma } \sup _{u\in \text {conv} ({\mathcal {D}}_{A}{\mathcal {H}}\vert _S)} \sum _{k=1}^m \sigma _k u_k \end{aligned}$$
(32)
$$\begin{aligned}&\ge \frac{1}{m} {{\,\mathrm{{\mathbb {E}}}\,}}_{\sigma } \sup _{u\in B_p^+\left( 0,\frac{m^{1/p}}{2} R_p\right) } \sum _{k=1}^m \sigma _k u_k \end{aligned}$$
(33)
$$\begin{aligned}&= \frac{1}{m} {{\,\mathrm{{\mathbb {E}}}\,}}_{\sigma } \sup _{u\in B_p^+\left( 0,\frac{m^{1/p}}{2} R_p\right) } u^T(\Pi _{K_+}(\sigma )+\Pi _{K_+^*}(\sigma )) \end{aligned}$$
(34)
$$\begin{aligned}&=\frac{1}{m}\cdot \frac{m^{1/p}}{2}R_p\cdot {{\,\mathrm{{\mathbb {E}}}\,}}_{\sigma } \Vert \Pi _{K_+} \sigma \Vert _{p'} \end{aligned}$$
(35)
$$\begin{aligned}&=\frac{1}{m}\cdot \frac{m^{1/p}}{2}R_p \cdot \left( \frac{m}{2}\right) ^{1/p'} \end{aligned}$$
(36)
$$\begin{aligned}&=m^{1/p+1/p'-1}\cdot 2^{-1-1/p} \cdot R_p = \frac{R_p}{2\root p \of {2}}. \end{aligned}$$
(37)

where \(p'\) is the Hölder conjugate of p, i.e. \(1/p+1/p'=1\). In line (34) we applied (30) to \(\sigma\), and (35) follows from the fact that \(u\) is in the positive orthant \(K_+\) so \(\langle u,\Pi _{K_+^*}(\sigma )\rangle \le 0\) and because the supremum equality is attained when \(u\) is a nonnegative scalar multiple of \(\Pi _{K_+}(\sigma )\) – in which case \(\langle u,\Pi _{K_+^*(\sigma )}\rangle =0\). This completes the proof of the lower bound. \(\square\)

The lower bound highlights the fact that one cannot tighten the complexity bound by more than a constant factor without making extra assumptions. In addition, we also see that non-negativity of the elements of \({\mathcal {D}}_{A}{\mathcal {H}}\) only affects this constant. Therefore in the next few sections we set out to find and exploit other structures in order to gain more transparency and insight on the effective magnitude of this quantity in some natural settings. Specifically, we shall discuss examples of some non-restrictive structural models from which one can read off benign conditions that give better bounds on \(\widehat{{\mathcal {R}}}_S ({\mathcal {D}}_{A}{\mathcal {H}})\). A lower magnitude of this complexity implies a smaller unlabelled data set size requirement for accurate estimation of the sensitivity, and in the case of our bounds in Sects. 2.3 and 2.4 this may even permit solving the learning problem without the need of an additional unlabelled sample.

3.1 Exploiting structural models of the sensitivity set

Throughout this section we make no assumption about either the function class \({\mathcal {H}}\) or the approximating class \({\mathcal {H}}_A\). So the results of this section are equally relevant to very rich classes like deep neural networks, all the way to very restricted ones like linear classes. We also make no assumption about the form of the approximating function, and indeed the approximating class is not required to be of the same architectural type as the original class.

We demonstrate the benign effects of some structural traits that the set \({\mathcal {D}}_{A}{\mathcal {H}}\) may naturally exhibit regardless of the linear or nonlinear nature of the actual predictors. Such benign structures will manifest themselves by explaining a reduced complexity \(\widehat{{\mathcal {R}}}_S({\mathcal {D}}_{A}{\mathcal {H}})\)—which in turn allow the bounds of Sect. 2 to provide a better understanding of what makes some instances of a learning problem easier than others.

Our strategy in the next subsections will be to study the complexity of the set \({\mathcal {D}}_{A}{\mathcal {H}}\) restricted to the sample S (as it appears in the empirical Rademacher bounds presented in Sect. 2) by inscribing it into various parametrised geometric shapes. These include natural structures such as the points of \({\mathcal {D}}_{A}{\mathcal {H}}_{\vert S}\) being near-sparse, or exhibiting clusters, or having some structured sparsity type model. For this we will not actually impose any extra conditions, instead our strategy is to use these constructs to reveal how the Rademacher complexity depends on the parameters of these models. In other words, our bounds will always hold with some parameter values, as in the worst case we just recover the crude magnitude bound in Proposition 3.1, while at the same time the effects of parameters convey more insight.

3.1.1 Near-sparse sensitivity set

A very natural situation is when some points in S have little effect on the sensitivity of the approximation, or in other words the approximation has little effect on the predictions for part of the points of S. For instance in classification, points that are far from the boundary will often have the approximating function Af predict in agreement with the original f.

A simple way to model this situation is by having the vectors in \({\mathcal {D}}_{A}{\mathcal {H}}\vert _S\) lie near the axes corresponding to the points in S which are less affected by the approximation, such as taking a shape of an axis-aligned ellipsoid in some Minkowski norm, defined as

$$\begin{aligned} {\mathcal {E}}_p(\mu ) {:}{=}\left\{ x \in {\mathbb {R}}^{m} :\sum _{k=1}^{m} \frac{\vert u_k\vert ^p}{\mu _k^p} \le 1 \right\} , \end{aligned}$$
(38)

for \(p\ge 1\), where \(\mu {:}{=}(\mu _1 , \dots , \mu _m) \in (0,\infty )^{m}\) are the semi-axes of the ellipsoid.

Note, this model is not restrictive, since we have \({\mathcal {D}}_{A}{\mathcal {H}}\vert _{S} \subset B_{p}^{m} (0,R_p m^{1/p})\), therefore \(\mu _k \le R_p m^{1/p}\) for all \(k \in [m]\). However, the added flexibility of this model allows us to infer the effect of the magnitudes of the semi-axes, yielding some simple and natural conditions that improve on the worst-case magnitude guarantee in Proposition 3.1.

The following lemma gives the exact expression for the Rademacher complexity of an ellipsoid in any p-norm.

Lemma 3.2

Let \(\mu \in (0,\infty )^m\) and \(p\ge 1\), and consider \({\mathcal {E}}_p(\mu )\) as defined in (38). Then,

$$\begin{aligned} \widehat{{\mathcal {R}}}_S ({\mathcal {E}}_p(\mu )) = \frac{\Vert \mu \Vert _{\frac{p}{p-1}}}{m}. \end{aligned}$$

Proof

Using Hölder’s inequality, \(\sigma _k \in \{ -1 , 1\}\), and the definition of \({\mathcal {E}}_p(\mu )\) we have

$$\begin{aligned} \widehat{{\mathcal {R}}}_{S} ({\mathcal {E}}_p(\mu ))&= \frac{1}{m} {{\,\mathrm{{\mathbb {E}}}\,}}_{\sigma } \sup _{u\in {\mathcal {E}}_p(\mu )} \sum _{k=1}^{m} \sigma _k u_k \end{aligned}$$
(39)
$$\begin{aligned}&= \frac{1}{m} {{\,\mathrm{{\mathbb {E}}}\,}}_{\sigma } \sup _{u\in {\mathcal {E}}_p(\mu )} \sum _{k=1}^{m} (\sigma _k \mu _k) \frac{u_k}{\mu _k} \end{aligned}$$
(40)
$$\begin{aligned}&= \frac{1}{m} {{\,\mathrm{{\mathbb {E}}}\,}}_{\sigma } \sup _{u\in {\mathcal {E}}_p(\mu )} \left( \sum _{k=1}^{m} \vert \sigma _k \mu _k\vert ^{p'} \right) ^{\frac{1}{p'}} \left( \sum _{k=1}^{m} \frac{\vert u_k\vert ^p}{\mu _k^p} \right) ^{\frac{1}{p}} \end{aligned}$$
(41)
$$\begin{aligned}&= \frac{1}{m} \left( \sum _{k=1}^{m} \mu _k^{p'} \right) ^{\frac{1}{p'}}, \end{aligned}$$
(42)

where \(p'\) is the Hölder conjugate of p, i.e. \(1/p+1/p'=1\). The identities (41) and (42) hold due to the supremum. This completes the proof. \(\square\)

For more intuition, consider the case when \(p=2\), which corresponds to the usual Euclidean norm ellipsoid, and we can relate the right hand side of the bound in Lemma 3.2 to the volume of the ellipsoid. Indeed, using the relation between the arithmetic and geometric mean,

$$\begin{aligned} \frac{1}{m}\Vert \mu \Vert _2= \frac{1}{\sqrt{m}} \left( \frac{1}{m} \sum _{k=1}^{m} \mu _k^2 \right) ^{\frac{1}{2}} \ge \frac{1}{\sqrt{m}}\left( \prod _{k=1}^m \mu _k \right) ^{\frac{1}{m}} = C_m \text {Vol} ({\mathcal {E}}_2(\mu ))^{\frac{1}{m}}, \end{aligned}$$

where \(C_m>0\) is a constant depending only on m. Hence, for a fixed sample size m, if the quadratic mean of the \(\mu _k\)’s is small then the ellipsoid has a small volume.

If \({\mathcal {D}}_{A}{\mathcal {H}}\vert _{S} \subseteq {\mathcal {E}}_p(\mu )\), then in the worst case \(\mu _k=R_p m^{1/p}\) for all \(k\in [m]\), and so

$$\begin{aligned} \frac{\Vert \mu \Vert _{\frac{p}{p-1}}}{m} \le \frac{1}{m}\left( \sum _{k\in [m]} (R_p m^{1/p})^{\frac{p}{p-1}}\right) ^{\frac{p-1}{p}} = R_p. \end{aligned}$$

Hence it is clear that the bound in Lemma 3.2 recovers the bound in Proposition 3.1 in the worst case. Thus, if \({\mathcal {D}}_{A}{\mathcal {H}}\vert _{S} \subseteq {\mathcal {E}}_p(\mu )\), then Lemma 3.2 is already an improvement on Proposition 3.1.

As a model of the sensitivity set, an ellipsoid with high excentricity posits that most sensitivity vectors reside in a linear subspace of \({\mathbb {R}}^m\). Interesting to note that this has no implication on the form of the predictors. Indeed, even with highly nonlinear predictors (nonlinear classification boundaries for example), the fraction of points for which the predictions are distorted under the action of approximation may be expected to be small.

However, it might be unrealistic to expect of all good functions of \({\mathcal {H}}\) that the approximation should change the prediction for the same points and should leave alone the same points. Hence, instead of assuming that \({\mathcal {D}}_{A}{\mathcal {H}}\) is contained in a single ellipsoid, for a more realistic model, we consider a union of multiple axes-aligned ellipsoids that cover \({\mathcal {D}}_{A}{\mathcal {H}}\vert _S\). This allows the set of points for which predictions are relatively unaffected by the approximation of some \(f\in {\mathcal {H}}\) be different for all \(f\in {\mathcal {H}}\).

The following proposition shows that in this model the Rademacher complexity of \({\mathcal {D}}_{A}{\mathcal {H}}\vert _S\) is bounded by the Rademacher complexity of the largest ellipsoid from the union and, remarkably, it does not depend on the number of ellipsoids in the union – we can have countably many in this model, so the diversity of sensitivity profiles of the predictors of \({\mathcal {H}}\) in the span of the sample is accounted for at no expense. The vector of axis lengths for the i-th ellipsoid will be denoted by \(\mu _i\). We refer to individual components of this vector by adding a second index, for example \(\mu _{i,k}\) for the kth semi-axis of the ith ellipsoid.

Proposition 3.3

(Complexity of near-sparse sensitivity set) Let \(S \subseteq {\mathcal {X}}\) be an i.i.d. unlabeled sample drawn from \(D_x\), of size m. Let \(l\in {\mathbb {N}}\), suppose that there exist \(\mu _i\in (0,\infty )^m, i\in [l]\) with \(\mu _{i,k}\le R_pm^{1/p}\), and \({\mathcal {D}}_{A}{\mathcal {H}}\vert _{S} \subset \bigcup _{i=1}^l {\mathcal {E}}_p(\mu _i)\) for ellipsoids \({\mathcal {E}}_p(\mu _i)\). Then we have the following bound

$$\begin{aligned} \widehat{{\mathcal {R}}}_{S} ({\mathcal {D}}_{A}{\mathcal {H}}) \le \frac{1}{m} \max _i \Vert \mu _i\Vert _{\frac{p}{p-1}}. \end{aligned}$$

The proof makes use of similar steps as the proof of Lemma 3.2, but it does not apply the result of Lemma 3.2, as it turns out that a direct approach yields the exact Rademacher complexity of the union of axis-aligned ellipsoids.

Proof of Proposition 3.3

As \({\mathcal {D}}_{A}{\mathcal {H}}\vert _{S} \subset \bigcup _{i=1}^l {\mathcal {E}}_p(\mu _i)\), then using the fact that for two bounded sets A and B we have \(\sup (A \cup B) = \max \{ \sup A , \sup B \}\), taking absolute value, the Hölder inequality, \(\sigma _k \in \{ -1 , 1\}\), and the definition of \({\mathcal {E}}_p(\mu _i)\) give

$$\begin{aligned} \widehat{{\mathcal {R}}}_{S} ({\mathcal {D}}_{A}{\mathcal {H}})&\le \widehat{{\mathcal {R}}}_S \left( \bigcup _{i=1}^l {\mathcal {E}}_p(\mu _i)\right) \end{aligned}$$
(43)
$$\begin{aligned}&= \frac{1}{m} {{\,\mathrm{{\mathbb {E}}}\,}}_{\sigma } \sup _{u\in \bigcup _{i=1}^l {\mathcal {E}}_p(\mu _i)} \sum _{k=1}^{m} \sigma _k u_k \end{aligned}$$
(44)
$$\begin{aligned}&= \frac{1}{m} {{\,\mathrm{{\mathbb {E}}}\,}}_{\sigma } \max _i \sup _{u\in {\mathcal {E}}_p(\mu _i)} \sum _{k=1}^{m} \sigma _k u_k \end{aligned}$$
(45)
$$\begin{aligned}&= \frac{1}{m} \max _i \sup _{u\in {\mathcal {E}}_p(\mu _i)} \sum _{k=1}^{m} \vert u_k\vert \end{aligned}$$
(46)
$$\begin{aligned}&= \frac{1}{m} \max _i \sup _{u\in {\mathcal {E}}_p(\mu _i)} \sum _{k=1}^{m} \mu _{i,k} \frac{\vert u_k\vert }{\mu _{i,k}} \end{aligned}$$
(47)
$$\begin{aligned}&= \frac{1}{m} \max _i \sup _{u\in {\mathcal {E}}_p(\mu _i)} \left( \sum _{k=1}^{m} \mu _{i,k}^{p'} \right) ^{\frac{1}{p'}} \left( \sum _{k=1}^{m} \frac{\vert u_k\vert ^p}{\mu _{i,k}^p} \right) ^{\frac{1}{p}} \end{aligned}$$
(48)
$$\begin{aligned}&= \frac{1}{m} \max _i \left( \sum _{k=1}^{m} \mu _{i,k}^{p'} \right) ^{\frac{1}{p'}}, \end{aligned}$$
(49)

where \(p'\) is the Hölder conjugate of p. The equality in (46) is due to the symmetry of the set \(\bigcup _{i=1}^l {\mathcal {E}}_p(\mu _i)\) around each axis, and in (48) Hölder’s inequality holds with equality due to the supremum.

We remark that the above proposition is true for a countably infinite number of ellipsoids by noticing that the sequence

$$\begin{aligned} \left( \sup _{u\in \bigcup _{i=1}^l {\mathcal {E}}_p(\mu _i)} \sum _{k=1}^{m} \sigma _k u_k \right) _{l \in {\mathbb {N}}} \end{aligned}$$

is non-decreasing in l. Thus, by the monotone convergence theorem we have

$$\begin{aligned} \widehat{{\mathcal {R}}}_S \left( \bigcup _{i=1}^{\infty } {\mathcal {E}}_p(\mu _i)\right)&= \lim _{l \rightarrow \infty } \widehat{{\mathcal {R}}}_S \left( \bigcup _{i=1}^l {\mathcal {E}}_p(\mu _i)\right) \\&= \lim _{l \rightarrow \infty } \frac{1}{m} \max _{i \in [l]} \left( \sum _{k=1}^{m} \mu _{i,k}^{p'} \right) ^{\frac{1}{p'}} \\&= \frac{1}{m} \sup _{i\in {\mathbb {N}}} \left( \sum _{k=1}^{m} \mu _{i,k}^{p'} \right) ^{\frac{1}{p'}}. \end{aligned}$$

Thus Proposition 3.3 is true for countably infinite ellipsoids.

It may be interesting to note that the model of a union of axis-aligned ellipsoids has an intuitive meaning of near-sparsity of sensitivities. This may also be interpreted as a kind-of near-compression bound, since Proposition 3.3 tells us that, when fewer points are affected by the approximation, the guarantee on the sensitivity estimation quality will be tighter, hence the generalisation bound will be tighter as well.

However, beyond the intuitive meaning above, our structural modelling approach has potential to reveal additional benign conditions that might be harder to find by intuition alone. To see this, we shall modify Proposition 3.3 to get an upper bound for a non-axis aligned union of ellipsoids. As long as the ellipsoids share the same center (for instance, at the origin), the upper bound will still be independent of the number of ellipsoids in the union.

To this end, in addition to the axis-length parameters, for each ellipsoid in the union, take a rotation matrix \(V_i\in {\mathbb {R}}^{m\times m}\) where \(V_i^TV_i=V_iV_i^T=I_m\) for \(i\in [l]\). The columns of \(V_i\) are the principal directions for the i-th ellipsoid. We will refer to the k-th column of \(V_i\) by \((V_i)_k\), and \((V_{i})_{k,k'}\) will denote its \((k,k')\)-th element. The i-th ellipsoid is then defined as

$$\begin{aligned} {\mathcal {E}}_p^{V_i}(\mu _i) {:}{=}\left\{ u\in {\mathbb {R}}^{m} :\sum _{k=1}^{m} \frac{\vert (V_{i})_k^Tu\vert ^p}{\mu _{i,k}^p} \le 1 \right\} . \end{aligned}$$
(50)

By a change of variables, we have that \(u\in {\mathcal {E}}_p^{V_i}(\mu _i)\) is equivalent to \(V_i^Tu\in {\mathcal {E}}_p(\mu _i)\). Let \(\Lambda _i\) be the diagonal matrix with elements \(\mu _{i,k} \in (0,\infty )\) for \(k\in [m]\), so \(\Lambda _i^{-1}V_i^Tu\in B_p(0,1)\).

We no longer have symmetry around the axes, so (46) becomes an inequality, and we have

$$\begin{aligned} \widehat{{\mathcal {R}}}_S (\bigcup _{i=1}^l {\mathcal {E}}_p^{V_i}(\mu _i))&= \frac{1}{m} {{\,\mathrm{{\mathbb {E}}}\,}}_{\sigma } \max _{i\in [l]}\sup _{u\in {\mathcal {E}}_p^{V_i}(\mu _i)} \sum _{k=1}^{m} \sigma _k u_k \end{aligned}$$
(51)
$$\begin{aligned}&\le \frac{1}{m} \max _i \sup _{u\in {\mathcal {E}}^{V_i}_p(\mu _i)} \sum _{k=1}^{m} \vert u_k\vert \end{aligned}$$
(52)
$$\begin{aligned}&= \frac{1}{m} \max _i \sup _{u\in {\mathcal {E}}^{V_i}_p(\mu _i)} \Vert u\Vert _1 \end{aligned}$$
(53)
$$\begin{aligned}&= \frac{1}{m} \max _i \sup _{u\in {\mathcal {E}}^{V_i}_p(\mu _i)} \Vert (V_i\Lambda _i)(\Lambda _i^{-1}V_i^Tu)\Vert _1 \end{aligned}$$
(54)
$$\begin{aligned}&= \frac{1}{m} \max _i \sup _{\Lambda _i^{-1}V_i^Tu\in B_p(0,1)} \Vert (V_i\Lambda _i) (\Lambda _i^{-1}V_i^Tu) \Vert _1 \end{aligned}$$
(55)
$$\begin{aligned}&= \frac{1}{m} \max _i \Vert V_i\Lambda _i\Vert _{p\rightarrow 1}. \end{aligned}$$
(56)

Equation (55) used the assumption that \(\Lambda _i\) and \(V_i\) are full rank square matrices. The last line (56) holds by the definition of \(\Vert \cdot \Vert _{p\rightarrow 1}\), called the operator norm (or induced matrix norm) with domain p and co-domain 1. Such norms can only be computed explicitly in a few special cases. In particular,

  1. 1.

    Whenever \(V_i=I_m\), then \(\Vert V_i\Lambda _i\Vert _{p\rightarrow 1}=\Vert \mu _i\Vert _{\frac{p}{p-1}}\), since \(\Lambda _i\) is the diagonal matrix with elements \(\mu _{i,k}\). This recovers precisely the axis-aligned setting.

  2. 2.

    With \(p=1\), the expression of the induced norm is known to be \(\Vert V_i\Lambda _i\Vert _{1\rightarrow 1}=\max _{k\in [m]} \mu _{i,k} \Vert (V_i)_k\Vert _1\).

We see, the non-axis alignment has led to somewhat less intuitive expressions, but nevertheless the main quantity that governs the empirical Rademacher complexity remains some notion of size of the largest ellipsoid. To interpret this in the context of interest here, it is enough if the sensitivities mainly reside in linear subspaces of \({\mathcal {D}}_{A}{\mathcal {H}}\vert _{S}\subset {\mathbb {R}}^m\) for the Rademacher complexity of \({\mathcal {D}}_{A}{\mathcal {H}}\) to be small. Equivalently, for the estimation of sensitivities this means to require less unlabelled points and still getting accurate sensitivity estimates (not to be confused with small sensitivity values).

3.1.2 Clustered sensitivity set

In this section we consider another natural structure, namely when the elements of \({\mathcal {D}}_{A}{\mathcal {H}}\vert _S\) form clusters. A cluster is a subset of \({\mathcal {H}}\) with similar sensitivity profile on the sample S. We can model each cluster with a p-norm ellipsoid, each having its own center as the following

$$\begin{aligned} {\mathcal {E}}_p(c_i,\mu _i,V_i) {:}{=}\left\{ u\in {\mathbb {R}}^m : \sum _{k=1}^m \frac{\vert (V_i)_{k}^T(u- c_{i})\vert ^p}{\mu _{i,k}^p} \le 1 \right\} . \end{aligned}$$

The components of the vector \(\mu _i\) are the semi-axes, and the vector \(c_i\) is the center of the i-th cluster. This model is again non-restrictive, as there exist worst case parameter values (\(c_i=0\), \(\mu _i=R_p m^{1/p}\), \(V_i=I_m\) for all \(i \in [l]\)) that recover the ball \(B_p(0,R_p m^{1/p})\) used previously in the crude bound of Proposition 3.1.

The following proposition shows that in this model, \(\widehat{{\mathcal {R}}}_S({\mathcal {D}}_{A}{\mathcal {H}})\) is bounded by the Rademacher complexity of the largest cluster plus an additive term that grows logarithmically with the number of clusters and linearly with the largest displacement of a cluster from the origin.

Proposition 3.4

(Complexity of clustered sensitivity set) Let \(S \subset {\mathcal {X}}\) be an unlabeled sample of size m drawn i.i.d. from \(D_x\). Let \(l\in {\mathbb {N}}\), suppose that there exist \(\mu _i\in (0,\infty )^m, c_i\in {\mathbb {R}}^m\) and \(V_i \in {\mathbb {R}}^{m \times m}\) such that \(\mu _{i,k}\le R_pm^{1/p}\), and \(V_i^TV_i=V_iV_i^T=I_m\) with \({\mathcal {D}}_{A}{\mathcal {H}}\vert _{S} \subseteq \bigcup _{i=1}^l {\mathcal {E}}_p(c_i,\mu _i,V_i)\) for p-ellipsoids. Then,

$$\begin{aligned} \widehat{{\mathcal {R}}}_S ({\mathcal {D}}_{A}{\mathcal {H}}) \le \frac{1}{m} \max _i \Vert V_i\Lambda _i\Vert _{p\rightarrow 1} + \max _i \{\Vert c_i\Vert _2\} \frac{ \sqrt{2 \ln l}}{m}. \end{aligned}$$

where \(\Lambda _i\) is the diagonal matrix with elements \(\mu _{i,k} \in (0,\infty )\) for \(k\in [m]\).

This cluster model highlights a trade-off about the effect of large sensitivities: If a cluster only contains functions whose approximation leads to large sensitivity values, then the first term of the bound can still be small, but a penalty is incurred in the second term if not all function fit in the same cluster.

Proof

Let \(c :{\mathcal {D}}_{A}{\mathcal {H}}\rightarrow \{ c_1 , \ldots , c_l \}\) be defined as the function that sends \(u\in {\mathcal {D}}_{A}{\mathcal {H}}\) to its best fitting ellipsoid, \(c(u):={{\,\textrm{argmin}\,}}_{c_i : i\in [l]} \sum _{k\in [m]}(V_i)_k^T(u-c_{i})/\mu _k\). Ties are broken arbitrarily.

Now, adding and subtracting \(c(u_k)\) and noting that, by construction, \(\{c(u) : u\in {\mathcal {D}}_{A}{\mathcal {H}}\} = \{ c_1 , \ldots , c_l \}\), we have

$$\begin{aligned} \widehat{{\mathcal {R}}}_{S} ({\mathcal {D}}_{A}{\mathcal {H}})&\le \widehat{{\mathcal {R}}}_S \left( \bigcup _{i=1}^l {\mathcal {E}}_p (c_i,\mu _i,V_i)\right) \end{aligned}$$
(57)
$$\begin{aligned}&= \frac{1}{m} {{\,\mathrm{{\mathbb {E}}}\,}}_{\sigma } \sup _{u\in \bigcup _{i=1}^l {\mathcal {E}}_p(c_i,\mu _i,V_i)} \sum _{k=1}^{m} \sigma _k u_k \end{aligned}$$
(58)
$$\begin{aligned}&= \frac{1}{m} {{\,\mathrm{{\mathbb {E}}}\,}}_{\sigma } \max _{i\in [l]} \sup _{u\in {\mathcal {E}}_p(c_i,\mu _i,V_i)} \sum _{k=1}^{m} \sigma _k u_k \end{aligned}$$
(59)
$$\begin{aligned}&\le \frac{1}{m} {{\,\mathrm{{\mathbb {E}}}\,}}_{\sigma } \max _{i\in [l]} \sup _{u\in {\mathcal {E}}_p(c_i,\mu _i,V_i)} \sum _{k=1}^{m} \sigma _k (V_i)_k^T(u- c_{i})\nonumber \\&\quad + \frac{1}{m} {{\,\mathrm{{\mathbb {E}}}\,}}_{\sigma } \max _{i\in [l]} \sup _{u\in {\mathcal {E}}_p (c_i,\mu _i,V_i)} \sum _{k=1}^{m} \sigma _k (V_i)_k^T c_{i} \end{aligned}$$
(60)
$$\begin{aligned}&=\frac{1}{m} {{\,\mathrm{{\mathbb {E}}}\,}}_{\sigma } \max _{i\in [l]} \sup _{V_{i}^T(u-c_{i}) \in {\mathcal {E}}_p(0,\mu _i)} \sum _{k=1}^{m} \sigma _k(V_i)_k^T(u- c_{i})\nonumber \\&\quad + \frac{1}{m} {{\,\mathrm{{\mathbb {E}}}\,}}_{\sigma } \max _{i\in [l]} \sum _{k=1}^{m} \sigma _k (V_i)_k^T c_{i}. \end{aligned}$$
(61)

We proceed by bounding the above two terms separately.

We bound the first term by applying Proposition 3.3, or its extension, Eq. (56). To bound the second term, we use Massart’s lemma to get

$$\begin{aligned} \frac{1}{m} {{\,\mathrm{{\mathbb {E}}}\,}}_{\sigma } \max _{i\in [l]} \sum _{k=1}^{m} \sigma _k (V_i)_k^Tc_i \le \max _{i\in [l]} \{\Vert c_i\Vert _2\} \frac{ \sqrt{2 \ln l}}{m}, \end{aligned}$$

since \(V_i\) is a rotation matrix, so \(\Vert V_i^Tc_i\Vert _2=\Vert c_i\Vert\). Combining the two bounds together completes the proof.

This bound is similar in flavour to that of the complexity of a union given in Golowich et al. (2020), Lemma 7.4 in the sense that there is a logarithmic price to pay for the number of clusters. However, by contrast, here we have an explicit constant in the second term with clear relation to the position of the ellipsoids, and our bound reduces to that from Proposition 3.3 if all \(c_i=0\) for all \(i \in [l]\). Therefore, the above bound gives more information as to what helps decrease the Rademacher complexity. More specifically, the benign structures identified are: small number of clusters, cluster centers close to the origin, and highly concentrated (low volume) clusters. We will summarise these positive findings and discuss their implications in Sect. 4.1.

3.2 Effect of the structural form of predictors

Our analysis so far was completely independent of the specification of \({\mathcal {H}}\) and \({\mathcal {H}}_A\), and applies to any PAC-learnable hypothesis class. From the crude bound in (29) we know that a low complexity \({\mathcal {H}}\) always implies a low complexity \({\mathcal {D}}_{A}{\mathcal {H}}\). In this section we give a worked example of how this effect plays out in the case of hypothesis classes that are linear in the parameters. Linear models represent a well-weathered object of study at the foundation of machine prediction (Vapnik, 1998), whose high-dimensional / low sample size version has been of much interest for the puzzle of over-parameterisation, see e.g. Bartlett et al. (2020). These models also allow for nonlinearity effortlessly through a feature map or a kernel.

Let \({\mathbb {H}}\) be a reproducing kernel Hilbert space with reproducing kernel \(k :{\mathcal {X}}\times {\mathcal {X}}\rightarrow {\mathbb {R}}\) associated with the feature map \(\Phi :{\mathcal {X}}\rightarrow {\mathbb {H}}\), so for any \(x_1,x_2\in {\mathcal {X}}\), we have \(k(x_1,x_2)=\langle \Phi (x_1),\Phi (x_2) \rangle _{{\mathbb {H}}}\). Then our hypothesis class is

$$\begin{aligned} {\mathcal {H}}{:}{=}\{ x \mapsto \langle w , \Phi (x) \rangle _{{\mathbb {H}}} : w \in {\mathbb {H}} \}. \end{aligned}$$

The familiar Euclidean space setting corresponds to \(\Phi\) being the identity map and \({\mathbb {H}}={\mathbb {R}}^d\).

We define our approximation operator to be \(A :{\mathcal {H}}\rightarrow {\mathcal {H}}_A\) defined by \(A f_w (x) = \langle Q(w) , \Phi (x) \rangle _{{\mathbb {H}}}\) where \(f_w (x) = \langle w , \Phi (x) \rangle _{{\mathbb {H}}}\) and \(Q :{\mathbb {H}} \rightarrow {\mathbb {H}}\) is some approximation of the weights w of the predictor \(f_w\).

Proposition 3.5

Let \(m \in {\mathbb {N}}\) and \(S = \{ x_1 , \ldots , x_m \} \subset {\mathcal {X}}\). Then we have the following bound

$$\begin{aligned} \widehat{{\mathcal {R}}}_S ({\mathcal {D}}_{A}{\mathcal {H}}) \le \frac{\sup _{w \in {\mathbb {H}}} \Vert w - Q(w)\Vert _{{\mathbb {H}}}}{\sqrt{m}} \sqrt{\sum _{k=1}^mk(x_k,x_k)}. \end{aligned}$$
(62)

This is of course upper bounded by the sum of familiar bounds for linear classes \({\mathcal {H}}\) and \({\mathcal {H}}_A\) by the triangle inequality, as already implied indeed by the crude bound (29); however, the important observation from the special-case analysis of Proposition 3.5 is that (62) does not explicitly depend on the norm of the weight vectors, but instead it only depends on how the approximation A (through Q) distorts the weights. In other words, we do not need the norms \(\Vert w\Vert _{{\mathbb {H}}}\) for \(\widehat{{\mathcal {R}}}_S ({\mathcal {D}}_{A}{\mathcal {H}})\) to be bounded as long as the weight sensitivity \(\Vert w - Q(w)\Vert _{{\mathbb {H}}}\) is bounded for the chosen operator Q.

Therefore the finding we conclude from Proposition 3.5 is that, in the generalised-linear model class considered, small weight-sensitivity is sufficient for dimension-independent learning when the approximating class \({\mathcal {H}}_A\) has dimension-free complexity. This is in contrast with existing dimension-free bounds that required a bounded norm constraint.

We have not found an analogous property for other hypothesis classes, and it remains an open question as to whether analyses of the sensitivity class tailored to specific classes would unearth additional insights.

Proof of Proposition 3.5

Since \(\sigma _k\) are uniform on \(\{-1,1\}\), we can remove the absolute value, and by the linearity of inner products, and the Cauchy-Schwarz inequality we have

$$\begin{aligned} \widehat{{\mathcal {R}}}_S ({\mathcal {D}}_{A}{\mathcal {H}})&= \frac{1}{m} {{\,\mathrm{{\mathbb {E}}}\,}}_{\sigma } \sup _{f \in {\mathcal {H}}} \sum _{k=1}^m \sigma _k \vert f (x_k) - A f (x_k) \vert \\&= \frac{1}{m} {{\,\mathrm{{\mathbb {E}}}\,}}_{\sigma } \sup _{w \in {\mathbb {H}}} \sum _{k=1}^m \sigma _k ( \langle w , \Phi (x_k) \rangle _{{\mathbb {H}}} - \langle Q(w) , \Phi (x_k) \rangle _{{\mathbb {H}}} ) \\&= \frac{1}{m} {{\,\mathrm{{\mathbb {E}}}\,}}_{\sigma } \sup _{w \in {\mathbb {H}}} \langle w - Q(w) , \sum _{k=1}^m \sigma _k \Phi (x_k) \rangle _{{\mathbb {H}}} \\&\le \frac{1}{m} \sup _{w \in {\mathbb {H}}} \Vert w-Q(w)\Vert _{{\mathbb {H}}} {{\,\mathrm{{\mathbb {E}}}\,}}_{\sigma } \left\| \sum _{k=1}^m \sigma _k \Phi (x_k) \right\| _{{\mathbb {H}}}. \end{aligned}$$

Finally it is know that (Mohri et al., 2018), Theorem 6.12

$$\begin{aligned} {{\,\mathrm{{\mathbb {E}}}\,}}_{\sigma } \left\| \sum _{k=1}^m \sigma _k \Phi (x_k) \right\| _{{\mathbb {H}}} \le \left[ \sum _{k = 1}^m k(x_k,x_k) \right] ^{\frac{1}{2}}. \end{aligned}$$

This completes the proof.

4 Discussion of implications, and potential extensions

In this section we elaborate on the significance of our theoretical results. The next Sect. 4.1 shows how to use our analysis of the sensitivity set to obtain a natural and very general structural condition that yields favourable convergence rates on the sensitivity estimation error. Hence, under this condition, the generalisation error bounds in our previous sections become dominated by the complexity of the reduced approximate class, irrespective of the form or size of the original class.

In Sect. 4.2 we discuss consequences related to real problems by revisiting the original motivation of understanding model compression in deep networks. In particular, we consider a concrete case of approximation by weight-binarisation, as in BinaryConnect (Courbariaux et al., 2015), or parameter quantisation in deep network classifiers (Hubara et al., 2017), where applying our results yields a depth-independent bound. We also discuss a potential way to relate our approach to a previously successful but theoretically unjustified on-device deep net approach, Neural Projections (Ravi, 2019), which brings insights into its working.

Finally, in Sect. 4.3 we describe how our framework can be extended to stochastic approximation schemes.

4.1 Favourable rates for sensitivity estimation in approximable hypothesis classes

We already commented that whenever the target function admits a small sensitivity threshold t, this can usefully restrict the hypothesis class in favourable data distributions. Here we show that a uniformly small t, with approximation sensitivity specified in the \(p=2\) norm, can even obtain a speed-up of the convergence rate of sensitivity estimation, based on the findings of Sect. 3.

First, we extract the fortuitous conditions that arose from our analysis in Sect. 3 that enable a fast convergence of the Rademacher complexity of the class of sensitivities. More precisely, if the sample sensitivity set \({\mathcal {D}}_{A}{\mathcal {H}}_{\vert S}\) resides in a countable union of near-sparse sets and a finite number \(l\in {\mathbb {N}}\) of dense clusters, then the Rademacher complexity of the sensitivity set decays at a fast rate 1/m, up to a logarithmic factor.

Condition 4.1

(Structured sensitivity condition) Suppose that \({\mathcal {D}}_{A}{\mathcal {H}}_{\vert S}\subset \left( \bigcup _{i=1}^{\infty } {\mathcal {E}}_p(\mu _i)\right) \bigcup \left( \bigcup _{i=1}^l {\mathcal {E}}_p({\tilde{c}}_i,{\tilde{\mu }}_i,V_i)\right)\), where \({\mathcal {E}}_p({\tilde{c}}_i,{\tilde{\mu }}_i,V_i)\) are ellipsoids centered at \({\tilde{c}}_i\) having side-lengths concatenated in the vector \({\tilde{\mu }}_i\) and orientation \(V_i\); and \({\mathcal {E}}_p(\mu _i), i\ge 1\), are ellipsoids centered at the origin, having side-lengths concatenated in \({\mu }_i\). Let \(\Lambda _i\) be a diagonal matrix with elements \({\tilde{\mu }}_{i,k} \in (0,\infty )\) for \(k\in [m]\). If there are constants \(\kappa ,\kappa _1,\kappa _2\ge 0\), independent of m, such that \(\max _{i\ge 1}{\Vert \mu _i\Vert _2}\le \kappa\), \(\max _i \Vert V_i\Lambda _i\Vert _{2\rightarrow 1} \le \kappa _1\) and \(\max _i \{\Vert {\tilde{c}}_i\Vert _2\} \le \kappa _2\), we say that the sensitivity set \({\mathcal {D}}_{A}{\mathcal {H}}_{\vert S}\) is structured, with parameters \(\kappa ,\kappa _1\) and \(\kappa _2\).

Lemma 4.2

If \({\mathcal {D}}_{A}{\mathcal {H}}_{\vert S}\) satisfies Condition 4.1 with parameters \(\kappa ,\kappa _1\), and \(\kappa _2\), then we have

$$\begin{aligned} \widehat{{\mathcal {R}}}_S({\mathcal {D}}_{A}{\mathcal {H}})&\le \frac{\kappa +\kappa _1+\kappa _2\sqrt{\log l}}{m} ={{\mathcal {O}}}\left( \sqrt{\log (l)}/m \right) . \end{aligned}$$
(63)

Proof

In the near-sparse subset of \({\mathcal {D}}_{A}{\mathcal {H}}_{\vert S}\) that resides in \(\bigcup _{i=1} {\mathcal {E}}_p(\mu _i)\), we have by Proposition 3.3 that \(\widehat{{\mathcal {R}}}_S({\mathcal {D}}_{A}{\mathcal {H}}) \le \kappa /m\).

In the remaining subset that resides in the elliptic clusters \(\bigcup _{i=1}^l {\mathcal {E}}_p({\tilde{c}}_i,{\tilde{\mu }}_i,V_i)\), we have by Proposition 3.4 that \(\widehat{{\mathcal {R}}}_S({\mathcal {D}}_{A}{\mathcal {H}}) \le (\kappa _1+\kappa _2\sqrt{\log l})/m\). The union has complexity no larger than the sum of complexities of its constituent subsets. \(\square\)

This implies the following for the estimation of sensitivities.

Theorem 4

(Sensitivity estimation bound for uniformly approximable classes) Let \(S \in {\mathcal {X}}^m\) be a sample drawn i.i.d. from the marginal distribution \(D_x\) of size m. Suppose there exists \(t\ge 0\) such that \({\mathcal {D}}_{A}^{2} (f) \le t\) for all \(f \in {\mathcal {H}}\). Then, with probability at least \(1-\delta\), we have

$$\begin{aligned} \sup _{f \in {\mathcal {H}}} \vert {\mathcal {D}}_{A}^{1} (f) - \widehat{{\mathcal {D}}}_{A}^{1} (f) \vert \le 6 \widehat{{\mathcal {R}}}_S ({\mathcal {D}}_{A}{\mathcal {H}}) + t \sqrt{\frac{2 \ln (\frac{1}{\delta })}{m}} + \frac{6 C \ln (\frac{1}{\delta })}{m}. \end{aligned}$$
(64)

Furthermore, if Condition 4.1 holds, then \(\widehat{{\mathcal {R}}}_S ({\mathcal {D}}_{A}{\mathcal {H}})={\tilde{{\mathcal {O}}}}(1/m)\).

Proof

First note that from Assumption 2 we have \(\Vert f - Af\Vert _{\infty } < C\) and we have the following bound on the variance of the function \(f - A f\),

$$\begin{aligned} {{\,\textrm{Var}\,}}_X [f(X) - A f(X)]&= {{\,\mathrm{{\mathbb {E}}}\,}}_X [(f(X) - Af (X))^2] - {{\,\mathrm{{\mathbb {E}}}\,}}_X [f(X) - A f(X))]^2\\&\le {{\,\mathrm{{\mathbb {E}}}\,}}_X [(f(X) - A f (X))^2]\\&\le ({\mathcal {D}}_{A}^{2} (f))^2 = t^2, \end{aligned}$$

where the last line is due to Jensen’s inequality. The result then follows from Bartlett et al. (2005), Theorem 2.1 by setting \(\alpha =\frac{1}{2}\). The second statement is proved in Lemma 4.2. \(\square\)

Theorem 4 bounds the deviation between the true sensitivity and its sample estimate in terms of the global sensitivity threshold t of functions in \({\mathcal {H}}\). Whenever t is sufficiently small, then the last term will dominate the t-dependent term, which in turn decays with m at a faster rate.

The observation that the sensitivity threshold t acts as a variance to control the rate could also be further refined using localisation to replace the global sensitivity threshold with the sensitivities of individual functions and relax the requirement that the entire class \({\mathcal {H}}\) is well approximable, at the expense of a more involved machinery of local Rademacher complexities (Bartlett et al., 2005), which we do not pursue here, and which would likely need a specialised treatment to bound the local complexity for particular choices of \({\mathcal {H}}\), similarly to the approach taken in Suzuki et al. (2020a).

The key difference from the approach of Suzuki et al. (2020a) is the following. Their bounds depend on the local Rademacher complexity of the Minkowski difference between the loss classes of the full and the approximate predictors, which they are able to bound for some specific hypothesis classes; whereas, our bounds depend on the Rademacher complexity of the set of sensitivities of predictors from the hypothesis class. The Minkowski difference loses the coupling between the full and approximate predictor pairs which, in our approach is the key to taking advantage of structure in the set of sensitivities. These structures that we identified and exploited are not specific to the form of functions in the hypothesis class chosen, and instead uncover new general insights, as well as tighten our bounds effortlessly, with elementary tools.

Indeed, we highlighted that even in the simple global analysis of Theorem 4, from the findings of Sect. 3 we were able to readily extract some general favourable rate conditions for sensitivity estimation. Note that Lemma 4.2 is general, and holds for any PAC-learnable class. It says that whenever the target function admits a small t and the interplay of data and model satisfies Condition 4.1, the error from sensitivity estimation becomes negligible very quickly (even without any additional unlabelled data), hence the dominant term of our generalisation bounds (Theorems 1, 2, 3) now becomes the complexity of the approximate class, regardless of how big the original class \({\mathcal {H}}\) was.

4.2 Implications related to real problems

In this section we discuss the significance of our theoretical results by revisiting some of our motivating examples related to real problems.

4.2.1 From BinaryConnect to a depth-independent bound

We consider a specific example. Take \({\mathcal {H}}\) to be the class of L-layer feed-forward neural network classifiers with ReLu activations in the hidden layers and binary output. Let \(\vert W\vert\) be the total number of parameters (including all weights and bias terms). It was shown in Bartlett et al. (2019) that the VC dimension of this class is \({{\mathcal {O}}}(\vert W\vert L \log (\vert W\vert ))\), and this is near-tight with a lower bound of \(\Omega (\vert W\vert L \log (\vert W\vert /L))\). A well-known relation between Rademacher complexity and VC dimension (Bartlett & Mendelson, 2002), Theorem 6 implies that for this class we have

$$\begin{aligned} \widehat{{\mathcal {R}}}_S({\mathcal {H}}) = {{\mathcal {O}}}\left( \sqrt{\frac{\vert W\vert L \log (\vert W\vert )}{m}}\right) . \end{aligned}$$
(65)

Let us consider the approximation operator A of parameter binarisation – that is, we retain only the signs of all parameters while keeping the same architecture, as it has been done in practice in BinaryConnect (Courbariaux et al., 2015). Hence \({\mathcal {H}}_A\) is a finite class of cardinality \(\vert {\mathcal {H}}_A\vert = 2^{\vert W\vert }\). By Massart’s finite lemma (Mohri et al., 2018), Theorem 3.7 the Rademacher complexity of this class of approximate classifiers is

$$\begin{aligned} \widehat{{\mathcal {R}}}_S({\mathcal {H}}_A) \le \sqrt{\max _{f\in {\mathcal {H}}_A}\frac{1}{n}\sum _{i\in [n]}f^2(x_i)}\sqrt{\frac{\vert W \log (2)\vert }{m}} = {{\mathcal {O}}}\left( \sqrt{\frac{\vert W\vert }{m}}\right) \end{aligned}$$
(66)

Observe, this is independent of the network depth L. The same reasoning holds if quantisation in pursued into q bins, since then \(\vert {\mathcal {H}}_A\vert =q^{\vert W\vert }\). By contrast, the complexity of the original class \({\mathcal {H}}\) grows with L. Applying our Theorem 2 combined with Lemma 4.2 under condition 4.1, we obtain the following depth-independent error bound for both the full-precision and the binarised network (i.e. a guarantee on \(\max \{{{\,\textrm{err}\,}}(A{\hat{f}}),{{\,\textrm{err}\,}}({\hat{f}})\}\)). Condition 4.1 in this setting is implied whenever the number of points on which the binarised network disagrees with the full-precision network is of constant order with respect to the training set size.

Corollary 4.3

(Learning with binarised deep nets) Let \({\mathcal {H}}\) be the class of arbitrary depth neural network classifiers having \(\vert W\vert\) parameters, ReLu activations in the hidden layers, 0–1 outputs, and 0–1 loss. Let the approximation operator A be parameter binarisation. Suppose the structured sensitivity condition 4.1 holds. For \(t\ge 0\), let \(f_t^*:={{\,\textrm{argmin}\,}}_{f\in {\mathcal {H}}_t}\{ {{\,\textrm{err}\,}}(f) \}\), and let \(t^*:={{\,\textrm{argmin}\,}}_{t\ge 0}\{{{\,\textrm{err}\,}}(f_t^*)+2t\}\). Then, with probability at least \(1-\delta\), the network \({\hat{f}}\) trained by minimising (24) on a labelled sample of size m satisfies

$$\begin{aligned} \max \{{{\,\textrm{err}\,}}(A {\hat{f}}), {{\,\textrm{err}\,}}({\hat{f}})\}&\le {{\,\textrm{err}\,}}(f^*_{t^*}) + 2 t^* + 2 \sqrt{\frac{\vert W\vert }{m}} + 5 \sqrt{\frac{\ln (\frac{16}{\delta })}{2m}} \nonumber \\&\quad +\frac{c}{m} + t^* \sqrt{\frac{2 \ln (\frac{2}{\delta })}{m}} + \frac{6 C \ln (\frac{2}{\delta })}{m}, \end{aligned}$$

where \(c>0\) is a constant depending only on the parameters of condition 4.1.

Proof

In Theorem 2 we use \(m_u=m\), replace (66) for \(\widehat{{\mathcal {R}}}_S({\mathcal {H}}_A)\), use Lemma 4.2 for \(\epsilon _u\), and \(\rho =1\). \(\square\)

We included in Appendix a numerical illustration of algorithm (24) with binarised deep nets as in Corollary 4.3.

Here we would like to discuss a potential interpretation of the result of Corollary 4.3. BinaryConnect and quantised deep nets are known empirically to be successful from the previous literature (Courbariaux et al., 2015; Hubara et al., 2017), e.g. in image classification problems. Our theory suggests that there must be something fortuitous about many natural data sources that, in our context, makes the complex function class of deep nets behave as a low complexity class. We can only speculate on this and to this end we identified Condition 4.1. It is intriguing that the same condition also turned out to explain depth independence of error in this example. There have been many attempts to depth independent error bounds for deep nets in the literature, by making various assumptions, for instance by imposing norm constraints on the weights (Golowich et al., 2020). Our above interpretation provides a complementary view on this, simply as a byproduct of our general pursuit to understand approximate predictors.

4.2.2 Towards understanding neural projections

Having developed an analytic approach to the twofold problem of learning a good full-precision predictor and a good approximate predictor, it may now be interesting to relate the training objective function we obtained in (25) to the training objective of Neural Projections proposed in Ravi (2019). The latter has been a practical approach to on-device deep networks. It has no theoretical backing, however ample empirical evidence demonstrated its impressive success in real world image classification problems (Ravi, 2019). It minimises a weighted sum of three terms – the empirical errors of full and approximate models plus their disagreement – with the ultimate goal to deploy the approximate model on-device.

Take any \(\eta \in [0,1]\). Our training objective can be written as the following.

$$\begin{aligned} {{\,\mathrm{{\widehat{err}}}\,}}(Af)+\lambda \widehat{{\mathcal {D}}}_{A}(f)&=\frac{\eta }{2} {{\,\mathrm{{\widehat{err}}}\,}}(Af)+\frac{1-\eta }{2}{{\,\mathrm{{\widehat{err}}}\,}}(Af) +\lambda \widehat{{\mathcal {D}}}_{A}(f)\\&\le \frac{\eta }{2} {{\,\mathrm{{\widehat{err}}}\,}}(f)+\frac{\eta }{2}\lambda \widehat{{\mathcal {D}}}_{A}(f)+\frac{1-\eta }{2}{{\,\mathrm{{\widehat{err}}}\,}}(Af) +\lambda \widehat{{\mathcal {D}}}_{A}(f)\\&=\frac{\eta }{2}{{\,\mathrm{{\widehat{err}}}\,}}(f)+\frac{1-\eta }{2}{{\,\mathrm{{\widehat{err}}}\,}}(Af) +\frac{1-\eta +2\lambda }{2} \widehat{{\mathcal {D}}}_{A}(f)\\&\quad \propto {{\,\mathrm{{\widehat{err}}}\,}}(f)+\lambda _1{{\,\mathrm{{\widehat{err}}}\,}}(Af) +\lambda _2\widehat{{\mathcal {D}}}_{A}(f), \end{aligned}$$

where \(\lambda _1= 1/\eta -1\ge 0, \lambda _2=1/\eta -1+2\lambda /\eta \ge 0\).

Now, if we relax Af in \({\mathcal {H}}_A\), i.e. replace it with some \(g\in {\mathcal {H}}_A\), then we arrive precisely at the training objective of Neural Projections. Indeed, in Ravi (2019) this modified objective is minimised in the parameters of f and g along with tuning both \(\lambda _1\) and \(\lambda _2\) independently. Thus, we may interpret the training objective function of Neural Projections as an approximate version of our objective function (25). While this has no theoretical justification, our objective function has a similar flavour, and it follows from a rigorous theory. Hence, while we reckon this is not a complete explanation of why Neural Projections (Ravi, 2019) are so effective in practice, nevertheless we believe this interpretation still brings some insights into its working.

4.3 Potential extensions to stochastic approximate predictors

The approximation schemes assumed so far were deterministic. Many approximation schemes are in fact stochastic in nature, therefore, in this section we discuss how to straightforwardly adapt our framework to stochastic approximation schemes.

Let \((\Omega , {\mathcal {F}},{\mathbb {P}})\) be a probability space. Then we define a Stochastic approximation scheme by \(A :\Omega \times {\mathcal {H}}\rightarrow {\mathcal {H}}_{{\mathcal {A}}}\), where \({\mathcal {H}}_{{\mathcal {A}}} {:}{=}\{ A_{\omega } f : \omega \in \Omega \text { and } f \in {\mathcal {H}}\}\). Then for a fixed \(\omega \in \Omega\) we have an approximation operator \(A_{\omega } :{\mathcal {H}}\rightarrow {\mathcal {H}}_{\omega }\) where \({\mathcal {H}}_{\omega } {:}{=}\{ A_{\omega } f : f \in {\mathcal {H}}\}\); that is, for a fixed \(\omega\) we have one approximation operator. Thus, when \(\vert \Omega \vert = 1\) we reduce to the deterministic setting. Also, for a fixed \(f \in {\mathcal {H}}\) we have the collection of possible approximations to f the set\(\{ A_{\omega } f : \omega \in \Omega \}\).

Now we define \({\mathcal {D}}_{\omega } (f) {:}{=}{\mathcal {D}}_{A_{\omega }} (f)\), and then for a fixed arbitrary \(\omega \in \Omega\) we have with probability at least \(1-\delta\), that

$$\begin{aligned} {{\,\textrm{err}\,}}(f) \le {{\,\mathrm{{\widehat{err}}}\,}}(A_{\omega } f) + \rho {\mathcal {D}}_{\omega } (f) + 2 \rho \widehat{{\mathcal {R}}}_S ({\mathcal {H}}_{\omega }) + 3 \sqrt{\frac{\ln \left({\frac{2}{\delta }}\right)}{2m}}, \end{aligned}$$
(67)

for all \(f \in {\mathcal {H}}\). This uniform bound follows directly from Lemma 2.3 combined with a standard Rademacher bound, and for fixed \(\omega\) the first two terms on its right hand side correspond to the objective function of the Algorithm (18) in Sect. 2.3.

We can make this independent of a particular random instance, e.g. by considering expectation. Although we cannot take expectation on both sides as this would incur a union bound over infinitely many sets, we can simply write

$$\begin{aligned} {{\,\textrm{err}\,}}(f)&= {{\,\textrm{err}\,}}(f) - {{\,\mathrm{{\mathbb {E}}}\,}}_{\omega } {{\,\textrm{err}\,}}(A_{\omega } f) + {{\,\mathrm{{\mathbb {E}}}\,}}_{\omega } {{\,\textrm{err}\,}}(A_{\omega } f) \\&\le \rho {{\,\mathrm{{\mathbb {E}}}\,}}_{\omega } {\mathcal {D}}_{\omega } (f) + {{\,\mathrm{{\mathbb {E}}}\,}}_{\omega } {{\,\textrm{err}\,}}(A_{\omega } f) - {{\,\mathrm{{\mathbb {E}}}\,}}_{\omega } {{\,\mathrm{{\widehat{err}}}\,}}(A_{\omega } f) + {{\,\mathrm{{\mathbb {E}}}\,}}_{\omega } {{\,\mathrm{{\widehat{err}}}\,}}(A_{\omega } f) \\&\le \rho {{\,\mathrm{{\mathbb {E}}}\,}}_{\omega } {\mathcal {D}}_{\omega } (f) + \sup _{f \in {\mathcal {H}}} \left[ {{\,\mathrm{{\mathbb {E}}}\,}}_{\omega } {{\,\textrm{err}\,}}(A_{\omega } f) - {{\,\mathrm{{\mathbb {E}}}\,}}_{\omega } {{\,\mathrm{{\widehat{err}}}\,}}(A_{\omega } f) \right] + {{\,\mathrm{{\mathbb {E}}}\,}}_{\omega } {{\,\mathrm{{\widehat{err}}}\,}}(A_{\omega } f). \end{aligned}$$

Now applying Jensen’s inequality, we have

$$\begin{aligned} \sup _{f \in {\mathcal {H}}} \left[ {{\,\mathrm{{\mathbb {E}}}\,}}_{\omega } {{\,\textrm{err}\,}}(A_{\omega } f) - {{\,\mathrm{{\mathbb {E}}}\,}}_{\omega } {{\,\mathrm{{\widehat{err}}}\,}}(A_{\omega } f) \right] \le {{\,\mathrm{{\mathbb {E}}}\,}}_{\omega } \sup _{f \in {\mathcal {H}}} \left[ {{\,\textrm{err}\,}}(A_{\omega } f) - {{\,\mathrm{{\widehat{err}}}\,}}(A_{\omega } f) \right] , \end{aligned}$$

and the argument of the expectation can be bounded in terms of the Rademacher complexity \({\mathcal {R}}_m ({\mathcal {H}}_{\omega })\). Thus, we have the following uniform bound expressed in terms of the expected sensitivity, the expected Rademacher complexity of the small approximating class, and a new empirical error term that, due to the expectation may be interpreted as a data augmentation loss. That is, we have, with probability at least \(1-\delta\), the following

$$\begin{aligned} {{\,\textrm{err}\,}}(f) \le {{\,\mathrm{{\mathbb {E}}}\,}}_{\omega } {{\,\mathrm{{\widehat{err}}}\,}}(A_{\omega } f) + \rho {{\,\mathrm{{\mathbb {E}}}\,}}_{\omega } {\mathcal {D}}_{\omega } (f) + 2 \rho {{\,\mathrm{{\mathbb {E}}}\,}}_{\omega } {\mathcal {R}}_m ({\mathcal {H}}_{\omega }) + \sqrt{\frac{\ln \left({\frac{1}{\delta }}\right)}{2m}}. \end{aligned}$$
(68)

Minimising the first two terms on its right hand side could be used to justify a regularised data augmentation algorithm in analogy with our previous algorithm in (18).

Likewise, one can introduce estimates of the expected distortion \({\mathcal {D}}_{\omega } (f)\) from unlabeled data. Alternatively, if the approximation operator A satisfies a variance condition, namely that \(\left[ {{\,\mathrm{{\mathbb {E}}}\,}}_{\omega } \Vert A_w f - f \Vert _{L^2}^2 \right] ^{\frac{1}{2}} \le \alpha {\mathcal {C}} (f)\) for all \(f \in {\mathcal {H}}\), where \({\mathcal {C}} (f)\) is some property of \(f \in {\mathcal {H}}\), then we have, by Jensen’s inequality and the variance condition, \({{\,\mathrm{{\mathbb {E}}}\,}}_{\omega } [{\mathcal {D}}_{\omega } (f)] \le {{\,\mathrm{{\mathbb {E}}}\,}}_{\omega } [{\mathcal {D}}_{\omega } (f)^2]^{\frac{1}{2}} = \left[ {{\,\mathrm{{\mathbb {E}}}\,}}_{\omega } {{\,\mathrm{{\mathbb {E}}}\,}}_{x \sim D_x} \vert A_{\omega } f(x) - f(x)\vert ^2 \right] ^{\frac{1}{2}} = \left[ {{\,\mathrm{{\mathbb {E}}}\,}}_{\omega } \Vert A_{\omega } f - f\Vert _{L^2}^2 \right] ^{\frac{1}{2}} \le \alpha {\mathcal {C}} (f)\). So we see this variance condition on A provides another instance where the need for additional unlabelled data is eliminated in the case of stochastic approximation operators. A similar condition, formulated on the level of parameters, is frequently encountered in the literature of quantisation for learning and optimisation, such as in stochastic rounding (Alistarh et al., 2017; Wen et al., 2017).

5 Conclusions

We end our study with a high-level summary. Inspired by the recent surge of interest in model-compression and approximate learning algorithms in the context of small device settings, we studied the role of approximability in generalisation, both in the full precision and in the approximated settings. Our main findings can be summarised as follows: (1) For any given PAC-learnable problem, and any approximation scheme, target concepts that have low sensitivity to the approximation can be learned from a smaller labelled sample, provided sufficient unlabelled data. This is achieved by using approximation to modify the loss function and isolating a sensitivity term in the generalisation error. The modified loss function has a lower complexity in comparison with the original, pushing the complexity of the learning problem onto the class of sensitivity functions – which in turn only requires unlabeled data for estimation whenever the original loss is Lipschitz. (2) Our analysis yielded algorithms showing that it is possible to learn a good predictor whose approximation has the same generalisation guarantee as the full precision predictor. Owing to the generality of our approach, such provably accurate approximate predictors can be used with a variety of model-compression and approximation schemes, and potentially deployed in memory-constrained settings. (3) Our algorithms use unlabelled data to estimate the sensitivity of predictors to the given approximation operator, and this needs not be independent from the labelled training set. Moreover, while the required unlabelled sample complexity can be large in general, we highlighted several examples of natural structure in the class of sensitivities that significantly reduce, and possibly even eliminate, the need of additional unlabelled data. At the same time, structural properties of the sensitivity class shed new light onto the question of what makes certain instances of learning problems easier than others.

Several open questions remain. As our upper bounds highlighted structural traits that explain good performance in model-compression settings, it will be interesting to develop lower bounds under the same structural traits, to assess the tightness of our bounds. From the practical perspective, it will be interesting to develop efficient implementations, and study their computational complexity. Another line of interesting future work is to explore adversarial settings (Chowdhury et al., 2022; Montasser et al., 2019), where the approximation operator A is in the hands of an adversary, and the learner needs to find a predictor that is robust to it. Furthermore, it would be interesting to study model-compression and approximate algorithms in other learning theory frameworks such as PAC-Bayes, and perhaps even non-uniform frameworks.