Heterogeneous Sets in Dimensionality Reduction and Ensemble Learning

We present a general framework for dealing with set heterogeneity in data and learning problems, which is able to exploit low complexity components. The main ingredients are (i) a deﬁnition of complexity for elements of a convex union that takes into account the complexities of their individual composition – this is used to cover the heterogeneous convex union, and (ii) upper bounds on the complexities of restricted subsets. We demonstrate this approach in two diﬀerent application areas, highlighting their conceptual connection. (1) In random projection based dimensionality reduction, we obtain improved bounds on the uniform preservation of Euclidean norms and distances when low complexity components are present in the union. (2) In statistical learning, our generalisation bounds justify heterogeneous ensemble learning methods that were incompletely understood before. We exemplify empirical results with boosting type random subspace and random projection ensembles that implement our bounds.


Introduction
We are interested in data and learning problems of a heterogeneous nature, which we will describe shortly. Let m be a positive integer, and consider a sequence S = (S j ) j∈ [m] consisting of bounded subsets of a vector space. The convex union S is defined to be the convex hull of the union of these sets, where ∆τ := {(α t ) t∈[τ ] ∈ [0, 1] τ : t∈[τ ] α t = 1} is the simplex for τ ∈ N.
In dimensionality reduction, random projection (RP) is a universal and computationally convenient method that enjoys near-isometry. The distortion of Euclidean norms and distances depends on the complexity of the set being projected (see [1] and references therein). Now suppose some high dimensional data resides in a set of the form (1). What can be said about simultaneous preservation of norms and distances? The complexity of the union grows with its highest complexity component. We would like to take advantage of heterogeneity to better exploit the presence of low complexity components.
In statistical learning (SL), suppose we want to learn a weighted ensemble where base learners belong to different complexity classes. The ensemble predictor then belongs to a function class of the form (1) -for instance, learning a weighted ensemble of random subspace classifiers, as raised in the future work section of [2]. What simultaneous (i.e. worst-case) generalisation guarantees can be given?
To tackle these problems, it is helpful to observe their common structure. Both problems can be described by a certain stochastic process -an infinite collection of random variables {X s } s∈S , indexed by the elements of a bounded set S. In the RP task, the index-set S ⊂ R d is a set of points in a high dimensional space, and the source of randomness is the RP map R, a random matrix taking values in R k×d with independent rows drawn from a known distribution. We are interested in norm-preservation, i.e. the discrepancy between the norm of a point before and after RP, so the collection of random variables of interest is { √ k s 2 − Rs 2 } s∈S , and we would like to guarantee that all of these discrepancies are small simultaneously, with high probability.
In the SL task, the index-set is a set of functions H (the hypothesis class), and the source of randomness is a training sample {(X 1 , Y 1 ), . . . , (X n , Y n )}, drawn i.i.d. from an unknown distribution. We are interested in generalisation, i.e. the discrepancy between true error and sample error, so the infinite collection of random variables of interest is where L is a loss function. Again, we want all of these discrepancies to be small simultaneously, with high probability.
This analogy suggests dealing with the problem of index-set heterogeneity in both tasks in a unified way. The index-sets will be of the form (1), and we appeal to empirical process theory to link the processes of interest with canonical processes whose suprema can be bounded.

Related work
The use of empirical process theory to provide low distortion guarantees for random projections of bounded sets was pioneered by the work of Klartag and Mendelson [3], and further refined by others -see [1] and references therein for a relatively recent treatment. These results extend the celebrated Johnson-Lindenstrauss (JL) lemma from finite sets to infinite sets. They allow simultaneous high probability statements to be made about Euclidean norm preservation of all points of the set being projected, and these guarantees depend on a notion of metric complexity of the set. For this reason, these bounds are more capable of explaining empirically observed low distortion in application areas where the underlying data support has a low intrinsic dimension, or a simple intrinsic structure, such as images or text data [4]. However, these existing bounds do not cater to the heterogeneity of the data support, so the presence of any small high-complexity component still renders them lose. This observation will be made precise in the sequel.
Empirical process theory is also a cornerstone in statistical learning, where it is widely used to provide uniform generalisation guarantees for learning problems [5] via Rademacher and Gaussian complexities. Uniform generalisation bounds ascertain, under certain conditions, that with high probability the training data does not mislead the learning algorithm. However, for complex models like heterogeneous ensembles of interest to practitioners, theory is scarce [2,6,7] and a general unifying treatment is missing.
In both of the above domains, classic theory considers a single homogeneous index-set in the underlying empirical process, ignoring any heterogeneity of its subsets. The complexity of a union of sets of differing complexities grows linearly with the complexity of the most complex one, consequently by this approach one obtains uniform bounds that grow linearly with the complexity of the most complex component set. However, in many natural situations one would expect predominantly lower complexity components -for instance, data may lie mostly (though not exclusively) on low dimensional structures, or the required hypothesis class has mostly (though not exclusively) low-complexity. This class of problems motivates our approach.

Contributions
In this paper, we develop a general unifying framework that allows us to formulate simultaneous high-probability bounds over all elements of a convex union of sets of differing complexity, taking advantage of any low-complexity components. The main contributions are summarised below.
• We introduce a notion of complexity for elements of the convex union, defined as a weighted average of complexities of constituent sets. This serves to cover the convex union with sets of increasing complexity and treat each individually.
• We bound the supremum of a weighted combination of canonical subgaussian processes, which serves as a tool to bound the complexities of restricted subsets of the convex union.
We demonstrate our approach in two different areas, highlighting their conceptual connection in our framework, namely random projection based dimensionality reduction, and statistical learning of heterogeneous ensembles.
• In dimensionality reduction, heterogeneity of the data support brings improvement in simultaneous norm-preservation guarantees when points have some low complexity constitution, improving on results from [1]. • In statistical learning, our bounds justify and guide principled heterogeneous weighted ensemble construction more generally than previous work has, and we exemplify regularised gradient boosting type random subspace & random projection ensembles for high dimensional learning.

Theory
We begin with preliminaries, and develop some theory in Sections 2.1-2.2.
Definition 1 (Sub-Gaussian right tail). A random variable X is said to have a sub-Gaussian right tail with parameter σ > 0 if P(X > ξ) ≤ e −ξ 2 /2σ 2 for all ξ > 0. Let R(σ 2 ) denote the collection of such random variables.
In the sequel, we shall be concerned with canonical stochastic processes {X s } s∈S indexed by a bounded set S, and their suprema Z := sup s∈S X s . In many useful cases these suprema turn out to have sub-Gaussian right tail.  is referred to as the Rademacher width of S. By McDiarmard's inequality, sup s∈S X s − R(S) ∈ R( i∈[n] sup s∈S s 2 i ). We also have sup s∈S {X s − R(S)} ∈ R(8 · sup s∈S i∈[n] s 2 i ) [9, Example 3.5]. Whilst the latter bound is sometimes tighter, we will rely primarily on the former bound in what follows.

Empirical processes over heterogeneous sets
Here we give a general result that will allow us to bound the complexity of certain subsets of a convex union. Consider the canonical stochastic process whose index set is the convex union of our interest. The next lemma bounds the supremum and its expectation for the resulting mixture process, subject to constraints, by showing that this supremum has sub-Gaussian right tail.

Element-wise complexity-restricted subsets
We define a notion of complexity for elements of a convex union as follows.
Definition 2 (Gaussian widths for elements of the convex hull of a union). Given S = (S j ) j∈[m] consisting of bounded sets S j ⊂ R d and s ∈ S, we define This will be useful in obtaining high probability bounds that hold simultaneously for all elements of the convex union, yet provide individual guarantees for each -a key idea in our approach. Note that an element s ∈ S may have multiple representations as a convex combination; the infimum breaks ties in favour of the most parsimonious one. The convex coefficients (α t ) t∈[τ ] that realise the infimum in this definition depend on the individual element s. Note also that the complexity G S (s) of an element s ∈ S depends crucially upon the sequence of sets S with respect to which the complexity is quantified. Indeed, if the sequence of sets contains {s}, we would have G S (s) = 0.
The following result shows the utility of element-wise complexities.
Proof Both bounds are instances of Lemma 1, using the sub-Gaussian right tail properties described in Examples 2 and 3. Take g ∼ N (0, In); for each t ∈ [τ ] and j t ∈ [m], take the canonical process {X jt st } st∈Sj t with X jt st := s t , g . By Example 2, sup st∈Sj t X jt st has sub-Gaussian right tail with parameter σ jt = sup s∈Sj t s 2 . By Definition 2 and Lemma 1 with σ := b, µ jt := G(S jt ), µ := κ, (2) follows. Now, let γ be a sequence of n i.i.d. Rademacher variables.
Furthermore, using element-wise complexities we can cover the convex union with sets of increasing complexity, allowing us to deal with each in turn. A similar result holds with R in place of G.
The next sections rely on Theorem 1 combined with the covering approach of Lemma 2.

Dimension reduction for heterogeneous sets
Here we consider random projection (RP) based dimensionality reduction of sets of the form (1) in some high dimensional Euclidean ambient space, with component regions each having their own predominantly simple structure together with various higher complexity noise components. This is a realistic scenario in real world data [10]. Dimensionality reduction is often desirable before a time-consuming processing of the data, and RP is a convenient approach, oblivious to the data, with useful distance-preservation guarantees. However, there is a gap in understanding what makes RP preserve structure more accurately. We apply our theory to this problem.
Recall that a k × d random matrix R, is said to be isotropic if every row We shall make use of the following result.
Lemma 3 (Liaw et al. [1]). There exists a universal constant C L > 0 such that for any isotropic k × d random matrix R, any set S ⊂ R d and δ ∈ (0, 1), the following holds with probability at least 1 − δ, The main result of this section is the following simultaneous bound for norm preservation.
Theorem 2 (Norm preservation in the convex union). Suppose we have an isotropic k × d random matrix R and a sequence of sets S = (S j ) j∈[m] with S j ⊆ R d and let S denote the convex union (1). Suppose further that max j∈[m] sup s∈Sj s 2 ≤ b for some b > 0. Given any δ ∈ (0, 1), with probability at least 1−δ, the following holds simultaneously for all points s ∈ S, The two dominant terms are in a tradeoff in the above bound; these are the element-wise complexity G S (s) (cf. Definition 2), and a logarithmic function of the number of component sets m in the union. Indeed, if the union consists of many low complexity sets, then the latter quantity will increase, while if it consists of fewer high complexity sets then the former will increase.
Proof of Theorem 2 Let > 0 (to be chosen later), and L = max j∈[m] G(S j )/ and define sets T l := {s ∈ S : (l − 1) · ≤ G S (s) ≤ l · }, for l ∈ [L], as in Lemma 2, so S ⊆ L l=1 T l . By the first part of Theorem 1, for each l ∈ [L] we have We apply Lemma 3 to each T l and take union bound, so the following holds w.p.
Finally, we take = b, note 1 < π/2 and use The log(m) term in Theorem 2 is the price to pay for a bound which holds simultaneously over all convex combinations. Let us compare the obtained bound with the alternative of applying Liaw et al directly to the convex union, which would give us where the latter bound follows from (2). Crucially, in Theorem 2 the maximal complexity max j∈[m] G(S j ) only appears under a log in our bound. By contrast, the above bound scales linearly with this quantity. Figure 1 exemplifies the tightening of our bound in low complexity regions of the data support in comparison with the previous uniform bound of [1].

Learning in heterogeneous function classes
In this section we apply the second part of Theorem 1 to heterogeneous function classes. Let X be the instance space (a measurable space). Throughout, we denote by M(X , V) the set of (measurable) functions with domain X and co-domain V. First, let us recall some classic complexity measures for function classes. Given a class H ⊆ M(X , R), and a sequence of points x = (x 1 , · · · x n ) i∈[n] ∈ X n , the empirical Gaussian widthĜ n (H, x) and empirical Rademacher widthR n (H, x) are defined aŝ The uniform complexities are useful in obtaining faster rates than O(n 1/2 ). Finally, given a distribution P X on X , and H ⊆ M(X , R), the Gaussian width G n (H, P X ) and Rademacher width R n (H, P X ) are and where the expectation is taken over a random sample X = (X 1 , · · · , X n ), consisting of n independent random variables X i with distribution P X . We can now define our element-wise complexities. Suppose we have a sequence H = (H j ) j∈[m] of function classes H j ⊆ M(X , R) and let H := conv( j∈[m] H j ) be the convex union. Given a function f ∈ H, where each infimum runs over all τ ∈ N, We can also make corresponding definitions for G H,n (f, x), G H,n (f, P X ) and G * H,n (f ); the results that follow hold unchanged. The following lemma extends Theorem 1 to these element-complexities.
Lemma 4 (Element-wise complexity bounds for function classes). Take n, m ∈ N and β > 0. Given Moreover, the bound (4) also holds with any one ofĜ H,n (·, x), R * H,n (·), G * H,n (·) in place ofR H,n (·, x). In addition, given any distribution P X on X , Moreover, the bound (5) also holds with G H,n (·, P X ) in place of R H,n (·, P X ).
The proof is given in the Appendix. The bound for empirical widths follows directly from Theorem 1, and the others will be reduced to these by using concentration of the empirical widths around its expectation, and for the uniform complexities this reduction will follow simply from its definition.

Learning with a Lipschitz loss
In this section we focus on the problem of supervised learning. Suppose we have a measurable input data space X and an output space Y ⊆ R. Suppose further that we have a tuple of random variables (X, Y ), where X takes values in X , and Y takes values in Y, with joint distribution P , and marginal P X over X. The learning task is defined in terms of a loss function L : The goal of the learner is to obtain a measurable mapping f : Whilst the distribution P is unknown, the learner does have access to a data set D : The main result of this section is the following simultaneous upper bound for weighted heterogeneous ensembles, given in terms of our element-wise Rademacher width of individual predictors.
Thus, by the classic Rademacher bound [11, Theorem 3.3] combined with a union bound, the following holds with probability at least 1 − δ for all l ∈ [n] and f ∈ F l , ≤ 2Λ · R H,n (f, P X ) + + 2β · 2 log m n + π 2n = 2Λ · R H,n (f, P X ) + 2β · 2 log m n + π 2n + B n ≤ 2Λ · R H,n (f, P X ) + 4Λβ · 2 · (1 + log m) n + B n and log(n/δ)/(2n) + 1/n ≤ 2 log(en/δ)/n for all n ≥ 3. This proves the first bound in Theorem 3 for all f ∈ n l=1 F l and n ≥ 3. On the other hand, if f ∈ H\ n l=1 F l or n ≤ 2 then max{2Λ · R H,n (f, P X ) , B 2 log(en/δ)/n} ≥ B, in which case the bound in Theorem 3 follows from sup (u,y)∈R×Y L(u, y) ≤ B, which completes the proof of the first bound in Theorem 3. The second bound may be proved by a similar argument exploiting (4).

Learning with a self-bounding Lipschitz loss
To further demonstrate the generality of our theory, here we apply Theorem 4 to multi-output learning, and show how to obtain a heterogeneous ensemble with good generalisation as well as favourable rates.
We begin with the problem-specific preliminaries. The main result of this section is Theorem 5.
The label space is Y ⊆ {0, 1} Q , where Q, the number of classes, can be very large in applications, but the number of simultaneous positive labels for an instance is typically much smaller, resulting in q-sparse binary vectors The following definition from [12] was shown to explain favourable rates for learning multioutput problems, ranging from slow rate n −1/2 , in the case of general Lipschitz losses, to fast rates n −1 .
A nice example associated with fast rates is the pick-all-labels loss [13], which generalises the multinomial logistic loss to multi-label problems.
To capture the complexity of a multi-output function class H ⊆ M(X , R Q ), its projected class is defined as Π = π (f (x)), and π : R Q → R is the -th coordinate projection.
We shall make use of the following optimistic-rate bound from [12]. where K is a numerical constant, and Γ λ,θ n,Q,δ (H) := With these preliminaries in place, we consider convex combinations of multi-output functions. Let where K is a numerical constant and Γ λ,θ n,Q,δ (f ) := We note that often R * nQ (Π • F) =Õ({nQ} −1/2 ), so the dependence on the number of classes Q is, up to the mild factor log 3 2(1−θ) (Q), only though the self-bounding Lipschitz constant λ. Hence in Example 4 there is no further dependence on Q, but only q. Since θ = 1/2, we also have fast rates for multilabel heterogeneous ensembles with very large numbers of labels, provided the individual label vectors are sufficiently sparse.
Proof of Theorem 5 Take > 0, L ∈ N (to be determined later), and for each l ∈ [L], By Theorem 4 combined with the union bound, the following holds with probability at least 1 − δ for all l ∈ [L] and f ∈ F l , Moreover, since f ∈ F l , we have l · ≤ R * Π•H,nQ (Π • f ) + . Hence, choosing L = n and = B/n yields the required bound when n is sufficiently large that ≤ 1−θ . On the other hand, if > 1, so n < B, then the bound is immediate.

Algorithmic consequences and numerical experiments
We exemplify and assess the use of our generalisation bounds empirically by turning Theorems 3 and 5 into learning algorithms for binary and multi-label classification problems, by minimising the bounds. We implement these as regularised gradient boosting with random subspace and random projection based base learners. Such ensembles are heterogeneous, since each base class is defined on a different subspace of the ambient input space.
For concreteness and simplicity, we consider generalised linear model base learners. Denoting by Θ := {a, b, v, w} the parameters, a base learner has the form h(x, Θ) = a tanh(x w + v) + b, where a, b, v ∈ R, and w ∈ X . For multilabel problems, w ∈ X Q and tanh(·) is computed component-wise. To ensure that h has bounded outputs, we constrain the magnitudes of a and b. We do not regularise the weight vectors w, as the random dimensionality reduction itself performs a regularisation role. Thus, with k-dimensional inputs, a binary classification base class of this form has Rademacher width of order (k/n) 1/2 , a multi-label base class has its Γ λ,θ n,Q,δ of order (λk/n) 1 2(1−θ) , and neither the exponents nor n affect the minimisation. This translates into easy-to-compute individual penalties for each base learner. The pseudo-code of the resulting algorithm is given in Algorithm 1. Other base learners are of course possible, and their Rademacher width would then be replacing this penalty term. However our goal is to assess in principle the ability of our bounds to turn into competitive learning algorithms. We generated k t for t ∈ [τ ] for the base learners independently from a skew distribution proportional to − log(U ) where U ∼ Uniform(0, 1), re-scaled these between 1 and half of the rank of the data matrix, and rounded them to the closest integers. This favours simpler base models, both for efficiency and to avoid large penalty terms.

Algorithm 1 Heterogeneous gradient boosting with compressive learners
Require: Loss function L, training set D = {x i , y i } n i=1 , regularisation parameter η, shrinkage , number of rounds T .
In addition to our heterogeneous ensembles, we also tested regularised gradient boosting on the original data; this is a homogeneous ensemble that performs all computations in the original high dimensional space. For comparisons we chose the closest related existing methods as follows. For binary classification we compare with adaboost, logitboost, and with the top results obtained in [2] by the methods RASE 1 -LDA, RASE 1 -kNN, RP-ens-LDA, RPens-kNN, as well as the classic Random Forest. For multi-label classification, we compare with existing multi-label ensembles: COCOA [14], ECC [15], and fRAkEL [16] provided by the MLC-Toolbox [17]. We use data sets previously employed by our competitors: the largest two real-world data sets from [2], and 5 benchmark multi-label data sets from [14][15][16]. The data characteristics are given in Tables 1-2. We stantardised all data sets to zero mean and unit variance. In binary problems we tested different training set sizes, following [2], leaving the rest of the data for testing. In multi-label problems we used 80%   Table 3 First 6 rows: Binary classification error rates (average ± standard deviation computed from 10 independent repetitions) on the Mice and Musk data sets after 1000 regularised gradient-boosting rounds. Our regularised heterogeneous ensembles are 's reg' and 'g reg' (underlined). The method descriptors specify the loss function used (exp=exponential; log=logistic), the input type (s = subspace ensemble; g = random projection ensemble with Gaussian RPs in base learners; HD = original uncompressed inputs). The competing methods do not regularise their base learners. The last 5 rows are taken from [2] for comparison. Bold font indicates best performance, the second best is marked in italic if its performance is within one standard deviation of the best performer. of the data for training and 20% for testing. We did not do any feature selection, to avoid external effects in assessing the informativeness of our bounds, while the RASE methods do so and hence might have some advantage in comparisons. In particular, the RASE algorithms use 200 evenly weighted base learners each selected from 500 trained candidates and meanwhile collecting information for feature selection -this totals 10,000 trained base learnerswhile we just train 1000 base learners in gradient boosting fashion. We have set η by 5-fold cross-validation in {10 −7 , 10 −5 , 10 −3 , 10 −1 , 0} · n −1/2 . The misclassification rates obtained on the binary problems are summarised in Table 3 with both exponential and logistic loss functions. The shrinkage parameter was set to 0.1, which is a common choice in gradient boosting algorithms. The multi-label results are given in Table 4, with the pick-all-labels loss function -here the values represent the average area under the ROC curve (AUC) over the labels (higher is better). We present results with shrinkage = 0.1 as well as without shrinkage ( = 1); our heterogeneous ensembles appear more robust to the setting of this parameter than homogeneous gradient boosting, where shrinkage is known to have a role in preventing overfitting.
From Tables 3-4 we see that our regularised heterogeneous ensembles (s reg = regularised random subspace gradient boosting; g reg = regularised random projection gradient boosting) consistently display good performance, even best performance in several cases. The regularised high-dimensional gradient boosting (HD reg) is only sometimes better and only marginally -despite it performs the computations to train all base learners in the full dimensional input space. The logistic loss worked better than exponential on these data, likely because of noise. Interestingly, the random subspace setting of our ensembles tended to work better than random projections, which is good news both computationally and from interpretability considerations. We also see that un-regularised models (Adaboost and Logitboost) sometimes display erratic behaviour, especially in the small sample regime. RASE performs very well in general, as its in-built feature selection also has a regularisation effect. One could mimic this with our boosting-type random subspace ensemble, especially when interpretability is at premium, although we have not pursued this here. Based on these results, we conclude that our heterogeneous random subspace ensemble is a safe-bet competitive approach.

Relation to previous work and discussion
The following corollary shows that, with a specific example loss function, our Theorem 3 recovers a result of [7], termed as "deep boosting".
A similar result holds withR n (H jt , P X ) in place of R(H jt , P X ).
Proof Follows straightforwardly from Theorem 3 applied to Example 5 and relaxing the infimum in our definition of element-wise complexities.
Corollary 1 is closely related to [7, Theorem 1] which contains a similar result with a different proof. We can also relate our Theorem 5 to multiclass "deep boosting" given in [18] in the special case of q = 1. Their bound grows linearly with the number of classes Q, while ours can exploit labelsparsity; their rate is n 1/2 , while ours allow significantly tighter bounds when the empirical error is sufficiently low and the sample size sufficiently large.
Foremost, our theoretical framework is general and widely applicable whenever heterogeneous geometric sets are of interest. The main benefit of our approach is to allow for a unified analysis which can be straightforwardly extended, and it justifies heterogeneous ensemble constructions beyond the previous theory. For instance SnapBoost [6] considered a mix of trees and kernel methods in gradient boosting and was empirically found very successful.
The bound suggests a regularisation should be included in the training of each base learner, proportional to the Rademacher complexity of its class. Of course the more data we have for training the less the effect of this will be -SnapBoost did not include a regularisation but trained on very large data sets. In relatively small sample settings (as we consider in Sec. 4.3) the regularisation suggested by the bound is expected to be more essential. However, we need to reckon that Rademacher complexity is hard to compute in practice, one typically resorts to upper bounds, therefore over-regularising can be a concern. This may be somewhat countered by including a balancing regularisation parameter that may be tuned by cross-validation.

Conclusions
We presented a general approach to deal with set heterogeneity in high probability uniform bounds, which is able to exploit low complexity components. We applied this to tighten norm preservation guarantees in random projections, and to justify and guide heterogeneous ensemble construction in statistical learning. We also exemplified concrete use cases by turning our generalisation bounds into a practical learning algorithms with competitive performance. EPSRC, though the Fellowship grant EP/P004245/1. Part of the computations for Section 4.3 were performed using the University of Birmingham's BlueBEAR HPC service (http://www.birmingham.ac.uk/bear).

Declarations
• Funding: EPSRC grant EP/P004245/1 • Conflicts of interests: The authors declare that they have no conflicts of interest or competing interests relating to the content of this article. Taking a supremum over all x ∈ X n we deduce the bound R * n f ∈ H : R * H,n (f ) < κ ≤ κ + β · 2 log m n + π 2n .
Hence, if we define a random variable Z by Z :=Rn f ∈ H : R H,n (f, P ) < κ , X − κ − β · 2 2 log m n + π 2n , we have Z ∈ R(β 2 /n). By integrating the tail bound we deduce E(Z) ≤ β · π/2n. It follows from the definition of the average Rademacher width that Rn f ∈ H : R H,n (f, P ) < κ , P = E X R n f ∈ H : R H,n (f, P ) < κ , X ≤ κ + 2β · 2 log m n + π 2n , as required. The proof of the corresponding bound with Gn(·, P ) in place of Gn(·, P ) is similar, except for replacing McDiarmid's inequality with Borell-TIS. Heterogeneous Ensembleŝ R H,n (f, x) element-wise empirical Rademacher complexity of f ∈ H R H,n (f, P ) element-wise Rademacher complexity of f ∈ H R * H,n (f ) element-wise uniform Rademacher complexity of f ∈ Ĥ G H,n (f, x) element-wise empirical Gaussian complexity of f ∈ H G H,n (f, P ) element-wise Gaussian complexity of f ∈ H G * H,n (f ) element-wise uniform Gaussian complexity of f ∈ H P probability distribution on X × Y (X, Y ) a random tuple from X × Y drawn from P L loss function B largest value of L Λ Lipschitz constant of L E L (f ) generalisation error (risk) of f n sample size D training set drawn i.i.d. from P E L (f ) training error (empirical risk) of f (F l ) l∈ [L] sets of increasing complexity that cover H Q number of classes in multi-label problems q maximum number of non-zero labels for an instance Y(q) set of all label vectors with at most q ≤ Q non-zeros (λ, θ) self-bounding Lipschitz parameters π (f ) f , the -th coordinate projection of a multi-output f Π