Abstract
The \(\mathcal {C}\)-bound is a tight bound on the true risk of a majority vote classifier that relies on the individual quality and pairwise disagreement of the voters and provides PAC-Bayesian generalization guarantees. Based on this bound, MinCq is a classification algorithm that returns a dense distribution on a finite set of voters by minimizing it. Introduced later and inspired by boosting, CqBoost uses a column generation approach to build a sparse \(\mathcal {C}\)-bound optimal distribution on a possibly infinite set of voters. However, both approaches have a high computational learning time because they minimize the \(\mathcal {C}\)-bound by solving a quadratic program. Yet, one advantage of CqBoost is its experimental ability to provide sparse solutions. In this work, we address the problem of accelerating the \(\mathcal {C}\)-bound minimization process while keeping the sparsity of the solution and without losing accuracy. We present CB-Boost, a computationally efficient classification algorithm relying on a greedy–boosting-based–\(\mathcal {C}\)-bound optimization. An in-depth analysis proves the optimality of the greedy minimization process and quantifies the decrease of the \(\mathcal {C}\)-bound operated by the algorithm. Generalization guarantees are then drawn based on already existing PAC-Bayesian theorems. In addition, we experimentally evaluate the relevance of CB-Boost in terms of the three main properties we expect about it: accuracy, sparsity, and computational efficiency compared to MinCq, CqBoost, Adaboost and other ensemble methods. As observed in these experiments, CB-Boost not only achieves results comparable to the state of the art, but also provides \(\mathcal {C}\)-bound sub-optimal weights with very few computational demand while keeping the sparsity property of CqBoost.
1 Introduction
Ensemble-based supervised classification consists in training a combination of many classifiers learnt from various algorithms and/or sub-samples of the initial dataset. Some of the most prominent ensemble learning frameworks include Boosting, with the seminal elegant Adaboost (Freund and Schapire 1997), which has been the inspiration of numerous other algorithms, Bagging (Breiman 1996) leading to the successful Random Forests (Breiman 2001), but also Multiple Kernel Learning (MKL) approaches (Sonnenburg et al. 2006; Lanckriet et al. 2004) or even the Set Covering Machines (Marchand and Taylor 2003). Most of these approaches are founded on theoretical aspects of PAC learning (Valiant 1984). Among them, the PAC-Bayesian theory studies the properties of the majority vote that is used to combine the classifiers (McAllester 1999) according to the distributions among them. Experimentally, valuable ad-hoc studies have been made over specific application domains in order to build relevant sets of classifiers. We address here the problem of learning one independently from the priors relevant to the application domain, together with a weighted schema that defines a majority vote over the members of that set of classifiers.
Introduced by McAllester (1999), the PAC-Bayesian theory provides some of the tightest Probably Approximately Correct (PAC) learning bounds. These bounds are often used for a better understanding of the learning capability of various algorithms (Seeger 2002; McAllester 2003; Langford and Shawe-Taylor 2003; Catoni 2007; Seldin et al. 2012; Dziugaite and Roy 2018). Based on the fact that PAC-Bayesian bounds gave rise to a powerful analysis of many algorithms’ behavior, it has incited a research direction that consists in developing new (or new variants of) algorithms that simply are bound minimizers (Germain et al. 2009; Parrado-Hernández et al. 2012; Dziugaite and Roy 2018; Germain et al. 2015). In this paper, we revisit one of such algorithms, MinCq (Germain et al. 2015), which focuses on the minimization of the \(\mathcal {C}\)-bound and comes with PAC-Bayesian guarantees. The \(\mathcal {C}\)-bound, introduced in Lacasse et al. (2006), bounds the true risk of a weighted majority vote based on the averaged true risk of the voters, coupled with their averaged pairwise disagreement. According to the \(\mathcal {C}\)-bound, the quality of each individual voter can be compensated if the voting community tends to balance the individual errors by having plural opinions on “difficult” examples.
Although MinCq has state of the art performance on many tasks, it computes the output distribution on a set of voters through a quadratic program, which is not tractable for more than medium-sized datasets. To overcome this, CqBoost (Roy et al. 2016) has then been proposed. It iteratively builds a sparse majority vote from a possibly infinite set of classifiers, within a column generation setting. However, CqBoost’s approach only partially tackles the computational challenge. In order to overcome this drawback, we propose CB-Boost, a greedy, boosting-based, \(\mathcal {C}\)-bound minimization algorithm designed to greatly reduce the computational cost of CqBoost and MinCq while maintaining the attractive peculiarities of the \(\mathcal {C}\)-bound on a finite set of hypothesis.
CB-Boost is somewhat similar to CqBoost, while closer to boosting in the sense that at each iteration, it selects a voter, finds its associated weight by minimizing an objective quantity (the \(\mathcal {C}\)-bound in the case of CB-Boost, and the exponential loss as for Adaboost) and adds it to the vote.
The main advantage of CB-Boost comes from the fact that at each iteration, it solves a \(\mathcal {C}\)-bound minimization problem by considering only one direction. Interestingly, it is possible to solve it analytically and with only a few light operations. Furthermore, we derive a guarantee that the \(\mathcal {C}\)-bound decreases throughout CB-Boost ’s iterations.
This paper is organised as follows. Section 2 sets up basic notation and definitions, reviews the \(\mathcal {C}\)-bound and its PAC-Bayesian framework, and briefly introduces MinCq and CqBoost, two existing algorithms that aim at learning an ensemble of classifiers based on the minimization of the \(\mathcal {C}\)-bound. Section 3 introduces our new boosting-based algorithm named CB-Boost, which aims at keeping the benefits of these two algorithms, while reducing the disadvantages. Finally, Sect. 4 addresses the theoretical properties of CB-Boost, while Sect. 5 focuses on experiments that not only validate the theoretical aspects, but also shows that CB-Boost performs well empirically on major aspects.
2 Context
After setting up basic notations and definitions, the context of PAC-Bayesian learning is introduced through the \(\mathcal {C}\)-bound and two theorems, which are pivotal components of our contribution.
2.1 Basic notations and definitions
Let us consider a supervised bi-class learning classification task, where \(\mathcal {X}\) is the input space and \({\mathcal {Y}}= \{-1,1\}\) is the output space. The learning sample \({\mathcal {S}}= \{(x_i, y_i)\}_{i=1}^m\) consists of m examples drawn i.i.d from a fixed, but unknown distribution D over \({\mathcal {X}}\times {\mathcal {Y}}\). Let \(\mathcal {H}= \{h_1, \ldots h_n\}\) be a set of n voters \(h_s: {\mathcal {X}}\rightarrow \{-1,1\}\), and \(\text {Conv}(\mathcal {H})\) the convex hull of \(\mathcal {H}\).
Definition 1
\(\forall x \in {\mathcal {X}}\), the majority vote classifier (Bayes classifier) \(B_{Q}\) over a distribution \(Q\) on \(\mathcal {H}\) is defined by
Definition 2
The true risk of the Bayes classifier \(B_Q\) over Q on \(\mathcal {H}\), is defined as the expected zero one loss over D, a distribution on \({\mathcal {X}}\times {\mathcal {Y}}\):
Definition 3
The training error of the Bayes classifier \(B_Q\) over Q on \(\mathcal {H}\), is defined as the empirical risk associated with the zero one loss on \({\mathcal {S}}= \{(x_i, y_i)\}_{i=1}^m\).
Definition 4
The Kullback-Leibler (KL) divergence between distributions \(Q\) and P is defined as
In the following study, let P denote the prior distribution on \(\mathcal {H}\) that incorporates pre-existing knowledge about the task. And let \(Q\) denote the posterior distribution, which is an update of P after observing the task’s data.
2.2 Previous work: the \(\mathcal {C}\)-Bound & PAC-Bayesian guarantees
Here, we state the main context of our work by presenting the \(\mathcal {C}\)-bound and its properties, as introduced in Lacasse et al. (2006). Let us first define one core concept: the margin of the majority vote.
Definition 5
Given an example \(x \in {\mathcal {X}}\) and its label \(y \in {\mathcal {Y}}\) drawn according to a distribution D, \(M^{D}_{Q}\) is the random variable that gives the margin of the majority vote \(B_Q\), defined as \(M^{D}_{Q} = y \underset{h\sim Q}{\mathbf {E}} h(x)\,.\)
Given the margin’s definition, the \(\mathcal {C}\)-bound is presented in Lacasse et al. (2006) as follows.
Definition 6
For any distribution \(Q\) on \(\mathcal {H}\), for any distribution D on \({\mathcal {X}}\times {\mathcal {Y}}\), let \(\mathscr {C}_Q^{D}\) be the \(\mathcal {C}\)-bound of \(B_Q\) over D, defined as
with \(\mu _1(M^{D}_{Q})\) being the first moment of the margin
and \(\mu _2(M^{D}_{Q})\) being the second moment of the margin
where the last equality comes from the fact that \(y \in \{-1, 1\}\), so \(y^2 = 1\).
Definition 7
For any distribution \(Q\) on \(\mathcal {H}\) and any \({\mathcal {S}}= \{(x_i, y_i)\}_{i=1}^m\), let \(\mathscr {C}_Q^{\mathcal {S}}\) be the empirical \(\mathcal {C}\)-bound of \(B_Q\) on \(\mathcal {S}\), defined as
The following theorem, established and proven in Lacasse et al. (2006), shows that the \(\mathcal {C}\)-bound is an upper bound of the true risk of the majority vote classifier.
Theorem 1
Lacasse et al. (2006) For any distribution \(Q\) on a set \(\mathcal {H}\) of hypotheses, and for any distribution D on \({\mathcal {X}}\times {\mathcal {Y}}\), if \(\mu _1(M^{D}_{Q})>0\), we have
From this result, we derive a corollary for the empirical risk, i.e., the training error that is used in Sect. 4:
Corollary 1
For any distribution \(Q\) on a set \(\mathcal {H}\) of hypotheses, and for \(\mathcal {S}= \{(x_i, y_i)\}_{i=1}^m\) a training sample, if \(\frac{1}{m}\mathop {\sum }\limits _{i=1}^{m} y_i \underset{h \sim Q}{\mathbf {E}} h(x_i) > 0\), we have
In terms of generalization guarantees, the PAC-Bayesian framework (McAllester 1999) provides a way to bound the true risk of \(B_{Q}\), given the empirical \(\mathcal {C}\)-bound, and P and \(Q\) the prior and posterior distributions. The important following theorem, established by Roy et al. (2016) is used in Sect. 4.3; it gives an upper bound of the true risk of the majority vote, which depends on the first and second moments of the margin as introduced in Definitions 5 and 6, and on the Kullback-Leibler divergence between the prior and posterior distributions.
Theorem 2
Roy et al. (2016) For any distribution D on \({\mathcal {X}}\times {\mathcal {Y}}\), for any set \(\mathcal {H}\) of voters \(h: \ {\mathcal {X}}\rightarrow \{-1, 1\}\), for any prior distribution P on \(\mathcal {H}\) and any \(\delta \in ]0,1]\) over the choice of the sample \(\mathcal {S}= \{(x_i, y_i)\}_{i=1}^m\sim D^m\), and for every posterior distribution \(Q\) on \(\mathcal {H}\), we have, with a probability at least \(1-\delta \)
2.3 Existing algorithms: MinCq & CqBoost
Let us focus on two algorithms that rely on the minimization of the empirical \(\mathcal {C}\)-bound in order to learn an accurate ensemble of classifiers. MinCq (Germain et al. 2015) finds the weights that minimize the empirical \(\mathcal {C}\)-bound on a given set of voters, and CqBoost (Roy et al. 2016), inversely, uses column generation and boosting to iteratively build a set of voters by minimizing the \(\mathcal {C}\)-bound.
MinCqFootnote 1 (Germain et al. 2015) The principle of MinCq is to create a majority vote over a finite set of voters, whose weights minimize the \(\mathcal {C}\)-bound. To avoid overfitting, it considers a restriction on the posterior distribution named quasi-uniformity (Germain et al. 2015), and adds an equality constraint on the first moment of the margin.
Beneficially, it is ensured to perform as well in train as in generalization, according to Corollary 1 and Theorem 2. The main drawback of MinCq is its computational training time: its algorithmic complexity is \(\mathcal {O}(m \times n^2 + n^3)\), which prevents it from scaling up to large datasets.
CqBoostFootnote 2 (Roy et al. 2016) Like MinCq, CqBoost is an algorithm based on the minimization of the \(\mathcal {C}\)-bound. It was designed to accelerate MinCq and has proven to be a sparser \(\mathcal {C}\)-bound minimizer, hence enabling better interpretability and faster training.
CqBoost is based on a column generation process that iteratively builds the majority vote. It is similar to boosting in the way that, at each iteration t, the choice of the voter to add is made by greedily optimizing the edge criterion (Demiriz et al. 2002). Once a voter has been added to the current selected set, CqBoost finds the optimal weights by minimizing the \(\mathcal {C}\)-bound similarly to MinCq, solving a quadratic program.
The sparsity of CqBoost is explained by its ability to stop the iterative vote building early enough to avoid overfitting. Nevertheless, even though CqBoost is faster and sparser than MinCq, it is still not applicable to large datasets because of its algorithmic complexity is \(\mathcal {O}(m \times T^2 + T^3)\), where T is the number of boosting iterations.
3 CB-Boost: A fast \(\mathcal {C}\)-Bound minimization algorithm
In this section, we present the mono-directional \(\mathcal {C}\)-bound minimization problem and its solution, which are central in CB-Boost. Then, we introduce CB-Boost ’s pseudo-code in Algorithm 1.
The empirical \(\mathcal {C}\)-bound of a distribution \(Q= \{\pi _1, \dots , \pi _n\}\) of n weights over \(\mathcal {H}= \{h_1, \dots , h_n\}\) a set of n voters, on a learning sample \(\mathcal {S}\) of m examples, is as follows.
It is proven in Appendix D.1 that if we use positive weights \(\{\alpha _1, \ldots , \alpha _n\} \in (\mathbb {R}^+)^n\) instead of a distribution, the empirical \(\mathcal {C}\)-bound is equivalent to using the distribution \(Q= \{\frac{\alpha _1}{\sigma }, \dots , \frac{\alpha _n}{\sigma }\}\), with \(\sigma \) being \(\mathop {\sum }\limits _{s=1}^{n}\alpha _s\). From here, we use the weights that do not sum to one in order to simplify the proofs.
3.1 Optimizing the \(\mathcal {C}\)-Bound in one direction
We outline some basic definitions to clarify the contributions of the following work : agreement ratio, margin and norm.
Definition 8
The agreement ratio between two voters \(h,h' \in \mathcal {H}\) (or two combination of voters in \(\text {Conv}(\mathcal {H})\)) is defined as \(\tau (h,h') \triangleq \frac{1}{m}\mathop {\sum }\limits _{i=1}^{m} h(x_i) h'(x_i)\,.\)
Definition 9
The empirical margin of a single voter \(h\in \mathcal {H}\) (or combination of voters in \(\text {Conv}(\mathcal {H})\)) is \( \gamma (h) \triangleq \frac{1}{m} \mathop {\sum }\limits _{i=1}^{m}y_i h(x_i)\,\).
Definition 10
The squared and normalized L2-norm of \(h\in \mathcal {H}\) (or combination of voters in \(\text {Conv}(\mathcal {H})\)), is defined as \( \nu \left( h\right) \triangleq \frac{1}{m} \mathop {\sum }\limits _{i=1}^{m} h(x_i)^2 = \frac{1}{m}\left\Vert h\right\Vert _2^2\,\).
Here, we consider the \(\mathcal {C}\)-bound optimization in a single direction, meaning that all weights, except one, are fixed. For readability reasons, we introduce \(\forall i \in [m]\), \(F_{k}(x_i) = \mathop {\sum }\limits _{\begin{array}{c} s=1 \\ s\ne k \end{array}}^{n} \alpha _s h_s(x_i)\) which denotes the majority vote built by all the fixed weights and their corresponding voters, and \((\alpha , h_{k})\) the weight that varies during the optimization and its corresponding voter. We can thus rewrite the empirical \(\mathcal {C}\)-bound with respect to \(k\), the varying direction, as
Our goal here is to find the optimal \(\alpha \) in terms of \(\mathcal {C}\)-bound, denoted by \(\alpha _k^*= \mathop {\text {arg}\,\text {min}}\limits _{\alpha \in \mathbb {R}^+} \mathscr {C}_k(\alpha )\). The following theorem is the central contribution of our work as it provides an analytical solution to this problem.
Theorem 3
\(\forall k \in [n]\) with the previously introduced notations, if \(\gamma (h_{k})> 0\) and \(\gamma (F_{k})> 0\), then
The proof is provided in the Appendix, in Sect. D.1.
Theorem 3 states that in a specific direction, the \(\mathcal {C}\)-bound has a global minimum, provided three conditions. The first two (\(\gamma (h_{k})> 0\) and \(\gamma (F_{k})>0\)) are met trivially within our framework as \(h_{k}\) is a weak classifierFootnote 3 and \(F_{k}\) is a positive linear combination of weak classifiers. The third one \(\big (\tau (F_{k}, h_{k})< \frac{\gamma (F_{k})}{\gamma (h_{k})}\big )\) means that \(F_{k}\) and \(h_{k}\) are not supposed to be colinear, which in the next section, we will show is not restrictive.
This theoretical result is the main step in building a greedy \(\mathcal {C}\)-bound minimization algorithm. Moreover, as long as there is a direction \(k\) in which \(\tau (F_{k}, h_{k})< \frac{\gamma (F_{k})}{\gamma (h_{k})}\), the \(\mathcal {C}\)-bound can be optimized in this direction, and every other one in which \(\tau (F_{k}, h_{k})\ge \frac{\gamma (F_{k})}{\gamma (h_{k})}\) is a dead end.
In terms of complexity, the solution to the minimization problem is obtained in \(\mathcal {O}(m)\) as \(\gamma (h_{k})\), \(\nu \left( F_{k}\right) \), \(\gamma (F_{k})\), and \(\tau (F_{k}, h_{k})\) are sums over the m examples of the training set \(\mathcal {S}\).
3.2 Optimally choosing the direction
In the previous subsection, we presented a theoretical result proving that, for a given direction, the \(\mathcal {C}\)-bound minimization problem has a unique solution. Here, we propose a way to optimally choose this direction and compare it to the main existing method.
Exhaustive search In our framework, \(\mathcal {H}\) is finite and has a cardinality of n, implying that we have a finite number of available directions to choose from. As stated before, the minimum \(\mathcal {C}\)-bound in one direction is available in \(\mathcal {O}(m)\). So by computing these minima in each direction, in \(\mathcal {O}(n\times m)\), we are able to choose the optimal direction, in which the \(\mathcal {C}\)-bound decreases the most.
Comparison with gradient boosting In the gradient boosting framework (a), the optimization direction is chosen by gradient minimization. Coupled with an adequate method to choose the step size, it is a very efficient way of optimizing a loss function. However, thanks to our theoretical analysis, we know that at each iteration of CB-Boost, the best direction is chosen and the optimal step size is known analytically.
Nonetheless, we present a comparison between our exhaustive method and a gradient boosting version that we call GB-CB-Boost. We show in the experiments (Sects. 5.1 and 5.2) that it has no significant advantage and it is less stable than CB-Boost. The details about the gradient boosting variant are explained in Appendix B, and a toy example gives an intuition on the difference between the two processes in Appendix C.
3.3 Presenting CB-Boost
Armed with the theoretical and practical results presented in the previous subsections, we are now ready to present the overall view of CB-Boost, which optimizes the training error of the majority vote through the iterative minimization of the mono-directional \(\mathcal {C}\)-bound presented in Theorem 3.

For the sake of clarity, we define \(I_1, \dots , I_T\) as a list that is initialized with zeros (Line 2), and that contains each of the chosen directions’ indices (Updated in Lines 3 and 12). To initialize CB-Boost, we use \(h_{I_1} \in \mathcal {H}\) the hypothesis with the best margin, we set its weight to 1, and all the others to zero (Lines 1, 3 and 4). This aims at accelerating the convergence by starting the vote building with the strongest available hypothesis.
Then, for each iteration t, we compute the \(\mathcal {C}\)-bound-optimal weights in every available direction, by solving multiple mono-directional optimization problems (Lines 7 to 11). The direction is then exhaustively chosen (Line 12).
After the initialization, the weights on \(\mathcal {H}\) are a Dirac distribution with the best-margin hypothesis’s weight being the only one non-zero, and at each iteration t, one more element of \(\mathcal {H}\) will have a non-zero weight \(\alpha _{I_t}\).
One major advantage of CB-Boost when compared to MinCq and CqBoost is the simplicity of Line 9, where its predecessors solve quadratic programs. Indeed, the algorithmic complexity of CB-Boost only depends on the number of iterations T, the number of examples m, and the number of hypotheses n. As the mono-directional \(\mathcal {C}\)-bound optimization is solved in \(\mathcal {O}(n \times m)\) CB-Boost ’s complexity is \(\mathcal {O}(n \times m \times T)\).
3.4 Remarks
On the \(\mathcal {C}\)-bound indirect example re-weighting To bring diversity in the majority vote, Adaboost (Freund and Schapire 1997) updates weights over the examples at each iteration, exponentially emphasizing the examples on which it previously failed.
In CB-Boost, by considering both the first and second moments of the margin, the \(\mathcal {C}\)-bound takes into account the individual performance of each voter and their disagreement. Therefore, minimizing the \(\mathcal {C}\)-bound requires to keep a trade-off between maximizing the vote’s margin and internal disagreement. This is the reason why CB-Boost does not include any example weighting. Indeed, the mono-directional \(\mathcal {C}\)-bound minimization problem is equivalent to minimizing the following quantity
Intuitively, in this expression, \(\mathop {\sum }\limits _{i=1}^{m} F_{k}(x_i) h_{k}(x_i)\) is equivalent to \(\tau (F_{k}, h_{k})\), so it decreases as \(h\) and \(F\) disagreement increases. It encourages CB-Boost to choose directions that perform well on hard examples. Moreover, \(\alpha ^2\) can be interpreted as a regularization term and \(\frac{1}{\left( \gamma (F_{k})+ \alpha \gamma (h_{k})\right) ^2}\) encapsulates the quality of the vote.
On the difference between the majority votes Intuitively, the concession made in CB-Boost to accelerate CqBoost and MinCq is focused on the weights of the majority vote. Indeed, CqBoost returns the majority vote that exactly minimizes the \(\mathcal {C}\)-bound, for the considered set of voters where CB-Boost returns sub-optimal weights because they have been optimized greedily throughout the iterations. Nevertheless, the \(\mathcal {C}\)-bound computed during the training phase is not an approximation for the considered majority vote, which explains the theoretical results of the next section. Moreover, in Fig. 5 (page 19), we empirically show that the weight-by-weight optimization has similar accuracy than the quadratic programs of MinCq and CqBoost.
On the stopping criterion In Sect. 3.1, we stated that as long as there is still a direction in which \(\tau (F_{k}, h_{k})< \frac{\gamma (F_{k})}{\gamma (h_{k})}\), the \(\mathcal {C}\)-bound can be optimized by CB-Boost. However, this is a very loose stopping criterion. i In fact, as experimentally seen in Sect. 5, as in CqBoost, it is far more interesting to use a fixed number of iterations as an hyper-parameter of the algorithm, as the main improvements are made during the first iterations of the algorithm. This way of restricting the number of iterations helps to reach a sparse model.
4 Theoretical results on training and generalization aspects
4.1 Quantifying the empirical \(\mathcal {C}\)-Bound decrease
In this section, we quantify the decrease rate of the empirical \(\mathcal {C}\)-bound for each iteration of CB-Boost, depending on the previous one and the considered direction.
Property 1
During iteration t of CB-Boost, if \(I_t\) is the chosen direction’s index, \(h_{I_t}\) its corresponding voter, and \(F_{I_t} = \mathop {\sum }\limits _{\begin{array}{c} s=1 \\ s\ne I_t \end{array}}^{n} \alpha _s h_s\) the majority vote of all the other directions, then the empirical \(\mathcal {C}\)-bound decreases exactly by
The proof is provided in the Appendix, in Sect. D.4.
4.2 Deriving the training error bound
Corollary 2
The training error of the majority vote, built by CB-Boost at iteration \(t>2\) is bounded by
with \(\mathscr {S}_{j}\) being the quantity introduced in Property 1.
The proof is straightforward by combining Corollary 1 and Property 1.
This training error bound allows us to assess CB-Boost ’s capacity to learn relevant models based on the available pool of voters.
4.3 Generalization guarantees
Theorem 2 presents a PAC-bound that gives generalization guarantees based on the empirical \(\mathcal {C}\)-bound. In order to apply it to CB-Boost ’s output \({F= \mathop {\sum }\limits _{s=1}^{n} h_s \alpha _s}\), we use \(Q= \{\frac{\alpha _1}{\sigma }, \dots , \frac{\alpha _n}{\sigma }\}\) with \(\sigma = \mathop {\sum }\limits _{s=1}^{n} \alpha _s\). Note that only the T weights corresponding to the chosen directions are non-zero. So, according to Theorem 2, with probability \(1-\delta \), for any sample \(\mathcal {S}\) drawn according to D,
These guarantees are tighter when the empirical \(\mathcal {C}\)-bound of a majority vote is small, which is exactly what CB-Boost aims at returning. Moreover, as seen in Sect. 2.2, returning a majority vote with a small KL(Q||P) is essential in order to have good generalization guarantees. In Roy et al. (2016), the authors established that if the number of voters is far lower than the number of examples, \(n<<m\), then minimizing KL(Q||P) is negligible in comparison with minimizing the \(\mathcal {C}\)-bound of the majority vote.
If the case \(n<<m\) is not applicable, we need to characterize KL(Q||P) intuitively. We use a uniform prior on \(\mathcal {H}\), and as at each iteration of CB-Boost, one more weight of \(Q\) will be non-zero, KL(Q||P) will increase as the posterior’s number of non-zero weight augments. Moreover, we proved that the \(\mathcal {C}\)-bound of \(Q\) decreases over the iterations of CB-Boost. Thus, in order to keep the trade-off between KL(Q||P) and the \(\mathcal {C}\)-bound for the bound of Theorem 2, it is relevant to use early-stopping by choosing a maximal number of iterations. Consequently, based on the generalization guarantees, we set the maximum number of iterations on CB-Boost as an hyper-parameter that can be chosen using hold-out data.
Comparison to non PAC-Bayesian generalization bounds In Cortes et al. (2014), a tight bound based on the Rademacher complexity is given for majority votes \(F\) of \(\text {Conv}(\mathcal {H})\). This bound depends on \(\hat{R}_{{\mathcal {S}}, \rho }(F) = \underset{(x,y) \sim {\mathcal {S}}}{\mathbf {E}}[1_{yF(x) \le \rho }]\) the training error of \(F\)’s margin being lower than \(\rho \), over the sample \(\mathcal {S}\) drawn according to D and \(\mathfrak {R}_{m}(\mathcal {H})\) the Rademacher complexity of \(\mathcal {H}\), as
As explained in the previous paragraph, the size of KL(Q||P) is either negligible or depends on the complexity of the distribution Q, and then can be reduced by early stopping. So, intuitively, both the expressions rely on \(\sqrt{\frac{1}{m}}\) and are then fairly equivalent. The main difference is that the Rademacher-based bound relies on the training error of the majority vote and the Rademacher complexity of the hypothesis space \(\mathcal {H}\) (in our case \(\mathfrak {R}_{m}(\mathcal {H}) \le \sqrt{\frac{2 \ln (|\mathcal {H}|)}{m}}\), as our classifiers only outputs \(-1\) or 1) whereas the \(\mathcal {C}\)-bound ’s PAC-Bayesian bound relies on maximizing \(\frac{\mu _1(M^{\mathcal {S}}_Q)}{\mu _2(M^{\mathcal {S}}_Q)}\), which CB-Boost is explicitly processing.
5 Experiments
The code used to realize each of the following experiments is available here: https://gitlab.lis-lab.fr/baptiste.bauvin/cb-boost-exps.
In order to experimentally study CB-Boost, we first compare it to CqBoost (Roy et al. 2016), MinCq (Germain et al. 2015) and GB-CB-Boost (Sect. 3.2), focusing on efficiency, sparsity and performance on chosen datasets featuring various properties. Then, we compare it to Adaboost (Freund and Schapire 1997) and other ensemble methods: Random Forest (Breiman 2001), Bagging (Breiman 1996), and Gradient Boosting (Friedman 2001).
5.1 Computational time improvement
In order to highlight the main advantage of CB-Boost, we first analyze the computational time improvement obtained by greedily minimizing the \(\mathcal {C}\)-bound, comparing CB-Boost and GB-CB-Boost to CqBoost’s and MinCq’s quadratic programs. Moreover to compare them with a fast, broadly used boosting algorithm, we challenge CB-Boost ’s computational efficiency with Adaboost’s.
Protocol In this experiment, we want to compare MinCq, CqBoost , GB-CB-Boost, CB-Boost and Adaboost by varying three factors : the training set size, the hypothesis set size, and the number of boosting iterations, respectively denoted, m, n and T. To do so, we use MNist 4vs9 (LeCun and Cortes 2010) as it provides 11791 examples and 784 features. Moreover, to be as fair as possible with the quadratic programs, we allowed 8 threads for the solver. For comparative purposes, the only difference between GB-CB-Boost and CB-Boost ’s implementations is the gradient computation, as they are based on the same code.
Results In Fig. 1a, we vary n, the number of available hypotheses, with constant m and T and, as expected, MinCq has a very long computational time. Moreover, even if CqBoost’s boosting-like structure reduce its time consumption, it is still much longer than the three other algorithms.
In Fig. 1b, we vary m. It has a small effect on MinCq, but has a huge effect on CqBoost as it runs \(T=50\) iterations in which it solves one increasingly complex quadratic programs per iteration, whereas MinCq has only one. Here, the apparently small difference between the other algorithms is due to CqBoost’s and MinCq’s long duration.
Figure 1c only plots the iterative algorithms, and the duration difference between CqBoost and the greedy ones is clear even for this small learning set.
To sum up the results of these three sub-experiments, Adaboost, CB-Boost and GB-CB-Boost are far faster than the other \(\mathcal {C}\)-bound algorithms. However, Adaboost and GB-CB-Boost seem to be slightly more time consuming than CB-Boost. To compare them, we analyze Fig. 1d, showing that even though they are close, CB-Boost is constantly faster than the others. Moreover, it is visible that GB-CB-Boost ’s gradient computation is more time consuming than CB-Boost ’s exhaustive search.
To conclude, CB-Boost is a substantial acceleration of CqBoost and MinCq, and is faster than Adaboost and GB-CB-Boost, which means that it is able to scale up to much bigger datasets than the other \(\mathcal {C}\)-bound-based algorithms.
5.2 Performance comparison: \(\mathcal {C}\)-bound-based algorithms equivalence
Here, we compare CB-Boost ’s performance in zero-one loss with the other \(\mathcal {C}\)-bound minimizers. Our goal is to measure the potential decrease in accuracy that greedy minimization would imply.
Protocol
-
Datasets All the datasets used in this experiment are presented in Table 1, an in-depth presentation is made in Appendix E.
-
Classifiers For each classifier, hyper-parameters were found with a randomized search with 30 draws, over a distribution that incorporates prior knowledge about the algorithm, but is independent from the dataset. They were validated by a 5-folds cross-validation and each experiment was run ten times, with the result in Table 2 being the mean and standard deviation over them. These results are not statistically significant, but multiplying the experiments helps avoiding an outlier split that could bias the results.
Results Table 2 shows that, for all datasets except Ionosphere and MNist-5v6, CB-Boost is performing at least as well as the other \(\mathcal {C}\)-bound algorithms, considering the standard deviation. It is quite clear that MinCq has the best zero-one loss on nearly all the datasets, as it uses all the available hypotheses. However, CB-Boost is competitive with both CqBoost and MinCq, and is even better on three datasets. On the more complex Animals with Attributes, CB-Boost is the best of the \(\mathcal {C}\)-bound algorithms. As expected, GB-CB-Boost shows no significant improvement of the accuracy, and is even less stable on australian, MNist-7v9 and 5v6.
CB-Boost is competitive with both the quadratic \(\mathcal {C}\)-bound minimization algorithms 11 times out of 13, and has the best mean 3 times. Considering its shorter computational time, it is more efficient at extracting the information.
5.3 Sparsity conservation
Protocol We analyze here the convergence speed of the five previously studied algorithms with grid-searched optimal hyper-parameters, when needed (MinCq’s margin hyper-parameter set to 0.05, and CqBoost’s margin and error parameters, respectively set to 0.001 and \(10^{-6}\)). The chosen dataset is MNIST 0v9 as it is more complex than UCI datasets but still small enough for CqBoost and MinCq to have reasonable computing time on it. We use the same hypotheses, training and testing sets as previously.
Results Figure 2 shows that, even if CB-Boost converges a bit less quickly than Adaboost and CqBoost on the training set, its performance on the test set is slightly better than all the other algorithms. Moreover, it is visible that CqBoost’s and CB-Boost ’s sparsity prevent them from the slight overfitting of MinCq. Furthermore, CqBoost seems to profit more from early-stopping than CB-Boost, even if their sparsity is clear as they have a better test performance than MinCq from the 30th voter. Finally, the experiments do not reveal any visible difference in sparsity between CB-Boost and GB-CB-Boost.
So, thanks to the closed form solution used in the calculation of CB-Boost, its computing time will always have an edge when compared to GB-CB-Boost. This, together with the fact that, as expected (and empirically shown), GB-CB-Boost is less stable, leads us to concentrate our analysis to CB-Boost for the remaining of the paper.
5.4 Performance comparison: noise robustness equivalence
In this section, we compare CB-Boost and Adaboost considering accuracy. The aim here is (1) to quantify how far CB-Boost is from Adaboost and (2) to evaluate their compared robustness to noise. Please note that the version of Adaboost used for these experiments is Adaboost.SAMME (Zhu et al. 2006), which is more robust to noise than the original.
Protocol We use the same framework as in Sect. 5.2 regarding datasets, train-test splits and hyper-parameter optimization. A Gaussian distribution is used to add noise to the data (see Appendix F.1 for a detailed protocol and an example). The MNist dataset requires a special pre-processing to reduce its contrast in order for the noise to have an effect on classification. See Appendix F.2 for details.
Results In Fig. 3, we show a view of the results with each matrix presenting the comparison between Adaboost and CB-Boost for each dataset (rows), and noise level (columns). It shows that, except for the four first noise levels of ionosphere where the difference is noteworthy, no significant loss of accuracy is observed when using CB-Boost, as the maximum difference between the other scores is 2.5%. In Appendix F.3, the numerical results are provided.
Zero-one loss of CB-Boost and Adaboost. For each dataset (rows), we used several levels of noise (columns). An orange square means that Adaboost is better than CB-Boost in zero-one-loss (\(\mathcal {L}(\text {Ada}) < \mathcal {L}(\text {CB-B})\)), and the difference between their scores is printed inside. A blue square means that they are equivalent, when considering their standard deviation
5.5 Performance comparison: ensemble methods equivalence
We present the result of the performance comparison between CB-Boost and four other ensemble methods : Adaboost (Freund and Schapire 1997), Bagging (Breiman 1996), Random Forests (Breiman 2001), and Gradient Boosting (a) for which we use the same protocol as in Sect. 5.4. This experiment differs from Sect. 5.4 in the fact that we do not mean the results on each dataset but instead, in Fig. 4 we plot one dot for each train/test split on each dataset, for a matter of legibility.
Results In Fig. 4, the distance to the \(x=y\) line represents the difference in zero-one loss. So, a dot under the line means that the ensemble method has lower loss and one over the line means it has higher loss. One can see that CB-Boost is similar to the state of the art for most of the datasets and train/test splits. The only perceptible tendency is that Bagging is frequently worse than CB-Boost, the other methods are equivalent to CB-Boost.
Each ensemble method’s zero-one loss against CB-Boost ’s one on the test set. A dot represents one of the 10 splits of each dataset of Table 1
5.6 \(\mathcal {C}\)-bound approximation characterization
Here, we aim at empirically analyzing the approximation made by greedily minimizing the \(\mathcal {C}\)-bound instead of using a quadratic program solver. To do so, at each iteration of CB-Boost, we run MinCq on the selected subset of voters to compute its optimal \(\mathcal {C}\)-bound.
Protocol We use the MNist 4v9 dataset with the same sets as previously and predefined hyper-parameters: CB-Boost ’s maximum number of iterations is set to 200, and MinCq’s margin parameter to 0.05.
Results Figure 5 shows that there is a slight, but noticeable difference between the \(\mathcal {C}\)-bound of CB-Boost ’s majority vote and MinCq’s one. However, this difference only has a small impact on CB-Boost ’s performance for the first 30 iterations in train and 80 iterations in test. So even if the expected difference in \(\mathcal {C}\)-bound optimality is noteworthy, it does not impact the performance of CB-Boost. Finally, the small gap between both the \(\mathcal {C}\)-bounds empirically suggests that CB-Boost keeps the qualities of CqBoost and MinCq.
6 Conclusion
In this paper, we presented CB-Boost, a greedy \(\mathcal {C}\)-bound minimization algorithm. While maintaining its predecessors’ sparsity and accuracy properties, it has much lighter computational demands. Its optimization process relies on a theoretical result allowing CB-Boost to efficiently minimize the \(\mathcal {C}\)-bound in one direction at a time. This algorithm keeps the training and generalization guarantees given by the \(\mathcal {C}\)-bound (Lacasse et al. 2006) and has the interesting property to allow a quantification of the decrease of its bound and training error.
Experimentally, the comparison of CB-Boost with relevant methods shows its real improvement in computational demand, without loss of accuracy. Furthermore, experiments shows that CB-Boost slightly improves the sparsity of the models, which is the main property of CqBoost. Finally, it is competitive with four state of the art ensemble methods with regards to performance, and with Adaboost in computational efficiency, sparsity and noise robustness.
In future work, we will analyze deeper theoretical properties of CB-Boost, focusing on its dual form, finding a stronger stopping criterion and adapting CB-Boost to infinite hypothesis spaces.
Notes
The pseudo-code of MinCq is given in Appendix A.1.
CqBoost’s pseudo-code is given in Appendix A.2.
A weak classifier is a classifier that is slightly better than random classification.
Available at https://archive.ics.uci.edu/ml/datasets.php.
Available at http://yann.lecun.com/exdb/mnist/.
Available at https://cvml.ist.ac.at/AwA/.
References
Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2), 123–140.
Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32.
Catoni, O. (2007). PAC-Bayesian supervised classification: the thermodynamics of statistical learning. arXiv preprint arXiv:0712.0248.
Cortes, C., Mohri, M., & Syed, U. (2014). Deep boosting. In: Proceedings of the thirty-first international conference on machine learning (ICML) (2014).
Demiriz, A., Bennett, K. P., & Shawe-Taylor, J. (2002). Linear programming boosting via column generation. Machine Learning, 46(1), 225–254.
Dua, D., & Graff, C. (2017). UCI machine learning repository.
Dziugaite, G.K., & Roy, D.M. (2018). Data-dependent PAC-Bayes priors via differential privacy. In: Advances in neural information processing systems, pp. 8440–8450.
Freund, Y., & Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1), 119–139.
Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. The Annals of Statistics, 29(5), 1189–1232.
Germain, P., Lacasse, A., Laviolette, F., & Marchand, M. (2009). PAC-Bayesian learning of linear classifiers. In: Proceedings of the 26th ICML, pp. 353–360. ACM.
Germain, P., Lacasse, A., Laviolette, F., Marchand, M., & Roy, J. F. (2015). Risk bounds for the majority vote: From a PAC-Bayesian analysis to a learning algorithm. Journal of Machine Learning Research, 16(1), 787–860.
Lacasse, A., Laviolette, F., Marchand, M., Germain, P., & Usunier, N. (2006). PAC-Bayes bounds for the risk of the majority vote and the variance of the Gibbs classifier. In: B. Schölkopf, J.C. Platt, T. Hoffman (Eds.) Advances in neural information processing systems 19, pp. 769–776. MIT Press.
Lampert, C.H., Nickisch, H., & Harmeling, S. (2009). Learning to detect unseen object classes by between-class attribute transfer. In: 2009 IEEE Conference on computer vision and pattern recognition, pp. 951–958. IEEE.
Lanckriet, G. R. G., Cristianini, N., Bartlett, P., Ghaoui, L. E., & Jordan, M. I. (2004). Learning the kernel matrix with semidefinite programming. Journal of Machine Learning Research, 5, 27–72.
Langford, J., & Shawe-Taylor, J. (2003). PAC-Bayes & margins. In: Advances in neural information processing systems, pp. 439–446.
LeCun, Y., & Cortes, C. (2010). MNIST handwritten digit database.
Marchand, M., & Taylor, J. S. (2003). The set covering machine. Journal of Machine Learning Research, 3, 723–746.
McAllester, D. A. (1999). Some PAC-Bayesian theorems. Machine Learning, 37(3), 355–363.
McAllester, D. A. (2003). PAC-Bayesian stochastic model selection. Machine Learning, 51(1), 5–21.
Parrado-Hernández, E., Ambroladze, A., Shawe-Taylor, J., & Sun, S. (2012). PAC-Bayes bounds with data dependent priors. Journal of Machine Learning Research, 13, 3507–3531.
Roy, J.F., Marchand, M., & Laviolette, F. (2016) A column generation bound minimization approach with PAC-Bayesian generalization guarantees. In: A. Gretton, C.C. Robert (eds.) Proceedings of the 19th international conference on artificial intelligence and statistics, proceedings of machine learning research, vol. 51, pp. 1241–1249. PMLR, Cadiz, Spain. http://proceedings.mlr.press/v51/roy16.html.
Schapire, R. E., & Freund, Y. (2012). Boosting: foundations and algorithms. Cambridge: The MIT Press.
Seeger, M. (2002). PAC-Bayesian generalisation error bounds for gaussian process classification. Journal of Machine Learning Research, 3, 233–269.
Seldin, Y., Cesa-Bianchi, N., Auer, P., Laviolette, F., & Shawe-Taylor, J. (2012). PAC-Bayes-bernstein inequality for martingales and its application to multiarmed bandits. Proceedings of the Workshop on On-line Trading of Exploration and Exploitation, 2, 98–111.
Sonnenburg, S., Rätsch, G., Schäfer, C., & Schölkopf, B. (2006). Large scale multiple kernel learning. Journal of Machine Learning Research, 7, 1531–1565.
Valiant, L. G. (1984). A theory of the learnable. Communications of the ACM, 27(11), 1134–1142.
Zhu, J., Rosset, S., Zou, H., & Hastie, T. (2006). Multi-class adaboost. Statistics and its Interface,. https://doi.org/10.4310/SII.2009.v2.n3.a8.
Acknowledgements
This work has been supported by National Science and Engineering Research Council of Canada (NSERC) Discovery grant 262067 and by the French National Research Agency (grant ANR-15-CE23-0026). We warmly thank Robert Sadler and Sokol Koço for their proofreading.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Editors: Ira Assent, Carlotta Domeniconi, Aristides Gionis, Eyke Hüllermeier.
Appendices
A Appendix: algorithms
Here, we briefly introduce MinCq (Germain et al. 2015) and CqBoost (Roy et al. 2016). The notation is adapted to use our paper’s instead of the original authors’ to avoid any confusion.
-
\(\mathbf {M} \triangleq \left[ \begin{array}{cccc}{f_{1}\left( x_{1}\right) } &{} {f_{2}\left( x_{1}\right) } &{} {\dots } &{} {f_{2 n}\left( x_{1}\right) } \\ {f_{1}\left( x_{2}\right) } &{} {f_{2}\left( x_{2}\right) } &{} {\dots } &{} {f_{2 n}\left( x_{2}\right) } \\ {\vdots } &{} {\vdots } &{} {\ddots } &{} {\vdots } \\ {f_{1}\left( x_{m}\right) } &{} {f_{2}\left( x_{m}\right) } &{} {\dots } &{} {f_{2 n}\left( x_{m}\right) }\end{array}\right] ,\) as the vote matrix, containing the vote of each weak classifier in \(\mathcal {H}\) and its complementary \(f_{n+j} = - f_{j},\ j = 1 .. n\) on each example \( x_i,\ i=1 .. m\) of the training set \({\mathcal {S}}\), it is the matrix representing \(\mathcal {H}\) ,
-
\(\mathbf {q}\) denotes the weight vector on the voters, corresponding to the weights \(\alpha \) in CB-Boost,
-
\(\mathbf {1}_{m}\) denotes the unitary vector of size m,
-
\(\mu \) denotes the exact margin of the majority vote and \(\tilde{\mu }\), its minimum margin,
-
\(\omega \) denotes a weight vector over the samples,
-
\(\epsilon \) denotes a small positive real.
1.1 A.1 MinCq
MinCq is simply a bound-minimizer, and therefore, its pseudo code is solving a quadratic program to output a weight vector on the hypothesis space, for details on the problem itself, see (Germain et al. 2015).

1.2 A.2 CqBoost
CqBoost is a bit more complex than MinCq, so we will analyze the main steps of the pseudocode :
-
Lines 1 to 3 initialize the problem with a null weight vector over the voters, a uniform distribution on the examples.
-
For each iteration,
-
Line 5 finds the voter with the best weighted edge, the edge being the dual of the margin, it allows to find the voter that has the best decision, given the weights of each example.
-
And Line 10 solves a \(\mathcal {C}\)-bound minimization problem to update the weights on the voters and examples.
-
Let us introduce the quadratic program that is solved at Line 10,
And its dual with \(\omega \), \(\beta \), \(\nu \) being Lagrange multipliers


B Appendix: gradient boosting comparison
1.1 B.1 Getting a gradient boosting version of CB-Boost
To build the gradient boosting version of CB-Boost, we follow the method presented in Schapire and Freund (2012), with the \(\mathcal {C}\)-bound as the choice of loss function, defined for any \(F\in \text {Conv}(\mathcal {H})\) as
The gradient \(\nabla \mathscr {C}(F) = \left( \frac{\partial \mathscr {C}\left( F(x_1)\right) }{\partial F(x_1)}, \dots , \frac{\partial \mathscr {C}\left( F(x_i)\right) }{\partial F(x_i)}, \dots , \frac{\partial \mathscr {C}\left( F(x_m)\right) }{\partial F(x_m)}\right) \) is given, \(\forall i \in [1 .. m]\), by
Given this result, at iteration t of the gradient boosting algorithms finds the following direction of optimization, with \(F_{k}\) and \(h_{k}\) defined as in Sect. 3.1:
Once this direction is found, the next goal is to find the best weight for the voter (or optimization step). Here, thanks to Theorem 3, the optimal weight is given by
Armed with these results, we present GB-CB-Boost in Algorithm 4 as the gradient boosting variant of CB-Boost.
1.2 B.2 Difference between CB-Boost and GB-CB-Boost
The only difference between the two greedy algorithms is the choice of the optimization direction, made in step 12 of Algorithm 1:
and step 4 of Algorithm 4:
In CB-Boost, the direction is chosen as the one in which the \(\mathcal {C}\)-bound has the lowest minimum whereas in GB-CB-Boost, the choice is based on the maximum of the negative gradient.
As seen in Sects. 5.1 and 5.2, it has an empirical impact on the computational efficiency and on the stability of the algorithm, as CB-Boost is faster and more stable than its gradient boosting counterpart. However, it does not have a significant impact on performance nor sparsity.
Concerning sheer \(\mathcal {C}\)-bound optimization, Fig. 6 shows that CB-Boost is as optimal as gradient boosting, and even slightly better, particularly on MNist-5v3. Therefore, it would be interesting to lead a more in-depth study on the differences between these variants. For example, in Appendix C, a toy example points a case where gradient boosting needs one more optimization step than our method to find the minimum of a toy loss.
The empirical (training) \(\mathcal {C}\)-bound for CB-Boost and GB-CB-Boost on the four MNist datasets presented in Table 1 over 75 boosting iterations
C Appendix: intuitive understanding of direction choice on a toy example
Here, we present a toy example that is meant to give the reader an intuition on the difference between gradient boosting bound minimization, and CB-Boost ’s \(\mathcal {C}\)-bound minimization process. To do so, we use a toy loss of two variables \(\mathcal {L} : x,y \mapsto z\) and analyze how the two approaches tackle the optimization problem. In Fig. 7, we plot it in two dimensions, a 3D plot is available here.Footnote 5 This loss is convex and defined as
with \(c_0, c_1, c_2\) constants. In Fig. 7, we can see that when starting at the same place on the loss function’s surface (the “Start” point), Gradient Boosting (GB) will choose the direction where the gradient is the lowest (x), and even with a perfectly tuned step size, will have to use two iterations to reach the loss’ minima, as in this direction, it is not attainable. However as CB-Boost ’s strategy is to find the direction in which the attainable minimum is the lowest (y), regardless of the gradient, CB-Boost will find one minimum in just one iteration. In Fig. 8, we projected the gradient directions and the function in 2D, in order to have a clearer point of view.
2D Projection of Fig. 7 gradient directions, the dashed lines represent the gradient at the starting point, and the red line represents the mono-directional optimization path that is available for the algorithm. The green dots represent the optimal first step for gradient boosting (\(GB_1\)) and CB-Boost (\(CB_1\))
This example is not meant to prove a theoretical nor empirical superiority but to point the difference between the two methods.
D Appendix: proofs
1.1 D.1 Proof of the equivalence between the distribution, and the weights version of the \(\mathcal {C}\)-bound
Let us consider a set of weights \(\{\alpha _1, \ldots , \alpha _n\} \in (\mathbb {R}^+)^n\) that do not sum to one, and the distribution version \(Q' = \{\frac{\alpha _1}{\sigma }, \dots , \frac{\alpha _n}{\sigma }\}\) with \(\sigma =\mathop {\sum }\limits _{s=1}^{n}\alpha _s\).
1.2 D.2 Proof of Theorem 3
The proof that follows is not technically complex; it mainly relies on second order polynomial analysis, but it is quite long. Therefore we propose in Sect. D.3 a graph version of it that is easier to read and provides the main steps and implications.
In this proof, we will use the same notations as in the theorem,
-
\(\forall i \in [m], F_{k}(x_i) = \mathop {\sum }\limits _{\begin{array}{c} s=1 \\ s\ne k \end{array}}^{n} \alpha _s h_s(x_i)\) is the fixed-weight majority vote,
-
\((\alpha , h_{k})\) is the variable weight and its corresponding voter.
So
With, \(F_{k}(x_i) \in \mathbb {R}\) and \(h_{k}(x_i) \in \{-1, 1\}\,\, \forall x_i, i=1 \dots m\) and
Deriving the \(\mathcal {C}\)-bound with respect to \(\alpha \), we obtain
as the third order terms of the numerator are simplified.
With
So, with our notations
For the sake of brevity, we define \(\mathscr {P}(\alpha ) = C_2 \alpha ^2 + C_1 \alpha + C_0\).
Eventually, we will use only \(h_{k}\) that are weak classifiers, so \(\gamma (h_{k})> 0\). This implies that \(\gamma (F_{k})>0\) because \(F_{k}\) is a linear combination, with only positive coefficients of weak classifiers.
1.2.1 D.2.1 Analysis of \(B_2 \alpha ^2 + B_1 \alpha + B_0\)
Let us recall the definition of the denominator,
The only way to cancel this sum of squares is if \(\forall i \in [m] F_{k}(x_i) = - \alpha h_{k}(x_i)\). Yet, we supposed that \(\gamma (h_{k})>0\), so if \(\forall i \in [m]F_{k}(x_i) = - \alpha h_{k}(x_i)\). So, either \(h_{k}= -\frac{F_{k}}{\alpha }\) with \(\alpha >0\) , which is absurd, as we supposed \(\gamma (F_{k})>0\) and \(\gamma (h_{k})>0\), or \(h_{k}= -\frac{F_{k}}{\alpha }\) with \(\alpha \le 0\) which is absurd as we supposed \(\alpha \ge 0\).
So our hypotheses lead to \(B_2 \alpha ^2 + B_1 \alpha + B_0 \ne 0, \forall \alpha \in \mathbb {R}\).
We will now analyse the behaviour of \(B_2 \alpha ^2 + B_1 \alpha + B_0\) without any knowledge or hypothesis on \(\mathscr {C}_k(\alpha )\). We compute the discriminant
So as we proved earlier that \(B_2 \alpha ^2 + B_1 \alpha + B_0\) could not be cancelled on \(\mathbb {R}\), we have
1.2.2 D.2.2 Analysis of \(\mathscr {C}'_k(\alpha )\) and \(\mathscr {C}_k(\alpha )\)
Even if, in CB-Boost, \(\alpha \in [0,+\infty [\), we will analyse these function on \(\mathbb {R}\).
If we look closer to \(\mathscr {C}_k(\alpha )\), we can see that its limits in \(\pm \infty \) is \(1-\frac{A_2}{B_2}\). Therefore, \(\mathscr {C}_k(\alpha )\) has an asymptotic line in \(\pm \infty \).
Moreover, thanks to (5) we know that the sign of \(\mathscr {C}'_k(\alpha )\) only depends on the sign of \(\mathscr {P}(\alpha )\).
Consequently, we will analyse the sign of \(\mathscr {P}(\alpha )\) by exhaustion.
-
If \(C_2 > 0\) we have \(\mathscr {P}(\alpha )\) is a positive parabola as represented in the following Fig. 9. Let’s analyse the possibilities concerning its roots.
-
If \(\mathscr {P}(\alpha )\) has no real roots, we get a table as represented in Fig. 10.
This is impossible, because \(\mathscr {C}_k(\alpha )\) is supposed to be strictly increasing and continuous.
So \(\mathscr {P}(\alpha )\)not having real roots is ABSURD.
-
If \(\mathscr {P}(\alpha )\) has exactly one real root, then
$$\begin{aligned} \begin{aligned}&\varDelta _\mathscr {P}= 0\\ \Leftrightarrow&C_1^2- 4 C_2 C_0 = 0\\ \Leftrightarrow&4 \left( \gamma (h_{k})^2 \nu \left( F_{k}\right) +\gamma (F_{k})\left[ \gamma (F_{k})- 2 \gamma (h_{k})\tau (F_{k}, h_{k})\right] \right) ^2=0\\ \Leftrightarrow&\gamma (h_{k})^2 \nu \left( F_{k}\right) + \gamma (F_{k})^2 - 2 \gamma (F_{k})\gamma (h_{k})\tau (F_{k}, h_{k})= 0.\\ \end{aligned} \end{aligned}$$(10)We can see this equality as a second order polynom roots problem, so it can be analysed through this polynom’s discriminant,
$$\begin{aligned} \begin{aligned}&\varDelta _\mathscr {P}= 0\\ \Leftrightarrow&4\gamma (h_{k})^2 \tau (F_{k}, h_{k})^2 - 4 \gamma (h_{k})^2 \nu \left( F_{k}\right) \ge 0 \\ \Leftrightarrow&4 \gamma (h_{k})^2 \left( \tau (F_{k}, h_{k})^2 - \nu \left( F_{k}\right) \right) \ge 0.\\ \end{aligned} \end{aligned}$$(11)However, we proved in Eq. (9) that it is ABSURD.
So \(\mathscr {P}(\alpha )\)can not have exactly one real root.
-
Consequently, \(\mathscr {P}(\alpha )\)has two real roots:
$$\begin{aligned} \begin{aligned} \alpha ^{-}&= \frac{-C_1 - \sqrt{\varDelta _\mathscr {P}}}{2 C_2},\\ \alpha ^{+}&= \frac{-C_1 +\sqrt{\varDelta _\mathscr {P}}}{2 C_2}. \end{aligned} \end{aligned}$$which leads to the table in Fig. 11. Thanks to the result on the asymptotic line, this implies \(\mathscr {C}_k(\alpha )\) looking like Fig. 12 So \(\alpha ^{+}\) is the global argminimum of \(\mathscr {C}_k(\alpha )\). In the next section, we will prove that iy is admissible in CB-Boost ’s framework.
-
-
If \(C_2<0\) we use similar methods, to prove that \(\mathscr {P}(\alpha )\) has two real roots, obtaining Figs. 13 and 14. So, in this case \(\alpha ^{-}\) is the global argminimum.
1.2.3 D.2.3 Possible argminima analysis
Preliminary results Following the previous analysis, we remind
Moreover, we will present two results before the proof by exhaustion.
-
First of all, let us analyze \(\mathscr {C}_k(0)\) with respect to the asymptotic line \(1-\frac{A_2}{B_2}\)
$$\begin{aligned} \begin{aligned}&\mathscr {C}_k(0)< 1-\frac{A_2}{B_2} \\ \Leftrightarrow&1-\frac{A_0}{B_0}< 1-\frac{A_2}{B_2} \\ \Leftrightarrow&\frac{A_2}{B_2}< \frac{A_0}{B_0} \\ \Leftrightarrow&A_2 B_0< A_0 B_2 \, \# \text { because }B_0 \text { and } B_2 \text { are positive} \\ \Leftrightarrow&C_1 < 0.\\ \end{aligned} \end{aligned}$$So
$$\begin{aligned} \mathscr {C}_k(0)< 1-\frac{A_2}{B_2} \Leftrightarrow C_1 < 0. \end{aligned}$$(13)Let us note, that with the same method we can prove
$$\begin{aligned} \mathscr {C}_k(0) \ge 1-\frac{A_2}{B_2} \Leftrightarrow C_1 \ge 0. \end{aligned}$$(14) -
Secondly, we will focus on \(\varDelta _\mathscr {P}\) and its square root.
$$\begin{aligned}\varDelta _\mathscr {P}= \left[ 2\left( \gamma (h_{k})^2 \nu \left( F_{k}\right) +\gamma (F_{k})\left[ \gamma (F_{k})- 2 \gamma (h_{k})\tau (F_{k}, h_{k})\right] \right) \right] ^2 \end{aligned}$$In order to deduce \(\sqrt{\varDelta _\mathscr {P}}\), we have to know the sign of
$$\begin{aligned}\gamma (h_{k})^2 \nu \left( F_{k}\right) +\gamma (F_{k})\left[ \gamma (F_{k})- 2 \gamma (h_{k})\tau (F_{k}, h_{k})\right] . \end{aligned}$$It can be developed as a second order polynomial expression in \(\gamma (h_{k})\)
$$\begin{aligned}\gamma (h_{k})^2 \nu \left( F_{k}\right) - \gamma (h_{k})2 \gamma (F_{k})\tau (F, h)+ \gamma (F_{k})^2 . \end{aligned}$$Its discriminant is
$$\begin{aligned}4\gamma (F_{k})^2 \tau (F_{k}, h_{k})^2- 4 \nu \left( F_{k}\right) \gamma (F_{k})^2 m = 4 \gamma (F_{k})^2 \left( \tau (F_{k}, h_{k})^2 - \nu \left( F_{k}\right) m \right) , \end{aligned}$$which has the same sign as \(\tau (F_{k}, h_{k})^2 - \nu \left( F_{k}\right) \), which is negative, as seen earlier, in Eq. 9.
So, \(\forall \gamma (h_{k})\in \mathbb {R}\), \(\gamma (h_{k})^2 \nu \left( F_{k}\right) +\gamma (F_{k})\left[ \gamma (F_{k})- 2 \gamma (h_{k})\tau (F_{k}, h_{k})\right] > 0\) as its discriminant is negative and its evaluation in 0 is positive. So
$$\begin{aligned} \sqrt{\varDelta _\mathscr {P}} = 2\left( \gamma (h_{k})^2 \nu \left( F_{k}\right) +\gamma (F_{k})\left[ \gamma (F_{k})- 2 \gamma (h_{k})\tau (F_{k}, h_{k})\right] \right) . \end{aligned}$$(15)
Let us now pursue with the proof by exhaustion.
If \(C_2>0\)
-
Let us suppose \(\mathscr {C}_k(0) \ge 1 - \frac{A_2}{B_2}\). So \(\mathscr {C}_k(0)\) is over the asymptotic line, so, somewhere on the blue side of the function in Fig. 15. So, necessarily, the red part of the abscissa axis is positive. Consequently \(\alpha ^{+} > 0\).
-
Now, let us suppose that \(\mathscr {C}_k(0) < 1 - \frac{A_2}{B_2}\), which is equivalent to \(C_1 < 0\), thanks to Eq. 13 so \(\alpha ^{+} = \frac{-C_1 +\sqrt{\varDelta _\mathscr {P}}}{2 C_2}\), with \(C_2 >0\), \(-C_1>0\) and \(\sqrt{\varDelta _\mathscr {P}}> 0 \) which implies \(\alpha ^{+} > 0 \).
As a conclusion for \(C_2 > 0\),
If \(C_2 < 0\)
-
Symmetrically, if we suppose \(\mathscr {C}_k(0) \ge 1-\frac{A_2}{B_2}\) then, \(\mathscr {C}_k(0)\) is on the blue side of the curve in Fig. 16, so the red side of the abscissa axis is negative. So, \(\alpha ^{-}<0\).
Let us keep in mind that \(\mathscr {C}_k(0) \ge 1-\frac{A_2}{B_2} \Rightarrow \alpha ^{-}<0\). However, when we use Eq.14, we have \(\mathscr {C}_k(0) \ge 1-\frac{A_2}{B_2} \Leftrightarrow C_1 \ge 0\). So
$$\begin{aligned} \begin{aligned}&\mathscr {C}_k(0) \ge 1-\frac{A_2}{B_2}\\ \Leftrightarrow&C_1 \ge 0\\ \Leftrightarrow&-C_1 \le 0 \\ \Rightarrow&-C_1 - \sqrt{\varDelta _\mathscr {P}} \le 0\\ \Rightarrow&\frac{ -C_1 - \sqrt{\varDelta _\mathscr {P}}}{2 C_2} \ge 0\,\# \text { because } C_2 < 0\\ \Rightarrow&\alpha ^{-} \ge 0. \end{aligned} \end{aligned}$$Consequently, if \(C_2 < 0\), then \(\mathscr {C}_k(0) \ge 1-\frac{A_2}{B_2} \Rightarrow \alpha ^{-} \ge 0\) and \(\mathscr {C}_k(0) \ge 1-\frac{A_2}{B_2} \Rightarrow \alpha ^{-}<0\), which is absurd. So \(\mathscr {C}_k(0) \ge 1-\frac{A_2}{B_2}\) is absurd.
-
Let us suppose that \(\mathscr {C}_k(0) < 1-\frac{A_2}{B_2}\).
We also suppose that \(\alpha ^{-}\ge 0\), so
$$\begin{aligned} \begin{aligned}&\alpha ^{-}\ge 0\\ \Rightarrow&\frac{-C_1 - \sqrt{\varDelta _\mathscr {P}}}{2 C_2} \ge 0 \\ \Rightarrow&-C_1 - \sqrt{\varDelta _\mathscr {P}} \le 0\,\# \text { because we supposed that } C_2 < 0 \\ \Rightarrow&C_1 \ge -\sqrt{\varDelta _\mathscr {P}} \\ \Rightarrow&2\left( \gamma (h_{k})^2 \nu \left( F_{k}\right) +\gamma (F_{k})\left[ \gamma (F_{k})m - 2 \gamma (h_{k})\tau (F_{k}, h_{k})\right] \right) \ge -2 \left( m \gamma (F_{k})^2 - \gamma (h_{k})^2 \nu \left( F_{k}\right) \right) \\ \Rightarrow&2 \gamma (F_{k})^2 m - 2 \gamma (h_{k})\tau (F_{k}, h_{k})\gamma (F_{k})\ge 0\\ \Rightarrow&\gamma (F_{k})m - \tau (F_{k}, h_{k})\gamma (h_{k})\ge 0\\ \Rightarrow&C_2 \ge 0. \end{aligned} \end{aligned}$$(17)Which is absurd. So \(\alpha ^{-} < 0\). So if \(\mathscr {C}_k(0) < 1-\frac{A_2}{B_2}\) and \(\alpha ^{-} < 0\) then the lowest admissible \(\mathcal {C}\)-bound value in CB-Boost is for zero.
Conclusion So if \(C_2 < 0\), the hypothesis does not add value in terms of \(\mathcal {C}\)-bound so the optimal choice is to weigh it with 0.
On the other hand, if in a direction \(C_2 > 0\), \(\alpha ^{+}\) is the global argminimum of \(\mathscr {C}_k(\alpha )\).
On Fig. 17, we draw a graph of this proof.
1.3 D.3 Graph of the proof
1.4 D.4 Proof of Property 1
1.4.1 D.4.1 Proof of the quantification
Let us analyze the \(\mathcal {C}\)-bound of the vote at the beginning of iteration \(t+1\),
And at the beginning of iteration t,
So the numerators and denominators are
So we obtain \(\mathscr {C}_{t+1}^{F}= 1-\frac{N_{t+1}}{D_{t+1}} = 1-\frac{N_t + c}{D_t + c'}\). Yet, to be able to quantify the \(\mathcal {C}\)-bound ’s decrease, we need to focus on \(\mathscr {C}_{t+1}^{F}- \mathscr {C}_{t}^{F}= \frac{N_{t}}{D_{t}} - \frac{N_t + c}{D_t + c'} = \frac{N_t c' - D_t c }{(D_t + c') D_t} \).
Yet,
So
We subtract
So, combining with \(\alpha _{I_t}\)’s expression found in Theorem 3,
Similarly,
So,
Then
In conclusion
1.4.2 D.4.2 Proof of positiveness
We have established that
So, \(sg(\mathscr {S}_{t}) = sg(\nu \left( F_{I_t}\right) - \tau (F_{I_t}, h_{I_t})^2)\), yet in Appendix D.1, Eq. 9, we had the property that \(\tau (F_{I_t}, h_{I_t})^2 - \nu \left( F_{I_t}\right) < 0\). So \(\nu \left( F_{I_t}\right) - \tau (F_{I_t}, h_{I_t})^2 > 0 \), so \(\mathscr {S}_{t}> 0\).
E Appendix: datasets
1.1 E.1 UCI datasets
We used the following datasets, from the UCI repository (Dua and Graff 2017)Footnote 6 to test CB-Boost ’s capacity to solve simple problems with few examples and/or a low dimension. We chose Australian \((690 \times 14)\), Balance \((625 \times 4)\), Bupa \((345 \times 6)\), Cylinder \((540 \times 35)\), Hepatitis \((155 \times 19)\), Ionosphere \((351 \times 34)\), Pima \((768 \times 8)\), Yeast \((1484 \times 8)\).
On every dataset, we generated 10 pairs of complementary decision stumps for each attribute to build our hypothesis space. For each experiment, used one half of the dataset to train our algorithms and the other half to test on unseen data.
1.2 E.2 MNist
To be able to analyse CB-Boost ’s behaviour on larger data, we used M-Nist (LeCun and Cortes 2010)Footnote 7 in which we selected four difficult couple of classes : (0,9), (5,3), (5,6) and (7,9). As it has 784 attributes, we used 1 pair of complementary decision stumps for each attribute for the hypothesis space and 4% of the dataset (\(\sim \) 471 ex.) to train and the remaining examples (\(\sim \) 11000 ex.) to test.
1.3 E.3 Animal with attributes
Finally, to have a dataset with a large number of attributes, we used Animals with Attributes (AwA) (Lampert et al.2009)Footnote 8 by fusing two descriptors (surf-hist and cq-hist) generating 4688 attributes. As each one on its own is barely relevant, we learned 2000 decision trees of depth 3 as the hypothesis space. They where trained on randomly sub-sampled attributes (\(60\%\) of the original dimension, with replacement). We selected two classes: tiger and wolf, that are very confused and provide some challenge to differentiate (we denote this dataset AwA-TvW). And one half of the dataset is used to train (546 ex.) and the other half to test (546ex.)
F Appendix: noise analysis
1.1 F.1 Data noising
In order to noise our datasets, we use a normal distribution, centred in 0 and of variable standard deviation. This distribution is then scaled with the attribute’s range, added to the data and capped to the limits of the attribute.
For example, for z a black and white pixel attribute, varying in [0, 255], to generate a noise level of 0.5, a 0-centered normal distribution with standard deviation of 0.5, \(\mathcal {N}(0,0.5)\) was used.
For each attribute (column) of each dataset, we used its upper and lower limit to generate an adequate noise.
1.2 F.2 MNist noising
The MNist dataset required a supplementary step in the noising process, at it is mostly comprised of black or white pixels. Indeed, this peculiarity made the previously introduced noising process ineffective.Thus, the contrast of each image in the dataset was halved in order for the noise to have a bigger impact on the performances of the studied algorithms. In Fig. 18, the basic, and low contrast images are shown with and without noise, to picture the impact of the contrast decrease. It is clear that reducing the contrast leads to a more difficult classification task once the dataset has been through the noising process.
1.3 F.3 Numerical results
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Bauvin, B., Capponi, C., Roy, JF. et al. Fast greedy \(\mathcal {C}\)-bound minimization with guarantees. Mach Learn 109, 1945–1986 (2020). https://doi.org/10.1007/s10994-020-05902-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994-020-05902-7