1 Introduction

Ensemble-based supervised classification consists in training a combination of many classifiers learnt from various algorithms and/or sub-samples of the initial dataset. Some of the most prominent ensemble learning frameworks include Boosting, with the seminal elegant Adaboost (Freund and Schapire 1997), which has been the inspiration of numerous other algorithms, Bagging (Breiman 1996) leading to the successful Random Forests (Breiman 2001), but also Multiple Kernel Learning (MKL) approaches (Sonnenburg et al. 2006; Lanckriet et al. 2004) or even the Set Covering Machines (Marchand and Taylor 2003). Most of these approaches are founded on theoretical aspects of PAC learning (Valiant 1984). Among them, the PAC-Bayesian theory studies the properties of the majority vote that is used to combine the classifiers (McAllester 1999) according to the distributions among them. Experimentally, valuable ad-hoc studies have been made over specific application domains in order to build relevant sets of classifiers. We address here the problem of learning one independently from the priors relevant to the application domain, together with a weighted schema that defines a majority vote over the members of that set of classifiers.

Introduced by McAllester (1999), the PAC-Bayesian theory provides some of the tightest Probably Approximately Correct (PAC) learning bounds. These bounds are often used for a better understanding of the learning capability of various algorithms (Seeger 2002; McAllester 2003; Langford and Shawe-Taylor 2003; Catoni 2007; Seldin et al. 2012; Dziugaite and Roy 2018). Based on the fact that PAC-Bayesian bounds gave rise to a powerful analysis of many algorithms’ behavior, it has incited a research direction that consists in developing new (or new variants of) algorithms that simply are bound minimizers (Germain et al. 2009; Parrado-Hernández et al. 2012; Dziugaite and Roy 2018; Germain et al. 2015). In this paper, we revisit one of such algorithms, MinCq (Germain et al. 2015), which focuses on the minimization of the \(\mathcal {C}\)-bound and comes with PAC-Bayesian guarantees. The \(\mathcal {C}\)-bound, introduced in Lacasse et al. (2006), bounds the true risk of a weighted majority vote based on the averaged true risk of the voters, coupled with their averaged pairwise disagreement. According to the \(\mathcal {C}\)-bound, the quality of each individual voter can be compensated if the voting community tends to balance the individual errors by having plural opinions on “difficult” examples.

Although MinCq has state of the art performance on many tasks, it computes the output distribution on a set of voters through a quadratic program, which is not tractable for more than medium-sized datasets. To overcome this, CqBoost (Roy et al. 2016) has then been proposed. It iteratively builds a sparse majority vote from a possibly infinite set of classifiers, within a column generation setting. However, CqBoost’s approach only partially tackles the computational challenge. In order to overcome this drawback, we propose CB-Boost, a greedy, boosting-based, \(\mathcal {C}\)-bound minimization algorithm designed to greatly reduce the computational cost of CqBoost and MinCq while maintaining the attractive peculiarities of the \(\mathcal {C}\)-bound on a finite set of hypothesis.

CB-Boost is somewhat similar to CqBoost, while closer to boosting in the sense that at each iteration, it selects a voter, finds its associated weight by minimizing an objective quantity (the \(\mathcal {C}\)-bound in the case of CB-Boost, and the exponential loss as for Adaboost) and adds it to the vote.

The main advantage of CB-Boost comes from the fact that at each iteration, it solves a \(\mathcal {C}\)-bound minimization problem by considering only one direction. Interestingly, it is possible to solve it analytically and with only a few light operations. Furthermore, we derive a guarantee that the \(\mathcal {C}\)-bound decreases throughout CB-Boost ’s iterations.

This paper is organised as follows. Section 2 sets up basic notation and definitions, reviews the \(\mathcal {C}\)-bound and its PAC-Bayesian framework, and briefly introduces MinCq and CqBoost, two existing algorithms that aim at learning an ensemble of classifiers based on the minimization of the \(\mathcal {C}\)-bound. Section 3 introduces our new boosting-based algorithm named CB-Boost, which aims at keeping the benefits of these two algorithms, while reducing the disadvantages. Finally, Sect. 4 addresses the theoretical properties of CB-Boost, while Sect. 5 focuses on experiments that not only validate the theoretical aspects, but also shows that CB-Boost performs well empirically on major aspects.

2 Context

After setting up basic notations and definitions, the context of PAC-Bayesian learning is introduced through the \(\mathcal {C}\)-bound and two theorems, which are pivotal components of our contribution.

2.1 Basic notations and definitions

Let us consider a supervised bi-class learning classification task, where \(\mathcal {X}\) is the input space and \({\mathcal {Y}}= \{-1,1\}\) is the output space. The learning sample \({\mathcal {S}}= \{(x_i, y_i)\}_{i=1}^m\) consists of m examples drawn i.i.d from a fixed, but unknown distribution D over \({\mathcal {X}}\times {\mathcal {Y}}\). Let \(\mathcal {H}= \{h_1, \ldots h_n\}\) be a set of n voters \(h_s: {\mathcal {X}}\rightarrow \{-1,1\}\), and \(\text {Conv}(\mathcal {H})\) the convex hull of \(\mathcal {H}\).

Definition 1

\(\forall x \in {\mathcal {X}}\), the majority vote classifier (Bayes classifier) \(B_{Q}\) over a distribution \(Q\) on \(\mathcal {H}\) is defined by

$$\begin{aligned}B_{Q}(x) \triangleq sg\left[ \underset{h\sim Q}{\mathbf {E}} h(x)\right] \text {where }sg(a) = 1 \text { if }a>0\text { and }-1\text { otherwise.}\end{aligned}$$

Definition 2

The true risk of the Bayes classifier \(B_Q\) over Q on \(\mathcal {H}\), is defined as the expected zero one loss over D, a distribution on \({\mathcal {X}}\times {\mathcal {Y}}\):

$$\begin{aligned}R_{D}(B_Q) \triangleq \underset{(x,y) \sim D}{\mathbf {E}} I\left( \underset{h\sim Q}{\mathbf {E}} y \cdot h(x) < 0\right) \text {, with }I(a) = 1\text { if a true, }0\text { otherwise} .\end{aligned}$$

Definition 3

The training error of the Bayes classifier \(B_Q\) over Q on \(\mathcal {H}\), is defined as the empirical risk associated with the zero one loss on \({\mathcal {S}}= \{(x_i, y_i)\}_{i=1}^m\).

$$\begin{aligned} R_{\mathcal {S}}(B_Q) = \frac{1}{m}\mathop {\sum }\limits _{i=1}^{m} I\left( \underset{h\sim Q}{\mathbf {E}} y_i \cdot h(x_i) < 0\right) .\end{aligned}$$

Definition 4

The Kullback-Leibler (KL) divergence between distributions \(Q\) and P is defined as

$$\begin{aligned}KL(Q||P) = \underset{h\sim Q}{\mathbf {E}} \ln \frac{Q(h)}{P(h)}\,.\end{aligned}$$

In the following study, let P denote the prior distribution on \(\mathcal {H}\) that incorporates pre-existing knowledge about the task. And let \(Q\) denote the posterior distribution, which is an update of P after observing the task’s data.

2.2 Previous work: the \(\mathcal {C}\)-Bound & PAC-Bayesian guarantees

Here, we state the main context of our work by presenting the \(\mathcal {C}\)-bound and its properties, as introduced in Lacasse et al. (2006). Let us first define one core concept: the margin of the majority vote.

Definition 5

Given an example \(x \in {\mathcal {X}}\) and its label \(y \in {\mathcal {Y}}\) drawn according to a distribution D, \(M^{D}_{Q}\) is the random variable that gives the margin of the majority vote \(B_Q\), defined as \(M^{D}_{Q} = y \underset{h\sim Q}{\mathbf {E}} h(x)\,.\)

Given the margin’s definition, the \(\mathcal {C}\)-bound is presented in Lacasse et al. (2006) as follows.

Definition 6

For any distribution \(Q\) on \(\mathcal {H}\), for any distribution D on \({\mathcal {X}}\times {\mathcal {Y}}\), let \(\mathscr {C}_Q^{D}\) be the \(\mathcal {C}\)-bound of \(B_Q\) over D, defined as

$$\begin{aligned} \mathscr {C}_Q^{D} = 1 - \frac{\left( \mu _1(M^{D}_{Q})\right) ^2}{\mu _2(M^{D}_{Q})}\,, \end{aligned}$$

with \(\mu _1(M^{D}_{Q})\) being the first moment of the margin

$$\begin{aligned}\mu _1(M^{D}_{Q}) = \underset{(x,y) \sim D}{\mathbf {E}} M^{}_{Q}(x,y),\end{aligned}$$

and \(\mu _2(M^{D}_{Q})\) being the second moment of the margin

$$\begin{aligned}\mu _2(M^{D}_{Q}) =\!\! \underset{(x,y) \sim D}{\mathbf {E}}\left[ M^{}_{Q}(x,y)^2\right] \end{aligned}$$

where the last equality comes from the fact that \(y \in \{-1, 1\}\), so \(y^2 = 1\).

Definition 7

For any distribution \(Q\) on \(\mathcal {H}\) and any \({\mathcal {S}}= \{(x_i, y_i)\}_{i=1}^m\), let \(\mathscr {C}_Q^{\mathcal {S}}\) be the empirical \(\mathcal {C}\)-bound of \(B_Q\) on \(\mathcal {S}\), defined as

$$\begin{aligned}\mathscr {C}_{Q}^{\mathcal {S}} = 1 - \frac{1}{m}\frac{\left( \mathop {\sum }\limits _{i=1}^{m} y_i \underset{h \sim Q}{\mathbf {E}} h(x_i) \right) ^2}{\mathop {\sum }\limits _{i=1}^{m} \left( y_i \underset{h \sim Q}{\mathbf {E}} h(x_i)\right) ^2}\end{aligned}$$

The following theorem, established and proven in Lacasse et al. (2006), shows that the \(\mathcal {C}\)-bound is an upper bound of the true risk of the majority vote classifier.

Theorem 1

Lacasse et al. (2006) For any distribution \(Q\) on a set \(\mathcal {H}\) of hypotheses, and for any distribution D on \({\mathcal {X}}\times {\mathcal {Y}}\), if \(\mu _1(M^{D}_{Q})>0\), we have

$$\begin{aligned}R_{D}(B_{Q}) \le \mathscr {C}_Q^{D}\,.\end{aligned}$$

From this result, we derive a corollary for the empirical risk, i.e., the training error that is used in Sect. 4:

Corollary 1

For any distribution \(Q\) on a set \(\mathcal {H}\) of hypotheses, and for \(\mathcal {S}= \{(x_i, y_i)\}_{i=1}^m\) a training sample, if \(\frac{1}{m}\mathop {\sum }\limits _{i=1}^{m} y_i \underset{h \sim Q}{\mathbf {E}} h(x_i) > 0\), we have

$$\begin{aligned}R_{\mathcal {S}}(B_{Q}) \le \mathscr {C}_Q^{\mathcal {S}}\,.\end{aligned}$$

In terms of generalization guarantees, the PAC-Bayesian framework (McAllester 1999) provides a way to bound the true risk of \(B_{Q}\), given the empirical \(\mathcal {C}\)-bound, and P and \(Q\) the prior and posterior distributions. The important following theorem, established by Roy et al. (2016) is used in Sect. 4.3; it gives an upper bound of the true risk of the majority vote, which depends on the first and second moments of the margin as introduced in Definitions 5 and 6, and on the Kullback-Leibler divergence between the prior and posterior distributions.

Theorem 2

Roy et al. (2016) For any distribution D on \({\mathcal {X}}\times {\mathcal {Y}}\), for any set \(\mathcal {H}\) of voters \(h: \ {\mathcal {X}}\rightarrow \{-1, 1\}\), for any prior distribution P on \(\mathcal {H}\) and any \(\delta \in ]0,1]\) over the choice of the sample \(\mathcal {S}= \{(x_i, y_i)\}_{i=1}^m\sim D^m\), and for every posterior distribution \(Q\) on \(\mathcal {H}\), we have, with a probability at least \(1-\delta \)

$$\begin{aligned} R_{D}(B_{Q})\le & {} 1 - \frac{\left( max(0, \underline{\mu _1})\right) ^2}{min(1, \overline{\mu _2})}\,, \text{ where: } \\ \underline{\mu _1}= & {} \frac{1}{m}\mathop {\sum }\limits _{i=1}^{m} y_i \underset{h \sim Q}{\mathbf {E}} h(x_i) - \sqrt{\frac{2}{m}\left[ KL(Q||P)+\ln \left( \frac{2\sqrt{m}}{\delta /2}\right) \right] }\\ \overline{\mu _2}= & {} \frac{1}{m}\mathop {\sum }\limits _{i=1}^{m} \left( y_i \underset{h \sim Q}{\mathbf {E}} h(x_i)\right) ^2+\sqrt{\frac{2}{m}\left[ 2 KL(Q||P)+\ln \left( \frac{2 \sqrt{m}}{\delta /2}\right) \right] }\,. \end{aligned}$$

2.3 Existing algorithms: MinCq & CqBoost

Let us focus on two algorithms that rely on the minimization of the empirical \(\mathcal {C}\)-bound in order to learn an accurate ensemble of classifiers. MinCq (Germain et al. 2015) finds the weights that minimize the empirical \(\mathcal {C}\)-bound on a given set of voters, and CqBoost (Roy et al. 2016), inversely, uses column generation and boosting to iteratively build a set of voters by minimizing the \(\mathcal {C}\)-bound.

MinCqFootnote 1 (Germain et al. 2015)   The principle of MinCq is to create a majority vote over a finite set of voters, whose weights minimize the \(\mathcal {C}\)-bound. To avoid overfitting, it considers a restriction on the posterior distribution named quasi-uniformity (Germain et al. 2015), and adds an equality constraint on the first moment of the margin.

Beneficially, it is ensured to perform as well in train as in generalization, according to Corollary 1 and Theorem 2. The main drawback of MinCq is its computational training time: its algorithmic complexity is \(\mathcal {O}(m \times n^2 + n^3)\), which prevents it from scaling up to large datasets.

CqBoostFootnote 2 (Roy et al. 2016)   Like MinCq, CqBoost is an algorithm based on the minimization of the \(\mathcal {C}\)-bound. It was designed to accelerate MinCq and has proven to be a sparser \(\mathcal {C}\)-bound minimizer, hence enabling better interpretability and faster training.

CqBoost is based on a column generation process that iteratively builds the majority vote. It is similar to boosting in the way that, at each iteration t, the choice of the voter to add is made by greedily optimizing the edge criterion (Demiriz et al. 2002). Once a voter has been added to the current selected set, CqBoost finds the optimal weights by minimizing the \(\mathcal {C}\)-bound similarly to MinCq, solving a quadratic program.

The sparsity of CqBoost is explained by its ability to stop the iterative vote building early enough to avoid overfitting. Nevertheless, even though CqBoost is faster and sparser than MinCq, it is still not applicable to large datasets because of its algorithmic complexity is \(\mathcal {O}(m \times T^2 + T^3)\), where T is the number of boosting iterations.

3 CB-Boost: A fast \(\mathcal {C}\)-Bound minimization algorithm

In this section, we present the mono-directional \(\mathcal {C}\)-bound minimization problem and its solution, which are central in CB-Boost. Then, we introduce CB-Boost ’s pseudo-code in Algorithm 1.

The empirical \(\mathcal {C}\)-bound of a distribution \(Q= \{\pi _1, \dots , \pi _n\}\) of n weights over \(\mathcal {H}= \{h_1, \dots , h_n\}\) a set of n voters, on a learning sample \(\mathcal {S}\) of m examples, is as follows.

$$\begin{aligned}{\mathscr{C}}_Q^{\mathcal{S}}= 1 - \frac{1}{m}\frac{\left( \mathop {\sum }\limits _{i=1}^{m} y_i \mathop {\sum }\limits _{s=1}^{n} \pi _s h_s(x_i)\right) ^2}{\mathop {\sum }\limits _{i=1}^{m} \left( y_i \mathop {\sum }\limits _{s=1}^{n} \pi _s h_s(x_i)\right) ^2}\,. \end{aligned}$$

It is proven in Appendix D.1 that if we use positive weights \(\{\alpha _1, \ldots , \alpha _n\} \in (\mathbb {R}^+)^n\) instead of a distribution, the empirical \(\mathcal {C}\)-bound is equivalent to using the distribution \(Q= \{\frac{\alpha _1}{\sigma }, \dots , \frac{\alpha _n}{\sigma }\}\), with \(\sigma \) being \(\mathop {\sum }\limits _{s=1}^{n}\alpha _s\). From here, we use the weights that do not sum to one in order to simplify the proofs.

3.1 Optimizing the \(\mathcal {C}\)-Bound in one direction

We outline some basic definitions to clarify the contributions of the following work : agreement ratio, margin and norm.

Definition 8

The agreement ratio between two voters \(h,h' \in \mathcal {H}\) (or two combination of voters in \(\text {Conv}(\mathcal {H})\)) is defined as \(\tau (h,h') \triangleq \frac{1}{m}\mathop {\sum }\limits _{i=1}^{m} h(x_i) h'(x_i)\,.\)

Definition 9

The empirical margin of a single voter \(h\in \mathcal {H}\) (or combination of voters in \(\text {Conv}(\mathcal {H})\)) is \( \gamma (h) \triangleq \frac{1}{m} \mathop {\sum }\limits _{i=1}^{m}y_i h(x_i)\,\).

Definition 10

The squared and normalized L2-norm of \(h\in \mathcal {H}\) (or combination of voters in \(\text {Conv}(\mathcal {H})\)), is defined as \( \nu \left( h\right) \triangleq \frac{1}{m} \mathop {\sum }\limits _{i=1}^{m} h(x_i)^2 = \frac{1}{m}\left\Vert h\right\Vert _2^2\,\).

Here, we consider the \(\mathcal {C}\)-bound optimization in a single direction, meaning that all weights, except one, are fixed. For readability reasons, we introduce \(\forall i \in [m]\), \(F_{k}(x_i) = \mathop {\sum }\limits _{\begin{array}{c} s=1 \\ s\ne k \end{array}}^{n} \alpha _s h_s(x_i)\) which denotes the majority vote built by all the fixed weights and their corresponding voters, and \((\alpha , h_{k})\) the weight that varies during the optimization and its corresponding voter. We can thus rewrite the empirical \(\mathcal {C}\)-bound with respect to \(k\), the varying direction, as

$$\begin{aligned} \mathscr {C}_k(\alpha ) = 1 - \frac{1}{m}\frac{\left( \mathop {\sum }\limits _{i=1}^{m} y_i \left( F_{k}(x_i) + \alpha h_{k}(x_i)\right) \right) ^2}{\mathop {\sum }\limits _{i=1}^{m} \left( y_i \left( F_{k}(x_i) + \alpha h_{k}(x_i)\right) \right) ^2}. \end{aligned}$$

Our goal here is to find the optimal \(\alpha \) in terms of \(\mathcal {C}\)-bound, denoted by \(\alpha _k^*= \mathop {\text {arg}\,\text {min}}\limits _{\alpha \in \mathbb {R}^+} \mathscr {C}_k(\alpha )\). The following theorem is the central contribution of our work as it provides an analytical solution to this problem.

Theorem 3

\(\forall k \in [n]\) with the previously introduced notations, if \(\gamma (h_{k})> 0\) and \(\gamma (F_{k})> 0\), then

$$\begin{aligned} \alpha _k^*= \mathop {\text {arg}\,\text {min}}\limits _{\alpha \in \mathbb {R}^+} \mathscr {C}_k(\alpha ) = {\left\{ \begin{array}{ll}\frac{\gamma (h_{k})\nu \left( F_{k}\right) - \gamma (F_{k})\tau (F_{k}, h_{k})}{\left( \gamma (F_{k})- \gamma (h_{k})\tau (F_{k}, h_{k})\right) } \text { if }\tau (F_{k}, h_{k})< \frac{\gamma (F_{k})}{\gamma (h_{k})}, \\ 0\text { otherwise.}\end{array}\right. } \end{aligned}$$

The proof is provided in the Appendix, in Sect. D.1.

Theorem 3 states that in a specific direction, the \(\mathcal {C}\)-bound has a global minimum, provided three conditions. The first two (\(\gamma (h_{k})> 0\) and \(\gamma (F_{k})>0\)) are met trivially within our framework as \(h_{k}\) is a weak classifierFootnote 3 and \(F_{k}\) is a positive linear combination of weak classifiers. The third one \(\big (\tau (F_{k}, h_{k})< \frac{\gamma (F_{k})}{\gamma (h_{k})}\big )\) means that \(F_{k}\) and \(h_{k}\) are not supposed to be colinear, which in the next section, we will show is not restrictive.

This theoretical result is the main step in building a greedy \(\mathcal {C}\)-bound minimization algorithm. Moreover, as long as there is a direction \(k\) in which \(\tau (F_{k}, h_{k})< \frac{\gamma (F_{k})}{\gamma (h_{k})}\), the \(\mathcal {C}\)-bound can be optimized in this direction, and every other one in which \(\tau (F_{k}, h_{k})\ge \frac{\gamma (F_{k})}{\gamma (h_{k})}\) is a dead end.

In terms of complexity, the solution to the minimization problem is obtained in \(\mathcal {O}(m)\) as \(\gamma (h_{k})\), \(\nu \left( F_{k}\right) \), \(\gamma (F_{k})\), and \(\tau (F_{k}, h_{k})\) are sums over the m examples of the training set \(\mathcal {S}\).

3.2 Optimally choosing the direction

In the previous subsection, we presented a theoretical result proving that, for a given direction, the \(\mathcal {C}\)-bound minimization problem has a unique solution. Here, we propose a way to optimally choose this direction and compare it to the main existing method.

Exhaustive search   In our framework, \(\mathcal {H}\) is finite and has a cardinality of n, implying that we have a finite number of available directions to choose from. As stated before, the minimum \(\mathcal {C}\)-bound in one direction is available in \(\mathcal {O}(m)\). So by computing these minima in each direction, in \(\mathcal {O}(n\times m)\), we are able to choose the optimal direction, in which the \(\mathcal {C}\)-bound decreases the most.

Comparison with gradient boosting   In the gradient boosting framework (a), the optimization direction is chosen by gradient minimization. Coupled with an adequate method to choose the step size, it is a very efficient way of optimizing a loss function. However, thanks to our theoretical analysis, we know that at each iteration of CB-Boost, the best direction is chosen and the optimal step size is known analytically.

Nonetheless, we present a comparison between our exhaustive method and a gradient boosting version that we call GB-CB-Boost. We show in the experiments (Sects. 5.1 and 5.2) that it has no significant advantage and it is less stable than CB-Boost. The details about the gradient boosting variant are explained in Appendix B, and a toy example gives an intuition on the difference between the two processes in Appendix C.

3.3 Presenting CB-Boost

Armed with the theoretical and practical results presented in the previous subsections, we are now ready to present the overall view of CB-Boost, which optimizes the training error of the majority vote through the iterative minimization of the mono-directional \(\mathcal {C}\)-bound presented in Theorem 3.

figure a

For the sake of clarity, we define \(I_1, \dots , I_T\) as a list that is initialized with zeros (Line 2), and that contains each of the chosen directions’ indices (Updated in Lines 3 and 12). To initialize CB-Boost, we use \(h_{I_1} \in \mathcal {H}\) the hypothesis with the best margin, we set its weight to 1, and all the others to zero (Lines 1, 3 and 4). This aims at accelerating the convergence by starting the vote building with the strongest available hypothesis.

Then, for each iteration t, we compute the \(\mathcal {C}\)-bound-optimal weights in every available direction, by solving multiple mono-directional optimization problems (Lines 7 to 11). The direction is then exhaustively chosen (Line 12).

After the initialization, the weights on \(\mathcal {H}\) are a Dirac distribution with the best-margin hypothesis’s weight being the only one non-zero, and at each iteration t, one more element of \(\mathcal {H}\) will have a non-zero weight \(\alpha _{I_t}\).

One major advantage of CB-Boost when compared to MinCq and CqBoost is the simplicity of Line 9, where its predecessors solve quadratic programs. Indeed, the algorithmic complexity of CB-Boost only depends on the number of iterations T, the number of examples m, and the number of hypotheses n. As the mono-directional \(\mathcal {C}\)-bound optimization is solved in \(\mathcal {O}(n \times m)\) CB-Boost ’s complexity is \(\mathcal {O}(n \times m \times T)\).

3.4 Remarks

On the \(\mathcal {C}\)-bound indirect example re-weighting   To bring diversity in the majority vote, Adaboost (Freund and Schapire 1997) updates weights over the examples at each iteration, exponentially emphasizing the examples on which it previously failed.

In CB-Boost, by considering both the first and second moments of the margin, the \(\mathcal {C}\)-bound takes into account the individual performance of each voter and their disagreement. Therefore, minimizing the \(\mathcal {C}\)-bound requires to keep a trade-off between maximizing the vote’s margin and internal disagreement. This is the reason why CB-Boost does not include any example weighting. Indeed, the mono-directional \(\mathcal {C}\)-bound minimization problem is equivalent to minimizing the following quantity

$$\begin{aligned} \left( \mathop {\sum }\limits _{i=1}^{m} F_{k}(x_i)^2 + 2\alpha \mathop {\sum }\limits _{i=1}^{m} F_{k}(x_i) h_{k}(x_i) + \alpha ^2 m \right) \frac{1}{\left( \gamma (F_{k})+ \alpha \gamma (h_{k})\right) ^2}\,. \end{aligned}$$

Intuitively, in this expression, \(\mathop {\sum }\limits _{i=1}^{m} F_{k}(x_i) h_{k}(x_i)\) is equivalent to \(\tau (F_{k}, h_{k})\), so it decreases as \(h\) and \(F\) disagreement increases. It encourages CB-Boost to choose directions that perform well on hard examples. Moreover, \(\alpha ^2\) can be interpreted as a regularization term and \(\frac{1}{\left( \gamma (F_{k})+ \alpha \gamma (h_{k})\right) ^2}\) encapsulates the quality of the vote.

On the difference between the majority votes   Intuitively, the concession made in CB-Boost to accelerate CqBoost and MinCq is focused on the weights of the majority vote. Indeed, CqBoost returns the majority vote that exactly minimizes the \(\mathcal {C}\)-bound, for the considered set of voters where CB-Boost returns sub-optimal weights because they have been optimized greedily throughout the iterations. Nevertheless, the \(\mathcal {C}\)-bound computed during the training phase is not an approximation for the considered majority vote, which explains the theoretical results of the next section. Moreover, in Fig. 5 (page 19), we empirically show that the weight-by-weight optimization has similar accuracy than the quadratic programs of MinCq and CqBoost.

On the stopping criterion   In Sect. 3.1, we stated that as long as there is still a direction in which \(\tau (F_{k}, h_{k})< \frac{\gamma (F_{k})}{\gamma (h_{k})}\), the \(\mathcal {C}\)-bound can be optimized by CB-Boost. However, this is a very loose stopping criterion. i In fact, as experimentally seen in Sect. 5, as in CqBoost, it is far more interesting to use a fixed number of iterations as an hyper-parameter of the algorithm, as the main improvements are made during the first iterations of the algorithm. This way of restricting the number of iterations helps to reach a sparse model.

4 Theoretical results on training and generalization aspects

4.1 Quantifying the empirical \(\mathcal {C}\)-Bound decrease

In this section, we quantify the decrease rate of the empirical \(\mathcal {C}\)-bound for each iteration of CB-Boost, depending on the previous one and the considered direction.

Property 1

During iteration t of CB-Boost, if \(I_t\) is the chosen direction’s index, \(h_{I_t}\) its corresponding voter, and \(F_{I_t} = \mathop {\sum }\limits _{\begin{array}{c} s=1 \\ s\ne I_t \end{array}}^{n} \alpha _s h_s\) the majority vote of all the other directions, then the empirical \(\mathcal {C}\)-bound decreases exactly by  

$$\begin{aligned}\mathscr {S}_{t}= \frac{\left( \gamma (h_{I_t})\nu \left( F_{I_t}\right) - \gamma (F_{I_t})\tau (F_{I_t}, h_{I_t})\right) ^2}{\nu \left( F_{I_t}\right) \left( \nu \left( F_{I_t}\right) - \tau (F_{I_t}, h_{I_t})^2 \right) } > 0\,.\end{aligned}$$

The proof is provided in the Appendix, in Sect. D.4.

4.2 Deriving the training error bound

Corollary 2

The training error of the majority vote, built by CB-Boost at iteration \(t>2\) is bounded by

$$\begin{aligned} 1 - \frac{1}{m}\frac{\left( \mathop {\sum }\limits _{i=1}^{m} y_i \left[ h_{I_1}(x_i) + \alpha _{I_2} h_{I_2}(x_i)\right] \right) ^2}{\mathop {\sum }\limits _{i=1}^{m} \left( y_i \left[ h_{I_1}(x_i) + \alpha _{I_2} h_{I_2}(x_i)\right] \right) ^2} - \mathop {\sum }\limits _{j=3}^{t} \mathscr {S}_{j}\,, \end{aligned}$$

with \(\mathscr {S}_{j}\) being the quantity introduced in Property 1.

The proof is straightforward by combining Corollary 1 and Property 1.

This training error bound allows us to assess CB-Boost ’s capacity to learn relevant models based on the available pool of voters.

4.3 Generalization guarantees

Theorem 2 presents a PAC-bound that gives generalization guarantees based on the empirical \(\mathcal {C}\)-bound. In order to apply it to CB-Boost ’s output \({F= \mathop {\sum }\limits _{s=1}^{n} h_s \alpha _s}\), we use \(Q= \{\frac{\alpha _1}{\sigma }, \dots , \frac{\alpha _n}{\sigma }\}\) with \(\sigma = \mathop {\sum }\limits _{s=1}^{n} \alpha _s\). Note that only the T weights corresponding to the chosen directions are non-zero. So, according to Theorem 2, with probability \(1-\delta \), for any sample \(\mathcal {S}\) drawn according to D,

$$\begin{aligned}R_{D}(F) \le 1 - \frac{\left( max\left( 0, \frac{1}{m}\mathop {\sum }\limits _{i=1}^{m} y_i \underset{h \sim Q}{\mathbf {E}} h(x_i) - \sqrt{\frac{2}{m}\left[ KL(Q||P)+\ln \left( \frac{2\sqrt{m}}{\delta /2}\right) \right] }\right) \right) ^2}{min\left( 1, \frac{1}{m}\mathop {\sum }\limits _{i=1}^{m} \left( y_i \underset{h \sim Q}{\mathbf {E}} h(x_i)\right) ^2+\sqrt{\frac{2}{m}\left[ 2 KL(Q||P)+\ln \left( \frac{2 \sqrt{m}}{\delta /2}\right) \right] }\right) }, \end{aligned}$$

These guarantees are tighter when the empirical \(\mathcal {C}\)-bound of a majority vote is small, which is exactly what CB-Boost aims at returning. Moreover, as seen in Sect. 2.2, returning a majority vote with a small KL(Q||P) is essential in order to have good generalization guarantees. In Roy et al. (2016), the authors established that if the number of voters is far lower than the number of examples, \(n<<m\), then minimizing KL(Q||P) is negligible in comparison with minimizing the \(\mathcal {C}\)-bound of the majority vote.

If the case \(n<<m\) is not applicable, we need to characterize KL(Q||P) intuitively. We use a uniform prior on \(\mathcal {H}\), and as at each iteration of CB-Boost, one more weight of \(Q\) will be non-zero, KL(Q||P) will increase as the posterior’s number of non-zero weight augments. Moreover, we proved that the \(\mathcal {C}\)-bound of \(Q\) decreases over the iterations of CB-Boost. Thus, in order to keep the trade-off between KL(Q||P) and the \(\mathcal {C}\)-bound for the bound of Theorem 2, it is relevant to use early-stopping by choosing a maximal number of iterations. Consequently, based on the generalization guarantees, we set the maximum number of iterations on CB-Boost as an hyper-parameter that can be chosen using hold-out data.

Comparison to non PAC-Bayesian generalization bounds   In Cortes et al. (2014), a tight bound based on the Rademacher complexity is given for majority votes \(F\) of \(\text {Conv}(\mathcal {H})\). This bound depends on \(\hat{R}_{{\mathcal {S}}, \rho }(F) = \underset{(x,y) \sim {\mathcal {S}}}{\mathbf {E}}[1_{yF(x) \le \rho }]\) the training error of \(F\)’s margin being lower than \(\rho \), over the sample \(\mathcal {S}\) drawn according to D and \(\mathfrak {R}_{m}(\mathcal {H})\) the Rademacher complexity of \(\mathcal {H}\), as

$$\begin{aligned} R_{D}(F) \le \hat{R}_{{\mathcal {S}},\rho }(F) + \frac{4}{\rho } \mathop {\sum }\limits _{t=1}^{T} \alpha _t \mathfrak {R}_{m}(\mathcal {H}) + \frac{2}{\rho } \sqrt{\frac{\ln n}{m}} + \sqrt{\lceil \frac{4}{\rho ^2}\ln \left[ \frac{\rho ^2m}{\ln n}\right] \rceil \frac{\ln n}{m} + \frac{\ln \frac{2}{\delta }}{2m}}\,. \end{aligned}$$

As explained in the previous paragraph, the size of KL(Q||P) is either negligible or depends on the complexity of the distribution Q, and then can be reduced by early stopping. So, intuitively, both the expressions rely on \(\sqrt{\frac{1}{m}}\) and are then fairly equivalent. The main difference is that the Rademacher-based bound relies on the training error of the majority vote and the Rademacher complexity of the hypothesis space \(\mathcal {H}\) (in our case \(\mathfrak {R}_{m}(\mathcal {H}) \le \sqrt{\frac{2 \ln (|\mathcal {H}|)}{m}}\), as our classifiers only outputs \(-1\) or 1) whereas the \(\mathcal {C}\)-bound ’s PAC-Bayesian bound relies on maximizing \(\frac{\mu _1(M^{\mathcal {S}}_Q)}{\mu _2(M^{\mathcal {S}}_Q)}\), which CB-Boost is explicitly processing.

5 Experiments

The code used to realize each of the following experiments is available here: https://gitlab.lis-lab.fr/baptiste.bauvin/cb-boost-exps.

In order to experimentally study CB-Boost, we first compare it to CqBoost (Roy et al. 2016), MinCq (Germain et al. 2015) and GB-CB-Boost (Sect. 3.2), focusing on efficiency, sparsity and performance on chosen datasets featuring various properties. Then, we compare it to Adaboost (Freund and Schapire 1997) and other ensemble methods: Random Forest (Breiman 2001), Bagging (Breiman 1996), and Gradient Boosting (Friedman 2001).

5.1 Computational time improvement

In order to highlight the main advantage of CB-Boost, we first analyze the computational time improvement obtained by greedily minimizing the \(\mathcal {C}\)-bound, comparing CB-Boost and GB-CB-Boost to CqBoost’s and MinCq’s quadratic programs. Moreover to compare them with a fast, broadly used boosting algorithm, we challenge CB-Boost ’s computational efficiency with Adaboost’s.

Protocol   In this experiment, we want to compare MinCq, CqBoost , GB-CB-Boost, CB-Boost and Adaboost by varying three factors : the training set size, the hypothesis set size, and the number of boosting iterations, respectively denoted, m, n and T. To do so, we use MNist 4vs9 (LeCun and Cortes 2010) as it provides 11791 examples and 784 features. Moreover, to be as fair as possible with the quadratic programs, we allowed 8 threads for the solver. For comparative purposes, the only difference between GB-CB-Boost and CB-Boost ’s implementations is the gradient computation, as they are based on the same code.

Fig. 1
figure 1

Efficiency of \(\mathcal {C}\)-bound algorithms and Adaboost on MNist 0v9. For abc, we broke the ordinate axis, to highlight the difference between the fast algorithms while showing the most time consuming algorithms

Results   In Fig. 1a, we vary n, the number of available hypotheses, with constant m and T and, as expected, MinCq has a very long computational time. Moreover, even if CqBoost’s boosting-like structure reduce its time consumption, it is still much longer than the three other algorithms.

In Fig. 1b, we vary m. It has a small effect on MinCq, but has a huge effect on CqBoost as it runs \(T=50\) iterations in which it solves one increasingly complex quadratic programs per iteration, whereas MinCq has only one. Here, the apparently small difference between the other algorithms is due to CqBoost’s and MinCq’s long duration.

Figure 1c only plots the iterative algorithms, and the duration difference between CqBoost and the greedy ones is clear even for this small learning set.

To sum up the results of these three sub-experiments, Adaboost, CB-Boost and GB-CB-Boost are far faster than the other \(\mathcal {C}\)-bound algorithms. However, Adaboost and GB-CB-Boost seem to be slightly more time consuming than CB-Boost. To compare them, we analyze Fig. 1d, showing that even though they are close, CB-Boost is constantly faster than the others. Moreover, it is visible that GB-CB-Boost ’s gradient computation is more time consuming than CB-Boost ’s exhaustive search.

To conclude, CB-Boost is a substantial acceleration of CqBoost and MinCq, and is faster than Adaboost and GB-CB-Boost, which means that it is able to scale up to much bigger datasets than the other \(\mathcal {C}\)-bound-based algorithms.

5.2 Performance comparison: \(\mathcal {C}\)-bound-based algorithms equivalence

Here, we compare CB-Boost ’s performance in zero-one loss with the other \(\mathcal {C}\)-bound minimizers. Our goal is to measure the potential decrease in accuracy that greedy minimization would imply.


Protocol

  • Datasets All the datasets used in this experiment are presented in Table 1, an in-depth presentation is made in Appendix E.

  • Classifiers For each classifier, hyper-parameters were found with a randomized search with 30 draws, over a distribution that incorporates prior knowledge about the algorithm, but is independent from the dataset. They were validated by a 5-folds cross-validation and each experiment was run ten times, with the result in Table 2 being the mean and standard deviation over them. These results are not statistically significant, but multiplying the experiments helps avoiding an outlier split that could bias the results.

Results   Table 2 shows that, for all datasets except Ionosphere and MNist-5v6, CB-Boost is performing at least as well as the other \(\mathcal {C}\)-bound algorithms, considering the standard deviation. It is quite clear that MinCq has the best zero-one loss on nearly all the datasets, as it uses all the available hypotheses. However, CB-Boost is competitive with both CqBoost and MinCq, and is even better on three datasets. On the more complex Animals with Attributes, CB-Boost is the best of the \(\mathcal {C}\)-bound algorithms. As expected, GB-CB-Boost shows no significant improvement of the accuracy, and is even less stable on australian, MNist-7v9 and 5v6.

Table 1 Datasets used in the different experiments
Table 2 Zero one loss on test

CB-Boost is competitive with both the quadratic \(\mathcal {C}\)-bound minimization algorithms 11 times out of 13, and has the best mean 3 times. Considering its shorter computational time, it is more efficient at extracting the information.

5.3 Sparsity conservation

Protocol   We analyze here the convergence speed of the five previously studied algorithms with grid-searched optimal hyper-parameters, when needed (MinCq’s margin hyper-parameter set to 0.05, and CqBoost’s margin and error parameters, respectively set to 0.001 and \(10^{-6}\)). The chosen dataset is MNIST 0v9 as it is more complex than UCI datasets but still small enough for CqBoost and MinCq to have reasonable computing time on it. We use the same hypotheses, training and testing sets as previously.

Results   Figure 2 shows that, even if CB-Boost converges a bit less quickly than Adaboost and CqBoost on the training set, its performance on the test set is slightly better than all the other algorithms. Moreover, it is visible that CqBoost’s and CB-Boost ’s sparsity prevent them from the slight overfitting of MinCq. Furthermore, CqBoost seems to profit more from early-stopping than CB-Boost, even if their sparsity is clear as they have a better test performance than MinCq from the 30th voter. Finally, the experiments do not reveal any visible difference in sparsity between CB-Boost and GB-CB-Boost.

So, thanks to the closed form solution used in the calculation of CB-Boost, its computing time will always have an edge when compared to GB-CB-Boost. This, together with the fact that, as expected (and empirically shown), GB-CB-Boost is less stable, leads us to concentrate our analysis to CB-Boost for the remaining of the paper.

Fig. 2
figure 2

Zero-one loss on train (left) and \(-\log (zero\)-\(one\ loss)\) on test (right) throughout the learning process on MNist 0v9

5.4 Performance comparison: noise robustness equivalence

In this section, we compare CB-Boost and Adaboost considering accuracy. The aim here is (1) to quantify how far CB-Boost is from Adaboost and (2) to evaluate their compared robustness to noise. Please note that the version of Adaboost used for these experiments is Adaboost.SAMME (Zhu et al. 2006), which is more robust to noise than the original.

Protocol   We use the same framework as in Sect. 5.2 regarding datasets, train-test splits and hyper-parameter optimization. A Gaussian distribution is used to add noise to the data (see Appendix F.1 for a detailed protocol and an example). The MNist dataset requires a special pre-processing to reduce its contrast in order for the noise to have an effect on classification. See Appendix F.2 for details.

Results   In Fig. 3, we show a view of the results with each matrix presenting the comparison between Adaboost and CB-Boost for each dataset (rows), and noise level (columns). It shows that, except for the four first noise levels of ionosphere where the difference is noteworthy, no significant loss of accuracy is observed when using CB-Boost, as the maximum difference between the other scores is 2.5%. In Appendix F.3, the numerical results are provided.

Fig. 3
figure 3

Zero-one loss of CB-Boost and Adaboost. For each dataset (rows), we used several levels of noise (columns). An orange square means that Adaboost is better than CB-Boost in zero-one-loss (\(\mathcal {L}(\text {Ada}) < \mathcal {L}(\text {CB-B})\)), and the difference between their scores is printed inside. A blue square means that they are equivalent, when considering their standard deviation

5.5 Performance comparison: ensemble methods equivalence

We present the result of the performance comparison between CB-Boost and four other ensemble methods : Adaboost (Freund and Schapire 1997), Bagging (Breiman 1996), Random Forests (Breiman 2001), and Gradient Boosting (a) for which we use the same protocol as in Sect. 5.4. This experiment differs from Sect. 5.4 in the fact that we do not mean the results on each dataset but instead, in Fig. 4 we plot one dot for each train/test split on each dataset, for a matter of legibility.

Results   In Fig. 4, the distance to the \(x=y\) line represents the difference in zero-one loss. So, a dot under the line means that the ensemble method has lower loss and one over the line means it has higher loss. One can see that CB-Boost is similar to the state of the art for most of the datasets and train/test splits. The only perceptible tendency is that Bagging is frequently worse than CB-Boost, the other methods are equivalent to CB-Boost.

Fig. 4
figure 4

Each ensemble method’s zero-one loss against CB-Boost ’s one on the test set. A dot represents one of the 10 splits of each dataset of Table 1

5.6 \(\mathcal {C}\)-bound approximation characterization

Here, we aim at empirically analyzing the approximation made by greedily minimizing the \(\mathcal {C}\)-bound instead of using a quadratic program solver. To do so, at each iteration of CB-Boost, we run MinCq on the selected subset of voters to compute its optimal \(\mathcal {C}\)-bound.

Protocol   We use the MNist 4v9 dataset with the same sets as previously and predefined hyper-parameters: CB-Boost ’s maximum number of iterations is set to 200, and MinCq’s margin parameter to 0.05.

Results   Figure 5 shows that there is a slight, but noticeable difference between the \(\mathcal {C}\)-bound of CB-Boost ’s majority vote and MinCq’s one. However, this difference only has a small impact on CB-Boost ’s performance for the first 30 iterations in train and 80 iterations in test. So even if the expected difference in \(\mathcal {C}\)-bound optimality is noteworthy, it does not impact the performance of CB-Boost. Finally, the small gap between both the \(\mathcal {C}\)-bounds empirically suggests that CB-Boost keeps the qualities of CqBoost and MinCq.

Fig. 5
figure 5

Zero-one-loss and \(\mathcal {C}\)-bound for CB-Boost and the optimal \(\mathcal {C}\)-bound version MinCq(CB-Boost) on MNist 4v9. The dotted, dashed and plain lines represent respectively the \(\mathcal {C}\)-bound, training and testing error

6 Conclusion

In this paper, we presented CB-Boost, a greedy \(\mathcal {C}\)-bound minimization algorithm. While maintaining its predecessors’ sparsity and accuracy properties, it has much lighter computational demands. Its optimization process relies on a theoretical result allowing CB-Boost to efficiently minimize the \(\mathcal {C}\)-bound in one direction at a time. This algorithm keeps the training and generalization guarantees given by the \(\mathcal {C}\)-bound (Lacasse et al. 2006) and has the interesting property to allow a quantification of the decrease of its bound and training error.

Experimentally, the comparison of CB-Boost with relevant methods shows its real improvement in computational demand, without loss of accuracy. Furthermore, experiments shows that CB-Boost slightly improves the sparsity of the models, which is the main property of CqBoost. Finally, it is competitive with four state of the art ensemble methods with regards to performance, and with Adaboost in computational efficiency, sparsity and noise robustness.

In future work, we will analyze deeper theoretical properties of CB-Boost, focusing on its dual form, finding a stronger stopping criterion and adapting CB-Boost to infinite hypothesis spaces.