1 Introduction

A weighted majority vote is an ensemble method (Dietterich 2000; Re and Valentini 2012) where several classifiers (or voters) are assigned a specific weight. Such approaches are motivated by the idea that a careful combination can potentially compensate for the individual classifiers’ errors and thus achieve better robustness and performance. For this reason, ensemble learning has been a prominent research area in machine learning and many methods have been proposed in the literature, among which Bagging (Breiman 1996), Boosting (Schapire and Singer 1999) or Random Forests (Breiman 2001). The problem has also been studied from a Bayesian learning perspective, for instance with Bayesian model averaging (Haussler et al. 1994; Domingos 2000). Multimedia analysis is an example of prolific application, for instance to combine classifiers learned from different modalities of the data (Atrey et al. 2010).

Even though combining weak classifiers such as in Boosting (Freund and Schapire 1996) is supported by a solid theory, understanding when weighted majority votes perform better than a classic averaging of the voters is still a difficult question. In this context, PAC-Bayesian theory (McAllester 1999) offers an appropriate framework to study majority votes and learn them in a principled way and with generalization guarantees. In particular, the recently-proposed MinCq (Laviolette et al. 2011) optimizes the weights of a set of voters \(\mathcal{H}\) by minimizing a bound—the \(C\)-bound (Lacasse et al. 2007)—involving the first two statistical moments of the margin achieved on the training data. The authors show that minimizing this bound allows one to minimize the true risk of the weighted majority vote and boils down to a simple quadratic program. MinCq returns a posterior distribution on \(\mathcal{H}\) that gives the weight of each voter. It is based on an a priori uniform belief on the relevance of the voters, which is well-suited when combining weak classifiers. For instance, it has been successfully applied to weighted majority votes of decision stumps and RBF kernel functions. However, this uniform prior is not appropriate when one wants to combine efficiently various classifiers with different levels of performance.

In this paper, we claim that MinCq can be extended to deal with variable-performing classifiers when one has an a priori belief on the voters. We generalize MinCq in two respects. First, we propose a new formulation by extending the original notion of aligned distribution (Germain et al. 2011) to \({\mathbf {P}}\)-aligned distributions. \({\mathbf {P}}\) models a constraint over the distribution on the weights of the voters, allowing us to incorporate an a priori belief on each voter, constraining the posterior distribution. Our extension, called P-MinCq, does not induce any loss of generality and we show that this new problem can still be formulated in a efficient way as a quadratic program. Second, we extend the proofs of convergence of Laviolette et al. (2011) to the sample compression setting (Graepel et al. 2005), where the voters are built from training examples, such as NN classifiers. Our results use similar arguments as those proposed in (Germain et al. 2011; Laviolette and Marchand 2007) but our setting requires a specific proof, since the results of Germain et al. (2011) are only valid for surrogate losses bounding the \(0-1\) loss.

The second part of the paper makes use of these two general contributions to optimize a weighted majority vote over a set of \(k\)-NN classifiers (\(k=\{1,2,\ldots \}\)) to hightlight the benefit of an a priori on the voters. We propose a suitable a priori constraint \({\mathbf {P}}\) modeling the fact that we have more confidence in close neighborhoods. The idea is to a priori constrain larger (resp. smaller) weights on classifiers with small (resp. large) values of \(k\) to reflect the belief that local neighborhoods convey more relevant information than distant ones, which cannot be modeled by the uniform belief used in MinCq. Using P-MinCq in this context constitutes an original approach to learning a robust combination of NN classifiers that achieves better accuracy. This is confirmed by experiments conducted on twenty benchmark datasets: P-MinCq clearly outperforms \(k\)-NN, a symmetric version of it (Nock et al. 2003), as well as MinCq based on the same voters. Moreover, for high-dimensional problems, P-MinCq turns out to be quite robust to overfitting. We also show that it is competitive with the metric learning algorithm LMNN (Weinberger and Saul 2009) and that plugging the learned distance into P-MinCq can further improve the results. Finally, we apply our approach to an object categorization dataset, on which P-MinCq again achieves good performance.

This paper is organized as follows. Section 2 reviews MinCq and its theoretical basis. In Sect. 3, we introduce P-MinCq, our extension of MinCq to \({\mathbf {P}}\)-aligned distributions. We derive generalization bounds for the sample compression case in Sect. 4. Section 5 shows that MinCq does not perform well when using NN-based voters and presents a \({\mathbf {P}}\)-aligned distribution that is suitable to this context. Experiments are presented in Sect. 6.

2 Notations and background

2.1 Preliminaries

Throughout this paper, we consider the framework of the algorithm MinCq (Laviolette et al. 2011) for learning a weighted majority vote over a set of real-valued voters for binary classification problems. Let \(\mathcal{X}\in \mathbb {R}^d\) be the input space of dimension \(d\) and \(\mathcal{Y}=\{-1,+1\}\) be the output space (i.e., the set of possible labels). \(S\) denotes the training sample made of \(m\) labeled examples \(({\mathbf {x}},y)\) drawn i.i.d over \(\mathcal{X} \times \mathcal{Y}\) according to a fixed and unknown distribution \(D\). The distribution of \(S\) of size \(m\) is denoted by \(D^m\). MinCq takes its roots from the PAC-Bayesian theory [first introduced by McAllester (1999)]. Given a set of voters \(\mathcal {H}\), this theory is based on a prior distribution \(P\) and a posterior distribution \(Q\), both of support \(\mathcal {H}\). \(P\) models the a priori information on the relevance of the voters: those that are believed to perform best have larger weights in \(P\).Footnote 1 By taking into account the information carried by \(S\), the learner aims at adapting \(P\) to get the posterior distribution \(Q\) that implies the Q -weighted majority vote with the best generalization performance.

Definition 1

Let \(\mathcal {H}=\{h_1,\ldots ,h_{n}\}\) be a set of voters (or classifiers) from \(\mathcal {X}\) to \(\mathbb {R}\). Let \(Q\) be a distribution over \(\mathcal{H}\). A Q -weighted majority vote classifierfootnoteSometimes \(B_Q\) is called the Bayes classifier. \(B_{Q}\) is defined:

$$\begin{aligned} \displaystyle \forall \mathbf{x}\in \mathcal{X},\ B_Q(\mathbf {x}) =\mathrm{sign }\left[ \underset{h \sim Q}{\mathbf {E}}h(\mathbf {x})\right] = \mathrm{sign }\left[ \sum _{h\in \mathcal{H}} Q(h) h(\mathbf {x})\right] . \end{aligned}$$

The true risk \(R_{D}(B_Q)\) over the pairs \((\mathbf {x},y)\) i.i.d. according to \(D\) is:

$$\begin{aligned} \displaystyle R_{D}(B_Q)=\underset{(\mathbf {x},y) \sim D}{\mathbf {E}} I[ B_Q(\mathbf {x})\ne y], \end{aligned}$$

where \(I[.]\) is an indicator function.

Laviolette et al. (2011) and Lacasse et al. (2007) make the link between the risk \(R_D(B_Q)\) and the following notion of \(Q\)-margin which models the confidence of \(B_Q\) in its labeling.

Definition 2

(Laviolette et al. 2011) The Q -margin of an example \((\mathbf {x},y)\) over \(Q\) is:

$$\begin{aligned} \mathcal{M}_Q(\mathbf {x},y) = y\underset{h \sim Q}{\mathbf {E}}h(\mathbf {x}). \end{aligned}$$

The first and second moments of the \(Q\)-margin are:

$$\begin{aligned} \mathcal{M}_Q^{D}&=\underset{(\mathbf {x},y) \sim D}{\mathbf {E}}\mathcal{M}_Q(\mathbf {x},y) =\underset{h\sim Q}{\mathbf {E}}\underset{(\mathbf {x},y) \sim D}{\mathbf {E}}yh(\mathbf {x}),\ \ \text{ and }\\ \mathcal{M}_{Q^2}^{D}&=\!\!\!\underset{(\mathbf {x},y) \sim D}{\mathbf {E}}\!\!(\mathcal{M}_Q(\mathbf {x},y))^2 =\!\!\! \underset{(h,h') \sim Q^2}{\mathbf {E}}\!\underset{(\mathbf {x},y) \sim D}{\mathbf {E}}\!\!\!h(\mathbf {x})h'(\mathbf {x}). \end{aligned}$$

It is easy to see that \(B_Q\) correctly classifies an example \({\mathbf {x}}\) if the \(Q\)-margin is strictly positive. Thus, under the convention that if \(\mathcal{M}_Q(\mathbf {x},y)=0\), then \(B_Q\) errs on \((\mathbf{x},y)\), we get:

$$\begin{aligned} R_{D}(B_Q) =\underset{(\mathbf{x},y)\sim D}{\mathbf {Pr}}\left( \mathcal{M}_Q(\mathbf{x},y)\le 0\right) . \end{aligned}$$
(1)

Let us finally introduce the following necessary notations:

$$\begin{aligned} \displaystyle \mathcal{M}_h^{D}=\underset{(\mathbf {x},y)\sim D}{\mathbf {E}}yh(\mathbf {x}),\ \ \text{ and }\ \ \displaystyle \mathcal{M}_{h,h'}^{D}=\underset{(\mathbf {x},y)\sim D}{\mathbf {E}}h(\mathbf {x})h'(\mathbf {x}). \end{aligned}$$
(2)

If we use the training sample \(S\!\sim \! D^m\) instead of the unknown distribution \(D\), we get the empirical risk \(R_S(B_Q)\), the empirical first and second moments of the Q -margin \(\mathcal{M}_{Q}^S\) and \(\mathcal{M}_{Q^2}^S\), and the associated \(\mathcal{M}_h^{S}\) and \(\mathcal{M}_{h,h'}^{S}\).

2.2 MinCq and theoretical results

We now review three recent results of Laviolette et al. (2011), Lacasse et al. (2007), which constitute the building blocks of our contributions. The first one takes the form of a bound—the \(C\)-bound (Theorem 1)—over \(R_D(B_Q)\). It shows that the true risk can be minimized by only considering the first two moments of the \(Q\)-margin. Then, following some PAC-Bayesian generalization bounds, Theorem 2 justifies that the posterior distribution \(Q\) can be learned by minimizing the empirical \(C\)-bound. Finally, the authors show that learning an optimal \(Q\)-weighted majority vote boils down to a simple quadratic program called MinCq.

The \(C\)-bound is obtained by making use of Eq. (1) and the Cantelli-Chebychev’s inequality (Devroye et al. 1996) applied on the random variable \(\mathcal{M}_Q(\mathbf{x},y)\).

Theorem 1

(The \(C\)-bound (Laviolette et al. 2011)) For any distributions \(Q\) over a class \(\mathcal{H}\) of functions and \(D\) over \(\mathcal{X}\times \mathcal{Y}\), if \(\mathcal{M}_Q^{D} > 0\) then \(R_{D}(B_Q) \le C_Q^{D}\) where:

$$\begin{aligned} \displaystyle C_Q^{D} = \frac{\mathbf {Var}_{(\mathbf {x},y) \sim D}\left( \mathcal{M}_Q(\mathbf {x},y)\right) }{ {\mathbf {E}_{(\mathbf {x},y) \sim D}}\left( \mathcal{M}_Q(\mathbf {x},y)\right) ^2}=1-\frac{\left( \mathcal{M}_Q^{D}\right) ^2}{\mathcal{M}_{Q^2}^{D}}. \end{aligned}$$

\(C_Q^S = 1-\frac{\left( \mathcal{M}_{Q}^S\right) ^2}{\mathcal{M}_{Q^2}^S}\) is its empirical counterpart.

Thus, minimizing the \(C\)-bound appears to be a nice strategy for learning a \(Q\) that implies a \(Q\)-weighted majority vote \(B_Q\) with low true risk. To justify this strategy, the authors derive a PAC-Bayesian generalization bound for \(C_Q^D\). To do so, they assume a quasi-uniform distribution \(Q\) over an auto-complemented set of \(2n\) voters \(\mathcal{H}=\{h_1,\ldots ,h_n,h_{n+1},\ldots ,h_{2n}\}\), where: \(h_{k+n}\! =\! -h_k\) (auto-complementation) and \(Q(h_k)+Q(h_{k+n})\!=\!\frac{1}{n}\) (quasi-uniformity) for every \(k\in \{1,\ldots ,n\}\). Note that, for the sake of simplicity, we will denote \(Q(h_k)\) by \(Q_k\). They claim that this assumption is not too strong a restriction and characterizes situations where, in the absence of ground truth, one gives the same a priori belief on the voters. Moreover, such distributions have two advantages. On the one hand, they allow us to get rid of the classic term which captures the complexity of \(\mathcal {H}\).Footnote 2 This is a clear advantage since such a term can be a bad regularization (Laviolette et al. 2011). On the other hand, this assumption plays the role of a regularization by giving the same a priori belief on the voters and provides a simple way to avoid overfitting.

The generalization bound is then obtained by taking the lower (resp. upper) bound on \(\mathcal{M}_Q^D\) together with the upper (resp. lower) bound on \(\mathcal{M}_{Q^2}^D\) from the following theorem.

Theorem 2

(Laviolette et al. 2011) For any distribution \(D\) over \(\mathcal{X}\times \mathcal{Y}\), any \(m\ge 8\), any auto-complemented family \(\mathcal{H}\) of \(B\)-bounded real-valued voters, for all quasi-uniform distribution \(Q\) on \(\mathcal {H}\), and for any \(\delta \in (0,1]\), we have:

$$\begin{aligned}&\underset{S\sim D^m}{\mathbf {Pr}}\left( \left| \mathcal {M}_Q^D - \mathcal {M}_Q^S\right| \le \frac{2B\sqrt{\ln \frac{2\sqrt{m}}{\delta }}}{\sqrt{2m}}\right) \ge 1-\delta ,\\&\text{ and }\\&\underset{S\sim D^m}{\mathbf {Pr}}\left( \left| \mathcal {M}_{Q^2}^D-\mathcal {M}_{Q^2}^S\right| \le \frac{2B^2\sqrt{\ln \frac{2\sqrt{m}}{\delta }}}{\sqrt{2m}}\right) \ge 1-\delta . \end{aligned}$$

The authors have proved that their setting does not induce any lack of generality. From Theorems 1 and 2, they suggest the minimization of the empirical \(C\)-bound under the constraint \(\mathcal{M}_Q^{S} \ge \mu \). Due to the quasi-uniformity assumption, they show that this minimization problem is equivalent to solving a simple quadratic program involving only the first \(n\) voters of \(\mathcal {H}\). Their algorithm MinCq is given in Algorithm 1. It consists in minimizing the denominator \(\mathcal{M}_{Q^2}^{S}\), i.e., the second moment of the \(Q\)-margin (Line 3), under the constraints \(\mathcal{M}_Q^{S}\!=\!\mu \) (Line 4) and \(Q\) is quasi-uniform (Line 5). This leads to minimizing the \(C\)-bound and thus the true risk of the majority vote by only taking into account the diversity between the voters expressed by the empirical second moment.

figure a

The \(Q\)-weighted majority vote learned by MinCq is:

$$\begin{aligned} \displaystyle B_{Q}(\mathbf{x})=\mathrm{sign }\left[ \sum _{k=1}^{n} \left( 2Q_k -\frac{1}{n}\right) h_k(\mathbf{x})\right] . \end{aligned}$$

3 Generalization of MinCq to \({\mathbf { P}}\)-aligned distributions

Rather than constraining \(Q\) to be a quasi-uniform on the auto-complemented set of \(2n\) voters \(\mathcal {H}\) (\(\small \forall k \in \{1,\ldots ,n\},\ Q_k+Q_{k+n}=\frac{1}{n}\)) as done in MinCq, we generalize this approach to any \({\mathbf P}\)-aligned distribution \(Q\): \(\forall k \in \{1,\ldots ,n\}\), \(Q_k+Q_{k+n}=P_k\), where \({\mathbf P}=(P_1,\ldots ,P_n)^{\top }\) sums to 1. In this context, \({\mathbf P}\) plays the role of an a priori belief on the voters.

3.1 Expressiveness of \({\mathbf {P}}\)-aligned distributions

We generalize the setting of Laviolette et al. (2011) for quasi-uniform distributions to any \({\mathbf {P}}\)-aligned distribution on a set of auto-complemented classifiers \(\mathcal {H}\), in fact this constraint does not restrict the possible outcomes of an algorithm that would minimize \(C_Q^S\).

Proposition 1

For all distributions \(Q\) on \(\mathcal{H}\), there exists a \({\mathbf { P}}\)-aligned distribution \(Q'\) on the auto-complemented \(\mathcal{H}\) that provides the same majority vote as \(Q\), and that has the same empirical and true \(C\)-bound values.

Proof

It follows from Proposition 4 of Germain et al. (2011) and is given in section ‘Proof of Proposition 1’ of Appendix.

From this proposition, similarly as for MinCq, it is then justified that under the constraint \(\mathcal{M}_Q^{S}=\mu \), the \(C\)-bound can be optimized by minimizing the second moment \(\mathcal{M}_{Q^2}^S\) of the \(Q\)-margin. This is done by solving the quadratic program P-MinCq described in the following.

3.2 The quadratic program P-MinCq

P-MinCq is described in Algorithm 2. Similarly to MinCq, thanks to the \({\mathbf P}\)-aligned assumption, we only need to cope with the first \(n\) voters in \(\mathcal{H}\). The objective function (Line (6)) minimizes the second moment of the \(Q\)-margin while the first constraint (Line (7)) enforces a margin equal to \(\mu \). Note that the left-hand side of this constraint is the weighted average (with weights of \(2Q_k\!-\!P_k\)) of the individual margins (\(\mathcal{M}_{h_k}\)). Finally, Line (8) restricts \(Q\) to be \({\mathbf P}\)-aligned. The proof of derivation of the algorithm can be found in section ‘Proof of Algorithm 2 : P-MinCq’ of Appendix.

figure b

The \(Q\)-weighted majority vote learned by P-MinCq is:

$$\begin{aligned} \displaystyle B_{Q}(\mathbf{x})\!=\!\mathrm{sign }\!\left[ \sum _{k=1}^{n}\! \left( 2Q_k\! -\!P_k\right) h_k(\mathbf{x})\right] \!. \end{aligned}$$

The next section addresses the generalization guarantees for P-MinCq.

4 PAC-Bayesian generalization guarantees under sample compression

The proof of the generalization bounds of Theorem 2 is still valid for \({\mathbf P}\)-aligned distribution \(Q\) over data-independent voters. Indeed, it only makes use of the \({\mathbf P}\)-aligned assumption corresponding to \(Q_k+Q_{k+n}=P_k+P_{k+n}\).Footnote 3 This theorem is nevertheless not valid in the sample compression setting, where the set of voters is data-dependent (such as the set of \(k\)-NN classifiers). Laviolette et al. (2011) have argued that it can be extended to this setting by using techniques from Laviolette and Marchand (2007). This section is devoted to derive generalization bounds for P-MinCq in this sample compression setting, allowing us to deal with data-dependent voters. Our result is rather general (and not restricted to \(k\)-NN voters). It differs from previous PAC-Bayesian results with sample compressed classifiers (Graepel et al. 2005; Laviolette and Marchand 2007; Germain et al. 2011) in that it is tailored to the first two moments of the \({\mathbf {Q}}\)-margin with \(\mathbf {P}\)-aligned distributions.

4.1 Sample compression setting

In the sample compression framework (Floyd and Warmuth 1995) the learning algorithm \(\mathcal{A}\) has access to a data-dependent set of classifiers. Each classifier is then represented by two elements: a compression sequence which is a sequence of examples, and a message representing the additional information needed to obtain the classifier from the compression sequence. Then, we can define a reconstruction function able to output a classifier from a compression sequence and a message. More formally, a learning algorithm \(\mathcal{A}\) is called a compression scheme if it is defined as follows.

Definition 3

Let \(S \in (\mathcal{X} \times \mathcal{Y})^m=\mathcal{Z}^m\) be the learning sample of size \(m\). We define \(\mathbf{J}_m\) to be the set containing all the possible vectors of indices:

$$\begin{aligned} \displaystyle \mathbf{J}_m = \bigcup _{i=1}^m \left\{ (j_1,\dots ,j_i)\in \{1,\dots ,m\}^i\right\} . \end{aligned}$$

Given a family of hypothesis \(\mathcal{H^S}\) from \(\mathcal{X}\) to \(\mathcal{Y}\) and an index vector \(\mathbf{j}\in \mathbf{J}_m\), let \(S_{\mathbf{j}}\) be the subsequence indexed by \(\mathbf{j}\), \(S_{\mathbf{j}}\) is called the compression sequence:

$$\begin{aligned} \displaystyle S_{\mathbf{j}}= (\mathbf{z}_{j_1},\dots ,\mathbf{z}_{j_i}). \end{aligned}$$

An algorithm \(\mathcal{A}:\mathcal{Z}^{(\infty )}\mapsto \mathcal{H^S}\) is a compression scheme if, and only if, there exists a triplet \((\mathcal{C},\mathcal{R},\omega )\) such that for all training sample \(S\):

$$\begin{aligned} \displaystyle \mathcal{A}(S) = \mathcal{R}\left( S_{\mathcal{C}(S)},\omega \right) , \end{aligned}$$

where \(\mathcal{C}:\mathcal{Z}^{(\infty )}\mapsto \bigcup _{m=1}^{\infty } \mathbf{J}_m \) is the compression function, \(\mathcal{R}:\mathcal{Z}^{(\infty )}\times \varOmega _{S_{\mathcal{C}(S)}}\mapsto \mathcal{H^S}\) the reconstruction function, and \(\omega \) is a message chosen from the set \(\varOmega _{S_{\mathcal{C}(S)}}\) (a priori defined) of all messages that can be supplied with the compression sequence \(S_{\mathcal{C}(S)}\).

Put into words, given a learning sample \(S\sim D^m\), a sample compression scheme is a reconstruction function \(\mathcal{R}\) mapping a compression sequence \(\mathcal{C}(S)=S_{\mathbf{j}}\) to some set \(\mathcal{H}^S\) of functions \(h^{\omega }_{S_{\mathbf{j}}}\) such that \(\mathcal{A}(S) = \mathcal{R}\left( S_{\mathbf{j}},\omega \right) =h^{\omega }_{S_{\mathbf{j}}}\). For example, \(k\)-NN classifiers can be reconstructed from a compression sequence only, which encodes the nearest neighbors (Floyd and Warmuth 1995; Graepel et al. 2005). Other classifiers, such as the decision list machines (Marchand and Sokolova 2005), need both a compression sequence and a message string. In the next section, we consider the general setting to avoid any loss of generality.

4.2 PAC-Bayesian generalization bounds under sample compression

Let \(S_{\mathbf{j}}\) be a sample compression sequence consisting of \(|\mathbf{j}|\) elements of the learning sample \(S\). In the PAC-Bayesian sample compression setting, the risks \(R_D\) and \(R_S\) can be biased by these elements: we often prefer to compute the empirical risk \(R_S\) from \(S\backslash S_{\mathbf{j}}\) (Laviolette and Marchand 2007). However, in order to derive risk bounds in such a situation, Germain et al. (2011) have proposed another strategy by directly considering the bias. As mentioned in the introduction, we cannot apply their result to our setting. Indeed, it is valid for loss functions defining a surrogate of the \(0-1\) loss, which is not suited for the second moment of the margin we have to consider. Moreover, it depends on the value of the surrogate at \(-1\), which may lead to a degenerate bound (this does not occur in our bounds).

The derivation of our result is nevertheless based on a similar setting: given a sample \(S\), we consider \(\mathcal{H}^S\) the set of all possible classifiers \(h^{\omega }_{S_{\mathbf{j}}}=\mathcal{R}(S_{\mathbf{j}},\omega )\) such that \(\omega \in \varOmega _{S_{\mathbf{j}}}\). We denote by \(Q_{\mathbf{J}_{m}}(\mathbf{j})\), the probability that a compression sequence \(S_{\mathbf{j}}\) is chosen by \(Q\), and \(Q_{S_{\mathbf{j}}}(\omega )\) the probability of choosing the message \({\omega }\) given \(S_{\mathbf{j}}\). Then, we have:

$$\begin{aligned} \displaystyle Q_{\mathbf{J}_m}(\mathbf{j}) = \int \limits _{\omega \in \varOmega _{S_\mathbf{j}}}\!\!\!\! Q(h^{\omega }_{S_{\mathbf{j}}})d\omega ,\ \ \ \text{ and }\ \ \ Q_{S_{\mathbf{j}}}(\omega ) = Q(h^{\omega }_{S_{\mathbf{j}}}|S_{\mathbf{j}}). \end{aligned}$$

In the usual PAC-Bayesian setting, the risk bounds depend on the prior distribution \(P\) over the set \(\mathcal{H}^S\). This prior distribution is supposed to be known before observing the learning sample \(S\), implying \(P\) independent from \(S\). However, in our setting the classifiers in \(\mathcal{H}^S\) are data-dependent. To tackle this problem, we propose to follow the principle of Laviolette and Marchand (2007), Germain et al. (2011) by considering a prior distribution defined by a pair: \(\left( P_{\mathbf{J}_m},(P_{S_{\mathbf{j}}})_{\mathbf{j}\in \mathbf{J}_m}\right) , \) where \(P_{\mathbf{J}_m}\) is a distribution over \(\mathbf{J}_m\), and for all possible compression sequence \(S_{\mathbf{j}}\), \(P_{S_{\mathbf{j}}}\) is a distribution over \({\varOmega }_{S_\mathbf{j}}\). Given a training sample \(S\), the data-independent prior distribution \(P\) corresponds to the distribution on \(\mathcal{H}^S\) associated with the prior \(\left( P_{\mathbf{J}_m},(P_{S_{\mathbf{j}}})_{\mathbf{j}\in \mathbf{J}_m}\right) \), then we have: \(\displaystyle P(h^{\omega }_{S_\mathbf{j}}) = P_{\mathbf{J}_m} P_{S_{\mathbf{j}}}(\omega ).\)

Definition 4

In the sample compression setting, the Q -margin of a point \((\mathbf {x},y)\) over \(Q\) is:

$$\begin{aligned} \mathcal{M}_{Q}(\mathbf {x},y) = y\ \underset{h_{S_{\mathbf{j}}}^{\omega } \sim Q}{\mathbf {E}}\ h_{S_{\mathbf{j}}}^{\omega }(\mathbf {x}). \end{aligned}$$

The first two moments \(\mathcal{M}_Q^D\) and \(\mathcal{M}_{Q^2}^D\) of the \(Q\)-margin are defined similarly as before:

$$\begin{aligned} \mathcal{M}_Q^{D} = \underset{(\mathbf {x},y) \sim D}{\mathbf {E}} \mathcal{M_{Q}}(\mathbf {x},y)\ \ \text{ and }\ \ \mathcal{M}_{Q^2}^{D} = \underset{(\mathbf {x},y) \sim D}{\mathbf {E}}(\mathcal{M}_Q(\mathbf {x},y))^2. \end{aligned}$$

In our setting, we assume P-aligned distributions on an auto-complemented set \(\mathcal{H}^S\). For each classifier \(h_{S}^{\omega }\!\in \! \mathcal{H}^S\), we denote its complement by \(-h_{S}^{\omega }\). Given \(S\), the associated message set is \(\varOmega _{S}\!\times \!\scriptstyle \{+,-\}\) and \(\forall \sigma \! \in \!\varOmega _{S},\) \(h_{S}^{(\sigma ,+)}=-h_{S}^{(\sigma ,-)}\). We now give the main result of this section.

Theorem 3

For any distribution \(D\) over \(\mathcal{X}\times \mathcal{Y}\), any \(m\ge 8\), any auto-complemented set \(\mathcal{H}^S\) of \(B\)-bounded real valued voters of sample compression size at most \(|\mathbf{j}^{\max }|<\tfrac{m}{2}\), for all P-aligned distribution \(Q\) on \(\mathcal {H}^S\), and for any \(\delta \in (0,1]\), we have:

$$\begin{aligned}&\underset{S\sim D^m}{\mathbf {Pr}}\left( \left| \mathcal {M}_Q^D - \mathcal {M}_Q^S\right| \!\le \! \frac{2B\sqrt{\frac{|\mathbf{j}^{\max }|}{B}+ \ln \left( \frac{2\sqrt{m}}{\delta }\right) }}{\sqrt{2(m-|\mathbf{j}^{\max }|)}} \right) \ge 1-\delta ,\end{aligned}$$
(9)
$$\begin{aligned}&\underset{S\sim D^m}{\mathbf {Pr}}\left( \left| \mathcal {M}_{Q^2}^D - \mathcal {M}_{Q^2}^S\right| \!\le \! \frac{2B^2\sqrt{\frac{2|\mathbf{j}^{\max }|}{B^2}+ \ln \left( \frac{2\sqrt{m}}{\delta }\right) }}{\sqrt{2(m-2|\mathbf{j}^{\max }|)}} \right) \ge 1-\delta . \end{aligned}$$
(10)

Proof

Deferred to section ‘Proof of Theorem 3’ of Appendix.

For data-independent classifiers, i.e. \(|\mathbf{j}^{\max }|=0\), we recover Theorem 2. As expected, the theorem indicates that when the compression size \(|\mathbf{j}^{\max }|\) is large, the bound becomes looser, suggesting that the compression size should not be too large to preserve consistency. Note that the bound \(B\) over the classifiers’ output can generally be controlled by the use of appropriate normalization.

In the next section, we instantiate P-MinCq in the specific \(k \text{-NN }\) setting by introducing a rather intuitive but statistically well-founded a priori constraint \({\mathbf P}\).

5 Instantiation of P-MinCq for nearest neighbor classifiers

5.1 Limitations of MinCq in the context of nearest neighbor classifiers

At first sight, one may think that MinCq is a good way to overcome two limitations of \(k\)-NN classifiers. First, while the theory tells us that the higher \(k\), the better the convergence to the optimal bayesian risk, this holds only asymptotically. In practice the choice of \(k\) requires special care. Therefore, optimizing a \(Q\)-weighted majority vote, where the set of voters \(\mathcal{H}\) consists of the \(k\)-NN classifiers (\(k=\{1,2,\ldots \}\)), would prevent us from tuning \(k\) while offering a principled way to combine these classifiers.Footnote 4 Second, by making use of the PAC-Bayesian setting, the minimization of the \(C\)-bound provides generalization guarantees that cannot be obtained with a standard \(k \text{-NN }\) algorithm in finite-sample situations.

We conduct a preliminary experimental study to compare a standard \(k \text{-NN }\) classifier (where \(k\) is tuned by cross-validation) with MinCq (see Sect. 6 for details on the setup). Over twenty datasets, MinCq achieves an average classification error of 18.18 % against 17.88 % for \(k\)-NN (see Table 1 for more details). It is worth noting that using a Student paired t-test, we cannot statistically distinguish between the two approaches. This is also confirmed by a sign test, which gives a record win/loss/tie equal to 7/6/7 leading to a p-value of about 0.5, as illustrated by Fig. 1. This serie of experiments clearly shows that MinCq performs no better than a single well-tuned \(k\)-NN classifier.

Table 1 Error rates of NN, SNN, LMNN, MinCq and P-MinCq on twenty datasets
Fig. 1
figure 1

Comparison of MinCq VS NN. Each point in the scatter plot shows the test error rate of the algorithms on a single dataset. A dataset above the bisecting line is in favor of MinCq

We claim that these disappointing results can be explained by the fact that the quasi-uniformity assumption on \(Q\) is not appropriate to settings where one has an a priori belief on the relevance of the voters, which is typically the case in NN classification. Indeed, for obvious reasons, close neighborhoods are likely to provide more relevant information than distant ones. We propose to overcome these limitations by using an instantiation of P-MinCq based on a constraint \({\mathbf P}\) suitable for NN classification.

5.2 A statistically well-founded constraint \({\mathbf {P}}\)

In standard \(k\)-NN classification, the theory tells us that the higher \(k\), the better the convergence to the optimal bayesian risk. However, this property holds only asymptotically, i.e., when the size \(m\) of the training sample goes to infinity. In practice, training data is limited and one has to set \(k\) carefully. On the one hand, we want to use a large value of \(k\) to obtain a reliable estimate. On the other hand, only points in a very close neighborhood lead to an accurate classification rule. Several theoretical and experimental studies in the literature have tried to analyze this trade-off between small and large values of \(k\). As suggested by Duda et al. (2001), a good solution consists in using a small fraction of the training examples, equal to about \(\sqrt{m/|\mathcal{Y}|}\) neighbors, where \(|\mathcal{Y}|\) is the number of classes.

The context is slightly different in P-MinCq, since we aim at linearly combining \(k\)-NN classifiers (\(k=1, 2,\ldots \)). Rather than setting \(k\), we aim at choosing a suitable constraint \({\mathbf P}\), which plays the role of an a priori belief on the voters. As suggested by Devroye et al. (1996), in a weighted nearest neighbor rule, nearer neighbors should provide more information than distant ones. Following this, we propose the following constraint \({\mathbf P }\) (normalized so that they sum to \(1\)):

$$\begin{aligned} \forall k \ge 1,\quad P_k=1/k. \end{aligned}$$
(11)

\({\mathbf P}\) concentrates the weights on voters that are based on a small fraction of the training data, i.e., points in a close neighborhood (as suggested by Duda et al. (2001)), but also takes into account (to a smaller extent) the information provided by (potentially) the entire training set. To justify this choice, we establish in the following a strong relationship between Eq. (11) and the popular choice \(\sqrt{m/2}\) for \(k\) in \(k \text{-NN }\) binary classification. Our analysis is based on the characterization of \({\mathbf P }\) by its median \(M\), which corresponds to the number of neighbors involved in the voters accumulating half of the total weight. While defining the median of a continuous distribution is rather straightforward, finding it in the discrete case of interest (i.e., where \(x \in \{1,\dots ,m\}\)) is slightly more tricky and requires an approximation. Let us define \(H_M= \sum _{x=1}^M \frac{1}{x}\) and \(H_m= \sum _{x=1}^m \frac{1}{x}\). They correspond to the sum of terms of a harmonic series for which no closed form is available. However, using the partial sums of the series, for all \(n\) we can define \(H_n\) such that: \(H_n=\sum _{x=1}^n \frac{1}{x} = \ln (n) + \gamma + \epsilon _n,\ \) where \(\gamma \) is the Euler-Mascheroni constant (\(\gamma \simeq 0.5772156\)) and \(\epsilon _n \sim \frac{1}{2n}\). Therefore, we have:

$$\begin{aligned} H_M&=\frac{1}{2} H_m \nonumber \\&\Leftrightarrow \sum _{x=1}^M \frac{1}{x} = \frac{1}{2} \sum _{x=1}^m \frac{1}{x} \nonumber \\&\Leftrightarrow \ln (M)+\gamma + \epsilon _M = \frac{1}{2}(\ln (m)+\gamma )+ \frac{1}{2} \epsilon _{m}\nonumber \\&\Leftrightarrow \ln (M) = \ln (\sqrt{m}) - \frac{1}{2} \gamma + \frac{1}{2} \epsilon _m - \epsilon _M \nonumber \\&\Rightarrow \ln (M) \cong \ln (\sqrt{m}) - \frac{1}{2} \gamma + \frac{1}{4m} - \frac{1}{2M}\ \ \small (\text{ using } \epsilon _n \sim \frac{1}{2n}) \nonumber \\&\Rightarrow \ln (M) \le \ln (\sqrt{m}) - \frac{1}{2} \gamma - \frac{1}{4m}\small (\text{ since } \text{ Equation } \text{(11) }\ {\scriptstyle \Rightarrow M \le m/2)}\nonumber \\&\Rightarrow M \le \sqrt{m \exp (-\gamma ) \exp \left( -\frac{1}{4m}\right) } \simeq \sqrt{\frac{m}{2}}. \end{aligned}$$
(12)

The main information provided by Eq. (12) is that the approximation of the median of \({\mathbf P}\) is very close to \(\sqrt{m/2}\), the value suggested for \(k\) in the \(k \text{-NN }\) rule for binary classification problems. Figure 2 shows a graphical illustration of the closeness between the median of the harmonic series and \(\sqrt{m/2}\). We have thus established a strong relationship between a classic choice for \(k\) in standard \(k\text{-NN }\) classification and our \({\mathbf P}\) constraint in a weighted majority vote of \(k \text{-NN }\) voters. The next section will feature a large comparative experimental study that validates our choice for \({\mathbf P}\).

Fig. 2
figure 2

Comparison between the median of the harmonic series \(\sum _{x=1}^m \frac{1}{x}\) and \(\sqrt{m/2}\)

Before that, recall that the generalization bound derived in Sect. 4 suggests to limit the prototype set for the \(k\text{-NN }\) classifiers. A first approach could be to divide the learning sample in two sets: one for defining the k-NN classifiers and one for learning the parameters of the model. However, this strategy does not stand in the sample compression scheme and has the disadvantage to discard useful information. Another solution is to apply—for each \(k \text{-NN } \text{ voter }\)—some prototype selection or reduction techniques (Duda et al. 2001) in order to remove training examples that do not change the labeling of any test example. This implies that each \(k \text{-NN }\) must use its own compressed sample corresponding to a subset of the training sample \(S\). However, in addition to its computational cost, this strategy is not always relevant in the context of NN since it may be difficult to obtain a good (i.e. small) compression scheme for some distributions. Nevertheless, in the particular setting we consider for \(k \text{-NN }\), we have noticed that using large \(|\mathbf{j}^{\max }|\) (even equals to \(m\)) does not influence the practical performance of P-MinCq.

6 Experimental results

In this section, we propose a comparative study of P-MinCq applied to the context of NN classification (as described in Sect. 3). We compare it against four different approaches. categorization task.

  • The standard Nearest Neighbor algorithm (NN) which plays the role of the baseline.

  • The Symmetric Nearest Neighbor algorithm (Nock et al. 2003) (SNN), a variant of NN where the class of an instance \(x\) is determined by the majority class among the training points that belong to the \(k\)-neighborhood of \(x\) (like in NN) plus those that include \(x\) in their own \(k\)-neighborhood.

  • Large Margin Nearest Neighbor (Weinberger and Saul 2009) (LMNN) which learns a Mahalanobis distance by optimizing the \(k \text{-NN }\) training error (with a safety margin). Then, \(k \text{-NN }\) is applied using the learned distance. Note that LMNN has been shown to be competitive with a RBF kernel SVM.

  • MinCq (Laviolette et al. 2011) which considers a quasi-uniform distribution.

We evaluate these methods on twenty benchmark datasets and an object categorization task.

6.1 Benchmark datasets

Experimental setup. These twenty binary classification datasets are of varying domain and difficulty, mostly taken from the UCI Machine Learning Repository.Footnote 5 We compute neighborhoods using the standard Euclidean distance. We randomly split each dataset into 50 % training and 50 % test data, except for letterAB, letterDO and letterOQ for which we split 20 %/80 %. We tune the following parameters by 10-fold cross-validation on the training set: the margin parameter \(\mu \) for MinCq and P-MinCq (among 14 values between .0001 and .5) and the parameter \(k\) for \(k \text{-NN }\) and LMNN (among \(\{1,\ldots ,10\}\)). The trade-off parameter of LMNN was set to .5, as done by Weinberger and Saul (2009).

Results. We report the results in Table 1. We make the following remarks. First, P-MinCq significantly outperforms a standard NN classifier. On average over the datasets, P-MinCq achieves a classification error of 16.89 % while NN reaches a level of 17.88 %. Using a Student paired t-test, this difference is statistically significant with a \(p\) value of .06. This is further supported by a sign test, which gives a record win/loss/tie equals to 12/5/3 leading to a \(p\) value of .07. P-MinCq also outperforms SNN despite the fact that the latter performs well on a few datasets (\(p\) value of .01 with a Student test and .24 with a sign test). Furthermore, P-MinCq performs significantly better than MinCq with a \(p\) value of .02 using a Student test. With a sign test, the \(p\) value is about .03 with a record win/loss/tie equals to 12/4/4. This shows the usefulness of our generalization of MinCq to \({\mathbf P}\)-aligned distributions, and that \(P_i=\frac{1}{i}\) is a suitable a priori distribution in the context of NN. Finally, despite the fact that P-MinCq is not a metric learning algorithm, it is competitive with LMNN (.1689 vs .1607 with a \(p\) value of about .10 with a Student test). A sign test leads to a \(p\) value of .5, indicating that one method is equally likely to perform better than the other.

In fact, we claim that P-MinCq and LMNN are rather complementary. Indeed, on the one hand, LMNN is a metric learning algorithm that can tweak the neighborhoods of the points (sometimes with great success, e.g., heart, parkinsons or sonar) but may perform worse than NN, especially because it often overfits when dimensionality is high (e.g., colon or musk1). On the other hand, P-MinCq does not change the neighborhoods of the points but combines several nearest neighbor rules, and as a combination of classifiers, appears to be quite stable (as shown at the bottom of Table 1, it achieves the best average rank) and robust to overfitting. To highlight how P-MinCq and LMNN complement each other, we perform an additional series of experiments aiming at combining LMNN and P-MinCq when this seems relevant. To do so, we make use of the validation performance: if LMNN performs better than P-MinCq, then we plug the distance learned by LMNN in P-MinCq (otherwise we keep the standard Euclidean distance). We report the results in Table 2. The combination LMNN+P-MinCq outperforms all other methods, including LMNN alone (\(p\) values of .05 with a Student test and .17 with a sign test). Notice that on some datasets where LMNN was by far the best performing method in the first series of experiments (e.g., on heart, parkinsons or voting), LMNN+P-MinCq is able to further improve these results (Fig. 3).

Table 2 Error rates of LMNN and LMNN+P-MinCq on twenty datasets
Fig. 3
figure 3

Comparison of P-MinCq versus LMNN (left) and P-MinCq+LMNN versus LMNN (right)

6.2 Object categorization

Experimental setup. We provide additional experiments on Graz-01 (Opelt et al. 2004), a popular object categorization database that has two object-class (bike and person) and a background class. It is known to have large intra-class variation and significant background clutter (see Fig. 4). The tasks are bike versus non-bike and person versus non-person and we follow experimental setup from Opelt et al. (2004): for each object, we randomly sample 100 positive examples and 100 negative examples (of which 50 are drawn from the other object and 50 from the background). Images are represented as frequency histograms of 200 visual words built from SIFT interest points. We thus compute neighborhoods using two popular histogram distances: the \(\chi ^2\) and the intersection distances.

Fig. 4
figure 4

Some examples of bikes (left column), persons (middle) and background (right) taken from Graz-01. Only parts of the objects of interest may be visible, and the background class features difficult counter-examples to the bike class, such as motorbikes

Table 3 Error rates of NN, SNN, MinCq and P-MinCq on the Graz-01 database, averaged over 10 runs

Results. We report the results in Table 3, averaged over 10 runs. P-MinCq is again the most stable method and also the best on average across tasks and distance measures. Indeed, it significantly outperforms MinCq (\(p\) value smaller than .01 with a Student test), again illustrating the importance of a good prior \({\mathbf {P}}\) for learning the majority vote. Moreover, P-MinCq performs significantly better than NN (\(p\) value smaller than .01 with a Student test) and to a smaller extent than SNN (\(p\) value of .13). It is worth noting that SNN performs rather well on this database: with large intra-class variation, it seems that extending the neighborhood can pay off. However, while the symmetry heuristic used by SNN is not relevant for all datasets, P-MinCq provides a principled and robust alternative.

7 Conclusion and future research

In this work, we have proposed a novel approach called P-MinCq for learning a weighted majority vote over variable-performing classifiers in the context of a recent algorithm MinCq which finds its grounds in the PAC-Bayesian theory. Our method is based on a generalization of MinCq to \({\mathbf P}\)-aligned distributions allowing us to incorporate an a priori knowledge in the form of a distribution on the voters. This approach does not restrict the expressiveness of the majority vote and we have provided generalization guarantees for data-dependent voters such as \(k\)-NN classifiers. Moreover, we have defined a specific \({\mathbf P}\)-aligned distribution adapted to the case of \(k \text{-NN }\) and provided experimental evidence of its good behavior.

Many promising perspectives arise from this work. First, the setting proposed in this paper is general enough to be used to combine virtually any set of classifiers (provided that they are bounded). For instance, our approach allows one to combine strong and weak classifiers and incorporate some a priori knowledge about their performance. Another interesting application is multi-view learning (Xu et al. 2013; Sun 2013), where P-MinCq could be used to combine classifiers (such as SVM) trained on multi-modal data coming from different sources and/or feature types (Morvant et al. 2014). In this case, \({\mathbf P}\) could encode the prior knowledge about the relative relevance of each modality for the task at hand. In general, in the absence of background knowledge, we note that defining a relevant \({\mathbf P}\) distribution for a set of learners can be difficult. Developing strategies to automatically assess \({\mathbf P}\) from (held-out) data could be very helpful in practice (Lever et al. 2013).

It would also be interesting to combine P-MinCq with other metric learning algorithms, such as the recent \(\chi ^2\) distance learning method for histogram data (Kedem et al. 2012). Lastly, extending P-MinCq to a multi-class setting is also of high interest. However, this requires margin and loss definitions tailored to multi-class problem that imply technical difficulties, with the need of different theoretical tools such as in Morvant et al. (2012).