1 Introduction

In the last three decades, ensemble learning has been investigated by many researchers. This technique has seen used in diverse tasks such as classification, regression, and clustering among others. Research has been conducted at multiple levels: from feature selection, component classifier generation and selection, to the ensemble model (Jurek et al., 2014; Dong et al., 2020). Some of the ensemble approaches have been very successful in international machine learning competitions such as Kaggle, KDD-Cups, etc. and the technologies have been extensively used in various application areas (Oza & Tumer, 2008).

Although some ensemble models (such as stacking, bagging, random forest, AdaBoost, gradient boosting machines, deep neural network-based models, and others) are more complicated, two relatively simple combination schemes, majority voting and weighted majority voting, have been used widely for the ensemble model (Sagi & Rokach, 2018). Even in those more complicated models, majority voting and weighted majority voting are still used frequently at intermediate or final combination stages.

How many component classifiers to use and how to select a subset from a large group for an ensemble are two related questions. Zhou et al. (2002) investigated the impact of the number of component classifiers on ensemble performance. It is found that the number of component classifiers is not always a positive factor for improving performance when majority voting is used for combination. Later their finding is often referred to as the “many-could-be-better-than-all” theorem.

However, this finding is not echoed by many others. More empirical evidence indicates that the size of an ensemble has a positive impact on performance (Hernández-Lobato et al., 2013). To balance ensemble performance and efficiency, many papers investigate how to achieve best possible performance by combining a fixed or a small number of classifiers selected from a large number of candidates (Latinne et al., 2001; Oshiro et al., 2012; Xiao et al., 2010; Dias & Windeatt, 2014; Bhardwaj et al., 2016; Ykhlef & Bouchaffra, 2017; Zhu et al., 2019).

Possibly inspired by Zhou et al’ work (Zhou et al., 2002), Bonab and Can (2019) asserted that for weighted majority voting: the ideal condition for the ensemble to achieve maximum accuracy is for the number of component classifiers to equal the number of output classes. However, their theoretical analysis is limited by the strength of assumptions used, which experimental results do not always support.

Diversity among component classifiers is a factor that may influence ensemble performance (Bi, 2012; Jain et al., 2020; Zhang et al., 2020). However, there is no generally accepted definition of diversity (Kuncheva & Whitaker, 2003). Many measures, such as Yule’s Q statistics, the correlation coefficient, the disagreement measure, F-score, the double fault measure, Kohavi-Wolpert’s variance, Kuncheva’s entropy, etc., have been proposed and investigated (Tang et al., 2006; Visentini et al., 2016). Moreover, many diversity measures such as the F-score, the disagreement measure, the double fault measure, etc. are or related to performance measures. This explains why the effect of diversity on ensemble performance is unclear in some cases (Tang et al., 2006; Bi, 2012).

In short, although ensemble learning has garnered considerable attention from researchers, much of the literature comprise methodological and empirical studies, while the theory is underrepresented. Many fundamental questions remain unclear. We list some of these below:

  1. 1.

    What is the difference between majority voting and weighted majority voting?

  2. 2.

    When should we use majority voting rather than weighted majority voting, or vice versa?

  3. 3.

    There are a lot of weighting assignment methods for weighted majority voting, which one is the best?

  4. 4.

    How does the number of component classifiers affect ensemble performance?

  5. 5.

    How does each component classifier affect ensemble performance?

  6. 6.

    How does performance of component classifiers affect ensemble performance?

  7. 7.

    How does diversity of component classifiers affect ensemble performance?

The goal of this paper is to set up a geometric framework for ensemble classifiers, thereby enabling us to clearly answer the above questions. In this framework, the result from each base classifier is represented as a point in a multi-dimensional space. We can use the same measure—the Euclidean distance—for both performance and diversity. Ensemble learning becomes a deterministic problem. Many interesting theorems about the relation of component classifiers and ensemble classifiers can be proven. Armed with this framework, we discover and present some useful features of majority voting and weighted majority voting. The major contributions of this paper is as follows:

  1. 1.

    A geometric framework is presented for ensemble classifiers.

  2. 2.

    An optimal weighting scheme is given for weighted majority voting.

  3. 3.

    A Euclidean distance-based measure of diversity is given. Unlike all other diversity measures proposed so far, it is orthogonal to performance.

  4. 4.

    Experiments on twenty data sets validate the theory for practical use.

The rest of this paper is organized as follows: Sect. 2 presents some related work. In Sect. 3, we present the geometric framework for ensemble classifiers. Some characteristics of majority voting and weighted majority voting are presented and compared. Discussion regarding the questions raised previously is given in Sect. 4, supported by theory. In Sect. 5, we present some empirical investigation results to confirm our findings in Sect. 3. Finally Sect. 6 is the conclusion.

2 Related work

Majority voting and weighted majority voting are commonly used in many ensemble models. Their features and some related tasks have been investigated by many researchers.

Zhou et al. (2002) investigated majority voting theoretically and empirically through an ensemble of a group of neural networks. It is found that selecting a subset may be able to achieve better performance than combining all available classifiers. Note that in both their theoretical and empirical study, weighted majority voting is mentioned, but not considered.

Fumera et al. (2008) set up a probabilistic framework to analyse the bagging misclassification rate when the ensemble size increases. Majority voting is used for combination. Both their theoretical analysis and empirical investigation show that bagging misclassification rate decreases when more and more component classifiers are combined. This is somewhat inconsistent with the finding of Zhou et al. (2002).

Latinne et al. (2001) used the McNemar test to decide if two sets of component classifiers are significantly different. More and more classifiers are added to the pool and the McNemar test is carried out repeatedly, the process stops if no significant difference can be observed.

Oshiro et al. (2012) raised the problem of how many component classifiers (trees) should be used for an ensemble (random forest). Experimenting with 29 real data sets in the biomedical domain, they observed that higher accuracy is achievable when more trees are combined by majority voting. However, when the number of trees is relatively large (say, over 32 or 64), the improvement is no longer significant. Therefore, a good choice is to consider both ensemble accuracy and computing cost. This issue is also addressed in Hernández-Lobato et al. (2013); Adnan and Islam (2016); Probst and Boulesteix (2017).

Many researchers find that if more base classifiers are combined, then the ensemble is able to achieve better performance. Although it is possible to achieve better performance by fusing more base classifiers, it costs more. Quite a few papers investigated the ensemble pruning problem that aims at increasing efficiency by reducing the number of base classifiers, without losing much performance at the same time. Various kinds of methods have been proposed. See Xiao et al. (2010); Dias and Windeatt (2014); Bhardwaj et al. (2016); Ykhlef and Bouchaffra (2017); Zhu et al. (2019); Mohammed et al. (2022) for some of them.

How to assign proper weights for weighted majority vote has been investigated by quite a few researchers. Some methods may consider different aspects of base classifiers for the weights: performance-based in Opitz and Shavlik (1996); Wozniak (2008), Bayesian model-based in Duan et al. (2007), prediction confidence-based in Schapire and Singer (1999), and Matthews correlation coefficient-based in Haque et al. (2016). Some methods may have different goals for weight optimisation: minimizing classification error in Kuncheva and Diez (2014); Mao et al. (2015); Bashir et al. (2015), reducing variance while preserving given accuracy in Derbeko et al. (2002), minimizing Euclidean distance in Bonab and Can (2018). Different search methods are also used: A linear programming-based method is presented in Zhang and Zhou (2011), Georgiou et al. (2006) applied a game theory to weight assignment, Liu et al. (2014) applied the genetic algorithm to weight assignment. Dynamic adjustment of weights is investigated in Valdovinos and Sánchez (2009).

Wu and Crestani (2015) proposed a geometric framework for data fusion in information retrieval. In this query-level framework, for a certain query each component retrieval system assigns scores to all the documents in the collection, which indicates their relevance to the query. Also predefined values are given to documents that are relevant or irrelevant to the query, which are ideal scores for the documents involved. Therefore, the scores from each component retrieval system can be regarded as a point in a multiple dimensional space, and the ideal scores form an ideal point. Under this framework, performance and dissimilarity can be measured by the same metric—the Euclidean distance. Both majority voting and weighted majority voting can be explained and calculated in the geometric space. However, the framework is set up at the query level, which is equivalent to the instance level in ensemble classifiers.

Bonab and Can (2018, 2019) adapted the model in Wu and Crestani (2015) to ensemble classifiers. Some properties of majority voting and weighted majority voting are presented. However, as in Wu and Crestani (2015), their framework is at the instance level. This means that the theorems hold for each individual instance, but it is unclear if they remain true for multiple instances collectively. The latter is a more important and realistic situation we should consider. As we know, a training data set or a test data set usually comprises a group of instances. It is desirable to know the collective properties of an ensemble classifier over all the instances, rather than that of any individual instance. This generalization is the major goal of this paper.

To do this, we first define a dataset-level framework and then go about proving a number of useful theorems.

3 The geometric framework

Table 1 Notation used in this paper

In this section, we introduce the dataset-level geometric framework.

Suppose for a machine learning problem we have p classes CL={\(cl_1\),\(cl_2\),\(\cdots\),\(cl_p\)}, the ensemble has m component base classifiers CF={\(cf_1\),\(cf_2\),\(\cdots\),\(cf_m\)}, and the dataset DT has n instances T={\(t_1\),\(t_2\),\(\cdots\),\(t_n\)}. For every instance \(t_i\) and every class \(cl_j\), each base classifier \(cf_k\) provides a score \(s_{ij}^{k}\), which indicates the estimated probability score that \(t_i\) is an instance of class \(cl_k\) given by \(cf_j\). \(s_{ij}^{k} \in [0,1]\). Each instance \(t_i\) has a real label for each class \(cl_k\), which is 0 or 1. We may set up a \(n*p\)-dimensional space for the above problem. There are m points \(\{S^1,S^2,\cdots ,S^m\}\), each represents the scores given by a specific classifier to all the instances in the dataset for all the classes. Point \(S^k\) is:

$$\begin{aligned} S^k= & {} \{(s_{11}^k,s_{12}^k,\cdots ,s_{1p}^k),(s_{21}^k,s_{22}^k,\cdots ,s_{2p}^k),\cdots (s_{n1}^k,s_{n2}^k,\cdots ,s_{np}^k)\} \\= & {} \{s_{1}^k,s_{2}^k,\cdots ,s_{n*p}^k\} \\ \end{aligned}$$

As such, we can always organize all the scores first by instances and then by classes. Thus a two-dimensional array with n elements in one dimension and p in the other is transformed to a list of \(n*p\) elements: \(s_{ij}^{k}\) becomes \(s_{(i-1)*n+j}^k\) The ideal point is also represented in the same style.

$$\begin{aligned} O= & {} \{(o_{11},o_{12},\cdots ,o_{1p}),(o_{21},o_{22},\cdots ,o_{2p}),\cdots (o_{n1},o_{n2},\cdots ,o_{np})\} \\= & {} \{o_{1},o_{2},\cdots ,o_{n*p}\} \\ \end{aligned}$$

which indicates the real labels of every instance in the dataset to each of the classes involved: 1 for a true label and 0 for a false label. The notation used in this paper is summarized in Table 1.

This framework is a generalization of the one presented in Bonab and Can (2018, 2019). If the dataset only has one instance, then the above framework is the same as the one in Bonab and Can (2018, 2019). This framework is suitable for soft voting (Cao et al., 2015), in which each component classifier provides probability scores for every instance relating to each class. If no such probability scores are provided (hard voting), then it is still possible to apply it to the geometric framework if we transform estimated class labels into proper scores. The geometric framework is applicable to both single-label and multi-label classification problems.

Example 1

A dataset includes two instances \(t_1\) and \(t_2\), and three classes \(cl_1\), \(cl_2\), and \(cl_3\). \(t_1\) is an instance of \(cl_2\) but not the other two, and \(t_2\) is an instance of both class \(cl_1\) and \(cl_3\) but not \(cl_2\). The scores got from the three base component classifiers \(cf_1\), \(cf_2\), and \(cf_3\) are as follows:

Instance

Classifier

Class \(cl_1\)

Class \(cl_2\)

Class \(cl_3\)

\(t_1\)

\(cf_1\)

\(s_{11}^1=0.5\)

\(s_{12}^1=0.6\)

\(s_{13}^1=0.3\)

\(cf_2\)

\(s_{11}^2=0.4\)

\(s_{12}^2=0.7\)

\(s_{13}^2=0.2\)

\(cf_3\)

\(s_{11}^3=0.6\)

\(s_{12}^3=0.8\)

\(s_{13}^3=0.4\)

Real label

\(o_{11}=0\)

\(o_{12}=1\)

\(o_{13}=0\)

\(t_2\)

\(cf_1\)

\(s_{21}^1=0.7\)

\(s_{22}^1=0.3\)

\(s_{23}^1=0.9\)

\(cf_2\)

\(s_{21}^2=0.3\)

\(s_{22}^2=0.6\)

\(s_{23}^2=0.7\)

\(cf_3\)

\(s_{21}^3=0.2\)

\(s_{22}^3=0.6\)

\(s_{23}^3=0.8\)

Real label

\(o_{21}=1\)

\(o_{22}=0\)

\(o_{23}=1\)

In a 6-dimensional geometric space, we may set up four points to represent this scenario: \(S^1=\{0.5,0.6,0.3,0.7,0.3,0.9\}\), \(S^2=\{0.4,0.7,0.2,0.3,0.6,0.7\}\), \(S^3=\{0.6,0.8,0.4,0.2,0.6,0.8\}\), and \(O=\{0,1,0,1,0,1\}\). \(\square\)

In the following we use point and component result interchangeably if no confusion will be caused. We can calculate the Euclidean distance of two points \(S^{u}\) and \(S^{v}\)

$$\begin{aligned} ed(S^u,S^v)=\sqrt{{\sum _{i=1}^{n}\sum _{j=1}^{p}{(s_{ij}^u-s_{ij}^v)}^2}}=\sqrt{\sum _{l=1}^{n*p}{(s_{l}^u-s_{l}^v)}^2} \end{aligned}$$
(1)

We may use the Euclidean distance to evaluate the performance of classifier \(cf_u\) over n instances and p classes.

$$\begin{aligned} ed(S^u,O)=\sqrt{{\sum _{i=1}^{n}\sum _{j=1}^{p}{(s_{ij}^u-o_{ij})}^2}}=\sqrt{\sum _{l=1}^{n*p}{(s_l^u-o_l)}^2} \end{aligned}$$
(2)

It is an advantage of the geometric framework to evaluate both performance of a component result and dissimilarity of two component results by using the same metric. They will be referred to as performance distance and dissimilarity distance later in this paper.

Definition 1

(Performance distribution). In a geometric space X, there are m points \({\mathcal {S}}\)=\(\{S^1,S^2,\cdots ,S^m\}\) (\(m \ge 1\)). O is the ideal point. Performance distribution of these m points in \({\mathcal {S}}\) is defined as \(ed(S^i,O)\) for (\(1 \le i \le m\)).

Definition 2

(Dissimilarity distribution). In a geometric space X, there are m points \({\mathcal {S}}\)=\(\{S^1,S^2,\cdots ,S^m\}\) (\(m \ge 1\)). Dissimilarity distribution of these m points in \({\mathcal {S}}\) is \(ed(S^i,S^j)\) for (\(1 \le i \le m\), \(1 \le j \le m\), \(i \ne j\)).

From their definitions, we can see that performance distribution and dissimilarity distribution are two completely different aspects of \({\mathcal {S}}\), and not related to each other. Their independence is a good thing for us to investigate the properties of ensembles, especially the effect of each on ensemble performance.

3.1 Majority voting

In a \(n*p\)-dimensional space, there are m points \({\mathcal {S}}\)={\(S^1\),\(S^2\),\(\cdots\),\(S^m\)} (\(m \ge 2\)). Combining them by majority voting can be understood to be finding the centroid of these m points. It is referred to as the centroid-based fusion method in Wu and Crestani (2015).

Theorem 1

In a geometric space X, there are m points \({\mathcal {S}}\)=\(\{S^1,S^2,\cdots ,S^m\}\) (\(m \ge 2\)). C is the centroid of these m points and O is the ideal point. The distance between C and O is no longer than the average distance between each of the m points and O:

$$\begin{aligned} \sqrt{\sum _{l=1}^{n*p}{(c_l-o_l)^2}} \le \frac{1}{m}\sum _{k=1}^{m}\sqrt{\sum _{l=1}^{n*p}{(s_{l}^{k}-o_l)^2}} \end{aligned}$$
(3)

Proof

Replace C with its definition \(c_l=\frac{1}{m}\sum _{k=1}^{m}{s_l^k}\) for (\(1 \le l \le n*p\)) in Eq. (3) and move \(\frac{1}{m}\) on the right side to the left as m, we get

$$\begin{aligned} m\sqrt{\sum _{l=1}^{n*p}{(\frac{1}{m}\sum _{k=1}^{m}{s_l^k}-o_l)^2}} \le \sum _{k=1}^{m}\sqrt{\sum _{l=1}^{n*p}{(s_{l}^{k}-o_l)^2}} \end{aligned}$$
(4)

The Minkowski Sum Inequality (Minkowski, 2020) is

$$\begin{aligned} {(\sum _{l=1}^{n*p}{{(a_l+b_l)}^q})}^{q^{-1}} \le {\sum _{l=1}^{n*p}{a_l^q}}^{q^{-1}}+{\sum _{l=1}^{n*p}{b_l^q}}^{q^{-1}} \end{aligned}$$

where \(q>1\) and \(a_l\), \(b_l\) > 0. For our question, we let \(q=2\) and \(a_l=s_l^k-o_l\). Then we have:

$$\begin{aligned}{} & {} \sum _{k=1}^m{\sqrt{\sum _{l=1}^{n*p}{{(s_l^k-o_l)}^2}}} = \sum _{k=1}^m\sqrt{\sum _{l=1}^{n*p}{(a_l)}^2} \\ {}{} & {} \quad \ge \sqrt{\sum _{l=1}^{n*p}{{(a_1+a_2)}^2}} + \sum _{k=3}^m\sqrt{\sum _{l=1}^{n*p}{(a_l)}^2} \\ {}{} & {} \quad \ge ... \ge \sqrt{\sum _{l=1}^{n*p}{{(a_1+a_2+...+a_m)}^2}} = \sqrt{\sum _{l=1}^{n*p}{(\sum _{k=1}^m{a_k})^2}} \end{aligned}$$

Now notice the left side of Inequation 4

$$\begin{aligned} m\sqrt{\sum _{l=1}^{n*p}{{(\frac{1}{m}\sum _{k=1}^m{s_l^k}-o_l)^2}}} = \sqrt{\sum _{l=1}^{n*p}{(\sum _{k=1}^{m}{s_l^k}-m*o_l)^2}} = \sqrt{\sum _{l=1}^{n*p}({\sum _{k=1}^m{a_k}})^2} \end{aligned}$$

\(\square\)

This theorem tells us, if we take all the instances together and use the Euclidean distance as the performance metric, then the performance of the ensemble by majority voting is at least as good as the average performance of all component classifiers involved. This theorem confirms the observation by many researchers that the results from the ensemble are stable and good.

Example 2

Consider the points in Example 1. \(C=\{0.5,0.7,0.3,0.4,0.5,0.8\}\), \(ed(S^1,O)=0.83\), \(ed(S^2,O)=1.11\), \(ed(S^3,O)=1.26\), \(ed(C,O)=1.04\), \((ed(S^1,O)+ed(S^2,O)+ed(S^3,O))/3=1.06\). ed(CO) is slightly smaller than the average of \(ed(S^1,O)\), \(ed(S^2,O)\), and \(ed(S^3,O)\). \(\square\)

Theorem 2

In a space X, suppose that \({\mathcal {S}}\)={\(S^1\), \(S^2\),\(\cdots\), \(S^m\)} and O are known points. C is the centroid of \(S^1\), \(S^2\),\(\cdots\), \(S^m\). The distance between C and O can be represented as

$$\begin{aligned} ed(C,O)=\frac{1}{m}\sqrt{m\sum _{i=1}^m{ed(S^i,O)}^2-\sum _{i=1}^{m-1}\sum _{j=i+1}^m{ed(S^i,S^j)}^2} \end{aligned}$$
(5)

Proof

Assume that O=(0,...,0), which can always be done by coordinate transformation. According to its definition \(C=(\frac{1}{m}\sum _{i=1}^m{s_1^i},..., \frac{1}{m}\sum _{i=1}^m{s_{n*p}^i})\), we have

$$\begin{aligned} \begin{aligned} ed(C,O)&=\sqrt{{(\frac{1}{m}\sum _{i=1}^m{s_1^i})^2}+...+{(\frac{1}{m}\sum _{i=1}^m{s_{n*p}^i})^2}} \\&=\frac{1}{m}\sqrt{\sum _{i=1}^m\sum _{k=1}^{n*p}{(s_k^i)}^2+2\sum _{k=1}^{n*p}\sum _{i=1}^{m-1}\sum _{j=i+1}^m{s_k^i*s_k^j}} \end{aligned} \end{aligned}$$
(6)

Note that the distance between \(S^i\) and \(S^j\) is \(ed(S^i,S^j)=\sqrt{\sum _{k=1}^{n*p}{(s_k^i-s_k^j)}^2}\) or \({ed(S^i,S^j)}^2=\sum _{k=1}^{n*p}{(s_k^i)^2}+\sum _{k=1}^{n*p}{(s_k^j)^2}-2\sum _{k=1}^{n*p}{s_k^i*s_k^j}\)

Because \({ed(S^i,O)}^2=\sum _{k=1}^{n*p}{(s_k^i)}^2\) and \({ed(S^j,O)}^2=\sum _{k=1}^{n*p}{(s_k^j)}^2\), we get

$$\begin{aligned} 2\sum _{k=1}^{n*p}{s_k^i*s_k^j}={ed(S^i,O)}^2+{ed(S^j,O)}^2-{ed(S^i,S^j)}^2 \end{aligned}$$

Considering all possible pairs of points \(S^i\) and \(S^j\) and we get

$$\begin{aligned} 2\sum _{k=1}^{n*p}\sum _{i=1}^{m-1}\sum _{j=i+1}^m{s_k^i*s_k^j}=(m-1)\sum _{i=1}^m{ed(S^i,O)}^2-\sum _{i=1}^{m-1}\sum _{j=i+1}^m{ed(S^i,S^j)}^2 \end{aligned}$$

In Eq. (6), we use the right side of the above equation to replace

$$\begin{aligned} 2\sum _{k=1}^{n*p}\sum _{i=1}^{m-1}\sum _{j=i+1}^m{s_k^i*s_k^j} \end{aligned}$$

Also note that \(\sum _{i=1}^m\sum _{k=1}^{n*p}({s_k^i})^2=\sum _{i=1}^m{ed(S^i,O)}^2\), we obtain

$$\begin{aligned} ed(C,O)=\frac{1}{m}\sqrt{m\sum _{i=1}^m{ed(S^i,O)}^2-\sum _{i=1}^{m-1}\sum _{j=i+1}^m{ed(S^i,S^j)}^2} \end{aligned}$$

\(\square\)

This theorem tells us that the ensemble performance is completely decided by \(ed(S^i,O)\) (for \(1 \le i \le m\)) and \(ed(S^i, S^j)\) (for \(1 \le i \le m\), \(1 \le j \le m\), \(i \ne j\)). The impact of both performance of component classifiers and dissimilarity of all pairs of component classifiers on ensemble performance can be seen clearly. According to Theorem 2, in order to minimize ed(CO), we need to minimize \(\sum _{i=1}^{m}{ed(S^i,O)^2}\) and maximize \(\sum _{i=1}^{m-1}\sum _{j=i+1}^{m}{ed(S^i,S^j)^2}\) at the same time. Therefore, for performance distribution, both average and variance affect ensemble performance. Lower average and lower variance lead to better performance. For dissimilarity distribution, both average and variance also affect ensemble performance. Higher average and higher variance lead to better performance.

Example 3

Assume that \({\mathcal {S}}^1\)={\(S^1\), \(S^2\)}, \({\mathcal {S}}^2\)={\(S^3\), \(S^4\)}, \(ed(S^1,S^2)=ed(S^3,S^4)\), \(ed(S^1,O)\)=0.5, \(ed(S^2,O)\)=0.5, \(ed(S^3,O)\)=0.4, \(ed(S^4,O)\)=0.6. The centroid of \(S^1\) and \(S^2\) is \(C^1\), and the centroid of \(S^3\) and \(S^4\) is \(C^2\). We have \(cd(C^1,O)\) < \(cd(C^2,O)\).

In this example, because dissimilarity distance is the same for both \({\mathcal {S}}^1\) and \({\mathcal {S}}^2\), we only need to consider performance distribution. The average of \(ed(S^1,O)\) and \(ed(S^2,O)\) is 0.5, and the average of \(ed(S^3,O)\) and \(ed(S^4,O)\) is also 0.5. The variance of \({\mathcal {S}}^1\) is smaller than that of \({\mathcal {S}}^2\). \(ed(S^1,O)^2+ed(S^2,O)^2\)=0.50, \(ed(S^3,O)^2+ed(S^4,O)^2\)=0.52. Because \(ed(S^1,S^2)=ed(S^3,S^4)\), we get \(ed(C^1,O)\) < \(ed(C^2,O)\). \(\square\)

Example 4

In the figure below, there are six points \(S^i\) \((1 \le i \le 6)\). Among these points, \(S^1\) is the closest to O. It is followed by \(S^2\) and \(S^3\), which are equally distant to O. Finally, \(S^4\), \(S^5\) and \(S^6\) are equally distant to O and they are all further from O than the other three points. Now we try to work out a subset of these to maximize performance.

figure a

Now, if we select one point only, then \(S^1\) is the best option; if we select two points, then combining \(S^2\) and \(S^3\) is the best option, in this instance the centroid would be \(C^2\); if we select three points, then combining \(S^4\), \(S^5\) and \(S^6\) is the best option, this gives the centroid O. Fusing all six points is also a good option, but it may not be as good as fusing \(S^4\), \(S^5\) and \(S^6\). The “many-could-be-better-than-all” theorem seems reasonable (Zhou et al., 2002) in this case. Because the centroid of a group of points is decided by the positions of all the points collectively, each of which has an equal weight, removing or adding even a single point may change the position of the centroid of a group of points significantly. It also indicates that it is not an easy task to find the best subset from a large group of base classifiers. As a matter of fact, it is an NP-hard problem, as our next theorem proves. \(\square\)

Theorem 3

\({\mathcal {S}}\) = {\(S_1\), \(S_2\),\(\cdots\), \(S_m\)} for \((m \ge 3)\) is a group of points and O is an ideal point. For a given number \(m'\) (\(2 \le m'<m\)), the problem is to find a subset of \(m'\) points from \({\mathcal {S}}\) to minimize the distance of the centroid of these \(m'\) points to the ideal point O. The above question is NP-hard.

Proof

First let us have a look at the maximum diversity problem (MDP), which is a known NP-hard problem (Wang et al., 2014). The MDP is to identify a subset \(\mathcal {E'}\) of \(m'\) elements from a set \({\mathcal {E}}\) of m elements, such that the sum of the pairwise distance between the elements in the subset is maximized. More precisely, let \({\mathcal {E}}\) = {\(e_1\), \(e_2\),\(\cdots\), \(e_m\)} be a group of elements, and \(d_{ij}\) be the distance between elements \(e_i\) and \(e_j\). The objective of the MDP can be formulated as:

Maximize

$$\begin{aligned} f(x)=\frac{1}{2}\sum _{i=1}^{m}\sum _{j=1}^{m}d_{ij}*x_{i}*x_{j} \end{aligned}$$
(7)

subject to \(\sum _{i=1}^{m}x_{i}=m', x_{i} \in \{0,1\}, i=1,\cdots , m\)

where each \(x_i\) is a binary (zero–one) variable indicating whether an element \(e_i \in {\mathcal {E}}\) is selected to be a member of \(\mathcal {E'}\).

On the other hand, according to Theorem 2, our problem may be written as minimizing \({ed(C,O)}^2\). Here C is the centroid of \(m'\) selected points.

$$\begin{aligned} {ed(C,O)}^2=\frac{1}{m'}\sum _{i=1}^m{ed(S^i,O)}^2*x_i-\frac{1}{m'*m'}\sum _{i=1}^{m}\sum _{j=1}^{m}{ed(S^i,S^j)}^2*x_i*x_j \end{aligned}$$

subject to \(\sum _{i=1}^{m}x_{i}=m', x_{i} \in \{0,1\}, i=1,\cdots , m\)

where each \(x_i\) is also a binary (zero–one) variable indicating whether a point \(S^i \in {\mathcal {S}}\) is selected to be a member of \(\mathcal {S'}\).

If we assume that \(ed(S^i,O)\) equals to each other for all \(i=1,\cdots , m\), then \(\frac{1}{m'}\sum _{i=1}^m{ed(S^i,O)}^2*x_i\) becomes a constant and minimizing \({ed(C,O)}^2\) equals to maximizing \(div(C,O)^2\)

$$\begin{aligned}{div(C,O)}^2=\frac{1}{m'*m'}\sum _{i=1}^{m}\sum _{j=1}^{m}{ed(S^i,S^j)}^2*x_i*x_j\end{aligned}$$

Let g(x)=\(div(C,O)^2\) and \(d_{ij}=\frac{2}{m'*m'}ed(S^i,S^j)^2\), then the above equation can be rewritten as

$$\begin{aligned} g(x)=\frac{1}{2}\sum _{i=1}^{m}\sum _{j=1}^{m}d_{ij}*x_i*x_j \end{aligned}$$
(8)

Comparing Eqs. (7) and (8), we can see that a simplified version of our problem is an MDP problem. Therefore, our problem is an NP-hard problem. \(\square\)

Theorem 3 tells us: for a given number of classifiers, choosing a subset for best ensemble performance by majority voting is an NP-hard problem.

Although the “many-could-be-better-than-all” theorem may be applicable to some cases, it does not tell the whole story. Now let us have a look at how the size of the ensemble impacts its performance. Because the performance of an ensemble is affected by a few different factors, we need to find a way of separating this from other factors.

Theorem 4

In a space X, \({\mathcal {S}}\) = \(\{S^1,S^2,\cdots ,S^m\}\) and O are known points. C is the centroid of \(S^1,S^2,\cdots ,S^m\), \(C^1\) is the centroid of \(m-1\) points \(S^2,S^3,\cdots ,S^m\), \(C^2\) is the centroid of \(m-1\) points \(S^1,S^3,\cdots ,S^m\),\(\cdots\), \(C^m\) is the centroid of \(m-1\) points \(S^1,S^2,\cdots ,S^{m-1}\). We have

$$\begin{aligned} ed(C,O)\le \frac{1}{m}\sum _{i=1}^{m}{ed(C^i,O)} \end{aligned}$$
(9)

Proof

According to the definition of \(C^1\), \(C^2\), \(\cdots\), \(C^m\), C is the centroid of these m points. The theorem can be proven by applying Theorem 1. \(\square\)

Theorem 4 can be used repeatedly to prove more general situations in which a subset includes \(m-2\),\(m-3\),\(\cdots\),or 2 points. This demonstrates that the number of component results has a positive effect on ensemble performance. \(\square\)

Example 5

In a space X, \(\{S^1,S^2,S^3,S^4\}\) and O are known points. There are four different combinations of three points and six combinations of two points. We use \(C^{1234}\) to represent the centroid of all 4 points. Similarly, \(C^{23}\) represents the centroid of \(S^2\) and \(S^3\), and so on. Applying Theorem 4 repeatedly, we have

$$\begin{aligned} ed(C^{1234},O)\le & {} \frac{1}{4}[ed(C^{123},O)+ed(C^{124},O)+ed(C^{134},O)+ed(C^{234},O)] \\\le & {} \frac{1}{6}[ed(C^{12},O)+ed(C^{13},O)+ ed(C^{14},O) \\{} & {} +ed(C^{23},O)+ed(C^{24},O)+ed(C^{34},O)] \\\le & {} \frac{1}{4}[ed(S^{1},O)+ed(S^{2},O)+ed(S^{3},O)+ed(S^{4},O)] \quad \\ \end{aligned}$$

\(\square\)

Although Theorem 4 shows that the number of component classifiers has a positive effect on ensemble performance, it is not clear how significant the effect is. Next let us look at this matter quantitatively. In order to focus on the number of component classifiers, we make a few simplifying assumptions. Suppose that \(S^1\), \(S^2\),\(\cdots\), \(S^m\) and O are known points in space X. ed(\(S^i\), O)=\(c_p\) for any (\(1 \le i \le m\)), ed(\(S^i\), \(S^j\))=\(c_d\) for any (\(1 \le i \le m\), \(1 \le j \le m\), \(i \ne j\)), and \(c_d= \theta *c_p\). According to Theorem 2 and the above assumptions, we have

$$\begin{aligned} {ed(C,O)}^2= & {} \frac{1}{m^2}[m\sum _{i=1}^m{ed(S^{i},O)}^2-\sum _{i=1}^{m-1}\sum _{j=i+1}^m{ed(S^{i},S^{j})}^2] \\= & {} {c_p}^2-\frac{m-1}{2m}{c_d}^2 \\= & {} (1-\frac{m-1}{2m}{\theta }^2){c_p}^2 \end{aligned}$$

Therefore,

$$\begin{aligned} ed(C,O)=\sqrt{(1-\frac{m-1}{2m}{\theta }^2)}*c_p \end{aligned}$$
(10)

By definition, ed(CO) cannot be negative and \((1-\frac{m-1}{2m}{\theta }^2) \ge 0\) should hold. Therefore, for a given m, \(\theta\) must have a maximal limit. If \(m=2\) and \(\theta =2\), then \(ed(C,O)=0\). \(\theta =2\) must be the maximal value in this case. Likewise, when \(m=3\), the maximal value for \(\theta\) is \(\sqrt{3}\).

Fig. 1
figure 1

The impact of component classifier number on ensemble performance

Figure 1 shows the values of ed(CO) in \(c_p\) unit for \(\theta =0.25,0.5, 0.75, 1\) and \(m=2, 3, \cdots ,100\). From Fig. 1 we can see that in all four cases, ed(CO) decreases with m. However, as m becomes larger and larger, the rate of decrease becomes smaller and smaller. When m tends to infinity, ed(CO) approaches \(\sqrt{1-0.5{\theta }^2}\) \(c_p\) units. They are 0.984, 0.935, 0.848, and 0.707 when \(\theta\) equals to 0.25, 0.5, 0.75, and 1, respectively. However, adding more component results does not help much if there is already a large number. For example, when \(\theta =1.0\) and \(m=27\), ed(CO) is 0.720 \(c_p\) units. It is close to the limit of 0.707 \(c_p\) units. It suggests that fusing 30 or more component results may not be very useful for further improving ensemble performance. This has been observed in some empirical studies before, such as in Oshiro et al. (2012), and others.

Theorem 1 tells us that the ensemble performance is at least as good as the average performance of all the component classifiers involved. This may not be positive enough for many applications of the technique. Theorem 3 indicates that it may take too much time to choose a subset from a large group of candidates for good ensemble performance. This problem may be solved in other ways. In particular, if we want the ensemble performance to be better than the best component classifier, more favorable conditions are required for those component classifiers. It means that we need to apply some restrictions to all the component classifiers involved. Theorem 5 can be useful for this.

Theorem 5

In a space X, \({\mathcal {S}}\) = \(\{S^1\), \(S^2\),\(\cdots\), \(S^m\}\) and O are known points. At least one of the points in \({\mathcal {S}}\) is different from the others. C is the centroid of \(S^1\), \(S^2\),\(\cdots\), \(S^m\). If \(ed(S^1,O)=ed(S^2,O)=\cdots =ed(S^m,O)\), then \(ed(C,O)<ed(S^1,O)\) must hold.

Proof

According to Theorem 2, we have

$$\begin{aligned} {ed(C,O)}^2= & {} \frac{1}{m}\sum _{i=1}^m{ed(S^i,O)}^2-\frac{1}{m*m}\sum _{i=1}^{m-1}\sum _{j=i+1}^m{ed(S^i,S^j)}^2 \\< & {} \frac{1}{m}\sum _{i=1}^m{ed(S^i,O)}^2={ed(S^1,O)}^2 \end{aligned}$$

Therefore, we obtain \(ed(C,O)<ed(S^1,O)\). \(\square\)

Theorem 5 tells us if all the component classifiers are equally effective, then majority voting is able to do a better job than Theorem 1’s guarantee. In practice, this has been implemented in various situations. For example, if using bagging with random forest or neural networks (Oshiro et al., 2012; Yang et al., 2013), then we are can generate a large number of almost equally-effective component classifiers. Good performance is achievable by fusing such classifiers. See Sect. 4 for more discussions.

3.2 Weighted majority voting

Weighted majority voting is a generalization of majority voting. It is more flexible than its counterpart because different weighting schemes can be de-fined. It might be believed that both are similar, and this is indeed true for the cases when weights across component classifiers are largely similar. However, for the weighted majority voting, we have little interest in the universe of all possible weighting schemes, but are rather focused on the optimum weighting scheme. We delve into the following two questions especially.

  1. 1.

    How to find the optimum weights for a group of component classifiers?

  2. 2.

    What are the properties of weighted majority voting with the optimum weights?

Let us begin with the first question. In a space X, \({\mathcal {S}}\) = \(\{S^1,S^2,\cdots ,S^m\}\) and O are known points. Let F be the fused point for linear combination of m points in \({\mathcal {S}}\) with weighting \(w^1,w^2,\cdots ,w^m\).

$$\begin{aligned} ed(F,O)^2=\sum _{i=1}^{n}\sum _{j=1}^{p}{\left(\sum _{k=1}^{m}{(w^k*s_{ij}^{k})}-o_{ij}\right)^2} \end{aligned}$$
(11)

Our goal is to minimize \(ed(F,O)^2\). Assuming \(f(w^1,w^2,\cdots ,w^m)=ed(F,O)^2\), we have

$$\begin{aligned} \frac{\partial f}{\partial w^q}= \sum _{i=1}^{n}\sum _{j=1}^{p}2\left(\sum _{k=1}^{m}(w^k*s_{ij}^{k})-o_{ij}\right)s_{ij}^q\end{aligned}$$

Let \(\frac{\partial f}{\partial w^q}=0\) for (q=1,2,\(\cdots\),m), then

$$\begin{aligned}\sum _{k=1}^{m}w^k\sum _{i=1}^{n}\sum _{j=1}^{p}{(s_{ij}^ks_{ij}^q)}=\sum _{i=1}^{n}\sum _{j=1}^{p}{o_{ij}s_{ij}^q} \end{aligned}$$

Let \(a_{qk}=\sum _{i=1}^{n}\sum _{j=1}^{p}s_{ij}^{k}s_{ij}^{q}\) for \((1 \le q \le m)\) and \((1 \le k \le m)\), and \(b_q=\sum _{i=1}^{n}\sum _{j=1}^{p}o_{ij}s_{ij}^q\) for \((1 \le q \le m)\). Thus we obtain the following m linear equations with m variables \(w^1,w^2,\cdots ,w^m\):

(12)

The optimum weights can be calculated by finding the solution to these m linear equations. Note that minimizing \(ed(F,O)^2\) and minimizing ed(FO) is equivalent for us to find the optimum weights because ed(FO) can not be negative.

Theorem 6

In a \(n*p\) dimensional space X, \({\mathcal {S}}\)={\(S^1\), \(S^2\),\(\cdots\), \(S^m\)} and O are known points. If every point in \({\mathcal {S}}\) is linearly independent from the others, then the above process and Eq. (12) can find the unique solution to the problem.

Proof

The independency of each point in \({\mathcal {S}}\) indicates that \(m \le n*p\) holds. For the same reason, any point that can be represented linearly by these m points has a unique representation. We may write ed(FO) as \(f(w^1,w^2,\cdots ,w^m)\), which is a continuous function. In the whole space, there is only one minima and no other saddle points or maxima. The point at which all partial derivatives of the function to all variables equal to zero must be the minimum point. The m equations set up in Eq. (12) are able to find the point with a unique representation of weights. \(\square\)

Intuitively, in a \(n*p\) dimensional space X, m points in \({\mathcal {S}}\)={\(S^1\), \(S^2\),\(\cdots\), \(S^m\)} comprise a subspace \(X'\) in X. For any point O, there exists one and only one point in \(X'\) that has the shortest distance to O. This point can be linearly represented by \(S^1\), \(S^2\),\(\cdots\), \(S^m\).

Theorem  6 can be explained as follows. For a (training) dataset with n instances, p classes, and m classifiers, each of the classifiers gives a score for each instance and each class. For each instance, we also have real labels relating to all the classes. Then we are able to find a group of weights \(w^1,w^2,\cdots ,w^m\) for \(S^1\), \(S^2\),\(\cdots\), \(S^m\) to achieve the best ensemble performance by weighted majority voting.

The following Theorem 7 answers the second question.

Theorem 7

In a \(n*p\) dimensional space X, \({\mathcal {S}}^1\)={\(S^1\), \(S^2\),\(\cdots\), \(S^m\)}, \({\mathcal {S}}^2\)={\(S^1\), \(S^2\),\(\cdots\), \(S^m\), \(S^{m+1}\)}, and O is an ideal point. If the optimum weights are used for both \({\mathcal {S}}^1\) and \({\mathcal {S}}^2\), then the performance of Group \({\mathcal {S}}^2\) is at least as effective as that of Group \({\mathcal {S}}^1\).

Proof

Assume that \(w^1, w^2,\cdots ,w^m\) are optimum weights for \(S^1\), \(S^2\),\(\cdots\), \(S^m\) of \({\mathcal {S}}^1\) to obtain the best performance. For \({\mathcal {S}}^2\), if using the same weights \(w^1, w^2,\cdots ,w^m\) for \(S^1\), \(S^2\),\(\cdots\), \(S^m\), and 0 for \(S^{m+1}\), then the ensemble performance of \({\mathcal {S}}^2\) will be the same as that of \({\mathcal {S}}^1\). Note that the above weighting scheme is by no means the best for \({\mathcal {S}}^2\) and it is also possible to find more profitable weights for \({\mathcal {S}}^2\). \(\square\)

Theorem 7 can be explained as follows: for a given dataset, consider two groups of classifiers. Group 1 has n classifiers: \(cf_1\),\(cf_2\),\(\cdots\),\(cf_m\), and Group 2 has \(m+1\) classifiers, \(cf_1\),\(cf_2\),\(\cdots\),\(cf_m\),\(cf_{m+1}\). m classifiers in both groups are the same. If using weighted majority voting with the optimum weights, then the ensemble performance of Group 2 is at least as effective as that of Group 1.

Corollary 7.1

In a \(n*p\) dimensional space X, assume that weighted majority voting is applied with the optimum weights. When more and more points are added, the ensemble performance is monotonically non-decreasing.

Proof

It can be proven by applying Theorem 7 repeatedly. \(\square\)

Intuitively, when more and more points are added, the subspace becomes bigger and bigger. During this process, it is possible to find new points that has the shortest distance to the ideal point.

Corollary 7.1 can be explained in this way. Assume for a given dataset, an ideal ensemble is implemented by weighted majority voting with the optimum weights. When more and more classifiers are added into such an ensemble, its performance is non-decreasing monotonically.

Theorem 2 tells us about how individual classifier performance (\(ed(S^i,O)\)) and dissimilarity between classifiers (\(ed(S^i,S^j)\)) affect ensemble performance with the majority voting scheme. We may regard weighted majority voting as a variation of majority voting. Before applying Theorem 2, weighted majority voting changes the positions of all component results’ position by a linear weighting scheme, thus Eq. (5) becomes

$$\begin{aligned} ed(C,O)=\frac{1}{m}\sqrt{m\sum _{i=1}^m{ed(w^i \cdot S^i,O)}^2-\sum _{i=1}^{m-1}\sum _{j=i+1}^m{ed(w^i \cdot S^i,w^j \cdot S^j)}^2} \end{aligned}$$
(13)

where \(w^i \cdot S^i=\sum _{j=1}^{n}\sum _{k=1}^{p}w^i*s_{jk}^i\). After that, both can be treated in the same way.

4 Discussion

In Sect. 3 we have set up a geometric framework and presented the properties of majority voting and weighted majority voting. Now we are in a good position to compare the two different levels of geometric frameworks and answer the questions raised in the first section of this paper.

4.1 Dataset-level vs. instance-level frameworks

A major objective of the ensemble problem is to try to provide a solution for all the instances in a dataset. A framework at different levels has certain impact on the way we can deal with the problem. For an instance-level framework, we need an approach to expand it to cover all the instances in the whole dataset, while the dataset-level framework does not need it.

In the dataset-level framework, all the instances in the whole dataset are concatenated to form a super-instance. Thus, all the properties stand in the instance-level framework also stand in the dataset-level framework.

However, a few differences need to be noted. One is the dimensionality of the geometric space involved. For a classification problem with p classes and a dataset with n instances, the dimensionality of the instance-level geometric space is p, while that of the dataset-level geometric space is \(p*n\).

How to calculate optimal weights for weighted majority voting is another place where we may have different solutions. The solution given in Sect. 3 of this paper is to minimize the Euclidean distance between the linear combination of all the component points and the ideal point (refer to Equation 11). Recall that all the instances in the dataset is transformed to a single super-instance. However, for the instance-level framework, we still need to consider multiple instances together. One possible way is to minimize the sum of the distance over all instances, or

$$\begin{aligned} \sum _{i=1}^n{ed(F,O)}=\sum _{i=1}^{n}\sqrt{\sum _{j=1}^{p}{(\sum _{k=1}^{m}{(w^k*s_{ij}^{k})}-o_{ij})^2}} \end{aligned}$$
(14)

It is tricky to optimise Equation 14 directly. To simplify, we may optimise \(\sum _{i=1}^n{ed(F,O)^2}\) instead (Wu & Crestani, 2015; Bonab & Can, 2018). In this way, it is the same as Equation 11. It demonstrates that there are connections between the two levels of frameworks. For the instance-level framework, optimising \(\sum _{i=1}^n{ed(F,O)^2}\) approximates optimising \(\sum _{i=1}^n{ed(F,O)}\). On the other hand, for the dataset-level framework, optimising \(ed(F,O)^2\) is the same as optimising ed(FO). It means that the weights obtained are optimum.

4.2 The size of ensemble

As discussed in Sect. 3, we proved that the number of component classifiers has positive impact on ensemble performance for both majority voting and weighted majority voting. However, this contradicts the assertion in Bonab and Can (2019): for a multi-classification problem with k classes, k is the ideal number of base classifiers to constitute an optimum ensemble by weighted majority voting. Let us analyse this further.

Example 6

Consider a classification problem with two classes and three base classifiers. One instance is shown in the figure below. Weighted majority voting is used for combination.

figure b

In the figure above, it shows a two-dimensional space with three points \(S^1\), \(S^2\), \(S^3\) to be a combination. The ideal point is O. We can see that combining three points can lead to the optimal results of zero distance than fusing any two. Therefore, in this example the assertion in Bonab and Can (2019) even does not hold at the instance level. However, as shown in this example, for a n dimensional space, \(n+1\) independent points are enough. \(\square\)

We may add some restrictions to the points involved. For example, for a binary classification problem, we let all the points to be on the line segment of [0,1] and [1,0]. The ideal point is either [1,0] or [0,1]. In this way, a maximum of two points are needed for the optimal fusion results. In the figure below, any two of the three points \(S^1\), \(S^2\), and \(S^3\) are competent for this task.

figure c

Anyhow, the assertation at the instance level is not very useful. To consider the problem in a more realistic way, we need to look at it at the dataset level. A dataset usually comprises at least a good number of instances. If we consider three instances with a binary classification problem, then the dimensionality of the dataset-level geometric framework is up to 2*3=6, and not 2 any more. If the dataset has more instances, then we may include even more independent classifiers. With an increased number of base classifiers, we will likely get better ensemble performance (Corollary 7.1).

On the other hand, we can obtain the same conclusion as Corollary 7.1 even under the instance-level framework. Assume that the whole dataset has n instances. \(F_i\) and \(O_i\) are the fused point and the ideal point for instance \(t_i\), respectively. The optimal weighting for m base classifiers are \(w^1\),\(w^2\),\(\cdots\),\(w^m\). Then we have

$$\begin{aligned} \sum _{i=1}^n{ed(F_i,O_i)}=\sum _{i=1}^{n}\sqrt{\sum _{j=1}^{p}{(\sum _{k=1}^{m}{(w^k*s_{ij}^{k})}-o_{ij})^2}} \end{aligned}$$
(15)

Now one more base classifier is added. We can set a new weighting scheme as \(w_{new}^1=w^1\),\(w_{new}^2=w^2\),\(\cdots\),\(w_{new}^m=w^m\), \(w_{new}^{m+1}=0\).

$$\begin{aligned} \sum _{i=1}^n{ed_{new}(F_i,O_i)}=\sum _{i=1}^{n}\sqrt{\sum _{j=1}^{p}{(\sum _{k=1}^{m+1}{(w_{new}^k*s_{ij}^{k})}-o_{ij})^2}} \end{aligned}$$
(16)

Then \(\sum _{i=1}^n{ed(F_i,O_i)}=\sum _{i=1}^n{ed_{new}(F_i,O_i)}\). \(w^1\),\(w^2\),\(\cdots\),\(w^m\) is the optimal weighting scheme for m base classifiers, while \(w_{new}^1\),\(w_{new}^2\),\(\cdots\),\(w_{new}^m\), \(w_{new}^{m+1}\) may not be optimal for \(m+1\) base classifiers. Therefore, if the optimal weighting scheme is used, then fusing \(m+1\) base classifiers can achieve at least the same performance as fusing m base classifiers. This is exactly what Corollary 7.1 tells us.

4.3 Answer to some questions

In Sect. 1, we listed some outstanding questions. Now let us discuss them one by one.

Question 1: What is the difference between majority voting and weighted majority voting?

In a sense, both weighting schemes can potentially enhance ensemble performance. However, there are certain aspects, including their abilities, to consider. Weighted majority voting can be better than the best component classifier if optimum weights are used, while majority voting can be better than the average of all component classifiers. Majority voting is a “mild” method because all the component results are treated equally and the centroid is the solution, while weighted majority voting is an “extreme” method because it does not treat all component results equally and it tries to take the most effective solution from all possible ones.

Question 2: When should we use majority voting rather than weighted majority voting, or vice versa?

Theoretically, weighted majority voting is always better than or at least as good as majority voting if optimal weights are applied and Euclidean distance is used for performance evaluation. However, in reality the above two conditions may not be satisfied. The weights leant in the training dataset may not be the best in the test dataset. Therefore, a general answer is: in cases majority voting does not work well (e.g., ensemble performance is worse than that of the the best component classifier), then weighted majority voting should be used.

Now a further question is: when is majority voting a good method? Performance of all component results, dissimilarity of all pairs of component results, number of component results have positive impact on ensemble performance. A judicious decision should consider these factors thoroughly. More specifically, Theorem 2 answers this question quantitatively. One easy noticeable situation is that when all the component classifiers are of equal or very close performance, then majority voting may be able to achieve better ensemble performance than the best component classifier (Theorem 5).

Question 3: There are a lot of weighting assignment methods for weighted majority voting, which one is the best?

Optimality is a complicated issue and it is related to the goal of optimisation. There are many diffent measures and any individual method cannot do the best work for all those measures. The least squares is the best weighting assignment method for the measure of Euclidean distance. Compared with many others, it is efficient and effective at the same time. Almost all other weighting assignment methods are either heuristic or optimisation methods. For the former, its effectiveness is not guaranteed; for the latter, it is timing-consuming.

Question 4: How does the number of component classifiers affect ensemble performance?

The number of component classifiers has a positive effect on ensemble performance. The situation is straightforward for weighted majority voting. When more and more component results are added to an ensemble, its performance becomes better and better. Here we assume that the optimal weights are used in the ensemble.

On the other hand, the situation for majority voting is more complicated. Adding more component classifiers into an ensemble cannot always improve performance. From a group of candidates, to find a subset for best ensemble performance is a NP-hard problem (Theorem 3).

Question 5: How does each component classifier affect ensemble performance?

It is related to the first question. For majority voting, each contributes equally; for weighted majority voting, each contributes differently in order to get the optimal results for the whole data set. If a very good component classifier is added, then weighted majority voting can take advantage of it. On the other hand, for majority voting, if many component classifiers are poor, then a few good ones may not be able to improve performance very much.

Question 6: How does performance of component classifiers affect ensemble performance?

Performance of component classifiers is the most important aspect that affect ensemble performance. For majority voting, a high-performance point is able to move the centroid of the group closer to the ideal point. For weighted majority voting, a high-performance point very likely enables the subspace to expand with some points closer to the ideal point.

Question 7: How does diversity of component classifiers affect ensemble performance?

For diversity, there are many different types of definitions before. In this paper, we define it as the dissimilarity distribution of all pairs of component results. Apart from performance, diversity is another aspect that impacts ensemble performance significantly. For majority voting, a comparative investigation about it and performance has been done in Sect. 3.1. Based on a simplified situation, the importance ratio between diversity and performance is calculated to be in the range of (0.25,0.5], varying with the number of component classifiers. For weighted majority voting, high diversity among component results will make the subspace bigger, thus it is more likely to find closer points to the ideal point in such a space.

The advantage of the diversity measure defined in this paper is its orthogonality to performance, which makes its effect on ensemble performance very clear. It is also helpful when we try to select a subset of base classifiers from all available ones (classifier pruning (Mohammed et al., 2022)) for effective and efficient ensembling.

One final comment about the framework is: all the theorems in the geometric framework hold when the Euclidean distance is used for measuring performance. When other metrics are used, the conclusions we obtain may hold for many of the instances, but not every single instance. However, there is strong correlations between any other meaningful performance metrics and the Euclidean distance. If enough instances are observed, we may expect consistent conclusions.

5 Empirical investigation

In this section we are going to investigate how theoretical conclusions presented in Sect. 3 can be confirmed for practical use. Within the geometric framework, all the theorems hold perfectly when Euclidean distance is used as performance metric on the same dataset. We would like to see how they behave when the conditions are partially satisfied. Specifically, two points are considered:

  • Quite a few weighting schemes have been proposed before for ensemble learning. We are going to see how the newly proposed optimal scheme is related to two typical weighting schemes proposed before. We also compare their performance and efficiency through experiments.

  • Usually classification accuracy or some other metrics, rather than Euclidean distance, is used for performance evaluation. It is interesting to find the strength of the correlation between the Euclidean distance and other commonly used metrics.

To allow for reproducbility, the datasets, source code, raw results of this study are available in GitHub.Footnote 1

5.1 Weighting schemes for ensemble learing

Before presenting the results of empirical investigation, let us further discuss the weighting schemes proposed in this paper and two others proposed before.

Recall the optimal weighting scheme, we may obtain it through minimizing the Euclidean distance between the ensemble point and the ideal point. See Eq. (11) for it. Actually, this can be implemented by applying multiple linear regression ,Footnote 2. We illustrate it using an example taken from Seewald (2002) with some necessary modifications. Suppose that the training data set includes m base classifiers, a group of n instances, and three class labels a, b, and c. Table 2 shows an example of the original training set (left side) and class probability distribution of a base classifier (\(cl_i\)) (right side).

Stacking (Ting & Witten, 1999) and StackingC with multiple linear regression (MLR) (Seewald, 2002) are two typical weighting schemes for ensemble learning. Multiple linear regression is used for both of them. It is also noticeble that the optimal weights by minimizing the Euclidean distance can also be solved by multiple linear regression. This method is referred to as ED (Euclidean Distance) with MLR later in this paper. Although many other weighting schemes have also been proposed (e.g., in Caruana et al. (2004); Mao et al. (2015); Nguyen et al. (2019); Sen and Erdogan (2013); Ting and Witten (1999); Zhang and Zhou (2011)), these three are very closely related to each other. The training sets used by these three methods are shown in Tables 3, 4, 5, respectively.

In both Stacking and StackingC, a multiple linear regression model is set to learn the weights for each class label. Therefore p models are required for a data set with p classes. However, Stacking uses all available \(m \times p\) variables, while StackingC uses m variables, which are related to the given class, in each regression model. ED only uses one regression model for all the classes and m variables are invovled in its regression model. In a sense, StackingC is a simplified version of Stacking and ED is a simplified version of StackingC.

Note that for binary classification problems, the three weighting schemes are almost the same. Although more variables are used in Stacking, half of the variables are redundant. This is because the probablity scores of any instance j obtained from a base classifier \(cl_i\) sum to 1: \(S_{ja}^i+S_{jb}^i=1\)). Using either \(S_{ja}^i\) or \(S_{jb}^i\) is enough. Due to the same reason, the weights learnt by either Stacking or StackingC are the same for both classes a and b. Therefore, it is enough to learn weights for one of the classes.

For multiclass classification problems, the model trained by Stacking is more accurate than the one by StackingC, and the model trained by StackingC is more accurate than the one by EU with regard to the training dataset. However, this may not always be the case with regard to the test dataset. More complex models have a bigger chance to be overfitting. Such a phenomenon is observed in Seewald (2002). It is also confirmed in our experiments. See later sections for detailed results.

Table 2 A data set used for training weights; it includes m base classiers, n instances, and three classes (a, b, and c); \(s_{jk}^i\) refers to the probability score given by base classifier \(cf_i\) for class k on instance number j
Table 3 Meta training set for class a, Stacking with MLR
Table 4 Meta training set for class a, StackingC with MLR
Table 5 Meta training set for all the classes, ED with MLR; the last collumn is not a part of the training data

5.2 Comparison of three weighting schemes

In this subsection we investigate empirically three weighting schemes including Stacking, StackingC, and ED. 26 datasets, downloaded from the UCI Machine Learning Repository,Footnote 3 are used for this. The main statistics of the these 26 datasets are listed in Table 6. These datasets vary in number of instances, features, and classes.

Table 6 Statistics of the datasets used in the study

Three types of base classifiers, including decision tree, support vector machine, and logistic regression-based classifiers, were involved for the ensemble.

For each dataset, we divided it into five equal subsets. All the instances were randomly allocated to each subset. Two of the subsets were used to generate base classifiers, another two were used to train weights for the ensemble, and the remaining one were used for testing. All 30 different combinations were tried. In order to reduce the impact of randomness of selection, we repeat the above process 20 times for each dataset. The results are shown in Table 7. The figure is the average over 20 iterations and 30 combinations for each dataset.

From Table 7 we can see that StackingC and ED are close in performance, while Stacking is not as good as the others. However, the difference between them is small and not significant if we consider all the cases together. In one of the datasets (Ba), all four weighting schems equally-performed. Out of the remaining 25 datasets, majority-voting wins eight of them. Stacking, StackingC, and ED are winners in 4, 8, and 10 datasets, respectively. Note that in two datasets Ca and Pe, these three are joint winners; and in one dataset An, Stacking and StackingC are joint winners.

In some cases, weighted ensemble does not perform as good as simple majority-voting. This happens when the condition is unfavorable to weighted ensemble. If all the base classifiers are close in performance, then it is harder for weighted ensemble to beat majority-voting. Therefore, we measured the performance of base classifiers and calculated the difference ratio of the best and worst R. Two groups are formed base on the difference (10% as the threshold). If only considering one group of 16 datasets in which R>10% (labelled “yes” in the column “R>10%” in Table 7), we find that the difference between Stacking and StackingC, and Stacking and ED is significant (paired-samples T test, two-sided, 0.010 and 0.008 respectively); while the difference between StackingC and ED is 0.055, just out of the signicance level of 0.05.

Table 7 Performance (in Accuracy) comparison of three weighting schemes (R denotes the performance difference between the best and worst base classifiers)

We also measured the time required for the training of these three weighting schemes. A personal desk computer with an i7-11700 CPU and 16 G RAM was used. Table 8 shows the results on seven selected datasets. All three methods run very fast. Understanably, ED is the fastest, Stacking is the slowest, while StackingC is in the middle. When a dataset has a small number of instances and a small number of classes, their difference is also small. This is the case for Ba, Ca, De, En, and GL. When a dataset has a larger number of classes, their difference becomes larger. Pe has over 10,000 instances and 10 classes, and So only has 683 instances but relatively a large number of 18 classes. So the time difference between Stacking and ED is over 8 times for Pe (Pen-Based Recognition) and 18 times for So (Soybean).

Table 8 Time (in second) for weights training by different methods

5.3 Correlation between Euclidean distance and other metrics

In this subsection we investigate the strength of correlation between Euclidean distance and other commonly used metrics.

In fact, Euclidean distance and all other reasonable metrics are directly related to probability scores. Considering one instance in a binary classification problem, a distance of below 0.5 always means a wrong classification, while a distance of above 0.5 alway means a correct classification. Therefore, an observable property of Euclidean distnace is: it is more sensitive to the change of probability scores than all other metrics. However, if a larger number of instances are considered together, we can expect a stronger (either negative or positive) correlation between Euclidean distance and other metrics such as precision and recall.

For all the instances in a dataset, we divide them into two equal partitions by selecting instances randomly. One partition is used to train a group of classifiers. Then we evaluate all the instances in the other partition by a few different metrics including Euclidean distance. Finally, the correlation between different metric pairs are calculated. The roles of training and testing are exchanged for the two partitions.

Two types of classifiers, decision tree-based (20) and support vector machine-based (20), were generated. To increase the diversity of those classifiers generated, different settings were tried. Rather than using all the features, we randomly selected a subset (2/3) of all the features. we also set different values for a few parameters including criterion and maximum deepth (for CART decision trees), and kernel and gamma (for SVMs).

Apart from Euclidean distance, Accuracy, Precision, Recall, and F1 are invovled. All the 26 datasets in Table 6 are also used in this experiment. For each dataset, we repeated the above-mentioned process 20 times. Table 9 presents the Pearson correlation coefficients between each pair of those five metrics.

Table 9 Correlation (by Pearson correlation coefficient) of different metric pairs )DS denotes Dataset, A denotes Accuracy, P denotes Precision, R denotes Recall, F denotes F1, and E denotes Euclidean distance)

From Table 9, we can see that the correlation between all those pairs are very strong in most cases. In 12 datasets, the correlation coefficients are always above 0.9 for all different metric pairs. In a few datasets, the corrlelation is relatively weaker. The weakest is 0.56 (P &E in Ly). In general, it demonstrates that Euclidean distnace has strong correlation with all four metrics considered in the experiment. Therefore, we conclude that in general, the theorems we obtain from the geometric framework still make sense even when other metrics such as Accuracy is used for performance evaluation.

6 Conclusions

In this paper, we have presented a dataset-level geometric framework for ensemble classifiers. The most important advantage of the framework is it makes ensemble learning a deterministic problem. Euclidean distance has some good properties. One is its continuity. This makes it a good candidate as a target variable for optimization, such as using regression to deal with classification problems. Another property is it can measure both performance and dissimilarity, thus it is a good platform for us to understand the fundamental properties of ensembles clearly and investigate many issues in ensemble classifiers, such as the impact of multiple aspects on ensemble performance, predicting ensemble performance, selecting a small number of base classifiers to get efficient and effective ensembles, etc. Otherwise, it is very challenging to grasp even an incomplete picture. This is why up to now some of the properties of majority voting and weighted majority voting have not been fully understood.

Compared with the instance-level framework in Wu and Crestani (2015); Bonab and Can (2019), the dataset-level framework presented in this paper is a step forward. It maps the ensemble classifier problem for a whole dataset into one multi-dimensional space, thus it is more convenient for us to investigate the properties of ensembles. Otherwise, we have to deal with multiple spaces at the same time, each for one instance. To find out the collective properties in those spaces is more complicated. Based on the dataset-level framework, we have deduced some useful theorems which had not been found before.

An empirical investigation has also been conducted to see how those theorems in the geometric framework hold when Accuracy rather than the Euclidean distance is used for performance evaluation. The experimental results show that the theorems are still meaningful for other metrics.

In this geometric framework, we can present by a mathematical equation the exact relationship between ensemble performance and two major factors including all base classifiers’ performance and diversity of the group, thus the profitability of ensembling a group of base classifiers can be calculated out very quickly. Ensemble pruning (Mohammed et al., 2022), which is to select a subset of component classifiers from all available ones for better performance and efficiency, has been investigated widely. This framework can be used for this task in different ways.

In this paper, the setting for the proposed framework is traditionally with a batch of training data. In recent years, data stream classification has attracted some attention (Gomes et al., 2017). How to adapt the geometric framework for this is worth further research. Especially, incorporating dynamic updates is a key point. Another research topic is multi-model data fusion (Gao et al., 2020). Again how to adapt the framework to support multi-modal data fusion is an interesting research issue. One possible solution is to use a separate framework for each mode and then to combine them. These research issues remain to be our future work.