Optimising Diversity in Classifier Ensembles

Ensembles of predictors have been generally found to have better performance than single predictors. Although diversity is widely thought to be an important factor in building successful ensembles, there have been contradictory results in the literature regarding the influence of diversity on the generalisation error. Fundamental to this may be the way diversity itself is defined. We present two new diversity measures, based on the idea of ambiguity, obtained from the bias-variance decomposition using the cross-entropy error or the hinge-loss. If random sampling is used to select patterns on which ensemble members are trained, we find that the generalisation error is negatively correlated with diversity at high sampling rates; conversely generalisation error is positively correlated with diversity when the sampling rate is low and the diversity high. We use evolutionary optimisers for small ensembles to select the subsets of patterns for predictor training by maximising these diversity measures on training data. Evaluation of their generalisation performance on a range of classification datasets from the literature shows that the ensembles obtained by maximising the cross-entropy diversity measure generalise well, enhancing the performance of small ensembles. Regarding big ensembles, we define tree selection methods that favour ambiguous ensembles over unambiguous ensembles. Our results show that the approach that prefers ambiguous ensembles reduces the generalisation error the most and considerably reduces the number of trees needed to obtain good generalisation performance.


Introduction
A principal concern of supervised machine learning is to ensure a predictor demonstrates good generalisation. A predictor is considered to have the ability to generalise, if it has a good performance in predicting on unseen data drawn from the same process that it was trained on [1,2]. Ensembles are collections of predictors, each of which is trained on a different subset of patterns or features. Some ensemble methods such as bagging [3] or boosting [4] have been seen to be very successful in pattern classification tasks [5], and ensembles have been proven in general to predict better than a single predictor [6,7].
In this paper we consider classification of patterns x n , n = 1, … , N into two classes, the positive and the negative class. Each of the M members of the ensemble yields a score y in ≡ y i (x n ) , i = 1, … , M indicating how likely it is that x n belongs to the positive class, and the ensemble score Y n ≡ Y(x n ) , which may be converted to a decision by thresholding is, in general, the weighted average of the constituent predictor scores [6]: where c i are the non-negative weights assigned to the constituent ensemble members, ∑ M i c i = 1 . Here we assume throughout that the ensemble members carry equal weight so that c i = 1∕M for all i. When the constituent classifiers produce a hard decision and the weights are equal this amounts to the often used majority voting.
Various methods for assigning the classifier weights have been developed in [8][9][10][11]. Linear combinations have been mathematically investigated in [12,13], together with This article is part of the topical collection "Applications of bioinspired computing (to real world problems)" guest edited by Aniko Ekart, Pedro Castillo and Juanlu Jiménez-Laredo. nonlinear methods utilising rank-based information in [14], belief-based methods in [15][16][17] and voting schemes in [18,19]. Here, however, we assume that the predictors are equally weighted and focus on the choice of patterns on which the ensemble members are trained.
Clearly, an accurate ensemble requires accurate members. However, Krogh and Vedelsby [20] have proven that an ensemble with good generalisation performance consists of members which disagree in their predictions [21]. As a result, diversity and accuracy are key factors in building successful ensembles.
Although the role of diversity has long been recognised, many ways of quantifying the diversity of an ensemble have been proposed. Kuncheva and Whitaker [22] empirically compared different diversity measures to assess the impact that diversity has on an ensemble's generalisation performance. However, their results could not support the influence of diversity on the overall performance of the ensembles. This aspect was partially explained in [23], which showed that different diversity measures have different degrees of correlation with generalisation error. It was also shown that there tends only to be high (negative) correlation between diversity and generalisation error when diversity is low and generalisation error is high; as diversity increases the correlation with generalisation error decreases [23]. We explore this aspect in more detail below.
In [20] Krogh and Vedelsby introduced a new diversity measure based on the ambiguity decomposition of regression ensembles and the bias-variance decomposition. The ambiguity term is obtained by subtracting the ensemble error from the average error of the predictors. Since the ambiguity is necessarily positive, this property shows the usefulness of the ensembles, since the ensemble error is lower than the average error of the classifiers. The ambiguity measures how much the predictions of the ensemble members differ from the ensemble prediction and as a result can be considered a type of diversity. Chen [23] defined another ambiguity measure in a similar fashion as to [20], but for classifiers and using the 0-1 loss. In his work, Chen demonstrated that out of all the diversity measures tested (Q-statistics, Kappa statistics, Correlation coefficient, Disagreement, Entropy, Kohavi-Wolpert variance, the measure of difficulty, generalised diversity, coincident failure diversity), the ambiguity measure had the highest correlation with the generalisation error [23]. In this paper we use the term ambiguity to refer to a measure of ensemble diversity.
Another aspect to take into consideration when building an ensemble, is the amount of time needed for the classification process. In general a large number of classifiers ensures a good classification performance, however the time needed for this process increases linearly with the number of classifiers. As a result, pruning approaches have been studied to reduce the number of classifiers used in a ensemble without affecting the overall performance too much. In [24][25][26] several pruning approaches that incorporate diversity have been proposed, which showed that diversity had a significant role in reducing generalisation error.
Here we further explore the connection between ensemble diversity and generalisation error. Following [20,23], we define and characterise new ambiguity measures appropriate for the log loss and hinge loss. We investigate empirically the relationship between the ambiguity and the generalisation error. We undertake this via employing an evolutionary algorithm for the direct maximisation of the ambiguity of small ensembles by selecting the patterns that each ensemble member is trained on. In the case of big ensembles, we define tree selection methods that reduce the size of the ensemble. Amongst these approaches, we define schemes that promote ensemble ambiguity, ensemble error and analyse empirically their performance in terms of generalisation error.
The principal contributions of our work are as follows: 1. the derivation of a cross-entropy-based ambiguity measure for ensemble diversity; 2. the derivation of a hinge-loss-based ambiguity measure for ensemble diversity; 3. the empirical assessment of the ambiguity/generalisation error trade-off on a number of widely used classification data sets, using decision trees ensembles; 4. the exploration of the effect of ensemble sampling rates on this trade-off; 5. the exploration of the direct maximisation of ensemble ambiguity via an evolutionary optimisation of the training patterns to maximise generalisation performance; 6. an investigation of algorithms for pruning large ensembles to derive smaller ensembles that generalised well; 7. a new algorithm for constructing small, well-generalising ensembles by greedily selecting randomly generated trees.
This paper unifies and extends work initially presented in [27], particularly contributions 6 and 7 on pruning and greedy construction.
In the next section we present different diversity measures for ensembles using log and hinge losses. "An Evolutionary Algorithm to Optimise Ambiguity" presents an evolutionary algorithm for the optimisation of the cross-entropy diversity. "Experiments" illustrates the performance of the evolutionary optimiser on a range of classification problems. "Optimising Diversity by Tree Selection" details tree selection methods and analyses their generalisation performance. "Conclusion" presents the conclusions and the future work.

Ambiguity Measures
Extending the idea of quantifying diversity in regression ensembles [20], Chen [23] defined a new classifier ensemble diversity measure in terms of how diverse the outputs of the constituent classifiers are compared with the ensemble prediction. Following this line, we define new diversity measures as the difference between the average error of the individual classifiers forming the ensemble and the ensemble error; that is we define the ambiguity through the simple relation: In line with [23], we call these measures of diversity ambiguity measures.
We first review the ambiguity for the 0-1 loss [23], before defining new ambiguities for the log loss and hinge loss.

Ambiguity Measure for 0-1 Loss
Here we assume that the targets, the true classes against which the classifiers are trained, are t n ∈ {−1, +1} , n = 1, … , N . Then the ensemble prediction for patterns x n is and the error or loss for the ensemble classifying x n is thus We denote the outputs of the ensemble members when classifying patterns x n by Y n = {y in = y i (x n )} M i=1 . Then, using (2), the corresponding ambiguity in the ensemble when classifying a single (x n , t n ) pair is thus [23]: The ambiguity of the ensemble for a dataset of N patterns is just the ambiguity for each pattern averaged over the N patterns.
for the 0-1 loss and the other losses which we consider. It can be shown that (see Appendix) the 0-1 ambiguity is zero if and only if all the ensemble members agree on the classification of a pattern, that is amb 01 (Y n ) = 0 ⇔ y in = y jn ∀1 ≤ i, j ≤ M . We note, however, that amb 01 Y n ≠ t n so that the ambiguity is negative if the ensemble classification is incorrect.

Ambiguity Measure for Log Loss
The cross-entropy error or log loss measures the discrepancy between the output of the classifier and the true class when the classifier produces an output between 0 and 1 which may be interpreted as a posterior probability; for convenience we denote the classes as 0 and 1, t n ∈ {0, 1} . We can express the loss for the ith classifier on the nth pattern as: where y in is the probability prediction of the i th classifier for the n th pattern belonging to the positive class. The error made by the ensemble for the nth pattern is therefore quantified as: Again defining the ambiguity as the difference between the average loss of each member of the ensemble and the ensemble loss we obtain the cross-entropy ambiguity for a single pattern: Using Eqs. (7), (8) and (9), we obtain: Note that for any t n only one of the terms will not be zero, so amb CE (Y n ) is the logarithm of the ratio between the arithmetic and geometric means of the proximity of the classifiers' outputs to the desired targets. The cross entropy ambiguity for many patterns is just the ambiguity averaged over patterns (6).
We note Woodhouse [28] shows that the ratio of the arithmetic mean to the geometric mean is equivalent to a cross-entropy quantifying the amount of information added in an image processing problem. In addition in [29] the ratio of the arithmetic to geometric mean is used to measure homogeneity.
Using the inequality between arithmetic and geometric means, namely that the arithmetic mean is greater than or equal to the geometric mean, it can be seen that amb CE (Y n ) ≥ 0 for any input pattern. It can also be shown that amb CE (Y n ) = 0 if and only if all the constituent classifiers agree, y in = y jn ∀1 ≤ i, j ≤ M.

Ambiguity Measure for Hinge Loss
Following the same route, an ambiguity measure can be obtained appropriate for the hinge loss. The hinge-loss is defined as: Here y in is the i th classifier score for the n th pattern and t n is the target, where it is convenient to label the targets as {±1} . The ambiguity measure obtained for the hinge loss is obtained by straightforward substitution, resulting in the following: As for amb CE , the hinge loss ambiguity is non-negative: amb HL (Y) ≥ 0 ∀Y . However, while it is easy to verify that if all the component classifiers have the same score ( y in = y jn for all 1 ≤ i, j ≤ M ) then amb HL (Y n ) = 0 , the converse is not true. This occurs when Inequality (13) can be satisfied when one of the component classifiers predicts incorrectly the class ( ∃i ∈ {1, … , M} t n y in < 0 ), whereas the others classify correctly the class, but with a score in absolute value lower or equal to 1 ( ∀j ∈ {1, … , M}, j ≠ i, t n y jn > 0 and |y jn | ≤ 1).
Proofs for the formulae of the ambiguity measures and their properties are presented in the Appendix.

Correlation Between Ambiguity and Generalisation Error
Previous studies have investigated the relationship between diversity (measured in a variety of ways) and the error/loss [22,23]. A negative correlation between generalisation error and ambiguity has been reported [23]. However, it is clear that this cannot be true across the entire range of ambiguity because it would imply that choosing the ensemble with the maximum diversity would minimise the generalisation error, but a maximally diverse ensemble (with no predictive power) could be constructed from learners that make random predictions. We therefore empirically investigate the relationship between the ambiguity measured on a training data set and the error/loss on a test data set (approximating the generalisation error).
Bagging was used to control the diversity by sampling different independent samples to train the classifiers in the ensemble. We use 30 sampling rates in the range [0.01, 1]. For each sampling rate an ensemble of decision trees, forming a random forest [3] was trained on the sampled patterns. From the 2000 available observations, 1000 were drawn at random and used for training, while the remaining 1000 for evaluating the generalisation error; the roles of the training and testing sets were then swapped and the corresponding ambiguities and losses calculated. This process was repeated 50 times and the ambiguities and errors averaged over the resulting 100 instances.
We used the GMM5 dataset [30] which comprises twodimensional features generated by a Gaussian mixture model with 5 components (an extension of the 4-component model of [31]) allowing a large quantity of data to be synthesised and the Bayes error rate to be calculated exactly. Figure 1 shows the variation of the generalisation error with the diversity of the ensemble measured on the training dataset for each of the ambiguity measures discussed. The first column of panels in Fig. 1 corresponds to a small ensemble of M = 5 trees the second column shows the variation for a large ensemble of M = 100 trees. Although there is considerable variation between the curves for the different ambiguity measures, they all display common characteristics. At high sampling rates the ambiguity and test error are negatively correlated, as also reported by [23]. In this regime, as the sampling rate increases member classifiers are trained on increasingly similar views of the data and therefore diversity decreases. Since the average error per classifier is approximately constant (because adding more data does not appreciably increase their accuracy), equation (2) shows that the ensemble error increases.
Decreasing the sampling rate means that the members of the ensemble are trained on different views of the data, leading to increasing diversity/ambiguity and therefore a smaller ensemble error c.f. (2). However, as the sampling rate is reduced to even lower levels, each component classifier is trained on a very small number of patterns and therefore starts to become inaccurate. In (2) the average error increases more rapidly than the diversity and the result is that the ensemble error begins to rise again. Unfortunately, determining the sampling rate that yields the best generalisation error is not straightforward or susceptible to a priori analysis. In "An Evolutionary Algorithm to Optimise Ambiguity" we therefore describe an evolutionary algorithm to determine this rate.
The same pattern is apparent for both small ( M = 5 , Fig. 1 left column) and large ( M = 100 , Fig. 1 right column) ensembles, although the larger ensemble achieves a lower generalisation error. This generalisation error is very close to the Bayes error (0.11 misclassification rate) for this data set. It might be expected that the optimum sampling rate would be at least 1/M, so that each classifier in the ensemble is trained on N/M examples and each example is used on average in the training of at least one classifier. However, as the panels in Fig. 1 show, the optimum sampling rate is well SN Computer Science below 1/M, meaning that some of the data is not used at all by the ensemble. This indicates the significant role played by diversity: to achieve best generalisation performance it is better to ensure diversity by exposing classifiers to very different views of the data than to better train them by providing more data.
Although only shown here for the GMM5 dataset we emphasise that very similar relationships between ambiguity and generalisation error were observed on a number of additional datasets (Table 1). We also repeated the experiments using sampling with replacement, but bagging without replacement in general yielded lower generalisation errors.
We also investigated the variation of generalisation error with the number M of classifiers forming the ensemble. This was achieved by generating ensembles with 2 to 100 members and training them, as before, with samples at a given rate. This was repeated 20 times for each ensemble size and sampling rate. The average (test) cross entropy error plotted against size of ensemble and sampling rate is shown in the panel of Fig. 2 for the Sonar data set (Table 1, [32,33]). This figure plainly shows the benefit of a large ensemble: the optimum generalisation error with a large ensemble is obtained over a wide range of sampling rates. The average training cross entropy ambiguity is plotted against size of ensemble s e e r t 0 0 1 s e e r t 5 Fig. 1 Curves of the three types of ambiguities versus the corresponding losses that were derived from the ambiguity measures detailed in "Ambiguity measures". The test error versus the training ambiguity was plotted for different sampling rates for ensembles formed of 5 trees (left column) and 100 tress (right column) for the gmm5test dataset. The first row shows the behaviour of the test cross entropy versus the training cross entropy ambiguity, in the second row the test 0-1 loss versus its corresponding training ambiguity is plotted, respectively the behaviour of the hinge loss is presented in the third row of panels. The optimal sampling rate (r) is indicated in red and sampling rate in the right panel of Fig. 2. These two figures together show the relationship between generalisation error and training ambiguity; high ambiguities yield lower test errors, provided the sampling rate is not too small. However, these two plots show the difficulty of predicting the optimal rate that will yield the lowest generalisation error from the training ambiguity

An Evolutionary Algorithm to Optimise Ambiguity
As we have shown, provided that the sampling rate is not too low, the generalisation error is reduced for ensembles with high diversity. We therefore use an evolutionary algorithm to maximise the ambiguity of an ensemble of classifiers by selecting the patterns, that is the particular training examples, on which the constituent optimisers are trained. Pseudocode for the algorithm is presented in Algorithm 1. We use ensembles of M classifiers, each of which is trained on a fraction of the N available training patterns. In common with standard bagging ensembles, each of the classifiers is trained on all the available features. The patterns on which each classifier is trained is represented by a string of N 0s and 1s, where a 1 indicates that the corresponding pattern is used to train the classifier, so that there are exactly [ N] 1s in each string and [⋅] indicates rounding to the nearest integer. The strings representing the training patterns are initialised using stratified random sampling without replacement so that the class ratios are preserved.
A single ensemble is evolved through mutation. Between 1 and M strings are mutated in one of two ways, chosen with equal probability (line 3 in Algorithm 1). Then a type of mutation is chosen with equal probability (line 5): 1. A proportion up to N 2 of 1s and 0s are flipped at random. This is performed in a stratified manner to preserve the class ratio and so as to maintain the sampling rate as (line 6). 2. The current string is discarded and replaced with a new string chosen in the same way as the initialisation, preserving the class ratio and the sampling rate (line 8).
Following mutation the N pop members with the largest ambiguity are retained to proceed into the next generation. In case of equality, the forest with the lower error will be preferred (line 10).

Experiments
We ran our algorithm on six standard classification datasets from the UCI Machine Learning Repository: Australian, Cancer, Liver, Heart, Sonar, Ionosphere [34] and an additional synthetic dataset GMM5 [31,35]. 1 Table 1 summarises the dataset characteristics. Fig. 3 Example results on the Liver dataset, using an evolutionary algorithm to optimise the cross-entropy ambiguity Since the result shown in Fig. 2 show that for large ensembles, the generalisation error is small for sufficiently low sampling rates, we concentrate on small ensembles. We used ensembles of M = 5 trees, which were implemented by using the DecisionTreeClassifier function from the sklearn library [36] in Python and the ambiguity measure amb CE (⋅) derived from the log loss (10).

Evolutionary Algorithm
Data was partitioned into stratified parts as follows: one half for testing, a quarter of the data for training and the remaining quarter for validation. The evolutionary algorithm was run using the training data and the resulting ensemble evaluated on the validation data. The forest with the sampling rate that yields the lowest validation error was evaluated on the test data to assess the algorithm's performance. Figure 3 shows example results obtained on the Liver dataset. The optimisation was repeated 50 times for each sampling rate and the figure shows the mean and interquartile range of the cross entropy generalisation error.
We compared the ensemble's validation error for the initial generation with the optimised ensemble's validation error, for the following sampling rates: 0.05, 0.1, 0.2, 0.3, 0.5. The green dashed line in Fig. 3 corresponds to the mean of the 50 runs for the initial population, whereas the purple dashed line represents the mean for the final population. Shading indicates the interquartile range. The blue box plot corresponds to the test error for the initial populations, whereas the red box plots represents the test error for the corresponding final populations. These box plots were generated just for the sampling rate that yielded the lowest average validation error.
We also performed non-parametric statistical tests to assess the significance of the results. We used the Wilcoxon signed rank two-tailed test, p = 0.05 . In Table 2 the mean test error of the initial ensemble for the sampling rate that yielded the smallest validation error is shown, along with the mean test error of the corresponding final evolved ensemble. The values in the parenthesis correspond to the 25th quartile and 75th quartiles. These results show that, in general, the EA performs significantly better than the random sampling from the initial population, and never worse. The ambiguity optimised ensembles have lower test errors on average than the initial ensemble across all test problems.

Optimising Diversity by Tree Selection
In the previous section the focus was on optimising the patterns on which the trees were built, trees which formed small ensembles. It has been shown that the more trees in an ensemble, the better its performance will be. However the amount of time needed for training that ensemble, increases linearly with the number of trees. In [37] the authors have shown that the ensemble size can be reduced substantially and still obtain a similar performance. In this section we define different pruning methods to find the best performance for a given number of trees. We also define tree selection methods for a fixed number of trees and compare their performances.

Pruning Methods
The process of ensemble pruning refers to reducing the number of predictors, by discarding the ones with little contribution.
We will use the following notations. Let t be the target vector and Y m = {y i } m i=1 be a collection of m classifiers, where 2 ≤ m ≤ M . We will denote the ambiguity of the ensemble by amb CE (Y m ) . A new ensemble Y m−1 is formed by removing one of the classifiers. We define different schemes to select the classifier to be removed from Y m : 1. The first approach at each iteration removes the tree that would yield the most ambiguous ensemble, evaluated on the training data (labelled "Keep most ambiguous ensemble"). Formally, this approach could be defined as: 2. The second approach at each iteration removes the tree that would yield the least ambiguous ensemble, evaluated on the training data (labelled "Keep least ambiguous ensemble"). Mathematically, this method can be expressed as: 3. The third approach at each iteration removes a random tree (labelled "Random"), defined as: 4. The fourth approach removes the tree with the highest training error (labelled "Remove tree highest error"): 5. Conversely, the fifth approach removes the tree with the smallest training error (labelled "Remove tree lowest error"): 6. The next approach discards the tree that would yield the ensemble with the highest training error ("Keep ensemble highest error"): 7. Conversely, the last approach discards the tree that would yield the smallest training error respectively ( "Keep ensemble lowest error"): In our experiments we used M = 100 trees and at each iteration a tree is discarded, according to the above methods. The function getAmbEns (see below) at each iteration receives an ensemble of size m (m decreases after every iteration) and discards each tree in turn to form subsets of m − 1 trees and evaluates the ambiguity of all these m ensembles. It returns the list containing the ambiguities of all the m subsets. Then according to the pruning scheme chosen ("Keep most ambiguous ensemble" or "Keep least ambiguous ensemble"), the algorithm will discard the tree of the index that is either the arg max or arg min.
Similarly the function getErrEns (see below) calculates the training error of the m subensembles. At each iteration, these approaches make m comparisons, which makes the method expensive in terms of time. (y, t)). By analysing the plot of the test cross entropy, we can see that the best approaches are "Keep most ambiguous ensemble", "Remove tree lowest error" and "Keep ensemble lowest error". We suspect that the reason of this similar behaviour between the first two approaches is that by removing the tree with the lowest cross entropy the average cross entropy of the remaining classifiers increases much faster than the ensemble cross entropy, leading to an increase of the ensemble ambiguity, as seen in Fig. 5. Figure 5 also shows the average tree training/test cross entropy versus ensemble training/test cross entropy and ensemble training/ test amb CE for three approaches "Remove tree lowest error", "Remove tree highest error", and "Random". The first row of each set of figures represents the results for the training data, the second shows results for the test data. We can see from these plots that the average tree cross entropy is increasing for the "Remove tree lowest error" method, decreasing for the "Remove tree highest error" and almost constant for the "Random" scheme. Also, in the "Random" case the ensemble cross entropy increases slightly faster (as trees are removed) than removing the lowest error trees.
Another explanation for why the "Remove tree lowest error" approach has such a good generalisation performance, could be that the lowest error trees are overfitted to the training data and so removing them from the ensemble leaves an ensemble of "less overfitted" trees, that are therefore better able to generalise. Conversely, we could argue that removing the tree with the highest error is actually removing the tree that is least overfitted to the training data.
The first plot from Fig. 4 also reveals that amongst the worst approaches is "Keep least ambiguous ensemble". These results prove once more the influence that diversity has on reducing test error.

A One In, One Out approach
The pruning method mentioned in the previous subsection (Pruning methods) that starts from M trees and reduces the ensemble size down to two trees -has the disadvantage of being too costly in terms of time. Let m n be the size of the ensemble at iteration n. Then for the pruning schemes that keep at each iteration an ensemble that satisfies a certain An alternative to this method would be to keep at each iteration a fixed number of trees, m, selected a priori, m ≪ M . At each iteration a random tree would be added from a pool of remaining trees, and a tree will also be discarded according to the approaches presented in "Pruning Methods" (it is possible for the discarded tree to be the new entrant). Hence, we have named this method the "One In, One Out (OIOO)" approach. The total number of comparisons made by the OIOO method will be (m + 1)(M − m) . As before, we will consider Y n = {y i } m i=1 to be the collection of m classifiers, where 2 ≤ m ≤ M and n denotes the n th iteration. We try to form a new ensemble Y ′ by considering ensembles formed by adding a new randomly generated classifier y ′ to the ensemble and then removing one of the classifiers from Y ′ to form Y n+1 . One important aspect to remark, is that as opposed to the previous method, the index n from Y n does not denote the size of the ensemble (because the size of the ensemble, m, is fixed), it denotes the current iteration.
We will consider in our experiments the same type of approaches as in the previous section, but with the following definitions: 1. Keep the most ambiguous ensemble: 2. Keep the least ambiguous ensemble : 3. Remove a random classifier: 4. Remove the classifier with the largest training error: 5. Remove the classifier with the lowest training error: 6. Keep ensemble with the highest training error: 7. Keep ensemble with the lowest training error: We have tested this approach on forests formed of m ∈ {5, 10, 20} trees which were randomly chosen from the same pool of M = 100 trees as in "Pruning Methods". We ran the algorithm for 50 runs and produced box plots of the test data, see Fig. 6. The same training and test data as in the previous experiments were used.

Evolving Ensemble Membership
Building on the ideas from the last two subsections, we define an evolutionary algorithm which maximises the training amb CE , by selecting a fixed number of trees, m, at each generation, see Algorithm 2. The trees selected are represented via a string of 0s and 1s, where 1 on the i th position signifies that the i th tree is selected and 0 that is not. At each generation we mutate the current string according to mutation rate . For each bit of the string a random number between 0 and 1 is generated, if the value is less than , then the current bit is mutated. If the total number of trees selected after the mutation is different than m, then random trees will be either added or removed to have the same total number of trees, line 4 from Algorithm 2. The training ambiguity of the new formed ensemble will be calculated and if it is higher than that of the previous ensemble, the current ensemble will be kept and the old one discarded. In case of equality, the ensemble with the lower training error will be kept. 6 Box plots of the test CE for the tree selection approaches versus the pattern selection approach (the evolutionary algorithm from "An Evolutionary Algorithm to Optimise Ambiguity") on the German dataset for the 0.05 sampling rate. The first plot in the panel displays the results for 5 trees, the second plot for 10 trees, the third of 20 trees, whereas the last one of 50 trees

SN Computer Science
The evolutionary algorithm is run for g = 25000 generations 50 times, and for the sampling rates {0.05, 0.1, 0.2, 0.3, 0.5} . The pool of trees from which the ensemble is composed at each generation is the same as the one used in "Pruning Methods" and "A One In, One Out Approach".
We compared the performance of the evolutionary algorithm with the approaches described in "Pruning Methods" and "A One In, One Out Approach" and with the evolutionary algorithm from "An Evolutionary Algorithm to Optimise Ambiguity". Box plots of the test errors for all the approaches over the 50 runs are shown in Fig. 6 Table 3 Statistical comparisons of the tree selection schemes and the evolutionary algorithm that selects patterns from "An Evolutionary Algorithm to Optimise Ambiguity" For each method the median of the test cross entropy over the 50 runs is displayed. The columns of the tree selection methods from "Pruning Methods", A One In, One Out Approach have two values, with the following meaning: the first one denotes the median of the test cross entropy for the pruning method, whereas the second the value indicates the median of the corresponding OIOO approach. Dark shading indicates the best approach across all methods, whereas the lighter grey the ones statistically indistinguishable. The blue underlining denotes the best approach across the tree selection methods, whereas the red underlining indicates the approaches statistically similar. The results shown are for the German dataset Fig. 7 Comparisons of the tree selection approaches versus the pattern selection approach (the evolutionary algorithm from "An Evolutionary Algorithm to Optimise Ambiguity") on the German dataset for the 0.05 sampling rate, for 5 trees. The top plot shows the ranking and the statistical similarities for all the approaches except for the evolutionary algorithm from "An Evolutionary Algorithm to Optimise Ambiguity". The bottom plot analyses all the approaches trees. The first box plot in the group is always the approach that starts from M classifiers and reduces the total number of trees at each iteration. The second box plot (the darker colour from the group) represents the corresponding method, but by using the OIOO approach. The black horizontal line separates the tree selection algorithms from the evolutionary algorithm that selects patterns (from "An Evolutionary Algorithm to Optimise Ambiguity"). These boxplots were obtained for the German dataset and for = 0.05.
We can see from these plots that in general the best approaches are "Keep most ambiguous ensemble","Keep most ambiguous ensemble OIOO","Remove tree lowest error","Remove tree lowest error OIOO" and the evolutionary algorithms, followed by "Keep ensemble lowest error","Keep ensemble lowest error OIOO","Random","Random OIOO". Another aspect which is visible from these plots is that as we increase the number of trees, the test errors given by the best approaches get closer to the unpruned ensemble.
To correctly rank all these schemes, we performed statistical tests. We used the Wilcoxon signed rank two-tailed test along with the Holm-Bonferroni correction method for a p-value of 0.05. The results of the statistical tests are displayed in Table 3. This table contains two analyses. First it determines the best approach and all those approaches which are statistically indistinguishable from it out of the tree selection approaches. The best one is underlined in red, whereas the statistically similar ones are underlined in blue. The second analysis compares all approaches, the trees selection ones with the pattern selection one (the evolutionary algorithm from "An Evolutionary Algorithm to Optimise Ambiguity"). The best one is shaded with a dark grey, whereas the statistically similar ones in lighter grey. Fig. 8 Comparisons of the tree selection approaches versus the pattern selection approach (the evolutionary algorithm from "An Evolutionary Algorithm to Optimise Ambiguity") on the German dataset for the 0.3 sampling rate, for 10 trees. The top plot shows the ranking and the statistical similarities for all the approaches except for the evolutionary algorithm from "An Evolutionary Algorithm to Optimise Ambiguity". The bottom plot analyses all the approaches Fig. 9 Comparisons of the tree selection approaches versus the pattern selection approach( the evolutionary algorithm from "An Evolutionary Algorithm to Optimise Ambiguity") on the German dataset for the 0.5 sampling rate, for 20 trees. The top plot shows the ranking and the statistical similarities for all the approaches except for the evolutionary algorithm from "An Evolutionary Algorithm to Optimise Ambiguity". The bottom plot analyses all the approaches

SN Computer Science
We also present a visual comparison, by using critical difference diagrams, as in [38]. In these diagrams the best classifiers are ranked in an ascending order and they are connected with each other if they are statistically similar, see Figs. 7, 8, 9 for these for the German dataset.
Even though these comparisons are just for the German dataset and for the ∈ {0.05, 0.3, 0.5} sampling rates, the ranking of the approaches is similar across all datasets. In general the best approaches are the evolutionary algorithms which are statistically similar to the "Keep most ambiguous ensemble","Keep most ambiguous ensemble OIOO", "Remove tree lowest error","Remove tree lowest error OIOO" approaches. Since the evolutionary algorithms are expensive and also the pruning approaches, probably the best approaches would be "Keep most ambiguous ensemble OIOO" and "Remove tree lowest error OIOO".

Conclusion
In this paper we introduced two ambiguity measures using the bias-variance decomposition and the cross-entropy error or the hinge loss. Together with the ambiguity corresponding to the 0-1 loss, we established the properties of these new diversity measures. We evolved the training patterns of the classifiers to maximise the ambiguity obtained from the cross-entropy ( amb CE ) and our results show that the evolved ensemble generally has a better generalisation error than the initial ensemble. Hence, our results support the influence that the diversity has on minimising generalisation error. Also the ambiguity measure obtained by using the crossentropy error satisfies all the required properties of a diversity measure (being always positive and being zero if and only if the predictions of the classifiers are all the same). This property is not present in the ambiguity obtained by using the 0-1 loss (see [23]), which we find can be negative.
We have defined different methods of tree selection, some that favoured ambiguous/accurate ensembles and some that promoted less ambiguous/accurate ensembles. We compared the performance of these methods with the results of the above mentioned evolutionary algorithm. Our results show that in general the evolutionary algorithm achieves good performance, however its performance is similar to the tree selection approaches that favour ambiguity, result which demonstrates once more the usefulness of ambiguity in error reduction.
Our results show that if random sampling is used to select patterns on which ensemble members are trained, we find that generalisation error is negatively correlated with diversity at high sampling rates; conversely generalisation error is positively correlated with diversity when the sampling rate is low and the diversity high.
Our experiments were based on random forests, therefore a possible extension of our work would be to use other types of ensembles and classifiers. In addition, other methods of inducing diversity, such as selection of features and different models, could be investigated.
In our experiments the weights c i of the classifiers were equal, as a result our future work will aim to optimise the weights of the classifiers to maximise ambiguity, without compromising the average error. Also some patterns have different ambiguities, so future work will focus on how to effectively select the most ambiguous patterns.

SN Computer Science
In this case, y in is the ith classifier's probability prediction that the nth pattern belongs to class 1. From equation (24), we could express the average error of the classifiers as: Then from (24) we can define the ensemble error as being: From equations (14) and (26) we can define the ensemble error as: Using the above equations, the difference between the average error of the classifiers and the ensemble error can be expressed as: Taking into account the logarithm's properties: As a result the ambiguity in x n be defined as: Taking into account equation 31, the formula of the diversity for all the patterns is: ◻ Theorem 2 The CE ambiguity has the following properties: 1. amb CE (Y n ) ≥ 0, ∀n ∈ 1, N 2. amb CE (Y n ) = 0, ⇔ y in = y jn ∀i, j ∈ 1, M Proof First we show that amb CE (Y n ) ≥ 0, ∀n ∈ 1, N. Let us consider the following inequality: where ∑ N i=1 i = , ∀i ∈ {1 … n}, i ≥ 0 . The equality holds if and only if all the x i are equal.
In our case i = c i and ∑ M i=1 c i = 1 , as a result the arguments of both logarithms from the formula are greater then one and then the logarithms are positive. Since t n ∈ {0, 1} the diversity is the sum of two positive numbers.

Consent for publication Yes
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.