Introduction

With the arrival of the big data era, various types of data are constantly increasing, and data mining is becoming increasingly urgent. Data mining is the core means to obtain valuable information from messy big data. In recent years, with the rapid development of machine learning technology, machine learning has become one of the main data mining methods. Machine learning [1] can learn the rules and patterns of data in massive data by computer. It is widely used to solve the problems of classification, regression, clustering, and so on. As one of the four research directions in the field of machine learning, ensemble learning [2] exploits multiple machine learning algorithms to produce weak predictive results based on features extracted through a diversity of projections on data, and fuse results with various voting mechanisms to achieve better performances than that obtained from any constituent algorithm alone. Therefore, the basic theory, algorithm, and application of ensemble learning have been the research hot spots in recent years. Ensemble learning has achieved exceptionally satisfactory performance in international machine learning competitions like Kaggle, KDD-Cups, etc.

Currently, the mainstream algorithms in ensemble learning include AdaBoost, bagging, random subspace, and random forest. In 1990, Schapire [3] proved that weak classifiers could be combined into a strong classifier by Boosting method. After that, Freund and Schapire [4] proposed AdaBoost. The principle is that the wrong samples of the previous classifier will be strengthened, and all the weighted samples will be used to train the next base classifier again. The proposal of this method makes ensemble learning an important research field of machine learning, and scholars conducted in-depth research on Boosting, such as LogitBoost [5], XGBoost [6], MF-AdaBoost [7] and so on. Bagging was proposed in 1996 by Breiman [8]. Bagging uses bootstrap sampling to generate sample subsets from the training dataset and then trains base classifiers based on these sample subsets. Random subspace was proposed in 1998 by Ho [9]. Random subspace randomly extracts several attribute subsets from the initial attribute set, then trains a base learner based on each attribute subset, and finally obtains the classification results through voting. Since then, random subspace has been widely used in the research of ensemble learning [10, 11]. Random forest was proposed in 2001 by Breiman [12], which is known as one of the best algorithms in machine learning. Random forest is a kind of ensemble learning algorithm that combines bagging with random subspace and takes decision trees as the base learners. By integrating the voting results of multiple decision trees, the random forest can alleviate the overfitting problem of decision trees. In addition to the above models, there are some other excellent algorithms [13,14,15] in ensemble learning.

After all the base classifiers of ensemble learning are trained, selective ensemble [16] reduces the scale of ensemble learning and obtains better performance by removing some base learners. The process of classifier selection can be regarded as searching a specific subset in the limited base classifiers space. The selection operation mainly considers two essential factors: one is the selective criterion, that is, the quality evaluation standard of the base classifier or ensemble system; the second is the search mode, that is, the process of searching the appropriate subset of base classifiers based on the selection criteria. Various selective ensemble schemes proposed by researchers are the combination of different selection criteria and search modes.

Currently, according to the characteristics of the main process of algorithms, the selective ensemble is usually divided into sorting, clustering, and optimization. Selective ensemble based on sorting usually sorts the base learners based on certain standards and selects the top base learners to join the final ensemble. In the past decades, researchers have proposed various sorting strategies, such as reduce-error pruning [17, 18], kappa pruning [19, 20]. Selective ensemble based on clustering usually uses the clustering method to divide the base learners into several groups and then select representative base learners from each group to form the final ensemble. At present, many different schemes have been proposed. For example, Lin et al. [21] selected the base learners from the original ensemble system through the K-means clustering. Zhang and Cao [22] utilized spectral clustering to realize ensemble pruning. Selective ensemble based on optimization transforms the pruning problem into an optimization problem. The goal is to find a subset of base learners to maximize or minimize the optimization goal related to the final ensemble generalization ability. Scholars have proposed ensemble pruning based on various heuristic optimization algorithms, such as genetic algorithm [23, 24], particle swarm optimization [25, 26], and so on.

Although scholars have put forward many ensemble learning and selection schemes, there are still some problems that need to be overcome:

  • Most of the above ensemble models only consider the feature or sample space. They do not consider both, which will inevitably lead to the waste of information and may affect the final ensemble performance.

  • The random acquisition of subspaces does not consider the importance of features, which is easy to cause the loss of important features and affects the classification performance.

  • Ensemble learning is composed of multiple base learners, so it has better generalization ability than a single base learner. However, too many base learners will cause these problems:(1) The more base learners, the greater the computational and storage overhead, and the slower the learning speed. (2) The more base learners, the smaller the diversity, resulting in the performance degradation of ensemble learning. (3) Enough base learners can ensure the performance of ensemble learning. However, there will inevitably be some redundant and poor performance base learners, which will affect the ensemble learning. (4) In the label prediction process, the majority voting method and the weighted voting method based on the classification ability of base learners can not obtain the optimal voting weight for each base learner.

In order to solve these problems, we propose SELA. The main steps are as follows: firstly, the subspaces is generated by our designed random subspace generation algorithm, and then the initial base learners are trained. Secondly, through the proposed local selection process, a set of appropriate base learners are selected from the initial ensemble as the new ensemble set. Finally, the new ensemble set is used to predict the test samples by the weighted voting method. The main contributions in this paper are as follows:

  • The random subspace generation algorithm we designed can not only retain important features but also ensure the randomness of subspaces.

  • This scheme considers both feature and sample spaces. In the initial and selective ensemble stages, we consider the feature space and the sample space respectively.

  • In the local selection process, the idea of AdaBoost and the local information of samples are combined to train the base learners, and we design a loss function to balance the accuracy and diversity of base learners to select an appropriate new ensemble set.

  • In the process of using the weighted voting scheme to predict new sample labels, the quantum genetic algorithm is introduced to search optimal weights for base learners.

Preliminaries

Ensemble learning is a combination of multiple base learners to obtain powerful capability. In this paper, we mainly study the ensemble selection process based on AdaBoost.

AdaBoost is characterized by training a base classifier in each iteration. In each iteration, the weights of incorrectly classified samples by the previous classifier are increased while the weights of correctly classified samples are reduced. Finally, AdaBoost takes the linear combination of base classifiers as a strong classifier and uses the weighted voting method to predict test samples. The specific process of AdaBoost is shown in Fig. 1 and Algorithm 1.

Fig. 1
figure 1

The overall framework of AdaBoost

figure a

Selective quantum ensemble learning based on local Adaboost

The overall framework of SELA is shown in Fig. 2. SELA is roughly divided into two phases: the generations of the original ensemble classifier and the new ensemble classifier. In the first phase, it mainly involves the generation of random subspace. We combine the importance and randomness of features to generate a set of high-quality subspaces to serve the initial ensemble classifier. In the second stage, we combine the local information of sample with the famous AdaBoost algorithm, which can more reasonably adjust the weights of misclassified and correctly classified samples so as to improve the final classification performance. A loss function that can balance accuracy and diversity is designed to serve the selection of high-quality base classifiers. In addition, the quantum genetic algorithm is designed to generate optimal weights for base learners in order to improve the ability of the ensemble model. In the following, the specific implementation steps are introduced.

Fig. 2
figure 2

The overall framework of SELA

Fig. 3
figure 3

The framework of local selection process

Generation of random subspace

Generally, the target of the random subspace technique is to reduce the similarity between base classifiers by randomly selecting features. However, this process usually leads to some important features for the classification task are not selected, which will affect the classification performance of base classifiers. In order to preserve important features and improve the diversity of base classifiers, we design a random subspace generation scheme shown in Algorithm 2. We use information entropy to measure the importance of features, and make important features into a fixed subset, then use bootstrap sampling to generate random subsets, and finally form them to be final subspaces.

figure b

Firstly, we use information gain to measure the importance of features. The greater the information gain, the more important the feature is. Suppose the training sample set \(X=\{(x_{1},y_{1}),(x_{2},y_{2}),\ldots ,(x_{N},y_{N})\}\), where \(y_{i}\) is class label of \(x_{i}\), and \(y_{i}\in \{C_{1},C_{2},\ldots ,C_{K}\}\). The information entropy of X is as follows:

$$\begin{aligned} \text {Ent}(X)=-\sum _{k=1}^{K}\frac{|C_{k}|}{N}\log _{2}\frac{|C_{k}|}{N}. \end{aligned}$$
(1)

Suppose the discrete attribute a of X has M different values. Therefore, if a is used to classify X, M subsets will be obtained, which are recorded as \(X^m(m=1,2,\ldots ,M)\), respectively. Then the information gain of attribute a is as follows:

$$\begin{aligned} \text {Gain}(X,a)=\text {Ent}(X)-\sum _{m=1}^{M}\frac{|X^m|}{N}\text {Ent}(X^m). \end{aligned}$$
(2)

Select some features with high information gain as a fixed feature set \(S^{'}\), and the remaining features are denoted as \(S^{''}\). Besides, bootstrap sampling is used to extract some feature subsets \(S^{''}_{1},S^{''}_{2},\ldots ,S^{''}_{O}\) from \(S^{''}\). Finally, we can get the subspace set \(S=\{S_{1},S_{2},\ldots ,S_{O}\}\), where

$$\begin{aligned} S_{o}=S^{'} \cup S^{''}_{o}. \end{aligned}$$
(3)

Local selection process of ensemble learning

The local selection is an iterative process, and an appropriate base classifier will be selected in each round. We use the idea of AdaBoost to focus the basic classifiers selected in each round on those samples that are difficult to classify, so as to improve the classification accuracy.

Local AdaBoost

The idea of AdaBoost is to increase the weight of samples incorrectly classified by the previous round and reduce the weight of samples correctly classified. In the q-th iteration of AdaBoost, the training error \(\epsilon _{q}\) of base classifier gained according to Eq. (1). In the calculation of \(\epsilon _{q}\), the training samples are independent of each other, so it can be regarded as a global measure of classifier performance. Generally, the performance of a classifier at one sample is closely related to its performance at samples in the neighborhood of this sample. In order to explore the classification performance of the classifier at each sample, we design a local error to measure the classification ability. Based on the local error, we can further obtain the local weight of the classifier, so as to update the sample weight. In the following, the key steps involved in the designed local AdaBoost algorithm are introduced.

In the local Adaboost algorithm, the local error of base classifier \(\chi _{q}\) with respect to sample \(x_{i}\) is designed as follows:

$$\begin{aligned} \epsilon _{q}(x_{i})=\sum _{x_{m}\in N(x_{i})}D_{q}^{i}(x_{m}){\mathbb {1}}(\chi ^{q}(x_{m})\ne y_{m}). \end{aligned}$$
(4)

\(D^{i}_{q}(x_{m})\) is the local weight of \(x_{m}\) which represents the proportion of the weight of \(x_{m}\) in the neighborhood of \(x_{i}\), as follows:

$$\begin{aligned} D^{i}_{q}(x_{m})=\frac{D_{q}(x_{m})}{\sum _{x_{l}\in N(x_{i})}D_{q}(x_{l})}. \end{aligned}$$
(5)

Then, the local weight of base classifier \(\chi _{q}\) with respect to sample \(x_{i}\) is as follows:

$$\begin{aligned} \alpha _{q}(x_{i})=\frac{1}{2}\ln \left( \frac{1-\epsilon _{q}(x_{i})}{\epsilon _{q}(x_{i})}\right) . \end{aligned}$$
(6)

The weight update rule of training samples is as follows:

$$\begin{aligned} D_{q+1}(x_{i})=\frac{D_{q}(x_{i})}{Z_{q}}\exp (-\alpha _{q}(x_{i})y_{i}\chi _{q}(x_{i})), \end{aligned}$$
(7)

where \(Z_{q}\) is the normalization factor. Obviously, the local AdaBoost algorithm can pay more attention to the classification ability of the classifier at each sample, so as to make the weight update of each sample more reliable.

Local selection process

We embed the local AdaBoost algorithm into the selection process of base classifiers in ensemble learning, and the specific steps of the local selection process are shown in Fig. 3 and Algorithm 3. In each iteration, we use the local AdaBoost algorithm to generate a set of base classifiers. By designing a loss function that can balance the diversity and classification ability of base classifiers, an appropriate base classifier can be selected from the group of base classifiers in this round. The sample weight is updated based on the selected base classifier, and then a new set of base classifiers will be obtained by the local AdaBoost algorithm in the next iteration.

figure c

In order to select an appropriate classifier from a group of classifiers generated by local AdaBoost, we design a loss function to measure the importance of each classifier. The loss function of base classifier \(\chi _{j}\) is designed as follows:

$$\begin{aligned} \Delta (\chi _{j})=\beta _{1}\text {Div}({\hat{\chi }},\chi _{j})+\beta _{2}\text {Acc}({ \hat{\chi }},\chi _{j}), \end{aligned}$$
(8)

where \({\hat{\chi }}\) is the currently selected new ensemble set, \(\text {Div}({\hat{\chi }},\chi _{j})\) and \(\text {Acc}({\hat{\chi }},\chi _{j})\) is the diversity and classification accuracy of \({\hat{\chi }} \cup \alpha _{j}\chi _{j}\) individually. \(\beta _{1}\) and \(\beta _{2}\) are the parameters used to balance accuracy and diversity, satisfying \(\beta _{1}+\beta _{2}=1 (\beta _{1},\beta _{2}>0)\). And we have

$$\begin{aligned}{} & {} \text {Div}({\hat{\chi }},\chi ^{j})=\frac{1}{N}\sum _{i=1}^{N} \frac{1}{n-\lceil \frac{n}{2} \rceil }\min \{l(x_{i}),q-l(x_{i})\},\nonumber \\{} & {} \text {Acc}({\hat{\chi }},\chi _{j})=\frac{1}{N}\sum _{i=1}^{N}{\mathbb {1}}(y_{i}=f(x_{i})), \end{aligned}$$
(9)

where n, \(l(x_{i})\) and \(f(x_{i})\) is the number of total classifiers, correct classifiers and predicted label for \(x_{i}\) in \({\hat{\chi }} \cup \alpha _{j}\chi _{j}\) individually. Besides, we have

$$\begin{aligned} f(x_{i})=\text {sign}\left( \sum _{\chi ^{t}\in {\hat{\chi }} }\alpha _{t} \chi ^{t}(x_{i})+\alpha _{j}\chi _{j}(x_{i})\right) . \end{aligned}$$
(10)

In the first iteration, that is \(q=1\), we select the base classifier with the highest classification accuracy as follows:

$$\begin{aligned} \chi ^{1}=\text {arg} \mathop {\max }\limits _{\chi _{o}}\{\text {Acc}(\chi _{1}),\text {Acc}(\chi _{2}),\ldots ,\text {Acc}(\chi _{O})\}, \end{aligned}$$
(11)

where \(\text {Acc}(\chi _{o})\) is the classification accuracy of classifier \(\chi _{o}\), and

$$\begin{aligned} \text {Acc}(\chi _{o})=\frac{1}{N}\sum _{i=1}^{N}{\mathbb {1}} ( y_{i}=\chi _{o}(x_{i})),\quad o=1,2,\ldots ,O.\nonumber \\ \end{aligned}$$
(12)

When \(q\ge 2\), we add the classifier with the highest loss function into the new ensemble set \({\hat{\chi }}\). After Q iterations, the new ensemble set is \({\hat{\chi }}=\{\chi ^{1},\chi ^{2},\ldots ,\chi ^{Q}\}\).

Quantum weighted voting scheme

The prediction stage of ensemble learning usually uses the majority voting method or allocates the voting weight by evaluating the accuracy of each basic classifier. In order to obtain the weight more suitable for each base classifier, we design the quantum genetic algorithm in SELA to further optimize each weight.

Genetic algorithm [27] is a random search method inspired by the law of biological evolution. It is good at solving global optimization problems and can efficiently jump out of local optimal points to find the global optimal point. However, the genetic algorithm also has some shortcomings, such as hard to determine the coding method, premature convergence, and slow convergence.

In recent years, quantum theory has attracted more and more attention. Using quantum theory to further improve genetic algorithm has become a very popular research direction. Quantum genetic algorithm [28] is a new evolutionary algorithm based on the combination of quantum computation and genetic algorithm technology. Genetic algorithm uses binary code, symbol code or real number code to represent genes, while quantum genetic algorithm uses probability amplitudes of qubits to encode chromosomes. The introduction of qubits makes each chromosome realize quantum entanglement so that each chromosome can express multiple states. Quantum genetic algorithm can be applied to the same group of problems as genetic algorithm, but it allows significantly accelerating the evolution process through quantum parallelization and quantum entanglement. The probability mechanism of quantum computing combined with the evolutionary algorithm can realize the global search capability with rapid convergence and a small population size. With the help of the quantum genetic algorithm, we can generate appropriate weights for each base learner, thus further improving the performance of SELA.

When we use the quantum genetic algorithm to optimize the Q weights of base classifiers, each chromosome is coded by qubits as follows:

$$\begin{aligned} q_{j}^{t}=\left( \begin{array}{cccccccccc} \alpha _{11}^{t} &{}\cdots &{}\alpha _{1k}^{t} &{} \alpha _{21}^{t} &{}\cdots &{}\alpha _{2k}^{t} &{}\cdots &{} \alpha _{Q1}^{t} &{} \cdots &{} \alpha _{Qk}^{t} \\ \beta _{11}^{t} &{}\cdots &{}\beta _{1k}^{t} &{} \beta _{21}^{t} &{}\cdots &{}\beta _{2k}^{t} &{}\cdots &{} \beta _{Q1}^{t} &{} \cdots &{} \beta _{Qk}^{t} \end{array} \right) , \end{aligned}$$
(13)

where \(q_{j}^{t}\) is the j-th chromosome in the population of the t-th generation, k represents the number of qubits coding each gene, Q is the number of genes on the chromosome which is equal to the number of weights. Besides, the fitness function in quantum genetic function is as follows:

$$\begin{aligned} f(w)=\frac{1}{E}, \end{aligned}$$
(14)

where

$$\begin{aligned} E=\frac{1}{N}\sum _{i=1}^{N}{\mathbb {1}}(y_{i}\ne \text {sign}\left( \sum _{q=1}^{Q}w_{q}\cdot \chi ^{q}(x_{i})\right) . \end{aligned}$$
(15)

E is the error of the new ensemble model, \(w_{q}\) is the weight of the q-th base learner. Therefore, the larger the fitness, the smaller the error, which indicates better weights. Here, we use the quantum rotation gate to update the quantum population, and it is defined as follows:

$$\begin{aligned} R(\theta )= \left( \begin{array}{cccccccccccccc} \cos \theta &{}\quad -\sin \theta \\ \sin \theta &{}\quad \cos \theta \end{array} \right) . \end{aligned}$$
(16)

When the iteration reaches one of the termination conditions, optimal weight vector \(w^{*}\) can be acquired. Finally, the predicted label of test sample x is as follows:

$$\begin{aligned} {\hat{\chi }}(x)=\text {sign}\left( \sum _{q=1}^{Q}w^{*}_{q}\chi ^{q}(x)\right) . \end{aligned}$$
(17)

Experiment

In order to test the performance of SELA, we choose ten datasets from UCI [29] shown in Table 1. In the following, we first analyze the impact of some parameters in SELA on classification performance. Secondly, the effect of the local selection process in SELA is studied. Ultimately, SELA is compared with some single classifiers and ensemble models. In all experiments, base classifiers in the ensemble model are decision trees.

Table 1 Specific information about datasets

The effect of the parameters

The value of \(\beta _{1}\), the number of original classifiers and samples contained in \(N(\cdot )\), and the proportion of fixed features are all important parameters in SELA. Therefore, it is necessary to study the influence of these parameters on the classification performance of SELA. In the following, we use SELA based on the datasets in Table 1 to do experiments.

As the value of \(\beta _{1}\) changes, the performance of SELA changes as shown in Fig. 4(1). Obviously, when \(\beta _{1}=0.3\), SELA has the highest accuracy in most cases. When the diversity of the base classifiers accounts for a large proportion and the accuracy accounts for a small proportion in the loss function, the overall performance of SELA is relatively reduced. Therefore, it is necessary to use \(\beta _{1}\) and \(\beta _{2}\) to balance the accuracy and diversity of base classifiers.

Fig. 4
figure 4

The effect of the parameters

The impact of O(the number of original classifiers) on the classification performance of SELA is shown in Fig. 4(2). When the value of O is between 160–180, the performance of the ensemble model is stable. When the value of O is too large or too small, the classification accuracy decreases. The possible reason for this phenomenon is that when there are enough initial base classifiers and the number of redundant base classifiers is small, we can ensure the diversity and accuracy of the new ensemble set.

Table 2 Experimental analysis of different algorithms [average accuracy \(\pm \) standard deviation and time consuming (second)]

Figure 4(3) shows the effect of the number of training samples, denoted as u, contained in \(N(\cdot )\) on the classification performance of SELA. When \(u=15\), the accuracy of the ensemble model is the highest. In SELA, we use the information in the sample neighborhood to provide more information for base classifiers which can assist base classifiers in training. When the value of u is small, there is less information in the sample neighborhood, resulting in weaker classification performance. When the value of u is greater than 15, the classification accuracy cannot be further improved. improved. The possible reason is that when the amount of information referenced by the sample is sufficient, there is no need to add redundant information. In actual use, if we allow the loss of some prediction accuracy to save calculation costs, we can choose a smaller u value.

The impact of s(the proportion of fixed features) on the classification performance of SELA is shown in Fig. 4(4). When we do not use the fixed features strategy in the process of generating random subspaces, that is, when \(s=0\), it is clear that the average accuracy of SELA is the lowest. Therefore, it is necessary to add a certain proportion of fixed features to each subspace. When the value of s is between 0.2–0.3, the accuracy of the ensemble model is the highest. When the value of s is too large or too small, the classification accuracy decreases. In SELA, we fix some important features to all subspaces, which can assist base learners in training. When the value of s is too small, it may cause the accuracy of the base learners to decline due to the lack of important features in the training process. When the value of s is too large, although each subspace has enough important features, it directly reduces the randomness of each subspace, which affects the classification ability of SELA. In general, it is necessary to use the fixed features strategy and select an appropriate proportion of fixed features to balance the accuracy and diversity.

The effect of the local selection process

The local selection process is an important process in SELA. In order to verify its necessity, we use the original ensemble learning (OEL) and SELA to do experiments based on datasets in Table 1. The experimental results are shown in Fig. 5. Obviously, the performance of SELA is significantly better than OEL.

Fig. 5
figure 5

The effect of the local selection process

Comparison with different algorithms

We use algorithms Decision Tree, Bagging, Random Forest, AdaBoost, and SELA to test their performances based on datasets in Table 1. These classifiers all are excellent algorithms in pattern recognition. The experimental results are shown in Table 2. Obviously, the performance of the four ensemble learning algorithms is much better than that of the Decision Tree. Moreover, the performance of SELA is the best in these datasets. The possible reasons are that SELA retains particularly important features when generating subspaces, and the base classifiers are further selected to remove the redundant and poor performance base classifiers, so as to improve the performance of the ensemble model. Besides, SELA requires more time due to the addition of the intelligent optimization algorithm, subspace selection process and other processing skills in the classification process. SELA consumes 4–10 s more than Bagging and Random Forest. And the core improvement measures of SELA are based on AdaBoost. However, SELA only consumes 3–4 s more than AdaBoost, which can achieve higher classification accuracy. Therefore, SELA not only has a strong learning ability but also does not take up too much time.

In addition, we use the rank sum test to acquire the statistical significance of the experimental results in Table 2. And the P-values of SELA and the other four algorithms are shown in Table 3. Statistically, if the P-value is less than 0.05, it indicates that SELA is significantly different from the comparison algorithm. It can be seen from the results in Table 3 that the P-value of all algorithms is lower than 0.05. Therefore, the rank sum test confirms that the performance of SELA is significantly improved compared with other algorithms.

Conclusion

In this paper, we design a random subspace generation algorithm, local selection process and quantum genetic algorithm to implement SELA. Compared with general ensemble learning, SELA ensures the randomness of subspace without losing important features. Based on the idea of AdaBoost, the local information of samples is introduced to train base classifiers, and the excellent base classifier is selected according to the designed loss function. SELA also utilizes the quantum genetic algorithm to search for optimal weights for base learners in the new ensemble set. In addition, we have carried out several experiments on relevant datasets and obtained the following results: (1) The local selection process setting can effectively improve ensemble learning performance. (2) SELA is generally superior to other competitive ensemble learning methods. In the future, we plan to continue to study the selective ensemble based on Adaboost. Based on SELA, we are trying to improve the ensemble classification performance through intelligent optimization algorithms. On the other hand, we are going to study whether there is a better classifier evaluation criterion than the loss function we designed in the process of selecting base classifiers.

Table 3 P-value of SELA with other four algorithms based on rank sum test