Using dual evolutionary search to construct decision tree based ensemble classifier

A typical ensemble learning process typically uses a forward integration mechanism to construct the ensemble classifier with a large number of base classifiers. Based on this mechanism, it is difficult to adjust the diversity among base classifiers and optimize the structure inside ensemble since the generation process has a certain amount of randomness, which makes the performance of ensemble classifiers heavily dependent on the human design decisions. To address this issue, we proposed an automatic ensemble classifier construction method based on a dual-layer evolutionary search mechanism, which includes a tree coding-based base classifier population and a binary coding-based ensemble classifier population. Through a collaborative searching process between the two populations, the proposed method can be driven by training data to update the base classifier population and optimize the ensemble classifiers globally. To verify the effectiveness of the dual evolutionary ensemble learning method (DEEL), we tested it on 22 classification tasks from 4 data repositories. The results show that the proposed method can generate a diverse decision tree population on the training data while searching and constructing ensemble classifiers from them. Compared with 9 competitor algorithms, the proposed method achieved the best performance on 17 of 22 test tasks and improved the average accuracies by 0.97–7.65% over the second place. In particular, the generated ensemble classifiers show excellent structure, which involve small number and diverse decision trees. That increases the transparency of ensembles and helps to perform interpretability analysis on them.


Introduction
Ensemble learning [1]  which are generally classified into three types: bagging [2], boosting [3] and stacking [4]. The ensemble is usually constructed by two steps: first, generating the base classifiers, and then, strategically combining them to obtain the ensemble to perform the calculation task. The problem with this process is that it generates redundant base classifiers that increase the computational cost and storage space and affect the classification accuracy of the ensemble, which results in diminishing returns [5]. In addition, the redundancy within the ensemble also increases the complexity of the system structure, which decreases the model interpretability. Studies show that the generalization ability of ensemble subsets is better than that of ensembles composed of all base classifiers [6]. When diversity and accuracy reach a balance, ensembles can obtain more accurate predictions than all individual members separately. Moreover, the optimization of internal components of the model can make its calculation mechanism more transparent. This is the basis of performing interpretability analysis based on an ensemble. Therefore, for an ensemble learning process, making the construct structure of the ensemble optimal while improving its accuracy is a desirable outcome.
Currently, the selection ensemble method is used to recruit a subset of classifiers from the base classifier population to improve the generalization ability of the ensemble [7]. However, the selection ensemble can only manage the base classifiers among the existing classifiers and cannot globally optimize the structure of the ensemble. Meanwhile, the selection ensemble conducts the base classifiers generation and ensemble construction as two independent stages. The advantages of interaction and cooperation of different base classifiers are not fully exploited. Thus, the ensemble process remains unidirectional and static, and the ensemble quality depends on the reliability of the artificial setting and usually reaches only a local optimum.
To solve the above problems, we propose an automatic ensemble learning method based on a dual evolutionary search structure. The core is a two-level evolutionary process: the first level is a tree coding-based base classifier population search process, and the second level is a binary coding-based ensemble classifier population optimization process. Mutual promotion and collaborative evolution are achieved through intermediate information interactions and feedback mechanisms. By globally optimizing the ensemble population, this method can perform automated ensemble construction while laying the foundation of interpretability analysis of the ensemble model. In this paper, we focus on the following: (1) A multiobjective search mechanism for the base classifier population is proposed. The tree coding-based base classifiers are constructed by a greedy search model, where each node will select the local optimal solution. However, local "good" decisions may lead to quality degradation of the whole tree since the structure cannot be updated or modified. We attempt to dynamically adjust the decision tree population from a global perspective through the multiobjective optimization process, thus maintaining the balance between the diversity and accuracy of the tree population. (2) A dual evolutionary synergy mechanism is proposed.
The base classifier search and ensemble search are two independent but interactive evolutionary processes. Information sharing is used to form a dynamic complementary and progressive relationship between two search populations. This relationship results in decision tree population with better accuracy and diversity, while an ensemble has lower redundancy and higher accuracy. In addition, this construction mechanism does not require the ensemble structure to be set in advance, and the construction process is more automated, thus reducing the influence of human decision-making.
The remainder of this paper is as follows: "Related works" section analyzes the related research work. "Dual evolutionary ensemble learning" section introduces the specific methods of the dual evolutionary ensemble learning (DEEL) mechanism. "Experiment and discussion" section analyzes and discusses the experimental results of the algorithm. "Conclusion" section summarizes the work of this paper.

Precision-oriented approaches
In traditional studies, enhancing the calculation precision is the major objective in the application field of Ensemble learning (EL) [8][9][10]. For the bagging method, the base classifier learns on a dataset randomly sampled from the training set. By randomly disturbing the training set, the base classifier obtains high accuracy and diversity. Typically, ensemble learning, such as random forest ensemble learning [11], uses a self-sampling method to disturb samples and features, thus constructing diverse decision trees. Using bagging, Zhang et al. [12] clustered the training samples, constructed the submodule structure of modular neural networks by clustering, and then ensembled the submodules according to the distance measure.
Unlike bagging, boosting is a serial process; for example, adaptive boosting (AdaBoost) [3] adjusts the distribution of training sets in each iteration so that subsequent base learners pay more attention to samples that are difficult to classify correctly. In addition, Li et al. [13] used k-nearest neighbors (KNN) as the base learner for unbalanced distribution samples and used the AdaBoost method to ensemble the optimal feature subset after extracting the samples using binary particle swarm optimization (BPSO).
Stacking usually uses the entire training set to generate the base model, and then the metamodel is trained using the output of the base model as input. Liu et al. [14] proposed a spatiotemporal ensemble framework for predicting parking demand. The prediction results of the base learner for time features are considered channels of images and then ensembled using the features of convolutional neural networks (CNNs) that can capture spatial information. Wang et al. [15] proposed an ensemble method for load forecasting based on long short-term memory (LSTM) networks and clustering. A set of LSTM models was generated by data clustering, and then the prediction results of LSTM were fused and improved using fully connected cascade (FCC) networks. DeepForest [16] uses random forests as nondifferentiable modules for serial ensembles, thereby achieving representation learning through multigranularity scanning and enhancement of the input data.
Some researchers classify ensemble construction as an optimization problem and combine evolutionary algorithms to improve the accuracy of ensembles. Wang et al. [17] ran different evolutionary operators on multiple populations. Operators with good adaptability obtain more computing resources, and then the offspring produced by different populations are exchanged; this exchange produces an exchange of information and enhances model accuracy. Sheng et al. [18] proposed an evolutionary ensemble algorithm for niched neural networks with adaptive negative correlation learning. The ensemble is optimized by sharing the fitness value to improve the accuracy. Zhao et al. [19] performed multiobjective sparse ensemble using evolutionary algorithms; the authors used the sparse ratio, false positive rate, and false negative rate as evaluation metrics. Asafuddoula et al. [20] clustered imbalanced data to obtain a diversity of heterogeneous base classifiers and subsequently used a multiobjective optimization algorithm to optimize the ensemble structure. Wang et al. [21] proposed a multiobjective evolutionary ensemble model based on an extreme learning machine (ELM), which used an evolutionary algorithm to achieve feature selection and network structure search. Wen et al. [22] proposed a constraint voting ensemble framework based on variants of the differential evolutionary algorithm. The designed differential evolutionary variants were used as a search engine. Each constraint was voted according to its own rules, and the individual with the most votes entered the next generation. Wang et al. [8] proposed a random forest-based stacking method that combines a multiobjective evolutionary algorithm and local search to obtain an ensemble model with enhanced performance.
In most applications of the EL method, the primary goal is to quickly construct a high-precision ensemble. To achieve this goal, many base classifiers have been integrated or stacked together. However, recently studies have shown that adding too many base classifiers to an ensemble negatively affects its accuracy and increases computational complexity [23]. Because of the law of diminishing returns, continuing to add base classifiers after the optimal ensemble size has been reached will not improve the accuracy. Moreover, the interior of a complex model is almost a black box, which is not conducive to interpretable analyses of its computational logic.

Improving precision by optimizing the structure of the ensemble
To reduce the redundancy and complexity of the ensemble, Zhou et al. [6] proposed a "selective ensemble" method, which finds a better subset of ensembles by model selection instead of assembling all classifiers. In the past few years, such approaches have been widely used. These approaches differ mainly in the strategy of selecting base classifiers and can be divided into sorting-, clustering-, and optimizationbased methods.
For example, Guo et al. [24] proposed margin-based ordered aggregation for ensemble pruning (MOAG), which adjusts the ensemble structure by an unsupervised marginal selection method. Dai et al. [25] proposed a reverse reduction error (RRE) pruning algorithm that incorporates a subtraction operation to subtract the votes of the worse classifiers at the time of voting. Zhou et al. [26] presented the Pareto optimization algorithm for selective ensembles (POSE), which combines multiobjective evolutionary algorithms with local search operators to improve accuracy while optimizing the ensemble structure. Cavalcanti et al. [27] proposed a pruning method that combines diversity measures for ensemble pruning (DivP), which combines different paired diversity matrices with a genetic algorithm. The combined diversity matrices are transformed into a graph, and then the graph coloring method is used to optimize the ensemble structure. Zhu et al. [28] proposed a selective ensemble based on an ELM and the improved discrete artificial fish swarm algorithm (IDAFSEN), which overcomes the drawback that a single ELM is unstable in terms of its classification. Ykhlef et al. [29] proposed a novel ensemble pruning methodology using nonmonotone simple coalitional games (SCG-P), which evaluates the diversity of classifiers based on the Banzhaf power index and selects classifiers with the minimum winning coalition.
Overall, the selection ensemble is still an additive model with only local optimization of the combination of base classifiers. The selection ensemble improves accuracy by reducing the ensemble complexity and removing redundant structures. Moreover, in selection ensembles, the generation of base classifiers and ensemble construction are two independent stages, and if the diversity of the generated base classifiers is low, the generalization ability of the ensemble is affected.

Diversity-oriented approach
In recent years, increasing attention has been given to the optimization of ensemble structures, which is mainly reflected in the diversity-oriented approach. Ensemble diversity is reflected in the similarity between base classifiers [20,27]. For example, Zhou et al. [30] studied ensemble diversity from the view of multi-information. Most ensemble methods generate classifiers implicitly by perturbing the inputs. For example, bagging and boosting promote diversity by perturbing the input data during training. Random forests perturb subsets of the input data and features to produce diversity, and rotating forests apply principal component analysis (PCA) to each subset as an improved method. These methods of increasing diversity aim to reduce the redundant structure in the ensemble. The measurement of ensemble diversity is mainly focused on the output space [31]. Existing diversity measures are usually divided into paired and unpaired diversity measures. Paired diversity measures include the Q-statistic, K-statistic, correlation coefficient, disagreement measure, and double-fault measure. Nonpairwise measures include the Kohavi-Wolpert variance, the entropy measure, the measure of difficulty, and generalized diversity. Jan et al. [32] proposed misclassification diversity (MD) and demonstrated that MD improves accuracy better than other pairwise diversity metrics. Liu et al. [33] proposed an anti-spoofing ensemble model based on a deep neural network (DNN) in which the generalization of the model is improved by the different views provided by the ensemble diversity. These diversity-oriented approaches focus on the behavior of base classifiers, and the structure of base classifiers also has a significant impact on diversity [34,35].
Essentially, the ultimate aim of the diversity-oriented approach, is to improve accuracy. However, we notice that in special applications, the interior structure of the ensemble has become an equally important optimization goal. For example, computing on mobile devices requires models that consume as few computing and storage resources as possible; additionally, medical data analyses require models with interpretable analytical capabilities. Thus, a method that can optimize the interior structure and accuracy of an ensemble offers research value.

Method
The bagging method tends to construct a diverse and accurate population of base classifiers in parallel. In contrast, the boosting method focuses on correcting the current deficiencies of the ensemble through the serial generation and stacking process of the base classifiers, and this mechanism centralizes the main contradictions to rapidly improve the computational effectiveness of the ensemble. However, the selective ensemble is optimized by combining among the existing base classifiers to reduce the redundancy and complexity of the ensemble while improving its classification effect. Each of these three models has its own characteristics, and it would be ideal to combine their advantages. Initialization generates many base classifiers in parallel and then introduces an evolutionary mechanism to adjust the structure of base classifiers and global optimization of combinations, thus combining the advantages of the three ensemble models to improve the overall performance of the ensemble.
By specifying the above research ideas, we propose dual evolutionary ensemble learning (DEEL), which has the following overall process: in the initialization stage, the decision tree population and binary coding ensemble population are initialized, and the offspring population is generated through evolutionary operations. In the fitness evaluation stage, the two evolutionary processes exchange and communicate information for the purpose of coevolution. Then, multiobjective selection is performed to construct the next generation population according to the fitness value, and this process is repeated. When the maximum number of iterations is reached, the loop will be exited and the optimal ensemble will be outputted. The DEEL method proposed in this paper just uses decision trees as the base classifiers. This process is shown in Fig. 1.
The advantage of DEEL is that it can search for highquality base classifiers in parallel while considering the combination effect evaluation and optimization of base classifiers. Moreover, DEEL can break the limitation of base classifier structure solidification, generate new forms of base classifiers, and explore the performance of more different structural ensembles. In this paper, the DEEL method is explained using decision trees.

Tree coding-based base classifier
We designed a classification tree to solve the classification problem. Each node is represented by the quadruplet in Fig. 2a, where each element is represented by a numerical value. The first element in the node is the index of the attribute, the second element is the threshold of the attribute split samples, the third element is the terminal node identifier (1 means the node is a terminal node, and 0 means the node is an intermediate node), and the fourth element is the class label (which is the predicted class of the tree). For example, a binary classification problem with a class label of 0 or 1, sample attribute of C {c 1 , c 2 , c 3 , c 4 } and corresponding index of {1, 2, 3, 4} is encoded as shown in Fig. 2b. Evaluation metrics for trees a) Diversity: Tree diversity is an assessment of the differences between trees. The double-fault measure (DF) correlates strongly with the accuracy of the majority voting ensemble, so DF is used as the diversity metric of the tree. The formula is as follows: where f DF (t i ) represents the similarity of tree t i in the population and D F t i , t j is the number of samples for which both t i and t j are predicted incorrectly. The Spl is the total of samples. The lowest diversity is indicated when f DF (t i ) 1. The f DF (t i ) purpose is to reduce tree repetition of common errors and make the output different for different trees. b) Complexity: Because the base classifier is a binary tree whose lateral growth is limited, limiting the tree depth can reduce the tree complexity. The tree complexity can be defined as follows: where max_depth(t i ) is the maximum depth of tree t i . c) Accuracy: The accuracy is defined as the percentage of the sample size that the tree predicts correctly. The formula is as follows: where f acc (t i ) is the accuracy of tree t i and Acc(t i ) is the number of correctly classified samples. d) Contribution:The collaboration degree between trees is the degree of contribution to the ensemble. This metric measures the average performance of trees when participating in the ensemble. The formula is as follows: where f coo p (t i ) indicates the degree to which t i collaborates with other trees in the population, num is the number of times that t i participates in the ensemble, and e j is the ensemble containing t i . f acc t i ∈ e j denotes the accuracy when e j contains t i , and f acc t i / ∈ e j denotes the accuracy when e j does not contain t i . The difference between the two accuracies is the contribution of t i to e j . A smaller f coo p (t i ) indicates a better collaboration of t i . If t i is not involved in any ensemble, then f coo p (t i ) defaults to 0.5.

Binary encoding of ensemble
The ensemble population can be expressed as E e 1 ..., e N e . As shown in Fig. 3, an ensemble individual e is expressed by a binary string; if e(i) 1, then tree t i is selected; and if e(i) 0, then tree t i is not selected. e can express a subset of the tree population T t 1 , ..., t N t .

Evaluation metrics for ensemble
a. Diversity: Diversity is one of the key aspects of ensembles. The purpose of any ensemble method is for the base classifier to be as accurate and diverse as possible. Ensembles use a difficulty measure as an evaluation metric. The formula is as follows: where Z z 1 , z 2 , . . . , z Spl denotes the proportion of correctly classified trees, z j is the number of trees in The tree construction process ensemble e i that correctly classify sample j, and the values are 0, 1 N t , 2 N t , . . . , 1 . Var (•) denotes the variance of Z. A larger variance indicates that some of the samples are difficult to classify and the rest are easy to classify when the classification behavior of the tree is more similar and the diversity is lower. Conversely, the smaller the variance is, the higher the diversity. b. Complexity: The complexity of the ensemble is the number of decision trees contained. The formula is as follows: where f num (e i ) indicates the number of decision trees in the ensemble e i . c. Accuracy: The ensemble obtains the prediction result by majority voting and then calculates the accuracy. The formula is as follows:

Algorithm
In this section, we introduce the algorithm in detail. The pseudocode is shown in Algorithm 1. Step 1: First, the tree population T t 1 , t 2 , · · · , t N t is initialized, and N t is the population number of trees. The initial value of the feature selection probability is P f eature 1/C f eature (where C f eature is the sum of feature frequencies), and the node generation probability is P node 1/dp (where dp is the depth of the current node), and P node is an adaptive value. To avoid generating a tree that is too complex, the probability of intermediate node generation decreases as the tree height increases. The tree construction process is as follows:

Algorithm 1 Dual Evolutionary Ensemble Learning
Step 1.1: Generate the root node, select a feature randomly according to P f eature , and then generate the split threshold randomly, as shown in Fig. 4a; Step 1.2: The tree structure is recursively generated. When the random number Rand is less than P node , the intermediate node is generated, and the feature and split threshold are randomly selected according to P f eature ; if Rand is greater than P node , the terminal node is generated, and the classification label is randomly selected as the attribute, as shown in Fig. 4.
Step 2: Initialize the ensemble population, denoted as E e 1 ..., e N e , where N e is the ensemble population size; each e consists of 0, 1 randomly; and the length is the decision tree population size N t .
Step 3: Ensemble sharing and flip operation. This procedure is based on binary encoding to achieve combinatorial optimization of trees, as shown in Fig. 5. This operation allows the offspring to better learn and inherit the strengths of all parents and reduces the sensitivity to the number of parents.
Step 3.1: Shared operation. The element at each position in the shared structure is the element with the most occurrences in the parents. The blue part in Fig. 5 shows the shared structure.
Step 3.2: Binary flip. Each element of the shared structure is randomly flipped with probability 1/N t , as shown in the red Step 4: The search process of decision trees. During the dual evolutionary search, the offspring are generated by split and fusion operations based on the decision tree in the ensemble shared structure.
Step 4.1: Split operation. It has been shown that reducing the search space boundary for continuous numerical variables can reduce the search time and improve the decision tree accuracy [36]37. We prune the decision tree search space with an ensemble shared structure to increase the search speed. First, all nodes in the decision tree except the root node are traversed, and the feature index of each node is used as the key. Then the subtree of the nodes, the feature frequency, and the split threshold limit are incorporated into the split set as values, as shown in Fig. 6.
Step 4.2: Fusion operation. The P f eature is updated according to the feature frequency. The decision tree t is randomly selected from the decision tree population T and fused with the split set to optimize the structure. All nodes of the tree t are traversed, and a random value Rand is generated at each node. If Rand is less than P f usion , then this node is used for fusion, and Rand greater than P f usion are not used for fusion.
The fusion node randomly selects one of the fusion operations to be changed, and the fusion operations include: (1) feature index and split threshold changes for fusion nodes. New feature indexes are randomly selected based on P f eature , and new thresholds are randomly generated based on the range of feature thresholds in the split set, as shown in Fig. 7a. (2) Fusion node replacement. Node replacement is randomly selected from the branch of the split set based on the feature index of fused nodes, as shown in Fig. 7b. (3) Change the split threshold of the fusion nodes. Based on the feature index of the fusion nodes, new threshold values are randomly generated in the range of thresholds in the split set, as shown in Fig. 7c. The probability of these three operations being selected is 1/3.
Step 5.1: Tree fitness assessment. Tree fitness should not only evaluate its own complexity but also ensure the diversity of the tree population. Additionally, the tree accuracy cannot indicate the collaboration degree in the ensemble, so the contribution in the ensemble must be evaluated as well. Therefore, the tree fitness is evaluated using formulas (1)- (4) in "Evaluation metrics for trees" section: Step 5.2: Ensemble fitness assessment. The optimization objective of the ensemble is to find the best combination Tree split process (the white nodes in the figure are the root nodes without splitting, the blue nodes are the nodes split by t 1 , the yellow nodes are the nodes split by t 2 , threshold is the range of feature splitting, and frequency is the number of times the feature appears in the node) of trees. The ensemble fitness is evaluated using formulas (5)-(7) in "Evaluation metrics for ensemble" section: Step 6: Multiobjective selection for the tree population. The parent and offspring individuals will form a population of size 2N t , and the next generation population of size N t will be generated by nondominance sorting. The specific steps are as follows: Step 6.1: First, the dominance rank (i.e., dominance relationship) of all individuals in the population is calculated according to the four objective functions, and then stratification is performed according to the dominance rank. The dominance rank is the number of this individual's domina-tors. When the dominance rank of the tree is 0, that indicates no individual in the population can dominate this tree.
Step 6.2: Other multiobjective algorithms sort individuals at the same rank according to density with the aim of maintaining diversity. The tree diversity in DEEL is calculated according to Eq. (1). The individuals in the same rank are sorted according to f DF in ascending order.
Step 6.3: Starting from the dominance rank of 0, the first N t trees are selected as the next generation population. For trees with the same rank, select them from front to back according to the ordering in step 6.2. To maintain the correspondence between the set and the tree, the index of trees from the parent remains unchanged and the trees from the offspring replace the eliminated trees in parent. Because the joined offspring are superior to the eliminated parents, the replacement does not reduce ensemble performance. Step 7: Ensemble population multiobjective selection. Individuals of parents and offspring are merged, and then individuals on the Pareto front surface are retained as the next generation ensemble population based on fitness. The operation is as follows: The merged populations of parents and offspring are entered into the set P and take the ensemble e from P in turn. If there is no individual dominating e in P, e is retained for the next generation ensemble population E and the individuals dominated by e in P are deleted. If there is an individual in P that dominates e, then e is dropped. This operation is repeated until the set P is empty. Finding a common structure from the individuals on the Pareto front surface reduces the search space and speeds up the search efficiency.
Step 8: If the current iteration number is equal to the maximum iteration number, output the current tree population and ensemble; otherwise, go to Step 3.

Experiment and discussion
In this section, we selected 22 classification tasks from 4 data repositories, UCI, OpenML, Kaggle and KEEL, to test DEEL, and details are presented in Table 1. These test datasets were selected as randomly as possible from the 4 data repositories to test the proposed method. Meanwhile, the selected test datasets were required to include both binaryclassification and multi-classification problems and cover a large range of sample size and feature dimension to make the composition of test datasets more diversified. The experiments were implemented in Python 3.6 using NumPy 1.19 and scikit-learn 0.21. The algorithm is tested on a computer running Windows 10 with a 2.5 GHz processor and 8 GB RAM.
Each test dataset is subjected to fivefold nested crossvalidation [38]39 and repeated 20 independent times. In the outer loop of nested cross-validation, 20% of the dataset is used to test the ensemble, and the remaining 20% and 60% of the dataset are used to train the ensemble and tree in the inner loop. The proposed algorithm and the comparison algorithm both use the grid search to optimize the hyperparameters. The parameters of DEEL as follows: the population size of the decision tree is N t 100, the population size of the ensemble is N e 100, the fusion node probability P f usion 0.5, and the number of dual evolution iterations is G 200. Tables 2 and 3 show DEEL's classification accuracy, time cost and structure of the final ensemble based on 22 test datasets, which include the best, worst, average results, and their standard deviations based on 20 independent operations. Generally, the average accuracy of DEEL is over 70% on all test datasets except the CMC, Abalone, Cirrhosis, and Segmentation datasets. Notably, on BigMartSales and Sentiment, two multiclass classification tasks, the average accuracies of DEEL are close to 100%. Moreover, the ensembles generated by DEEL show a high quality structure, whose average size of the involved decision trees is below 12, and for most test datasets, the diversity between them is below 0.5.

Experimental statistical results
The diversity of the final ensemble is calculated by Eq. (5), which indicates higher diversity by smaller values. As shown in Table 1, the number of samples, classes, and features of 22 test datasets have enormous variation. To observe how the increasing number of features, samples and classes affect the performance of proposed method, we divided the test datasets into multiple groups from the above three angles, respectively. The basic grouping rules are as following: from the angle of number of features, the boundary values used for partitioning were set to 10, 20, 50, 100 and 150; from the angle of the number of samples, the boundary values were set to 500, 1000, 2000, 10,000 and 20,000; from the angle of the number of classes, the boundary values were set to 2, 3, 4, 5 and 10. From the statistical results, we can summarize some general rules of DEEL's performance characteristics. Figure 8 shows the influence of changing the number of samples, classes, and features on the classification accuracy and running time cost. First, we find all the number of samples, classes, and features affect the achieved accuracy. Overall, in the single dataset, the increasing number of samples is beneficial to improve the accuracy. However, each group in Fig. 8 contains multiple datasets. The comprehensive outcomes of DEEL show a certain fluctuation. Especially in the group with more than 10,000 samples, the accuracy of DEEL shows distinct fluctuation. The big sample size produces a huge searching space that is likely to increase the complexity of global search and cause a certain degree of fluctuation of results. Besides, when the number of features increases, each sample will provide more information, which enables the model to discover more association rules. However, it also enlarges the searching space, which is likely to reduce the prediction accuracy of model. The increasing number of classes is likely to raise the difficulty of classification tasks, which degrades the prediction performance of DEEL in general. And, the number of samples dominates the time cost, while the increasing number of classes and features exert smaller impacts on the time cost.
As shown in Fig. 9, the ensembles generated by DEEL include approximately 4 to 12 decision trees for 22 test datasets. Meanwhile, for the test datasets, the diversity of the involved decision trees does not show obvious regular changes. We believe the changing number of samples, classes and features does not have a great impact on the structure of the final ensemble. Even for some test datasets with a large number of samples, DEEL can still produce a small size and highly diverse ensemble. However, its performance may fluctuate somewhat. Therefore, from the view of optimizing the interior structure of the ensemble, the DEEL shows excellent ability.

Comparison of the experimental results
We compared DEEL against 9 competitive methods on the 22 test datasets, which include 3 precision-oriented approaches, DeepForest [16], RF [11] and Bagging [2], 4 select ensemble methods, POSE [26], IDAFSEN [28], GASEN [6], and RRE [25], and 2 diversity-oriented algorithms, ECL [20] and DivP [27]. IDAFSEN uses an improved discrete artificial fish swarm algorithm to optimize ensembles according to diversity, runs on MATLAB 2012a and uses classification accuracy for evaluation. GASEN uses a genetic algorithm to optimize the weights of base classifiers and is implemented in MATLAB, which selects base classifiers to constitute an ensemble according to the generalization error. ECL uses the clustering process to obtain a group of various base classifiers and then crops them using a multiobjective optimization algorithm. The multiobjective optimization algorithm considers the accuracy of each class and the diversity of classifiers, which is implemented in MATLAB 2015b.  Tables 2 and 3. The initial population size of the above methods is 100, except for DeepForest since it does not require the initial population size. In addition, the other hyperparameters of the above method are optimized with the grid search. As shown in Fig. 10a, the DEEL obtained the best average accuracy for 17 of the 22 test problems, which increased the average accuracy by 0.97%-7.65% over the second place. In particular, for Abalone, the DEEL improved the average accuracy over the second-ranked IDAFSEN by 7.65%. For Spambase, the average accuracy of DEEL was lower than that of the firstranked POSE by approximately 3.16%. For BigMartSales, Sentiment, Movement and Optdigits, the average accuracy of DEEL was lower than that of the first-ranked DeepForest by approximately 0.36%, 0.88%, 7.56% and 7.49%, respectively. In addition, RRE, DeepForest, RF, and Bagging are 4 noniterative methods that have significantly shorter runtimes than iterative methods (e.g., DEEL, IDAFSEN, GASEN, DivP, and POSE), which is shown in Fig. 10b. Meanwhile, it should be noted that the time cost of DEEL is significantly less than IDAFSEN, GASEN, DivP, and POSE time costs. We believe that the main reason for this is that the shared structure reduces the search operations, and the split-fusion operation reduces the traversal times of trees. Figure 11a shows that the average number of involved decision trees in the ensemble is approximately 4 to 12, which is much smaller than other competitors except DivP. Although DivP has the smallest size of base classifiers, its average accuracies are lower than those of the proposed method by approximately 1.84% to 11%. Furthermore, Fig. 11b reveals that DEEL has the best average diversity for all 22 test datasets. Its diversity curve is clearly below that of all competitors. Figure 11 indicates that relative to 9 algorithms, the ensemble generated by DEEL involves fewer base classifiers with higher diversity. Therefore, the proposed method can effectively reduce the redundancy inside the ensemble and globally optimize the structure of the involved decision trees. In summary, for various types of classification problems with high-dimensional and small sample, the DEEL shows stable performance. Meanwhile, it can be driven by data to automatically produce a high-quality ensemble with excellent internal structure.

Discussion
In this section, we investigated the effect of several key parameters and the dual evolutionary search mechanism.

Effect of G
In DEEL, G is used to limit the iteration of the entire search process. We set 3 different G values (100, 200 and 400), to compare the difference of accuracy among the final ensembles. As shown in Fig. 12, DEEL obviously does not complete convergence when G 100. When G 200, the system obtains the best accuracies in 6 of 10 binary-class problems and 8 of 12 multi-class problems. Meanwhile, the accuracies of G 400 overpass G 200 only in 8 of all 22 tasks, and the lifting accuracies are about 0.17% to 3.12%.
Overall, bigger G produces more cost. Furthermore, excessive search operations are likely to cause overlearning. Such as, when G 400, the accuracies of DEEL for the test data show no obvious change comparing with the accuracies of G 200 for 14 of 22 test datasets. So, a dynamically stopping mechanism of DEEL is desired. Considering the stopping method used in this work is still a static parameter, we believe G 200 is a relatively appropriate setting.   Figure 13 shows the effect of N t and N e on the results when we set N t and N e to 50, 100 and 200. In general, the function of N t and N e is to limit the size of the two population. When N t and N e are 50, the average accuracy of results is approximately 4.88% lower than other two groups of parameter settings, but the gap of average accuracy is very close (approximately 0.48%), when N t and N e are set to 100 and 200. If we set N t and N e too small, that will affect the richness of the base classifiers, which impacts the quality of ensembles. In contrast, a set of N t and N e with large numbers will (a) Comparison of the involved base classifiers number with 9 algorithms (b) Comparison of the involved base classifiers diversity with 9 algorithms Fig. 11 Comparison of the size and diversity of ensembles between DEEL and 9 competitors for 22 datasets Table 3 The statistical results of the DEEL in terms of size and diversity for 22 test datasets

Effect of P fusion
In the search process of DEEL, the fusion probability P f usion is a key parameter to affect the final structure of decision trees. Since the fusion operation is a top-down operation, the nodes closer to the root have a greater impact on the tree structure. Hence, a large P f usion is likely to damage the current structure more. On the contrary, a small P f usion will delay the convergence process of optimization. As shown in Fig. 14

Effect of the dual evolutionary search mechanism
The impact of the dual evolutionary mechanism on the ensemble is mainly reflected in three aspects: precision, complexity, and diversity. Figures 15 and 16 compare the accuracy and structure of the involved base classifiers between DEEL and POSE on Cirrhosis and KCHouse. POSE initializes the decision tree by greedy algorithm, while the decision tree of DEEL is randomly initialized, so the initial ensemble accuracy of DEEL is lower than POSE. However, the accuracy of DEEL increases significantly faster than POSE during the iteration. After 200 iterations, the accuracy of DEEL is higher than POSE by 7.8% and 7.4% on Cirrhosis and KCHouse. Meanwhile, the two curves of decision tree number show distinct difference. Firstly, by and large, the curve of DEEL is less than that of POSE by 5 and 8. Second, the curve of POSE always decreases, whereas the curve of DEEL shows unusual fluctuations. This difference indicates that the dual evolutionary mechanism is constantly adjusting the internal structure of ensembles. This global search capability enables DEEL to synchronously optimize the accuracy and interior structure synchronously of the ensembles. Figures 17 and 18 show the diversity comparison of involved base classifiers between DEEL and POSE on Cirrhosis and KCHouse. First, the populations of decision tree are clustered. The clustering results were calculated by the affinity propagation clustering algorithm. Then, each tree uses its own features to make a prediction. In these figures, the circles represent trees, the clusters represent the classes of trees, the largest circle in each cluster represents the cluster's center, and the black circle represents the tree has been selected by an ensemble.
As a whole, the POSE improves the diversity of the trees by self-sampling algorithm, so it has more clusters than DEEL. However, the POSE does not optimize the base classifier structure, therefore, during the iterations, the cluster structure of trees inside POSE exhibits no change. In contrast, from the comparisons between gen 50 and 200, we notice a marked difference in the cluster structure of trees inside DEEL since DEEL constantly updates its populations of decision trees and ensembles. The diversity value of ensemble is an evaluation index for the interior redundant situation. Apparently, DEEL shows showed better ability in this regard.

Conclusion
In this paper, we proposed an ensemble construction method based on a dual evolutionary search mechanism. DEEL uses evolutionary pressure to find highly collaborative combinations, thereby achieving coevolution through the information interaction between the base classifier population and the ensemble population. The dual evolution mechanism reduces the dependency on human-designed decisions by data-driven mining of feature association rules and increases the reliability and generality of the model. The experimental results for 22 test datasets demonstrate that: (1) DEEL produced high quality ensembles for most of the test datasets, which have excellent interior structure and competitive computation accuracy. Overall, DEEL can handle various binary-class and multi-class classification problems. For the high dimensional small sample problem, it displays outstanding performance. (2) The DEEL uses only a few parameters to adjust the generation process, and other than gen, they have no significant effect on the model output. Too many iterations will likely to produce overfitting. In the discussion, we analyzed the influence of changes in parameter values on DEEL performance and provided a set of parame- In this work, the base classifier used in DEEL is only the decision tree due to the following considerations. First, it is easy to implement coding design. In addition, the tree coding-based optimization mechanism can make the ensemble more internally transparent inside, which aids in analyses of the internal decision logic of ensemble.
In the future, we would like to pursue two topics. We note that DEEL provides a method to implement the interpretability analyses for the ensemble model. We will explore the method of knowledge extraction from the output of DEEL. In other side, we will improve the encoding mechanism to enable the proposed method to contain more types of base classifier, and seek out ways to enhance the efficiency of optimizing the base classifier population .