A dynamic ensemble learning algorithm for neural networks

This paper presents a novel dynamic ensemble learning (DEL) algorithm for designing ensemble of neural networks (NNs). DEL algorithm determines the size of ensemble, the number of individual NNs employing a constructive strategy, the number of hidden nodes of individual NNs employing a constructive–pruning strategy, and different training samples for individual NN’s learning. For diversity, negative correlation learning has been introduced and also variation of training samples has been made for individual NNs that provide better learning from the whole training samples. The major benefits of the proposed DEL compared to existing ensemble algorithms are (1) automatic design of ensemble; (2) maintaining accuracy and diversity of NNs at the same time; and (3) minimum number of parameters to be defined by user. DEL algorithm is applied to a set of real-world classification problems such as the cancer, diabetes, heart disease, thyroid, credit card, glass, gene, horse, letter recognition, mushroom, and soybean datasets. It has been confirmed by experimental results that DEL produces dynamic NN ensembles of appropriate architecture and diversity that demonstrate good generalization ability.


Introduction
Neural network (NN) structures have been used for knowledge representation [1], modelling [2][3][4], prediction [5,6], design automation [7], classification [8,9], identification [10], and nonlinear control [11] applications in many domains. All these applications mainly used the monolithic structure for NN. In a monolithic structure, the NN is represented by a single NN architecture for the whole task to be performed [12][13][14]. Scalability is a major impairment for monolithic NN for a wide range of applications. Incremental learning is also not possible as the addition of new elements to NN requires retraining of the NN with old and new data [15,16]. An inevitable phenomenon in the retraining of NN is the catastrophic forgetting (also known as crosstalk), which was first reported by McCloskey and Cohen [17]. Two types of crosstalk phenomena can get exposed during retraining: temporal crosstalk and spatial crosstalk. In temporal crosstalk, learned knowledge is lost during retraining of a new task. In spatial crosstalk, NN cannot learn two or more tasks simultaneously [18]. Kemker et al. [19] demonstrated that catastrophic forgetting problem in incremental learning paradigm has not been resolved despite many claims and showed methods of measuring such catastrophic forgetting can be measured. A number of attempts have been made to mitigate the phenomenon such as regularization, rehearsal and pseudorehearsal, life-long learning-based dynamic combination, dual-memory models and ensemble methods [16,[20][21][22][23]. A collection or committee of individual NNs can also be advantageous for addition of a new NN to store new knowledge mitigating forgetting phenomena where tasks can be subdivided [24]. Instead of employing a large NN for a complex problem, the researchers are impressed by the idea of decomposing the problem into smaller subtasks leading to smaller architecture, shorter training time and increased performance [24,25]. NN ensemblebased classifier can also improve generalization ability [25,26]. The structure of an NN ensemble is illustrated in Fig. 1. Each NN in the ensemble (1 through n) is first trained on the training instances. The output of the ensembles is calculated from the predicted outputs, O i ; i ¼ 1; 2; . . .; n, of the individual NNs [26]. The challenge here is to design a learning algorithm for ensemble NN. The initial weights, topology of NNs, training datasets, and training algorithms also play decisive roles in the design of ensembles [23,25]. In general, NN ensembles are designed by mostly varying these parameters.
Many algorithms similar to NN ensembles [25] have been reported in the literature such as mixer of experts [27], boosting [28] and bagging [29]. The main drawbacks of these algorithms are manual design and predefined number of neurons in the hidden layer and the number of NNs in an ensemble.
In general, ensemble and modular approaches are employed for combining NNs. The ensemble approach attempts to generate a reliable and accurate output by combining the outputs of a set of trained NNs rather than selecting the best NN, whereas the modular approach strives to have each NN as self-contained or autonomous [14,24]. In modular approach, the problem is divided into a number of tasks. Each task is assigned to an individual NN to be accomplished. It is not possible to know the best size of NN a priori. The size of NN is defined by the number of layers and the number of neurons in each layer. Moreover, the backpropagation (BP) [30,31] algorithm is not useful for training NN unless the topology is known. Therefore, finding the correct topology is the foremost design issue. In order to define the topology of an NN, a number of parameters such as the number of layers, the number of hidden neurons, activation functions, and degree of connectivity have to be determined. A second issue is to determine the training parameters that include the initial weights of the NN, the learning rate, acceleration term, momentum term and weight update rule. The choice of the topological and training parameters has significant impact on the training time and the performance of the NN. Unfortunately, there is no straightforward method of selecting the parameters; rather the designer has to depend on the expert knowledge or employ empirical method.
The performance of NNs in an ensemble is dependent on a number of factors such as (1) the topology of the NNs and the initial structure; (2) the training method; (3) the learning rate; (4) the input and output representations; and (5) the content of the training sample [32]. Eventually, the numbers of NNs and the number of neurons in the hidden layers in NNs determine the performance of an ensemble. In most of the cases, these are predefined by human experts based on available a priori information. Formal learning theory is used to estimate the size of the ensemble system based on the complexity, and the examples required learning the particular function. In such cases, the generalization error becomes high if the number of examples is small. Consequently, choosing appropriate NN topology is still something of an art. The data examples play a crucial role in learning where learning is sensitive to initial weights and learning parameters [33][34][35].
The purpose of this research is to design an NN ensemble that addresses the following issues: (1) automatic determination of NN ensemble architecture (i.e. the number of NNs in the ensemble), (2) automatic determination of the size of individual NNs (i.e. the number of hidden neurons in individual NNs), and (3) variation of training examples for each individual NN's better learning. Realworld classification problems are used to verify the effectiveness and the generalization ability of the ensemble.
The paper is organized as follows: Sect. 2 presents the related works. Section 3 contains the description of DEL algorithm. Section 4 presents the datasets description, experimental results and comparison. Section 5 presents a discussion. Some conclusions are made in Sect. 6.

Related works
In ensemble learning, the individual NNs are called base learners. They are single classifiers, which are trained and combined together to ease individual errors and crop generalization independently. Hitherto, efforts have been made to design ensemble by combining NNs based on either the accuracy or the diversity [25,36,37]. There are evidences that accurate and diverse NNs can produce a good ensemble that distribute errors over different regions of the input space [38,39]. Rosen [40] proposed an ensemble learning algorithm that also trains individual NNs sequentially where the individual NNs minimize training errors as well as de-correlate previous training errors. Sequential training of an NN does not affect the NNs that were previously trained, which is a major disadvantage in ensemble learning. Consequently, there is no correlation between the errors of the individual NNs [41]. The topology of the mixtures-of-experts (ME) [27] can produce biased individual NNs which may be negatively correlated [32]. The disadvantage of ME is that it needs a separate gating NN and also can not provide a balance control over the bias-variance-covariance tradeoff [34]. A two-stage design approach is employed in most of the architectures mentioned above where individual NNs are generated first followed by combining them. As the combination stage does not provide any feedback to design stage, some individual NNs designed independently may not contribute significantly to the ensemble [34]. Therefore, some researchers proposed a one-stage design process and used a penalty term into the error function of each NN. The researchers also proposed the simultaneous and interactive training for all NNs in the ensemble instead of the independent and sequential training [41]. NNs with negative correlation can be created by reassuring specialization and cooperation among the NNs in an ensemble. This will enable NNs to learn the different regions of training data space and ensure the ensemble learns the whole data space.
To ensure interaction between NNs and simultaneous learning in an ensemble, some researchers employed evolutionary computing [32]. Liu et al. [32] applied evolutionary algorithm for ensemble learning of NNs with negative correlation. This approach can determine the optimal number of NNs and the combinations of NNs in an ensemble using fitness sharing mechanism.
Chen and Yao [33] employed multi-objective genetic algorithm [42] for regularized negative correlation learning (NCL) optimizing errors of the base NNs and their diversity in ensemble. Mousavi and Eftekhari [43] proposed static ensemble selection and deployed the popular multiobjective genetic algorithm NSGA-II [42]. This combination of static ensemble selection and NSGA-II ensures selecting the best classifiers and their optimal combination.
There are two other widely popular approaches to ensemble learning, namely constructive NN ensemble (CNNE) [44] and pruning NN ensemble (PNNE) [45]. CNNE determines the number of NNs in the ensemble and the hidden neurons of the individual NNs by employing NCL [34,41] in an incremental fashion. On the other hand, PNNE employs a competitive decay approach. PNNE uses a neuron cooperation function in each NN for the hidden neurons and a selective deletion of NNs in the ensemble based on the criterion of over-fitting. PNNE employs NCL to ensure diversity of the NNs in the ensemble.
Islam et al. [29] proposed two incremental learning algorithms for NNs in ensemble using NCL: NegBagg and NegBoost. NegBagg fixes the number of hidden neurons of NNs in ensemble by constructive method. NegBoost also uses constructive method to fix the number of hidden neurons of NNs as well as the number of NNs in the ensemble.
Yin et al. [46] proposed a two-stage hierarchical approach to ensemble learning called dynamic ensemble of ensembles (DE 2 ). DE 2 comprises component classifiers and interim ensembles. The final DE 2 is obtained by weighted averaging. Cruz et al. [47] used a two-phase dynamic ensemble selection (DES) framework. In the first phase, DES extracts meta-features from training data. In the second phase, DES uses a meta-classifier to estimate the competence of the base classifier to be added to the ensemble.
Chen and Yao [48] show that NCL considers the entire ensemble as a single machine with the objective of minimizing the mean square error (MSE) and NCL does not employ regularization while training. They proposed a regularized NCL (RNCL) incorporating a regularization term for the ensemble which enables the RNCL decomposing the training objectives into sub-objectives each of which is implemented by an individual NN. RNCL shows improved performance over the NCL even when noise level is higher in datasets.
Semi-supervised learning is the mechanism of learning using a large amount of unlabelled data and a small amount of labelled data. Chen and Wang [49] proposed a semisupervised boosting framework taking three assumptions such as smoothness, cluster and manifold into consideration where they used a cost function comprising the margin cost on labelled data and the regularization penalty on unlabelled data. Experiments on benchmarks and realworld classification reveal constant improvement by the algorithm. Semi-supervised learning is a widely popular method due to its higher accuracy at a lower effort.
The generalization of an ensemble is related to the accuracy of the base NNs and the diversity among NNs [37,38]. Higher accuracy for the base NNs leads to the lower diversity among them. To strike a balance of the dilemma between accuracy and diversity in an ensemble, Chen et al. [50] proposed a semi-supervised NCL (SemiNCL) where a correlation penalty term on labelled and unlabeled data is incorporated into the cost function of each individual NNs in the ensemble.
Though the semi-supervised learning has been very successful for labelled and unlabelled data, its generalization ability is sensitive to incorrect labelled data. To mitigate this limitation, Soares et al. [51] proposed a clusterbased boosting (CBoost) with cluster regularization. In CBoost, the base NNs in the ensemble jointly perform a cluster-based semi-supervised optimization. Extensive experimentation shows that the CBoost has significant generalization ability over the other ensembles.
Recently, Rafiei and Adeli [52] reported a new neural dynamic classification algorithm. A comprehensive review of multiple classifier systems based on the dynamic selection of classifiers was reported by Britto et al. [53]. Recent developments in ensemble methods are analysed by Ren et al. [54]. Cruz et al. [55] reported a review on the recent advances on dynamic classifier selection techniques. Dynamic mechanism is used in the generalization phase in those studies, while the dynamic mechanism is employed in the training phase in DEL.

Main steps of the algorithm
Unlike fixed ensemble architecture, DEL automatically determines the number of base learner NNs and their architectures in an ensemble during the training phase. The DEL algorithm is presented in 8 steps in the sequel. The flow diagram of the DEL algorithm is shown in Fig. 2.
Step 1 Create an ensemble with minimum architecture comprising two NNs. Each NN consists of an input layer, two hidden layers, and an output layer. The number of neurons in the input and output layers is determined by the system. Next, apply a constructive algorithm [56] based on Ash's [57] dynamic node creation method for the first (later on the odd number of NNs in sequence in the ensemble) NN training. Initially, this NN starts with a small architecture containing one node in each hidden layer. For the second (later on even number of NNs in sequence in the ensemble) NN training, apply Reed's pruning algorithm [58]. In the pruning phase of NN training, the number of neurons in the hidden layer is larger than necessary (i.e. it starts with a bulky architecture). Initialize the connection weights for each NN randomly within a small interval.
Step Step 3 Train the NNs in the ensemble partially on the examples for a fixed number of epochs specified by the user using NCL [34,41] regardless of whether the NNs converge or not [59].
Step 4 Compute the training error E i for the ith NN in the ensemble according to the following rule: where O max is the maximum value and O min is the minimum value of the target outputs, respectively, N is the total number of examples, S is the number of output neurons, d(n, s) is the desired output, and F i (n, s) is the actual output of the neuron s in the nth training data. The rule in Eq. (1) is a combination of the rule proposed by Reed [58] and NCL for an NN error. The error E i is independent of the size of the training examples and the number of output neurons.
Step acceptable, then either the ensemble architecture or the individual base learner NNs undergo change.
Step 6 Check the neuron addition and/or deletion criterion of individual NNs. In this criterion, hidden neurons are added or deleted if the error of individual NNs does not change after a specified number of epochs chosen by the user (see Sect. 3.2). If the criterion is not met, then the individual NNs are not good enough and the ensemble undergoes addition of new learner NN.
Step 7 Add and/or delete hidden neurons to/from the NNs to meet the addition and/or deletion criterion (see Sect. 3.2) and continue training using NCL. Step

Nodes addition/deletion to/from individual NNs
Both constructive and pruning algorithms provide some benefits as well as some drawbacks. At the training period of individual NNs, there may be some portions which may be critical or stable either for constructive or pruning algorithms. If all the NNs in the ensemble learn either only by constructive or only by pruning algorithm, then their learning will be very similar. Even though NCL forces the NNs to learn from different regions of the data space, the learning will not be perfect if the NNs in the ensemble have the same architecture. Different architectures of the NNs in the ensemble will provide a different weight on the accuracy and diversity, which justifies the deployment of the hybrid 'constructivepruning strategy' in DEL.

NN addition to the ensemble
In DEL, constructive algorithm is used to add NNs in the ensemble. New NNs are added to the ensemble if the previous addition improves the performance of the ensemble. This addition process continues until the minimum ensemble error criterion has been met.  Output final ensemble Endif

Different training sets for individual NNs
Step 6: Check node addition/deletion criterion 1. If (addition/deletion criterion is not met) 2.
Add NN to ensemble 3.
Add/delete hidden nodes to NN 6.
Go to Step 3 7. Endif

Experimental analysis
The effectiveness and performance of DEL are verified on real-world benchmark problems. The datasets of the selected benchmark problems are taken from the UCI machine learning repository [60].
Different tests were carried out on DEL algorithm with varying parameter settings. For setting the correlation strength parameter k to nonzero, the DEL performs as described in Sect. 3. For the correlation strength parameter k equal to zero, it is the individual NN's independent training. The independent training is performed using standard backpropagation algorithm [30].
The  Table 1 shows the summary of benchmark datasets.

Medical datasets
The medical datasets comprise four datasets from medical domain: the cancer, the diabetes, the heart disease, and the thyroid dataset. These datasets have some characteristics in common: • DEL uses the similar input attributes that an expert uses for diagnosis. • The datasets pose a classification problem, which the DEL has to classify to a number of classes or predict a set of quantities. • Acquisition of examples from human subjects is expensive, which results in small datasets for training.
• Very often the datasets have missing values of attributes and contain a small sample of noisy data [59], which make the classification or prediction challenging.

The breast cancer dataset
The  patients are normal, which insists that the classifier accuracy must be significantly higher than 92%.

Non-medical datasets
The non-medical datasets comprise seven datasets from different other domains: the credit card, glass, gene, horse, letter, mushroom, and soybean datasets.

Experimental setup
Datasets are divided into training and testing sets, and no validation set is used in the experimentation. The classification error rate is calculated according to:  Table 1.

Experimental results
A summary of the experimental results of the DEL algorithm carried on 11 datasets described in Sects. 4.1 and 4.2 is presented in Table 2. The classification error is defined as the percentage of wrong classifications in the test set defined by Eq. 2. Table 3 shows the comparison of DEL with its component individual networks in terms of classification error rates for glass dataset. It shows the error rates for glass datasets are relatively higher than the other datasets. This is due to the error rates of the individual NNs that led to the higher error rate of the ensemble. Table 4a shows the accuracy of NNs and the common intersection and the diversity of the NNs of ensemble for the glass dataset is shown in Table 4b. The accuracy X means the correct response sets of the individual NNs, whereas the diversity 1 means the number of different examples correctly classified by individual NNs. If S i is the correct response set of the ith NN in the testing set, X i is the size of S i , and X i1;i2;...;ik is the size of the set S i1 \ S i2 ; . . .; \S ik , then the diversity 1 of the ensemble is X i ¼ X i1;i2;...;ik . For the glass dataset, DEL produced an ensemble of four NNs   Ensemble (9-10-7-6), (9-9-7-6), (9-11-6), (9-9-6) 26.415 NN1 9-10-7-6 30.189 NN2 9-9-7-6 37.736 NN3 9-11-6 32.075 NN4 9-9-6 32.075  It is demonstrated here that the DEL uses a smaller number of training cycles to find the dynamic ensemble architecture with a small classification error. For example, for the glass dataset DEL with dynamic architecture produces a final ensemble with only four individual networks. Only five hidden nodes were added to individual networks training with constructive algorithm, and two hidden nodes were deleted from individual networks' while training with a pruning algorithm. DEL achieved a classification error of 26.415% for this dataset. According to the comparison with other algorithms shown in Table 5, DEL achieves the lowest percentage of classification error.
To demonstrate how a hidden neuron's output changes during the entire training period, the hidden neurons' output for the cancer dataset is shown in Fig. 3. Constructive algorithm was used for training one network. The individual network started the training with one node in its first hidden layer and two nodes in its second hidden layer. During the training period, four nodes were added to the first hidden layer of the network and nodes in second hidden layer were kept fixed at two nodes. The outputs stabilize and the convergence curve becomes smooth after about 100 iterations, indicating that the learning may not require a very large number of iterations. Figure 4 shows the training error profile of ensemble for cancer, heart disease, glass, and soybean datasets. Two from medical and two from non-medical datasets are chosen. During the intermediate period of the training, individual networks were added to the ensemble by constructive strategy and hidden nodes were added as well as deleted from corresponding individual networks using a hybrid constructive-pruning strategy. For example, for cancer dataset in Fig. 4, the ensemble started with two individual networks with architecture (9-4-2-2) and (9-12-2-2). The NN architecture (9-4-2-2) has 9 inputs, two hidden layers with 4 and 2 neurons, respectively, and 2 outputs. The NN architecture (9-12-2-2) has 9 inputs, two hidden layers with 12 and 2 neurons, respectively, and 2 outputs. Constructive algorithms for individual network (9-4-2-2) and pruning algorithm for individual network (9-12-2-2) were applied during training. During the training, individual NNs with architectures (9-4-2) and (9-12-2) were added to the ensemble. Hidden nodes were added to individual networks (9-4-2-2) and (9-4-2) as constructive algorithms were used to train them. Hidden nodes were deleted from individual networks (9-12-2-2) and (9-4-2) as these two were trained using pruning algorithm. After addition of individual networks and hidden nodes by constructive strategy and deletion of hidden nodes by pruning strategy, the final ensemble with individual NN architectures of (9-8-2-2), (9-10-2-2), (9-8-2), and (9-10-2) was attained. Figure 5a, b shows the training error profiles of individual NNs with constructive algorithm. For example, Fig. 5a shows the curves of individual networks for which constructive algorithms were applied starting with architectures (9-4-2-2) (indicated by solid line) and (9-4-2) (indicated by dash line) for cancer dataset. At the intermediate period of training, hidden nodes were added to individual networks by the dynamic node creation (DNC) method until this node addition increased the performance of the ensemble. Finally, all these constructive networks in the ensemble completed training with (9-8-2-2) and (9-8-2) architectures. Solid lines indicate NNs with 2 hidden layers, and dash lines indicate NN with single hidden layer from Figs. 5, 6, 7 and 8. Figure 6a, b shows training error profiles of individual NNs with pruning algorithm. The pruning algorithm has an impact on error profiles which is visible from the nonsmooth curves. Figure 6a shows the curves of individual NNs applied to cancer dataset starting with (9-12-2-2) and (9-12-2) architectures. At the intermediate training period, hidden nodes were deleted from individual networks by the sensitivity calculation method until this node deletion increased the performance of the ensemble. Finally, all these pruning networks in the ensemble ended up training with (9-10-2-2) and (9-10-2) architectures. Figure 7a, b shows the curves of hidden nodes addition to the individual NNs training applying constructive algorithm. In this case, individual networks with small architecture started training and at the intermediate training period, hidden nodes were added to the first hidden layer of the individual network sensitivity by the dynamic node creation method. For example, Fig. 7a shows the curves of the hidden nodes addition to individual networks trained using constructive algorithm for cancer dataset. Here, the individual network started training with (9-4-2-2) and (9-4-2) architectures and finally ended up training with (9-8-2-2) and (9-8-2) architectures. Figure 8a, b shows the curves of the hidden nodes deletion from the individual NNs training applying pruning algorithm. Individual networks with architecture larger than necessary started training in this case and at the intermediate training period, hidden nodes that deem not necessary were deleted from the first hidden layer of the individual network by the sensitivity calculation method.
Hidden node with the lowest sensitivity was deleted. If the deleted node does not possess the lowest sensitivity, Table 5 Test set error rates for the datasets: comparison of DEL with results of (1) a single NN classifier (Stan); (2) an ensemble created by varying random initial weights (Simp); (3) an ensemble created by Bagging method; (4) an ensemble created by Arcing method, (5) an ensemble created by Ada method [63], (6) a semi-supervised ensemble learning algorithm, i.e.
SemiNCL [50], and (7)    Individual networks here started training with (9-12-2-2) and (9-12-2) architectures and finally completed training with (9-10-2-2) and (9-10-2) architectures. Figure 9a, b shows the curves of individual networks addition during the training period. Individual networks were added to the ensemble applying constructive strategy. Initially, the number of NNs in the ensemble was two. When addition increased the performance of the ensemble, the number was increased. For example, Fig. 9a shows the curve of individual network addition to the ensemble for cancer dataset. The curve shows that network addition to the ensemble completed training with four networks.

Correlations among the individual NNs
In Tables 6, 7, and 8, Cor ij means correlation between individual networks j and i in the ensemble.
The distinguishable difference between Tables 6 and 7 is the negative correlation strength parameter k = 0.2, so that the correlation between any two networks is positive in Table 6. But in Table 7, the negative strength correlation parameter is k = 1.0 so that in almost all cases the value of correlation between any two networks is negative.
The distinguishable difference between Tables 8 and 9 is the negative correlation strength parameter k = 0.2 in Table 8 so that the correlation between any two networks is positive. But in the case of Table 9, the negative correlation strength parameter is k = 1.0, which results in negative correlation between any two networks in many cases.

Comparison
To verify the performance of DEL algorithm, the results are compared with popular empirical study of ensemble network by Opitz and Maclin [62], a semi-supervised ensemble learning algorithm, i.e. SemiNCL by Chen et al. [50], and a fully semi-supervised ensemble approach to multiclass semi-supervised classification in two versions, i.e. CBoost-Sup and CBoost-Semi by Soares et al. [51]. Opitz and Maclin have studied a number of networks such as a simple NN, an ensemble with varying initial weights, Bagging ensemble, and Boosting ensemble. They used resampling based on Arcing and Ada method. A confidence level of 95% can be achieved by an ensemble method than a single-component classifier [34]. Opitz and Maclin didn't apply thyroid, gene, horse, and mushroom datasets in their experiments; therefore, the results are not available for comparison and marked as '-' in the table. Chen et al. [50] and Soares et al. [51] both have presented test errors by mean ± standard deviation % with 5%, 10%, and 20% of labelled data. They also didn't apply cancer, diabetics, heart, thyroid, gene, letter, mushroom and soybean datasets in their experiments; therefore, the results are not available for comparison and marked as '-' in the table.

Discussions
Most of the existing ensemble learning methods use trailand-error method to determine the number and architecture of NNs in the ensemble. Most of them use a two-stage design process for designing an ensemble. In the first stage, individual NNs are created, and in the second stage these NNs are combined. In the ensemble, the number of NNs  Table 2 ) and the number of hidden neurons in the individual networks are predefined and fixed. These existing methods use two cost functions for designing the ensemble. One is for the accuracy, and another is for diversity. In most of the existing ensemble methods, individual NNs are trained independently or sequentially rather than simultaneously, which lead to loss of interaction among NNs in the ensemble. In ensemble training, the previously trained network is not affected.
In DEL, we presented a dynamic approach to determine the topology of an ensemble. This dynamic approach determines the number and architecture of the individual NNs in the ensemble. Such a dynamic approach is entirely new to designing NN ensemble. In DEL, better diversity among the NNs has also been maintained. In DEL, constructive strategy has been used for automatic determination of the number of NNs and constructive-pruning strategy has been used for automatic determination of the architecture of NNs in the ensemble. The hybrid constructive-pruning strategy has provided better diversity for the whole ensemble (Table 4b). NCL has been used for diversity of NNs in the ensemble encouraging individual networks to learn different regions and aspects of data space. But, if different NNs attempt to learn different regions with inaccurate architecture, learning will also be insufficient or improper by this attempt. Different training sets for individual networks are created which also help maintaining diversity among the NNs in the ensemble (Table 4b). In some cases, different training sets were  DEL uses a minimum number of parameters, i.e. only one correlation strength parameter k. An incremental training approach has been used in DEL because even after choosing the appropriate architecture of the ensemble, DEL has to be trained several times for finding the correct value of the learning rate parameter and the correlation strength parameter k. DEL uses only one cost function (the ensemble error E) during training, not two cost functions, one for accuracy and the one for diversity used in some other ensemble method in the literature. DEL uses a onestage design process. Individual networks are created and combined at the same design stage. The advantage of DEL is that it does not need any separate gating block. DEL uses the parameter k as a balancing mechanism for bias-variance-covariance tradeoff. Since DEL generates uncorrelated networks in the ensemble, individual networks in this ensemble are well diversified.
DEL algorithm uses both simple averaging and majority voting combination methods. For some problems, simple averaging method performed better, and for some other problems majority voting method performed better. Despite problem dependent, the choice of the correlation strength parameter k is important in DEL. To delete hidden nodes from individual networks in an ensemble, initially a network larger than necessary is considered. But, assessing the initial size of the NN is challenging, which is still an unknown parameter in DEL algorithm.

Conclusions
DEL is a new algorithm for designing and training NN ensembles. Traditional way of ensemble designing is still a manual trial-and-error process, whereas DEL is an automatic design approach. The number of NNs and their architectures are determined by DEL algorithm.
The major benefits of the proposed DEL algorithm compared to existing ensemble algorithms are (1) automatic creation of ensemble architectures; (2) preservation of accuracy and diversity among the NNs in the ensemble; and (3) minimum number of parameters specified by designer.
DEL emphasizes both accuracy and diversity of NNs in ensemble to improve the performance. Constructive and constructive-pruning strategies are used in DEL to achieve the accuracy of individual NNs. To maintain diversity of NNs, NCL and different training sets are used. The performance of DEL algorithm was confirmed on benchmark problems. In almost all cases, DEL outperformed the others. However, the performance of DEL needs to be evaluated further on some regression and time series problems.

Compliance with ethical standards
Conflict of interest The authors declare that they have no conflict of interest.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creative commons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.