Introduction

Artificial Neural Network (ANN) have been widely used in scientific problems and have attracted many researchers as the most popular tool for pattern classification, regression, and recognition due to its nonlinearity. The most challenging matter in ANN models is the selection of the appropriate weights, number of layers, and number of nodes in each layer. The complexity of the network is affected by the number of layers and nodes, so the difficulty for the training process will be increased. Therefore, selecting the suitable ANN model is required which should not be very small network that has a limited potential to be able to characterize the real state nor a large network which doing complex training process and may provide noise in the training data and thus cannot represent superior capability1,2,3,4.

Feature selection (FS) supplies a way to reduce the number of features from a large number of available features to capture better classification performance than using all features by removing or reducing irrelevant and redundant features5,6. A dataset usually includes a large number of features in classification problems, so irrelevant and redundant data are not applicable for classification and they may reduce the performance of the classification due to the large search space. FS strategies can be divided generally into three categories: filter, wrapper, and hybrid techniques. Filter technique dependently operates on data itself using appointed methods such as Principal Component Analysis (PCA) which is a common method. On the other hand, wrapper techniques are beneficial in finding feature subsets that satisfy a predetermined classifier. Consequently, they are broadly examined for the accuracy of the class. Besides that wrapper approaches are expensive and can be collapsed with a very large number of features due to utilizing learning algorithms in evaluating feature subset every time. However, hybrid techniques try to gather merits of the filter and wrapper techniques by manipulating their correlative strengths7,8. FS plays very important role in many areas such as pattern classification, multimedia information retrieval, data mining, machine learning applications and so on, which can influence the classification accuracy rate and reducing the time required for training. Classifying any given input feature vector into pre-defined set of classes of patterns needs assigning this vector to one of a set of classes9,10.

Meta-heuristics have been very dependable for solving diverse optimization problems in the last two decades and overcoming the challenging problem of searching optimal subset from all the original set. A new FS approaches were generated based on evolutionary optimization techniques since they can lead to a faster way to find optimal solutions. Moreover, by considering an effective fitness function a high dimensional data can be managed by limited number training samples. Whereas the complete search generates all possible solutions for the problem, meta-heuristics present outstanding performance compared to other conventional search techniques11,12,13.

Background

Related works

In recent years, several meta-heuristics have been utilized by many researchers in the field of optimization to search feature subset space for selecting optimal feature set. The strategy of meta-heuristic may determine a satisfactory solution in a reasonable time in spite of it doesn’t assure finding the best solution in every run. These algorithms showed superior performance in solving many practical problems which can be original, modified, or hybrid algorithms14,15,16,17,18,19,20. In21, Menghour and Meslati introduced a hybrid feature selection algorithm based on Ant Colony Optimization (ACO) and Particle Swarm Optimization (PSO) algorithms. The algorithm was designed to solve six well-known datasets. Artificial Bee Colony optimization (ABC) and DE technique was proposed in22 as a new combinable method for feature selection of classification tasks for solving fifteen public datasets. In23, a new hybrid ACO-ABC algorithm was introduced to validate thirteen datasets’ problems, in which ants decide the best ant and best feature subset by exploiting the bees and adjusting as their sources of food. In addition, a feature selection approach was proposed based on two different implementation of multi-objective ABC algorithm combined with non-dominated sorting procedure and genetic operators for examining twelve benchmark datasets24. In25, a new method called PSO-DFS using bare-bone particle swarm optimization (BBPSO) for discretization and feature selection in a single stage was proposed for solving ten high-dimensional datasets. In26, a comprehensive study to investigate the use of Genetic programming (GP) for feature construction and selection on high-dimensional classification problems was presented and tested on seven high-dimensional gene expression problems. Chen et al. proposed two novel Bacterial Foraging Optimization algorithms (BFO), which named Adaptive Chemotaxis Bacterial Foraging Optimization algorithm (ACBFO) and Improved Swarming and Elimination-Dispersal Bacterial Foraging Optimization algorithm (ISEDBFO) to create the mapping relationship between the bacterium and the feature subset and to evaluate the importance of features. This method dealt with feature selection problems and tested ten public datasets of UCI27. Majdi et al. proposed a Grasshopper Optimization Algorithm (GOA) as a search strategy to design a wrapper-based feature selection method in the form of four different strategies to moderate the immature convergence and stagnation drawbacks of the conventional GOA. These approaches were benchmarked on twenty-two public UCI datasets28.

In29, the antlion multi-objective wrapper-based feature selection method (CALO) was proposed by using different chaotic Maps and tested on eighteen datasets to balance between exploration and exploitation in the search space. The proposed method achieved better performance by converging to the optimal solution than PSO and GA methods and it was more effective than the original ALO method.

Li et al. introduced a new multi-objective ranking binary artificial bee colony method for the gene selection on eight microarray datasets. They first used the Fisher Markov Selector method to assort and choose the features that will be used as inputs to the binary ranking artificial bee colony. After that, the binary ranking artificial bee colony selected the genes subset. This method achieved the best performance compared to other methods with different classifiers. The results show also outperforms of that method in selecting smaller number of selected features30.

In spite of the advantages of the above mentioned heuristic algorithms for feature selection on classification problems, someone may inquire if we need another new heuristic algorithms. The theory of No-Free-Lunch (NFL) illustrated that all the optimization problems cannot be solved by one optimizer31. Thus, all classification/ feature selection problems cannot be solved by only one of the heuristic feature selection methods and there is always a possibility to improve the current methods to solve better the current new classification/feature selection problems. This is our motivation for attempting to propose another optimization algorithm for feature selection on classification.

Cat Swarm Optimization (CSO)

Cat Swarm Optimization algorithm was proposed in 2007 by Chu and Tsai32. The CSO algorithm has two modes: seeking mode and tracing mode. In the beginning of the iteration, the number of cats is specified and cats broadcast arbitrarily in M-dimensions space. Then, applying cats to solve the problem, in which every cat has position, velocity for each dimension, fitness value, and a flag to determine if the cat in seeking or tracing mode. The one of the cats with the last solution will have the finest position. At the end of the iterations, the best solution will be kept33.

Whale Optimization Algorithm (WOA)

Whale Optimization Algorithm was presented by Mirjalili and Lewis in 201634.This algorithm comprises of two main stages; encircling prey and spiral updating position in the first stage (exploitation stage). In the second stage, a random searching for a prey is carried out (exploration stage). In the beginning, whales are allocated by arbitrary solutions and the minimum or maximum value of the objective function will be assumed as the best optimal value relying on the problem is solved. Then, every search agent of the objective function is calculated. Every search agent modifies its position relying on the best solution or on a random choice search agent for every iteration.

Sine Cosine Algorithm (SCA)

Sine Cosine Algorithm was proposed in 2015 by Mirjalili35.In SCA, the algorithm started by arbitrary solutions’ set. The objective function calculated recurrently this arbitrary set and rules’ set which is the core of this method was used to improve it. It consists of two phases: in the first phase (exploration phase), the arbitrary solutions in the solutions’ set were combined suddenly by the optimization method to find the encouraging search space areas. Random solutions were changed gradually in the exploitation phase. In addition, arbitrary differences were greatly fewer than those in the first phase.

Differential Evolution Algorithm (DE)

Differential Evolution (DE) algorithm was introduced by Storn and Price in 199636. It is one of the most popular evolutionary algorithms to solve the global optimization problems. Global optimization is necessary in fields such as engineering, statistics and finance. It is stochastic and population-based optimization algorithm. It is developed to optimize real valued functions and real parameter. A population of candidate solutions for the optimization problem to be solved is randomly initialized. By applying crossover and mutation, new individuals are created for each generation of the evolution process. Recombination of the target individual with mutant individual to create the trial individual incorporates successful solutions from the previous generation. The target individual is compared with the trial individual and the one with the lowest function value is admitted to the next generation. Mutation, recombination and selection continue until some stopping criterion is reached37,38.

The mutation is performed by computing the vector differences between other two individuals in the same population which are selected randomly. Generating the mutant individual \(V_{i,g}\) by adding the weighted difference of two of the vectors \(F\left( {X_{{r_{2,g} }} + X_{{r_{3,g} }} } \right)\) to the base vector Xr1, g to disorganize it. A mutant vector is generated by the following formula:

$$ V_{i,g} = X_{{r_{1,g} }} + F\left( {X_{{r_{2,g} }} + X_{{r_{3,g} }} } \right) $$
(1)

where r1, r2 and r3 are indexes selected randomly over [1, N], N is the number of individuals in the population, g is the current generation, and F is a constant mutation factor from [0, 2].

The trial vector \({\text{U}}_{{{\text{i}},{\text{g}}}}\) is constructed through the recombination step where is developed from the elements of the target vector, \({\text{X}}_{{{\text{i}},{\text{g}}}}\), and the elements of the mutant vector, \(V_{i,g}\). The crossover factor \({\text{CR}}\) presents the probability of entering elements of the mutant vector with the trial vector. The trial vector constructed formula:

$$ {\text{U}}_{{{\text{i}},{\text{j}},{\text{g}}}} = \left\{ {\begin{array}{*{20}l} {{\text{V}}_{{{\text{i}},{\text{j}},{\text{g }}}} } \hfill & {\quad if\,{\text{rand }}_{{{\text{i}},{\text{j}}}} \le CR\,or\,J = {\text{J}}_{{{\text{rand}}}} } \hfill \\ {{\text{X}}_{{{\text{i}},{\text{j}},{\text{g }}}} } \hfill & {\quad if\,{\text{rand }}_{{{\text{i}},{\text{j}}}} > CR\,and\,J \ne {\text{J}}_{{{\text{rand}}}} } \hfill \\ \end{array} } \right. $$
(2)

where i = 1, 2. . . N; N is the population size, j = 1, 2. . . D; D is the dimension of a single vector. \({\text{rand }}_{{{\text{i}},{\text{j}}}}\) is a random number in range [0, 1] and \({\text{J}}_{{{\text{rand}}}}\) is a random integer from [1, 2, …,D].

Finally, the trial vector \({\text{U }}_{{{\text{i}},{\text{g}}}}\) is compared to the target vector \({\text{X}}_{{{\text{i}},{\text{g}}}} \) and the one with the lowest function value is become a member of the next generation g + 1 using the fitness function formula:

$$ {\text{X}}_{{{\text{i}},{\text{g}} + 1}} = \left\{ {\begin{array}{*{20}l} {{\text{U}}_{{{\text{i}},{\text{j}},{\text{g }}}} } \hfill & {\quad if\,f\left( {{\text{U }}_{{{\text{i}},{\text{g}}}} } \right) < f\left( {{\text{X}}_{{{\text{i}},{\text{g}}}} } \right)} \hfill \\ {{\text{X}}_{{{\text{i}},{\text{g }}}} } \hfill & {\quad otherwise} \hfill \\ \end{array} } \right. $$
(3)

Recently, DE has arisen as an encouraging approach in several real world challenges. Effectiveness, robustness, capability to deal with complex large-dimensional optimization problems, and needing few control parameters are some merits of DE algorithm over other meta-heuristic algorithms. Furthermore, because of the fitness of offspring is competed and compared one-to-one with the fitness of corresponding parent, DE has sufficiently fast convergence characteristics. Although this approach raises the possibility of trapping in local optimal (suboptimal solution) and leads to premature convergence, it may be efficient to find an optimal solution rapidly. The control parameters of DE are needed a fine-tuning while remain fixed over the process of optimization is considering another demerit of DE.

To overcome the drawbacks of DE algorithm, it has been integrated with other optimization algorithms in a hybridized form to improve the performance of DE. On contrast, it is an effective method on a large range of classic optimization problems. DE is one of the most popular heuristic algorithms to solve single-objective optimization problems and it has been extended to solve multi-objective optimization problems39.

A novel optimization algorithm is proposed in this work to optimize both the weights and the structure of ANN simultaneously by presenting a new solution representation. Two main phases are composed of the proposed method: arrangement optimization and weights update. However, the proposed MVDE algorithm with multi variant mutation and adaptive scaling factor has been developed to choose the optimal number of features used for classification which has an impact on the accuracy of the classification.

However, the proposed MVDE is presented to overcome the drawbacks of DE which mentioned above. On this issue, different five mutation approaches incorporated with two high random scaling factors based on cosine and logistic distributions will be used to maintain the population diversity during the optimization process which leads to preventing premature convergence. Moreover, the requirements for tuning of the control parameters can be reduced by the proposed adaptive crossover and adaptive selection.

In addition, the experimental results are compared to the results in the literature and to another four optimization algorithms; DE, CSO32,33, WOA34, SCA35, and PSO21 for evaluating the proposed algorithm performance.

The rest of this paper is organized as follows: The methodology of the proposed approach is outlined in "Methodology" section. "Results and discussion" introduces and analyses the experimental results. Finally, conclusions is given in "Conclusion" section.

Methodology

Proposed multi-variant differential evolution algorithm

Differential evolution (DE) algorithm36 is a promising tool that can be regarded to deal with complicated high-dimensional problems with enhanced search qualities. DE algorithm is selected here to use as a search engine because it has meaningful merits over other meta-heuristic methods in terms of robustness, effectiveness, and fast convergence characteristics in searching high-scale problems. On contrast, the premature convergence to local optima and fixed control parameters are the problems of DE37,38,39. Additional improvements are essential before using this method in practical problems to gain better performance. In this concern, a new algorithm named Multi-Variant Differential Evolution (MVDE) has been proposed as an almost parameter-free optimization method. The proposed approach was designed mainly to enhance the global search ability of the original DE, i.e., reducing the probability of trapping in local optima and preventing the premature convergence. The proposed MVDE uses five different mutation strategies integrated with two high random scaling factors (based on cosine and logistic distributions) to maintain the population diversity during the optimization process, thereby prevent premature convergence. Additionally, the proposed adaptive crossover integrated with adaptive selection to reduce the needs for tuning of the control parameters. The overall steps for proposed MVDE algorithm are shown in Fig. 1. The main stages of MVDE algorithm are listed below.

Figure 1
figure 1

Pseudo code of MVDE method.

Initialization

The proposed MVDE is designed as population-based meta-heuristic algorithm. It begins to solve d-dimensional optimization problem with initial solutions (\(x_{j,i,0}\), initial candidate population) produced randomly to overspread the restricted search space as best as possible as follows:

$$ x_{i,j} \left( 0 \right) = x_{j}^{min} + rand\left[ {0, 1} \right] \cdot \left( {x_{j}^{max} - x_{j}^{min} } \right)\quad \forall i \in \left\{ {1,2, \ldots ,n} \right\}, j \in \left\{ {1,2, \ldots ,d} \right\} $$
(4)

where, \(n\) is the population size, \(x_{j}^{min}\) and \(x_{j}^{max}\) are the minimum and maximum limits of the \({\text{j}}\)-dimensional, and \(rand\left[ {0, 1} \right] \) is a uniformly distributed random variable between 0 and 1 respectively.

In the next generation, MVDE goes in a cycle of iteration using a new single operation named multi-variant mutation-crossover process in order to create a new population. This process is associated with two proposed operators named self-adaptive scaling factor (incorporated with two different distributions and adaptive crossover operator) and adaptive parent selection. In the beginning, it is essential to explain the mutation-crossover process by describing its associated operators.

Self-adaptive scaling factor

Scaling factor \({\text{F}}\) has substantial effect on the convergence speed as a favorable control parameter. With the progress of iterations, the speed of convergence can be generally enhanced by reducing gradually \({\text{ F}}\). In40, a self-adaptive scaling factor of DE was introduced based on the elitist scheme of the learning rate to keep track of the fittest vector (i.e., the best individual is copied into the next generation). In spite of the widespread use of elitist scheme in GAs for attempting to help for fast convergence, it might be suffering from the problem of premature convergence. In41, biological genetic strategy has been proposed to inspire another adaptive scaling factor by changing (increasing or decreasing) F exponentially between pre-specified initial and maximum scaling factors besides to two of adjustable elements. However, extra parameters have been involved by this adaptive scaling factor which need to be chosen or adjusted.

In this work, to enhance the adaptive and dynamic performance of MVDE, a new adaptive scaling factor is proposed. The values of random variables which are generated from two different probability distributions: cosine and logistic distributions are used to update the scale factor of each individual for each generation.

Proposed distributions

The two parametric distributions with different thicknesses tails which can be identified completely are cosine and logistic distributions. Depending on the development of optimization operation, the probability of selecting each random generator is proposed.

In the first generations, F has a value with a high probability to be produced according to the cosine distribution. On the other hand, at the end of optimization, the probability of utilizing the logistic distribution is increased. Hence, both of these distributions are used simultaneously. Certainly, small and large values of independent variables can be generated from each random distribution (i.e., symmetrical bell shaped distributions). Therefore, the control of the exploration (with large values) and the exploitation (with small values) are attempted to be supported by this proposed scheme for controlling the exploration– exploitation balance.

As shown in Fig. 2, the logistic distribution has a lighter tail than the cosine distribution. Therefore, sufficient disturbances are obtained by applying the heavy tail cosine distribution for spreading the population over the wide search space in the beginning of optimization operation. The cosine distribution is utilized instead of the normal distribution due to its qualified modeling the higher proportion of the heavy tail of the population for diversity conservation, which will avoid premature convergence. However, the lighter tail logistic distribution is used at the end of the optimization process to increase the exploitation ability. Therefore, the proposed adaptations of the scale factor give the meaningful advantage for solving the exploration–exploitation problem during the complete optimization process. In41, the proposed two types of distributions that are used by the proposed MVDE have been introduced.

Figure 2
figure 2

The distributions of cosine and logistic.

The cosine distribution

The cosine or half-cosine distribution is able to transact with cases where the central orientation is not obvious enough. In spite of that, it presents a low central orientation with a heavy tail in which samples appear likely nearby the limits of distribution compared to others such as normal distribution, as presented in Fig. 2. In the first iterations, the population diversity is increased by developing the scale factor which aims to assure those possible paths of the individuals have stretched/flattened over the search area (i.e., low central orientation). Indeed, the cosine distribution can achieve this objective. In addition, this distribution exhibits a significant abidance through range and limits and therefore is more practicable than intermittent distributions such as quadratic and trigonometric distributions. Hence, the use of cosine distribution is appropriate in cases which are restricted by finite limits and requires a low density around the center (i.e., the samples are scattered more equally). The probability density function (PDF) of the cosine is given by42,43:

$$ f\left( {y|a,b} \right) = \frac{1}{2b}\cos \left( {\frac{y - a}{b}} \right)\quad y^{min} \le y \le y^{max} $$
(5)

where \({\text{y}}^{{\max}}\) and \({\text{y}}^{{\min}}\) are maximum and minimum value of random variables, respectively, \({\text{a}} = \left( {{\text{y}}^{{\max}} + {\text{y}}^{{\min}} } \right)/2\) is the mean (median, location, and mode), and \({\text{b}} = \left( {{\text{y}}^{{\max}} - {\text{y}}^{{\min}} } \right)/{\uppi }\) is scale parameter. The two parameters of the distribution are designed to be completely appropriate in the ambient of the problem of interest. Thus, the output of cosine distribution is limited in the range [−1, 1]. The variance of the proposed cosine distribution is about 0.19 and is calculated as \(\left( {y^{max} - y^{min} } \right)^{2} \left( {\pi^{2} - 8} \right)/4\pi^{2}\).

The logistic distribution

On the other hand, the logistic density curve is symmetrical, bell-shaped, and has a larger amplitude in the central area of distribution and lighter tails on the limits’ sides compared to the curve of cosine, as shown in Fig. 2. The probability density function (PDF) of the logistic distribution is given by44,45:

$$ f\left( {y|a,b} \right) = \frac{1}{b}e^{{\left( { - \frac{y - a}{b}} \right)}} \left( {1 + e^{{\left( { - \frac{y - a}{b}} \right)}} } \right)^{ - 2} \quad - \infty < y < \infty $$
(6)

This stage involves increasing in exploitation strength whereas preserving some exploration for the problems which offer a large search space at the end of optimization. In the meantime, the location and scale parameters can identify the logistic distribution completely. In consequence, in this work, the scale parameter is set to 0.1 for achieving lighter tail with more exploitation. According to the values drawn from Gaussian distribution in the range [0, 1], the location parameter is shifted from zero to preserve the population diversity and offer adequate exploration capability till the end of the optimization process. The variance of the proposed logistic distribution is about 0.033 which is lesser than the cosine distribution and calculated as b^2 π^2/3.

Adaptive crossover operator integrated with scaling factor

The use of low crossover rate or high scale factor produces large disturbances which are beneficial for population diversity but decrease the speed of convergence. The convergence will be rapid, but also premature if the contrary happens, so that, these operators can be valuable in case of integrating30. Hence, the crossover probability constant is chosen for linearly increasing across the track of iterations from nearby zero to 0.5, as follows:

$$ CR = g/\left( {2G} \right) $$
(7)

where \(g\) is the generation (iteration /time) and \(G\) is the maximum generation. The available parts of the scaling factor is chosen by producing the binary crossover mapping matrix A with size n × d, as follows.

$$ A_{ij} = Ind_{ij} \ge CR\quad \forall i \in \left\{ {1, 2, \ldots , n} \right\}, j \in \left\{ {1, 2, \ldots ,d} \right\} $$
(8)

where \(Ind\) is a random number which is initialized for each jth element of the ith individual and is produced from the uniform distribution in the range [0, 1]. The above mentioned continuous scaling factor is mapped using the (0–1) matrix A as follows:

$$ F_{ij} = F_{ij} \cdot A_{ij} \quad \forall i \in \left\{ {1, 2, \ldots , n} \right\}, j \in \left\{ {1, 2, \ldots ,d} \right\} $$
(9)

Observe that if \( A_{ij} = 1\), then the scale factor holds its value, else its value is zero. Therefore, as the number of iterations increased, the number of elements that have zero value is increased in the scaling factor matrix in an effort for increasing the exploitation with time.

Adaptive elite selection mechanism

Mutations schemes of the MVDE method need a suitable number of chosen individuals to attend as parents in order to lead the search operation. So, the qualified parents of size \( N_{p}\), are chosen depending on the top-ranked solutions of the adaptive crossover rate as follows:

$$ N_{p} = n \times \left( {1 - CR} \right) $$
(10)

Depending on the strategy size, an acceptable number of parents is recognized among the qualified parents. The MVDE algorithm construction retains the ability to integrate and handle five mutation schemes.

Multi-variant mutation-crossover process

The time horizon of the optimization process is divided evenly into five successive iterative sub-processes (phases/periods) after initialization. According to choose one of the five kinds of mutation schemes, the population in each phase is modified. During the first iteration, the solutions are updated utilizing the proposed less greedy evolution variants such as (DE/rand/k) to cover the search area and to recognize the encouraging areas by benefiting from their explorative capability. The less greedy variants discover the most encouraging area which is used to lead the search at the end of optimization operation to share more information among the used schemes through the greedy strategies such as (DE/best/k). To focus only on the optimal solution guiding to the desired rapid convergence, the low disturbances of these variants are essential.

According to previous suggestions, the chosen five mutation schemes incorporated with the updated scaling factor (i.e., includes crossover) are applied through mutation-crossover operation, as follows below.

Purely explorative stage

Actually, the differences between population individuals are the basis of the disturbance utilized in all variants of the evolution. The “DE/rand/2” variant is the famous least greedy mutation scheme which has a top priority for exploration, and is applied to the first 20% of iterations, as follows:

$$ x_{i,j} \left( {g + 1} \right) = x_{{r_{1} ,j}} \left( g \right) + F_{ij} \left( {g + 1} \right).\left( {x_{{r_{2} ,j}} \left( g \right) - x_{{r_{3} ,j}} \left( g \right)} \right) + F_{ij} \left( {g + 1} \right).\left( {x_{{r_{4} ,j}} \left( g \right) - x_{{r_{5} ,j}} \left( g \right)} \right) $$
(11)

where \( r_{1}\),\(r_{2}\),\( r_{3}\), \(r_{4}\), and \(r_{5}\) are five mutually limited random integers chosen from the range [1,\( N_{p}\)], and all vary from the mutated index \( i\).

Two binary scaling factors are produced independently in this variant, therefore each element of the group member \(x_{i,j}\) has the chance to search either through one of the parts of the previous equation, by both parts, or remain the same (the later chance increases during optimization progress). Several of these updated chances appear in the following other kinds.

More explorative stage

The less greedy and explorative “DE/rand/1” variant is implemented in the next group of iterations as follows:

$$ x_{i,j} \left( {g + 1} \right) = x_{{r_{1} ,j}} \left( g \right) + F_{ij} \left( {g + 1} \right).\left( {x_{{r_{2} ,j}} \left( g \right) - x_{{r_{3} ,j}} \left( g \right)} \right) $$
(12)

As a result of the existence of a single binary scaling factor, this scheme comprises only two updated chances.

Balanced stage

The most appropriate current solution is integrated into the “DE/current-to-best/1” strategy for leading the search to the global optimal with more aesthetic rapid convergence (exploitative conduct). Differences between random solutions are utilized to balance such conduct for updating the robustness as explorative conduct. So, at the middle of the search area, this variant is applied to attain a good balance between exploration and exploitation as follows:

$$ x_{i,j} \left( {g + 1} \right) = x_{{r_{1} ,j}} \left( g \right) + F_{ij} \left( {g + 1} \right).\left( {x_{best,j} \left( g \right) - x_{i,j} \left( g \right)} \right) + F_{ij} \left( {g + 1} \right).\left( {x_{{r_{1} ,j}} \left( g \right) - x_{{r_{2} ,j}} \left( g \right)} \right) $$
(13)

where \(x_{best}\) is the best individual over the whole current generation \( g\). This scheme holds the four kinds of updated chances revealed in “DE/rand/2”.

More exploitative stage

The population in the next sub-domain handled due to the greedy and exploitative “DE/best/2” variant where the four kinds of updated chances are available as follows:

$$ x_{i,j} \left( {g + 1} \right) = x_{best,j} \left( g \right) + F_{ij} \left( {g + 1} \right).\left( {x_{{r_{1} ,j}} \left( g \right) - x_{{r_{2} ,j}} \left( g \right)} \right) + F_{ij} \left( {g + 1} \right).\left( {x_{{r_{3} ,j}} \left( g \right) - x_{{r_{4} ,j}} \left( g \right)} \right) $$
(14)
Purely exploitative stage

At last, the positions of the population are updated by implementing the highly greedy and exploitative “DE/best/1” variant during the last sub-iterations as follows:

$$ x_{i,j} \left( {g + 1} \right) = x_{best,j} \left( g \right) + F_{ij} \left( {g + 1} \right).\left( {x_{{r_{1} ,j}} \left( g \right) - x_{{r_{2} ,j}} \left( g \right)} \right) $$
(15)

According to the description and difficulty of the problem, the above mentioned mutation schemes can be redescribed or reduced, respectively. At the end of each generation, the mutated individuals compared to the previous population, and the best solutions permitted for surviving to the next generation. We are the original source and the owners of this new algorithm. The original source code is available at (https://www.mathworks.com/matlabcentral/fileexchange/70997-mvde)46.

ANN model

In this work, the optimization of the weights and structure of ANN is regarded by applying MVDE algorithm where each solution in the population of the MVDE holds both weight and structure solution. In this strategy, a specific fitness function which is dependent on the weights and structure of ANN is the base to measure inputs, different weights, number of hidden-layers, and number of nodes in each hidden-layer47.

The training phase consists of two main stages: structure optimization and weight update. The network structure is optimized during the training process by selecting important hidden nodes that minimize the output error. The weights of the network are updated by maximizing the diversity of the output from hidden nodes. The weight update stage is proposed to enhance the performance of the structure optimization48,49.

  • Structure optimization: finding a compact topology with the minimum number of hidden nodes is the goal of the structure optimization stage. This can be achieved by choosing the important nodes from the network while neglecting the reminder.

  • Weight update: selecting the most important hidden nodes from the initial network is proposed to produce a compact structure of the network during the structure optimization.

The selected hidden node is then divided into two new hidden nodes which have the same number of weight connections as their parents. The new weight connections are calculated as follows:

$$ w_{1} = \left( {1 + \theta } \right) \times w $$
(16)
$$ w_{2} = - \theta \times w $$
(17)

where \(w\) is the weight of the existing node, and \({ }w_{1}\) and \(w_{2}\) are the weights of the two produced nodes. To avoid a large change in the existing network functionality, the value of \(\theta\) should be within small range.

The search process can be accelerated by identifying the suitable number of hidden nodes in ANN architecture.

Solution representation

In this work, two one-dimensional vectors are considered for solution representation. One vector describes the structure solution which holds binary values of 0 and 1 while another vector contains weights and biases with real numbers in range [−1, 1]. The first vector has three sections; two sections are for the number of hidden-layers and the number of nodes in each of these layers which occupy three cells each. These three cells also keep the binary values of the number of hidden-layers and the number of nodes in each of these layers47.

The feature selection section is added to the structure solution representation for classification since the values of the input nodes, and the number of input nodes is essential which its dimension is equal to the number of features in the dataset. The ith feature in the complete feature set is held in the subset of a selected feature if the value of ith in this section is equal to 1 while the subset of selected feature does not contain this feature if the cell presents 0 value.

For weights and biases, the second vector contains weights and biases in real values in which the number of them is calculated based on the structure solution48.

Fitness function

The use of an effective fitness function is necessary to select a solution that minimizes the objective function and to evaluate the quality of it in successive iterations. In this paper, the fitness function is used to minimize the classification error, the number of selected features, and the size of ANN with good generalization capability which is calculated by the average of the error ε, the ration of the selected features, and the ration of the number of weights and biases (connections). The classification error which presents the percentage of misclassified training samples is calculated as follows:

$$ E\left( p \right) = \frac{100}{{N_{p} }}\mathop \sum \limits_{i = 1}^{{N_{p} }} (y_{i} - \overline{y}_{i} ) $$
(18)

where \(n_{p} { }\) is the number of samples, and \(y_{i}\) and \(\overline{y}_{i}\) are the target and actual output of the network respectively. The second part of the fitness function evaluates the ration of the selected features is as follows:

$$ R_{f} = \frac{{n_{s} }}{{n_{f} }} $$
(19)

where \(n_{s}\) is the number of the selected features and \(n_{f}\) is the number of the complete set of features. The third part is the number of connections (weights and biases) evaluated by the fitness function as follows:

$$ P_{c} = \frac{1}{{n_{c} }}\mathop \sum \limits_{i = 1}^{{n_{c} }} (w_{i} + b_{i} ) $$
(20)

where \(n_{c}\) is the total possible number of connections that are utilized by the network, \(w_{i}\) is the active number of the weights, and \(b_{i}\) is the active number of biases connections. Thus the fitness function \(f\left( s \right)\) of the solution s is calculated as follows:

$$ f\left( s \right) = E\left( p \right) + a_{1} *R_{f} + a_{2} *P_{c} $$
(21)

where \(a_{1}\) and \(a_{2}\) are user-defined constants within the range of 0 and 1 which are utilized to control the significance of three terns of fitness calculation and they are set to 0.1 in this work.

Results and discussion

Classification problems are implemented using MATLAB R2017a software on Windows 10 and executed on a PC with an Intel Core i7-5600U processor of 2.6 GHZ 8.0 GB. The maximum number of iterations is set to 100 and the population size is set to 40.

Fifteen classification datasets are evaluated using the MVDE method. The datasets are from different sources49,50,51,52. Datasets with different issues of instances and attributes are chosen for validating MVDE. Four heuristic algorithms: DE, CSO, WOA, SCA, and PSO algorithms are used to compare and valuate with the MVDE. Table 1 outline the description of the datasets.

Table 1 Description of datasets.

The performance of the proposed algorithm and compared methods for finding solutions are measured using eight metric indices. The metric indices are the average of classification accuracy, best, mean, worst, standard deviation (Std), the number of selected features, complexity of the network (active number of connections/total number of connections), and time cost (Seconds) of solutions for denoting the sturdiness and firmness of all implemented methods. The chance to find optimal solutions is increased when candidate solutions broadcast over a wide area of search space. Moreover, the average ranking test is used to judge the performance of MVDE and other methods. The best results for all algorithms using different metric indices are presented in bold.

In Fig. 3 which indicates the convergence rates of different methods for various datasets, the MVDE algorithm achieves better performance compared to other methods. The convergence rates of MVDE algorithm are rapid compared to the other algorithms except in case of arrhythmia, ionosphere, and WBCD datasets where DE algorithm is the faster and competes MVDE. Besides, the compared algorithms; CSO, WOA, and SCA contend each other to converge in various benchmarks and converge slowly than MVDE. Mainly, the MVDE is an impactful algorithm to use in various datasets which are utilized in this study.

Figure 3
figure 3

Convergence curves for datasets by different optimizers.

Table 2 outlines the average classification accuracy which indicates the performance on test data for all datasets using the selected features from different algorithms. We can observe from the table that MVDE is the best performing over all other optimizers except in Australian and zoo datasets where DE is doing better than MVDE. The solutions that have the best average of fitness maximizes the accuracy of the classification and minimizes the selected features’ number is presented in Table 3 which indicates that the best solutions are attained by the MVDE. Considering these results, MVDE has the capability for searching the area of space adaptively outperforms the other algorithms. The average mean and worst solutions are presented in Tables 4 and 5. These results assure the superiority of MVDE over compared algorithms.

Table 2 Average classification accuracy of different optimizers.
Table 3 Best fitness function of different optimizers.
Table 4 Mean fitness function of different optimizers.
Table 5 Worst fitness function of different optimizers.

Table 6 shows the results of standard deviations for the achieved values of the fitness to investigate the firmness and sturdiness among compared methods with MVDE. From the results, we can observe that the MVDE outperforms the compared methods due to the faintness of the compared methods for exploring and exploiting the search space.

Table 6 Standard deviation of different optimizers.

We can conclude that the MVDE is an outstanding method with various benchmarked problems which attains the best results of different metric indices to avoid trapping in local minima.

The average of selected features are outlined in Table 7 for all compared methods. The number of the selected features is minimized better using the SCA algorithm than others algorithms. Moreover, the MVDE is doing better also and comes in the second place of selecting features number ranking.

Table 7 Average selected features of different optimizers.

The consuming time is introduced in Table 8. The SCA algorithm executes at the lowest period of time compared to other methods. This means that it is a rapid algorithm to be executed whereas the MVDE, DE, and PSO algorithms are contested to execute in small time. On the other hand, CSO and WOA are high consuming time.

Table 8 Execution time in seconds of different optimizers.

In the case of the complexity of the network, Table 9 outlines the complex network of different optimizers. The results show that MVDE and SCA methods outperform in using the least connections of the network. Therefore, MVDE has the ability for training the ANN using a fewer complexity of the network model by a high accuracy.

Table 9 Complexity of the network for different optimizers.

In a conclusion, the judgment on the different optimizers are presented in Table 10 to rank and order every compared algorithm. From this table, the MVDE algorithm is graded in the first place for all metric indices excluding the average number of selected features, consuming time, and complexity of the network.

Table 10 Average of the ranks for different optimizers.

Table 11 indicates the results using KNN classifier, which MVDE average classification accuracy results using KNN classifier are better than using ANN classifier in some datasets and vice versa. The values’ range of average fitness, mean, worst, standard deviation are different using KNN classifier than using ANN classifier due to the differences between the two classifiers. Execution time using KNN classifier is more than using ANN classifier, which means that KNN classifier is a time consuming method compared to ANN classifier. Some of classification methods just work with some of the data or applications better than others.

Table 11 Average values of different metrics using KNN classifier for MVDE.

To authenticate the performance of the MVDE technique, the comparison among the MVDE and the lately four published algorithms by other researchers is presented in Table 1223,24,27,28,29,30. Eleven datasets that are common among this study and the compared studies in the recent researches. The results due to the comparison among the MVDE and the methods in the literature review indicate that the MVDE outperforms the other four methods on iris, Australian credit, German credit, hillvalley, and waveform datasets. In contrast, a slight bad performance of classification on ionosphere, vehicle, wine, zoo, WBC, and heart datasets. For the heart benchmark, MVDE attains well performance of classification over one method, but worse than the other one. On the other side, the MVDE chooses a fewer number of features compared to all other algorithms for all benchmarks. However, it selects a slight more selected features’ number on ionosphere dataset than only one method. In addition, the computational time that is achieved by MVDE algorithm is typically shorter for all datasets except for WBC dataset since the number of selected features is smaller. Additionally, the results validate that the MVDE algorithm’s performance is competitive to the state-of-the art algorithms. MVDE benefits from the advantages of DE algorithm in addition to the modifications applied to it for overcoming DE drawbacks. MVDE is a parameter free optimization method with a high performance and applicable to deal with complex high-dimensional problems. The design of MVDE algorithm with multi variant mutation and adaptive scaling factor help it to select the optimal number of features used for classification which has an impact on the accuracy of the classification.

Table 12 Comparison with other methods.

Conclusion

This work presents a novel optimization algorithm (MVDE), which has multi variant mutation with adaptive scaling factor is developed by integrating adaptive crossover rate with mutation factors and adaptive selection of parent to achieve better performance. The performance of the MVDE algorithm is verified using fifteen real-world problems to ensure its stability, quality, and simplicity. In this paper, the performance and the complexity of the ANN for training process are optimized simultaneously. Different weights, biases, number of nodes of the hidden layers, and selecting inputs during the search process have the opportunity to be checked by providing this method. So, an effective model of ANN with low complexity and classification error has the chance to be found. The goal challenge to balance between exploration (diversification) and exploitation (intensification) is achieved utilizing the MVDE and a varied population is preserved through iterations.

The evaluation is implemented using a set of evaluation criteria to evaluate different aspects of the MVDE algorithm. In addition, DE, CSO, WOA, and SCA methods are applied to solve the problems and to compare the results to MVDE method. The results of the methods in the literature review and the MVDE algorithm are compared to evaluate the performances.

The investigative results infer that the MVDE optimization method is a useful and an appropriate technique to classify data and can move to a particular group of benchmarks. The results also investigate the ability of the MVDE algorithm for dodging the local minima better than the compared DE, CSO, WOA, and SCA methods. The performance of the selected features is promising and better for the features selected by the MVDE. Furthermore, the superiority of the MVDE performance is obviously detected for training ANNs in terms of evaluation metrics.