Introduction

Recently, data mining field has become an active research area due to presence of the huge amount of data in digital format that needs to be transformed into useful information. The main task of data mining is to build models for the discovery of useful hidden patterns in collections of huge data. It is considered as an essential step in the knowledge discovery process [16]. Preprocessing of data is a critical step in data mining. It has a direct impact on data mining techniques such as classification. It affects the quality of discovered patterns and the accuracy of the classification models [16, 25]. Feature selection is one of the main pre-processing steps in data mining that aims to discard noisy and irrelevant features while retaining the useful and informative ones. Selecting the ideal or near ideal subset of given features leads to accurate classification results and less computational cost [6, 25]. Feature selection approaches are classified based on estimation criteria of selected subset of features into two classes filter and wrapper approaches [6]. Wrapper techniques heavily rely on their search for optimal subset of features on the accuracy of machine learning classifiers such as KNN or SVM, whereas filter techniques use scoring matrices such as chi-square and information gain to assess the goodness of the selected subset of features. More precisely, in filter approaches, attributes are ranked using a filter approach, e.g., chi-square, and then the attributes with less than a predefined threshold are removed [1, 6, 14].

Generally speaking, finding an optimal subset of features is a challenging task. FS has gained the interest of many researchers in data mining and machine learning fields [8, 14]. The literature shows that meta-heuristic techniques have been very effective in tackling many optimization problems like machine learning, engineering design, data mining, production problems and feature selection [38]. For feature selection problem, Genetic algorithm (GA) and Particle swarm optimization algorithm (PSO) have been successfully utilized for solving many feature selection problems [6, 7, 13, 17]. Moreover, many biologically inspired approaches such as simulated annealing (SA) [29], Tabe search (TS) [45], Ant Colony Optimization (ACO) [22], Binary Bat Algorithm [32]. Moth-Flame Optimization Algorithm [44], Antlion Optimization Algorithm [43], Dragonfly Optimization Algorithm [26], and Whale Optimization Algorithm (WOA) [35] have been efficiently applied to discover the best subsets of features for many classification problems.

As discussed in [37, 38], when designing and using meta-heuristic techniques, two main different criteria must be considered: diversification which refers to search space exploration, and intensification which means the exploitation of the optimal solution (e.,g., best subset of features found so far). Based on these criteria, meta-heuristic techniques can be distinguished into two branches. Population-based meta-heuristics (e.,g., PSO) and single-solution based techniques such as SA and TS which are biased towards exploitation. The performance of the search algorithm can be improved if an appropriate balance between the exploration and the exploitation is achieved. The desired balance between exploration and exploitation can be achieved via combining two techniques, e.g., population-based approach and single-solution based approach.

Various hybrid meta-heuristic approaches have been proposed for solving wide range of optimization problem. Hybrid approaches have also been popular, because they benefit from advantages of two or more algorithms [37]. In [18], to control the search procedure, local search approaches were embedded in GA algorithm. Obtained results show better performance in comparison with classical versions of GA. In [42], simulated annealing algorithm in conjunction with Genetic Algorithms were used to optimize industrial production management problem. In addition, a hybrid approach based on Markov chain and simulated annealing algorithms was designed to tackle the travel salesman problem [28]. Furthermore, in [4], Particle Swarm Optimization algorithm was hybridized with Simulated Annealing algorithm to avoid PSO in getting trapped in local optima. The designed approach was applied to deal with complex multi-dimensional optimization problem. Finally, in [3], the ACO algorithm was used in conjunction with a GA as a hybrid approach for feature selection in text classification domain. The proposed approach recorded better results in comparison with filter approaches and the classic version of ACO. Despite the effectiveness that meta-heuristic algorithms have shown to solve many optimization problems, they have some shortcomings such as their sensitivity to parameter tuning, finding the optimal parameters for difference optimization problems is very important. Furthermore, in terms of performance, many of the present optimization algorithms have difficulties in dealing with high dimension optimization problems [5]. In addition, the computational cost of using meta-heuristics in general and the hybrid meta-heuristic approaches in particular is relatively high.

In this work, SA is used to enhance the best solution found so far (subset of features) by the Binary Dragonfly Optimizer. The classic version of Binary Dragonfly Algorithm was used for feature selection [26], but to best of our knowledge, this is the first time that Binary Dragonfly hybridized with Simulated Annealing is utilized in feature selection domain. The remaining of this paper is organized as follows: related work is reviewed in section “Related Work”, whereas section “Algorithms” presents the BDA and SA Algorithms. Section “Proposed Feature Selection Method” explains the proposed FS approach. In section “Experiments”, the experimental results are presented and discussed. Conclusion and future work are shown in section “Conclusion”.

In this paper, there are number of key contributions as follows:

  • BDA-SA: an improved version of the standard binary version of BDA is proposed.

  • The main improvement includes the combination of DA with SA to solve the local optima problem of standard BDA.

  • BDA-SA wrapper feature selection model is developed in this paper.

  • We evaluated and compared BDA-SA with a number of well-known algorithms including (BDA, PSO, BGWO and BALO).

  • We compared the results with using all features, where we used 18 data sets from UCI repository which are frequently used by feature selection research. and ALO) using 18 benchmark data sets from UCI repository. From these results, it is clearly confirmed the superiority of BDA-SA in comparison to these baseline algorithms.

Related Work

Based on literature investigation, many optimization algorithms were improved by combining them with local search algorithms (LSA). For example, in a study by [11], Elgamal et al. improved Harris Hawks Optimization (HHO) Algorithm by SA and applied it for feature selection problem. In [2], the authors also improved water cycle optimization with SA and applied it for spam email detection. In work by [41], performance of Salp Algorithm (SSA) was enhanced by combining it with a new developed LSA and applied it for feature selection problem. Also, in [40], WOA was combined with new a new local search algorithm to overcome the problem of local optima, and applied it for rules selection problem. Furthermore, in [21], Jia et al. improved spotted hyena optimization using SA and applied it for feature selection problem. Also, Simulated Annealing was hybridized with GA in [27] to be utilized as feature selection method for the classification of power disturbance in the Power Quality problem. In [33], Genetic Algorithm GA was used with SA to extract features to tackle the examination timetabling problem. A local search strategy was embedded in Particle Swarm Optimization algorithm to guide the PSO during the search process for the best subset of feature in classification [31]. Mafarja and Mirjalili [25] proposed two hybrid wrapper feature selection approaches based on Whale Optimization and Simulated Annealing algorithms. The aim of using SA is to enhance the exploitation of WOA. Results showed that the proposed approaches improved the classification accuracy in comparison with other wrapper feature selection techniques.

The literature also revealed that many successful efforts were done to improve the performance of Dragonfly optimization algorithm. For instance, in [19], Sayed et al. used several chaotic maps to adjust the movement parameters of dragonflies of DA algorithm through the iterations to accelerate its convergence rate. Hammouri et al. in [15] adopted several functions such as linear, quadratic, and sinusoidal for updating the main coefficients of Binary Dragonfly algorithm. Also, a hyper learning strategy was utilized by [39] to boost the Binary Dragonfly algorithm to avoid local optima, and enhance its search behaviour for an ideal subset of features. the proposed method was applied to a coronavirus disease (COVID-19) data set. Experimental results demonstrated the ability of hyper learning based BDA in improving the classification accuracy. Finally, in [34], Qasim et al. proposed a feature selection approach in which the binary dragonfly algorithm (BDA) was hybridized with statistical dependence (SD). The proposed hybrid approach confirmed its efficiency in increasing the classification accuracy. Thus, based on the achieved results and improvements conducted in these mentioned studies, it has motivated us to combine SA as a LSA with DA to improve its search ability for feature selection problem.

Algorithms

Dragonfly Algorithm

Dragonfly Algorithm is a recent biologically inspired optimization approach which was proposed by Seyedali Mirjalili in 2015 [30]. It was found that dragonfly swarming behavior depends on two sorts of swarming behavior: hunting and migration [30, 36]. Hunting swarm of dragonflies moves in small subgroups over a restricted area to find and hunt preys. This behavior was utilized to simulate the exploration part of optimization process. In the migration behavior, in contrast with hunting swarm, dragonflies move a long one direction in bigger subgroups. This behavior was exploited to simulate the exploitation part of the optimization [30, 36]. Generally, the aim of swarm members is to co-operate to discover food places, and to protect themselves form danger of enemies. Based on these two aims, a set of factors is mathematically modeled for adjusting the positions of members in the swarm. The mathematical models for implementing the swarming behavior of dragonfly insects are given as follows [12, 26, 30]:

Separation indicates the way that flying dragonflies follow to avoid clashes between themselves. This can be mathematically written as in Eq. (1):

$$\begin{aligned} S_{i}=-\sum _{j=1}^{N}X-X_{j} \end{aligned}$$
(1)

where X refers to the current search agent, while \(X_i\) denotes the \(j-\)th neighbor of X. N represents the number of neighbors. Alignment refers to the way of adjusting the velocity of an individual with respect to the velocity vector of other close dragonflies in the swarm. This can be mathematically written using Eq. (2):

$$\begin{aligned} A_{i}=\frac{\sum _{j=1}^{N}V_{j}}{N} \end{aligned}$$
(2)

where \(V_{j}\) refers to the j-th neighbor’s velocity vector.

Cohesion is a factor for position update of search agents that represents the desire of search agents to travel towards the mass center. It is mathematically written as in Eq. (3):

$$\begin{aligned} C_{i}=\frac{\sum _{j=1}^{N}X_{j}}{N}-X \end{aligned}$$
(3)

Attraction denotes the interest of search agents to travel in direction of food location. The tendency of \(i-\)th member in the swarm to move towards the food source is obtained using Eq. (4):

$$\begin{aligned} F_{i}=F_{location}-X \end{aligned}$$
(4)

where \(F_{location}\) refers to the location of food source, and X refers to the current member.

Distraction refers to mechanism that dragonflies follow to flee from enemy. The distraction of \(i-\)th dragonfly is defined as in Eq. (5):

$$\begin{aligned} E_{i}=E_{location}+X \end{aligned}$$
(5)

where \(E_{location}\) presents current position of the enemy, and X is the position of the current member.

To find the optimum solution for a given optimization problem, DA defines a position vector and a step vector for each search agent in the swarm. These vectors are utilized to update the positions of search agents in the search space of the given optimization problem. The step vector which refers to travelling direction of dragonflies is formulated as follow [26, 30]:

$$\begin{aligned} \varDelta X_{t+1}= (sS_{i}+aA_{i}+cC_{i}+fF_{i}+eE_{i} )+w\varDelta X_{t} \end{aligned}$$
(6)

where s, a, c, f, and e are known as weighting factors for separation (\(S_i\)), alignment (\(A_i\)), cohesion (\(C_i\)), attraction (\(F_i\)) and distraction (\(E_i\)) of the i-th search agent, respectively. w refers to the inertia weight.

The obtained step vector (\(\varDelta X\)) is used to estimate the position vector of search agent X as follows:

$$\begin{aligned} X_{t+1}= X_{t} + \varDelta X_{t+1} \end{aligned}$$
(7)

where t indicates the current iteration.

The basic version of Dragonfly optimizer is proposed for the problems in continuous search space. The dragonflies can update their position by adding the step vector to the position vector. Feature selection is a binary optimization problem, so the update strategy as in Eq. (7) is not possible for binary search space. Mirjalili [30] utilized the following transfer function to convert the step vector values to a number restricted in [0,1].

$$\begin{aligned} T(\varDelta x)=\,\mid \frac{\varDelta x}{\sqrt{\varDelta x^2+1}}\mid \end{aligned}$$
(8)

The above transfer function is used to find the probability of updating the position of dragonflies in the swarm, and then the following equation is employed to update the positions of dragonflies (search agents):

$$\begin{aligned} X_{t+1}=\left\{ \begin{matrix} \lnot X_{t} &{} r < T(\varDelta x_{t+1}) \\ X_{t} &{} r\ge T(\varDelta x_{t+1}) \end{matrix}\right. \end{aligned}$$
(9)

where r is a number in the range of [0,1].

Algorithm 1 presents the pseudocode of Binary Dragonfly algorithm.

figure a

Simulated Annealing

SA is a single-solution based meta-heuristic optimization algorithm, introduced by Kirkpatrick et al. [23] in 1983. It has been widely used to tackle discrete and continuous optimization problems. SA is classified as a hill-climbing local search approach, in which a certain probability is used to decide weather to accept worse solution or not [23]. SA generates an initial solution (in our case, the best solution found so far by BDA is used as SA initial solution). A neighbour solution to the optimal one found so far is generated by SA based on fitness value of the neighbour and a specific neighbourhood structure. If the calculated fitness of neighbour is better (less or equal) than the fitness of optimal solution, then the neighbour solution is selected as the optimal one. Boltzmann probability, \(P=e^{-\frac{\Phi }{T}}\) is applied as an acceptance condition of the neighbour solution. \(\Phi\) refers to the difference between the fitness of the optimal and neighbour solutions, and T is named temperature which is gradually reduced based on cooling schedule throughout the search procedure [20, 23, 25]. In this paper, as adopted in [25], the initial temperature equals \(2*|N|\), where N is the number of features in each data set, and \(T= 0.93* T\), is applied to calculate the cooling schedule. Algorithm 2 presents the pseudocode of SA algorithm [25].

figure b

Proposed Feature Selection Method

Feature selection problem is distinguished as a binary optimization problem, so binary vectors are used to represent the solutions. In this way, if the value of a specific cell in the binary vector is set to 1, then the corresponding feature is retained. Otherwise, that feature is ignored. The size of the binary vector is equal to the number of features in the data set. Dragonfly optimization algorithm is a newly introduced optimization approach. The basic version of binary DA algorithm was used for feature selection in [26]. The main aim of this work is to improve the performance of BDA for feature solution problem. To achieve that purpose, the best solution obtained so for by BDA is passed to the SA algorithm to be used as an initial solution instead of the random generation of the initial solution. Therefore, SA will conduct a local search starting with the optimal solution found so far by BDA in attempt to find a better one. Figure 1 presents the flowchart of the proposed approach.

Fig. 1
figure 1

Flowchart of BDA-SA algorithm

Feature selection is a multi-objective optimization problem, where the maximum accuracy of the classifier and least number of selected features are two related objectives need to be achieved. Eq. (10) is commonly used as fitness function for feature selection [6, 25, 26].

$$\begin{aligned} Fitness=(\propto * er)+(\beta *(\frac{m}{N})) \end{aligned}$$
(10)

where er denotes the error rate of KNN machine learning classifier using the selected subset of features. \(\propto\) and \(\beta\) are two parameters used to make a balance between the classification accuracy and the size of subset of features (selected by the search agent), \(\propto\) is a number restricted in the range [0,1], and \(\beta\) equals 1 - \(\propto\). N refers to the total number of features in the data set, and m indicates the cardinality of the subset of features selected by the search agent. In this work, since we are mostly interested in getting the highest classification accuracy, \(\propto\) is set equal to 0.99 as in the previous work [26]. The value \(\propto\) is set 0.99, because in this work we improved the binary version of DA algorithm which was developed in [26], and we used the same setting in our experiments to compare our results with the results were reported in BDA [26].

In general, diversification has larger importance than intensification in exploring potentially useful areas of the feature space especially at the beginning of the search process. In later phases, exploitation has larger importance, because the search for better solutions around the best one found by the exploration phase is required [24]. Hybrid approaches such as BDA-SA in our case can be used to achieve the desired balance between exploring and exploiting the search space. However, in comparison with classic wrapper approaches, where a heuristic technique and an evaluator are used, the computational cost of utilizing hybrid approaches is higher.

Experiments

Data Sets

In this work, 18 data sets from UCI repository were used to assess the performance of the proposed feature selection approach [10]. They are the same data sets used by many researchers to evaluate various feature selection approaches. Table 1 outlines the details of applied data sets for evaluating the proposed Binary Dragonfly algorithm based feature selection approach.

Table 1 Data sets used in the experiments

Parameter Settings

As in [26], each data set is split into three equal sets: training set, validation set, and test set. In addition, K-fold-cross-validation procedure is used to evaluate KNN classifier (the parameter K of KNN classifier is set to five as adopted in [26]). Also, results of several metaheuristics-based wrapper feature selection algorithms including Binary Particle Swarm Optimization (BPSO), Binary Ant Lion Optimization (BALO), and Binary Gray Wolf Optimization (BGWO) were also used for comparison purpose. Furthermore, for Binary Dragonfly algorithm, the original paper of Dragonfly algorithm [30] comprehensively studied the appropriate values of swarming factors and the inertia weight, for that reason, the same best parameter settings reported in that paper were adopted in this work. Moreover, the same values of parameters adopted in [25] for SA algorithm were used. The parameters of BGWO, BPSO, and BALO algorithms were selected based on recommended setting reported in the original publications and related studies in feature selection domain. In all conducted experiments, common parameters were set as in Table 2. These values were set following a range of initial experiments. Comparison between approaches were made based on three criteria comprising classification accuracy, number of selected features, and best fitness. In addition, each approach was run 20 times with random initial solutions on a machine with Intel Core i5 processor 2.2 GHz and 4 GB of RAM.

Table 2 Parameter setting of algorithms

Results and Discussion

This section presents all recorded results obtained from the proposed FS approach. The proposed hybrid approach BDA-SA was compared to the original version of BDA based approach. In terms of classification accuracy, it is clear from Table 3 that BDA-SA is able to classify most accurately on all data sets. It can be stated that SA succeeded to enhance the best solution found by BDA algorithm. In addition, in terms of best fitness, Table 3 reveals the averages of best fitness for the applied FS approaches on each data set. In most of the cases, BDA-SA obtained the lowest average of fitness value. BDA is slightly better than BDA-SA in only two cases (SonerEW and M-of-n dat asets). Although, we will see later that the differences are not statistically significant. It is clear from Table 3 that BDA-SA has less averages of selected features than BDA in some cases, while the averages recorded by BDA are less on other cases. We previously observed that BDA-SA is superior on all cases in terms of classification accuracy. It is evident that when the average of selected features by BDA-SA is greater than BDA, this means that BDA-SA approach managed to find informative and relevant features ignored by BDA, and when the average of selected features by BDA-SA is less than BDA, that means BDA-SA approach may removed some noisy or irrelevant features selected by BDA approach.

Table 3 Averages of classification accuracy, best fitness and selected features obtained from BDA and BDA-SA

Since the main aim of this work is to enhance the performance of BDA algorithm by hybridizing it with SA algorithm, we conducted further statistical analysis to demonstrate that the hybrid BDA-SA approach is better than using the basic BDA alone for feature selection problem. Table 4 presents the standard deviation values of the averages of classification accuracy, best fitness, and number of selected features for BDA and BDA-SA approaches on each data set. In terms of classification accuracy, as in Table 4, it can be observed that BDA-SA approach behaves more robust than BDA based approach on almost all data sets. In terms of best fitness and number of selected features, as shown in Table 4, it can be seen that BDA-SA approach is better than BDA approach on half of the cases.

Table 4 Standard deviation of the averages of classification accuracy, best fitness and selected features obtained from BDA and BDA-SA

The average and standard deviation were used as measures to compare the overall results obtained form BDA and BDA-SA. To see whether the differences in the results are statistically significant or not, the non-parametric Wilcoxon test with significant level 0.05 was applied. This test is appropriate to compare the algorithms that have stochastic behaviour [9]. As in Table 5, the p values of the accuracy and the fitness show that BDA-SA recorded significantly better results than BDA on most of the data sets. In terms of selected features, the differences are statistically significant on eight cases, while on other data sets including BreastEW, CongressEW, WaveformEW, SpectEW, and IonosphereEW, p values show that the differences are not statistically significant. The superiority of BDA-SA particularly in terms of classification accuracy is expected, since it utilizes two powerful searching algorithms, DA which is efficient in exploration and SA that has a strong exploitation capability. The ability of DA is utilized in exploring the highly relevant regions in the feature space and avoid the trap of local optima, and then SA is used to intensify the nearby regions to the optimal solution (best subset of features) discovered by BDA algorithm. However, in terms of computational time, as revealed in Fig. 2, in all cases, the computation cost of BDA-SA is higher compared to BDA.

Table 5 P values of the Wilcoxon ranksum test over 20 runs for classification accuracy, Best Fitness and Selected Features of BDA and BDA-SA (P \(\ge\) 0.05 have been underlined)
Fig. 2
figure 2

Averages of computational time for BDA and BDA-SA

The performance of BDA-SA was also compared with three meta-heuristic-based feature selection approaches including Binary Particle Swarm Optimization, Binary Ant Lion Optimization, and Binary Gray Wolf Optimization. In terms of accuracy rates, as revealed in Table 6, BDA-SA outperformed all its competitors. In addition, in terms of best fitness rates, as presented in Table 7, BDA-SA recorded lowest averages of best fitness on fifteen out of eighteen data sets. Furthermore, in terms of lowest number of selected features, Table 8 shows that BDA-SA outperformed other algorithms on more than 50% of tested cases. Also, the average of computational time for each approach was considered. Figure 3 shows the computational cost of BDA-SA, BPSO, BGWO, and BALO. It can be observed that BPSO is the best in terms of lowest computational time.

Table 6 Comparison between BDA-SA and other algorithms in terms of classification accuracy
Table 7 Comparison between BDA-SA and other algorithms in terms of best fitness
Table 8 Comparison between BDA-SA and other algorithms in terms of selected features
Fig. 3
figure 3

Comparison between BDA-SA and other algorithms in terms of computational time

Conclusion

This work introduced BDA-SA as a hybrid feature selection approach. The main goal was to enhance the performance of Binary Dragonfly algorithm especially in terms for classification accuracy. The best solution found so far by BDA algorithm was used as initial solution by SA algorithm to conduct a local search to find better solution than the one obtained by BDA. The proposed approach was assessed on a set of frequently used data sets from UCI machine learning repository. The performance of BDA-SA was compared to the native BDA algorithm as well as various algorithms comprising BGWO, BPSO, and BALO. Experimental results show that BDA-SA outperformed BDA and the other algorithms. In the future, it is worth to evaluate the proposed hybrid approach on more complex data sets.