Enhanced feature selection technique using slime mould algorithm: a case study on chemical data

Ewees, Ahmed A.; Al-qaness, Mohammed A. A.; Abualigah, Laith; Algamal, Zakariya Yahya; Oliva, Diego; Yousri, Dalia; Elaziz, Mohamed Abd

doi:10.1007/s00521-022-07852-8

Enhanced feature selection technique using slime mould algorithm: a case study on chemical data

Original Article
Published: 09 October 2022

Volume 35, pages 3307–3324, (2023)
Cite this article

Download PDF

Neural Computing and Applications Aims and scope Submit manuscript

Enhanced feature selection technique using slime mould algorithm: a case study on chemical data

Download PDF

Ahmed A. Ewees^1,2,
Mohammed A. A. Al-qaness³,
Laith Abualigah^4,5,
Zakariya Yahya Algamal^6,7,
Diego Oliva⁸,
Dalia Yousri⁹ &
…
Mohamed Abd Elaziz^10,11,12,13

1615 Accesses
6 Citations
1 Altmetric
Explore all metrics

Abstract

Feature selection techniques are considered one of the most important preprocessing steps, which has the most significant influence on the performance of data analysis and decision making. These FS techniques aim to achieve several objectives (such as reducing classification error and minimizing the number of features) at the same time to increase the classification rate. FS based on Metaheuristic (MH) is considered one of the most promising techniques to improve the classification process. This paper presents a modified method of the Slime mould algorithm depending on the Marine Predators Algorithm (MPA) operators as a local search strategy, which leads to increasing the convergence rate of the developed method, named SMAMPA and avoiding the attraction to local optima. The efficiency of SMAMPA is evaluated using twenty datasets and compared its results with the state-of-the-art FS methods. In addition, the applicability of SMAMPA to work with real-world problems is evaluated by using it as a quantitative structure-activity relationship (QSAR) model. The obtained results show the high ability of the developed SMAMPA method to reduce the dimension of the tested datasets by increasing the prediction rate. In addition, it provides results better than other FS techniques in terms of performance metrics.

Puma optimizer (PO): a novel metaheuristic optimization algorithm and its application in machine learning

Article 19 January 2024

Crayfish optimization algorithm

Article 02 September 2023

Hybrid approaches to optimization and machine learning methods: a systematic literature review

Article Open access 24 January 2024

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

The rapid growth of computer applications and information technologies produces a tremendous amount of data generated from various devices. The vast amount of data causes a critical problem for data mining which requires implementing practical data pre-processing steps using different techniques. Pr-processing is a necessary step that is employed to prepare and clean the data for the subsequent processing steps of the machine learning [1, 2]. Feature selection (FS) is an essential pre-processing step which is employed to reduce the size of the dataset. It is employed to select a small subset of the relevant features that capture the characteristics of the input data [3, 4]. Generally, FS methods remove noisy, unnecessary, and repeated features. Thus, an effective FS technique can boost the efficiency of data mining applications and various machine learning classification applications [5]. In general, FS methods can be classified into two types, wrapper-based and filter-based [6]. The wrapper-based techniques usually apply a classifier to obtain features, whereas the filter-based methods use data-reliant specifications to evaluate the merits of the features [6, 7]. Therefore, filter-based methods are more effective due to their fast implementation because they do not require classifiers to be involved in the FS process. To obtain a subset of features, we face some challenges. Thus, different search methods are applied to find the best features, including depth search, breadth search, random search, and hybrid search. However, exhaustive search requires a long time for extensive data, which is considered time-consuming.

Recently, with the great developments of the metahuristics (MH) optimization algorithms, that inspired from nature, various optimization problems, including FS can be solved using these MH algorithms. In the literature, different MH algorithms have been employed for this purpose, such as particle swarm optimization (PSO) [8], genetic algorithm (GA) [9], artificial bee colony (ABC) [10], firefly algorithm (FA) [11], grey wolf algorithm (GOA) [12], sine cosine algorithm (SCA) [13], salp swarm algorithm [14], multi-verse optimizer (MVO) [15], Arithmetic Optimization Algorithm (AOA) [16], and others [17, 18]. However, individual MH algorithms may face severe limitations, such as slow convergence and trapping at local optima. Therefore, the hybridization concept has recently been implemented to overcome these limitations. This concept is performed by combining the operators of two MH algorithms to leverage their proprieties and advantages and avoid their shortcomings. Thus, in literature, we can find various hybrid MH methods for FS, such as a hybrid of PSO and SSA [19], differential evaluation (DE) and ABC [20], GOA and crow search algorithm (CSA) [21], DE and SCA [22], moth flame optimization (MFO) and DE [23], SSA and SCA [24], and many other hybrid MH methods [25].

Following the concept of the MH hybridization, this study proposes a new and efficient FS technique using a modified version of the slim mould algorithm (SMA) by the marine predators algorithm (MPA). The SMA was developed by [26], as a new MH optimizer that can be utilized to solve various optimization problems. The oscillation mode of slime mould inspires it in nature. More so, it is adopted to solve several optimization problems in literature, such as finding optimal parameters in energy applications [27, 28], air quality forecasting [29], and other engineering applications [30,31,32]. In addition, MPA is recently proposed by [33] by simulating the conduct of the marine prey and predators. It has received wide attention due to its efficiency and it is adopted in various domains, for example, time series forecasting [34, 35], image segmentation [36], medical image classification [37], parameter estimation [38], and other applications [39, 40].

However, the SMA performance requires more improvements, mainly when applied to real-world applications, which motivated us to develop a new version of SMA to improve its local search process using the operators of the MPA. The main aim of using MPA operators is to enhance the exploitation ability of SMA during the process of finding the optimal solution inside the feasible region. MPA is applied as a local search method since it has been established its performance in several applications, including forecasting cases of COVID-19 [35], and photovoltaic array reconfiguration [41].

The contribution of this study can be summarized as follows:

Develop a feature selection technique using an enhancement version of the SMA.
Boost the capability of the local search of the SMA using the operators of MPA.
Assess the efficiency of the SMAMPA developed method by using a set of twenty UCI datasets and comparing it with other FS methods.
Verify the applicability of the SMAMPA by implementing it with real-world applications, such as QSAR model.

The structure of this study is as follows. The related works are presented in Sect. 2, where the preliminaries of the applied techniques, SMA, and MPA are described in Sect. 3. In Sect. 4, we describe the proposed SMAMPA approach, and in Sect. 5, the experimental evaluation is presented, including different benchmark datasets and comparisons to existing methods. Finally, the conclusions and future direction are highlighted in Sect. 6.

2 Related works

In this section, we summarize a number of the existing FS methods based on modified and improved optimization algorithms proposed in recent years. In [4], a modified version of the ABC algorithm, called a binary ABC, is proposed for FS. The searchability of the ABC is improved using the evolutionary-based similarity search mechanism, which is integrated into the existing binary ABC variants. It was evaluated using several datasets and compared to the original PSO and ABC besides several modified versions of PSO and ABC. In [42], the authors suggested an FS method based on a hybrid of the Flower Pollination Algorithm (FPA) and Clonal Selection Algorithm (CSA). The proposed BCFA was evaluated using the optimum-path forest classifier, and it showed significant performance with three different datasets. Also, It showed better performance in comparison to several optimization methods.

In [43], two binary variants of the whale optimization algorithm (WOA) were proposed for FS. The first variant is implemented by improving the search process using Tournament and Roulette Wheel selection mechanisms. In the second variant, the exploitation of the whale optimization algorithm is improved by using crossover and mutation operators. Sayed et al. [44] proposed a chaotic crow search algorithm (CSA) to overcome the limitations of the original CSA, such as trapping at local optima and low convergence rate. The new modified version, CCSA, was applied as an FS method evaluated using twenty datasets. The CCSA also was compared to different optimization techniques, and it achieved superior performance against several previous FS methods.

The authors in [45] suggested two binary versions of butterfly optimization algorithm (BOA) for FS. They used two transfer functions for mapping continuous search spaces to discrete ones. Several UCI benchmark datasets were used to evaluate the proposed method. More so, wide comparisons to some existing FS methods were performed. Evaluation outcomes showed the superior performance of the BOA. Too and Abdullah [46] proposed an FS method using a new variant of the genetic algorithm (GA) and a fast rival GA. They applied a competition strategy to combine crossover schemes and the new selection to boost the global search ability of the GA. Twenty-three UCI benchmark datasets were utilized to test the performance of the modified GA.

Zhang et al. [47] presented an improved variant of the Harris hawks optimization algorithm, called IHHO for FS. The main idea of the IHHO is by applying the salp swarm algorithm to enhance the search ability of the HHO. Several UCI datasets were used to evaluate the IHHO, and it achieved competitive performance compared to several FS methods. Another modified HHO, called Chaotic HHO (CHHO), is proposed for FS by Elgamal et al. [48]. Chaotic maps are applied to improve the population diversity of the HHO in the search space. Moreover, simulated annealing (SA) is applied to the best solution to enhance the exploitation of the HHO. They used Fourteen datasets to evaluate the CHHO compared to several optimization algorithms. Overall results showed that CHHO got the best outcomes.

The authors of [49] proposed a FS method, called ECSA, using a modified version of the crow search algorithm (CSA). The authors proposed three modifications to the traditional CSA to enhance its search capability. Sixteen UCI benchmark datasets were applied to evaluate the ESCA compared to the traditional CSA and several existing FS methods. The ESCA showed competitive performance in all experiments. Too and Mirjalili [6] suggested an FS method called hyper learning binary dragonfly algorithm. They applied a hyper learning strategy to improve the binary dragonfly algorithm, to avoid its limitations, such as trapping at local optima. They evaluate the proposed method using different UCI datasets and a new COVID-19 dataset. Zhong et al. [7] proposed a new FS method based on a modified Tree Growth Algorithm (TGA). A binary TGA is applied for FS applications, and also the evolutionary population dynamic strategy is employed to enhance the search capability of the TGA. Different UCI benchmark datasets were utilized to test the TGA performance.

Several works from the previous review were conducted for addressing FS problems by developing new methods to overcome the drawbacks of the algorithms’ original versions using benchmark and real datasets. The proposed methods showed good abilities to escape getting trapped in local optima, improve the convergence rate, and improve population diversity. However, there is no optimization technique to solve all problems, as stated by the No-Free-Lunch (NFL) theorem. Accordingly, this paper proposes a new optimization method by improving the slime mould algorithm’s local search ability using the MPA operators to solve different feature selection problems using benchmark and real datasets. This improvement can help balance the search methods and avoid local search problems such as traping in a local optimum and degrading the convergence rate.

3 Background

This section presents the basic definitions of the SMA and MPA, as in what follows.

3.1 Slime mould algorithm

The SMA was firstly introduced by [26] as a novel optimization mechanism for global optimization. The SMA simulates the natural behaviour of the slime mould’s oscillation. The mathematical formulation of SMA is given as:

1.
Phase 1 (The food approach): This step models the approach for the slime mould. The following equation describes this phase:
$$\begin{aligned} Z=\left\{ \begin{array}{cc} Z_{b}+v_{b} . \left( W . Z_{A}-Z_{B} \right) &{} r<p \\ v_{c}.Z &{} r\ge p \end{array}\right. \end{aligned}$$
(1)
where $v_{b}$ is defined in the range of $[-a,a]$ and $v_{c}$ decreases from 1 to 0. $Z_{b}$ corresponds to the best solutions. Additionally, $Z_{A}$ and $Z_{B}$ are two solutions selected from a randomly, whereas W represents the mould weight of the slime. While p is computed as:
$$\begin{aligned} p= \tanh \left| S(i)-DF\right| , \, i=1,2,...,N \end{aligned}$$
(2)
From Eq. 2, S(i) corresponds to the fitness values of the Z solution. DF is the best fitness value. The value a that defines $v_{b}$ in Eq. 1 is computed as:
$$\begin{aligned} a= arctanh \left( -\left( \frac{t}{max_t} \right) +1 \right) \end{aligned}$$
(3)
where, t is the current iteration. ${max_t}$ is the maximum number of iteration. Also, the value of W is obtained as follows:
$$\begin{aligned} W(S_{Ind}(i))=\left\{ \begin{array}{cc} 1+r \log ((b_F-S(i))/(b_F-w_F)+1) &{} Cond \\ 1-r \log ((b_F-S(i))/(b_F-w_F)+1) &{} otherwise \end{array} \right. \end{aligned}$$
(4)
in which Cond denotes that S(i) ranks first half of the population. More so, $r\in [0,1]$ is randomly generated. $b_F$ and $w_F$ and $b_F$ represent the best and worst fitness values, respectively. Finally, $S_{Ind}$ stores the sorted fitness values, as defined in the following formula:
$$\begin{aligned} S_{Ind}=sort(S) \end{aligned}$$
(5)
2.
Phase 2 (Wrap food): in this step, SMA imitates the updating position of the slime mould. The following equation is applied to compute this update.
$$\begin{aligned} Z^{*}=\left\{ \begin{array}{cc} rand (UB-LB)+LB &{} rand<z \\ Z_b (t)+v_b(WZ_A(t)-Z_B (t)) &{} r<p\\ v_c Z(t) &{} r\ge p \end{array}\right. \end{aligned}$$
(6)
where LB and UB represent the lower and upper bounds of the search space, respectively. r and rand are obtained from a random distribution between [0, 1].
3.
Phase 3 (Oscillation): at this step the value of $v_b$ is updated within $[-a,a]$ and $v_c$ inside [−1, 1].

3.2 Marine predators algorithm

The MPA is a global optimization mechanism introduced in [33]. The MPA mimics the elements of marine prey and predators during hunting. As other metaheuristics, the MPA begins by taking random solutions from the search space as in Eq. 7

$$\begin{aligned} Z=LB+rand \times (UB-LB) \end{aligned}$$

(7)

where, rand a random variable is generated in the range [0,1]. LB and UB are the upper and lower bounds that define the search space. Once the candidate solutions are generated, two matrices (named Elite matrix, which contains the fitness values and prey matrix) are formulated as:

$$\begin{aligned} Elite=\left[ \begin{array}{cccc} Z_{11}^1&{}Z_{12}^1&{}...&{}Z_{1d}^1\\ Z_{21}^1&{}Z_{22}^1&{}...&{}Z_{2d}^1\\ ...&{}...&{}...&{}...\\ Z_{n1}^1&{}Z_{n2}^1&{}...&{}Z_{nd}^1\\ \end{array}\right] , \, z=\left[ \begin{array}{cccc} Z_{11}&{}Z_{12}&{}...&{}Z_{1d}\\ Z_{21}&{}Z_{22}&{}...&{}Z_{2d}\\ ...&{}...&{}...&{}...\\ Z_{n1}&{}Z_{n2}&{}...&{}Z_{nd}\\ \end{array}\right] , \, \end{aligned}$$

(8)

The three phases of MPA modify the candidate solution using the velocity ratio of the predator and prey. Each step of the MPA is described below.

1.
Phase 1 (High-velocity ratio): here, the prey is extremely fast, then the predator decides to be quiet and not move. This phase occurs at the beginning of the optimization process, and the movement of the prey is modeled as follows:
$$\begin{aligned}&S_i=R_B \times (Elite_i-R_B\times Z_i), i=1,2,...,N \end{aligned}$$
(9)
$$\begin{aligned}&Z_i=Z_i+P\times R \times S_i \end{aligned}$$
(10)
in which $R\in [0,1]$ refers to a vector of random numbers $P=0.5$, and $R_B$ is Brownian motion vector.
2.
Phase 2 (Unit velocity ratio): at this phase, the velocity of the prey and the predator is the same. This case is present in half of the iterative procedure. Here, the predator updates his position using Brownian movements, and the prey uses lévy flights. In this phase, Z is divided into two parts, and to update the solution in the first part; it applies Eqs. (11)-(12) and the second one uses Eq. (13)-(14).
$$\begin{aligned} S_i\,\,\,= & {} R_L \times (Elite_i-R_L\times Z_i), i=1,2,...,N \end{aligned}$$
(11)
$$\begin{aligned} Z_i\,\,\,= & {} Z_i+P \times R \times S_i \end{aligned}$$
(12)
where $R_L$ is generated randomly by a Lévy distribution.
$$\begin{aligned}&S_i=R_B \times (R_B \times Elite_i- Z_i), i=1,2,...,N \end{aligned}$$
(13)
$$\begin{aligned}&Z_i=Elite_i+P \times CF\times S_i, \nonumber \\&\quad CF=\left(1-\frac{t}{max_{t}} \right)^{2\frac{t}{max_{t}})} \end{aligned}$$
(14)
From Eqs. 13 and 14 the values of t and $max_{t}$ are the current and total number of iterations, respectively.
3.
Phase 3 (low-velocity ratio): Within this phase, the predator has velocity faster than the prey, which occurred in the last third of the updating process using Eq. (15)
$$\begin{aligned}&S_i=R_L \times (R_L \times Elite_i- Z_i), i=1,2,...,N \end{aligned}$$
(15)
$$\begin{aligned}&Z_i=Elite_i+P \times CF\times S_i,\ \end{aligned}$$
(16)

According to [33] the MPA has another two key points.

The first one is related to the Eddy formation and the effect of fish aggregating devices (FADS) that can modify the behavior of the predators. The MPA employs the following equation to handle these situations:
$$\begin{aligned} Z_i=\left\{ \begin{array}{cc} Z_i+CF [Z_{min}+R \times (Z_{max}-Z_{min})]\times U &{} r_5 < FAD \\ Z_i+[FAD(1-r)+r](Z_{r1}-Z_{r2}) &{} r_5 > FAD\\ \end{array}\right. \end{aligned}$$
(17)
From Eq. 17U refers to a binary vector. $FAD=0.2$. $r\in [0,1]$. $r_1$ and $r_2$ denote random prey.
The second one is the marine memory, here Z remembers its position, so, this behavior gives MPA ability to save the previous $Z_b$. This solution is used and compared with the new $Z_b$.

4 The SMAMPA method

The SMAMPA is described in this section. It applies both SMA and MPA algorithms to improve its performance. In this context, the MPA applies as a local search of the original version of SMA to improve its ability to solve optimization problems. This improvement adds more flexibility to the method to explore the search space and improve diversity.

The basic structure of the SMAMPA is shown in Fig. 1. It starts by defining the parameters and creating the search space by initialling the problem population. After this step, the best solution is determined and saved by evaluating the fitness function. Furthermore, each solution is updated by either the SMA or MPA algorithms; this switching is based on the quality of the fitness function value; the quality is calculated as in Equation 19. Therefore, if the probability of the solution is more significant than $\alpha$, the solution will be updated by SMA, else it will be updated by MPA. In this paper, the probability value ($\alpha$) is set to 0.5. These steps are iterated for all solutions; then, the best solution, among all solutions, is selected. This sequence loops till reaching the stop condition, then the final results are presented. In detail, the SMAMPA begins by initializing the parameters of both SMA and MPA. Then the SMA generates a Z [$x_i, i=1, 2, .., X_N$] random binary population with N and D size and dimension. Then, the first fitness values are computed by the operators of the SMA. The following equation is used to calculate the fitness function value Eq. (18):

$$\begin{aligned} f (x_i(t))=\xi {E}_{x_i(t)}+(1-\xi ) (\frac{|x_i(t)|}{|C|}) \end{aligned}$$

(18)

where $E_{x_i(t)}$ defines the classification error (in this study we use kNN as a classifier). $\xi \in [0,1]$ balances between the classification error and the number of the selected features. The proposed method calculates the probability ($Pro_i$) by Eq. 19 to update the solution by the operators of MPA or SMA (i.e., if $Pro_i>0.5$ the SMA will be used else, MPA will be used)

$$\begin{aligned} Pro_i=\frac{F_i}{\sum _{i=1}^N F} \end{aligned}$$

(19)

where, f is the values of the fitness function. These sequences are iterated until meeting the stop condition. In the final step, the best solution is presented as the output of the proposed method.

5 Experiment results and discussion

5.1 Performance metrics

Minimum (Min) result and maximum (Max) result of the fitness value are applied using Eqs. 20 and 21, respectively.

$$\begin{aligned} Min= \min _{1 \le k\le N} F_i \end{aligned}$$

(20)

$$\begin{aligned} Max= \max _{1 \le k\le N} F_i \end{aligned}$$

(21)

where F is the fitness function values

Accuracy: It is used to compute the classification accuracy in the experiments. It is calculated using Eq. 22.

$$\begin{aligned} Accuracy=\frac{TP+ TN}{TP + FP + FN + TN} \end{aligned}$$

(22)

where TP and TN define true positive and true negative. FP and FN define false positive and false negative.

Standard deviation (Std): It is computed using Eq. 23. It evaluates the stability of the algorithms. The results of the fitness function are used to compute this measure (${\overline{F}}$ is the mean of F).

$$\begin{aligned} Std = \sqrt{\frac{1}{{{N}}}\sum \limits _{k=1}^{{N}} {{{( {F_k - {\overline{F}}} )}^2}}} \end{aligned}$$

(23)

5.2 Compared techniques and parameter settings

The SMAMPA is evaluated and compared to nine recently published metaheuristic algorithms (i.e., MPA, GA, SMA, PSO, HHO, SSA, MFO, WOA, and GOA) in the fitness values (i.e., minimum and maximum), standard deviation, accuracy, classification accuracy, and computational time. The SMAMPA method is also compared with eight advanced metaheuristic algorithms (i.e., BDA [50], BSSAS3 [14], bGWO2 [12], GLR [51], SbBOA [45], BGOAM [52], Das [53], and S-bBOA [45]).

The parameters setting of these algorithms is identical to that declared in their original studies. Table 1 presents the settings of the parameters of all applied methods. The MATLAB 2015a executes all the algorithms. All methods run on a 16GB RAM Intel Core i7 1.8 GHz 2.3 GHz processor. The solution numbers applied in this paper are set to to 30. The maximum iteration number is set to 500. Each competitor algorithm is applied 30 independent runs and the average of its results are presented in the tables.

Table 1 Parameters setting of the applied methods

Enhanced feature selection technique using slime mould algorithm: a case study on chemical data

Abstract

Similar content being viewed by others

Puma optimizer (PO): a novel metaheuristic optimization algorithm and its application in machine learning

Crayfish optimization algorithm

Hybrid approaches to optimization and machine learning methods: a systematic literature review

1 Introduction

2 Related works

3 Background

3.1 Slime mould algorithm

3.2 Marine predators algorithm

4 The SMAMPA method

5 Experiment results and discussion

5.1 Performance metrics

5.2 Compared techniques and parameter settings

5.3 Experiment series 1: UCI datasets

5.4 Comparison with the state-of-the-art

5.5 Experiment series 2: real−world quantitative structure-activity relationship application

6 Conclusion and future work

Data Availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation