1 Introduction

Software Development Life Cycle (SDLC) represents the phases that software passes through while it is being developed. Starting with requirements elicitation, then the analysis and design of the collected requirements. After that, the programmers start developing the proposed software based on the analysis and design phases. A vital phase in SDLC is software testing. This phase follows the development phase and consists of a set of activities that assure the team is developing the right software with high-quality levels [1]. Numerous testing types are available to test various aspects of a software product. These tests include but are not limited to unit testing, component testing, integration testing, regression testing, and user acceptance testing. Many software development methodologies are available to be used by the development team. The most popular SDLC models are waterfall, agile and spiral models.

The testing stage plays an essential role in the development process. It is usually performed as a traditional linear model (e.g., waterfall) or a cyclic model (e.g., agile model). Testing process concerns with enhancing the software quality and reducing the total cost [2, 3]. However, many factors affect the results of the testing process, such as the limited resources (e.g., time or software testers). Therefore, early-stage procedures such as Software Fault Prediction (SFP) are utilized to facilitate the testing process in an optimal way [4]. In SFP, the faulty components of the software are detected prior to system deployment in the early stages of the SDLC. This is achieved by utilizing software faults datasets collected from previous projects or predefined software metrics. It is worth mentioning that the SFP process became more straightforward since the adoption of the agile software development (ASD) model in 2001 [5] as a replacement for the waterfall model which was introduced in 1970 [6]. Adopting the ASD methodology has many benefits since the software is developed incrementally. Moreover, ASD opens the door to adopting volatile requirements, optimizing resources (time and cost), bridging the gap between the development team and business owners [7], and facilitating performance software engineering tasks regularly such as review, maintenance, and testing [2].

The early prediction of faults in software components such as modules, classes, and so on has a significant impact in reducing the needed time and effort for the project outcomes to be delivered to the end-user. SFP is one of the approaches that help in optimizing the development process by reducing the number of potential faults in the early stages of the SDLC process [4]. Various SFP approaches were recorded in the literature. The main approaches include but are not limited to Soft Computing (SC) and Machine Learning (ML) [8]. These methods need data to be able to predict software faults. Design features (metrics) gathered during the design stage or historical fault datasets accumulated during the implementation of previous versions of similar projects are two essential resources to be used with SFP approaches for benchmarking [9].

Various types of metrics such as method-level and class-level have been proposed for the SFP problem [10]. Method-level metrics can be collected from structured programming or object-oriented programming-based source codes. Halstead [11], and McCabe [12] metrics are the most common method-level measures used by many researchers. Class-level metrics are only appropriate when developing SFP models for object-oriented programming-based projects. Examples of class-level suite of metrics for object-oriented design are CK (Chidamber–Kemerer) [13], L&K (Lorenz-Kidd) [14], and quality metrics for object-oriented design (QMOOD) [15]. However, in comparison with other suites, CK metric is mostly applied when class-level metrics are chosen [10].

Automated systems become available for almost all fields in real life. With the advancement of software development, and the availability of large-scale projects, analyzing the collected software metrics becomes complicated and forms a significant challenge. Thus, ML techniques have been proposed as SFP solutions and shown a good performance [16]. The main purpose behind these techniques is to predict the faulty components in software based on the supplied datasets. Examples of ML techniques that have been used as SFP approaches are K-Nearest Neighbors (KNN), Naive Bayes (NB), Linear Discriminant Analysis (LDA), Linear Regression (LR), Decision Tree (DT), Support Vector Machine (SVM), and Random Forest (RF) [4, 17, 18].

Among the various ML models, ensemble learning has proven excellent performance in dealing with various complex classification problems [19]. Ensemble learning combines a number of ML models to create an ensemble learner to improve the model performance by proving a more general robust model. The RF is recognized as a well-regarded ensemble technique that was originally introduced by Breiman, Leo [20]. In RF, a number of DT classifiers are fit on various sub-samples of the dataset and combine the output of all the trees. The RF has several merits that make it superior when compared to other traditional ML models. It controls the over-fitting problem of DT, reduces the variance within the forest, and thus enhances the predictive accuracy [21, 22].

The performance of the ML-based SFP approaches depends mainly on a set of factors which is the applied ML technique and the quality of the utilized dataset (in terms of noise, irrelevant features, and imbalanced representation of data) [23]. Therefore, dimensionality reduction (e.g., feature selection) and data resampling (e.g., Synthetic Minority Oversampling Technique (SMOTE)) techniques are needed before applying the ML technique. These features can be defined in the context of the feature selection problem which can be tackled by feature selection techniques.

In feature selection (FS) the problems with high-dimension feature space increase the hardness of the search process. In common, various search strategies, including complete, random, and heuristic, are available for searching the feature space to obtain the optimal subset of features [24]. The complete search requires generating and evaluating all possible subsets of features. In this way, for a set of m features, 2m features subsets will be formed. For example, if the given problem has four features, sixteen subsets of features will be produced. In case of random search, the next candidate solution (subset of features) is generated randomly while heuristic search strategies conduct the search in adaptive way, and generate possible solutions (feature subsets) for the problem [24,25,26,27].

Recently, metaheuristics is widely used by the research community as a successful FS method. The metaheuristics are conventionally categorized based on the initial solutions into population-based, and trajectory-based [28]. A trajectory-based metaheuristic is initiated with a single solution. The search follows a trajectory in the search space based on the local modification of the current solution until a local optimum is obtained. These methods like tabu search [29], β-hill climbing [30], stochastic local search [31], and variable neighborhood search [31]. In contrast, the population-based algorithm is initiated with a population of individuals. Iteratively, the population inherits its strong elements to come up with an optimal solution. Normally, population-based algorithms are classified into evolutionary algorithms (EAs) and swarm intelligence approaches. Genetic Algorithm (GA) [32], Genetic Programming (GP) [33], and Differential Evolution (DE) [34] are the base EAs for feature selection. A swarm-based algorithm is normally built based on the idea of a group of solutions where the group members are divided into leaders and followers. Particle Swarm Optimization (PSO) algorithm and Ant Colony Optimization (ACO) are the base swarm intelligence methods. Quite recently, several swarm intelligence methods have been proposed for FS such as PSO [35], Salp Swarm Algorithm (SSA) [36, 37], Dragonfly Algorithm (DA) [38], Rate Swarm Optimizer [39], Ant Lion Optimizer (ALO) [40], Harmony Search [41], Coronavirus herd immunity optimizer [42], ant colony optimization (ACO) [43], β-hill climbing optimizer [44], Crow Search Algorithm (CSA) [45], JAYA algorithm [46], Firefly algorithm (FFA) [47], Artificial Bee Colony (ABC) algorithm [48], Coyote Optimization Algorithm (COA) [49], and Grasshopper Optimization Algorithm (GOA) [50]. Furthermore, the hybrid between them such as Genetic-Whale-Ant colony algorithms [51], Grey Wolf optimizer and Random Forest [52], hybrid Salp Swarm Algorithm [53], genetic and coral reefs [54], etc. There are also real opportunities to adapt newly-established optimization algorithms for FS problems like starling murmuration optimizer [55], Quantum-based avian navigation optimizer [56], Farmland fertility algorithm [57], African vultures optimization algorithm [58], and artificial gorilla troops optimizer [59].

Whale Optimization Algorithm (WOA) is a recent swarm intelligence imitates the behavior of humpback whales in hunting fish in the oceans [60]. It has impressive characteristics over other optimization methods such as it has few control parameters, easy to implement, simple structure, and it has maneuver behavior to find a suitable balance between local exploitation and global exploration. Due to its successful attributes, WOA has been widely utilized to deal with feature selection problems [25, 61,62,63,64,65]. The original version of WOA was designed to handle continuous search space problems. In this paper, to match the binary search space of the FS problem, WOA was boosted with eight fuzzy transfer functions from S-shaped and V-shaped families. Due to the No Free Lunch [66] argument which points out that there is no superb optimization algorithm that can excel all others for all optimization problems, therefore, the opportunity is still possible to investigate modifying efficient methods to handle the SFP to improve the algorithm efficiency.

In this paper, a systematic SFP approach that considered several ML techniques with different pre-processing methods was proposed. The major contributions are summarized as follows:

  • Several pre-processing and re-sampling techniques are applied to prepare SFP datasets to be suitable to the ML techniques.

  • Various classification techniques, namely KNN, LDA, SVM, LR, DT, and NB, RF, are applied. Their performance is compared in the same environment to adopt one technique for further experiments. As a result, the RF classifier is adopted in this stage.

  • A dimensionality reduction method based on the Binary version of WOA was utilized to eliminate the irrelevant/redundant features to enhance the performance of the RF classifier. The newly proposed method is called BWOA, which utilizes eight transfer functions where a transfer function that yields good results is chosen.

  • An enhanced WOA version (EWOA) is introduced, where the exploration strategies from the grey wolf optimizer (GWO) and harris hawks optimization (HHO) algorithms are used to enhance the diversity of the WOA. By means of this enhancement mechanism, the performance of WOA is improved to deal more efficiently with the search space of the FS problem. This yields a superior optimization framework for the faulty-software prediction problem.

The newly proposed EWOA reveals very successful outcomes in terms of choosing the most informative features in the area of SFP. The findings prove that the classification performance can be significantly improved by removing useless features. The performance is compared with nine state-of-the-art methods and it shows the viability of the proposed method in terms of the accuracy, number of features, and fitness function.

The remainder of the paper is structured as follows: a review of the related works is presented in Section 2. In Section 3, a theoretical background of the related aspects to this paper is introduced. Section 4 presents the proposed methodology. The experimental design and the obtained results are discussed in Section 5. Finally, Section 6 includes a conclusion about the main findings of this paper in addition to some future work directions.

2 Related works

Recently, different ML approaches were considered to solve the SFP problem with remarkable success [4]. Accordingly, different datasets (e.g., PROMISE repository, NASA datasets, and Qualitas corpus) became publicly available to the researchers [4, 67]. This section presents the most relevant related work in the field of SFP. A general overview of the SFP techniques is provided, and the related ML approaches are investigated, followed by the related work of the enhanced ML approaches by applying some preprocessing approaches like feature selection.

2.1 ML based SFP

Different supervised and unsupervised ML techniques were applied as prediction models in SFP. Examples of the ML that were used with SFP are: SVM [68], DT [69], Bayesian Networks (BN) [70], NB [71], KNN [72], Multi-layer Perceptron (MLP) [73], Artificial Neural Networks (ANN) [2, 74], LR [75], Multi-nomial Logistic Regression (MLR) [73], RF [76] and ensemble MLP [77].

Singh and Malhotra [68] conducted an empirical study to evaluate the performance of an SVM classifier in determining the relationship between some software Object-Oriented (OO) design matrices and fault proneness. A dataset from the NASA repository (KC1) and Receiver Operating Characteristic (ROC) were used to evaluate the proposed model. The study shows that the SVM classifier was feasible and helpful in predicting faulty classes in OO-based systems.

Moreover, Cahill et al. [78] introduced an approach named Rank Sum for data representation to improve the performance of fault porousness prediction modules. The proposed approach is evaluated by applying the well-known ML classifiers SVM and NB over various datasets from the NASA repository. It was found that NB is better compared to the SVM classifier. Erturk and Sezer [79] introduced an SFP model that combines Fuzzy Inference System (FIS) and Artificial Neural Network (ANN) classifiers. FIS was applied at the beginning of the project to make predictions depending on expert opinion because it does not need historical data for prediction, and then ANN was employed in the later iterations when some data about the software project are obtainable. The proposed iterative system was tested using a set of datasets including various versions of many projects from the PROMISE repository. The selected datasets consist of common OO metrics such as coupling between objects, response for a class, and weighted methods per class. The evaluation of the results according to the receiver operating characteristics (ROC) with the area under the curve (AUC) method shows that the iterative module is capable of locating fault-prone modules in the software.

An approach named multi-strategy classifier (RB2CBL) was introduced by Khoshgoftaar et al. [80] for the SFP problem, where Rule-Based (RB) classifier was hybridized with two variants of the Case-Based Learning (CBL) model. Moreover, an embedded GA was utilized to optimize the parameters of CBL models. The experimental results reveal that the proposed RB2CBL classifier is superior compared to the RB model alone. Carrozza et al. [81] proposed a new set of software matrices for detecting mandelbugs in complex software systems. In addition, considering the newly proposed matrices and the conventional software matrices, several algorithms, including DT, SVM, BN, NB, and MLR, were applied to various datasets from the NASA repository. The authors reported that MLR and SVM are the best among all examined algorithms in finding Mandelbug-prone modules.

A model based on the principle of ensemble learning methods was employed by Rathore and Kumar [82] to predict software faults in which linear regression-based combination rule (LRCR) and gradient boosting regression-based combination rule (GBRCR) approaches were used to ensemble the output of Genetic Programming (GP), MLP, LR algorithms. Moreover, eleven datasets belonging to six software projects were accumulated from the PROMISE data repository to assess the performance of the proposed ensemble models. Results of different performance evaluation measures, including Average Absolute Error (AAE) and Average Relative Error (ARE), provided evidence that ensemble techniques can produce better results for predicting software faults compared to individual fault prediction techniques. Choudhary et al. [83] defined a set of change matrices in addition to the existing ones to enhance the performance of SFP modules.

Various ML classifiers were applied along with code matrices and change matrices. Experimental results on different releases of Eclipse projects demonstrate that the newly introduced change matrices can improve the performance of fault prediction modules. In [84], Shatnawi used the ROC analysis to examine the relationship between software matrices (features) and faults where threshold values of matrices were identified accordingly. A threshold value is defined for each metric to be used for deciding whether a software module is faulty or not. Moreover, the results of ROC were also considered for selecting the most correlated matrices with faults. Only selected matrices were applied to train and test a set of ML classifiers (LR, NB, KNN, and decision trees C4.5).

From the previously investigated related work, researchers confirmed that having a considerable number of features in a dataset affects the performance of the ML technique. Therefore, many researchers considered dimensionality reduction methods to eliminate the irrelevant/ redundant features from the datasets. The most popular dimensionality reduction technique is FS.

2.2 Preprocessing ML methods

FS is a well-known preprocessing step in the data mining process that aims to eliminate noisy, irrelevant, and redundant features to reduce data dimensionality, and hence, improve the performance of the employed ML technique [24, 85]. In many works in the field of SFP, different filter, and wrapper FS approaches were investigated. Catal, C. and Diri, B. [86] employed a correlation-based feature selection approach to select the highly relevant matrices with varying techniques of ML (i.e., RF, DT, NB, and AIRS). They found that FS positively affected the performance of the employed ML approaches and that the RF classifier outperformed other classification techniques. In [87], eighteen filter FS methods were employed on five datasets from the NASA repository with various classification techniques. The obtained results revealed that using FS enhanced the performance of the prediction models.

As presented in [88], a set of filter FS methods, including Chi-square (CS), information gain (IG), and Pearson Correlation Coefficient (PCC), was used to develop a hybrid feature selection method to improve the performance of Software Defect Prediction (SDP). In the hybrid FS method, the features were ranked and selected according to their values using these filter ranking methods. In addition, for comparison purposes, each of the three filter methods was applied separately. Using five NASA datasets for validating the FS method and a RF classifier for building the prediction model, the results of AUC show that the hybrid FS approach is superior compared to other filter FS methods.

Moreover, many wrapper FS methods were applied in the SFP field. A GA-based FS approach with a bagging technique was proposed in [89]. In this approach, two preprocessing techniques were used; FS (i.e., GA) and resampling (i.e., bagging). A similar approach was proposed in [90]. In this approach, two metaheuristics algorithms (i.e., GA and PSO) were applied as selection mechanisms in the FS process, in addition to considering a bagging technique to rebalance the used nine datasets.

Another wrapper FS was recently proposed by Turabieh, H., Mafarja, M. and Li, X. [2]. In this approach, the authors used several FS approaches to improve the efficiency of a Layered Recurrent Neural Network (L-RNN) classifier in predicting faulty software components. Three metaheuristics algorithms (i.e., GA, PSO, and ACO) were considered in this paper as FS approaches. A set of extensive experiments were conducted in this paper, and the performance of the proposed approach was compared with several ML classifiers (i.e., NB, LR, ANN, C4.5 DT, and KNN), and the area under the curve (AUC) was considered as an evaluation measurement. AUC confirmed that the proposed wrapper approach is better compared to other approaches. In this approach, a Binary Queuing Search Algorithm (BQSA) was proposed for the first time in literature in the SFP field. Moreover, the SMOTE technique was applied to rebalance the datasets that were obtained from the PROMISE repository. The presented results in the paper revealed the positive effects of dimensionality reduction and resampling techniques on the obtained datasets.

In 2020, Tumar, Iyad, et al. [3] proposed a modified version of binary Moth Flame Optimization named Enhanced MFO (EBMFO) as a wrapper FS approach in SFP, along with the Adaptive synthetic sampling method (ADASYN) as a resampling technique. Three ML classifiers were used in this paper (i.e., LDA, KNN, and DT), and the results confirmed that the performance of these classifiers was improved with the use of the preprocessing techniques. Recently, a Harris Hawk Optimization algorithm (called EBHHO) based FS approach in the SFP field was proposed in [91]. Again, the obtained results proved the positive influence of the employed preprocessing techniques on the used ML classifiers.

From the previously mentioned approaches, it is clear that preparing the datasets by employing preprocessing techniques (e.g., FS and resampling) greatly influences the performance of the prediction model in the SFP problem. It can be concluded that SFP becomes possible for large-scale projects when proper preprocessing techniques are used. These observations, besides the Non-Free-Lunch (NFL) theorem for optimization [66] which states that no best classifier to handle all possible classification problems [92], motivated our attempts to propose an advanced SFP approach that considers the RF classifier as an ML technique, improved by SMOTE as a resampling technique, along with an advanced wrapper FS approach with a novel WOA algorithm as a searching strategy.

3 Preliminaries

This section briefly describes the main theoretical concepts utilized in this research which are: the RF classifier, the oversampling technique (i.e., SMOTE), feature subset selection, and WOA to tackle the FS problem. In the SFP problem, the aim is to predict fault-prone software modules in the early stages of the SDLC based on the designed metrics of the software project. SFP is considered a binary classification problem since each software component has two options in the target class: faulty or non-faulty. Several supervised classification paradigms can be used to tackle this problem. After conducting deep experimental studies, we considered the RF classifier to be adopted in this research.

3.1 Random Forest classification paradigm

Random Forest (RF) is a classification algorithm that was initially proposed in [20]. RF considers a combination of decision trees in a model called ensemble learning, where each tree in the forest depends on an independently sampled vector of random values. The main advantages of the RF classifier are the few parameters to be tuned, its capability of high generalization, and it requires less training time than other classifiers regardless of the size of the dataset [93]. Ensemble learners combine different classification algorithms to get a generalized model that enhances the prediction performance where the number of wrongly classified instances comes at a very low rate. Ensemble methods can be distinguished into three families, namely, bagging, boosting, and stacking methods.

RF classifiers use the bagging method. As demonstrated in Fig. 1, a random sample of the original training dataset is provided by applying a bootstrap re-sampling technique for each model. After applying the bagging process, x features are selected randomly from the full feature set. Then, one feature is selected as a split node. The splitting process is repeated, with a fresh selection of features, until reaching a specified depth d where a decision tree is completed [94]. After the multiple splits, a random forest of decision trees is constituted. Every new instance is passed to all trees in the forest, and a class label is predicted (termed as a vote), Then the majority voting strategy is applied to select the class label for this instance.

Fig. 1
figure 1

Ensemble learning (bagging method)

3.2 Data sampling for imbalanced classification

The quality of data is considered a significant factor that has a profound impact on the performance of ML techniques. Imbalanced datasets are distinguished as a challenging aspect that may degrade the prediction quality of classification methods. This issue emerges in most real-world problems in which the target classes are not represented equally. In other words, in binary class datasets, most of the instances are labeled with the first class (called majority class), while few of them are labeled with the other one (called minority class). In such a case, the classifier is trained on highly imbalanced data and thus tends to pick up the patterns in the dominant classes, which leads to inaccurate prediction of the minority class [95].

The class imbalance problem poses a significant challenge in the field of software defect prediction since the available datasets are highly imbalanced. That is to say, the occurrences of defective cases are very low compared to normal cases (see Fig. 3). Various strategies can be employed to handle this problem, such as cost-sensitive, kernel-based, and sampling methods [95]. Sampling methods are categorized into two types: oversampling, which increases the rate of the minority class, and under-sampling, which reduces the frequency of the minority class. The latter causes information to be lost, which leads to poor prediction quality. In this research, we utilized an oversampling technique called SMOTE to rebalance the used SFP datasets.

The SMOTE is a promising oversampling method that proves its superiority in dealing with imbalanced data. It is originally introduced by Chawla, Nitesh V., et al. [96]. This technique preserves the original data without losing information, and it increases the rate of the minority class without duplication. New synthetic samples (\(\hat {x}_{ij}\)) labeled with the minority class are generated using the k-nearest neighbors’ method for each minority sample (xi) using the Euclidean distance, where j = 1,2,...,k. The new synthetic samples are generated along the lines joining the minority sample and its j selected neighbors as in (1).

$$ x_{new} = x_{i} + (\hat{x}_{i} - x_{i} ) * r $$
(1)

where r is a random vector between 0 and 1, \(\hat {x}_{i}\) denotes one of the k neighbors. The value of k depends on the desired amount of oversampling.

3.3 Feature Selection (FS)

One of the most common questions when applying ML algorithms is whether all features (factors) are relevant to the classification rule. As a response to this question, a problem called FS emerged. FS is defined as the process of reducing the dimension of data by eliminating irrelevant, noisy, and redundant features. In other words, it is the task of searching for the most informative subset of features. It is an essential pre-processing technique that aims to enhance the performance of ML tasks [25, 26, 97].

FS approaches are classified into wrapper and filter based on the evaluation function used to measure the selected subset of features [98]. In wrapper-based methods, a search algorithm (deterministic or heuristic) is employed to generate subsets of features for examination. Then the effectiveness of each suggested subset of features is evaluated by a given classifier (learning algorithm). The evaluation is conducted in terms of several measures such as accuracy, the area under the ROC, etc. FS is treated as a binary optimization problem in which the search algorithm is guided using the reported error by a classifier [99].

In the filter-based approach, the learning algorithm is not involved in the evaluation function. The effectiveness of a subset of features can be evaluated based on the intrinsic properties of the data. Statistical measures are used to measure the dependency or correlation between features, which can be filtered to select the most informative features. Several ranking techniques have been introduced for feature evaluations, such as gain ratio and information gain [100]. The filter-based approach is more effective compared to a wrapper-based approach in terms of the required computational time.

In this paper, we propose a wrapper FS approach that considers WOA as a selection mechanism and RF classifier as an evaluation method. In the following subsection, the WOA is introduced, followed by the description of the enhanced approach of the original WOA.

3.4 An overview of the WOA

WOA is a recent Swarm Intelligence (SI) algorithm that mimics the behavior of humpback whales in hunting fish in the oceans. The hunting process starts by constructing bubble nets to constrict the prey, and then the whale swims towards them in a spiral shape to attack them. According to [60], WOA can balance its stochastic exploratory and exploitative tendencies effectively. In the exploration phase, WOA simulates the encircling mechanism of the whales in nature. Where the prey represents the best solution, found so far, and the other solutions in the population represent the candidate whales. The whales change their positions by moving spirally toward the prey’s location as modeled in (2) and (3):

$$ D = \mid C.\vec{X}^{ *}(t)- \vec{X}^{ }(t) \mid $$
(2)
$$ \vec{X}^{ }(t+1) = \vec{X}^{ *}(t)- \vec{A}^{ }.D $$
(3)

where t represents the current iteration, X represents the prey’s location, X represents the locations of the candidate solution (whale). Vectors A and C are defined in (4) and (5).

$$ \vec{A}^{ }=2\vec{a}^{ }.\vec{r}^{ }-\vec{a}^{ } $$
(4)
$$ \vec{C}^{ }=2.\vec{r}^{ } $$
(5)

where \(\overrightarrow {r}\) is generated randomly in the interval [0,1], and \(\overrightarrow {a}\) simulates the shrinking encircling behavior of the whales and decreases linearly in the interval [2, 0] as in (6)).

$$ a=2\left( 1-\frac{t}{T}\right) $$
(6)

According to (7) which represents the bubble-net attacking process (exploitation phase), a solution’s position is changed based on two different approaches; shrinking encircling (when p < 0.5) and spiral updating mechanism (when p ≥ 0.5), where a probability of 50% is used to select between these two approaches.

$$ \vec{X}^{ }(t+1)= \left\{\begin{array}{ll} \vec{X}^{ *}(t)- \vec{A}^{ }.D & p< 0.5\\ D^{\prime}.e^{bl}.\cos(2\pi l)+\vec{X}^{ * }(t) & p \geq 0.5 \end{array}\right. $$
(7)

where \(D^{\prime }\) represents the distance between the i th solution and the prey’s location, b is a constant, and l is a random number in the interval [-1,1].

Based on the variation of \(\vec {A}\), which takes a value in the interval [-1, 1], a solution is forced to move towards or far away from the best solution. If \(\vec {A} < 1\), then a solution is moved towards the prey’s location (exploitation), while it is moved towards a randomly selected solution from the population (represented as \(\vec {X_{rand}}^{ }\) in (8) and (9)) when \(\vec {A} >1\) (exploration).

$$ D = \mid C.\vec{X_{rand}}^{ }(t)- \vec{X}^{ }(t) \mid $$
(8)
$$ \vec{X}^{ }(t+1) = \vec{X_{rand}}^{ }(t)- \vec{A}^{ }.D $$
(9)
Algorithm 1
figure a

Pseudo-code of WOA.

As for all population-based metaheuristic algorithms (MAs), WOA starts the optimization process by generating N random solution, each of which represents a whale in nature. Then, each solution is evaluated using the adopted fitness function. The solution with the lowest fitness value is denoted as the best solution since this type of this problem is a minimization problem, and the coefficients of the algorithm are calculated. The algorithm then moves according to the parameter a which is decreased from 2 to 0. Each solution is updated based on the value of \(\vec {A}\), where it moved towards a randomly selected solution from the current population when \(\vec {A} >1\). Also, it is moved towards the best solution when \(\vec {A} <1\). The WOA switches between a spiral or circular movement based on the value of p. The pseudo-code of WOA is shown in Algorithm 1.

4 The proposed methodology

The main objective of this paper is to build a well-performing classification model that is able to predict faulty software components. The datasets were selected from the PROMISE repository and normalized to set a proper scale for all data. Moreover, some techniques were applied to balance the data to get more accurate results. Then, more experiments were conducted to select the most appropriate classifier for this problem. After that, extensive experiments were conducted to tune the parameters of the selected classifier. In the last phase, a set of wrapper feature selection methods were applied to enhance the performance of the adopted classier. Figure 2 represents the proposed methodology.

Fig. 2
figure 2

Software fault prediction process

4.1 Preprocessing techniques

Data preprocessing is a vital step in the mining process. It aims to prepare the dataset to be suitable for the mining techniques to achieve high performance. The datasets are investigated using a 2D visualization using Principle Component Analysis (PCA) as shown in Fig. 3. The figure demonstrates that the datasets are highly imbalanced and not linearly separable. Therefore, all datasets should be balanced before applying proper results and adopting a learning algorithm. Moreover, a complex learning algorithm is required to provide better performance because the data in datasets is not linearly separable.

  • Data Normalization: The collected datasets are complete, and no missing data are there. Their structures are well to be mined, and all attributes are numeric. However, the numeric data are of different scales. Therefore, to avoid bias towards some dominant features, the Min-max normalization method (as can be seen in (10)) was applied to standardize the data in the interval of [0, 1].

    $$ x^{n} = \frac{x - \min}{ \max -\min} $$
    (10)

    where xn is the normalized value of x within the interval \([\min \limits , \max \limits ]\).

  • Data Balancing After investigating the adopted datasets, we noticed that they are highly imbalanced as the rate of faulty instances is very low compared to normal ones (see Fig. 3). Thus, the datasets should be balanced before using them with the classification technique to avoid any decrease in their performance. In this paper, we applied three variants of the SMOTE oversampling technique (i.e., SMOTE, Borderline SMOTE, and SVM SMOTE) to select the best one that positively affects the performance of the learning algorithm.

Fig. 3
figure 3

Visualization of target class distribution based on the first 2 principal components of the dataset features

4.2 Classifier selection

Investigating data visualization in Fig. 3, it can be seen that data in most datasets are not linearly separable. Thus, simple classifiers may not be suitable to handle this kind of problem. Therefore, we did extensive experiments to compare the performance of different classifiers on the same datasets and under the same computational system.

4.3 Binary variant of WOA (BWOA)

As mentioned earlier, FS is a binary optimization problem, while the WOA was originally designed to deal with continuous optimization problems. This requires using a conversion function that converts the continuous solutions to binary to make them suitable for binary problems. We used the Transfer Functions (TFs) that were widely used to convert the continuous metaheuristics population to binary [101, 102]. TFs can be categorised based on their shapes [103] into S-shaped and V-shaped functions (see Fig. 4). The proposed Binary WOA for FS is called BWOA.

Fig. 4
figure 4

Transfer functions families (a) S-shaped and (b) V-shaped

The S-shape TF [104] is used to convert the continuous PSO algorithm into binary. The TF is usually used to produce a probability of flipping a future value from 0 to 1 or from 1 to 0 as in (11). It takes the elements of the step vector (solution x) that was generated by the algorithm.

$$ S({x_{i}^{j}}(t))=\frac{1}{1+\exp^{- {x_{i}^{j}}(t)}} $$
(11)

where \({x_{i}^{j}}\) represents the jth element in the ith solution x, and t indicates the current iteration. An element of a solution in the next iteration is updated by (12).

$$ {X_{i}^{j}}(t+1)= \left\{\begin{array}{ll} 0 & \text{If} rand<S({x_{i}^{j}}(t+1))\\ 1 & \text{If} rand\geq S({x_{i}^{j}}(t+1)) \end{array}\right. $$
(12)

where \({X_{i}^{j}} (t+1)\) is the binary value of the real \({x_{i}^{j}}\), \(S({x_{i}^{j}}(t))\) is the probability value, which can be obtained via (11).

Another TF that belongs to the V-shaped family [105] is used to convert the continuous version of the GSA algorithm into binary. Equation (13) represents the V-shape TF and (14) represents the rule to convert to binary.

$$ V({x_{i}^{j}}(t))=|\tanh({x_{i}^{j}}(t))| $$
(13)
$$ {X_{i}^{j}}(t+1)= \left\{\begin{array}{ll} \neg {X_{i}^{j}}(t) & r<V({\Delta} {x_{i}^{j}}(t+1))\\ {X_{i}^{j}}(t) & r\geq V({\Delta} {x_{i}^{j}}(t+1)) \end{array}\right. $$
(14)

In this paper, eight TFs were adopted to convert the original WOA into binary. The original TFs that were proposed in [104], which are S2 TF and V2 TF [105]. In addition, the six TFs that were proposed in [103] are evaluated. The mathematical formulation of all TFs is shown in Table 1.

Table 1 S-shaped and V-shaped transfer functions

4.4 Formulation of FS problem

One of the critical aspects that should be considered when designing an optimization algorithm is the fitness function. Since FS is a multi-objective optimization problem, both objectives should be considered when evaluating a solution. The suitability of a feature subset is determined by the number of selected features (minimize) and the classification accuracy (maximize). Aggregation is one of the most popular techniques for multi-objective formulation [106]. In this technique, the objectives are integrated into a single objective formula such that a preset weight identifies each objective importance. In this work, we adopt a fitness function that combines both objectives of FS as shown in (15).

$$ Fitness(X)=\alpha \cdot E(X) +\beta *\left( 1- \frac{|R|}{|N|}\right) $$
(15)

where Fitness(X) represents the fitness value of a subset X, E(X) represents the classification error rate by using the selected features in the X subset, |R| and |N| are the number of selected features and the number of original features in the dataset respectively, α and β are the weights of the classification error and the reduction ratio, α ∈ [0,1] and \(\beta =\left (1-\alpha \right )\) adopted from [36, 97, 107].

Another aspect that should be considered when designing an optimization approach to tackle an FS problem is the solution representation. In this research, the feature subset is represented as a binary vector of N elements where N is the total number of features in the original dataset. Each dimension in that vector contains a binary value (0 or 1). The 0 value indicates that the corresponding feature is not selected in that feature subset, while the 1 value indicates the corresponding feature is selected. Figure 5 shows a sample solution for a dataset of N features.

Fig. 5
figure 5

Binary solution representation

4.4.1 Enhanced BWOA (BEWOA)

The performance of the binary version of WOA (BWOA) can be improved to deal with the complex nature of the SFP problem more efficiently. This is because the SFP search space is highly dimensions and rigid. Therefore, the enhanced version of BWOA (BEWOA) extracts efficient exploration strategies from GWO [108] and HHO [109]. The flowchart of the proposed BEWOA is illustrated in Fig. 6.

Fig. 6
figure 6

The flowchart of the proposed BEWOA

We propose a new approach that employs multi-exploration strategies instead of having one strategy as follows:

  • The exploration strategy that was used in GWO, where the best three solutions are selected and updated based on the values of A and C parameters as shown in (1619).

    $$ \vec{X}_{1} = \vec{X}_{\alpha} - \vec{A}_{1} . |\vec{C}_{1} . \vec{X}_{\alpha} - \vec{X}(t)| $$
    (16)
    $$ \vec{X}_{2} = \vec{X}_{\beta} - \vec{A}_{2} . |\vec{C}_{2} . \vec{X}_{\beta} - \vec{X}(t)| $$
    (17)
    $$ \vec{X}_{3} = \vec{X}_{\delta} - \vec{A}_{3} . |\vec{C}_{3} . \vec{X}_{\delta} - \vec{X}(t)| $$
    (18)
    $$ \vec{X}(t+1) = \frac{\vec{X}_{1} + \vec{X}_{2} +\vec{X}_{3}}{3} $$
    (19)

    Note that the \(\vec {X}_{\alpha }\), \(\vec {X}_{\beta }\) and \(\vec {X}_{\delta }\) are the best, second-best, third-best solutions in the population. The vectors \(\vec {C}_{1}\), \(\vec {C}_{2}\), \(\vec {C}_{2}\) are calculated based on the (5) while vectors \(\vec {A}_{1}\), \(\vec {A}_{2}\), \(\vec {A}_{2}\) are calculated based on the (4).

  • The average of all positions was used in HHO to guide the solution being updated. This strategy is used as shown in (2021).

    $$ \vec{X}^{ }(t+1) = r.\vec{X}^{ *}(t)- \vec{X}_{avg}(t) $$
    (20)
    $$ X_{avg}(t)=\frac{1}{N}\sum\limits_{i=1}^{N}X_{i}(t) $$
    (21)

Algorithm 2 shows the new modifications. Now, instead of updating the position of the current solution based on the WOA ((3) in Algorithm 1), the positions in BEWOA of the current solution are updated based on either GWO updating (19) or HHO updating (20).

The main rationale of this enhancement is to improve the exploration stage of the original WOA by employing the survival-of-the-fittest principle rather than random search. Since the search space of SFP is very deep, the search requires a more guided exploration to reach the best possible solution. Therefore, concentrating on the best solutions in the search space regions can converge the search to the optimal solution quickly.

In order to provide more insight about the proposed BEWOA, the time complexity is analyzed. As can be noticed from Algorithm 2, it has several statements with loops. The statement in Line 1 needs a \(\mathcal {O}\)(N × n) where N is the population size while n is the solution dimension . In line 3, the complexity depends on the evaluation function complexity as calculated in (15). In the main loop starting from Line 5 to Line 18, the time complexity is \(\mathcal {O}\)(L × N × n). Therefore, the time complexity of the proposed method is \(\mathcal {O}\)(LNn) which is similar to the original WOA.

Algorithm 2
figure b

Pseudo-code of the proposed BEWOA.

5 Experimental results and discussion

A set of experiments are conducted in this research to prove the efficiency of the proposed approach. This section presents these results in detail. The experiments are carried out in four phases as follows:

  • First, different SMOTE techniques are adopted to solve the imbalance dataset issues.

  • Second, the hyper-parameter settings of the RF classifier are investigated, and those parameters that provided the optimal results are selected.

  • Third, the performance is tested using 7 popular machine learning methods (RF, KNN, NB, LDA, DT, LR, and SVM).

  • Fourth, a novel BEWOA approach is introduced as a feature selection method to tackle the data dimensionality problem. In addition, the proposed approach is compared with the other 9 state-of-the-art methods.

In the experiments, the train/test model is adopted. The datasets are split randomly into two parts; 66% are used for training purposes, while the remaining 34% of the dataset is used for blind testing. Due to the availability of random factors in the classification algorithm (i.e., RF) and the optimization technique (i.e., WOA), we only report the average values of 30 runs for each employed method. Please note that we use the boldface to represent the best results. All the experiments are conducted on a computer under the same environment and conditions to guarantee a fair comparison. The details of the utilized system are presented in Table 2. It is worth mentioning that Python programming language is used to implement the proposed classification framework, together with the open-source libraries (e.g., Panda, Numpy, Matplotlib, and SKlearn (Scikit-learn)).

Table 2 The system properties

5.1 Datasets: investigated software projects

In this work, we adopt 16 well-known datasets related to the SFP problem to assess the performance of the proposed approach. The datasets are collected from the PROMISE repository [110, 111] and have been recommended by many researchers in the field. The adopted datasets consist of several software metrics (e.g., object-oriented metrics) that are usually employed to investigate the quality of a software project (see Table 4 for more information about these metrics) [112]. The details of the adopted datasets are presented in Table 3. It can be seen that these datasets have various sizes (from 109 to 909 instances), and all of them have 20 object-oriented metrics.

Table 3 Details of the 16 software projects (datasets) from PROMISE repository
Table 4 Description of object-oriented metrics

From Table 3 it can be noticed that most of the datasets are imbalanced, where the rate of the defected instances is much lower than the rate of the normal instances. In some of the datasets, the rate of the defective instances is only 2.2% from all instances. This observation motivates us to use an oversampling technique to re-balance the datasets before applying them to the ML approaches. Table 4 illustrates the main features provided in the dataset given in Table 3.

5.2 Evaluation measures

Since SFP is a binary classification problem, we use the confusion matrix to calculate the evaluation measures. Table 5 represents the confusion matrix, followed by the equations for calculating the evaluation metrics. The True Positive Rate (TPR) is calculated according to (22), True Negative Rate (TNR) is calculated as in (23), while the Area Under the Curve (AUC) is calculated based on (24). Since the available datasets are imbalanced, and thus we do not consider the accuracy as the evaluation measure because it can be profoundly misleading of judging a model.

Table 5 Confusion matrix for binary classification
  • TPR: The percentage of the positive cases that are correctly classified as positive.

    $$ TPR = {TP}/({TP + FN}) $$
    (22)
  • TNR: The percentage of negative cases that are correctly classified as negative.

    $$ TNR = {TN}/({TN + FP}) $$
    (23)
  • AUC: a measure of how well a model can distinguish between defected and normal groups.

    $$ AUC = (TPR+TNR)/2 $$
    (24)

5.3 Handling imbalanced data using different SMOTE variants

In this section, we are interested in assessing the effect of rebalancing the datasets on the RF classifier. Three SMOTE variants (namely: SMOTE, BorderlineSMOTE, and SVMSMOTE) are applied, and the AUC, TPR, and TNR results are reported in Table 6. Observing the results in Table 6, one can conclude that SMOTE recorded the best performance. Moreover, the prediction quality of the RF classifier has been enhanced after rebalancing the datasets. Without applying the SMOTE technique, the TPR was very high compared with the TNR. However, after applying the SMOTE techniques, the results became more reasonable, and the TNR became more realistic.

Table 6 Evaluation results of RF classifier using different SMOTE variants in terms of testing TPR and TNR

Table 7 displays the AUC rates obtained by the RF classifier using different SMOTE variants. Inspecting the result in Table 7, it is clear that the AUC rate with SMOTE techniques was much better than the original one. Among the SMOTE variants, the BorderlineSMOTE was the best (rank of 1.88), followed by SMOTE (rank of 2.06). Our findings indicate that the BorderlineSMOTE was the most suitable technique for re-balancing the datasets. In sum, we can say that applying a resampling technique is important to the imbalanced datasets since it can significantly improve the performance of the classifier. On the whole, we apply BorderlineSMOTE technique on all datasets in the subsequent experiments.

Table 7 AUC rates obtained by RF classifier using different SMOTE variants

5.4 Random forest hyperparameter tuning

This section investigates the hyperparameter settings of the RF classifier. As well-known, the performance of the RF classifier is highly affected by the used parameter values. Therefore, we consider a set of comprehensive experiments to find out the most appropriate parameters that can best reveal the performance of the classifier.

The following hyperparameters of the RF classifier were examined:

  • n_estimators: number of trees in the forest

  • max_depth: max number of levels in each decision tree

  • min_samples_split: min number of data points placed in a node before the node is split

  • min_samples_leaf: min number of data points allowed in a leaf node

  • bootstrap: a method for sampling data points (with or without replacement)

  • max_samples: If bootstrap is True, the number of samples to draw from X to train each base estimator

Table 8 presents the hyperparameter settings of the RF classifier. As can be observed, we test the RF classifier using multiple combinations of parameters Initially, we tune the n_estimators and fixed the other parameters. The process of hyperparameter tuning is repeated by replacing the n_estimators with max_depth, min_samples_split, min_samples_leaf, bootstrap, and max_samples, respectively.

Table 8 Hyperparameter tuning of the RF classifier

Figure 7 illustrates the training and testing curves of the AUC score for different hyperparameter settings. With proper tuning of these parameters, one can see that the training AUC score has been substantially increased. However, in some cases, an optimal training AUC performance may lead to the over-fitting issue, which provided a very low AUC score on the testing set. In this regard, we evaluate the hyperparameter using the testing AUC score rather than the training AUC in the present work. Based on the result obtained, the optimal performance of the RF classifier can be achieved with n_estimators= 10, max_depth= 7, min_samples_split= 2, min_samples_leaf= 1, bootstrap=TRUE, and max_samples= 1.

Fig. 7
figure 7

The performance of RF model after adjusting the considered set of hyperparameters in terms of training and testing curves of AUC score

5.5 Comparison with other classification techniques

After performing the hyperparameters tuning of the RF classifier, we compare it to a set of well-known classification algorithms (i.e., KNN, NB, LDA, DT, LR, and SVM) in solving the SFP problems. We adopt this set of classifiers because they are from different categories of the supervised ML techniques, and each one differs from the others in its structure. For example, the KNN classifier does not build the model, and it predicts the output based on the distance between training and testing samples. Also, the LR, NB, and LDA are easy-to-implement algorithms since they do not contain extra parameters.

In this study, the min-max normalization technique is applied to all datasets, in addition to the oversampling technique. After investigating the performance of each classifier, a deep comparison in terms of AUC rates among the classifiers is presented in Table 9. In order to obtain the overall rank, the average ranking values of the Friedman test (F-Test) are calculated.

Table 9 Comparison of RF against traditional classifiers in terms of testing AUC rates

From Table 9, it is observed that the RF classifier performed better than other classifiers in most cases. Among rivals, RF obtained the highest AUC scores in 63% of the datasets while the KNN, which came in second place, obtained the best results in 25% of the datasets. Besides, the results of the F-Test reveal that the RF classifier outperformed other classifiers with the rank of 1.81, and it was ranked as the best classifier in terms of the AUC measurement. Ultimately, it can be inferred that the RF classifier is a useful and powerful learning algorithm for SFP problems.

5.6 Feature selection based on proposed BWOA approaches

In the previous sections, the impact of the re-sampling technique on this work is presented. Accordingly, the RF classifier perceived the best AUC value in most of the SFP problems. Nevertheless, the appearance of irrelevant and redundant features might be degrading the performance of the classifier, which limits the learning ability of the algorithm in the SFP prediction. As such, feature selection can be an excellent way to resolve the above-mentioned issue. In this work, we propose the new variant of the BWOA algorithm as a feature selection method to select the significant features.

5.6.1 Performance of BWOA using different TFs

In the first sub-section, we study the performance of the BWOA algorithm with different TFs. According to [61], we adopt four S-shaped TFs (S1 to S4) and four V-shaped TFs (V1 to V4). In total, eight binary versions of WOA, namely: SBWOA1, SBWOA2, SBWOA3, SBWOA4, VBWOA1, VBWOA2, VBWOA3, and VBWOA4 are introduced. Table 10 reports the AUC results of the BWOA using different TFs. From Table 10, it is seen that the SBWOA1 achieved the optimal AUC score in at least 11 datasets. Our findings reveal that the SBWOA1 overtook other variants in selecting the most informative feature subset. This argument is supported by the results of the F-Test in Table 10. By observing the results in Table 11, one can see that the SBWOA1 was not very good at reducing the feature size. However, the SBWOA1 can often retain the significant feature that can best describe the target classes, which benefits the learning process. In Table 12, the SBWOA1 again ranked first (rank of 2.41) in terms of the fitness value. All in all, we can conclude that the BWOA with an s-shaped transfer function S1 is the most appropriate for SFP problems.

Table 10 Evaluation results of BWOA using 8 transfer functions in terms of AUC rates
Table 11 Evaluation results of BWOA using 8 transfer functions in terms of number of selected features
Table 12 Evaluation results of BWOA using 8 transfer functions in terms of fitness values

5.6.2 Performance of enhanced WOA

After investigating the best TFs in the BWOA algorithm, we inspect the performance of the proposed SBEWOA (the enhanced EWOA with s-shaped TF S1) against SBWOA (the original WOA with s-shaped TF S1) in all datasets. Table 13 outlines the performance comparison of the SBEWOA and SBWOA in terms of the AUC rate, feature size, and fitness value. From Table 13, the proposed SBEWOA has achieved the optimal AUC rate and fitness for most of the datasets (15 datasets). As for the feature size, the SBEWOA can always select a minimal number of features during the selection process. In comparison with SBWOA, the SBEWOA is more capable of finding the positive features that can contribute to the highest AUC rate.

Table 13 Comparison of SBEWOA versus the conventional SBWOA in terms of AUC rates, number of selected features, and fitness values

The foremost cause for the improved efficiency of the SBEWOA is that it employs the multi-exploration strategies rather than the random operator in the global search phase. Thus, in case of immature convergence, the search agents can explore the untried areas and escape the local optimum. Besides, the SBEWOA enables the search agents to learn from the best three solutions, thereby increasing the chance of exploring the promising areas.

5.6.3 Population diversity analysis

In this sub-section, the diversity measurement has been considered to check the behaviour of the proposed method during the search. Maintaining the diversity during the search leads the optimization algorithm to escape the local optima and enhance the final solution. To measure the diversity capability, the hamming distance between the population members in each iteration of the proposed method (i.e., SBEWOA) as well as the original method (i.e., SBWOA) are recorded and plotted as shown in Fig. 8. As can be seen from the Figure, the enhanced version of BWOA (i.e., SBEWOA) are able to preserve a good diversity during the search where the hamming distance keep converging the search to the optimal results. On the other hand, the SBWOA has almost stable hamming distance results during the search and converge slowly. From Fig. 8, one can see that the SBEWOA has maintained a good diversity for all datasets with smaller Hamming distance. At the initial stage, the SBEWOA keeps exploring the search regions to find the global minimum. Then, the SBEWOA searches locally for the local optimum. Even though SBEWOA was less robust compared to SBWOA, but it can offer better solution that provide high AUC performance.

Fig. 8
figure 8

The population diversity curves of BWOA and BEWOA

5.6.4 Deep analysis on the modifications

This sub-section conducts a deep analysis to study the impacts of the modifications on the proposed method. Table 14 presents the findings of proposed modifications in terms of fitness values. In Table 14, the SBWOA-G denotes the modification using GWO algorithm (19), SBWOA-H refers to the modification using HHO algorithm (20), and the SBEWOA is the modification using both GWO and HHO algorithms. Based on the results obtained, the SBEWOA has achieved the lowest fitness in most datasets. The convergence analysis in Figs. 9 and 10 again verify the convergence power of the SBEWOA in finding the near optimal feature subset. By observing the AUC performance in Table 15, it shows that SBEWOA can usually score the highest AUC values, followed by SBWOA-H.

Table 14 Impact of proposed modifications in terms of fitness results
Fig. 9
figure 9

Convergence analysis of basic and modified versions on camel-10, camel-1.4, camel-1.6, jedit-3.2, jedit-4.0, and jedit-4.1 datasets

Fig. 10
figure 10

Convergence analysis of basic and modified versions on log4j1.0, log4j1.1, lucene-2.0, xalan-24, xalan-2.5 , and xalan-2.6 datasets

Table 15 Impact of proposed modifications in terms of AUC results

On the other side, the proposed SBEWOA did not show the best results in feature reduction. From Table 16, the number of features chosen by SBEWOA was slightly higher than BGWO. However, the feature subsets produced by SBEWOA were able to maintain high AUC results.

Table 16 Impact of proposed modifications in terms of number of features

5.6.5 Comparison of SBEWOA with other optimizors

In this sub-section, we further compare the performance of the proposed SBEWOA with the other 9 state-of-the-art feature selection methods. These comparison methods are binary firefly algorithm (BFFA), binary moth flame optimization (BMFO), binary multiverse optimizer (BMVO), binary grey wolf optimization (BGWO), binary bat algorithm (BBAT), binary cuckoo search (BCS), binary harris hawk optimization (BHHO), binary Jaya algorithm (BJAYA), and genetic algorithm (GA). In this study, we adjusted the optimizers’ parameters in accordance with the settings recommended in preliminary publications and related studies on feature selection [99, 113,114,115,116]. The details of parameter configurations used in this paper are outlined in Table 17.

Table 17 Parameter settings for the optimization algorithms

Table 18 presents the AUC rates of the SBEWOA and other methods. As can be seen, the SBEWOA overwhelmed its competitors by scoring the highest AUC rates in 12 datasets, with the minimum best ranking of 1.47. On the other side, the second-best algorithm was BHHO (rank of 2.41), followed by the BMFO method (rank of 3.59). When observing the result of the Wilcoxon test in Table 19, one can see that the performance of the SBEWOA was significantly better than other methods for most of the datasets.

Table 18 Comparison between the proposed SBEWOA and other optimizers based on AUC rates
Table 19 2-tailed P-values of the Wilcoxon signed ranks test based on AUC results reported in Table 18 (P-values ≤ 0.05 are in bold and significant)

Table 20 reports the number of selected features for SBEWOA and other methods. Meanwhile, the results of the Wilcoxon test based on the number of selected feature is shown in Table 21. As can be observed, the BBAT outperformed other methods in minimal feature selection, while the SBEWOA ranked 8th with the F-Test rank of 7.72. Our findings reveal that the SBEWOA was not very good at reducing the feature numbers. However, it is worth noting that an algorithm that worked perfectly in feature reduction might be degrading the performance of the classifier due to extensive elimination of the positive features (refer Tables 18 and 20). Accordingly, the SBEWOA is excellent at maintaining the positive features while removing the negative features. Hence, the SBEWOA can often offer better AUC rates when dealing with SFP problems.

Table 20 Comparison between the proposed SBEWOA and other optimizers based on the number of selected features
Table 21 2-tailed P-values of the Wilcoxon signed ranks test based on number of features reported in Table 20 (P-values ≤ 0.05 are in bold and significant)

Table 22 shows the fitness values of the SBEWOA and other methods. Judging the Table 22, the SBEWOA contributed to the best fitness values in 11 datasets. Besides, the SBEWOA can often offer consistent results due to a smaller standard deviation. As compared to other methods, the SBEWOA has retained the best rank of 1.69, which exhibited a better search tendency when dealing with FS problems in SFP analysis. The results of Wilcoxon test in Table 23 support the arguments. On the one hand, Table 24 shows the results of the computational time among the proposed SBEWOA and other optimizers. Based on the results obtained, GA was the fastest algorithm in terms of running time. Although SBEWOA is not the fastest algorithm, it can be computed faster than BGWO, BBAT, BCS, and BJaya in many cases. This argument is further supported by the statistical results shown in Table 25.

Table 22 Comparison between the proposed SBEWOA and other optimizers based on fitness values
Table 23 2-tailed P-values of the Wilcoxon signed ranks test based on fitness results reported in Table 22 (P-values ≤ 0.05 are in bold and significant)
Table 24 Comparison between the proposed SBEWOA and other optimizers based on running time
Table 25 2-tailed P-values of the Wilcoxon signed ranks test based on running time results reported in Table 24 (P-values ≤ 0.05 are in bold and significant)

Figures 11 and 12 illustrate the convergence curves of the SBEWOA for all datasets. It is seen that the SBEWOA showed an excellent acceleration rate in most cases. Taking jedit-3.2 and lucene-2.0 datasets as examples, we can observe that the SBEWOA converged faster and deeper to explore the global minimum solution. The results imply that SBEWOA giving good convergence ability in solving the FS problem, thus leading to satisfactory achievements.

Fig. 11
figure 11

Convergence curves for compared algorithms on ant-1.7, camel-1.2, camel-1.4, camel-1.6, edit-3.2, edit-4.0 datasets

Fig. 12
figure 12

convergence curves for compared algorithms on edit-4.1, og4j-1.0, log4j-1.1, ucene-2.0, xalan-2.5, xalan-2.6 datasets

Based on the results obtained, it can be inferred that the proposed SBEWOA is a powerful and useful FS tool for SFP analysis. The superior performance of the SBEWOA can be attributed to the multi-exploration strategies which take full advantage of the best three leaders in exploring the feature spaces. Moreover, the utilization of the average of all positions and the current best position allows the search agents to explore the untried regions effectively. In case of premature convergence, the search agents can escape the local optimum and seek out promising solutions. Hence, SBEWOA can usually achieve better results than other methods.

5.6.6 Relevant features selected by SBEWOA

In the final sub-section, we are interested to investigate the top features of SFP analysis. Table 26 depicts the details of features selected by SBEWOA that scored the best AUC results. From Table 26, it is noticed that different set of features has been chosen by SBEWOA on different datasets. On the one hand, Table 27 outlines the number of times each feature has been selected by the SBEWOA algorithm. The importance of features is demonstrated in Fig. 13. Accordingly, the top five features were ca (80.938%), dit (77.188%), moa (75.938%), max_cc (75.625%), and mfa (75%). On the contrary, the worst features are found to be wmc and lcom (54.063%).

Table 26 Details of selected features by SBEWOA that scores the best AUC results for each dataset [best result out of 20 runs]
Table 27 The number of times each feature has been selected by SBEWOA for all datasets [over 320 runs]
Fig. 13
figure 13

Importance of features in terms of the number of times the SBEWOA algorithm has selected them

Table 28 shows the details of the features chosen for each dataset. In ant-1.7 dataset, the most relevant features was cbo, followed by cam. As for xalan-2.6 dataset, the mos feature was frequently selected by the algorithm. Across all datasets, it is seen that the ca features have been selected with 259 times, while the second-best feature, dit, has been selected with 247 times. Our findings suggest that these features are having high discriminative power when dealing with the SFP problem.

Table 28 Details of the number of times the features have been selected for each data

6 Conclusion and future works

Software Fault Prediction (SFP) helps developers in identifying the faulty components of the software prior to system deployment. In this article, we developed a well-performing classification model that is able to predict faulty software components. The 16 software project datasets were selected from the PROMISE repository. The datasets are then normalized to set a proper scale for all data. Firstly, the imbalance problem in the dataset is resolved by applying and comparing several SMOTE techniques to get more accurate results. The results revealed that the BorderlineSMOTE technique offered the optimal AUC rate as it can properly re-balance the datasets.

The hyper-parameter settings of the Random Forest (RF) classifier are investigated, in which the parameters that provide the optimal results are selected. Additionally, The RF classifier is compared against other traditional classifiers. In most cases, it obtains superior outcomes. Furthermore, eight transfer functions (TFs) were adopted to convert the original WOA into binary search space. Based on the result obtained, BWOA with S-shaped transfer function S1 is the most appropriate for SFP problems. Moreover, the performance of BWOA is improved by integrating the exploration strategies from GWO and HHO algorithms. This newly proposed method is called SBEWOA. The main rationale of this enhancement is to improve the exploration stage of the WOA by employing the survival-of-the-fittest principle rather than random search. This modification enables the proposed SBEWOA to select a set of positive features from the complex dataset.

Lastly, the proposed SBEWOA is compared against 9 state-of-the-art feature selection methods. The results show that the proposed SBEWOA is a powerful and useful FS tool for the SFP problem. The superior performance of the SBEWOA can be attributed to the multi-exploration strategies that help in exploring the features space. Among the rivals, the SBEWOA not only gives the highest AUC score but also the minimum number of features. The proposed multi-stage approach helps in producing a fruitful solution to tackling SFP problems.

There are several limitations in this work. First, the proposed SBEWOA suffers from low feature reduction power, but it can often attain high accuracy. Second, the proposed SBEWOA is highly complex in its structure, thus resulting in high computation cost. In the future, the SBEWOA can be applied to other applications such as parameter estimation photovoltaic solar cells and intrusion detection system. Furthermore, powerful mechanisms such as chaotic map and opposite-based learning can be integrated into SBEWOA for performance enhancement.