Classification framework for faulty-software using enhanced exploratory whale optimizer-based feature selection scheme and random forest ensemble learning

Mafarja, Majdi; Thaher, Thaer; Al-Betar, Mohammed Azmi; Too, Jingwei; Awadallah, Mohammed A.; Abu Doush, Iyad; Turabieh, Hamza

doi:10.1007/s10489-022-04427-x

Classification framework for faulty-software using enhanced exploratory whale optimizer-based feature selection scheme and random forest ensemble learning

Published: 09 February 2023

Volume 53, pages 18715–18757, (2023)
Cite this article

Download PDF

Applied Intelligence Aims and scope Submit manuscript

Classification framework for faulty-software using enhanced exploratory whale optimizer-based feature selection scheme and random forest ensemble learning

Download PDF

Majdi Mafarja¹,
Thaer Thaher^2,3,
Mohammed Azmi Al-Betar ORCID: orcid.org/0000-0003-1980-1791⁴,
Jingwei Too⁵,
Mohammed A. Awadallah^6,7,
Iyad Abu Doush^8,9 &
…
Hamza Turabieh¹⁰

2857 Accesses
22 Citations
1 Altmetric
Explore all metrics

Abstract

Software Fault Prediction (SFP) is an important process to detect the faulty components of the software to detect faulty classes or faulty modules early in the software development life cycle. In this paper, a machine learning framework is proposed for SFP. Initially, pre-processing and re-sampling techniques are applied to make the SFP datasets ready to be used by ML techniques. Thereafter seven classifiers are compared, namely K-Nearest Neighbors (KNN), Naive Bayes (NB), Linear Discriminant Analysis (LDA), Linear Regression (LR), Decision Tree (DT), Support Vector Machine (SVM), and Random Forest (RF). The RF classifier outperforms all other classifiers in terms of eliminating irrelevant/redundant features. The performance of RF is improved further using a dimensionality reduction method called binary whale optimization algorithm (BWOA) to eliminate the irrelevant/redundant features. Finally, the performance of BWOA is enhanced by hybridizing the exploration strategies of the grey wolf optimizer (GWO) and harris hawks optimization (HHO) algorithms. The proposed method is called SBEWOA. The SFP datasets utilized are selected from the PROMISE repository using sixteen datasets for software projects with different sizes and complexity. The comparative evaluation against nine well-established feature selection methods proves that the proposed SBEWOA is able to significantly produce competitively superior results for several instances of the evaluated dataset. The algorithms’ performance is compared in terms of accuracy, the number of features, and fitness function. This is also proved by the 2-tailed P-values of the Wilcoxon signed ranks statistical test used. In conclusion, the proposed method is an efficient alternative ML method for SFP that can be used for similar problems in the software engineering domain.

An Improved and Optimized Random Forest Based Approach to Predict the Software Faults

Article Open access 09 May 2024

Improved Dwarf Mongoose Optimization Algorithm for Feature Selection: Application in Software Fault Prediction Datasets

Article 14 May 2024

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

Article 09 November 2022

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Software Development Life Cycle (SDLC) represents the phases that software passes through while it is being developed. Starting with requirements elicitation, then the analysis and design of the collected requirements. After that, the programmers start developing the proposed software based on the analysis and design phases. A vital phase in SDLC is software testing. This phase follows the development phase and consists of a set of activities that assure the team is developing the right software with high-quality levels [1]. Numerous testing types are available to test various aspects of a software product. These tests include but are not limited to unit testing, component testing, integration testing, regression testing, and user acceptance testing. Many software development methodologies are available to be used by the development team. The most popular SDLC models are waterfall, agile and spiral models.

The testing stage plays an essential role in the development process. It is usually performed as a traditional linear model (e.g., waterfall) or a cyclic model (e.g., agile model). Testing process concerns with enhancing the software quality and reducing the total cost [2, 3]. However, many factors affect the results of the testing process, such as the limited resources (e.g., time or software testers). Therefore, early-stage procedures such as Software Fault Prediction (SFP) are utilized to facilitate the testing process in an optimal way [4]. In SFP, the faulty components of the software are detected prior to system deployment in the early stages of the SDLC. This is achieved by utilizing software faults datasets collected from previous projects or predefined software metrics. It is worth mentioning that the SFP process became more straightforward since the adoption of the agile software development (ASD) model in 2001 [5] as a replacement for the waterfall model which was introduced in 1970 [6]. Adopting the ASD methodology has many benefits since the software is developed incrementally. Moreover, ASD opens the door to adopting volatile requirements, optimizing resources (time and cost), bridging the gap between the development team and business owners [7], and facilitating performance software engineering tasks regularly such as review, maintenance, and testing [2].

The early prediction of faults in software components such as modules, classes, and so on has a significant impact in reducing the needed time and effort for the project outcomes to be delivered to the end-user. SFP is one of the approaches that help in optimizing the development process by reducing the number of potential faults in the early stages of the SDLC process [4]. Various SFP approaches were recorded in the literature. The main approaches include but are not limited to Soft Computing (SC) and Machine Learning (ML) [8]. These methods need data to be able to predict software faults. Design features (metrics) gathered during the design stage or historical fault datasets accumulated during the implementation of previous versions of similar projects are two essential resources to be used with SFP approaches for benchmarking [9].

Various types of metrics such as method-level and class-level have been proposed for the SFP problem [10]. Method-level metrics can be collected from structured programming or object-oriented programming-based source codes. Halstead [11], and McCabe [12] metrics are the most common method-level measures used by many researchers. Class-level metrics are only appropriate when developing SFP models for object-oriented programming-based projects. Examples of class-level suite of metrics for object-oriented design are CK (Chidamber–Kemerer) [13], L&K (Lorenz-Kidd) [14], and quality metrics for object-oriented design (QMOOD) [15]. However, in comparison with other suites, CK metric is mostly applied when class-level metrics are chosen [10].

Automated systems become available for almost all fields in real life. With the advancement of software development, and the availability of large-scale projects, analyzing the collected software metrics becomes complicated and forms a significant challenge. Thus, ML techniques have been proposed as SFP solutions and shown a good performance [16]. The main purpose behind these techniques is to predict the faulty components in software based on the supplied datasets. Examples of ML techniques that have been used as SFP approaches are K-Nearest Neighbors (KNN), Naive Bayes (NB), Linear Discriminant Analysis (LDA), Linear Regression (LR), Decision Tree (DT), Support Vector Machine (SVM), and Random Forest (RF) [4, 17, 18].

Among the various ML models, ensemble learning has proven excellent performance in dealing with various complex classification problems [19]. Ensemble learning combines a number of ML models to create an ensemble learner to improve the model performance by proving a more general robust model. The RF is recognized as a well-regarded ensemble technique that was originally introduced by Breiman, Leo [20]. In RF, a number of DT classifiers are fit on various sub-samples of the dataset and combine the output of all the trees. The RF has several merits that make it superior when compared to other traditional ML models. It controls the over-fitting problem of DT, reduces the variance within the forest, and thus enhances the predictive accuracy [21, 22].

The performance of the ML-based SFP approaches depends mainly on a set of factors which is the applied ML technique and the quality of the utilized dataset (in terms of noise, irrelevant features, and imbalanced representation of data) [23]. Therefore, dimensionality reduction (e.g., feature selection) and data resampling (e.g., Synthetic Minority Oversampling Technique (SMOTE)) techniques are needed before applying the ML technique. These features can be defined in the context of the feature selection problem which can be tackled by feature selection techniques.

In feature selection (FS) the problems with high-dimension feature space increase the hardness of the search process. In common, various search strategies, including complete, random, and heuristic, are available for searching the feature space to obtain the optimal subset of features [24]. The complete search requires generating and evaluating all possible subsets of features. In this way, for a set of m features, 2^m features subsets will be formed. For example, if the given problem has four features, sixteen subsets of features will be produced. In case of random search, the next candidate solution (subset of features) is generated randomly while heuristic search strategies conduct the search in adaptive way, and generate possible solutions (feature subsets) for the problem [24,25,26,27].

Recently, metaheuristics is widely used by the research community as a successful FS method. The metaheuristics are conventionally categorized based on the initial solutions into population-based, and trajectory-based [28]. A trajectory-based metaheuristic is initiated with a single solution. The search follows a trajectory in the search space based on the local modification of the current solution until a local optimum is obtained. These methods like tabu search [29], β-hill climbing [30], stochastic local search [31], and variable neighborhood search [31]. In contrast, the population-based algorithm is initiated with a population of individuals. Iteratively, the population inherits its strong elements to come up with an optimal solution. Normally, population-based algorithms are classified into evolutionary algorithms (EAs) and swarm intelligence approaches. Genetic Algorithm (GA) [32], Genetic Programming (GP) [33], and Differential Evolution (DE) [34] are the base EAs for feature selection. A swarm-based algorithm is normally built based on the idea of a group of solutions where the group members are divided into leaders and followers. Particle Swarm Optimization (PSO) algorithm and Ant Colony Optimization (ACO) are the base swarm intelligence methods. Quite recently, several swarm intelligence methods have been proposed for FS such as PSO [35], Salp Swarm Algorithm (SSA) [36, 37], Dragonfly Algorithm (DA) [38], Rate Swarm Optimizer [39], Ant Lion Optimizer (ALO) [40], Harmony Search [41], Coronavirus herd immunity optimizer [42], ant colony optimization (ACO) [43], β-hill climbing optimizer [44], Crow Search Algorithm (CSA) [45], JAYA algorithm [46], Firefly algorithm (FFA) [47], Artificial Bee Colony (ABC) algorithm [48], Coyote Optimization Algorithm (COA) [49], and Grasshopper Optimization Algorithm (GOA) [50]. Furthermore, the hybrid between them such as Genetic-Whale-Ant colony algorithms [51], Grey Wolf optimizer and Random Forest [52], hybrid Salp Swarm Algorithm [53], genetic and coral reefs [54], etc. There are also real opportunities to adapt newly-established optimization algorithms for FS problems like starling murmuration optimizer [55], Quantum-based avian navigation optimizer [56], Farmland fertility algorithm [57], African vultures optimization algorithm [58], and artificial gorilla troops optimizer [59].

Whale Optimization Algorithm (WOA) is a recent swarm intelligence imitates the behavior of humpback whales in hunting fish in the oceans [60]. It has impressive characteristics over other optimization methods such as it has few control parameters, easy to implement, simple structure, and it has maneuver behavior to find a suitable balance between local exploitation and global exploration. Due to its successful attributes, WOA has been widely utilized to deal with feature selection problems [25, 61,62,63,64,65]. The original version of WOA was designed to handle continuous search space problems. In this paper, to match the binary search space of the FS problem, WOA was boosted with eight fuzzy transfer functions from S-shaped and V-shaped families. Due to the No Free Lunch [66] argument which points out that there is no superb optimization algorithm that can excel all others for all optimization problems, therefore, the opportunity is still possible to investigate modifying efficient methods to handle the SFP to improve the algorithm efficiency.

In this paper, a systematic SFP approach that considered several ML techniques with different pre-processing methods was proposed. The major contributions are summarized as follows:

Several pre-processing and re-sampling techniques are applied to prepare SFP datasets to be suitable to the ML techniques.
Various classification techniques, namely KNN, LDA, SVM, LR, DT, and NB, RF, are applied. Their performance is compared in the same environment to adopt one technique for further experiments. As a result, the RF classifier is adopted in this stage.
A dimensionality reduction method based on the Binary version of WOA was utilized to eliminate the irrelevant/redundant features to enhance the performance of the RF classifier. The newly proposed method is called BWOA, which utilizes eight transfer functions where a transfer function that yields good results is chosen.
An enhanced WOA version (EWOA) is introduced, where the exploration strategies from the grey wolf optimizer (GWO) and harris hawks optimization (HHO) algorithms are used to enhance the diversity of the WOA. By means of this enhancement mechanism, the performance of WOA is improved to deal more efficiently with the search space of the FS problem. This yields a superior optimization framework for the faulty-software prediction problem.

The newly proposed EWOA reveals very successful outcomes in terms of choosing the most informative features in the area of SFP. The findings prove that the classification performance can be significantly improved by removing useless features. The performance is compared with nine state-of-the-art methods and it shows the viability of the proposed method in terms of the accuracy, number of features, and fitness function.

The remainder of the paper is structured as follows: a review of the related works is presented in Section 2. In Section 3, a theoretical background of the related aspects to this paper is introduced. Section 4 presents the proposed methodology. The experimental design and the obtained results are discussed in Section 5. Finally, Section 6 includes a conclusion about the main findings of this paper in addition to some future work directions.

2 Related works

Recently, different ML approaches were considered to solve the SFP problem with remarkable success [4]. Accordingly, different datasets (e.g., PROMISE repository, NASA datasets, and Qualitas corpus) became publicly available to the researchers [4, 67]. This section presents the most relevant related work in the field of SFP. A general overview of the SFP techniques is provided, and the related ML approaches are investigated, followed by the related work of the enhanced ML approaches by applying some preprocessing approaches like feature selection.

2.1 ML based SFP

Different supervised and unsupervised ML techniques were applied as prediction models in SFP. Examples of the ML that were used with SFP are: SVM [68], DT [69], Bayesian Networks (BN) [70], NB [71], KNN [72], Multi-layer Perceptron (MLP) [73], Artificial Neural Networks (ANN) [2, 74], LR [75], Multi-nomial Logistic Regression (MLR) [73], RF [76] and ensemble MLP [77].

Singh and Malhotra [68] conducted an empirical study to evaluate the performance of an SVM classifier in determining the relationship between some software Object-Oriented (OO) design matrices and fault proneness. A dataset from the NASA repository (KC1) and Receiver Operating Characteristic (ROC) were used to evaluate the proposed model. The study shows that the SVM classifier was feasible and helpful in predicting faulty classes in OO-based systems.

Moreover, Cahill et al. [78] introduced an approach named Rank Sum for data representation to improve the performance of fault porousness prediction modules. The proposed approach is evaluated by applying the well-known ML classifiers SVM and NB over various datasets from the NASA repository. It was found that NB is better compared to the SVM classifier. Erturk and Sezer [79] introduced an SFP model that combines Fuzzy Inference System (FIS) and Artificial Neural Network (ANN) classifiers. FIS was applied at the beginning of the project to make predictions depending on expert opinion because it does not need historical data for prediction, and then ANN was employed in the later iterations when some data about the software project are obtainable. The proposed iterative system was tested using a set of datasets including various versions of many projects from the PROMISE repository. The selected datasets consist of common OO metrics such as coupling between objects, response for a class, and weighted methods per class. The evaluation of the results according to the receiver operating characteristics (ROC) with the area under the curve (AUC) method shows that the iterative module is capable of locating fault-prone modules in the software.

An approach named multi-strategy classifier (RB2CBL) was introduced by Khoshgoftaar et al. [80] for the SFP problem, where Rule-Based (RB) classifier was hybridized with two variants of the Case-Based Learning (CBL) model. Moreover, an embedded GA was utilized to optimize the parameters of CBL models. The experimental results reveal that the proposed RB2CBL classifier is superior compared to the RB model alone. Carrozza et al. [81] proposed a new set of software matrices for detecting mandelbugs in complex software systems. In addition, considering the newly proposed matrices and the conventional software matrices, several algorithms, including DT, SVM, BN, NB, and MLR, were applied to various datasets from the NASA repository. The authors reported that MLR and SVM are the best among all examined algorithms in finding Mandelbug-prone modules.

A model based on the principle of ensemble learning methods was employed by Rathore and Kumar [82] to predict software faults in which linear regression-based combination rule (LRCR) and gradient boosting regression-based combination rule (GBRCR) approaches were used to ensemble the output of Genetic Programming (GP), MLP, LR algorithms. Moreover, eleven datasets belonging to six software projects were accumulated from the PROMISE data repository to assess the performance of the proposed ensemble models. Results of different performance evaluation measures, including Average Absolute Error (AAE) and Average Relative Error (ARE), provided evidence that ensemble techniques can produce better results for predicting software faults compared to individual fault prediction techniques. Choudhary et al. [83] defined a set of change matrices in addition to the existing ones to enhance the performance of SFP modules.

Various ML classifiers were applied along with code matrices and change matrices. Experimental results on different releases of Eclipse projects demonstrate that the newly introduced change matrices can improve the performance of fault prediction modules. In [84], Shatnawi used the ROC analysis to examine the relationship between software matrices (features) and faults where threshold values of matrices were identified accordingly. A threshold value is defined for each metric to be used for deciding whether a software module is faulty or not. Moreover, the results of ROC were also considered for selecting the most correlated matrices with faults. Only selected matrices were applied to train and test a set of ML classifiers (LR, NB, KNN, and decision trees C4.5).

From the previously investigated related work, researchers confirmed that having a considerable number of features in a dataset affects the performance of the ML technique. Therefore, many researchers considered dimensionality reduction methods to eliminate the irrelevant/ redundant features from the datasets. The most popular dimensionality reduction technique is FS.

2.2 Preprocessing ML methods

FS is a well-known preprocessing step in the data mining process that aims to eliminate noisy, irrelevant, and redundant features to reduce data dimensionality, and hence, improve the performance of the employed ML technique [24, 85]. In many works in the field of SFP, different filter, and wrapper FS approaches were investigated. Catal, C. and Diri, B. [86] employed a correlation-based feature selection approach to select the highly relevant matrices with varying techniques of ML (i.e., RF, DT, NB, and AIRS). They found that FS positively affected the performance of the employed ML approaches and that the RF classifier outperformed other classification techniques. In [87], eighteen filter FS methods were employed on five datasets from the NASA repository with various classification techniques. The obtained results revealed that using FS enhanced the performance of the prediction models.

As presented in [88], a set of filter FS methods, including Chi-square (CS), information gain (IG), and Pearson Correlation Coefficient (PCC), was used to develop a hybrid feature selection method to improve the performance of Software Defect Prediction (SDP). In the hybrid FS method, the features were ranked and selected according to their values using these filter ranking methods. In addition, for comparison purposes, each of the three filter methods was applied separately. Using five NASA datasets for validating the FS method and a RF classifier for building the prediction model, the results of AUC show that the hybrid FS approach is superior compared to other filter FS methods.

Moreover, many wrapper FS methods were applied in the SFP field. A GA-based FS approach with a bagging technique was proposed in [89]. In this approach, two preprocessing techniques were used; FS (i.e., GA) and resampling (i.e., bagging). A similar approach was proposed in [90]. In this approach, two metaheuristics algorithms (i.e., GA and PSO) were applied as selection mechanisms in the FS process, in addition to considering a bagging technique to rebalance the used nine datasets.

Another wrapper FS was recently proposed by Turabieh, H., Mafarja, M. and Li, X. [2]. In this approach, the authors used several FS approaches to improve the efficiency of a Layered Recurrent Neural Network (L-RNN) classifier in predicting faulty software components. Three metaheuristics algorithms (i.e., GA, PSO, and ACO) were considered in this paper as FS approaches. A set of extensive experiments were conducted in this paper, and the performance of the proposed approach was compared with several ML classifiers (i.e., NB, LR, ANN, C4.5 DT, and KNN), and the area under the curve (AUC) was considered as an evaluation measurement. AUC confirmed that the proposed wrapper approach is better compared to other approaches. In this approach, a Binary Queuing Search Algorithm (BQSA) was proposed for the first time in literature in the SFP field. Moreover, the SMOTE technique was applied to rebalance the datasets that were obtained from the PROMISE repository. The presented results in the paper revealed the positive effects of dimensionality reduction and resampling techniques on the obtained datasets.

In 2020, Tumar, Iyad, et al. [3] proposed a modified version of binary Moth Flame Optimization named Enhanced MFO (EBMFO) as a wrapper FS approach in SFP, along with the Adaptive synthetic sampling method (ADASYN) as a resampling technique. Three ML classifiers were used in this paper (i.e., LDA, KNN, and DT), and the results confirmed that the performance of these classifiers was improved with the use of the preprocessing techniques. Recently, a Harris Hawk Optimization algorithm (called EBHHO) based FS approach in the SFP field was proposed in [91]. Again, the obtained results proved the positive influence of the employed preprocessing techniques on the used ML classifiers.

From the previously mentioned approaches, it is clear that preparing the datasets by employing preprocessing techniques (e.g., FS and resampling) greatly influences the performance of the prediction model in the SFP problem. It can be concluded that SFP becomes possible for large-scale projects when proper preprocessing techniques are used. These observations, besides the Non-Free-Lunch (NFL) theorem for optimization [66] which states that no best classifier to handle all possible classification problems [92], motivated our attempts to propose an advanced SFP approach that considers the RF classifier as an ML technique, improved by SMOTE as a resampling technique, along with an advanced wrapper FS approach with a novel WOA algorithm as a searching strategy.

3 Preliminaries

This section briefly describes the main theoretical concepts utilized in this research which are: the RF classifier, the oversampling technique (i.e., SMOTE), feature subset selection, and WOA to tackle the FS problem. In the SFP problem, the aim is to predict fault-prone software modules in the early stages of the SDLC based on the designed metrics of the software project. SFP is considered a binary classification problem since each software component has two options in the target class: faulty or non-faulty. Several supervised classification paradigms can be used to tackle this problem. After conducting deep experimental studies, we considered the RF classifier to be adopted in this research.

3.1 Random Forest classification paradigm

Random Forest (RF) is a classification algorithm that was initially proposed in [20]. RF considers a combination of decision trees in a model called ensemble learning, where each tree in the forest depends on an independently sampled vector of random values. The main advantages of the RF classifier are the few parameters to be tuned, its capability of high generalization, and it requires less training time than other classifiers regardless of the size of the dataset [93]. Ensemble learners combine different classification algorithms to get a generalized model that enhances the prediction performance where the number of wrongly classified instances comes at a very low rate. Ensemble methods can be distinguished into three families, namely, bagging, boosting, and stacking methods.

RF classifiers use the bagging method. As demonstrated in Fig. 1, a random sample of the original training dataset is provided by applying a bootstrap re-sampling technique for each model. After applying the bagging process, x features are selected randomly from the full feature set. Then, one feature is selected as a split node. The splitting process is repeated, with a fresh selection of features, until reaching a specified depth d where a decision tree is completed [94]. After the multiple splits, a random forest of decision trees is constituted. Every new instance is passed to all trees in the forest, and a class label is predicted (termed as a vote), Then the majority voting strategy is applied to select the class label for this instance.

3.2 Data sampling for imbalanced classification

The quality of data is considered a significant factor that has a profound impact on the performance of ML techniques. Imbalanced datasets are distinguished as a challenging aspect that may degrade the prediction quality of classification methods. This issue emerges in most real-world problems in which the target classes are not represented equally. In other words, in binary class datasets, most of the instances are labeled with the first class (called majority class), while few of them are labeled with the other one (called minority class). In such a case, the classifier is trained on highly imbalanced data and thus tends to pick up the patterns in the dominant classes, which leads to inaccurate prediction of the minority class [95].

The class imbalance problem poses a significant challenge in the field of software defect prediction since the available datasets are highly imbalanced. That is to say, the occurrences of defective cases are very low compared to normal cases (see Fig. 3). Various strategies can be employed to handle this problem, such as cost-sensitive, kernel-based, and sampling methods [95]. Sampling methods are categorized into two types: oversampling, which increases the rate of the minority class, and under-sampling, which reduces the frequency of the minority class. The latter causes information to be lost, which leads to poor prediction quality. In this research, we utilized an oversampling technique called SMOTE to rebalance the used SFP datasets.

The SMOTE is a promising oversampling method that proves its superiority in dealing with imbalanced data. It is originally introduced by Chawla, Nitesh V., et al. [96]. This technique preserves the original data without losing information, and it increases the rate of the minority class without duplication. New synthetic samples ($\hat {x}_{ij}$) labeled with the minority class are generated using the k-nearest neighbors’ method for each minority sample (x_i) using the Euclidean distance, where j = 1,2,...,k. The new synthetic samples are generated along the lines joining the minority sample and its j selected neighbors as in (1).

$$ x_{new} = x_{i} + (\hat{x}_{i} - x_{i} ) * r $$

(1)

where r is a random vector between 0 and 1, $\hat {x}_{i}$ denotes one of the k neighbors. The value of k depends on the desired amount of oversampling.

3.3 Feature Selection (FS)

One of the most common questions when applying ML algorithms is whether all features (factors) are relevant to the classification rule. As a response to this question, a problem called FS emerged. FS is defined as the process of reducing the dimension of data by eliminating irrelevant, noisy, and redundant features. In other words, it is the task of searching for the most informative subset of features. It is an essential pre-processing technique that aims to enhance the performance of ML tasks [25, 26, 97].

FS approaches are classified into wrapper and filter based on the evaluation function used to measure the selected subset of features [98]. In wrapper-based methods, a search algorithm (deterministic or heuristic) is employed to generate subsets of features for examination. Then the effectiveness of each suggested subset of features is evaluated by a given classifier (learning algorithm). The evaluation is conducted in terms of several measures such as accuracy, the area under the ROC, etc. FS is treated as a binary optimization problem in which the search algorithm is guided using the reported error by a classifier [99].

In the filter-based approach, the learning algorithm is not involved in the evaluation function. The effectiveness of a subset of features can be evaluated based on the intrinsic properties of the data. Statistical measures are used to measure the dependency or correlation between features, which can be filtered to select the most informative features. Several ranking techniques have been introduced for feature evaluations, such as gain ratio and information gain [100]. The filter-based approach is more effective compared to a wrapper-based approach in terms of the required computational time.

In this paper, we propose a wrapper FS approach that considers WOA as a selection mechanism and RF classifier as an evaluation method. In the following subsection, the WOA is introduced, followed by the description of the enhanced approach of the original WOA.

3.4 An overview of the WOA

WOA is a recent Swarm Intelligence (SI) algorithm that mimics the behavior of humpback whales in hunting fish in the oceans. The hunting process starts by constructing bubble nets to constrict the prey, and then the whale swims towards them in a spiral shape to attack them. According to [60], WOA can balance its stochastic exploratory and exploitative tendencies effectively. In the exploration phase, WOA simulates the encircling mechanism of the whales in nature. Where the prey represents the best solution, found so far, and the other solutions in the population represent the candidate whales. The whales change their positions by moving spirally toward the prey’s location as modeled in (2) and (3):

$$ D = \mid C.\vec{X}^{ *}(t)- \vec{X}^{ }(t) \mid $$

(2)

$$ \vec{X}^{ }(t+1) = \vec{X}^{ *}(t)- \vec{A}^{ }.D $$

(3)

where t represents the current iteration, X^∗ represents the prey’s location, X represents the locations of the candidate solution (whale). Vectors A and C are defined in (4) and (5).

$$ \vec{A}^{ }=2\vec{a}^{ }.\vec{r}^{ }-\vec{a}^{ } $$

(4)

$$ \vec{C}^{ }=2.\vec{r}^{ } $$

(5)

where $\overrightarrow {r}$ is generated randomly in the interval [0,1], and $\overrightarrow {a}$ simulates the shrinking encircling behavior of the whales and decreases linearly in the interval [2, 0] as in (6)).

$$ a=2\left( 1-\frac{t}{T}\right) $$

(6)

According to (7) which represents the bubble-net attacking process (exploitation phase), a solution’s position is changed based on two different approaches; shrinking encircling (when p < 0.5) and spiral updating mechanism (when p ≥ 0.5), where a probability of 50% is used to select between these two approaches.

$$ \vec{X}^{ }(t+1)= \left\{\begin{array}{ll} \vec{X}^{ *}(t)- \vec{A}^{ }.D & p< 0.5\\ D^{\prime}.e^{bl}.\cos(2\pi l)+\vec{X}^{ * }(t) & p \geq 0.5 \end{array}\right. $$

(7)

where $D^{\prime }$ represents the distance between the i th solution and the prey’s location, b is a constant, and l is a random number in the interval [-1,1].

Based on the variation of $\vec {A}$, which takes a value in the interval [-1, 1], a solution is forced to move towards or far away from the best solution. If $\vec {A} < 1$, then a solution is moved towards the prey’s location (exploitation), while it is moved towards a randomly selected solution from the population (represented as $\vec {X_{rand}}^{ }$ in (8) and (9)) when $\vec {A} >1$ (exploration).

$$ D = \mid C.\vec{X_{rand}}^{ }(t)- \vec{X}^{ }(t) \mid $$

(8)

$$ \vec{X}^{ }(t+1) = \vec{X_{rand}}^{ }(t)- \vec{A}^{ }.D $$

(9)

As for all population-based metaheuristic algorithms (MAs), WOA starts the optimization process by generating N random solution, each of which represents a whale in nature. Then, each solution is evaluated using the adopted fitness function. The solution with the lowest fitness value is denoted as the best solution since this type of this problem is a minimization problem, and the coefficients of the algorithm are calculated. The algorithm then moves according to the parameter a which is decreased from 2 to 0. Each solution is updated based on the value of $\vec {A}$, where it moved towards a randomly selected solution from the current population when $\vec {A} >1$. Also, it is moved towards the best solution when $\vec {A} <1$. The WOA switches between a spiral or circular movement based on the value of p. The pseudo-code of WOA is shown in Algorithm 1.

4 The proposed methodology

The main objective of this paper is to build a well-performing classification model that is able to predict faulty software components. The datasets were selected from the PROMISE repository and normalized to set a proper scale for all data. Moreover, some techniques were applied to balance the data to get more accurate results. Then, more experiments were conducted to select the most appropriate classifier for this problem. After that, extensive experiments were conducted to tune the parameters of the selected classifier. In the last phase, a set of wrapper feature selection methods were applied to enhance the performance of the adopted classier. Figure 2 represents the proposed methodology.

4.1 Preprocessing techniques

Data preprocessing is a vital step in the mining process. It aims to prepare the dataset to be suitable for the mining techniques to achieve high performance. The datasets are investigated using a 2D visualization using Principle Component Analysis (PCA) as shown in Fig. 3. The figure demonstrates that the datasets are highly imbalanced and not linearly separable. Therefore, all datasets should be balanced before applying proper results and adopting a learning algorithm. Moreover, a complex learning algorithm is required to provide better performance because the data in datasets is not linearly separable.

Data Normalization: The collected datasets are complete, and no missing data are there. Their structures are well to be mined, and all attributes are numeric. However, the numeric data are of different scales. Therefore, to avoid bias towards some dominant features, the Min-max normalization method (as can be seen in (10)) was applied to standardize the data in the interval of [0, 1].
$$ x^{n} = \frac{x - \min}{ \max -\min} $$
(10)
where xⁿ is the normalized value of x within the interval $[\min \limits , \max \limits ]$.
Data Balancing After investigating the adopted datasets, we noticed that they are highly imbalanced as the rate of faulty instances is very low compared to normal ones (see Fig. 3). Thus, the datasets should be balanced before using them with the classification technique to avoid any decrease in their performance. In this paper, we applied three variants of the SMOTE oversampling technique (i.e., SMOTE, Borderline SMOTE, and SVM SMOTE) to select the best one that positively affects the performance of the learning algorithm.

4.2 Classifier selection

Investigating data visualization in Fig. 3, it can be seen that data in most datasets are not linearly separable. Thus, simple classifiers may not be suitable to handle this kind of problem. Therefore, we did extensive experiments to compare the performance of different classifiers on the same datasets and under the same computational system.

4.3 Binary variant of WOA (BWOA)

As mentioned earlier, FS is a binary optimization problem, while the WOA was originally designed to deal with continuous optimization problems. This requires using a conversion function that converts the continuous solutions to binary to make them suitable for binary problems. We used the Transfer Functions (TFs) that were widely used to convert the continuous metaheuristics population to binary [101, 102]. TFs can be categorised based on their shapes [103] into S-shaped and V-shaped functions (see Fig. 4). The proposed Binary WOA for FS is called BWOA.

The S-shape TF [104] is used to convert the continuous PSO algorithm into binary. The TF is usually used to produce a probability of flipping a future value from 0 to 1 or from 1 to 0 as in (11). It takes the elements of the step vector (solution x) that was generated by the algorithm.

$$ S({x_{i}^{j}}(t))=\frac{1}{1+\exp^{- {x_{i}^{j}}(t)}} $$

(11)

where ${x_{i}^{j}}$ represents the j_th element in the i_th solution x, and t indicates the current iteration. An element of a solution in the next iteration is updated by (12).

$$ {X_{i}^{j}}(t+1)= \left\{\begin{array}{ll} 0 & \text{If} rand<S({x_{i}^{j}}(t+1))\\ 1 & \text{If} rand\geq S({x_{i}^{j}}(t+1)) \end{array}\right. $$

(12)

where ${X_{i}^{j}} (t+1)$ is the binary value of the real ${x_{i}^{j}}$, $S({x_{i}^{j}}(t))$ is the probability value, which can be obtained via (11).

Another TF that belongs to the V-shaped family [105] is used to convert the continuous version of the GSA algorithm into binary. Equation (13) represents the V-shape TF and (14) represents the rule to convert to binary.

$$ V({x_{i}^{j}}(t))=|\tanh({x_{i}^{j}}(t))| $$

(13)

$$ {X_{i}^{j}}(t+1)= \left\{\begin{array}{ll} \neg {X_{i}^{j}}(t) & r<V({\Delta} {x_{i}^{j}}(t+1))\\ {X_{i}^{j}}(t) & r\geq V({\Delta} {x_{i}^{j}}(t+1)) \end{array}\right. $$

(14)

In this paper, eight TFs were adopted to convert the original WOA into binary. The original TFs that were proposed in [104], which are S2 TF and V2 TF [105]. In addition, the six TFs that were proposed in [103] are evaluated. The mathematical formulation of all TFs is shown in Table 1.

Table 1 S-shaped and V-shaped transfer functions

Classification framework for faulty-software using enhanced exploratory whale optimizer-based feature selection scheme and random forest ensemble learning

Abstract

Similar content being viewed by others

An Improved and Optimized Random Forest Based Approach to Predict the Software Faults

Improved Dwarf Mongoose Optimization Algorithm for Feature Selection: Application in Software Fault Prediction Datasets

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

1 Introduction

2 Related works

2.1 ML based SFP

2.2 Preprocessing ML methods

3 Preliminaries

3.1 Random Forest classification paradigm

3.2 Data sampling for imbalanced classification

3.3 Feature Selection (FS)

3.4 An overview of the WOA

4 The proposed methodology

4.1 Preprocessing techniques

4.2 Classifier selection

4.3 Binary variant of WOA (BWOA)

4.4 Formulation of FS problem

4.4.1 Enhanced BWOA (BEWOA)

5 Experimental results and discussion

5.1 Datasets: investigated software projects

5.2 Evaluation measures

5.3 Handling imbalanced data using different SMOTE variants

5.4 Random forest hyperparameter tuning

5.5 Comparison with other classification techniques

5.6 Feature selection based on proposed BWOA approaches

5.6.1 Performance of BWOA using different TFs

5.6.2 Performance of enhanced WOA

5.6.3 Population diversity analysis

5.6.4 Deep analysis on the modifications

5.6.5 Comparison of SBEWOA with other optimizors

5.6.6 Relevant features selected by SBEWOA

6 Conclusion and future works

Abbreviations

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation