A new binary chaos-based metaheuristic algorithm for software defect prediction

Arasteh, Bahman; Arasteh, Keyvan; Ghaffari, Ali; Ghanbarzadeh, Reza

doi:10.1007/s10586-024-04486-4

A new binary chaos-based metaheuristic algorithm for software defect prediction

Open access
Published: 04 May 2024

(2024)
Cite this article

Download PDF

You have full access to this open access article

Cluster Computing Aims and scope Submit manuscript

A new binary chaos-based metaheuristic algorithm for software defect prediction

Download PDF

Bahman Arasteh^1,2,
Keyvan Arasteh¹,
Ali Ghaffari¹ &
…
Reza Ghanbarzadeh³

240 Accesses
Explore all metrics

Abstract

Software defect prediction is a critical challenge within software engineering aimed at enhancing software quality by proactively identifying potential defects. This approach involves selecting defect-prone modules ahead of the testing phase, thereby reducing testing time and costs. Machine learning methods provide developers with valuable models for categorising faulty software modules. However, the challenge arises from the numerous elements present in the training dataset, which frequently reduce the accuracy and precision of classification. Addressing this, selecting effective features for classification from the dataset becomes an NP-hard problem, often tackled using metaheuristic algorithms. This study introduces a novel approach, the Binary Chaos-based Olympiad Optimisation Algorithm, specifically designed to select the most impactful features from the training dataset. By selecting these influential features for classification, the precision and accuracy of software module classifiers can be notably improved. The study's primary contributions involve devising a binary variant of the chaos-based Olympiad optimisation algorithm to meticulously select effective features and construct an efficient classification model for identifying faulty software modules. Five real-world and standard datasets were utilised across both the training and testing phases of the classifier to evaluate the proposed method's effectiveness. The findings highlight that among the 21 features within the training datasets, specific metrics such as basic complexity, the sum of operators and operands, lines of code, quantity of lines containing code and comments, and the sum of operands have the most significant influence on software defect prediction. This research underscores the combined effectiveness of the proposed method and machine learning algorithms, significantly boosting accuracy (91.13%), precision (92.74%), recall (97.61%), and F1 score (94.26%) in software defect prediction.

An Improved and Optimized Random Forest Based Approach to Predict the Software Faults

Article Open access 09 May 2024

Applications of AI in classical software engineering

Article Open access 26 July 2020

Novel hybrid classifier based on fuzzy type-III decision maker and ensemble deep learning model and improved chaos game optimization

Article Open access 05 May 2024

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

The presence of defects within a software system poses a considerable risk to its overall quality. Predicting software defects is a critical aspect of software engineering, aimed at improving software quality by identifying and addressing these defects [1]. Detecting defects before the software is released is vital for enhancing its overall quality. The Pareto principle, which underscores that a majority of software defects arise within specific modules, is applicable here. Hence, forecasting and spotting defects in the early stages of software development significantly enhances the resulting software's quality. Software defect prediction entails identifying modules susceptible to defects before the testing phase, thus reducing testing time and costs. As software systems grow in size and complexity, testing every module comprehensively becomes unfeasible, underscoring the importance of predicting modules prone to defects to enhance software quality. In this pursuit, machine learning (ML) methods provide valuable models that empower developers to classify faulty software modules effectively [2].

Recently, there has been a surge in the adoption of software defect prediction approaches to bolster software quality. This study delves into software defect prediction, concentrating on categorising software components (modules) into two categories: prone to defects and non-prone to defects [1, 2]. The classification technique hinges on extracting a model based on the history of defective modules, subsequently employed to enhance accuracy in predicting defects in new modules. Past research reveals a robust correlation between software module metrics and defect prediction [3]. Multiple algorithms exist for software module classification, including Decision Trees (DT), the K-Nearest Neighbour algorithm (KNN), the Naive Bayes (NB) algorithm, Support Vector Machines (SVM), and Artificial Neural Networks (ANN). However, a prevalent challenge in classification lies in handling a vast array of features, which compromises classification accuracy. Feature selection methods come into play to mitigate this challenge and decrease feature dimensions. The process of identifying effective features for classification is an NP-hard problem that can be addressed using evolutionary algorithms [4].

This paper introduces a novel approach to predicting software defects. Initially, a Binary variant of the Chaos-based Olympiad Optimisation Algorithm (BCOOA) was developed to select the most impactful features from the training dataset. Subsequently, various ML algorithms were employed to construct a classification model using this optimal training set. BCOOA draws inspiration from swarm intelligence and is designed to emulate the learning process of a group of students preparing for the Olympiad examination. The stages of teaching and learning amongst students produce population evolution. The primary objective here is to leverage BCOOA's capability to select crucial features for predicting software defects using ML algorithms. The aim is to employ these algorithms to detect and address software defects before software release. Diverse ML algorithms, such as KNN, DT, SVM, NB, and ANN, were utilised for faulty module classification. By selecting the most influential features in classification, there's potential to enhance precision and accuracy in the software module classifier. Ultimately, in the testing phase, the effectiveness of the new feature selection method was assessed using test data.

The following are the main objectives of the current study:

Determining the most effective features of software defect prediction datasets.
Increasing the accuracy, precision, and sensitivity of software defect predictors.
Enhancing the performance and stability of software defect predictors.

The primary contributions of this study are:

Proposing a novel binary and hybrid version of the Olympiad Optimisation Algorithm (OOA) to select the most effective features of the defect prediction dataset. To achieve greater population diversity, the operators of the Genetic Algorithm (GA) were embedded into the OOA.
Developing and adapting the theory of chaos in the OOA to improve its convergence with regard to both exploration and exploitation. Chaos maps have been used for population initialisation. The developed binary and chaos-based OOA was adapted to address the challenge of software defect prediction.
Developing an effective and efficient classification model to detect faulty software modules.
Increasing the efficiency of software defect detection methods by selecting the smallest subset and the most effective features.

The remainder of the current paper is organised as follows: Sect. 2 reviews the related works on the problem of software defect prediction. In Sect. 3, the details of the proposed method are presented. This section includes two subsections; the first subsection suggests and utilises BCOOA to select the most effective features in the training dataset. The second subsection discusses the development of the desired classifier using the optimal train set and different machine learning algorithms. Section 4 presents all the relevant results from the tests conducted with the specified criteria on real-world datasets, and Sect. 5 concludes the article and recommends guidelines for future work.

2 Related works

During software development, the testing phase holds significant importance [5,6,7]. The reliability of software hinges on the presence of bugs^{Footnote 1}(faults) within the software [8, 9]. This phase incurs substantial expenses in terms of both budget and time, underscoring the criticality of predicting software modules prone to defects. This prediction occurs before commencing the testing phase, aiding in identifying and rectifying modules susceptible to defects. Drawing from the historical data of problematic software modules in prior project implementations or similar projects, a model is derived to facilitate accurate defect prediction in newly developed modules. Research in defect prediction and estimation indicates that the underlying hypothesis used in constructing the model significantly influences the efficiency of the prediction model [10]. Various approaches exist for software defect prediction. This section scrutinises four ML 5ethods: normalisation-based, unbalanced learning-based, feature selection-based, and blended learning-based methods.

2.1 Normalisation learning methods

Data normalisation involves pre-processing datasets prior to training and testing to standardise the values of independent features within the dataset. This process ensures consistency across the diverse range of feature values present in datasets, which is crucial as some ML algorithms' objective functions may not perform optimally without normalisation. This pre-processing technique plays a pivotal role in enhancing algorithm accuracy by remapping the measured values to a consistent range [11]. In the realm of software defect prediction research, three primary normalisation methods—logarithmic normalisation, min-max, and Z-standardisation—have been widely adopted.

Logarithmic normalisation transforms all sample feature values into their logarithmic equivalents, effectively managing skewed data distributions and reducing the impact of outliers. The min–max method normalises features within a specified range by identifying the minimum and maximum values for each feature and scaling accordingly. In contrast, Z-standardisation adjusts feature values based on their mean and standard deviation, remapping them to a range where the mean is 0 and the standard deviation is 1. These techniques have been applied in various studies, utilising logarithmic normalisation, min–max, and Z-standard methods, respectively [12]. It is important to note that in classification problems, the number of samples in each class may vary, presenting additional challenges in data pre-processing and model training. By employing these normalisation techniques, researchers and practitioners can mitigate some of these challenges, ensuring more robust and accurate software defect prediction models.

2.2 Unbalanced learning-based methods

In binary classification scenarios, the challenge of imbalance occurs when there is a significant disparity between the number of samples in one class compared to the other. This imbalance can lead to suboptimal performance of learning algorithms, as they generally assume an equal distribution of samples across classes. To mitigate the adverse effects of dataset imbalance on prediction accuracy, several strategies have been developed, broadly categorised into data-level, algorithm-level, and cost-sensitive learning approaches. Data-level approaches focus on pre-processing techniques that aim to rebalance the dataset before the learning process begins, without directly altering the algorithm itself. These methods adjust the data distribution at the pre-processing stage to counteract the imbalance. They achieve this by either augmenting the number of samples in the underrepresented class (oversampling) or reducing the samples in the overrepresented class (undersampling), thus fostering a more balanced class distribution for the training process [13]. Oversampling seeks to enhance the representation of the minor class by duplicating existing samples or generating new ones, whereas undersampling reduces the imbalance by removing some samples from the major class.

On the other hand, algorithm-level approaches modify the learning algorithms to make them more sensitive to the minority class. These methods do not alter the distribution of the data but instead adjust the learning process to focus more on correctly classifying the underrepresented class. Techniques such as bagging and boosting are examples of algorithm-level approaches that can improve classification performance in the context of imbalanced datasets by enhancing the model's focus on the minority class. Together, these strategies provide a multifaceted toolkit for addressing the challenges posed by imbalanced datasets in binary classification scenarios, ensuring more accurate and equitable predictions across both classes.

2.3 Feature selection-based methods

Machine learning methods dealing with high-dimensional data, such as datasets with a large number of features, encounter several challenges. These include increased computational complexity, difficulty in extracting meaningful insights, and the need to manage model complexity to avoid overfitting during training [14]. To address these issues, dimensionality reduction techniques aim to simplify datasets by representing them in a lower-dimensional space while preserving essential characteristics of the original data. Dimension reduction strategies can be broadly categorised into feature extraction-based methods and feature subset selection-based methods. Feature extraction methods work by transforming the original high-dimensional data into a lower-dimensional space, effectively combining existing features to create a new, smaller set of features that capture the core information of the original dataset. A well-known technique in this category is Principal Component Analysis (PCA), which identifies the principal components that account for the most variance in the data. Feature subset selection is another critical approach in machine learning, especially relevant for tasks such as classification and regression. Despite the presence of numerous features in these tasks, not all contribute equally to the learning process—some may be redundant or even detrimental to model accuracy. By removing these non-contributory features, computational efficiency is improved, and model accuracy is enhanced. The objective of feature selection is to find the smallest subset of features that is sufficient for the task at hand. Within the realm of defect prediction, feature selection methods are categorised into filtering and classification techniques [15].

The filtering approach operates independently of the machine learning algorithm, without incorporating a classification function. This method evaluates features based on specific criteria, assigning a score to each feature. Features are then ranked according to their scores, and those with the lowest rankings, typically below a certain threshold, are removed. The selected feature subset is then used for classification, as illustrated in Fig. 1, which outlines the steps involved in the filtering process. Several filter-based feature ranking techniques, such as Information Gain (IG), Information Gain Rate (GR), Symmetric Uncertainty (SU), Chi-Square Test (CS), and two variants of Relief, have been extensively studied [16]. The CS filter evaluates the distribution of classes and feature correlations. The IG filter measures the information a particular feature (feature Y) provides about the target class based on the value of another feature (feature X). However, IG has a tendency to favor features with a large number of values, which may not always be the most informative. The GR and SU methods overcome this bias by adjusting for the value count of features; GR penalizes features with many values, while SU calculates the combined entropies of features X and Y to provide a balanced measure. The ReliefF algorithm, an enhancement of the original Relief method, excels in handling noisy and multiclass datasets by effectively identifying relevant features. Figure 1 illustrates the procedure of the filter-based method for feature selection.

The set of techniques that use an evaluation function based on the error rate of the learning algorithm are referred to as classification or wrapper methods. This strategy involves generating new feature subsets through a generator function, which are then evaluated using a ML technique. The effectiveness of each subset is determined by the number of errors in the test set or the error rate of the learning method. Typically, the classification (wrapper) method provides superior performance compared to the filter method, albeit at a higher computational cost. Two primary techniques for selecting the optimal set of features are forward selection and backward selection. Forward selection evaluates each feature for potential inclusion step by step, while backward selection starts with all features and gradually eliminates them based on a predefined stopping criterion, efficiently determining the essential features for software defect prediction [15]. Figure 2 demonstrates the workflow of classification methods, including both forward and backward selection. These selection strategies have been utilised in [17], with greedy forward selection—a specific form of forward selection—beginning with no features and adding them progressively to improve performance accuracy [18].

2.4 Blended learning-based methods

Blended learning, within the context of machine learning, emerges as a potent methodology [19]. This technique integrates the predictions from several classifiers to bolster overall learning precision. Blended learning is characterised by two principal applications. In the first, various classification algorithms are applied to tackle defect prediction challenges. The initial step often involves identifying the most effective classification algorithm. Yet, this approach does not leverage the potential insights available from employing multiple algorithms and struggles with the task of pinpointing the singularly best classifier. The second application addresses situations involving extensive and varied features, which makes it impractical to integrate all features within a single classifier. A crucial element in creating an effective blended classifier is the selection of underlying classification principles. The absence of appropriate classifiers diminishes the potential benefits of diversity among classifiers, thereby limiting the effectiveness of the blended approach.

3 The proposed method

3.1 Feature selection

This section outlines the proposed method for software defect prediction, which leverages the Binary Chaos-based Olympiad Optimisation Algorithm (BCOOA) in conjunction with ML algorithms. BCOOA is utilised to identify the most impactful features within defect prediction datasets for subsequent analysis using ML algorithms. The suite of ML algorithms applied in this study comprises Artificial Neural Networks (ANN), Decision Trees (DT), K-Nearest Neighbors (KNN), Naive Bayes (NB), and Support Vector Machines (SVM). Figure 3 depicts the workflow of the proposed method. Standard datasets were used to train the ML algorithms and evaluate the performance of the resulting software defect predictors. Consequently, BCOOA plays a crucial role in selecting features that notably improve the accuracy and precision of the developed defect predictor, thereby enhancing its classification capabilities.

3.2 Training datasets

The datasets employed in this study were sourced from the NASA repository [20], publicly accessible since 2005. The software metrics extracted from this dataset include McCabe's complexity metrics, Halstead's metrics, branch count, and five distinct metrics related to lines of code. Table 1 presents the datasets used in this research, while Table 2 details the 21 features within these datasets. The final feature (the 22nd feature) serves as the dependent variable, indicating the presence of defects in the program code.

Table 1 Specification of NASA datasets utilised in this study

A new binary chaos-based metaheuristic algorithm for software defect prediction

Abstract

Similar content being viewed by others

An Improved and Optimized Random Forest Based Approach to Predict the Software Faults

Applications of AI in classical software engineering

Novel hybrid classifier based on fuzzy type-III decision maker and ensemble deep learning model and improved chaos game optimization

1 Introduction

2 Related works

2.1 Normalisation learning methods

2.2 Unbalanced learning-based methods

2.3 Feature selection-based methods

2.4 Blended learning-based methods

3 The proposed method

3.1 Feature selection

3.2 Training datasets

3.3 Structure of olympiad optimization algorithm

3.4 Chaos-based population

3.5 Learning operation in OOA

3.6 Binary OOA

4 Experiments and results

4.1 Experiments platform and datasets

4.2 Evaluation criteria

4.3 Discussion of the results from the first experiment (CM1 dataset)

4.4 Discussion of the results from the second experiment (KC1 dataset)

4.5 Discussion of the results from the third experiment (JM1 dataset)

4.6 Discussion of the results from the fourth experiment (KC2 dataset)

4.7 Discussion of the results from the fifth experiment (PC1 dataset)

5 Conclusion

Data availability

Notes

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation