Hybrid Defect Prediction Model Based on Counterfactual Feature Optimization

Zheng, Wei; Chen, Teng Fei; Hu, Mei Ting; Yang, Feng Yu; Fan, Xin; Xiao, Peng

doi:10.1007/s44230-023-00034-2

Hybrid Defect Prediction Model Based on Counterfactual Feature Optimization

Research Article
Open access
Published: 04 July 2023

Volume 3, pages 366–380, (2023)
Cite this article

Download PDF

You have full access to this open access article

Human-Centric Intelligent Systems Aims and scope Submit manuscript

Hybrid Defect Prediction Model Based on Counterfactual Feature Optimization

Download PDF

Wei Zheng^1,2,
Teng Fei Chen^1,2,
Mei Ting Hu^1,2,
Feng Yu Yang^1,2,
Xin Fan^1,2 &
…
Peng Xiao^1,2

1015 Accesses
1 Citation
Explore all metrics

Abstract

Software defect prediction is critical to ensuring software quality. Researchers have worked on building various defect prediction models to improve the performance of defect prediction. Existing defect prediction models are mainly divided into two categories: models constructed based on artificial statistical features and models constructed based on semantic features. DP-CNN [Li J, He P, Zhu J, et al. Software defect prediction via convolutional neural network. In: 2017 IEEE international conference on software quality, reliability and security (QRS). IEEE, 2017; 318–328.] is one of the best defect prediction models, because it combines both artificial statistical features and semantic features, so its performance is greatly improved compared to traditional defect prediction models. This paper is based on the DP-CNN model and makes the following two improvements: first, using a new Struc2vec network representation technique to mine existing information between software modules, which specializes in learning node representations from structural identity and can further extract structural features associated with defects. Let the DP-CNN model once again incorporate the newly mined structural features. Then, this paper proposes a feature selection method based on counterfactual explanations, which can determine the importance score of each feature by the feature change rate of counterfactual samples. The origin of these feature importance scores is interpretable. Under the guidance of these interpretable feature importance scores, better feature subsets can be obtained and used to optimize artificial statistical features within the DP-CNN model. Based on the above methods, this paper proposes a new hybrid defect prediction model DPS-CNN-STR. Evaluating our model on six open source projects in terms of F1 score in defect prediction. Experimental results show that DPS-CNN-STR improves the state-of-the-art method by an average of 3.3%.

Application of Random Vector Functional Link Network for Software Defect Prediction

SMAN2: Soft-Max Multilayer Adversarial Neural Network-Based Cross-Project Software Defect Prediction

Article 12 October 2023

Software defect prediction based on nested-stacking and heterogeneous feature selection

Article Open access 20 February 2022

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

As the software system grows in size, there are more and more defects in the software. The presence of defects may lead to severe economic losses or even endanger people’s lives [1,2,3]. It has been found that nearly four-fifths of the cost is spent on defect repair throughout the development cycle [4,5,6], and If the above defects can be identified and changed in a timely manner at an early stage of software development, the cost of fixing them will be significantly reduced. Therefore, related researchers have devoted themselves to building various defect prediction models [7,8,9], to help developers identify potential defects in the software as early as possible to reduce the losses caused by defects.

Software defect prediction can be divided into intra-project and cross-project based on data sources. Within a project, it is based on historical data from a single project, while cross-projects use historical data from multiple different projects, and most research focuses on the binary classification problem of file defect propensity [10]. In early research work, machine learning methods were widely used to construct defect prediction models. Researchers first design artificial statistical features (code metrics, process metrics) related to defects based on source code, and then obtain labels for each source file based on software history information. Finally, prediction models are constructed based on the obtained features and labels using relevant algorithms. Most existing software defect metrics fall into two main categories: software code metrics and software process metrics [11]. Software code metrics (such as LOC [12], Halstead [13], and McCabe [14]) represent the code program’ complexity, and software process metrics (such as CK [15], Martin [16], and MOOD [17]) represent the development process’ complexity.

By definition, the process of defect prediction is complex, and in some cases artificial statistical features are not sufficient for defect prediction tasks. Previous research has focused on manually designing defect metrics closely related to the source program. However, manual feature extraction has the problem of low efficiency, and the defect information contained in artificial statistical features is extremely limited. With the widespread application of deep learning [18, 19], it can automatically capture highly complex nonlinear features and has strong feature extraction capabilities. Deep learning techniques are gradually being applied to defect prediction, and satisfactory results have been obtained. Wang et al. [20] proposed a model called DBN, which can learn semantic information existing in software programs end-to-end, and then construct a model with better prediction performance based on the extracted semantic information. Experimental results show that the semantic feature-based model has better prediction performance than existing artificial statistical feature-based models. Li et al. [21] use CNN's local feature extraction capability to automatically extract semantic information from software code, and combine it with artificial statistical features to construct a prediction model called DP-CNN. Experimental results show that the predictive performance of the DP-CNN model is superior to that of the DBN model, which proves that CNN has good feature extraction ability and the effectiveness of feature combination. Xia et al. proposed a deep learning method called DeepFL, which can automatically learn the most effective existing or potential features and be used for precise fault localization [22].

The researchers extracted artificial statistical and semantic features from the code, but these two types of features only reflect the internal information of each code file and lack the structural information that exists between code files, and the single source of features leads to unsatisfactory defect prediction results.

To solve this problem, network representation learning [23] is formally applied in the field of software defect prediction. The software network diagram is first constructed based on the dependencies existing between the software modules, and through representation learning, the graph information is effectively characterized and more important structural features are extracted. Typical methods are DeepWalk [24], LINE [25], Node2Vec [26], and SDNE [27], all based on the proximity similarity hypothesis. In fact, in some scenarios, two vertices that are not nearest neighbors may also have high similarity, and because of this type of similarity, the above methods are unable to capture it. Therefore, this paper uses a new Struc2vec [28] network representation learning technique, which specifically constructs node sequences from another perspective and focuses more on the structural information of nodes, overcoming the limitations of traditional network representation learning methods. In contrast to the most advanced technologies like as DeepWalk, Node2Vec, and RolX, Struc2vec [28] exhibits an exceptional ability to capture node structural features and demonstrates superior performance in a variety of classification tasks.

Software defect prediction models are mainly constructed by features, so the selection of features will directly affect the defect prediction results. There are two main types of classical feature selection methods, the filter method and the wrapper method [29,30,31]. Afzal and Torkar [32] empirically compared eight feature selection methods on five defect datasets in the PROMISE repository, and experimental results showed that the feature selection method can effectively improve the performance of software defect prediction models. Rodriguez et al. [33] conducted a comparative study on three feature selection methods based on the Filter method and two feature selection methods based on the Wrapper method on four defect datasets. Experimental results showed that Wrapper method was overall superior to Filter method. However, the evaluation indicators of feature importance scores in the Filter method (such as variance, frequency, etc.) are largely derived from the prior knowledge of the decision-maker, and this experience is subjective. Besides, the feature importance score generated by Wrapper is determined by the base model. However, the machine learning model is a black box, and the unexplainable model cannot intuitively demonstrate the source of feature importance scores for learners. Therefore, the feature importance score generated by traditional feature selection methods cannot fully explain the contribution rate of features to the model output.

Therefore, we use the generated counterfactual samples for feature selection [34,35,36]. The general idea is that by minimizing the change in input features to generate different model outputs, a set of different counterfactual samples can be generated for a piece of data in the same algorithm. Intuitively, the feature that changes more frequently to generate counterfactual samples is an important feature. The rate at which a feature is changed in all generated counterfactual samples is used as the importance score for that feature, so we can obtain the corresponding importance scores for all features. These important scores fundamentally explain the necessity of features for model output. We can generate the best performing subset of features guided by interpretable feature importance scores.

In summary, this paper proposes a new hybrid defect prediction model DPS-CNN-STR based on artificial statistical, semantic and structural features. Among them, the artificial statistical features are optimized by the feature selection method based on counterfactual explanations, and the structural features are extracted by Struc2vec. Our main contributions are as follows:

As far as we know, we are one of the few to consider Struc2vec's method for extracting structural features and applying it to software defect prediction. In addition, we are the first to propose a feature selection method based on counterfactual explanations, and have achieved good results, providing a new idea for the study of feature selection.
We present a new model called DPS-CNN-STR, whose F1 score is improved by 3.3% on average over the optimal DP-CNN model.

The paper is organized as follows: Sect. 2 presents the related work, while Sect. 3 outlines the procedure for constructing the hybrid defect prediction model. Section 4 details the experimental setup, while Sect. 5 analyses the experimental results. Section 6 provides a brief description of potential threats to the validity of the results. Finally, Sect. 7 provides a summary of the work and outlines directions for future research.

2 Related Work

In this section, we mainly introduce the following: software defect prediction technology, Counterfactual generation framework and Struc2vec’s relevant background.

2.1 Software Defect Prediction

Software defect prediction techniques are primarily used to identify potential software defects in a timely manner, and to help testers perform purposeful testing activities [37]. Figure 1 shows the common file-level software defect prediction process in literature [38, 39], which mainly includes the following three steps.

1.
First, mark each source file in the project according to the software history warehouse. The defect is marked as buggy and the flawless is marked as clean.
2.
Then, by analyzing the software source code or historical data, features related to software defects are extracted from the source files. The most common features are artificial statistical features. The obtained features and labels are trained in various machine learning algorithms [40] (such as LR, SVM, and RF) to construct the classifier.
3.
Finally, features are extracted from the files in the project to be predicted. After all the features are extracted, the prediction results of each file can be obtained by putting the features into the defect prediction model.

2.2 Counterfactual Generation Framework

Machine learning model is a black box, and people can't explain how it works internally. Due to the unexplainability of models, it can often lead to irreparable consequences. Therefore, it is necessary to provide explanations for machine learning models to reduce the potential threats posed by models.

The most popular explanation is the counterfactual explanation proposed by Wachter et al. [36]:

$$\begin{array}{*{20}c} {C = \arg \min Yloss\left( {f\left( c \right),y} \right) + \left| {x - C} \right|} \\ \end{array}$$

(1)

In short, given the corresponding output of the input features $x$ and model $f$, the output $y$ of the model can be changed by modifying the features $x$, but it pursues the change of minimizing the features. $Yloss$ pushes counterfactual $C$ to a prediction different from the original instance $x$, and the counterfactual sample $C$ should be close to the original sample $x$.

For example, in the Adult Income dataset [41], if you want to change a low-income person to a high-income person, you can increase the value of working hours. We can imagine that this value must have a critical value. When it is equal to or greater than this value, the result of the model will change, but we only choose the critical value. People can explain the model's decision through the generated counterfactual samples.

More recently, Mothilal et al. [42]have extended Wachter's work. Mothilal et al. argue that the counterfactual samples generated should be diverse, proximal, and feasible.

To ensure the diversity of counterfactuals, Mothilal et al. used the Determinant Point Process (DPP) algorithm. The DPP algorithm ensures that the selection of subsets is diverse and that similar subsets are not easily captured simultaneously. where $\mathrm{dist}\left({C}_{i},{C}_{j}\right)$ represents the distance between each of the two counterfactual samples.

$$\begin{array}{*{20}c} {Diversity = \frac{1}{{1 + dist\left( {C_{i} ,C_{j} } \right)}}} \\ \end{array}$$

(2)

Of the multiple counterfactual samples generated for an original piece of data, the counterfactual that is closest to the original data is likely to be the most helpful to the user in making a decision. ${C}_{i}$ represents the ith counterfactual sample and where $\mathrm{dist}\left({C}_{i},x\right)$ represents the distance between the ${C}_{i}$ and $x$. Distance is used to determine the proximity of the generated counterfactuals to the original data.

$$\begin{array}{*{20}c} {\Pr oximity = - \frac{1}{k}\mathop \sum \limits_{i = 0}^{k} dist\left( {C_{i} ,x} \right)} \\ \end{array}$$

(3)

Mothilal et al. ensured the feasibility of counterfactuals through domain knowledge between features. For example, gender (discrete features) can only be limited to men and women; the value of working time (continuity features) is limited to a normal interval.

Figure 2 shows multiple counterfactual instances generated for a single piece of original data, and the higher the ranked counterfactual, the closer it is to the original data.

The ratio of feature changes in all counterfactual samples is the feature importance score. Inspired by the above, and to solve the limitations of traditional feature selection methods, we use the generated counterfactual samples for feature selection, which can generate a better subset of features guided by feature importance scores. Details will be provided in Sect. 3.2.

2.3 Struc2vec

Traditional network representation learning algorithms have a limitation. Due to the limited sampling length of the walk, they cannot effectively model nodes with structural similarities that are far apart. However, the reason why previous algorithms perform better is that most datasets prefer the characterization of homogeneity. That is, nodes with similar distances are also similar in feature space, which is enough to cover most data sets. When constructing the graph, Struc2vec neither requires node location information nor label information. Instead, it relies solely on the concept of node degree to build the multilayer graph. An intuitive concept suggests that if two nodes have the similar degree, then the two nodes will be structurally closer together. Moreover, if all adjacent nodes of these two nodes also have the same degree, then the nodes should exhibit even greater structural similarity. In short, nodes with neighbors with similar node sets are expected to have similar potential representations, and Struc2vec specifically learns node representations from structure identification, and achieves good results.

Figure 3 nodes m and n have similar local structures, node m has degree 4 and node n has degree 3, and nodes m and n are connected to the software network with 3 and 2 triangles respectively. It can be seen that these two nodes have high structural similarity, but because there are no common nodes in their neighborhoods, traditional network representation techniques cannot learn the potential representation of nodes with similar structures, but Struc2vec solves this problem.

In our work, to address the problem of a single feature source and to improve the prediction performance, we can use Struct2vec to extract relevant structural features.

3 Proposed Method

This section describes our proposed hybrid defect prediction model in detail. Our DPS-CNN-STR model is based on artificial statistical, semantic, and structural features. It is based on the DP-CNN [21] and has made the following improvements:

1.
The artificial statistical features are optimized by the feature selection method based on counterfactual explanations.
2.
Using Struc2vec to learn the structural features of the software network, a new hybrid defect prediction model is jointly constructed based on the optimized DP-CNN model, combined with the newly learned structural features.

3.1 Hybrid Defect Prediction Model

To improve the performance of defect prediction models, Fig. 4 shows a new hybrid defect prediction model. First, artificial statistical features are optimized through counterfactual explanations, then semantic features are learned from the source program using CNN, and finally structural features are learned from the software network using Struc2vec. A hybrid defect prediction called DPS-CNN-STR is constructed.

First, combine the optimized artificial statistical features, semantic features of CNN end-to-end learning, and structural features of Struc2vec Unsupervised Learning, and input them into the Softmax network as a whole to obtain the prediction results. The calculation formula for output layer Softmax is as follows:

$$\begin{array}{*{20}c} {P_{i} = \frac{{\exp \left( {y_{i} } \right)}}{{\mathop \sum \nolimits_{j} \exp \left( {y_{i} } \right)}}\quad i,j \in \left\{ {0,1} \right\}} \\ \end{array}$$

(4)

where ${P}_{i}$ represents the probability that the module is predicting as a bug. We build the neural network based on Keras, the biggest advantage of which is its simplicity and speed, and we also keep the exact same parameter settings as in the literature [21].

3.2 Feature Selection Based on Counterfactual Explanations

Figure 5 shows a feature selection method based on counterfactual explanations, consisting of the following three main steps:

1.
In this paper, defect prediction is based on iterations of versions within the project, so the artificial statistical table of the old version is input into the counterfactual generation framework, and then each feature is given a corresponding importance score.
2.
Features are combined one by one according to their scores of importance from the highest to the lowest. Suppose there are n features, then there are n feature subsets corresponding.
3.
The old version of the artificial statistical features (the subset of the features described in step 2) and their labels are used to construct the classifier, then the new version of the artificial statistical features and their labels are used as the test set, and the corresponding optimal subset of the features is selected by selecting the highest F1 score.

3.3 Structural Features Extraction

Using the existing dependencies between software modules, we construct the software network with software modules as the basic unit, and then use Struc2vec to learn the potential representation of node structure, in order to extract structural features under unsupervised learning.

First, based on the data flow relationships that exist between the modules, the software network G = (V, E) is constructed, where G represents the constructed software network, V = {v_i|i = 1,2,3,…, n} is the set of nodes in the software network, the element v_i represents each node in the software network, n =| V | is the number of nodes in the constructed software network, and k^* is the diameter. E = {e_y | v_iv_j = 1, i, j ∈ [1, n]} represents the set of edges. When the value is 1, it indicates that there is a relationship between node i and node j. When the value is 0, then there is no relationship between them, and the relationships that exist are as follows:

1.
There is a dependency between node i and node j;
2.
There is a combination relationship between node i and node j;
3.
There is an inheritance relationship between node i and node j;

Figure 6 shows software networks built by some Apache open source software projects in accordance with the above rules.

Extracting structural features from the software network using Struc2vec is divided into four main steps [28]:

(1) Measuring structural similarity: ${R}_{k}\left(u\right)$ represents the set of nodes whose distance from node $u$ is $k$, ${R}_{1}\left(u\right)$ represents the set of directly connected nearest neighbors of $u$, and $s\left(S\right)$ represents the ordered degree sequences of the set $S$ of nodes. The distance ${f}_{k}\left(u,v\right)$ between all nodes is calculated by introducing a hierarchical structure, and this distance can reflect the situation of structural similarity between nodes, defined as:

$$\begin{array}{*{20}c} { f_{k} \left( {u,v} \right) = f_{k - 1} \left( {u,v} \right) + g\left( {s\left( {R_{k} \left( u \right)} \right),s\left( {R_{k} \left( v \right)} \right)} \right),k \ge 0 \quad {\text{and}}\quad \left| {R_{k} \left( u \right)} \right|, \left| {R_{k} \left( v \right)} \right| > 0} \\ \end{array}$$

(5)

The $g\left(D1,D2\right)$ ≥ 0 is a function that measures the distance of the ordered degree sequences $D1,D2$. Since $s\left({R}_{k}\left(u\right)\right)$ and $s\left({R}_{k}\left(v\right)\right)$ have different lengths and may contain duplicate elements. To solve this problem, so a distance calculation formula called DTW is used, defined as follows:

$$\begin{array}{*{20}c} {d\left( {a,b} \right) = \frac{{\max \left( {a,b} \right)}}{{\min \left( {a,b} \right)}} - 1} \\ \end{array}$$

(6)

(2) Constructing the context graph: A multilayer weighted graph M is constructed based on the obtained node-pair distances, which is mainly intended to encode the structural similarity between nodes. The edge weight of two nodes in a certain layer $k$ is defined as:

$$\begin{array}{*{20}c} {w_{k} \left( {u,v} \right) = e^{{ - f_{k} \left( {u,v} \right)}} ,\quad k = 0, \ldots ,k^{*} } \\ \end{array}$$

(7)

The same node belonging to different layers is connected by directed edges, and the edge weight is defined as:

$$\begin{array}{*{20}c} {w\left( {u_{k} ,u_{k + 1} } \right) = \log \left( {\Gamma_{k} \left( u \right) + e} \right),\quad k = 0, \ldots ,k^{*} - 1} \\ \end{array}$$

(8)

$$\begin{array}{*{20}c} { w\left( {u_{k} ,u_{k - 1} } \right) = 1,\quad k = 1, \ldots ,k^{*} } \\ \end{array}$$

(9)

${\Gamma }_{k}\left(u\right)$ is the number of edges related to node $u$, and its weight is greater than the average edge weight of the complete graph in layer $k$, and is defined as:

$$\begin{array}{*{20}c} { {\Gamma}_{k} \left( u \right) = \mathop \sum \limits_{v \in V}^{1} (w_{k} \left( {u,v} \right) > \overline{w_{k}} }) \\ \end{array}$$

(10)

(3) Generating context for nodes: A biased random walk strategy is applied to all nodes in graph M as a way to generate the contextual representation of each node. At each sampling, if the decision is to wander to the current layer, and assuming that it is currently at layer $k$, the probability of going from node $u$ to node $v$ is:

$$\begin{array}{*{20}c} {P_{k} \left( {u,v} \right) = \frac{{e^{{ - f_{k} \left( {u,v} \right)}} }}{{Z_{k} \left( u \right)}}} \\ \end{array}$$

(11)

${Z}_{k}\left(u\right)$ is the normalization factor of node $u$ in layer $k$, which is obtained by the following formula:

$$\begin{array}{*{20}c} {Z_{k} \left( u \right) = \sum_{v \in V,v \ne u } e^{{ - f_{k} \left( {u,v} \right)}} } \\ \end{array}$$

(12)

If it is decided to switch different layers, select $k+1$ layer or $k-1$ layer with the following probability:

$$\begin{array}{*{20}c} {P_{k} \left( {u_{k} ,u_{k + 1} } \right) = \frac{{w\left( {u_{k} ,u_{k + 1} } \right)}}{{w\left( {u_{k} ,u_{k + 1} } \right) + w\left( {u_{k} ,u_{k - 1} } \right)}}} \\ \end{array}$$

(13)

$$\begin{array}{*{20}c} { P_{k} \left( {u_{k} ,u_{k - 1} } \right) = 1 - P_{k} \left( {u_{k} ,u_{k + 1} } \right)} \\ \end{array}$$

(14)

(4) Learning a language model: Finally, using Skip-Gram technique, the potential representation of each node is learned from the generated contextual representation.

3.4 Semantic Features Extraction

The source program is the main cause of software defects. Each source file program code can be parsed into a series of word sequence representations, and the word sequence representations are converted into semantic features by CNN’s efficient feature extraction capability, the process of which is shown in Fig. 7.

(1) Iterate through each file in the source project, the code in each source file is parsed into AST [43] nodes by an open-source Python package named javalang. According to the optimal approach [20], only three main node types [44, 45] are selected as word sequences for software modules. One is the node type of method invocation and class instance creation, adding their specific method names and class names to word sequences; one is the node type of declaration, such as method declaration, interface description, constructor declaration, etc., adding their values to word sequences; and one is the node type of control flow. For example, ForStatement, IfStatement, WhileStatement, etc. are added to word sequences, and some node types are shown in Table 1.

Table 1 Part of the node types

Full size table

(2) Since the CNN model only receives numeric input, all words in word sequences need to be converted to numeric values. The solution is to encode words in a non-repeating manner starting with a value of 1 and increasing. During this process, make sure that a word can only correspond to one encoded value, and duplicate words refer to the previous encoded value. To distinguish between word encoding and label coding, we decided to encode labels with One-Hot encoding. As a result of the above steps, the source files are parsed into a series of numerical vectors.

(3) Besides, CNN requires input vectors to maintain a consistent length. Yet it is not possible to keep the same length because the code of each source file is different. To solve this problem, we first set the fixed length of the input vectors, and then if the length of the input vectors is greater than the set value, the redundant part will be discarded; otherwise, it is supplemented with 0. In short, the CNN model uses vectors as input and applies various layers such as embedding, convolutional, activation, pooling, and fully connected to extract semantic features.

4 Experimental Setup

In this section, a series of experiments are designed to evaluate the effectiveness of our proposed hybrid defect prediction model.

4.1 Datasets

This paper uses experimental data from six selected open source software projects hosted by the Apache Foundation. The selection is based on criteria such as dataset stability and other relevant factors. Due to the software defect prediction based on version iteration in this article, each project has two successive versions, with the old version serving as the data source for model training and the new version serving as the data source for model testing. Table 2 provides detailed descriptions, versions, average file counts, and bug rates for the six projects.

Table 2 Software project related information

Full size table

First of all, the number of files in the project we selected varies from 210 to 892, and its purpose is to ensure the diversity of data. Then, we also selected projects with different defect rates to test the performance of our model, with a minimum of 15.5% and a maximum of 64.7%. The first column displays the names of six datasets, the second column gives a brief description of the projects, the third and fourth columns respectively describe the version of the projects and the average number of files, and the last column describes the percentage of defect instances.

In addition, this paper collects a dataset of 20 artificial statistical features and defect statistics for these six projects, with statistical feature data coming from the tera-PROMISE project. The dataset contains metrics based on software size and software complexity, and these statistical metrics are shown in Table 3.

Table 3 The statistical metrics

Full size table

4.2 Evaluation Measures

$F1$ Score [46] is a metric that combines the precision and recall of a classifier and is more suitable for evaluating model performance under imbalanced datasets than other metrics because it weighs well the classifier's performance on positive and negative classes. The precision $P$ is calculated as follows:

$$\begin{array}{*{20}c} {P = \frac{true positive}{{true positive + false positive}}} \\ \end{array}$$

(15)

Among them, $true positive$ represents the true number of cases, $false positive$ represents the false positive number of cases, and $false negative$ represents the false negative number of cases, and the recall $R$ is calculated as follows:

$$\begin{array}{*{20}c} {R = \frac{true positive}{{true positive + false negative}}} \\ \end{array}$$

(16)

The formula for calculating an $F1$ score is as follows:

$$\begin{array}{*{20}c} {F1 = \frac{2*P*R}{{P + R}}} \\ \end{array}$$

(17)

For datasets with balanced data distribution, precision and recall can be good measures of model performance, but for classical data imbalance datasets such as software defect prediction, where the prior probability threshold for determining the category is not equal to 0.5, it is better to use the $F1$ score.

4.3 Baselines

To prove the superiority of the feature selection method based on counterfactual explanations, we compared it with the featureless selection method and the traditional optimal feature selection method—Wrapper method, where feature selection was performed on 20 artificial statistical features from the six open source projects mentioned in Sect. 4.1.

To prove that our feature selection method and structural features are conducive to improving model prediction performance, the following six models are set up for the model comparison experiment, which are described as follows:

1.
LR: Construct traditional defect prediction models based on 20 artificial statistical features and using logistic regression algorithms.
2.
CNN: Use CNN to learn semantic features in the source programs of software projects, and build defect prediction models without combining traditional artificial feature indicators.
3.
DP-CNN: Based on semantic features, a new defect prediction model is constructed by combining 20 traditional artificial feature indicators.
4.
DPS-CNN: Based on the DP-CNN model, a feature selection method based on counterfactual explanations is used to optimize artificial statistical features.
5.
DP-CNN-STR: Based on the DP-CNN model, combined with newly excavated structural features.
6.
DPS-CNN-STR: Combining the above two methods based on the DP-CNN model.

4.4 Research Questions

In studying the construction of prediction models, the main concern is the performance of the model. This paper proposes a new prediction model called DPS-CNN-STR. Because this model is built on the DP-CNN model and additionally undergoes feature selection based on counterfactual explanations and combines the structural features mined by Struc2vec. Therefore, we have three questions:

RQ1: Is the feature selection method based on counterfactual explanations effective?
RQ2: Can Struc2vec's excavated structural features effectively improve the performance of defect prediction models?
RQ3: Is the performance of the mixed defect prediction model DPS-CNN-STR superior to any other model?

4.5 Setup

In RQ1, we compare the feature selection method based on counterfactual explanations with the Wrapper method and the featureless selection method. The featureless selection method does not optimize the features and uses only logistic regression algorithms as a prediction model. The feature selection method based on counterfactual explanations and the Wrapper method require a specific classifier as the carrier, and to ensure the comparability of the experiment, both above two methods also use logistic regression as a specific classifier. Specific ideas are as follows:

The feature selection method based on counterfactual explanations (LRBOC): first, the old version of the artificial statistical feature table is input into the counterfactual generation framework, and then each feature is given a corresponding importance score, which is returned in the form of a dictionary from top to bottom, e.g. { 'feature1': 0.542, 'feature2': 0.384,…, 'feature20': 0.112}. Then the features are combined one by one according to the scores of importance from the highest to the lowest. Suppose there are 20 features, then there are 20 feature subsets corresponding to the following form: [feature1], [feature1, feature2], …[feature1, feature2, …, feature20]. Then, the old version of the artificial statistical features (the feature subset described above) and their labels are used to construct the classifier, and finally the new version of the artificial statistical features and their labels are used as the test set, and the corresponding optimal feature subset is selected by selecting the highest F1 score.
The Wrapper method (WRAP): This paper selects the RFECV algorithm in the Wrapper method, as the RFECV method is generally consistent with the LRBOC. The RFECV method works by first using the REF (recursive feature elimination) method to derive an important ranking for each feature. Then, based on this ranking, various subsets of different numbers of features are selected for model training and cross-validation. Finally, the best performing feature subset is selected as the final feature set used in the model. There is a parameter named 'n_features_to_select' in the RFECV method, which represents the number of features in the optimal feature subset. First, the number of features in the optimal feature subset of the old version is determined by the learning curve, and afterwards the specific optimal feature subset is determined by the feature score matrix. Suppose the number of features in the optimal feature subset is 8 and the form of the feature score matrix is [1,2,2,3,1,1,2,3,1,2,1,3,1,2,2,3,1,3,2,1], we can select the 8 features corresponding to the value 1 as the optimal feature subset. Then, the best feature subset and labels of the old version are used to build the classifier, while the best feature subset and labels of the new version are used as the test set, and its F1 score is recorded.
Featureless selection method (LR): Without using any feature selection method, all features and labels of the old version are used to build the classifier, while all features and labels of the new version are used as the test set and its F1 score is recorded.

The F1 scores obtained from the above three methods are compared, and the above comparison results are used to demonstrate the superiority of the feature selection method based on counterfactual explanations.

In RQ2-3, we mainly set up five sets of comparison experiments. More details will be given in the experimental analysis in Sect. 5.2.

5 Experimental Results

In this section, we give experimental results to answer the three questions.

5.1 RQ1: Validity of Feature Selection

5.1.1 Comparative

As shown in Table 4, the performance of our proposed feature selection method LRBOC is much better than the WRAP feature selection method on six datasets.

Table 4 F1 scores of different feature selection methods

Full size table

Table 4 shows that the LRBOC F1 score is on average 2.9% higher than WRAP, and the LRBOC F1 score is on average 5.1% higher than LR. However, WRAP is only 2.2% higher on average than LR. All the above data illustrate the effectiveness of the feature selection method based on counterfactual explanations.

The advantages of the feature selection method proposed in this paper lie in the following two points:

1.
Important features are frequently changed to generate counterfactual instances, resulting in higher scores. This provides objective evaluation indicators for determining feature importance scores and addresses the limitations of the Filter method.
2.
In all generated counterfactual instances, learners can intuitively observe the difficulty of feature changes and determine the importance score based on the frequency of feature changes, solving the limitations of Wrapper method.

5.1.2 Extension

As described in Sect. 3.2, the feature selection method based on counterfactual explanations consists mainly of three steps. Here, we show the process of selecting the best subset of features for different projects, as shown in Fig. 8.

In Fig. 8, we can clearly see that when there are 20 features, it is equivalent to no feature selection. In the poi project, when the number of artificial statistical features is 7, the F1 score is the largest. These seven features are the top seven with the highest scores of feature importance, and they are the optimal subset of artificial statistical features for this project.

5.2 RQ2-3: Validity of Structural Features

To answer RQ2 and RQ3, we designed five sets of comparison experiments, and the results are shown in Table 5.

Table 5 F1 scores of six defect prediction models

Full size table

From Table 5, it can be seen that the F1 value of the CNN model is on average 7.1% higher than that of the LR model, proving that automated feature mining methods are more conducive to the construction of defect prediction models compared to manual feature mining methods, and also verifying the feasibility of deep learning in the field of software defect prediction. Then, by comparing the values in the third and second columns of the table, it was found that the F1 value of the DP-CNN model is on average 2.2% higher than that of the CNN model, indicating the effectiveness of feature combination. In addition, the F1 value of the DPS-CNN model is on average 0.9% higher than that of the DP-CNN model, further verifying the effectiveness of the feature selection method based on counterfactual explanations.

To better answer RQ2, this paper designed a comparative experiment between the DP-CNN-STR model and the DP-CNN model. It was found that the F1 average of the former was 2.6% higher than the latter, proving that the software structure features extracted by Struc2Vec are useful for the construction of defect prediction models.

It can be observed that the overall performance of DPS-CNN-STR model is higher than that of DP-CNN-STR. However, the DPS-CNN-STR model is not as good as the DP-CNN-STR model in two projects (lucene and jedit). After analysis, we believe that the reason for this phenomenon is that the DPS-CNN-STR model is composed of artificial statistical features, semantic features, and structural features. However, by comparing the F1 values of the CNN model with the LR model, we know that the influence of artificial statistical features on the model is far less than that of semantic features. In other words, features mined based on deep learning have a greater impact on the defect prediction model. Therefore, although the DPS-CNN-STR model has undergone artificial statistical feature optimization based on the DP-CNN-STR model, it cannot guarantee that the performance of the DPS-CNN-STR model will definitely be better than the DP-CNN-STR model.

Finally, to answer RQ3, the average F1 score of DPS-CNN-STR is 3.3% higher than that of DP-CNN, and the average F1 score of DPS-CNN-STR is also the highest of the six models. The above proves that both the feature selection method and the structural features proposed in this paper can promote the construction of the model.

6 Threats to Validity

The main internal threats come from the construction of the experimental environment and the setting of parameters. Our experiments are based on the Python language environment. To reduce the uncontrollable factors during the implementation process, we have adopted sufficiently mature third-party libraries, such as calling various Python packages, to achieve the required requirements. In addition, we refer to the default values of the documentation as the parameters for defect prediction, and the parameter setting often directly affects the prediction performance.

For example, in the experiment of feature selection based on counterfactual explanations, there is a parameter called penalty term. In the process of machine learning, because we provide a lot of data for training, many dimensions will be generated during the training, some are decisive, others are irrelevant. In other words, we hope to obtain more accurate results through a large amount of data training, but at the same time, the more judgment dimensions, the worse the generalization ability of our model. This requires us to control a balance, so we introduced the penalty term. In the experiment in Sect. 5.1.1, we choose its default value of 1. However, when the values of penalty terms are different, the results are often different, as shown in Table 6.

Table 6 Results of LRBOC under different parameter settings

Full size table

In addition, when constructing the DPS-CNN-STR model, different internal parameter settings will also lead to different prediction performance of the model. Therefore, it is necessary to select the appropriate parameters for the model, so as to make the prediction performance of the model better.

The main external threat stems from the universality of the project. We have only done our model on 6 projects, but these projects cannot summarize all types of software. In addition, we feel it necessary to explain why only 6 datasets were selected. Because the construction of our defect prediction model uses artificial statistical features, semantic features, and structural features, which means that not only the features that need to be manually mined, but also the corresponding source code needs to be provided. However, some source code files in some projects have syntax errors and are difficult to correct, so only 6 projects are selected as our final dataset after careful consideration.

7 Conclusion and Future Work

Researchers have developed various models for software defect prediction to reduce the losses due to defects. This paper present a new model called DPS-CNN-STR, which is built on the basis of the DP-CNN model. The improvements are mainly reflected in the following two aspects: first, we build a software network using the data flow existing between the modules and the new network characterization technique Struc2vec is used to extract the more important structural features, combining these new mined structural features on the DP-CNN model. Then, the feature importance score is determined through the generated counterfactual samples, and the importance score is used as heuristic search information to optimize the artificial statistical features in the model. Because the feature importance score of this method is interpretable, this method can produce a better feature subset than the traditional feature selection method. Finally, our experiments on six public datasets show that our proposed model DPS-CNN-STR based on multi-feature fusion can provide a new idea for the construction of software defect prediction models.

In the future, we will explore the potential of our DPS-CNN-STR for defect prediction on more projects. In addition, we will explore how to combine features more effectively under the premise of given feature importance scores, and how to extract the structural features of software network more effectively, in order to further improve the performance of defect prediction model.

Availability of Data and Materials

The counterfactual generation framework is at https://github.com/microsoft/DiCE.

References

Hall T, Beecham S, Bowes D, et al. A systematic literature review on fault prediction performance in software engineering. IEEE Trans Softw Eng. 2011;38(6):1276–304.
Article Google Scholar
Sun X, Peng X, Zhang K, et al. How security bugs are fixed and what can be improved: an empirical study with Mozilla. Sci China Inf Sci. 2019;62:1–3.
Article Google Scholar
Sun X, Yang H, Xia X, et al. Enhancing developer recommendation with supplementary information via mining historical commits. J Syst Softw. 2017;134:355–68.
Article Google Scholar
Wang L, Sun X, Wang J, et al. Construct bug knowledge graph for bug resolution. In: 2017 IEEE/ACM 39th International Conference on Software Engineering Companion (ICSE-C). IEEE, 2017; 189–191.
Sun X, Zhou W, Li B, et al. Bug localization for version issues with defect patterns. IEEE Access. 2019;7:18811–20.
Article Google Scholar
Sun X, Peng X, Li B, et al. IPSETFUL: an iterative process of selecting test cases for effective fault localization by exploring concept lattice of program spectra. Front Comp Sci. 2016;10:812–31.
Article Google Scholar
Liu C, Yang D, Xia X, et al. A two-phase transfer learning model for cross-project defect prediction[J]. Inf Softw Technol. 2019;107:125–36.
Article Google Scholar
Shippey T, Bowes D, Hall T. Automatically identifying code features for software defect prediction: using AST N-grams. Inf Softw Technol. 2019;106:142–60.
Article Google Scholar
Li N, Shepperd M, Guo Y. A systematic review of unsupervised learning techniques for software defect prediction. Inf Softw Technol. 2020;122: 106287.
Article Google Scholar
Pachouly J, Ahirrao S, Kotecha K, et al. A systematic literature review on software defect prediction using artificial intelligence: datasets, data validation methods, approaches, and tools. Eng Appl Artif Intell. 2022;111: 104773.
Article Google Scholar
Huda S, Alyahya S, Ali MM, et al. A framework for software defect prediction and metric selection. IEEE Access. 2017;6:2844–58.
Article Google Scholar
Akiyama F. An example of software system debugging. In: Proc. of the Int’l Federation of Information Proc. Societies Congress. New York: Springer Science and Business Media, 1971; 353−359.
Halstead MH. Elements of software science. North-Holland: Elsevier; 1977. p. 32–41.
MATH Google Scholar
McCabe TJ. A complexity measure. IEEE Trans Softw Eng. 1976;4:308–20.
Article MathSciNet MATH Google Scholar
Chidamber SR, Kemerer CF. A metrics suite for object oriented design. IEEE Trans Softw Eng. 1994;20(6):476–93.
Article Google Scholar
Tripathi A. An analytical and comparative review of cohesion metrics. In: Proceedings of the 2018 International Conference on Software Engineering and Information Management. 2018: 17–25.
Radjenović D, Heričko M, Torkar R, et al. Software fault prediction metrics: a systematic literature review. Inf Softw Technol. 2013;55(8):1397–418.
Article Google Scholar
Bengio Y, Goodfellow I, Courville A. Deep learning. Cambridge: MIT Press; 2017.
MATH Google Scholar
Learning D. Deep learning. High-dimensional fuzzy clustering, 2020.
Wang S, Liu T, Tan L. Automatically learning semantic features for defect prediction. In: Proceedings of the 38th International Conference on Software Engineering. 2016; 297–308.
Li J, He P, Zhu J, et al. Software defect prediction via convolutional neural network. In: 2017 IEEE international conference on software quality, reliability and security (QRS). IEEE, 2017; 318–328.
Li X, Li W, Zhang Y, et al. Deepfl: integrating multiple fault diagnosis dimensions for deep fault localization. In: Proceedings of the 28th ACM SIGSOFT international symposium on software testing and analysis. 2019; 169–180.
Li B, Pi D. Network representation learning: a systematic literature review. Neural Comput Appl. 2020;32(21):16647–79.
Article Google Scholar
Perozzi B, Al-Rfou R, Skiena S. Deepwalk: online learning of social representations. In: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. 2014; 701–710.
Qiu J, Dong Y, Ma H, et al. Network embedding as matrix factorization: Unifying deepwalk, line, pte, and node2vec. In: Proceedings of the eleventh ACM international conference on web search and data mining. 2018; 459–467.
Grover A, Leskovec J. node2vec: scalable feature learning for networks. In: Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining. 2016; 855–864.
Goyal P, Raja S, Huang D, et al. Graph representation ensemble learning[C]//2020 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM). IEEE, 2020: 24–31.
Ribeiro LFR, Saverese PHP, Figueiredo DR. struc2vec: learning node representations from structural identity. In: Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining. 2017; 385–394.
Das H, Naik B, Behera HS. A Jaya algorithm based wrapper method for optimal feature selection in supervised classification. J King Saud Univ Comput Inform Sci. 2022;34(6):3851–63.
Google Scholar
Wah YB, Ibrahim N, Hamid HA, et al. Feature selection methods: case of filter and wrapper approaches for maximising classification accuracy. Pertanika J Sci Technol, 2018; 26(1).
Jović A, Brkić K, Bogunović N. A review of feature selection methods with applications. In: 2015 38th international convention on information and communication technology, electronics and microelectronics (MIPRO). IEEE, 2015; 1200–1205.
Afzal W, Torkar R. Towards benchmarking feature subset selection methods for software fault prediction. In: Computational intelligence and quantitative software engineering. Cham: Springer; 2016. p. 33–58.
Chapter Google Scholar
Rodriguez D, Ruiz R, Cuadrado-Gallego J, et al. Attribute selection in software engineering datasets for detecting fault modules. In: 33rd EUROMICRO Conference on Software Engineering and Advanced Applications (EUROMICRO 2007). IEEE, 2007; 418–423.
Keane MT, Smyth B. Good counterfactuals and where to find them: A case-based technique for generating counterfactuals for explainable AI (XAI). In: Case-based reasoning research and development: 28th International Conference, ICCBR 2020, Salamanca, Spain, June 8–12, 2020, Proceedings 28. Springer International Publishing, 2020; 163–178.
Dandl S, Molnar C, Binder M, et al. Multi-objective counterfactual explanations. In: Parallel Problem Solving from Nature–PPSN XVI: 16th International Conference, PPSN 2020, Leiden, The Netherlands, September 5–9, 2020, Proceedings, Part I. Cham: Springer International Publishing, 2020; 448–469.
Wachter S, Mittelstadt B, Russell C. Counterfactual explanations without opening the black box: Automated decisions and the GDPR. Harv JL & Tech. 2017;31:841.
Google Scholar
Menzies T, Milton Z, Turhan B, et al. Defect prediction from static code features: current results, limitations, new approaches. Autom Softw Eng. 2010;17:375–407.
Article Google Scholar
Jing XY, Ying S, Zhang Z W, et al. Dictionary learning based software defect prediction. In: Proceedings of the 36th international conference on software engineering. 2014; 414–423.
Menzies T, Greenwald J, Frank A. Data mining static code attributes to learn defect predictors. IEEE Trans Softw Eng. 2006;33(1):2–13.
Article Google Scholar
Sarker IH. Machine learning: Algorithms, real-world applications and research directions. SN Comput Sci. 2021;2(3):160.
Article Google Scholar
https://archive.ics.uci.edu/ml/datasets/adult.
Mothilal RK, Sharma A, Tan C. Explaining machine learning classifiers through diverse counterfactual explanations. In: Proceedings of the 2020 conference on fairness, accountability, and transparency. 2020; 607–617.
Xiao Y, Keung J, Bennin KE, et al. Improving bug localization with word embedding and enhanced convolutional neural networks. Inf Softw Technol. 2019;105:17–29.
Article Google Scholar
Cao S, Sun X, Bo L, et al. Bgnn4vd: constructing bidirectional graph neural-network for vulnerability detection. Inf Softw Technol. 2021;136: 106576.
Article Google Scholar
Wang T, Su X, Wang Y, et al. Semantic similarity-based grading of student programs[J]. Inf Softw Technol. 2007;49(2):99–107.
Article Google Scholar
Nam J. Survey on software defect prediction. Department of Compter Science and Engineerning, The Hong Kong University of Science and Technology, Tech. Rep, 2014.

Download references

Acknowledgements

A preprint has previously been published. The name of the original preprint is[Software defect prediction based on counterfactual explanations]. The link location of the original preprint is at [https://assets.researchsquare.com/files/rs-1870038/v1_covered.pdf?c=1658943649].

Funding

This paper is supported by the National Natural Science Foundation of China (61867004).

Author information

Authors and Affiliations

School of Software, Nanchang Hangkong University, Nanchang, 330063, People’s Republic of China
Wei Zheng, Teng Fei Chen, Mei Ting Hu, Feng Yu Yang, Xin Fan & Peng Xiao
Software Testing and Evaluation Center, Nanchang Hangkong University, Nanchang, 330063, People’s Republic of China
Wei Zheng, Teng Fei Chen, Mei Ting Hu, Feng Yu Yang, Xin Fan & Peng Xiao

Authors

Wei Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Teng Fei Chen
View author publications
You can also search for this author in PubMed Google Scholar
Mei Ting Hu
View author publications
You can also search for this author in PubMed Google Scholar
Feng Yu Yang
View author publications
You can also search for this author in PubMed Google Scholar
Xin Fan
View author publications
You can also search for this author in PubMed Google Scholar
Peng Xiao
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors are contributed equally.

Corresponding author

Correspondence to Teng Fei Chen.

Ethics declarations

Conflict of Interest

The authors declare they have no conflicts of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Zheng, W., Chen, T.F., Hu, M.T. et al. Hybrid Defect Prediction Model Based on Counterfactual Feature Optimization. Hum-Cent Intell Syst 3, 366–380 (2023). https://doi.org/10.1007/s44230-023-00034-2

Download citation

Received: 22 March 2023
Accepted: 20 June 2023
Published: 04 July 2023
Issue Date: September 2023
DOI: https://doi.org/10.1007/s44230-023-00034-2

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Hybrid Defect Prediction Model Based on Counterfactual Feature Optimization

Abstract

Similar content being viewed by others

Application of Random Vector Functional Link Network for Software Defect Prediction

SMAN2: Soft-Max Multilayer Adversarial Neural Network-Based Cross-Project Software Defect Prediction

Software defect prediction based on nested-stacking and heterogeneous feature selection

1 Introduction

2 Related Work

2.1 Software Defect Prediction

2.2 Counterfactual Generation Framework

2.3 Struc2vec

3 Proposed Method

3.1 Hybrid Defect Prediction Model

3.2 Feature Selection Based on Counterfactual Explanations

3.3 Structural Features Extraction

3.4 Semantic Features Extraction

4 Experimental Setup

4.1 Datasets

4.2 Evaluation Measures

4.3 Baselines

4.4 Research Questions

4.5 Setup

5 Experimental Results

5.1 RQ1: Validity of Feature Selection

5.1.1 Comparative

5.1.2 Extension

5.2 RQ2-3: Validity of Structural Features

6 Threats to Validity

7 Conclusion and Future Work

Availability of Data and Materials

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of Interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation