1 Introduction

As the software system grows in size, there are more and more defects in the software. The presence of defects may lead to severe economic losses or even endanger people’s lives [1,2,3]. It has been found that nearly four-fifths of the cost is spent on defect repair throughout the development cycle [4,5,6], and If the above defects can be identified and changed in a timely manner at an early stage of software development, the cost of fixing them will be significantly reduced. Therefore, related researchers have devoted themselves to building various defect prediction models [7,8,9], to help developers identify potential defects in the software as early as possible to reduce the losses caused by defects.

Software defect prediction can be divided into intra-project and cross-project based on data sources. Within a project, it is based on historical data from a single project, while cross-projects use historical data from multiple different projects, and most research focuses on the binary classification problem of file defect propensity [10]. In early research work, machine learning methods were widely used to construct defect prediction models. Researchers first design artificial statistical features (code metrics, process metrics) related to defects based on source code, and then obtain labels for each source file based on software history information. Finally, prediction models are constructed based on the obtained features and labels using relevant algorithms. Most existing software defect metrics fall into two main categories: software code metrics and software process metrics [11]. Software code metrics (such as LOC [12], Halstead [13], and McCabe [14]) represent the code program’ complexity, and software process metrics (such as CK [15], Martin [16], and MOOD [17]) represent the development process’ complexity.

By definition, the process of defect prediction is complex, and in some cases artificial statistical features are not sufficient for defect prediction tasks. Previous research has focused on manually designing defect metrics closely related to the source program. However, manual feature extraction has the problem of low efficiency, and the defect information contained in artificial statistical features is extremely limited. With the widespread application of deep learning [18, 19], it can automatically capture highly complex nonlinear features and has strong feature extraction capabilities. Deep learning techniques are gradually being applied to defect prediction, and satisfactory results have been obtained. Wang et al. [20] proposed a model called DBN, which can learn semantic information existing in software programs end-to-end, and then construct a model with better prediction performance based on the extracted semantic information. Experimental results show that the semantic feature-based model has better prediction performance than existing artificial statistical feature-based models. Li et al. [21] use CNN's local feature extraction capability to automatically extract semantic information from software code, and combine it with artificial statistical features to construct a prediction model called DP-CNN. Experimental results show that the predictive performance of the DP-CNN model is superior to that of the DBN model, which proves that CNN has good feature extraction ability and the effectiveness of feature combination. Xia et al. proposed a deep learning method called DeepFL, which can automatically learn the most effective existing or potential features and be used for precise fault localization [22].

The researchers extracted artificial statistical and semantic features from the code, but these two types of features only reflect the internal information of each code file and lack the structural information that exists between code files, and the single source of features leads to unsatisfactory defect prediction results.

To solve this problem, network representation learning [23] is formally applied in the field of software defect prediction. The software network diagram is first constructed based on the dependencies existing between the software modules, and through representation learning, the graph information is effectively characterized and more important structural features are extracted. Typical methods are DeepWalk [24], LINE [25], Node2Vec [26], and SDNE [27], all based on the proximity similarity hypothesis. In fact, in some scenarios, two vertices that are not nearest neighbors may also have high similarity, and because of this type of similarity, the above methods are unable to capture it. Therefore, this paper uses a new Struc2vec [28] network representation learning technique, which specifically constructs node sequences from another perspective and focuses more on the structural information of nodes, overcoming the limitations of traditional network representation learning methods. In contrast to the most advanced technologies like as DeepWalk, Node2Vec, and RolX, Struc2vec [28] exhibits an exceptional ability to capture node structural features and demonstrates superior performance in a variety of classification tasks.

Software defect prediction models are mainly constructed by features, so the selection of features will directly affect the defect prediction results. There are two main types of classical feature selection methods, the filter method and the wrapper method [29,30,31]. Afzal and Torkar [32] empirically compared eight feature selection methods on five defect datasets in the PROMISE repository, and experimental results showed that the feature selection method can effectively improve the performance of software defect prediction models. Rodriguez et al. [33] conducted a comparative study on three feature selection methods based on the Filter method and two feature selection methods based on the Wrapper method on four defect datasets. Experimental results showed that Wrapper method was overall superior to Filter method. However, the evaluation indicators of feature importance scores in the Filter method (such as variance, frequency, etc.) are largely derived from the prior knowledge of the decision-maker, and this experience is subjective. Besides, the feature importance score generated by Wrapper is determined by the base model. However, the machine learning model is a black box, and the unexplainable model cannot intuitively demonstrate the source of feature importance scores for learners. Therefore, the feature importance score generated by traditional feature selection methods cannot fully explain the contribution rate of features to the model output.

Therefore, we use the generated counterfactual samples for feature selection [34,35,36]. The general idea is that by minimizing the change in input features to generate different model outputs, a set of different counterfactual samples can be generated for a piece of data in the same algorithm. Intuitively, the feature that changes more frequently to generate counterfactual samples is an important feature. The rate at which a feature is changed in all generated counterfactual samples is used as the importance score for that feature, so we can obtain the corresponding importance scores for all features. These important scores fundamentally explain the necessity of features for model output. We can generate the best performing subset of features guided by interpretable feature importance scores.

In summary, this paper proposes a new hybrid defect prediction model DPS-CNN-STR based on artificial statistical, semantic and structural features. Among them, the artificial statistical features are optimized by the feature selection method based on counterfactual explanations, and the structural features are extracted by Struc2vec. Our main contributions are as follows:

  • As far as we know, we are one of the few to consider Struc2vec's method for extracting structural features and applying it to software defect prediction. In addition, we are the first to propose a feature selection method based on counterfactual explanations, and have achieved good results, providing a new idea for the study of feature selection.

  • We present a new model called DPS-CNN-STR, whose F1 score is improved by 3.3% on average over the optimal DP-CNN model.

The paper is organized as follows: Sect. 2 presents the related work, while Sect. 3 outlines the procedure for constructing the hybrid defect prediction model. Section 4 details the experimental setup, while Sect. 5 analyses the experimental results. Section 6 provides a brief description of potential threats to the validity of the results. Finally, Sect. 7 provides a summary of the work and outlines directions for future research.

2 Related Work

In this section, we mainly introduce the following: software defect prediction technology, Counterfactual generation framework and Struc2vec’s relevant background.

2.1 Software Defect Prediction

Software defect prediction techniques are primarily used to identify potential software defects in a timely manner, and to help testers perform purposeful testing activities [37]. Figure 1 shows the common file-level software defect prediction process in literature [38, 39], which mainly includes the following three steps.

Fig. 1
figure 1

Software defect prediction process

  1. 1.

    First, mark each source file in the project according to the software history warehouse. The defect is marked as buggy and the flawless is marked as clean.

  2. 2.

    Then, by analyzing the software source code or historical data, features related to software defects are extracted from the source files. The most common features are artificial statistical features. The obtained features and labels are trained in various machine learning algorithms [40] (such as LR, SVM, and RF) to construct the classifier.

  3. 3.

    Finally, features are extracted from the files in the project to be predicted. After all the features are extracted, the prediction results of each file can be obtained by putting the features into the defect prediction model.

2.2 Counterfactual Generation Framework

Machine learning model is a black box, and people can't explain how it works internally. Due to the unexplainability of models, it can often lead to irreparable consequences. Therefore, it is necessary to provide explanations for machine learning models to reduce the potential threats posed by models.

The most popular explanation is the counterfactual explanation proposed by Wachter et al. [36]:

$$\begin{array}{*{20}c} {C = \arg \min Yloss\left( {f\left( c \right),y} \right) + \left| {x - C} \right|} \\ \end{array}$$
(1)

In short, given the corresponding output of the input features \(x\) and model \(f\), the output \(y\) of the model can be changed by modifying the features \(x\), but it pursues the change of minimizing the features. \(Yloss\) pushes counterfactual \(C\) to a prediction different from the original instance \(x\), and the counterfactual sample \(C\) should be close to the original sample \(x\).

For example, in the Adult Income dataset [41], if you want to change a low-income person to a high-income person, you can increase the value of working hours. We can imagine that this value must have a critical value. When it is equal to or greater than this value, the result of the model will change, but we only choose the critical value. People can explain the model's decision through the generated counterfactual samples.

More recently, Mothilal et al. [42]have extended Wachter's work. Mothilal et al. argue that the counterfactual samples generated should be diverse, proximal, and feasible.

To ensure the diversity of counterfactuals, Mothilal et al. used the Determinant Point Process (DPP) algorithm. The DPP algorithm ensures that the selection of subsets is diverse and that similar subsets are not easily captured simultaneously. where \(\mathrm{dist}\left({C}_{i},{C}_{j}\right)\) represents the distance between each of the two counterfactual samples.

$$\begin{array}{*{20}c} {Diversity = \frac{1}{{1 + dist\left( {C_{i} ,C_{j} } \right)}}} \\ \end{array}$$
(2)

Of the multiple counterfactual samples generated for an original piece of data, the counterfactual that is closest to the original data is likely to be the most helpful to the user in making a decision. \({C}_{i}\) represents the ith counterfactual sample and where \(\mathrm{dist}\left({C}_{i},x\right)\) represents the distance between the \({C}_{i}\) and \(x\). Distance is used to determine the proximity of the generated counterfactuals to the original data.

$$\begin{array}{*{20}c} {\Pr oximity = - \frac{1}{k}\mathop \sum \limits_{i = 0}^{k} dist\left( {C_{i} ,x} \right)} \\ \end{array}$$
(3)

Mothilal et al. ensured the feasibility of counterfactuals through domain knowledge between features. For example, gender (discrete features) can only be limited to men and women; the value of working time (continuity features) is limited to a normal interval.

Figure 2 shows multiple counterfactual instances generated for a single piece of original data, and the higher the ranked counterfactual, the closer it is to the original data.

Fig. 2
figure 2

Counterfactual generation instances

The ratio of feature changes in all counterfactual samples is the feature importance score. Inspired by the above, and to solve the limitations of traditional feature selection methods, we use the generated counterfactual samples for feature selection, which can generate a better subset of features guided by feature importance scores. Details will be provided in Sect. 3.2.

2.3 Struc2vec

Traditional network representation learning algorithms have a limitation. Due to the limited sampling length of the walk, they cannot effectively model nodes with structural similarities that are far apart. However, the reason why previous algorithms perform better is that most datasets prefer the characterization of homogeneity. That is, nodes with similar distances are also similar in feature space, which is enough to cover most data sets. When constructing the graph, Struc2vec neither requires node location information nor label information. Instead, it relies solely on the concept of node degree to build the multilayer graph. An intuitive concept suggests that if two nodes have the similar degree, then the two nodes will be structurally closer together. Moreover, if all adjacent nodes of these two nodes also have the same degree, then the nodes should exhibit even greater structural similarity. In short, nodes with neighbors with similar node sets are expected to have similar potential representations, and Struc2vec specifically learns node representations from structure identification, and achieves good results.

Figure 3 nodes m and n have similar local structures, node m has degree 4 and node n has degree 3, and nodes m and n are connected to the software network with 3 and 2 triangles respectively. It can be seen that these two nodes have high structural similarity, but because there are no common nodes in their neighborhoods, traditional network representation techniques cannot learn the potential representation of nodes with similar structures, but Struc2vec solves this problem.

Fig. 3
figure 3

Example of two nodes (m and n) with similar structure

In our work, to address the problem of a single feature source and to improve the prediction performance, we can use Struct2vec to extract relevant structural features.

3 Proposed Method

This section describes our proposed hybrid defect prediction model in detail. Our DPS-CNN-STR model is based on artificial statistical, semantic, and structural features. It is based on the DP-CNN [21] and has made the following improvements:

  1. 1.

    The artificial statistical features are optimized by the feature selection method based on counterfactual explanations.

  2. 2.

    Using Struc2vec to learn the structural features of the software network, a new hybrid defect prediction model is jointly constructed based on the optimized DP-CNN model, combined with the newly learned structural features.

3.1 Hybrid Defect Prediction Model

To improve the performance of defect prediction models, Fig. 4 shows a new hybrid defect prediction model. First, artificial statistical features are optimized through counterfactual explanations, then semantic features are learned from the source program using CNN, and finally structural features are learned from the software network using Struc2vec. A hybrid defect prediction called DPS-CNN-STR is constructed.

Fig. 4
figure 4

Hybrid defect prediction model

First, combine the optimized artificial statistical features, semantic features of CNN end-to-end learning, and structural features of Struc2vec Unsupervised Learning, and input them into the Softmax network as a whole to obtain the prediction results. The calculation formula for output layer Softmax is as follows:

$$\begin{array}{*{20}c} {P_{i} = \frac{{\exp \left( {y_{i} } \right)}}{{\mathop \sum \nolimits_{j} \exp \left( {y_{i} } \right)}}\quad i,j \in \left\{ {0,1} \right\}} \\ \end{array}$$
(4)

where \({P}_{i}\) represents the probability that the module is predicting as a bug. We build the neural network based on Keras, the biggest advantage of which is its simplicity and speed, and we also keep the exact same parameter settings as in the literature [21].

3.2 Feature Selection Based on Counterfactual Explanations

Figure 5 shows a feature selection method based on counterfactual explanations, consisting of the following three main steps:

  1. 1.

    In this paper, defect prediction is based on iterations of versions within the project, so the artificial statistical table of the old version is input into the counterfactual generation framework, and then each feature is given a corresponding importance score.

  2. 2.

    Features are combined one by one according to their scores of importance from the highest to the lowest. Suppose there are n features, then there are n feature subsets corresponding.

  3. 3.

    The old version of the artificial statistical features (the subset of the features described in step 2) and their labels are used to construct the classifier, then the new version of the artificial statistical features and their labels are used as the test set, and the corresponding optimal subset of the features is selected by selecting the highest F1 score.

Fig. 5
figure 5

Feature selection based on counterfactual explanations

3.3 Structural Features Extraction

Using the existing dependencies between software modules, we construct the software network with software modules as the basic unit, and then use Struc2vec to learn the potential representation of node structure, in order to extract structural features under unsupervised learning.

First, based on the data flow relationships that exist between the modules, the software network G = (V, E) is constructed, where G represents the constructed software network, V = {vi|i = 1,2,3,…, n} is the set of nodes in the software network, the element vi represents each node in the software network, n =| V | is the number of nodes in the constructed software network, and k* is the diameter. E = {ey | vivj = 1, i, j ∈ [1, n]} represents the set of edges. When the value is 1, it indicates that there is a relationship between node i and node j. When the value is 0, then there is no relationship between them, and the relationships that exist are as follows:

  1. 1.

    There is a dependency between node i and node j;

  2. 2.

    There is a combination relationship between node i and node j;

  3. 3.

    There is an inheritance relationship between node i and node j;

Figure 6 shows software networks built by some Apache open source software projects in accordance with the above rules.

Fig. 6
figure 6

The software network of poi, lucene and synapse

Extracting structural features from the software network using Struc2vec is divided into four main steps [28]:

(1) Measuring structural similarity: \({R}_{k}\left(u\right)\) represents the set of nodes whose distance from node \(u\) is \(k\), \({R}_{1}\left(u\right)\) represents the set of directly connected nearest neighbors of \(u\), and \(s\left(S\right)\) represents the ordered degree sequences of the set \(S\) of nodes. The distance \({f}_{k}\left(u,v\right)\) between all nodes is calculated by introducing a hierarchical structure, and this distance can reflect the situation of structural similarity between nodes, defined as:

$$\begin{array}{*{20}c} { f_{k} \left( {u,v} \right) = f_{k - 1} \left( {u,v} \right) + g\left( {s\left( {R_{k} \left( u \right)} \right),s\left( {R_{k} \left( v \right)} \right)} \right),k \ge 0 \quad {\text{and}}\quad \left| {R_{k} \left( u \right)} \right|, \left| {R_{k} \left( v \right)} \right| > 0} \\ \end{array}$$
(5)

The \(g\left(D1,D2\right)\)0 is a function that measures the distance of the ordered degree sequences \(D1,D2\). Since \(s\left({R}_{k}\left(u\right)\right)\) and \(s\left({R}_{k}\left(v\right)\right)\) have different lengths and may contain duplicate elements. To solve this problem, so a distance calculation formula called DTW is used, defined as follows:

$$\begin{array}{*{20}c} {d\left( {a,b} \right) = \frac{{\max \left( {a,b} \right)}}{{\min \left( {a,b} \right)}} - 1} \\ \end{array}$$
(6)

(2) Constructing the context graph: A multilayer weighted graph M is constructed based on the obtained node-pair distances, which is mainly intended to encode the structural similarity between nodes. The edge weight of two nodes in a certain layer \(k\) is defined as:

$$\begin{array}{*{20}c} {w_{k} \left( {u,v} \right) = e^{{ - f_{k} \left( {u,v} \right)}} ,\quad k = 0, \ldots ,k^{*} } \\ \end{array}$$
(7)

The same node belonging to different layers is connected by directed edges, and the edge weight is defined as:

$$\begin{array}{*{20}c} {w\left( {u_{k} ,u_{k + 1} } \right) = \log \left( {\Gamma_{k} \left( u \right) + e} \right),\quad k = 0, \ldots ,k^{*} - 1} \\ \end{array}$$
(8)
$$\begin{array}{*{20}c} { w\left( {u_{k} ,u_{k - 1} } \right) = 1,\quad k = 1, \ldots ,k^{*} } \\ \end{array}$$
(9)

\({\Gamma }_{k}\left(u\right)\) is the number of edges related to node \(u\), and its weight is greater than the average edge weight of the complete graph in layer \(k\), and is defined as:

$$\begin{array}{*{20}c} { {\Gamma}_{k} \left( u \right) = \mathop \sum \limits_{v \in V}^{1} (w_{k} \left( {u,v} \right) > \overline{w_{k}} }) \\ \end{array}$$
(10)

(3) Generating context for nodes: A biased random walk strategy is applied to all nodes in graph M as a way to generate the contextual representation of each node. At each sampling, if the decision is to wander to the current layer, and assuming that it is currently at layer \(k\), the probability of going from node \(u\) to node \(v\) is:

$$\begin{array}{*{20}c} {P_{k} \left( {u,v} \right) = \frac{{e^{{ - f_{k} \left( {u,v} \right)}} }}{{Z_{k} \left( u \right)}}} \\ \end{array}$$
(11)

\({Z}_{k}\left(u\right)\) is the normalization factor of node \(u\) in layer \(k\), which is obtained by the following formula:

$$\begin{array}{*{20}c} {Z_{k} \left( u \right) = \sum_{v \in V,v \ne u } e^{{ - f_{k} \left( {u,v} \right)}} } \\ \end{array}$$
(12)

If it is decided to switch different layers, select \(k+1\) layer or \(k-1\) layer with the following probability:

$$\begin{array}{*{20}c} {P_{k} \left( {u_{k} ,u_{k + 1} } \right) = \frac{{w\left( {u_{k} ,u_{k + 1} } \right)}}{{w\left( {u_{k} ,u_{k + 1} } \right) + w\left( {u_{k} ,u_{k - 1} } \right)}}} \\ \end{array}$$
(13)
$$\begin{array}{*{20}c} { P_{k} \left( {u_{k} ,u_{k - 1} } \right) = 1 - P_{k} \left( {u_{k} ,u_{k + 1} } \right)} \\ \end{array}$$
(14)

(4) Learning a language model: Finally, using Skip-Gram technique, the potential representation of each node is learned from the generated contextual representation.

3.4 Semantic Features Extraction

The source program is the main cause of software defects. Each source file program code can be parsed into a series of word sequence representations, and the word sequence representations are converted into semantic features by CNN’s efficient feature extraction capability, the process of which is shown in Fig. 7.

Fig. 7
figure 7

The semantic features extraction process

(1) Iterate through each file in the source project, the code in each source file is parsed into AST [43] nodes by an open-source Python package named javalang. According to the optimal approach [20], only three main node types [44, 45] are selected as word sequences for software modules. One is the node type of method invocation and class instance creation, adding their specific method names and class names to word sequences; one is the node type of declaration, such as method declaration, interface description, constructor declaration, etc., adding their values to word sequences; and one is the node type of control flow. For example, ForStatement, IfStatement, WhileStatement, etc. are added to word sequences, and some node types are shown in Table 1.

Table 1 Part of the node types

(2) Since the CNN model only receives numeric input, all words in word sequences need to be converted to numeric values. The solution is to encode words in a non-repeating manner starting with a value of 1 and increasing. During this process, make sure that a word can only correspond to one encoded value, and duplicate words refer to the previous encoded value. To distinguish between word encoding and label coding, we decided to encode labels with One-Hot encoding. As a result of the above steps, the source files are parsed into a series of numerical vectors.

(3) Besides, CNN requires input vectors to maintain a consistent length. Yet it is not possible to keep the same length because the code of each source file is different. To solve this problem, we first set the fixed length of the input vectors, and then if the length of the input vectors is greater than the set value, the redundant part will be discarded; otherwise, it is supplemented with 0. In short, the CNN model uses vectors as input and applies various layers such as embedding, convolutional, activation, pooling, and fully connected to extract semantic features.

4 Experimental Setup

In this section, a series of experiments are designed to evaluate the effectiveness of our proposed hybrid defect prediction model.

4.1 Datasets

This paper uses experimental data from six selected open source software projects hosted by the Apache Foundation. The selection is based on criteria such as dataset stability and other relevant factors. Due to the software defect prediction based on version iteration in this article, each project has two successive versions, with the old version serving as the data source for model training and the new version serving as the data source for model testing. Table 2 provides detailed descriptions, versions, average file counts, and bug rates for the six projects.

Table 2 Software project related information

First of all, the number of files in the project we selected varies from 210 to 892, and its purpose is to ensure the diversity of data. Then, we also selected projects with different defect rates to test the performance of our model, with a minimum of 15.5% and a maximum of 64.7%. The first column displays the names of six datasets, the second column gives a brief description of the projects, the third and fourth columns respectively describe the version of the projects and the average number of files, and the last column describes the percentage of defect instances.

In addition, this paper collects a dataset of 20 artificial statistical features and defect statistics for these six projects, with statistical feature data coming from the tera-PROMISE project. The dataset contains metrics based on software size and software complexity, and these statistical metrics are shown in Table 3.

Table 3 The statistical metrics

4.2 Evaluation Measures

\(F1\) Score [46] is a metric that combines the precision and recall of a classifier and is more suitable for evaluating model performance under imbalanced datasets than other metrics because it weighs well the classifier's performance on positive and negative classes. The precision \(P\) is calculated as follows:

$$\begin{array}{*{20}c} {P = \frac{true positive}{{true positive + false positive}}} \\ \end{array}$$
(15)

Among them, \(true positive\) represents the true number of cases, \(false positive\) represents the false positive number of cases, and \(false negative\) represents the false negative number of cases, and the recall \(R\) is calculated as follows:

$$\begin{array}{*{20}c} {R = \frac{true positive}{{true positive + false negative}}} \\ \end{array}$$
(16)

The formula for calculating an \(F1\) score is as follows:

$$\begin{array}{*{20}c} {F1 = \frac{2*P*R}{{P + R}}} \\ \end{array}$$
(17)

For datasets with balanced data distribution, precision and recall can be good measures of model performance, but for classical data imbalance datasets such as software defect prediction, where the prior probability threshold for determining the category is not equal to 0.5, it is better to use the \(F1\) score.

4.3 Baselines

To prove the superiority of the feature selection method based on counterfactual explanations, we compared it with the featureless selection method and the traditional optimal feature selection method—Wrapper method, where feature selection was performed on 20 artificial statistical features from the six open source projects mentioned in Sect. 4.1.

To prove that our feature selection method and structural features are conducive to improving model prediction performance, the following six models are set up for the model comparison experiment, which are described as follows:

  1. 1.

    LR: Construct traditional defect prediction models based on 20 artificial statistical features and using logistic regression algorithms.

  2. 2.

    CNN: Use CNN to learn semantic features in the source programs of software projects, and build defect prediction models without combining traditional artificial feature indicators.

  3. 3.

    DP-CNN: Based on semantic features, a new defect prediction model is constructed by combining 20 traditional artificial feature indicators.

  4. 4.

    DPS-CNN: Based on the DP-CNN model, a feature selection method based on counterfactual explanations is used to optimize artificial statistical features.

  5. 5.

    DP-CNN-STR: Based on the DP-CNN model, combined with newly excavated structural features.

  6. 6.

    DPS-CNN-STR: Combining the above two methods based on the DP-CNN model.

4.4 Research Questions

In studying the construction of prediction models, the main concern is the performance of the model. This paper proposes a new prediction model called DPS-CNN-STR. Because this model is built on the DP-CNN model and additionally undergoes feature selection based on counterfactual explanations and combines the structural features mined by Struc2vec. Therefore, we have three questions:

  • RQ1: Is the feature selection method based on counterfactual explanations effective?

  • RQ2: Can Struc2vec's excavated structural features effectively improve the performance of defect prediction models?

  • RQ3: Is the performance of the mixed defect prediction model DPS-CNN-STR superior to any other model?

4.5 Setup

In RQ1, we compare the feature selection method based on counterfactual explanations with the Wrapper method and the featureless selection method. The featureless selection method does not optimize the features and uses only logistic regression algorithms as a prediction model. The feature selection method based on counterfactual explanations and the Wrapper method require a specific classifier as the carrier, and to ensure the comparability of the experiment, both above two methods also use logistic regression as a specific classifier. Specific ideas are as follows:

  • The feature selection method based on counterfactual explanations (LRBOC): first, the old version of the artificial statistical feature table is input into the counterfactual generation framework, and then each feature is given a corresponding importance score, which is returned in the form of a dictionary from top to bottom, e.g. { 'feature1': 0.542, 'feature2': 0.384,…, 'feature20': 0.112}. Then the features are combined one by one according to the scores of importance from the highest to the lowest. Suppose there are 20 features, then there are 20 feature subsets corresponding to the following form: [feature1], [feature1, feature2], …[feature1, feature2, …, feature20]. Then, the old version of the artificial statistical features (the feature subset described above) and their labels are used to construct the classifier, and finally the new version of the artificial statistical features and their labels are used as the test set, and the corresponding optimal feature subset is selected by selecting the highest F1 score.

  • The Wrapper method (WRAP): This paper selects the RFECV algorithm in the Wrapper method, as the RFECV method is generally consistent with the LRBOC. The RFECV method works by first using the REF (recursive feature elimination) method to derive an important ranking for each feature. Then, based on this ranking, various subsets of different numbers of features are selected for model training and cross-validation. Finally, the best performing feature subset is selected as the final feature set used in the model. There is a parameter named 'n_features_to_select' in the RFECV method, which represents the number of features in the optimal feature subset. First, the number of features in the optimal feature subset of the old version is determined by the learning curve, and afterwards the specific optimal feature subset is determined by the feature score matrix. Suppose the number of features in the optimal feature subset is 8 and the form of the feature score matrix is [1,2,2,3,1,1,2,3,1,2,1,3,1,2,2,3,1,3,2,1], we can select the 8 features corresponding to the value 1 as the optimal feature subset. Then, the best feature subset and labels of the old version are used to build the classifier, while the best feature subset and labels of the new version are used as the test set, and its F1 score is recorded.

  • Featureless selection method (LR): Without using any feature selection method, all features and labels of the old version are used to build the classifier, while all features and labels of the new version are used as the test set and its F1 score is recorded.

The F1 scores obtained from the above three methods are compared, and the above comparison results are used to demonstrate the superiority of the feature selection method based on counterfactual explanations.

In RQ2-3, we mainly set up five sets of comparison experiments. More details will be given in the experimental analysis in Sect. 5.2.

5 Experimental Results

In this section, we give experimental results to answer the three questions.

5.1 RQ1: Validity of Feature Selection

5.1.1 Comparative

As shown in Table 4, the performance of our proposed feature selection method LRBOC is much better than the WRAP feature selection method on six datasets.

Table 4 F1 scores of different feature selection methods

Table 4 shows that the LRBOC F1 score is on average 2.9% higher than WRAP, and the LRBOC F1 score is on average 5.1% higher than LR. However, WRAP is only 2.2% higher on average than LR. All the above data illustrate the effectiveness of the feature selection method based on counterfactual explanations.

The advantages of the feature selection method proposed in this paper lie in the following two points:

  1. 1.

    Important features are frequently changed to generate counterfactual instances, resulting in higher scores. This provides objective evaluation indicators for determining feature importance scores and addresses the limitations of the Filter method.

  2. 2.

    In all generated counterfactual instances, learners can intuitively observe the difficulty of feature changes and determine the importance score based on the frequency of feature changes, solving the limitations of Wrapper method.

5.1.2 Extension

As described in Sect. 3.2, the feature selection method based on counterfactual explanations consists mainly of three steps. Here, we show the process of selecting the best subset of features for different projects, as shown in Fig. 8.

Fig. 8
figure 8

The process of selecting the best feature subset

In Fig. 8, we can clearly see that when there are 20 features, it is equivalent to no feature selection. In the poi project, when the number of artificial statistical features is 7, the F1 score is the largest. These seven features are the top seven with the highest scores of feature importance, and they are the optimal subset of artificial statistical features for this project.

5.2 RQ2-3: Validity of Structural Features

To answer RQ2 and RQ3, we designed five sets of comparison experiments, and the results are shown in Table 5.

Table 5 F1 scores of six defect prediction models

From Table 5, it can be seen that the F1 value of the CNN model is on average 7.1% higher than that of the LR model, proving that automated feature mining methods are more conducive to the construction of defect prediction models compared to manual feature mining methods, and also verifying the feasibility of deep learning in the field of software defect prediction. Then, by comparing the values in the third and second columns of the table, it was found that the F1 value of the DP-CNN model is on average 2.2% higher than that of the CNN model, indicating the effectiveness of feature combination. In addition, the F1 value of the DPS-CNN model is on average 0.9% higher than that of the DP-CNN model, further verifying the effectiveness of the feature selection method based on counterfactual explanations.

To better answer RQ2, this paper designed a comparative experiment between the DP-CNN-STR model and the DP-CNN model. It was found that the F1 average of the former was 2.6% higher than the latter, proving that the software structure features extracted by Struc2Vec are useful for the construction of defect prediction models.

It can be observed that the overall performance of DPS-CNN-STR model is higher than that of DP-CNN-STR. However, the DPS-CNN-STR model is not as good as the DP-CNN-STR model in two projects (lucene and jedit). After analysis, we believe that the reason for this phenomenon is that the DPS-CNN-STR model is composed of artificial statistical features, semantic features, and structural features. However, by comparing the F1 values of the CNN model with the LR model, we know that the influence of artificial statistical features on the model is far less than that of semantic features. In other words, features mined based on deep learning have a greater impact on the defect prediction model. Therefore, although the DPS-CNN-STR model has undergone artificial statistical feature optimization based on the DP-CNN-STR model, it cannot guarantee that the performance of the DPS-CNN-STR model will definitely be better than the DP-CNN-STR model.

Finally, to answer RQ3, the average F1 score of DPS-CNN-STR is 3.3% higher than that of DP-CNN, and the average F1 score of DPS-CNN-STR is also the highest of the six models. The above proves that both the feature selection method and the structural features proposed in this paper can promote the construction of the model.

6 Threats to Validity

The main internal threats come from the construction of the experimental environment and the setting of parameters. Our experiments are based on the Python language environment. To reduce the uncontrollable factors during the implementation process, we have adopted sufficiently mature third-party libraries, such as calling various Python packages, to achieve the required requirements. In addition, we refer to the default values of the documentation as the parameters for defect prediction, and the parameter setting often directly affects the prediction performance.

For example, in the experiment of feature selection based on counterfactual explanations, there is a parameter called penalty term. In the process of machine learning, because we provide a lot of data for training, many dimensions will be generated during the training, some are decisive, others are irrelevant. In other words, we hope to obtain more accurate results through a large amount of data training, but at the same time, the more judgment dimensions, the worse the generalization ability of our model. This requires us to control a balance, so we introduced the penalty term. In the experiment in Sect. 5.1.1, we choose its default value of 1. However, when the values of penalty terms are different, the results are often different, as shown in Table 6.

Table 6 Results of LRBOC under different parameter settings

In addition, when constructing the DPS-CNN-STR model, different internal parameter settings will also lead to different prediction performance of the model. Therefore, it is necessary to select the appropriate parameters for the model, so as to make the prediction performance of the model better.

The main external threat stems from the universality of the project. We have only done our model on 6 projects, but these projects cannot summarize all types of software. In addition, we feel it necessary to explain why only 6 datasets were selected. Because the construction of our defect prediction model uses artificial statistical features, semantic features, and structural features, which means that not only the features that need to be manually mined, but also the corresponding source code needs to be provided. However, some source code files in some projects have syntax errors and are difficult to correct, so only 6 projects are selected as our final dataset after careful consideration.

7 Conclusion and Future Work

Researchers have developed various models for software defect prediction to reduce the losses due to defects. This paper present a new model called DPS-CNN-STR, which is built on the basis of the DP-CNN model. The improvements are mainly reflected in the following two aspects: first, we build a software network using the data flow existing between the modules and the new network characterization technique Struc2vec is used to extract the more important structural features, combining these new mined structural features on the DP-CNN model. Then, the feature importance score is determined through the generated counterfactual samples, and the importance score is used as heuristic search information to optimize the artificial statistical features in the model. Because the feature importance score of this method is interpretable, this method can produce a better feature subset than the traditional feature selection method. Finally, our experiments on six public datasets show that our proposed model DPS-CNN-STR based on multi-feature fusion can provide a new idea for the construction of software defect prediction models.

In the future, we will explore the potential of our DPS-CNN-STR for defect prediction on more projects. In addition, we will explore how to combine features more effectively under the premise of given feature importance scores, and how to extract the structural features of software network more effectively, in order to further improve the performance of defect prediction model.