Introduction

A software defect refers to a flaw or fault in the computer program that produces incorrect or unexpected outcomes and it does not meet the actual requirements of users [1]. Nowadays, as the scale of software increases, the structure becomes more complicated. How to ensure the safety and reliability of the code in the software development process has therefore become the focus of major software development companies. In the small-scale development stage of software development, developers can find and solve most software defects through testing and code review. However, software development has changed from small-scale development in the past to centralized development. The structure of software has become complex, and the defects in the software are hidden deeper and harder to be found. Moreover, the cost of relying on the workforce for defect detection has increased; on the contrary, the timeliness is poor gradually [2]. According to the survey [3], 50\(\%\) to 70\(\%\) of software development costs are spent in the process of finding and fixing software defects.

The goal of Software Defect Prediction (SDP) is to discover which modules in a software project may have defects. SDP runs through the entire life cycle of software development, which is a vital link in the software development process [4,5,6,7]. SDP includes cross-version defect prediction (CVDP) and cross-project defect code prediction (CPDP). CVDP training data and test data are from the same project, usually the project history. Usually, the historical version data of the project is used as training data, the current version data is used as test data, or selecting the data of two adjacent versions of the project as training data and test data [8,9,10,11,12]. Compared with CPDP, CVDP is more useful in the actual scenario since the project team is usually composed of a fixed part of developers in the software development process [11]. Specifically, due to the correlation between the developer’s programming habits and the business code, the historical version of the project’s defect code distribution is consistent, and a good learning effect can be achieved using a machine learning model.

When the version changes, both methods mentioned above will detect all components to reduce defect misses. This selection strategy definitely reduces the rate of missed alarms due to lack of coverage. Unfortunately, even small changes can lead to complete coverage detection for some large and complex software systems. This case would cause a significant testing overhead, which would be unacceptable before a release. Moreover, in terms of code feature representation and model selection for defect detection, existing works still use traditional statistical features to characterize the code and leverage traditional machine learning models for training and learning. However, the actual software defect detection effect is not ideal.

To improve detection efficiency, we proposed BugPre, which uses structural change information to extract code differences caused by changes between versions. Compared to prior CVDP methods [10, 13, 14], this difference in information substantially reduces the scope of code detection. For the discrimination of difference codes, BugPre utilizes an innovative graph network structure, and the input of multilateral relational graphs enhances the detection capability of the deep learning model. In the process of building the BugPre detection framework, we had to address the following challenges.

The key challenge is how to decrease the false-positive rate of the impact set in the change impact analysis (CIA). The traditional change impact analysis method based on Data Flow Graph (DFG) or Function Call Graph (CG) is coarse-grained and based on a single relationship. This could cause some functions to be called but not actually affected by the modification and thus mistaken for changed functions. Besides, the change in a variable can aect the execution of the function [15]. When a certain part of the code changes, the code execution path may be affected, or further impact is caused since the variables in the changed code may be used in other functions. Therefore, other functions affected by the variable can be found according to the definition, use, and propagation of the variable. Based on the above analysis, this paper proposes a change code generation method base on variable propagation tree (VPT), which describes the specific propagation behavior among functions of all variables in a fine-grained way with a unit of variables in the class space and performs version increments code generation based on this behavior. Secondly, in defect detection, the version change information and the change code feature of the project are integrated to enrich the feature information of the code. Finally, the graph is constructed based on the fusion feature, and we use Graph Convolutional Neural Network (GCNN) [16, 17] to learn the graph feature of the code for quick defect detection after version change.

We design sufficient experiments to verify the method proposed in this paper. The experiment results indicate that the performance of defect detection has significantly improved compared with the traditional method, whether in the open data set or the open-source data set. Our BugPre leverages changed code analysis to greatly reduce the data size of test data between versions and further improves the efficiency of defect detection. The main contributions in this paper include the following:

  • We propose a variable propagation path-based changed code generation approach for method-level cross-version defect detection, which accurately determines the scope of code changes during version iteration.

  • Graph representations combining changes in software code version information are proposed and applied in practice. We built the BugPre system and conducted extensive experiments on open-source Apache projects. Compared with the existing three state-of-the-art defect detection methods, our BugPre has increased by over 16% in terms of F1-score.

Related work

This paper performs associated code analysis leveraging the code changes between software versions to generate change code and then employs a defect detection model to detect defects in the generated change code. Around the above process, this section introduces related work in three aspects: software version defect code detection, change impact analysis, and graph network model.

Cross-version defect prediction

Software defect prediction (SDP) help software project developer to spot defect-prone source files of software systems. The software project team can carry out rigorous quality assurance activities on these predicted defect files before release and deliver high-quality software. SDP is usually a binary classification task, which trains a prediction model using the data related to the defective code from the software repositories. According to sources of training data, SDP can be divided into two categories: cross-version defect prediction (CVDP) and cross-project defect prediction (CPDP). Cross-project prediction constructs a prediction model from the previous version of a software project and predicts defects in the current version. Compared with cross-project defect prediction, cross-version defect detection is more practical. Although software development is a dynamic process that can lead to changes in the data structure characteristics during version change, the new version usually retains much information from the old versions. Thus, the previous version of the same software project will have similar parameter distributions among files [10].

In recent years, several cross-version defect predictions have emerged [10, 13, 14]. Shukla et al. [10] proposed multi-objective defect prediction models for cross-version defect prediction and compared with four traditional machine learning algorithms. They pointed out that the multi-objective logistic regression was more cost-effective than single-objective algorithms. Zhang [14] et al. proposed a cross-version model with a data selection method (CDS) to solve the problem of version data label overlap. Bennin et al. [13] used an effort-aware performance measure to perform the cross-release evaluation of 11 fault prediction models using the data sets of the same version. However, these works are to conduct CVDP in a non-incremental way, that is, directly using all the data of the previous version to construct a prediction model. Compared with non-incremental detection methods, detection based on differences between versions can substantially narrow the scope of code to be detected and improve detection efficiency.

Change impact analysis

Software modification has become an indispensable part of the software development process to cope with the frequent changes in technologies and satisfy customers’ frequently changing demands. Software changes bring challenges to cross-version defect detection as the continuous and intensive nature of changes. Therefore, how to efficiently detect software defects is particularly important. Compared with the traditional defect detection method using the entire file, the change code-based detection method is to detect only code that may be affected, which can greatly reduce time, effort, and cost.

More recently, some works [18,19,20] have explored the Change Impact Analysis (CIA) [21] to identify areas of the potential impact caused by changes made in the software. Sun et al. [22] provided a static CIA method, which is mainly based on class, and member calls for analysis. They first categorize the types of changes, some of which are based on [23]. In order to improve the estimation accuracy of the impact set, the author starts from the following three aspects: the change type of the modified entity that has been classified, the dependence relationship between the modified entity and other entities, and the initial change set. The initial change set will directly affect the final impact set. Bogdan et al. [24] constructed a static change impact analysis tool (ImpactMiner) that uses an adaptive combination of static text analysis, dynamic execution tracking, and software warehouse digging to estimate the impact set. Ufuktepe et al. [25] proposed a model named Change Effect Graph. They first parsed and encoded changed code by program slicing, then encoded the probability value of the sliced data combined with the function call relationship and the change information, and finally input it into the Bayesian network to predict the final change impact set. Hanam et al. [26] proposed a semantic change impact analysis technology, constructed a Sem-CIA tool to extract the semantic change information of the code, and defined four semantic change impact relations. This work can help developers to better understand the relationship between software code changes while determining the impact set.

Existing works perform change impact analysis on the changed code from multiple dimensions such as classes, functions, statements, etc. However, in function-based associated analysis, most methods only rely on function calls to parse affected code, ignoring other interactions such as usages of variables between functions. If the analysis level is coarse-grained, the test results tend to have a high false-positive rate.

Fig. 1
figure 1

Overview of BugPre. BugPre utilizes the data of the prior version for model training and conducts defect detection on the current version within the same project

Graph neural network

A graph is a data structure that includes nodes and node relationships with powerful information representation capabilities. Related research on graphs leveraging machine learning methods has received wide attention, especially in social networks [16, 27], drug molecules [28], knowledge domain maps [29], and other fields. Furthermore, the code itself is also highly irregular graph data, containing complex structure and semantic information. This deep-level information cannot be perceived by traditional machine learning. Therefore, much research work related to defective code detection began to adopt graph network models for learning the deep features of code.

One of the more common methods is Graph Embedding [30,31,32,33,34,35], which expresses the graph as low-dimensional feature information by learning the node, edge relationship, or subgraph structure information in the graph. Such as the node2vec algorithm framework proposed by Aditya et al. [36], which can reserve the adjacency information of the node in the network to maximum by learning low-dimensional space mapping from nodes to features. Tang et al. [37] proposed the Line model in 2015, which maps the nodes in a large network into a vector space according to the degree of density of their relationships so that closely connected nodes are projected into similar positions. An important indicator to measure the closeness of two nodes in the network is the weight of the edge between the two nodes. In this paper, the author not only considers the first-order relationship when modeling but also considers the second-order relationship. Ribeiro et al. [38] proposed the Structure2vec algorithm, using supervised learning combined with convolution to perform iterative learning on the graph for obtaining the vector representation of the graph.

System design

System overview

Figure 1 shows the high-level overview of BugPre. BugPre collects buggy-fix logs from the commit records of the project hosted on GitHub to label project data. It extracts the AST and then builds a graph network detection model. When applying BugPre for defect prediction, we use VPT for correlated code analysis to determine the regression test scope. BugPre contains the following parts:

Data Collection: BugPre collects data of different releases of projects from GitHub, including source code and Commit information submitted by users.

Data Preprocessing: This module labels data of the prior version with defective codes and non-defective codes according to the corresponding Commit information for subsequent classification model training.

Graph Construction: BugPre leverages the code context dependency to construct a graph based on AST and extracts the code features and version change features for graph nodes. The objects used for graph construction in the training phase and prediction phase are different, while the construction process of both phases is the same.

Changed Code Generation: BugPre uses the CIA analysis on variable propagation tree to extract the differences between versions, that is, functions that changed and functions that may be affected by changes.

Training Model: We use all the labeled data of the prior version to build a classification model through GCNN.

Defect Prediction: We conduct defect prediction on the changed code of the current version.

Data processing

The purpose of data processing is to extract defective code and non-defective code fragments based on the information of historical project version changes. We extract keywords (i.e., “fix,” “defect,” and “bug”) from the commits records and then set labels for code segments according to these keywords. The commits containing these keywords indicate a defect in a previous version of the software that has been fixed in the current version. In general, version change records on open source repositories track the process of bug fixes. By comparing versions with associated relationships, we can accurately distinguish between defective and normal codes. These extracted code fragments are subsequently used in the construction of the graph structure.

Graph construction

A key challenge is how to learn a distributed representation of source code that can capture its semantics effectively. The existing approaches using sequential representations of code only consider shallow features of source code such as method names and code tokens but ignore structured features such as abstract syntax trees (ASTs) of source code, which contains rich and well-defined semantics of source code. The AST enables representing the hierarchical syntactic structure of a program, and the code information such as an identifier, character, or variable and structural information can be well preserved. However, this structural relationship in AST is only the parent-child relationship between nodes and cannot represent the actual logical execution process of the code. We construct graphs for node structure relationships based on the token as the primary node in AST to tackle this problem. Next, we extract static node information and fuse version change information to enrich node features based on the node relationship graph. The above graph-building operations would allow the model to learn the high-level information in the code.

AST-based node relationship construction

Software defects are often hampered by complex and unclear context dependencies, so execution logic and data dependencies in the code are indispensability features for defect detection. Therefore, BugPre constructs a node-relationship graph of the code, similar to the COMFORT system [39], incorporating structural information such as control flow and data flow.

Control Flow Statements. The control flow in the code mainly refers to the sequential execution, branch jump, and loop statement, which represents the execution logic and records all available paths during the implementation of the code. According to these three structures, we build a control flow graph based on the AST of each function. To be specific, we use JavaParserFootnote 1 to parse the java file into an AST and then utilize each function declaration node named “MethodDeclaration” as the root node of the control flow structure. The root node attaches the control flow relationship to all child nodes with a deep traversal method.

Figure 2 shows an example of building a control flow relationship based on AST (only necessary nodes are kept in the graph). Among them, next-token is the original edge of AST, which represents the inclusion relationship between nodes, and the next-exac edge is defined to indicate the execution order of the program. We build a control flow relationship for AST according to the following three rules for adding edges. (1) For a branch jump statement with a node named “IfStmt” because the specific jump path is not known, the judgment condition statement needs to be connected with all branches through next-exec. For example, there are two next-exec edges from the condition node named “BinaryExpr” to the “ExpressionStmt” node and “forStmt” node, respectively, in Fig. 2, and the former node corresponds to the assignment statement “r = 1.0” in the source code, and the latter node represents the for loop structure. (2) For the loop statement, the loop judgment node “BinaryExpr” and the loop statement “BlockStmt” need to form a loop by adding two next-exec edges to simulate the execution of the loop statement. (3) Finally, for the sequential execution of statements, the nodes are connected through next-exec edge according to the execution order of the statements.

Data Flow Dependencies. Similar to memory leaks, incorrect variable usage can cause program defects. As shown in Listing 1, we offer a defective code fragment in the upper part, in which the file stream has not been freed after being used, and the sample code in the lower part finally solves this defect by releasing the file stream.

figure a
Fig. 2
figure 2

Example of AST-based control flow relationship construction

The process from variable definition to use is a complex and changing manner, which may cause errors and defects. Therefore, constructing the data flow of variables contributes to learning defective code characteristics and is better for defects exposure. For this purpose, we define three types of edges to describe the use of variables in functions accurately as follows:

  • VarDefinition-Edge: It represents the definition of a variable. VarDefinition-Edge is used to connect the parent node with the variable definition node.

  • VarUse-Edge: It indicates whether the variable is used or not. When a node uses a currently defined variable node, we will add an edge of such type between these two node pairs.

  • VarModify-Edge: We use this edge type to describe the relationships between modified/reassigned variables and their callers or passers.

Function calls are considered in many studies about graph representation of code [40, 41]. However, the defect detection granularity of BugPre is method-level, and the function call relationship cannot be established because the node relationship graph is built based on the AST of a single function. Although it is unable to associate called functions in the graph, we extract function calls information as node features in the subsequent feature extraction.

Feature extraction

In this part, we focus on the extraction of two types of features. Specifically, statistical features and version change information are covered. The statistical feature represents the static abstraction of the node itself, and the version change feature enriches the information of the code.

Statistical features: We design the following eight kinds of characteristics to capture the semantic information of the code, as shown in Table 1. Among them, the in-degree and out-degree nodes are used to measure the node’s centrality, and the critical nodes have high centrality. The number of function calls is obtained based on the MethodCallExper node in the graph, which can reflect the structure complexity of code. In addition, the statistical information also contains the characteristic information of the code structure, such as statistics of control flow nodes and comparison statements.

Table 1 Statistical features of nodes

Version change feature: Version change features refined information about the version evolution of software projects. We can easily obtain commits from open source repositories. This kind of information provides a complete record of code fix, the person who fixed it, and whether certain defects have been resolved in the version change process, which is typically related to the occurrence of the defect. A basic intuition is that if a block of code is fixed many times, the more likely it is that a defect will occur, and some code development experience shows that the code is expected to be very complex and poorly structured. In addition, due to inconsistencies being easy to produce in the process of collaborative development, code fix by multiple people also lead to defects easily. By analyzing the existing commit records, we summarize six defect-related version change features, as shown in Table 2.

Table 2 Version change features of nodes

Changed code generation

To achieve efficient defect detection, we use the CIA method to examine the prior version and the current version and pick up the changed codes. The set of changed codes includes, in addition to itself, the parts associated with it. Next, we will describe the two steps of the CIA method, namely obtaining the changed code set and automatically generating the corresponding impact set.

Obtaining change set

The first step of CIA is to correctly extract the changed set, which is the critical input for subsequent associated code analysis. We first use the command git diff to find the changed code over two versions of a project. However, the matching results may lead to high false positives of the change set due to changes in logs, formats, etc. Therefore, we propose a fine-grained approach of matching function pairs to identify the actual change set. If a file in the prior version has changed, we compare it with the file in the current version to form function pairs. Then, we filter the invalid code and match the function pair to locate the real changed function, generating an accurate change set. Note that new functions (including functions in new files) do not need any comparison; they are added directly to the change set. Next, we detail the method of obtaining a changed set, including AST pruning and locating modified functions based on graph matching.

(i)AST pruning. Considering that structured information has more pronounced relationships with code defects, we propose a graph-based matching method that uses the AST of the function pairs to locate the differences rapidly. However, if the comparison is made directly, the system overhead suffers a serious increase once the graph structure is complex. In a practical scenario, many changes unrelated to the code logic function (e.g., changes in comments, logs, etc.) are removable by AST pruning [42]. Based on this view, we propose an AST pruning algorithm (i.e., Algorithm 1). Specifically, the algorithm calls function getMethodDeclaration to extract AST of each function according to node type “MethodDeclaration” and then filters meaningless code segments through the function filter.

figure b

(ii)Locating changed functions based on graph matching. After pruning the AST of each function, we can use the simplified AST to pick up changed functions by graph-based similarity comparison. As described in the literature [43, 44], we use the random walk-based graph matching algorithm, which utilizes the number of matching paths as the similarity value S of the two graphs. A higher value of S indicates a higher degree of similarity.

The range of change datasets is closely related to the threshold of similarity. We use the two adjacent versions of the three open-source software of Apache https://people.apache.org/phonebook.html, namely Camel, Ant, and Xalan, to calculate thresholds. We chose the Apache projects instead of the random GitHub projects as the former has a higher quality of defect annotations [45]. We first collect the Git log and commit information of these projects, manually filter all changed files to remove invalid, changed code, and locate all modified functions. Table 3 shows the version information and the number of Java files of the three projects, as well as the number of changed function pairs and the unchanged function pairs between the two releases of the same project. For functions pair without change between adjacent releases, we employ the random walk-based graph matching algorithm to calculate the similarity value for their ASTs and normalize the similarity value to the range of 0 to 1. Then, we obtain the quartiles of the similarity values of all unchanged functions and take the mean value as the threshold to confirm the changed function. Specifically, if the function pair’s similarity is greater than the threshold, the function is considered as changed code, and vice versa.

Table 3 Three open-source projects of Apache for calculating threshold

After confirming which functions have changed during the version change, the next step is to analyze these modified functions to generate a smaller-grained change set. Unlike the traditional Change Set, we use variables, including local variables and global variables, as the basic unit. We apply Spoon GumTre [46]Footnote 2 to compare the ASTs of the two change functions to determine the changed variable and the specific location and add them to Change Set.

Generating impact set

The essence of the association between functions is the value transfer of variables between functions. Unlike the previously associated code analysis using the function call relationship, we propose a fine-grained method based on the variable propagation tree (VPT) such that the associated functions affected by the change can be accurately identified on the changed software release. Figure 3 shows an example of associated code analysis using the variable v1. As can be seen, although functions M1, M2, M6, M7, and M8 all have a function call relationship with M3, only functions M2 and M7 are affected when the variable v1 gets changed.

Fig. 3
figure 3

Code analysis based on function call relationship

(i)Building Variable Propagation Tree. To describe VPT, we define four types, as shown in Table 4, which represent class, global variable, method body, and statement, respectively. For the AST of each Java file, BugPre traverses the nodes in the AST in a depth-first way to parse them into corresponding VPT nodes. To be specific, the class node, class member variable, class member method, and statement node in a method in AST are respectively parsed as VPT\(\_\)C, VPT\(\_\)F, VPT\(\_\)M, VPT\(\_\)S.

Table 4 Summary of VPT nodes
figure c
figure d
figure e
figure f

Algorithm 2 shows the detailed construction process of VPT. Algorithm 2 first calls the function getNodes to return all child nodes with a deep traversal method. Then, the AST nodes are parsed to the corresponding VPT node according to its type. The nodetype is an attribute of the AST node. It is used to represent the type node in AST, mainly including declaration type and statement type. For instance, ClassOrInterfaceDeclaration and MethodDeclaration are the class declaration and function declaration, respectively, and Ifstmt denotes if-statement. The node type nodetype is obtained through the function getNodeType.

After parsing all VPT nodes, we construct VPT utilizing control flow, data dependencies, and function call relationships to construct VPT. The construction of control flow and data dependencies is similar to the node relationship construction (cf. Section AST-based Node Relationship Construction). Specifically, Algorithm 2 calls function buildControlFlow, buildDataFlow, and buildFunCall to build edge relationships for VPT nodes to generate final VPT. Function buildControlFlow in Algorithm 3 constructs control flow by adding next-token edge and next-exac edge to represent the inclusion relationship and the sequential execution between nodes, respectively. Function buildDataFlow shown in Algorithm 4 is used to describe the pass of value in VPT by adding VarDefinition, VarUse, and VarModify edges. Algorithm 5 shows the process of building a calling relationship between functions. The function buildFunCall conducts a connection of FunCall edge between the calling function node and the called function node. Figure 4 shows a Java class and its corresponding VPT.

Fig. 4
figure 4

Example of VPT construction

Compared with the existing token-based analysis methods, VPT has less data and can improve the efficiency of associated code analysis. BugPre builds VPT with statements (i.e., VPT\(\_\)S nodes ) as the basic elements such that VPT can cover all variables in the class space. Each statement can be abstracted into the structure of \(\{\)Assigned, List<Assigner> \(\}\), corresponding to assigned or modified objects and a set of multiple referenced or used variables. Based on VPT\(\_\)S, the specific data flow direction can be easily known.

(ii)Generating impact set through associated code analysis. We use a VPT to describe the value transfer of variables between functions in class space. After completing the VPT construction, we can perform associated code analysis to generate an impact set. Based on VPT, we can easily get the context information of a certain variable and find function nodes associated with this variable. Yet, associated code analysis is not a simple context search, which needs to be performed according to the location and the type of the variable. Then, we devise the following three rules to analyze the transmission of impact.

  • Using global variable. We consider member variables and public static variables in the class as global variables and variables participating in the modification of global variables as generalized global variables. When a global variable in a class or a generalized global variable in a function is modified, the scope of its influence will not be limited to the current function. We need to consider the use of propagation in the entire class where the variable is located.

  • Using return. “return” indicates a function call relationship. The called function obtains the processed result through calculation and returns the result to the calling location. If a function with a return value gets changed, BugPre analyzes the propagation path of the modified variable in this method and finds out associated code with it. Besides, BugPre also needs to determine the position returned by the “return” statement (i.e., the calling location) and then analyze the associated code of the calling function.

  • Using local variable. The local variable is only defined in the function, and its life cycle is limited to the function body. Therefore, it is only necessary to analyze the data flow relationship in this function.

Recalling that, we generate the change set between the adjacent version project (cf. Section Obtaining Change Set), which records the changed variable and the specific location. We take this variable change set as the entrance to the associated code analysis. We employ the above three rules for associated code analysis for each changed variable to pick up the functions associated with it. Finally, the changed functions in Change Set and the affected functions in Impact Set are used together as change codes for defect detection. For changed functions, we apply the graph construction method (cf. Section Graph Construction) to construct the node relationship graph and extract features for defect detection.

Training the predictive model

A node-relationship graph with node features of the code is constructed in the previous graph construction, and its adjacency matrix and node feature matrix will be used as the input of the GCNN model for training. In this section, we introduce how to construct the graph convolutional neural network model (Graph Convolutional Neural Network, GCNN) and how to use it to learn the defect features in the code for defect detection.

We first define a graph as \(\mathcal {G}\) = (\(\mathcal {V}\),\(\mathcal {E}\),\(\mathcal {X}\)), where \(\mathcal {V}\) is a set of vertices, \(\mathcal {E}\) is a set of edges, and \(\mathcal {X}\) is a feature vector that represents the feature information of the node. Each vertex is the node of the node relationship graph, and each edge represents the relationship between all the nodes in the graph, including control flow and data flow.

Fig. 5
figure 5

The structure of graph convolutional network

As shown in Fig. 5, the graph convolutional neural network includes three parts: the input layer, the hidden layer, and the output layer. The input layer receives a node relationship graph, where \(A\sim F\) represents the nodes in the graph, and the directed edges represent the relationship between the nodes. The node relationship graph is fed into the model in the form of an adjacency matrix for operation. The \(A^{0}\sim F^{0}\) in the hidden layer represents the initial feature vectors of the vertices, which will be iterated and updated every time they propagate in the hidden layer. We choose the softmax activation function for the output layer of the model. In our defect detection task, softmax converts GCNN model output into a probability distribution over the two classes, then chooses one class with the largest probability value for resulting whether the input data is a defect code.

The core part of the graph neural network model is in the hidden layer, mainly the update of the feature vector. The feature value \(x_v^{t+1}\) of the node v after t times update is given by:

$$\begin{aligned} x_v^{t+1}=F\left( x_{v}^t,\sum x_u^t\right) ,\forall v\epsilon \mathcal {V},\forall u\epsilon Adjacent(\mathcal {V}) \end{aligned}$$
(1)

where Adjacent is the Adjacent node of node v, and F denotes the feature update function, which can be formulated as follows:

$$\begin{aligned} F\left( x_{v}^t,\sum x_u^t\right) =tanh\left( Wx_v^t+\sigma \left( \sum x_u^t\right) \right) \end{aligned}$$
(2)

Where W is a \(d*p\) matrix, d represents the dimension of the feature vector in the input model, and p is the dimension of the feature vector in the propagation process. \(\sigma \) denotes a non-linear function, in this paper, which is a n layer fully connected neural network; the specific formula is as follows:

$$\begin{aligned} \sigma (l)=p_1*RELU(p_2*RELU(p_3*...RELU(p_{n}l))) \end{aligned}$$
(3)

Where \(p_{i}(i=1,2,...n)\) is a \(p*p\) matrix, p is the dimension in the iterative update process of the feature vector. After n iterations, the SoftMax function is used to classify all nodes to determine whether there are defects in the code.

Prediction

After the model is trained, BugPre utilizes VPT-based associated code analysis to pick up the changed module in the current version. For each changed or affected function, BugPre constructs control flow and data flow to enrich AST and extract node features. The graphs with node features are fed into the trained model to perform defect prediction.

Experimental evaluation

In this section, we evaluate the BugPre from the following four parts:

Efficiency of filtering invalid code. This experiment was conducted to verify whether AST pruning can remove invalid code and explore the effect of filtering invalid code on the final defect detection results.

Efficiency of VPT-based associated code analysis. This experiment was performed to verify the effect of VPT-based associated code analysis in the change code generation, and comparisons were made with the existing code generation.

Efficiency of defect detection based on GCNN. By comparing was made with traditional machine learning methods and deep learning algorithms, the effect of GCNN for defect detection in change code was verified. In addition, the performance of BugPre on the open-source data set was also evaluated.

Comparison with prior works. By comparing two types of cross-version defect detection, the effectiveness of BugPre was verified.

Experiment environment

For the experimental environment, we use a workstation with a 3.3GHz Intel Core CPU, an NVIDIA GeForceGTX 1080 GPU, and 128G RAM, running Ubuntu 18.04 operating system with Linux kernel 4.15.

Dataset

Dataset 1: We collect data from 8 versions of 4 Apache projects to explore the effect of invalid code on the defect detection results. They are all widely used and publicly available Apache projects with higher quality bug annotations, which is conducive to the analysis of changes in version changes according to the submission record. The source code of these projects is easily available on Github. These projects describe in detail as follows:

  • HadoopFootnote 3: It is an open-source distributed computing and storage framework.

  • LuceneFootnote 4: It is a high-performance, full-featured text search engine library.

  • Log4jFootnote 5: It is a Java logging tool.

  • AntFootnote 6: It is a Java-based build tool.

Dataset 2: We select four Java open source projects or components for the evaluation experiment of associated code analysis based on VPT, which are NanoXMLFootnote 7, JUnitFootnote 8, HttpCoreFootnote 9, and HeritrixFootnote 10. They are all commonly used java tools or components in practice, which are used for XML parsing, unit testing, HTTP transport, and automatic crawling, respectively. For each project, we collect data from three releases and perform associated code analysis on the Change Set from two adjacent releases.

Dataset 3: To make the model comparison experiment more representative, we use the public dataset of the PROMISE [47] repository, which is widely used in many prior software defect prediction studies. We choose the six most used projects to evaluate the detection performance. The Promise dataset provides an Excel for each project release, which records the attribute information of all Java files. In addition, each file has a defect mark to indicate whether the file has defects. We choose six java projects from the PROMISE repository with a total of 20 releases. Table 5 shows the version number, number of Java files, and defect rate of the projects involved in the experiment.

Table 5 Descriptions of the data sets used in the GCNN evaluation

Dataset 4: We collect the data of the widely used Java projects of Apache from GitHub, Which are KafkaFootnote 11 (with releases of V2.5.0, V2.6.1, and V2.7.0) and DruidFootnote 12 (with releases of V0.19.0, V0.20.0, and V.0.22.1) to verify the effect of our BugPre in the actual scenario. Among them, Kafka is a distributed, publish/subscribe messaging system, and Druid is a connection pool component of the relational database.

Efficiency of filtering invalid code

In this experiment, our goal is to verify the effect of filtering invalid code by AST pruning. We perform statistics to understand the distribution of invalid change code in all changed code. This experiment involves four Java open source projects of Apache mentioned in Dataset 1. We obtained two adjacent version projects of them and downloaded all the Commit information. Then, we use the diff command in Git to compare the specific changes between Commit and define the statistical indicators as shown in Table 6 to describe the changed code. Table 7 gives the statistical results in more detail according to the above indicators. We can see that the four Apache open source projects exist invalid code changes, of which comments and code irrelevant to logic account for a relatively large proportion. For example, during the change from the 2.1-rc2 version to the 2.1-rc3 version of Log4j, the change comments and code irrelevant to logic reached 71 lines, accounting for 7.14% of all changed codes.

Table 6 Statistical indicators of software version changes
Table 7 The information on software version change
Fig. 6
figure 6

The precision and recall of defect detection before and after filtering invalid code

The statistics experiment verifies that there are some invalid code changes during the version change process, and these invalid, changed codes cannot be removed through the version controller. Our proposed method of using AST pruning can effectively remove these noneffective changed codes. To verify the necessity of removing invalid code changes, we use the datasets before and after AST pruning to evaluate defect detection performance, respectively. We still use the five open source projects mentioned above for this experiment. We first train a GCNN-based detection model for each project using the data of the historical version project and then obtain two change codes with AST pruning and without AST pruning. Finally, we use a trained model to conduct defect detection for two change codes respectively. We use precision (i.e., \(\frac{TP}{TP+FP}\)) and recall (i.e., \(\frac{TP}{TP+TN}\)) to evaluate the impact of invalid code on defect detection. The precision rate is used to measure how many samples are detected as defective codes, and high precision means that the model can achieve better defect prediction performance. In addition, due to the imbalance of defective code data [48], the recall rate needs to be used to measure how much of the actual defective code is actually detected. The result is shown in Fig. 6. After removing invalid codes, the precision of defect detection has not changed much, with an average increase of 2.6%, and the recall is significantly improved by an average of 17.7%. Experimental results show that the existence of invalid code does interfere with defect detection.

To further explain the effect of invalid code on defect detection, we conduct a data size statistic of datasets (i.e., changed codes) before and after invalid code filtering for four projects, as shown in Fig. 7. The results show that the method based on AST pruning does reduce the size of the change set, so the change code generated by the associated code analysis based on the change set is reduced. The recall rate reflects how many real defects are in the predicted results. The recall rate increases after removing invalid code, indicating that the false-negative rate decreases, that is, the probability that a normal code is incorrectly identified as a defective code decrease. Therefore, removing invalid codes by AST pruning can effectively improve defect detection performance.

Fig. 7
figure 7

The size of test data before and after invalid code filtering

Table 8 The results of the size of the actually changed functions set (i.e., Actual Set) and the size of changed functions set generated by the change impact analysis (i.e., Impact Set) with our proposed VPT-based method, InfluenceCIA and, and VM-CIA
Fig. 8
figure 8

The performance of associated code analysis with three tools

Efficiency of VPT-based associated code analysis

To verify the effect of VPT-based associated code analysis in the change code generation, we compare our method with two CIA methods, i.e., influenceCIA [49] and VM-CIA [15]. These two tools are similar to our VPT-based approach in that they take into account the correlation of variables and methods for CIA.

In the change impact analysis work, the size of the impact set is a more important measure. If the impact set is too large, the credibility of the data will be reduced because there may be many wrong association impact results. Therefore, we first compare the size of the change code generated by the three methods. We use four java projects in Dataset 2 to perform this experiment. For each project, we obtain two cross-version pairs and conduct CIA on the cross-version project. Specifically, we use the graph matching algorithm to find out the change set between the adjacent versions of projects and then take the global variables in the class and the parameter variables in the method as the input data for three CIA methods of VPT-based CIA, InfluenceCIA, and VM-CIA. Table 8 shows the size of the actually changed functions set (i.e., Actual Set) and the size of the Impact Set generated by the change impact analysis with different methods. Compared with InfluenceCIA and VM-CIA, our proposed VPT-based associated code analysis method generates fewer influence sets as only the functions that are actually affected by Change Set are retained.

To further analyze how many functions in the Impact Set are actually associated with the changed code (i.e., ActualSet), we define the two evaluation indicators, AFPrecision and AFRecall (AF is the abbreviation of Associated Functions) as follows.

$$\begin{aligned} AFPrecision= & {} \frac{ActualSet\bigcap ImpactSet}{ImpactSet} \end{aligned}$$
(4)
$$\begin{aligned} AFRecall= & {} \frac{ActualSet\bigcap ImpactSet}{ActualSet} \end{aligned}$$
(5)

where ActualSet represents the set of functions that actually changes between adjacent versions of projects, and ImpactSet represents the set of the function obtained through associated code analysis.

Figure 8 shows the AFPrecision and AFRecall of three associated code analysis methods calculated according to the statistical result of Table 8. We can see that the VPT-based associated code analysis method has higher precision than the other two methods, and the recall rates of the three methods are similar, around 95%. The comprehensive results show that the VPT-based associated code analysis method outperforms the other two methods, which enable to ensure effective correlation code analysis while reducing the size of the change code for defect detection.

Efficiency of defect detection based on GCNN

To verify the effect of the method proposed in this paper in the defect detection process, we compare the performance of our BugPre based on GCNN with traditional machine learning and deep learning models commonly used for defect detection. Besides, we also verify the actual defect detection capabilities of BugPre on open source projects.

To make the experiment more convincing, we use six java projects from the PROMISE repository of dataset 3. According to the defect mark in excel provided by the PROMISE dataset, each java file enables it to be easily labeled. Then, the labeled data need to be divided into training data and verification data. The training data is the data input to the model for iterative training, and the verification data is used to optimize the model parameters during the training process. In this experiment, the training data and verification data are all from the same historical version of the project data and divided according to 8:2. The testing data is the changed version project adjacent to this historical version. For example, the project Camel with the 1.0 version is used for model training, and the data of the 1.2 version is used as the testing data to verify the effect of the model.

Compared to the machine learning model, we choose the commonly used machine learning algorithms in defect detection, namely Logistic Regression (LR) [50], Transfer Naive Bayes (TNB) [51], Support Vector Machine (SVM) [52], Decision Tree (DT) [41]. The attribute information provided by the PROMISE dataset is used as the machine learning model input, which is also used in many traditional machine learning-based defect code detection work. To generate the input required by the graph network model, we crawl the source code of the corresponding project, represent it as the graph and extract the features (cf. Section Graph Construction). We use the historical version project to train the model and use the adjacent new version project to conduct prediction. We use the precision rate and recall rate mentioned in Section Efficiency of Filtering Invalid Code and the harmonic average of them (i.e., F1-score) in this evaluation. Besides, PofB20 has been employed by some SDP methods [53, 54]. However, PofB20 doesn’t work with our evaluation since BugPre is based on the changed code between versions in defect detection, not the entire project file.

The prediction performance of the three models is shown in Fig. 9. We observe that the GCNN model achieves the precision of average precision of 67.5% for six projects, which is 5.7%–22.3% higher than four machine learning methods. The average recall of the GCNN model is 71.8%, which is approximately 20-30% higher than the four other methods. The results reveal that the GCNN model has a stronger ability for defect recognition. Since the GCNN model enables to capture of the structural information of the graph and better represent the characteristics of the node, the GCNN model can achieve better performance than machine learning.

Fig. 9
figure 9

Comparison results of GCNN and traditional machine learning model

In addition to comparing with traditional machine learning models, BugPre also is compared with widely used deep learning models in defect detection, including Deep Belief Network (DBN) [55] and transfer learning convolutional neural network (TCNN) [56]. Different from traditional machine learning methods that use code attribute information, both DBN and TCNN use AST node information, which is vectorized and then input into the model for training. Specifically, we parse the source code into an AST first, extract all tokens in AST as a token vector, and then vectorize the token vector into an integer vector by mapping the token to the corresponding integer. Note that, in order to eliminate the influence of the change code generation method on the defective code detection for different models, we also use the VPT-based associated code analysis method to generate the final testing data for detection. Fig. 10 shows the prediction performance of various deep learning models. Compared with traditional machine learning algorithms, DBN and TCNN have richer characterization capabilities using AST node information. Therefore, the precision and recall rates are slightly improved, and their average values are above 57% and 59%, respectively. However, due to the lack of complex structural relationships between codes, the performance of DBN and TCNN is poorer than the GCNN.

Fig. 10
figure 10

Comparison results of GCNN and deep learning model

To further verify the effect of our BugPre in the actual scenario, we perform an experiment using the data of two open-source projects in Dataset 4. We use adjacent versions of the same project for defect prediction. Figure 11 shows the performance of our BugPre compared with DBN and TCNN on two open-source projects. We can clearly see that BugPre achieves the best performance in terms of precision, F-measure, or Recall. This is because BugPre takes full advantage of the semantic structure information in the code for graph construction in the characterization of the code, and thus GCNN model enables to learn deeper information about the code.

Fig. 11
figure 11

The result of defect detection on open source projects

Comparison with prior works

We compare BugPre with three cross-version defect prediction works in Table 9. We choose a unified metric (i.e., F1-score) that is adopted in all works involved in the comparison to evaluate defect detection performance. We can see that BugPre achieves an average F1-score of 67.2% in five datasets, outperforming the other three approaches (in terms of F1-score). In addition, we conducted a comparison with a CPDP method named ADCNN [57]. The average F1-score of BugPre is 27% higher than that of ADCNN across the same dataset. Our work adopts a graph convolutional neural network to capture context dependencies of code and deeper code representation. The comparison result indicates that our proposed BugPre effectively predicts software defects. Future improvements to BugPre may yield even better results.

Table 9 Comparison of two cross-version defect detection approaches based on the five open-source datasets. The F1-score value is depicted for each approach

Limitation and discussion

Although we implemented the BugPre, there are still some threats to validity that need to be improved.

  • Data labeling: The current version of BugPre uses the keywords such as “fix,” “bug,” and “defect” in the commit record to set labels for code. The behind reason is that, according to the git commit specification, the commit type of “fix” represents the modification of the bug. We try to get accurately labeled data from high-quality projects, but it is difficult to deal with out-of-spec submissions. We believe more advanced bug tracking approaches [60] can help label data accurately.

  • Expandability: The proposed system is currently applicable to projects with Java languages. In our approach, AST, control flow, and data flow are used to build node relationship graphs, all of which exist in most programming languages. To extend to the defect detection of projects with other languages, we only need to achieve program parsing of the target language and follow the graph construction process in this paper to build graphs for other programming languages.

  • Insufficient historical data: Insufficient historical data in the early stage of software development is a well-known challenge in cross-version defect detection. For future design, we believe that combining cross-project defect detection can mitigate this limitation.

Conclusion

We presented BugPre, a novel framework for detecting software defects between versions. BugPre leverages the structure of VPT for similarity analysis to accurately extract version change codes. It then constructs multiple nodes and edge relationships to generate graphs fed into the GCNN network. We evaluate BugPre by applying it to Apache projects and commonly used Java tools. BugPre showed good performance on four different datasets (c.f. Section Dataset).