1 Introduction

Bug reports are generated when the software fails to behave as it is expected or follow the technical requirements of the system. Unfortunately, it is costly for the developer to manually locate the corresponding buggy source files according to the bug report, especially when the software system is large. To reduce the maintenance cost, bug localization, which aims to automatically locate the buggy source files according to the bug report, has drawn significant attention in software mining community and many models have been proposed (Huo et al. 2020; Poshyvanyk et al. 2007; Ye et al. 2014; Zhou et al. 2012).

Traditional works treat the source code as pure text and locate buggy files by measuring the lexical similarity between the bug report and the source code (Gay et al. 2009; Lukins et al. 2008; Zhou et al. 2012). Recent researches indicate that the program structure of source code carries more semantics reflecting the program behavior, which should be exploited in feature learning and is beneficial for bug localization. Such structures include: correlations among neighboring statements (Huo et al. 2016), long term sequential dependency of source code (Huo and Li 2017), abstract syntax tree (AST) of source code (Youm et al. 2017), etc. However, these structures can only represent part of the program structure. A more essential and widely used representation is the control flow graph (CFG). Recently, Huo et al. (2020) propose a CG-CNN model to exploit more complex program structures such as branches and loops from the CFG for improving bug localization. However, CG-CNN decomposes the CFG into multiple paths and merges all the representations with average-pooling, which breaks the integrity of the graph and fails to consider the inherent correlation among paths. Therefore, to comprehensively exploit the program structure and consider the correlation among execution paths, the integrity of CFG should not be broken.

A straightforward way for feature learning from the entire CFG is to use the graph neural network (GNN). Although GNN has been widely used in various software mining problems and proven to be effective in embedding the AST (Allamanis et al. 2018; Mou et al. 2016; Wei and Li 2017; Zhou et al. 2019), it is inappropriate to embed the CFG with GNN. In general, GNN models assume that two adjacent nodes in the graph are related (Grover and Leskovec 2016; Kipf and Welling 2017; Niepert et al. 2016; Veličković et al. 2018), which means that the learned features between adjacent nodes should be closer than non-adjacent nodes. GNN performs well in embedding the AST, since two adjacent nodes in the AST are inherently semantically related and their learned semantic features ought to be similar.

However, this assumption no longer holds in the CFG. Edges in the CFG only represent the successively execution relationship, and two adjacent nodes in the CFG could be unrelated in semantics. Instead, previous statements may affect the semantics of subsequent statements along the execution path, which we call the flowing nature of CFG. Figure 1 illustrates an example of the source code and the corresponding CFG, where each statement in the source code corresponds to one node in the CFG. It can be observed that two adjacent statements (node 1 and 2) are unrelated in semantics. On the other hand, although node 6 is far away from node 1 and 4, the semantics of node 6 is jointly determined by them. In general, a statement p could affect the semantic of another statement q only if there exists an execution path (i.e., a walk in the CFG), where p is executed before q. Therefore, instead of aggregating semantics from neighbors like GNN, the semantics of statements should flow in a directional manner from the entry node to the exit nodes along execution paths.

Fig. 1
figure 1

An example of source code snippet (left) and the corresponding CFG (right). Node 1 and 2 are adjacent but semantically-unrelated. Node 1 and 4 together determine the semantics of Node 6, even if they are far away

In this paper, we claim that the flowing nature of CFG should be explicitly considered in feature learning and propose a novel model named cFlow for bug localization. In cFlow, a flow-based GRU is particularly designed to transmit the semantics of statements along the execution paths, where the program structure represented by the CFG is fully exploited. In order to further consider the inherent correlation among different paths, our flow-based GRU merges paths when they converge on the same statement. Experimental results on widely-used real-world software projects show that cFlow significantly outperforms the state-of-the-art bug localization methods, indicating that exploiting the program structure represented by the CFG with respect to the flowing nature is beneficial for improving the bug localization.

The contributions of our work are summarized as follows:

  • We claim that the control flow graph holds the unique flowing nature, that is, previous statements may affect the semantics of subsequent statements along the execution path, while the semantics of adjacent nodes may be irrelevant. The flowing nature is essential and should be explicitly considered in feature learning.

  • We propose a novel model named cFlow for bug localization, where a particularly designed flow-based GRU is employed for feature learning from the CFG. The design of our flow-based GRU explicitly considers the flowing nature and the inherent correlations among paths, where the semantics of statements are transmitted along the execution paths and paths are merged when they converge on the same statement, respectively.

The rest of this paper is organized as follows. In Sect. 2, the proposed cFlow model will be introduced in detail. Experiments will be provided and discussed in Sect. 3 and the related works will be introduced in Sect. 4. In Sect. 5, this paper is concluded and some future works are discussed.

2 The proposed method

The goal of bug localization is to locate potentially buggy source files that produce the program behaviors specified in the given bug report. Here comes our formulation of the bug localization problem:

Let \({\mathcal {R}} = \{r_1, r_2, \dots , r_{N_r} \}\) denotes the set of bug reports received by the developer and \({\mathcal {C}} = \{c_1, c_2, \dots , c_{N_c} \}\) denotes the set of source code files of a software project, where \(N_r, N_c\) denotes the number of bug reports and source code files, respectively. The learning task of bug localization aims to learn a prediction function \(f : {\mathcal {R}} \times {\mathcal {C}} \mapsto {\mathcal {Y}}\). Let \(y_{ij} \in {\mathcal {Y}} = \{ +1, -1 \}\) indicates whether a source code file \(c_j \in {\mathcal {C}}\) is related to a bug report \(r_i \in {\mathcal {R}}\), which can be obtained by investigating software commit logs and bug report descriptions. The prediction function f can be learned by minimizing the following objective function:

$$\begin{aligned} \min _f \sum _{i,j} {\mathcal {L}} (f(r_i, c_j),y_{ij}) + \lambda \varOmega (f) \text {,} \end{aligned}$$
(1)

where \({\mathcal {L}}(\cdot ,\cdot )\) is the empirical loss and \(\varOmega (f)\) is a regularization term imposing on the prediction function. The trade-off between \({\mathcal {L}}(\cdot ,\cdot )\) and \(\varOmega (f)\) is balanced by a hyper-parameter \(\lambda\).

2.1 The general framework of cFlow

The learning task of bug localization is instantiated by proposing a novel model called cFlow. cFlow is consisted of four layers: source code feature extraction layer, bug report feature extraction layer, fusion layer and prediction layer. We design two independent layers to extract features from bug reports and source files separately, since they are heterogeneous. The general framework of cFlow is shown in Fig. 2.

Fig. 2
figure 2

The general framework of cFlow. The semantic features of bug report and source code are extracted by two separate layers since they are heterogenerous. After that, two consequent layers are designed to fuse them into a unified feature and make the final prediction, respectively

The bug report feature extraction layer takes the bug report \(r_i\) as input and extracts the semantic feature \({\mathbf {u}}_i^r\) of it:

$$\begin{aligned} {\mathbf {u}}_i^r = f_{report}(r_i) \text {.} \end{aligned}$$
(2)

Since bug reports are written in natural language, we follow the normal natural language preprocessing method. We first tokenizes the words in the bug report and removes the stop words, then each token term is embedded with word2vec Mikolov et al. (2013). Following with Kim (2014), we use 1D-CNN with max-pooling to extract the semantic feature of the bug report.

The source code feature extraction layer aims to take the raw data of source code \(c_j\) as input and extracts the semantic feature \({\mathbf {u}}_j^c\) of it, where the program structure is carefully exploited:

$$\begin{aligned} {\mathbf {u}}_j^c = f_{code}(c_j) \text {.} \end{aligned}$$
(3)

This layer can be further divided into three sub-layers. The first sub-layer is designed for the initial statement-level feature learning. In the second sub-layer, we design a flow-based GRU to enhance the statement-level feature, where the program structure represented by the CFG is exploited to transmit the semantics of statements along the execution path. The third sub-layer merges all the enhanced statement-level features into the code-level semantic feature \({\mathbf {u}}_j^c\). This layer is the core of cFlow and the details will be introduced in subsection 2.2.

In the fusion layer, source code feature \({\mathbf {u}}_j^c\) and bug report feature \({\mathbf {u}}_i^r\) are taken as input and they will be fused into a unified feature for prediction. We employ a fully connected network to fuse these two language-specific features. Finally, in the prediction layer, fully connected network is employed to predict whether the source code \(c_j\) is related to bug report \(r_i\) based on the unified feature:

$$\begin{aligned} {\hat{y}}_{ij} = f_{prediction}( f_{fusion}({\mathbf {u}}_i^r, {\mathbf {u}}_j^c) ) \text {.} \end{aligned}$$
(4)

The empirical loss function used in cFlow is cross entropy, and \(L_2\) regularization is employed to avoid overfitting.

In most cases of bug localization, a reported bug may only be related to one or a few source code files, while a large number of source code files are irrelevant. This highly imbalance nature should be carefully considered. Following with the previous bug localization work (Huo et al. 2016), several negative instances are randomly dropped, which can decrease the computational cost and counteract the imbalance nature. Specifically, we randomly select a subset of all negative instances for training, and discard the rest. How many negative instances will be selected is a hyper-parameter in cFlow. In our implementation, the number of negative instances we randomly select is the same as the number of positive instances, aiming to result a relatively balanced positive and negative distribution.

2.2 The source code feature extraction layer

The source code feature extraction layer takes the raw data of source code \(c_j\) as input, and aims to learn the semantic feature \({\mathbf {u}}_j^c\) of it. For the simplicity of notations, we will omit the subscript j in this subsection if there is no ambiguity.

As shown in Fig. 1, each statement in the program corresponds to a node in the CFG. Let \(G = (V, E, {\mathbf {X}})\) indicate the corresponding CFG of source code c. The matrix \({\mathbf {X}} \in {\mathbb {R}}^{|V| \times d}\) denotes the statement(node)-level feature matrix, where each node \(v_l \in V\) is represented by a d-dimensional real-valued vector \({\mathbf {x}}_l \in {\mathbb {R}}^d\). Obviously, G is a directed graph.

The first sub-layer is designed to extract the initial statement-level feature \({\mathbf {X}}^0\) from the raw text data of each statement. We tokenize each statement, use the camel case nomenclature to split each token, and remove the unimportant punctuation such as braces, commas and quotation marks. Then, word2vec is used to embed each token term. After that, a 1D-CNN with d filters is employed to incorporate the semantic meaning of tokens, followed by max-pooling. The filters should slide within each statement and stop when they meet the end of the statement, respecting the atomicity of the statement in semantics explicitly. In this sub-layer, we focus on the semantic meaning of a single statement, thus all the statements are processed individually.

After extracting the initial statement-level feature \({\mathbf {X}}^0\), the second sub-layer further exploits the program structure to enhance the satement-level feature. To explicitly consider the flowing nature of CFG, we particularly design a flow-based GRU to transmit the semantics of statements from the entry node to the exit nodes along the execution paths. In the rest part of this subsection, we assume that G is connected, which means there is only one entry node \(v_{entry}\), and \(v_{entry}\) can reach all the other nodes in the CFG. Otherwise, the flow-based GRU will process each connected component independently. Here comes three definitions:

Definition 1

(InNode) For an arbitrary node \(p \in V\), the set InNode(p) is defined as \(InNode(p) = \{ q | (q, p) \in E \}\).

Definition 2

(OutNode) For an arbitrary node \(p \in V\), the set OutNode(p) is defined as \(OutNode(p) = \{ q | (p, q) \in E \}\).

Definition 3

(ReachableFrom) For an arbitrary Node \(p \in V\), the set ReachableFrom(p) is defined as:

  • \(q \in InNode(p) \Rightarrow q \in ReachableFrom(p)\) ,

  • \(r \in ReachableFrom(q) \wedge q \in ReachableFrom(p) \Rightarrow r \in ReachableFrom(p)\).

For an arbitrary node p, any node \(q \in ReachableFrom(p)\) indicates that there is an execution path (i.e., a walk in the CFG) in the source code, and the correponding statement q is executed before statement p. So the statement q may affect the semantics of the statement p. Therefore, when enhancing the feature of the node p, the semantic information of all the nodes in ReachableFrom(p) should be taken into consideration.

The information transmission process of flow-based GRU is inspired by the Breadth First Search (BFS) algorithm. The process has T time steps in total, where T is a hyper-parameter and will be discussed later. Let \({\mathbf {z}}_l^t\) denotes the hidden state of node \(v_l\), and \(Act^t \subseteq V\) denotes the activated nodes at time step t. At the beginning, only the entry node \(v_{entry}\) is activated, and all the hidden states \({\mathbf {z}}_l^0\) are initialized as 0. For each time step t, each activated node p aggregates non-zero hidden states from all its InNode(p) (Eq. 5) and updates its node feature \({\mathbf {x}}_p^t\) and hidden state \({\mathbf {z}}_p^t\) with GRU (Eq. 6). Inactivated nodes will remain their node features and hidden states the same as the last time step (Eq. 6).

$$\begin{aligned} \widetilde{{\mathbf {z}}}_{p}^t= & {} AVG\left( \left\{ {\mathbf {z}}_q^{t-1} | q \in InNode(p) \wedge {\mathbf {z}}_q^{t-1} \ne {\mathbf {0}} \right\} \right) \text {.} \end{aligned}$$
(5)
$$\begin{aligned} {\mathbf {x}}_p^t, {\mathbf {z}}_p^t= & {} \left\{ \begin{aligned}&\textit{GRU} \left( {\mathbf {x}}_p^{t-1}, \widetilde{{\mathbf {z}}}_p^t \right) ,&p \in Act^t \text {,} \\&{\mathbf {x}}_p^{t-1}, {\mathbf {z}}_p^{t-1},&otherwise \text {.} \end{aligned} \right. \end{aligned}$$
(6)

The OutNode of activated nodes will become activated nodes at the next time step:

$$\begin{aligned} Act^t = \left\{ \begin{aligned}&\{v_{entry}\},&t = 1 \text {,} \\&\bigcup _{p \in Act^{t-1}}OutNode(p),&otherwise \text {.} \end{aligned} \right. \end{aligned}$$
(7)

Intuitively, each hidden state indicates an execution path starting from the entry node. It is worth noticing that as long as one InNode is activated at the last time step (i.e., a new execution path comes), all the non-zero hidden states from InNodes (whether they are activated or not at the last time step) will be aggregated to update the node feature and hidden state. If a node has never been activated, the hidden state will remain 0, which will not affect the update procedure. The advantage of doing so is that we no longer need to confirm how long each execution path is and all the possible execution paths will be aggregated simultaneously when the last never-been-activated InNode is activated (i.e., the longest path reaches). Therefore, cFlow can comprehensively consider the inherent correlations among different execution paths.

Figure 3 illustrates an example of how the flow-based GRU exploits the program structure to enhance the statement-level feature. The upper left is the source code and the upper right is the corresponding CFG, where each statement corresponds to a node and directed edges represent that two statement may be executed consequently. Only node 3 is activated at \(t = 3\), thus hidden state of InNode(3) is aggregated to update \({\mathbf {x}}_3^3\) and \({\mathbf {z}}_3^3\) according Eqs. 5 and 6, and other nodes remain their features and hidden states. At \(t = 4\), the OutNodes(3) (i.e., node 4 and 5) become activated nodes and update their features and hidden states. However, node 4 has never been activated before \(t=4\), thus the hidden state is 0 and node 5 only aggregates the hidden state from node 3. At \(t=5\), node 5 is still activated since it is the OutNode of node 4 and node 4 is activated at \(t=4\). At this moment, node 5 is able to aggregate hidden states from all the InNode(5) , which means two different execution paths (1, 2, 3, 5) and (1, 2, 3, 4, 5) will be combined here and the correlation between them will be considered.

Fig. 3
figure 3

An example of flow-based GRU. For each time step, activated nodes (red) aggregate hidden states from InNode and update node features and hidden states with GRU (lines with arrows). Their OutNodes will become activated at the next time step. Inactivated nodes remain node features and hidden states (lines without arrows). It is worth noticing that the hidden state of never been activated nodes (e.g., \({\mathbf {z}}_4^3\)) will not be aggregated (Color figure online)

For each time step, the flow-based GRU only steps forward one node along the execution path. Therefore, the maximum time steps T determines how far the semantic information will be transmitted. In order to ensure that each node is able to receive semantic information from all its ReachableFrom nodes, an upper bound of T is provided in Proposition 1.

Proposition 1

Given the control flow graph \(G = (V, E, X)\) and the entry node s. For \(\forall p, q \in V\), if \(p \in ReachableFrom(q)\), there is a walk \(|(s = v_1, v_2, \dots , v_{k_1} = p, v_{k_1+1}, \dots , v_{k_1 + k_2} = q)| \le 2|V|\).

Proof

To prove this proposition, we only need to prove that there is a walk \(|(s = v_1, v_2, \dots , v_{k_1} = p)| \le |V|\) and a walk \(| (p = v_1, v_2, \dots , v_{k_2} = q) | \le |V|\). Since G is connected and s is the entry node, it is obviously that \(\forall v \in V \backslash \{s\} , s \in ReachableFrom(v)\). Otherwise, the statement will never be executed and can be ignored. Thus, we only need to prove that \(\forall v \in V, w \in ReachableFrom(v)\) indicates that there exists a walk \(|(v = v_1, v_1, \dots , v_k = w)| \le |V|\). It is trivial by Definition 3. \(\square\)

Proposition 1 gives a theoretical proof that any semantic information from ReachableFrom(p) takes at most 2|V| time steps to be transmitted to the node p, where |V| is the number of nodes in the CFG. In other words, when \(T\) is large enough, flow-based GRU ensures that each node can receive semantic information from all the ReachableFrom nodes.

We further provide the pseudo code of flow-based GRU in Algorithm 1. Flow-based GRU takes the CFG, the initial node features and the total time step as input. Line 1-2 are the initial part, and the update equations for activated and inactivated nodes are listed in line 6-7 and line 9-10, respectively. Line 13-17 merge the OutNodes of all the activated nodes to generate the activated nodes at the next time step. In the end, flow-based GRU outputs the enhanced node features.

After enhancing the statement-level feature in the second sub-layer, the third sub-layer is designed to merge all the statement-level features into the code-level feature. In order to preserve the original semantics of statements, we concentrate the initial statement-level features \({\mathbf {X}}^0\) and the enhanced statement-level features \({\mathbf {X}}^T\). Then, a normal GRU with average-pooling is employed to extract the code-level feature:

figure a
$$\begin{aligned} {\mathbf {u}}^c = \frac{1}{|V|} \sum _{i = 1}^{|V|} \textit{GRU}([{\mathbf {x}}_i^T; {\mathbf {x}}_i^0]) \text {.} \end{aligned}$$
(8)

After being processed by the source code feature extraction layer, the extracted source code feature \({\mathbf {u}}^c\) will be fed into the cross-language fusion layer together with the bug report feature \({\mathbf {u}}^r\) for further fusion and prediction (mentioned in subsection 2.1).

3 Experiments

We have conducted comparative evaluation of cFlow against state-of-the-art bug localization methods, based on four widely-used real-world software projects. This section includes the summary of our experimental setup, the evaluation result, and a brief analysis of the result.

3.1 Data sets

The data sets used in the experiment are extracted from four real-world software projects. All these projects are open source and ground truth correlations between bug reports and source code files can be extracted from bug tracking system (Bugzilla) and version control system (Git) (Fischer et al. 2003).

Eclipse PatformFootnote 1 defines the set of frameworks and common services that make up Eclipse infrastructures. The first data set Platform is the “UI” component of it. Plug-in Development EnvironmentFootnote 2 is a tool to create and deploy Eclipse plug-ins. We use the “UI” component as the second data set, which is denoted as PDE. Java Development ToolsFootnote 3 is an Eclipse project for plug-ins support and development of any Java applications. The third data set JDT is the “UI” component of it. The AspectJFootnote 4 project is an aspect-oriented extension to the Java programming language, which is the last data set and denoted as AspectJ.

The statistics of the four data sets are shown in Table 1. All of them have been widely used in many previous bug localization studies (Huo et al. 2016; Lam et al. 2015; Zhou et al. 2012). Specifically, as suggested in Kochhar et al. (2014), we filtered those fully localized bug reports from the data set, that is, the names of all corresponding buggy source files have already been contained in the bug report. For such bug reports, they no longer need a machine learning model to automatically locate the buggy source files.

Table 1 The statistic information of the four real-world data sets

3.2 Baseline methods and experiment settings

To evaluate the effectiveness of cFlow, we compare against the following state-of-the-art bug localization methods:

  • Buglocator (Zhou et al. 2012) A classicial information retrieval (IR) based bug localization method, which employs a revised vector space model (rVSM) to represent the bug report and the source code into vectors from the lexical perspective, and identifies potential buggy files by measuring the lexical similarity between bug reports and source files.

  • LS-CNN (Huo and Li 2017) A state-of-the-art deep learning based bug localization method, which only considers the sequential nature of source code. LS-CNN employs a LSTM network to enhance the statement-level feature without considering more complex program structures like branches and loops. The network structure of LS-CNN is equivalent to cFlow without the flow-based GRU.

  • CG-CNN (Huo et al. 2020) A state-of-the-art deep learning based bug localization method, which learns the semantic features from the CFG of the source code. CG-CNN first exploits the structrual information from the CFG to enhance the statement-level feature with DeepWalk (Perozzi et al. 2014) model. After that, CG-CNN decomposes the CFG into different execution paths and averages all the representations without considering the inherent correlations between different paths.

  • cFlow-GAT A variant of cFlow, which utilizes a three-layer graph attention network (Veličković et al. 2018) to directly embed the CFG without considering the flowing nature.

To compare with these baselines, we follow the same hyper-parameter settings suggested in their studies. For hyper-parameters in cFlow, the window sizes of the convolutional filters are set as 3,4,5 (with 100 filters of each size). Batch normalization is applied after the fusion layer to prevent over-fitting. Adam (Kingma and Ba 2014) is employed to optimize parameters in cFlow. For each training, we first divide the training data into the training set and the validation set with the ratio of 8:1. Then, all the training hyper-parameters are determined in terms of the performance on the validation set, such as batch size, total time steps \(T\), number of epochs. After that, the model will be retrained using all the training data and the best hyper-parameters.

3.3 Performance evaluation

We consider four evaluation metrics: Mean Average Precision (MAP), Mean Reciprocal Rank (MRR), AUC and Top-k Rank, which have been widely-used in previous bug localization studies (Huo and Li 2017; Huo et al. 2020; Zhou et al. 2012). For each data set, 10-fold cross validation is used and the average performances are reported.

Table 2 MAP of the compared methods on all data sets
Table 3 MRR of the compared methods on all data sets

The evaluation result in terms of MAP and MRR are listed in Tables 2 and 3, respectively. The performance on each data set is boldfaced only if the model outperforms other baselines on each fold. It can be observed from Table 2 that cFlow achieves the best performance among all the baselines on all the four data sets in terms of MAP. Besides, cFlow achieves the best average MAP performance (0.486), which improves Buglocator (0.297) by 63.6%, LS-CNN (0.436) by 11.6%, CG-CNN (0.466) by 4.3%, cFlow-GAT (0.440) by 10.4%. The superiority of cFlow is significant.

Similar to the performance in terms of MAP, cFlow achieves the best performance among all the baselines on all the four data sets in terms of MRR. Also, cFlow achieves the best average MRR performance (0.594), which improves Buglocator (0.352) by 68.7%, LS-CNN (0.536) by 10.7%, CG-CNN (0.571) by 3.9%, cFlow-GAT (0.540) by 10.0%. This performance evaluation once again reflects the superiority of cFlow.

The performance evaluations in terms of AUC and Top-10 Rank are shown in Figs. 4 and 5, respectively. It can be noticed that cFlow outperforms all the baselines on all the data sets in terms of both AUC and Top-10 Rank. Higher AUC values mean that cFlow is able to rank the related buggy source files higher than other baseline methods, and higher Top-10 Rank values mean that cFlow is able to correlate the most buggy source files when the same number of potential files are examined.

Fig. 4
figure 4

AUC of the compared methods on all the data sets. cFlow achieves the best performance among all the baselines on all the data sets

Fig. 5
figure 5

Top-10 Rank of the compared methods on all the data sets. Higher metric value means better performance. It can be easily observed that cFlow outperforms all the baseline on all the data sets

Compared with the state-of-the-art bug localization models, Buglocator totally ignores the program structure, and LS-CNN only considers the long term sequential dependency. However, cFlow performs better since it exploits more complex program structures like branches and loops from the CFG of source code, indicating that exploiting the program structure improves the performance of bug localization. Although both cFlow-GAT and CG-CNN exploit the program structure from the CFG for feature learning, cFlow-GAT directly uses a GNN which ignores the flowing nature of CFG, and CG-CNN ignores the inherent correlation among different execution paths, resulting in only acceptable performances. Experimental results indicate that the flowing nature and the correlation among different execution paths are critical to the performance of bug localization.

Figure 6 provides a case study of the bug report and the corresponding source code. Statements that have been modified to fix the bug are in the red boxes. It can be observed from the source code that the variable “items” plays an important role in this bug, and statements containing the variable “items” are far away in the CFG. Interestingly, only cFlow can successfully locate the buggy file, since cFlow explicitly considers the flowing nature in feature learning. On the other hand, other baseline models failed to discover the impact of the variable “item” from these distant statements. Therefore, this case study once again shows that explicitly considering the flowing nature is important in feature learning from the CFG for bug localization.

Fig. 6
figure 6

An example of the bug report and the corresponding buggy file. Red boxes gives those statements that have been modified to fix the bug

In summary, experimental results on widely-used real-world data sets indicate that cFlow outperforms the state-of-the-art bug localization methods and cFlow-GAT (a variant of cFlow that directly embeds the CFG with GNN) in terms of all the four commonly used metrics, demonstrating that exploiting the program structure from the CFG with respect to the flowing nature is beneficial for improving the performance of bug localization.

3.4 Ablation study

We have conducted the ablation study to show that our design for cFlow is effective. The key part of cFlow is the flow-based GRU (i.e., the second sub-layer of the source code feature extraction layer), which is designed to enhance the original statement-level feature by exploiting the program structure represented by the CFG.

In order to show that the enhanced statement-level feature is beneficial for bug localization, we compare cFlow with LS-CNN. cFlow shares a similar network structure with LS-CNN, except that LS-CNN directly merges the inital statement-level feature without employing the flow-based GRU to enhance it. Experimental results in Tables 2, 3, Figs. 4 and 5 show that cFlow outperforms LS-CNN in terms of all the metrics (MAP, MRR, AUC and Top-10 Rank, respectively) on all the four data sets, indicating that employing the flow-based GRU is beneficial and the enhanced statement-level feature is effective for improving bug localization.

4 Related work

In this section, we summarize the literature related to bug localization and deep software mining models.

4.1 Bug localization

Bug localization aims to automatically locate buggy source files according to the textual description in the bug report, which remains a big challenge in software maintenance. Traditional methods treat the source code as pure text and locate potential buggy source files by measuring the lexical similarity between the bug report and the source code. Lukins et al. (2008) use a Latent Dirichlet Allocation (LDA) model to represent source code and bug reports and locate potential buggy files by measuring their similarities. Zhou et al. (2012) propose a revised Vector Space Model (rVSM), where buggy files corresponding to similar historical bug reports are explored to improve the bug localization result. However, as pointed out by Huo et al. (2016), these traditional approaches ignore the rich program structure in source code. To overcome this, Huo et al. (2016) propose a NP-CNN model to generate high-level semantics by exploiting the local relationships among statements with the 1D-CNN. Huo and Li (2017) further exploit the long term sequential dependency of source code with the LSTM network. Youm et al. (2017) exploit the abstract syntax tree to classify the source code tokens into different categories for further analyzing the method information. However, this work only utilizes the token attribution to build the vector space model and does not fully exploit the tree structure. Recently, Huo et al. (2020) exploit more complex structures such as branches and loops from the CFG of source code, in which DeepWalk is employed to learn the representation of each statement, and then the source code is decomposed into different paths for multi-instance learning. Despite exploiting the structural information in the source code, some deep learning methods deal with the bug localization problem from other perspectives. Lam et al. (2017) combine an auto-encoder model with revised vector space model to deal with the lexical mismatch problem in traditional IR-based approaches. Rahman and Roy (2018) classify bug reports into three categories according to the quality and incorporates context-aware query reformulation for bug localization. Huo et al. (2019) propose the TRANP-CNN model for cross-project bug localization to face the problem of insufficient history data. Zhang et al. (2020) propose a KGBuglocator model, which exploits the interrelation information via code knowledge graph for bug localization.

4.2 Deep software mining models

In recent years, deep learning models are very popular and have achieved great success in various software mining tasks. White et al. (2015) first identify avenues to use deep learning models for mining software repositories. Li et al. (2019) employan attention-based network to learn the context-based code representation for improving bug detection. Shi et al. (2019) propose a specific network for code review, where an auto-encoder is employed to learn the revision feature from the original-new source code pair. Alon et al. (2019) design an attention-based neural network to represent arbitrary-sized code snippets into continuous distributed vectors. Zhang and Li (2019) exploit the contest between the plagiarist and the detector for software clone detection. Feng et al. (2020) propose a bimodal pre-trained model CodeBERT, which achieves state-of-the-art performance on code search and code summarization.

Among all the deep learning models, mining from the source code graph representation has received extensive attention, especially in mining from the abstract syntax tree (AST). Mou et al. (2016) propose a tree-based CNN (TBCNN) model to learn the feature from the AST representation of source code for code functionality classification. Wei and Li (2017) propose a CDLH model for code clone detection, where a AST-based LSTM model is particularly employed for embedding the AST representation of source code. Allamanis et al. (2018) utilize ASTs with additional data flow edges to represent the program for variable-naming and variable-misuse problems and the feature representation of source code is learned with a gated graph neural network (GGNN) (Li et al. 2015). Zhou et al. (2019) propose a Devign model for vulnerability identification, where the source code is represented with a mixture graph containing AST edges, DFG edges, CFG edges and natural sequence edges and GGNN is applied to learn the node representation. Li et al. (2020) propose a DLFix model for automated program repair, where a tree-based RNN encoder-decoder model is employed to learn local contexts.

5 Conclusion and future work

In this paper, we claim that the flowing nature of control flow graph is essential and should be explicitly considered, and propose a novel model named cFlow for bug localization by exploiting the program structure represented by the CFG. cFlow employs a particularly designed flow-based GRU for feature learning from the CFG, where the flowing nature is explicitly considered by transmitting the semantics of statements along the execution paths. Experimental results on widely-used real-world software projects show that cFlow significantly outperforms the state-of-the-art bug localization methods, indicating that exploiting the program structure from the CFG with respect to the flowing nature is beneficial for improving the performance of bug localization.

For future work, it is interesting to exploit the program structure from the control flow graph for other software mining problems like clone detection or code summarization.