1 Introduction

Software testing based on defect pattern (Quinlan et al. 2007) is a source code static analysis technology developed in this century. For its high efficiency and accuracy, various static analysis tools, such as Coverity (Bessey et al. 2010), PREfix (Bush et al. 2000), Defect Testing System (DTS) (Yang et al. 2008) and FindBugs (Ayewah and Pugh 2010), have been widely applied in automatically detecting potential source code defects at an early software development phase.

Table 1 Evaluated projects

Although static analysis tools have proven themselves to be useful and significant in some domains, several researches (Johnson et al. 2013; Kumar and Nori 2013; Beller et al. 2016; Christakis and Bird 2016) demonstrate that such tools are faced with challenges in practice. One of the most crucial challenges involves false positives, which is a common problem of software testing based on defect pattern. Because static analysis technology cannot obtain the dynamic execution information, static analysis tools are required to speculate on how the program will behave actually (Ruthruff et al. 2008). As a result, a large scale of alarms reported by the tools are found to be false positives, which is inevitable (Dillig et al. 2012). Therefore, manual inspection of the reported alarms would be a costly and unavoidable work for developers.

To mitigate the effort of manual inspection, efficient defect identification techniques for handling static analysis alarms have been put forward by numerous studies and summarized in a few literature reviews (Heckman and Williams 2011; Muske and Serebrenik 2016). One of the promising approaches addressing the problem is to come up with a set of artifact characteristics for classifying alarms as actionable and unactionable, probability-based ranking of each alarm being true, and clustering by the similarity of alarms. The artifact characteristics are based on semantic, structure or historical information about the source codes and reported alarms.

The objective of our research is to build a binary classification model based on machine learning methods for automatically identifying the reported alarms. The reported alarms are classified either as actionable (true defects) or unactionable (false positives), and the unactionable ones are pruned and not reported to developers, thus reducing the workload of manual inspection. Despite the fact that numerous machine learning methods have been applied in the classification of alarms with positive results, no individual learner can always achieve perfect performance due to the limitation of each machine learning algorithm. Therefore, ensemble learning methods, combining multiple classifiers with strategies, have become a better choice for classification tasks. We select 13 base classifiers in our research for building defect identification models. They are selected from five categories built in Weka in order to reflect the diversity of machine learning algorithms and the fairness of evaluation analysis. And then, two ensemble learning methods are introduced to improve the classification performance compared to base classifiers.

In this paper, we discuss the potential defects from four open-source C projects (listed in Table 1 with full description) detected by DTS, a tool to catch defects in source code using static testing techniques (Yang et al. 2008). The amount of reported alarms from each project, called inspection points (IPs) in DTS, is listed in the fifth column of Table 1. Since the characteristics of different defect patterns are specific, the model proposed in this paper is merely for null pointer dereference (NPD) that occupies the majority of the IPs reported by DTS, which can be seen in the NPD Rate column.

In order to establish a defect-specific model, we need to firstly draw the discrepancy between true defects and false positives from the source codes related to the reported alarms. The existing approaches have manually designed various features to classify the alarms, such as software metrics features, source code history and churn features based on file or module level, and alarms-based features. However, these features lack precision in representing the distinct semantics of alarms leading to a large amount of false positives. For example, Fig. 1 shows a C language function str_add_char from spell with two alarms reported by DTS. Alarm NPD1 is determined as actionable after manual inspection because of dereferencing of a parameter str that may be null, while alarm NPD2 is determined as unactionable. The feature vectors of these two alarms are identical under traditional features, because these two alarms have the same characteristics in terms of if statements, assignment statements and lines of code, etc. However, the manual inspection results of these two alarms are opposite. Therefore, false positives may occur when we use traditional features to classify the reported alarms.

Fig. 1
figure 1

A motivating example

To bridge the gap between the reported alarms’ semantics and features used for defect identification, a set of novel features at variable level, named variable characteristics (VCs), are raised in this paper, which is based on the related information of variables that cause defects. Each reported alarm can be transformed into one feature vector with designed VCs via a mapping function. Then, a predictive model can be trained using machine learning methods in an ensemble way to automatically identify new reported alarms either as actionable or unactionable. The contributions of this paper are fourfold:

  • Two ensemble learning methods and 13 base classifiers are selected to build the automated defect identification models in order to mitigate the effort of manual inspection.

  • A set of novel artifact characteristics at variable level, named variable characteristics, are designed to build the proposed model.

  • Experiments are conducted on four open-source C projects to evaluate the performance of our approach in the case of identifying alarms reported from the same project.

  • Variable characteristics are ranked by three different single attribute evaluators to identify the impact of these features on our proposed model.

The rest of this paper is organized as follows. We survey related work in Sect. 2. Sections 3 and 4 describe variable characteristics and model building process, respectively. We provide the experimental setup in Sect. 5. Section 6 shows the analyzing results of model evaluation. We conclude this paper and present the future work in Sect. 7.

2 Related work

There have been an increasing number of approaches for handling static analysis alarms, and the approaches are categorized in a few literature reviews (Heckman and Williams 2011; Muske and Serebrenik 2016). One of the promising approaches is in a position to simplify the inspection effort by designing a set of features from the related information of source codes and reported alarms for clustering, ranking schemes and classification tasks.

Clustering Static analysis alarms are clustered by the dependence among them. Since the grouped alarms are dominated by the alarm on which they are depending, not only the number of alarms that need to be inspected is reduced, but also the superfluous inspection effort is eliminated (Le and Soffa 2010; Lee et al. 2012; Zhang et al. 2013; Podelski et al. 2016; Muske et al. 2018). Podelski et al. (2016) proposed a set of semantic-based features for each alarm, and the alarms of the same feature values were grouped. Le and Soffa (2010) constructed a correlation graph by collecting the data about the characteristics of fault correlations, and this graph could integrate fault correlations on different paths and among multiple faults. Zhang et al. (2013) presented a sound alarm correlation algorithm based on trace semantic to automate alarm identification. Muske et al. (2018) described a novel technique that reduces alarms by repositioning, which uses the information of control flow to group the related alarms.

Table 2 Semantic metrics

Ranking One ranking approach is to make a priority of the alarms that have a high probability to be true defects, and artifact characteristics are used to compute the likelihood of each alarm being actionable. Jung et al. (2005) made use of syntactic alarm context as input of Bayesian to compute the probability of each alarm to be true and ranked the alarms based on the probability before reporting. Kim and Ernst (2007a, b) put forward a warning prioritization algorithm based on the software change history features that were mined from the source code repository, while their underlying intuition was that alarms eliminated by fix-changes are important. On the similar lines, Williams and Hollingsworth (2005) raised a method to utilize the source code change history of a software project to drive and help to refine the search for defects. Addressing the weakness of other ranking schemes, that is, rankings should be adaptive as reports are inspected, Kremenek et al. (2004) took advantage of correlation behavior among reports and user feedback for alarm ranking. Compared with the clustering approach, this approach needs to inspect all the ranked alarms.

Classification The classification approach can identify whether the alarms as actionable or unactionable. The unactionable alarms are not reported to the users for these alarms are more likely to be false positives. Ayewah et al. (2007) discussed the kinds of generated alarms and classified the types of alarms into false positives, trivial bugs and serious bugs. Ruthruff et al. (2008) proposed a logistic regression model based on 33 features extracted from the alarms themselves to predict actionable alarms found by FindBugs, and a screening methodology was used to quickly discard features with low predictive power in order to build cost-effectively predictive models. Reynolds et al. (2017) used a set of descriptive attributes to standardize the patterns of false positives. Several studies (Brun and Ernst 2004; Yi et al. 2007; Heckman and Williams 2009; Liang et al. 2010; Yuksel and Sözer 2013; Hanam et al. 2014; Yoon et al. 2014; Flynn et al. 2018) have utilized machine learning classification models to abstract the difference between the actionable alarms and the unactionable alarms for automatically identifying defects. Brun and Ernst (2004) presented a machine learning-based technique that builds models of program properties and used the built models to classify the program properties that may lead to latent defects. Heckman and Williams (2009) evaluated 15 machine learning algorithms based on distinct sets of alarms characteristics out of 51 candidate characteristics, which is one of the most comprehensive studies in predicting actionable alarms and achieves high performance. Additionally, they proposed a benchmark in Heckman and Williams (2008) for evaluation and comparison of the automated defect identification models. Liang et al. (2010) constructed a training set automatically to compute the learning weights effectively for different features, and then, the reported alarms were ranked and classified by computing the scores using these learning weights. Hanam et al. (2014) put forward a method for differentiating actionable and unactionable alarms by finding similar code patterns that are based on the code surrounding each static analysis alarm. Flynn et al. (2018) developed and tested four classification models for static analysis alarms mapped to CERT rules, using a novel combination of multiple static analysis tools and 28 features extracted from the alarms.

To the best of our knowledge, no research gives a set of artifact characteristics at variable level. In this paper, alarms are reported with detailed description after detected by DTS, including the variable information of each static analysis alarm. Information from source codes and reported alarms is extracted to represent variable characteristics, which will be described in Sect.3.

3 Variable characteristics

There are a growing number of features designed in the existing studies (Heckman and Williams 2009; Podelski et al. 2016; Hanam et al. 2014; Yuksel and Sözer 2013; Yoon et al. 2014) to classify the alarms. In this paper, a set of novel artifact characteristics, called VCs, are designed for each reported alarm, which are based on the related information of variables that cause potential defects. These VCs are derived from three sources: the information of data flow and conditional predicate of related variable in the source codes, the lines of code (LOC) metrics, and the defect pattern definition in DTS. Details of the VCs are shown in the following three subsections.

3.1 Semantic metrics

For our paper, we firstly utilize abstract syntax tree (AST) to extract source code semantics. When analyzing source code files, we pour attention into the statements of data flow and conditional predicate of related variable, which has a great impact on leading to potential NPD defects (Wang et al. 2013). Then, we reduce the number of statements to inspect by generating a backwards program slice. A backwards program slice takes the statement containing the related variable as the seed statement and extracts three types of statements that could have affected the outcome of the seed statement as characteristics, namely assignment statements, reference statements and control-flow statements, respectively. We totally design 17 VCs based on the three types of statements listed in Table 2 with detailed description.

For example, we consider the code in Fig. 2 where the pointer variable pzchat (defined at line 95) causes a potential NPD defect at line 160. Line 160 is used as the seed statement for computing a backwards program slice, which will produce the set of statements at lines \(\{95,146,148,160\}\). In the code, the related variable pzchat is defined at line 95, is assigned by a variable qchat->uuconf_pzchat at line 146, occurs in a while statement at line 148, and is referenced by a library function strlen at line 160. Then, the following VCs are extracted:

  • AS_V: line 146

  • CS_WHILE: line 148

  • RS_LFUNC: line 160

Fig. 2
figure 2

An example of variable characteristics extraction

The initial status of the variable after defined is also considered as a variable characteristic called I_STATE. As an example, again consider the pointer variable pzchat in Fig. 2, which occurs in an assignment statement after it is defined. Therefore, we take the attitude that the initial status of such variable is Assignment. Furthermore, five initial cases are considered and mapped into integers to construct feature vectors as input for the classification model, which is given in Table 3.

Table 3 Mapping between initial status cases and integers

3.2 LOC metrics

Heckman and Williams (2009) collected LOC metrics at three different levels of granularity, that is, method, file and package, respectively. Podelski et al. (2016) have utilized the LOC metrics to classify bugs. For LOC metrics containing different levels of granularity, the LOC metrics are collected at variable and method levels in our model. The LOC metrics designed in this paper are listed below:

  • IP_LOC: the number of source code lines counted from the definition statement of the related variable to the statement that contains the variable causing a potential NPD defect. As shown in Fig. 2, the pointer variable pzchat is defined at line 95 and line 160 is the statement where pzchat causes a potential NPD defect, and thus, the value of IP_LOC is 66.

  • METHOD_LOC: the number of source code lines within the method containing the variable. As shown in Fig. 2, the pointer variable pzchat is contained in the method fchat, and thus, the value of METHOD_LOC is 214 (counted from line 80 to line 293).

3.3 Defect pattern definition metric

In DTS, NPD is further defined as five categories in terms of analyzing C programming language projects, which is considered as a variable characteristic CLASS. In addition, a mapping between categories and integers is built to construct feature vectors as input for the classification model, which is shown below:

  • NPD: dereferencing of a local pointer that may be null and mapped into the value of 0.

  • NPD_CHECK: dereferencing of a checked parameter or global pointer that may be null and mapped into the value of 1.

  • NPD_EXP: dereferencing of an expression that may be null and mapped into the value of 2.

  • NPD_PARAM: dereferencing of a functional returned parameter that may be null and mapped into the value of 3.

  • NPD_PRE: dereferencing of a parameter or global pointer that may be null and mapped into the value of 4.

4 Model building process

There is a strategy for classification tasks using machine learning outlined by Witten et al. (2016), and Fig. 3 illustrates a complete procedure of building a classification model for automatically identifying defects based on the strategy. For the following four subsections, we describe the model building process proposed in this paper.

Fig. 3
figure 3

Model building process

4.1 DTS’s architecture

DTS is a defect pattern-driven tool. Each defect pattern is defined using a defect pattern state machine (DPSM), which is stored in an xml file. A DPSM can be represented as a triple (STC), in which:

  • S is a state set and \(S=\{S_\mathrm{start}, S_\mathrm{error}, S_\mathrm{end}, S_\mathrm{other}\}\), where \(S_\mathrm{start}\) denotes the initial state, \(S_\mathrm{error}\) denotes the error state, \(S_\mathrm{end}\) denotes the end state and \(S_\mathrm{other}\) denotes other intermediate states,

  • T is a state transition set, which is defined as \(T:S \times C \rightarrow S\),

  • C is a transition condition set, which denotes the state transition conditions.

DTS’s architecture is shown in Fig. 4. Firstly, DTS transforms the source code file into a program model that indicates the analyzed codes with a set of data structures. As the AST is built for the analyzed codes, the tool constructs a symbol table alongside it. The symbol table is linked to each identifier in the analyzed codes with its type and a pointer to its declaration or definition.

DTS proposes the interval computation technique in control-flow and data-flow analysis, of which the purpose is to compute the state of DPSM. Defect patterns tell the defect patterns analysis engine how to model the environment and the effects of library and system call. If a DPSM is transited to an error state, then a defect is reported by DTS. Since the static analysis tools would produce the false positives, the IPs reported by DTS are reviewed by our testing team.

Fig. 4
figure 4

DTS’s architecture

Fig. 5
figure 5

Feature vector construction

4.2 Data preparation

The four evaluated projects are originally analyzed statically by DTS, and then, the reported IPs are inspected manually by the developers in order to obtain the result of each IP being either actionable or unactionable. After that, the information of the variable causing a potential NPD defect is received from the reporting log of each IP and is parsed based on the source code files, which is fully explained in Sect. 3.

4.3 Feature vector construction

As is described in Sect. 3, the feature vector is a mapping between each reported IP and the VCs we designed, and the construction process shown in Fig. 3 can be represented according to the following mapping function:

$$\begin{aligned} \mathbb {IP} \longmapsto \mathbb {FV} = \left( \mathbb {VC}, R\left( \mathbb {IP}\right) \right) \end{aligned}$$
(1)

where \(\mathbb {IP}\) is the inspection point reported by DTS, \(\mathbb {FV}\) is the mapped feature vector via Eq. (1), \(\mathbb {VC}\) is a list of integer numbers denoting the value of the 21 variable characteristics calculated from \(\mathbb {IP}\), and \(R\left( \mathbb {IP}\right) \) is the manual inspection result of \(\mathbb {IP}\). The identification rule of \(R\left( \mathbb {IP}\right) \) is defined below:

$$\begin{aligned} R(\mathbb {IP})=\left\{ \begin{array}{ll} \mathbf TRUE , &{}\quad \text{ if } \mathbb {IP} \text{ is } \text{ actionable; } \\ \mathbf FALSE , &{}\quad \text{ otherwise. } \\ \end{array} \right. \end{aligned}$$
(2)

Figure 5 demonstrates an example of constructing a feature vector from the code in Fig. 2. The upper left corner of the figure is the source code lines of the related variable obtained from the IP’s reporting log in the upper right corner. The ellipsis in the sample feature vector represent the variable characteristics that value 0. Detailed mapping process can be referred in Sect. 3.

4.4 Classification and VC ranking

To classify new IPs from the same source code project, we need to keep track of the IPs that have been classified by our testing team, along with the feature vector for each IP. This firstly requires a training phase that our testing team inspect a number of IPs and classify them as actionable or unactionable manually. Then, we use the inspected IPs to build and train defect identification models based on machine learning methods to differentiate the actionable and unactionable IPs. Finally, we allow the models to automatically classify the rest IPs in the project, thus reducing our burden of manual inspection.

Moreover, we use machine learning techniques to rank the variable characteristics in order to find out how much proportion of classification performance these features can contribute to our proposed model in Sect. 6.4.

5 Experimental setup

To evaluate our proposed approach, Weka (Witten et al. 2016), an open-source software developed by the Machine Learning Group at the University of Waikato in New Zealand, is used for model building. As shown in Fig. 6, the framework consists of three major steps. Firstly, 13 base classifiers built in Weka are selected to build individual classification model, respectively. Secondly, two ensemble learning methods are trained based on the output of the 13 base classifiers. Finally, we evaluate all of the classification models including base classifiers and ensemble learning methods by carrying out tenfold cross-validation on four open-source projects. We use the default parameters for all classifiers in Weka. Our experiments are all run on a 3.7 GHz Intel Core i3-6100 machine with 4 GB RAM.

Fig. 6
figure 6

Framework of alarms classification using ensemble learning methods

5.1 Base classifiers

Weka contains a series of machine learning algorithms for classification tasks with understandable output results to developers. We select 13 classification algorithms from five categories built in Weka, which are fully described in Table 4. The selection of these classifiers is based on their popularity and diversity (Ruthruff et al. 2008; Heckman and Williams 2009; Yuksel and Sözer 2013; Hanam et al. 2014; Yoon et al. 2014; Flynn et al. 2018).

Table 4 The selected 13 base classifiers

5.2 Ensemble learning methods

Ensemble learning methods, also called multi-classifier systems (MCSs), use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone. A machine learning ensemble trains each base classifier firstly by individual machine learning algorithm, and then, these base classifiers are integrated into one MCS by strategy. In general, ensemble learning methods can often perform better than base classifiers (Dietterich 2000). For the purpose of improving the performance of defect identification, we choose two different kinds of MCSs for building models in this paper: One is that all base classifiers in the MCS are not one single type (majority voting), and the other is that all base classifiers are of the same type (random forest). The two MCSs are described below.

5.2.1 Majority voting

Majority voting (MV) is the most common combination strategy used in ensemble learning. The voting method derives from the hypothesis that the decision of a group is superior to that of the individuals. The flowchart of MV is presented in Fig. 7. For binary classification model proposed in this paper, the ensemble consists of 13 base classifiers \(\{h_1,h_2,\ldots ,h_{13}\}\), and base classifier \(h_i\) predicts a label for one test instance x from the set of class label \(\{c_1,c_2\}\), where \(c_1\) denotes TRUE and \(c_2\) denotes FALSE. If x is identified as the same class by most base classifiers, x is labeled to this class. Since the number of base classifiers is odd, the two classes cannot obtain the same voting value for x. Thus, x is always able to be labeled to one certain class. The rule of class identification for majority voting is shown in Eq. (3), where \(h_i^j(x)\) denotes the prediction output of base classifier \(h_i\) on class label \(c_j\).

$$\begin{aligned} H(x)=\left\{ \begin{array}{ll} c_1, &{}\quad \text{ if } \sum _{i=1}^{13}h_i^1(x)>0.5\sum _{j=1}^2\sum _{i=1}^{13}h_i^j(x)\text{; } \\ c_2, &{}\quad \text{ otherwise. } \\ \end{array} \right. \nonumber \\ \end{aligned}$$
(3)
Fig. 7
figure 7

Flowchart of majority voting

5.2.2 Random forest

Random forest (RF) is an ensemble learning method where each two base classifiers have no strong dependency and can be generated simultaneously and parallel. Differing from traditional decision tree algorithms, k attributes are randomly selected at each node of an individual tree in the forest during the procedure of building a random forest model. Breiman (2001) suggested that \(k=\log _{2}d\), where d is the size of the attribute set.

5.3 Experimental design

For within-project defect identification, datasets from the same project are split into the training set and the test set. When building models, we carry out tenfold cross-validation to evaluate the effectiveness of classification. In cross-validation, datasets are randomly split into ten approximately equal subsets, and nine of the subsets are used to train a model and the last subset to test. The process is repeated ten times in order that each of the ten subsets would be tested once. We repeat the tenfold cross-validation 100 times for each model, as randomness would occur inevitably in splitting datasets (Arcuri and Briand 2011).

Furthermore, attribute selection in Weka (Witten et al. 2016), selecting a subset of attributes using attribute evaluator with one search method, is of great importance to avoid reducing classifier performance because of redundant and irrelevant attributes. In this paper, the ranking of variable characteristics is what we concern about to show the merit of each VC to our proposed model. Variable characteristics designed in this paper are evaluated using three single attribute evaluators of Weka with Ranker search method (Witten et al. 2016).

5.4 Evaluation metrics

To assess the performance of models trained in this paper, the following metrics are adopted to evaluate defect identification techniques.

5.4.1 Accuracy

Accuracy is one of the most shared evaluation metrics for classification tasks. According to individual model M built on dataset D, the definition of accuracy is shown in Eq. (4), where S is the size of D, and \(M(x_i)\) and \(y_i\) represent the results of the model prediction and the manual inspection, respectively.

$$\begin{aligned} \hbox {accuracy}(M;D)=\frac{1}{S}\sum _{i=1}^S{\mathbb {I}}(M(x_i)=y_i) \end{aligned}$$
(4)

5.4.2 Kappa statistic

Kappa statistic is a coefficient for consistency test, that is, to determine whether the model prediction results and the actual results are consistent. Kappa coefficient values from 0 to 1. When the coefficient is larger than 0.6, the model prediction result is reliable.

5.4.3 Indicators derived from confusion matrix

For a binary classification problem, instances can be divided into true positive, false positive, true negative and false negative combining by the predicted results of the model and actual results of the manual inspection, which consist of the confusion matrix. The three metrics, precision (P), recall (R) and F-measure (F1), are derived from the confusion matrix. Here is a brief introduction:

$$\begin{aligned}&P =\frac{\hbox {true}\, \hbox {positive}}{\hbox {true} \, \hbox {positive}\,+\,\hbox {false} \, \hbox {positive}} \end{aligned}$$
(5)
$$\begin{aligned}&R =\frac{\hbox {true} \, \hbox {positive}}{\hbox {true} \, \hbox {positive}\,+\,\hbox {false} \, \hbox {negative}} \end{aligned}$$
(6)
$$\begin{aligned}&F1 =\frac{2*P*R}{P+R} \end{aligned}$$
(7)

Precision and recall are a pair of contradictory metrics. In general, higher precision means that more true defects can be classified as positive, while a growing rate of recall can reveal more true defects. F-measure that takes consideration of both precision and recall is a harmonic mean value.

5.4.4 ROC curve and area under ROC curve

Receiver operating characteristic (ROC) curve that is widely used in machine learning is an effective tool to visualize the generalization performance of classifiers. However, ROC curve has a drawback when the curves of two classifiers are intersected, and we can hardly predicate which classifier is superior. Thus, area under ROC curve (AUC), summing up the area of each part under ROC curve, is introduced as an evaluation metric. Assuming that the ROC curve is connected by a series of points \(\{(x_1,y_1),(x_2,y_2),\ldots ,(x_n,y_n)\}\) in sequence, AUC can be calculated as shown in Eq. (8):

$$\begin{aligned} AUC=\frac{1}{2}\sum _{i=1}^n(x_{i+1}-x_i)\cdot (y_{i+1}-y_i) \end{aligned}$$
(8)

6 Experimental results and analysis

6.1 Ground truth

To evaluate our proposed models, we should firstly classify the reported NPD IPs as actionable and unactionable accurately. As is described in Sect. 4.4, our method in this paper is looking through the reported NPD IPs manually and classifying them into the two classes. The result of manual inspection is given in Table 5. The TRUE column indicates the number of NPD IPs that are classified as actionable, and the number of unactionable NPD IPs is listed in the FALSE column.

Table 5 The manual inspection result on evaluated projects
Table 6 Accuracy and kappa statistic results on evaluated projects
Table 7 Weighted precision, recall and F-measure results on evaluated projects

We classify 890 NPD IPs in total and use these IPs in our experiments. A total of 582 of these are actionable, and 308 are unactionable. The machine learning methods mentioned in Sects. 5.1 and 5.2 are used to make a classification of these IPs, and we repeat the experiment designed in Sect. 5.3 100 times on each classifier built for each project so as to evaluate the performance of models based on our designed VCs. Furthermore, we calculate a weighted average of the evaluation metrics except accuracy and kappa statistic referred in Sect. 5.4 across both classes (TRUE and FALSE), which is according to the number of NPD IPs in each class. The weighted average (WA) is shown in Eq. (9), where \([M]_T\) denotes the metric value of precision, recall, F-measure or AUC for class TRUE, \([M]_F\) denotes the metric value of precision, recall, F-measure or AUC for class FALSE, \(A_\mathrm{IP}\) denotes the number of actionable NPD IPs, and \(U_\mathrm{IP}\) denotes the number of unactionable NPD IPs.

$$\begin{aligned} WA=\frac{[M]_{T}*A_\mathrm{IP}+[M]_{F}*U_\mathrm{IP}}{A_\mathrm{IP}+U_\mathrm{IP}} \end{aligned}$$
(9)
Fig. 8
figure 8

ROC curves on evaluated projects: a antiword, b spell, c sphinxbase, d uucp

6.2 Evaluation metrics analysis

6.2.1 Accuracy and kappa statistic

The results of accuracy and kappa statistic are listed in Table 6, where the best result values of accuracy and kappa statistic on each evaluated project are highlighted in bold. Each column represents the machine learning methods given to the project (by row). The machine learning methods are divided by double vertical lines according to their categories as is described in Sects. 5.1 and 5.2. Overall, the average accuracy and kappa statistic (across all projects and classifiers) are 83.36% and 0.5986, respectively. According to the statistics from Table 6, half of the models’ accuracy surpass 85% and the proportion of the models with kappa statistic above 0.6 is 60%, indicating that our approach based on machine learning is precise and credible enough to provide a reliable prediction result. As shown in Table 6, we can find out that the best model based on accuracy and kappa statistic for the four projects is specific. The best model for antiword is SMO. SMO is a support vector machine classifier and has an advantage in solving the classification of small dataset, and thus, it performs better than other machine learning methods for antiword. PART, a rule-based learner, is the best model for spell, and RF is the best model for both sphinxbase and uucp. Furthermore, the two MCSs enhance the classification accuracy and kappa statistic in most cases compared to the 13 base classifiers.

6.2.2 Weighted precision, recall and F-measure

Table 7 shows the results of weighted precision, recall and F-measure for the models, which are calculated as described in Eq. (9). The values highlighted in bold are the highest precision, recall and F-measure on each evaluated project. In general, the average weighted precision (across all projects and classifiers) is 0.8367, the average weighted recall is 0.8336, and the average weighted F-measure is 0.8349. Statistics from Table 7 shows that nearly half of the models’ three indicators (weighted precision, recall and F-measure) exceed 0.85, indicating the high performance of our proposed approach. According to Table 7, the best model for each evaluated project is the same as discussed in the analysis of accuracy and kappa statistic. In most instances, the performance of the two MCSs surpasses the 13 base classifiers.

6.2.3 ROC curve and weighted AUC

ROC curves for each project of different classifiers are shown in Fig. 8, and Table 8 presents the classification performance of weighted AUC, where the best weighted AUC values of the classifiers built on each project are highlighted in bold. Generally, our proposed approach achieves a weighted AUC of 0.8273 on the average (across all projects and classifiers). Based on the statistics in Table 8, one-third of the models obtain a weighted AUC value that over 0.85. To be specific in all machine learning methods, RF measures up the best performance for all of the evaluated projects. However, the weighted AUC of MV is inferior to the base classifiers in most cases. Consequently, there is little or nothing we can judge on the AUC performance between MV and the base classifiers. Generally speaking, two ensemble learning methods perform well especially RF.

Table 8 Weighted AUC result on evaluated projects

6.3 Baseline comparison analysis

6.3.1 A baseline for comparison

To evaluate the performance of semantic metrics we raised in Sect. 3.1 in defect identification, we compare semantic metrics with traditional metrics. Our baseline of traditional metrics consists of three VCs, that is, IP_LOC, METHOD_LOC and CLASS (described in Sects. 3.2 and 3.3).

6.3.2 Evaluation setup

The project for the baseline comparison is uucp, which is large enough to contain many IPs. We select six (J48, NaiveBayes, KStar, PART, SimpleLogistic and RF) out of 15 machine learning methods to run our experiment designed in Sect. 5.3, for their large variance and high evaluation performance according our analysis in Sect. 6.2. The six metrics described in Sect. 5.4 are used to evaluate the baseline comparison.

6.3.3 Result analysis

As is shown in Table 9, each column represents the machine learning methods given to the project uucp, and the evaluation results of each metric are listed by row. The machine learning methods are divided by double vertical lines according to their categories. Generally, we can observe a significant improvement of the six evaluation metrics from baseline to our approach proposed in this paper in the Average column. In all cases, our approach is more accurate than the baseline; that is, fewer IPs are classified incorrectly, which shows that our models minimize the number of unactionable IPs (false positives) developers need to inspect. Since the reported IPs classified as unactionable would be pruned for reducing the workload of manual inspection, the increment in precision, recall and F-measure indicates that our models can reveal more true defects as well as lower the rate of false negative. Overall, our proposed model with semantic metrics at variable level improves the performance in defect identification.

Table 9 The comparison of evaluation metrics results on uucp
Table 10 The top 5 VC ranking

6.4 VC ranking analysis

Single attribute evaluator with Ranker evaluates each VC individually and returns an ordered list of VCs with merit values (Witten et al. 2016). Based on the merit values calculated from each evaluator for each project, the top 5 out of 21 designed VCs for each case are presented in Table 10. Among the ranked VCs in Table 10, the merit values range from 0.0356 to 0.6829. Overall, 16 out of 21 designed VCs are ranked into the top 5 at least one time, which demonstrates that over three quarters of the designed VCs have a little contribution to the classification performance of our proposed model.

Based on the statistics from Table 10, Fig. 9 presents the number of times that a VC is ranked into the top 5 in one of the three single attribute evaluators for each project with different color. Hereby, METHOD_LOC, I_STATE and AS_V appear to be the most relevant variable characteristics for classification. Moreover, METHOD_LOC is contained in every project at least once, which implies that the number of statements within the method containing the alarm is predictive of the actionability of the alarm. Additionally, there are five variable characteristics contained in none of the four evaluated projects, that is, AS_LFUNC, AS_IF, AS_FOR, AS_WHILE, CS_WHILE, respectively, which may be less important to our proposed model.

Fig. 9
figure 9

Variable characteristics ranking results (ordered)

6.5 Threats to validity

Three main threats to validity in this paper, that is, external validity, internal validity and construct validity, respectively, are illustrated as follows.

6.5.1 External validity

The principal threat to external validity is that the evaluated projects in this paper may not be of enough generalization for all software projects. As a result, projects exclude in the four projects might yield better or worse performance based on our approach. But additional running of our proposed model on other projects will minimize this threat to validity. Since our model is only evaluated on open-source C projects, its performance on projects written in other programming languages is unknown.

6.5.2 Internal validity

For our paper, dataset preparation is the main concern of internal validity. As is described in Sect. 4, the evaluated projects are detected by DTS and the reported IPs are inspected manually. Oversights of manual inspection could invalidate a few of the model results. Multiple examination by different developers will minimize this threat to internal validity.

6.5.3 Construct validity

We cannot generalize our proposed model since variable characteristics designed in this paper are specific for NPD defect, which may not be representative, resulting in a threat to construct validity. However, we consider various shared features including LOC metrics applied in a multitude of the existing research papers, which can contribute to improving the generalization performance of our proposed model.

7 Conclusion and future work

In order to mitigate the effort of manual inspection, this paper presents a machine learning-based model for automatically identifying null pointer dereference (NPD) defects using a set of novel and more fine-grained features. Specifically, features, called variable characteristics (VCs), are extracted from related source codes of the variable leading to a potential NPD defect analyzed by defect testing system (DTS), as well as the DTS reporting log. Then, the designed VCs are leveraged to build models for classifying the reported alarms as actionable and unactionable. Since the unactionable alarms are pruned and only the actionable alarms are to be inspected, the workload of manual inspection is reduced.

Our evaluation results on the four open-source C projects show that the proposed models at variable level are promising and can be a useful approach for automatically classifying the static analysis alarms. Then, we perform a baseline comparison experiment between semantic metrics and traditional metrics, evaluating the effectiveness of our proposed model with semantic metrics in defect identification. Thus, our proposed approach can be applied for automated defect identification when new alarms are reported by static analysis tools. Additionally, we use single attribute evaluator with Ranker in Weka to rank the relevance of each VC individually and return an ordered list of ranked VCs.

In the future, we will extend our automated defect identification approach for more defect patterns. Moreover, designing a set of artifact characteristics that may be shared enough for the majority of defect patterns is considered to be of great significance for our work. As is not mentioned in this paper, however, there is a major challenge in raising the accuracy of cross-project defect identification, namely using a given project to train a model to identify the defects from another project without manual inspection. We also plan to leverage our model combined with transfer learning methods to automatically identify defects across projects, which would be promising to increase the accuracy of cross-project defect identification.