Which process metrics can significantly improve defect prediction models? An empirical study

Madeyski, Lech; Jureczko, Marian

doi:10.1007/s11219-014-9241-7

Which process metrics can significantly improve defect prediction models? An empirical study

Open access
Published: 17 June 2014

Volume 23, pages 393–422, (2015)
Cite this article

Download PDF

You have full access to this open access article

Software Quality Journal Aims and scope Submit manuscript

Which process metrics can significantly improve defect prediction models? An empirical study

Download PDF

Lech Madeyski¹ &
Marian Jureczko¹

13k Accesses
118 Citations
Explore all metrics

Abstract

The knowledge about the software metrics which serve as defect indicators is vital for the efficient allocation of resources for quality assurance. It is the process metrics, although sometimes difficult to collect, which have recently become popular with regard to defect prediction. However, in order to identify rightly the process metrics which are actually worth collecting, we need the evidence validating their ability to improve the product metric-based defect prediction models. This paper presents an empirical evaluation in which several process metrics were investigated in order to identify the ones which significantly improve the defect prediction models based on product metrics. Data from a wide range of software projects (both, industrial and open source) were collected. The predictions of the models that use only product metrics (simple models) were compared with the predictions of the models which used product metrics, as well as one of the process metrics under scrutiny (advanced models). To decide whether the improvements were significant or not, statistical tests were performed and effect sizes were calculated. The advanced defect prediction models trained on a data set containing product metrics and additionally Number of Distinct Committers (NDC) were significantly better than the simple models without NDC, while the effect size was medium and the probability of superiority (PS) of the advanced models over simple ones was high ($p=.016$, $r=-.29$, $\hbox {PS}=.76$), which is a substantial finding useful in defect prediction. A similar result with slightly smaller PS was achieved by the advanced models trained on a data set containing product metrics and additionally all of the investigated process metrics ($p=.038$, $r=-.29$, $\hbox {PS}=.68$). The advanced models trained on a data set containing product metrics and additionally Number of Modified Lines (NML) were significantly better than the simple models without NML, but the effect size was small ($p=.038$, $r=.06$). Hence, it is reasonable to recommend the NDC process metric in building the defect prediction models.

The Role of Process in Early Software Defect Prediction: Methods, Attributes and Metrics

An Open-Source Software Metric Tool for Defect Prediction, Its Case Study and Lessons We Learned

Revisiting process versus product metrics: a large scale analysis

Article 17 March 2022

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Software development companies are seeking for ways to improve the quality of software systems without allocating too many resources in the quality assurance activities such as testing. Applying the same testing effort to all modules of a software system is not an optimal approach, since the distribution of defects among individual parts of a system is not uniform. According to Pareto-Zipf-type law (Boehm and Papaccio 1988; Denaro and Pezzè 2002; Endres and Rombach 2003), the 80:20 empirical rule is operating here, i.e., a small amount of code (often quantified as 20 % of the code) is responsible for the majority of software faults (often quantified as 80 % of the faults). Therefore, it is possible to test only a small part of a software system and find most of the defects. Defect prediction models, in turn, may be used to find the defect-prone classes. Hence, the quality assurance efforts should be focused (unless for critical projects) on the most defect-prone classes in order to save valuable time and financial resources, and, at the same time, to increase the quality of delivered software products.

The defect prediction models built on the basis of product metrics are already well known (Basili et al. 1996; Denaro and Pezzè 2002; Gyimothy et al. 2005; Tang et al. 1999); however, also the process metrics have recently become popular^{Footnote 1}. Fenton was not only among the first who have criticized the product metric-based approach (Fenton and Ohlsson 2000), but also the one who suggested a model based only on the project and the process metrics (Fenton et al. 2007). There are also other studies in which the process metrics are investigated (Illes-Seifert and Paech 2010; Schröter et al. 2006), as well as used in the model (Graves et al. 2000; Weyuker et al. 2008, 2010). Nevertheless, there are no conclusive results. Usually, only the correlations between some process metrics and the defect count are investigated, e.g. (Illes-Seifert and Paech 2010; Schröter et al. 2006). When defect prediction models are built, they are either not compared with a product-based approach (e.g., Bell et al. 2006; Hassan 2009; Ostrand et al. 2005; Weyuker et al. 2006, 2007), they are built on a small sample (e.g., Graves et al. 2000; Moser et al. 2008) or do not perform statistical tests and effect size calculations to conclude whether the improvements obtained through adding the process metrics were of both, statistical and practical significance even when improvements were impressive (e.g., Nagappan et al. 2008). Effect size is an index that quantifies the degree of practical significance of study results, i.e., the degree to which the study results should be considered important, or negligible, regardless of the size of the study sample. Further discussion of related work is given in detail in Sect. 3.

This paper presents the results of an empirical study exploring the relationship between the process metrics and the number of defects. For that purpose, the correlations between particular process metrics and the number of defects were calculated. Subsequently, the simple defect prediction models were built on the basis of the product metrics. With those simple models, we were able to build advanced defect prediction models by introducing, additionally, one of the process metrics at a time. As a result, we were able to compare the simple and the advanced models and answer the question whether or not the introduction of the selected process metric improved the adequacy of the predictions. Statistical methods were used to evaluate the significance of that improvement. The approach used in this study can be easily put into practice, which is its distinct advantage. Moreover, no sophisticated methods were used to build the prediction models, but the ordinary stepwise linear regression. Even though they are probably neither best nor the most effective for this purpose, stepwise linear regression methods are widely known and, therefore, reduce the learning effort.

The derivation of the baseline model, as well as the experiments presented in this paper, intend to reflect the industrial reality. Since the product metrics have a very long history (e.g., McCabe 1976), they enjoy a good tool support (e.g., the Ckjm tool used in this study) and are well understood by practitioners. We may assume that there are companies interested in defect prediction which have already launched a metric program and collect the product metrics. The assumption is plausible, as such companies are already known to the authors of this paper. A hypothetical company as described above is using product metrics for the aforementioned reasons (mainly tool support). Unfortunately, the prediction results are often unsatisfactory; therefore, new metrics may be employed in order to improve the prediction. The process metrics can be particularly useful, since they reflect the attributes different from those associated with the product metrics, namely the product history, which is (hopefully) an extra source of information. Nevertheless, it is still not obvious what the company should do, as there are a number of process metrics which are being investigated with regard to defect prediction. Furthermore, the results are sometimes contradictory (see Sect. 3 for details). Moreover, the tool support for the process metrics is far from being perfect, e.g., for the sake of this study, the authors had to develop their own solution to calculate these metrics. Bearing in mind that hypothetical situation in an industrial environment and relying on their direct and indirect experience, the authors of this study chose as its main objective to provide assistance in making key decisions regarding which metric (or metrics) should be chosen and added to the metric program in order to improve the predictions and not to waste financial resources on checking all the possibilities. Therefore, we have analyzed which of the frequently used process metrics can significantly improve defect prediction—on the basis of a wide range of software projects from different environments. The construction of the models made use solely of the data which were historically older than the ones used in prediction (model evaluation). For example, the model built on the data from the release $i$ was used to make predictions in release $i+1$. The data from $i$th release are usually (or at least can be) available during the development of $(i+1)$th release. Hopefully, on the basis of the empirical evaluations presented in this paper, development teams may take informed decisions (at least to some extent, as the number of analyzed projects, although large, is not infinite) about the process metrics which may be worth collecting in order to improve the defect prediction models based on product metrics. Additionally, the framework of the empirical evaluation of the models presented in this paper can be reused in different environments to evaluate new kinds of metrics and to improve the defect prediction models even further.

This paper is organized as follows: The descriptions of all the investigated product and process metrics, as well as the tools employed for data collection and the investigated software projects are described in Sect. 2. Related empirical studies concerning the process metrics are presented in Sect. 3. Section 4 contains the detailed description of our empirical investigation aimed at identifying the process metrics which may significantly improve the defect prediction models based on the product metrics. The obtained results are reported in Sect. 5, while threats to validity are discussed in Sect. 6. The discussion of results in Sect. 7 is followed by the conclusions and contributions in Sect. 8.

2 Data collection

This section presents the descriptions of all the investigated product and process metrics in Sect. 2.1, the tools used to compute the aforementioned metrics are described in Sect. 2.2, while the investigated software projects are presented in Sect. 2.3.

2.1 Studied metrics

The investigation entailed two types of metrics: the product metrics, which describe the size and design complexity of software, served as the basis and the point of departure, whereas the process metrics were treated as the primary object of this study. The product metrics were used to build simple defect prediction models, while the product metrics, together with the selected process metrics (one at a time), were used to build the advanced models. Subsequently, both models were compared in order to determine whether the selected process metrics improve the prediction efficiency. The classification of the product and the process metrics was thoroughly discussed in (Henderson-Sellers 1996).

2.1.1 Product metrics

The following metrics have been used in this study:

The metrics suite suggested by Chidamber and Kemerer (1994).
Lack of Cohesion in Methods (LCOM3) suggested by Henderson-Sellers (1996).
The QMOOD metrics suite suggested by Bansiya and Davis (2002).
The quality oriented extension to Chidamber and Kemerer metrics suite suggested by Tang et al. (1999).
Coupling metrics suggested by Martin (1994).
Class level metrics built on the basis of McCabe’s (1976) complexity metric.
Lines of Code (LOC).

A separate report by Jureczko and Madeyski (2011c), available online, presents definitions of the aforementioned metrics.

2.1.2 Process metrics

A considerable research has been performed on identifying the process metrics which influence the efficiency of defect prediction. Among them, the most widely used are the metrics similar to NR, NDC, NML and NDPV (cf. Sect. 3):

Number of Revisions (NR). The NR metric constitutes the number of revisions (retrieved from a main line of development in a version control system, e.g., trunk in SVN) of a given Java class during development of the investigated release of a software system. The metric (although using different names) has already been used by several researchers (Graves et al. 2000; Illes-Seifert and Paech 2010; Moser et al. 2008; Nagappan and Ball 2007; Nagappan et al. 2010; Ostrand and Weyuker 2002; Ostrand et al. 2004; Ratzinger et al. 2007; Schröter et al. 2006; Shihab et al. 2010; Weyuker et al. 2006, 2007, 2008).
Number of Distinct Committers (NDC). The NDC metric returns the number of distinct authors who committed their changes in a given Java class during the development of the investigated release of a software system. The metric has already been used or analyzed by researchers (Bell et al. 2006; Weyuker et al. 2007, 2008, 2010; Graves et al. 2000; Illes-Seifert and Paech 2010; Matsumoto et al. 2010; Moser et al. 2008; Nagappan et al. 2008, 2010; Ratzinger et al. 2007; Schröter et al. 2006; Zimmermann et al. 2009).
Number of Modified Lines (NML). The NML metric calculates the sum of all lines of source code which were added or removed in a given Java class. Each of the committed revisions during the development of the investigated release of a software system is taken into account. According to the CVS version–control system, a modification in a given line of source code is equivalent to removing the old version and subsequently adding a new version of the line. Similar metrics have already been used or analyzed by various researchers (Graves et al. 2000; Hassan 2009; Purushothaman and Perry 2005; Layman et al. 2008; Moser et al. 2008; Nagappan and Ball 2005, 2007; Nagappan et al. 2008, 2010; Ratzinger et al. 2007; Śliwerski et al. 2005; Zimmermann et al. 2009).
Number of Defects in Previous Version (NDPV). The NDPV metric returns the number of defects repaired in a given class during the development of the previous release of a software system. Similar metrics have already been investigated by a number of researchers (Arisholm and Briand 2006; Hassan 2009; Ostrand et al. 2005; Weyuker et al. 2006, 2008; Graves et al. 2000; Gyimothy et al. 2005; Illes-Seifert and Paech 2010; Kim et al. 2007; Khoshgoftaar et al. 1998; Moser et al. 2008; Nagappan et al. 2008, 2010; Ostrand and Weyuker 2002; Ratzinger et al. 2007; Schröter et al. 2006; Shihab et al. 2010; Śliwerski et al. 2005; Wahyudin et al. 2008).

2.2 Tools

All product metrics were calculated with the Ckjm tool^{Footnote 2}. The tool calculates all the aforementioned product metrics by processing the byte code of the compiled Java files.

The fact that the metrics are collected from byte code is not considered here as threat to the experiment, since—as it was explained in the case of LOC by Fenton and Neil (1999)—a metric calculated directly from the source code and the same metric calculated from the byte code are the alternative measures of the same attribute. The Ckjm version reported by Jureczko and Spinellis (2010) was used in this study.

The process metrics and the defect count were collected with a tool called BugInfo^{Footnote 3}. The BugInfo analyzes the logs from the source code repository (SVN or CVS) and, according to the log content, decides whether a commit is a bugfix. A commit is considered to be a bugfix when it solves an issue reported in a bug tracking system. Each of the projects was investigated in order to identify bugfixes commenting guidelines which had been used in the source code repository. The guidelines were formalized into regular expressions. Buginfo compares the regular expressions with the comments of the commits. When a comment matches the regular expression, BugInfo increments the defect count for all the classes which have been modified in the commit. The tool has been recently incorporated into a more complex one, i.e., QualitySpy (Jureczko and Magott 2012), which is under development. The QualitySpy tool was used to collect the NML metric from projects that use SVN repositories as such feature is not supported by BugInfo. Unfortunately, some time passed before the QualitySpy was ready to use, and hence, we faced obstacles (mostly significant changes in project structure that make impossible to match the newly collected data with the new ones) that prevented us from collecting the NML metric in some of the investigated projects.

Even though there is no formal evaluation of BugInfo regarding the efficiency in mapping defects yet, comprehensive tests have already been conducted. Most of them are available online as JUnit’s tests in the BugInfo source code package.

2.3 Analyzed projects

Forty-three releases of 12 open source and 27 releases of 6 industrial software projects were investigated in this study.

In order to ensure consistent measurement of product metrics, all of the analyzed projects were written in Java. It is worth mentioning that we were not able to collect all of the process metrics for all of the projects. Therefore, some of the analyses were conducted on a subset of the projects mentioned below, hence each project has references to those experiments in which it was used, e.g., NR denotes that the project was analyzed in the experiment that investigated the NR metric. The experiment where all four metrics were investigated was executed on those projects that have references to all four metrics. The following projects were examined:

POI version 1.5, 2.0, 2.5.1 and 3.0 (http://poi.apache.org, NR, NDC, NDPV).
Synapse version 1.0, 1.1 and 1.2 (http://synapse.apache.org, NR, NDC, NML, NDPV).
Xalan-Java version 2.4, 2.5, 2.6 and 2.7 (http://xml.apache.org/xalan-j, NR, NDC, NML, NDPV).
Xerces version 1.1, 1.2, 1.3 and 1.4.4 (http://xerces.apache.org/xerces-j, NR, NDC, NML, NDPV).
Ant version 1.3, 1.4, 1.5, 1.6 and 1.7 (http://ant.apache.org, NR, NDC, NML, NDPV).
PBeans version 1.0 and 2.0 (http://pbeans.sourceforge.net, only descriptive statistics).
Ivy version 1.1, 1.2 and 2.0 (http://ant.apache.org/ivy, NR, NDC).
Camel version 1.0, 1.2, 1.4 and 1.6 (http://camel.apache.org, NR, NDC, NML, NDPV).
Log4j version 1.0, 1.1 and 1.2 (http://logging.apache.org/log4j, NR, NDC, NDPV).
Lucene version 2.0, 2.2 and 2.4 (http://lucene.apache.org, NR, NDC, NML, NDPV).
Velocity version 1.4, 1.5 and 1.6.1 (http://velocity.apache.org, NR, NDC, NDPV).
JEdit version 3.2.1, 4.0, 4.1, 4.2 and 4.3 (http://www.jedit.org, NR, NDC, NML, NDPV).
Five industrial projects belonging to the insurance domain (NR, NDC, NML, NDPV).
One industrial project that is a tool that supports quality assurance in software development (NR, NDC). All six projects are developed by the same software development company, by international teams in a plan driven manner. However, only the insurance projects has fixed scope and deadline (for each release, respectively), whereas the sixth one has much more flexible plans. The five projects from the insurance domain are custom built solutions with more than five years of development history. They implement different feature sets according to the individual customer requirements. All of them are developed using Java enterprise technologies and frameworks, as well as already installed in customer environments.

A separate report by Jureczko and Madeyski (2011b), available online, presents the descriptions of the open-source projects for which the authors have built software defect prediction models. The collected data are available online in our Metric Repository^{Footnote 4}. Moreover, the repository contains also some metadata, e.g., the regular expressions used to identify defects. In order to obtain exactly the same data as used in this study, the following URL should be used: http://purl.org/MarianJureczko/MetricsRepo/WhichMetricsImprovePrediction. Furthermore, the archive containing the same collection of data sets (called MJ12A) is available online^{Footnote 5}. It is worth mentioning that a number of analyses had already been conducted previously (Jureczko and Madeyski 2010; Jureczko and Spinellis 2010), but they were neither focused on process metrics nor did they use all of the data which we were able to collect and analyze in this paper (not to mention loosely related earlier study (Madeyski 2006) focused on external code quality instead of defects).

3 Related work and background

Comprehensive surveys on defect prediction were presented by Purao and Vaishnavi (2003), Catal and Diri (2009), Kitchenham (2010), and Hall et al. (2012). Hall et al. (2012) investigated how the context of prediction models (the modeling techniques applied and the independent variables used) affect the performance of fault prediction models. They synthesised the results of 36 studies and concluded that the methodology used to build models seems to be influential to predictive performance. For example, the models that perform comparatively well tend to be based on relatively simple modeling techniques (e.g., Naive Bayes, Logistic Regression), combinations of independent variables have been used by prediction models that perform well, while feature selection has been applied to these combinations when models are performing particularly well (Hall et al. 2012).

Moreover, considerable research has been performed on identifying the process metrics (similar to NR, NDC, NML and NDPV) which influence the efficiency of defect prediction. The research papers in which the aforementioned process metrics were employed or analyzed have been already mentioned in Sect. 2.1.2. This section describes a subset of these works which not only use but also are focused on analyses of the aforementioned process metrics. More comprehensive description is available in (Jureczko and Madeyski 2011a).

The NR metric (or a similar one) was recommended in all of the works presented in Table 1. However, it is worth mentioning that some of them reported only correlation coefficients with the number of defects (e.g., Illes-Seifert and Paech 2010; Schröter et al. 2006), while the other studies were carried out on limited data sets (i.e., single project): (Graves et al. 2000)—a telephone switching system, (Moser et al. 2008)—Eclipse, (Nagappan and Ball 2007)—Windows 2003 and (Nagappan et al. 2010)—Windows Vista.

Table 1 Findings related to Number of Revisions

Which process metrics can significantly improve defect prediction models? An empirical study

Abstract

Similar content being viewed by others

The Role of Process in Early Software Defect Prediction: Methods, Attributes and Metrics

An Open-Source Software Metric Tool for Defect Prediction, Its Case Study and Lessons We Learned

Revisiting process versus product metrics: a large scale analysis

1 Introduction

2 Data collection

2.1 Studied metrics

2.1.1 Product metrics

2.1.2 Process metrics

2.2 Tools

2.3 Analyzed projects

3 Related work and background

4 Study design

4.1 Statistical hypothesis

4.2 Effect size

5 Analysis

5.1 Descriptive statistics

5.1.1 Number of revisions (NR)

5.1.2 Number of distinct committers (NDC)

5.1.3 Number of modified lines (NML)

5.1.4 Number of defects in previous version (NDPV)

5.1.5 Relationships between metrics

5.2 Hypotheses testing

5.2.1 Number of revisions (NR)

5.2.2 Number of distinct committers (NDC)

5.2.3 Number of modified lines (NML)

5.2.4 Number of defects in previous version (NDPV)

5.2.5 Combination of process metrics

5.3 Association and subgroup analyses

6 Threats to validity

6.1 Construct validity

6.2 Statistical conclusion validity

6.3 Internal validity

6.4 External validity

7 Discussion of results

8 Conclusions and contributions

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation