Comparing the effectiveness of several modeling methods for fault prediction

Weyuker, Elaine J.; Ostrand, Thomas J.; Bell, Robert M.

doi:10.1007/s10664-009-9111-2

Comparing the effectiveness of several modeling methods for fault prediction

Published: 10 June 2009

Volume 15, pages 277–295, (2010)
Cite this article

Empirical Software Engineering Aims and scope Submit manuscript

Elaine J. Weyuker¹,
Thomas J. Ostrand¹ &
Robert M. Bell¹

1040 Accesses
90 Citations
Explore all metrics

Abstract

We compare the effectiveness of four modeling methods—negative binomial regression, recursive partitioning, random forests and Bayesian additive regression trees—for predicting the files likely to contain the most faults for 28 to 35 releases of three large industrial software systems. Predictor variables included lines of code, file age, faults in the previous release, changes in the previous two releases, and programming language. To compare the effectiveness of the different models, we use two metrics—the percent of faults contained in the top 20% of files identified by the model, and a new, more general metric, the fault-percentile-average. The negative binomial regression and random forests models performed significantly better than recursive partitioning and Bayesian additive regression trees, as assessed by either of the metrics. For each of the three systems, the negative binomial and random forests models identified 20% of the files in each release that contained an average of 76% to 94% of the faults.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Predicting Bugs in Large Industrial Software Systems

Software Fault Prediction Using Random Forests

Software defect prediction: do different classifiers find the same defects?

Article Open access 07 February 2017

References

Adams EN (1984) Optimizing preventive service of software products. IBM J Res Develop 28(1):2–14
Article Google Scholar
Arisholm E, Briand LC (2006) Predicting fault-prone components in a java legacy system. In: Proc ACM/IEEE ISESE, Rio de Janeiro
Basili VR, Perricone BT (1984) Software errors and complexity: an empirical investigation. Commun ACM 27(1):42–52
Article Google Scholar
Bell RM, Ostrand TJ, Weyuker EJ (2006) Looking for bugs in all the right places. In: Proc ACM/international symposium on software testing and analysis (ISSTA2006), Portland, pp 61–71
Breiman L (2001) Random forests. Mach Learn 45:5–32
Article MATH Google Scholar
Breiman L, Friedman JH, Olshen RA, Stone CJ (1984) Classification and regression trees. Wadsworth, Belmont
MATH Google Scholar
Chipman HA, George EI, McCulloch RE (2008) BART: Bayesian additive regression trees. http://arxiv.org/abs/0806.3286v1
Denaro G, Pezze M (2002) An empirical evaluation of fault-proneness models. In: Proc international conf on software engineering (ICSE2002), Miami
Eick SG, Graves TL, Karr AF, Marron JS, Mockus A (2001) Does code decay? Assessing the evidence from change management data. IEEE Trans Softw Eng 27(1):1–12
Article Google Scholar
Fenton NE, Ohlsson N (2000) Quantitative analysis of faults and failures in a complex software system. IEEE Trans Softw Eng 26(8):797–814
Article Google Scholar
Friedman JH (1991) Multivariate adaptive regression splines. Ann Stat 19:1–67
Article MATH Google Scholar
Guo L, Ma Y, Cukic B, Singh H (2004) Robust prediction of fault-proneness by random forests. In: Proc ISSRE 2004, Saint-Malo
Hatton L (1997) Reexamining the fault density—component size connection. IEEE Softw 14:89–97
Article Google Scholar
Jiang Y, Cukic B, Ma Y (2008) Techniques for evaluating fault prediction models. Empir Softw Eng 13:561–595
Article Google Scholar
Khoshgoftaar TM, Allen EB, Kalaichelvan KS, Goel N (1996) Early quality prediction: a case study in telecommunications. IEEE Softw 13:65–71
Article Google Scholar
Khoshgoftaar TM, Allen EB, Deng J (2002) Using regression trees to classify fault-prone software modules. IEEE Trans Reliab 51(4):455–462
Article Google Scholar
Koru AG, Liu H (2005) An investigation of the effect of module size on defect prediction using static measures. In: 2005 Promise workshop, 15 May 2005
Lessmann S, Baesens B, Ues C, Pietsch S (2008) Benchmarking classification models for software defect prediction: a proposed framework and novel findings. IEEE Trans Softw Eng 34(4):485–496
Article Google Scholar
McCullagh P, Nelder JA (1989) Generalized linear models, 2nd edn. Chapman and Hall, London
MATH Google Scholar
Menzies T, Di Stefano JS, Cunanan C, Chapman R (2004) Mining repositories to assist in project planning and resource allocation. In: International workshop on mining software repositories
Moeller K-H, Paulish DJ (1993) An empirical investigation of software fault distribution. In: Proc IEEE first international software metrics symposium, Baltimore, 21–22 May 1993, pp 82–90
Munson JC, Khoshgoftaar TM (1992) The detection of fault-prone programs. IEEE Trans Softw Eng 18(5):423–433
Article Google Scholar
Ohlsson N, Alberg H (1996) Predicting fault-prone software modules in telephone switches. IEEE Trans Softw Eng 22(12):886–894
Article Google Scholar
Ostrand T, Weyuker EJ (2002) The distribution of faults in a large industrial software system. In: Proc ACM/international symposium on software testing and analysis (ISSTA2002), Rome, pp 55–64
Ostrand T, Weyuker EJ (2007) How to measure success of software prediction models. In: Proc fourth international workshop on software quality assurance, Dubrovnik, September 2007
Ostrand TJ, Weyuker EJ, Bell RM (2005) Predicting the location and number of faults in large software systems. IEEE Trans Softw Eng 31(4):340–355
Article Google Scholar
Ostrand TJ, Weyuker EJ, Bell RM (2007) Automating algorithms for the identification of fault-prone files. In: Proc ACM/international symposium on software testing and analysis (ISSTA07), London
Pighin M, Marzona A (2003) An empirical analysis of fault persistence through software releases. In: Proc IEEE/ACM ISESE 2003, pp 206–212
Succi G, Pedrycz W, Stefanovic M, Miller J (2003) Practical assessment of the models for identification of defect-prone classes in object-oriented commercial systems using design metrics. J Syst Softw 65(1):1–12
Article Google Scholar
White H (1980) A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrics 48:817–838
Article MATH Google Scholar

Download references

Author information

Authors and Affiliations

AT&T Labs - Research, 180 Park Avenue, Florham Park, NJ, 07932, USA
Elaine J. Weyuker, Thomas J. Ostrand & Robert M. Bell

Authors

Elaine J. Weyuker
View author publications
You can also search for this author in PubMed Google Scholar
Thomas J. Ostrand
View author publications
You can also search for this author in PubMed Google Scholar
Robert M. Bell
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Elaine J. Weyuker.

Additional information

Editor: G. Ruhe

Rights and permissions

Reprints and permissions

About this article

Cite this article

Weyuker, E.J., Ostrand, T.J. & Bell, R.M. Comparing the effectiveness of several modeling methods for fault prediction. Empir Software Eng 15, 277–295 (2010). https://doi.org/10.1007/s10664-009-9111-2

Download citation

Published: 10 June 2009
Issue Date: June 2010
DOI: https://doi.org/10.1007/s10664-009-9111-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Comparing the effectiveness of several modeling methods for fault prediction

Abstract

Access this article

Similar content being viewed by others

Predicting Bugs in Large Industrial Software Systems

Software Fault Prediction Using Random Forests

Software defect prediction: do different classifiers find the same defects?

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Comparing the effectiveness of several modeling methods for fault prediction

Abstract

Access this article

Similar content being viewed by others

Predicting Bugs in Large Industrial Software Systems

Software Fault Prediction Using Random Forests

Software defect prediction: do different classifiers find the same defects?

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation