Abstract
Fault prediction by negative binomial regression models is shown to be effective for four large production software systems from industry. A model developed originally with data from systems with regularly scheduled releases was successfully adapted to a system without releases to identify 20% of that system’s files that contained 75% of the faults. A model with a pre-specified set of variables derived from earlier research was applied to three additional systems, and proved capable of identifying averages of 81, 94 and 76% of the faults in those systems. A primary focus of this paper is to investigate the impact on predictive accuracy of using data about the number of developers who access individual code units. For each system, including the cumulative number of developers who had previously modified a file yielded no more than a modest improvement in predictive accuracy. We conclude that while many factors can “spoil the broth” (lead to the release of software with too many defects), the number of developers is not a major influence.
Similar content being viewed by others
References
Adams EN (1984) Optimizing preventive service of software products. IBM J Res Develop 28(1):2–14, January
Arisholm E, Briand LC (2006) Predicting fault-prone components in a java legacy system. In: Proc ACM/IEEE ISESE, Rio de Janeiro, 21 September 2006
Basili VR, Perricone BT (1984) Software errors and complexity: an empirical investigation. Commun ACM 27(1):42–52, January
Bell RM, Ostrand TJ, Weyuker EJ (2006) Looking for bugs in all the right places. In: Proc ACM/international symposium on software testing and analysis (ISSTA2006), Portland, July 2006, pp 61–71
Chidamber SR, Kemerer CF (1994) A metrics suite for object oriented design. IEEE Trans Softw Eng 20(6):476–493, June
Denaro G, Pezze M (2002) An empirical evaluation of fault-proneness models. In: Proc international conf on software engineering (ICSE2002), Miami, May 2002
Eick SG, Graves TL, Karr AF, Marron JS, Mockus A (2001) Does code decay? Assessing the evidence from change management data. IEEE Trans Softw Eng 27(1):1–12, January
Fenton NE, Ohlsson N (2000) Quantitative analysis of faults and failures in a complex software system. IEEE Trans Softw Eng 26(8):797–814, August
Graves TL, Karr AF, Marron JS, Siy H (2000) Predicting fault incidence using software change history. IEEE Trans Softw Eng 26(7):653–661, July
Guo L, Ma Y, Cukic B, Singh H (2004) Robust prediction of fault-proneness by random forests. In: Proc ISSRE 2004, Saint-Malo, November 2004
Hatton L (1997) Reexamining the fault density—component size connection. IEEE Softw 14:89–97, March/April
Khoshgoftaar TM, Allen EB, Deng J (2002) Using regression trees to classify fault-prone software modules. IEEE Trans Reliab 51(4):455–462, December
Khoshgoftaar TM, Allen EB, Kalaichelvan KS, Goel N (1996) Early quality prediction: a case study in telecommunications. IEEE Softw 13:65–71, January
McCullagh P, Nelder JA (1989) Generalized linear models, 2nd edn. Chapman and Hall, London
Menzies T, Greenwald J, Frank A (2007) Data mining static code attributes to learn defect predictors. IEEE Trans Softw Eng 33(1):2–13, January
Mockus A, Weiss DM (2000) Predicting risk of software changes. Bell Labs Tech J 5:169–180, April–June
Moeller K-H, Paulish DJ (1993) An empirical investigation of software fault distribution. In: Proc IEEE first international software metrics symposium, Baltimore, 21–22 May 1993, pp 82–90
Munson JC, Khoshgoftaar TM (1992) The detection of fault-prone programs. IEEE Trans Softw Eng 18(5):423–433, May
Nagappan N, Ball T (2007) Using software dependencies and churn metrics to predict field failures. In: Int symp on software engineering and measurement, Madrid, 21–22 September 2007
Nagappan N, Ball T, Zeller A (2006) Mining metrics to predict component failures. In: Proc int conf on software engineering, Shanghai, May 2006, pp 452–461
Ohlsson N, Alberg H (1996) Predicting fault-prone software modules in telephone switches. IEEE Trans Softw Eng 22(12):886–894, December
Ostrand T, Weyuker EJ (2002) The distribution of faults in a large industrial software system. In: Proc ACM/international symposium on software testing and analysis (ISSTA2002), Rome, July 2002, pp 55–64
Ostrand TJ, Weyuker EJ, Bell RM (2005) Predicting the location and number of faults in large software systems. IEEE Trans Softw Eng 31(4):340–355, April
Ostrand TJ, Weyuker EJ, Bell RM (2007) Automating algorithms for the identification of fault-prone files. In: Proc. ACM/international symposium on software testing and analysis (ISSTA07), London, July 2007
Pighin M, Marzona A (2003) An empirical analysis of fault persistence through software releases. In: Proc. IEEE/ACM ISESE. IEEE, Piscataway, pp 206–212
SAS Institute Inc (2004) SAS/STAT 9.1 user’s guide. SAS, Cary
Succi G, Pedrycz W, Stefanovic M, Miller J (2003) Practical assessment of the models for identification of defect-prone classes in object-oriented commercial systems using design metrics. J Syst Softw 65(1):1–12, January
Witten IH, Frank E (2005) Data mining, 2nd edn. Morgan Kaufmann, San Francisco
Author information
Authors and Affiliations
Corresponding author
Additional information
Editor: Tim Menzies
Rights and permissions
About this article
Cite this article
Weyuker, E.J., Ostrand, T.J. & Bell, R.M. Do too many cooks spoil the broth? Using the number of developers to enhance defect prediction models. Empir Software Eng 13, 539–559 (2008). https://doi.org/10.1007/s10664-008-9082-8
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10664-008-9082-8