Automated Software Engineering

, Volume 17, Issue 4, pp 375–407 | Cite as

Defect prediction from static code features: current results, limitations, new approaches

  • Tim Menzies
  • Zach Milton
  • Burak Turhan
  • Bojan Cukic
  • Yue Jiang
  • Ayşe Bener


Building quality software is expensive and software quality assurance (QA) budgets are limited. Data miners can learn defect predictors from static code features which can be used to control QA resources; e.g. to focus on the parts of the code predicted to be more defective.

Recent results show that better data mining technology is not leading to better defect predictors. We hypothesize that we have reached the limits of the standard learning goal of maximizing area under the curve (AUC) of the probability of false alarms and probability of detection “AUC(pd, pf)”; i.e. the area under the curve of a probability of false alarm versus probability of detection.

Accordingly, we explore changing the standard goal. Learners that maximize “AUC(effort, pd)” find the smallest set of modules that contain the most errors. WHICH is a meta-learner framework that can be quickly customized to different goals. When customized to AUC(effort, pd), WHICH out-performs all the data mining methods studied here. More importantly, measured in terms of this new goal, certain widely used learners perform much worse than simple manual methods.

Hence, we advise against the indiscriminate use of learners. Learners must be chosen and customized to the goal at hand. With the right architecture (e.g. WHICH), tuning a learner to specific local business goals can be a simple task.


Defect prediction Static code features WHICH 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Arisholm, E., Briand, L.: Predicting fault-prone components in a java legacy system. In: 5th ACM-IEEE International Symposium on Empirical Software Engineering (ISESE), Rio de Janeiro, Brazil, September 21–22 (2006). Available from
  2. Blake, C., Merz, C.: UCI repository of machine learning databases (1998). URL:
  3. Bradley, P.S., Fayyad, U.M., Reina, C.: Scaling clustering algorithms to large databases. In: Knowledge Discovery and Data Mining, pp. 9–15 (1998). Available from
  4. Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and regression trees. Tech. rep., Wadsworth International, Monterey, CA (1984) Google Scholar
  5. Breimann, L.: Random forests. Mach. Learn. 45, 5–32 (2001) CrossRefGoogle Scholar
  6. Brieman, L.: Bagging predictors. Mach. Learn. 24(2), 123–140 (1996) Google Scholar
  7. Chapman, M., Solomon, D.: The relationship of cyclomatic complexity, essential complexity and error rates. In: Proceedings of the NASA Software Assurance Symposium, Coolfont Resort and Conference Center in Berkley Springs, West Virginia (2002). Available from
  8. Cohen, P.: Empirical Methods for Artificial Intelligence. MIT Press, Cambridge (1995a) zbMATHGoogle Scholar
  9. Cohen, W.: Fast effective rule induction. In: ICML’95, pp. 115–123 (1995b). Available on-line from
  10. Cover, T.M., Hart, P.E.: Nearest neighbour pattern classification. IEEE Trans. Inf. Theory iT-13, 21–27 (1967) CrossRefGoogle Scholar
  11. Demsar, J.: Statistical comparisons of clasifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006). Available from MathSciNetGoogle Scholar
  12. Dietterich, T.: Machine learning research: four current directions. AI Mag. 18(4), 97–136 (1997) Google Scholar
  13. Domingos, P., Pazzani, M.J.: On the optimality of the simple Bayesian classifier under zero-one loss. Mach. Learn. 29(2–3), 103–130 (1997) zbMATHCrossRefGoogle Scholar
  14. Elkan, C.: The foundations of cost-sensitive learning. In: Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence (IJCAI’01) (2001). Available from
  15. Fagan, M.: Design and code inspections to reduce errors in program development. IBM Syst. J. 15(3), 182–211 (1976) CrossRefGoogle Scholar
  16. Fagan, M.: Advances in software inspections. IEEE Trans. Softw. Eng. SE-12, 744–751 (1986) Google Scholar
  17. Fawcett, T.: Using rule sets to maximize roc performance. In: 2001 IEEE International Conference on Data Mining (ICDM-01) (2001). Available from
  18. Fenton, N.E., Neil, M.: A critique of software defect prediction models. IEEE Trans. Softw. Eng. 25(5), 675–689 (1999). Available from CrossRefGoogle Scholar
  19. Fenton, N.E., Pfleeger, S.: Software Metrics: A Rigorous & Practical Approach, 2nd edn. International Thompson Press (1995) Google Scholar
  20. Fenton, N.E., Pfleeger, S.: Software Metrics: A Rigorous & Practical Approach. International Thompson Press (1997) Google Scholar
  21. Fenton, N., Pfleeger, S., Glass, R.: Science and substance: a challenge to software engineers. IEEE Softw., 86–95 (1994) Google Scholar
  22. Freund, Y., Schapire, R.: A decision-theoretic generalization of on-line learning and an application to boosting. JCSS: J. Comput. Syst. Sci. 55 (1997) Google Scholar
  23. Hall, G., Munson, J.: Software evolution: code delta and code churn. J. Syst. Softw. 111–118 (2000) Google Scholar
  24. Halstead, M.: Elements of Software Science. Elsevier, Amsterdam (1977) zbMATHGoogle Scholar
  25. Huang, J., Ling, C.: Using AUC and accuracy in evaluating learning algorithms. IEEE Trans. Knowledge Data Eng. 17(3), 299–310 (2005) CrossRefGoogle Scholar
  26. Jiang, Y., Cukic, B., Ma, Y.: Techniques for evaluating fault prediction models. Empir. Softw. Eng., 561–595 (2008a) Google Scholar
  27. Jiang, Y., Cukic, B., Menzies, T.: Does transformation help? In: Defects (2008b). Available from
  28. Khoshgoftaar, T.: An application of zero-inflated Poisson regression for software fault prediction. In: Proceedings of the 12th International Symposium on Software Reliability Engineering, Hong Kong, pp. 66–73 (2001) Google Scholar
  29. Khoshgoftaar, T., Allen, E.: Model software quality with classification trees. In: Pham, H. (ed.): Recent Advances in Reliability and Quality Engineering, pp. 247–270. World Scientific, Singapore (2001) CrossRefGoogle Scholar
  30. Khoshgoftaar, T.M., Seliya, N.: Fault prediction modeling for software quality estimation: comparing commonly used techniques. Empir. Softw. Eng. 8(3), 255–283 (2003) CrossRefGoogle Scholar
  31. Koru, A., Zhang, D., Liu, H.: Modeling the effect of size on defect proneness for open-source software. In: Proceedings PROMISE’07 (ICSE) (2007). Available from
  32. Koru, A., Emam, K.E., Zhang, D., Liu, H., Mathew, D.: Theory of relative defect proneness: replicated studies on the functional form of the size-defect relationship. Empir. Softw. Eng., 473–498 (2008) Google Scholar
  33. Koru, A., Zhang, D., El Emam, K., Liu, H.: An investigation into the functional form of the size-defect relationship for software modules. Softw. Eng. IEEE Trans. 35(2), 293–304 (2009) CrossRefGoogle Scholar
  34. Lessmann, S., Baesens, B., Mues, C., Pietsch, S.: Benchmarking classification models for software defect prediction: a proposed framework and novel findings. IEEE Trans. Softw. Eng. (2008) Google Scholar
  35. Leveson, N.: Safeware System Safety and Computers. Addison-Wesley, Reading (1995) Google Scholar
  36. Littlewood, B., Wright, D.: Some conservative stopping rules for the operational testing of safety-critical software. IEEE Trans. Softw. Eng. 23(11), 673–683 (1997) CrossRefGoogle Scholar
  37. Lowry, M., Boyd, M., Kulkarni, D.: Towards a theory for integration of mathematical verification and empirical testing. In: Proceedings, ASE’98: Automated Software Engineering, pp. 322–331 (1998) Google Scholar
  38. Lutz, R., Mikulski, C.: Operational anomalies as a cause of safety-critical requirements evolution. J. Syst. Softw. (2003). Available from
  39. McCabe, T.: A complexity measure. IEEE Trans. Softw. Eng. 2(4), 308–320 (1976) CrossRefMathSciNetGoogle Scholar
  40. Menzies, T., Cukic, B.: When to test less. IEEE Softw. 17(5), 107–112 (2000). Available from CrossRefGoogle Scholar
  41. Menzies, T., Stefano, J.S.D.: How good is your blind spot sampling policy? In: 2004 IEEE Conference on High Assurance Software Engineering (2003). Available from
  42. Menzies, T., Raffo, D., Setamanit, S., Hu, Y., Tootoonian, S.: Model-based tests of truisms. In: Proceedings of IEEE ASE 2002 (2002). Available from
  43. Menzies, T., Dekhtyar, A., Distefano, J., Greenwald, J.: Problems with precision. IEEE Trans. Softw. Eng. (2007a).
  44. Menzies, T., Greenwald, J., Frank, A.: Data mining static code attributes to learn defect predictors. IEEE Trans. Softw. Eng. (2007b). Available from
  45. Milton, Z.: Which rules. M.S. thesis (2008) Google Scholar
  46. Mockus, A., Zhang, P., Li, P.L.: Predictors of customer perceived software quality. In: ICSE ’05: Proceedings of the 27th International Conference on Software Engineering, pp. 225–233. ACM, New York (2005) Google Scholar
  47. Musa, J., Iannino, A., Okumoto, K.: Software Reliability: Measurement, Prediction, Application. McGraw-Hill, New York (1987) Google Scholar
  48. Nagappan, N., Ball, T.: Static analysis tools as early indicators of pre-release defect density. In: ICSE 2005, St. Louis (2005a) Google Scholar
  49. Nagappan, N., Ball, T.: Static analysis tools as early indicators of pre-release defect density. In: ICSE, pp. 580–586 (2005b) Google Scholar
  50. Nagappan, N., Murphy, B.: Basili, V.: The influence of organizational structure on software quality: An empirical case study. In: ICSE’08 (2008) Google Scholar
  51. Nikora, A.: Personnel communication on the accuracy of severity determinations in NASA databases (2004) Google Scholar
  52. Nikora, A., Munson, J.: Developing fault predictors for evolving software systems. In: Ninth International Software Metrics Symposium (METRICS’03) (2003) Google Scholar
  53. Ostrand, T.J., Weyuker, E.J., Bell, R.M.: Where the bugs are. In: ISSTA’04: Proceedings of the 2004 ACM SIGSOFT International Symposium on Software Testing and Analysis, pp. 86–96. ACM, New York (2004) CrossRefGoogle Scholar
  54. Porter, A., Selby, R.: Empirically guided software development using metric-based classification trees. IEEE Softw. 46–54 (1990) Google Scholar
  55. Pugh, W.: Skip lists: a probabilistic alternative to balanced trees. Commun. ACM 33(6), 668–676 (1990). Available from CrossRefMathSciNetGoogle Scholar
  56. Quinlan, J.R.: Learning with continuous classes. In: 5th Australian Joint Conference on Artificial Intelligence, pp. 343–348 (1992a). Available from
  57. Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufman, San Mateo (1992b). ISBN: 1558602380 Google Scholar
  58. Raffo, D.: Personnel communication (2005) Google Scholar
  59. Rakitin, S.: Software Verification and Validation for Practitioners and Managers, 2nd edn. Artech House, Norwood (2001) Google Scholar
  60. Shepperd, M., Ince, D.: A critique of three metrics. J. Syst. Softw. 26(3), 197–210 (1994) CrossRefGoogle Scholar
  61. Shull, F., Rus, I., Basili, V.: How perspective-based reading can improve requirements inspections. IEEE Comput. 33(7), 73–79 (2000). Available from Google Scholar
  62. Shull, F., Boehm, B., B., V., Brown, A., Costa, P., Lindvall, M., Port, D., Rus, I., Tesoriero, R., Zelkowitz, M.: What we have learned about fighting defects. In: Proceedings of 8th International Software Metrics Symposium, Ottawa, Canada, pp. 249–258 (2002). Available from
  63. Srinivasan, K., Fisher, D.: Machine learning approaches to estimating software development effort. IEEE Trans. Soft. Eng. 126–137 (1995) Google Scholar
  64. Tang, W., Khoshgoftaar, T.M.: Noise identification with the k-means algorithm. In: ICTAI, pp. 373–378 (2004) Google Scholar
  65. Tian, J., Zelkowitz, M.: Complexity measure evaluation and selection. IEEE Trans. Softw. Eng. 21(8), 641–649 (1995) CrossRefGoogle Scholar
  66. Tosun, A., Bener, A.: Ai-based software defect predictors: Applications and benefits in a case study. In: IAAI’10 (2010) Google Scholar
  67. Tosun, A., Bener, A., Turhan, B.: Practical considerations of deploying ai in defect prediction: a case study within the Turkish telecommunication industry. In: PROMISE’09 (2009) Google Scholar
  68. Turhan, B., Menzies, T., Bener, A., Distefano, J.: On the relative value of cross-company and within-company data for defect prediction. Empir. Softw. Eng. 68(2), 278–290 (2009). Available from Google Scholar
  69. Turner, J.: A predictive approach to eliminating errors in software code (2006). Available from
  70. Voas, J., Miller, K.: Software testability: the new verification. IEEE Softw. 17–28 (1995). Available from
  71. Weyuker, E., Ostrand, T., Bell, R.: Do too many cooks spoil the broth? Using the number of developers to enhance defect prediction models. Empir. Softw. Eng. (2008) Google Scholar
  72. Witten, I.H., Frank, E.: Data Mining, 2nd edn. Morgan Kaufmann, Los Altos (2005) zbMATHGoogle Scholar
  73. Yang, Y., Webb, G.I., Cerquides, J., Korb, K.B., Boughton, J.R., Ting, K.M.: To select or to weigh: a comparative study of model selection and model weighing for spode ensembles. In: ECML, pp. 533–544 (2006) Google Scholar
  74. Zimmermann, T., Nagappan, N., E.G., H.G., Murphy, B., Cross-project defect prediction. In: ESEC/FSE’09 (2009) Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2010

Authors and Affiliations

  • Tim Menzies
    • 1
  • Zach Milton
    • 1
  • Burak Turhan
    • 2
  • Bojan Cukic
    • 1
  • Yue Jiang
    • 1
  • Ayşe Bener
    • 3
  1. 1.West Virginia UniversityMorgantownUSA
  2. 2.University of OuluOuluFinland
  3. 3.Boğaziçi UniversityIstandbulTurkey

Personalised recommendations