Empirical Software Engineering

, Volume 14, Issue 5, pp 540–578 | Cite as

On the relative value of cross-company and within-company data for defect prediction

  • Burak TurhanEmail author
  • Tim Menzies
  • Ayşe B. Bener
  • Justin Di Stefano


We propose a practical defect prediction approach for companies that do not track defect related data. Specifically, we investigate the applicability of cross-company (CC) data for building localized defect predictors using static code features. Firstly, we analyze the conditions, where CC data can be used as is. These conditions turn out to be quite few. Then we apply principles of analogy-based learning (i.e. nearest neighbor (NN) filtering) to CC data, in order to fine tune these models for localization. We compare the performance of these models with that of defect predictors learned from within-company (WC) data. As expected, we observe that defect predictors learned from WC data outperform the ones learned from CC data. However, our analyses also yield defect predictors learned from NN-filtered CC data, with performance close to, but still not better than, WC data. Therefore, we perform a final analysis for determining the minimum number of local defect reports in order to learn WC defect predictors. We demonstrate in this paper that the minimum number of data samples required to build effective defect predictors can be quite small and can be collected quickly within a few months. Hence, for companies with no local defect data, we recommend a two-phase approach that allows them to employ the defect prediction process instantaneously. In phase one, companies should use NN-filtered CC data to initiate the defect prediction process and simultaneously start collecting WC (local) data. Once enough WC data is collected (i.e. after a few months), organizations should switch to phase two and use predictors learned from WC data.


Defect prediction Learning Metrics (product metrics) Cross-company Within-company Nearest-neighbor filtering 


  1. Arisholm E, Briand L (2006a) Predicting fault-prone components in a java legacy system. In: ISESE ’06: Proceedings of the 2006 ACM/IEEE international symposium on international symposium on empirical software engineering, September 2006.
  2. Arisholm E, Briand L (2006b) Predicting fault-prone components in a java legacy system. In: 5th ACM-IEEE international symposium on empirical software engineering (ISESE), Rio de Janeiro, Brazil, September 21–22.
  3. Baker D (2007) A hybrid approach to expert and model-based effort estimation. Master’s thesis, Lane Department of Computer Science and Electrical Engineering, West Virginia University.
  4. Basili V, McGarry F, Pajerski R, Zelkowitz M (2002) Lessons learned from 25 years of process improvement: the rise and fall of the NASA software engineering laboratory. In: Proceedings of the 24th international conference on software engineering (ICSE) 2002, Orlando, Florida.
  5. Bell R, Ostrand T, Weyuker E (2006) Looking for bugs in all the right places. In: ISSTA ’06: Proceedings of the 2006 international symposium on software testing and analysis, July 2006.
  6. Blake C, Merz C (1998) UCI repository of machine learning databases.
  7. Boehm B, Papaccio P (1988) Understanding and controlling software costs. IEEE Trans Softw Eng 14(10):1462–1477, October 1988CrossRefGoogle Scholar
  8. Boehm B (2000) Safe and simple software cost analysis. IEEE Software, pp 14–17, September/October 2000.
  9. Boetticher G, Menzies T, Ostrand T (2007) The PROMISE repository of empirical software engineering data.
  10. Brooks FP (1995) The mythical man-month, Anniversary edn. Addison-Wesley, ReadingGoogle Scholar
  11. Chapman M, Solomon D (2002) The relationship of cyclomatic complexity, essential complexity and error rates. In: Proceedings of the NASA software assurance symposium, Coolfont Resort and Conference Center in Berkley Springs, West Virginia.
  12. Chen Z, Menzies T, Port D (2005) Feature subset selection can improve software cost estimation. In: PROMISE’05.
  13. Dekhtyar A, Hayes JH, Menzies T (2004) Text is software too. In: International workshop on mining software repositories.
  14. Demsar J (2006) Statistical comparisons of clasifiers over multiple data sets. J Mach Learn Res 7:1–30. MathSciNetGoogle Scholar
  15. Domingos P, Pazzani MJ (1997) On the optimality of the simple bayesian classifier under zero-one loss. Mach Learn 29(2–3):103–130 zbMATHCrossRefGoogle Scholar
  16. Dougherty J, Kohavi R, Sahami M (1995) Supervised and unsupervised discretization of continuous features. In: International conference on machine learning, pp 194–202. timm/dm/dougherty95supervised.pdf
  17. Duda R, Hart P, Nilsson N (1976) Subjective bayesian methods for rule-based inference systems. In: Technical Report 124, Artificial Intelligence Center, SRI InternationalGoogle Scholar
  18. Fagan M (1976) Design and code inspections to reduce errors in program development. IBM Syst J 15(3):182–211CrossRefGoogle Scholar
  19. Fagan M (1986) Advances in software inspections. IEEE Trans Softw Eng 12:744–751, July 1986Google Scholar
  20. Fenton NE, Pfleeger S (1995) Software metrics: a rigorous & practical approach, 2nd edn. International Thompson, LondonGoogle Scholar
  21. Goseva K, Hamill M (2007) Architecture-based software reliability: why only a few parameters matter? In: 31st annual IEEE international computer software and applications conference (COMPSAC 2007), Beijing, July 2007Google Scholar
  22. Graves TL, Karr AF, Marron JS, Siy HP (2000) Predicting fault incidence using software change history. IEEE Trans Softw Eng 26(7):653–661. Google Scholar
  23. Hall G, Munson J (2000) Software evolution: code delta and code churn. J Syst Softw 54(2):111–118CrossRefGoogle Scholar
  24. Halstead M (1977) Elements of software science. Elsevier, AmsterdamzbMATHGoogle Scholar
  25. Hayes JH, Dekhtyar A, Sundaram SK (2006) Advancing candidate link generation for requirements tracing: the study of methods. IEEE Trans Softw Eng 32(1):4–19. CrossRefGoogle Scholar
  26. Jiang Y, Cukic B, Menzies T (2007) Fault prediction using early lifecycle data. In: ISSRE’07.
  27. John G, Langley P (1995) Estimating continuous distributions in bayesian classifiers. In: Proceedings of the eleventh conference on uncertainty in artificial intelligence. Morgan Kaufmann, Montreal, pp 338–345.
  28. Khoshgoftaar TM, Seliya N (2003) Analogy-based practical classification rules for software quality estimation. Empirical Softw Eng 8(4):325–350CrossRefGoogle Scholar
  29. Khoshgoftaar T, Seliya N (2004) Comparative assessment of software quality classification techniques: an empirical case study. Empirical Softw Eng 9(3):229–257CrossRefGoogle Scholar
  30. Kitchenham BA, Mendes E, Travassos GH (2007) Cross- vs. within-company cost estimation studies: a systematic review. IEEE Trans Softw Eng 33:316–329, May 2007CrossRefGoogle Scholar
  31. Koru AG, Emam KE, Zhang D, Liu H, Mathew D (2008) Theory of relative defect proneness. Empirical Softw Eng 13(5):473–498CrossRefGoogle Scholar
  32. Lessmann S, Baesens B, Mues C, Pietsch S (2008) Benchmarking classification models for software defect prediction: a proposed framework and novel findings. IEEE Trans Softw Eng 34(4):485–496CrossRefGoogle Scholar
  33. Mann HB, Whitney DR (1947) On a test of whether one of two random variables is stochastically larger than the other. Ann Math Stat 18(1):50–60. zbMATHCrossRefMathSciNetGoogle Scholar
  34. Marcus A, Poshyvanyk D, Ferenc R (2008) Using the conceptual cohesion of classes for fault prediction in object-oriented systems. IEEE Trans Softw Eng 34(2):287–300, March–April 2008CrossRefGoogle Scholar
  35. McCabe T (1976) A complexity measure. IEEE Trans Softw Eng 2(4):308–320, December 1976CrossRefMathSciNetGoogle Scholar
  36. Menzies T, Raffo D, Setamanit S, Hu Y, Tootoonian S (2002) Model-based tests of truisms. In: Proceedings of IEEE ASE 2002.
  37. Menzies T, DiStefano J, Orrego A, Chapman R (2004) Assessing predictors of software defects. In: Proceedings, workshop on predictive software models, Chicago.
  38. Menzies T, Dekhtyar A, Distefano J, Greenwald J (2007a) Problems with precision. IEEE Trans Softw Eng 33:637–640. CrossRefGoogle Scholar
  39. Menzies T, Greenwald J, Frank A (2007b) Data mining static code attributes to learn defect predictors. IEEE Trans Softw Eng 33:2–13 CrossRefGoogle Scholar
  40. Menzies T, Turhan B, Bener A, Gay G, Cukic B, Jiang Y (2008) Implications of ceiling effects in defect predictors. In: Proceedings of PROMISE 2008 workshop (ICSE).
  41. Moser R, Pedrycz W, Succi G (2008) A comparative analysis of the efficiency of change metrics and static code attributes for defect prediction. In: ICSE ’08: Proceedings of the 30th international conference on software engineering, May 2008.
  42. Musa J, Iannino A, Okumoto K (1987) Software reliability: measurement, prediction, application. McGraw Hill, New YorkGoogle Scholar
  43. Nagappan N, Ball T (2005a) Static analysis tools as early indicators of pre-release defect density. In: ICSE 2005, St. LouisGoogle Scholar
  44. Nagappan N, Ball T (2005b) Static analysis tools as early indicators of pre-release defect density. In: ICSE, pp 580–586.
  45. Nikora A, Munson J (2003) Developing fault predictors for evolving software systems. In: Ninth international software metrics symposium (METRICS’03)Google Scholar
  46. Orrego A (2004) Sawtooth: learning from huge amounts of data. Master’s thesis, Computer Science, West Virginia UniversityGoogle Scholar
  47. Ostrand T, Weyuker E, Bell R (2007) Automating algorithms for the identification of fault-prone files. In: ISSTA ’07: Proceedings of the 2007 international symposium on software testing and analysis, July 2007.
  48. Porter A, Selby R (1990) Empirically guided software development using metric-based classification trees. IEEE Softw 7:46–54, MarchCrossRefGoogle Scholar
  49. Quinlan R (1992a) C4.5: programs for machine learning. Morgan Kaufman, San Francisco, iSBN: 1558602380Google Scholar
  50. Quinlan JR (1992b) Learning with continuous classes. In: 5th Australian joint conference on artificial intelligence, pp 343–348.
  51. Rakitin S (2001) Software verification and validation for practitioners and managers, 2nd edn. Artech House, CormanoGoogle Scholar
  52. Shepperd M, Ince D (1994) A critique of three metrics. J Syst Softw 26(3):197–210, September 1994CrossRefGoogle Scholar
  53. Shepperd M, Schofield C (1997) Estimating software project effort using analogies. IEEE Trans Softw Eng 23(12), November 1997.
  54. Shull F, Basili V, Boehm B, Brown A, Costa P, Lindvall M, Port D, Rus I, Tesoriero R, Zelkowitz M (2002) What we have learned about fighting defects. In: Proceedings of 8th international software metrics symposium, Ottawa, Canada, pp 249–258.
  55. Shull F, Rus I, Basili V (2000) How perspective-based reading can improve requirements inspections. IEEE Comput 33(7):73–79. Google Scholar
  56. Srinivasan K, Fisher D (1995) Machine learning approaches to estimating software development effort. IEEE Trans. Softw Eng 21(2):126–137, February 1995CrossRefGoogle Scholar
  57. Tang W, Khoshgoftaar TM (2004) Noise identification with the k-means algorithm. In: ICTAI 2004, pp 373–378.
  58. Tian J, Zelkowitz M (1995) Complexity measure evaluation and selection. IEEE Trans Softw Eng 21(8):641–649, August 1995CrossRefGoogle Scholar
  59. Witten IH, Frank E (2005) Data mining, 2nd edn. Morgan Kaufmann, Los AltoszbMATHGoogle Scholar
  60. Yang Y, Webb G (2003) Weighted proportional k-interval discretization for naive-bayes classifiers. In: Proceedings of the 7th Pacific-Asia conference on knowledge discovery and data mining (PAKDD 2003).

Copyright information

© Springer Science+Business Media, LLC 2008

Authors and Affiliations

  • Burak Turhan
    • 1
    Email author
  • Tim Menzies
    • 2
  • Ayşe B. Bener
    • 1
  • Justin Di Stefano
    • 2
  1. 1.Department of Computer EngineeringBogazici UniversityIstanbulTurkey
  2. 2.Lane Department of Computer Science and Electrical EngineeringMorgantownUSA

Personalised recommendations