Abstract
Effort-aware just-in-time (JIT) defect prediction aims at finding more defective software changes with limited code inspection cost. Traditionally, supervised models have been used; however, they require sufficient labelled training data, which is difficult to obtain, especially for new projects. Recently, Yang et al. proposed an unsupervised model (i.e., LT) and applied it to projects with rich historical bug data. Interestingly, they reported that, under the same inspection cost (i.e., 20 percent of the total lines of code modified by all changes), it could find about 12% - 27% more defective changes than a state-of-the-art supervised model (i.e., EALR) when using different evaluation settings. This is surprising as supervised models that benefit from historical data are expected to perform better than unsupervised ones. Their finding suggests that previous studies on defect prediction had made a simple problem too complex. Considering the potential high impact of Yang et al.’s work, in this paper, we perform a replication study and present the following new findings: (1) Under the same inspection budget, LT requires developers to inspect a large number of changes necessitating many more context switches. (2) Although LT finds more defective changes, many highly ranked changes are false alarms. These initial false alarms may negatively impact practitioners’ patience and confidence. (3) LT does not outperform EALR when the harmonic mean of Recall and Precision (i.e., F1-score) is considered. Aside from highlighting the above findings, we propose a simple but improved supervised model called CBS+, which leverages the idea of both EALR and LT. We investigate the performance of CBS+ using three different evaluation settings, including time-wise cross-validation, 10-times 10-fold cross-validation and cross-project validation. When compared with EALR, CBS+ detects about 15% - 26% more defective changes, while keeping the number of context switches and initial false alarms close to those of EALR. When compared with LT, the number of defective changes detected by CBS+ is comparable to LT’s result, while CBS+ significantly reduces context switches and initial false alarms before first success. Finally, we discuss how to balance the tradeoff between the number of inspected defects and context switches, and present the implications of our findings for practitioners and researchers.
Similar content being viewed by others
Notes
The amount of inspected code in an individual change is much less than the code in a file, package, or module.
“Insigma Global Service,” http://www.insigmaservice.com/.
“Hengtian,” http://www.hengtiansoft.com/.
References
Abdi H (2007) Bonferroni and šidák corrections for multiple comparisons. Enc Measur Stat 3:103–107
Agrawal A, Menzies T (2018) Is better data better than better data miners?: on the benefits of tuning smote for defect prediction. In: Proceedings of the 40th International Conference on Software Engineering, ACM, pp 1050–1061
Arisholm E, Briand LC, Fuglerud M (2007) Data mining techniques for building fault-proneness models in telecom java software. In: The 18th IEEE International Symposium on Software Reliability (ISSRE’07), IEEE, pp 215–224
Arisholm E, Briand LC, Johannessen EB (2010) A systematic and comprehensive investigation of methods to build and evaluate fault prediction models. J Syst Softw 83 (1):2–17
Cliff N (1996) Ordinal methods for behavioral data analysis. Lawrence Erlbaum Associates
da Costa DA, McIntosh S, Shang W, Kulesza U, Coelho R, Hassan AE (2017) A framework for evaluating the results of the szz approach for identifying bug-introducing changes. IEEE Trans Softw Eng 43(7):641–657
D’Ambros M, Lanza M, Robbes R (2010) An extensive comparison of bug prediction approaches. In: 2010 7th IEEE working conference on mining software repositories (MSR), IEEE, pp 31–41
Fu W, Menzies T (2017) Revisiting unsupervised learning for defect prediction. In: Proceedings of the 2017 25th ACM SIGSOFT International Symposium on Foundations of Software Engineering, ACM, p to appear
Ghotra B, McIntosh S, Hassan AE (2015) Revisiting the impact of classification techniques on the performance of defect prediction models. In: Proceedings of the 37th international conference on software engineering-volume 1, IEEE Press, pp 789–800
Graves TL, Karr AF, Marron JS, Siy H (2000) Predicting fault incidence using software change history. IEEE Trans Softw Eng 26(7):653–661
Guo PJ, Zimmermann T, Nagappan N, Murphy B (2010) Characterizing and predicting which bugs get fixed: an empirical study of microsoft windows. In: 2010 ACM/IEEE 32nd international conference on software engineering, IEEE, vol 1, pp 495–504
Gyimothy T, Ferenc R, Siket I (2005) Empirical validation of object-oriented metrics on open source software for fault prediction. IEEE Trans Softw Eng 31 (10):897–910
Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The weka data mining software: an update. ACM SIGKDD explorations newsletter 11(1):10–18
Hall T, Beecham S, Bowes D, Gray D, Counsell S (2012) A systematic literature review on fault prediction performance in software engineering. IEEE Trans Softw Eng 38(6):1276–1304
Hamill M, Goseva-Popstojanova K (2009) Common trends in software fault and failure data. IEEE Trans Softw Eng 35(4):484–496
Han J, Pei J, Kamber M (2011) Data mining: concepts and techniques. Elsevier, Amsterdam
Hassan AE (2009) Predicting faults using the complexity of code changes. In: Proceedings of the 31st international conference on software engineering, IEEE computer society, pp 78–88
Hintze JL, Nelson RD (1998) Violin plots: a box plot-density trace synergism. The American Statistician 52(2):181–184
Huang Q, Xia X, Lo D (2017) Supervised vs unsupervised models: a holistic look at effort-aware just-in-time defect prediction. In: IEEE International Conference on Software maintenance and evolution (ICSME), IEEE
Huang Q, Shihab E, Xia X, Lo D, Li S (2018) Identifying self-admitted technical debt in open source projects using text mining. Empir Softw Eng 23(1):418–451
Jiang T, Tan L, Kim S (2013) Personalized defect prediction. In: 2013 IEEE/ACM 28th International conference on automated software engineering (ASE), IEEE, pp 279–289
Kamei Y, Shihab E, Adams B, Hassan AE, Mockus A, Sinha A, Ubayashi N (2013) A large-scale empirical study of just-in-time quality assurance. IEEE Trans Softw Eng 39(6):757–773
Kim S, Zimmermann T, Pan K, James E Jr et al (2006) Automatic identification of bug-introducing changes. In: null, IEEE, pp 81–90
Kim S, Whitehead EJ Jr, Zhang Y (2008) Classifying software changes: Clean or buggy? IEEE Trans Softw Eng 34(2):181–196
Kochhar PS, Xia X, Lo D, Li S (2016) Practitioners’ expectations on automated fault localization. In: Proceedings of the 25th International Symposium on Software Testing and Analysis, ACM, pp 165– 176
Koru AG, Zhang D, El Emam K, Liu H (2009) An investigation into the functional form of the size-defect relationship for software modules. IEEE Trans Softw Eng 35(2):293–304
Koru G, Liu H, Zhang D, El Emam K (2010) Testing the theory of relative defect proneness for closed-source software. Empir Softw Eng 15(6):577–598
Li PL, Herbsleb J, Shaw M, Robinson B (2006) Experiences and results from initiating field defect prediction and product test prioritization efforts at abb inc. In: Proceedings of the 28th international conference on Software engineering, ACM, pp 413–422
Matsumoto S, Kamei Y, Monden A, Matsumoto K, Nakamura M (2010) An analysis of developer metrics for fault prediction. In: Proceedings of the 6th International Conference on Predictive Models in Software Engineering, ACM, p 18
Mende T, Koschke R (2010) Effort-aware defect prediction models. In: 2010 14th European conference on software maintenance and reengineering (CSMR), IEEE, pp 107–116
Menzies T, Di Stefano JS (2004) How good is your blind spot sampling policy. In: Proceedings 8th IEEE International symposium on high assurance systems engineering, 2004, IEEE, pp 129–138
Menzies T, Milton Z, Turhan B, Cukic B, Jiang Y, Bener A (2010) Defect prediction from static code features: current results, limitations, new approaches. Autom Softw Eng 17(4):375–407
Meyer AN, Fritz T, Murphy GC, Zimmermann T (2014) Software developers’ perceptions of productivity. In: Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, ACM, pp 19–29
Mockus A, Weiss DM (2000) Predicting risk of software changes. Bell Labs Tech J 5(2):169–180
Moser R, Pedrycz W, Succi G (2008) A comparative analysis of the efficiency of change metrics and static code attributes for defect prediction. In: Proceedings of the 30th international conference on Software engineering, ACM, pp 181–190
Munson JC, Khoshgoftaar TM (1992) The detection of fault-prone programs. IEEE Trans Softw Eng 18(5):423–433
Nagappan N, Ball T (2005) Use of relative code churn measures to predict system defect density. In: Proceedings 27th International conference on software engineering, 2005. ICSE 2005. IEEE, pp 284– 292
Nagappan N, Ball T, Murphy B (2006a) Using historical in-process and product metrics for early estimation of software failures. In: 17th International symposium on software reliability engineering, 2006. ISSRE’06. IEEE, pp 62–74
Nagappan N, Ball T, Zeller A (2006b) Mining metrics to predict component failures. In: Proceedings of the 28th international conference on Software engineering, ACM, pp 452–461
Nam J, Kim S (2015) Clami: Defect prediction on unlabeled datasets (t). In: 2015 30th IEEE/ACM International conference on automated software engineering (ASE), IEEE, pp 452–463
Neto EC, da Costa DA, Kulesza U (2018) The impact of refactoring changes on the szz algorithm: An empirical study. In: 2018 IEEE 25Th international conference on software analysis, evolution and reengineering (SANER), IEEE, pp 380–390
Ostrand TJ, Weyuker EJ, Bell RM (2004) Where the bugs are. In: ACM SIGSOFT Software engineering notes, ACM, vol 29, pp 86–96
Parnin C, Orsom A (2011) Are automated debugging techniques actually helping programmers?. In: Proceedings of the 2011 international symposium on software testing and analysis, ACM, pp 199–209
Purushothaman R, Perry DE (2005) Toward understanding the rhetoric of small source code changes. IEEE Trans Softw Eng 31(6):511–526
Rahman F, Devanbu P (2013) How, and why, process metrics are better. In: Proceedings of the 2013 International conference on software engineering, IEEE Press, pp 432–441
Rahman F, Posnett D, Devanbu P (2012) Recalling the imprecision of cross-project defect prediction. In: Proceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering, ACM, p 61
Shihab E, Hassan AE, Adams B, Jiang ZM (2012) An industrial study on the risk of software changes. In: Proceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering, ACM, p 62
Shihab E, Ihara A, Kamei Y, Ibrahim WM, Ohira M, Adams B, Hassan AE, Matsumoto K (2013) Studying re-opened bugs in open source software. Empir Softw Eng 18(5):1005–1042
Śliwerski J, Zimmermann T, Zeller A (2005) When do changes induce fixes?. In: ACM Sigsoft software engineering notes, ACM, vol 30, pp 1–5
Tantithamthavorn C, McIntosh S, Hassan AE, Matsumoto K (2016) Automated parameter optimization of classification techniques for defect prediction models. In: Proceedings of the 38th International Conference on Software Engineering, ACM, pp 321–332
Thongmak M, Muenchaisri P (2003) Predicting faulty classes using design metrics with discriminant analysis. In: Software engineering research and practice, pp 621–627
Turhan B, Menzies T, Bener AB, Di Stefano J (2009) On the relative value of cross-company and within-company data for defect prediction. Empir Softw Eng 14 (5):540–578
Valdivia Garcia H, Shihab E (2014) Characterizing and predicting blocking bugs in open source projects. In: Proceedings of the 11th working conference on mining software repositories, ACM, pp 72–81
Wilcoxon F (1945) Individual comparisons by ranking methods. Biom Bull 1 (6):80–83
Xia X, Bao L, Lo D, Li S (2016a) Automated debugging considered harmful considered harmful: a user study revisiting the usefulness of spectra-based fault localization techniques with professionals using real bugs from large systems. In: 2016 IEEE International conference on software maintenance and evolution (ICSME), IEEE, pp 267–278
Xia X, Lo D, Pan SJ, Nagappan N, Wang X (2016b) Hydra: Massively compositional model for cross-project defect prediction. IEEE Trans Softw Eng 42 (10):977–998
Xia X, Lo D, Wang X, Yang X (2016c) Collective personalized change classification with multiobjective search. IEEE Trans Reliab 65(4):1810–1829
Yan M, Fang Y, Lo D, Xia X, Zhang X (2017) File-level defect prediction: Unsupervised vs. supervised models. In: Proceedings of the 11th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement, ACM, p to appear
Yang X, Lo D, Xia X, Zhang Y, Sun J (2015) Deep learning for just-in-time defect prediction. In: 2015 IEEE International conference on software quality, reliability and security (QRS), IEEE, pp 17–26
Yang X, Lo D, Xia X, Sun J (2017) Tlel: a two-layer ensemble learning approach for just-in-time defect prediction. Inf Softw Technol 87:206–220
Yang Y, Zhou Y, Liu J, Zhao Y, Lu H, Xu L, Xu B, Leung H (2016) Effort-aware just-in-time defect prediction: simple unsupervised models could be better than supervised models. In: Proceedings of the 2016 24th ACM SIGSOFT International symposium on foundations of software engineering, ACM, pp 157–168
Yin Z, Yuan D, Zhou Y, Pasupathy S, Bairavasundaram L (2011) How do fixes become bugs?. In: Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering, ACM, pp 26–36
Zhou Y, Yang Y, Lu H, Chen L, Li Y, Zhao Y, Qian J, Xu B (2018) How far we have progressed in the journey? an examination of cross-project defect prediction. ACM Trans Softw Eng Methodol (TOSEM) 27(1):1
Zimmermann T, Nagappan N, Gall H, Giger E, Murphy B (2009) Cross-project defect prediction: a large scale experiment on data vs. domain vs. process. In: Proceedings of the the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering, ACM
Acknowledgments
We would like to thank Kamei et al. (2013) and Yang et al. (2016) for providing us the datasets and source code used in their study. Finally, to enable other researchers replicate and extend our study, we have published the replication package in Zenodo.Footnote 5 This research was partially supported by the National Key Research and Development Program of China (2018YFB1003904) and NSFC Program (No. 61602403).
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by: Lu Zhang, Thomas Zimmermann, Xin Peng and Hong Mei
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Huang, Q., Xia, X. & Lo, D. Revisiting supervised and unsupervised models for effort-aware just-in-time defect prediction. Empir Software Eng 24, 2823–2862 (2019). https://doi.org/10.1007/s10664-018-9661-2
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10664-018-9661-2