Revisiting supervised and unsupervised models for effort-aware just-in-time defect prediction

Huang, Qiao; Xia, Xin; Lo, David

doi:10.1007/s10664-018-9661-2

Revisiting supervised and unsupervised models for effort-aware just-in-time defect prediction

Published: 27 October 2018

Volume 24, pages 2823–2862, (2019)
Cite this article

Empirical Software Engineering Aims and scope Submit manuscript

Qiao Huang¹,
Xin Xia² &
David Lo³

1699 Accesses
75 Citations
2 Altmetric
Explore all metrics

Abstract

Effort-aware just-in-time (JIT) defect prediction aims at finding more defective software changes with limited code inspection cost. Traditionally, supervised models have been used; however, they require sufficient labelled training data, which is difficult to obtain, especially for new projects. Recently, Yang et al. proposed an unsupervised model (i.e., LT) and applied it to projects with rich historical bug data. Interestingly, they reported that, under the same inspection cost (i.e., 20 percent of the total lines of code modified by all changes), it could find about 12% - 27% more defective changes than a state-of-the-art supervised model (i.e., EALR) when using different evaluation settings. This is surprising as supervised models that benefit from historical data are expected to perform better than unsupervised ones. Their finding suggests that previous studies on defect prediction had made a simple problem too complex. Considering the potential high impact of Yang et al.’s work, in this paper, we perform a replication study and present the following new findings: (1) Under the same inspection budget, LT requires developers to inspect a large number of changes necessitating many more context switches. (2) Although LT finds more defective changes, many highly ranked changes are false alarms. These initial false alarms may negatively impact practitioners’ patience and confidence. (3) LT does not outperform EALR when the harmonic mean of Recall and Precision (i.e., F1-score) is considered. Aside from highlighting the above findings, we propose a simple but improved supervised model called CBS+, which leverages the idea of both EALR and LT. We investigate the performance of CBS+ using three different evaluation settings, including time-wise cross-validation, 10-times 10-fold cross-validation and cross-project validation. When compared with EALR, CBS+ detects about 15% - 26% more defective changes, while keeping the number of context switches and initial false alarms close to those of EALR. When compared with LT, the number of defective changes detected by CBS+ is comparable to LT’s result, while CBS+ significantly reduces context switches and initial false alarms before first success. Finally, we discuss how to balance the tradeoff between the number of inspected defects and context switches, and present the implications of our findings for practitioners and researchers.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Data collection and quality challenges in deep learning: a data-centric AI perspective

Article 03 January 2023

How different are different diff algorithms in Git?

Article Open access 11 September 2019

CoRT: Transformer-based code representations with self-supervision by predicting reserved words for code smell detection

Article 08 April 2024

Notes

The amount of inspected code in an individual change is much less than the code in a file, package, or module.
Some previous studies (Hall et al. 2012; Jiang et al. 2013; Rahman and Devanbu 2013) also denoted this evaluation measure as cost-effectiveness.
“Insigma Global Service,” http://www.insigmaservice.com/.
“Hengtian,” http://www.hengtiansoft.com/.
https://zenodo.org/record/1432582#.W6YyU2gzaUl

References

Abdi H (2007) Bonferroni and šidák corrections for multiple comparisons. Enc Measur Stat 3:103–107
Google Scholar
Agrawal A, Menzies T (2018) Is better data better than better data miners?: on the benefits of tuning smote for defect prediction. In: Proceedings of the 40th International Conference on Software Engineering, ACM, pp 1050–1061
Arisholm E, Briand LC, Fuglerud M (2007) Data mining techniques for building fault-proneness models in telecom java software. In: The 18th IEEE International Symposium on Software Reliability (ISSRE’07), IEEE, pp 215–224
Arisholm E, Briand LC, Johannessen EB (2010) A systematic and comprehensive investigation of methods to build and evaluate fault prediction models. J Syst Softw 83 (1):2–17
Article Google Scholar
Cliff N (1996) Ordinal methods for behavioral data analysis. Lawrence Erlbaum Associates
da Costa DA, McIntosh S, Shang W, Kulesza U, Coelho R, Hassan AE (2017) A framework for evaluating the results of the szz approach for identifying bug-introducing changes. IEEE Trans Softw Eng 43(7):641–657
Article Google Scholar
D’Ambros M, Lanza M, Robbes R (2010) An extensive comparison of bug prediction approaches. In: 2010 7th IEEE working conference on mining software repositories (MSR), IEEE, pp 31–41
Fu W, Menzies T (2017) Revisiting unsupervised learning for defect prediction. In: Proceedings of the 2017 25th ACM SIGSOFT International Symposium on Foundations of Software Engineering, ACM, p to appear
Ghotra B, McIntosh S, Hassan AE (2015) Revisiting the impact of classification techniques on the performance of defect prediction models. In: Proceedings of the 37th international conference on software engineering-volume 1, IEEE Press, pp 789–800
Graves TL, Karr AF, Marron JS, Siy H (2000) Predicting fault incidence using software change history. IEEE Trans Softw Eng 26(7):653–661
Article Google Scholar
Guo PJ, Zimmermann T, Nagappan N, Murphy B (2010) Characterizing and predicting which bugs get fixed: an empirical study of microsoft windows. In: 2010 ACM/IEEE 32nd international conference on software engineering, IEEE, vol 1, pp 495–504
Gyimothy T, Ferenc R, Siket I (2005) Empirical validation of object-oriented metrics on open source software for fault prediction. IEEE Trans Softw Eng 31 (10):897–910
Article Google Scholar
Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The weka data mining software: an update. ACM SIGKDD explorations newsletter 11(1):10–18
Article Google Scholar
Hall T, Beecham S, Bowes D, Gray D, Counsell S (2012) A systematic literature review on fault prediction performance in software engineering. IEEE Trans Softw Eng 38(6):1276–1304
Article Google Scholar
Hamill M, Goseva-Popstojanova K (2009) Common trends in software fault and failure data. IEEE Trans Softw Eng 35(4):484–496
Article Google Scholar
Han J, Pei J, Kamber M (2011) Data mining: concepts and techniques. Elsevier, Amsterdam
MATH Google Scholar
Hassan AE (2009) Predicting faults using the complexity of code changes. In: Proceedings of the 31st international conference on software engineering, IEEE computer society, pp 78–88
Hintze JL, Nelson RD (1998) Violin plots: a box plot-density trace synergism. The American Statistician 52(2):181–184
Google Scholar
Huang Q, Xia X, Lo D (2017) Supervised vs unsupervised models: a holistic look at effort-aware just-in-time defect prediction. In: IEEE International Conference on Software maintenance and evolution (ICSME), IEEE
Huang Q, Shihab E, Xia X, Lo D, Li S (2018) Identifying self-admitted technical debt in open source projects using text mining. Empir Softw Eng 23(1):418–451
Article Google Scholar
Jiang T, Tan L, Kim S (2013) Personalized defect prediction. In: 2013 IEEE/ACM 28th International conference on automated software engineering (ASE), IEEE, pp 279–289
Kamei Y, Shihab E, Adams B, Hassan AE, Mockus A, Sinha A, Ubayashi N (2013) A large-scale empirical study of just-in-time quality assurance. IEEE Trans Softw Eng 39(6):757–773
Article Google Scholar
Kim S, Zimmermann T, Pan K, James E Jr et al (2006) Automatic identification of bug-introducing changes. In: null, IEEE, pp 81–90
Kim S, Whitehead EJ Jr, Zhang Y (2008) Classifying software changes: Clean or buggy? IEEE Trans Softw Eng 34(2):181–196
Article Google Scholar
Kochhar PS, Xia X, Lo D, Li S (2016) Practitioners’ expectations on automated fault localization. In: Proceedings of the 25th International Symposium on Software Testing and Analysis, ACM, pp 165– 176
Koru AG, Zhang D, El Emam K, Liu H (2009) An investigation into the functional form of the size-defect relationship for software modules. IEEE Trans Softw Eng 35(2):293–304
Article Google Scholar
Koru G, Liu H, Zhang D, El Emam K (2010) Testing the theory of relative defect proneness for closed-source software. Empir Softw Eng 15(6):577–598
Article Google Scholar
Li PL, Herbsleb J, Shaw M, Robinson B (2006) Experiences and results from initiating field defect prediction and product test prioritization efforts at abb inc. In: Proceedings of the 28th international conference on Software engineering, ACM, pp 413–422
Matsumoto S, Kamei Y, Monden A, Matsumoto K, Nakamura M (2010) An analysis of developer metrics for fault prediction. In: Proceedings of the 6th International Conference on Predictive Models in Software Engineering, ACM, p 18
Mende T, Koschke R (2010) Effort-aware defect prediction models. In: 2010 14th European conference on software maintenance and reengineering (CSMR), IEEE, pp 107–116
Menzies T, Di Stefano JS (2004) How good is your blind spot sampling policy. In: Proceedings 8th IEEE International symposium on high assurance systems engineering, 2004, IEEE, pp 129–138
Menzies T, Milton Z, Turhan B, Cukic B, Jiang Y, Bener A (2010) Defect prediction from static code features: current results, limitations, new approaches. Autom Softw Eng 17(4):375–407
Article Google Scholar
Meyer AN, Fritz T, Murphy GC, Zimmermann T (2014) Software developers’ perceptions of productivity. In: Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, ACM, pp 19–29
Mockus A, Weiss DM (2000) Predicting risk of software changes. Bell Labs Tech J 5(2):169–180
Article Google Scholar
Moser R, Pedrycz W, Succi G (2008) A comparative analysis of the efficiency of change metrics and static code attributes for defect prediction. In: Proceedings of the 30th international conference on Software engineering, ACM, pp 181–190
Munson JC, Khoshgoftaar TM (1992) The detection of fault-prone programs. IEEE Trans Softw Eng 18(5):423–433
Article Google Scholar
Nagappan N, Ball T (2005) Use of relative code churn measures to predict system defect density. In: Proceedings 27th International conference on software engineering, 2005. ICSE 2005. IEEE, pp 284– 292
Nagappan N, Ball T, Murphy B (2006a) Using historical in-process and product metrics for early estimation of software failures. In: 17th International symposium on software reliability engineering, 2006. ISSRE’06. IEEE, pp 62–74
Nagappan N, Ball T, Zeller A (2006b) Mining metrics to predict component failures. In: Proceedings of the 28th international conference on Software engineering, ACM, pp 452–461
Nam J, Kim S (2015) Clami: Defect prediction on unlabeled datasets (t). In: 2015 30th IEEE/ACM International conference on automated software engineering (ASE), IEEE, pp 452–463
Neto EC, da Costa DA, Kulesza U (2018) The impact of refactoring changes on the szz algorithm: An empirical study. In: 2018 IEEE 25Th international conference on software analysis, evolution and reengineering (SANER), IEEE, pp 380–390
Ostrand TJ, Weyuker EJ, Bell RM (2004) Where the bugs are. In: ACM SIGSOFT Software engineering notes, ACM, vol 29, pp 86–96
Parnin C, Orsom A (2011) Are automated debugging techniques actually helping programmers?. In: Proceedings of the 2011 international symposium on software testing and analysis, ACM, pp 199–209
Purushothaman R, Perry DE (2005) Toward understanding the rhetoric of small source code changes. IEEE Trans Softw Eng 31(6):511–526
Article Google Scholar
Rahman F, Devanbu P (2013) How, and why, process metrics are better. In: Proceedings of the 2013 International conference on software engineering, IEEE Press, pp 432–441
Rahman F, Posnett D, Devanbu P (2012) Recalling the imprecision of cross-project defect prediction. In: Proceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering, ACM, p 61
Shihab E, Hassan AE, Adams B, Jiang ZM (2012) An industrial study on the risk of software changes. In: Proceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering, ACM, p 62
Shihab E, Ihara A, Kamei Y, Ibrahim WM, Ohira M, Adams B, Hassan AE, Matsumoto K (2013) Studying re-opened bugs in open source software. Empir Softw Eng 18(5):1005–1042
Article Google Scholar
Śliwerski J, Zimmermann T, Zeller A (2005) When do changes induce fixes?. In: ACM Sigsoft software engineering notes, ACM, vol 30, pp 1–5
Tantithamthavorn C, McIntosh S, Hassan AE, Matsumoto K (2016) Automated parameter optimization of classification techniques for defect prediction models. In: Proceedings of the 38th International Conference on Software Engineering, ACM, pp 321–332
Thongmak M, Muenchaisri P (2003) Predicting faulty classes using design metrics with discriminant analysis. In: Software engineering research and practice, pp 621–627
Turhan B, Menzies T, Bener AB, Di Stefano J (2009) On the relative value of cross-company and within-company data for defect prediction. Empir Softw Eng 14 (5):540–578
Article Google Scholar
Valdivia Garcia H, Shihab E (2014) Characterizing and predicting blocking bugs in open source projects. In: Proceedings of the 11th working conference on mining software repositories, ACM, pp 72–81
Wilcoxon F (1945) Individual comparisons by ranking methods. Biom Bull 1 (6):80–83
Article Google Scholar
Xia X, Bao L, Lo D, Li S (2016a) Automated debugging considered harmful considered harmful: a user study revisiting the usefulness of spectra-based fault localization techniques with professionals using real bugs from large systems. In: 2016 IEEE International conference on software maintenance and evolution (ICSME), IEEE, pp 267–278
Xia X, Lo D, Pan SJ, Nagappan N, Wang X (2016b) Hydra: Massively compositional model for cross-project defect prediction. IEEE Trans Softw Eng 42 (10):977–998
Xia X, Lo D, Wang X, Yang X (2016c) Collective personalized change classification with multiobjective search. IEEE Trans Reliab 65(4):1810–1829
Yan M, Fang Y, Lo D, Xia X, Zhang X (2017) File-level defect prediction: Unsupervised vs. supervised models. In: Proceedings of the 11th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement, ACM, p to appear
Yang X, Lo D, Xia X, Zhang Y, Sun J (2015) Deep learning for just-in-time defect prediction. In: 2015 IEEE International conference on software quality, reliability and security (QRS), IEEE, pp 17–26
Yang X, Lo D, Xia X, Sun J (2017) Tlel: a two-layer ensemble learning approach for just-in-time defect prediction. Inf Softw Technol 87:206–220
Article Google Scholar
Yang Y, Zhou Y, Liu J, Zhao Y, Lu H, Xu L, Xu B, Leung H (2016) Effort-aware just-in-time defect prediction: simple unsupervised models could be better than supervised models. In: Proceedings of the 2016 24th ACM SIGSOFT International symposium on foundations of software engineering, ACM, pp 157–168
Yin Z, Yuan D, Zhou Y, Pasupathy S, Bairavasundaram L (2011) How do fixes become bugs?. In: Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering, ACM, pp 26–36
Zhou Y, Yang Y, Lu H, Chen L, Li Y, Zhao Y, Qian J, Xu B (2018) How far we have progressed in the journey? an examination of cross-project defect prediction. ACM Trans Softw Eng Methodol (TOSEM) 27(1):1
Article Google Scholar
Zimmermann T, Nagappan N, Gall H, Giger E, Murphy B (2009) Cross-project defect prediction: a large scale experiment on data vs. domain vs. process. In: Proceedings of the the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering, ACM

Download references

Acknowledgments

We would like to thank Kamei et al. (2013) and Yang et al. (2016) for providing us the datasets and source code used in their study. Finally, to enable other researchers replicate and extend our study, we have published the replication package in Zenodo.^{Footnote 5} This research was partially supported by the National Key Research and Development Program of China (2018YFB1003904) and NSFC Program (No. 61602403).

Author information

Authors and Affiliations

College of Computer Science and Technology, Zhejiang University, Hangzhou, China
Qiao Huang
Faculty of Information Technology, Monash University, Melbourne, Australia
Xin Xia
School of Information Systems, Singapore Management University, Singapore, Singapore
David Lo

Authors

Qiao Huang
View author publications
You can also search for this author in PubMed Google Scholar
Xin Xia
View author publications
You can also search for this author in PubMed Google Scholar
David Lo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xin Xia.

Additional information

Communicated by: Lu Zhang, Thomas Zimmermann, Xin Peng and Hong Mei

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Huang, Q., Xia, X. & Lo, D. Revisiting supervised and unsupervised models for effort-aware just-in-time defect prediction. Empir Software Eng 24, 2823–2862 (2019). https://doi.org/10.1007/s10664-018-9661-2

Download citation

Published: 27 October 2018
Issue Date: 15 October 2019
DOI: https://doi.org/10.1007/s10664-018-9661-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Revisiting supervised and unsupervised models for effort-aware just-in-time defect prediction

Abstract

Access this article

Similar content being viewed by others

Data collection and quality challenges in deep learning: a data-centric AI perspective

How different are different diff algorithms in Git?

CoRT: Transformer-based code representations with self-supervision by predicting reserved words for code smell detection

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Revisiting supervised and unsupervised models for effort-aware just-in-time defect prediction

Abstract

Access this article

Similar content being viewed by others

Data collection and quality challenges in deep learning: a data-centric AI perspective

How different are different diff algorithms in Git?

CoRT: Transformer-based code representations with self-supervision by predicting reserved words for code smell detection

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation