On the time-based conclusion stability of cross-project defect prediction models

Bangash, Abdul Ali; Sahar, Hareem; Hindle, Abram; Ali, Karim

doi:10.1007/s10664-020-09878-9

On the time-based conclusion stability of cross-project defect prediction models

Published: 09 September 2020

Volume 25, pages 5047–5083, (2020)
Cite this article

Empirical Software Engineering Aims and scope Submit manuscript

Abdul Ali Bangash¹,
Hareem Sahar ORCID: orcid.org/0000-0001-6972-1664¹,
Abram Hindle¹ &
…
Karim Ali¹

650 Accesses
24 Citations
1 Altmetric
Explore all metrics

Abstract

Researchers in empirical software engineering often make claims based on observable data such as defect reports. Unfortunately, in many cases, these claims are generalized beyond the data sets that have been evaluated. Will the researcher’s conclusions hold a year from now for the same software projects? Perhaps not. Recent studies show that in the area of Software Analytics, conclusions over different data sets are usually inconsistent. In this article, we empirically investigate whether conclusions in the area of cross-project defect prediction truly exhibit stability throughout time or not. Our investigation applies a time-aware evaluation approach where models are trained only on the past, and evaluations are executed only on the future. Through this time-aware evaluation, we show that depending on which time period we evaluate defect predictors, their performance, in terms of F-Score, the area under the curve (AUC), and Mathews Correlation Coefficient (MCC), varies and their results are not consistent. The next release of a product, which is significantly different from its prior release, may drastically change defect prediction performance. Therefore, without knowing about the conclusion stability, empirical software engineering researchers should limit their claims of performance within the contexts of evaluation, because broad claims about defect prediction performance might be contradicted by the next upcoming release of a product under analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

How different are different diff algorithms in Git?

Article Open access 11 September 2019

Code and commit metrics of developer productivity: a study on team leaders perceptions

Article 13 April 2020

Naming the pain in requirements engineering

Article 24 October 2016

Notes

Yang et al. (2015) used 10-fold cross-validation in their study. On the other hand, Yang et al. (2016) used time-wise cross-validation for within-project models, however, in cross-project prediction they trained on one project and tested on another project without ordering the data set time-wise. Kamei et al. (2016) trained JIT cross-project models using the data from one project and tested the prediction performance using the data from every other project, irrespective of their time order.
In the rest of the paper, we do not use the rankings reported in original study of Herbold et al. (2018), but instead use our re-implementation results of his methodology on open-source projects in Jureczko data set.
Herbold’s replication kit (https://crosspare.informatik.uni-goettingen.de/)

References

Amasaki S, Kawata K, Yokogawa T (2015) Improving cross-project defect prediction methods with data simplification. In: 2015 41st euromicro conference on software engineering and advanced applications, pp 96–103, DOI https://doi.org/10.1109/SEAA.2015.25, (to appear in print)
Bangash AA (2020) Abdulali/replication-kit-emse-2020-benchmark: First release. https://doi.org/10.5281/ZENODO.3715485
Basili VR, Briand LC, Melo WL (1996) A validation of object-oriented design metrics as quality indicators. IEEE Trans Softw Eng 22(10):751–761
Article Google Scholar
Chidamber SR, Kemerer CF (1994) A metrics suite for object oriented design. IEEE Trans Softw Eng 20(6):476–493
Article Google Scholar
Cruz AEC, Ochimizu K (2009) Towards logistic regression models for predicting fault-prone code across software projects. In: 2009 3rd international symposium on empirical software engineering and measurement, pp 460–463, DOI https://doi.org/10.1109/ESEM.2009.5316002, (to appear in print)
D’Ambros M, Lanza M, Robbes R (2012) Evaluating defect prediction approaches: a benchmark and an extensive comparison. Empir Softw Eng 17(4-5):531–577
Article Google Scholar
Ekanayake J, Tappolet J, Gall HC, Bernstein A (2009) Tracking concept drift of software projects using defect prediction quality. In: 2009 6th IEEE international working conference on mining software repositories, IEEE, pp 51–60
Ekanayake J, Tappolet J, Gall HC, Bernstein A (2012) Time variance and defect prediction in software projects, vol 17. Springer, New York, pp 348–389
Google Scholar
Fenton N, Neil M, Marsh W, Hearty P, Radlinski L, Krause P (2007) Project data incorporating qualitative factors for improved software defect prediction. In: Third international workshop on predictor models in software engineering PROMISE’07: ICSE workshops, vol 2007, pp 2–2, DOI https://doi.org/10.1109/PROMISE.2007.11
Fischer M, Pinzger M, Gall H (2003) Populating a release history database from version control and bug tracking systems. In: International conference on software maintenance, 2003. ICSM, proceedings IEEE, vol 2003, pp 23–32
Hassan AE (2009) Predicting faults using the complexity of code changes. In: Proceedings of the 31st international conference on software engineering, IEEE computer society, pp 78–88
Herbold S (2015) Crosspare: a tool for benchmarking cross-project defect predictions. In: 2015 30th IEEE/ACM international conference on automated software engineering workshop (ASEW), IEEE, pp 90–96
Herbold S (2017a) Sherbold/replication-kit-tse-2017-benchmark: Release of the replication kit
Herbold S (2017b) A systematic mapping study on cross-project defect prediction. arXiv:170506429
Herbold S, Trautsch A, Grabowski J (2018) A comparative study to benchmark cross-project defect prediction approaches. IEEE Trans Softw Eng 44 (9):811–833. https://doi.org/10.1109/TSE.2017.2724538
Article Google Scholar
Hindle A, Onuczko C (2019) Preventing duplicate bug reports by continuously querying bug reports. Empir Softw Eng 24(2):902–936
Article Google Scholar
Huang Q, Xia X, Lo D (2017) Supervised vs unsupervised models: A holistic look at effort-aware just-in-time defect prediction. In: 2017 IEEE international conference on software maintenance and evolution, ICSME, IEEE, pp 159–170
Jimenez M, Rwemalika R, Papadakis M, Sarro F, Le Traon Y, Harman M (2019) The importance of accounting for real-world labelling when predicting software vulnerabilities. In: Joint european software engineering conference and symposium on the foundations of software engineering ESEC/FSE
Jureczko M, Madeyski L (2010) Towards identifying software project clusters with regard to defect prediction. In: Proceedings of the 6th international conference on predictive models in software engineering, ACM, New York, NY, USA, PROMISE ’10, pp 9:1–9:10, DOI https://doi.org/10.1145/1868328.1868342, (to appear in print)
Kamei Y, Fukushima T, McIntosh S, Yamashita K, Ubayashi N, Hassan AE (2016) Studying just-in-time defect prediction using cross-project models. Empir Softw Eng 21(5):2072–2106
Article Google Scholar
Koru AG, Liu H (2005) An investigation of the effect of module size on defect prediction using static measures. SIGSOFT Softw Eng Notes 30(4):1–5. https://doi.org/10.1145/1082983.1083172
Article Google Scholar
Krishna R, Menzies T (2018) Bellwethers: A baseline method for transfer learning. IEEE Transactions on Software Engineering
Lessmann S, Baesens B, Mues C, Pietsch S (2008) Benchmarking classification models for software defect prediction: a proposed framework and novel findings. IEEE Trans Softw Eng 34(4):485–496
Article Google Scholar
Ma Y, Luo G, Zeng X, Chen A (2012) Transfer learning for cross-company software defect prediction. Inf Softw Technol 54(3):248–256. https://doi.org/10.1016/j.infsof.2011.09.007
Article Google Scholar
Martin R (1994) Oo design quality metrics-an analysis of dependencies. In: Proceeding workshop pragmatic and theoretical directions in object-oriented software metrics, OOPSLA’94
McIntosh S, Kamei Y (2017) Are fix-inducing changes a moving target? a longitudinal case study of just-in-time defect prediction. IEEE Trans Softw Eng 44(5):412–428
Article Google Scholar
Menzies T, Di Stefano JS (2004) How good is your blind spot sampling policy. In: Eighth IEEE international symposium on high assurance systems engineering, 2004. Proceedings, pp 129–138, DOI https://doi.org/10.1109/HASE.2004.1281737, (to appear in print)
Menzies T, DiStefano J, Orrego A, Chapman R (2004) Assessing predictors of software defects. In: Proceeding workshop predictive software models
Menzies T, Milton Z, Turhan B, Cukic B, Jiang Y, Bener A (2010) Defect prediction from static code features: current results, limitations, new approaches. Autom Softw Eng 17(4):375–407
Article Google Scholar
Menzies T, Butcher A, Marcus A, Zimmermann T, Cok D (2011) Local vs. global models for effort estimation and defect prediction. In: 2011 26th IEEE/ACM international conference on automated software engineering (ASE 2011), IEEE, pp 343–351
Mockus A, Weiss DM (2000) Predicting risk of software changes. Bell Labs Technical J 5(2):169–180
Article Google Scholar
Morasca S, Ruhe G (2000) A hybrid approach to analyze empirical software engineering data and its application to predict module fault-proneness in maintenance. J Syst Softw 53(3):225–237
Article Google Scholar
Nagappan N, Ball T (2005) Use of relative code churn measures to predict system defect density. In: Proceedings of the 27th international conference on Software engineering, ACM, pp 284–292
Nam J, Kim S (2015) Clami: Defect prediction on unlabeled datasets (t). In: Proceedings of the 2015 30th IEEE/ACM international conference on automated software engineering (ASE), IEEE computer society, Washington, DC, USA, ASE ’15, pp 452–463, DOI https://doi.org/10.1109/ASE.2015.56, (to appear in print)
Nam J, Pan SJ, Kim S (2013) Transfer defect learning. In: 2013 35Th international conference on software engineering, ICSE, IEEE, pp 382–391
Peters F, Menzies T, Marcus A (2013) Better cross company defect prediction. In: Proceedings of the 10th working conference on mining software repositories. IEEE Press, Piscataway, pp 409–418
Rahman F, Devanbu P (2013) How and why process metrics are better. In: 2013 35Th international conference on software engineering, ICSE, IEEE, pp 432–441
Rakha MS, Bezemer C, Hassan AE (2018) Revisiting the performance evaluation of automated approaches for the retrieval of duplicate issue reports. IEEE Trans Softw Eng 44(12):1245–1268. https://doi.org/10.1109/TSE.2017.2755005
Article Google Scholar
Romano J, Kromrey JD, Coraggio J, Skowronek J, Devine L (2006) Exploring methods for evaluating group differences on the nsse and other surveys: Are the t-test and cohen’sd indices the most appropriate choices. In: Annual meeting of the southern association for institutional research. Citeseer, Princeton, pp 1–51
Śliwerski J, Zimmermann T, Zeller A (2005a) When do changes induce fixes?. In: ACM Sigsoft software engineering notes, ACM, vol 30, pp 1–5
Tan M, Tan L, Dara S, Mayeux C (2015) Online defect prediction for imbalanced data. In: 2015 IEEE/ACM 37Th IEEE international conference on software engineering, IEEE, vol 2, pp 99–108
Tang MH, Kao MH, Chen MH (1999) An empirical study on object-oriented metrics. In: Proceedings sixth international software metrics symposium (Cat. No. PR00403), IEEE, pp 242–249
Tantithamthavorn C, McIntosh S, Hassan AE, Ihara A, Matsumoto K (2015) The impact of mislabelling on the performance and interpretation of defect prediction models. In: 2015 IEEE/ACM 37Th IEEE international conference on software engineering, IEEE, vol 1, pp 812–823
Tantithamthavorn C, McIntosh S, Hassan AE, Matsumoto K (2017) An empirical comparison of model validation techniques for defect prediction models. IEEE Trans Softw Eng 43(1):1–18
Article Google Scholar
Tantithamthavorn C, McIntosh S, Hassan AE, Matsumoto K (2018) The impact of automated parameter optimization on defect prediction models. IEEE Transactions on Software Engineering
Turhan B (2012) On the dataset shift problem in software engineering prediction models. Empir Softw Eng 17(1-2):62–74
Article Google Scholar
Turhan B, Menzies T, Bener AB, Di Stefano J (2009) On the relative value of cross-company and within-company data for defect prediction. Empir Softw Eng 14(5):540–578
Article Google Scholar
Watanabe S, Kaiya H, Kaijiri K (2008) Adapting a fault prediction model to allow inter languagereuse. In: Proceedings of the 4th international workshop on predictor models in software engineering, ACM, New York, NY, USA, PROMISE ’08, pp 19–24, DOI https://doi.org/10.1145/1370788.1370794, (to appear in print)
Witten IH, Frank E, Hall MA, Pal CJ (2016) Data mining: practical machine learning tools and techniques. Morgan Kaufmann
Yang X, Lo D, Xia X, Zhang Y, Sun J (2015) Deep learning for just-in-time defect prediction. In: 2015 IEEE International conference on software quality, reliability and security, IEEE, pp 17–26
Yang Y, Zhou Y, Liu J, Zhao Y, Lu H, Xu L, Xu B, Leung H (2016) Effort-aware just-in-time defect prediction: simple unsupervised models could be better than supervised models. In: Proceedings of the 2016 24th ACM SIGSOFT international symposium on foundations of software engineering, ACM, pp 157–168
Yap BW, Rani KA, Rahman HAA, Fong S, Khairudin Z, Abdullah NN (2014) An application of oversampling, undersampling, bagging and boosting in handling imbalanced datasets. In: Proceedings of the first international conference on advanced data and information engineering (DaEng-2013). Springer, New York, pp 13–22
Zhang F, Mockus A, Keivanloo I, Zou Y (2014) Towards building a universal defect prediction model. In: Proceedings of the 11th working conference on mining software repositories, ACM, pp 182–191
Zimmermann T, Nagappan N (2007) Predicting subsystem failures using dependency graph complexities. In: The 18th IEEE international symposium on software reliability (ISSRE’07), IEEE, pp 227–236
Zimmermann T, Premraj R, Zeller A (2007) Predicting defects for eclipse. In: Third international workshop on predictor models in software engineering (PROMISE’07: ICSE Workshops 2007), IEEE, pp 9–9
Zimmermann T, Nagappan N, Gall H, Giger E, Murphy B (2009) Cross-project defect prediction: a large scale experiment on data vs. domain vs. process. In: Proceedings of the the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering, ACM, pp 91–100

Download references

Author information

Authors and Affiliations

Department of Computing Science, University of Alberta, Edmonton, AB, Canada
Abdul Ali Bangash, Hareem Sahar, Abram Hindle & Karim Ali

Authors

Abdul Ali Bangash
View author publications
You can also search for this author in PubMed Google Scholar
Hareem Sahar
View author publications
You can also search for this author in PubMed Google Scholar
Abram Hindle
View author publications
You can also search for this author in PubMed Google Scholar
Karim Ali
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Abdul Ali Bangash.

Additional information

Communicated by: Romain Robbes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bangash, A.A., Sahar, H., Hindle, A. et al. On the time-based conclusion stability of cross-project defect prediction models. Empir Software Eng 25, 5047–5083 (2020). https://doi.org/10.1007/s10664-020-09878-9

Download citation

Published: 09 September 2020
Issue Date: November 2020
DOI: https://doi.org/10.1007/s10664-020-09878-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

On the time-based conclusion stability of cross-project defect prediction models

Abstract

Access this article

Similar content being viewed by others

How different are different diff algorithms in Git?

Code and commit metrics of developer productivity: a study on team leaders perceptions

Naming the pain in requirements engineering

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

On the time-based conclusion stability of cross-project defect prediction models

Abstract

Access this article

Similar content being viewed by others

How different are different diff algorithms in Git?

Code and commit metrics of developer productivity: a study on team leaders perceptions

Naming the pain in requirements engineering

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation