The impact of automated feature selection techniques on the interpretation of defect models

Abstract

The interpretation of defect models heavily relies on software metrics that are used to construct them. Prior work often uses feature selection techniques to remove metrics that are correlated and irrelevant in order to improve model performance. Yet, conclusions that are derived from defect models may be inconsistent if the selected metrics are inconsistent and correlated. In this paper, we systematically investigate 12 automated feature selection techniques with respect to the consistency, correlation, performance, computational cost, and the impact on the interpretation dimensions. Through an empirical investigation of 14 publicly-available defect datasets, we find that (1) 94–100% of the selected metrics are inconsistent among the studied techniques; (2) 37–90% of the selected metrics are inconsistent among training samples; (3) 0–68% of the selected metrics are inconsistent when the feature selection techniques are applied repeatedly; (4) 5–100% of the produced subsets of metrics contain highly correlated metrics; and (5) while the most important metrics are inconsistent among correlation threshold values, such inconsistent most important metrics are highly-correlated with the Spearman correlation of 0.85–1. Since we find that the subsets of metrics produced by the commonly-used feature selection techniques (except for AutoSpearman) are often inconsistent and correlated, these techniques should be avoided when interpreting defect models. In addition to introducing AutoSpearman which mitigates correlated metrics better than commonly-used feature selection techniques, this paper opens up new research avenues in the automated selection of features for defect models to optimise for interpretability as well as performance.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16

References

  1. Agrawal A, Menzies T (2018) Is better data better than better data miners?. In: 2018 IEEE/ACM 40Th international conference on software engineering (ICSE), IEEE, pp 1050–1061

  2. Alckmin G, Kooistra L, Lucieer A, Rawnsley R (2019) Feature filtering and selection for dry matter estimation on perrenial ryegrass: a case study of vegetation indices. International archives of the photogrammetry. Remote Sensing and Spatial Information Sciences 42(2/W13)

  3. Alzubi R, Ramzan N, Alzoubi H, Amira A (2017) A hybrid feature selection method for complex diseases SNPs. IEEE Access 6:1292–1301. https://doi.org/10.1109/ACCESS.2017.2778268

    Article  Google Scholar 

  4. Arisholm E, Briand LC, Johannessen EB (2010) A systematic and comprehensive investigation of methods to build and evaluate fault prediction models. J Syst Softw 83(1):2–17

    Google Scholar 

  5. Berry WD (1993) Understanding regression assumptions, vol 92. Sage Publications

  6. Bettenburg N, Hassan AE (2010) Studying the impact of social structures on software quality. In: Proceedings of the International Conference on Program Comprehension (ICPC), pp 124–133

  7. Bird C, Bachmann A, Aune E, Duffy J, Bernstein A, Filkov V, Devanbu P (2009) Fair and balanced?: Bias in bug-fix datasets. In: Proceedings of the European Software Engineering Conference and the Symposium on the Foundations of Software Engineering (ESEC/FSE), pp 121–130

  8. Blake C, Merz C (1998) Uci repository of machine learning databases. University of California, Irvine, CA 55

  9. Breiman L (2001) Random forests. Mach Learn 45(1):5–32

    MATH  Google Scholar 

  10. Breiman L, Cutler A, Liaw A, Wiener M (2006) randomForest: Breiman and Cutler’s random forests for classification and regression. R package version 4.6-12. Software available at https://cran.r-project.org/package=randomForest

  11. Cahill J, Hogan JM, Thomas R (2013) Predicting fault-prone software modules with rank sum classification. In: Proceedings of the Australian Software Engineering Conference (ASWEC), pp 211–219

  12. Cai Y, Chow M, Lu W, Li L (2010) Statistical feature selection from massive data in distribution fault diagnosis. IEEE Trans Power Syst 25 (2):642–648

    Google Scholar 

  13. Canfora G, De Lucia A, Di Penta M, Oliveto R, Panichella A, Panichella S (2013) Multi-objective cross-project defect prediction. In: Proceedings of the International Conference on Software Testing, Verification and Validation (ICST), pp 252–261

  14. Chambers JM (1992) Statistical models in s wadsworth. Pacific Grove, California

  15. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: Synthetic minority over-sampling TE chnique. J Artif Intell Res 16:321–357

    MATH  Google Scholar 

  16. D’Ambros M, Lanza M, Robbes R (2010) An Extensive Comparison of Bug Prediction Approaches. In: Proceedings of the International Conference on Mining Software Repositories (MSR), pp 31–41

  17. D’Ambros M, Lanza M, Robbes R (2012) Evaluating defect prediction approaches: a benchmark and an extensive comparison. Emp Softw Eng (EMSE) 17(4-5):531–577

    Google Scholar 

  18. Dash M, Liu H, Motoda H (2000) Consistency based Feature Selection. In: Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), pp 98–109

  19. Efron B, Tibshirani RJ (1993) An introduction to the bootstrap. Springer, Boston

    MATH  Google Scholar 

  20. Elish KO, Elish MO (2008) Predicting defect-prone software modules using support vector machines. J Syst Softw 81(5):649–660

    Google Scholar 

  21. Fox J (2015) Applied regression analysis and generalized linear models. Sage Publications

  22. Fox J, Monette G (1992) Generalized collinearity diagnostics. J Am Statis Assoc (JASA) 87(417):178–183

    Google Scholar 

  23. Friedman J, Hastie T, Tibshirani R (2001) The Elements of Statistical Learning, vol 1. Springer series in statistics

  24. Fu W, Menzies T, Shen X (2016) Tuning for Software analytics: Is it really necessary?. Inf Softw Technol 76:135–146

    Google Scholar 

  25. Garner SR, et al. (1995) Weka: the waikato environment for knowledge analysis. In: Proceedings of the New Zealand Computer Science Research Students Conference (NZCSRSC), pp 57–64

  26. Ghotra B, McIntosh S, Hassan AE (2015) Revisiting the impact of classification techniques on the performance of defect prediction models. In: Proceedings of the International Conference on Software Engineering (ICSE), pp 789–800

  27. Ghotra B, McIntosh S, Hassan AE (2017) A large-scale study of the impact of feature selection techniques on defect classification models. In: Proceedings of the 14th International Conference on Mining Software Repositories, pp 146–157

  28. Gil Y, Lalouche G (2017) On the correlation between size and metric validity. Emp Softw Eng (EMSE) 22(5):2585–2611

    Google Scholar 

  29. Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182

    MATH  Google Scholar 

  30. Hair JF, Black WC, Babin BJ, Anderson RE, Tatham RL, et al. (2006) Multivariate data analysis vol. 6

  31. Hall MA (1999) Correlation-based feature selection for machine learning. PhD thesis, University of Waikato Hamilton

  32. Hall MA, Smith LA (1997) Feature Subset Selection: A Correlation Based Filter Approach

  33. Hanley J, McNeil BJ (1982) The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143(4):29–36

    Google Scholar 

  34. Harrell FE Jr (2013) Hmisc: Harrell miscellaneous. R package version 3.12-2. Software available at http://cran.r-project.org/web/packages/Hmisc

  35. Harrell FE Jr (2015) Regression modeling strategies: with applications to linear models, logistic and ordinal regression, and survival analysis. Springer, New Tork

    MATH  Google Scholar 

  36. Harrell FE Jr (2017) rms: regression modeling strategies. R package version 5.1-1. Software available at http://cran.r-project.org/web/packages/rms

  37. Hinkle DE, Wiersma W, Jurs SG (2003) Applied statistics for the behavioral sciences, vol 663. Houghton Mifflin College Division

  38. Hsu HH, Hsieh CW, Lu MD (2011) Hybrid feature selection by combining filters and wrappers. Expert Syst Appl 38(7):8144–8150. https://doi.org/10.1016/j.eswa.2010.12.156

    Article  Google Scholar 

  39. Huan L, Setiono R (1995) Chi2: feature selection and discretization of numeric attributes. In: Proceedings of the International Conference on Tools with Artificial Intelligence, pp 388–391

  40. Jiang Y, Cukic B, Menzies T (2008) Can data transformation help in the detection of fault-prone modules?. In: Proceedings of the International Workshop on Defects in Large Software Systems (DEFECTS), pp 16–20

  41. Jiarpakdee J, Tantithamthavorn C, Ihara A, Matsumoto K (2016) A study of redundant metrics in defect prediction datasets. In: Proceedings of the International Symposium on Software Reliability Engineering Workshops (ISSREW), pp 51–52

  42. Jiarpakdee J, Tantithamthavorn C, Hassan AE (2018a) The Impact of Correlated Metrics on Defect Models. arXiv:180110271 p To Appear

  43. Jiarpakdee J, Tantithamthavorn C, Treude C (2018b) AutoSpearman: automatically mitigating correlated software metrics for interpreting defect models. In: Proceedings of the International Conference on Software Maintenance and Evolution (ICSME), pp 92–103

  44. Jiarpakdee J, Tantithamthavorn C, Treude C (2018c) Online Appendix for Should Automated Feature Selection Techniques be Applied when Interpreting Defect Models?. https://github.com/software-analytics/autospearman-extension-appendix

  45. Jiarpakdee J, Tantithamthavorn C, Dam HK, Grundy J (2020) An empirical study of model-agnostic techniques for defect prediction models. Trans Softw Eng (TSE) 1–1

  46. John GH, Kohavi R, Pfleger K (1994) Irrelevant features and the subset selection problem

  47. Jureczko M, Madeyski L (2010) Towards identifying software project clusters with regard to defect prediction. In: Proceedings of the International Conference on Predictive Models in Software Engineering (PROMISE), p 9

  48. Kamei Y, Shihab E, Adams B, Hassan AE, Mockus A, Sinha A, Ubayashi N (2013) A Large-Scale empirical study of Just-In-Time quality assurance. Trans Softw Eng (TSE) 39(6):757–773

    Google Scholar 

  49. Kaur A, Malhotra R (2008) Application of Random Forest in Predicting Fault-prone Classes. In: Proceedings of International Conference on the Advanced Computer Theory and Engineering (ICACTE), pp 37–43

  50. Kim S, Zhang H, Wu R, Gong L (2011) Dealing with noise in defect prediction. In: Proceedings of the International Conference on Software Engineering (ICSE), pp 481–490

  51. Kohavi R, John GH (1997) Wrappers for feature subset selection. Artif Intell 97(1-2):273–324

    MATH  Google Scholar 

  52. Koru AG, Liu H (2005) An investigation of the effect of module size on defect prediction using static measures. Softw Eng Notes (SEN) 30:1–5

    Google Scholar 

  53. Kraemer HC, Morgan GA, Leech NL, Gliner JA, Vaske JJ, Harmon RJ (2003) Measures of clinical significance. J Am Academy Child Adolescent Psychiat(JAACAP) 42(12):1524–1529

    Google Scholar 

  54. Kuhn M, Wing J, Weston S, Williams A, Keefer C, Engelhardt A, Cooper T, Mayer Z, Kenkel B, Team R, et al. (2017) Caret: Classification and regression training. R package version 6.0–78. Software available at https://cran.r-project.org/package=caret

  55. Lessmann S, Baesens B, Mues C, Pietsch S (2008) Benchmarking classification models for software defect prediction: a proposed framework and novel findings. Trans Softw Eng (TSE) 34(4):485–496

    Google Scholar 

  56. Lewis DD, Ringuette M (1994) A comparison of two learning algorithms for text categorization. In: Annual Sympos Document Anal Inform Retrieval, vol 33, pp 81–93

  57. Li J, Cheng K, Wang S, Morstatter F, Trevino RP, Tang J, Liu H (2017) Feature selection: a data perspective. ACM Computing Surveys (CSUR) 50(6):94

    Google Scholar 

  58. Lu H, Kocaguneli E, Cukic B (2014) Defect Prediction between Software Versions with Active Learning and Dimensionality Reduction. In: Proceedings of the International Symposium on Software Reliability Engineering (ISSRE), pp 312–322

  59. Mason CH, Perreault WD Jr (1991) Collinearity, power, and interpretation of multiple regression analysis. J Market Res (JMR) 268–280

  60. Matthews BW (1975) Comparison of the predicted and observed secondary structure of t4 phage lysozyme. Biochimica et Biophysica Acta (BBA)-Protein Structure 405(2):442–451

    Google Scholar 

  61. McHugh ML (2013) The Chi-square Test of Independence. Biochemia Medica 23(2):143–149

    Google Scholar 

  62. McIntosh S, Kamei Y, Adams B, Hassan AE (2014) The impact of code review coverage and code review participation on software quality. In: Proceedings of the International Conference on Mining Software Repositories (MSR), pp 192–201

  63. Mende T (2010) Replication of defect prediction studies: problems, pitfalls and recommendations. In: Proceedings of the International Conference on Predictive Models in Software Engineering (PROMISE), pp 1–10

  64. Mende T, Koschke R (2009) Revisiting the evaluation of defect prediction models. In: Proceedings of the International Conference on Predictive Models in Software Engineering (PROMISE), pp 7–16

  65. Menzies T (2018) The unreasonable effectiveness of software analytics. IEEE Softw 35(2):96–98. https://doi.org/10.1109/MS.2018.1661323

    Article  Google Scholar 

  66. Menzies T, Greenwald J, Frank A (2007) Data mining static code attributes to learn defect predictors. Trans Softw Eng (TSE) 33(1):2–13

    Google Scholar 

  67. Menzies T, Caglayan B, Kocaguneli E, Krall J, Peters f, Turhan B (2012) The Promise Repository of Empirical Software Engineering Data

  68. Mersmann O, Beleites C, Hurling R, Friedman A (2018) Microbenchmark: Accurate Timing Functions. R package version 1.4-6. Software available at https://cran.r-project.org/package=microbenchmark

  69. Mitchell TM (1997) Machine Learning. McGraw Hill

  70. Moser R, Pedrycz W, Succi G (2008) A comparative analysis of the efficiency of change metrics and static code attributes for defect prediction. In: Proceedings of the International Conference on Software Engineering (ICSE), pp 181–190

  71. Nam J, Fu W, Kim S, Menzies T, Tan L (2017) Heterogeneous Defect Prediction. Transactions on Software Engineering (TSE) p In Press

  72. Okutan A, Yıldız OT (2014) Software defect prediction using bayesian networks. Empirical Softw Eng (EMSE) 19(1):154–181

    Google Scholar 

  73. Osman H, Ghafari M, Nierstrasz O (2018) The impact of feature selection on predicting the number of bugs. arXiv:180704486

  74. Pandari Y, Thangavel P, Senthamaraikannan H, Jagadeeswaran S (2019) HybridFS: a hybrid filter-wrapper feature selection method. R package version 0.1.3. Software available at https://cran.r-project.org/package=HybridFS

  75. Petrić J, Bowes D, Hall T, Christianson B, Baddoo N (2016) The jinx on the nasa software defect data sets. In: Proceedings of the International Conference on Evaluation and Assessment in Software Engineering (EASE), pp 13–17

  76. Rahman F, Devanbu P (2013) How, and Why, process metrics are better. In: Proceedings of the International Conference on Software Engineering (ICSE), pp 432–441

  77. Robles G (2010) Replicating MSR: A Study of the Potential Replicability of Papers Published in the Mining Software Repositories Proceedings. In: Proceedings of the International Conference on Mining Software Repositories (MSR), pp 171–180

  78. Rodríguez D, Ruiz R, Cuadrado-Gallego J, Aguilar-Ruiz J (2007) Detecting fault modules applying feature selection to classifiers. In: Proceedings of the International Conference on Information Reuse and Integration (IRI), pp 667–672

  79. Romano J, Kromrey JD, Coraggio J, Skowronek J (2006) Appropriate statistics for ordinal level data: should we really be using t-test and Cohen’s d for evaluating group differences on the NSSE and other surveys. In: Annual meeting of the Florida Association of Institutional Research (FAIR), pp 1–33

  80. Romanski P, Kotthoff L (2013) FSelector: Selecting attributes. R package version 0.19. Software available at https://cran.r-project.org/package=FSelector

  81. Shepperd M, Song Q, Sun Z, Mair C (2013) Data quality: some comments on the NASA software defect datasets. Trans Softw Eng (TSE) 39(9):1208–1215

    Google Scholar 

  82. Shepperd M, Bowes D, Hall T (2014) Researcher bias: the use of machine learning in software defect prediction. Trans Softw Eng (TSE) 40(6):603–616

    Google Scholar 

  83. Shivaji S, Whitehead EJ, Akella R, Kim S (2013) Reducing features to improve code Change-Based bug prediction. Trans Softw Eng (TSE) 39 (4):552–569

    Google Scholar 

  84. Strobl C, Boulesteix AL, Kneib T, Augustin T, Zeileis A (2008) Conditional variable importance for random forests. BMC Bioinforma 9 (1):307

    Google Scholar 

  85. Tantithamthavorn C (2017) ScottKnottESD: The Scott-Knott Effect Size Difference (ESD) Test. R package version 2.0. Software available at https://cran.r-project.org/web/packages/ScottKnottESD

  86. Tantithamthavorn C, Hassan AE (2018) An experience report on defect modelling in practice: pitfalls and challenges

  87. Tantithamthavorn C, Jiarpakdee J (2018) Rnalytica: An R package of the Miscellaneous Functions for Data Analytics Research. https://github.com/software-analytics/Rnalytica

  88. Tantithamthavorn C, McIntosh S, Hassan AE, Ihara A, Matsumoto K (2015) The impact of mislabelling on the performance and interpretation of defect prediction models. In: Proceeding of the International Conference on Software Engineering (ICSE), pp 812–823

  89. Tantithamthavorn C, McIntosh S, Hassan AE, Matsumoto K (2016a) Automated parameter optimization of classification techniques for defect prediction models. In: Proceedings of the International Conference on Software Engineering (ICSE), pp 321–332

  90. Tantithamthavorn C, McIntosh S, Hassan AE, Matsumoto K (2016b) Comments on bias: the use of machine learning in software defect prediction. Trans Softw Eng (TSE) 42(11):1092–1094

  91. Tantithamthavorn C, McIntosh S, Hassan AE, Matsumoto K (2017) An empirical comparison of model validation techniques for defect prediction models. Trans Softw Eng (TSE) 43(1):1–18

    Google Scholar 

  92. Tantithamthavorn C, Hassan AE, Matsumoto K (2019a) The impact of class rebalancing techniques on the performance and interpretation of defect prediction models. Trans Softw Eng (TSE) p Preprints

  93. Tantithamthavorn C, McIntosh S, Hassan AE, Matsumoto K (2019b) The impact of automated parameter optimization on defect prediction models. Trans Softw Eng (TSE) 45(7):683–711

  94. Team RC (2017) contributors worldwide Stats: The R Stats Package. R Package. Version 3.4.0

  95. Tian Y, Nagappan M, Lo D, Hassan AE (2015) What Are the Characteristics of High-Rated Apps? A Case Study on Free Android Applications. In: Proceedings of the International Conference on Software Maintenance and Evolution (ICSME), pp 301–310

  96. Torgo L, Torgo ML (2015) DMwR: Functions and Data for Data Mining with R. R package version 0.4.1. Software available at https://cran.r-project.org/package=DMwR

  97. Tosun A, Bener A (2009) Reducing false alarms in software defect prediction by decision threshold optimization. In: Proceedings of the International Symposium on Empirical Software Engineering and Measurement (ESEM), pp 477–480

  98. Xu Z, Liu J, Yang Z, An G, Jia X (2016) The Impact of Feature Selection on Defect Prediction Performance: An Empirical Comparison. In: Proceedings of the International Symposium on Software Reliability Engineering (ISSRE), pp 309–320

  99. Yan K, Zhang D (2015) Feature selection and analysis on correlated gas sensor data with recursive feature elimination. Sensors Actuators B: Chemical 212:353–363

    Google Scholar 

  100. Yathish S, Jiarpakdee J, Thongtanunam P, Tantithamthavorn C (2019) Mining software defects should we consider affected releases?. In: In Proceedings of the International Conference on Software Engineering (ICSE), p To Appear

  101. Zhang F, Hassan AE, McIntosh S, Zou Y (2017) The use of summation to aggregate software metrics hinders the performance of defect prediction models. Trans Softw Eng (TSE) 43(5):476–491

    Google Scholar 

  102. Zimmermann T, Premraj R, Zeller A (2007) Predicting defects for eclipse. In: Proceedings of the International Workshop on Predictor Models in Software Engineering (PROMISE), pp 9–19

  103. Zimmermann T, Nagappan N, Gall H, Giger E, Murphy B (2009) Cross-project Defect Prediction. In: Proceedings of the European Software Engineering Conference and the Symposium on the Foundations of Software Engineering (ESEC/FSE), pp 91–100

Download references

Acknowledgements

C. Tantithamthavorn is supported by the Australian Research Council’s Discovery Early Career Researcher Award (DECRA) funding scheme (DE200100941). C. Treude is supported by the Australian Research Council’s Discovery Early Career Researcher Award (DECRA) funding scheme (DE180100153).

Author information

Affiliations

Authors

Corresponding author

Correspondence to Jirayus Jiarpakdee.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Communicated by: Tim Menzies

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Jiarpakdee, J., Tantithamthavorn, C. & Treude, C. The impact of automated feature selection techniques on the interpretation of defect models. Empir Software Eng 25, 3590–3638 (2020). https://doi.org/10.1007/s10664-020-09848-1

Download citation

Keywords

  • Software analytics
  • Defect prediction
  • Model interpretation
  • Feature selection