Abstract
The interpretation of defect models heavily relies on software metrics that are used to construct them. Prior work often uses feature selection techniques to remove metrics that are correlated and irrelevant in order to improve model performance. Yet, conclusions that are derived from defect models may be inconsistent if the selected metrics are inconsistent and correlated. In this paper, we systematically investigate 12 automated feature selection techniques with respect to the consistency, correlation, performance, computational cost, and the impact on the interpretation dimensions. Through an empirical investigation of 14 publicly-available defect datasets, we find that (1) 94–100% of the selected metrics are inconsistent among the studied techniques; (2) 37–90% of the selected metrics are inconsistent among training samples; (3) 0–68% of the selected metrics are inconsistent when the feature selection techniques are applied repeatedly; (4) 5–100% of the produced subsets of metrics contain highly correlated metrics; and (5) while the most important metrics are inconsistent among correlation threshold values, such inconsistent most important metrics are highly-correlated with the Spearman correlation of 0.85–1. Since we find that the subsets of metrics produced by the commonly-used feature selection techniques (except for AutoSpearman) are often inconsistent and correlated, these techniques should be avoided when interpreting defect models. In addition to introducing AutoSpearman which mitigates correlated metrics better than commonly-used feature selection techniques, this paper opens up new research avenues in the automated selection of features for defect models to optimise for interpretability as well as performance.
Similar content being viewed by others
References
Agrawal A, Menzies T (2018) Is better data better than better data miners?. In: 2018 IEEE/ACM 40Th international conference on software engineering (ICSE), IEEE, pp 1050–1061
Alckmin G, Kooistra L, Lucieer A, Rawnsley R (2019) Feature filtering and selection for dry matter estimation on perrenial ryegrass: a case study of vegetation indices. International archives of the photogrammetry. Remote Sensing and Spatial Information Sciences 42(2/W13)
Alzubi R, Ramzan N, Alzoubi H, Amira A (2017) A hybrid feature selection method for complex diseases SNPs. IEEE Access 6:1292–1301. https://doi.org/10.1109/ACCESS.2017.2778268
Arisholm E, Briand LC, Johannessen EB (2010) A systematic and comprehensive investigation of methods to build and evaluate fault prediction models. J Syst Softw 83(1):2–17
Berry WD (1993) Understanding regression assumptions, vol 92. Sage Publications
Bettenburg N, Hassan AE (2010) Studying the impact of social structures on software quality. In: Proceedings of the International Conference on Program Comprehension (ICPC), pp 124–133
Bird C, Bachmann A, Aune E, Duffy J, Bernstein A, Filkov V, Devanbu P (2009) Fair and balanced?: Bias in bug-fix datasets. In: Proceedings of the European Software Engineering Conference and the Symposium on the Foundations of Software Engineering (ESEC/FSE), pp 121–130
Blake C, Merz C (1998) Uci repository of machine learning databases. University of California, Irvine, CA 55
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Breiman L, Cutler A, Liaw A, Wiener M (2006) randomForest: Breiman and Cutler’s random forests for classification and regression. R package version 4.6-12. Software available at https://cran.r-project.org/package=randomForest
Cahill J, Hogan JM, Thomas R (2013) Predicting fault-prone software modules with rank sum classification. In: Proceedings of the Australian Software Engineering Conference (ASWEC), pp 211–219
Cai Y, Chow M, Lu W, Li L (2010) Statistical feature selection from massive data in distribution fault diagnosis. IEEE Trans Power Syst 25 (2):642–648
Canfora G, De Lucia A, Di Penta M, Oliveto R, Panichella A, Panichella S (2013) Multi-objective cross-project defect prediction. In: Proceedings of the International Conference on Software Testing, Verification and Validation (ICST), pp 252–261
Chambers JM (1992) Statistical models in s wadsworth. Pacific Grove, California
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: Synthetic minority over-sampling TE chnique. J Artif Intell Res 16:321–357
D’Ambros M, Lanza M, Robbes R (2010) An Extensive Comparison of Bug Prediction Approaches. In: Proceedings of the International Conference on Mining Software Repositories (MSR), pp 31–41
D’Ambros M, Lanza M, Robbes R (2012) Evaluating defect prediction approaches: a benchmark and an extensive comparison. Emp Softw Eng (EMSE) 17(4-5):531–577
Dash M, Liu H, Motoda H (2000) Consistency based Feature Selection. In: Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), pp 98–109
Efron B, Tibshirani RJ (1993) An introduction to the bootstrap. Springer, Boston
Elish KO, Elish MO (2008) Predicting defect-prone software modules using support vector machines. J Syst Softw 81(5):649–660
Fox J (2015) Applied regression analysis and generalized linear models. Sage Publications
Fox J, Monette G (1992) Generalized collinearity diagnostics. J Am Statis Assoc (JASA) 87(417):178–183
Friedman J, Hastie T, Tibshirani R (2001) The Elements of Statistical Learning, vol 1. Springer series in statistics
Fu W, Menzies T, Shen X (2016) Tuning for Software analytics: Is it really necessary?. Inf Softw Technol 76:135–146
Garner SR, et al. (1995) Weka: the waikato environment for knowledge analysis. In: Proceedings of the New Zealand Computer Science Research Students Conference (NZCSRSC), pp 57–64
Ghotra B, McIntosh S, Hassan AE (2015) Revisiting the impact of classification techniques on the performance of defect prediction models. In: Proceedings of the International Conference on Software Engineering (ICSE), pp 789–800
Ghotra B, McIntosh S, Hassan AE (2017) A large-scale study of the impact of feature selection techniques on defect classification models. In: Proceedings of the 14th International Conference on Mining Software Repositories, pp 146–157
Gil Y, Lalouche G (2017) On the correlation between size and metric validity. Emp Softw Eng (EMSE) 22(5):2585–2611
Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182
Hair JF, Black WC, Babin BJ, Anderson RE, Tatham RL, et al. (2006) Multivariate data analysis vol. 6
Hall MA (1999) Correlation-based feature selection for machine learning. PhD thesis, University of Waikato Hamilton
Hall MA, Smith LA (1997) Feature Subset Selection: A Correlation Based Filter Approach
Hanley J, McNeil BJ (1982) The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143(4):29–36
Harrell FE Jr (2013) Hmisc: Harrell miscellaneous. R package version 3.12-2. Software available at http://cran.r-project.org/web/packages/Hmisc
Harrell FE Jr (2015) Regression modeling strategies: with applications to linear models, logistic and ordinal regression, and survival analysis. Springer, New Tork
Harrell FE Jr (2017) rms: regression modeling strategies. R package version 5.1-1. Software available at http://cran.r-project.org/web/packages/rms
Hinkle DE, Wiersma W, Jurs SG (2003) Applied statistics for the behavioral sciences, vol 663. Houghton Mifflin College Division
Hsu HH, Hsieh CW, Lu MD (2011) Hybrid feature selection by combining filters and wrappers. Expert Syst Appl 38(7):8144–8150. https://doi.org/10.1016/j.eswa.2010.12.156
Huan L, Setiono R (1995) Chi2: feature selection and discretization of numeric attributes. In: Proceedings of the International Conference on Tools with Artificial Intelligence, pp 388–391
Jiang Y, Cukic B, Menzies T (2008) Can data transformation help in the detection of fault-prone modules?. In: Proceedings of the International Workshop on Defects in Large Software Systems (DEFECTS), pp 16–20
Jiarpakdee J, Tantithamthavorn C, Ihara A, Matsumoto K (2016) A study of redundant metrics in defect prediction datasets. In: Proceedings of the International Symposium on Software Reliability Engineering Workshops (ISSREW), pp 51–52
Jiarpakdee J, Tantithamthavorn C, Hassan AE (2018a) The Impact of Correlated Metrics on Defect Models. arXiv:180110271 p To Appear
Jiarpakdee J, Tantithamthavorn C, Treude C (2018b) AutoSpearman: automatically mitigating correlated software metrics for interpreting defect models. In: Proceedings of the International Conference on Software Maintenance and Evolution (ICSME), pp 92–103
Jiarpakdee J, Tantithamthavorn C, Treude C (2018c) Online Appendix for Should Automated Feature Selection Techniques be Applied when Interpreting Defect Models?. https://github.com/software-analytics/autospearman-extension-appendix
Jiarpakdee J, Tantithamthavorn C, Dam HK, Grundy J (2020) An empirical study of model-agnostic techniques for defect prediction models. Trans Softw Eng (TSE) 1–1
John GH, Kohavi R, Pfleger K (1994) Irrelevant features and the subset selection problem
Jureczko M, Madeyski L (2010) Towards identifying software project clusters with regard to defect prediction. In: Proceedings of the International Conference on Predictive Models in Software Engineering (PROMISE), p 9
Kamei Y, Shihab E, Adams B, Hassan AE, Mockus A, Sinha A, Ubayashi N (2013) A Large-Scale empirical study of Just-In-Time quality assurance. Trans Softw Eng (TSE) 39(6):757–773
Kaur A, Malhotra R (2008) Application of Random Forest in Predicting Fault-prone Classes. In: Proceedings of International Conference on the Advanced Computer Theory and Engineering (ICACTE), pp 37–43
Kim S, Zhang H, Wu R, Gong L (2011) Dealing with noise in defect prediction. In: Proceedings of the International Conference on Software Engineering (ICSE), pp 481–490
Kohavi R, John GH (1997) Wrappers for feature subset selection. Artif Intell 97(1-2):273–324
Koru AG, Liu H (2005) An investigation of the effect of module size on defect prediction using static measures. Softw Eng Notes (SEN) 30:1–5
Kraemer HC, Morgan GA, Leech NL, Gliner JA, Vaske JJ, Harmon RJ (2003) Measures of clinical significance. J Am Academy Child Adolescent Psychiat(JAACAP) 42(12):1524–1529
Kuhn M, Wing J, Weston S, Williams A, Keefer C, Engelhardt A, Cooper T, Mayer Z, Kenkel B, Team R, et al. (2017) Caret: Classification and regression training. R package version 6.0–78. Software available at https://cran.r-project.org/package=caret
Lessmann S, Baesens B, Mues C, Pietsch S (2008) Benchmarking classification models for software defect prediction: a proposed framework and novel findings. Trans Softw Eng (TSE) 34(4):485–496
Lewis DD, Ringuette M (1994) A comparison of two learning algorithms for text categorization. In: Annual Sympos Document Anal Inform Retrieval, vol 33, pp 81–93
Li J, Cheng K, Wang S, Morstatter F, Trevino RP, Tang J, Liu H (2017) Feature selection: a data perspective. ACM Computing Surveys (CSUR) 50(6):94
Lu H, Kocaguneli E, Cukic B (2014) Defect Prediction between Software Versions with Active Learning and Dimensionality Reduction. In: Proceedings of the International Symposium on Software Reliability Engineering (ISSRE), pp 312–322
Mason CH, Perreault WD Jr (1991) Collinearity, power, and interpretation of multiple regression analysis. J Market Res (JMR) 268–280
Matthews BW (1975) Comparison of the predicted and observed secondary structure of t4 phage lysozyme. Biochimica et Biophysica Acta (BBA)-Protein Structure 405(2):442–451
McHugh ML (2013) The Chi-square Test of Independence. Biochemia Medica 23(2):143–149
McIntosh S, Kamei Y, Adams B, Hassan AE (2014) The impact of code review coverage and code review participation on software quality. In: Proceedings of the International Conference on Mining Software Repositories (MSR), pp 192–201
Mende T (2010) Replication of defect prediction studies: problems, pitfalls and recommendations. In: Proceedings of the International Conference on Predictive Models in Software Engineering (PROMISE), pp 1–10
Mende T, Koschke R (2009) Revisiting the evaluation of defect prediction models. In: Proceedings of the International Conference on Predictive Models in Software Engineering (PROMISE), pp 7–16
Menzies T (2018) The unreasonable effectiveness of software analytics. IEEE Softw 35(2):96–98. https://doi.org/10.1109/MS.2018.1661323
Menzies T, Greenwald J, Frank A (2007) Data mining static code attributes to learn defect predictors. Trans Softw Eng (TSE) 33(1):2–13
Menzies T, Caglayan B, Kocaguneli E, Krall J, Peters f, Turhan B (2012) The Promise Repository of Empirical Software Engineering Data
Mersmann O, Beleites C, Hurling R, Friedman A (2018) Microbenchmark: Accurate Timing Functions. R package version 1.4-6. Software available at https://cran.r-project.org/package=microbenchmark
Mitchell TM (1997) Machine Learning. McGraw Hill
Moser R, Pedrycz W, Succi G (2008) A comparative analysis of the efficiency of change metrics and static code attributes for defect prediction. In: Proceedings of the International Conference on Software Engineering (ICSE), pp 181–190
Nam J, Fu W, Kim S, Menzies T, Tan L (2017) Heterogeneous Defect Prediction. Transactions on Software Engineering (TSE) p In Press
Okutan A, Yıldız OT (2014) Software defect prediction using bayesian networks. Empirical Softw Eng (EMSE) 19(1):154–181
Osman H, Ghafari M, Nierstrasz O (2018) The impact of feature selection on predicting the number of bugs. arXiv:180704486
Pandari Y, Thangavel P, Senthamaraikannan H, Jagadeeswaran S (2019) HybridFS: a hybrid filter-wrapper feature selection method. R package version 0.1.3. Software available at https://cran.r-project.org/package=HybridFS
Petrić J, Bowes D, Hall T, Christianson B, Baddoo N (2016) The jinx on the nasa software defect data sets. In: Proceedings of the International Conference on Evaluation and Assessment in Software Engineering (EASE), pp 13–17
Rahman F, Devanbu P (2013) How, and Why, process metrics are better. In: Proceedings of the International Conference on Software Engineering (ICSE), pp 432–441
Robles G (2010) Replicating MSR: A Study of the Potential Replicability of Papers Published in the Mining Software Repositories Proceedings. In: Proceedings of the International Conference on Mining Software Repositories (MSR), pp 171–180
Rodríguez D, Ruiz R, Cuadrado-Gallego J, Aguilar-Ruiz J (2007) Detecting fault modules applying feature selection to classifiers. In: Proceedings of the International Conference on Information Reuse and Integration (IRI), pp 667–672
Romano J, Kromrey JD, Coraggio J, Skowronek J (2006) Appropriate statistics for ordinal level data: should we really be using t-test and Cohen’s d for evaluating group differences on the NSSE and other surveys. In: Annual meeting of the Florida Association of Institutional Research (FAIR), pp 1–33
Romanski P, Kotthoff L (2013) FSelector: Selecting attributes. R package version 0.19. Software available at https://cran.r-project.org/package=FSelector
Shepperd M, Song Q, Sun Z, Mair C (2013) Data quality: some comments on the NASA software defect datasets. Trans Softw Eng (TSE) 39(9):1208–1215
Shepperd M, Bowes D, Hall T (2014) Researcher bias: the use of machine learning in software defect prediction. Trans Softw Eng (TSE) 40(6):603–616
Shivaji S, Whitehead EJ, Akella R, Kim S (2013) Reducing features to improve code Change-Based bug prediction. Trans Softw Eng (TSE) 39 (4):552–569
Strobl C, Boulesteix AL, Kneib T, Augustin T, Zeileis A (2008) Conditional variable importance for random forests. BMC Bioinforma 9 (1):307
Tantithamthavorn C (2017) ScottKnottESD: The Scott-Knott Effect Size Difference (ESD) Test. R package version 2.0. Software available at https://cran.r-project.org/web/packages/ScottKnottESD
Tantithamthavorn C, Hassan AE (2018) An experience report on defect modelling in practice: pitfalls and challenges
Tantithamthavorn C, Jiarpakdee J (2018) Rnalytica: An R package of the Miscellaneous Functions for Data Analytics Research. https://github.com/software-analytics/Rnalytica
Tantithamthavorn C, McIntosh S, Hassan AE, Ihara A, Matsumoto K (2015) The impact of mislabelling on the performance and interpretation of defect prediction models. In: Proceeding of the International Conference on Software Engineering (ICSE), pp 812–823
Tantithamthavorn C, McIntosh S, Hassan AE, Matsumoto K (2016a) Automated parameter optimization of classification techniques for defect prediction models. In: Proceedings of the International Conference on Software Engineering (ICSE), pp 321–332
Tantithamthavorn C, McIntosh S, Hassan AE, Matsumoto K (2016b) Comments on bias: the use of machine learning in software defect prediction. Trans Softw Eng (TSE) 42(11):1092–1094
Tantithamthavorn C, McIntosh S, Hassan AE, Matsumoto K (2017) An empirical comparison of model validation techniques for defect prediction models. Trans Softw Eng (TSE) 43(1):1–18
Tantithamthavorn C, Hassan AE, Matsumoto K (2019a) The impact of class rebalancing techniques on the performance and interpretation of defect prediction models. Trans Softw Eng (TSE) p Preprints
Tantithamthavorn C, McIntosh S, Hassan AE, Matsumoto K (2019b) The impact of automated parameter optimization on defect prediction models. Trans Softw Eng (TSE) 45(7):683–711
Team RC (2017) contributors worldwide Stats: The R Stats Package. R Package. Version 3.4.0
Tian Y, Nagappan M, Lo D, Hassan AE (2015) What Are the Characteristics of High-Rated Apps? A Case Study on Free Android Applications. In: Proceedings of the International Conference on Software Maintenance and Evolution (ICSME), pp 301–310
Torgo L, Torgo ML (2015) DMwR: Functions and Data for Data Mining with R. R package version 0.4.1. Software available at https://cran.r-project.org/package=DMwR
Tosun A, Bener A (2009) Reducing false alarms in software defect prediction by decision threshold optimization. In: Proceedings of the International Symposium on Empirical Software Engineering and Measurement (ESEM), pp 477–480
Xu Z, Liu J, Yang Z, An G, Jia X (2016) The Impact of Feature Selection on Defect Prediction Performance: An Empirical Comparison. In: Proceedings of the International Symposium on Software Reliability Engineering (ISSRE), pp 309–320
Yan K, Zhang D (2015) Feature selection and analysis on correlated gas sensor data with recursive feature elimination. Sensors Actuators B: Chemical 212:353–363
Yathish S, Jiarpakdee J, Thongtanunam P, Tantithamthavorn C (2019) Mining software defects should we consider affected releases?. In: In Proceedings of the International Conference on Software Engineering (ICSE), p To Appear
Zhang F, Hassan AE, McIntosh S, Zou Y (2017) The use of summation to aggregate software metrics hinders the performance of defect prediction models. Trans Softw Eng (TSE) 43(5):476–491
Zimmermann T, Premraj R, Zeller A (2007) Predicting defects for eclipse. In: Proceedings of the International Workshop on Predictor Models in Software Engineering (PROMISE), pp 9–19
Zimmermann T, Nagappan N, Gall H, Giger E, Murphy B (2009) Cross-project Defect Prediction. In: Proceedings of the European Software Engineering Conference and the Symposium on the Foundations of Software Engineering (ESEC/FSE), pp 91–100
Acknowledgements
C. Tantithamthavorn is supported by the Australian Research Council’s Discovery Early Career Researcher Award (DECRA) funding scheme (DE200100941). C. Treude is supported by the Australian Research Council’s Discovery Early Career Researcher Award (DECRA) funding scheme (DE180100153).
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by: Tim Menzies
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Jiarpakdee, J., Tantithamthavorn, C. & Treude, C. The impact of automated feature selection techniques on the interpretation of defect models. Empir Software Eng 25, 3590–3638 (2020). https://doi.org/10.1007/s10664-020-09848-1
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10664-020-09848-1