Skip to main content
Log in

The impact of automated feature selection techniques on the interpretation of defect models

  • Published:
Empirical Software Engineering Aims and scope Submit manuscript

Abstract

The interpretation of defect models heavily relies on software metrics that are used to construct them. Prior work often uses feature selection techniques to remove metrics that are correlated and irrelevant in order to improve model performance. Yet, conclusions that are derived from defect models may be inconsistent if the selected metrics are inconsistent and correlated. In this paper, we systematically investigate 12 automated feature selection techniques with respect to the consistency, correlation, performance, computational cost, and the impact on the interpretation dimensions. Through an empirical investigation of 14 publicly-available defect datasets, we find that (1) 94–100% of the selected metrics are inconsistent among the studied techniques; (2) 37–90% of the selected metrics are inconsistent among training samples; (3) 0–68% of the selected metrics are inconsistent when the feature selection techniques are applied repeatedly; (4) 5–100% of the produced subsets of metrics contain highly correlated metrics; and (5) while the most important metrics are inconsistent among correlation threshold values, such inconsistent most important metrics are highly-correlated with the Spearman correlation of 0.85–1. Since we find that the subsets of metrics produced by the commonly-used feature selection techniques (except for AutoSpearman) are often inconsistent and correlated, these techniques should be avoided when interpreting defect models. In addition to introducing AutoSpearman which mitigates correlated metrics better than commonly-used feature selection techniques, this paper opens up new research avenues in the automated selection of features for defect models to optimise for interpretability as well as performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16

Similar content being viewed by others

References

  • Agrawal A, Menzies T (2018) Is better data better than better data miners?. In: 2018 IEEE/ACM 40Th international conference on software engineering (ICSE), IEEE, pp 1050–1061

  • Alckmin G, Kooistra L, Lucieer A, Rawnsley R (2019) Feature filtering and selection for dry matter estimation on perrenial ryegrass: a case study of vegetation indices. International archives of the photogrammetry. Remote Sensing and Spatial Information Sciences 42(2/W13)

  • Alzubi R, Ramzan N, Alzoubi H, Amira A (2017) A hybrid feature selection method for complex diseases SNPs. IEEE Access 6:1292–1301. https://doi.org/10.1109/ACCESS.2017.2778268

    Article  Google Scholar 

  • Arisholm E, Briand LC, Johannessen EB (2010) A systematic and comprehensive investigation of methods to build and evaluate fault prediction models. J Syst Softw 83(1):2–17

    Google Scholar 

  • Berry WD (1993) Understanding regression assumptions, vol 92. Sage Publications

  • Bettenburg N, Hassan AE (2010) Studying the impact of social structures on software quality. In: Proceedings of the International Conference on Program Comprehension (ICPC), pp 124–133

  • Bird C, Bachmann A, Aune E, Duffy J, Bernstein A, Filkov V, Devanbu P (2009) Fair and balanced?: Bias in bug-fix datasets. In: Proceedings of the European Software Engineering Conference and the Symposium on the Foundations of Software Engineering (ESEC/FSE), pp 121–130

  • Blake C, Merz C (1998) Uci repository of machine learning databases. University of California, Irvine, CA 55

  • Breiman L (2001) Random forests. Mach Learn 45(1):5–32

    MATH  Google Scholar 

  • Breiman L, Cutler A, Liaw A, Wiener M (2006) randomForest: Breiman and Cutler’s random forests for classification and regression. R package version 4.6-12. Software available at https://cran.r-project.org/package=randomForest

  • Cahill J, Hogan JM, Thomas R (2013) Predicting fault-prone software modules with rank sum classification. In: Proceedings of the Australian Software Engineering Conference (ASWEC), pp 211–219

  • Cai Y, Chow M, Lu W, Li L (2010) Statistical feature selection from massive data in distribution fault diagnosis. IEEE Trans Power Syst 25 (2):642–648

    Google Scholar 

  • Canfora G, De Lucia A, Di Penta M, Oliveto R, Panichella A, Panichella S (2013) Multi-objective cross-project defect prediction. In: Proceedings of the International Conference on Software Testing, Verification and Validation (ICST), pp 252–261

  • Chambers JM (1992) Statistical models in s wadsworth. Pacific Grove, California

  • Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: Synthetic minority over-sampling TE chnique. J Artif Intell Res 16:321–357

    MATH  Google Scholar 

  • D’Ambros M, Lanza M, Robbes R (2010) An Extensive Comparison of Bug Prediction Approaches. In: Proceedings of the International Conference on Mining Software Repositories (MSR), pp 31–41

  • D’Ambros M, Lanza M, Robbes R (2012) Evaluating defect prediction approaches: a benchmark and an extensive comparison. Emp Softw Eng (EMSE) 17(4-5):531–577

    Google Scholar 

  • Dash M, Liu H, Motoda H (2000) Consistency based Feature Selection. In: Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), pp 98–109

  • Efron B, Tibshirani RJ (1993) An introduction to the bootstrap. Springer, Boston

    MATH  Google Scholar 

  • Elish KO, Elish MO (2008) Predicting defect-prone software modules using support vector machines. J Syst Softw 81(5):649–660

    Google Scholar 

  • Fox J (2015) Applied regression analysis and generalized linear models. Sage Publications

  • Fox J, Monette G (1992) Generalized collinearity diagnostics. J Am Statis Assoc (JASA) 87(417):178–183

    Google Scholar 

  • Friedman J, Hastie T, Tibshirani R (2001) The Elements of Statistical Learning, vol 1. Springer series in statistics

  • Fu W, Menzies T, Shen X (2016) Tuning for Software analytics: Is it really necessary?. Inf Softw Technol 76:135–146

    Google Scholar 

  • Garner SR, et al. (1995) Weka: the waikato environment for knowledge analysis. In: Proceedings of the New Zealand Computer Science Research Students Conference (NZCSRSC), pp 57–64

  • Ghotra B, McIntosh S, Hassan AE (2015) Revisiting the impact of classification techniques on the performance of defect prediction models. In: Proceedings of the International Conference on Software Engineering (ICSE), pp 789–800

  • Ghotra B, McIntosh S, Hassan AE (2017) A large-scale study of the impact of feature selection techniques on defect classification models. In: Proceedings of the 14th International Conference on Mining Software Repositories, pp 146–157

  • Gil Y, Lalouche G (2017) On the correlation between size and metric validity. Emp Softw Eng (EMSE) 22(5):2585–2611

    Google Scholar 

  • Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182

    MATH  Google Scholar 

  • Hair JF, Black WC, Babin BJ, Anderson RE, Tatham RL, et al. (2006) Multivariate data analysis vol. 6

  • Hall MA (1999) Correlation-based feature selection for machine learning. PhD thesis, University of Waikato Hamilton

  • Hall MA, Smith LA (1997) Feature Subset Selection: A Correlation Based Filter Approach

  • Hanley J, McNeil BJ (1982) The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143(4):29–36

    Google Scholar 

  • Harrell FE Jr (2013) Hmisc: Harrell miscellaneous. R package version 3.12-2. Software available at http://cran.r-project.org/web/packages/Hmisc

  • Harrell FE Jr (2015) Regression modeling strategies: with applications to linear models, logistic and ordinal regression, and survival analysis. Springer, New Tork

    MATH  Google Scholar 

  • Harrell FE Jr (2017) rms: regression modeling strategies. R package version 5.1-1. Software available at http://cran.r-project.org/web/packages/rms

  • Hinkle DE, Wiersma W, Jurs SG (2003) Applied statistics for the behavioral sciences, vol 663. Houghton Mifflin College Division

  • Hsu HH, Hsieh CW, Lu MD (2011) Hybrid feature selection by combining filters and wrappers. Expert Syst Appl 38(7):8144–8150. https://doi.org/10.1016/j.eswa.2010.12.156

    Article  Google Scholar 

  • Huan L, Setiono R (1995) Chi2: feature selection and discretization of numeric attributes. In: Proceedings of the International Conference on Tools with Artificial Intelligence, pp 388–391

  • Jiang Y, Cukic B, Menzies T (2008) Can data transformation help in the detection of fault-prone modules?. In: Proceedings of the International Workshop on Defects in Large Software Systems (DEFECTS), pp 16–20

  • Jiarpakdee J, Tantithamthavorn C, Ihara A, Matsumoto K (2016) A study of redundant metrics in defect prediction datasets. In: Proceedings of the International Symposium on Software Reliability Engineering Workshops (ISSREW), pp 51–52

  • Jiarpakdee J, Tantithamthavorn C, Hassan AE (2018a) The Impact of Correlated Metrics on Defect Models. arXiv:180110271 p To Appear

  • Jiarpakdee J, Tantithamthavorn C, Treude C (2018b) AutoSpearman: automatically mitigating correlated software metrics for interpreting defect models. In: Proceedings of the International Conference on Software Maintenance and Evolution (ICSME), pp 92–103

  • Jiarpakdee J, Tantithamthavorn C, Treude C (2018c) Online Appendix for Should Automated Feature Selection Techniques be Applied when Interpreting Defect Models?. https://github.com/software-analytics/autospearman-extension-appendix

  • Jiarpakdee J, Tantithamthavorn C, Dam HK, Grundy J (2020) An empirical study of model-agnostic techniques for defect prediction models. Trans Softw Eng (TSE) 1–1

  • John GH, Kohavi R, Pfleger K (1994) Irrelevant features and the subset selection problem

  • Jureczko M, Madeyski L (2010) Towards identifying software project clusters with regard to defect prediction. In: Proceedings of the International Conference on Predictive Models in Software Engineering (PROMISE), p 9

  • Kamei Y, Shihab E, Adams B, Hassan AE, Mockus A, Sinha A, Ubayashi N (2013) A Large-Scale empirical study of Just-In-Time quality assurance. Trans Softw Eng (TSE) 39(6):757–773

    Google Scholar 

  • Kaur A, Malhotra R (2008) Application of Random Forest in Predicting Fault-prone Classes. In: Proceedings of International Conference on the Advanced Computer Theory and Engineering (ICACTE), pp 37–43

  • Kim S, Zhang H, Wu R, Gong L (2011) Dealing with noise in defect prediction. In: Proceedings of the International Conference on Software Engineering (ICSE), pp 481–490

  • Kohavi R, John GH (1997) Wrappers for feature subset selection. Artif Intell 97(1-2):273–324

    MATH  Google Scholar 

  • Koru AG, Liu H (2005) An investigation of the effect of module size on defect prediction using static measures. Softw Eng Notes (SEN) 30:1–5

    Google Scholar 

  • Kraemer HC, Morgan GA, Leech NL, Gliner JA, Vaske JJ, Harmon RJ (2003) Measures of clinical significance. J Am Academy Child Adolescent Psychiat(JAACAP) 42(12):1524–1529

    Google Scholar 

  • Kuhn M, Wing J, Weston S, Williams A, Keefer C, Engelhardt A, Cooper T, Mayer Z, Kenkel B, Team R, et al. (2017) Caret: Classification and regression training. R package version 6.0–78. Software available at https://cran.r-project.org/package=caret

  • Lessmann S, Baesens B, Mues C, Pietsch S (2008) Benchmarking classification models for software defect prediction: a proposed framework and novel findings. Trans Softw Eng (TSE) 34(4):485–496

    Google Scholar 

  • Lewis DD, Ringuette M (1994) A comparison of two learning algorithms for text categorization. In: Annual Sympos Document Anal Inform Retrieval, vol 33, pp 81–93

  • Li J, Cheng K, Wang S, Morstatter F, Trevino RP, Tang J, Liu H (2017) Feature selection: a data perspective. ACM Computing Surveys (CSUR) 50(6):94

    Google Scholar 

  • Lu H, Kocaguneli E, Cukic B (2014) Defect Prediction between Software Versions with Active Learning and Dimensionality Reduction. In: Proceedings of the International Symposium on Software Reliability Engineering (ISSRE), pp 312–322

  • Mason CH, Perreault WD Jr (1991) Collinearity, power, and interpretation of multiple regression analysis. J Market Res (JMR) 268–280

  • Matthews BW (1975) Comparison of the predicted and observed secondary structure of t4 phage lysozyme. Biochimica et Biophysica Acta (BBA)-Protein Structure 405(2):442–451

    Google Scholar 

  • McHugh ML (2013) The Chi-square Test of Independence. Biochemia Medica 23(2):143–149

    Google Scholar 

  • McIntosh S, Kamei Y, Adams B, Hassan AE (2014) The impact of code review coverage and code review participation on software quality. In: Proceedings of the International Conference on Mining Software Repositories (MSR), pp 192–201

  • Mende T (2010) Replication of defect prediction studies: problems, pitfalls and recommendations. In: Proceedings of the International Conference on Predictive Models in Software Engineering (PROMISE), pp 1–10

  • Mende T, Koschke R (2009) Revisiting the evaluation of defect prediction models. In: Proceedings of the International Conference on Predictive Models in Software Engineering (PROMISE), pp 7–16

  • Menzies T (2018) The unreasonable effectiveness of software analytics. IEEE Softw 35(2):96–98. https://doi.org/10.1109/MS.2018.1661323

    Article  Google Scholar 

  • Menzies T, Greenwald J, Frank A (2007) Data mining static code attributes to learn defect predictors. Trans Softw Eng (TSE) 33(1):2–13

    Google Scholar 

  • Menzies T, Caglayan B, Kocaguneli E, Krall J, Peters f, Turhan B (2012) The Promise Repository of Empirical Software Engineering Data

  • Mersmann O, Beleites C, Hurling R, Friedman A (2018) Microbenchmark: Accurate Timing Functions. R package version 1.4-6. Software available at https://cran.r-project.org/package=microbenchmark

  • Mitchell TM (1997) Machine Learning. McGraw Hill

  • Moser R, Pedrycz W, Succi G (2008) A comparative analysis of the efficiency of change metrics and static code attributes for defect prediction. In: Proceedings of the International Conference on Software Engineering (ICSE), pp 181–190

  • Nam J, Fu W, Kim S, Menzies T, Tan L (2017) Heterogeneous Defect Prediction. Transactions on Software Engineering (TSE) p In Press

  • Okutan A, Yıldız OT (2014) Software defect prediction using bayesian networks. Empirical Softw Eng (EMSE) 19(1):154–181

    Google Scholar 

  • Osman H, Ghafari M, Nierstrasz O (2018) The impact of feature selection on predicting the number of bugs. arXiv:180704486

  • Pandari Y, Thangavel P, Senthamaraikannan H, Jagadeeswaran S (2019) HybridFS: a hybrid filter-wrapper feature selection method. R package version 0.1.3. Software available at https://cran.r-project.org/package=HybridFS

  • Petrić J, Bowes D, Hall T, Christianson B, Baddoo N (2016) The jinx on the nasa software defect data sets. In: Proceedings of the International Conference on Evaluation and Assessment in Software Engineering (EASE), pp 13–17

  • Rahman F, Devanbu P (2013) How, and Why, process metrics are better. In: Proceedings of the International Conference on Software Engineering (ICSE), pp 432–441

  • Robles G (2010) Replicating MSR: A Study of the Potential Replicability of Papers Published in the Mining Software Repositories Proceedings. In: Proceedings of the International Conference on Mining Software Repositories (MSR), pp 171–180

  • Rodríguez D, Ruiz R, Cuadrado-Gallego J, Aguilar-Ruiz J (2007) Detecting fault modules applying feature selection to classifiers. In: Proceedings of the International Conference on Information Reuse and Integration (IRI), pp 667–672

  • Romano J, Kromrey JD, Coraggio J, Skowronek J (2006) Appropriate statistics for ordinal level data: should we really be using t-test and Cohen’s d for evaluating group differences on the NSSE and other surveys. In: Annual meeting of the Florida Association of Institutional Research (FAIR), pp 1–33

  • Romanski P, Kotthoff L (2013) FSelector: Selecting attributes. R package version 0.19. Software available at https://cran.r-project.org/package=FSelector

  • Shepperd M, Song Q, Sun Z, Mair C (2013) Data quality: some comments on the NASA software defect datasets. Trans Softw Eng (TSE) 39(9):1208–1215

    Google Scholar 

  • Shepperd M, Bowes D, Hall T (2014) Researcher bias: the use of machine learning in software defect prediction. Trans Softw Eng (TSE) 40(6):603–616

    Google Scholar 

  • Shivaji S, Whitehead EJ, Akella R, Kim S (2013) Reducing features to improve code Change-Based bug prediction. Trans Softw Eng (TSE) 39 (4):552–569

    Google Scholar 

  • Strobl C, Boulesteix AL, Kneib T, Augustin T, Zeileis A (2008) Conditional variable importance for random forests. BMC Bioinforma 9 (1):307

    Google Scholar 

  • Tantithamthavorn C (2017) ScottKnottESD: The Scott-Knott Effect Size Difference (ESD) Test. R package version 2.0. Software available at https://cran.r-project.org/web/packages/ScottKnottESD

  • Tantithamthavorn C, Hassan AE (2018) An experience report on defect modelling in practice: pitfalls and challenges

  • Tantithamthavorn C, Jiarpakdee J (2018) Rnalytica: An R package of the Miscellaneous Functions for Data Analytics Research. https://github.com/software-analytics/Rnalytica

  • Tantithamthavorn C, McIntosh S, Hassan AE, Ihara A, Matsumoto K (2015) The impact of mislabelling on the performance and interpretation of defect prediction models. In: Proceeding of the International Conference on Software Engineering (ICSE), pp 812–823

  • Tantithamthavorn C, McIntosh S, Hassan AE, Matsumoto K (2016a) Automated parameter optimization of classification techniques for defect prediction models. In: Proceedings of the International Conference on Software Engineering (ICSE), pp 321–332

  • Tantithamthavorn C, McIntosh S, Hassan AE, Matsumoto K (2016b) Comments on bias: the use of machine learning in software defect prediction. Trans Softw Eng (TSE) 42(11):1092–1094

  • Tantithamthavorn C, McIntosh S, Hassan AE, Matsumoto K (2017) An empirical comparison of model validation techniques for defect prediction models. Trans Softw Eng (TSE) 43(1):1–18

    Google Scholar 

  • Tantithamthavorn C, Hassan AE, Matsumoto K (2019a) The impact of class rebalancing techniques on the performance and interpretation of defect prediction models. Trans Softw Eng (TSE) p Preprints

  • Tantithamthavorn C, McIntosh S, Hassan AE, Matsumoto K (2019b) The impact of automated parameter optimization on defect prediction models. Trans Softw Eng (TSE) 45(7):683–711

  • Team RC (2017) contributors worldwide Stats: The R Stats Package. R Package. Version 3.4.0

  • Tian Y, Nagappan M, Lo D, Hassan AE (2015) What Are the Characteristics of High-Rated Apps? A Case Study on Free Android Applications. In: Proceedings of the International Conference on Software Maintenance and Evolution (ICSME), pp 301–310

  • Torgo L, Torgo ML (2015) DMwR: Functions and Data for Data Mining with R. R package version 0.4.1. Software available at https://cran.r-project.org/package=DMwR

  • Tosun A, Bener A (2009) Reducing false alarms in software defect prediction by decision threshold optimization. In: Proceedings of the International Symposium on Empirical Software Engineering and Measurement (ESEM), pp 477–480

  • Xu Z, Liu J, Yang Z, An G, Jia X (2016) The Impact of Feature Selection on Defect Prediction Performance: An Empirical Comparison. In: Proceedings of the International Symposium on Software Reliability Engineering (ISSRE), pp 309–320

  • Yan K, Zhang D (2015) Feature selection and analysis on correlated gas sensor data with recursive feature elimination. Sensors Actuators B: Chemical 212:353–363

    Google Scholar 

  • Yathish S, Jiarpakdee J, Thongtanunam P, Tantithamthavorn C (2019) Mining software defects should we consider affected releases?. In: In Proceedings of the International Conference on Software Engineering (ICSE), p To Appear

  • Zhang F, Hassan AE, McIntosh S, Zou Y (2017) The use of summation to aggregate software metrics hinders the performance of defect prediction models. Trans Softw Eng (TSE) 43(5):476–491

    Google Scholar 

  • Zimmermann T, Premraj R, Zeller A (2007) Predicting defects for eclipse. In: Proceedings of the International Workshop on Predictor Models in Software Engineering (PROMISE), pp 9–19

  • Zimmermann T, Nagappan N, Gall H, Giger E, Murphy B (2009) Cross-project Defect Prediction. In: Proceedings of the European Software Engineering Conference and the Symposium on the Foundations of Software Engineering (ESEC/FSE), pp 91–100

Download references

Acknowledgements

C. Tantithamthavorn is supported by the Australian Research Council’s Discovery Early Career Researcher Award (DECRA) funding scheme (DE200100941). C. Treude is supported by the Australian Research Council’s Discovery Early Career Researcher Award (DECRA) funding scheme (DE180100153).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jirayus Jiarpakdee.

Additional information

Communicated by: Tim Menzies

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jiarpakdee, J., Tantithamthavorn, C. & Treude, C. The impact of automated feature selection techniques on the interpretation of defect models. Empir Software Eng 25, 3590–3638 (2020). https://doi.org/10.1007/s10664-020-09848-1

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10664-020-09848-1

Keywords

Navigation