Abstract
The object of software defect prediction (SDP) is to identify defect-prone modules. This is achieved through constructing prediction models using datasets obtained by mining software historical depositories. However, data mined from these depositories are often associated with high dimensionality, class imbalance, and mislabels which deteriorate classification performance and increase model complexity. In order to mitigate the consequences, this paper proposes an integrated preprocessing framework in which feature selection (FS), data balance (DB), and noise filtering (NF) techniques are fused to deal with the factors that deteriorate learning performance. We apply the proposed framework on three software metrics, namely static code metric (SCM), object oriented metric (OOM), and combined metric (CombM) and build models based on four scenarios (S): (S1) original data; (S2) FS subsets; (S3) FS subsets after DB using random under sampling (RUS) and synthetic minority oversampling technique (SMOTE); (S4) FS subsets after DB (RUS and SMOTE); and NF using iterative partitioning filter (IPF) and iterative noise filtering based on the fusing of classifiers (INFFC). Empirical results show that 1. the integrated preprocessing of FS, DB, and NF improves the performance of all the models built for SDP, 2. for all FS methods, all the models improve performance progressively from S2 through to S4 in all the software metrics, 3. model performance based on S4 is statistically significantly better than the performance based on S3 for all the software metrics, and 4. in order to achieve optimal model performance for SDP, appropriate implementation of the proposed framework is required. The results also validate the effectiveness of our proposal and provide guidelines for achieving quality training data that enhances model performance for SDP.
Article PDF
Avoid common mistakes on your manuscript.
References
J. Radatz, A. Geraci, F. Katki, IEEE standard glossary of software engineering terminology, IEEE Std. 610121990(121990) (1990), 1–84.
S. Lessmann, B. Baesens, C. Mues, S. Pietsch, Benchmarking classification models for software defect prediction: a proposed framework and novel findings, IEEE Trans. Soft. Eng. 34(4) (2008), 485–496.
M.E. Fagan, Design and code inspections to reduce errors in program development, IBM Syst. J. 38(2–3) (1999), 258–287.
E. Arisholm, L.C. Briand, E.B. Johannessen, A systematic and comprehensive investigation of methods to build and evaluate fault prediction models, J. Syst. Soft. 83(1) (2010), 2–17.
K. Ogawa, K. Matsumoto, M. Hashimoto, Editing training sets from imbalanced data using fuzzy-rough sets, in International Conference on Artificial Intelligence Applications and Innovations (IFIP), Bayonne, France, 2015, pp. 115–129.
S. Hu, Y. Liang, L. Ma, Y. He, MSMOTE: Improving classification performance when training data is imbalanced, in Second International Workshop on Computer Science and Engineering, Qingdao, China, 2009, pp 13–17.
G.M. Weiss, F. Provost, Learning when training data are costly: The effect of class distribution on tree induction, J. Artif. Intell. Res. 19(1) (2003), 315–354.
K. Gao, T. M. Khoshgoftaar, R. Wald, The use of under-and over-sampling within ensemble feature selection and classification for software quality prediction, Int. J. Reliab. Qual. Saf. Eng. 21(01) (2014), 1450004–1450030.
W.Y. Chubato, T. Li, A combined-learning based framework for improved software fault prediction, Int. J. Comput. Int. Sys. 10(1) (2017), 647–662.
T.M. Khoshgoftaar, K. Gao, J.V. Hulse, Feature selection for highly imbalanced software measurement data, in Recent Trends in Information Reuse and Integration, Springer-Verlag Wien, Vienna, 2012, pp. 167–189.
M. Galar, A. Fernandez, E. Barrenechea, H. Bustince, F. Herrera, A review on ensembles for the class imbalance problem: Bagging, boosting, and hybrid-based approaches, IEEE Trans. Syst. Man. Cybern, C Appl. Rev. 42(4) (2012), 463–484.
C.E. Brodley, M.A. Friedl, Identifying mislabeled training data, J. Artif. Intell. Res. 11(1) (1999), 131–167.
T.M. Khoshgoftaar, P. Rebours, Improving software quality prediction by noise filtering techniques, J. Comput. Sci. Tech. 22(3) (2007), 387–396.
K. Bashir, T. Li, C.W. Yohannese, Y. Mahama, Enhancing software defect prediction using supervised-learning based framework, in 12th International Conference on Intelligent Systems and Knowledge Engineering (ISKE), Nanjing, China, 2017, pp. 1–6.
T.M. Khoshgoftaar, K. Gao, A. Napolitano, An empirical study of feature ranking techniques for software quality prediction, Int. J. Soft. Eng. Know. Eng. 22(02) (2012), 161–183.
Z. Xu, J. Liu, Z. Yang, G. An, X. Jia, The impact of feature selection on defect prediction performance: An empirical comparison, in 27th International Symposium on Software Reliability Engineering (ISSRE), Ottawa, Canada, 2016, pp. 309–320.
S. Wang, X.Yao, Using class imbalance learning for software defect prediction, IEEE Trans. Rel. 62(2) (2013), 434–443.
J.A. Sáez, M. Galar, J. Luengo, F. Herrera, INFFC: An iterative class noise filter based on the fusion of classifiers with noise sensitivity control, Inf Fusion. 27 (2016), 19–32.
T.M. Khoshgoftaar, K. Gao, N. Seliya, Attribute selection and imbalanced data: Problems in software defect prediction, in 22nd IEEE International Conference on Tools with Artificial Intelligence, Arras, France, 2010, pp. 137–144.
A.A. Shanab, T.M. Khoshgoftaar, R. Wald, J. Van Hulse, Comparison of approaches to alleviate problems with high-dimensional and class-imbalanced data, in IEEE International Conference on Information Reuse and Integration, Las Vegas, USA, 2011, pp. 234–239.
T.M. Khoshgoftaar, K. Gao, A. Napolitano, R. Wald, A comparative study of iterative and non-iterative feature selection techniques for software defect prediction, Inform. Syst. Front. 16(5) (2014), 801–822.
H. Liu, H. Motoda, L. Yu, A selective sampling approach to active feature selection, Artif. Intell. 159(2) (2004), 49–74.
T.J. McCabe, A complexity measure, IEEE Trans. Soft. Eng. 2(4) (1976), 308–320.
T. Menzies, J. Greenwald, A. Frank, Data mining static code attributes to learn defect predictors. IEEE Trans. Soft. Eng. 33(1) (2007), 2–13.
S.R. Chidamber, C.F. Kemerer, A metrics suite for object oriented design, IEEE Trans. Soft. Eng. 20(6) (1994), 476–493.
M. DAmbros, M. Lanza, R. Robbes, Evaluating defect prediction approaches: A benchmark and an extensive comparison, Empir. Softw. Eng. 17(4) (2011), 531–577.
S. Liu, X. Chen, W. Liu, J. Chen, Q. Gu, D. Chen, FECAR: A feature selection framework for software defect prediction, in 38th Annual Computer Software and Applications Conference, Vsters, Sweden, 2014, pp. 426–435.
A. Malhi, R.X. Gao, PCA-based feature selection scheme for machine defect classification. IEEE Trans. Instrum. Meas. 53(6) (2004), 1517–1525.
M.A. Hall, Correlation-Based Feature Selection for Machine Learning, Hamilton, New Zealand, 1999.
A.C. Cameron, P.K. Trivedi, Regression Analysis of Count Data. 2nd ed., Cambridge University Press Cambridge, United Kingdom, 2013.
K. Dejaeger, T. Verbraken, B. Baesens, Toward comprehensible software fault prediction models using bayesian network classifiers, IEEE Trans. Softw. Eng. 39(2) (2013), 237–257.
K. Kira, L.A. Rendell, The feature selection problem: Traditional methods and a new algorithm, in Proceedings of the Tenth National Conference on Artificial Intelligence (AAAI’92), Menlo Park, CA, 1992, pp. 129–134.
I. Kononenko, Estimating attributes: Analysis and extensions of relief, in Proceedings of the European Conference on Machine Learning on Machine Learning (ECML), Secaucus, NJ, 1994, pp. 171–182.
I.H. Witten, E. Frank, M.A. Hall, C.J. Pal, Data Mining: Practical Machine Learning Tools and Techniques. 4th ed., Morgan Kaufmann Publishers Inc., San Francisco, CA, 2016.
C. Catal, B. Diri, A systematic review of software fault prediction studies, Expert. Syst. Appl. 36(4) (2009), 7346–7354.
Y. Kamei, A. Monden, S. Matsumoto, T. Kakimoto, K.I. Matsumoto, The effects of over and under sampling on fault-prone module detection, in Proceedings of the First International Symposium on Empirical Software Engineering and Measurement(ESEM), Washington, DC, pp. 196–204.
N.V. Chawla, K.W. Bowyer, L.O. Hall, W.P. Kegelmeyer, SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 16(1) (2002), 321–357.
H. Haibo, E.A. Garcia, Learning from imbalanced data, IEEE Trans. Knowl. Data Eng. 21(9) (2009), 1263–1284.
T.M. Khoshgoftaar, C. Seiffert, J. Van Hulse, A. Napolitano, A. Folleco, Learning with limited minority class data, in Sixth International Conference on Machine Learning and Applications (ICMLA), Cincinnati, OH, 2007.
T.C. Landgrebe, R.P. Duin, Efficient multiclass roc approximation by decomposition via confusion matrix perturbation analysis. IEEE Trans. Pattern Anal. Mach. Intell. 30(5) (2008), 810–822.
D. Pryor, T. Menzies, R. Krishna, The promise repository of empirical software engineering data. http://openscience.us/repo/. North Carolina State University, Department of Computer Science (2015).
W. Liu, S. Liu, Q. Gu, J. Chen, X. Chen, D. Chen, Empirical studies of a two-stage data preprocessing approach for software fault prediction. IEEE Trans. Reliab. 65(1) (2016), 38–53.
F. Karimian, S.M. Babamir, Evaluation of classifiers in software fault-proneness prediction, J. AI Data Min. 5(2) (2017), 149–167.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
This is an open access article distributed under the CC BY-NC 4.0 license (http://creativecommons.org/licenses/by-nc/4.0/).
About this article
Cite this article
Bashir, K., Li, T. & Yohannese, C.W. An Empirical Study for Enhanced Software Defect Prediction Using a Learning-Based Framework. Int J Comput Intell Syst 12, 282–298 (2018). https://doi.org/10.2991/ijcis.2018.125905638
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.2991/ijcis.2018.125905638