An Empirical Study for Enhanced Software Defect Prediction Using a Learning-Based Framework

Bashir, Kamal; Li, Tianrui; Yohannese, Chubato Wondaferaw

doi:10.2991/ijcis.2018.125905638

An Empirical Study for Enhanced Software Defect Prediction Using a Learning-Based Framework

Research Article
Open access
Published: 28 January 2019

Volume 12, pages 282–298, (2018)
Cite this article

Download PDF

You have full access to this open access article

International Journal of Computational Intelligence Systems Aims and scope Submit manuscript

An Empirical Study for Enhanced Software Defect Prediction Using a Learning-Based Framework

Download PDF

Kamal Bashir^1,2,
Tianrui Li¹ &
Chubato Wondaferaw Yohannese¹

26 Accesses
7 Citations
Explore all metrics

Abstract

The object of software defect prediction (SDP) is to identify defect-prone modules. This is achieved through constructing prediction models using datasets obtained by mining software historical depositories. However, data mined from these depositories are often associated with high dimensionality, class imbalance, and mislabels which deteriorate classification performance and increase model complexity. In order to mitigate the consequences, this paper proposes an integrated preprocessing framework in which feature selection (FS), data balance (DB), and noise filtering (NF) techniques are fused to deal with the factors that deteriorate learning performance. We apply the proposed framework on three software metrics, namely static code metric (SCM), object oriented metric (OOM), and combined metric (CombM) and build models based on four scenarios (S): (S1) original data; (S2) FS subsets; (S3) FS subsets after DB using random under sampling (RUS) and synthetic minority oversampling technique (SMOTE); (S4) FS subsets after DB (RUS and SMOTE); and NF using iterative partitioning filter (IPF) and iterative noise filtering based on the fusing of classifiers (INFFC). Empirical results show that 1. the integrated preprocessing of FS, DB, and NF improves the performance of all the models built for SDP, 2. for all FS methods, all the models improve performance progressively from S2 through to S4 in all the software metrics, 3. model performance based on S4 is statistically significantly better than the performance based on S3 for all the software metrics, and 4. in order to achieve optimal model performance for SDP, appropriate implementation of the proposed framework is required. The results also validate the effectiveness of our proposal and provide guidelines for achieving quality training data that enhances model performance for SDP.

Article PDF

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

References

J. Radatz, A. Geraci, F. Katki, IEEE standard glossary of software engineering terminology, IEEE Std. 610121990(121990) (1990), 1–84.
S. Lessmann, B. Baesens, C. Mues, S. Pietsch, Benchmarking classification models for software defect prediction: a proposed framework and novel findings, IEEE Trans. Soft. Eng. 34(4) (2008), 485–496.
Google Scholar
M.E. Fagan, Design and code inspections to reduce errors in program development, IBM Syst. J. 38(2–3) (1999), 258–287.
Google Scholar
E. Arisholm, L.C. Briand, E.B. Johannessen, A systematic and comprehensive investigation of methods to build and evaluate fault prediction models, J. Syst. Soft. 83(1) (2010), 2–17.
K. Ogawa, K. Matsumoto, M. Hashimoto, Editing training sets from imbalanced data using fuzzy-rough sets, in International Conference on Artificial Intelligence Applications and Innovations (IFIP), Bayonne, France, 2015, pp. 115–129.
Google Scholar
S. Hu, Y. Liang, L. Ma, Y. He, MSMOTE: Improving classification performance when training data is imbalanced, in Second International Workshop on Computer Science and Engineering, Qingdao, China, 2009, pp 13–17.
Google Scholar
G.M. Weiss, F. Provost, Learning when training data are costly: The effect of class distribution on tree induction, J. Artif. Intell. Res. 19(1) (2003), 315–354.
Google Scholar
K. Gao, T. M. Khoshgoftaar, R. Wald, The use of under-and over-sampling within ensemble feature selection and classification for software quality prediction, Int. J. Reliab. Qual. Saf. Eng. 21(01) (2014), 1450004–1450030.
W.Y. Chubato, T. Li, A combined-learning based framework for improved software fault prediction, Int. J. Comput. Int. Sys. 10(1) (2017), 647–662.
Google Scholar
T.M. Khoshgoftaar, K. Gao, J.V. Hulse, Feature selection for highly imbalanced software measurement data, in Recent Trends in Information Reuse and Integration, Springer-Verlag Wien, Vienna, 2012, pp. 167–189.
Google Scholar
M. Galar, A. Fernandez, E. Barrenechea, H. Bustince, F. Herrera, A review on ensembles for the class imbalance problem: Bagging, boosting, and hybrid-based approaches, IEEE Trans. Syst. Man. Cybern, C Appl. Rev. 42(4) (2012), 463–484.
Google Scholar
C.E. Brodley, M.A. Friedl, Identifying mislabeled training data, J. Artif. Intell. Res. 11(1) (1999), 131–167.
Google Scholar
T.M. Khoshgoftaar, P. Rebours, Improving software quality prediction by noise filtering techniques, J. Comput. Sci. Tech. 22(3) (2007), 387–396.
Google Scholar
K. Bashir, T. Li, C.W. Yohannese, Y. Mahama, Enhancing software defect prediction using supervised-learning based framework, in 12th International Conference on Intelligent Systems and Knowledge Engineering (ISKE), Nanjing, China, 2017, pp. 1–6.
Google Scholar
T.M. Khoshgoftaar, K. Gao, A. Napolitano, An empirical study of feature ranking techniques for software quality prediction, Int. J. Soft. Eng. Know. Eng. 22(02) (2012), 161–183.
Google Scholar
Z. Xu, J. Liu, Z. Yang, G. An, X. Jia, The impact of feature selection on defect prediction performance: An empirical comparison, in 27th International Symposium on Software Reliability Engineering (ISSRE), Ottawa, Canada, 2016, pp. 309–320.
Google Scholar
S. Wang, X.Yao, Using class imbalance learning for software defect prediction, IEEE Trans. Rel. 62(2) (2013), 434–443.
Google Scholar
J.A. Sáez, M. Galar, J. Luengo, F. Herrera, INFFC: An iterative class noise filter based on the fusion of classifiers with noise sensitivity control, Inf Fusion. 27 (2016), 19–32.
Google Scholar
T.M. Khoshgoftaar, K. Gao, N. Seliya, Attribute selection and imbalanced data: Problems in software defect prediction, in 22nd IEEE International Conference on Tools with Artificial Intelligence, Arras, France, 2010, pp. 137–144.
Google Scholar
A.A. Shanab, T.M. Khoshgoftaar, R. Wald, J. Van Hulse, Comparison of approaches to alleviate problems with high-dimensional and class-imbalanced data, in IEEE International Conference on Information Reuse and Integration, Las Vegas, USA, 2011, pp. 234–239.
Google Scholar
T.M. Khoshgoftaar, K. Gao, A. Napolitano, R. Wald, A comparative study of iterative and non-iterative feature selection techniques for software defect prediction, Inform. Syst. Front. 16(5) (2014), 801–822.
Google Scholar
H. Liu, H. Motoda, L. Yu, A selective sampling approach to active feature selection, Artif. Intell. 159(2) (2004), 49–74.
T.J. McCabe, A complexity measure, IEEE Trans. Soft. Eng. 2(4) (1976), 308–320.
T. Menzies, J. Greenwald, A. Frank, Data mining static code attributes to learn defect predictors. IEEE Trans. Soft. Eng. 33(1) (2007), 2–13.
Google Scholar
S.R. Chidamber, C.F. Kemerer, A metrics suite for object oriented design, IEEE Trans. Soft. Eng. 20(6) (1994), 476–493.
Google Scholar
M. DAmbros, M. Lanza, R. Robbes, Evaluating defect prediction approaches: A benchmark and an extensive comparison, Empir. Softw. Eng. 17(4) (2011), 531–577.
Google Scholar
S. Liu, X. Chen, W. Liu, J. Chen, Q. Gu, D. Chen, FECAR: A feature selection framework for software defect prediction, in 38th Annual Computer Software and Applications Conference, Vsters, Sweden, 2014, pp. 426–435.
Google Scholar
A. Malhi, R.X. Gao, PCA-based feature selection scheme for machine defect classification. IEEE Trans. Instrum. Meas. 53(6) (2004), 1517–1525.
Google Scholar
M.A. Hall, Correlation-Based Feature Selection for Machine Learning, Hamilton, New Zealand, 1999.
Google Scholar
A.C. Cameron, P.K. Trivedi, Regression Analysis of Count Data. 2nd ed., Cambridge University Press Cambridge, United Kingdom, 2013.
Google Scholar
K. Dejaeger, T. Verbraken, B. Baesens, Toward comprehensible software fault prediction models using bayesian network classifiers, IEEE Trans. Softw. Eng. 39(2) (2013), 237–257.
Google Scholar
K. Kira, L.A. Rendell, The feature selection problem: Traditional methods and a new algorithm, in Proceedings of the Tenth National Conference on Artificial Intelligence (AAAI’92), Menlo Park, CA, 1992, pp. 129–134.
Google Scholar
I. Kononenko, Estimating attributes: Analysis and extensions of relief, in Proceedings of the European Conference on Machine Learning on Machine Learning (ECML), Secaucus, NJ, 1994, pp. 171–182.
Google Scholar
I.H. Witten, E. Frank, M.A. Hall, C.J. Pal, Data Mining: Practical Machine Learning Tools and Techniques. 4th ed., Morgan Kaufmann Publishers Inc., San Francisco, CA, 2016.
Google Scholar
C. Catal, B. Diri, A systematic review of software fault prediction studies, Expert. Syst. Appl. 36(4) (2009), 7346–7354.
Google Scholar
Y. Kamei, A. Monden, S. Matsumoto, T. Kakimoto, K.I. Matsumoto, The effects of over and under sampling on fault-prone module detection, in Proceedings of the First International Symposium on Empirical Software Engineering and Measurement(ESEM), Washington, DC, pp. 196–204.
Google Scholar
N.V. Chawla, K.W. Bowyer, L.O. Hall, W.P. Kegelmeyer, SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 16(1) (2002), 321–357.
Google Scholar
H. Haibo, E.A. Garcia, Learning from imbalanced data, IEEE Trans. Knowl. Data Eng. 21(9) (2009), 1263–1284.
Google Scholar
T.M. Khoshgoftaar, C. Seiffert, J. Van Hulse, A. Napolitano, A. Folleco, Learning with limited minority class data, in Sixth International Conference on Machine Learning and Applications (ICMLA), Cincinnati, OH, 2007.
Google Scholar
T.C. Landgrebe, R.P. Duin, Efficient multiclass roc approximation by decomposition via confusion matrix perturbation analysis. IEEE Trans. Pattern Anal. Mach. Intell. 30(5) (2008), 810–822.
Google Scholar
D. Pryor, T. Menzies, R. Krishna, The promise repository of empirical software engineering data. http://openscience.us/repo/. North Carolina State University, Department of Computer Science (2015).
W. Liu, S. Liu, Q. Gu, J. Chen, X. Chen, D. Chen, Empirical studies of a two-stage data preprocessing approach for software fault prediction. IEEE Trans. Reliab. 65(1) (2016), 38–53.
Google Scholar
F. Karimian, S.M. Babamir, Evaluation of classifiers in software fault-proneness prediction, J. AI Data Min. 5(2) (2017), 149–167.
Google Scholar

Download references

Author information

Authors and Affiliations

School of Information Science and Technology, Southwest Jiaotong University, 611756, Chengdu, China
Kamal Bashir, Tianrui Li & Chubato Wondaferaw Yohannese
Department of Information Technology, College of Computer Science and Information Technology, Karary University, 12304, Omdurman, Sudan
Kamal Bashir

Authors

Kamal Bashir
View author publications
You can also search for this author in PubMed Google Scholar
Tianrui Li
View author publications
You can also search for this author in PubMed Google Scholar
Chubato Wondaferaw Yohannese
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tianrui Li.

Rights and permissions

This is an open access article distributed under the CC BY-NC 4.0 license (http://creativecommons.org/licenses/by-nc/4.0/).

Reprints and permissions

About this article

Cite this article

Bashir, K., Li, T. & Yohannese, C.W. An Empirical Study for Enhanced Software Defect Prediction Using a Learning-Based Framework. Int J Comput Intell Syst 12, 282–298 (2018). https://doi.org/10.2991/ijcis.2018.125905638

Download citation

Received: 25 March 2018
Accepted: 11 January 2019
Published: 28 January 2019
Issue Date: January 2018
DOI: https://doi.org/10.2991/ijcis.2018.125905638

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

An Empirical Study for Enhanced Software Defect Prediction Using a Learning-Based Framework

Abstract

Article PDF

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation