Abstract
Software Fault Prediction (SFP) is found to be vital to predict the fault-proneness of software modules, which allows software engineers to focus development activities on fault-prone modules, thereby prioritize and optimize tests, improve software quality and make better use of resources. In this regard, machine learning has been successfully applied to solve classification problems for SFP. Nevertheless, the presence of different software metrics, the redundant and irrelevant features and the imbalanced nature of software datasets have created more and more challenges for the classification problems. Therefore, the objective of this study is to independently examine software metrics with multiple Feature Selection (FS) combined with Data Balancing (DB) using Synthetic Minority Oversampling Techniques for improving classification performance. Accordingly, a new framework that efficiently handles those challenges in a combined form on both Object Oriented Metrics (OOM) and Static Code Metrics (SCM) datasets is proposed. The experimental results confirm that the prediction performance could be compromised without suitable Feature Selection Techniques (FST). To mitigate that, data must be balanced. Thus our combined technique assures the robust performance. Furthermore, a combination of Random Forts (RF) with Information Gain (IG) FS yields the highest Receiver Operating Characteristic (ROC) curve (0.993) value, which is found to be the best combination when SCM are used, whereas the combination of RF with Correlation-based Feature Selection (CFS) guarantees the highest ROC (0.909) value, which is found to be the best choice when OOM are used. Therefore, as shown in this study, software metrics used to predict the fault proneness of the software modules must be carefully examined and suitable FST for software metrics must be cautiously selected. Moreover, DB must be applied in order to obtain robust performance. In addition to that, dealing with the challenges mentioned above, the proposed framework ensures the remarkable classification performance and lays the pathway to quality assurance of software.
Article PDF
Avoid common mistakes on your manuscript.
References
S. Shivaji, E. J. W., Jr., R. Akella, and S. Kim, “Reducing features to improve code change-based bug prediction,” IEEE Transactions on Software Engineering, 39: 552–569, 2013.
I. Gondra, “Applying machine learning to software fault-proneness prediction,” Journal of Systems and Software, 81: 186–195, 2008.
K.O. Elish, M.O. Elish, “Predicting defect-prone software modules using support vector machines,” Journal of Systems and Software, 81: 649–660, 2008.
E. Arisholm, L.C. Briand, and E.B. Johannessen, “A systematic and comprehensive investigation of methods to build and evaluate fault prediction models,” Journal of Systems and Software, 83: 2–17, 2010.
S. Kim, J.E.J. Whitehead, Y. Zhang, “Classifying software changes: clean or buggy?” IEEE Transactions on Software Engineering, 34: 181–196, 2008.
S. Kanmani, V. Uthariaraj, V. Sankaranarayanan, and P. Thambidurai, “Object-Oriented Software Fault Prediction Using Neural Networks,” Information and Software Technology, 49: 483–492, 2007.
T. Gyimothy, R. Ferenc, and I. Siket, “Empirical Validation of Object-Oriented Metrics on Open Source Software for Fault Prediction,” IEEE Transactions on Software Engineering, 31: 897–910, 2005.
H. Wang, T. M. Khoshgoftaar, and A. Napolitano, “A comparative study of ensemble feature selection techniques for software defect prediction,” International Conference on Machine Learning and Applications, 9: 135–140, 2010
H. Liu, H. Motoda, and L. Yu, “A selective sampling approach to active feature selection,” Artificial Intelligence, 159: 49–74, 2004
T. M. Khoshgoftaar, C. Seiffert, J. V. Hulse, A. Napolitano, and A. Folleco, “Learning with limited minority class data,” International Conference on Machine Learning and Applications, 6: 348–353, 2007.
D. Radjenovic, M. Hericko, R. Torkar, and A. Zivkovic, “Software fault prediction metrics: A systematic literature review,” Journal of Information and Software Technology, 55: 1397–1418, 2013.
T.J. McCabe, “A complexity measure,” IEEE Transactions on Software Engineering, SE-2: 308–320, 1976.
T. Menzies, J. Greenwald, and A. Frank, “Data mining static code attributes to learn defect predictors,” IEEE Transactions on Software Engineering 33: 2–13, 2007.
S.R. Chidamber, C.F. Kemerer, “A metrics suite for object-oriented design,” IEEE Transactions on Software Engineering, 20: 476–493, 1994.
M. A. Hall, “Correlation-based Feature Subset Selection for Machine Learning,” PhD. dissertation, Hamilton, New Zealand, 1999.
A. Malhi, R. Gao, “PCA-Based feature selection scheme for machine defect classification,” IEEE Transactions on Instrumentation and Measurement, 53: 1517–1525, 2004.
I. H. Witten, E. Frank and M. A. Hall, “Data Mining: Practical Machine Learning Tools and Techniques,” Morgan Kaufmann, 3rd edition, 2011.
V. C. Nitesh, W. B. Kevin, O. H. Lawrence, and W. P. Kegelmeyer, “Synthetic minority over-sampling technique,” Journal of Artificial Intelligence Research, 16: 321–357, 2002.
R. Kohavi, “A study of cross-validation and bootstrap for accuracy estimation and model selection,” International Joint Conference on Artificial Intelligence (IJCAI), 14: 1137–1143, 1995.
M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten, “The WEKA data mining software: an update; SIGKDD Explorations,” Retrieved 01 Sep 2015.
T. Menzies, R. Krishna, and D. Pryor, “The Promise Repository of Empirical Software Engineering Data,” http://openscience.us/repo. North Carolina State University, Department of Computer Science bibtex. Retrieved 11 Jan 2015.
T. Hall, S. Beecham, D. Bowes, D. Gray, and S. Counsell, “A systematic literature review on fault prediction performance in software engineering,” IEEE Transactions on Software Engineering 38: 1276–1304, 2012.
H. Laradji, M. Alshayeb, and L. Ghouti, “Software defect prediction using ensemble learning on selected features,” Information and Software Technology, 58: 388–402, 2015.
W. Li, Z. Huang, and Q. Li, “Three-way decisions based software defect prediction,” Knowledge-Based Systems, 91: 263–274, 2016.
S. Liu, X. Chen, W. Liu, J Chen, Q. Gu, and D. Chen, “FECAR: A feature selection framework for software defect prediction,” Annual International Computers, Software and Applications Conference, 38: 426–435, 2014.
C. Catal, “Software fault prediction: A literature review and current trends,” Expert Systems with Applications, 38: 4626–4636, 2011.
G. Abaei and A. Selamat, “A survey on software fault detection based on different prediction approaches,” Vietnam Journal of Computer Science, 1: 79–95, 2014.
K. Dejaeger, T. Verbraken, and B. Baesens, “Toward Comprehensible Software Fault Prediction Models Using Bayesian Network Classifiers,” IEEE Transactions on Software Engineering, 39: 237–257, 2013.
H. B. Yadav,and D. K. Yadav, “A fuzzy logic based approach for phase-wise software defects prediction using software metrics,” Information and Software Technology, 63: 44–57, 2015.
P. He, B. Li, X. Liu, J. Chen, and Y. Ma, “An empirical study on software defect prediction with a simplified metric set,” Information and Software Technology, 59: 170–190, 2015.
F. Song, Z. Guo, and D. Mei, “Feature selection using principal component analysis,” International Conference on System Science, Engineering Design and Manufacturing Informatization, 1: 27–30, 2010.
H.A. Al-Jamimi, and M. Ahmed, “Machine Learning-based Software Quality Prediction Models: State of the Art,” International Conference on Information Science and Applications, 1–4, 2013.
A.K. Tripathi, and K. Sharma, “Optimizing Testing Efforts Based on Change Proneness Through Machine Learning Techniques,” Power India International Conference, 1–4, 2014.
R. Malhotra, “A systematic review of machine learning techniques for software fault prediction,” Applied Soft Computing, 27: 504–518, 2015.
V. García, J.S. Sánchez, and R.A. Mollineda, “On the effectiveness of preprocessing methods when dealing with different levels of class imbalance,” Knowledge-Based Systems, 25: 13–21, 2012.
A.A. El-Sayed, M.A.M. Mahmood, N.A. Meguid, and H.A. Hefny, “Handling Autism Imbalanced Data using Synthetic Minority Over-Sampling Technique (SMOTE),” Third World Conference on Complex Systems, 1–5, 2015.
P. Sarakit, T. Theeramunkong, and C. Haruechaiyasak, “Improving emotion classification in imbalanced YouTube dataset using SMOTE algorithm,” International Conference on Advanced Informatics: Concepts, Theory and Applications, 1–5, 2015.
J. Li, H. Li, and J. Yu, “Application of Random-SMOTE on Imbalanced Data Mining,” International Conference on Business Intelligence and Financial Engineering, 130–133, 2011.
J. Nam and S. Kim, “Heterogeneous defect prediction,” The 2015 10th Joint Meeting on Foundations of Software Engineering-ESEC/FSE 2015, 508–519, ACM Press, 2015
R. Krishna, T. Menzies, and W. Fu, “Too Much Automation? The Bellwether Effect and Its Implications for Transfer Learning,” The 31st IEEE/ACM International Conference on Automated Software Engineering, 122–131, 2016.
M. D’Ambros, M. Lanza, and R. Robbes, “Evaluating defect prediction approaches: a benchmark and an extensive comparison,” Empirical Software Engineering, 17: 531–577, 2012.
F. Qin, Z. Zheng, C. Bai, Y. Qiao, Z. Zhang, and C. Chen, “Cross-Project Aging Related Bug Prediction,” The 2015 IEEE International Conference on Software Quality, Reliability and Security, 43–48, 2015.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
This is an open access article under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).
About this article
Cite this article
Yohannese, C.W., Li, T. A Combined-Learning Based Framework for Improved Software Fault Prediction. Int J Comput Intell Syst 10, 647–662 (2017). https://doi.org/10.2991/ijcis.2017.10.1.43
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.2991/ijcis.2017.10.1.43