Abstract
The software has turn into an imperious part of human’s life. In the recent computing era, many large-scale complex network systems and millions of modern technological devices produce a huge amount of data every second. Among these data, the amount of imbalanced data is relatively excessive. The machine learning model is miss leaded by these imbalanced data. Software Defect Prediction (SDP) is a standout amongst the most helping exercises during the testing phase. The estimated cost of finding and fixing defects is approximately billions of pounds per year. To reduce this problem, software defect prediction has come forth but need fine tuning to have expected efficiency. In this chapter, we have proposed a new model based on machine learning approach to predict software defect and identify the key factors that may help the software engineer to identify the most defect-prone part of the system. The proposed model works as follows. First, need to remove highly correlated features and turn all the feature in the same scale using the scaling feature approach. Second, we have used Synthetic Minority Over-Sampling Technique (SMOTE), Adaptive Synthetic (ADASYN) and Hybrid sampling method to balance highly imbalanced datasets. Third, Random Forest Importance and Chi-square algorithms are chosen to find out the factors which have high effect on software defect. Cross validation is used to remove overriding problem. Scikit-learn library is used for machine learning algorithms. Pandas library is used for data processing. Matplotlib, and PyPlot are used for graph and data visualization respectively. The hybrid sampling method and Random Forest (RF) algorithms achieved the highest prediction accuracy about 93.26% by showing its superiority.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Menzies T, Greenwald J, Frank A (2007) Data mining static code attributes to learn defect predictors. IEEE Trans Softw Eng 33:2–13
Lin J-C, Wu K-C (2007) Digging high risk defects out in software engineering. In: International conference on intelligent information processing. Springer US, pp 20–23
Gray D, Bowes D (2011) The misuse of the NASA metrics data program data sets for automated software defect prediction. In: IET conference proceedings. The Institution of Engineering & Technology, pp 96–103 (2011)
Lessmann S, Baesens B (2008) Benchmarking classification models for software defect prediction: a proposed framework and novel findings. IEEE Trans Software Eng 34(4):485–496
Khoshgoftaar TM, Gao K, Napolitano A (2012) An empirical study of feature ranking techniques for software quality prediction. Int J Softw Eng Knowl Eng 22:161–183
Vashisht V, Lal M, Sureshchandar GS (2016) Defect prediction framework using neural networks for software enhancement projects. Br J Math Comput Sci (BJMCS) 16(5)
Hall MA, Holmes G (2003) Benchmarking attribute selection techniques for discrete class data mining. IEEE Trans Knowl Data Eng 15(6):1437–1447
Wang H, Khoshgoftaar TM, Gao K, Seliya N (2009) Mining data from multiple software development projects. In: Proceedings of the 3rd IEEE international workshop mining multiple information sources, pp 551–557, Miami, FL
Promise Dataset, https://promise.site.uottawa.ca/SERepository/datasets/jm1.arff. Last accessed 4 April 2019
Benesty J, Chen J, Huang Y, Cohen I (2009) Pearson correlation coefficient. In: Noise reduction in speech processing. Springer, Berlin, Heidelberg, pp 1–4
Danielsson P-E (1980) Euclidean distance mapping. Comput Graph Image Process 14(3):227–248
MinMaxScaler, https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing. Last accessed 4 April 2019
Strobl C, Boulesteix A-L, Zeileis A, Hothorn T (2007) Bias in random forest variable importance measures: illustrations, sources and a solution. BMC Bioinform 8(25):25
Random forest feature importance, https://blog.datadive.net/selecting-goodfeatures-part-iii-random-forests/. Last accessed 1 Oct 2018
Sklearn.feature-selection.chi2, https://scikitlearn.org. Last accessed April 2019
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357 (2002)
He H, Bai Y, Garcia EA, Li S (2008) ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: International joint conference on neural networks, IJCNN 2008, pp 1322–1328
Seiffert C, Khoshgoftaar TM, Van Hulse J (2009) Hybrid sampling for imbalanced data. Integr Comput Aided Eng 16(3):193–210
Burez J, Van den Poel D (2009) Handling class imbalance in customer churn prediction. Expert Syst Appl 36(3):4626-4636
Acknowledgements
This research work is supported by research grant RDU1703236 funded by Universiti Malaysia Pahang, https://www.ump.edu.my/. The authors would also like to thank the Faculty of Electrical & Electronics Engineering, Universiti Malaysia Pahang for financial support.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Hossen, M.A. et al. (2020). Hybrid Sampling and Random Forest Based Machine Learning Approach for Software Defect Prediction. In: Kasruddin Nasir, A.N., et al. InECCE2019. Lecture Notes in Electrical Engineering, vol 632. Springer, Singapore. https://doi.org/10.1007/978-981-15-2317-5_46
Download citation
DOI: https://doi.org/10.1007/978-981-15-2317-5_46
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-2316-8
Online ISBN: 978-981-15-2317-5
eBook Packages: EngineeringEngineering (R0)