Hybrid Sampling and Random Forest Based Machine Learning Approach for Software Defect Prediction

Hossen, Md Anwar; Islam, Md. Shariful; Yusof, Nurhafizah Abu Talip; Rahman, Md. Sakib; Siddika, Fatema; Rahman, Mostafijur; Khatun, Sabira; Karim, Mohamad Shaiful Abdul; Mahmud, S. M. Hasan

doi:10.1007/978-981-15-2317-5_46

Hybrid Sampling and Random Forest Based Machine Learning Approach for Software Defect Prediction

Md Anwar Hossen⁴⁶,
Md. Shariful Islam⁴⁶,
Nurhafizah Abu Talip Yusof⁴⁷,
Md. Sakib Rahman⁴⁶,
Fatema Siddika⁴⁸,
Mostafijur Rahman⁴⁶,
Sabira Khatun⁴⁷,
Mohamad Shaiful Abdul Karim⁴⁷ &
…
S. M. Hasan Mahmud⁴⁶

Conference paper
First Online: 24 March 2020

1211 Accesses
1 Citations

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE,volume 632))

Abstract

The software has turn into an imperious part of human’s life. In the recent computing era, many large-scale complex network systems and millions of modern technological devices produce a huge amount of data every second. Among these data, the amount of imbalanced data is relatively excessive. The machine learning model is miss leaded by these imbalanced data. Software Defect Prediction (SDP) is a standout amongst the most helping exercises during the testing phase. The estimated cost of finding and fixing defects is approximately billions of pounds per year. To reduce this problem, software defect prediction has come forth but need fine tuning to have expected efficiency. In this chapter, we have proposed a new model based on machine learning approach to predict software defect and identify the key factors that may help the software engineer to identify the most defect-prone part of the system. The proposed model works as follows. First, need to remove highly correlated features and turn all the feature in the same scale using the scaling feature approach. Second, we have used Synthetic Minority Over-Sampling Technique (SMOTE), Adaptive Synthetic (ADASYN) and Hybrid sampling method to balance highly imbalanced datasets. Third, Random Forest Importance and Chi-square algorithms are chosen to find out the factors which have high effect on software defect. Cross validation is used to remove overriding problem. Scikit-learn library is used for machine learning algorithms. Pandas library is used for data processing. Matplotlib, and PyPlot are used for graph and data visualization respectively. The hybrid sampling method and Random Forest (RF) algorithms achieved the highest prediction accuracy about 93.26% by showing its superiority.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 259.00; Price excludes VAT (USA)

Softcover Book: USD 329.99; Price excludes VAT (USA)

Hardcover Book: USD 329.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Menzies T, Greenwald J, Frank A (2007) Data mining static code attributes to learn defect predictors. IEEE Trans Softw Eng 33:2–13
Google Scholar
Lin J-C, Wu K-C (2007) Digging high risk defects out in software engineering. In: International conference on intelligent information processing. Springer US, pp 20–23
Google Scholar
Gray D, Bowes D (2011) The misuse of the NASA metrics data program data sets for automated software defect prediction. In: IET conference proceedings. The Institution of Engineering & Technology, pp 96–103 (2011)
Google Scholar
Lessmann S, Baesens B (2008) Benchmarking classification models for software defect prediction: a proposed framework and novel findings. IEEE Trans Software Eng 34(4):485–496
Google Scholar
Khoshgoftaar TM, Gao K, Napolitano A (2012) An empirical study of feature ranking techniques for software quality prediction. Int J Softw Eng Knowl Eng 22:161–183
Google Scholar
Vashisht V, Lal M, Sureshchandar GS (2016) Defect prediction framework using neural networks for software enhancement projects. Br J Math Comput Sci (BJMCS) 16(5)
Google Scholar
Hall MA, Holmes G (2003) Benchmarking attribute selection techniques for discrete class data mining. IEEE Trans Knowl Data Eng 15(6):1437–1447
Google Scholar
Wang H, Khoshgoftaar TM, Gao K, Seliya N (2009) Mining data from multiple software development projects. In: Proceedings of the 3rd IEEE international workshop mining multiple information sources, pp 551–557, Miami, FL
Google Scholar
Promise Dataset, https://promise.site.uottawa.ca/SERepository/datasets/jm1.arff. Last accessed 4 April 2019
Benesty J, Chen J, Huang Y, Cohen I (2009) Pearson correlation coefficient. In: Noise reduction in speech processing. Springer, Berlin, Heidelberg, pp 1–4
Google Scholar
Danielsson P-E (1980) Euclidean distance mapping. Comput Graph Image Process 14(3):227–248
Google Scholar
MinMaxScaler, https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing. Last accessed 4 April 2019
Strobl C, Boulesteix A-L, Zeileis A, Hothorn T (2007) Bias in random forest variable importance measures: illustrations, sources and a solution. BMC Bioinform 8(25):25
Google Scholar
Random forest feature importance, https://blog.datadive.net/selecting-goodfeatures-part-iii-random-forests/. Last accessed 1 Oct 2018
Sklearn.feature-selection.chi2, https://scikitlearn.org. Last accessed April 2019
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357 (2002)
Google Scholar
He H, Bai Y, Garcia EA, Li S (2008) ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: International joint conference on neural networks, IJCNN 2008, pp 1322–1328
Google Scholar
Seiffert C, Khoshgoftaar TM, Van Hulse J (2009) Hybrid sampling for imbalanced data. Integr Comput Aided Eng 16(3):193–210
Google Scholar
Burez J, Van den Poel D (2009) Handling class imbalance in customer churn prediction. Expert Syst Appl 36(3):4626-4636
Google Scholar

Download references

Acknowledgements

This research work is supported by research grant RDU1703236 funded by Universiti Malaysia Pahang, https://www.ump.edu.my/. The authors would also like to thank the Faculty of Electrical & Electronics Engineering, Universiti Malaysia Pahang for financial support.

Author information

Authors and Affiliations

Department of Software Engineering, Daffodil International University, Dhaka, Bangladesh
Md Anwar Hossen, Md. Shariful Islam, Md. Sakib Rahman, Mostafijur Rahman & S. M. Hasan Mahmud
Faculty of Electrical and Electronics Engineering, Universiti Malaysia Pahang, Pekan, Malaysia
Nurhafizah Abu Talip Yusof, Sabira Khatun & Mohamad Shaiful Abdul Karim
Department of Computer Science and Engineering, Jagannath University, Dhaka, Bangladesh
Fatema Siddika

Authors

Md Anwar Hossen
View author publications
You can also search for this author in PubMed Google Scholar
Md. Shariful Islam
View author publications
You can also search for this author in PubMed Google Scholar
Nurhafizah Abu Talip Yusof
View author publications
You can also search for this author in PubMed Google Scholar
Md. Sakib Rahman
View author publications
You can also search for this author in PubMed Google Scholar
Fatema Siddika
View author publications
You can also search for this author in PubMed Google Scholar
Mostafijur Rahman
View author publications
You can also search for this author in PubMed Google Scholar
Sabira Khatun
View author publications
You can also search for this author in PubMed Google Scholar
Mohamad Shaiful Abdul Karim
View author publications
You can also search for this author in PubMed Google Scholar
S. M. Hasan Mahmud
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Md Anwar Hossen .

Editor information

Editors and Affiliations

Faculty of Electrical and Electronics Engineering, Universiti Malaysia Pahang, Pekan, Pahang, Malaysia
Ahmad Nor Kasruddin Nasir
Faculty of Electrical and Electronics Engineering, Universiti Malaysia Pahang, Pekan, Pahang, Malaysia
Mohd Ashraf Ahmad
Faculty of Electrical and Electronics Engineering, Universiti Malaysia Pahang, Pekan, Pahang, Malaysia
Muhammad Sharfi Najib
Faculty of Electrical and Electronics Engineering, Universiti Malaysia Pahang, Pekan, Pahang, Malaysia
Yasmin Abdul Wahab
Faculty of Electrical and Electronics Engineering, Universiti Malaysia Pahang, Pekan, Pahang, Malaysia
Nur Aqilah Othman
Faculty of Electrical and Electronics Engineering, Universiti Malaysia Pahang, Pekan, Pahang, Malaysia
Nor Maniha Abd Ghani
Faculty of Electrical and Electronics Engineering, Universiti Malaysia Pahang, Pekan, Pahang, Malaysia
Addie Irawan
Faculty of Electrical and Electronics Engineering, Universiti Malaysia Pahang, Pekan, Pahang, Malaysia
Sabira Khatun
Faculty of Electrical and Electronics Engineering, Universiti Malaysia Pahang, Pekan, Pahang, Malaysia
Raja Mohd Taufika Raja Ismail
Faculty of Electrical and Electronics Engineering, Universiti Malaysia Pahang, Pekan, Pahang, Malaysia
Mohd Mawardi Saari
Faculty of Electrical and Electronics Engineering, Universiti Malaysia Pahang, Pekan, Pahang, Malaysia
Mohd Razali Daud
Faculty of Electrical and Electronics Engineering, Universiti Malaysia Pahang, Pekan, Pahang, Malaysia
Ahmad Afif Mohd Faudzi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hossen, M.A. et al. (2020). Hybrid Sampling and Random Forest Based Machine Learning Approach for Software Defect Prediction. In: Kasruddin Nasir, A.N., et al. InECCE2019. Lecture Notes in Electrical Engineering, vol 632. Springer, Singapore. https://doi.org/10.1007/978-981-15-2317-5_46

Download citation

DOI: https://doi.org/10.1007/978-981-15-2317-5_46
Published: 24 March 2020
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-2316-8
Online ISBN: 978-981-15-2317-5
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics