Skip to main content
Book cover

InECCE2019 pp 541–553Cite as

Hybrid Sampling and Random Forest Based Machine Learning Approach for Software Defect Prediction

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE,volume 632))

Abstract

The software has turn into an imperious part of human’s life. In the recent computing era, many large-scale complex network systems and millions of modern technological devices produce a huge amount of data every second. Among these data, the amount of imbalanced data is relatively excessive. The machine learning model is miss leaded by these imbalanced data. Software Defect Prediction (SDP) is a standout amongst the most helping exercises during the testing phase. The estimated cost of finding and fixing defects is approximately billions of pounds per year. To reduce this problem, software defect prediction has come forth but need fine tuning to have expected efficiency. In this chapter, we have proposed a new model based on machine learning approach to predict software defect and identify the key factors that may help the software engineer to identify the most defect-prone part of the system. The proposed model works as follows. First, need to remove highly correlated features and turn all the feature in the same scale using the scaling feature approach. Second, we have used Synthetic Minority Over-Sampling Technique (SMOTE), Adaptive Synthetic (ADASYN) and Hybrid sampling method to balance highly imbalanced datasets. Third, Random Forest Importance and Chi-square algorithms are chosen to find out the factors which have high effect on software defect. Cross validation is used to remove overriding problem. Scikit-learn library is used for machine learning algorithms. Pandas library is used for data processing. Matplotlib, and PyPlot are used for graph and data visualization respectively. The hybrid sampling method and Random Forest (RF) algorithms achieved the highest prediction accuracy about 93.26% by showing its superiority.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   259.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   329.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   329.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Menzies T, Greenwald J, Frank A (2007) Data mining static code attributes to learn defect predictors. IEEE Trans Softw Eng 33:2–13

    Google Scholar 

  2. Lin J-C, Wu K-C (2007) Digging high risk defects out in software engineering. In: International conference on intelligent information processing. Springer US, pp 20–23

    Google Scholar 

  3. Gray D, Bowes D (2011) The misuse of the NASA metrics data program data sets for automated software defect prediction. In: IET conference proceedings. The Institution of Engineering & Technology, pp 96–103 (2011)

    Google Scholar 

  4. Lessmann S, Baesens B (2008) Benchmarking classification models for software defect prediction: a proposed framework and novel findings. IEEE Trans Software Eng 34(4):485–496

    Google Scholar 

  5. Khoshgoftaar TM, Gao K, Napolitano A (2012) An empirical study of feature ranking techniques for software quality prediction. Int J Softw Eng Knowl Eng 22:161–183

    Google Scholar 

  6. Vashisht V, Lal M, Sureshchandar GS (2016) Defect prediction framework using neural networks for software enhancement projects. Br J Math Comput Sci (BJMCS) 16(5)

    Google Scholar 

  7. Hall MA, Holmes G (2003) Benchmarking attribute selection techniques for discrete class data mining. IEEE Trans Knowl Data Eng 15(6):1437–1447

    Google Scholar 

  8. Wang H, Khoshgoftaar TM, Gao K, Seliya N (2009) Mining data from multiple software development projects. In: Proceedings of the 3rd IEEE international workshop mining multiple information sources, pp 551–557, Miami, FL

    Google Scholar 

  9. Promise Dataset, https://promise.site.uottawa.ca/SERepository/datasets/jm1.arff. Last accessed 4 April 2019

  10. Benesty J, Chen J, Huang Y, Cohen I (2009) Pearson correlation coefficient. In: Noise reduction in speech processing. Springer, Berlin, Heidelberg, pp 1–4

    Google Scholar 

  11. Danielsson P-E (1980) Euclidean distance mapping. Comput Graph Image Process 14(3):227–248

    Google Scholar 

  12. MinMaxScaler, https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing. Last accessed 4 April 2019

  13. Strobl C, Boulesteix A-L, Zeileis A, Hothorn T (2007) Bias in random forest variable importance measures: illustrations, sources and a solution. BMC Bioinform 8(25):25

    Google Scholar 

  14. Random forest feature importance, https://blog.datadive.net/selecting-goodfeatures-part-iii-random-forests/. Last accessed 1 Oct 2018

  15. Sklearn.feature-selection.chi2, https://scikitlearn.org. Last accessed April 2019

  16. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357 (2002)

    Google Scholar 

  17. He H, Bai Y, Garcia EA, Li S (2008) ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: International joint conference on neural networks, IJCNN 2008, pp 1322–1328

    Google Scholar 

  18. Seiffert C, Khoshgoftaar TM, Van Hulse J (2009) Hybrid sampling for imbalanced data. Integr Comput Aided Eng 16(3):193–210

    Google Scholar 

  19. Burez J, Van den Poel D (2009) Handling class imbalance in customer churn prediction. Expert Syst Appl 36(3):4626-4636

    Google Scholar 

Download references

Acknowledgements

This research work is supported by research grant RDU1703236 funded by Universiti Malaysia Pahang, https://www.ump.edu.my/. The authors would also like to thank the Faculty of Electrical & Electronics Engineering, Universiti Malaysia Pahang for financial support.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Md Anwar Hossen .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Hossen, M.A. et al. (2020). Hybrid Sampling and Random Forest Based Machine Learning Approach for Software Defect Prediction. In: Kasruddin Nasir, A.N., et al. InECCE2019. Lecture Notes in Electrical Engineering, vol 632. Springer, Singapore. https://doi.org/10.1007/978-981-15-2317-5_46

Download citation

  • DOI: https://doi.org/10.1007/978-981-15-2317-5_46

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-15-2316-8

  • Online ISBN: 978-981-15-2317-5

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics