Skip to main content

Training Data Selection Using Ensemble Dataset Approach for Software Defect Prediction

  • Conference paper
  • First Online:
Cyber Security and Computer Science (ICONCS 2020)

Abstract

Cross-project defect prediction (CPDP) is using due to the limitation of within project defect prediction (WPDP) in Software Defect Prediction (SDP) research. CPDP aims to train one project data to predict another project using the machine learning technique. The source and target projects are different in the CPDP setting, because of various structured source-target projects, sometimes it may not be a perfect combination. This study represents a categorical data set ensemble technique, where multiple data sets have been aggregated for source data instead of using a single data set. The method has been evaluated on nine data sets, taken from the publicly accessible repository with two performance indicators. The results of this data set ensemble approach show the improvement of the prediction performance over 65% combinations compared with traditional CPDP models. The results also show that same categories (homogeneous) train-test data set pairs give high performance; otherwise, the prediction performances of different category data sets are mostly collapsed. Therefore, the proposed scheme is recommended as an alternative to predict defects that can improve the prediction of most of the cases compared with traditional cross-project SDP models.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 119.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 159.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    For this study homogeneous or heterogeneous being called based on the number of non-defective and defective class in a data set.

  2. 2.

    http://gromit.iiar.pwr.wroc.pl/p_inf/ckjm/.

References

  1. Wahono, R.S., Suryana, N.: Combining particle swarm optimization based feature selection and bagging technique for software defect prediction. Int. J. Softw. Eng. Appl. 7(5), 153–166 (2013)

    Google Scholar 

  2. Wahono, R.S.: A systematic literature review of software defect prediction: research trends, data sets, methods and frameworks. J. Softw. Eng. 1(1), 1–16 (2015)

    Google Scholar 

  3. Gayatri, N., Nickolas, S., Reddy, A.V., Reddy, S., Nickolas, A.V.: Feature selection using decision tree induction in class level metrics data set for software defect predictions. In: Proceedings of the World Congress on Engineering and Computer Science, pp. 124–129 (2010)

    Google Scholar 

  4. Ryu, D., Jang, J.-I., Baik, J.: A transfer cost-sensitive boosting approach for cross-project defect prediction. Software Qual. J. 25(1), 235–272 (2015). https://doi.org/10.1007/s11219-015-9287-1

    Article  Google Scholar 

  5. Marjuni, A., Adji, T.B., Ferdiana, R.: Unsupervised software defect prediction using signed Laplacian-based spectral classifier. Soft. Comput. 23(24), 13679–13690 (2019). https://doi.org/10.1007/s00500-019-03907-6

    Article  Google Scholar 

  6. Kamei, Y., Fukushima, T., McIntosh, S., Yamashita, K., Ubayashi, N., Hassan, A.E.: Studying just-in-time defect prediction using cross-project models. Empir. Softw. Eng. 21(5), 2072–2106 (2015). https://doi.org/10.1007/s10664-015-9400-x

    Article  Google Scholar 

  7. He, Z., Shu, F., Yang, Y., Li, M., Wang, Q.: An investigation on the feasibility of cross-project defect prediction. Autom. Softw. Eng. 19(2), 167–199 (2012)

    Article  Google Scholar 

  8. Jing, X., Wu, F., Dong, X., Qi, F., Xu, B.: Heterogeneous cross-company defect prediction by unified metric representation and CCA-based transfer learning. In: Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering, pp. 496–507 (2015)

    Google Scholar 

  9. Bowes, D., Hall, T., Petrić, J.: Software defect prediction: do different classifiers find the same defects? Software Qual. J. 26(2), 525–552 (2017). https://doi.org/10.1007/s11219-016-9353-3

    Article  Google Scholar 

  10. Menzies, T., Krishna, R., Pryor, D.: The SEACRAFT Repository of Empirical Software Engineering Data (2017). https://zenodo.org/communities/seacraft

  11. Catal, C., Diri, B.: A systematic review of software fault prediction studies. Expert Syst. Appl. 36(4), 7346–7354 (2009)

    Article  Google Scholar 

  12. Porter, A.A., Selby, R.W.: Empirically guided software development using metric-based classification trees. IEEE Softw. 7(2), 46–54 (1990)

    Article  Google Scholar 

  13. Liu, M., Miao, L., Zhang, D.: Two-stage cost-sensitive learning for software defect prediction. IEEE Trans. Reliab. 63(2), 676–686 (2014)

    Article  Google Scholar 

  14. Sohan, M. F., Jabiullah, M. I., Rahman, S. S. M. M., Mahmud, S. H.: Assessing the effect of imbalanced learning on cross-project software defect prediction. In: 2019 10th International Conference on Computing, Communication and Networking Technologies (ICCCNT), pp. 1–6. IEEE (2019)

    Google Scholar 

  15. Sohan, M.F., Kabir, M.A., Jabiullah, M.I., Rahman, S.S.M.M.: Revisiting the class imbalance issue in software defect prediction. In: 2019 International Conference on Electrical, Computer and Communication Engineering (ECCE), pp. 1–6 (2019)

    Google Scholar 

  16. Wang, S., Yao, X.: Using class imbalance learning for software defect prediction. IEEE Trans. Reliab. 62(2), 434–443 (2013)

    Article  Google Scholar 

  17. Ma, Y., Luo, G., Zeng, X., Chen, A.: Transfer learning for cross-company software defect prediction. Inf. Softw. Technol. 54(3), 248–256 (2012)

    Article  Google Scholar 

  18. Krishna, R., Menzies, T.: Bellwethers: a baseline method for transfer learning. IEEE Trans. Softw. Eng. (2018)

    Google Scholar 

  19. Fukushima, T., Kamei, Y., McIntosh, S., Yamashita, K., Ubayashi, N.: An empirical study of just-in-time defect prediction using cross-project models. In: Proceedings of the 11th Working Conference on Mining Software Repositories, pp. 172–181 (2014)

    Google Scholar 

  20. Jureczko, M., Madeyski, L.: Towards identifying software project clusters with regard to defect prediction. In: Proceedings of the 6th International Conference on Predictive Models in Software Engineering, p. 9. ACM, September 2010

    Google Scholar 

  21. Menzies, T., Greenwald, J., Frank, A.: Data mining static code attributes to learn defect predictors. IEEE Trans. Software Eng. 33(1), 2–13 (2006)

    Article  Google Scholar 

  22. Chidamber, S.R., Kemerer, C.F.: A metrics suite for object oriented design. IEEE Trans. Software Eng. 20(6), 476–493 (1994)

    Article  Google Scholar 

  23. Son, L.H., Pritam, N., Khari, M., Kumar, R., Phuong, P.T.M., Thong, P.H.: Empirical study of software defect prediction: a systematic mapping. Symmetry 11(2), 212 (2019)

    Article  Google Scholar 

  24. Özakıncı, R., Tarhan, A.: Early software defect prediction: a systematic map and review. J. Syst. Softw. 144, 216–239 (2018)

    Article  Google Scholar 

  25. Manjula, C., Florence, L.: Deep neural network based hybrid approach for software defect prediction using software metrics. Cluster Comput. 22(4), 9847–9863 (2018). https://doi.org/10.1007/s10586-018-1696-z

    Article  Google Scholar 

  26. Xu, Z., et al.: TSTSS: a two-stage training subset selection framework for cross version defect prediction. J. Syst. Softw. 154, 59–78 (2019)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Md Fahimuzzman Sohan .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Sohan, M.F., Kabir, M.A., Rahman, M., Hasan Mahmud, S.M., Bhuiyan, T. (2020). Training Data Selection Using Ensemble Dataset Approach for Software Defect Prediction. In: Bhuiyan, T., Rahman, M.M., Ali, M.A. (eds) Cyber Security and Computer Science. ICONCS 2020. Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, vol 325. Springer, Cham. https://doi.org/10.1007/978-3-030-52856-0_19

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-52856-0_19

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-52855-3

  • Online ISBN: 978-3-030-52856-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics