Training Data Selection Using Ensemble Dataset Approach for Software Defect Prediction

Sohan, Md Fahimuzzman; Kabir, Md Alamgir; Rahman, Mostafijur; Hasan Mahmud, S. M.; Bhuiyan, Touhid

doi:10.1007/978-3-030-52856-0_19

Md Fahimuzzman Sohan¹⁸,
Md Alamgir Kabir¹⁸,
Mostafijur Rahman¹⁸,
S. M. Hasan Mahmud¹⁸ &
…
Touhid Bhuiyan¹⁸

Part of the book series: Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering ((LNICST,volume 325))

Included in the following conference series:

International Conference on Cyber Security and Computer Science

1189 Accesses

Abstract

Cross-project defect prediction (CPDP) is using due to the limitation of within project defect prediction (WPDP) in Software Defect Prediction (SDP) research. CPDP aims to train one project data to predict another project using the machine learning technique. The source and target projects are different in the CPDP setting, because of various structured source-target projects, sometimes it may not be a perfect combination. This study represents a categorical data set ensemble technique, where multiple data sets have been aggregated for source data instead of using a single data set. The method has been evaluated on nine data sets, taken from the publicly accessible repository with two performance indicators. The results of this data set ensemble approach show the improvement of the prediction performance over 65% combinations compared with traditional CPDP models. The results also show that same categories (homogeneous) train-test data set pairs give high performance; otherwise, the prediction performances of different category data sets are mostly collapsed. Therefore, the proposed scheme is recommended as an alternative to predict defects that can improve the prediction of most of the cases compared with traditional cross-project SDP models.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 119.00; Price excludes VAT (USA)

Softcover Book: USD 159.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
For this study homogeneous or heterogeneous being called based on the number of non-defective and defective class in a data set.
2.
http://gromit.iiar.pwr.wroc.pl/p_inf/ckjm/.

References

Wahono, R.S., Suryana, N.: Combining particle swarm optimization based feature selection and bagging technique for software defect prediction. Int. J. Softw. Eng. Appl. 7(5), 153–166 (2013)
Google Scholar
Wahono, R.S.: A systematic literature review of software defect prediction: research trends, data sets, methods and frameworks. J. Softw. Eng. 1(1), 1–16 (2015)
Google Scholar
Gayatri, N., Nickolas, S., Reddy, A.V., Reddy, S., Nickolas, A.V.: Feature selection using decision tree induction in class level metrics data set for software defect predictions. In: Proceedings of the World Congress on Engineering and Computer Science, pp. 124–129 (2010)
Google Scholar
Ryu, D., Jang, J.-I., Baik, J.: A transfer cost-sensitive boosting approach for cross-project defect prediction. Software Qual. J. 25(1), 235–272 (2015). https://doi.org/10.1007/s11219-015-9287-1
Article Google Scholar
Marjuni, A., Adji, T.B., Ferdiana, R.: Unsupervised software defect prediction using signed Laplacian-based spectral classifier. Soft. Comput. 23(24), 13679–13690 (2019). https://doi.org/10.1007/s00500-019-03907-6
Article Google Scholar
Kamei, Y., Fukushima, T., McIntosh, S., Yamashita, K., Ubayashi, N., Hassan, A.E.: Studying just-in-time defect prediction using cross-project models. Empir. Softw. Eng. 21(5), 2072–2106 (2015). https://doi.org/10.1007/s10664-015-9400-x
Article Google Scholar
He, Z., Shu, F., Yang, Y., Li, M., Wang, Q.: An investigation on the feasibility of cross-project defect prediction. Autom. Softw. Eng. 19(2), 167–199 (2012)
Article Google Scholar
Jing, X., Wu, F., Dong, X., Qi, F., Xu, B.: Heterogeneous cross-company defect prediction by unified metric representation and CCA-based transfer learning. In: Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering, pp. 496–507 (2015)
Google Scholar
Bowes, D., Hall, T., Petrić, J.: Software defect prediction: do different classifiers find the same defects? Software Qual. J. 26(2), 525–552 (2017). https://doi.org/10.1007/s11219-016-9353-3
Article Google Scholar
Menzies, T., Krishna, R., Pryor, D.: The SEACRAFT Repository of Empirical Software Engineering Data (2017). https://zenodo.org/communities/seacraft
Catal, C., Diri, B.: A systematic review of software fault prediction studies. Expert Syst. Appl. 36(4), 7346–7354 (2009)
Article Google Scholar
Porter, A.A., Selby, R.W.: Empirically guided software development using metric-based classification trees. IEEE Softw. 7(2), 46–54 (1990)
Article Google Scholar
Liu, M., Miao, L., Zhang, D.: Two-stage cost-sensitive learning for software defect prediction. IEEE Trans. Reliab. 63(2), 676–686 (2014)
Article Google Scholar
Sohan, M. F., Jabiullah, M. I., Rahman, S. S. M. M., Mahmud, S. H.: Assessing the effect of imbalanced learning on cross-project software defect prediction. In: 2019 10th International Conference on Computing, Communication and Networking Technologies (ICCCNT), pp. 1–6. IEEE (2019)
Google Scholar
Sohan, M.F., Kabir, M.A., Jabiullah, M.I., Rahman, S.S.M.M.: Revisiting the class imbalance issue in software defect prediction. In: 2019 International Conference on Electrical, Computer and Communication Engineering (ECCE), pp. 1–6 (2019)
Google Scholar
Wang, S., Yao, X.: Using class imbalance learning for software defect prediction. IEEE Trans. Reliab. 62(2), 434–443 (2013)
Article Google Scholar
Ma, Y., Luo, G., Zeng, X., Chen, A.: Transfer learning for cross-company software defect prediction. Inf. Softw. Technol. 54(3), 248–256 (2012)
Article Google Scholar
Krishna, R., Menzies, T.: Bellwethers: a baseline method for transfer learning. IEEE Trans. Softw. Eng. (2018)
Google Scholar
Fukushima, T., Kamei, Y., McIntosh, S., Yamashita, K., Ubayashi, N.: An empirical study of just-in-time defect prediction using cross-project models. In: Proceedings of the 11th Working Conference on Mining Software Repositories, pp. 172–181 (2014)
Google Scholar
Jureczko, M., Madeyski, L.: Towards identifying software project clusters with regard to defect prediction. In: Proceedings of the 6th International Conference on Predictive Models in Software Engineering, p. 9. ACM, September 2010
Google Scholar
Menzies, T., Greenwald, J., Frank, A.: Data mining static code attributes to learn defect predictors. IEEE Trans. Software Eng. 33(1), 2–13 (2006)
Article Google Scholar
Chidamber, S.R., Kemerer, C.F.: A metrics suite for object oriented design. IEEE Trans. Software Eng. 20(6), 476–493 (1994)
Article Google Scholar
Son, L.H., Pritam, N., Khari, M., Kumar, R., Phuong, P.T.M., Thong, P.H.: Empirical study of software defect prediction: a systematic mapping. Symmetry 11(2), 212 (2019)
Article Google Scholar
Özakıncı, R., Tarhan, A.: Early software defect prediction: a systematic map and review. J. Syst. Softw. 144, 216–239 (2018)
Article Google Scholar
Manjula, C., Florence, L.: Deep neural network based hybrid approach for software defect prediction using software metrics. Cluster Comput. 22(4), 9847–9863 (2018). https://doi.org/10.1007/s10586-018-1696-z
Article Google Scholar
Xu, Z., et al.: TSTSS: a two-stage training subset selection framework for cross version defect prediction. J. Syst. Softw. 154, 59–78 (2019)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Software Engineering, Daffodil International University, Dhaka, Bangladesh
Md Fahimuzzman Sohan, Md Alamgir Kabir, Mostafijur Rahman, S. M. Hasan Mahmud & Touhid Bhuiyan

Authors

Md Fahimuzzman Sohan
View author publications
You can also search for this author in PubMed Google Scholar
Md Alamgir Kabir
View author publications
You can also search for this author in PubMed Google Scholar
Mostafijur Rahman
View author publications
You can also search for this author in PubMed Google Scholar
S. M. Hasan Mahmud
View author publications
You can also search for this author in PubMed Google Scholar
Touhid Bhuiyan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Md Fahimuzzman Sohan .

Editor information

Editors and Affiliations

Daffodil International University, Dhaka, Bangladesh
Touhid Bhuiyan
Daffodil International University, Dhaka, Bangladesh
Md. Mostafijur Rahman
Daffodil International University, Dhaka, Bangladesh
Md. Asraf Ali

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sohan, M.F., Kabir, M.A., Rahman, M., Hasan Mahmud, S.M., Bhuiyan, T. (2020). Training Data Selection Using Ensemble Dataset Approach for Software Defect Prediction. In: Bhuiyan, T., Rahman, M.M., Ali, M.A. (eds) Cyber Security and Computer Science. ICONCS 2020. Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, vol 325. Springer, Cham. https://doi.org/10.1007/978-3-030-52856-0_19

Download citation

DOI: https://doi.org/10.1007/978-3-030-52856-0_19
Published: 30 July 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-52855-3
Online ISBN: 978-3-030-52856-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics