Abstract
Software fault prediction (SFP) aims to improve software quality with a possible minimum cost and time. Various machine learning models have been proposed in the past for predicting software faults. The performance of those models depends on dataset quality and can be enhanced by identifying and eliminating data quality issues. In this paper, we present a systematic literature review on data quality issues existing in SFP datasets. We have selected 145 primary studies published until November 2021 and analyzed them from five perspectives—data quality issue, pre-processing technique, modeling technique, dataset and performance measures used. The findings indicate that data quality issues such as data dimensionality, class imbalance and their combination have been heavily considered in the literature. However, data quality issues such as class overlapping, missing data are pertinent to SFP datasets and need further investigation. The effect of resolving one data quality issue relative to others is an unexplored field. C4.5, naive Bayes, multilayer perceptron, support vector machine, and random forest are the most frequently used classifiers by the researchers. However, researchers should know the sensitiveness of those classifiers corresponding to a particular data quality issue and select them accordingly. The PROMISE datasets have been extensively used in SFP. Accuracy, precision, recall and area under curve are the common performance measures. It is suggested to employ unbiased and stable performance measures such as Mathew Co-relation Coefficient for the model evaluation. Our findings from the survey concluded that the existence of data quality issues in SFP datasets degrades the classifiers’ performance and there is a scope for further research on data quality issues.
Similar content being viewed by others
References
Abaei G, Selamat A (2014) Increasing the accuracy of software fault prediction using majority ranking fuzzy clustering. Int J Softw Innov 2:60–71. https://doi.org/10.4018/ijsi.2014100105
Adrion WR, Branstad MA, Cherniavsky JC (1982) Validation, verification, and testing of computer software. ACM Comput Surv 14:159–192. https://doi.org/10.1145/356876.356879
Agrawal A, Menzies T (2018) Is “better data” better than “better data miners”? In: Proceedings of the 40th International conference on software engineering. ACM, New York, pp 1050–1061
Alan O, Catal C (2009) An outlier detection algorithm based on object-oriented metrics thresholds. 2009 24th Int Symp Comput Inf Sci (ISC 2009), pp 567–570. https://doi.org/10.1109/ISCIS.2009.5291882
Alan O, Catal C (2011) Thresholds based outlier detection approach for mining class outliers: An empirical case study on software measurement datasets. Expert Syst Appl 38:3440–3445. https://doi.org/10.1016/j.eswa.2010.08.130
Alsawalqah H, Faris H, Aljarah I, Alnemer L (2017) Software engineering trends and techniques in intelligent systems. Springer, Cham
Altidor W, Khoshgoftaar TM, Napolitano A (2009) Wrapper-based feature ranking for software engineering metrics. In: 8th Int conf mach learn appl (ICMLA 2009), pp 241–246. https://doi.org/10.1109/ICMLA.2009.17
Anbu M, Anandha Mala GS (2019) Feature selection using firefly algorithm in software defect prediction. Cluster Comput 22:10925–10934. https://doi.org/10.1007/s10586-017-1235-3
Antoine JY, Villaneau J, Lefeuvre A (2014) Weighted Krippendorff’s alpha is a more reliable metrics for multicoders ordinal annotations: experimental studies on emotion, opinion and coreference annotation. In: 14th Conf Eur Chapter Assoc Comput Linguist 2014 (EACL 2014), pp 550–559. https://doi.org/10.3115/v1/e14-1058
Arar ÖF, Ayan K (2017) A feature dependent Naive Bayes approach and its application to the software defect prediction problem. Appl Soft Comput J 59:197–209. https://doi.org/10.1016/j.asoc.2017.05.043
Arisholm E, Briand LC, Johannessen EB (2010) A systematic and comprehensive investigation of methods to build and evaluate fault prediction models. J Syst Softw 83:2–17. https://doi.org/10.1016/j.jss.2009.06.055
Armah GK, Luo G, Qin K (2013) Multi-level data pre-processing for software defect prediction. In: Proc 2013 6th Int Conf Inf Manag Innov Manag Ind Eng (ICIII 2013), vol 2, pp 170–174. https://doi.org/10.1109/ICIII.2013.6703111
Arshad A, Riaz S, Jiao L, Murthy A (2018) Semi-supervised deep fuzzy C-mean clustering for software fault prediction. IEEE Access 6:25675–25685. https://doi.org/10.1109/ACCESS.2018.2866082
Azeem MI, Palomba F, Shi L, Wang Q (2019) Machine learning techniques for code smell detection: a systematic literature review and meta-analysis. Inf Softw Technol 108:115–138. https://doi.org/10.1016/j.infsof.2018.12.009
Aziz SR, Khan TA, Nadeem A (2021) Exclusive use and evaluation of inheritance metrics viability in software fault prediction—an experimental study. PeerJ Comput Sci 7:1–47. https://doi.org/10.7717/PEERJ-CS.563
Bal PR, Kumar S (2020) WR-ELM: weighted regularization extreme learning machine for imbalance learning in software fault prediction. IEEE Trans Reliab 69:1355–1375. https://doi.org/10.1109/TR.2020.2996261
Banga M, Bansal A (2020) Proposed software faults detection using hybrid approach. Secur Priv. https://doi.org/10.1002/spy2.103
Batool I, Khan TA (2022) Software fault prediction using data mining, machine learning and deep learning techniques: a systematic literature review. Comput Electr Eng 100:107886. https://doi.org/10.1016/j.compeleceng.2022.107886
Beecham S, Hall T, Bowes D et al (2010) A systematic review of fault prediction approaches used in software engineering. Limerick, Ireland: The Irish Software Engineering Research Centre.
Bejjanki KK, Gyani J, Gugulothu N (2020) Class imbalance reduction (CIR): a novel approach to software defect prediction in the presence of class imbalance. Symmetry (Basel). https://doi.org/10.3390/sym12030407
Bennin KE, Keung J, Phannachitta P et al (2018) MAHAKIL: diversity based oversampling approach to alleviate the class imbalance issue in software defect prediction. IEEE Trans Softw Eng 44:534–550. https://doi.org/10.1109/TSE.2017.2731766
Biolchini J, Mian PG, Natali ACC, Travassos GH (2005) Systematic review in software engineering. System engineering and computer science department COPPE/UFRJ, Technical Report ES, 679(05), 45.
Boehm B, Basili V (2001) Software Defect Reduction Top 10 List, vol 34. Computer (Long Beach Calif), pp 135–137
Boetticher GD (2005) Nearest neighbor sampling for better defect prediction. ACM SIGSOFT Softw Eng Notes 30:1–6. https://doi.org/10.1145/1082983.1083173
Borandag E, Ozcift A, Kilinc D, Yucalar F (2019) Majority vote feature selection algorithm in software fault prediction. Comput Sci Inf Syst 16:515–539. https://doi.org/10.2298/CSIS180312039B
Bosu MF, Macdonell SG (2013) A taxonomy of data quality challenges in empirical software engineering. In: Proceedings of Australasian software engineering conference (ASWEC), pp 97–106. https://doi.org/10.1109/ASWEC.2013.21
Bowes D, Hall T, Petrić J (2018) Software defect prediction: do different classifiers find the same defects? Softw Qual J 26:525–552. https://doi.org/10.1007/s11219-016-9353-3
Brereton P, Kitchenham BA, Budgen D et al (2007) Lessons from applying the systematic literature review process within the software engineering domain. J Syst Softw 80:571–583. https://doi.org/10.1016/j.jss.2006.07.009
Brezočnik L, Podgorelec V (2019) Applying weighted particle swarm optimization to imbalanced data in software defect prediction. In: Karabegović I (ed) New technologies, development and applications. Springer, Cham, pp 289–296
Catal C (2011) Software fault prediction: a literature review and current trends. Expert Syst Appl 38:4626–4636. https://doi.org/10.1016/j.eswa.2010.10.024
Catal C, Diri B (2009) A systematic review of software fault prediction studies. Expert Syst Appl 36:7346–7354. https://doi.org/10.1016/j.eswa.2008.10.027
Catal C, Alan O, Balkan K (2011) Class noise detection based on software metrics and ROC curves. Inf Sci (NY) 181:4867–4877. https://doi.org/10.1016/j.ins.2011.06.017
Chakraborty T, Chakraborty AK (2021) Hellinger net: a hybrid imbalance learning model to improve software defect prediction. IEEE Trans Reliab 70:481–494. https://doi.org/10.1109/TR.2020.3020238
Chen J, Liu S, Chen X et al (2013) Empirical studies on feature selection for software fault prediction. In: Proceedings of the 5th Asia-Pacific symposium on internetware. ACM, New York, pp 1–4
Chen J, Liu S, Liu W et al (2014) A two-stage data preprocessing approach for software fault prediction. In: Proceedings of 8th international conference on software security and reliability (SERE), pp 20–29. https://doi.org/10.1109/SERE.2014.15
Chen X, Shen Y, Cui Z, Ju X (2017) Applying feature selection to software defect prediction using multi-objective optimization. In: Proceedings of international computer software and applications conference, vol 2, pp 54–59. https://doi.org/10.1109/COMPSAC.2017.65
Chen L, Fang B, Shang Z, Tang Y (2018) Tackling class overlap and imbalance problems in software defect prediction. Softw Qual J 26:97–125. https://doi.org/10.1007/s11219-016-9342-6
Choeikiwong T, Vateekul P (2015) Software defect prediction in imbalanced data sets using unbiased support vector machine. In: Kim KJ (ed) Lecture notes in electrical engineering. Springer, Berlin, pp 923–931
Choirunnisa S, Meidyani B, Rochimah S (2018) Software defect prediction using oversampling algorithm: A-SUWO. In: 2018 Electrical Power, Electronics, Communications, Control and Informatics Seminar (EECCIS 2018), pp 337–341. https://doi.org/10.1109/EECCIS.2018.8692874
Cohen J (1960) A coefficient of agreement for nominal scales. Educ Psychol Meas 20:37–46
Cornelissen B, Zaidman A, van Deursen A et al (2009) A systematic survey of program comprehension through dynamic analysis. IEEE Trans Softw Eng 35:684–702. https://doi.org/10.1109/TSE.2009.28
Dhamayanthi N, Lavanya B (2019a) Software defect prediction using principal component analysis and naïve Bayes algorithm. Springer, Singapore
Dhamayanthi N, Lavanya B (2019b) Improvement in software defect prediction outcome using principal component analysis and ensemble machine learning algorithms. In: Lecture notes on data engineering and communications technologies. Springer, Cham, pp 397–406
Du Y, Zhang L, Shi J, et al (2018) Feature-grouping-based two steps feature selection algorithm in software defect prediction. In: ACM international conference proceeding series, pp 173–178
Dybå T, Dingsöyr T, Hanssen G. (2007) Applying systematic reviews to diverse study types: an experience report. In: Proceedings of international symposium on empirical software engineering and measurement conference, pp 225–234. https://doi.org/10.1109/ESEM.2007.59
Eivazpour Z, Keyvanpour MR (2019) Improving performance in software defect prediction using variational autoencoder. In: 2019 IEEE 5th conference on knowledge-based engineering and innovation (KBEI 2019), pp 644–649. https://doi.org/10.1109/KBEI.2019.8734915
Ekanayake J, Tappolet J, Gall HC, Bernstein A (2009) Tracking concept drift of software projects using defect prediction quality. In: Proceedings of 2009 6th IEEE international working conference on mining software repositories (MSR 2009), pp 51–60. https://doi.org/10.1109/MSR.2009.5069480
El-Shorbagy SA, El-Gammal WM, Abdelmoez WM (2018) Using SMOTE and heterogeneous stacking in ensemble learning for software defect prediction. In: Proceedings of the 7th international conference on software and information engineering—ICSIE ’18. ACM, New York, pp 44–47
Feng S, Keung J, Liu J et al (2021a) ROCT: Radius-based class overlap cleaning technique to alleviate the class overlap problem in software defect prediction. In: Proceedings of 2021a IEEE 45th annual computer software and applications conference (COMPSAC 2021), pp 228–237. https://doi.org/10.1109/COMPSAC51774.2021.00041
Feng S, Keung J, Yu X et al (2021b) COSTE: Complexity-based OverSampling TEchnique to alleviate the class imbalance problem in software defect prediction. Inf Softw Technol 129:106432. https://doi.org/10.1016/j.infsof.2020.106432
Galin D (2004) Software quality assurance: from theory to implementation. Pearson-Addison Wesley, New York
Galinac Grbac T, Runeson P, Huljenić D (2013) A second replicated quantitative analysis of fault distributions in complex software systems. IEEE Trans Softw Eng 39:462–476. https://doi.org/10.1109/TSE.2012.46
Gao K, Khoshgoftaar TM (2011) Software defect prediction for high-dimensional and class-imbalanced data. In: SEKE 2011—Proceedings of 23rd international conference on software engineering and knowledge engineering, pp 89–94
Gao K, Khoshgoftaar TM, Wang H, Seliya N (2011) Choosing software metrics for defect prediction: an investigation on feature selection techniques. Softw Pract Exp 41:579–606. https://doi.org/10.1002/spe.1043
Gao K, Khoshgoftaar TM, Napolitano A (2012a) A hybrid approach to coping with high dimensionality and class imbalance for software defect prediction. In: 2012a 11th International conference on machine learning and applications. IEEE, pp 281–288
Gao K, Khoshgoftaar TM, Seliya N (2012b) Predicting high-risk program modules by selecting the right software measurements. Softw Qual J 20:3–42. https://doi.org/10.1007/s11219-011-9132-0
Gao K, Khoshgoftaar TM, Wald R (2014) The use of under-and oversampling within ensemble feature selection and classification for software quality prediction. Int J Reliab Qual Saf Eng 21:1450004. https://doi.org/10.1142/S0218539314500041
Gao K, Khoshgoftaar TM, Napolitano A (2015a) Aggregating data sampling with feature subset selection to address skewed software defect data. Int J Softw Eng Knowl Eng 25:1531–1550. https://doi.org/10.1142/S0218194015400318
Gao K, Khoshgoftaar TM, Napolitano A (2015b) Investigating two approaches for adding feature ranking to sampled ensemble learning for software quality estimation. Int J Softw Eng Knowl Eng 25:115–146. https://doi.org/10.1142/S0218194015400069
Gayatri N, Nickolas S, Reddy AV (2012) ANOVA discriminant analysis for features selected through decision tree induction method. In: Communications in computer and information science, pp 61–70
Ghosh S, Rana A, Kansal V (2018) A nonlinear manifold detection based model for software defect prediction. Procedia Comput Sci 132:581–594. https://doi.org/10.1016/j.procs.2018.05.012
Gondra I (2008) Applying machine learning to software fault-proneness prediction. J Syst Softw 81:186–195. https://doi.org/10.1016/j.jss.2007.05.035
Gong L, Jiang S, Jiang L (2019a) Tackling class imbalance problem in software defect prediction through cluster-based over-sampling with filtering. IEEE Access 7:145725–145737. https://doi.org/10.1109/ACCESS.2019.2945858
Gong L, Jiang S, Wang R, Jiang L (2019b) Empirical evaluation of the impact of class overlap on software defect prediction. In: Proceedings of 2019b 34th IEEE/ACM international conference on automated software engineering (ASE 2019), pp 698–709. https://doi.org/10.1109/ASE.2019.0071
Gong L, Jiang S, Bo L et al (2020) A novel class-imbalance learning approach for both within-project and cross-project defect prediction. IEEE Trans Reliab 69:40–54. https://doi.org/10.1109/TR.2019.2895462
Goyal S (2021a) Handling class-imbalance with KNN (neighbourhood) under-sampling for software defect prediction. Artif Intell Rev. https://doi.org/10.1007/s10462-021-10044-w
Goyal S (2021b) Predicting the defects using stacked ensemble learner with filtered dataset. Autom Softw Eng 28:1–81. https://doi.org/10.1007/s10515-021-00285-y
Gray D, Bowes D, Davey N et al (2011) The misuse of the NASA Metrics Data Program data sets for automated software defect prediction. IET Semin Dig 2011:96–103. https://doi.org/10.1049/ic.2011.0012
Gray D, Bowes D, Davey N et al (2012) Reflections on the NASA MDP data sets. IET Softw 6:549–558. https://doi.org/10.1049/iet-sen.2011.0132
Guo S, Dong J, Li H, Wang J (2021) Software defect prediction with imbalanced distribution by radius-synthetic minority over-sampling technique. J Softw Evol Process 33:1–21. https://doi.org/10.1002/smr.2362
Gupta S, Gupta A (2017) A set of measures designed to identify overlapped instances in software defect prediction. Computing 99:889–914. https://doi.org/10.1007/s00607-016-0538-1
Hadi NT, Rochimah S (2018) Enhancing software defect prediction using principle component analysis and self-organizing map. In: 2018 Electr Power. Electron Commun Control Informatics Semin (EECCIS 2018), pp 320–325. https://doi.org/10.1109/EECCIS.2018.8692889
Hall T, Beecham S, Bowes D et al (2012) A systematic literature review on fault prediction performance in software engineering. IEEE Trans Softw Eng 38:1276–1304. https://doi.org/10.1109/TSE.2011.103
Hassouneh Y, Turabieh H, Thaher T et al (2021) Boosted whale optimization algorithm with natural selection operators for software fault prediction. IEEE Access 9:14239–14258. https://doi.org/10.1109/ACCESS.2021.3052149
He H, Zhang X, Wang Q et al (2019) Ensemble multiboost based on RIPPER classifier for prediction of imbalanced software defect data. IEEE Access 7:110333–110343. https://doi.org/10.1109/access.2019.2934128
Hosseini S, Turhan B, Gunarathna D (2019) A systematic literature review and meta-analysis on cross project defect prediction. IEEE Trans Softw Eng 45:111–147. https://doi.org/10.1109/TSE.2017.2770124
Huang J, Sun H (2016) Grey Relational analysis based k nearest neighbor missing data imputation for software quality datasets. In: Proc - 2016 IEEE Int Conf Softw Qual Reliab Secur (QRS 2016), pp 86–91. https://doi.org/10.1109/QRS.2016.20
Huang J, Keung JW, Sarro F et al (2017) Cross-validation based K nearest neighbor imputation for software quality datasets: an empirical study. J Syst Softw 132:226–252. https://doi.org/10.1016/j.jss.2017.07.012
Huda S, Liu K, Abdelrazek M et al (2018) An ensemble oversampling model for class imbalance problem in software defect prediction. IEEE Access 6:24184–24195. https://doi.org/10.1109/ACCESS.2018.2817572
Ibarguren I, Perez JM, Mugerza J et al (2017) The Consolidated Tree Construction algorithm in imbalanced defect prediction datasets. In: 2017 IEEE Congr Evol Comput (CEC 2017) - Proc, pp 2656–2660. https://doi.org/10.1109/CEC.2017.7969629
Jayanthi R, Florence L (2019) Software defect prediction techniques using metrics based on neural network classifier. Clust Comput 22:77–88. https://doi.org/10.1007/s10586-018-1730-1
Ji H, Huang S, Wu Y et al (2017) A new attribute selection method based on maximal information coefficient and automatic clustering. In: 2017 International conference on dependable systems and their applications (DSA). IEEE, pp 22–28
Jian Y, Yu X, Xu Z, Ma Z (2019) A hybrid feature selection method for software fault prediction. IEICE Trans Inf Syst E102D:1966–1975. https://doi.org/10.1587/transinf.2019EDP7033
Jiang Y, Li M, Zhou ZH (2011) Software defect detection with Rocus. J Comput Sci Technol 26:328–342. https://doi.org/10.1007/s11390-011-9439-0
Johnson JM, Khoshgoftaar TM (2019) Survey on deep learning with class imbalance. J Big Data. https://doi.org/10.1186/s40537-019-0192-5
Jing XY, Wu F, Dong X, Xu B (2017) An improved SDA based defect prediction framework for both within-project and cross-project class-imbalance problems. IEEE Trans Softw Eng 43:321–339. https://doi.org/10.1109/TSE.2016.2597849
Johnson AM, Malek M (1988) Survey of software tools for evaluating reliability, availability, and serviceability. ACM Comput Surv 20:227–269. https://doi.org/10.1145/50020.50062
Joon A, Tyagi RK, Kumar K (2020) Noise filtering and imbalance class distribution removal for optimizing software fault prediction using best software metrics suite. In: Proceedings of the 5th international conference on communication and electronics systems (ICCES 2020), pp 1381–1389
Juneja K (2019) A fuzzy-filtered neuro-fuzzy framework for software fault prediction for inter-version and inter-project evaluation. Appl Soft Comput J 77:696–713. https://doi.org/10.1016/j.asoc.2019.02.008
Kalsoom A, Maqsood M, Ghazanfar MA et al (2018) A dimensionality reduction-based efficient software fault prediction using Fisher linear discriminant analysis (FLDA). J Supercomput 74:4568–4602. https://doi.org/10.1007/s11227-018-2326-5
Kaur S, Singh P (2019) How does object-oriented code refactoring influence software quality? Research landscape and challenges. J Syst Softw. https://doi.org/10.1016/j.jss.2019.110394
Khoshgoftaar TM, Gao K (2009) Feature selection with imbalanced data for software defect prediction. In: 8th Int Conf Mach Learn Appl (ICMLA 2009), pp 235–240. https://doi.org/10.1109/ICMLA.2009.18
Khoshgoftaar TM, Rebours P (2004) Generating multiple noise elimination filters with the ensemble- partitioning filter. In: Proc 2004 IEEE Int Conf Inf Reuse Integr (IRI-2004), pp 369–375. https://doi.org/10.1109/iri.2004.1431489
Khoshgoftaar TM, Rebours P (2007) Improving software quality prediction by noise filtering techniques. J Comput Sci Technol 22:387–396. https://doi.org/10.1007/s11390-007-9054-2
Khoshgoftaar TM, Seliya N, Gao K (2004) Rule-based noise detection for software measurement data. In: Proc 2004 IEEE Int Conf Inf Reuse Integr (IRI-2004), pp 302–307. https://doi.org/10.1109/iri.2004.1431478
Khoshgoftaar TM, Bullard LA, Gao K (2009) Attribute selection using rough sets in software quality classification. Int J Reliab Qual Saf Eng 16:73–89. https://doi.org/10.1142/S0218539309003307
Khoshgoftaar TM, Gao K, Seliya N (2010) Attribute Selection and Imbalanced Data: Problems in Software Defect Prediction. In: 2010 22nd IEEE International Conference on Tools with Artificial Intelligence. IEEE, pp 137–144
Khoshgoftaar TM, Gao K, Napolitano A (2014a) Improving software quality estimation by combining feature selection strategies with sampled ensemble learning. In: Proc 2014 IEEE 15th Int Conf Inf Reuse Integr IEEE (IRI 2014), pp 428–433. https://doi.org/10.1109/IRI.2014.7051921
Khoshgoftaar TM, Gao K, Napolitano A, Wald R (2014b) A comparative study of iterative and non-iterative feature selection techniques for software defect prediction. Inf Syst Front 16:801–822. https://doi.org/10.1007/s10796-013-9430-0
Khuat TT, Le MH (2019) Binary teaching–learning-based optimization algorithm with a new update mechanism for sample subset optimization in software defect prediction. Soft Comput 23:9919–9935. https://doi.org/10.1007/s00500-018-3546-6
Khurma RA, Alsawalqah H, Aljarah I et al (2021) An enhanced evolutionary software defect prediction method using Island Moth Flame optimization. Mathematics 9:1722
Kim S, Whitehead EJ, Zhang Y (2008) Classifying software changes: clean or buggy? IEEE Trans Softw Eng 34:181–196. https://doi.org/10.1109/TSE.2007.70773
Kim S, Zhang H, Wu R, Gong L (2011) Dealing with noise in defect prediction. In: Proceedings of international conference on software engineering. IEEE, pp 481–490
Kim SY, Gu S, Jeong HH, Sohn KA (2015) A network clustering based software attribute selection for identifying fault-prone modules. In: 2015 5th Int Conf IT Converg Secur (ICITCS 2015) - Proc, pp 1–5. https://doi.org/10.1109/ICITCS.2015.7292921
Kitchenham B, Brereton P (2013) A systematic review of systematic review process research in software engineering. Inf Softw Technol 55:2049–2075. https://doi.org/10.1016/j.infsof.2013.07.010
Kitchenham B, Charters S (2007) Guidelines for performing systematic literature reviews in software engineering. Keele University and Durham University Joint Report
Kitchenham B, Pearl Brereton O, Budgen D et al (2009) Systematic literature reviews in software engineering—a systematic literature review. Inf Softw Technol 51:7–15. https://doi.org/10.1016/j.infsof.2008.09.009
Kumar L, Sripada SK, Sureka A, Rath SK (2018a) Effective fault prediction model developed using Least Square Support Vector Machine (LSSVM). J Syst Softw 137:686–712. https://doi.org/10.1016/j.jss.2017.04.016
Kumar L, Tirkey A, Rath S-K (2018b) An effective fault prediction model developed using an extreme learning machine with various kernel methods. Front Inf Technol Electron Eng 19:864–888. https://doi.org/10.1631/FITEE.1601501
Kundu D, Sarma M, Samanta D, Mall R (2009) System testing for object-oriented systems with test case prioritization. Softw Test Verif Reliab 19:297–333. https://doi.org/10.1002/stvr.407
Kutlubay O, Turhan B, Bener AB (2007) A two-step model for defect density estimation. In: EUROMICRO 2007 - Proc 33rd EUROMICRO Conf Softw Eng Adv Appl (SEAA 2007), pp 322–329. https://doi.org/10.1109/EUROMICRO.2007.13
Landis JR, Koch GG (1977) The measurement of observer agreement for categorical data. Biometrics 33:159–174. https://doi.org/10.2307/2529310
Laradji IH, Alshayeb M, Ghouti L (2015) Software defect prediction using ensemble learning on selected features. Inf Softw Technol 58:388–402. https://doi.org/10.1016/j.infsof.2014.07.005
Li G, Wang S (2016) Oversampling boosting for classification of imbalanced software defect data. In: Chinese control conf (CCC 2016), August, pp 4149–4154. https://doi.org/10.1109/ChiCC.2016.7554000
Li Z, Jing XY, Zhu X (2018) Progress on approaches to software defect prediction. IET Softw 12:161–175. https://doi.org/10.1049/iet-sen.2017.0148
Li Z, Jing XY, Zhu X et al (2019) Heterogeneous defect prediction with two-stage ensemble learning. Autom Softw Eng 26:599–651. https://doi.org/10.1007/s10515-019-00259-1
Limsettho N, Bennin KE, Keung JW et al (2018) Cross project defect prediction using class distribution estimation and oversampling. Inf Softw Technol 100:87–102. https://doi.org/10.1016/j.infsof.2018.04.001
Liu M, Miao L, Zhang D (2014a) Two-stage cost-sensitive learning for software defect prediction. IEEE Trans Reliab 63:676–686. https://doi.org/10.1109/TR.2014.2316951
Liu S, Chen X, Liu W et al (2014b) FECAR: a feature selection framework for software defect prediction. In: Proceedings of international on computer software and applications conference, pp 426–435. https://doi.org/10.1109/COMPSAC.2014.66
Liu W, Liu S, Gu Q et al (2016) Empirical studies of a two-stage data preprocessing approach for software fault prediction. IEEE Trans Reliab 65:38–53. https://doi.org/10.1109/TR.2015.2461676
Lu H, Cukic B, Culp M (2014a) A semi-supervised approach to software defect prediction. In: Proc - Int Comput Softw Appl Conf, pp 416–425. https://doi.org/10.1109/COMPSAC.2014.65
Lu H, Kocaguneli E, Cukic B (2014b) Defect prediction between software versions with active learning and dimensionality reduction. In: Proc - Int Symp Softw Reliab Eng ISSRE, pp 312–322. https://doi.org/10.1109/ISSRE.2014.35
Ma Y, Pan W, Zhu S et al (2014) An improved semi-supervised learning method for software defect prediction. J Intell Fuzzy Syst 27:2473–2480. https://doi.org/10.3233/IFS-141220
Malhotra R (2015) A systematic review of machine learning techniques for software fault prediction. Appl Soft Comput J 27:504–518. https://doi.org/10.1016/j.asoc.2014.11.023
Malhotra R, Kamal S (2017) Tool to handle imbalancing problem in software defect prediction using oversampling methods. In: 2017 International conference on advances in computing, communications and informatics (ICACCI). IEEE, pp 906–912
Martins LEG, Gorschek T (2016) Requirements engineering for safety-critical systems: a systematic literature review. Inf Softw Technol 75:71–89. https://doi.org/10.1016/j.infsof.2016.04.002
Menzies T, Greenwald J, Frank A (2007) Data mining static code attributes to learn defect predictors. IEEE Trans Softw Eng 33:2–13. https://doi.org/10.1109/TSE.2007.256941
Menzies T, Turhan B, Bener A et al (2008) Implications of ceiling effects in defect predictors. In: PROMISE’08. ACM, New York, pp 47–54
Menzies T, Milton Z, Turhan B et al (2010) Defect prediction from static code features: current results, limitations, new approaches. Autom Softw Eng 17:375–407. https://doi.org/10.1007/s10515-010-0069-5
Mousavi R, Eftekhari M, Rahdari F (2018) Omni-ensemble learning (OEL): Utilizing over-bagging, static and dynamic ensemble selection approaches for software defect prediction. Int J Artif Intell Tools 27:1850024. https://doi.org/10.1142/S0218213018500240
Murillo-Morera J, Quesada-López C, Jenkins M (2015) Software fault prediction: a systematic mapping study. In: CIBSE 2015—XVIII Ibero-American Conf Softw Eng, pp 446–459
Nascimento AM, de Melo VV, Dias LAV, da Cunha AM (2018) Increasing the prediction quality of software defective modules with automatic feature engineering. In: Advances in intelligent systems and computing, pp 527–535
NezhadShokouhi MM, Majidi MA, Rasoolzadegan A (2020) Software defect prediction using over-sampling and feature extraction based on Mahalanobis distance. J Supercomput 76:602–635. https://doi.org/10.1007/s11227-019-03051-w
Ni C, Chen X, Wu F et al (2019) An empirical study on pareto based multi-objective feature selection for software defect prediction. J Syst Softw 152:215–238. https://doi.org/10.1016/j.jss.2019.03.012
Ozturk MM, Zengin A (2016) HSDD: a hybrid sampling strategy for class imbalance in defect prediction data sets. In: 2016 5th International conference on future communication technologies (FGCT). IEEE, pp 60–69
Öztürk MM, Zengin A (2016) How repeated data points affect bug prediction performance: a case study. Appl Soft Comput J 49:1051–1061. https://doi.org/10.1016/j.asoc.2016.08.002
Pachouly J, Ahirrao S, Kotecha K et al (2022) A systematic literature review on software defect prediction using artificial intelligence: datasets, data validation methods, approaches, and tools. Eng Appl Artif Intell 111:104773. https://doi.org/10.1016/j.engappai.2022.104773
Pandey SK, Mishra RB, Tripathi AK (2021) Machine learning based methods for software fault prediction: a survey. Expert Syst Appl 172:114595. https://doi.org/10.1016/j.eswa.2021.114595
Pandey SK, Mishra RB, Tripathi AK (2020) BPDET: an effective software bug prediction model using deep representation and ensemble learning techniques. Expert Syst Appl 144:113085. https://doi.org/10.1016/j.eswa.2019.113085
Pelayo L, Dick S (2007) Applying novel resampling strategies to software defect prediction. In: NAFIPS 2007—2007 annual meeting of the north american fuzzy information processing society. IEEE, pp 69–72
Petersen K, Ali NB (2011) Identifying strategies for study selection in systematic reviews and maps. In: Int Symp Empir Softw Eng Meas, pp 351–354. https://doi.org/10.1109/esem.2011.46
Petersen K, Vakkalanka S, Kuzniarz L (2015) Guidelines for conducting systematic mapping studies in software engineering: an update. In: Information and software technology. Elsevier, Amsterdam, pp 1–18
Qiu S, Lu L, Jiang S, Guo Y (2019) An investigation of imbalanced ensemble learning methods for cross-project defect prediction. Int J Pattern Recognit Artif Intell. https://doi.org/10.1142/S0218001419590377
Radjenović D, Heričko M, Torkar R, Živkovič A (2013) Software fault prediction metrics: a systematic literature review. Inf Softw Technol 55:1397–1418
Rahman MH, Sharmin S, Sarwar SM, Shoyaib M (2016) Software defect prediction using feature space transformation. In: Proceedings of the international conference on internet of things and cloud computing. ACM, New York, pp 1–6
Rao KN, Reddy CS (2018) An efficient software defect analysis using correlation-based oversampling. Arab J Sci Eng 43:4391–4411. https://doi.org/10.1007/s13369-018-3076-7
Rathore S, Gupta A (2014) A comparative study of feature-ranking and feature-subset selection techniques for improved fault prediction. In: Proceedings of the 7th india software engineering conference on—ISEC ’14. ACM Press, New York, pp 1–10
Rathore SS, Kumar S (2019) A study on software fault prediction techniques. Artif Intell Rev 51:255–327. https://doi.org/10.1007/s10462-017-9563-5
Rathore SS, Kumar S (2020) An empirical study of ensemble techniques for software fault prediction. Appl Intell. https://doi.org/10.1007/s10489-020-01935-6
Riaz S, Arshad A, Jiao L (2018) Rough noise-filtered easy ensemble for software fault prediction. IEEE Access 6:46886–46899. https://doi.org/10.1109/ACCESS.2018.2865383
Rodríguez D, Ruiz R, Cuadrado-Gallego J et al (2007) Attribute selection in software engineering datasets for detecting fault modules. In: EUROMICRO 2007—Proc 33rd EUROMICRO Conf Softw Eng Adv Appl SEAA 2007, pp 418–423. https://doi.org/10.1109/EUROMICRO.2007.20
Seiffert C, Khoshgoftaar TM, Van Hulse J (2009) Improving software-quality predictions with data sampling and boosting. IEEE Trans Syst Man Cybern Part A Syst Hum 39:1283–1294. https://doi.org/10.1109/TSMCA.2009.2027131
Seliya N, Khoshgoftaar TM (2011) The use of decision trees for cost-sensitive classification: an empirical study in software quality prediction. Wiley Interdiscip Rev Data Min Knowl Discov 1:448–459. https://doi.org/10.1002/widm.38
Seliya N, Khoshgoftaar TM, Van Hulse J (2010) Predicting faults in high assurance software. In: 2010 IEEE 12th International symposium on high assurance systems engineering. IEEE, pp 26–34
Shan C, Chen B, Hu C et al (2014) Software defect prediction model based on LLE and SVM. In: 2014 Communications security conference (CSC 2014). Institution of Engineering and Technology, London, pp 1–5
Shao Y, Liu B, Wang S, Li G (2018) A novel software defect prediction based on atomic class-association rule mining. Expert Syst Appl 114:237–254. https://doi.org/10.1016/j.eswa.2018.07.042
Sharmin S, Arefin MR, Wadud MA-A, et al (2015) SAL: An effective method for software defect prediction. In: 2015 18th International conference on computer and information technology (ICCIT). IEEE, pp 184–189
Shatnawi R (2012) Improving software fault-prediction for imbalanced data. In: 2012 Int Conf Innov Inf Technol (IIT 2012), pp 54–59. https://doi.org/10.1109/INNOVATIONS.2012.6207774
Shen C, Zhang SF, Zhai JH et al (2018) Imbalanced data classification based on extreme learning machine autoencoder. In: Proc - Int Conf Mach Learn Cybern, vol 2, pp 387–392. https://doi.org/10.1109/ICMLC.2018.8526934
Shepperd M, Song Q, Sun Z, Mair C (2013) Data quality: some comments on the NASA software defect datasets. IEEE Trans Softw Eng 39:1208–1215. https://doi.org/10.1109/TSE.2013.11
Shivaji S, Whitehead EJ, Akella R, Kim S (2009) Reducing features to improve bug prediction. In: ASE2009—24th IEEE/ACM Int Conf Autom Softw Eng, pp 600–604. https://doi.org/10.1109/ASE.2009.76
Siers MJ, Islam Z (2015) Software defect prediction using a cost sensitive decision forest and voting, and a potential solution to the class imbalance problem. Inf Syst 51:62–71. https://doi.org/10.1016/j.is.2015.02.006
Singh P, Singh K (2017) Exploring automatic search in digital libraries. In: Proceedings of the 21st international conference on evaluation and assessment in software engineering. ACM, New York, pp 236–241
Singh P, Verma S (2020) ACO based comprehensive model for software fault prediction. Int J Knowl Based Intell Eng Syst 24:63–71. https://doi.org/10.3233/KES-200029
Soleimani A,Asdaghi F (2014) An AIS based feature selection method for software fault prediction. In: Iran Conf Intell Syst (ICIS 2014), pp 1–5. https://doi.org/10.1109/IranianCIS.2014.6802598
Son L, Pritam N, Khari M et al (2019) Empirical study of software defect prediction: a systematic mapping. Symmetry (Basel) 11:212. https://doi.org/10.3390/sym11020212
Song Q, Guo Y, Shepperd M (2019) A comprehensive investigation of the role of imbalanced learning for software defect prediction. IEEE Trans Softw Eng 45:1253–1269. https://doi.org/10.1109/TSE.2018.2836442
Sri Kavya K, Prasanth Y (2020) An ensemble deepboost classifier for software defect prediction. Int J Adv Trends Comput Sci Eng 9:2021–2028. https://doi.org/10.30534/ijatcse/2020/173922020
Sun Z, Song Q, Zhu X (2012) Using coding-based ensemble learning to improve software defect prediction. IEEE Trans Syst Man Cybern Part C Appl Rev 42:1806–1817. https://doi.org/10.1109/TSMCC.2012.2226152
Sun Y, Xu L, Li Y, et al (2018) Utilizing Deep Architecture Networks of VAE in Software Fault Prediction. In: 2018 IEEE Intl conf on parallel & distributed processing with applications, ubiquitous computing & communications, big data & cloud computing, social computing & networking, sustainable computing & communications (ISPA/IUCC/BDCloud/SocialCom/SustainCom). IEEE, pp 870–877
Tan M, Tan L, Dara S, Mayeux C (2015) Online defect prediction for imbalanced data. In: 2015 IEEE/ACM 37th IEEE international conference on software engineering. IEEE, pp 99–108
Tang W, Khoshgoftaar TM (2004) Noise identification with the k-means algorithm. In: Proc - Int Conf Tools with Artif Intell (ICTAI), pp 373–378. https://doi.org/10.1109/ictai.2004.93
Tantithamthavorn C, McIntosh S, Hassan AE, Matsumoto K (2016) Comments on researcher bias: the use of machine learning in software defect prediction. IEEE Trans Softw Eng 42:1092–1094. https://doi.org/10.1109/TSE.2016.2553030
Tong H, Liu B, Wang S (2018) Software defect prediction using stacked denoising autoencoders and two-stage ensemble learning. Inf Softw Technol 96:94–111. https://doi.org/10.1016/j.infsof.2017.11.008
Tran HD, Hanh LTM, Binh NT (2019) Combining feature selection, feature learning and ensemble learning for software fault prediction. In: Proc 2019 11th Int Conf Knowl Syst Eng (KSE 2019), pp 1–8. https://doi.org/10.1109/KSE.2019.8919292
Tumar I, Hassouneh Y, Turabieh H, Thaher T (2020) Enhanced binary moth flame optimization as a feature selection algorithm to predict software fault prediction. IEEE Access 8:8041–8055. https://doi.org/10.1109/ACCESS.2020.2964321
Turabieh H, Mafarja M, Li X (2019) Iterated feature selection algorithms with layered recurrent neural network for software fault prediction. Expert Syst Appl 122:27–42. https://doi.org/10.1016/j.eswa.2018.12.033
Verma R, Gupta A (2012) Software defect prediction using Two level data pre-processing. In: Proc 2012 Int Conf Recent Adv Comput Softw Syst (RACSS 2012), pp 311–317. https://doi.org/10.1109/RACSS.2012.6212686
Wahono RS (2015) A systematic literature review of software defect prediction: research trends, datasets, methods and frameworks. J Softw Eng 1:1–16
Wahono RS, Suryana N, Ahmad S (2014) Metaheuristic optimization based feature selection for software defect prediction. J Softw. https://doi.org/10.4304/jsw.9.5.1324-1333
Walkinshaw N, Minku L (2018) Are 20% of files responsible for 80% of defects? In: Int Symp Empir Softw Eng Meas. https://doi.org/10.1145/3239235.3239244
Wang S, Yao X (2013) Using class imbalance learning for software defect prediction. IEEE Trans Reliab 62:434–443. https://doi.org/10.1109/TR.2013.2259203
Wang F, Ai J, Zou Z (2019) A cluster-based hybrid feature selection method for defect prediction. In: Proc - 19th IEEE Int conf Softw Qual Reliab Secur (QRS 2019), pp 1–9. https://doi.org/10.1109/QRS.2019.00014
Wang H, Khoshgoftaar TM, Napolitano A (2010) A comparative study of ensemble feature selection techniques for software defect prediction. In: Proc - 9th Int Conf Mach Learn Appl (ICMLA 2010), pp 135–140. https://doi.org/10.1109/ICMLA.2010.27
Wang H, Khoshgoftaar TM, Van Hulse J, Gao K (2011) Metric selection for software defect prediction. Int J Softw Eng Knowl Eng 21:237–257. https://doi.org/10.1142/S0218194011005256
Wang H, Khoshgoftaar TM, Napolitano A (2012) Software measurement data reduction using ensemble techniques. Neurocomputing 92:124–132. https://doi.org/10.1016/j.neucom.2011.08.040
Wang H, Khoshgoftaar TM, Napolitano A (2013) An empirical study on wrapper-based feature selection for software engineering data. In: Proc - 2013 12th Int Conf Mach Learn Appl (ICMLA 2013), vol 2, pp 84–89. https://doi.org/10.1109/ICMLA.2013.110
Wang S, Liu T, Tan L (2016a) Automatically learning semantic features for defect prediction. In: Proceedings of the 38th international conference on software engineering. ACM, New York, pp 297–308
Wang T, Zhang Z, Jing X, Zhang L (2016b) Multiple kernel ensemble learning for software defect prediction. Autom Softw Eng. https://doi.org/10.1007/s10515-015-0179-1
Wei H, Hu C, Chen S et al (2019) Establishing a software defect prediction model via effective dimension reduction. Inf Sci (NY) 477:399–409. https://doi.org/10.1016/j.ins.2018.10.056
Wang K, Liu L, Yuan C, Wang Z (2020) Software defect prediction model based on LASSO–SVM. Neural Comput Appl. https://doi.org/10.1007/s00521-020-04960-1
Wen J, Li S, Lin Z et al (2012) Systematic literature review of machine learning based software development effort estimation models. Inf Softw Technol 54:41–59. https://doi.org/10.1016/j.infsof.2011.09.002
Wohlin C (2014) Guidelines for snowballing in systematic literature studies and a replication in software engineering. In: EASE ’14. ACM, New York, pp 1–10
Xia Y, Yan G, Jiang X, Yang Y (2014) A new metrics selection method for software defect prediction. PIC 2014 - Proc 2014 IEEE Int Conf Prog Informatics Comput 433–436. https://doi.org/10.1109/PIC.2014.6972372
Xu Z, Li S, Xu J et al (2019a) LDFR: Learning deep feature representation for software defect prediction. J Syst Softw 158:110402. https://doi.org/10.1016/j.jss.2019.110402
Xu Z, Liu J, Luo X et al (2019b) Software defect prediction based on kernel PCA and weighted extreme learning machine. Inf Softw Technol 106:182–200. https://doi.org/10.1016/j.infsof.2018.10.004
Xu Z, Xuan J, Liu J, Cui X (2016) MICHAC: Defect prediction via feature selection based on Maximal Information Coefficient with Hierarchical Agglomerative Clustering. 2016 IEEE 23rd Int Conf Softw Anal Evol Reengineering. SANER 2016:370–381. https://doi.org/10.1109/SANER.2016.34
Xu X, Chen W, Wang X (2021) RFC: a feature selection algorithm for software defect prediction. J Syst Eng Electron 32:389–398. https://doi.org/10.23919/JSEE.2021.000032
Yao J, Shepperd M (2021) The impact of using biased performance metrics on software defect prediction research. Inf Softw Technol 139:106664. https://doi.org/10.1016/j.infsof.2021.106664
Yohannese CW, Li T (2017) A Combined-Learning Based Framework for Improved Software Fault Prediction. Int J Comput Intell Syst 10:647–662. https://doi.org/10.2991/ijcis.2017.10.1.43
Yohannese CW, Li T, Bashir K (2018) A three-stage based ensemble learning for improved software fault prediction: An empirical comparative study. Int J Comput Intell Syst 11:1229–1247. https://doi.org/10.2991/ijcis.11.1.92
Yu Q, Jiang S, Wang R, Wang H (2017a) A feature selection approach based on a similarity measure for software defect prediction. Front Inf Technol Electron Eng 18:1744–1753. https://doi.org/10.1631/FITEE.1601322
Yu X, Ma Z, Ma C et al (2017b) FSCR:A Feature Selection Method for Software Defect Prediction. In: Proceedings of the International Conference on Software Engineering and Knowledge Engineering, SEKE. pp 351–356
Zhang H, Babar MA, Tell P (2011) Identifying relevant studies in software engineering. Inf Softw Technol 53:625–637. https://doi.org/10.1016/j.infsof.2010.12.010
Zhang X, Song Q, Wang G et al (2015) A dissimilarity-based imbalance data classification algorithm. Appl Intell 42:544–565. https://doi.org/10.1007/s10489-014-0610-5
Zhang Z, Jing X, Wang T (2017) Label propagation based semi-supervised learning for software defect prediction. Autom Softw Eng 24:47–69. https://doi.org/10.1007/s10515-016-0194-x
Zhao Q, Yan X, Zhou Y (2018) Adaptive Centre-Weighted Oversampling for Class Imbalance in Software Defect Prediction. In: 2018 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Ubiquitous Computing & Communications, Big Data & Cloud Computing, Social Computing & Networking, Sustainable Computing & Communications (ISPA/IUCC/BDCloud/SocialCom/SustainCom). IEEE, pp 223–230
Zheng J (2010) Cost-sensitive boosting neural networks for software defect prediction. Expert Syst Appl 37:4537–4543. https://doi.org/10.1016/j.eswa.2009.12.056
Zhou L, Li R, Zhang S, Wang H (2018) Imbalanced Data Processing Model for Software Defect Prediction. Wirel Pers Commun 102:937–950. https://doi.org/10.1007/s11277-017-5117-z
Zhu K, Ying S, Zhang N, Zhu D (2021) Software defect prediction based on enhanced metaheuristic feature selection optimization and a hybrid deep neural network. J Syst Softw 180:111026. https://doi.org/10.1016/j.jss.2021.111026
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Appendix
Appendix
See Table 15.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Bhandari, K., Kumar, K. & Sangal, A.L. Data quality issues in software fault prediction: a systematic literature review. Artif Intell Rev 56, 7839–7908 (2023). https://doi.org/10.1007/s10462-022-10371-6
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10462-022-10371-6