Skip to main content
Log in

Data quality issues in software fault prediction: a systematic literature review

  • Published:
Artificial Intelligence Review Aims and scope Submit manuscript

Abstract

Software fault prediction (SFP) aims to improve software quality with a possible minimum cost and time. Various machine learning models have been proposed in the past for predicting software faults. The performance of those models depends on dataset quality and can be enhanced by identifying and eliminating data quality issues. In this paper, we present a systematic literature review on data quality issues existing in SFP datasets. We have selected 145 primary studies published until November 2021 and analyzed them from five perspectives—data quality issue, pre-processing technique, modeling technique, dataset and performance measures used. The findings indicate that data quality issues such as data dimensionality, class imbalance and their combination have been heavily considered in the literature. However, data quality issues such as class overlapping, missing data are pertinent to SFP datasets and need further investigation. The effect of resolving one data quality issue relative to others is an unexplored field. C4.5, naive Bayes, multilayer perceptron, support vector machine, and random forest are the most frequently used classifiers by the researchers. However, researchers should know the sensitiveness of those classifiers corresponding to a particular data quality issue and select them accordingly. The PROMISE datasets have been extensively used in SFP. Accuracy, precision, recall and area under curve are the common performance measures. It is suggested to employ unbiased and stable performance measures such as Mathew Co-relation Coefficient for the model evaluation. Our findings from the survey concluded that the existence of data quality issues in SFP datasets degrades the classifiers’ performance and there is a scope for further research on data quality issues.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kirti Bhandari.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (XLSX 58 KB)

Appendix

Appendix

See Table 15.

Table 15 Studies mapping to unique identifier

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bhandari, K., Kumar, K. & Sangal, A.L. Data quality issues in software fault prediction: a systematic literature review. Artif Intell Rev 56, 7839–7908 (2023). https://doi.org/10.1007/s10462-022-10371-6

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10462-022-10371-6

Keywords

Navigation