Skip to main content
Log in

Identifying and eliminating less complex instances from software fault data

  • Original Article
  • Published:
International Journal of System Assurance Engineering and Management Aims and scope Submit manuscript

Abstract

Software quality is costly to achieve for large systems. Developers and testers need to investigate a large number of software modules to ensure software quality. This investigation is time consuming and formal models such as machine learning techniques are used to predict the fault-proneness of software modules to predict where faults are more probable. However, many modules are small in either implementation or design size and have low priority to investigate their quality. In this research, machine learners are used to classify the most complex parts of software. The less complex software modules are filtered out using two size measures: number of public parameters in a software construct and line of code (LOC) as a surrogate of implementation size. Modules at lowest 10, 20 and 30% of each measure are filtered out of the trained and tested data sets. The remaining modules are used to build four classifiers: Naïve Bayes, logistic regression, nearest neighbors and decision trees. The resulting classifiers were evaluated and compared against the original data. The classifiers based on NPM filtering were more consistent with the data without filtering than the LOC size measure.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

References

  • Aha D, Kibler D (1991) Instance-based learning algorithms. Mach Learn 6(1):37–66

    MATH  Google Scholar 

  • Al Dallal J (2012) The impact of accounting for special methods in the measurement of object-oriented class cohesion on refactoring and fault prediction activities. J Syst Softw 85(5):1042–1057

    Article  Google Scholar 

  • Boetticher G (2006) Improving credibility of machine learner models in software engineering. In: Advanced machine learner applications in software engineering, software engineering and knowledge engineering, pp 52–72

  • Catal C, Alan O, Balkan K (2011) Class noise detection based on software metrics and ROC curves. Inf Sci 181(21):4867–4877

    Article  Google Scholar 

  • Challagulla VU, Bastani FB, Yen I, Paul RA (2005) Empirical assessment of machine learning based software defect prediction techniques. In: Tenth IEEE international workshop on object-oriented real-time dependable systems. pp 263–270

  • Chawla N, Bowyer K, Hall L, Kegelmeyer W (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16(1):321–357

    MATH  Google Scholar 

  • Chidamber SR, Kemerer CF (1994) A metrics suite for object oriented design. IEEE Trans Softw Eng 20(6):476–493

    Article  Google Scholar 

  • D’Ambros M, Lanza M, Robbes R (2010) An extensive comparison of bug prediction approaches. In: Proceedings of MSR 2010 (7th IEEE working conference on mining software repositories). pp 31–41

  • Erni K, Lewerentz C (1996) Applying design-metrics to object-oriented frameworks. In: Proceedings of the third international software metrics symposium. pp 25–26

  • Fawcett T (2004) ROC graphs: notes and practical considerations for researchers. Technical report, HP Laboratories, Page Mill Road, Palo Alto, 38 pages

  • Gao K, Khoshgoftaar K, Wang H, Seliya N (2011) Choosing software metrics for defect prediction: an investigation on feature selection techniques. Softw Pract Exp 41(5):579–606

    Article  Google Scholar 

  • Gao K, Khoshgoftaar TM, Seliya N (2012) Predicting high-risk program modules by selecting the right software measurements. Softw Qual J 20(1):3–42

    Article  Google Scholar 

  • Gray D, Bowes D, Davey N, Sun Y, Christianson B (2011) The misuse of the NASA metrics data program data sets for automated software defect prediction. In: Evaluation and assessment in software engineering (EASE)

  • Gyimothy T, Ferenc R, Siket I (2005) Empirical validation of object-oriented metrics on open source software for fault prediction, IEEE Trans Softw Eng 31(10):897–910

  • Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten I (2009) The WEKA data mining software, an update. Special Interest Group Knowl Discov Data Min Explor Newsl 11(1):10–18

    Google Scholar 

  • Hall T, Beecham S, Bowes D, Gray D, Counsell S (2011) A systematic review of fault prediction performance in software engineering. IEEE Trans Softw Eng 38(6):1276–1304

    Article  Google Scholar 

  • Hamill M, Goseva-Popstojanova K (2014) Exploring the missing link: an empirical study of software fixes. Softw Test Verif Reliab 24(5):49–71

    Google Scholar 

  • He H, Garcia E (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1264–1284

    Google Scholar 

  • Jiang Y, Cukic B, Menzies T (2007) Can data transformation help in the detection of fault-prone modules? In: Proceedings of the 2008 workshop on defects in large software systems. pp 16–20

  • Jiang Y, Cukic B, Ma Y (2008) Techniques for evaluating fault prediction models. Empir Softw Eng 13:561–595

    Article  Google Scholar 

  • Jindal R, Malhotra R, Jain A (2016) Prediction of defect severity by mining software project reports. Int J Syst Assur Eng Manag 1–18

  • John GH, Langley P (1995) Estimating continuous distributions in Bayesian classifiers. In: Besnard P, Hanks S (eds) Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence, pp 338–345

  • Jureczko M, Madeyski L (2015) Cross–project defect prediction with respect to code ownership model: an empirical study. e-Inform Softw Eng J 9(1):21–35

    Google Scholar 

  • Kaur A, Kaur K, Chopra D (2016) An empirical study of software entropy based bug prediction using machine learning. Int J Syst Assur Eng Manag 1–18

  • Kim S, Zimmermann T, Whitehead E, Zeller A (2007) Predicting faults from cached history. In: Proceedings of the 29th international conference on software engineering (ICSE 2007), Minneapolis, 20–26 May

  • Liebchen GA, Shepperd M (2008) Data sets and data quality in software engineering. Proceedings of the 4th international workshop on predictor models in software engineering (PROMISE ‘08). ACM, New York, pp 39–44

    Chapter  Google Scholar 

  • Marcus A, Poshyvanyk D, Ferenc R (2008) Using the conceptual cohesion of classes for fault prediction in object-oriented systems. IEEE Trans Softw Eng 34(2):287–300

    Article  Google Scholar 

  • Menzies T, DiStefano J, Orrego A, Chapman R (2004) Assessing predictors of software defects. In: Predictive software models workshop

  • Menzies T, Milton Z, Turhan B, Cukic B, Jiang Y, Bener A (2010) Defect prediction from static code features: current results, limitations, new approaches. Autom Softw Eng 17:375–407

    Article  Google Scholar 

  • Mertik M, Lenic M, Stiglic G, Kokol P (2006) Estimating software quality with advanced data mining techniques. In: International conference on software engineering advances. p 19

  • Petrić J, Bowes D, Hall T, Christianson B, Baddoo N (2016) The jinx on the NASA software defect data sets. In: Proceedings of the 20th international conference on evaluation and assessment in software engineering (EASE ‘16). Article 13, 5 pages

  • Quinlan JR (1993) C4.5: Programs for machine learning. Morgan Kaufmann Publishers, San Mateo

    Google Scholar 

  • Riquelme JC, Ruiz R, Rodríguez D, Moreno J (2008) Finding defective modules from highly unbalanced datasets. Actas del 8° taller sobre el apoyo a la decisión en ingeniería del software 2(1):67–74

    Google Scholar 

  • Schröter A, Zimmermann T, Zeller A (2006) Predicting component failures at design time. In: Proceedings of the 2006 ACM/IEEE international symposium on empirical software engineering. ACM, pp 18–27

  • Seiffert C, Khoshgoftaar TM, Hulse JV, Folleco A (2014) An empirical study of the classification performance of learners on imbalanced and noisy software quality data. Inf Sci 259:571–595

    Article  Google Scholar 

  • Shatnawi R (2010) A quantitative investigation of the acceptable risk levels of object-oriented metrics in open-source systems. IEEE Trans Softw Eng 36(2):216–225

    Article  Google Scholar 

  • Shepperd M, Song Q, Sun Z, Mair C (2013) Data quality: some comments on the NASA software defect datasets. IEEE Trans Softw Eng 39(9):1208–1215

    Article  Google Scholar 

  • Wang H, Khoshgoftaar TM, Seliya N (2011) How many software metrics should be selected for defect prediction? In: Murray RC, McCarthy PM (eds) FLAIRS Conference. AAAI Press, Palo Alto

    Google Scholar 

  • Zhou Y, Leung H (2006) Empirical analysis of object-oriented design metrics for predicting high and low severity faults. IEEE Trans Softw Eng 32(10):771–789

    Article  Google Scholar 

  • Zhou Y, Xu B, Leung H, Chen L (2014) An in-depth study of the potentially confounding effect of class size in fault prediction. ACM Trans Softw Eng Methodol 23(1):1–51

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Raed Shatnawi.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Shatnawi, R. Identifying and eliminating less complex instances from software fault data. Int J Syst Assur Eng Manag 8 (Suppl 2), 974–982 (2017). https://doi.org/10.1007/s13198-016-0556-6

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13198-016-0556-6

Keywords

Navigation