Abstract
Software quality is costly to achieve for large systems. Developers and testers need to investigate a large number of software modules to ensure software quality. This investigation is time consuming and formal models such as machine learning techniques are used to predict the fault-proneness of software modules to predict where faults are more probable. However, many modules are small in either implementation or design size and have low priority to investigate their quality. In this research, machine learners are used to classify the most complex parts of software. The less complex software modules are filtered out using two size measures: number of public parameters in a software construct and line of code (LOC) as a surrogate of implementation size. Modules at lowest 10, 20 and 30% of each measure are filtered out of the trained and tested data sets. The remaining modules are used to build four classifiers: Naïve Bayes, logistic regression, nearest neighbors and decision trees. The resulting classifiers were evaluated and compared against the original data. The classifiers based on NPM filtering were more consistent with the data without filtering than the LOC size measure.
Similar content being viewed by others
References
Aha D, Kibler D (1991) Instance-based learning algorithms. Mach Learn 6(1):37–66
Al Dallal J (2012) The impact of accounting for special methods in the measurement of object-oriented class cohesion on refactoring and fault prediction activities. J Syst Softw 85(5):1042–1057
Boetticher G (2006) Improving credibility of machine learner models in software engineering. In: Advanced machine learner applications in software engineering, software engineering and knowledge engineering, pp 52–72
Catal C, Alan O, Balkan K (2011) Class noise detection based on software metrics and ROC curves. Inf Sci 181(21):4867–4877
Challagulla VU, Bastani FB, Yen I, Paul RA (2005) Empirical assessment of machine learning based software defect prediction techniques. In: Tenth IEEE international workshop on object-oriented real-time dependable systems. pp 263–270
Chawla N, Bowyer K, Hall L, Kegelmeyer W (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16(1):321–357
Chidamber SR, Kemerer CF (1994) A metrics suite for object oriented design. IEEE Trans Softw Eng 20(6):476–493
D’Ambros M, Lanza M, Robbes R (2010) An extensive comparison of bug prediction approaches. In: Proceedings of MSR 2010 (7th IEEE working conference on mining software repositories). pp 31–41
Erni K, Lewerentz C (1996) Applying design-metrics to object-oriented frameworks. In: Proceedings of the third international software metrics symposium. pp 25–26
Fawcett T (2004) ROC graphs: notes and practical considerations for researchers. Technical report, HP Laboratories, Page Mill Road, Palo Alto, 38 pages
Gao K, Khoshgoftaar K, Wang H, Seliya N (2011) Choosing software metrics for defect prediction: an investigation on feature selection techniques. Softw Pract Exp 41(5):579–606
Gao K, Khoshgoftaar TM, Seliya N (2012) Predicting high-risk program modules by selecting the right software measurements. Softw Qual J 20(1):3–42
Gray D, Bowes D, Davey N, Sun Y, Christianson B (2011) The misuse of the NASA metrics data program data sets for automated software defect prediction. In: Evaluation and assessment in software engineering (EASE)
Gyimothy T, Ferenc R, Siket I (2005) Empirical validation of object-oriented metrics on open source software for fault prediction, IEEE Trans Softw Eng 31(10):897–910
Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten I (2009) The WEKA data mining software, an update. Special Interest Group Knowl Discov Data Min Explor Newsl 11(1):10–18
Hall T, Beecham S, Bowes D, Gray D, Counsell S (2011) A systematic review of fault prediction performance in software engineering. IEEE Trans Softw Eng 38(6):1276–1304
Hamill M, Goseva-Popstojanova K (2014) Exploring the missing link: an empirical study of software fixes. Softw Test Verif Reliab 24(5):49–71
He H, Garcia E (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1264–1284
Jiang Y, Cukic B, Menzies T (2007) Can data transformation help in the detection of fault-prone modules? In: Proceedings of the 2008 workshop on defects in large software systems. pp 16–20
Jiang Y, Cukic B, Ma Y (2008) Techniques for evaluating fault prediction models. Empir Softw Eng 13:561–595
Jindal R, Malhotra R, Jain A (2016) Prediction of defect severity by mining software project reports. Int J Syst Assur Eng Manag 1–18
John GH, Langley P (1995) Estimating continuous distributions in Bayesian classifiers. In: Besnard P, Hanks S (eds) Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence, pp 338–345
Jureczko M, Madeyski L (2015) Cross–project defect prediction with respect to code ownership model: an empirical study. e-Inform Softw Eng J 9(1):21–35
Kaur A, Kaur K, Chopra D (2016) An empirical study of software entropy based bug prediction using machine learning. Int J Syst Assur Eng Manag 1–18
Kim S, Zimmermann T, Whitehead E, Zeller A (2007) Predicting faults from cached history. In: Proceedings of the 29th international conference on software engineering (ICSE 2007), Minneapolis, 20–26 May
Liebchen GA, Shepperd M (2008) Data sets and data quality in software engineering. Proceedings of the 4th international workshop on predictor models in software engineering (PROMISE ‘08). ACM, New York, pp 39–44
Marcus A, Poshyvanyk D, Ferenc R (2008) Using the conceptual cohesion of classes for fault prediction in object-oriented systems. IEEE Trans Softw Eng 34(2):287–300
Menzies T, DiStefano J, Orrego A, Chapman R (2004) Assessing predictors of software defects. In: Predictive software models workshop
Menzies T, Milton Z, Turhan B, Cukic B, Jiang Y, Bener A (2010) Defect prediction from static code features: current results, limitations, new approaches. Autom Softw Eng 17:375–407
Mertik M, Lenic M, Stiglic G, Kokol P (2006) Estimating software quality with advanced data mining techniques. In: International conference on software engineering advances. p 19
Petrić J, Bowes D, Hall T, Christianson B, Baddoo N (2016) The jinx on the NASA software defect data sets. In: Proceedings of the 20th international conference on evaluation and assessment in software engineering (EASE ‘16). Article 13, 5 pages
Quinlan JR (1993) C4.5: Programs for machine learning. Morgan Kaufmann Publishers, San Mateo
Riquelme JC, Ruiz R, Rodríguez D, Moreno J (2008) Finding defective modules from highly unbalanced datasets. Actas del 8° taller sobre el apoyo a la decisión en ingeniería del software 2(1):67–74
Schröter A, Zimmermann T, Zeller A (2006) Predicting component failures at design time. In: Proceedings of the 2006 ACM/IEEE international symposium on empirical software engineering. ACM, pp 18–27
Seiffert C, Khoshgoftaar TM, Hulse JV, Folleco A (2014) An empirical study of the classification performance of learners on imbalanced and noisy software quality data. Inf Sci 259:571–595
Shatnawi R (2010) A quantitative investigation of the acceptable risk levels of object-oriented metrics in open-source systems. IEEE Trans Softw Eng 36(2):216–225
Shepperd M, Song Q, Sun Z, Mair C (2013) Data quality: some comments on the NASA software defect datasets. IEEE Trans Softw Eng 39(9):1208–1215
Wang H, Khoshgoftaar TM, Seliya N (2011) How many software metrics should be selected for defect prediction? In: Murray RC, McCarthy PM (eds) FLAIRS Conference. AAAI Press, Palo Alto
Zhou Y, Leung H (2006) Empirical analysis of object-oriented design metrics for predicting high and low severity faults. IEEE Trans Softw Eng 32(10):771–789
Zhou Y, Xu B, Leung H, Chen L (2014) An in-depth study of the potentially confounding effect of class size in fault prediction. ACM Trans Softw Eng Methodol 23(1):1–51
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Shatnawi, R. Identifying and eliminating less complex instances from software fault data. Int J Syst Assur Eng Manag 8 (Suppl 2), 974–982 (2017). https://doi.org/10.1007/s13198-016-0556-6
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13198-016-0556-6