Empirical Software Engineering

, Volume 18, Issue 4, pp 659–698 | Cite as

On the value of outlier elimination on software effort estimation research

  • Yeong-Seok Seo
  • Doo-Hwan Bae


Producing accurate and reliable software effort estimation has always been a challenge for both academic research and software industries. Regarding this issue, data quality is an important factor that impacts the estimation accuracy of effort estimation methods. To assess the impact of data quality, we investigated the effect of eliminating outliers on the estimation accuracy of commonly used software effort estimation methods. Based on three research questions, we associatively analyzed the influence of outlier elimination on the accuracy of software effort estimation by applying five methods of outlier elimination (Least trimmed squares, Cook’s distance, K-means clustering, Box plot, and Mantel leverage metric) and two methods of effort estimation (Least squares regression and Estimation by analogy with the variation of the parameters). Empirical experiments were performed using industrial data sets (ISBSG Release 9, Bank and Stock data sets that are collected from financial companies, and a Desharnais data set in the PROMISE repository). In addition, the effect of the outlier elimination methods is evaluated by the statistical tests (the Friedman test and the Wilcoxon signed rank test). The experimental results derived from the evaluation criteria showed that there was no substantial difference between the software effort estimation results with and without outlier elimination. However, statistical analysis indicated that outlier elimination leads to a significant improvement in the estimation accuracy on the Stock data set (in case of some combinations of outlier elimination and effort estimation methods). In addition, although outlier elimination did not lead to a significant improvement in the estimation accuracy on the other data sets, our graphical analysis of errors showed that outlier elimination can improve the likelihood to produce more accurate effort estimates for new software project data to be estimated. Therefore, from a practical point of view, it is necessary to consider the outlier elimination and to conduct a detailed analysis of the effort estimation results to improve the accuracy of software effort estimation in software organizations.


Software cost estimation Software effort estimation Outlier elimination Software data quality 



The authors would like to thank the anonymous reviewers for their valuable comments and suggestions to improve the quality of the paper. This work was partially supported by Defense Acquisition Program Administration and Agency for Defense Development under the contract.


  1. Agulló J, Croux C, Van Aelst S (2008) The multivariate least-trimmed squares estimator. J Multivar Anal 99(3):311–338zbMATHCrossRefGoogle Scholar
  2. Alpaydin E (2004) Introduction to machine learning. MIT Press, CambridgeGoogle Scholar
  3. Barret V, Lewis T (1994) Outliers in statistical data, 3rd edn. Wiley, New YorkGoogle Scholar
  4. Boetticher GD, Menzies T, Ostrand TJ (2007) PROMISE Repository of empirical software engineering data., West Virginia University, Department of Computer Science
  5. Cartwright MH, Shepperd MJ, Song Q (2003) Dealing with missing software project data. In: Proc 9th international software metrics symposium (METRICS ’03), pp 154–165Google Scholar
  6. Chan V, Wong W (2007) Outlier elimination in construction of software metric models. In: Proc the 22nd ACM symposium on applied computing (SAC ’07), pp 1484–1488Google Scholar
  7. Chiu NH, Huang SJ (2007) The adjusted analogy-based software effort estimation based on similarity distances. J Syst Softw 80(4):628–640CrossRefGoogle Scholar
  8. Chrissis MB, Konrad M, Shrum S (2003) CMMI: guidelines for process integration and product improvement. Addison-Wesley ProfessionalGoogle Scholar
  9. Conte S, Dunsmore H, Shen V (1986) Software engineering metrics and models. Benjamin/Cummings Publishing CompanyGoogle Scholar
  10. Cook RD (1977) Detection of influential observation in linear regression. Technometrics 19(1):15–18MathSciNetzbMATHCrossRefGoogle Scholar
  11. de Barcelos Tronto I, da Silva J, Sant’Anna N (2007) Comparison of artificial neural network and regression models in software effort estimation. In: Proc 2007 international joint conference on neural networks (IJCNN ’07), pp 771–776Google Scholar
  12. Desharnais J (1989) Analyse statistique de la productivitie des projets informatique a partie de la technique des point des fonction. Masters thesis, University of MontrealGoogle Scholar
  13. Field A (2009) Discovering statistics using SPSS, 3rd edn. Sage Publications LtdGoogle Scholar
  14. Foss T, Stensrud E, Kitchenham B, Myrtveit I (2003) A simulation study of the model evaluation criterion MMRE. IEEE Trans Softw Eng 29(11):985–995CrossRefGoogle Scholar
  15. Hamilton L (1992) Regression with graphics: a second course in applied statistics. Duxbury PressGoogle Scholar
  16. Han J, Kamber M (2006) Data mining: concepts and techniques. Morgan KaufmannGoogle Scholar
  17. Huang SJ, Chiu NH (2006) Optimization of analogy weights by genetic algorithm for software effort estimation. Inf Softw Technol 48(11):1034–1045CrossRefGoogle Scholar
  18. IFPUG (1994) Function point counting practices manual. International Function Point Users Group.
  19. ISBSG (2005) International Software Benchmarking Standards Group.
  20. Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 31(3):264–323CrossRefGoogle Scholar
  21. Jeffery R, Ruhe M, Wieczorek I (2000) A comparative study of two software development cost modeling techniques using multi-organizational and company-specific data. Inf Softw Technol 42(14):1009–1016CrossRefGoogle Scholar
  22. Jeffery R, Ruhe M, Wieczorek I (2001) Using public domain metrics to estimate software development effort. In: Proc 7th IEEE international software metrics symposium (METRICS ’01), pp 16–27Google Scholar
  23. Jorgensen M, Shepperd MJ (2007) A systematic review of software development cost estimation studies. IEEE Trans Softw Eng 33(1):33–53CrossRefGoogle Scholar
  24. Keung JW, Kitchenham BA, Jeffery DR (2008) Analogy-X: providing statistical inference to analogy-based software cost estimation. IEEE Trans Softw Eng 34(4):471–484CrossRefGoogle Scholar
  25. Kirsopp C, Shepperd MJ (2002) Making inferences with small numbers of training sets. IEE Proc Softw 149(5):123–130CrossRefGoogle Scholar
  26. Kitchenham B, MacDonell S, Pickard L, Shepperd MJ (1999) Assessing prediction systems. The Information Science Discussion Paper Series, University of OtagoGoogle Scholar
  27. Kocaguneli E, Menzies T, Bener A, Keung J (2012) Exploiting the essential assumptions of analogybased effort estimation. IEEE Trans Softw Eng 38(2):425–438CrossRefGoogle Scholar
  28. Kultur Y, Turhan B, Bener AB (2008) ENNA: software effort estimation using ensemble of neural networks with associative memory. In: Proc 16th ACM SIGSOFT international symposium on foundations of software engineering (FSE ’08), pp 330–338Google Scholar
  29. Li YF, Xie M, Goh TN (2009) A study of project selection and feature weighting for analogy based software cost estimation. J Syst Softw 82(2):241–252CrossRefGoogle Scholar
  30. Liu Q, Qin W, Mintram R, Ross M (2008) Evaluation of preliminary data analysis framework in software cost estimation based on ISBSG R9 data. Softw Q J 16(3):411–458CrossRefGoogle Scholar
  31. Lokan C, Mendes E (2006) Cross-company and single-company effort models using the ISBSG database: a further replicated study. In: Proc 2006 ACM/IEEE international symposium on empirical software engineering (ISESE ’06), pp 75–84Google Scholar
  32. MacDonell SG, Shepperd MJ (2003) Combining techniques to optimize effort predictions in software project management. J Syst Softw 66(2):91–98CrossRefGoogle Scholar
  33. Mair C, Shepperd MJ (2005) The consistency of empirical comparisons of regression and analogy-based software project cost prediction. In: Proc 2005 ACM/IEEE international symposium on empirical software engineering (ISESE ’05), pp 509–518Google Scholar
  34. Maxwell KD (2002) Applied statistics for software managers. Prentice HallGoogle Scholar
  35. Mendes E, Lokan C (2008) Replicating studies on cross- vs single-company effort models using the ISBSG database. Empir Software Eng 13(1):3–37CrossRefGoogle Scholar
  36. Mendes M, Pala A (2003) Type I error rate and power of three normality tests. Pakistan J Inf Technol 2(2):135–139Google Scholar
  37. Menzies T, Jalali O, Hihn J, Baker D, Lum K (2010) Stable rankings for different effort models. Autom Softw Eng 17(4):409–437CrossRefGoogle Scholar
  38. Menzies T, Butcher A, Marcus A, Zimmermann T, Cok DR (2011) Local vs. global models for effort estimation and defect prediction. In: Proc 26th IEEE/ACM international conference on automated software engineering (ASE ’11), pp 343–351Google Scholar
  39. Mittas N, Angelis L (2008) Combining regression and estimation by analogy in a semi-parametric model for software cost estimation. In: Proc second ACM-IEEE international symposium on empirical software engineering and measurement (ESEM ’08), pp 70–79Google Scholar
  40. Miyazaki Y, Takanou A, Nozaki H, Nakagawa N, Okada K (1991) Method to estimate parameter values in software prediction models. Inf Softw Technol 33(3):239–243CrossRefGoogle Scholar
  41. Miyazaki Y, Terakado M, Ozaki K, Nozaki H (1994) Robust regression for developing software estimation models. J Syst Softw 27(1):3–16CrossRefGoogle Scholar
  42. Myrtveit I, Stensrud E, Shepperd MJ (2005) Reliability and validity in comparative studies of software prediction models. IEEE Trans Softw Eng 31(5):380–391CrossRefGoogle Scholar
  43. Ott RL, Longnecker MT (2008) An introduction to statistical methods and data analysis, 6th edn. Duxbury PressGoogle Scholar
  44. Pendharkar P, Subramanian G, Rodger J (2005) A probabilistic model for predicting software development effort. IEEE Trans Softw Eng 31(7):615–624CrossRefGoogle Scholar
  45. Rousseeuw PJ (1984) Least median of squares regression. J Am Stat Assoc 79(388):871–880MathSciNetzbMATHCrossRefGoogle Scholar
  46. Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20(1):53–65zbMATHCrossRefGoogle Scholar
  47. Rousseeuw P, Leroy A (1987) Robust regression and outlier detection. Wiley, New YorkzbMATHCrossRefGoogle Scholar
  48. Rousseeuw P, van Driessen K (2006) Computing LTS regression for large data sets. Data Min Knowl Discovery 12(1):29–45MathSciNetCrossRefGoogle Scholar
  49. Seo YS, Yoon KA, Bae DH (2008) An empirical analysis of software effort estimation with outlier elimination. In: Proc 4th international workshop on predictor models in software engineering (PROMISE ’08), pp 25–32Google Scholar
  50. Seo YS, Yoon KA, Bae DH (2009) Improving the accuracy of software effort estimation based on multiple least square regression models by estimation error-based data partitioning. In: Proc 2009 16th Asia–Pacific software engineering conference (APSEC ’09), pp 3–10Google Scholar
  51. Shapiro SS, Wilk MB (1965) An analysis of variance test for normality (complete samples). Biometrika 52(3–4):591–611MathSciNetzbMATHGoogle Scholar
  52. Shepperd MJ, Kadoda G (2001) Comparing software prediction techniques using simulation. IEEE Trans Softw Eng 27(11):1014–1022CrossRefGoogle Scholar
  53. Shepperd MJ, Schofield C (1997) Estimating software project effort using analogies. IEEE Trans Softw Eng 23(11):736–743CrossRefGoogle Scholar
  54. Strike K, El Emam K, Madhavji N (2001) Software cost estimation with incomplete data. IEEE Trans Softw Eng 27(10):890–908CrossRefGoogle Scholar
  55. Van Hulse J, Khoshgoftaar TM (2008) A comprehensive empirical evaluation of missing value imputation in noisy software measurement data. J Syst Softw 81(5):691–708CrossRefGoogle Scholar
  56. Wen J, Li S, Tang L (2009) Improve analogy-based software effort estimation using principal components analysis and correlation weighting. In: Proc 2009 16th Asia–Pacific software engineering conference (APSEC ’09), pp 179–186Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2012

Authors and Affiliations

  1. 1.Department of Computer ScienceCollege of Information Science & Technology, KAISTDaejeonSouth Korea

Personalised recommendations