Abstract
Software is quite often expensive to develop and can become a major cost factor in corporate information systems’ budgets. With the variability of software characteristics and the continual emergence of new technologies the accurate prediction of software development costs is a critical problem within the project management context.
In order to address this issue a large number of software cost prediction models have been proposed. Each model succeeds to some extent but they all encounter the same problem, i.e., the inconsistency and inadequacy of the historical data sets. Often a preliminary data analysis has not been performed and it is possible for the data to contain non-dominated or confounded variables. Moreover, some of the project attributes or their values are inappropriately out of date, for example the type of computer used for project development in the COCOMO 81 (Boehm, 1981) data set.
This paper proposes a framework composed of a set of clearly identified steps that should be performed before a data set is used within a cost estimation model. This framework is based closely on a paradigm proposed by Maxwell (2002). Briefly, the framework applies a set of statistical approaches, that includes correlation coefficient analysis, Analysis of Variance and Chi-Square test, etc., to the data set in order to remove outliers and identify dominant variables.
To ground the framework within a practical context the procedure is used to analyze the ISBSG (International Software Benchmarking Standards Group data—Release 8) data set. This is a frequently used accessible data collection containing information for 2,008 software projects. As a consequence of this analysis, 6 explanatory variables are extracted and evaluated.
Similar content being viewed by others
References
Basili, V.R. 1985. Quantitative evaluation of software methodology, Proceedings of the 1st Pan-Pacific Computer Conference.
Basili, V.R. and Rombach, H.D. 1988. The TAME project: Towards improvement-oriented software environments, IEEE Transactions on Software Engineering 14(6): 758–773.
Boehm, B.W. 1981. Software Engineering Economics. Englewood Cliffs, NJ, Prentice Hall.
Boetticher, G. 2001. Using machine learning to predict project effort: Empirical case studies in data-starved domains, Proceedings of the Model Based Requirements Workshop, pp. 17–24.
Briand, L.C., Basili, V.R., and Thomas, W. 1992. A pattern recognition approach for software engineering data analysis, IEEE Transactions on Software Engineering 18(11).
Briand, L.C., Emam, K.E., Surmann, D., and Wieczorek, I. 1998. An assessment and comparison of common software cost estimation modelling techniques, Technical Report ISERN-98-27, Fraunhofer Institute for Experimental Software Engineering, Germany.
Briand, L.C., Langley, T., and Wieczorek, I. 1999. A replicated assessment and comparison of common software cost modeling techniques, Technical Report, IESE-Report 073.99/E.
Burr, A. and Owen, M. 1996. Statistical Methods for Software Quality Using Metrics for Process Improvement. Thomson Computer Press.
Chulani, S., Boehm, B.W., and Steece, B. 1999. Bayesian analysis of empirical software engineering cost models, IEEE Transactions on Software Engineering 25(4): 573–583.
Conte, S.D., Dunsmore, H.E., and Shen, V.Y. 1986. Software Engineering Metrics and Models. Benjamin/Cummings.
Fenton, N.E. and Neil, M. 2000. Software metrics: Roadmap, “The Future of Software Engineering,” Proceedings of the 22nd International Conference on Software Engineering, pp. 357–370. ACM Press.
Finnie, G.R., Wittig, G.E., and Desharnais, J.M. 1997. Reassessing function points, Australian Journal of Information Systems 4(2): 39–45.
Gravetter, F.J. and Wallnau, L.B. 1996. Statistics for the Behavioral Science: A First Course for Students of Psychology and Education, 4th ed. St. Paul, West.
IFPUG. 1994. Counting Practices Manual, Release 4.0, International Function Point Users Group, Westerville, OH.
Karunanithi, N., Whitley, D., and Malaiya, K.Y. 1992. Using neural networks in reliability prediction, IEEE Software 9(4): 53–59.
Kemerer, C.F. 1987. An empirical validation of software cost estimation models, Communications of the ACM 30(5): 416–429.
Kitchenham, B.A. 1998. A procedure for analyzing unbalanced datasets, IEEE Transactions on Software Engineering 24(4): 278–301.
Kitchenham, B.A., Pfleeger, S., Pickard, L., Jones, P., Hoaglin, D., Emam, K.E., and Rosenberg, J. 2002. Preliminary guidelines for empirical research in software engineering, IEEE Transactions on Software Engineering 28(8): 721–734.
Maxwell, K. 2002. Applied Statistics for Software Managers. UpperSaddle River, NJ, Pearson Education.
Maxwell, K., Wassenhove, L.V., and Dutta, S. 1996. A software development productivity of European space, military and industrial applications, IEEE Transactions on Software Engineering 22(10): 704–718.
Pfleeger, S.L., Jeffery, R., Curtis, B., and Kitchenham, B. 1997. Status report on software measurement, IEEE Software 14(2): 33–43.
Porter, A.A. and Selby, R.W. 1988. Learning from examples: Generation and evaluation of decision trees for software resource analysis, IEEE Transactions on Software Engineering 14(12): 1743–1757.
Porter, A.A. and Selby, R.W. 1990. Empirically guided software development using metric-based classification trees, IEEE Software 7(2): 46–54.
Putnam, L.H. 1978. A general empirical solution to the macro software sizing and estimating problem, IEEE Transactions on Software Engineering 4(4): 345–361.
Putnam, L.H. and Myers, W. 1992. Measures for Excellence: Reliable Software on Time, within Budget. Yourdon Press.
Samson, B., Ellison, D., and Dugard, P. 1997. Software cost estimation using an albus perceptron (cmac), Information and Software Technology 39: 55–60.
Srinivasan, K. and Fisher, D. 1995. Machine learning approaches to estimating software development effort, IEEE Transactions on Software Engineering 21(2): 126–137.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Liu, Q., Mintram, R.C. Preliminary Data Analysis Methods in Software Estimation. Software Qual J 13, 91–115 (2005). https://doi.org/10.1007/s11219-004-5262-y
Issue Date:
DOI: https://doi.org/10.1007/s11219-004-5262-y