Software Engineering Data Collection for Field Studies

  • Janice Singer
  • Susan E. Sim
  • Timothy C. Lethbridge

Software engineering is an intensely people-oriented activity, yet little is known about how software engineers perform their work. In order to improve software engineering tools and practice, it is therefore essential to conduct field studies, i.e., to study real practitioners as they solve real problems. To aid this goal, we describe a series of data collection techniques for such studies, organized around a taxonomy based on the degree to which interaction with software engineers is necessary. For each technique, we provide examples from the literature, an analysis of some of its advantages and disadvantages, and a discussion of special reporting requirements. We also talk briefly about recording options and data analysis.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Albrecht, A. J. & GaffneyJr., J. E. (1983), ‘Software function, source lines of code, and development effort prediction: a software science validation’, IEEE Trans. Software Eng. 9(6), 639–648.CrossRefGoogle Scholar
  2. An, K. H., Gustafson, D. A. & Melton, A. C. (1987), A model for software maintenance, in ‘Proceedings of the Conference in Software Maintenance’, Austin, Texas, pp. 57–62.Google Scholar
  3. Atkins, D., Ball, T., Graves, T. & Mockus, A. (1999), Using version control data to evaluate the effectiveness of software tools, in ‘1999 International Conference on Software Engineering’, ACM Press, pp. 324–333.Google Scholar
  4. Barnard, J. & Rubin, D. B. (1999), ‘Small sample degrees of freedom with multiple imputation’, Biometrika 86(4).Google Scholar
  5. Chidamber, S. R. & Kemerer, C. F. (1994), ‘A metrics suite for object oriented design’, IEEE Trans. Software Eng. 20(6), 476–493.CrossRefGoogle Scholar
  6. Fleming, T. H. & Harrington, D. (1984), ‘Nonparametric estimation of the survival distribution in censored data’, Comm. in Statistics 13, 2469–86.CrossRefMathSciNetGoogle Scholar
  7. Goldenson, D. R., Gopal, A. & Mukhopadhyay, T. (1999), Determinants of success in software measurement programs, in ‘Sixth International Symoposium on Software Metrics’, IEEE Computer Society, pp. 10–21.Google Scholar
  8. Graves, T. L., Karr, A. F., Marron, J. S. & Siy, H. P. (2000), ‘Predicting fault incidence using software change history’, IEEE Transactions on Software Engineering 26(7), 653–661.CrossRefGoogle Scholar
  9. Graves, T. L. & Mockus, A. (1998), Inferring change effort from configuration management databases, in ‘Metrics 98: Fifth International Symposium on Software Metrics’, Bethesda, Maryland, pp. 267–273.Google Scholar
  10. Halstead, M. H. (1977), Elements of Software Science, Elsevier North-Holland.MATHGoogle Scholar
  11. Herbsleb, J. D. & Grinter, R. (1998), Conceptual simplicity meets organizational complexity: Case study of a corporate metrics program, in ‘20th International Conference on Software Engineering’, IEEE Computer Society, pp. 271–280.Google Scholar
  12. Herbsleb, J. D., Krishnan, M., Mockus, A., Siy, H. P. & Tucker, G. T. (2000), Lessons from ten years of software factory experience, Technical report, Bell Laboratories.Google Scholar
  13. Jönsson, P. & Wohlin, C. (2004), An evaluation of k-nearest neighbour imputation using likert data, in ‘Proc. of the 10th Int. Symp. on Software Metrics’, pp. 108–118.Google Scholar
  14. Kaplan, E. & Meyer, P. (1958), ‘Non-paramentric estimation from incomplete observations’, J Am Stat Assoc pp. 457–481.Google Scholar
  15. Kim, J. & Curry, J. (1977), ‘The treatment of missing data in multivariate analysis’, Social Methods and Research 6, 215–240.CrossRefGoogle Scholar
  16. Little, R. & Hyonggin, A. (2003), Robust likelihood-based analysis of multivariate data with missing values, Technical Report Working Paper 5, The University of Michigan Department of Biostatistics Working Paper Series.
  17. Little, R. J. A. (1988), ‘A test of missing completely at random for multivariate data with missing values’, Journal of the American Statistical Association 83(404), 1198–1202.CrossRefMathSciNetGoogle Scholar
  18. Little, R. J. A. & Rubin, D. B. (1987), Statistical Analysis with Missing Data, Willey Series in Probability and Mathematical Statistics, John Willey & Sons.Google Scholar
  19. Little, R. J. A. & Rubin, D. B. (1989), ‘The analysis of social science data with missing values’, Sociological Methods and Research 18(2), 292–326.CrossRefGoogle Scholar
  20. McCabe, T. (1976), ‘A complexity measure’, IEEE Transactions on Software Engineering 2(4), 308–320.CrossRefMathSciNetGoogle Scholar
  21. Mockus, A. (2006), Empirical estimates of software availability of deployed systems, in ‘2006 International Symposium on Empirical Software Engineering’, ACM Press, Rio de Janeiro, Brazil, pp. 222–231.Google Scholar
  22. Mockus, A. (2007), Software support tools and experimental work, in V. Basili & et al, eds, ‘Empirical Software Engineering Issues: LNCS 4336:’, Springer, p. to appear.Google Scholar
  23. Mockus, A. & Votta, L. G. (1997), Identifying reasons for software changes using historic databases, Technical Report BL0113590–980410-04, Bell Laboratories.Google Scholar
  24. Myrtveit, I., Stensrud, E. & Olsson, U. (2001), ‘Analyzing data sets with missing data: An empirical evaluation of imputation methods and likelihood-based methods’, IEEE Transactions on Software Engineering 27(11), 1999–1013.CrossRefGoogle Scholar
  25. Novo, A. (2002), ‘Analysis of multivariate normal datasets with missing values’. Ported to R by Alvaro A. Novo. Original by J.L. Schafer.Google Scholar
  26. R Development Core Team (2005), R: A language and environment for statistical computing, R Foundation for Statistical Computing, Vienna, Austria. ISBN 3–900051-07–0.
  27. Roth, P. L. (1994), ‘Missing data: A conceptual review for applied psychologist’, Personel Psychology 47, 537–560.CrossRefGoogle Scholar
  28. Rubin, D. B. (1987), Multiple Imputation for Nonresponse in Surveys, John Willey & Sons.Google Scholar
  29. Schafer, J. L. (1997), Analysis of Incomplete Data, Monograph on Statistics ans Applied Probability, Chapman & Hall.Google Scholar
  30. Schafer, J. L. & Olsen, M. K. (1998), ‘Multiple imputation for multivariate missing data problems’, Multivariate Behavioural Research 33(4), 545–571.CrossRefGoogle Scholar
  31. Schafer, J. S. (1999), ‘Software for multiple imputation’.˜jls/misoftwa.html.
  32. Strike, K., Emam, K. E. & Madhavji, N. (2001), ‘Software cost estimation with incomplete data’, IEEE Transactions on Software Engineering 27(10), 890–908.CrossRefGoogle Scholar
  33. Swanson, E. B. (1976), The dimensions of maintenance, in ‘Proc. 2nd Conf. on Software Engineering’, San Francisco, pp. 492–497.Google Scholar
  34. Twala, B., Cartwright, M. & Shepperd, M. (2006), Ensemble of missing data techniques to improve software prediction accuracy, in ‘ICSE’06’, ACM, Shanghai, China, pp. 909–912.Google Scholar
  35. Weisberg, S. (1985), Applied Linear Regression, 2nd Edition, John Wiley & Sons, USA.MATHGoogle Scholar

Copyright information

© Springer-Verlag London Limited 2008

Authors and Affiliations

  • Janice Singer
    • 1
  • Susan E. Sim
    • 2
  • Timothy C. Lethbridge
    • 3
  1. 1.Institute for Information TechnologyNational Research Council CanadaOttawaCanada
  2. 2.Department of InformaticsUniversity of California, IrvineIrvineUSA
  3. 3.School of Information Technology and EngineeringUniversity of OttawaOttawaCanada

Personalised recommendations