Advertisement

Software Quality Journal

, Volume 16, Issue 4, pp 563–600 | Cite as

Imputation techniques for multivariate missingness in software measurement data

  • Taghi M. Khoshgoftaar
  • Jason Van Hulse
Article

Abstract

The problem of missing values in software measurement data used in empirical analysis has led to the proposal of numerous potential solutions. Imputation procedures, for example, have been proposed to ‘fill-in’ the missing values with plausible alternatives. We present a comprehensive study of imputation techniques using real-world software measurement datasets. Two different datasets with dramatically different properties were utilized in this study, with the injection of missing values according to three different missingness mechanisms (MCAR, MAR, and NI). We consider the occurrence of missing values in multiple attributes, and compare three procedures, Bayesian multiple imputation, k Nearest Neighbor imputation, and Mean imputation. We also examine the relationship between noise in the dataset and the performance of the imputation techniques, which has not been addressed previously. Our comprehensive experiments demonstrate conclusively that Bayesian multiple imputation is an extremely effective imputation technique.

Keywords

Imputation Software quality Missing data Data quality Bayesian multiple imputation 

Notes

Acknowledgements

We thank the anonymous reviewers for their constructive comments and suggestions which helped improve this paper. We are grateful to the current and former members of the Empirical Software Engineering and Data Mining and Machine Learning Laboratories at Florida Atlantic University for their reviews and comments.

References

  1. Allison, P. D. (2000). Missing Data 07-136. Sage University Papers Series on Quantitative Applications in the Social Sciences. Thousand Oaks, CA.Google Scholar
  2. Bremaud, P. (1999). Markov chains: Gibbs fields, Monte Carlo simulation, and queues. Springer.Google Scholar
  3. Cartwright, M. H., Shepperd, M. J., & Song, Q. (2003). Dealing with missing software project data. 9th IEEE Intl. Software Metrics Symposium, pp. 154–165.Google Scholar
  4. Conover, W. J. (1971). Practical nonparametric statistics, 2nd edn. Wiley.Google Scholar
  5. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39(1), 1–38.zbMATHMathSciNetGoogle Scholar
  6. Emam, K. E., & Birk, A. (2000). Validating the ISO/IEC 15504 measure of software requirements analysis process capability. IEEE Transactions on Software Engineering, 26(6), 541–566.CrossRefGoogle Scholar
  7. Fenton, N. E., & Pfleeger, S. L. (1997). Software metrics: A rigorous and practical approach, 2nd edn. ITP, Boston, MA: PWS Publishing Company.Google Scholar
  8. Jönsson, P., & Wohlin, C. (2004). An evaluation of k-nearest neighbour imputation using likert data. 10th IEEE Intl. Symposium on Software Metrics (METRICS’04), pp. 108–118.Google Scholar
  9. Khoshgoftaar, T. M., & Seliya, N. (2004). Comparative assessment of software quality classification techniques: An empirical case study. Empirical Software Engineering Journal, 9(2), 229–257.CrossRefGoogle Scholar
  10. Khoshgoftaar, T. M., & Van Hulse, J. (2005a). Identifying noisy features with the pairwise attribute noise detection algorithm. Intelligent Data Analysis: An International Journal, 9(6), 589–602.Google Scholar
  11. Khoshgoftaar, T. M., & Van Hulse, J. (2005b, August). Empirical case studies in attribute noise detection. In Proceedings of the IEEE International Conference Information Reuse and Integration (pp. 211–216). Las Vegas, NV.Google Scholar
  12. Khoshgoftaar, T. M., & Van Hulse, J. (2006, July). Multiple imputation of software measurement data: A case study. In International Conference on Software Engineering and Knowledge Engineering (SEKE’2006), pp. 220–226.Google Scholar
  13. Khoshgoftaar, T. M., Zhong, S., & Joshi, V. (2005). Enhancing software quality estimation using ensemble-classifier based noise filtering. Intelligent Data Analysis: An International Journal, 9(1), 3–27.Google Scholar
  14. Little, R. J. A., & Rubin, D. B. (2002). Statistical analysis with missing data, 2nd edn. Hoboken, NJ: Wiley.zbMATHGoogle Scholar
  15. Myrtveit, I., Stensrud, E., & Olsson, U. (2001). Analyzing data sets with missing data: An empirical evaluation of imputation methods and likelihood-based methods. IEEE Transactions on Software Engineering, 27(11), 999–1013.CrossRefGoogle Scholar
  16. Orr, K. (1998). Data quality and systems theory. Communications of the ACM, 41(2), 66–71.CrossRefMathSciNetGoogle Scholar
  17. Rahm, E., & Do, H. (2000). Data cleaning: Problems and current approaches. Bulletin of the Technical Committee on Data Engineering 23(4), 3–13.Google Scholar
  18. Rubin, D. B. (1987). Multiple imputation for nonresponse in surveys. Wiley.Google Scholar
  19. SAS Institute. (2004). SAS/STAT user’s guide. SAS Institute Inc.Google Scholar
  20. Schafer, J. L. (2000). Analysis of incomplete multivariate data. Chapman and Hall/CRC.Google Scholar
  21. Song, Q., Shepperd, M. J., & Cartwright, M. H. (2005). A short note on safest default missingness mechanism assumptions. Empirical Software Engineering, 10(2), 235–243.CrossRefGoogle Scholar
  22. Strike, K., Emam, K. E., & Madhavji, N. (2001). Software cost estimation with incomplete data. IEEE Transactions on Software Engineering, 27(10), 890–908.CrossRefGoogle Scholar
  23. Tanner, M. A., & Wong, W. H. (1987). The calculation of posterior distributions by data augmentation. Journal of the American Statistical Society, 82, 528–550.zbMATHMathSciNetGoogle Scholar
  24. Twala, B., & Cartwright, M. H. (2005). Ensemble imputation methods for missing software engineering data. In Proceedings of 11th IEEE Intl. Software Metrics Symposium, pp. 30–40.Google Scholar
  25. Wohlin, C., Runeson, P., Host, M., Ohlsson, M. C., Regnell, B., & Wesslen, A. (2000). Experimentation in software engineering: An introduction. Boston, MA: Kluwer Academic Publishers.zbMATHGoogle Scholar
  26. Yuan, Y. C. (2000). Multiple imputation for missing data: Concepts and new development. In Proceedings of the 25th Annual SAS Users Group International Conference, SAS Institute Paper No 267.Google Scholar
  27. Zhong, S., Khoshgoftaar, T. M., & Seliya, N. (2004, March). Analyzing software measurement data with clustering techniques. IEEE Intelligent Systems, pp. 22–29.Google Scholar
  28. Zhu, X., & Wu, X. (2004). Class noise vs attribute noise: A quantitative study of their impacts. Artificial Intelligence Review, 22(3–4), 177–210.zbMATHCrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2008

Authors and Affiliations

  1. 1.Department of Computer Science and EngineeringFlorida Atlantic UniversityBoca RatonUSA

Personalised recommendations