Abstract
Missing data are common in surveys regardless of research field, undermining statistical analyses and biasing results. One solution is to use an imputation method, which recovers missing data by estimating replacement values. Previously, we have evaluated the hot-deck k-Nearest Neighbour (k-NN) method with Likert data in a software engineering context. In this paper, we extend the evaluation by benchmarking the method against four other imputation methods: Random Draw Substitution, Random Imputation, Median Imputation and Mode Imputation. By simulating both non-response and imputation, we obtain comparable performance measures for all methods. We discuss the performance of k-NN in the light of the other methods, but also for different values of k, different proportions of missing data, different neighbour selection strategies and different numbers of data attributes. Our results show that the k-NN method performs well, even when much data are missing, but has strong competition from both Median Imputation and Mode Imputation for our particular data. However, unlike these methods, k-NN has better performance with more data attributes. We suggest that a suitable value of k is approximately the square root of the number of complete cases, and that letting certain incomplete cases qualify as neighbours boosts the imputation ability of the method.
Similar content being viewed by others
References
Batista GEAPA, Monard MC (2001) A study of k-nearest neighbour as a model-based method to treat missing data. In: Proceedings of the 3rd Argentine Symposium on Artificial Intelligence, vol. 30. Buenos Aires, Argentine, pp 1–9
Cartwright MH, Shepperd MJ, Song Q (2003) Dealing with missing software project data. In: Proceedings of the 9th International Software metrics Symposium. Sydney, Australia, pp 154–165
Chen G, Åstebro T (2003) How to deal with missing categorical data: test of a simple Bayesian method. Organ Res Methods 6:309–327
Chen J, Shao J (2000) Nearest neighbor imputation for survey data. J Off Stat 16(2):113–131
De Leeuw ED (2001) Reducing missing data in surveys: an overview of methods. Qual Quant 35:147–160
Downey RG, King CV (1998) Missing data in Likert ratings: a comparison of replacement methods. J Gen Psych 125(2):175–191
Duda RO, Hart PE (1973) Pattern Classification and Scene Analysis. John Wiley and Sons, NY
Engels JM, Diehr P (2003) Imputation of missing longitudinal data: a comparison of methods. J Clin Epidemiol 56:968–976
Gediga G, Düntsch I (2003) Maximum consistency of incomplete data via non-invasive imputation. Artif Intell Rev 19(1):93–107
Gmel G (2001) Imputation of missing values in the case of a multiple item instrument measuring alcohol consumption. Stat Med 20:2369–2381
Hu M, Salvucci SM, Cohen MP (1998) Evaluation of some popular imputation algorithms. In: Proceedings of the Survey Research Methods Section, American Statistical Association, pp 308–313
Huisman M (2000) Imputation of missing item responses: some simple techniques. Qual Quant 34:331–351
Jönsson P, Wohlin C (2004) Evaluation of k-Nearest neighbour imputation using Likert data. In: Proceedings of the 10th International Metrics Symposium, Sep. 14–16, 2004, Chicago, USA, pp 108–118
Jönsson P, Wohlin C (2005) Understanding the importance of roles in architecture-related process improvement—a case study. In: Proceedings of the 6th International Conference on Product Focused Software Process Improvement, June 13–15, 2005, Oulu, Finland, pp 343–357
Myrtveit I, Stensrud E, Olsson UH (2001) Analyzing data sets with missing data: an empirical evaluation of imputation methods and likelihood-based methods. IEEE Trans Softw Eng 27:999–1013
Raaijmakers QAW (1999, October) Effectiveness of different missing data treatments in surveys with Likert-type data: introducing the relative mean substitution approach. Educ Psychol Meas 59(5):725–748
Robson C (2002) Real World Research, 2nd ed. Blackwell Publishers, Malden, MA
Sande IG (1983) Hot-deck imputation procedures. In: Madow WG, Olkin I (eds) Incomplete Data in Sample Surveys, vol. 3, Proceedings of the Symposium, Academic Press, pp 334–350
Scheffer J (2002) Dealing with missing data. Research Letters in the Information and Mathematical Sciences 3:153–160
Song Q, Shepperd M, Cartwright MH (2005) A short note on safest default missingness mechanism assumptions. Empir Softw Eng 10:235–243
Strike K, El Emam K, Madhavji N (2001) Software cost estimation with incomplete data. IEEE Trans Softw Eng 27:890–908
Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, Botstein D, Altman RB (2001) Missing value estimation methods for DNA microarrays. Bioinformatics 17:520–525
Wilson DR, Martinez TR (1997) Improved heterogeneous distance functions. J Artif Intell Res 6:1–34
Acknowledgments
We would like to thank the anonymous reviewers for their helpful comments that have allowed us to improve the paper significantly. This work was partly funded by The Knowledge Foundation in Sweden under a research grant for the project “Blekinge—Engineering Software Qualities (BESQ)” http://www.bth.se/besq.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Jönsson, P., Wohlin, C. Benchmarking k-nearest neighbour imputation with homogeneous Likert data. Empir Software Eng 11, 463–489 (2006). https://doi.org/10.1007/s10664-006-9001-9
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10664-006-9001-9