Skip to main content
Log in

Techniques for dealing with incomplete data: a tutorial and survey

  • Survey
  • Published:
Pattern Analysis and Applications Aims and scope Submit manuscript

Abstract

Real-world applications of pattern recognition, or machine learning algorithms, often present situations where the data are partly missing, corrupted by noise, or otherwise incomplete. In spite of that, developments in the machine learning community in the last decade have mostly focused on mathematical analysis of learning machines, making it difficult for practitioners to recollect an overview of major approaches to this issue. Paradoxically, as a consequence, even established methodologies rooted in statistics appear to have long been forgotten. Although the relevant literature on the topic is so wide that no exhaustive coverage is nowadays possible, the first goal of this paper is to provide the reader with a nonetheless significant survey of major, or utterly sound, techniques for dealing with the tasks of pattern recognition, machine learning, and density estimation from incomplete data. Secondly, the paper aims at representing a viable tutorial tool for the interested practitioner, by allowing for self-contained, step-by-step understanding of several approaches. An effort is made to categorize the different techniques as follows: (1) heuristic methods; (2) statistical approaches; (3) connectionist-oriented techniques; (4) other approaches (dynamical systems, adversarial deletion of features, etc.).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

Notes

  1. Bailey and Jain [27] repeated Dudani’s experiments concluding that the DW-KNN is not superior to the traditional \(K\)-nearest neighbor rule.

  2. “Normal” and “Gaussian” are used as synonyms.

  3. When there are only complete data, Eqs. (16) and (17) reduce to the well-known formulae for the ML estimation of \(\mu \) and \(\Sigma \).

  4. The constant \(a\) which minimizes \(E\{(y-ax)^{2}\}\), obtained by the least square method, is such that \(y-ax\) is orthogonal to \(x\), that is: \(E\{(y-ax)x\} = 0\).

References

  1. Lee C, Choi SW, Lee J-M, Lee I-B (2004) Sensor fault identification in mspm using reconstructed monitoring statistics. Ind Eng Chem Res 43(15):4293–4304

    Article  Google Scholar 

  2. Lopes VV, Menezes JC (2005) Inferential sensor design in the presence of missing data: a case study. Chemometr Intell Lab Syst 78(1–2):1–10

    Article  Google Scholar 

  3. Rendtel U (2006) The 2005 plenary meeting on missing data and measurement error. AStA Adv Stat Anal 90(4):493–499

    MATH  MathSciNet  Google Scholar 

  4. Mott P, Sammis TW, Southward GM (1994) Climate data estimation using climate information from surrounding climate stations. Appl Eng Agric 10(1):41–44

    Article  Google Scholar 

  5. Li Q, Roxas BAP (2008) Significance analysis of microarray for relative quantitation of lc/ms data in proteomics. BMC Bioinform 9(1):187–197

  6. Green P, Barker J, Cooke M, Josifovski L (2001) Handling missing and unreliable information in speech recognition. In: Proceedings of AISTATS

  7. Barker J (2012) Missing-data techniques: recognition with incomplete spectrograms. Wiley, New York, pp 369–398

    Google Scholar 

  8. Pynadath D, Wellman M (2000) Probabilistic state-dependent grammars for plan recognition. In: Proceedings of the conference on uncertainty in artificial intelligence, pp 507–514

  9. Guerreiro RFC, Aguiar PMQ (2002) Factorization with missing data for 3d structure recovery. In: Proceedings of the IEEE workshop on multimedia signal processing, pp 105–108

  10. Jia H, Martinez AM (2009) Support vector machines in face recognition with occlusions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 136–141

  11. Chapelle O, Schölkopf B, Zien A (eds) (2006) Semi-supervised learning. MIT Press, Cambridge

    Google Scholar 

  12. Chen K, Wang S (2011) Semi-supervised learning via regularized boosting working on multiple semi-supervised assumptions. IEEE Trans Pattern Anal Mach Intell 99(1):129–143

  13. You Z, Yin Z, Han K, Huang D-S, Zhou X (2010) A semi-supervised learning approach to predict synthetic genetic interactions by combining functional and topological properties of functional gene network. BMC Bioinform 11:343

    Article  Google Scholar 

  14. Schwenker F, Trentin E (2014) Pattern classification and clustering: a review of partially supervised learning approaches. Pattern Recognit Lett 37:4–14

    Article  Google Scholar 

  15. Gabrys B (2009) Learning with missing or incomplete data. In: Foggia P, Sansone C, Vento M (eds) Image analysis and processing ICIAP 2009. Springer, Berlin, Heidelberg, pp 1–4

    Chapter  Google Scholar 

  16. Vinod NC, Punithavalli M (2011) Classification of incomplete data handling techniques an overview. Int J Comput Sci Eng 3(1):340–344

  17. Richard MD, Lippmann RP (1991) Neural network classifiers estimate Bayesian a posteriori probabilities. Neural Comput 3:461–483

    Article  Google Scholar 

  18. Lee RCT, Slagle JR, Mong CT (1976) Application of clustering to estimate missing data and improve data integrity. In: Proceedings of 2nd international software engineering conference, pp 539–544, San Francisco, October 1976

  19. Lim C-P, Leong J-H, Kuan M-M (2005) A hybrid neural network system for pattern classification tasks with missing features. IEEE Trans Pattern Anal Mach Intell 27(4):648–653

    Article  Google Scholar 

  20. Zhang S, Qin Y, Zhu X, Zhang J, Zhang C (2006) Optimized parameters for missing data imputation. In: PRICAI, pp 1010–1016

  21. Pelckmans KA, De Brabanter JB, Suykens JAKA, De Moor BA (2005) Handling missing values in support vector machine classifiers. Neural Netw 18(5–6):684–692

    Article  MATH  Google Scholar 

  22. Su X, Khoshgoftaar TM, Zhu X, Greiner R (2008) Imputation-boosted collaborative filtering using machine learning classifiers. In: SAC ’08: Proceedings of the 2008 ACM symposium on applied computing. ACM, New York, pp 949–950

  23. Su X, Greiner R, Khoshgoftaar TM, Napolitano A (2011) Using classifier-based nominal imputation to improve machine learning. In: Huang JZ, Cao L, Srivastava J (eds) PAKDD (1). Springer, pp 124–135

  24. Little RJA, Rubin DB (2002) Statistical analysis with missing data. Wiley, New York

    Book  MATH  Google Scholar 

  25. Dixon JK (1979) Pattern recognition with partly missing data. IEEE Trans Syst Man Cybern 9(10):617–621

    Article  Google Scholar 

  26. Dudani SA (1976) The distance-weighted \(k\)-nearest-neighbor rule. IEEE Trans Syst Man Cybern 6:325–327

    Article  Google Scholar 

  27. Bailey T, Jain AK (1978) A note on distance-weighted \(k\)-nearest neighbor rules. IEEE Trans Syst Man Cybern 8:311–313

    Article  MATH  Google Scholar 

  28. Morin RL, Raeside DE (1981) A reappraisal of distance-weighted \(k\)-nearest neighbor classification for pattern recognition with missing data. IEEE Trans Syst Man Cybern 11(3):241–243

    Article  MathSciNet  Google Scholar 

  29. Schafer JL, Graham JW (2002) Missing data: our view of the state of the art. Psychol Methods 7:147–177

    Article  Google Scholar 

  30. Ghahramani Z, Jordan MI (1994) Learning from incomplete data. AI Memo 1509, CBCL paper 108. MIT, Cambridge

  31. Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B 39:1–38

    MATH  MathSciNet  Google Scholar 

  32. Rao CR (1972) Linear statistical inference and its applications. Wiley, New York

    Google Scholar 

  33. Wu CFJ (1983) On the convergence properties of the EM algorithm. Ann Stat 11(1):95–103

    Article  MATH  Google Scholar 

  34. Redner RA, Walker HF (1984) Mixture densities, maximum likelihood and the EM algorithm. SIAM Rev 26:195–239

    Article  MATH  MathSciNet  Google Scholar 

  35. Xu L, Jordan MI (1996) On convergence properties of the EM algorithm for Gaussian mixtures. Neural Comput 8:129–151

    Article  Google Scholar 

  36. Duda RO, Hart PE (1973) Pattern classification and scene analysis. Wiley, New York

    MATH  Google Scholar 

  37. Walter RG, Richardson S, Spiegelhalter D (1996) Markov chain Monte Carlo in practice. Chapman & Hall/CRC, New York

    Google Scholar 

  38. Ramoni M, Sebastiani P (2001) Robust learning with missing data. Machine Learn 45(2):147–170

    Article  MATH  Google Scholar 

  39. Beaton AE (1964) The use of special matrix operations in statistical calculus. Educational Testing Service Research Bulletin, RB-64-51

  40. Dempster AP (1969) Elements of continuous multivariate analysis. Addison-Wesley, Reading

    MATH  Google Scholar 

  41. McLachlan G, Basford K (1988) Mixture models: inference and applications to clustering. Marcel Dekker, New York

    MATH  Google Scholar 

  42. Ghahramani Z (1994) Solving inverse problems using an EM approach to density estimation. In: Mozer MC, Smolensky P, Touretzky DS, Elman JL, Weigend AS (eds) Proceedings of the 1993 Connectionist Models Summer School. Erlbaum Associates, Hillsdale, pp 316–323

  43. Ghahramani Z, Jordan MI (1994) Supervised learning from incomplete data via an EM approach. In: Cowan JD, Tesauro G, Alspector J (eds) Advances in neural information processing systems 6. Morgan Kaufmann Publishers, San Mateo

    Google Scholar 

  44. Moss S, Hancock ER (1997) Registering incomplete radar images using the EM algorithm. Image Vis Comput 15:637–648

    Article  Google Scholar 

  45. Breiman L, Friedman JH, Olshen RA, Stone CJ (1984) Classification and Regression Trees. Wadsworth International Group, Belmont

    MATH  Google Scholar 

  46. Friedman JH (1991) Multivariate adaptive regression splines. Ann Stat 19:1–141

    Article  MATH  Google Scholar 

  47. Tresp V, Hollatz J, Ahmad S (1993) Network structuring and training using rule-based knowledge. In: Hanson SJ, Cowan JD, Giles CL (eds) Advances in Neural Information Processing Systems 5. Morgan Kaufmann Publishers, San Mateo, pp 871–878

    Google Scholar 

  48. Jordan MI, Jacobs RA (1994) Hierarchical mixtures of experts and the EM algorithm. Neural Comput 6:181–214

    Article  Google Scholar 

  49. Tresp V, Ahmad S, Neuneier R (1994) Training neural networks with deficient data. In: Cowan JD, Tesauro G, Alspector J (eds) Advances in neural information processing systems. Morgan Kaufmann Publishers, San Mateo, pp 128–135

    Google Scholar 

  50. Streit RL, Luginbuhl TE (1994) Maximum likelihood training of probabilistic neural networks. IEEE Trans Neural Netw 5(5):764–783

    Article  Google Scholar 

  51. Tanaka M, Kotokawa Y, Tanino T (1996) Pattern classification by stochastic neural networks with missing data. In: IEEE international conference on systems, man and cybernetics, Beijing, China, pp 690–695, 14–17 October 1996

  52. Vellido A (2006) Missing data imputation through GTM as a mixture of t-distributions. Neural Netw 19(10):1624–1635

    Article  MATH  Google Scholar 

  53. Hwang JN, Wang CJ (1994) Classification of incomplete data with missing elements. In: International symposium on artificial neural networks, Tainan, Taiwan, December 1994, pp 471–477

  54. Schafer JL (2010) Analysis of incomplete multivariate data. Chapmann and Hall-CRC Press, London

    Google Scholar 

  55. Linden A, Kindermann J (1989) Inversion of multilayer nets. In: Proceedings of the international joint conference on neural networks, II, Washington DC, June 1989, pp 425–430

  56. Ahmad S, Tresp V (1993) Some solutions to the missing feature problem in vision. In: Hanson SJ, Cowan JD, Giles CL (eds) Advances in neural information processing systems 5. Morgan Kaufmann Publishers, San Mateo, pp 393–400

    Google Scholar 

  57. Tresp V, Neuneier R, Ahmad S (1995) Efficient methods for dealing with missing data in supervised learning. In: Tesauro G, Touretzky D, Leen T (eds) Advances in neural information processing systems 7. Morgan Kaufmann Publishers, San Mateo, pp 689–696

    Google Scholar 

  58. Graham BS, Keisuke H (2011) Robustness to parametric assumptions in missing data models. Am Econ Rev 101(3):538–543

    Article  Google Scholar 

  59. Ahmad S, Tresp V (1993) Classification with missing and uncertain inputs. In: Proceedings of the IEEE international conference on neural networks, San Francisco

  60. Moody J, Darken C (1988) Learning with localized receptive fields. In: Hinton G, Sejnowski T (eds) Proceedings of the 1988 Connectionist Models Summer School. Morgan-Kauffmann

  61. Nowlan S (1990) Maximum likelihood competitive learning. In: Advances in neural information processing systems 2. Morgan Kaufmann Publishers, pp 574–582

  62. Moody J, Darken C (1989) Fast learning in networks of locally-tuned processing units. Neural Comput 1:281–294

    Article  Google Scholar 

  63. Parzen E (1962) On estimation of a probability density function and mode. Ann Math Stat 33:1065–1076

    Article  MATH  MathSciNet  Google Scholar 

  64. Breiman L, Meisel W, Purcell E (1977) Variable kernel estimates of multivariate densities. Technometrics 19(2):135–144

  65. Hwang JN, Lay SR, Lippman A (1994) Nonparametric multivariate density estimation: a comparative study. IEEE Trans Signal Process 42(10):2795–2810

    Article  Google Scholar 

  66. Ahmad S (1994) Feature densities are required for computing feature correspondence. In: Cowan JD, Tesauro G, Alspector J (eds) Advances in neural information processing systems 6. Morgan Kaufmann Publishers, San Mateo, pp 961–968

    Google Scholar 

  67. Fielding S, Fayers PM, McDonald A, McPherson G, Campbell MK (2008) Simple imputation methods were inadequate for missing not at random (MNAR) quality of life data. Health Qual Life Outcomes 6(57)

  68. Molenberghs G, Thijs H, Jansen I, Beunckens C, Kenward MG, Mallinckrodt C, Carroll RJ (2004) Analyzing incomplete longitudinal clinical trial data. Biostatistics 5(3):445–464

    Article  MATH  Google Scholar 

  69. Congdon P (2006) Bayesian statistical modelling, 2nd edn. Wiley, New York

    Book  MATH  Google Scholar 

  70. Collins LM, Schafer JL, Kam CM (2001) A comparison of inclusive and restrictive strategies in modern missing-data procedures. Psychol Methods 6:330–351

    Article  Google Scholar 

  71. Heckman JJ (1976) The common structure of statistical models of truncation, sample selection and limited dependent variables and a simple estimator for such models. In: Annals of economic and social measurement, vol 5, number 4. National Bureau of Economic Research, Inc, pp 475–492

  72. Berndt ER, Hall BH, Hall RE, Hausman JA (1974) Estimation and inference in nonlinear structural models. Ann Econ Soc Meas 3:653–665

    Google Scholar 

  73. Marlin B, Roweis S, Zemel R (2005) Unsupervised learning with non-ignorable missing data. In: Proceedings of the tenth international workshop on artificial intelligence and statistics (AISTATS), pp 222–229

  74. Little RJA (1993) Pattern-mixture models for multivariate incomplete data. J Am Stat Assoc 88:125–134

    MATH  Google Scholar 

  75. Molenberghs G, Kenward M (2007) Missing data in clinical studies. Wiley, New York

    Book  Google Scholar 

  76. Vonesh EF, Greene T, Schluchter MD (2006) Shared parameter models for the joint analysis of longitudinal data and event times. Stat Med 25(1):143–163

    Article  MathSciNet  Google Scholar 

  77. Little RJ (2006) Selection and pattern-mixture models. CRC Press, London, pp 409–431

    Google Scholar 

  78. Gad AM, Darwish NMM (2013) A shared parameter model for longitudinal data with missing values. Am J Appl Math Stat 1(2):30–35

    Article  Google Scholar 

  79. Rubin DB (1987) Multiple imputation for nonresponse in surveys. Wiley, New York

    Book  Google Scholar 

  80. Harel O, Zhou XH (2007) Multiple imputation: review of theory, implementation and software. Stat Med 26:3057–3077

    Article  MathSciNet  Google Scholar 

  81. Kenward MG, Carpenter JC (2009) Multiple Imputation. CRC Press, London, pp 477–500

    Google Scholar 

  82. Saltelli A, Chan K, Scott EM (2000) Sensitivity analysis. Wiley, New York

    MATH  Google Scholar 

  83. White IR, Royston P, Wood AM (2011) Multiple imputation using chained equations: issues and guidance for practice. Stat Med 30(4):377–399

    Article  MathSciNet  Google Scholar 

  84. Daniel Rhian M, Kenward Michael G (2012) A method for increasing the robustness of multiple imputation. Comput Stat Data Anal 56(6):1624–1643

    Article  MATH  MathSciNet  Google Scholar 

  85. Jansen I, Hens N, Molenberghs G, Aerts M, Verbeke G, Kenward MG (2006) The nature of sensitivity in monotone missing not at random models. Comput Stat Data Anal 50(3):830–858

    Article  MATH  MathSciNet  Google Scholar 

  86. Park J-S, Qian GQ, Jun Y (2008) Monte Carlo EM algorithm in logistic linear models involving non-ignorable missing data. Appl Math Comput 197(1):440–450

    Article  MATH  MathSciNet  Google Scholar 

  87. Stubbendick AL, Ibrahim JG (2003) Maximum likelihood methods for nonignorable missing responses and covariates in random effects models. Biometrics 59(4):1140–50

    Article  MATH  MathSciNet  Google Scholar 

  88. Jolani S (2012) Dual imputation strategies for analyzing incomplete data. Utrecht University, Utrecht

    Google Scholar 

  89. Enders CK (2011) Missing not at random models for latent growth curve analyses. Psychol Methods 16(1):1–16

    Article  Google Scholar 

  90. Molenberghs G, Beunckens C, Sotto C, Kenward MG (2008) Every missingness not at random model has a missingness at random counterpart with equal fit. J R Stat Soc Ser B 70(Part 2):371–388

    Article  MATH  MathSciNet  Google Scholar 

  91. Vamplew P, Adams A (1992) Missing values in a backpropagation neural net. In: Leong S, Jabri M (eds) Proceedings of the third Australian conference on neural networks, Sidney, February 1992, pp 64–67

  92. Vamplew P, Clark D, Adams A, Muench J (1996) Techniques for dealing with missing values in feedforward networks. In: Proceedings of the seventh Australian conference on neural networks, Canberra, 10–12 April 1996

  93. Southcott ML, Bogner RE (1993) Classification of incomplete data using neural networks. In: Proceedings of the fourth Australian conference on neural networks, Melbourne, 3–5 February 1993, pp 220–223

  94. Hwang JN, Wang CJ (1994) Neural network inversion techniques for missing data applications. In: IEEE neural network workshop on signal processing, Ermioni, Greece, September 1994, pp 22–31

  95. Specht DF (1990) Probabilistic neural networks. Neural Netw 3(1):109–118

    Article  Google Scholar 

  96. Vapnik V (1982) Estimation of dependences based on empirical data. Springer, Berlin

    MATH  Google Scholar 

  97. Buntine WL, Weigend AS (1991) Bayesian back-propagation. Complex Syst 5(6):603–643

    MATH  Google Scholar 

  98. Arrowsmith DK, Place CM (1990) An introduction to dynamical systems. Cambridge University Press, Cambridge

    MATH  Google Scholar 

  99. Rabiner LR (1989) A tutorial on Hidden Markov Models and selected applications in speech recognition. Proc IEEE 77(2):267–296

    Article  Google Scholar 

  100. Jain LC, Medsker LR (1999) Recurrent neural networks: design and applications. CRC Press Inc, Boca Raton

    Google Scholar 

  101. Jurafsky D, Martin JH (2009) Speech and language processing, 2nd edn. Prentice-Hall Inc, Upper Saddle River

    Google Scholar 

  102. Trentin E, Gori M (2003) Robust combination of neural networks and hidden Markov models for speech recognition. IEEE Trans Neural Netw 14(6):1519–1531

  103. Bertolami R, Bunke H (2008) Hidden markov model-based ensemble methods for offline handwritten text line recognition. Pattern Recogn 41(11):3452–3460

    Article  MATH  Google Scholar 

  104. Baldi P, Brunak S (2001) Bioinformatics: the machine learning approach, 2nd edn. MIT Press, Cambridge

    Google Scholar 

  105. Hinton GE, Sejnowski TJ (1986) Learning and relearning in Boltzmann machines. In: Rumelhart DE, McClelland J (eds) Parallel distributed processing, vol 1, chapter 7. MIT Press

  106. Hopfield JJ (1982) Neural networks and physical systems with emergent collective computational abilities. In: Proceedings of the National Academy of Sciences, vol 79, pp 2554–2558

  107. Hertz J, Krogh A, Palmer RG (1991) Introduction to the theory of neural computation. Addison-Wesley, Redwood City

    Google Scholar 

  108. Almeida L (1987) A learning rule for asynchronous perceptrons with feedback in a combinatorial environment. In: Caudill M, Butler C (eds) Proceedings of the IEEE first international conference on neural networks, vol 2. IEEE, San Diego, pp 609–618

  109. Pineda F (1989) Recurrent backpropagation and the dynamical approach to adaptive neural computation. Neural Comput 1:161–172

    Article  Google Scholar 

  110. Bengio Y, Gingras F (1996) Recurrent neural networks for missing or asynchronous data. In: Touretzky DS, Mozer MC, Hasselmo ME (eds) Advances in neural information processing systems 8. MIT Press, Cambridge, pp 395–401

    Google Scholar 

  111. Minsky ML, Papert SA (1969) Perceptrons. MIT Press, Cambridge

    MATH  Google Scholar 

  112. Rumelhart DE, Hinton GE, Williams RJ (1986) Learning internal representations by error propagation. In: Rumelhart DE, McClelland J (eds) Parallel distributed processing, vol 1, chapter 8. MIT Press, pp 318–362

  113. Globerson A, Roweis ST (2006) Nightmare at test time: robust learning by feature deletion. In: ICML ’06: Proceedings of the 23th international conference on machine learning, pp 353–360

  114. Dekel O, Shamir O (2008) Learning to classify with missing and corrupted features. In: ICML ’08: Proceedings of the 25th international conference on machine learning. ACM, New York, pp 216–223

  115. Ding Y, Simonoff JS (2010) An investigation of missing data methods for classification trees applied to binary response data. J Mach Learn Res 11:131–170

    MATH  MathSciNet  Google Scholar 

  116. Twala B (2009) An empirical comparison of techniques for handling incomplete data using decision trees. Appl Artif Intell 23:373–405

    Article  Google Scholar 

  117. Luengo J, García S, Herrera F (2010) A study on the use of imputation methods for experimentation with radial basis function network classifiers handling missing attribute values: The good synergy between RBFs and event covering method. Neural Netw 23:406–418

    Article  Google Scholar 

  118. Corani G, Zaffalon M (2008) Learning reliable classifiers from small or incomplete data sets: the naive credal classifier 2. J Mach Learn Res 9:581–621

    MATH  MathSciNet  Google Scholar 

  119. Chierichetti F, Kleinberg J, Liben-Nowell D (2011) Reconstructing patterns of information diffusion from incomplete observations. In: Shawe-Taylor J, Zemel RS, Bartlett P, Pereira FCN, Weinberger KQ (eds) Advances in neural information processing systems 24. MIT Press, Cambridge, pp 792–800

    Google Scholar 

  120. Greenwald A, Li J, Sodomka E (2012) Approximating equilibria in sequential auctions with incomplete information and multi-unit demand. In: Bartlett P, Pereira FCN, Burges CJC, Bottou L, Weinberger KQ (eds) Advances in neural information processing systems 25. MIT Press, Cambridge, pp 2330–2338

    Google Scholar 

  121. Ghannad-Rezaie M, Soltanian-Zadeh H, Ying H, Dong M (2010) Selection-fusion approach for classification of data sets with missing values. Pattern Recognit 43:2340–2350

    Article  MATH  Google Scholar 

  122. Farhangfar A, Kurgan L, Dy J (2008) Impact of imputation of missing values on classification error for discrete data. Pattern Recognit 41:3692–3705

    Article  MATH  Google Scholar 

  123. Saar-Tsechansky M, Provost F (2007) Handling missing values when applying classification models. J Mach Learn Res 8:1623–1657

    MATH  Google Scholar 

  124. Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, Botstein D, Altman RB (2001) Missing value estimation methods for DNA microarrays. Bioinformatics 17(6):520–525

    Article  Google Scholar 

  125. Oba SA, Sato MA, Takemasa IC, Monden MC, Matsubara KI, Ishii SA (2003) A bayesian missing value estimation method for gene expression profile data. Bioinformatics 19(16):2088–2096

    Article  Google Scholar 

  126. Kim HA, Golub GHB, Park HA (2005) Missing value estimation for DNA microarray gene expression data: Local least squares imputation. Bioinformatics 21(2):187–198

    Article  Google Scholar 

  127. Scheel IA, Aldrin MB, Glad IKA, Sorum RA, Lyng HC, Frigessi AB (2005) The influence of missing value imputation on detection of differentially expressed genes from microarray data. Bioinformatics 21(23):4272–4279

    Article  Google Scholar 

  128. Wang X, Jiang Z, Feng H (2006) Missing value estimation for DNA microarray gene expression data by support vector regression imputation and orthogonal coding scheme. BMC Bioinform 7(32):1–10

    MATH  Google Scholar 

  129. Wong DSV, Wong FK, Wood GR (2007) A multi-stage approach to clustering and imputation of gene expression profiles. Bioinformatics 23(8):998–1005

    Article  Google Scholar 

  130. Yoon D, Lee EK, Park T (2007) Robust imputation method for missing values in microarray data. BMC Bioinform 8(2):1–7

    Google Scholar 

  131. Roure B, Baurain D, Philippe H (2013) Impact of missing data on phylogenies inferred from empirical phylogenomic data sets. Mol Biol Evol 30(1):197–214

    Article  Google Scholar 

  132. Nutt W, Razniewski S, Vegliach G (2012) Incomplete databases: missing records and missing values. In: Proceedings of the 17th international conference on database systems for advanced applications, DASFAA’12. Springer, pp 298–310

  133. Kaambwa B, Bryan S, Billingham L (2012) Do the methods used to analyse missing data really matter? An examination of data from an observational study of Intermediate Care patients. BMC Res Notes 5(1):330

    Article  Google Scholar 

  134. David M, Little RJA, Samuhel ME, Triest RK (1986) Alternative methods for CPS income imputation. J Am Stat Assoc 81(393):29–41

    Article  Google Scholar 

  135. Foster EM, Fang GY (2004) Alternative methods for handling attrition: an illustration using data from the fast track evaluation. Eval Rev 28(5):434–464

    Article  Google Scholar 

  136. Horton NJ, Kleinman KP (2007) Much ado about nothing: a comparison of missing data methods and software to fit incomplete data regression models. Am Stat 61(1):79–90

    Article  MathSciNet  Google Scholar 

  137. Dong Y, Peng C-YJ (2013) Principled missing data methods for researchers. Springerplus 2(1):222

    Article  Google Scholar 

  138. Ali AMG, Dawson SJ, Blows FM, Provenzano E, Ellis IO, Baglietto L, Huntsman D, Caldas C, Pharoah PD (2011) Comparison of methods for handling missing data on immunohistochemical markers in survival analysis of breast cancer. Br J Cancer 104(4):693–699

    Article  Google Scholar 

  139. Fielding S, Fayers P, Ramsay C (2010) Predicting missing quality of life data that were later recovered: an empirical comparison of approaches. Clin Trials 7(4):333–342

    Article  Google Scholar 

  140. Marshall A, Altman D, Royston P, Holder R (2010) Comparison of techniques for handling missing covariate data within prognostic modelling studies: a simulation study. BMC Med Res Methodol 10(1):7

    Article  MATH  Google Scholar 

  141. Hedden S, Woolson R, Malcolm R (2008) A comparison of missing data methods for hypothesis tests of the treatment effect in substance abuse clinical trials: a Monte-Carlo simulation study. Subst Abuse Treatm Prev Policy 3(1):1–9

    Article  Google Scholar 

  142. Ding Y, Simonoff JS (2010) An investigation of missing data methods for classification trees applied to binary response data. J Mach Learn Res 11:131–170

    MATH  MathSciNet  Google Scholar 

  143. Roda C, Nicolis I, Momas I, Guihenneuc-Jouyaux C (2013) Comparing methods for handling missing data. Epidemiology 24(3):469–471

    Article  Google Scholar 

  144. Graham JW (2009) Missing data analysis: making it work in the real world. Annu Rev Psychol 60:549–576

    Article  Google Scholar 

  145. Schwartz T, Zeig-Owens R (2013) Knowledge (of your missing data) is power: handling missing values in your SAS dataset. In: Proceedings of SAS global forum SUGI 31: statistics, data analysis and data mining, San Francisco, California, 28 April–1 May 2013

  146. Templ M, Alfons A, Filzmoser P (2012) Exploring incomplete data using visualization techniques. Adv Data Anal Classif 6(1):29–47

    Article  MathSciNet  Google Scholar 

  147. Heitjan DF (2011) Incomplete data: what you don’t know might hurt you. Cancer Epidemiol Biomark Prev 20(8):1567–1570

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Edmondo Trentin.

Additional information

This work was accomplished while A. Freno was at the University of Siena.

Appendix

Appendix

1.1 Appendix A: Marginal distributions of multivariate Normals

Let \(x \in R^{n}\) be a random vector normally distributed with mean \(\mu \) and covariance matrix \(\Sigma \), i.e., \(x\sim N_{n}(\mu ,\Sigma )\). Let \(x^{o} \in R^{k}\) be a vector corresponding to a subset of \(k\) components of \(x\) and let \(x^{m} \in R^{l}\) be a vector corresponding to the remaining components of \(x\). Without loss of generality, we suppose that the components of \(x^{o}\) correspond to the first \(k\) components of \(x\), and \(x^{m}\) to the remaining \(l\) components (otherwise, the components of \(x\), \(\mu \) and \(\Sigma \) can be simply relabeled).

The random vector \(x\), its mean vector, \(\mu \), and its covariance matrix, \(\Sigma \), can be partitioned according to \(x^{o}\) and \(x^{m}\), as:

$$\begin{aligned} x = \left( \begin{array}{c} x^{o} \\ x^{m} \end{array} \right) , \mu = \left( \begin{array}{c} \mu ^{o} \\ \mu ^{m} \end{array} \right) , \Sigma = \left( \begin{array}{cc} \Sigma ^{oo} &{} \Sigma ^{om}\\ \Sigma ^{mo} &{} \Sigma ^{mm} \end{array} \right) . \end{aligned}$$

Let \(A\) be the following \(k \times n\) matrix:

$$\begin{aligned} A = \left( \begin{array}{cccccccc} 1 &{} 0 &{} \cdots &{} 0 &{} 0 &{} 0 &{} \cdots &{} 0 \\ 0 &{} 1 &{} \cdots &{} 0 &{} 0 &{} 0 &{} \cdots &{} 0 \\ \vdots &{} \vdots &{} \ddots &{} \vdots &{} \vdots &{} \vdots &{} \ddots &{} \vdots \\ 0 &{} 0 &{} \cdots &{} 1 &{} 0 &{} 0 &{} \cdots &{} 0 \end{array} \right) . \end{aligned}$$

where the first \(k\) columns identify a \(k \times k\) identity matrix and the remaining \(l=n-k\) columns correspond to a \(k \times l\) null matrix.

Let us define the random vector \(z=Ax\), \(z \in R^{k}\). Since each linear combination of a normally distributed random variable is normally distributed, the \(k\) linear combinations \(z=Ax\) are distributed as a multivariate Normal:

$$\begin{aligned} z \sim N_{k}(\mu _{z},\Sigma _{z}) \end{aligned}$$

where the mean vector, \(\mu _{z}\), and the covariance matrix, \(\Sigma _{z}\), are:

$$\begin{aligned} \mu _{z} = E\{Ax\} = AE\{x\} = A \mu \end{aligned}$$
$$\begin{aligned} \Sigma _{z} = Cov(Ax) = ACov(x)A^{T} = A \Sigma A^{T} \end{aligned}$$

Finally, observing that \(z \equiv x^{o}\), \(A \mu \equiv \mu _{o}\), and \(A \Sigma A^{T} \equiv \Sigma ^{oo}\), it follows that \(x^{o}\) is normally distributed with mean \(\mu ^{o}\) and covariance matrix \(\Sigma ^{oo}\), i.e., \(x^{o} \sim N_{k}(\mu ^{o},\Sigma ^{oo})\).

1.2 Appendix B: Convolution of multivariate Normals with diagonal covariances

Let us consider the convolution of two multivariate Normals with mean vectors \(\mu _{1}\) and \(\mu _{2}\), and diagonal covariance matrices \((\sigma _{1})^{2}\) and \((\sigma _{2})^{2}\):

$$\begin{aligned} C(x) = \int N_{n}(y;\mu _{1},\sigma _{1}) N_{n}(y-x;\mu _{2},\sigma _{2}) \mathrm{d}y. \end{aligned}$$
(59)

Given the diagonal nature of the two covariance matrices, the previous \(n\)-dimensional convolution integral can be rewritten as:

$$\begin{aligned} C(x) = \prod _{i=1}^{n} c_{i}(x_{i}), \end{aligned}$$
(60)

where \(c_{i}(x_{i})\) is the convolution of two univariate Normals with mean \(\mu _{1,i}\) and \(\mu _{2,i}\), and variance \(\sigma _{1,i}^{2}\) and \(\sigma _{2,i}^{2}\), i.e.,

$$\begin{aligned} c_{i}(x_{i}) = \int N_{1}(y_{i};\mu _{1,i},\sigma _{1,i}) N_{1}(y_{i}-x_{i};\mu _{2,i},\sigma _{2,i}) \mathrm{d}y_{i}. \end{aligned}$$

The integrals in Eq. (60) can be solved applying the convolution theorem:

$$\begin{aligned} \mathcal{F}[f *g] = \mathcal{F}[f] \mathcal{F}[g], \end{aligned}$$

where \(\mathcal{F}\) indicates the linear operator which maps each function into its own Fourier transform.

As it is well known, the transform of the exponential function \(\mathrm{e}^{-ax^{2}}\) is:

$$\begin{aligned} \mathcal{F}[\mathrm{e}^{-a x^{2}}] = \mathrm{e}^{-\frac{\omega ^{2}}{4a}}\sqrt{\frac{\pi }{a}}, \end{aligned}$$

from which, using the linearity of the Fourier transform and applying the shift theorem, it follows:

$$\begin{aligned} \mathcal{F}\left[\frac{1}{\sqrt{2\pi }\sigma } {\text {e}}^{-\frac{(x-\mu )^{2}}{2\sigma ^{2}}}\right] = {\text {e}}^{-\frac{\omega ^{2} \sigma ^{2}}{2}}{\text {e}}^{-j \omega \mu }. \end{aligned}$$

The integrals in Eq. (60) can be solved applying the convolution theorem:

$$\begin{aligned} \mathcal{F}[c_{i}(x_{i})] = {\text {e}}^{-\frac{\omega ^{2} (\sigma _{1,i}^{2}+\sigma _{2,i}^{2})}{2}} {\text {e}}^{-j \omega (\mu _{1,i}+\mu _{2,i})}, \end{aligned}$$

from which, computing the inverse Fourier transform of the two sides of the previous equation, it follows:

$$\begin{aligned} c_{i}(x_{i}) = N_{1}(x_{i};\mu _{1,i}+\mu _{2,i},\sqrt{\sigma _{1,i}^{2}+\sigma _{2,i}^{2}}). \end{aligned}$$
(61)

Finally, from Eqs. (60) and (61) it follows that the result of the convolution integral (59) is a multivariate Normal with mean vector \(\mu =\mu _{1}+\mu _{2}\) and diagonal covariance matrix \((\sigma )^{2}=(\sigma _{1})^{2}+(\sigma _{2})^{2}\).

1.3 Appendix C: Linear conditional expectations

Let \(x\) and \(y\) be two random variables with joint density function \(p(x,y)\). Let us assume \(p(x,y)\) is a normal distribution with parameter \(\theta =(\mu ,\Sigma )\), where \(\mu \) is the mean vector and \(\Sigma \) is the covariance matrix. Parameters \(\mu \) and \(\Sigma \) can be rewritten as:

$$\begin{aligned} \mu = \left( \begin{array}{cc} \mu _{x} \\ \mu _{y} \end{array} \right) , \;\;\;\;\; \Sigma = \left( \begin{array}{cc} \Sigma _{xx} &{} \Sigma _{xy}\\ \Sigma _{yx} &{} \Sigma _{yy} \end{array} \right) . \end{aligned}$$

The task is to estimate the expectations of \(y\) and \(y^{2}\) given \(x\) using the least square linear regression between \(y\) and \(x\) as predicted by the model, i.e., \(E\{y|x,\theta \}\) and \(E\{y^{2}|x,\theta \}\).

To compute the least square linear regression, values for constants \(a\) and \(b\) must be found which minimize the error expression \(e=E\{[y-(ax+b)]^{2}\}\).

First, let us consider the problem of finding parameter \(b\) when \(a\) is known. The value of \(b\) which minimizes \(e\) is the least square estimate of \(y-ax\) using a constant model, i.e.,

$$\begin{aligned} b = E\{y - ax\} = \mu _{y} - a\mu _{x}. \end{aligned}$$
(62)

Using Eq. (62), the error expression can be rewritten as:

$$\begin{aligned} e&= E\{[(y - \mu _{y}) - a(x - \mu _{x})]^{2}\} \nonumber \\&= \Sigma _{yy} - 2a \Sigma _{xy} + a^{2} \Sigma _{xx}, \end{aligned}$$
(63)

which attains its minimum when the slope of the regression, \(a\), is equal to:

$$\begin{aligned} a = \frac{\Sigma _{xy}}{\Sigma _{xx}}. \end{aligned}$$
(64)

By substituting Eq. (64) into Eq. (63), the expression for the minimum error is obtained:

$$\begin{aligned} e_{m} = \Sigma _{yy} - \frac{\Sigma _{xy}^{2}}{\Sigma _{xx}}. \end{aligned}$$

The linear conditional expectations can be estimated using the principle of orthogonalityFootnote 4 and the simple observation that if two random variables \(z\) and \(w\) are independent, then \(E\{z|w\}=E\{z\}\).

Let us consider the first-order linear conditional expectation \(E\{y|x,\theta \}\). Observe that:

$$\begin{aligned} E\{(y -ax -b)x\} = 0 \end{aligned}$$
$$\begin{aligned} E\{(y -ax -b)\}E\{x\} = 0, \end{aligned}$$

where the first equality comes from the principle of orthogonality, and the second from Eq. (62). From the two previous equations, it follows that \(E\{(y-ax-b)x\}=E\{(y-ax-b)\}E\{x\}\), and therefore, \(y-ax-b\) and \(x\) are not correlated.

Noting that \(y-ax-b\) is normally distributed (being a linear combination of normal variables) and that two uncorrelated random variables normally distributed are independent, it follows that \(y-ax-b\) and \(x\) are independent, i.e., \(E\{y-ax-b|x,\theta \}=E\{y-ax-b\}\). Moreover, from Eq. (62) it follows that \(E\{y-ax-b|x,\theta \}=0\).

On the other hand:

$$\begin{aligned} E\{(y-ax-b)|x,\theta \} = E\{y-\mu _{y}-a(x-\mu _{x})|x,\theta \} \\ = E\{y|x,\theta \}-\mu _{y} - \frac{\Sigma _{xy}}{\Sigma _{xx}}(x-\mu _{x}), \end{aligned}$$

where Eqs. (62) and (64) are used. Finally, the conditional expectation of \(y\) given \(x\) is:

$$\begin{aligned} E\{y|x,\theta \} = \mu _{y} + \frac{\Sigma _{xy}}{\Sigma _{xx}}(x - \mu _{x}). \end{aligned}$$

Let us now consider the second-order linear conditional expectation \(E\{y^{2}|x,\theta \}\). Using the independence between \(y-ax-b\) and \(x\), it follows:

$$\begin{aligned} E\{(y-ax-b)^{2}|x,\theta \} = E\{(y-ax-b)^{2}\} = e_{m}, \end{aligned}$$

and then, using the principle of orthogonality:

$$\begin{aligned} E\{(y-ax-b)^{2}|x,\theta \} = E\{(y-ax-b)y|x,\theta \} = \end{aligned}$$
$$\begin{aligned} = E\{y^{2}|x,\theta \} - E^{2}\{y|x,\theta \}. \end{aligned}$$

Finally, the conditional expectation of \(y^{2}\) given \(x\) is:

$$\begin{aligned} E\{y^{2}|x,\theta \}&= E^{2}\{y|x,\theta \} + e_{m}\\&= E^{2}\{y|x,\theta \} + \Sigma _{yy} - \frac{\Sigma _{xy}^{2}}{\Sigma _{xx}}. \end{aligned}$$

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Aste, M., Boninsegna, M., Freno, A. et al. Techniques for dealing with incomplete data: a tutorial and survey. Pattern Anal Applic 18, 1–29 (2015). https://doi.org/10.1007/s10044-014-0411-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10044-014-0411-9

Keywords

Navigation