Techniques for dealing with incomplete data: a tutorial and survey

Aste, Marco; Boninsegna, Massimo; Freno, Antonino; Trentin, Edmondo

doi:10.1007/s10044-014-0411-9

Techniques for dealing with incomplete data: a tutorial and survey

Survey
Published: 21 September 2014

Volume 18, pages 1–29, (2015)
Cite this article

Pattern Analysis and Applications Aims and scope Submit manuscript

Marco Aste¹,
Massimo Boninsegna¹,
Antonino Freno² &
…
Edmondo Trentin³

1510 Accesses
25 Citations
Explore all metrics

Abstract

Real-world applications of pattern recognition, or machine learning algorithms, often present situations where the data are partly missing, corrupted by noise, or otherwise incomplete. In spite of that, developments in the machine learning community in the last decade have mostly focused on mathematical analysis of learning machines, making it difficult for practitioners to recollect an overview of major approaches to this issue. Paradoxically, as a consequence, even established methodologies rooted in statistics appear to have long been forgotten. Although the relevant literature on the topic is so wide that no exhaustive coverage is nowadays possible, the first goal of this paper is to provide the reader with a nonetheless significant survey of major, or utterly sound, techniques for dealing with the tasks of pattern recognition, machine learning, and density estimation from incomplete data. Secondly, the paper aims at representing a viable tutorial tool for the interested practitioner, by allowing for self-contained, step-by-step understanding of several approaches. An effort is made to categorize the different techniques as follows: (1) heuristic methods; (2) statistical approaches; (3) connectionist-oriented techniques; (4) other approaches (dynamical systems, adversarial deletion of features, etc.).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Density-Based Clustering Based on Hierarchical Density Estimates

Data clustering: application and trends

Article 27 November 2022

A Comprehensive Survey of Anomaly Detection Algorithms

Article 26 November 2021

Notes

Bailey and Jain [27] repeated Dudani’s experiments concluding that the DW-KNN is not superior to the traditional $K$-nearest neighbor rule.
“Normal” and “Gaussian” are used as synonyms.
When there are only complete data, Eqs. (16) and (17) reduce to the well-known formulae for the ML estimation of $\mu $ and $\Sigma $.
The constant $a$ which minimizes $E\{(y-ax)^{2}\}$, obtained by the least square method, is such that $y-ax$ is orthogonal to $x$, that is: $E\{(y-ax)x\} = 0$.

References

Lee C, Choi SW, Lee J-M, Lee I-B (2004) Sensor fault identification in mspm using reconstructed monitoring statistics. Ind Eng Chem Res 43(15):4293–4304
Article Google Scholar
Lopes VV, Menezes JC (2005) Inferential sensor design in the presence of missing data: a case study. Chemometr Intell Lab Syst 78(1–2):1–10
Article Google Scholar
Rendtel U (2006) The 2005 plenary meeting on missing data and measurement error. AStA Adv Stat Anal 90(4):493–499
MATH MathSciNet Google Scholar
Mott P, Sammis TW, Southward GM (1994) Climate data estimation using climate information from surrounding climate stations. Appl Eng Agric 10(1):41–44
Article Google Scholar
Li Q, Roxas BAP (2008) Significance analysis of microarray for relative quantitation of lc/ms data in proteomics. BMC Bioinform 9(1):187–197
Green P, Barker J, Cooke M, Josifovski L (2001) Handling missing and unreliable information in speech recognition. In: Proceedings of AISTATS
Barker J (2012) Missing-data techniques: recognition with incomplete spectrograms. Wiley, New York, pp 369–398
Google Scholar
Pynadath D, Wellman M (2000) Probabilistic state-dependent grammars for plan recognition. In: Proceedings of the conference on uncertainty in artificial intelligence, pp 507–514
Guerreiro RFC, Aguiar PMQ (2002) Factorization with missing data for 3d structure recovery. In: Proceedings of the IEEE workshop on multimedia signal processing, pp 105–108
Jia H, Martinez AM (2009) Support vector machines in face recognition with occlusions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 136–141
Chapelle O, Schölkopf B, Zien A (eds) (2006) Semi-supervised learning. MIT Press, Cambridge
Google Scholar
Chen K, Wang S (2011) Semi-supervised learning via regularized boosting working on multiple semi-supervised assumptions. IEEE Trans Pattern Anal Mach Intell 99(1):129–143
You Z, Yin Z, Han K, Huang D-S, Zhou X (2010) A semi-supervised learning approach to predict synthetic genetic interactions by combining functional and topological properties of functional gene network. BMC Bioinform 11:343
Article Google Scholar
Schwenker F, Trentin E (2014) Pattern classification and clustering: a review of partially supervised learning approaches. Pattern Recognit Lett 37:4–14
Article Google Scholar
Gabrys B (2009) Learning with missing or incomplete data. In: Foggia P, Sansone C, Vento M (eds) Image analysis and processing ICIAP 2009. Springer, Berlin, Heidelberg, pp 1–4
Chapter Google Scholar
Vinod NC, Punithavalli M (2011) Classification of incomplete data handling techniques an overview. Int J Comput Sci Eng 3(1):340–344
Richard MD, Lippmann RP (1991) Neural network classifiers estimate Bayesian a posteriori probabilities. Neural Comput 3:461–483
Article Google Scholar
Lee RCT, Slagle JR, Mong CT (1976) Application of clustering to estimate missing data and improve data integrity. In: Proceedings of 2nd international software engineering conference, pp 539–544, San Francisco, October 1976
Lim C-P, Leong J-H, Kuan M-M (2005) A hybrid neural network system for pattern classification tasks with missing features. IEEE Trans Pattern Anal Mach Intell 27(4):648–653
Article Google Scholar
Zhang S, Qin Y, Zhu X, Zhang J, Zhang C (2006) Optimized parameters for missing data imputation. In: PRICAI, pp 1010–1016
Pelckmans KA, De Brabanter JB, Suykens JAKA, De Moor BA (2005) Handling missing values in support vector machine classifiers. Neural Netw 18(5–6):684–692
Article MATH Google Scholar
Su X, Khoshgoftaar TM, Zhu X, Greiner R (2008) Imputation-boosted collaborative filtering using machine learning classifiers. In: SAC ’08: Proceedings of the 2008 ACM symposium on applied computing. ACM, New York, pp 949–950
Su X, Greiner R, Khoshgoftaar TM, Napolitano A (2011) Using classifier-based nominal imputation to improve machine learning. In: Huang JZ, Cao L, Srivastava J (eds) PAKDD (1). Springer, pp 124–135
Little RJA, Rubin DB (2002) Statistical analysis with missing data. Wiley, New York
Book MATH Google Scholar
Dixon JK (1979) Pattern recognition with partly missing data. IEEE Trans Syst Man Cybern 9(10):617–621
Article Google Scholar
Dudani SA (1976) The distance-weighted $k$-nearest-neighbor rule. IEEE Trans Syst Man Cybern 6:325–327
Article Google Scholar
Bailey T, Jain AK (1978) A note on distance-weighted $k$-nearest neighbor rules. IEEE Trans Syst Man Cybern 8:311–313
Article MATH Google Scholar
Morin RL, Raeside DE (1981) A reappraisal of distance-weighted $k$-nearest neighbor classification for pattern recognition with missing data. IEEE Trans Syst Man Cybern 11(3):241–243
Article MathSciNet Google Scholar
Schafer JL, Graham JW (2002) Missing data: our view of the state of the art. Psychol Methods 7:147–177
Article Google Scholar
Ghahramani Z, Jordan MI (1994) Learning from incomplete data. AI Memo 1509, CBCL paper 108. MIT, Cambridge
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B 39:1–38
MATH MathSciNet Google Scholar
Rao CR (1972) Linear statistical inference and its applications. Wiley, New York
Google Scholar
Wu CFJ (1983) On the convergence properties of the EM algorithm. Ann Stat 11(1):95–103
Article MATH Google Scholar
Redner RA, Walker HF (1984) Mixture densities, maximum likelihood and the EM algorithm. SIAM Rev 26:195–239
Article MATH MathSciNet Google Scholar
Xu L, Jordan MI (1996) On convergence properties of the EM algorithm for Gaussian mixtures. Neural Comput 8:129–151
Article Google Scholar
Duda RO, Hart PE (1973) Pattern classification and scene analysis. Wiley, New York
MATH Google Scholar
Walter RG, Richardson S, Spiegelhalter D (1996) Markov chain Monte Carlo in practice. Chapman & Hall/CRC, New York
Google Scholar
Ramoni M, Sebastiani P (2001) Robust learning with missing data. Machine Learn 45(2):147–170
Article MATH Google Scholar
Beaton AE (1964) The use of special matrix operations in statistical calculus. Educational Testing Service Research Bulletin, RB-64-51
Dempster AP (1969) Elements of continuous multivariate analysis. Addison-Wesley, Reading
MATH Google Scholar
McLachlan G, Basford K (1988) Mixture models: inference and applications to clustering. Marcel Dekker, New York
MATH Google Scholar
Ghahramani Z (1994) Solving inverse problems using an EM approach to density estimation. In: Mozer MC, Smolensky P, Touretzky DS, Elman JL, Weigend AS (eds) Proceedings of the 1993 Connectionist Models Summer School. Erlbaum Associates, Hillsdale, pp 316–323
Ghahramani Z, Jordan MI (1994) Supervised learning from incomplete data via an EM approach. In: Cowan JD, Tesauro G, Alspector J (eds) Advances in neural information processing systems 6. Morgan Kaufmann Publishers, San Mateo
Google Scholar
Moss S, Hancock ER (1997) Registering incomplete radar images using the EM algorithm. Image Vis Comput 15:637–648
Article Google Scholar
Breiman L, Friedman JH, Olshen RA, Stone CJ (1984) Classification and Regression Trees. Wadsworth International Group, Belmont
MATH Google Scholar
Friedman JH (1991) Multivariate adaptive regression splines. Ann Stat 19:1–141
Article MATH Google Scholar
Tresp V, Hollatz J, Ahmad S (1993) Network structuring and training using rule-based knowledge. In: Hanson SJ, Cowan JD, Giles CL (eds) Advances in Neural Information Processing Systems 5. Morgan Kaufmann Publishers, San Mateo, pp 871–878
Google Scholar
Jordan MI, Jacobs RA (1994) Hierarchical mixtures of experts and the EM algorithm. Neural Comput 6:181–214
Article Google Scholar
Tresp V, Ahmad S, Neuneier R (1994) Training neural networks with deficient data. In: Cowan JD, Tesauro G, Alspector J (eds) Advances in neural information processing systems. Morgan Kaufmann Publishers, San Mateo, pp 128–135
Google Scholar
Streit RL, Luginbuhl TE (1994) Maximum likelihood training of probabilistic neural networks. IEEE Trans Neural Netw 5(5):764–783
Article Google Scholar
Tanaka M, Kotokawa Y, Tanino T (1996) Pattern classification by stochastic neural networks with missing data. In: IEEE international conference on systems, man and cybernetics, Beijing, China, pp 690–695, 14–17 October 1996
Vellido A (2006) Missing data imputation through GTM as a mixture of t-distributions. Neural Netw 19(10):1624–1635
Article MATH Google Scholar
Hwang JN, Wang CJ (1994) Classification of incomplete data with missing elements. In: International symposium on artificial neural networks, Tainan, Taiwan, December 1994, pp 471–477
Schafer JL (2010) Analysis of incomplete multivariate data. Chapmann and Hall-CRC Press, London
Google Scholar
Linden A, Kindermann J (1989) Inversion of multilayer nets. In: Proceedings of the international joint conference on neural networks, II, Washington DC, June 1989, pp 425–430
Ahmad S, Tresp V (1993) Some solutions to the missing feature problem in vision. In: Hanson SJ, Cowan JD, Giles CL (eds) Advances in neural information processing systems 5. Morgan Kaufmann Publishers, San Mateo, pp 393–400
Google Scholar
Tresp V, Neuneier R, Ahmad S (1995) Efficient methods for dealing with missing data in supervised learning. In: Tesauro G, Touretzky D, Leen T (eds) Advances in neural information processing systems 7. Morgan Kaufmann Publishers, San Mateo, pp 689–696
Google Scholar
Graham BS, Keisuke H (2011) Robustness to parametric assumptions in missing data models. Am Econ Rev 101(3):538–543
Article Google Scholar
Ahmad S, Tresp V (1993) Classification with missing and uncertain inputs. In: Proceedings of the IEEE international conference on neural networks, San Francisco
Moody J, Darken C (1988) Learning with localized receptive fields. In: Hinton G, Sejnowski T (eds) Proceedings of the 1988 Connectionist Models Summer School. Morgan-Kauffmann
Nowlan S (1990) Maximum likelihood competitive learning. In: Advances in neural information processing systems 2. Morgan Kaufmann Publishers, pp 574–582
Moody J, Darken C (1989) Fast learning in networks of locally-tuned processing units. Neural Comput 1:281–294
Article Google Scholar
Parzen E (1962) On estimation of a probability density function and mode. Ann Math Stat 33:1065–1076
Article MATH MathSciNet Google Scholar
Breiman L, Meisel W, Purcell E (1977) Variable kernel estimates of multivariate densities. Technometrics 19(2):135–144
Hwang JN, Lay SR, Lippman A (1994) Nonparametric multivariate density estimation: a comparative study. IEEE Trans Signal Process 42(10):2795–2810
Article Google Scholar
Ahmad S (1994) Feature densities are required for computing feature correspondence. In: Cowan JD, Tesauro G, Alspector J (eds) Advances in neural information processing systems 6. Morgan Kaufmann Publishers, San Mateo, pp 961–968
Google Scholar
Fielding S, Fayers PM, McDonald A, McPherson G, Campbell MK (2008) Simple imputation methods were inadequate for missing not at random (MNAR) quality of life data. Health Qual Life Outcomes 6(57)
Molenberghs G, Thijs H, Jansen I, Beunckens C, Kenward MG, Mallinckrodt C, Carroll RJ (2004) Analyzing incomplete longitudinal clinical trial data. Biostatistics 5(3):445–464
Article MATH Google Scholar
Congdon P (2006) Bayesian statistical modelling, 2nd edn. Wiley, New York
Book MATH Google Scholar
Collins LM, Schafer JL, Kam CM (2001) A comparison of inclusive and restrictive strategies in modern missing-data procedures. Psychol Methods 6:330–351
Article Google Scholar
Heckman JJ (1976) The common structure of statistical models of truncation, sample selection and limited dependent variables and a simple estimator for such models. In: Annals of economic and social measurement, vol 5, number 4. National Bureau of Economic Research, Inc, pp 475–492
Berndt ER, Hall BH, Hall RE, Hausman JA (1974) Estimation and inference in nonlinear structural models. Ann Econ Soc Meas 3:653–665
Google Scholar
Marlin B, Roweis S, Zemel R (2005) Unsupervised learning with non-ignorable missing data. In: Proceedings of the tenth international workshop on artificial intelligence and statistics (AISTATS), pp 222–229
Little RJA (1993) Pattern-mixture models for multivariate incomplete data. J Am Stat Assoc 88:125–134
MATH Google Scholar
Molenberghs G, Kenward M (2007) Missing data in clinical studies. Wiley, New York
Book Google Scholar
Vonesh EF, Greene T, Schluchter MD (2006) Shared parameter models for the joint analysis of longitudinal data and event times. Stat Med 25(1):143–163
Article MathSciNet Google Scholar
Little RJ (2006) Selection and pattern-mixture models. CRC Press, London, pp 409–431
Google Scholar
Gad AM, Darwish NMM (2013) A shared parameter model for longitudinal data with missing values. Am J Appl Math Stat 1(2):30–35
Article Google Scholar
Rubin DB (1987) Multiple imputation for nonresponse in surveys. Wiley, New York
Book Google Scholar
Harel O, Zhou XH (2007) Multiple imputation: review of theory, implementation and software. Stat Med 26:3057–3077
Article MathSciNet Google Scholar
Kenward MG, Carpenter JC (2009) Multiple Imputation. CRC Press, London, pp 477–500
Google Scholar
Saltelli A, Chan K, Scott EM (2000) Sensitivity analysis. Wiley, New York
MATH Google Scholar
White IR, Royston P, Wood AM (2011) Multiple imputation using chained equations: issues and guidance for practice. Stat Med 30(4):377–399
Article MathSciNet Google Scholar
Daniel Rhian M, Kenward Michael G (2012) A method for increasing the robustness of multiple imputation. Comput Stat Data Anal 56(6):1624–1643
Article MATH MathSciNet Google Scholar
Jansen I, Hens N, Molenberghs G, Aerts M, Verbeke G, Kenward MG (2006) The nature of sensitivity in monotone missing not at random models. Comput Stat Data Anal 50(3):830–858
Article MATH MathSciNet Google Scholar
Park J-S, Qian GQ, Jun Y (2008) Monte Carlo EM algorithm in logistic linear models involving non-ignorable missing data. Appl Math Comput 197(1):440–450
Article MATH MathSciNet Google Scholar
Stubbendick AL, Ibrahim JG (2003) Maximum likelihood methods for nonignorable missing responses and covariates in random effects models. Biometrics 59(4):1140–50
Article MATH MathSciNet Google Scholar
Jolani S (2012) Dual imputation strategies for analyzing incomplete data. Utrecht University, Utrecht
Google Scholar
Enders CK (2011) Missing not at random models for latent growth curve analyses. Psychol Methods 16(1):1–16
Article Google Scholar
Molenberghs G, Beunckens C, Sotto C, Kenward MG (2008) Every missingness not at random model has a missingness at random counterpart with equal fit. J R Stat Soc Ser B 70(Part 2):371–388
Article MATH MathSciNet Google Scholar
Vamplew P, Adams A (1992) Missing values in a backpropagation neural net. In: Leong S, Jabri M (eds) Proceedings of the third Australian conference on neural networks, Sidney, February 1992, pp 64–67
Vamplew P, Clark D, Adams A, Muench J (1996) Techniques for dealing with missing values in feedforward networks. In: Proceedings of the seventh Australian conference on neural networks, Canberra, 10–12 April 1996
Southcott ML, Bogner RE (1993) Classification of incomplete data using neural networks. In: Proceedings of the fourth Australian conference on neural networks, Melbourne, 3–5 February 1993, pp 220–223
Hwang JN, Wang CJ (1994) Neural network inversion techniques for missing data applications. In: IEEE neural network workshop on signal processing, Ermioni, Greece, September 1994, pp 22–31
Specht DF (1990) Probabilistic neural networks. Neural Netw 3(1):109–118
Article Google Scholar
Vapnik V (1982) Estimation of dependences based on empirical data. Springer, Berlin
MATH Google Scholar
Buntine WL, Weigend AS (1991) Bayesian back-propagation. Complex Syst 5(6):603–643
MATH Google Scholar
Arrowsmith DK, Place CM (1990) An introduction to dynamical systems. Cambridge University Press, Cambridge
MATH Google Scholar
Rabiner LR (1989) A tutorial on Hidden Markov Models and selected applications in speech recognition. Proc IEEE 77(2):267–296
Article Google Scholar
Jain LC, Medsker LR (1999) Recurrent neural networks: design and applications. CRC Press Inc, Boca Raton
Google Scholar
Jurafsky D, Martin JH (2009) Speech and language processing, 2nd edn. Prentice-Hall Inc, Upper Saddle River
Google Scholar
Trentin E, Gori M (2003) Robust combination of neural networks and hidden Markov models for speech recognition. IEEE Trans Neural Netw 14(6):1519–1531
Bertolami R, Bunke H (2008) Hidden markov model-based ensemble methods for offline handwritten text line recognition. Pattern Recogn 41(11):3452–3460
Article MATH Google Scholar
Baldi P, Brunak S (2001) Bioinformatics: the machine learning approach, 2nd edn. MIT Press, Cambridge
Google Scholar
Hinton GE, Sejnowski TJ (1986) Learning and relearning in Boltzmann machines. In: Rumelhart DE, McClelland J (eds) Parallel distributed processing, vol 1, chapter 7. MIT Press
Hopfield JJ (1982) Neural networks and physical systems with emergent collective computational abilities. In: Proceedings of the National Academy of Sciences, vol 79, pp 2554–2558
Hertz J, Krogh A, Palmer RG (1991) Introduction to the theory of neural computation. Addison-Wesley, Redwood City
Google Scholar
Almeida L (1987) A learning rule for asynchronous perceptrons with feedback in a combinatorial environment. In: Caudill M, Butler C (eds) Proceedings of the IEEE first international conference on neural networks, vol 2. IEEE, San Diego, pp 609–618
Pineda F (1989) Recurrent backpropagation and the dynamical approach to adaptive neural computation. Neural Comput 1:161–172
Article Google Scholar
Bengio Y, Gingras F (1996) Recurrent neural networks for missing or asynchronous data. In: Touretzky DS, Mozer MC, Hasselmo ME (eds) Advances in neural information processing systems 8. MIT Press, Cambridge, pp 395–401
Google Scholar
Minsky ML, Papert SA (1969) Perceptrons. MIT Press, Cambridge
MATH Google Scholar
Rumelhart DE, Hinton GE, Williams RJ (1986) Learning internal representations by error propagation. In: Rumelhart DE, McClelland J (eds) Parallel distributed processing, vol 1, chapter 8. MIT Press, pp 318–362
Globerson A, Roweis ST (2006) Nightmare at test time: robust learning by feature deletion. In: ICML ’06: Proceedings of the 23th international conference on machine learning, pp 353–360
Dekel O, Shamir O (2008) Learning to classify with missing and corrupted features. In: ICML ’08: Proceedings of the 25th international conference on machine learning. ACM, New York, pp 216–223
Ding Y, Simonoff JS (2010) An investigation of missing data methods for classification trees applied to binary response data. J Mach Learn Res 11:131–170
MATH MathSciNet Google Scholar
Twala B (2009) An empirical comparison of techniques for handling incomplete data using decision trees. Appl Artif Intell 23:373–405
Article Google Scholar
Luengo J, García S, Herrera F (2010) A study on the use of imputation methods for experimentation with radial basis function network classifiers handling missing attribute values: The good synergy between RBFs and event covering method. Neural Netw 23:406–418
Article Google Scholar
Corani G, Zaffalon M (2008) Learning reliable classifiers from small or incomplete data sets: the naive credal classifier 2. J Mach Learn Res 9:581–621
MATH MathSciNet Google Scholar
Chierichetti F, Kleinberg J, Liben-Nowell D (2011) Reconstructing patterns of information diffusion from incomplete observations. In: Shawe-Taylor J, Zemel RS, Bartlett P, Pereira FCN, Weinberger KQ (eds) Advances in neural information processing systems 24. MIT Press, Cambridge, pp 792–800
Google Scholar
Greenwald A, Li J, Sodomka E (2012) Approximating equilibria in sequential auctions with incomplete information and multi-unit demand. In: Bartlett P, Pereira FCN, Burges CJC, Bottou L, Weinberger KQ (eds) Advances in neural information processing systems 25. MIT Press, Cambridge, pp 2330–2338
Google Scholar
Ghannad-Rezaie M, Soltanian-Zadeh H, Ying H, Dong M (2010) Selection-fusion approach for classification of data sets with missing values. Pattern Recognit 43:2340–2350
Article MATH Google Scholar
Farhangfar A, Kurgan L, Dy J (2008) Impact of imputation of missing values on classification error for discrete data. Pattern Recognit 41:3692–3705
Article MATH Google Scholar
Saar-Tsechansky M, Provost F (2007) Handling missing values when applying classification models. J Mach Learn Res 8:1623–1657
MATH Google Scholar
Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, Botstein D, Altman RB (2001) Missing value estimation methods for DNA microarrays. Bioinformatics 17(6):520–525
Article Google Scholar
Oba SA, Sato MA, Takemasa IC, Monden MC, Matsubara KI, Ishii SA (2003) A bayesian missing value estimation method for gene expression profile data. Bioinformatics 19(16):2088–2096
Article Google Scholar
Kim HA, Golub GHB, Park HA (2005) Missing value estimation for DNA microarray gene expression data: Local least squares imputation. Bioinformatics 21(2):187–198
Article Google Scholar
Scheel IA, Aldrin MB, Glad IKA, Sorum RA, Lyng HC, Frigessi AB (2005) The influence of missing value imputation on detection of differentially expressed genes from microarray data. Bioinformatics 21(23):4272–4279
Article Google Scholar
Wang X, Jiang Z, Feng H (2006) Missing value estimation for DNA microarray gene expression data by support vector regression imputation and orthogonal coding scheme. BMC Bioinform 7(32):1–10
MATH Google Scholar
Wong DSV, Wong FK, Wood GR (2007) A multi-stage approach to clustering and imputation of gene expression profiles. Bioinformatics 23(8):998–1005
Article Google Scholar
Yoon D, Lee EK, Park T (2007) Robust imputation method for missing values in microarray data. BMC Bioinform 8(2):1–7
Google Scholar
Roure B, Baurain D, Philippe H (2013) Impact of missing data on phylogenies inferred from empirical phylogenomic data sets. Mol Biol Evol 30(1):197–214
Article Google Scholar
Nutt W, Razniewski S, Vegliach G (2012) Incomplete databases: missing records and missing values. In: Proceedings of the 17th international conference on database systems for advanced applications, DASFAA’12. Springer, pp 298–310
Kaambwa B, Bryan S, Billingham L (2012) Do the methods used to analyse missing data really matter? An examination of data from an observational study of Intermediate Care patients. BMC Res Notes 5(1):330
Article Google Scholar
David M, Little RJA, Samuhel ME, Triest RK (1986) Alternative methods for CPS income imputation. J Am Stat Assoc 81(393):29–41
Article Google Scholar
Foster EM, Fang GY (2004) Alternative methods for handling attrition: an illustration using data from the fast track evaluation. Eval Rev 28(5):434–464
Article Google Scholar
Horton NJ, Kleinman KP (2007) Much ado about nothing: a comparison of missing data methods and software to fit incomplete data regression models. Am Stat 61(1):79–90
Article MathSciNet Google Scholar
Dong Y, Peng C-YJ (2013) Principled missing data methods for researchers. Springerplus 2(1):222
Article Google Scholar
Ali AMG, Dawson SJ, Blows FM, Provenzano E, Ellis IO, Baglietto L, Huntsman D, Caldas C, Pharoah PD (2011) Comparison of methods for handling missing data on immunohistochemical markers in survival analysis of breast cancer. Br J Cancer 104(4):693–699
Article Google Scholar
Fielding S, Fayers P, Ramsay C (2010) Predicting missing quality of life data that were later recovered: an empirical comparison of approaches. Clin Trials 7(4):333–342
Article Google Scholar
Marshall A, Altman D, Royston P, Holder R (2010) Comparison of techniques for handling missing covariate data within prognostic modelling studies: a simulation study. BMC Med Res Methodol 10(1):7
Article MATH Google Scholar
Hedden S, Woolson R, Malcolm R (2008) A comparison of missing data methods for hypothesis tests of the treatment effect in substance abuse clinical trials: a Monte-Carlo simulation study. Subst Abuse Treatm Prev Policy 3(1):1–9
Article Google Scholar
Ding Y, Simonoff JS (2010) An investigation of missing data methods for classification trees applied to binary response data. J Mach Learn Res 11:131–170
MATH MathSciNet Google Scholar
Roda C, Nicolis I, Momas I, Guihenneuc-Jouyaux C (2013) Comparing methods for handling missing data. Epidemiology 24(3):469–471
Article Google Scholar
Graham JW (2009) Missing data analysis: making it work in the real world. Annu Rev Psychol 60:549–576
Article Google Scholar
Schwartz T, Zeig-Owens R (2013) Knowledge (of your missing data) is power: handling missing values in your SAS dataset. In: Proceedings of SAS global forum SUGI 31: statistics, data analysis and data mining, San Francisco, California, 28 April–1 May 2013
Templ M, Alfons A, Filzmoser P (2012) Exploring incomplete data using visualization techniques. Adv Data Anal Classif 6(1):29–47
Article MathSciNet Google Scholar
Heitjan DF (2011) Incomplete data: what you don’t know might hurt you. Cancer Epidemiol Biomark Prev 20(8):1567–1570
Article Google Scholar

Download references

Author information

Authors and Affiliations

EyePro System, 38121, Trento, Italy
Marco Aste & Massimo Boninsegna
Amazon DCG GmbH, 10707, Berlin, Germany
Antonino Freno
DIISM University of Siena, 53100, Siena, Italy
Edmondo Trentin

Authors

Marco Aste
View author publications
You can also search for this author in PubMed Google Scholar
Massimo Boninsegna
View author publications
You can also search for this author in PubMed Google Scholar
Antonino Freno
View author publications
You can also search for this author in PubMed Google Scholar
Edmondo Trentin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Edmondo Trentin.

Additional information

This work was accomplished while A. Freno was at the University of Siena.

Appendix

1.1 Appendix A: Marginal distributions of multivariate Normals

Let $x \in R^{n}$ be a random vector normally distributed with mean $\mu $ and covariance matrix $\Sigma $, i.e., $x\sim N_{n}(\mu ,\Sigma )$. Let $x^{o} \in R^{k}$ be a vector corresponding to a subset of $k$ components of $x$ and let $x^{m} \in R^{l}$ be a vector corresponding to the remaining components of $x$. Without loss of generality, we suppose that the components of $x^{o}$ correspond to the first $k$ components of $x$, and $x^{m}$ to the remaining $l$ components (otherwise, the components of $x$, $\mu $ and $\Sigma $ can be simply relabeled).

The random vector $x$, its mean vector, $\mu $, and its covariance matrix, $\Sigma $, can be partitioned according to $x^{o}$ and $x^{m}$, as:

$$\begin{aligned} x = \left( \begin{array}{c} x^{o} \\ x^{m} \end{array} \right) , \mu = \left( \begin{array}{c} \mu ^{o} \\ \mu ^{m} \end{array} \right) , \Sigma = \left( \begin{array}{cc} \Sigma ^{oo} &{} \Sigma ^{om}\\ \Sigma ^{mo} &{} \Sigma ^{mm} \end{array} \right) . \end{aligned}$$

Let $A$ be the following $k \times n$ matrix:

$$\begin{aligned} A = \left( \begin{array}{cccccccc} 1 &{} 0 &{} \cdots &{} 0 &{} 0 &{} 0 &{} \cdots &{} 0 \\ 0 &{} 1 &{} \cdots &{} 0 &{} 0 &{} 0 &{} \cdots &{} 0 \\ \vdots &{} \vdots &{} \ddots &{} \vdots &{} \vdots &{} \vdots &{} \ddots &{} \vdots \\ 0 &{} 0 &{} \cdots &{} 1 &{} 0 &{} 0 &{} \cdots &{} 0 \end{array} \right) . \end{aligned}$$

where the first $k$ columns identify a $k \times k$ identity matrix and the remaining $l=n-k$ columns correspond to a $k \times l$ null matrix.

Let us define the random vector $z=Ax$, $z \in R^{k}$. Since each linear combination of a normally distributed random variable is normally distributed, the $k$ linear combinations $z=Ax$ are distributed as a multivariate Normal:

$$\begin{aligned} z \sim N_{k}(\mu _{z},\Sigma _{z}) \end{aligned}$$

where the mean vector, $\mu _{z}$, and the covariance matrix, $\Sigma _{z}$, are:

$$\begin{aligned} \mu _{z} = E\{Ax\} = AE\{x\} = A \mu \end{aligned}$$

$$\begin{aligned} \Sigma _{z} = Cov(Ax) = ACov(x)A^{T} = A \Sigma A^{T} \end{aligned}$$

Finally, observing that $z \equiv x^{o}$, $A \mu \equiv \mu _{o}$, and $A \Sigma A^{T} \equiv \Sigma ^{oo}$, it follows that $x^{o}$ is normally distributed with mean $\mu ^{o}$ and covariance matrix $\Sigma ^{oo}$, i.e., $x^{o} \sim N_{k}(\mu ^{o},\Sigma ^{oo})$.

1.2 Appendix B: Convolution of multivariate Normals with diagonal covariances

Let us consider the convolution of two multivariate Normals with mean vectors $\mu _{1}$ and $\mu _{2}$, and diagonal covariance matrices $(\sigma _{1})^{2}$ and $(\sigma _{2})^{2}$:

$$\begin{aligned} C(x) = \int N_{n}(y;\mu _{1},\sigma _{1}) N_{n}(y-x;\mu _{2},\sigma _{2}) \mathrm{d}y. \end{aligned}$$

(59)

Given the diagonal nature of the two covariance matrices, the previous $n$-dimensional convolution integral can be rewritten as:

$$\begin{aligned} C(x) = \prod _{i=1}^{n} c_{i}(x_{i}), \end{aligned}$$

(60)

where $c_{i}(x_{i})$ is the convolution of two univariate Normals with mean $\mu _{1,i}$ and $\mu _{2,i}$, and variance $\sigma _{1,i}^{2}$ and $\sigma _{2,i}^{2}$, i.e.,

$$\begin{aligned} c_{i}(x_{i}) = \int N_{1}(y_{i};\mu _{1,i},\sigma _{1,i}) N_{1}(y_{i}-x_{i};\mu _{2,i},\sigma _{2,i}) \mathrm{d}y_{i}. \end{aligned}$$

The integrals in Eq. (60) can be solved applying the convolution theorem:

$$\begin{aligned} \mathcal{F}[f *g] = \mathcal{F}[f] \mathcal{F}[g], \end{aligned}$$

where $\mathcal{F}$ indicates the linear operator which maps each function into its own Fourier transform.

As it is well known, the transform of the exponential function $\mathrm{e}^{-ax^{2}}$ is:

$$\begin{aligned} \mathcal{F}[\mathrm{e}^{-a x^{2}}] = \mathrm{e}^{-\frac{\omega ^{2}}{4a}}\sqrt{\frac{\pi }{a}}, \end{aligned}$$

from which, using the linearity of the Fourier transform and applying the shift theorem, it follows:

$$\begin{aligned} \mathcal{F}\left[\frac{1}{\sqrt{2\pi }\sigma } {\text {e}}^{-\frac{(x-\mu )^{2}}{2\sigma ^{2}}}\right] = {\text {e}}^{-\frac{\omega ^{2} \sigma ^{2}}{2}}{\text {e}}^{-j \omega \mu }. \end{aligned}$$

The integrals in Eq. (60) can be solved applying the convolution theorem:

$$\begin{aligned} \mathcal{F}[c_{i}(x_{i})] = {\text {e}}^{-\frac{\omega ^{2} (\sigma _{1,i}^{2}+\sigma _{2,i}^{2})}{2}} {\text {e}}^{-j \omega (\mu _{1,i}+\mu _{2,i})}, \end{aligned}$$

from which, computing the inverse Fourier transform of the two sides of the previous equation, it follows:

$$\begin{aligned} c_{i}(x_{i}) = N_{1}(x_{i};\mu _{1,i}+\mu _{2,i},\sqrt{\sigma _{1,i}^{2}+\sigma _{2,i}^{2}}). \end{aligned}$$

(61)

Finally, from Eqs. (60) and (61) it follows that the result of the convolution integral (59) is a multivariate Normal with mean vector $\mu =\mu _{1}+\mu _{2}$ and diagonal covariance matrix $(\sigma )^{2}=(\sigma _{1})^{2}+(\sigma _{2})^{2}$.

1.3 Appendix C: Linear conditional expectations

Let $x$ and $y$ be two random variables with joint density function $p(x,y)$. Let us assume $p(x,y)$ is a normal distribution with parameter $\theta =(\mu ,\Sigma )$, where $\mu $ is the mean vector and $\Sigma $ is the covariance matrix. Parameters $\mu $ and $\Sigma $ can be rewritten as:

$$\begin{aligned} \mu = \left( \begin{array}{cc} \mu _{x} \\ \mu _{y} \end{array} \right) , \;\;\;\;\; \Sigma = \left( \begin{array}{cc} \Sigma _{xx} &{} \Sigma _{xy}\\ \Sigma _{yx} &{} \Sigma _{yy} \end{array} \right) . \end{aligned}$$

The task is to estimate the expectations of $y$ and $y^{2}$ given $x$ using the least square linear regression between $y$ and $x$ as predicted by the model, i.e., $E\{y|x,\theta \}$ and $E\{y^{2}|x,\theta \}$.

To compute the least square linear regression, values for constants $a$ and $b$ must be found which minimize the error expression $e=E\{[y-(ax+b)]^{2}\}$.

First, let us consider the problem of finding parameter $b$ when $a$ is known. The value of $b$ which minimizes $e$ is the least square estimate of $y-ax$ using a constant model, i.e.,

$$\begin{aligned} b = E\{y - ax\} = \mu _{y} - a\mu _{x}. \end{aligned}$$

(62)

Using Eq. (62), the error expression can be rewritten as:

$$\begin{aligned} e&= E\{[(y - \mu _{y}) - a(x - \mu _{x})]^{2}\} \nonumber \\&= \Sigma _{yy} - 2a \Sigma _{xy} + a^{2} \Sigma _{xx}, \end{aligned}$$

(63)

which attains its minimum when the slope of the regression, $a$, is equal to:

$$\begin{aligned} a = \frac{\Sigma _{xy}}{\Sigma _{xx}}. \end{aligned}$$

(64)

By substituting Eq. (64) into Eq. (63), the expression for the minimum error is obtained:

$$\begin{aligned} e_{m} = \Sigma _{yy} - \frac{\Sigma _{xy}^{2}}{\Sigma _{xx}}. \end{aligned}$$

The linear conditional expectations can be estimated using the principle of orthogonality^{Footnote 4} and the simple observation that if two random variables $z$ and $w$ are independent, then $E\{z|w\}=E\{z\}$.

Let us consider the first-order linear conditional expectation $E\{y|x,\theta \}$. Observe that:

$$\begin{aligned} E\{(y -ax -b)x\} = 0 \end{aligned}$$

$$\begin{aligned} E\{(y -ax -b)\}E\{x\} = 0, \end{aligned}$$

where the first equality comes from the principle of orthogonality, and the second from Eq. (62). From the two previous equations, it follows that $E\{(y-ax-b)x\}=E\{(y-ax-b)\}E\{x\}$, and therefore, $y-ax-b$ and $x$ are not correlated.

Noting that $y-ax-b$ is normally distributed (being a linear combination of normal variables) and that two uncorrelated random variables normally distributed are independent, it follows that $y-ax-b$ and $x$ are independent, i.e., $E\{y-ax-b|x,\theta \}=E\{y-ax-b\}$. Moreover, from Eq. (62) it follows that $E\{y-ax-b|x,\theta \}=0$.

On the other hand:

$$\begin{aligned} E\{(y-ax-b)|x,\theta \} = E\{y-\mu _{y}-a(x-\mu _{x})|x,\theta \} \\ = E\{y|x,\theta \}-\mu _{y} - \frac{\Sigma _{xy}}{\Sigma _{xx}}(x-\mu _{x}), \end{aligned}$$

where Eqs. (62) and (64) are used. Finally, the conditional expectation of $y$ given $x$ is:

$$\begin{aligned} E\{y|x,\theta \} = \mu _{y} + \frac{\Sigma _{xy}}{\Sigma _{xx}}(x - \mu _{x}). \end{aligned}$$

Let us now consider the second-order linear conditional expectation $E\{y^{2}|x,\theta \}$. Using the independence between $y-ax-b$ and $x$, it follows:

$$\begin{aligned} E\{(y-ax-b)^{2}|x,\theta \} = E\{(y-ax-b)^{2}\} = e_{m}, \end{aligned}$$

and then, using the principle of orthogonality:

$$\begin{aligned} E\{(y-ax-b)^{2}|x,\theta \} = E\{(y-ax-b)y|x,\theta \} = \end{aligned}$$

$$\begin{aligned} = E\{y^{2}|x,\theta \} - E^{2}\{y|x,\theta \}. \end{aligned}$$

Finally, the conditional expectation of $y^{2}$ given $x$ is:

$$\begin{aligned} E\{y^{2}|x,\theta \}&= E^{2}\{y|x,\theta \} + e_{m}\\&= E^{2}\{y|x,\theta \} + \Sigma _{yy} - \frac{\Sigma _{xy}^{2}}{\Sigma _{xx}}. \end{aligned}$$

Rights and permissions

Reprints and permissions

About this article

Cite this article

Aste, M., Boninsegna, M., Freno, A. et al. Techniques for dealing with incomplete data: a tutorial and survey. Pattern Anal Applic 18, 1–29 (2015). https://doi.org/10.1007/s10044-014-0411-9

Download citation

Received: 19 June 2013
Accepted: 07 September 2014
Published: 21 September 2014
Issue Date: February 2015
DOI: https://doi.org/10.1007/s10044-014-0411-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Techniques for dealing with incomplete data: a tutorial and survey

Abstract

Access this article

Similar content being viewed by others

Density-Based Clustering Based on Hierarchical Density Estimates

Data clustering: application and trends

A Comprehensive Survey of Anomaly Detection Algorithms

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix

1.1 Appendix A: Marginal distributions of multivariate Normals

1.2 Appendix B: Convolution of multivariate Normals with diagonal covariances

1.3 Appendix C: Linear conditional expectations

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Techniques for dealing with incomplete data: a tutorial and survey

Abstract

Access this article

Similar content being viewed by others

Density-Based Clustering Based on Hierarchical Density Estimates

Data clustering: application and trends

A Comprehensive Survey of Anomaly Detection Algorithms

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix

Appendix

1.1 Appendix A: Marginal distributions of multivariate Normals

1.2 Appendix B: Convolution of multivariate Normals with diagonal covariances

1.3 Appendix C: Linear conditional expectations

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation