Abstract
Real-world applications of pattern recognition, or machine learning algorithms, often present situations where the data are partly missing, corrupted by noise, or otherwise incomplete. In spite of that, developments in the machine learning community in the last decade have mostly focused on mathematical analysis of learning machines, making it difficult for practitioners to recollect an overview of major approaches to this issue. Paradoxically, as a consequence, even established methodologies rooted in statistics appear to have long been forgotten. Although the relevant literature on the topic is so wide that no exhaustive coverage is nowadays possible, the first goal of this paper is to provide the reader with a nonetheless significant survey of major, or utterly sound, techniques for dealing with the tasks of pattern recognition, machine learning, and density estimation from incomplete data. Secondly, the paper aims at representing a viable tutorial tool for the interested practitioner, by allowing for self-contained, step-by-step understanding of several approaches. An effort is made to categorize the different techniques as follows: (1) heuristic methods; (2) statistical approaches; (3) connectionist-oriented techniques; (4) other approaches (dynamical systems, adversarial deletion of features, etc.).
Similar content being viewed by others
Notes
Bailey and Jain [27] repeated Dudani’s experiments concluding that the DW-KNN is not superior to the traditional \(K\)-nearest neighbor rule.
“Normal” and “Gaussian” are used as synonyms.
The constant \(a\) which minimizes \(E\{(y-ax)^{2}\}\), obtained by the least square method, is such that \(y-ax\) is orthogonal to \(x\), that is: \(E\{(y-ax)x\} = 0\).
References
Lee C, Choi SW, Lee J-M, Lee I-B (2004) Sensor fault identification in mspm using reconstructed monitoring statistics. Ind Eng Chem Res 43(15):4293–4304
Lopes VV, Menezes JC (2005) Inferential sensor design in the presence of missing data: a case study. Chemometr Intell Lab Syst 78(1–2):1–10
Rendtel U (2006) The 2005 plenary meeting on missing data and measurement error. AStA Adv Stat Anal 90(4):493–499
Mott P, Sammis TW, Southward GM (1994) Climate data estimation using climate information from surrounding climate stations. Appl Eng Agric 10(1):41–44
Li Q, Roxas BAP (2008) Significance analysis of microarray for relative quantitation of lc/ms data in proteomics. BMC Bioinform 9(1):187–197
Green P, Barker J, Cooke M, Josifovski L (2001) Handling missing and unreliable information in speech recognition. In: Proceedings of AISTATS
Barker J (2012) Missing-data techniques: recognition with incomplete spectrograms. Wiley, New York, pp 369–398
Pynadath D, Wellman M (2000) Probabilistic state-dependent grammars for plan recognition. In: Proceedings of the conference on uncertainty in artificial intelligence, pp 507–514
Guerreiro RFC, Aguiar PMQ (2002) Factorization with missing data for 3d structure recovery. In: Proceedings of the IEEE workshop on multimedia signal processing, pp 105–108
Jia H, Martinez AM (2009) Support vector machines in face recognition with occlusions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 136–141
Chapelle O, Schölkopf B, Zien A (eds) (2006) Semi-supervised learning. MIT Press, Cambridge
Chen K, Wang S (2011) Semi-supervised learning via regularized boosting working on multiple semi-supervised assumptions. IEEE Trans Pattern Anal Mach Intell 99(1):129–143
You Z, Yin Z, Han K, Huang D-S, Zhou X (2010) A semi-supervised learning approach to predict synthetic genetic interactions by combining functional and topological properties of functional gene network. BMC Bioinform 11:343
Schwenker F, Trentin E (2014) Pattern classification and clustering: a review of partially supervised learning approaches. Pattern Recognit Lett 37:4–14
Gabrys B (2009) Learning with missing or incomplete data. In: Foggia P, Sansone C, Vento M (eds) Image analysis and processing ICIAP 2009. Springer, Berlin, Heidelberg, pp 1–4
Vinod NC, Punithavalli M (2011) Classification of incomplete data handling techniques an overview. Int J Comput Sci Eng 3(1):340–344
Richard MD, Lippmann RP (1991) Neural network classifiers estimate Bayesian a posteriori probabilities. Neural Comput 3:461–483
Lee RCT, Slagle JR, Mong CT (1976) Application of clustering to estimate missing data and improve data integrity. In: Proceedings of 2nd international software engineering conference, pp 539–544, San Francisco, October 1976
Lim C-P, Leong J-H, Kuan M-M (2005) A hybrid neural network system for pattern classification tasks with missing features. IEEE Trans Pattern Anal Mach Intell 27(4):648–653
Zhang S, Qin Y, Zhu X, Zhang J, Zhang C (2006) Optimized parameters for missing data imputation. In: PRICAI, pp 1010–1016
Pelckmans KA, De Brabanter JB, Suykens JAKA, De Moor BA (2005) Handling missing values in support vector machine classifiers. Neural Netw 18(5–6):684–692
Su X, Khoshgoftaar TM, Zhu X, Greiner R (2008) Imputation-boosted collaborative filtering using machine learning classifiers. In: SAC ’08: Proceedings of the 2008 ACM symposium on applied computing. ACM, New York, pp 949–950
Su X, Greiner R, Khoshgoftaar TM, Napolitano A (2011) Using classifier-based nominal imputation to improve machine learning. In: Huang JZ, Cao L, Srivastava J (eds) PAKDD (1). Springer, pp 124–135
Little RJA, Rubin DB (2002) Statistical analysis with missing data. Wiley, New York
Dixon JK (1979) Pattern recognition with partly missing data. IEEE Trans Syst Man Cybern 9(10):617–621
Dudani SA (1976) The distance-weighted \(k\)-nearest-neighbor rule. IEEE Trans Syst Man Cybern 6:325–327
Bailey T, Jain AK (1978) A note on distance-weighted \(k\)-nearest neighbor rules. IEEE Trans Syst Man Cybern 8:311–313
Morin RL, Raeside DE (1981) A reappraisal of distance-weighted \(k\)-nearest neighbor classification for pattern recognition with missing data. IEEE Trans Syst Man Cybern 11(3):241–243
Schafer JL, Graham JW (2002) Missing data: our view of the state of the art. Psychol Methods 7:147–177
Ghahramani Z, Jordan MI (1994) Learning from incomplete data. AI Memo 1509, CBCL paper 108. MIT, Cambridge
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B 39:1–38
Rao CR (1972) Linear statistical inference and its applications. Wiley, New York
Wu CFJ (1983) On the convergence properties of the EM algorithm. Ann Stat 11(1):95–103
Redner RA, Walker HF (1984) Mixture densities, maximum likelihood and the EM algorithm. SIAM Rev 26:195–239
Xu L, Jordan MI (1996) On convergence properties of the EM algorithm for Gaussian mixtures. Neural Comput 8:129–151
Duda RO, Hart PE (1973) Pattern classification and scene analysis. Wiley, New York
Walter RG, Richardson S, Spiegelhalter D (1996) Markov chain Monte Carlo in practice. Chapman & Hall/CRC, New York
Ramoni M, Sebastiani P (2001) Robust learning with missing data. Machine Learn 45(2):147–170
Beaton AE (1964) The use of special matrix operations in statistical calculus. Educational Testing Service Research Bulletin, RB-64-51
Dempster AP (1969) Elements of continuous multivariate analysis. Addison-Wesley, Reading
McLachlan G, Basford K (1988) Mixture models: inference and applications to clustering. Marcel Dekker, New York
Ghahramani Z (1994) Solving inverse problems using an EM approach to density estimation. In: Mozer MC, Smolensky P, Touretzky DS, Elman JL, Weigend AS (eds) Proceedings of the 1993 Connectionist Models Summer School. Erlbaum Associates, Hillsdale, pp 316–323
Ghahramani Z, Jordan MI (1994) Supervised learning from incomplete data via an EM approach. In: Cowan JD, Tesauro G, Alspector J (eds) Advances in neural information processing systems 6. Morgan Kaufmann Publishers, San Mateo
Moss S, Hancock ER (1997) Registering incomplete radar images using the EM algorithm. Image Vis Comput 15:637–648
Breiman L, Friedman JH, Olshen RA, Stone CJ (1984) Classification and Regression Trees. Wadsworth International Group, Belmont
Friedman JH (1991) Multivariate adaptive regression splines. Ann Stat 19:1–141
Tresp V, Hollatz J, Ahmad S (1993) Network structuring and training using rule-based knowledge. In: Hanson SJ, Cowan JD, Giles CL (eds) Advances in Neural Information Processing Systems 5. Morgan Kaufmann Publishers, San Mateo, pp 871–878
Jordan MI, Jacobs RA (1994) Hierarchical mixtures of experts and the EM algorithm. Neural Comput 6:181–214
Tresp V, Ahmad S, Neuneier R (1994) Training neural networks with deficient data. In: Cowan JD, Tesauro G, Alspector J (eds) Advances in neural information processing systems. Morgan Kaufmann Publishers, San Mateo, pp 128–135
Streit RL, Luginbuhl TE (1994) Maximum likelihood training of probabilistic neural networks. IEEE Trans Neural Netw 5(5):764–783
Tanaka M, Kotokawa Y, Tanino T (1996) Pattern classification by stochastic neural networks with missing data. In: IEEE international conference on systems, man and cybernetics, Beijing, China, pp 690–695, 14–17 October 1996
Vellido A (2006) Missing data imputation through GTM as a mixture of t-distributions. Neural Netw 19(10):1624–1635
Hwang JN, Wang CJ (1994) Classification of incomplete data with missing elements. In: International symposium on artificial neural networks, Tainan, Taiwan, December 1994, pp 471–477
Schafer JL (2010) Analysis of incomplete multivariate data. Chapmann and Hall-CRC Press, London
Linden A, Kindermann J (1989) Inversion of multilayer nets. In: Proceedings of the international joint conference on neural networks, II, Washington DC, June 1989, pp 425–430
Ahmad S, Tresp V (1993) Some solutions to the missing feature problem in vision. In: Hanson SJ, Cowan JD, Giles CL (eds) Advances in neural information processing systems 5. Morgan Kaufmann Publishers, San Mateo, pp 393–400
Tresp V, Neuneier R, Ahmad S (1995) Efficient methods for dealing with missing data in supervised learning. In: Tesauro G, Touretzky D, Leen T (eds) Advances in neural information processing systems 7. Morgan Kaufmann Publishers, San Mateo, pp 689–696
Graham BS, Keisuke H (2011) Robustness to parametric assumptions in missing data models. Am Econ Rev 101(3):538–543
Ahmad S, Tresp V (1993) Classification with missing and uncertain inputs. In: Proceedings of the IEEE international conference on neural networks, San Francisco
Moody J, Darken C (1988) Learning with localized receptive fields. In: Hinton G, Sejnowski T (eds) Proceedings of the 1988 Connectionist Models Summer School. Morgan-Kauffmann
Nowlan S (1990) Maximum likelihood competitive learning. In: Advances in neural information processing systems 2. Morgan Kaufmann Publishers, pp 574–582
Moody J, Darken C (1989) Fast learning in networks of locally-tuned processing units. Neural Comput 1:281–294
Parzen E (1962) On estimation of a probability density function and mode. Ann Math Stat 33:1065–1076
Breiman L, Meisel W, Purcell E (1977) Variable kernel estimates of multivariate densities. Technometrics 19(2):135–144
Hwang JN, Lay SR, Lippman A (1994) Nonparametric multivariate density estimation: a comparative study. IEEE Trans Signal Process 42(10):2795–2810
Ahmad S (1994) Feature densities are required for computing feature correspondence. In: Cowan JD, Tesauro G, Alspector J (eds) Advances in neural information processing systems 6. Morgan Kaufmann Publishers, San Mateo, pp 961–968
Fielding S, Fayers PM, McDonald A, McPherson G, Campbell MK (2008) Simple imputation methods were inadequate for missing not at random (MNAR) quality of life data. Health Qual Life Outcomes 6(57)
Molenberghs G, Thijs H, Jansen I, Beunckens C, Kenward MG, Mallinckrodt C, Carroll RJ (2004) Analyzing incomplete longitudinal clinical trial data. Biostatistics 5(3):445–464
Congdon P (2006) Bayesian statistical modelling, 2nd edn. Wiley, New York
Collins LM, Schafer JL, Kam CM (2001) A comparison of inclusive and restrictive strategies in modern missing-data procedures. Psychol Methods 6:330–351
Heckman JJ (1976) The common structure of statistical models of truncation, sample selection and limited dependent variables and a simple estimator for such models. In: Annals of economic and social measurement, vol 5, number 4. National Bureau of Economic Research, Inc, pp 475–492
Berndt ER, Hall BH, Hall RE, Hausman JA (1974) Estimation and inference in nonlinear structural models. Ann Econ Soc Meas 3:653–665
Marlin B, Roweis S, Zemel R (2005) Unsupervised learning with non-ignorable missing data. In: Proceedings of the tenth international workshop on artificial intelligence and statistics (AISTATS), pp 222–229
Little RJA (1993) Pattern-mixture models for multivariate incomplete data. J Am Stat Assoc 88:125–134
Molenberghs G, Kenward M (2007) Missing data in clinical studies. Wiley, New York
Vonesh EF, Greene T, Schluchter MD (2006) Shared parameter models for the joint analysis of longitudinal data and event times. Stat Med 25(1):143–163
Little RJ (2006) Selection and pattern-mixture models. CRC Press, London, pp 409–431
Gad AM, Darwish NMM (2013) A shared parameter model for longitudinal data with missing values. Am J Appl Math Stat 1(2):30–35
Rubin DB (1987) Multiple imputation for nonresponse in surveys. Wiley, New York
Harel O, Zhou XH (2007) Multiple imputation: review of theory, implementation and software. Stat Med 26:3057–3077
Kenward MG, Carpenter JC (2009) Multiple Imputation. CRC Press, London, pp 477–500
Saltelli A, Chan K, Scott EM (2000) Sensitivity analysis. Wiley, New York
White IR, Royston P, Wood AM (2011) Multiple imputation using chained equations: issues and guidance for practice. Stat Med 30(4):377–399
Daniel Rhian M, Kenward Michael G (2012) A method for increasing the robustness of multiple imputation. Comput Stat Data Anal 56(6):1624–1643
Jansen I, Hens N, Molenberghs G, Aerts M, Verbeke G, Kenward MG (2006) The nature of sensitivity in monotone missing not at random models. Comput Stat Data Anal 50(3):830–858
Park J-S, Qian GQ, Jun Y (2008) Monte Carlo EM algorithm in logistic linear models involving non-ignorable missing data. Appl Math Comput 197(1):440–450
Stubbendick AL, Ibrahim JG (2003) Maximum likelihood methods for nonignorable missing responses and covariates in random effects models. Biometrics 59(4):1140–50
Jolani S (2012) Dual imputation strategies for analyzing incomplete data. Utrecht University, Utrecht
Enders CK (2011) Missing not at random models for latent growth curve analyses. Psychol Methods 16(1):1–16
Molenberghs G, Beunckens C, Sotto C, Kenward MG (2008) Every missingness not at random model has a missingness at random counterpart with equal fit. J R Stat Soc Ser B 70(Part 2):371–388
Vamplew P, Adams A (1992) Missing values in a backpropagation neural net. In: Leong S, Jabri M (eds) Proceedings of the third Australian conference on neural networks, Sidney, February 1992, pp 64–67
Vamplew P, Clark D, Adams A, Muench J (1996) Techniques for dealing with missing values in feedforward networks. In: Proceedings of the seventh Australian conference on neural networks, Canberra, 10–12 April 1996
Southcott ML, Bogner RE (1993) Classification of incomplete data using neural networks. In: Proceedings of the fourth Australian conference on neural networks, Melbourne, 3–5 February 1993, pp 220–223
Hwang JN, Wang CJ (1994) Neural network inversion techniques for missing data applications. In: IEEE neural network workshop on signal processing, Ermioni, Greece, September 1994, pp 22–31
Specht DF (1990) Probabilistic neural networks. Neural Netw 3(1):109–118
Vapnik V (1982) Estimation of dependences based on empirical data. Springer, Berlin
Buntine WL, Weigend AS (1991) Bayesian back-propagation. Complex Syst 5(6):603–643
Arrowsmith DK, Place CM (1990) An introduction to dynamical systems. Cambridge University Press, Cambridge
Rabiner LR (1989) A tutorial on Hidden Markov Models and selected applications in speech recognition. Proc IEEE 77(2):267–296
Jain LC, Medsker LR (1999) Recurrent neural networks: design and applications. CRC Press Inc, Boca Raton
Jurafsky D, Martin JH (2009) Speech and language processing, 2nd edn. Prentice-Hall Inc, Upper Saddle River
Trentin E, Gori M (2003) Robust combination of neural networks and hidden Markov models for speech recognition. IEEE Trans Neural Netw 14(6):1519–1531
Bertolami R, Bunke H (2008) Hidden markov model-based ensemble methods for offline handwritten text line recognition. Pattern Recogn 41(11):3452–3460
Baldi P, Brunak S (2001) Bioinformatics: the machine learning approach, 2nd edn. MIT Press, Cambridge
Hinton GE, Sejnowski TJ (1986) Learning and relearning in Boltzmann machines. In: Rumelhart DE, McClelland J (eds) Parallel distributed processing, vol 1, chapter 7. MIT Press
Hopfield JJ (1982) Neural networks and physical systems with emergent collective computational abilities. In: Proceedings of the National Academy of Sciences, vol 79, pp 2554–2558
Hertz J, Krogh A, Palmer RG (1991) Introduction to the theory of neural computation. Addison-Wesley, Redwood City
Almeida L (1987) A learning rule for asynchronous perceptrons with feedback in a combinatorial environment. In: Caudill M, Butler C (eds) Proceedings of the IEEE first international conference on neural networks, vol 2. IEEE, San Diego, pp 609–618
Pineda F (1989) Recurrent backpropagation and the dynamical approach to adaptive neural computation. Neural Comput 1:161–172
Bengio Y, Gingras F (1996) Recurrent neural networks for missing or asynchronous data. In: Touretzky DS, Mozer MC, Hasselmo ME (eds) Advances in neural information processing systems 8. MIT Press, Cambridge, pp 395–401
Minsky ML, Papert SA (1969) Perceptrons. MIT Press, Cambridge
Rumelhart DE, Hinton GE, Williams RJ (1986) Learning internal representations by error propagation. In: Rumelhart DE, McClelland J (eds) Parallel distributed processing, vol 1, chapter 8. MIT Press, pp 318–362
Globerson A, Roweis ST (2006) Nightmare at test time: robust learning by feature deletion. In: ICML ’06: Proceedings of the 23th international conference on machine learning, pp 353–360
Dekel O, Shamir O (2008) Learning to classify with missing and corrupted features. In: ICML ’08: Proceedings of the 25th international conference on machine learning. ACM, New York, pp 216–223
Ding Y, Simonoff JS (2010) An investigation of missing data methods for classification trees applied to binary response data. J Mach Learn Res 11:131–170
Twala B (2009) An empirical comparison of techniques for handling incomplete data using decision trees. Appl Artif Intell 23:373–405
Luengo J, García S, Herrera F (2010) A study on the use of imputation methods for experimentation with radial basis function network classifiers handling missing attribute values: The good synergy between RBFs and event covering method. Neural Netw 23:406–418
Corani G, Zaffalon M (2008) Learning reliable classifiers from small or incomplete data sets: the naive credal classifier 2. J Mach Learn Res 9:581–621
Chierichetti F, Kleinberg J, Liben-Nowell D (2011) Reconstructing patterns of information diffusion from incomplete observations. In: Shawe-Taylor J, Zemel RS, Bartlett P, Pereira FCN, Weinberger KQ (eds) Advances in neural information processing systems 24. MIT Press, Cambridge, pp 792–800
Greenwald A, Li J, Sodomka E (2012) Approximating equilibria in sequential auctions with incomplete information and multi-unit demand. In: Bartlett P, Pereira FCN, Burges CJC, Bottou L, Weinberger KQ (eds) Advances in neural information processing systems 25. MIT Press, Cambridge, pp 2330–2338
Ghannad-Rezaie M, Soltanian-Zadeh H, Ying H, Dong M (2010) Selection-fusion approach for classification of data sets with missing values. Pattern Recognit 43:2340–2350
Farhangfar A, Kurgan L, Dy J (2008) Impact of imputation of missing values on classification error for discrete data. Pattern Recognit 41:3692–3705
Saar-Tsechansky M, Provost F (2007) Handling missing values when applying classification models. J Mach Learn Res 8:1623–1657
Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, Botstein D, Altman RB (2001) Missing value estimation methods for DNA microarrays. Bioinformatics 17(6):520–525
Oba SA, Sato MA, Takemasa IC, Monden MC, Matsubara KI, Ishii SA (2003) A bayesian missing value estimation method for gene expression profile data. Bioinformatics 19(16):2088–2096
Kim HA, Golub GHB, Park HA (2005) Missing value estimation for DNA microarray gene expression data: Local least squares imputation. Bioinformatics 21(2):187–198
Scheel IA, Aldrin MB, Glad IKA, Sorum RA, Lyng HC, Frigessi AB (2005) The influence of missing value imputation on detection of differentially expressed genes from microarray data. Bioinformatics 21(23):4272–4279
Wang X, Jiang Z, Feng H (2006) Missing value estimation for DNA microarray gene expression data by support vector regression imputation and orthogonal coding scheme. BMC Bioinform 7(32):1–10
Wong DSV, Wong FK, Wood GR (2007) A multi-stage approach to clustering and imputation of gene expression profiles. Bioinformatics 23(8):998–1005
Yoon D, Lee EK, Park T (2007) Robust imputation method for missing values in microarray data. BMC Bioinform 8(2):1–7
Roure B, Baurain D, Philippe H (2013) Impact of missing data on phylogenies inferred from empirical phylogenomic data sets. Mol Biol Evol 30(1):197–214
Nutt W, Razniewski S, Vegliach G (2012) Incomplete databases: missing records and missing values. In: Proceedings of the 17th international conference on database systems for advanced applications, DASFAA’12. Springer, pp 298–310
Kaambwa B, Bryan S, Billingham L (2012) Do the methods used to analyse missing data really matter? An examination of data from an observational study of Intermediate Care patients. BMC Res Notes 5(1):330
David M, Little RJA, Samuhel ME, Triest RK (1986) Alternative methods for CPS income imputation. J Am Stat Assoc 81(393):29–41
Foster EM, Fang GY (2004) Alternative methods for handling attrition: an illustration using data from the fast track evaluation. Eval Rev 28(5):434–464
Horton NJ, Kleinman KP (2007) Much ado about nothing: a comparison of missing data methods and software to fit incomplete data regression models. Am Stat 61(1):79–90
Dong Y, Peng C-YJ (2013) Principled missing data methods for researchers. Springerplus 2(1):222
Ali AMG, Dawson SJ, Blows FM, Provenzano E, Ellis IO, Baglietto L, Huntsman D, Caldas C, Pharoah PD (2011) Comparison of methods for handling missing data on immunohistochemical markers in survival analysis of breast cancer. Br J Cancer 104(4):693–699
Fielding S, Fayers P, Ramsay C (2010) Predicting missing quality of life data that were later recovered: an empirical comparison of approaches. Clin Trials 7(4):333–342
Marshall A, Altman D, Royston P, Holder R (2010) Comparison of techniques for handling missing covariate data within prognostic modelling studies: a simulation study. BMC Med Res Methodol 10(1):7
Hedden S, Woolson R, Malcolm R (2008) A comparison of missing data methods for hypothesis tests of the treatment effect in substance abuse clinical trials: a Monte-Carlo simulation study. Subst Abuse Treatm Prev Policy 3(1):1–9
Ding Y, Simonoff JS (2010) An investigation of missing data methods for classification trees applied to binary response data. J Mach Learn Res 11:131–170
Roda C, Nicolis I, Momas I, Guihenneuc-Jouyaux C (2013) Comparing methods for handling missing data. Epidemiology 24(3):469–471
Graham JW (2009) Missing data analysis: making it work in the real world. Annu Rev Psychol 60:549–576
Schwartz T, Zeig-Owens R (2013) Knowledge (of your missing data) is power: handling missing values in your SAS dataset. In: Proceedings of SAS global forum SUGI 31: statistics, data analysis and data mining, San Francisco, California, 28 April–1 May 2013
Templ M, Alfons A, Filzmoser P (2012) Exploring incomplete data using visualization techniques. Adv Data Anal Classif 6(1):29–47
Heitjan DF (2011) Incomplete data: what you don’t know might hurt you. Cancer Epidemiol Biomark Prev 20(8):1567–1570
Author information
Authors and Affiliations
Corresponding author
Additional information
This work was accomplished while A. Freno was at the University of Siena.
Appendix
Appendix
1.1 Appendix A: Marginal distributions of multivariate Normals
Let \(x \in R^{n}\) be a random vector normally distributed with mean \(\mu \) and covariance matrix \(\Sigma \), i.e., \(x\sim N_{n}(\mu ,\Sigma )\). Let \(x^{o} \in R^{k}\) be a vector corresponding to a subset of \(k\) components of \(x\) and let \(x^{m} \in R^{l}\) be a vector corresponding to the remaining components of \(x\). Without loss of generality, we suppose that the components of \(x^{o}\) correspond to the first \(k\) components of \(x\), and \(x^{m}\) to the remaining \(l\) components (otherwise, the components of \(x\), \(\mu \) and \(\Sigma \) can be simply relabeled).
The random vector \(x\), its mean vector, \(\mu \), and its covariance matrix, \(\Sigma \), can be partitioned according to \(x^{o}\) and \(x^{m}\), as:
Let \(A\) be the following \(k \times n\) matrix:
where the first \(k\) columns identify a \(k \times k\) identity matrix and the remaining \(l=n-k\) columns correspond to a \(k \times l\) null matrix.
Let us define the random vector \(z=Ax\), \(z \in R^{k}\). Since each linear combination of a normally distributed random variable is normally distributed, the \(k\) linear combinations \(z=Ax\) are distributed as a multivariate Normal:
where the mean vector, \(\mu _{z}\), and the covariance matrix, \(\Sigma _{z}\), are:
Finally, observing that \(z \equiv x^{o}\), \(A \mu \equiv \mu _{o}\), and \(A \Sigma A^{T} \equiv \Sigma ^{oo}\), it follows that \(x^{o}\) is normally distributed with mean \(\mu ^{o}\) and covariance matrix \(\Sigma ^{oo}\), i.e., \(x^{o} \sim N_{k}(\mu ^{o},\Sigma ^{oo})\).
1.2 Appendix B: Convolution of multivariate Normals with diagonal covariances
Let us consider the convolution of two multivariate Normals with mean vectors \(\mu _{1}\) and \(\mu _{2}\), and diagonal covariance matrices \((\sigma _{1})^{2}\) and \((\sigma _{2})^{2}\):
Given the diagonal nature of the two covariance matrices, the previous \(n\)-dimensional convolution integral can be rewritten as:
where \(c_{i}(x_{i})\) is the convolution of two univariate Normals with mean \(\mu _{1,i}\) and \(\mu _{2,i}\), and variance \(\sigma _{1,i}^{2}\) and \(\sigma _{2,i}^{2}\), i.e.,
The integrals in Eq. (60) can be solved applying the convolution theorem:
where \(\mathcal{F}\) indicates the linear operator which maps each function into its own Fourier transform.
As it is well known, the transform of the exponential function \(\mathrm{e}^{-ax^{2}}\) is:
from which, using the linearity of the Fourier transform and applying the shift theorem, it follows:
The integrals in Eq. (60) can be solved applying the convolution theorem:
from which, computing the inverse Fourier transform of the two sides of the previous equation, it follows:
Finally, from Eqs. (60) and (61) it follows that the result of the convolution integral (59) is a multivariate Normal with mean vector \(\mu =\mu _{1}+\mu _{2}\) and diagonal covariance matrix \((\sigma )^{2}=(\sigma _{1})^{2}+(\sigma _{2})^{2}\).
1.3 Appendix C: Linear conditional expectations
Let \(x\) and \(y\) be two random variables with joint density function \(p(x,y)\). Let us assume \(p(x,y)\) is a normal distribution with parameter \(\theta =(\mu ,\Sigma )\), where \(\mu \) is the mean vector and \(\Sigma \) is the covariance matrix. Parameters \(\mu \) and \(\Sigma \) can be rewritten as:
The task is to estimate the expectations of \(y\) and \(y^{2}\) given \(x\) using the least square linear regression between \(y\) and \(x\) as predicted by the model, i.e., \(E\{y|x,\theta \}\) and \(E\{y^{2}|x,\theta \}\).
To compute the least square linear regression, values for constants \(a\) and \(b\) must be found which minimize the error expression \(e=E\{[y-(ax+b)]^{2}\}\).
First, let us consider the problem of finding parameter \(b\) when \(a\) is known. The value of \(b\) which minimizes \(e\) is the least square estimate of \(y-ax\) using a constant model, i.e.,
Using Eq. (62), the error expression can be rewritten as:
which attains its minimum when the slope of the regression, \(a\), is equal to:
By substituting Eq. (64) into Eq. (63), the expression for the minimum error is obtained:
The linear conditional expectations can be estimated using the principle of orthogonalityFootnote 4 and the simple observation that if two random variables \(z\) and \(w\) are independent, then \(E\{z|w\}=E\{z\}\).
Let us consider the first-order linear conditional expectation \(E\{y|x,\theta \}\). Observe that:
where the first equality comes from the principle of orthogonality, and the second from Eq. (62). From the two previous equations, it follows that \(E\{(y-ax-b)x\}=E\{(y-ax-b)\}E\{x\}\), and therefore, \(y-ax-b\) and \(x\) are not correlated.
Noting that \(y-ax-b\) is normally distributed (being a linear combination of normal variables) and that two uncorrelated random variables normally distributed are independent, it follows that \(y-ax-b\) and \(x\) are independent, i.e., \(E\{y-ax-b|x,\theta \}=E\{y-ax-b\}\). Moreover, from Eq. (62) it follows that \(E\{y-ax-b|x,\theta \}=0\).
On the other hand:
where Eqs. (62) and (64) are used. Finally, the conditional expectation of \(y\) given \(x\) is:
Let us now consider the second-order linear conditional expectation \(E\{y^{2}|x,\theta \}\). Using the independence between \(y-ax-b\) and \(x\), it follows:
and then, using the principle of orthogonality:
Finally, the conditional expectation of \(y^{2}\) given \(x\) is:
Rights and permissions
About this article
Cite this article
Aste, M., Boninsegna, M., Freno, A. et al. Techniques for dealing with incomplete data: a tutorial and survey. Pattern Anal Applic 18, 1–29 (2015). https://doi.org/10.1007/s10044-014-0411-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10044-014-0411-9