Abstract
In the framework of incomplete data analysis, this paper provides a nonparametric approach to missing data imputation based on Information Retrieval. In particular, an incremental procedure based on the iterative use of tree-based method is proposed and a suitable Incremental Imputation Algorithm is introduced. The key idea is to define a lexicographic ordering of cases and variables so that conditional mean imputation via binary trees can be performed incrementally. A simulation study and real data applications are carried out to describe the advantages and the performance with respect to standard approaches.
Similar content being viewed by others
References
ASUNCION, A., and NEWMAN, D.J. (2007), UCI Machine Learning Repository, http://www.ics.uci.edu/mlearn/MLRepository.html, Irvine, CA: University of California, School of Information and Computer Science.
ATKNINSON, E.J., and THERNEAU, T.M. (2000), “An Introduction to Recursive Partitioning Using the RPART Routines”, Mayo Foundation Technical Report, http://mayoresearch.mayo.edu/mayo/research/biostat, Scottsdale, AR: Mayo Clinic.
BARCENA, M.J., and TUSELL, F. (2004), “Fitting Multivariate Responses Using Scalar Trees”, Statistics & Probability Letters, 69(3), 253–259.
BREIMAN, L., FRIEDMAN, J.H., OLSHEN, R.A., and STONE, C.J. (1984), Classification and Regression Trees, Monterey, CA: Wadsworth & Brooks.
CONVERSANO, C., and SICILIANO, R., (2008), “Statistical Data Editing”, in Data Warehousing and Mining: Concepts, Methodologies, Tools, and Applications, ed. J. Wang, Hershey, PA: Information Science Reference, pp. 1835–1840.
COVER, T.M., and THOMAS, J.A. (1991), Elements of Information Theory, New York: John Wiley and Sons.
HAN, J., and KAMBER, M. (2006), Data Mining: Concepts and Techniques, San Francisco, CA: Morgan Kaufmann.
HASTIE, T., TIBSHIRANI, R., and FRIEDMAN, J.H. (2001), The Elements of Statistical Learning, New York: Springer.
IACUS, S.M., and PORRO, G. (2007), “Missing Data Imputation, Matching and Other Applications of Random Recursive Partitioning”, Computational Statistics and Data Analysis, 52(2), 773–789.
IBRAHIM, J.G., LIPSITZ, S.R., and CHEN, M.H. (1999), “Missing Covariates in Generalized Linear Models When the Missing Data Mechanism Is Non-ignorable”, Journal of the Royal Statistical Society, Series B, 61(1), 173–190.
KELLER, A.M., and ULLMAN, J.D. (1995), “A Version Numbering Scheme with a Useful Lexicographical Order”, in Proceedings of the Eleventh International Conference on Data Engineering, eds. P.S. Yu and A.L.P. Chen, Los Alamitos, CA: IEEE Computer Society Press, pp. 240–248.
KLASCHKA, J., SICILIANO, R., and ANTOCH, J. (1998), “Computational Enhancements in Tree-GrowingMethods”, in Advances in Data Science and Classification: Proceedings of the 6th Conference of the International Federation of Classification Society, eds. A. Rizzi, M. Vichi, and H.H. Bock, Berlin Heidelberg: Springer-Verlag, pp. 295–302.
LITTLE, J.R.A. (1992), “Regression with Missing X’s: A Review”, Journal of the American Statistical Association, 87(420), 1227–1237.
LITTLE, J.R.A., and RUBIN, D.B. (1987), Statistical Analysis with Missing Data, New York: John Wiley and Sons.
MOLA, F., and SICILIANO, R. (1997), “A Fast Splitting Procedure for Classification Trees”, Statistics and Computing, 7(3), 208–216.
MOLENBERGHS, G. (2007), “Editorial: What To DoWith Missing Data?”, Journal of the Royal Statistical Society, series A Statistics in Society, 170(4), 861–863.
PETRAKOS, G., CONVERSANO, C., FARMAKIS, G., MOLA, F., SICILIANO, R., and STRAVOPULOS, P. (2004), “New Ways of Specifying Edits”, Journal of The Royal Statistical Society, Series A Statistics in Society, 167(2), 249–274.
QUINLAN, J.R. (1993), C4.5: Programs for Machine Learning, San Fransisco, CA: Morgan Kaufmann.
R DEVELOPTMENT CORE TEAM(2008), A Language Environment for Statistical Computing, http://www.r-project.org,Wien, Austria: R Foundation for Statistical Computing.
SARLE,W.S. (1998), “Prediction with Missing Inputs”, in Proceedings of The Fourth Joint Conference on Information Sciences JCIS 98 (Vol. 2), ed. P.P.Wang, Research Triangle Park, NC, pp. 399–402.
SCHAFER, J.L. (1997), Analysis of Incomplete Multivariate Data, London, UK: Chapman & Hall.
SICILIANO, R., and CONVERSANO, C. (2008), “Decision Tree Induction”, in Data Warehousing And Mining: Concepts, Methodologies, Tools, And Applications, ed. J.Wang, Hershey, PA: Information Science Reference, pp. 624–629.
SICILIANO, R., and MOLA, F. (2000), “Multivariate Data Analysis and Modeling through Classification and Regression Trees”, Computational Statistics and Data Analysis, 32(3-4), 285–301.
VACH,W., and BLETTNER,M. (1995), “Logistic Regression With Incompletely Observed Categorical Covariates - Investigating the Sensitivity Against Violation of the Missing at Random Assumption”, Statistics in Medicine, 14(12), 1315–1329.
Author information
Authors and Affiliations
Corresponding author
Additional information
The authors would like to thank the editor and referees for their helpful comments, which have helped us to greatly improve the quality of this paper. Research work of Claudio Conversano is supported by the research funds awarded by University of Cagliari within the “Young Researchers’ Start-Up Programme 2007”. Research work of Roberta Siciliano is supported by the “iWebCare” European Project research funds (FP6-2004-IST-4-028055).
Rights and permissions
About this article
Cite this article
Conversano, C., Siciliano, R. Incremental Tree-Based Missing Data Imputation with Lexicographic Ordering. J Classif 26, 361–379 (2009). https://doi.org/10.1007/s00357-009-9038-8
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00357-009-9038-8