Skip to main content
Log in

Incremental Tree-Based Missing Data Imputation with Lexicographic Ordering

  • Published:
Journal of Classification Aims and scope Submit manuscript

Abstract

In the framework of incomplete data analysis, this paper provides a nonparametric approach to missing data imputation based on Information Retrieval. In particular, an incremental procedure based on the iterative use of tree-based method is proposed and a suitable Incremental Imputation Algorithm is introduced. The key idea is to define a lexicographic ordering of cases and variables so that conditional mean imputation via binary trees can be performed incrementally. A simulation study and real data applications are carried out to describe the advantages and the performance with respect to standard approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • ASUNCION, A., and NEWMAN, D.J. (2007), UCI Machine Learning Repository, http://www.ics.uci.edu/mlearn/MLRepository.html, Irvine, CA: University of California, School of Information and Computer Science.

    Google Scholar 

  • ATKNINSON, E.J., and THERNEAU, T.M. (2000), “An Introduction to Recursive Partitioning Using the RPART Routines”, Mayo Foundation Technical Report, http://mayoresearch.mayo.edu/mayo/research/biostat, Scottsdale, AR: Mayo Clinic.

  • BARCENA, M.J., and TUSELL, F. (2004), “Fitting Multivariate Responses Using Scalar Trees”, Statistics & Probability Letters, 69(3), 253–259.

    Article  MATH  MathSciNet  Google Scholar 

  • BREIMAN, L., FRIEDMAN, J.H., OLSHEN, R.A., and STONE, C.J. (1984), Classification and Regression Trees, Monterey, CA: Wadsworth & Brooks.

    MATH  Google Scholar 

  • CONVERSANO, C., and SICILIANO, R., (2008), “Statistical Data Editing”, in Data Warehousing and Mining: Concepts, Methodologies, Tools, and Applications, ed. J. Wang, Hershey, PA: Information Science Reference, pp. 1835–1840.

    Google Scholar 

  • COVER, T.M., and THOMAS, J.A. (1991), Elements of Information Theory, New York: John Wiley and Sons.

    Book  MATH  Google Scholar 

  • HAN, J., and KAMBER, M. (2006), Data Mining: Concepts and Techniques, San Francisco, CA: Morgan Kaufmann.

    Google Scholar 

  • HASTIE, T., TIBSHIRANI, R., and FRIEDMAN, J.H. (2001), The Elements of Statistical Learning, New York: Springer.

    MATH  Google Scholar 

  • IACUS, S.M., and PORRO, G. (2007), “Missing Data Imputation, Matching and Other Applications of Random Recursive Partitioning”, Computational Statistics and Data Analysis, 52(2), 773–789.

    Article  MATH  MathSciNet  Google Scholar 

  • IBRAHIM, J.G., LIPSITZ, S.R., and CHEN, M.H. (1999), “Missing Covariates in Generalized Linear Models When the Missing Data Mechanism Is Non-ignorable”, Journal of the Royal Statistical Society, Series B, 61(1), 173–190.

    Article  MATH  MathSciNet  Google Scholar 

  • KELLER, A.M., and ULLMAN, J.D. (1995), “A Version Numbering Scheme with a Useful Lexicographical Order”, in Proceedings of the Eleventh International Conference on Data Engineering, eds. P.S. Yu and A.L.P. Chen, Los Alamitos, CA: IEEE Computer Society Press, pp. 240–248.

    Chapter  Google Scholar 

  • KLASCHKA, J., SICILIANO, R., and ANTOCH, J. (1998), “Computational Enhancements in Tree-GrowingMethods”, in Advances in Data Science and Classification: Proceedings of the 6th Conference of the International Federation of Classification Society, eds. A. Rizzi, M. Vichi, and H.H. Bock, Berlin Heidelberg: Springer-Verlag, pp. 295–302.

    Google Scholar 

  • LITTLE, J.R.A. (1992), “Regression with Missing X’s: A Review”, Journal of the American Statistical Association, 87(420), 1227–1237.

    Article  Google Scholar 

  • LITTLE, J.R.A., and RUBIN, D.B. (1987), Statistical Analysis with Missing Data, New York: John Wiley and Sons.

    MATH  Google Scholar 

  • MOLA, F., and SICILIANO, R. (1997), “A Fast Splitting Procedure for Classification Trees”, Statistics and Computing, 7(3), 208–216.

    Article  Google Scholar 

  • MOLENBERGHS, G. (2007), “Editorial: What To DoWith Missing Data?”, Journal of the Royal Statistical Society, series A Statistics in Society, 170(4), 861–863.

    MathSciNet  Google Scholar 

  • PETRAKOS, G., CONVERSANO, C., FARMAKIS, G., MOLA, F., SICILIANO, R., and STRAVOPULOS, P. (2004), “New Ways of Specifying Edits”, Journal of The Royal Statistical Society, Series A Statistics in Society, 167(2), 249–274.

    MathSciNet  Google Scholar 

  • QUINLAN, J.R. (1993), C4.5: Programs for Machine Learning, San Fransisco, CA: Morgan Kaufmann.

    Google Scholar 

  • R DEVELOPTMENT CORE TEAM(2008), A Language Environment for Statistical Computing, http://www.r-project.org,Wien, Austria: R Foundation for Statistical Computing.

    Google Scholar 

  • SARLE,W.S. (1998), “Prediction with Missing Inputs”, in Proceedings of The Fourth Joint Conference on Information Sciences JCIS 98 (Vol. 2), ed. P.P.Wang, Research Triangle Park, NC, pp. 399–402.

  • SCHAFER, J.L. (1997), Analysis of Incomplete Multivariate Data, London, UK: Chapman & Hall.

    MATH  Google Scholar 

  • SICILIANO, R., and CONVERSANO, C. (2008), “Decision Tree Induction”, in Data Warehousing And Mining: Concepts, Methodologies, Tools, And Applications, ed. J.Wang, Hershey, PA: Information Science Reference, pp. 624–629.

    Google Scholar 

  • SICILIANO, R., and MOLA, F. (2000), “Multivariate Data Analysis and Modeling through Classification and Regression Trees”, Computational Statistics and Data Analysis, 32(3-4), 285–301.

    Article  MATH  MathSciNet  Google Scholar 

  • VACH,W., and BLETTNER,M. (1995), “Logistic Regression With Incompletely Observed Categorical Covariates - Investigating the Sensitivity Against Violation of the Missing at Random Assumption”, Statistics in Medicine, 14(12), 1315–1329.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Claudio Conversano.

Additional information

The authors would like to thank the editor and referees for their helpful comments, which have helped us to greatly improve the quality of this paper. Research work of Claudio Conversano is supported by the research funds awarded by University of Cagliari within the “Young Researchers’ Start-Up Programme 2007”. Research work of Roberta Siciliano is supported by the “iWebCare” European Project research funds (FP6-2004-IST-4-028055).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Conversano, C., Siciliano, R. Incremental Tree-Based Missing Data Imputation with Lexicographic Ordering. J Classif 26, 361–379 (2009). https://doi.org/10.1007/s00357-009-9038-8

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00357-009-9038-8

Keywords

Navigation