Incremental Tree-Based Missing Data Imputation with Lexicographic Ordering

Conversano, Claudio; Siciliano, Roberta

doi:10.1007/s00357-009-9038-8

Incremental Tree-Based Missing Data Imputation with Lexicographic Ordering

Published: 07 January 2010

Volume 26, pages 361–379, (2009)
Cite this article

Journal of Classification Aims and scope Submit manuscript

Claudio Conversano¹ &
Roberta Siciliano²

221 Accesses
21 Citations
Explore all metrics

Abstract

In the framework of incomplete data analysis, this paper provides a nonparametric approach to missing data imputation based on Information Retrieval. In particular, an incremental procedure based on the iterative use of tree-based method is proposed and a suitable Incremental Imputation Algorithm is introduced. The key idea is to define a lexicographic ordering of cases and variables so that conditional mean imputation via binary trees can be performed incrementally. A simulation study and real data applications are carried out to describe the advantages and the performance with respect to standard approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Missing data imputation using decision trees and fuzzy clustering with iterative learning

Article 11 December 2019

Imputation for Categorical Attributes with Probabilistic Reasoning

A sparse linear regression model for incomplete datasets

Article 04 December 2019

References

ASUNCION, A., and NEWMAN, D.J. (2007), UCI Machine Learning Repository, http://www.ics.uci.edu/mlearn/MLRepository.html, Irvine, CA: University of California, School of Information and Computer Science.
Google Scholar
ATKNINSON, E.J., and THERNEAU, T.M. (2000), “An Introduction to Recursive Partitioning Using the RPART Routines”, Mayo Foundation Technical Report, http://mayoresearch.mayo.edu/mayo/research/biostat, Scottsdale, AR: Mayo Clinic.
BARCENA, M.J., and TUSELL, F. (2004), “Fitting Multivariate Responses Using Scalar Trees”, Statistics & Probability Letters, 69(3), 253–259.
Article MATH MathSciNet Google Scholar
BREIMAN, L., FRIEDMAN, J.H., OLSHEN, R.A., and STONE, C.J. (1984), Classification and Regression Trees, Monterey, CA: Wadsworth & Brooks.
MATH Google Scholar
CONVERSANO, C., and SICILIANO, R., (2008), “Statistical Data Editing”, in Data Warehousing and Mining: Concepts, Methodologies, Tools, and Applications, ed. J. Wang, Hershey, PA: Information Science Reference, pp. 1835–1840.
Google Scholar
COVER, T.M., and THOMAS, J.A. (1991), Elements of Information Theory, New York: John Wiley and Sons.
Book MATH Google Scholar
HAN, J., and KAMBER, M. (2006), Data Mining: Concepts and Techniques, San Francisco, CA: Morgan Kaufmann.
Google Scholar
HASTIE, T., TIBSHIRANI, R., and FRIEDMAN, J.H. (2001), The Elements of Statistical Learning, New York: Springer.
MATH Google Scholar
IACUS, S.M., and PORRO, G. (2007), “Missing Data Imputation, Matching and Other Applications of Random Recursive Partitioning”, Computational Statistics and Data Analysis, 52(2), 773–789.
Article MATH MathSciNet Google Scholar
IBRAHIM, J.G., LIPSITZ, S.R., and CHEN, M.H. (1999), “Missing Covariates in Generalized Linear Models When the Missing Data Mechanism Is Non-ignorable”, Journal of the Royal Statistical Society, Series B, 61(1), 173–190.
Article MATH MathSciNet Google Scholar
KELLER, A.M., and ULLMAN, J.D. (1995), “A Version Numbering Scheme with a Useful Lexicographical Order”, in Proceedings of the Eleventh International Conference on Data Engineering, eds. P.S. Yu and A.L.P. Chen, Los Alamitos, CA: IEEE Computer Society Press, pp. 240–248.
Chapter Google Scholar
KLASCHKA, J., SICILIANO, R., and ANTOCH, J. (1998), “Computational Enhancements in Tree-GrowingMethods”, in Advances in Data Science and Classification: Proceedings of the 6th Conference of the International Federation of Classification Society, eds. A. Rizzi, M. Vichi, and H.H. Bock, Berlin Heidelberg: Springer-Verlag, pp. 295–302.
Google Scholar
LITTLE, J.R.A. (1992), “Regression with Missing X’s: A Review”, Journal of the American Statistical Association, 87(420), 1227–1237.
Article Google Scholar
LITTLE, J.R.A., and RUBIN, D.B. (1987), Statistical Analysis with Missing Data, New York: John Wiley and Sons.
MATH Google Scholar
MOLA, F., and SICILIANO, R. (1997), “A Fast Splitting Procedure for Classification Trees”, Statistics and Computing, 7(3), 208–216.
Article Google Scholar
MOLENBERGHS, G. (2007), “Editorial: What To DoWith Missing Data?”, Journal of the Royal Statistical Society, series A Statistics in Society, 170(4), 861–863.
MathSciNet Google Scholar
PETRAKOS, G., CONVERSANO, C., FARMAKIS, G., MOLA, F., SICILIANO, R., and STRAVOPULOS, P. (2004), “New Ways of Specifying Edits”, Journal of The Royal Statistical Society, Series A Statistics in Society, 167(2), 249–274.
MathSciNet Google Scholar
QUINLAN, J.R. (1993), C4.5: Programs for Machine Learning, San Fransisco, CA: Morgan Kaufmann.
Google Scholar
R DEVELOPTMENT CORE TEAM(2008), A Language Environment for Statistical Computing, http://www.r-project.org,Wien, Austria: R Foundation for Statistical Computing.
Google Scholar
SARLE,W.S. (1998), “Prediction with Missing Inputs”, in Proceedings of The Fourth Joint Conference on Information Sciences JCIS 98 (Vol. 2), ed. P.P.Wang, Research Triangle Park, NC, pp. 399–402.
SCHAFER, J.L. (1997), Analysis of Incomplete Multivariate Data, London, UK: Chapman & Hall.
MATH Google Scholar
SICILIANO, R., and CONVERSANO, C. (2008), “Decision Tree Induction”, in Data Warehousing And Mining: Concepts, Methodologies, Tools, And Applications, ed. J.Wang, Hershey, PA: Information Science Reference, pp. 624–629.
Google Scholar
SICILIANO, R., and MOLA, F. (2000), “Multivariate Data Analysis and Modeling through Classification and Regression Trees”, Computational Statistics and Data Analysis, 32(3-4), 285–301.
Article MATH MathSciNet Google Scholar
VACH,W., and BLETTNER,M. (1995), “Logistic Regression With Incompletely Observed Categorical Covariates - Investigating the Sensitivity Against Violation of the Missing at Random Assumption”, Statistics in Medicine, 14(12), 1315–1329.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Economics, University of Cagliari, Viale Frá Ignazio 17, I-09123, Cagliari, Italy
Claudio Conversano
Department of Mathematics and Statistics, University of Naples Federico II, Via Cinthia, Monte Sant’Angelo, I-80126, Naples, Italy
Roberta Siciliano

Authors

Claudio Conversano
View author publications
You can also search for this author in PubMed Google Scholar
Roberta Siciliano
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Claudio Conversano.

Additional information

The authors would like to thank the editor and referees for their helpful comments, which have helped us to greatly improve the quality of this paper. Research work of Claudio Conversano is supported by the research funds awarded by University of Cagliari within the “Young Researchers’ Start-Up Programme 2007”. Research work of Roberta Siciliano is supported by the “iWebCare” European Project research funds (FP6-2004-IST-4-028055).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Conversano, C., Siciliano, R. Incremental Tree-Based Missing Data Imputation with Lexicographic Ordering. J Classif 26, 361–379 (2009). https://doi.org/10.1007/s00357-009-9038-8

Download citation

Published: 07 January 2010
Issue Date: December 2009
DOI: https://doi.org/10.1007/s00357-009-9038-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Incremental Tree-Based Missing Data Imputation with Lexicographic Ordering

Abstract

Access this article

Similar content being viewed by others

Missing data imputation using decision trees and fuzzy clustering with iterative learning

Imputation for Categorical Attributes with Probabilistic Reasoning

A sparse linear regression model for incomplete datasets

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Incremental Tree-Based Missing Data Imputation with Lexicographic Ordering

Abstract

Access this article

Similar content being viewed by others

Missing data imputation using decision trees and fuzzy clustering with iterative learning

Imputation for Categorical Attributes with Probabilistic Reasoning

A sparse linear regression model for incomplete datasets

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation