Boosted Incremental Tree-based Imputation of Missing Data

Siciliano, Roberta; Aria, Massimo; D’Ambrosio, Antonio

doi:10.1007/3-540-35978-8_31

Roberta Siciliano²¹,
Massimo Aria²¹ &
Antonio D’Ambrosio²¹

Part of the book series: Studies in Classification, Data Analysis, and Knowledge Organization ((STUDIES CLASS))

1567 Accesses
1 Citations

Abstract

Tree-based procedures have been recently considered as non parametric tools for missing data imputation when dealing with large data structures and no probability assumption. A previous work used an incremental algorithm based on cross-validated decision trees and a lexicographic ordering of the single data to be imputed. This paper considers an ensemble method where tree-based model is used as learner. Furthermore, the incremental imputation concerns missing data of each variable at turn. As a result, the proposed method allows more accurate imputa-tions through a more efficient algorithm. A simulation case study shows the overall good performance of the proposed method against some competitors. A MatLab implementation enriches Tree Harvest Software for non-standard classification and regression trees.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

ARIA, M. and SICILIANO, R. (2003): Learning from Trees: Two-Stage Enhance-ments. CLADAG 2003, Book of Short Papers, CLUEB, Bologna, 21–24.
Google Scholar
BREIMAN, L. (1996). Bagging Predictors. Machine Learning, 36.
Google Scholar
CHU, C.K. and CHENG, P.E. (1995): Nonparametric regression estimation with missing data. Journal of Statistical Planning and Inference, 48, 85–99.
Article MATH Google Scholar
EIBL, G. and PFEIFFER, K.P. (2002): How to make AdaBoost.M1 work for weak base classifiers by changing only one line of the code. Machine Learning: ECML 2002, Lecture Notes in Artificial Intelligence.
Google Scholar
FREUND, Y. and SCHAPIRE R.E. (1997): A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and Sys-tem Sciences, 55.
Google Scholar
FRIEDMAN, J.H. and POPESCU, B.E. (2005): Predictive Learning via Rule En-sembles. Technical Report of Stanford University.
Google Scholar
HASTIE, T., TIBSHIRANI, R. and FRIEDMAN X (2002): The Elements of Sta-tistical Learning, Springer Verlag, New York.
Google Scholar
IBRAHIM, J.G., LIPSITZ, S.R. and CHEN, M.H. (1999): Missing Covariates in Generalized Linear Models when the missing data mechanism is non-ignorable. Journal of the Royal Statistical Society, B, 61, 173–190.
Article MATH Google Scholar
LITTLE, J.R.A and RUBIN, D.B. (1987): Statistical Analysis with Missing Data. John Wiley and Sons, New York.
MATH Google Scholar
LITTLE, J.R.A. (1992): Regression with Missing X’s: a Review. Journal of the American Statistical Association, 87, 1227–1237.
Article Google Scholar
MOLA, F. and SICILIANO, R. (1992): A two-stage predictive splitting algorithm in binary segmentation. In Dodge, Y. and Whittaker, J. (Eds.): Computational Statistics. Physica Verlag, Heidelberg, 179–184.
Google Scholar
MOLA, F. and SICILIANO, R. (1997): A Fast Splitting Procedure for Classification and Regression Trees. Statistics and Computing, 7, 208–216.
Article Google Scholar
PETRAKOS, G., CONVERSANO, C., FARMAKIS, G., MOLA, F., SICILIANO, R. and STAVROPOULOS, P. (2004): New ways to specify data edits, Journal of Royal Statistical Society, Series A, 167, 249–274.
Google Scholar
SICILIANO, R., ARIA, M. and CONVERSANO, C. (2004): Tree Harvest: Methods, Software and Some Applications. In Antoch J. (Ed.): Proceedings in Computational Statistics. Physica-Verlag, 1807–1814.
Google Scholar
SICILIANO, R. and CONVERSANO, C. (2002): Tree-based Classifiers for Conditional Missing Data Incremental Imputation, Proceedings of the International Conference on Data Clean, University of Jyvaskyla.
Google Scholar
SICILIANO, R. and MOLA, F. (2000): Multivariate Data Analysis through Classification and Regression Trees. Computational Statistics and Data Analysis, 32, 285–301.
Article MATH Google Scholar
VACH, W. (1994): Logistic Regression with Missing Values and covariates. Lecture notes in statistics, vol. 86, Springer Verlag, Berlin.
Google Scholar

Download references

Author information

Authors and Affiliations

Dipartimento di Matematica e Statistica, Università degli Studi di Napoli Federico II, Italy
Roberta Siciliano, Massimo Aria & Antonio D’Ambrosio

Authors

Roberta Siciliano
View author publications
You can also search for this author in PubMed Google Scholar
Massimo Aria
View author publications
You can also search for this author in PubMed Google Scholar
Antonio D’Ambrosio
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Economics Section of Statistics and Computing, University of Parma, Via Kennedy 6, 43100, Parma, Italy
Sergio Zani , Andrea Cerioli & Marco Riani , &
Department of Statistics, Probability and Applied Statistics, University of Rome “La Sapienza”, Piazzale Aldo Moro 5, 00185, Roma, Italy
Maurizio Vichi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Siciliano, R., Aria, M., D’Ambrosio, A. (2006). Boosted Incremental Tree-based Imputation of Missing Data. In: Zani, S., Cerioli, A., Riani, M., Vichi, M. (eds) Data Analysis, Classification and the Forward Search. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-35978-8_31

Download citation

DOI: https://doi.org/10.1007/3-540-35978-8_31
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-35977-7
Online ISBN: 978-3-540-35978-4
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics