Journal of Classification

, Volume 29, Issue 2, pp 227–258

Accurate Tree-based Missing Data Imputation and Data Fusion within the Statistical Learning Paradigm

  • Antonio D’Ambrosio
  • Massimo Aria
  • Roberta Siciliano
Article

DOI: 10.1007/s00357-012-9108-1

Cite this article as:
D’Ambrosio, A., Aria, M. & Siciliano, R. J Classif (2012) 29: 227. doi:10.1007/s00357-012-9108-1

Abstract

Framework of this paper is statistical data editing, specifically how to edit or impute missing or contradictory data and how to merge two independent data sets presenting some lack of information. Assuming a missing at random mechanism, this paper provides an accurate tree-based methodology for both missing data imputation and data fusion that is justified within the Statistical Learning Theory of Vapnik. It considers both an incremental variable imputation method to improve computational efficiency as well as boosted trees to gain in prediction accuracy with respect to other methods. As a result, the best approximation of the structural risk (also known as irreducible error) is reached, thus reducing at minimum the generalization (or prediction) error of imputation. Moreover, it is distribution free, it holds independently of the underlying probability law generating missing data values. Performance analysis is discussed considering simulation case studies and real world applications.

Keywords

Data editing Tree-based methods Boosting algorithm FAST algorithm Incremental imputation Generalization error 

Copyright information

© Springer Science+Business Media, LLC 2012

Authors and Affiliations

  • Antonio D’Ambrosio
    • 1
  • Massimo Aria
    • 1
  • Roberta Siciliano
    • 1
  1. 1.Department of Mathematics and StatisticsUniversity of Naples Federico IINaplesItaly

Personalised recommendations