Accurate Tree-based Missing Data Imputation and Data Fusion within the Statistical Learning Paradigm
- First Online:
- Cite this article as:
- D’Ambrosio, A., Aria, M. & Siciliano, R. J Classif (2012) 29: 227. doi:10.1007/s00357-012-9108-1
- 320 Downloads
Framework of this paper is statistical data editing, specifically how to edit or impute missing or contradictory data and how to merge two independent data sets presenting some lack of information. Assuming a missing at random mechanism, this paper provides an accurate tree-based methodology for both missing data imputation and data fusion that is justified within the Statistical Learning Theory of Vapnik. It considers both an incremental variable imputation method to improve computational efficiency as well as boosted trees to gain in prediction accuracy with respect to other methods. As a result, the best approximation of the structural risk (also known as irreducible error) is reached, thus reducing at minimum the generalization (or prediction) error of imputation. Moreover, it is distribution free, it holds independently of the underlying probability law generating missing data values. Performance analysis is discussed considering simulation case studies and real world applications.