Accurate Tree-based Missing Data Imputation and Data Fusion within the Statistical Learning Paradigm

D’Ambrosio, Antonio; Aria, Massimo; Siciliano, Roberta

doi:10.1007/s00357-012-9108-1

Accurate Tree-based Missing Data Imputation and Data Fusion within the Statistical Learning Paradigm

Published: 17 June 2012

Volume 29, pages 227–258, (2012)
Cite this article

Journal of Classification Aims and scope Submit manuscript

Antonio D’Ambrosio¹,
Massimo Aria¹ &
Roberta Siciliano¹

521 Accesses
23 Citations
3 Altmetric
Explore all metrics

Abstract

Framework of this paper is statistical data editing, specifically how to edit or impute missing or contradictory data and how to merge two independent data sets presenting some lack of information. Assuming a missing at random mechanism, this paper provides an accurate tree-based methodology for both missing data imputation and data fusion that is justified within the Statistical Learning Theory of Vapnik. It considers both an incremental variable imputation method to improve computational efficiency as well as boosted trees to gain in prediction accuracy with respect to other methods. As a result, the best approximation of the structural risk (also known as irreducible error) is reached, thus reducing at minimum the generalization (or prediction) error of imputation. Moreover, it is distribution free, it holds independently of the underlying probability law generating missing data values. Performance analysis is discussed considering simulation case studies and real world applications.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Feature Based Multivariate Data Imputation

Scalable Model-Based Cascaded Imputation of Missing Data

Missing data imputation using decision trees and fuzzy clustering with iterative learning

Article 11 December 2019

References

ALUJA-BANET, T., MORINEAU A., and RIUS, R. (1997), “La Greffe de Fichiers et Ses Conditions D’application. Méthode et Exemple”, in Enquêtes et Sondages, eds. G. Brossier G. and A.M. Dussaix, Paris: Dunod, pp. 94–102.
Google Scholar
ALUJA-BANET, T., RIUS, R., NONELL, R., and MARTÍNEZ-ABARCA, M.J. (1998), “Data Fusion and File Grafting”, in Analyses Multidimensionelles Des Données (1st ed.), NGUS 97, eds. A. Morineau, and K. Fernández Aguirre, París: CISIA-CERESTA, pp. 7–14.
Google Scholar
ALUJA-BANET, T., DAUNIS-I-ESTADELLA, J., and PELLICER, D. (2007), “GRAFT, a Complete System for Data Fusion”, Computational Statistics and Data Analysis 52, 635–649.
Article MathSciNet MATH Google Scholar
BARCENA,M.J., and TUSELL, F. (1999), “Enlace de Encuestas: Una PropuestaMetodológica y Aplicación a la Encuesta de Presupuestos de Tempo”, Qüestiio, 23(2), 297–320.
MathSciNet MATH Google Scholar
BARCENA, M.J., and TUSELL, F. (2000), “Tree-based Algorithms for Missing Data Imputation”, in Proceedings in Computational Statistics, COMPSTAT 2000, eds. J.G. Bethlehem, and P.G.M. van der Heijden, Heidelberg: Physica-Verlag, pp. 193–198.
Google Scholar
BREIMAN, L. (1996), “Bagging Predictors”, Machine Learning, 26, 46–59.
Google Scholar
BREIMAN, L. (1998), “Arcing Classifiers”, The Annals of Statistics, 26(3), 801–849.
Article MathSciNet MATH Google Scholar
BREIMAN, L., FRIEDMAN, J.H., OLSHEN, R.A., and STONE, C.J. (1984), Classification and Regression Trees, Belmont CA: Wadsworth International Group.
MATH Google Scholar
CAPPELLI, C.,MOLA, F., and SICILIANO, R. (2002), “A Statistical Approach to Growing a Reliable Honest Tree”, Computational Statistics and Data Analysis, 38, 285–299.
Article MathSciNet MATH Google Scholar
CHU, C.K., and CHENG, P.E. (1995), “Nonparametric Regression Estimation With Missing Data”, Journal of Statistical Planning and Inference, 48, 85–99.
Article MathSciNet MATH Google Scholar
CONTI, P.L., MARELLA, D., and SCANU, M. (2008), “Evaluation of Matching Noise for Imputation Techniques Based on Nonparametric Local Linear Regression Estimators”, Computational Statistics and Data Analysis, 43, 354–365.
Article MathSciNet Google Scholar
CONVERSANO, C., and SICILIANO, R. (2008), “Statistical Data Editing”, in: J. WANG. Data Warehousing And Mining: Concepts, Methodologies, Tools, And Applications (Vol. 4), ed. J. Wang, HERSHEY PA: Information Science Reference, pp. 1835–1840.
CONVERSANO, C., and SICILIANO, R. (2009), “Incremental Tree-Based Missing Data Imputation with Lexicographic Ordering”, Journal of Classification, 26(3), 361–379.
Article MathSciNet Google Scholar
D’AMBROSIO, A., ARIA, M., and SICILIANO, R. (2007), “Robust Tree-based Incremental Imputation Method for Data Fusion”, in LNCS 4273; Advances in Intelligent Data Analysis, Berlin/Heidelberg: Springer-Verlag, pp 174–183.
Google Scholar
DAVID, M.H., LITTLE, R.J.A., SAMUEL, M.E., and TRIEST, R.K. (1986), “Alternative Methods for CPS Income Imputation”, Journal of American Statistical Association, 81, 29–41.
Google Scholar
DEWAAL T., PANNEKOEK, J, and SCHOLTUS, S. (2011), “Handbook of Statistical Data Editing and Imputation”, New York: Wiley.
Book Google Scholar
DEMPSTER, A.P., LAIRD, N.M., and RUBIN, D.B. (1977), “Maximul Likelihood Estimation from Incomplete Data via the EM Algorithm (With Discussion)”, Journal of the Royal Statistical Society, Series B, 39, 1–38.
MathSciNet MATH Google Scholar
DIETTERICH, T.G. (2000), “Ensemble Methods in Machine Learning”, in First International Workshop on Multiple Classifier Systems, eds. J. Kittler and F. Roli, Springer-Verlag, pp. 1-15.
D’ORAZIO, M., DI ZIO, M., and SCANU, M. (2006), Statistical Matching: Theory and Practice, Chinchester: John Wiley & Sons.
Book MATH Google Scholar
EIBL, G., and PFEIFFER, K. P. (2002), “How To Make AdaBoost.M1 Work for Weak Base Classifiers by Changing Only One Line of the Code”, in Machine Learning: ECML 2002, Lecture Notes in Artificial Intelligence, Heidelberg: Springer.
Google Scholar
FELLEGI, I. P., and HOLT, D. (1976), “A Systematic Approach To Automatic Edit and Imputation”, Journal of American Statistical Association, 71, 17–35.
Google Scholar
FORD, B.N. (1983), “An Overview of Hot Deck Procedures”, in Incomplete Data in Sample Surveys, Vol. II: Theory and Annotated Bibliography, eds. G. Madow, I. Olkin and D.B. Rubin, New York: Academic Press.
Google Scholar
FREUND, Y., and SCHAPIRE, R.E. (1997), “A Decision-Theoretic Generalization of On-Line Learning and An Application To Boosting”, Journal of Computer and System Sciences, 55(1), 119–139.
Article MathSciNet MATH Google Scholar
GEY, S., and POGGI, J.M. (2006), “Boosting and Instability for Regression Trees”, Computational Statistics and Data Analysis, 50, 533–550.
Article MathSciNet MATH Google Scholar
HASTIE, T.J., TIBSHIRANI, R.J., and FRIEDMAN, J.H. (2009), The Elements of Statistical Learning (2nd ed.), New York: Springer Verlag.
Book MATH Google Scholar
IBRAHIM, J.G. (1990), “Incomplete Data in Generalized Linear Models”, Journal of the American Statistical Association, 85, 765–769.
Google Scholar
IBRAHIM, J.G., LIPSITZ, S.R., and CHEN, M.H. (1999), “Missing Covariates in Generalized Linear Models When the Missing Data Mechanism Is Non-Ignorable”, Journal of the Royal Statistical Society, Series B, 61(1), 173–190.
Article MathSciNet MATH Google Scholar
KOHAVI, R., and WOLPERT, D. (1996), “Bias Plus Variance for Zero-One Loss Functions”, in Proceedings of the 13th International Machine Learning Conference, San Mateo CA: Morgan Kaufmann, pp. 275–283.
Google Scholar
KONG, E., and DIETTERICH, T.G. (1995), “Error-Correcting Output Coding Correct Bias and Variance”, in The XII International Conference on Machine Learning, San Francisco CA: Morgan Kaufmann, pp. 313–321.
Google Scholar
LAKSHMINARAYAN, K., HARP, S.A., GOLDMAN R., and SAMAD, T. (1996), “Imputation of Missing Data Using Machine Learning Techniques”, in Proceedings of the Second International Conference on Knowledge Discovery and Data Miming, eds. Simoudis, Han and Fayyad, Menlo Park CA: AAAI Press, pp. 140–145.
Google Scholar
LITTLE, J.R.A. (1992), “Regression with Missing X’s: A Review”, Journal of the American Statistical Association, 87(420), 1227–1237.
Google Scholar
LITTLE, J.R.A., and RUBIN, D.B. (1987), Statistical Analysis with Missing Data, New York: John Wiley and Sons.
MATH Google Scholar
McKNIGHT, P.E., McKNIGHT, K.M., SIDANI, S., and FIGUEREDO, A.J. (2007), Missing Data: A Gentle Introduction, New York: The Guildford Press.
Google Scholar
MARELLA, D., SCANU, M., and CONTI, P.L. (2008), “On the Matching Noise of Some Nonparametric Imputation Procedures”, Statistics & Probability Letters, 78(12), 1593–1600.
Article MathSciNet MATH Google Scholar
MOLA, F., and SICILIANO, R. (1992), “A Two-Stage Predictive Splitting Algorithm in Binary Segmentation”, in Computational Statistics: COMPSTAT 92, 1, eds. Y. Dodge, and J. Whittaker, Heiderlberg (D): Physica Verlag, pp. 179–184.
Google Scholar
MOLA, F., and SICILIANO, R. (1997), “A Fast Splitting Procedure for Classification and Regression Trees”, Statistics and Computing, 7, 208–216.
Article Google Scholar
OUDSHOORN, C.G.M., VAN BUUREN, S., and VAN RIJCKEVORSEL, J.L.A. (1999), “Flexible Multiple Imputation by Chained Equations of the AVO-95 Survey”, TNO Preventie en Gezondheid, TNO/PG 99.045.
Google Scholar
PAAS, G. (1985), “Statistical Record Linkage Methodology, State of the Art and Future Prospects”, Bulletin of the International Statistical Institute, Proceedings of the 45th Session, LI, Book 2.
PETRAKOS, G., CONVERSANO, C., FARMAKIS, G., MOLA, F., SICILIANO, R., and STAVROPOULOS, P. (2004), “New Ways to Specify Data Edits”, Journal of Royal Statistical Society, Series A, 167(2), 249–274.
Article MathSciNet Google Scholar
RASSLER, S. (2002), Statistical Matching: A Frequentist Theory, Practical Applications and Alternative Bayesian Approaches, New York: Springer-Verlag.
Google Scholar
RASSLER, S. (2004), “Data Fusion: Identification Problems, Validity, and Multiple Imputation”, Austrian Journal of Statistics, 33(1 & 2), 153–171.
Google Scholar
RUBIN, D.B. (1976), “Inference and Missing Data (with Discussion)”, Biometrika, 63, 581–592.
Article MathSciNet MATH Google Scholar
RUBIN, D.B. (1987), Multiple Imputation for Nonresponse in Surveys, New York: Wiley.
Book Google Scholar
SANDE, I.G. (1983), “Hot Deck Imputation Procedures”, in Incomplete Data in Sample Surveys, Vol. III. Symposium on Incomplete Data: Proceedings, New York: Academic Press.
Google Scholar
SAPORTA,G. (2002), “Data Fusion and Data Grafting”, Computational Statistics and Data Analysis, 38, 465-473.
Article MathSciNet MATH Google Scholar
SCHAPIRE, R.E., FREUND, Y., BARTLETT, P., and LEE, W.S. (1998), “Boosting the Margin: A New Explanation for the Effectiveness of Voting Methods”, The Annals of Statistics, 26(5), 1651–1686.
Article MathSciNet MATH Google Scholar
SICILIANO, R., and CONVERSANO, C. (2002), “Tree-Based Classifiers for Conditional Missing Data Incremental Imputation”, Proceedings of the International Conference on Data Clean (Jyväskylä, May 29-31, 2002), University of Jyväskylä, Finland.
Google Scholar
SICILIANO, R., and CONVERSANO, C. (2008), “Decision Tree Induction”, in Data Warehousing And Mining: Concepts, Methodologies, Tools, And Applications (Vol. 2), ed. J. Wang, Hershey PA: Information Science Reference, pp. 624–629.
Google Scholar
SICILIANO, R., and MOLA, R. (1996), “A Fast Regression Tree Procedure”, in Statistical Modelling, Proceedings of the 11th International Workshop on Statistical Modeling, eds. A. Forcina, G.M. Marchetti, R. Hatzinger, and G. Galmacci, Orvieto, 15-19 luglio, Graphos, Cittá di Castello, pp. 332–340.
TIBSHIRANI, R. (1996), “Bias, Variance and Prediction Error for Classification Rules”, Technical Report, University of Toronto, Department of Statistics.
VAPNIK, V.N. (1995), The Nature of Statistical Learning Theory, New York: Springer Verlag.
MATH Google Scholar
VAPNIK, V.N. (1998), Statistical Learning Theory, New York: Wiley.
MATH Google Scholar
VAPNIK, V.N., and CHERVONENKIS, A.J. (1989), ”The Necessary and Sufficient Conditions for Consistency of the Method of Empirical Risk Minimization”, Pattern Recognition and Image Analysis, 284–305.
VAN BUUREN, S., BRAND, JPL., GROOTHUIS-OUDSHOORN, C.G.M., and RUBIN, D.B. (2006), “Fully Conditional Specification in Multivariate Imputation”, Journal of Statistical Computation and Simulation, 76 (12), 1049–1064.
Article MathSciNet MATH Google Scholar
WINKLER, W. E. (1999), “State of Statistical Data Editing and Current Research Problems”, Working paper No 29 in the UN/ECE Work Session on Statistical Data Editing, Rome, 2-4 June 1999.

Download references

Author information

Authors and Affiliations

Department of Mathematics and Statistics, University of Naples Federico II, Via Cinthia, M.te S. Angelo, 80126, Naples, Italy
Antonio D’Ambrosio, Massimo Aria & Roberta Siciliano

Authors

Antonio D’Ambrosio
View author publications
You can also search for this author in PubMed Google Scholar
Massimo Aria
View author publications
You can also search for this author in PubMed Google Scholar
Roberta Siciliano
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Antonio D’Ambrosio.

Additional information

The authors would like to thank the editor and the referees for their helpful comments, which have helped us to greatly improve the quality of this paper.

Rights and permissions

Reprints and permissions

About this article

Cite this article

D’Ambrosio, A., Aria, M. & Siciliano, R. Accurate Tree-based Missing Data Imputation and Data Fusion within the Statistical Learning Paradigm. J Classif 29, 227–258 (2012). https://doi.org/10.1007/s00357-012-9108-1

Download citation

Published: 17 June 2012
Issue Date: July 2012
DOI: https://doi.org/10.1007/s00357-012-9108-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Accurate Tree-based Missing Data Imputation and Data Fusion within the Statistical Learning Paradigm

Abstract

Access this article

Similar content being viewed by others

Feature Based Multivariate Data Imputation

Scalable Model-Based Cascaded Imputation of Missing Data

Missing data imputation using decision trees and fuzzy clustering with iterative learning

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Accurate Tree-based Missing Data Imputation and Data Fusion within the Statistical Learning Paradigm

Abstract

Access this article

Similar content being viewed by others

Feature Based Multivariate Data Imputation

Scalable Model-Based Cascaded Imputation of Missing Data

Missing data imputation using decision trees and fuzzy clustering with iterative learning

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation