A batch process for high dimensional imputation

Waggoner, Philip D.

doi:10.1007/s00180-023-01325-9

A batch process for high dimensional imputation

Original paper
Published: 17 January 2023

Volume 39, pages 781–802, (2024)
Cite this article

Computational Statistics Aims and scope Submit manuscript

Philip D. Waggoner ORCID: orcid.org/0000-0002-7825-7573^1,2

207 Accesses
1 Citation
Explore all metrics

Abstract

This paper describes a correlation-based batch process for addressing high dimensional imputation problems. There are relatively few algorithms designed to efficiently handle imputation of missing data in high dimensional contexts. Fewer still are flexible enough to natively handle mixed-type data, often requiring lengthy pre-processing to get the data into proper shape, and then post-processing to return the data to usable form. Such decisions as well as assumptions made by many methods (e.g., data generating process) limit their performance, flexibility, and usability. Built on a set of complementary algorithms for nonparametric imputation via chained random forests, I introduce a batching process to ease computational costs associated with high dimensional imputation by subsetting data based on ranked cross-feature absolute correlations. The algorithm then imputes each batch separately, and joins imputed subsets in the final step. The process, hdImpute, is fast and accurate. As a result, high dimensional imputation is more accessible, and researchers are not forced to decide between speed or accuracy. Complementary software is available in the form of an R package, and is openly developed on Github under the MIT public license. In the spirit of open science, collaboration and engagement with the actively developing software are encouraged.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

A random forest guided tour

Article 19 April 2016

Violating the normality assumption may be the lesser of two evils

Article Open access 07 May 2021

Notes

The hyperparameter is an argument called batch in the hdImpute software detailed later in the paper.
Creation of the correlation matrix is not a formal step in the hdImpute algorithm, but it can easily be built using the hdImpute package as described in Sect. 6.
https://www.kaggle.com/subham07/detecting-anomalies-in-water-manufacturing.
Of note, all algorithms were run using the default settings. It is certainly conceivable that some tuning could result in a more accurate (and likely slower) fit of softImpute. However, the same could also be said for missRanger and all iterations of hdImpute. Thus, in the absence of precise model tuning given the poor comparability across models in such a case, default values were used.

References

Bach F (2017) Breaking the curse of dimensionality with convex neural networks. J Mach Learn Res 18(1):629–681
Google Scholar
Bennett J, Lanning S et al (2007) The netflix prize. In: Proceedings of KDD cup and workshop, vol 2007 New York, NY, USA, p 35
Berisha V, Krantsevich Chelsea, Hahn PR, Hahn S, Dasarathy G, Turaga P, Liss J (2021) Digital medicine and the curse of dimensionality. NPJ Digital Med 4(1):1–8
Article Google Scholar
Bessa MA, Bostanabad R, Zeliang Liu AHu, Apley DW, Brinson C, Chen W, Liu WK (2017) A framework for data-driven analysis of materials under uncertainty: countering the curse of dimensionality. Comput Methods Appl Mech Eng 320:633–667
Article ADS MathSciNet Google Scholar
Bollier D, Firestone CM et al (2010) The promise and peril of big data. Communications and Society Program Washington, DC, Aspen Institute
Google Scholar
Chattopadhyay A, Lu T-P (2019) Gene-gene interaction: the curse of dimensionality. Ann Trans Med 7(24)
Daum F, Huang J (2003) Curse of dimensionality and particle filters. In: 2003 IEEE aerospace conference proceedings (Cat. No. 03TH8652), vol 4 IEEE pp 4_1979–4_1993
De Marchi S (2005) Computational and mathematical modeling in the social sciences. Cambridge University Press, Cambridge
Goldfeld K, Jacob W-J (2020) simstudy: Illuminating research methods through data generation. J Open Source Softw 5(54):2763. https://doi.org/10.21105/joss.02763
Article ADS Google Scholar
Han J, Jentzen A, Weinan E (2018) Solving high-dimensional partial differential equations using deep learning. Proc Nat Acad Sci 115(34):8505–8510
Article ADS MathSciNet CAS PubMed PubMed Central Google Scholar
Hastie T, Mazumder R, Hastie MT (2013) “Package ‘softImpute’.”
Hong S, Lynn HS (2020) Accuracy of random-forest-based imputation of missing data in the presence of non-normality, non-linearity, and interaction. BMC Med Res Methodol 20(1):1–12
Article Google Scholar
Lall R, Robinson T (2021) The MIDAS touch: accurate and scalable missing-data imputation with deep learning. Polit Anal, pp 1–18
Lavanya K, Reddy LSS, Reddy BE (2019) A study of high-dimensional data imputation using additive LASSO regression model. Comput Intell Data Min. Springer, Berlin, pp 19–30
Little RJA, Rubin DB (2019) Statistical analysis with missing data, vol 793. John Wiley & Sons, New York
Google Scholar
Mayer M (2019) Package ’missRanger’. R Package
Oba S, Sato M, Takemasa I, Monden M, Matsubara K, Ishii S (2003) A Bayesian missing value estimation method for gene expression profile data. Bioinformatics 19(16):2088–2096
Article CAS PubMed Google Scholar
Rhys H (2020) Machine Learning with R, the tidyverse, and mlr. Simon and Schuster
Google Scholar
Salas J, Yepes V (2019) VisualUVAM: a decision support system addressing the curse of dimensionality for the multi-scale assessment of urban vulnerability in Spain. Sustainability 11(8):2191
Article Google Scholar
Shah JS, Rai SN, DeFilippis AP, Hill BG, Bhatnagar A, Brock GN (2017) Distribution based nearest neighbor imputation for truncated high dimensional data with applications to pre-clinical and clinical metabolomics studies. BMC Bioinf 18(1):1–13
Article Google Scholar
Stekhoven DJ, Bühlmann P (2012) MissForest-non-parametric missing value imputation for mixed-type data. Bioinformatics 28(1):112–118
Article CAS PubMed Google Scholar
Van Buuren S (2018) Flexible imputation of missing data. CRC press
Van Buuren S and Oudshoorn K (1999) Flexible multivariate imputation by MICE. TNO, Leiden
Verleysen M, François D (2005) The curse of dimensionality in data mining and time series prediction. In: International work-conference on artificial neural networks. Springer pp 758–770
Waggoner PD (2021) Modern dimension reduction. Cambridge University Press, Cambridge
Book Google Scholar
Wilson S (2022) Package ‘miceforest’ v5.6.2. Python package. PyPi . https://pypi.org/project/miceforest/
Wright MN, Ziegler A (2015) Ranger: A fast implementation of random forests for high dimensional data in C++ and R. arXiv preprint arXiv:1508.04409
Zhao Y, Long Q (2016) Multiple imputation in the presence of high-dimensional data. Stat Methods Med Res 25(5):2021–2035
Article MathSciNet PubMed Google Scholar

Download references

Author information

Authors and Affiliations

Columbia University, New York, NY, USA
Philip D. Waggoner
YouGov America, New York, NY, USA
Philip D. Waggoner

Authors

Philip D. Waggoner
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Philip D. Waggoner.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

I am grateful to Doug Rivers, Delia Bailey, and other colleagues at YouGov for many helpful conversations and support throughout, as well as the Institute for Social and Economic Research and Policy (ISERP) at Columbia. Any mistakes remain my own.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Waggoner, P.D. A batch process for high dimensional imputation. Comput Stat 39, 781–802 (2024). https://doi.org/10.1007/s00180-023-01325-9

Download citation

Received: 04 April 2022
Accepted: 03 January 2023
Published: 17 January 2023
Issue Date: April 2024
DOI: https://doi.org/10.1007/s00180-023-01325-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A batch process for high dimensional imputation

Abstract

Access this article

Similar content being viewed by others

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

A random forest guided tour

Violating the normality assumption may be the lesser of two evils

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A batch process for high dimensional imputation

Abstract

Access this article

Similar content being viewed by others

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

A random forest guided tour

Violating the normality assumption may be the lesser of two evils

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation