Abstract
This paper describes a correlation-based batch process for addressing high dimensional imputation problems. There are relatively few algorithms designed to efficiently handle imputation of missing data in high dimensional contexts. Fewer still are flexible enough to natively handle mixed-type data, often requiring lengthy pre-processing to get the data into proper shape, and then post-processing to return the data to usable form. Such decisions as well as assumptions made by many methods (e.g., data generating process) limit their performance, flexibility, and usability. Built on a set of complementary algorithms for nonparametric imputation via chained random forests, I introduce a batching process to ease computational costs associated with high dimensional imputation by subsetting data based on ranked cross-feature absolute correlations. The algorithm then imputes each batch separately, and joins imputed subsets in the final step. The process, hdImpute, is fast and accurate. As a result, high dimensional imputation is more accessible, and researchers are not forced to decide between speed or accuracy. Complementary software is available in the form of an R package, and is openly developed on Github under the MIT public license. In the spirit of open science, collaboration and engagement with the actively developing software are encouraged.
Similar content being viewed by others
Notes
The hyperparameter is an argument called batch in the hdImpute software detailed later in the paper.
Creation of the correlation matrix is not a formal step in the hdImpute algorithm, but it can easily be built using the hdImpute package as described in Sect. 6.
Of note, all algorithms were run using the default settings. It is certainly conceivable that some tuning could result in a more accurate (and likely slower) fit of softImpute. However, the same could also be said for missRanger and all iterations of hdImpute. Thus, in the absence of precise model tuning given the poor comparability across models in such a case, default values were used.
References
Bach F (2017) Breaking the curse of dimensionality with convex neural networks. J Mach Learn Res 18(1):629–681
Bennett J, Lanning S et al (2007) The netflix prize. In: Proceedings of KDD cup and workshop, vol 2007 New York, NY, USA, p 35
Berisha V, Krantsevich Chelsea, Hahn PR, Hahn S, Dasarathy G, Turaga P, Liss J (2021) Digital medicine and the curse of dimensionality. NPJ Digital Med 4(1):1–8
Bessa MA, Bostanabad R, Zeliang Liu AHu, Apley DW, Brinson C, Chen W, Liu WK (2017) A framework for data-driven analysis of materials under uncertainty: countering the curse of dimensionality. Comput Methods Appl Mech Eng 320:633–667
Bollier D, Firestone CM et al (2010) The promise and peril of big data. Communications and Society Program Washington, DC, Aspen Institute
Chattopadhyay A, Lu T-P (2019) Gene-gene interaction: the curse of dimensionality. Ann Trans Med 7(24)
Daum F, Huang J (2003) Curse of dimensionality and particle filters. In: 2003 IEEE aerospace conference proceedings (Cat. No. 03TH8652), vol 4 IEEE pp 4_1979–4_1993
De Marchi S (2005) Computational and mathematical modeling in the social sciences. Cambridge University Press, Cambridge
Goldfeld K, Jacob W-J (2020) simstudy: Illuminating research methods through data generation. J Open Source Softw 5(54):2763. https://doi.org/10.21105/joss.02763
Han J, Jentzen A, Weinan E (2018) Solving high-dimensional partial differential equations using deep learning. Proc Nat Acad Sci 115(34):8505–8510
Hastie T, Mazumder R, Hastie MT (2013) “Package ‘softImpute’.”
Hong S, Lynn HS (2020) Accuracy of random-forest-based imputation of missing data in the presence of non-normality, non-linearity, and interaction. BMC Med Res Methodol 20(1):1–12
Lall R, Robinson T (2021) The MIDAS touch: accurate and scalable missing-data imputation with deep learning. Polit Anal, pp 1–18
Lavanya K, Reddy LSS, Reddy BE (2019) A study of high-dimensional data imputation using additive LASSO regression model. Comput Intell Data Min. Springer, Berlin, pp 19–30
Little RJA, Rubin DB (2019) Statistical analysis with missing data, vol 793. John Wiley & Sons, New York
Mayer M (2019) Package ’missRanger’. R Package
Oba S, Sato M, Takemasa I, Monden M, Matsubara K, Ishii S (2003) A Bayesian missing value estimation method for gene expression profile data. Bioinformatics 19(16):2088–2096
Rhys H (2020) Machine Learning with R, the tidyverse, and mlr. Simon and Schuster
Salas J, Yepes V (2019) VisualUVAM: a decision support system addressing the curse of dimensionality for the multi-scale assessment of urban vulnerability in Spain. Sustainability 11(8):2191
Shah JS, Rai SN, DeFilippis AP, Hill BG, Bhatnagar A, Brock GN (2017) Distribution based nearest neighbor imputation for truncated high dimensional data with applications to pre-clinical and clinical metabolomics studies. BMC Bioinf 18(1):1–13
Stekhoven DJ, Bühlmann P (2012) MissForest-non-parametric missing value imputation for mixed-type data. Bioinformatics 28(1):112–118
Van Buuren S (2018) Flexible imputation of missing data. CRC press
Van Buuren S and Oudshoorn K (1999) Flexible multivariate imputation by MICE. TNO, Leiden
Verleysen M, François D (2005) The curse of dimensionality in data mining and time series prediction. In: International work-conference on artificial neural networks. Springer pp 758–770
Waggoner PD (2021) Modern dimension reduction. Cambridge University Press, Cambridge
Wilson S (2022) Package ‘miceforest’ v5.6.2. Python package. PyPi . https://pypi.org/project/miceforest/
Wright MN, Ziegler A (2015) Ranger: A fast implementation of random forests for high dimensional data in C++ and R. arXiv preprint arXiv:1508.04409
Zhao Y, Long Q (2016) Multiple imputation in the presence of high-dimensional data. Stat Methods Med Res 25(5):2021–2035
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
I am grateful to Doug Rivers, Delia Bailey, and other colleagues at YouGov for many helpful conversations and support throughout, as well as the Institute for Social and Economic Research and Policy (ISERP) at Columbia. Any mistakes remain my own.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Waggoner, P.D. A batch process for high dimensional imputation. Comput Stat 39, 781–802 (2024). https://doi.org/10.1007/s00180-023-01325-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00180-023-01325-9