Skip to main content
Log in

A batch process for high dimensional imputation

  • Original paper
  • Published:
Computational Statistics Aims and scope Submit manuscript

Abstract

This paper describes a correlation-based batch process for addressing high dimensional imputation problems. There are relatively few algorithms designed to efficiently handle imputation of missing data in high dimensional contexts. Fewer still are flexible enough to natively handle mixed-type data, often requiring lengthy pre-processing to get the data into proper shape, and then post-processing to return the data to usable form. Such decisions as well as assumptions made by many methods (e.g., data generating process) limit their performance, flexibility, and usability. Built on a set of complementary algorithms for nonparametric imputation via chained random forests, I introduce a batching process to ease computational costs associated with high dimensional imputation by subsetting data based on ranked cross-feature absolute correlations. The algorithm then imputes each batch separately, and joins imputed subsets in the final step. The process, hdImpute, is fast and accurate. As a result, high dimensional imputation is more accessible, and researchers are not forced to decide between speed or accuracy. Complementary software is available in the form of an R package, and is openly developed on Github under the MIT public license. In the spirit of open science, collaboration and engagement with the actively developing software are encouraged.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. The hyperparameter is an argument called batch in the hdImpute software detailed later in the paper.

  2. Creation of the correlation matrix is not a formal step in the hdImpute algorithm, but it can easily be built using the hdImpute package as described in Sect. 6.

  3. https://www.kaggle.com/subham07/detecting-anomalies-in-water-manufacturing.

  4. Of note, all algorithms were run using the default settings. It is certainly conceivable that some tuning could result in a more accurate (and likely slower) fit of softImpute. However, the same could also be said for missRanger and all iterations of hdImpute. Thus, in the absence of precise model tuning given the poor comparability across models in such a case, default values were used.

References

  • Bach F (2017) Breaking the curse of dimensionality with convex neural networks. J Mach Learn Res 18(1):629–681

    Google Scholar 

  • Bennett J, Lanning S et al (2007) The netflix prize. In: Proceedings of KDD cup and workshop, vol 2007 New York, NY, USA, p 35

  • Berisha V, Krantsevich Chelsea, Hahn PR, Hahn S, Dasarathy G, Turaga P, Liss J (2021) Digital medicine and the curse of dimensionality. NPJ Digital Med 4(1):1–8

    Article  Google Scholar 

  • Bessa MA, Bostanabad R, Zeliang Liu AHu, Apley DW, Brinson C, Chen W, Liu WK (2017) A framework for data-driven analysis of materials under uncertainty: countering the curse of dimensionality. Comput Methods Appl Mech Eng 320:633–667

    Article  ADS  MathSciNet  Google Scholar 

  • Bollier D, Firestone CM et al (2010) The promise and peril of big data. Communications and Society Program Washington, DC, Aspen Institute

    Google Scholar 

  • Chattopadhyay A, Lu T-P (2019) Gene-gene interaction: the curse of dimensionality. Ann Trans Med 7(24)

  • Daum F, Huang J (2003) Curse of dimensionality and particle filters. In: 2003 IEEE aerospace conference proceedings (Cat. No. 03TH8652), vol 4 IEEE pp 4_1979–4_1993

  • De Marchi S (2005) Computational and mathematical modeling in the social sciences. Cambridge University Press, Cambridge

  • Goldfeld K, Jacob W-J (2020) simstudy: Illuminating research methods through data generation. J Open Source Softw 5(54):2763. https://doi.org/10.21105/joss.02763

    Article  ADS  Google Scholar 

  • Han J, Jentzen A, Weinan E (2018) Solving high-dimensional partial differential equations using deep learning. Proc Nat Acad Sci 115(34):8505–8510

    Article  ADS  MathSciNet  CAS  PubMed  PubMed Central  Google Scholar 

  • Hastie T, Mazumder R, Hastie MT (2013) “Package ‘softImpute’.”

  • Hong S, Lynn HS (2020) Accuracy of random-forest-based imputation of missing data in the presence of non-normality, non-linearity, and interaction. BMC Med Res Methodol 20(1):1–12

    Article  Google Scholar 

  • Lall R, Robinson T (2021) The MIDAS touch: accurate and scalable missing-data imputation with deep learning. Polit Anal, pp 1–18

  • Lavanya K, Reddy LSS, Reddy BE (2019) A study of high-dimensional data imputation using additive LASSO regression model. Comput Intell Data Min. Springer, Berlin, pp 19–30

  • Little RJA, Rubin DB (2019) Statistical analysis with missing data, vol 793. John Wiley & Sons, New York

    Google Scholar 

  • Mayer M (2019) Package ’missRanger’. R Package

  • Oba S, Sato M, Takemasa I, Monden M, Matsubara K, Ishii S (2003) A Bayesian missing value estimation method for gene expression profile data. Bioinformatics 19(16):2088–2096

    Article  CAS  PubMed  Google Scholar 

  • Rhys H (2020) Machine Learning with R, the tidyverse, and mlr. Simon and Schuster

    Google Scholar 

  • Salas J, Yepes V (2019) VisualUVAM: a decision support system addressing the curse of dimensionality for the multi-scale assessment of urban vulnerability in Spain. Sustainability 11(8):2191

    Article  Google Scholar 

  • Shah JS, Rai SN, DeFilippis AP, Hill BG, Bhatnagar A, Brock GN (2017) Distribution based nearest neighbor imputation for truncated high dimensional data with applications to pre-clinical and clinical metabolomics studies. BMC Bioinf 18(1):1–13

    Article  Google Scholar 

  • Stekhoven DJ, Bühlmann P (2012) MissForest-non-parametric missing value imputation for mixed-type data. Bioinformatics 28(1):112–118

    Article  CAS  PubMed  Google Scholar 

  • Van Buuren S (2018) Flexible imputation of missing data. CRC press

  • Van Buuren S and Oudshoorn K (1999) Flexible multivariate imputation by MICE. TNO, Leiden

  • Verleysen M, François D (2005) The curse of dimensionality in data mining and time series prediction. In: International work-conference on artificial neural networks. Springer pp 758–770

  • Waggoner PD (2021) Modern dimension reduction. Cambridge University Press, Cambridge

    Book  Google Scholar 

  • Wilson S (2022) Package ‘miceforest’ v5.6.2. Python package. PyPi . https://pypi.org/project/miceforest/

  • Wright MN, Ziegler A (2015) Ranger: A fast implementation of random forests for high dimensional data in C++ and R. arXiv preprint arXiv:1508.04409

  • Zhao Y, Long Q (2016) Multiple imputation in the presence of high-dimensional data. Stat Methods Med Res 25(5):2021–2035

    Article  MathSciNet  PubMed  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Philip D. Waggoner.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

I am grateful to Doug Rivers, Delia Bailey, and other colleagues at YouGov for many helpful conversations and support throughout, as well as the Institute for Social and Economic Research and Policy (ISERP) at Columbia. Any mistakes remain my own.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Waggoner, P.D. A batch process for high dimensional imputation. Comput Stat 39, 781–802 (2024). https://doi.org/10.1007/s00180-023-01325-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00180-023-01325-9

Keywords

Navigation