Skip to main content
Log in

Optimal decorrelated score subsampling for generalized linear models with massive data

  • Articles
  • Published:
Science China Mathematics Aims and scope Submit manuscript

Abstract

In this paper, we consider the unified optimal subsampling estimation and inference on the low-dimensional parameter of main interest in the presence of the nuisance parameter for low/high-dimensional generalized linear models (GLMs) with massive data. We first present a general subsampling decorrelated score function to reduce the influence of the less accurate nuisance parameter estimation with the slow convergence rate. The consistency and asymptotic normality of the resultant subsample estimator from a general decorrelated score subsampling algorithm are established, and two optimal subsampling probabilities are derived under the A- and L-optimality criteria to downsize the data volume and reduce the computational burden. The proposed optimal subsampling probabilities provably improve the asymptotic efficiency upon the subsampling schemes in the low-dimensional GLMs and perform better than the uniform subsampling scheme in the high-dimensional GLMs. A two-step algorithm is further proposed to implement and the asymptotic properties of the corresponding estimators are also given. Simulations show satisfactory performance of the proposed estimators, and two applications to census income and Fashion-MNIST datasets also demonstrate its practical applicability.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Ai M Y, Wang F, Yu J, et al. Optimal subsampling for large-scale quantile regression. J Complexity, 2021, 62: 101512

    Article  MathSciNet  Google Scholar 

  2. Ai M Y, Yu J, Zhang H M, et al. Optimal subsampling algorithms for big data generalized linear models. arXiv: 1806.06761v1, 2018

  3. Ai M Y, Yu J, Zhang H M, et al. Optimal subsampling algorithms for big data regressions. Statist Sinica, 2021, 31: 749–772

    MathSciNet  Google Scholar 

  4. Blazère M, Loubes J-M, Gamboa F. Oracle inequalities for a group lasso procedure applied to generalized linear models in high dimension. IEEE Trans Inform Theory, 2014, 60: 2303–2318

    Article  MathSciNet  Google Scholar 

  5. Cheng C, Feng X D, Huang J, et al. Regularized projection score estimation of treatment effects in high-dimensional quantile regression. Statist Sinica, 2022, 32: 23–41

    MathSciNet  Google Scholar 

  6. Duan R, Ning Y, Chen Y. Heterogeneity-aware and communication-efficient distributed statistical inference. Biometrika, 2022, 109: 67–83

    Article  MathSciNet  Google Scholar 

  7. Fan J Q, Li R Z. Variable selection via nonconcave penalized likelihood and its oracle properties. J Amer Statist Assoc, 2001, 96: 1348–1360

    Article  MathSciNet  Google Scholar 

  8. Fang E X, Ning Y, Li R Z. Test of significance for high-dimensional longitudinal data. Ann Statist, 2020, 48: 2622–2645

    Article  MathSciNet  Google Scholar 

  9. Ferguson T S. A Course in Large Sample Theory. London: Chapman and Hall, 1996

    Book  Google Scholar 

  10. Han D X, Huang J, Lin Y Y, et al. Robust post-selection inference of high-dimensional mean regression with heavy-tailed asymmetric or heteroskedastic errors. J Econometrics, 2022, 230: 416–431

    Article  MathSciNet  Google Scholar 

  11. Hansen M H, Hurwitz W N. On the theory of sampling from finite populations. Ann Math Statist, 1943, 14: 333–362

    Article  MathSciNet  Google Scholar 

  12. Hastie T, Tibshirani R, Wainwright M. Statistical Learning with Sparsity. The Lasso and Generalizations. Monographs on Statistics and Applied Probability, vol. 143. Boca Raton: CRC Press, 2015

    Book  Google Scholar 

  13. Javanmard A, Montanari A. Confidence intervals and hypothesis testing for high-dimensional regression. J Mach Learn Res, 2014, 15: 2869–2909

    MathSciNet  Google Scholar 

  14. Jordan M I, Lee J D, Yang Y. Communication-efficient distributed statistical inference. J Amer Statist Assoc, 2019, 114: 668–681

    Article  MathSciNet  Google Scholar 

  15. Koenker R, Portnoy S. M estimation of multivariate regressions. J Amer Statist Assoc, 1990, 85: 1060–1068

    MathSciNet  Google Scholar 

  16. Kohavi R. Scaling up the accuracy of Naive-Bayes classifiers: A decision-tree hybrid. In: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining. Menol Park: AAAI Press, 1996, 202–207

    Google Scholar 

  17. Li M Y, Li R Z, Ma Y Y. Inference in high dimensional linear measurement error models. J Multivariate Anal, 2021, 184: 104759

    Article  MathSciNet  Google Scholar 

  18. Ma P, Mahoney M W, Yu B. A statistical perspective on algorithmic leveraging. J Mach Learn Res, 2015, 16: 861–911

    MathSciNet  Google Scholar 

  19. Ma P, Zhang X L, Xing X, et al. Asymptotic analysis of sampling estimators for randomized numerical linear algebra algorithms. In: Proceedings of Machine Learning Research, vol. 108. Boston: Addison-Wesley, 2020, 1026–1034

    Google Scholar 

  20. Ning Y, Liu H. A general theory of hypothesis tests and confidence regions for sparse high dimensional models. Ann Statist, 2017, 45: 158–195

    Article  MathSciNet  Google Scholar 

  21. Obozinski G, Wainwright M J, Jordan M I. Support union recovery in high-dimensional multivariate regression. Ann Statist, 2011, 39: 1–47

    Article  MathSciNet  Google Scholar 

  22. Raskutti G, Wainwright M J, Yu B. Restricted eigenvalue properties for correlated Gaussian designs. J Mach Learn Res, 2010, 11: 2241–2259

    MathSciNet  Google Scholar 

  23. Schifano E D, Wu J, Wang C, et al. Online updating of statistical inference in the big data setting. Technometrics, 2016, 58: 393–403

    Article  MathSciNet  Google Scholar 

  24. Tibshirani R. Regression shrinkage and selection via the Lasso. J R Stat Soc Ser B Stat Methodol, 1996, 58: 267–288

    MathSciNet  Google Scholar 

  25. van de Geer S, Bühlmann P, Ritov Y, et al. On asymptotically optimal confidence regions and tests for high-dimensional models. Ann Statist, 2014, 42: 1166–1202

    Article  MathSciNet  Google Scholar 

  26. van der Vaart A W. Asymptotic Statistics. Cambridge: Cambridge University Press, 1998

    Book  Google Scholar 

  27. Wang H Y, Ma Y Y. Optimal subsampling for quantile regression in big data. Biometrika, 2021, 108: 99–112

    Article  MathSciNet  Google Scholar 

  28. Wang H Y, Zhu R, Ma P. Optimal subsampling for large sample logistic regression. J Amer Statist Assoc, 2018, 113: 829–844

    Article  MathSciNet  Google Scholar 

  29. Wang W G, Liang Y B, Xing E P. Block regularized Lasso for multivariate multi-response linear regression. J Mach Learn Res, 2013, 14: 608–617

    Google Scholar 

  30. Xiao H, Rasul K, Vollgraf R. Fashion-MNIST: A novel image dataset for benchmarking machine learning algorithms. arXiv:1708.07747, 2017

  31. Xiong S F, Li G Y. Some results on the convergence of conditional distributions. Statist Probab Lett, 2008, 78: 3249–3253

    Article  MathSciNet  Google Scholar 

  32. Yao Y Q, Wang H Y. A review on optimal subsampling methods for massive datasets. J Data Sci, 2021, 19: 151–172

    Article  Google Scholar 

  33. Yu J, Wang H Y, Ai M Y, et al. Optimal distributed subsampling for maximum quasi-likelihood estimators with massive data. J Amer Statist Assoc, 2022, 117: 265–276

    Article  MathSciNet  Google Scholar 

  34. Zhang C-H, Zhang S S. Confidence intervals for low dimensional parameters in high dimensional linear models. J R Stat Soc Ser B Stat Methodol, 2014, 76: 217–242

    Article  MathSciNet  Google Scholar 

  35. Zhang H M, Jia J Z. Elastic-net regularized high-dimensional negative binomial regression: Consistency and weak signal detection. Statist Sinica, 2022, 32: 181–207

    MathSciNet  Google Scholar 

  36. Zhang H X, Wang H Y. Distributed subdata selection for big data via sampling-based approach. Comput Statist Data Anal, 2021, 153: 107072

    Article  MathSciNet  Google Scholar 

  37. Zhang T, Ning Y, Ruppert D. Optimal sampling for generalized linear models under measurement constraints. J Comput Graph Stat, 2021, 30: 106–114

    Article  MathSciNet  Google Scholar 

  38. Zhang Y C, Duchi J C, Wainwright M J. Communication-efficient algorithms for statistical optimization. J Mach Learn Res, 2013, 14: 3321–3363

    MathSciNet  Google Scholar 

Download references

Acknowledgements

This work was supported by the Fundamental Research Funds for the Central Universities, National Natural Science Foundation of China (Grant No. 12271272) and the Key Laboratory for Medical Data Analysis and Statistical Research of Tianjin. The authors are grateful to the referees for their insightful comments and suggestions on this article, which have led to significant improvements.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lei Wang.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gao, J., Wang, L. & Lian, H. Optimal decorrelated score subsampling for generalized linear models with massive data. Sci. China Math. 67, 405–430 (2024). https://doi.org/10.1007/s11425-022-2057-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11425-022-2057-8

Keywords

MSC(2020)

Navigation