Skip to main content
Log in

A review on design inspired subsampling for big data

  • Regular Article
  • Published:
Statistical Papers Aims and scope Submit manuscript

Abstract

Subsampling focuses on selecting a subsample that can efficiently sketch the information of the original data in terms of statistical inference. It provides a powerful tool in big data analysis and gains the attention of data scientists in recent years. In this review, some state-of-the-art subsampling methods inspired by statistical design are summarized. Three types of designs, namely optimal design, orthogonal design, and space filling design, have shown their great potential in subsampling for different objectives. The relationships between experimental designs and the related subsampling approaches are discussed. Specifically, two major families of design inspired subsampling techniques are presented. The first aims to select a subsample in accordance with some optimal design criteria. The second tries to find a subsample that meets some design requirements, including balancing, orthogonality, and uniformity. Simulated and real data examples are provided to compare these methods empirically.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

References

  • Ai M, Wang F, Yu J, Zhang H (2021a) Optimal subsampling for large-scale quantile regression. J Complex 62:101512

    MathSciNet  Google Scholar 

  • Ai M, Yu J, Zhang H, Wang H (2021b) Optimal subsampling algorithms for big data regressions. Stat Sin 31:749–772

    MathSciNet  Google Scholar 

  • Altschuler J, Bach F, Rudi A, Niles-Weed J (2019) Massively scalable sinkhorn distances via the nyström method. Adv Neural Inf Process Syst 32:4427–4437

    Google Scholar 

  • Atkinson A, Donev A, Tobias R (2007) Optimum experimental designs, with SAS. Oxford University Press, Oxford

    Google Scholar 

  • Avron H, Maymounkov P, Toledo S (2010) Blendenpik: supercharging Lapack’s least-squares solver. SIAM J Sci Comput 32:1217–1236

    MathSciNet  Google Scholar 

  • Beasley LB, Brualdi RA, Shader BL (1993) Combinatorial orthogonality. In: Brualdi RA, Friedland S, Klee V (eds) Combinatorial and graph-theoretical problems in linear algebra. Springer, New York, NY, pp 207–218

    Google Scholar 

  • Berger YG, de La Riva-Torres O (2016) Empirical likelihood confidence intervals for complex sampling designs. J R Stat Soc Ser B 78:314–319

    MathSciNet  Google Scholar 

  • Bertsekas DP (1992) Auction algorithms for network flow problems: a tutorial introduction. Comput Optim Appl 1:7–66

    MathSciNet  Google Scholar 

  • Biedermann S, Dette H (2001) Minimax optimal designs for nonparametric regression—a further optimality property of the uniform distribution. In: Atkinson AC, Hackl P, Müller WG (eds) mODa 6—Advances in Model-Oriented Design and Analysis. Physica-Verlag HD, Heidelberg, pp 13–20

    Google Scholar 

  • Blom G (1976) Some properties of incomplete U-statistics. Biometrika 63:573–580

    MathSciNet  Google Scholar 

  • Boivin J, Ng S (2006) Are more data always better for factor analysis? J Econom 132:169–194

    MathSciNet  Google Scholar 

  • Bottou L (1999) On-line learning and stochastic approximations. In: Saad D (ed) On-line learning in neural networks. Publications of the Newton Institute. Cambridge University Press, Cambridge, pp 9–42. https://doi.org/10.1017/CBO9780511569920.003

  • Breidt FJ, Opsomer JD (2000) Local polynomial regression estimators in survey sampling. Ann Stat 28:1026–1053

    Google Scholar 

  • Breiman L (1996) Bagging predictors. Mach Learn 24:123–140

    Google Scholar 

  • Burnham KP, Anderson DR (2002) Model selection and multimodel inference: a practical information-theoretic approach, 2nd edn. Springer, New York

    Google Scholar 

  • Chen WY, Mackey L, Gorham J, Briol FX, Oates C (2018) Stein points. In: Dy J, Krause A (eds) Proceedings of the 35th International Conference on Machine Learning, vol 80, pp 844–853

  • Chen WY, Barp A, Briol FX, Gorham J, Girolami M, Mackey L, Oates CJ (2019) Stein point Markov chain Monte Carlo. https://arXiv.org/1905.03673

  • Cheng Q, Wang H, Yang M (2020) Information-based optimal subdata selection for big data logistic regression. J Stat Plan Inference 209:112–122

    MathSciNet  Google Scholar 

  • Chernozhukov V, Galichon A, Hallin M, Henry M (2017) Monge-Kantorovich depth, quantiles, ranks and signs. Ann Stat 45:223–256

    MathSciNet  Google Scholar 

  • Cioppa TM, Lucas TW (2007) Efficient nearly orthogonal and space-filling Latin hypercubes. Technometrics 49:45–55

    MathSciNet  Google Scholar 

  • Cook CE, Lopez R, Stroe O, Cochrane G, Apweiler R (2019) The European bioinformatics institute in 2018: tools, infrastructure and training. Nucleic Acids Res 47:D15–D22

    CAS  PubMed  Google Scholar 

  • Cox DR (1957) Note on grouping. J Am Stat Assoc 52:543–547

    Google Scholar 

  • Deldossi L, Tommasi C (2021) Optimal design subsampling from big datasets. J Qual Technol 54:93

    Google Scholar 

  • Derezinski M, Warmuth MKK (2017) Unbiased estimates for linear regression via volume sampling. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds) Advances in neural information processing systems, vol 30. Curran Associates, Inc., Red Hook, pp 3084–3093

    Google Scholar 

  • Dereziński M, Warmuth MK (2018) Reverse iterative volume sampling for linear regression. J Mach Learn Res 19:1–39

    MathSciNet  Google Scholar 

  • Derezinski M, Warmuth MKK, Hsu DJ (2018) Leveraged volume sampling for linear regression. In: Bengio S, Wallach H, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R (eds) Advances in neural information processing systems, vol 31. Curran Associates, Inc., Red Hook, pp 2505–2514

    Google Scholar 

  • Dereziński M, Clarkson KL, Mahoney MW, Warmuth MK (2019) Minimax experimental design: Bridging the gap between statistical and worst-case approaches to least squares regression. In: Beygelzimer A, Hsu D (eds) Proceedings of the Thirty-Second Conference on Learning Theory, Phoenix, USA, Proceedings of Machine Learning Research, vol 99, pp 1050–1069

  • Devroye L (1986) Sample-based non-uniform random variate generation. In: Proceedings of the 18th conference on Winter simulation, ACM, pp 260–265

  • Dey A, Mukerjee R (1999) Fractional factorial plans. Wiley, New York

    Google Scholar 

  • Dheeru D, Karra Taniskidou E (2017) UCI machine learning repository. https://archive.ics.uci.edu/ml/datasets/Physicochemical+Properties+of+Protein+Tertiary+Structure. Accessed 25 July 2022

  • Doug L (2001) 3d data management: controlling data volume, velocity and variety. META Group Res Note 6:1

    Google Scholar 

  • Drineas P, Kannan R, Mahoney MW (2006) Fast Monte Carlo algorithms for matrices I: approximating matrix multiplication. SIAM J Comput 36:132–157

    MathSciNet  Google Scholar 

  • Efron B, Tibshirani RJ (1994) An introduction to the bootstrap. CRC Press, Boca Raton

    Google Scholar 

  • Eshragh A, Roosta F, Nazari A, Mahoney MW (2019) LSAR: efficient leverage score sampling algorithm for the analysis of big time series data. https://arxiv.org/1911.12321

  • Fan J, Han F, Liu H (2014) Challenges of big data analysis. Natl Sci Rev 1:293–314

    PubMed  Google Scholar 

  • Fan Y, Liu Y, Zhu L (2021) Optimal subsampling for linear quantile regression models. Can J Stat. https://doi.org/10.1002/cjs.11590

    Article  MathSciNet  Google Scholar 

  • Fang KT, Wang Y (1994) Number-theoretic methods in statistics, vol 51. CRC Press, Boca Raton

    Google Scholar 

  • Fang KT, Kotz S, Ng KW (1990) Symmetric multivariate and related distributions. Monographs on statistics and applied probability. Springer, Berlin

    Google Scholar 

  • Fithian W, Hastie T (2014) Local case-control sampling: efficient subsampling in imbalanced data sets. Ann Stat 42:1693–1724

    MathSciNet  PubMed  PubMed Central  Google Scholar 

  • Flury BA (1990) Principal points. Biometrika 77:33–41

    MathSciNet  Google Scholar 

  • Gittens A, Mahoney MW (2016) Revisiting the Nyström method for improved large-scale machine learning. J Mach Learn Res 17:3977–4041

    Google Scholar 

  • Gu C (2013) Smoothing spline ANOVA models. Springer, Berlin

    Google Scholar 

  • Gu C, Kim YJ (2002) Penalized likelihood regression: general formulation and efficient approximation. Can J Stat 30:619–628

    MathSciNet  Google Scholar 

  • Hájek J (1964) Asymptotic theory of rejective sampling with varying probabilities from a finite population. Ann Math Stat 35:1491–1523

    MathSciNet  Google Scholar 

  • Han L, Tan KM, Yang T, Zhang T et al (2020) Local uncertainty sampling for large-scale multiclass logistic regression. Ann Stat 48:1770–1788

    MathSciNet  Google Scholar 

  • Hansen MH, Madow WG, Tepping BJ (1983) An evaluation of model-dependent and probability-sampling inferences in sample surveys. J Am Stat Assoc 78:776–793

    Google Scholar 

  • Harman R, Rosa S (2020) On greedy heuristics for computing d-efficient saturated subsets. Oper Res Lett 48:122–129

    MathSciNet  Google Scholar 

  • He Z, Owen AB (2016) Extensible grids: uniform sampling on a space filling curve. J R Stat Soc B 78:917–931

    MathSciNet  Google Scholar 

  • Hedayat AS, Sloane NJA, Stufken J (1999) Orthogonal arrays: theory and applications. Springer, Berlin

    Google Scholar 

  • Hickernell FJ, Liu M (2002) Uniform designs limit aliasing. Biometrika 89:893–904

    MathSciNet  Google Scholar 

  • Horn RA, Johnson CR (2013) Matrix analysis, 2nd edn. Cambridge University Press, Cambridge

    Google Scholar 

  • Hu G, Wang H (2021) Most likely optimal subsampled Markov chain Monte Carlo. J Syst Sci Complex 34:1121–1134

    MathSciNet  Google Scholar 

  • Iraji MS, Ameri H (2016) RMSD protein tertiary structure prediction with soft computing. IJ Math Sci Comput 2:24–33

    Google Scholar 

  • Johnson M, Moore L, Ylvisaker D (1990) Minimax and maximin distance designs. J Stat Plan Inference 26:131–148

    MathSciNet  Google Scholar 

  • Joseph VR, Dasgupta T, Tuo R, Wu CFJ (2015) Sequential exploration of complex surfaces using minimum energy designs. Technometrics 57:64–74

    MathSciNet  Google Scholar 

  • Joseph VR, Wang D, Gu L, Lyu S, Tuo R (2019) Deterministic sampling of expensive posteriors using minimum energy designs. Technometrics 61:297–308

    MathSciNet  Google Scholar 

  • Katharopoulos A, Fleuret F (2018) Not all samples are created equal: deep learning with importance sampling. In: Dy J, Krause A (eds) Proceedings of the 35th International Conference on Machine Learning, PMLR, Proceedings of Machine Learning Research, vol 80, pp 2525–2534

  • Kiefer J (1974) General equivalence theory for optimum designs (approximate theory). Ann Stat 2:849–879

    MathSciNet  Google Scholar 

  • Kong X, Zheng W (2020) Design based incomplete U-statistics. https://arXiv.org/2008.04348

  • Kuang K, Xiong R, Cui P, Athey S, Li B (2018) Stable prediction across unknown environments. http://arxiv.org/1806.06270

  • Kuang K, Zhang H, Wu F, Zhuang Y, Zhang A (2020) Balance-subsampled stable prediction. https://arXiv.org/2006.04381

  • Lee S, Ng S (2020) An econometric perspective on algorithmic subsampling. Annu Rev Econ 12:45–80

    Google Scholar 

  • Lee J, Schifano ED, Wang H (2021) Fast optimal subsampling probability approximation for generalized linear models. Econom Stat. https://doi.org/10.1016/j.ecosta.2021.02.007

    Article  Google Scholar 

  • Lehmann EL, Casella G (1998) Theory of point estimation, 2nd edn. Springer, Berlin

    Google Scholar 

  • Lemieux C (2009) Monte Carlo and Quasi-Monte Carlo Sampling. Springer, Berlin

    Google Scholar 

  • Li N, Qardaji W, Su D (2012) On sampling, anonymization, and differential privacy or, k-anonymization meets differential privacy. In: Proceedings of the 7th ACM Symposium on Information, Computer and Communications Security, ACM, pp 32–33

  • Li K, Kong X, Ai M (2016) A general theory for orthogonal array based Latin hypercube sampling. Stat Sin 26:761–777

    MathSciNet  Google Scholar 

  • Lid Hjort N, Pollard D (2011) Asymptotics for minimisers of convex processes. https://arXiv.org/1107.3806

  • Liu Q, Lee JD, Jordan MI (2016) A kernelized stein discrepancy for goodness-of-fit tests and model evaluation. https://arXiv.org/1602.03253

  • Loshchilov I, Hutter F (2015) Online batch selection for faster training of neural networks. https://arXiv.org/1511.06343

  • Ma CX, Fang KT (2001) A note on generalized aberration in factorial designs. Metrika 53:85–93

    MathSciNet  Google Scholar 

  • Ma P, Huang J, Zhang N (2015a) Efficient computation of smoothing splines via adaptive basis sampling. Biometrika 102:631–645

    MathSciNet  Google Scholar 

  • Ma P, Mahoney MW, Yu B (2015b) A statistical perspective on algorithmic leveraging. J Mach Learn Res 16:861–919

    MathSciNet  Google Scholar 

  • Ma P, Zhang X, Xing X, Ma J, Mahoney M (2020) Asymptotic analysis of sampling estimators for randomized numerical linear algebra algorithms. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 1026–1035

  • Mak S, Joseph VR (2018) Support points. Ann Stat 46:2562–2592

    MathSciNet  Google Scholar 

  • Meng X, Saunders MA, Mahoney MW (2014) LSRN: a parallel iterative solver for strongly over-or underdetermined systems. SIAM J Sci Comput 36:C95–C118

    MathSciNet  PubMed  PubMed Central  Google Scholar 

  • Meng C, Zhang X, Zhang J, Zhong W, Ma P (2020) More efficient approximation of smoothing splines via space-filling basis selection. Biometrika 107:723–735

    MathSciNet  PubMed  Google Scholar 

  • Meng C, Xie R, Mandal A, Zhang X, Zhong W, Ma P (2021) Lowcon: a design-based subsampling approach in a misspecified linear model. J Comput Gr Stat 30:694–708

    MathSciNet  Google Scholar 

  • Meng C, Yu J, Chen Y, Zhong W, Ma P (2022) Smoothing splines approximation using Hilbert curve basis selection. J Comput Gr Stat 31:802–812

    MathSciNet  Google Scholar 

  • Mishra A, Rana PS, Mittal A, Jayaram B (2014) D2n: distance to the native. Biochim Biophys Acta 1844:1798–1807. https://doi.org/10.1016/j.bbapap.2014.07.010

    Article  CAS  PubMed  Google Scholar 

  • Musser D (1997) Introspective sorting and selection algorithms. Softw Pract Exp 27:983–993

    Google Scholar 

  • Newey WK, McFadden D (1994) Large sample estimation and hypothesis testing. In: Heckman JJ, Leamer E (eds) Handbook of econometrics, vol 4. Elsevier, Amsterdam, pp 2111–2245

    Google Scholar 

  • Ng S (2017) Opportunities and challenges: lessons from analyzing terabytes of scanner data. Tech. rep, National Bureau of Economic Research

  • Pfeffermann D (1993) The role of sampling weights when modeling survey data. Int Stat Rev/Rev Int Stat 61:317–337

    Google Scholar 

  • Pfeffermann D, Skinner CJ, Holmes DJ, Goldstein H, Rasbash J (1998) Weighting for unequal selection probabilities in multilevel models. J R Stat Soc B 60:23–40

    MathSciNet  Google Scholar 

  • Pronzato L (2006) On the sequential construction of optimum bounded designs. J Stat Plan Inference 136:2783–2804

    MathSciNet  Google Scholar 

  • Pukelsheim F (2006) Optimal design of experiments. Society for Industrial and Applied Mathematics

  • Qi ZF, Zhou Y, Fang KT (2019) Representative points for location-biased data sets. Commun Stat Simul Comput 48:458–471

    Google Scholar 

  • Quiroz M, Kohn R, Villani M, Tran MN (2019) Speeding up MCMC by efficient data subsampling. J Am Stat Assoc 114:831–843

    MathSciNet  CAS  Google Scholar 

  • R Core Team (2019) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna

    Google Scholar 

  • Ren M, Zhao SL (2021) Subdata selection based on orthogonal array for big data. Commun Stat Theory Methods. https://doi.org/10.1080/03610926.2021.2012196

    Article  Google Scholar 

  • Ren H, Zou C, Chen N, Li R (2022) Large-scale datastreams surveillance via pattern-oriented-sampling. J Am Stat Assoc 117:794–808

    MathSciNet  CAS  Google Scholar 

  • Santner TJ, Williams BJ, Notz WI (2003) Space-filling designs for computer experiments, vol 5. Springer, Cham, pp 121–161

    Google Scholar 

  • Shao L, Song S, Zhou Y (2022) Optimal subsampling for large-sample quantile regression with massive data. Can J Stat. https://doi.org/10.1002/cjs.11697

    Article  Google Scholar 

  • Shi C, Tang B (2021) Model-robust subdata selection for big data. J Stat Theory Pract 15:82

    MathSciNet  Google Scholar 

  • Shu X, Yao D, Bertino E (2015) Privacy-preserving detection of sensitive data exposure. IEEE Trans Inf Forensics Secur 10:1092–1103

    Google Scholar 

  • Su Y (2000) Asymptotically optimal representative points of bivariate random vectors. Stat Sin 10:559–575

    MathSciNet  Google Scholar 

  • Székely GJ, Rizzo ML (2013) Energy statistics: a class of statistics based on distances. J Stat Plan Inference 143:1249–1272

    MathSciNet  Google Scholar 

  • Thompson S (2012) Simple random sampling, vol 2. Wiley, New York, pp 9–37

    Google Scholar 

  • Ting D, Brochu E (2018) Optimal subsampling with influence functions. In: Advances in Neural Information Processing Systems, pp 3650–3659

  • Vakayil A, Joseph VR (2022) Data twinning. In: Clarke B (ed) Statistical analysis and data mining: the ASA data science journal. Wiley, New York. https://doi.org/10.1002/sam.11574

  • Villani C (2008) Optimal transport: old and new. Springer, Berlin

    Google Scholar 

  • Wang H (2019) More efficient estimation for logistic regression with optimal subsamples. J Mach Learn Res 20:1–59

    ADS  MathSciNet  Google Scholar 

  • Wang H, Kim JK (2020) Maximum sampled conditional likelihood for informative subsampling. https://arXiv.org/2011.05988

  • Wang W, Jing BY (2021) Convergence of Gaussian process regression: optimality, robustness, and relationship with kernel ridge regression. https://arXiv.org/2104.09778

  • Wang H, Ma Y (2021) Optimal subsampling for quantile regression in big data. Biometrika 108:99–112

    MathSciNet  Google Scholar 

  • Wang H, Zou J (2021) A comparative study on sampling with replacement vs poisson sampling in optimal subsampling. In: Banerjee A, Fukumizu K (eds) Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, PMLR, Proceedings of Machine Learning Research, vol 130, pp 289–297

  • Wang W, Tuo R, Wu CFJ (2017a) On prediction properties of Kriging: uniform error bounds and robustness. https://arXiv.org/1710.06959

  • Wang Y, Yu AW, Singh A (2017b) On computationally tractable selection of experiments in measurement-constrained regression models. J Mach Learn Res 18:5238–5278

    MathSciNet  Google Scholar 

  • Wang H, Zhu R, Ma P (2018a) Optimal subsampling for large sample logistic regression. J Am Stat Assoc 113:829–844

    MathSciNet  CAS  PubMed  PubMed Central  Google Scholar 

  • Wang Y, Yang J, Xu H (2018b) On the connection between maximin distance designs and orthogonal designs. Biometrika 105:471–477

    MathSciNet  Google Scholar 

  • Wang Z, Zhu H, Dong Z, He X, Huang SL (2019a) Less is better: unweighted data subsampling via influence function. https://arXiv.org/1912.01321

  • Wang H, Yang M, Stufken J (2019b) Information-based optimal subdata selection for big data linear regression. J Am Stat Assoc 114:393–405

    MathSciNet  CAS  Google Scholar 

  • Wang L, Elmstedt J, Wong WK, Xu H (2021) Orthogonal subsampling for big data linear regression. https://arXiv.org/2105.14647

  • Wang Y, Sun F, Xu H (2022) On design orthogonality, maximin distance, and projection uniformity for computer experiments. J Am Stat Assoc 117:375–385

    MathSciNet  CAS  Google Scholar 

  • Williams C, Seeger M (2001) Using the nyström method to speed up kernel machines. Adv Neural Inf Process Syst 13:682–688

    Google Scholar 

  • Wu CFJ, Hamada MS (2009) Experiments: planning, analysis and parameter design optimization, 2nd edn. Wiley, New York

    Google Scholar 

  • Wu S, Zhu X, Wang H (2022) Subsampling and jackknifing: a practically convenient solution for large data analysis with limited computational resources. Stat Sin. https://doi.org/10.5705/ss.202021.0257

    Article  PubMed  PubMed Central  Google Scholar 

  • Xie MY, Fang KT (2000) Admissibility and minimaxity of the uniform design measure in nonparametric regression model. J Stat Plan Inference 83:101–111

    MathSciNet  Google Scholar 

  • Xie R, Wang Z, Bai S, Ma P, Zhong W (2019) Online decentralized leverage score sampling for streaming multidimensional time series. In: Chaudhuri K, Sugiyama M (eds) Proceedings of Machine Learning Research, PMLR, Proceedings of Machine Learning Research, vol 89, pp 2301–2311

  • Xiong S, Li G (2008) Some results on the convergence of conditional distributions. Stat Probab Lett 78:3249–3253

    MathSciNet  Google Scholar 

  • Yao Y, Wang H (2019) Optimal subsampling for softmax regression. Stat Pap 60:235–249

    MathSciNet  Google Scholar 

  • Yao Y, Wang H (2021a) A review on optimal subsampling methods for massive datasets. J Data Sci 19:151–172

    Google Scholar 

  • Yao Y, Wang H (2021b) A selective review on statistical techniques for big data. In: Zhao Y, Chen DG (eds) Modern statistical methods for health research. Springer, New York. https://doi.org/10.1007/978-3-030-72437-5_11

    Chapter  Google Scholar 

  • Ye KQ (1998) Orthogonal column Latin hypercubes and their application in computer experiments. J Am Stat Assoc 93:1430–1439

    MathSciNet  Google Scholar 

  • Yu J, Wang H (2022) Subdata selection algorithm for linear model discrimination. Stat Pap. https://doi.org/10.1007/s00362-022-01299-8

    Article  MathSciNet  Google Scholar 

  • Yu K, Bi J, Tresp V (2006) Active learning via transductive experimental design. In: Proceedings of the 23rd International Conference on Machine Learning, Association for Computing Machinery, pp 1081–1088

  • Yu J, Wang H, Ai M, Zhang H (2022) Optimal distributed subsampling for maximum quasi-likelihood estimators with massive data. J Am Stat Assoc 117:265–276

    MathSciNet  CAS  Google Scholar 

  • Zhang H, Wang H (2021) Distributed subdata selection for big data via sampling-based approach. Comput Stat Data Anal 153:107072

    MathSciNet  Google Scholar 

  • Zhang T, Ning Y, Ruppert D (2021) Optimal sampling for generalized linear models under measurement constraints. J Comput Gr Stat 30:106–114

    MathSciNet  Google Scholar 

  • Zhang J, Meng C, Yu J, Zhang M, Zhong W, Ma P (2022) An optimal transport approach for selecting a representative subsample with application in efficient kernel density estimation. J Comput Gr Stat. https://doi.org/10.1080/10618600.2022.2084404

    Article  Google Scholar 

  • Zhao Y, Amemiya Y, Hung Y (2018) Efficient Gaussian process modeling using experimental design-based subagging. Stat Sin 28:1459–1479

    MathSciNet  Google Scholar 

  • Zhou Y, Fang K (2019) FM-criterion for representative points. Sci Sin Math 49:1009–1020

    Google Scholar 

  • Zhu X, Pan R, Wu S, Wang H (2021) Feature screening for massive data analysis by subsampling. J Bus Econ Stat. https://doi.org/10.1080/07350015.2021.1990771

    Article  Google Scholar 

  • Zuo L, Zhang H, Wang H, Liu L (2021) Sampling-based estimation for massive survival data with additive hazards model. Stat Med 40:441–450

    MathSciNet  PubMed  Google Scholar 

Download references

Acknowledgements

The authors sincerely thank the editor, associate editor, and the referees for their valuable comments and insightful suggestions, which lead to further improvement of this work. Ai’s work was supported by NSFC grants 12071014 and 12131001, and LMEQF. Yu’s work was supported by NSFC 12001042 and Beijing Institute of Technology research fund program for young scholars.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mingyao Ai.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

1.1 A summary of subsampling methods

See Table 8.

Table 8 A summary of subsampling methods

1.2 Technical details

Proof of Theorem 1

The estimator \(\tilde{{\varvec{\beta }}}\) is the maximizer of \(L^*({\varvec{\beta }})\) in (1), so conditional on \({\mathcal {F}}_n\), \(\sqrt{r}(\tilde{{\varvec{\beta }}}-\hat{{\varvec{\beta }}})\) is the maximizer of \(\Lambda ({\varvec{\eta }})=n^{-1}rL^*(\hat{{\varvec{\beta }}}+{{\varvec{\eta }}}/\sqrt{r})-n^{-1}rL^*(\hat{{\varvec{\beta }}})\). For natation simplicity, let \({\dot{f}}({\varvec{\beta }})\) and \(\ddot{f}({\varvec{\beta }})\) be the first and second order derivatives of a given function \(f({\varvec{\beta }})\) with respect to \({\varvec{\beta }}\). Preforming Taylor’s expansion on \(\Lambda ({\varvec{\eta }})\),

$$\begin{aligned} \Lambda ({\varvec{\eta }})=\frac{\sqrt{r}}{n}{{\varvec{\eta }}}^T{\dot{L}}^*(\hat{{\varvec{\beta }}})+\frac{1}{2n}{{\varvec{\eta }}}^T{\ddot{L}}^*(\hat{{\varvec{\beta }}}){{\varvec{\eta }}}+R, \end{aligned}$$
(A.1)

where R is the remainder term. Under Assumption 3, one can see that R goes to zero in probability conditional on \({\mathcal {F}}_n\), i.e., \(R=o_{P|{\mathcal {F}}_n}(1)\). Precisely,

$$\begin{aligned} |R|\le \frac{p^3\Vert {\varvec{\eta }}\Vert ^3}{3\sqrt{r}}\times \frac{1}{n}\sum _{i=1}^n H({\varvec{x}}_i,y_i)=o_{P}(1), \end{aligned}$$
(A.2)

thus the result holds by Theorem 3.3 in Xiong and Li (2008).

Recall that \({\dot{\ell }}(y_i|{\varvec{x}}_i, {{\varvec{\beta }}})=n^{-1}(\partial \log f(y_i|{\varvec{x}}_i,{\varvec{\beta }})/\partial {\varvec{\beta }})\), and the corresponding subsampling version \({\dot{\ell }}^*(y_i|{\varvec{x}}_i, {{\varvec{\beta }}})=(n\pi _i)^{-1}\delta _i(\partial \log f(y_i|{\varvec{x}}_i,{\varvec{\beta }})/\partial {\varvec{\beta }})\). One can see that \({\dot{\ell }}^*(y_i|{\varvec{x}}_i, {{\varvec{\beta }}})\) are independent random vectors conditional on \({\mathcal {F}}_n\). Direct calculation yields

$$\begin{aligned} E({\dot{L}}^*(\hat{{\varvec{\beta }}})|{\mathcal {F}}_n)&={\varvec{0}}, \end{aligned}$$
(A.3)
$$\begin{aligned} \textrm{var}({\dot{L}}^*(\hat{{\varvec{\beta }}})|{\mathcal {F}}_n)&=\sum _{i=1}^n\frac{1-\pi _i}{\pi _i}{\dot{\ell }}(y_i|{\varvec{x}}_i, \hat{{\varvec{\beta }}}){\dot{\ell }}^{T}(y_i|{\varvec{x}}_i, \hat{{\varvec{\beta }}})\nonumber \\&=\sum _{i=1}^n\frac{1}{\pi _i}{\dot{\ell }}(y_i|{\varvec{x}}_i, \hat{{\varvec{\beta }}}){\dot{\ell }}^{T}(y_i|{\varvec{x}}_i, \hat{{\varvec{\beta }}})+o_{P|{\mathcal {F}}_n}(1)\nonumber \\&=V_c+o_{P|{\mathcal {F}}_n}(1), \end{aligned}$$
(A.4)

where the second last equation comes from the facts

$$\begin{aligned}&\sum _{i=1}^n{\dot{\ell }}(y_i|{\varvec{x}}_i, \hat{{\varvec{\beta }}}){\dot{\ell }}^{T}(y_i|{\varvec{x}}_i, \hat{{\varvec{\beta }}})\\ \le&\sup _{{\varvec{\beta }}\in \Theta }\frac{1}{n^2}\sum _{i=1}^n(\partial \log f(y_i|{\varvec{x}}_i,{\varvec{\beta }})/\partial {\varvec{\beta }})(\partial \log f(y_i|{\varvec{x}}_i,{\varvec{\beta }})/\partial {\varvec{\beta }})^T\\ \le&\frac{1}{n^2}\sup _{{\varvec{\beta }}\in \Theta }\sum _{i=1}^n \Vert \partial \log f(y_i|{\varvec{x}}_i,{\varvec{\beta }})/\partial {\varvec{\beta }}\Vert ^2 I_p =o_{P|{\mathcal {F}}_n}(1), \end{aligned}$$

under Assumption 2 and \(r/n\rightarrow 0\).

Under Assumption 5, the Lindeberg-Feller condition holds under the condition distribution. Thus from the central limit theorem,

$$\begin{aligned} \frac{\sqrt{r}}{n}{\dot{L}}^*(\hat{{\varvec{\beta }}})\rightarrow N({\varvec{0}},V_c). \end{aligned}$$
(A.5)

Under Assumption 2, by the law of large number, we know that

$$\begin{aligned} \frac{1}{n}{\ddot{L}}^*(\hat{{\varvec{\beta }}})=-{\mathcal {J}}_{\hat{{\varvec{\beta }}}}+o_{P|{\mathcal {F}}_n}(1). \end{aligned}$$
(A.6)

Thus the desired result follows by the Basic Corollary in Lid Hjort and Pollard (2011) and Theorem 3.3 in Xiong and Li (2008). \(\square \)

Proof of Theorem 2

Note that \({\varvec{z}}=\Sigma ^{-1/2}{\varvec{x}}\) is standard normal and

$$\begin{aligned}E(X_s^{ \textrm{T} }X_s)=\Sigma ^{1/2}\left( \int {\mathbb {I}}({\varvec{z}}^{ \textrm{T} }{\varvec{z}}>c){\varvec{z}}{\varvec{z}}^{ \textrm{T} }\phi _I({\varvec{z}})d{\varvec{z}}\right) \Sigma ^{1/2},\end{aligned}$$

where \(\phi _I({\varvec{z}})\) is the density function of the p-dimensional standard normal distribution.

By the inequality of arithmetic and geometric means for the eigenvalues, one can see that

$$\begin{aligned} p\left( \det (\int {\mathbb {I}}({\varvec{z}}^{ \textrm{T} }{\varvec{z}}>c){\varvec{z}}{\varvec{z}}^{ \textrm{T} }\phi _I({\varvec{z}})d{\varvec{z}})\right) ^{1/p}&\le \textrm{tr}\left( \int {\mathbb {I}}({\varvec{z}}^{ \textrm{T} }{\varvec{z}}>c){\varvec{z}}{\varvec{z}}^{ \textrm{T} }\phi _I({\varvec{z}})d{\varvec{z}}\right) \\&=\int _{{\varvec{z}}^{ \textrm{T} }{\varvec{z}}>c}\Vert {\varvec{z}}\Vert ^2\phi _I({\varvec{z}})d{\varvec{z}}, \end{aligned}$$

the equality holds by the fact that \(\int _{{\varvec{z}}^{ \textrm{T} }{\varvec{z}}>c}\Vert {\varvec{z}}\Vert ^2\phi _I({\varvec{z}})d{\varvec{z}}\) is a nonzero multiple of the identity matrix.

Note that for any selection rule \(\textrm{tr}\left( \int {\mathbb {I}}(g(\Sigma ^{1/2}{\varvec{z}})){\varvec{z}}{\varvec{z}}^{ \textrm{T} }\phi _I({\varvec{z}})d{\varvec{z}}\right) \!=\!\int _{g(\Sigma ^{1/2}{\varvec{z}})} \Vert {\varvec{z}}\Vert ^2\phi _I({\varvec{z}})d{\varvec{z}}\). Clearly, such kind of selection also achieve the upper bound of \(\textrm{tr}\big (\int {\mathbb {I}}(g(\Sigma ^{1/2}{\varvec{z}})){\varvec{z}}{\varvec{z}}^{ \textrm{T} }\phi _I({\varvec{z}})d{\varvec{z}}\big )\) for any \(g(\cdot )\) satisfies the constraint. Thus the result follows by the fact \(\det (AB)=\det (A)\det (B)\) for any \(A,B>0\). \(\square \)

1.3 Brief introduction of the optimal transport map

To ease the conversation, we only present a specific optimal transport map from a general distribution F on \(\Omega \subseteq {\mathbb {R}}^{p+1}\) to an uniform distribution \(F_\textrm{unif}\) on \([0,1]^{p+1}\). For the general definition of the optimal transport map, we refer to Villani (2008) for more details. Let \(\#\) be the push-forward operator, such that for all measurable \(B\subseteq \Omega \), we have \(\phi _\#(F)(B)=F(\phi ^{-1}(B))\). The optimal transport map \(\phi ^*\) of our interest is the one that minimizes the \(l_2\) cost, \(\int _\Omega \Vert {\varvec{z}}- \phi ({\varvec{z}}) \Vert ^2 \text{ d }F\) with respect to all the measures preserving maps \(\phi :\Omega \rightarrow [0,1]^{p+1}\) such that \(\phi _\#(F) = F_{\textrm{unif}}\) and \(\phi ^{-1}_\#(F_\textrm{unif}) = F\). It has been shown that when \(\Omega ={\mathbb {R}}\), \(\phi ^*\) is equivalent to the cumulative distribution function F (Villani 2008).

For the given observations \(\{{\varvec{z}}_i\}_{i=1}^n\) comes from F and \(\{{\varvec{s}}_i\}_{i=1}^n\) from \(F_{\textrm{unif}}\). The empirical optimal transport map can be estimated by the auction algorithm or the refined auction algorithm (Bertsekas 1992). To further alleviate the computational burden, in practice, some projection-based methods can be used to approximate the optimal transport map \(\phi ^*\). Readers may refer Zhang et al. (2022) for detail discussions on computational issue.

1.4 Additional simulation results for \(n=1,000,000\)

In this subsection, we will provide the log EMSE for the subsampling methods with large n. More precisely, we consider \(n=1,000,000\) with the three cases given in Sect. 6. Due to the limitation of the computational resources, we opt to report the results for the methods whose computing time is no more than \(O(np^2)\). The results for the linear regression model and Logistic regression model are reported in Tables 9 and 10, respectively.

Table 9 The log EMSE of the different subsampling methods for the linear models with different r
Table 10 The log EMSE of the different subsampling methods for the Logistic regression models with different r

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yu, J., Ai, M. & Ye, Z. A review on design inspired subsampling for big data. Stat Papers 65, 467–510 (2024). https://doi.org/10.1007/s00362-022-01386-w

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00362-022-01386-w

Keywords

Mathematics Subject Classification

Navigation