A review on design inspired subsampling for big data

Yu, Jun; Ai, Mingyao; Ye, Zhiqiang

doi:10.1007/s00362-022-01386-w

A review on design inspired subsampling for big data

Regular Article
Published: 13 February 2023

Volume 65, pages 467–510, (2024)
Cite this article

Statistical Papers Aims and scope Submit manuscript

Jun Yu¹,
Mingyao Ai² &
Zhiqiang Ye²

1613 Accesses
10 Citations
Explore all metrics

Abstract

Subsampling focuses on selecting a subsample that can efficiently sketch the information of the original data in terms of statistical inference. It provides a powerful tool in big data analysis and gains the attention of data scientists in recent years. In this review, some state-of-the-art subsampling methods inspired by statistical design are summarized. Three types of designs, namely optimal design, orthogonal design, and space filling design, have shown their great potential in subsampling for different objectives. The relationships between experimental designs and the related subsampling approaches are discussed. Specifically, two major families of design inspired subsampling techniques are presented. The first aims to select a subsample in accordance with some optimal design criteria. The second tries to find a subsample that meets some design requirements, including balancing, orthogonality, and uniformity. Simulated and real data examples are provided to compare these methods empirically.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Data clustering: application and trends

Article 27 November 2022

Uncertainty in big data analytics: survey, opportunities, and challenges

Article Open access 04 June 2019

Missing value imputation: a review and analysis of the literature (2006–2017)

Article 05 April 2019

References

Ai M, Wang F, Yu J, Zhang H (2021a) Optimal subsampling for large-scale quantile regression. J Complex 62:101512
MathSciNet Google Scholar
Ai M, Yu J, Zhang H, Wang H (2021b) Optimal subsampling algorithms for big data regressions. Stat Sin 31:749–772
MathSciNet Google Scholar
Altschuler J, Bach F, Rudi A, Niles-Weed J (2019) Massively scalable sinkhorn distances via the nyström method. Adv Neural Inf Process Syst 32:4427–4437
Google Scholar
Atkinson A, Donev A, Tobias R (2007) Optimum experimental designs, with SAS. Oxford University Press, Oxford
Google Scholar
Avron H, Maymounkov P, Toledo S (2010) Blendenpik: supercharging Lapack’s least-squares solver. SIAM J Sci Comput 32:1217–1236
MathSciNet Google Scholar
Beasley LB, Brualdi RA, Shader BL (1993) Combinatorial orthogonality. In: Brualdi RA, Friedland S, Klee V (eds) Combinatorial and graph-theoretical problems in linear algebra. Springer, New York, NY, pp 207–218
Google Scholar
Berger YG, de La Riva-Torres O (2016) Empirical likelihood confidence intervals for complex sampling designs. J R Stat Soc Ser B 78:314–319
MathSciNet Google Scholar
Bertsekas DP (1992) Auction algorithms for network flow problems: a tutorial introduction. Comput Optim Appl 1:7–66
MathSciNet Google Scholar
Biedermann S, Dette H (2001) Minimax optimal designs for nonparametric regression—a further optimality property of the uniform distribution. In: Atkinson AC, Hackl P, Müller WG (eds) mODa 6—Advances in Model-Oriented Design and Analysis. Physica-Verlag HD, Heidelberg, pp 13–20
Google Scholar
Blom G (1976) Some properties of incomplete U-statistics. Biometrika 63:573–580
MathSciNet Google Scholar
Boivin J, Ng S (2006) Are more data always better for factor analysis? J Econom 132:169–194
MathSciNet Google Scholar
Bottou L (1999) On-line learning and stochastic approximations. In: Saad D (ed) On-line learning in neural networks. Publications of the Newton Institute. Cambridge University Press, Cambridge, pp 9–42. https://doi.org/10.1017/CBO9780511569920.003
Breidt FJ, Opsomer JD (2000) Local polynomial regression estimators in survey sampling. Ann Stat 28:1026–1053
Google Scholar
Breiman L (1996) Bagging predictors. Mach Learn 24:123–140
Google Scholar
Burnham KP, Anderson DR (2002) Model selection and multimodel inference: a practical information-theoretic approach, 2nd edn. Springer, New York
Google Scholar
Chen WY, Mackey L, Gorham J, Briol FX, Oates C (2018) Stein points. In: Dy J, Krause A (eds) Proceedings of the 35th International Conference on Machine Learning, vol 80, pp 844–853
Chen WY, Barp A, Briol FX, Gorham J, Girolami M, Mackey L, Oates CJ (2019) Stein point Markov chain Monte Carlo. https://arXiv.org/1905.03673
Cheng Q, Wang H, Yang M (2020) Information-based optimal subdata selection for big data logistic regression. J Stat Plan Inference 209:112–122
MathSciNet Google Scholar
Chernozhukov V, Galichon A, Hallin M, Henry M (2017) Monge-Kantorovich depth, quantiles, ranks and signs. Ann Stat 45:223–256
MathSciNet Google Scholar
Cioppa TM, Lucas TW (2007) Efficient nearly orthogonal and space-filling Latin hypercubes. Technometrics 49:45–55
MathSciNet Google Scholar
Cook CE, Lopez R, Stroe O, Cochrane G, Apweiler R (2019) The European bioinformatics institute in 2018: tools, infrastructure and training. Nucleic Acids Res 47:D15–D22
CAS PubMed Google Scholar
Cox DR (1957) Note on grouping. J Am Stat Assoc 52:543–547
Google Scholar
Deldossi L, Tommasi C (2021) Optimal design subsampling from big datasets. J Qual Technol 54:93
Google Scholar
Derezinski M, Warmuth MKK (2017) Unbiased estimates for linear regression via volume sampling. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds) Advances in neural information processing systems, vol 30. Curran Associates, Inc., Red Hook, pp 3084–3093
Google Scholar
Dereziński M, Warmuth MK (2018) Reverse iterative volume sampling for linear regression. J Mach Learn Res 19:1–39
MathSciNet Google Scholar
Derezinski M, Warmuth MKK, Hsu DJ (2018) Leveraged volume sampling for linear regression. In: Bengio S, Wallach H, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R (eds) Advances in neural information processing systems, vol 31. Curran Associates, Inc., Red Hook, pp 2505–2514
Google Scholar
Dereziński M, Clarkson KL, Mahoney MW, Warmuth MK (2019) Minimax experimental design: Bridging the gap between statistical and worst-case approaches to least squares regression. In: Beygelzimer A, Hsu D (eds) Proceedings of the Thirty-Second Conference on Learning Theory, Phoenix, USA, Proceedings of Machine Learning Research, vol 99, pp 1050–1069
Devroye L (1986) Sample-based non-uniform random variate generation. In: Proceedings of the 18th conference on Winter simulation, ACM, pp 260–265
Dey A, Mukerjee R (1999) Fractional factorial plans. Wiley, New York
Google Scholar
Dheeru D, Karra Taniskidou E (2017) UCI machine learning repository. https://archive.ics.uci.edu/ml/datasets/Physicochemical+Properties+of+Protein+Tertiary+Structure. Accessed 25 July 2022
Doug L (2001) 3d data management: controlling data volume, velocity and variety. META Group Res Note 6:1
Google Scholar
Drineas P, Kannan R, Mahoney MW (2006) Fast Monte Carlo algorithms for matrices I: approximating matrix multiplication. SIAM J Comput 36:132–157
MathSciNet Google Scholar
Efron B, Tibshirani RJ (1994) An introduction to the bootstrap. CRC Press, Boca Raton
Google Scholar
Eshragh A, Roosta F, Nazari A, Mahoney MW (2019) LSAR: efficient leverage score sampling algorithm for the analysis of big time series data. https://arxiv.org/1911.12321
Fan J, Han F, Liu H (2014) Challenges of big data analysis. Natl Sci Rev 1:293–314
PubMed Google Scholar
Fan Y, Liu Y, Zhu L (2021) Optimal subsampling for linear quantile regression models. Can J Stat. https://doi.org/10.1002/cjs.11590
Article MathSciNet Google Scholar
Fang KT, Wang Y (1994) Number-theoretic methods in statistics, vol 51. CRC Press, Boca Raton
Google Scholar
Fang KT, Kotz S, Ng KW (1990) Symmetric multivariate and related distributions. Monographs on statistics and applied probability. Springer, Berlin
Google Scholar
Fithian W, Hastie T (2014) Local case-control sampling: efficient subsampling in imbalanced data sets. Ann Stat 42:1693–1724
MathSciNet PubMed PubMed Central Google Scholar
Flury BA (1990) Principal points. Biometrika 77:33–41
MathSciNet Google Scholar
Gittens A, Mahoney MW (2016) Revisiting the Nyström method for improved large-scale machine learning. J Mach Learn Res 17:3977–4041
Google Scholar
Gu C (2013) Smoothing spline ANOVA models. Springer, Berlin
Google Scholar
Gu C, Kim YJ (2002) Penalized likelihood regression: general formulation and efficient approximation. Can J Stat 30:619–628
MathSciNet Google Scholar
Hájek J (1964) Asymptotic theory of rejective sampling with varying probabilities from a finite population. Ann Math Stat 35:1491–1523
MathSciNet Google Scholar
Han L, Tan KM, Yang T, Zhang T et al (2020) Local uncertainty sampling for large-scale multiclass logistic regression. Ann Stat 48:1770–1788
MathSciNet Google Scholar
Hansen MH, Madow WG, Tepping BJ (1983) An evaluation of model-dependent and probability-sampling inferences in sample surveys. J Am Stat Assoc 78:776–793
Google Scholar
Harman R, Rosa S (2020) On greedy heuristics for computing d-efficient saturated subsets. Oper Res Lett 48:122–129
MathSciNet Google Scholar
He Z, Owen AB (2016) Extensible grids: uniform sampling on a space filling curve. J R Stat Soc B 78:917–931
MathSciNet Google Scholar
Hedayat AS, Sloane NJA, Stufken J (1999) Orthogonal arrays: theory and applications. Springer, Berlin
Google Scholar
Hickernell FJ, Liu M (2002) Uniform designs limit aliasing. Biometrika 89:893–904
MathSciNet Google Scholar
Horn RA, Johnson CR (2013) Matrix analysis, 2nd edn. Cambridge University Press, Cambridge
Google Scholar
Hu G, Wang H (2021) Most likely optimal subsampled Markov chain Monte Carlo. J Syst Sci Complex 34:1121–1134
MathSciNet Google Scholar
Iraji MS, Ameri H (2016) RMSD protein tertiary structure prediction with soft computing. IJ Math Sci Comput 2:24–33
Google Scholar
Johnson M, Moore L, Ylvisaker D (1990) Minimax and maximin distance designs. J Stat Plan Inference 26:131–148
MathSciNet Google Scholar
Joseph VR, Dasgupta T, Tuo R, Wu CFJ (2015) Sequential exploration of complex surfaces using minimum energy designs. Technometrics 57:64–74
MathSciNet Google Scholar
Joseph VR, Wang D, Gu L, Lyu S, Tuo R (2019) Deterministic sampling of expensive posteriors using minimum energy designs. Technometrics 61:297–308
MathSciNet Google Scholar
Katharopoulos A, Fleuret F (2018) Not all samples are created equal: deep learning with importance sampling. In: Dy J, Krause A (eds) Proceedings of the 35th International Conference on Machine Learning, PMLR, Proceedings of Machine Learning Research, vol 80, pp 2525–2534
Kiefer J (1974) General equivalence theory for optimum designs (approximate theory). Ann Stat 2:849–879
MathSciNet Google Scholar
Kong X, Zheng W (2020) Design based incomplete U-statistics. https://arXiv.org/2008.04348
Kuang K, Xiong R, Cui P, Athey S, Li B (2018) Stable prediction across unknown environments. http://arxiv.org/1806.06270
Kuang K, Zhang H, Wu F, Zhuang Y, Zhang A (2020) Balance-subsampled stable prediction. https://arXiv.org/2006.04381
Lee S, Ng S (2020) An econometric perspective on algorithmic subsampling. Annu Rev Econ 12:45–80
Google Scholar
Lee J, Schifano ED, Wang H (2021) Fast optimal subsampling probability approximation for generalized linear models. Econom Stat. https://doi.org/10.1016/j.ecosta.2021.02.007
Article Google Scholar
Lehmann EL, Casella G (1998) Theory of point estimation, 2nd edn. Springer, Berlin
Google Scholar
Lemieux C (2009) Monte Carlo and Quasi-Monte Carlo Sampling. Springer, Berlin
Google Scholar
Li N, Qardaji W, Su D (2012) On sampling, anonymization, and differential privacy or, k-anonymization meets differential privacy. In: Proceedings of the 7th ACM Symposium on Information, Computer and Communications Security, ACM, pp 32–33
Li K, Kong X, Ai M (2016) A general theory for orthogonal array based Latin hypercube sampling. Stat Sin 26:761–777
MathSciNet Google Scholar
Lid Hjort N, Pollard D (2011) Asymptotics for minimisers of convex processes. https://arXiv.org/1107.3806
Liu Q, Lee JD, Jordan MI (2016) A kernelized stein discrepancy for goodness-of-fit tests and model evaluation. https://arXiv.org/1602.03253
Loshchilov I, Hutter F (2015) Online batch selection for faster training of neural networks. https://arXiv.org/1511.06343
Ma CX, Fang KT (2001) A note on generalized aberration in factorial designs. Metrika 53:85–93
MathSciNet Google Scholar
Ma P, Huang J, Zhang N (2015a) Efficient computation of smoothing splines via adaptive basis sampling. Biometrika 102:631–645
MathSciNet Google Scholar
Ma P, Mahoney MW, Yu B (2015b) A statistical perspective on algorithmic leveraging. J Mach Learn Res 16:861–919
MathSciNet Google Scholar
Ma P, Zhang X, Xing X, Ma J, Mahoney M (2020) Asymptotic analysis of sampling estimators for randomized numerical linear algebra algorithms. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 1026–1035
Mak S, Joseph VR (2018) Support points. Ann Stat 46:2562–2592
MathSciNet Google Scholar
Meng X, Saunders MA, Mahoney MW (2014) LSRN: a parallel iterative solver for strongly over-or underdetermined systems. SIAM J Sci Comput 36:C95–C118
MathSciNet PubMed PubMed Central Google Scholar
Meng C, Zhang X, Zhang J, Zhong W, Ma P (2020) More efficient approximation of smoothing splines via space-filling basis selection. Biometrika 107:723–735
MathSciNet PubMed Google Scholar
Meng C, Xie R, Mandal A, Zhang X, Zhong W, Ma P (2021) Lowcon: a design-based subsampling approach in a misspecified linear model. J Comput Gr Stat 30:694–708
MathSciNet Google Scholar
Meng C, Yu J, Chen Y, Zhong W, Ma P (2022) Smoothing splines approximation using Hilbert curve basis selection. J Comput Gr Stat 31:802–812
MathSciNet Google Scholar
Mishra A, Rana PS, Mittal A, Jayaram B (2014) D2n: distance to the native. Biochim Biophys Acta 1844:1798–1807. https://doi.org/10.1016/j.bbapap.2014.07.010
Article CAS PubMed Google Scholar
Musser D (1997) Introspective sorting and selection algorithms. Softw Pract Exp 27:983–993
Google Scholar
Newey WK, McFadden D (1994) Large sample estimation and hypothesis testing. In: Heckman JJ, Leamer E (eds) Handbook of econometrics, vol 4. Elsevier, Amsterdam, pp 2111–2245
Google Scholar
Ng S (2017) Opportunities and challenges: lessons from analyzing terabytes of scanner data. Tech. rep, National Bureau of Economic Research
Pfeffermann D (1993) The role of sampling weights when modeling survey data. Int Stat Rev/Rev Int Stat 61:317–337
Google Scholar
Pfeffermann D, Skinner CJ, Holmes DJ, Goldstein H, Rasbash J (1998) Weighting for unequal selection probabilities in multilevel models. J R Stat Soc B 60:23–40
MathSciNet Google Scholar
Pronzato L (2006) On the sequential construction of optimum bounded designs. J Stat Plan Inference 136:2783–2804
MathSciNet Google Scholar
Pukelsheim F (2006) Optimal design of experiments. Society for Industrial and Applied Mathematics
Qi ZF, Zhou Y, Fang KT (2019) Representative points for location-biased data sets. Commun Stat Simul Comput 48:458–471
Google Scholar
Quiroz M, Kohn R, Villani M, Tran MN (2019) Speeding up MCMC by efficient data subsampling. J Am Stat Assoc 114:831–843
MathSciNet CAS Google Scholar
R Core Team (2019) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna
Google Scholar
Ren M, Zhao SL (2021) Subdata selection based on orthogonal array for big data. Commun Stat Theory Methods. https://doi.org/10.1080/03610926.2021.2012196
Article Google Scholar
Ren H, Zou C, Chen N, Li R (2022) Large-scale datastreams surveillance via pattern-oriented-sampling. J Am Stat Assoc 117:794–808
MathSciNet CAS Google Scholar
Santner TJ, Williams BJ, Notz WI (2003) Space-filling designs for computer experiments, vol 5. Springer, Cham, pp 121–161
Google Scholar
Shao L, Song S, Zhou Y (2022) Optimal subsampling for large-sample quantile regression with massive data. Can J Stat. https://doi.org/10.1002/cjs.11697
Article Google Scholar
Shi C, Tang B (2021) Model-robust subdata selection for big data. J Stat Theory Pract 15:82
MathSciNet Google Scholar
Shu X, Yao D, Bertino E (2015) Privacy-preserving detection of sensitive data exposure. IEEE Trans Inf Forensics Secur 10:1092–1103
Google Scholar
Su Y (2000) Asymptotically optimal representative points of bivariate random vectors. Stat Sin 10:559–575
MathSciNet Google Scholar
Székely GJ, Rizzo ML (2013) Energy statistics: a class of statistics based on distances. J Stat Plan Inference 143:1249–1272
MathSciNet Google Scholar
Thompson S (2012) Simple random sampling, vol 2. Wiley, New York, pp 9–37
Google Scholar
Ting D, Brochu E (2018) Optimal subsampling with influence functions. In: Advances in Neural Information Processing Systems, pp 3650–3659
Vakayil A, Joseph VR (2022) Data twinning. In: Clarke B (ed) Statistical analysis and data mining: the ASA data science journal. Wiley, New York. https://doi.org/10.1002/sam.11574
Villani C (2008) Optimal transport: old and new. Springer, Berlin
Google Scholar
Wang H (2019) More efficient estimation for logistic regression with optimal subsamples. J Mach Learn Res 20:1–59
ADS MathSciNet Google Scholar
Wang H, Kim JK (2020) Maximum sampled conditional likelihood for informative subsampling. https://arXiv.org/2011.05988
Wang W, Jing BY (2021) Convergence of Gaussian process regression: optimality, robustness, and relationship with kernel ridge regression. https://arXiv.org/2104.09778
Wang H, Ma Y (2021) Optimal subsampling for quantile regression in big data. Biometrika 108:99–112
MathSciNet Google Scholar
Wang H, Zou J (2021) A comparative study on sampling with replacement vs poisson sampling in optimal subsampling. In: Banerjee A, Fukumizu K (eds) Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, PMLR, Proceedings of Machine Learning Research, vol 130, pp 289–297
Wang W, Tuo R, Wu CFJ (2017a) On prediction properties of Kriging: uniform error bounds and robustness. https://arXiv.org/1710.06959
Wang Y, Yu AW, Singh A (2017b) On computationally tractable selection of experiments in measurement-constrained regression models. J Mach Learn Res 18:5238–5278
MathSciNet Google Scholar
Wang H, Zhu R, Ma P (2018a) Optimal subsampling for large sample logistic regression. J Am Stat Assoc 113:829–844
MathSciNet CAS PubMed PubMed Central Google Scholar
Wang Y, Yang J, Xu H (2018b) On the connection between maximin distance designs and orthogonal designs. Biometrika 105:471–477
MathSciNet Google Scholar
Wang Z, Zhu H, Dong Z, He X, Huang SL (2019a) Less is better: unweighted data subsampling via influence function. https://arXiv.org/1912.01321
Wang H, Yang M, Stufken J (2019b) Information-based optimal subdata selection for big data linear regression. J Am Stat Assoc 114:393–405
MathSciNet CAS Google Scholar
Wang L, Elmstedt J, Wong WK, Xu H (2021) Orthogonal subsampling for big data linear regression. https://arXiv.org/2105.14647
Wang Y, Sun F, Xu H (2022) On design orthogonality, maximin distance, and projection uniformity for computer experiments. J Am Stat Assoc 117:375–385
MathSciNet CAS Google Scholar
Williams C, Seeger M (2001) Using the nyström method to speed up kernel machines. Adv Neural Inf Process Syst 13:682–688
Google Scholar
Wu CFJ, Hamada MS (2009) Experiments: planning, analysis and parameter design optimization, 2nd edn. Wiley, New York
Google Scholar
Wu S, Zhu X, Wang H (2022) Subsampling and jackknifing: a practically convenient solution for large data analysis with limited computational resources. Stat Sin. https://doi.org/10.5705/ss.202021.0257
Article PubMed PubMed Central Google Scholar
Xie MY, Fang KT (2000) Admissibility and minimaxity of the uniform design measure in nonparametric regression model. J Stat Plan Inference 83:101–111
MathSciNet Google Scholar
Xie R, Wang Z, Bai S, Ma P, Zhong W (2019) Online decentralized leverage score sampling for streaming multidimensional time series. In: Chaudhuri K, Sugiyama M (eds) Proceedings of Machine Learning Research, PMLR, Proceedings of Machine Learning Research, vol 89, pp 2301–2311
Xiong S, Li G (2008) Some results on the convergence of conditional distributions. Stat Probab Lett 78:3249–3253
MathSciNet Google Scholar
Yao Y, Wang H (2019) Optimal subsampling for softmax regression. Stat Pap 60:235–249
MathSciNet Google Scholar
Yao Y, Wang H (2021a) A review on optimal subsampling methods for massive datasets. J Data Sci 19:151–172
Google Scholar
Yao Y, Wang H (2021b) A selective review on statistical techniques for big data. In: Zhao Y, Chen DG (eds) Modern statistical methods for health research. Springer, New York. https://doi.org/10.1007/978-3-030-72437-5_11
Chapter Google Scholar
Ye KQ (1998) Orthogonal column Latin hypercubes and their application in computer experiments. J Am Stat Assoc 93:1430–1439
MathSciNet Google Scholar
Yu J, Wang H (2022) Subdata selection algorithm for linear model discrimination. Stat Pap. https://doi.org/10.1007/s00362-022-01299-8
Article MathSciNet Google Scholar
Yu K, Bi J, Tresp V (2006) Active learning via transductive experimental design. In: Proceedings of the 23rd International Conference on Machine Learning, Association for Computing Machinery, pp 1081–1088
Yu J, Wang H, Ai M, Zhang H (2022) Optimal distributed subsampling for maximum quasi-likelihood estimators with massive data. J Am Stat Assoc 117:265–276
MathSciNet CAS Google Scholar
Zhang H, Wang H (2021) Distributed subdata selection for big data via sampling-based approach. Comput Stat Data Anal 153:107072
MathSciNet Google Scholar
Zhang T, Ning Y, Ruppert D (2021) Optimal sampling for generalized linear models under measurement constraints. J Comput Gr Stat 30:106–114
MathSciNet Google Scholar
Zhang J, Meng C, Yu J, Zhang M, Zhong W, Ma P (2022) An optimal transport approach for selecting a representative subsample with application in efficient kernel density estimation. J Comput Gr Stat. https://doi.org/10.1080/10618600.2022.2084404
Article Google Scholar
Zhao Y, Amemiya Y, Hung Y (2018) Efficient Gaussian process modeling using experimental design-based subagging. Stat Sin 28:1459–1479
MathSciNet Google Scholar
Zhou Y, Fang K (2019) FM-criterion for representative points. Sci Sin Math 49:1009–1020
Google Scholar
Zhu X, Pan R, Wu S, Wang H (2021) Feature screening for massive data analysis by subsampling. J Bus Econ Stat. https://doi.org/10.1080/07350015.2021.1990771
Article Google Scholar
Zuo L, Zhang H, Wang H, Liu L (2021) Sampling-based estimation for massive survival data with additive hazards model. Stat Med 40:441–450
MathSciNet PubMed Google Scholar

Download references

Acknowledgements

The authors sincerely thank the editor, associate editor, and the referees for their valuable comments and insightful suggestions, which lead to further improvement of this work. Ai’s work was supported by NSFC grants 12071014 and 12131001, and LMEQF. Yu’s work was supported by NSFC 12001042 and Beijing Institute of Technology research fund program for young scholars.

Author information

Authors and Affiliations

School of Mathematics and Statistics, and Key Laboratory of Mathematical Theory and Computation in Information Security, Beijing Institute of Technology, Beijing, 100811, China
Jun Yu
LMAM, School of Mathematical Sciences and Center for Statistical Science, Peking University, Beijing, 100871, China
Mingyao Ai & Zhiqiang Ye

Authors

Jun Yu
View author publications
You can also search for this author in PubMed Google Scholar
Mingyao Ai
View author publications
You can also search for this author in PubMed Google Scholar
Zhiqiang Ye
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mingyao Ai.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

1.1 A summary of subsampling methods

See Table 8.

Table 8 A summary of subsampling methods

Full size table

1.2 Technical details

Proof of Theorem 1

The estimator $\tilde{{\varvec{\beta }}}$ is the maximizer of $L^*({\varvec{\beta }})$ in (1), so conditional on ${\mathcal {F}}_n$, $\sqrt{r}(\tilde{{\varvec{\beta }}}-\hat{{\varvec{\beta }}})$ is the maximizer of $\Lambda ({\varvec{\eta }})=n^{-1}rL^*(\hat{{\varvec{\beta }}}+{{\varvec{\eta }}}/\sqrt{r})-n^{-1}rL^*(\hat{{\varvec{\beta }}})$. For natation simplicity, let ${\dot{f}}({\varvec{\beta }})$ and $\ddot{f}({\varvec{\beta }})$ be the first and second order derivatives of a given function $f({\varvec{\beta }})$ with respect to ${\varvec{\beta }}$. Preforming Taylor’s expansion on $\Lambda ({\varvec{\eta }})$,

$$\begin{aligned} \Lambda ({\varvec{\eta }})=\frac{\sqrt{r}}{n}{{\varvec{\eta }}}^T{\dot{L}}^*(\hat{{\varvec{\beta }}})+\frac{1}{2n}{{\varvec{\eta }}}^T{\ddot{L}}^*(\hat{{\varvec{\beta }}}){{\varvec{\eta }}}+R, \end{aligned}$$

(A.1)

where R is the remainder term. Under Assumption 3, one can see that R goes to zero in probability conditional on ${\mathcal {F}}_n$, i.e., $R=o_{P|{\mathcal {F}}_n}(1)$. Precisely,

$$\begin{aligned} |R|\le \frac{p^3\Vert {\varvec{\eta }}\Vert ^3}{3\sqrt{r}}\times \frac{1}{n}\sum _{i=1}^n H({\varvec{x}}_i,y_i)=o_{P}(1), \end{aligned}$$

(A.2)

thus the result holds by Theorem 3.3 in Xiong and Li (2008).

Recall that ${\dot{\ell }}(y_i|{\varvec{x}}_i, {{\varvec{\beta }}})=n^{-1}(\partial \log f(y_i|{\varvec{x}}_i,{\varvec{\beta }})/\partial {\varvec{\beta }})$, and the corresponding subsampling version ${\dot{\ell }}^*(y_i|{\varvec{x}}_i, {{\varvec{\beta }}})=(n\pi _i)^{-1}\delta _i(\partial \log f(y_i|{\varvec{x}}_i,{\varvec{\beta }})/\partial {\varvec{\beta }})$. One can see that ${\dot{\ell }}^*(y_i|{\varvec{x}}_i, {{\varvec{\beta }}})$ are independent random vectors conditional on ${\mathcal {F}}_n$. Direct calculation yields

$$\begin{aligned} E({\dot{L}}^*(\hat{{\varvec{\beta }}})|{\mathcal {F}}_n)&={\varvec{0}}, \end{aligned}$$

(A.3)

$$\begin{aligned} \textrm{var}({\dot{L}}^*(\hat{{\varvec{\beta }}})|{\mathcal {F}}_n)&=\sum _{i=1}^n\frac{1-\pi _i}{\pi _i}{\dot{\ell }}(y_i|{\varvec{x}}_i, \hat{{\varvec{\beta }}}){\dot{\ell }}^{T}(y_i|{\varvec{x}}_i, \hat{{\varvec{\beta }}})\nonumber \\&=\sum _{i=1}^n\frac{1}{\pi _i}{\dot{\ell }}(y_i|{\varvec{x}}_i, \hat{{\varvec{\beta }}}){\dot{\ell }}^{T}(y_i|{\varvec{x}}_i, \hat{{\varvec{\beta }}})+o_{P|{\mathcal {F}}_n}(1)\nonumber \\&=V_c+o_{P|{\mathcal {F}}_n}(1), \end{aligned}$$

(A.4)

where the second last equation comes from the facts

$$\begin{aligned}&\sum _{i=1}^n{\dot{\ell }}(y_i|{\varvec{x}}_i, \hat{{\varvec{\beta }}}){\dot{\ell }}^{T}(y_i|{\varvec{x}}_i, \hat{{\varvec{\beta }}})\\ \le&\sup _{{\varvec{\beta }}\in \Theta }\frac{1}{n^2}\sum _{i=1}^n(\partial \log f(y_i|{\varvec{x}}_i,{\varvec{\beta }})/\partial {\varvec{\beta }})(\partial \log f(y_i|{\varvec{x}}_i,{\varvec{\beta }})/\partial {\varvec{\beta }})^T\\ \le&\frac{1}{n^2}\sup _{{\varvec{\beta }}\in \Theta }\sum _{i=1}^n \Vert \partial \log f(y_i|{\varvec{x}}_i,{\varvec{\beta }})/\partial {\varvec{\beta }}\Vert ^2 I_p =o_{P|{\mathcal {F}}_n}(1), \end{aligned}$$

under Assumption 2 and $r/n\rightarrow 0$.

Under Assumption 5, the Lindeberg-Feller condition holds under the condition distribution. Thus from the central limit theorem,

$$\begin{aligned} \frac{\sqrt{r}}{n}{\dot{L}}^*(\hat{{\varvec{\beta }}})\rightarrow N({\varvec{0}},V_c). \end{aligned}$$

(A.5)

Under Assumption 2, by the law of large number, we know that

$$\begin{aligned} \frac{1}{n}{\ddot{L}}^*(\hat{{\varvec{\beta }}})=-{\mathcal {J}}_{\hat{{\varvec{\beta }}}}+o_{P|{\mathcal {F}}_n}(1). \end{aligned}$$

(A.6)

Thus the desired result follows by the Basic Corollary in Lid Hjort and Pollard (2011) and Theorem 3.3 in Xiong and Li (2008). $\square $

Proof of Theorem 2

Note that ${\varvec{z}}=\Sigma ^{-1/2}{\varvec{x}}$ is standard normal and

$$\begin{aligned}E(X_s^{ \textrm{T} }X_s)=\Sigma ^{1/2}\left( \int {\mathbb {I}}({\varvec{z}}^{ \textrm{T} }{\varvec{z}}>c){\varvec{z}}{\varvec{z}}^{ \textrm{T} }\phi _I({\varvec{z}})d{\varvec{z}}\right) \Sigma ^{1/2},\end{aligned}$$

where $\phi _I({\varvec{z}})$ is the density function of the p-dimensional standard normal distribution.

By the inequality of arithmetic and geometric means for the eigenvalues, one can see that

$$\begin{aligned} p\left( \det (\int {\mathbb {I}}({\varvec{z}}^{ \textrm{T} }{\varvec{z}}>c){\varvec{z}}{\varvec{z}}^{ \textrm{T} }\phi _I({\varvec{z}})d{\varvec{z}})\right) ^{1/p}&\le \textrm{tr}\left( \int {\mathbb {I}}({\varvec{z}}^{ \textrm{T} }{\varvec{z}}>c){\varvec{z}}{\varvec{z}}^{ \textrm{T} }\phi _I({\varvec{z}})d{\varvec{z}}\right) \\&=\int _{{\varvec{z}}^{ \textrm{T} }{\varvec{z}}>c}\Vert {\varvec{z}}\Vert ^2\phi _I({\varvec{z}})d{\varvec{z}}, \end{aligned}$$

the equality holds by the fact that $\int _{{\varvec{z}}^{ \textrm{T} }{\varvec{z}}>c}\Vert {\varvec{z}}\Vert ^2\phi _I({\varvec{z}})d{\varvec{z}}$ is a nonzero multiple of the identity matrix.

Note that for any selection rule $\textrm{tr}\left( \int {\mathbb {I}}(g(\Sigma ^{1/2}{\varvec{z}})){\varvec{z}}{\varvec{z}}^{ \textrm{T} }\phi _I({\varvec{z}})d{\varvec{z}}\right) \!=\!\int _{g(\Sigma ^{1/2}{\varvec{z}})} \Vert {\varvec{z}}\Vert ^2\phi _I({\varvec{z}})d{\varvec{z}}$. Clearly, such kind of selection also achieve the upper bound of $\textrm{tr}\big (\int {\mathbb {I}}(g(\Sigma ^{1/2}{\varvec{z}})){\varvec{z}}{\varvec{z}}^{ \textrm{T} }\phi _I({\varvec{z}})d{\varvec{z}}\big )$ for any $g(\cdot )$ satisfies the constraint. Thus the result follows by the fact $\det (AB)=\det (A)\det (B)$ for any $A,B>0$. $\square $

1.3 Brief introduction of the optimal transport map

To ease the conversation, we only present a specific optimal transport map from a general distribution F on $\Omega \subseteq {\mathbb {R}}^{p+1}$ to an uniform distribution $F_\textrm{unif}$ on $[0,1]^{p+1}$. For the general definition of the optimal transport map, we refer to Villani (2008) for more details. Let $\#$ be the push-forward operator, such that for all measurable $B\subseteq \Omega $, we have $\phi _\#(F)(B)=F(\phi ^{-1}(B))$. The optimal transport map $\phi ^*$ of our interest is the one that minimizes the $l_2$ cost, $\int _\Omega \Vert {\varvec{z}}- \phi ({\varvec{z}}) \Vert ^2 \text{ d }F$ with respect to all the measures preserving maps $\phi :\Omega \rightarrow [0,1]^{p+1}$ such that $\phi _\#(F) = F_{\textrm{unif}}$ and $\phi ^{-1}_\#(F_\textrm{unif}) = F$. It has been shown that when $\Omega ={\mathbb {R}}$, $\phi ^*$ is equivalent to the cumulative distribution function F (Villani 2008).

For the given observations $\{{\varvec{z}}_i\}_{i=1}^n$ comes from F and $\{{\varvec{s}}_i\}_{i=1}^n$ from $F_{\textrm{unif}}$. The empirical optimal transport map can be estimated by the auction algorithm or the refined auction algorithm (Bertsekas 1992). To further alleviate the computational burden, in practice, some projection-based methods can be used to approximate the optimal transport map $\phi ^*$. Readers may refer Zhang et al. (2022) for detail discussions on computational issue.

1.4 Additional simulation results for $n=1,000,000$

In this subsection, we will provide the log EMSE for the subsampling methods with large n. More precisely, we consider $n=1,000,000$ with the three cases given in Sect. 6. Due to the limitation of the computational resources, we opt to report the results for the methods whose computing time is no more than $O(np^2)$. The results for the linear regression model and Logistic regression model are reported in Tables 9 and 10, respectively.

Table 9 The log EMSE of the different subsampling methods for the linear models with different r

Full size table

Table 10 The log EMSE of the different subsampling methods for the Logistic regression models with different r

Full size table

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Yu, J., Ai, M. & Ye, Z. A review on design inspired subsampling for big data. Stat Papers 65, 467–510 (2024). https://doi.org/10.1007/s00362-022-01386-w

Download citation

Received: 05 November 2021
Revised: 24 October 2022
Accepted: 12 December 2022
Published: 13 February 2023
Issue Date: April 2024
DOI: https://doi.org/10.1007/s00362-022-01386-w

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A review on design inspired subsampling for big data

Abstract

Access this article

Similar content being viewed by others

Data clustering: application and trends

Uncertainty in big data analytics: survey, opportunities, and challenges

Missing value imputation: a review and analysis of the literature (2006–2017)

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix

1.1 A summary of subsampling methods

1.2 Technical details

Proof of Theorem 1

Proof of Theorem 2

1.3 Brief introduction of the optimal transport map

1.4 Additional simulation results for \(n=1,000,000\)

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

A review on design inspired subsampling for big data

Abstract

Access this article

Similar content being viewed by others

Data clustering: application and trends

Uncertainty in big data analytics: survey, opportunities, and challenges

Missing value imputation: a review and analysis of the literature (2006–2017)

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix

Appendix

1.1 A summary of subsampling methods

1.2 Technical details

Proof of Theorem 1

Proof of Theorem 2

1.3 Brief introduction of the optimal transport map

1.4 Additional simulation results for \(n=1,000,000\)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation