Abstract
Subsampling focuses on selecting a subsample that can efficiently sketch the information of the original data in terms of statistical inference. It provides a powerful tool in big data analysis and gains the attention of data scientists in recent years. In this review, some state-of-the-art subsampling methods inspired by statistical design are summarized. Three types of designs, namely optimal design, orthogonal design, and space filling design, have shown their great potential in subsampling for different objectives. The relationships between experimental designs and the related subsampling approaches are discussed. Specifically, two major families of design inspired subsampling techniques are presented. The first aims to select a subsample in accordance with some optimal design criteria. The second tries to find a subsample that meets some design requirements, including balancing, orthogonality, and uniformity. Simulated and real data examples are provided to compare these methods empirically.
Similar content being viewed by others
References
Ai M, Wang F, Yu J, Zhang H (2021a) Optimal subsampling for large-scale quantile regression. J Complex 62:101512
Ai M, Yu J, Zhang H, Wang H (2021b) Optimal subsampling algorithms for big data regressions. Stat Sin 31:749–772
Altschuler J, Bach F, Rudi A, Niles-Weed J (2019) Massively scalable sinkhorn distances via the nyström method. Adv Neural Inf Process Syst 32:4427–4437
Atkinson A, Donev A, Tobias R (2007) Optimum experimental designs, with SAS. Oxford University Press, Oxford
Avron H, Maymounkov P, Toledo S (2010) Blendenpik: supercharging Lapack’s least-squares solver. SIAM J Sci Comput 32:1217–1236
Beasley LB, Brualdi RA, Shader BL (1993) Combinatorial orthogonality. In: Brualdi RA, Friedland S, Klee V (eds) Combinatorial and graph-theoretical problems in linear algebra. Springer, New York, NY, pp 207–218
Berger YG, de La Riva-Torres O (2016) Empirical likelihood confidence intervals for complex sampling designs. J R Stat Soc Ser B 78:314–319
Bertsekas DP (1992) Auction algorithms for network flow problems: a tutorial introduction. Comput Optim Appl 1:7–66
Biedermann S, Dette H (2001) Minimax optimal designs for nonparametric regression—a further optimality property of the uniform distribution. In: Atkinson AC, Hackl P, Müller WG (eds) mODa 6—Advances in Model-Oriented Design and Analysis. Physica-Verlag HD, Heidelberg, pp 13–20
Blom G (1976) Some properties of incomplete U-statistics. Biometrika 63:573–580
Boivin J, Ng S (2006) Are more data always better for factor analysis? J Econom 132:169–194
Bottou L (1999) On-line learning and stochastic approximations. In: Saad D (ed) On-line learning in neural networks. Publications of the Newton Institute. Cambridge University Press, Cambridge, pp 9–42. https://doi.org/10.1017/CBO9780511569920.003
Breidt FJ, Opsomer JD (2000) Local polynomial regression estimators in survey sampling. Ann Stat 28:1026–1053
Breiman L (1996) Bagging predictors. Mach Learn 24:123–140
Burnham KP, Anderson DR (2002) Model selection and multimodel inference: a practical information-theoretic approach, 2nd edn. Springer, New York
Chen WY, Mackey L, Gorham J, Briol FX, Oates C (2018) Stein points. In: Dy J, Krause A (eds) Proceedings of the 35th International Conference on Machine Learning, vol 80, pp 844–853
Chen WY, Barp A, Briol FX, Gorham J, Girolami M, Mackey L, Oates CJ (2019) Stein point Markov chain Monte Carlo. https://arXiv.org/1905.03673
Cheng Q, Wang H, Yang M (2020) Information-based optimal subdata selection for big data logistic regression. J Stat Plan Inference 209:112–122
Chernozhukov V, Galichon A, Hallin M, Henry M (2017) Monge-Kantorovich depth, quantiles, ranks and signs. Ann Stat 45:223–256
Cioppa TM, Lucas TW (2007) Efficient nearly orthogonal and space-filling Latin hypercubes. Technometrics 49:45–55
Cook CE, Lopez R, Stroe O, Cochrane G, Apweiler R (2019) The European bioinformatics institute in 2018: tools, infrastructure and training. Nucleic Acids Res 47:D15–D22
Cox DR (1957) Note on grouping. J Am Stat Assoc 52:543–547
Deldossi L, Tommasi C (2021) Optimal design subsampling from big datasets. J Qual Technol 54:93
Derezinski M, Warmuth MKK (2017) Unbiased estimates for linear regression via volume sampling. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds) Advances in neural information processing systems, vol 30. Curran Associates, Inc., Red Hook, pp 3084–3093
Dereziński M, Warmuth MK (2018) Reverse iterative volume sampling for linear regression. J Mach Learn Res 19:1–39
Derezinski M, Warmuth MKK, Hsu DJ (2018) Leveraged volume sampling for linear regression. In: Bengio S, Wallach H, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R (eds) Advances in neural information processing systems, vol 31. Curran Associates, Inc., Red Hook, pp 2505–2514
Dereziński M, Clarkson KL, Mahoney MW, Warmuth MK (2019) Minimax experimental design: Bridging the gap between statistical and worst-case approaches to least squares regression. In: Beygelzimer A, Hsu D (eds) Proceedings of the Thirty-Second Conference on Learning Theory, Phoenix, USA, Proceedings of Machine Learning Research, vol 99, pp 1050–1069
Devroye L (1986) Sample-based non-uniform random variate generation. In: Proceedings of the 18th conference on Winter simulation, ACM, pp 260–265
Dey A, Mukerjee R (1999) Fractional factorial plans. Wiley, New York
Dheeru D, Karra Taniskidou E (2017) UCI machine learning repository. https://archive.ics.uci.edu/ml/datasets/Physicochemical+Properties+of+Protein+Tertiary+Structure. Accessed 25 July 2022
Doug L (2001) 3d data management: controlling data volume, velocity and variety. META Group Res Note 6:1
Drineas P, Kannan R, Mahoney MW (2006) Fast Monte Carlo algorithms for matrices I: approximating matrix multiplication. SIAM J Comput 36:132–157
Efron B, Tibshirani RJ (1994) An introduction to the bootstrap. CRC Press, Boca Raton
Eshragh A, Roosta F, Nazari A, Mahoney MW (2019) LSAR: efficient leverage score sampling algorithm for the analysis of big time series data. https://arxiv.org/1911.12321
Fan J, Han F, Liu H (2014) Challenges of big data analysis. Natl Sci Rev 1:293–314
Fan Y, Liu Y, Zhu L (2021) Optimal subsampling for linear quantile regression models. Can J Stat. https://doi.org/10.1002/cjs.11590
Fang KT, Wang Y (1994) Number-theoretic methods in statistics, vol 51. CRC Press, Boca Raton
Fang KT, Kotz S, Ng KW (1990) Symmetric multivariate and related distributions. Monographs on statistics and applied probability. Springer, Berlin
Fithian W, Hastie T (2014) Local case-control sampling: efficient subsampling in imbalanced data sets. Ann Stat 42:1693–1724
Flury BA (1990) Principal points. Biometrika 77:33–41
Gittens A, Mahoney MW (2016) Revisiting the Nyström method for improved large-scale machine learning. J Mach Learn Res 17:3977–4041
Gu C (2013) Smoothing spline ANOVA models. Springer, Berlin
Gu C, Kim YJ (2002) Penalized likelihood regression: general formulation and efficient approximation. Can J Stat 30:619–628
Hájek J (1964) Asymptotic theory of rejective sampling with varying probabilities from a finite population. Ann Math Stat 35:1491–1523
Han L, Tan KM, Yang T, Zhang T et al (2020) Local uncertainty sampling for large-scale multiclass logistic regression. Ann Stat 48:1770–1788
Hansen MH, Madow WG, Tepping BJ (1983) An evaluation of model-dependent and probability-sampling inferences in sample surveys. J Am Stat Assoc 78:776–793
Harman R, Rosa S (2020) On greedy heuristics for computing d-efficient saturated subsets. Oper Res Lett 48:122–129
He Z, Owen AB (2016) Extensible grids: uniform sampling on a space filling curve. J R Stat Soc B 78:917–931
Hedayat AS, Sloane NJA, Stufken J (1999) Orthogonal arrays: theory and applications. Springer, Berlin
Hickernell FJ, Liu M (2002) Uniform designs limit aliasing. Biometrika 89:893–904
Horn RA, Johnson CR (2013) Matrix analysis, 2nd edn. Cambridge University Press, Cambridge
Hu G, Wang H (2021) Most likely optimal subsampled Markov chain Monte Carlo. J Syst Sci Complex 34:1121–1134
Iraji MS, Ameri H (2016) RMSD protein tertiary structure prediction with soft computing. IJ Math Sci Comput 2:24–33
Johnson M, Moore L, Ylvisaker D (1990) Minimax and maximin distance designs. J Stat Plan Inference 26:131–148
Joseph VR, Dasgupta T, Tuo R, Wu CFJ (2015) Sequential exploration of complex surfaces using minimum energy designs. Technometrics 57:64–74
Joseph VR, Wang D, Gu L, Lyu S, Tuo R (2019) Deterministic sampling of expensive posteriors using minimum energy designs. Technometrics 61:297–308
Katharopoulos A, Fleuret F (2018) Not all samples are created equal: deep learning with importance sampling. In: Dy J, Krause A (eds) Proceedings of the 35th International Conference on Machine Learning, PMLR, Proceedings of Machine Learning Research, vol 80, pp 2525–2534
Kiefer J (1974) General equivalence theory for optimum designs (approximate theory). Ann Stat 2:849–879
Kong X, Zheng W (2020) Design based incomplete U-statistics. https://arXiv.org/2008.04348
Kuang K, Xiong R, Cui P, Athey S, Li B (2018) Stable prediction across unknown environments. http://arxiv.org/1806.06270
Kuang K, Zhang H, Wu F, Zhuang Y, Zhang A (2020) Balance-subsampled stable prediction. https://arXiv.org/2006.04381
Lee S, Ng S (2020) An econometric perspective on algorithmic subsampling. Annu Rev Econ 12:45–80
Lee J, Schifano ED, Wang H (2021) Fast optimal subsampling probability approximation for generalized linear models. Econom Stat. https://doi.org/10.1016/j.ecosta.2021.02.007
Lehmann EL, Casella G (1998) Theory of point estimation, 2nd edn. Springer, Berlin
Lemieux C (2009) Monte Carlo and Quasi-Monte Carlo Sampling. Springer, Berlin
Li N, Qardaji W, Su D (2012) On sampling, anonymization, and differential privacy or, k-anonymization meets differential privacy. In: Proceedings of the 7th ACM Symposium on Information, Computer and Communications Security, ACM, pp 32–33
Li K, Kong X, Ai M (2016) A general theory for orthogonal array based Latin hypercube sampling. Stat Sin 26:761–777
Lid Hjort N, Pollard D (2011) Asymptotics for minimisers of convex processes. https://arXiv.org/1107.3806
Liu Q, Lee JD, Jordan MI (2016) A kernelized stein discrepancy for goodness-of-fit tests and model evaluation. https://arXiv.org/1602.03253
Loshchilov I, Hutter F (2015) Online batch selection for faster training of neural networks. https://arXiv.org/1511.06343
Ma CX, Fang KT (2001) A note on generalized aberration in factorial designs. Metrika 53:85–93
Ma P, Huang J, Zhang N (2015a) Efficient computation of smoothing splines via adaptive basis sampling. Biometrika 102:631–645
Ma P, Mahoney MW, Yu B (2015b) A statistical perspective on algorithmic leveraging. J Mach Learn Res 16:861–919
Ma P, Zhang X, Xing X, Ma J, Mahoney M (2020) Asymptotic analysis of sampling estimators for randomized numerical linear algebra algorithms. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp 1026–1035
Mak S, Joseph VR (2018) Support points. Ann Stat 46:2562–2592
Meng X, Saunders MA, Mahoney MW (2014) LSRN: a parallel iterative solver for strongly over-or underdetermined systems. SIAM J Sci Comput 36:C95–C118
Meng C, Zhang X, Zhang J, Zhong W, Ma P (2020) More efficient approximation of smoothing splines via space-filling basis selection. Biometrika 107:723–735
Meng C, Xie R, Mandal A, Zhang X, Zhong W, Ma P (2021) Lowcon: a design-based subsampling approach in a misspecified linear model. J Comput Gr Stat 30:694–708
Meng C, Yu J, Chen Y, Zhong W, Ma P (2022) Smoothing splines approximation using Hilbert curve basis selection. J Comput Gr Stat 31:802–812
Mishra A, Rana PS, Mittal A, Jayaram B (2014) D2n: distance to the native. Biochim Biophys Acta 1844:1798–1807. https://doi.org/10.1016/j.bbapap.2014.07.010
Musser D (1997) Introspective sorting and selection algorithms. Softw Pract Exp 27:983–993
Newey WK, McFadden D (1994) Large sample estimation and hypothesis testing. In: Heckman JJ, Leamer E (eds) Handbook of econometrics, vol 4. Elsevier, Amsterdam, pp 2111–2245
Ng S (2017) Opportunities and challenges: lessons from analyzing terabytes of scanner data. Tech. rep, National Bureau of Economic Research
Pfeffermann D (1993) The role of sampling weights when modeling survey data. Int Stat Rev/Rev Int Stat 61:317–337
Pfeffermann D, Skinner CJ, Holmes DJ, Goldstein H, Rasbash J (1998) Weighting for unequal selection probabilities in multilevel models. J R Stat Soc B 60:23–40
Pronzato L (2006) On the sequential construction of optimum bounded designs. J Stat Plan Inference 136:2783–2804
Pukelsheim F (2006) Optimal design of experiments. Society for Industrial and Applied Mathematics
Qi ZF, Zhou Y, Fang KT (2019) Representative points for location-biased data sets. Commun Stat Simul Comput 48:458–471
Quiroz M, Kohn R, Villani M, Tran MN (2019) Speeding up MCMC by efficient data subsampling. J Am Stat Assoc 114:831–843
R Core Team (2019) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna
Ren M, Zhao SL (2021) Subdata selection based on orthogonal array for big data. Commun Stat Theory Methods. https://doi.org/10.1080/03610926.2021.2012196
Ren H, Zou C, Chen N, Li R (2022) Large-scale datastreams surveillance via pattern-oriented-sampling. J Am Stat Assoc 117:794–808
Santner TJ, Williams BJ, Notz WI (2003) Space-filling designs for computer experiments, vol 5. Springer, Cham, pp 121–161
Shao L, Song S, Zhou Y (2022) Optimal subsampling for large-sample quantile regression with massive data. Can J Stat. https://doi.org/10.1002/cjs.11697
Shi C, Tang B (2021) Model-robust subdata selection for big data. J Stat Theory Pract 15:82
Shu X, Yao D, Bertino E (2015) Privacy-preserving detection of sensitive data exposure. IEEE Trans Inf Forensics Secur 10:1092–1103
Su Y (2000) Asymptotically optimal representative points of bivariate random vectors. Stat Sin 10:559–575
Székely GJ, Rizzo ML (2013) Energy statistics: a class of statistics based on distances. J Stat Plan Inference 143:1249–1272
Thompson S (2012) Simple random sampling, vol 2. Wiley, New York, pp 9–37
Ting D, Brochu E (2018) Optimal subsampling with influence functions. In: Advances in Neural Information Processing Systems, pp 3650–3659
Vakayil A, Joseph VR (2022) Data twinning. In: Clarke B (ed) Statistical analysis and data mining: the ASA data science journal. Wiley, New York. https://doi.org/10.1002/sam.11574
Villani C (2008) Optimal transport: old and new. Springer, Berlin
Wang H (2019) More efficient estimation for logistic regression with optimal subsamples. J Mach Learn Res 20:1–59
Wang H, Kim JK (2020) Maximum sampled conditional likelihood for informative subsampling. https://arXiv.org/2011.05988
Wang W, Jing BY (2021) Convergence of Gaussian process regression: optimality, robustness, and relationship with kernel ridge regression. https://arXiv.org/2104.09778
Wang H, Ma Y (2021) Optimal subsampling for quantile regression in big data. Biometrika 108:99–112
Wang H, Zou J (2021) A comparative study on sampling with replacement vs poisson sampling in optimal subsampling. In: Banerjee A, Fukumizu K (eds) Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, PMLR, Proceedings of Machine Learning Research, vol 130, pp 289–297
Wang W, Tuo R, Wu CFJ (2017a) On prediction properties of Kriging: uniform error bounds and robustness. https://arXiv.org/1710.06959
Wang Y, Yu AW, Singh A (2017b) On computationally tractable selection of experiments in measurement-constrained regression models. J Mach Learn Res 18:5238–5278
Wang H, Zhu R, Ma P (2018a) Optimal subsampling for large sample logistic regression. J Am Stat Assoc 113:829–844
Wang Y, Yang J, Xu H (2018b) On the connection between maximin distance designs and orthogonal designs. Biometrika 105:471–477
Wang Z, Zhu H, Dong Z, He X, Huang SL (2019a) Less is better: unweighted data subsampling via influence function. https://arXiv.org/1912.01321
Wang H, Yang M, Stufken J (2019b) Information-based optimal subdata selection for big data linear regression. J Am Stat Assoc 114:393–405
Wang L, Elmstedt J, Wong WK, Xu H (2021) Orthogonal subsampling for big data linear regression. https://arXiv.org/2105.14647
Wang Y, Sun F, Xu H (2022) On design orthogonality, maximin distance, and projection uniformity for computer experiments. J Am Stat Assoc 117:375–385
Williams C, Seeger M (2001) Using the nyström method to speed up kernel machines. Adv Neural Inf Process Syst 13:682–688
Wu CFJ, Hamada MS (2009) Experiments: planning, analysis and parameter design optimization, 2nd edn. Wiley, New York
Wu S, Zhu X, Wang H (2022) Subsampling and jackknifing: a practically convenient solution for large data analysis with limited computational resources. Stat Sin. https://doi.org/10.5705/ss.202021.0257
Xie MY, Fang KT (2000) Admissibility and minimaxity of the uniform design measure in nonparametric regression model. J Stat Plan Inference 83:101–111
Xie R, Wang Z, Bai S, Ma P, Zhong W (2019) Online decentralized leverage score sampling for streaming multidimensional time series. In: Chaudhuri K, Sugiyama M (eds) Proceedings of Machine Learning Research, PMLR, Proceedings of Machine Learning Research, vol 89, pp 2301–2311
Xiong S, Li G (2008) Some results on the convergence of conditional distributions. Stat Probab Lett 78:3249–3253
Yao Y, Wang H (2019) Optimal subsampling for softmax regression. Stat Pap 60:235–249
Yao Y, Wang H (2021a) A review on optimal subsampling methods for massive datasets. J Data Sci 19:151–172
Yao Y, Wang H (2021b) A selective review on statistical techniques for big data. In: Zhao Y, Chen DG (eds) Modern statistical methods for health research. Springer, New York. https://doi.org/10.1007/978-3-030-72437-5_11
Ye KQ (1998) Orthogonal column Latin hypercubes and their application in computer experiments. J Am Stat Assoc 93:1430–1439
Yu J, Wang H (2022) Subdata selection algorithm for linear model discrimination. Stat Pap. https://doi.org/10.1007/s00362-022-01299-8
Yu K, Bi J, Tresp V (2006) Active learning via transductive experimental design. In: Proceedings of the 23rd International Conference on Machine Learning, Association for Computing Machinery, pp 1081–1088
Yu J, Wang H, Ai M, Zhang H (2022) Optimal distributed subsampling for maximum quasi-likelihood estimators with massive data. J Am Stat Assoc 117:265–276
Zhang H, Wang H (2021) Distributed subdata selection for big data via sampling-based approach. Comput Stat Data Anal 153:107072
Zhang T, Ning Y, Ruppert D (2021) Optimal sampling for generalized linear models under measurement constraints. J Comput Gr Stat 30:106–114
Zhang J, Meng C, Yu J, Zhang M, Zhong W, Ma P (2022) An optimal transport approach for selecting a representative subsample with application in efficient kernel density estimation. J Comput Gr Stat. https://doi.org/10.1080/10618600.2022.2084404
Zhao Y, Amemiya Y, Hung Y (2018) Efficient Gaussian process modeling using experimental design-based subagging. Stat Sin 28:1459–1479
Zhou Y, Fang K (2019) FM-criterion for representative points. Sci Sin Math 49:1009–1020
Zhu X, Pan R, Wu S, Wang H (2021) Feature screening for massive data analysis by subsampling. J Bus Econ Stat. https://doi.org/10.1080/07350015.2021.1990771
Zuo L, Zhang H, Wang H, Liu L (2021) Sampling-based estimation for massive survival data with additive hazards model. Stat Med 40:441–450
Acknowledgements
The authors sincerely thank the editor, associate editor, and the referees for their valuable comments and insightful suggestions, which lead to further improvement of this work. Ai’s work was supported by NSFC grants 12071014 and 12131001, and LMEQF. Yu’s work was supported by NSFC 12001042 and Beijing Institute of Technology research fund program for young scholars.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
1.1 A summary of subsampling methods
See Table 8.
1.2 Technical details
Proof of Theorem 1
The estimator \(\tilde{{\varvec{\beta }}}\) is the maximizer of \(L^*({\varvec{\beta }})\) in (1), so conditional on \({\mathcal {F}}_n\), \(\sqrt{r}(\tilde{{\varvec{\beta }}}-\hat{{\varvec{\beta }}})\) is the maximizer of \(\Lambda ({\varvec{\eta }})=n^{-1}rL^*(\hat{{\varvec{\beta }}}+{{\varvec{\eta }}}/\sqrt{r})-n^{-1}rL^*(\hat{{\varvec{\beta }}})\). For natation simplicity, let \({\dot{f}}({\varvec{\beta }})\) and \(\ddot{f}({\varvec{\beta }})\) be the first and second order derivatives of a given function \(f({\varvec{\beta }})\) with respect to \({\varvec{\beta }}\). Preforming Taylor’s expansion on \(\Lambda ({\varvec{\eta }})\),
where R is the remainder term. Under Assumption 3, one can see that R goes to zero in probability conditional on \({\mathcal {F}}_n\), i.e., \(R=o_{P|{\mathcal {F}}_n}(1)\). Precisely,
thus the result holds by Theorem 3.3 in Xiong and Li (2008).
Recall that \({\dot{\ell }}(y_i|{\varvec{x}}_i, {{\varvec{\beta }}})=n^{-1}(\partial \log f(y_i|{\varvec{x}}_i,{\varvec{\beta }})/\partial {\varvec{\beta }})\), and the corresponding subsampling version \({\dot{\ell }}^*(y_i|{\varvec{x}}_i, {{\varvec{\beta }}})=(n\pi _i)^{-1}\delta _i(\partial \log f(y_i|{\varvec{x}}_i,{\varvec{\beta }})/\partial {\varvec{\beta }})\). One can see that \({\dot{\ell }}^*(y_i|{\varvec{x}}_i, {{\varvec{\beta }}})\) are independent random vectors conditional on \({\mathcal {F}}_n\). Direct calculation yields
where the second last equation comes from the facts
under Assumption 2 and \(r/n\rightarrow 0\).
Under Assumption 5, the Lindeberg-Feller condition holds under the condition distribution. Thus from the central limit theorem,
Under Assumption 2, by the law of large number, we know that
Thus the desired result follows by the Basic Corollary in Lid Hjort and Pollard (2011) and Theorem 3.3 in Xiong and Li (2008). \(\square \)
Proof of Theorem 2
Note that \({\varvec{z}}=\Sigma ^{-1/2}{\varvec{x}}\) is standard normal and
where \(\phi _I({\varvec{z}})\) is the density function of the p-dimensional standard normal distribution.
By the inequality of arithmetic and geometric means for the eigenvalues, one can see that
the equality holds by the fact that \(\int _{{\varvec{z}}^{ \textrm{T} }{\varvec{z}}>c}\Vert {\varvec{z}}\Vert ^2\phi _I({\varvec{z}})d{\varvec{z}}\) is a nonzero multiple of the identity matrix.
Note that for any selection rule \(\textrm{tr}\left( \int {\mathbb {I}}(g(\Sigma ^{1/2}{\varvec{z}})){\varvec{z}}{\varvec{z}}^{ \textrm{T} }\phi _I({\varvec{z}})d{\varvec{z}}\right) \!=\!\int _{g(\Sigma ^{1/2}{\varvec{z}})} \Vert {\varvec{z}}\Vert ^2\phi _I({\varvec{z}})d{\varvec{z}}\). Clearly, such kind of selection also achieve the upper bound of \(\textrm{tr}\big (\int {\mathbb {I}}(g(\Sigma ^{1/2}{\varvec{z}})){\varvec{z}}{\varvec{z}}^{ \textrm{T} }\phi _I({\varvec{z}})d{\varvec{z}}\big )\) for any \(g(\cdot )\) satisfies the constraint. Thus the result follows by the fact \(\det (AB)=\det (A)\det (B)\) for any \(A,B>0\). \(\square \)
1.3 Brief introduction of the optimal transport map
To ease the conversation, we only present a specific optimal transport map from a general distribution F on \(\Omega \subseteq {\mathbb {R}}^{p+1}\) to an uniform distribution \(F_\textrm{unif}\) on \([0,1]^{p+1}\). For the general definition of the optimal transport map, we refer to Villani (2008) for more details. Let \(\#\) be the push-forward operator, such that for all measurable \(B\subseteq \Omega \), we have \(\phi _\#(F)(B)=F(\phi ^{-1}(B))\). The optimal transport map \(\phi ^*\) of our interest is the one that minimizes the \(l_2\) cost, \(\int _\Omega \Vert {\varvec{z}}- \phi ({\varvec{z}}) \Vert ^2 \text{ d }F\) with respect to all the measures preserving maps \(\phi :\Omega \rightarrow [0,1]^{p+1}\) such that \(\phi _\#(F) = F_{\textrm{unif}}\) and \(\phi ^{-1}_\#(F_\textrm{unif}) = F\). It has been shown that when \(\Omega ={\mathbb {R}}\), \(\phi ^*\) is equivalent to the cumulative distribution function F (Villani 2008).
For the given observations \(\{{\varvec{z}}_i\}_{i=1}^n\) comes from F and \(\{{\varvec{s}}_i\}_{i=1}^n\) from \(F_{\textrm{unif}}\). The empirical optimal transport map can be estimated by the auction algorithm or the refined auction algorithm (Bertsekas 1992). To further alleviate the computational burden, in practice, some projection-based methods can be used to approximate the optimal transport map \(\phi ^*\). Readers may refer Zhang et al. (2022) for detail discussions on computational issue.
1.4 Additional simulation results for \(n=1,000,000\)
In this subsection, we will provide the log EMSE for the subsampling methods with large n. More precisely, we consider \(n=1,000,000\) with the three cases given in Sect. 6. Due to the limitation of the computational resources, we opt to report the results for the methods whose computing time is no more than \(O(np^2)\). The results for the linear regression model and Logistic regression model are reported in Tables 9 and 10, respectively.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Yu, J., Ai, M. & Ye, Z. A review on design inspired subsampling for big data. Stat Papers 65, 467–510 (2024). https://doi.org/10.1007/s00362-022-01386-w
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00362-022-01386-w