Abstract
Subdata selection methods provide flexible tradeoffs between computational complexity and statistical efficiency in analyzing big data. In this work, we investigate a new algorithm for selecting informative subdata from massive data for a broad class of models, including generalized linear models as special cases. A connection between the proposed method and many widely used optimal design criteria such as A-, D-, and E-optimality criteria is established to provide a comprehensive understanding of the selected subdata. Theoretical justifications are provided for the proposed method, and numerical simulations are conducted to illustrate the superior performance of the selected subdata.
Similar content being viewed by others
References
Ai M, Wang F, Yu J, Zhang H (2021a) Optimal subsampling for large-scale quantile regression. J Complex 62:101512
Ai M, Yu J, Zhang H, Wang H (2021b) Optimal subsampling algorithms for big data regressions. Stat Sin 31:749–772
Atkinson AC, Fedorov VV (1975) The design of experiments for discriminating between two rival models. Biometrika 62:57–70
Cheng Q, Wang H, Yang M (2020) Information-based optimal subdata selection for big data logistic regression. J Stat Plan Inference 209:112–122
David HA, Hartley HO, Pearson ES (1954) The distribution of the ratio, in a single normal sample, of range to standard deviation. Biometrika 41:482–493
Deldossi L, Tommasi C (2022) Optimal design subsampling from big datasets. J Qual Technol 54:93–101
Deville JC, Särndal CE (1992) Calibration estimators in survey sampling. J Am Stat Assoc 87(418):376–382
Drovandi CC, Holmes C, McGree JM, Mengersen K, Richardson S, Ryan EG (2017) Principles of experimental design for big data analysis. Stat Sci 32(3):385
Efron B, Hinkley DV (1978) Assessing the accuracy of the maximum likelihood estimator: observed versus expected fisher information. Biometrika 65(3):457–483
Hartley HO, Rao JNK (1962) Sampling with unequal probabilities and without replacement. Ann Math Stat 33:350–374
Horn RA, Johnson CR (2013) Matrix analysis, 2nd edn. Cambridge University Press, Cambridge
Kiefer J (1959) Optimum experimental designs. J R Stat Soc B 21(2):272–319
Ma P, Mahoney MW, Yu B (2015) A statistical perspective on algorithmic leveraging. J Mach Learn Res 16:861–919
Ma P, Zhang X , Xing X ,Ma J, Mahoney M (2020) Asymptotic analysis of sampling estimators for randomized numerical linear algebra algorithms. In: International conference on artificial intelligence and statistics. PMLR, pp. 1026–1035
Martínez C (2004) On partial sorting. Technical report, 10th seminar on the analysis of algorithms
Meng C, Xie R, Mandal A, Zhang X, Zhong W, Ma P (2020) Lowcon: A design-based subsampling approach in a misspecified linear model. J Comput Graph Stat. https://doi.org/10.1080/10618600.2020.1844215
Montgomery DC (2019) Introduction to statistical quality control, 8th Ed
Musser DR (1997) Introspective sorting and selection algorithms. Software 27(8):983–993
Pukelsheim F (2006) Optimal design of experiments. Society for Industrial and Applied Mathematics
Särndal CE, Swensson B, Wretman J (1992) Model assisted survey sampling. Springer, New York
Schmidt D, Schwabe R (2017) Optimal design for multiple regression with information driven by the linear predictor. Stat Sin 27:1371–1384
Tippett LHC (1925) On the extreme individuals and the range of samples taken from a normal population. Biometrika 17:364–387
Wang H, Ma Y (2021) Optimal subsampling for quantile regression in big data. Biometrika 108:99–112
Wang H, Zhu R, Ma P (2018) Optimal subsampling for large sample logistic regression. J Am Stat Assoc 113:829–844. https://doi.org/10.1080/01621459.2017.1292914
Wang H, Yang M, Stufken J (2019) Information-based optimal subdata selection for big data linear regression. J Am Stat Assoc 114:393–405
Wang L, Elmstedt J, Wong WK, Xu H (2021) Orthogonal subsampling for big data linear regression. Ann Appl Stat 15:1273–1290
Xie R, Wang Z, Bai S, Ma P, Zhong W (2019) Online decentralized leverage score sampling for streaming multidimensional time series. In: Chaudhuri K, Sugiyama M (eds) Proceedings of machine learning research, vol 89, pp. 2301–2311. PMLR
Yang M, Zhang B, Huang S (2011) Optimal designs for generalized linear models with multiple design variables. Stat Sin 21:1415–1430
Yu J, Wang H (2022) Subdata selection algorithm for linear model discrimination. Stat Pap. https://doi.org/10.1007/s00362-022-01299-8
Yu J, Ai M, Ye Z (2023) A review on design inspired subsampling for big data. Stat Pap. https://doi.org/10.1007/s00362-022-01386-w
Zhang T, Ning Y, Ruppert D (2021) Optimal sampling for generalized linear models under measurement constraints. J Comput Graph Stat 30:106–114
Zhao Y, Amemiya Y, Hung Y (2018) Efficient Gaussian process modeling using experimental design-based subagging. Stat Sin 28:1459–1479
Acknowledgements
The authors sincerely thank the editors and referees for their valuable comments and insightful suggestions, which led to further improvement of this article. Beijing Municipal Natural Science Foundation No. 1232019. Yu’s work was supported by NSFC Grant 12001042, Beijing Institute of Technology Research Fund Program for Young Scholars. Liu and Wang’s research was supported by NSF Grant CCF 2105571 and UConn CLAS Research in Academic Themes funding.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix A Proofs
Appendix A Proofs
1.1 A.1 Proof of Theorem 1
Proof
Note that
Thus, the T-optimal subdata is the subdata with the k largest values of \(\Vert \varvec{x}_i\Vert ^2\). \(\square \)
1.2 A.2 Proof of Theorem 2
Before proving Theorem 2, we need the following lemma.
Lemma A.1
Let \(\lambda _{1},\ldots ,\lambda _{d}\) be the d eigenvalues of \({\mathcal {I}}(\varvec{\delta })\) with \(\lambda _{\min }=\lambda _{1}\le \ldots \le \lambda _{d}=\lambda _{\max }\). Assume that \(\textrm{var}(\varvec{Z}_{1}^*)\le \ldots \le \textrm{var}(\varvec{Z}_{d}^*)\) with \(\textrm{var}(\varvec{Z}_j^*)\) being the sample variance for the jth column of \(\varvec{Z}^*\). For any subdata of size n, represented by \(\varvec{\delta }\), it holds that
where \(\varvec{R}^*\) is the sample correlation matrix of \(\varvec{Z}^*\).
Proof
Recall \(\varvec{Z}^*=(\varvec{z}_{1}^*,\ldots , \varvec{z}_{n}^*)^\textrm{T}\). Let \({\textbf{1}}\) be a \(n\times 1\) vector of ones, and \(\textrm{var}(\varvec{Z}_j^*)\) be the sample variance for the jth column of \(\varvec{Z}^*\), \(j=1,\ldots , d\). It follows that
Note the fact that for any matrices \(A\ge 0\) and \(B\ge 0\), it holds that \(\lambda _{\min }(B)\lambda _j(A^2)\le \lambda _j(ABA)\le \lambda _{\max }(B)\lambda _j(A^2)\). The desired result follows immediately by letting \(B=\varvec{R}^*\) and \(A=\textrm{diag}(\sqrt{\textrm{var}(\varvec{Z}_{1}^*)},\ldots ,\sqrt{\textrm{var}(\varvec{Z}_{d}^*)})\). \(\square \)
Proof of Theorem 2
Recall that \(\lambda _{1},\ldots ,\lambda _{d}\) are d eigenvalues of \({\mathcal {I}}(\varvec{\delta })\) with \(0<\lambda _{\min }=\lambda _{1}\le \ldots \le \lambda _{d}=\lambda _{\max }\). We have
Thus (13)–(15) can be easily obtained by Lemma A.1. \(\square \)
1.3 A.3 Proof of Theorem 3
Proof
For each sample variance,
For the first summation in (A.4),
Each term in the summation of (A.5) can be written as
where \({\tilde{z}}_j^{(s)l}\) is the jth dimension of the subdata point selected according to \(\{\tilde{{z}}_{il},i=1,\ldots ,N\}\) in the second step of Algorithm 1. From the Assumption 1, we have that for \(s,i\le r\), \({({\tilde{z}}_{(s)j}-{\tilde{z}}_{(i)j})}/{({\tilde{z}}_{(N)j}-{\tilde{z}}_{(1)j})}=o_{P}(1)\) and \({({\tilde{z}}_j^{(s)l}-{\tilde{z}}_{(i)j})}/{({\tilde{z}}_{(N)j}-{\tilde{z}}_{(1)j})}\) is either positive or \(o_{P}(1)\). Thus (A.6) implies
From assumptions 1 and 2, for \(s\ge N-r+1\) and \(i\le r\), as \(N\rightarrow \infty \),
Similarly,
Combining (A.4), (A.9) and (A.10),
If \({\tilde{z}}_{(1)j}/{\tilde{z}}_{(N)j}\overset{P}{\rightarrow }0\) or \(\pm \infty \), then
Combining (A.11) and (A.12) shows that
Thus, the desired results come from Theorem 2 and Slutsky’s theorem.
If \({\tilde{z}}_{(N)j}\rightarrow \infty \) and \({\tilde{z}}_{(1)j}\) is bounded below, or \({\tilde{z}}_{(1)j}\rightarrow -\infty \) and \({\tilde{z}}_{(N)j}\) is bounded above, then
From (A.11) and (A.14), it follows that
Thus, the desired results come from Theorem 2 and Slutsky’s theorem.
\(\square \)
1.4 A.4 Some details of Example 3 and Proposition 4
Lemma A.2
Suppose \(\varvec{x}\sim N(\varvec{\mu },\Sigma )\) with \(\Sigma >0\) and \(\varvec{\theta }\) has more than one nonzero elements. Then it holds that \((x_j,\varvec{\theta }^\textrm{T}\varvec{x})^\textrm{T}\) is still a nondegenerate normal distribution for all j.
Lemma A.3
Suppose \(\varvec{x}\sim N(\varvec{\mu },\Sigma )\) with \(\Sigma >0\), and \(\varvec{\theta }\) lies in a compact ball with more than one nonzero element. Then for any given M, and \(C>0\), it holds that
for all j.
Proof of Proposition 4
Note that the jth dimension of \(\tilde{\varvec{z}}\) can be written as
For any j, one can see that
where M and C are some constant independent of \(\varvec{x}\).
Similarly, one can show that
From Lemma A.3, the desired result follows. \(\square \)
Details of Example 3
Grounded on the two lemmas, we can see that for any \(M>0\) and \(j=1,\ldots ,d\),
From Lemma A.3, it is clear to see that
Thus the result follows. \(\square \)
Proof of Lemma A.2
Without loss of generality, we only consider the case \(j=1\) here. Note that \(\varvec{x}\) is a multivariate normal distribution. Let \({\varvec{e}_j}\) be a unit vector with jth element being one and \(C=(\varvec{e}_j,\varvec{\theta })\). Note that \(\varvec{\theta }\) has more than one nonzero element by assumption. It is clear to see the rank of C equals two. Since \(\Sigma >0\), the rank of \(C^\textrm{T}\Sigma C\) is also two. The \(\check{\varvec{x}}=C^\textrm{T}\varvec{x}=(x_j,\varvec{\theta }^\textrm{T}\varvec{x})^\textrm{T}\) is also a non degenerate normal distribution with mean \(C^\textrm{T}\varvec{\mu }=(\mu _{j},\varvec{\mu }^\textrm{T}\varvec{\theta })^\textrm{T}\), and variance
where \(\mu _{j}\) is the jth elements in \(\varvec{\mu }\), \(\sigma _{jj}\) is the (j, j)th element of \(\Sigma \), \({{\check{\sigma }}}_{12}\) equals to \((\Sigma _{\cdot j}^\textrm{T}\varvec{\theta })\) with \(\Sigma _{\cdot j}\) being the jth column of \(\Sigma \), and \({{\check{\sigma }}}_{22}=\varvec{\theta }^\textrm{T}\Sigma \varvec{\theta }\).
\(\square \)
Proof of Lemma A.3
For the first result, simple calculation yields that
since \((x_j,\varvec{x}^\textrm{T}\varvec{\theta })^\textrm{T}\) are non degenerate normal distribution by the fact proved in Lemma A.2. The second result is quite similar. Thus we omit it.
\(\square \)
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Yu, J., Liu, J. & Wang, H. Information-based optimal subdata selection for non-linear models. Stat Papers 64, 1069–1093 (2023). https://doi.org/10.1007/s00362-023-01430-3
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00362-023-01430-3