Abstract
The information-based optimal subdata selection (IBOSS) is a computationally efficient method to select informative data points from large data sets through processing full data by columns. However, when the volume of a data set is too large to be processed in the available memory of a machine, it is infeasible to implement the IBOSS procedure. This paper develops a divide-and-conquer IBOSS approach to solving this problem, in which the full data set is divided into smaller partitions to be loaded into the memory and then subsets of data are selected from each partition using the IBOSS algorithm. We derive both finite sample properties and asymptotic properties of the resulting estimator. Asymptotic results show that if the full data set is partitioned randomly and the number of partitions is not very large, then the resultant estimator has the same estimation efficiency as the original IBOSS estimator. We also carry out numerical experiments to evaluate the empirical performance of the proposed method.
Similar content being viewed by others
References
Battey H, Fan J, Liu H, Lu J, Zhu Z (2018) Distributed estimation and inference with statistical guarantees. Ann Stat 46:1352–1382
Bezanson J, Edelman A, Karpinski S, Shah VB (2017) Julia: a fresh approach to numerical computing. SIAM Rev 59(1):65–98
Chen X, Mg X (2014) A split-and-conquer approach for analysis of extraordinarily large data. Stat Sin 24:1655–1684
Drineas P, Magdon-Ismail M, Mahoney M, Woodruff D (2012) Faster approximation of matrix coherence and statistical leverage. J Mach Learn Res 13:3475–3506
Galambos J (1987) The asymptotic theory of extreme order statistics. Robert E. Krieger, Melbourne
Hall P (1979) On the relative stability of large order statistics. Math Proc Camb Philos Soc 86:467–475
Jordan MI (2012) Divide-and-conquer and statistical inference for big data. In: Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, p 4
Lin N, Xie R (2011) Aggregated estimating equation estimation. Stat Interface 4:73–83
Ma P, Mahoney M, Yu B (2015) A statistical perspective on algorithmic leveraging. J Mach Learn Res 16:861–911
Martínez C (2004) On partial sorting. Technical report, 10th seminar on the analysis of algorithms
R Core Team (2018) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/
Schifano ED, Wu J, Wang C, Yan J, Chen MH (2016) Online updating of statistical inference in the big data setting. Technometrics 58(3):393–403. https://doi.org/10.1080/00401706.2016.1142900
Shang Z, Cheng G (2017) Computational limits of a distributed algorithm for smoothing spline. J Mach Learn Res 18(1):3809–3845
Wang H (2018) More efficient estimation for logistic regression with optimal subsample. arXiv preprint arXiv:180202698
Wang H, Zhu R, Ma P (2018) Optimal subsampling for large sample logistic regression. J Am Stat Assoc 113(522):829–844
Wang H, Yang M, Stufken J (2019) Information-based optimal subdata selection for big data linear regression. J Am Stat Assoc 114(525):393–405
Xue Y, Wang H, Yan J, Schifano ED (2018) An online updating approach for testing the proportional hazards assumption with streams of big survival data. arXiv preprint arXiv:180901291
Yao Y, Wang H (2019) Optimal subsampling for softmax regression. Stat Pap 60(2):235–249
Acknowledgements
This work was supported by NSF Grant 1812013, a UCONN REP Grant, and a GPU Grant from NVIDIA Corporation.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
1.1 Proof of Theorem 1
Proof
For \(l\ne j\), let \(z_j^{(i)l}\) be the concomitant of \(z_{(i)l}\) for \(z_j\), i.e., if \(z_{(i)l}=z_{sl}\) then \(z_j^{(i)l}=z_{sj}\), \(i=1, \dots , N\). Let \(\bar{z}_{j{\mathcal {D}}_{\mathrm{BS}}}^*\) and \({\mathrm {var}}(z_{j{\mathcal {D}}_{\mathrm{BS}}}^*)\) be the sample mean and sample variance for covariate \(z_j\) of subdata \({\mathcal {D}}_{\mathrm{BS}}\). For the proof of Theorem 3 in Wang et al. [16], we know that
Let \(k_B=k/B\) be the number of data points to take from each partition, and \(z_{bij}^*\) be the i-th observation on the j-th covariate in the subdata \({\mathcal {D}}_{S}^{(b)}\) from the b-th partition.
The sample variance for each j satisfies,
where \(\bar{z}_{bj}^{**}=\left( \sum _{i=1}^{r_B}+\sum _{i=n_B-r_B+1}^{n_B}\right) z_{b(i)j}/(2r_B)\), \(\bar{z}_{bj}^{*l}=\sum _{i=1}^{r_B}z_{b(i)j}/{r_B}\), and \(\bar{z}_{bj}^{*u}=\sum _{i=n_B-r_B+1}^{n_B} z_{b(i)j}/(r_B)\). Thus,
which, combined with (12), shows that
This shows that
Each numerator in the summation of the bound in (14) relays on the covariate range of each data partition. If the full data are not divided randomly, this may not produce a sharp bound and using the full data covariate ranges may produce a better bound. We use this idea to derive the bound \(C_E\) in the following. From Algorithm 2, for each \(j=1, \dots , p\), the \(r_B\) data points with the smallest value of \(z_j\) and the \(r_B\) data points with the largest value of \(z_j\) must be included in \({\mathcal {D}}_{\mathrm{BS}}\). Thus, for each sample variance,
In this case, \(\bar{z}_j^{**}=(\sum _{i=1}^{r_B}+\sum _{i=N-r_B+1}^{N})z_{(i)j}/(2r_B)\), \(\bar{z}_j^{*l}=\sum _{i=1}^{r}z_{(i)j}/(r_B)\), and \(\bar{z}_j^{*u}=\sum _{i=n-r+1}^{n}z_{(i)j}/(r_B)\). Thus,
This, combined with (12), shows that
The proof finishes if we combine (14) and (16). \(\square\)
1.2 Proof of Theorem 2
Proof
The proof for (7) is similar to the proof of inequality (19) in Wang et al. [16]. For (8), from the proof of Theorem 4 in Wang et al. [16], we know that
and
From the (17), (18), and the fact that \((\varvec{R}_{{\mathcal {D}}_{\mathrm{BS}}}^{-1})_{jj}\ge 1\), we have
From (13), (15) and the fact that \((\varvec{R}_{{\mathcal {D}}_{\mathrm{BS}}}^{-1})_{jj}\le \lambda _{\min }^{-1}(\varvec{R}_{{\mathcal {D}}_{\mathrm{BS}}})\), we have
The proof finishes by combining (19), (20), and (21). \(\square\)
1.3 Proof of Theorem 3
Proof
For the first case that B and r are fixed, from (8) with \(V_{ej}\), we only need to verify that
which is true according to Theorems 2.8.1 and 2.8.2 in Galambos [5].
For the second case, \(z_{b(n_B-r_B+1)j}-z_{b(r_B)j}\) and \(z_{(N)j}-z_{(1)j}\) converge to the same fixed constant in probability, and \(z_{b(n_B-r_B+1)j}-z_{b(r_B)j}\) are bounded by this constant for all b. Thus,
converges to the same finite constant. Thus, (22) can be easily verified.
For the third case, let \(g_{N,j}=F_j^{-1}(1-1/N)\) and \(g_{n_B,j}=F_j^{-1}(1-1/n_B)\). When (9) holds, from the proof of Theorem 5 in Wang et al. [16], we have \(z_{(N)j}/g_{N,j}=1+o_{P}(1)\) and \(z_{(n_B-r_B+1)j}/g_{n_B,j}=1+o_{P}(1)\). Thus, noting that \(z_{b(r)j}\) and \(z_{(1)j}\) are bounded in probability when the lower endpoint for the support of \(F_j\) is finite, we have
Note that \(\left| \frac{z_{b(n_B-r_B+1)j}-z_{b(r)j}}{z_{(N)j}-z_{(1)j}}\right|\) are bounded by the same constant for all b, so
Combining this and (8) with \(V_{aj}\), the result follows.
For the forth case, the result follows from reversing the signs of the covariates in the proof for the third case.
The proof for the fifth case is obtained by combining the proof for the third case and the proof for the fourth case. \(\square\)
Rights and permissions
About this article
Cite this article
Wang, H. Divide-and-Conquer Information-Based Optimal Subdata Selection Algorithm. J Stat Theory Pract 13, 46 (2019). https://doi.org/10.1007/s42519-019-0048-5
Published:
DOI: https://doi.org/10.1007/s42519-019-0048-5