Skip to main content
Log in

Divide-and-Conquer Information-Based Optimal Subdata Selection Algorithm

  • Original Article
  • Published:
Journal of Statistical Theory and Practice Aims and scope Submit manuscript

Abstract

The information-based optimal subdata selection (IBOSS) is a computationally efficient method to select informative data points from large data sets through processing full data by columns. However, when the volume of a data set is too large to be processed in the available memory of a machine, it is infeasible to implement the IBOSS procedure. This paper develops a divide-and-conquer IBOSS approach to solving this problem, in which the full data set is divided into smaller partitions to be loaded into the memory and then subsets of data are selected from each partition using the IBOSS algorithm. We derive both finite sample properties and asymptotic properties of the resulting estimator. Asymptotic results show that if the full data set is partitioned randomly and the number of partitions is not very large, then the resultant estimator has the same estimation efficiency as the original IBOSS estimator. We also carry out numerical experiments to evaluate the empirical performance of the proposed method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  1. Battey H, Fan J, Liu H, Lu J, Zhu Z (2018) Distributed estimation and inference with statistical guarantees. Ann Stat 46:1352–1382

    Article  Google Scholar 

  2. Bezanson J, Edelman A, Karpinski S, Shah VB (2017) Julia: a fresh approach to numerical computing. SIAM Rev 59(1):65–98

    Article  MathSciNet  Google Scholar 

  3. Chen X, Mg X (2014) A split-and-conquer approach for analysis of extraordinarily large data. Stat Sin 24:1655–1684

    MathSciNet  MATH  Google Scholar 

  4. Drineas P, Magdon-Ismail M, Mahoney M, Woodruff D (2012) Faster approximation of matrix coherence and statistical leverage. J Mach Learn Res 13:3475–3506

    MathSciNet  MATH  Google Scholar 

  5. Galambos J (1987) The asymptotic theory of extreme order statistics. Robert E. Krieger, Melbourne

    MATH  Google Scholar 

  6. Hall P (1979) On the relative stability of large order statistics. Math Proc Camb Philos Soc 86:467–475

    Article  MathSciNet  Google Scholar 

  7. Jordan MI (2012) Divide-and-conquer and statistical inference for big data. In: Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, p 4

  8. Lin N, Xie R (2011) Aggregated estimating equation estimation. Stat Interface 4:73–83

    Article  MathSciNet  Google Scholar 

  9. Ma P, Mahoney M, Yu B (2015) A statistical perspective on algorithmic leveraging. J Mach Learn Res 16:861–911

    MathSciNet  MATH  Google Scholar 

  10. Martínez C (2004) On partial sorting. Technical report, 10th seminar on the analysis of algorithms

  11. R Core Team (2018) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/

  12. Schifano ED, Wu J, Wang C, Yan J, Chen MH (2016) Online updating of statistical inference in the big data setting. Technometrics 58(3):393–403. https://doi.org/10.1080/00401706.2016.1142900

    Article  MathSciNet  Google Scholar 

  13. Shang Z, Cheng G (2017) Computational limits of a distributed algorithm for smoothing spline. J Mach Learn Res 18(1):3809–3845

    MathSciNet  MATH  Google Scholar 

  14. Wang H (2018) More efficient estimation for logistic regression with optimal subsample. arXiv preprint arXiv:180202698

  15. Wang H, Zhu R, Ma P (2018) Optimal subsampling for large sample logistic regression. J Am Stat Assoc 113(522):829–844

    Article  MathSciNet  Google Scholar 

  16. Wang H, Yang M, Stufken J (2019) Information-based optimal subdata selection for big data linear regression. J Am Stat Assoc 114(525):393–405

    Article  MathSciNet  Google Scholar 

  17. Xue Y, Wang H, Yan J, Schifano ED (2018) An online updating approach for testing the proportional hazards assumption with streams of big survival data. arXiv preprint arXiv:180901291

  18. Yao Y, Wang H (2019) Optimal subsampling for softmax regression. Stat Pap 60(2):235–249

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

This work was supported by NSF Grant 1812013, a UCONN REP Grant, and a GPU Grant from NVIDIA Corporation.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to HaiYing Wang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

1.1 Proof of Theorem 1

Proof

For \(l\ne j\), let \(z_j^{(i)l}\) be the concomitant of \(z_{(i)l}\) for \(z_j\), i.e., if \(z_{(i)l}=z_{sl}\) then \(z_j^{(i)l}=z_{sj}\), \(i=1, \dots , N\). Let \(\bar{z}_{j{\mathcal {D}}_{\mathrm{BS}}}^*\) and \({\mathrm {var}}(z_{j{\mathcal {D}}_{\mathrm{BS}}}^*)\) be the sample mean and sample variance for covariate \(z_j\) of subdata \({\mathcal {D}}_{\mathrm{BS}}\). For the proof of Theorem 3 in Wang et al. [16], we know that

$$\begin{aligned} |({\mathcal {X}}^*_{{\mathcal {D}}_{\mathrm{BS}}})^{\mathrm{T}}{\mathcal {X}}^*_{{\mathcal {D}}_{\mathrm{BS}}}| \ge k(k-1)^p\lambda _{\min }^p(\varvec{R}_{{\mathcal {D}}_{\mathrm{BS}}})\prod _{j=1}^p{\mathrm {var}}(z_{j{\mathcal {D}}_{\mathrm{BS}}}^*). \end{aligned}$$
(12)

Let \(k_B=k/B\) be the number of data points to take from each partition, and \(z_{bij}^*\) be the i-th observation on the j-th covariate in the subdata \({\mathcal {D}}_{S}^{(b)}\) from the b-th partition.

The sample variance for each j satisfies,

$$\begin{aligned} (k-1){\mathrm {var}}(z_{j{\mathcal {D}}_{\mathrm{BS}}}^*) & = \sum _{i=1}^{k}\left( z_{ij}^*-\bar{z}_{j{\mathcal {D}}_{\mathrm{BS}}}^*\right) ^2 =\sum _{b=1}^{B}\sum _{i=1}^{k_B}\left( z_{bij}^*-\bar{z}_{j{\mathcal {D}}_{\mathrm{BS}}}^*\right) ^2\\ & \ge \sum _{b=1}^{B}\left( \sum _{i=1}^{r_B}+\sum _{i=n_B-r_B+1}^{n_B}\right) \left( z_{b(i)j}-\bar{z}_{bj}^{**}\right) ^2\\ & = \sum _{b=1}^{B}\Bigg \{\sum _{i=1}^{r_B}\left( z_{b(i)j}-\bar{z}_{bj}^{*l}\right) ^2 +\sum _{i=n_B-r_B+1}^{n_B}\left( z_{b(i)j}-\bar{z}_{bj}^{*u}\right) ^2\\&+\frac{r_B}{2}\left( \bar{z}_{bj}^{*u}-\bar{z}_{bj}^{*l}\right) ^2 \Bigg \}\\ & \ge \frac{r_B}{2} \sum _{b=1}^{B}\left( \bar{z}_j^{*u}-\bar{z}_j^{*l}\right) ^2 \ge \frac{r_B}{2}\sum _{b=1}^{B}\left( z_{b(n_B-r_B+1)j}-z_{b(r_B)j}\right) ^2, \end{aligned}$$

where \(\bar{z}_{bj}^{**}=\left( \sum _{i=1}^{r_B}+\sum _{i=n_B-r_B+1}^{n_B}\right) z_{b(i)j}/(2r_B)\), \(\bar{z}_{bj}^{*l}=\sum _{i=1}^{r_B}z_{b(i)j}/{r_B}\), and \(\bar{z}_{bj}^{*u}=\sum _{i=n_B-r_B+1}^{n_B} z_{b(i)j}/(r_B)\). Thus,

$$\begin{aligned} {\mathrm {var}}(z_{j{\mathcal {D}}_{\mathrm{BS}}}^*) \ge \frac{r(z_{(N)j}-z_{(1)j})^2}{2B(k-1)} \sum _{b=1}^{B}\left( \frac{z_{b(n_B-r_B+1)j}-z_{b(r_B)j}}{z_{(N)j}-z_{(1)j}}\right) ^2, \end{aligned}$$
(13)

which, combined with (12), shows that

$$\begin{aligned} |({\mathcal {X}}^*_{{\mathcal {D}}_{\mathrm{BS}}})^{\mathrm{T}}{\mathcal {X}}^*_{{\mathcal {D}}_{\mathrm{BS}}}| & \ge \frac{r^p}{2^pB^p} k\lambda _{\min }^p(\varvec{R}_{{\mathcal {D}}_{\mathrm{BS}}})\prod _{j=1}^p(z_{(N)j}-z_{(1)j})^2\\&\quad \times \prod _{j=1}^p \sum _{b=1}^{B}\left( \frac{z_{b(n_B-r_B+1)j}-z_{b(r_B)j}}{z_{(N)j}-z_{(1)j}}\right) ^2. \end{aligned}$$

This shows that

$$\begin{aligned} \frac{|({\mathcal {X}}^*_{{\mathcal {D}}_{\mathrm{BS}}})^{\mathrm{T}}{\mathcal {X}}^*_{{\mathcal {D}}_{\mathrm{BS}}}|}{M_N} & \ge \frac{\lambda _{\min }^p(\varvec{R}_{{\mathcal {D}}_{\mathrm{BS}}})}{(Bp)^p} \prod _{j=1}^p\sum _{b=1}^{B}\left( \frac{z_{b(n_B-r_B+1)j}-z_{b(r_B)j}}{z_{(N)j}-z_{(1)j}}\right) ^2. \end{aligned}$$
(14)

Each numerator in the summation of the bound in (14) relays on the covariate range of each data partition. If the full data are not divided randomly, this may not produce a sharp bound and using the full data covariate ranges may produce a better bound. We use this idea to derive the bound \(C_E\) in the following. From Algorithm 2, for each \(j=1, \dots , p\), the \(r_B\) data points with the smallest value of \(z_j\) and the \(r_B\) data points with the largest value of \(z_j\) must be included in \({\mathcal {D}}_{\mathrm{BS}}\). Thus, for each sample variance,

$$\begin{aligned} (k-1){\mathrm {var}}(z_{j{\mathcal {D}}_{\mathrm{BS}}}^*) & \ge \left( \sum _{i=1}^{r_B}+\sum _{i=N-r_B+1}^{N}\right) \left( z_{(i)j}-\bar{z}_{j{\mathcal {D}}_{\mathrm{BS}}}^*\right) ^2\nonumber \\ & \ge \left( \sum _{i=1}^{r_B}+\sum _{i=N-r_B+1}^{N}\right) \left( z_{(i)j}-\bar{z}_j^{**}\right) ^2\nonumber \\ & \ge \frac{r_B}{2}\left( \bar{z}_j^{*u}-\bar{z}_j^{*l}\right) ^2 \ge \frac{r_B}{2}\left( z_{(N-r_B+1)j}-z_{(r_B)j}\right) ^2. \end{aligned}$$

In this case, \(\bar{z}_j^{**}=(\sum _{i=1}^{r_B}+\sum _{i=N-r_B+1}^{N})z_{(i)j}/(2r_B)\), \(\bar{z}_j^{*l}=\sum _{i=1}^{r}z_{(i)j}/(r_B)\), and \(\bar{z}_j^{*u}=\sum _{i=n-r+1}^{n}z_{(i)j}/(r_B)\). Thus,

$$\begin{aligned} {\mathrm {var}}(z_{j{\mathcal {D}}_{\mathrm{BS}}}^*)& \ge \frac{r(z_{(N)j}-z_{(1)j})^2}{2(k-1)B} \left( \frac{z_{(N-r_B+1)j}-z_{(r_B)j}}{z_{(N)j}-z_{(1)j}}\right) ^2. \end{aligned}$$
(15)

This, combined with (12), shows that

$$\begin{aligned} \frac{|({\mathcal {X}}^*_{{\mathcal {D}}_{\mathrm{BS}}})^{\mathrm{T}}{\mathcal {X}}^*_{{\mathcal {D}}_{\mathrm{BS}}}|}{M_N} & \ge \frac{\lambda _{\min }^p(\varvec{R}_{{\mathcal {D}}_{\mathrm{BS}}})}{(Bp)^p} \times \prod _{j=1}^p \left( \frac{z_{(N-r_B+1)j}-z_{(r_B)j}}{z_{(N)j}-z_{(1)j}}\right) ^2. \end{aligned}$$
(16)

The proof finishes if we combine (14) and (16). \(\square\)

1.2 Proof of Theorem 2

Proof

The proof for (7) is similar to the proof of inequality (19) in Wang et al. [16]. For (8), from the proof of Theorem 4 in Wang et al. [16], we know that

$$\begin{aligned} {\mathrm {var}}(z_{j{\mathcal {D}}_{\mathrm{BS}}}^*)\le \frac{k}{4(k-1)}\left( z_{(N)j}-z_{(1)j}\right) ^2, \end{aligned}$$
(17)

and

$$\begin{aligned} {\mathbb {V}}(\hat{\beta }^{{\mathcal {D}}_{\mathrm{BS}}}_j|{\mathcal {Z}}) =\frac{\sigma ^2}{k-1}\frac{(\varvec{R}_{{\mathcal {D}}_{\mathrm{BS}}}^{-1})_{jj}}{{\mathrm {var}}(z_{j}^*)}. \end{aligned}$$
(18)

From the (17), (18), and the fact that \((\varvec{R}_{{\mathcal {D}}_{\mathrm{BS}}}^{-1})_{jj}\ge 1\), we have

$$\begin{aligned} {\mathbb {V}}(\hat{\beta }^{{\mathcal {D}}_{\mathrm{BS}}}_j|{\mathcal {Z}}) \ge \frac{4\sigma ^2(z_{(N)j}-z_{(1)j})^2}{k}. \end{aligned}$$
(19)

From (13), (15) and the fact that \((\varvec{R}_{{\mathcal {D}}_{\mathrm{BS}}}^{-1})_{jj}\le \lambda _{\min }^{-1}(\varvec{R}_{{\mathcal {D}}_{\mathrm{BS}}})\), we have

$$\begin{aligned} {\mathbb {V}}(\hat{\beta }^{{\mathcal {D}}_{\mathrm{BS}}}_j|{\mathcal {Z}})&\le \frac{4pB\sigma ^2}{k\lambda _{\min }(\varvec{R}_{{\mathcal {D}}_{\mathrm{BS}}})(z_{(N-r_B+1)j}-z_{(r_B)j})^2}, \quad \text { and} \end{aligned}$$
(20)
$$\begin{aligned} {\mathbb {V}}(\hat{\beta }^{{\mathcal {D}}_{\mathrm{BS}}}_j|{\mathcal {Z}})&\le \frac{4pB\sigma ^2}{k\lambda _{\min }(\varvec{R}_{{\mathcal {D}}_{\mathrm{BS}}}) \sum _{b=1}^{B}\left( z_{b(n_B-r_B+1)j}-z_{b(r_B)j}\right) ^2}. \end{aligned}$$
(21)

The proof finishes by combining (19), (20), and (21). \(\square\)

1.3 Proof of Theorem 3

Proof

For the first case that B and r are fixed, from (8) with \(V_{ej}\), we only need to verify that

$$\begin{aligned} \frac{z_{(N)j}-z_{(1)j}}{z_{(N-r_B+1)j}-z_{(r_B)j}}=O_{P}(1), \end{aligned}$$
(22)

which is true according to Theorems 2.8.1 and 2.8.2 in Galambos [5].

For the second case, \(z_{b(n_B-r_B+1)j}-z_{b(r_B)j}\) and \(z_{(N)j}-z_{(1)j}\) converge to the same fixed constant in probability, and \(z_{b(n_B-r_B+1)j}-z_{b(r_B)j}\) are bounded by this constant for all b. Thus,

$$\begin{aligned} \frac{1}{B}\sum _{b=1}^{B}(z_{b(n_B-r_B+1)j}-z_{b(r_B)j})^2 \end{aligned}$$

converges to the same finite constant. Thus, (22) can be easily verified.

For the third case, let \(g_{N,j}=F_j^{-1}(1-1/N)\) and \(g_{n_B,j}=F_j^{-1}(1-1/n_B)\). When (9) holds, from the proof of Theorem 5 in Wang et al. [16], we have \(z_{(N)j}/g_{N,j}=1+o_{P}(1)\) and \(z_{(n_B-r_B+1)j}/g_{n_B,j}=1+o_{P}(1)\). Thus, noting that \(z_{b(r)j}\) and \(z_{(1)j}\) are bounded in probability when the lower endpoint for the support of \(F_j\) is finite, we have

$$\begin{aligned} \frac{z_{b(n_B-r_B+1)j}-z_{b(r)j}}{z_{(N)j}-z_{(1)j}}=1+o_{P}(1). \end{aligned}$$

Note that \(\left| \frac{z_{b(n_B-r_B+1)j}-z_{b(r)j}}{z_{(N)j}-z_{(1)j}}\right|\) are bounded by the same constant for all b, so

$$\begin{aligned} \frac{1}{B}\sum _{b=1}^{B}\frac{z_{b(n_B-r_B+1)j}-z_{b(r)j}}{z_{(N)j}-z_{(1)j}}=1+o_{P}(1). \end{aligned}$$

Combining this and (8) with \(V_{aj}\), the result follows.

For the forth case, the result follows from reversing the signs of the covariates in the proof for the third case.

The proof for the fifth case is obtained by combining the proof for the third case and the proof for the fourth case. \(\square\)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, H. Divide-and-Conquer Information-Based Optimal Subdata Selection Algorithm. J Stat Theory Pract 13, 46 (2019). https://doi.org/10.1007/s42519-019-0048-5

Download citation

  • Published:

  • DOI: https://doi.org/10.1007/s42519-019-0048-5

Keywords

Navigation