Divide-and-Conquer Information-Based Optimal Subdata Selection Algorithm

Wang, HaiYing

doi:10.1007/s42519-019-0048-5

Divide-and-Conquer Information-Based Optimal Subdata Selection Algorithm

Original Article
Published: 01 July 2019

Volume 13, article number 46, (2019)
Cite this article

Journal of Statistical Theory and Practice Aims and scope Submit manuscript

HaiYing Wang ORCID: orcid.org/0000-0001-7729-0243¹

450 Accesses
13 Citations
1 Altmetric
Explore all metrics

Abstract

The information-based optimal subdata selection (IBOSS) is a computationally efficient method to select informative data points from large data sets through processing full data by columns. However, when the volume of a data set is too large to be processed in the available memory of a machine, it is infeasible to implement the IBOSS procedure. This paper develops a divide-and-conquer IBOSS approach to solving this problem, in which the full data set is divided into smaller partitions to be loaded into the memory and then subsets of data are selected from each partition using the IBOSS algorithm. We derive both finite sample properties and asymptotic properties of the resulting estimator. Asymptotic results show that if the full data set is partitioned randomly and the number of partitions is not very large, then the resultant estimator has the same estimation efficiency as the original IBOSS estimator. We also carry out numerical experiments to evaluate the empirical performance of the proposed method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Data clustering: application and trends

Article 27 November 2022

Stratified random sampling from streaming and stored data

Article 23 October 2020

On Splitting Training and Validation Set: A Comparative Study of Cross-Validation, Bootstrap and Systematic Sampling for Estimating the Generalization Performance of Supervised Learning

Article Open access 01 July 2018

References

Battey H, Fan J, Liu H, Lu J, Zhu Z (2018) Distributed estimation and inference with statistical guarantees. Ann Stat 46:1352–1382
Article Google Scholar
Bezanson J, Edelman A, Karpinski S, Shah VB (2017) Julia: a fresh approach to numerical computing. SIAM Rev 59(1):65–98
Article MathSciNet Google Scholar
Chen X, Mg X (2014) A split-and-conquer approach for analysis of extraordinarily large data. Stat Sin 24:1655–1684
MathSciNet MATH Google Scholar
Drineas P, Magdon-Ismail M, Mahoney M, Woodruff D (2012) Faster approximation of matrix coherence and statistical leverage. J Mach Learn Res 13:3475–3506
MathSciNet MATH Google Scholar
Galambos J (1987) The asymptotic theory of extreme order statistics. Robert E. Krieger, Melbourne
MATH Google Scholar
Hall P (1979) On the relative stability of large order statistics. Math Proc Camb Philos Soc 86:467–475
Article MathSciNet Google Scholar
Jordan MI (2012) Divide-and-conquer and statistical inference for big data. In: Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, p 4
Lin N, Xie R (2011) Aggregated estimating equation estimation. Stat Interface 4:73–83
Article MathSciNet Google Scholar
Ma P, Mahoney M, Yu B (2015) A statistical perspective on algorithmic leveraging. J Mach Learn Res 16:861–911
MathSciNet MATH Google Scholar
Martínez C (2004) On partial sorting. Technical report, 10th seminar on the analysis of algorithms
R Core Team (2018) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/
Schifano ED, Wu J, Wang C, Yan J, Chen MH (2016) Online updating of statistical inference in the big data setting. Technometrics 58(3):393–403. https://doi.org/10.1080/00401706.2016.1142900
Article MathSciNet Google Scholar
Shang Z, Cheng G (2017) Computational limits of a distributed algorithm for smoothing spline. J Mach Learn Res 18(1):3809–3845
MathSciNet MATH Google Scholar
Wang H (2018) More efficient estimation for logistic regression with optimal subsample. arXiv preprint arXiv:180202698
Wang H, Zhu R, Ma P (2018) Optimal subsampling for large sample logistic regression. J Am Stat Assoc 113(522):829–844
Article MathSciNet Google Scholar
Wang H, Yang M, Stufken J (2019) Information-based optimal subdata selection for big data linear regression. J Am Stat Assoc 114(525):393–405
Article MathSciNet Google Scholar
Xue Y, Wang H, Yan J, Schifano ED (2018) An online updating approach for testing the proportional hazards assumption with streams of big survival data. arXiv preprint arXiv:180901291
Yao Y, Wang H (2019) Optimal subsampling for softmax regression. Stat Pap 60(2):235–249
Article MathSciNet Google Scholar

Download references

Acknowledgements

This work was supported by NSF Grant 1812013, a UCONN REP Grant, and a GPU Grant from NVIDIA Corporation.

Author information

Authors and Affiliations

Department of Statistics, University of Connecticut, Storrs, CT, USA
HaiYing Wang

Authors

HaiYing Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to HaiYing Wang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

1.1 Proof of Theorem 1

Proof

For $l\ne j$, let $z_j^{(i)l}$ be the concomitant of $z_{(i)l}$ for $z_j$, i.e., if $z_{(i)l}=z_{sl}$ then $z_j^{(i)l}=z_{sj}$, $i=1, \dots , N$. Let $\bar{z}_{j{\mathcal {D}}_{\mathrm{BS}}}^*$ and ${\mathrm {var}}(z_{j{\mathcal {D}}_{\mathrm{BS}}}^*)$ be the sample mean and sample variance for covariate $z_j$ of subdata ${\mathcal {D}}_{\mathrm{BS}}$. For the proof of Theorem 3 in Wang et al. [16], we know that

$$\begin{aligned} |({\mathcal {X}}^*_{{\mathcal {D}}_{\mathrm{BS}}})^{\mathrm{T}}{\mathcal {X}}^*_{{\mathcal {D}}_{\mathrm{BS}}}| \ge k(k-1)^p\lambda _{\min }^p(\varvec{R}_{{\mathcal {D}}_{\mathrm{BS}}})\prod _{j=1}^p{\mathrm {var}}(z_{j{\mathcal {D}}_{\mathrm{BS}}}^*). \end{aligned}$$

(12)

Let $k_B=k/B$ be the number of data points to take from each partition, and $z_{bij}^*$ be the i-th observation on the j-th covariate in the subdata ${\mathcal {D}}_{S}^{(b)}$ from the b-th partition.

The sample variance for each j satisfies,

$$\begin{aligned} (k-1){\mathrm {var}}(z_{j{\mathcal {D}}_{\mathrm{BS}}}^*) & = \sum _{i=1}^{k}\left( z_{ij}^*-\bar{z}_{j{\mathcal {D}}_{\mathrm{BS}}}^*\right) ^2 =\sum _{b=1}^{B}\sum _{i=1}^{k_B}\left( z_{bij}^*-\bar{z}_{j{\mathcal {D}}_{\mathrm{BS}}}^*\right) ^2\\ & \ge \sum _{b=1}^{B}\left( \sum _{i=1}^{r_B}+\sum _{i=n_B-r_B+1}^{n_B}\right) \left( z_{b(i)j}-\bar{z}_{bj}^{**}\right) ^2\\ & = \sum _{b=1}^{B}\Bigg \{\sum _{i=1}^{r_B}\left( z_{b(i)j}-\bar{z}_{bj}^{*l}\right) ^2 +\sum _{i=n_B-r_B+1}^{n_B}\left( z_{b(i)j}-\bar{z}_{bj}^{*u}\right) ^2\\&+\frac{r_B}{2}\left( \bar{z}_{bj}^{*u}-\bar{z}_{bj}^{*l}\right) ^2 \Bigg \}\\ & \ge \frac{r_B}{2} \sum _{b=1}^{B}\left( \bar{z}_j^{*u}-\bar{z}_j^{*l}\right) ^2 \ge \frac{r_B}{2}\sum _{b=1}^{B}\left( z_{b(n_B-r_B+1)j}-z_{b(r_B)j}\right) ^2, \end{aligned}$$

where $\bar{z}_{bj}^{**}=\left( \sum _{i=1}^{r_B}+\sum _{i=n_B-r_B+1}^{n_B}\right) z_{b(i)j}/(2r_B)$, $\bar{z}_{bj}^{*l}=\sum _{i=1}^{r_B}z_{b(i)j}/{r_B}$, and $\bar{z}_{bj}^{*u}=\sum _{i=n_B-r_B+1}^{n_B} z_{b(i)j}/(r_B)$. Thus,

$$\begin{aligned} {\mathrm {var}}(z_{j{\mathcal {D}}_{\mathrm{BS}}}^*) \ge \frac{r(z_{(N)j}-z_{(1)j})^2}{2B(k-1)} \sum _{b=1}^{B}\left( \frac{z_{b(n_B-r_B+1)j}-z_{b(r_B)j}}{z_{(N)j}-z_{(1)j}}\right) ^2, \end{aligned}$$

(13)

which, combined with (12), shows that

$$\begin{aligned} |({\mathcal {X}}^*_{{\mathcal {D}}_{\mathrm{BS}}})^{\mathrm{T}}{\mathcal {X}}^*_{{\mathcal {D}}_{\mathrm{BS}}}| & \ge \frac{r^p}{2^pB^p} k\lambda _{\min }^p(\varvec{R}_{{\mathcal {D}}_{\mathrm{BS}}})\prod _{j=1}^p(z_{(N)j}-z_{(1)j})^2\\&\quad \times \prod _{j=1}^p \sum _{b=1}^{B}\left( \frac{z_{b(n_B-r_B+1)j}-z_{b(r_B)j}}{z_{(N)j}-z_{(1)j}}\right) ^2. \end{aligned}$$

This shows that

$$\begin{aligned} \frac{|({\mathcal {X}}^*_{{\mathcal {D}}_{\mathrm{BS}}})^{\mathrm{T}}{\mathcal {X}}^*_{{\mathcal {D}}_{\mathrm{BS}}}|}{M_N} & \ge \frac{\lambda _{\min }^p(\varvec{R}_{{\mathcal {D}}_{\mathrm{BS}}})}{(Bp)^p} \prod _{j=1}^p\sum _{b=1}^{B}\left( \frac{z_{b(n_B-r_B+1)j}-z_{b(r_B)j}}{z_{(N)j}-z_{(1)j}}\right) ^2. \end{aligned}$$

(14)

Each numerator in the summation of the bound in (14) relays on the covariate range of each data partition. If the full data are not divided randomly, this may not produce a sharp bound and using the full data covariate ranges may produce a better bound. We use this idea to derive the bound $C_E$ in the following. From Algorithm 2, for each $j=1, \dots , p$, the $r_B$ data points with the smallest value of $z_j$ and the $r_B$ data points with the largest value of $z_j$ must be included in ${\mathcal {D}}_{\mathrm{BS}}$. Thus, for each sample variance,

$$\begin{aligned} (k-1){\mathrm {var}}(z_{j{\mathcal {D}}_{\mathrm{BS}}}^*) & \ge \left( \sum _{i=1}^{r_B}+\sum _{i=N-r_B+1}^{N}\right) \left( z_{(i)j}-\bar{z}_{j{\mathcal {D}}_{\mathrm{BS}}}^*\right) ^2\nonumber \\ & \ge \left( \sum _{i=1}^{r_B}+\sum _{i=N-r_B+1}^{N}\right) \left( z_{(i)j}-\bar{z}_j^{**}\right) ^2\nonumber \\ & \ge \frac{r_B}{2}\left( \bar{z}_j^{*u}-\bar{z}_j^{*l}\right) ^2 \ge \frac{r_B}{2}\left( z_{(N-r_B+1)j}-z_{(r_B)j}\right) ^2. \end{aligned}$$

In this case, $\bar{z}_j^{**}=(\sum _{i=1}^{r_B}+\sum _{i=N-r_B+1}^{N})z_{(i)j}/(2r_B)$, $\bar{z}_j^{*l}=\sum _{i=1}^{r}z_{(i)j}/(r_B)$, and $\bar{z}_j^{*u}=\sum _{i=n-r+1}^{n}z_{(i)j}/(r_B)$. Thus,

$$\begin{aligned} {\mathrm {var}}(z_{j{\mathcal {D}}_{\mathrm{BS}}}^*)& \ge \frac{r(z_{(N)j}-z_{(1)j})^2}{2(k-1)B} \left( \frac{z_{(N-r_B+1)j}-z_{(r_B)j}}{z_{(N)j}-z_{(1)j}}\right) ^2. \end{aligned}$$

(15)

This, combined with (12), shows that

$$\begin{aligned} \frac{|({\mathcal {X}}^*_{{\mathcal {D}}_{\mathrm{BS}}})^{\mathrm{T}}{\mathcal {X}}^*_{{\mathcal {D}}_{\mathrm{BS}}}|}{M_N} & \ge \frac{\lambda _{\min }^p(\varvec{R}_{{\mathcal {D}}_{\mathrm{BS}}})}{(Bp)^p} \times \prod _{j=1}^p \left( \frac{z_{(N-r_B+1)j}-z_{(r_B)j}}{z_{(N)j}-z_{(1)j}}\right) ^2. \end{aligned}$$

(16)

The proof finishes if we combine (14) and (16). $\square$

1.2 Proof of Theorem 2

Proof

The proof for (7) is similar to the proof of inequality (19) in Wang et al. [16]. For (8), from the proof of Theorem 4 in Wang et al. [16], we know that

$$\begin{aligned} {\mathrm {var}}(z_{j{\mathcal {D}}_{\mathrm{BS}}}^*)\le \frac{k}{4(k-1)}\left( z_{(N)j}-z_{(1)j}\right) ^2, \end{aligned}$$

(17)

and

$$\begin{aligned} {\mathbb {V}}(\hat{\beta }^{{\mathcal {D}}_{\mathrm{BS}}}_j|{\mathcal {Z}}) =\frac{\sigma ^2}{k-1}\frac{(\varvec{R}_{{\mathcal {D}}_{\mathrm{BS}}}^{-1})_{jj}}{{\mathrm {var}}(z_{j}^*)}. \end{aligned}$$

(18)

From the (17), (18), and the fact that $(\varvec{R}_{{\mathcal {D}}_{\mathrm{BS}}}^{-1})_{jj}\ge 1$, we have

$$\begin{aligned} {\mathbb {V}}(\hat{\beta }^{{\mathcal {D}}_{\mathrm{BS}}}_j|{\mathcal {Z}}) \ge \frac{4\sigma ^2(z_{(N)j}-z_{(1)j})^2}{k}. \end{aligned}$$

(19)

From (13), (15) and the fact that $(\varvec{R}_{{\mathcal {D}}_{\mathrm{BS}}}^{-1})_{jj}\le \lambda _{\min }^{-1}(\varvec{R}_{{\mathcal {D}}_{\mathrm{BS}}})$, we have

$$\begin{aligned} {\mathbb {V}}(\hat{\beta }^{{\mathcal {D}}_{\mathrm{BS}}}_j|{\mathcal {Z}})&\le \frac{4pB\sigma ^2}{k\lambda _{\min }(\varvec{R}_{{\mathcal {D}}_{\mathrm{BS}}})(z_{(N-r_B+1)j}-z_{(r_B)j})^2}, \quad \text { and} \end{aligned}$$

(20)

$$\begin{aligned} {\mathbb {V}}(\hat{\beta }^{{\mathcal {D}}_{\mathrm{BS}}}_j|{\mathcal {Z}})&\le \frac{4pB\sigma ^2}{k\lambda _{\min }(\varvec{R}_{{\mathcal {D}}_{\mathrm{BS}}}) \sum _{b=1}^{B}\left( z_{b(n_B-r_B+1)j}-z_{b(r_B)j}\right) ^2}. \end{aligned}$$

(21)

The proof finishes by combining (19), (20), and (21). $\square$

1.3 Proof of Theorem 3

Proof

For the first case that B and r are fixed, from (8) with $V_{ej}$, we only need to verify that

$$\begin{aligned} \frac{z_{(N)j}-z_{(1)j}}{z_{(N-r_B+1)j}-z_{(r_B)j}}=O_{P}(1), \end{aligned}$$

(22)

which is true according to Theorems 2.8.1 and 2.8.2 in Galambos [5].

For the second case, $z_{b(n_B-r_B+1)j}-z_{b(r_B)j}$ and $z_{(N)j}-z_{(1)j}$ converge to the same fixed constant in probability, and $z_{b(n_B-r_B+1)j}-z_{b(r_B)j}$ are bounded by this constant for all b. Thus,

$$\begin{aligned} \frac{1}{B}\sum _{b=1}^{B}(z_{b(n_B-r_B+1)j}-z_{b(r_B)j})^2 \end{aligned}$$

converges to the same finite constant. Thus, (22) can be easily verified.

For the third case, let $g_{N,j}=F_j^{-1}(1-1/N)$ and $g_{n_B,j}=F_j^{-1}(1-1/n_B)$. When (9) holds, from the proof of Theorem 5 in Wang et al. [16], we have $z_{(N)j}/g_{N,j}=1+o_{P}(1)$ and $z_{(n_B-r_B+1)j}/g_{n_B,j}=1+o_{P}(1)$. Thus, noting that $z_{b(r)j}$ and $z_{(1)j}$ are bounded in probability when the lower endpoint for the support of $F_j$ is finite, we have

$$\begin{aligned} \frac{z_{b(n_B-r_B+1)j}-z_{b(r)j}}{z_{(N)j}-z_{(1)j}}=1+o_{P}(1). \end{aligned}$$

Note that $\left| \frac{z_{b(n_B-r_B+1)j}-z_{b(r)j}}{z_{(N)j}-z_{(1)j}}\right|$ are bounded by the same constant for all b, so

$$\begin{aligned} \frac{1}{B}\sum _{b=1}^{B}\frac{z_{b(n_B-r_B+1)j}-z_{b(r)j}}{z_{(N)j}-z_{(1)j}}=1+o_{P}(1). \end{aligned}$$

Combining this and (8) with $V_{aj}$, the result follows.

For the forth case, the result follows from reversing the signs of the covariates in the proof for the third case.

The proof for the fifth case is obtained by combining the proof for the third case and the proof for the fourth case. $\square$

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, H. Divide-and-Conquer Information-Based Optimal Subdata Selection Algorithm. J Stat Theory Pract 13, 46 (2019). https://doi.org/10.1007/s42519-019-0048-5

Download citation

Published: 01 July 2019
DOI: https://doi.org/10.1007/s42519-019-0048-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Divide-and-Conquer Information-Based Optimal Subdata Selection Algorithm

Abstract

Access this article

Similar content being viewed by others

Data clustering: application and trends

Stratified random sampling from streaming and stored data

On Splitting Training and Validation Set: A Comparative Study of Cross-Validation, Bootstrap and Systematic Sampling for Estimating the Generalization Performance of Supervised Learning

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix

1.1 Proof of Theorem 1

Proof

1.2 Proof of Theorem 2

Proof

1.3 Proof of Theorem 3

Proof

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Divide-and-Conquer Information-Based Optimal Subdata Selection Algorithm

Abstract

Access this article

Similar content being viewed by others

Data clustering: application and trends

Stratified random sampling from streaming and stored data

On Splitting Training and Validation Set: A Comparative Study of Cross-Validation, Bootstrap and Systematic Sampling for Estimating the Generalization Performance of Supervised Learning

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix

Appendix

1.1 Proof of Theorem 1

Proof

1.2 Proof of Theorem 2

Proof

1.3 Proof of Theorem 3

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation