Skip to main content

Subsampling for Big Data: Some Recent Advances

  • Conference paper
  • First Online:
Nonparametric Statistics (ISNPS 2016)

Part of the book series: Springer Proceedings in Mathematics & Statistics ((PROMS,volume 250))

Included in the following conference series:

  • 1107 Accesses

Abstract

The goal of this contribution is to develop subsampling methods in the framework of big data and to show their feasibility in a simulation study. We argue that using different subsampling distributions with different subsampling sizes brings a lot of information on the behavior of statistical procedures: subsampling allows to estimate the rate of convergence of different procedures and to construct confidence intervals for general parameters including the generalization error of an algorithm in machine learning.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Arcones, M. A., Giné, E. (1993). Limit theorems for U-processes. Annals of Probability, 21(3). 1494–1542.

    Google Scholar 

  2. Babu, G., & Singh, K. (1985). Edgeworth expansions for sampling without replacement from finite populations. Journal of Multivariate Analysis, 17, 261–278.

    Google Scholar 

  3. Belsley, D. A., Kuh, E., & Welsh, R. E. (1980). Regression diagnostics: Identifying influential data and sources of collinearity. New York: Wiley.

    Google Scholar 

  4. Bertail, P. (1997). Second order properties of an extrapolated bootstrap without replacement under weak assumptions: The i.i.d. and strong mixing case. Bernoulli, 3, 149–179.

    Google Scholar 

  5. Bertail, P. (2011). Somme comments on Subsampling weakly dependent time series and application to extremes. TEST, 20, 487–490.

    Google Scholar 

  6. Bertail, P., Chautru, E., & Clémençon, S. (2014). Scaling-up M-estimation via sampling designs: The Horvitz-Thompson stochastic gradient descent. In Proceedings of the 2014 IEEE International Conference on Big Data, Washington (USA).

    Google Scholar 

  7. Bertail, P., Chautru, E., & Clémençon, S. (2015). Tail index estimation based on survey data. ESAIM Probability & Statistics, 19, 28–59.

    Google Scholar 

  8. Bertail, P., Chautru, E., & Clémençon, S. (2016). Empirical processes in survey sampling. Scandinavian Journal of Statistics, 44(1), 97–111.

    Google Scholar 

  9. Bertail, P., Haeffke, C., Politis, D., & White H. (2004). A subsampling approach to estimating the distribution of diverging statistics with applications to assessing financial market risks. Journal of Econometrics, 120, 295–326.

    Google Scholar 

  10. Bertail, P., & Politis, D. (2001). Extrapolation of subsampling distribution estimators in the i.i.d. strong-mixing cases. Canadian Journal of Statistics, 29(4), 667–680.

    Google Scholar 

  11. Bertail, P., Politis, D., & Romano, J. (1999). Undersampling with unknown rate of convergence. Journal of the American Statistical Association, 94(446), 569–579.

    Google Scholar 

  12. Bickel, P. J., & Sakov, A. (2008). On the choice of the m out n bootstrap and confidence bounds for extrema. Statistica Sinica, 18, 967–985.

    Google Scholar 

  13. Bickel P. J., & Yahav, J. A. (1988). Richardson extrapolation and the bootstrap. Journal of the American Statistical Association, 83(402), 387–393.

    Google Scholar 

  14. Bickel, P. J., Götze, F., & van Zwet, W. R. (1997). Resampling fewer than n observations, gains, losses and remedies for losses. Statistica Sinica, 7, 1–31.

    Google Scholar 

  15. Bingham, N. H., Goldie, C. M., & Teugels, J. L. (1987). Regular variation. Cambridge: Cambridge University Press.

    Google Scholar 

  16. Bretagnolle, J. (1983). Lois limites du bootstrap de certaines fonctionelles. Annales de l’Institut Henri Poincaré B: Probability and Statistics, 19, 281–296.

    Google Scholar 

  17. Carlstein, E. (1988). Nonparametric change-point estimation. Annals of Statistics, 16(1), 188–197.

    Google Scholar 

  18. Darkhovshk, B. S. (1976). A non-parametric method for the a posteriori detection of the “disorder” time of a sequence of independent random variables. Theory of Probability and Its Applications, 21, 178–83.

    Google Scholar 

  19. Götze Rauckauskas, F. A. (1999). Adaptive choice of bootstrap sample sizes. In M. de Gunst, C. Klaassen, & A. van der Vaart (Eds.), State of the art in probability statistics: Festschrift for Willem R. van Zwet. IMS lecture notes, monograph series (pp. 286–309). Beachwood, OH: Institute of Mathematical Statistics.

    Google Scholar 

  20. Heilig, C., & Nolan, D. (2001). Limit theorems for the infinite degree U-process. Statistica Sinica, 11, 289–302.

    Google Scholar 

  21. Isaacson, E., & Keller, H. B. (1966). Analysis of numerical methods. New York: John Wiley.

    Google Scholar 

  22. Kleiner, A., Talwalkar, A., Sarkar, P., & Jordan, M. I. (2014). A scalable bootstrap for massive data. Journal of the Royal Statistical Society: Series B, 76(4), 795–816.

    Google Scholar 

  23. Le Cam, L. (1990). Maximum likelihood: An introduction. Revue Internationale de Statistique, 58(2), 153–171.

    Google Scholar 

  24. McLeod, I., & Bellhouse, D. R. (1983). Algorithm for drawing a simple random sample. Journal of the Royal Statistical Society. Series C (Applied Statistics), 32(2), 182–184.

    Google Scholar 

  25. Politis, D., & Romano, J. P. (1994). Large sample confidence regions based on subsamples under minimal assumptions. Annals of Statistics, 22, 2031–2050.

    Google Scholar 

Download references

Acknowledgements

This research has been conducted as part of the project Labex MME-DII (ANR11-LBX-0023- 01), and the industrial chair “Machine Learning for Big Data,” Télécom- -ParisTech.

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Bertail, P., Jelassi, O., Tressou, J., Zetlaoui, M. (2018). Subsampling for Big Data: Some Recent Advances. In: Bertail, P., Blanke, D., Cornillon, PA., Matzner-Løber, E. (eds) Nonparametric Statistics. ISNPS 2016. Springer Proceedings in Mathematics & Statistics, vol 250. Springer, Cham. https://doi.org/10.1007/978-3-319-96941-1_13

Download citation

Publish with us

Policies and ethics