Advertisement

Subsampling for Big Data: Some Recent Advances

  • P. Bertail
  • O. Jelassi
  • J. Tressou
  • M. Zetlaoui
Conference paper
Part of the Springer Proceedings in Mathematics & Statistics book series (PROMS, volume 250)

Abstract

The goal of this contribution is to develop subsampling methods in the framework of big data and to show their feasibility in a simulation study. We argue that using different subsampling distributions with different subsampling sizes brings a lot of information on the behavior of statistical procedures: subsampling allows to estimate the rate of convergence of different procedures and to construct confidence intervals for general parameters including the generalization error of an algorithm in machine learning.

Notes

Acknowledgements

This research has been conducted as part of the project Labex MME-DII (ANR11-LBX-0023- 01), and the industrial chair “Machine Learning for Big Data,” Télécom- -ParisTech.

References

  1. 1.
    Arcones, M. A., Giné, E. (1993). Limit theorems for U-processes. Annals of Probability, 21(3). 1494–1542.Google Scholar
  2. 2.
    Babu, G., & Singh, K. (1985). Edgeworth expansions for sampling without replacement from finite populations. Journal of Multivariate Analysis, 17, 261–278.Google Scholar
  3. 3.
    Belsley, D. A., Kuh, E., & Welsh, R. E. (1980). Regression diagnostics: Identifying influential data and sources of collinearity. New York: Wiley.Google Scholar
  4. 4.
    Bertail, P. (1997). Second order properties of an extrapolated bootstrap without replacement under weak assumptions: The i.i.d. and strong mixing case. Bernoulli, 3, 149–179.Google Scholar
  5. 5.
    Bertail, P. (2011). Somme comments on Subsampling weakly dependent time series and application to extremes. TEST, 20, 487–490.Google Scholar
  6. 6.
    Bertail, P., Chautru, E., & Clémençon, S. (2014). Scaling-up M-estimation via sampling designs: The Horvitz-Thompson stochastic gradient descent. In Proceedings of the 2014 IEEE International Conference on Big Data, Washington (USA).Google Scholar
  7. 7.
    Bertail, P., Chautru, E., & Clémençon, S. (2015). Tail index estimation based on survey data. ESAIM Probability & Statistics, 19, 28–59.Google Scholar
  8. 8.
    Bertail, P., Chautru, E., & Clémençon, S. (2016). Empirical processes in survey sampling. Scandinavian Journal of Statistics, 44(1), 97–111.Google Scholar
  9. 9.
    Bertail, P., Haeffke, C., Politis, D., & White H. (2004). A subsampling approach to estimating the distribution of diverging statistics with applications to assessing financial market risks. Journal of Econometrics, 120, 295–326.Google Scholar
  10. 10.
    Bertail, P., & Politis, D. (2001). Extrapolation of subsampling distribution estimators in the i.i.d. strong-mixing cases. Canadian Journal of Statistics, 29(4), 667–680.Google Scholar
  11. 11.
    Bertail, P., Politis, D., & Romano, J. (1999). Undersampling with unknown rate of convergence. Journal of the American Statistical Association, 94(446), 569–579.Google Scholar
  12. 12.
    Bickel, P. J., & Sakov, A. (2008). On the choice of the m out n bootstrap and confidence bounds for extrema. Statistica Sinica, 18, 967–985.Google Scholar
  13. 13.
    Bickel P. J., & Yahav, J. A. (1988). Richardson extrapolation and the bootstrap. Journal of the American Statistical Association, 83(402), 387–393.Google Scholar
  14. 14.
    Bickel, P. J., Götze, F., & van Zwet, W. R. (1997). Resampling fewer than n observations, gains, losses and remedies for losses. Statistica Sinica, 7, 1–31.Google Scholar
  15. 15.
    Bingham, N. H., Goldie, C. M., & Teugels, J. L. (1987). Regular variation. Cambridge: Cambridge University Press.Google Scholar
  16. 16.
    Bretagnolle, J. (1983). Lois limites du bootstrap de certaines fonctionelles. Annales de l’Institut Henri Poincaré B: Probability and Statistics, 19, 281–296.Google Scholar
  17. 17.
    Carlstein, E. (1988). Nonparametric change-point estimation. Annals of Statistics, 16(1), 188–197.Google Scholar
  18. 18.
    Darkhovshk, B. S. (1976). A non-parametric method for the a posteriori detection of the “disorder” time of a sequence of independent random variables. Theory of Probability and Its Applications, 21, 178–83.Google Scholar
  19. 19.
    Götze Rauckauskas, F. A. (1999). Adaptive choice of bootstrap sample sizes. In M. de Gunst, C. Klaassen, & A. van der Vaart (Eds.), State of the art in probability statistics: Festschrift for Willem R. van Zwet. IMS lecture notes, monograph series (pp. 286–309). Beachwood, OH: Institute of Mathematical Statistics.Google Scholar
  20. 20.
    Heilig, C., & Nolan, D. (2001). Limit theorems for the infinite degree U-process. Statistica Sinica, 11, 289–302.Google Scholar
  21. 21.
    Isaacson, E., & Keller, H. B. (1966). Analysis of numerical methods. New York: John Wiley.Google Scholar
  22. 22.
    Kleiner, A., Talwalkar, A., Sarkar, P., & Jordan, M. I. (2014). A scalable bootstrap for massive data. Journal of the Royal Statistical Society: Series B, 76(4), 795–816.Google Scholar
  23. 23.
    Le Cam, L. (1990). Maximum likelihood: An introduction. Revue Internationale de Statistique, 58(2), 153–171.Google Scholar
  24. 24.
    McLeod, I., & Bellhouse, D. R. (1983). Algorithm for drawing a simple random sample. Journal of the Royal Statistical Society. Series C (Applied Statistics), 32(2), 182–184.Google Scholar
  25. 25.
    Politis, D., & Romano, J. P. (1994). Large sample confidence regions based on subsamples under minimal assumptions. Annals of Statistics, 22, 2031–2050.Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • P. Bertail
    • 1
  • O. Jelassi
    • 2
  • J. Tressou
    • 3
  • M. Zetlaoui
    • 1
  1. 1.Modal’X, UPLUniversité Paris NanterreNanterreFrance
  2. 2.Telecom ParisTechParisFrance
  3. 3.MORSEINRA-MIAParisFrance

Personalised recommendations