Subsampling for Big Data: Some Recent Advances

Bertail, P.; Jelassi, O.; Tressou, J.; Zetlaoui, M.

doi:10.1007/978-3-319-96941-1_13

P. Bertail⁵,
O. Jelassi⁶,
J. Tressou⁷ &
…
M. Zetlaoui⁵

Part of the book series: Springer Proceedings in Mathematics & Statistics ((PROMS,volume 250))

Included in the following conference series:

Conference of the International Society for Non-Parametric Statistics

1107 Accesses

Abstract

The goal of this contribution is to develop subsampling methods in the framework of big data and to show their feasibility in a simulation study. We argue that using different subsampling distributions with different subsampling sizes brings a lot of information on the behavior of statistical procedures: subsampling allows to estimate the rate of convergence of different procedures and to construct confidence intervals for general parameters including the generalization error of an algorithm in machine learning.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Arcones, M. A., Giné, E. (1993). Limit theorems for U-processes. Annals of Probability, 21(3). 1494–1542.
Google Scholar
Babu, G., & Singh, K. (1985). Edgeworth expansions for sampling without replacement from finite populations. Journal of Multivariate Analysis, 17, 261–278.
Google Scholar
Belsley, D. A., Kuh, E., & Welsh, R. E. (1980). Regression diagnostics: Identifying influential data and sources of collinearity. New York: Wiley.
Google Scholar
Bertail, P. (1997). Second order properties of an extrapolated bootstrap without replacement under weak assumptions: The i.i.d. and strong mixing case. Bernoulli, 3, 149–179.
Google Scholar
Bertail, P. (2011). Somme comments on Subsampling weakly dependent time series and application to extremes. TEST, 20, 487–490.
Google Scholar
Bertail, P., Chautru, E., & Clémençon, S. (2014). Scaling-up M-estimation via sampling designs: The Horvitz-Thompson stochastic gradient descent. In Proceedings of the 2014 IEEE International Conference on Big Data, Washington (USA).
Google Scholar
Bertail, P., Chautru, E., & Clémençon, S. (2015). Tail index estimation based on survey data. ESAIM Probability & Statistics, 19, 28–59.
Google Scholar
Bertail, P., Chautru, E., & Clémençon, S. (2016). Empirical processes in survey sampling. Scandinavian Journal of Statistics, 44(1), 97–111.
Google Scholar
Bertail, P., Haeffke, C., Politis, D., & White H. (2004). A subsampling approach to estimating the distribution of diverging statistics with applications to assessing financial market risks. Journal of Econometrics, 120, 295–326.
Google Scholar
Bertail, P., & Politis, D. (2001). Extrapolation of subsampling distribution estimators in the i.i.d. strong-mixing cases. Canadian Journal of Statistics, 29(4), 667–680.
Google Scholar
Bertail, P., Politis, D., & Romano, J. (1999). Undersampling with unknown rate of convergence. Journal of the American Statistical Association, 94(446), 569–579.
Google Scholar
Bickel, P. J., & Sakov, A. (2008). On the choice of the m out n bootstrap and confidence bounds for extrema. Statistica Sinica, 18, 967–985.
Google Scholar
Bickel P. J., & Yahav, J. A. (1988). Richardson extrapolation and the bootstrap. Journal of the American Statistical Association, 83(402), 387–393.
Google Scholar
Bickel, P. J., Götze, F., & van Zwet, W. R. (1997). Resampling fewer than n observations, gains, losses and remedies for losses. Statistica Sinica, 7, 1–31.
Google Scholar
Bingham, N. H., Goldie, C. M., & Teugels, J. L. (1987). Regular variation. Cambridge: Cambridge University Press.
Google Scholar
Bretagnolle, J. (1983). Lois limites du bootstrap de certaines fonctionelles. Annales de l’Institut Henri Poincaré B: Probability and Statistics, 19, 281–296.
Google Scholar
Carlstein, E. (1988). Nonparametric change-point estimation. Annals of Statistics, 16(1), 188–197.
Google Scholar
Darkhovshk, B. S. (1976). A non-parametric method for the a posteriori detection of the “disorder” time of a sequence of independent random variables. Theory of Probability and Its Applications, 21, 178–83.
Google Scholar
Götze Rauckauskas, F. A. (1999). Adaptive choice of bootstrap sample sizes. In M. de Gunst, C. Klaassen, & A. van der Vaart (Eds.), State of the art in probability statistics: Festschrift for Willem R. van Zwet. IMS lecture notes, monograph series (pp. 286–309). Beachwood, OH: Institute of Mathematical Statistics.
Google Scholar
Heilig, C., & Nolan, D. (2001). Limit theorems for the infinite degree U-process. Statistica Sinica, 11, 289–302.
Google Scholar
Isaacson, E., & Keller, H. B. (1966). Analysis of numerical methods. New York: John Wiley.
Google Scholar
Kleiner, A., Talwalkar, A., Sarkar, P., & Jordan, M. I. (2014). A scalable bootstrap for massive data. Journal of the Royal Statistical Society: Series B, 76(4), 795–816.
Google Scholar
Le Cam, L. (1990). Maximum likelihood: An introduction. Revue Internationale de Statistique, 58(2), 153–171.
Google Scholar
McLeod, I., & Bellhouse, D. R. (1983). Algorithm for drawing a simple random sample. Journal of the Royal Statistical Society. Series C (Applied Statistics), 32(2), 182–184.
Google Scholar
Politis, D., & Romano, J. P. (1994). Large sample confidence regions based on subsamples under minimal assumptions. Annals of Statistics, 22, 2031–2050.
Google Scholar

Download references

Acknowledgements

This research has been conducted as part of the project Labex MME-DII (ANR11-LBX-0023- 01), and the industrial chair “Machine Learning for Big Data,” Télécom- -ParisTech.

Author information

Authors and Affiliations

Modal’X, UPL, Université Paris Nanterre, Nanterre, France
P. Bertail & M. Zetlaoui
Telecom ParisTech, Paris, France
O. Jelassi
MORSE, INRA-MIA, Paris, France
J. Tressou

Authors

P. Bertail
View author publications
You can also search for this author in PubMed Google Scholar
O. Jelassi
View author publications
You can also search for this author in PubMed Google Scholar
J. Tressou
View author publications
You can also search for this author in PubMed Google Scholar
M. Zetlaoui
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

MODAL’X, Paris West University Nanterre La Défense, Nanterre, France
Patrice Bertail
LMA, Avignon University, Avignon, France
Delphine Blanke
MIASHS, University of Rennes 2, Rennes, France
Pierre-André Cornillon
Formation Continue CEPE, Ecole Nationale de la Statistique et de l’Administration, Malakoff, France
Eric Matzner-Løber

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bertail, P., Jelassi, O., Tressou, J., Zetlaoui, M. (2018). Subsampling for Big Data: Some Recent Advances. In: Bertail, P., Blanke, D., Cornillon, PA., Matzner-Løber, E. (eds) Nonparametric Statistics. ISNPS 2016. Springer Proceedings in Mathematics & Statistics, vol 250. Springer, Cham. https://doi.org/10.1007/978-3-319-96941-1_13

Download citation

DOI: https://doi.org/10.1007/978-3-319-96941-1_13
Published: 09 March 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-96940-4
Online ISBN: 978-3-319-96941-1
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics