Abstract
The goal of this contribution is to develop subsampling methods in the framework of big data and to show their feasibility in a simulation study. We argue that using different subsampling distributions with different subsampling sizes brings a lot of information on the behavior of statistical procedures: subsampling allows to estimate the rate of convergence of different procedures and to construct confidence intervals for general parameters including the generalization error of an algorithm in machine learning.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Arcones, M. A., Giné, E. (1993). Limit theorems for U-processes. Annals of Probability, 21(3). 1494–1542.
Babu, G., & Singh, K. (1985). Edgeworth expansions for sampling without replacement from finite populations. Journal of Multivariate Analysis, 17, 261–278.
Belsley, D. A., Kuh, E., & Welsh, R. E. (1980). Regression diagnostics: Identifying influential data and sources of collinearity. New York: Wiley.
Bertail, P. (1997). Second order properties of an extrapolated bootstrap without replacement under weak assumptions: The i.i.d. and strong mixing case. Bernoulli, 3, 149–179.
Bertail, P. (2011). Somme comments on Subsampling weakly dependent time series and application to extremes. TEST, 20, 487–490.
Bertail, P., Chautru, E., & Clémençon, S. (2014). Scaling-up M-estimation via sampling designs: The Horvitz-Thompson stochastic gradient descent. In Proceedings of the 2014 IEEE International Conference on Big Data, Washington (USA).
Bertail, P., Chautru, E., & Clémençon, S. (2015). Tail index estimation based on survey data. ESAIM Probability & Statistics, 19, 28–59.
Bertail, P., Chautru, E., & Clémençon, S. (2016). Empirical processes in survey sampling. Scandinavian Journal of Statistics, 44(1), 97–111.
Bertail, P., Haeffke, C., Politis, D., & White H. (2004). A subsampling approach to estimating the distribution of diverging statistics with applications to assessing financial market risks. Journal of Econometrics, 120, 295–326.
Bertail, P., & Politis, D. (2001). Extrapolation of subsampling distribution estimators in the i.i.d. strong-mixing cases. Canadian Journal of Statistics, 29(4), 667–680.
Bertail, P., Politis, D., & Romano, J. (1999). Undersampling with unknown rate of convergence. Journal of the American Statistical Association, 94(446), 569–579.
Bickel, P. J., & Sakov, A. (2008). On the choice of the m out n bootstrap and confidence bounds for extrema. Statistica Sinica, 18, 967–985.
Bickel P. J., & Yahav, J. A. (1988). Richardson extrapolation and the bootstrap. Journal of the American Statistical Association, 83(402), 387–393.
Bickel, P. J., Götze, F., & van Zwet, W. R. (1997). Resampling fewer than n observations, gains, losses and remedies for losses. Statistica Sinica, 7, 1–31.
Bingham, N. H., Goldie, C. M., & Teugels, J. L. (1987). Regular variation. Cambridge: Cambridge University Press.
Bretagnolle, J. (1983). Lois limites du bootstrap de certaines fonctionelles. Annales de l’Institut Henri Poincaré B: Probability and Statistics, 19, 281–296.
Carlstein, E. (1988). Nonparametric change-point estimation. Annals of Statistics, 16(1), 188–197.
Darkhovshk, B. S. (1976). A non-parametric method for the a posteriori detection of the “disorder” time of a sequence of independent random variables. Theory of Probability and Its Applications, 21, 178–83.
Götze Rauckauskas, F. A. (1999). Adaptive choice of bootstrap sample sizes. In M. de Gunst, C. Klaassen, & A. van der Vaart (Eds.), State of the art in probability statistics: Festschrift for Willem R. van Zwet. IMS lecture notes, monograph series (pp. 286–309). Beachwood, OH: Institute of Mathematical Statistics.
Heilig, C., & Nolan, D. (2001). Limit theorems for the infinite degree U-process. Statistica Sinica, 11, 289–302.
Isaacson, E., & Keller, H. B. (1966). Analysis of numerical methods. New York: John Wiley.
Kleiner, A., Talwalkar, A., Sarkar, P., & Jordan, M. I. (2014). A scalable bootstrap for massive data. Journal of the Royal Statistical Society: Series B, 76(4), 795–816.
Le Cam, L. (1990). Maximum likelihood: An introduction. Revue Internationale de Statistique, 58(2), 153–171.
McLeod, I., & Bellhouse, D. R. (1983). Algorithm for drawing a simple random sample. Journal of the Royal Statistical Society. Series C (Applied Statistics), 32(2), 182–184.
Politis, D., & Romano, J. P. (1994). Large sample confidence regions based on subsamples under minimal assumptions. Annals of Statistics, 22, 2031–2050.
Acknowledgements
This research has been conducted as part of the project Labex MME-DII (ANR11-LBX-0023- 01), and the industrial chair “Machine Learning for Big Data,” Télécom- -ParisTech.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Bertail, P., Jelassi, O., Tressou, J., Zetlaoui, M. (2018). Subsampling for Big Data: Some Recent Advances. In: Bertail, P., Blanke, D., Cornillon, PA., Matzner-Løber, E. (eds) Nonparametric Statistics. ISNPS 2016. Springer Proceedings in Mathematics & Statistics, vol 250. Springer, Cham. https://doi.org/10.1007/978-3-319-96941-1_13
Download citation
DOI: https://doi.org/10.1007/978-3-319-96941-1_13
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-96940-4
Online ISBN: 978-3-319-96941-1
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)