Abstract
With the advance in science and technologies in the past decade, big data becomes ubiquitous in all fields. The exponential growth of big data significantly outpaces the increase of storage and computational capacity of high performance computers. The challenge in analyzing big data calls for innovative analytical and computational methods that make better use of currently available computing power. An emerging powerful family of methods for effectively analyzing big data is called statistical leveraging. In these methods, one first takes a random subsample from the original full sample, then uses the subsample as a surrogate for any computation and estimation of interest. The key to success of statistical leveraging methods is to construct a data-adaptive sampling probability distribution, which gives preference to those data points that are influential to model fitting and statistical inference. In this chapter, we review the recent development of statistical leveraging methods. In particular, we focus on various algorithms for constructing subsampling probability distribution, and a coherent theoretical framework for investigating their estimation property and computing complexity. Simulation studies and real data examples are presented to demonstrate applications of the methodology.
Keywords
- Randomized algorithm
- Leverage scores
- Subsampling
- Least squares
- Linear regression
This is a preview of subscription content, access via your institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Agarwal A, Duchi JC (2011) Distributed delayed stochastic optimization. In: Advances in neural information processing systems, pp 873–881
Avron H, Maymounkov P, Toledo S (2010) Blendenpik: supercharging LAPACK’s least-squares solver. SIAM J Sci Comput 32:1217–1236
Bhlmann P, van de Geer S (2011) Statistics for high-dimensional data: methods, theory and applications, 1st edn. Springer, Berlin
Chatterjee S, Hadi AS (1986) Influential observations, high leverage points, and outliers in linear regression. Stat Sci 1(3):379–393
Chen X, Xie M (2014) A split-and-conquer approach for analysis of extraordinarily large data. Stat Sin 24:1655–1684
Clarkson KL, Woodruff DP (2013) Low rank approximation and regression in input sparsity time. In: Proceedings of the forty-fifth annual ACM symposium on theory of computing. ACM, New York, pp 81–90
Clarkson KL, Drineas P, Magdon-Ismail M, Mahoney MW, Meng X, Woodruff DP (2013) The Fast Cauchy Transform and faster robust linear regression. In: Proceedings of the twenty-fourth annual ACM-SIAM symposium on discrete algorithms. Society for Industrial and Applied Mathematics, Philadelphia, pp 466–477
Coles S, Bawa J, Trenner L, Dorazio P (2001) An introduction to statistical modeling of extreme values, vol 208. Springer, Berlin
Drineas P, Mahoney MW, Muthukrishnan S (2006) Sampling algorithms for ℓ 2 regression and applications. In: Proceedings of the 17th annual ACM-SIAM symposium on discrete algorithms, pp 1127–1136
Drineas P, Mahoney MW, Muthukrishnan S, Sarlós T (2010) Faster least squares approximation. Numer Math 117(2):219–249
Drineas P, Magdon-Ismail M, Mahoney MW, Woodruff DP (2012) Fast approximation of matrix coherence and statistical leverage. J Mach Learn Res 13:3475–3506
Duchi JC, Agarwal A, Wainwright MJ (2012) Dual averaging for distributed optimization: convergence analysis and network scaling. IEEE Trans Autom Control 57(3):592–606
Friedman J, Hastie T, Tibshirani R (2001) The elements of statistical learning, vol 1. Springer series in statistics. Springer, Berlin
Golub GH, Van Loan CF (1996) Matrix computations. Johns Hopkins University Press, Baltimore
Hesterberg T (1995) Weighted average importance sampling and defensive mixture distributions. Technometrics 37(2):185–194
Hoaglin DC, Welsch RE (1978) The hat matrix in regression and ANOVA. Am Stat 32(1):17–22
Lichman M (2013) UCI machine learning repository
Ma P, Sun X (2015) Leveraging for big data regression. Wiley Interdiscip Rev Comput Stat 7(1):70–76
Ma P, Mahoney MW, Yu B (2014) A statistical perspective on algorithmic leveraging. In: Proceedings of the 31st international conference on machine learning (ICML-14), pp 91–99
Ma P, Mahoney MW, Yu B (2015) A statistical perspective on algorithmic leveraging. J Mach Learn Res 16:861–911
Ma P, Zhang X, Ma J, Mahoney MW, Yu B, Xing X (2016) Optimal subsampling methods for large sample linear regression. Technical report, Department of Statistics, University of Georgia
Mahoney MW (2011) Randomized algorithms for matrices and data. Foundations and trends in machine learning. NOW Publishers, Boston. Also available at: arXiv:1104.5557
Mahoney MW, Drineas P (2009) CUR matrix decompositions for improved data analysis. Proc Natl Acad Sci 106(3):697–702
McCullagh P, Nelder JA (1989) Generalized linear models, vol 37. CRC, Boca Raton
Meng X, Mahoney MW (2013) Low-distortion subspace embeddings in input-sparsity time and applications to robust linear regression. In: Proceedings of the forty-fifth annual ACM symposium on theory of computing. ACM, New York, pp 91–100
Meng X, Saunders MA, Mahoney MW (2014) LSRN: a parallel iterative solver for strongly over-or underdetermined systems. SIAM J Sci Comput 36(2):C95–C118
Raskutti G, Mahoney MW (2016) A statistical perspective on randomized sketching for ordinary least-squares. J Mach Learn Res 17(214):1–31
Velleman PF, Welsch ER (1981) Efficient computing of regression diagnostics. Am Stat 35(4): 234–242
Wang H, Zhu R, Ma P (2017) Optimal subsampling for large sample logistic regression. J Am Stat Assoc (in press)
Xie R, Sriram TN, Ma P (2017) Sequential leveraging sampling method for streaming time series data. Technical report, Department of Statistics University of Georgia
Zhang Y, Duchi JC, Wainwright MJ (2013) Divide and conquer kernel ridge regression: a distributed algorithm with minimax optimal rates. CoRR. abs/1305.5029
Acknowledgements
This work was funded in part by NSF DMS-1440037(1222718), NSF DMS-1438957(1055815), NSF DMS-1440038(1228288), NIH R01GM122080, NIH R01GM113242.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this chapter
Cite this chapter
Zhang, X., Xie, R., Ma, P. (2018). Statistical Leveraging Methods in Big Data. In: Härdle, W., Lu, HS., Shen, X. (eds) Handbook of Big Data Analytics. Springer Handbooks of Computational Statistics. Springer, Cham. https://doi.org/10.1007/978-3-319-18284-1_3
Download citation
DOI: https://doi.org/10.1007/978-3-319-18284-1_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-18283-4
Online ISBN: 978-3-319-18284-1
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)