Abstract
Nowadays, researchers are frequently confronted with challenges from large-scale data computing. Quantile regression on massive dataset is challenging due to the limitations of computer primary memory. Our proposed block average quantile regression provides a simple and efficient way to implement quantile regression on massive dataset. The major novelty of this method is splitting the entire data into a few blocks, applying the convectional quantile regression onto the data within each block, and deriving final results through aggregating these quantile regression results via simple average approach. While our approach can significantly reduce the storage volume needed for estimation, the resulting estimator is theoretically as efficient as the traditional quantile regression on entire dataset. On the statistical side, asymptotic properties of the resulting estimator are investigated. We verify and illustrate our proposed method via extensive Monte Carlo simulation studies as well as a real-world application.
Similar content being viewed by others
References
Alhamzawi R (2015) Model selection in quantile regression models. J Appl Stat 42(2):445–458
Arcones MA (1996) The bahadur-kiefer representation of lp regression estimators. Econ Theor 12(2):257–283
Briollais L, Durrieu G (2014) Application of quantile regression to recent genetic and -omic studies. Hum Genet 133(8):951–966
Chen X, Mg Xie (2014) A split-and-conquer approach for analysis of extraordinarily large data. Stat Sin 24(4):1655–1684
Cook LP (2014) Gendered parenthood penalties and premiums across the earnings distribution in Australia, the United Kingdom, and the United States. Eur Sociol Rev 30(3):360–372
El Bantli F, Hallin M (1999) L1-estimation in linear models with heterogeneous white noise. Stat Prob Lett 45(4):305–315
Fan J, Li R (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc 96(456):1348–1360
Fan TH, Cheng KF (2007) Tests and variables selection on regression analysis for massive datasets. Data Knowl Eng 63(3):811–819
Fan TH, Lin DKJ, Cheng KF (2007) Regression analysis for massive datasets. Data Knowl Eng 61(3):554–562
He X, Shao QM (1996) A general bahadur representation of m-estimators and its application to linear regression with nonstochastic designs. Ann Stat 24(6):2608–2630
Jiang R, Qian WM, Zhou ZG (2016) Single-index composite quantile regression with heteroscedasticity and general error distributions. Stat Pap 57:185–203
Killewald A, Bearak J (2014) Is the motherhood penalty larger for low-wage women? A comment on quantile regression. Am Sociol Rev 79(2):350–357
Knight K (1998) Limiting distributions for l1 regression estimators under general conditions. Ann Stat 26(2):755–770
Koenker R (2005) Quantile regression. Cambridge University Press, New York
Koenker R, Bassett GW (1978) Regression quantiles. Econometrica 46(1):33–50
Koenker R, Geling O (2001) Reappraising medfly longevity: a quantile regression survival analysis. J Am Stat Assoc 96(454):458–468
Koenker R, Portnoy S (1987) L-estimation for linear models. J Am Stat Assoc 82(399):851–857
Koenker R, Zhao Q (1994) L-estimatton for linear heteroscedastic models. J Nonparametr Stat 3(3–4):223–235
Li R, Lin DK, Li B (2013) Statistical inference in massive data sets. Appl Stoch Models Bus Ind 29(5):399–409
Ning Z, Tang L (2014) Estimation and test procedures for composite quantile regression with covariates missing at random. Stat Prob Lett 95:15–25
Okada K, Samreth S (2012) The effect of foreign aid on corruption: a quantile regression approach. Econ Lett 115(2):240–243
Peng L, Huang Y (2008) Survival analysis with quantile regression models. J Am Stat Assoc 103(482):637–649
Powell D, Wagner J (2014) The exporter productivity premium along the productivity distribution: evidence from quantile regression with nonadditive firm fixed effects. Rev World Econ 150(4):763–785
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc 58(1):267–288
van der Vaart AW (1998) Asymptotic statistics. Cambridge University Press, New York
Wang H, He X (2007) Detecting differential expressions in genechip microarray studies: a quantile approach. J Am Stat Assoc 102(477):104–112
Xu Q, Niu X, Jiang C, Huang X (2015) The phillips curve in the us: a nonlinear quantile regression approach. Econ Model 49:186–197
Yang H, Liu H (2016) Penalized weighted composite quantile estimators with missing covariates. Stat Pap 57:69–88
Yang J, Meng X, Mahoney MW (2013) Quantile regression for large-scale applications. In: Proceedings of the 30th international conference on machine learning, pp 881–887
Yang J, Meng X, Mahoney MW (2014) Quantile regression for large-scale applications. SIAM J Sci Comput 36(5):S78–S110
Zhang Y, Duchi J, Wainwright M (2013) Divide and conquer kernel ridge regression. J Mach Learn Res 30:592–617
Zhao T, Kolar M, Liu H (2015) A general framework for robust testing and confidence regions in high-dimensional quantile regression. Tech. rep
Zou H (2006) The adaptive lasso and its oracle properties. J Am Stat Assoc 101(476):1418–1429
Acknowledgements
The authors are grateful to the Editor-in-Chief, an associate editor, and three anonymous referees for their helpful comments and constructive guidance. The authors also gratefully acknowledge financial support from the National Natural Science Foundation of PR China (71671056, 71490725), the National Social Science Foundation of PR China (15BJY008), the Humanity and Social Science Foundation of Ministry of Education of PR China (14YJA790015).
Author information
Authors and Affiliations
Corresponding author
Appendix
Appendix
Proof of Theorem 1
Note that
From Proposition 1 and Proposition 2 of El Bantli and Hallin (1999), \(\hat{\varvec{\beta }}^{(k)}(\tau )-\varvec{\beta }(\tau )\xrightarrow {p}\varvec{0}\) as \(n\rightarrow \infty \). According to property of convergence in probability, we have
Consequently, the consistency holds.
By Knight (1998) and Koenker (2005), \(\Vert \hat{\varvec{\beta }}^{(k)}(\tau )-\varvec{\beta }(\tau )\Vert _{2} = O_p(1/\sqrt{n})\). Then, from (18), we may derive
Thus, the estimator \(\hat{\varvec{\beta }}(\tau )\) is said to converge with rate \(O_p(1/\sqrt{n})\).
Proof of Theorem 2
From Theorem 4.1 of Koenker (2005), the joint asymptotic distribution of the quantile regression estimator \(\hat{\varvec{\beta }}^{(k)}(\tau )\) takes the form:
where \(\varvec{\Sigma }^{(k)}=\omega ^{2}_{k}[\varvec{D}^{(k)}_{0}]^{-1}\) for i.i.d. errors and \(\varvec{\Sigma }^{(k)}=\tau (1-\tau )[\varvec{D}^{(k)}_{1}(\tau )]^{-1}\varvec{D}^{(k)}_{0}\)\([\varvec{D}^{(k)}_{1}(\tau )]^{-1}\) for non-i.i.d. errors.
From C3, the Bahadur representation for the estimator \(\hat{\varvec{\beta }}^{(k)}(\tau )\) is \(\sqrt{n}(\hat{\varvec{\beta }}^{(k)}(\tau )-\varvec{\beta }(\tau ))=\varvec{W}^{(k)}+R^{(k)}_{n}\). By Arcones (1996) and He and Shao (1996), the remainder term meets
Hence, combining (21) with (22), using the Theorem 2.7 of van der Vaart (1998), we can infer that \(\varvec{W}^{(k)}= \sqrt{n}(\hat{\varvec{\beta }}^{(k)}(\tau )-\varvec{\beta }(\tau )) - R^{(k)}_{n}\) has the same asymptotic normality as \(\sqrt{n}(\hat{\varvec{\beta }}^{(k)}(\tau )-\varvec{\beta }(\tau ))\), i.e.
First, based on C2, it is easy to show that \(\varvec{\Sigma }^{(1)}=\varvec{\Sigma }^{(2)}=\cdots =\varvec{\Sigma }^{(K)}=\varvec{\Sigma }\) for both i.i.d. and non-i.i.d. cases. Using the BAQR estimator, we derive
Second, we show that \(\frac{1}{\sqrt{K}}\sum \limits ^K_{k=1}\varvec{W}^{(k)}\) follows a multivariate normal distribution. Since \((\varvec{x}_{i},y_{i})\) are independent for \(i=1,2,\ldots ,N\), \(\varvec{W}^{(k)}\) defined in (9) or (10) is therefore independent for \(k=1,2,\ldots ,K\). From (23), we know that \(\varvec{W}^{(k)}\) follows a multivariate normal distribution. Hence, we conclude that \(\frac{1}{\sqrt{K}}\sum \limits ^K_{k=1}\varvec{W}^{(k)}\) also follows a multivariate normal distribution.
Third, we calculate the expectation and covariance matrix of \(\frac{1}{\sqrt{K}}\sum \limits ^K_{k=1}\varvec{W}^{(k)}\). From (23), we have
and
Based on the results of the second and third parts, we can infer that
Fourth, we show that \(\sqrt{N}(\hat{\varvec{\beta }}(\tau )-\varvec{\beta }(\tau ))\) follows a multivariate normal distribution. If
holds(see Eq. (22)), combining (27) with (28), using the Theorem 2.7 of van der Vaart (1998), we can infer that \(\sqrt{N}(\hat{\varvec{\beta }}(\tau )-\varvec{\beta }(\tau ))\) has the same asymptotic normality as \(\frac{1}{\sqrt{K}}\sum \limits ^K_{k=1}\varvec{W}^{(k)}\), i.e.
To ensure (29), we suggest \(N=Ke^{K^{1/2}}\). In addition, using \(N=nK\) we then get \(K=\log ^2n\). Hence, as \(n\rightarrow \infty \), we have
Therefore, the result of asymptotic normality in (12) holds when \(N=Ke^{K^{1/2}}\), and we complete the whole proof.
Rights and permissions
About this article
Cite this article
Xu, Q., Cai, C., Jiang, C. et al. Block average quantile regression for massive dataset. Stat Papers 61, 141–165 (2020). https://doi.org/10.1007/s00362-017-0932-6
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00362-017-0932-6