Skip to main content
Log in

Block average quantile regression for massive dataset

  • Regular Article
  • Published:
Statistical Papers Aims and scope Submit manuscript

Abstract

Nowadays, researchers are frequently confronted with challenges from large-scale data computing. Quantile regression on massive dataset is challenging due to the limitations of computer primary memory. Our proposed block average quantile regression provides a simple and efficient way to implement quantile regression on massive dataset. The major novelty of this method is splitting the entire data into a few blocks, applying the convectional quantile regression onto the data within each block, and deriving final results through aggregating these quantile regression results via simple average approach. While our approach can significantly reduce the storage volume needed for estimation, the resulting estimator is theoretically as efficient as the traditional quantile regression on entire dataset. On the statistical side, asymptotic properties of the resulting estimator are investigated. We verify and illustrate our proposed method via extensive Monte Carlo simulation studies as well as a real-world application.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  • Alhamzawi R (2015) Model selection in quantile regression models. J Appl Stat 42(2):445–458

    Article  MathSciNet  Google Scholar 

  • Arcones MA (1996) The bahadur-kiefer representation of lp regression estimators. Econ Theor 12(2):257–283

    Article  Google Scholar 

  • Briollais L, Durrieu G (2014) Application of quantile regression to recent genetic and -omic studies. Hum Genet 133(8):951–966

    Article  Google Scholar 

  • Chen X, Mg Xie (2014) A split-and-conquer approach for analysis of extraordinarily large data. Stat Sin 24(4):1655–1684

    MathSciNet  MATH  Google Scholar 

  • Cook LP (2014) Gendered parenthood penalties and premiums across the earnings distribution in Australia, the United Kingdom, and the United States. Eur Sociol Rev 30(3):360–372

    Article  MathSciNet  Google Scholar 

  • El Bantli F, Hallin M (1999) L1-estimation in linear models with heterogeneous white noise. Stat Prob Lett 45(4):305–315

    Article  Google Scholar 

  • Fan J, Li R (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc 96(456):1348–1360

    Article  MathSciNet  Google Scholar 

  • Fan TH, Cheng KF (2007) Tests and variables selection on regression analysis for massive datasets. Data Knowl Eng 63(3):811–819

    Article  Google Scholar 

  • Fan TH, Lin DKJ, Cheng KF (2007) Regression analysis for massive datasets. Data Knowl Eng 61(3):554–562

    Article  Google Scholar 

  • He X, Shao QM (1996) A general bahadur representation of m-estimators and its application to linear regression with nonstochastic designs. Ann Stat 24(6):2608–2630

    Article  MathSciNet  Google Scholar 

  • Jiang R, Qian WM, Zhou ZG (2016) Single-index composite quantile regression with heteroscedasticity and general error distributions. Stat Pap 57:185–203

    Article  MathSciNet  Google Scholar 

  • Killewald A, Bearak J (2014) Is the motherhood penalty larger for low-wage women? A comment on quantile regression. Am Sociol Rev 79(2):350–357

    Article  Google Scholar 

  • Knight K (1998) Limiting distributions for l1 regression estimators under general conditions. Ann Stat 26(2):755–770

    Article  Google Scholar 

  • Koenker R (2005) Quantile regression. Cambridge University Press, New York

    Book  Google Scholar 

  • Koenker R, Bassett GW (1978) Regression quantiles. Econometrica 46(1):33–50

    Article  MathSciNet  Google Scholar 

  • Koenker R, Geling O (2001) Reappraising medfly longevity: a quantile regression survival analysis. J Am Stat Assoc 96(454):458–468

    Article  MathSciNet  Google Scholar 

  • Koenker R, Portnoy S (1987) L-estimation for linear models. J Am Stat Assoc 82(399):851–857

    MATH  Google Scholar 

  • Koenker R, Zhao Q (1994) L-estimatton for linear heteroscedastic models. J Nonparametr Stat 3(3–4):223–235

    Article  Google Scholar 

  • Li R, Lin DK, Li B (2013) Statistical inference in massive data sets. Appl Stoch Models Bus Ind 29(5):399–409

    MathSciNet  Google Scholar 

  • Ning Z, Tang L (2014) Estimation and test procedures for composite quantile regression with covariates missing at random. Stat Prob Lett 95:15–25

    Article  MathSciNet  Google Scholar 

  • Okada K, Samreth S (2012) The effect of foreign aid on corruption: a quantile regression approach. Econ Lett 115(2):240–243

    Article  Google Scholar 

  • Peng L, Huang Y (2008) Survival analysis with quantile regression models. J Am Stat Assoc 103(482):637–649

    Article  MathSciNet  Google Scholar 

  • Powell D, Wagner J (2014) The exporter productivity premium along the productivity distribution: evidence from quantile regression with nonadditive firm fixed effects. Rev World Econ 150(4):763–785

    Article  Google Scholar 

  • Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc 58(1):267–288

    MathSciNet  MATH  Google Scholar 

  • van der Vaart AW (1998) Asymptotic statistics. Cambridge University Press, New York

  • Wang H, He X (2007) Detecting differential expressions in genechip microarray studies: a quantile approach. J Am Stat Assoc 102(477):104–112

    Article  MathSciNet  Google Scholar 

  • Xu Q, Niu X, Jiang C, Huang X (2015) The phillips curve in the us: a nonlinear quantile regression approach. Econ Model 49:186–197

    Article  Google Scholar 

  • Yang H, Liu H (2016) Penalized weighted composite quantile estimators with missing covariates. Stat Pap 57:69–88

    Article  MathSciNet  Google Scholar 

  • Yang J, Meng X, Mahoney MW (2013) Quantile regression for large-scale applications. In: Proceedings of the 30th international conference on machine learning, pp 881–887

  • Yang J, Meng X, Mahoney MW (2014) Quantile regression for large-scale applications. SIAM J Sci Comput 36(5):S78–S110

    Article  MathSciNet  Google Scholar 

  • Zhang Y, Duchi J, Wainwright M (2013) Divide and conquer kernel ridge regression. J Mach Learn Res 30:592–617

    MATH  Google Scholar 

  • Zhao T, Kolar M, Liu H (2015) A general framework for robust testing and confidence regions in high-dimensional quantile regression. Tech. rep

  • Zou H (2006) The adaptive lasso and its oracle properties. J Am Stat Assoc 101(476):1418–1429

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

The authors are grateful to the Editor-in-Chief, an associate editor, and three anonymous referees for their helpful comments and constructive guidance. The authors also gratefully acknowledge financial support from the National Natural Science Foundation of PR China (71671056, 71490725), the National Social Science Foundation of PR China (15BJY008), the Humanity and Social Science Foundation of Ministry of Education of PR China (14YJA790015).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Cuixia Jiang.

Appendix

Appendix

Proof of Theorem 1

Note that

$$\begin{aligned} \hat{\varvec{\beta }}(\tau )-\varvec{\beta }(\tau )=\frac{1}{K}\sum \limits ^K_{k=1}\hat{\varvec{\beta }}^{(k)}(\tau )-\varvec{\beta }(\tau )=\frac{1}{K}\sum \limits ^K_{k=1}\left( \hat{\varvec{\beta }}^{(k)}(\tau )-\varvec{\beta }(\tau )\right) . \end{aligned}$$
(18)

From Proposition 1 and Proposition 2 of El Bantli and Hallin (1999), \(\hat{\varvec{\beta }}^{(k)}(\tau )-\varvec{\beta }(\tau )\xrightarrow {p}\varvec{0}\) as \(n\rightarrow \infty \). According to property of convergence in probability, we have

$$\begin{aligned} \begin{array}{lll} \hat{\varvec{\beta }}(\tau )-\varvec{\beta }(\tau )\xrightarrow {p}\varvec{0}&as&n\rightarrow \infty . \end{array} \end{aligned}$$
(19)

Consequently, the consistency holds.

By Knight (1998) and Koenker (2005), \(\Vert \hat{\varvec{\beta }}^{(k)}(\tau )-\varvec{\beta }(\tau )\Vert _{2} = O_p(1/\sqrt{n})\). Then, from (18), we may derive

$$\begin{aligned} \begin{array}{lll} \Vert \hat{\varvec{\beta }}(\tau )-\varvec{\beta }(\tau )\Vert _{2} &{}=&{} \Vert \frac{1}{K}\sum \limits ^K_{k=1}\left( \hat{\varvec{\beta }}^{(k)}(\tau )-\varvec{\beta }(\tau )\right) \Vert _2\\ &{}\le &{} \frac{1}{K}\sum \limits ^K_{k=1}\Vert \hat{\varvec{\beta }}^{(k)}(\tau )-\varvec{\beta }(\tau )\Vert _2\\ &{}=&{} O_p(1/\sqrt{n}). \end{array} \end{aligned}$$
(20)

Thus, the estimator \(\hat{\varvec{\beta }}(\tau )\) is said to converge with rate \(O_p(1/\sqrt{n})\).

Proof of Theorem 2

From Theorem 4.1 of Koenker (2005), the joint asymptotic distribution of the quantile regression estimator \(\hat{\varvec{\beta }}^{(k)}(\tau )\) takes the form:

$$\begin{aligned} \sqrt{n}(\hat{\varvec{\beta }}^{(k)}(\tau )-\varvec{\beta }(\tau ))\xrightarrow {d} N(\varvec{0},\varvec{\Sigma }^{(k)}), \end{aligned}$$
(21)

where \(\varvec{\Sigma }^{(k)}=\omega ^{2}_{k}[\varvec{D}^{(k)}_{0}]^{-1}\) for i.i.d. errors and \(\varvec{\Sigma }^{(k)}=\tau (1-\tau )[\varvec{D}^{(k)}_{1}(\tau )]^{-1}\varvec{D}^{(k)}_{0}\)\([\varvec{D}^{(k)}_{1}(\tau )]^{-1}\) for non-i.i.d. errors.

From C3, the Bahadur representation for the estimator \(\hat{\varvec{\beta }}^{(k)}(\tau )\) is \(\sqrt{n}(\hat{\varvec{\beta }}^{(k)}(\tau )-\varvec{\beta }(\tau ))=\varvec{W}^{(k)}+R^{(k)}_{n}\). By Arcones (1996) and He and Shao (1996), the remainder term meets

$$\begin{aligned} \begin{array}{lll} R^{(k)}_{n}=O(n^{-1/4}(\log \log n)^{3/4})\xrightarrow {p}0&as&n\rightarrow \infty . \end{array} \end{aligned}$$
(22)

Hence, combining (21) with (22), using the Theorem 2.7 of van der Vaart (1998), we can infer that \(\varvec{W}^{(k)}= \sqrt{n}(\hat{\varvec{\beta }}^{(k)}(\tau )-\varvec{\beta }(\tau )) - R^{(k)}_{n}\) has the same asymptotic normality as \(\sqrt{n}(\hat{\varvec{\beta }}^{(k)}(\tau )-\varvec{\beta }(\tau ))\), i.e.

$$\begin{aligned} \varvec{W}^{(k)}\xrightarrow {d} N(\varvec{0},\varvec{\Sigma }^{(k)}). \end{aligned}$$
(23)

First, based on C2, it is easy to show that \(\varvec{\Sigma }^{(1)}=\varvec{\Sigma }^{(2)}=\cdots =\varvec{\Sigma }^{(K)}=\varvec{\Sigma }\) for both i.i.d. and non-i.i.d. cases. Using the BAQR estimator, we derive

$$\begin{aligned} \begin{array}{lll} \sqrt{N}(\hat{\varvec{\beta }}(\tau )-\varvec{\beta }(\tau )) &{}=&{} \sqrt{N}\left( \frac{1}{K}\sum \limits ^K_{k=1}\hat{\varvec{\beta }}^{(k)}(\tau )-\varvec{\beta }(\tau )\right) \\ &{}=&{} \frac{\sqrt{N}}{K} \sum \limits ^K_{k=1}\left( \hat{\varvec{\beta }}^{(k)}(\tau )-\varvec{\beta }(\tau )\right) \\ &{}=&{} \frac{\sqrt{N}}{K\sqrt{n}} \sum \limits ^K_{k=1}\sqrt{n}\left( \hat{\varvec{\beta }}^{(k)}(\tau )-\varvec{\beta }(\tau )\right) \\ &{}=&{} \frac{1}{\sqrt{K}} \sum \limits ^K_{k=1}\sqrt{n}\left( \hat{\varvec{\beta }}^{(k)}(\tau )-\varvec{\beta }(\tau )\right) \\ &{}=&{} \frac{1}{\sqrt{K}}\sum \limits ^K_{k=1}\varvec{W}^{(k)}+\frac{1}{\sqrt{K}}\sum \limits ^K_{k=1}R^{(k)}_{n}. \end{array} \end{aligned}$$
(24)

Second, we show that \(\frac{1}{\sqrt{K}}\sum \limits ^K_{k=1}\varvec{W}^{(k)}\) follows a multivariate normal distribution. Since \((\varvec{x}_{i},y_{i})\) are independent for \(i=1,2,\ldots ,N\), \(\varvec{W}^{(k)}\) defined in (9) or (10) is therefore independent for \(k=1,2,\ldots ,K\). From (23), we know that \(\varvec{W}^{(k)}\) follows a multivariate normal distribution. Hence, we conclude that \(\frac{1}{\sqrt{K}}\sum \limits ^K_{k=1}\varvec{W}^{(k)}\) also follows a multivariate normal distribution.

Third, we calculate the expectation and covariance matrix of \(\frac{1}{\sqrt{K}}\sum \limits ^K_{k=1}\varvec{W}^{(k)}\). From (23), we have

$$\begin{aligned} \text {E}\left( \frac{1}{\sqrt{K}}\sum \limits ^K_{k=1}\varvec{W}^{(k)}\right) =\frac{1}{\sqrt{K}} \sum \limits ^K_{k=1} \text {E}(\varvec{W}^{(k)})=\varvec{0}, \end{aligned}$$
(25)

and

$$\begin{aligned} \begin{array}{lll} \text {Var}\left( \frac{1}{\sqrt{K}}\sum \limits ^K_{k=1}\varvec{W}^{(k)}\right) &{}=&{} \frac{1}{K} \sum \limits ^K_{k=1} \text {Var}(\varvec{W}^{(k)})\\ &{}=&{} \frac{1}{K} (\varvec{\Sigma }^{(1)}+\varvec{\Sigma }^{(2)}+\cdots +\varvec{\Sigma }^{(K)})\\ &{}=&{} \varvec{\Sigma }. \end{array} \end{aligned}$$
(26)

Based on the results of the second and third parts, we can infer that

$$\begin{aligned} \frac{1}{\sqrt{K}}\sum \limits ^K_{k=1}\varvec{W}^{(k)}\xrightarrow {d} N(\varvec{0},\varvec{\Sigma }). \end{aligned}$$
(27)

Fourth, we show that \(\sqrt{N}(\hat{\varvec{\beta }}(\tau )-\varvec{\beta }(\tau ))\) follows a multivariate normal distribution. If

$$\begin{aligned} \begin{array}{lll} \frac{1}{\sqrt{K}}\sum \limits ^K_{k=1}R^{(k)}_{n}=O(\sqrt{K}n^{-1/4}(\log \log n)^{3/4})\xrightarrow {p}0,&as&n\rightarrow \infty , \end{array} \end{aligned}$$
(28)

holds(see Eq. (22)), combining (27) with (28), using the Theorem 2.7 of van der Vaart (1998), we can infer that \(\sqrt{N}(\hat{\varvec{\beta }}(\tau )-\varvec{\beta }(\tau ))\) has the same asymptotic normality as \(\frac{1}{\sqrt{K}}\sum \limits ^K_{k=1}\varvec{W}^{(k)}\), i.e.

$$\begin{aligned} \sqrt{N}(\hat{\varvec{\beta }}(\tau )-\varvec{\beta }(\tau ))\xrightarrow {d} N(\varvec{0},\varvec{\Sigma }). \end{aligned}$$
(29)

To ensure (29), we suggest \(N=Ke^{K^{1/2}}\). In addition, using \(N=nK\) we then get \(K=\log ^2n\). Hence, as \(n\rightarrow \infty \), we have

$$\begin{aligned} \begin{array}{lll} \frac{1}{\sqrt{K}}\sum \limits ^K_{k=1}R^{(k)}_{n}&{}=&{}O(\sqrt{K}n^{-1/4}(\log \log n)^{3/4})\\ &{}=&{}O(n^{-1/4}(\log \log n)^{3/4}\log n)\xrightarrow {p}0. \end{array} \end{aligned}$$
(30)

Therefore, the result of asymptotic normality in (12) holds when \(N=Ke^{K^{1/2}}\), and we complete the whole proof.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xu, Q., Cai, C., Jiang, C. et al. Block average quantile regression for massive dataset. Stat Papers 61, 141–165 (2020). https://doi.org/10.1007/s00362-017-0932-6

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00362-017-0932-6

Keywords

Navigation